# Statistica Data Miner STATISTICA DATA MINER Training Course Outline Overview of Data Mining What is Data Mining? Steps in Data Mining Overview of Data Mining techniques Points to Remember

What is Data Mining? Data mining is an analytic process designed to explore large amounts of data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. Data Mining is a process of torturing the data until they confess The typical goals of data mining projects are: Identification of groups, clusters, strata, or dimensions in data

that display no obvious structure, The identification of factors that are related to a particular outcome of interest (root-cause analysis) Accurate prediction of outcome variable(s) of interest (in the future, or in new customers, clients, applicants, etc.; this application is usually referred to as predictive data mining) What is Data Mining? Data mining is used to Detect fraudulent patterns in credit card transactions, insurance claims, etc. Detect default patterns

Model customer buying patterns and behavior for cross-selling, up selling, and customer acquisition Optimize engine performance and several other complex manufacturing processes Data mining can be utilized in any organization that needs to find patterns or relationships in their data. Steps in Data Mining Stage 1: Precise statement of the problem. Stage 2: Initial exploration. Stage 3: Model building and validation. Stage 4: Deployment. Steps in Data Mining

Stage 1: Precise statement of the problem. Before opening a software package and running an analysis, the analyst must be clear as to what question he wants to answer. If you have not given a precise formulation of the problem you are trying to solve, then you are wasting time and money. Stage 2: Initial exploration. This stage usually starts with data preparation that may involve the cleaning of the data (e.g., identification and removal of incorrectly coded data, etc.), data transformations, selecting subsets of records, and, in the case of data sets with large numbers of variables (fields), performing preliminary feature selection. Data description and

visualization are key components of this stage (e.g. descriptive statistics, correlations, scatterplots, box plots, etc.). Steps in Data Mining Stage 3: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance. Stage 4: Deployment. When the goal of the data mining project is to predict or classify new cases (e.g., to predict the credit worthiness of individuals applying for loans), the third and final stage typically involves the application of the best model or models (determined in the previous stage) to

generate predictions Initial exploration Cleaning of data, Identification and removal of incorrectly coded data, e.g., Degree=Graduate, salary=100. Data transformations, Data may be skewed (that is, outliers in one direction or another may be present). Log transformation, Box-Cox transformation, etc. Data reduction, Selecting subsets of records, and, in the case of data sets with large numbers of variables (fields), performing preliminary feature selection. Data description and visualization are key components of this stage (e.g.

descriptive statistics, correlations, scatterplots, box plots, brushing tools, etc.) Data description allows you to get a snapshot of the important characteristics of the data (e.g. central tendency and dispersion). Model building and validation. Model building and validation. A model is typically rated according to 2 aspects: Accuracy Understandability These aspects often conflict with one another. Decision trees and linear regression models are less

complicated and simpler than models such as neural networks, boosted trees, etc. and thus easier to understand, however, you might be giving up some predictive accuracy. Remember not to confuse the data mining model with reality (a road map is not a perfect representation of the road) but it can be used as a useful guide. Model building and validation. Validation of the model requires that you train the model on one set of data and evaluate on another independent set of data.

There are two main methods of validation Split data into train/test datasets (75-25 split) If you do not have enough data to have a holdout sample, then use v-fold cross validation. Model building and validation. Model Validation Measures Possible validation measures Classification accuracy Total cost/benefit when different errors involve different costs Lift and Gains curves Error in Numeric predictions

Error rate Proportion of errors made over the whole set of instances Training set error rate: is way too optimistic! You can find patterns even in random data Deployment. A model is built once, but can be used over and over again. Model should be easily deployable. A linear regression is easily deployed. Simply gather the regression coefficients For example, if a new observed data vector comes in {x1, x2, x3}, then simply plug into linear equation to generate predicted value, Prediction = B0 + B1*X1 + B2*X2 + B3*X3

What about for more complicated models, such as neural networks? Within STATISTICA, we will use Rapid Deployment module in order to easily deploy models. Data Mining Techniques Neural Networks Generalized EM And K-means Cluster Analysis General CART Models General CHAID Models Interactive Trees (C&RT and CHAID)

Boosted Tree Classifiers and Regression Association Rules MARSPlines Machine Learning(Bayesian, Support Vectors and Nearest neighbors) Random Forests for Regression and Classification Generalized Additive Models (GAM) Feature Selection and Variable Screening Data Mining techniques Supervised Learning Supervised learning is a machine learning technique for deducing a function from training data. The training data consist of pairs of input variable and desired outputs.

The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples. Classification and Regression are very popular techniques of supervised learning. Unsupervised Learning In unsupervised learning training data set is not available in the form of input and output variable. unsupervised learning is a class of problems in which researcher seeks to determine how the data are organized Cluster analysis, and Principal component analysis are very popular techniques for unsupervised learning. Points to Remember.. Data mining is a tool, not a magic box.

Data mining will not automatically discover solutions without guidance. To ensure meaningful results, its vital that you understand your data. User-centric interactive process which leverages analytic technologies and computing power. Data mining central quest: Find true patterns and avoid overfitting (finding random patterns by searching too many possibilities)

Classification and Regression. Databases are rich with hidden information that can be used to make intelligent business decisions. Classification and Regression are two form of data analysis that can be used to extract models, describing important data classes or to predict future data trends. Classification is used to predict or classify categorical response variable, like to predict Iris type of flowers (Setosa,Verginica,Versocol). Regression is used to predict quantitative response variable, average income of household.

Statistical learning plays a key role in many areas of science, finance, industry many other applications. Here are some examples of learning problems: Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient. Predict the price of a stock in 6 months from now, on the basis of company performance measures and economic data. Identify the customers who will be beneficial for the banker in loan application. Identify the numbers in a handwritten ZIP code, from a

digitized image. Estimate the amount of glucose in the blood of a diabetic person, from the infrared absorption spectrum of that persons blood. Steps of Classification and Regression models Step 1: In the first step a model is built describing a predetermined set of data classes. (Supervised learning). Step 2: In the second step the predictive accuracy of the model is estimated. Step 3: If the accuracy of the model is considered acceptable, then the model can be used to classify future

data for which the class label is unknown. Techniques. Different kind of Classification and Regression techniques are available in STATISTICA, including 1. Classification and Regression, through STATISTICA Automated Neural Network. 2. General Classification and Regression tree. 3. General CHAID model. 4. Boosted Tree Classification and Regression. 5. Random Forest for Classification and Regression, etc.

Decision Trees For example, consider the widely referenced Iris data classification problem introduced by Fisher (1936). The purpose of the analysis is to learn how one can discriminate between the three types of flowers, based on the four measures of width and length of petals and sepals. A classification tree will determine a set of logical if-then conditions (instead of linear equations) for predicting or classifying cases. Advantages of tree methods. Simplicity of results. In most cases, the interpretation of results summarized in a tree is

very simple. This simplicity is useful not only for purposes of rapid classification of new observations . Often yield a much simpler "model" for explaining why observations are classified or predicted in a particular manner . e.g., when analyzing business problems, it is much easier to present a few simple if-then statements to management, than some elaborate equations. Tree methods are nonparametric and nonlinear. The final results of using tree methods for classification or regression can be summarized in a series of logical if-then

conditions . Therefore, there is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable are linear, follow some specific non-linear link function , or that they are even monotonic in nature. General Classification and Regression tree The STATISTICA General Classification and Regression Trees module (GC&RT) will build classification and regression trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification). The program supports the classic C&RT algorithm and includes various methods for pruning and crossvalidation, as well as the powerful v-fold cross-validation

methods. Classification and Regression Trees (C&RT) In most general terms, the purpose of the analyses via tree-building algorithms is to determine a set of if-then logical (split) conditions that permit accurate prediction or classification of cases. Classification Trees The example data file Irisdat.sta reports the lengths and widths of sepals and petals of three types of irises (Setosa, Versicol, and Virginic). The purpose of the analysis is to learn how one can discriminate between the three types of flowers, based on the four measures of width and length of

petals and sepals. Discriminant function analysis will estimate several linear combinations of predictor variables for computing classification scores (or probabilities) that allow the user to determine the predicted classification for each observation. A classification tree will determine a set of logical if-then conditions (instead of linear equations) for predicting or classifying cases. Regression Trees. The general approach to derive predictions from few simple if-then conditions can be applied to regression problems as well. Example 1 is based on the data file Poverty.sta, which contains 1960 and 1970 Census figures for a random selection of 30 counties. The research question (for that example) was to determine the correlates of poverty, that is, the variables that best predict the percent of families below the poverty line in a

county. CHAID Model CHAID stands for CHi-squared Automatic Interaction Detector. CHAID, a technique whose original intent was to detect interaction between variables (i.e., find "combination" variables), recursively partitions a population into separate and distinct groups, which are defined by a set of independent (predictor) variables, such that the CHAID Objective is met - the variance of the dependent (target) variable is minimized within the groups, and maximized across the groups. Like other decision trees, its advantages are that its output is highly visual and easy to interpret.

It uses multiway splits by default, it needs rather large sample sizes to work effectively. The basic algorithm that is used to construct (non-binary) trees, which for classification problems relies on the Chi-square test to determine the best next split at each step; for regression-type problems the program will actually compute F-tests. Specifically, the algorithm proceeds as follows: Preparing predictors. First, STATISTICA will create categorical predictors out of any continuous predictors, by dividing the respective continuous distributions into a number of categories with an approximately equal number of observations. For categorical predictors, the categories (classes) are "naturally" defined. Merging categories. Next STATISTICA will cycle through the predictors to determine

for each predictor the pair of (predictor) categories that is least significantly different with respect to the dependent variable; for classification problems (where the dependent variable is categorical as well) the program will compute a Chi-square test (Pearson Chisquare); for regression problems (where the dependent variable is continuous), the program will compute F tests. If the respective test for a given pair of predictor categories is not statistically significant as defined by an alpha-to-merge value, then the program will merge the respective predictor categories and repeat this step. Selecting the split variable. Next STATISTICA will choose for the split the predictor variable with the smallest adjusted p-value, i.e., the predictor variable that will yield the most significant split; if the smallest (Bonferroni) adjusted p-value for any predictor is greater than some alpha-to-split value, then no further splits are performed, and the respective node is a terminal node. This process continues until no further splits can be performed (given the alpha-tomerge and alpha-to-split values). CHAID and Exhaustive CHAID Algorithms:

Exhaustive CHAID, a modification to the basic CHAID algorithm, performs a more thorough merging and testing of predictor variables, and hence requires more computing time. Specifically, the merging of categories continuous (without reference to any alpha-to-merge value) until only two categories remain for each predictor. The program then proceeds as described above in the Selecting the split variable step, and selects among the predictors the one that yields the most significant split. For large data sets, and with many continuous predictor variables, this modification of the simpler CHAID algorithm may require significant computing time.

Machine Learning Algorithms. STATISTICA Machine Learning provides a number of advanced statistical methods for handling regression and classification tasks with multiple dependent and independent variables. These methods include Support Vector Machines (SVM) ( for regression and classification). Naive Bayes (for classification) K-Nearest Neighbors (KNN) ( for regression and classification.) Support Vector Machines STATISTICA Support Vector Machine (SVM) is primarily a classier

method that performs classification tasks by constructing hyperplanes in a multidimensional space that separates cases of different class labels. STATISTICA SVM supports both regression and classification tasks and can handle multiple continuous and categorical variables. To construct an optimal hyperplane, SVM employees an iterative training algorithm, which is used to minimize an error function. According to the form of the error function, SVM models can be classified into four distinct groups:

Classification SVM Type 1 (also known as C-SVM classification). Classification SVM Type 2 (also known as nu-SVM classification). Regression SVM Type 1 (also known as epsilon-SVM regression). Regression SVM Type 2 (also known as nu-SVM regression). Naive-Bayes Classification Bayesian Classifiers are Statistical classifiers, which can predict class membership probabilities, such as the probability that a given sample belongs to a particular class . Bayesian Classification is based on Bayes-theorem. Bayesian classifier has also high accuracy

and speed when applied to large data set. Bayes Theorem. Let X be a data sample whose class label is unknown. Let H be some hypothesis, such as that the data sample X belongs to a specified class C. For classification problem we want to determine P(H|X),the probability that the hypothesis H holds given the observed data sample X. P(H|X) is called the posterior probability. Suppose the world of data samples consists of fruits,describing by their color and shape. Suppose x is red and round and that H is hypothesis that X is an apple. Then P(H|X) reflects our confidence that X is an apple given

that we have seen X is red and round. K-Nearest Neighbors . STATISTICA K-Nearest Neighbors (KNN) is a memory-based model defined by a set of objects known as examples for which the outcome are known (i.e., the examples are labeled). The independent and dependent variables can be either continuous or categorical. For continuous dependent variables, the task is regression; otherwise it is a classification. Thus,

STATISTICA KNN can handle both regression and classification tasks. Given a new case of dependent values (query point), we would like to estimate the outcome based on the KNN examples. STATISTICA KNN achieves this by finding K examples that are closest in distance to the query point, hence, the name KNearest Neighbors. For regression problems, KNN predictions are based on averaging the outcomes of the K nearest neighbors; for classification problems, a majority of voting is used. Cross-Validation K can be regarded as one of the most important factors of the model that can

strongly influence the quality of predictions. There should be an optimal value for K that achieves the right trade off between the bias and the variance of the model. STATISTICA KNN can provide an estimate of K using an algorithm known as Cross-validation . Cross-Validation Cross-validation is a well established technique that can be used to obtain estimates of model parameters that are unknown. Here we discuss the applicability of this technique to estimating K.

The general idea of this method is to divide the data sample into a number of v folds (randomly drawn, disjointed sub-samples or segments). For a fixed value of K, we apply the KNN model to make predictions on the vth segment (i.e., use the v-1 segments as the examples) and evaluate the error. The most common choice for this error for regression is sum-of-squared and for classification it is most conveniently defined as the accuracy (the percentage of correctly classified cases). This process is then successively applied to all possible choices of v. At the end of the v folds (cycles), the computed errors are averaged to yield a measure of the stability of the model (how well the model predicts query points).

The above steps are then repeated for various K and the value achieving the lowest error (or the highest classification accuracy) is then selected as the optimal value for K (optimal in a cross-validation sense). Note that cross-validation is computationally expensive and you should be prepared to let the algorithm run for some time especially when the size of the examples sample is large. Association Rule. The goal of the Association rule is to detect relationships or associations among a large set of data items.

It is an important data mining model studied extensively by the database and data mining community. Assume all data are categorical. Initially used for Market Basket Analysis to find how items purchased by customers are related. The discovery of such association rule can help people to develop marketing strategies by gaining insight into, which items are frequently purchased together by customer. Transaction data: supermarket data Market basket transactions:

t1: {bread, cheese, milk} t2: {apple, eggs, salt, yogurt} tn: {biscuit, eggs, milk} Concepts: An item: an item/article in a basket I: the set of all items sold in the store A transaction: items purchased in a basket; it may have TID (transaction ID) A transactional dataset: A set of transactions The model: rules A transaction t contains X, a set of items

(itemset) in I, if X t. An association rule is an implication of the form: X Y, where X, Y I, and X Y = An itemset is a set of items. E.g., X = {milk, bread, cereal} is an itemset. A k-itemset is an itemset with k items. E.g., {milk, bread, cereal} is a 3-itemset Rule strength measures Support: The rule holds with support sup in T (the transaction data set) if sup% of transactions contain X Y.

sup = Pr(X Y)= Count (XY)/total count. Confidence: The rule holds in T with confidence conf if conf% of transactions that contain X also contain Y. conf = Pr(Y | X)=support(X,Y)/support(X). An association rule is a pattern that states when X occurs, Y occurs with certain probability. An Example. Transaction data Assume: minsup = 30% minconf = 80% An example frequent itemset:

{Chicken, Clothes, Milk} [sup = 3/7] Association rules from the itemset: Clothes Milk,Chicken[sup = 3/7, conf = 3/3] Clothes, Chicken Milk[sup = 3/7, conf = 3/3] t1: Beef, Chicken, Milk t2: Beef, Cheese t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk

t6: Chicken, Clothes, Milk t7: Chicken, Milk, Clothes Cluster Analysis. The process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Clustering is an example of unsupervised learning, where the learning do not rely on predefined classes and class labeled training examples. For the above reason , Clustering is the form of Learning by observation , rather than learning

by Example. Area of Application. Market Research. Clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns. Biology. Biologist can use cluster to discover distinct groups of species depending on some useful parameters. k-Means clustering. The basic operation of this algorithm is relatively simple: Given

a fixed number of (desired or hypothesized) k clusters, assign observations to those clusters so that the means across clusters (for all variables) are as different from each other as possible. Extensions and generalizations. The methods implemented in the Generalized EM and k-Means Cluster Analysis module of STATISTICA extend this basic approach to clustering in three important ways: Instead of assigning cases or observations to clusters so as to maximize the differences in means for continuous variables, the EM (expectation maximization) clustering algorithm rather computes probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering algorithm is to maximize the overall probability or likelihood of the data, given the (final) clusters. Unlike the classic implementation of k-Means clustering in the Cluster Analysis module, the k-Means and EM algorithms in the Generalized EM and k-Means Cluster Analysis module then can be applied to both continuous and categorical variables.

A major shortcoming of k-Means clustering has been that you need to specify the number of clusters before starting the analysis (i.e., the number of clusters must be known a priori); the Generalized EM and k-Means Cluster Analysis module uses a modified v-fold cross-validation scheme , to determine the best number of clusters from the data. This extension makes the Generalized EM and k-Means Cluster Analysis module an extremely useful data mining tool for unsupervised learning and pattern recognition. THANK YOU. Krishnendu Kundu. (Statistician) StatsoftIndia. Email Id- [email protected] Mobile Number- +919873119520.