Statistical Learning Dong Liu Dept. EEIS, USTC Chapter 1. Linear Regression 1. 2. 3. 4. 5. 6. From one to two Regularization

Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 03/03/2020 Chap 1. Linear Regression 2 A motivating example 1/2 What is the height of Mount Qomolangma? A piece of knowledge one variable def HeightOfQomolangma(): return 8848.0

How do we achieve this knowledge from data? We have a series of measurements: For example, we can use the (arithmetic) mean: def SLHeightOfQomolangma(data): return sum(data) / len(data) 03/03/2020 Chap 1. Linear Regression 3 A motivating example 2/2 Or in this way

hQomo = 0 Learning/Training def LearnHeightOfQomolangma(data): global hQomo hQomo = sum(data) / len(data) def UseHeightOfQomolangma(): Using/Testing global hQomo return hQomo 03/03/2020 Chap 1. Linear Regression 4

Why arithmetic mean? Least squares Solving the problem Relative (local) minimum Absolute (global) minimum In statistical learning, we often formulate such optimization problems and try to solve them How to formulate? How to solve? 03/03/2020 Chap 1. Linear Regression 5

From the statistical perspective The height of Qomolangma is a random variable , which obeys a specific probability distribution For example, Gaussian (normal) distribution The measurements are observations of the random variable, and are used to estimate the distribution Assumption: independent and identical distribution (i.i.d.) 03/03/2020 Chap 1. Linear Regression 6

Maximum likelihood estimation Likelihood function: as a function of Overall likelihood function (recall iid): We need to find a parameter that maximizes the overall likelihood: And it reduces to least squares! 03/03/2020 Chap 1. Linear Regression 7 More is implied We can also estimate other parameters, e.g.

We can use other estimators, like unbiased: We can give range estimation rather than point estimation 03/03/2020 Chap 1. Linear Regression 8 Correlated variables The height of Mount Qomolangma is correlated to the season

So what is the correlation between two variables? Why not an affine function: def UseSeasonalHeight(x, a, b): Spring return a*x+b 03/03/2020 Chap 1. Linear Regression Summer Fall Wint

9 Least squares We formulate the optimization problem as And (fortunately) have the closed-form solution Result Seemingly not good, how to improve? 03/03/2020 Chap 1. Linear Regression 10

Variable (re)mapping Previously we use 0 Spring 1 Summer 2 Fall 3

Winter Now we use 1 Summer 2 Spring/ Fall 3 Winter

Result def ErrorOfHeight(datax, datay, a, b): fity = UseSeasonalHeight(datax, a, b) error = datay - fity return sum(error**2) Season: 3.6146646254138832 Remapped season: 0.9404394822254982 03/03/2020 Chap 1. Linear Regression 11

From the statistical perspective We have two random variables Height: a dependent, continuous variable Season: an independent, discrete variable The seasons probability distribution The heights probability distribution The overall likelihood function: 03/03/2020 Chap 1. Linear Regression 12

History review Adrien-Marie Legendre (French, 1752-1833) 03/03/2020 Carl Friedrich Gauss (German, 1777-1855) Chap 1. Linear Regression 13 Notes Correlation is not Causality, but inspires efforts to interpret

Remapped/Latent variables are important 03/03/2020 Chap 1. Linear Regression 14 Chapter 1. Linear Regression 1. 2. 3. 4. 5. 6.

From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 03/03/2020 Chap 1. Linear Regression 15 As we are not confident about our data

Height is correlated to season, but also correlated to other variables Can we constrain the level of correlation between height and season? So we want to constrain the slope parameter We have two choices Given a range of possible values of slope parameter, find the least squares Minimize the least squares and the (e.g. square of) slope parameter simultaneously 03/03/2020 Chap 1. Linear Regression 16

Two optimization problems Constraint form Unconstrained form Solution 03/03/2020 Reg Reg Reg Reg Reg Reg Reg

Reg Reg Reg Reg 0: y = 1.184060 * x +8846.369904, error: 0.94043 1: y = 1.014908 * x +8846.708207, error: 1.11211 2: y = 0.888045 * x +8846.961934, error: 1.46618 3: y = 0.789373 * x +8847.159277, error: 1.87510 4: y = 0.710436 * x +8847.317152, error: 2.28635 5: y = 0.645851 * x +8847.446322, error: 2.67845 6: y = 0.592030 * x +8847.553964, error: 3.04343 7: y = 0.546489 * x +8847.645045, error: 3.37941 8: y = 0.507454 * x +8847.723115, error: 3.68720 9: y = 0.473624 * x +8847.790776, error: 3.96875 10: y = 0.444022 * x +8847.849978, error: 4.2263

Chap 1. Linear Regression 17 How to solve a constrained optimization problem? Consider a general optimization problem The basic idea is to construct an augmented objective function where and are Lagrange multipliers Then consider the dual function When , the dual function is a lower bound of the original problem

03/03/2020 Chap 1. Linear Regression 18 Duality The dual problem Weak duality (always true) Strong duality (under some condition) Considering differentiable functions, strong duality implies KKT condition For convex optimization, KKT condition implies strong

duality 03/03/2020 Chap 1. Linear Regression 19 Understanding the KKT condition: From the equivalence of constrained and unconstrained Equation as constraint Inequation as constraint 03/03/2020 Chap 1. Linear Regression

20 Understanding the KKT condition: From the geometrical perspective Example: No constraint With equality as constraint With inequality as constraint 03/03/2020 Chap 1. Linear Regression

21 More about convex optimization 1/2 Convex set Convex function A function that is defined on a convex set Concave function: if its negative is convex Affine function is both convex and concave Convex optimization is to minimize a convex function (or maximize a concave function) over a convex set 03/03/2020 Chap 1. Linear Regression

22 More about convex optimization 2/2 For a convex optimization problem, any local minimum is also global minimum Proof: by reductio For a convex optimization problem, if the function is strictly convex, then there is only one global minimum Proof: the definition of convex function The dual problem is convex optimization 03/03/2020 Chap 1. Linear Regression

23 What & Why is regularization? What A process of introducing additional information in order to solve an ill-posed problem Why Want to introduce additional information Have difficulty in solving the ill-posed problem Without regularization: 03/03/2020 With regularization:

Chap 1. Linear Regression 24 From the statistical perspective The Bayes formula Maximum a posterior (MAP) estimation (Bayesian estimation) We need to specify a prior, e.g.: Finally it reduce to the regularized least squares with 03/03/2020

Chap 1. Linear Regression 25 Bayesian interpretation of regularization The prior is additional information Many statisticians question this point How much regularization depends on How confident we are about the data How confident we are about the prior 03/03/2020 Chap 1. Linear Regression

26 Chapter 1. Linear Regression 1. 2. 3. 4. 5. 6. From one to two Regularization Basis functions Bias-variance decomposition

Different regularization forms Bayesian approach 03/03/2020 Chap 1. Linear Regression 27 Polynomial curve fitting Basis functions Weights Another form: weights and bias 03/03/2020

Chap 1. Linear Regression 28 Basis functions Global vs. Local Polynomial Gaussian Other choices: Fourier basis (sinusoidal), wavelet, spline 03/03/2020 Chap 1. Linear Regression

Sigmoid 29 Variable remapping Using basis functions will remap the variable(s) in a non-linear manner Change the dimensionality To enable a simpler (linear) model 03/03/2020 Chap 1. Linear Regression

30 Maximum likelihood Assume observations are from a deterministic function with additive Gaussian noise Then Given observed inputs The likelihood function is 03/03/2020 Chap 1. Linear Regression and targets 31

Maximum likelihood and least squares Maximizing is equivalent to minimizing which is known as sum of squared errors (SSE) 03/03/2020 Chap 1. Linear Regression 32 Maximum likelihood solution

Solution is The design matrix The pseudo-inverse 03/03/2020 Chap 1. Linear Regression 33 Geometrical interpretation Let And let the columns of be

They span a subspace Then, is the orthogonal projection of on the subspace , so as to minimize the Euclidean distance 03/03/2020 Chap 1. Linear Regression 34 Regularized least squares Construct the joint error function Data term + Regularization term

Use SSE as data term, and quadratic regularization term (ridge regression): The solution is 03/03/2020 Chap 1. Linear Regression 35 Equivalent kernel For a new input, the predicted output is Equivalent kernel

Predictions can be calculated directly from the equivalent kernel, without calculating the parameters 03/03/2020 Chap 1. Linear Regression 36 Equivalent kernel for Gaussian basis functions 03/03/2020 Chap 1. Linear Regression

37 Equivalent kernel for other basis functions Polynomial Sigmoidal Equivalent kernel is local: nearby points have more weights 03/03/2020 Chap 1. Linear Regression

38 Properties of equivalent kernel Sums to 1 if is 0 May have negative values Can be seen as inner product 03/03/2020 Chap 1. Linear Regression 39

Chapter 1. Linear Regression 1. 2. 3. 4. 5. 6. From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach

03/03/2020 Chap 1. Linear Regression 40 Example Reproduced from PRML Generate 100 data sets, each having 25 points A sine function plus Gaussian noise Perform ridge regression on each data set with 24 Gaussian basis functions and different values of

regularization weight 03/03/2020 Chap 1. Linear Regression 41 Simulation results 1/3 High regularization, the variance is small but bias is large Fitted curve (shown 20 fits) 03/03/2020 The average curve after 100 fits

Chap 1. Linear Regression 42 Simulation results 2/3 Moderate regularization Fitted curve (shown 20 fits) 03/03/2020 The average curve after 100 fits Chap 1. Linear Regression 43

Simulation results 3/3 Low regularization, the variance is large but bias is small Fitted curve (shown 20 fits) 03/03/2020 The average curve after 100 fits Chap 1. Linear Regression 44 Bias-variance decomposition The second term is intrinsic noise, consider the first

term Suppose we have a dataset and we can calculate the parameter based on the dataset Then we take expectation with respect to dataset Finally we have: expected loss = (bias)2 + variance + noise 03/03/2020 Chap 1. Linear Regression 45 Bias-variance trade-off Over-regularized model will have a high bias,

while underregularized model will have a high variance How can we achieve the trade-off? For example, by cross validation (will be discussed later) 03/03/2020 Chap 1. Linear Regression 46 Chapter 1. Linear Regression

1. 2. 3. 4. 5. 6. From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 03/03/2020

Chap 1. Linear Regression 47 Other forms? Least squares Ridge regression Norm regularized regression norm: 03/03/2020 Chap 1. Linear Regression

48 Different norms What about 0 and ? 03/03/2020 Chap 1. Linear Regression 49 Best subset selection Define norm as

Best subset selection regression: Also known as sparse Unfortunately, this is NP-hard 03/03/2020 Chap 1. Linear Regression 50 Example: Why we need sparsity? fMRI data help to understand brains functionality Brain fMRI data may consists of

10,000~100,000 voxels We want to identify the most relevant anchor points 03/03/2020 Chap 1. Linear Regression 51 norm replacing norm Interestingly, we can use norm to replace

and still achieve a sparse solution* Geometric interpretation Left: norm; Right: norm, norm Least squares solution Sparsity here! 03/03/2020

* Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal -norm solution is also the sparsest solution. Communications on Pure and Applied 1. Linear Regression 52 Mathematics, 59(6),Chap 797-829. LASSO regression LASSO: least absolute shrinkage and selection operator * Bayesian interpretation Laplace distribution as prior Laplace distribution * Tibshiranit, R. (1996). Regression shrinkage and selection via

the lasso. Journal of the Royal Statistical Society Series BMethodological, 58(1), 267-288. 03/03/2020 Chap 1. Linear Regression 53 Solution of LASSO regression Consider a special case: The least-squares solution is And the solution is 03/03/2020 Chap 1. Linear Regression

54 Comparison between best subset, LASSO, and ridge Consider the special case: orthonormal design matrix Best subset: Hard thresholding 03/03/2020 Ridge: Uniformly shrink the LS

solution Chap 1. Linear Regression LASSO: Soft thresholding 55 Implications of different norms Best subset LASSO Sparse solution

03/03/2020 ridge Convex optimization Chap 1. Linear Regression 56 Chapter 1. Linear Regression 1. 2. 3. 4. 5.

6. From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 03/03/2020 Chap 1. Linear Regression 57 Bayesian linear regression

Define prior for the parameters Note the likelihood function is The posterior is 03/03/2020 Chap 1. Linear Regression 58 MAP estimation The maximum a posteriori (MAP) estimation Compared with the maximum likelihood estimation Compared with the ridge regression solution

03/03/2020 Chap 1. Linear Regression 59 How to set the prior If using zero-mean Gaussian prior, the Bayesian estimation is equivalent to ridge regression solution If using zero-mean Laplace prior, the Bayesian estimation (no closed-form expression) is equivalent to lasso regression solution Conjugate prior: to make the posterior and prior follow the same category of distribution, e.g. Gaussian 03/03/2020

Chap 1. Linear Regression 60 Example Reproduced from PRML 0 data points observed Parameters for simulation: 03/03/2020 Chap 1. Linear Regression

61 Simulation results 1/3 1 data point observed Likelihood 03/03/2020 Posterior Chap 1. Linear Regression Data Space 62

Simulation results 2/3 2 data points observed Likelihood 03/03/2020 Posterior Chap 1. Linear Regression Data Space 63 Simulation results 3/3

20 data points observed Posterior Data Space Variance of the posterior decreases as the number of data points increases 03/03/2020 Chap 1. Linear Regression 64

Predictive distribution In the Bayesian framework, every variable has a distribution, such as the predicted output given an input As increases this term will vanish 03/03/2020 Chap 1. Linear Regression 65 Example

Reproduced from PRML Sinusoidal data, 9 Gaussian basis functions, 1 data point True function Predictive mean Predictive varianceDifferent predicted functions 03/03/2020 Chap 1. Linear Regression 66

Simulation results 1/3 Sinusoidal data, 9 Gaussian basis functions, 2 data points 03/03/2020 Chap 1. Linear Regression 67 Simulation results 2/3 Sinusoidal data, 9 Gaussian basis functions, 4 data points 03/03/2020

Chap 1. Linear Regression 68 Simulation results 3/3 Sinusoidal data, 9 Gaussian basis functions, 25 data points 03/03/2020 Chap 1. Linear Regression 69 Model selection Polynomial curve fitting: How to set the order?

03/03/2020 Chap 1. Linear Regression 70 From the statistical perspective Bayesian model selection: Given the dataset, estimate the posterior of different models ML model selection: Choose the model that maximizes the model evidence function 03/03/2020

Chap 1. Linear Regression 71 Calculation of model evidence for Bayesian linear regression Details c.f. PRML 03/03/2020 Chap 1. Linear Regression 72 Example

03/03/2020 Reproduced from PRML Chap 1. Linear Regression 73 More about the hyper-parameters We can estimate the hyper-parameters based on e.g. ML criterion Define the eigenvalues as 03/03/2020 Chap 1. Linear Regression

74 Interpretation 1/3 The eigenvalues of are 03/03/2020 Chap 1. Linear Regression 75 Interpretation 2/3 By decreasing , more parameters become learnt from data

measures the number of learnt parameters 03/03/2020 Chap 1. Linear Regression 76 Interpretation 3/3 Recall that for estimating Gaussian parameters, we have Now for Bayesian linear regression, we have 03/03/2020

Chap 1. Linear Regression 77 Notes The hyper-parameters can be further regarded as random variables, and integrated into the Bayesian framework 03/03/2020 Chap 1. Linear Regression 78 Chapter summary

Dictionary Toolbox Bias-variance decomposition Equivalent kernel Gaussian distribution Laplace distribution KKT condition Model selection Prior; conjugate ~ Posterior Regularization Sparsity Basis functions

Best subset selection Lagrange multiplier LASSO regression Least squares Maximum a posteriori (MAP) estimation, aka Bayesian estimation Maximum likelihood (ML) estimation Ridge regression 03/03/2020 Chap 1. Linear Regression 79