# Exploratory Data Analysis Overview of TBW Data Models to Represent the Relationships Between Variables (Regression) Learning Objectives Develop a model to estimate an output variable from input variables. Select from a variety of modeling approaches in developing a model. Quantify the uncertainty in model predictions. Use models to provide forecasts or predictions for inputs different from any previously observed Readings Kottegoda and Rosso, Chapter 6 Helsel and Hirsch, Chapters 9 and 11 Hastie, Tibshirani and Friedman, Chapters 1-2 Matlab Statistics Toolbox Users Guide, Chapter 4. Regression The use of mathematical functions to model and investigate the dependence of one variable, say Y, called the response variable, on one or more other observed variables, say X, knows as

the explanatory variables Not search for cause and effect relationship without prior knowledge Iterative process Formulate Fit Evaluate Validate A Rose by any other name... Explanatory variable Independent variable

x-value Predictor Input Regressor Response variable Dependent variable y-value Predictand Output The modeling process Data gathering and exploratory data analysis Conceptual model development (hypothesis formulation) Applying various forms of models to see which relationships work and which do not. parameter estimation

diagnostic testing interpretation of results Conceptual model of system to guide analysis Natural Climate states: ENSO, PDO, NAO, Rainfall Other climate variables: temperature, humidity Management Groundwater pumping Surface water withdrawals Groundwater Level Surface water releases from storage Streamflow

Conceptual Model Solar Radiation Precipitation Air Humidity Air Temp. Mountain Snowpack Evaporation GSL Level Volume Area R BEA Soil Moisture And

Groundwater R Salinity WEBER R JORDAN R Streamflow Bear River Basin Macro-Hydrology Streamflow response to basin and annual average forcing. 200 Runoff ratio = 0.10 400 500 600

Precipitation 700 mm 800 50 50 100 100 150 150 Streamflow Q/A mm 200

Runoff ratio = 0.18 900 2.5 LOWESS (R defaults) 3.0 3.5 4.0 Temperature 4.5 C 5.0

5.5 1.2 Annual Evaporation Loss E/A LOWESS (R defaults) 0.4 0.6 0.8 E/A m 1.0 Salinity decreases as volume increases. E increases as salinity decreases.

2.5 e+09 3.5 e+09 4.5 e+09 Area m2 5.5 e+09 1.2 Evaporation vs Salinity LOWESS (R defaults) 0.8 0.6 0.4 E/A m 1.0

Salinity estimated from total load and volume related to decrease in E/A with decrease in lake volume and increase in C 100 150 200 250 C = 3.5 x 1012 kg/(Volume) 300 g/l

LOWESS (R defaults) 0.8 0.6 0.4 E/A m 1.0 1.2 Evaporation vs Temperature (Annual) 9.0 9.5 10.0 10.5 Degrees C

11.0 11.5 12.0 Conclusions Solar Radiation Precipitation Air Humidity Air Temp. In s e s a e

r c Reduces Mountain Snowpack Evaporation ea l Ar tro n Co Reduces GSL Level Volume Area R BEA R Salinity

CL/V Supplies tes u b i r t n o C Dominant WEBER R JORDAN R Streamflow Soil Moisture And

Groundwater Considerations in Model Selection Choice of complexity in functional relationship Theoretically infinite choice of type of functional relationship Classes of functional relationships Interplay between bias, variance and model complexity Generality/Transferability prediction capability on independent test data. -2 -1 y 0 1

Model Selection Choices Example Complexity, Generality, Transferability -1 -2 y[order(x)] 0 1 Interpolation -2.0 -1.0 0.0 0.5 1.0 -2 -1

y 0 1 Functional Fit -2.0 -1.0 0.0 0.5 1.0 RSS = 0 -2.0 -1.0 X 0.0 0.5 1.0 x[order(x)] ei = f(xi) - yi

RSS > 0 -2 -1 -1 y Y -2 y[order(x)] 0 0 1 1

How do we quantify the fit to the data? -2.0 xi X 0.0 -1.0 0.5 1.0 x Residual (ei): Difference between fit (f(xi)) and observed (yi) N 2 RSS f (

x ) y Residual Sum of Squared Error (RSS) : i i i 1 -1 y -2 -1 -2 y[order(x)] 0 0 1

1 Interpolation or function fitting? 0.5 1.0 0.5 1.0 Which -2.0 has the-1.0 smallest0.0fitting error?-2.0 Is this-1.0 a valid 0.0 measure? x[order(x)] x Each is useful for its own purpose. Selection may hinge on considerations out of the data, such as the nature and purpose of the model and understanding of the process it represents. 0.0 -0.5

-1.0 y2 0.5 1.0 Another Example -2 -1 0 x 1 2 0.0 -0.5 -1.0

y2 0.5 1.0 Interpolation -2 -1 0 x 1 2 0.0 -0.5 -1.0

y2 0.5 1.0 Functional Fit - Linear -2 -1 0 x 1 2 -0.5 -1.0 0.0

-0.5 -1.0 y2 0.0 0.5 0.5 1.0 1.0 Which is better? -2 -1 0 1

2 -2 -1 0 x Is a linear approximation appropriate? x 1 2 The actual functional relationship (random noise added to cyclical 0.0 -0.5 -1.0 y2

0.5 1.0 function) -2 -1 0 x 1 2 Another example of two approaches to prediction Linear Regression Fit 0.5 y

1.0 1.5 Linear model fit by least squares 0.0 Y ( x o ) a bx o 0.0 0.2 0.4 0.6 0.8 0.6

0.8 k-Nearest Neighbor Fit k=20 x k o 0.5 0.0 i y 1 Y ( x o ) yi k x N ( x ) 1.0

1.5 Nearest neighbor 0.0 0.2 0.4 x General function fitting y f ( x ) y f ( x1, x 2 , x 3 ,...) y1 f1( x1, x 2 , x 3 ,...) 1 y2 f 2 ( x1, x 2 , x 3 ,...) 2 . . . .

. . . . . y f ( x , x , x ,...) n n 1 2 3 n y f ( x ) General function fitting Independent data samples x1 x2 x3 y y f ( x1, x 2 , x 3 ,...) x1 x2 x3 y Example linear regression x1 x2 x3 y y=a x + b +

x1 x2 x3 y Input Output Independent data vectors Statistical decision theory X inputs, p p dimensional, real valued Y real valued output variable Joint distribution Pr(X,Y) Seek a function f(X) for predicting Y given X. Loss function to penalize errors in prediction e.g. L(Y, f(X))=(Y-f(X))2 square error L(Y, f(X))=|Y-f(X)| absolute error Criterion for choosing f Minimize expected loss e.g. E[L] = E[(Y-f(X))2]

f(x) = E[Y|X=x] The conditional expectation, known as the regression function This is the best prediction of Y at any point X=x when best is measured by average square error. Basis for nearest neighbor method f ( x ) Ave ( yi | x i N k ( x )) Expectation approximated by averaging over sample data Conditioning at a point relaxed to conditioning on some region close to the target point Basis for linear regression Model based. Assumes a model, f(x) = a + b x Plug f(X) in to expected loss E[L] = E[(Y-a-bX)2] Solve for a, b that minimize this theoretically Did not condition on X, rather used (assumed) knowledge of the functional relationship to pool over values of X.

Comparison of assumptions Linear model fit by least squares assumes f(x) is well approximated by a global linear function k nearest neighbor assumes f(x) is well approximated by a locally constant function 1.5 Linear Regression Fit Mean((y-y)2) = 0.0459 0.5 y 1.0 Mean((f(x)-y )2) = 0.00605 0.0 f ( x ) x 2 0.5

y f ( x ) ~ N (0,0.2) 0.0 0.2 0.4 0.6 x 0.8 1.0 1.5 k-Nearest Neighbor Fit k=20 Mean((y-y)2) = 0.0408 0.5

y 1.0 Mean((f(x)-y )2) = 0.00262 0.0 f ( x ) x 2 0.5 y f ( x ) ~ N (0,0.2) 0.0 0.2 0.4 0.6 x 0.8 1.0

0.05 0.10 Data MSE Model MSE Linear data MSE Linear model MSE 0.00 Mean Squared Error 0.15 MSE model and data 20 40 60 k-Values

80 100 k-Nearest Neighbor Fit k=60 1.0 Mean((f(x)-y )2) = 0.0221 f ( x ) x 2 0.5 y f ( x ) ~ N (0,0.2) 0.5 y 1.5 Mean((y-y)2) = 0.0661

0.0 0.2 0.4 0.6 x 0.8 1.0 50 sets of samples generated For each calculated f ( x o ) at specific xo values for linear fit and knn fit 2 2 Err ( x o ) E[(f ( x o ) f ( x o )) ] Var (f ( x o )) E[f ( x o )] f ( x o ) MSE

Xo MSE Variance Bias2 = Variance + Bias2 Linear k=20 k=40 0.8 0.6 0.8 0.6 0.8

0.6 0.0012 0.0047 0.0025 0.0023 0.0025 0.0019 0.001 0.0005 0.0025 0.0023 0.0024 0.0018 0.0001 0.0041 7.7E-06 1.4E-08 6.3E-05 0.0001 mse bias2 Var 0.04 0.02 Dashed lines from linear regression 0.00 mse 0.06 0.08 0.10

Xo= 0.8 20 40 60 knnseq 80 100 Dashed lines from linear regression 0.002 0.004 0.006 mse

bias2 Var 0.000 mse 0.008 0.010 Xo= 0.6 20 40 60 knnseq 80 100

Simple Linear Regression Model E( Y | x ) o 1x Var ( Y | x ) 2 Y ~ N ( o 1x, 2 ) Kottegoda and Rosso page 343 Regression is performed to learn something about the relationship between variables remove a portion of the variation in one variable (a portion that may not be of interest) in order to gain a better understanding of some other, more interesting, portion estimate or predict values of one variable based on knowledge of another variable Helsel and Hirsch page 222 Regression Assumptions Helsel and Hirsch page 225 Regression Diagnostics

- Residuals Kottegoda and Rosso page 350 Regression Diagnostics - Antecedent Residual Kottegoda and Rosso page 350 Regression Diagnostics - Test residuals for normality Kottegoda and Rosso page 351 Regression Diagnostics - Residual versus explanatory variable Kottegoda and Rosso page 351 Regression Diagnostics - Residual versus predicted response variable

Helsel and Hirsch page 232 Regression Diagnostics - Residual versus predicted response variable Helsel and Hirsch page 232 Quantile-Quantile Plots Normal Q-Q Plot QQ-plot for Log-Transformed Flows 8 7 6 5 3 4 Sample Quantiles

3000 2000 1000 0 Sample Quantiles 4000 Normal QQ-plot for Q-Q RawPlot Flows -3 -2 -1 0 1

Theoretical Quantiles 2 3 -3 -2 -1 0 1 Theoretical Quantiles Need transformation to Normalize the data 2

3 Bulging Rule For Transformations Up, >1 (x2, etc.) Down, <1 (log x, 1/x, x , etc.) Helsel and Hirsch page 229 Box-Cox Transformation z = (x -1)/ ; 0 z = ln(x); = 0 Kottegoda and Rosso page 381 Box-Cox Normality Plot for Monthly September Flows on Alafia R. Using PPCC 0.6 0.2 0.4 This is close to 0, = -0.14

0.0 Fillibens Statistic 0.8 1.0 Box-Cox Normality Plot for Alafia R. -2 -1 0 Box-Cox Lambda Value Optimal Lambda= -0.14 1 2