# Session 7: Programming strategies Session 2: Programming strategies S Programming (the book) 1 Introduction 2 The S Language: Syntax and Semantics 5 3 The S Language: Advanced Aspects 4 Classes 5 New-style Classes 6 Using Compiled Code 7 General Strategies and Extended Examples 8 S Software Development 9 Interfaces under Windows A Compiling and Loading Code B The Interactive Environment C BATCH Operation References 255 Index 7 July 2006 S Programming in R 1

39 75 99 123 151 179 205 235 247 253 257 2 An Introductory Example How many balls of diameter 2 can you fit inside a box of side 2? n/2 n The volume of a sphere of side r is r n / 2 1 where n is the number of dimensions and r = 1, here.

The volume of the cube is 2 The proportion of the cube occupied by the sphere is n/2 Therefore the maximum number is n / 2 1 2 7 July 2006 n S Programming in R n / 2 1 n/2 3 A function to evaluate the relative volume

rvball <- function(d) exp(d/2 * log(pi) - d * log(2) - lgamma(d/2 + 1)) The function is vectorized so we can check several cases at once: > structure(floor(1/rvball(1:10)), names = 1:10) 1 2 3 4 5 6 7 8 9 10 1 1 1 3 6 12 27 63 155 401 Consider now how we might go about checking this relative volume formula by simulation. Sample n points at random in the origin-centred cube. Calculate how many of them are closer to the origin than 1 in the sense of Euclidean distance, say k, Return k and n, but can we make it behave like the ratio k/n when that would be convenient?

7 July 2006 S Programming in R 4 The algorithm Suppose we want to simulate N points inside the box X <- matrix(runif(N * d, -1, 1), nrow = N, ncol = d) The points will lie inside the ball if their distance to the origin is less than or equal to one. Calculate the (squared) distances: dis2 <- (X * X) %*% rep(1, d) or rowSums(X * X) Find how many are leq 1 inside <- sum(dis2 <= 1) Work in blocks to avoid excessive memory overhead 7 July 2006 S Programming in R 5

We allow for large numbers of trials but conduct them in blocks so as not to use too much memory all at once. The object will carry information about itself, including how it was generated. The object has a class so we can provide methods for key generic functions that allow results to be conveniently handled. mcrvball

6 A print method for "mcrvball" objects. print.mcrvball <- function(x, ...) { cat("Dim.:", x\$d, "Estimated:", signif(x\$inside/x\$total, 4), "Actual:", signif(rvball(x\$dim), 4), "\n") invisible(x) } Ops.mcrvball <- function(e1, e2) { if(!is.null(class(e1)) && class(e1) == "mcrvball") e1 <- e1\$inside/e1\$total if(!missing(e2) && !is.null(class(e2)) && class(e2) == "mcrvball") e2 <- e2\$inside/e2\$total NextMethod() } The second method allows arithmetic operations. 7 July 2006 S Programming in R 7

A test run > for(i in 4:10) print(mcrvball(i, 1000000)) Dim.: 4 Estimated: 0.3089 Actual: 0.3084 Dim.: 5 Estimated: 0.1650 Actual: 0.1645 Dim.: 6 Estimated: 0.08031 Actual: 0.08075 Dim.: 7 Estimated: 0.03687 Actual: 0.03691 Dim.: 8 Estimated: 0.01575 Actual: 0.01585 Dim.: 9 Estimated: 0.006486 Actual: 0.006442 Dim.: 10 Estimated: 0.002495 Actual: 0.00249 7 July 2006 S Programming in R 8 > X <- matrix(NA, 7, 2) > X[,2] <- floor(1/rvball(4:10)) > for(i in 4:10) X[i-3,1] <- floor(1/mcrvball(i, 100000)) > dimnames(X) <- list(4:10, c("Monte Carlo", "Actual"))

> X 4 5 6 7 8 9 10 Monte Carlo Actual 3 3 6 6 12 12 27 27 62 63 141 155 398 401

Are the Monte Carlo results likely to be biased downwards? 7 July 2006 S Programming in R 9 The call component and updating If an object has a call component, update() may be used to create a new object by making simple modifications to the call. Here it is not very useful, but it does work: > p1 <- mcrvball(10) > floor(1/p1)  555 > p2 <- update(p1, N = 200000) > floor(1/p2)  408 Special methods for lm and glm objects allow convenient forms for changes to formulas, such as the use of the period. 7 July 2006 S Programming in R 10

Combining two estimates We combine two relative frequencies by adding numerators and denominators. The most convenient way to do this is with a binary operator. "%+%" <- function(e1, e2) UseMethod("%+%") "%+%.mcrvball" <- function(e1, e2) { if(e1\$dimensions != e2\$dimensions) stop("ball dimensions differ!") res <- list(dimensions = e1\$dimensions, inside = e1\$inside + e2\$inside, total = e1\$total + e2\$total, call = e1\$call) oldClass(res) <- "mcrvball" res } > p1 %+% p2 Dim.: 10 Estimated: 0.002414 Actual: 0.00249 > floor(1/(p1 %+% p2))  414 7 July 2006

S Programming in R 11 Collecting results as we go > p0 <- p1 > for(i in 1:10) p0 <- p0 %+% print(update(p1)) Dim.: 10 Estimated: 0.0023 Actual: 0.00249 Dim.: 10 Estimated: 0.0030 Actual: 0.00249 ..... Dim.: 10 Estimated: 0.0022 Actual: 0.00249 Dim.: 10 Estimated: 0.0039 Actual: 0.00249 > p0 Dim.: 10 Estimated: 0.002473 Actual: 0.00249 > floor(1/p0)  404 > unlist(sapply(p0, as.character)) dimensions inside total call1 call2 "10" "272" "110000" "mcrvball" "10" 7 July 2006

S Programming in R 12 Some lessons 1. Vectorization. (rvball) 2. Taking the whole object view of the problem. (mcrvball) 3. Object orientation: put all the information you are likely to need into the object and give it a class. (mcrvball) 4. Methods for existing generics. (print.mcrvball) 5. Group generic functions. (Ops.mcrvball) 6. Binary operators and new generic functions. (%+%) 7 July 2006 S Programming in R 13 Some under-used array and other facilities 1. cumsum, cummax, cummin, cumprod 2. pmax, pmin 3. The empty index especially in replacements.

4. row, col and index.slice 5. outer 6. aperm 7. diag and diag<8. A matrix as an index. 7 July 2006 S Programming in R 14 Example: write a function to construct a tri-diagonal matrix with constant values in each diagonal position. > tridiag <- function(r, v) { cind <- as.vector(outer(-1:1, 1:r, "+")) rind <- rep(1:r, each = 3) mind <- cbind(rind, cind) mind <- mind[ -c(1, 3*r), ] X <- matrix(0, r, r) X[mind] <- rep(v, r)[ -c(1, 3*r)] X } > tridiag(4, c(1,2,1)) [,1] [,2] [,3] [,4] [1,] 2

1 0 0 [2,] 1 2 1 0 [3,] 0 1 2 1 [4,] 0 0 1 2 7 July 2006 S Programming in R 15 A direct calculation

> > > > X <- matrix(0, 5, 5) X[row(X) == col(X)] <- 2 X[abs(row(X) - col(X)) == 1] <- 1 X [,1] [,2] [,3] [,4] [,5] [1,] 2 1 0 0 0 [2,] 1 2 1 0 0 [3,] 0 1

2 1 0 [4,] 0 0 1 2 1 [5,] 0 0 0 1 2 7 July 2006 S Programming in R 16 Some little-used debugging functions and support systems In approximately increasing order of investment

and payoff: 1. cat or print statements, strategically placed. 2. trace, untrace and tprint. 3. browser. 4. inspect. 5. emacs with ESS. 7 July 2006 S Programming in R 17 Compiled code and Packages Example: a function to calculate the convolution of two numerical sequences a = {ai }, b = {bj }, (a * b)k = ab

i j i +j =k We first consider a C function to do the crude arithmetic (next slide) and how to link it to R Next consider four possible R functions to perform the computation Finally we look at system timings for a relatively large computation. 7 July 2006 S Programming in R 18 The C function in do_convolve.c void do_convolve(double *a, long *na, double *b, long *nb, double *ab) { int i, j, nab = *na + *nb - 1; for(i = 0; i < nab; i++) ab[i] = 0.0; for(i = 0; i < *na; i++) for(j = 0; j < *nb; j++)

ab[i + j] += a[i] * b[j]; } 7 July 2006 S Programming in R 19 Making the dll All arguments must be pointers Some care needed to ensure that the arguments supplied to the function have the correct storage mode (see next slide) Creating the dll using MinGW compilers is easy. Make sure R.exe and gcc are both visible on the %PATH% (Windows) environment variable In a DOS window: R CMD SHLIB do_convolve.c Several files can be compiled at once, Fortran and C mixed, &c. 7 July 2006 S Programming in R 20

Loading the shared library This is done in the function we propose to use: convolve3 <- function(a, b) { if(!is.loaded(symbol.C("do_convolve"))) dyn.load("D:/Data/R Course/do_convolve.dll") storage.mode(a) <- "double" storage.mode(b) <- "double" .C("do_convolve", a, length(a), b, length(b), ab = double(length(a) + length(b) - 1))\$ab } 7 July 2006 S Programming in R 21 Notes .C() takes the name of the entry point as a character string as its first argument subsequent arguments correspond to those of the C function itself, with correct storage mode ensured

Subsequent arguments may be named The result is a list with names as supplied to the arguments. The results of the C function must be reflected as changes in the supplied arguments In our case only one result was needed, so this is the only one we named (ab) is.loaded( ) may be used to check if an entry point is visible symbol.C() and symbol.For() can be used to ensure portability across platforms 7 July 2006 S Programming in R 22 Three pure S code contenders convolve0 <- function(a, b) { ## nave in the extreme ab <- rep(0, length(a) + length(b) - 1) for(i in 1:length(a)) for(j in 1:length(b)) ab[i+j-1] <- ab[i+j-1] + a[i]*b[j] ab } convolve1 <- function(a, b) {

## replace loop with vec. ops. ab <- rep(0, length(a) + length(b) - 1) ind <- 1:length(a) for(j in 1:length(b)) { ab[ind] <- ab[ind] + a*b[j] ind <- ind + 1 } ab } 7 July 2006 S Programming in R 23 Using the operators Contrary to popular belief the speed of R's interpreter is rarely the limiting factor to R's speed. People treating R like C is typically the limiting factor. You have vector operations, USE THEM! -- Byron Ellis, R-help (October 2005) convolve2 <- function(a, b) { ab <- outer(a, b)

tapply(ab, row(ab) + col(ab), sum) } 7 July 2006 S Programming in R 24 A timing run > > > > p <- runif(10) p <- p/sum(p) x <- x0 <- x1a <- x1b <- x2 <- x3 <- p times <- rbind( a0 = system.time(for(j in 1:400) x0 <- convolve0(x, x0))[1:3], a1a =system.time(for(j in 1:400) x1a <- convolve1(x, x1a))[1:3], a1b =system.time(for(j in 1:400) x1b <- convolve1(x1b, x))[1:3],

a2 = system.time(for(j in 1:400) x2 <- convolve2(x, x2))[1:3], a3 = system.time(for(j in 1:400) x3 <- convolve3(x, x3))[1:3]) > 7 July 2006 S Programming in R 25 > range(abs(x0-x1a)+abs(x1a-x1b)+ abs(x1b-x2)+abs(x2-x3))  0.000000e+00 1.040834e-17 > dimnames(times)[] times user CPU system CPU elapsed a0 94.35 1.94 96.78 a1a 9.97

0.16 10.15 a1b 1.78 0.01 1.78 a2 14.17 0.20 14.40 a3 0.27 0.00 0.27 > plot(x0[x0 > .00001], type = "h", col = "red") 7 July 2006 S Programming in R 26 The central limit theorem strikes again! 7 July 2006

S Programming in R 27 A more dynamic illustration convolveTest <- function(x, convolve = convolve3, k = 1000, goff = TRUE) { if(goff) graphics.off() t0 <- proc.time() xc <- 1 x <- abs(x)/sum(abs(x)) mu1 <- sum(x*(1:length(x) - 1)) va1 <- sum(x*(1:length(x) - 1 - mu1)^2) mu <- va <- 0 for(i in 1:k) { mu <- mu + mu1 va <- va + va1 stdev <- sqrt(va) xc <- convolve(xc, x) g3 <- sum(xc*(1:length(xc) - 1 - mu)^3)/stdev^3 g4 <- sum(xc*(1:length(xc) - 1 - mu)^4)/stdev^4 - 3 wh <- which(xc > 0.001*max(xc)) z <- (wh - 1 - mu)/stdev 7 July 2006

S Programming in R 30 p <- spline(z, c(0, diff(pnorm(z+0.5/stdev))), n = 500) plot(z, xc[wh], type = "h", ylab = "Prob", col = "orange", ylim = range(sp\$y, xc[wh]), xlim = c(-4,4)) lines(sp, col = "seagreen") par(usr = c(0,1,0,1)) text(0.1, 9:7/10, c(i, format(round(c(g3, g4), 5), scientific = FALSE)), cex = 1.5, font = 14, adj = 0) abline(h = 0) } t1 <- (proc.time() - t0)[1:3] names(t1) <- paste(c("User", "System", "Elapsed"), "secs") t1 } 7 July 2006

S Programming in R 31 Notes The slick version is NOT the fastest, probably because of memory use. Both conventional S versions are much faster than the translated Fortran first version With the loop+vector version, it pays to loop over the shorter of the two vectors. This first invocation will, in fact, be the slowest as it needs to load the dll before it does anything. A good strategy is to reserve C for the heavy computational part of the calculation but do everything else in pure S. The .Call interface allows you to manipulate R objects directly from within C code. This requires much more specialised coding using C macros that are rather sketchily documented (but R is better than S-PLUS in this respect!) 7 July 2006 S Programming in R

32 Making and maintaining Packages Read the section in Writing R Extensions (at least once). Read the readme.packages file and consult the web portals for the tools. Make sure they are visible within your %PATH %. This is a very specialised area, but the payoff is large when you have it under control The function package.skeleton() establishes a good part of the work needing to be done, but then you have to do it Building packages does the compilation as part of the process. install.packages() and update.packages() may be used to get packages from CRAN (or .zip files) and update packages from CRAN from within R itself. They require administrator capabilities within Windows (usually). 7 July 2006 S Programming in R 33