Monthly Archives: April 2013
This article is the first of a series of using R to answer typical business questions that can be answered with data science solutions. The question answered here is: what is prediction and how does it work? In order to demonstrate this concept, I’ve chosen one of the oldest statistical techniques used for predicting the future – linear regression. Our example will use a widely available data source along with the R statistical language to show how it’s done. There’s no room here to give a class on statistics or R, but I’ll provide a couple of resources for educating yourself on each.
Our test case will involve all the commonly used ingredients of a data scientist building a prediction model: finding an appropriate data set, doing a little exploratory analysis to fully understand the data, defining a linear model, fitting a regression line, making plots, and using the model for prediction. The only thing we won’t do here is “data munging,” the often tedious exercise of massaging a dirty data source into a clean version for machine learning purposes. For our example, we’ll use an already-clean data set.
The Data Set and Exploratory Analysis
Let’s start by looking at the data set for our simple example. We’ll go way back to early thinking in linear regression – the 1885 Galton study to define a relationship between a parent’s height and their children’s. The data set is available with the “UsingR” package (http://cran.r-project.org/web/packages/UsingR/index.html). The R statistical software (http://www.r-project.org/) is available as open source software. You should also download the free RStudio development environment (http://www.rstudio.com/).
Here is the R code required to install the package, and open the data set:
The data set consists of 928 pairs of child and parent heights in inches. For unfamiliar data sets you typical explore the data with one or more of the following R commands (although there are many more ways to explore data sets in R):
head(galton) # Show first few records
tail(galton) # Show last few records
summary(galton) # Statistical summary
table(galton) # Show a distribution of values
hist(galton$child) # A visual distribution of values
We also can do a simple scatter plot of child height versus parent height. This plot exhibits an ideal case for using regression – a cloud shaped collection of data points.
plot(galton$parent, galton$child, pch=19, col=”blue”)
Fitting a Linear Model
Next, we can fit a line to the Galton data using a simple linear model. In R this is very easy to do, just use the
lm function where parent is the explanatory variable (independent variable), and child is the response variable (to be predicted).
lm1 <- lm(galton$child ~ galton$parent)
If you display the contents of the lm1 model variable (just type lm1 at the RStudio command prompt), you’ll see it contains a number of items including the coefficients of the linear model: the intercept 23.942 and the slope 0.646. These values are stored in a 2 element vector
lm1$coeff. This means that an increase in parent’s height by 1 inch, you increase the child’s height by 0.646 inch. In other words, think back to high school algebra class when you graphed a line. It is as simple as that!
Now let’s complete the picture by adding the regression line to the plot using the vector of fitted values found in
lines(galton$parent, lm1$fitted, col=”red”, lwd=3)
You can also plot the residual values (response values minus fitted values) using:
plot(galton$parent, lm1$residuals, pch=19, col=”blue”)
Using the Model to Predict
Now you can use the model to make predictions using new data. For example, say you have a parent height of 80 (which is outside the range of the data set we used to train the model) and you want to predict child height. Just use the coefficients of the model:
lm1$coeff + lm1$coeff * 80
The predicted answer you get is 75.6 for the child. You can improve model accuracy by using a larger training set and reevaluating the linear model.
Uses of Linear Regression
Businesses use linear regression forecasting techniques in a multitude of ways. For example, you can model Google pay-per-click advertising costs versus sales; just download your cost data from Google Adsense and built an Excel spreadsheet that matches up the sales based on time period. Another example might be using the linear model to predict productivity gains resulting from a training program. Uses of linear regression are very broad: manufacturing, supply chain, real estate, financial sector, and much more.
To supplement this article, here are a couple of excellent resources for you to learn more about general statistics and R.
• The Art of R Programming (http://heather.cs.ucdavis.edu/~matloff/132/NSPpart.pdf)
• OpenIntro Statistics (http://www.openintro.org/stat/textbook.php)