Category Archives: How-To
How-To instructional article about data science
Great news! I am scheduled to teach a brand new class at UCLA Extension this coming Winter 2014 quarter starting January 8. Here is a LINK to the class description in the UNEX course catalog. As part of the course materials, I am providing all the source code from the required textbook “The Art of R Programming” by Norman Matloff via dropbox. You’ll find a series of folders, one for each chapter, containing the R scripts for the examples found in the book, plus a few missing data files that were referenced but never provided. Unfortunately, the author did not provide most of the sample source code on the book website made available from the publisher, so rather than have my students type in all the code, I’ve provided an annotated version to help the learning process. Feel free to grab the scripts even if you’re not taking my class.
Additionally, Matloff has made available a 2009 draft of the book on his personal website, so feel free to download the book HERE (pdf).
See you all in class!
The field of data science is heating up fast. The following list of educational resources will let you join the data revolution by getting up to speed with data science.
Data science — and the driving force behind it, machine learning — is the process of deriving added value from data assets. Commerce and research are being transformed by data-driven discovery and prediction. Skills required for data analytics at massive levels span a variety of disciplines and are not easy to obtain through conventional curricula. These include algorithms for machine learning (e.g., neural networks and clustering), parallel algorithms, basic statistical modeling (logistic regression and linear/non-linear regression), and proficiency with a complex ecosystem of tools and platforms.
A good place to start is with meetup groups. Two of my favorite data science groups deal with the primary ingredients of data science work: R, which is the programming environment of choice for building algorithms, and machine learning. The LA area R user group is excellent; try to find one near you. The LA Machine Learning group has regular meetings that are extremely useful.
The Massive Open Online Course (MOOC) movement is very active in the data science space and constitutes a superb educational resource. These free courses (some offer certifications) offer an excellent path toward obtaining the requisite background for becoming a data scientist. I’ve put together a Radical Data Science “pseudo degree program” for you to follow.
Free Data Science Books
To go along with the coursework, there also are a number of excellent free books available:
Mining of Massive Datasets
Bayesian Reasoning and Machine Learning (pdf)
Information Theory, Inference, and Learning Algorithms
Gaussian Processes for Machine Learning (pdf)
The Elements of Statistical Learning
Introduction to Machine Learning (pdf)
Think Bayes (pdf)
As the interest in data science continues to grow, and as the shortage in talent becomes apparent, the timing is excellent to retool yourself and climb aboard the data science gravy train. If you know of any other good educational resources for data science and machine learning, please leave a note for all of us.
This article is the first of a series of using R to answer typical business questions that can be answered with data science solutions. The question answered here is: what is prediction and how does it work? In order to demonstrate this concept, I’ve chosen one of the oldest statistical techniques used for predicting the future – linear regression. Our example will use a widely available data source along with the R statistical language to show how it’s done. There’s no room here to give a class on statistics or R, but I’ll provide a couple of resources for educating yourself on each.
Our test case will involve all the commonly used ingredients of a data scientist building a prediction model: finding an appropriate data set, doing a little exploratory analysis to fully understand the data, defining a linear model, fitting a regression line, making plots, and using the model for prediction. The only thing we won’t do here is “data munging,” the often tedious exercise of massaging a dirty data source into a clean version for machine learning purposes. For our example, we’ll use an already-clean data set.
The Data Set and Exploratory Analysis
Let’s start by looking at the data set for our simple example. We’ll go way back to early thinking in linear regression – the 1885 Galton study to define a relationship between a parent’s height and their children’s. The data set is available with the “UsingR” package (http://cran.r-project.org/web/packages/UsingR/index.html). The R statistical software (http://www.r-project.org/) is available as open source software. You should also download the free RStudio development environment (http://www.rstudio.com/).
Here is the R code required to install the package, and open the data set:
The data set consists of 928 pairs of child and parent heights in inches. For unfamiliar data sets you typical explore the data with one or more of the following R commands (although there are many more ways to explore data sets in R):
head(galton) # Show first few records
tail(galton) # Show last few records
summary(galton) # Statistical summary
table(galton) # Show a distribution of values
hist(galton$child) # A visual distribution of values
We also can do a simple scatter plot of child height versus parent height. This plot exhibits an ideal case for using regression – a cloud shaped collection of data points.
plot(galton$parent, galton$child, pch=19, col=”blue”)
Fitting a Linear Model
Next, we can fit a line to the Galton data using a simple linear model. In R this is very easy to do, just use the
lm function where parent is the explanatory variable (independent variable), and child is the response variable (to be predicted).
lm1 <- lm(galton$child ~ galton$parent)
If you display the contents of the lm1 model variable (just type lm1 at the RStudio command prompt), you’ll see it contains a number of items including the coefficients of the linear model: the intercept 23.942 and the slope 0.646. These values are stored in a 2 element vector
lm1$coeff. This means that an increase in parent’s height by 1 inch, you increase the child’s height by 0.646 inch. In other words, think back to high school algebra class when you graphed a line. It is as simple as that!
Now let’s complete the picture by adding the regression line to the plot using the vector of fitted values found in
lines(galton$parent, lm1$fitted, col=”red”, lwd=3)
You can also plot the residual values (response values minus fitted values) using:
plot(galton$parent, lm1$residuals, pch=19, col=”blue”)
Using the Model to Predict
Now you can use the model to make predictions using new data. For example, say you have a parent height of 80 (which is outside the range of the data set we used to train the model) and you want to predict child height. Just use the coefficients of the model:
lm1$coeff + lm1$coeff * 80
The predicted answer you get is 75.6 for the child. You can improve model accuracy by using a larger training set and reevaluating the linear model.
Uses of Linear Regression
Businesses use linear regression forecasting techniques in a multitude of ways. For example, you can model Google pay-per-click advertising costs versus sales; just download your cost data from Google Adsense and built an Excel spreadsheet that matches up the sales based on time period. Another example might be using the linear model to predict productivity gains resulting from a training program. Uses of linear regression are very broad: manufacturing, supply chain, real estate, financial sector, and much more.
To supplement this article, here are a couple of excellent resources for you to learn more about general statistics and R.
• The Art of R Programming (http://heather.cs.ucdavis.edu/~matloff/132/NSPpart.pdf)
• OpenIntro Statistics (http://www.openintro.org/stat/textbook.php)
Here is a new webinar from Revolution Analytics that introduces the use of the R statistical programming environment for doing data mining. Presented by Joe Rickert, the seminar demos several examples of data mining with various R packages. Rickert’s slides can be downloaded HERE. Enjoy!