The field of data science is heating up fast. The following list of educational resources will let you join the data revolution by getting up to speed with data science.
Data science — and the driving force behind it, machine learning — is the process of deriving added value from data assets. Commerce and research are being transformed by data-driven discovery and prediction. Skills required for data analytics at massive levels span a variety of disciplines and are not easy to obtain through conventional curricula. These include algorithms for machine learning (e.g., neural networks and clustering), parallel algorithms, basic statistical modeling (logistic regression and linear/non-linear regression), and proficiency with a complex ecosystem of tools and platforms.
A good place to start is with meetup groups. Two of my favorite data science groups deal with the primary ingredients of data science work: R, which is the programming environment of choice for building algorithms, and machine learning. The LA area R user group is excellent; try to find one near you. The LA Machine Learning group has regular meetings that are extremely useful.
The Massive Open Online Course (MOOC) movement is very active in the data science space and constitutes a superb educational resource. These free courses (some offer certifications) offer an excellent path toward obtaining the requisite background for becoming a data scientist. I’ve put together a Radical Data Science “pseudo degree program” for you to follow.
Free Data Science Books
To go along with the coursework, there also are a number of excellent free books available:
Mining of Massive Datasets
Bayesian Reasoning and Machine Learning (pdf)
Information Theory, Inference, and Learning Algorithms
Gaussian Processes for Machine Learning (pdf)
The Elements of Statistical Learning
Introduction to Machine Learning (pdf)
Think Bayes (pdf)
As the interest in data science continues to grow, and as the shortage in talent becomes apparent, the timing is excellent to retool yourself and climb aboard the data science gravy train. If you know of any other good educational resources for data science and machine learning, please leave a note for all of us.
This article is the first of a series of using R to answer typical business questions that can be answered with data science solutions. The question answered here is: what is prediction and how does it work? In order to demonstrate this concept, I’ve chosen one of the oldest statistical techniques used for predicting the future – linear regression. Our example will use a widely available data source along with the R statistical language to show how it’s done. There’s no room here to give a class on statistics or R, but I’ll provide a couple of resources for educating yourself on each.
Our test case will involve all the commonly used ingredients of a data scientist building a prediction model: finding an appropriate data set, doing a little exploratory analysis to fully understand the data, defining a linear model, fitting a regression line, making plots, and using the model for prediction. The only thing we won’t do here is “data munging,” the often tedious exercise of massaging a dirty data source into a clean version for machine learning purposes. For our example, we’ll use an already-clean data set.
The Data Set and Exploratory Analysis
Let’s start by looking at the data set for our simple example. We’ll go way back to early thinking in linear regression – the 1885 Galton study to define a relationship between a parent’s height and their children’s. The data set is available with the “UsingR” package (http://cran.r-project.org/web/packages/UsingR/index.html). The R statistical software (http://www.r-project.org/) is available as open source software. You should also download the free RStudio development environment (http://www.rstudio.com/).
Here is the R code required to install the package, and open the data set:
The data set consists of 928 pairs of child and parent heights in inches. For unfamiliar data sets you typical explore the data with one or more of the following R commands (although there are many more ways to explore data sets in R):
head(galton) # Show first few records
tail(galton) # Show last few records
summary(galton) # Statistical summary
table(galton) # Show a distribution of values
hist(galton$child) # A visual distribution of values
We also can do a simple scatter plot of child height versus parent height. This plot exhibits an ideal case for using regression – a cloud shaped collection of data points.
plot(galton$parent, galton$child, pch=19, col=”blue”)
Fitting a Linear Model
Next, we can fit a line to the Galton data using a simple linear model. In R this is very easy to do, just use the
lm function where parent is the explanatory variable (independent variable), and child is the response variable (to be predicted).
lm1 <- lm(galton$child ~ galton$parent)
If you display the contents of the lm1 model variable (just type lm1 at the RStudio command prompt), you’ll see it contains a number of items including the coefficients of the linear model: the intercept 23.942 and the slope 0.646. These values are stored in a 2 element vector
lm1$coeff. This means that an increase in parent’s height by 1 inch, you increase the child’s height by 0.646 inch. In other words, think back to high school algebra class when you graphed a line. It is as simple as that!
Now let’s complete the picture by adding the regression line to the plot using the vector of fitted values found in
lines(galton$parent, lm1$fitted, col=”red”, lwd=3)
You can also plot the residual values (response values minus fitted values) using:
plot(galton$parent, lm1$residuals, pch=19, col=”blue”)
Using the Model to Predict
Now you can use the model to make predictions using new data. For example, say you have a parent height of 80 (which is outside the range of the data set we used to train the model) and you want to predict child height. Just use the coefficients of the model:
lm1$coeff + lm1$coeff * 80
The predicted answer you get is 75.6 for the child. You can improve model accuracy by using a larger training set and reevaluating the linear model.
Uses of Linear Regression
Businesses use linear regression forecasting techniques in a multitude of ways. For example, you can model Google pay-per-click advertising costs versus sales; just download your cost data from Google Adsense and built an Excel spreadsheet that matches up the sales based on time period. Another example might be using the linear model to predict productivity gains resulting from a training program. Uses of linear regression are very broad: manufacturing, supply chain, real estate, financial sector, and much more.
To supplement this article, here are a couple of excellent resources for you to learn more about general statistics and R.
• The Art of R Programming (http://heather.cs.ucdavis.edu/~matloff/132/NSPpart.pdf)
• OpenIntro Statistics (http://www.openintro.org/stat/textbook.php)
Much has been debated about the validity of p-values in determining statistical relationships between data elements. Here is a long list of 402 Citations Questioning the Indiscriminate Use of Null Hypothesis Significance Tests in Observational Studies. And here is a current and measured contribution to the debate appearing over at the Simply Statistics blog. How do you weigh in? Do you feel the use of p-values produces the kind of false positives we always see in the mainstream press?
Here is a new webinar from Revolution Analytics that introduces the use of the R statistical programming environment for doing data mining. Presented by Joe Rickert, the seminar demos several examples of data mining with various R packages. Rickert’s slides can be downloaded HERE. Enjoy!
I had another scintillating evening with the Los Angeles area R User Group, January 24, 2013. I always have a good time at this Meetup, which has become my favorite, but I was particularly intrigued by the theme – a panel of distinguished data scientist talking about “data science.” Fun to be sure! The R group meetings are free along with free validated parking and all the pizza and drinks you can navigate.
Once again held at Adconion Media Group in Santa Monica, the panel consisted of: Avram Aelony, Eric Kostello, Yasmin Lucero, Szilard Pafka, Ryan Rosario, and Oliver Will (see inset photo I took with my new iPhone 5). The first order of business was panelist introductions. It appears that all but one held a Ph.D. which led to the first discussion topic – do you need a Ph.D. to be a data scientist? Although contrary to what’s mentioned above, the consensus was you don’t. I agree in a sense, but I think a Ph.D. in a related field like statistics, computer science, mathematics or even physics would certainly propel your career.
A lot of the hour-and-a-half discussion dealt with defining “data science” along with what a typical data scientist actually does. I found this dialog enlightening because I always like to better understand how my fellow data scientists view themselves and our field. The panel seemed to agree that the term” data science” is relatively new and maybe somewhat overhyped by some, but many of its methods have seasoned and sound foundations in statistics, computer science and their various related fields, e.g. machine learning, databases, distributed computing, visualization, etc.).
Questions from the audience were welcome (nearly 100 in attendance) so I brought up a point about how many of the job descriptions being floated these days for data science positions seem to be overly heavy with qualifications to the point where the successful candidate needs to fill the role of CTO, VP Engineering, system admin, and coder all at once. Heck, if the employer is looking for a one-person company why didn’t they say so! The panel’s response was that many times, the employer has no idea what a data scientist is, so they just Google a bunch of terms and throw them into the job description. A statistically insignificant number of candidates will have all the qualifications, so the point was – just go in and talk to them and see what sticks.
The discussion eventually moved into more technical subjects such as using R version control and processing environments such as Hadoop and Amazon EC2. I thought there was a good balance between technical and non-technical material, but the organizer Szilard Pafka had so many other topics to cover he’s already schedule a Part 2 panel for March.
After walking away from the thought provoking panel discussion, I stopped to think that this is an excellent time for a technical person to retool herself to get aboard the data science gravy train. This is why I put together a list of FREE data science educational resources over at Big Data Republic.
It is a mighty good time to be a data scientist, and being part of the Silicon Beach data science scene is pretty energizing. I hope to see you all at the next Meetup!
I guess it had to happen eventually, but this soon? “Big Data” is an awfully hot buzz word seen in mainstream publications like the New York Times
and Harvard Business Review, but a coffee table book? Yes indeed, The Human Face of Big Data is a 5 lb. hardcover published in Nov. 2012 with authors Rick Smolan and Jennifer Erwitt.
Rumor has it that the book is landing on the desks of world leaders, dignitaries and other notable people around the world. Among others, the list includes: President Obama, the Dalai Lama, Pope Benedict XVI and Chinese Premier Wen Jiabao, and also Twitter CEO Jack Dorsey; Daniel Tunkelang, chief data scientist at LinkedIn; and actor Robin Williams.
Among the images in the book’s 224 pages is the blended image of 1,400 different shots of New York’s Times Square taken across 15 hours. In an abstract sense, big data is about people: what they do, where they go, who they know and so on. The stories about how data, once harnessed, solves problems and in some ways creates new ones, is its overarching theme.
The perfect holiday gift from a Data Scientist!
I’d like to offer the following slide presentation on combining the power of R with the Hadoop platform. This presentation was made by Jeffrey Breen of the Chicago Area Hadoop Users Group and Chicago Area R Users Group in August 2012. It is a great overview for anyone considering placing an R-based application in a production computation environment.
On Dec. 6, 2012 I attended another great evening event with the Los Angeles area R user group. This was my second meeting with the group and I was looking forward to the topic matter: “Real-time dashboards with Node and R.” As a data scientist, my development environment of choice is R (with a little Octave thrown in for good measure – Octave is an open source Matlab derivative), so I’m always looking for ways to extend my knowledge of R and different ways to apply it to machine learning. This meeting served this purpose very well.
I arrived at the meeting venue early over at Adconion Media Group in Santa Monica. Silicon Beach, of which Santa Monica is part, is really heating up with interest in Big Data. Adconion is a fantastic Meetup venue, easily accessible, great parking, high tech appeal, and somewhat recognizable if you ever watched “Curb Your Enthusiasm” since this is the same office park where Larry David’s office is located. Adconion supplied the wide selection of pizza and soft drinks (Thank you!). The first hour is a meet-and-greet with all the attendees; 74 registered this time. I met a bunch of like-minded data scientists like a Caltech Ph.D. candidate in computational neuroscience, and another who uses R for sentiment analysis on publicly traded companies for investment managers.
Once the lecture began, the group was led through a fast-paced example of using Node and Google Chart Tools along with a Twitter sentiment analysis algorithm in R to yield a “happiness” indicator for countries around the world. The globe map used color coding to indicate general “happiness.” Click HERE for a demo that presents “How popular is fast food.” The presenter, Dr. Dirk Neumann was a data scientist and founder of social supercomputing startup Opani. He received a Ph.D. in Computation and Neural Systems from Caltech.
As I mentioned in my previous field report for this group, I highly recommend it for anyone in the field. The presentation content tends to be fairly complex, but it’s a good way to push your personal envelope.
As a data scientist, I’m continually amused with the on-going misuse of the principle of linear regression. You see it all the time in the press. Lately, regression theory is underlying much of the talk surrounding the demise of American entitlement programs like Medicare and Social Security. Forecasts abound, saying that the country can no longer afford retirement benefits and elder healthcare coverage. But you must stop to ask what these so-called forecasts are based on?
Many forecasts unwisely project healthcare costs far into the future by assuming that the trends of the past would continue unaltered. But this mentality ignores realty. To look at this another way, just because your son is 4 feet tall at age 6 doesn’t mean he’ll be 12 feet tall at age 18. And just because the average American born today will live to the age of 78 doesn’t mean that the baby born in 2032 will live to 100. Forecasting something like healthcare is just as prickly.
Making bets on the distant future is not wise because it’s unknowable. Too many exceptional events can occur to affect life along the regression line – wars, financial bubbles, financial crashes, extreme weather, political demagoguery, etc. No one – no business, no government agency – makes plans today based on a vision of the world 20 years ahead. Apple doesn’t do it. Google doesn’t do it. The Department of Defense doesn’t do it. You and I don’t do it. Not even insurance companies do it and much of their business is tied to the future. But many in our U.S. government prodded along by the anti-entitlement lobby are pushing to gut these programs while relying on projecting the past experience into the future without adjusting for changes in behavior or policy.
All this does not mean that uncertainty in economic forecasts means there’s no point in economic planning at all. There are indeed good reasons for looking ahead, in small increments, along the regression line, just not good reason or making sweeping changes in programs as complex and far reaching as Social Security and Medicare. So let’s accept life on the regression line for what it is – a limited and isolated view of the future.