Great news! I am scheduled to teach a brand new class at UCLA Extension this coming Winter 2014 quarter starting January 8. Here is a LINK to the class description in the UNEX course catalog. As part of the course materials, I am providing all the source code from the required textbook “The Art of R Programming” by Norman Matloff via dropbox. You’ll find a series of folders, one for each chapter, containing the R scripts for the examples found in the book, plus a few missing data files that were referenced but never provided. Unfortunately, the author did not provide most of the sample source code on the book website made available from the publisher, so rather than have my students type in all the code, I’ve provided an annotated version to help the learning process. Feel free to grab the scripts even if you’re not taking my class.
Additionally, Matloff has made available a 2009 draft of the book on his personal website, so feel free to download the book HERE (pdf).
See you all in class!
For those of you out there wondering where I’ve gone, let me give you the low down. Shortly after I started the RadicalDataScience blog, I was hired as Managing Editor of insideBIGDATA, a news outlet covering the big data industry including areas such as data science, machine learning and visualization. Since Sept 1, 2013 I’ve been writing many features, news stories and technical articles including my “Data Science 101” series. As a result of this high visibility assignment, and my less-frequent contributions over on BigDataRepublic, my writing will continue to appear through these outlets. RDS will therefore remain silent until something changes.
Have a superb 2014! And keep thinking Data Science!
Last night I attended the latest meeting of the Los Angeles area R User Group. The event was held in a very obscure rooftop location in Boelter Hall (the Engineering building). I’m glad I arrived a bit early otherwise I would have missed out on the limit pizza supply. But no matter, my motivation for attending was to hear all about the international useR! 2013 recently help in Albacete, Spain. I wasn’t disappointed. The group’s organizer, Szilard Pafka and Bryan Galvin gave useful presentations about all the new stuff that was offered up at the conference. I’d like to present a summary here for RDS blog readers. In no particular order:
- Check out the useR! 2013 website for detailed information about the conference.
- There is a new JVM-based interpreter for R called Renjin.
- It seems the evil empire Oracle is all into R now with Oracle R Enterprise.
- A brand new version of the plyr R package is in the works called dplyr. The main benefit is it will be much faster!
- The shiny web application framework package for R continues to be a rising star. Check out this tutorial.
- There is a new “pretty quick” version of R called pqR.
- And last but not least, I’m excited to say that the useR! 2014 conference will be held on June 30 – July 3 in Los Angeles at UCLA and sponsored by the UCLA Statistics department.
I always have a good time at this Meetup, which has become one of my favorites. I highly recommend this particular Meetup group if you happen to be in Los Angeles, but if not, try to locate a similar group near you.
Have you ever tried to buy tickets to a popular summer concert only to find it sold out within minutes of first availability? If so, you’re not alone. Apparently, this situation is becoming a frustrating ritual that frequently plays out at online ticket sources where tickets to hot concerts seem to vanish instantly. What’s going on? Bots, that’s what. Bots are software applications used by high-tech scalpers to feed a growing, multibillion-dollar secondary market for tickets. Leading ticket seller Ticketmaster claims that bots buy up in excess of 60% of the most desirable tickets for some events. A recent Ticketmaster lawsuit accuses one group of scalpers of using bots to request up to 200,000 tickets a day.
Machine Learning to the Rescue
Fortunately, Ticketmaster chose a big data solution to combat the onslaught of the ticket bots. The firm hired machine learning specialist John Carnahan, who had previous computational marketing roles at the Rubicon Project, Fox and Yahoo, to head up its Data Science Group and the bot-fighting program. Using a classification algorithm similar in nature to spam classifiers, legitimate ticket requests can be identified from the bot-driven requests. One of feature variables used to detect a bot is the click speed in filling out the web form. Human click rates are non-linear whereas bots are regular and fast. Carnahan uses a color-coded dashboard application to show ticket requests in real-time. In some cases the bot requests outnumbered the human requests several hundred-fold.
It isn’t clear what the training-set looks like for the ticket bot classifier since Ticketmaster won’t say how many of the 148 million tickets sold each year were purchased by bots.
Slowed Down But Not Out
The curious thing about the Ticketmaster anti-bot effort is that it was designed to slow down the bots but not knock them out altogether. The goal is send the bot, when detected, to the end of the line in order to let fans get the tickets they desire.
This kind of limited tolerance for bots can still wreak havoc on performances. Some sold-out shows see up to 20% no-shows with many of the best seats vacant. Basically, the speculators who bought tickets with bots couldn’t get the prices they wanted.
Not Only About the Bots
The science of ticket selling may appear straightforward, but Ticketmaster is sitting on a treasure trove of unexplored transactional and social media data that it can use to better understand customer buying behavior in order to sell more tickets. In addition to bot detection, this application of big data can also lend itself to implementing recommender systems using scientific rigor in order to show customers related tickets they’re likely to buy and figure out how unstructured social media data can play a contributing role. With this initiative, Ticketmaster’s dominance in the highly competitive ticket industry can be furthered with data science.
The field of data science is heating up fast. The following list of educational resources will let you join the data revolution by getting up to speed with data science.
Data science — and the driving force behind it, machine learning — is the process of deriving added value from data assets. Commerce and research are being transformed by data-driven discovery and prediction. Skills required for data analytics at massive levels span a variety of disciplines and are not easy to obtain through conventional curricula. These include algorithms for machine learning (e.g., neural networks and clustering), parallel algorithms, basic statistical modeling (logistic regression and linear/non-linear regression), and proficiency with a complex ecosystem of tools and platforms.
A good place to start is with meetup groups. Two of my favorite data science groups deal with the primary ingredients of data science work: R, which is the programming environment of choice for building algorithms, and machine learning. The LA area R user group is excellent; try to find one near you. The LA Machine Learning group has regular meetings that are extremely useful.
The Massive Open Online Course (MOOC) movement is very active in the data science space and constitutes a superb educational resource. These free courses (some offer certifications) offer an excellent path toward obtaining the requisite background for becoming a data scientist. I’ve put together a Radical Data Science “pseudo degree program” for you to follow.
Free Data Science Books
To go along with the coursework, there also are a number of excellent free books available:
Mining of Massive Datasets
Bayesian Reasoning and Machine Learning (pdf)
Information Theory, Inference, and Learning Algorithms
Gaussian Processes for Machine Learning (pdf)
The Elements of Statistical Learning
Introduction to Machine Learning (pdf)
Think Bayes (pdf)
As the interest in data science continues to grow, and as the shortage in talent becomes apparent, the timing is excellent to retool yourself and climb aboard the data science gravy train. If you know of any other good educational resources for data science and machine learning, please leave a note for all of us.
This article is the first of a series of using R to answer typical business questions that can be answered with data science solutions. The question answered here is: what is prediction and how does it work? In order to demonstrate this concept, I’ve chosen one of the oldest statistical techniques used for predicting the future – linear regression. Our example will use a widely available data source along with the R statistical language to show how it’s done. There’s no room here to give a class on statistics or R, but I’ll provide a couple of resources for educating yourself on each.
Our test case will involve all the commonly used ingredients of a data scientist building a prediction model: finding an appropriate data set, doing a little exploratory analysis to fully understand the data, defining a linear model, fitting a regression line, making plots, and using the model for prediction. The only thing we won’t do here is “data munging,” the often tedious exercise of massaging a dirty data source into a clean version for machine learning purposes. For our example, we’ll use an already-clean data set.
The Data Set and Exploratory Analysis
Let’s start by looking at the data set for our simple example. We’ll go way back to early thinking in linear regression – the 1885 Galton study to define a relationship between a parent’s height and their children’s. The data set is available with the “UsingR” package (http://cran.r-project.org/web/packages/UsingR/index.html). The R statistical software (http://www.r-project.org/) is available as open source software. You should also download the free RStudio development environment (http://www.rstudio.com/).
Here is the R code required to install the package, and open the data set:
The data set consists of 928 pairs of child and parent heights in inches. For unfamiliar data sets you typical explore the data with one or more of the following R commands (although there are many more ways to explore data sets in R):
head(galton) # Show first few records
tail(galton) # Show last few records
summary(galton) # Statistical summary
table(galton) # Show a distribution of values
hist(galton$child) # A visual distribution of values
We also can do a simple scatter plot of child height versus parent height. This plot exhibits an ideal case for using regression – a cloud shaped collection of data points.
plot(galton$parent, galton$child, pch=19, col=”blue”)
Fitting a Linear Model
Next, we can fit a line to the Galton data using a simple linear model. In R this is very easy to do, just use the
lm function where parent is the explanatory variable (independent variable), and child is the response variable (to be predicted).
lm1 <- lm(galton$child ~ galton$parent)
If you display the contents of the lm1 model variable (just type lm1 at the RStudio command prompt), you’ll see it contains a number of items including the coefficients of the linear model: the intercept 23.942 and the slope 0.646. These values are stored in a 2 element vector
lm1$coeff. This means that an increase in parent’s height by 1 inch, you increase the child’s height by 0.646 inch. In other words, think back to high school algebra class when you graphed a line. It is as simple as that!
Now let’s complete the picture by adding the regression line to the plot using the vector of fitted values found in
lines(galton$parent, lm1$fitted, col=”red”, lwd=3)
You can also plot the residual values (response values minus fitted values) using:
plot(galton$parent, lm1$residuals, pch=19, col=”blue”)
Using the Model to Predict
Now you can use the model to make predictions using new data. For example, say you have a parent height of 80 (which is outside the range of the data set we used to train the model) and you want to predict child height. Just use the coefficients of the model:
lm1$coeff + lm1$coeff * 80
The predicted answer you get is 75.6 for the child. You can improve model accuracy by using a larger training set and reevaluating the linear model.
Uses of Linear Regression
Businesses use linear regression forecasting techniques in a multitude of ways. For example, you can model Google pay-per-click advertising costs versus sales; just download your cost data from Google Adsense and built an Excel spreadsheet that matches up the sales based on time period. Another example might be using the linear model to predict productivity gains resulting from a training program. Uses of linear regression are very broad: manufacturing, supply chain, real estate, financial sector, and much more.
To supplement this article, here are a couple of excellent resources for you to learn more about general statistics and R.
• The Art of R Programming (http://heather.cs.ucdavis.edu/~matloff/132/NSPpart.pdf)
• OpenIntro Statistics (http://www.openintro.org/stat/textbook.php)
Much has been debated about the validity of p-values in determining statistical relationships between data elements. Here is a long list of 402 Citations Questioning the Indiscriminate Use of Null Hypothesis Significance Tests in Observational Studies. And here is a current and measured contribution to the debate appearing over at the Simply Statistics blog. How do you weigh in? Do you feel the use of p-values produces the kind of false positives we always see in the mainstream press?
Here is a new webinar from Revolution Analytics that introduces the use of the R statistical programming environment for doing data mining. Presented by Joe Rickert, the seminar demos several examples of data mining with various R packages. Rickert’s slides can be downloaded HERE. Enjoy!
I had another scintillating evening with the Los Angeles area R User Group, January 24, 2013. I always have a good time at this Meetup, which has become my favorite, but I was particularly intrigued by the theme – a panel of distinguished data scientist talking about “data science.” Fun to be sure! The R group meetings are free along with free validated parking and all the pizza and drinks you can navigate.
Once again held at Adconion Media Group in Santa Monica, the panel consisted of: Avram Aelony, Eric Kostello, Yasmin Lucero, Szilard Pafka, Ryan Rosario, and Oliver Will (see inset photo I took with my new iPhone 5). The first order of business was panelist introductions. It appears that all but one held a Ph.D. which led to the first discussion topic – do you need a Ph.D. to be a data scientist? Although contrary to what’s mentioned above, the consensus was you don’t. I agree in a sense, but I think a Ph.D. in a related field like statistics, computer science, mathematics or even physics would certainly propel your career.
A lot of the hour-and-a-half discussion dealt with defining “data science” along with what a typical data scientist actually does. I found this dialog enlightening because I always like to better understand how my fellow data scientists view themselves and our field. The panel seemed to agree that the term” data science” is relatively new and maybe somewhat overhyped by some, but many of its methods have seasoned and sound foundations in statistics, computer science and their various related fields, e.g. machine learning, databases, distributed computing, visualization, etc.).
Questions from the audience were welcome (nearly 100 in attendance) so I brought up a point about how many of the job descriptions being floated these days for data science positions seem to be overly heavy with qualifications to the point where the successful candidate needs to fill the role of CTO, VP Engineering, system admin, and coder all at once. Heck, if the employer is looking for a one-person company why didn’t they say so! The panel’s response was that many times, the employer has no idea what a data scientist is, so they just Google a bunch of terms and throw them into the job description. A statistically insignificant number of candidates will have all the qualifications, so the point was – just go in and talk to them and see what sticks.
The discussion eventually moved into more technical subjects such as using R version control and processing environments such as Hadoop and Amazon EC2. I thought there was a good balance between technical and non-technical material, but the organizer Szilard Pafka had so many other topics to cover he’s already schedule a Part 2 panel for March.
After walking away from the thought provoking panel discussion, I stopped to think that this is an excellent time for a technical person to retool herself to get aboard the data science gravy train. This is why I put together a list of FREE data science educational resources over at Big Data Republic.
It is a mighty good time to be a data scientist, and being part of the Silicon Beach data science scene is pretty energizing. I hope to see you all at the next Meetup!