Monthly Archives: December 2012

Do you have a Statistically Significant Other?


xkcd: Boyfriend

… Face it – I’m your statistically significant other


Big Data Coffee Table Book


I guess it had to happen eventually, but this soon? “Big Data” is an awfully hot buzz word seen in mainstream publications like the New York Times

The Human Face of Big Data

The Human Face of Big Data

and Harvard Business Review, but a coffee table book? Yes indeed, The Human Face of Big Data is a 5 lb. hardcover published in Nov. 2012 with authors Rick Smolan and Jennifer Erwitt.

Rumor has it that the book is landing on the desks of world leaders, dignitaries and other notable people around the world. Among others, the list includes: President Obama, the Dalai Lama, Pope Benedict XVI and Chinese Premier Wen Jiabao, and also Twitter CEO Jack Dorsey; Daniel Tunkelang, chief data scientist at LinkedIn; and actor Robin Williams.

Among the images in the book’s 224 pages is the blended image of 1,400 different shots of New York’s Times Square taken across 15 hours. In an abstract sense, big data is about people: what they do, where they go, who they know and so on. The stories about how data, once harnessed, solves problems and in some ways creates new ones, is its overarching theme.

The perfect holiday gift from a Data Scientist!

Getting Started with R and Hadoop


I’d like to offer the following slide presentation on combining the power of R with the Hadoop platform. This presentation was made by Jeffrey Breen of the Chicago Area Hadoop Users Group and Chicago Area R Users Group in August 2012. It is a great overview for anyone considering placing an R-based application in a production computation environment.

Los Angeles area R user group

LA R user group


On Dec. 6, 2012 I attended another great evening event with the Los Angeles area R user group. This was my second meeting with the group and I was looking forward to the topic matter: “Real-time dashboards with Node and R.” As a data scientist, my development environment of choice is R (with a little Octave thrown in for good measure – Octave is an open source Matlab derivative), so I’m always looking for ways to extend my knowledge of R and different ways to apply it to machine learning. This meeting served this purpose very well.

I arrived at the meeting venue early over at Adconion Media Group in Santa Monica. Silicon Beach, of which Santa Monica is part, is really heating up with interest in Big Data. Adconion is a fantastic Meetup venue, easily accessible, great parking, high tech appeal, and somewhat recognizable if you ever watched “Curb Your Enthusiasm” since this is the same office park where Larry David’s office is located. Adconion supplied the wide selection of pizza and soft drinks (Thank you!). The first hour is a meet-and-greet with all the attendees; 74 registered this time. I met a bunch of like-minded data scientists like a Caltech Ph.D. candidate in computational neuroscience, and another who uses R for sentiment analysis on publicly traded companies for investment managers.

Once the lecture began, the group was led through a fast-paced example of using Node and Google Chart Tools along with a Twitter sentiment analysis algorithm in R to yield a “happiness” indicator for countries around the world. The globe map used color coding to indicate general “happiness.” Click HERE for a demo that presents “How popular is fast food.” The presenter, Dr. Dirk Neumann was a data scientist and founder of social supercomputing startup Opani. He received a Ph.D. in Computation and Neural Systems from Caltech.

As I mentioned in my previous field report for this group, I highly recommend it for anyone in the field. The presentation content tends to be fairly complex, but it’s a good way to push your personal envelope.

Linear Regression Fallacies


As a data scientist, I’m continually amused with the on-going misuse of the principle of linear regression.  You see it all the time in the press. Lately, regression theory is underlying much of the talk surrounding the demise of American entitlement programs like Medicare and Social Security. Forecasts abound, saying that the country can no longer afford retirement benefits and elder healthcare coverage. But you must stop to ask what these so-called forecasts are based on?

Many forecasts unwisely project healthcare costs far into the future by assuming that the trends of the past would continue unaltered. But this mentality ignores realty. To look at this another way, just because your son is 4 feet tall at age 6 doesn’t mean he’ll be 12 feet tall at age 18. And just because the average American born today will live to the age of 78 doesn’t mean that the baby born in 2032 will live to 100. Forecasting something like healthcare is just as prickly.

Making bets on the distant future is not wise because it’s unknowable. Too many exceptional events can occur to affect life along the regression line – wars, financial bubbles, financial crashes, extreme weather, political demagoguery, etc. No one – no business, no government agency – makes plans today based on a vision of the world 20 years ahead. Apple doesn’t do it. Google doesn’t do it. The Department of Defense doesn’t do it. You and I don’t do it. Not even insurance companies do it and much of their business is tied to the future. But many in our U.S. government prodded along by the anti-entitlement lobby are pushing to gut these programs while relying on projecting the past experience into the future without adjusting for changes in behavior or policy.

All this does not mean that uncertainty in economic forecasts means there’s no point in economic planning at all. There are indeed good reasons for looking ahead, in small increments, along the regression line, just not good reason or making sweeping changes in programs as complex and far reaching as Social Security and Medicare. So let’s accept life on the regression line for what it is – a limited and isolated view of the future.

Starbridge Partners

Welcome to STARBRIDGE PARTNERS! We Help Build Careers in Data Science. We Specialize in Placing Data Scientists & Data Engineers.

Follow the Data

A data driven blog

Sponsored by AMULET Analytics

Simply Statistics

Sponsored by AMULET Analytics

Data Science 101

Learning To Be A Data Scientist

R and Data Mining

Sponsored by AMULET Analytics