I guess it had to happen eventually, but this soon? “Big Data” is an awfully hot buzz word seen in mainstream publications like the New York Times
and Harvard Business Review, but a coffee table book? Yes indeed, The Human Face of Big Data is a 5 lb. hardcover published in Nov. 2012 with authors Rick Smolan and Jennifer Erwitt.
Rumor has it that the book is landing on the desks of world leaders, dignitaries and other notable people around the world. Among others, the list includes: President Obama, the Dalai Lama, Pope Benedict XVI and Chinese Premier Wen Jiabao, and also Twitter CEO Jack Dorsey; Daniel Tunkelang, chief data scientist at LinkedIn; and actor Robin Williams.
Among the images in the book’s 224 pages is the blended image of 1,400 different shots of New York’s Times Square taken across 15 hours. In an abstract sense, big data is about people: what they do, where they go, who they know and so on. The stories about how data, once harnessed, solves problems and in some ways creates new ones, is its overarching theme.
The perfect holiday gift from a Data Scientist!
I’d like to offer the following slide presentation on combining the power of R with the Hadoop platform. This presentation was made by Jeffrey Breen of the Chicago Area Hadoop Users Group and Chicago Area R Users Group in August 2012. It is a great overview for anyone considering placing an R-based application in a production computation environment.
On Dec. 6, 2012 I attended another great evening event with the Los Angeles area R user group. This was my second meeting with the group and I was looking forward to the topic matter: “Real-time dashboards with Node and R.” As a data scientist, my development environment of choice is R (with a little Octave thrown in for good measure – Octave is an open source Matlab derivative), so I’m always looking for ways to extend my knowledge of R and different ways to apply it to machine learning. This meeting served this purpose very well.
I arrived at the meeting venue early over at Adconion Media Group in Santa Monica. Silicon Beach, of which Santa Monica is part, is really heating up with interest in Big Data. Adconion is a fantastic Meetup venue, easily accessible, great parking, high tech appeal, and somewhat recognizable if you ever watched “Curb Your Enthusiasm” since this is the same office park where Larry David’s office is located. Adconion supplied the wide selection of pizza and soft drinks (Thank you!). The first hour is a meet-and-greet with all the attendees; 74 registered this time. I met a bunch of like-minded data scientists like a Caltech Ph.D. candidate in computational neuroscience, and another who uses R for sentiment analysis on publicly traded companies for investment managers.
Once the lecture began, the group was led through a fast-paced example of using Node and Google Chart Tools along with a Twitter sentiment analysis algorithm in R to yield a “happiness” indicator for countries around the world. The globe map used color coding to indicate general “happiness.” Click HERE for a demo that presents “How popular is fast food.” The presenter, Dr. Dirk Neumann was a data scientist and founder of social supercomputing startup Opani. He received a Ph.D. in Computation and Neural Systems from Caltech.
As I mentioned in my previous field report for this group, I highly recommend it for anyone in the field. The presentation content tends to be fairly complex, but it’s a good way to push your personal envelope.
As a data scientist, I’m continually amused with the on-going misuse of the principle of linear regression. You see it all the time in the press. Lately, regression theory is underlying much of the talk surrounding the demise of American entitlement programs like Medicare and Social Security. Forecasts abound, saying that the country can no longer afford retirement benefits and elder healthcare coverage. But you must stop to ask what these so-called forecasts are based on?
Many forecasts unwisely project healthcare costs far into the future by assuming that the trends of the past would continue unaltered. But this mentality ignores realty. To look at this another way, just because your son is 4 feet tall at age 6 doesn’t mean he’ll be 12 feet tall at age 18. And just because the average American born today will live to the age of 78 doesn’t mean that the baby born in 2032 will live to 100. Forecasting something like healthcare is just as prickly.
Making bets on the distant future is not wise because it’s unknowable. Too many exceptional events can occur to affect life along the regression line – wars, financial bubbles, financial crashes, extreme weather, political demagoguery, etc. No one – no business, no government agency – makes plans today based on a vision of the world 20 years ahead. Apple doesn’t do it. Google doesn’t do it. The Department of Defense doesn’t do it. You and I don’t do it. Not even insurance companies do it and much of their business is tied to the future. But many in our U.S. government prodded along by the anti-entitlement lobby are pushing to gut these programs while relying on projecting the past experience into the future without adjusting for changes in behavior or policy.
All this does not mean that uncertainty in economic forecasts means there’s no point in economic planning at all. There are indeed good reasons for looking ahead, in small increments, along the regression line, just not good reason or making sweeping changes in programs as complex and far reaching as Social Security and Medicare. So let’s accept life on the regression line for what it is – a limited and isolated view of the future.
I had an excellent alternative to all the media machinations on Election Day Nov. 6 2012; I went to a machine learning event! Hosted by the LA Machine Learning group LA Machine Learning group on Meetup.com, the event went by the name of “The Unreasonable Effectiveness of Ensembles.” This topic was of specific interest to me because many of the entries on the Kaggle leaderboard for the $3 million Heritage Health data science competition use ensembles. I wanted to refine my knowledge about this useful ML technique.
The event took place at the Century City offices of Factual.com, a company that provides access to data for powering web and mobile apps, mobile advertising, and enterprise solutions. I arrived early to get a jump on the vibe of this Meetup group since this was my first event. The Factual office was perfect for this kind of thing, complete with a large meeting area that had an ample lecture space, bean bag chairs, piano and other games designed to please the developer geeks working there. Factual did it right by providing plenty of pizza and beer. One guy had his tablet tuned to Nate Silver’s blog to monitor the election results. After chatting with a number of fellow data scientists, the talk was about to begin.
The lecturer was Rudiger Lippert, a software developer at Factual. Rudiger studied Electrical Engineering at Boston University and went on to get a Master’s degree at UCLA, specializing in Signal Processing. In graduate school his research centered around Speech Recognition. Rudi’s talk was excellent and covered all the areas of ensembles I had hoped for.
Ensemble methods are considered by many to be the most important development in machine learning of the last decade. By combining many weak models to produce a single strong model, ensemble methods have performance which rivals and very often beats that of other model classes such as Support Vector Machines (SVM) and Neural Networks. The talk started with simple Decision Trees, and went on to Bagging, Random Forests, Boosting, and newer developments such as Regularized Greedy Forests. It was a great overview of the subject and one that I plan to utilize in my consulting practice.
I would definitely recommend this Meetup group, but if you’re not from LA, try to find a similar group in your area.
I’m pleased to announce that I have a new gig with Big Data Republic (sponsored by UBM Tech from InformationWeek fame) as a forum moderator. I’m looking forward to my participation in all the cool discussions. This is an exciting new destination in the Big Data space, so please stop by to see all the bleeding-edge technology being discussed.
On October 25, 2012 I attended the LA area R user group organized through: http://www.meetup.com/LAarea-R-usergroup/. This is a really nice user group with a focus on the R statistical programming environment that many data scientists use for developing machine learning algorithms. I’ve used R for years and it is my favorite although I do use a Matlab derivative called Octave on occasion. If you’re looking to get into R or strengthen your skills with R, I highly recommend you seek out a local R meetup.com group.
The event was held over at the offices of Adconion Media Group in Santa Monica, which happens to be walking distance from my gym. So after running a couple of miles, I headed over to the event that provided ample pizza and soft drinks (to counteract the calories burned moments earlier). It was a great venue for a group of about 50 attendees or so. It was a typical hipster, high-tech workplace, complete with a large presentation area for the meetup.
The meeting I attended was entitled “More highlights of useR! 2012 conference (Part 2)” and amounted to a discussion of some cool stuff discovered at the useR! 2012 Conference. The group’s organizer Szilard Pafka led a very detailed tutorial/demo on how to load and use R with Amazon’s EC2. If you’re interested in this configuration, the recorded talk can be obtained here: http://www.r-bloggers.com/RUG/2012/08/highlights-from-the-user-2012-conference/
Although I wasn’t particularly interested in the topics for this meetup, I do greatly appreciate there being a user group that aligns so well with my interests in data science, machine learning and statistics. Meetup.com is an excellent resource for finding like-minded people in your area of expertise.
Welcome to Radical Data Science! This blog’s goal is to impart all that is contemporary within the field of Data Science including machine learning, data mining, predictive analytics, knowledge discovery in databases, statistical inference and most importantly the “Big Data” movement. Here at RDS, I’ll cover a wide range of aspects to this growing area of technology – Commentary, Opinion, Technology, How-To, Applications, Field Report, and Culture. I’ll try to evenly distribute the posts in each category, but the TECHNOLOGY and HOW-TO categories take longer to write so they might come more infrequently.
My name is Daniel D. Gutierrez and I come from a mathematics/computer science background, although for the last several years I’ve applied machine learning techniques to the analysis of astrophysical data sets for the LIGO project and the detection of gravitational waves. My long-term background in data science extends far before this cool name was en vogue. As a database technologist, I taught database courses at UCLA Extension for 18 years, wrote three database books, and served a technical editor for Database Advisor Magazine. My academic history with data science stems from graduate machine learning work at Stanford and Caltech.
My company is AMULET Analytics a Los Angeles based consultancy specializing in data science services. We extract intelligence from your business data assets.
But what’s with the “radical” in RDS? This is because I firmly believe that data science is the most disruptive technology in decades. The ability to make scientific predictions based on business data sets is pretty radical in terms of competitive advantage. I first became enamored with prediction when I was a kid reading the “Foundation” series by Issac Asimov. In the book, character Hari Seldon is a mathematics professor who develops “psychohistory” allowing him to predict the future in probabilistic terms. Now that’s disruptive!
As a long-time data scientist I am VERY excited about this field and with RDS I hope to spread this excitement to my readers. Oh yes, I almost forgot, please spread the word about RDS! I’d like to establish an active Data Science community here. I encourage you to leave comments about the posts as this will encourage the learning experience.
Data Science rules!