Editor’s note: This is the first of a two-part interview. Read part two here.
The term “data scientist” has not yet jumped the shark. That’s according to Michael Driscoll, the chief technology officer and co-founder of Metamarkets, a startup company that delivers predictive analytics to digital, social and mobile media companies.
While Driscoll has embraced the term to describe an emerging role in the field of analytics and business intelligence, others are not quite ready to do so, and the title is a hotly debated one.
Driscoll likens data scientists to civil engineers.
“Civil engineers are part physicist and part construction worker,” he said. As well, the data scientist has to be able to find a balance between the theoretical and the practical within the data landscape.
What is data science?
Michael Driscoll: Data science is a neologism, and, thus like all neologisms, it’s an evolving term and title. Effectively, data scientists are those who combine theoretical expertise of mathematicians and statisticians with the hard-nosed engineering chops of software developers. In the last decade, there’s been this renaissance in the field of machine learning, which exists at the intersection of statistics, applied mathematics and computer science. But for all of this theoretical work to be used, it ultimately needs to be coded. So data scientists are a hybrid who can combine these two -- the theoretical and the practical.
When you talk about the practical piece of data science, what are you referring to?
Driscoll: I generally frame the three skills of data science as, first, "data munging", which involves the ability to slice and dice, transform, extract and work with data in a facile, fluid way. The second skill is data modeling, which basically means taking a set of data and being able to develop a hypothesis about a pattern in the data and to test that hypothesis with statistical tools. The third skill is data visualization. Once you have transformed data into a useable form -- the first skill -- and you have developed a model about how some features of the data may relate to some set of observations, some outcomes of the data -- the second skill -- you then need to convey that insight in a way decision makers understand. That requires the ability to tell a story or build a narrative visually, and that’s where data visualization comes in.
Why is building a narrative so important?
Driscoll: If a data scientist is to endeavor in this age of massive amounts of information and massive outputs of information, we need to have ways of consuming information at a commensurately high rate. Data visualization is one of those ways. In fact, it’s probably the most important way we can consume information at a very high rate.
How do predictive analytics and data science fit together?
Driscoll: Data is what data does. The goal of all of this data science ultimately is to predict the behavior of consumers, of systems. Effectively, just having data surface insights isn’t enough. You want to be able to make predictions about what’s going to happen next. According to Karl Popper, the entire goal of science is to make predictions that can be falsified. And making predictions is really the end goal of all of the work that [data scientists] do. It’s looking forward, not looking backward. One might say that business intelligence and this world of reporting is all about the past; predictive analytics is about the future.
And yet, some say predictive analytics requires looking back in order to predict the future.
Driscoll: Absolutely. The goal of predictive analytics is to study the past but ultimately to generate predictions about the future. I’ll give you an example. Facebook was trying to understand what types of their user behavior on the Facebook system would lead to higher engagement with the platform, in the likelihood that they would stay active three months after signing up. So they looked historically, at the past, of all of their users. And they looked at gender, how many friends they had, what colleges they were at. They looked at all of these different observed user features and then, for three months afterward, they studied which of those observed features corresponded most with a high level of engagement later on. What they found was the highest correlating feature that led to using Facebook more actively three months later was the number of friends you had. That was a predictive analytic insight. As a result, once people signed up on Facebook, they worked hard to suggest as many people join your network as possible. Predictive analytics is essentially about connecting observed events with outcomes; that’s probably the simplest way to put it. There’s lots of ways to slice it, but ultimately, you’re building a mathematical model of a system. To test whether that mathematical model is correct, you make predications and then you observe whether future events actually confirm or refute your hypothesis about the system.
But, do you really need a data scientist to build your models?
Driscoll: Here’s an example of a predictive model: You want to look at features of credit card purchase behavior and whether or not that purchase was fraudulent or not. Let’s say your two features are the time of day and the country of the purchase. In some cases, simply visualizing the number of fraudulent credit card actions by country will jump out at you. Any purchases made in Estonia when the credit card holder is in America are fraudulent purchases. You don’t really need a statistical model to tell you that. It’s simply plotting the data. The truth is when differences become small, then you need to rely on statistics to tell you whether the trends you observe are significant. The obvious things are easy. It really comes down to the much more nuanced, smaller differences that require statistics to tease out the difference between something that’s noise and something that’s signal.
How have the kinds of data businesses are tapping into changed in recent years?
Michael Driscoll: There are a few trends underneath this. The first is the rise of sensor technology. That would be cell phones, navigation devices or point-of-sale instruments on cash registers. Increasingly, we have these sensors in our cars and in our homes, tracking actions and events and consumer choices and purchases. That’s one thing that’s causing this massive increase in volume and velocity of data. Before, we had a lot of these devices that were chirping, but no one was listening. It’s part of this trend -- the exponential decrease in the cost of bandwidth, storage and compute power has made it worthwhile to keep data that previously would have been too expensive to keep.
The biggest class of data that’s emerging as the most interesting class of data istransaction data, transaction streams. Previously, systems were designed to roll those events up into more of a summary form, but now, increasingly, it’s possible for people to do analysis at the lowest grain of data, which is at the transaction level. Transactions are everything from when you go to the credit card machine at the supermarket and swipe your card, to when you go through an E-ZPass lane on the highway, to when you make a phone call. All of these transactions have many attributes attached to them and typically, as they’re occurring or after they occur, the data from those transactions is being pulsed into servers around the world. Collectively, these transactions represent the pulse of the planet. That to me is the most interesting type of structured data out there.
Why do you find transaction data the most interesting?
Driscoll: Transactions represent facts, and when you’re building models it’s much easier to build models over factual actions than it is over sentimental speech. By analogy to my own experience, if we were building a model for customer retention when I was at this North American telco two years ago, we could have pulled the logs from all of the of customer phone calls and attempted to do an analysis of the transcripts of customers who said they were leaving this provider. We could have done that and performed some sentiment analysis. People may have claimed, and in fact, often people claim that it’s about the signal quality on their cell phones and that they were getting a lot of dropped calls. Therefore, they were upset, and that’s why they were going to cancel their contract. When we actually looked at the facts of the data, we found there wasn’t a high correlation between signal quality, the number of dropped calls and whether you canceled their contract. What was much more important was whether or not their friends, someone they spoke with frequently, had canceled their contract the month before. That’s the difference. Structured data can tell stories that are very hard to tease out of unstructured data.
How are these new data sources changing the way models are built?
Driscoll: Until recently, a lot of statistical modeling done over real-world data was typically performed over very small data sets. Or, I should say, a lot of statistical modeling was done over summarized data sets. With the rise and the availability of fine-grained transaction data on the scale of billions of events per day, it’s changed the way businesses build models about their customers. It’s made those models more complex, more powerful and more challenging. Ultimately, in terms of the time granularity of the models, it’s changed the scope of modeling, from talking about how customers behave over long periods of time -- whether that be quarters or months -- to how customers behave over the span of just minutes.
And the tools? What do those look like for a data scientist?
Driscoll: When you move from modeling relatively small, high-level summarized data to modeling over large-scale transactional logs, it no longer becomes possible to build models exogenously from the system that holds the data. So, one consequence has been that data scientists have had to increasingly find ways of moving the analytics to the data rather than moving the data to analytics. That’s because data is heavy, and analytical algorithms are light. So, there’s been a real push in the last couple of years for people to try to push analytics into the database. As far as tools go, there’s more of a requirement now than in the past for a data scientist to be able to write code that can run inside of a database or write code that can scale.
There’s so much talk about Hadoop these days. How does that fit in?
Driscoll: Hadoop is a platform for large-scale data processing, and ultimately, if you want to build models over large-scale data, you’ve got to find a way to do your modeling inside the Hadoop platform. And there’s an emerging set of tools that allows folks to do that. One is called Mahout; it’s an open source machine learning toolkit. That’s probably the one that’s got the most traction.
What do you mean by “large-scale data?”
Driscoll: Small data is data that can fit in RAM [random-access memory], in-memory, on your desktop. Medium data is data that can fit on single machine. So, small data is from 0 to 10 gigs; medium data is from 100 gigs to a terabyte and can fit on single hard drive. Big data is data that cannot fit on a single machine; it must be distributed over many machines. Ultimately, if you want to do big data analytics, you’ve got to find a way to write distributed algorithms that also run in parallel over many machines. That’s effectively what Hadoop is -- a platform for doing distributed computing.
We’ve talked about open source tools with Hadoop and Mahout. Why are data scientists drawn to them?
The most popular tool for data science -- both open source and commercial source -- these days is R, which is an environment for statistical computing and data visualization. There are a few reasons why open source has such a draw for data scientists. One is that with R, there is a large community of individuals both in academia and in industry that use R. Many of the users have created libraries that allow someone to use a new clustering algorithm or to find a better way of doing a logistic regression or a faster method for identifying statistical anomalies. All of these libraries created by users of the tool are shared freely. Right now R has thousands of these libraries that are made available through a website called CRAN -- the Comprehensive R Archive Network.
I think the draw is that data science like regular science advances most quickly when it’s done in the open. Because this field is changing so quickly, the open source community is one that is able to disseminate new ideas and new approaches so new techniques can flow quickly between practitioners. Conversely, if you look at tools such as Matlab or SAS, the time it takes for a new algorithm to be discovered and implemented in a commercial piece of software can be months. Commercial software, by its very nature, is going to move much more slowly in adoption than open source.