Stunning interactive visualization on how Americans spend their day.
![]()
Stunning interactive visualization on how Americans spend their day.
![]()
WSJ with an interesting analysis on the ongoing big data hype:
Wielding numbers that stretched to 20 or more digits, researchers recently reported on the world’s massive ability to store, communicate and compute information. All three have grown at annual rates of at least 23% since 1986, according to a study published this month in Science.
…
But the digital avalanche isn’t as massive as those numbers suggest. Much of the growth reflects the surge in high-resolution video and photos. In addition, while there is much more information available, each piece is being consumed, on average, by far fewer people than in the past.
![]()
A reminder that experiments and data driven decisions require guts, persistence and internal marketing.
Interesting new data quality service from Google. While most companies need a more automated approach, this looks great for one-time individual research projects
In data mining communication matters as much as analytical skills. This is true for analysts ‘selling’ their findings to others. It a lso true for data mining algorithms. While some models, like regressions and decision trees, are relatively easy to understand to humans, the popular Neuronal Networks are not. Solving this pain is an interesting visualization/data-mining question in itself.
One approach is displaying the network itself like in the new visualization offered by SPSS:
![]()
While this look’s flashy it doesn’t adress the problem of understanding how inputs really drive the model.
Well-known data miner Gordon Linoff offers a more thoughtful approach on his blog, including a scatter plots of scores depending on how neurons dominate predictions as displayed below.
![]()
Highly recommended read.
Thanks to colleague Tim for inspiring this post.
The “Traveling Sales Man Problem” is a classic in Operations Research. It asked for the shortest round trip through a set of cities given the distances beween them. For 20 cities there are already 2.432.902.008.176.640.000 of such tours. A computer able to calculate a trip length in one milliseconds would still need 240 billion years checking all of them.
Its fascinating to see how researchers keep pushing the limits when solving ever larger problems using methods from mathematical optimization and Operations Research, such as William Cook who claims to have calculated the best tour visting 1.9 million cities:
![]()
Language matters. When studying math I was surprised at first how hard it was to get to terms with it. But once familiar with its notation it’s not only powerful to communicate ideas but even supports generating them.
For advancing analytical ideas and practices agreeing on a common language is crucial. The field had its breakthrough when mathematicians and computer scientists finally started talking to each other. Today the increase of digital global communication accelerates what has always been true: new ideas in any field are rarely the product for individual geniuses alone but a social product of combining existing concepts.
Sadly we are not there yet. I recently posted on the confusion between BI vs Business Analytics but there is more confusion beneath the surface:
Who will take the lead resolving this?
Tim Elliotfrom SAP just posted on the difference of BI and Business Analytics and how fuzzy both terms are used in the market. Interestingly BI searches are loosing ground while analytics is increasing
![]()
![]()
Infoboom just published a helpful introduction into advanced analytics. Here are some quotes I liked most:
“Top-performing companies are three times more likely than lower performers to be sophisticated users of analytics, and are two times more likely to say that their analytics use is a competitive differentiator.“ - MIT Sloan Management Review (link)
“The new step is to provide simulation, prediction, optimization and other analytics, not simply information, to empower even more decision flexibility at the time and place of every business process action. The new step looks into the future, predicting what can or will happen.” - Gartner (link)
“Success with business analytics requires more than the data and algorithms. It requires a company culture committed to using information for breakthrough ideas and operations.“ - IBM Website (link)
“Need music data? Get all the data you want and more from the freely available million song dataset, offered by LabROSA at Columbia University and Echo Nest. There’s lots of metadata on song features and your standard stuff like year and artist. There are also several code wrappers and samples to help researchers make use of the data right away.”
Based on Rexter’s 4th Annual Survey (2010). Recommendations from data mining practititioners on:
Why social search will help Facebook more than its users.
An infographic classic: Google News as a real-time tile chart screen saver.
![]()
Gartner’s definition of Advanced Analytics.
Its further put into context by the observation that “this can be viewed as a third step in supporting operational business decisions. Fixed rules and prepared policies gave way to more informed decisions powered by the right information delivered at the right time, whether through customer relationship management (CRM) or enterprise resource planning (ERP) or other applications. The new step is to provide simulation, prediction, optimization and other analytics, not simply information, to empower even more decision flexibility at the time and place of every business process action. The new step looks into the future, predicting what can or will happen.”
Data movie of people getting fatter around the globe. Often data movies are mis-used. But in this one is impressive.
Found via Flowing Data.
I recently commented that analytics is more than just algorithms. This is true in many aspects. One is viewing it from a model lifecycle perspective. While this process is decribed in many data mining books, the article Best Practices for Managing Predictive Models in a Production Environment is an excellent free resource loaded with expert advice. Here is a quick outline of the process steps mentioned:
1. Determine the Business Objective
Typical use cases include modeling, customer behavior, risk management, credit scoring, rate making, fraud detection, customer retention, customer lifetime value, customer attrition/churn, and marketing resposne models.
2. Access and Manage the Data includes
In general a standard data mart is desirable to promote best practices but often new models will require new sources and data.
3. Develop the Model
Includes application of exploratory statistics and visual data discovery, further data preparation and training and comparing models. Many modelling decisions — such as variable selection — are typically guideded by organizational rules including legal and management requirements. There are many more best practices such as validating the model on a hold-out data set which can be a costly thing to forget.
4. Validate the Model
A continuous effort to ensure it meets business, operational, legal, analytical and other requirements.
5. Deploy the Model
Choose between several possible deployment scenarios incl. batch, transactional and on-demand.
6. Monitor the Model
Includes automated tracking of user-defined, portable KPIs and detect disruption in input data.
Anand Rajaraman and Jeff Ullman from Stanford University share a complete book about mining massive data sets as pdf at this link.
Excellent balance between introducing concept and examples. Just finished Chapter 2 and it provides a nice introduction to Google’s Map-Reduce advantages (scalability, fault tolerance) and limitations (computational intense tasks on huge and stable data set) followed by examples how to implement typical algebraic and database set operations.
Interesting perspective of how the data deluge could transform science. While the complete book is available for download check out the linked review first.
I find insipiring:
The review kicks off with a comment on data deluge that points right at the heart of its challenge:
Gathering data is so easy and quick that it exceeds our capacity to validate, analyze, visualize, store, and curate the information.
Linear Programming (LP) is one of the earliest and most straight forward approaches to optimization. Its basic assumption is that even complicated real-world systems – such as supply chains – can be modeled with a set of simple equations.
LPs scale in two ways. First, in quantity. Modern software can solve models with millions of objects – sometimes almost in real-time. Secondly, in quality. Even complex and dynamic scenarios, such as network flows, economic market, or auctions can be expressed using basic equations.
The paper Decision Making Using PROC OPTMODEL illustrates this flexibility by demonstrating a case study of applying stochastic programming to business scenario planning. The basic idea is simple but effective: by building a tree of “potential futures” into the model the planner obtains as a production plan that is both feasible in any scenario and simultanously optimizies the expected return.
I surpised whenever I see see “simple” concepts creating so much value. As for linear programs, modern Operations Research offers many more and sophisticated tools. But they will keep creating value as one of the all-time favorites in applied optimization.

A good read about data visualization and story telling. Features Charles Joseph Minard’s amazing visualization from 1861 depicting Napoleon’s ill-fated march to and from Russia (which made me love the field).
![]()
Werner Vogels does a nice job evangelizing Amazon’s elastic storage vision. One tought I found intriguing is his characterization of the challenge of big data: It’s not primarly that there is lots of it but that it is collected without knowing the questions to solve using it.
While there are ever faster machines and algorithms this challenge will not go away and its a strong case for statistical virtues that are not en vogue but absolutely crucial such design of experiment and data preparation among others.
Convincing and thought-provocing pitch from Ray Kurzweil, one of the both most visionary and most controversial figures in IT, on how the exponential growth of technologies will transform industries and pose new opportunities—and hurdles—for business and society.
I think his judgement of benefits of IT developments outweighing the perrils is rather bold. The truth is, I suspect: No one knows.
![]()
Interesting move … but “Eighty percent of (SAS and IBM analytics) revenues are from 15 core statistical procedure.” is presumably not only wrong but — more importantly — mostly irrelevant.
Arguing about Analytics by counting procedures is like comparing two pieces of art by counting the brushes that have been used to create them. True analytics solutions include data management and integration, user interface, process support, scalability, LoB and industry best practices, reporting and much more.
And it’s more than a “product”. Analytics vendors must be commited and capable and supporting their customers beyond the point of sales to create business value using it for their particular industry and lines of businesses. It’s so much more than bits and bytes.
All opinions, as usually, my own.
One of the first things I learned at my financial market lectures in London a few years ago turned out to be wrong: the Efficient Market Hypthesis. Real stock markets do not follow a random walk. This inspires researches and practitioners around the globe trying to predict them.
The research paper “Twitter mood predicts the stock market” created significant media buzz (eg. Wired has a neat summary) and eventually even inspired a new hedge fund.
Don’t buy.
On the positive side, there are many good ideas in this work: sufficient data (~10M tweets) and preperation (normalization, moods thoroughly classified along various dimension) and various statistical tests and models (Granger, regression, neuronal net).
But its still useless. Here is the problem. In the approach described in their paper Johan Bollen and his fellows ignore one of data mining’s most critical practices: They neither tested or validated their model properly (ie. on seperate data sets). By using a single data set to choose the model (“calm” over other emotions in this case) and judging its performance on the very same data (it has just been fitted to) exaggerates its ability to predict.
Pick the emotion and lead time that predicts the Dow Jones best over a given period and guess what: It indeed will predict the Dow Jones pretty well over this same interval. It’s really like playing the lottery more than once but without drawing any new numbers. I’d bet you’d do pretty well after round 1.
This is not to say that Twitter might not predict the stock market. But until someone gets the numbers right better save your money.
![]()
Junkchart describes the two main challenges well:
Facebook is desperate. It is different from Google. A Google user is searching for something, and if an ad is relevant, the user will click on it. The Facebook user is typically chatting with friends; not surprisingly, advertisers have not been impressed with the effectiveness of Facebook ads. That’s why they are trying to insert these ads in to our conversations. We’d feel like a waiter standing next to our table at a restaurant, listening to our private conversation, and then inserting himself into the conversation to sell us the special whatever of the day.
It turns out most of Facebook conversations are not information-rich. Most of the chitchat is just that. So, Facebook wants to know our habits and likings. They want to know what we are doing when not chatting online. So they set up this network of feelers around the Web, the Like buttons. On the one hand, these sell convenience to the users and their communities; on the other hand, they compile profiles of users, secretly, that can be sold to marketers.
Some key points from Mark Madsens talk at STRATA 2011:
Mark is great speaker so check the recording.
![]()
Following the misconception of the homo economics, research on altruistic behavior is fashionable today. Psychologists designed facinating experiments and surveys on how social behavior is rewarded by the human brain.
Well — they could have saved their dollars.
One of the most intriguing aspects of today’s world is the vast amount of data that exists as a by-product of our day-to-day activities. “The Economist” estimates we just crossed the chasm where we could possibly save, let alone use, all of it.
There is one particular goldmine of data on what keeps humans perceive as rewarding and engaging: computer games. They are not only an industry four times the size of the music recording business. They also creating billions of data records a day representing unmatched quantified traces of human decisions.
This is exciting data.
TED has an inspiring talk from Tom Chatfield on some of the lessons games teach us including elements of uncertainty, tight feedback cycles and social reputation.
The freely available report “Analytics: The new path to value” is a nice pitch on the value of analytics for an executive audience.
One of the charts I found inspirational for its clarity is the following:

Integration of these three entities — Data, Insights and Actions — versus just trying to optimize them locally is key to success.
A second chart suggests that analytics applications will mature in both visual and predicitve quality:

InfoQ features this great talk on the history of machine learning.
The “AI Winter” ended in the 90’s when statistitions and computer scientists — after decades of failure — finally joined forces to develop “computer systems that improve with experience”. These probalistic models together with todays huge amount of data and elastic computing infrastructures created a new renaissance of the field. This new “Data Science” combines skills from engineering (ie. “building scalable systems”), math, computer science among others.
The talk includes hands-on tips such as using the NYT API for meta data or the Lynx text browser for crawling.
Professor Mason has a lovely sense of humor and the gift to explain complex ideas in easy words. If you have little time, focus on the first 15 minutes.
Great post from @FILWD on turning the tention between #datamining and #visualization into an opportunity