In data mining communication matters as much as analytical skills. This is true for analysts ‘selling’ their findings to others. It a lso true for data mining algorithms. While some models, like regressions and decision trees, are relatively easy to understand to humans, the popular Neuronal Networks are not. Solving this pain is an interesting visualization/data-mining question in itself.
One approach is displaying the network itself like in the new visualization offered by SPSS:
While this look’s flashy it doesn’t adress the problem of understanding how inputs really drive the model.
Well-known data miner Gordon Linoff offers a more thoughtful approach on his blog, including a scatter plots of scores depending on how neurons dominate predictions as displayed below.
Highly recommended read.
Thanks to colleague Tim for inspiring this post.
The “Traveling Sales Man Problem” is a classic in Operations Research. It asked for the shortest round trip through a set of cities given the distances beween them. For 20 cities there are already 2.432.902.008.176.640.000 of such tours. A computer able to calculate a trip length in one milliseconds would still need 240 billion years checking all of them.
Its fascinating to see how researchers keep pushing the limits when solving ever larger problems using methods from mathematical optimization and Operations Research, such as William Cook who claims to have calculated the best tour visting 1.9 million cities:
Language matters. When studying math I was surprised at first how hard it was to get to terms with it. But once familiar with its notation it’s not only powerful to communicate ideas but even supports generating them.
For advancing analytical ideas and practices agreeing on a common language is crucial. The field had its breakthrough when mathematicians and computer scientists finally started talking to each other. Today the increase of digital global communication accelerates what has always been true: new ideas in any field are rarely the product for individual geniuses alone but a social product of combining existing concepts.
Sadly we are not there yet. I recently posted on the confusion between BI vs Business Analytics but there is more confusion beneath the surface:
- Predictions vs. Forecasting
- Decision Tree (in OR vs. Data Mining)
- Dynamic regression and Transfer Function Model in Forecasting
Who will take the lead resolving this?
Tim Elliotfrom SAP just posted on the difference of BI and Business Analytics and how fuzzy both terms are used in the market. Interestingly BI searches are loosing ground while analytics is increasing
Infoboom just published a helpful introduction into advanced analytics. Here are some quotes I liked most:
“Top-performing companies are three times more likely than lower performers to be sophisticated users of analytics, and are two times more likely to say that their analytics use is a competitive differentiator.“ - MIT Sloan Management Review (link)
“The new step is to provide simulation, prediction, optimization and other analytics, not simply information, to empower even more decision flexibility at the time and place of every business process action. The new step looks into the future, predicting what can or will happen.” - Gartner (link)
“Success with business analytics requires more than the data and algorithms. It requires a company culture committed to using information for breakthrough ideas and operations.“ - IBM Website (link)
Optimization and simulation is using analytical tools and models to maximize business process and decision effectiveness by examining alternative outcomes and scenarios, before, during and after process implementation and execution.” —
Gartner’s definition of Advanced Analytics.
Its further put into context by the observation that “this can be viewed as a third step in supporting operational business decisions. Fixed rules and prepared policies gave way to more informed decisions powered by the right information delivered at the right time, whether through customer relationship management (CRM) or enterprise resource planning (ERP) or other applications. The new step is to provide simulation, prediction, optimization and other analytics, not simply information, to empower even more decision flexibility at the time and place of every business process action. The new step looks into the future, predicting what can or will happen.”
One business decision based on analytics: priceless” —http://analytics-magazine.com/january-february-2011/81-telecommunications-marketing-from-business-intelligence-to-analytics.html
I recently commented that analytics is more than just algorithms. This is true in many aspects. One is viewing it from a model lifecycle perspective. While this process is decribed in many data mining books, the article Best Practices for Managing Predictive Models in a Production Environment is an excellent free resource loaded with expert advice. Here is a quick outline of the process steps mentioned:
1. Determine the Business Objective
Typical use cases include modeling, customer behavior, risk management, credit scoring, rate making, fraud detection, customer retention, customer lifetime value, customer attrition/churn, and marketing resposne models.
2. Access and Manage the Data includes
- Data Integration e.g. connecting customer data from various systems including billing, service and marketing).
- Data Quality, e.g. handling missing values and outliers.
- Data Pre-processing, e.g. computing roll-ups or interval variables or normalizing heavily skewed distributions or clustering customers.
In general a standard data mart is desirable to promote best practices but often new models will require new sources and data.
3. Develop the Model
Includes application of exploratory statistics and visual data discovery, further data preparation and training and comparing models. Many modelling decisions — such as variable selection — are typically guideded by organizational rules including legal and management requirements. There are many more best practices such as validating the model on a hold-out data set which can be a costly thing to forget.
4. Validate the Model
A continuous effort to ensure it meets business, operational, legal, analytical and other requirements.
5. Deploy the Model
Choose between several possible deployment scenarios incl. batch, transactional and on-demand.
6. Monitor the Model
Includes automated tracking of user-defined, portable KPIs and detect disruption in input data.
Linear Programming (LP) is one of the earliest and most straight forward approaches to optimization. Its basic assumption is that even complicated real-world systems – such as supply chains – can be modeled with a set of simple equations.
LPs scale in two ways. First, in quantity. Modern software can solve models with millions of objects – sometimes almost in real-time. Secondly, in quality. Even complex and dynamic scenarios, such as network flows, economic market, or auctions can be expressed using basic equations.
The paper Decision Making Using PROC OPTMODEL illustrates this flexibility by demonstrating a case study of applying stochastic programming to business scenario planning. The basic idea is simple but effective: by building a tree of “potential futures” into the model the planner obtains as a production plan that is both feasible in any scenario and simultanously optimizies the expected return.
I surpised whenever I see see “simple” concepts creating so much value. As for linear programs, modern Operations Research offers many more and sophisticated tools. But they will keep creating value as one of the all-time favorites in applied optimization.
One of the first things I learned at my financial market lectures in London a few years ago turned out to be wrong: the Efficient Market Hypthesis. Real stock markets do not follow a random walk. This inspires researches and practitioners around the globe trying to predict them.
On the positive side, there are many good ideas in this work: sufficient data (~10M tweets) and preperation (normalization, moods thoroughly classified along various dimension) and various statistical tests and models (Granger, regression, neuronal net).
But its still useless. Here is the problem. In the approach described in their paper Johan Bollen and his fellows ignore one of data mining’s most critical practices: They neither tested or validated their model properly (ie. on seperate data sets). By using a single data set to choose the model (“calm” over other emotions in this case) and judging its performance on the very same data (it has just been fitted to) exaggerates its ability to predict.
Pick the emotion and lead time that predicts the Dow Jones best over a given period and guess what: It indeed will predict the Dow Jones pretty well over this same interval. It’s really like playing the lottery more than once but without drawing any new numbers. I’d bet you’d do pretty well after round 1.
This is not to say that Twitter might not predict the stock market. But until someone gets the numbers right better save your money.
Following the misconception of the homo economics, research on altruistic behavior is fashionable today. Psychologists designed facinating experiments and surveys on how social behavior is rewarded by the human brain.
Well — they could have saved their dollars.
One of the most intriguing aspects of today’s world is the vast amount of data that exists as a by-product of our day-to-day activities. “The Economist” estimates we just crossed the chasm where we could possibly save, let alone use, all of it.
There is one particular goldmine of data on what keeps humans perceive as rewarding and engaging: computer games. They are not only an industry four times the size of the music recording business. They also creating billions of data records a day representing unmatched quantified traces of human decisions.
This is exciting data.
TED has an inspiring talk from Tom Chatfield on some of the lessons games teach us including elements of uncertainty, tight feedback cycles and social reputation.
The freely available report “Analytics: The new path to value” is a nice pitch on the value of analytics for an executive audience.
One of the charts I found inspirational for its clarity is the following:
Integration of these three entities — Data, Insights and Actions — versus just trying to optimize them locally is key to success.
A second chart suggests that analytics applications will mature in both visual and predicitve quality:
InfoQ features this great talk on the history of machine learning.
The “AI Winter” ended in the 90’s when statistitions and computer scientists — after decades of failure — finally joined forces to develop “computer systems that improve with experience”. These probalistic models together with todays huge amount of data and elastic computing infrastructures created a new renaissance of the field. This new “Data Science” combines skills from engineering (ie. “building scalable systems”), math, computer science among others.
The talk includes hands-on tips such as using the NYT API for meta data or the Lynx text browser for crawling.
Professor Mason has a lovely sense of humor and the gift to explain complex ideas in easy words. If you have little time, focus on the first 15 minutes.