Petzoldt

Month

March 2011

23 posts

How Different Groups Spend Their Day → nytimes.com

Stunning interactive visualization on how Americans spend their day.

Mar 31, 2011
#Visualization #statistics
Mar 30, 2011
#Visualization #Economy #Statistics
Play
Mar 29, 2011
#Statistics #Video #National Geographics #World
"Pattern": Python API for Web Mining → clips.ua.ac.be

Mar 28, 20114 notes
#Programming #Web Mining #Data Mining #Text Analytics
Mar 28, 20111 note
#Statistics #Humor
Behind the Information Overload Hype → online.wsj.com

WSJ with an interesting analysis on the ongoing big data hype:

Wielding numbers that stretched to 20 or more digits, researchers recently reported on the world’s massive ability to store, communicate and compute information. All three have grown at annual rates of at least 23% since 1986, according to a study published this month in Science.

…

But the digital avalanche isn’t as massive as those numbers suggest. Much of the growth reflects the surge in high-resolution video and photos. In addition, while there is much more information available, each piece is being consumed, on average, by far fewer people than in the past.

Mar 27, 2011
#Big Data #Web
Trusting Data, Not Intuition → technologyreview.com

A reminder that experiments and data driven decisions require guts, persistence and internal marketing.

Mar 26, 2011
#Decision Making #Experiments
“Complexity Science is a field of research that supports two key findings: first, marketers should hire scientists before advertising their products; second, scientists should hire marketers before naming their field of study.” —Funny! — and true even in a more general sense: scientists indeed need marketers and vice versa.
Mar 25, 20111 note
#Humor #Complexity
Google Refine → code.google.com

Interesting new data quality service from Google. While most companies need a more automated approach, this looks great for one-time individual research projects

Mar 24, 2011
#Data Quality #Product
Visualization Challenge: Understanding Neuronal Nets

In data mining communication matters as much as analytical skills. This is true for analysts ‘selling’ their findings to others. It a lso true for data mining algorithms. While some models, like regressions and decision trees, are relatively easy to understand to humans, the popular Neuronal Networks are not. Solving this pain is an interesting visualization/data-mining question in itself.

One approach is displaying the network itself like in the new visualization offered by SPSS:

While this look’s flashy it doesn’t adress the problem of understanding how inputs really drive the model.

Well-known data miner Gordon Linoff offers a more thoughtful approach on his blog, including a scatter plots of scores depending on how neurons dominate predictions as displayed below.

Highly recommended read.

Thanks to colleague Tim for inspiring this post.

Mar 23, 2011
#Data Mining #Visualization
When faster computers won't help

The “Traveling Sales Man Problem” is a classic in Operations Research. It asked for the shortest round trip through a set of cities given the distances beween them. For 20 cities there are already 2.432.902.008.176.640.000 of such tours.  A computer able to calculate a trip length in  one milliseconds would still need 240 billion years checking all of them.

Its fascinating to see how researchers keep pushing the limits when solving ever larger problems using methods from mathematical optimization and Operations Research, such as William Cook who claims to have calculated the best tour visting 1.9 million cities:

Mar 21, 20113 notes
#OR #Operations Research
Language confusion in the analytics space today

Language matters. When studying math I was surprised at first how hard it was to get to terms with it. But once familiar with its notation it’s not only powerful to communicate ideas but even supports generating them.

For advancing analytical ideas and practices agreeing on a common language is crucial. The field had its breakthrough when mathematicians and computer scientists finally started talking to each other. Today the increase of digital global communication accelerates what has always been true: new ideas in any field are rarely the product for individual geniuses alone but a social product of combining existing concepts.

Sadly we are not there yet. I recently posted on the confusion between BI vs Business Analytics but there is more confusion beneath the surface:

  • Predictions vs. Forecasting
  • Decision Tree (in OR vs. Data Mining)
  • Dynamic regression and Transfer Function Model in Forecasting

Who will take the lead resolving this?

Mar 20, 20112 notes
#Analytics
On the difference of BI and Business Analytics

Tim Elliotfrom SAP just posted on the difference of BI and Business Analytics and how fuzzy both terms are used in the market. Interestingly BI searches are loosing ground while analytics is increasing

Mar 19, 2011
#BI #Business Analytics
Advanced Analytics Quotes

Infoboom just published a helpful introduction into advanced analytics. Here are some quotes I liked most:

“Top-performing companies are three times more likely than lower performers to be sophisticated users of analytics, and are two times more likely to say that their analytics use is a competitive differentiator.“  - MIT Sloan Management Review (link)

“The new step is to provide simulation, prediction, optimization and other analytics, not simply information, to empower even more decision flexibility at the time and place of every business process action. The new step looks into the future, predicting what can or will happen.” - Gartner (link)

“Success with business analytics requires more than the data and algorithms. It requires a company culture committed to using information for breakthrough ideas and operations.“ - IBM Website (link)

Mar 18, 20111 note
#Advanced Analytics #Introduction
“So, yes, the interface matters. Now more than ever. But in our race to present data through interactive graphics, let’s not forget the real reasons why we do what we do. Users need this information to prevent healthcare fraud, improve product quality, predict criminal activity – and much, much more. We want those results to be accurate and we want business leaders to be confident in the decisions they make based on that information. Accuracy and confidence: that’s still what matters the most.” —Jim Davis from SAS.
Mar 9, 2011
#Analytics #User Experience
Mar 8, 20113 notes
#Visualization #Faith #Poverty #Statistics
Mar 7, 20111 note
#Humor #Data Mining #Customer Intelligence
Million song dataset available for download → flowingdata.com

“Need music data? Get all the data you want and more from the freely available million song dataset, offered by LabROSA at Columbia University and Echo Nest. There’s lots of metadata on song features and your standard stuff like year and artist. There are also several code wrappers and samples to help researchers make use of the data right away.”

Mar 6, 2011
#Open Data
Overcoming Data Mining Challenges  → rexeranalytics.com

Based on Rexter’s 4th Annual Survey (2010). Recommendations from data mining practititioners on:

  • Dirty Data
  • Explaining Data Mining to Others
  • Unavailability of Data / Difficult Access to Data
Mar 5, 2011
#Data Mining #Analytics #Best Practice
Mar 4, 2011
#infographic #visualization
Mar 3, 2011
#Social #Visualization
Mar 2, 2011
#Data #Visualization
“Social search built around similarity — the “like” rather than the “know” — could improve its reliability. To increase the chances and relevance of similarity, social engines need to also expand the boundaries of “social proximity” to include friends of friends and others adjacent to your social graph. This cocktail of similarity and expansion could yield the Page-Rankian shift that transforms social search from an occasional option to a reliable resource.” —Combining the “like” with the “know” could be the breakthrough for social search.
Mar 1, 2011
#Social

February 2011

22 posts

5 assumptions about social search → radar.oreilly.com

Why social search will help Facebook more than its users.

Feb 28, 2011
#Facebook #Search #Social
Newsmap → marumushi.com

An infographic classic: Google News as a real-time tile chart screen saver.

Feb 27, 20115 notes
#Visualization #Google #Screensaver
“The marketers job has changed from creating and pushing messages to one that requires listening, engaging, and reacting to potential and current customer needs.” —Erik Qualman: Socialnomics: How Social Media Transforms the Way We Live and Do Business. John Wiley & Sons, Hoboken (New Jersey), 2009.
Feb 26, 20112 notes
#Social Media #Marketing
“Advanced Analytics
Optimization and simulation is using analytical tools and models to maximize business process and decision effectiveness by examining alternative outcomes and scenarios, before, during and after process implementation and execution.”
—

Gartner’s definition of Advanced Analytics.

Its further put into context by the observation that “this can be viewed as a third step in supporting operational business decisions. Fixed rules and prepared policies gave way to more informed decisions powered by the right information delivered at the right time, whether through customer relationship management (CRM) or enterprise resource planning (ERP) or other applications. The new step is to provide simulation, prediction, optimization and other analytics, not simply information, to empower even more decision flexibility at the time and place of every business process action. The new step looks into the future, predicting what can or will happen.”

Feb 25, 20112 notes
#Analytics
Play
Feb 24, 20112 notes
#User Experience #Design #Creativity
Weight of the world → washingtonpost.com

Data movie of people getting fatter around the globe. Often data movies are mis-used. But in this one is impressive.

Found via Flowing Data.

Feb 23, 2011
#visualization #overwheight #data #worldwide
“Data: petabytes
Reports: terabytes
Excel: gigabytes
PowerPoint: megabytes
Analytics: bytes
One business decision based on analytics: priceless”
—http://analytics-magazine.com/january-february-2011/81-telecommunications-marketing-from-business-intelligence-to-analytics.html
Feb 22, 20112 notes
#Business #Analytics #Data
Best Practices for Managing Predictive Models in a Production Environment

I recently commented that analytics is more than just algorithms. This is true in many aspects. One is viewing it from a model lifecycle perspective. While this process is decribed in many data mining books, the article  Best Practices for Managing Predictive Models in a Production Environment is an excellent free resource loaded with expert advice. Here is a quick outline of the process steps mentioned:

1. Determine the Business Objective

Typical use cases include modeling, customer behavior, risk management, credit scoring, rate making, fraud detection, customer retention, customer lifetime value, customer attrition/churn, and marketing resposne models.

2. Access and Manage the Data includes

  1. Data Integration e.g. connecting customer data from various systems including billing, service and marketing).
  2. Data Quality, e.g. handling missing values and outliers.
  3. Data Pre-processing, e.g. computing roll-ups or interval variables or normalizing heavily skewed distributions or clustering customers.

In general a standard data mart is desirable to promote best practices but often new models will require new sources and data.

3. Develop the Model

Includes application of exploratory statistics and visual data discovery, further data preparation and training and comparing models. Many modelling decisions — such as variable selection — are typically guideded by organizational rules including legal and management requirements. There are many more best practices such as validating the model on a hold-out data set  which can be a costly thing to forget.

4. Validate the Model

A continuous effort to ensure it meets business, operational, legal, analytical and other requirements.

5. Deploy the Model

Choose between several possible deployment scenarios incl. batch, transactional and on-demand.

6.  Monitor the Model

Includes automated tracking of user-defined, portable KPIs and detect disruption in input data.

Feb 21, 20113 notes
#Analytics #Data Mining #Best Practices #Free Resource
Mining of Massive Datasets  → infolab.stanford.edu

Anand Rajaraman and Jeff Ullman from Stanford University share a complete book about mining massive data sets as pdf at this link.

Excellent balance between introducing concept and examples. Just finished Chapter 2 and it provides a nice introduction to Google’s Map-Reduce advantages (scalability, fault tolerance) and limitations (computational intense tasks on huge and stable data set) followed by examples how to implement typical algebraic and database set operations.

Feb 20, 20113 notes
#Computer Science #Free Resource #Big Data #Analytics
“Quantity has a quality all it’s own.” —Joseph Stalin about his Red Army compared to the western forces which where better equipped but fewer.
Feb 19, 2011
#Statistics #Big Data
The Fourth Paradigm: Data-Intensive Scientific Discovery → research.microsoft.com

Interesting perspective of how the data deluge could transform science. While the complete book is available for download check out the linked review first.

I find insipiring:

  • The four-stage historical science model — from (1) experimental to (2) theory to (3) computation to (4) data-driven. I don’t think the shift is as radical as prosposed but it’s a nice concept to reflect on the new science opportunities emerging with technology
  • The vision of  more open research embracing data and findings from various fields.

The review kicks off with a comment on data deluge that points right at the heart of its challenge:

Gathering data is so easy and quick that it exceeds our capacity to validate, analyze, visualize, store, and curate the information.

Feb 18, 2011
#science #data #analytics
“Dealing with uncertainty turned out to be more important than thinking with logical precision. […] The fundamental tools of A.I. shifted from Logic to Probability in the late 1980s, and fundamental progress in the theory of uncertain reasoning underlies many of the recent practical advances. Learning turned out to be more important than knowing.” —The Machine Age: Great summary from Google’s Peter Norvig on the current state of AI. His AI book is my favorite introduction to the field.
Feb 17, 2011
#AI #Analytics
Play
Feb 17, 201112 notes
#AI
Scenario Planning using Linear Programming

Linear Programming (LP) is one of the earliest and most straight forward approaches to optimization. Its basic assumption is that even complicated real-world systems – such as supply chains – can be modeled with a set of simple equations.

LPs scale in two ways. First, in quantity. Modern software can solve models with millions of objects – sometimes almost in real-time. Secondly, in quality. Even complex and dynamic scenarios, such as network flows, economic market, or auctions can be expressed using basic equations.

The paper Decision Making Using PROC OPTMODEL illustrates this flexibility by demonstrating a case study of applying stochastic programming to business scenario planning. The basic idea is simple but effective: by building a tree of “potential futures” into the model  the planner obtains as a production plan that is both feasible in any scenario and simultanously optimizies the expected return.

I surpised whenever I see see “simple” concepts creating so much value. As for linear programs, modern Operations Research offers many more and sophisticated tools. But they will keep creating value as one of the all-time favorites in applied optimization.

image


Feb 16, 20111 note
#Operations Research #Business Planning #Scenario Planning #Decision Trees
Strata 2011: Telling the Story with Data → infosthetics.com

A good read about data visualization and story telling. Features Charles Joseph Minard’s amazing visualization from 1861 depicting Napoleon’s ill-fated march to and from Russia (which made me love the field).

Feb 15, 201112 notes
#visualization #Strata
Strata 2011: Werner Vogels, "Data Without Limits"  → youtube.com

Werner Vogels does a nice job evangelizing Amazon’s elastic storage vision. One tought I found intriguing is his characterization of the challenge of big data: It’s not primarly that there is lots of it but that it is collected without knowing the questions to solve using it.

While there are ever faster machines and algorithms this challenge will not go away and its a strong case for statistical virtues that are not en vogue but absolutely crucial such design of experiment and data preparation among others.

Feb 14, 2011
#Big Data #Strata #Amazon #Statistics
IT growth and global change: A conversation with Ray Kurzweil  → mckinseyquarterly.com

Convincing and thought-provocing pitch from Ray Kurzweil, one of the both most visionary and most controversial figures in IT, on how the exponential growth of technologies will transform industries and pose new opportunities—and hurdles—for business and society.

I think his judgement of benefits of IT developments outweighing the perrils is rather bold. The truth is, I suspect: No one knows.

Feb 13, 20111 note
#IT #McKinsey
Another Open Source Swipe at IBM and SAS → blogs.forbes.com

Interesting move … but “Eighty percent of (SAS and IBM analytics) revenues are from 15 core statistical procedure.” is presumably not only wrong but — more importantly — mostly irrelevant.

Arguing about Analytics by counting procedures is like comparing two pieces of art by counting the brushes that have been used to create them. True analytics solutions include data management and integration, user interface, process support, scalability, LoB and industry best practices, reporting and much more.

And it’s more than a “product”. Analytics vendors must be commited and capable and supporting their customers beyond the point of sales to create business value using it for their particular industry and lines of businesses. It’s so much more than bits and bytes.

All opinions, as usually, my own.

Feb 12, 20114 notes
#Analytics #R #SAS #IBM
Statistical flaws in "Twitter mood predicts the stock market" research paper

One of the first things I learned at my financial market lectures in London a few years ago turned out to be wrong: the Efficient Market Hypthesis. Real stock markets do not follow a random walk. This inspires researches and practitioners around the globe trying to predict them.

The research paper “Twitter mood predicts the stock market” created significant media buzz (eg. Wired has a neat summary) and eventually even inspired a new hedge fund.

Don’t buy.

On the positive side, there are many good ideas in this work: sufficient data (~10M tweets) and preperation (normalization, moods thoroughly classified along various dimension) and various statistical tests and models (Granger, regression, neuronal net).

But its still useless. Here is the problem. In the approach described in their paper Johan Bollen and his fellows ignore one of data mining’s most critical practices: They neither tested or validated their model properly (ie. on seperate data sets). By using a single data set to choose the model (“calm” over other emotions in this case) and judging its performance on the very same data (it has just been fitted to) exaggerates its ability to predict.

Pick the emotion and lead time that predicts the Dow Jones best over a given period and guess what: It indeed will predict the Dow Jones pretty well over this same interval. It’s really like playing the lottery more than once but without drawing any new numbers. I’d bet you’d do pretty well after round 1.

This is not to say that Twitter might not predict the stock market. But until someone gets the numbers right better save your money.

Feb 11, 201125 notes
#Prediction #Social Media #Stock Market #Research
Feb 11, 2011
#Facebook #Online Marketing #Web Analytics
Two reasons why Facebook marketing is hard → junkcharts.typepad.com

Junkchart describes the two main challenges well:

Facebook is desperate. It is different from Google. A Google user is searching for something, and if an ad is relevant, the user will click on it. The Facebook user is typically chatting with friends; not surprisingly, advertisers have not been impressed with the effectiveness of Facebook ads. That’s why they are trying to insert these ads in to our conversations. We’d feel like a waiter standing next to our table at a restaurant, listening to our private conversation, and then inserting himself into the conversation to sell us the special whatever of the day.

It turns out most of Facebook conversations are not information-rich. Most of the chitchat is just that. So, Facebook wants to know our habits and likings. They want to know what we are doing when not chatting online. So they set up this network of feelers around the Web, the Like buttons. On the one hand, these sell convenience to the users and their communities; on the other hand, they compile profiles of users, secretly, that can be sold to marketers.

Feb 9, 2011
#Facebook #Web Analytics #Social Media #Online Marketing
The Mythology of Big Data → youtube.com

Some key points from Mark Madsens talk at STRATA 2011:

  • Short data history: Since the 60’ data moved from product to by-product to asset to substract.
  • De-hyping the goldrush of big data: Appart from some start-ups or consultants, data mining is usually not about the lonesome hero minig terabytes of raw data. Most companies don’t run “data as business” but its all about using data as an organization.
  • Decision making models are highly contextual: On executive levels, for example, its political and buracratic. Decisions can take months, involve tid-for-tad, cognitive bias and multiple contectual versions of truths.

Mark is great speaker so check the recording.

Feb 8, 20111 note
#Data Mining #Decision Making #Big Data #Strata

January 2011

8 posts

“All models are false but some models are useful.” —George Box
Jan 23, 2011
#Quote #Modelling
Computer Games, Big Data and Human Psychology

Following the misconception of the homo economics, research on altruistic behavior is fashionable today. Psychologists designed facinating experiments and surveys on how social behavior is rewarded by the human brain.

Well — they could have saved their dollars.

One of the most intriguing aspects of today’s world is the vast amount of data that exists as a by-product of our day-to-day activities. “The Economist” estimates we just crossed the chasm where we could possibly save, let alone use, all of it.

There is one particular goldmine of data on what keeps humans perceive as rewarding and engaging: computer games. They are not only an industry four times the size of the music recording business. They also creating billions of data records a day representing unmatched quantified traces of human decisions.

This is exciting data.

TED has an inspiring talk from Tom Chatfield on some of the lessons games teach us including elements of uncertainty, tight feedback cycles and social reputation.

Jan 22, 2011
#Games #Big Data #Psychology #Ted
Analytics: The new path to value

The freely available report “Analytics: The new path to value” is a nice pitch on the value of analytics for an executive audience.

One of the charts I found inspirational for its clarity is the following:

image

Integration of these three entities — Data, Insights and Actions — versus just trying to optimize them locally is key to success.

A second chart suggests that analytics applications will mature in both visual and predicitve quality:

image

Jan 21, 20113 notes
#IBM #Analytics #Value
Machine Learning: A Love Story

InfoQ features this great talk on the history of machine learning.

The “AI Winter” ended in the 90’s when statistitions and computer scientists — after decades of failure — finally joined forces to develop “computer systems that improve with experience”. These probalistic models together with todays huge amount of data and elastic computing infrastructures created a new renaissance of the field. This new “Data Science” combines skills from engineering (ie. “building scalable systems”), math, computer science among others.

The talk includes hands-on tips such as using the NYT API for meta data or the Lynx text browser for crawling.

Professor Mason has a lovely sense of humor and the gift to explain complex ideas in easy words. If you have little time, focus on the first 15 minutes.

Jan 19, 2011
#Machine Learning #Talk #AI
Data Mining vs. Visualization → fellinlovewithdata.com

Great post from @FILWD on turning the tention between #datamining and #visualization into an opportunity

Jan 7, 2011
#data mining #visualization
Next page →
2012 2013
  • January 3
  • February
  • March
  • April 1
  • May
  • June
  • July
  • August
  • September
  • October
  • November
  • December
2011 2012 2013
  • January 3
  • February
  • March
  • April
  • May 1
  • June
  • July
  • August
  • September
  • October
  • November
  • December
2010 2011 2012
  • January 8
  • February 22
  • March 23
  • April 7
  • May
  • June
  • July 16
  • August 1
  • September 15
  • October
  • November 9
  • December
2010 2011
  • January
  • February
  • March
  • April
  • May
  • June
  • July 2
  • August 1
  • September 2
  • October 4
  • November
  • December 18