« November 2009
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
     
       
Today

FEEDS

SEARCH



LINKS




CONTACT
tilmannsblog
Template by
Helquin

Thursday Jan 12, 2006

Too Good To Be True


Split (The town in Croatia) The greatest danger to success with Predictive Analytics may be to over-estimate the predictive power of a predictive model. One problem that can lead to over-estimating predictive power is over-fitting. Over fitting occurs when a predictive model is trained to memorize training data so well that the model will not perform well when scoring new data. Machine Learning algorithms typically split training data internally to test for and to avoid overfitting. This internal splitting is an important safe guard but it is advisable to take the additional precaution of setting a hold out data set aside against which the quality of a trained model can be tested.

A hold out data set is created by splitting your available historical data set into two subsets, one for training, and one for validation. It is crucial that the validation data set faithfully mimics new data coming in for scoring in the production environment. What is important is to exclude any inputs that carry information that is not available when the model is deployed, and also to exclude any information from the training data set that provides clues about the to-be-predicted outcomes for the validation data. If you make a mistake your validation will be meaningless.

Churn For example, consider anecdotes I heard at KDD04, Directions 2005 and ICDM05. A predictive model was being developed for a churn prediction application, and an account code was used as an input. The model validated with excellent precision but it was found later that the account code contained information about whether the account was active. This made it too easy for the model to predict churn because accounts that are canceled due to churn then become inactive. The account code is therefore an illegal input, at least if the account code represents current status as opposed to the historical status as of the time before the churn event occurred.

An image which includes a tank. In a military application the goal was to train a model to identify tanks in imagery. This model too performed exceedingly well but it was later found that all the training images containing tanks were taken at a different time of day than those images that did not contain tanks, and the validation data had the same problem. Again, it was easy for the model, in fact unrealistically easy, to identify the presence of tanks by assessing the overall brightness of the image.

I have yet to meet an experienced practitioner in Predictive Analytics who does not admit to accidentally using illegal inputs or allowing hints about outcomes in the validation hold-out data set to spill into the training data set.

Comments:

You can eliminate the problem by working with large training sets containing many observations and - most importantly - many patterns spread over large time periods. Introduce as much variance as possible in the training set thanks to intelligent design of experiment. For instance, include pictures of tanks in different environments, different time periods, and also different types of tanks - both in shape and size. Include tanks that are moving at various speeds and tanks that are partially destroyed.

Posted by Vincent Granville on April 19, 2006 at 03:49 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed