Too Good To Be True
The greatest danger to success with Predictive Analytics may be to over-estimate the predictive power of a predictive model. One problem that can lead to over-estimating predictive power is over-fitting. Over fitting occurs when a predictive model is trained to memorize training data so well that the model will not perform well when scoring new data. Machine Learning algorithms typically split training data internally to test for and to avoid overfitting. This internal splitting is an important safe guard but it is advisable to take the additional precaution of setting a hold out data set aside against which the quality of a trained model can be tested.
A hold out data set is created by splitting your available historical data set into two subsets, one for training, and one for validation. It is crucial that the validation data set faithfully mimics new data coming in for scoring in the production environment. What is important is to exclude any inputs that carry information that is not available when the model is deployed, and also to exclude any information from the training data set that provides clues about the to-be-predicted outcomes for the validation data. If you make a mistake your validation will be meaningless.
For example, consider anecdotes I heard at KDD04, Directions 2005 and ICDM05. A predictive model was being developed for a churn prediction application, and an account code was used as an input. The model validated with excellent precision but it was found later that the account code contained information about whether the account was active. This made it too easy for the model to predict churn because accounts that are canceled due to churn then become inactive. The account code is therefore an illegal input, at least if the account code represents current status as opposed to the historical status as of the time before the churn event occurred.
In a military application the goal was to train a model to identify tanks in imagery. This model too performed exceedingly well but it was later found that all the training images containing tanks were taken at a different time of day than those images that did not contain tanks, and the validation data had the same problem. Again, it was easy for the model, in fact unrealistically easy, to identify the presence of tanks by assessing the overall brightness of the image.
I have yet to meet an experienced practitioner in Predictive Analytics who does not admit to accidentally using illegal inputs or allowing hints about outcomes in the validation hold-out data set to spill into the training data set.
Posted at
01:18PM Jan 12, 2006
by tilmannsblog in The Predictive Business |
Posted by Vincent Granville on April 19, 2006 at 03:49 PM PDT #