Synthetic WOE
Weight of Evidence often called, WOE is use to reduce noisy data, handle outliers, missing data, and character or categorical variables. However one key issue is that enough data needs to be available to be estimate each WOE value within a variable. Limited data creates variables with very few bins and results in poor fitting models with low accuracy. Or people will use whatever data is available and end up with poor WOE estimates due to sparse data. This paper shows how to keep business logic with monotonic restrictions while increasing WOE bins and information value from variables with either overall sparse data or sparse data in different ranges of the variable. First we estimate WOE values for ranges where enough data is present which creates the skeleton for the WOE curve. This relationship should be explainable by the business. Second we choose a method such as linear (simple average), non-linear (cubic spline or ploynomial), or local weights (simple weighting for LOESS/LOWESS) and add new WOE bins and assign the variable data ranges given the method. The results, more granular WOE bins for more accurate predictions, more information retained from the original non-transformed variable, and maintained explainability. This has worked great in practice for both decision tree based models such as XGBoost and standard statistical models such as logistic regression. Paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6283358 Consulting: https://www.fancyquantnation.com/industry-consulting LinkedIn: https://www.linkedin.com/in/dimitri-bianco/
Download
0 formatsNo download links available.