It also explores each variable distribution detecting singular values as errors or some representation of missing values.
Depending on the model we are fitting and the nature of the variables we are working with, it might be necessary to transform some of them (like counters, quantities or elapsed times) in order to improve model performance.
The results obtained are by far better than standard approaches of the problem (Recursive Feature Elimination, mRMR, Boruta, Lasso, ...).
We usually reach double digit improvements in several metrics (auc, rmse, logloss, normal discount cumulative gain, mean average precission...) over a fresh test set.
This step is critical for most of the existing models and interacts with other infocells, like Variable Representation or Feature Selection
Most out-of-the-box procedures for relative importance assessment of variables are biased and overfitted. Our method is unbiased and works with one-hot features representation (not only with individual levels).
This is very useful for understand the relationship between variables and response in black box models.
This give us the gold standard model and let us compare the relative performance of each model with it.
We currently use a ruled based system to select initial values, options and parameter ranges for each model. In the near future this will change:
We are currently working in a meta learning model that uses topological descriptors of the dataset in order to select the most suitable model family, parameters range where the model will be optimal and variable transformations that will perform better. This will be done without the need of deploying a huge number of models.