Custom ML Infill with Automunge

All learners are welcome

Published in

Automunge

8 min readOct 8, 2021

For anyone that hasn’t been following along, I’ve been using this forum to document the development of Automunge, an open source python library for tabular learning. One way to think about Automunge is that it is an abstraction that greatly simplifies pipelines of Pandas operations. Specifically we focus on the workflow boundaries between received tidy data (one column per feature and one row per sample) and returned data suitable for direct application of machine learning. Sets of univariate pandas operations can be aggregated by simple specification, and then all of the pipeline management to prepare additional data is push button. If you are productionizing a model, you don’t even have to think about a pandas pipeline, all you need is the dictionary returned from training data preparation and you have a key for automatically preparing streams of data for inference. You can even publish the dictionary to allow others to consistently prepare data to run inference on a model that you trained.

Basically we’re encapsulating pandas pipelines fit to a train set similar to how machine learning is encapsulated in a trained model fit to a train set. We are defining and formalizing a new niche in the space in between dataframe libraries and learning libraries. We’ve all heard a thousand times that 80% of a data scientist’s time is spent cleaning the data. We are attempting to do something about that in an open source setting.

Within that context, we’ve also integrated a push button solution for auto ML derived missing data infill, what we call ML infill. Missing data is kind of a fundamental obstacle for machine learning, as backpropagation requires all valid entries. ML infill is a more sophisticated convention than often common practices like mean imputation to numeric sets or constant imputation to categoric. By feature set specific partitioning of the training data, feature set specific machine learning models are trained to impute missing data based on properties of the surrounding features. Sounds simple, doesn’t it?

Missing Data Infill with Automunge

The Preprint

medium.com

In our paper Missing Data Infill with Automunge, we noted that we had a few auto ML options to choose from for this purpose. The default option built around Scikit-learn’s random forest was selected as the default due to simplicity, latency, and tendency not to overfit. We incorporated an option for the AutoGluon library having seen benchmarks suggesting it will likely outperform random forest, albeit with a tradeoff for disk storage associated with model ensembles and latency performance. We included the FLAML library for their simplicity of setting a max training duration time through hyperparameter tuning, as well as their claimed performance from a latency standpoint. We selected CatBoost as a gradient boosting option and for GPU support. The glaring omission from this suite is obviously gradient boosting by the XGBoost library, which still remains the most popular learning library for tabular applications. We’ve run experiments around incorporating this library as an additional option, what has primarily stopped us was the general consensus we’ve seen implied in our explorations that gradient boosting on it’s own has potential to overfit when hyperparameters are not tuned.

ML infill model tuning does have some support in the library. The default random forest implementation can have parameters tuned by grid search by passing parameters as lists of values, or even random search with parameters passed as distributions. This makes sense for random forest since the bulk of tuning benefit is often achieved with just a few inputs, such as n_estimators. Our experience with XGBoost is that a well tuned model requires balancing a whole set of options. For example in our Kaggle competition paper isFraud? we followed a handy procedural procedure for iteratively tuning through parameters with progressive grid searches.

isFraud?

My Second Kaggle

medium.com

Part of what has kept the Automunge project feasible in this three year journey has been a recognition of circles of competence for our developers. We don’t claim to be highly sophisticated machine learning practitioners. We’ve built a library around Pandas dataframes, and that is really our strength. (Can you imagine the gall of trying to publish a paper at a mainstream research conference built around random forest? :) In sophisticated circles, tuning can be achieved with things like bayesian optimization or other conventions. In unsophisticated circles, we’ll stick to the grid for now.

So that leaves us with kind of a conundrum. XGBoost is the gold standard. XGBoost without tuning is not. How can we bridge that gap? Easy, we don’t. We let the user bridge that gap on their own.

In other words, this essay is to announce a new convention allowing users to define custom machine learning algorithms for integration into Automunge’s ML infill. These custom learning algorithms could be built around gradient boosting, neural networks, or even quantum machine learning, whatever you want. All you have to do is define a wrapper function for your model tuning / training and a wrapper function for inference. You pass those functions as part of the automunge(.) call, and we do all the rest. Sounds simple, doesn’t it?

You can either define separate wrapper functions for classification and regression, or you can define a single wrapper function and use the received labels column header to distinguish between whether a received label set is a target for classification (1) or regression (0).

def customML_train_classifier(labels, features, columntype_report, commands, randomseed):
  …
  return modeldef customML_train_regressor(labels, features, columntype_report, commands, randomseed):
  …
  return model

The convention is really simple, your wrapper function receives as input a dataframe of labels, a dataframe of features, a report of feature properties, any commands that you passed as part of the automunge(.) call for the operation, and a unique sampled randomseed. You then tune and train a model however you want and return the trained model from the function and let us handle the rest (basically that means we’ll store the model in the returned dictionary that is used as key to prepare additional data).

The features will be received as a numerically encoded dataframe consistent with form returned from automunge(.), excluding any features from transforms that may return non-numeric entries or otherwise identified as a channel for data leakage. Any missing data will have received an initial imputation applied as part of the transformation functions, which initial imputation may be replaced when that feature has already been targeted with ML infill. Categoric features will be integer encoded, which could include ordinal integers, one hot encodings, or binarizations. The columntype_report can be used to access feature properties, and will include a list of all categoric features, a list of all numeric features, or more granular details such as listed categoric features of a certain type and groupings (the form will be similar to the final version returned as postprocess_dict['columtype_report']).

The labels for a classification target will be received as a single column pandas series with header as the integer 1, and entries in the form of str(int), which basically means entries will be integers that have been converted to strings. The str(int) convention we believe is a neat concept since some libraries like their classification targets as strings and some prefer integers, so this way if you library is integer based classification you can just convert the labels with labels.astype(int) and you’re off to the races. The labels for a regression target will also be a single column pandas series, but this time with column header as the integer 0 and entries as floats. (Received continuous integer types for regression targets can be treated as floats since we’ll round them back to integers after inference.)

Any imports needed, such as for learning libraries and stuff, can either be performed external to the automunge(.) call or included as an import operation within the wrapper functions. Pandas is available as pd, Numpy as np, spicy.stats as stats.

When it comes time to use that model for inference, we’ll access the appropriate model and pass it to your corresponding custom inference function along with the correspondingly partitioned features dataframe serving as basis and any commands a user passed as part of the automunge(.) call.

def customML_predict_classifier(features, model, commands):
  …
  return infilldef customML_predict_regressor(features, model, commands):
  …
  return infill

So then you just use the predict wrapper function to run your inference and return the resulting derived infill. The form of the returned infill is user choice, you can provide the derivations as a single column array, single column dataframe, or as a series. Regression output is expected as floats. Classification output types can be returned as int or as str(int). Up to you. Once we access the infill we’ll convert it back to whatever form is needed. We’ll take it from there.

Great, so we’ve defined our custom ML wrapper functions, now all it takes to integrate into an automunge call is passing them through the ML_cmnd parameter. Here we demonstration choosing customML as the autoML_type (meaning we apply your defined functions instead of the default random forest), passing any desired parameters to your functions (which may differ between automunge(.) calls), and passing the functions themselves.

The only important asterisk is that since their function identifier will be directly saved in the returned postprocess_dict dictionary (that is the dictionary serving as a key to prepare additional data), if you download that dictionary with pickle and want to upload it in a separate notebook, you’ll need to first reinitialize your custom inference functions in the separate notebook prior to upload.

And thus ML infill can now run with any tabular learning library or algorithm.

BYOML

Five Blind Boys of Alabama — Mother’s Song

Appendix A — Custom Training Functions

In this example, we build custom training functions built around Scikit-Learn’s random forest (sheepish grin :).

Appendix B — Custom Inference Functions

And now here are the corresponding custom inference functions.

Appendix C— Default Inference Functions

Note that the library has an internal suite of inference functions for different ML libraries that can optionally be used in place of a user defined customML inference function. These can be activated by passing a string to entries for ‘customML_Classifier_predict’ or ‘customML_Regressor_predict’ as one of {‘tensorflow’, ‘xgboost’, ‘catboost’, ‘flaml’, ‘autogluon’, ‘randomforest’}. Use of the internally defined inference functions allows a user to upload a postprocess_dict in a separate notebook without needing to first reinitialize the customML inference functions. For example, to apply a default inference function for the XGBoost library could apply:

Voila.