Stochastic Perturbations Appendix

Keep Calm and Mind the Details

Nicholas Teague
33 min readJan 11, 2022

Appendix A — Table of Contents

  • Appendix B — Automunge Demonstration
  • Appendix C — Train and test data
  • Appendix D — Noise options
  • Appendix E — Sampling parameters
  • Appendix F — Noise injection tutorial
  • F.1 — DP transformation categories
  • F.2 — Parameter assignment
  • F.3 — Numeric parameters
  • F.4 — Categoric parameters
  • F.5 — Noise injection under automation
  • F.6 — Data augmentation with noise
  • F.7 — Alternate random samplers
  • F.8 — QRAND library sampling
  • F.9 — All Together Now
  • F.10 — Noise directed at existing data pipelines
  • Appendix G — Advanced Noise Profiles
  • G.1 — Noise parameter randomization
  • G.2 — Noise profile composition
  • G.3 — Protected attributes
  • Appendix H — Distribution scaling
  • Appendix I— Causal inference
  • Appendix J — Benchmarking
  • Appendix K — Sensitivity analysis — Fastai
  • Appendix L — Sensitivity analysis — Catboost
  • Appendix M — Sensitivity analysis — XGBoost
  • Appendix N — Intellectual property disclaimer

Appendix B — Automunge demonstration

The Automunge interface is channeled through two master functions, automunge(.) for preparing data and postmunge(.) for preparing additional corresponding data. As an example, for a training set dataframe df train which includes a label feature ‘labels’, automunge(.) can be applied under automation as:

Some of the returned sets may be empty based on parameter selections. Using the returned dictionary postprocess_dict, corresponding data can then be prepared on a consistent basis with postmunge(.).

To engineer a custom set of transformations, one can populate a transformdict and processdict entry for a new transformation category we’ll call ‘newt’. The functionpointer is used to match ‘newt’ to the processdict entries applied for ‘nmbr’, which is for z-score normalization. The transformdict is used to populate transformation category entries to the family tree primitives [Table 1] (Teague, 2021b) associated with a root category. The first four primitives are for upstream transforms. Since parents is a primitive with offspring, after applying transforms for the ‘newt’ entry, the downstream primitives from newt’s family tree will be inspected to apply ‘bsor’ for ordinal encoded standard deviation bins to the output of the upstream transform. The upstream ‘NArw’ is used to aggregate missing data markers. The assigncat parameter is used to assign ‘newt’ as a root category to a target input column ‘targetcolumn’. There are also many preconfigured trees available in the library.

This transformation set will return columns with headers logging the applied transformation categories as: ‘column newt’ (z-score normalization), ‘column newt bsor’ (ordinal encoded standard deviation bins), and ‘column NArw’ (missing data markers). In an alternate configuration ‘bsor’ could be entered to an upstream primitive, this is just an example to demonstrate applying generations of transformations. Since friends is a supplement primitive, the upstream output ‘column newt’ to which the ‘bsor’ transform is applied is retained in the returned data. And since cousins and friends are primitives without offspring, no further generations are inspected after applying their entries.

Table 1. Family Tree Primitives

Parameters can be passed to the transformations through assignparam, as demonstrated here for updating a parameter setting so that the number of standard deviation bins for ‘bsor’ as applied to column ‘column’ is increased from the default of 6 to 7, where since this is an odd number will result in the center bin straddling the mean.

Under automation auto ML models are trained for each feature and missing marker activations are aggregated in the returned sets to support missing data imputation. These options can be deactivated with the MLinfill and NArw_marker parameters. The function automatically shuffles the rows of training data and defaults to not shuffling rows of test data. To retain order of train set rows can deactivate the shuffletrain parameter.

There is an option to mask the returned feature headers and order of columns for purposes of retaining privacy of model basis by the privacy_encode parameter, which can later be inverted with a postmunge(.) inversion operation if desired. This option can also be combined with encryption of the postprocess_dict by the encrypt_key parameter. Here is an example of privacy encoding without encryption.

Putting it all together in an automunge(.) call simply means passing our parameter specifications.

One can then save the returned postprocess_dict, such as by downloading with the pickle library, to use as a key for preparing additional corresponding data on a consistent basis with postmunge(.).

Assigning noise injection root categories to targeted input columns is also applied in the assigncat automunge(.) parameter, which once assigned will be carried through as the basis for postmunge(.). Here we demonstrate assigning DPnb as the root category for a list of numeric features, DPod for a list of categoric features, and DPmm for a specific targeted numeric feature.

To default to applying noise injection under automation one can take advantage of the automunge(.) powertransform parameter which is used to select between scenarios for default transformations applied under automation. powertransform accepts specification as ‘DP1’ or ‘DP2’ resulting in automated encodings applying noise injection, further detailed in the read me powertransform parameter writeup (or DT and DB equivalents DT1 / DT2 / DB1 / DB2 for different default train and test noise configurations [Appendix C]).

Transformation category specific parameters can be passed to transformation functions through the automunge(.) assignparam parameter, which will then be carried through as the basis for preparing additional data in postmunge. In order of precedence, parameter assignments may be designated targeting a transformation category as applied to a specific column header with suffix appenders, a transformation category as applied to an input column header (which may include multiple instances), all instances of a specific transformation category, all transformation categories, or may be initialized as default parameters when defining a transformation category.

Here we demonstrate passing three different kinds of assignparam specifications.

  • ‘global assignparam’ passes a specified parameter to all transformation functions applied to all columns, which if a function does not accept that parameter will just be ignored. In this demonstration we turn on test noise injection for all transforms via the ‘testnoise’ parameter.
  • ‘default assignparam’ passes a specified parameter to all instances of a specified tree category (where tree category refers to the entries to the family tree primitives of a root category assigned to a column, and in many cases the tree category will be the same as the root category). Here we demonstrate updating the ‘flip prob’ parameter from the 0.03 default for all instances of the DPod transform, which represents the ratio of entries that will be targeted for injection.
  • To target parameters to specific categories as applied to specific columns, can specify as {category : {column : {parameter : value}}}. Here we demonstrate targeting the application of the DPmm transform to a column ‘targetcolumn’ in order to apply all positive signed noise injections by setting the ‘noisedistribution’ parameter to ‘abs normal’, and also reducing the standard deviation of the injections from default of 0.03 to 0.02 with the ‘sigma’ setting. ‘targetcolumn’ refers to the header configuration received as input to a transform without the returned suffix.

Having defined our assignparam specification dictionary, it can then be passed to the automunge(.) assignparam parameter. As an asterisk, it’s important to keep in mind that targeting a category for assignparam specification is based on that category’s use as a tree category (as opposed to use as a root category), which in some cases may be different. The read me documentation on noise injection details any cases where a noise injection parameter acceptance may be a tree category differing from the root category, as is the case for a few of the hashing noise injections. Having defined our relevant parameters, we can then pass them to an automunge(.) call.

In addition to preparing our training data and any validation or test data, this function also populates the postprocess_dict dictionary, which we recommend downloading with pickle if you intend to train a model with the returned data (pickle code demonstrations provided in read me). The postprocess_dict can then be uploaded in a separate notebook to prepare additional corresponding test data on a consistent basis, as may be used for inference.

Appendix B — Train and test data

One of the key distinctions in the library with respect to data preparation is the difference between training data and test data. In a traditional supervised learning application, training data would be used to train a model, and test data would be used for validation or inference. When Automunge prepares training data, in many cases it fits transformations to properties of a feature as found in the training data, which is then used for preparing that feature in the test data on a consistent basis. The automunge(.) function for initial preparations accepts training data and optionally additionally test data with application. The postmunge(.) function for preparing additional data assumes that received data is considered test data by the default parameter setting traindata=False.

In the context of noise injections, the train/test distinction comes into play. Our original default configuration was that noise is injected to training data and not injected to test data. This was built around use cases of applying noise for applications in model training, such as for data augmentation, differential privacy, and model perturbation in the aggregation of ensembles. As we’ll demonstrate through benchmarking [Appendix K, L, M] there are scenarios in application of non-deterministic inference for noise injection for test data as well as train data (e.g. neural networks) — or perhaps even just for test data and not for train data (e.g. gradient boosting).

We thus have a few relevant parameters for distinguishing between these scenarios. In the base configuration for ‘DP’ root categories, the training data set returned from automunge(.) receives noise when relevant transforms are applied, and does not receive noise to the corresponding features in test data, including test data sets returned from both automunge(.) or postmunge(.).

To treat test data passed to postmunge(.) as training data, postmunge has the traindata parameter, which can be turned on and off as desired with each postmunge call. To configure a transformation to default to applying injected noise to train or test data, parameters can be passed to specific transformations as applied to specific columns with the automunge(.) assignparam parameter. The noise injections transforms accept a trainnoise specification (defaulting as True) signaling that noise will be injected to training data, and a testnoise specification (defaulting as False) signaling that noise will not be injected to test data. Please note that these assignparam parameters, once specified in an automunge(.) call, are retained as basis for preparing additional data in postmunge(.). If validation data is prepared in automunge(.) it is treated comparably to test data. Alternate root categories are available by replacing a category’s ‘DP’ prefix with ‘DT’ or ‘DB’ with other train and test injection defaults as shown [Table 2].

Table 2. Train and test injection scenarios

Please note that the postmunge(.) traindata parameter can also be passed as ‘train_no_noise’ or ‘test_no_noise’, which is for purpose of treating the data consistent to train or test data but without noise injections.

Appendix D — Noise options

Noise injections in the library can be applied in conjunction with other preparations, for example noise injection for numeric features can be applied in conjunction with normalizations and scaling, or noise injections to categoric features can be applied in conjunction with integer encodings. Thus root categories for noise injection transformations [Table 3] can be considered a drop in replacement for the corresponding encoding. As used in [Table 3], distribution sampling refers to Gaussian noise that can be configured by parameter to alternate profiles like Laplace, Uniform, or all positive/negative.

The library also supports the application of noise injections without otherwise editing a feature. DPse (passthrough swap noise), DPpc (passthrough weighted activation flips for categoric) DPne (passthrough gaussian or laplace noise for numeric), DPsk (passthrough mask noise for numeric or categoric), and excl (passthrough without noise) can be used in tandem to pass a dataframe to automunge(.) for noise injection without other edits or infill, such as could be used to incorporate noise into an existing tabular pipeline. When limited to these three root categories the returned dataframe will match the same order of columns with only edits other than noise as updated column headers and DPne will override any data types other than float. (To retain same order of rows can deactivate shuffletrain parameter, and original column headers can be retained with the orig headers parameter.) This practice will be demonstrated in [Appendix F.10].

Table 3. Noise root categories

Appendix E — Sampling parameters

For entropy seeding and alternate random samplers, each automunge(.) or postmunge(.) call may be passed distinct parameters [Fig 6], meaning that the sampling parameters passed to automunge(.) are not carried through as a basis to postmunge(.). This is to support potential workflow where entropy seeding may be desired for training and not inference or vice versa. The three relevant parameters are entropy seeds, random generator, and sampling dict (which aggregates multiple sub-specifications into a dictionary).

Figure 6. automunge(.) or postmunge(.) sampling parameters
  • entropy_seeds: defaults to False, accepts integer or list / flattened array of integers which may serve as supplemental sources of entropy for noise injections with DP transforms, we suggest integers in range {0:(2 ** 31–1)} to align with int32 dtype. entropy seeds are specific to an automunge(.) or postmunge(.) call, in other words they are not returned in the populated postprocess dict. Please note that for determination of how many entropy seeds are needed for various sampling dict[‘sampling type’] scenarios, can inspect postprocess_dict[‘sampling report dict’], where if insufficient seeds are available for these scenarios additional seeds will be derived with the extra seed generator. Note that the sampling report dict will report requirements separately for train and test data and in the bulk seeds case will have a row count basis. (If not passing test data to automunge(.) the test budget can be omitted.) Note that the entropy seed budget only accounts for preparing one set of data, for the noise augment option we recommend passing a custom extra seed generator with a sampling type specification, which will result in internal samplings of additional entropy seeds for each additional noise augment duplicate (or for the bulk seeds case with external sampling can increased entropy seed budget proportional to the number of additional duplicates with noise).
  • random_generator: defaults to False, accepts numpy.random.Generator formatted random samplers which are applied for noise injections with DP transforms. Note that random generator may optionally be applied in conjunction with entropy seeds. When not specified applies numpy.random.PCG64. Examples of alternate generators could be a generator initialized with the QRAND library (Rivero, 2021) to sample from a quantum circuit. Or if the alternate library does not have numpy.random support, their output can be channeled as entropy seeds for a similar benefit. random generator is specific to an automunge(.) or postmunge(.) call, in other words it is not returned in the populated postprocess dict. Please note that numpy formatted generators of both forms e.g. np.random.PCG64 or np.random.PCG64() may be passed, in the latter case any entropy seeding to this generator will be turned off automatically.
  • sampling_dict: defaults to False, accepts a dictionary including possible keys of {sampling_type, seeding_type, sampling_report_dict, stochastic_count_safety_factor, extra_seed_generator, sampling_generator}. sampling_dict is specific to an automunge(.) or postmunge(.) call, in other words they are not returned in the populated postprocess_dict.

sampling_dict[‘sampling type’] accepts a string as one of {‘default’, ‘bulk_seeds’, ‘sampling_seed’, ‘transform_seed’}

- default: every sampling receives a common set of entropy seeds per user specification which are shuffled and passed to each call

- bulk_seeds: every sampling receives a unique supplemental seed for every sampled entry for sampling from sampling generator (expended seed counts dependent on train/test/both configuration and numbers of rows). This scenario also defaults to sampling dict[‘seeding type’] = ‘primary seeds’

- sampling_seed: every sampling operation receives one supplemental seed for sampling from sampling generator (expended seed counts dependent on train/test/both configuration)

- transform_seed: every noise transform receives one supplemental seed for sampling from sampling generator (expended seed counts are the same independent of train/test/both configuration)

sampling_dict[‘seeding_type’] defaults to ‘supplemental_seeds’ or ‘primary_seeds’ as described below, where ‘supplemental_seeds’ means that entropy seeds are integrated into np.random.SeedSequence with entropy seeding from the operating system. Also accepts ‘primary_seeds’, in which user passed entropy seeds are the only source of seeding. Please note that ‘primary_seeds’ is used as the default for the bulk seeds sampling type and ‘supplemental_seeds’ is used as the default for other sampling type options.

sampling_dict[‘sampling_report_dict’] defaults as False, accepts a prior populated postprocess_dict[‘sampling_report_dict’] from an automunge(.), call if this is not received it will be generated internally. sampling_report_dict is a resource for determining how many entropy seeds are needed for various sampling type scenarios.

sampling_dict[‘stochastic_count_safety_factor’]: defaults to 0.15, accepts float 0–1, is associated with the bulk seeds sampling type case and is used as a multiplier for number of seeds populated for sampling operations with a stochastic number of entries.

sampling_dict[‘sampling_generator’]: used to specify which generator will be used for sampling operations other than generation of additional entropy seeds. Defaults to ‘custom’ (meaning the passed random generator or when unspecified the default PCG64), and accepts one of {‘custom’, ‘PCG64’, ‘MersenneTwister’}.

sampling_dict[‘extra_seed_generator’]: used to specify which generator will be used to sample additional entropy seeds when more are needed to meet requirements of sampling_report_dict, defaults to ‘custom’ (meaning the passed random generator or when unspecified the default PCG64), and accepts one of {‘custom’, ‘PCG64’, ‘MersenneTwister’, ‘off’, ‘sampling_generator’}, where sampling generator matches specification for sampling_generator, and ‘off’ turns off sampling of additional entropy seeds.

  • noise_augment: accepts type int or float(int) ≥ 0, defaults to 0. Used to specify a count of additional duplicates of training data prepared and concatenated with the original train set. Intended for use in conjunction with noise injection, such that the increased size of training corpus can be a form of data augmentation. (Noise injection still needs to be assigned, e.g. by assigning root categories in assigncat or could turn on automated noise with powertransform = ‘DP1’). Note that injected noise will be uniquely randomly sampled with each duplicate. When noise augment is received as a dtype of int, one of the duplicates will be prepared without noise. When noise augment is received as a dtype of float(int), all of the duplicates will be prepared with noise. When shuffletrain is activated the duplicates are collectively shuffled, and can distinguish between duplicates by the original df train.shape in comparison to the ID set’s Automunge index. Please be aware that with large dataframes a large duplicate count may run into memory constraints, in which case additional duplicates can be prepared separately in postmunge(.). Note that the entropy seed budget only accounts for preparing one set of data, for the noise augment option with entropy seeding we recommend passing a custom extra seed generator with a sampling type specification, which will result in internal samplings of additional entropy seeds for each additional noise augment duplicate (or for the bulk seeds case with external sampling can increase entropy seed budget proportional to the number of additional duplicates with noise).

Appendix F — Noise injection tutorial

F.1 DP transformation categories

DP family of transforms are surveyed in the read me’s library of transformations section as Differential Privacy Noise Injections. The noise injections can be performed in conjunction with numeric normalizations or categoric encodings, which options were surveyed in [Appendix D] [Table 3].

Here is an example of assigning some of these root categories to received features with headers ‘column1’, ‘column2’, ‘column3’. DTnb is z-score normalization with Gaussian noise to test data, shown here assigned to column1. DBod is ordinal encoding with weighted activation flips to both train and test data, shown here assigned to column2 and column3. (To just inject to train data the identifier string for that default configuration replaces the DT or DB prefix with DP.)

F.2 Parameter assignment

Each of these transformations accepts optional parameter specifications to vary from the defaults. Parameters are passed to transformations through the automunge(.) assignparam parameter. As we described in Appendix D, parameter assignments through assignparam can be conducted in three ways, where global_assignparam passes the setting to every transform applied to every column, default_assignparam pass the same setting to every instance of a specific transformation’s tree category identifier applied to any column, or in the third option a parameter setting can be assigned to a specific transformation tree category identifier passed to a specific column (where that column may be an input column or a derived column with suffix appender passed to the transform). Note that the difference between a tree category and a root category (Teague, 2021b) is that a root category is the identifier of the family tree of transformation categories assigned to a column in the assigncat parameter, and a tree category is an entry to one of those family tree primitives which is used to access the transformation function. To restate for clarity, the (column) string designates one of either the input column header (before suffixes are applied) or an intermediate column header with suffixes that serves as input to the target transformation.

For noise injections that are performed in conjunction with a normalization or encoding, the noise transform is generally applied in a different tree category than the encoding transform, so if parameters are desired to be passed to the encoding, assignparam will need to target a different tree category for the encoding than for the noise. Generally speaking, the noise transform family trees have been configured so that the noise tree category matches the root category, which was intentional for simplicity of parameter assignment (with an exception for DPhs for esoteric reasons). To view the full family tree such as to inspect the encoding tree category, the set of family trees associated with various root categories are provided in the code repository as FamilyTrees.md.

Note that assignparam can also be used to deviate from the default train or test noise injection settings. As noted above, the convention for the string identifiers of noise root categories is that ‘DP’ injects noise to train and not test data, ‘DT’ injects noise to test and not train data, and ‘DB’ injects noise to both train and test data. These are the defaults, but each of these can be updated by parameter assignment with assignparam specification of ‘trainnoise’ or ‘testnoise’ parameters.

As noted in [Appendix C], for subsequent data passed to postmunge(.), the data can also be treated as test data or train data, and in both cases also have noise deactivated. The postmunge(.) traindata parameter defaults to False to prepare postmunge(.) as test data and accepts entries of {False, True, ‘test_no_noise’, ‘train_no_noise’}.

Most of the noise injection transforms share common parameters between those targeting numeric or categoric entries.

F.3 Numeric parameters

  • trainnoise: activates noise injection to train data (defaults True for DP or DB and False for DT)
  • testnoise: activates noise injection to test data (defaults True for DT or DB and False for DP)
  • flip_prob: ratio of train entries receiving injection
  • test_flip_prob: ratio of test entries receiving injection (defaults as matched to flip prob)
  • sigma: scale of train noise distribution
  • test_sigma: scale of test noise distribution
  • mu: mean of train noise distribution (before any scaling)
  • test_mu: mean of test noise distribution (before any scaling)
  • noisedistribution: train noise distribution, defaults to ‘normal’ (Gaussian), accepts one of ‘normal’, ‘laplace’, ‘uni- form’, ‘abs normal’, ‘abs laplace’, ‘abs uniform’, ‘negabs normal’, ‘negabs laplace’, ‘negabs uniform’, where abs refers to all positive signed noise and negabs refers to all negative signed noise
  • test_noisedistribution: test noise distribution, comparable options supported
  • rescale_sigmas: for min-max normalization (DPmm) or retain normalization (DPrt), this activates the mean adjustment
  • noted in Appendix E, defaults to True
  • retain_basis: for cases where distribution parameters passed as list or distribution, activating retain basis means the basis sampled in automunge is carried through to postmunge or the default of False means a unique basis is sampled in automunge and postmunge
  • protected_feature: can be used to specify an adjacent categoric feature with sensitive attributes for a segment specific noise scaling adjustment which we speculate may reduce loss discrepancy

Default numeric parameters detailed in [Table 4]

Table 4. Default numeric parameter settings

F.4 Categoric parameters

  • trainnoise: activates noise injection to train data (defaults True for DP or DB and False for DT)
  • testnoise: activates noise injection to test data (defaults True for DT or DB and False for DP)
  • flip_prob: ratio of train entries receiving injection
  • test_flip_prob: ratio of test entries receiving injection
  • weighted: weighted vs uniform sampling of activation flips to train data
  • test_weighted: weighted vs uniform sampling of activation flips to test data
  • retain_basis: for cases where distribution parameters passed as list or distribution, activating retain basis means the basis sampled in automunge is carried through to postmunge or the default of False means a unique basis is sampled in automunge and postmunge
  • protected_feature: can be used to specify an adjacent categoric feature with sensitive attributes for a segment specific noise scaling adjustment which we speculate may reduce loss discrepancy
Table 5. Default categoric parameter settings

Here is an example of assignparam specification to:

  • set an all positive noise distribution for category DPmm as applied to an input column with header ‘column1’, noting that for scaled noise like DPmm all positive or all negative noise should be performed with a deactivated noise scaling bias offset.
  • update the flip prob parameter to 0.1 for all cases of DPnb injections via default assignparam
  • apply testnoise injections to all noise transforms via global assignparam

F.5 Noise injection under automation

The automunge(.) powertransform parameter can be used to select between alternate sets of default transformations applied under automation. We currently have six scenarios for default encodings with noise, including powertransform passed as one of ‘DP1’, ‘DP2’. DP2 differs from DP1 in that numerical defaults to retain normalization instead of z-score and categoric defaults to ordinal instead of binarization. (DT and DB equivalents DT1 / DT2 / DB1 / DB2 allow for different default train and test noise configurations, i.e. DT injects just to test data and DB to both train and test [Appendix C].)

Shown following are the root categories applied under automation for these two powertransform scenarios of ‘DP1’ or ‘DP2’.

powertransform = ‘DP1’

  • numeric receives DPnb
  • categoric receives DP10
  • binary receives DPbn
  • hash receives DPhs
  • hsh2 receives DPh2
  • (labels do not receive noise)

powertransform = ‘DP2’

  • numeric receives DPrt
  • categoric receives DPod
  • binary receives DPbn
  • hash receives DPhs
  • hsh2 receives DPh2
  • (labels do not receive noise)

Otherwise noise can just be manually assigned in the assigncat parameter as demonstrated above, which specifications will take precedence over what would otherwise be performed under automation.

F.6 Data augmentation with noise

Data augmentation refers to increasing the size of a training set with manipulations to increase variety. In the image modality it is common to achieve data augmentation by way of adjustments like image cropping, rotations, color shift, etc. Here we are simply injecting noise to training data for similar effect. In a deep learning benchmark performed in (Teague, 2020a) it was found that this type of data augmentation was fairly benign with a fully represented data set, but was increasingly beneficial with underserved training data. Note that this type of data augmentation can be performed in conjunction with non-deterministic inference by simply injecting to both train and test data.

Data augmentation can be realized by assigning noise transforms in conjunction with the automunge(.) noise_augment parameter, which accepts integers of number of additional duplicates to prepare, e.g. noise augment=1 would double the size of the training set returned from automunge(.). For cases where too much duplication starts to run into memory constraints additional duplicates can also be prepared with postmunge(.), which also has a noise augment parameter option and accepts the traindata parameter to distinguish whether a data set is to be treated as train or test data.

Under the default configuration when noise_augment is received as an integer dtype, one of the duplicates will be prepared without noise. If noise_augment is received as a float(int) type, all of the duplicates will be prepared with noise.

Here is an example of preparing data augmentation for the data set loaded earlier.

F.7 Alternate random samplers

The random sampling for noise injection defaults to numpy’s PCG64, which is based on the PCG pseudo random number generator algorithm (O’Neill, 2014). On its own this generator is not truly random, it relies on seedings of entropy provided by the operating system which are then enhanced through use. To support integration of enhanced randomness profiles, both automunge(.) and postmunge(.) accept parameters for entropy_seeds and random_generator.

entropy_seeds accepts an integer or list/array of integers which may serve as a supplemental source of entropy for the numpy.random generator to enhance randomness properties.

random_generator accepts input of a numpy.random.Generator formatted random sampler. An example could be numpy.random.MT19937 for Mersenne Twister, or could even be an external library with a numpy.random formatted generator, such as for example could be used to sample with the support of quantum circuits.

Specifications of entropy seeds and random generator are specific to an automunge(.) or postmunge(.) call, in other words they are not returned in the populated postprocess_dict. The two parameters can also be passed in tangent, for sampling with a custom generator with custom supplemental entropy seeds.

If an alternate library does not have a numpy.random formatted generator, their output can be channeled to entropy seeds for similar benefit. Here is an example of specifying an alternate generator and supplemental entropy seedings.

In the default case the same bank of entropy seeds is fed to each sampling operation with a shuffle. The library also supports different types of sampling scenarios that can be supported by entropy seedings. Alternate sampling scenarios can be specified to automunge(.) or postmunge(.) by the sampling_dict parameter. Here are a few scenarios to illustrate.

1) In one scenario, instead of passing externally sampled supplemental entropy seeds, a user can pass a custom generator for internal sampling of entropy seeds. Here is an example of using a custom generator to sampling entropy seeds and the default generator PCG64 for sampling applied in the transformations. The sampling type bulk seeds means that a unique seed will be generated for each sampled entry. When not sampling externally, this scenario may be beneficial for improving utilization rate of quantum hardware since the quantum sampling will only take once per automunge(.) or postmunge(.) call and latency will be governed by the sampler instead of pandas operations.

2) In another scenario a user may want to reduce their sampling budget by only accessing one entropy seed for each set of entries. This is accessed with the sampling_type of sampling_seed.

3) There may be a case where a source of supplemental entropy seeds isn’t available as a numpy.random formatted generator. In this case, in order to apply one of the alternate sampling type scenarios, a user may desire to know a budget of how many seeds are required for externally sampled seeds passed through the entropy seeds parameter. This can be accomplished by first running the automunge(.) call without entropy seeding specifications to generate the report returned as postprocess_dict[‘sampling_report_dict’]. (note that if sampling seeds internally with a custom generator this isn’t needed.) Note that the sampling report dict will report requirements separately for train and test data and in the bulk seeds case will have a row count basis. (If not passing test data to automunge(.) the test budget can be omitted. For postmunge the use of train or test budget should align with the postmunge traindata parameter.) For example, if a user wishes to derive a set of entropy seeds to support a bulk seeds sampling type, they can produce a report and derive as follows:

F.8 QRAND library sampling

To sample noise from a quantum circuit, a user can either pass externally sampled entropy seeds or make use of an external library with a numpy.random formatted generator. Here’s an example of using the QRAND library (Rivero, 2021) to sample from a quantum circuit, based on a tutorial provided in their read me which makes use of Qiskit (Anis et al., 2021).

F.9 All Together Now

Let’s do a quick demonstration tying it all together. Here we’ll apply the powertransform = ‘DP2’ option for noise under automation, override a few of the default transforms with assigncat, assign a few deviations to transformation parameters via assignparam, add some additional entropy seeds from some other resource, and prepare a few additional training data duplicates for data augmentation purposes.

Similarly we can prepare additional test data in postmunge(.) using the postprocess_dict returned from automunge(.), which since we set testnoise as globally activated will result in injected noise in the default traindata=False case.

F.10 Noise directed at existing data pipelines

One more thing. When noise is intended for direction at an existing data pipeline, such as for incorporation of noise into test data for an inference operation on a previously trained model, there may be desire to inject noise without other edits to a dataframe. This is possible by passing the dataframe as a df train to an automunge(.) call to populate a postprocess_dict with assignment of the various features to one of these four pass-through categories:

  • DPne: pass-through numeric with gaussian (or laplace) noise, comparable parameter support to DPnb
  • DPse: pass-through with swap noise (e.g. for categoric data), comparable parameter support to DPmc
  • DPpc: pass-through with weighted categoric noise (categoric activation flips), comparable parameter support to DPod
  • excl: pass-through without noise

Once populated, the postprocess_dict can be used to prepare additional data in postmunge(.) which has lower latency. Note that DPse injects swap noise by accessing an alternate row entry for a target. This type of noise may not be suitable for test data injections in a scenario where inference may be run on a test set with one or very few samples. The convention in library is that data is received in a tidy form (one column per feature and one row per observation), so ideally categoric features should be received in a single column configuration for targeting with DPse.

Note that DPne will return entries as float data type, converting any non-numeric to NaN. The default noise scale for DPne (sigma=0.06 / test sigma=0.03) is set to align with z-score normalized data. For the DPne pass-through transform, since the feature may not be z-score normalized, the scaling is adjusted by multiplication with the evaluated standard deviation of the feature as found in the training data by use of the defaulted parameter rescale_sigmas = True. This adjustment factor is derived based on the training data used to fit the postprocess_dict, and that same basis is carried through to postmunge(.). If user doesn’t have access to the training data, they can fit the postprocess_dict to a representative df_test routed as the automunge(.) df_train.

Having populated the postprocess_dict, additional inference data can be channeled through postmunge, which has latency benefits.

The returned dataframe test can then be passed to inference. The order of columns in returned dataframe will be retained for these transforms and the orig_headers parameter retains original column headers without suffix appenders.

The postmunge(.) call can then be repeated as additional inference data becomes available, and could be applied sequentially to streams of data in inference.

Appendix G — Advanced noise profiles

The noise profiles discussed thus far have mostly been a composition of two perturbation vectors, one arising from a Bernoulli sampling to select injection targets and a second arising from either a distribution sampling for numeric or a choice sampling for categoric. There may be use cases where a user desires some additional variations in the form of distribution. This section will survey a few more advanced noise profile compositions methods available. Compositions beyond those discussed here are also available by custom defined transformation functions which are available by use of a simple template.

In some sense, the methods discussed here will be a form of probabilistic programming, although not a Turing complete one. For Turing complete distribution compositionally, we recommend channeling through custom defined transformation functions that make use of external libraries with such capability, e.g. (Tran et al., 2017). Custom transformations can apply a simple template detailed in the read me (Teague, 2021b) section “Custom Transformation Functions.”

G.1 Noise parameter randomization

The generic numeric noise injection parameters were surveyed in [Appendix H.3] and their defaults presented in [Table 4]. Similarly, the generic categoric noise injection parameters were surveyed in [Appendix H.4] with defaults presented in [Table 5]. For each of the parameters related to noise scaling, weighting, or specification, the library offers options to randomize their derivation by a random sampling, including support for such sampling to be conducted with the support of entropy seeding. The random sampling of parameter values can either be activated by passing the parameters as a list of candidate values for a choice sampling between, or for parameters with float specification by passing parameters as arbitrary scipy stats (Virtanen et al., 2020) formatted distributions for a shaped sampling.

One of the parameters that we did not go into great detail in earlier discussions was the ‘retain_basis’ parameter. This parameter is relevant to noise parameter randomization, and refers to the practice of applying a common or unique noise parameter sampling between a feature as prepared in the test data received by automunge(.) and each subsequent postmunge(.) call. We expect that in most cases a user will desire a common noise profile between initial test data prepared in automunge(.) and subsequent test data prepared in postmunge(.) for inference, as is the default True setting. A consistent noise profile should be appropriate when relying on a corresponding noise profile injected to the training data. We speculate that there may be cases where non-deterministic inference could benefit from a unique sampled noise profile across inference operations. Deactivating the retain basis option can either be conducted specific to a feature, or may be conducted globally using an assignparam[‘global_assignparam’] specification.

G.2 Noise profile composition

Another channel for adding additional perturbation vectors into a noise profile is available by composing sets of noise injection transforms using the family tree primitives [Table 1]. The primitives are for purposes of specifying the order, type, and retention of derivations applied when a ‘root category’ is assigned to an input feature, where each derivation is associated with a ‘tree category’ populated in either the root category’s family or some downstream family tree accessed from origination of the root category. As they are implemented with elements of recursion, they can be used to specify transformation sets that include generations and branches of univariate derivations. Thus, multiple noise injection operations can be applied to a single returned set, potentially including noise of different scaling and/or injection ratios.

Here is an example of family tree specification for a numeric injection of Gaussian noises with two different profiles as applied downstream of a z-score normalization.

G.3 Protected attributes

We noted in our related work discussions in Section 7 that one possible consequence of noise injections is that different segments of a feature set corresponding to categories in an adjacent protected feature may be impacted more than others owing to a diversity in segment distributions in comparison to a common noise profile, which may contribute to loss discrepancy between categories of that protected feature (Khani & Liang, 2020) without mitigation. The mitigation available from Automunge was inspired by the description of loss discrepancy offered by the citation, and might be considered another contribution of this paper.

Loss discrepancy may arise in conjunction with noise due to the property that segments of a feature that are not randomly sampled in some cases may not share the same distribution profile from their aggregation. In some cases the segments in a noise targeted feature corresponding to the attributes of an adjacent protected features may have this property. Thus by injecting a single noise profile into segments with different scalings, those segments will be unequally impacted.

The Automunge solution is an as yet untested hypothesis with a clean implementation. When a user specifies an adjacent protected feature for a numeric noise feature, the noise scaling for each segment of the noise target feature corresponding to attributes in the adjacent feature is rescaled, with the train data basis carried through to test data. For example, if the aggregate feature has standard deviation σa, and the segment has standard deviation σs, the noise can be scaled for that segment by multiplication by the ratio σs /σa. Similarly, if a protected feature is specified for a categoric noise feature, the derivation of weights by frequency counts can be calculated for each segment individually. In both cases, the segment noise distributions will share a common profile between train and test data, and the aggregate noise distribution will too as long as the distribution properties of the protected feature remain consistent. These options can be activated by passing an input header string to the protected_feature parameter of a distribution sampled or weighted categoric noise transform through assignparam, with support for up to one protected feature per noise targeted feature.

Appendix H — Distribution scaling

The paper noted in Section 3.1 that the numeric injections may have the potential to increase the maximum value found in the returned feature set or decrease the minimum value. In some cases, applications may benefit from retention of such feature properties before and after injection. When a feature set is scaled with a normalization that produces a known range of values, as is the case with min-max scaling (which scales data to the range between 0–1 inclusive), it becomes possible to manipulate noise as a function of entry properties to ensure retained range of values after injection. Other normalizations with a known range other than 0–1 (such as for ‘retain’ normalization (Teague, 2020a)) can be shifted to the range 0–1 prior to injection and then reverted after for comparable effect. As this results in noise distribution derived as a function of the feature distribution, the sampled noise mean can be adjusted to closer approximate a zero mean for the scaled noise.

The noise scaling [Alg 1] assumes a min-max scaled feature, and thus this algorithm works because both min-max and noise have a known range of entries. The pandas implementation looks a little different, this is equivalent.

Algorithm 1. Noise scaling

Since the noise scaling is based on the feature distribution which is unknown, the mean adjustment to closer approximate a zero mean after scaling is derived by linear interpolation from iterated scalings to derive a final scaling used for seeded sampling and injection [Alg 2]. Note that this final adjusted mean μadjusted is derived based on the training data and that same basis is used for subsequent test data.

Algorithm 2. Noise scaling mean adjustment and injection

Appendix I — Causal inference

The focus of the paper was non-deterministic inference by noise injection to tabular features in the context of a supervised learning workflow. Because the application of Automunge is intended to be applied directly preceding passing data to training and inference, we don’t expect there will be a great risk of noise injections polluting data that would otherwise be passed to causal inference. However it should be noted that the output of non-deterministic inference will likely have a non-symmetric noise distribution of exotic nature based on properties of the model, and any downstream use of causal inference may be exposed to that profile. One way to circumvent this could be to run inference redundantly with and without noise such as to retain the deterministic inference outcome as a point of reference. Such operation can easily be performed by passing the same test data to the Automunge postmunge(.) function with variations on the traindata parameter as per [Appendix C].

Appendix J — Benchmarking

A series of sensitivity analysis trials were conducted to evaluate different injection scenarios and parameter settings [Appendix K, L, M]. As we were primarily interested in relative performance between configurations we narrowed focus to a single data set, a contributing factor was also resource limitations. Scenarios were trained on a dedicated workstation over the course of about two months. The IEEE-CIS data set (Vesta, 2019) was selected for scale and familiarity. The data was prepared in configurations of just numeric features, just categoric features, and all features. Injections were performed without supplemental entropy seeding. In addition to parameter setting for numeric or categoric noise types and scales, the benchmarks also considered scenarios associated with injections just to train data, just to test data, and both train and test data (which we refer to as the traintest scenario). The trials were repeated in a few learning frameworks appropriate for the tabular domain, including neural networks via the Fastai library (Howard & Gugger, 2020) (with four dense layers and width equal to feature count), cat boosting via the Catboost library (Dorogush et al., 2018), and gradient boosting with XGBoost (Chen & Guestrin, 2016), and in each case with default hyperparameters for consistency. The primary considerations of interest were validating default noise parameter settings [App F.3, F.4] (they were confirmed as conservative) and demonstrating noise sensitivity in context of relative performance between injections targeting train / test / traintest features. The full results are presented in [Appendix K, L, M], and a representative set of jupyter notebooks used to aggregate these figures is included in the supplementary material.

Appendix K — Sensitivity Analysis — Fastai

Highlights

The neural network scenario benefited from injecting noise to test data in conjunction with train data. The regularizing effect of the noise appeared to improve with increasing scale at low injection ratio [Fig 7]. When we increased the injection ratio (‘flip prob’) this tolerance of higher distribution scales (‘sigma’) faded [Fig 8]. Unlike the numeric distribution sampling, categoric activation flips demonstrated an apparent linear performance degradation profile with increasing noise scale [Fig 9]. Note that this data set had many more numeric than categoric features, which we suspect made the categoric benchmarks appear disproportionately effected. Laplace distributed noise performed almost as well as Gaussian, it was very close [Fig 10]. Note that Laplace’s thicker tails means non-deterministic inference may be exposed to a broader range of scenarios.

Figure 7. Fastai Gaussian Injections
Figure 8. Fastai Gaussian scale and Bernoulli ratio
Figure 9. Fastai categoric injections
Figure 10. Fastai distribution comparisons

Appendix L — Sensitivity Analysis — Catboost

Highlights

Catboost appeared to tolerate training data injections better than gradient boosting, and the traintest scenario for the most part closely resembled the test scenario. Unique to this library there was a characteristic drop at smallest numeric noise scale, and then further intensity was more benign [Fig 11]. The categoric profile resembled a linear degradation with increasing ratio [Fig 12]. Consistent with the other libraries, the categoric weighted sampling did not exactly line up with swap noise in test case [Fig 13], we expect this is because weighted sampling is fit to the training data while swap noise samples based on distribution of the target set, suggesting there was some amount of covariate shift in the validation set.

Figure 11. Catboost Gaussian injections
Figure 12. Catboost categoric injections
Figure 13. Catboost distribution comparisons

Appendix M — Sensitivity Analysis — XGBoost

Highlights

The biggest takeaway here is that XGBoost does not like injections to train data, but injections only to test data for inference are tolerable with only small performance impact based on scale [Fig 14]. It appeared numeric features could tolerate increasing Gaussian scale with sufficiently low injection ratio [Fig 15]. A subtle result, and don’t know if this would consistently duplicate, is that for categoric injections [Fig 16], when injecting to just test data, Weighted outperformed Uniform when looking at only categoric features, however when looking at all features (with the surrounding numeric features included), uniform slightly outperformed weighted. Numeric swap noise had a much closer performance to distribution sampling [Fig 17] in comparison to neural networks [Fig 10].

Figure 14. XGBoost Gaussian injections
Figure 15. Gaussian scale and Bernoulli ratio
Figure 16. XGBoost categoric injections
Figure 17. XGBoost distribution comparisons

Appendix N — Intellectual Property Disclaimer

Automunge is released under GNU General Public License v3.0. Full license details available with code repository. Contact available via automunge.com. Copyright © 2021 — All Rights Reserved. Patent Pending, including applications 16552857, 17021770

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Nicholas Teague

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.