In which we argue our case

Image for post
Image for post

A few excerpts from discussion with reviewers, sharing for transparency purposes:

To: Reviewer 1

Contributions

I appreciate that you offered two specific criteria for software packages, I believe this software has met both of these criteria as follows:

Criteria one: “The software implements a scientifically novel algorithm, framework, model, etc.” I believe the family tree primitives as described in Figure 6 meet this criteria, for the reason that they have formalized a fundamental aspect of processing tabular data, as enabling a simple means for command line specification of multi-transform sets that may include generations and branches of derivations. …


How to dance the French way ;)

Image for post
Image for post

…our life, like the harmony of the world, is composed of contrary things — of diverse tones, sweet and harsh, sharp and flat, sprightly and solemn: the musician who should only affect some of these, what would he be able to do? he must know how to make use of them all, and to mix them; and so we should mingle the goods and evils which are consubstantial with our life; our being cannot subsist without this mixture, and the one part is no less necessary to it than the other.

Michel De Montaigne

In this world of on-demand 24 hour streaming music, there is a risk of taking for granted the wonder of the form. The same songs played on repeat lose their meaning, they become part of the background. Perhaps comforting for their familiarity, but without the stirring of the soul, without the goosebumps and the exhilaration of newly discovered resonance. …


It’s both here and there

Image for post
Image for post

For those that haven’t been following along, I’ve been using this forum over the last two years to document the development of Automunge, an open source python library platform for preparing tabular data for machine learning. …


On an election like no other

Image for post
Image for post

America is at a crossroads. This isn’t exaggeration. This isn’t hyperbole. It’s a simple statement of fact. The election taking place next month has so much on the ballot. The climate is on the ballot. Healthcare is on the ballot. Truth is on the ballot. Democracy is on the ballot.

The sitting president is unfit for office. He has demonstrated with fervent consistency that he is incapable of even remotely truthful communications. His social media streams of consciousness are a window into a self-destructive psyche. He openly praises dictators and refuses to disavow white supremacists. He has alienated our country from our allies and openly encouraged foreign interference in our elections. …


The unifying theory of the Wolfram Model

Image for post
Image for post

There have been a few paradigm shifts of note in modern physics. The principles of relativity bent the constancy of space and time at extreme scales, then quantum dynamics broke point-wise precision at the nano. The library of atoms and constituent particles was eventually revealed as an abstraction for aggregations of the subatomic, whose newest member, the Higgs boson, required near light speed particle collisions for evidence.

Marriages between these domains have long been sought by researchers, as macro scale relativity and nano scale quantum have trouble reconciling the nature of gravity, one of the four fundamental forces. One channel of investigation has been the invention of new kinds of mathematics, finding higher dimensions manifesting particles from the vibrations of strings and membranes, and symmetries between dimensions even demonstrated through AdS/CFT correspondence, which translations may yet be shown as a kind of Penrose triangle, with the direction determining the destination. …


Numeric Encoding Options with Automunge

Image for post
Image for post
St. Thomas Cathedral

Abstract

Mainstream practice in machine learning with tabular data may take for granted that any feature engineering beyond scaling for numeric sets is superfluous in context of deep neural networks. This paper will offer arguments for potential benefits of extended encodings of numeric streams in deep learning by way of a survey of options for numeric transformations as available in the Automunge open source python library platform for tabular data pipelines, where transformations may be applied to distinct columns in “family tree” sets with generations and branches of derivations. Automunge transformation options include normalization, binning, noise injection, derivatives, and more. The aggregation of these methods into family tree sets of transformations are demonstrated for use to present numeric features to machine learning in multiple configurations of varying information content, as may be applied to encode numeric sets of unknown interpretation. …


Artificial general language

Image for post
Image for post
Blondie — Rapture

For anyone that has been keeping up with chatter in the machine learning space I am sure by now you have heard of OpenAI’s new GPT-3 model, sort of an inelegantly named natural language algorithm per the acronym for Generative Pretrained Transformers — the third generation, with each a parameter set scaled progressively through orders of magnitude to get to literally hundreds of billions of parameter weights to achieve the current top tier. …


Automunge by the numbers

Image for post
Image for post

1. Automunge is an open source python library available now for pip install.

2. Automunge prepares tabular data for machine learning by way of numeric encodings and missing data infill.

3. Automunge is built on top of Pandas and Numpy libraries, and also uses scikit-learn for predictive models and scipy stats for statistics.

4. Automunge assumes data is provided in a tidy form, which means one column per feature and one row per observation.

5. Automunge has a library of feature engineering transforms intended for different data types such as numeric, categoric, sequential, and date-time.

6. Automunge automatically evaluates column data properties to assign appropriate types of encodings. …


Hyperparameter free machine learning

Image for post
Image for post

I recently had the opportunity to attend another machine learning research conference, this one organized by ICML, hosted online with pre-recorded presentations, Zoom chats, and various other virtual interactive features (these things are too affordable not to attend, seriously no excuse). To be honest I’m finding these research conferences kind of difficult to navigate. Representing a startup company trying to attract users means navigating a field full of potential competitors necessitating a bit of caution. …


From zero to one to hero

Image for post
Image for post

A friend recently shared that they were thinking about getting into coding, so I thought I’d assemble a few pointers and helpful hints for getting started. Basically I was in a similar position a few years ago and so you can think of these as kind of a letter to former self for what may have helped me get started way back when. Anyone not yet acclimated to the data science ecosystem may find this helpful I hope. Yeah so without further ado.

Software Engineering vs Data Science

So there’s kind of a fundamental distinction between the kind of coding workflows that go into software engineering verses mainstream data science. Specifically, software engineering is the act of creating self-contained packaged systems, with defined inputs and outputs, and engineering is an appropriate term because done properly it involves creating specifications, documentation, pseudo code, architectures, and implementations. Data science on the other hand is a little bit of a looser term, “science” here is kind of a generous monicker, it’s not exactly a science as practiced in mainstream use. Data science is all about extracting insights from some data corpus. The data could be any range of applications, from financial data, business data, web data, or even for advanced practice stuff that you might not think about as “data” like images, video, language, speech, music, etc. Basically anything that we can represent in a digital form can be a target for a data science analysis by machine learning. That being said, a whole lot of the type of analysis that may be performed in a business setting isn’t quite as exotic, a lot involves just getting numeric and categoric sets into properly formatted tables such as may be passed to machine learning algorithms. …

About

Nicholas Teague

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store