My current blog is epistem.ink. This one is here just for archival purposes.
Disclaimer: opinions here are my own and are not endorsed by, held by, prescribed by, or legally binding to Mindsdb
I don't usually write about my professional work, this is an exception. I've been working on automatic machine learning for almost 3 years.
A small amount of that was focused on what I'd call the core of the problem, most of it was focused on platform building. This changed in the last 5 months when I decided to quit management duties and focus solely on "research". The first thing to come out of this is the version 1 redesign of an automatic ml library called Lightwood.
This article might be a bit preachy, it's very hard to spend almost 3 years of your life working on something without being somewhat biased towards it and thinking it's the best thing since keto bread. I’m also affiliated with Mindsdb, in the strongest way possible, where I own options, get money from them, and am friends with the people working on it. So take this with a ton of salt.
The objective of Lightwood, broadly speaking, is to provide a decent solution to any supervised learning problem, which the user can customize to the point where it's "state of the art".
The idea of automatic machine learning is not novel, there are older libraries and tools focused on the same thing. There's literature on the subject since ancient times, we aren't inventing anything entirely new, we are simply walking along what I'd hope is the best path towards achieving an idea that's yet to deliver on its potential.
The things that I hope will make lightwood stand out are:
It's fully open-source and free in the truest sense of the word. And it has a business model that will allow it to remain open source by keeping maintenance costs low and making it a critical component of for-profit products.
It's very easy to contribute to and expand, someone with PyTorch familiarity and expertise in some specific area can make a significant contribution in hours.
It has a benchmark suite that covers a wider range of problems than any other. And follows strict rules around demanding any single change improves (or, at least doesn't harm) benchmark performance.
i - Architecture
Lightwood works by generating code for
Predictor objects out of structured data (e.g. a dataframe) and a problem definition; The simplest possible definition being the column to predict.
The data can be anything. It can contain numbers, dates, categories, text (in any language, but English is currently the primary focus), quantities, arrays, matrices, images, audio, or video. The later 3 can be paths to the file system or URLs, since storing them as binary data can be cumbersome.
The generated Predictor object can be fitted by calling a
learn method, or through a lower level step-by-step API. It can then make predictions on similar data (same columns except for the target) by calling a
predict method. That's the gist of it.
In practice, video input is not yet supported, and outputting text or multi-media that's not gibberish requires custom code.
The other interesting bit is that it supports dependencies between rows when running in "time-series mode". We call this "time-series" because it's the most often encountered use-case for dependencies between rows, but it can be used for a broad set of applications where the values in a row are dependant on the values in some set of other rows.
There's an intermediary representation that gets turned into the final code, called JsonAI. This provides an easy way to edit the
Predictor being generated from the original data and problem specifications; As well as prototype custom code without modifying the library itself, or even having a "development" version of the library installed.
The way this happens is via a sequential~ish series of steps:
Types are inferred for every column/key
A "statistical analysis" gets various information out of the data
When between-row dependencies exist, the data is transformed to account for them (i.e. relevant values from the dependencies are sourced into each row)
The data is cleaned and standardized (e.g. dates represented as a mix of string formats and timestamp floats are converted to
An "encoder" is "trained" for each individual column. This can be as simple as an OHE, or as complex as the time-series encoder and encoders that deal with sets of interacting columns (which a whole article could be written about) or a transformer encoder for generating text embeddings.
A series of "mixers" are trained on the resulting encoded representations.
An ensemble is "trained" and decides which mixers to use or how to combine their predictions.
A "model analysis" step looks at the whole ensemble and extracts some stats about it, as well as builds some confidence models that allow us to output a confidence and confidence interval for each prediction. We also use this step to generate some "explanations" about model behavior, but they are fairly primitive right now; However, it's very easy to add new analysis blocks to improve this, if compute allows for it.
Predicting is very similar, data is cleaned, then encoded, then mixers make their predictions and they get ensembled. Finally, explainer modules determine things like confidence, CIs, and column importances.
A lot of production pipelines handling mixed data seem to do something very similar.
The main drawback of this is that separation doesn't allow for phases to wield great influence on each other or run in a "united" fashion. This both means you can't have stuff like mixer gradients propagating through and training encoders, nor analysis blocks looking at the model and deciding the data cleaning procedure should change. Granted, there's no hard limit on this, but any such implementation would be rather unwieldy in terms of code complexity.
The main draw of this architecture is the main draw of lightwood, it's very easy to extend. Full understanding (or even any understanding) of the pipeline is not required to improve a specific component. Users can still easily integrate their custom code with minimal hassle, even if PRs are not accepted, while still pulling everything else from upstream. This works well with the open-source nature of the product and is what might really give it an edge, given the number of alternatives.
The second advantage this provides is that it's relatively trivial to parallelize since most tasks are done per-feature. The bits which are done on all the data (mixer training and mode analysis) are made up of multiple blocks with similar APIs which can themselves be run in parallel.
Finally, most of lightwood is built on pytorch, and pytorch mixers and encoders are first-class citizens in so far as the data format makes it easiest to work with them. In that sense performance on "professional" hardware and continued compatibility is taken care of for us, which frees up time to work on other things.
ii - Benchmarking
I think Lightwood really starts to shine when we get to its extensive benchmark suite. It's not as big as I want it to be. But now that the core of lightwood is working, this has become my main focus, and I expect it to improve tremendously before 2021 ends.
It’s easy to use, I’ll let the README.md do the talking, but you can prototype a change in lightwood and benchmark it against all of our versions of the library with just 3 commands. I’m still working on the problem of allowing people to publish their results since it would invite cheating without some sort of replication on our end, but there are solutions to that, they are just annoying to implement.
You can run it on all datasets or on a specific bunch that’s relevant to your area of interest.
There are other automl benchmarks, OpenML being the most well developed, but they lack diversity. The lightwood benchmark suite encourages the inclusion and mixing of “typical” automl problems, those that deal in numerical and categorical features, with multimedia, text, time-series (i.e. constraints between rows and columns), arrays of numbers, lists of tags, dates, and anything else that can fit in a structured format. We are working on out-of-the-box support for some more domain-specific formats, but I can’t give exact estimates yet.
OpenML is great, and more thorough than us in terms of numbers of datasets and methodology, but it only contains classification problems with categorical and numerical features. Which seems like a very small subset of all ML problems.
Our suite doesn’t force k-fold cross-validation and accepts that each dataset might want to use a different nr of folds for CV. This keeps the runtime reasonable; Currently, it runs in ~2 hours on a g4dn.12xlarge, I expect that number to be up to 6 hours soon, but I wouldn’t want it to exceed that. In part this means allowing for better parallelization and waiting for hardware to catch up, in part, this means accepting the trade-off between consistency offered by CV and the fact that some datasets take exponentially longer to run than others, 20% of datasets use 80% of compute.
Each train + predict job is executed with ray, with centralization happening only for the evaluation and saving of the final results. This means that, if budget becomes less of an issue, the maximum runtime will be no longer than the lengthiest dataset.
Here’s how it looks when plotting the running accuracy of all (stable) lightwood versions across all datasets and accuracy functions. Here’s a report when comparing a single version of the library to the best results per dataset picked from all other versions. Finally, here’s a simpler report comparing two different versions. These are our private results, but you can generate the exact same reports and graphs by running locally and comparing them against your dataset.
iii – Why this is important
So what we have right now is:
An automatic machine learning library made out of small modules that are easy to swap, customize and write.
A decently thorough benchmark suite spread across a large variety of datasets.
Or, to put it another way:
A very simple interface for a “layman” to automatically build models for any predictive problem.
A simple interface for anyone that wants to customize the automatically generated models
A moderately complex interface for someone with expertise to modify the library and benchmark their changes.
This means that the library ought to address two common concerns people have with machine learning.
There’s a group of people that might want to use machine learning in theory but are put off by the complexity and or cost to hire a specialist. For these people, I hope that requiring only 3 or 4 lines of code to compare their hand-written models with a machine learning implementation will prove sufficiently easy.
There’s a group of people that might be comfortable using existing automl solutions, but require their models to include some domain-specific logic. This can range from handling specific data formats, to not cleaning or ignoring data that seems “obviously bad”, to using some existing architecture tweaks that are known to yield very good results. Lightwood should be able to provide these people with what they need, without them needing to build everything from scratch, thus reducing the chance of messing up code unrelated to the “core” problem and freeing time.
This makes lightwood a great tool for introducing machine learning to scientific problems, something that is steadily happening but seems to be stifled by a huge gap between applied researchers and people doing ML. A gap which can be crossed, but usually only by having a team of decently well trained ML people or at least researchers that are fairly polymathic, but the supply side of this equation is lacking; Even if they weren’t, increasing a team is intrinsically problematic due to added workflow complexity.
Will this actually happen? I don’t know, I doubt it because it seems like a monumental task. But at least I think we can aim for this as a north star.
More importantly, this gives a trivially easy way to get new research into the library without having to hire specialists in every single sub-sub-sub field of machine learning. The tried and true model of contests.
Got a new Optimizer that blows everything out of the water? Copy-paste it, modify one line of code in our linear network mixer, run the benchmarks, PR it.
Got a new method for generating text embeddings that beats the state of the art? Ok, just PR a text encoder using it and the benchmarks will take it from there.
Think the people are completely wrong about how they ought to normalize numeric values in X scenarios before feeding them to an fcnn ? Great, add that
if to our numeric encoder and let us see it.
The problem here is how do you incentives people to do this. I think part of the answer lies in Kaggle competitions and other avenues for paid collaboration. We sort of beta tested this during hacktoberfest, with a contribution-importance weighted raffle for a very good GPU laptop. We've had two dozen PRs, from small bug fixes to fancy things like audio encoders and (2) mixers using quantum computing APIs.
I can’t say I’ve got a perfect solution for driving contributions, but if we can get to a critical mass, we can shift the onus from having a very large ML team to sponsoring a wide array of researchers to contribute code in the areas they know best.
This also places Lightwood in the hands of the community, which avoids the issue of being open source in name only, with employees being the only contributors. I've seen this pattern with many projects, and I can't say I fault them, getting enough attention and managing PRs is hard.
Ideally, we could reach a level similar to HuggingFace, but applied to any possible supervised problem, not just focused on text.
Finally, I think the benchmark suite, ideally together in lightwood, can turn into de-facto tools for judging the quality of academic works much more easily.
Going back to a previous example, I see a lot of new optimizer papers, but most of them suffer from the issue of proof. It’s impossible to really prove an optimize is better than alternatives, you can build a just-so theoretical story, but that’s just selecting for optimizer written by capable math storytellers.
Papers address this by benchmarking, but their choice of architectures to try it on is often questionable and the amount and quality of datasets they use is subpar. You end up with things like:
We’ve invented this amazing state-of-the-art optimizer.
. In order to test our theory we used VGG16 on a subset of CIFRAR10 and MNIST and compare our optimizer with an untuned version of AdamW using no scheduler.
I might be hyperbolic here, but I’m not veering too far off from the truth. I’m not a “real” researcher, I skim 4 to 12 ML papers on a good week, and I need to know about a broad set of things. So most of the stuff I end up reading is from giant and well-funded groups like Facebook, Google Brain, Deepmind, OpenAI, Uber AI, Hugging face and the research groups of people with Tenure at top universities… and it’s in their papers that I see this kind of stuff. I don’t have a significant sample, but I have to imagine that this gets much worst once you go from “papers where 1/3rd of the authors held a NISP keynote” to “papers from a 2nd tier European or Chinese university.
I’m not trying to assign any blame here, because there’s none to assign, the incentive structures for being rigorous are not there. But even if a few people wanted to be rigorous, it’s close to impossible, and it would barely increase your chances of being published.
Does something like this benchmark suite solve this issue? Not entirely, not in most cases, but in some cases, I think it does.
I can see the combination of lightwood and our benchmarks being able to set a new standard for comparing: optimizers, schedulers, NAS methods, hyperparameter search, dynamic hyperparameter setting, novel architectures meant for a wide class of problems, and data cleaning procedures. That’s what I can think of, but I’m sure that, given sufficient adoption and extension, this could be used for much more.
Before anyone links the relevant XKCD, I know this is somewhat of a pipedream, and that’s why I list it last, it’s the most unrealistic objective I've set out. One I’m not credentialed enough to speak about and one that, I’m sure, many people have failed at implementing.
But alas, someone has to try.
Published on: 2022-12-01