Modeling Libraries Don’t Matter


When building the first machine learning pipelines for my company, I agonized over which modeling libraries to include in our stack. What would most model developers want to use? I felt strongly about scikit-learn and PyTorch, but what would be the consequences of imposing my opinions on ML frameworks on our company’s infrastructure? Which modeling library would “win” in the long-term? What if I wrote modeling code in a DSL that would become obsolete in a few years?

In 2016, I took an introductory deep learning class with assignments all in Tensorflow; my most recent deep learning course was completely conducted in PyTorch. Four years later, it seems like all ML researchers I know use PyTorch. The few who don’t use PyTorch use TF 1.0, with the “some day I’ll switch to TF 2.0 or PyTorch” mantra. What happened?

Over several months, I realized that the choice of library does not matter, as modeling is just a tiny step in the machine learning pipeline. Other steps are equally, if not more, challenging to maintain: for example, it is much harder to migrate data pipeline code than rewrite basic TF modeling code in PyTorch. I’ve written models in TensorFlow, PyTorch, XGBoost, scikit-learn, and LightGBM for different tasks for my company. I’ve even written non-Python models in Scala. When I iterate on machine learning pipelines for a prediction task, I avoid changing the model architectures as much as possible since I’d rather change parts of the pipeline I understand better, like data ingestion and better featurization. My company’s pull requests show that people hardly touch their modeling code compared to pipeline code. What matters is having the infrastructure to “plug and play” ML model trainers and predictors, since there is almost never one programming library that meets all needs.

Some would point to the trends of researchers overwhelmingly preferring PyTorch and JAX and argue that a winner here actually does matter, because researchers turn into data scientists at companies, these data scientists build models, and the models will get productionized and used “forever.” But as a field, we’re still struggling with productionizing models, aligning their outputs with human incentives, iterating on these systems, and trusting these pipelines. For any ML practitioners outside “big tech,” their biggest problems are model pipelines and value alignment between customers, themselves, and machines. After all, these are essential to product development. Even if we built frameworks for these central problems, the modeling library still won’t matter because multiple software frameworks for a problem can happily coexist. People are smart and can easily learn a different framework — the fact that there was such a large, rapid transition from TensorFlow to PyTorch within a few years proves that developers will find the best tool for their job. It matters more that they have the correct foundation for software they build.

Additionally, business considerations can override the choice of framework or even build new frameworks, particularly in startups. My company’s codebase for a particular ML problem has experienced something in this vein: first, I wrote experimental code in my DSLs of choice to “solve” the problem. Then when we had to build a product, I rewrote the pipeline. Then when we pivoted slightly, I refactored the pipeline to produce the live ML product we’re regularly releasing today. As I gained more clarity on the current version of the product and how other stakeholders (technical or nontechnical) might interact with it, I realized these business considerations drove pipeline development more than the modeling libraries or my opinions on other DSLs.

So the horse race of modeling libraries is misleading, and the most challenging problems in “real-world” ML right now revolve around business values, productionization, miniaturization, and pipelining for repeated training and inference. But since programming languages and infrastructure drive innovation in software, it’s still worth thinking about the evolution of modeling libraries. In The Mythical Man-Month: Essays on Software Engineering, Fred Brooks introduces the concept of the second-system effect to be “the tendency of small, elegant, and successful systems to be succeeded by over-engineered, bloated systems, due to inflated expectations and overconfidence.” Famous examples include the IBM System/360 operating system (which succeeded the IBM 700/7000 series from the 1950s), and the Multics operating system (which succeeded Compatible Time-Sharing System from the late 1960s).

I consider TF 1.0 a success: it accelerated a lot of deep learning research, was fairly narrow and thoughtful in scope, and spearheaded innovation in the hardware vertical with XLA compilation, TPUs, and more. But over time, as hundreds of TensorFlow engineers tried to address the software’s limitations and turn TensorFlow into a machine learning library for everybody, it suffered from the second-system effect and became TF 2.0, a machine learning library for nobody (possibly except for Google). One set of problems they tried to address is “making models work in production settings:” TFX is a great example of an overhyped and underused TF 2.0 tool. Compare the PyPI stats with Kubeflow’s for context; TFX built their framework to fit nicely with Kubeflow and Kubeflow users still don’t want to use TFX. This is not to say the problems with production ML aren’t real; rather, it seems TFX currently isn’t the solution to many of these incredibly challenging problems. Having tutorials doesn’t help if the UX is counterintuitive and engineers need to become professional error log parsers to become proficient with the tool.

All this being said about my criticism for the TensorFlow UX, I actually use TF 2.0 at work — mainly out of laziness. The Spark to TFRecord to TFData to TF model pipeline is partially documented, whereas the Spark to TFRecord to something to PyTorch model pipeline is only barely documented. But the DSL for my models is hardly something I think about on a day-to-day basis, since most of my problems aren’t “how quickly can I code up a transformer or convnet architecture from scratch.” People in this industry rarely build things from scratch. Software products are built on the principle of incrementality; code and features accumulate over time.

So my answer to my original question is that it’s not worth worrying about which modeling library will “win” in the long run, because multiple libraries can win if they each do something important, such as championing the dataflow paradigm or easy autodifferentiation. If you’re an engineer, don’t build your pipelines around a specific modeling library. If you’re a researcher or data scientist, don’t worry about learning all the modeling libraries or whatever libraries the company’s job description mentions. Software history indicates that the modeling framework bloat is inevitable, and for as long as these libraries’ biggest priorities are to compete with each other, they will all converge to the same solutions to mission-critical modeling problems — eager execution, ease of building a model from scratch, ability to nicely view loss curves, and more. But these are only a fraction of most “real-world” machine learning problems, and you, as a machine learning practitioner, at the end of the day aren’t hired only for your expertise in training a model once; you’re hired to make a machine learning system consistently deliver value to an end user.

Thanks to Reese Pathak, Jay Bhambhani and Debnil Sur for their feedback on multiple drafts.