I was recently asked to speak for 2 to 3 minutes at our All Hands meeting, to summarize progress on my main project over the past year. I looked up some metrics and made some quick comparisons. The results are … interesting.
The Comparison
I work on Assortment Optimization. We are developing a tool for internal use.
One year ago, I had just joined a team of 40 working on the data science bits of the backend. 10 of us worked specifically on the optimization piece. It took over 10 hours to run the main optimization script for one relatively small category. Much of the work involved repeated manual re-runs, exporting of Excel sheets, and discussions with business experts. We adjusted parameters and tried to hit certain targets. This was necessary before we could present category results to users. 4 categories were marked as “passed” as of 28 January, 2021. The script applied a set of heuristics sequentially, but it was essentially a greedy algorithm. Our part of the code was tens of thousands of lines of R code spread across a few hundred files. The team that had developed the code was cycling off the project. They were documenting all that they had done. A bunch of new features were being rolled out in a hurry and the code was breaking. It was fast becoming obvious that the few new hires who were staying wouldn’t be able to maintain the code. We were in crisis mode. I was wondering what I had signed up for.
Let’s fast forward to the present. We recently finished a run of 87 large categories in 15 minutes. We are going to try to optimize 500 categories simultaneously next. Business validations are largely automated. Many are what I’d call “automatic.” We are able to make performance guarantees relative to optimality and/or to bound computation times. We are accounting for the same “complexities” of the problem, like demand transference, operational requirements at distribution centers, regulatory and contractual requirements, etc. We do this in maybe one thousand lines of Python code in tens of files. There are about 13 people total working on this project at any given time. 3 work specifically on the optimization piece. It was me, a recent MS grad, and an MLE in India for most of the year. I actually felt good enough about the state of the project to step away and work on a sister project recently.
Over the course of the year, lots of talented engineers left the project. This includes both the entire team that built the original code and almost everyone else who I worked with one year ago.
The 5 or so people who focused on Project and Engineering Management left. A year ago, Jira tickets were detailed and people moved them across various boards. We had sprint planning, retro, and cross-team demo and planning meetings. All were well attended and felt useful. None of that is true now. We haven’t had a PM on the science part of this project for about 10 months now. There have been herculean efforts by one Data Engineer and, more recently, a new hire Director to organize us and manage communications with the (slightly crazy) front end team. We still have the demo meetings, I think, although almost no one I know attends any more. They have been abandoned by most of us as being a waste of time but somehow I think they are still attended by the most senior among us. Hmm. In any event, I don’t think I’ll offend anyone (and I certainly don’t mean to) by saying we are in a much worse position with regards to Project Management. Our documentation is poor.
And yet… I think we are in a much better place than we were a year ago. I think the top level stats reflect that.
I hope I’m not offending anyone here. I never tried too hard to figure out what went wrong in the development process previously. I have no interest in laying blame anywhere. The engineers and project managers I interacted with were all excellent.
The Magic
One of the main reasons why we’re in a better place is simply the magic of Operations Research (OR).
People often compare Machine Learning (ML) to magic. (See also Deep Learning, Artificial Intelligence, Natural Language Processing, etc.) You throw data at sklearn and out comes a model capable of predicting flight delays remarkably well. The model itself demonstrates previously unknown details about how systems work. I haven’t seen many people compare OR to magic. But it can be. And this project is a good example.
I put a brief introduction to OR in a previous post. You build a (simple) mathematical model, a “formulation,” consisting of an objective function and several constraints. You then develop an algorithm or ask a specialized service to solve the model, finding the values for the decision variables that maximize or minimize the objective function while meeting all of the constraints. State-of-the-art solvers are remarkably efficient. They can also be managed, for example to limit their run times or memory use while still guaranteeing a solution within x% of optimality. They do counter-intuitive things, focusing exclusively on the underlying math of the problem at hand.
I came up with the assortment optimization formulation we are using now. I swear this post isn’t just a humble (or not so humble) brag. The formulation is quite simple. (More on this later.) My team and I coded it up. I thought the benefits would be that we’d have a relatively good solution to the underlying assortment optimization problem relatively fast. And that it would all be thanks to the CPLEX solver being slightly smarter than whatever we could come up with. But the benefits ended up being so much more than that.
One benefit that became apparent early on was that our code was much simpler and easier to maintain than the code it was replacing. Everything was in python. Our initial “pseudocode,” developed over about 2 weeks, was put straight into production and basically just worked. We spent countless hours debugging unexpected results and refining the flow of our script in the following months and never really had to change the core optimization code. The mathematical modeling was done using pyomo and even this part looked and still looks embarrassingly simple.
We worked with business to define what a good solution looks like, and then added the minimum necessary constraints to our formulation. Optimization forces you to put everything into a precise mathematical framework, which is great when requirements are otherwise missing. This is, for me, a not obvious benefit of OR.
A few data engineers on my team guided staff on other teams to build data pipelines to feed our optimization code. A good amount of the speed up we were able to achieve on the optimization part of this project stems from work being offloaded to the data pipelines. The data pipelines can be run offline, something the data engineers recognized well before I did. They did an amazing job. Here again, OR helped in a not obvious way by convincing us to define and use relatively small, precisely defined, structured data sets.
More recently, a Distinguished Data Scientist revamped the code using pyspark. This resulted in another large speed up. It helped that in OR everything is, essentially, matrix manipulations. I could write a whole separate article on the magic of pyspark but I’m still learning and probably not the person to write that article.
So things are much better on the Assortment Optimization project now and, for me, a lot of it boils down to the magic of OR.
Side Notes
I have a few other thoughts about why things are better now.
Large software projects are… unwieldy. The costs and expectations are often too high. Some developers may get annoyed at being unable to shape the direction of the project. Others will take no ownership. This is particularly true if people are maintaining code that they didn’t develop. Solid project management helps but excess meetings waste developer time. Have too many meetings at different times on different days and the deep thought needed to tackle the most important issue becomes impossible. The amount of time we were wasting on Assortment Optimization meetings one year ago honestly should have been red flag number one.
It is so much easier to maintain and add features to a script that takes 10 minutes rather than 10 hours to run. That is obvious right? And yet I think people vastly underestimate the costs of additional runtime. Extra runtime wastes developer time and saps morale. If your script takes over, say, 40 minutes to run, then you’re in trouble. In this regard, OR is maybe not magic enough. I had a friend tell me recently that she switched from working on OR to ML because “OR doesn’t scale.” I know exactly what she means. Want to add an extra variable to your problem? Your runtimes may explode. And look out if you want to do something non-linear.
I personally think OR formulations should be simple. That helps with the scaling. The job of OR is to solve the simple but challenging combinatorial math puzzle at the heart of your business problem. Think some type of knapsack or vehicle routing problem. Your problem really doesn’t have to be more complicated than this. Trust me. If it’s good enough for Amazon and Walmart, it’s good enough for you. The OR practitioner fresh out of graduate school or with a background in ML will try to fit every business concern into the model. You really don’t need or want 17 objective functions and 43 constraint sets. You can interpret the results of your optimization in 27 different ways but don’t pollute the math.
In academia, I would often be adding an additional complication to an existing formulation. Make it more “realistic” and use complex math to show how it can be done. In industry, I am often simplifying some complex and broken formulation or analysis. Make it simpler and more reliable.
Leave a Reply