Type Three Error

Wednesday, August 10, 2022

Do you Need a Prediction or a Prescription?

Machine learning and Predictive Analytics immediately captures the imagination as people try to figure out how to use data to grow their business. “What would we do differently if we knew what was coming?” is a common question in the world of data analytics. However, there are a number of challenges with this problem framing that often reduce the success of data projects. In this article, you will learn about some of those challenges and a better approach to leveraging data.

First, two key definitions:

Predictive Analytics is focused on making a prediction. This is often a prediction of what will happen in the future (a forecast). But it can also be a prediction of an unknown value based on other values (e.g., is this credit card transaction fraud or not?). A lot of “Machine Learning” and “Artificial Intelligence” falls into this category.
Prescriptive Analytics is focused on prescribing an action. Typically, this involves creating a model of your system and the effects possible actions would have. These actions often include preparing for the future (staffing / inventory choices). But they can also reflect a tradeoff of side effects given unknowns (e.g., do we deny or approve this credit card transaction?). The fields of “Optimization” and “Operations Research” tend to study this domain.

In general, there has been a lot of focus on Predictive Analytics. As an example, “Analytics Maturity Frameworks” typically suggest that companies who have already nailed Descriptive Analytics can grow into Predictive Analytics. Then, they should only layer in Prescriptive Analytics once they get good at predictions. However, what you will notice about Predictive Analytics problems is that they only impact the business when they have been translated into an action. A perfect prediction does nothing for your organization until it has actually changed something.

This is not to suggest that predictions are useless. But after reading this article, you will understand the role of prescriptions and how the two can be used together to drive value for your organization.

Going From Data to Action

The key reason Analytics Maturity frameworks place Prescription after Prediction is perfectly valid. If there is a problem with the prediction, there is an opportunity for some process to compensate since a prediction by itself does not change anything. Data generally has many inaccuracies and gaps unless your organization has already spent substantial effort solving those issues. Furthermore, if you are trying to make a decision based on incomplete information, it seems to make sense that the first priority should be to improve that information. Maybe you should start with predicting anything that is unknown yet key to making your decision?

By contrast, a broken prescription seems useless. If we ask our analytics what to do based on inaccurate or unrelated data, what benefit can it provide? Unfortunately, this framing glosses over the fact that predictions guide actions. As mentioned above, a prediction that never contributes to an action cannot improve your business. Fraud detection is called a Predictive Analytics problem, and yet only is useful if you then deny transactions that are predicted to be fraudulent.

You might think the above is fine because these predictive algorithms have been found to be accurate enough to link them to actions. In business when predictions are not accurate enough, they are often presented instead as the input to a human decision-making process. A prediction which is properly trained, tested, and validated will have a quantifiable “accuracy” which engenders trust in the output. It also helps to define progress for future development as data scientists try different features, algorithms, or tuning parameters.

But what if we defined accuracy in terms of the actions, not the predictions? Depending on the link from prediction to action, this translation may be simple or very challenging. In the simple case the benefits of measuring accuracy in terms of action are clear since we will be able to measure in business terms. Dollars of uncaught fraud, number of transactions incorrectly denied. When the link is muddy from prediction to action, it is a lot less clear how to fairly assess the prediction. However, actions often are more forgiving than predictions. Ultimately, the measure of success for an analytics tool should be “how did this improve my business?” Anything less is selling yourself short.

To illustrate this point, consider the problem of predicting how much a house is worth. Getting an accurate prediction is fraught with data challenges including complex markets, changing trends, and text-based features. However, a natural use of a house value prediction is to take an action of making an offer on a house. In the context of that action, a prediction is good if it improves profits. While improved prediction accuracy can help, the context of how it will be used is critical to the value of the effort. Done poorly, prediction accuracy may be phenomenal on houses that are entirely irrelevant to your business and terrible on the ones you care about.

Putting it into Practice

The first question you should ask yourself before starting a data analytics project is “What data do we have that is relevant to our goals?” At this point in the project you must decide what data team to bring together to solve your business problems. In a world where you have an enterprise data warehouse team who can quickly respond to new data needs from the business, it makes sense to focus on building algorithms. By contrast, an organization that has not yet worked out the kinks of defining things like “how many active customers do we have now and at any point in the past?” should ensure that experienced data engineers and architects form the backbone of the team. A helpful heuristic is to think about how you would train a human to do the work you are hoping an algorithm will do for you. If that training process would be unpleasant, you probably have some work to do before you can try to train an algorithm.

Once the correct team is assembled per your analytics maturity, it makes sense to clearly define the “why” of your project. Everyone on the team should be able to articulate what would change for your business if the project were to be successful. This is also your key opportunity to ensure that definition of success includes the action, not just the inputs to the action. A helpful thought experiment to distinguish between prediction and prescription is to imagine you have an Oracle and can make perfect predictions. What actions would you take and how would they be different than in an uncertain world?

With a clear set of goals for the project, comes the fun of the analysis phase. “Proof of Concept” and “Minimum Viable Product” are key concepts that can help your team prioritize competing elements of solving the problem. This is also the phase of the project to test competing ideas for how you might achieve your business objectives. Depending on the reality of your data and how you are trying to use it, this phase can be very quick or much slower. The quicker you can prove that the data you have can solve your business needs, the sooner you can quantify the potential ROI of your project.

Finally, even though this article is focused on the benefits of prescriptive analytics, in reality people are often reluctant to hand control over to an algorithm. Moreover, that reluctance is frequently justified due to gaps in the data or qualitative business rules. Given that awareness, choosing the right way to add your analytics solution to the business process is critical to realizing the return on investment. Non-traditional prescriptive solutions like a “what if” calculator can be extremely effective both for communication with your users and to smoothly integrate into business processes.

Conclusion

Data is often limited in how completely it describes the things you most care about for your business. Predictions help to fill the gap from your data to the things that drive decisions. However, there remains a gap from prediction to action which can be filled in part with Prescriptive Analytics. By framing your business problem all the way from data to action, you can more effectively drive business impact.

Friday, July 29, 2022

"Data Literacy" and Why it Matters

Most months I attend the INFORMS Practice Section Happy Hour. Today, the topic of discussion was “Data Literacy,” a term I had not previously paid any attention to. In the course of the discussion and reflecting afterwards, I realized that junior data folks needing to become data literate is a key component of what I see makes it hard to break into the field. It is also a very learnable skill with practice.

The way I had previously described this challenge is that everyone who works with data will eventually learn that dealing with Time zones is hard. A couple years ago I watched this 10 minute video which I *thought* had taught me everything there was to know about the difficulty of dealing with timezones. However, this recent article on tzdb covered even more complexities related to the politics of tracking time zones, and the challenge of who gets to make those decisions.

But even without going that deep, simply knowing what timezone your data is in can be tricky, and is frequently important. My first real data project was trying to develop a trading algorithm. At one point I realized I had accidentally been handicapping my algorithm by several hours because I had not properly handled time zones. That is an example of not having data literacy because it didn’t even occur to me to check until I had lost substantial time trying to figure out why my analysis was not making sense. Compare that to a recent experience when I was working on a database server and was told to filter for an end_date of “12-31-9999 23:59.” I was able to quickly shift to “12-31-9999 17:59” when my first query gave me nothing.

With that new lens, I have a slightly different definition of data literacy than say, Gartner. I would define data literacy as “the ability to know what questions you need to answer about a set of data in order to understand it.” I like this framing because whether you know the domain of a particular data set or not, someone who is data literate can make progress on understanding what the data means. Our discussion spent a lot of time on a debate around the “context” you may need to be data literate. I think it makes sense to have a general concept of “Data Literacy” that is not specific to a domain, i.e., context. A key part of my job is knowing when to guess that something means what I think it does, and when to ask a “Subject Matter Expert (SME)” to walk me through the nuance.

To close, I’ll add some common steps I use or that came up during the happy hour to understand a new-to-you data set:

Look at the column headers for the spreadsheet or read the axis and legend of the graph.
Familiarize yourself with the data itself. Maybe check what the set of values is. Look at the most common values in each column. Make some graphs. Test some guesses about links in the data.
Imagine the process behind the data and see if it explains anything about what you have observed so far. Oftentimes there is a human involved, and that can substantially change the interpretation of what you find. For applications you often can answer a lot of questions just by getting a SME to walk you through how the tool works.
In data science bootcamps you always get taught to look at the first few and the last few rows of a data set. I have found grabbing a random sample (.sample instead of .head or .tail in python) to be extremely helpful for getting an understanding of what you are dealing with. This is particularly useful for understanding which data may be missing a lot of the time.
Consider if anything in the prior steps leads you to believe your understanding of the data might not be quite right.
Think about if the data itself might not be quite right.

After spending some time using these principles, I either have a good understanding of the data, or a good knowledge of what I don’t yet understand. Feel free to add any other tips in the comments!

Tuesday, April 5, 2022

Why I Call Myself a Data Scientist

This week, I am at the INFORMS Business Analytics conference, one of the two conferences I attend regularly as an Operations Research PhD and enthusiast. In fact, INFORMS conferences are the only ones I have attended at all since graduation. In my work, I identify myself as a data scientist. What is interesting about these two facts is that INFORMS is not even on the map when it comes to data science.

What I find most valuable about my background in Operations Research is that by the end of your PhD for sure, and likely after a Masters, you have internalized one key lesson: The problem is always up for discussion. Unfortunately, you don't receive that lesson explicitly. Instead, what you get is a series of courses focused on "reformulating" problems. As an example, you learn that linear problems are easiest mathematically, and so you use your training to rewrite problems as linear subproblems. After spending 2-6 years rewriting problems it becomes crystal clear that the first way you think of writing down a problem is unlikely to be the best.

This mindset puts you in a place to succeed as a data scientist (and also as a consultant). Traditionally, the best data scientists are able to take a business problem and understand how to leverage ML and tools from analytics to solve that problem. Data science training programs focus on teaching you primarily how to use ML algorithms and code. This puts Operations Researchers in an odd position as they have so many of the hard-to-find skills on the business side that make exceptional data scientists, but often are lacking the ML skills that are considered "table stakes" for these roles.

As a result of all this, I call myself a data scientist. It is expedient and people generally hand me the right kinds of problems when I market myself that way. At some point they catch on or I warn them that not all data scientists approach problems the same way I do. Depending on the setting I go so far as to explain Operations Research and why they should consider hiring OR professionals for their data science needs. However, what I really wish is that the mindset of OR became the common framework for all data scientists. By articulating the notion that the problem is always up for discussion, you start to realize how much of your value comes not from just solving the problem you were asked to solve, but from getting to the why and how along the way.

Saturday, December 26, 2020

My bathroom non-remodel

In graduate school I took a class called “Systems Engineering” somewhat on a whim since it was being taught by the former secretary of the navy and I like systems. In preparing to now teach the same course in the spring quarter for DU, I have been reflecting on the guiding principles of the discipline and how I can convey those to my students.

I feel “mindset” is the most distinctive feature of many of my favorite disciplines (and what makes me a good consultant). What sets Operations Research apart is the perspective that the problem is up for discussion as well, not just the solution. Similarly, lean engineering is a way of understanding and improving production systems. I see systems engineering as also primarily a perspective and set of tools: one focused on how people can design large systems that succeed.

Today’s illustration however is not a large system. Namely, it is a home improvement project I was contemplating but hesitant to move forward with. When we first bought our house 3 years ago, I was suspicious of the shower bench in the master bathroom. It seemed to have serious mold issues, and as a former Michigander I am incredibly suspicious of mold. A month later, I had patched in new tile around the bench and felt confident for the short term but knew within a few years I would want to do the whole shower properly.

After finishing my last project this summer, I have been debating how urgent the master bathroom project is. It seems like the next major project I should tackle, but should I tackle it now? With three kids home and work to do, it seemed like an obvious “no.” But still, the molding grout and caulk shouted for something to be done.

And then I realized I was making a classic mistake I learned about in systems engineering. I considered plenty of “revolutionary” alternatives (should I redo the whole bathroom, or just the shower?), but had forgotten to include an “evolutionary” alternative (leave things mostly the same, but re-caulk). One of the important lessons in Systems Engineering is to contemplate your alternatives carefully. If you are not mindful of the alternatives, you often end up with a sub-optimal solution. Like a large remodel in the middle of a pandemic.

I am happy to report that my shower is once-again mold free. And with just a couple hours invested!

Thursday, November 7, 2019

Computers like to cheat

One year ago I got to hear Janelle Shane speak about her blog "AI weirdness." Her illustrations helped me start to understand sort of the... logic? of more sophisticated machine learning algorithms like neural nets.

This week, her book on the same topic was published and I got to hear her speak again. First off, I highly recommend both the blog and the book. Her justification for writing "You Look Like a Thing and I Love You" is that while we have many examples in science fiction of super-smart AIs like C3PO and Ultron, we don't actually have examples of AI as we have it today. Machines with brains maybe as powerful as a worm, but who are being trusted to screen applicants and drive cars.

The part that I continue to find most interesting in her presentations is the subject of my blog. She explains that while sometimes you don't have enough data to train an algorithm, much more often the problem is that you asked the computer to solve the wrong problem. You wanted a computer to caption images, but neglected to mention that "I'm not sure" is an acceptable answer. Or you asked a computer to make unbiased hiring decisions, but fed it discriminatory examples. These may sound like isolated examples, but Janelle's presentation helps you to understand that computers are always going to take advantage of the smallest oversights in your problem statement to win at the wrong task.

This intuitive idea that the computer will be trying to "cheat" any way it can is helpful as we navigate hype around AI and when we should actually trust it. Can this image recognition software tell the difference between dogs and wolves? Or has it just learned that wolves are often photographed in snow. Should we trust an algorithm because it has a lot of training data? Or might that data have important holes on issues we care about.

None of this is to say ML and AI don't have a place in the world today. But it does help us as individuals in the modern era understand how our lives may be changing for both good and bad as more decisions are handed over to computers.

I'm also going to throw out her Ted talk from last week, if you're not quite ready to commit to a whole book.

Friday, September 7, 2018

Not that multiverse

I recently started reading the book "Fooled by Randomness" by Nassim Taleb. So far it is not a book I would recommend to most people (the person who suggested I read it said he usually recommends people start with his most recent book, Antifragile). The author covers very interesting content, but not in a way that is easy to follow or digest. This is the first of probably (hopefully?) a series of posts trying to translate the subject of Taleb's book to an easier to digest format.

While I lived in Ann Arbor during graduate school there was a turn I had to drive about once a month. The unfortunate thing about this turn was that it was a left turn immediately after taking a left at a light. It was so close that I had to make a decision: either move into the middle lane of the road, which was a left turn lane for traffic coming the opposite direction, or remain in the line of traffic, and wait for any oncoming traffic to be clear.

After a few times of taking the turn, I wondered which of the two not-great options I should choose going forward. It seemed to me that I could either risk a low likelihood of a head-on-collision, or a relatively higher likelihood of being rear-ended in the other lane. I settled on staying in my lane and risking being rear-ended because of how much more destructive head-on collisions are.

A few years after making the decision, I made the left turn, and waited for the oncoming traffic to clear as usual. The person who was driving behind me saw the brake lights and stopped. Unfortunately, the person behind them didn't and bumped the middle car into mine. It was fairly minor damage all around, but it is easy to wonder given what happened if I actually made the right choice.

One of the messages from Taleb is that there is complexity in judging the quality of a decision based on random outcomes. For the person who bought a lottery ticket and won, it was a good decision. However, we should advise each person not to buy lottery tickets because in most versions of the universe, the individual you are talking to does not win.

This notion of "most versions of the universe" is a useful one when talking about randomness since it lets you still give weight to things that didn't happen. And while it can be a good idea to update your estimates of probabilities as you get new information, the fundamentals before an event are the same as they are after. As an example, after being rear-ended I did conclude that maybe I should be a bit more aggressive in taking my turn between oncoming traffic. But the fundamentals of my decision didn't change because of it.

Tuesday, April 17, 2018

Soft vs. Hard constraints

Last week, at a meeting to prepare for an on-site kickoff with a client, I was asked if I had any real-life examples of the "squishy rules" I wanted to discuss with the customer. At first nothing was coming to mind, but my airline helpfully solved that problem for me on my way to the kickoff.

After I had started my first flight, my second flight was cancelled. I found myself in the customer service line behind several other people also trying to figure out how to satisfy their constraints and priorities in the best way possible (scheduled meetings the next day, no private jets, how far were they willing to drive a rental car). What struck me was how much those constraints and priorities varied among the 4 people ahead of me in line. Some people were fine with getting in the next night, others (like me) were willing to give up anything except being on time the next day.

Now, you may have noticed above that I combined constraints and priorities into a single list. When I had booked my flight, I chose to fly to the actual city I was headed to. Once that flight was cancelled, I had a choice to make. What used to be two hard constraints now gave me zero "feasible solutions" -- I could either miss one day of the one-and-a-half day kickoff, or I needed to fly to a different city. Now, some very creative people find themselves in this situation and will fly to some other middle city and then to their destination. But my airline either didn't or couldn't suggest those options, and if you had asked me before the cancellation if I would consider a 3-leg trip, I would have given a flat no. So if I had no possible solutions, what could I do?

Well, this happens a lot. People will often list their preferences as needs until pressed. And as long as there is a feasible solution, it doesn't have to become obvious. One of the people ahead of me in line chose not to give up any of their hard constraints, which meant there were still no options available. It was obvious that something had to give unless the goal had changed -- nevermind, I didn't need to go to that city after all. But knowing which of your rules to turn "squishy" is the key to still achieving your goal.

In my case, I flew to a neighboring city instead. In fact, my boss had flown directly to my alternate city and planned from the start to drive the remaining distance -- he had never made flying to the final city a constraint. As a result of this experience I also finally bought some plane tickets for the summer I had been putting off for weeks. I am now flying to the 2-hour-away airport for less than half the price of the tickets to the actual city.

Have you ever realized you were overconstraining your problem? Which constraints turned out to be a lot squishier than you realized?