Type Three Error: 2016

Saturday, November 12, 2016

Rocky Mountain Datacon

I spent the previous two days at the first Rocky Mountain Datacon. I haven't yet figured out how to blog during a conference (I have two half-finished posts and a number of ideas), but it was a great experience and I learned a ton. All the talks were filmed, and it was successful enough that the organizers expect to do it again next year.

For the moment, I thought I would share a few thoughts of what I learned at the conference. Feel free to hit me up if you are interested in a discussion about any of them since that will help my eventual posts be more useful and articulate for everyone.

Data has allegories to oil, currency, intellectual property, and inventory.
Data is a tax-free asset (though it does cost money to keep it and use it).
With the current technology and tools, we have distinct classes of big, medium, and small data. Accurately assessing what you have and will have in the future is important for picking the right technology stack.
I picked up a lot of data science 101 including what all the titles should mean, what a technology stack is, how to pronounce the word "munging," what the technology options right now look like, how to "break into the field," and a ton of other things.
And for the OR folks reading, very little of any of this is using optimization yet. Several people threw around "5 years" as the timeframe to get there, so it seems to be a pretty good time for us to join the data science world.

Thursday, October 6, 2016

Evaluating health claims

There is a large body of evidence (along with personal anecdotes) that getting people to change beliefs is very difficult. Studies have found that providing people with contradictory evidence can make them even more confident in their views. Given these challenges, how should we go about reducing misinformation in the world?

One in-progress study is looking at teaching primary school students in Uganda how to evaluate health claims as well as the evidence they are based on. While I had previously thought that a basic understanding of statistics was our best option, this kind of education is more clearly and directly related to the goal. It will be interesting to see what the results of the study end up being, particularly if we eventually find that learning in one area spills over to increase scientific literacy in general.

Sunday, September 25, 2016

Indicators and using noise as signal

How should you use an indicator? When you look at the weather report, it seems pretty straightforward. Sunny means you do not need an umbrella. High of 36 means you should wear a coat. But what about when there is a 40% chance of rain? And what if you are trying to figure out if you can go for a hike this weekend?

When talking to my friend Chris Miller recently, he was trying to predict the weather before going on an aggressive hike. He mentioned that the noise in the forecast was part of his signal to decide how seriously he should take the report. If the forecast kept changing in the few days leading up to the proposed hike, that meant there was a decent chance that the weather would be unfavorable the day-of.

This got me thinking about the standard weather forecast as an indicator of the underlying data. I had a friend who was studying to be a meteorologist, and so she would go straight to the NOAA source data to predict the weather. For everyone else the weather report is basically a black box and all we have to work with are the indicators.

And when there are no indicators that answer your question? Or if it is simply impossible to interpret the underlying data? That is when it is time to get creative with the information you do have.

Friday, September 2, 2016

Voting and an intro to some game theory ideas

My apartment complex decided to show a movie via projector and sent out a poll with 6 options. We were asked to rank the options and told that the movie that won would be shown at the movie night.

If management only got one response, it would be easy to decide how to vote and which movie won. However, assuming there was more than one response, how should the winner be determined? And given that there will be other voters, what should your vote be? In this game, management has decided on the rules for voting (rank the 6 options) as well as the rule for which movie is selected based on those votes (which they did not tell us). The voters are then left to decide how to vote, given their guess of the rules.

The most likely way to score such a set of votes is that management could give a set number of points for each possible rank-position (i.e., 5 points for being ranked 1st, 4 points for 2nd, etc) and then take the movie with the highest sum. However, there's no reason they couldn't pick the movie with the largest number of times being ranked 1st, and then use the later places as tie breakers. And in general, there can be truly crazy sets of rules. For example, if one movie is far-and-away the favorite in general, we could handicap that movie by saying any vote for it is actually only worth half a vote (this seems ludicrous... but in auction theory counting a "high-valuation" bidder's bid as a fraction of their actual bid is a standard tool to design an "optimal" auction).

Depending on the rules, an individual voter has different incentives. Further, depending on their guess as to how everyone else will vote, they will have additional incentives. An important notion in game theory is a "Nash Equilibrium." A NE is a set of votes for everyone so no individual person will choose to switch their vote. So if you knew that everyone else was following the NE, there is no benefit to you from not following the NE. But there are a lot of assumptions that go into the NE including that it is unique, that there will be no collusion (lets both list our shared second-favorite movie as 1st), and that somehow there being a NE actually leads people to vote accordingly (here is the Wikipedia link on when that will happen).

Given all this complexity, you might wonder how anyone ever decides anything. In a world where so much is uncertain though, a lot of decisions could be the best. If I vote my genuine ranking, I'm at least giving my preferred movie it's best shot to be selected. I could vote strategically and rank my 3rd favorite as top because I think my least favorite movies are the most popular. I could also not vote at all because I think the effort involved in voting is more than the difference between the outcome when I vote or not. Which of these guesses of uncertainty is right is impossible to say until after the votes are in, which are then influenced by what everyone else is guessing to be the case.

Hopefully, this gives a taste of some of the difficulty both in designing the rules of a game, and the subsequent decisions by the voters. Early on in learning about game theory I was told "the devil is in the details," which I have found to be absolutely true. First-past-the-post voting seems sensible, until you realize the incentive issues when there are more than two choices.

Feel free to send me any follow-up game theory questions you have and I will do my best to get them answered!

Tuesday, August 23, 2016

Classification problems

A friend asked me for ideas of a good analogy for classification problems in machine learning. In a classification problem, we have a collection of objects, and we somehow want to separate them into groups. Ideally, when designing this analogy there are a few things we want to convey:

Not all properties of objects are equal when it comes to classification. Some will be highly predictive, while others just help you over-fit your model.
Some properties will be highly correlated, so it can be a waste of energy / data effectiveness to include them all.
Your desired classification informs which properties you should use for your classification.
What do "properties" even look like, and how do they help us get at a classification?

The example I suggested is a person deciding what to eat at a potluck. I usually have two different classification problems to worry about when I'm filling my plate. First off, I want to decide which things I want to eat. In addition, I'm one of these people who will get a main course plate, eat that, and then go get dessert later. So as I survey the food, I have to decide both which things I think will be delicious, and which things I want to get later as dessert.

When trying to figure out what will be delicious, there are a lot of criteria I could use. Since I do not like cucumbers, anything with them is immediately excluded. Other properties besides ingredients could be smell, color, how much of it is available, how much was already eaten, the temperature of the food, anything! Some of these criteria are more helpful than others. I've had a lot of delicious brown things in my life at potlucks. And when I am trying to decide if something is dessert of not, how much people ate is not going to be very informative.

How do you classify food at a potluck?

Thursday, July 14, 2016

Organized Brainstorming

When you learn about brainstorming, there is often a focus on how "spontaneous" it should be. Don't worry if an idea is good or bad, just add it to the list! During undergrad I took several of Tau Beta Pi's "Engineering Futures" classes. One of the topics was a modified approach to brainstorming.

With the mindset they taught, the quality of the ideas are still unimportant, but you go about producing them in an organized way. Say you are trying to come up with the list of things you need to buy from the store. Instead of simply writing things down, you create categories and then fill in each category. For our shopping example, you might list each room in your house as a category and then think about what things you need for each room. If you are trying to figure out the best way to solve an engineering problem at work you might have categories like "new equipment" and "better software."

What are your thoughts on brainstorming? After a quick read through this link on the topic, it looks like I am describing a more task-oriented version of brainstorming. Which makes sense for an engineering-focused training on the subject.

Sunday, May 29, 2016

Using decision trees for sequential decision making

I am a planner by nature. As I get close to leaving my apartment of 6 years to go a third of the way across the country, I've found planning a move to be a complicated project. For most of the past year I have tried to avoid thinking about the move since there was a lot of uncertainty which would be resolved with time. With the move less than a month away there is still a lot of uncertainty, but very little of it will be resolved until weeks, months, or years from now.

Since my research is in stochastic optimization (making decisions under uncertainty), in principle I have a lot of tools at my disposal. But most of the complexity of a move is the sequential decision-making aspect. Initially you have a lot of degrees of freedom. With each decision you eliminate not only the alternatives to that choice, but also the feasibility of any decisions which come later.

In my research I have not come across tools to handle this aspect of empirical decision making. However, with a little thinking outside the box, decision trees can! Decision trees let you visually consider what will happen in a world after a series of decisions have been made and uncertainty resolved. For a situation where the order of decisions matters, you can then draw multiple decision trees for each of those orderings.

I only explicitly drew decision trees for a couple aspects of my move. But for the early decisions, lots of flexibility down the road was an important factor. There were also a couple cases where I did draw the tree so I could articulate the relationship between specific uncertainty and my decisions. In one case I found that the uncertainty I was worried about did not actually change my choice!

In somewhat related news, hopefully I will get back to my once-a-week schedule for posting soon.

Friday, May 6, 2016

Link on stats and journalism

I've linked to his blog before, but this post by Andrew Gelman is a great read. Whether you happen to be a journalist, someone trying to understand how to communicate statistics results to the public, or just want to know what is missing from the standard "science says x cures y!" news stories.

Thursday, April 14, 2016

Back to business

After defending (and revising) my thesis late last month, I'm back to blogging. This week I was in Orlando for the INFORMS Analytics conference. During the conference they host the Edelman competition with the goal of recognizing a project with huge tangible contributions made possible by OR and analytics.

I made it to three of the six talks, and enjoyed hearing about what people are doing with data.

The NYPD has developed "DAS" which integrates with department phones and provides real-time information about an address and the surrounding area before police even show up at the scene of an event. They also use the data to identify likely locations for crime to target extra enforcement (it seemed somewhat less sinister than minority report).
Until very recently, UPS still had their drivers plan their routes each day. ORION changed that with major cost savings. One of the most interesting parts to me was that before they could roll out this project, they needed to collect much more accurate map data to avoid the problem of "Google maps tells me to drive a mile out of the way to make a u-turn."
360i is a marketing firm which specializes in paid search advertising. Keyword-based advertising at Google is allocated based on the result of an auction. However, there are literally hundreds of thousands of strings that a particular company might want to bid on. 360i developed a set of tools to improve the effectiveness of search advertising by trying to figure out what people intended to search for (their example was you don't want to waste advertising dollars on someone trying to figure out how to sort out relationship problems)

UPS was chosen as the winner this year, and will be joining an impressive list going back to Pillsbury in 1972.

Tuesday, March 8, 2016

Links for Students

To anyone else who will eventually be doing a PhD defense, here are a few useful links I've come across recently.

10 Ways to Successfully Defend Your PhD: This focuses on the presentation itself. There appear to be other useful articles on the website if you want to look around.

Hints from CS at Columbia: This is the link that helped me understand the point of the defense. Specifically, getting everyone on the same page in case some of your committee did not read every section so that if one person asks a specific question, everyone will understand the context when you answer.

Preparation tips for Rochester University: This link talked about the basic logistics. If you are a PhD student and haven't attended a defense yet, make it a priority long before your turn. There are also always snacks.

Prepare your PhD Defense Presentation: This one did a nice job of demonstrating what a research question is.

Good luck!

Edit: Just found a super helpful Defense Outline.

Friday, February 19, 2016

Don't ask "which one is right?"

The above quote came from this blog post discussing the PACE trial which was a large-scale randomized controlled study comparing different treatment approaches for Chronic Fatigue Syndrome. I would not necessarily recommend reading the post since it can be hard to follow, and the comments suggest that in fact the study actually only provided evidence that the placebo effect holds...

But separate from all that, people tend to be uncomfortable with uncertainty. The setup of the study was supposedly to demonstrate which of different treatment options were effective. However, different people are likely to have varying responses to the same treatment. For something like CFS, it then makes sense to start your study with a question like "how can we identify the right treatment for each person" rather than "which treatment is right for everyone."

These questions come up a lot in the relatively new field of personalized medicine. While doctors have always used patient-specific information to make healthcare decisions, policy makers typically have not. Sometimes that makes sense: Eating more plants and getting more exercise are good for pretty much everyone. But we should not let our desire to have one right answer get in the way of understanding complex systems.

Saturday, February 6, 2016

The Greenfield Bridge in Pittsburgh

A speaker in my department reminded us about the bridge-under-the-bridge in Pittsburgh, which involved building a bridge to catch falling debris from traffic-carrying bridge above. The speaker posited that perhaps the resources that went into building the second bridge would have been better spent fixing the broken bridge.

The bridge is finally being replaced (link here). The article actually provides numbers we can use to decide if, in hindsight, the secondary bridge was a good idea.

The bridge is projected to cost 19 million dollars in today's dollars to build, while the secondary bridge cost $625,000 in 2003 dollars. Based on the CPI, $1 in 2003 would be $1.29 today, and therefore the bridge would cost about $800,000 in 2015 dollars. But it turns out that is pretty irrelevant since $800,000 / $19 million = 4.2%. Assuming building new bridges wasn't wildly cheaper back in 2003, this was an obviously good investment since the bridge lasted 12 years, not one.

I will add one caveat: I don't know a lot about how these kinds of capital projects are funded. The replacement bridge project page mentions that 5% of the funding is coming from the city, 15% from the state, and 80% from the federal government. The above analysis in some ways assumes that the breakdown is the same for the secondary bridge. However, it appears likely that the city of Pittsburgh covered the full cost of the under-bridge. That puts their investment in 2003 for the past 12 years of bridge as about the same as for an entirely new bridge which will hopefully last 100 years. From that perspective, the city's best decision is a little less clear.

Friday, January 29, 2016

Salt, plow, or do nothing?

Last week we had a reminder of what an inch of snow can do at the wrong time. But how should a city decide when and how to act to avoid unsafe roads?

Modeling what will happen is tricky. The Washington Post link mentions that they get information from the "High-Resolution Rapid Refresh" weather model, which they say typically does well for short-term storms. They also cited a concern by city officials that low temperatures would result in the brine application freezing instead of keeping the roads ice-free. This means that the physics-side of the problem is tricky to figure out.

In addition to predicting what different snow-removal solutions would do, cost plays a role. Salting in some ways is preferable since it can be done ahead of the storm, but it depends on an accurate estimate of not-too-much snow falling. Plowing on the other hand requires you to wait until snow has actually built up, but then you know with certainty how much snow there is (taken to the extreme, the city of Ann Arbor does nothing until it stops snowing).

There do not seem to be comprehensive models which do the optimal tradeoff between the cost of accidents and lost productivity and the cost of snow removal. One could argue that a policy like Ann Arbor's makes the most sense since discouraging drivers in bad weather limits the number of accidents. But that only works if people are able to stay home, which is only sometimes true. If snow falls after people have already driven to work (which is frequently the case for these public failures), doing something so they can get home safely becomes much more important.

Wednesday, January 20, 2016

Are you a change agent?

Last fall I attended a Lean Green Belt certification course put on by IIE. One of the things that it included was this partitioning of people:

20% of people who are change agents.
60% of people who get on board once they see results.
20% of people who don't want to change.

While the exact percentages are obviously made up, the strategy they mentioned based on this partitioning is still useful. You start by getting a group of people together who are excited for positive change. By your combined forces you get some initial results which you can use to convince the middle people. Finally, simply outrun the people who are unwilling to change.

I was talking this idea over with a friend, and we joked about making a survey which asked people to classify themselves. It was a fun idea, but highlights one way new projects get derailed. In a study, 82% of adults say they have done something in the past six months to contribute to positive social change. In an optimistic world where the number is accurate, that matches up nicely with the made-up percentages above if you assume the middle people find something with evidence behind it to pursue. But it still makes it harder to figure out who to include in your initial concept-motivated group, and who to bring on board after you have some success.

Wednesday, January 13, 2016

Think about the incentives.

When you start applying game theory to business, the first question to ask is "what are everyone's incentives?" For example, when you buy insurance, you are trying to limit your risk in the event of a low-probability but very high cost event. An insurance provider on the other hand is interested in making as much profit as possible. They accomplish their goal by increasing total premium payments (rate * customers) while limiting how much they pay back out.

When incentives are misaligned, things that "don't make sense" from your point of view happen more frequently. If you have a car accident, from your perspective this is exactly why you bought insurance. From the insurance companies perspective, they would like to keep you as a customer in the future if the cost of fixing this accident is low enough¹. If the cost is higher though... they'd rather make it as difficult as possible for you to get a pay out.

My recent experience with this was that I rented a car from a company I won't name here. Most of the experience went well, which seems to be hard to come by for rental car companies. Except, I forgot my GPS in the car. Since it is clear that the GPS would have been found on cleaning the car, I reported a lost item, and the value of a GPS is pretty low by comparison to any future rentals, you might expect that it would have been returned. However, their are four players in this situation. Me, the car rental company, the specific rental location, and the person actually cleaning the car. It clearly benefits me to get my GPS back. It benefits the car company if I choose to use them again. It benefits the specific rental location considerably less than HQ in expectation. And it benefits the person cleaning the car to have a bonus GPS unless there are negative repercussions to them.

In short, a new GPS is purchased and on the way to me. I don't plan on using this rental company again, but I much more strongly don't plan to forget things in the car next time.

1 The benefit of you staying a customer also includes any word-of-mouth generated by how the incident is handled.^↩