A friend asked me for ideas of a good analogy for classification problems in machine learning. In a classification problem, we have a collection of objects, and we somehow want to separate them into groups. Ideally, when designing this analogy there are a few things we want to convey:
- Not all properties of objects are equal when it comes to classification. Some will be highly predictive, while others just help you over-fit your model.
- Some properties will be highly correlated, so it can be a waste of energy / data effectiveness to include them all.
- Your desired classification informs which properties you should use for your classification.
- What do "properties" even look like, and how do they help us get at a classification?
The example I suggested is a person deciding what to eat at a potluck. I usually have two different classification problems to worry about when I'm filling my plate. First off, I want to decide which things I want to eat. In addition, I'm one of these people who will get a main course plate, eat that, and then go get dessert later. So as I survey the food, I have to decide both which things I think will be delicious, and which things I want to get later as dessert.
When trying to figure out what will be delicious, there are a lot of criteria I could use. Since I do not like cucumbers, anything with them is immediately excluded. Other properties besides ingredients could be smell, color, how much of it is available, how much was already eaten, the temperature of the food, anything! Some of these criteria are more helpful than others. I've had a lot of delicious brown things in my life at potlucks. And when I am trying to decide if something is dessert of not, how much people ate is not going to be very informative.
How do you classify food at a potluck?