Friday, July 29, 2022

"Data Literacy" and Why it Matters

Most months I attend the  INFORMS Practice Section Happy Hour. Today, the topic of discussion was “Data Literacy,” a term I had not previously paid any attention to. In the course of the discussion and reflecting afterwards, I realized that junior data folks needing to become data literate is a key component of what I see makes it hard to break into the field. It is also a very learnable skill with practice.
 
The way I had previously described this challenge is that everyone who works with data will eventually learn that dealing with Time zones is hard. A couple years ago I watched this 10 minute video which I *thought* had taught me everything there was to know about the difficulty of dealing with timezones. However, this recent article on tzdb covered even more complexities related to the politics of tracking time zones, and the challenge of who gets to make those decisions.

But even without going that deep, simply knowing what timezone your data is in can be tricky, and is frequently important. My first real data project was trying to develop a trading algorithm. At one point I realized I had accidentally been handicapping my algorithm by several hours because I had not properly handled time zones. That is an example of not having data literacy because it didn’t even occur to me to check until I had lost substantial time trying to figure out why my analysis was not making sense. Compare that to a recent experience when I was working on a database server and was told to filter for an end_date of “12-31-9999 23:59.” I was able to quickly shift to “12-31-9999 17:59” when my first query gave me nothing.

With that new lens, I have a slightly different definition of data literacy than say,
Gartner. I would define data literacy as “the ability to know what questions you need to answer about a set of data in order to understand it.” I like this framing because whether you know the domain of a particular data set or not, someone who is data literate can make progress on understanding what the data means. Our discussion spent a lot of time on a debate around the “context” you may need to be data literate. I think it makes sense to have a general concept of “Data Literacy” that is not specific to a domain, i.e., context. A key part of my job is knowing when to guess that something means what I think it does, and when to ask a “Subject Matter Expert (SME)” to walk me through the nuance.

To close, I’ll add some common steps I use or that came up during the happy hour to understand a new-to-you data set:
  • Look at the column headers for the spreadsheet or read the axis and legend of the graph.
  • Familiarize yourself with the data itself. Maybe check what the set of values is. Look at the most common values in each column. Make some graphs. Test some guesses about links in the data.
  • Imagine the process behind the data and see if it explains anything about what you have observed so far. Oftentimes there is a human involved, and that can substantially change the interpretation of what you find. For applications you often can answer a lot of questions just by getting a SME to walk you through how the tool works.
  • In data science bootcamps you always get taught to look at the first few and the last few rows of a data set. I have found grabbing a random sample (.sample instead of .head or .tail in python) to be extremely helpful for getting an understanding of what you are dealing with. This is particularly useful for understanding which data may be missing a lot of the time.
  • Consider if anything in the prior steps leads you to believe your understanding of the data might not be quite right.
  • Think about if the data itself might not be quite right.
After spending some time using these principles, I either have a good understanding of the data, or a good knowledge of what I don’t yet understand. Feel free to add any other tips in the comments!

1 comment:

  1. On your bullet #5, some of the biggest bloopers I've seen have come about when two people *thought* they were communicating, but weren't. A recent example was "first transaction." One person was defining it as a customer's first transaction through all time (absolute). The other person was defining it as the first transaction within the time horizon under study (relative). These kinds of understanding gaps can be really difficult to find, and almost hilarious once you correct them. Trust but verify! Looking forward to the next INFORMS Practice Section Networking Happy:
    https://connect.informs.org/practice/events/virtual-happy-hour

    ReplyDelete