The airport delay fallacy

2021-09-10

When you’re in the airport, if you look at the flights on the big board, chances are you’ll see a whole bunch of them saying “DELAYED.”

And if you look at the faces in the airport, you’ll see a lot of miserable people, ones who have clearly been there for hours longer than they expected.

I fly a fair amount, but I don’t find my flights are delayed all that often. It seems kind of rare. Am I just lucky? Am I taking better flights? No - in fact, most flights are on time, and most passengers leave on time. What’s going on here? The issue is a funny sort of sampling bias in time. Maybe it has a technical name, but I think of it as the “airport delay fallacy.”

Most of the flights leave on time. They are posted on the big board for perhaps an hour or two. Their passengers are in the airport for maybe 30 minutes.

Some small fraction of the flights don’t leave on time. So, they’re posted on the big board for more than two hours - maybe 6, maybe 8. Their passengers are in their airports for way longer, like 4 hours.

When you walk in the airport and look at the board, you’re taking a snapshot at a single point in time. Each of the small flights is there for less time - so they have a lower probability to be caught in your snapshot. The delayed flights are there for way longer, so you’re more likely to see them.

This effect can be really strong! I wrote a little Python program to simulate it, and found that if 99% of passengers are in the airport for just 30 minutes, but 1% of passengers are there for 4 hours, then at any given moment about 8% of the passengers will b from the delayed population - they’re over-represented by 8x!

This form of sampling bias can show up all the time. It shows up when analyzing user behavior on a product: if you ever do something like take a snapshot of users who logged in in the last day, and analyze their behavior or send them a survey, you’re probably massively biasing towards users with a lot of activity.

If you’re analyzing a system’s behavior, and you take a snapshot of its point-in-time work, you’re likely to see it handling its slowest, most difficult workloads.

Sometimes, this bias isn’t undesirable. It can even be a good, useful thing! Stack-sampling profilers use this to efficiently make estimates of how a program spends its time.

In addition to the sampling effects, I think this is a useful mental model to keep when thinking about system behavior. If your system generally gets very easy, quick-to-handle requests, but a small number of slow requests, the slow requests will pile up and become the dominant work done by the system. This gets really dangerous in serial processing of a queue of work. When you let the fast requests get blocked behind the slow ones, they can all suffer dramatically.