I saw the initial college football playoff rankings top 25 for 2018 and thought to myself how formulaic and wrong their rankings were. Sure they got Alabama right but overall they sucked. Clearly win-loss record trumps strength of schedule (which makes sense because humans are not smart enough to understand how to balance the results of thousands of games in their heads) which is supposed to be antithesis to how their rankings work. Additionally they have a stated view that head-to-head results are important which in my mind sounded really stupid. I think the worst part about a playoff is that it decreases the odds of the best team winning since we know that the best team does not win every game. Each additional round of playoffs just introduces more upset opportunities. The best way to crown a champion would be based on overall body of work and this also holds true when comparing two teams, even those who played one another. After all just this year Purdue beat Ohio State and Oregon beat Washington. I decided to test my views by looking at the data.
First I selected ESPN FPI as my "true" rankings. Computers are meant for churning through data in exactly the way the playoff committee cannot and thus can factor in game outcomes in a way that humans cannot. I used ESPN because after perusing thepredictiontracker.com I decided FPI was good enough in overall prediction percentage and against the spread that I saw it was at least as good as the best humans and that was good enough (I'm just looking at macroresults that are better than humans can do) and it was the easiest for me to gather recent-historic rankings of. I used three season of results, 2015-2017, because I wanted enough to have a large enough to be meaningful dataset (I didn't do any work to validate this but all of my buckets, other than games decided by a margin of 29-32, had over 100 results so I assume it is pretty good) and I didn't want to do more work than necessary pulling in extra years of data. I was actually intrigued by the Pi-Rate Ratings but it was more effort than I wanted to expend manually scrolling through their old posts to find historic rankings. I used game data from seldomusedreserve.com because it was easy to obtain and I've emailed the owner before and he is a nice guy.
One comment on the rankings I used is I always used the end of season rankings. I think this was a good choice because it was easier than trying to analyze results week by week and it predicted 79% of results correctly. During the the three seasons selected FPI predicted 74% of games correctly so I interpret this to mean that as the season went on the model was refined to become more accurate. I didn't validate this interpretation though by actually looking at week by week accuracy. When calculating that 79% prediction rate I also don't know if ESPN gets fancy with FPI and actually looks at situational matchups and that sort of thing. I just defined it's accuracy as did the higher ranked team win. I also assume the rankings are truth and thus consider 21% of the games to be upsets.
With my datasets collected I first looked at how often upsets occurred for different ranking differentials.

Here we see that of the 237 games where teams where ranked within 5 spots of one another (e.g. rank 1 vs rank 6 or rank 105 vs rank 110) the higher ranked team won 54% of the time. I think that this is a key piece of data that shows head to head results of a single game are a highly inaccurate way to determine which team is better than another. We're basically flipping a coin when we use a single game to understand who is better. There are assuredly edge cases where you get a team like Alabama this year where they seem to be distinguished from teams close below them in the rankings who will win more frequently but once you get down to where teams are better than one another but not better by much then we see the results will be variable when only looking at a single game.
I didn't do any sort of natural breaks or sensitivity analysis to choose these buckets (or the ones below) and we see that 6-10 and 11-15 have very similar percentages with the more highly ranked team winning about 60% of the time for both buckets. Rank differences between 16-20 and 21-25 are also similar to one another with a percentage of an upset occurring nearly 1 in 4 games. Once we get to larger rank differentials of 26+ and 50+ upsets become less common, presumably as the meaningful difference in rank gives better teams increasingly large margins of error.
Next I looked at how frequently an upset occurred for different score differentials. Thus you can say given the result of a game, how likely it was that the better team won.

Here we see that in games determined by 1-3 points that there is basically a 40% chance it was an upset. 4-7 point difference is a 35% chance of an upset so for any game that is determined by a touchdown or less you should consider there is a decent chance an upset occurred. As the score differential increases the odds of an upset having occurred are lower, as lower ranked teams are less likely to win by a lot of points. If a team wins by four scores then the better team won 19 out of 20 times.
I looked at this from a Washington lens which is a terrible idea since these results are population-based on Washington is an outlier with how we're a generally good team that plays down to our competition. We haven't been winning by a lot which gives low confidence that that we're better than the teams we beat, however we only get beaten by very little, so it is also possible we're better than the teams that have been beating us. Beating Utah by 14 points gives us about an 80% chance of being better than them and they are a solid team by most accounts. That game is one of the strongest pieces of evidence that a human can look at and say that this is a good team without having to calculate a gigantic linear algebra problem. This is probably why we were unranked in the initial rankings despite every computer ranking I've looked at having us in the top 10.
Anyways, I think this is useful for thinking about single game outcomes and what they mean over the course of an entire season. Clearly head-to-head results is a nonsense factor for definitively comparing two closely ranked teams who played each other in a close game. College football is a body of work and everything was better before 1998.
Comments
Things doogs use to justify underperforming for $1000 Alex.
Why?
As much as head to head wins have a really high variance your overall record should, more or less, be a solid indicator of who you are as a team across a season. As you said, body of work.
law of av·er·ag·es
noun
the principle that supposes most future events are likely to balance any past deviation from a presumed average.
All that said, (Jake browning sucks BUT) I think this year's losses are pretty squarely on the failings of the coaches and not some other outside variables or outliers. Namely, recruiting, game plans, and in game decisions. The fact that our losses were so close are, to me, part of what's so damning. This team is just not that good. Hopefully, with some luck, it can be good enough to win a shitty Pac12 and with some extreme luck win a shitty Rosebowl. At this point though, I am of the opinion that that is more bargaining stage of grief and denial with hope than a likely outcome.
Since we don't play a series like in baseball where we get to do multiple weighted coin flips, we can use extra information that we have (score) to infer more about the relative statures of the teams. I would trust the overall record to the law of averages for n >> 12. Given that one 80/20 outcome coming up 20 can represents 8% of a total body of work I think it is worth understanding what score differentials tell us beyond binary win-loss.