By: Edward Egros


The Art of the Comeback

Pasted GraphicLast November, arguably five million people attended the Chicago Cubs victory parade, celebrating the team's first World Series Championship since 1908.

Last Summer,
Cleveland hosted hundreds of thousands of Cavaliers fans to celebrate that franchise's first title and the city's first pro championship in more than half a century.

This year in New England, they constantly win. We move on.

The common storyline among these three winners is "The Comeback". The Cubs overcame a 3-1 deficit in the World Series to claim their championship in an extra-inning Game 7, the Cavaliers also stormed back from down 3-1 in the NBA Finals and the Patriots trailed Atlanta by 25 in the second half of Super Bowl LI, to win in overtime. These comebacks were also nearly unprecedented.
Only five teams had come back from down 3-1 to win the World Series before the Cubs. Cleveland became the first NBA team to overcome a 3-1 deficit in the Finals to win. And, New England's 25-point comeback win is the largest in Super Bowl history. The second largest ever is merely ten points.

This confluence of sports drama may seem like supernatural intervention, but perhaps it can be explained in earthlier terms. In 2011, Brian Skinner published "
Scoring Strategies for the Underdog: A General, Quantitative Method for Determining Optimal Sports Strategies". Skinner explained how underdogs must call riskier plays to have a chance at success. In this case, we can refer to teams significantly trailing in series and games as underdogs when their probability of winning is significantly below 50%. Calling riskier plays might mean getting shellacked, but by finding specifically how much riskier a team should get, it might be the only way for those trailing to win.

Baseball closers are niche pitchers, often asked to pitch only one inning, with his team holding the lead. Aroldis Chapman, the Cubs' closer, came in to pitch 2.2 innings in Game 5, 1.1 innings in Game 6 and 1.1 innings in Game 7. Chapman had one day of rest and pitched Game 5, another day of rest before Game 6 and no days off in Game 7. While he did allow three earned runs in the last two games, Maddon believed the risky strategy of extending his closer was the only way to overcome his 3-1 deficit. Chapman did allow runs, but it left other relievers fresh for longer games. Hitters were also asked to swing for home runs, not mere singles or doubles. The Cubs ranked 13th in home runs last season, but in the World Series, they recorded at least one home run in games five, six and seven, en route to their title.

In basketball, Skinner's paper discussed two key concepts pertinent to the Cavs: how often to shoot 3's and when to stall. The logic in the first case is, depending upon how many possessions are left in the game, a team should resort to shooting triples when reaching its critical threshold. In the regular season, Cleveland ranked 7th in the NBA in three-point shooting percentage and 3rd in three-point shooting attempts, but going up against the Golden State Warriors who ranked first in both categories. The Cavs' two of the three highest rates of three-point shooting in that series
happened in games 6 and 7, two must-win games. As for pace, while Golden State had the second most possessions per 48 minutes in the NBA, Cleveland ranked 27th out of 30 teams. However, the Cavs played a faster pace for games 5 and 6, both resorting to a style more like the Warriors and not shortening the game like it is suggested for underdogs. It is worth noting there was a slower pace for Game 7, the most dramatic in the entire series.

Lastly, the Patriots helped themselves and the Falcons maimed themselves because of risk-taking.
Once Atlanta led 28-3, New England resorted to 40 pass plays (including sacks) and just 10 rushes. Before the deficit, the Patriots passed the ball 34 times and ran it 15 times, relying significantly more on the ground attack. Also, some of Brady's longest completions occurred in the 4th quarter during the comeback. Defensively, Matt Ryan and the Falcons leaned towards passing more frequently in the final minutes than sticking to the ground game, which would have taken more time off the clock. Perhaps the most egregious example was when Atlanta had the ball at the New England 22-yard line with 4:40 left in the game and leading by eight. Instead of running the ball three times and going for a two-possession lead, a sack, a pass (wiped away by offensive holding) and an incompletion took the Falcons out of field goal range AND gave Tom Brady 3:30 to tie the game. Overall, even play-count disparity factored into the outcome; Brady kept the Falcons' defense on the field and Ryan could not give his teammates a break.

Teams in any sport can calculate when it is time to run riskier plays. Many recent and high-profile examples suggest comebacks are more possible than ever before, when the right tactics are implemented.

There is a postscript: win probability charts have become more popular than ever. But these games and series show something seemingly calculated to have a .7% probability of happening can occur. Because underdogs can increase their own variance with their playcalling, perhaps these charts need to be updated in some way. Fortunately, this discussion is ongoing.

A New NCAA Tournament

There's no doubting the increased awareness of analytics in predicting the NCAA tournament field in college basketball. Instead of just diagnosing a team's record against the Top 50, it's Rating Percentage Index or Ken Pomeroy rankings, that are becoming more commonplace. It has gotten to where data scientists are actually meeting with the NCAA to determine if one metric should be used above all others to pick tournament teams.

Perhaps surprisingly, data scientists want simpler criteria for picking teams: who wins, who loses and who have you played. This is opposed to other explanatory variables used in more advanced metrics, like margin of victory and offensive/defensive efficiency. Coaches, on the other hand, would prefer more complex formulae for determining the tournament field. Logically, this approach makes more sense from their perspective, because of competition. If a coach has figured out a style of play or way to schedule opponents that increases the likelihood of making the tournament, they develop a competitive advantage. Data scientists want to keep it simple for fans, coaches want a figure out a competitive advantage.

Perhaps in this same spirit of transparency, the tournament selection committee released "in-season" projections for the first time ever, one month before Selection Sunday. It only has the top four seeds of every region, but it is added information for where highly ranked teams really sit. As with any analytic project, more data "usually" means more robust forecasts. Already, it is easier to make more accurate assumptions and offer a better glimpse as to what the committee is looking for.

However, these in-season projections do not include the full field of 68, and what usually causes the most consternation is simply who does and does not make the dance. While it makes sense not to include the full field because you have to assume certain conference champions in mid-major conferences, something that would include all "at large" teams would provide even more information as to the criteria for inclusion.

Nothing is easy about picking 68 teams to play in a tournament, and while analytics may be helpful in forecasting a Final Four, easy-to-understand criteria can help teams and fans quell any controversy.

How Predictive Is Scoring Differential?

Pasted GraphicHow important is an impenetrable goalie in the NHL? How much better is it to outscore opponents throughout the season, as opposed to dominating them defensively? Overall, how important is point differential to overall success?

In an earlier blog post, I discussed
playoff unpredictability when it comes to determining who will win a championship based upon how many games that team won. There, the NBA was the most predictable, then the NHL, NFL, then MLB is the most unpredictable (unless, of course, you are the 2016 Chicago Cubs). But how does point differential (or run differential in baseball or goal differential in hockey) translate to winning championships? And which league is most predictable when looking at that specific metric?

Once again, I am using
logistic regressions using one explanatory variable and if that team won a championship as the dependent variable. However, this time I am using three per sport: offensive output, defensive output and scoring differential. Also once again, here is what is noteworthy with our datasets:

- All data used begins with the 1989-90 season because the NFL had the biggest chance to its playoff format at the turn of the new decade.

- Any season in any sport where a lockout shortened the number of games played considerably was removed (e.g., the 1998-99 NBA season, the 2012-13 NHL season, etc.)

- Though the NHL played 80 and 84 games in a few of these seasons, these numbers are not significantly different from the 82 played the rest of the dataset, so they are still used.

Each explanatory variable has the appropriate and logical coefficient. In other words, scoring variables have a positive coefficient, defensive variables have a negative coefficient and scoring differential variables have a much larger positive coefficient. All of this equates to a better probability of winning a championship. Each variable is also statistically significant with 95% confidence, which is to be expected. A better offense, defense and scoring differential will obviously increase the likelihood of winning a championship. What is not clear is which of these indicators is most predictive.
A goodness-of-fit measure called AIC (Akaike Information Criterion) can shed some light. As this number gets smaller, the model has a better fit, explaining away more of the randomness of that sport.

The first chart is points (or runs or goals) modeled against championships:

Pasted Graphic 1

Before analyzing this chart, it is important to note the value of each point, goal and run, compared with the other sports. In 2016, the average MLB team scored 726 runs for the season. This number is different from the 325 points scored, on average, for an NFL team in 2015, the 8419 points scored for an NBA team for last season and the 222 goals scored for an NHL team for last season. Fortunately, the variation across each league is not so substantially different to where comparison becomes impossible.

In the chart, we see goals in hockey as being the best predictor for winning its championship, with football being slightly more random, then basketball, then baseball finishing as the most random. So far, these results are consistent with the previous study where MLB's postseason was the toughest to predict, based upon number of wins during the regular season. Basketball makes intuitive sense because teams play at different paces, and it is not conclusive if playing at a faster rate—which scores more points but not necessarily more points per possession—is the best way to win a title.

The next chart illustrates runs, points, and goals allowed, modeled against winning a championship:

Pasted Graphic 2

Comparatively, the trends are almost the same as they are with offensive output: Major League Baseball is the most random, followed by the NBA. However, an NFL scoring defense is now a better indicator than an NHL scoring defense, but only slightly so.

Now, let's combine these two charts into scoring differential, modeled against a championship:

Pasted Graphic 3

Here, we learn point differential is more predictive in basketball than in any other sport. Remember how different teams playing at different paces obscures the importance of points alone? Including the defensive component erases pace of play and gives a clearer predictor. It also coincides with how a win total in basketball is most predictive for winning a championship. Football and hockey are nearly equal in predictive ability and baseball is a distant fourth.

There are more trends to uncover if we combine all of these charts:

Pasted Graphic 4

In nearly every sport, scoring defense is more predictive than offense (with hockey being the lone exception). Scoring differential is predictably better for analysis than offense or defense by itself, but the degree to which it takes away the randomness is different for each sport. It is only a slight improvement in the NFL, but a drastic improvement for basketball.

Overall, these proportions could prove helpful when determining if a team is going in the right direction when devoting resources to offense and defense. Both are necessary, but perhaps more money should be proportionally allocated to the areas that best predict who will win a championship.

Playoff Unpredictability

Pasted GraphicUntil recently, the Los Angeles Lakers were one of the fixtures of the NBA Playoffs, and in many seasons, the Finals. They have put together dynasties in different generations of the sport, from Magic Johnson's teams to the Shaq and Kobe era. When the Lakers were not winning titles, chances are another team was enjoying its own dynasty, like the Boston Celtics, Chicago Bulls or San Antonio Spurs. Dynasties are so commonplace in the NBA, 15 franchises in the sport's history do not have a championship (and seven of those still in existence never even made it to the Finals).

The NBA is unique in this regard: championships are won in bulk. Other leagues offer more parity, where there is a larger pool of contenders vying for a title. There may be dynasties in other sports, but there seems to be fewer of them, each shorter in duration and there stood a better chance someone unexpected can claim the sport's top prize.

Which of the four top professional sports leagues (NFL, NBA, MLB and NHL) offers the most playoff unpredictability? Is the NBA truly the most predictable? Is it significantly more predictable or marginally so?

One approach to answering these questions is by using a statistical model for each sport. Here, we will use
logistic regressions, where we will look at only wins (or points in hockey) and see how well it predicts whether a team won a championship that year. Here are some other notes for setting up this project:

- All data used begins with the 1989-90 season because
the NFL had the biggest chance to its playoff format at the turn of the new decade.

- Any season in any sport where a lockout shortened the number of games played considerably was removed (e.g., the 1998-99 NBA season, the 2012-13 NHL season, etc.)

- Though the NHL played 80 and 84 games in a few of these seasons, these numbers are not significantly different from the 82 played the rest of the dataset, so they are still used.

At first glance, every variable representing wins is statistically significant with 99% confidence, which should be obvious because you need so many wins just to make the playoffs. What matters is how well wins alone predicts championships. In statistical parlance, we will use a goodness-of-fit measure called
AIC (Akaike Information Criterion) to answer this question. As this number gets smaller, the model has a better fit. The following shows how well each model performs:

Screen Shot 2016-04-17 at 7.47.11 AM
The larger the bar, the more unpredictable the league is. Again, as expected, the NBA is the most predictable, and by a considerable margin. This model also suggests Major League Baseball is the most unpredictable, with the NFL as a close second and the NHL as a close third.

There are a number of other variables that could be added to these models to help determine who will win a championship, but the simplicity of these models makes for an easier comparison across sports.

Evaluating Your Bracket

Pasted Graphic 1The Law of Conservation of Mass tells us: matter is neither created nor destroyed. When you burn your horribly incorrect college basketball bracket, remember, you never destroyed it, it is in another form somewhere in the universe. So instead of ignoring your transgressions, let's embrace what still exists and see which approaches were the best when predicting who will be in the Final Four.

There's a one-seed (North Carolina), a couple of two-seeds (Villanova and Oklahoma) and a 10-seed (Syracuse). There is not as much parity with this quartet as with some tournaments in the last few years. Still, some of the favorites to win the National Championship did not survive the first two weeks of this crucible. For instance, the top three teams in the Pythagorean Rating at the end of the conference tournaments are not playing in Houston. In fact,
Syracuse did not even crack the top 25, until recently. ESPN's Basketball Power Index offers these rankings: North Carolina (1), Villanova (3), Oklahoma (6) and Syracuse (39). The LRMC Basketball Rankings still has its two, three and seven, but ranks the Orange 41st.

Some computer models have resorted to predictions without solely implementing historical data. How is this possible? Microsoft's search engine, Bing, uses social media to determine which teams will survive and advance.
It has already proven successful in other sporting events like the World Cup and NFL games. But how did it fare for this tournament? Sadly for Bing, it only predicted one Final Four team correctly (North Carolina). In fact, the system predicted the Orange to lose their first game.

It should be clear by now the two schools that ruined this tournament's predictiveness: Kansas and Syracuse. The Jayhawks were the top team by nearly all accounts, yet lost in the Regional Final,
perhaps uncharacteristically. At the other end of the spectrum, Syracuse could be the worst team ever to make the Final Four. There have been 11-seeds to make it to the final weekend of the season, but many debated if Syracuse even deserved to make the tournament. Their RPI was 72 at the time of selection, worse than other schools that were not chosen (e.g. Valparaiso, San Diego St. and St. Bonaventure). Instead of the favorite vying for the National Championship, it's the controversial at-large two wins away from glory.

Even listening to me would not have been wise. Using my own system, I only correctly predicted one team (and it was a different school than what I said was coming out of that Region on Fox 4). My National Champion was knocked out during the Elite Eight (Kansas) and my second place team lost in the First Round (Michigan St.).

So what is the best way to fill out your bracket for the next tournament?

I don't know.