By: Edward Egros

baseball

The Statcast Revolution

Pasted Graphic
There are more statistics about hitters than ever before. Thanks to Statcast, a baseball fan can learn how fast a ball comes off a bat from any hit, the angle the ball leaves the bat, an accurate distance the ball travels, etc.

These statistics can help characterize and differentiate hitters. A potential extension to these statistics is if they can predict a hitter's success. For instance, if a hitter averages a higher exit velocity, does that mean he is generally a better hitter?

Fangraphs has kept a database with averages of these Statcast statistics for every hitter. Even though there is some missing data, Jeff Zimmerman made necessary corrections based upon the type of balls in play fielded by certain positions. Using 2016 season data, the variables include:


It makes intuitive sense for the second half of this list to be relevant to a hitter's success, but what about the first half? To answer that question, I merged this same dataset with other advanced offensive statistics for these same hitters (this data came from
Baseball Reference). While it would make sense to choose offensive wins above replacement (oWAR) as my dependent variable, there is a problem. WAR is an aggregate, meaning it can add up with additional plate appearances. Because I am already using averaged statistics for hitters and want to look at the average impact each statistic has to a hitter's overall performance, I divided oWAR by plate appearances and then multiplied by 1,000, so as not to have too many zeroes after the decimal point (this variable is named oWARavg).

The next step is to determine which of the first group of variables is significant at the 95% level. I am using a
backward elimination technique, where I start with a regression with all three variables, then remove any of them that are not significant. By executing this approach, the only variable eliminated was speed. In other words, the average exit velocity of a batted ball is not a significant indicator for how successful a hitter is. However, the angle of the batted ball and the distance it travels are significant:

Pasted Graphic 1

The angle has a negative coefficient, meaning batted balls not hit as steeply tend to be hits. Distance has a positive coefficient, which makes intuitive sense, because the farther a ball travels, the likelier it becomes a hit or maybe even a home run. As accurate as these findings are, the adjusted R-squared is only .1687, meaning only approximately 17% of these two variables can explain the variability of average offensive WAR.

Just for fun, let's see what impact angle and distance have when the second group of variables are included in a regression. Again, using the backward elimination technique, here are the results:


Pasted Graphic 2

Once again, backward elimination took out exit velocity. It also took out the expected ratio of home runs to fly balls. While it kept the original ratio, the negative coefficient does not make intuitive sense. The logic is the more home runs hit out of fly balls, the more successful a hitter is. Instead, this model suggests the alternative. However, a positive isolated power does make logical sense and the adjusted R-squared is approximately 40%, making for a model that does a better job explaining what makes for a successful hitter.

Obviously there are a lot more advanced offensive variables that could be included in a model like this. At least there is a statistical approach for determining which variables Statcast emphasizes that explain offensive success. A similar study can be conducted when looking at baserunning, pitching, defense, etc.

The Art of the Comeback

Pasted GraphicLast November, arguably five million people attended the Chicago Cubs victory parade, celebrating the team's first World Series Championship since 1908.

Last Summer,
Cleveland hosted hundreds of thousands of Cavaliers fans to celebrate that franchise's first title and the city's first pro championship in more than half a century.

This year in New England, they constantly win. We move on.

The common storyline among these three winners is "The Comeback". The Cubs overcame a 3-1 deficit in the World Series to claim their championship in an extra-inning Game 7, the Cavaliers also stormed back from down 3-1 in the NBA Finals and the Patriots trailed Atlanta by 25 in the second half of Super Bowl LI, to win in overtime. These comebacks were also nearly unprecedented.
Only five teams had come back from down 3-1 to win the World Series before the Cubs. Cleveland became the first NBA team to overcome a 3-1 deficit in the Finals to win. And, New England's 25-point comeback win is the largest in Super Bowl history. The second largest ever is merely ten points.

This confluence of sports drama may seem like supernatural intervention, but perhaps it can be explained in earthlier terms. In 2011, Brian Skinner published "
Scoring Strategies for the Underdog: A General, Quantitative Method for Determining Optimal Sports Strategies". Skinner explained how underdogs must call riskier plays to have a chance at success. In this case, we can refer to teams significantly trailing in series and games as underdogs when their probability of winning is significantly below 50%. Calling riskier plays might mean getting shellacked, but by finding specifically how much riskier a team should get, it might be the only way for those trailing to win.

Baseball closers are niche pitchers, often asked to pitch only one inning, with his team holding the lead. Aroldis Chapman, the Cubs' closer, came in to pitch 2.2 innings in Game 5, 1.1 innings in Game 6 and 1.1 innings in Game 7. Chapman had one day of rest and pitched Game 5, another day of rest before Game 6 and no days off in Game 7. While he did allow three earned runs in the last two games, Maddon believed the risky strategy of extending his closer was the only way to overcome his 3-1 deficit. Chapman did allow runs, but it left other relievers fresh for longer games. Hitters were also asked to swing for home runs, not mere singles or doubles. The Cubs ranked 13th in home runs last season, but in the World Series, they recorded at least one home run in games five, six and seven, en route to their title.

In basketball, Skinner's paper discussed two key concepts pertinent to the Cavs: how often to shoot 3's and when to stall. The logic in the first case is, depending upon how many possessions are left in the game, a team should resort to shooting triples when reaching its critical threshold. In the regular season, Cleveland ranked 7th in the NBA in three-point shooting percentage and 3rd in three-point shooting attempts, but going up against the Golden State Warriors who ranked first in both categories. The Cavs' two of the three highest rates of three-point shooting in that series
happened in games 6 and 7, two must-win games. As for pace, while Golden State had the second most possessions per 48 minutes in the NBA, Cleveland ranked 27th out of 30 teams. However, the Cavs played a faster pace for games 5 and 6, both resorting to a style more like the Warriors and not shortening the game like it is suggested for underdogs. It is worth noting there was a slower pace for Game 7, the most dramatic in the entire series.

Lastly, the Patriots helped themselves and the Falcons maimed themselves because of risk-taking.
Once Atlanta led 28-3, New England resorted to 40 pass plays (including sacks) and just 10 rushes. Before the deficit, the Patriots passed the ball 34 times and ran it 15 times, relying significantly more on the ground attack. Also, some of Brady's longest completions occurred in the 4th quarter during the comeback. Defensively, Matt Ryan and the Falcons leaned towards passing more frequently in the final minutes than sticking to the ground game, which would have taken more time off the clock. Perhaps the most egregious example was when Atlanta had the ball at the New England 22-yard line with 4:40 left in the game and leading by eight. Instead of running the ball three times and going for a two-possession lead, a sack, a pass (wiped away by offensive holding) and an incompletion took the Falcons out of field goal range AND gave Tom Brady 3:30 to tie the game. Overall, even play-count disparity factored into the outcome; Brady kept the Falcons' defense on the field and Ryan could not give his teammates a break.

Teams in any sport can calculate when it is time to run riskier plays. Many recent and high-profile examples suggest comebacks are more possible than ever before, when the right tactics are implemented.

There is a postscript: win probability charts have become more popular than ever. But these games and series show something seemingly calculated to have a .7% probability of happening can occur. Because underdogs can increase their own variance with their playcalling, perhaps these charts need to be updated in some way. Fortunately, this discussion is ongoing.

How Predictive Is Scoring Differential?

Pasted GraphicHow important is an impenetrable goalie in the NHL? How much better is it to outscore opponents throughout the season, as opposed to dominating them defensively? Overall, how important is point differential to overall success?

In an earlier blog post, I discussed
playoff unpredictability when it comes to determining who will win a championship based upon how many games that team won. There, the NBA was the most predictable, then the NHL, NFL, then MLB is the most unpredictable (unless, of course, you are the 2016 Chicago Cubs). But how does point differential (or run differential in baseball or goal differential in hockey) translate to winning championships? And which league is most predictable when looking at that specific metric?

Once again, I am using
logistic regressions using one explanatory variable and if that team won a championship as the dependent variable. However, this time I am using three per sport: offensive output, defensive output and scoring differential. Also once again, here is what is noteworthy with our datasets:

- All data used begins with the 1989-90 season because the NFL had the biggest chance to its playoff format at the turn of the new decade.

- Any season in any sport where a lockout shortened the number of games played considerably was removed (e.g., the 1998-99 NBA season, the 2012-13 NHL season, etc.)

- Though the NHL played 80 and 84 games in a few of these seasons, these numbers are not significantly different from the 82 played the rest of the dataset, so they are still used.

Each explanatory variable has the appropriate and logical coefficient. In other words, scoring variables have a positive coefficient, defensive variables have a negative coefficient and scoring differential variables have a much larger positive coefficient. All of this equates to a better probability of winning a championship. Each variable is also statistically significant with 95% confidence, which is to be expected. A better offense, defense and scoring differential will obviously increase the likelihood of winning a championship. What is not clear is which of these indicators is most predictive.
A goodness-of-fit measure called AIC (Akaike Information Criterion) can shed some light. As this number gets smaller, the model has a better fit, explaining away more of the randomness of that sport.

The first chart is points (or runs or goals) modeled against championships:

Pasted Graphic 1

Before analyzing this chart, it is important to note the value of each point, goal and run, compared with the other sports. In 2016, the average MLB team scored 726 runs for the season. This number is different from the 325 points scored, on average, for an NFL team in 2015, the 8419 points scored for an NBA team for last season and the 222 goals scored for an NHL team for last season. Fortunately, the variation across each league is not so substantially different to where comparison becomes impossible.

In the chart, we see goals in hockey as being the best predictor for winning its championship, with football being slightly more random, then basketball, then baseball finishing as the most random. So far, these results are consistent with the previous study where MLB's postseason was the toughest to predict, based upon number of wins during the regular season. Basketball makes intuitive sense because teams play at different paces, and it is not conclusive if playing at a faster rate—which scores more points but not necessarily more points per possession—is the best way to win a title.

The next chart illustrates runs, points, and goals allowed, modeled against winning a championship:

Pasted Graphic 2

Comparatively, the trends are almost the same as they are with offensive output: Major League Baseball is the most random, followed by the NBA. However, an NFL scoring defense is now a better indicator than an NHL scoring defense, but only slightly so.

Now, let's combine these two charts into scoring differential, modeled against a championship:

Pasted Graphic 3

Here, we learn point differential is more predictive in basketball than in any other sport. Remember how different teams playing at different paces obscures the importance of points alone? Including the defensive component erases pace of play and gives a clearer predictor. It also coincides with how a win total in basketball is most predictive for winning a championship. Football and hockey are nearly equal in predictive ability and baseball is a distant fourth.

There are more trends to uncover if we combine all of these charts:

Pasted Graphic 4

In nearly every sport, scoring defense is more predictive than offense (with hockey being the lone exception). Scoring differential is predictably better for analysis than offense or defense by itself, but the degree to which it takes away the randomness is different for each sport. It is only a slight improvement in the NFL, but a drastic improvement for basketball.

Overall, these proportions could prove helpful when determining if a team is going in the right direction when devoting resources to offense and defense. Both are necessary, but perhaps more money should be proportionally allocated to the areas that best predict who will win a championship.

Go Cubs Go

Pasted GraphicIn just a few days, Wrigley Field's iconic scoreboard will showcase a World Series for the first time in more than seven decades. A franchise with questionable management and horrible luck has finally come within four wins of its first world championship in more than a century.

The Cubs have fielded formidable teams that have made the postseason, but never have they won the NLCS until this year. Often postseason baseball can be so unpredictable that it is difficult to explain why the Cubs could not reach the World Series until now. But there are some trends that predict success in playoff baseball, that does not have as great an impact in regular-season baseball.

While I have written a paper about this and have applied those lessons to the Texas Rangers in a previous post, I would like to look at alternative research. In the book "Baseball Between the Numbers", three qualities are listed that best determine postseason success:

  • Pitcher Strikeout Rate
  • Fielding Runs Above Average (FRAA)
  • Closer Expected Wins Over Replacement Pitcher (WXRL)

The Cubs finished 3rd in the majors in strikeout percentage and strikeouts per nine innings (the Dodgers finished first in both categories, the team Chicago beat in the NLCS). Fangraphs uses a metric called
Ultimate Zone Rating to calculate fielding, and listed the Cubs as the best fielding team this season. Lastly, the Cubs finished 19th in reliever Wins Above Replacement, but keep in mind, the team traded for Aroldis Chapman late in the season.

It is also worth nothing, the Indians had high rankings in all three of these categories as well (5th, 4th and 7th, respectively). While the matchup should make for a fantastic World Series, given how the Cubs have properly built this team for a postseason run, it should not come as a surprise if they can end this 108-year streak.

No Range for the Texas Rangers

IMG_5937It's hard not to catch shortstop Elvis Andrus smiling these days. His Texas Rangers go into the postseason with home-field advantage all the way through the World Series—while finishing one victory shy of a franchise record for most wins in a season—and boasting the most wins at home in the American League. Elvis himself finished the regular-season as a .302/.362/.439 hitter. And yet, a few sabermetricians have spoken out, saying not only shouldn't the Rangers be one of the favorites to win the World Series, their success is virtually fraudulent.

It involves
Pythagorean Expectation. This is the often-cited formula baseball guru Bill James invented to estimate how many wins a team "should" have based upon how many runs they scored and allowed. Since it became commonplace, the formula has worked quite well explaining why teams are thriving and struggling. Even this season, the formula explains all but a handful of wins or losses for every MLB team. The one team the formula has done the poorest job with, is the Texas Rangers.

For much of the season, this team's Pythagorean W-L hovered around .500. The Rangers finished 13 games above what was expected, at 95-67. Why? The Rangers were 36-11 in one-run games (the .766 winning percentage is a record in modern baseball). They were also 18-24 in games decided by 5+ runs. In other words, the Rangers won a lot of close games and lost a lot of blowouts.

This large of a discrepancy is unprecedented in the last decade for the Rangers:

Pasted Graphic

The Rangers have performed roughly what was expected, given their runs scored and allowed. But the last two years this team has over-performed. It might be a coincidence those were the two years Jeff Banister has been the manager of the Rangers, but maybe not. Banister has a history of evaluating players and looking at skills during blowouts. He is certainly not the only manager to have this approach, but it is possible he takes it to the next level. Two years is not sufficient data to make such a conclusion, but it is a noteworthy trend to consider.

So how accurate is this formula when predicting if the Rangers will win the World Series? Not very. Since 1969,
11 teams out of 47 had the best Pythagorean Expected record and went on to win the World Series. In fact, the likelihood has decreased since the postseason expanded. Many conclude the postseason is almost impossible to predict, though there are the trends to consider that are helpful. Most notably, "Small ball" seems to be a more successful approach in the postseason than the regular-season. Among teams in the postseason, the Rangers rank 3rd in stolen bases, 5th in sacrifice flies and 3rd in hit by pitch (they are however last in walks and almost last in sacrifice hits).

If you believe the Rangers will eventually regress to the mean given this disparity, it has not happened through 162 games, so statistically nothing suggests this trend will automatically change after another 19 games. In a way, the Texas Rangers have just as good a chance to win the franchise's first world championship as anybody, and that smile from Elvis Andrus will be even wider.

Predicting Pitching Performance

Image-1Noah Syndergaard made his Major League debut last year for the New York Mets and made an immediate impact (3.24 ERA and 9.96 K/9). While his 9-7 record may not have been overly impressive, there were signs this was only the beginning. Now, Syndergaard has multiple National League player of the week awards and is one of the more reliable hurlers in the game.

But not every pitcher lives up to predictions. How can someone better determine which pitchers will become successful the following season? One of the more intriguing presentations concerning the future of baseball predictions involved creating a pitcher projection system based upon Pitch F/X (to read the paper and/or watch the presentation, click
here). The traditional ways to gauge a successful pitcher do not always perform well when forecasting how he'll do the following year. According to this research, if next season's Earned Run Average (or Runs Averaged/9 innings) is regressed onto one of these traditional metrics, here are the following R^2:

Metric R^2
K% 0.67
SIERA 0.52
xFIP 0.46
BB% 0.45
FIP 0.35
HR% 0.18
ERA 0.14
BABIP 0.04

Strikeout percentage is the most successful traditional metric when determining future success. Here are the top ten pitchers in K% in 2015:

  • Clayton Kershaw (33.82%)
  • Chris Sale (32.08%)
  • Max Scherzer (30.7%)
  • Carlos Carrasco (29.59%)
  • Chris Archer (29.03%)
  • Corey Kluber (27.65%)
  • Jacob deGrom (27.03%)
  • Jake Arrieta (27.13%)
  • Madison Bumgarner (26.93%)
  • Francisco Liriano (26.52%)

MLB is through 1/4 of the 2016 season. As it stands, here are the top ten pitchers in K% this year:

  • Jose Fernandez (35.9%)
  • Clayton Kershaw (33.7%)
  • Noah Syndergaard (32.6%)
  • Max Scherzer (31.5%)
  • Stephen Strasburg (30.9%)
  • Danny Salazar (30.3%)
  • David Price (29.4%)
  • Vincent Velasquez (28.8%)
  • Drew Smyly (28.4%)
  • Drew Pomeranz (28.3%)

While many on the 2015 list currently rank just outside of the top ten this year, it shows two things: the difficulty of predicting pitcher success given any traditional metric and it shows just how consistently dominant Clayton Kershaw and Max Scherzer really are.

This paper discussed combining the aforementioned statistics with Arsenal/Zone rating. This metric uses PitchF/X data which tracks the speed, movement and placement of every pitch relative to the strike zone. The idea is, with more data about the specifics of each pitch a pitcher throws, the pitch sequence and which pitches are most sustainable over time, it will be easier to predict success the following season.

Data scientists should always be careful about having too much data because of overfitting. In other words, too much data and too many variables mean watering down the prediction to where it is hard to find actual trends that are meaningful. Still, this is an intriguing paper and hopefully this Arsenal/Zone rating can be more readily available to baseball fans but in an easily digestible way.

Playoff Unpredictability

Pasted GraphicUntil recently, the Los Angeles Lakers were one of the fixtures of the NBA Playoffs, and in many seasons, the Finals. They have put together dynasties in different generations of the sport, from Magic Johnson's teams to the Shaq and Kobe era. When the Lakers were not winning titles, chances are another team was enjoying its own dynasty, like the Boston Celtics, Chicago Bulls or San Antonio Spurs. Dynasties are so commonplace in the NBA, 15 franchises in the sport's history do not have a championship (and seven of those still in existence never even made it to the Finals).

The NBA is unique in this regard: championships are won in bulk. Other leagues offer more parity, where there is a larger pool of contenders vying for a title. There may be dynasties in other sports, but there seems to be fewer of them, each shorter in duration and there stood a better chance someone unexpected can claim the sport's top prize.

Which of the four top professional sports leagues (NFL, NBA, MLB and NHL) offers the most playoff unpredictability? Is the NBA truly the most predictable? Is it significantly more predictable or marginally so?

One approach to answering these questions is by using a statistical model for each sport. Here, we will use
logistic regressions, where we will look at only wins (or points in hockey) and see how well it predicts whether a team won a championship that year. Here are some other notes for setting up this project:

- All data used begins with the 1989-90 season because
the NFL had the biggest chance to its playoff format at the turn of the new decade.

- Any season in any sport where a lockout shortened the number of games played considerably was removed (e.g., the 1998-99 NBA season, the 2012-13 NHL season, etc.)

- Though the NHL played 80 and 84 games in a few of these seasons, these numbers are not significantly different from the 82 played the rest of the dataset, so they are still used.

At first glance, every variable representing wins is statistically significant with 99% confidence, which should be obvious because you need so many wins just to make the playoffs. What matters is how well wins alone predicts championships. In statistical parlance, we will use a goodness-of-fit measure called
AIC (Akaike Information Criterion) to answer this question. As this number gets smaller, the model has a better fit. The following shows how well each model performs:

Screen Shot 2016-04-17 at 7.47.11 AM
The larger the bar, the more unpredictable the league is. Again, as expected, the NBA is the most predictable, and by a considerable margin. This model also suggests Major League Baseball is the most unpredictable, with the NFL as a close second and the NHL as a close third.

There are a number of other variables that could be added to these models to help determine who will win a championship, but the simplicity of these models makes for an easier comparison across sports.