By: Edward Egros

The Truth About 3rd Down

Pasted Graphic
Anyone paying attention to stats during an NFL broadcast has noticed 3rd down conversions being reported. It is an easy way for commentators to critique how clutch a team is and if an offense can maintain a drive when the pressure is at its peak. Obviously a team converting on 100% of its 3rd down attempts is probably winning the game, but otherwise it is not nearly as helpful a statistic as suggested.

For this exercise I took 10 seasons' worth of NFL data (2007-2016) and looked at conversion rates for 1st down, 2nd down, 3rd down and the number of regular season wins that team accumulated. Logically, it would make sense to have an increasing percentage with later downs because you often have fewer yards to go before moving the chains. The numbers reflect this trend: on 1st down, teams on average convert 20% of the time, on 2nd down it's 30.3% and on 3rd down it's 38.1%.

To make things simple, I then calculated a linear regression, treating wins as my dependent variable and keeping it continuous
so as not to lose information. Here are the results:

Pasted Graphic 1

As expected, every down is significant to wins at the 99% level, because the more you convert, the greater your chances of success. The degree to which each down matters does go up, as reflected by the coefficients increasing with each successive down. And, even though later downs should be easier to convert, the coefficient is still increasing, perhaps suggesting third down conversions do matter more than first and second.

However, the
R-squared and adjusted R-squared only hover around 28%. In other words, conversion rates only account for 28% of why a team wins or loses, so a 3rd down conversion percentage by itself is less that figure (22% if 3rd down rate is the only explanatory variable). While these rates are statistically significant (especially on 3rd down) they are also noisy.

In previous blog posts, I have outlined which factors best determine the outcome of football games (
and they are detailed in my Cowboys data visualizations). One reason why I never brought up 3rd down conversion rates is because of how noisy the variable is and how it takes away from 1st and 2nd down. Many others have their own ways of determining success based upon the down, but also the distance. I would suggest, for sake of ease, promoting the discussion of 1st and 2nd down success rates, both as a pair, but also as a bridge to what is a reasonable 3rd down to convert when those plays occur.

A New Explanation of Cowboys Graphics

Pasted Graphic
For the second-straight year, after every Dallas Cowboys game, I will post a recap of the game with an analytic visualization. Once again, these metrics sum up all of the important factors that determine the outcome of a football game. Some of the metrics are the same, while others are more refined and better reflect certain concepts.

Going from the top and working down, once again I will chart turnovers, one of the more impactful statistics in the game. The numbers reflect the turnover margin and the bars reflect how many turnovers were committed.

The next box will look at how the quarterbacks performed, often looking at
net yards per pass attempt. This metric is highly predictive; and while others may be more predictive, it is also far easier to calculate.

Perhaps the biggest change comes where it is labeled "Time of Possession/Rushing Yards". This metric was designed to determine who "controlled" the game. It has since been updated to look at how many rushing yards a team had per quarter.
As noted in a previous blog post, the more rushing yards a team scores later in the game, the likelier they are to win. The larger the number, the better that team "controlled" the game.

Overachiever/Underachiever refers to what the Cowboys' record should be, relative to their point differential for the whole season. In baseball, this idea is referred to as the
Pythagorean Expectation. In football, there is debate as to how to calculate such a record, but here, the exponent is 2.37: ((Points for^2.37) / (Points for^2.37 + Points Against^2.37)) * 16.

Finally, scoring efficiency has been tweaked. The idea here is to see how many points teams scored, relative to the number of yards they needed. The larger the bar and the bigger the number, the more efficient the team was. Simply put, it's points divided by yards, then multiplied by 15.457886 so that average is approximately 1. Using data from 2009-2016, we can also see if a team was overall good, average or bad in its efficiency. If the result is less than .949394, the team was inefficient. If the result is between .949395 and 1.057116, the team was average and gets a blue bar. If the result is greater than the aforementioned range, they were efficient and get a green bar.

Again, these metrics are meant to capture nearly everything that happened in a game that pertained to the result. Some of these metrics can also be used to forecast future games, but the intent is solely inference.

No Need to Establish the Run

David Johnson

Arizona Cardinals running back David Johnson (left) may understand the importance of balancing between rushing and passing about as well as anybody. Last season, he finished with the most touches, all-purpose yards and rushing/rec touchdowns of anyone in the NFL. For an encore, his head coach says he wants Johnson to average 30 touches per game.

It's one thing to strike the right balance between how to use Johnson as a rusher and as a receiver; it's another to make these decision relative to the time of the game. Conventional wisdom in football has always championed the idea of "establishing the run"; meaning no matter how long it takes to create an effective run game, it should be a point of emphasis early in a contest. More recently,
rushing plays are called less frequently, regardless of what the clock reads. Knowing this recent trend, there is a way to explain why, at least analytically, attempting to establish the run is unnecessary.

I took NFL play-by-play data from the 2010 thru the 2015 seasons. This information included which team won and lost. Then, using only rushing plays, I summed up the rushing yards each team had per quarter, per game (in this analysis, I am not including overtime rushing yards because of how infrequently they appeared, but also how much they swayed the results because so many rushing yards will essentially end the game). Using a
logit regression with "win" as a binary dependent variable and rushing yards per quarter as my explanatory variables, here is the output:

=========================================
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8447 -0.9786 -0.5544 1.0545 2.0701
Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept)
-1.747385 0.105946 -16.493 < 2e-16 ***
yards.gained.1
0.006508 0.001922 3.386 0.000708 ***
yards.gained.2
0.007091 0.001953 3.632 0.000282 ***
yards.gained.3
0.015546 0.001910 8.137 4.05e-16 ***
yards.gained.4
0.035783 0.002156 16.594 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4251.8 on 3066 degrees of freedom
Residual deviance: 3711.2 on 3062 degrees of freedom
AIC: 3721.2 Number of Fisher Scoring iterations: 4
==========================================

First, all of these variables are statistically significant at the 99% level, which makes logical sense. The more yards a team has, no matter the type, the likelier they are to win. Second, there is a direct relationship between the time of the game and the magnitude of the coefficient. In other words, as the game goes on, the more important rushing yards are to the game's outcome. Having the largest coefficient for the fourth quarter makes sense because teams that are leading are trying to take time off the clock, and rushing makes that motive easier to fulfill. However, that the third quarter has a greater magnitude than the first half could suggest there is no statistical advantage to "establishing the run".

It is also important to convert these coefficients to
odds ratios to know how important each rushing yard is to winning. Specifically, an extra first quarter yard increases the odds of winning by a factor of 1.0065. In the second quarter, it's 1.0071, a small difference. In the third quarter, it is 1.0157 and in the fourth, it is 1.0364.

There may be a value to wearing down a defense by running the ball earlier in a game, but from this data and regression, it is not captured. It may also be possible a running back needs several carries before knowing how to dissect a defense later in a game; but again, this idea is not captured aggregately. Again, establishing the run may not be as crucial an idea as originally thought.

However, one conventional bit of wisdom that is reflected is the idea a team controls the game more effectively by running the ball later in the contest. Quantifying how a team controls a game can be captured using a study like this one. In fact, I plan to use this analysis in my weekly Cowboys postgame graphics that explain why Dallas either won or lost a particular contest. I will go over these upgraded graphics in a later blog post.

(Special thanks to
Luke Stanke for providing the data and helping me with the code!)

...One More Thing About the PGA Championship

Pasted Graphic
(Courtesy: Stuart Franklin/Getty Images)

At one point, there was a five-way tie atop the leaderboard during the back nine of the final round of the 99th PGA Championship. Then, Justin Thomas cards a birdie on the 13th hole, enters the Green Mile with a par on 16, a birdie on 17 and an insignificant bogey on 18. While the rest of the field struggled to finish, Thomas blazed through the toughest closing stretch at a major this year, to capture his first Wanamaker Trophy.

My pick to win, Hideki Matsuyama, fared more than respectably, finishing tied for 5th. But as I watched the television coverage of the moments he struggled, one of the commentators pointed out his performance mirrored that of last year's PGA Championship, where he was the best hitter of the golf ball, but could not make any putts. At that point, he finished tied for 4th.

This year, Matsuyama missed a few critical putts, but he was 12th in Strokes Gained: Putting. However, SG: Approach the Green and SG: Around the Green were 20th and 27th, respectively. As for the champion, Thomas was tied for 15th in SG: Approach the Green, 22nd in SG: Around the Green and 4th in SG: Putting. Overall, these numbers are slightly better and equaled a commanding win.

I am reminded of a paper by Dr. George Kondraske of UT Arlington titled: "
General Systems Performance Theory and its Application to Understanding Complex System Performance". In it, Kondraske attempts to explain human systems through complex machines. Regressions have a number components that are often considered additive (which is why we have a lot of "+" signs in our equations). But if one explanatory variable is largely deficient, it is not satisfactory to say the dependent variable decreases by the same amount. The output depends upon everything working together; components are so interconnected that any one piece that does not work or is largely deficient means the entire system might fail to perform.

What does this have to do with golf? If someone cannot putt at all, they will post a high score and have no chance of winning a tournament; they cannot simply overcompensate with a longer drive or a more accurate iron shot. Granted, professional golfers are at least competent in every component of a golf game, but any significant deficiency makes for a bigger setback than simply subtracting odds to win based upon a negative strokes gained metric.

This approach is intuitive to golf enthusiasts. It is why golfers work on everything, not just emphasizing the skills with which they excel. What matters here is when data scientists are putting together models for forecasting winners, perhaps it is important to think less linearly. Maybe it has less to do with the sum of skills coming together and how they fit with a particular course, and more about if every skill is adequate for the demands of a specific tournament. Justin Thomas' skills certainly were.

Who Will Win the 2017 PGA Championship?

Pasted GraphicThis year, the Wanamaker Trophy will be claimed at Quail Hollow Club, the same course that hosts the Wells Fargo Championship (previously the Wachovia Championship). No analysis of this year's PGA Championship would be robust without discussing Rory McIlroy's domination there.

A favorite to win the last major of the season, McIlroy has two victories and once lost in a playoff, in seven appearances there. He also made the cut six of seven times and owns the course record, shooting a 61 in 2015. Also, as I mentioned in a previous article, McIlroy is not only successful in PGA Championships, he is one of the more dominant golfers of any specific event on Tour (even if that major is a hodgepodge of characteristics where no particular abilities stand out). You add to his resume that he has a pair of Top 5 finishes his last two tournaments, and McIlroy seems poised to win for the third time at the PGA Championship.

However, as we have learned with other tournaments,
Strokes Gained statistics have incredible predictive power. When it comes to who has won in North Carolina before, sometimes an already dominant golfer came in and continued his momentum to victory. More recently, Strokes Gained: Around-the-Green has become more crucial to success:

Pasted Graphic 3

There are two periods when a player needed to rank in the Top 40 in SG: Around-the-Green: 2005-2007 and 2014-2016. This season, the Wells Fargo Championship was played elsewhere so Quail Hollow could be redone for a major. The two important changes here are the removal of trees and the adjusting of the front nine to where the final yardage is shorter but likely more challenging. It's possible these two details make SG: Around-the-Green all the more important.

At this point, the players leading in this statistic are: Ian Poulter, Jason Day, Bill Haas, Pat Perez and Cameron Smith. McIlroy barely cracks the Top 80. Jordan Spieth, another favorite who could complete the career Grand Slam at age 24, is 18th. As for Strokes Gained: Off-the-Tee, another stat with some predictive power, the current leaders are Jon Rahm, Dustin Johnson and Sergio Garcia. In terms of skills shown this season, there are several players who are perhaps more suited to win a revamped Quail Hollow than the favorites.

Perhaps the one player that seems to have put it all together, at this point, is Hideki Matsuyama. Fresh off a win at the WGC-Bridgestone Invitational, he is one of only four players with three wins on Tour this season. He also ranks 11th in Strokes Gained: Around-the-Green and 11th in Strokes Gained: Off-the-Tee. Lastly, he finished fourth in last year's PGA Championship and has two Top 20 finishes in the last four seasons. In other words, he overcomes the slightly lower statistical rankings than the aforementioned players with overwhelming momentum and overall success with this specific event. While I expect solid games from the favorites, I am picking Hideki Matsuyama to capture his first major.