By: Edward Egros

basketball

NCAA Tournament Dos and Don'ts

Pasted Graphic
With every NCAA Tournament comes firmer beliefs that THIS is the year the secrets will be revealed, the formulas will be solved and the proprietor of how to win your pool will be unmasked like a Scooby-Doo villain.

Alas, not everything can be predicted.

Still, many tactics for winning your pool have stood the test of time. Last year, I went over a few strategies which I will revisit:

Firstly, start with your National Champion. While Duke seems to be a heavy favorite,
Virginia ranks first in KenPom, college sports data scientist Luke Benz has Gonzaga winning most frequently in his simulations and one significant variable is the overall talent at point guard, and the most talented seem to play for Purdue and Michigan State. Certainly the Boilermakers and Spartans are riskier, but any one of these four teams are acceptable alternatives to the heavy favorite.

Next, you do not need as many upsets as you might think. Look at the size of your tournament pool and adjust the number of total upsets based upon how many other brackets you are up against. Per
Andrew Beaton:

Pasted Graphic 1


If you are up against something much greater, you can see the growth is asymptotic, so do not go crazy picking dozens of underdogs.

Lastly, look at who everyone else is picking and choose those that are being undervalued.
ESPN's Tournament Challenge shares its popular vote, and if you go by KenPom and Benz's simulations, participants are overvaluing North Carolina, Kentucky and Villanova, and undervaluing Virginia, Gonzaga and Virginia Tech. Regional bias may cause more people to pick a team close to their area, and if it is being overvalued, there may be an opportunity to oust them early and gain points others will not.

Though I cannot independently confirm, there are roughly as many analytical projections of the perfect bracket as there are actual realistic permutations. What research can show is which factors typically matter. For starters,
geography matters. Teams that play significantly closer to a site than their opponents have a tangible advantage. Two of the more notable matchups this year involve Oregon/Wisconsin and UC Irvine/Kansas State, both happening in San Jose, CA. The Ducks are roughly 558 miles away from the arena, while the Badgers are four times that distance. UC Irvine is located approximately 382 miles from the SAP Center, whereas Kansas State must travel more than 1,700 miles to play its first game of the tournament.

Preseason rankings also matter: no matter how the season played out, projections before the first games are played typically perform well (
good news for Kansas, Kentucky and Gonzaga). Offensive and defensive efficiency metrics are also significant (though never foolproof). One factor that does not seem to be important is team experience. While pundits seem to enjoy citing if a team has been through the tournament rigors before or how many seniors/juniors they have, young teams have performed well and senior-laden teams have been upset early. Overall, it should not factor into decisions.

Taking as many of these factors into account, here are two brackets I have submitted to a pool or two. The "Gonzaga" bracket is for smaller pools and the "Michigan State" bracket is for larger pools. I will blend these two into my Final Four picks for television, podcasts and such (subject to change).


Pasted Graphic 2

Pasted Graphic 3

May the odds be ever in your favor.

A Review of SXSW

Pasted Graphic
For one weekend in Austin, while others attended festivals, concerts and technology symposiums, I wanted to explore the sports and media panel discussions at SXSW, a conference billing itself that celebrates the convergence of the interactive, film and music industries. First, to be a panelist at SXSW, you must master the art of the humble brag. For instance, to paraphrase a filmmaker's list of his two main goals for any of his products, he said that he must love his work and his audience must love his work.

Pause to be inspired.

A couple of panel hosts reiterated their resumes aloud without using this background for an educated question.

Pause to emote.

One of the better tidbits of advice I ever received as a writer is to know your audience. While SXSW attracts a diverse group of scholars, professionals and fun-lovers, often times this gallery wants to learn how to become the person they are watching, or at least morph into a unique iteration. If keeping trade secrets serves as the highest priority, at least teach us something that will help an aspiring audience. A disappointing talk featured a panel promoting "
The Weekly", a television endeavor from the New York Times committed to long-form storytelling. Instead of previewing at least one specific story without a fancy montage, the conversation felt more like an academic presentation, philosophizing about why this format "will" work. I did want to learn more about the program, but the panel felt so much like an infomercial, Ron Popeil should have moderated. I wished for something with takeaways for my own work as a journalist, not a payment installment plan.

Often times panels with star power offered fewer lessons about the ways of the world than those who let their experience speak for themselves. One exception was NBA champion Chris Bosh, an advocate of analytics and
an elite karaoke singer. It took losing to the Dallas Mavericks in 2011 and signing Shane Battier to make Bosh a believer in the numbers, so much so that he rightfully credits the adjusted approach with his 2012 NBA Championship. Alongside Spurs general manager R.C. Buford, the panel devoted to the future of basketball discussed the importance of the three-point shot, and how big men are taught to handle the ball like a guard and be unafraid to launch from deep.

Perhaps my favorite panel featured a discussion on how data has changed the NFL. In particular,
Sarah Bailey (analyst with the Los Angeles Rams) and Namita Nandakumar (analyst with the Philadelphia Eagles and noted pugilist) discussed how they make the most of a small analytics staff and why any quantitative strides should be celebrated, not interpreted as a "glass half-empty" endeavor. For instance, while quants wish coaches would go for it on 4th down more often, Nandakumar advises to at least take satisfaction coaches are getting smarter about it. A sobering reality to consider is how few jobs are available to data scientists in pro football, so self-promotion and engaging research projects in other sports are vital in this competitive industry. For nuanced advice for data scientists, Bailey offers this advice: when working with large datasets, take a subset, crunch the numbers in the R programming language, and if you are satisfied with the results, take a larger dataset and use the Python programming language.

The last lecture I attended featured Paul Bracewell, Managing Director of DOT Loves Data, discussing how to use machine learning to rate athletes in a variety of sports, most notably international events like rugby and cricket. Some of the better instruction he offered involved how to discuss analytics with coaches and players: "When predictivity is used as a benchmark, the model needs to generate supporting output to explain any departure from the predicted results". In other words, explain why a model works or doesn't work in a specific situation. Transparency and meaning to build trust are of the utmost concern.

I wrapped up my time recording an episode of "
Outside the Box" with our usual EPlay crew. It featured a spirited conversation as to who is responsible for preventing young players from a catch-and-shoot approach to basketball that some believe analytic enthusiasts have espoused. Trust me, you want to listen to this one. Until then, my biggest takeaway from my first SXSW is the opportunities to learn and share ideas exist, but they often remain hidden. Until I revisit it all, I will take a break and watch a video of me petting a robot puppy at one of the innovation exhibits.

The Art of the Comeback

Pasted GraphicLast November, arguably five million people attended the Chicago Cubs victory parade, celebrating the team's first World Series Championship since 1908.

Last Summer,
Cleveland hosted hundreds of thousands of Cavaliers fans to celebrate that franchise's first title and the city's first pro championship in more than half a century.

This year in New England, they constantly win. We move on.

The common storyline among these three winners is "The Comeback". The Cubs overcame a 3-1 deficit in the World Series to claim their championship in an extra-inning Game 7, the Cavaliers also stormed back from down 3-1 in the NBA Finals and the Patriots trailed Atlanta by 25 in the second half of Super Bowl LI, to win in overtime. These comebacks were also nearly unprecedented.
Only five teams had come back from down 3-1 to win the World Series before the Cubs. Cleveland became the first NBA team to overcome a 3-1 deficit in the Finals to win. And, New England's 25-point comeback win is the largest in Super Bowl history. The second largest ever is merely ten points.

This confluence of sports drama may seem like supernatural intervention, but perhaps it can be explained in earthlier terms. In 2011, Brian Skinner published "
Scoring Strategies for the Underdog: A General, Quantitative Method for Determining Optimal Sports Strategies". Skinner explained how underdogs must call riskier plays to have a chance at success. In this case, we can refer to teams significantly trailing in series and games as underdogs when their probability of winning is significantly below 50%. Calling riskier plays might mean getting shellacked, but by finding specifically how much riskier a team should get, it might be the only way for those trailing to win.

Baseball closers are niche pitchers, often asked to pitch only one inning, with his team holding the lead. Aroldis Chapman, the Cubs' closer, came in to pitch 2.2 innings in Game 5, 1.1 innings in Game 6 and 1.1 innings in Game 7. Chapman had one day of rest and pitched Game 5, another day of rest before Game 6 and no days off in Game 7. While he did allow three earned runs in the last two games, Maddon believed the risky strategy of extending his closer was the only way to overcome his 3-1 deficit. Chapman did allow runs, but it left other relievers fresh for longer games. Hitters were also asked to swing for home runs, not mere singles or doubles. The Cubs ranked 13th in home runs last season, but in the World Series, they recorded at least one home run in games five, six and seven, en route to their title.

In basketball, Skinner's paper discussed two key concepts pertinent to the Cavs: how often to shoot 3's and when to stall. The logic in the first case is, depending upon how many possessions are left in the game, a team should resort to shooting triples when reaching its critical threshold. In the regular season, Cleveland ranked 7th in the NBA in three-point shooting percentage and 3rd in three-point shooting attempts, but going up against the Golden State Warriors who ranked first in both categories. The Cavs' two of the three highest rates of three-point shooting in that series
happened in games 6 and 7, two must-win games. As for pace, while Golden State had the second most possessions per 48 minutes in the NBA, Cleveland ranked 27th out of 30 teams. However, the Cavs played a faster pace for games 5 and 6, both resorting to a style more like the Warriors and not shortening the game like it is suggested for underdogs. It is worth noting there was a slower pace for Game 7, the most dramatic in the entire series.

Lastly, the Patriots helped themselves and the Falcons maimed themselves because of risk-taking.
Once Atlanta led 28-3, New England resorted to 40 pass plays (including sacks) and just 10 rushes. Before the deficit, the Patriots passed the ball 34 times and ran it 15 times, relying significantly more on the ground attack. Also, some of Brady's longest completions occurred in the 4th quarter during the comeback. Defensively, Matt Ryan and the Falcons leaned towards passing more frequently in the final minutes than sticking to the ground game, which would have taken more time off the clock. Perhaps the most egregious example was when Atlanta had the ball at the New England 22-yard line with 4:40 left in the game and leading by eight. Instead of running the ball three times and going for a two-possession lead, a sack, a pass (wiped away by offensive holding) and an incompletion took the Falcons out of field goal range AND gave Tom Brady 3:30 to tie the game. Overall, even play-count disparity factored into the outcome; Brady kept the Falcons' defense on the field and Ryan could not give his teammates a break.

Teams in any sport can calculate when it is time to run riskier plays. Many recent and high-profile examples suggest comebacks are more possible than ever before, when the right tactics are implemented.

There is a postscript: win probability charts have become more popular than ever. But these games and series show something seemingly calculated to have a .7% probability of happening can occur. Because underdogs can increase their own variance with their playcalling, perhaps these charts need to be updated in some way. Fortunately, this discussion is ongoing.

A New NCAA Tournament

UNADJUSTEDNONRAW_thumb_10d3
There's no doubting the increased awareness of analytics in predicting the NCAA tournament field in college basketball. Instead of just diagnosing a team's record against the Top 50, it's Rating Percentage Index or Ken Pomeroy rankings, that are becoming more commonplace. It has gotten to where data scientists are actually meeting with the NCAA to determine if one metric should be used above all others to pick tournament teams.

Perhaps surprisingly, data scientists want simpler criteria for picking teams: who wins, who loses and who have you played. This is opposed to other explanatory variables used in more advanced metrics, like margin of victory and offensive/defensive efficiency. Coaches, on the other hand, would prefer more complex formulae for determining the tournament field. Logically, this approach makes more sense from their perspective, because of competition. If a coach has figured out a style of play or way to schedule opponents that increases the likelihood of making the tournament, they develop a competitive advantage. Data scientists want to keep it simple for fans, coaches want a figure out a competitive advantage.

Perhaps in this same spirit of transparency, the tournament selection committee released "in-season" projections for the first time ever, one month before Selection Sunday. It only has the top four seeds of every region, but it is added information for where highly ranked teams really sit. As with any analytic project, more data "usually" means more robust forecasts. Already, it is easier to make more accurate assumptions and offer a better glimpse as to what the committee is looking for.

However, these in-season projections do not include the full field of 68, and what usually causes the most consternation is simply who does and does not make the dance. While it makes sense not to include the full field because you have to assume certain conference champions in mid-major conferences, something that would include all "at large" teams would provide even more information as to the criteria for inclusion.

Nothing is easy about picking 68 teams to play in a tournament, and while analytics may be helpful in forecasting a Final Four, easy-to-understand criteria can help teams and fans quell any controversy.

How Predictive Is Scoring Differential?

Pasted GraphicHow important is an impenetrable goalie in the NHL? How much better is it to outscore opponents throughout the season, as opposed to dominating them defensively? Overall, how important is point differential to overall success?

In an earlier blog post, I discussed
playoff unpredictability when it comes to determining who will win a championship based upon how many games that team won. There, the NBA was the most predictable, then the NHL, NFL, then MLB is the most unpredictable (unless, of course, you are the 2016 Chicago Cubs). But how does point differential (or run differential in baseball or goal differential in hockey) translate to winning championships? And which league is most predictable when looking at that specific metric?

Once again, I am using
logistic regressions using one explanatory variable and if that team won a championship as the dependent variable. However, this time I am using three per sport: offensive output, defensive output and scoring differential. Also once again, here is what is noteworthy with our datasets:

- All data used begins with the 1989-90 season because the NFL had the biggest chance to its playoff format at the turn of the new decade.

- Any season in any sport where a lockout shortened the number of games played considerably was removed (e.g., the 1998-99 NBA season, the 2012-13 NHL season, etc.)

- Though the NHL played 80 and 84 games in a few of these seasons, these numbers are not significantly different from the 82 played the rest of the dataset, so they are still used.

Each explanatory variable has the appropriate and logical coefficient. In other words, scoring variables have a positive coefficient, defensive variables have a negative coefficient and scoring differential variables have a much larger positive coefficient. All of this equates to a better probability of winning a championship. Each variable is also statistically significant with 95% confidence, which is to be expected. A better offense, defense and scoring differential will obviously increase the likelihood of winning a championship. What is not clear is which of these indicators is most predictive.
A goodness-of-fit measure called AIC (Akaike Information Criterion) can shed some light. As this number gets smaller, the model has a better fit, explaining away more of the randomness of that sport.

The first chart is points (or runs or goals) modeled against championships:

Pasted Graphic 1

Before analyzing this chart, it is important to note the value of each point, goal and run, compared with the other sports. In 2016, the average MLB team scored 726 runs for the season. This number is different from the 325 points scored, on average, for an NFL team in 2015, the 8419 points scored for an NBA team for last season and the 222 goals scored for an NHL team for last season. Fortunately, the variation across each league is not so substantially different to where comparison becomes impossible.

In the chart, we see goals in hockey as being the best predictor for winning its championship, with football being slightly more random, then basketball, then baseball finishing as the most random. So far, these results are consistent with the previous study where MLB's postseason was the toughest to predict, based upon number of wins during the regular season. Basketball makes intuitive sense because teams play at different paces, and it is not conclusive if playing at a faster rate—which scores more points but not necessarily more points per possession—is the best way to win a title.

The next chart illustrates runs, points, and goals allowed, modeled against winning a championship:

Pasted Graphic 2

Comparatively, the trends are almost the same as they are with offensive output: Major League Baseball is the most random, followed by the NBA. However, an NFL scoring defense is now a better indicator than an NHL scoring defense, but only slightly so.

Now, let's combine these two charts into scoring differential, modeled against a championship:

Pasted Graphic 3

Here, we learn point differential is more predictive in basketball than in any other sport. Remember how different teams playing at different paces obscures the importance of points alone? Including the defensive component erases pace of play and gives a clearer predictor. It also coincides with how a win total in basketball is most predictive for winning a championship. Football and hockey are nearly equal in predictive ability and baseball is a distant fourth.

There are more trends to uncover if we combine all of these charts:

Pasted Graphic 4

In nearly every sport, scoring defense is more predictive than offense (with hockey being the lone exception). Scoring differential is predictably better for analysis than offense or defense by itself, but the degree to which it takes away the randomness is different for each sport. It is only a slight improvement in the NFL, but a drastic improvement for basketball.

Overall, these proportions could prove helpful when determining if a team is going in the right direction when devoting resources to offense and defense. Both are necessary, but perhaps more money should be proportionally allocated to the areas that best predict who will win a championship.

Playoff Unpredictability

Pasted GraphicUntil recently, the Los Angeles Lakers were one of the fixtures of the NBA Playoffs, and in many seasons, the Finals. They have put together dynasties in different generations of the sport, from Magic Johnson's teams to the Shaq and Kobe era. When the Lakers were not winning titles, chances are another team was enjoying its own dynasty, like the Boston Celtics, Chicago Bulls or San Antonio Spurs. Dynasties are so commonplace in the NBA, 15 franchises in the sport's history do not have a championship (and seven of those still in existence never even made it to the Finals).

The NBA is unique in this regard: championships are won in bulk. Other leagues offer more parity, where there is a larger pool of contenders vying for a title. There may be dynasties in other sports, but there seems to be fewer of them, each shorter in duration and there stood a better chance someone unexpected can claim the sport's top prize.

Which of the four top professional sports leagues (NFL, NBA, MLB and NHL) offers the most playoff unpredictability? Is the NBA truly the most predictable? Is it significantly more predictable or marginally so?

One approach to answering these questions is by using a statistical model for each sport. Here, we will use
logistic regressions, where we will look at only wins (or points in hockey) and see how well it predicts whether a team won a championship that year. Here are some other notes for setting up this project:

- All data used begins with the 1989-90 season because
the NFL had the biggest chance to its playoff format at the turn of the new decade.

- Any season in any sport where a lockout shortened the number of games played considerably was removed (e.g., the 1998-99 NBA season, the 2012-13 NHL season, etc.)

- Though the NHL played 80 and 84 games in a few of these seasons, these numbers are not significantly different from the 82 played the rest of the dataset, so they are still used.

At first glance, every variable representing wins is statistically significant with 99% confidence, which should be obvious because you need so many wins just to make the playoffs. What matters is how well wins alone predicts championships. In statistical parlance, we will use a goodness-of-fit measure called
AIC (Akaike Information Criterion) to answer this question. As this number gets smaller, the model has a better fit. The following shows how well each model performs:

Screen Shot 2016-04-17 at 7.47.11 AM
The larger the bar, the more unpredictable the league is. Again, as expected, the NBA is the most predictable, and by a considerable margin. This model also suggests Major League Baseball is the most unpredictable, with the NFL as a close second and the NHL as a close third.

There are a number of other variables that could be added to these models to help determine who will win a championship, but the simplicity of these models makes for an easier comparison across sports.

Evaluating Your Bracket

Pasted Graphic 1The Law of Conservation of Mass tells us: matter is neither created nor destroyed. When you burn your horribly incorrect college basketball bracket, remember, you never destroyed it, it is in another form somewhere in the universe. So instead of ignoring your transgressions, let's embrace what still exists and see which approaches were the best when predicting who will be in the Final Four.

There's a one-seed (North Carolina), a couple of two-seeds (Villanova and Oklahoma) and a 10-seed (Syracuse). There is not as much parity with this quartet as with some tournaments in the last few years. Still, some of the favorites to win the National Championship did not survive the first two weeks of this crucible. For instance, the top three teams in the Pythagorean Rating at the end of the conference tournaments are not playing in Houston. In fact,
Syracuse did not even crack the top 25, until recently. ESPN's Basketball Power Index offers these rankings: North Carolina (1), Villanova (3), Oklahoma (6) and Syracuse (39). The LRMC Basketball Rankings still has its two, three and seven, but ranks the Orange 41st.

Some computer models have resorted to predictions without solely implementing historical data. How is this possible? Microsoft's search engine, Bing, uses social media to determine which teams will survive and advance.
It has already proven successful in other sporting events like the World Cup and NFL games. But how did it fare for this tournament? Sadly for Bing, it only predicted one Final Four team correctly (North Carolina). In fact, the system predicted the Orange to lose their first game.

It should be clear by now the two schools that ruined this tournament's predictiveness: Kansas and Syracuse. The Jayhawks were the top team by nearly all accounts, yet lost in the Regional Final,
perhaps uncharacteristically. At the other end of the spectrum, Syracuse could be the worst team ever to make the Final Four. There have been 11-seeds to make it to the final weekend of the season, but many debated if Syracuse even deserved to make the tournament. Their RPI was 72 at the time of selection, worse than other schools that were not chosen (e.g. Valparaiso, San Diego St. and St. Bonaventure). Instead of the favorite vying for the National Championship, it's the controversial at-large two wins away from glory.

Even listening to me would not have been wise. Using my own system, I only correctly predicted one team (and it was a different school than what I said was coming out of that Region on Fox 4). My National Champion was knocked out during the Elite Eight (Kansas) and my second place team lost in the First Round (Michigan St.).

So what is the best way to fill out your bracket for the next tournament?

I don't know.