Data Variety
What were we thinking?
The first step when creating Waiver Coach was to create a model that predicted player performance based on stats from previous games. After we built that model, we started wondering.
What other data sources could we use to improve our predictions?
A couple things came to mind right away. Most obvious was the same expert predictions that were an inspiration for Waiver Coach. Web sites like NFL.com and ESPN.com have predictions made by some of the top analysts in football. Also, we thought that the odds and over/under line might help us tap the collective wisdom of all the gamblers in Vegas. Weather data was another source that kept popping up in discussions. It is intuitive that players would be impacted by rain, snow, or cold, at least for teams that play outdoors. In the interest of time, we started with the most promising source first, expert predictions.
Getting the data
So we went about compiling the expert predictions. The Python library BeautifulSoup made it pretty easy. After identifying the URL and all of the querystring arguments we needed to manipulate, we could do something like this.
url = "http://games.espn.go.com/ffl/tools/projections?&scoringPeriodId=" + \ str(wk) + "&seasonId=2015&startIndex=" + str(offset) soup = BeautifulSoup(urlopen(url), "html.parser") table = soup.find('table', {'id': 'playertable_0'}) rows = table.find_all('tr', {'class': lambda x: x and 'pncPlayerRow' \ in x.split()})
This code loads a page and finds the HTML table with an id of playertable_0. All of the rows in that table with the CSS class pncPlayerRow are stored in a list, which we can loop through to grab the columns we are interested in.
This same general process was repeated for a couple sites that displayed their predictions back to the beginning of the season. This history was important because we needed data to train with and we had a deadline! The end result of this web scraping was several predictions for each week of all the statistics we were trying to predict from the historical data—rushing yards, attempts, touchdowns, etc.
Making a prediction
After we had gathered the source data into our database, we connected to it with Python and loaded it into a pandas dataframe. Now all we needed was the actual performance statistics for each week of the season. These actual statistics were queried from nfldb and stored in another dataframe. The two dataframes were joined using the week and year columns and a key column created from the player's name by setting the name in lowercase and stripping out punctuation and whitespace. Thankfully almost all players in the NFL have a unique name. After all the gathering and data structuring were complete, making a prediction was a breeze.
The output of this process was a couple of new predictions. There is a prediction of each player's stats based solely on expert opinions. More importantly, there is also a prediction that used all of the data scraped from the web as features along with our original prediction based on historical data. This gives us the best of both data sources in a single value.
Other data sources
Vegas totals and spreads are the only other data sources that we have added to our analysis. The intuition here is that a team that is heavily favored is likely to rush the ball more, piling on points for the running backs, while a game that has a high total is likely to have a lot of points scored overall by all of their skill players.
After manually compiling these two data points for every game back to 2009, we did a little feature engineering. The spread is only useful when you know who is favored. So we created a new feature that was the spread multiplied by either 1 (if the current player's team was favored) or -1 (for the underdog). Then a model was built to predict the error left over from our historical model. We referred to this as our Vegas-adjusted prediction.
Results
Imagine our chagrin when these additional data sources did nothing to improve our results! Undeterred, we have a few ideas for how to incorporate these additional data sources more effectively.
Expert Opinions
- Use fewer, higher quality opinions (i.e. CBS) to reduce noise
- Create a feature that is the difference between our prediction and the expert's prediction
- Test combining the predictions using different models
Vegas Data
- Create a feature that is the difference between a team's average point total and what Vegas thinks they will score
This post will be updated as we try to make our predictions better with new data.