Share



Scraping Football Stats with Python


I love stats, I love all sorts of games, and Fantasy Football is the perfect amalgamation of a game with statistics. Let’s scrape the web for some of our favorite players’ stats using Python, requests, and BeautifulSoup!

Pro Football Reference is a great site for looking up football stats, and is the site we will be referring to in this post for our scraped statistics. Carolina Panther’s wide receiver D.J. Moore is an exciting young player as far as fantasy football is concerned, thus let’s use him in our examples. Time to create a quick Python project that scrapes Pro Football Reference and can spit out a game-by-game breakdown for a specific stat!

Initial Setup

Let’s set up a simple Python project, using pipenv as our packaging tool. The first step is to create a dedicated directory for our code. I am working on a Mac, thus from the command line execute the following command to navigate to your Desktop directory, and create a new scrape_stats directory, which will be our project’s main directory:

$ cd ~/Desktop
$ mkdir scrape_stats && cd scrape_stats

Note: I assume you already have Python 3.7 installed, as well as pipenv. For guidance with installing Python 3 please read this article. If you need assistance with pipenv please see it’s installation documentation.

Now we can create our Pipfile, and activate our virtual environment. Open up a code editor of your choice, though I suggest either VSCode or PyCharm. Go ahead and open up our bare project in the editor of your choice and create a new Pipfile:

[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[dev-packages]
black = "==19.10b0"

[packages]
beautifulsoup4 = "==4.9.0"
requests = "==2.23.0"

[requires]
python_version = "3.7"

There is a lot going on above, and while I don’t want to get into a full discussion of how pipenv works, I just want to point out this Pipfile will install beautifulsoup4, requests, and optionally, black, for formatting our code.

Note: I always use Black in my Python projects. I suggest you do as well. It takes the guesswork out of formatting, allows peer reviews to focus on the actual code and its functionality, and spits out very tidy code.

Let’s go back to our terminal window, install everything via pipenv, and activate our virtualenv.

$ pipenv install --dev
$ pipenv shell

Above, we are installing everything listed in the Pipfile, including the dev-dependencies. Black is currently our only dev-package, thus if you don’t want to bother with formatting the code simply do pipenv install. We activate our virtualenv via pipenv shell, and are now ready to setup our project structure and write some code!

We want to layout our project’s structure as so:

scrape_stats/
│
├── src/
│   ├── __init__.py
│   ├── scrape_stats.py
│   └── stats/
|       ├── __init__.py
|       └── player_stats.py
│
├── Pipfile
└── Pipfile.lock

First, add a src package from our project’s root directory which will contain all of our code. Additionally, create a stats package inside of this src package, which will contain the player_stats.py module. Be sure to add an __init__.py file where necessary.

Now that our project is setup we need to write some code. We use the popular requests library for acquiring the data from Pro Football reference. Add a simple function to player_stats.py for acquiring HTML content for a provided player’s particular season.

import requests 


def get_player_stats_from_pfr(player_url, year):
    """
    Function that returns the HTML content for a particular player
    in a given NFL year/season.
    
    :param player_url: URL for the player's Pro Football Reference Page 
    :param year: Year to access the player's stats for
    :return: String containing all content of the player's gamelogs for the given year
    """
    get_url = player_url + year
    response = requests.get(get_url)
    return response.content

Our player_stats.py module will need a few more helper functions to make sure our scraped stats are accurate and correct. Football is a rough sport, and players often miss games. Unfortunately, Pro Football Reference does not record missed games in their game logs, and our code needs to keep track of any gaps in the stats. Some of this might not make sense until we see the finished product, but let’s lay out our groundwork now.

Add a function which quickly checks if we have any missed games to fill in:

def player_has_missed_games(game_being_processed, cur_pfr_game):
    """
    Function that checks if the player has missing games
    :param game_being_processed: Game (by week) we are processing
    :param cur_pfr_game: Game (by week) we are on for the player
    :return: Bool
    """
    if game_being_processed == cur_pfr_game:
        return False

    return True

This function is used to quickly check if we have missing games. If this function returns True we need to fill in the correct number of missing games via the following function:

def fill_in_missed_games(game_being_processed, cur_pfr_game, stat_per_game):
    """
    Function that fills a list with 0s for missing games
    
    :param game_being_processed: Game (by week) we are processing
    :param cur_pfr_game: Game (by week) we are on for the player
    :param stat_per_game: List containing a stat's value for each game
    :return: Updated list with missing games filled in with "0"
    """
    games_missed = cur_pfr_game - game_being_processed
    for i in range(games_missed):
        stat_per_game.append("0")

    return stat_per_game

Lastly, before returning the scraped stats we need to check if the player played week 17 or not, and update our list of weekly stats accordingly:

def check_if_missed_week_sixteen(stat_per_game):
    """
    Function that checks a list of stats per game is the
    expected 16 weeks big

    :param stat_per_game: List of stats
    :return: Updated list, if applicable
    """
    if len(stat_per_game) != 16:
        stat_per_game.append("0")

We now have a handful of helper functions for manipulating and cleaning up our data from Pro Football Reference, but we still need to extract this data using Beautiful Soup.

def get_stat_for_season(html_game_stats, stat_to_scrape):
    """
    Function that extracts a given stat from PFR table rows 
    and returns the results in a list
    
    :param html_game_stats: BeautifulSoup PageElements containing our stats
    :param stat_to_scrape: PFR stat_id to extract
    :return: List of stats, 16-weeks big, for player's season
    """
    scraped_weekly_stats = []
    game_being_processed = 1

    for game in html_game_stats:
        cur_pfr_game = int(get_stat_value_from_game(game, STAT_GAME_NUM))

        if player_has_missed_games(game_being_processed, cur_pfr_game):
            scraped_weekly_stats = fill_in_missed_games(
                game_being_processed, cur_pfr_game, scraped_weekly_stats
            )

        game_receptions = get_stat_value_from_game(game, stat_to_scrape)
        scraped_weekly_stats.append(game_receptions)
        game_being_processed += 1

    check_if_missed_week_sixteen(scraped_weekly_stats)
    return scraped_weekly_stats


def get_stat_value_from_game(game, pfr_stat_id):
    """
    Function that extracts a specific stat from a set of game stats
    :param game: Table Row extracted by BeautifulSoup
                 containing all player's stats for single game
    :param pfr_stat_id: PFR string element ID for the stat we want to extract
    :return: Extracted stat for provided game
    """
    data_stat_rec = game.find("td", {"data-stat": pfr_stat_id})
    stat_val = data_stat_rec.renderContents().strip()
    return stat_val.decode("utf-8")

There is a lot going on in the code above! The function get_stat_for_season loops over a list of table rows from Pro Football Reference, extracting the specific stat value we are looking for. It utilizes many of the functions we previously wrote, checking for missed games, and finally adding the current game’s value to our list of stats.

Final step before running our code! Let’s write a short script, utilizing all of these functions, tying everything togather, capable of extracting a specific stat for a given player, over an entire, single, season.

import sys
from os import path

from bs4 import BeautifulSoup

sys.path.append(path.dirname(path.dirname(path.abspath(__file__))))

from src.stats.player_stats import (
    get_player_stats_from_pfr,
    get_stat_for_season,
)

# Feel free to change this to the URL of your favorite player
PLAYER_PFR_URL = "https://www.pro-football-reference.com/players/M/MoorD.00/gamelog/"
YEAR_2019 = "2019"


if __name__ == "__main__":
    html_doc = get_player_stats_from_pfr(PLAYER_PFR_URL, YEAR_2019)
    soup = BeautifulSoup(html_doc, "html.parser")
    game_stats = soup.find_all("tr", id=lambda x: x and x.startswith("stats."))

    # "rec" is the stat-id for receptions. See the GitHub repo for other stat-id vals
    receptions = get_stat_for_season(game_stats, "rec")
    print(f"DJ Moore 2019 receptions: {receptions}")

The above code is a simple, runnable Python script. It goes out to Pro Football Reference, gets D.J. Moore’s gamelog page for 2019 and parses it via BeautifulSoup, and concludes by printing the results to the terminal window. The line that invokes soup.final_all is finding all HTML table rows (tr) whose id attribute starts with the "stats." prefix. Such rows are the table rows containing all of our desired game stats.

Note: I use a lambda function above. For more on lambda functions please read this article

Our code should finally be in a runnable state! Running it from the command line should print out D.J. Moore’s reception totals, week-by-week, for the entire 2019 season.

(scrape_stats)$ python src/scrape_stats.py 
DJ Moore 2019 receptions: ['7','9','1','3','6','7','5','7','9','8','6','6','4','8','1','0']

That’s it! This project is pretty raw, and leaves a lot of room to be expanded upon, and refactored for more stats fun … which we just might do in future posts! If you want to download the source code seen in this post check out its GitHub repo here.

D.J. Moore Thumbs Up