Scraping football statistics is fun, but what good are these statistics if they are not persisted for later use? Let’s scrape some seasonal football stats, and store them in a relational database!
In my last post I made a small Python project capable of extracting a single statisic (receptions, for example) across all games in a given season from a particular player’s Pro Football Reference gamelog page. That’s quite the mouthful. It’d be a lot cooler to dump all players’ statistics from Pro Football Reference, both real stats, and fantasy stats, to a relational database, which will eventually lend itself to more interesting projects … and blog posts! Let’s dive into some data!
Initial Setup
If you completed the project from my last post you should already have Python 3 and pipenv installed. However, if you did not read my last post then you will need to install Python 3 as well as pipenv.
Once everything is installed we can create our project folder by entering the following in our terminal window:
# This assumes you have a "Desktop" directory.
# Feel free to put our project anywhere you see fit
$ cd ~/Desktop
$ mkdir fantasy_stats && cd fantasy_stats
Open up the fantasy_stats
directory in your favorite code editor and create a new Pipfile
:
[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true
[dev-packages]
black = "==22.12.0"
[packages]
beautifulsoup4 = "==4.11.1"
requests = "==2.28.1"
[requires]
python_version = "3.11"
Head back to a terminal window and install everything using pipenv. Once pipenv has finished installing all of our dependencies we can activate our virtualenv and start working.
# Install all dependencies in our Pipfile and activate our virtualenv
$ pipenv install
$ pipenv shell
Project Structure
We now have a Pipfile
and Pipfile.lock
in our project’s directory. Lay out the rest of our project like so:
fantasy_stats/
│
├── src/
│ ├── __init__.py
│ ├── constants.py
│ ├── fantasy_stats_to_db.py
│ └── data/
| ├── __init__.py
| ├── database.py
| └── db_columns.py
│ └── stats/
| ├── __init__.py
| └── fantasy_stats.py
│
├── Pipfile
├── Pipfile.lock
└── players.sql
Implementation
Before we start implementing the necessary functionality to scrape a season’s worth of stats and load it into a sqlite3
database we should define our players
SQL schema and create some constants files that will help organize our code a bit.
First, let’s write our players.sql
file. I’d like our database to have the following columns from the PFR stats table:
Knowing this, our schema file should look as follows:
-- ~/fantasy_stats/players.sql
CREATE TABLE players (
id INTEGER PRIMARY KEY AUTOINCREMENT,
player TEXT NOT NULL,
team TEXT NOT NULL,
fantasy_position TEXT NOT NULL,
age INTEGER NOT NULL,
games INTEGER NOT NULL,
games_started INTEGER NOT NULL,
completions INTEGER,
pass_attempts INTEGER,
pass_yards INTEGER,
pass_tds INTEGER,
interceptions INTEGER,
rush_attempts INTEGER,
rush_yards INTEGER,
rush_yards_per_attempt REAL,
rush_tds INTEGER,
targets INTEGER,
receptions INTEGER,
rec_yards INTEGER,
yards_per_reception REAL,
rec_tds INTEGER,
fumbles INTEGER,
fumbles_lost INTEGER,
total_tds INTEGER,
two_point_con INTEGER,
two_point_con_passing INTEGER,
fantasy_points_standard INTEGER,
fantasy_points_ppr REAL,
dk_fantasy_points REAL,
fd_fantasy_points REAL
);
Next, we should make a couple of constants files:
# ~/fantasy_stats/constants.py
"""
constants.py
File contains some constants shared across our modules
If you move this file the PROJECT_ROOT var needs fixing.
"""
from pathlib import Path
PROJECT_ROOT = Path(__file__).parent.parent
CSV_FILE = f"{PROJECT_ROOT}/player_data.csv"
DB_FILE = f"{PROJECT_ROOT}/football.db"
SCHEMA_FILE = f"{PROJECT_ROOT}/players.sql"
Here we create a PROJECT_ROOT
constant string which is used to help define where we store and find our CSV file, sqlite3 DB file, and the SQL schema file used to create our players
table.
I don’t like literal strings littered throughout my code, thus lets create a db_columns.py
module which will define strings for all of our column names in our players
table.
"""
~/fantasy_stats/database/db_columns.py
This file contains constants representing the database columns
in the players table
"""
COLUMN_PLAYER = "player"
COLUMN_TEAM = "team"
COLUMN_FANTASY_POSITION = "fantasy_position"
COLUMN_AGE = "age"
COLUMN_GAMES = "games"
COLUMN_GAMES_STARTED = "games_started"
COLUMN_COMPLETIONS = "completions"
COLUMN_PASS_ATTEMPTS = "pass_attempts"
COLUMN_PASS_YARDS = "pass_yards"
COLUMN_PASS_TDS = "pass_tds"
COLUMN_INTERCEPTIONS = "interceptions"
COLUMN_RUSH_ATTEMPTS = "rush_attempts"
COLUMN_RUSH_YARDS = "rush_yards"
COLUMN_RUSH_YARDS_PER_ATTEMPT = "rush_yards_per_attempt"
COLUMN_RUSH_TDS = "rush_tds"
COLUMN_TARGETS = "targets"
COLUMN_RECEPTIONS = "receptions"
COLUMN_REC_YARDS = "rec_yards"
COLUMN_YARDS_PER_RECEPTION = "yards_per_reception"
COLUMN_REC_TDS = "rec_tds"
COLUMN_FUMBLES = "fumbles"
COLUMN_FUMBLES_LOST = "fumbles_lost"
COLUMN_TOTAL_TDS = "total_tds"
COLUMN_TWO_POINT_CON = "two_point_con"
COLUMN_TWO_POINT_CON_PASSING = "two_point_con_passing"
COLUMN_FANTASY_POINTS_STANDARD = "fantasy_points_standard"
COLUMN_FANTASY_POINTS_PPR = "fantasy_points_ppr"
COLUMN_FANTASY_POINTS_DK = "fantasy_points_dk"
COLUMN_FANTASY_POINTS_FD = "fantasy_points_fd"
# This is a list of all of the columns above, used to create our CSV header row
CSV_HEADER_ROW = [
COLUMN_PLAYER,
COLUMN_TEAM,
COLUMN_FANTASY_POSITION,
COLUMN_AGE,
COLUMN_GAMES,
COLUMN_GAMES_STARTED,
COLUMN_COMPLETIONS,
COLUMN_PASS_ATTEMPTS,
COLUMN_PASS_YARDS,
COLUMN_PASS_TDS,
COLUMN_INTERCEPTIONS,
COLUMN_RUSH_ATTEMPTS,
COLUMN_RUSH_YARDS,
COLUMN_RUSH_YARDS_PER_ATTEMPT,
COLUMN_RUSH_TDS,
COLUMN_TARGETS,
COLUMN_RECEPTIONS,
COLUMN_REC_YARDS,
COLUMN_YARDS_PER_RECEPTION,
COLUMN_REC_TDS,
COLUMN_FUMBLES,
COLUMN_FUMBLES_LOST,
COLUMN_TOTAL_TDS,
COLUMN_TWO_POINT_CON,
COLUMN_TWO_POINT_CON_PASSING,
COLUMN_FANTASY_POINTS_STANDARD,
COLUMN_FANTASY_POINTS_PPR,
COLUMN_FANTASY_POINTS_DK,
COLUMN_FANTASY_POINTS_FD,
]
To recap, we have implemented a couple of constants files and our players
table schema. The next logical step is to implement the fantasy_stats.py
module which will scrape the data we want from Pro Football Reference and dump it into a CSV file, later loaded to our sqlite
database instance.
We first need to GET
and extract the HTML table containing all the data we want from Pro Football Reference using requests
and BeautifulSoup4
. Our fantasy_stats.py
module is currently empty, but not for long. Open it in your code editor and add the following function:
# ~/fantasy_stats/fantasy_stats.py
import csv
import logging
import requests
from bs4 import BeautifulSoup
from src.constants import CSV_FILE
from src.data.db_columns import CSV_HEADER_ROW
PFR_FANTASY_URL_PREFIX = "https://www.pro-football-reference.com/years/"
PFR_FANTASY_URL_SUFFIX = "/fantasy.htm"
def get_seasonal_fantasy_stats_table(year):
"""
Function that extracts the HTML table from PFR for the provided year
This table contains all of the stats for all players for the desired season.
:param year: String representing the year
:return: The table tag object containing all of the stats
"""
pfr_url = f"{PFR_FANTASY_URL_PREFIX}{year}{PFR_FANTASY_URL_SUFFIX}"
response = requests.get(pfr_url)
soup = BeautifulSoup(response.content, "html.parser")
parsed_table = soup.find_all("table")[0]
return parsed_table
Ignoring the imports at the top, some that we need for this method, and some that we will need further below, we just added a function that uses requests
to get the HTML data from the Pro Football Reference page containing all of our seasonal stats. The content returned via requests
is then parsed via BeautifulSoup4
, extracting the first HTML table
on the page. This table
contains all of the stats we want to load into our database.
Now that we have the data we need loaded into Beautiful Soup we want to write it to a CSV file. We need to add a few more methods to fantasy_stats.py
to do so.
Note
We could just load the data right into sqlite
from Beautiful Soup, but I liked having a tangible file to check my results against when loading to the database. The CSV file also makes for a good backup to our very light-weight dataset, and can be imported into Excel for further data fun.
# ~/fantasy_stats/fantasy_stats.py
def build_player_csv(parsed_table):
"""
Function that creates a CSV from the PFR stats table
:param parsed_table: Table object extracted with BeautifulSoup
:return: None
"""
pfr_table_rows = parsed_table.findAll("tr")
# Slice the table rows from the third row on,
# as the first two rows are header rows
pfr_player_rows = pfr_table_rows[2:]
_write_row_to_csv(CSV_HEADER_ROW)
for row in pfr_player_rows:
player_data = _extract_data_from_row(row)
_write_row_to_csv(player_data)
def _extract_data_from_row(row):
"""
Function that extracts a player's stats from an HTML table row
via BeautifulSoup4
:param row: 'tr' or table row containing our stats
:return: List of player stats
"""
player_data = []
# Attempt to extract player name first
# This is used to determine if we are in a header row or player data row
# PFR mixes in "header" rows into the table
player_name_data = row.find("td", attrs={"data-stat": "player"})
if player_name_data is None:
return
player_name = player_name_data.a.get_text()
player_data.append(player_name)
# We want to slice the list from the first element after "player
# until the fourth-to-last data element. The last three elements are
# fantasy ranking related, and we do not want this data.
remaining_cells = row.find_all("td")[1:-3]
for cell in remaining_cells:
player_data.append(cell.get_text())
return player_data
def _write_row_to_csv(row):
"""
Function that writes a row (list) of data to our player CSV
:param row: List of player data
:return: None
"""
# PFR mixes in some meta-rows into their stats table, and we do not want these
if not row:
logging.debug("Skipping non-player data row!")
return
# Open the CSV file in 'append' mode
with open(CSV_FILE, mode="a") as player_data_file:
player_data_writer = csv.writer(
player_data_file, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL
)
player_data_writer.writerow(row)
These three new methods take the data loaded via BeautifulSoup4
and extract each table-row, or tr
from the HTML table
, then parse each row for the statistics we want. The build_player_csv
function concludes by calling write_row_to_csv
, which does exactly as it sounds, and writes the current table-row to our CSV file. Notice the latter two functions are prefaced with _
. This prefix is simply an indication that these functions are “private” to this module and should not be called elsewhere.
The end goal is to take the CSV file we create and load it into a sqlite
database. Thus, we need to write a module containing a function that creates the players
table in our database from the players.sql
schema file. This module also needs a function that takes our CSV file and loads its contents into the players
table. Open the database.py
module and implement the following two functions:
#~/fantasy_stats/data/database.py
import csv
import sqlite3
from src.data.db_columns import (
COLUMN_PLAYER,
COLUMN_TEAM,
COLUMN_FANTASY_POSITION,
COLUMN_AGE,
COLUMN_GAMES,
COLUMN_GAMES_STARTED,
COLUMN_COMPLETIONS,
COLUMN_PASS_ATTEMPTS,
COLUMN_PASS_YARDS,
COLUMN_PASS_TDS,
COLUMN_INTERCEPTIONS,
COLUMN_RUSH_ATTEMPTS,
COLUMN_RUSH_YARDS,
COLUMN_RUSH_YARDS_PER_ATTEMPT,
COLUMN_RUSH_TDS,
COLUMN_TARGETS,
COLUMN_RECEPTIONS,
COLUMN_REC_YARDS,
COLUMN_YARDS_PER_RECEPTION,
COLUMN_REC_TDS,
COLUMN_FUMBLES,
COLUMN_FUMBLES_LOST,
COLUMN_TOTAL_TDS,
COLUMN_TWO_POINT_CON,
COLUMN_TWO_POINT_CON_PASSING,
COLUMN_FANTASY_POINTS_STANDARD,
COLUMN_FANTASY_POINTS_PPR,
COLUMN_FANTASY_POINTS_DK,
COLUMN_FANTASY_POINTS_FD,
CSV_HEADER_ROW,
)
def create_db_schema(schema_file, db_name):
"""
Function that loads a database table from a schema file
to a sqlite3 database
:param schema_file: SQL Schema file
:param db_name: Name of the sqlite3 database
:return: None
"""
with open(schema_file, "r") as schema_file:
sql_script = schema_file.read()
db = sqlite3.connect(db_name)
cursor = db.cursor()
cursor.executescript(sql_script)
db.commit()
db.close()
def load_players_csv_to_db(db_name, players_csv_file):
con = sqlite3.connect(db_name)
cur = con.cursor()
# Open the CSV file in "read" "text" (rt) mode
with open(csv_file, "rt") as player_table:
dr = csv.DictReader(player_table)
player_data = [
(
i[COLUMN_PLAYER],
i[COLUMN_TEAM],
i[COLUMN_FANTASY_POSITION],
i[COLUMN_AGE],
i[COLUMN_GAMES],
i[COLUMN_GAMES_STARTED],
i[COLUMN_COMPLETIONS],
i[COLUMN_PASS_ATTEMPTS],
i[COLUMN_PASS_YARDS],
i[COLUMN_PASS_TDS],
i[COLUMN_INTERCEPTIONS],
i[COLUMN_RUSH_ATTEMPTS],
i[COLUMN_RUSH_YARDS],
i[COLUMN_RUSH_YARDS_PER_ATTEMPT],
i[COLUMN_RUSH_TDS],
i[COLUMN_TARGETS],
i[COLUMN_RECEPTIONS],
i[COLUMN_REC_YARDS],
i[COLUMN_YARDS_PER_RECEPTION],
i[COLUMN_REC_TDS],
i[COLUMN_FUMBLES],
i[COLUMN_FUMBLES_LOST],
i[COLUMN_TOTAL_TDS],
i[COLUMN_TWO_POINT_CON],
i[COLUMN_TWO_POINT_CON_PASSING],
i[COLUMN_FANTASY_POINTS_STANDARD],
i[COLUMN_FANTASY_POINTS_PPR],
i[COLUMN_FANTASY_POINTS_DK],
i[COLUMN_FANTASY_POINTS_FD],
)
for i in dr
]
# Use a list comprehension to help build the cumbersome insert statement
insert_values = ["?" for column in CSV_HEADER_ROW]
insert_params = ",".join(insert_values)
cur.executemany(
# Instead of hand-typing this insert statement I used a list comprehension
# above to create a list of ?s for each column in our table, and joined the
# list together, comma separated. The starting 'null' represents the
# auto-incremented 'id' column in our database
f"INSERT INTO players VALUES (null, {insert_params});",
player_data,
)
con.commit()
con.close()
The first method in database.py
, create_db_schema
, takes the schema_file
and db_name
parameters, and connects to a sqlite3
database named after the db_name
parameter. It then creates a cursor
object and runs the schema_file
against the database. Later in this tutorial we will call this method, providing our players.sql
schema file as the schema_file
parameter.
The second function, load_players_csv_to_db
is a crude method that takes two parameters, db_name
and players_csv_file
. This function initializes a connection to the database identified by db_name
, opens our CSV file in rt
or read-file-as-text
mode, iterates through the CSV file using a DictReader, and lastly inserts all of the data into our sqlite
database.
This code is pretty useless up to this point. We still need to make a runnable Python script that ties everything together. Open the fantasy_stats_to_db.py
module I listed in the project structure and implement it like so:
import logging
import os
import sys
from os import path
sys.path.append(path.dirname(path.dirname(path.abspath(__file__))))
from src.constants import CSV_FILE, DB_FILE, SCHEMA_FILE
from src.data.database import create_db_schema, load_players_csv_to_db
from src.stats.fantasy_stats import (
build_player_csv,
get_seasonal_fantasy_stats_table,
)
# Some logging for letting us know stuff is actually happening
logging.basicConfig(level=logging.INFO)
def cleanup_old_files(files):
for file in files:
try:
os.remove(file)
except OSError:
logging.debug(f"File does not exist: {file}")
if __name__ == "__main__":
# This method was for my sanity, as multiple debugging ones
# required a clean slate.
cleanup_old_files([CSV_FILE, DB_FILE])
# Feel free to change the year "2019" to any year you desire
parsed_table = get_seasonal_fantasy_stats_table("2019")
logging.info("Extracted HTML table from PFR...")
build_player_csv(parsed_table)
logging.info("Successfully built player CSV...")
create_db_schema(SCHEMA_FILE, DB_FILE)
logging.info("Created DB schema...")
load_players_csv_to_db(DB_FILE, CSV_FILE)
logging.info("Loaded data to sqlite3 successfully!")
This file can be run from our project’s root directory via the following command:
python src/fantasy_stats_to_db.py
If all goes well you should see output similar to this:
(fantasy_stats) bash-3.2$ python src/fantasy_stats_to_db.py
INFO:root:Extracted HTML table from PFR...
INFO:root:Successfully built player CSV...
INFO:root:Created DB schema...
INFO:root:Loaded data to sqlite3 successfully!
If you poke around the project a bit you should see that the files ~/fantasy_stats/player_data.csv
and ~/fantasy_stats/football.db
have been created. The file football.db
is our new sqlite3
database loaded with all of the extracted stats! Let’s dive into the database and run a test query. Run the following command from our project’s root directory, which is also where our football.db
database file is located:
bash-3.2$ sqlite3 football.db
# The next two lines are output from the above command
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
We are now be connected to our sqlite3
database loaded with all of our football statistics in the players
table. Let’s run a query to test things out!
sqlite> SELECT player, fantasy_position FROM players WHERE receptions >= 100;
# The next few lines are output from our query above
Christian McCaffrey|RB
Michael Thomas|WR
DeAndre Hopkins|WR
Keenan Allen|WR
Julian Edelman|WR
Note: I was listening to episode #183 of the Python Bytes podcast and learned of Beekeeper Studio, a very cool open-source SQL editor database manager. I highly recommend checking it out!
Pretty cool! We can see from our little test query that five different players caught 100 or more receptions in 2019, and one of them was a running back, Christian McCaffery! That’s a wrap for this little project. If you want to download the source code seen in this post check out its GitHub repo here.