Getting Started

Installation

This package is available on PyPI, so you can install it with pip,

pip install pybbda

Or you can install the latest master branch directly from the github repo using pip,

pip install git+https://github.com/bdilday/pybbda.git

or download the source,

git clone git@github.com:bdilday/pybbda.git
cd pybbda
pip install .

Requirements

This package explicitly supports Python 3.6 and Python 3.7. It aims to support Python 3.8 but this is not guaranteed. It explicitly does not support any versions prior to Python 3.6, including Python 2.7.

Environment variables

The package uses the following environment variables

  • PYBBDA_DATA_ROOT

The root directory for storing data (See [Installing data](#Installing-data)). Defaults to ${INSTALLATION_ROOT}/data/assets where ${INSTALLATION_ROOT} is the path the the pybbda installation. The code location is typically a path to the Python installation plus site-packages/pybbda.

This can cause a problem with write permissions if you’re using a system Python instead of a user-controlled virtual environment. For this reason, and to avoid duplication if the package is installed into multiple virtual environments, it’s recommended to use a custom path for PYBBDA_DATA_ROOT, for example,

export PYBBDA_DATA_ROOT=${HOME}/.pybbda/data
  • PYBBDA_LOG_LEVEL

This sets the logging level for the package at runtime. The default is INFO.

  • PYBBDA_MAX_OUTS

Sets the number of outs for the run expectancy tool. Defaults to 3, i.e simulating an inning. Use 27 to simulate a full game.

Installing data

This package ships without any data. Instead it provides tools to fetch and store data from a variety of sources. To install data you can use the update tool in the pybbda.data.tools sub-module.

Example,

python -m pybbda.data.tools.update -h

 usage: update.py [-h] [--data-root DATA_ROOT] --data-source
                  {Lahman,BaseballReference,Fangraphs,retrosheet,statcast,all}
                  [--make-dirs] [--overwrite] [--create-event-database]
                  [--min-year MIN_YEAR] [--max-year MAX_YEAR]
                  [--min-date MIN_DATE] [--max-date MAX_DATE]
                 [--num-threads NUM_THREADS]

 optional arguments:
     -h, --help            show this help message and exit
     --data-root DATA_ROOT
                           Root directory for data storage
     --data-source {Lahman,BaseballReference,Fangraphs,retrosheet,statcast,all}
                           Update source
     --make-dirs           Make root dir if does not exist
     --overwrite           Overwrite files if they exist
     --create-event-database
                           Create a sqlite database for retrosheet event files
     --min-year MIN_YEAR   Min year to download
     --max-year MAX_YEAR   Max year to download
     --min-date MIN_DATE   Min date to download
     --max-date MAX_DATE   Max date to download
     --num-threads NUM_THREADS
                           Number of threads to use for downloads

The data will be downloaded to --data-root, which defaults to the PYBBDA_DATA_ROOT. By default the script will expect the target directory to exist and raise a ValueError and exit if it does not. You can create it or pass option --make-dirs to update to create it automatically.

The --create-event-database will cause a sqlite database to be created in the directory retrosheet, under the --data-root directory.

The min-year and max-year arguments refer to Fangraphs leaderboards and to the retrosheet events database, if enabled. The min-date and max-date arguments refer to statcast pitch-level data.

Following are some examples of specific data sources

Lahman

python -m pybbda.data.tools.update --data-source Lahman
python -m pybbda.data.tools.update --data-source Lahman --data-root /tmp/missing --make-dirs

Baseball Reference WAR

python -m pybbda.data.tools.update --data-source BaseballReference

Fangraphs leaderboards, park factors, and guts constants

python -m pybbda.data.tools.update --data-source Fangraphs

Note that because downloading the full set of leaderboard data starting from 1871 takes 5-10 minutes, by default the years downloaded are 2018 - 2019 only. To get them all use --min-year 1871

python -m pybbda.data.tools.update --data-source Fangraphs --min-year 1871

Retrosheet events

Retrosheet event data is accessed with the pychadwick package.

To store a local copy,

$ python -m pybbda.data.tools.update --data-source retrosheet

The pychadwick package provides a command line tool to parse retrosheet events data as CSV. The following downloads the events data to /tmp/retrosheet-example and then parse them to CSV

$ python -m pybbda.data.tools.update --data-source retrosheet --data-root /tmp/retrosheet-example --make-dirs
INFO:pybbda.data.sources.retrosheet._update:_update:downloading file from https://github.com/chadwickbureau/retrosheet/archive/master.zip

$ pycwevent --data-root /tmp/retrosheet-example/retrosheet/retrosheet-master/event/regular > /tmp/all_events.csv

The argument --create-event-database will cause a sqlite database to be created. Inserting data takes much longer than bulk uploading a csv, however, this is provided as a convenience. The min-year and --max-year arguments will limit the years to populate the database with.

$ python -m pybbda.data.tools.update --data-source retrosheet --data-root /tmp/retrosheet-example --make-dirs --min-year 1982 --max-year 1982 --create-event-database
INFO:pybbda.data.sources.retrosheet._update:_update:path /tmp/retrosheet-example/retrosheet/retrosheet-master exists, not downloading
INFO:pybbda.data.sources.retrosheet._update:_update:creating database with 26 files
$ ls /tmp/retrosheet-example/retrosheet/
retrosheet.db  retrosheet-master

$ sqlite3
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .open /tmp/retrosheet-example/retrosheet/retrosheet.db
sqlite> select GAME_ID, BAT_ID, EVENT_CD from event limit 2;
CIN198204050|willb101|23
CIN198204050|bowal001|2
sqlite> .q

Statcast pitch-level events

Events from May, 2019.

$ python -m pybbda.data.tools.update --data-source statcast --min-date 2019-05-01 --max-date 2019-05-31 --num-threads 7

All event from 2019

$ python -m pybbda.data.tools.update --data-source statcast --min-year 2019 --max-year 2019 --num-threads 7

All sources

The argument --data-source all is a shortcut to downloaded data from all the supported sources.

python -m pybbda.data.tools.update --data-source all

CLI tools

Run expectancy

There’s a cli tool for computing run expectancies from Markov chains.

python -m pybbda.analysis.run_expectancy.markov.cli --help

This Markov chain uses a lineup of 9 batters instead of assuming each batter has the same characteristics. You can also assign running probabilities, although they apply to all batters equally.

You can assign batting-event probabilities using a sequence of probabilities, or by referencing a player-season with the format {playerID}_{season}, where playerID is the Lahman ID and season is a 4-digit year. For example, to refer to Rickey Henderson’s 1982 season, use henderi01_1982.

The lineup is assigned by giving the lineup slot followed by either 5 probabilities, or a player-season id. The lineup-slot 0 is a code to assign all nine batters to this value. Any other specific slots will be filled in as noted.

The number of outs to model is 3 by default. It can be changed by setting the environment variable PYBBDA_MAX_OUTS.

Example: Use a default set of probabilities for all 9 slots but let Rickey Henderson 1982 bat leadoff and Babe Ruth 1927 bat clean-up (using 27 outs, instead of 3)

PYBBDA_MAX_OUTS=27  python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 -i 1 henderi01_1982 -i 4 ruthba01_1927