Getting Started¶
Installation¶
This package is available on PyPI, so you can install it with
pip,
pip install pybbda
Or you can install the latest master branch
directly from the github repo using
pip,
pip install git+https://github.com/bdilday/pybbda.git
or download the source,
git clone git@github.com:bdilday/pybbda.git
cd pybbda
pip install .
Requirements¶
This package explicitly
supports Python 3.6 and Python 3.7. It aims
to support Python 3.8 but this is not guaranteed.
It explicitly does not support any versions
prior to Python 3.6, including Python 2.7.
Environment variables¶
The package uses the following environment variables
PYBBDA_DATA_ROOT
The root directory for storing data
(See [Installing data](#Installing-data)). Defaults to ${INSTALLATION_ROOT}/data/assets
where ${INSTALLATION_ROOT} is the path the the pybbda installation.
The code location is typically a path to the Python installation
plus site-packages/pybbda.
This can cause a problem with write permissions
if you’re using a system Python instead of a user-controlled
virtual environment.
For this reason, and to avoid duplication if the package is
installed into multiple virtual environments, it’s
recommended to use a custom path for PYBBDA_DATA_ROOT, for example,
export PYBBDA_DATA_ROOT=${HOME}/.pybbda/data
PYBBDA_LOG_LEVEL
This sets the logging level for the package at runtime.
The default is INFO.
PYBBDA_MAX_OUTS
Sets the number of outs for the run expectancy tool. Defaults to 3, i.e simulating an inning. Use 27 to simulate a full game.
Installing data¶
This package ships without any data. Instead it provides tools
to fetch and store data from a variety of sources. To install
data you can use the update tool in the pybbda.data.tools
sub-module.
Example,
python -m pybbda.data.tools.update -h
usage: update.py [-h] [--data-root DATA_ROOT] --data-source
{Lahman,BaseballReference,Fangraphs,retrosheet,statcast,all}
[--make-dirs] [--overwrite] [--create-event-database]
[--min-year MIN_YEAR] [--max-year MAX_YEAR]
[--min-date MIN_DATE] [--max-date MAX_DATE]
[--num-threads NUM_THREADS]
optional arguments:
-h, --help show this help message and exit
--data-root DATA_ROOT
Root directory for data storage
--data-source {Lahman,BaseballReference,Fangraphs,retrosheet,statcast,all}
Update source
--make-dirs Make root dir if does not exist
--overwrite Overwrite files if they exist
--create-event-database
Create a sqlite database for retrosheet event files
--min-year MIN_YEAR Min year to download
--max-year MAX_YEAR Max year to download
--min-date MIN_DATE Min date to download
--max-date MAX_DATE Max date to download
--num-threads NUM_THREADS
Number of threads to use for downloads
The data will be downloaded to --data-root, which defaults to the
PYBBDA_DATA_ROOT.
By default the script will expect the target directory
to exist and raise a ValueError and exit if it does not.
You can create it or pass option --make-dirs to update to create it automatically.
The --create-event-database will cause a sqlite database to be created in the
directory retrosheet, under the --data-root directory.
The min-year and max-year arguments refer to Fangraphs leaderboards and to the retrosheet
events database, if enabled. The min-date and max-date arguments refer to statcast
pitch-level data.
Following are some examples of specific data sources
Lahman¶
python -m pybbda.data.tools.update --data-source Lahman
python -m pybbda.data.tools.update --data-source Lahman --data-root /tmp/missing --make-dirs
Baseball Reference WAR¶
python -m pybbda.data.tools.update --data-source BaseballReference
Fangraphs leaderboards, park factors, and guts constants¶
python -m pybbda.data.tools.update --data-source Fangraphs
Note that because downloading the full set of
leaderboard data starting from 1871 takes 5-10 minutes,
by default the years downloaded are 2018 - 2019 only. To get them all
use --min-year 1871
python -m pybbda.data.tools.update --data-source Fangraphs --min-year 1871
Retrosheet events¶
Retrosheet event data is accessed with the pychadwick package.
To store a local copy,
$ python -m pybbda.data.tools.update --data-source retrosheet
The pychadwick package provides a command line tool to parse retrosheet events data as CSV.
The following downloads the events data to /tmp/retrosheet-example and then parse them to CSV
$ python -m pybbda.data.tools.update --data-source retrosheet --data-root /tmp/retrosheet-example --make-dirs
INFO:pybbda.data.sources.retrosheet._update:_update:downloading file from https://github.com/chadwickbureau/retrosheet/archive/master.zip
$ pycwevent --data-root /tmp/retrosheet-example/retrosheet/retrosheet-master/event/regular > /tmp/all_events.csv
The argument --create-event-database will cause a sqlite database to be created. Inserting data
takes much longer than bulk uploading a csv, however, this is provided as a convenience.
The min-year and --max-year arguments will limit the years to populate the database with.
$ python -m pybbda.data.tools.update --data-source retrosheet --data-root /tmp/retrosheet-example --make-dirs --min-year 1982 --max-year 1982 --create-event-database
INFO:pybbda.data.sources.retrosheet._update:_update:path /tmp/retrosheet-example/retrosheet/retrosheet-master exists, not downloading
INFO:pybbda.data.sources.retrosheet._update:_update:creating database with 26 files
$ ls /tmp/retrosheet-example/retrosheet/
retrosheet.db retrosheet-master
$ sqlite3
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .open /tmp/retrosheet-example/retrosheet/retrosheet.db
sqlite> select GAME_ID, BAT_ID, EVENT_CD from event limit 2;
CIN198204050|willb101|23
CIN198204050|bowal001|2
sqlite> .q
Statcast pitch-level events¶
Events from May, 2019.
$ python -m pybbda.data.tools.update --data-source statcast --min-date 2019-05-01 --max-date 2019-05-31 --num-threads 7
All event from 2019
$ python -m pybbda.data.tools.update --data-source statcast --min-year 2019 --max-year 2019 --num-threads 7
All sources¶
The argument --data-source all is a shortcut to downloaded data from
all the supported sources.
python -m pybbda.data.tools.update --data-source all
CLI tools¶
Run expectancy¶
There’s a cli tool for computing run expectancies from Markov chains.
python -m pybbda.analysis.run_expectancy.markov.cli --help
This Markov chain uses a lineup of 9 batters instead of assuming each batter has the same characteristics. You can also assign running probabilities, although they apply to all batters equally.
You can assign batting-event probabilities using a sequence of
probabilities, or by referencing a player-season with the
format {playerID}_{season}, where playerID is the
Lahman ID and season is a 4-digit year. For example, to
refer to Rickey Henderson’s 1982 season, use henderi01_1982.
The lineup is assigned by giving the lineup slot followed by either 5 probabilities, or a player-season id. The lineup-slot 0 is a code to assign all nine batters to this value. Any other specific slots will be filled in as noted.
The number of outs to model is 3 by default. It can be changed by setting the
environment variable PYBBDA_MAX_OUTS.
Example: Use a default set of probabilities for all 9 slots but let Rickey Henderson 1982 bat leadoff and Babe Ruth 1927 bat clean-up (using 27 outs, instead of 3)
PYBBDA_MAX_OUTS=27 python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 -i 1 henderi01_1982 -i 4 ruthba01_1927