Getting Started¶
Installation¶
This package is available on PyPI, so you can install it with
pip
,
pip install pybbda
Or you can install the latest master branch
directly from the github repo using
pip
,
pip install git+https://github.com/bdilday/pybbda.git
or download the source,
git clone git@github.com:bdilday/pybbda.git
cd pybbda
pip install .
Requirements¶
This package explicitly
supports Python 3.6
and Python 3.7
. It aims
to support Python 3.8
but this is not guaranteed.
It explicitly does not support any versions
prior to Python 3.6
, including Python 2.7
.
Environment variables¶
The package uses the following environment variables
PYBBDA_DATA_ROOT
The root directory for storing data
(See [Installing data](#Installing-data)). Defaults to ${INSTALLATION_ROOT}/data/assets
where ${INSTALLATION_ROOT}
is the path the the pybbda
installation.
The code location is typically a path to the Python
installation
plus site-packages/pybbda
.
This can cause a problem with write permissions
if you’re using a system Python instead of a user-controlled
virtual environment.
For this reason, and to avoid duplication if the package is
installed into multiple virtual environments, it’s
recommended to use a custom path for PYBBDA_DATA_ROOT
, for example,
export PYBBDA_DATA_ROOT=${HOME}/.pybbda/data
PYBBDA_LOG_LEVEL
This sets the logging level for the package at runtime.
The default is INFO
.
PYBBDA_MAX_OUTS
Sets the number of outs for the run expectancy tool. Defaults to 3, i.e simulating an inning. Use 27 to simulate a full game.
Installing data¶
This package ships without any data. Instead it provides tools
to fetch and store data from a variety of sources. To install
data you can use the update
tool in the pybbda.data.tools
sub-module.
Example,
python -m pybbda.data.tools.update -h
usage: update.py [-h] [--data-root DATA_ROOT] --data-source
{Lahman,BaseballReference,Fangraphs,retrosheet,statcast,all}
[--make-dirs] [--overwrite] [--create-event-database]
[--min-year MIN_YEAR] [--max-year MAX_YEAR]
[--min-date MIN_DATE] [--max-date MAX_DATE]
[--num-threads NUM_THREADS]
optional arguments:
-h, --help show this help message and exit
--data-root DATA_ROOT
Root directory for data storage
--data-source {Lahman,BaseballReference,Fangraphs,retrosheet,statcast,all}
Update source
--make-dirs Make root dir if does not exist
--overwrite Overwrite files if they exist
--create-event-database
Create a sqlite database for retrosheet event files
--min-year MIN_YEAR Min year to download
--max-year MAX_YEAR Max year to download
--min-date MIN_DATE Min date to download
--max-date MAX_DATE Max date to download
--num-threads NUM_THREADS
Number of threads to use for downloads
The data will be downloaded to --data-root
, which defaults to the
PYBBDA_DATA_ROOT
.
By default the script will expect the target directory
to exist and raise a ValueError
and exit if it does not.
You can create it or pass option --make-dirs
to update to create it automatically.
The --create-event-database
will cause a sqlite
database to be created in the
directory retrosheet
, under the --data-root
directory.
The min-year
and max-year
arguments refer to Fangraphs leaderboards and to the retrosheet
events database, if enabled. The min-date
and max-date
arguments refer to statcast
pitch-level data.
Following are some examples of specific data sources
Lahman¶
python -m pybbda.data.tools.update --data-source Lahman
python -m pybbda.data.tools.update --data-source Lahman --data-root /tmp/missing --make-dirs
Baseball Reference WAR¶
python -m pybbda.data.tools.update --data-source BaseballReference
Fangraphs leaderboards, park factors, and guts constants¶
python -m pybbda.data.tools.update --data-source Fangraphs
Note that because downloading the full set of
leaderboard data starting from 1871 takes 5-10 minutes,
by default the years downloaded are 2018 - 2019 only. To get them all
use --min-year 1871
python -m pybbda.data.tools.update --data-source Fangraphs --min-year 1871
Retrosheet events¶
Retrosheet event data is accessed with the pychadwick package.
To store a local copy,
$ python -m pybbda.data.tools.update --data-source retrosheet
The pychadwick
package provides a command line tool to parse retrosheet events data as CSV.
The following downloads the events data to /tmp/retrosheet-example
and then parse them to CSV
$ python -m pybbda.data.tools.update --data-source retrosheet --data-root /tmp/retrosheet-example --make-dirs
INFO:pybbda.data.sources.retrosheet._update:_update:downloading file from https://github.com/chadwickbureau/retrosheet/archive/master.zip
$ pycwevent --data-root /tmp/retrosheet-example/retrosheet/retrosheet-master/event/regular > /tmp/all_events.csv
The argument --create-event-database
will cause a sqlite
database to be created. Inserting data
takes much longer than bulk uploading a csv, however, this is provided as a convenience.
The min-year
and --max-year
arguments will limit the years to populate the database with.
$ python -m pybbda.data.tools.update --data-source retrosheet --data-root /tmp/retrosheet-example --make-dirs --min-year 1982 --max-year 1982 --create-event-database
INFO:pybbda.data.sources.retrosheet._update:_update:path /tmp/retrosheet-example/retrosheet/retrosheet-master exists, not downloading
INFO:pybbda.data.sources.retrosheet._update:_update:creating database with 26 files
$ ls /tmp/retrosheet-example/retrosheet/
retrosheet.db retrosheet-master
$ sqlite3
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .open /tmp/retrosheet-example/retrosheet/retrosheet.db
sqlite> select GAME_ID, BAT_ID, EVENT_CD from event limit 2;
CIN198204050|willb101|23
CIN198204050|bowal001|2
sqlite> .q
Statcast pitch-level events¶
Events from May, 2019.
$ python -m pybbda.data.tools.update --data-source statcast --min-date 2019-05-01 --max-date 2019-05-31 --num-threads 7
All event from 2019
$ python -m pybbda.data.tools.update --data-source statcast --min-year 2019 --max-year 2019 --num-threads 7
All sources¶
The argument --data-source all
is a shortcut to downloaded data from
all the supported sources.
python -m pybbda.data.tools.update --data-source all
CLI tools¶
Run expectancy¶
There’s a cli tool for computing run expectancies from Markov chains.
python -m pybbda.analysis.run_expectancy.markov.cli --help
This Markov chain uses a lineup of 9 batters instead of assuming each batter has the same characteristics. You can also assign running probabilities, although they apply to all batters equally.
You can assign batting-event probabilities using a sequence of
probabilities, or by referencing a player-season with the
format {playerID}_{season}
, where playerID is the
Lahman ID and season is a 4-digit year. For example, to
refer to Rickey Henderson’s 1982 season, use henderi01_1982
.
The lineup is assigned by giving the lineup slot followed by either 5 probabilities, or a player-season id. The lineup-slot 0 is a code to assign all nine batters to this value. Any other specific slots will be filled in as noted.
The number of outs to model is 3 by default. It can be changed by setting the
environment variable PYBBDA_MAX_OUTS
.
Example: Use a default set of probabilities for all 9 slots but let Rickey Henderson 1982 bat leadoff and Babe Ruth 1927 bat clean-up (using 27 outs, instead of 3)
PYBBDA_MAX_OUTS=27 python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 -i 1 henderi01_1982 -i 4 ruthba01_1927