Gameday, get yr Gameday
1 February 2010 5:30 pm
MLB.com provides, among other things, all of the pitch information for each MLB and AAA game in XML format. It’s the data that drives their wonderful little service Gameday. If you want to take a spin through what the data looks like, start here and poke around. What’s key, most folks agree, is the Pitch/FX information, but there’s also pitch-by-pitch logs for every game.
I’ve put together a little package which includes (1) a schema for a MySQL database to retain the information, and (2) a python script which will handle fetching and parsing the XML data found on MLB.com servers. If you’re interested in such a thing, you can download it from the github project page.
There is detailed installation and execution information found on the wiki at github as well but just to provide them here:
./gameday.py
With the following arguments:
--year=XXXXfour digit year--day=X,Ydays in a comma separated list--month=X,Ymonths in a comma separated list--type=[mlb, aaa]optional: which league to process. Default is ‘mlb’. Any of the categories found here (AA, etc) should work- I’ve just worked with MLB and AAA.--verboseShows every HTTP request--deltaUses delta mode.
When delta mode is run, the script will store the last date it processed in the database. Upon next execution, it will start from where it left off. This is useful for running the thing nightly to grab the latest stuff. ¶
5 Responses to Gameday, get yr Gameday
RZ on February 6th, 2010 at 11:48 am
Very cool. Two questions:
Is this like Baseball on Stick where is parses the data directly into the database?
I looked over the database structure and couldn’t find the balls and strike count for each pitch. Is it update pitch by pitch in the atbats table? If not, one could not find the pitch by pitch count which makes any run value formulas worthless.
Wells on February 6th, 2010 at 12:02 pm
RZ- yeah, this is similar to BBOS. The database table ‘atbat’ contains two columns – ‘b’ and ‘s’ – that contain the balls and strikes information at the time of the event (described by ‘des’ and ‘event’ columns).
Let me know if this helps.
Wells on February 6th, 2010 at 12:12 pm
RZ – if you have the 2009 data loaded, you can run this query:
SELECT
*
FROM pitch
WHERE game_id = ‘gid_2009_06_08_pitmlb_atlmlb_1′
ORDER BY num ASC;
Each pitch is grouped into an atbat by the pitch.num column. You can derive the count this way, also using pitch.type (S = strike, B = ball).
I could update the script to do this, tally the count per pitch. I will look into that.
RZ on February 6th, 2010 at 1:30 pm
I have a MySQL database using Mike Fast’s method with Perl. The problem I have with it is the long process of parsing the data into the database which takes a very,very long time.
That is why I am interested in your setup. But without having the balls and strike count for each pitch doesn’t make it worth to use your method until you add that in.
Wells on February 6th, 2010 at 2:44 pm
RZ – I just made an update. Now there are two columns added to the pitch table – ‘b’ and ‘s’ – for the count at the beginning of the pitch. Let me know if you find any issues, but it seems to work well. You can download the newest at github: http://github.com/wellsoliver/py-gameday/archives/master
Leave a comment