Follow up to Gameday data & python scripts
4 August 2009 10:50 am
In an earlier post I provided a script that spidered the MLB Gameday site and downloaded all of the XML information. I mentioned that I’d be updating the script at some point to pull nightlies and run the data into a MySQL database, allowing easier manipulation and exploration of the data.
First, I’ve updated the spider script to a) retrieve less data (there is a lot of redundant data in the Gameday collection), b) only pull information for games where data is missing (accomplishing the nightlies part of things, and c) accept a year parameter when it’s run.
Here’s the updated script (note: I had to rename it to .pyy so that Dreamhost would serve the thing; rename it back to .py). I won’t paste it here. You will need to edit line 12 and change the value for OUTPUT to wherever you’d like your XML files stored. I have it configured in cron to run nightly at 3am like so:
0 3 * * * /home/wells/baseball/pbp/pbp.py -y 2009 > /dev/null
Second, here is the schema I am using to create the Gameday database. It contains tables for games, players, hit charts, at-bats, and individual pitches, as well as indexes to speed up certain queries:
DROP TABLE IF EXISTS `game`; CREATE TABLE `game` ( id varchar(30) DEFAULT NULL, home varchar(30) DEFAULT NULL, away varchar(30) DEFAULT NULL, date date DEFAULT NULL ) ENGINE=InnoDB; DROP TABLE IF EXISTS `hitchart`; CREATE TABLE `hitchart` ( game varchar(30) DEFAULT NULL, des varchar(25) DEFAULT NULL, x decimal(6,2) DEFAULT NULL, y decimal(6,2) DEFAULT NULL, batter int(6) DEFAULT NULL, pitcher int(6) DEFAULT NULL, type enum('O','H','E') DEFAULT NULL, team enum('A', 'H') DEFAULT NULL, inning tinyint(1) DEFAULT NULL ) ENGINE=InnoDB; DROP TABLE IF EXISTS `player`; CREATE TABLE `player`( team varchar(3) DEFAULT NULL, id int(6) DEFAULT NULL, pos varchar(2) DEFAULT NULL, type varchar(10) DEFAULT NULL, first_name varchar(25) DEFAULT NULL, last_name varchar(25) DEFAULT NULL, jersey_number varchar(10) DEFAULT NULL, height varchar(5) DEFAULT NULL, weight varchar(3) DEFAULT NULL, bats varchar(10) DEFAULT NULL, throws varchar(10) DEFAULT NULL, dob varchar(15) DEFAULT NULL ) ENGINE=InnoDB; DROP TABLE IF EXISTS `atbat`; CREATE TABLE `atbat` ( game varchar(30) DEFAULT NULL, inning int DEFAULT NULL, num int UNSIGNED DEFAULT NULL, b int DEFAULT NULL, s int DEFAULT NULL, o int DEFAULT NULL, batter int DEFAULT NULL, stand enum('R','L'), b_height varchar(5), pitcher int DEFAULT NULL, p_throws enum('R', 'L'), des varchar(1000) DEFAULT NULL, event varchar(100) DEFAULT NULL, brief_event varchar(100) DEFAULT NULL, score varchar(1) DEFAULT NULL, home_team_runs tinyint DEFAULT NULL, away_team_runs tinyint DEFAULT NULL ) ENGINE=InnoDB; DROP TABLE IF EXISTS `pitch`; CREATE TABLE `pitch` ( game varchar(30) DEFAULT NULL, id int DEFAULT NULL, atbat int UNSIGNED NOT NULL, des varchar(50) DEFAULT NULL, type enum('S', 'B', 'X') DEFAULT NULL, x float DEFAULT NULL, y float DEFAULT NULL, sv_id varchar(25) DEFAULT NULL, start_speed float DEFAULT NULL, end_speed float DEFAULT NULL, sz_top float DEFAULT NULL, sz_bot float DEFAULT NULL, pfx_x float DEFAULT NULL, pfx_z float DEFAULT NULL, px float DEFAULT NULL, pz float DEFAULT NULL, x0 float DEFAULT NULL, y0 float DEFAULT NULL, z0 float DEFAULT NULL, vx0 float DEFAULT NULL, vy0 float DEFAULT NULL, vz0 float DEFAULT NULL, ax float DEFAULT NULL, ay float DEFAULT NULL, az float DEFAULT NULL, break_y float DEFAULT NULL, break_angle float DEFAULT NULL, break_length float DEFAULT NULL, pitch_type varchar(5) DEFAULT NULL, type_confidence float DEFAULT NULL, spin_dir float DEFAULT NULL, spin_rate float DEFAULT NULL, on_1b int(6) DEFAULT NULL, on_2b int(6) DEFAULT NULL, on_3b int(6) DEFAULT NULL, score tinyint DEFAULT NULL ) ENGINE=InnoDB; CREATE INDEX player_id ON player(id); CREATE INDEX game_id ON game(id); CREATE INDEX game_id ON atbat(game); CREATE INDEX game_id ON pitch(game); CREATE INDEX game_id ON hitchart(game);
Run that into a MySQL database called ‘pbp’.
LASTLY, here is the python script called parse.py which takes the XML from Gameday and puts it into the relevant database tables. A CAVEAT: you’ll need to change the location of the my.cnf file (specifying your access credentials) in line 39. If you did not call your database ‘pbp’ you will also need to change that, also conveniently located on line 39. Otherwise you should be good to go. This script, as do most of my python scripts located on this site, relies on MySQLdb. This isn’t standard in python 2.6 (and won’t work at all apparently in 3.0) – you will need to compile it manually for 2.6.
#!/usr/bin/env python from xml.dom.minidom import parse import glob import MySQLdb import sys, getopt import os YEAR = None try: opts, args = getopt.getopt(sys.argv[1:], "y:") except getopt.GetoptError, e: raise SystemExit for opt, arg in opts: if opt == "-y": YEAR = int(arg) if YEAR is None: print "please run like parse.py -y 2009" raise SystemExit def nodeToTable(table, node, extra = {}): values = [] keys = node.attributes.keys() for val in node.attributes.values(): values.append(val.value) for key in extra.keys(): keys.insert(0, key) for value in extra.values(): values.insert(0, value) query = "INSERT INTO %s (%s) VALUES(%s)" %(table, ','.join(keys),','.join(['%s'] * len(values))) cursor.execute(query, values) db = MySQLdb.connect(host="localhost", read_default_file="/home/wells/.my.cnf", db="pbp") cursor = db.cursor() for month in glob.glob("./data/year_%d/*" % YEAR): for day in glob.glob("%s/*" % month): for game in glob.glob("%s/*" % day): gid = os.path.basename(game) cursor.execute("SELECT * FROM game WHERE id = '%s'" % gid) if cursor.rowcount == 0: print "adding game %s" % gid xml = parse("%s/boxscore.xml" % game) box = xml.getElementsByTagName("boxscore").item(0) home = box.attributes["home_fname"].value away = box.attributes["away_fname"].value date = box.attributes["date"].value cursor.execute("INSERT INTO game (id, home, away, date) VALUES ('%s', '%s', '%s', STR_TO_DATE('%s', '%%M %%d, %%Y'))" % (gid, home, away, date)) xml = parse("%s/inning/inning_hit.xml" % game) for hip in xml.getElementsByTagName("hip"): nodeToTable("hitchart", hip, {'game': gid}) for ifile in glob.glob("%s/inning/*.xml" % game): xml = parse(ifile) for inning in xml.getElementsByTagName("inning"): for atbat in xml.getElementsByTagName("atbat"): nodeToTable('atbat', atbat, {'game': gid, 'inning': inning.getAttribute("num")}) for pitch in atbat.getElementsByTagName("pitch"): nodeToTable('pitch', pitch, {'atbat': atbat.getAttribute("num"), 'game': gid}) for batter in glob.glob("%s/batters/*.xml" % game): xml = parse(batter) for player in xml.getElementsByTagName("Player"): cursor.execute("SELECT id FROM player WHERE id = %d" % int(player.attributes["id"].value)) if cursor.rowcount == 0: nodeToTable("player", player) for pitcher in glob.glob("%s/pitchers/*.xml" % game): xml = parse(pitcher) for player in xml.getElementsByTagName("Player"): cursor.execute("SELECT id FROM player WHERE id = %d" % int(player.attributes["id"].value)) if cursor.rowcount == 0: nodeToTable("player", player) db.commit() cursor.close() db.close()
I have this cron’ed to run at 4am, an hour after the parsing script does its job. Like so:
0 4 * * * /home/wells/baseball/pbp/parse.py -y 2009 > /dev/null
Let me know if you have any issues with this. I’ll be tagging all baseball hack related posts with the baseballhacks tag. ¶
3 Responses to Follow up to Gameday data & python scripts
Brock on August 12th, 2009 at 7:21 pm
Keep it up man, this is some awesome stuff!
Brock on August 16th, 2009 at 10:30 am
Quick question. I know this is outside the scope of your post, but I have been learning a ton from your examples. I am trying to figure out why my regex is giving me an error. “sre_constants.error: multiple repeat” If you dont mind, would you lend your expertise? I figure it has to do with my searching in regex, but I dont get what syntax error exists. Thanks!
Keep up the great work!
———
import re
page = r’AJ Anxiety08-13-2009, 03:40 PMvery nice..’
s2 = r’div class=”post”>(.?+)(.?+)(.?+)’
for match in re.finditer(s2, page):
print match.group(0)
Brock on August 16th, 2009 at 10:31 am
I dont think the code came out right, so if you are willing to help, I can email if that is better.
Thanks again,
Brock
Leave a comment