All posts tagged with python

Gameday, get yr Gameday

MLB.com provides, among other things, all of the pitch information for each MLB and AAA game in XML format. It’s the data that drives their wonderful little service Gameday. If you want to take a spin through what the data looks like, start here and poke around. What’s key, most folks agree, is the Pitch/FX information, but there’s also pitch-by-pitch logs for every game.

I’ve put together a little package which includes (1) a schema for a MySQL database to retain the information, and (2) a python script which will handle fetching and parsing the XML data found on MLB.com servers. If you’re interested in such a thing, you can download it from the github project page.

There is detailed installation and execution information found on the wiki at github as well but just to provide them here:

./gameday.py

With the following arguments:

  • --year=XXXX four digit year
  • --day=X,Y days in a comma separated list
  • --month=X,Y months in a comma separated list
  • --type=[mlb, aaa] optional: which league to process. Default is ‘mlb’. Any of the categories found here (AA, etc) should work- I’ve just worked with MLB and AAA.
  • --verbose Shows every HTTP request
  • --delta Uses delta mode.

When delta mode is run, the script will store the last date it processed in the database. Upon next execution, it will start from where it left off. This is useful for running the thing nightly to grab the latest stuff.

Update to MLB API layer

I’ve updated my MLB API abstraction layer for python to include an easy way to persist player data, including basic player attributes and game log information, to a MySQL database.

The project is hosted up at Github and the Wiki there now includes an example usage for the library, a script that can be run nightly to create in-season game logs for players.

Please do let me know of any issues you may encounter.

Retrosheet parser update

I updated my Retrosheet parser / database creator thingy to work with Chadwick 0.5.2, which introduced six new extra fields. The change also includes a new way of running the CSV files into MySQL- it’s slower, but more complete, as it uses the actual headers from the Chadwick export rather than loading the entire file at once.

Parsing the files from 1953 – 2008 took a little over two hours on my rather macho Linux box. 8,594,270 events.

Also: in awesome news, Retrosheet has completed the game files for 2009 and they should be available this weekend. I can’t say enough good things about the guys over there. They do incredible work and our lives – all of us baseball obsessives – are far better for it.

py-mlb: Python library for MLB.com “API”

So in a follow-up to an earlier post regarding MLB.com’s unofficial, undocumented, kind-of, sort-of API, I’ve written a Python abstraction layer for the league, the teams, and the players. Like everything it’s beta, or maybe alpha, and anything it does to your system is likely your own doing.

That being said I’m rather happy with the first iteration, but there are some design issues to iron out and lots of to-dos. I wanted to throw out it out there to the community so that A) it can be used, B) my Python skills and design ideas can be harshly criticized (only partially kidding here, I love to hear feedback), and C) anyone who wants to can work on the code as well.

As an example, here’s how one might get the entire roster for the Seattle Mariners:

from py_mlb import player, team, league
l = league.League()
team = l.teams['sea']
team.fetchRoster()
 
for player_id, player in team.roster.iteritems():
    print player.name_full

I’ll setup a permanent page for the project in a bit. For now, the project is hosted here at GitHub.

Follow up to Gameday data & python scripts

In an earlier post I provided a script that spidered the MLB Gameday site and downloaded all of the XML information. I mentioned that I’d be updating the script at some point to pull nightlies and run the data into a MySQL database, allowing easier manipulation and exploration of the data.

First, I’ve updated the spider script to a) retrieve less data (there is a lot of redundant data in the Gameday collection), b) only pull information for games where data is missing (accomplishing the nightlies part of things, and c) accept a year parameter when it’s run.

Here’s the updated script (note: I had to rename it to .pyy so that Dreamhost would serve the thing; rename it back to .py). I won’t paste it here. You will need to edit line 12 and change the value for OUTPUT to wherever you’d like your XML files stored. I have it configured in cron to run nightly at 3am like so:

0 3 * * * /home/wells/baseball/pbp/pbp.py -y 2009 > /dev/null

Second, here is the schema I am using to create the Gameday database. It contains tables for games, players, hit charts, at-bats, and individual pitches, as well as indexes to speed up certain queries:

DROP TABLE IF EXISTS `game`;
CREATE TABLE `game` (
	id varchar(30) DEFAULT NULL,
	home varchar(30) DEFAULT NULL,
	away varchar(30) DEFAULT NULL,
	date date DEFAULT NULL
) ENGINE=InnoDB;
 
DROP TABLE IF EXISTS `hitchart`;
CREATE TABLE `hitchart` (
	game varchar(30) DEFAULT NULL,
	des varchar(25) DEFAULT NULL,
	x decimal(6,2) DEFAULT NULL,
	y decimal(6,2) DEFAULT NULL,
	batter int(6) DEFAULT NULL,
	pitcher int(6) DEFAULT NULL,
	type enum('O','H','E') DEFAULT NULL,
	team enum('A', 'H') DEFAULT NULL,
	inning tinyint(1) DEFAULT NULL
) ENGINE=InnoDB;
 
DROP TABLE IF EXISTS `player`;
CREATE TABLE `player`(
	team varchar(3) DEFAULT NULL,
	id int(6) DEFAULT NULL,
	pos varchar(2) DEFAULT NULL,
	type varchar(10) DEFAULT NULL,
	first_name varchar(25) DEFAULT NULL,
	last_name varchar(25) DEFAULT NULL,
	jersey_number varchar(10) DEFAULT NULL,
	height varchar(5) DEFAULT NULL,
	weight varchar(3) DEFAULT NULL,
	bats varchar(10) DEFAULT NULL,
	throws varchar(10) DEFAULT NULL,
	dob varchar(15) DEFAULT NULL
) ENGINE=InnoDB;
 
DROP TABLE IF EXISTS `atbat`;
CREATE TABLE `atbat` (
	game varchar(30) DEFAULT NULL,
	inning int DEFAULT NULL,
	num int UNSIGNED DEFAULT NULL,
	b int DEFAULT NULL,
	s int DEFAULT NULL,
	o int DEFAULT NULL,
	batter int DEFAULT NULL,
	stand enum('R','L'),
	b_height varchar(5),
	pitcher int DEFAULT NULL,
	p_throws enum('R', 'L'),
	des varchar(1000) DEFAULT NULL,
	event varchar(100) DEFAULT NULL,
	brief_event varchar(100) DEFAULT NULL,
	score varchar(1) DEFAULT NULL,
	home_team_runs tinyint DEFAULT NULL,
	away_team_runs tinyint DEFAULT NULL
) ENGINE=InnoDB;
 
DROP TABLE IF EXISTS `pitch`;
CREATE TABLE `pitch` (
	game varchar(30) DEFAULT NULL,
	id int DEFAULT NULL,
	atbat int UNSIGNED NOT NULL,
	des varchar(50) DEFAULT NULL,
	type enum('S', 'B', 'X') DEFAULT NULL,
	x float DEFAULT NULL,
	y float DEFAULT NULL,
	sv_id varchar(25) DEFAULT NULL,
	start_speed float DEFAULT NULL,
	end_speed float DEFAULT NULL,
	sz_top float DEFAULT NULL,
	sz_bot float DEFAULT NULL,
	pfx_x float DEFAULT NULL,
	pfx_z float DEFAULT NULL,
	px float DEFAULT NULL,
	pz float DEFAULT NULL,
	x0 float DEFAULT NULL,
	y0 float DEFAULT NULL,
	z0 float DEFAULT NULL,
	vx0 float DEFAULT NULL,
	vy0 float DEFAULT NULL,
	vz0 float DEFAULT NULL,
	ax float DEFAULT NULL,
	ay float DEFAULT NULL,
	az float DEFAULT NULL,
	break_y float DEFAULT NULL,
	break_angle float DEFAULT NULL,
	break_length float DEFAULT NULL,
	pitch_type varchar(5) DEFAULT NULL,
	type_confidence float DEFAULT NULL,
	spin_dir float DEFAULT NULL,
	spin_rate float DEFAULT NULL,
	on_1b int(6) DEFAULT NULL,
	on_2b int(6) DEFAULT NULL,
	on_3b int(6) DEFAULT NULL,
	score tinyint DEFAULT NULL
) ENGINE=InnoDB;
 
CREATE INDEX player_id ON player(id);
CREATE INDEX game_id ON game(id);
CREATE INDEX game_id ON atbat(game);
CREATE INDEX game_id ON pitch(game);
CREATE INDEX game_id ON hitchart(game);

Run that into a MySQL database called ‘pbp’.

LASTLY, here is the python script called parse.py which takes the XML from Gameday and puts it into the relevant database tables. A CAVEAT: you’ll need to change the location of the my.cnf file (specifying your access credentials) in line 39. If you did not call your database ‘pbp’ you will also need to change that, also conveniently located on line 39. Otherwise you should be good to go. This script, as do most of my python scripts located on this site, relies on MySQLdb. This isn’t standard in python 2.6 (and won’t work at all apparently in 3.0) – you will need to compile it manually for 2.6.

#!/usr/bin/env python
from xml.dom.minidom import parse
import glob
import MySQLdb
import sys, getopt
import os
 
YEAR = None
 
try:
	opts, args = getopt.getopt(sys.argv[1:], "y:")
except getopt.GetoptError, e:
	raise SystemExit
 
for opt, arg in opts:
	if opt == "-y":
		YEAR = int(arg)
 
if YEAR is None:
	print "please run like parse.py -y 2009"
	raise SystemExit
 
def nodeToTable(table, node, extra = {}):
	values = []
	keys = node.attributes.keys()
 
	for val in node.attributes.values():
		values.append(val.value)
 
	for key in extra.keys():
		keys.insert(0, key)
 
	for value in extra.values():
		values.insert(0, value)
 
	query = "INSERT INTO %s (%s) VALUES(%s)" %(table, ','.join(keys),','.join(['%s'] * len(values)))
	cursor.execute(query, values)
 
db = MySQLdb.connect(host="localhost", read_default_file="/home/wells/.my.cnf", db="pbp")
cursor = db.cursor()
 
for month in glob.glob("./data/year_%d/*" % YEAR):
	for day in glob.glob("%s/*" % month):
		for game in glob.glob("%s/*" % day):
			gid = os.path.basename(game)
 
			cursor.execute("SELECT * FROM game WHERE id = '%s'" % gid)
			if cursor.rowcount == 0:
				print "adding game %s" % gid
				xml = parse("%s/boxscore.xml" % game)
				box = xml.getElementsByTagName("boxscore").item(0)
				home = box.attributes["home_fname"].value
				away = box.attributes["away_fname"].value
				date = box.attributes["date"].value
				cursor.execute("INSERT INTO game (id, home, away, date) VALUES ('%s', '%s', '%s', STR_TO_DATE('%s', '%%M %%d, %%Y'))" % (gid, home, away, date))
 
				xml = parse("%s/inning/inning_hit.xml" % game)
				for hip in xml.getElementsByTagName("hip"):
					nodeToTable("hitchart", hip, {'game': gid})
 
				for ifile in glob.glob("%s/inning/*.xml" % game):
					xml = parse(ifile)
					for inning in xml.getElementsByTagName("inning"):
						for atbat in xml.getElementsByTagName("atbat"):
							nodeToTable('atbat', atbat, {'game': gid, 'inning': inning.getAttribute("num")})
							for pitch in atbat.getElementsByTagName("pitch"):
								nodeToTable('pitch', pitch, {'atbat': atbat.getAttribute("num"), 'game': gid})
 
			for batter in glob.glob("%s/batters/*.xml" % game):
				xml = parse(batter)
				for player in xml.getElementsByTagName("Player"):
					cursor.execute("SELECT id FROM player WHERE id = %d" % int(player.attributes["id"].value))
					if cursor.rowcount == 0:
						nodeToTable("player", player)
 
			for pitcher in glob.glob("%s/pitchers/*.xml" % game):
				xml = parse(pitcher)
				for player in xml.getElementsByTagName("Player"):
					cursor.execute("SELECT id FROM player WHERE id = %d" % int(player.attributes["id"].value))
					if cursor.rowcount == 0:
						nodeToTable("player", player)
 
db.commit()
cursor.close()
db.close()

I have this cron’ed to run at 4am, an hour after the parsing script does its job. Like so:

0 4 * * * /home/wells/baseball/pbp/parse.py -y 2009 > /dev/null

Let me know if you have any issues with this. I’ll be tagging all baseball hack related posts with the baseballhacks tag.