All posts tagged with baseballhacks

Retrosheet parser update

I updated my Retrosheet parser / database creator thingy to work with Chadwick 0.5.2, which introduced six new extra fields. The change also includes a new way of running the CSV files into MySQL- it’s slower, but more complete, as it uses the actual headers from the Chadwick export rather than loading the entire file at once.

Parsing the files from 1953 – 2008 took a little over two hours on my rather macho Linux box. 8,594,270 events.

Also: in awesome news, Retrosheet has completed the game files for 2009 and they should be available this weekend. I can’t say enough good things about the guys over there. They do incredible work and our lives – all of us baseball obsessives – are far better for it.

py-mlb: Python library for MLB.com “API”

So in a follow-up to an earlier post regarding MLB.com’s unofficial, undocumented, kind-of, sort-of API, I’ve written a Python abstraction layer for the league, the teams, and the players. Like everything it’s beta, or maybe alpha, and anything it does to your system is likely your own doing.

That being said I’m rather happy with the first iteration, but there are some design issues to iron out and lots of to-dos. I wanted to throw out it out there to the community so that A) it can be used, B) my Python skills and design ideas can be harshly criticized (only partially kidding here, I love to hear feedback), and C) anyone who wants to can work on the code as well.

As an example, here’s how one might get the entire roster for the Seattle Mariners:

from py_mlb import player, team, league
l = league.League()
team = l.teams['sea']
team.fetchRoster()
 
for player_id, player in team.roster.iteritems():
    print player.name_full

I’ll setup a permanent page for the project in a bit. For now, the project is hosted here at GitHub.

MLB.com’s unofficially, thoroughly awesome API

So MLB.come has an unofficial API. When you go to any player’s page and open up an inspector such as Firefox’s Firebug or Safari’s Web Inspector, you can see that the page initiates a bunch of AJAX requests for various information including awesomely useful information such as player statistics and game logs.

Behold Felix Hernandez’s ‘core’ player information. Mouse-over to see the URL.

And then you can get Felix’s game log using this incredibly long URL. I haven’t really played with the query string parameters too much, but there’s one, ‘results’, which is used on the player’s page to limit the number of games shown to 10. Making that 165 will show all for the season. Likewise there’s ‘year’ which corresponds to the season. I’m not quite sure how far back the data will go. There’s also a ‘game_type’, which given that the value used in the URLs in this post is ‘R’, means regular/post/preseason.

Here’s the URL for a position player, Ichiro’s 2009 stats.

There are more GET calls to return detailed player biography and news-related items, awards, and other stuff. There are also GET calls on the team pages for team information.

Anyway. I’m pretty certain that these APIs are not meant for public consumption, and could change at any time, though I hope not, because they are awesome. You can use the data returned with any basic knowledge of XML and JSON, and most (if not all) programming languages now have pretty solid libraries for that.

Update: I have made a python abstraction layer for the MLB.com API. You can find it on its github project page or read more about it on this blog post.

FINAL UPDATE – This project lives now on my baseball projects page.

Retrosheet and github

In a post a few months ago I put up some code to spider the Retrosheet site, download game data, and build a database out of event & game data. A reader responded a week ago or so recommending an update for the code to work with the newer (0.5.2) version of Chadwick as well as suggesting I put the code out on github so it can be worked on collaboratively. I’m all for it, so I went ahead and signed up and made the project publicly available.

The project is located here. Dude who recommended this all to me made a branch and submitted a patch (pardon, git folks, if patch isn’t really the word, I’m still firmly planted in the Subversion universe), which I then applied to the master project.

DO FEEL FREE to fork and work on the thing – I’d love to see whatever you might make of it. In my spare time as of late I’ve been building a python library which pulls data from MLB.com for the current season (which, sadly, is all but over) and provides nice abstractions for working with it. I’ll pit it up on github when it’s ready to be roundly criticized publicized.

Follow up to Gameday data & python scripts

In an earlier post I provided a script that spidered the MLB Gameday site and downloaded all of the XML information. I mentioned that I’d be updating the script at some point to pull nightlies and run the data into a MySQL database, allowing easier manipulation and exploration of the data.

First, I’ve updated the spider script to a) retrieve less data (there is a lot of redundant data in the Gameday collection), b) only pull information for games where data is missing (accomplishing the nightlies part of things, and c) accept a year parameter when it’s run.

Here’s the updated script (note: I had to rename it to .pyy so that Dreamhost would serve the thing; rename it back to .py). I won’t paste it here. You will need to edit line 12 and change the value for OUTPUT to wherever you’d like your XML files stored. I have it configured in cron to run nightly at 3am like so:

0 3 * * * /home/wells/baseball/pbp/pbp.py -y 2009 > /dev/null

Second, here is the schema I am using to create the Gameday database. It contains tables for games, players, hit charts, at-bats, and individual pitches, as well as indexes to speed up certain queries:

DROP TABLE IF EXISTS `game`;
CREATE TABLE `game` (
	id varchar(30) DEFAULT NULL,
	home varchar(30) DEFAULT NULL,
	away varchar(30) DEFAULT NULL,
	date date DEFAULT NULL
) ENGINE=InnoDB;
 
DROP TABLE IF EXISTS `hitchart`;
CREATE TABLE `hitchart` (
	game varchar(30) DEFAULT NULL,
	des varchar(25) DEFAULT NULL,
	x decimal(6,2) DEFAULT NULL,
	y decimal(6,2) DEFAULT NULL,
	batter int(6) DEFAULT NULL,
	pitcher int(6) DEFAULT NULL,
	type enum('O','H','E') DEFAULT NULL,
	team enum('A', 'H') DEFAULT NULL,
	inning tinyint(1) DEFAULT NULL
) ENGINE=InnoDB;
 
DROP TABLE IF EXISTS `player`;
CREATE TABLE `player`(
	team varchar(3) DEFAULT NULL,
	id int(6) DEFAULT NULL,
	pos varchar(2) DEFAULT NULL,
	type varchar(10) DEFAULT NULL,
	first_name varchar(25) DEFAULT NULL,
	last_name varchar(25) DEFAULT NULL,
	jersey_number varchar(10) DEFAULT NULL,
	height varchar(5) DEFAULT NULL,
	weight varchar(3) DEFAULT NULL,
	bats varchar(10) DEFAULT NULL,
	throws varchar(10) DEFAULT NULL,
	dob varchar(15) DEFAULT NULL
) ENGINE=InnoDB;
 
DROP TABLE IF EXISTS `atbat`;
CREATE TABLE `atbat` (
	game varchar(30) DEFAULT NULL,
	inning int DEFAULT NULL,
	num int UNSIGNED DEFAULT NULL,
	b int DEFAULT NULL,
	s int DEFAULT NULL,
	o int DEFAULT NULL,
	batter int DEFAULT NULL,
	stand enum('R','L'),
	b_height varchar(5),
	pitcher int DEFAULT NULL,
	p_throws enum('R', 'L'),
	des varchar(1000) DEFAULT NULL,
	event varchar(100) DEFAULT NULL,
	brief_event varchar(100) DEFAULT NULL,
	score varchar(1) DEFAULT NULL,
	home_team_runs tinyint DEFAULT NULL,
	away_team_runs tinyint DEFAULT NULL
) ENGINE=InnoDB;
 
DROP TABLE IF EXISTS `pitch`;
CREATE TABLE `pitch` (
	game varchar(30) DEFAULT NULL,
	id int DEFAULT NULL,
	atbat int UNSIGNED NOT NULL,
	des varchar(50) DEFAULT NULL,
	type enum('S', 'B', 'X') DEFAULT NULL,
	x float DEFAULT NULL,
	y float DEFAULT NULL,
	sv_id varchar(25) DEFAULT NULL,
	start_speed float DEFAULT NULL,
	end_speed float DEFAULT NULL,
	sz_top float DEFAULT NULL,
	sz_bot float DEFAULT NULL,
	pfx_x float DEFAULT NULL,
	pfx_z float DEFAULT NULL,
	px float DEFAULT NULL,
	pz float DEFAULT NULL,
	x0 float DEFAULT NULL,
	y0 float DEFAULT NULL,
	z0 float DEFAULT NULL,
	vx0 float DEFAULT NULL,
	vy0 float DEFAULT NULL,
	vz0 float DEFAULT NULL,
	ax float DEFAULT NULL,
	ay float DEFAULT NULL,
	az float DEFAULT NULL,
	break_y float DEFAULT NULL,
	break_angle float DEFAULT NULL,
	break_length float DEFAULT NULL,
	pitch_type varchar(5) DEFAULT NULL,
	type_confidence float DEFAULT NULL,
	spin_dir float DEFAULT NULL,
	spin_rate float DEFAULT NULL,
	on_1b int(6) DEFAULT NULL,
	on_2b int(6) DEFAULT NULL,
	on_3b int(6) DEFAULT NULL,
	score tinyint DEFAULT NULL
) ENGINE=InnoDB;
 
CREATE INDEX player_id ON player(id);
CREATE INDEX game_id ON game(id);
CREATE INDEX game_id ON atbat(game);
CREATE INDEX game_id ON pitch(game);
CREATE INDEX game_id ON hitchart(game);

Run that into a MySQL database called ‘pbp’.

LASTLY, here is the python script called parse.py which takes the XML from Gameday and puts it into the relevant database tables. A CAVEAT: you’ll need to change the location of the my.cnf file (specifying your access credentials) in line 39. If you did not call your database ‘pbp’ you will also need to change that, also conveniently located on line 39. Otherwise you should be good to go. This script, as do most of my python scripts located on this site, relies on MySQLdb. This isn’t standard in python 2.6 (and won’t work at all apparently in 3.0) – you will need to compile it manually for 2.6.

#!/usr/bin/env python
from xml.dom.minidom import parse
import glob
import MySQLdb
import sys, getopt
import os
 
YEAR = None
 
try:
	opts, args = getopt.getopt(sys.argv[1:], "y:")
except getopt.GetoptError, e:
	raise SystemExit
 
for opt, arg in opts:
	if opt == "-y":
		YEAR = int(arg)
 
if YEAR is None:
	print "please run like parse.py -y 2009"
	raise SystemExit
 
def nodeToTable(table, node, extra = {}):
	values = []
	keys = node.attributes.keys()
 
	for val in node.attributes.values():
		values.append(val.value)
 
	for key in extra.keys():
		keys.insert(0, key)
 
	for value in extra.values():
		values.insert(0, value)
 
	query = "INSERT INTO %s (%s) VALUES(%s)" %(table, ','.join(keys),','.join(['%s'] * len(values)))
	cursor.execute(query, values)
 
db = MySQLdb.connect(host="localhost", read_default_file="/home/wells/.my.cnf", db="pbp")
cursor = db.cursor()
 
for month in glob.glob("./data/year_%d/*" % YEAR):
	for day in glob.glob("%s/*" % month):
		for game in glob.glob("%s/*" % day):
			gid = os.path.basename(game)
 
			cursor.execute("SELECT * FROM game WHERE id = '%s'" % gid)
			if cursor.rowcount == 0:
				print "adding game %s" % gid
				xml = parse("%s/boxscore.xml" % game)
				box = xml.getElementsByTagName("boxscore").item(0)
				home = box.attributes["home_fname"].value
				away = box.attributes["away_fname"].value
				date = box.attributes["date"].value
				cursor.execute("INSERT INTO game (id, home, away, date) VALUES ('%s', '%s', '%s', STR_TO_DATE('%s', '%%M %%d, %%Y'))" % (gid, home, away, date))
 
				xml = parse("%s/inning/inning_hit.xml" % game)
				for hip in xml.getElementsByTagName("hip"):
					nodeToTable("hitchart", hip, {'game': gid})
 
				for ifile in glob.glob("%s/inning/*.xml" % game):
					xml = parse(ifile)
					for inning in xml.getElementsByTagName("inning"):
						for atbat in xml.getElementsByTagName("atbat"):
							nodeToTable('atbat', atbat, {'game': gid, 'inning': inning.getAttribute("num")})
							for pitch in atbat.getElementsByTagName("pitch"):
								nodeToTable('pitch', pitch, {'atbat': atbat.getAttribute("num"), 'game': gid})
 
			for batter in glob.glob("%s/batters/*.xml" % game):
				xml = parse(batter)
				for player in xml.getElementsByTagName("Player"):
					cursor.execute("SELECT id FROM player WHERE id = %d" % int(player.attributes["id"].value))
					if cursor.rowcount == 0:
						nodeToTable("player", player)
 
			for pitcher in glob.glob("%s/pitchers/*.xml" % game):
				xml = parse(pitcher)
				for player in xml.getElementsByTagName("Player"):
					cursor.execute("SELECT id FROM player WHERE id = %d" % int(player.attributes["id"].value))
					if cursor.rowcount == 0:
						nodeToTable("player", player)
 
db.commit()
cursor.close()
db.close()

I have this cron’ed to run at 4am, an hour after the parsing script does its job. Like so:

0 4 * * * /home/wells/baseball/pbp/parse.py -y 2009 > /dev/null

Let me know if you have any issues with this. I’ll be tagging all baseball hack related posts with the baseballhacks tag.