Archive for June, 2009

Baseball, Retrosheet, Python, too much time on my hands…

UPDATE: There’s a much more complete, updated version of this located on my Baseball Projects page, with links to the github repositories.

If you’re not familiar with it, Retrosheet is one of the best baseball reference items around. Containing play-by-play data for every game back through the mid 1970s, and plenty of information before that, you can pull any kind of data you’d like, and spend hours exploring the dark corners of the game. It’s also a great way to blow through a five hour flight if you have a netbook handy that’s running MySQL. And, hey, who leaves the house without a netbook running MySQL? Between Retrosheet, Baseball DataBank, and MLB.tv, you could never leave the house again.

Anyway, Retrosheet provides their data in a collection of game files, which are a bunch of comma-separated files using the Project Shoresheet notation. To properly investigate them, you will need to import them into your MySQL database. Or, possibly, your Postgres, Oracle, MS SQL Server, etc, etc instance. But here, I’m only concerned with MySQL. I spent some of this weekend putting together a script in Python that combs the Retrosheet site for the event files, downloads, parses, and imports them. I owe a debt of gratitude to the Tangotiger wiki as he laid out a nice schema using Oracle. A few modifications and it’s good to go with MySQL.

First you’ll need to download Chadwick. It’s used to parse the Retrosheet files and create CSV exports, which will then be used to import into MySQL. I used version 0.5 – I’m not sure if 0.4 will work. It probably won’t. It might even kill your dog. BEWARE.

Here is the MySQL schema. It contains tables for events, games, rosters, and teams, as well as the lookup tables provided by Tangotiger. I was playing around with adding the park codes, but haven’t completed that yet. Primarily because Cliff Lee and Chris Carpenter were pitching tonight, along with the season premiere of True Blood. Certain things take precedent, you see.

Run this into MySQL like so. Running MySQL…

CREATE DATABASE retrosheet;
GRANT ALL ON retrosheet.* TO retrosheet@localhost IDENTIFIED BY 'retrosheet';

Then quit back to the command line and:

mysql -uretrosheet -pretrosheet -D retrosheet < retrosheet.sql

On to the Python. This is a script which spawns a few threads to comb the Retrosheet site, pulling down the zip files for the game data. It then uncompresses these files, converts them into CSV with Chadwick, and finally runs them into MySQL. It will also process the roster and team files. You will want to open the script in a text editor first and edit the variables at the top:

#!/usr/bin/env python
import urllib
import os
import subprocess
import threading
import Queue
import zipfile
import glob
import tempfile
import re
import time
import MySQLdb
import sys
 
THREADS = 20
RETROSHEET_URL = "http://www.retrosheet.org/game.htm"
DB_USER = "retrosheet"
DB_PASSWORD = "retrosheet"
DB_HOST = "localhost"
DB = "retrosheet"
CHADWICK = "/usr/local/bin/"
 
class Parser(threading.Thread):
        def __init__(self, queue):
                threading.Thread.__init__(self)
                self.queue = queue
 
        def run(self):
                while 1:
                        try:
                                year = self.queue.get_nowait()
                        except Queue.Empty:
                                break;
 
                        cmd = "%s/cwevent -q -f 0-96 -x 0-50 -y %d %d*.EV* &gt; events-%d.csv" % (CHADWICK, year, year, year)
                        subprocess.call(cmd, shell=True)
                        cmd = "%s/cwgame -q -f 0-83 -y %d %d*.EV* &gt; games-%d.csv" % (CHADWICK, year, year, year)
                        subprocess.call(cmd, shell=True)
 
                        for file in glob.glob("%d*" % year):
                                os.remove(file)
 
class Fetcher(threading.Thread):
        def __init__(self, queue, path):
                threading.Thread.__init__(self)
                self.queue = queue
                self.path = path
 
        def run(self):
                while 1:
                        try:
                                url = self.queue.get_nowait()
                        except Queue.Empty:
                                break;
 
                        f = "%s/%s" % (self.path, os.path.basename(url))
                        urllib.urlretrieve(url, f)
 
                        if (zipfile.is_zipfile(f)):
                                zip = zipfile.ZipFile(f, "r")
                                zip.extractall(self.path)
 
                        os.remove(f)
 
start = time.time()
path = tempfile.mkdtemp()
os.chdir(path)
 
db = MySQLdb.connect(DB_HOST, DB_USER, DB_PASSWORD, DB)
cursor = db.cursor()
 
print "fetching retrosheet files..."
queue = Queue.Queue()
pattern = r'href="(?Phttp://www.retrosheet.org/(?P\d{4})/\d{4}(?P\w{2}).htm)"'
for match in re.finditer(pattern, urllib.urlopen(RETROSHEET_URL).read(), re.S):
        url = "http://www.retrosheet.org/%s/%s%s.zip" % (match.group("year"), match.group("year"), match.group("league"))
        queue.put(url)
 
threads = []
for i in range(THREADS):
        t = Fetcher(queue, path)
        t.start()
        threads.append(t)
 
# be sure to finish fetching before parsing starts
for thread in threads:
        thread.join()
 
print "processing game files..."
queue = Queue.Queue()
 
years = []
for file in glob.glob("%s/*.EV*" % path):
        year = re.search(r"^\d{4}", os.path.basename(file)).group(0)
        if year not in years:
                queue.put(int(year))
                years.append(year)
 
for i in range(THREADS):
        t = Parser(queue)
        t.start()
        threads.append(t)
 
# finishing parsing before importing
for thread in threads:
        thread.join()
 
print "processing rosters..."
for file in glob.glob("*.ROS"):
        f = open(file, "r")
 
        team, year = re.findall(r"(^\w{3})(\d{4}).+?$", os.path.basename(file))[0]
        for line in f.readlines():
 
                if line.strip() == "":
                        continue
 
                info = line.strip().replace('"', '').split(",")
 
                info.insert(0, team)
                info.insert(0, year)
 
                # wacky '\x1a' ASCII characters, probably some better way of handling this
                if len(info) == 3:
                        continue
 
                # ROSTERS table has nine columns, let's fill it out
                if len(info) &lt; 9:
                        for i in range (9 - len(info)):
                                info.append('')
 
                sql = "INSERT INTO ROSTERS VALUES (%s)" % ", ".join(["%s"] * len(info))
                cursor.execute(sql, info)
                db.commit()
 
print "processing teams..."
for file in glob.glob("TEAM*"):
        f = open(file, "r")
 
        try:
                year = re.findall(r"^TEAM(\d{4})$", os.path.basename(file))[0]
        except:
                continue
 
        for line in f.readlines():
 
                if line.strip() == "":
                        continue
 
                info = line.strip().replace('"', '').split(",")
                info.insert(0, year)
 
                if len(info) &lt; 5:
                        continue
 
                sql = "INSERT INTO TEAMS VALUES (%s)" % ", ".join(["%s"] * len(info))
                cursor.execute(sql, info)
                db.commit()
 
for file in glob.glob("events-*.csv"):
        print "processing %s" % file
        sql = "LOAD DATA LOCAL INFILE \"%s\" INTO TABLE EVENTS FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n'" % file
        cursor.execute(sql)
        db.commit()
 
for file in glob.glob("games-*.csv"):
        print "processing %s" % file
        sql = "LOAD DATA LOCAL INFILE \"%s\" INTO TABLE GAMES FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n'" % file
        cursor.execute(sql)
        db.commit()
 
cursor.close()
db.close()
 
# cleanup!
for file in glob.glob("%s/*" % path):
        os.remove(file)
 
os.rmdir(path)
 
elapsed = (time.time() - start)
print "%d seconds!" % elapsed

Change the variables set in the first seven lines to whatever is appropriate for your system, though you shouldn’t need to change RETROSHEET_URL. Save this file off as retrosheet.py. Then, from the command line, making sure the file is marked executable, simply:

./retrosheet.py

This usually takes about 900-1000 seconds (15 minutes!) to run on my system. Your mileage… it may vary. Once it’s done, you should have a fully functional Retrosheet database with 8,594,270 events, 108,266 games, 56,757 roster records, 1,414 teams, and 0 park codes, ‘cos I’ve not done them yet.

If you have any feedback on anything – the MySQL schema, the Python script, or the fact that Cliff Lee just threw a 3-hit complete-game shutout of the Cardinals, please let me know. I’m always up for improving anything.

Sailing…

… or, not really.