All posts tagged with baseballhacks

2009 MLB park factors

Here they are in CSV with MLBAM’s venue ID added. Here’s a schema you can use should you want to run them into a database:

DROP TABLE IF EXISTS `park_factors`;
CREATE TABLE park_factors (
    venue_id INT DEFAULT NULL,
    year INT DEFAULT NULL,
    name VARCHAR(100) DEFAULT NULL,
    R FLOAT DEFAULT NULL,
    H FLOAT DEFAULT NULL,
    HR FLOAT DEFAULT NULL,
    H2B FLOAT DEFAULT NULL,
    H3B FLOAT DEFAULT NULL,
    BB FLOAT DEFAULT NULL,
    PRIMARY KEY (venue_id, year)
);

Gameday, get yr Gameday

MLB.com provides, among other things, all of the pitch information for each MLB and AAA game in XML format. It’s the data that drives their wonderful little service Gameday. If you want to take a spin through what the data looks like, start here and poke around. What’s key, most folks agree, is the Pitch/FX information, but there’s also pitch-by-pitch logs for every game.

I’ve put together a little package which includes (1) a schema for a MySQL database to retain the information, and (2) a python script which will handle fetching and parsing the XML data found on MLB.com servers. If you’re interested in such a thing, you can download it from the github project page.

There is detailed installation and execution information found on the wiki at github as well but just to provide them here:

./gameday.py

With the following arguments:

  • --year=XXXX four digit year
  • --day=X,Y days in a comma separated list
  • --month=X,Y months in a comma separated list
  • --type=[mlb, aaa] optional: which league to process. Default is ‘mlb’. Any of the categories found here (AA, etc) should work- I’ve just worked with MLB and AAA.
  • --verbose Shows every HTTP request
  • --delta Uses delta mode.

When delta mode is run, the script will store the last date it processed in the database. Upon next execution, it will start from where it left off. This is useful for running the thing nightly to grab the latest stuff.

Update to MLB API layer

I’ve updated my MLB API abstraction layer for python to include an easy way to persist player data, including basic player attributes and game log information, to a MySQL database.

The project is hosted up at Github and the Wiki there now includes an example usage for the library, a script that can be run nightly to create in-season game logs for players.

Please do let me know of any issues you may encounter.

Rate statistics for Baseball Databank

Baseball Databank is one of the best sources for free, reliable baseball information on the net, and they just recently released the data for the 2009 season. I put together a small MySQL script to pull together rate statistics for both pitchers and hitters. The script will create supplemental views with the following information: for batters, AB (though that’s in the standard Batting table, it’s helpful to have as to avoid unnecessary joins), PA, TB, OBP, SLG, OPS, ISO, and BABIP; for pitchers, IP, ERA (this is duplicated from the Pitching table), WHIP, K/9, BB/9, HR/9, H/9, K/BB, and LOB%. I’ve added a Qualified column on the batting view which indicates whether or not the person qualified for the batting title that year, according to the formulas laid out here. My totals for qualifiers don’t quite match up before 1957, but I swear I’ve been re-checking the math and it looks square to me. Let me know if you see something obvious (or not so obvious).

You’ll forgive the… ugly… formatting here; I’ve never found a comfortable way to properly indent SQL. It’s just a messy, toolsy language which gets uglier and toolsier the more nested you get. You do what you need to do.

I’d like at some point to add some league-averaged stuff such as OPS+ and ERA+, but for now it is as it stands. Here you go.

DROP VIEW IF EXISTS AVG_Batting;
 
CREATE VIEW AVG_Batting AS
SELECT playerID, yearID,
SUM(AB + BB + IFNULL(HBP, 0) + IFNULL(SF, 0) + IFNULL(SH, 0)) AS PA,
SUM(AB) AS AB,
SUM(H - HR - 3B - 2B) + (SUM(2B) * 2) + (SUM(3B) * 3) + (SUM(HR) * 4)  AS TB,
SUM(H + BB + IFNULL(HBP, 0)) / SUM(AB + BB + IFNULL(HBP, 0) + IFNULL(SF, 0)) AS OBP,
(SUM(H - HR - 3B - 2B) + (SUM(2B) * 2) + (SUM(3B) * 3) + (SUM(HR) * 4)) / SUM(AB) AS SLG,
(SUM(H + BB + IFNULL(HBP, 0)) / SUM(AB + BB + IFNULL(HBP, 0) + IFNULL(SF, 0))) +
	((SUM(H - HR - 3B - 2B) + (SUM(2B) * 2) + (SUM(3B) * 3) + (SUM(HR) * 4)) / SUM(AB)) AS OPS,
(SUM(2B + (3B * 2) + (HR * 3))) / SUM(AB) AS ISO,
(SUM(H - HR)) / (SUM(AB - SO - HR + IFNULL(SF, 0))) AS BABIP,
(CASE
		WHEN yearID >= 1957 AND (SUM(AB + BB + IFNULL(HBP, 0) + SF + IFNULL(SH, 0)) >= (SELECT FLOOR(G * 3.1) FROM Teams WHERE teamID = Batting.teamID AND yearID = Batting.yearID)) THEN 1
		WHEN yearID BETWEEN 1945 AND 1956 AND (SUM(AB) / (SELECT G FROM Teams WHERE teamID = Batting.teamID AND yearID = Batting.yearID) >= 2.6) THEN 1
		WHEN yearID BETWEEN 1938 AND 1944 AND (SELECT lgID FROM Teams WHERE teamID = Batting.teamID AND yearID = Batting.yearID) = 'NL' AND (SUM(G) >= 100) THEN 1
		WHEN yearID BETWEEN 1938 AND 1944 AND (SELECT lgID FROM Teams WHERE teamID = Batting.teamID AND yearID = Batting.yearID) = 'AL' AND (SUM(AB) >= 400) THEN 1
		WHEN yearID BETWEEN 1920 AND 1937 AND (SUM(G) >= 100) THEN 1
		when yearID < 1920 AND (ROUND(SUM(G) / (SELECT G FROM Teams WHERE teamID = Batting.teamID AND yearID = Batting.yearID) * 100) >= 60) THEN 1
		ELSE 0
		END) AS Qualified
FROM Batting
GROUP BY yearID, playerID;
 
DROP VIEW IF EXISTS AVG_Pitching;
 
CREATE VIEW AVG_Pitching AS
SELECT playerID, yearID,
ROUND((IPouts / 3), 1) AS IP,
ROUND(((ER * 9) / (IPouts / 3)), 2) AS ERA,
ROUND(((BB + H) / (IPouts / 3)), 2) AS WHIP,
ROUND(((9 * SO) / (IPouts / 3)), 1) AS K9,
ROUND(((9 * BB) / (IPouts / 3)), 1) AS BB9,
ROUND(((9 * HR) / (IPouts / 3)), 1) AS HR9,
ROUND(((9 * H) / (IPouts / 3)), 1) AS H9,
ROUND((SO / BB), 2) AS KBB,
(H + BB + IFNULL(HBP, 0) - R) / (H + BB + IFNULL(HBP, 0) - (1.4 * HR)) AS LOB
FROM Pitching
GROUP BY yearID, playerID;

Retrosheet parser update

I updated my Retrosheet parser / database creator thingy to work with Chadwick 0.5.2, which introduced six new extra fields. The change also includes a new way of running the CSV files into MySQL- it’s slower, but more complete, as it uses the actual headers from the Chadwick export rather than loading the entire file at once.

Parsing the files from 1953 – 2008 took a little over two hours on my rather macho Linux box. 8,594,270 events.

Also: in awesome news, Retrosheet has completed the game files for 2009 and they should be available this weekend. I can’t say enough good things about the guys over there. They do incredible work and our lives – all of us baseball obsessives – are far better for it.