All posts tagged with baseballhacks

Small updates, big updates

I’ve updated my MLB.com library for python, cleverly titled py-mlb. New in this version is a lot of logging and player transaction information. You can now retrieve all MLB transactions and persist them to a local database, or do whatever your devious mind can come up with. I’ve added a quick example script to the project’s wiki that shows how to save them as well as how to use logging if you’re interested in seeing what the library is doing. As always, the project is also located on my baseball projects page.

A note about the transactions though: I’m not sure how reliable they are prior to the current season. It seems as though the 2009 data is good, but before that it’s spotty. There does seem to be some kind of historical archive available through Javascript, but I’m still figuring that out. I’m fairly certain the 2010 information is 100%. Any thoughts you have on the subject are appreciated.

BUT IN BIGGER NEWS, I’m incredibly excited to mention that I’m joining the San Diego Padres front office in early May. Less than two weeks from today, really. I’ll be working on a lot of projects they have going on- most of which are highly top secret but with any luck will bring about world peace and a championship to the people of San Diego. There is a very talented group of people working there and it will be a great experience to learn from them and contribute my part.

It seems like a beautiful city and I’m eager to get out there and get started.

2009 MLB park factors

Here they are in CSV with MLBAM’s venue ID added. Here’s a schema you can use should you want to run them into a database:

DROP TABLE IF EXISTS `park_factors`;
CREATE TABLE park_factors (
    venue_id INT DEFAULT NULL,
    year INT DEFAULT NULL,
    name VARCHAR(100) DEFAULT NULL,
    R FLOAT DEFAULT NULL,
    H FLOAT DEFAULT NULL,
    HR FLOAT DEFAULT NULL,
    H2B FLOAT DEFAULT NULL,
    H3B FLOAT DEFAULT NULL,
    BB FLOAT DEFAULT NULL,
    PRIMARY KEY (venue_id, year)
);

Gameday, get yr Gameday

MLB.com provides, among other things, all of the pitch information for each MLB and AAA game in XML format. It’s the data that drives their wonderful little service Gameday. If you want to take a spin through what the data looks like, start here and poke around. What’s key, most folks agree, is the Pitch/FX information, but there’s also pitch-by-pitch logs for every game.

I’ve put together a little package which includes (1) a schema for a MySQL database to retain the information, and (2) a python script which will handle fetching and parsing the XML data found on MLB.com servers. If you’re interested in such a thing, you can download it from the github project page.

There is detailed installation and execution information found on the wiki at github as well but just to provide them here:

./gameday.py

With the following arguments:

  • --year=XXXX four digit year
  • --day=X,Y days in a comma separated list
  • --month=X,Y months in a comma separated list
  • --type=[mlb, aaa] optional: which league to process. Default is ‘mlb’. Any of the categories found here (AA, etc) should work- I’ve just worked with MLB and AAA.
  • --verbose Shows every HTTP request
  • --delta Uses delta mode.

When delta mode is run, the script will store the last date it processed in the database. Upon next execution, it will start from where it left off. This is useful for running the thing nightly to grab the latest stuff.

Update to MLB API layer

I’ve updated my MLB API abstraction layer for python to include an easy way to persist player data, including basic player attributes and game log information, to a MySQL database.

The project is hosted up at Github and the Wiki there now includes an example usage for the library, a script that can be run nightly to create in-season game logs for players.

Please do let me know of any issues you may encounter.

Rate statistics for Baseball Databank

Baseball Databank is one of the best sources for free, reliable baseball information on the net, and they just recently released the data for the 2009 season. I put together a small MySQL script to pull together rate statistics for both pitchers and hitters. The script will create supplemental views with the following information: for batters, AB (though that’s in the standard Batting table, it’s helpful to have as to avoid unnecessary joins), PA, TB, OBP, SLG, OPS, ISO, and BABIP; for pitchers, IP, ERA (this is duplicated from the Pitching table), WHIP, K/9, BB/9, HR/9, H/9, K/BB, and LOB%. I’ve added a Qualified column on the batting view which indicates whether or not the person qualified for the batting title that year, according to the formulas laid out here. My totals for qualifiers don’t quite match up before 1957, but I swear I’ve been re-checking the math and it looks square to me. Let me know if you see something obvious (or not so obvious).

You’ll forgive the… ugly… formatting here; I’ve never found a comfortable way to properly indent SQL. It’s just a messy, toolsy language which gets uglier and toolsier the more nested you get. You do what you need to do.

I’d like at some point to add some league-averaged stuff such as OPS+ and ERA+, but for now it is as it stands. Here you go.

DROP VIEW IF EXISTS AVG_Batting;
 
CREATE VIEW AVG_Batting AS
SELECT playerID, yearID,
SUM(AB + BB + IFNULL(HBP, 0) + IFNULL(SF, 0) + IFNULL(SH, 0)) AS PA,
SUM(AB) AS AB,
SUM(H - HR - 3B - 2B) + (SUM(2B) * 2) + (SUM(3B) * 3) + (SUM(HR) * 4)  AS TB,
SUM(H + BB + IFNULL(HBP, 0)) / SUM(AB + BB + IFNULL(HBP, 0) + IFNULL(SF, 0)) AS OBP,
(SUM(H - HR - 3B - 2B) + (SUM(2B) * 2) + (SUM(3B) * 3) + (SUM(HR) * 4)) / SUM(AB) AS SLG,
(SUM(H + BB + IFNULL(HBP, 0)) / SUM(AB + BB + IFNULL(HBP, 0) + IFNULL(SF, 0))) +
	((SUM(H - HR - 3B - 2B) + (SUM(2B) * 2) + (SUM(3B) * 3) + (SUM(HR) * 4)) / SUM(AB)) AS OPS,
(SUM(2B + (3B * 2) + (HR * 3))) / SUM(AB) AS ISO,
(SUM(H - HR)) / (SUM(AB - SO - HR + IFNULL(SF, 0))) AS BABIP,
(CASE
		WHEN yearID >= 1957 AND (SUM(AB + BB + IFNULL(HBP, 0) + SF + IFNULL(SH, 0)) >= (SELECT FLOOR(G * 3.1) FROM Teams WHERE teamID = Batting.teamID AND yearID = Batting.yearID)) THEN 1
		WHEN yearID BETWEEN 1945 AND 1956 AND (SUM(AB) / (SELECT G FROM Teams WHERE teamID = Batting.teamID AND yearID = Batting.yearID) >= 2.6) THEN 1
		WHEN yearID BETWEEN 1938 AND 1944 AND (SELECT lgID FROM Teams WHERE teamID = Batting.teamID AND yearID = Batting.yearID) = 'NL' AND (SUM(G) >= 100) THEN 1
		WHEN yearID BETWEEN 1938 AND 1944 AND (SELECT lgID FROM Teams WHERE teamID = Batting.teamID AND yearID = Batting.yearID) = 'AL' AND (SUM(AB) >= 400) THEN 1
		WHEN yearID BETWEEN 1920 AND 1937 AND (SUM(G) >= 100) THEN 1
		when yearID < 1920 AND (ROUND(SUM(G) / (SELECT G FROM Teams WHERE teamID = Batting.teamID AND yearID = Batting.yearID) * 100) >= 60) THEN 1
		ELSE 0
		END) AS Qualified
FROM Batting
GROUP BY yearID, playerID;
 
DROP VIEW IF EXISTS AVG_Pitching;
 
CREATE VIEW AVG_Pitching AS
SELECT playerID, yearID,
ROUND((IPouts / 3), 1) AS IP,
ROUND(((ER * 9) / (IPouts / 3)), 2) AS ERA,
ROUND(((BB + H) / (IPouts / 3)), 2) AS WHIP,
ROUND(((9 * SO) / (IPouts / 3)), 1) AS K9,
ROUND(((9 * BB) / (IPouts / 3)), 1) AS BB9,
ROUND(((9 * HR) / (IPouts / 3)), 1) AS HR9,
ROUND(((9 * H) / (IPouts / 3)), 1) AS H9,
ROUND((SO / BB), 2) AS KBB,
(H + BB + IFNULL(HBP, 0) - R) / (H + BB + IFNULL(HBP, 0) - (1.4 * HR)) AS LOB
FROM Pitching
GROUP BY yearID, playerID;