First of all, Happy New Years everyone! Hope you had a fun time doing whatever you do on these holidays. IceBat was a party pooper and decided to sleep all night in his freezer bed.

As you may remember, back in December I had to complete a couple of final projects. One idea that I didn’t use dealt with the concept that home runs are not always equal in displaying a player’s power or batting skills. We equate overpowering shots to right-center field by Prince Fielder with balls that graze the more-than-generous right field wall of Yankee Stadium. What I mean is, there are more variables than just pure distance that go in to determining whether or not a fly ball becomes a home run. With this in mind, I can run a regression model to compute the probability that a flyball will turn in to a home run. I received a large data set (many thanks to Greg Rybarczyk at Hit Tracker) that spans the 2006-2008 seasons for three players (Adam Dunn, Manny Ramirez and Jason Bay). The data includes observational and calculated data (in the similar ways of Hit Tracker’s data – i.e. True Distance or Elevation Angle, etc.) on every long fly ball the players hit, totaling a tad over 700 observations. Included are variables such as what ballpark the ball was hit in, date & time, and the outcome of the play (single, double, home run, out, etc.)

As you can tell from the graph above, the outcome of the play isn’t so clear when only given the elevation angle and distance traveled summary of the ball. All the outcomes are generally scattered so that we cannot conclude any real correlation. I superimposed two boxes to easily show how similar balls can have different outcomes. In the case of the right-side box, a slightly different elevation angle could mean the difference between a home run and a fly ball.

In this next plot we’re seeing the outcomes of the play split into two events: a home run or no home run (the latter equating to zero, or the orange points). We see both smoothed curves have the same shape, with the home runs curve reaching further distances on average. However, the smoothed curves don’t show how the blue and orange points are still very much intermixed. The likelihood of a home run (based on knowing the ball’s angle and distance) is quite sporadic.

Now that I’ve convinced you of this issue, we can start modeling (yay!). For those who are statistically inclined, I used a logistic multiple regression model to find any pattern for predicting home runs. This model will essentially spit out a probability of a ball becoming a home run, given a bunch of variables. The equation used for this type of modeling is:

Prob(HR) = 1/1 + exp(-z)

Where z is a linear equation whose coefficients are estimated (later on)…

The variables I used include the true distance the ball reached, time in the air, speed off the bat, elevation angle, horizontal angle (imagine a baseball field where a larger angle equates to left field, etc.) and apex of the ball (highest vertical point the ball reached). After editing the model to include only those that are statistically significant, the final coefficient confidence intervals looked like this:

2.5 % 97.5 %

(Intercept) -69.55676316 -48.86556510

True.Dist. 0.08764763 0.12669753

Time -9.12581512 -6.25056613

Elev Angle 0.85324327 1.24073619

SOB 0.16324949 0.33018947

Horiz Angle -0.03296355 -0.00181532

From this we’re seeing that Distance, Elevation Angle and Speed off the Bat are all positively associated with home runs, while hang time of the ball and horizontal angle are negatively associated. This is so-so new news to baseball fans; any physicist could probably come to the same conclusions. It may be a better use of time to focus on the more intangible affects on fly balls, such as temperature, winds, and elevation above the sea (i.e. – Coors Field affect). The next regression I use will consider all the fly ball variables and these other uncontrolled park variables. Running this regression surprisingly has all variables as statistically significant. Better yet the AIC (which measures how much error is in the model) from this model is lower than the original. Remarkable stuff!

I won’t go into more detail (for now), but maybe models like this can help us determine how good of a home run hitter a player really is. For example, if Manny has an average probability of a fly ball making it over the fence than Jason Bay, we should probably credit Manny as the better power hitter, despite possibly having lower home run totals (this is all hypothetical). Instead of looking at home run totals, there are probably better ways to discern power ability.