Sunday, April 19, 2015

Visualizing Statistical Stablization

I was listening to the Effectively Wild podcast from a few days ago, and Ben and Sam were going through the inevitable yearly motions of dismissing early season performance due to small sample size, yet explaining that there is no magical number that stats stabilize etc. etc.  I became interested in seeing what such stabilization looks like.  I decided to dig into the Retrosheet event files, and I graphed out cumulative batting averages over time for various players/other entities.


First I started with individual players, which clearly demonstrates the variability you can have in the first couple months of the season.  Especially a player like Punto,  who didn't play every day, or get full at bats when he did, doesn't seem to attain any sort of stability until July at the soonest.  Below is what all three look like relative to each other, with the league average added in for comparison.
The league looks pretty stable, right?  Here's what that line looks like on its own scale.  Even the league wide batting average doesn't stabilize until late May.  Concerned about Scooter Gennett now?
Finally, I was wondering what variances on the player (Jimmy Rollins), team (Oakland), and league level looked like relative to each other.  The specific team and player were chosen only because both ended up with nearly league average numbers.
Here, also, is a table of the variances of each sample of the cumulative average, subdivided by month.

months April May June July August September
jrollvars 1.2059766 0.7116112 0.4704438 0.3695916 0.3124211 0.2803761
oakvars 2.0644219 1.0208267 0.6871219 0.5282449 0.4226945 0.3568851
mlbvars 0.1070076 0.06255327 0.04570331 0.0372675 0.03044601 0.02613776

I think it's interesting that Oakland's variances are consistently higher than Rollins', probably due to greater changes in the underlying talent producing the numbers.  The league's changes are much less significant, yet tangible in comparison to team and player variance.  It should be noted that I multiplied the batting averages by 100 to compute the variances, i.e. I used 245 instead of .245.  Hope this was interesting. Thanks for reading.

No comments:

Post a Comment