Wednesday, February 29, 2012

WOD 12.1 - Early plots and analysis

Thanks to miked and killers411, newcomers to the crossfit analysis scene, we have some early plots and discussion points!  Thanks guys!

Miked has plotted the real distributions and fitted distributions of the WOD 12.1, for all athletes, including masters.  He's also very nicely calculated the mean and std of each of these groups and included them in the legend.  Notes:  the masters distributions are hard to see because the y-axis is number of athletes, and of course the number of athletes in the non-masters groups dwarf that of the masters.

Also, he's plotted performance against height and weight, and done some modeling of workout performance and workload.  Outstanding stuff and it'll be fun to see some more.

Here's the link from miked.

Killers411, has done a nice number comparison of each region, which suggests the open has grown 2.5 times since last year!  The fight for regionals will certainly be tougher with all these newcomers.

...and the link from killers411.


  1. Yup, I will normalize them later. Since there's so few competitors, it gets really noisy, so I should increase the bin size a bit first. I do have the plot zoomed in on the masters so we can see how everyone stacks up though.

    I added another page to my analysis talking about height, weight, power and burpees:

    Back to my day job...Will do more as I get the time.

  2. Got the normalized histograms up...

  3. Ah, well, the CrossFit Games website has changed how they format their scores and rankings data. So, I will have to write new code to scrape the results from the Games website.

    on the plus side, the new format is faster and simpler to scrape the results. It's also less html/graphics, so its less bandwidth for the Games site.

    The downside, I will have Week 2 raw data results for awhile until I recode.. stay tuned.

  4. I have developed perl scripts to scrape Athlete data (Region,Affiliate, Team, Age, Height, Weight, Gender, Division, plus scores and rank)

    here is a sample from the top 10 teams:

    I can pull everyones data without too much more effort.

    1. That looks good, but could you put -1 or something in the fields where there's invalid data, that makes it a whole bunch easier when parsing the data and binning it. I'd love to see those perl scripts, I tried doing this via perl last year and I failed miserably - but I guess that's why I'm a scientific programmer to me HTML and services are just magic :)

    2. Hey Andrew,

      Boom! Nice work sir... it looks like you've found nice way to crack the leaderboard pages, without having to go through each athlete's ID page. Very cool.

      Curiously, how did you find the 'backpages' to the leaderboard? I was trying to figure that out and couldn't crack it.

      I guess what I mean about backpages is something like this in your excel:

      I notice npp actually controls the number of athletes shown, which is nice. I tried 500 without too many problems.

    3. Also note for some folks... changing the rid=0 gets the main leaderboard page, without regards to region.

    4. Fancy!! All of this stealing of data from the interweb is magic to me. Is anyone able to get the whole set? I also wonder if it's a simple matter to scrape regional rankings as well, although I've started some code to do this manually for my own region.

  5. We are all duplicating each others efforts here.

    I am working on scraping the entire leaderboard, for all divisions, for all athletes, to finish the data set I started and posted two weeks ago.

    The CF Games site changed format, so my original code is no longer valid, and I am working on new code that uses the new website.

    My original file format will still remain as before as it had columns for all five events, height, weight, ranks for both regional and worldwide, age, division.

    Mine is PHP, loading into MySQL database. Then I can export into csv which seems to be the simplest format for everyone.

    1. I think there's bound to be some overlap here. My suspicion is that some people just like trying to scrape the data themselves, as a form of mental exercise. I partly scraped the leaderboard just to make a simple plot that I was interested in, but I'm absolutely certain your final dataset will be of much higher quality than what I could produce!

      BTW, I know what you mean about changing formats. Last year HQ did the same thing and killed my code (I guess it wasn't very robust). Can't HQ figure out that they're improvements hinder our efforts? ;)