Thursday, March 8, 2012

Week 12.3 percentile plots

One thing I have trouble doing with the leaderboard is quickly scanning how many reps you need to rank in a certain percentile.  This is especially true after Thursday, where droves of scores make scrolling through the leaderboard the devil.

Here - I make it easy! Thanks to Andrew, I've managed to scrape the leaderboard and plot cumulative distributions for week 3.   Want to know how many reps you need for the 50th percentile?  Just go to the y-axis, go to 0.5, and run your finger across until it hits the pink (Ind. Women) or Blue (Ind. Men) lines.   Drop down, and then there's the target number of reps.  Going from reps to percentile is also pretty easy, just reverse the process.

Caveat, this is still pretty early, so the curves could change a bit, especially at the extremes (ie near 0 or near 1).  Ballpark though, I bet it'll work for most people.

Oh, the little bump on the pink line near low percentiles?  That's apparently a few folks that have trouble doing a toe-to-bar.  I wondered about that myself.

For those interested in the code, I've enclosed the matlab file I used to scrape the pages and organize it how I like it.  It's definitely rough around the edges, so if somebody would like to use it and make it better, by all means!


  1. Okay I Pulled ALL of the athlete's data, self reported biometrics and scores:

    It's alittle tricky in that heights and weights are reported in either cm and kg or ft/in and lbs depending on locale.

    missing data is in as '--', if you want to change to -1, find and replace '--'.

    Anyone know perl? I'll provide the perl scripts if you like.

  2. Wicked Andrew, thanks for your efforts! Am I correct that these are regional rankings/scores (not worldwide) for each WOD (but not overall?).
    I'm currently trying to figure out how to scrape the data using R (my language of choice) as a coding exercise, but as I don't know the first thing about html I am somewhat hindered. I'd be interested in your scripts for my own education :)

  3. The ranks I think are worldwide, pulled right from the page. (the number outside of the parenthesis)

    The script to get the names:
    The script to get the scores:

    I recommend modifying the names script to work with a subset of the athletes because it takes two days to get everyone's data. It actually visits every athlete's athlete page to get their data.

    the scores script is much faster.

    note you will need the perl packages HTML::Parser, and LWP::Simple for this script to work. you can get it using cpan:

    >install HTML::Parser
    >install LWP::Simple

    1. Update: the rankings appear to be regional rankings.

    2. ahhh one more update:
      I just updated the file, so hit the link again to download the latest copy.

      I accidentally categorized the masters women 60+ with the individual men because I only matched did=1, which also matched did=10. So i re-categorized the masters women 60+ into the correct division

      also I forgot this one script to scrape the athlete's bio info:

      Run the script to get a roster, then the to get their age, gender height, then the to get their WOD scores.

  4. Great work, Andrew. Thanks for your fantastic contributions! I actually think a column for the regional rankings is more preferable, as I've found it's easy recalculate the world rankings rather than the other way around. Filtering on each region and wod can be a bit tedious.

  5. I have created my version of the CF Open dataset. Similar to prior weeks, just keep adding and improving over the weeks.

    Now includes Overall and per Event Ranks. These are worldwide rank, not regional rank. Regional Rank per event coming soon.

    Results as of Wed 3/14. 3pm MDT (after Event 3 closed)

    1. Using Jeff's Data, I updated my site:

      We now have "Participation" through WOD 12.4 and "Who's Dropping Out?" through 12.3.

      Not much fancy analysis yet, still really busy at work; but, the pictures are pretty.

  6. I have compiled CrossFit Open 12.4 results data.

    This update includes several new features:

    - Regional Ranks per event in addition to Overall Ranks per event.
    - Masters Divisions now properly account for bubble ages (44 yrs old now, but will be 45 in July)
    - Age07 is Age in July. Is only different than "current age" for 390 bubble Masters.

    Data file can be found here:

    J Young, you may want to repost this as a blog entry, rather than make folks find it buried as a comment.

    1. The proper division for each athlete can be found on the leaderboard (did=1 through 10, did=11 is the team ranks) irrespective of reported age.

      For the rankings, rid=0 is worldwide, and then rid 1 through however many regions there are. gives the leaderboard page for rankings in each region.