Tuesday, February 21, 2012

The Crossfit Open 2011 dataset, for download

At some point I thought I would write something up more formally about the 2011 Open, but that moment has certainly passed.  Many folks have asked for the data from the 2011 Open, and here it is!

The funny thing is, I have some suspicions that the 2011 Open dataset might be better than the 2012 dataset for the height/weight analyses.  Why?  Part of it is the registration process.  In 2011, when athletes registered, they were asked immediately for their height and weight.  In 2012, these questions have been eliminated, although folks can enter this information freely in their profile.  As a result though, I expect that volunteered information for height/weight will fall off dramatically.  Who knows though, with how big the Open is getting maybe it won't matter?


Click here for a .csv file of the 2011 Open dataset, courtesy of work done by Greg Perkins.  The dataset includes all athletes, including athletes that did not finish all six workouts.  The column headers should be:
- athlete ID,nameURL,age,sex&division,height,weight, overall-points,overall-rank, score1,rank1, score2, rank2, score3,rank3, score4,rank4, score5,rank5, score6,rank6  

Click here for some helpful matlab scripts, including one that breaks the overall dataset into separate structures for each competition category.

And here for some information about the .csv file and descriptions of the matlab scripts.

11 comments:

  1. Was Greg planning to do this for 2012 Open? I am in progress of doing this same basic thing. No use duplicating effort?

    ReplyDelete
    Replies
    1. Hey Jeff,

      It's not clear if Greg will be available this year (he seems really busy), and having a few good people take a look at things is always a good idea IMO. I think if you're up to the challenge of creating and updating the dataset, many people would appreciate it!

      Delete
  2. The height/weight values are almost EXACTLY same percentage as last year. As of Friday morning (35,000+ athletes) Height was self-reported 60%, Weight self-reported 58%. Last year it was 60%, 59%.

    Almost weird. Now as to the quality of those reported values, we'll never know....

    ReplyDelete
    Replies
    1. Wow. Somewhat surprising, but I like how those numbers are headed! I think I questioned the accuracy last year as well, but after analyzing the data I feel things looked pretty good. At least if there was some wiggle in the precision, it was dwarfed by the numbers.

      Delete
  3. I've been going through the data and the code over the last day or two.

    But first, THANK YOU! To whoever pulled the data, I wouldn't be able to do that on my own. The next part is intended to be constructive criticism.

    Main concern:
    I see in J's Matlab code he fixed the data (height converted form 5'8" format to inches, and centimeters to inches - also weight in kilograms stripped of units and converted to pounds, weight in pounds stripped of units) before it was loaded into Matlab. It's not great if somebody wanted to use the pure data in another program like R or Excel. I don't know how to resend the data from Matlab, now cleaned back into csv form. I don't have experience in Matlab, and it would probably not be efficient for me to learn it at this time - I am not sure what you guys are using.

    I showed my stats professor this, and she said: It is the job of the person collecting the data to make it easy for others to analyze with. Then she said she could easily fix it in C++ herself. But she has a 2nd child and didn't want to help me, and my coding skills are 7 years old.

    There is obviously a multitude of ways to fix this, but I managed to do it in Excel and I'll send the data to J to upload, if he doesn't mind.

    I hope whoever is collecting this years data can fix the data first before posting. I would also hope we can format the ID and person's name in a way, or fix last year's data, so we can cross compare current people with last year's data for the same people. We can do some interesting things with that, if it's possible.


    But let me repeat, thanks for the data and code! And talk to you guys again soon!

    -Vivek

    ReplyDelete
  4. I am successfully collecting 2012 CF Games data. I fully intend to make it available in its simplest form to either this blog owner, or elsewhere.

    The height/weight values this year are much cleaner as the data input screen on the athlete profile page is more regimented. The data comes back as 5' 10" for example, rather than the 70" the user input. But its consistently formatted.

    I can make the 2012 data available as csv similar to the 2011 data posted above, as well as MySQL dump or other formats as requested.

    I will attempto to do a person to person match of 2011 vs. 2012. That will take some tinkering over the weeks.

    Assuming the Event 1 closes at 5pm PT on Sunday and the results are updated and stop "moving", I should be able to scrape all the Event 1 data starting Sunday night. I could then have it available sometime later Monday or Tuesday Denver USA time.

    ReplyDelete
    Replies
    1. Sounds wonderful. I'm okay with either option on hosting and I'll leave it up to you.

      Delete
    2. Note: Affiliates have 1 day to validate scores that were entered by 5 PT Sunday, so the data may not stop "moving" until Monday at 5.

      Can't wait to play with the data and validate my hypothesis that short people have an advantage at burpees. Thanks for your efforts!

      Delete
  5. I have made a first cut of the data. It stills work, some fields missing, some not filled in. Read the !ReadMe.txt file for more information.

    Any ideas, comments, feedback welcome.

    I'll post again after Event 1 closes Sunday night.

    http://media.jsza.com/CFOpen2012-02-24.zip

    ReplyDelete
    Replies
    1. Incredible, Jeff! It looks super nice and I imagine it'll only get better.

      Some comments, as you requested:

      1) Some people might love a master sheet where all the data is in one spreadsheet and it grows wider with each year and piece of information. The obvious downside is many empty cells for most athletes that aren't multiyear competitors. I can deal with either scenario and I'll leave it up to you to decide, although I would imagine it's easier on your front to keep the years different.

      2) It's easy on this end to convert height strings into numbers, so if you'd like to capture it for now and not do anything with it until later that's fine by me.

      3) You've brought the point up about identifying which athletes are in which competition (open, masters, etc) Last year, we were able to do this easily by capturing the leaderboard, something which doesn't seem possible this year. The only way I can think about how to partition the athletes on the age borders (such as 44, 49, etc..) is to use their athlete page, and cross reference other names on the list.

      For example, this athlete: http://games.crossfit.com/athlete/10184, is 44, but appears with names that are from the masters division. There might be a better way, I just don't know what that is. Ideas?

      Delete
    2. How did a 44 year old do 111 burpees?!?!

      Good luck with the scrape!

      Delete