CrossFit Open Analysis: The Crossfit Open 2011 dataset, for download

Tuesday, February 21, 2012

The Crossfit Open 2011 dataset, for download

At some point I thought I would write something up more formally about the 2011 Open, but that moment has certainly passed. Many folks have asked for the data from the 2011 Open, and here it is!

The funny thing is, I have some suspicions that the 2011 Open dataset might be better than the 2012 dataset for the height/weight analyses. Why? Part of it is the registration process. In 2011, when athletes registered, they were asked immediately for their height and weight. In 2012, these questions have been eliminated, although folks can enter this information freely in their profile. As a result though, I expect that volunteered information for height/weight will fall off dramatically. Who knows though, with how big the Open is getting maybe it won't matter?

Click here for a .csv file of the 2011 Open dataset, courtesy of work done by Greg Perkins. The dataset includes all athletes, including athletes that did not finish all six workouts. The column headers should be:

- athlete ID,nameURL,age,sex&division,height,weight, overall-points,overall-rank, score1,rank1, score2, rank2, score3,rank3, score4,rank4, score5,rank5, score6,rank6

Click here for some helpful matlab scripts, including one that breaks the overall dataset into separate structures for each competition category.

And here for some information about the .csv file and descriptions of the matlab scripts.

11 comments:

Jeff KingFebruary 21, 2012 at 4:46 PM
Was Greg planning to do this for 2012 Open? I am in progress of doing this same basic thing. No use duplicating effort?
ReplyDelete
Replies
Jeff KingFebruary 21, 2012 at 11:02 PM
The height/weight values are almost EXACTLY same percentage as last year. As of Friday morning (35,000+ athletes) Height was self-reported 60%, Weight self-reported 58%. Last year it was 60%, 59%.

Almost weird. Now as to the quality of those reported values, we'll never know....
ReplyDelete
Replies
Killers411February 23, 2012 at 10:41 PM
I've been going through the data and the code over the last day or two.

But first, THANK YOU! To whoever pulled the data, I wouldn't be able to do that on my own. The next part is intended to be constructive criticism.

Main concern:
I see in J's Matlab code he fixed the data (height converted form 5'8" format to inches, and centimeters to inches - also weight in kilograms stripped of units and converted to pounds, weight in pounds stripped of units) before it was loaded into Matlab. It's not great if somebody wanted to use the pure data in another program like R or Excel. I don't know how to resend the data from Matlab, now cleaned back into csv form. I don't have experience in Matlab, and it would probably not be efficient for me to learn it at this time - I am not sure what you guys are using.

I showed my stats professor this, and she said: It is the job of the person collecting the data to make it easy for others to analyze with. Then she said she could easily fix it in C++ herself. But she has a 2nd child and didn't want to help me, and my coding skills are 7 years old.

There is obviously a multitude of ways to fix this, but I managed to do it in Excel and I'll send the data to J to upload, if he doesn't mind.

I hope whoever is collecting this years data can fix the data first before posting. I would also hope we can format the ID and person's name in a way, or fix last year's data, so we can cross compare current people with last year's data for the same people. We can do some interesting things with that, if it's possible.

But let me repeat, thanks for the data and code! And talk to you guys again soon!

-Vivek
ReplyDelete
Replies
Jeff KingFebruary 24, 2012 at 11:57 AM
I am successfully collecting 2012 CF Games data. I fully intend to make it available in its simplest form to either this blog owner, or elsewhere.

The height/weight values this year are much cleaner as the data input screen on the athlete profile page is more regimented. The data comes back as 5' 10" for example, rather than the 70" the user input. But its consistently formatted.

I can make the 2012 data available as csv similar to the 2011 data posted above, as well as MySQL dump or other formats as requested.

I will attempto to do a person to person match of 2011 vs. 2012. That will take some tinkering over the weeks.

Assuming the Event 1 closes at 5pm PT on Sunday and the results are updated and stop "moving", I should be able to scrape all the Event 1 data starting Sunday night. I could then have it available sometime later Monday or Tuesday Denver USA time.
ReplyDelete
Replies
Jeff KingFebruary 24, 2012 at 5:51 PM
I have made a first cut of the data. It stills work, some fields missing, some not filled in. Read the !ReadMe.txt file for more information.

Any ideas, comments, feedback welcome.

I'll post again after Event 1 closes Sunday night.

http://media.jsza.com/CFOpen2012-02-24.zip
ReplyDelete
Replies

Add comment