I'm almost certain that I won't be able to maintain the blog as well as I did last year - and let's be clear, Greg Perkins was a HUGE help in helping me acquire the data in last couple of weeks of 2011. Without him I probably would have spent many more hours just trying to mine the data.
So at this point, I'm trying to start a discussion among people who would be willing and capable of helping out this year. I can host the 2012 blog, or somebody else can if they prefer and want to make it look all fancy (instead of the terrible book background I've been using). I can share the matlab scripts I used to generate the plots last year, which can be easily adapted to other languages.
I'm just brain storming now, but we need people who can do the following:
1) Understand how HQ is posting the workout scores on the website, and figure out a way to dump these scores along with all the other athlete information (name, region, age, height, weight) etc. Greg, are you still around these days? Do you think your nid trick will work like last year?
2) Take data dumps in relatively real time and play around with the data, pose a few interesting questions and make some rough plots. Ideally, this person would also be writing something to go along with their plots. I can facilitate here a bit and help the crowd.
3) Folks that can take the plots and clean them up a bit for presentation purposes.
I think that's all for now. Please comment if you think you have the skill and desire to help!
Well, I'll probably be doing the stats anyway. I find grabbing the data the hardest, but if someone can post the data, I'm happy to play around in near real-time and do some simple correlations to pose interesting questions. Keep me in the loop, depending how how busy work gets, I could be anywhere from obsessing over the data to just doing some simple stats.
ReplyDeletem i k e d e s k e v i c h (at) g m a i l (dot) c o m
I'm in about the same boat as miked. If someone is grabbing data and putting it somewhere accessible, I'm happy to put together some plots and commentary in Excel/Matlab/Octave.
DeleteThanks, Mike. Hopefully once the scores start to get posted we'll have a better idea on how to grab the data.
ReplyDeleteOkay, actually I think this year might be easier to pull data, since each individual athlete has a page number, presumably with their results attached to the source.
DeleteFor example: http://games.crossfit.com/athlete/2632
Here's the trick, we're going to need a list of the numbers (2632 in this case) that actually have athlete information once the final day of registration occurs. So far, I haven't found an athlete with a 6 digit number, which is great news. Has anybody found one yet?
Perhaps if enough people are interested, one person could focus on pulling the data, then make it available to a bunch of folks who will look at correlations, etc.
ReplyDeleteI think it's a great idea. xfit open analysis - collaborative style. Let's make it happen folks.
ReplyDeleteI have written script already to download the athlete data. I just went 1 to 50,000 like JYoung example above. I found about 70% of the nodes to be athletes. The other integers are used by their website for content of other types (videos, articles, etc). I just grabbed the athlete data at this point. We'll have to see what the score data looks like since the current site provides no clues, but I'd guess it will some form of the integer URL. At this point, the last athlete is about 53,000, so no 6 digit numbers yet. When you say "post the data so its available", what format would be helpful? I might be able to do that part with some help.
ReplyDeleteGreat work, Jeff. I just checked the website though, and it seems like new registered folks are up in the 57000s. With regards to the data, I can send you an example of last year's previous data so you can see the format. Can you send me your email address?
DeleteI only went to 50,000 because at that time (last Thur/Fri) that was as high as they went. I intend to retrieve them all as high as needed.
DeleteAs for posting the data, I can make it whatever format (email to J Young), and will post it on a website I have. That way, folks can download it from there, no emailing required.
Perhaps I'll make as different formats such as SQL dump from MySQL, XML, comma delimited, etc.
I have already discovered one slight issue with the athlete data. The "age" is today's age, not the July 15 cutoff age. So, for masters divisions, its not clear that some athlete who is currently age 59 might be 60 on July 15.
ReplyDeleteI think last year had a similar issue as before. It's an important one though, especially for the different competition classes. It'd also be great if we can match athlete data from this year to last, and I think age will help with this, along with obviously, the name of the athlete.
DeleteI'd like to join the fray analyzing and plotting.
ReplyDeleteI'm not sure how you spread the data before, but can I suggest posting the data on Google Fusion Tables: http://www.google.com/fusiontables/Home/
I've never used this service before, but it should prevent the need to email the data to every one of us.
Hey V,
DeleteI tried posting some data to fusion, and it seemed a little strange. Maybe it was just me, but it seemed clunky. Maybe b/c it's still in beta?
Your point is a good one though, what's the best way to collaborate and share files? Dropbox comes to mind - and it would offer a little bit of control over the data and plots. Are there better ways though?
I created an iPad Data Visualization using the data you guys provided. I had to clean the data a bit and get it into an sqlite database, but there are still about 10,000 records. I made sure to cite this blog as the source of the data. Check it out here: http://itunes.apple.com/us/app/2011-crossfit-games-visualization/id529206737?mt=8
ReplyDeleteContact me at ryanrusnak.com if you have any questions. The code is also all open source so feel free to contribute.