When I was a young boy with a wild imagination, I used to try my hand at numerous sports ranging from tennis to gaelic footbal to soccer, each with varying degrees of success. Living in the countryside throughout my childhood, a big garden allowed me to construct vivid simulations of soccer championships (crowd and all) in my head as I kicked a football, all too often, off of my mother’s clean windows and my dad’s van. One funny memory stands out from this wonderful period of my life, however, and that is my attempt at a ski jump off of a pile of gravel my dad had delivered to create a new pathway into the garden. Did I have skis? Yes!! Two pieces of thin wood that would be typically used for flooring.

While laughing at my younger self over this activity, I decided that I would try and apply my newly acquired data analysis skills to a competition of this year’s men’s ski jumping world cup. I chose Zakopane as my location and went to work. The competition runs over the weekend beginning the 20th of January and consists of qualifying on the 20th and the main competition on the 22nd. Now I have always followed ski jumping, and have some background knowledge of the sport, but this is a chance to experiment with some data science tricks in order to see just how close I can get to predicting accurate results. Who knows, I may even put a little bet on!

### The Analysis

I scraped Zakopane ski jumping results data from the FIS official website stretching back to 2010. The time period covered is 2010-2016. I decided not to go back any further than that as some of the top ski jumpers today were not involved earlier than 2010 and some of the top athletes from pre-2010 are retired today. The 2010-2016 period contains the active jumpers at present.

Any good data analysis begins with a look at the distribution or shape of the data as well as some summary statistics. The histograms below show the distributions of all observed ski jumps and all podium ski jumps between 2010-2016. Note that podium jumps are those which contributed to a top 3 finish in the competition. This visualisation immediately shows us that most of the jumps at Zakopane are around 120 metres while most of the jumps in the podium jumps subset are around 130 metres. Indeed, a closer look using statistical functions confirms that the mean jumps are 120.5 metres and 130.6 metres, respectively.

Of course, I could go into a lot more detail regarding summary statistics and list metrics such as the variance, standard deviation and interquartile range of the distributions but this blog is primarily for showcasing the data visualisation products and conclusions of my analyses. Therefore, I will move on to the next visualisation.

These plots are just modified versions of the two histograms we just looked at. The main differences are that these are density plots which give the frequency density of each jump class, a density curve is added, and the median and mean are added as vertical lines. The frequency density is calculated by dividing the frequency by the class width. The median is used for the all jumps distribution as it deviates from normality while the mean is appropriate for the podium jumps distribution which has a normal distribution. Again, without delving too deeply into the statistics, a normality test called the Shapiro-Wilk test can be used to assess normality (i.e. a bell-shaped distribution a.k.a. the Gaussian distribution). We can see that the distribution for all jumps is negatively skewed by the presence of some shorter distance jumps.

A little bit of data manipulation and querying was all it took to produce a table of the top jumpers in Zakopane between 2010-2016 based on podium finishes.

The Polish native Kamil Stoch leads the way with 4 podium finishes out of 10 competitions. This translates to a probability of 40% that he will once again be on the podium at Zakopane in 2017. Of the rest, I would say that Stefan Kraft and Richard Freitag are the ones to watch. One must be careful to study the current form of the jumpers too as this is another important factor.

A little more data manipulation allowed me to create a more statistically deep table of the top 3 jumpers in Zakopane between 2010-2016. The data suggests that Kamil Stoch, Richard Freitag and Stefan Kraft are the best performers at Zakopane. This table includes variables such as the number of first, second and third places, the probabilities of winning outright and of a podium finish, the mean jump when a podium was achieved, the overall mean, maximum and median jumps as well as number of appearances and podium finishes.

Okay. On to the really cool stuff…the predictions! There is an applied predictive modeling technique known as a decision tree which works similar to a game of Guess Who? The machine asks a question which splits the data into lower entropy, or purer, subgroups of data. These models are used by financial organisations, for example, when assessing the credit risk associated with a loan applicant. For this project, I split the ski jumping data 75% training and 25% test in order to build the model and validate it. Here is the decision tree in all its glory.

These awesome models have the benefit of being easily interpreted which is why they are very useful for a broad range of organisations. The first question that the machine asks is whether the first jump (competitions usually consist of two jumps per athlete) was over 133.5 metres. If so, the probability of the athlete sealing a podium place is almost 100%. If the first jump was less than 133.5 metres, then other conditional questions are posed which the results of direct you to one of the other sub-groups. For example, we see that a first jump of less than 133.5 metres and a second jump of less than 130.5 metres translates to a very low probability of getting onto the podium.

So there we have it. I will be using this model to determine whether each athlete jumping next weekend will have a good chance of being on the podium. I will be back with the results. In conclusion, I will give my predictions for outright winner and podium finishes.

1.) Kamil Stoch

2.) Stefan Kraft

3.) Domen Prevc

Wildcards: Richard Freitag, Daniel Andre Tande and Peter Prevc (if competing)

### One last data visualisation

Following on from the brief discussion of the classification tree, this final plot shows the actual data for 2010-2016 and the classification zones as determined by the model. The machine looks to have done a good job! Actual podium finishes tend to occur in the podium prediction space while most of the non-podium jumps occur in the no podium prediction space. Let’s see what happens at Zakopane 2017!

### ***POST-COMPETITION DISCUSSION***

After an excellent competition, let’s see whose predictions were better, mine or the decision tree’s. I predicted Kamil Stoch to take 1st, Stefan Kraft to take 2nd and Domen Prevc to take 3rd. Stefan Kraft did not take part due to illness. However, Kamil Stoch did indeed win the competition while Domen Prevc finished in 9th position. Richard Freitag, one of my wildcard picks, took 3rd position with two great jumps. My two other wildcards, Peter Prevc and Daniel Andre Tande took 13th and 7th position, respectively.

Now how did the classification tree predict the outcome of the competition? The model correctly predicted podium finishes for Kamil Stoch (1), Andreas Wellinger (2) and Richard Freitag (3). The tree incorrectly predicted podium outcomes for Michael Hayboeck (5) and Peter Prevc (13) while all 45 other jumpers were correctly predicted not to have podium positions. This translates to a 96% model accuracy which can be calculated from a confusion matrix using the table() function to create a table of predicted versus actual outcomes.

table(predictions2017, Zakopane_2017Test$Podium)

I loved this! I am curious about your scraper, in particular whether or not you can get other imp information about performance and judge scores out of it. What exactly can you scrape out of the website directly?

Hello Torkild. Thanks for the feedback it’s much appreciated. I used the RCurl and XML packages to scrape HTML tables from the FIS website (http://www.fis-ski.com/ski-jumping/events-and-places/results/) which has been excellently developed. The getURL() and readHTMLTable() are the functions used for taking the URL and scraping the tables of data. I used jump lengths to build the classification tree but there are other parameters which could be added for sure. In my next project I may add style points and in-run velocity. Style points are obviously important…not so sure about the in-run velocity, however. Cheers!!