ANALYST/SCIENTIST

Predictive Modeling- Logistic Regression

What Makes and Olympic Winner?

image1.jpg
 
1st Half of Data Set

1st Half of Data Set

2nd Half of Data Set

2nd Half of Data Set

1st Half of Data Set

1st Half of Data Set

2nd Half of Data Set

2nd Half of Data Set

1st Half of Data Set

1st Half of Data Set

2nd Half of Data Set

2nd Half of Data Set

1st Half of Data Set

1st Half of Data Set

2nd Half of Data Set

2nd Half of Data Set

Understanding Olympic Winning History-120 Years of History

This data set took into account several data taken from the Kaggle website that assessed Olympic wins over the past 120 years. A logistic regression was performed to ascertain the effects of age, weight, height, gender, sport and location on the likelihood that participant will win the Gold, Silver or Bronze medals. The logistic regression model was statistically significant, Log-Likelihood Score = -70332, p=1. The model explained in the negative 60.1% (Nagelkerke R2) of the variance in Winning the Olympic ratio and correctly classified 85.0% of cases. The weighted precision score came out to .85 for both data sets for loss, but for wins the prevision only amounted to .53.

I had over 700 different variables to consider and the trickiest thing in this model was testing the variables to see if there was any combination of variables to give me a low p value. Using Recursive Feature Elimination I was able to dwindle down to 23 variables all negatively correlated to Win. Ultimately, I ended up concluding that the variables in this data set were not conducive to creating a predictive model of Wins. This model actually ended up predicting loses and the variables of choice for sports were: Baseball, Softball, Curling, Handball, and Taekwondo. The variables of choice for location were: Angola, Egypt, Guatemala, Hong Kong, Israel, Virgin Islands, US, Puerto Rico, Senegal, Barbados, Cyprus, Luxembourg, Libya, Ghana, Russia, Gabon, Germany, USA . The other variables did not chart highly in correlation. In addition, my computer lacked the memory to analyze the data so I used some SQL code to split the data in half selecting every other row of data, with a goal of a more even split. Some of these countries may lack talent as you can see by the coefficients, but the smaller country coefficients like USA, Russia and Germany are probably due to participating more than other countries.

If this were my study, I would suggest other factors to consider and include for analysis, like hours of practice/days trained, determination rating from coaches, prior ranking outside of Olympics. Either with these factors we cannot reject the null hypothesis, we will need to segment variables or reverse score winning the Olympics and see what results.