项目作者: yg6688

项目描述 :
Sports Analytics
高级语言: HTML
项目地址: git://github.com/yg6688/sports-analytics.git
创建时间: 2019-03-29T01:05:29Z
项目社区:https://github.com/yg6688/sports-analytics

开源协议:

下载


Sports Analytics - An Introduction in R

Yongqi Gan

R and R Markdown

R is one of the most popular statistical programming languages. R is the only widely-used statistical programming language that is free and open source, with a wide developer community that contributes to the core language and corresponding packages. This file was created with RStudio, our recommended IDE for use with R. For more information and to download, see https://rstudio.com.

This is an R Markdown document generated by knitr and RStudio. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk and its corresponding output like this:

  1. summary(cars)
  1. ## speed dist
  2. ## Min. : 4.0 Min. : 2.00
  3. ## 1st Qu.:12.0 1st Qu.: 26.00
  4. ## Median :15.0 Median : 36.00
  5. ## Mean :15.4 Mean : 42.98
  6. ## 3rd Qu.:19.0 3rd Qu.: 56.00
  7. ## Max. :25.0 Max. :120.00

Loading Data

We are examining two datasets: The NBA player statistics and their historical salaries for the last 4 years.

Sources: https://stats.nba.com for performance data and https://hoopshype.com/salaries/players for salary data.

Let us first load the performance data into our R environment using the read.csv function. The header = TRUE flag tells the console that the first row of the file contains column names.

  1. players1819 <- read.csv(file = "players18-19.csv", header = TRUE)
  2. players1718 <- read.csv(file = "players17-18.csv", header = TRUE)
  3. players1617 <- read.csv(file = "players16-17.csv", header = TRUE)
  4. players1516 <- read.csv(file = "players15-16.csv", header = TRUE)

and the salary data…

  1. salary1819 <- read.csv(file = "18-19_salary.csv", header = TRUE)
  2. salary1718 <- read.csv(file = "17-18_salary.csv", header = TRUE)
  3. salary1617 <- read.csv(file = "16-17_salary.csv", header = TRUE)
  4. salary1516 <- read.csv(file = "15-16_salary.csv", header = TRUE)

Cleaning Data

We will now merge the four dataframes for salary data and remove the individual datasets from our working environment. Note that we only include players that are accounted for in all four datasets to ensure consistency in our analysis.

  1. #merge concactenates two dataframes together by values in a certain row or column.
  2. df_salary <- merge(salary1516, salary1617, by.x = "Player", by.y = "Player")
  3. df_salary <- merge(df_salary, salary1718, by.x = "Player", by.y = "Player")
  4. df_salary <- merge(df_salary, salary1819, by.x = "Player", by.y = "Player")
  5. #rm removes dataframes and vectors from our local working environment.
  6. #rm(salary1516, salary1617, salary1718, salary1819)

There are some issues with our data. If you click on the df_salary dataframe in the “environment” pane on the right, you will see that the year columns have weird names and that the dollar amounts are treated as factors. Factors are categorical variables best used to describe types of values (animals classified as “dog” “cat” etc.) This is undesirable for use in numerical data, and will impact any plotting or statistical analysis we want to do. Therefore, we will convert the factor vectors into numeric vectors and rename them using the code snippet below.

  1. names(df_salary) <- c("Player", "y1516", "y1617", "y1718", "y1819")
  2. df_salary$Player <- as.character(df_salary$Player)
  3. df_salary$y1516 = as.numeric(gsub("[\\$,]", "", as.character(df_salary$y1516)))
  4. df_salary$y1617 = as.numeric(gsub("[\\$,]", "", as.character(df_salary$y1617)))
  5. df_salary$y1718 = as.numeric(gsub("[\\$,]", "", as.character(df_salary$y1718)))
  6. df_salary$y1819 = as.numeric(gsub("[\\$,]", "", as.character(df_salary$y1819)))
  7. #Trim unnecessary whitespace to prevent string matching errors
  8. df_salary$Player = trimws(df_salary$Player)

Preliminary Analysis, Plotting, Data Visualization

We want to get an overview of the data in order to get an idea of how to consider analyzing it. We can start by simply viewing the header of the salary data, shown below.

  1. head(df_salary)
  1. ## Player y1516 y1617 y1718 y1819
  2. ## 1 Aaron Gordon 4405071 4549389 5662481 21590909
  3. ## 2 Al Horford 12671359 27748190 28530811 28928710
  4. ## 3 Al Jefferson 14255279 10695850 10050366 4000000
  5. ## 4 Al-Farouq Aminu 8492868 8030598 7529204 6957105
  6. ## 5 Alan Williams 120677 914448 6172292 77250
  7. ## 6 Alec Burks 9728947 10355341 11156939 11536515

We can make a dotplot of 2018-2019 season salaries against points scored using the ggplot2 package, a popul. We will also show minutes played using a color gradient.

  1. require(ggplot2)
  1. ## Loading required package: ggplot2
  1. require(scales)
  1. ## Loading required package: scales
  2. ## Warning: package 'scales' was built under R version 3.4.4
  1. data1819 <- merge(df_salary, players1819, by.x = "Player", by.y = "PLAYER")
  2. p <- ggplot(data = data1819, aes(y = y1819, x = PTS, colour = MIN))
  3. p + scale_y_continuous(labels = comma) + ylab("2018-2019 Salary") +
  4. xlab("Points") + ggtitle("2018-2019 Season: Salary Plotted Against Points") +
  5. scale_color_gradient(low="red", high="green") + geom_point()

Interesting. We see that a higher salary is generally correlated with more scored points and more minutes played on average. This is likely an obvious observation to most of you, but it is helpful to see it confirmed graphically.

The visual examination leads us to believe that there is a positive relationship between salary and scored points. Let’s run a linear regression model to examine this numerically.

  1. #lm stands for "linear model"
  2. points_salary <- lm(y1819 ~ PTS, data = data1819)
  3. summary(points_salary)
  1. ##
  2. ## Call:
  3. ## lm(formula = y1819 ~ PTS, data = data1819)
  4. ##
  5. ## Residuals:
  6. ## Min 1Q Median 3Q Max
  7. ## -21203611 -4704150 -789194 4028796 20410886
  8. ##
  9. ## Coefficients:
  10. ## Estimate Std. Error t value Pr(>|t|)
  11. ## (Intercept) 2174352 830607 2.618 0.00938 **
  12. ## PTS 843156 65573 12.858 < 2e-16 ***
  13. ## ---
  14. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  15. ##
  16. ## Residual standard error: 6814000 on 254 degrees of freedom
  17. ## Multiple R-squared: 0.3943, Adjusted R-squared: 0.3919
  18. ## F-statistic: 165.3 on 1 and 254 DF, p-value: < 2.2e-16

Let’s interpret this regression output. The line of best fit is given by data1819 = 2,174,352 + 843,156 * PTS. We can interpret this to mean that for every additional point scored, on average the salary of that player will increase by 843,156 dollars. We note the small p-value (less than 2−16), implying our results are statistically significant at the 99.99% confidence level.

The original plot with our newly found regression line is displayed below.

  1. p + scale_y_continuous(labels = comma) + ylab("2018-2019 Salary") +
  2. xlab("Points") + ggtitle("2018-2019 Season: Salary Plotted Against Points") +
  3. scale_color_gradient(low="red", high="green") + geom_point() + geom_abline(intercept = 2174352, slope = 843156)

Optimization Example: Linear Programming

To demonstrate the power of R packages, let’s consider a hypothetical scenario: You are the manager of an NBA team. Your goal is to score at least 100 points every game on average, while spending as little as possible on the players’ salaries.

This is an optimization problem. We are trying to minimize a cost function (total salary cost), given a constraint (average at least 100 points). We can solve this optimization problem using a technique known as linear programming. For technical details you are welcome to review https://www.math.ucla.edu/~tom/LP.pdf; however, a technical understanding of the algorithm is not necessary for this example.

We will use a R package built by Michel Berkelaar, lpSolve, to solve this problem. R packages are custom-written libraries developed by statistians, companies, and other users to extend or the functionality of R or to simplify certain procedures.

  1. require(lpSolve)
  1. ## Loading required package: lpSolve
  1. # obj represents the objective function. This is the function that calculates the total cost of the team and is what we are trying to minimize.
  2. obj <- data1819$y1819
  3. #constr represents the constraint functions. These are the conditions that a valid solution must satisfy. The conditions we impose are that the "team" must have a minimum sum of 100 points per game, the team must have at least 12 players, and each player can only be chosen at most once.
  4. constr <- matrix(append(append(data1819$PTS, rep(1, times = 256)), as.vector(diag(nrow = 256))), nrow = 258, byrow = TRUE)
  5. #right represents the right hand side of the constraint functions.
  6. right <- c(100, 12, rep(1, times = 256))
  7. #constraints_direction represents the sign of the constraint functions.
  8. constranints_direction <- c(">=", ">=", rep("<=", times = 256))
  9. optimum <- lp(direction="min",
  10. objective.in = obj,
  11. const.mat = constr,
  12. const.dir = constranints_direction,
  13. const.rhs = right,
  14. all.int = T)
  15. best_sol <- optimum$solution
  16. names(best_sol) <- data1819$Player
  17. print(best_sol)
  1. ## Aaron Gordon Al Horford Al-Farouq Aminu
  2. ## 0 0 0
  3. ## Alan Williams Alec Burks Alex Len
  4. ## 1 0 0
  5. ## Allen Crabbe Amir Johnson Andre Drummond
  6. ## 0 0 0
  7. ## Andre Iguodala Andrew Bogut Andrew Wiggins
  8. ## 0 1 0
  9. ## Anthony Davis Anthony Tolliver Aron Baynes
  10. ## 0 0 0
  11. ## Austin Rivers Avery Bradley Ben McLemore
  12. ## 0 0 0
  13. ## Bismack Biyombo Blake Griffin Boban Marjanovic
  14. ## 0 0 0
  15. ## Bobby Portis Bojan Bogdanovic Bradley Beal
  16. ## 0 0 0
  17. ## Brandon Knight Brook Lopez Bruno Caboclo
  18. ## 0 0 1
  19. ## Cameron Payne Carmelo Anthony Chandler Parsons
  20. ## 0 0 0
  21. ## Channing Frye Chris Paul CJ McCollum
  22. ## 0 0 0
  23. ## CJ Miles Clint Capela Cody Zeller
  24. ## 0 0 0
  25. ## Corey Brewer Cory Joseph Courtney Lee
  26. ## 1 0 0
  27. ## Cristiano Felicio D'Angelo Russell Damian Lillard
  28. ## 0 0 0
  29. ## Danilo Gallinari Danny Green Dante Cunningham
  30. ## 0 0 0
  31. ## Dante Exum Darren Collison DeAndre Jordan
  32. ## 0 0 0
  33. ## Delon Wright DeMar DeRozan DeMarcus Cousins
  34. ## 0 0 0
  35. ## DeMarre Carroll Derrick Favors Derrick Rose
  36. ## 0 0 1
  37. ## Devin Booker Devin Harris Dewayne Dedmon
  38. ## 1 0 0
  39. ## Dion Waiters Dirk Nowitzki Doug McDermott
  40. ## 0 0 0
  41. ## Draymond Green Dwight Howard Dwight Powell
  42. ## 0 0 0
  43. ## Dwyane Wade E'Twaun Moore Ed Davis
  44. ## 0 0 0
  45. ## Elfrid Payton Emmanuel Mudiay Enes Kanter
  46. ## 0 0 0
  47. ## Eric Bledsoe Eric Gordon Ersan Ilyasova
  48. ## 0 0 0
  49. ## Evan Fournier Evan Turner Frank Kaminsky
  50. ## 0 0 0
  51. ## Garrett Temple Gary Harris George Hill
  52. ## 0 0 0
  53. ## Gerald Green Giannis Antetokounmpo Glenn Robinson III
  54. ## 1 0 0
  55. ## Goran Dragic Gordon Hayward Gorgui Dieng
  56. ## 0 0 0
  57. ## Greg Monroe Harrison Barnes Hassan Whiteside
  58. ## 0 0 0
  59. ## Ian Clark Ian Mahinmi Iman Shumpert
  60. ## 0 0 0
  61. ## Isaiah Canaan Isaiah Thomas Jabari Parker
  62. ## 1 0 0
  63. ## Jae Crowder Jahlil Okafor Jamal Crawford
  64. ## 0 0 0
  65. ## James Harden James Johnson JaMychal Green
  66. ## 0 0 0
  67. ## Jared Dudley Jarell Martin Jason Smith
  68. ## 0 0 0
  69. ## JaVale McGee Jeff Green Jeff Teague
  70. ## 0 0 0
  71. ## Jerami Grant Jeremy Lamb Jeremy Lin
  72. ## 0 0 0
  73. ## Jerian Grant Jerryd Bayless Jimmy Butler
  74. ## 0 0 0
  75. ## JJ Redick Joakim Noah Jodie Meeks
  76. ## 0 0 0
  77. ## Joe Harris Joe Ingles Joel Embiid
  78. ## 0 0 0
  79. ## John Henson John Wall Jon Leuer
  80. ## 0 0 0
  81. ## Jonas Jerebko Jonas Valanciunas Jonathon Simmons
  82. ## 0 0 0
  83. ## Jordan Clarkson Jose Calderon Josh Richardson
  84. ## 0 0 0
  85. ## JR Smith Jrue Holiday Julius Randle
  86. ## 0 0 0
  87. ## Justin Anderson Justin Holiday Justise Winslow
  88. ## 0 0 0
  89. ## Jusuf Nurkic Karl-Anthony Towns Kawhi Leonard
  90. ## 0 0 0
  91. ## Kelly Olynyk Kemba Walker Kenneth Faried
  92. ## 0 0 0
  93. ## Kent Bazemore Kentavious Caldwell-Pope Kevin Durant
  94. ## 0 0 0
  95. ## Kevin Love Kevon Looney Khris Middleton
  96. ## 0 0 0
  97. ## Klay Thompson Kosta Koufos Kyle Anderson
  98. ## 0 0 0
  99. ## Kyle Korver Kyle Lowry Kyle O'Quinn
  100. ## 0 0 0
  101. ## Kyrie Irving LaMarcus Aldridge Lance Stephenson
  102. ## 0 0 0
  103. ## Lance Thomas Langston Galloway LeBron James
  104. ## 0 0 0
  105. ## Luc Mbah a Moute Luol Deng Marc Gasol
  106. ## 0 0 0
  107. ## Marcin Gortat Marco Belinelli Marcus Morris
  108. ## 0 0 0
  109. ## Marcus Smart Mario Hezonja Markieff Morris
  110. ## 0 0 0
  111. ## Marvin Williams Mason Plumlee Matthew Dellavedova
  112. ## 0 0 0
  113. ## Meyers Leonard Michael Beasley Michael Carter-Williams
  114. ## 0 0 0
  115. ## Michael Kidd-Gilchrist Mike Conley Mike Muscala
  116. ## 0 0 0
  117. ## Mike Scott Miles Plumlee Montrezl Harrell
  118. ## 0 0 0
  119. ## Myles Turner Nemanja Bjelica Nerlens Noel
  120. ## 0 0 0
  121. ## Nick Young Nicolas Batum Nik Stauskas
  122. ## 1 0 0
  123. ## Nikola Jokic Nikola Mirotic Nikola Vucevic
  124. ## 0 0 0
  125. ## Noah Vonleh Norman Powell Omri Casspi
  126. ## 0 0 0
  127. ## Pat Connaughton Patrick Beverley Patrick Patterson
  128. ## 0 0 0
  129. ## Pau Gasol Paul George Paul Millsap
  130. ## 0 0 0
  131. ## PJ Tucker Quincy Acy Quincy Pondexter
  132. ## 0 1 0
  133. ## Rajon Rondo Raul Neto Raymond Felton
  134. ## 0 0 0
  135. ## Reggie Bullock Reggie Jackson Richaun Holmes
  136. ## 0 0 0
  137. ## Ricky Rubio Robert Covington Robin Lopez
  138. ## 0 0 0
  139. ## Rodney Hood Rondae Hollis-Jefferson Rudy Gay
  140. ## 0 0 0
  141. ## Rudy Gobert Russell Westbrook Ryan Anderson
  142. ## 0 0 0
  143. ## Salah Mejri Sam Dekker Serge Ibaka
  144. ## 0 0 0
  145. ## Seth Curry Shabazz Napier Shaun Livingston
  146. ## 0 0 0
  147. ## Shelvin Mack Solomon Hill Spencer Dinwiddie
  148. ## 0 0 1
  149. ## Stanley Johnson Stephen Curry Steven Adams
  150. ## 0 0 0
  151. ## Taj Gibson Terrence Ross Terry Rozier
  152. ## 0 0 0
  153. ## Thabo Sefolosha Thaddeus Young Tim Frazier
  154. ## 0 0 0
  155. ## Tobias Harris Tony Parker Tony Snell
  156. ## 0 0 0
  157. ## Treveon Graham Trevor Ariza Trey Burke
  158. ## 0 0 0
  159. ## Trey Lyles Tristan Thompson Troy Daniels
  160. ## 0 0 0
  161. ## Tyler Johnson Tyler Zeller Tyreke Evans
  162. ## 0 1 0
  163. ## Tyson Chandler Tyus Jones Udonis Haslem
  164. ## 0 0 0
  165. ## Victor Oladipo Vince Carter Wayne Ellington
  166. ## 0 0 0
  167. ## Wesley Johnson Wesley Matthews Will Barton
  168. ## 0 0 0
  169. ## Willie Cauley-Stein Wilson Chandler Zach LaVine
  170. ## 0 0 0
  171. ## Zaza Pachulia
  172. ## 0

The players with a “1” under their name have been chosen for this “optimal” team.

  1. print(paste("Total cost: ", optimum$objval, sep=""))
  1. ## [1] "Total cost: 12387465"

We find that the total cost of this “optimal” team is 12.387 million dollars.

Categorical Variable Analysis: Does switching teams increase expected salary?

It is widely understood in the professional basketball world that switching teams generally results in an increase in salary. We will now quantitatively test this hypothesis and determine the size of this effect.

First, we need to extract only the players who were active in the last 4 seasons:

  1. player <- merge(merge(merge(players1516[ , c("PLAYER", "TEAM")], players1617[ , c("PLAYER", "TEAM")], by = "PLAYER"), players1718[ , c("PLAYER", "TEAM")], by = "PLAYER"), players1819[ , c("PLAYER", "TEAM")], by = "PLAYER")
  1. ## Warning in merge.data.frame(merge(merge(players1516[, c("PLAYER",
  2. ## "TEAM")], : column names 'TEAM.x', 'TEAM.y' are duplicated in the result
  1. names(player) <- c("PLAYER", "TEAM1516", "TEAM1617", "TEAM1718", "TEAM1819")

Now we can count the number of times each player switched teams:

  1. player$switches <- 0
  2. player$switches <- ifelse(player$TEAM1516 == player$TEAM1617, player$switches, player$switches + 1)
  3. player$switches <- ifelse(player$TEAM1617 == player$TEAM1718, player$switches, player$switches + 1)
  4. player$switches <- ifelse(player$TEAM1718 == player$TEAM1819, player$switches, player$switches + 1)

Let’s add the salary data to this dataset:

  1. player <- merge(player, df_salary, by.x = "PLAYER", by.y = "Player")
  2. player2 <- player
  3. player2$change <- player2$y1819 - player2$y1516
  4. require(reshape)
  1. ## Loading required package: reshape
  2. ## Warning: package 'reshape' was built under R version 3.4.4
  1. player <- melt(player, id = c("PLAYER", "TEAM1516", "TEAM1617", "TEAM1718", "TEAM1819", "switches"))

We can create a plot of salaries based on the number of team switches:

  1. p <- ggplot(data = player, aes(y = value, x = variable, colour = switches))
  2. p + scale_y_continuous(labels = comma) + ylab("Salary") +
  3. xlab("Season") + ggtitle("Salary Plotted Against Number of Team Switches") +
  4. scale_color_gradient(low="red", high="green") + geom_point()

The graph doesn’t seem to be too promising on first glance. Let’s run a regression analysis on the change in salary across these four seasons given the change in teams:

  1. switch_lm <- lm(change ~ switches, data = player2)
  2. summary(switch_lm)
  1. ##
  2. ## Call:
  3. ## lm(formula = change ~ switches, data = player2)
  4. ##
  5. ## Residuals:
  6. ## Min 1Q Median 3Q Max
  7. ## -20935449 -4550918 -402481 4353721 18122651
  8. ##
  9. ## Coefficients:
  10. ## Estimate Std. Error t value Pr(>|t|)
  11. ## (Intercept) 8427039 708532 11.894 < 2e-16 ***
  12. ## switches -3266267 477605 -6.839 6.23e-11 ***
  13. ## ---
  14. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  15. ##
  16. ## Residual standard error: 6925000 on 247 degrees of freedom
  17. ## Multiple R-squared: 0.1592, Adjusted R-squared: 0.1558
  18. ## F-statistic: 46.77 on 1 and 247 DF, p-value: 6.232e-11

We find that this is in fact exactly the case. A team switch on average will decrease a player’s earnings by $3.26 million.