项目作者: robertschnitman

项目描述 :
Diagnostic tools for regression modeling.
高级语言: R
项目地址: git://github.com/robertschnitman/diagnoser.git
创建时间: 2017-11-15T00:27:05Z
项目社区:https://github.com/robertschnitman/diagnoser

开源协议:GNU General Public License v3.0

下载


diagnoser

Robert Schnitman
2017-11-14
Recommended Citation:
Schnitman, Robert (2017). diagnoser v0.0.2.5. https://github.com/robertschnitman/diagnoser

Outline

  1. Installation
  2. Introduction
  3. diagnose(), ggdiagnose(), and cdiagnose()
  4. fitres()
  5. modeldf()
  6. validate()
  7. Conclusion
  8. References

0. Installation

  1. ## Ensure that you are running R 3.4.2 or higher.
  2. ## Package Dependencies:
  3. # lazyeval (>= 0.2.1)
  4. # Package Imports:
  5. # ggplot2 (>= 2.2.1), gridExtra (>= 2.3), scales (>= 0.5.0), car (>= 2.1)
  6. # install.packages("devtools")
  7. devtools::install_github("robertschnitman/diagnoser")

Preface

Alternative versions of this document:

  1. Gitbook: https://diagnoser.netlify.com/
  2. PDF: https://github.com/robertschnitman/diagnoser/blob/master/docs/diagnoser_handbook.pdf

1. Introduction

The diagnoser package contains tools for regression diagnostics. Base R’s plot(model.object) was the primary influence, as it was a useful tool for quickly assessing estimation bias and existence of heteroskedasticity; but interpreting more specialized concepts such as Cook’s Distance proved to be difficult to understand for those without linear algebra knowledge. To improve upon comprehension for introductory students, I developed diagnose() and ggdiagnose(). Individuals with a fondness for the classics would appreciate cdiagnose(), which recreates the original plot(model.object) with ggplot2 graphics.

Other functions such as fitres(), modeldf(), and validate() were inspired by tidyverse’s broom library. While broom eases the process of transforming model objects into data frames, outputs from tidy() lacked estimates integral to the social and health sciences, such as the margin of error for OLS estimates. Additionally, glance() does not produce a pseudo r-squared for general linear models. The functions modeldf() and validate() seek to close the gaps from these broom functions.

The following sections provide examples.

2. diagnose(), ggdiagnose(), and cdiagnose()

The functions diagnose() and ggdiagnose() provide alternatives for the plot(model.object) approach. The Q-Q, Scale-Location, and Residuals-vs.-Leverage plots in the latter method can present difficulties in interpretations. For example, Cook’s Distance typically is not taught at the secondary and undergraduate levels—when it is, teachers will forego explanation of the math due to its complexity and instead focus solely on the interpretation, leaving students in the dark on how the statistic works. If the goal is to maximize students’ comprehension of detecting heteroskedasticity, one option is to replace the three previously mentioned graphs with histograms and an addition of another variable: residuals as a percentage of the values for the dependent variable (i.e. (residuals ÷ actual values)*100).

Thinking of residuals in terms of percent differences can help determine their magnitude. For example, if you notice an outlier in the residuals having the value of “5”, does this issue necessitate a re-estimation of the model that excludes this observation? A common method is to examine the (adjusted) R-squared before-and-after the outlier exclusion. The problem of “mining” the model occurs, however, and heightens the risk of a Type 1 Error (i.e. false positive). One solution, then, is to confirm whether this extremity is substantively different from the rest of the values—you may, based on prior knowledge, decide whether thresholds of 10% or 15% should be marked as such.

Overall, with these functions, students will learn how to visualize homoskedasticity/heteroskedasticity and the magnitude of outliers based on familiar concepts as opposed to being inundated with hastily-taught new ones that assume a sufficient understanding of linear algebra.

However, for those with advanced training or simply disagree with me, I also present a “classic” version of the original base R residual diagnostics plot: cdiagnose(), a recreation of plot(model.object) with ggplot2 graphics. The Residuals vs. Leverage graph is the most differentiated one from the original, using the size of the points to indicate the degree of Cook’s Distance (as inspired by Raju Rimal’s diagPlot(): https://rpubs.com/therimalaya/43190).

Because base R’s plotting of model objects do not include NLM/NLS objects, neither does cdiagnose(), which is justified considering the linear algebra involved in leverage and Cook’s Distance. Nonetheless, future work will consider an alternative for non-linear models.

diagnose()

Case 1: OLS

  1. # OLS case
  2. model.lm <- lm(data = mtcars, formula = mpg ~ wt + gear)
  3. diagnose(model.lm, fit_type = 'response', residual_type = 'response')
  4. # The fit_type option specifies prediction type in predict().
  5. # Similarly, residual_type specifies for resid().
  6. # These inputs are useful for glm objects using the binomial family.

Case 2: NLS

  1. model.nls <- nls(Ozone ~ theta0 + Temp^theta1, airquality)
  2. diagnose(model.nls, point_color = '#00BFC4', line_color = '#F8766D', pch = 16, lwd = 2)
  3. # Graph editing inputs. Recommended for larger data, as ggplot2 in ggdiagnose() and cdiagnose() can be slow.

ggdiagnose()

  1. # NLS case
  2. model.nls <- nls(Ozone ~ theta0 + Temp^theta1, airquality)
  3. ggdiagnose(model.nls, fit_type = 'response', residual_type = 'response',
  4. bins = nobs(model.nls), se = TRUE, freqpct = TRUE, alpha = 0.5)
  5. # The fit_type option specifies prediction type in predict().
  6. # Similarly, residual_type specifies for resid().
  7. # These inputs are useful for glm objects using the binomial family.
  8. # Default bins value is 30.
  9. # Default se value is TRUE.
  10. # Default freqpct value is FALSE.
  11. # Default alpha value is 1.

cdiagnose()

  1. # OLS case
  2. model.lm <- lm(data = Orange, formula = log(circumference) ~ age)
  3. cdiagnose(model.lm, fit_type = 'response', residual_type = 'response', se = FALSE, alpha = 1)
  4. # The fit_type option specifies prediction type in predict().
  5. # Similarly, residual_type specifies for resid().
  6. # These inputs are useful for glm objects using the binomial family.
  7. # Default bins value is 30.
  8. # Default se value is FALSE.
  9. # Default alpha value is 1.

3. fitres()

The function fitres() will look similar to those who have used augment() from broom.

It creates a matrix of the fitted values, residuals, and residuals as a proportion (percent) based on the actual dependent variable’s values. When the data input is specified, the function produces a dataframe that merges the fitted values and residual variables as columns to said specified dataset.

  1. model.lm <- lm(data = mtcars, formula = mpg ~ wt + gear)
  2. head(fitres(model.lm, fit_type = 'response'))
  3. # default type value is 'response'.
  1. ## fit residual residual_pct
  2. ## Mazda RX4 23.26669 -2.2666926 -0.10793774
  3. ## Mazda RX4 Wag 21.86801 -0.8680127 -0.04133394
  4. ## Datsun 710 24.91220 -2.1121984 -0.09264028
  5. ## Hornet 4 Drive 20.32266 1.0773414 0.05034305
  6. ## Hornet Sportabout 19.08853 -0.3885293 -0.02077697
  7. ## Valiant 18.97883 -0.8788289 -0.04855408
  1. model.lm <- lm(data = mtcars, formula = mpg ~ wt + gear)
  2. head(fitres(model = model.lm, data = mtcars, fit_type = 'response'))
  1. ## mpg cyl disp hp drat wt qsec vs am gear carb fit residual residual_pct
  2. ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 23.26669 -2.2666926 -0.10793774
  3. ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 21.86801 -0.8680127 -0.04133394
  4. ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 24.91220 -2.1121984 -0.09264028
  5. ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.32266 1.0773414 0.05034305
  6. ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 19.08853 -0.3885293 -0.02077697
  7. ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 18.97883 -0.8788289 -0.04855408

4. modeldf()

The function modeldf() has similar features to tidying model objects with additions. The margin of error (moe) and confidence interval columns (ci_**) would inform those in the health sciences the impact range of their variables of interest—other discplines may benefit as well from these estimates. The variance inflation factors (VIF*)—which are estimated with vif() from car—measure the extent of collinearity in linear models.

Case 1: OLS

  1. model.lm <- lm(data = mtcars, formula = mpg ~ disp + hp + wt + gear + am)
  2. modeldf(model = model.lm, conf = 0.90) # conf = 0.95 is the default value; can be omitted.
  1. ## term coef se moe ci_lower ci_upper t p vif
  2. ## 1 (Intercept) 32.108024910 4.84359733 8.26132640 23.84669851 40.36935131 6.6289625 4.959127e-07 NA
  3. ## 2 am 1.605381694 1.78234460 3.03999888 -1.43461719 4.64538058 0.9007134 3.760085e-01 3.583076
  4. ## 3 disp 0.005352328 0.01178752 0.02010500 -0.01475267 0.02545733 0.4540675 6.535481e-01 9.668205
  5. ## 4 gear 0.651585626 1.21191542 2.06706466 -1.41547904 2.71865029 0.5376494 5.953915e-01 3.621713
  6. ## 5 hp -0.042892355 0.01424230 0.02429192 -0.06718428 -0.01860043 -3.0116168 5.721679e-03 4.319422
  7. ## 6 wt -3.113042246 1.17912588 2.01113824 -5.12418048 -1.10190401 -2.6401271 1.382770e-02 6.029643

Case 2: GLM (logit)

  1. model.glm <- glm(data = mtcars, formula = am ~ mpg + disp + hp, family = binomial(link = 'logit'))
  2. modeldf(model = model.glm, conf = 0.90) # conf = 0.95 is the default value; can be omitted.
  1. ## term coef se moe ci_lower ci_upper z p vif
  2. ## 1 (Intercept) -33.8128314 24.17533401 26.83504619 -94.19070326 -6.97778523 -1.398650 0.16191796 NA
  3. ## 2 disp -0.0654460 0.04304626 0.04707836 -0.16992781 -0.01836764 -1.520364 0.12841942 15.021316
  4. ## 3 hp 0.1493636 0.07871156 0.20423473 0.05857355 0.35359832 1.897607 0.05774791 23.014959
  5. ## 4 mpg 1.2849763 0.89894752 2.22397961 0.29320289 3.50895588 1.429423 0.15288269 8.822745

Case 3: NLS

  1. model.nls <- nls(Ozone ~ theta0 + Temp^theta1, airquality)
  2. modeldf(model = model.nls, conf = 0.85) # conf = 0.95 is the default value; can be omitted.
  1. ## parameter estimate se moe ci_lower ci_upper t p
  2. ## 1 theta0 -121.608226 13.2364581 19.40604989 -140.672736 -102.202176 -9.187369 2.167395e-15
  3. ## 2 theta1 1.170315 0.0182639 0.02489023 1.141823 1.195206 64.078073 3.014033e-91

5. validate()

The glance() function from broom had a vague label for the F statistic (simply “statistic”) and lacked any kind of pseudo R-squared for logistic regressions.

Furthermore, while the same function is friendly for data frames, its wide form is cumbersome for quickly ascertaining model validity. Thus, validate() produces similar output as a column vector, adding McFadden’s pseudo R-squared and the apparent error rate—defined as the ratio of the number of incorrect predictions to correct ones (i.e. number incorrect / number correct)—for logistic regressions. Those who wish to have the values in the format of broom can always transpose the vector. Alternatively, converting the output to a dataframe is simple by setting dataframe = TRUE in the function.

Output definitions are in the help file associated with this function.

Case 1: OLS

  1. model.lm <- lm(data = mtcars, formula = mpg ~ wt + gear)
  2. validate(model.lm)
  1. ## model.lm
  2. ## n 32.000000
  3. ## rsq 0.753842
  4. ## adj.rsq 0.736866
  5. ## F.stat 44.405361
  6. ## df.num 3.000000
  7. ## df.den 29.000000
  8. ## p.value 0.000000
  9. ## residual.median -0.293202
  10. ## residual.mean 0.000000
  11. ## residual.sd 2.990226
  12. ## residual.se 0.528602
  13. ## rmse 2.943133
  14. ## mad 1.943778
  15. ## mae 2.353567
  16. ## medianpe -0.016107
  17. ## mpe -0.015267
  18. ## sdpe 0.161915
  19. ## sepe 0.028623
  20. ## AIC 167.898446
  21. ## BIC 173.761389
  22. ## loglik -79.949223
  1. validate(model.lm, TRUE) # dataframe
  1. ## statistic model.lm
  2. ## 1 n 32.000000
  3. ## 2 rsq 0.753842
  4. ## 3 adj.rsq 0.736866
  5. ## 4 F.stat 44.405361
  6. ## 5 df.num 3.000000
  7. ## 6 df.den 29.000000
  8. ## 7 p.value 0.000000
  9. ## 8 residual.median -0.293202
  10. ## 9 residual.mean 0.000000
  11. ## 10 residual.sd 2.990226
  12. ## 11 residual.se 0.528602
  13. ## 12 rmse 2.943133
  14. ## 13 mad 1.943778
  15. ## 14 mae 2.353567
  16. ## 15 medianpe -0.016107
  17. ## 16 mpe -0.015267
  18. ## 17 sdpe 0.161915
  19. ## 18 sepe 0.028623
  20. ## 19 AIC 167.898446
  21. ## 20 BIC 173.761389
  22. ## 21 loglik -79.949223

Case 2: GLM (logit)

  1. model.glm <- glm(am ~ mpg + wt, mtcars, family = binomial(link = 'logit'))
  2. validate(model.glm) # Note the inapplicability of the percent error (pe) statistics.
  1. ## model.glm
  2. ## n 32.000000
  3. ## pseudo.rsq.mcfad 0.602490
  4. ## aer 0.062500
  5. ## null.deviance 43.229733
  6. ## residual.deviance 17.184255
  7. ## df.null 31.000000
  8. ## df.residual 29.000000
  9. ## residual.median -0.046842
  10. ## residual.mean -0.044152
  11. ## residual.sd 0.743181
  12. ## residual.se 0.131377
  13. ## rmse 0.732808
  14. ## mad 0.384793
  15. ## mae 0.508942
  16. ## medianpe -Inf
  17. ## mpe -Inf
  18. ## sdpe NaN
  19. ## sepe NaN
  20. ## AIC 23.184255
  21. ## BIC 27.581463
  22. ## loglik -8.592128

Case 3: NLS

  1. model.nls <- nls(Ozone ~ theta0 + Temp^theta1, airquality)
  2. validate(model.nls)
  1. ## model.nls
  2. ## n 116.000000
  3. ## iterations 4.000000
  4. ## convergence_tolerance 0.000001
  5. ## sigma 23.624178
  6. ## df.sigma 114.000000
  7. ## residual.median -0.684547
  8. ## residual.mean -0.000002
  9. ## residual.sd 23.521240
  10. ## residual.se 2.183892
  11. ## rmse 23.419636
  12. ## mad 15.047691
  13. ## mae 17.120045
  14. ## medianpe -0.011579
  15. ## mpe -0.287614
  16. ## sdpe 1.161944
  17. ## sepe 0.107884
  18. ## AIC 1066.823097
  19. ## BIC 1075.083868
  20. ## loglik -530.411549

6. Conclusion

The functions discussed and demonstrated will be improved on a continuing basis to (1) minimize the programming tedium in statistical reporting and (2) assist people in diagnosing the validity of their results. New functions to be added based on feasibility and future needs as necessary.

7. References

  1. broom library. https://github.com/tidyverse/broom
  2. Raju Rimal’s diagPlot. https://rpubs.com/therimalaya/43190

End of Document