项目作者: mdjeric

项目描述 :
Function for improving regression ceofficient plots
高级语言: R
项目地址: git://github.com/mdjeric/order-coef-plot.git
创建时间: 2018-06-05T13:33:48Z
项目社区:https://github.com/mdjeric/order-coef-plot

开源协议:

下载


Ordering regression coef plots

Function that orders values of coefficients from regression models, so the complex multiple models can be plotted at the same time and compared. My theme of choice is hrbrthemes. Although I do not know how to include label for reference category below variable name (ideally in smaller font size, or italic) but that it remains left-aligned.




It still requires a lot of fixing with label names, as can be seen, but I am working on making it into another function.

Function is located in 10-plot-sorting-function.R, and there are two samples of how to use it.

Here is how to do it.

Libraries and creating models

  1. library(tidyverse)
  2. library(car)
  3. library(margins)
  4. library(hrbrthemes)

Data is from GSS, however, .sav file is too large for github, and saved in .Rdata is data already prepared for OLS in GSS_14 (20-importing-and-cleaning-data.R is how I got there).

  1. load("GSS14.Rdata")
  2. ## Regression -------------------------------------------------------
  3. GSS_14$r_res16_c <- relevel(GSS_14$r_res16, ref = "CITY")
  4. GSS_14$polviews_m <- relevel(GSS_14$polviews, ref = "MODERATE")
  5. GSS_14$r_religion_m <- relevel(GSS_14$r_religion, ref = "MOD&LIB")
  6. clm_small <- lm(criteria_plus ~ age + sex + race + immigrant +
  7. veteran + prt_ba + r_degree +
  8. r_coninc + r_srcbelt, data = GSS_14)
  9. clm_small_res16 <- update(clm_small, ~ . + r_res16_c)
  10. clm_full <- update(clm_small_res16, ~ . + r_religion_m + r_region_4 +
  11. polviews_m)

Extract values and prepare labels, names, etc.

Now, extract the data from models. It’s much easier with tidy from broom package (e.g. mc_s <- tidy(clm_small)), but I started with margins for some reason. It’s just much slower, but function works best that way now. I’ll be making it so it can work either way, once I finish more elegant way of renaming of variables and levels.

In total, we need five variables: full_factor output name of the factor from lm(…); var name of the variable - it has to be ordered (alphabetic~ascending); factor probably more precise of the level, it is used in plotting and not sorting, since it is possible that two same levels for two different variables exist; AME and lower and upper.

  1. #### Extract coefficients (ame) ------------------------------------------------
  2. # mc_s - small model; mc_r - model with childhood place; mc_f - full model
  3. mc_s <- as.data.frame(summary(margins(clm_small)))
  4. mc_r <- as.data.frame(summary(margins(clm_small_res16)))
  5. mc_f <- as.data.frame(summary(margins(clm_full)))
  6. #### Fix the labels and prepare them for ordering and poloting ----------------
  7. # copy of the full name of factors from lm models which will be used later
  8. mc_s$full_factor <- mc_s$factor
  9. mc_r$full_factor <- mc_r$factor
  10. mc_f$full_factor <- mc_f$factor
  11. # create new variable with names of the variables from the model
  12. mc_s$var <- mc_s$factor
  13. mc_r$var <- mc_r$factor
  14. mc_f$var <- mc_f$factor
  15. # asing model names, so they can be distinguished
  16. mc_s$model <- as.factor(c("Reduced"))
  17. mc_r$model <- as.factor(c("W/ childhood"))
  18. mc_f$model <- as.factor(c("Large"))
  19. # first, variable names
  20. mc_s$var <- c(
  21. "Age", "Immigrant", "Parents' education", "Income",
  22. "Education", "Education", "SMSA", "SMSA", "SMSA", "Race",
  23. "Race", "Sex", "Veteran"
  24. )
  25. mc_r$var <- c(
  26. "Age", "Immigrant", "Parents' education", "Income", "Education",
  27. "Education", "Childhood", "Childhood", "SMSA", "SMSA", "SMSA",
  28. "Race", "Race", "Sex", "Veteran"
  29. )
  30. mc_f$var <- c(
  31. "Age", "Immigrant", "Political views", "Political views",
  32. "Political views", "Political views", "Political views",
  33. "Political views", "Parents' education", "Income", "Education",
  34. "Education", "Region", "Region", "Region", "Religion",
  35. "Religion", "Religion", "Religion", "Religion", "Childhood",
  36. "Childhood", "SMSA", "SMSA", "SMSA", "Race", "Race", "Sex", "Veteran"
  37. )
  38. ## remove variable name from the name of the factor
  39. mc_s$factor <- c(
  40. "One year older", "Yes", "Either has BA", "Increase of $20,000",
  41. "BA or more", "HS or JC", "Other urban", "Rural",
  42. "Suburban (top 100)", "Black", "Other", "Female", "Yes"
  43. )
  44. mc_r$factor <- c(
  45. "One year older", "Yes", "Either has BA", "Increase of $20,000",
  46. "BA or more", "HS or JC", "Large city or suburbs", "Rural",
  47. "Other urban", "Rural", "Suburban (top 100)", "Black",
  48. "Other", "Female", "Yes"
  49. )
  50. mc_f$factor <- c(
  51. "One year older", "Yes",
  52. "Conservative", "Extremly Liberal",
  53. "Extremly Conservative", "Liberal",
  54. "Slightly Conservative", "Slightly Liberal",
  55. "Either has BA", "Increase of $20,000",
  56. "BA or more", "HS or JC",
  57. "Northeast", "South",
  58. "West", "Cath. or orth.",
  59. "Christian NGG", "Luth., Epsic., or Morm.",
  60. "None, other, or Jewish", "Sect. or Bapt.",
  61. "Large city or suburbs","Rural",
  62. "Other urban", "Rural",
  63. "Suburban (top 100)",
  64. "Black","Other",
  65. "Female", "Yes"
  66. )

Now, we combine all three models into one, so they can be ordered and plotted.

  1. MB <- rbind(mc_f, mc_s)
  2. MB <- rbind(MB, mc_r)
  3. ## make all the labels to be factors
  4. MB$full_factor <- as.factor(MB$full_factor)
  5. MB$factor <- as.factor(MB$factor)
  6. MB$var <- as.factor(MB$var)

Use the function and sort

And we just call the sorting function. It reorders levels in var that will be used for faceting, and adds ordr to the data frame that will be used for ordering of values. It does it quite crudely, using numbers and additions, but it works. Several “rules” are followed: 1. variable levels are ordered according to the size of the mean values for all applicable models; 2. variables are ordered according to the number of models they are involved with (most models in the bottom); and 3. variables are ordered according to total mean of all values in them.

  1. MB <- sort_models(MB)

Additional fixing so the plot is nicer

Some more fixing before plotting - new line in variable names that includes a reference category.

  1. levels(MB$var)
  1. ## [1] "Religion" "Political views" "Region"
  2. ## [4] "Childhood" "Education" "Parents' education"
  3. ## [7] "Income" "Immigrant" "Age"
  4. ## [10] "SMSA" "Veteran" "Sex"
  5. ## [13] "Race"
  1. levels(MB$var)[1] <- "Religion\n[ref: mod. or lib. prot.]"
  2. levels(MB$var)[2] <- "Political views\n[ref: moderate]"
  3. levels(MB$var)[3] <- "Region\n[ref: Midwest]"
  4. levels(MB$var)[4] <- "Childhood\n[ref: medium city]"
  5. levels(MB$var)[5] <- "Education\n[ref: LT HS]"
  6. levels(MB$var)[10] <- "Metrop. area\n[ref: city (top 100)]"
  7. levels(MB$var)[13] <- "Race\n[ref: white]"

Since not variables as whole might not be stat. significant in the models, and thus would be just adding to the clutter (e.g. especially categorical ones), they can be identified and their alpha reduced. A new logical variable in dataframe is useful for that, and the values are manually assigned (at this point). This also allows, for example, to easily see in what models certain categorical variable as whole looses significance.

  1. MB$sig <- TRUE
  2. Anova(clm_full)
  1. ## Anova Table (Type II tests)
  2. ##
  3. ## Response: criteria_plus
  4. ## Sum Sq Df F value Pr(>F)
  5. ## age 16.306 1 60.7099 1.595e-14 ***
  6. ## sex 2.480 1 9.2328 0.002437 **
  7. ## race 8.194 2 15.2531 2.957e-07 ***
  8. ## immigrant 0.126 1 0.4688 0.493677
  9. ## veteran 0.764 1 2.8436 0.092039 .
  10. ## prt_ba 0.393 1 1.4620 0.226888
  11. ## r_degree 5.950 2 11.0757 1.739e-05 ***
  12. ## r_coninc 4.546 1 16.9268 4.191e-05 ***
  13. ## r_srcbelt 0.634 3 0.7866 0.501453
  14. ## r_res16_c 3.651 2 6.7972 0.001167 **
  15. ## r_religion_m 12.467 5 9.2835 1.160e-08 ***
  16. ## r_region_4 2.014 3 2.4997 0.058184 .
  17. ## polviews_m 15.893 6 9.8623 1.341e-10 ***
  18. ## Residuals 279.601 1041
  19. ## ---
  20. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Anova(clm_small)
  1. ## Anova Table (Type II tests)
  2. ##
  3. ## Response: criteria_plus
  4. ## Sum Sq Df F value Pr(>F)
  5. ## age 24.07 1 77.7871 < 2.2e-16 ***
  6. ## sex 5.41 1 17.4982 3.114e-05 ***
  7. ## race 10.70 2 17.2945 4.068e-08 ***
  8. ## immigrant 0.07 1 0.2422 0.6227255
  9. ## veteran 2.21 1 7.1273 0.0077083 **
  10. ## prt_ba 1.36 1 4.3806 0.0365875 *
  11. ## r_degree 12.13 2 19.6072 4.354e-09 ***
  12. ## r_coninc 3.76 1 12.1476 0.0005117 ***
  13. ## r_srcbelt 4.75 3 5.1219 0.0016056 **
  14. ## Residuals 327.07 1057
  15. ## ---
  16. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Anova(clm_small_res16)
  1. ## Anova Table (Type II tests)
  2. ##
  3. ## Response: criteria_plus
  4. ## Sum Sq Df F value Pr(>F)
  5. ## age 21.86 1 71.7931 < 2.2e-16 ***
  6. ## sex 5.43 1 17.8494 2.598e-05 ***
  7. ## race 11.65 2 19.1280 6.917e-09 ***
  8. ## immigrant 0.03 1 0.0847 0.771098
  9. ## veteran 2.29 1 7.5185 0.006210 **
  10. ## prt_ba 1.14 1 3.7518 0.053018 .
  11. ## r_degree 10.42 2 17.1176 4.830e-08 ***
  12. ## r_coninc 3.04 1 9.9854 0.001623 **
  13. ## r_srcbelt 1.97 3 2.1590 0.091243 .
  14. ## r_res16_c 5.84 2 9.5964 7.409e-05 ***
  15. ## Residuals 321.22 1055
  16. ## ---
  17. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. MB$sig[MB$var == "Immigrant"] <- FALSE
  2. MB$sig[(MB$var == "Veteran") &
  3. (MB$model == "Large")] <- FALSE
  4. MB$sig[(MB$var == "Parents' education") &
  5. (MB$model == "Large")] <- FALSE
  6. MB$sig[(MB$var == "Metrop. area\n[ref: city (top 100)]") &
  7. (MB$model == "Large")] <- FALSE
  8. MB$sig[(MB$var == "Region\n[ref: Midwest]") &
  9. (MB$model == "Large")] <- FALSE
  10. MB$sig[(MB$var == "Parents' education") &
  11. (MB$model == "W/ childhood")] <- FALSE
  12. MB$sig[(MB$var == "Metrop. area\n[ref: city (top 100)]") &
  13. (MB$model == "W/ childhood")] <- FALSE

If there are specific variables we are interested in, as I was, their background can be shaded to draw attention. Since plot is going to be faceted, adding full width/length geom_rect that has also same main variables is one option. Thus, we are going to assign NA values to all the facets, and add (-)Infs to variables we are interested in.

  1. intrst <- data.frame(var = unique(MB$var),
  2. xmn = NA,
  3. xmx = NA,
  4. ymn = NA,
  5. ymx = NA
  6. )
  7. intrst[ 9, 2:5] <- c(-Inf, Inf, -Inf, Inf)
  8. intrst[10, 2:5] <- c(-Inf, Inf, -Inf, Inf)

Ploting

Finally, to plot the data ordr should be used to reorder variable factor (it contains levels of categorical variables, or units of non-categorical and since it was not used in creation of the ordr var it can contain same names for different variables).

  1. p_m <- ggplot(data = MB, aes(x = reorder(factor, ordr),
  2. y = AME,
  3. ymin = lower,
  4. ymax = upper,
  5. colour = model,
  6. alpha = sig)
  7. )
  8. # add marginal effects
  9. p_m <- p_m +
  10. geom_pointrange(position = position_dodge(width = 0.5),
  11. shape = 21,
  12. fill = "white"
  13. ) +
  14. scale_colour_manual(values = c("#1b9e77", "#d95f02", "#7570b3"))
  15. # make vars. that do not reach stat.sig. transparent
  16. p_m <- p_m + scale_alpha_discrete(range = c(0.5, 1.0))
  1. ## Warning: Using alpha for a discrete variable is not advised.
  1. # add intercept line
  2. p_m <- p_m + geom_hline(yintercept = 0,
  3. color = "dark gray")
  4. # rotate plot and add some labels
  5. p_m <- p_m +
  6. coord_flip() +
  7. labs(x = NULL,
  8. y = "Average Marginal Effect",
  9. colour = "Models",
  10. alpha = "Variable\nsig at p<.05",
  11. title = "Regression on criteria of national belonging (openess)")
  12. # add facets according to model variables
  13. # order of levels is going to take care of variable order
  14. p_m <- p_m + facet_grid(var ~ . ,
  15. scales = "free_y",
  16. space = "free_y",
  17. switch = "y")
  18. # fix the appearance of the plot
  19. p_m <- p_m +
  20. theme(strip.text.y = element_text(angle = 180),
  21. strip.placement = "outside",
  22. strip.background = element_rect(fill = "#d9d9d9", color = NA),
  23. plot.background = element_blank(),
  24. panel.background = element_blank(),
  25. panel.grid.major.x = element_line(colour = "#d9d9d9"),
  26. panel.grid.minor.x = element_line(colour = "#d9d9d9"),
  27. panel.grid.major.y = element_line(colour = "#d9d9d9"),
  28. panel.grid.minor.y = element_line(colour = "#d9d9d9"),
  29. axis.ticks = element_blank(),
  30. legend.key = element_blank(),
  31. legend.background = element_blank(),
  32. legend.box.background = element_blank(),
  33. legend.title = element_text(size = 10),
  34. axis.title = element_text(size = 10),
  35. title = element_text(size = 11)
  36. )
  37. # add rectangles to highlight variables we are interested in
  38. p_m <- p_m + geom_rect(data = intrst,
  39. aes(xmin = xmn,
  40. xmax = xmx,
  41. ymin = ymn,
  42. ymax = ymx),
  43. alpha = 0.09,
  44. fill = "black",
  45. inherit.aes = FALSE,
  46. show.legend = FALSE
  47. )

And this is how finished plot looks like, it does include several warnings, which are for facets that are not shaded.

  1. # finished plot
  2. p_m
  1. ## Warning: Removed 11 rows containing missing values (geom_rect).

Post-plot-fixing

But, the title comes awkwardly in the middle, to fix it we need to use ggplotGrob and grid.draw from grid package.

  1. p_m_fix <- ggplotGrob(p_m)
  2. p_m_fix$layout$l[p_m_fix$layout$name == "title"] <- 2
  3. grid::grid.draw(p_m_fix)

It’s all fairly simple, but the function cuts a lot of lines. It can also be cut more, and I’ll be trying to do it. :)

If you have any suggestion on how to further improve the plot, I would be thrilled to hear it!