项目作者: rensa

项目描述 :
Tidy appearance frequency data for ABC's The Drum.
高级语言: HTML
项目地址: git://github.com/rensa/drumguests.git
创建时间: 2018-07-30T08:47:36Z
项目社区:https://github.com/rensa/drumguests

开源协议:

下载


Analysis of The Drum hosts, panellists and guests

This script scrapes data on the hosts, panellists and guests of The
Drum
from the ABC website. If
you just want to grab some tidy data, it’s currently in
drum_tidy.csv. It goes back to 27 April 2018 (as at
2019-05-13).

Note: the formatted datetimes in the dt column are in UTC! You’ll
need to convert them to "Australia/Sydney" before using them.

To grab the data from the ABC site yourself, run this notebook!

Let’s scrape data from the ABC website and find out how often people
appear!

  1. drum_url = 'http://www.abc.net.au/news/programs/the-drum/'
  2. pages = 1:10
  3. episodes_id = 'collectionId-4'
  4. # download data
  5. episodes =
  6. map_dfr(pages, function(x) {
  7. episode_page =
  8. read_html(glue('{drum_url}?page={x}')) %>%
  9. html_nodes(glue('#{episodes_id} article'))
  10. tibble(
  11. title = episode_page %>% html_nodes('h3') %>% html_text(),
  12. description = episode_page %>% html_nodes('p') %>% html_text())
  13. }) %>%
  14. print()
  15. #> # A tibble: 250 x 2
  16. #> title description
  17. #> <chr> <chr>
  18. #> 1 "\n\n The Drum Friday M… Host: Ellen Fanning Panel: Kate Mills, David M…
  19. #> 2 "\n\n Health Care Speci… Host: Ellen Fanning Panel: Pat Turner, Profess…
  20. #> 3 "\n\n The Drum Wednesda… Host: Kathryn Robinson Panel: Geraldine Doogue…
  21. #> 4 "\n\n Corangamite Speci… Host: Ellen Fanning Panel: Dr Fiona Gray, Gabr…
  22. #> 5 "\n\n The Drum Monday M… "Host: Kathryn Robinson Panel: Kathryn Greiner…
  23. #> 6 "\n\n The Drum Friday M… Host: Ellen Fanning Panel: Nicki Hutley, Peter…
  24. #> 7 "\n\n The Drum Thursday… Host: Kathryn Robinson Panel: Robyn Parker, Ja…
  25. #> 8 "\n\n The Drum Wednesda… Host: Kathryn Robinson Panel: Amanda Rose, Kat…
  26. #> 9 "\n\n The Drum Tuesday … Host: Ellen Fanning Panel: Jenna Price, Scott …
  27. #> 10 "\n\n The Drum Monday A… In a special episode, our panel of Indigenous …
  28. #> # … with 240 more rows

Okay, let’s tidy it up and get the good bits out (regex makes me cry).

  1. episodes %<>%
  2. # format the date
  3. mutate(
  4. ep_date = str_replace_all(title, c("\n\n The Drum " = "", " \n" = "", "- " = "", "\\s$" = "")),
  5. dt = parse_date_time(ep_date, orders = "A, B d", tz = "Australia/Sydney")) %>%
  6. # isolate the host and people
  7. mutate(
  8. host = str_extract(description, regex("(?<=Host: )(.*)(?= Panel:)",
  9. dotall = TRUE)),
  10. panel = str_extract(description,
  11. regex(paste0("(?<=Panel: )(.*)(?=( Guest:| Guests:| Interview with:|",
  12. "The panel|We have))"),
  13. ignore_case = TRUE, dotall = TRUE))) %>%
  14. # separate guest and/or interviewees...
  15. mutate(
  16. guest = str_extract(panel, regex("(?<=Guest: )(.*)$", dotall = TRUE, ignore_case = TRUE)),
  17. interviewee = str_extract(panel, regex("(?<=Interview with: )(.*)$", dotall = TRUE, ignore_case = TRUE))) %>%
  18. # ... and remove them from the panel
  19. mutate(
  20. panel = str_replace(panel, regex("Guest: (.*)$"), ""),
  21. panel = str_replace(panel, regex("Interview with: (.*)$"), ""),
  22. panel = str_replace(panel, "\\.$", "")) %>%
  23. select(ep_date, dt, host, panel, guest, interviewee) %>%
  24. print()
  25. #> Warning: 17 failed to parse.
  26. #> # A tibble: 250 x 6
  27. #> ep_date dt host panel guest interviewee
  28. #> <chr> <dttm> <chr> <chr> <chr> <chr>
  29. #> 1 Friday May NA Ellen … "Kate Mills,… <NA> <NA>
  30. #> 2 "\n\n Hea… NA Ellen … "Pat Turner,… <NA> <NA>
  31. #> 3 Wednesday NA Kathry… "Geraldine D… <NA> <NA>
  32. #> 4 "\n\n Cor… NA Ellen … "Dr Fiona Gr… <NA> <NA>
  33. #> 5 Monday Ma… 2019-05-06 00:00:00 Kathry… "Kathryn Gre… "John … <NA>
  34. #> 6 Friday Ma… 2019-05-03 00:00:00 Ellen … "Nicki Hutle… "Ece T… <NA>
  35. #> 7 Thursday … 2019-05-02 00:00:00 Kathry… "Robyn Parke… "Shane… <NA>
  36. #> 8 Wednesday… 2019-05-01 00:00:00 Kathry… "Amanda Rose… <NA> <NA>
  37. #> 9 Tuesday A… 2019-04-30 00:00:00 Ellen … "Jenna Price… "Sarah… <NA>
  38. #> 10 Monday Ap… 2019-04-29 00:00:00 <NA> <NA> <NA> <NA>
  39. #> # … with 240 more rows

Okay, now let’s break these names up:

  1. episodes %<>%
  2. gather(key = "role", value = "name", host, panel, guest, interviewee) %>%
  3. separate_rows(name, sep = ", and |, | and ") %>%
  4. # remove any trailing spaces that snuck in
  5. mutate(name = str_replace_all(name, "\\s$", "")) %T>%
  6. write_csv('drum_tidy.csv') %T>%
  7. print()
  8. #> # A tibble: 1,733 x 4
  9. #> ep_date dt role name
  10. #> <chr> <dttm> <chr> <chr>
  11. #> 1 Friday May NA host Ellen Fanning
  12. #> 2 "\n\n Health Care Special" NA host Ellen Fanning
  13. #> 3 Wednesday NA host Kathryn Robinson
  14. #> 4 "\n\n Corangamite Special" NA host Ellen Fanning
  15. #> 5 Monday May 6 2019-05-06 00:00:00 host Kathryn Robinson
  16. #> 6 Friday May 3 2019-05-03 00:00:00 host Ellen Fanning
  17. #> 7 Thursday May 2 2019-05-02 00:00:00 host Kathryn Robinson
  18. #> 8 Wednesday May 1 2019-05-01 00:00:00 host Kathryn Robinson
  19. #> 9 Tuesday April 30 2019-04-30 00:00:00 host Ellen Fanning
  20. #> 10 Monday April 29 2019-04-29 00:00:00 host <NA>
  21. #> # … with 1,723 more rows

Nowe we can visualise. For example, here are hosts by frequency:

  1. episodes %>%
  2. filter(role == "host") %>%
  3. group_by(name) %>%
  4. summarise(n = n()) %>%
  5. ungroup() %>%
  6. drop_na(name) %T>%
  7. print() %>%
  8. {
  9. ggplot(., aes(x = reorder(name, n), y = n)) +
  10. geom_col() +
  11. coord_flip() +
  12. theme_minimal() +
  13. labs(
  14. x = 'Host',
  15. y = 'Number of appearances',
  16. title = 'The Drum hosts by appearance over the last year')
  17. }
  18. #> # A tibble: 10 x 2
  19. #> name n
  20. #> <chr> <int>
  21. #> 1 Adam Spencer 21
  22. #> 2 Craig Reucassel 12
  23. #> 3 Dr Norman Swan 1
  24. #> 4 Ellen Fanning 93
  25. #> 5 Fran Kelly 1
  26. #> 6 John Barron 6
  27. #> 7 Julia Baird 71
  28. #> 8 Kathryn Robinson 10
  29. #> 9 Peter van Onselen 23
  30. #> 10 Sarah Dingle 7

And here’s guests, panellists and interviewees:

  1. episodes %>%
  2. filter(role != "host") %>%
  3. group_by(name, role) %>%
  4. summarise(n = n()) %>%
  5. ungroup() %>%
  6. drop_na(name) %>%
  7. top_n(30, n) %T>%
  8. print() %>%
  9. {
  10. ggplot(., aes(x = reorder(name, n), y = n)) +
  11. geom_col() +
  12. coord_flip() +
  13. theme_minimal(base_size = 8) +
  14. labs(
  15. x = 'Host',
  16. y = 'Number of appearances',
  17. title = 'Top 30 non-host appearance on THe Drum over the last year')
  18. }
  19. #> # A tibble: 31 x 3
  20. #> name role n
  21. #> <chr> <chr> <int>
  22. #> 1 Adrian Piccoli panel 9
  23. #> 2 Avril Henry panel 8
  24. #> 3 Bridie Jabour panel 8
  25. #> 4 Caroline Overington panel 7
  26. #> 5 Craig Chung panel 10
  27. #> 6 David Marr panel 6
  28. #> 7 Greg Sheridan panel 9
  29. #> 8 Jane Caro panel 7
  30. #> 9 Jenna Price panel 7
  31. #> 10 Jennifer Hewett panel 6
  32. #> # … with 21 more rows