项目作者: Cirad-ASTRE

项目描述 :
Match toponyms by similarity
高级语言:
项目地址: git://github.com/Cirad-ASTRE/topomatch.git
创建时间: 2019-02-08T13:02:13Z
项目社区:https://github.com/Cirad-ASTRE/topomatch

开源协议:

下载


topomatch

This project moved to https://forgemia.inra.fr/umr-astre/topomatch/

Helper function for matching toponyms from different sources, that can be written in slightly different ways. Allows to inspect the matching and act accordingly.

  1. countries1 <- spData::world$name_long
  2. countries2 <- unique(maps::world.cities$country.etc)
  3. (country_matches <- topomatch(countries1, countries2))
  4. #> 156 names matched exactly: Fiji, Tanzania, Western Sahara, ...
  5. #>
  6. #> 15 matches based on similarity:
  7. #> 1. United States: United Arab Emirates
  8. #> 2. Democratic Republic of the Congo: Congo Democratic Republic
  9. #> 3. Russian Federation: Russia
  10. #> 4. French Southern and Antarctic Lands: Northern Mariana Islands
  11. #> 5. Timor-Leste: East Timor
  12. #> 6. Côte d'Ivoire: Cape Verde
  13. #> 7. The Gambia: Gambia
  14. #> 8. United Kingdom: United Arab Emirates
  15. #> 9. Brunei Darussalam: Brunei
  16. #> 10. Antarctica: Vatican City
  17. #> 11. Northern Cyprus: Northern Mariana Islands
  18. #> 12. Somaliland: Swaziland
  19. #> 13. Serbia: Serbia and Montenegro
  20. #> 14. Montenegro: Serbia and Montenegro
  21. #> 15. South Sudan: South Africa
  22. #>
  23. #> 6 unresolved matches:
  24. #> 1. Republic of the Congo: Czech Republic, Dominican Republic, ...
  25. #> 2. eSwatini: Palestine, Estonia
  26. #> 3. Lao PDR: San Marino, ...
  27. #> 4. Dem. Rep. Korea: Korea South, Korea North
  28. #> 5. Republic of Korea: Czech Republic, Dominican Republic, ...
  29. #> 6. Kosovo: Comoros, Solomon Islands

There are some manual fixes needed for those toponyms that weren’t correctly matched. Just write the fixes in a named vector. If there is no correct match for one toponym, give it an NA.

  1. ## Inspect the competing candidates for the unmatched countries
  2. (bm <- best_matches(country_matches)[unmatched(country_matches)])
  3. #> $`Republic of the Congo`
  4. #> [1] "Czech Republic" "Dominican Republic"
  5. #> [3] "Congo Democratic Republic" "Central African Republic"
  6. #>
  7. #> $eSwatini
  8. #> [1] "Palestine" "Estonia"
  9. #>
  10. #> $`Lao PDR`
  11. #> [1] "San Marino" "Central African Republic"
  12. #> [3] "Sao Tome and Principe"
  13. #>
  14. #> $`Dem. Rep. Korea`
  15. #> [1] "Korea South" "Korea North"
  16. #>
  17. #> $`Republic of Korea`
  18. #> [1] "Czech Republic" "Dominican Republic"
  19. #> [3] "Congo Democratic Republic" "Central African Republic"
  20. #>
  21. #> $Kosovo
  22. #> [1] "Comoros" "Solomon Islands"
  23. cnames_fixes <- setNames(
  24. c("Congo Democratic Republic", NA, "Laos", "Korea North",
  25. "Korea South", NA),
  26. names(bm)
  27. )
  28. ## Fix the incorrectly matches from similarity as well
  29. cnames_fixes <- c(
  30. cnames_fixes,
  31. "United States" = "USA",
  32. "French Southern and Antarctic Lands" = "France",
  33. "Côte d'Ivoire" = "Ivory Coast",
  34. "United Kingdom" = "UK",
  35. "Antarctica" = NA,
  36. "Northern Cyprus" = "Cyprus",
  37. "Somaliland" = "Somalia",
  38. "South Sudan" = "Sudan"
  39. )

Now you can transcribe the original toponyms to the matched terms.

  1. translate <- transcribe(country_matches, fixes = cnames_fixes)
  2. translate(c("United Kingdom", "Kosovo"))
  3. #> [1] "UK" NA
  4. ## "Translate" all of the original toponyms
  5. countries1_trans <- translate(countries1)
  6. ## Only those "fixed" as NA are not found in the second list
  7. countries1[!countries1_trans %in% countries2]
  8. #> [1] "eSwatini" "Antarctica" "Kosovo"

Method

Wraps local-global alignment algorithm borrwed from bioConductor package Biostrings. Works better than global alignment and requires less fine-tuning (although is considerably slower too) https://ro-che.info/articles/2016-12-11-local-alignment.

Installation

  1. remotes::install_github("Cirad-ASTRE/topomatch")