项目作者: RGLab

项目描述 :
Tools for text standardization in situations with specialized vocabulary
高级语言: R
项目地址: git://github.com/RGLab/corpusFreq.git
创建时间: 2018-02-08T00:32:56Z
项目社区:https://github.com/RGLab/corpusFreq

开源协议:

下载


corpusFreq

R-CMD-check
Codecov test coverage
Lifecycle: maturing

A utility package for creating word-frequency tables from various data types and then performing interactive spellchecking with these frequency tables.

The parsing functions are specifically designed with Immunological data in mind and focus on use cases such as manuscript abstracts or clinical forms.

Installation

The package can be downloaded from the RGLab repo:

  1. # install.packages("remotes")
  2. remotes::install_github("RGLab/corpusFreq")

Usage

The general idea is for users to create a corpus, or canonical frequency table, to use in spellchecking other data in an R session. Besides being biology-focused, the main difference between corpusFreq and other spellchecking packages, e.g. refinr or hunspell, is that the interactive methods here use both string-distance and frequency of words in the corpus to determine the most likely correct replacement. Other spellchecking packages do not take into account the frequency of words, rather they focus on a variation of string distance, stemming, or ngrams to compare incorrect words with possible replacements.

The “hello world” example:

  1. library(corpusFreq)
  2. # Make frequency table from large text document
  3. myData <- read.table("myText.tsv", sep = "\t", stringsAsFactors = FALSE)
  4. ft <- makeFreqTbl(myData)
  5. # Use frequency to spellcheck other texts
  6. myAbstract <- "This is my stuby loooking at CD4+ cells, but not CD4-CD8- ones."
  7. result <- interactiveSpellCheck(input = myAbstract,
  8. name = "CD4_study",
  9. outputDir = "home/CD4_work",
  10. freqTbl = ft)
  11. print(result)
  12. "This is my study looking at CD4+ cells, but not CD4-CD8- ones."

If you are working on biological text and want to use a frequency table that has already been curated, you can use data in the BioCorpus data package. There are four frequency tables representing different public or soon-to-be-public database: ImmPort, GEO, Center for AIDS Vaccine Data, and ONB. The combined version is also found in the package and can be loading by doing the following:

  1. remotes::install_github("RGLab/BioCorpus")
  2. library(BioCorpus)
  3. ft <- BioCorpus::allDataFT

Examples & Documentation

For more advanced examples and detailed documentation, see the package vignettes.