项目作者: AndresCalimero

项目描述 :
Automatization tools for CorpusSearch
高级语言: Java
项目地址: git://github.com/AndresCalimero/corpus-search-auto.git
创建时间: 2017-04-13T18:12:56Z
项目社区:https://github.com/AndresCalimero/corpus-search-auto

开源协议:MIT License

下载


CorpusSearchAuto

Automatization tools for CorpusSearch

Summary

CorpusSearchAuto is a Java program that automates large researches in corpus linguistics, it uses CorpusSearch for searching syntactically annotated (parsed) corpora.

The software is a set of tools (currently only have 2 tools, described below).

Features

  • Classify YCOE and PENN corpus by genre (tool: genre-finder)
  • Run bulk searches of lexical and syntactic configurations by genre in several corpora (tool: statistics-by-genres)
  • Export results to HTML and Excel formats
  • Resource optimization

Usage

  1. CospusSearchAuto -tool [tool] [parameters]

The “data” folder

The “data” folder is a folder in the same directory as the CorpusSearchAuto executable that should store all the corpora files and the CS.jar executable, CorpusSearchAuto will search for both in this directory.

An example of its structure can be:

  1. data
  2. corporas
  3. YCOE
  4. info
  5. YcoeTextInfo.htm
  6. ...
  7. pos
  8. coadrian.o34.pos
  9. ...
  10. psd
  11. coadrian.o34.psd
  12. ...
  13. PPCEME
  14. info
  15. description.html
  16. ...
  17. pos
  18. abott-e1-p1.pos
  19. ...
  20. psd
  21. abott-e1-p1.psd
  22. ...
  23. output
  24. 13-04-2017 18.46.36.333 YCOE
  25. prepositions
  26. homilies_and_sermons
  27. result.out
  28. ...
  29. ...
  30. ...
  31. searches
  32. queries
  33. YCOE
  34. prepositions.q
  35. ...
  36. ...
  37. PENN
  38. prepositions.q
  39. ...
  40. ...
  41. search.xml
  42. ...
  43. CS.jar

NOTE: The only thing that is mandatory to have into the data folder is the corpora and the CS.jar executable, the rest is just a suggestion to organize more efficiently the .q search files and the output files.

The XML search file

CorpusSearchAuto requires an XML file that contains all the information necessary to perform a massive search in corpora, in that file you must specify the variables to be searched (associated with their corresponding .q) and all corpus belonging to each corpora, classified by genre.

An example of a XML search file would be:

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <search xmlns="http://corpus.search.auto" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  3. <corpora name="YCOE">
  4. <variables>
  5. <variable q-file="data\searches\queries\YCOE\adverbial_subordinators.q">Adverbial subordinators</variable>
  6. <variable q-file="data\searches\queries\YCOE\agentless_passives.q">Agentless passives</variable>
  7. <variable q-file="data\searches\queries\YCOE\attributive_adjetives.q">Attributive adjetives</variable>
  8. <variable q-file="data\searches\queries\YCOE\by_passives.q">By passives</variable>
  9. ...
  10. </variables>
  11. <genres>
  12. <genre name="Philosophy">
  13. <corpus>coboeth.o2</corpus>
  14. <corpus>codicts.o34</corpus>
  15. ...
  16. </genre>
  17. <genre name="History">
  18. <corpus>cobede.o2</corpus>
  19. <corpus>cochronA.o23</corpus>
  20. <corpus>cochronC</corpus>
  21. ...
  22. </genre>
  23. ...
  24. </genres>
  25. </corpora>
  26. <corpora name="PPCEME">
  27. <variables>
  28. <variable q-file="data\searches\queries\PENN\adverbial_subordinators.q">Adverbial subordinators</variable>
  29. <variable q-file="data\searches\queries\PENN\agentless_passives.q">Agentless passives</variable>
  30. <variable q-file="data\searches\queries\PENN\attributive_adjetives.q">Attributive adjetives</variable>
  31. <variable q-file="data\searches\queries\PENN\by_passives.q">By passives</variable>
  32. ...
  33. </variables>
  34. <genres>
  35. <genre name="Philosophy">
  36. <corpus>boethco-e1-h</corpus>
  37. <corpus>boethco-e1-p1</corpus>
  38. ...
  39. </genre>
  40. <genre name="History">
  41. <corpus>burnetcha-e3-h</corpus>
  42. <corpus>burnetcha-e3-p1</corpus>
  43. <corpus>burnetcha-e3-p2</corpus>
  44. ...
  45. </genre>
  46. ...
  47. </genres>
  48. </corpora>
  49. ...
  50. </search>

Tools

genre-finder

Generates an XML file that lists all the corpus of the specified corpora and classifies them by genres, currently only supports the YCOE and PENN corpora.

The resulting XML file is useful to create the search file that the tool statistics-by-genres requires.

An example of the resulting XML would be:

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <genres xmlns="corpus.search.auto" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="corpus.search.auto genres-schema.xsd" corpora="YCOE">
  3. <genre name="religious treatise">
  4. <corpus>coadrian.o34</corpus>
  5. <corpus>coalcuin</corpus>
  6. <corpus>cocura.o2</corpus>
  7. ...
  8. </genre>
  9. <genre name="homilies">
  10. <corpus>coaelhom.o3</corpus>
  11. <corpus>coaugust</corpus>
  12. ...
  13. </genre>
  14. ...
  15. </genres>
Usage
  1. CospusSearchAuto -tool genre-finder -corpora [CORPORA] -in [PATH_OF_INFO_FILE] (-out [PATH_OF_OUT_FILE])
Parameters
Name Mandatory Default value Description
corpora Yes - The corpora to be used, for example: YCOE
in Yes - The corpora html file specifying the genres to which each corpus belongs, for example: “data\corporas\YCOE\info\YcoeTextInfo.htm”
out No in path Path of the resulting XML file
Examples
  1. CorpusSearchAuto -tool genre-finder -corpora YCOE -in "data\corporas\YCOE\info\YcoeTextInfo.htm"
  1. CorpusSearchAuto -tool genre-finder -corpora PPCEME -in "data\corporas\PPCEME\info\description.html"
  1. CorpusSearchAuto -tool genre-finder -corpora PPCME2 -in "data\corporas\PPCME2\info\description.html" -out "C:\genres.xml"

statistics-by-genres

Processes the XML search file (described above) and generates one or more output files with the results in the format(s) specified.

Usage
  1. CospusSearchAuto -tool statistics-by-genres -in [PATH_OF_SEARCH_FILE] (-show-only [all|hits|tokens|total] -out-format [all|html|excel] -out [PATH_OF_OUT_FOLDER])
Parameters
Name Mandatory Default value Description
in Yes - Path of the XML search file, for example: “data/searches/search.xml”
show-only No all What should be displayed in the output file(s) (hits, tokens, total or all)
out-format No all What format the output file(s) should have (html, excel or all)
out No in path Path to the folder where the output files will be saved
Examples
  1. CorpusSearchAuto -tool statistics-by-genres -in "data\searches\search.xml"
  1. CorpusSearchAuto -tool statistics-by-genres -in "data\searches\search.xml" -show-only hits
  1. CorpusSearchAuto -tool statistics-by-genres -in "data\searches\search.xml" -show-only hits -out-format excel -out "C:\search_output"

License

MIT