Language Detector developed in ruby + WikiScraper
This is simple a implementation of Language Detector in Ruby. It uses n-grams to build language models, and then approximates compability of input with each model to predict the language.
You can read about the n-grams here.
First, you have to build language models.
ruby buildModel.rb
Program will automaticly build 3-grams of every language in trainData folder (names of files should be languages names).
You can also run this with an argument to build any n-gram model. For example, to build 4-grams:
ruby buildModel.rb 4
To detect language in the input text (e.g. testData/english.txt):
ruby detectLanguage.rb english.txt
Your input text should be located in the testData folder.
Without an extra parameter program works for 3-grams. To run it for custom n-grams, use:
ruby detectLanguage.rb english.txt 4
WikiScraper is a program which lets you get text corpuses scraped from Wikipedia article, using Nokogiri and httparty. By default its set for https://en.wikipedia.org/wiki/Earth, but you can change it for any Wikipedia side (it should be the english one).
To turn on scraper:
ruby wikiScraper.rb
Program works in infinite loop. You have plenty of options to use:
Note: Scraper supports ISO-8859-1 encoding.
File “demo.sh” contains a program that runs a demonstration. If you want to see how all the programs work without running each one manually, simply run the code and follow the instructions.
Demo instructions: