项目作者: 0xsuid
项目描述 :
Text Mining on COVID-19 article of CBC news - Summarize the given text automatically using spaCy & Python.
高级语言: Jupyter Notebook
项目地址: git://github.com/0xsuid/text-summerizer.git
Text Summerizer
Text Mining on CBC news article with Natural Language Processing(NLP) - Automatically summarize the given text using spaCy & Python.
Requirements
- Spacy
- spaCy Model
python -m spacy download en_core_web_sm
Overview
- Convert the input text to a list of sentences. Then, compute the
number of sentences in the given Text. - Calculate the frequency of words in each sentence:
- The output is a dictionary where each key is a sentence and the value is also a dictionary of word frequency.
- Calculate Term frequency for each word in a sentence:
TF(word) = (Number of times term “word” appears in a sentence) / (Total number of terms in the sentence)
- Create a matrix termFrequency:
- The termFrequency matrix is a dictionary where each key is a sentence and the value is also a dictionary of word frequency.
- For each word compute how many sentences contain that word.
- Calculate IDF for each word in a sentence.
IDF(word) = log_e(Total number of sentences / number of sentences with term word in it)
- Compute the TF-IDF for each word in each sentence.
- Use the TF-IDF computed in (7) and give a weight for each sentence.
- Threshold: compute the average sentence weight
- Generate the summary : select a sentence for summarization if the weight of the sentence exceeds the threshold.
References