EE448 Big Data Mining Project: Query Expansion with Rocchio Algorithm & Document Ranking with BM25 Score
Ranking documents of a query using BM25 Score in Document Ranking Phase and Rocchio Algorithm in Query Expansion Phase.
Create a folder name data
and put query txt and doc txt in ./data
folder
Run EE448.ipynb
to visual output
The output ranked documents is in ./data/bm25_score.txt
Python >= 3.0
You can get dataset here or use you own data.
./data/query.txt
: query_id \t query_text
./data/doc.txt
: document_id \t document_text
Set expansion words in util.py/findNewQuery/loopRange
to different value. If the documents is short, set loopRange to a smaller value.
Set k2
in score.py/bm25
to larger value.
Set GAMMA
to 0.15 or 0 to enable positive feedback and negative feedback
You may try different Score Function like TF-IDF to rank documents in score.py