Implementing a fake news detector. Comparing different ML algorithms and NLP strategies.
Implementing a fake news detector. Comparing different ML algorithms and NLP strategies.
In hyperparameter_optimization/randomized_search.py
, line 44 change the value of the variable n_iter_search
which
determines the number of combinations that are tested in the parameter search to a value of your choice. The lower
this value, the faster the script, but the less values are tested.
Then, execute the script ./run-project.sh
. A dataset for the submission has been generated by the preprocessing
script and placed in the corresponding output folder. It is not necessary to change anything here!
pip3 install -r requirements.txt
in the setup
folder. This installs the required Python libraries.preprocessing/raw_data
directory. python3 generate_dataset.py
. This generates a dataset with the above mentionedpython3 generate_submission_dataset.py
. This generates a dataset with thepython3 hashing_vectorizer.py -l <lower bound> -u <upper bound>
python3 ngram_vectorizer.py -l <lower bound> -u <upper bound> [-t]
Parameter | Values |
---|---|
lower_bound |
tested with 1..3 |
upper_bound |
tested with 1..3 |
-t |
use term frequency–inverse document frequency (only for ngrams) |
In the experiment, lower_bound
and upper_bound
were set equal. The implementations are based on sklearn’s
implementation of the HashingVectorizer
and CountVectorizer
.
If the training set should be vectorized and the vectorizers stowed away for the regeneration
of the feature vectors with the test set, use hashing_vectorizer_training.py
and ngram_vectorizer_training.py
instead, with the same parameters.
To optimize the parameters for the machine learning algorithms in the experiment, run the hyperparameter optimization
script. Therefore, all features for the validation set must already be generated as explained above! In line 44 of
the script, you can set the number of configurations that should be tested. A higher number may give better results,
but also take longer.
Then, run python3 randomized_search.py
.
This step is to create the vectors for the test set that is used in the final assessment of the algorithms. This can
only work, when the training set was already vectorized.
Then, run both python3 hashing_revectorizer.py
and python3 ngram_revectorizer.py
with the same parameter settings
that are mentioned in this step.
The classifier performance is evaluated with the f1-score
o. The optimized parameters that were obtained in this step
must be configured in code. The output is printed directly on the console.
Run experiment.py
and have the vectorized training and test set ready in the data
folders in
feature_generation
and feature_regeneration
respectively, this should have happened automatically.