项目作者: vrjkmr

项目描述 :
Detecting topic clusters in arXiv ML papers.
高级语言: Jupyter Notebook
项目地址: git://github.com/vrjkmr/arxiv-topic.git
创建时间: 2020-09-15T03:59:16Z
项目社区:https://github.com/vrjkmr/arxiv-topic

开源协议:

下载


ArXiv Topic Modeling

This repository contains the code for a Latent Dirichlet Allocation (LDA) topic model built and trained on the abstracts of ~160,000 ML-related research papers from the ArXiv.org dataset on Kaggle.

This model can be used to detect topics in any given arXiv paper related to machine learning. To illustrate, shown below is an example of the model’s ability to predict topics present in the seminal paper “Attention Is All You Need” by Vaswani et al. (2017).

  1. Paper
  2. -----
  3. "Attention Is All You Need" by Vaswani et al. (2017)
  4. Abstract
  5. --------
  6. The dominant sequence transduction models are based on complex recurrent or
  7. convolutional neural networks in an encoder-decoder configuration. The best
  8. performing models also connect the encoder and decoder through an attention
  9. mechanism. We propose a new simple network architecture, the Transformer, based
  10. solely on attention mechanisms, dispensing with recurrence and convolutions
  11. entirely. Experiments on two machine translation tasks show these models to be
  12. superior in quality while being more parallelizable and requiring significantly
  13. less time to train. Our model achieves 28.4 BLEU on the WMT 2014
  14. English-to-German translation task, improving over the existing best results,
  15. including ensembles by over 2 BLEU. On the WMT 2014 English-to-French
  16. translation task, our model establishes a new single-model state-of-the-art
  17. BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction
  18. of the training costs of the best models from the literature. We show that the
  19. Transformer generalizes well to other tasks by applying it successfully to
  20. English constituency parsing both with large and limited training data.
  21. Predicted topics
  22. ----------------
  23. [('Deep learning', 0.7827023),
  24. ('Natural language processing', 0.18202062),
  25. ('ML-related terms?', 0.022977384)]

Motivation

You know how when reading research papers, the first thing we read is the abstract? The abstract helps us (as humans) get a general sense of what different topics are explored in any given paper. But what if we can train an unsupervised model to automatically “categorize” papers for us?

In this project, my ultimate goal was to build an clustering model that can:

  1. Identify salient trends and sub-topics in machine learning research today, and
  2. Automatically predict the topic(s) explored in any given paper simply by looking at its abstract.

Project structure

This project is organized as follows.

  1. .
  2. ├── dataset.py # script containing the dataset class
  3. ├── model.py # script containing the topic model class
  4. ├── preprocess.py # script containing the text preprocessor class
  5. ├── utils.py # script containing helper functions
  6. └── notebooks/
  7. ├── Dataset preparation.ipynb # notebook to build the arXiv abstracts dataset
  8. ├── Inference.ipynb # notebook illustrating how to predict topics of papers
  9. └── Training.ipynb # notebook to train and tune LDA models
  10. └── README.md

Results: Topics

The final model achieved a c_v coherence score of 50.2%. While this score is quite low (an ideal coherence score tends to be around 60-75%), the model was able to detect some interesting topic clusters, a few of which are listed below. Note that while the topic terms were generated by the LDA model, the topic titles themselves are subjective, since they were added by me after looking at the term distribution for each of the topics.

  1. Natural language processing: “text”, “knowledge”, “language”, “information”, “semantic”
  2. Probability and inference: “model”, “distribution”, “inference”, “bayesian”, “parameter”
  3. Computer vision: “image”, “object”, “segmentation”, “detection”, “video”
  4. Recommendation systems: “user”, “group”, “item”, “preference”, “product”
  5. Sequences and time-series: “time”, “dynamic”, “series”, “sequence”, “temporal”
  6. Reinforcement learning: “agent”, “policy”, “environment”, “game”, “action”

Model inference

To predict which topics might be related to any paper on arXiv, simply build a TopicModel object, scrape the abstract section, and pass in the raw text into the model’s predict() method. The output is an ordered list of tuples, with each tuple holding the topic name and the likelihood of the topic’s presence in the paper.

  1. from model import TopicModel
  2. from utils import scrape_arxiv_abstract
  3. lda_filepath = "./models/model_001"
  4. dataset_filepath = "./data/dataset.obj"
  5. topic_model = TopicModel(lda_filepath, dataset_filepath)
  6. # Optional: Set topic names
  7. topic_model.set_topic_names([...])
  8. # Paper: "Personalized Re-ranking for Recommendation" by Pei et al. (2019)
  9. paper_url = "https://arxiv.org/abs/1904.06813"
  10. abstract = scrape_arxiv_abstract(paper_url)
  11. predictions = topic_model.predict(abstract)
  12. print(predictions)
  13. '''
  14. Output
  15. ------
  16. [('Recommendation systems', 0.32558212),
  17. ('Deep learning', 0.17530766),
  18. ('Paper-related terms?', 0.16065647)]
  19. '''

Acknowledgements

  • Radim Řehůřek’s tips on building Gensim LDA models
  • Cornell University’s arXiv.org dataset hosted on Kaggle