项目作者: PathwayCommons

项目描述 :
A simple semantic search engine for scientific papers.
高级语言: Python
项目地址: git://github.com/PathwayCommons/semantic-search.git
创建时间: 2020-05-07T18:22:26Z
项目社区:https://github.com/PathwayCommons/semantic-search

开源协议:MIT License

下载


build
codecov
Checked with mypy
GitHub

Scientific Semantic Search

A simple semantic search engine for scientific papers. Check out our demo here.

Installation

This repository requires Python 3.7 or later.

Setting up a virtual environment

Before installing, you should create and activate a Python virtual environment. See here for detailed instructions.

Installing the library and dependencies

If you don’t plan on modifying the source code, install from git using pip

  1. pip install git+https://github.com/PathwayCommons/semantic-search.git

Otherwise, clone the repository locally and then install

  1. git clone https://github.com/PathwayCommons/semantic-search.git
  2. cd semantic-search
  3. pip install --editable .

Finally, if you would like to take advantage of a CUDA-enabled GPU, you must also install PyTorch with CUDA support by following the instructions for your system here.

Usage

To start up the server:

  1. uvicorn semantic_search.main:app

You can pass the --reload flag if you are developing to force the server to reload on changes.

To provide arguments to the server, pass them as environment variables, e.g.:

  1. CUDA_DEVICE=0 MAX_LENGTH=384 uvicorn semantic_search.main:app

Once the server is running, you can make a POST request to the /search endpoint with a JSON body. E.g.

  1. {
  2. "query": {
  3. "uid": "9887103",
  4. "text": "The Drosophila activin receptor baboon signals through dSmad2 and controls cell proliferation but not patterning during larval development."
  5. },
  6. "documents": [
  7. {
  8. "uid": "10320478",
  9. "text": "Drosophila dSmad2 and Atr-I transmit activin/TGFbeta signals. "
  10. },
  11. {
  12. "uid": "22563507",
  13. "text": "R-Smad competition controls activin receptor output in Drosophila. "
  14. },
  15. {
  16. "uid": "18820452",
  17. "text": "Distinct signaling of Drosophila Activin/TGF-beta family members. "
  18. },
  19. {
  20. "uid": "10357889"
  21. },
  22. {
  23. "uid": "31270814"
  24. }
  25. ],
  26. "top_k": 3
  27. }

The return value is a JSON representation of the top_k most similar documents (defaults to 10):

  1. [
  2. {
  3. "uid": "10320478",
  4. "score": 0.6997108459472656
  5. },
  6. {
  7. "uid": "22563507",
  8. "score": 0.6877762675285339
  9. },
  10. {
  11. "uid": "18820452",
  12. "score": 0.6436074376106262
  13. }
  14. ]

If "text" is not provided, we assume "uid"s are valid PMIDs and fetch the title and abstract text before embedding, indexing and searching.

  • Notes on optional parameters
    • top_k: A positive integer (default is 10) that limits the search results to this many of the most similar neighbours (articles)
    • docs_only: A boolean (default is False) that instructs the service to return scores for the provided documents. If true, top_k is disregarded.

Running via Docker

Setup

If you are intending on using a CUDA-enabled GPU, you must also install the NVIDIA Container Toolkit on the host following the instructions for your system here.

For Ubuntu 18.04:

  1. curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
  2. distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
  3. curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list |\
  4. sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
  5. sudo apt-get update
  6. sudo apt-get install nvidia-container-runtime

Restart Docker

  1. sudo systemctl stop docker
  2. sudo systemctl start docker

Check your install

  1. docker run --gpus all nvidia/cuda:10.2-cudnn7-devel nvidia-smi

Running a container

First, build the docker image:

  1. docker build -t semantic-search .

Then, run it

  1. docker run -it -p <PORT>:8000 semantic-search

For CUDA-enabled GPU

  1. docker run --gpus all -dt --rm --name semantic_container -p 8000:8000 --env CUDA_DEVICE=0 --env MAX_LENGTH=384 semantic-search:latest

Documentation

With the web server running, open http://127.0.0.1:8000/redoc in your browser for the API documentation.

For contributing guidelines, see CONTRIBUTING.md.