项目作者: cgoliver

项目描述 :
Tools for discovering flexible motifs in RNA Graphs.
高级语言: Python
项目地址: git://github.com/cgoliver/vernal.git
创建时间: 2020-08-20T18:43:23Z
项目社区:https://github.com/cgoliver/vernal

开源协议:

下载


vernal: Fuzzy Recurrent Subgraph Mining

This is a reference implementation of veRNAl, an algorithm for identifying fuzzy recurrent subgraphs in RNA 3D networks.

Please cite:

  1. @article{vernal,
  2. author = {Oliver, Carlos and Mallet, Vincent and Philippopoulos, Pericles and Hamilton, William L and Waldispühl, Jérôme},
  3. title = "{VeRNAl: A Tool for Mining Fuzzy Network Motifs in RNA}",
  4. journal = {Bioinformatics},
  5. year = {2021},
  6. month = {11},
  7. issn = {1367-4803},
  8. doi = {10.1093/bioinformatics/btab768},
  9. url = {https://doi.org/10.1093/bioinformatics/btab768},
  10. note = {btab768},
  11. eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btab768/41153095/btab768.pdf},
  12. }

See full paper for complete description of the algorithm.

You can browse the results from an already trained model here.

This repository has three main components:

  • Preparing Data /prepare_data
  • Subgraph Embeddings /train_embeddings
  • Motif Building /build_motifs

Each subdirectory contrains a main.py file which controls the behaviour of that stage.
For full usage, run python <dir>/main.py -h

0. Install Dependencies

The command below will install the full list of dependencies.

The main packages we use are:

  • multiset
  • NetworkX
  • BioPython
  • Pytorch
  • DGL (Deep Graph Library)
  • Scikit-learn
  1. conda env create -f environment.yml
  2. conda activate vernal

1. Data Preparation

This step loads the whole PDBs, creates uniformly-sized chunks (chopper.py) and builds
newtorkx graphs for each chunk.

We build a rooted subgraph and graphlet hashtable for each node in annotate.py to
speed up the similarity function computations at training time.

Create two directories where the data will be kept:

  1. mkdir data/graphs
  2. mkdir data/annotated

Data building and loading will take some time (~1 hr), you can skip all the data preparation if you want to use a pre-built dataset, just download and move to the data/annotated/ folder and move to step 2.

Download RNA networks:

Save the crystal structures (first link) to the data/ folder.

Save the whole graphs (second link) to the data/graphs folder.

Bulid the dataset. This will take some time as it involves loading many large PDB files.

  1. python prepare_data/main.py -n <data-id>

2. Subgraph Embeddings

Once the training data is built, we train the RGCN.

  1. python train_embedding/main.py train -n my_model

3. Motif Building

Finally, the trained RGCN and the whole graphs are used to build motifs.

Here, you have three options:

  1. Build/load a new meta graph
  2. Use a meta graph to build motifs
  3. Use a meta graph to search for matches to a graph query

To build a new meta graph:

If this is the first time you build a meta-graph, create the following folder:

  1. mkdir results/mggs
  1. python build_motifs/main.py -r my_model --mgg_name my_metagraph

The new meta-graph will be built and dumped in the folder results/mggs/my_metagraph.p