Combinatorial and Semantic Analysis of Functional Elements
This repository contains the implementation of the Combinatorial and Semantic Analysis of Functional Elements (CombSAFE) presented in: Leone M, Galeota E, Masseroli M, Pellizzola M. "Identification, semantic annotation and comparison of combinations of functional elements in multiple biological conditions", 2021
. It is a flexible computational method to identify combinations of static and dynamic functional elements genome-wide, or in a specific genomic region, and how they change across semantically annotated biological conditions.
Given as input a set of ChIP-seq dataset samples and the list of functional elements to be considered, the CombSAFE
pipeline allows:
CombSAFE/
|-- README.md
|-- LICENSE
|-- .gitignore
|-- notebook/
| |-- Functional_states_analysis.ipynb
|-- gene_list/
| |-- MYC_associated.txt
| |-- test_list.txt
| |-- tumor_suppressor.txt
|-- CombSAFE/
| |-- CombSAFE.py
|-- CombSAFE.yml
README.md
this fileLICENSE
MIT license file.gitignore
standard .gitignore file for Python projectsnotebook/Functional_states_analysis.ipynb
Python notebook to run a functional state analysisgene_list/
folder where gene name lists are stored for the CombSAFE single gene analysisgene_list/test_list.txt
list of random genesgene_list/tumor_suppressor.txt
list of tumor suppressor genesgene_list/MYC_associated.txt
list of MYC interacting genesCombSAFE/CombSAFE.py
core Python routines called from within the notebook to perform the functional state analysisCombSAFE.yml
dependency yaml file to load in order to perform the CombSAFE analysisIn order to run the CombSAFE
pipeline, please follow the steps below:
conda env create -f CombSAFE.yml
conda activate CombSAFE
ON Linux and macOS. On Windows systems digit activate CombSAFE
notebook/Functional_states_analysis.ipynb
NB: The pyGMQL
package additionally requires Java. Please follow the installation procedure here.
NB2: The PyEnsembl
package additionally requires Ensembl data. Please follow the installation procedure here.
NB3: For Windows users, Visual Studio v.14 or higher is required. Please follow the installation procedure here
In the following, we show how to call the functions implemented to easily perform the different steps of our CombSAFE
computational method, providing example resuls for some of them.
combsafe.create_dataset(path, organism, threads=4, from_GEO=False)
Run a ChIP-seq peak calling pipeline from input raw data.
For single-end reads Input files must be structured as follows:
Input_folder/
|-- Raw_Reads/
| |-- 1.rawreads.fastq
| |-- 2.rawreads.fastq
| |-- 3.rawreads.fastq
| |-- 4.rawreads.fastq
| |-- 5.rawreads.fastq
| |-- 6.rawreads.fastq
| |-- 7.rawreads.fastq
| |-- ...
|-- Textual_file.txt
Raw_Reads
a folder containing raw reads in fastq formatTextual_file.txt
a text file containing the following information:file
, filename of the corresponding raw reads file in the Raw_Reads folderfactor
, transcription factor or histone mark used for the analysisdescription
, all available iformations of the biological source from which to extract terms for semantic annotations.E.g.:
file | factor | description |
---|---|---|
1.rawreads.fastq | CTCF | low passage primary melanoma cultures |
2.rawreads.fastq | H3K4me3 | Bone marrow mononuclear cells |
3.rawreads.fastq | MYC | human primary monocytes isolated from PBMC of healthy donors |
… | … | … |
For paired-end reads Input files must be structured as follows:
Input_folder/
|-- Raw_Reads/
| |-- 1.forward_reads.fastq
| |-- 1.reverse_reads.fastq
| |-- 2.forward_reads.fastq
| |-- 2.reverse_reads.fastq
| |-- 3.forward_reads.fastq
| |-- 3.reverse_reads.fastq
| |-- ...
|-- Textual_file.txt
Raw_Reads
a folder containing raw reads in fastq formatTextual_file.txt
a text file containing the following information:file_1
, filename of the corresponding forward raw reads file in the Raw_Reads folderfile_2
, filename of the corresponding reverse raw reads file in the Raw_Reads folderfactor
, transcription factor or histone mark used for the analysisdescription
, all available informations of the biological source from which to extract terms for semantic annotations. E.g.:
file_1 | file_2 | factor | description |
---|---|---|---|
1.forward_reads.fastq | 1.reverse_reads.fastq | H3K4me1 | Human embryonic stem cells received from the John Doe laboratory |
2.forward_reads.fastq | 2.reverse_reads.fastq | H3K4me3 | Nuclei derived from crude preps of adipose tissue |
3.forward_reads.fastq | 3.reverse_reads.fastq | H3K27me3 | Monocyte-derived macrophage |
… | … | … | … |
If you want to start a functional state analysis on GEO experiments, set the from_GEO
label as True. In that scenario, input files must be structured as follows:
Input_folder/
|-- Textual_file.txt
Textual_file.txt
a text file containing the following information:sample_id
, Id of the samples on GEOfactor
, transcription factor or histone mark used for the analysisE.g.,
sample_id | factor |
---|---|
GSM648494 | H3K4me1 |
GSM648495 | H3K4me3 |
GSM575280 | H3K27me3 |
… | … |
Parameters:
Example:
>> dataset = create_dataset("./Input_folder/", assembly="hg38", threads=20, from_GEO=False)
combsafe.load_dataset(path, assembly, from_GEO=False)
Load the input path for the analysis. Skip this step if you have already generated the daatset.
Input files must be structured as follows:
Input_folder/
|-- Chip_Files/
| |-- 1.narrowPeak.bed
| |-- 2.broadPeak.bed
| |-- 3.narrowPeak.bed
| |-- 4.broadPeak.bed
| |-- 5.narrowPeak.bed
| |-- 6.broadPeak.bed
| |-- 7.narrowPeak.bed
| |-- ...
|-- Textual_file.txt
Chip_Files
a folder containing ChIP-Seq filesTextual_file.txt
a text file containing the following information:sample_id
, univolcal id for each samplefactor
, Transcription Factor or Histone Mark used for the analysisfile
, filename of the corresponding ChIp-seq file in the Chip_Files folderdescription
, all available informations of the biological source from which to extract terms for semantic annotations. E.g.:
sample_id | factor | file | description |
---|---|---|---|
1 | CTCF | 1.narrowPeak.bed | low passage primary melanoma cultures |
2 | H3K4me3 | 2.narrowPeak.bed | Bone marrow mononuclear cells |
3 | MYC | 3.narrowPeak.bed | human primary monocytes isolated from PBMC of healthy donors |
… | … | … |
If your dataset is generate from GEO samples and you want to get the description from the GSM GEO webpage, set the from_GEO
label as True. In that scenario, Textual_file.txt must be structured as follows:
Textual_file.txt
a text file containing the following information:sample_id
, Id of the samples on GEOfactor
, Transcription Factor or Histone Mark used for the analysisfile
, filename of the corresponding ChIp-seq file in the Chip_Files foldersample_id | factor | file |
---|---|---|
GSM648494 | H3K4me1 | 1.narrowPeak.bed |
GSM648495 | H3K4me3 | 2.broadPeak.bed |
GSM575280 | H3K27me3 | 3.narrowPeak.bed |
… | … | … |
Parameters:
Example:
>> input_path = import_path("./Input_folder/", assembly="hg38", from_GEO=True)
combsafe.generate_semantic_annotations(dataset, ontology_1, ontology_2, disease = False, encode_convert=False)
Generate semantic annotations about tissue and disease types from the input dataset.
Parameters:
Returns:
Example:
>> ontology_1 = "https://raw.githubusercontent.com/obophenotype/cell-ontology/master/cl-basic.obo"
>> ontology_2 = "https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/main/src/ontology/doid.obo"
>> semantic_df = generate_semantic_annotations(dataset, ontology_1, ontology_2, disease = True, encode_convert=False)
combsafe.plot_factor_freq(semantic_dataframe, n)
Vertical barplot of the factor frequency in the input dataset.
Parameters:
Example:
>> plot_factor_freq(semantic_df, 30)
combsafe.generate_fixed_factor_pool(semantic_dataframe, factor_list, number_of_semantic_annotations)
Table containg lists of factors according to the selected parameters.
Parameters:
Example:
>> generate_fixed_factor_pool(semantic_df, ["CTCF", "MYC"], 5)
combsafe.get_semantic_annotation_list(semantic_dataframe, factor_list)
List of semantic annotations according to the selected factors.
Parameters:
Example:
>> get_semantic_annotation_list(semantic_df, ["CTCF", "MYC", "POLR2A", "H3K4me3", "H3K27me3"])
Combine sample replicas of the listed factors and extract their semantic annotations regarding the conditions in which they were mapped.<br/>
Parameters:
- ***factor_list***: list
- list of factors to include in the analysis
Example:
```python
>> extract_data(["CTCF", "MYC", "POLR2A", "H3K4me3", "H3K27me3"])
combsafe.add_custom_tracks(tracks_label, path_to_custom_tracks, index)
Add custom tracks of static genomic elements to the analysis (e.g., CpG islands).
Parameters:
Example:
>> add_custom_tracks("CpG_Islands", "./input_files/cpgIslandExt.txt")
combsafe.identify_functional_states(ChromHMM_path, number_of_states, n_core)
identification of combinations of static and dynamic functional elements throughout the genome.
Parameters:
Returns:
>> functional_states_df = identify_functional_states(chromhmm_path ="./ChromHMM/", number_of_states = 15, n_core = 20)
Alternatively, it is possible to load in house segmentated files from an other HMM segmentation tool and jump to the next step
combsafe.load_custom_segments(input_segment_dir, num_states)
load functional states files from input path.
Parameters:
Returns:
Example:
>> functional_states_df = load_custom_segments(input_segment_dir ="./Input_folder/in_house_segmentated/", num_states=15)
Input files must be structured as follows:
Input_folder/
|-- Segmentated_Files/
| |-- 1.semantic_annotation_15_segments.bed
| |-- 2.semantic_annotation_15_segments.bed
| |-- 3.semantic_annotation_15_segments.bed
| |-- 4.semantic_annotation_15_segments.bed
| |-- 5.semantic_annotation_15_segments.bed
| |-- 6.semantic_annotation_15_segments.bed
| |-- 7.semantic_annotation_15_segments.bed
| |-- ...
|-- emissions.txt
Segmentated_Files
a folder containing raw reads in fastq formatemissions.txt
a text file structured as follows:State | H3K4me3 | POLR2A | MYC | H3K27me3 |
---|---|---|---|---|
1 | 0.093326 | 0.457892 | 0.143540 | 0.924645 |
2 | 0.793153 | 0.658634 | 0.972344 | 0.487613 |
3 | 0.940996 | 0.000234 | 0.243758 | 0.187461 |
4 | 0.143540 | 0.763471 | 0.872346 | 0.104765 |
… | … | … | … | … |
Show emission parameters heatmap of genome functional states combination. <br/>
Parameters:
- ***custom_palette***: list of exadecimals, default=None
- add a list of customized colors in hexadecimal form to be assigned to the functional states.
Example:
```python
>> colors = ['#c9f9ff', '#e6beff', '#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231','#911eb4', '#bcf60c', '#f032e6', '#fffac8', '#fabebe', '#9a6324', '#46f0f0', '#008080']
>> show_emission_graph(custom_palette=colors)
Show distance matrix heatmap about functional states generated from the emission parameters file of an HMM model. <br/>
Example:
```python
>> show_distance_matrix()
combsafe.single_gene_analysis(functional_states_dataframe, path_to_gene_list_file, distance_metric )
Given a list of gene symbols in a textual file, the heatmap of the functional states of the related genomic regions is shown.
Parameters:
emission.txt
file a special metric functional_state_distance
to weight distances among functional states according to the their function. Alternatively, hamming
distance can be selectedExample:
>> single_gene_analysis(functional_states_df, "path_to_gene_list/gene_list.txt", distance_metric = funtional_states_distance)
>> single_gene_analysis(functional_states_df, "path_to_gene_list/gene_list.txt", distance_metric = "hamming")
Shows PCA heatmap among semantic annotation for selected components. <br/>
Parameters:
- ***functional_states_dataframe***: dataframe
- dataframe of functional states
- ***number_of_components***: int
- number of components for the principal component analysis
Return:
- ***loadings***: dataframe
- PCA loadings
Example:
```python
>> pca_analysis(functional_states_df, 10)
combsafe.genome_reduction(functional_states_dataframe, threshold)
Reduce the initial functional state dataframe to visualize the functional states of the various semantic annotations in the form of a heatmap.
NB: the proportions among the functional states are maintained as in the previous dataframe of functional states.
Parameters:
Return:
Example:
>> reducted_df = genome_reduction(functional_states_df, threshold=90)
combsafe.data_driven_heatmap(reducted_df, distance_metric, min_clust_size, min_sampl)
Show a genome-wide heatmap with the most significant clusters of genomic regions based on their patterns of functional states.
Parameters:
emission.txt
file a special metric functional_state_distance
to weight distances among functional states according to the their function. Alternatively, hamming
distance can be selectedReturn:
Example:
>> clustered_heatmap = data_driven_heatmap(reducted_df, functional_states_distance, min_clust_size=10, min_sampl=2)
combsafe.gene_ontology_enrichment_analysis(clustered_dataframe, distance_metric, goea_tool)
Show a genome-wide heatmap with the most significant clusters of genomic regions based on their patterns of functional states.
Parameters:
emission.txt
file a special metric functional_state_distance
to weight distances among functional states according to the their function. Alternatively, hamming
distance can be selectedExample:
>> gene_ontology_enrichment_analysis(clustered_heatmap, goea_tool = "great", distance_metric=functional_states_distance)