项目作者: sa-lee

项目描述 :
A grammar of genomic data transformation
高级语言: R
项目地址: git://github.com/sa-lee/plyranges.git
创建时间: 2017-08-28T05:12:39Z
项目社区:https://github.com/sa-lee/plyranges

开源协议:

下载


" class="reference-link">plyranges: fluent genomic data analysis

R-CMD-check-bioc
BioC
status

plyranges
provides a consistent interface for importing and wrangling genomics
data from a variety of sources. The package defines a grammar of genomic
data transformation based on dplyr and the Bioconductor packages
IRanges, GenomicRanges, and rtracklayer. It does this by providing
a set of verbs for developing analysis pipelines based on Ranges
objects that represent genomic regions:

  • Modify genomic regions with the mutate() and stretch() functions.
  • Modify genomic regions while fixing the start/end/center coordinates
    with the anchor_ family of functions.
  • Sort genomic ranges with arrange().
  • Modify, subset, and aggregate genomic data with the mutate(),
    filter(), and summarise()functions.
  • Any of the above operations can be performed on partitions of the data
    with group_by().
  • Find nearest neighbour genomic regions with the join_nearest_ family
    of functions.
  • Find overlaps between ranges with the join_overlaps_ family of
    functions.
  • Merge all overlapping and adjacent genomic regions with
    reduce_ranges().
  • Merge the end points of all genomic regions with disjoin_ranges().
  • Import and write common genomic data formats with the read_/write_
    family of functions.

For more details on the features of plyranges, read the
vignette.
For a complete case-study on using plyranges to combine ATAC-seq and
RNA-seq results read the fluentGenomics
workflow
.

plyranges is part of the tidyomics
project, providing a dplyr-based interface for many types of
genomics datasets represented in Bioconductor.

Installation

plyranges
can be installed from the latest Bioconductor release:

  1. # install.packages("BiocManager")
  2. BiocManager::install("plyranges")

To install the development version from GitHub:

  1. BiocManager::install("tidyomics/plyranges")

Quick overview

About Ranges

Ranges objects can either represent sets of integers as IRanges
(which have start, end and width attributes) or represent genomic
intervals (which have additional attributes, sequence name, and strand)
as GRanges. In addition, both types of Ranges can store information
about their intervals as metadata columns (for example GC content over a
genomic interval).

Ranges objects follow the tidy data principle: each row of a Ranges
object corresponds to an interval, while each column will represent a
variable about that interval, and generally each object will represent a
single unit of observation (like gene annotations).

We can construct a IRanges object from a data.frame with a start
or width using the as_iranges() method.

  1. library(plyranges)
  2. df <- data.frame(start = 1:5, width = 5)
  3. as_iranges(df)
  4. #> IRanges object with 5 ranges and 0 metadata columns:
  5. #> start end width
  6. #> <integer> <integer> <integer>
  7. #> [1] 1 5 5
  8. #> [2] 2 6 5
  9. #> [3] 3 7 5
  10. #> [4] 4 8 5
  11. #> [5] 5 9 5
  12. # alternatively with end
  13. df <- data.frame(start = 1:5, end = 5:9)
  14. as_iranges(df)
  15. #> IRanges object with 5 ranges and 0 metadata columns:
  16. #> start end width
  17. #> <integer> <integer> <integer>
  18. #> [1] 1 5 5
  19. #> [2] 2 6 5
  20. #> [3] 3 7 5
  21. #> [4] 4 8 5
  22. #> [5] 5 9 5

We can also construct a GRanges object in a similar manner. Note that
a GRanges object requires at least a seqnames column to be present in
the data.frame (but not necessarily a strand column).

  1. df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"),
  2. start = 1:5,
  3. width = 5)
  4. as_granges(df)
  5. #> GRanges object with 5 ranges and 0 metadata columns:
  6. #> seqnames ranges strand
  7. #> <Rle> <IRanges> <Rle>
  8. #> [1] chr1 1-5 *
  9. #> [2] chr2 2-6 *
  10. #> [3] chr2 3-7 *
  11. #> [4] chr1 4-8 *
  12. #> [5] chr2 5-9 *
  13. #> -------
  14. #> seqinfo: 2 sequences from an unspecified genome; no seqlengths
  15. # strand can be specified with `+`, `*` (mising) and `-`
  16. df$strand <- c("+", "+", "-", "-", "*")
  17. as_granges(df)
  18. #> GRanges object with 5 ranges and 0 metadata columns:
  19. #> seqnames ranges strand
  20. #> <Rle> <IRanges> <Rle>
  21. #> [1] chr1 1-5 +
  22. #> [2] chr2 2-6 +
  23. #> [3] chr2 3-7 -
  24. #> [4] chr1 4-8 -
  25. #> [5] chr2 5-9 *
  26. #> -------
  27. #> seqinfo: 2 sequences from an unspecified genome; no seqlengths

Example: finding GWAS hits that overlap known exons

Let’s look at a more a realistic example (taken from HelloRanges
vignette).

Suppose we have two GRanges objects: one containing coordinates of
known exons and another containing SNPs from a GWAS.

The first and last 5 exons are printed below, there are two additional
columns corresponding to the exon name, and a score.

We could check the number of exons per chromosome using group_by and
summarise.

  1. exons
  2. #> GRanges object with 459752 ranges and 2 metadata columns:
  3. #> seqnames ranges strand | name score
  4. #> <Rle> <IRanges> <Rle> | <character> <numeric>
  5. #> [1] chr1 11874-12227 + | NR_046018_exon_0_0_c.. 0
  6. #> [2] chr1 12613-12721 + | NR_046018_exon_1_0_c.. 0
  7. #> [3] chr1 13221-14409 + | NR_046018_exon_2_0_c.. 0
  8. #> [4] chr1 14362-14829 - | NR_024540_exon_0_0_c.. 0
  9. #> [5] chr1 14970-15038 - | NR_024540_exon_1_0_c.. 0
  10. #> ... ... ... ... . ... ...
  11. #> [459748] chrY 59338754-59338859 + | NM_002186_exon_6_0_c.. 0
  12. #> [459749] chrY 59338754-59338859 + | NM_176786_exon_7_0_c.. 0
  13. #> [459750] chrY 59340194-59340278 + | NM_002186_exon_7_0_c.. 0
  14. #> [459751] chrY 59342487-59343488 + | NM_002186_exon_8_0_c.. 0
  15. #> [459752] chrY 59342487-59343488 + | NM_176786_exon_8_0_c.. 0
  16. #> -------
  17. #> seqinfo: 93 sequences from an unspecified genome; no seqlengths
  18. exons %>%
  19. group_by(seqnames) %>%
  20. summarise(n = n())
  21. #> DataFrame with 49 rows and 2 columns
  22. #> seqnames n
  23. #> <Rle> <integer>
  24. #> 1 chr1 43366
  25. #> 2 chr10 19420
  26. #> 3 chr11 24476
  27. #> 4 chr12 24949
  28. #> 5 chr13 7974
  29. #> ... ... ...
  30. #> 45 chrUn_gl000222 20
  31. #> 46 chrUn_gl000223 22
  32. #> 47 chrUn_gl000228 85
  33. #> 48 chrX 18173
  34. #> 49 chrY 4128

Next we create a column representing the transcript_id with mutate:

  1. exons <- exons %>%
  2. mutate(tx_id = sub("_exon.*", "", name))

To find all GWAS SNPs that overlap exons, we use join_overlap_inner.
This will create a new GRanges with the coordinates of SNPs that
overlap exons, as well as metadata from both objects.

  1. olap <- join_overlap_inner(gwas, exons)
  2. olap
  3. #> GRanges object with 3439 ranges and 4 metadata columns:
  4. #> seqnames ranges strand | name.x name.y score
  5. #> <Rle> <IRanges> <Rle> | <character> <character> <numeric>
  6. #> [1] chr1 1079198 * | rs11260603 NR_038869_exon_2_0_c.. 0
  7. #> [2] chr1 1247494 * | rs12103 NM_001256456_exon_1_.. 0
  8. #> [3] chr1 1247494 * | rs12103 NM_001256460_exon_1_.. 0
  9. #> [4] chr1 1247494 * | rs12103 NM_001256462_exon_1_.. 0
  10. #> [5] chr1 1247494 * | rs12103 NM_001256463_exon_1_.. 0
  11. #> ... ... ... ... . ... ... ...
  12. #> [3435] chrX 153764217 * | rs1050828 NM_001042351_exon_9_.. 0
  13. #> [3436] chrX 153764217 * | rs1050828 NM_000402_exon_9_0_c.. 0
  14. #> [3437] chrX 153764217 * | rs1050828 NM_001042351_exon_9_.. 0
  15. #> [3438] chrX 153764217 * | rs1050828 NM_000402_exon_9_0_c.. 0
  16. #> [3439] chrX 153764217 * | rs1050828 NM_001042351_exon_9_.. 0
  17. #> tx_id
  18. #> <character>
  19. #> [1] NR_038869
  20. #> [2] NM_001256456
  21. #> [3] NM_001256460
  22. #> [4] NM_001256462
  23. #> [5] NM_001256463
  24. #> ... ...
  25. #> [3435] NM_001042351
  26. #> [3436] NM_000402
  27. #> [3437] NM_001042351
  28. #> [3438] NM_000402
  29. #> [3439] NM_001042351
  30. #> -------
  31. #> seqinfo: 93 sequences from an unspecified genome; no seqlengths

For each SNP we can count the number of times it overlaps a transcript.

  1. olap %>%
  2. group_by(name.x, tx_id) %>%
  3. summarise(n = n())
  4. #> DataFrame with 1619 rows and 3 columns
  5. #> name.x tx_id n
  6. #> <character> <character> <integer>
  7. #> 1 rs10043775 NM_001271723 1
  8. #> 2 rs10043775 NM_030793 1
  9. #> 3 rs10078 NM_001242412 1
  10. #> 4 rs10078 NM_020731 1
  11. #> 5 rs10089 NM_001046 1
  12. #> ... ... ... ...
  13. #> 1615 rs9906595 NM_001008777 1
  14. #> 1616 rs9948 NM_017623 1
  15. #> 1617 rs9948 NM_199078 1
  16. #> 1618 rs995030 NM_000899 4
  17. #> 1619 rs995030 NM_003994 4

We can also generate 2bp splice sites on either side of the exon using
flank_left and flank_right. We add a column indicating the side of
flanking for illustrative purposes. The interweave function pairs the
left and right ranges objects.

  1. left_ss <- flank_left(exons, 2L)
  2. right_ss <- flank_right(exons, 2L)
  3. all_ss <- interweave(left_ss, right_ss, .id = "side")
  4. all_ss
  5. #> GRanges object with 919504 ranges and 4 metadata columns:
  6. #> seqnames ranges strand | name score
  7. #> <Rle> <IRanges> <Rle> | <character> <numeric>
  8. #> [1] chr1 11872-11873 + | NR_046018_exon_0_0_c.. 0
  9. #> [2] chr1 12228-12229 + | NR_046018_exon_0_0_c.. 0
  10. #> [3] chr1 12611-12612 + | NR_046018_exon_1_0_c.. 0
  11. #> [4] chr1 12722-12723 + | NR_046018_exon_1_0_c.. 0
  12. #> [5] chr1 13219-13220 + | NR_046018_exon_2_0_c.. 0
  13. #> ... ... ... ... . ... ...
  14. #> [919500] chrY 59340279-59340280 + | NM_002186_exon_7_0_c.. 0
  15. #> [919501] chrY 59342485-59342486 + | NM_002186_exon_8_0_c.. 0
  16. #> [919502] chrY 59343489-59343490 + | NM_002186_exon_8_0_c.. 0
  17. #> [919503] chrY 59342485-59342486 + | NM_176786_exon_8_0_c.. 0
  18. #> [919504] chrY 59343489-59343490 + | NM_176786_exon_8_0_c.. 0
  19. #> tx_id side
  20. #> <character> <character>
  21. #> [1] NR_046018 left
  22. #> [2] NR_046018 right
  23. #> [3] NR_046018 left
  24. #> [4] NR_046018 right
  25. #> [5] NR_046018 left
  26. #> ... ... ...
  27. #> [919500] NM_002186 right
  28. #> [919501] NM_002186 left
  29. #> [919502] NM_002186 right
  30. #> [919503] NM_176786 left
  31. #> [919504] NM_176786 right
  32. #> -------
  33. #> seqinfo: 93 sequences from an unspecified genome; no seqlengths

Learning more

Citation

If you found plyranges useful for your work please cite our
paper:

  1. @ARTICLE{Lee2019,
  2. title = "plyranges: a grammar of genomic data transformation",
  3. author = "Lee, Stuart and Cook, Dianne and Lawrence, Michael",
  4. journal = "Genome Biol.",
  5. volume = 20,
  6. number = 1,
  7. pages = "4",
  8. month = jan,
  9. year = 2019,
  10. url = "http://dx.doi.org/10.1186/s13059-018-1597-8",
  11. doi = "10.1186/s13059-018-1597-8",
  12. pmc = "PMC6320618"
  13. }

Contributing

We welcome contributions from the R/Bioconductor community. We ask that
contributors follow the code of conduct
and the guide outlined here.