pyranges.genomicfeatures

Module Contents

class pyranges.genomicfeatures.GenomicFeaturesMethods(pr)

Namespace for methods using feature information.

Accessed through gr.features.

pr
tss(self)

Return the transcription start sites.

Returns the 5’ for every interval with feature “transcript”.

See also

pyranges.genomicfeatures.GenomicFeaturesMethods.tes()
return the transcription end sites

Examples

>>> gr = pr.data.ensembl_gtf()
>>> gr
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+
| Chromosome   | Source     | Feature      | Start     | End       | Score      | Strand       | Frame      | gene_biotype                       | +19   |
| (category)   | (object)   | (category)   | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)                           | ...   |
|--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------|
| 1            | havana     | gene         | 11868     | 14409     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | transcript   | 11868     | 14409     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | exon         | 11868     | 12227     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | exon         | 12612     | 12721     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| ...          | ...        | ...          | ...       | ...       | ...        | ...          | ...        | ...                                | ...   |
| 1            | havana     | gene         | 1173055   | 1179555   | .          | -            | .          | lncRNA                             | ...   |
| 1            | havana     | transcript   | 1173055   | 1179555   | .          | -            | .          | lncRNA                             | ...   |
| 1            | havana     | exon         | 1179364   | 1179555   | .          | -            | .          | lncRNA                             | ...   |
| 1            | havana     | exon         | 1173055   | 1176396   | .          | -            | .          | lncRNA                             | ...   |
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+
Stranded PyRanges object has 2,446 rows and 28 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
>>> gr.features.tss()
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+
| Chromosome   | Source     | Feature    | Start     | End       | Score      | Strand       | Frame      | gene_biotype                       | +19   |
| (category)   | (object)   | (object)   | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)                           | ...   |
|--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------|
| 1            | havana     | tss        | 11868     | 11869     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | tss        | 12009     | 12010     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | tss        | 29553     | 29554     | .          | +            | .          | lncRNA                             | ...   |
| 1            | havana     | tss        | 30266     | 30267     | .          | +            | .          | lncRNA                             | ...   |
| ...          | ...        | ...        | ...       | ...       | ...        | ...          | ...        | ...                                | ...   |
| 1            | havana     | tss        | 1092813   | 1092814   | .          | -            | .          | protein_coding                     | ...   |
| 1            | havana     | tss        | 1116087   | 1116088   | .          | -            | .          | protein_coding                     | ...   |
| 1            | havana     | tss        | 1116089   | 1116090   | .          | -            | .          | protein_coding                     | ...   |
| 1            | havana     | tss        | 1179555   | 1179556   | .          | -            | .          | lncRNA                             | ...   |
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+
Stranded PyRanges object has 280 rows and 28 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
tes(self, slack=0)

Return the transcription end sites.

Returns the 3’ for every interval with feature “transcript”.

See also

pyranges.genomicfeatures.GenomicFeaturesMethods.tss()
return the transcription start sites

Examples

>>> gr = pr.data.ensembl_gtf()
>>> gr
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+
| Chromosome   | Source     | Feature      | Start     | End       | Score      | Strand       | Frame      | gene_biotype                       | +19   |
| (category)   | (object)   | (category)   | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)                           | ...   |
|--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------|
| 1            | havana     | gene         | 11868     | 14409     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | transcript   | 11868     | 14409     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | exon         | 11868     | 12227     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | exon         | 12612     | 12721     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| ...          | ...        | ...          | ...       | ...       | ...        | ...          | ...        | ...                                | ...   |
| 1            | havana     | gene         | 1173055   | 1179555   | .          | -            | .          | lncRNA                             | ...   |
| 1            | havana     | transcript   | 1173055   | 1179555   | .          | -            | .          | lncRNA                             | ...   |
| 1            | havana     | exon         | 1179364   | 1179555   | .          | -            | .          | lncRNA                             | ...   |
| 1            | havana     | exon         | 1173055   | 1176396   | .          | -            | .          | lncRNA                             | ...   |
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+
Stranded PyRanges object has 2,446 rows and 28 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
>>> gr.features.tes()
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+
| Chromosome   | Source     | Feature    | Start     | End       | Score      | Strand       | Frame      | gene_biotype                       | +19   |
| (category)   | (object)   | (object)   | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)                           | ...   |
|--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------|
| 1            | havana     | tes        | 14409     | 14410     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | tes        | 13670     | 13671     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | tes        | 31097     | 31098     | .          | +            | .          | lncRNA                             | ...   |
| 1            | havana     | tes        | 31109     | 31110     | .          | +            | .          | lncRNA                             | ...   |
| ...          | ...        | ...        | ...       | ...       | ...        | ...          | ...        | ...                                | ...   |
| 1            | havana     | tes        | 1092813   | 1092814   | .          | -            | .          | protein_coding                     | ...   |
| 1            | havana     | tes        | 1116087   | 1116088   | .          | -            | .          | protein_coding                     | ...   |
| 1            | havana     | tes        | 1116089   | 1116090   | .          | -            | .          | protein_coding                     | ...   |
| 1            | havana     | tes        | 1179555   | 1179556   | .          | -            | .          | lncRNA                             | ...   |
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+
Stranded PyRanges object has 280 rows and 28 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
introns(self, by='gene', nb_cpu=1)

Return the introns.

Parameters:
  • by (str, {"gene", "transcript"}, default "gene") – Whether to find introns per gene or transcript.
  • nb_cpu (int, default 1) – How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple. Will only lead to speedups on large datasets.

See also

pyranges.genomicfeatures.GenomicFeaturesMethods.tss()
return the transcription start sites

Examples

>>> gr = pr.data.ensembl_gtf()
>>> gr
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+
| Chromosome   | Source     | Feature      | Start     | End       | Score      | Strand       | Frame      | gene_biotype                       | +19   |
| (category)   | (object)   | (category)   | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)                           | ...   |
|--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------|
| 1            | havana     | gene         | 11868     | 14409     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | transcript   | 11868     | 14409     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | exon         | 11868     | 12227     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| 1            | havana     | exon         | 12612     | 12721     | .          | +            | .          | transcribed_unprocessed_pseudogene | ...   |
| ...          | ...        | ...          | ...       | ...       | ...        | ...          | ...        | ...                                | ...   |
| 1            | havana     | gene         | 1173055   | 1179555   | .          | -            | .          | lncRNA                             | ...   |
| 1            | havana     | transcript   | 1173055   | 1179555   | .          | -            | .          | lncRNA                             | ...   |
| 1            | havana     | exon         | 1179364   | 1179555   | .          | -            | .          | lncRNA                             | ...   |
| 1            | havana     | exon         | 1173055   | 1176396   | .          | -            | .          | lncRNA                             | ...   |
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+
Stranded PyRanges object has 2,446 rows and 28 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
>>> gr.features.introns(by="gene")
+--------------+----------------+------------+-----------+-----------+------------+--------------+------------+-------+
| Chromosome   | Source         | Feature    | Start     | End       | Score      | Strand       | Frame      | +20   |
| (object)     | (object)       | (object)   | (int32)   | (int32)   | (object)   | (category)   | (object)   | ...   |
|--------------+----------------+------------+-----------+-----------+------------+--------------+------------+-------|
| 1            | ensembl_havana | intron     | 1173926   | 1174265   | .          | +            | .          | ...   |
| 1            | ensembl_havana | intron     | 1174321   | 1174423   | .          | +            | .          | ...   |
| 1            | ensembl_havana | intron     | 1174489   | 1174520   | .          | +            | .          | ...   |
| 1            | ensembl_havana | intron     | 1175034   | 1179188   | .          | +            | .          | ...   |
| ...          | ...            | ...        | ...       | ...       | ...        | ...          | ...        | ...   |
| 1            | havana         | intron     | 874591    | 875046    | .          | -            | .          | ...   |
| 1            | havana         | intron     | 875155    | 875525    | .          | -            | .          | ...   |
| 1            | havana         | intron     | 875625    | 876526    | .          | -            | .          | ...   |
| 1            | havana         | intron     | 876611    | 876754    | .          | -            | .          | ...   |
+--------------+----------------+------------+-----------+-----------+------------+--------------+------------+-------+
Stranded PyRanges object has 311 rows and 28 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
20 hidden columns: gene_biotype, gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, ... (+ 11 more.)
>>> gr.features.introns(by="transcript")
+--------------+----------------+------------+-----------+-----------+------------+--------------+------------+----------------------------------+-------+
| Chromosome   | Source         | Feature    | Start     | End       | Score      | Strand       | Frame      | gene_biotype                     | +19   |
| (object)     | (object)       | (object)   | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)                         | ...   |
|--------------+----------------+------------+-----------+-----------+------------+--------------+------------+----------------------------------+-------|
| 1            | havana         | intron     | 818202    | 818722    | .          | +            | .          | lncRNA                           | ...   |
| 1            | ensembl_havana | intron     | 960800    | 961292    | .          | +            | .          | protein_coding                   | ...   |
| 1            | ensembl_havana | intron     | 961552    | 961628    | .          | +            | .          | protein_coding                   | ...   |
| 1            | ensembl_havana | intron     | 961750    | 961825    | .          | +            | .          | protein_coding                   | ...   |
| ...          | ...            | ...        | ...       | ...       | ...        | ...          | ...        | ...                              | ...   |
| 1            | havana         | intron     | 732207    | 732980    | .          | -            | .          | transcribed_processed_pseudogene | ...   |
| 1            | havana_tagene  | intron     | 168165    | 169048    | .          | -            | .          | lncRNA                           | ...   |
| 1            | havana_tagene  | intron     | 165942    | 167958    | .          | -            | .          | lncRNA                           | ...   |
| 1            | havana_tagene  | intron     | 168165    | 169048    | .          | -            | .          | lncRNA                           | ...   |
+--------------+----------------+------------+-----------+-----------+------------+--------------+------------+----------------------------------+-------+
Stranded PyRanges object has 1,043 rows and 28 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
pyranges.genomicfeatures.genome_bounds(gr, chromsizes, clip=False)

Remove or clip intervals outside of genome bounds.

Parameters:
  • chromsizes (dict or PyRanges) – Dict or PyRanges describing the lengths of the chromosomes.
  • clip (bool, default False) – Part of interval within bounds.

Examples

>>> d = {"Chromosome": [1, 1, 3], "Start": [1, 249250600, 5], "End": [2, 249250640, 7]}
>>> gr = pr.from_dict(d)
>>> gr
+--------------+-----------+-----------+
|   Chromosome |     Start |       End |
|   (category) |   (int32) |   (int32) |
|--------------+-----------+-----------|
|            1 |         1 |         2 |
|            1 | 249250600 | 249250640 |
|            3 |         5 |         7 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 3 rows and 3 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
>>> chromsizes = {"1": 249250621, "3": 500}
>>> chromsizes
{'1': 249250621, '3': 500}
>>> pr.gf.genome_bounds(gr, chromsizes)
+--------------+-----------+-----------+
|   Chromosome |     Start |       End |
|   (category) |   (int32) |   (int32) |
|--------------+-----------+-----------|
|            1 |         1 |         2 |
|            3 |         5 |         7 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 2 rows and 3 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
>>> pr.gf.genome_bounds(gr, chromsizes, clip=True)
+--------------+-----------+-----------+
|   Chromosome |     Start |       End |
|   (category) |   (int32) |   (int32) |
|--------------+-----------+-----------|
|            1 |         1 |         2 |
|            1 | 249250600 | 249250621 |
|            3 |         5 |         7 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 3 rows and 3 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
>>> del chromsizes['3']
>>> chromsizes
{'1': 249250621}
>>> pr.gf.genome_bounds(gr, chromsizes)
Traceback (most recent call last):
...
KeyError: '3'
pyranges.genomicfeatures.tile_genome(genome, tile_size, tile_last=False)

Create a tiled genome.

Parameters:
  • chromsizes (dict or PyRanges) – Dict or PyRanges describing the lengths of the chromosomes.
  • tile_size (int) – Length of the tiles.
  • tile_last (bool, default False) – Use genome length as end of last tile.

See also

pyranges.PyRanges.tile()
split intervals into adjacent non-overlapping tiles.

Examples

>>> chromsizes = pr.data.chromsizes()
>>> chromsizes
+--------------+-----------+-----------+
| Chromosome   | Start     | End       |
| (category)   | (int32)   | (int32)   |
|--------------+-----------+-----------|
| chr1         | 0         | 249250621 |
| chr2         | 0         | 243199373 |
| chr3         | 0         | 198022430 |
| chr4         | 0         | 191154276 |
| ...          | ...       | ...       |
| chr22        | 0         | 51304566  |
| chrM         | 0         | 16571     |
| chrX         | 0         | 155270560 |
| chrY         | 0         | 59373566  |
+--------------+-----------+-----------+
Unstranded PyRanges object has 25 rows and 3 columns from 25 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
>>> pr.gf.tile_genome(chromsizes, int(1e6))
+--------------+-----------+-----------+
| Chromosome   | Start     | End       |
| (category)   | (int32)   | (int32)   |
|--------------+-----------+-----------|
| chr1         | 0         | 1000000   |
| chr1         | 1000000   | 2000000   |
| chr1         | 2000000   | 3000000   |
| chr1         | 3000000   | 4000000   |
| ...          | ...       | ...       |
| chrY         | 56000000  | 57000000  |
| chrY         | 57000000  | 58000000  |
| chrY         | 58000000  | 59000000  |
| chrY         | 59000000  | 59373566  |
+--------------+-----------+-----------+
Unstranded PyRanges object has 3,114 rows and 3 columns from 25 chromosomes.
For printing, the PyRanges was sorted on Chromosome.