pyranges.genomicfeatures
¶
Module Contents¶
-
class
pyranges.genomicfeatures.
GenomicFeaturesMethods
(pr)¶ Namespace for methods using feature information.
Accessed through gr.features.
-
pr
¶
-
tss
(self)¶ Return the transcription start sites.
Returns the 5’ for every interval with feature “transcript”.
See also
pyranges.genomicfeatures.GenomicFeaturesMethods.tes()
- return the transcription end sites
Examples
>>> gr = pr.data.ensembl_gtf() >>> gr +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (category) | (object) | (category) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------| | 1 | havana | gene | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | transcript | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 11868 | 12227 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 12612 | 12721 | . | + | . | transcribed_unprocessed_pseudogene | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | gene | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | transcript | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1179364 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1173055 | 1176396 | . | - | . | lncRNA | ... | +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ Stranded PyRanges object has 2,446 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
>>> gr.features.tss() +--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------| | 1 | havana | tss | 11868 | 11869 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | tss | 12009 | 12010 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | tss | 29553 | 29554 | . | + | . | lncRNA | ... | | 1 | havana | tss | 30266 | 30267 | . | + | . | lncRNA | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | tss | 1092813 | 1092814 | . | - | . | protein_coding | ... | | 1 | havana | tss | 1116087 | 1116088 | . | - | . | protein_coding | ... | | 1 | havana | tss | 1116089 | 1116090 | . | - | . | protein_coding | ... | | 1 | havana | tss | 1179555 | 1179556 | . | - | . | lncRNA | ... | +--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ Stranded PyRanges object has 280 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
-
tes
(self, slack=0)¶ Return the transcription end sites.
Returns the 3’ for every interval with feature “transcript”.
See also
pyranges.genomicfeatures.GenomicFeaturesMethods.tss()
- return the transcription start sites
Examples
>>> gr = pr.data.ensembl_gtf() >>> gr +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (category) | (object) | (category) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------| | 1 | havana | gene | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | transcript | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 11868 | 12227 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 12612 | 12721 | . | + | . | transcribed_unprocessed_pseudogene | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | gene | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | transcript | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1179364 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1173055 | 1176396 | . | - | . | lncRNA | ... | +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ Stranded PyRanges object has 2,446 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
>>> gr.features.tes() +--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------| | 1 | havana | tes | 14409 | 14410 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | tes | 13670 | 13671 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | tes | 31097 | 31098 | . | + | . | lncRNA | ... | | 1 | havana | tes | 31109 | 31110 | . | + | . | lncRNA | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | tes | 1092813 | 1092814 | . | - | . | protein_coding | ... | | 1 | havana | tes | 1116087 | 1116088 | . | - | . | protein_coding | ... | | 1 | havana | tes | 1116089 | 1116090 | . | - | . | protein_coding | ... | | 1 | havana | tes | 1179555 | 1179556 | . | - | . | lncRNA | ... | +--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ Stranded PyRanges object has 280 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
-
introns
(self, by='gene', nb_cpu=1)¶ Return the introns.
Parameters: - by (str, {"gene", "transcript"}, default "gene") – Whether to find introns per gene or transcript.
- nb_cpu (int, default 1) – How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple. Will only lead to speedups on large datasets.
See also
pyranges.genomicfeatures.GenomicFeaturesMethods.tss()
- return the transcription start sites
Examples
>>> gr = pr.data.ensembl_gtf() >>> gr +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (category) | (object) | (category) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------| | 1 | havana | gene | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | transcript | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 11868 | 12227 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 12612 | 12721 | . | + | . | transcribed_unprocessed_pseudogene | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | gene | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | transcript | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1179364 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1173055 | 1176396 | . | - | . | lncRNA | ... | +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ Stranded PyRanges object has 2,446 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
>>> gr.features.introns(by="gene") +--------------+----------------+------------+-----------+-----------+------------+--------------+------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | +20 | | (object) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... | |--------------+----------------+------------+-----------+-----------+------------+--------------+------------+-------| | 1 | ensembl_havana | intron | 1173926 | 1174265 | . | + | . | ... | | 1 | ensembl_havana | intron | 1174321 | 1174423 | . | + | . | ... | | 1 | ensembl_havana | intron | 1174489 | 1174520 | . | + | . | ... | | 1 | ensembl_havana | intron | 1175034 | 1179188 | . | + | . | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | intron | 874591 | 875046 | . | - | . | ... | | 1 | havana | intron | 875155 | 875525 | . | - | . | ... | | 1 | havana | intron | 875625 | 876526 | . | - | . | ... | | 1 | havana | intron | 876611 | 876754 | . | - | . | ... | +--------------+----------------+------------+-----------+-----------+------------+--------------+------------+-------+ Stranded PyRanges object has 311 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 20 hidden columns: gene_biotype, gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, ... (+ 11 more.)
>>> gr.features.introns(by="transcript") +--------------+----------------+------------+-----------+-----------+------------+--------------+------------+----------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (object) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+----------------+------------+-----------+-----------+------------+--------------+------------+----------------------------------+-------| | 1 | havana | intron | 818202 | 818722 | . | + | . | lncRNA | ... | | 1 | ensembl_havana | intron | 960800 | 961292 | . | + | . | protein_coding | ... | | 1 | ensembl_havana | intron | 961552 | 961628 | . | + | . | protein_coding | ... | | 1 | ensembl_havana | intron | 961750 | 961825 | . | + | . | protein_coding | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | intron | 732207 | 732980 | . | - | . | transcribed_processed_pseudogene | ... | | 1 | havana_tagene | intron | 168165 | 169048 | . | - | . | lncRNA | ... | | 1 | havana_tagene | intron | 165942 | 167958 | . | - | . | lncRNA | ... | | 1 | havana_tagene | intron | 168165 | 169048 | . | - | . | lncRNA | ... | +--------------+----------------+------------+-----------+-----------+------------+--------------+------------+----------------------------------+-------+ Stranded PyRanges object has 1,043 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.)
-
-
pyranges.genomicfeatures.
genome_bounds
(gr, chromsizes, clip=False)¶ Remove or clip intervals outside of genome bounds.
Parameters: - chromsizes (dict or PyRanges) – Dict or PyRanges describing the lengths of the chromosomes.
- clip (bool, default False) – Part of interval within bounds.
Examples
>>> d = {"Chromosome": [1, 1, 3], "Start": [1, 249250600, 5], "End": [2, 249250640, 7]} >>> gr = pr.from_dict(d) >>> gr +--------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int32) | (int32) | |--------------+-----------+-----------| | 1 | 1 | 2 | | 1 | 249250600 | 249250640 | | 3 | 5 | 7 | +--------------+-----------+-----------+ Unstranded PyRanges object has 3 rows and 3 columns from 2 chromosomes. For printing, the PyRanges was sorted on Chromosome.
>>> chromsizes = {"1": 249250621, "3": 500} >>> chromsizes {'1': 249250621, '3': 500}
>>> pr.gf.genome_bounds(gr, chromsizes) +--------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int32) | (int32) | |--------------+-----------+-----------| | 1 | 1 | 2 | | 3 | 5 | 7 | +--------------+-----------+-----------+ Unstranded PyRanges object has 2 rows and 3 columns from 2 chromosomes. For printing, the PyRanges was sorted on Chromosome.
>>> pr.gf.genome_bounds(gr, chromsizes, clip=True) +--------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int32) | (int32) | |--------------+-----------+-----------| | 1 | 1 | 2 | | 1 | 249250600 | 249250621 | | 3 | 5 | 7 | +--------------+-----------+-----------+ Unstranded PyRanges object has 3 rows and 3 columns from 2 chromosomes. For printing, the PyRanges was sorted on Chromosome.
>>> del chromsizes['3'] >>> chromsizes {'1': 249250621}
>>> pr.gf.genome_bounds(gr, chromsizes) Traceback (most recent call last): ... KeyError: '3'
-
pyranges.genomicfeatures.
tile_genome
(genome, tile_size, tile_last=False)¶ Create a tiled genome.
Parameters: - chromsizes (dict or PyRanges) – Dict or PyRanges describing the lengths of the chromosomes.
- tile_size (int) – Length of the tiles.
- tile_last (bool, default False) – Use genome length as end of last tile.
See also
pyranges.PyRanges.tile()
- split intervals into adjacent non-overlapping tiles.
Examples
>>> chromsizes = pr.data.chromsizes() >>> chromsizes +--------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int32) | (int32) | |--------------+-----------+-----------| | chr1 | 0 | 249250621 | | chr2 | 0 | 243199373 | | chr3 | 0 | 198022430 | | chr4 | 0 | 191154276 | | ... | ... | ... | | chr22 | 0 | 51304566 | | chrM | 0 | 16571 | | chrX | 0 | 155270560 | | chrY | 0 | 59373566 | +--------------+-----------+-----------+ Unstranded PyRanges object has 25 rows and 3 columns from 25 chromosomes. For printing, the PyRanges was sorted on Chromosome.
>>> pr.gf.tile_genome(chromsizes, int(1e6)) +--------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int32) | (int32) | |--------------+-----------+-----------| | chr1 | 0 | 1000000 | | chr1 | 1000000 | 2000000 | | chr1 | 2000000 | 3000000 | | chr1 | 3000000 | 4000000 | | ... | ... | ... | | chrY | 56000000 | 57000000 | | chrY | 57000000 | 58000000 | | chrY | 58000000 | 59000000 | | chrY | 59000000 | 59373566 | +--------------+-----------+-----------+ Unstranded PyRanges object has 3,114 rows and 3 columns from 25 chromosomes. For printing, the PyRanges was sorted on Chromosome.