:mod:`pyranges.genomicfeatures` =============================== .. py:module:: pyranges.genomicfeatures Module Contents --------------- .. py:class:: GenomicFeaturesMethods(pr) Namespace for methods using feature information. Accessed through `gr.features`. .. attribute:: pr .. method:: tss(self) Return the transcription start sites. Returns the 5' for every interval with feature "transcript". .. seealso:: :meth:`pyranges.genomicfeatures.GenomicFeaturesMethods.tes` return the transcription end sites .. rubric:: Examples >>> gr = pr.data.ensembl_gtf() >>> gr +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (category) | (object) | (category) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------| | 1 | havana | gene | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | transcript | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 11868 | 12227 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 12612 | 12721 | . | + | . | transcribed_unprocessed_pseudogene | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | gene | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | transcript | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1179364 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1173055 | 1176396 | . | - | . | lncRNA | ... | +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ Stranded PyRanges object has 2,446 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.) >>> gr.features.tss() +--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------| | 1 | havana | tss | 11868 | 11869 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | tss | 12009 | 12010 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | tss | 29553 | 29554 | . | + | . | lncRNA | ... | | 1 | havana | tss | 30266 | 30267 | . | + | . | lncRNA | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | tss | 1092813 | 1092814 | . | - | . | protein_coding | ... | | 1 | havana | tss | 1116087 | 1116088 | . | - | . | protein_coding | ... | | 1 | havana | tss | 1116089 | 1116090 | . | - | . | protein_coding | ... | | 1 | havana | tss | 1179555 | 1179556 | . | - | . | lncRNA | ... | +--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ Stranded PyRanges object has 280 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.) .. method:: tes(self, slack=0) Return the transcription end sites. Returns the 3' for every interval with feature "transcript". .. seealso:: :meth:`pyranges.genomicfeatures.GenomicFeaturesMethods.tss` return the transcription start sites .. rubric:: Examples >>> gr = pr.data.ensembl_gtf() >>> gr +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (category) | (object) | (category) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------| | 1 | havana | gene | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | transcript | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 11868 | 12227 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 12612 | 12721 | . | + | . | transcribed_unprocessed_pseudogene | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | gene | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | transcript | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1179364 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1173055 | 1176396 | . | - | . | lncRNA | ... | +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ Stranded PyRanges object has 2,446 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.) >>> gr.features.tes() +--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------| | 1 | havana | tes | 14409 | 14410 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | tes | 13670 | 13671 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | tes | 31097 | 31098 | . | + | . | lncRNA | ... | | 1 | havana | tes | 31109 | 31110 | . | + | . | lncRNA | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | tes | 1092813 | 1092814 | . | - | . | protein_coding | ... | | 1 | havana | tes | 1116087 | 1116088 | . | - | . | protein_coding | ... | | 1 | havana | tes | 1116089 | 1116090 | . | - | . | protein_coding | ... | | 1 | havana | tes | 1179555 | 1179556 | . | - | . | lncRNA | ... | +--------------+------------+------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ Stranded PyRanges object has 280 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.) .. method:: introns(self, by='gene', nb_cpu=1) Return the introns. :param by: Whether to find introns per gene or transcript. :type by: str, {"gene", "transcript"}, default "gene" :param nb_cpu: How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple. Will only lead to speedups on large datasets. :type nb_cpu: int, default 1 .. seealso:: :meth:`pyranges.genomicfeatures.GenomicFeaturesMethods.tss` return the transcription start sites .. rubric:: Examples >>> gr = pr.data.ensembl_gtf() >>> gr +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (category) | (object) | (category) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------| | 1 | havana | gene | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | transcript | 11868 | 14409 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 11868 | 12227 | . | + | . | transcribed_unprocessed_pseudogene | ... | | 1 | havana | exon | 12612 | 12721 | . | + | . | transcribed_unprocessed_pseudogene | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | gene | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | transcript | 1173055 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1179364 | 1179555 | . | - | . | lncRNA | ... | | 1 | havana | exon | 1173055 | 1176396 | . | - | . | lncRNA | ... | +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+------------------------------------+-------+ Stranded PyRanges object has 2,446 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.) >>> gr.features.introns(by="gene") +--------------+----------------+------------+-----------+-----------+------------+--------------+------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | +20 | | (object) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... | |--------------+----------------+------------+-----------+-----------+------------+--------------+------------+-------| | 1 | ensembl_havana | intron | 1173926 | 1174265 | . | + | . | ... | | 1 | ensembl_havana | intron | 1174321 | 1174423 | . | + | . | ... | | 1 | ensembl_havana | intron | 1174489 | 1174520 | . | + | . | ... | | 1 | ensembl_havana | intron | 1175034 | 1179188 | . | + | . | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | intron | 874591 | 875046 | . | - | . | ... | | 1 | havana | intron | 875155 | 875525 | . | - | . | ... | | 1 | havana | intron | 875625 | 876526 | . | - | . | ... | | 1 | havana | intron | 876611 | 876754 | . | - | . | ... | +--------------+----------------+------------+-----------+-----------+------------+--------------+------------+-------+ Stranded PyRanges object has 311 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 20 hidden columns: gene_biotype, gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, ... (+ 11 more.) >>> gr.features.introns(by="transcript") +--------------+----------------+------------+-----------+-----------+------------+--------------+------------+----------------------------------+-------+ | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_biotype | +19 | | (object) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | ... | |--------------+----------------+------------+-----------+-----------+------------+--------------+------------+----------------------------------+-------| | 1 | havana | intron | 818202 | 818722 | . | + | . | lncRNA | ... | | 1 | ensembl_havana | intron | 960800 | 961292 | . | + | . | protein_coding | ... | | 1 | ensembl_havana | intron | 961552 | 961628 | . | + | . | protein_coding | ... | | 1 | ensembl_havana | intron | 961750 | 961825 | . | + | . | protein_coding | ... | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | | 1 | havana | intron | 732207 | 732980 | . | - | . | transcribed_processed_pseudogene | ... | | 1 | havana_tagene | intron | 168165 | 169048 | . | - | . | lncRNA | ... | | 1 | havana_tagene | intron | 165942 | 167958 | . | - | . | lncRNA | ... | | 1 | havana_tagene | intron | 168165 | 169048 | . | - | . | lncRNA | ... | +--------------+----------------+------------+-----------+-----------+------------+--------------+------------+----------------------------------+-------+ Stranded PyRanges object has 1,043 rows and 28 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. 19 hidden columns: gene_id, gene_name, gene_source, gene_version, tag, transcript_biotype, transcript_id, transcript_name, transcript_source, transcript_support_level, ... (+ 9 more.) .. function:: genome_bounds(gr, chromsizes, clip=False) Remove or clip intervals outside of genome bounds. :param chromsizes: Dict or PyRanges describing the lengths of the chromosomes. :type chromsizes: dict or PyRanges :param clip: Part of interval within bounds. :type clip: bool, default False .. rubric:: Examples >>> d = {"Chromosome": [1, 1, 3], "Start": [1, 249250600, 5], "End": [2, 249250640, 7]} >>> gr = pr.from_dict(d) >>> gr +--------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int32) | (int32) | |--------------+-----------+-----------| | 1 | 1 | 2 | | 1 | 249250600 | 249250640 | | 3 | 5 | 7 | +--------------+-----------+-----------+ Unstranded PyRanges object has 3 rows and 3 columns from 2 chromosomes. For printing, the PyRanges was sorted on Chromosome. >>> chromsizes = {"1": 249250621, "3": 500} >>> chromsizes {'1': 249250621, '3': 500} >>> pr.gf.genome_bounds(gr, chromsizes) +--------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int32) | (int32) | |--------------+-----------+-----------| | 1 | 1 | 2 | | 3 | 5 | 7 | +--------------+-----------+-----------+ Unstranded PyRanges object has 2 rows and 3 columns from 2 chromosomes. For printing, the PyRanges was sorted on Chromosome. >>> pr.gf.genome_bounds(gr, chromsizes, clip=True) +--------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int32) | (int32) | |--------------+-----------+-----------| | 1 | 1 | 2 | | 1 | 249250600 | 249250621 | | 3 | 5 | 7 | +--------------+-----------+-----------+ Unstranded PyRanges object has 3 rows and 3 columns from 2 chromosomes. For printing, the PyRanges was sorted on Chromosome. >>> del chromsizes['3'] >>> chromsizes {'1': 249250621} >>> pr.gf.genome_bounds(gr, chromsizes) Traceback (most recent call last): ... KeyError: '3' .. function:: tile_genome(genome, tile_size, tile_last=False) Create a tiled genome. :param chromsizes: Dict or PyRanges describing the lengths of the chromosomes. :type chromsizes: dict or PyRanges :param tile_size: Length of the tiles. :type tile_size: int :param tile_last: Use genome length as end of last tile. :type tile_last: bool, default False .. seealso:: :func:`pyranges.PyRanges.tile` split intervals into adjacent non-overlapping tiles. .. rubric:: Examples >>> chromsizes = pr.data.chromsizes() >>> chromsizes +--------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int32) | (int32) | |--------------+-----------+-----------| | chr1 | 0 | 249250621 | | chr2 | 0 | 243199373 | | chr3 | 0 | 198022430 | | chr4 | 0 | 191154276 | | ... | ... | ... | | chr22 | 0 | 51304566 | | chrM | 0 | 16571 | | chrX | 0 | 155270560 | | chrY | 0 | 59373566 | +--------------+-----------+-----------+ Unstranded PyRanges object has 25 rows and 3 columns from 25 chromosomes. For printing, the PyRanges was sorted on Chromosome. >>> pr.gf.tile_genome(chromsizes, int(1e6)) +--------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int32) | (int32) | |--------------+-----------+-----------| | chr1 | 0 | 1000000 | | chr1 | 1000000 | 2000000 | | chr1 | 2000000 | 3000000 | | chr1 | 3000000 | 4000000 | | ... | ... | ... | | chrY | 56000000 | 57000000 | | chrY | 57000000 | 58000000 | | chrY | 58000000 | 59000000 | | chrY | 59000000 | 59373566 | +--------------+-----------+-----------+ Unstranded PyRanges object has 3,114 rows and 3 columns from 25 chromosomes. For printing, the PyRanges was sorted on Chromosome.