pyranges.readers

Module Contents

pyranges.readers.read_bed(f, as_df=False, nrows=None)

Return bed file as PyRanges.

This is a reader for files that follow the bed format. They can have from 3-12 columns which will be named like so:

Chromosome Start End Name Score Strand ThickStart ThickEnd ItemRGB BlockCount BlockSizes BlockStarts

Parameters:
  • f (str) – Path to bed file
  • as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.
  • nrows (int, default None) – Number of rows to return.

Notes

If you just want to create a PyRanges from a tab-delimited bed-like file, use pr.PyRanges(pandas.read_table(f)) instead.

Examples

>>> path = pr.get_example_path("aorta.bed")
>>> pr.read_bed(path, nrows=5)
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int32) |   (int32) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         |      9939 |     10138 | H3K27me3   |         7 | +            |
| chr1         |      9953 |     10152 | H3K27me3   |         5 | +            |
| chr1         |      9916 |     10115 | H3K27me3   |         5 | -            |
| chr1         |      9951 |     10150 | H3K27me3   |         8 | -            |
| chr1         |      9978 |     10177 | H3K27me3   |         7 | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 5 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> pr.read_bed(path, as_df=True, nrows=5)
  Chromosome  Start    End      Name  Score Strand
0       chr1   9916  10115  H3K27me3      5      -
1       chr1   9939  10138  H3K27me3      7      +
2       chr1   9951  10150  H3K27me3      8      -
3       chr1   9953  10152  H3K27me3      5      +
4       chr1   9978  10177  H3K27me3      7      -
pyranges.readers.read_bam(f, sparse=True, as_df=False, mapq=0, required_flag=0, filter_flag=1540)

Return bam file as PyRanges.

Parameters:
  • f (str) – Path to bam file
  • sparse (bool, default True) – Whether to return only.
  • as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.
  • mapq (int, default 0) – Minimum mapping quality score.
  • required_flag (int, default 0) – Flags which must be present for the interval to be read.
  • filter_flag (int, default 1540) – Ignore reads with these flags. Default 1540, which means that either the read is unmapped, the read failed vendor or platfrom quality checks, or the read is a PCR or optical duplicate.

Notes

This functionality requires the library bamread. It can be installed with pip install bamread or conda install -c bioconda bamread.

Examples

>>> path = pr.get_example_path("control.bam")
>>> pr.read_bam(path)
+--------------+-----------+-----------+--------------+------------+
| Chromosome   | Start     | End       | Strand       | Flag       |
| (category)   | (int32)   | (int32)   | (category)   | (uint16)   |
|--------------+-----------+-----------+--------------+------------|
| chr1         | 887771    | 887796    | +            | 16         |
| chr1         | 994660    | 994685    | +            | 16         |
| chr1         | 1770383   | 1770408   | +            | 16         |
| chr1         | 1995141   | 1995166   | +            | 16         |
| ...          | ...       | ...       | ...          | ...        |
| chrY         | 57402214  | 57402239  | +            | 16         |
| chrY         | 10643526  | 10643551  | -            | 0          |
| chrY         | 11776321  | 11776346  | -            | 0          |
| chrY         | 20557165  | 20557190  | -            | 0          |
+--------------+-----------+-----------+--------------+------------+
Stranded PyRanges object has 10,000 rows and 5 columns from 25 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> pr.read_bam(path, sparse=False, as_df=True)
     Chromosome    Start      End Strand  Flag  QueryStart  QueryEnd Name Cigar Quality
0          chr1   887771   887796      +    16           0        25   U0   25M    None
1          chr1   994660   994685      +    16           0        25   U0   25M    None
2          chr1  1041102  1041127      -     0           0        25   U0   25M    None
3          chr1  1770383  1770408      +    16           0        25   U0   25M    None
4          chr1  1995141  1995166      +    16           0        25   U0   25M    None
...         ...      ...      ...    ...   ...         ...       ...  ...   ...     ...
9995       chrM     3654     3679      -     0           0        25   U0   25M    None
9996       chrM     3900     3925      +    16           0        25   U0   25M    None
9997       chrM    13006    13031      +    16           0        25   U0   25M    None
9998       chrM    14257    14282      -     0           0        25   U0   25M    None
9999       chrM    14257    14282      -     0           0        25   U0   25M    None
<BLANKLINE>
[10000 rows x 10 columns]
pyranges.readers._fetch_gene_transcript_exon_id(attribute, annotation=None)
pyranges.readers.skiprows(f)
pyranges.readers.read_gtf(f, full=True, as_df=False, nrows=None, duplicate_attr=False)

Read files in the Gene Transfer Format.

Parameters:
  • f (str) – Path to GTF file.
  • as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.
  • nrows (int, default None) – Number of rows to read. Default None, i.e. all.
  • duplicate_attr (bool, default False) – Whether to handle (potential) duplicate attributes or just keep last one.

See also

pyranges.read_gff3()
read files in the General Feature Format

Examples

>>> path = pr.get_example_path("ensembl.gtf")
>>> gr = pr.read_gtf(path)
>>> # +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------+
>>> # | Chromosome   | Source     | Feature      | Start     | End       | Score      | Strand       | Frame      | gene_id         | gene_version   | +18   |
>>> # | (category)   | (object)   | (category)   | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)        | (object)       | ...   |
>>> # |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------|
>>> # | 1            | havana     | gene         | 11868     | 14409     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
>>> # | 1            | havana     | transcript   | 11868     | 14409     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
>>> # | 1            | havana     | exon         | 11868     | 12227     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
>>> # | 1            | havana     | exon         | 12612     | 12721     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
>>> # | ...          | ...        | ...          | ...       | ...       | ...        | ...          | ...        | ...             | ...            | ...   |
>>> # | 1            | ensembl    | transcript   | 120724    | 133723    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
>>> # | 1            | ensembl    | exon         | 133373    | 133723    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
>>> # | 1            | ensembl    | exon         | 129054    | 129223    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
>>> # | 1            | ensembl    | exon         | 120873    | 120932    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
>>> # +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------+
>>> # Stranded PyRanges object has 95 rows and 28 columns from 1 chromosomes.
>>> # For printing, the PyRanges was sorted on Chromosome and Strand.
>>> # 18 hidden columns: gene_name, gene_source, gene_biotype, transcript_id, transcript_version, transcript_name, transcript_source, transcript_biotype, tag, transcript_support_level, ... (+ 8 more.)
pyranges.readers.read_gtf_full(f, as_df=False, nrows=None, skiprows=0, duplicate_attr=False)
pyranges.readers.to_rows(anno)
pyranges.readers.to_rows_keep_duplicates(anno)
pyranges.readers.read_gtf_restricted(f, as_df=False, skiprows=0, nrows=None)

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below. # source - name of the program that generated this feature, or the data source (database or project name) feature - feature type name, e.g. Gene, Variation, Similarity start - Start position of the feature, with sequence numbering starting at 1. end - End position of the feature, with sequence numbering starting at 1. score - A floating point value. strand - defined as + (forward) or - (reverse). # frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on.. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

pyranges.readers.to_rows_gff3(anno)
pyranges.readers.read_gff3(f, annotation=None, as_df=False, nrows=None, skiprows=0)

Read files in the General Feature Format.

Parameters:
  • f (str) – Path to GFF file.
  • as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.
  • nrows (int, default None) – Number of rows to read. Default None, i.e. all.

See also

pyranges.read_gtf()
read files in the Gene Transfer Format