
Statistics useful for genomics.

Module Contents


Adjust p-values with Benjamini-Hochberg.

Parameters:data (array-like) –
Returns:DataFrame where values are order of data.
Return type:Pandas.DataFrame


>>> np.random.seed(0)
>>> x = np.random.random(10) / 100
>>> gr = pr.random(10)
>>> gr.PValue = x
>>> gr
| Chromosome   | Start     | End       | Strand       | PValue               |
| (category)   | (int32)   | (int32)   | (category)   | (float64)            |
| chr1         | 176601938 | 176602038 | +            | 0.005488135039273248 |
| chr1         | 155082851 | 155082951 | -            | 0.007151893663724195 |
| chr2         | 211134424 | 211134524 | -            | 0.006027633760716439 |
| chr9         | 78826761  | 78826861  | -            | 0.005448831829968969 |
| ...          | ...       | ...       | ...          | ...                  |
| chr16        | 52216522  | 52216622  | +            | 0.004375872112626925 |
| chr17        | 8085927   | 8086027   | -            | 0.008917730007820798 |
| chr19        | 17333425  | 17333525  | +            | 0.009636627605010294 |
| chr22        | 16728001  | 16728101  | +            | 0.003834415188257777 |
Stranded PyRanges object has 10 rows and 5 columns from 9 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> gr.FDR = pr.stats.fdr(gr.PValue)
>>> gr.print(formatting={"PValue": "{:.4f}", "FDR": "{:.4}"})
| Chromosome   | Start     | End       | Strand       | PValue      | FDR         |
| (category)   | (int32)   | (int32)   | (category)   | (float64)   | (float64)   |
| chr1         | 176601938 | 176602038 | +            | 0.0055      | 0.01098     |
| chr1         | 155082851 | 155082951 | -            | 0.0072      | 0.00894     |
| chr2         | 211134424 | 211134524 | -            | 0.0060      | 0.01005     |
| chr9         | 78826761  | 78826861  | -            | 0.0054      | 0.01362     |
| ...          | ...       | ...       | ...          | ...         | ...         |
| chr16        | 52216522  | 52216622  | +            | 0.0044      | 0.01459     |
| chr17        | 8085927   | 8086027   | -            | 0.0089      | 0.009909    |
| chr19        | 17333425  | 17333525  | +            | 0.0096      | 0.009637    |
| chr22        | 16728001  | 16728101  | +            | 0.0038      | 0.03834     |
Stranded PyRanges object has 10 rows and 6 columns from 9 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
pyranges.statistics.fisher_exact(n1, d1, n2, d2, pseudocount=0)

Fisher’s exact for contingency tables.

Computes the hypotheses two-sided, less and greater at the same time.

The odds-ratio is

  • n1 (array-like of int) – Top left square of contingency table.
  • d1 (array-like of int) – Bottom left square of contingency table.
  • n2 (array-like of int) – Top right square of contingency table.
  • d2 (array-like of int) – Bottom right square of contingency table.
  • pseudocount (float, default 0) – Values > 0 allow Odds Ratio to always be a finite number.


The odds-ratio is computed thusly:

((n1 + pseudocount) / (d2 + pseudocount)) / ((n2 + pseudocount) / (d1 + pseudocount))

Returns:DataFrame with columns OR and P, PLeft and PRight.
Return type:pandas.DataFrame

See also

correct for multiple testing


>>> d = {"TP": [1, 0, 8], "FP": [11, 12, 1], "TN": [9, 10, 2], "FN": [3, 2, 5]}
>>> df = pd.DataFrame(d)
>>> df
   TP  FP  TN  FN
0   1  11   9   3
1   0  12  10   2
2   8   1   2   5
>>> pr.stats.fisher_exact(df.TP, df.FP, df.TN, df.FN)
         OR         P     PLeft    PRight
0  0.407407  0.002759  0.001380  0.999966
1  0.000000  0.000067  0.000034  1.000000
2  0.800000  0.034965  0.999126  0.024476
pyranges.statistics.mcc(grs, genome=None, labels=None, strand=False, verbose=False)

Compute Matthew’s correlation coefficient for PyRanges overlaps.

  • grs (list of PyRanges) – PyRanges to compare.
  • genome (DataFrame or dict, default None) – Should contain chromosome sizes. By default, end position of the rightmost intervals are used as proxies for the chromosome size, but it is recommended to use a genome.
  • labels (list of str, default None) – Names to give the PyRanges in the output.
  • strand (bool, default False) – Whether to compute correlations per strand.
  • verbose (bool, default False) – Warn if some chromosomes are in the genome, but not in the PyRanges.


>>> np.random.seed(0)
>>> chromsizes = {"chrM": 16000}
>>> grs = [pr.random(chromsizes=chromsizes) for _ in range(3)]
>>> labels = ["a", "b", "c"]
>>> mcc = pr.stats.mcc(grs, labels=labels, genome=chromsizes)
>>> mcc
   T  F     TP  FP  TN  FN       MCC
0  a  a  15920   0  80   0  1.000000
1  a  b  15875  65  15  45  0.213109
3  a  c  15896  72   8  24  0.155496
2  b  a  15875  45  15  65  0.213109
5  b  b  15940   0  60   0  1.000000
6  b  c  15916  52   8  24  0.180354
4  c  a  15896  24   8  72  0.155496
7  c  b  15916  24   8  52  0.180354
8  c  c  15968   0  32   0  1.000000

To create a symmetric matrix (useful for heatmaps of correlations):

>>> mcc.set_index(["T", "F"]).MCC.unstack()
F         a         b         c
a  1.000000  0.213109  0.155496
b  0.213109  1.000000  0.180354
c  0.155496  0.180354  1.000000
pyranges.statistics.rowbased_spearman(x, y)

Fast row-based Spearman’s correlation.

  • x (matrix-like) – 2D numerical matrix. Same size as y.
  • y (matrix-like) – 2D numerical matrix. Same size as x.

Array with same length as input, where values are P-values.

Return type:


See also

fast row-based Pearson’s correlation.
correct for multiple testing


>>> np.random.seed(0)
>>> x = np.random.randint(10, size=(10, 10))
>>> y = np.random.randint(10, size=(10, 10))

Perform Spearman’s correlation pairwise on each row in 10x10 matrixes:

>>> pr.stats.rowbased_spearman(x, y)
array([ 0.07523548, -0.24838724,  0.03703774,  0.24194052,  0.04778621,
       -0.23913505,  0.12923138,  0.26840486,  0.13292204, -0.29846295])
pyranges.statistics.rowbased_pearson(x, y)

Fast row-based Pearson’s correlation.

  • x (matrix-like) – 2D numerical matrix. Same size as y.
  • y (matrix-like) – 2D numerical matrix. Same size as x.

Array with same length as input, where values are P-values.

Return type:


See also

fast row-based Spearman’s correlation.
correct for multiple testing


>>> np.random.seed(0)
>>> x = np.random.randint(10, size=(10, 10))
>>> y = np.random.randint(10, size=(10, 10))

Perform Pearson’s correlation pairwise on each row in 10x10 matrixes:

>>> pr.stats.rowbased_pearson(x, y)
array([ 0.20349603, -0.01667236, -0.01448763, -0.00442322,  0.06527234,
       -0.36710862,  0.14978726,  0.32360286,  0.17209191, -0.08902829])

Rank order of entries in each row.

Same as SciPy rankdata with method=mean.

Parameters:data (matrix-like) – The data to find order of.
Returns:DataFrame where values are order of data.
Return type:Pandas.DataFrame


>>> np.random.seed(0)
>>> x = np.random.randint(10, size=(3, 10))
>>> x
array([[5, 0, 3, 3, 7, 9, 3, 5, 2, 4],
       [7, 6, 8, 8, 1, 6, 7, 7, 8, 1],
       [5, 9, 8, 9, 4, 3, 0, 3, 5, 0]])
>>> pr.stats.rowbased_rankdata(x)
     0    1    2    3    4     5    6    7    8    9
0  7.5  1.0  4.0  4.0  9.0  10.0  4.0  7.5  2.0  6.0
1  6.0  3.5  9.0  9.0  1.5   3.5  6.0  6.0  9.0  1.5
2  6.5  9.5  8.0  9.5  5.0   3.5  1.5  3.5  6.5  1.5
pyranges.statistics.simes(df, groupby, pcol, keep_position=False)

Apply Simes method for giving dependent events a p-value.

  • df (pandas.DataFrame) – Data to analyse with Simes.
  • groupby (str or list of str) – Features equal in these columns will be merged with Simes.
  • pcol (str) – Name of column with p-values.
  • keep_position (bool, default False) – Keep columns “Chromosome”, “Start”, “End” and “Strand” if they exist.

See also

correct for multiple testing


>>> s = '''Chromosome Start End Strand Gene PValue
... 1 10 20 + P53 0.0001
... 1 20 20 + P53 0.0002
... 1 30 20 + P53 0.0003
... 2 60 65 - FOX 0.05
... 2 70 75 - FOX 0.0000001
... 2 80 90 - FOX 0.0000021'''
>>> gr = pr.from_string(s)
>>> gr
|   Chromosome |     Start |       End | Strand       | Gene       |      PValue |
|   (category) |   (int32) |   (int32) | (category)   | (object)   |   (float64) |
|            1 |        10 |        20 | +            | P53        |     0.0001  |
|            1 |        20 |        20 | +            | P53        |     0.0002  |
|            1 |        30 |        20 | +            | P53        |     0.0003  |
|            2 |        60 |        65 | -            | FOX        |     0.05    |
|            2 |        70 |        75 | -            | FOX        |     1e-07   |
|            2 |        80 |        90 | -            | FOX        |     2.1e-06 |
Stranded PyRanges object has 6 rows and 6 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> simes = pr.stats.simes(gr.df, "Gene", "PValue")
>>> simes
  Gene         Simes
0  FOX  3.000000e-07
1  P53  3.000000e-04
>>> gr.apply(lambda df:
... pr.stats.simes(df, "Gene", "PValue", keep_position=True))
|   Chromosome |     Start |       End |       Simes | Strand     | Gene       |
|     (object) |   (int32) |   (int32) |   (float64) | (object)   | (object)   |
|            1 |        10 |        20 |      0.0001 | +          | P53        |
|            2 |        60 |        90 |      1e-07  | -          | FOX        |
Stranded PyRanges object has 2 rows and 6 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
class pyranges.statistics.StatisticsMethods(pr)

Namespace for statistical comparsion-operations.

Accessed with gr.stats.

forbes(self, other, chromsizes, strandedness=None)

Compute Forbes coefficient.

Ratio which represents observed versus expected co-occurence.

Described in Forbes SA (1907): On the local distribution of certain Illinois fishes: an essay in statistical ecology.

  • other (PyRanges) – Intervals to compare with.
  • chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.
  • strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.

Ratio of observed versus expected co-occurence.

Return type:


See also

compute the jaccard coefficient


>>> gr, gr2 =,
>>> chromsizes =
>>> gr.stats.forbes(gr2, chromsizes=chromsizes)
jaccard(self, other, **kwargs)

Compute Jaccards coefficient.

Ratio of the intersection and union of two sets.

  • other (PyRanges) – Intervals to compare with.
  • chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.
  • strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.

Ratio of the intersection and union of two sets.

Return type:


See also

compute the forbes coefficient


>>> gr, gr2 =,
>>> chromsizes =
>>> gr.stats.jaccard(gr2, chromsizes=chromsizes)
relative_distance(self, other)

Compute spatial correllation between two sets.

Metric which describes relative distance between each interval in one set and two closest intervals in another.

  • other (PyRanges) – Intervals to compare with.
  • chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.
  • strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.

DataFrame containing the frequency of each relative distance.

Return type:


See also

compute the jaccard coefficient
compute the forbes coefficient


>>> gr, gr2 =,
>>> chromsizes =
>>> gr.stats.relative_distance(gr2)
    reldist  count  total  fraction
0      0.00    264   9956  0.026517
1      0.01    226   9956  0.022700
2      0.02    206   9956  0.020691
3      0.03    235   9956  0.023604
4      0.04    194   9956  0.019486
5      0.05    241   9956  0.024207
6      0.06    201   9956  0.020189
7      0.07    191   9956  0.019184
8      0.08    192   9956  0.019285
9      0.09    191   9956  0.019184
10     0.10    186   9956  0.018682
11     0.11    203   9956  0.020390
12     0.12    218   9956  0.021896
13     0.13    209   9956  0.020992
14     0.14    201   9956  0.020189
15     0.15    178   9956  0.017879
16     0.16    202   9956  0.020289
17     0.17    197   9956  0.019787
18     0.18    208   9956  0.020892
19     0.19    202   9956  0.020289
20     0.20    191   9956  0.019184
21     0.21    188   9956  0.018883
22     0.22    213   9956  0.021394
23     0.23    192   9956  0.019285
24     0.24    199   9956  0.019988
25     0.25    181   9956  0.018180
26     0.26    172   9956  0.017276
27     0.27    191   9956  0.019184
28     0.28    190   9956  0.019084
29     0.29    192   9956  0.019285
30     0.30    201   9956  0.020189
31     0.31    212   9956  0.021294
32     0.32    213   9956  0.021394
33     0.33    177   9956  0.017778
34     0.34    197   9956  0.019787
35     0.35    163   9956  0.016372
36     0.36    191   9956  0.019184
37     0.37    198   9956  0.019888
38     0.38    160   9956  0.016071
39     0.39    188   9956  0.018883
40     0.40    200   9956  0.020088
41     0.41    188   9956  0.018883
42     0.42    230   9956  0.023102
43     0.43    197   9956  0.019787
44     0.44    224   9956  0.022499
45     0.45    184   9956  0.018481
46     0.46    198   9956  0.019888
47     0.47    187   9956  0.018783
48     0.48    200   9956  0.020088
49     0.49    194   9956  0.019486