Motif simulation explained

[1]:
import numpy as np
import seqlogo

This notebook intends to illustrate the generation of motifs and the effect of the choice of Dirichlet parameters.

Each position (letter) of a motif is sampled from a Dirichlet distribution. The Dirichlet distribution is initialized with 4 alpha parameters corresponding to A, C, G, and T bases.

[2]:
l_motif = 5
[3]:
def get_ic_logo(ppm_np: np.array):
    ppm = seqlogo.Ppm(ppm_np)
    print(f'The motif ppm is: {ppm}')
    print(f'IC per letter: {' '.join([str(round(i,2)) for i in list(ppm.ic)])}')
    print(f'IC per motif: {sum(ppm.ic):.2f}')
    logo_plot = seqlogo.seqlogo(ppm, ic_scale = True, format = 'png', size = 'small')
    return logo_plot

Unconstrained information content through Dirichlet parameters

This settings provide equal chances for each letter to be the main selection, but does not give very strong weight to any of them. This results in varying information content of any position of the motif.

[4]:
np.random.seed(42)
unconstrained = [1, 1, 1, 1]
motif_from_unconstr_prior = np.random.dirichlet(unconstrained, l_motif)
logo_plot = get_ic_logo(motif_from_unconstr_prior)
logo_plot
A C G T
0 0.082197 0.527252 0.230641 0.159911
1 0.070375 0.070363 0.024826 0.834435
2 0.161962 0.216972 0.003665 0.617401
3 0.735638 0.098290 0.082638 0.083434
4 0.179898 0.368931 0.280463 0.170708
The motif ppm is:
IC per letter: 0.31 1.11 0.64 0.75 0.07
IC per motif: 2.88
[4]:
../_images/usage_motif_simulation_6_2.png

Constrained low information content

When high (>1) alpha values are provided, the Dirichlet distribution will be concentrated. This means that in the motif, each letter has about the same probability for each position. Thus, the motif has low information content.

[5]:
np.random.seed(43)
constr_low = [10, 10, 10, 10]
motif_from_constr_low_prior = np.random.dirichlet(constr_low, l_motif)
logo_plot = get_ic_logo(motif_from_constr_low_prior)
logo_plot
A C G T
0 0.271576 0.184040 0.325902 0.218483
1 0.334526 0.197765 0.204022 0.263687
2 0.327130 0.269244 0.196121 0.207505
3 0.372365 0.171653 0.199164 0.256818
4 0.238419 0.293696 0.151122 0.316763
The motif ppm is:
IC per letter: 0.03 0.03 0.03 0.07 0.05
IC per motif: 0.21
[5]:
../_images/usage_motif_simulation_8_2.png

Constrained high information content

When low (<1) alpha values are given, the Dirichlet distribution will be sparse. This means that for each position in the motif a single value will get most mass while others are close to 0. Thus, the motif will have high information content.

[6]:
np.random.seed(42)
constr_high = [0.1, 0.1, 0.1, 0.1]
motif_from_constr_high_prior = np.random.dirichlet(constr_high, l_motif)
logo_plot = get_ic_logo(motif_from_constr_high_prior)
logo_plot
A C G T
0 1.228498e-03 9.987713e-01 1.932663e-07 9.883878e-12
1 3.711972e-02 8.230591e-17 9.628800e-01 2.379930e-07
2 8.952084e-04 2.978610e-02 9.687221e-01 5.966345e-04
3 3.098475e-02 8.017583e-06 4.237311e-01 5.452761e-01
4 1.241914e-12 9.794086e-01 6.295173e-06 2.058508e-02
The motif ppm is:
IC per letter: 1.99 1.77 1.79 0.84 1.86
IC per motif: 8.24
[6]:
../_images/usage_motif_simulation_10_2.png

Imbalanced alphas: one (or more) letter is more probable than the rest.

In this example, higher value is given for the alpha corresponding to letter ‘A’, thus it is more likely to have the highest probability in each position. The alpha corresponding to letter ‘G’ has the lowest probability, thus it is less frequently appears in the motif as ‘C’ or ‘T’.

[7]:
np.random.seed(42)
unequal = [2, 0.5, 0.3, 0.5]
motif_from_unequal_prior = np.random.dirichlet(unequal, l_motif)
logo_plot = get_ic_logo(motif_from_unequal_prior)
logo_plot
A C G T
0 0.582251 0.090481 0.000497 0.326770
1 0.689308 0.000195 0.295247 0.015249
2 0.900675 0.038215 0.000634 0.060476
3 0.059368 0.010963 0.000030 0.929639
4 0.299011 0.545008 0.075405 0.080577
The motif ppm is:
IC per letter: 0.7 1.02 1.43 1.59 0.43
IC per motif: 5.16
[7]:
../_images/usage_motif_simulation_12_2.png

In this second example, ‘G’ and ‘C’ have higher probability than ‘A’ or ‘T’. The values are generally small, so sparsity is still achieved.

[8]:
np.random.seed(42)
gc_rich = [0.1, 0.5, 0.5, 0.1]
motif_from_gc_rich_prior = np.random.dirichlet(gc_rich, l_motif)
logo_plot = get_ic_logo(motif_from_gc_rich_prior)
logo_plot
A C G T
0 0.000079 0.964303 0.035618 6.395171e-13
1 0.005590 0.000384 0.994026 3.583746e-08
2 0.000012 0.322111 0.677869 7.818585e-06
3 0.000914 0.093587 0.905499 3.185328e-12
4 0.631156 0.054725 0.314119 4.321596e-10
The motif ppm is:
IC per letter: 1.78 1.95 1.09 1.54 0.83
IC per motif: 7.18
[8]:
../_images/usage_motif_simulation_14_2.png