Motif simulation explained

[1]:

import numpy as np
import seqlogo

This notebook intends to illustrate the generation of motifs and the effect of the choice of Dirichlet parameters.

Each position (letter) of a motif is sampled from a Dirichlet distribution. The Dirichlet distribution is initialized with 4 alpha parameters corresponding to A, C, G, and T bases.

[2]:

l_motif = 5

[3]:

def get_ic_logo(ppm_np: np.array):
    ppm = seqlogo.Ppm(ppm_np)
    print(f'The motif ppm is: {ppm}')
    print(f'IC per letter: {' '.join([str(round(i,2)) for i in list(ppm.ic)])}')
    print(f'IC per motif: {sum(ppm.ic):.2f}')
    logo_plot = seqlogo.seqlogo(ppm, ic_scale = True, format = 'png', size = 'small')
    return logo_plot

Unconstrained information content through Dirichlet parameters

This settings provide equal chances for each letter to be the main selection, but does not give very strong weight to any of them. This results in varying information content of any position of the motif.

[4]:

np.random.seed(42)
unconstrained = [1, 1, 1, 1]
motif_from_unconstr_prior = np.random.dirichlet(unconstrained, l_motif)
logo_plot = get_ic_logo(motif_from_unconstr_prior)
logo_plot

	A	C	G	T
0	0.082197	0.527252	0.230641	0.159911
1	0.070375	0.070363	0.024826	0.834435
2	0.161962	0.216972	0.003665	0.617401
3	0.735638	0.098290	0.082638	0.083434
4	0.179898	0.368931	0.280463	0.170708

The motif ppm is:
IC per letter: 0.31 1.11 0.64 0.75 0.07
IC per motif: 2.88

[4]:

../_images/usage_motif_simulation_6_2.png

Constrained low information content

When high (>1) alpha values are provided, the Dirichlet distribution will be concentrated. This means that in the motif, each letter has about the same probability for each position. Thus, the motif has low information content.

[5]:

np.random.seed(43)
constr_low = [10, 10, 10, 10]
motif_from_constr_low_prior = np.random.dirichlet(constr_low, l_motif)
logo_plot = get_ic_logo(motif_from_constr_low_prior)
logo_plot

	A	C	G	T
0	0.271576	0.184040	0.325902	0.218483
1	0.334526	0.197765	0.204022	0.263687
2	0.327130	0.269244	0.196121	0.207505
3	0.372365	0.171653	0.199164	0.256818
4	0.238419	0.293696	0.151122	0.316763

The motif ppm is:
IC per letter: 0.03 0.03 0.03 0.07 0.05
IC per motif: 0.21

[5]:

../_images/usage_motif_simulation_8_2.png

Constrained high information content

When low (<1) alpha values are given, the Dirichlet distribution will be sparse. This means that for each position in the motif a single value will get most mass while others are close to 0. Thus, the motif will have high information content.

[6]:

np.random.seed(42)
constr_high = [0.1, 0.1, 0.1, 0.1]
motif_from_constr_high_prior = np.random.dirichlet(constr_high, l_motif)
logo_plot = get_ic_logo(motif_from_constr_high_prior)
logo_plot

	A	C	G	T
0	1.228498e-03	9.987713e-01	1.932663e-07	9.883878e-12
1	3.711972e-02	8.230591e-17	9.628800e-01	2.379930e-07
2	8.952084e-04	2.978610e-02	9.687221e-01	5.966345e-04
3	3.098475e-02	8.017583e-06	4.237311e-01	5.452761e-01
4	1.241914e-12	9.794086e-01	6.295173e-06	2.058508e-02

The motif ppm is:
IC per letter: 1.99 1.77 1.79 0.84 1.86
IC per motif: 8.24

[6]:

../_images/usage_motif_simulation_10_2.png

Imbalanced alphas: one (or more) letter is more probable than the rest.

In this example, higher value is given for the alpha corresponding to letter ‘A’, thus it is more likely to have the highest probability in each position. The alpha corresponding to letter ‘G’ has the lowest probability, thus it is less frequently appears in the motif as ‘C’ or ‘T’.

[7]:

np.random.seed(42)
unequal = [2, 0.5, 0.3, 0.5]
motif_from_unequal_prior = np.random.dirichlet(unequal, l_motif)
logo_plot = get_ic_logo(motif_from_unequal_prior)
logo_plot

	A	C	G	T
0	0.582251	0.090481	0.000497	0.326770
1	0.689308	0.000195	0.295247	0.015249
2	0.900675	0.038215	0.000634	0.060476
3	0.059368	0.010963	0.000030	0.929639
4	0.299011	0.545008	0.075405	0.080577

The motif ppm is:
IC per letter: 0.7 1.02 1.43 1.59 0.43
IC per motif: 5.16

[7]:

../_images/usage_motif_simulation_12_2.png

In this second example, ‘G’ and ‘C’ have higher probability than ‘A’ or ‘T’. The values are generally small, so sparsity is still achieved.

[8]:

np.random.seed(42)
gc_rich = [0.1, 0.5, 0.5, 0.1]
motif_from_gc_rich_prior = np.random.dirichlet(gc_rich, l_motif)
logo_plot = get_ic_logo(motif_from_gc_rich_prior)
logo_plot

	A	C	G	T
0	0.000079	0.964303	0.035618	6.395171e-13
1	0.005590	0.000384	0.994026	3.583746e-08
2	0.000012	0.322111	0.677869	7.818585e-06
3	0.000914	0.093587	0.905499	3.185328e-12
4	0.631156	0.054725	0.314119	4.321596e-10

The motif ppm is:
IC per letter: 1.78 1.95 1.09 1.54 0.83
IC per motif: 7.18

[8]:

../_images/usage_motif_simulation_14_2.png