Command line options

There are four different command line options.

$ python -m inmotifin
Usage: python -m inmotifin [OPTIONS] COMMAND [ARGS]...

You are running inMOTIFin

Options:
--help  Show this message and exit.

Commands:
multimers          Multimerizing motifs given motifs and distances
motifs             Creating motifs given information content, length,...
motif-in-seq       Simulating sequences with inserted motif instances
random-sequences   Creating random sequences given length and alphabet

Simulation of sequences with inserted motif instances

python -m inmotifin motif-in-seq --help

Arguments

--workdir

Folder of the simulation outputs. Defaults to current work directory.
NOTE: it should be a relative path. Absolute paths are not supported.

--title

The title of the simulation. A new folder workdir\title is
created and all simulated outputs will get the title prefix.
Default: sim

--seed

Random seed for the different steps in the simulation. All results
are reproducible when set. Default: None

--config

Config file for the simulation with the parameters for creating
sequences. with motif instances in them. Default: None

--dirichlet_alpha

Alpha values for the Dirichlet distribution from which motifs are
sampled. Default: 0.5,0.5,0.5,0.5 (Jeffreys prior)

--num_motifs

Number of motifs to create. Default: 10

--m_length_min

Length of the simulated motifs, unless --len_motifs_max is
specified. If --len_motifs_max is also specified, this is the
lower boundary of length. Default: 5

--m_length_max

Maximum allowed length of the motif. The actual length is sampled
from a uniform distribution. Default: None

--m_alphabet

String of letters in the motif alphabet, Default: ACGT

--motif_files

List of path(s) to the motif file(s). Supported formats are jaspar,
meme, csv with JASPAR motif ids. Default: None

--jaspar_db_version

Release name of JASPAR database version to use when fetching JASPAR
motif IDs. For futher information see pyJASPAR’s documentation.
Example value: ‘JASPAR2024’ Default: None

--m_alphabet_pairs

Dictionary of letter pairs for reverse complementing the motif instance.
Default: A:T,C:G,G:C,T:A

--num_backgrounds

Number of background sequences to simulate. Default: 100

--b_length_min

Length of the simulated background sequences, unless --b_length_max
is specified. to simulate. If --b_length_max is also specified, this
is the lower boundary of length. Default: 50

--b_length_max

Maximum allowed length of the background sequences. The actual length
is sampled from a uniform distribution. Default: None

--b_alphabet

String of letters in the background alphabet. Default: ACGT

--b_alphabet_prior

Comma separated probability of the letters in the background alphabet.
Default: 0.25,0.25,0.25,0.25

--background_files

Path(s) to the background file(s) in fasta format. Default: None

--background_type

Parameter defining how the background sequences are used. Supported types are:
fasta_iid, random_nucl_shuffled_only, random_nucl_shuffled_addon,
iid, markov_fit, markov_sim.
fasta_iid: Fasta files are used as is - default when background_files
is not None. random_nucl_shuffled_only: Fasta files are used, nucleotides
in sequences are shuffled and only shuffled ones are used / returned.
random_nucl_shuffled_addon: Fasta files are used, nucleotides in
sequences are shuffled and both the original and the shuffled sequences
are used / returned. iid: Fasta files are ignored if provided,
b_alphabet_prior specifies nucelotide probabilities - default when
background_files is None. markov_fit: Fasta files are used to fit
hidden Markov model for posterior probabilities. Order specified
with markov_order. markov_sim: Fasta files are used to fit and
sample from hidden Markov model, so this is a type of simulation. Order
specified with markov_order.

--markov_order

Order of Markov model to learn from sequences background_type
is set to markov_fit or markov_sim. Defaults to 0 corresponding to
learning independent nucleotide frequencies.

--markov_n_iter

Number of iterations of Markov model to learn from sequences.
Defaults to 100

--markov_algorithm

Algorithm of Markov model to learn from sequences. Options: ‘viterbi’ or ‘map’.
See hmmlearn 0.3.3 documentation. Defaults to ‘viterbi’.

--num_shuffle

Number of shuffle of the backgrounds. Used when background_type is
set to random_nucl_shuffled_only or random_nucl_shuffled_addon.
Default: None

--num_groups

Number of groups into which motifs are assigned. If = 1 the rest
of the options are ignored and all motifs are assigned to a
single group. Default: 1

--max_group_size

Maximum size of each group. It cannot be smaller than the number
of motifs. Each group size is sampled from binomial distribution
with number of trials = max_group_size and success = group_size_p.
Default: inf

--group_size_p

This parameter controls the expected size of each group. Each
group size is sampled from binomial distribution with number of
trials = max_group_size and success = group_size_p. Default: 1

--group_motif_assignment_file

Path to the motif to group asisgnment file in two column tsv format.
The first column is the group IDs, and the second column lists the
motfIDs that are assigned to the corresponding group. Default: None

--group_freq_type

The method of selecting group background frequencies.
Values: uniform, random. Where uniform means each group has
an equal chance to be selected. Random means each group is assigned a
probability of being selected. The difference between a frequent and
rare group is controlled by the --group_freq_range parameter.
Default: uniform

--group_freq_range

The range of the potential differences between a frequent and a
rare group. Default: None

--motif_freq_type

The method of selecting motif background frequencies.
Values: uniform, random. Where uniform means each motif has
an equal chance to be selected. Random means each motif is assigned
a probability of being selected. The difference between a frequent
and rare motif is controlled by the --motif_freq_range parameter.
Default: uniform

--motif_freq_range

The range of the potential differences between a frequent and a
rare motif. Default: None

--concentration_factor

The preference of each groups to be selected again wen selecting
more than one group for insertion. Value between 0 and 1. Default: 1

--group_group_type

The method of selecting group-group transition probabilities.
Values: uniform, random. Where uniform means any two
groups are equally probable of co-occuring. Random means group pairs
are assigned a random probability of transition. Default: uniform

--group_freq_file

Tsv file including the background frequencies for the groups
to be selected. Default: None

--motif_freq_file

Tsv file including the background frequencies for the selection
of motifs to be inserted. Default: None

--group_group_file

Tsv file including the group-group transition probability matrix.
Default: None

--position_type

Type of position simulation. Supported values: central,
left_central, right_central, uniform, gaussian.
Central means the first motif is inserted into the center of the
background (replacing existing bases). Left_central means aligning the
first base to the center (replacing existing bases). Right_central means
aligning the last base to the position one before center (replacing
existing bases). Uniform means all position has equal chance. With the
to_replace option, the user may chose between replacing existing
bases or inserting between them. Gaussian means following the
probabilities of a Gaussian distribution centered on the given position
of the background as per the position_means and position_variances
parameters. Gaussian insertions are left aligned and the insertion
is without replacing existing bases. Default: central

--position_means

Comma separated mean values for the gaussian positioning option.
Default: None

--position_variances

Comma separated variance values for the gaussian positioning option.
Default: None

--num_motif_in_seq

Number of sequences with motifs in them to generate. Default: 100

--pc_no_motif

Percentage of sequences without motifs. Number between 0 and 100
is expected. Default: 0

--to_replace

Whether to replace backgorund bases with motif instance when sampling positions
uniformly. Alternative is to insert between existing bases. Default: True

--orientation_prob

Probability of reversing a motif instance in the motived sequence.
Default: 0.5

--num_groups_per_seq

Number of groups to sample per sequence. Default: 1

--motif_sampling_replacement

Whether to select motifs from groups with replacement. Note, if more motifs
are requested than available in a group, replacement will be used regardless
of this parameter. Default: True, set to False with using the flag
--no-motif_sampling_replacement

--n_instances_per_sequence

Number of motif instances to be inserted per sequence. Takes
precedent over --n_instances_per_sequence_l. Default: 1

--n_instances_per_sequence_l

Lambda parameter of Poisson distribution for selecting the number
of motif instances to be inserted per sequence. Default: None

--to_draw

Whether to draw the directed acyclic graph of the simulation steps.
Default: False

For further reference on the hidden Markov modeling, visit the hmmlearn documentation

File format examples

motif_files in different formats

MEME: visit the official MEME format description page about the minimal meme format expected here.

MEME version 4

ALPHABET= ACGT

strands: + -

Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25

MOTIF test1 test1
letter-probability matrix: alength= 4 w= 3 nsites= 99494 E= 0
0.13  0.10  0.34  0.43
0.75  0.06  0.19  0.00
0.00  0.00  0.00  1.00

JASPAR format: visit the official JASPAR format description page .

>MA0001.1 MA0001.1.AGL3
A  [     0      3     79     40     66     48     65     11     65      0 ]
C  [    94     75      4      3      1      2      5      2      3      3 ]
G  [     1      0      3      4      1      0      5      3     28     88 ]
T  [     2     19     11     50     29     47     22     81      1      6 ]
>MA0003.1 MA0003.1.TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]
>MA0002.2 MA0002.2.Runx1
A  [   287    234    123     57      0     87      0     17     10    131    500 ]
C  [   496    485   1072      0     75    127      0     42    400    463    158 ]
G  [   696    467    149      7   1872     70   1987   1848    251     81    289 ]
T  [   521    814    656   1936     53   1716     13     93   1339   1325   1053 ]

CSV format: comma separated list of JASPAR database matrix IDs (base ID + version) is expected.

MA0001.1 , MA0003.1 , MA0002.2

group_motif_assignment_file

<title>_group_0   <title>_motif_1,<title>_motif_2,<title>_motif_3,<title>_motif_4
<title>_group_1   <title>_motif_1,<title>_motif_5

group_freq_file

<title>_group_1   0.5
<title>_group_2   0.5

motif_freq_file

      <title>_group_1     <title>_group_2
<title>_motif_1   0.25    0.5
<title>_motif_2   0.25    0.0
<title>_motif_3   0.25    0.0
<title>_motif_4   0.25    0.0
<title>_motif_5   0.0     0.5

group_group_file

   <title>_group_1        <title>_group_2
<title>_group_1   0.75    0.25
<title>_group_2   0.25    0.75

Outputs

dagsim_table.csv A table produced by DagSim framework, including all values of all nodes for each simulation round. Note: In case of inserting instances without replacing the existing bases, the <title>_inserted_instances.bed file contains the correct start and end coordinates. This csv file contains the coordinates of the positions of insertion in the original background sequences before insertion.

<title>_dagsim_table.png The DAG of the simulation showing all nodes and their dependencies.

<title>_final_sequences.fa A fasta file containing all the motif-in-sequences and the sequences without motifs (controlled by the pc_no_motif parameter). The name of the sequences are the index_title_backgroundID.

<title>_probabilistic_final_sequences.npz A npz file containing the probabilistic version of the motif-in-sequences and the sequences without motifs (controlled by the pc_no_motif parameter). The format is a dictionary with 2D numpy arrays as values. You can load this file using the numpy.load() command and fetch any sequence by its key. The keys of the sequences are the index_title_backgroundID.

<title>_inserted_instances.bed A bed file containing the locations of the motifs in each of the motif-in-sequences. The first column is the name of the sequence as in the fasta file: index_title_backgroundID. The second and third columns are the start and end coordinates of the inserted motif instance. The fourth column is the name of the inserted instance: title_motifID_instance. The score column is ‘.’. The strand column is + or - depending on the orientation of the motif instance. Note: In case of inserting instances without replacing the existing bases, the bed file contains the correct start and end coordinates. The dagsim_table.csv file contains the coordinates of the positions of insertion in the original background sequences before insertion.

<title>_simulated_backgrounds.fa A fasta file containing all the simulated sequences. Created only if the backgrounds are simulated.

<title>_simulated_motifs.meme A meme file containing all the simulated motifs. Created only if the motifs are simulated.

<title>_motif_freq_per_group.tsv A table exported from pandas showing the probability of selection of each motif from each group. Created only if the frequencies are simulated.

      <title>_group_1        <title>_group_2
<title>_motif_1      0.25    0.5
<title>_motif_2      0.25    0.0
<title>_motif_3      0.25    0.0
<title>_motif_4      0.25    0.0
<title>_motif_5      0.0     0.5

<title>_motif_group_membership.tsv A tsv file where the first column is the group IDs, and the second column lists the motfIDs that are assigned to the corresponding group. Created only if the groups are simulated.

<title>_group_0      <title>_motif_1,<title>_motif_2,<title>_motif_3,<title>_motif_4
<title>_group_1      <title>_motif_1,<title>_motif_5

<title>_group_frequency.tsv The occurence frequency or selection probability of each group. During simulation sampling from these values will define the dominant group of the sequence. Created only if the groups are simulated.

<title>_group_1      0.5
<title>_group_2      0.5

<title>_group_group_transition_probabilities.tsv The transition probability of each group pairs. If more than one group is selected during simulation, first a group is chosen, then the rest are selected based on the previous group and the respective transition values. The --concentration_factor defines how likely it is that the same group will be selected multiple consecutive times (i.e. staying the same state).

     <title>_group_1 <title>_group_2
<title>_group_1      0.75    0.25
<title>_group_2      0.25    0.75

<title>_occurrence_summaries.json A json file with counts for the values of all nodes. Number of times each background, group, motif, and orientation was selected. The number of occurrences of specific motif instances (note that the random background might include more such instances not controlled by actual insertion). The number of how many times a specific number of instances per sequences was selected (relevant when Poisson distribution is used). Number of selected start positions, the occurrences of different motif lengths (relevant when motifs of different lengths are simulated), and the number of backgrounds without motif inserted in them (applicable when pc_no_motif > 0).

{
   "selected_groups": {
      "<title>_group_0": 15
   },
   "orientations": {
      "1": 7,
      "0": 8
   },
   "selected_motifs": {
      "<title>_motif_2": 4,
      "<title>_motif_1": 2,
      "<title>_motif_0": 6,
      "<title>_motif_3": 2,
      "<title>_motif_4": 1
   },
   "instances": {
      "GCCTA": 1,
      "CCAGC": 1,
      "GCAGA": 1,
      "TCCTG": 1,
      "CTGTA": 1,
      "TTAGG": 1,
      "TAAAC": 1,
      "GTTCA": 1,
      "TGTGT": 1,
      "AGATA": 1,
      "TTACG": 1,
      "CCAAG": 1,
      "AGAGT": 1,
      "AAAAC": 1,
      "AGTCT": 1
   },
   "backgrounds": {
      "<title>_seq_4": 4,
      "<title>_seq_3": 4,
      "<title>_seq_2": 3,
      "<title>_seq_0": 2,
      "<title>_seq_5": 1,
      "<title>_seq_1": 1
   },
   "num_instances": {
      "1": 15
   },
   "position_starts": {
      "23": 15
   },
   "motif_lengths": {
      "5": 15
   },
   "no_motif_backgrounds": {
      "<title>_seq_1": 1,
      "<title>_seq_0": 2,
      "<title>_seq_4": 2
   }
}

Usage examples

Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.

python -m inmotifin motif-in-seq --title simulation_0 --dirichlet_alpha 1,2,3,4 --num_motifs 10 --m_length_min 3 --pc_no_motif 10

Simulation of random sequences

python -m inmotifin random-sequences --help

Arguments

--workdir

Folder of the simulation outputs. Defaults to current work directory.

--title

The title of the simulation. A new folder workdir\title is created and all
simulated sequences will get the title prefix. Default: sim

--seed

Random seed for the different steps in the simulation. All results are reproducible
when set. Default: None

--config

Config file for the simulation with the parameters for creating random sequences.
Default: None

--num_backgrounds

Number of background sequences to simulate. Default: 100

--b_length_min

Length of the simulated background sequences, unless --b_length_max
is specified. to simulate. If --b_length_max is also specified, this
is the lower boundary of length. Default: 50

--b_length_max

Maximum allowed length of the background sequences. The actual length
is sampled from a uniform distribution. Default: None

--b_alphabet

String of letters in the background alphabet. Default: ACGT

--b_alphabet_prior

Comma separated probability of the letters in the background alphabet.
Default: 0.25,0.25,0.25,0.25

--background_files

Path(s) to the background file(s) in fasta format. Default: None

--background_type

Parameter defining how the background sequences are used. Supported types are:
fasta_iid, random_nucl_shuffled_only, random_nucl_shuffled_addon,
iid, markov_fit, markov_sim.
fasta_iid: Fasta files are used as is - default when background_files
is not None. random_nucl_shuffled_only: Fasta files are used, nucleotides
in sequences are shuffled and only shuffled ones are used / returned.
random_nucl_shuffled_addon: Fasta files are used, nucleotides in
sequences are shuffled and both the original and the shuffled sequences
are used / returned. iid: Fasta files are ignored if provided,
b_alphabet_prior specifies nucelotide probabilities - default when
background_files is None. markov_fit: Fasta files are used to fit
hidden Markov model for posterior probabilities. Order specified
with markov_order. markov_sim: Fasta files are used to fit and
sample from hidden Markov model, so this is a type of simulation. Order
specified with markov_order.

--markov_order

Order of Markov model to learn from sequences background_type
is set to markov_fit or markov_sim. Defaults to 0 corresponding to
learning independent nucleotide frequencies.

--markov_n_iter

Number of iterations of Markov model to learn from sequences.
Defaults to 100

--markov_algorithm

Algorithm of Markov model to learn from sequences. Options: ‘viterbi’ or ‘map’.
See hmmlearn 0.3.3 documentation. Defaults to ‘viterbi’.

--num_shuffle

Number of shuffle of the backgrounds. Used when background_type is
set to random_nucl_shuffled_only or random_nucl_shuffled_addon.
Default: None

For further reference on the hidden Markov modeling, visit the hmmlearn documentation

Outputs

<title>_simulated_backgrounds.fa A fasta file containing all the simulated sequences.

Usage examples

Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.

python -m inmotifin random-sequences --title nmk_sim --b_alphabet NMK --b_alphabet_prior 0.3,0.4,0.3

Simulation of motifs

python -m inmotifin motifs --help

Arguments

--workdir

Folder of the analysis outputs. Defaults to current work directory.

--title

The title of the simulation. A new folder workdir\title is created and all
simulated motifs will get the title prefix. Default: sim

--seed

Random seed for the different steps in the simulation. All results are reproducible
when set. Default: None

--config

Config file for the simulation with the parameters for creating motifs.
Default: None

--dirichlet_alpha

Alpha values for the Dirichlet distribution from which motifs are sampled.
Default: 0.5,0.5,0.5,0.5 (Jeffreys prior)

--num_motifs

Number of motifs to create. Default: 10

--m_length_min

Length of the simulated motifs, unless --len_motifs_max is
specified. If --len_motifs_max is also specified, this is the
lower boundary of length. Default: 5

--m_length_max

Maximum allowed length of the motif. The actual length is sampled from a uniform
distribution. Default: None

--m_alphabet

String of letters in the motif alphabet, Default: ACGT

Outputs

<title>_simulated_motifs.meme A meme file containing all the simulated motifs.

Usage examples

Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.

python -m inmotifin motifs --title motif_sim --num_motifs 10 --m_length_min 3 --m_length_max 6

Multimerisation of motifs

python -m inmotifin multimers --help

Arguments

--workdir

Folder of the simulation outputs. Defaults to current work directory.

--title

The title of the multimerisation. A new folder workdir\title is
created and all simulated multimers will get the title prefix.
Default: sim

--seed

Random seed for the insertion of non-informative bases between
two motifs when the distance is positive. All results are
reproducible when set. Default: None

--config

Config file for the simulation with the parameters for creating
multimers. Default: None

--motif_files

Path to the motif files. Meme, jaspar, and csv with jaspar motif
IDs formats are accepted. Default: None

--jaspar_db_version

Release name of JASPAR database version to use when fetching JASPAR
motif IDs. For futher information see pyJASPAR’s documentation.
Example value: ‘JASPAR2024’ Default: None

--multimerisation_rules

Path to the multimerisation rule tsv files. It should have two
(optionally three) tab separated columns: list of motif IDs separated
by comma, and list of distances between them spearated by comma, and
weights for the averaging of motifs (one per motif) separated by comma.
When weights (or third column in general) are not present, each
motif is assigned weight of 1. Default: None

File format examples

multimerisation_rules

motif_1,motif_2,motif_3      -2,-1   1,5,1
motif_2,motif_3      3
motif_1,motif_2      2

In the first multimer, the middle motif will have a higher weight at the overlaps. When the distances are positive, weighting does not play a role.

Outputs

<title>_multimer_motifs.meme A meme file containing all the simulated multimers.

Usage examples

Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.

python -m inmotifin multimers --title multimers_sim --motif_files multimer_sim/simulated_motifs.meme --multimerisation_rules multimer_sim/multimer_rules.tsv