Command line options

There are four different command line options.

$ python -m inmotifin
Usage: python -m inmotifin [OPTIONS] COMMAND [ARGS]...

You are running inMOTIFin

Options:
--help  Show this message and exit.

Commands:
multimers          Multimerizing motifs given motifs and distances
motifs             Creating motifs given information content, length,...
motif-in-seq       Simulating sequences with inserted motif instances
random-sequences   Creating random sequences given length and alphabet

Simulation of sequences with inserted motif instances

python -m inmotifin motif-in-seq --help

Arguments

`--workdir`	Folder of the simulation outputs. Defaults to current work directory. NOTE: it should be a relative path. Absolute paths are not supported.
`--title`	The title of the simulation. A new folder `workdir\title` is created and all simulated outputs will get the `title` prefix. Default: `sim`
`--seed`	Random seed for the different steps in the simulation. All results are reproducible when set. Default: `None`
`--config`	Config file for the simulation with the parameters for creating sequences. with motif instances in them. Default: `None`
`--dirichlet_alpha`	Alpha values for the Dirichlet distribution from which motifs are sampled. Default: `0.5,0.5,0.5,0.5` (Jeffreys prior)
`--num_motifs`	Number of motifs to create. Default: `10`
`--m_length_min`	Length of the simulated motifs, unless `--len_motifs_max` is specified. If `--len_motifs_max` is also specified, this is the lower boundary of length. Default: `5`
`--m_length_max`	Maximum allowed length of the motif. The actual length is sampled from a uniform distribution. Default: `None`
`--m_alphabet`	String of letters in the motif alphabet, Default: `ACGT`
`--motif_files`	List of path(s) to the motif file(s). Supported formats are jaspar, meme, csv with JASPAR motif ids. Default: `None`
`--jaspar_db_version`	Release name of JASPAR database version to use when fetching JASPAR motif IDs. For futher information see pyJASPAR’s documentation. Example value: ‘JASPAR2024’ Default: `None`
`--m_alphabet_pairs`	Dictionary of letter pairs for reverse complementing the motif instance. Default: `A:T,C:G,G:C,T:A`
`--num_backgrounds`	Number of background sequences to simulate. Default: `100`
`--b_length_min`	Length of the simulated background sequences, unless `--b_length_max` is specified. to simulate. If `--b_length_max` is also specified, this is the lower boundary of length. Default: `50`
`--b_length_max`	Maximum allowed length of the background sequences. The actual length is sampled from a uniform distribution. Default: `None`
`--b_alphabet`	String of letters in the background alphabet. Default: `ACGT`
`--b_alphabet_prior`	Comma separated probability of the letters in the background alphabet. Default: `0.25,0.25,0.25,0.25`
`--background_files`	Path(s) to the background file(s) in fasta format. Default: `None`
`--background_type`	Parameter defining how the background sequences are used. Supported types are: `fasta_iid`, `random_nucl_shuffled_only`, `random_nucl_shuffled_addon`, `iid`, `markov_fit`, `markov_sim`. `fasta_iid`: Fasta files are used as is - default when `background_files` is not None. `random_nucl_shuffled_only`: Fasta files are used, nucleotides in sequences are shuffled and only shuffled ones are used / returned. `random_nucl_shuffled_addon`: Fasta files are used, nucleotides in sequences are shuffled and both the original and the shuffled sequences are used / returned. `iid`: Fasta files are ignored if provided, b_alphabet_prior specifies nucelotide probabilities - default when `background_files` is None. `markov_fit`: Fasta files are used to fit hidden Markov model for posterior probabilities. Order specified with `markov_order`. `markov_sim`: Fasta files are used to fit and sample from hidden Markov model, so this is a type of simulation. Order specified with `markov_order`.
`--markov_order`	Order of Markov model to learn from sequences `background_type` is set to `markov_fit` or `markov_sim`. Defaults to 0 corresponding to learning independent nucleotide frequencies.
`--markov_n_iter`	Number of iterations of Markov model to learn from sequences. Defaults to 100
`--markov_algorithm`	Algorithm of Markov model to learn from sequences. Options: ‘viterbi’ or ‘map’. See hmmlearn 0.3.3 documentation. Defaults to ‘viterbi’.
`--num_shuffle`	Number of shuffle of the backgrounds. Used when `background_type` is set to `random_nucl_shuffled_only` or `random_nucl_shuffled_addon`. Default: `None`
`--num_groups`	Number of groups into which motifs are assigned. If = 1 the rest of the options are ignored and all motifs are assigned to a single group. Default: `1`
`--max_group_size`	Maximum size of each group. It cannot be smaller than the number of motifs. Each group size is sampled from binomial distribution with number of trials = max_group_size and success = group_size_p. Default: `inf`
`--group_size_p`	This parameter controls the expected size of each group. Each group size is sampled from binomial distribution with number of trials = max_group_size and success = group_size_p. Default: `1`
`--group_motif_assignment_file`	Path to the motif to group asisgnment file in two column tsv format. The first column is the group IDs, and the second column lists the motfIDs that are assigned to the corresponding group. Default: `None`
`--group_freq_type`	The method of selecting group background frequencies. Values: `uniform`, `random`. Where uniform means each group has an equal chance to be selected. Random means each group is assigned a probability of being selected. The difference between a frequent and rare group is controlled by the `--group_freq_range` parameter. Default: `uniform`
`--group_freq_range`	The range of the potential differences between a frequent and a rare group. Default: `None`
`--motif_freq_type`	The method of selecting motif background frequencies. Values: `uniform`, `random`. Where uniform means each motif has an equal chance to be selected. Random means each motif is assigned a probability of being selected. The difference between a frequent and rare motif is controlled by the `--motif_freq_range` parameter. Default: `uniform`
`--motif_freq_range`	The range of the potential differences between a frequent and a rare motif. Default: `None`
`--concentration_factor`	The preference of each groups to be selected again wen selecting more than one group for insertion. Value between 0 and 1. Default: `1`
`--group_group_type`	The method of selecting group-group transition probabilities. Values: `uniform`, `random`. Where uniform means any two groups are equally probable of co-occuring. Random means group pairs are assigned a random probability of transition. Default: `uniform`
`--group_freq_file`	Tsv file including the background frequencies for the groups to be selected. Default: `None`
`--motif_freq_file`	Tsv file including the background frequencies for the selection of motifs to be inserted. Default: `None`
`--group_group_file`	Tsv file including the group-group transition probability matrix. Default: `None`
`--position_type`	Type of position simulation. Supported values: `central`, `left_central`, `right_central`, `uniform`, `gaussian`. Central means the first motif is inserted into the center of the background (replacing existing bases). Left_central means aligning the first base to the center (replacing existing bases). Right_central means aligning the last base to the position one before center (replacing existing bases). Uniform means all position has equal chance. With the `to_replace` option, the user may chose between replacing existing bases or inserting between them. Gaussian means following the probabilities of a Gaussian distribution centered on the given position of the background as per the position_means and position_variances parameters. Gaussian insertions are left aligned and the insertion is without replacing existing bases. Default: `central`
`--position_means`	Comma separated mean values for the gaussian positioning option. Default: `None`
`--position_variances`	Comma separated variance values for the gaussian positioning option. Default: `None`
`--num_motif_in_seq`	Number of sequences with motifs in them to generate. Default: `100`
`--pc_no_motif`	Percentage of sequences without motifs. Number between 0 and 100 is expected. Default: `0`
`--to_replace`	Whether to replace backgorund bases with motif instance when sampling positions uniformly. Alternative is to insert between existing bases. Default: `True`
`--orientation_prob`	Probability of reversing a motif instance in the motived sequence. Default: `0.5`
`--num_groups_per_seq`	Number of groups to sample per sequence. Default: `1`
`--motif_sampling_replacement`	Whether to select motifs from groups with replacement. Note, if more motifs are requested than available in a group, replacement will be used regardless of this parameter. Default: `True`, set to False with using the flag `--no-motif_sampling_replacement`
`--n_instances_per_sequence`	Number of motif instances to be inserted per sequence. Takes precedent over `--n_instances_per_sequence_l`. Default: `1`
`--n_instances_per_sequence_l`	Lambda parameter of Poisson distribution for selecting the number of motif instances to be inserted per sequence. Default: `None`
`--to_draw`	Whether to draw the directed acyclic graph of the simulation steps. Default: `False`

For further reference on the hidden Markov modeling, visit the hmmlearn documentation

File format examples

`motif_files` in different formats

MEME: visit the official MEME format description page about the minimal meme format expected here.

MEME version 4

ALPHABET= ACGT

strands: + -

Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25

MOTIF test1 test1
letter-probability matrix: alength= 4 w= 3 nsites= 99494 E= 0
0.13  0.10  0.34  0.43
0.75  0.06  0.19  0.00
0.00  0.00  0.00  1.00

JASPAR format: visit the official JASPAR format description page .

>MA0001.1 MA0001.1.AGL3
A  [     0      3     79     40     66     48     65     11     65      0 ]
C  [    94     75      4      3      1      2      5      2      3      3 ]
G  [     1      0      3      4      1      0      5      3     28     88 ]
T  [     2     19     11     50     29     47     22     81      1      6 ]
>MA0003.1 MA0003.1.TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]
>MA0002.2 MA0002.2.Runx1
A  [   287    234    123     57      0     87      0     17     10    131    500 ]
C  [   496    485   1072      0     75    127      0     42    400    463    158 ]
G  [   696    467    149      7   1872     70   1987   1848    251     81    289 ]
T  [   521    814    656   1936     53   1716     13     93   1339   1325   1053 ]

CSV format: comma separated list of JASPAR database matrix IDs (base ID + version) is expected.

MA0001.1 , MA0003.1 , MA0002.2

`group_motif_assignment_file`

<title>_group_0   <title>_motif_1,<title>_motif_2,<title>_motif_3,<title>_motif_4
<title>_group_1   <title>_motif_1,<title>_motif_5

`group_freq_file`

<title>_group_1   0.5
<title>_group_2   0.5

`motif_freq_file`

      <title>_group_1     <title>_group_2
<title>_motif_1   0.25    0.5
<title>_motif_2   0.25    0.0
<title>_motif_3   0.25    0.0
<title>_motif_4   0.25    0.0
<title>_motif_5   0.0     0.5

`group_group_file`

   <title>_group_1        <title>_group_2
<title>_group_1   0.75    0.25
<title>_group_2   0.25    0.75

Outputs

dagsim_table.csv A table produced by DagSim framework, including all values of all nodes for each simulation round. Note: In case of inserting instances without replacing the existing bases, the <title>_inserted_instances.bed file contains the correct start and end coordinates. This csv file contains the coordinates of the positions of insertion in the original background sequences before insertion.

<title>_dagsim_table.png The DAG of the simulation showing all nodes and their dependencies.

<title>_final_sequences.fa A fasta file containing all the motif-in-sequences and the sequences without motifs (controlled by the pc_no_motif parameter). The name of the sequences are the index_title_backgroundID.

<title>_probabilistic_final_sequences.npz A npz file containing the probabilistic version of the motif-in-sequences and the sequences without motifs (controlled by the pc_no_motif parameter). The format is a dictionary with 2D numpy arrays as values. You can load this file using the numpy.load() command and fetch any sequence by its key. The keys of the sequences are the index_title_backgroundID.

<title>_inserted_instances.bed A bed file containing the locations of the motifs in each of the motif-in-sequences. The first column is the name of the sequence as in the fasta file: index_title_backgroundID. The second and third columns are the start and end coordinates of the inserted motif instance. The fourth column is the name of the inserted instance: title_motifID_instance. The score column is ‘.’. The strand column is + or - depending on the orientation of the motif instance. Note: In case of inserting instances without replacing the existing bases, the bed file contains the correct start and end coordinates. The dagsim_table.csv file contains the coordinates of the positions of insertion in the original background sequences before insertion.

<title>_simulated_backgrounds.fa A fasta file containing all the simulated sequences. Created only if the backgrounds are simulated.

<title>_simulated_motifs.meme A meme file containing all the simulated motifs. Created only if the motifs are simulated.

<title>_motif_freq_per_group.tsv A table exported from pandas showing the probability of selection of each motif from each group. Created only if the frequencies are simulated.

      <title>_group_1        <title>_group_2
<title>_motif_1      0.25    0.5
<title>_motif_2      0.25    0.0
<title>_motif_3      0.25    0.0
<title>_motif_4      0.25    0.0
<title>_motif_5      0.0     0.5

<title>_motif_group_membership.tsv A tsv file where the first column is the group IDs, and the second column lists the motfIDs that are assigned to the corresponding group. Created only if the groups are simulated.

<title>_group_0      <title>_motif_1,<title>_motif_2,<title>_motif_3,<title>_motif_4
<title>_group_1      <title>_motif_1,<title>_motif_5

<title>_group_frequency.tsv The occurence frequency or selection probability of each group. During simulation sampling from these values will define the dominant group of the sequence. Created only if the groups are simulated.

<title>_group_1      0.5
<title>_group_2      0.5

<title>_group_group_transition_probabilities.tsv The transition probability of each group pairs. If more than one group is selected during simulation, first a group is chosen, then the rest are selected based on the previous group and the respective transition values. The --concentration_factor defines how likely it is that the same group will be selected multiple consecutive times (i.e. staying the same state).

     <title>_group_1 <title>_group_2
<title>_group_1      0.75    0.25
<title>_group_2      0.25    0.75

<title>_occurrence_summaries.json A json file with counts for the values of all nodes. Number of times each background, group, motif, and orientation was selected. The number of occurrences of specific motif instances (note that the random background might include more such instances not controlled by actual insertion). The number of how many times a specific number of instances per sequences was selected (relevant when Poisson distribution is used). Number of selected start positions, the occurrences of different motif lengths (relevant when motifs of different lengths are simulated), and the number of backgrounds without motif inserted in them (applicable when pc_no_motif > 0).

{
   "selected_groups": {
      "<title>_group_0": 15
   },
   "orientations": {
      "1": 7,
      "0": 8
   },
   "selected_motifs": {
      "<title>_motif_2": 4,
      "<title>_motif_1": 2,
      "<title>_motif_0": 6,
      "<title>_motif_3": 2,
      "<title>_motif_4": 1
   },
   "instances": {
      "GCCTA": 1,
      "CCAGC": 1,
      "GCAGA": 1,
      "TCCTG": 1,
      "CTGTA": 1,
      "TTAGG": 1,
      "TAAAC": 1,
      "GTTCA": 1,
      "TGTGT": 1,
      "AGATA": 1,
      "TTACG": 1,
      "CCAAG": 1,
      "AGAGT": 1,
      "AAAAC": 1,
      "AGTCT": 1
   },
   "backgrounds": {
      "<title>_seq_4": 4,
      "<title>_seq_3": 4,
      "<title>_seq_2": 3,
      "<title>_seq_0": 2,
      "<title>_seq_5": 1,
      "<title>_seq_1": 1
   },
   "num_instances": {
      "1": 15
   },
   "position_starts": {
      "23": 15
   },
   "motif_lengths": {
      "5": 15
   },
   "no_motif_backgrounds": {
      "<title>_seq_1": 1,
      "<title>_seq_0": 2,
      "<title>_seq_4": 2
   }
}

Usage examples

Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.

python -m inmotifin motif-in-seq --title simulation_0 --dirichlet_alpha 1,2,3,4 --num_motifs 10 --m_length_min 3 --pc_no_motif 10

Simulation of random sequences

python -m inmotifin random-sequences --help

Arguments

`--workdir`	Folder of the simulation outputs. Defaults to current work directory.
`--title`	The title of the simulation. A new folder `workdir\title` is created and all simulated sequences will get the `title` prefix. Default: `sim`
`--seed`	Random seed for the different steps in the simulation. All results are reproducible when set. Default: `None`
`--config`	Config file for the simulation with the parameters for creating random sequences. Default: `None`
`--num_backgrounds`	Number of background sequences to simulate. Default: `100`
`--b_length_min`	Length of the simulated background sequences, unless `--b_length_max` is specified. to simulate. If `--b_length_max` is also specified, this is the lower boundary of length. Default: `50`
`--b_length_max`	Maximum allowed length of the background sequences. The actual length is sampled from a uniform distribution. Default: `None`
`--b_alphabet`	String of letters in the background alphabet. Default: `ACGT`
`--b_alphabet_prior`	Comma separated probability of the letters in the background alphabet. Default: `0.25,0.25,0.25,0.25`
`--background_files`	Path(s) to the background file(s) in fasta format. Default: `None`
`--background_type`	Parameter defining how the background sequences are used. Supported types are: `fasta_iid`, `random_nucl_shuffled_only`, `random_nucl_shuffled_addon`, `iid`, `markov_fit`, `markov_sim`. `fasta_iid`: Fasta files are used as is - default when `background_files` is not None. `random_nucl_shuffled_only`: Fasta files are used, nucleotides in sequences are shuffled and only shuffled ones are used / returned. `random_nucl_shuffled_addon`: Fasta files are used, nucleotides in sequences are shuffled and both the original and the shuffled sequences are used / returned. `iid`: Fasta files are ignored if provided, b_alphabet_prior specifies nucelotide probabilities - default when `background_files` is None. `markov_fit`: Fasta files are used to fit hidden Markov model for posterior probabilities. Order specified with `markov_order`. `markov_sim`: Fasta files are used to fit and sample from hidden Markov model, so this is a type of simulation. Order specified with `markov_order`.
`--markov_order`	Order of Markov model to learn from sequences `background_type` is set to `markov_fit` or `markov_sim`. Defaults to 0 corresponding to learning independent nucleotide frequencies.
`--markov_n_iter`	Number of iterations of Markov model to learn from sequences. Defaults to 100
`--markov_algorithm`	Algorithm of Markov model to learn from sequences. Options: ‘viterbi’ or ‘map’. See hmmlearn 0.3.3 documentation. Defaults to ‘viterbi’.
`--num_shuffle`	Number of shuffle of the backgrounds. Used when `background_type` is set to `random_nucl_shuffled_only` or `random_nucl_shuffled_addon`. Default: `None`

For further reference on the hidden Markov modeling, visit the hmmlearn documentation

Outputs

<title>_simulated_backgrounds.fa A fasta file containing all the simulated sequences.

Usage examples

Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.

python -m inmotifin random-sequences --title nmk_sim --b_alphabet NMK --b_alphabet_prior 0.3,0.4,0.3

Simulation of motifs

python -m inmotifin motifs --help

Arguments

`--workdir`	Folder of the analysis outputs. Defaults to current work directory.
`--title`	The title of the simulation. A new folder `workdir\title` is created and all simulated motifs will get the `title` prefix. Default: `sim`
`--seed`	Random seed for the different steps in the simulation. All results are reproducible when set. Default: `None`
`--config`	Config file for the simulation with the parameters for creating motifs. Default: `None`
`--dirichlet_alpha`	Alpha values for the Dirichlet distribution from which motifs are sampled. Default: `0.5,0.5,0.5,0.5` (Jeffreys prior)
`--num_motifs`	Number of motifs to create. Default: `10`
`--m_length_min`	Length of the simulated motifs, unless `--len_motifs_max` is specified. If `--len_motifs_max` is also specified, this is the lower boundary of length. Default: `5`
`--m_length_max`	Maximum allowed length of the motif. The actual length is sampled from a uniform distribution. Default: `None`
`--m_alphabet`	String of letters in the motif alphabet, Default: `ACGT`

Outputs

<title>_simulated_motifs.meme A meme file containing all the simulated motifs.

Usage examples

Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.

python -m inmotifin motifs --title motif_sim --num_motifs 10 --m_length_min 3 --m_length_max 6

Multimerisation of motifs

python -m inmotifin multimers --help

Arguments

`--workdir`	Folder of the simulation outputs. Defaults to current work directory.
`--title`	The title of the multimerisation. A new folder `workdir\title` is created and all simulated multimers will get the `title` prefix. Default: `sim`
`--seed`	Random seed for the insertion of non-informative bases between two motifs when the distance is positive. All results are reproducible when set. Default: `None`
`--config`	Config file for the simulation with the parameters for creating multimers. Default: `None`
`--motif_files`	Path to the motif files. Meme, jaspar, and csv with jaspar motif IDs formats are accepted. Default: `None`
`--jaspar_db_version`	Release name of JASPAR database version to use when fetching JASPAR motif IDs. For futher information see pyJASPAR’s documentation. Example value: ‘JASPAR2024’ Default: `None`
`--multimerisation_rules`	Path to the multimerisation rule tsv files. It should have two (optionally three) tab separated columns: list of motif IDs separated by comma, and list of distances between them spearated by comma, and weights for the averaging of motifs (one per motif) separated by comma. When weights (or third column in general) are not present, each motif is assigned weight of 1. Default: `None`

File format examples

`multimerisation_rules`

motif_1,motif_2,motif_3      -2,-1   1,5,1
motif_2,motif_3      3
motif_1,motif_2      2

In the first multimer, the middle motif will have a higher weight at the overlaps. When the distances are positive, weighting does not play a role.

Outputs

<title>_multimer_motifs.meme A meme file containing all the simulated multimers.

Usage examples

Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.

python -m inmotifin multimers --title multimers_sim --motif_files multimer_sim/simulated_motifs.meme --multimerisation_rules multimer_sim/multimer_rules.tsv

Command line options

Simulation of sequences with inserted motif instances

Arguments

File format examples

motif_files in different formats

group_motif_assignment_file

group_freq_file

motif_freq_file

group_group_file

Outputs

Usage examples

Simulation of random sequences

Arguments

Outputs

Usage examples

Simulation of motifs

Arguments

Outputs

Usage examples

Multimerisation of motifs

Arguments

File format examples

multimerisation_rules

Outputs

Usage examples

`motif_files` in different formats

`group_motif_assignment_file`

`group_freq_file`

`motif_freq_file`

`group_group_file`

`multimerisation_rules`