Command line options
There are four different command line options.
$ python -m inmotifin
Usage: python -m inmotifin [OPTIONS] COMMAND [ARGS]...
You are running inMOTIFin
Options:
--help Show this message and exit.
Commands:
multimers Multimerizing motifs given motifs and distances
motifs Creating motifs given information content, length,...
motif-in-seq Simulating sequences with inserted motif instances
random-sequences Creating random sequences given length and alphabet
Simulation of sequences with inserted motif instances
python -m inmotifin motif-in-seq --help
Arguments
|
Folder of the simulation outputs. Defaults to current work directory.
NOTE: it should be a relative path. Absolute paths are not supported.
|
|
The title of the simulation. A new folder
workdir\title iscreated and all simulated outputs will get the
title prefix.Default:
sim |
|
Random seed for the different steps in the simulation. All results
are reproducible when set. Default:
None |
|
Config file for the simulation with the parameters for creating
sequences. with motif instances in them. Default:
None |
|
Alpha values for the Dirichlet distribution from which motifs are
sampled. Default:
0.5,0.5,0.5,0.5 (Jeffreys prior) |
|
Number of motifs to create. Default: |
|
Length of the simulated motifs, unless
--len_motifs_max isspecified. If
--len_motifs_max is also specified, this is thelower boundary of length. Default:
5 |
|
Maximum allowed length of the motif. The actual length is sampled
from a uniform distribution. Default:
None |
|
String of letters in the motif alphabet, Default: |
|
List of path(s) to the motif file(s). Supported formats are jaspar,
meme, csv with JASPAR motif ids. Default:
None |
|
Release name of JASPAR database version to use when fetching JASPAR
motif IDs. For futher information see pyJASPAR’s documentation.
Example value: ‘JASPAR2024’ Default:
None |
|
Dictionary of letter pairs for reverse complementing the motif instance.
Default:
A:T,C:G,G:C,T:A |
|
Number of background sequences to simulate. Default: |
|
Length of the simulated background sequences, unless
--b_length_maxis specified. to simulate. If
--b_length_max is also specified, thisis the lower boundary of length. Default:
50 |
|
Maximum allowed length of the background sequences. The actual length
is sampled from a uniform distribution. Default:
None |
|
String of letters in the background alphabet. Default: |
|
Comma separated probability of the letters in the background alphabet.
Default:
0.25,0.25,0.25,0.25 |
|
Path(s) to the background file(s) in fasta format. Default: |
|
Parameter defining how the background sequences are used. Supported types are:
fasta_iid, random_nucl_shuffled_only, random_nucl_shuffled_addon,iid, markov_fit, markov_sim.fasta_iid: Fasta files are used as is - default when background_filesis not None.
random_nucl_shuffled_only: Fasta files are used, nucleotidesin sequences are shuffled and only shuffled ones are used / returned.
random_nucl_shuffled_addon: Fasta files are used, nucleotides insequences are shuffled and both the original and the shuffled sequences
are used / returned.
iid: Fasta files are ignored if provided,b_alphabet_prior specifies nucelotide probabilities - default when
background_files is None. markov_fit: Fasta files are used to fithidden Markov model for posterior probabilities. Order specified
with
markov_order. markov_sim: Fasta files are used to fit andsample from hidden Markov model, so this is a type of simulation. Order
specified with
markov_order. |
|
Order of Markov model to learn from sequences
background_typeis set to
markov_fit or markov_sim. Defaults to 0 corresponding tolearning independent nucleotide frequencies.
|
|
Number of iterations of Markov model to learn from sequences.
Defaults to 100
|
|
Algorithm of Markov model to learn from sequences. Options: ‘viterbi’ or ‘map’.
See hmmlearn 0.3.3 documentation. Defaults to ‘viterbi’.
|
|
Number of shuffle of the backgrounds. Used when
background_type isset to
random_nucl_shuffled_only or random_nucl_shuffled_addon.Default:
None |
|
Number of groups into which motifs are assigned. If = 1 the rest
of the options are ignored and all motifs are assigned to a
single group. Default:
1 |
|
Maximum size of each group. It cannot be smaller than the number
of motifs. Each group size is sampled from binomial distribution
with number of trials = max_group_size and success = group_size_p.
Default:
inf |
|
This parameter controls the expected size of each group. Each
group size is sampled from binomial distribution with number of
trials = max_group_size and success = group_size_p. Default:
1 |
|
Path to the motif to group asisgnment file in two column tsv format.
The first column is the group IDs, and the second column lists the
motfIDs that are assigned to the corresponding group. Default:
None |
|
The method of selecting group background frequencies.
Values:
uniform, random. Where uniform means each group hasan equal chance to be selected. Random means each group is assigned a
probability of being selected. The difference between a frequent and
rare group is controlled by the
--group_freq_range parameter.Default:
uniform |
|
The range of the potential differences between a frequent and a
rare group. Default:
None |
|
The method of selecting motif background frequencies.
Values:
uniform, random. Where uniform means each motif hasan equal chance to be selected. Random means each motif is assigned
a probability of being selected. The difference between a frequent
and rare motif is controlled by the
--motif_freq_range parameter.Default:
uniform |
|
The range of the potential differences between a frequent and a
rare motif. Default:
None |
|
The preference of each groups to be selected again wen selecting
more than one group for insertion. Value between 0 and 1. Default:
1 |
|
The method of selecting group-group transition probabilities.
Values:
uniform, random. Where uniform means any twogroups are equally probable of co-occuring. Random means group pairs
are assigned a random probability of transition. Default:
uniform |
|
Tsv file including the background frequencies for the groups
to be selected. Default:
None |
|
Tsv file including the background frequencies for the selection
of motifs to be inserted. Default:
None |
|
Tsv file including the group-group transition probability matrix.
Default:
None |
|
Type of position simulation. Supported values:
central,left_central, right_central, uniform, gaussian.Central means the first motif is inserted into the center of the
background (replacing existing bases). Left_central means aligning the
first base to the center (replacing existing bases). Right_central means
aligning the last base to the position one before center (replacing
existing bases). Uniform means all position has equal chance. With the
to_replace option, the user may chose between replacing existingbases or inserting between them. Gaussian means following the
probabilities of a Gaussian distribution centered on the given position
of the background as per the position_means and position_variances
parameters. Gaussian insertions are left aligned and the insertion
is without replacing existing bases. Default:
central |
|
Comma separated mean values for the gaussian positioning option.
Default:
None |
|
Comma separated variance values for the gaussian positioning option.
Default:
None |
|
Number of sequences with motifs in them to generate. Default:
100 |
|
Percentage of sequences without motifs. Number between 0 and 100
is expected. Default:
0 |
|
Whether to replace backgorund bases with motif instance when sampling positions
uniformly. Alternative is to insert between existing bases. Default:
True |
|
Probability of reversing a motif instance in the motived sequence.
Default:
0.5 |
|
Number of groups to sample per sequence. Default: |
|
Whether to select motifs from groups with replacement. Note, if more motifs
are requested than available in a group, replacement will be used regardless
of this parameter. Default:
True, set to False with using the flag--no-motif_sampling_replacement |
|
Number of motif instances to be inserted per sequence. Takes
precedent over
--n_instances_per_sequence_l. Default: 1 |
|
Lambda parameter of Poisson distribution for selecting the number
of motif instances to be inserted per sequence. Default:
None |
|
Whether to draw the directed acyclic graph of the simulation steps.
Default:
False |
For further reference on the hidden Markov modeling, visit the hmmlearn documentation
File format examples
motif_files in different formats
MEME: visit the official MEME format description page about the minimal meme format expected here.
MEME version 4
ALPHABET= ACGT
strands: + -
Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25
MOTIF test1 test1
letter-probability matrix: alength= 4 w= 3 nsites= 99494 E= 0
0.13 0.10 0.34 0.43
0.75 0.06 0.19 0.00
0.00 0.00 0.00 1.00
JASPAR format: visit the official JASPAR format description page .
>MA0001.1 MA0001.1.AGL3
A [ 0 3 79 40 66 48 65 11 65 0 ]
C [ 94 75 4 3 1 2 5 2 3 3 ]
G [ 1 0 3 4 1 0 5 3 28 88 ]
T [ 2 19 11 50 29 47 22 81 1 6 ]
>MA0003.1 MA0003.1.TFAP2A
A [ 0 0 0 22 19 55 53 19 9 ]
C [ 0 185 185 71 57 44 30 16 78 ]
G [ 185 0 0 46 61 67 91 137 79 ]
T [ 0 0 0 46 48 19 11 13 19 ]
>MA0002.2 MA0002.2.Runx1
A [ 287 234 123 57 0 87 0 17 10 131 500 ]
C [ 496 485 1072 0 75 127 0 42 400 463 158 ]
G [ 696 467 149 7 1872 70 1987 1848 251 81 289 ]
T [ 521 814 656 1936 53 1716 13 93 1339 1325 1053 ]
CSV format: comma separated list of JASPAR database matrix IDs (base ID + version) is expected.
MA0001.1 , MA0003.1 , MA0002.2
group_motif_assignment_file
<title>_group_0 <title>_motif_1,<title>_motif_2,<title>_motif_3,<title>_motif_4
<title>_group_1 <title>_motif_1,<title>_motif_5
group_freq_file
<title>_group_1 0.5
<title>_group_2 0.5
motif_freq_file
<title>_group_1 <title>_group_2
<title>_motif_1 0.25 0.5
<title>_motif_2 0.25 0.0
<title>_motif_3 0.25 0.0
<title>_motif_4 0.25 0.0
<title>_motif_5 0.0 0.5
group_group_file
<title>_group_1 <title>_group_2
<title>_group_1 0.75 0.25
<title>_group_2 0.25 0.75
Outputs
dagsim_table.csv
A table produced by DagSim framework, including all values of all nodes for each simulation round.
Note: In case of inserting instances without replacing the existing bases, the <title>_inserted_instances.bed file contains the correct start and end coordinates.
This csv file contains the coordinates of the positions of insertion in the original background sequences before insertion.
<title>_dagsim_table.png
The DAG of the simulation showing all nodes and their dependencies.
<title>_final_sequences.fa
A fasta file containing all the motif-in-sequences and the sequences without motifs (controlled by the pc_no_motif parameter).
The name of the sequences are the index_title_backgroundID.
<title>_probabilistic_final_sequences.npz
A npz file containing the probabilistic version of the motif-in-sequences and the sequences without motifs (controlled by the pc_no_motif parameter).
The format is a dictionary with 2D numpy arrays as values.
You can load this file using the numpy.load() command and fetch any sequence by its key.
The keys of the sequences are the index_title_backgroundID.
<title>_inserted_instances.bed
A bed file containing the locations of the motifs in each of the motif-in-sequences. The first column is the name of the sequence as in the fasta file: index_title_backgroundID.
The second and third columns are the start and end coordinates of the inserted motif instance. The fourth column is the name of the inserted instance: title_motifID_instance.
The score column is ‘.’. The strand column is + or - depending on the orientation of the motif instance.
Note: In case of inserting instances without replacing the existing bases, the bed file contains the correct start and end coordinates.
The dagsim_table.csv file contains the coordinates of the positions of insertion in the original background sequences before insertion.
<title>_simulated_backgrounds.fa
A fasta file containing all the simulated sequences. Created only if the backgrounds are simulated.
<title>_simulated_motifs.meme
A meme file containing all the simulated motifs. Created only if the motifs are simulated.
<title>_motif_freq_per_group.tsv
A table exported from pandas showing the probability of selection of each motif from each group. Created only if the frequencies are simulated.
<title>_group_1 <title>_group_2
<title>_motif_1 0.25 0.5
<title>_motif_2 0.25 0.0
<title>_motif_3 0.25 0.0
<title>_motif_4 0.25 0.0
<title>_motif_5 0.0 0.5
<title>_motif_group_membership.tsv
A tsv file where the first column is the group IDs, and the second column lists the motfIDs that are assigned to the corresponding group. Created only if the groups are simulated.
<title>_group_0 <title>_motif_1,<title>_motif_2,<title>_motif_3,<title>_motif_4
<title>_group_1 <title>_motif_1,<title>_motif_5
<title>_group_frequency.tsv
The occurence frequency or selection probability of each group. During simulation sampling from these values will define the dominant group of the sequence. Created only if the groups are simulated.
<title>_group_1 0.5
<title>_group_2 0.5
<title>_group_group_transition_probabilities.tsv
The transition probability of each group pairs. If more than one group is selected during simulation, first a group is chosen, then the rest are selected based on the previous group and the respective transition values.
The --concentration_factor defines how likely it is that the same group will be selected multiple consecutive times (i.e. staying the same state).
<title>_group_1 <title>_group_2
<title>_group_1 0.75 0.25
<title>_group_2 0.25 0.75
<title>_occurrence_summaries.json
A json file with counts for the values of all nodes. Number of times each background, group, motif, and orientation was selected.
The number of occurrences of specific motif instances (note that the random background might include more such instances not controlled by actual insertion).
The number of how many times a specific number of instances per sequences was selected (relevant when Poisson distribution is used).
Number of selected start positions, the occurrences of different motif lengths (relevant when motifs of different lengths are simulated), and
the number of backgrounds without motif inserted in them (applicable when pc_no_motif > 0).
{
"selected_groups": {
"<title>_group_0": 15
},
"orientations": {
"1": 7,
"0": 8
},
"selected_motifs": {
"<title>_motif_2": 4,
"<title>_motif_1": 2,
"<title>_motif_0": 6,
"<title>_motif_3": 2,
"<title>_motif_4": 1
},
"instances": {
"GCCTA": 1,
"CCAGC": 1,
"GCAGA": 1,
"TCCTG": 1,
"CTGTA": 1,
"TTAGG": 1,
"TAAAC": 1,
"GTTCA": 1,
"TGTGT": 1,
"AGATA": 1,
"TTACG": 1,
"CCAAG": 1,
"AGAGT": 1,
"AAAAC": 1,
"AGTCT": 1
},
"backgrounds": {
"<title>_seq_4": 4,
"<title>_seq_3": 4,
"<title>_seq_2": 3,
"<title>_seq_0": 2,
"<title>_seq_5": 1,
"<title>_seq_1": 1
},
"num_instances": {
"1": 15
},
"position_starts": {
"23": 15
},
"motif_lengths": {
"5": 15
},
"no_motif_backgrounds": {
"<title>_seq_1": 1,
"<title>_seq_0": 2,
"<title>_seq_4": 2
}
}
Usage examples
Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.
python -m inmotifin motif-in-seq --title simulation_0 --dirichlet_alpha 1,2,3,4 --num_motifs 10 --m_length_min 3 --pc_no_motif 10
Simulation of random sequences
python -m inmotifin random-sequences --help
Arguments
|
Folder of the simulation outputs. Defaults to current work directory. |
|
The title of the simulation. A new folder
workdir\title is created and allsimulated sequences will get the
title prefix. Default: sim |
|
Random seed for the different steps in the simulation. All results are reproducible
when set. Default:
None |
|
Config file for the simulation with the parameters for creating random sequences.
Default:
None |
|
Number of background sequences to simulate. Default: |
|
Length of the simulated background sequences, unless
--b_length_maxis specified. to simulate. If
--b_length_max is also specified, thisis the lower boundary of length. Default:
50 |
|
Maximum allowed length of the background sequences. The actual length
is sampled from a uniform distribution. Default:
None |
|
String of letters in the background alphabet. Default: |
|
Comma separated probability of the letters in the background alphabet.
Default:
0.25,0.25,0.25,0.25 |
|
Path(s) to the background file(s) in fasta format. Default: |
|
Parameter defining how the background sequences are used. Supported types are:
fasta_iid, random_nucl_shuffled_only, random_nucl_shuffled_addon,iid, markov_fit, markov_sim.fasta_iid: Fasta files are used as is - default when background_filesis not None.
random_nucl_shuffled_only: Fasta files are used, nucleotidesin sequences are shuffled and only shuffled ones are used / returned.
random_nucl_shuffled_addon: Fasta files are used, nucleotides insequences are shuffled and both the original and the shuffled sequences
are used / returned.
iid: Fasta files are ignored if provided,b_alphabet_prior specifies nucelotide probabilities - default when
background_files is None. markov_fit: Fasta files are used to fithidden Markov model for posterior probabilities. Order specified
with
markov_order. markov_sim: Fasta files are used to fit andsample from hidden Markov model, so this is a type of simulation. Order
specified with
markov_order. |
|
Order of Markov model to learn from sequences
background_typeis set to
markov_fit or markov_sim. Defaults to 0 corresponding tolearning independent nucleotide frequencies.
|
|
Number of iterations of Markov model to learn from sequences.
Defaults to 100
|
|
Algorithm of Markov model to learn from sequences. Options: ‘viterbi’ or ‘map’.
See hmmlearn 0.3.3 documentation. Defaults to ‘viterbi’.
|
|
Number of shuffle of the backgrounds. Used when
background_type isset to
random_nucl_shuffled_only or random_nucl_shuffled_addon.Default:
None |
For further reference on the hidden Markov modeling, visit the hmmlearn documentation
Outputs
<title>_simulated_backgrounds.fa
A fasta file containing all the simulated sequences.
Usage examples
Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.
python -m inmotifin random-sequences --title nmk_sim --b_alphabet NMK --b_alphabet_prior 0.3,0.4,0.3
Simulation of motifs
python -m inmotifin motifs --help
Arguments
|
Folder of the analysis outputs. Defaults to current work directory. |
|
The title of the simulation. A new folder
workdir\title is created and allsimulated motifs will get the
title prefix. Default: sim |
|
Random seed for the different steps in the simulation. All results are reproducible
when set. Default:
None |
|
Config file for the simulation with the parameters for creating motifs.
Default:
None |
|
Alpha values for the Dirichlet distribution from which motifs are sampled.
Default:
0.5,0.5,0.5,0.5 (Jeffreys prior) |
|
Number of motifs to create. Default: |
|
Length of the simulated motifs, unless
--len_motifs_max isspecified. If
--len_motifs_max is also specified, this is thelower boundary of length. Default:
5 |
|
Maximum allowed length of the motif. The actual length is sampled from a uniform
distribution. Default:
None |
|
String of letters in the motif alphabet, Default:
ACGT |
Outputs
<title>_simulated_motifs.meme
A meme file containing all the simulated motifs.
Usage examples
Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.
python -m inmotifin motifs --title motif_sim --num_motifs 10 --m_length_min 3 --m_length_max 6
Multimerisation of motifs
python -m inmotifin multimers --help
Arguments
|
Folder of the simulation outputs. Defaults to current work directory. |
|
The title of the multimerisation. A new folder
workdir\title iscreated and all simulated multimers will get the
title prefix.Default:
sim |
|
Random seed for the insertion of non-informative bases between
two motifs when the distance is positive. All results are
reproducible when set. Default:
None |
|
Config file for the simulation with the parameters for creating
multimers. Default:
None |
|
Path to the motif files. Meme, jaspar, and csv with jaspar motif
IDs formats are accepted. Default:
None |
|
Release name of JASPAR database version to use when fetching JASPAR
motif IDs. For futher information see pyJASPAR’s documentation.
Example value: ‘JASPAR2024’ Default:
None |
|
Path to the multimerisation rule tsv files. It should have two
(optionally three) tab separated columns: list of motif IDs separated
by comma, and list of distances between them spearated by comma, and
weights for the averaging of motifs (one per motif) separated by comma.
When weights (or third column in general) are not present, each
motif is assigned weight of 1. Default:
None |
File format examples
multimerisation_rules
motif_1,motif_2,motif_3 -2,-1 1,5,1
motif_2,motif_3 3
motif_1,motif_2 2
In the first multimer, the middle motif will have a higher weight at the overlaps. When the distances are positive, weighting does not play a role.
Outputs
<title>_multimer_motifs.meme
A meme file containing all the simulated multimers.
Usage examples
Parameters can be passed through config file and as command line options. Command line options take precedent if both provided.
python -m inmotifin multimers --title multimers_sim --motif_files multimer_sim/simulated_motifs.meme --multimerisation_rules multimer_sim/multimer_rules.tsv