Introduction
------------

inMOTIFin is a lightweight python package for simulating cis-regulatory elements.
It consists of four command line modules and python access to the backend for advanced usage.

======================================================
Simulation of sequences with inserted motif instances
======================================================

This option of the inMOTIFin package is built using `DagSim <https://uio-bmi.github.io/dagsim/index.html>`_ a simulation framework for causal models.

The directed acyclic graph (DAG) shown below describes the nodes and their relationship for the simulation.
Each node describes a value or sampling strategy (depending on user input).
Each node gets input from an upstream node (or the config file).
Similarly, the output of each node is the input to its downstream nodes.
This creates a dependency structure where each node is dependent only on the ones above it.
For example: first groups are selected, then motifs from these groups, then instances from the motifs, then positions for each instance.
As the motif-group membership information is used only when sampling motifs, the locations of the instances will not depend on the original groups the motifs were sampled from.

.. image:: images/dag.png

+--------------------+-----------------------------------------------------------------------------------+
| selected_groups    | | Motifs are organized into groups. Groups have probabilities for being selected. | 
|                    | | Within each group, motifs also have probabilities of being selected.            | 
+--------------------+-----------------------------------------------------------------------------------+
| num_instances      | | Number of motif instances per sequence. Options: integer or lambda parameter    |
|                    | | for a Poisson distribution.                                                     | 
+--------------------+-----------------------------------------------------------------------------------+
| selected_motifs    | | PWMs selected from a provided list (either simulated or from user input).       | 
+--------------------+-----------------------------------------------------------------------------------+
| orientations       | | The orientation of the motif instances is sampled from a binomial distribution. | 
+--------------------+-----------------------------------------------------------------------------------+
| instances          | | List of instances sampled from the ``selected_motifs``.                         | 
+--------------------+-----------------------------------------------------------------------------------+
| backgrounds        | | List of backgrounds sampled from a provided list (either simulated or from      |
|                    | | user input).                                                                    | 
+--------------------+-----------------------------------------------------------------------------------+
| positions          | | Positions of the inserted instances. Options: centering at the middle           |
|                    | | of the background, aligning the right or left side of the instance to the       |
|                    | | center, uniform sampling along the length of the background, sampling from      |
|                    | | Gaussian distribution(s) from provided mean and variance values.                | 
+--------------------+-----------------------------------------------------------------------------------+
| motif_in_seq       | | Final sequence assembled from the sampled parts.                                | 
+--------------------+-----------------------------------------------------------------------------------+
| prob_motif_in_seq  | | Position-base letter probabilities of the final sequence.                       | 
+--------------------+-----------------------------------------------------------------------------------+

To read more about groups, visit the `Explanation of groups page <https://inmotifin.readthedocs.io/en/latest/usage/groups_freq_explained.html>`_.

=================================
Simulation of random sequences 
=================================

For i.i.d. simulation, the user can set minimum and maximum length, alphabet, and base preference that is shared across all sequences.
When both minimum and maximum lenght is provided, the sequence length is sampled random uniformly between these two values (inclusive).

.. code-block::

   {'background_sim_seq_0': 'CGCGTACGGGGTCCGACCCTTAAAA',
   'background_sim_seq_1': 'CCGCGAACGTGGCCGTCTGCAGAAC',
   'background_sim_seq_2': 'AGAGCACCGGACGTGACCGCGGAAT',
   'background_sim_seq_3': 'GTTCCACCCGAAACCACAGCGGCTA',
   'background_sim_seq_4': 'ACCGTACAAAGGAGGTAAATCGATG'}

For context-dependent simulation, inMOTIFin can fit a hidden Markov model (HMM) on existing sequences and sample from the model.
This requires a set of input sequences in FASTA format, expected alphabet (which may include more letters than the provided sequences, but not less), minimum and maximum length for sampling, and the order of the fitted model.
When chosing an order, follow the `considerations described by Charles E Grant in the MEME suite Google forum <https://groups.google.com/g/meme-suite/c/yNascbE8Tig>`_:

.. code-block::

   The larger the set of "background" sequences is, the better the results will be. (...)
   The order of the Markov model is the number of preceding positions considered when calculating the character frequencies for the current position. 

   Typically, you should not specify an order larger than 3 for DNA sequences  or larger than 2 for protein sequences. (...)
   * For an accurate model of order N, you need to use a FASTA file as input (...) with at least 10 times 4**(N+1) DNA characters in it. So,

   order-3 requires 2560 characters 
   order 4 requires 10240 characters 
   order 5 requires 40960 characters 
   etc.

Note that here the model is used to fit and sample from a HMM, so the motif-finding specific considerations are omitted (i.e. replaced with (...)).
Also note that inMOTIFin can work with any alphabet, so it is not restricted to DNA or protein.
However, it is important to note that the size of the alphabet has an impact on the model fitting capacity.

======================
Simulation of motifs 
======================

by controlling their length, information content, and alphabet.

.. image:: images/motif_1.png
   :width: 30%

.. image:: images/motif_2.png
   :width: 60%

Motif_1 and motif_2 simulated from Dirichlet distribution defined with alpha values: A: 0.4, C: 1, G: 0.9, T: 0.5.
For interpretation of values, visit the page `motif simulation explained <https://inmotifin.readthedocs.io/en/latest/usage/motif_simulation.html>`_.

=========================
Multimerisation of motifs 
=========================

by providing distances between the motif components.

Motif_1 and motif_2 multimerised by setting a positive distance.

.. image:: images/multimer_pos_example.png

...or a negative distance.

.. image:: images/multimer_neg_example.png

================================
Manuscript and related resources 
================================

The inMOTIFin package is described in the manuscript: `inMOTIFin: a lightweight end-to-end simulation software for regulatory sequences <https://arxiv.org/abs/2506.20769>`_.
The results and examples shown in this documentation are based on the data and code available at the `Bitbucket repository <https://bitbucket.org/CBGR/inmotifin_evaluation/src/main/>`_.
In this repository you may also find an inMOTIFin Nextflow module.