Rerun analysis
*The interpretation of the Advanced part is only available in EN.
Based on Snakemake
SAW pipelines are constructed using the Snakemake computing framework. Snakemake is a powerful and flexible workflow management system designed to create reproducible and scalable data analyses. It is particularly popular in bioinformatics and computational biology but can be applied to any domain involving data processing pipelines. Its workflows are defined using a Python-based language, which makes them both human-readable and highly customizable.
Key features
- Reproducibility: Snakemake ensures that workflows are reproducible by automatically tracking dependencies between input files, output files, and the steps (rules) used to generate them.
- Scalability: It supports execution on various environments, from local machines to high-performance computing (HPC) clusters and cloud platforms (e.g., AWS, Google Cloud).
- Portability: Workflows are portable across different systems and environments, thanks to its integration with containerization tools like Docker and Singularity.
- Parallel execution: Snakemake automatically parallelizes tasks where possible, optimizing resource usage and reducing runtime.
- Dynamic workflow design: It allows for dynamic determination of input and output files during runtime, enabling flexible and adaptive workflows.
- Integration with Python: Snakemake leverages Python for scripting, enabling users to incorporate complex logic, data manipulation, and external libraries directly into their workflows.
- Visualization: It provides tools for visualizing workflows, making it easier to understand and debug complex pipelines.
Functional modules
- Rules: The core building blocks of a Snakemake workflow. Each rule defines a step in the workflow, specifying input files, output files, and the command or script to execute.
- DAG (Directed Acyclic Graph): Snakemake constructs a DAG to represent the dependencies between rules and files, ensuring tasks are executed in the correct order.
- Configuration files: External configuration files (e.g., JSON, YAML) can be used to parameterize workflows, making them more modular and reusable.
- Logging and reporting: Snakemake provides detailed logging and reporting features, allowing users to monitor workflow execution and generate summary reports.
- Cluster and cloud integration: Snakemake supports submission of jobs to cluster schedulers (e.g., SLURM, PBS) and cloud platforms, enabling scalable execution.
- Containerization: Integration with Docker and Singularity ensures that workflows run in isolated environments, enhancing reproducibility.
SAW rerun
What is rerun?
Snakemake's rerun functionality is based on its ability to track dependencies and determine which parts of the workflow need to be re-executed. This is achieved through the following principles:
- File Timestamps: Snakemake compares the timestamps of input and output files. If an input file is newer than its corresponding output file, the rule is marked for rerun.
- Content-Based Hashing: Snakemake can use content-based hashing (e.g., MD5 checksums) to detect changes in input files, ensuring that even minor modifications trigger a rerun.
- Rule Dependencies: If a rule depends on the output of another rule, and the upstream rule is rerun, all downstream rules are also marked for rerun.
- Conditional Execution: Snakemake allows analysts to define conditions under which rules should be rerun, such as changes in configuration parameters or external data sources.
How to rerun?
Obviously, the rerun function of SAW count benefits from the natural advantages of the Snakemake computing framework. When running analysis tasks, you may encounter the following situations:
- The analysis task is interrupted due to insufficient computing resources in the computing environment or the sudden shutdown of server machines.
- You accidentally kill the analysis task.
If the input parameters for the second run are identical to those of the first run, which can be verified by comparing them with the recorded information in /pipeline-logs/config.yaml, the analysis will resume.