Cluster mode
*The interpretation of the Advanced part is only available in EN.
Overview
SAW now supports submitting analysis tasks in cluster mode on Linux systems, compatible with the SGE job scheduling tool. Cluster mode allows analysts to execute tasks in parallel in high-performance computing environments, maximizing cluster resources and improving analysis efficiency.
- Cluster mode support: Define default job submission rules and resource requirements through configuration files.
- Resource allocation optimization: Dynamically adjust thread count, memory size, and memory retry factors to optimize resource utilization.
- Task management: supports job submission, logging, and task interruption.
- Parallel computing: supports intra-rule and inter-rule parallel computing, suitable for large-scale data analysis.
Enabling Cluster Mode
Enable cluster mode by specifying --job-mode=sge when running tasks. If cluster mode is not specified, the local mode is used by default.
Resource configuration
A resource configuration file is used to define default job submission rules and resource requirements. A resource file, /path/to/software/package/saw/config/resources.yaml, is typically in YAML format and includes the following sections:
- Default resource configuration: defines global defaults for rules without explicit configurations.
- Rule-specific resource configuration:
- thread count,
- memory size (GB),
- maximum of times to retry,
- memory retry factors (the memory allocated for the next retry will be increased by a specified multiple) for specific rules.
If the default value of a field is null, it indicates that the program will automatically calculate the resources required for this component in real-time. Typically, these components do not consume excessive computing resources.
Example file of resource configuration yaml :
version: "1.0"
global:
scheduler: "local" #choice: local, sge
default_setting:
max_retries: 0
schedulers: #optional
sge:
queue: " " #if none, default queue will be used
-------------------------------------------------------
#optional rule settings, pick needed ones
rules:
generate_mask:
threads: 1
mem_gb: 1.5
retry_factor: 2
generate_cid_count:
threads: 16
mem_gb: 1
retry_factor: 2
alignment: #read alignment
threads: null
mem_gb: null #from CID counts
retry_factor: 1.5
annotation: #gene annotation
threads: null #lowest 8
mem_gb: 70
retry_factor: 1.5
merge_cid_info:
threads: 10
mem_gb: 3
retry_factor: 2
image_registration: #image-related processing
threads: null
mem_gb: 20
retry_factor: 1.5
tissue_cut:
threads: 10
mem_gb: 12
retry_factor: 2
microbe_analysis:
threads: 10
mem_gb: 66
retry_factor: 2
microbe_tissue_cut:
threads: 10
mem_gb: 11
retry_factor: 2
clustering: #clustering analysis based on matrices
threads: 1
mem_gb: 8
retry_factor: 2
cell_cut:
threads: 1
mem_gb: 15
retry_factor: 2
saturation:
threads: 1
mem_gb: 7
retry_factor: 2
protein_mapping:
threads: null
mem_gb: 45
retry_factor: 2
protein_tissue_cut:
threads: 10
mem_gb: 8
retry_factor: 2
protein_clustering:
threads: 11
mem_gb: 3
retry_factor: 2
protein_cell_cut:
threads: 1
mem_gb: 15
retry_factor: 2
protein_saturation:
threads: 1
mem_gb: 10
retry_factor: 2
protein_remove_background:
threads: 10
mem_gb: 20
retry_factor: 2
protein_calculate_statistics:
threads: 1
mem_gb: 45
retry_factor: 2
generate_gef:
threads: 1
mem_gb: 15
retry_factor: 2
generate_report:
threads: 1
mem_gb: 15
retry_factor: 2
package_visualization:
threads: 1
mem_gb: 1
retry_factor: 2
Default resource configuration is suitable for most Stereo-seq datasets (chip size <= 2*3). If your Stereo-seq sequencing data volume is large or you are working with a large Stereo-seq chip (chip size > 2*3), it is essential to adjust the resource configuration in the resources.yaml file.
Submit analysis
Enable SGE mode to submit SAW count analysis task:
cd /saw/runs
saw count \
--id=SGE_test \
--sn=C04144D5 \
--omics=transcriptomics \
--kit-version='Stereo-seq N FFPE V1.0' \
--sequencing-type='PE75_25+59' \
--organism=mouse \
--tissue=brain \
--chip-mask=./C04144D5.barcodeToPos.h5 \
--fastqs=./reads \
--reference=./mouse_transcriptome \
--threads-num=96 \
--memory=100 \
--job-mode=sge
Or use a specified configuration YAML to start an analysis task:
Remember to adjust the scheduler to "sge" and set a queue if you are using a specific YAML file in SGE clustering mode.
cd /saw/runs
saw count \
--id=SGE_with_a_specific_yaml_test \
--sn=C04144D5 \
--omics=transcriptomics \
--kit-version='Stereo-seq N FFPE V1.0' \
--sequencing-type='PE75_25+59' \
--organism=mouse \
--tissue=brain \
--chip-mask=./C04144D5.barcodeToPos.h5 \
--fastqs=./reads \
--reference=./mouse_transcriptome \
--threads-num=96 \
--memory=100 \
--job-mode=./specific_configuration_for_my_stereoseq_chip.yaml