Running StaG-mwc

You need to configure a workflow before you can run StaG-mwc. The code you downloaded in the previous git clone step includes a file called config.yaml, which is used to configure the workflow.

Selecting input files

There are two ways to define which files StaG-mwc should run on: either by specifying an input directory and a filename pattern, or by providing a sample sheet. The two ways are exclusive and cannot be combined, so you have to pick the one that suits you best.

Input directory

If your input FASTQ files are all in the same folder and they all follow the same filename pattern, the input directory option is often the most convenient.

Open config.yaml in your favorite editor and change input file settings under the Run configuration heading: the input directory, the input filename pattern. They can be declared using absolute or relative filenames (relative to the StaG-mwc repository directory). Input and output directories can technically be located anywhere, i.e. their locations are not restricted to the repository folder, but it is recommended to keep them in the repository directory. A common practice is to put symlinks to the files you want to analyze in a folder called input in the repository folder.

Samplesheet

If your input FASTQ files are spread across several filesystem locations or potentially exist in remote locations (e.g. S3), or your input FASTQ filenames do not follow a common filename pattern, the samplesheet option is the most convenient. The samplesheet input option also allows you to specify custom sample names that are not derived from a substring of the input filenames.

The format of the samplesheet is tab-separated text and it must contain a header line with at least the following three columns: sample_id, fastq_1, and fastq_2. An example file could look like this (columns are separated by TAB characters):

sample_id  fastq_1                             fastq_2
ABC123     /path/to/sample1_1.fq.gz            /path/to/sample1_2.fq.gz
DEF456     s3://bucketname/sample_R1.fq.gz     s3://bucketname/sample_R2.fq.gz
GHI789     http://domain.com/sample_R1.fq.gz   http://domain.com/sample_R2.fq.gz

Open config.yaml in your favorite editor and enter the path to a samplesheet TSV file that you have prepared in advance in the samplesheet field under the Run configuration heading. The FASTQ paths can be declared using absolute or relative filenames (relative to the StaG-mwc repository directory). Input files can be located anywhere, i.e. their locations are not restricted to the repository folder and they can even be located in remote storage systems like S3.

Note

When the path to a samplesheet TSV file has been specified in the config file StaG-mwc will ignore the inputdir and input_fn_pattern settings.

When using remote input files on S3 the access and secret keys must be available in environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

Using remote files is also possible from http:// and https:// sources.

It is possible to keep a local copy of remote input files in the repository folder after the run by setting keep_local: True in the config file.

The samplesheet can be specified on the command line by utilizing Snakemake’s built-in functionality for modifying configuration settings via the command line directive --config samplesheet=samplesheet.tsv.

Configuring which tools to run

Next, configure the settings under the Pipeline steps included heading. This is where you define what steps should be included in your workflow. Simply assign True or False to the steps you want to include. Note that the default configuration file already includes qc_reads and host_removal. These two steps are the primary read processing steps and most other steps depends on host filtered reads (i.e. the output of the host_removal step). Note that these two steps will pretty much always run, regardless of their setting in the config file, because they produce output files that almost all other workflow steps depend on.

Note

You can create several copies of config.yaml, named whatever you want, in order to manage several analyses from the same StaG-mwc directory. If you create a copy called e.g. microbime_analysis.yaml, you can easily run the workflow with this configuration file by using the --configfile commandline argument when running the workflow.

A reference database is required in order to run the host_removal step. If you already have it downloaded somewhere, point StaG-mwc to the location using the db_path parameter under the remove_host section of config.yaml.

The config file contains a parameter called email. This can be used to have the workflow send an email after a successful or failed run. Note that this requires that the Linux system your workflow is running on has a working email configuration. It is also quite common that most email clients will mark email sent from unknown random computers as spam, so don’t forget to check your spam folder.

Running

It is recommended to run Snakemake with the -n/--dryrun argument before starting an analysis for real. Executing a dryrun will let Snakemake check that all the requirements are available and it will then print a summary of what it intends to do, without actually doing anything. After finishing the configuration by editing config.yaml, test your configuration with:

snakemake --dryrun

If you are satisfied with the workflow plan output by the dryrun, you can run the workflow. The typical command to run StaG-mwc on your local computer is:

snakemake --use-conda --cores N

where N is the maximum number of cores you want to allow for the workflow. Snakemake will automatically reduce the number of cores available to individual steps to this limit. Another variant of --cores is called --jobs, which you might encounter occassionally. The two commands are equivalent.

Note

If several people are running StaG-mwc on a shared server or on a shared file system, it can be useful to use the --singularity-prefix/--conda-prefix parameter to use a common folder to store the conda environments created by StaG-mwc, so they can be re-used between different people or analyses. This reduces the risk of producing several copies of the same conda environment in different folders. This is can also be necessary when running on cluster systems where paths are usually very deep. Then for example create a folder in your home directory and use that with the --singularity-prefix/--conda-prefix option.

If you want to keep your customized config.yaml in a separate file, let’s say my_config.yaml, then you can run snakemake using that custom configuration file with the --configfile my_config.yaml command line argument.

Another useful command line argument to snakemake is --keep-going. This will instruct snakemake to keep going even if a job should fail, e.g. maybe the taxonomic profiling step will fail for a sample if the sample contains no assignable reads after quality filtering.

If you are having trouble running StaG-mwc with conda, try with Singularity (assuming you have Singularity installed on your system). There are pre-built Singularity images that are ready to use with StaG-mwc. Consider using --singularity-prefix to specify a folder where Snakemake can download and re-use the downloaded Singularity images for future invocations. The command to run StaG-mwc with Singularity instead of conda is:

snakemake --use-singularity --singularity-prefix /path/to/prefix/folder --dryrun

There are some additional details that need to be considered when using Singularity instead of conda, most notably that you will have to specify bind paths (specifying-bind-paths) so that your reference databases are accessible from within the containers when running StaG-mwc. It might look something like this:

snakemake --use-singularity --singularity-prefix /path/to/prefix/folder --singularity-args "-B /home/username/databases"

The above example assumes you have entered paths to your databases in config.yaml with a base path like the one shown in the above command (e.g. /home/username/databases/kraken2/kraken2_human/).

Running on cluster resources

In order to run StaG-mwc on a cluster, you need a cluster profile. StaG-mwc ships with a pre-made cluster profile for use on CTMR’s Gandalf Slurm cluster. The profile can be adapter to use on other Slurm systems if needed. The profile is distributed together with the StaG-mwc workflow code and is available in the profiles directory in the repository. The cluster profile specifies which cluster account to use (i.e. Slurm project account and partition), as well as the number of CPUs, time, and memory requirements for each individual step. Snakemake uses this information when submitting jobs to the cluster scheduler.

When running on a cluster it will likely work best if you run StaG using Singularity. The workflow comes preconfigured to automatically download and use containers from various sources for the different workflow steps. The CTMR Gandalf Slurm profile is preconfigured to use Singularity by default.

Note

Do not combine --use-conda with --use-singularity.

To prevent StaG-mwc from unnecessarily downloading the Singularity container images again between several projects you can use the --singularity-prefix to specify a directory where Snakemake can store the downloaded images for reuse between projects.

Paths to databases need to be located so that they are accessible from inside the Singularity containers. It’s easiest if they are all available from the same folder, so you can bind the main database folder into the Singularity container with e.g. --singularity-args "-B /path/to/db". Note that database paths need to specified in the config file so that the paths are correct from inside the Singularity container. Read more about specifying bind paths in the official Singularity docs: specifying-bind-paths.

To run StaG-mwc on CTMR’s Gandalf cluster, run the following command from inside the workflow repository directory:

snakemake --profile profiles/ctmr_gandalf

This will make Snakemake submit each workflow step as a separate cluster job using the CPU and time requirements specified in the profile. The above command assumes you are using the default config.yaml configuration file. If you are using a custom configuration file, just add --configfile <name_of_your_config_file> to the command line.

Note

Have a look in the profiles/ctmr_gandalf/config.yaml to see how to modify the resource configurations used for the Slurm job submissions.

Some very lightweight rules will run on the submitting node (typically directly on the login node), but the number of concurrent local jobs is limited to 2 in the default profiles.

Execution report

Snakemake provides facilites to produce an HTML report of the execution of the workflow. A zipped HTML report is automatically created when the workflow finishes.