6.2. Snakemake¶
In the following, we will play with two small FastQ files. Let us download them:
Download the two fastq files and unzip them for the next exercice. Rename them as A.fastq and B.fastq
Install snakemake:
conda install snakemake
6.2.1. Exercice 0: execution¶
Here is simple Snakefile that counts the number of reads in 1 input file:
rule all:
input:
"stats/A.txt"
rule count:
input: "A.fastq"
output: "stats/A.txt"
run:
count = 0
with open(input[0], "r") as fin:
for x in fin:
count += 1
with open(output[0], "w") as fout:
N = int(count / 4)
fout.write("{}".format(N))
Then execute the pipeline and get a flavor of what is going on.
snakemake -s name.workflow --cores 1
6.2.2. Exercice 1 : expand¶
Using wildcards (sample) with can generalise this pipeline:
samples = ['A', 'B']
rule all:
input:
expand("stats/{sample}.txt", sample=samples)
rule count:
input: "{sample}.fastq"
output: "stats/{sample}.txt"
run:
count = 0
with open(input[0], "r") as fin:
for x in fin:
count += 1
with open(output[0], "w") as fout:
N = int(count / 4)
fout.write("{}".format(N))
Then execute the pipeline and get a flavor of what is going on.
snakemake -s name.workflow --cores 1
6.2.3. Exercice 2 (deal with error)¶
Copy and try this Snakemake workflow. First get the two input files A.fastq.gz and B.fastq.gz in the local directory:
samples = ['A', 'B']
rule all:
input:
expand("fastqc/{sample}_fastqc.html", sample=samples)
rule fastqc:
input: "{sample}.fastq.gz"
output:
html="fastqc/{sample}_fastqc.html",
zip="fastqc/{sample}_fastqc.zip"
log: "logs/{sample}.log"
shell: "fastqc {input} >{log} 2>&1"
what is wrong with the workflow ? Look at fastqc documentation and change the pipeline to fix it. There are two solutions. One that consists in changing the two rules input/output. The second solution need to change only one line by adding 9 characters (only).
6.2.4. Exercice 3 (deal with error)¶
Execute this snakemake file. What is wrong ? Find the error. This is a quite current type of error:
samples = ['A', 'B']
rule all:
input:
expand("fastqc/{sample}_fastqc.html", sample=samples)
rule fastqc:
input: "{sample}.fastq.gz"
output:
html="fastqc/{sample}_fastqc.html",
zip="fastqc/{sample}_fastqc.zip"
shell:
"""
fastqc {input} -o fastqc -t {threads} >{log} 2>&1
6.2.5. Exercice 4 (benchmark)¶
From the fastqc example here below, add threading options (see the course material or online snakemake documention). You will need to look also at the fastqc documentation:
samples = ['A', 'B']
rule all:
input:
expand("fastqc/{sample}_fastqc.html", sample=samples)
rule fastqc:
input: "{sample}.fastq.gz"
output:
html="fastqc/{sample}_fastqc.html",
zip="fastqc/{sample}_fastqc.zip"
threads: 4
log: "logs/{sample}.log"
shell: """fastqc {input} -o fastqc -t {threads} >{log} 2>&1 """
Execute the pipeline with htreading options. Check that it is faster than when using a single thread.
Execute the pipeline using 4 cores then 8 cores. Check from the benchmark output files that indeed you have 5 entries indicating the time that each run took.
6.2.6. Exercice 5 (project)¶
Implement the workflow corresponding to the project in Snakemake.
6.2.7. Solution Exercice1 (deal with error)¶
If you data is not found because the files are not zipped, you will get this error:
Building DAG of jobs...
MissingInputException in line 11 of /pasteur/homes/tcokelae/NextFlow/Snakefile:
Missing input files for rule fastqc:
A.fastq.gz
Solution:
gzip [AB].fastq
If conda environment is missing the tools fastqc the pipeline cannot work. You will get an error as follows:
Error in rule fastqc:
jobid: 2
output: fastqc/B_fastqc.html, fastqc/B_fastqc.zip
log: logs/B.log (check log file(s) for error message)
shell:
fastqc B.fastq.gz >logs/B.log 2>&1
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /pasteur/homes/tcokelae/NextFlow/.snakemake/log/2021-06-25T141558.517347.snakemake.log
Solution:
conda install fastqc
If your input files are there and fastqc found, you should get the error expected by the exercice:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 11 of /pasteur/homes/tcokelae/NextFlow/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
fastqc/B_fastqc.html
fastqc/B_fastqc.zip
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /pasteur/homes/tcokelae/NextFlow/.snakemake/log/2021-06-25T142552.128080.snakemake.log
Here the problem is that the expected output of the pipeline (fastqc/{sample}_fastqc.html) do not match the output of the rule fastqc.
The first solution consists in the place where fastqc will store the results:
shell: "fastqc {input} -o fastqc/ >{log} 2>&1"
The second consists in changing the input/output to remove 'fastqc' directory but this is not recommended. It is better to think about the future when tens of samples and tens of tools are used. You will not want to store everything in the current directory:
rule all:
input:
expand("{sample}_fastqc.html", sample=samples)
rule fastqc:
input: "{sample}.fastq.gz"
output:
html="{sample}_fastqc.html",
zip="{sample}_fastqc.zip"