nf-core/bactmap

A mapping-based pipeline for creating a phylogeny from bacterial whole genome sequences

Introduction

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.

Documentation

The nf-core/bactmap pipeline comes with documentation about the pipeline, found in the docs/ directory:

Installation
Pipeline configuration
- Local installation
- Adding your own system
Running the pipeline
Output and how to interpret the results
Troubleshooting

Pipeline description

This pipeline maps paired end short reads to a bacterial fasta reference sequence, calls qnd filters variants, produces a whole genome alignment from pseudogenomes derived from the variants and finally produces a robust maximum likelihood phylogentic tree.

Pipeline steps

The steps are:

Index a reference sequnce using bwa (the reference sequence must only contain the chromosome and no additional sequences such as plasmids).
(Optional) Fetch reads from the ENA
Trim reads using trimmomatic (dynamic MIN_LEN based on 33% of the read length)
Count number of reads and estimate genome size using Mash
Downsample reads if the --depth_cutoff argument was specified
Map reads to the specified reference genome with bwa mem
Call variants with samtools
Filter variants to flag low quality SNPs
Produce a pseudogenome based on the variants called. Missing positions are encoded as - characters and low quality positions as N
All pseudogenomes are concatenanted to make a whole genome alignment
(Optional) Recombination is removed from the alignment using gubbins
Invariant sites are removed using snp-sites
(Optional) Maximum likelihood tree generated using IQ-TREE

A sumary of this process is shown below in the diagram that was generated when running Nextflow using the -with-dag command

workflow diagram

Pipeline outputs

These will be found in the directory specified by the --output_dir argument

(Optional) If accession numbers were used as the input source a directory called fastqs will contain the fastq file pairs for each accession number
A directory called trimmed_fastqs containing the reads after trimminb with TRIMMOMATIC
A directory called sorted_bams containing the alignmed sam files after mapping with bwa mem, conversion to bam and sorting
A directory called filtered_bcfs containing binary vcf files after filtering to flag low quality positions with LowQual in the FILTER column
A directory called pseudogenomes containing
- the pseudogenome from each sample
- a whole genome alignment named aligned_pseudogenome.fas containing the concatenated sample pseudogenomes and the refrerence genome
- a variant only alignment named aligned_pseudogenome.variants_only.fas with the invariant sites removed from aligned_pseudogenome.fas using snp-sites. If recombination removal was specified, the file will be named aligned_pseudogenome.gubbins.variants_only.fas with gubbins having been applied prior to invariant site removal.
Two newick tree files
- aligned_pseudogenome.gubbins.variants_only.contree If tree generation was specified, this file containing the consensus tree from IQTREE will be produced. The tree will possess assigned branch supports where branch lengths are optimized on the original alignment. If recombination removal was not specified the file will be named `alignedpseudogenome.variants_only.contree`
- aligned_pseudogenome.gubbins.variants_only.treefile The original IQ-TREE maximum likelihood tree without branch supports. If recombination removal was not specified the file will be named aligned_pseudogenome.variants_only.treefile

Credits

nf-core/bactmap was originally written by Anthony Underwood.

Software used within the workflow

Trimmomatic A flexible read trimming tool for Illumina NGS data.
mash Fast genome and metagenome distance estimation using MinHash.
seqtk A fast and lightweight tool for processing sequences in the FASTA or FASTQ format.
bwa mem Burrow-Wheeler Aligner for short-read alignment
samtools Utilities for the Sequence Alignment/Map (SAM) format
bcftools Utilities for variant calling and manipulating VCFs and BCFs
filteredbcfto_fasta.py Python utility to create a pseudogenome from a bcf file where each position in the reference genome is included
gubbins Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences
snp-sites Finds SNP sites from a multi-FASTA alignment file
IQ-TREE Efficient software for phylogenomic inference

nf-core/bactmap
Version 1

nf-core/bactmap

Introduction

Documentation

Pipeline description

Pipeline steps

Pipeline outputs

Credits

Software used within the workflow

Version History

Version 1 (earliest) Created 25th Feb 2020 at 11:02 by Finn Bacall

Creators

Additional credit

Submitter

nf-core/bactmap Version 1

nf-core/bactmap

Introduction

Documentation

Pipeline description

Pipeline steps

Pipeline outputs

Credits

Software used within the workflow

Version History

Version 1 (earliest) Created 25th Feb 2020 at 11:02 by Finn Bacall

Creators

Additional credit

Submitter

Related items

nf-core/bactmap
Version 1