4_ Variation
Version 1

Workflow Type: Galaxy

Analysis of variation within individual COVID-19 samples | March 20 2020

What's the point?

The absolute majority of SARS-COV-2 data is in the form of assembled genomic sequences. This is unfortunate because any variation that exists within individual samples is obliterated--converted to the most frequent base--during the assembly process. However, knowing underlying evolutionary dynamics is critical for tracing evolution of the virus as it allows identification of genomic regions under different selective regimes and understanding of its population parameters.

Data availability

Raw sequencing reads are required to detected within-sample variation. We update the list of available data daily using the following logic implemented in fetchsraacc.sh:

  1. Two resources list read data available for SARS-CoV-2: SARS-CoV-2 resource and SRA itself.
  2. We pull accession numbers from these two resources (genbank.txt and sra.txt) and compute their union (union.txt)
  3. From the list obtained at the previous step we exclude bad datasets and non human samples (see this file or view datasets directly in run selector). This produces a list of current SRA accession: current.txt.
  4. Finally, we restrict this list to only datasets produced with the Illumina platform (current_illumina.txt). Oxford Nanopore data is used later to confirm indel polymorphisms. This is done by uploading accessions listed in current.txt to SRA Run Selector and filtering on platform=ILLUMINA.

This is the current list of analyzed datasets:

SRR10903401 SRR10903402 SRR10971381 SRR11059940 SRR11059941 SRR11059942 SRR11059943 SRR11059944 SRR11059945 SRR11059946 SRR11059947 SRR11140744 SRR11140746 SRR11140748 SRR11140750 SRR11177792 SRR11241254 SRR11241255 SRR11247075 SRR11247076 SRR11247077 SRR11247078 SRR11278090 SRR11278091 SRR11278092 SRR11278164 SRR11278165 SRR11278166 SRR11278167 SRR11278168 SRR11314339

Next we fetch fastq datasets for accession listed in current_illumina.txt using Galaxy's wrapper for fasterq-dump located in "Get data" tool section. We also download Genbank file for SARS-CoV-2 reference NC_045512.2. Finally we apply the following workflows to Paired and Single end data, respectively:

Analysis of Illumina Paired End data

  1. Map all reads against COVID-19 reference NC_045512.2 using bwa mem
  2. Filter reads with mapping quality of at least 20, that were mapped as proper pairs
  3. Mark duplicate reads with picard markduplicates
  4. Perform realignments using lofreq viterbi
  5. Call variants using lofreq call
  6. Annotate variants using snpeff against database created from NC_045512.2 GenBank file
  7. Convert VCFs into tab delimited dataset

Analysis of Illumina Single End data

  1. Map all reads against COVID-19 reference NC_045512.2 using bowtie2 (because all SE datasets we have seen so far contain short, 75 bp, reads)
  2. Filter reads with mapping quality of at least 20
  3. Mark duplicate reads with picard markduplicates
  4. Perform realignments using lofreq viterbi
  5. Call variants using lofreq call
  6. Annotate variants using snpeff against database created from NC_045512.2 GenBank file
  7. Convert VCFs into tab delimited dataset

After running both workflows the data is combined into a single dataset (variant_list.tsv) and analyzed using in Jupyter notebook.

Inputs

Workflow

  1. GenBank file for the reference COVID-19 genome. The GenBank record is used by snpeff to generate a database for variant annotation.
  2. Downloaded reads as either paired (for paired end data) or single (for single end data) collections. Reads can be downloaded using Galaxy's wrapper for fasterq-dump located in "Get data" tool section.

Jupyter notebook

The Jupyter notebook requires the GenBank file (#1 from above) and the output of the workflow described below.

Outputs

The workflow produces a table of variants that looks like this:

  Sample
  CHROM
  POS
  REF
  ALT
  DP
  AF
  SB
  DP4
  IMPACT
  FUNCLASS
  EFFECT
  GENE
  CODON




  0
  SRR10903401
  NC_045512
  1409
  C
  T
  124
  0.040323
  1
  66,53,2,3
  MODERATE
  MISSENSE
  NON_SYNONYMOUS_CODING
  orf1ab
  Cat/Tat


  1
  SRR10903401
  NC_045512
  1821
  G
  A
  95
  0.094737
  0
  49,37,5,4
  MODERATE
  MISSENSE
  NON_SYNONYMOUS_CODING
  orf1ab
  gGt/gAt


  2
  SRR10903401
  NC_045512
  1895
  G
  A
  107
  0.037383
  0
  51,52,2,2
  MODERATE
  MISSENSE
  NON_SYNONYMOUS_CODING
  orf1ab
  Gta/Ata


  3
  SRR10903401
  NC_045512
  2407
  G
  T
  122
  0.024590
  0
  57,62,1,2
  MODERATE
  MISSENSE
  NON_SYNONYMOUS_CODING
  orf1ab
  aaG/aaT


  4
  SRR10903401
  NC_045512
  3379
  A
  G
  121
  0.024793
  0
  56,62,1,2
  LOW
  SILENT
  SYNONYMOUS_CODING
  orf1ab
  gtA/gtG

Here, most fields names are descriptive. SB = the Phred-scaled probability of strand bias as calculated by lofreq (0 = no strand bias); DP4 = strand-specific depth for reference and alternate allele observations (Forward reference, reverse reference, forward alternate, reverse alternate).

The variants we identified were distributed across the SARS-CoV-2 genome in the following way:

The following table describes variants with frequencies above 10%:

Workflow and Histories

Workflows

We use two separate workflows for performing paired and single end data analyses:

Variation analysis workflows

Workflow for the analysis of single end Illumina reads < 100 bp Galaxy workflow

Workflow for the analysis of paired end Illumina reads Galaxy workflow

Histories

We analyze paired end and single end data in separate histories. Next we combine output of the two workflows into a new history where Jupyter analysis is performed. These three histories are:

BioConda

Tools used in this analysis are also available from BioConda:

Name Link

bwa Anaconda-Server Badge

samtools Anaconda-Server Badge

lofreq Anaconda-Server Badge

snpeff Anaconda-Server Badge

snpsift Anaconda-Server Badge

porechop Anaconda-Server Badge

filtlong Anaconda-Server Badge

minimap2 Anaconda-Server Badge

Inputs

ID Name Description Type
bed_file bed_file runtime parameter for tool Filter SAM or BAM, output SAM or BAM n/a
intervals intervals runtime parameter for tool SnpEff eff: n/a
snpDb snpDb runtime parameter for tool SnpEff eff: n/a
input input runtime parameter for tool SnpEff eff: n/a
transcripts transcripts runtime parameter for tool SnpEff eff: n/a

Steps

ID Name Description
0 GenBank file
1 Reads
2 SnpEff build: toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_build_gb/4.3+T.galaxy3
3 Map with BWA-MEM toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.1
4 Filter SAM or BAM, output SAM or BAM toolshed.g2.bx.psu.edu/repos/devteam/samtool_filter2/samtool_filter2/1.8
5 Realign reads toolshed.g2.bx.psu.edu/repos/iuc/lofreq_viterbi/lofreq_viterbi/2.1.3.1+galaxy1
6 Call variants toolshed.g2.bx.psu.edu/repos/iuc/lofreq_call/lofreq_call/2.1.3.1+galaxy1
7 SnpEff eff: toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1
8 SnpSift Extract Fields toolshed.g2.bx.psu.edu/repos/iuc/snpsift/snpSift_extractFields/4.3+t.galaxy0
9 Collapse Collection toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/4.1

Outputs

ID Name Description Type
snpeff_output snpeff_output n/a
output_fasta output_fasta n/a
bam_output bam_output n/a
output1 output1 n/a
realigned realigned n/a
variants variants n/a
snpeff_output snpeff_output n/a
statsFile statsFile n/a
output output n/a
output output n/a
Total size: 13.7 MB
help Creators and Submitter
Creator
Submitter
Activity

Views: 987   Downloads: 28

Created: 25th Mar 2020 at 10:03

Last updated: 25th Mar 2020 at 11:23

Last used: 27th Nov 2021 at 00:57

help Tags
help Attributions

None

Version History

Version 1 (earliest) Created 25th Mar 2020 at 10:03 by Finn Bacall

Added/updated 56 files


Open master 906d789

Related items

Powered by
(v.1.12.0-master)
Copyright © 2008 - 2021 The University of Manchester and HITS gGmbH