4_ Variation
Version 1

Analysis of variation within individual COVID-19 samples | March 20 2020

What's the point?

The absolute majority of SARS-COV-2 data is in the form of assembled genomic sequences. This is unfortunate because any variation that exists within individual samples is obliterated--converted to the most frequent base--during the assembly process. However, knowing underlying evolutionary dynamics is critical for tracing evolution of the virus as it allows identification of genomic regions under different selective regimes and understanding of its population parameters.

Data availability

Raw sequencing reads are required to detected within-sample variation. We update the list of available data daily using the following logic implemented in fetchsraacc.sh:

  1. Two resources list read data available for SARS-CoV-2: SARS-CoV-2 resource and SRA itself.
  2. We pull accession numbers from these two resources (genbank.txt and sra.txt) and compute their union (union.txt)
  3. From the list obtained at the previous step we exclude bad datasets and non human samples (see this file or view datasets directly in run selector). This produces a list of current SRA accession: current.txt.
  4. Finally, we restrict this list to only datasets produced with the Illumina platform (current_illumina.txt). Oxford Nanopore data is used later to confirm indel polymorphisms. This is done by uploading accessions listed in current.txt to SRA Run Selector and filtering on platform=ILLUMINA.

This is the current list of analyzed datasets:

SRR10903401

SRR10903402

SRR10971381

SRR11059940

SRR11059941

SRR11059942

SRR11059943

SRR11059944

SRR11059945

SRR11059946

SRR11059947

SRR11140744

SRR11140746

SRR11140748

SRR11140750

SRR11177792

SRR11241254

SRR11241255

SRR11247075

SRR11247076

SRR11247077

SRR11247078

SRR11278090

SRR11278091

SRR11278092

SRR11278164

SRR11278165

SRR11278166

SRR11278167

SRR11278168

SRR11314339

Next we fetch fastq datasets for accession listed in currentillumina.txt using Galaxy's wrapper for fasterq-dump located in "Get data" tool section. We also download Genbank file for SARS-CoV-2 reference <a href="https://www.ncbi.nlm.nih.gov/nuccore/NC045512">NC_045512.2. Finally we apply the following workflows to Paired and Single end data, respectively:

Analysis of Illumina Paired End data

  1. Map all reads against COVID-19 reference NC_045512.2 using bwa mem
  2. Filter reads with mapping quality of at least 20, that were mapped as proper pairs
  3. Mark duplicate reads with picard markduplicates
  4. Perform realignments using lofreq viterbi
  5. Call variants using lofreq call
  6. Annotate variants using snpeff against database created from NC_045512.2 GenBank file
  7. Convert VCFs into tab delimited dataset

Analysis of Illumina Single End data

  1. Map all reads against COVID-19 reference NC_045512.2 using bowtie2 (because all SE datasets we have seen so far contain short, 75 bp, reads)
  2. Filter reads with mapping quality of at least 20
  3. Mark duplicate reads with picard markduplicates
  4. Perform realignments using lofreq viterbi
  5. Call variants using lofreq call
  6. Annotate variants using snpeff against database created from NC_045512.2 GenBank file
  7. Convert VCFs into tab delimited dataset

After running both workflows the data is combined into a single dataset (variantlist.tsv) and analyzed using in <a href="variationanalysis.ipynb">Jupyter notebook.

Inputs

Workflow

  1. GenBank file for the reference COVID-19 genome. The GenBank record is used by snpeff to generate a database for variant annotation.
  2. Downloaded reads as either paired (for paired end data) or single (for single end data) collections. Reads can be downloaded using Galaxy's wrapper for fasterq-dump located in "Get data" tool section.

Jupyter notebook

The Jupyter notebook requires the GenBank file (#1 from above) and the output of the workflow described below.

Outputs

The workflow produces a table of variants that looks like this:

Sample CHROM POS REF ALT DP AF SB DP4 IMPACT FUNCLASS EFFECT GENE CODON 0 SRR10903401 NC_045512 1409 C T 124 0.040323 1 66,53,2,3 MODERATE MISSENSE NON_SYNONYMOUS_CODING orf1ab Cat/Tat 1 SRR10903401 NC_045512 1821 G A 95 0.094737 0 49,37,5,4 MODERATE MISSENSE NON_SYNONYMOUS_CODING orf1ab gGt/gAt 2 SRR10903401 NC_045512 1895 G A 107 0.037383 0 51,52,2,2 MODERATE MISSENSE NON_SYNONYMOUS_CODING orf1ab Gta/Ata 3 SRR10903401 NC_045512 2407 G T 122 0.024590 0 57,62,1,2 MODERATE MISSENSE NON_SYNONYMOUS_CODING orf1ab aaG/aaT 4 SRR10903401 NC_045512 3379 A G 121 0.024793 0 56,62,1,2 LOW SILENT SYNONYMOUS_CODING orf1ab gtA/gtG

Here, most fields names are descriptive. SB = the Phred-scaled probability of strand bias as calculated by lofreq (0 = no strand bias); DP4 = strand-specific depth for reference and alternate allele observations (Forward reference, reverse reference, forward alternate, reverse alternate).

The variants we identified were distributed across the SARS-CoV-2 genome in the following way:

The following table describes variants with frequencies above 10%:

Workflow and Histories

Workflows

We use two separate workflows for performing paired and single end data analyses:

Variation analysis workflows

Workflow for the analysis of single end Illumina reads < 100 bp Galaxy workflow

Workflow for the analysis of paired end Illumina reads Galaxy workflow

Histories

We analyze paired end and single end data in separate histories. Next we combine output of the two workflows into a new history where Jupyter analysis is performed. These three histories are:

BioConda

Tools used in this analysis are also available from BioConda:

Name

Link

bwa

Anaconda-Server Badge

samtools

Anaconda-Server Badge

lofreq

Anaconda-Server Badge

snpeff

Anaconda-Server Badge

snpsift

Anaconda-Server Badge

porechop

Anaconda-Server Badge

filtlong

Anaconda-Server Badge

minimap2

Anaconda-Server Badge

Inputs

ID Name Description Type
bed_file bed_file runtime parameter for tool Filter SAM or BAM, output SAM or BAM n/a
intervals intervals runtime parameter for tool SnpEff eff: n/a
snpDb snpDb runtime parameter for tool SnpEff eff: n/a
input input runtime parameter for tool SnpEff eff: n/a
transcripts transcripts runtime parameter for tool SnpEff eff: n/a

Steps

ID Name Description
0 GenBank file
1 Reads
2 SnpEff build: toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_build_gb/4.3+T.galaxy3
3 Map with BWA-MEM toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.1
4 Filter SAM or BAM, output SAM or BAM toolshed.g2.bx.psu.edu/repos/devteam/samtool_filter2/samtool_filter2/1.8
5 Realign reads toolshed.g2.bx.psu.edu/repos/iuc/lofreq_viterbi/lofreq_viterbi/2.1.3.1+galaxy1
6 Call variants toolshed.g2.bx.psu.edu/repos/iuc/lofreq_call/lofreq_call/2.1.3.1+galaxy1
7 SnpEff eff: toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1
8 SnpSift Extract Fields toolshed.g2.bx.psu.edu/repos/iuc/snpsift/snpSift_extractFields/4.3+t.galaxy0
9 Collapse Collection toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/4.1

Outputs

ID Name Description Type
snpeff_output snpeff_output n/a snpeffdb
output_fasta output_fasta n/a fasta
bam_output bam_output n/a bam
output1 output1 n/a sam
realigned realigned n/a bam
variants variants n/a vcf
snpeff_output snpeff_output n/a vcf
statsFile statsFile n/a html
output output n/a tabular
output output n/a input
help Creators and Submitter
Creator
Submitter
Activity

Views: 424   Downloads: 9

Created: 25th Mar 2020 at 10:03

Last updated: 25th Mar 2020 at 11:23

Last used: 27th Nov 2020 at 07:02

help Tags
help Attributions

None

Related items

Powered by
(v.1.11.master)
Copyright © 2008 - 2020 The University of Manchester and HITS gGmbH