Analysis of variation within individual COVID-19 samples | March 20 2020

What's the point?

The absolute majority of SARS-COV-2 data is in the form of assembled genomic sequences. This is unfortunate because any variation that exists within individual samples is obliterated--converted to the most frequent base--during the assembly process. However, knowing underlying evolutionary dynamics is critical for tracing evolution of the virus as it allows identification of genomic regions under different selective regimes and understanding of its population parameters.

Data availability

Raw sequencing reads are required to detected within-sample variation. We update the list of available data daily using the following logic implemented in fetchsraacc.sh:

Two resources list read data available for SARS-CoV-2: SARS-CoV-2 resource and SRA itself.
We pull accession numbers from these two resources (genbank.txt and sra.txt) and compute their union (union.txt)
From the list obtained at the previous step we exclude bad datasets and non human samples (see this file or view datasets directly in run selector). This produces a list of current SRA accession: current.txt.
Finally, we restrict this list to only datasets produced with the Illumina platform (current_illumina.txt). Oxford Nanopore data is used later to confirm indel polymorphisms. This is done by uploading accessions listed in current.txt to SRA Run Selector and filtering on platform=ILLUMINA.

This is the current list of analyzed datasets:

SRR10903401 SRR10903402 SRR10971381 SRR11059940 SRR11059941 SRR11059942 SRR11059943 SRR11059944 SRR11059945 SRR11059946 SRR11059947 SRR11140744 SRR11140746 SRR11140748 SRR11140750 SRR11177792 SRR11241254 SRR11241255 SRR11247075 SRR11247076 SRR11247077 SRR11247078 SRR11278090 SRR11278091 SRR11278092 SRR11278164 SRR11278165 SRR11278166 SRR11278167 SRR11278168 SRR11314339

Next we fetch fastq datasets for accession listed in current_illumina.txt using Galaxy's wrapper for fasterq-dump located in "Get data" tool section. We also download Genbank file for SARS-CoV-2 reference NC_045512.2. Finally we apply the following workflows to Paired and Single end data, respectively:

Analysis of Illumina Paired End data

Map all reads against COVID-19 reference NC_045512.2 using bwa mem
Filter reads with mapping quality of at least 20, that were mapped as proper pairs
Mark duplicate reads with picard markduplicates
Perform realignments using lofreq viterbi
Call variants using lofreq call
Annotate variants using snpeff against database created from NC_045512.2 GenBank file
Convert VCFs into tab delimited dataset

Analysis of Illumina Single End data

Map all reads against COVID-19 reference NC_045512.2 using bowtie2 (because all SE datasets we have seen so far contain short, 75 bp, reads)
Filter reads with mapping quality of at least 20
Mark duplicate reads with picard markduplicates
Perform realignments using lofreq viterbi
Call variants using lofreq call
Annotate variants using snpeff against database created from NC_045512.2 GenBank file
Convert VCFs into tab delimited dataset

After running both workflows the data is combined into a single dataset (variant_list.tsv) and analyzed using in Jupyter notebook.

Inputs

Workflow

GenBank file for the reference COVID-19 genome. The GenBank record is used by snpeff to generate a database for variant annotation.
Downloaded reads as either paired (for paired end data) or single (for single end data) collections. Reads can be downloaded using Galaxy's wrapper for fasterq-dump located in "Get data" tool section.

Jupyter notebook

The Jupyter notebook requires the GenBank file (#1 from above) and the output of the workflow described below.

Outputs

The workflow produces a table of variants that looks like this:

	Sample	CHROM	POS	REF	ALT	DP	AF	SB	DP4	IMPACT	FUNCLASS	EFFECT	GENE	CODON
0	SRR10903401	NC_045512	1409	C	T	124	0.040323	1	66,53,2,3	MODERATE	MISSENSE	NON_SYNONYMOUS_CODING	orf1ab	Cat/Tat
1	SRR10903401	NC_045512	1821	G	A	95	0.094737	0	49,37,5,4	MODERATE	MISSENSE	NON_SYNONYMOUS_CODING	orf1ab	gGt/gAt
2	SRR10903401	NC_045512	1895	G	A	107	0.037383	0	51,52,2,2	MODERATE	MISSENSE	NON_SYNONYMOUS_CODING	orf1ab	Gta/Ata
3	SRR10903401	NC_045512	2407	G	T	122	0.024590	0	57,62,1,2	MODERATE	MISSENSE	NON_SYNONYMOUS_CODING	orf1ab	aaG/aaT
4	SRR10903401	NC_045512	3379	A	G	121	0.024793	0	56,62,1,2	LOW	SILENT	SYNONYMOUS_CODING	orf1ab	gtA/gtG

Here, most fields names are descriptive. SB = the Phred-scaled probability of strand bias as calculated by lofreq (0 = no strand bias); DP4 = strand-specific depth for reference and alternate allele observations (Forward reference, reverse reference, forward alternate, reverse alternate).

The variants we identified were distributed across the SARS-CoV-2 genome in the following way:

The following table describes variants with frequencies above 10%:

Workflow and Histories

Workflows

We use two separate workflows for performing paired and single end data analyses:

Variation analysis workflows

Workflow for the analysis of single end Illumina reads < 100 bp

Workflow for the analysis of paired end Illumina reads

Histories

We analyze paired end and single end data in separate histories. Next we combine output of the two workflows into a new history where Jupyter analysis is performed. These three histories are:

BioConda

Tools used in this analysis are also available from BioConda:

Name	Link
`bwa`
`samtools`
`lofreq`
`snpeff`
`snpsift`
`porechop`
`filtlong`
`minimap2`

ID	Name	Description
2	SnpEff build:	toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_build_gb/4.3+T.galaxy3
3	Map with BWA-MEM	toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.1
4	Filter SAM or BAM, output SAM or BAM	toolshed.g2.bx.psu.edu/repos/devteam/samtool_filter2/samtool_filter2/1.8
5	Realign reads	toolshed.g2.bx.psu.edu/repos/iuc/lofreq_viterbi/lofreq_viterbi/2.1.3.1+galaxy1
6	Call variants	toolshed.g2.bx.psu.edu/repos/iuc/lofreq_call/lofreq_call/2.1.3.1+galaxy1
7	SnpEff eff:	toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1
8	SnpSift Extract Fields	toolshed.g2.bx.psu.edu/repos/iuc/snpsift/snpSift_extractFields/4.3+t.galaxy0
9	Collapse Collection	toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/4.1

ID	Name	Description	Type
_anonymous_output_3	_anonymous_output_3	n/a	File
_anonymous_output_4	_anonymous_output_4	n/a	File
_anonymous_output_5	_anonymous_output_5	n/a	File
_anonymous_output_6	_anonymous_output_6	n/a	File
_anonymous_output_7	_anonymous_output_7	n/a	File
_anonymous_output_8	_anonymous_output_8	n/a	File
_anonymous_output_9	_anonymous_output_9	n/a	File
_anonymous_output_10	_anonymous_output_10	n/a	File
_anonymous_output_11	_anonymous_output_11	n/a	File
_anonymous_output_12	_anonymous_output_12	n/a	File

4_ Variation
Version 1

Analysis of variation within individual COVID-19 samples | March 20 2020

What's the point?

Data availability

Analysis of Illumina Paired End data

Analysis of Illumina Single End data

Inputs

Workflow

Jupyter notebook

Outputs

Workflow and Histories

Workflows

Histories

BioConda

Inputs

Steps

Outputs

Version History

Version 1 (earliest) Created 25th Mar 2020 at 10:03 by Finn Bacall

Creator

Submitter

4_ Variation Version 1

Analysis of variation within individual COVID-19 samples | March 20 2020

What's the point?

Data availability

Analysis of Illumina Paired End data

Analysis of Illumina Single End data

Inputs

Workflow

Jupyter notebook

Outputs

Workflow and Histories

Workflows

Histories

BioConda

Inputs

Steps

Outputs

Version History

Version 1 (earliest) Created 25th Mar 2020 at 10:03 by Finn Bacall

Creator

Submitter

Related items

4_ Variation
Version 1