Analysis of variation within individual COVID-19 samples | March 20 2020
What's the point?
The absolute majority of SARS-COV-2 data is in the form of assembled genomic sequences. This is unfortunate because any variation that exists within individual samples is obliterated--converted to the most frequent base--during the assembly process. However, knowing underlying evolutionary dynamics is critical for tracing evolution of the virus as it allows identification of genomic regions under different selective regimes and understanding of its population parameters.
Data availability
Raw sequencing reads are required to detected within-sample variation. We update the list of available data daily using the following logic implemented in fetchsraacc.sh:
- Two resources list read data available for SARS-CoV-2: SARS-CoV-2 resource and SRA itself.
- We pull accession numbers from these two resources (genbank.txt and sra.txt) and compute their union (union.txt)
- From the list obtained at the previous step we exclude bad datasets and non human samples (see this file or view datasets directly in run selector). This produces a list of current SRA accession: current.txt.
- Finally, we restrict this list to only datasets produced with the Illumina platform (current_illumina.txt). Oxford Nanopore data is used later to confirm indel polymorphisms. This is done by uploading accessions listed in
current.txt to SRA Run Selector and filtering on
platform=ILLUMINA
.
This is the current list of analyzed datasets:
SRR10903401
SRR10903402
SRR10971381
SRR11059940
SRR11059941
SRR11059942
SRR11059943
SRR11059944
SRR11059945
SRR11059946
SRR11059947
SRR11140744
SRR11140746
SRR11140748
SRR11140750
SRR11177792
SRR11241254
SRR11241255
SRR11247075
SRR11247076
SRR11247077
SRR11247078
SRR11278090
SRR11278091
SRR11278092
SRR11278164
SRR11278165
SRR11278166
SRR11278167
SRR11278168
SRR11314339
Next we fetch fastq datasets for accession listed in current_illumina.txt using Galaxy's wrapper for fasterq-dump
located in "Get data" tool section. We also download Genbank file for SARS-CoV-2 reference NC_045512.2. Finally we apply the following workflows to Paired and Single end data, respectively:
Analysis of Illumina Paired End data
- Map all reads against COVID-19 reference NC_045512.2 using
bwa mem
- Filter reads with mapping quality of at least 20, that were mapped as proper pairs
- Mark duplicate reads with
picard markduplicates
- Perform realignments using
lofreq viterbi
- Call variants using
lofreq call
- Annotate variants using
snpeff
against database created from NC_045512.2 GenBank file - Convert VCFs into tab delimited dataset
Analysis of Illumina Single End data
- Map all reads against COVID-19 reference NC_045512.2 using
bowtie2
(because all SE datasets we have seen so far contain short, 75 bp, reads) - Filter reads with mapping quality of at least 20
- Mark duplicate reads with
picard markduplicates
- Perform realignments using
lofreq viterbi
- Call variants using
lofreq call
- Annotate variants using
snpeff
against database created from NC_045512.2 GenBank file - Convert VCFs into tab delimited dataset
After running both workflows the data is combined into a single dataset (variant_list.tsv) and analyzed using in Jupyter notebook.
Inputs
Workflow
- GenBank file for the reference COVID-19 genome.
The GenBank record is used by
snpeff
to generate a database for variant annotation. - Downloaded reads as either paired (for paired end data) or single (for single end data) collections. Reads can be downloaded using Galaxy's wrapper for
fasterq-dump
located in "Get data" tool section.
Jupyter notebook
The Jupyter notebook requires the GenBank file (#1 from above) and the output of the workflow described below.
Outputs
The workflow produces a table of variants that looks like this:
Sample | CHROM | POS | REF | ALT | DP | AF | SB | DP4 | IMPACT | FUNCLASS | EFFECT | GENE | CODON | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SRR10903401 | NC_045512 | 1409 | C | T | 124 | 0.040323 | 1 | 66,53,2,3 | MODERATE | MISSENSE | NON_SYNONYMOUS_CODING | orf1ab | Cat/Tat |
1 | SRR10903401 | NC_045512 | 1821 | G | A | 95 | 0.094737 | 0 | 49,37,5,4 | MODERATE | MISSENSE | NON_SYNONYMOUS_CODING | orf1ab | gGt/gAt |
2 | SRR10903401 | NC_045512 | 1895 | G | A | 107 | 0.037383 | 0 | 51,52,2,2 | MODERATE | MISSENSE | NON_SYNONYMOUS_CODING | orf1ab | Gta/Ata |
3 | SRR10903401 | NC_045512 | 2407 | G | T | 122 | 0.024590 | 0 | 57,62,1,2 | MODERATE | MISSENSE | NON_SYNONYMOUS_CODING | orf1ab | aaG/aaT |
4 | SRR10903401 | NC_045512 | 3379 | A | G | 121 | 0.024793 | 0 | 56,62,1,2 | LOW | SILENT | SYNONYMOUS_CODING | orf1ab | gtA/gtG |
Here, most fields names are descriptive. SB = the Phred-scaled probability of strand bias as calculated by lofreq (0 = no strand bias); DP4 = strand-specific depth for reference and alternate allele observations (Forward reference, reverse reference, forward alternate, reverse alternate).
The variants we identified were distributed across the SARS-CoV-2 genome in the following way:
The following table describes variants with frequencies above 10%:
Workflow and Histories
Workflows
We use two separate workflows for performing paired and single end data analyses:
Variation analysis workflows |
---|
Workflow for the analysis of single end Illumina reads < 100 bp |
Workflow for the analysis of paired end Illumina reads |
Histories
We analyze paired end and single end data in separate histories. Next we combine output of the two workflows into a new history where Jupyter analysis is performed. These three histories are:
BioConda
Tools used in this analysis are also available from BioConda:
Name | Link |
---|---|
bwa |
|
samtools |
|
lofreq |
|
snpeff |
|
snpsift |
|
porechop |
|
filtlong |
|
minimap2 |
Inputs
ID | Name | Description | Type |
---|---|---|---|
GenBank file | GenBank file | n/a |
|
Reads | Reads | n/a |
|
Steps
ID | Name | Description |
---|---|---|
2 | SnpEff build: | toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff_build_gb/4.3+T.galaxy3 |
3 | Map with BWA-MEM | toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.1 |
4 | Filter SAM or BAM, output SAM or BAM | toolshed.g2.bx.psu.edu/repos/devteam/samtool_filter2/samtool_filter2/1.8 |
5 | Realign reads | toolshed.g2.bx.psu.edu/repos/iuc/lofreq_viterbi/lofreq_viterbi/2.1.3.1+galaxy1 |
6 | Call variants | toolshed.g2.bx.psu.edu/repos/iuc/lofreq_call/lofreq_call/2.1.3.1+galaxy1 |
7 | SnpEff eff: | toolshed.g2.bx.psu.edu/repos/iuc/snpeff/snpEff/4.3+T.galaxy1 |
8 | SnpSift Extract Fields | toolshed.g2.bx.psu.edu/repos/iuc/snpsift/snpSift_extractFields/4.3+t.galaxy0 |
9 | Collapse Collection | toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/4.1 |
Outputs
ID | Name | Description | Type |
---|---|---|---|
_anonymous_output_3 | _anonymous_output_3 | n/a |
|
_anonymous_output_4 | _anonymous_output_4 | n/a |
|
_anonymous_output_5 | _anonymous_output_5 | n/a |
|
_anonymous_output_6 | _anonymous_output_6 | n/a |
|
_anonymous_output_7 | _anonymous_output_7 | n/a |
|
_anonymous_output_8 | _anonymous_output_8 | n/a |
|
_anonymous_output_9 | _anonymous_output_9 | n/a |
|
_anonymous_output_10 | _anonymous_output_10 | n/a |
|
_anonymous_output_11 | _anonymous_output_11 | n/a |
|
_anonymous_output_12 | _anonymous_output_12 | n/a |
|
Version History
Version 1 (earliest) Created 25th Mar 2020 at 10:03 by Finn Bacall
Added/updated 56 files
Open
master
906d789
Creator
Submitter
Views: 2451 Downloads: 352
Created: 25th Mar 2020 at 10:03
Last updated: 25th Mar 2020 at 11:23
None