BROAD Best Practices Somatic CNV Panel Workflow 4.1.0.0

BROAD Best Practices Somatic CNV Panel is used for creating a panel of normals (PON) given a set of normal samples.

Common Use Cases

For CNV discovery, the PON is created by running the initial coverage collection tools individually on a set of normal samples and combining the resulting copy ratio data using a dedicated PON creation tool [1]. This produces a binary file that can be used as a PON. It is very important to use normal samples that are as technically similar as possible to the tumor samples (same exome or genome preparation methods, sequencing technology etc.) [2].

The basis of copy number variant detection is formed by collecting coverage counts, while the resolution of the analysis is defined by the genomic intervals list. In the case of whole genome data, the reference genome is divided into equally sized intervals or bins, while for exome data, the target regions of the capture kit should be padded. In either case, the PreprocessIntervals tool is used for preparing the intervals list which is then used for collecting raw integer counts. For this step CollectReadCounts is utilized, which counts reads that overlap the interval. Finally a CNV panel of normals is generated using the CreateReadCountPanelOfNormals tool.

In creating a PON, CreateReadCountPanelOfNormals abstracts the counts data for the samples and the intervals using Singular Value Decomposition (SVD), a type of Principal Component Analysis. The normal samples in the PON should match the sequencing approach of the case sample under scrutiny. This applies especially to targeted exome data because the capture step introduces target-specific noise [3].

Some of the common input parameters are listed below:

Input reads (--input) - BAM/SAM/CRAM file containing reads. In the case of BAM and CRAM files, secondary BAI and CRAI index files are required.
Intervals (--intervals) - required for both WGS and WES cases. Formats must be compatible with the GATK -L argument. For WGS, the intervals should simply cover the autosomal chromosomes (sex chromosomes may be included, but care should be taken to avoid creating panels of mixed sex, and to denoise case samples only with panels containing only individuals of the same sex as the case samples)[4].
Bin length (--bin-length). This parameter is passed to the PreprocessIntervals tool. Read counts will be collected per bin and final PON file will contain information on read counts per bin. Thus, when calling CNVs in Tumor samples, Bin length parameter has to be set to the same value used when creating the PON file.
Padding (--padding). Also used in the PreprocessIntervals tool, defines number of base pairs to pad each bin on each side.
Reference (--reference) - Reference sequence file along with FAI and DICT files.
Blacklisted Intervals (--exclude_intervals) will be excluded from coverage collection and all downstream steps.
Do Explicit GC Correction - Annotate intervals with GC content using the AnnotateIntervals tool.

Changes Introduced by Seven Bridges

The workflow in its entirety is per best practice specification.

Performance Benchmarking

| Input Size | Experimental Strategy | Coverage | Duration | Cost (on demand) | AWS Instance Type | | --- | --- | --- | --- | --- | --- | --- | | 2 x 45GB | WGS | 8x | 33min | $0.59 | c4.4xlarge 2TB EBS | | 2 x 120GB | WGS | 25x | 1h 22min | $1.47 | c4.4xlarge 2TB EBS | | 2 x 210GB | WGS | 40x | 2h 19min | $2.48 | c4.4xlarge 2TB EBS | | 2 x 420GB | WGS | 80x | 4h 15min | $4.54 | c4.4xlarge 2TB EBS |

API Python Implementation

The app's draft task can also be submitted via the API. In order to learn how to get your Authentication token and API endpoint for corresponding platform visit our documentation.

# Initialize the SBG Python API
from sevenbridges import Api
api = Api(token="enter_your_token", url="enter_api_endpoint")
# Get project_id/app_id from your address bar. Example: https://igor.sbgenomics.com/u/your_username/project/app
project_id = "your_username/project"
app_id = "your_username/project/app"
# Replace inputs with appropriate values
inputs = {
	"sequence_dictionary": api.files.query(project=project_id, names=["enter_filename"])[0], 
	"intervals": api.files.query(project=project_id, names=["enter_filename"])[0], 
	"in_alignments": list(api.files.query(project=project_id, names=["enter_filename", "enter_filename"])), 
	"in_reference": api.files.query(project=project_id, names=["enter_filename"])[0], 
	"output_prefix": "sevenbridges"}
# Creates draft task
task = api.tasks.create(name="GATK CNV Somatic Panel Workflow - API Run", project=project_id, app=app_id, inputs=inputs, run=False)

Instructions for installing and configuring the API Python client, are provided on github. For more information about using the API Python client, consult the client documentation. More examples are available here.

Additionally, API R and API Java clients are available. To learn more about using these API clients please refer to the API R client documentation, and API Java client documentation.

ID	Name	Description	Type
bin_length	n/a	n/a	int?
padding	n/a	n/a	int?
sequence_dictionary	Sequence dictionary	Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file.	File
intervals	Intervals	Genomic intervals over which to operate.	File
in_reference	Reference	Reference sequence file.	File
exclude_intervals	Blacklisted intervals	Genomic intervals to exclude from processing.	File?
feature_query_lookahead	n/a	n/a	int?
mappability_track	Mappability track	Umap single-read mappability track in .bed or .bed.gz format (see https://bismap.hoffmanlab.org/).	File?
segmental_duplication_track	Segmental duplication track	Segmental-duplication track in .bed or .bed.gz format	File?
in_alignments	Input reads	BAM/SAM/CRAM file containing reads This argument must be specified at least once.	File[]
output_format	n/a	n/a	<strong>enum</strong> of: TSV, HDF5
do_impute_zeros	n/a	n/a	<strong>enum</strong> of: true, false
extreme_outlier_truncation_percentile	n/a	n/a	float?
extreme_sample_median_percentile	n/a	n/a	float?
maximum_chunk_size	n/a	n/a	int?
maximum_zeros_in_interval_percentage	n/a	n/a	float?
maximum_zeros_in_sample_percentage	n/a	n/a	float?
minimum_interval_median_percentile	n/a	n/a	float?
number_of_eigensamples	n/a	n/a	int?
pon_entity_id	PON entity id	PON entity id (output prefix) for the panel of normals.	string
do_explicit_gc_correction	Do explicit GC correction	Choose whether to annotate intervals with GC content.	boolean?

ID	Name	Description
gatk_annotateintervals_4_1_0_0	GATK AnnotateIntervals	n/a
gatk_collectreadcounts_4_1_0_0	GATK CollectReadCounts	n/a
gatk_createreadcountpanelofnormals_4_1_0_0	GATK CreateReadCountPanelOfNormals	n/a
gatk_preprocessintervals_4_1_0_0	GATK PreprocessIntervals	n/a

ID	Name	Description	Type
preprocessed_intervals	Preprocessed Intervals	n/a	File?
read_counts	Read counts	n/a	File[]?
entity_id	Entity ID	n/a	string array containing string
panel_of_normals	Panel of normals	n/a	File?

BROAD Best Practices Somatic CNV Panel Workflow 4.1.0.0
Version 1

Common Use Cases

Changes Introduced by Seven Bridges

Performance Benchmarking

API Python Implementation

References

Inputs

Steps

Outputs

Version History

Version 1 (earliest) Created 24th Mar 2020 at 14:17 by Kaushik Ghose

Creator

Submitter

BROAD Best Practices Somatic CNV Panel Workflow 4.1.0.0 Version 1

Common Use Cases

Changes Introduced by Seven Bridges

Performance Benchmarking

API Python Implementation

References

Inputs

Steps

Outputs

Version History

Version 1 (earliest) Created 24th Mar 2020 at 14:17 by Kaushik Ghose

Creator

Submitter

Related items

BROAD Best Practices Somatic CNV Panel Workflow 4.1.0.0
Version 1