BROAD Best Practices Somatic CNV Panel Workflow
Version 1

BROAD Best Practices Somatic CNV Panel is used for creating a panel of normals (PON) given a set of normal samples.

Common Use Cases

For CNV discovery, the PON is created by running the initial coverage collection tools individually on a set of normal samples and combining the resulting copy ratio data using a dedicated PON creation tool [1]. This produces a binary file that can be used as a PON. It is very important to use normal samples that are as technically similar as possible to the tumor samples (same exome or genome preparation methods, sequencing technology etc.) [2].

The basis of copy number variant detection is formed by collecting coverage counts, while the resolution of the analysis is defined by the genomic intervals list. In the case of whole genome data, the reference genome is divided into equally sized intervals or bins, while for exome data, the target regions of the capture kit should be padded. In either case, the PreprocessIntervals tool is used for preparing the intervals list which is then used for collecting raw integer counts. For this step CollectReadCounts is utilized, which counts reads that overlap the interval. Finally a CNV panel of normals is generated using the CreateReadCountPanelOfNormals tool.

In creating a PON, CreateReadCountPanelOfNormals abstracts the counts data for the samples and the intervals using Singular Value Decomposition (SVD), a type of Principal Component Analysis. The normal samples in the PON should match the sequencing approach of the case sample under scrutiny. This applies especially to targeted exome data because the capture step introduces target-specific noise [3].

Some of the common input parameters are listed below:
* Input reads (--input) - BAM/SAM/CRAM file containing reads. In the case of BAM and CRAM files, secondary BAI and CRAI index files are required.
* Intervals (--intervals) - required for both WGS and WES cases. Formats must be compatible with the GATK -L argument. For WGS, the intervals should simply cover the autosomal chromosomes (sex chromosomes may be included, but care should be taken to avoid creating panels of mixed sex, and to denoise case samples only with panels containing only individuals of the same sex as the case samples)[4].
* Bin length (--bin-length). This parameter is passed to the PreprocessIntervals tool. Read counts will be collected per bin and final PON file will contain information on read counts per bin. Thus, when calling CNVs in Tumor samples, Bin length parameter has to be set to the same value used when creating the PON file.
* Padding (--padding). Also used in the PreprocessIntervals tool, defines number of base pairs to pad each bin on each side.
* Reference (--reference) - Reference sequence file along with FAI and DICT files.
* Blacklisted Intervals (--exclude_intervals) will be excluded from coverage collection and all downstream steps.
* Do Explicit GC Correction - Annotate intervals with GC content using the AnnotateIntervals tool.

Changes Introduced by Seven Bridges

The workflow in its entirety is per best practice specification.

Performance Benchmarking

| Input Size | Experimental Strategy | Coverage | Duration | Cost (on demand) | AWS Instance Type |
| --- | --- | --- | --- | --- | --- | --- |
| 2 x 45GB | WGS | 8x | 33min | $0.59 | c4.4xlarge 2TB EBS |
| 2 x 120GB | WGS | 25x | 1h 22min | $1.47 | c4.4xlarge 2TB EBS |
| 2 x 210GB | WGS | 40x | 2h 19min | $2.48 | c4.4xlarge 2TB EBS |
| 2 x 420GB | WGS | 80x | 4h 15min | $4.54 | c4.4xlarge 2TB EBS |

API Python Implementation

The app's draft task can also be submitted via the API. In order to learn how to get your Authentication token and API endpoint for corresponding platform visit our documentation.

python <br># Initialize the SBG Python API <br>from sevenbridges import Api <br>api = Api(token="enter_your_token", url="enter_api_endpoint") <br># Get project_id/app_id from your address bar. Example: <br>project_id = "your_username/project" <br>app_id = "your_username/project/app" <br># Replace inputs with appropriate values <br>inputs = { <br> "sequence_dictionary": api.files.query(project=project_id, names=["enter_filename"])[0], <br> "intervals": api.files.query(project=project_id, names=["enter_filename"])[0], <br> "in_alignments": list(api.files.query(project=project_id, names=["enter_filename", "enter_filename"])), <br> "in_reference": api.files.query(project=project_id, names=["enter_filename"])[0], <br> "output_prefix": "sevenbridges"} <br># Creates draft task <br>task = api.tasks.create(name="GATK CNV Somatic Panel Workflow - API Run", project=project_id, app=app_id, inputs=inputs, run=False) <br>

Instructions for installing and configuring the API Python client, are provided on github. For more information about using the API Python client, consult the client documentation. More examples are available here.

Additionally, API R and API Java clients are available. To learn more about using these API clients please refer to the API R client documentation, and API Java client documentation.


* [1]
* [2]
* [3]
* [4]


ID Name Description Type
bin_length n/a n/a int?
padding n/a n/a int?
sequence_dictionary Sequence dictionary Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file. File
intervals Intervals Genomic intervals over which to operate. File
in_reference Reference Reference sequence file. File
exclude_intervals Blacklisted intervals Genomic intervals to exclude from processing. File?
feature_query_lookahead n/a n/a int?
mappability_track Mappability track Umap single-read mappability track in .bed or .bed.gz format (see File?
segmental_duplication_track Segmental duplication track Segmental-duplication track in .bed or .bed.gz format File?
in_alignments Input reads BAM/SAM/CRAM file containing reads This argument must be specified at least once. File[]
output_format n/a n/a {"name"=>"output_format", "symbols"=>["TSV", "HDF5"], "type"=>"enum"} (Optional)
do_impute_zeros n/a n/a {"name"=>"do_impute_zeros", "symbols"=>["true", "false"], "type"=>"enum"} (Optional)
extreme_outlier_truncation_percentile n/a n/a float?
extreme_sample_median_percentile n/a n/a float?
maximum_chunk_size n/a n/a int?
maximum_zeros_in_interval_percentage n/a n/a float?
maximum_zeros_in_sample_percentage n/a n/a float?
minimum_interval_median_percentile n/a n/a float?
number_of_eigensamples n/a n/a int?
pon_entity_id PON entity id PON entity id (output prefix) for the panel of normals. string
do_explicit_gc_correction Do explicit GC correction Choose whether to annotate intervals with GC content. boolean?


ID Name Description
gatk_annotateintervals_4_1_0_0 GATK AnnotateIntervals
gatk_collectreadcounts_4_1_0_0 GATK CollectReadCounts
gatk_createreadcountpanelofnormals_4_1_0_0 GATK CreateReadCountPanelOfNormals
gatk_preprocessintervals_4_1_0_0 GATK PreprocessIntervals


ID Name Description Type
preprocessed_intervals Preprocessed Intervals n/a File?
read_counts Read counts n/a File[]?
entity_id Entity ID n/a
    panel_of_normals Panel of normals n/a File?
    help Creators and Submitter

    Views: 131

    Created: 24th Mar 2020 at 14:17

    Last used: 4th Aug 2020 at 21:19

    help Tags

    This item has not yet been tagged.

    help Attributions


    Related items

    Powered by
    Copyright © 2008 - 2020 The University of Manchester and HITS gGmbH