BROAD Best Practices Somatic CNV Panel is used for creating a panel of normals (PON) given a set of normal samples.
Common Use Cases
For CNV discovery, the PON is created by running the initial coverage collection tools individually on a set of normal samples and combining the resulting copy ratio data using a dedicated PON creation tool [1]. This produces a binary file that can be used as a PON. It is very important to use normal samples that are as technically similar as possible to the tumor samples (same exome or genome preparation methods, sequencing technology etc.) [2].
The basis of copy number variant detection is formed by collecting coverage counts, while the resolution of the analysis is defined by the genomic intervals list. In the case of whole genome data, the reference genome is divided into equally sized intervals or bins, while for exome data, the target regions of the capture kit should be padded. In either case, the PreprocessIntervals tool is used for preparing the intervals list which is then used for collecting raw integer counts. For this step CollectReadCounts is utilized, which counts reads that overlap the interval. Finally a CNV panel of normals is generated using the CreateReadCountPanelOfNormals tool.
In creating a PON, CreateReadCountPanelOfNormals abstracts the counts data for the samples and the intervals using Singular Value Decomposition (SVD), a type of Principal Component Analysis. The normal samples in the PON should match the sequencing approach of the case sample under scrutiny. This applies especially to targeted exome data because the capture step introduces target-specific noise [3].
Some of the common input parameters are listed below:
- Input reads (
--input
) - BAM/SAM/CRAM file containing reads. In the case of BAM and CRAM files, secondary BAI and CRAI index files are required. - Intervals (
--intervals
) - required for both WGS and WES cases. Formats must be compatible with the GATK-L
argument. For WGS, the intervals should simply cover the autosomal chromosomes (sex chromosomes may be included, but care should be taken to avoid creating panels of mixed sex, and to denoise case samples only with panels containing only individuals of the same sex as the case samples)[4]. - Bin length (
--bin-length
). This parameter is passed to the PreprocessIntervals tool. Read counts will be collected per bin and final PON file will contain information on read counts per bin. Thus, when calling CNVs in Tumor samples, Bin length parameter has to be set to the same value used when creating the PON file. - Padding (
--padding
). Also used in the PreprocessIntervals tool, defines number of base pairs to pad each bin on each side. - Reference (
--reference
) - Reference sequence file along with FAI and DICT files. - Blacklisted Intervals (
--exclude_intervals
) will be excluded from coverage collection and all downstream steps. - Do Explicit GC Correction - Annotate intervals with GC content using the AnnotateIntervals tool.
Changes Introduced by Seven Bridges
The workflow in its entirety is per best practice specification.
Performance Benchmarking
| Input Size | Experimental Strategy | Coverage | Duration | Cost (on demand) | AWS Instance Type | | --- | --- | --- | --- | --- | --- | --- | | 2 x 45GB | WGS | 8x | 33min | $0.59 | c4.4xlarge 2TB EBS | | 2 x 120GB | WGS | 25x | 1h 22min | $1.47 | c4.4xlarge 2TB EBS | | 2 x 210GB | WGS | 40x | 2h 19min | $2.48 | c4.4xlarge 2TB EBS | | 2 x 420GB | WGS | 80x | 4h 15min | $4.54 | c4.4xlarge 2TB EBS |
API Python Implementation
The app's draft task can also be submitted via the API. In order to learn how to get your Authentication token and API endpoint for corresponding platform visit our documentation.
# Initialize the SBG Python API
from sevenbridges import Api
api = Api(token="enter_your_token", url="enter_api_endpoint")
# Get project_id/app_id from your address bar. Example: https://igor.sbgenomics.com/u/your_username/project/app
project_id = "your_username/project"
app_id = "your_username/project/app"
# Replace inputs with appropriate values
inputs = {
"sequence_dictionary": api.files.query(project=project_id, names=["enter_filename"])[0],
"intervals": api.files.query(project=project_id, names=["enter_filename"])[0],
"in_alignments": list(api.files.query(project=project_id, names=["enter_filename", "enter_filename"])),
"in_reference": api.files.query(project=project_id, names=["enter_filename"])[0],
"output_prefix": "sevenbridges"}
# Creates draft task
task = api.tasks.create(name="GATK CNV Somatic Panel Workflow - API Run", project=project_id, app=app_id, inputs=inputs, run=False)
Instructions for installing and configuring the API Python client, are provided on github. For more information about using the API Python client, consult the client documentation. More examples are available here.
Additionally, API R and API Java clients are available. To learn more about using these API clients please refer to the API R client documentation, and API Java client documentation.
References
Click and drag the diagram to pan, double click or use the controls to zoom.
Inputs
ID | Name | Description | Type |
---|---|---|---|
bin_length | n/a | n/a |
|
padding | n/a | n/a |
|
sequence_dictionary | Sequence dictionary | Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file. |
|
intervals | Intervals | Genomic intervals over which to operate. |
|
in_reference | Reference | Reference sequence file. |
|
exclude_intervals | Blacklisted intervals | Genomic intervals to exclude from processing. |
|
feature_query_lookahead | n/a | n/a |
|
mappability_track | Mappability track | Umap single-read mappability track in .bed or .bed.gz format (see https://bismap.hoffmanlab.org/). |
|
segmental_duplication_track | Segmental duplication track | Segmental-duplication track in .bed or .bed.gz format |
|
in_alignments | Input reads | BAM/SAM/CRAM file containing reads This argument must be specified at least once. |
|
output_format | n/a | n/a |
|
do_impute_zeros | n/a | n/a |
|
extreme_outlier_truncation_percentile | n/a | n/a |
|
extreme_sample_median_percentile | n/a | n/a |
|
maximum_chunk_size | n/a | n/a |
|
maximum_zeros_in_interval_percentage | n/a | n/a |
|
maximum_zeros_in_sample_percentage | n/a | n/a |
|
minimum_interval_median_percentile | n/a | n/a |
|
number_of_eigensamples | n/a | n/a |
|
pon_entity_id | PON entity id | PON entity id (output prefix) for the panel of normals. |
|
do_explicit_gc_correction | Do explicit GC correction | Choose whether to annotate intervals with GC content. |
|
Steps
ID | Name | Description |
---|---|---|
gatk_annotateintervals_4_1_0_0 | GATK AnnotateIntervals | n/a |
gatk_collectreadcounts_4_1_0_0 | GATK CollectReadCounts | n/a |
gatk_createreadcountpanelofnormals_4_1_0_0 | GATK CreateReadCountPanelOfNormals | n/a |
gatk_preprocessintervals_4_1_0_0 | GATK PreprocessIntervals | n/a |
Outputs
ID | Name | Description | Type |
---|---|---|---|
preprocessed_intervals | Preprocessed Intervals | n/a |
|
read_counts | Read counts | n/a |
|
entity_id | Entity ID | n/a |
|
panel_of_normals | Panel of normals | n/a |
|
Version History
Version 1 (earliest) Created 24th Mar 2020 at 14:17 by Kaushik Ghose
Added/updated 1 files
Open
master
986331e
Creator
Submitter
Views: 2347 Downloads: 287
Created: 24th Mar 2020 at 14:17
This item has not yet been tagged.
None