Short reads were recorded in FASTQ format. The high quality data for bioinformatics analysis were generated as the following: (1) discard read pairs if either read contains adapter sequences (>10 nucleotides aligned to the adapter, allowing ≤ 10% mismatches); (2) discard read pairs if either read contains more than 10% of uncertain bases; and (3) discard read pairs if either read contains more than 50% low quality (Phred quality <5) bases. The following statistics were calculated and deposited in Supplementary Table: total number of reads; sequencing error rate; percentage of reads with average quality score >20; percentage of reads with average quality score >30; and GC content distribution.
Sequencing reads were aligned to the Human Genome Reference Consortium build 38 (GRCh38) using BWA v0.7.8. Unaligned reads that passed the Illumina’s quality filter (PF reads) were stored in the BAM file as well. We used the "Picard" pipeline to combine data from multiple libraries and flow cell runs into a single BAM file for any single sample. Only uniquely aligned, de-duplicated reads were used in subsequent analysis. Individual BAM file contains reads aligned to the human genome with quality scores recalibrated using Genome Analysis Toolkit (GATK)’s Table Recalibration tool. All sites potentially harboring small insertions or deletions in either the tumor or the matched normal were realigned using GATK.
Workflow | Software |
---|---|
Step1: Alignment | BWA MEM 、Samtools |
Step2: BAM Sort and Merge | Picard Tools 、Samtools |
Step3: Mark and Remove Duplicates | Picard Tools |
Step4: Local Realignment | GATK |
Step5: Base Quality Score Recalibration | GATK |
bwa mem \ -t 5 \ -M -R <read_group> \ <reference.fa> \ <fastq_1.fq.gz> \ <fastq_2.fq.gz> \ |samtools view -b -S -t \ <reference.fa.fai> \ - > \ <sample.bam>
# Bam Sort samtools sort \ <sample.bam> \ <sample.sort> # Bam Merge java -Xmx5g -jar MergeSamFiles.jar \ INPUT= <sample.L1.sort.bam> \ INPUT= <sample.L2.sort.bam> \ OUTPUT= <sample.sort.bam> \ SORT_ORDER=coordinate \ TMP_DIR= <outdir> \ USE_THREADING=true \ CREATE_INDEX=true \ VALIDATION_STRINGENCY=SILENT \ MAX_RECORDS_IN_RAM=4000000
java -Xmx10g -jar MarkDuplicates.jar \ TMP_DIR= <outdir> \ INPUT= <sample.sort.bam> \ OUTPUT= <sample.nodup.bam> \ METRICS_FILE= <sample.sort.metrics> \ VALIDATION_STRINGENCY=SILENT \ ASSUME_SORTED=true \ CREATE_INDEX=true \ REMOVE_DUPLICATES=true \ MAX_RECORDS_IN_RAM=4000000
java -Xmx14g -jar GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -nt 6 \ -R <reference.fa> \ -I <sample.nodup.bam> \ -known <known_indels.vcf> \ -o <sample.nodup.bam.intervals> && \ java -Xmx14g -jar GenomeAnalysisTK.jar \ -T IndelRealigner \ -R <reference.fa> \ -I <sample.nodup.bam> \ -targetIntervals <sample.nodup.bam.intervals> \ -known <known_indels.vcf> \ -o <sample.nodup.realn.bam>
java -Xmx14g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -nct 6 \ -R <reference.fa> \ -I <sample.nodup.realn.bam> \ -rf BadCigar \ -knownSites <dbsnp.vcf> \ -o <recal_data.table> && \ java -Xmx14g -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R <reference.fa> \ -I <sample.nodup.realn.bam> \ -BQSR <recal_data.table> \ -o <ample.nodup.realn.recal.bam>
We used GATK's HaplotypeCaller to perform germline mutation calling.
# GATK call Germline SNP/INDEL java -Xmx6g -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R <reference.fa> \ -I <sample.final.bam> \ -nct 8 \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 30 \ -o <sample.germline.raw.vcf> # Split SNP and INDEL, Then Filter java -Xmx2g -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R <reference.fa> \ -V <sample.germline.raw.vcf> \ -selectType SNP \ -o <sample.raw_snp.vcf> && \ java -Xmx2g -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R <reference.fa> \ -V <sample.raw_snp.vcf> \ --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filterName "my_snp_filter" \ -o <sample.snp.befil.vcf> \ 2>/dev/null && \ java -Xmx2g -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R <reference.fa> \ -V <sample.germline.raw.vcf> \ -selectType INDEL \ -o <sample.raw_indel.vcf> && \ java -Xmx2g -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R <reference.fa> \ -V <sample.raw_indel.vcf> \ --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \ --filterName "my_indel_filter" \ -o <sample.indel.befil.vcf> \ 2>/dev/null
Control-FREEC was used to detect somatic CNAs; B-allele frequencies were used to infer loss-of-heterozygosity events. The genomic regions of frequent germline CNAs (1kb windows including SNP proves with freqcnv = TREU in snp6.na35.remap.hg38.subset.txt.gz, downloaded from https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files ) and blacklist regions (downloaded from https://github.com/Boyle-Lab/Blacklist ) [ref: https://www.nature.com/articles/s41598-019-45839-z ] were excluded from the analyses.
The GISTIC2 algorithm was used to infer recurrently amplified or deleted genomic regions. G-scores were calculated for genomic and gene-coding regions on the basis of the frequency and amplitude of amplification or deletion affecting each gene. To determine significantly recurrent regions of SCNA, the GISTIC 2.0 version 6.2 GenePattern module was applied to the segmented data using amplifications and deletions thresholds of 0.2, join.segment.size of 100 probes, a copy-ratio cap of 1.5, arm-level peel-off enabled, broad length cutoff of 0.7 chromosome arms, and confidence level of 95%.
Control-FREEC
# Step1: Samtools Mpileup samtools mpileup \ -f <reference.fa> \ <sample.bam> \ > \ <sample.pileup> # Step2: Run Control-Freec freec -conf <somatic_freec.config>
somatic_freec.config
[general] BedGraphOutput = FALSE breakPointType = 2 chrFiles = <bychr> chrLenFile = <chr.24.length> maxThreads = 6 ploidy = 2 gemMappabilityFile = <mappability.100bp.out.mappability> uniqueMatch = TRUE sex = XY telocentromeric = 50000 forceGCcontentNormalization = 1 window = 2000 step = 1000 outputDir = <outdir> breakPointThreshold = 0.8 noisyData = FALSE readCountThreshold = 10 printNA = TRUE [sample] mateFile = <tumor.final.mpileup.gz> inputFormat = pileup mateOrientation = FR [control] mateFile = <normal.final.mpileup.gz> inputFormat = pileup mateOrientation = FR [BAF] SNPfile = <SingleDiNucl.1based.sorted.txt> minimalCoveragePerPosition = 5 shiftInQuality = 33
GISTIC
gistic2 \ -seg <segment.txt> \ -refgene <hg38.UCSC.add_miR.160920.refgene.mat> \ -b <outdir> \ -rx 1 -fname CPGEA_WGS_208pairs \ -ta 0.2 -td 0.2 -js 100 -qvt 0.25 -cap 1.5 -maxseg 2000\ -conf 0.95 -genegistic 1 -armpeel 1 \ -smallmem 1 -broad 1 -brlen 0.7 -savegene 1
To accurately detect the somatic breakpoint, we utilized the MeerKat software which was based on the paired-end’s approaches. The set of structural variant calls from each tumor sample is filtered by the calls from its matched normal to remove germ-line variants as described10. An event was removed if: 1) both breakpoints fell in satellite or simple repeats;2) filter TEI (transposable element insertion) for deletions; 3) the event featured more than 40bp of homology (or unmatched nucleotides) at breakpoints for intrachromosomal translocation and 20 bp of homology (or unmatched nucleotides) for interchromosomal translocation; 4) the event contains too many discordant pairs (>1) around the predicted breakpoints in the matched normal genome; 5) the event contains too many nonuniquely mapped reads (>1) around the predicted breakpoints in the matched normal genome; 6) the event contains too many nonuniquely mapped reads (>25%) around the predicted breakpoints in the matched normal genome was filtered; 7) cutoff for number of soft-clipped reads in normal bam file is 3; 8) minimum coverage for supporting discordant read pairs and split reads is 6; every discordant cluster must be supported by at least 3 read pair; every split cluster must be supported by at least 1 read pair. Should add the association in this part.
Meerkat
# Step1: Run Normal perl scripts/pre_process.pl -s 30 -k 1000 -q 15 -l 0 -b <normal.final.bam> -I <reference.fa> -A <reference.fa.fai> -W <bwa_path> -S <samtools_path> perl scripts/meerkat.pl -s 30 -d 4 -p 2 -m 0 -l 0 -b <normal.final.bam> -F <reference.fa> -W <bwa_path> -B <blastall_path> -S <samtools_path> # Step2: Run Tumor perl scripts/pre_process.pl -s 30 -k 1500 -l 0 -q 15 -b <tumor.final.bam> -I <reference.fa> -A <reference.fa.fai> -W <bwa_path> -S <samtools_path> perl scripts/meerkat.pl -s 30 -d 4 -p 2 -o 1 -m 0 -l 0 -b <tumor.final.bam> -F <reference.fa> -B <blastall_path> -S <samtools_path> perl scripts/mechanism.pl -b <tumor.final.bam> -R <rmsk.b38.txt> # Step3 ln -sf <normal.final.discord> <normal_discords_dir> perl scripts/somatic_sv.pl -i <tumor.final.variants> -o <tumor.finalsomatica.variants> -F <normal_discords_dir> -l 1000 -R <rmsk.b38.txt> perl scripts/somatic_sv.pl -S <samtools_path> -i <tumor.finalsomatica.variants> -o <tumor.finalsomaticb.variants> -n 1 -D 4 -Q 10 -I <normal.final.isinfo> -B <normal.final.bam> -R <rmsk.b38.txt> perl scripts/somatic_sv.pl -S <samtools_path> -i <tumor.finalsomaticb.variants> -o <tumor.finalsomaticc.variants> -u 1 -Q 10 -B <normal.final.bam> -R <rmsk.b38.txt> perl scripts/somatic_sv.pl -S <samtools_path> -i <tumor.finalsomaticc.variants> -o <tumor.finalsomaticd.variants> -f 1 -Q 10 -B <normal.final.bam> -R <rmsk.b38.txt> perl scripts/somatic_sv.pl -S <samtools_path> -i <tumor.finalsomaticd.variants> -o <tumor.finalsomatice.variants> -e 1 -D 4 -Q 10 -B <tumor.final.bam> -I <tumor.final.isinfo> -R <rmsk.b38.txt> perl scripts/somatic_sv.pl -S <samtools_path> -i <tumor.finalsomatice.variants> -o <tumor.finalsomaticf.variants> -z 1 -R <rmsk.b38.txt> perl scripts/somatic_sv.pl -S <samtools_path> -i <tumor.finalsomaticf.variants> -o <tumor.finalsomaticg.variants> -d 40 -t 20 -R <rmsk.b38.txt>
Somatic single nucleotide variants (SSNVs) were detected by MuTect. A minimum of 5 variants-containing reads and VAF>=0.04 in the tumor were required for mutation calling. In addition, we used a representative panel of 210 normal WGS Bams to build The Panel of Normals (PoN) and removed any mutation with a corresponding alternate allele appearing in more than 1 of the PoN samples. Somatic SNVs were called using MuTect v1.1.4 with default parameters. Further filtering was performed using the fpfilter.pl script with default parameters and a VAF threshold of 4%.
Somatic indels were detected using Strelka and Mutect2. Only indels agreed upon by both tools were retained. A minimum of 5 variant-containing reads and VAF>=0.04 in the tumor were required for mutation calling. Any indel appearing in more than 1 of the PoN samples was filtered.
ANNOVAR was used to annotate VCF (Variant Call Format). In order to ensure that no candidate driving mutations were mistakenly removed by the post-processing filtering, previously implicated cancer genes candidate mutations were manually reviewed using IGV.
muTect (Somatic SNV)
java -Xmx12G -Djava.io.tmpdir=<outdir> -jar muTect.jar \ -T MuTect \ -rf BadCigar \ -dt NONE \ --reference_sequence <reference.fa> \ --cosmic <cosmic.vcf> \ --dbsnp <dbsnp.vcf> \ --input_file:normal <normal.final.bam> \ --input_file:tumor <tumor.final.bam> \ --vcf <sample.muTect.call.vcf> \ --out <sample.muTect.call_stats.out> \ >/dev/null
muTect2 (Somatic SNV/INDEL)
java -Xmx12G -Djava.io.tmpdir=<outdir> -jar GenomeAnalysisTK.jar \ -T MuTect2 -rf BadCigar -dt NONE \ --reference_sequence <reference.fa> \ --input_file:normal <normal.final.bam> \ --input_file:tumor <tumor.final.bam> \ -o <sample.muTect2.call.vcf> \ --dbsnp <dbsnp.vcf> \ --cosmic <cosmic.vcf> \ >/dev/nullStrelka (Somatic SNV/INDEL)
configureStrelkaWorkflow.pl \ --tumor=<tumor.final.bam> \ --normal=<normal.final.bam> \ --ref=<reference.fa> \ --config=<strelka_config.ini> \ --output-dir=<outdir/build> && \ make -C <outdir/build> -j 8
We used funseq2 to detect recurrent noncoding mutations. The mutation data from 206 tumors were used as bed input files and the expression data from 139 pairs of RNA-seq was also used.
# SOFTWARE module load R/3.3.3 module load r-modules/3.1.2 # for DESeq module load bedtools/2.27.1 # for intersectBed fastaFromBed module load kentsrc/20170117 # for bigWigAverageOverBed module load htslib/1.6 # for tabix module load perl/5.20.3 # for funseq.pl module load perl-modules/5.20.3 # for Perl package Parallel::ForkManager funseq2.sh -f <tumor1.mutation.bed>, ... ,<tumorN.mutation.bed> \ -inf bed \ -nc \ -o <out_dir> \ -exp <expression matrix> \ -cls <sample class> \ -exf raw \ -p 24