# MicroSEC **Repository Path**: yym0924/MicroSEC ## Basic Information - **Project Name**: MicroSEC - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-01-26 - **Last Updated**: 2021-01-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ![MicroSEC logo](MicroSEC_logo.png) # MicroSEC pipeline for FFPE artifacts This pipeline is designed for filtering sequence errors found in formalin-fixed and paraffin-embedded (FFPE) samples. This repository contains all the codes to regenerate results from our paper: "MicroSEC: Sequence error filtering pipeline for formalin-fixed and paraffin-embedded samples" ## Supplementary Code > M. Ikegami et al., "MicroSEC: Sequence error filtering pipeline for formalin-fixed and paraffin-embedded samples", in preparation. ## Contents - [Overview](#overview) - [Filtering detail](#filtering-detail) - [System Requirements](#system-requirements) - [Instructions for Use](#instructions-for-use) ## Overview This pipeline is designed for filtering mutations found in formalin-fixed and paraffin-embedded (FFPE) samples. The MicroSEC filter utilizes a statistical analysis, and the results for mutations with less than 10 supporting reads are not reliable. Four files are necessary for the analysis: mutation information file, BAM file, and mutation supporting read ID information file. ### File 1: mutation information file This excel file should contain at least these contents: ``` Sample Gene HGVS.c HGVS.p Mut_type Total_QV>=20 %Alt Chr Pos Ref Alt SimpleRepeat_TRF Neighborhood_sequence Transition SL_1010-N6-B SLC25A24 _ _ 1-snv 366 1.0929 chr1 108130741 C T N CTACCTGGAGAATGGGCCCATGTGTCCAGGTAGCAGTAAGC C>T_t ``` - Notation: - Mut_type: [altered bases]-[mutation type (snv, ins, or del)]. 1-snv, 3-snv, 3-del, 1-ins, etc. - Total_QV>=20: The read number with total Q-value >=20. - Chr: chr1-22, chrX, and chrY for hg38. 1-22, X, and Y for hg19. - Pos/Ref/Alt: There are some difference from HGVS nomenclature as follows: - 131C>T -> Pos/Ref/Alt 131/C/T. - 514_515insC and 514A -> Pos/Ref/Alt 514/A/AC - 244delC and 243A -> Pos/Ref/Alt 243/AC/A - SimpleRepeat_TRF: Whether the mutation locates at a simple repeat sequence or not. - Neighborhood_sequence: [5'-20 bases] + [Alt sequence] + [3'-20 bases]. - Transition: 1-snv mutation pattern with a 3'-base. C>T_t represents CT to TT mutation. C>T_g_FFPE represents the possible FFPE artifact. - Sample, Mut_type, Chr, Pos, Ref, and Alt should be set exactly. - Gene, HGVS.c, HGVS.p, Total_QV>=20, %Alt, SimpleRepeat_TRF, and Transition can be set to any values. ### File 2: BAM file This file should contain at least these contents: - QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, and QUAL ### File 3: mutation supporting read ID information tsv file This file should contain at least these contents: ``` Chr Pos Ref Alt Mut_ID Mut chr1 2561609 T A _;ID001-1:579185f,ID004-1:1873933f;ID006-1:1131647f,ID001-1:570086f,ID008-1:1953407r,ID002-2:749570r .;A;N chr6 346429 G GACACACAC _;ID005-2:545593f,ID006-1:1132212f,ID006-1:1132212r;ID004-1:991476r,ID004-2:516896f;ID002-1:670026f .;.+ACACACAC;.-ACAC;N ``` - Mut_ID _;AAA,BBB,CCC;DDD,EEE,FFF represents the ID list corresponding to the Mut column separated by semicolons. Read ID is "QNAME in the BAM file" + "mapped strand information". Strand information can be extracted from the FLAG column in the BAM file. ID001-1:579185f represents that the read ID is ID001-1:579185 and the read was mapped to the forward strand. ID008-1:1953407r represents that the read ID is ID008-1:1953407 and the read was mapped to the reverse strand. - Mut ".;A;N" represents the mutation pattern: "reference base";"altered to A";"other alterations" ".;.+ACACACAC;.-ACAC;N" reapresents the mutation pattern: "reference base";"insertion ACACACAC";"deletion ACAC";"other alterations" ### File 4: sample information tsv file Seven or eight columns are necessary. The file contains no header. ``` [sample name] [mutation information excel file] [BAM file] [read ID information directory] [read length] [adapter sequence read 1] [optional: adapter sequence read 2] [reference genome] PC9 /mnt/source/CCLE.xlsx /mnt/source/PC9.bam /mnt/source/PC9_Cell_line 127 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT hg38 ``` - Reference genome: Human (hg38), Mouse (mm10), hg19, hg38, or mm10 ## Filtering detail This pipeline contains 8 filtering processes. - Filter 1 : Shorter-supporting lengths distribute too short to occur (1-1 and 1-2). - Filter 1-1: P-values are less than the threshold_p(default: 10^(-6)). - Filter 1-2: The longest shorter-supporting lengths is shorter than 40% of the read length. - Filter 2 : Hairpin-structure induced error detection (2-1 or 2-2). - Filter 2-1: Palindromic sequences exist within 150 bases. - Filter 2-2: >=50% mutation-supporting reads contains a reverse complementary sequence of the opposite strand consisting >= 15 bases. - Filter 3 : 3'-/5'-supporting lengths are too densely distributed to occur (3-1, 3-2, and 3-3). - Filter 3-1: P-values are less than the threshold_p(default: 10^(-6)). - Filter 3-2: The distributions of 3'-/5'-supporting lengths are shorter than 80% of the read length. - Filter 3-3: <10% of bases are low quality (Quality score <18). - Filter 4 : >=15% mutations were called by chimeric reads comprising two distant regions. - Filter 5 : Mutations locating at simple repeat sequences. - Filter 6 : C>T_g false positive calls in FFPE samples. - Filter 7 : Indel mutations locating at a >=15 homopolymer. - Filter 8 : >=10% low quality bases in the mutation supporting reads. Supporting lengths are adjusted considering small repeat sequences around the mutations. Results are saved in a excel file. The explatation of the results is written in detail in the second sheet of the excel file. github url: https://github.com/MANO-B/MicroSEC ## System Requirements ### Hardware Requirements The scripts requires only a standard computer with enough RAM to support the operations defined by a user. For minimal performance, this will be a computer with about 32 GB of RAM. For optimal performance, we recommend a computer with the following specs: RAM: 128+ GB CPU: 4+ cores, 4.2+ GHz/core The runtimes below are generated using a computer with the recommended specs (128 GB RAM, 4 cores@4.2 GHz) and internet of speed 100 Mbps. ### Software Requirements ### R language This script files runs on `R` for Windows, Mac, or Linux, which requires the R version 3.4.0 or later. ### Package dependencies Users should install the following packages prior to use the scripts, from an `R` terminal: ``` if (!requireNamespace("BiocManager", quietly = TRUE)){ install.packages("BiocManager") } install.packages(c('tidyr', 'openxlsx', 'data.table', 'R.utils', 'stringr', 'magrittr', 'dplyr', 'gtools', 'devtools'), dependencies = TRUE) BiocManager::install(c("Rsamtools", "Biostrings", "GenomicAlignments", "GenomeInfoDb"), update=FALSE) # install necessary genomes BiocManager::install("BSgenome.Hsapiens.UCSC.hg38", update=FALSE) BiocManager::install("BSgenome.Hsapiens.UCSC.hg19", update=FALSE) BiocManager::install("BSgenome.Mmusculus.UCSC.mm10", update=FALSE) ``` which will install in about 30 minutes on a recommended machine. ### Package Versions All packages are in their latest versions as they appear on `CRAN` on Oct. 31, 2020. Users can check [CRAN snapshot](https://mran.microsoft.com/timemachine/) for details. The versions of software are, specifically: ``` > packageVersion("tidyr") [1] '1.1.2' > packageVersion("openxlsx") [1] '4.2.2' > packageVersion("data.table") [1] '1.13.2' > packageVersion("R.utils") [1] '2.10.1' > packageVersion("stringr") [1] '1.4.0' > packageVersion("magrittr") [1] '1.5' > packageVersion("dplyr") [1] '1.0.2' > packageVersion("Biostrings") [1] '2.54.0' > packageVersion("BSgenome.Hsapiens.UCSC.hg38") [1] '1.4.1' > packageVersion("BSgenome.Hsapiens.UCSC.hg19") [1] '1.4.0' > packageVersion("BSgenome.Mmusculus.UCSC.mm10") [1] '1.4.0' > packageVersion("GenomicAlignments") [1] '1.22.1' > packageVersion("Rsamtools") [1] '2.0.3' > packageVersion("gtools") [1] '3.8.2' > packageVersion("devtools") [1] '2.3.2' > packageVersion("GenomeInfoDb") [1] '1.22.1' ``` ## Instructions for Use See also https://rdrr.io/cran/MicroSEC/ - How to install ``` # Stable version from CRAN install.packages("MicroSEC") # Developmental stable version from github (recommended) devtools::install_github("MANO-B/MicroSEC", upgrade="never", ref="v1.1.18") # Developmental unstable version from github (not recommended) devtools::install_github("MANO-B/MicroSEC", upgrade="never") ``` - How to use in command line ``` # download only once wget https://raw.githubusercontent.com/MANO-B/MicroSEC/main/MicroSEC.R # Style 1 Rscript MicroSEC.R [working/output directory] [sample information tsv file] [progress bar Y/N] # Style 2 (for pipeline use) Rscript MicroSEC.R [output filename] [sample name] [mutation information tsv.gz] [BAM file] [read ID information directory] [read length] [adapter sequence read 1] [adapter sequence read 2] [sample type: Human or Mouse] ``` - Example ``` Rscript MicroSEC.R /mnt/HDD8TB/MicroSEC /mnt/HDD8TB/MicroSEC/source/Sample_list.txt Y Rscript MicroSEC.R /mnt/result/post_filter/SAMPLE.gz SAMPLE_NAME /mnt/result/mutation/SAMPLE.gz /mnt/result/BAM/SAMPLE.bam /mnt/result/ID 150 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT hg38 ``` - How to use in R Console ``` ## See also the vignette file. ## Necessary packages library(MicroSEC) options(show.error.messages = FALSE, warn = -1) wd = "/mnt/HDD8TB/MicroSEC" # set your working/output directory sample_list = "/mnt/HDD8TB/MicroSEC/source/Sample_list.txt" progress_bar = "Y" # load sample information tsv file sample_info = read.csv(sample_list, header = FALSE, stringsAsFactors = FALSE, sep = "\t") # initialize msec = NULL homology_search = NULL mut_depth = NULL for (sample in seq_len(dim(sample_info)[1])) { sample_name = sample_info[sample, 1] mutation_file = sample_info[sample, 2] bam_file = sample_info[sample, 3] read_list = sample_info[sample, 4] read_length = as.integer(sample_info[sample, 5]) adapter_1 = sample_info[sample, 6] if (sample_info[sample, 7] %in% c("Human", "Mouse", "hg19", "hg38", "mm10")) { adapter_2 = adapter_1 organism = sample_info[sample, 7] } else{ adapter_2 = sample_info[sample, 7] organism = sample_info[sample, 8] } # load mutation information df_mutation = fun_load_mutation(mutation_file, sample_name) df_bam = fun_load_bam(bam_file) df_mut_call = fun_load_id(read_list) # load genomic sequence ref_genome = fun_load_genome(organism) chr_no = fun_load_chr_no(organism) # analysis result = fun_read_check(df_mutation = df_mutation, df_bam = df_bam, df_mut_call = df_mut_call, ref_genome = ref_genome, sample_name = sample_name, read_length = read_length, adapter_1 = adapter_1, adapter_2 = adapter_2, short_homology_search_length = 4, progress_bar = progress_bar) msec = rbind(msec, result[[1]]) homology_search = rbind(homology_search, result[[2]]) mut_depth = rbind(mut_depth, result[[3]]) } # search homologous sequences msec = fun_homology(msec, homology_search, min_homology_search = 40, ref_genome, chr_no, progress_bar = progress_bar) # statistical analysis msec = fun_summary(msec) msec = fun_analysis(msec, mut_depth, short_homology_search_length = 4, min_homology_search = 40, threshold_p = 10 ^ (-6), threshold_hairpin_ratio = 0.50, threshold_short_length = 0.75, threshold_distant_homology = 0.15, threshold_low_quality_rate = 0.1, homopolymer_length = 15) # save the results fun_save(msec, sample_info[1,1], wd) ``` - Output files one excel file and one tsv file are saved in the working/output directory. ``` MicroSEC-result_[Sample name]_[Date].xlsx MicroSEC_[Sample name].tsv ``` like following: ``` MicroSEC-result_PC9_2020-11-05.xlsx MicroSEC_PC9.tsv ``` - Sample name is set to the sample of interest in the Sample column of the mutation information file. - Confirm the read length in the platform. - Confirm the adapter sequence; Todai Onco Panel ("AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" and "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"). - If you input only one adapter sequence, the sequence will be used for read 1 and read 2. - If you want to know the progress visually, [progress bar Y/N] should be Y. ### Conversion of mutation information files MicroSEC utilizes a mutation information file. The format of the file is written differently from common HGVS nomenclature. A conversion function is prepared. ``` # Conversion example # Set column names correctly. # # Before (hg19) # Sample Hugo_Symbol Chr Start_Position End_Position Variant_Type Reference Tumor_Seq Protein_Change # LS411N BRAF 7 140453136 140453136 SNP A T p.V600E # LS411N CELSR1 22 46931227 46931227 DEL C - p.G615fs # LS411N APC 5 112175951 112175952 INS - A p.E1554fs # # After # Sample Gene Chr Pos Mut_type Ref Alt HGVS.p Neighborhood_sequence # LS411N BRAF 7 140453136 1-snv A T p.V600E ACCCACTCCATCGAGATTTCTCTGTAGCTAGACCAAAATCA # LS411N CELSR1 22 46931226 1-del GC G p.G615fs TCTTAGGCCCAGCGCTGCCGCCCCCAGAAAGGTGGAGGCC # LS411N APC 5 112175951 1-ins G GA p.E1554fs AAAACCAAGAGAAAGAGGCAGAAAAAAACTATTGATTCTGAA library(openxlsx) library(MicroSEC) df_mutation = fun_convert( system.file("extdata", "convert_test.xlsx", package = "MicroSEC"), "hg19" ) write.xlsx(df_mutation, "/mnt/HDD8TB/MicroSEC/mutation_modified.xlsx") ``` ### Reproducibility The source code is available in MicroSEC.R. Source data will be available at the Japanese Genotype-Phenotype Archive (http://trace.ddbj.nig.ac.jp/jga), which is hosted by the DNA Data Bank of Japan (the accession number is written in our paper). Each filtering process takes about 5–120 minutes per sample on a recommended machine, according to the depth and mutation amount of the sample.