The ProDuSe Pipeline¶
The ProDuSe pipeline consists of multiple stages, each of which is described breifly here.
Trims the barcode sequence off each read, and stores it in a FASTQ comment. Any reads where the barcode deviates significantly from the expected degenerate range are discarded
- Input: Paired fastq files
- Output: Trimmed paired fastq files
Trim can be used to demultiplex samples, assuming the barcodes used in each sample are sufficiently distinct
Maps provided reads to a reference genome using the Burrows-Wheeler Aligner (mem algorithm). The resulting SAM file is converted into a BAM file and sorted, with the FASTQ comment stored as a read tag.
- Command: bwa mem <reference> <trimmed_fastq.R1.fastq> <trimmed_fastq.R2.fastq> | samtools view -b | samtools sort > out.trim.bam
Collapses duplicate reads into a consensus sequence. In addition, reads which are in “duplex” (i.e. originate from the same parental molecule) are flagged here
- Input: Trimmed BAM file
- Output: Collapsed BAM File
Idenfies bases that overlap between each read pair, and generates a consensus from the overlap. This consensus is then assigned to only one read in the read pair, thus removing overlapping bases.
- Input: Collapsed BAM file
- Output: Clipped BAM file