Pipeline design

This page describes the computational logic at a conceptual level.

Key components

The published HmtG-PacBio pipeline concept combines:

Multiple sequence alignment (MSA) to place reads into a comparable coordinate system
A machine-learning representation step (modified variational autoencoders)
Density-based clustering (DBSCAN) to group reads into OTUs / haplotypes

The implementation in this repository follows that spirit, with a pragmatic script-first structure.

Read QC + subsampling - Filter reads by mean Q score - Randomly subsample a fixed number of reads (to keep runtimes predictable)
MSA - Convert subsampled reads to FASTA - Run MAFFT with nucleotide orientation adjustment enabled - Save the raw MSA for auditability
Haplotype inference - Transform aligned reads into a feature representation - Cluster reads into OTUs/haplotypes - Summarize each haplotype and write FASTA outputs
Optional: local BLAST - If BLAST+ + database are present, search haplotypes against the local reference set - Include results in the PDF/JSON report
Distance matrix - Compute all-vs-all distances between inferred haplotypes - Write a TSV table (and summarize in the PDF)

Long reads that span the full amplicon reduce the need for assembly and help avoid chimeric consensus artifacts.
Subsampling makes it feasible to run many samples and still get informative haplotype sets.
A local reference DB keeps control in the researcher’s hands (you can include unpublished or curated sequences).