Pipeline design

This page describes the computational logic at a conceptual level.

Key components

The published HmtG-PacBio pipeline concept combines:

  • Multiple sequence alignment (MSA) to place reads into a comparable coordinate system

  • A machine-learning representation step (modified variational autoencoders)

  • Density-based clustering (DBSCAN) to group reads into OTUs / haplotypes

The implementation in this repository follows that spirit, with a pragmatic script-first structure.

Step-by-step (conceptual)

  1. Read QC + subsampling - Filter reads by mean Q score - Randomly subsample a fixed number of reads (to keep runtimes predictable)

  2. MSA - Convert subsampled reads to FASTA - Run MAFFT with nucleotide orientation adjustment enabled - Save the raw MSA for auditability

  3. Haplotype inference - Transform aligned reads into a feature representation - Cluster reads into OTUs/haplotypes - Summarize each haplotype and write FASTA outputs

  4. Optional: local BLAST - If BLAST+ + database are present, search haplotypes against the local reference set - Include results in the PDF/JSON report

  5. Distance matrix - Compute all-vs-all distances between inferred haplotypes - Write a TSV table (and summarize in the PDF)

Why the pipeline is structured this way

  • Long reads that span the full amplicon reduce the need for assembly and help avoid chimeric consensus artifacts.

  • Subsampling makes it feasible to run many samples and still get informative haplotype sets.

  • A local reference DB keeps control in the researcher’s hands (you can include unpublished or curated sequences).