Pipeline design
===============

This page describes the computational logic at a conceptual level.

Key components
--------------

The published HmtG-PacBio pipeline concept combines:

- Multiple sequence alignment (MSA) to place reads into a comparable coordinate system
- A machine-learning representation step (modified variational autoencoders)
- Density-based clustering (DBSCAN) to group reads into OTUs / haplotypes

The implementation in this repository follows that spirit, with a pragmatic script-first structure.

Step-by-step (conceptual)
-------------------------

1. **Read QC + subsampling**
   - Filter reads by mean Q score
   - Randomly subsample a fixed number of reads (to keep runtimes predictable)

2. **MSA**
   - Convert subsampled reads to FASTA
   - Run MAFFT with nucleotide orientation adjustment enabled
   - Save the raw MSA for auditability

3. **Haplotype inference**
   - Transform aligned reads into a feature representation
   - Cluster reads into OTUs/haplotypes
   - Summarize each haplotype and write FASTA outputs

4. **Optional: local BLAST**
   - If BLAST+ + database are present, search haplotypes against the local reference set
   - Include results in the PDF/JSON report

5. **Distance matrix**
   - Compute all-vs-all distances between inferred haplotypes
   - Write a TSV table (and summarize in the PDF)

Why the pipeline is structured this way
---------------------------------------

- Long reads that span the full amplicon reduce the need for assembly and help avoid chimeric consensus artifacts.
- Subsampling makes it feasible to run many samples and still get informative haplotype sets.
- A local reference DB keeps control in the researcher's hands (you can include unpublished or curated sequences).