Skip to content

antonmry/FastFiRSt

Repository files navigation

FastFiRSt

FastFiRSt ports two DNA sequencing utilities to Rust so they can handle large datasets more efficiently. FLASH merges paired-end reads into longer fragments, while PERF identifies microsatellites in DNA sequences, both critical steps in genome assembly workflows.

Genomics workloads routinely process billions of bases, so every percentage of performance improvement shortens total runtimes, cuts infrastructure costs, and reduces the carbon impact of large compute clusters. By consolidating the tools into highly optimized Rust binaries and streamlining their data pipelines, the project aims to deliver faster science with a lighter environmental footprint.

Rust combines C-like performance with memory safety, fearless concurrency, and rich tooling, making it ideal for rewriting high-throughput bioinformatics software that previously relied on native extensions or manual memory management. The resulting binaries are portable and efficient across CLI, data processing backends, and WebAssembly targets.

FastFiRSt also adopts the Apache Arrow ecosystem to keep data in a vectorized, columnar format from ingestion to analytics. Arrow's in-memory layout powers DataFusion, a Rust-native SQL query engine embedded in this workspace, and Ballista, its distributed execution layer. Together they enable the project to scale from local experiments to cluster-sized runs with minimal code changes, while reusing the same kernels for both batch analytics and interactive workflows.

The repository also reimagines the Hadoop-based BigFiRSt data pipeline that orchestrates both tools. The resulting improvements of the BigFiRSt data pipeline are documented in BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Project goals include:

  • Porting both tools to Rust for improved performance (see the related paper for details).
  • Applying Apache Arrow Ballista optimizations to the end-to-end pipeline.
  • Exposing the tools through WebAssembly so they remain easy to use without dedicated infrastructure.

Workspace structure

  • flash-lib: Core library crate exposing the merge algorithm.
  • flash-cli: Thin CLI wrapper that matches the original FLASH command-line flags and writes the three FASTQ outputs.
  • flash-df: Provides scaffolding for running the merge pipeline through DataFusion/Ballista (feature-gated stubs for now).
  • flash-wasm: Minimal WebAssembly interface exposing the FLASH merge logic for the playground UI.
  • fastq-gen-cli: Synthetic FASTQ generator for producing large paired-end datasets to feed the pipeline.
  • wasm-playground: Vite/Mantine web playground that can execute SQL queries against DataFusion and run FLASH locally in the browser.

Installing the CLI

The flash-cli binary is published on crates.io. You can install it with:

cargo install flash-cli

Requirements

  • Rust toolchain (cargo 1.89+)

Build

cargo build --release --workspace

Usage

cargo run --release --bin flash-cli -- READ1.fq READ2.fq \
  --output-dir output_dir [--output-prefix out]

Outputs are written to <prefix>.extendedFrags.fastq, <prefix>.notCombined_1.fastq, and <prefix>.notCombined_2.fastq in the given directory. Optional parameters default to the FLASH values; run cargo run --bin flash-cli -- --help for the full list.

FASTQ generator CLI

The fastq-gen-cli crate produces synthetic paired FASTQ files with tunable record counts, read length, and output names. This is useful for benchmarking and integration testing without sharing private sequencing data.

# Generate 1M paired reads with a 150bp read length
cargo run -p fastq-gen-cli --release -- \
  --num-sequences 1000000 \
  --output-r1 synthetic_R1.fastq \
  --output-r2 synthetic_R2.fastq

Defaults write to generated_R1.fastq / generated_R2.fastq and use a 150bp read length:

cargo run -p fastq-gen-cli -- -n 1000

To plug into the FLASH pipeline with the bundled sample data:

cargo run -p fastq-gen-cli -- -n 500 --output-r1 input1.fq --output-r2 input2.fq
cargo run --bin flash-cli -- input1.fq input2.fq --output-dir ./out

Library usage

The flash-lib crate is also published on crates.io. You can add it to your project with:

cargo add flash-lib

And use it:

use flash_lib::{merge_fastq_files, CombineParams};

let params = CombineParams::default();
merge_fastq_files("input1.fq", "input2.fq", "./out", "out", &params)?;

DataFusion/Ballista prototype

The flash-df crate exposes a FlashDistributedJob wrapper that executes the merge locally (re-using flash-lib) and provides feature-gated hooks for wiring the workflow into a datafusion::SessionContext. Enable the relevant feature flag when building:

cargo build -p flash-df --features datafusion

When the datafusion feature is enabled, FlashDistributedJob can register a FastqTableProvider that exposes paired FASTQ records as a tabular relation:

use flash_df::FlashDistributedJob;
use flash_lib::{CombineParams, merge_fastq_files};

let job = FlashDistributedJob::new(config, CombineParams::default());
let ctx = job.session_context().await?;
job.register_fastq_sources(&ctx).await?; // registers `flash_pairs` table
let plan = job.build_logical_plan(&ctx).await?; // logical plan with combined/not-combined annotations

// Or execute the full pipeline via DataFusion and write the three FASTQ outputs
job.execute_datafusion().await?;

For a quick preview, you can run the bundled examples against the sample FASTQ files checked into the workspace root:

# run an ad-hoc SQL query over the paired FASTQ rows
cargo run -p flash-df --example query --features datafusion -- \
  input1.fq \
  input2.fq \
  "SELECT tag1, seq1 FROM flash_pairs LIMIT 5"

# exercise the FLASH UDFs that produce the combined and not-combined outputs
cargo run -p flash-df --example flash_udf --features datafusion -- \
  input1.fq \
  input2.fq \
  5

# mirror the CLI behaviour via DataFusion execution
cargo run -p flash-df --example flash_cli --features datafusion -- \
  input1.fq \
  input2.fq \
  ./out-df \
  flash

Website using WASM (Experimental)

This playground bundles the Rust implementation of FLASH compiled to WebAssembly, so you can upload paired FASTQ files and inspect the merged results directly in the browser:

  1. Build the WebAssembly artifact from the workspace root:

    rustup target add wasm32-unknown-unknown # once per environment
    cargo build -p flash-wasm --release --target wasm32-unknown-unknown
    mkdir -p public
    cp ../target/wasm32-unknown-unknown/release/flash_wasm.wasm public/
  2. Start the playground (pnpm dev/npm run dev) and open the FLASH Merge tab.

    • In FLASH Merge, select your forward (R1) and reverse (R2) FASTQ files, then click Run FLASH to view or download the merged outputs. The uploader also registers four DataFusion views for direct querying:

    • flash_input_pairs with the original paired reads.

    • flash_combined with successfully merged reads.

    • flash_not_combined_left / flash_not_combined_right mirroring FLASH's not-combined outputs.