ESGI: Efficient splitting of generic indices in single-cellsequencing data
Stohn, T.; van de Brug, N. D.; Theodosiadou, A.; Thijssen, B.; Jastrzebski, K.; Wessels, L. F. A.; Bosdriesz, E.
Show abstract
Single-cell sequencing technologies increasingly rely on complex nucleotide barcoding schemes to encode cellular identities, experimental conditions, and multiple molecular modalities within a single experiment. While demultiplexing, alignment, and UMI-based quantification form the core preprocessing steps that transform raw sequencing reads into analyzable single-cell data, existing pipelines are often tightly coupled to specific experimental designs and typically assume fixed barcode positions and substitution-only error models. As a result, many emerging assays employing combinatorial, variablelength, or multimodal barcoding designs require custom, hard-coded preprocessing solutions that are difficult to generalize and maintain. Here, we present ESGI (Efficient Splitting of Generic Indices), a flexible and extendable framework for demultiplexing and processing single-cell sequencing data with arbitrary barcode architectures. ESGI operates directly on raw FASTQ files using a generic barcode pattern specification, supports barcode matching with insertions and deletions via Levenshtein distance, accommodates variable-length barcodes, and provides detailed quality metrics for barcode assignment. ESGI optionally integrates genome alignment via STAR and performs feature quantification and UMI collapsing to generate cellby-feature count matrices. ESGI is well documented and readily applicable to novel single-cell experiments. We demonstrate the versatility of ESGI across six datasets spanning four distinct single-cell technologies, including combinatorial indexing-based transcriptomic and multimodal assays, feature barcode-based protein measurements, and spatial barcoding data. Across these applications, ESGI robustly demultiplexes complex barcode designs that are not natively supported by existing pipelines, while producing results comparable to established workflows where applicable. Together, ESGI provides a general and future-proof solution for preprocessing single-cell sequencing data, enabling rapid adoption and analysis of emerging experimental designs.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.