Motivation: Efficient and collision-free hashing of DNA sequences is essential for accuracy and performance in bioinformatics applications such as genome assembly, sequence alignment, and metagenomic classification. Traditional hashing methods often result in collisions, impacting the precision and/or performance of downstream analyses. Thus, it is highly advantageous to have hashing functions that guarantee collision-free mappings for DNA sequences, particularly for k-mers up to length 16, where practical limits for 32-bit hashing are reached. In this study, we evaluate genCRC32 as a hashing primitive, reporting collision behavior, bucket balance, sensitivity to single-base changes, and speed to inform its potential use in downstream tools. Evaluation within specific software tools is outside the scope of this paper and is planned as future work.
Results: We present genCRC32, an innovative hashing method that integrates a straightforward preprocessing step (gen32) with CRC32 hashing, specifically identifying eight CRC32 polynomials that ensure collision-free hashing for all DNA k-mers up to 16 nucleotides in length. Through extensive empirical evaluations, genCRC32 demonstrated zero collisions for these k-mers, achieving a one-to-one mapping without auxiliary data structures. Benchmark tests confirmed minimal computational overhead introduced by preprocessing, maintaining hashing performance comparable to established methods such as MurmurHash3 and xxHash32.
Availability and implementation: The source code for genCRC32 is publicly available at: https://github.com/berybox/genCRC32. The implementation is provided in Go (version 1.24) and leverages only standard libraries, ensuring portability and ease of integration into existing bioinformatics workflows.
© The Author(s) 2025. Published by Oxford University Press.