Stable DNA Storage Encoding Scheme Based on Repeating Substring Tree

IEEE Trans Comput Biol Bioinform. 2025 Sep-Oct;22(5):2184-2193. doi: 10.1109/TCBBIO.2025.3586008.

Abstract

DNA storage is considered to be a promising storage media in the current era of data explosion. DNA encoding is the beginning of the DNA storage process and lays the foundation for subsequent processes. However, many encoding methods suffer from low encoding rate, do not satisfy important constraints, or have insufficient sequence stability. To address these issues and improved sequences stability, this paper proposes a novel approach called the Repeating Substring Tree Encoding (RSTE) method. The method begins by applying the Longest Substring Backtracking Method (LSBM) to identify the longest repeated substrings within the binary file. These substrings are then encoded into compact DNA motifs using Huffman encoding. In contrast to the ideal coding density of 2 bits per nucleotide (2 bit/nt) targeted by previous studies, RSTE enhances the encoding rate by 13% through efficient utilization of repeated substrings. Furthermore, the DNA sequences generated by the RSTE method successfully meet three biological constraints: run-length limitation, GC content balance and end constraints. The experimental results of minimum free energy and melting temperature indicate that the stability of the sequences encoded by RSTE is also greatly improved. A series of experiments showed that the sequences encoded by RSTE have a higher coding rate, satisfy constraints, and are more stable.

MeSH terms

  • Base Sequence
  • Computational Biology / methods
  • Computer Simulation
  • DNA Primers / chemistry
  • DNA Primers / genetics
  • DNA* / chemistry
  • DNA* / genetics
  • Sequence Analysis, DNA
  • Transition Temperature

Substances

  • DNA
  • DNA Primers