trie | Techlist.io

kakao Dec 21, 2025

Building an Ultra-lightweight (opens in new tab)

Kakao developed a specialized, lightweight morphological analyzer to meet the strict resource constraints of mobile environments where modern deep-learning models are often too heavy. By opting for a classical Viterbi-based approach implemented in C++20, the team successfully reduced the library's binary size to approximately 200KB while ensuring high performance. This development highlights how traditional algorithmic optimization and careful language selection remain vital for mobile software efficiency. ## The Choice of C++ over Rust - While Rust was considered for its safety, it was ultimately rejected because its default binary size (even with optimization) reached several megabytes, which was too large for the specific project requirements. - C++ was chosen because mobile platforms like iOS and Android already include standard libraries (libc++ or libstdc++), allowing the final analyzer binary to be stripped down to core logic. - The project utilized C++20 features such as Concepts and `std::span` to replace older patterns like SFINAE and `gsl::span`, resulting in more readable and maintainable code without sacrificing performance. ## Trie Compression using LOUDS - To minimize the dictionary size, the team implemented a LOUDS (Level-Order Unary Degree Sequence) structure, which represents a Trie using a bit sequence instead of pointers. - This approach provides a compression rate near the information-theoretic lower bound, allowing approximately 760,000 nodes to be stored in just 9.4MB. - Further optimization was achieved through a custom encoding scheme that represents Hangul in 2 bytes and English in 1 byte, significantly reducing the dictionary's memory footprint compared to standard UTF-8. ## Optimizing the Select Bit Operation - Initial performance profiling showed that the `select0` operation (finding the N-th zero in a bit sequence) consumed 90% of the dictionary search time due to linear search overhead. - The solution involved dividing the bit sequence into 64-bit chunks and storing the cumulative count of zeros at each chunk boundary in a separate array. - By using binary search to find the correct chunk and applying parallel bit-counting techniques for intra-chunk searching, the dictionary search time was reduced from 165ms to 10ms. - These optimizations led to a total analysis time improvement from 182ms to 28ms, making the tool highly responsive for real-time mobile use. For mobile developers facing strict hardware limitations, this project proves that combining classical data structures like LOUDS with modern low-level language features can yield performance and size benefits that deep learning alternatives currently cannot match.

trie ai nlp optimization+5