Top-k string auto-completion with synonyms

Motivation

Keyword searching is a ubiquitous activity performed by millions of users daily. However, cognitively formulating and physically typing search queries is a time-consuming and error-prone process. In response, keyword search engines have widely adopted auto-completion as a means of reducing the efforts required to submit a query. As users enter their query into the search box, auto-completion suggests possible queries the user may have in mind.

usage
Example of usage scenario

Challenges

The existing solutions of auto-completion provide the suggestions based on the beginning of the currently input character sequence (i.e. prefix). Although this approach provides satisfactory auto-completion in many cases, it is far from optimal since it fails to take into account the semantic of users' input characters. There are many practical applications where syntactically different strings can represent the same real-world object. For example, Bill is a short form of William and Database Management Systems can be abbreviated as DBMS. However, these equivalence information suggests semantically similar strings that may have been missed by simple prefix based approaches. In this project, we expose these relations between different strings, to support efficient top-k completion queries with synonyms for different space and time complexity trade-offs.

Further reading

You may find our research paper at https://arxiv.org/abs/1611.03751.

Contributions

Twin tries (TT)

Two tries are constructed to present strings and synonym rules respectively in order to minimize the space occupancy. Each trie is a compact data structure, where the children of each node are ordered by the highest score among their respective descendants. Applicable synonym rules are indicated by pointers between two tries. An efficient top-k algorithm is developed to search both tries to find the synonym rules.

Expansion trie (ET)

A fast lookup-optimized solution by integrating synonym rules with the corresponding strings. Unlike TT, ET uses a single expended trie to represent both synonym and string rules. Therefore, by efficiently traversing this trie, ET is faster than TT to provide top-k completions. Meanwhile ET often takes larger space overhead than TT, because ET needs to expand the strings with their applicable rules.

Hybrid tries (HT)

An optimized structure to strike a good balance between space and time cost for TT and ET. We try to find a balance between lookup speed and space cost by judiciously select part of synonym rules to expand the strings. We show that given a predefined space constraint, the optimal selection of synonym rules is NP-hard, which can be reduced to a 0/1 knapsack problem with item interactions. We provide an empirically efficient heuristic algorithm by extending the branch and bound algorithm.

Want to get start? See download page to have a try!