Scalable way to search for (similar) strings in a database

Question

Let me describe my problem. There is an input string, and a table containing many thousands of strings. I am looking for best way to search for the most similar* strings to the input string. The search should return a list of ~10 suggested strings, sorted by degree of similarity. Strings also have numerical weights (popularity) associated with them in database, in another column, so the ones with higher weights should have higher chance of appearing in results, if possible.

What is the best library to achieve this? I am looking for something similar to Elasticsearch, I guess. I don't have much experience with these kinds of libraries, so I would need something easy to include in my project and preferably open-source. I am using Python (Flask and SQLAlchemy) and Postgresql, but could also use e.g. Node.js, if needed.

*I also want to clarify what kind of similarity I am looking for. Ideally, it would be semantic similarity, but lexical similarity is fine as well. I would be happy with anything that works okay, is easy to implement, and is as scalable and performant as possible.

Example input sentence:

  • I don't like cangaroos.

Example suggestions from the database:

  • Cangaroos are not my favorite.
  • Cangaroos are evil.
  • I once had a cangaroo. Never again.

These suggestions should appear first because 'cangaroo' is not a frequent word in my database, so any string with the word 'cangaroo' should have a high chance appearing in results. It is probably much harder to detect 'don't like', so that part is completely optional for me.

P.s. Could PostgreSQL's full text search do something like this?

Thank you.


Show source
| search   | postgresql   | elasticsearch   | full-text-search   | scalability   2016-12-24 11:12 1 Answers

Answers ( 1 )

  1. 2016-12-25 00:12

    PostgreSQL Full-text search cannot do what you're looking for. However, PostgreSQL trigram similarity can do it.

    You first need to install the packages with 'trigram similarity' and 'btree_gist', by executing (once) in your database:

    CREATE EXTENSION pg_trgm
    CREATE EXTENSION btree_gist;
    

    I assume you have one table that looks like this one:

    CREATE TABLE sentences
    (
        sentence_id integer PRIMARY KEY,
        sentence text
    ) ;
    
    INSERT INTO sentences (sentence_id, sentence)
    VALUES
        (1, 'Cangaroos are not my favorite.'),
        (2, 'A vegetable sentence.'),
        (3, 'Cangaroos are evil.'),
        (4, 'Again, some plants in my garden.'),
        (5, 'I once had a cangaroo. Never again.') ;
    

    This table needs a 'trigram index', to allow the PostgreSQL database to 'index by similarity'. This is accomplished by executing:

    CREATE INDEX ON sentences USING GIST (sentence gist_trgm_ops, sentence_id) ;
    

    To find the answers you're looking for, you execute:

    -- Set the minimum similarity you want to be able to search
    SELECT set_limit(0.2) ;
    
    -- And now, select the sentences 'similar' to the input one
    SELECT
        similarity(sentence, 'I don''t like cangaroos') AS similarity, 
        sentence_id,
        sentence
    FROM
        sentences
    WHERE
        /* That's how you choose your sentences:
           % means 'similar to', in the trigram sense */
        sentence % 'I don''t like cangaroos'
    ORDER BY
        similarity DESC ;
    

    The result that you get is:

    similarity | sentence_id | sentence
    -----------+-------------+-------------------------------------
        0.3125 |           3 | Cangaroos are evil.      
        0.2325 |           1 | Cangaroos are not my favorite.
        0.2173 |           5 | I once had a cangaroo. Never again.
    

    Hope this gives you what you want...

◀ Go back