Replicating your SQL into NoSQL and full text search engine (Solr/ElasticSearch)

Opensooq uses so many open source technologies, because each piece has it’s own use case, for example we use Apache Solr to power the content moderation dashboard.

When your main data source is a strict classical ACID SQL, and you want to take that data to some other backend like NoSQL’s, ElasticSearch, or Solr as in our case, for sure you will get out of sync data. Keeping them in-sync becomes harder and harder.

Naive classical implementation

We have a microservice called Onix, responsible for this. First iterations of Onix, used a timestamp column (called updated_at below) using DDL like

CREATE TABLE something (...updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
...);

And then periodically (let’s say every 10 minutes) incrementally continue from where last job stopped. Unfortunately this method is not instant, changes need to wait for a given period of time.

Later we make it triggered by application. When some change happen, the “afterSave” method of application’s ORM will send the id of the affected post to Onix to instantly sync. Unfortunately this method does not count direct changes done directly in the database behind the application ORM or when the ORM was interrupted.

Even when both (periodic and application-triggered sync) are used, we still had out of sync data. For example, let’s say “posts” table has a foreign key that references a category, when that category is updated, its timestamp is updated but that of posts referring to it is not. The “afterSave” ORM method is triggered for category but not for post.

Implementing “binlog” replication protocol

Make your NoSQL a slave of your SQL

Unlike PostgreSQL, MySQL replication protocol is high level, it goes like “in table=ABC at row having id=123 change column col1 from value1 to value2” when it’s set to row “binlog” formate.

In the latest update to our Onix service, we have implemented MySQL replication protocol using python-mysql-replication and make it send changes to Apache Solr directly. If someone open the database directly and type an update statement, it will be directly reflected to our search engine.

We created a MySQL user for this purpose

CREATE USER 'repl'@'%' IDENTIFIED BY 'SECRET';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
GRANT REPLICATION CLIENT ON *.* TO 'repl'@'%';
GRANT SELECT ON *.* TO 'repl'@'%';
GRANT SHOW DATABASES ON *.* TO 'repl'@'%';

and using this user, Onix would receive changes instantly from MySQL as if it’s an actual slave and it would sleep when there is no events. The code looks something like

from pymysqlreplication import BinLogStreamReader
from pymysqlreplication.event import QueryEvent
from pymysqlreplication.row_event import (
    DeleteRowsEvent,
    UpdateRowsEvent,
    WriteRowsEvent,
)
# ....
        self.stream = BinLogStreamReader(
            connection_settings = mysql_settings,
            server_id=server_id, blocking=True, resume_stream=resume_stream,
            log_file=self.last_log_file, log_pos=self.last_log_pos, skip_to_timestamp = self.last_timestamp,
            only_tables = tables, only_schemas=schemas, only_events=[DeleteRowsEvent, WriteRowsEvent, UpdateRowsEvent]
        )

        for binlogevent in self.stream:
            if isinstance(binlogevent, UpdateRowsEvent): typ="update"
            elif isinstance(binlogevent, WriteRowsEvent): typ="insert"
            elif isinstance(binlogevent, DeleteRowsEvent): typ="delete"
            # ....