Opensooq uses so many open source technologies, because each piece has it’s own use case, for example we use Apache Solr to power the content moderation dashboard.
When your main data source is a strict classical ACID SQL, and you want to take that data to some other backend like NoSQL’s, ElasticSearch, or Solr as in our case, for sure you will get out of sync data. Keeping them in-sync becomes harder and harder.
Naive classical implementation
We have a microservice called Onix, responsible for this. First iterations of Onix, used a timestamp column (called updated_at below) using DDL like
CREATE TABLE something (...updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, ...);
And then periodically (let’s say every 10 minutes) incrementally continue from where last job stopped. Unfortunately this method is not instant, changes need to wait for a given period of time.
Later we make it triggered by application. When some change happen, the “afterSave” method of application’s ORM will send the id of the affected post to Onix to instantly sync. Unfortunately this method does not count direct changes done directly in the database behind the application ORM or when the ORM was interrupted.
Even when both (periodic and application-triggered sync) are used, we still had out of sync data. For example, let’s say “posts” table has a foreign key that references a category, when that category is updated, its timestamp is updated but that of posts referring to it is not. The “afterSave” ORM method is triggered for category but not for post.
Implementing “binlog” replication protocol
Make your NoSQL a slave of your SQL
Unlike PostgreSQL, MySQL replication protocol is high level, it goes like “in table=ABC at row having id=123 change column col1 from value1 to value2” when it’s set to row “binlog” formate.
In the latest update to our Onix service, we have implemented MySQL replication protocol using python-mysql-replication and make it send changes to Apache Solr directly. If someone open the database directly and type an update statement, it will be directly reflected to our search engine.
We created a MySQL user for this purpose
CREATE USER 'repl'@'%' IDENTIFIED BY 'SECRET'; GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%'; GRANT REPLICATION CLIENT ON *.* TO 'repl'@'%'; GRANT SELECT ON *.* TO 'repl'@'%'; GRANT SHOW DATABASES ON *.* TO 'repl'@'%';
and using this user, Onix would receive changes instantly from MySQL as if it’s an actual slave and it would sleep when there is no events. The code looks something like
from pymysqlreplication import BinLogStreamReader from pymysqlreplication.event import QueryEvent from pymysqlreplication.row_event import ( DeleteRowsEvent, UpdateRowsEvent, WriteRowsEvent, ) # .... self.stream = BinLogStreamReader( connection_settings = mysql_settings, server_id=server_id, blocking=True, resume_stream=resume_stream, log_file=self.last_log_file, log_pos=self.last_log_pos, skip_to_timestamp = self.last_timestamp, only_tables = tables, only_schemas=schemas, only_events=[DeleteRowsEvent, WriteRowsEvent, UpdateRowsEvent] ) for binlogevent in self.stream: if isinstance(binlogevent, UpdateRowsEvent): typ="update" elif isinstance(binlogevent, WriteRowsEvent): typ="insert" elif isinstance(binlogevent, DeleteRowsEvent): typ="delete" # ....