MySQL 8.0.12: The devil is in the detail

More than a billion views now use MySQL to access post, comment, member, and more. As we have evolved the product and added new functionality, the underlying technologies that power OpenSooq have changed substantially. We have dozens of background tasks that handle things like payments and analytics. Many of the queries that are run by these tasks are expensive SELECTs, and we were starting to notice an impact on frontend performance. 

To help improve OpenSooq even more, we now have to work on the storage service to make it faster, more efficient, more reliable, and easier to upgrade with new features. This evolution involved a major change which is upgrading MySQL 5.5 to MySQL 8.0.12 since we will get up to 2X faster.

There are great improvements in MySQL 8.0.12 can be summarized as follows:

  1. Invisible index
  2. Configuration Persistence
  3. UUID Enhancements
  4. Descending Indexes
  5. Cost Model Improvements

And Others.

 

We wanted to make sure people using OpenSooq had an uninterrupted experience during the migration process. The migration required extensive reading and writing operations, which are I/O operations bound. It is critical to have strong data consistency before and after migration to ensure the correct product experience.

It was also crucial to ensure that we migrated all the information of posts and all microservices data, which meant terabytes of information. It’s a nightmare to migrate terabytes of information from MySQL 5.5 to MySQL 8.0 with passing through MySQL 5.6 and 5.7.

While AWS EC2 is generally very reliable, there is always a chance of an instance failure. The recovery process we had in place was completely managed by us. We would have had to launch a new instance and restart the site after we changed the hostname.

The first challenge we faced with MySQL 8.0 is Backup. Percona is not supporting backup or even mysqldump is not working with us for huge data. MySQL Enterprise Backup provides enterprise-grade backup and recovery for MySQL and this solved us the first issue.

We should stop migrating MySQL 8.0 once we noticed that Percona doesn’t have a solution for backup at this moment. We continue the process since we see the solution of MySQL Enterprise Backup.

As we were also migrating to a new version of the database, we were also developing the service and designing and manufacturing the Lightning flash servers at the same time. Rolling out the new service required fixing software and even kernel but The devil is in the detail.

 

 

The devil is in the detail

Database crashes when start simulating the traffic, Really? Yes. It crashes on Database server with 128GB with 16cores with adding millions of rows.

Database error log is full of messages like this prior to the crash:
2018-10-07T15:52:27.287624Z 0 [ERROR] [MY-013059] [InnoDB] InnoDB: Log writer overwriting data after checkpoint - waited too long (1 second), lag: 3865468416 bytes, checkpoint LSN: 134684446208
2018-10-07T15:52:27.287686Z 0 [Warning] [MY-013060] [InnoDB] InnoDB: Log writer is waiting for checkpointer to to catch up lag: 3865468928 bytes, checkpoint LSN: 134684446208

When we see this issue, we limit the number of concurrent threads by setting the configuration parameter innodb_thread_concurrency. Once the number of executing threads reaches this limit, additional threads sleep for a number of microseconds, set by the configuration parameter innodb_thread_sleep_delay, before being placed into the queue. But no positive results. 

MySQL Developers announce this bug in their Bug list, They missed a margin between boundary up to which log writer may write and boundary up to which log_free_check() calls do not have to wait (after they included the concurrency margin). This margin has been restored and it is called “extra_margin”.

Bug #90993 InnoDB crashes from long lock wait – log writer waiting for checkpointer

Fixed as of the upcoming 8.0.13 release, but no one knows when it can be released, and you can find it in change log under:

An apparent hang due to the log writer running out of free space in the
redo log caused the server to exit. 

However, we stopped our migration process.