DAS: Distributed Analytics System for Arabic Search Engines

Today’s successful organizations are data driven. At Opensooq we have tens of engineers, analysts, and data scientists who crunch petabytes of data everyday to provide a great experience for our users. We execute at massive scale using data to connect our millions of users in global classified.

Opensooq has published a paper in IEEE, We introduce the fault-tolerant Distributed Analytics System (DAS) for analyzing big data collected from search engines in Arabic. This system consists of three main subsystems: Logging and Archiving Subsystem (LAS), Analytics Subsystem (AS), and a User Interface (UI). We used the data provided by opensooq.com, an online market with Arabic content, and compiled four datasets with sizes: 50 Million, 100 Million, 150 Million, and 200 Million events, in order to assess DAS. The experiments showed that DAS outperformed its sequential counterpart at datasets of 100 Million events and more, with the best speedup being 3.5 at 200 Million events. Additionally, DAS outperformed the well-known analytics system ElasticSearch (ES) in terms of response time for input sizes of 70 Million events and more, as the time per request achieved by DAS was 21% faster than ES’s time. Moreover, DAS turned out to be more energy-efficient in terms of CPU utilization, as ES’s CPU utilization was 2.4 times more than DAS’s utilization, on average.

The system consists of 3 main subsystems: Logging and Archiving Subsystem (LAS), Analytics Subsystem (AS), and a User Interface (UI). As for LAS, its main focus is storing data efficiently without elaborate processing. Thus, LAS utilizes SQLite, which is a light file-like database system, to store data as chunks on a daily basis. The Analytic Subsystem (AS), on the other hand, is an enterprise analytics system based on Lucene, an open-source high-performance text search engine library, which implements the aforementioned text analysis algorithms. The AS runs in a distributed fashion on 2p distinct servers (i.e nodes) that communicate over HTTP, utilize JSON for information interchange, and are programmed in Java. The AS subsystem consists of p leader servers (shards) and their corresponding p replica servers. This subsystem analyzes the data stored in the SQLite database via implementing distributed indexing and searching in a seamless fashion, and counts for high-availability as well. It also incorporates classification by Association Rule Analysis (ARA) while indexing to help address business owners’ queries in real-time. In particular, the AS utilizes the CBA, and CMAR methods to address business owners’ queries while it implements indexing. Finally, the UI subsystem is the interface via which the required statistical data is conveyed to the business owner. Again, the same aforementioned ARA tools are utilized in order to process the users’ queries sent through the UI. The figure below demonstrate the flow of the whole Analytics System.

The AS operations comprise two phases: indexing, and update. The indexing phase starts by a pre-processing step which is carried out by one of the leader servers, let us call it the distributor. In this step, the distributor periodically reads events’ data from LAS, create documents from them, and applies text-analysis algorithms to the documents. Then, it distributes the documents among the remaining leader servers for indexing.

Each such document consists of a set of fields, one of which is the query field that contains the search terms. Thus, in the indexing phase, each server will use the terms in its documents to build an inverted file, which will map each term to the documents in which it appeared together with the term’s ordinal position within those documents. It is quite intuitive that as new documents get distributed, the indices at each server will need to be updated, which is handled in the update phase.

When DAS is queried via its UI, the distributor will also pass the query to all the designated servers, which will handle the query using their own documents and indices, and send their results back to the distributor. The distributor will collect the partial results and reduce them to the final answer that will be returned to the business owner via the UI.

To count for high-availability, AS utilizes Apache ZooKeeper as a service that runs on each server in the AS subsystem and monitors the health of the servers and handles issues relevant to load-balancing. Upon the failure of a given server within the AS subsystem, the ZooKeeper transfers the role of the failed server to a suitable server within the subsystem. Similarly upon reviving the failed server, Zookeeper restores the roles back again.To help better illustrate the role of the Zookeeper, consider the below figure which provides a different perspective of the DAS system. Here, we see that the Zookeeper instances on all servers utilize broadcasting to stay up-to-date on the health of other servers within the AS.

Apache Zookeeper itself is replicated over the AS nodes forming an ensemble. The Zookeeper ensemble has one leader and many followers. The choice of leader is on the start of zookeeper ensemble. Apache zookeeper nodes contain information related to distributed nodes, changes in the data, timestamp as well as client uploaded information. The diagram below depicts the structure of zookeeper in a distributed environment.

Leave a Reply

Your email address will not be published.