Content websites need a system to ensure the relevancy between the page category and the content that show on it. Therefore, to ensure high data quality, data warehouses must validate and cleanse incoming data from users. OpenSooq.com is one of Arabic content website that desperately needs such system, it is a leading classifieds ads website in the Middle East and North Africa. The website which is available in the Arabic language serves more than billion page views per month, and also receives more than one million new ad postings from users every month. Thus, the manual verification of relevancy between the page category and the ad that show on is impossible.
Such systems require applying artificial intelligence concept to determine the similarities between data sets, and the user ad. Artificial Intelligence is a science in itself, and once diving deeply in artificial intelligence, Fuzzy Matching will shine. For every piece of data examined, the fuzzy matching process will give a probability score to determine the accuracy of the match. For example, ‘Tomas Jones’ would possibly get a 85 percent score of similarity, while ‘Tom Jones’ might receive a 70 percent score, as compared to the actual name of Thomas Jones. A major challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming data if it fails to match exactly with data sets.
Arabic language poses several challenges faced by the Natural Language Processing (NLP) largely due to the fact that Arabic language unlike European languages has a very rich and complex morphological system. It includes 28 letters and it is written cursively from right to left. The morphological representation of Arabic is rather complex because of the morphological variation and the agglutination phenomenon. Letters modify forms according to their position in the word (beginning, middle, end and separate). Table 1 gives an example of different forms of the letter “ain” at different positions. We can observe various general characteristics of this language as follows:
TABLE 1: Different writings of the letter “ain” at different positions within word or as a separate letter
• A large portion of nouns and verbs are inferred from a reduced number (approximately 10000) of roots. These roots are linguistic units conveying a semantic meaning and most of these roots consist of only 3 consonants, rarely 4 or 5 consonants.
• From these roots, we can produce nominal and verbal derivatives by the application of the templates (morphological rules). One can create up to 30 words from a 3 consonants root. Table 2 shows an example with the 3-grams root “ktb” (to write), from which we can produce several words:
TABLE 2: Derivation of several words from the “ktb” root
• In written Arabic, the vowels (diacritics) are omitted and as a result of this omission, the words tend to have a higher level of ambiguity. For example, the word (على) without vowels can mean the proper name (Ali) or the preposition (on). This ambiguity will be a crucial problem in information retrieval in the fact that an Arabic word can have several meanings.
Checking Relevance with Solr
Apache Solr provides support text analysis that covers text-processing steps such as tokenization, case normalization, stemming, synonyms, and other miscellaneous text processing. In our case we need to build the following steps
Typically, tokenization occurs at the text level, it is a process of breaking a stream of text up into words or other meaningful elements called tokens. A Solr schema.xml file allows many methods for specifying the way a text field is tokenized. StandardTokenizer is simple and it works fine with Arabic text.
For content with lots of words, common uninteresting words like “من” “from”, “على” “above”, and so on, make the index large and slow down phrase queries that use them. A simple solution to this problem is to filter them out of fields where they show up often. So, removing stop words are important to ensure high performance and quality.
There is a simple filter called StopFilterFactory that filters out certain so-called stop words specified in a file in the conf directory, optionally ignoring case. Example usage:
Arabic Normalization and Stemming
Arabic corpus and queries should be normalized and stemmed, and we can do that in Solr easily by adding in Schema.xml :
ArabicNormalizationFilterFactory normalizes the Arabic text according to the following steps:
• Remove punctuation
• Remove diacritics (primarily weak vowels). Some entries contained weak vowels, in particular, the dictionaries used in cross-language experiments. Removal made everything consistent.
• Remove non-letters
• Replace إ , and أ with ا
• Replace final ى with ي
• Replace final ة with ه
Stemming is another one of many tools besides normalization that is used in information retrieval to combat this vocabulary mismatch problem
Of course, Arabic Languages has so many spelling idiosyncrasies that algorithmic stemmers are imperfect they sometimes stem incorrectly or don’t stem when they should. This filter will skip tokens already marked by KeywordMarkerFilter and it will keyword mark all tokens it replaces itself, so that the stemmer will skip them.
Someone searches using a word that wasn’t within the original document however is similar with a word that’s indexed, thus you want that document to match the query. Of course, the synonym need not be strictly those known by a thesaurus, and they can be whatever you want including terminology specific to your application’s domain.
Here is a sample analyzer configuration line for synonym processing:
Specifying an Analyzer in the schema
The configuration example defines an analyzer, which specifies an ordered sequence of processing steps. Solr starts by applying a preprocessing phase such as tokenization, removing stopwords, case normalization, stemming and synonyms on title and description of every index content and query.
Finally, The below figure shows the function of checking relevancy by using PHP library for indexing and searching documents within an Apache Solr.