{"id":12463,"date":"2022-01-05T07:00:05","date_gmt":"2022-01-05T06:00:05","guid":{"rendered":"https:\/\/sii.pl\/blog\/?p=12463"},"modified":"2023-06-14T14:18:37","modified_gmt":"2023-06-14T12:18:37","slug":"specialized-semantic-search-engine","status":"publish","type":"post","link":"https:\/\/sii.pl\/blog\/en\/specialized-semantic-search-engine\/","title":{"rendered":"Specialized Semantic Search Engine"},"content":{"rendered":"\n<p>This article shares the experience collected in constructing a PoC of a semantic search engine for Polish courts\u2019 rulings. The project\u2019s goal was to determine possible gains over a classical, lexical-search solution that our customer used to work with daily.<\/p>\n\n\n\n<p> The most challenging parts were to deal with a highly specific language of the documents and to build a text-similarity machine learning model without an annotated dataset.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Lexical search methods<\/h2>\n\n\n\n<p>Lexical search methods are word frequency-based approaches, what means a set of documents is ranked according to the query\u2019s term frequency regardless of their proximity and context within the document. Hence <strong>it may produce many false matches<\/strong>. Let\u2019s assume a user is looking for:&nbsp;<em>a park in Barcelona worth visiting<\/em>. The method will probably return the correct document containing&nbsp;<em>Park Guel in Barcelona <\/em>is a place everyone must see. Still, it may as well answer with <em>Taxis Park&nbsp;near the stadium of FC Barcelona<\/em>. Both documents are equally good from the algorithm\u2019s perspective because both include the terms&nbsp;<em>park<\/em>&nbsp;and&nbsp;<em>Barcelona<\/em>.<\/p>\n\n\n\n<p>However, only the first one corresponds to the actual intention of the user. Although it\u2019s an artificial example, it clearly states the limitation of lexical search.<\/p>\n\n\n\n<p>Modern semantic approaches aspire to outperform lexical methods and promise to capture the real intention and context of search queries. Unlike lexical techniques, they don\u2019t rely solely on keyword matching but exploit deep neural networks trained on massive corpora and adapted for text-similarity tasks. Hence, they are expected to produce more accurate results.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How To Capture Text Similarity?<\/h2>\n\n\n\n<p>Transformer-based models reached human-level performance<strong> in many challenging text processing tasks<\/strong>. Mainly BERT-based (<a href=\"https:\/\/arxiv.org\/abs\/1810.04805\" rel=\"nofollow\" >Devling et al., 2018<\/a>) models have been widely applied thanks to their finetuning capabilities. As they are pre-trained on massive corpora, one could expect them to immediately capture the concept of text similarity and produce text embeddings directly applicable for similarity tasks.<\/p>\n\n\n\n<p>Unfortunately, it appears that \u201craw\u201d text embeddings are outperformed even by a much simpler models like GloVe (<a href=\"https:\/\/nlp.stanford.edu\/pubs\/glove.pdf\" rel=\"nofollow\" >Penningtion et al., 2014<\/a>) vectors averaged over all words! Luckily, further supervised fine-tuning may help to improve these disappointing results.<\/p>\n\n\n\n<p>There are plenty of annotated datasets (at least for English) that may help to fine-tune a pretrained model to a text-similarity task, but they may vary in their interpretation of text similarity. Some consider text similarity as a textual entailment, where the truth from a given sentence needs to follow the fact from the other.<\/p>\n\n\n\n<p>Alternatively, textual agreement might be expressed in terms of reading comprehension, where the task is to determine if a given text fragment answers the stated question. Another idea is to consider similarity as a specific duplication problem. The table below presents some of the available datasets and corresponding similarity tasks.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12467\"><a href=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-1.png\"><img decoding=\"async\" width=\"511\" height=\"384\" src=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-1.png\" alt=\"Sample labeled datasets that might be used for text-similarity fine-tuning\" class=\"wp-image-12467\" srcset=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-1.png 511w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-1-300x225.png 300w\" sizes=\"(max-width: 511px) 100vw, 511px\" \/><\/a><figcaption class=\"wp-element-caption\">Tab. 1 Sample labeled datasets that might be used for text-similarity fine-tuning<\/figcaption><\/figure>\n\n\n\n<p>In the literature, text similarity models are often compared with the&nbsp;<a href=\"https:\/\/ixa2.si.ehu.eus\/stswiki\/index.php\/STSbenchmark\" rel=\"nofollow\" >Semantic Text Similiarity Benchmark<\/a>. It defines similarity as relatedness: a score from 1 to 5, reflecting how much two sentences are related to each other in terms of their semantic, syntactic, and lexical properties. <strong>The model performance is expressed as a correlation coefficient between predictions and ground truths.<\/strong><\/p>\n\n\n\n<p>The following table presents the performance of&nbsp;<a href=\"https:\/\/nlp.stanford.edu\/pubs\/glove.pdf\" rel=\"nofollow\" >GloVE<\/a>,&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1810.04805\" rel=\"nofollow\" >BERT<\/a>, and&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1908.10084\" rel=\"nofollow\" >SBERT<\/a>&nbsp;(a model trained for text-similarity) on the&nbsp;<a href=\"https:\/\/ixa2.si.ehu.eus\/stswiki\/index.php\/STSbenchmark\" rel=\"nofollow\" >STS benchmark<\/a>. It\u2019s easy to notice that&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1908.10084\" rel=\"nofollow\" >SBERT<\/a> overwhelms the \u201craw\u201d approaches, which proves the need for task-specific adaptation. It also shows that CLS-vectors are not the best choice\u2026<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12468\"><a href=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-2.png\"><img decoding=\"async\" width=\"491\" height=\"377\" src=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-2.png\" alt=\"Models\u2019 performance comparison on the STSb\" class=\"wp-image-12468\" srcset=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-2.png 491w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-2-300x230.png 300w\" sizes=\"(max-width: 491px) 100vw, 491px\" \/><\/a><figcaption class=\"wp-element-caption\">Tab. 2 <a href=\"https:\/\/arxiv.org\/abs\/1908.10084\" rel=\"nofollow\" >Models\u2019 performance comparison on the STSb<\/a><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Lack of Annotated Data<\/h2>\n\n\n\n<p>Some language domains are significantly different from the everyday language. Documents created by lawyers, scientists, pharmacists utilize distinctive vocabulary and contain lengthy and detailed sentences. Hence to capture the nuances of such domain-specific text-similarity, <strong>one would require a domain-specific dataset.<\/strong> Unfortunately, such appears rarely, especially for less common languages like Polish.<\/p>\n\n\n\n<p>Therefore, one of the most significant challenges in our project was the lack of annotated dataset for our problem of text-similarity. Ultimately, we managed to get 450 queries\/answer pairs that we could use for testing purposes and better understand the requirements, but we had to turn into unsupervised approaches for training.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Domain Adaptation<\/h3>\n\n\n\n<p>At this point, one may ask, why not use general-purpose language models adapted to text-similarity tasks, e.g., using HerBERT (<a href=\"https:\/\/arxiv.org\/abs\/2105.01735\" rel=\"nofollow\" >Mroczkowski et al., 2021<\/a>) (<a href=\"https:\/\/arxiv.org\/abs\/1810.04805\" rel=\"nofollow\" >BERT<\/a>&nbsp;for Polish) finetuned on the&nbsp;<a href=\"https:\/\/ixa2.si.ehu.eus\/stswiki\/index.php\/STSbenchmark\" rel=\"nofollow\" >STSb<\/a>&nbsp;dataset for Polish. It\u2019d work indeed, however as presented by&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2004.10964\" rel=\"nofollow\" >Gururangan et al. (2020)<\/a>, adapting models to a particular domain leads to significant performance gains. Moreover <a href=\"https:\/\/arxiv.org\/abs\/2104.06979\" rel=\"nofollow\" >Wang et al. (2021)<\/a> show that unsupervised approaches trained on the domain data outperform supervised generic models.<\/p>\n\n\n\n<p>Those findings were especially interesting, as the domain language of our problem was meaningfully different from everyday Polish. While performing the initial data analysis, <strong>we extracted the 10K most frequent lemmas from the judgment documents<\/strong> and compared them with the 10K most frequent lemmas from the&nbsp;<a href=\"http:\/\/nkjp.pl\/index.php?page=0&amp;lang=1\" rel=\"nofollow\" >National Corpus of Polish<\/a>&nbsp;(excluding stop words). We determined there is just 34% of overlap between those two sets!<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12469\"><a href=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-3.png\"><img decoding=\"async\" width=\"470\" height=\"788\" src=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-3.png\" alt=\"Most common lemmas in both datasets and their respective positions\" class=\"wp-image-12469\" srcset=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-3.png 470w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-3-179x300.png 179w\" sizes=\"(max-width: 470px) 100vw, 470px\" \/><\/a><figcaption class=\"wp-element-caption\">Tab. 3 Most common lemmas in both datasets and their respective positions<\/figcaption><\/figure>\n\n\n\n<p>The Table 3. presents the top 5 lemmas from the judgment\u2019s corpus and National Corpus of Polish their respective position in the other dataset. Moreover, we\u2019ve also noticed the judgments\u2019 sentences are significantly longer than everyday Polish, and more frequently written in the passive voice.<\/p>\n\n\n\n<p>Thus, considering the specificity of our domain\u2019s language and the effectiveness of unsupervised approaches shown in the aforementioned papers, we\u2019ve decided to turn our attention to unsupervised methods. Moreover, we had access to 350 000 courts\u2019 rulings making 8GB large corpus.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Unsupervised Semantic Search<\/h2>\n\n\n\n<p>Building a semantic search engine is about constructing an efficient model for text similarity and scaling the solution to perform an effective search among thousands or millions of documents. Therefore, <strong>there is a need to make the right architectural decisions<\/strong>. We dealt with roughly 350 000 documents, which split into text passages led to 16 million of data points. Thus, we decided to apply a typical retrieve-rerank pattern often used in such systems. It means a two-step solution made of a quick but not very accurate method to retrieve the list of potential matches and a precise but slow technique to re-rank the retrieved candidates.<\/p>\n\n\n\n<p>We decided to rely on the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Okapi_BM25\" rel=\"nofollow\" >Okapi BM25<\/a> algorithm from Elasticsearch with&nbsp;<a href=\"https:\/\/spacy.io\/models\/pl\" rel=\"nofollow\" >spaCy<\/a>&nbsp;Polish lemmatizer for the retrieve part and then rerank the top-K candidates with the semantic similarity model.<\/p>\n\n\n\n<p>Unfortunately, none of the described unsupervised methods could be directly applied to our task as we dealt with a highly asymmetric search problem. The models presented in this article were designed to work with sentences of similar (symmetric) length, and in such circumstances, they perform well. However, they provide disappointing results once the queries are significantly shorter than the answers (asymmetric). While collecting the requirements, we noticed that the users form short non-factoid questions and expect explanatory answers often span around several sentences.<\/p>\n\n\n\n<p>We experimentally determined that a single query made of several words requires a text passage of a hundred words on average to answer it. Hence, we had to mitigate the query\/answers asymmetry length somehow to make the models working.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter wp-image-12470\"><a href=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Pic.-1.png\"><img decoding=\"async\" width=\"850\" height=\"359\" src=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Pic.-1.png\" alt=\"High-level overview of the system based on the retrieve-rerank pattern\" class=\"wp-image-12470\" srcset=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Pic.-1.png 850w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Pic.-1-300x127.png 300w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Pic.-1-768x324.png 768w\" sizes=\"(max-width: 850px) 100vw, 850px\" \/><\/a><figcaption class=\"wp-element-caption\">Fig. 1 High-level overview of the system based on the retrieve-rerank pattern<\/figcaption><\/figure>\n\n\n\n<p>Our approach might be summarized as replacing the&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Okapi_BM25\" rel=\"nofollow\" >BM25<\/a>&nbsp;scores with semantic scores computed for the retrieved text passages. We designed a method that calculates semantic scores of the terms found by&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Okapi_BM25\" rel=\"nofollow\" >BM25<\/a>&nbsp;based on their context with respect to the query and aggregates such partial results to score the entire text passage. We used&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1810.04805\" rel=\"nofollow\" >BERT<\/a>&nbsp;trained on our 8GB corpus as the MLM, and we finetuned it for text similarity using&nbsp;<a href=\"https:\/\/ixa2.si.ehu.eus\/stswiki\/index.php\/STSbenchmark\" rel=\"nofollow\" >STSb<\/a>&nbsp;translated to Polish. However, instead of using the model to construct the embeddings, <strong>we predict the similarity score directly given the query and the contexts of found terms<\/strong>. The whole solution was deployed on Microsoft Azure.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The results<\/h3>\n\n\n\n<p>To evaluate the solution\u2019s performance, we relied on the&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Mean_reciprocal_rank\" rel=\"nofollow\" >MRR (Mean Reciprocal Rank)<\/a>&nbsp;metric, which is widely used in evaluating information retrieval systems. At first, we focused on the test set of 450 documents and noticed a significant improvement provided by the semantic approach, especially when using the semantic similarity model pretrained on the domain-specific data.<\/p>\n\n\n\n<p>Next, we extended the test set with all documents from the courts\u2019 rulings corpus what made the search task more difficult as we expected <strong>to identify 450 documents from the test set among other 350 000 records<\/strong>. In such conditions, the semantic search presented its robustness once again and appeared to be twice as good as the previous lexical system used by the client (baseline). What might be worrying are low&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Mean_reciprocal_rank\" rel=\"nofollow\" >MRR<\/a>&nbsp;values for this experiment.<\/p>\n\n\n\n<p>However, this might be explained because some of the 350 000 documents were better matches for the constructed queries but were simply not labeled. Meanwhile, the&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Mean_reciprocal_rank\" rel=\"nofollow\" >MRR<\/a>&nbsp;was calculated only on the 450 labeled documents. Manual verification of the results confirmed this thesis.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12471\"><a href=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-4.png\"><img decoding=\"async\" width=\"490\" height=\"376\" src=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-4.png\" alt=\"Performance evaluation of the built system and comparison to the baseline (previous lexical-search system)\" class=\"wp-image-12471\" srcset=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-4.png 490w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-4-300x230.png 300w\" sizes=\"(max-width: 490px) 100vw, 490px\" \/><\/a><figcaption class=\"wp-element-caption\">Tab. 4 Performance evaluation of the built system and comparison to the baseline (previous lexical-search system)<\/figcaption><\/figure>\n\n\n\n<p>I believe that the presented&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Mean_reciprocal_rank\" rel=\"nofollow\" >MRR<\/a>&nbsp;values don\u2019t correspond to the real performance measure of the system, as the test set was too small and could be biased. Secondly, the extended test set was not fully annotated. However, in each case, the relative comparison of different methods shows the robustness of the semantic approach. What\u2019s most important from the business perspective is the utterly positive feedback from our customer after one month of using the elaborated solution.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full wp-image-12474\"><a href=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-5-1.png\"><img decoding=\"async\" width=\"925\" height=\"372\" src=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-5-1.png\" alt=\"An example query and results produced by the previous lexical system and our solution\" class=\"wp-image-12474\" srcset=\"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-5-1.png 925w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-5-1-300x121.png 300w, https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Tab.-5-1-768x309.png 768w\" sizes=\"(max-width: 925px) 100vw, 925px\" \/><\/a><figcaption class=\"wp-element-caption\">Tab. 5 An example query and results produced by the previous lexical system and our solution<\/figcaption><\/figure>\n\n\n\n<p>For the lexical approach, the found keywords are in bold. The results were truncated for readability. I&#8217;m not a lawyer, so I&#8217;m not sure if the previous system&#8217;s result is correct. However, I have no doubts about the correctness of the second result. The text was translated from Polish with the help <a href=\"https:\/\/www.deepl.com\/\" rel=\"nofollow\" >of Deepl<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusions<\/h2>\n\n\n\n<p>Adapting pretrained masked language models to particular domains and then further finetuning on a text-similarity auxiliary dataset seems to yield surprisingly good results. Even though one can\u2019t directly use such methods in scenarios where the length of the query is much shorter than the length of the answer, it\u2019s relatively easy to construct an algorithm leveraging the text-similarity models and handling such asymmetric cases.<\/p>\n\n\n\n<p>Thanks to pragmatic architectural decisions and thoughtful research of the existing unsupervised text-similarity methods, we proved that domain-specific semantic search produces way better results than lexical approaches.<\/p>\n\n\n<div class=\"kk-star-ratings kksr-auto kksr-align-left kksr-valign-bottom\"\n    data-payload='{&quot;align&quot;:&quot;left&quot;,&quot;id&quot;:&quot;12463&quot;,&quot;slug&quot;:&quot;default&quot;,&quot;valign&quot;:&quot;bottom&quot;,&quot;ignore&quot;:&quot;&quot;,&quot;reference&quot;:&quot;auto&quot;,&quot;class&quot;:&quot;&quot;,&quot;count&quot;:&quot;6&quot;,&quot;legendonly&quot;:&quot;&quot;,&quot;readonly&quot;:&quot;&quot;,&quot;score&quot;:&quot;4.8&quot;,&quot;starsonly&quot;:&quot;&quot;,&quot;best&quot;:&quot;5&quot;,&quot;gap&quot;:&quot;2&quot;,&quot;greet&quot;:&quot;&quot;,&quot;legend&quot;:&quot;4.8\\\/5&quot;,&quot;size&quot;:&quot;30&quot;,&quot;title&quot;:&quot;Specialized Semantic Search Engine&quot;,&quot;width&quot;:&quot;152.6&quot;,&quot;_legend&quot;:&quot;{score}\\\/5&quot;,&quot;font_factor&quot;:&quot;1.25&quot;}'>\n            \n<div class=\"kksr-stars\">\n    \n<div class=\"kksr-stars-inactive\">\n            <div class=\"kksr-star\" data-star=\"1\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"2\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"3\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"4\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"5\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n    <\/div>\n    \n<div class=\"kksr-stars-active\" style=\"width: 152.6px;\">\n            <div class=\"kksr-star\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-right: 2px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 30px; height: 30px;\"><\/div>\n        <\/div>\n    <\/div>\n<\/div>\n                \n\n<div class=\"kksr-legend\" style=\"font-size: 24px;\">\n            4.8\/5    <\/div>\n    <\/div>\n","protected":false},"excerpt":{"rendered":"<p>This article shares the experience collected in constructing a PoC of a semantic search engine for Polish courts\u2019 rulings. The &hellip; <a class=\"continued-btn\" href=\"https:\/\/sii.pl\/blog\/en\/specialized-semantic-search-engine\/\">Continued<\/a><\/p>\n","protected":false},"author":276,"featured_media":12479,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_editorskit_title_hidden":false,"_editorskit_reading_time":0,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","inline_featured_image":false,"footnotes":""},"categories":[1319],"tags":[1442,1501,1352,1348],"class_list":["post-12463","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-soft-development","tag-ai-en","tag-artifiical-intelligence-en","tag-data-science-en","tag-search-engine-en"],"acf":[],"aioseo_notices":[],"republish_history":[],"featured_media_url":"https:\/\/sii.pl\/blog\/wp-content\/uploads\/2022\/01\/Specialized-Semantic-Search-Engine.png","category_names":["Soft development"],"_links":{"self":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts\/12463"}],"collection":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/users\/276"}],"replies":[{"embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/comments?post=12463"}],"version-history":[{"count":2,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts\/12463\/revisions"}],"predecessor-version":[{"id":22288,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/posts\/12463\/revisions\/22288"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/media\/12479"}],"wp:attachment":[{"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/media?parent=12463"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/categories?post=12463"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sii.pl\/blog\/en\/wp-json\/wp\/v2\/tags?post=12463"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}