In the third decade of the 21st century, each day brings us a tremendous amount of information, unimaginable for the previous generations. The reason for this is not only widespread access to the Internet but also digitalization processes appearing in institutions such as corporations, scientific and medical centres or bookkeeping offices. Huge volumes of data, previously maintained only as physical, paper copies are now stored in databases and accessible from a personal computer without having to get up from your desk.
How to Navigate Through the Ocean of Information?
However, access to information alone is insufficient, as it’s impossible to manually review hundreds or thousands of potential documents to find what we look for. Therefore, search engines go hand in hand with data storing systems.
The most common type of information is raw, unprocessed data e.g., in the form of text documents such as court decisions, invoices or contracts. While it is very easy to search for documents created or modified within a certain date range or having a specific author, searching in the content of the documents is much more challenging.
Harvey Wants to Know
Let’s imagine, there is a database full of court decisions collected over the last dozen years, and a lawyer called Harvey. Harvey would like to know how the court has ruled in cases similar to the one he is currently pursuing. Let’s assume that the case involves tax evasion. Harvey logs into our system and types the following phrase into the search engine: „tax evasion”. How will the search system work?
A classic, full-text search systems index documents based on word frequency. However, this raises problems in handling different wordings. E.g., if Harvey’s query is „tax evasion”, and a document contains „avoidance of paying VAT”, then such a document is unlikely to be found, even though both things express a very similar concept. Moreover, if there exists a document containing the following phrases “the company paid the tax” and “firewall evasion techniques”, they are very likely to be found as both „tax” and „evasion” words appear in there.
The described approach is called lexical search and is unable to understand the intention of the searcher or the context of information is presented.
The AI comes to the rescue
The recent advances in artificial intelligence and in the field of natural language processing have made it possible to solve both of the above problems behind what is known as semantic search. This way, the system is not only able to understand the user’s intention but also the context in which the information is given.
The basis of semantic search is usually deep neural networks trained on large amounts of data. For example, the GPT-3 AI model has been trained on nearly 500 billion words from websites, books, and Wikipedia! The training procedure is designed specifically to capture the meaning of individual words based on the context of their occurrence.
Since natural language may differ from domain to domain (for example lawyers, architects, or medics, use slightly different vocabulary) it is advised to adjust such models to handle subject-specific documents to improve search efficiency.
Finally, the AI model is used to index the documents by taking into account the semantic meaning of text rather than just ordinary word occurrence.
One step forward: question answering
Semantic search engines are able to locate very accurately the information you are looking for. Even among hundreds of thousands of different documents, they manage to indicate the right text fragment. But would it be possible to go one step further and extract the exact answer rather than a text snippet?
Let’s suppose Harvey’s query is „what is the fine for speeding in a built-up area?” and he expects an exact answer, like 500$. Bearing in mind that the semantic engine may already return the exact snippet of text containing the answer, the next step would be to extract the appropriate value.
It turns out that artificial intelligence comes to the rescue in this case as well, and the described system is possible to build, provided that an appropriate, domain-specific training dataset is available.
Unobvious capabilities of semantic search engines
In the previous paragraphs, we described classic search engines, where potential answers appear for a given query. However, this is only the beginning… There are plenty of amazing tools that can be built on top of AI-powered search engines! Let us mention just two more.
Let’s assume that we would like to determine whether our documents are correct in terms of certain regulations and whether they meet the requirements that are imposed on a certain type of documents. Having defined a list of regulations and requirements, the AI model is able to learn the requirements and verify the correctness of user documents against them. What is more, it will also allow checking the consistency of the document against a group of other documents – e.g. compare dates, sums of money or any other values. This significantly reduces the amount of time needed for manual data validation and avoids many potential mistakes difficult to notice for a human. Such a solution can be useful e.g. in the healthcare industry, where it is often required to verify whether a given medical document meets regulatory standards, or to check the correctness of such document against other documents already approved by the regulator.
Another challenge is extracting the key information from documents, such as names, dates, monetary values, laws, etc. This is especially relevant when there are a large number of long text documents, where the user would like to get key information quickly without having to read everything. Again, this is a search-like problem and AI can help. Moreover, it can even provide a short summary of the whole document! This may allow us to quickly identify and match best-fitting offers to customers and present them the excerpt of the most important information. In this way, customers will not only get the best-fit offers, but they will also be able to get acquainted with them and compare with other proposals faster.
Interestingly, from the technical point of view, the described problem is very similar to the problem of semantic search.
The above use cases are not purely theoretical. They describe experiences gathered by our AI team in real projects addressing real problems of our customers. However, this article does not even slightly exhaust the potential possibilities offered by artificial intelligence. So if you are facing a challenge that requires efficient processing of a large number of text documents, but you do not find your use case in this article, please contact us. We will be happy to discuss and help you choose the right solution.
Modern search engines, based on artificial intelligence, allow almost infinite freedom in formulating queries and indexing documents. They are able to understand both the meaning and context in which information is presented so that search results go hand in hand with user intent. Moreover, AI models can be tailored to specific domains at hand.
Quick access to the right information is crucial in many areas of life and allows making the right decisions, which often translate into the financial health of a company, investment decisions or simply time-saving. Browsing through hundreds of search results can be compared to looking for a needle in a haystack. In such cases, Artificial Intelligence becomes a strong magnet that easily finds the needle you are looking for!