Have you ever struggled to find and pull out the information from tons of documents, reports, and resources with your eyes literally bleeding?

We know all of us have been there.

It’s like searching for a needle in a haystack, right?

At such moments, we desperately need a helping hand, somebody or something that would save our eyesight and time and drag out that data. Assistance like this is usually necessary when departments across industries (healthcare, engineering, etc.) need to process plenty of documentation.

Happily, we live in the era of evolving development of AI technologies, when its different tools, methods, and techniques can simplify multiple lives.

For instance, did you know that conversation search can help you extract vital insights from a sea of data in a glimpse of an eye?

Yes, exactly; today, we will discuss how leveraging this technology can be a game-changer for businesses. We’ll look up behind the curtains to know how it runs and what perks it brings. Also, we’ll reveal how it broadens the possibilities across industries (including EdTech).

And, of course, we can’t wait to share an experiment of building a QA bot for a conversational search for fast document processing and information retrieval. This time, we used 4 tool types and will compare their results.
We play with technology quite often. If you haven’t read about how we built a helpdesk bot for an education platform, we kindly recommend you to.

Table of Contents

What’s behind?

Let’s clear out some tech aspects.

Conversational search is a technology that allows users to interact with search engines or digital assistants through conversation. It’s similar to how they would communicate with another person. It goes beyond traditional keyword-based searches and enables users to ask questions, provide context, and receive more contextually relevant results.

The conversational search usually leverages variations of AI and ML technologies, such as NLP (with different LLMs) and text analysis, to understand user intent and provide meaningful responses.

Okay, now that we’re done with tech, why does it make sense to you?

Actually, businesses across industries might need conversational search for several compelling reasons:

Efficient information retrieval. The technology can streamline information retrieval and knowledge sharing within organizations, institutions, or educational platforms. Users, industry representatives, or employees can quickly find relevant documents and information, improving productivity and decision-making.
Personalization. Conversational search can consider user preferences and historical interactions, enabling businesses to provide personalized recommendations, content, and services. This approach would definitely enhance customer engagement and loyalty.
Deeper insights. Businesses can gain valuable insights into customer behavior, preferences, and pain points by analyzing conversational search queries and interactions. These insights can inform product development, marketing strategies, and decision-making processes.

What we wanted

To back up the claim and to create an effective conversational search, we designed a demo pipeline using Lang Chain with Chat GPT through a question-answering chain. The main goal of this project was to build a conversational bot that would answer user questions over a given list of documents. We took 4 types of chains, with different work principles for each one. We wanted to compare their speed, accuracy, and amount of output.

How it works

Here is the pipeline diagram, which describes all processes that run to get a correct answer to the given question:

So, what’s the process?

A user passes a folder with documents to ChatGPT in .txt format.
Then, all documents load to the system to split all its text into chunks.
When a user inputs a query, the TF IDF retriever extracts the relevant text chunks from a document that refers to the question.
Then, these relevant text chunks and the user’s query go to ChatGPT through the question-answer chain from the LangChain toolset.
In the end, ChatGPT returns the corresponding text chunk as a formulated answer to a user.

In this experiment, we used 4 types of chains:

Stuff
Map Reduce
Refine
Map-Rerank

Stuff chain

The stuff documents chain (referring to “stuffing” as to “filling”) represents the simplest form of document chaining. It functions by taking a collection of documents, placing them collectively within a prompt, and delivering this prompt to a language model. This particular chain is ideal for scenarios where documents are typically compact, and only a limited number are generally supplied for most requests.

As we can see from this diagram, all relevant documents pass to the prompt together with the question, which means that this chain type sends this request only once.

Map Reduce chain

The initial step in the map-reduce document chain involves the application of an LLM chain to each document individually, known as the Map step. This process treats the output of the chain as a fresh document. Subsequently, all the most recent records are forwarded to a distinct combine-documents chain, culminating in a unified output during the Reduce step. The option is to initially compress or condense the mapped documents to ensure their compatibility with the combine-documents chain, which frequently directs them to an LLM. If required, the compression is executed recursively.

In this case, it sends a request to the LLM model multiple times to get a correct answer, which might be more time-consuming.

Refine chain

The refined-documents chain generates a response by systematically iterating through the input documents and continuously enhancing its response. Each document is sent through the LLM chain, along with all non-document inputs, the current record, and the most recent intermediate response, resulting in a refreshed answer.

As the Refine chain operates on one document at a time when interfacing with the LLM, it is particularly well-suited for tasks that involve analyzing a larger volume of records than can be accommodated within the model’s contextual limitations. It is worth noting that this approach significantly increases LLM calls compared to other methods, such as the Stuff documents chain. Additionally, specific tasks require more effort when executed iteratively. For instance, the Refine chain may exhibit suboptimal performance when dealing with documents that frequently reference each other or when a task necessitates comprehensive information retrieval from numerous records.

Map-Rerank chain

The map re-rank documents chain involves executing an initial prompt on each document to perform a specific task and assess its confidence level in generating an answer. Subsequently, the response with the highest score is selected and presented as the final result.

How we did it

In this part, we experimented with the “Generative AI with Large Language Models” course to get answers to 5 questions that refer to the corresponding materials. First, we took lectures from the first part of this course, which ended up with 10 files – each lecture as a separate file. Then we formulated 5 questions related to these lectures, passed them to Chat GPT through all types of chains we described above, and received the following results.

Stuff chain

How do LLM models work?

LLM models are decoder-only models that are commonly used today. They can generalize to most tasks and be trained to perform general text generation tasks. Examples of LLM models include the GPT family of models, BLOOM, Jurassic, and LLaMA.

How is the text that you pass to an LLM named?

The text that you pass to an LLM is known as a prompt.

What is the max new tokens parameter?

Max new tokens are a configuration parameter that can limit the number of tokens the model will generate. It can be seen as putting a cap on the number of times the model will go through the selection process.

What is greedy decoding?

Greedy decoding is the simplest form of next-word prediction, where the model will always choose the word with the highest probability.

What model is an example of an encoder-only model?

BERT is an example of an encoder-only model.

Map Reduce chain

How do LLM models work?

Large language models (LLM) use the transformer architecture – split into two distinct parts: the encoder and the decoder. The input to the model is tokenized, which converts the words into numbers, each representing a position in a dictionary of all the possible words the model can work with. Popular decoder-only models include GPT, BLOOM, Jurassic, LLaMA, and many more.

How is the text that you pass to an LLM named?

The text that you pass to an LLM is known as a prompt.

What is the max new tokens parameter?

Max new tokens are a configuration parameter that can limit the number of tokens the model will generate during inference.

What is greedy decoding?

Greedy decoding is the simplest form of next-word prediction, where the model will always choose the word with the highest probability.

What model is an example of an encoder-only model?

BERT is an example of an encoder-only model.

Refine chain

How do LLM models work?

Large language models (LLM) take natural language or human written instructions and use them to perform tasks. They are trained on large text datasets and use fine-tuning techniques to adapt them to specific use cases. This allows them to rapidly build customized solutions without preparing a new model from scratch. LLM models can understand the context of the instructions and use that to determine how to respond. For this purpose, the transformer architecture is split into two distinct parts: the encoder and the decoder. These components work in conjunction with each other and share several similarities. The inputs to the model are at the bottom, and the outputs are at the top. The words must be tokenized before passing texts into the model to process. This converts the words into numbers, each representing a position in a dictionary of all the possible words the model can work with. To facilitate this process, Hugging Face has built a library called Transformers. This Python library can load many standard public datasets that people use to train, fine-tune, or experiment with. Additionally, these models can be scaled to different lengths and trained to perform general text generation tasks.

How is the text that you pass to an LLM named?

The text that you pass to an LLM is known as a prompt. This prompt is fed into the model, and the output text is the completion. The total amount of text or the memory available for the prompt is called the context window. The act of generating text is known as inference. When defining the project’s scope, it is essential to consider what function the LLM will have in the specific application. Depending on the task, the model may need to be able to carry out many different jobs, including long-form text generation or with a high degree of capability, or it may only need to be good at one thing, such as named entity recognition. Knowing the specific task that the LLM will be used for can help determine the model’s size and architecture, which can help save time and resources.

Additionally, providing an example within the prompt can improve performance. For instance, if the task is sentiment analysis, the prompt text should include a sample review with the identified sentiment. This will help the model understand the job and the format of the response that you are expecting.

What is the max new tokens parameter?

Additionally, providing an example within the prompt can improve performance. For instance, if the task is sentiment analysis, the prompt text should include a sample review with the identified sentiment. This will help the model understand the task and the format of the response that you are expecting.

What is greedy decoding?

Greedy decoding is the simplest form of next-word prediction, where the model will always choose the word with the highest probability. It is used to generate summaries of conversations by comparing the baseline human summary to what the model predicts is the summary. To introduce some variability, instead of selecting the most probable word every time with random sampling, the model chooses an output word randomly using the probability distribution to weight the selection. For example, the term banana has a probability score of 0.02 in the illustration. Random sampling equates to a 2% chance that this word will be selected. By using this sampling technique, we reduce the likelihood that words will be repeated.

However, depending on the setting, the output may be too creative, producing words that cause the generation to wander off into topics or words that don’t make sense. To help limit the random sampling and increase the likelihood of producing more accurate summaries, top k, and top p sampling techniques can be used. In some implementations, you may need to turn off greedy and enable random sampling explicitly. For example, the Hugging Face transformers implementation in the lab requires we set the sample to equal true.

What model is an example of an encoder-only model?

BERT is an example of an encoder-only model, which encodes input sequences into a profound representation of the structure and meaning of the input. The encoder does not use the decoder component of the transformer, which works from input token triggers to generate new tokens in a loop until some stop condition has been reached. This model interacts with libraries and APIs to create natural language or human written instructions and perform tasks much as a human would. The text passed to an encoder-only model is known as a prompt, and the model output is called a completion. The completion comprises the text contained in the original prompt, followed by the generated text.

This model can be used to answer questions, such as determining where Ganymede is located in the solar system. It can also be used for tasks such as named entity recognition, where the model only needs to be good at one thing. When deciding whether to train your model from scratch or work with an existing base model, you’ll need to consider the feasibility of preparing your model and the computing cost. Once you have your model in hand, you can assess its performance and carry out additional training if needed.

Map-Rerank chain

How do LLM models work?

Large language models can take natural language or human written instructions and perform tasks much as a human would.

How is the text that you pass to an LLM named?

The text that you pass to an LLM is known as a prompt.

What is the max new tokens parameter?

Max new tokens are a configuration parameter invoked at inference time and give you control over the maximum number of tokens in the completion. It limits the number of times the model will go through the selection process.

What is greedy decoding?

Greedy decoding is the simplest form of next-word prediction, where the model will always choose the word with the highest probability.

What model is an example of an encoder-only model?

BERT

The results we got

Summarizing all experiences during the experiment and its results, we can highlight the following outcomes:

The answers with the most extensive content belonged to the Refine chain. Still, it was also the slowest compared to other types of chains.
The fastest chain was Map-Rerank, and it looked better than the others, although its answers were too short.
The Stuff and Map Reduce chains were the most balanced solutions because both work fast and the answers are correct with a good content presentation.

The QA bot can be used for question answering over given documents and, at the same time, receiving correct answers, which will significantly increase the efficiency of information assimilation.

Afterwards

Similar QA bots can boost many spheres of life, helping businesses save time, improve the required accuracy, and enhance decision-making processes. Such convenient document processing efficiently extracts and presents information from extensive document repositories.

Here are 3 common examples:

EdTech (our experimental use case). Here, question-answering bots are already revolutionizing the learning experience. They can assist students by providing instant answers and explanations from course materials, textbooks, and educational resources. This technology can support both formal education in schools and universities and informal learning through online courses and e-learning platforms.

Healthcare and medical research. In this industry, such QA bots can help professionals access and interpret vast amounts of medical literature, research papers, and patient records. Medical experts and industry representatives can use these bots to find answers to clinical questions quickly, stay updated on the latest medical advancements, and make informed decisions about patient care. Also, such a feature can highlight and boost your healthcare startup.

Legal services. The legal industry can leverage question-answering bots that can sift through legal documents, case law, and statutes to provide attorneys and legal professionals with relevant information and precedents. These bots can streamline legal research, help with drafting legal documents, and support decision-making in legal cases, ultimately increasing the efficiency of legal services.

We at Unidatalab always use technologies to bring our clients and their customers’ efficiency, high performance, and convenience. Got some ideas? Let’s connect then!