How to Build a RAG-Powered Chatbot with Chat, Embed, and Rerank


This learning module is part of Cohere’s LLM University. We offer a comprehensive curriculum to give you a rock-solid foundation in large language models. To learn more, see the full course.

In the previous chapter of the Chat with Retrieval-Augmented Generation (RAG) module, we discussed building a chatbot using Cohere’s Chat endpoint. 

In this chapter, you’ll learn how to add RAG capabilities to the chatbot and enable it to connect to external documents, ground its responses on these documents, and produce document citations in its responses.

With RAG, developers can build powerful product experiences for the enterprise and mitigate hallucinations by producing grounded and verifiable generations. The Chat endpoint comes integrated with RAG features, which greatly simplifies the task of developing RAG-powered applications.

There are three RAG modes available with the Chat endpoint:

  • Document mode: Specifying the documents for the model to use when generating a response
  • Connector mode: Connecting the endpoint with an external service that handles all the logic of document retrieval
  • Query-generation mode: Generating one more queries given a user message

In this chapter, you’ll learn how to use RAG in document mode, which will also require the query-generation mode. In the next chapter, you’ll learn how to use RAG in connector mode. Refer to the documentation for more details about these three modes of RAG.

Before going into the step-by-step guide, let’s look at the high-level implementation plan of the demo application that we’ll build. Below is a diagram that provides an overview of what we’ll build, followed by a list of the key steps involved.

A high-level implementation plan for the RAG-powered chatbot
A high-level implementation plan for the RAG-powered chatbot

Setup phase:

  • Step 0: Ingest the documents – get documents, chunk, embed, and index.

For each user-chatbot interaction:

  • Step 1: Get the user message
  • Step 2: Call the Chat endpoint in query-generation mode
  • If at least one query is generated
    • Step 3: Retrieve and rerank relevant documents
    • Step 4: Call the Chat endpoint in document mode to generate a grounded response with citations
  • If no query is generated
    • Step 4: Call the Chat endpoint in normal mode to generate a response

Throughout the conversation:

  • Append the user-chatbot interaction to the conversation thread
  • Repeat with every interaction

Below is a screenshot of a sample conversation between the chatbot and a user.

A sample conversation between the chatbot and a user
A sample conversation between the chatbot and a user

This chatbot acts as an intelligent knowledge assistant. It is capable of extracting relevant context from external documents and using it to provide helpful responses to a user, verifiable through document citations.

The chatbot provides helpful and verifiable responses through citations
The chatbot provides helpful and verifiable responses through citations

By wrapping RAG capabilities with a chat paradigm, we can build context-aware applications that are able to both maintain the state of a conversation and generate grounded, citation-backed, responses. This enables building practical applications in the enterprise, such as assisting customer support agents synthesize information from multiple sources and helping knowledge workers refine reports.

The Chat endpoint wraps RAG capabilities with a chat paradigm
The Chat endpoint wraps RAG capabilities with a chat paradigm

We’ll use Cohere’s Python SDK for the code examples. This chapter comes with a Google Colaboratory notebook. Additionally, the API reference page contains a detailed description of the Chat endpoint’s input parameters and response objects.

As mentioned earlier, this guide shows how to build a RAG system using the document mode of the Chat endpoint.

This application will use several Cohere API endpoints:

  • Chat: For handling the main logic of the chatbot, including turning a user message into queries, generating responses, and producing citations
  • Embed: For turning textual documents into their embeddings representation, later to be used in retrieval (we’ll use the latest, state-of-the-art Embed v3 model)
  • Rerank: For reranking the retrieved documents according to their relevance to a query
This demo application will use Cohere’s Chat, Embed, and Rerank endpoints
This demo application will use Cohere’s Chat, Embed, and Rerank endpoints

Setup

First, let’s import the necessary libraries for this project. This includes cohere, hnswlib for the vector library, and unstructured for chunking the documents (more details on these later).

pip install cohere hnswlib unstructured -q

Then, import the necessary modules from these libraries in addition to other required modules. Let’s also create a Cohere client.

import cohere
import os
import hnswlib
import json
import uuid
from typing import List, Dict
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

co = cohere.Client(os.environ["COHERE_API_KEY"])

We’ll build three classes that form the key components of the application: Documents , Chatbot, and App.

Three key components of the system: Documents, Chatbot, and App
Three components of the application: Documents, Chatbot, and App

Now, let’s start building the first component: Documents.

Create Documents Component

The Documents class handles the ingestion of documents, as well as returns the relevant documents given a query, which includes retrieval (using vector search) and reranking.

The Documents component handles document ingestion as well as retrieval
The Documents component handles document ingestion as well as retrieval

We start by creating the class, which takes a list of dictionaries representing the document sources.

class Documents:

    def __init__(self, sources: List[Dict[str, str]]):
        self.sources = sources
        self.docs = []
        self.docs_embs = []
        self.retrieve_top_k = 10
        self.rerank_top_k = 3
        self.load()
        self.embed()
        self.index()

Each dictionary item contains a web URL containing two keys: “title” and “URL.” Here’s the format:

sources = [
    {
        "title": "Text Embeddings", 
        "url": "https://docs.cohere.com/docs/text-embeddings"
     },
    {
        ...
    }  
]

We also initialize a few instance attributes and methods. The attributes include self.sources to represent the raw documents, self.docs to represent the chunked version of the documents, self.docs_embs to represent the embeddings of the chunked documents, and a couple of top_k parameters to be used for retrieval and reranking. 

Meanwhile, the methods include load, embed, and index for ingesting documents. These methods load a set of raw documents, break them into smaller chunks, generate embeddings for each chunk, and stores these in an index.

The document ingestion portion of the Documents component
The document ingestion portion of the Documents component

Load and Chunk Documents

Next, we create the load method to load and chunk the documents.

During loading, each URL is processed and turned into smaller chunks. Chunking for information retrieval is a broad topic in and of itself with many strategies being discussed within the AI community. For our example, we’ll utilize the partition_html method from the unstructured library. Read its documentation for more information about its chunking approach.

We turn each chunk into a dictionary object containing three fields: title (the web page’s title), text (the textual content of the chunk), and url (the web page’s URL). This information will eventually be passed to the chatbot’s prompt for generating the response, so it’s crucial to populate relevant information into this dictionary.

Note that we are not limited to these three fields. At a minimum, the Chat endpoint requires the text field, but beyond that, we can add custom fields that can provide more context about the document, such as subtitles, snippets, tags, and others.

On another note, the text field will be required for prompt truncation purposes. There is another parameter of the endpoint called prompt_truncation which accepts either AUTO or OFF as an argument. With prompt_truncation set to AUTO, some elements from the conversation history and documents will be dropped in an attempt to construct a prompt that fits within the model’s context length limit. When this happens, the endpoint will use the text field for prompt truncation.

The resulting documents are stored in the self.docs attribute.

class Documents:

    ...
    ...

    def load(self) -> None:
    """
    Loads the documents from the sources and chunks the HTML content.
    """
        print("Loading documents...")

        for source in self.sources:
            elements = partition_html(url=source["url"])
            chunks = chunk_by_title(elements)
            for chunk in chunks:
                self.docs.append(
                    {
                        "title": source["title"],
                        "text": str(chunk),
                        "url": source["url"],
                    }
                )

Embed Documents

Next, we create the embed method to generate the embeddings of the chunked documents. We use the Embed endpoint, utilizing the Embed v3 model which offers state-of-the-art performance per trusted MTEB and BEIR benchmarks. The model we’ll use is embed-english-v3.0.

With the Embed v3 model, we need to define an input_type, of which there are four options depending on the type of task. Using these input types ensures the highest possible quality for the respective tasks. For our documents, which are documents to be used in retrieval, we use search_document as the input_type.

class Documents:

    ...
    ...

    def embed(self) -> None:
        """
        Embeds the documents using the Cohere API.
        """
        print("Embedding documents...")

        batch_size = 90
        self.docs_len = len(self.docs)

        for i in range(0, self.docs_len, batch_size):
            batch = self.docs[i : min(i + batch_size, self.docs_len)]
            texts = [item["text"] for item in batch]
            docs_embs_batch = co.embed(
		              texts=texts,										               model="embed-english-v3.0",							               input_type="search_document"
	 		).embeddings
            self.docs_embs.extend(docs_embs_batch)

Note that we are sending the documents to the Embed endpoint in batches because the endpoint has a limit of 96 documents per call.

The resulting document embeddings are stored in the self.doc_embs attribute.

Index Documents

Next, we create the index method to index the document embeddings.

We store the embeddings in an index for a number of reasons. One of them is retrieval efficiency. The index stores the embeddings in a structured and organized way. This organization ensures an efficient similarity search during retrieval.

There are many options available for building an index. For production environments, typically a vector database is required to handle the continuous process of indexing documents and maintaining the index. 

In our example, however, we’ll keep it simple and use a vector library instead. We can choose from many open-source projects, such as Faiss, Annoy, ScaNN, and Hnswlib, which is the one we’ll use. These libraries store embeddings in in-memory indexes and implement approximate nearest neighbor (ANN) algorithms to make similarity search efficient.

The resulting document embeddings are stored in the self.index attribute.

class Documents:

    ...
    ...

    def index(self) -> None:
        """
    Indexes the documents for efficient retrieval.
    """
        print("Indexing documents...")

        self.index = hnswlib.Index(space="ip", dim=1024)
        self.index.init_index(max_elements=self.docs_len, ef_construction=512, M=64)
        self.index.add_items(self.docs_embs, list(range(len(self.docs_embs))))

        print(f"Indexing complete with {self.index.get_current_count()} documents.")

Implement Retrieval

Next, we create the retrieve method to retrieve relevant documents given a query.

We’ll implement a semantic search system that leverages embeddings to retrieve documents, offering significant improvements over basic keyword-matching approaches. Embeddings can capture the contextual meaning of a document, thus enabling the retrieval of highly relevant results to the given query.

First, we need to turn the query into embeddings. For this, we use the embed-english-v3.0 model, this time with search_query as the input_type.

The retrieval is performed by the knn_query method from the hnswlib library. Given a query, it returns the documents most similar to the query. We can define the number of top documents to retrieve using the attribute self.retrieve_top_k, for which we choose 10.

class Documents:

    ...
    ...

    def retrieve(self, query: str) -> List[Dict[str, str]]:
        	  """
        Retrieves documents based on the given query.

        Parameters:
        query (str): The query to retrieve documents for.

        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved  documents, with 'title', 'snippet', and 'url' keys.
        """
        docs_retrieved = []
        query_emb = co.embed(
                    texts=[query],
                    model="embed-english-v3.0",
                    input_type="search_query"
                    ).embeddings				    

        doc_ids = self.index.knn_query(query_emb, k=self.retrieve_top_k)[0][0]

Implement Reranking

Next, we implement a reranking step in the retrieve method.

While our semantic search component is already highly capable of retrieving relevant documents, the Rerank endpoint provides an additional boost to the quality of the search results, especially for complex and domain-specific queries. It takes the search results, in our case, 10 documents, and sorts them according to their relevance to the query.

A more detailed view of document ingestion, retrieval, and reranking
A more detailed view of document ingestion, retrieval, and reranking

Implementing reranking with the Rerank endpoint requires just one line of code. To call the endpoint, we pass the query and the list of documents to be reranked. We also define the number of top reranked documents to retrieve using the attribute self.rerank_top_k, for which we choose 3. The model we use is rerank-english-v2.0.

Finally, we store the top retrieved documents in the docs_retrieved attribute and return these to the chatbot, which we’ll implement next.

Note: When prompt_truncation is set to AUTO, the endpoint will handle reranking by default. In this case, we don’t have to implement our own rerank step. However, we can still do so if we want greater control over the reranking process, for example, defining the number of documents to retrieve after reranking. Also on this note, we’ll dive deeper into how prompt truncation works when we discussing connectors in the coming chapters.

class Documents:

    ...
    ...

    def retrieve(self, query: str) -> List[Dict[str, str]]:

        ...
        ...
				

        docs_to_rerank = []
        for doc_id in doc_ids:
            docs_to_rerank.append(self.docs[doc_id]["text"])

        rerank_results = co.rerank(
            query=query,
            documents=docs_to_rerank,
            top_n=self.rerank_top_k,
            model="rerank-english-v2.0",
        )

        doc_ids_reranked = []
        for result in rerank_results:
            doc_ids_reranked.append(doc_ids[result.index])

        for doc_id in doc_ids_reranked:
            docs_retrieved.append(
                {
                    "title": self.docs[doc_id]["title"],
                    "text": self.docs[doc_id]["text"],
                    "url": self.docs[doc_id]["url"],
                }
            )

        return docs_retrieved

Create Chatbot Component

The Chatbot class handles the logic of the chatbot, including generating search queries based on a user message, retrieving documents, and generating the response to the user. 

The Chatbot component handles the chatbot logic, from getting the user message to generating the response
The Chatbot component handles the chatbot logic, from getting the user message to generating the response

This is where we implement the methods that call the Chat endpoint. As mentioned earlier, we’ll see how the endpoint is used in two ways: generating queries and generating responses in document mode.

We start by creating the Chatbot class, which takes an instance of the Documents class. We initialize a self.docs attribute for that instance, as well as a unique conversation ID that we’ll need for each conversation.

class Chatbot:

    def __init__(self, docs: Documents):
        self.docs = docs
        self.conversation_id = str(uuid.uuid4())

Generate Queries

The first step is to decide on how to handle a user message. With RAG-powered chatbots, there are two key decisions to make at this point:

  • Should it respond to the user message directly or retrieve external information before responding?
  • If the decision is to retrieve information, what is the optimal set of queries given the user message?

One characteristic of the Chat endpoint is that the underlying Command model has been trained to handle these scenarios. That means we can leverage the capability out of the box without any further finetuning.

Let’s illustrate what this means with a few examples.

First, we need to call the Chat endpoint in query-generation mode. The syntax is simple: pass the user message and set search_queries_only to be True. As for the response, we are interested in the search_queries field of the cohere.Chat object.

response = co.chat(message=message, search_queries_only=True)

A few scenarios can happen in query-generation mode:

  • No query needed: Suppose we have a user message of “Hello, I need help with a report I’m writing.” This type of message doesn’t require any additional context from external information, hence retrieval is not required. A direct chatbot response will suffice (for example: “Sure, how can I help?”). When we send this to the Chat endpoint, we get an empty search_queries result, which is what we expect.
  • One query generated: Take this user message: “What did the report say about the company’s Q4 performance?”. This does require additional context as it refers to a report, hence retrieval is required. Given this message, the Chat endpoint returns the search_queries result of Q4 company performance. Here it turns the user message into a query optimized for search. Another important scenario is generating queries in the context of the conversation. Suppose there’s an ongoing conversation where the user is learning from the chatbot about deep learning. If at some point, the user asks, “Why is it important”, then the generated search_queries will become why is deep learning important, providing the much-needed context for the retrieval process.
  • More than one query generated: What if the user message is a bit more complex, such as “What did the report say about the company’s Q4 performance and its range of products and services?”. This requires multiple pieces of information to be retrieved. Given this message, the Chat endpoint returns two search_queries results: Q4 company performance and company's range of products and services.

These scenarios highlight the adaptability of the Chat endpoint to decide on the next course of action based on a user message. Thus, the first step we want to implement is the query generation step. This becomes the first part of a method that we’ll call generate_response.

class Chatbot:

	  ...
	  ...

	  def generate_response(self, message: str):
        """
        Generates a response to the user's message.

        Parameters:
        message (str): The user's message.

        Yields:
        Event: A response event generated by the chatbot.

        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved documents.

        """

        # Generate search queries (if any)
        response = co.chat(message=message, search_queries_only=True)

Retrieve and Rerank Documents

If the chatbot response in the query-generation mode contains at least one search query, then the next step is to retrieve documents that are relevant to the queries. For this, we create the retrieve_docs method to retrieve and rerank the documents via the Documents class we created earlier.

The retrieved documents are then stored in the retrieved_docs attribute.

class Chatbot:

        ...
        ...

        def generate_response(self, message: str):

            ...
        	...

            # If there are search queries, retrieve documents and respond
            if response.search_queries:
                print("Retrieving information...")

                documents = self.retrieve_docs(response)

                ...
                ...

        def retrieve_docs(self, response) -> List[Dict[str, str]]:
            """
            Retrieves documents based on the search queries in the response.

            Parameters:
            response: The response object containing search queries.

            Returns:
            List[Dict[str, str]]: A list of dictionaries representing the retrieved documents.

            """
            # Get the query(s)
            queries = []
            for search_query in response.search_queries:
                queries.append(search_query["text"])

            # Retrieve documents for each query
            retrieved_docs = []
            for query in queries:
                retrieved_docs.extend(self.docs.retrieve(query))

            return retrieved_docs

Generate Response

Now that we have the relevant documents retrieved, we can pass them to the Chat endpoint in order to generate a response. For this, we call the Chat endpoint in document mode by adding a documents parameter to the call and passing the documents we retrieved earlier. There is no prompt engineering required as it’s handled by the endpoint.

Meanwhile, if the chatbot response in query-generation mode doesn’t contain any search queries, then it doesn’t require information retrieval. To generate the response, we call the Chat endpoint another time, passing the user message and without needing to add any documents to the call.

In either case, we also pass the conversation_id parameter, which retains the interactions between the user and the chatbot in the same conversation thread. We also enable the stream parameter so we can stream the chatbot response to the application.

class Chatbot:

        ...
        ...

        def generate_response(self, message: str):

        ...
        ...

        # If there are search queries, retrieve documents and respond
        if response.search_queries:
            print("Retrieving information...")

            documents = self.retrieve_docs(response)

            response = co.chat(
                message=message,
                documents=documents,
                conversation_id=self.conversation_id,
                stream=True,
            )
            for event in response:
                yield event

        # If there is no search query, directly respond
        else:
            response = co.chat(
                message=message, 
                conversation_id=self.conversation_id, 
                stream=True
            )
            for event in response:
                yield event

Create App Component

The App class handles the interaction between the user and the chatbot. In our case, we are creating a simple text interface in a Jupyter notebook.

The App component handles the interaction between the user and the chatbot
The App component handles the interaction between the user and the chatbot

We start by creating the App class, which takes an instance of the Chatbot class.

class App:
    def __init__(self, chatbot: Chatbot):
        """
        Initializes an instance of the App class.

        Parameters:
        chatbot (Chatbot): An instance of the Chatbot class.

        """
        self.chatbot = chatbot

Get User Message

Next, we create a run method and implement the logic for getting the user message, as well as providing a way for the user to end a conversation.

class App:

    ...
    ...

    def run(self):
    """
    Runs the chatbot application.
    """
    while True:
        # Get the user message
        message = input("User: ")

        # Typing "quit" ends the conversation
        if message.lower() == "quit":
            print("Ending chat.")
            break
        else:
            print(f"User: {message}")

Display Response with Citations

Next, we pass the user message to the generate_response method we created earlier in the Chatbot class, which goes through the steps of generating queries, retrieving relevant information, and generating a response.

To display the response, we print the text-generation events from the response stream.

On top of generating the response, the Chat endpoint also provides citations to indicate the spans of the retrieved documents on which the response is grounded. Here is one example:

[{'start': 59, 'end': 73, 'text': 'large datasets', 'document_ids': ['doc_0', 'doc_1']}]

The format of each citation is:

  • start: The starting point of a span where one or more documents are referenced
  • end: The ending point of a span where one or more documents are referenced
  • text: The text representing this span
  • document_ids: The IDs of the documents being referenced (doc_0 being the ID of the first document passed to the documentscreating parameter in the endpoint call, and so on)
class App:

    ...
	...

	def run(self):

        while True:
		
           ...
           ...
		
            # Get the chatbot response
            response = self.chatbot.generate_response(message)

            # Print the chatbot response
            print("Chatbot:")
            flag = False
            for event in response:
                # Text
                if event.event_type == "text-generation":
                    print(event.text, end="")

                # Citations
                if event.event_type == "citation-generation":
                    if not flag:
                        print("\\n\\nCITATIONS:")
                        flag = True
                    print(event.citations)

Run App

We have now completed creating the three components: Documents, Chatbot, and App, which means that we are now ready to run our chatbot app!

Define Documents

First, we define the list of documents we want to ingest and make available for retrieval. As an example, we’ll use the contents from the first module of LLM University: What are Large Language Models? It has four chapters, in which we define their web URLs in the sources attribute.

sources = [
    {
        "title": "Text Embeddings", 
        "url": "https://docs.cohere.com/docs/text-embeddings"},
    {
        "title": "Similarity Between Words and Sentences", 
        "url": "https://docs.cohere.com/docs/similarity-between-words-and-sentences"},
    {
        "title": "The Attention Mechanism", 
        "url": "https://docs.cohere.com/docs/the-attention-mechanism"},
    {
        "title": "Transformer Models", 
        "url": "https://docs.cohere.com/docs/transformer-models"}   
]

Process Documents

Next, we process these documents by creating an instance of Documents. In our case, we get a total of 136 documents, chunked from the four web URLs.

documents = Documents(sources)
Loading documents...
Embedding documents...
Indexing documents...
Indexing complete with 136 documents.

Run Chatbot

We can now run the chatbot app. For this, we create the instances of Chatbot and App . Finally, we run the chatbot by invoking the run method.

chatbot = Chatbot(documents)

app = App(chatbot)

app.run()

Here’s an example of a conversation that happens over seven turns:

User: hello
Chatbot:
Hi! How can I help you today?
----------------------------------------------------------------------------------------------------

User: what is the difference between word and sentence embeddings
Chatbot:
Retrieving information...
Word embeddings associate words with lists of numbers (vectors). Similar words are associated with numbers that are close by and dissimilar words with numbers that are far away. A sentence embedding does the same thing but associates a vector with every sentence. Similar sentences are assigned to similar vectors, while different sentences are assigned to different vectors.

CITATIONS:
[{'start': 0, 'end': 63, 'text': 'Word embeddings associate words with lists of numbers (vectors)', 'document_ids': ['doc_0']}]
[{'start': 65, 'end': 177, 'text': 'Similar words are associated with numbers that are close by and dissimilar words with numbers that are far away.', 'document_ids': ['doc_0']}]
[{'start': 178, 'end': 263, 'text': 'A sentence embedding does the same thing but associates a vector with every sentence.', 'document_ids': ['doc_0', 'doc_2']}]
[{'start': 264, 'end': 375, 'text': 'Similar sentences are assigned to similar vectors, while different sentences are assigned to different vectors.', 'document_ids': ['doc_2']}]

----------------------------------------------------------------------------------------------------

User: what kind of technology that makes all this possible
Chatbot:
Retrieving information...
Transformer models utilize a lot of data (entire internet, large datasets), post-training, large datasets of conversations and a bias towards the last things it has learned. It relies on the principle of teaching a machine to perform tasks, similar to teaching a person.

CITATIONS:
[{'start': 36, 'end': 40, 'text': 'data', 'document_ids': ['doc_2']}]
[{'start': 41, 'end': 57, 'text': '(entire internet', 'document_ids': ['doc_0', 'doc_2']}]
[{'start': 59, 'end': 73, 'text': 'large datasets', 'document_ids': ['doc_0', 'doc_1']}]
[{'start': 76, 'end': 89, 'text': 'post-training', 'document_ids': ['doc_0', 'doc_1']}]
[{'start': 91, 'end': 122, 'text': 'large datasets of conversations', 'document_ids': ['doc_1']}]
[{'start': 129, 'end': 157, 'text': 'bias towards the last things', 'document_ids': ['doc_0']}]
[{'start': 204, 'end': 222, 'text': 'teaching a machine', 'document_ids': ['doc_0']}]
[{'start': 234, 'end': 239, 'text': 'tasks', 'document_ids': ['doc_0', 'doc_1']}]
[{'start': 252, 'end': 270, 'text': 'teaching a person.', 'document_ids': ['doc_0']}]

----------------------------------------------------------------------------------------------------

User: How does the model work
Chatbot:
Retrieving information...
Transformer models utilize the attention step, which helps language models understand the context. It is a powerful multi-head attention technique that has helped language models reach much higher levels of efficacy. Consider the following two sentences:
Sentence 1: The bank of the river.
Sentence 2: Money in the bank.

CITATIONS:
[{'start': 31, 'end': 45, 'text': 'attention step', 'document_ids': ['doc_0']}]
[{'start': 90, 'end': 98, 'text': 'context.', 'document_ids': ['doc_0']}]
[{'start': 116, 'end': 136, 'text': 'multi-head attention', 'document_ids': ['doc_1']}]
[{'start': 255, 'end': 289, 'text': 'Sentence 1: The bank of the river.', 'document_ids': ['doc_0']}]
[{'start': 290, 'end': 320, 'text': 'Sentence 2: Money in the bank.', 'document_ids': ['doc_0']}]

----------------------------------------------------------------------------------------------------

User: continue
Chatbot:
These two sentences have completely different meanings. However, fascinatingly, by just looking at the last words, the model can tell whether the sentence is about a river or money. This is done by calculating attention weights by measuring the attraction of two given phrases. The model then decides whether the sentences have similar or different meanings based on these attention weights.
----------------------------------------------------------------------------------------------------

User: How accurate can this be
Chatbot:
Retrieving information...
I'm sorry, I did not find any information in my search, but i'll reply to you anyway. It really depends on the quality of the model. Cohere’s model is much better than word embeddings because it can capture many more features of the words.

CITATIONS:
[{'start': 199, 'end': 239, 'text': 'capture many more features of the words.', 'document_ids': ['doc_1']}]

----------------------------------------------------------------------------------------------------

User: what do you mean by the quality of the model
Chatbot:
Retrieving information...
I couldn't find any precise sources to support the reply I'm about to write when I searched the databases I have access to -- instead, I'm basing it on my general knowledge. The quality of the model refers to how good it is at performing the task it was designed to do. For example, if you want to build a model that can understand and generate language, you would want to use a language model with high quality.
----------------------------------------------------------------------------------------------------

User: quit
Ending chat.

In the conversation above, notice a few observations that reflect the different components of what we built:

  • Direct response: For user messages that don’t require retrieval, such as hello, the chatbot responds directly without requiring retrieval.
  • Citation generation: For responses that do require retrieval, the endpoint returns the response together with the citations.
  • State management: The endpoint maintains the state of the conversation via the conversation_id parameter, for example, by being able to correctly respond to a vague user message of “continue.”
  • Response synthesis: The endpoint can decide if none of the retrieved documents provide the necessary information required to answer a user message. For example, when asked the question “what do you mean by the quality of the model”, it responds with “I couldn’t find any precise sources to support the reply I’m about to write …”

Conclusion

In this chapter, you learned how to build a RAG-powered chatbot with the Chat endpoint. With access to a collection of documents, the chatbot is able to provide contextually relevant responses to user requests, along with verifiable citations.

This chapter used the Chat endpoint in document mode. This mode highlights the modularity of the endpoint, giving developers the flexibility to customize each component of the system.

An alternative to this is the connector mode. It abstracts away some of the steps we saw in the document mode, which makes it simpler to build applications. It also makes it easy to connect to enterprise data sources and do that at scale. You’ll learn about connector mode in the next chapter.

Get started by creating a Cohere account now.


About Cohere’s LLM University

Our comprehensive NLP curriculum aims to equip you with the skills to develop your own AI applications. We cater to learners from all backgrounds, covering everything from the basics to the most advanced topics in large language models (LLMs). Plus, you’ll have the opportunity to work on hands-on exercises, allowing you to build and deploy your very own solutions. Take a course today. 

This LLMU course consists of the following chapters:

  1. Foundations of Chat and RAG
  2. Using the Chat endpoint 
  3. Using the Chat endpoint with RAG in document mode (this chapter)
  4. Using the Chat endpoint with RAG in connector mode (coming soon)
  5. Creating custom models (coming soon)



Source link

Leave a Comment