Tutorial: Model-Based Evaluation of RAG Pipelines

_{Last Updated:
February 13, 2024}

Level: Beginner
Time to complete: 10 minutes
Components Used: InMemoryDocumentStore, InMemoryBM25Retriever, PromptBuilder, OpenAIGenerator, UpTrainEvaluator
Prerequisites: You must have an API key from an active OpenAI account as this tutorial is using the gpt-3.5-turbo model by OpenAI: https://platform.openai.com/api-keys
Goal: After completing this tutorial, you’ll have learned how to evaluate your RAG pipelines using some of the model-based evaluation frameworkes integerated into Haystack.

This tutorial uses Haystack 2.0 Beta. To learn more, read the Haystack 2.0 Beta announcement or see Haystack 2.0 Documentation.

Overview

This tutorial shows you how to evaluate a generative question-answering pipeline that uses the retrieval-augmentation ( RAG) approach with Haystack 2.0. As we’re doing model-based evaluation, no ground-truth labels are required. The process involves Haystack’s integration of three evaluation frameworks:

UpTrain ✅
RAGAS 🔜
DeepEval 🔜

For this tutorial, you’ll use the Wikipedia pages of Seven Wonders of the Ancient World as Documents, but you can replace them with any text you want.

Preparing the Colab Environment

Installing Haystack

Install Haystack 2.0 Beta and datasets with pip:

%%bash

pip install haystack-ai
pip install "datasets>=2.6.1"
pip install uptrain-haystack

Enabling Telemetry

Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.

from haystack.telemetry import tutorial_running

tutorial_running(35)

Create the RAG Pipeline to Evaluate

To evaluate a RAG pipeline, we need a RAG pipeline to start with. So, we will start by creating a question answering pipelne.

💡 For a complete tutorial on creating Retrieval-Augmmented Generation pipelines check out the Creating Your First QA Pipeline with Retrieval-Augmentation Tutorial

First, we will initialize a DocumentStore. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, you’ll be using the InMemoryDocumentStore.

You’ll use the Wikipedia pages of Seven Wonders of the Ancient World as Documents. We preprocessed the data and uploaded to a Hugging Face Space: Seven Wonders. Thus, you don’t need to perform any additional cleaning or splitting.

from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()


dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
docs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]
document_store.write_documents(docs)

InMemoryDocumentStore is the simplest DocumentStore to get started with. It requires no external dependencies and it’s a good option for smaller projects and debugging. But it doesn’t scale up so well to larger Document collections, so it’s not a good choice for production systems. To learn more about the different types of external databases that Haystack supports, see DocumentStore Integrations.

Now that we have our data ready, we can create a simple RAG pipeline.

In this example, we’ll be using:

InMemoryBM25Retriever which will get the relevant documents to the query.
OpenAIGenerator to generate answers to queries. You can replace OpenAIGenerator in your pipeline with another Generator. Check out the full list of generators here.

import os
from getpass import getpass
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

retriever = InMemoryBM25Retriever(document_store)

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)


os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
generator = OpenAIGenerator()

To build a pipeline, add all components to your pipeline and connect them. Create connections from retriever to the prompt_builder and from prompt_builder to llm. Explicitly connect the output of retriever with “documents” input of the prompt_builder to make the connection obvious as prompt_builder has two inputs (“documents” and “question”). For more information on pipelines and creating connections, refer to Creating Pipelines documentation.

from haystack.pipeline import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder

rag_pipeline = Pipeline()
# Add components to your pipeline
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")

# Now, connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

That’s it! The pipeline’s ready to generate answers to questions!

Asking a Question

When asking a question, use the run() method of the pipeline. Make sure to provide the question to both the retriever and the prompt_builder. This ensures that the {{question}} variable in the template prompt gets replaced with your specific question.

question = "When was the Rhodes Statue built?"

response = rag_pipeline.run(
    {"retriever": {"query": question}, "prompt_builder": {"question": question}, "answer_builder": {"query": question}}
)

print(response["answer_builder"]["answers"][0].data)

Now that we have the RAG pipeline ready, we create an evaluation pipeline. It defines what metrics we want to evaluate and we choose one of Haystack’s integrated evaluation frameworks.

Evaluate The Pipeline with UpTrain

Now that we have a RAG pipeline, let’s look at how we can evaluate it. Here, we’re using the Haystack UpTrain integration. We will perform 2 evaluations:

Context Relevance, grading how relevant the context is to the question specified
Critique language, grading language aspects such as fluency, politeness, grammar, and coherence

For a full list of available metrics and their expected inputs, check out our UpTrainEvaluator Docs.

1) Evaluate Context Relevance

from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(metric=UpTrainMetric.CONTEXT_RELEVANCE, api="openai")

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)

Next, we can create a helper function to evaluate the context relevance of a RAG pipeline with multiple questions. The context relevance metric expects 2 inputs that should be provided from the RAG pipeline we are evaluating:

questions
contexts

def evaluate_context_relevance(questions, evaluation_pipeline):
    contexts = []
    responses = []
    for question in questions:
        response = rag_pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )

        contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
        responses.append(response["answer_builder"]["answers"][0].data)

    evaluation_results = evaluation_pipeline.run({"evaluator": {"questions": questions, "contexts": contexts}})
    return evaluation_results

questions = ["When was the Rhodes Statue built?", "Where is the Pyramid of Giza?", "When was the pyramid built?"]

evaluate_context_relevance(questions=questions, evaluation_pipeline=evaluator_pipeline)

2) Critique Tone

An evaluator that uses the CRITIQUE_TONE metric expects to be initialized with an llm_persona. This is the persona the generative model being assessed was expected to follow, for example methodical teacher, helpful chatbot, or here simply informative.

from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.CRITIQUE_TONE, api="openai", metric_params={"llm_persona": "informative"}
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)

Next, we can create a helper function to critique the tone of the results of a RAG pipeline. This metric expects 1 input that should be provided from the RAG pipeline we are evaluating:

responses

def evaluate_critique_tone(questions, evaluation_pipeline):
    responses = []
    for question in questions:
        response = rag_pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )
        responses.append(response["answer_builder"]["answers"][0].data)

    evaluation_results = evaluator_pipeline.run({"evaluator": {"responses": responses}})
    return evaluation_results

questions = ["When was the Rhodes Statue built?", "Where is the Pyramid of Giza?", "When was the pyramid built?"]

evaluate_critique_tone(questions=questions, evaluation_pipeline=evaluator_pipeline)

What’s next

🎉 Congratulations! You’ve learned how to evaluate a RAG pipeline with model-based evaluation frameworks and without any labeling efforts.

If you liked this tutorial, you may also enjoy:

To stay up to date on the latest Haystack developments, you can sign up for our newsletter. Thanks for reading!

Question Generation

Query Classifier