Optimizing Language Models with DSPy

Prompt engineering language models lacks necessary refinement especially for enterprise solutions. Optimize your language models with DSPy!

12 min read Ben Hayes

Table of Contents


Introduction

Feeling let down by prompt-engineering? Tired of fickle prompts that worked yesterday but not today? Need to swap in a newer, more powerful model but afraid your prompts won’t transfer reliably? You aren’t alone - a common point of feedback from AI researchers, developers, and enterprise decision-makers is that generative AI solutions lack robustness and enterprise reliability - so far.

Whether you are building a retrieval augmented generation (RAG) or a more complex solution, building around generative AI models requires tedious efforts from teams to ensure the outputs align with the needs of the users and business. Additionally, as single-model AI solutions give way to compound AI systems, having the right tooling can mean the difference between success and failure, between reliability and fragility.

What is DSPy?

In October 2023, researchers at Stanford University introduced DSPy - a framework that provides developers and researchers with the ability to systematize usage of language models. In other words, users can algorithmically control how language models interact with other parts of an AI system. Instead of relying on manual prompting in isolation for a given step of a process/chain, users leveraging DSPy can define modules and optimizers - more on these concepts later.

This is a new paradigm in which LMs and their prompts fade into the background as optimizable pieces of a larger system that can learn from data. tldr; less prompting, higher scores, and a more systematic approach to solving hard tasks with LMs.

Github: Stanfordnlp/dspy

Let’s build up our intuition of what DSPy does so we can understand how it works so well. First, let’s breakdown what DSPy components are available today.

What are the components of DSPy?

As mentioned above, DSPy is a framework or another tool in your AI-toolkit that enables you to build robust, compound AI systems. In this section we'll cover the following components of DSPy and how to approach using them for your own projects:

Signatures

Signatures express what we expect the language model to do without relying on meticulously detailing how it should be completed. By using signatures, our code will be more modular and has built-in support reproducibility.

Code:

1sentence = "You are a nice person."
2
3classify = dspy.Predict('sentence -> sentiment')
4classify(sentence=sentence).sentiment

Output:

'Positive'

Modules

As we aim to modularize our AI system code, we have access to modules in DSPy. Modules act as layers in a pipeline that give us access to signature based language model interactions. In the example below, we use the most basic module dspy.predict() but others include: dspy.ChainOfThought(), dspy.ProgramOfThought(), dspy.ReAct(), dspy.MultiChainComparison().

Code:

 1sentence = "You are a nice person."
 2
 3# 1) Declare with a signature.
 4classify = dspy.Predict('sentence -> sentiment')
 5
 6# 2) Call with input argument(s). 
 7response = classify(sentence=sentence)
 8
 9# 3) Access the output.
10print(response.sentiment)

Output:

'Positive'

Data

While using DSPy, we need to define what data we are using. By defining inputs, intermediate labels (optional), and final labels (optional), we can programmatically interact with different examples.

DSPy relies on the Example object which contain key-value pairs. Additionally, examples can be marked as inputs using example.with_inputs().

Code:

1qa_pair = dspy.Example(question="This is a question?", answer="This is an answer.")
2
3print(qa_pair)
4print(qa_pair.question)
5print(qa_pair.answer)

Output:

Example({'question': 'This is a question?', 'answer': 'This is an answer.'}) (input_keys=None)
This is a question?
This is an answer.

Metrics

We move on to metrics so we begin to think with the outcome in mind - what are we looking to achieve, what is important to us? Metrics in DSPy allow us to define functions that will score the results of a model task. These are powerful because they are flexible, programmatic, and traceable (outside of the scope but you can learn more here).

In the example below, we define a metric as a function and evaluate both the answer as well as the context - giving us a multi-perspective view into quality and, over time, consistency.

 1def validate_context_and_answer(example, pred, trace=None):
 2    # check the gold label and the predicted answer are the same
 3    answer_match = example.answer.lower() == pred.answer.lower()
 4
 5    # check the predicted answer comes from one of the retrieved contexts
 6    context_match = any((pred.answer.lower() in c) for c in pred.context)
 7
 8    if trace is None: # if we're doing evaluation or optimization
 9        return (answer_match + context_match) / 2.0
10    else: # if we're doing bootstrapping, i.e. self-generating good demonstrations of each step
11        return answer_match and context_match

Optimizers

Within DSPy an Optimizer is an algorithm that optimizes the prompts and language model weights. This tuning is performed with respect to the metrics that we defined. According to the DSPy docs (Optimizers), an Optimizer requires 3 inputs: a DSPy program (one or more modules), a metric, and a few training inputs.

1from dspy.teleprompt import BootstrapFewShotWithRandomSearch
2
3# Set up the optimizer: we want to "bootstrap" (i.e., self-generate) 8-shot examples of your program's steps.
4# The optimizer will repeat this 10 times (plus some initial attempts) before selecting its best attempt on the devset.
5config = dict(max_bootstrapped_demos=4, max_labeled_demos=4, num_candidate_programs=10, num_threads=4)
6
7optimizer = BootstrapFewShotWithRandomSearch(metric=YOUR_METRIC_HERE, **config)
8optimized_program = optimizer.compile(YOUR_PROGRAM_HERE, trainset=YOUR_TRAINSET_HERE)

Now that we have examined each of these components within DSPy individually, we can get a high-level overview of all of them. Refer to the table below for a summary of each component.

Component Description
Signature Expected behavior of a module including input and output.
Module Conceptual and functional building block of a DSPy program (a program that uses language models). Can be chained together to form multi-module programs. Required as an input to an Optimizer.
Data Training inputs often supplied to Optimizer. Defined in a similar fashion to python dictionaries but with extra utilities.
Metrics A function defined to evaluate the quality or performance of a task by a language model.
Optimizers A component to tune the prompts or weights used in language model tasks. Composed of modules, metrics, and training inputs.

Now that we understand how each component is used, let’s walk through an example of how to use DSPy. In the next section, we’ll explore an end-to-end example of using DSPy to optimize a large language model.


DSPy: Learn through examples

In this section, we will build a multi-hop program that extends a retrieval augmented generation (RAG) approach. Often times, RAG solutions are insufficient for real-world use cases. User queries are more complex or require chain of thought or reasoning to provide an answer.

Installation

Fortunately, installing DSPy is incredibly simple. We can run the following command to install DSPy and the required dependencies. Note that other alternatives are shown in the code comments (e.g., how to install for Pinecone). You can read more about the installation here in the DSPy docs.

1pip install dspy-ai
2# pip install "dspy-ai[pinecone]"
3# pip install "dspy-ai[chromadb]"
4# pip install "dspy-ai[mongodb]"

An Example: Building a retrieval and evaluation system

We will leverage the example in the DSPy docs - the simplified Baleen example. You can read and follow along with the tutorial below or refer to the original documentation here. Note that the documentation tutorial has been modified/updated slightly to reflect DSPy changes, new models, etc.

Add models

1import dspy
2
3turbo = dspy.OpenAI(model='gpt-3.5-turbo', api_key=<REDACTED>)
4colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
5
6dspy.settings.configure(lm=turbo, rm=colbertv2_wiki17_abstracts)
Here we define our model which is GPT-3.5T and set up a retriever of Wikipedia data (specifically the first paragraph of articles). We then use dspy.settings to configure the language model and the retriever.

Define the model signature

 1class GenerateAnswer(dspy.Signature):
 2    """Answer questions with short factoid answers."""
 3
 4    context = dspy.InputField(desc="may contain relevant facts")
 5    question = dspy.InputField()
 6    answer = dspy.OutputField(desc="often between 1 and 5 words")
 7
 8class GenerateSearchQuery(dspy.Signature):
 9    """Write a simple search query that will help answer a complex question."""
10
11    context = dspy.InputField(desc="may contain relevant facts")
12    question = dspy.InputField()
13    query = dspy.OutputField()
Here we define the signatures for our two sub-task processes. Note that the context and question are input fields and the answer and query are output fields.

Compose a module

 1from dsp.utils import deduplicate
 2
 3class SimplifiedBaleen(dspy.Module):
 4    def __init__(self, passages_per_hop=3, max_hops=2):
 5        super().__init__()
 6
 7        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
 8        self.retrieve = dspy.Retrieve(k=passages_per_hop)
 9        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
10        self.max_hops = max_hops
11    
12    def forward(self, question):
13        context = []
14        
15        for hop in range(self.max_hops):
16            query = self.generate_query[hop](context=context, question=question).query
17            passages = self.retrieve(query).passages
18            context = deduplicate(context + passages)
19
20        pred = self.generate_answer(context=context, question=question)
21        return dspy.Prediction(context=context, answer=pred.answer)
With our signatures defined in the previous step, we are ready to leverage modules. We'll generate multiple queries using ChainOfThought(), retrieve context, and then generate an answer. If you are curious about the forward() method, check out this part of the DSPy deep dive.
 1# Ask any question you like to this simple RAG program.
 2my_question = "In what year did India gain independence from British rule?"
 3
 4# Get the prediction. This contains `pred.context` and `pred.answer`.
 5uncompiled_baleen = SimplifiedBaleen()  # uncompiled (i.e., zero-shot) program
 6pred = uncompiled_baleen(my_question)
 7
 8# Print the contexts and the answer.
 9print(f"Question: {my_question}")
10print(f"Predicted Answer: {pred.answer}")
11print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")
In order to conduct a test of the optimization and compilation benefits of using DSPy, we show an example of an uncompiled program. While the answer is fine, we still lack certainty that this solution is robust or will work across contexts, etc.
Question: In what year did India gain independence from British rule?
Predicted Answer: 1947
Retrieved Contexts (truncated): ['Speeches about Indian independence | India won independence from Britain by the Indian Independence Act 1947, ending the British Raj. This act, passed in June and receiving royal assent in July, took ...', 'Dominion of India | Between gaining independence from the United Kingdom on 15 August 1947 and the proclamation of a republic on 26 January 1950, India was an independent dominion in the British Commo...', "Independence Day (India) | Independence Day is annually celebrated on 15 August, as a national holiday in India commemorating the nation's independence from the United Kingdom on 1 August 1947, the UK...", '1947 PGA Championship | The 1947 PGA Championship was the 29th PGA Championship, held June 18–24 at Plum Hollow Country Club in Southfield, Michigan, a suburb northwest of Detroit. Jim Ferrier won the...', "1947 Masters Tournament | The 1947 Masters Tournament was the 11th Masters Tournament, held April 3–6 at Augusta National Golf Club in Augusta, Georgia. The purse was $10,000 with a winner's share of ...", 'Time Out of Mind (1947 film) | Time Out of Mind is a 1947 American drama film directed by Robert Siodmak and starring Phyllis Calvert, Robert Hutton and Ella Raines. The film was made by Universal Pic...']
We print the uncompiled program's prediction and contexts. But can we improve this process?

Optimization and Evaluation

 1def validate_context_and_answer_and_hops(example, pred, trace=None):
 2    if not dspy.evaluate.answer_exact_match(example, pred): return False
 3    if not dspy.evaluate.answer_passage_match(example, pred): return False
 4
 5    hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]
 6
 7    if max([len(h) for h in hops]) > 100: return False
 8    if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False
 9
10    return True
We now move to optimize our pipeline by leveraging DSPy's compilation capability. The exact validation logic can be found in this tutorial but we want answers to be match our labels, context to contain the answer, and have conciseness.

Build the optimizer

1from dspy.teleprompt import BootstrapFewShot
2
3optimizier = BootstrapFewShot(metric=validate_context_and_answer_and_hops)
4compiled_baleen = optimizer.compile(SimplifiedBaleen(), teacher=SimplifiedBaleen(passages_per_hop=2), trainset=trainset)
Here we leverage the BootstrapFewShot optimizer and provide our metric defiend in the previous step. The optimizer offers a compile method.
100%|██████████| 20/20 [01:33<00:00,  4.70s/it]
Bootstrapped 2 full traces after 20 examples in round 0.
Success! You can learn more about the available optimizers and their members in the docs here.
 1from dspy.evaluate.evaluate import Evaluate
 2
 3# Define metric to check if we retrieved the correct documents
 4def gold_passages_retrieved(example, pred, trace=None):
 5    gold_titles = set(map(dspy.evaluate.normalize_text, example["gold_titles"]))
 6    found_titles = set(
 7        map(dspy.evaluate.normalize_text, [c.split(" | ")[0] for c in pred.context])
 8    )
 9    return gold_titles.issubset(found_titles)
10
11# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
12evaluate_on_hotpotqa = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=5)
13
14uncompiled_baleen_retrieval_score = evaluate_on_hotpotqa(uncompiled_baleen, metric=gold_passages_retrieved)
15compiled_baleen_retrieval_score = evaluate_on_hotpotqa(compiled_baleen, metric=gold_passages_retrieved)
16
17print(f"## Retrieval Score for uncompiled Baleen: {uncompiled_baleen_retrieval_score}")
18print(f"## Retrieval Score for compiled Baleen: {compiled_baleen_retrieval_score}")
Now that we've set up our optimizer and compiled the program, we can evaluate retrieval performance of the compiled and uncompiled AI systems. We want to know if the correct documents were retrieved.
# Higher is better
Retrieval Score for uncompiled Baleen: 56.0
Retrieval Score for compiled Baleen: 66.0
Here a higher score is better and we can see that there is significant improvement in performance by leveraging the compilation in DSPy.

Let’s take a look at a side-by-side view of the retrieval results:

Poorer retrieval results when not compiling (Displaying 5/50 results)

Better retrieval resutls when compiling (Displaying 5/50 results)

Please feel free to reach out if you would like access to the notebook (.ipynb) on Google Colab.


Conclusion

Let's take a moment to reflect on what we have accomplished above. We've successfully:

  • recognized that working with splintery AI models and tools necessitates usage of compound AI systems
  • built our intuition for why programmatic and algorithmic optimization is needed, especially for large-scale, enterprise AI systems
  • reviewed DSPy and the DSPy components
  • built a multi-model retrieval system with tooling that helps optimize and evaluate performance
  • demonstrated how compiling multi-step AI programs can improve quality and performance

The benefits of this approach are that we can reduce time spent with fragile prompt-tinkering, reduce cost by leveraging smaller, pre-trained models, and more easily maintain this solution as models come and go. Given the nature of generative AI, this space is evolving rapidly so expect change. If you are interested in this subject, please feel free to reach out or refer to the resources below.

Additional Resources