A DSPy deep dive: What, how, and a sneak peek under the hood

If you have experienced first-hand or have stood by and witnessed the frustration experienced by coworkers as they grapple with prompts at the mercy of a language model, then you’ll want to read about DSPy. The main philosophy behind the DSPy framework is to be able to automatically generate prompt instructions, with minimum natural language intervention by the programmer.

Manual prompt engineering is frustrating because it’s so unwieldy, especially in the context of a pipeline consisting of a sequence of calls to a language model. There are often too many variables, like the nature of the prompts, the choice of the model, and the parameters of the model. All of these can result in a fragile pipeline with an increased likelihood of “breaking” it at different points as an unintended consequence of tweaking one of the variables. The cost of crafting prompts is also high, especially because changes to the prompts or model parameters create ripple effects, often requiring you to start from scratch after each failed attempt.

DSPy aims to circumvent some of these problems by providing the ability to do the following:

Generate simple prompts automatically
Validate the desired outcomes programmatically, as well as by using the underlying language model
Run optimizations that automatically refine prompt instructions

In this blog, we give a brief overview of the framework, followed by a description of what we need to use it for. We then give details of what goes into a basic DSPy program and discuss what we learned. The framework itself is easy to use. The code snippets here are inspired largely by the examples shared in the official documentation.

Let’s get started!

A DSPy program in a nutshell#

The main idea is to build programs consisting of modules that make calls to language models under the hood.

Each module is passed a signature as an input. Informally, a signature is just information that’s used by the module for generating simple prompt instructions as output.

More interestingly, a prompt instruction can undergo a series of improvements especially when used with one of the optimizers that are part of the DSPy framework.

Use case: SEO meta descriptions#

We toyed with DSPy and tested it for a small but real scenario relevant to our workflow. Here is the description of the use case (the goal and requirements):

Goal: We’d like to create meta descriptions for text-based courses. The meta descriptions are to be written with the intent of improving the SEO scores of the courses so that it impacts how these courses show up in search engine results.

A meta description is just a textual description with these constraints:

It should say what the course is about.

The meta description of a course should be based on a full and longer description of the course that was written manually at the time when that course was created.
It should be at most 150 characters.
It should include an SEO keyword that’s manually curated for the course. An SEO keyword is usually a short phrase that helps improve search engine ranking of the content.

Getting started with DSPy#

The code given here was tested using the gpt-3.5-turbo-instruct model (with default configurations). We use the following command to load the language model (LM).

gpt_turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct', max_tokens=255, api_key="INSERT_API_KEY")
dspy.settings.configure(lm=gpt_turbo)

Let’s look into the basic ingredients needed to build a DSPy program.

Signatures: To generate prompt instructions#

The first thing to understand about the signature is that it simply communicates the intent of the prompts for a use case.

There are two ways to specify the signature:

Inline: The short way is to specify it as a string. Convenient, but unsuitable for more complex use cases.
Class-based: The longer way is to define these as a class that extends the built-in class dspy.Signature. This way offers a greater flexibility of including more information.

The following string is an example of an inline signature. The text on the left side of the arrow indicates that a question is expected as input and its answer is expected as the output.

"question -> answer"

When this signature is used by a module to predict an answer, the following prompt instructions will be automatically generated. These essentially ask for an answer to the question being asked:

As we can see, the generated instructions are, by design, quite simple.

The signature just needs to say what is required. If we want the AI to summarize some input text, we could use a signature like "text -> summary" or in some other clear way like "document -> summary".

Suppose we want an answer to a question based on some context, then this would work:

"question, context -> answer"

In general, we can express multiple inputs or outputs using comma-separated lists.

These inline signatures are suitable for simple situations, but if we want to include even a little more information we’ll need to rely on class-based definitions of signatures.

More expressive signatures#

For our requirements above, a signature like the following is inadequate because even though it says what the inputs (the course description and the SEO keyword) are and what the output is, there’s no way to specify our constraints (like the length of the meta description).

"content, keyword -> metadescription"

So we resort to defining it as a full-blown class:

Explanation:

The code defining the signature above may look unusual, but this is what it does:

The docstring at the top is included verbatim in the generated prompt instructions.
The names of the fields (content, keyword, metadescription) also become an actual part of the prompt. So, keeping meaningful field names is essential.
The input and output fields need to be marked clearly using dspy.InputField() and dspy.OutputField(), as shown. The descriptions of these fields can be passed as arguments using the desc named parameter, although the descriptions are optional. These descriptions become part of the prompt instructions, as we’ll see shortly.

And that’s it—defining a class-based signature just requires specifying inputs, outputs, and concise descriptions of what’s expected.

Once we have written the signature, we need to declare a module that uses it for generating prompt instructions.

Modules: To interface with LMs#

A module is just a class that inherits from dspy.Module, directly or indirectly. To declare a module instance, a signature must be passed to the module constructor. The instantiated module can then be used for making LM calls to generate outputs against the inputs passed to it.

There are currently five types of built-in modules that embody different prompting techniques. Understanding how to work with two of these is sufficient for getting started and may very well be all you need. The most basic of these is dspy.Predict. Others inherit from it or build on top of it.

Example: The Predict module#

In Python, it’s possible to make calls to a class object in a way that’s similar to how we call functions. An instance of the Predict module is callable in this sense; when it’s called with an input it returns the response containing the output. The returned object is an instance of a class called Prediction.

Here, a Prediction object is returned on line 10 and called on lines 13–25:

class MySignature(dspy.Signature):
    """Generate a meta description that explains what the course covers. 
    It must include the keyword and it's length must be 150 characters or less."""
    
    content = dspy.InputField(desc="the course content")
    keyword = dspy.InputField(desc="the keyword")
    metadescription = dspy.OutputField(desc="contains the keyword and at most 150 characters")
    
# Declare an instance of the Predict module
predictor = dspy.Predict(MySignature)
# Call it on some content and SEO keyword to to predict the meta description
response = predictor(
    content="""When it comes to operating systems, there are three main concepts: virtualization, 
    concurrency, and persistence. These concepts lay the foundation for understanding how an operating 
    system works.  
    In this extensive course, you'll cover each of those in its entirety. You'll start by covering the 
    basics of CPU virtualization and memory such as: CPU scheduling, process virtualization, and API 
    virtualization. You will then move on to concurrency concepts where you’ll focus heavily on locks, 
    semaphores, and how to triage concurrency bugs like deadlocks.
    Towards the end, you'll get plenty of hands-on practice with persistence via I/O devices and file 
    systems. By the time you're done, you'll have mastered everything there is to know about operating 
    systems.""",
    keyword="operating system course"
)
# Print the machine-written meta description
print(f"Predicted meta descripton: {response.metadescription}")

Notice how the predicted description does not include the SEO keyword (“operating system course”). It’s also 153 characters.

In such a case, we must suppress the urge to fix things by resorting to old fashioned prompt engineering and see if this can be fixed through automation. In our case, we tried dspy.ChainOfThought next.

Example: The ChainOfThought module#

The dspy.ChainOfThought module breaks the process of generating output into multiple steps that can be thought of as steps of reasoning. We can instantiate it by passing a signature and the number of steps of reasoning required:

# Complete three rounds of reasoning
predictor = dspy.ChainOfThought(MySignature, n=3)

The signature passed to ChainOfThought is modified under the hood to include an additional output field rationale, so that the rationale for each of the completion rounds is generated before the desired output (the meta description).

The object returned can then be examined to see what transpired at the level of the LM.

# Declare an instance of the ChainOfThought module
predictor = dspy.ChainOfThought(MySignature, n=3)
# Call it on an input.
response = predictor(
    content="""When it comes to operating systems, there are three main concepts: virtualization, 
    concurrency, and persistence. These concepts lay the foundation for understanding how an operating 
    system works.  
    In this extensive course, you'll cover each of those in its entirety. You'll start by covering the 
    basics of CPU virtualization and memory such as: CPU scheduling, process virtualization, and API 
    virtualization. You will then move on to concurrency concepts where you’ll focus heavily on locks, 
    semaphores, and how to triage concurrency bugs like deadlocks.
    Towards the end, you'll get plenty of hands-on practice with persistence via I/O devices and file 
    systems. By the time you're done, you'll have mastered everything there is to know about operating 
    systems.""",
    keyword="operating system course"
)
# Look at the rationale and meta descriptions
print(response.completions)

Rationale	Meta Description	Meta Description Length
'produce a concise and informative meta description. We will cover virtualization, concurrency, and persistence in this comprehensive operating system course.'	'Master the fundamental concepts of operating systems with our comprehensive course on virtualization, concurrency, and persistence. Perfect for beginners.'	154
'introduce our comprehensive operating system course. We will cover virtualization, concurrency, and persistence.'	'Master the foundations of operating systems with our comprehensive course. Learn about virtualization, concurrency, and persistence in just a few weeks!'	152
'produce a metadescription. We will cover the basics of virtualization, concurrency, and persistence in this extensive operating system course. Master everything there is to know in just one course!'	'Master everything there is to know about operating systems in this extensive course covering virtualization, concurrency, and persistence. Enroll now!'	150

To better understand how the AI generated rationale gets plugged in, it’s worth noting that by default, the line of code

predictor = dspy.ChainOfThought(MySignature, n=3)

is equivalent to the following code where an output field is created as a standalone object and then passed to the ChainOfThought constructor using an optional parameter.

rationale_type = dspy.OutputField(
    prefix="Reasoning: Let's think step by step in order to",
    desc="${produce the metadescription}. We ..."
)
predictor = dspy.ChainOfThought(MySignature, n=3, rationale_type=rationale_type)

For what it’s worth, minor adjustments can be made to the text in the rationale_type before it’s passed explicitly to the ChainOfThought constructor. For example, simply changing ${produce the metadescription} to ${produce the metadescription in 150 characters} would have an impact on the generated output.

However, we did not engineer the prompts in this manner, since it’s not inline with the DSPy philosophy. Instead, we considered using built-in DSPy optimizers as a next step to see if this would lead to improvements in meta descriptions on a set of examples.

e = dspy.Example( 
    keyword="operating system course",
    content="""When it comes to operating systems, there are three main concepts: virtualization, 
    concurrency, and persistence. These concepts lay the foundation for understanding how an operating 
    system works.  
    In this extensive course, you'll cover each of those in its entirety. You'll start by covering the 
    basics of CPU virtualization and memory such as: CPU scheduling, process virtualization, and API 
    virtualization. You will then move on to concurrency concepts where you’ll focus heavily on locks, 
    semaphores, and how to triage concurrency bugs like deadlocks.
    Towards the end, you'll get plenty of hands-on practice with persistence via I/O devices and file 
    systems. By the time you're done, you'll have mastered everything there is to know about operating 
    systems.""",
    metadescription="""This operating system course helps developers learn three main concepts: 
    OS virtualization, concurrency and persistence with hands-on practice."""
)

Once an example (e) is created, its input fields (content and keyword) must be marked explicitly, as shown below. Any remaining field (metadescription in our case) is treated as a training label.

e = e.with_inputs("content", "keyword")

A handful of labeled examples should suffice. We used nine examples (not listed here) that in our opinion are representative of the quality we seek.

We also create new examples as our test data, but we don’t label them. So only the content and keyword fields are included and marked as input fields; the metadescription field is not included.

DSPy optimizers: To refine prompts#

A DSPy program can consist of one or more modules. Such a program can be fine-tuned using a DSPy optimizer. An optimizer improves the prompt instructions by placing calls to an LM behind the scenes. It can also automatically generate more examples called bootstrapped demos to be included in the prompt instructions. Note however that parameters of the LM that are not part of the prompt (like the model temperature) are optimized under the hood with gradient descent.

It’s instructive to learn about some of these optimizers. Other optimizers are built around similar ideas.

LabeledFewShot picks a handful of examples from the training data, and uses them without modification.
The BootstrapFewShot is passed two DSPy programs called a student and a teacher, as well as a metric function.
- The student program is the one that needs to be optimized.
- The role of the teacher program is to help optimize the student. When the teacher is not provided, the student program itself serves as the teacher.
The teacher program uses examples from the training set (max_labeled_demos), and generates additional bootstrapped examples. These examples are validated using a metric function (more on this later). Once validated these are used as part of the prompt to make a prediction. The sequence of operations in generating a bootstrapped example is called a trace. There can be multiple traces. Here’s how we can use a BootstrapFewShot optimizer to “compile” a predictor.

Nomenclature: Previously, these optimizers were being referred to as “teleprompters.” That explains the use of dspy.teleprompt in the import statement. The use of the function name compile just refers to the optimizations or refinements that the optimizer performs.

BootstrapFewShotWithRandomSearch generates multiple candidate programs, then picks the one that works best on a validation set. These candidates programs include:
- The original (student) program itself
- The student program optimized with LabeledFewShot
- The student program optimized with BootstrapFewshot, with and without random shuffling of training examples.
If you are not satisfied with your experimentation with BootstrapFewShot and are on the edge of uncertainty wondering about what to do next, then BootstrapFewShotWithRandomSearch will try out many variations saving you from a lot of headache. This is, of course, at the expense of additional costs incurred by the underlying LM calls.

Metric functions: To validate quality#

A metric is a function that returns a score (numeric or boolean) that represents the degree to which the predicted prompt instruction conforms to our requirements.

We can write our own custom metric function to give a verdict on whether the generated output is good enough.

We use a simple metric function that, when called by an optimizer, returns True if all three of our requirements have been met. In particular:

It programmatically checks to see if the predicted meta description is at most 150 characters and contains the SEO keyword.
It articulates our remaining requirement as a question (does the meta description say what the course covers?) and poses it to the AI for a yes or no answer. It does this by using the Predict module with signature AssessQuality.

So let’s look at the signature AssessQuality first before looking at the code for the metric.

# Define a metric to validate quality
def my_metric(example, prediction, trace=None):
    # Retrieving the inputs and output from the prediction object
    content = example.content
    keyword = example.keyword
    metadescription = prediction.metadescription
    # Does the metadescription have a valid length?
    is_length_valid = (len(metadescription) <= 150)
    # Is the keyword present verbatim in the predicted metadescription?
    is_keyword_present = keyword.lower() in metadescription.lower()
    # We state the third requirement as a question with a yes/no answer
    coverage_query = f"Does the metadescription `{metadescription}` concisely express what a course covers if that course's description is `{content}`?"
    # We use the Predict module to get a yes/no answer from AI to the textual query
    coverage_response =  dspy.Predict(AssessQuality)(metadescription=metadescription, criteria_query=coverage_query) 
    # Does the description say what the course covers?
    is_coverage_adequate = (coverage_response.answer.lower() == 'yes')
    
    # The score equals the number of requirements that were met
    score = is_length_valid + is_keyword_present + is_coverage_adequate
    
    # When the metric function is called inside optimizer trace is not None. So return true only if the score is perfect. Else return false.
    if trace is not None: return (score == 3)
    # When the metric function is called for evaluation purposes, we'd like to get a more nuanced score to help us understand how badly we failed.
    return score

Lines 4–6: The inputs in example (a training example) are extracted. The predicted meta description, against this example, is extracted from the prediction object.
Lines 9–12: The boolean variables (is_length_valid and is_keyword_present) are set to True if the length of the meta description is at most 150 characters and the SEO keyword is contained in the meta description.
Lines 17: Our third requirement (the meta description should say what the course is about) is qualitative and cannot be checked programmatically. So it’s articulated as a query and passed to the callable Predict module, which was instantiated using the AssessQuality signature above.
Note: A metric function can be used in multiple ways.
1. It may be used in the training (optimizing) phase by being passed to an optimizer that runs multiple traces on different training examples. In such a case, the trace parameter is set to something other than None.
2. It can also be used directly for evaluation purposes, where the predictor is evaluated on different testing examples.
Lines 25–28: Instead of writing two separate metric functions, we include a conditional statement (if trace not None) to check if the function is called internally from within an optimizer.
- If it’s called from within an optimizer, we return a True or False to indicate pass or fail. This helps the optimizer fine-tune the training process.
- If it’s not called by an optimizer, we return a score that lets us know the number of requirements that were satisfied.

We’ll see how to apply the metric function next.

Using the metric to optimize and evaluate#

We used a BootstrapFewShot optimizer for our use case (lines 6–10). Notice how we pass the my_metric function as an argument to it. We specify the use of at most 9 training examples, with at most 3 bootstrapped (generated) examples:

We tried other variations. This is what we observed for our small test data, but beware this won’t always be true in general:

ChainOfThought worked better than Predict.
When Predict is used with an optimizer, the performance does tend to improve.
Best output, over all the tweaks we made, was with ChainOfThought with the constructor parameter n set to 1 (one round of completion), and using no optimizer.
The best outcomes with BootstrapFewshot (after trying out multiple changes to the parameters) were comparable to the results with LabeledFewshot.
Increasing bootstrapped examples to a larger number degrades the outcomes.
We could not test with BootstrapFewShotWithRandomSearch, since we used the gpt-3.5-turbo-instruct model and our token usage was limited to 90,000 tokens per minute. With the BootstrapFewShotWithRandomSearch optimizer, we kept exceeding this token limit.

Concluding thoughts#

The three requirements we began with—conciseness, keyword inclusion, and content coverage—are competing requirements in the following sense:

Expressing what a course covers is challenging when we are confined to the space of 150 characters.
Insistence on inclusion of a keyword restricts the ways in which we can express the meta description.

We assigned the task of writing meta descriptions for a hundred courses to different human writers, and the results were variable. Similar to the problems seen in AI generated prompts, some writers omitted keywords, or settled for descriptions that did not fully capture the courses’ content. Moreover, text written by these writers was not as well-written.

For a larger project, such as generating meta descriptions for 1000+ courses, the time investment for human writers would be significantly more than using DSPy (assuming the engineer is already familiar with its basic use). So for this use case, it’s more efficient to use DSPy to generate the bulk of the descriptions and manually fix any problem cases afterward. In general, for other use cases, one needs to be aware that working with DSPy requires careful reflection over each decision made. The choice of modules, optimizers, parameters, and improvements to the metric function can all help in fine tuning the application in incremental steps.