If you have experienced first-hand or have stood by and witnessed the frustration experienced by coworkers as they grapple with prompts at the mercy of a language model, then you’ll want to read about DSPy. The main philosophy behind the DSPy framework is to be able to automatically generate prompt instructions, with minimum natural language intervention by the programmer.
Manual prompt engineering is frustrating because it’s so unwieldy, especially in the context of a pipeline consisting of a sequence of calls to a language model. There are often too many variables, like the nature of the prompts, the choice of the model, and the parameters of the model. All of these can result in a fragile pipeline with an increased likelihood of “breaking” it at different points as an unintended consequence of tweaking one of the variables. The cost of crafting prompts is also high, especially because changes to the prompts or model parameters create ripple effects, often requiring you to start from scratch after each failed attempt.
DSPy aims to circumvent some of these problems by providing the ability to do the following:
Generate simple prompts automatically
Validate the desired outcomes programmatically, as well as by using the underlying language model
Run optimizations that automatically refine prompt instructions
In this blog, we give a brief overview of the framework, followed by a description of what we need to use it for. We then give details of what goes into a basic DSPy program and discuss what we learned. The framework itself is easy to use. The code snippets here are inspired largely by the examples shared in the official documentation.
Let’s get started!
The main idea is to build programs consisting of modules that make calls to language models under the hood.
Each module is passed a signature as an input. Informally, a signature is just information that’s used by the module for generating simple prompt instructions as output.
A prompt instruction is simply an instruction to the language model on how to respond to the prompts entered by the end user of the system being built.
The instructions that are automatically generated by a module can then be used directly to augment the prompt entered by the end user at execution time.
More interestingly, a prompt instruction can undergo a series of improvements especially when used with one of the optimizers that are part of the DSPy framework.
We toyed with DSPy and tested it for a small but real scenario relevant to our workflow. Here is the description of the use case (the goal and requirements):
Goal: We’d like to create meta descriptions for text-based courses. The meta descriptions are to be written with the intent of improving the SEO scores of the courses so that it impacts how these courses show up in search engine results.
A meta description is just a textual description with these constraints:
It should say what the course is about.
The meta description of a course should be based on a full and longer description of the course that was written manually at the time when that course was created.
It should be at most 150 characters.
It should include an SEO keyword that’s manually curated for the course. An SEO keyword is usually a short phrase that helps improve search engine ranking of the content.
The code given here was tested using the gpt-3.5-turbo-instruct
model (with default configurations). We use the following command to load the language model (LM).
gpt_turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct', max_tokens=255, api_key="INSERT_API_KEY")
dspy.settings.configure(lm=gpt_turbo)
Let’s look into the basic ingredients needed to build a DSPy program.
The first thing to understand about the signature is that it simply communicates the intent of the prompts for a use case.
There are two ways to specify the signature:
dspy.Signature
. This way offers a greater flexibility of including more information.The following string is an example of an inline signature. The text on the left side of the arrow indicates that a question is expected as input and its answer is expected as the output.
"question -> answer"
When this signature is used by a module to predict an answer, the following prompt instructions will be automatically generated. These essentially ask for an answer to the question being asked:
As we can see, the generated instructions are, by design, quite simple.
The signature just needs to say what is required. If we want the AI to summarize some input text, we could use a signature like "text -> summary"
or in some other clear way like "document -> summary"
.
Suppose we want an answer to a question based on some context, then this would work:
"question, context -> answer"
In general, we can express multiple inputs or outputs using comma-separated lists.
These inline signatures are suitable for simple situations, but if we want to include even a little more information we’ll need to rely on class-based definitions of signatures.
For our requirements above, a signature like the following is inadequate because even though it says what the inputs (the course description and the SEO keyword) are and what the output is, there’s no way to specify our constraints (like the length of the meta description).
"content, keyword -> metadescription"
So we resort to defining it as a full-blown class:
class MySignature(dspy.Signature):"""Generate a meta description that explains what the course covers.It must include the keyword and it's length must be 150 characters or less."""content = dspy.InputField(desc="the course content")keyword = dspy.InputField(desc="the keyword")metadescription = dspy.OutputField(desc="contains the keyword and at most 150 characters")
Explanation:
The code defining the signature above may look unusual, but this is what it does:
The docstring at the top is included verbatim in the generated prompt instructions.
The names of the fields (content
, keyword
, metadescription
) also become an actual part of the prompt. So, keeping meaningful field names is essential.
The input and output fields need to be marked clearly using dspy.InputField()
and dspy.OutputField()
, as shown. The descriptions of these fields can be passed as arguments using the desc
named parameter, although the descriptions are optional. These descriptions become part of the prompt instructions, as we’ll see shortly.
And that’s it—defining a class-based signature just requires specifying inputs, outputs, and concise descriptions of what’s expected.
Once we have written the signature, we need to declare a module that uses it for generating prompt instructions.
A module is just a class that inherits from dspy.Module
, directly or indirectly. To declare a module instance, a signature must be passed to the module constructor. The instantiated module can then be used for making LM calls to generate outputs against the inputs passed to it.
There are currently five types of built-in modules that embody different prompting techniques. Understanding how to work with two of these is sufficient for getting started and may very well be all you need. The most basic of these is dspy.Predict
. Others inherit from it or build on top of it.
In Python, it’s possible to make calls to a class object in a way that’s similar to how we call functions. An instance of the Predict
module is callable in this sense; when it’s called with an input it returns the response containing the output. The returned object is an instance of a class called Prediction
.
Here, a Prediction
object is returned on line 10 and called on lines 13–25:
class MySignature(dspy.Signature):"""Generate a meta description that explains what the course covers.It must include the keyword and it's length must be 150 characters or less."""content = dspy.InputField(desc="the course content")keyword = dspy.InputField(desc="the keyword")metadescription = dspy.OutputField(desc="contains the keyword and at most 150 characters")# Declare an instance of the Predict modulepredictor = dspy.Predict(MySignature)# Call it on some content and SEO keyword to to predict the meta descriptionresponse = predictor(content="""When it comes to operating systems, there are three main concepts: virtualization,concurrency, and persistence. These concepts lay the foundation for understanding how an operatingsystem works.In this extensive course, you'll cover each of those in its entirety. You'll start by covering thebasics of CPU virtualization and memory such as: CPU scheduling, process virtualization, and APIvirtualization. You will then move on to concurrency concepts where you’ll focus heavily on locks,semaphores, and how to triage concurrency bugs like deadlocks.Towards the end, you'll get plenty of hands-on practice with persistence via I/O devices and filesystems. By the time you're done, you'll have mastered everything there is to know about operatingsystems.""",keyword="operating system course")# Print the machine-written meta descriptionprint(f"Predicted meta descripton: {response.metadescription}")
Explanation:
Predict
module, passing it the MySignature
signature.predictor
by passing it the inputs, as directed by MySignature
.metadescription
field of the response
object to contain the predicted output, because we specified it as an output field in MySignature
.Here’s the full prompt that’s generated when the module is called:
Here’s the output of our program:
Predicted meta description: Master the fundamentals of operating systems with our extensive course on virtualization, concurrency, and persistence. Perfect for beginners.
Notice how the predicted description does not include the SEO keyword (“operating system course”). It’s also 153 characters.
In such a case, we must suppress the urge to fix things by resorting to old fashioned prompt engineering and see if this can be fixed through automation. In our case, we tried dspy.ChainOfThought
next.
The dspy.ChainOfThought
module breaks the process of generating output into multiple steps that can be thought of as steps of reasoning. We can instantiate it by passing a signature and the number of steps of reasoning required:
# Complete three rounds of reasoning
predictor = dspy.ChainOfThought(MySignature, n=3)
The signature passed to ChainOfThought
is modified under the hood to include an additional output field rationale
, so that the rationale for each of the completion rounds is generated before the desired output (the meta description).
The object returned can then be examined to see what transpired at the level of the LM.
# Declare an instance of the ChainOfThought modulepredictor = dspy.ChainOfThought(MySignature, n=3)# Call it on an input.response = predictor(content="""When it comes to operating systems, there are three main concepts: virtualization,concurrency, and persistence. These concepts lay the foundation for understanding how an operatingsystem works.In this extensive course, you'll cover each of those in its entirety. You'll start by covering thebasics of CPU virtualization and memory such as: CPU scheduling, process virtualization, and APIvirtualization. You will then move on to concurrency concepts where you’ll focus heavily on locks,semaphores, and how to triage concurrency bugs like deadlocks.Towards the end, you'll get plenty of hands-on practice with persistence via I/O devices and filesystems. By the time you're done, you'll have mastered everything there is to know about operatingsystems.""",keyword="operating system course")# Look at the rationale and meta descriptionsprint(response.completions)
Here’s the output presented as a table to see the rationale for each completion step and the corresponding meta description. It’s interesting to note that the length meets the 150 character limit in the last completion round. However, the keyword is still not included.
Rationale | Meta Description | Meta Description Length |
'produce a concise and informative meta description. We will cover virtualization, concurrency, and persistence in this comprehensive operating system course.' | 'Master the fundamental concepts of operating systems with our comprehensive course on virtualization, concurrency, and persistence. Perfect for beginners.' | 154 |
'introduce our comprehensive operating system course. We will cover virtualization, concurrency, and persistence.' | 'Master the foundations of operating systems with our comprehensive course. Learn about virtualization, concurrency, and persistence in just a few weeks!' | 152 |
'produce a metadescription. We will cover the basics of virtualization, concurrency, and persistence in this extensive operating system course. Master everything there is to know in just one course!' | 'Master everything there is to know about operating systems in this extensive course covering virtualization, concurrency, and persistence. Enroll now!' | 150 |
It’s a little difficult to make sense of the rationale. We can inspect the generated prompt instructions using the command gpt_turbo.inspect_history(n=1)
.
See how the (AI generated) rationale in the first row of the table above gets plugged into the prompt instruction below. The AI generated text is shown highlighted in green.
To better understand how the AI generated rationale gets plugged in, it’s worth noting that by default, the line of code
predictor = dspy.ChainOfThought(MySignature, n=3)
is equivalent to the following code where an output field is created as a standalone object and then passed to the ChainOfThought
constructor using an optional parameter.
rationale_type = dspy.OutputField(
prefix="Reasoning: Let's think step by step in order to",
desc="${produce the metadescription}. We ..."
)
predictor = dspy.ChainOfThought(MySignature, n=3, rationale_type=rationale_type)
For what it’s worth, minor adjustments can be made to the text in the rationale_type
before it’s passed explicitly to the ChainOfThought
constructor. For example, simply changing ${produce the metadescription}
to ${produce the metadescription in 150 characters}
would have an impact on the generated output.
However, we did not engineer the prompts in this manner, since it’s not inline with the DSPy philosophy. Instead, we considered using built-in DSPy optimizers as a next step to see if this would lead to improvements in meta descriptions on a set of examples.
Though there are other modules of interest that we did not use, they’re still worth exploring for other use cases (like the ProgramOfThought
module that builds on top of ChainOfThought
and is used for generating and executing Python code).
A DSPy optimizer uses a few examples for training purposes. So let’s first see how to prepare training data first before we think about optimizing.
The syntax for creating examples is straightforward. We create training examples by including hand-written meta descriptions against each provided content and keyword pair:
e = dspy.Example(keyword="operating system course",content="""When it comes to operating systems, there are three main concepts: virtualization,concurrency, and persistence. These concepts lay the foundation for understanding how an operatingsystem works.In this extensive course, you'll cover each of those in its entirety. You'll start by covering thebasics of CPU virtualization and memory such as: CPU scheduling, process virtualization, and APIvirtualization. You will then move on to concurrency concepts where you’ll focus heavily on locks,semaphores, and how to triage concurrency bugs like deadlocks.Towards the end, you'll get plenty of hands-on practice with persistence via I/O devices and filesystems. By the time you're done, you'll have mastered everything there is to know about operatingsystems.""",metadescription="""This operating system course helps developers learn three main concepts:OS virtualization, concurrency and persistence with hands-on practice.""")
Once an example (e
) is created, its input fields (content
and keyword
) must be marked explicitly, as shown below. Any remaining field (metadescription
in our case) is treated as a training label.
e = e.with_inputs("content", "keyword")
A handful of labeled examples should suffice. We used nine examples (not listed here) that in our opinion are representative of the quality we seek.
If there’s a non-trivial number of examples, we can also read from a file and create them programmatically.
We also create new examples as our test data, but we don’t label them. So only the content
and keyword
fields are included and marked as input fields; the metadescription
field is not included.
A DSPy program can consist of one or more modules. Such a program can be fine-tuned using a DSPy optimizer. An optimizer improves the prompt instructions by placing calls to an LM behind the scenes. It can also automatically generate more examples called bootstrapped demos to be included in the prompt instructions. Note however that parameters of the LM that are not part of the prompt (like the model temperature) are optimized under the hood with gradient descent.
It’s instructive to learn about some of these optimizers. Other optimizers are built around similar ideas.
LabeledFewShot
picks a handful of examples from the training data, and uses them without modification.
The BootstrapFewShot
is passed two DSPy programs called a student and a teacher, as well as a metric function.
The teacher program uses examples from the training set (max_labeled_demos
), and generates additional bootstrapped examples. These examples are validated using a metric function (more on this later). Once validated these are used as part of the prompt to make a prediction. The sequence of operations in generating a bootstrapped example is called a trace. There can be multiple traces. Here’s how we can use a BootstrapFewShot
optimizer to “compile” a predictor.
from dspy.teleprompt import BootstrapFewShotoptimizer = BootstrapFewShot(metric=my_metric,max_labeled_demos=9,max_bootstrapped_demos=4)# Pass the program (in our case, predictor, the module instance) and the labeled examples as argumentsoptimized_program = optimizer.compile(student=predictor, trainset=examples)
Nomenclature: Previously, these optimizers were being referred to as “teleprompters.” That explains the use of
dspy.teleprompt
in the import statement. The use of the function namecompile
just refers to the optimizations or refinements that the optimizer performs.
BootstrapFewShotWithRandomSearch
generates multiple candidate programs, then picks the one that works best on a validation set. These candidates programs include:
LabeledFewShot
BootstrapFewshot
, with and without random shuffling of training examples.If you are not satisfied with your experimentation with BootstrapFewShot
and are on the edge of uncertainty wondering about what to do next, then BootstrapFewShotWithRandomSearch
will try out many variations saving you from a lot of headache. This is, of course, at the expense of additional costs incurred by the underlying LM calls.
A metric is a function that returns a score (numeric or boolean) that represents the degree to which the predicted prompt instruction conforms to our requirements.
We can write our own custom metric function to give a verdict on whether the generated output is good enough.
We use a simple metric function that, when called by an optimizer, returns True
if all three of our requirements have been met. In particular:
Predict
module with signature AssessQuality
.So let’s look at the signature AssessQuality
first before looking at the code for the metric.
# Define a signature so we can use it to assess quality on line 25 belowclass AssessQuality(dspy.Signature):"""Assess the quality of the metadescription according to the specified critera."""metadescription = dspy.InputField()criteria_query = dspy.InputField()answer = dspy.OutputField(desc="yes or no")
We’ll see how this signature is used on line 17 below.
Note that the signature of a metric function is expected to specify three parameters: for passing a labeled example, for passing a prediction, and a parameter trace
which indicates the trace being run by an optimizer when it uses the metric function.
# Define a metric to validate qualitydef my_metric(example, prediction, trace=None):# Retrieving the inputs and output from the prediction objectcontent = example.contentkeyword = example.keywordmetadescription = prediction.metadescription# Does the metadescription have a valid length?is_length_valid = (len(metadescription) <= 150)# Is the keyword present verbatim in the predicted metadescription?is_keyword_present = keyword.lower() in metadescription.lower()# We state the third requirement as a question with a yes/no answercoverage_query = f"Does the metadescription `{metadescription}` concisely express what a course covers if that course's description is `{content}`?"# We use the Predict module to get a yes/no answer from AI to the textual querycoverage_response = dspy.Predict(AssessQuality)(metadescription=metadescription, criteria_query=coverage_query)# Does the description say what the course covers?is_coverage_adequate = (coverage_response.answer.lower() == 'yes')# The score equals the number of requirements that were metscore = is_length_valid + is_keyword_present + is_coverage_adequate# When the metric function is called inside optimizer trace is not None. So return true only if the score is perfect. Else return false.if trace is not None: return (score == 3)# When the metric function is called for evaluation purposes, we'd like to get a more nuanced score to help us understand how badly we failed.return score
Lines 4–6: The inputs in example
(a training example) are extracted. The predicted meta description, against this example, is extracted from the prediction
object.
Lines 9–12: The boolean variables (is_length_valid
and is_keyword_present
) are set to True
if the length of the meta description is at most 150 characters and the SEO keyword is contained in the meta description.
Lines 17: Our third requirement (the meta description should say what the course is about) is qualitative and cannot be checked programmatically. So it’s articulated as a query and passed to the callable Predict
module, which was instantiated using the AssessQuality
signature above.
Note: A metric function can be used in multiple ways.
- It may be used in the training (optimizing) phase by being passed to an optimizer that runs multiple traces on different training examples. In such a case, the
trace
parameter is set to something other thanNone
.- It can also be used directly for evaluation purposes, where the predictor is evaluated on different testing examples.
Lines 25–28: Instead of writing two separate metric functions, we include a conditional statement (if trace not None
) to check if the function is called internally from within an optimizer.
True
or False
to indicate pass or fail. This helps the optimizer fine-tune the training process.We’ll see how to apply the metric function next.
We used a BootstrapFewShot
optimizer for our use case (lines 6–10). Notice how we pass the my_metric
function as an argument to it. We specify the use of at most 9 training examples, with at most 3 bootstrapped (generated) examples:
from dspy.teleprompt import LabeledFewShotfrom dspy.teleprompt import BootstrapFewShotfrom dspy.evaluate import Evaluate # To evaluate on the test setoptimizer = BootstrapFewShot(metric=my_metric,max_labeled_demos=9,max_bootstrapped_demos=3)optimized_predictor = optimizer.compile(student=dspy.ChainOfThought(MySignature), trainset=examples)evaluator = Evaluate( devset=test_data, metric=my_metric, display_progress=True, display_table=7)eval_score = evaluator(optimized_predictor)
On line 14, the my_metric
function is passed to the built-in Evaluate
class along with the test data. The examples in the test data contain two fields: keyword
and content
.
The returned evaluator is applied to our optimized predictor to these examples to get the following output.
Observe the rightmost column to see that of the seven test examples, only two meet all our needs (with score 3). All others meet only two of the three requirements. That’s far from perfect. Although to be fair, we should have tested with more examples.
Surprisingly, if we use the ChainOfThought
module directly without any optimizations, the results are better with four of the examples meeting all the requirements, and three of them meeting only two requirements:
We tried other variations. This is what we observed for our small test data, but beware this won’t always be true in general:
ChainOfThought
worked better than Predict
.Predict
is used with an optimizer, the performance does tend to improve.ChainOfThought
with the constructor parameter n
set to 1 (one round of completion), and using no optimizer.BootstrapFewshot
(after trying out multiple changes to the parameters) were comparable to the results with LabeledFewshot
.BootstrapFewShotWithRandomSearch
, since we used the gpt-3.5-turbo-instruct
model and our token usage was limited to 90,000 tokens per minute. With the BootstrapFewShotWithRandomSearch
optimizer, we kept exceeding this token limit.The three requirements we began with—conciseness, keyword inclusion, and content coverage—are competing requirements in the following sense:
We assigned the task of writing meta descriptions for a hundred courses to different human writers, and the results were variable. Similar to the problems seen in AI generated prompts, some writers omitted keywords, or settled for descriptions that did not fully capture the courses’ content. Moreover, text written by these writers was not as well-written.
For a larger project, such as generating meta descriptions for 1000+ courses, the time investment for human writers would be significantly more than using DSPy (assuming the engineer is already familiar with its basic use). So for this use case, it’s more efficient to use DSPy to generate the bulk of the descriptions and manually fix any problem cases afterward. In general, for other use cases, one needs to be aware that working with DSPy requires careful reflection over each decision made. The choice of modules, optimizers, parameters, and improvements to the metric function can all help in fine tuning the application in incremental steps.
Want to learn more about working with generative models? Explore these courses on Educative to polish your skills!
Free Resources