Engineers Should Handle Prompting LLMs (and Prompts Should Live in Your Codebase)

Published on
Mar 29, 2024

We’ve seen many discussions around Large Language Model (LLM) software development allude to a workflow where prompts live apart from LLM calls and are managed by multiple stakeholders, including non-engineers. In fact, many popular LLM development frameworks and libraries are built in a way that requires prompts to be managed separately from their calls. 

We think this is an unnecessarily cumbersome approach that’s not scalable for complex, production-grade LLM software development. 

Here’s why: for anyone developing production-grade LLM apps, prompts that include code will necessarily be a part of your engineering workflow. Therefore, separating prompts from the rest of your codebase, especially from their API calls, means you’re splitting that workflow into different, independent parts.

Separating concerns and assigning different roles to manage each may seem to bring certain efficiencies, for example, easing collaboration between tech and non-tech roles. But it introduces fundamental complexity that can disrupt the engineering process. For instance, introducing a change in one place—like adding a new key-value pair to an input for an LLM call—means hunting down that change manually. And then, you will likely still not catch all the errors.

Don’t get us wrong. If your prompts are purely text or have very minimal code, then managing them separately from your calls may not have much of an impact. And there are legitimate examples of prompts with minimal or no code, like prompts for ChatGPT. In such cases, managing prompts separately from calls can make sense.

But any enterprise-grade LLM apps require sophisticated prompts, which means you’ll end up writing code for such prompts anyway. 

In fact, trying to write out that logic in text would be even more complicated. In our view, code makes prompting both efficient and manageable, as well as the purview of engineers.

Below, we outline how we arrived at this truth, and how the solution we’ve developed (a Python-based LLM development library) helps developers manage prompts in the codebase easily and efficiently, making, in our experience, LLM app development faster and more enjoyable.

Our Frustrations with Developer Tools for Prompt Engineering

Our view on prompting started as we were using an early version of the OpenAI SDK to build out interpretable machine learning tools at a previous company. This was the standard OpenAI API for accessing GPT model functionalities.

Back then we didn’t have the benefit of any useful helper libraries, so we wrote all the API code ourselves. This amounted to writing lots of boilerplate to accomplish what seemed like simple tasks. For example, automatically extracting the model configuration (such as constraints) from just the name of the features in a given dataset. This required many prompt iterations and it was a pain to evaluate them.

It was around that time that we began asking ourselves: why aren’t there better  developer tools in the prompt engineering space? Is it because people are bringing experimental stuff into production too quickly? Or simply because the space is so new?

The more we worked to develop our LLM applications, the more it was clear that from a software engineer's perspective, the separation of prompt management from the calls was fundamentally flawed. It made the actual engineering slow, cumbersome and arguably more error prone. It was almost as if current tools weren't built around developer best practices but rather around Jupyter notebook best practices (if there even is such a thing).

Beyond that, we noticed some other issues:

  • Our prompts became unmanageable past two versions. We weren’t using a prompt management workflow back then, so implementing changes was a manual process. We started telling colleagues not to touch the code because it might break a function somewhere else.

  • A lot of libraries tried to offer functionality for as many use cases as possible, sometimes making you feel dependent on them. They required you to do things their way, or you’d have to wait for them to catch up with new features from the LLMs.

All this led us to rethink how prompts should be managed to make developers’ lives easier. In the end, these frustrations boiled over into us wanting to build our own library that approached LLM development in a developer-first way to make LLM app development faster and more enjoyable. This ultimately became Mirascope.

How Mirascope Makes Prompt Engineering Intuitive and Scalable

For us, prompt engineering boils down to the relationship between the prompt and the API call. Mirascope represents what we feel is a best-in-class approach for generating that prompt, taking the LLM response, and tracking all aspects of that flow.

As developers, we want to focus on innovation and creativity, rather than on managing and troubleshooting underlying processes.

To that end, we designed Mirascope with the following features and capabilities to make your prompting more efficient, simpler, and scalable.

Code Like You Already Code, with Pythonic Simplicity

It was important to us to be able to just code in Python, without having to learn superfluous abstractions or extra, fancy structures that make development more cumbersome than it needs to be. So we designed Mirascope to do just that. 

For instance, we don’t make you implement directed acyclic graphs in the context of sequencing function calls. We provide code that’s eminently readable, lightweight, and maintainable.

An example of this is our `BasePrompt` class, which encapsulates as much logic within the prompt as feasible.

Within it, the docstring can be used as the prompt template, a string for generating a prompt that requests book recommendations based on topic and genre pairs. The `topics_x_genres` property constructs these pairs and the combined string is integrated into `prompt_template` to create the final list of messages.

from pydantic import computed_field

from mirascope.core import BasePrompt, prompt_template


@prompt_template(
    """
    Can you recommend some books on the following topic and genre pairs?
    {topics_x_genres}
    """
)
class BookRecommendationPrompt(BasePrompt):
    topics: list[str]
    genres: list[str]

    @computed_field
    def topics_x_genres(self) -> list[str]:
        """Returns `topics` as a comma separated list."""
        return [
            f"Topic: {topic}, Genre: {genre}"
            for topic in self.topics
            for genre in self.genres
        ]


prompt = BookRecommendationPrompt(
    topics=["coding", "music"], genres=["fiction", "fantasy"]
)
print(prompt)
# > Can you recommend some books on the following topic and genre pairs?
#   Topic: coding, Genre: fiction
#   Topic: coding, Genre: fantasy
#   Topic: music, Genre: fiction
#   Topic: music, Genre: fantasy


By default, Mirascope’s `BasePrompt` typically treats the prompt message as a single user message in order to simplify initial use and implementation of the class for straightforward scenarios.

But you may want to add more context to prompts in the form of different roles, such as SYSTEM, USER, ASSISTANT, or TOOL (depending on which roles an LLM model can use), to generate responses that are more relevant and nuanced, as shown here:

from mirascope.core import BasePrompt, prompt_template


@prompt_template(
    """
    SYSTEM:
    You are the world's greatest librarian.

    USER:
    Can you recommend some books on {topic}?
    """
)
class BookRecommendationPrompt(BasePrompt):
    topic: str



Now, if you run this prompt, it will automatically parse into message objects in order to seamlessly call the specified model.You can extend `BasePrompt` to fit whatever use case is needed, from few-shot prompting to chat interactions with an LLM.

We also wanted to avoid introducing complexity where it’s not absolutely necessary. For example, given a choice in how we would chain together components, we’d prefer relying on native Python—perhaps instantiating one class within another and calling its method directly (as shown in the code sample directly below)—rather than relying on the pipe moderator.

Note: This isn’t to say there’s one “best” way to accomplish chaining, and you're certainly not required to do it with the subclass style shown below. For instance, you can also have two separate calls where you pass the output of the first into the second as an attribute during construction (rather than as an internal property). It's not our recommendation since it breaks colocation, but you're free to do what you like. We just have opinionated guidelines, not requirements.

Nevertheless, this approach to chaining encapsulates each step of the process within class methods, allowing for a clean and readable way to sequentially execute tasks that depend on the outcome of previous steps:

from pydantic import computed_field

from mirascope.core import BasePrompt, openai, prompt_template
from mirascope.core.openai import OpenAICallResponse


@prompt_template("Name a chef who is really good at cooking {food_type} food")
class ChefSelectionPrompt(BasePrompt):
    food_type: str


@prompt_template(
    """
    SYSTEM:
    Imagine that you are chef {chef}.
    Your task is to recommend recipes that you, {chef}, would be excited to serve.

    USER:
    Recommend a {food_type} recipe using {ingredient}.
    """
)
class RecipeRecommendationPrompt(ChefSelectionPrompt):
    ingredient: str

    @computed_field
    @property
    def chef(self) -> OpenAICallResponse:
        prompt = ChefSelectionPrompt(food_type=self.food_type)
        return prompt.run(openai.call(model="gpt-4o"))


prompt = RecipeRecommendationPrompt(food_type="japanese", ingredient="apples")
response = prompt.run(openai.call(model="gpt-4o"))
print(response.content)
# > Certainly! Here's a delightful Japanese-inspired recipe using apples:...


Finally, we show an example of how you can use Mirascope to do few-shot prompting, which provides the language model with a few examples (shots) to help it understand the task and generate better output.

Below are three example sets of book recommendations for different topics to guide the model in understanding the format and type of response expected when asked to recommend books on a new topic, such as "coding."

from mirascope.core import BasePrompt, prompt_template


@prompt_template(
    """
    I'm looking for book recommendations on various topics. Here are some examples:

    1. For a topic on 'space exploration', you might recommend:
       - 'The Right Stuff' by Tom Wolfe
       - 'Cosmos' by Carl Sagan

    2. For a topic on 'artificial intelligence', you might recommend:
       - 'Life 3.0' by Max Tegmark
       - 'Superintelligence' by Nick Bostrom

    3. For a topic on 'historical fiction', you might recommend:
       - 'The Pillars of the Earth' by Ken Follett
       - 'Wolf Hall' by Hilary Mantel

    Can you recommend some books on {topic}?
    """
)
class FewShotBookRecommendationPrompt(BasePrompt):
    topic: str

Minimizing complexity lowers the learning curve. In Mirascope’s case, beyond knowing our library and Python, the only framework to learn is Pydantic.

Built-in Data Validation for Error-Free Prompting

We find that high-quality prompts—ones that are type and error checked—lead to more accurate and useful LLM responses, and so data validation is at the heart of what we do. 

Automatic validation against predefined schemas is built into the fabric of our framework, allowing you to be more productive rather than having to chase down bugs or code your own basic error handling logic.

For starters, our `BasePrompt` class extends Pydantic’s `BaseModel`, ensuring valid and well-formed inputs for your prompts. This means:

  • Mirascope’s prompt class inherits Pydantic’s capability to ensure the data is correctly typed before it’s processed and sent over to the API, leading to cleaner, more maintainable code. Developers can focus more on the business logic specific to prompting rather than on writing boilerplate.
  • Pydantic easily serializes data both to and from JSON format, which simplifies  the process of preparing request payloads and handling responses, eases integrations with any systems that accept JSON, and helps you quickly spin up FastAPI endpoints.
  • Pydantic is well supported in many IDEs, offering autocompletion and type hints.
  • It also lets developers define custom validation methods if needed, allowing them to enforce complex rules that go beyond type checks and basic validations.

An example of using Pydantic for enforcing type validation (with graceful error handling) is shown below:

from pydantic import BaseModel, ValidationError

from mirascope.core import openai, prompt_template


class Book(BaseModel):
    title: str
    price: float


@openai.call(model="gpt-4o", response_model=Book)
@prompt_template("Please recommend a book.")
def recomend_book(): ...


try:
    book = recomend_book()
    assert isinstance(book, Book)
    print(book)
except ValidationError as e:
    print(e)
    # > 1 validation error for Book
    #  price
    #    Input should be a valid number, unable to parse string as a number [type=float_parsing, input_value='standard', input_type=str]
    #      For further information visit      https://errors.pydantic.dev/2.6/v/float_parsing

You can also validate data in ways that are difficult if not impossible to code successfully, but that LLMs excel at, such as analyzing sentiment. For instance, you can add Pydantic’s `AfterValidator` annotation to Mirascope’s extracted output as shown below:

from typing import Annotated, Literal

from pydantic import AfterValidator, BaseModel, Field, ValidationError

from mirascope.core import openai, prompt_template


class Label(BaseModel):
    sentiment: Literal["happy", "sad"] = Field(
        ..., description="Whether text is happy or sad"
    )


@openai.call(model="gpt-4o", response_model=Label)
@prompt_template("Is the following happy or sad? {text}")
def get_sentiment(text: str): ...


def validate_happy(story: str) -> str:
    """Check if the content follows the guidelines."""
    label = get_sentiment(story)
    assert label.sentiment == "happy", "Story wasn't happy."
    return story


class Story(BaseModel):
    story: Annotated[str, AfterValidator(validate_happy)]


@openai.call(model="gpt-4o", response_model=Story)
@prompt_template("Tell me a very sad story.")
def tell_sad_story(): ...


try:
    story = tell_sad_story()
    print(story)
except ValidationError as e:
    print(e)
    # > 1 validation error for Story
    #   story
    #     Assertion failed, Story wasn't happy. [type=assertion_error, input_value="Once upon a time, there ...er every waking moment.", input_type=str]
    #       For further information visit https://errors.pydantic.dev/2.6/v/assertion_error

Simplify LLM Interactions with Wrappers and Integrations

We believe in freeing you from writing boilerplate to interact with APIs, so we made available a number of wrappers for common providers. 

Whether you call the model using our function decorators or `BasePrompt`, the response generates a wrapper around the original response from the model. In OpenAI’s case, it would be `OpenAICallResponse`:

from mirascope.core import BasePrompt, openai, prompt_template


class BookRecommendationPrompt(BasePrompt):
    """Recommend a {genre} book."""

    genre: str


@openai.call(model="gpt-4o", call_params={"temperature": 0.5})
@prompt_template("Recommend a {genre} book.")
def recommend_book(genre: str): ...


# Both responses return `OpenAICallResponse`
prompt = BookRecommendationPrompt(genre="fantasy")
response1 = prompt.run(openai.call(model="gpt-4o", call_params={"temperature": 0.5}))
response2 = recommend_book(genre="fantasy")

print(response1.content)


In the decorator, you can use `call_params` to tie any parameters used for making the API call to OpenAI with the specific instance of `OpenAICall`. This means each instance carries with it all the necessary information for making a tailored API request to OpenAI.

For streaming LLM responses, set `stream=True` in the function decorator. When streaming, the response is returned as an instance of `OpenAIStream`, which generates `OpenAICallResponseChunk` instances (and `OpenAITool` instances as the second item in the tuple if using tools). Both offer their own set of convenience wrappers:

from mirascope.core import openai, prompt_template


@openai.call(model="gpt-4o", call_params={"temperature": 0.5}, stream=True)
@prompt_template("Recommend a {genre} book.")
def recommend_book(genre: str): ...


response = recommend_book(genre="fantasy")

for chunk, _ in response:
    print(chunk.content, end="", flush=True)
# > Certainly! If you're looking for a captivating fantasy book, ...

`OpenAIStream` has convenience wrappers pertaining to the call as a whole, whereas `OpenAICallResponseChunk` has wrappers pertaining to each chunk.

from mirascope.core.openai import OpenAICallResponseChunk
from mirascope.core.openai._stream import OpenAIStream



stream = OpenAIStream(...)
stream.message_param
# > {
#       "role": "assistant",
#       "content": "Certainly! If you're looking for a captivating fantasy book, ...",
#       "tool_calls": None,
#   }
stream.call_params
# > {'temperature': 0.5}


chunk = OpenAICallResponseChunk(...)
chunk.content  # original.choices[0].delta.content
chunk.delta    # original.choices[0].delta
chunk.choice   # original.choices[0]
chunk.choices  # original.choices
chunk.chunk    # ChatCompletionChunk(...)


If a provider has a custom endpoint you can call with their own API, it is accessible through Mirascope by setting `client` in the decorator. This is true for providers such as Ollama, Anyscale, Together, AzureOpenAI, and others that support the OpenAI API through a proxy.

from mirascope.core import openai, prompt_template
from openai import AzureOpenAI, OpenAI


@openai.call("gpt-4o", client=AzureOpenAI(azure_endpoint="ENDPOINT"))
@prompt_template("Recommend a {genre} book.")
def recommend_book(genre: str): ...


@openai.call("llama3", client=OpenAI(base_url="BASE_URL", api_key="ollama"))
@prompt_template("Recommend a {genre} book.")
def recommend_book(genre: str): ...


We also made sure that Mirascope integrates with Logfire, OpenTelemetry, HyperDX, LangSmith, Langfuse, and LangChain for (respectively) tracking machine learning experiments and visualizing data, and improving prompt effectiveness through automated refinement and testing. We provide code examples for integrating with these libraries in our documentation.

You can use Mirascope with other LLM providers that also implement the OpenAI API, including:

  • Ollama
  • Anyscale
  • Together
  • Groq

Beyond OpenAI, Mirascope provides access to these other LLM providers (and those using their APIs):

  • Anthropic
  • Gemini
  • Mistral

If you wanted to switch to another model provider like Anthropic for instance, you’d just need to change the decorator and the corresponding call parameters:

from mirascope.core import anthropic, prompt_template


@anthropic.call(model="claude-3-5-sonnet-20240620")
@prompt_template("Recommend a {genre} book.")
def recommend_book(genre: str): ...

Expand LLM Capabilities with Tools

Although LLMs are known mostly for text generation, you can provide them with specific tools (also known as function calling) to extend their capabilities. 

Examples of what you can do with tools include:

  • Granting access to the Bing API for internet search to fetch the latest information on various topics.
  • Providing a secure sandbox environment like Repl.it for dynamically running code snippets provided by users in a coding tutorial platform.
  • Allowing access to the Google Cloud Natural Language API for evaluating customer feedback and reviews to determine sentiment and help businesses quickly identify areas for improvement.
  • Providing a Machine Learning (ML) recommendation engine API for giving personalized content or product recommendations for an e-commerce website, based on natural language interactions with users.

Mirascope lets you easily define a tool by documenting any function using a docstring as shown below. It automatically converts this into a tool, saving you additional work.

from typing import Literal

from mirascope.core import openai, prompt_template


def get_current_weather(
    location: str, unit: Literal["celsius", "fahrenheit"] = "fahrenheit"
):
    """Get the current weather in a given location."""
    if "tokyo" in location.lower():
        print(f"It is 10 degrees {unit} in Tokyo, Japan")
    elif "san francisco" in location.lower():
        print(f"It is 72 degrees {unit} in San Francisco, CA")
    elif "paris" in location.lower():
        print(f"It is 22 degress {unit} in Paris, France")
    else:
        print("I'm not sure what the weather is like in {location}")


@openai.call(model="gpt-4o", tools=[get_current_weather])
@prompt_template("What's the weather in {city}?")
def forecast(city: str): ...


response = forecast("Tokyo")
if tool := response.tool:
    tool.call()


Mirascope supports Google, ReST, Numpydoc, and Epydoc style docstrings for creating tools.If a particular function doesn’t have a docstring, you can define your own `BaseTool` class. You can then define the `call()` method to attach the docstring-less function’s functionality to the tool, but use your own custom docstrings:

from mirascope.core import BaseTool
from pydantic import Field


# has no docstings
def get_weather(city: str) -> str: ...


class GetWeather(BaseTool):
    """Gets the weather in a city."""

    city: str = Field(..., description="The city to forecast weather of.")

    def call(self) -> str:
        return get_weather(self.city)


Tools allow you to dynamically generate prompts based on current or user-specified data such as extracting current weather data in a given city before generating a prompt like, “Given the current weather conditions in Tokyo, what are fun outdoor activities?” 

See our documentation for details on generating prompts in this way (for instance, by calling the `call` method).

Extract Structured Data from LLM-Generated Text

LLMs are great at producing conversations in text, which is unstructured information. But many applications need structured data from LLM outputs. Scenarios include:

  • Extracting structured information from a PDF invoice (i.e., invoice number, vendor, total charges, taxes, etc.) so that you can automatically insert that information into another system like a CRM or tracking tool, a spreadsheet, etc.
  • Automatically extracting sentiment, feedback categories (product quality, service, delivery, etc.), and customer intentions from customer reviews or survey responses.
  • Pulling out specific medical data such as symptoms, diagnoses, medication names, dosages, and patient history from clinical notes.
  • Extracting financial metrics, stock data, company performance indicators, and market trends from financial reports and news articles.

To handle such scenarios, we support extraction with the `response_model` argument in the decorator, which leverages tools (or optionally `json_mode=True`) to reliably extract structured data from the outputs of LLMs according to the schema defined in Pydantic’s `BaseModel`. In the example below you can see how due dates, priorities, and descriptions are being extracted:

from typing import Literal

from pydantic import BaseModel, Field

from mirascope.core import openai, prompt_template


class TaskDetails(BaseModel):
    due_date: str = Field(...)
    priority: Literal["low", "normal", "high"] = Field(...)
    description: str = Field(...)


@openai.call(
    model="gpt-4o",
    response_model=TaskDetails,
    call_params={"tool_choice": "required"},
)
@prompt_template(
    """
    Extract the task details from the following task:
    {task}
    """
)
def get_task_details(task: str): ...


task = "Submit quarterly report by next Friday. Task is high priority."
task_details = get_task_details(task)
assert isinstance(task_details, TaskDetails)
print(task_details)
# > due_date='next Friday' priority='high' description='Submit quarterly report'


You can define schema parameters against which to extract data in Pydantic’s `BaseModel` class, by setting certain attributes and fields in that class. Mirascope also lets you set the number of retries to extract data in case a failure occurs.But you also don’t have to use a detailed schema like BaseModel if you’re extracting base types like strings, integers, booleans, etc. The code sample below shows how extraction for a simple structure like a list of strings doesn’t need a full-fledged schema definition.

from mirascope.core import openai, prompt_template


@openai.call(model="gpt-4o", response_model=list[str])
@prompt_template("Recommend 3 {genre} books")
def recommend_books(genre: str): ...


books = recommend_books(genre="fantasy")
print(books)
# > [
#   'The Name of the Wind by Patrick Rothfuss',
#   'Mistborn: The Final Empire by Brandon Sanderson',
#   'The Way of Kings by Brandon Sanderson'
#   ]


Mirascope makes things as simple as feasible, requiring you to write less code in cases where more code isn't necessary.

Facilitate Your Prompt Workflows with CLI and IDE Support

As mentioned earlier, our experience with prompts is that they generally become unmanageable after a certain number of iterations. Versioning is obviously a good idea, and we see some cloud prompting tools that offer this, but as they don’t generally colocate prompts with LLM calls, not all the relevant information gets versioned, unfortunately.

We believe it’s important to colocate as much information with the prompt as feasible, and that it should all be versioned together as a single unit. Our prompt management CLI is inspired by Alembic and lets you:

  • Create a local prompt repository and add prompts to it
  • Commit new versions of prompts
  • Switch between different versions of prompts
  • Remove prompts from the repository

Our CLI lets you commit your versions from development as part of your standard Git workflow, ensuring colleagues can see everything that was tried, as well as the differences between prompts. It’s also worth noting that the CLI works with calls and extractors since they subclass the `BasePrompt` class.

When installed, our CLI creates predefined working subdirectories and files as shown below:

|
|-- mirascope.ini
|-- mirascope
|   |-- prompt_template.j2
|   |-- versions/
|   |   |-- <directory_name>/
|   |   |   |-- version.txt
|   |   |   |-- <revision_id>_<directory_name>.py
|-- prompts/


This creates a prompt management environment that supports collaboration and allows you to centralize prompt development in one place. 

When you save a prompt in the `versions` subdirectory above with a command like:

mirascope add book_recommender


It versions the prompt and both creates a subdirectory `book_recommender`, and adds a version number to the prompt’s filename, e.g., `0001_book_recommender.py`.

As well, the version number is added inside the file itself:

# versions/book_recommender/0001_book_recommender.py
from mirascope.core import BasePrompt, prompt_template

prev_revision_id = "None"
revision_id = "0001"

@prompt_template("Can you recommend some books on {topic} in a list format?")
class BookRecommendationPrompt(BasePrompt):

    topic: str


Once the prompt file is versioned, you can continue iterating on the prompt, as well as switch and remove versions, etc.

Both Mirascope’s and Pydantic’s documentation are available for your IDE; for example, Mirascope provides help information for inline errors and autocomplete suggestions.

Inline errors:

Autocomplete:

If you want to give Mirascope a try, you can get started with our source code on GitHub. You can find our documentation (and more code samples) on our documentation site as well.