Skip to content

Tips & Inspiration

LLM Chaining: Techniques and Best Practices

LLM chaining is a technique in artificial intelligence for connecting multiple LLMs or their outputs to other applications, tools, and services to get the best responses or to do complex tasks.

Chaining lets applications do things like work with multiple files, refine content iteratively, and improve responses. It also overcomes some inherent limitations of LLMs themselves:

  • They generally only accept up to a certain amount of information in one prompt (despite “context windows” getting larger all the time), and so connecting an LLM with a service to divide up long documents and feed these via several calls to the model can be very useful.
  • They only remember what’s been said in a given conversation but not outside of it, unless you store the memory or state externally.
  • They still generally only output their answers as text (e.g., prose, JSON, SQL code). But what if your application needs very specific outputs like a validated CSV file, flow chart, or knowledge graph?

This is why LLM chains are useful: they let you build sophisticated applications like LLM agents and retrieval augmented generation (RAG) systems that typically include:

  • An input processing step, like preparing and formatting data that’ll be sent to the LLM — which often involves a prompt template.
  • APIs enabling interaction with both LLMs and external services and applications.
  • Language model output processing, such as parsing, validating, and formatting.
  • Data retrieval from external data sources, such as fetching relevant embeddings from a vector database to enhance contextual understanding in LangChain RAG applications.

But putting together an LLM chain isn’t always straightforward, which is why orchestration frameworks like LangChain and LlamaIndex exist, though these have their own inefficiencies that we’ll discuss later on.

It’s for that reason we designed Mirascope, our developer-friendly, pythonic toolkit, to overcome shortcomings of modern frameworks.

Mirascope works together with Lilypad, our open source prompt observability platform that allows software developers to version, trace, and optimize language model calls, treating these as non-deterministic functions.

Below, we dive into techniques for chaining and discuss what to look for in an LLM chaining framework.

LLM Integration: Key Tools and Techniques

LLM integration means embedding a language model into an application to give it superpowers like:

  • Understanding the subtleties of user inputs to provide relevant and coherent responses.
  • Writing human-like text for tasks like drafting emails, writing reports, or creating other content.
  • Executing multi-step responses by interpreting and following instructions.
  • Interacting with various tools and APIs to fetch data, do actions, or control devices.

To get these benefits you connect the language model with external resources and data sources, typically via an API.

But calling an API isn’t the end of the story — the software must be designed to handle certain challenges:

  • Non-deterministic responses need careful prompt engineering and error handling to maintain consistent outputs.
  • Network issues need to be reliably managed.
  • Model vendor lock-in requires flexible and open design.
  • Experimenting with different prompts requires version control and observability.

That’s why we created Mirascope, a user-friendly toolkit for integrating LLMs into applications, and Lilypad, our prompt management and observability platform.

Below, we dive into real-world examples of LLM integrations and then explain the different ways you can set this up.

Prompt Evaluation: Methods, Tools, And Best Practices

Prompt evaluation measures how effectively a prompt guides an LLM to generate responses that satisfy the goals of a task or application.

Unlike LLM evaluation, which assesses a model's overall performance across various tasks (its general strength), prompt evaluation zooms in on the prompt itself — judging how well a particular prompt is structured to produce your desired output — be it in customer support, content generation, or other use cases.

This is important because a poorly designed prompt can lead to unclear, irrelevant, or inaccurate responses, resulting in outputs that fail to meet the intended objectives.

When evaluating prompts, you should:

  1. Define clear and actionable criteria to measure prompt effectiveness (e.g., clarity, output relevance, bias, etc.). But since there’s no universal metric for defining a “perfect” prompt, evaluation success becomes subjective — especially when relying on qualitative criteria.
    Also, language models can carry biases from their training data, influencing responses in ways that weren’t intended.
  2. Implement a process to evaluate prompts against these criteria. This often involves using frameworks or automated tools for repetitive tasks like scoring outputs or ranking responses, while reserving more subtle aspects like detecting hallucinations or addressing culturally sensitive contexts for human oversight.

In this article, we show you different ways to evaluate prompts and share practical techniques and insights based on real-world applications. Along the way, we’ll also examine how Lilypad helps structure this evaluation process, making it easier to iterate, compare, and optimize prompts at scale.

Further down, we also share best practices for prompt evaluation and example evaluations using Mirascope, our lightweight toolkit for building with LLMs.

LLM Agents: What They Are, Tools, and Examples

We can define an "agent" as a person who acts on behalf of another person or group. However, the definition of agent with respect to Large Language Models (LLMs) is a hotly debated topic with no one definition yet reigning.

We like to refer to an LLM agent as an autonomous or semi-autonomous system that can act on your behalf. The core concept is the use of tools to enable the LLM to interact with its environment through tool use.

Agents can be used to handle complex, multi-step tasks that may require planning, data retrieval, or other dynamic paths that are not necessarily fully or well defined before starting the task.

This goes beyond what an LLM normally does on its own — which is to generate text responses to user queries based on its pre-training — and steps up its autonomy in planning, executing tasks, using tools, and retrieving external data.

What makes LLM agents useful is they can function within workflows that integrate multiple systems and services without having to fully define every step of the process beforehand.

How to Make a Chatbot from Scratch (That Uses LLMs)

Chatbots are built in generally one of two ways:

  1. Using no-code tools, messaging platforms (like Facebook Messenger or WhatsApp), or a chatbot builder like Botpress. This is often the preferred route for non-developers or those who don’t need specific functionality, and allows them to create chatbots quickly and with minimal technical expertise​​​.
  2. Developing one yourself, which is a great option for those needing custom functionality, advanced integrations, or unique designs.

Building your own chatbot gives you complete control over the bot’s behavior and features, but requires a basic grasp of programming.

As far as programming is concerned, Python’s readability and selection of libraries for working with language models makes it a good choice, although you can build chatbots in any language.

The sheer number of options and approaches for building a chatbot from scratch can seem overwhelming and it can be challenging to even decide where to start.

In this article, we help you navigate the process of building your own LLM-driven chatbot from scratch, including how (and where) to get started, as well as things to look out for when choosing a chat development library.

Finally, we include a step-by-step example of building a basic chatbot and extending it to use tools. We build this using Mirascope, our lightweight, Pythonic toolkit for developing LLM-powered applications that’s designed with software developer best practices in mind.

Using LLM-as-a-Judge to Evaluate AI Outputs

LLM as a judge is an evaluation technique that uses a language model to score the quality of answers of another model.

This allows you to automate certain types of evaluations that typically have to be done by humans, such as:

  • Assessing the politeness or tone of chatbot responses
  • Comparing and ranking two or more LLM responses according to how well they directly address the prompt or question
  • Scoring the quality of a machine translation
  • Determining how creative or original a response is

Such tasks have traditionally been hard to measure using software since they require subjective human-level interpretation, and machine learning assessments typically rely on objective rules or evaluation metrics.

In fact, there’s evidence that models like GPT-4 often come to similar conclusions on such tasks as humans do.

Below, we describe how LLM judges work, the steps required to set up an LLM as a judge, and the unique challenges they bring. We also show you a practical example of how to implement an LLM evaluator using our lightweight Python toolkit, Mirascope.

RAG Application: Benefits, Challenges & How to Build One

Retrieval augmented generation (RAG) is a workflow that incorporates real-time information retrieval into AI-assisted outputs.

The retrieved information gives the language model added context, allowing it to generate responses that are factual, up-to-date, and domain specific, thereby reducing the risk of misinformation or hallucinated content.

Hallucinations are LLM responses that sound plausible but are wrong, and the risk of these occurring is very real. This is due to a variety of reasons, not least because large language models rely on pre-trained data that may be outdated, incomplete, or irrelevant to the specific context of a query.

This limitation is more pronounced in domains like healthcare, finance, or legal compliance, where the details of the background information need to be correct to be useful.

RAG, however, keeps responses anchored in reality, mitigating the chances of inaccuracies. It’s found in an increasing number of generative AI applications like:

  • A customer chatbot — to intelligently extract information from relevant knowledge base articles.
  • Healthcare — to provide evidence-based medical advice.
  • Legal research — to retrieve relevant case law and regulations.
  • Personalized education platforms — to provide up-to-date answers tailored to customers’ specific questions and needs.
  • Financial advisory tools — to leverage timely market data for better decision-making.

In this article, we explain how RAG works and list its challenges. We also show an example of building an application using both LlamaIndex and Mirascope, our lightweight toolkit for building with LLMs.

We use LlamaIndex for data ingestion, indexing, and retrieval, while leveraging Mirascope’s straightforward and Pythonic approach to prompting.

A Guide to Synthetic Data Generation

Synthetic data generation involves creating artificial data that closely imitates the characteristics of real-world data, providing an alternative when real data is difficult to collect, and helping to address data bottlenecks where real-world data is scarce or inaccessible.

Instead of collecting data from actual events or observations, you use algorithms and models to produce data that replicates the patterns, structures, and statistical attributes found in real datasets.

Such data is widely used in industries where real data is either sensitive, unavailable, or limited in scope. In healthcare, for instance, synthetic data lets you test medical systems or train AI models on patient-like records without risking sensitive information.

Similarly, in finance, you can simulate customer transactions for fraud detection and risk analysis without exposing sensitive customer information or confidential business data.

You might want to generate artificial data in order to:

  • Save yourself the time and effort of collecting and managing real data yourself.
  • Augment your datasets with synthetic data that covers more scenarios and edge cases; for example, simulating rare instances of poor lighting so your facial recognition model adapts better to such situations.

Retrieval Augmented Generation: Examples & How to Build One

RAG is a way to make LLM responses more accurate and relevant by connecting the model to an external knowledge base that pulls in useful information to include in the prompt.

This overcomes certain limitations of relying on language models alone, as responses now include up-to-date, specific, and contextually relevant information that aren’t limited to what the model learned during its training.

It also contrasts with other techniques like semantic search, which retrieves relevant documents or snippets (based on the user’s meaning and intent) but leaves the task of understanding and contextualizing the information entirely to the user.

RAG helps reduce the risk of hallucination and offers benefits in fields where accuracy, timeliness, and specialized knowledge are highly valued, such as healthcare, science, legal, and others.

As an alternative to RAG you can fine-tune a model to internalize domain-specific knowledge, which can result in faster and more consistent responses — as long as those tasks have specialized, fixed requirements — but it’s generally a time consuming and potentially expensive process.

Also, the model’s knowledge is static, meaning you’ll need to fine-tune the model again to update it.

RAG, in contrast, gives you up-to-date responses from a knowledge base that can be adapted on the fly.

Below, we explain how RAG works and then show you examples of using RAG for different applications. Finally, we walk you through an example of setting up a simple RAG application in Python.

For the tutorial we use LlamaIndex for data ingestion and storage, and also Mirascope, our user-friendly development library for integrating large language models with retrieval systems to implement RAG.

How to Build a Knowledge Graph from Unstructured Information

A knowledge graph is a structured representation of interconnected information where entities are linked through defined relationships.

Knowledge graphs show you which entities are connected and how they’re related, and are most useful for structuring and giving context to unstructured data (like text, images, and audio), allowing you to:

  • Visualize subtle (or hidden) patterns or insights that might not be immediately apparent in traditional data formats.
  • Get accurate and context-aware search results by better connecting related entities and concepts.
  • Bring data together from multiple, often unrelated sources into a single, unified system.

Building a knowledge graph involves setting up these entities and their relationships:

  • Entities are the primary subjects within the graph — whether people, organizations, places, or events — and each holds attributes relevant to that subject, like a "Person" entity with attributes of name, age, and occupation.
  • Relationships between entities — often called edges — show how these entities connect and interact, such as a "Person" node being linked to a "Company" node by a "works for" relationship.
  • Properties add additional context, or metadata like dates or locations, to entities and edges.

Traditionally, building knowledge graphs used to involve bringing together a wide range of disciplines to manually design ontologies, curate data, and develop algorithms for extracting entities and relationships, which required expertise in areas like data science, natural language processing, and semantic web technologies.

Today, you no longer need to be an expert in graph theory or taxonomies to build your own graph, especially when LLMs can help simplify entity recognition and relationship extraction.

We dive into key concepts and steps for getting started with knowledge graphs, and show you how to leverage an LLM to build a graph using Mirascope, our lightweight toolkit for developing AI-driven applications.