Evaluating Web Search Agent with LLM¶
In this recipe, we will be using taking our Web Search Agent Tutorial and running evaluations on the LLM call. We will be exploring writing a context relevance test since that is one of the most important aspects of web search.
Check out the Web Search Agent Tutorial
We will be using our WebAssistantAgent
for our evaluations. For a detailed explanation regarding this code snippet, refer to the Web Search Agent Tutorial.
Setup¶
To set up our environment, first let's install all of the packages we will use:
!pip install "mirascope[openai]"
# for web search functionality
!pip install requests beautifulsoup4 duckduckgo-search ipytest
import os
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
# Set the appropriate API key for the provider you're using
Basic Evaluations¶
Let's start off with some basic evaluations to know whether our agent is working in general. Given that updates to prompts can significantly influence LLM behavior, it's crucial to test each component of our agent individually.
Evaluating _web_search
tool¶
Our goal is to ensure that the LLM consistently utilizes the web search tool, rather than relying on its inherent knowledge base to generate responses. We've intentionally refrained from explicitly instructing the agent to always utilize the web search tool, as some user queries may be more conversational in nature and not necessitate web searches. However, for user queries that are more information-seeking, the agent should always leverage the web search tool.
import re
from datetime import datetime
import ipytest
import pytest
import requests
from bs4 import BeautifulSoup
from duckduckgo_search import DDGS
from mirascope.core import BaseMessageParam, openai, prompt_template
from pydantic import BaseModel
ipytest.autoconfig(run_in_thread=True)
def extract_content(url: str) -> str:
"""Extract the main content from a webpage.
Args:
url: The URL of the webpage to extract the content from.
Returns:
The extracted content as a string.
"""
try:
response = requests.get(url, timeout=5)
soup = BeautifulSoup(response.content, "html.parser")
unwanted_tags = ["script", "style", "nav", "header", "footer", "aside"]
for tag in unwanted_tags:
for element in soup.find_all(tag):
element.decompose()
main_content = (
soup.find("main")
or soup.find("article")
or soup.find("div", class_=re.compile("content|main"))
)
if main_content:
text = main_content.get_text(separator="\n", strip=True)
else:
text = soup.get_text(separator="\n", strip=True)
lines = (line.strip() for line in text.splitlines())
return "\n".join(line for line in lines if line)
except Exception as e:
return f"{type(e)}: Failed to extract content from URL {url}"
class WebAssistant(BaseModel):
messages: list[BaseMessageParam | openai.OpenAIMessageParam] = []
search_history: list[str] = []
max_results_per_query: int = 2
def _web_search(self, queries: list[str]) -> str:
"""Performs web searches for given queries and returns URLs.
Args:
queries: List of search queries.
Returns:
str: Newline-separated URLs from search results or error messages.
Raises:
Exception: If web search fails entirely.
"""
try:
urls = []
for query in queries:
results = DDGS(proxies=None).text(
query, max_results=self.max_results_per_query
)
for result in results:
link = result["href"]
try:
urls.append(link)
except Exception as e:
urls.append(
f"{type(e)}: Failed to parse content from URL {link}"
)
self.search_history.append(query)
return "\n\n".join(urls)
except Exception as e:
return f"{type(e)}: Failed to search the web for text"
@openai.call(model="gpt-4o-mini", stream=True)
@prompt_template(
"""
SYSTEM:
You are an expert web searcher. Your task is to answer the user's question using the provided tools.
The current date is {current_date}.
You have access to the following tools:
- `_web_search`: Search the web when the user asks a question. Follow these steps for EVERY web search query:
1. There is a previous search context: {self.search_history}
2. There is the current user query: {question}
3. Given the previous search context, generate multiple search queries that explores whether the new query might be related to or connected with the context of the current user query.
Even if the connection isn't immediately clear, consider how they might be related.
- `extract_content`: Parse the content of a webpage.
When calling the `_web_search` tool, the `body` is simply the body of the search
result. You MUST then call the `extract_content` tool to get the actual content
of the webpage. It is up to you to determine which search results to parse.
Once you have gathered all of the information you need, generate a writeup that
strikes the right balance between brevity and completeness based on the context of the user's query.
MESSAGES: {self.messages}
USER: {question}
"""
)
async def _stream(self, question: str) -> openai.OpenAIDynamicConfig:
return {
"tools": [self._web_search, extract_content],
"computed_fields": {
"current_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
},
}
async def _step(self, question: str):
print(self.messages)
response = await self._stream(question)
tools_and_outputs = []
async for chunk, tool in response:
if tool:
print(f"using {tool._name()} tool with args: {tool.args}")
tools_and_outputs.append((tool, tool.call()))
else:
print(chunk.content, end="", flush=True)
if response.user_message_param:
self.messages.append(response.user_message_param)
self.messages.append(response.message_param)
if tools_and_outputs:
self.messages += response.tool_message_params(tools_and_outputs)
await self._step("")
async def run(self):
while True:
question = input("(User): ")
if question == "exit":
break
print("(Assistant): ", end="", flush=True)
await self._step(question)
print()
@pytest.mark.asyncio
@pytest.mark.parametrize(
"user_query",
[
"How is the weather in New York?",
"What is the capital of France?",
"Who is the president of the United States?",
"What is the population of India?",
"What is an apple?",
],
)
async def test_web_search(user_query: str):
"""Tests that the web search agent always uses the web search tool."""
web_assistant = WebAssistant()
response = await web_assistant._stream(user_query)
tools = []
async for _, tool in response:
if tool:
tools.append(tool)
assert len(tools) == 1 and tools[0]._name() == "_web_search"
ipytest.run()
It's recommended to continually expand our golden dataset until we can confidently assert that the LLM uses web search when appropriate.
Evaluating extract_content
tool¶
Our agent has been prompt engineered to utilize the extract_content tool at its discretion. Given the non-deterministic nature of this test, we'll implement a basic verification to ensure that the extract_content
tool is invoked at least once per user query. We'll employ the same golden dataset used in the test_web_search
, allowing us to assume that test_extract_content
will always have a functional _web_search
.
test_extract_content_messages = [
{"role": "user", "content": "What is the capital of France?"},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"type": "function",
"function": {
"arguments": '{"queries":["capital of France","capital city of France","France","Paris","France capital"]}',
"name": "_web_search",
},
"id": "call_ddhSpVcNeDg9bLz4C7HEw85P",
}
],
},
{
"role": "tool",
"content": "https://en.wikipedia.org/wiki/Paris\n\nhttps://www.britannica.com/place/Paris\n\nhttps://en.wikipedia.org/wiki/Paris\n\nhttps://www.britannica.com/place/Paris\n\nhttps://en.wikipedia.org/wiki/France\n\nhttps://www.britannica.com/place/France\n\nhttps://en.wikipedia.org/wiki/Paris\n\nhttps://www.britannica.com/place/Paris\n\nhttps://www.britannica.com/place/Paris\n\nhttps://en.wikipedia.org/wiki/Paris",
"tool_call_id": "call_ddhSpVcNeDg9bLz4C7HEw85P",
"name": "_web_search",
},
]
@pytest.mark.asyncio
async def test_extract_content():
"""Tests that the extract content tool gets called once."""
user_query = "What is the capital of France?"
web_assistant = WebAssistant(messages=test_extract_content_messages)
response = await web_assistant._stream(user_query)
tools = []
async for _, tool in response:
if tool:
tools.append(tool)
assert len(tools) == 1 and tools[0]._name() == "extract_content"
ipytest.run()
For brevity, we've included just one example from our golden dataset, as the full messages array would be too lengthy to show.
Now that we have our simple tests, let's take a look at a more complex evaluation-based test.
Evaluating context relevance of LLM-generated queries¶
LLMs can easily answer detailed queries, but real-world scenarios often involve vague questions from users who may not fully understand what they're seeking. Just as many people struggle to master advanced search techniques despite years of using search engines, becoming proficient at formulating effective queries for LLMs is equally challenging.
from mirascope.core import anthropic
from pydantic import Field
class ContextRelevant(BaseModel):
is_context_relevant: bool = Field(
description="Whether the LLM-generated query is context-relevant"
)
explanation: str = Field(description="The reasoning for the context relevance")
@anthropic.call(
model="claude-3-5-sonnet-20240620", response_model=ContextRelevant, json_mode=True
)
@prompt_template(
"""
Given:
Search history: {search_history}
User query: {user_query}
LLM-generated query: {llm_query}
Evaluate if the LLM-generated query is context-relevant using the following criteria:
Bridging Relevance:
Does {llm_query} effectively bridge the gap between {search_history} and {user_query}?
Does it incorporate elements from both {search_history} and {user_query} meaningfully?
Intent Preservation:
Does {llm_query} maintain the apparent intent of {user_query}?
Does it also consider the broader context established by {search_history}?
Topical Consistency:
Is {llm_query} consistent with the overall topic or theme of {search_history}?
If there's a shift in topic from {search_history} to {user_query}, does {llm_query} handle this transition logically?
Specificity and Relevance:
Is {llm_query} specific enough to be useful, considering both {search_history} and {user_query}?
Does it avoid being overly broad or tangential?
Contextual Enhancement:
Does {llm_query} add value by incorporating relevant context from {search_history}?
Does it expand on {user_query} in a way that's likely to yield more relevant results?
Handling of Non-Sequiturs:
If {user_query} is completely unrelated to {search_history}, does {llm_query} appropriately pivot to the new topic?
Does it still attempt to maintain any relevant context from {search_history}, if possible?
Semantic Coherence:
Do the terms and concepts in {llm_query} relate logically to both {search_history} and {user_query}?
Is there a clear semantic path from {search_history} through {user_query} to {llm_query}?
Evaluation:
Assess {llm_query} against each criterion, noting how well it performs.
Consider the balance between maintaining context from {search_history} and addressing the specific intent of {user_query}.
Evaluate how {llm_query} handles any topic shift between {search_history} and {user_query}.
Provide a final assessment of whether {llm_query} is context-relevant, with a brief explanation of your reasoning.
"""
)
async def check_context_relevance(
search_history: list[str], user_query: str, llm_query: str
): ...
We use an LLM to evaluate context-awareness and define a series of questions the LLM will answer to determine if the llm_query
generated makes sense given the user_query
.
Examples¶
We can write some simple examples to verify if the evaluation is working properly, like so:
async def run(search_history: list[str], user_query: str, llm_query: str):
return await check_context_relevance(search_history, user_query, llm_query)
search_history = ["Best beaches in Thailand", "Thai cuisine must-try dishes"]
user_query = "How to book flights?"
llm_query = "How to book flights to Thailand for a beach and culinary vacation"
await run(search_history, user_query, llm_query)
ContextRelevant(is_context_relevant=True, explanation="The LLM-generated query 'How to book flights to Thailand for a beach and culinary vacation' effectively bridges the gap between the search history and the user query. It maintains the intent of booking flights while incorporating the context of Thai beaches and cuisine from the search history. The query is specific, relevant, and adds value by combining the user's immediate need (booking flights) with their apparent interest in Thai travel. It handles the topic shift smoothly and creates a semantically coherent link between all elements.")
Now let's update our llm_query
:
llm_query = "General steps for booking flights online"
await run(search_history, user_query, llm_query)
ContextRelevant(is_context_relevant=False, explanation="The LLM-generated query 'General steps for booking flights online' focuses solely on the user's new query about booking flights, without incorporating any context from the previous search history about Thai beaches and cuisine. While it addresses the user's immediate question, it fails to bridge the contextual gap or maintain topical consistency with the established travel theme. A more context-relevant query might have been 'How to book flights to Thailand for a beach and culinary vacation', which would have preserved the intent while incorporating the previous search context.")
We can verify that the llm_query
does not mention anything related to the search_history
, and therefore is properly labeled as context irrelevant.
However, it's important to note that not all user queries need to be context-relevant to previous searches. Users may intentionally shift topics or ask unrelated questions in succession, which is a natural part of chatbot interactions.
Implementing the test¶
Now that we have our evaluation, we can write our test.
test_conversation_messages = [
{"role": "user", "content": "I am a SWE looking for a LLM dev tool library"},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"type": "function",
"function": {
"arguments": '{"queries":["best LLM development tools","top libraries for LLM development","LLM libraries for software engineers","LLM dev tools for machine learning","most popular libraries for LLM development"]}',
"name": "_web_search",
},
"id": "call_1",
}
],
},
{
"role": "tool",
"content": "https://www.blog.aiport.tech/p/top-9-libraries-to-accelerate-llm\n\nhttps://github.com/tensorchord/awesome-llmops\n\nhttps://www.blog.aiport.tech/p/top-9-libraries-to-accelerate-llm\n\nhttps://medium.com/pythonforall/essential-python-libraries-for-llms-and-application-development-in-2024-17c64b672421\n\nhttps://www.datacamp.com/blog/top-open-source-llms\n\nhttps://machinelearningmastery.com/5-essential-free-tools-getting-started-llms/\n\nhttps://github.com/princeton-nlp/swe-agent\n\nhttps://arxiv.org/html/2407.01489v1\n\nhttps://www.datacamp.com/blog/top-open-source-llms\n\nhttps://llmmodels.org/blog/top-10-open-source-llm-frameworks-2024/",
"tool_call_id": "call_1",
"name": "_web_search",
},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"type": "function",
"function": {
"arguments": '{"url": "https://research.aimultiple.com/llmops-tools/"}',
"name": "extract_content",
},
"id": "call_UXnnadcCnki8qvCxrzRI1fXA",
},
{
"type": "function",
"function": {
"arguments": '{"url": "https://www.techradar.com/computing/artificial-intelligence/best-large-language-models-llms-for-coding"}',
"name": "extract_content",
},
"id": "call_S0OnNQqxtPH5HtDb1buzjedV",
},
{
"type": "function",
"function": {
"arguments": '{"url": "https://www.blog.aipo.rt.tech/p/top-9-libraries-to-accelerate-llm"}',
"name": "extract_content",
},
"id": "call_aCyaCFXUWMWloDkETTrxHyoJ",
},
{
"type": "function",
"function": {
"arguments": '{"url": "https://medium.com/pythonforall/essential-python-libraries-for-llms-and-application-development-in-2024-17c64b672421"}',
"name": "extract_content",
},
"id": "call_71e9AkvKjIuEp3QceqO4DCUK",
},
{
"type": "function",
"function": {
"arguments": '{"url": "https://github.com/tensorchord/awesome-llmops"}',
"name": "extract_content",
},
"id": "call_YeaR70E6l7iM7UHEtp709iVc",
},
{
"type": "function",
"function": {
"arguments": '{"url": "https://medium.com/@bijit2111987/top-llm-dev-tool-and-when-to-use-them-in-your-ai-stack-622a651ec0e6"}',
"name": "extract_content",
},
"id": "call_UWuyM3dy71Js7fspwSKnMlGC",
},
],
},
{
"role": "tool",
"content": "We follow\nethical norms\n&\nour process\nfor objectivity. Brands with links to their websites\nfund\nour research.\nTable of contents\nLLMOps Landscape\nCompare the top LLMOps platforms in 2023\nTools for secure and complaint LLMs\nDisclaimer about current categorization approach\nWhich LLMOps tool is the best choice for your business?\nFurther reading\nMLOps\nLLM\nUpdated on\nAug 7\nComparing 10+ LLMOps Tools: A Comprehensive Vendor Benchmark\nWritten by\nCem Dilmegani\nCem Dilmegani\nCem Dilmegani\nCem Dilmegani\nCem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.\nCem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.\nThroughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.\nHe led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.\nCem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.\nView Full Profile\nFollow on\nWe follow\nethical norms\n&\nour process\nfor objectivity. Brands with links to their websites\nfund\nour research.\nThe number of\nlarge language models (LLMs)\nhas been increasing since 2019 due to the models’ extensive application areas and capabilities (See Figure 1).\nYet, the estimates show that designing a new foundation model can cost up to $90 million while fine-tuning or enhancing existing large language models can cost $1 million to $100 thousand.\n1\nThese costs result from:\nComputational costs like hardware usage for training runs\nData gathering and labelling costs\nEngineering and R&D costs\nFigure 1: The increasing number of LLMs since 2019\n2\nLLMOps\ntools can reduce these costs by facilitating LLM management. However, LLMOps is a relatively recent solution and most business leaders are not aware of the leading players in this market. This article explains the LLMOps market and compares available tools.\nLLMOps Landscape\nThere are 20+ tools that claim to be LLMOps solutions, which can be evaluated under 6 main categories:\n1. LLMOps Platforms\nThese are either designed specifically for LLMOps or are MLOps platforms that started offering LLMOps capabilities. They include features that allow carrying out these operations on LLMs:\nFinetuning\nVersioning\nDeploying\nThese LLM platforms can offer different levels of flexibility and ease of use:\nNo-code LLM platforms:\nSome of these platforms are no-code and low-code, which facilitate LLM adoption. However, these tools typically have limited flexibility.\nCode-first platforms:\nThese platforms target machine learning engineers and data scientists. They tend to offer a higher level of flexibility.\n2. Integration frameworks\nThese tools are built to facilitate developing\nLLM applications\nsuch as document analyzers, code analyzers, chatbots etc.\n3.) Vector databases (VD)\nVDs store high-dimensional data vectors, such as patient data covering symptoms, blood test results, behaviors, and general health. Some VD software like deep lake can facilitate LLM operations.\n4.) Fine-tuning tools\nFine-tuning tools are frameworks, or platforms for fine-tuning pre-trained models. These tools provide a streamlined workflow to modify, retrain, and optimize pre-trained models for natural language processing, computer vision, and more tasks. Some libraries are also designed for fine-tuning, such as Hugging Face Transformers, PyTorch, and TensorFlow.\n5.) RLHF tools\nReinforcement learning from human feedback\n, or RLHF, is a way for AI to learn the best actions by listening to human input. Typically, Reinforcement learning includes an RL algorithm to learn by interacting with the environment and receiving rewards or penalties based on its actions.\nIn contrast, RLHF tools (e.g. Clickworker or Appen) include human feedback in the learning loop. RLHF can be useful to:\nEnhance LLM fine-tuning by large data labeling\nImplement AI governance by reducing biases in LLM responses and moderating content\nCustomize model\nImprove contextual understanding.\n6.) LLM testing tools\nLLM testing tools evaluate and assess LLMs by testing model performance, capabilities, and potential biases in various language-related tasks and applications, such as natural language understanding and generation. Testing tools may include:\nTesting frameworks\nBenchmark datasets\nEvaluation metrics.\n7.) LLM monitoring and observability\nLLM monitoring and observability tools ensure their proper functioning, user safety, and brand protection. LLM monitoring includes activities like:\nFunctional monitoring\n: Keeping track of factors like response time, token usage, number of requests, costs and error rates.\nPrompt monitoring\n: Checking user inputs and prompts to evaluate toxic content in responses, measure embedding distances, and identify malicious prompt injections.\nResponse monitoring:\nAnalyzing to discover\nhallucinatory\nbehavior, topic divergence, tone and sentiment in the responses.\nCompare the top LLMOps platforms in 2023\nIn this section, we focus on LLMOps platforms and excluded integration frameworks and other tools. LLMOps platforms can be examined in these categories:\n1. MLOps platforms\nSome\nMLOps platforms\noffer LLMOps toolkits.\nMachine Learning Operations (MLOps)\nmanages and optimizes the end-to-end machine learning lifecycle. Since LLMs are also machine learning models, MLOps vendors are naturally expanding into this domain.\n2. LLM platforms\nSome\nLLM providers\n, especially OpenAI, are also providing LLMOps capabilities to fine-tune, integrate and deploy their models.\n3. Data and cloud platforms\nData or cloud platforms are starting to offer LLMOps capabilities that allow their users to leverage their own data to build and finetune LLMs. For example, Databricks acquired MosaicML for $1.3 billion.\n3\nCloud platforms\nCloud leaders Amazon, Azure and Google have all launched their LLMOps offering which allows users to deploy models from different providers with ease\n4. LLMOPs frameworks\nThis category includes tools that exclusively focus on optimizing and managing LLM operations. The table below shows the Github stars, B2B reviews and average B2B score from B2B review pages (Trustradius, Gartner & G2) for some of these LLMOps tools:\nLLMOps Tools\nGithub Stars\nNumber of B2B Reviews*\nAverage Review Score**\nNemo by Nvidia\n7,900\nNA\nNA\nDeep Lake\n6,600\nNA\nNA\nFine-Tuner AI\n6,000\nNA\nNA\nSnorkel AI\n5,500\nNA\nNA\nZen ML\n3,000\nNA\nNA\nLamini AI\n2100\nNA\nNA\nComet\n54\nNA\nNA\nTitan ML\n47\nNA\nNA\nDeepset AI\n6\nNA\nNA\nValohai\nNot open source\n20\n4.9\nHere is a brief explanation for each tool in alphabetical order:\nComet:\nComet streamlines the ML lifecycle, tracking experiments and production models. Suited for large enterprise teams, it offers various deployment strategies. It supports private cloud, hybrid, and on-premise setups.\nFigure 2:\nComet LLMops platform\n4\nDeep Lake:\nDeep Lake combines the capabilities of Data Lakes and Vector Databases to create, refine, and implement high-quality LLMs and MLOps solutions for businesses. Deep Lake allows users to visualize and manipulate datasets in their browser or Jupyter notebook, swiftly accessing different versions and generating new ones through queries, all compatible with PyTorch and TensorFlow.\nDeepset AI:\nDeepset AI is a comprehensive platform that allows users to integrate their data with LLMs to build and deploy customized LLM features in their applications. Deepset supports Retrieval-augmented generation (RAG) and Enterprise knowledge search, as well.\nLamini AI:\nLamini AI provides an easy method for training LLMs through both prompt-tuning and base model training. Lamini AI users can write custom code, integrate their own data, and host the resulting LLM on their infrastructure.\nNemo by Nvidia:\nNvidia offers an end-to-end, cloud-native enterprise framework to develop, customize, and employ generative AI models and LLM applications. The framework can execute various tasks required to train LLMs, such as token classification, prompt learning and question answering.\nSnorkel AI:\nSnorkel AI empowers enterprises to construct or customize foundation models (FMs) and large language models (LLMs) to achieve remarkable precision on domain-specific datasets and use cases. Snorkel AI introduces programmatic labelling, enabling data-centric AI development with automated processes.\nFigure 3:\nSnorkel AI LLMOps platform\n5\n6.\nTitan ML:\nTitanML is an NLP development platform that aims to allow businesses to swiftly build and implement smaller, more economical deployments of large language models. It offers proprietary, automated, efficient fine-tuning and inference optimization techniques. This way, it allows businesses to create and roll out large language models in-house.\n7.\nValohai:\nValohai streamlines MLOps and LLMs, automating data extraction to model deployment. It can store models, experiments, and artefacts, making monitoring and deployment easier. Valohai creates an efficient workflow from code to deployment, supporting notebooks, scripts, and Git projects.\n8.\nZen ML:\nZenML primarily focuses on machine learning operations (MLOps) and the management of the machine learning workflow, including data preparation, experimentation, and model deployment.\nTools for secure and complaint LLMs\nSome LLMOps integrate with AI governance and\nLLM security\ntechnologies to ensure safe, unbiased, and ethical LLM deployment and operation. Check out more on these:\nCompare Top 25 AI Governance Tools: A Vendor Benchmark\nCompare 20 LLM Security Tools & Open-Source Frameworks\nDisclaimer about current categorization approach\nWe are aware that there are different approaches to categorize these tools. For instance, some vendors include other technologies that can help large language model development in this landscape, such as containerization or edge computing. However, such technologies are not built for designing or monitoring models, even though they can be paired with LLMOps tools to improve model performance. Therefore, we excluded these tools.\nA more classical approach categorizes tools based on licence type (e.g. open source or not) or whether the tool provides pre-trained models or not. While these are relevant categorizations, we think they are less critical than other functionality provided by the tool. For example, it is quite important whether an LLM is open source or not since it impacts how the end user can finetune the model. However, an LLMOps platform, like most other software, will be used by most end users without modifications to the software code and therefore it is less impactful for an LLMOps tool to be open source.\nWhich LLMOps tool is the best choice for your business?\nWe now provide relatively generic recommendations on choosing these tools. We will make these more specific as we explore LLMOps platforms in more detail and as the market matures.\nHere are a few steps you must complete in your selection process:\nDefine goals:\nClearly outline your business goals to establish a solid foundation for your LLMOps tool selection process. For example, if your goal requires training a model from scratch vs fine-tuning an existing model, this will have important implications to your LLMOps stack.\nDefine requirements:\nBased on your goal, certain requirements will become more important. For example, if you aim to enable business users to use LLMs, you may want to include no code in your list of requirements.\nPrepare a shortlist\n: Consider user reviews and feedback to gain insights into real-world experiences with different LLMOps tools. Rely on this market data to prepare a shortlist.\nCompare functionality:\nUtilize free trials and demos provided by various LLMOps tools to compare their features and functionalities firsthand.\nWhat is LLMOps?\nLarge Language Models (LLMs) are advanced machine learning models designed to understand and generate human-like text based on the patterns and information they’ve learned from training data. These models are built using deep learning models to capture intricate linguistic nuances and context.\nLLMOps refer to techniques and tools used for the operational model management of LLMs in production environments.\nKey components of LLMOps tools\nLarge Language Model Operations (LLMOps) tools encompass crucial components for efficient management and deployment of large language models (LLMs). These tools typically include features such as:\n–\nPrompt Engineering:\nCreating effective prompt templates for improved model performance.\n–\nData Management:\nHandling vast datasets, ensuring proper data versioning, and facilitating exploratory data analysis.\n–\nModel Fine Tuning:\nFine-tuning LLMs to specific tasks and refining models for optimal performance.\n–\nModel Monitoring:\nContinuous tracking of model outcomes, detection of accuracy degradation, and addressing model drift.\nWhat are LLMOps benefits?\nLLMOps delivers significant advantages to machine learning projects leveraging large language models:\n1.) Increased Accuracy:\nEnsuring high-quality data for training and reliable deployment enhances model accuracy.\n2.)\nReduced Latency:\nEfficient deployment strategies lead to reduced latency in LLMs, enabling faster data retrieval.\n3.) Fairness Promotion:\nStriving to eliminate bias ensures more impartial outputs, preventing discrimination.\nLLMOps challenges & solutions\nChallenges in large language model operations require robust solutions to maintain optimal performance:\n1.) Data Management Challenges:\nHandling vast datasets and sensitive data necessitates efficient data collection and versioning.\n2.)\nModel Monitoring Solutions:\nImplementing model monitoring tools to track model outcomes, detect accuracy degradation, and address model drift.\n3.)\nScalable Deployment:\nDeploying scalable infrastructure and utilizing cloud-native technologies to meet computational power requirements.\n4.)\nOptimizing Models:\nEmploying model compression techniques and refining models to enhance overall efficiency.\nLLMOps tools are pivotal in overcoming challenges and delivering higher quality models in the dynamic landscape of large language models.\nFurther reading\nExplore more on LLMs, MLOps and AIOps by checking out our articles:\nMLOps Tools & Platforms Landscape: In-Depth Guide\n15 Best AiOps Platforms: Streamline IT Ops with AI\nChatGPT AIOps in IT Automation: 8 Powerful Examples\nIf you still have questions about LLMOps tools and landscape, we would like to help:\nFind the Right Vendors\nExternal sources\n1. “\nThe CEO’s\xa0Roadmap on Generative AI\n” BCG. March 2023. Revisited August 11, 2023.\n2. “\nA Survey of Large Language Models.\n”\nGithub\n. March 2023. Revisited August 11, 2023.\n3. “\nDatabricks Signs Definitive Agreement to Acquire MosaicML, a Leading Generative AI Platform\n“.\nDatabricks\n. June 26, 2023. Retrieved August 24, 2023.\n4. “\nDebugging Large Language Models with Comet LLMOps Tools\n.”\nComet\n. Revisited August 16, 2023.\n5. Harvey, N(March 20, 2023). “\nSnorkel Flow Spring 2023: warm starts and foundation models.\n”\nSnorkelAI\n. Revisited August 16, 2023.\nShare This Article\nCem Dilmegani\nCem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.\nCem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.\nThroughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.\nHe led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.\nCem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.\nNext to Read\nGuide to RLHF LLMs in 2024: Benefits & Top Vendors\nJan 3\n5 min read\nLLMOPs vs MLOPs in 2024: Discover the Best Choice for You\nAug 7\n6 min read\nUsing Vector Databases for LLMs: Why is it Crucial in 2024?\nAug 14\n5 min read\nComments\nYour email address will not be published. All fields are required.\n0\nComments\nPost Comment\nRelated research\nCompare 20 LLM Security Tools & Open-Source Frameworks in '24\nAug 7\n12 min read\n12 Retrieval Augmented Generation (RAG) Tools / Software in '24\nAug 7\n6 min read",
"tool_call_id": "call_UXnnadcCnki8qvCxrzRI1fXA",
"name": "extract_content",
},
{
"role": "tool",
"content": "PRICE\nVERDICT\nREASONS TO BUY\nREASONS TO AVOID\nVERDICT\nREASONS TO BUY\nREASONS TO AVOID\n(Image credit: Future)\nJump to:\nBest for Enterprises\nBest free\nBest Value\nBest for code generation\nBest for debugging\nFAQs\nThe best Large Language Models (LLMs) for coding have been trained with code related data and are a new approach that developers are using to augment workflows to improve efficiency and productivity. These coding assistants can be used for a wide range of code related tasks, such as code generation, code analysis to help with debugging, refactoring, and writing test cases, as well offering chat capabilities to discuss problems and inspire developers with solutions. For this guide we tested several different LLMs that can be used for coding assistants to work out which ones present the best results for their given category.\nThe\nbest large language models\nare area of technology that is moving very quickly so while we do our best to keep this guide as up to date as possible, you may want to check if a newer model has been released and whether it fits your specific use case better.\nThe best large language models (LLMs) for coding\nWhy you can trust TechRadar\nWe spend hours testing every product or service we review, so you can be sure you’re buying the best.\nFind out more about how we test.\nBest for Enterprises\n(Image credit: Copilot)\nGitHub Copilot\nThe best LLM for business\nReasons to buy\n+\nOffers a first-party extension for direct integration into several popular development environments\n+\nMultiple subscription tiers with varying feature levels\n+\nBuilt upon OpenAI’s GPT-4 model\n+\nUnlimited messages and interactions for all subscription tiers\nReasons to avoid\n-\nRequires a subscription to use\n-\nCan’t be self-hosted\n-\nNot immune to providing inaccurate or incorrect prompts\nOriginally released in October 2021, GitHub Copilot is a version of\nMicrosoft\n’s Copilot LLM that is specifically trained with data to assist coders and developers with their work with the aim to improve efficiency and productivity. While the original release used\nOpenAI\n’s Codex model, a modified version of GPT-3 which was also trained as a coding assistant, GitHub Copilot was updated to use the more advanced GPT-4 model in November 2023.\nA core feature of GitHub Copilot is the extension provided that allows direct integration of the LLM into commonly used Integrated Development Environments (IDEs) popular among developers today, including Visual Studio Code, Visual Studio, Vim, Neovim, the JetBrains suite of IDEs, and Azure Data Studio. This direct integration allows GitHub Copilot to access your existing project to improve the suggestions made when given a prompt, while also providing users hassle free installation and access to the features provided. For enterprise users, the model can also be granted access to existing repositories and knowledge bases from your organization to further enhance the quality of outputs and suggestions.\nWhen writing code, GitHub Copilot can offer suggestions in a few different ways. Firstly, you can write a prompt using an inline comment that can be converted into a block of code. This works in a similar way to how you might use other LLMs to generate code blocks from a prompt, but with the added advantage of GitHub Copilot being able to access existing project files to use as context and produce a better output. Secondly, GitHub Copilot can provide real-time suggestions as you are writing your code. For example, if you are writing a regex function to validate an email address, simply starting to write the function can offer an autocomplete suggestion that provides the required syntax. Additionally, you can also use the GitHub Copilot Chat extension to ask questions, request suggestions, and help you to debug code in a more context aware fashion than you might get from LLMs trained on more broad datasets. Users can enjoy unlimited messages and interactions with GitHub Copilot’s chat feature across all subscription tiers.\nGitHub Copilot is trained using data from publicly available code repositories, including GitHub itself. GitHub Copilot claims it can provide code assistance in any language where a public repository exists, however the quality of the suggestions will depend on the volume of data available. All subscription tiers include a public code filter to reduce the risk of suggestions directly copying code from a public repository. By default, GitHub Copilot excludes submitted data from being used to train the model further for business and enterprise tier customers and offers the ability to exclude files or repositories from being used to inform suggestions offered. Administrators can configure both features as needed based on your business use cases.\nWhile these features aim to keep your data private, it’s worth keeping in mind that prompts aren’t processed locally and rely on external infrastructure to provide code suggestions and you should factor this into whether this is the right product for you. Users should also be cautious about trusting any outputs implicitly – while the model is generally very good at providing suggestions, like all LLMs it is still prone to hallucinations and can make poor or incorrect suggestions. Always make sure to review any code generated by the model to make sure it does what you intend it to do.\nIn the future it’s possible that GitHub will upgrade GitHub Copilot to use the recently released GPT-4o model. GPT-4 was originally released in March 2023, with GitHub Copilot being updated to use the new model roughly 7 months later. It makes sense to update the model further given the improved intelligence, reduced latency, and reduced cost to operate GPT-4o, though at this time there has been no official announcement.\nIf you want to try before you buy, GitHub Copilot offers a free 30 day trial of their cheapest package which should be sufficient to test out its capabilities, with a $10 per month fee thereafter. Copilot Business costs $19 per user per month, while Enterprise costs $39 per user per month\nBest for individuals\n(Image credit: Qwen)\nCodeQwen1.5\nBest coding assistant for individuals\nReasons to buy\n+\nOpen source\n+\nHas options for local hosting\n+\nCan be trained further using your own code repositories\n+\nOffers a range of model sizes to fit your requirements\nReasons to avoid\n-\nNo first-party extensions for popular IDEs\n-\nUp front hardware, cost needs to be considered when hosted locally\nCodeQwen1.5 is a version of Alibaba’s open-source Qwen1.5 LLM specifically trained using public code repositories to assist developers in coding related tasks. This specialized version was released in April 2024, a few months after the release of Qwen1.5 to the public in February 2024.\nThere are 2 different versions of CodeQwen1.5 available today. The base model of CodeQwen1.5 is designed for code generation and suggestions but has limited chat functionality, while the second version can also be used as a chat interface that can answer questions in a more human-like way. Both models have been trained with 3 trillion tokens of code related data and support a very respectable 92 languages, which include some of the most common languages in use today such as Python, C++, Java, PHP, C# and JavaScript.\nUnlike the base version of Qwen1.5, which has several different sizes available for download, CodeQwen1.5 is only available in a single size of 7B. While this is quite small when compared to other models on the market that can also be used as coding assistants, there are a few advantages that developers can take advantage of. Despite its small size, CodeQwen1.5 performs incredibly well compared to some of the larger models that offer coding assistance, both open and closed source. CodeQwen1.5 comfortably beats GPT3.5 in most benchmarks and provides a competitive alternative to GPT-4, though this can sometimes depend on the specific programming language. While GPT-4 may perform better overall by comparison, it’s important to remember that GPT-4 requires a subscription and has per token costs that could make using it very expensive compared to CodeQwen1.5 and GPT-4 cannot be hosted locally. Like with all LLMs, its risky to implicitly trust any suggestions or responses provided by the model. While steps have been taken to reduce hallucinations, always check the output to make sure it is correct.\nAs CodeQwen1.5 is open source, you can download a copy of the LLM to use at no additional cost beyond the hardware needed to run it. You’ll still need to make sure your system has enough resources to ensure the model can run well, but the bonus of the smaller model size means a modern system with GPU that has at least 16GB of VRAM and at least 32GB of system RAM should be sufficient. CodeQwen1.5 can also be trained using code from existing projects or other code repositories to further improve the context of the generated responses and suggestions. The ability to host CodeQwen1.5 within your own local or remote infrastructure, such as a Virtual Private Server (VPS) or dedicated server, should also help to alleviate some of the concerns related to data privacy or security often connected to submitting information to third party providers.\nAlibaba surprised us by releasing their new Qwen2 LLM at the start of June that they claim offers significant gains over the base model of Qwen1.5. Alibaba also mentioned that the training data used for CodeQwen1.5 is included in Qwen2-72B, so has the potential to offer improved results, but it’s currently unclear if there is a plan to upgrade CodeQwen to use the new model.\nBest Value\n(Image credit: Meta)\nLLama 3\nBest value LLM\nReasons to buy\n+\nOpen source\n+\nSmaller models can be hosted locally\n+\nCan be fine tuned with your own dataset\n+\nExternal hosting provided by AWS and Azure have low per token costs\nReasons to avoid\n-\nHardware requirements for the larger models could require significant upfront investment\n-\nNot specifically trained as a coding LLM\nWhen it comes to the best bang for buck, Meta’s open-source Llama 3 model released in April 2024 is one of the best low-cost models available on the market today. Unlike many other models specifically trained with code related data to assist developers with coding tasks, Llama 3 is a more general LLM capable of assisting in many ways – one of which also happens to be as a coding assistant – and outperforms CodeLlama, a coding model released by Meta in August 2023 based on Llama 2.\nIn like for like testing with models of the same size, Llama 3 outperforms CodeLlama by a considerable margin when it comes to code generation, interpretation, and understanding. This is impressive considering Llama 3 wasn’t trained specifically for code related tasks but can still outperform those that have. This means that not only can you use Llama 3 to improve efficiency and productivity when performing coding tasks, but it can also be used for other tasks as well. Llama 3 has a training data cutoff of December 2023, which isn’t always of critical importance for code related tasks, but some languages can develop quickly and having the most recent data available can be incredibly valuable.\nLlama 3 is an open-source model that allows developers to download and deploy the model to their own local system or infrastructure. Like CodeQwen1.5, Llama 3 8B is small enough that a modern system with at least 16GB of VRAM and 32GB of system RAM is sufficient to run the model. The larger 70B version of Llama 3 naturally has better capabilities due to the increased parameter number, but the hardware requirement is an order of magnitude greater and would require a significant injection of funds to build a system capable of running it effectively. Luckily, the Llama 3 8B offers enough capability that users can get excellent value without breaking the bank at the same time. If you find that you need the added capability of the larger model, the open-source nature of the model means you can easily rent an external VPS or dedicated server to support your needs, though costs will vary depending on the provider. If you decide that you’d like the increased capability of the larger model, but the investment needed for the required hardware, or the cost to rent an external host, is outside your budget, AWS offers API access to the model via a pay as you go plan which charges you by the token instead. AWS currently charges $3.50 per million output tokens, which is a considerable quantity for a very small price. For comparison, OpenAI’s GPT-4o costs $15.00 for the same quantity of tokens. If this type of solution appeals to you, make sure to shop around for the best provider for your location, budget, and needs.\nLlama 3 performs well in code generation tasks and adheres well to the prompts given. It will sometimes simplify the code based on the prompt, but it's reasonably receptive to being given instruction to provide a complete solution and will segment if it reaches the token limit for a single response if requested. During testing, we asked for Llama 3 to write a complete solution in Python for a chess game that would immediately compile and could be played via text prompts, and it dutifully provided the requested code. Although the code initially failed to compile, providing Llama 3 with the error messages from the compiler allowed it to identify where the mistakes were and provided a correction. Llama 3 can effectively debug code segments to identify issues and provide new code to fix the error. As a bonus, it can also explain where the error was located and why it needs to be fixed to help the user understand what the mistake was. However, like with all models generating code-related solutions, it's important to check the output and not trust it implicitly. Although the models are becoming increasingly intelligent and accurate, they also hallucinate at times and provide incorrect or insecure responses.\nLike with other open-source models, any data you submit to train Llama 3 from your own code repositories remains within your control. This helps to alleviate some of the concerns and risks associated with submitting proprietary and personal data to third parties, though keep in mind that also means that you should consider what that means for your information security policies where required. It doesn’t cost anything extra to train a model you have hosted within your own infrastructure, but some hosts providing API access do have an additional cost associated with further training.\nYou can download Llama 3 today directly from\nMeta\n.\nBest for code generation\n(Image credit: Claude AI)\nClaude 3 Opus\nThe best LLM for generating code\nReasons to buy\n+\nOutperforms most models for code generation tasks\n+\nCan provide detailed explanations of the generated code to assist developer understanding\n+\nProvides more human responses to prompts than other models\nReasons to avoid\n-\nClosed source and can’t be hosted locally\n-\nExpensive per token cost\n-\nCan’t be connected to existing knowledgebases\nReleased in April 2024, Claude 3 Opus is the latest and most capable LLM from Anthropic that they claim is the most intelligent LLM on the market today and is designed to tackle a variety of different tasks. Although most LLMs can generate code, the accuracy and correctness of the generated outputs can vary, and may have mistakes or be flat out incorrect due to not being specifically designed with code generation in mind. Claude 3 Opus bridges that gap by being trained to handle coding related tasks alongside the regular tasks LLMs are often used for, making for a very powerful multi-faceted solution.\nWhile Anthropic doesn’t mention how many programming languages it supports, Claude 3 Opus can generate code across a large range of programming languages, ranging from incredibly popular languages such as C++, C#, Python and Java, to older or more niche languages such as FORTRAN, COBOL, and Haskell. Claude 3 Opus relies on the patterns, syntaxes, coding conventions and algorithms identified within the code related training data provided to generate new code snippets from scratch to help avoid direct reproduction of code used to train it. The large 200k token context window offered by Claude 3 Opus is incredibly useful when working with large code blocks as you iterate through suggestions and changes. Like all LLMs, Claude 3 Opus also has an output token limit, and tends to either summarise or truncate the response to fit within a single reply. While summarisation of a purely text response isn’t too problematic as you can ask for additional context, not being provided with a large chunk of required code, such as when generating a test case, is quite a problem. Fortunately, Claude 3 Opus can segment its responses if you request it to do so in your initial prompt. You’ll still need to ask it to continue after each reply, but this does allow you to obtain more long form responses where needed. As well as generating functional code, Claude 3 Opus also adds comments to the code and provides explanations as to what the generated code does to help developers understand what is happening. In cases where you are using Claude 3 to debug code and generate fixes, this is extremely valuable as it not only helps solve the problem, but also provides context as to why changes were made, or why the code was generated in this specific way.\nFor those concerned about privacy and data security, Anthropic states that they don’t use any of the data submitted to Claude 3 for the purposes of training the model further, a welcome feature that many will appreciate when working with proprietary code. They also include copyright indemnity protections with their paid subscriptions.\nClaude 3 Opus does come with some limitations when it comes to improving the context of responses as it doesn’t currently offer a way to connect your own knowledge bases or codebases for additional training. This probably isn’t a deal breaker for most but could be something worth thinking about when choosing the right LLM for your code generation solution.\nThis does all come with a hefty price tag compared to other LLMs that offer code generation functionality. API access is one of the more expensive ones on the market at an eye watering $75 per 1 million output tokens, which is considerably more than GPT-4o’s $15 price tag. Anthropic do offer 2 additional models based on Claude 3, Haiku and Sonnet, which are much cheaper at $15 and $1.25 respectively for the same quantity of tokens, though they have reduced capability compared to Opus. In addition to API access, Anthropic offers 3 subscription tiers that grant access to Claude 3. The free tier has a lower daily limit and only grants access to the Sonnet model but should give those looking to test it’s capabilities a good idea of what to expect. To access Opus, you’ll need to subscribe to Pro or Team at $20 and $30 per person per month respectively. The Team subscription does need a minimum of 5 users for a total of $150 per month, but increases the usage limits for each user compared to the Pro tier.\nHead over to create a free account to access\nClaude 3\n.\nBest for debugging\n(Image credit: Open AI)\nGPT-4\nThe best LLM for debugging\nReasons to buy\n+\nIdentifies issues within blocks of code and suggests corrections\n+\nCan explain what the problem was and how the corrections solve it\n+\nLarge context window\nReasons to avoid\n-\nPer token cost can be expensive compared to coding-focused offerings with similar capability\n-\nRequires a subscription to gain access\n-\nManual opt-out needed to prevent data from being used to train the model\nSince the release of\nChatGPT\nin November 2022, OpenAI has taken the world by storm and offers some of the most intelligent and capable LLMs on the market today. GPT-4 was released in March 2023 as an update to GPT-3.5\nWhile GPT-4 isn’t an LLM designed specifically as a coding assistant, it performs well across a broad range of code related tasks, including real time code suggestions, generating blocks of code, writing test cases, and debugging errors in code. GitHub Copilot has also been using a version of GPT-4 with additional training data since November 2023, leveraging the human response capabilities of GPT-4 for code generation and within their chat assistant, which should give you an idea of the value it can provide.\nGPT-4 has been trained with code related data that covers many different programming languages and coding practices to help it understand the vast array of logic flows, syntax rules and programming paradigms used by developers. This allows GPT-4 to excel when debugging code by helping to solve a variety of issues commonly encountered by developers. Syntax errors can be incredibly frustrating when working with some languages - I’m looking at you and your indentations, Python – so using GPT-4 to review your code can massively speed up the process when code won’t compile due to errors that are difficult to find. Logical errors are one of the toughest errors to debug as code usually compiles correctly, but it doesn’t provide the correct output or operate as desired. By giving GPT-4 your code and an explanation of what it should be doing, GPT-4 can analyse and identify where the problem lies, offer suggestions or rewrites to solve the problem, and even provide an explanation as to what the problem is and how the suggested changes solve it. This can help developers quickly understand the cause of the problem and offers an opportunity to learn how to avoid it again in the future.\nAlthough the training data cutoff for GPT-4 is September 2021, which is quite a long time ago considering the advancements in LLMs over the last year, GPT-4 is continuously trained using new data from user interactions. This allows GPT-4’s debugging to become more accurate over time, though this does present some potential risk when it comes to the code you submit for analysis, especially when using it to write or debug proprietary code. Users do have the option to opt out of their data being used to train GPT-4 further, but it's not something that happens by default so keep this in mind when using GPT-4 for code related tasks.\nYou might be wondering why the recommendation here is to use GPT-4 when it is 4 times more expensive than the newer, cheaper, and more intelligent GPT-4o model released in May 2024. In general, GPT-4o has proven to be a more capable model, but for code related tasks GPT-4 tends to provide better responses that are more correct, adheres to the prompt better, and offers better error detection than GPT-4o. However, the gap is small and it's likely that GPT-4o will become more capable and overtake GPT-4 in the future as the model matures further through additional training from user interactions. If cost is a major factor in your decision, GPT-4o is a good alternative that covers the majority of what GPT-4 can provide at a much lower cost.\nBest LLM for Coding Assistants FAQs\nHow does a coding assistant work?\nCoding assistants use Large Language Models (LLMs) that are trained with code related data to provide developers with tools that help increase productivity and efficiency when performing code related tasks. The training data often leverages public code repositories, documentation and other licenced work to enable the LLM to recognise syntax, coding styles, programming practices and paradigms to provide code generation, debugging, code analysis, and problem-solving capabilities across many different programming languages.\nCoding assistants can be integrated into your development environments to provide inline code suggestions, and some can be trained further using an organization's knowledge bases and codebases to improve the context of suggestions.\nWhy shouldn’t I implicitly trust the code generated by a coding assistant?\nLLMs are becoming increasingly intelligent, but they aren’t immune to making mistakes known as “hallucinations”. Most coding assistants generate code that works well, but sometimes the code can be incomplete, inaccurate, or completely wrong. This can vary from model to model and has a high dependency on the training data used and the overall intelligence capability of the model itself.\nWhat is a context window?\nA context window is another way of describing how far back the LLM’s memory can go for a conversation, usually measured in tokens. LLMs with a large context window allow for responses that offer better context based on the conversation history which can be valuable for developers working on code related tasks when brainstorming ideas, debugging large sections of code, or iterating on a design.\nAre you a pro? Subscribe to our newsletter\nSign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!\nContact me with news and offers from other Future brands\nReceive email from us on behalf of our trusted partners or sponsors\nGrant Hickey\nFascinated by computers from a young age, Grant is on an endless quest to leverage existing and emerging technologies to augment and enhance the productivity of individuals and enterprises, and to improve the velocity at which teams can analyze data and identify trends within their customer base or organization. Grant has previously worked as a software engineer building cloud based CRMs, before moving into the games industry to work for Krafton on PUBG:Battlegrounds and later Creative Assembly. Always looking to improve his working practices he often builds his own tools to streamline tasks and become more efficient.\nLATEST ARTICLES\n1\nScuf Nomad review: a solid mobile controller that struggles to stand out\n2\nLG says next-gen \"dream OLED\" panel is finally real – but it might not come to TVs first\n3\nHP 325 FHD webcam: A budget-friendly choice with solid image quality\n4\nHow to watch After Baywatch: Moment in the Sun online from anywhere\n5\nAt last! Garmin Fenix 8 revealed, with an Apple Watch Ultra-beating dive mode – alongside the Garmin Enduro 3",
"tool_call_id": "call_S0OnNQqxtPH5HtDb1buzjedV",
"name": "extract_content",
},
{
"role": "tool",
"content": "<class 'requests.exceptions.ConnectionError'>: Failed to extract content from URL https://www.blog.aipo.rt.tech/p/top-9-libraries-to-accelerate-llm",
"tool_call_id": "call_aCyaCFXUWMWloDkETTrxHyoJ",
"name": "extract_content",
},
{
"role": "tool",
"content": "Member-only story\nEssential Python Libraries for LLMs and Application Development in 2024\nAhmad Waleed\n·\nFollow\nPublished in\nPythonForAll\n·\n3 min read\n·\nNov 30, 2023\n--\nShare\nIn the ever-evolving landscape of Language Models (LLMs), Natural Language Processing (NLP), and Machine Learning (ML), the arsenal of Python libraries continues to expand, bringing forth innovation and efficiency in diverse applications. For those venturing into this realm or seeking to enrich their projects, here’s a categorized guide to the essential Python libraries that are reshaping the field in 2024.\nKey Insights:\nNLP & ML Model Integration: Libraries such as Transformers and spaCy play pivotal roles in seamlessly integrating pre-trained NLP models, marking a paradigm shift towards context-aware language models across applications.\nDeep Learning & Neural Networks: TensorFlow and PyTorch stand tall as robust frameworks, showcasing Python’s prowess in handling complex neural network architectures and computations at scale.\nData Preprocessing & Validation: Libraries like Unstructured and Pydantic underscore the burgeoning emphasis on data quality and integrity, crucial in an era dominated by vast datasets and intricate ML models.\nApplication Development: Streamlit and Gradio emerge as transformative tools, converting data scripts into interactive web applications, emphasizing the significance of accessibility…",
"tool_call_id": "call_71e9AkvKjIuEp3QceqO4DCUK",
"name": "extract_content",
},
{
"role": "tool",
"content": "tensorchord\n/\nAwesome-LLMOps\nPublic\nNotifications\nYou must be signed in to change notification settings\nFork\n353\nStar\n3.7k\nAn awesome & curated list of best LLMOps tools for developers\nLicense\nCC0-1.0 license\n3.7k\nstars\n353\nforks\nBranches\nTags\nActivity\nStar\nNotifications\nYou must be signed in to change notification settings\ntensorchord/Awesome-LLMOps\nThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.\nmain\nBranches\nTags\nGo to file\nCode\nFolders and files\nName\nName\nLast commit message\nLast commit date\nLatest commit\nHistory\n150 Commits\nscripts\nscripts\n.gitignore\n.gitignore\nLICENSE\nLICENSE\nREADME.md\nREADME.md\ncontributing.md\ncontributing.md\nView all files\nRepository files navigation\nAwesome LLMOps\nAn awesome & curated list of the best LLMOps tools for developers.\nContribute\nContributions are most welcome, please adhere to the\ncontribution guidelines\n.\nTable of Contents\nTable of Contents\nModel\nLarge Language Model\nCV Foundation Model\nAudio Foundation Model\nServing\nLarge Model Serving\nFrameworks/Servers for Serving\nObservability\nSecurity\nLLMOps\nSearch\nVector search\nCode AI\nTraining\nIDEs and Workspaces\nFoundation Model Fine Tuning\nFrameworks for Training\nExperiment Tracking\nVisualization\nModel Editing\nData\nData Management\nData Storage\nData Tracking\nFeature Engineering\nData/Feature enrichment\nLarge Scale Deployment\nML Platforms\nWorkflow\nScheduling\nModel Management\nPerformance\nML Compiler\nProfiling\nAutoML\nOptimizations\nFederated ML\nAwesome Lists\nModel\nLarge Language Model\nProject\nDetails\nRepository\nAlpaca\nCode and documentation to train Stanford's Alpaca models, and generate the data.\nBELLE\nA 7B Large Language Model fine-tune by 34B Chinese Character Corpus, based on LLaMA and Alpaca.\nBloom\nBigScience Large Open-science Open-access Multilingual Language Model\ndolly\nDatabricks’ Dolly, a large language model trained on the Databricks Machine Learning Platform\nFalcon 40B\nFalcon-40B-Instruct is a 40B parameters causal decoder-only model built by TII based on Falcon-40B and finetuned on a mixture of Baize. It is made available under the Apache 2.0 license.\nFastChat (Vicuna)\nAn open platform for training, serving, and evaluating large language models. Release repo for Vicuna and FastChat-T5.\nGemma\nGemma is a family of lightweight, open models built from the research and technology that Google used to create the Gemini models.\nGLM-6B (ChatGLM)\nAn Open Bilingual Pre-Trained Model, quantization of ChatGLM-130B, can run on consumer-level GPUs.\nChatGLM2-6B\nChatGLM2-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model\nChatGLM-6B\n.\nGLM-130B (ChatGLM)\nAn Open Bilingual Pre-Trained Model (ICLR 2023)\nGPT-NeoX\nAn implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.\nLuotuo\nA Chinese LLM, Based on LLaMA and fine tune by Stanford Alpaca, Alpaca LoRA, Japanese-Alpaca-LoRA.\nMixtral-8x7B-v0.1\nThe Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.\nStableLM\nStableLM: Stability AI Language Models\n⬆ back to ToC\nCV Foundation Model\nProject\nDetails\nRepository\ndisco-diffusion\nA frankensteinian amalgamation of notebooks, models and techniques for the generation of AI Art and Animations.\nmidjourney\nMidjourney is an independent research lab exploring new mediums of thought and expanding the imaginative powers of the human species.\nsegment-anything (SAM)\nproduces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image.\nstable-diffusion\nA latent text-to-image diffusion model\nstable-diffusion v2\nHigh-Resolution Image Synthesis with Latent Diffusion Models\n⬆ back to ToC\nAudio Foundation Model\nProject\nDetails\nRepository\nbark\nBark is a transformer-based text-to-audio model created by Suno. Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects.\nwhisper\nRobust Speech Recognition via Large-Scale Weak Supervision\nServing\nLarge Model Serving\nProject\nDetails\nRepository\nAlpaca-LoRA-Serve\nAlpaca-LoRA as Chatbot service\nCTranslate2\nfast inference engine for Transformer models in C++\nClip-as-a-service\nserving the OpenAI CLIP model\nDeepSpeed-MII\nMII makes low-latency and high-throughput inference possible, powered by DeepSpeed.\nFaster Whisper\nfast inference engine for whisper in C++ using CTranslate2.\nFlexGen\nRunning large language models on a single GPU for throughput-oriented scenarios.\nFlowise\nDrag & drop UI to build your customized LLM flow using LangchainJS.\nllama.cpp\nPort of Facebook's LLaMA model in C/C++\nInfinity\nRest API server for serving text-embeddings\nModelz-LLM\nOpenAI compatible API for LLMs and embeddings (LLaMA, Vicuna, ChatGLM and many others)\nOllama\nServe Llama 2 and other large language models locally from command line or through a browser interface.\nTensorRT-LLM\nInference engine for TensorRT on Nvidia GPUs\ntext-generation-inference\nLarge Language Model Text Generation Inference\ntext-embeddings-inference\nInference for text-embedding models\nvllm\nA high-throughput and memory-efficient inference and serving engine for LLMs.\nwhisper.cpp\nPort of OpenAI's Whisper model in C/C++\nx-stable-diffusion\nReal-time inference for Stable Diffusion - 0.88s latency. Covers AITemplate, nvFuser, TensorRT, FlashAttention.\n⬆ back to ToC\nFrameworks/Servers for Serving\nProject\nDetails\nRepository\nBentoML\nThe Unified Model Serving Framework\nJina\nBuild multimodal AI services via cloud native technologies · Model Serving · Generative AI · Neural Search · Cloud Native\nMosec\nA machine learning model serving framework with dynamic batching and pipelined stages, provides an easy-to-use Python interface.\nTFServing\nA flexible, high-performance serving system for machine learning models.\nTorchserve\nServe, optimize and scale PyTorch models in production\nTriton Server (TRTIS)\nThe Triton Inference Server provides an optimized cloud and edge inferencing solution.\nlangchain-serve\nServerless LLM apps on Production with Jina AI Cloud\nlanarky\nFastAPI framework to build production-grade LLM applications\nray-llm\nLLMs on Ray - RayLLM\nXinference\nReplace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.\n⬆ back to ToC\nSecurity\nFrameworks for LLM security\nProject\nDetails\nRepository\nPlexiglass\nA Python Machine Learning Pentesting Toolbox for Adversarial Attacks. Works with LLMs, DNNs, and other machine learning algorithms.\n⬆ back to ToC\nObservability\nProject\nDetails\nRepository\nAzure OpenAI Logger\n\"Batteries included\" logging solution for your Azure OpenAI instance.\nDeepchecks\nTests for Continuous Validation of ML Models & Data. Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort.\nEvidently\nEvaluate and monitor ML models from validation to production.\nFiddler AI\nEvaluate, monitor, analyze, and improve machine learning and generative models from pre-production to production. Ship more ML and LLMs into production, and monitor ML and LLM metrics like hallucination, PII, and toxicity.\nGiskard\nTesting framework dedicated to ML models, from tabular to LLMs. Detect risks of biases, performance issues and errors in 4 lines of code.\nGreat Expectations\nAlways know what to expect from your data.\nwhylogs\nThe open standard for data logging\n⬆ back to ToC\nLLMOps\nProject\nDetails\nRepository\nagenta\nThe LLMOps platform to build robust LLM apps. Easily experiment and evaluate different prompts, models, and workflows to build robust apps.\nAI studio\nA Reliable Open Source AI studio to build core infrastructure stack for your LLM Applications. It allows you to gain visibility, make your application reliable, and prepare it for production with features such as caching, rate limiting, exponential retry, model fallback, and more.\nArize-Phoenix\nML observability for LLMs, vision, language, and tabular models.\nBudgetML\nDeploy a ML inference service on a budget in less than 10 lines of code.\nCometLLM\nThe 100% opensource LLMOps platform to log, manage, and visualize your LLM prompts and chains. Track prompt templates, prompt variables, prompt duration, token usage, and other metadata. Score prompt outputs and visualize chat history all within a single UI.\ndeeplake\nStream large multimodal datasets to achieve near 100% GPU utilization. Query, visualize, & version control data. Access data w/o the need to recompute the embeddings for the model finetuning.\nDify\nOpen-source framework aims to enable developers (and even non-developers) to quickly build useful applications based on large language models, ensuring they are visual, operable, and improvable.\nDstack\nCost-effective LLM development in any cloud (AWS, GCP, Azure, Lambda, etc).\nEmbedchain\nFramework to create ChatGPT like bots over your dataset.\nEvidently\nAn open-source framework to evaluate, test and monitor ML and LLM-powered systems.\nFiddler AI\nEvaluate, monitor, analyze, and improve MLOps and LLMOps from pre-production to production.\nGlide\nCloud-Native LLM Routing Engine. Improve LLM app resilience and speed.\nGPTCache\nCreating semantic cache to store responses from LLM queries.\nHaystack\nQuickly compose applications with LLM Agents, semantic search, question-answering and more.\nHelicone\nOpen-source LLM observability platform for logging, monitoring, and debugging AI applications. Simple 1-line integration to get started.\nIzlo\nPrompt management tools for teams. Store, improve, test, and deploy your prompts in one unified workspace.\nKeywords AI\nA unified DevOps platform for AI software. Keywords AI makes it easy for developers to build LLM applications.\nlangchain\nBuilding applications with LLMs through composability\nLangFlow\nAn effortless way to experiment and prototype LangChain flows with drag-and-drop components and a chat interface.\nLangfuse\nOpen Source LLM Engineering Platform: Traces, evals, prompt management and metrics to debug and improve your LLM application.\nLangKit\nOut-of-the-box LLM telemetry collection library that extracts features and profiles prompts, responses and metadata about how your LLM is performing over time to find problems at scale.\nLiteLLM 🚅\nA simple & light 100 line package to\nstandardize LLM API calls\nacross OpenAI, Azure, Cohere, Anthropic, Replicate API Endpoints\nLiteral AI\nMulti-modal LLM observability and evaluation platform. Create prompt templates, deploy prompts versions, debug LLM runs, create datasets, run evaluations, monitor LLM metrics and collect human feedback.\nLlamaIndex\nProvides a central interface to connect your LLMs with external data.\nLLMApp\nLLM App is a Python library that helps you build real-time LLM-enabled data pipelines with few lines of code.\nLLMFlows\nLLMFlows is a framework for building simple, explicit, and transparent LLM applications such as chatbots, question-answering systems, and agents.\nLLMonitor\nObservability and monitoring for AI apps and agents. Debug agents with powerful tracing and logging. Usage analytics and dive deep into the history of your requests. Developer friendly modules with plug-and-play integration into LangChain.\nmagentic\nSeamlessly integrate LLMs as Python functions. Use type annotations to specify structured output. Mix LLM queries and function calling with regular Python code to create complex LLM-powered functionality.\nManag.ai\nYour all-in-one prompt management and observability platform. Craft, track, and perfect your LLM prompts with ease.\nMirascope\nIntuitive convenience tooling for lightning-fast, efficient development and ensuring quality in LLM-based applications\nOpenLIT\nOpenLIT is an OpenTelemetry-native GenAI and LLM Application Observability tool and provides OpenTelmetry Auto-instrumentation for monitoring LLMs, VectorDBs and Frameworks. It provides valuable insights into token & cost usage, user interaction, and performance related metrics.\nParea AI\nPlatform and SDK for AI Engineers providing tools for LLM evaluation, observability, and a version-controlled enhanced prompt playground.\nPezzo 🕹️\nPezzo is the open-source LLMOps platform built for developers and teams. In just two lines of code, you can seamlessly troubleshoot your AI operations, collaborate and manage your prompts in one place, and instantly deploy changes to any environment.\nPromptHub\nFull stack prompt management tool designed to be usable by technical and non-technical team members. Test, version, collaborate, deploy, and monitor, all from one place.\npromptfoo\nOpen-source tool for testing & evaluating prompt quality. Create test cases, automatically check output quality and catch regressions, and reduce evaluation cost.\nPromptFoundry\nThe simple prompt engineering and evaluation tool designed for developers building AI applications.\nPromptLayer 🍰\nPrompt Engineering platform. Collaborate, test, evaluate, and monitor your LLM applications\nPromptMage\nOpen-source tool to simplify the process of creating and managing LLM workflows and prompts as a self-hosted solution.\nPrompteams\nPrompt management system. Version, test, collaborate, and retrieve prompts through real-time APIs. Have GitHub style with repos, branches, and commits (and commit history).\nprompttools\nOpen-source tools for testing and experimenting with prompts. The core idea is to enable developers to evaluate prompts using familiar interfaces like code and notebooks. In just a few lines of codes, you can test your prompts and parameters across different models (whether you are using OpenAI, Anthropic, or LLaMA models). You can even evaluate the retrieval accuracy of vector databases.\nTreeScale\nAll In One Dev Platform For LLM Apps. Deploy LLM-enhanced APIs seamlessly using tools for prompt optimization, semantic querying, version management, statistical evaluation, and performance tracking. As a part of the developer friendly API implementation TreeScale offers Elastic LLM product, which makes a unified API Endpoint for all major LLM providers and open source models.\nTrueFoundry\nDeploy LLMOps tools like Vector DBs, Embedding server etc on your own Kubernetes (EKS,AKS,GKE,On-prem) Infra including deploying, Fine-tuning, tracking Prompts and serving Open Source LLM Models with full Data Security and Optimal GPU Management. Train and Launch your LLM Application at Production scale with best Software Engineering practices.\nReliableGPT 💪\nHandle OpenAI Errors (overloaded OpenAI servers, rotated keys, or context window errors) for your production LLM Applications.\nPortkey\nControl Panel with an observability suite & an AI gateway — to ship fast, reliable, and cost-efficient apps.\nVellum\nAn AI product development platform to experiment with, evaluate, and deploy advanced LLM apps.\nWeights & Biases (Prompts)\nA suite of LLMOps tools within the developer-first W&B MLOps platform. Utilize W&B Prompts for visualizing and inspecting LLM execution flow, tracking inputs and outputs, viewing intermediate results, securely managing prompts and LLM chain configurations.\nWordware\nA web-hosted IDE where non-technical domain experts work with AI Engineers to build task-specific AI agents. It approaches prompting as a new programming language rather than low/no-code blocks.\nxTuring\nBuild and control your personal LLMs with fast and efficient fine-tuning.\nZenML\nOpen-source framework for orchestrating, experimenting and deploying production-grade ML solutions, with built-in\nlangchain\n&\nllama_index\nintegrations.\n⬆ back to ToC\nSearch\nVector search\nProject\nDetails\nRepository\nAquilaDB\nAn easy to use Neural Search Engine. Index latent vectors along with JSON metadata and do efficient k-NN search.\nAwadb\nAI Native database for embedding vectors\nChroma\nthe open source embedding database\nInfinity\nThe AI-native database built for LLM applications, providing incredibly fast vector and full-text search\nLancedb\nDeveloper-friendly, serverless vector database for AI applications. Easily add long-term memory to your LLM apps!\nMarqo\nTensor search for humans.\nMilvus\nVector database for scalable similarity search and AI applications.\nPinecone\nThe Pinecone vector database makes it easy to build high-performance vector search applications. Developer-friendly, fully managed, and easily scalable without infrastructure hassles.\npgvector\nOpen-source vector similarity search for Postgres.\npgvecto.rs\nVector database plugin for Postgres, written in Rust, specifically designed for LLM.\nQdrant\nVector Search Engine and Database for the next generation of AI applications. Also available in the cloud\ntxtai\nBuild AI-powered semantic search applications\nVald\nA Highly Scalable Distributed Vector Search Engine\nVearch\nA distributed system for embedding-based vector retrieval\nVectorDB\nA Python vector database you just need - no more, no less.\nVellum\nA managed service for ingesting documents and performing hybrid semantic/keyword search across them. Comes with out-of-box support for OCR, text chunking, embedding model experimentation, metadata filtering, and production-grade APIs.\nWeaviate\nWeaviate is an open source vector search engine that stores both objects and vectors, allowing for combining vector search with structured filtering with the fault-tolerance and scalability of a cloud-native database, all accessible through GraphQL, REST, and various language clients.\n⬆ back to ToC\nCode AI\nProject\nDetails\nRepository\nCodeGeeX\nCodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)\nCodeGen\nCodeGen is an open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.\nCodeT5\nOpen Code LLMs for Code Understanding and Generation.\nContinue\n⏩ the open-source autopilot for software development—bring the power of ChatGPT to VS Code\nfauxpilot\nAn open-source alternative to GitHub Copilot server\ntabby\nSelf-hosted AI coding assistant. An opensource / on-prem alternative to GitHub Copilot.\nTraining\nIDEs and Workspaces\nProject\nDetails\nRepository\ncode server\nRun VS Code on any machine anywhere and access it in the browser.\nconda\nOS-agnostic, system-level binary package manager and ecosystem.\nDocker\nMoby is an open-source project created by Docker to enable and accelerate software containerization.\nenvd\n🏕️ Reproducible development environment for AI/ML.\nJupyter Notebooks\nThe Jupyter notebook is a web-based notebook environment for interactive computing.\nKurtosis\nA build, packaging, and run system for ephemeral multi-container environments.\nWordware\nA web-hosted IDE where non-technical domain experts work with AI Engineers to build task-specific AI agents. It approaches prompting as a new programming language rather than low/no-code blocks.\n⬆ back to ToC\nFoundation Model Fine Tuning\nProject\nDetails\nRepository\nalpaca-lora\nInstruct-tune LLaMA on consumer hardware\nfinetuning-scheduler\nA PyTorch Lightning extension that accelerates and enhances foundation model experimentation with flexible fine-tuning schedules.\nFlyflow\nOpen source, high performance fine tuning as a service for GPT4 quality models with 5x lower latency and 3x lower cost\nLMFlow\nAn Extensible Toolkit for Finetuning and Inference of Large Foundation Models\nLora\nUsing Low-rank adaptation to quickly fine-tune diffusion models.\npeft\nState-of-the-art Parameter-Efficient Fine-Tuning.\np-tuning-v2\nAn optimized prompt tuning strategy achieving comparable performance to fine-tuning on small/medium-sized models and sequence tagging challenges.\n(ACL 2022)\nQLoRA\nEfficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance.\nTRL\nTrain transformer language models with reinforcement learning.\n⬆ back to ToC\nFrameworks for Training\nProject\nDetails\nRepository\nAccelerate\n🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision.\nApache MXNet\nLightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.\naxolotl\nA tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.\nCaffe\nA fast open framework for deep learning.\nColossalAI\nAn integrated large-scale model training system with efficient parallelization techniques.\nDeepSpeed\nDeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.\nHorovod\nDistributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.\nJax\nAutograd and XLA for high-performance machine learning research.\nKedro\nKedro is an open-source Python framework for creating reproducible, maintainable and modular data science code.\nKeras\nKeras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow.\nLightGBM\nA fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.\nMegEngine\nMegEngine is a fast, scalable and easy-to-use deep learning framework, with auto-differentiation.\nmetric-learn\nMetric Learning Algorithms in Python.\nMindSpore\nMindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.\nOneflow\nOneFlow is a performance-centered and open-source deep learning framework.\nPaddlePaddle\nMachine Learning Framework from Industrial Practice.\nPyTorch\nTensors and Dynamic neural networks in Python with strong GPU acceleration.\nPyTorch Lightning\nDeep learning framework to train, deploy, and ship AI products Lightning fast.\nXGBoost\nScalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library.\nscikit-learn\nMachine Learning in Python.\nTensorFlow\nAn Open Source Machine Learning Framework for Everyone.\nVectorFlow\nA minimalist neural network library optimized for sparse data and single machine environments.\n⬆ back to ToC\nExperiment Tracking\nProject\nDetails\nRepository\nAim\nan easy-to-use and performant open-source experiment tracker.\nClearML\nAuto-Magical CI/CD to streamline your ML workflow. Experiment Manager, MLOps and Data-Management\nComet\nComet is an MLOps platform that offers experiment tracking, model production management, a model registry, and full data lineage from training straight through to production. Comet plays nicely with all your favorite tools, so you don't have to change your existing workflow. Check out CometLLM for all your prompt engineering needs!\nGuild AI\nExperiment tracking, ML developer tools.\nMLRun\nMachine Learning automation and tracking.\nKedro-Viz\nKedro-Viz is an interactive development tool for building data science pipelines with Kedro. Kedro-Viz also allows users to view and compare different runs in the Kedro project.\nLabNotebook\nLabNotebook is a tool that allows you to flexibly monitor, record, save, and query all your machine learning experiments.\nSacred\nSacred is a tool to help you configure, organize, log and reproduce experiments.\nWeights & Biases\nA developer first, lightweight, user-friendly experiment tracking and visualization tool for machine learning projects, streamlining collaboration and simplifying MLOps. W&B excels at tracking LLM-powered applications, featuring W&B Prompts for LLM execution flow visualization, input and output monitoring, and secure management of prompts and LLM chain configurations.\n⬆ back to ToC\nVisualization\nProject\nDetails\nRepository\nFiddler AI\nRich dashboards, reports, and UMAP to perform root cause analysis, pinpoint problem areas, like correctness, safety, and privacy issues, and improve LLM outcomes.\nManiford\nA model-agnostic visual debugging tool for machine learning.\nnetron\nVisualizer for neural network, deep learning, and machine learning models.\nOpenOps\nBring multiple data streams into one dashboard.\nTensorBoard\nTensorFlow's Visualization Toolkit.\nTensorSpace\nNeural network 3D visualization framework, build interactive and intuitive model in browsers, support pre-trained deep learning models from TensorFlow, Keras, TensorFlow.js.\ndtreeviz\nA python library for decision tree visualization and model interpretation.\nZetane Viewer\nML models and internal tensors 3D visualizer.\nZeno\nAI evaluation platform for interactively exploring data and model outputs.\nModel Editing\nProject\nDetails\nRepository\nFastEdit\nFastEdit aims to assist developers with injecting fresh and customized knowledge into large language models efficiently using one single command.\n⬆ back to ToC\nData\nData Management\nProject\nDetails\nRepository\nArtiVC\nA version control system to manage large files. Lake is a dataset format with a simple API for creating, storing, and collaborating on AI datasets of any size.\nDolt\nGit for Data.\nDVC\nData Version Control - Git for Data & Models - ML Experiments Management.\nDelta-Lake\nStorage layer that brings scalable, ACID transactions to Apache Spark and other engines.\nPachyderm\nPachyderm is a version control system for data.\nQuilt\nA self-organizing data hub for S3.\n⬆ back to ToC\nData Storage\nProject\nDetails\nRepository\nJuiceFS\nA distributed POSIX file system built on top of Redis and S3.\nLakeFS\nGit-like capabilities for your object storage.\nLance\nModern columnar data format for ML implemented in Rust.\n⬆ back to ToC\nData Tracking\nProject\nDetails\nRepository\nPiperider\nA CLI tool that allows you to build data profiles and write assertion tests for easily evaluating and tracking your data's reliability over time.\nLUX\nA Python library that facilitates fast and easy data exploration by automating the visualization and data analysis process.\n⬆ back to ToC\nFeature Engineering\nProject\nDetails\nRepository\nFeatureform\nThe Virtual Feature Store. Turn your existing data infrastructure into a feature store.\nFeatureTools\nAn open source python framework for automated feature engineering\n⬆ back to ToC\nData/Feature enrichment\nProject\nDetails\nRepository\nUpgini\nFree automated data & feature enrichment library for machine learning: automatically searches through thousands of ready-to-use features from public and community shared data sources and enriches your training dataset with only the accuracy improving features\nFeast\nAn open source feature store for machine learning.\n⬆ back to ToC\nLarge Scale Deployment\nML Platforms\nProject\nDetails\nRepository\nComet\nComet is an MLOps platform that offers experiment tracking, model production management, a model registry, and full data lineage from training straight through to production. Comet plays nicely with all your favorite tools, so you don't have to change your existing workflow. Check out CometLLM for all your prompt engineering needs!\nClearML\nAuto-Magical CI/CD to streamline your ML workflow. Experiment Manager, MLOps and Data-Management.\nHopsworks\nHopsworks is a MLOps platform for training and operating large and small ML systems, including fine-tuning and serving LLMs. Hopsworks includes both a feature store and vector database for RAG.\nOpenLLM\nAn open platform for operating large language models (LLMs) in production. Fine-tune, serve, deploy, and monitor any LLMs with ease.\nMLflow\nOpen source platform for the machine learning lifecycle.\nMLRun\nAn open MLOps platform for quickly building and managing continuous ML applications across their lifecycle.\nModelFox\nModelFox is a platform for managing and deploying machine learning models.\nKserve\nStandardized Serverless ML Inference Platform on Kubernetes\nKubeflow\nMachine Learning Toolkit for Kubernetes.\nPAI\nResource scheduling and cluster management for AI.\nPolyaxon\nMachine Learning Management & Orchestration Platform.\nPrimehub\nAn effortless infrastructure for machine learning built on the top of Kubernetes.\nOpenModelZ\nOne-click machine learning deployment (LLM, text-to-image and so on) at scale on any cluster (GCP, AWS, Lambda labs, your home lab, or even a single machine).\nSeldon-core\nAn MLOps framework to package, deploy, monitor and manage thousands of production machine learning models\nStarwhale\nAn MLOps/LLMOps platform for model building, evaluation, and fine-tuning.\nTrueFoundry\nA PaaS to deploy, Fine-tune and serve LLM Models on a company’s own Infrastructure with Data Security and Optimal GPU and Cost Management. Launch your LLM Application at Production scale with best DevSecOps practices.\nWeights & Biases\nA lightweight and flexible platform for machine learning experiment tracking, dataset versioning, and model management, enhancing collaboration and streamlining MLOps workflows. W&B excels at tracking LLM-powered applications, featuring W&B Prompts for LLM execution flow visualization, input and output monitoring, and secure management of prompts and LLM chain configurations.\n⬆ back to ToC\nWorkflow\nProject\nDetails\nRepository\nAirflow\nA platform to programmatically author, schedule and monitor workflows.\naqueduct\nAn Open-Source Platform for Production Data Science\nArgo Workflows\nWorkflow engine for Kubernetes.\nFlyte\nKubernetes-native workflow automation platform for complex, mission-critical data and ML processes at scale.\nHamilton\nA lightweight framework to represent ML/language model pipelines as a series of python functions.\nKubeflow Pipelines\nMachine Learning Pipelines for Kubeflow.\nLangFlow\nAn effortless way to experiment and prototype LangChain flows with drag-and-drop components and a chat interface.\nMetaflow\nBuild and manage real-life data science projects with ease!\nPloomber\nThe fastest way to build data pipelines. Develop iteratively, deploy anywhere.\nPrefect\nThe easiest way to automate your data.\nVDP\nAn open-source unstructured data ETL tool to streamline the end-to-end unstructured data processing pipeline.\nZenML\nMLOps framework to create reproducible pipelines.\n⬆ back to ToC\nScheduling\nProject\nDetails\nRepository\nKueue\nKubernetes-native Job Queueing.\nPAI\nResource scheduling and cluster management for AI (Open-sourced by Microsoft).\nSlurm\nA Highly Scalable Workload Manager.\nVolcano\nA Cloud Native Batch System (Project under CNCF).\nYunikorn\nLight-weight, universal resource scheduler for container orchestrator systems.\n⬆ back to ToC\nModel Management\nProject\nDetails\nRepository\nComet\nComet is an MLOps platform that offers Model Production Management, a Model Registry, and full model lineage from training straight through to production. Use Comet for model reproducibility, model debugging, model versioning, model visibility, model auditing, model governance, and model monitoring.\ndvc\nML Experiments Management - Data Version Control - Git for Data & Models\nModelDB\nOpen Source ML Model Versioning, Metadata, and Experiment Management\nMLEM\nA tool to package, serve, and deploy any ML model on any platform.\normb\nDocker for Your ML/DL Models Based on OCI Artifacts\n⬆ back to ToC\nPerformance\nML Compiler\nProject\nDetails\nRepository\nONNX-MLIR\nCompiler technology to transform a valid Open Neural Network Exchange (ONNX) graph into code that implements the graph with minimum runtime support.\nTVM\nOpen deep learning compiler stack for cpu, gpu and specialized accelerators\n⬆ back to ToC\nProfiling\nProject\nDetails\nRepository\noctoml-profile\noctoml-profile is a python library and cloud service designed to provide the simplest experience for assessing and optimizing the performance of PyTorch models on cloud hardware with state-of-the-art ML acceleration technology.\nscalene\na high-performance, high-precision CPU, GPU, and memory profiler for Python\n⬆ back to ToC\nAutoML\nProject\nDetails\nRepository\nArchai\na platform for Neural Network Search (NAS) that allows you to generate efficient deep networks for your applications.\nautoai\nA framework to find the best performing AI/ML model for any AI problem.\nAutoGL\nAn autoML framework & toolkit for machine learning on graphs\nAutoGluon\nAutoML for Image, Text, and Tabular Data.\nautoml-gs\nProvide an input CSV and a target field to predict, generate a model + code to run it.\nautokeras\nAutoML library for deep learning.\nAuto-PyTorch\nAutomatic architecture search and hyperparameter optimization for PyTorch.\nauto-sklearn\nan automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.\nDragonfly\nAn open source python library for scalable Bayesian optimisation.\nDetermined\nscalable deep learning training platform with integrated hyperparameter tuning support; includes Hyperband, PBT, and other search methods.\nDEvol (DeepEvolution)\na basic proof of concept for genetic architecture search in Keras.\nEvalML\nAn open source python library for AutoML.\nFEDOT\nAutoML framework for the design of composite pipelines.\nFLAML\nFast and lightweight AutoML (\npaper\n).\nGoptuna\nA hyperparameter optimization framework, inspired by Optuna.\nHpBandSter\na framework for distributed hyperparameter optimization.\nHPOlib2\na library for hyperparameter optimization and black box optimization benchmarks.\nHyperband\nopen source code for tuning hyperparams with Hyperband.\nHypernets\nA General Automated Machine Learning Framework.\nHyperopt\nDistributed Asynchronous Hyperparameter Optimization in Python.\nhyperunity\nA toolset for black-box hyperparameter optimisation.\nIntelli\nA framework to connect a flow of ML models by applying graph theory.\nKatib\nKatib is a Kubernetes-native project for automated machine learning (AutoML).\nKeras Tuner\nHyperparameter tuning for humans.\nlearn2learn\nPyTorch Meta-learning Framework for Researchers.\nLudwig\na toolbox built on top of TensorFlow that allows to train and test deep learning models without the need to write code.\nMOE\na global, black box optimization engine for real world metric optimization by Yelp.\nModel Search\na framework that implements AutoML algorithms for model architecture search at scale.\nNASGym\na proof-of-concept OpenAI Gym environment for Neural Architecture Search (NAS).\nNNI\nAn open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.\nOptuna\nA hyperparameter optimization framework.\nPycaret\nAn open-source, low-code machine learning library in Python that automates machine learning workflows.\nRay Tune\nScalable Hyperparameter Tuning.\nREMBO\nBayesian optimization in high-dimensions via random embedding.\nRoBO\na Robust Bayesian Optimization framework.\nscikit-optimize(skopt)\nSequential model-based optimization with a\nscipy.optimize\ninterface.\nSpearmint\na software package to perform Bayesian optimization.\nTPOT\none of the very first AutoML methods and open-source software packages.\nTorchmeta\nA Meta-Learning library for PyTorch.\nVegas\nan AutoML algorithm tool chain by Huawei Noah's Arb Lab.\n⬆ back to ToC\nOptimizations\nProject\nDetails\nRepository\nFeatherCNN\nFeatherCNN is a high performance inference engine for convolutional neural networks.\nForward\nA library for high performance deep learning inference on NVIDIA GPUs.\nNCNN\nncnn is a high-performance neural network inference framework optimized for the mobile platform.\nPocketFlow\nuse AutoML to do model compression.\nTensorFlow Model Optimization\nA suite of tools that users, both novice and advanced, can use to optimize machine learning models for deployment and execution.\nTNN\nA uniform deep learning inference framework for mobile, desktop and server.\n⬆ back to ToC\nFederated ML\nProject\nDetails\nRepository\nEasyFL\nAn Easy-to-use Federated Learning Platform\nFATE\nAn Industrial Grade Federated Learning Framework\nFedML\nThe federated learning and analytics library enabling secure and collaborative machine learning on decentralized data anywhere at any scale. Supporting large-scale cross-silo federated learning, cross-device federated learning on smartphones/IoTs, and research simulation.\nFlower\nA Friendly Federated Learning Framework\nHarmonia\nHarmonia is an open-source project aiming at developing systems/infrastructures and libraries to ease the adoption of federated learning (abbreviated to FL) for researches and production usage.\nTensorFlow Federated\nA framework for implementing federated learning\n⬆ back to ToC\nAwesome Lists\nProject\nDetails\nRepository\nAwesome Argo\nA curated list of awesome projects and resources related to Argo\nAwesome AutoDL\nAutomated Deep Learning: Neural Architecture Search Is Not the End (a curated list of AutoDL resources and an in-depth analysis)\nAwesome AutoML\nCurating a list of AutoML-related research, tools, projects and other resources\nAwesome AutoML Papers\nA curated list of automated machine learning papers, articles, tutorials, slides and projects\nAwesome-Code-LLM\n👨\u200d💻 An awesome and curated list of best code-LLM for research.\nAwesome Federated Learning Systems\nA curated list of Federated Learning Systems related academic papers, articles, tutorials, slides and projects.\nAwesome Federated Learning\nA curated list of federated learning publications, re-organized from Arxiv (mostly)\nawesome-federated-learning\nacc\nAll materials you need for Federated Learning: blogs, videos, papers, and software, etc.\nAwesome Open MLOps\nThis is the Fuzzy Labs guide to the universe of free and open source MLOps tools.\nAwesome Production Machine Learning\nA curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning\nAwesome Tensor Compilers\nA list of awesome compiler projects and papers for tensor computation and deep learning.\nkelvins/awesome-mlops\nA curated list of awesome MLOps tools.\nvisenger/awesome-mlops\nMachine Learning Operations - An awesome list of references for MLOps\ncurrentslab/awesome-vector-search\nA curated list of awesome vector search framework/engine, library, cloud service and research papers to vector similarity search.\npleisto/flappy\nProduction-Ready LLM Agent SDK for Every Developer\n⬆ back to ToC\nAbout\nAn awesome & curated list of best LLMOps tools for developers\nTopics\nawesome-list\nmlops\nai-development-tools\nllmops\nResources\nReadme\nLicense\nCC0-1.0 license\nActivity\nCustom properties\nStars\n3.7k\nstars\nWatchers\n64\nwatching\nForks\n353\nforks\nReport repository\nReleases\nNo releases published\nPackages\n0\nNo packages published\nContributors\n81\n+ 67 contributors\nLanguages\nShell\n86.2%\nPython\n13.8%",
"tool_call_id": "call_YeaR70E6l7iM7UHEtp709iVc",
"name": "extract_content",
},
{
"role": "tool",
"content": "Medium\nOpen in app\nSign up\nSign in\nWrite\nSign up\nSign in\nPAGE NOT FOUND\n404\nOut of nothing, something.\nYou can find (just about) anything on\nMedium\n— apparently even a page that doesn’t exist. Maybe these stories will take you somewhere new?\nHome\n“You Can’t Eat Technology.”\nAdam DeMartino\nAug 25, 2024\n·\n11 min read\n“You Can’t Eat Technology.”\nAdam DeMartino\nAug 25, 2024\n·\n11 min read\nReflections from the 2024 DNC\nIsaac Saul\nin\nThe Political Prism\nAug 25, 2024\n·\n15 min read\nReflections from the 2024 DNC\nIsaac Saul\nin\nThe Political Prism\nAug 25, 2024\n·\n15 min read\nIt’s cognitive bias week, because thinking is hard\nThe Medium Newsletter\nin\nThe Medium Blog\nAug 27, 2024\n·\n3 min read\nIt’s cognitive bias week, because thinking is hard\nThe Medium Newsletter\nin\nThe Medium Blog\nAug 27, 2024\n·\n3 min read\nThe Fun and Games of College Tours\njen murphy parker\nAug 17, 2024\n·\n12 min read\nMember-only\nThe Fun and Games of College Tours\njen murphy parker\nAug 17, 2024\n·\n12 min read\nMember-only",
"tool_call_id": "call_UWuyM3dy71Js7fspwSKnMlGC",
"name": "extract_content",
},
]
@pytest.mark.asyncio
async def test_conversation():
"""Tests that in a conversation setting, the llm generated query is context-relevant."""
web_assistant = WebAssistant(
search_history=[
"best LLM development tools",
"top libraries for LLM development",
"LLM libraries for software engineers",
"LLM dev tools for machine learning",
"most popular libraries for LLM development",
],
messages=test_conversation_messages,
)
response = await web_assistant._stream("What is mirascope library?")
async for _, tool in response:
queries = tool.args.get("queries", "") if tool else ""
is_context_relevant = False
for query in queries:
context_relevant = await check_context_relevance(
web_assistant.search_history, "What is mirascope library?", query
)
is_context_relevant = context_relevant.is_context_relevant
if is_context_relevant:
break
assert is_context_relevant
ipytest.run()
A few things to note:
- Messages are appended to reduce testing time and token usage. Check messages.py for the full history used.
- Our test asserts at least one of the llm queries generated must be context-relevant.
Evaluating context relevance is just of the crucial steps towards enhancing LLM-powered search systems, enabling them to provide more coherent, personalized, and valuable results across diverse user interactions.
When adapting this recipe to your specific use-case, consider the following:
- Use
pytest.mark.parametrize
and add more examples from user queries to further identify areas of improvement. - Implement context relevance call in to the agent to generate a feedback loop for the agent to ask a better search query.
- Context-relevance is not always what the user is asking for. The challenge lies in distinguishing between unintentional context loss and intentional topic shifts. This can be addressed by implementing a like/dislike answer feature and using that feedback to refine its search.