{ "cells": [ { "cell_type": "markdown", "id": "6f8bf5c97b4d23d8", "metadata": {}, "source": [ "# Removing Semantic Duplicates\n", "\n", " In this recipe, we show how to use LLMs — in this case, OpenAI's `gpt-4o-mini` — to answer remove semantic duplicates from lists and objects.\n", "\n", "
\n", "

Mirascope Concepts Used

\n", "\n", "
\n", "\n", "
\n", "

Background

\n", "

\n", "Semantic deduplication, or the removal of duplicates which are equivalent in meaning but not in data, has been a longstanding problem in NLP. LLMs which have the ability to comprehend context, semantics, and implications within that text trivializes this problem.\n", "

\n", "
" ] }, { "cell_type": "markdown", "id": "7636106d", "metadata": {}, "source": [ "## Setup\n", "\n", "Let's start by installing Mirascope and its dependencies:" ] }, { "cell_type": "code", "execution_count": null, "id": "c210ca9f", "metadata": {}, "outputs": [], "source": [ "!pip install \"mirascope[openai]\"" ] }, { "cell_type": "code", "execution_count": null, "id": "4a1ef922", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY\"\n", "# Set the appropriate API key for the provider you're using" ] }, { "cell_type": "markdown", "id": "cab69a8c", "metadata": {}, "source": [ "## Deduplicating a List\n", "\n", "To start, assume we have a some entries of movie genres with semantic duplicates:" ] }, { "cell_type": "code", "execution_count": 1, "id": "65835a1c932e3226", "metadata": { "ExecuteTime": { "end_time": "2024-09-30T07:44:43.248848Z", "start_time": "2024-09-30T07:44:43.245404Z" } }, "outputs": [], "source": [ "movie_genres = [\n", " \"sci-fi\",\n", " \"romance\",\n", " \"love story\",\n", " \"action\",\n", " \"horror\",\n", " \"heist\",\n", " \"crime\",\n", " \"science fiction\",\n", " \"fantasy\",\n", " \"scary\",\n", "]" ] }, { "cell_type": "markdown", "id": "704fa237654a4539", "metadata": {}, "source": [ "To deduplicate this list, we’ll extract a schema containing `genres`, the deduplicated list, and `duplicates`, a list of all duplicate items. The reason for having `duplicates` in our schema is that LLM extractions can be inconsistent, even with the most recent models - forcing it to list the duplicate items helps it reason through the call and produce a more accurate answer.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "ec73f974659a2527", "metadata": { "ExecuteTime": { "end_time": "2024-09-30T07:44:44.571453Z", "start_time": "2024-09-30T07:44:44.495200Z" } }, "outputs": [], "source": [ "from pydantic import BaseModel, Field\n", "\n", "\n", "class DeduplicatedGenres(BaseModel):\n", " duplicates: list[list[str]] = Field(\n", " ..., description=\"A list containing lists of semantically equivalent items\"\n", " )\n", " genres: list[str] = Field(\n", " ..., description=\"The list of genres with semantic duplicates removed\"\n", " )" ] }, { "cell_type": "markdown", "id": "2a43771f239cd3fb", "metadata": {}, "source": [ "\n", "We can now set this schema as our response model in a Mirascope call:\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "ff39bf501b0da43c", "metadata": { "ExecuteTime": { "end_time": "2024-09-30T07:44:48.163992Z", "start_time": "2024-09-30T07:44:46.204147Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['action', 'horror', 'heist', 'crime', 'fantasy', 'scary']\n", "[['sci-fi', 'science fiction'], ['love story', 'romance']]\n" ] } ], "source": [ "from mirascope.core import openai, prompt_template\n", "\n", "\n", "@openai.call(model=\"gpt-4o-mini\", response_model=DeduplicatedGenres)\n", "@prompt_template(\n", " \"\"\"\n", " SYSTEM:\n", " Your job is to take a list of movie genres and clean it up by removing items\n", " which are semantically equivalent to one another. When coming across multiple items\n", " which refer to the same genre, keep the genre name which is most commonly used.\n", " For example, \"sci-fi\" and \"science fiction\" are the same genre.\n", "\n", " USER:\n", " {genres}\n", " \"\"\"\n", ")\n", "def deduplicate_genres(genres: list[str]): ...\n", "\n", "\n", "response = deduplicate_genres(movie_genres)\n", "assert isinstance(response, DeduplicatedGenres)\n", "print(response.genres)\n", "print(response.duplicates)" ] }, { "cell_type": "markdown", "id": "c395b2ccbbfec826", "metadata": {}, "source": [ "\n", "Just like with a list of strings, we can create a schema of `DeduplicatedBooks` and set it as the response model, with a modified prompt to account for the different types of differences we see:\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "98ed7c009cfb57f8", "metadata": { "ExecuteTime": { "end_time": "2024-09-30T07:44:51.482573Z", "start_time": "2024-09-30T07:44:49.667933Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "title='The War of the Worlds' author='H. G. Wells' genre='scifi'\n", "title='War of the Worlds' author='H.G. Wells' genre='science fiction'\n" ] } ], "source": [ "class Book(BaseModel):\n", " title: str\n", " author: str\n", " genre: str\n", "\n", "\n", "duplicate_books = [\n", " Book(title=\"The War of the Worlds\", author=\"H. G. Wells\", genre=\"scifi\"),\n", " Book(title=\"War of the Worlds\", author=\"H.G. Wells\", genre=\"science fiction\"),\n", " Book(title=\"The Sorcerer's stone\", author=\"J. K. Rowling\", genre=\"fantasy\"),\n", " Book(\n", " title=\"Harry Potter and The Sorcerer's stone\",\n", " author=\"J. K. Rowling\",\n", " genre=\"fantasy\",\n", " ),\n", " Book(title=\"The Name of the Wind\", author=\"Patrick Rothfuss\", genre=\"fantasy\"),\n", " Book(title=\"'The Name of the Wind'\", author=\"Patrick Rofuss\", genre=\"fiction\"),\n", "]\n", "\n", "\n", "@openai.call(model=\"gpt-4o\", response_model=list[Book])\n", "@prompt_template(\n", " \"\"\"\n", " SYSTEM:\n", " Your job is to take a database of books and clean it up by removing items which are\n", " semantic duplicates. Look out for typos, formatting differences, and categorizations.\n", " For example, \"Mistborn\" and \"Mistborn: The Final Empire\" are the same book \n", " but \"Mistborn: Shadows of Self\" is not.\n", " Then return all the unique books.\n", "\n", " USER:\n", " {books}\n", " \"\"\"\n", ")\n", "def deduplicate_books(books: list[Book]): ...\n", "\n", "\n", "books = deduplicate_books(duplicate_books)\n", "assert isinstance(books, list)\n", "for book in books:\n", " print(book)" ] }, { "cell_type": "markdown", "id": "6fd02e1f9483587b", "metadata": {}, "source": [ "
\n", "

Additional Real-World Examples

\n", "\n", "
\n", "\n", "When adapting this recipe to your specific use-case, consider the following:\n", "\n", "- Refine your prompts to provide clear instructions and examples tailored to your requirements.\n", "- Experiment with different model providers and version to balance accuracy and speed.\n", "- Use multiple model providers to evaluate whether all duplicates have bene removed.\n", "- Add more information if possible to get better accuracy, e.g. some books might have similar names but are released in different years.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }