{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6f8bf5c97b4d23d8",
   "metadata": {},
   "source": [
    "# Removing Semantic Duplicates\n",
    "\n",
    " In this recipe, we show how to use LLMs — in this case, OpenAI's `gpt-4o-mini` — to answer remove semantic duplicates from lists and objects.\n",
    "\n",
    "<div class=\"admonition tip\">\n",
    "<p class=\"admonition-title\">Mirascope Concepts Used</p>\n",
    "<ul>\n",
    "<li><a href=\"../../../learn/prompts/\">Prompts</a></li>\n",
    "<li><a href=\"../../../learn/calls/\">Calls</a></li>\n",
    "<li><a href=\"../../../learn/response_models/\">Response Model</a></li>\n",
    "</ul>\n",
    "</div>\n",
    "\n",
    "<div class=\"admonition note\">\n",
    "<p class=\"admonition-title\">Background</p>\n",
    "<p>\n",
    "Semantic deduplication, or the removal of duplicates which are equivalent in meaning but not in data, has been a longstanding problem in NLP. LLMs which have the ability to comprehend context, semantics, and implications within that text trivializes this problem.\n",
    "</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7636106d",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Let's start by installing Mirascope and its dependencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c210ca9f",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"mirascope[openai]\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a1ef922",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY\"\n",
    "# Set the appropriate API key for the provider you're using"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cab69a8c",
   "metadata": {},
   "source": [
    "## Deduplicating a List\n",
    "\n",
    "To start, assume we have a some entries of movie genres with semantic duplicates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "65835a1c932e3226",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-30T07:44:43.248848Z",
     "start_time": "2024-09-30T07:44:43.245404Z"
    }
   },
   "outputs": [],
   "source": [
    "movie_genres = [\n",
    "    \"sci-fi\",\n",
    "    \"romance\",\n",
    "    \"love story\",\n",
    "    \"action\",\n",
    "    \"horror\",\n",
    "    \"heist\",\n",
    "    \"crime\",\n",
    "    \"science fiction\",\n",
    "    \"fantasy\",\n",
    "    \"scary\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "704fa237654a4539",
   "metadata": {},
   "source": [
    "To deduplicate this list, we’ll extract a schema containing `genres`, the deduplicated list, and `duplicates`, a list of all duplicate items. The reason for having `duplicates` in our schema is that LLM extractions can be inconsistent, even with the most recent models - forcing it to list the duplicate items helps it reason through the call and produce a more accurate answer.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ec73f974659a2527",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-30T07:44:44.571453Z",
     "start_time": "2024-09-30T07:44:44.495200Z"
    }
   },
   "outputs": [],
   "source": [
    "from pydantic import BaseModel, Field\n",
    "\n",
    "\n",
    "class DeduplicatedGenres(BaseModel):\n",
    "    duplicates: list[list[str]] = Field(\n",
    "        ..., description=\"A list containing lists of semantically equivalent items\"\n",
    "    )\n",
    "    genres: list[str] = Field(\n",
    "        ..., description=\"The list of genres with semantic duplicates removed\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a43771f239cd3fb",
   "metadata": {},
   "source": [
    "\n",
    "We can now set this schema as our response model in a Mirascope call:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ff39bf501b0da43c",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-30T07:44:48.163992Z",
     "start_time": "2024-09-30T07:44:46.204147Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['action', 'horror', 'heist', 'crime', 'fantasy', 'scary']\n",
      "[['sci-fi', 'science fiction'], ['love story', 'romance']]\n"
     ]
    }
   ],
   "source": [
    "from mirascope.core import openai, prompt_template\n",
    "\n",
    "\n",
    "@openai.call(model=\"gpt-4o-mini\", response_model=DeduplicatedGenres)\n",
    "@prompt_template(\n",
    "    \"\"\"\n",
    "    SYSTEM:\n",
    "    Your job is to take a list of movie genres and clean it up by removing items\n",
    "    which are semantically equivalent to one another. When coming across multiple items\n",
    "    which refer to the same genre, keep the genre name which is most commonly used.\n",
    "    For example, \"sci-fi\" and \"science fiction\" are the same genre.\n",
    "\n",
    "    USER:\n",
    "    {genres}\n",
    "    \"\"\"\n",
    ")\n",
    "def deduplicate_genres(genres: list[str]): ...\n",
    "\n",
    "\n",
    "response = deduplicate_genres(movie_genres)\n",
    "assert isinstance(response, DeduplicatedGenres)\n",
    "print(response.genres)\n",
    "print(response.duplicates)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c395b2ccbbfec826",
   "metadata": {},
   "source": [
    "\n",
    "Just like with a list of strings, we can create a schema of `DeduplicatedBooks` and set it as the response model, with a modified prompt to account for the different types of differences we see:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "98ed7c009cfb57f8",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-30T07:44:51.482573Z",
     "start_time": "2024-09-30T07:44:49.667933Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "title='The War of the Worlds' author='H. G. Wells' genre='scifi'\n",
      "title='War of the Worlds' author='H.G. Wells' genre='science fiction'\n"
     ]
    }
   ],
   "source": [
    "class Book(BaseModel):\n",
    "    title: str\n",
    "    author: str\n",
    "    genre: str\n",
    "\n",
    "\n",
    "duplicate_books = [\n",
    "    Book(title=\"The War of the Worlds\", author=\"H. G. Wells\", genre=\"scifi\"),\n",
    "    Book(title=\"War of the Worlds\", author=\"H.G. Wells\", genre=\"science fiction\"),\n",
    "    Book(title=\"The Sorcerer's stone\", author=\"J. K. Rowling\", genre=\"fantasy\"),\n",
    "    Book(\n",
    "        title=\"Harry Potter and The Sorcerer's stone\",\n",
    "        author=\"J. K. Rowling\",\n",
    "        genre=\"fantasy\",\n",
    "    ),\n",
    "    Book(title=\"The Name of the Wind\", author=\"Patrick Rothfuss\", genre=\"fantasy\"),\n",
    "    Book(title=\"'The Name of the Wind'\", author=\"Patrick Rofuss\", genre=\"fiction\"),\n",
    "]\n",
    "\n",
    "\n",
    "@openai.call(model=\"gpt-4o\", response_model=list[Book])\n",
    "@prompt_template(\n",
    "    \"\"\"\n",
    "    SYSTEM:\n",
    "    Your job is to take a database of books and clean it up by removing items which are\n",
    "    semantic duplicates. Look out for typos, formatting differences, and categorizations.\n",
    "    For example, \"Mistborn\" and \"Mistborn: The Final Empire\" are the same book \n",
    "    but \"Mistborn: Shadows of Self\" is not.\n",
    "    Then return all the unique books.\n",
    "\n",
    "    USER:\n",
    "    {books}\n",
    "    \"\"\"\n",
    ")\n",
    "def deduplicate_books(books: list[Book]): ...\n",
    "\n",
    "\n",
    "books = deduplicate_books(duplicate_books)\n",
    "assert isinstance(books, list)\n",
    "for book in books:\n",
    "    print(book)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fd02e1f9483587b",
   "metadata": {},
   "source": [
    "<div class=\"admonition tip\">\n",
    "<p class=\"admonition-title\">Additional Real-World Examples</p>\n",
    "<ul>\n",
    "<li><b>Customer Relationship Management (CRM)</b>: Maintaining a single, accurate view of each customer.</li>\n",
    "<li><b>Database Management</b>: Removing duplicate records to maintain data integrity and improve query performance</li>\n",
    "<li><b>Email</b>: Clean up digital assets by removing duplicate attachments, emails.</li>\n",
    "</ul>\n",
    "</div>\n",
    "\n",
    "When adapting this recipe to your specific use-case, consider the following:\n",
    "\n",
    "- Refine your prompts to provide clear instructions and examples tailored to your requirements.\n",
    "- Experiment with different model providers and version to balance accuracy and speed.\n",
    "- Use multiple model providers to evaluate whether all duplicates have bene removed.\n",
    "- Add more information if possible to get better accuracy, e.g. some books might have similar names but are released in different years.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}