{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3495c6e316ccfa49",
   "metadata": {},
   "source": [
    "# Generate Synthetic Data\n",
    "\n",
    "In this tutorial, we go over how to generate synthetic data for LLMs, in this case, OpenAI’s `gpt-4o-mini`. When using LLMs to synthetically generate data, it is most useful to generate non-numerical data which isn’t strictly dependent on a defined probability distribution - in those cases, it will be far easier to define a distribution and generate these points directly from the distribution.\n",
    "\n",
    "However, for:\n",
    "\n",
    "- data that needs general intelligence to be realistic\n",
    "- data that lists many items within a broad category\n",
    "- data which is language related\n",
    "\n",
    "and more, LLMs are far easier to use and yield better (or the only feasible) results.\n",
    "\n",
    "<div class=\"admonition tip\">\n",
    "<p class=\"admonition-title\">Mirascope Concepts Used</p>\n",
    "<ul>\n",
    "<li><a href=\"../../../learn/prompts/\">Prompts</a></li>\n",
    "<li><a href=\"../../../learn/calls/\">Calls</a></li>\n",
    "<li><a href=\"../../../learn/response_models/\">Response Models</a></li>\n",
    "</ul>\n",
    "</div>\n",
    "\n",
    "<div class=\"admonition note\">\n",
    "<p class=\"admonition-title\">Background</p>\n",
    "<p>\n",
    "Large Language Models (LLMs) have emerged as powerful tools for generating synthetic data, particularly for text-based applications. Compared to traditional synthetic data generation methods, LLMs can produce more diverse, contextually rich, and human-like textual data, often with less need for domain-specific rules or statistical modeling.\n",
    "</p>\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd472be3",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "To set up our environment, first let's install all of the packages we will use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de766ec5",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"mirascope[openai]\" pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49f293a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY\"\n",
    "# Set the appropriate API key for the provider you're using"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf95d7e939b2fee8",
   "metadata": {},
   "source": [
    "## Generate Data as CSV\n",
    "\n",
    "To generate realistic, synthetic data as a csv, you can accomplish this in a single call by requesting a csv format in the prompt and describing the kind of data you would like generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "358f2e43ea30e094",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-30T11:39:44.294304Z",
     "start_time": "2024-09-30T11:39:42.835288Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Name, Price, Inventory  \n",
      "\"4-Slice Toaster\", 29.99, 150  \n",
      "\"Stainless Steel Blender\", 49.99, 75  \n",
      "\"Robot Vacuum Cleaner\", 199.99, 30  \n",
      "\"Microwave Oven 1000W\", 89.99, 50  \n",
      "\"Electric Kettle\", 24.99, 200\n"
     ]
    }
   ],
   "source": [
    "from mirascope.core import openai, prompt_template\n",
    "\n",
    "\n",
    "@openai.call(model=\"gpt-4o-mini\")\n",
    "@prompt_template(\n",
    "    \"\"\"\n",
    "    Generate {num_datapoints} random but realistic datapoints of items which could\n",
    "    be in a home appliances warehouse. Output the datapoints as a csv, and your\n",
    "    response should only be the CSV.\n",
    "\n",
    "    Format:\n",
    "    Name, Price, Inventory\n",
    "\n",
    "    Name - the name of the home appliance product\n",
    "    Price - the price of an individual product, in dollars (include cents)\n",
    "    Inventory - how many are left in stock\n",
    "    \"\"\"\n",
    ")\n",
    "def generate_csv_data(num_datapoints: int): ...\n",
    "\n",
    "\n",
    "print(generate_csv_data(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82c3e7974f491198",
   "metadata": {},
   "source": [
    "\n",
    "Note that the prices and inventory of each item are somewhat realistic for their corresponding item, something which would be otherwise difficult to accomplish.\n",
    "\n",
    "\n",
    "## Generate Data with `response_model`\n",
    "\n",
    "Sometimes, it will be easier to integrate your datapoints into your code if they are defined as some schema, namely a Pydantic `BaseModel`. In this case, describe each column as the `description` of a `Field` in the `BaseModel` instead of the prompt, and set `response_model` to your defined schema:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "44d7b2c55386292a",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-30T11:39:46.607894Z",
     "start_time": "2024-09-30T11:39:44.988847Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[HomeAppliance(name='Refrigerator', price=899.99, inventory=25), HomeAppliance(name='Microwave', price=129.99, inventory=50), HomeAppliance(name='Washing Machine', price=499.99, inventory=15), HomeAppliance(name='Dishwasher', price=749.99, inventory=10), HomeAppliance(name='Air Conditioner', price=349.99, inventory=30)]\n"
     ]
    }
   ],
   "source": [
    "from pydantic import BaseModel, Field\n",
    "\n",
    "\n",
    "class HomeAppliance(BaseModel):\n",
    "    name: str = Field(description=\"The name of the home appliance product\")\n",
    "    price: float = Field(\n",
    "        description=\"The price of an individual product, in dollars (include cents)\"\n",
    "    )\n",
    "    inventory: int = Field(description=\"How many of the items are left in stock\")\n",
    "\n",
    "\n",
    "@openai.call(model=\"gpt-4o-mini\", response_model=list[HomeAppliance])\n",
    "@prompt_template(\n",
    "    \"\"\"\n",
    "    Generate {num_datapoints} random but realistic datapoints of items which could\n",
    "    be in a home appliances warehouse. Output the datapoints as a list of instances\n",
    "    of HomeAppliance.\n",
    "    \"\"\"\n",
    ")\n",
    "def generate_home_appliance_data(num_datapoints: int): ...\n",
    "\n",
    "\n",
    "print(generate_home_appliance_data(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed5c4be43093ac85",
   "metadata": {},
   "source": [
    "\n",
    "## Generate Data into a pandas `DataFrame`\n",
    "\n",
    "Since pandas is a common library for working with data, it’s also worth knowing how to directly create and append to a dataframe with LLMs.\n",
    "\n",
    "### Create a New `DataFrame`\n",
    "\n",
    "To create a new `DataFrame`, we define a `BaseModel` schema with a simple function to generate  `DataFrame` via a list of list of data and the column names:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "537ac4107a9792b0",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-30T11:39:49.791477Z",
     "start_time": "2024-09-30T11:39:48.165697Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             Name         Price Inventory\n",
      "0  Microwave Oven  Refrigerator   Blender\n",
      "1           79.99        899.99     49.99\n",
      "2              25            10        40\n"
     ]
    }
   ],
   "source": [
    "from typing import Any, Literal\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "\n",
    "class DataFrameGenerator(BaseModel):\n",
    "    data: list[list[Any]] = Field(\n",
    "        description=\"the data to be inserted into the dataframe\"\n",
    "    )\n",
    "    column_names: list[str] = Field(description=\"The names of the columns in data\")\n",
    "\n",
    "    def append_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:\n",
    "        return pd.concat([df, self.generate_dataframe()], ignore_index=True)\n",
    "\n",
    "    def generate_dataframe(self) -> pd.DataFrame:\n",
    "        return pd.DataFrame(dict(zip(self.column_names, self.data, strict=False)))\n",
    "\n",
    "\n",
    "@openai.call(model=\"gpt-4o-mini\", response_model=DataFrameGenerator)\n",
    "@prompt_template(\n",
    "    \"\"\"\n",
    "    Generate {num_datapoints} random but realistic datapoints of items which could\n",
    "    be in a home appliances warehouse. Generate your response as `data` and\n",
    "    `column_names`, so that a pandas DataFrame may be generated with:\n",
    "    `pd.DataFrame(data, columns=column_names)`.\n",
    "\n",
    "    Format:\n",
    "    Name, Price, Inventory\n",
    "\n",
    "    Name - the name of the home appliance product\n",
    "    Price - the price of an individual product, in dollars (include cents)\n",
    "    Inventory - how many are left in stock\n",
    "    \"\"\"\n",
    ")\n",
    "def generate_df_data(num_datapoints: int): ...\n",
    "\n",
    "\n",
    "df_data = generate_df_data(5)\n",
    "df = df_data.generate_dataframe()\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a604276753bcbe8",
   "metadata": {},
   "source": [
    "### Appending to a `DataFrame`\n",
    "\n",
    "To append to a `DataFrame`, we can modify the prompt so that instead of describing the data we want to generate, we ask the LLM to match the type of data it already sees. Furthermore, we add a `append_dataframe()` function to append to an existing `DataFrame`. Finally, note that we use the generated `df` from above as the `DataFrame` to append to in the following example:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "f1c01d8ce8efa9c6",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-30T11:39:53.553873Z",
     "start_time": "2024-09-30T11:39:51.984027Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             Name         Price     Inventory\n",
      "0  Microwave Oven  Refrigerator       Blender\n",
      "1           79.99        899.99         49.99\n",
      "2              25            10            40\n",
      "3         Toaster   Slow Cooker  Coffee Maker\n",
      "4           29.99         49.99         39.99\n",
      "5             150            80           200\n"
     ]
    }
   ],
   "source": [
    "@openai.call(model=\"gpt-4o-mini\", response_model=DataFrameGenerator)\n",
    "@prompt_template(\n",
    "    \"\"\"\n",
    "    Generate {num_datapoints} random but realistic datapoints of items which would\n",
    "    make sense to the following dataset:\n",
    "    {df}\n",
    "    Generate your response as `data` and\n",
    "    `column_names`, so that a pandas DataFrame may be generated with:\n",
    "    `pd.DataFrame(data, columns=column_names)` then appended to the existing data.\n",
    "    \"\"\"\n",
    ")\n",
    "def generate_additional_df_data(num_datapoints: int, df: pd.DataFrame): ...\n",
    "\n",
    "\n",
    "df_data = generate_additional_df_data(5, df)\n",
    "df = df_data.append_dataframe(df)\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0e37ff81554e5ed",
   "metadata": {},
   "source": [
    "\n",
    "## Adding Constraints\n",
    "\n",
    "While you cannot successfully add complex mathematical constraints to generated data (think statistics, such as distributions and covariances), asking LLMs to abide by basic constraints will (generally) prove successful, especially with newer models. Let’s look at an example where we generate TVs where the TV price should roughly linearly correlate with TV size, and QLEDs are 2-3x more expensive than OLEDs of the same size:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "55bb308c777edc75",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-30T11:39:59.880354Z",
     "start_time": "2024-09-30T11:39:57.499118Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "size=32 price=299.99 tv_type='OLED'\n",
      "size=32 price=549.99 tv_type='QLED'\n",
      "size=43 price=399.99 tv_type='OLED'\n",
      "size=43 price=749.99 tv_type='QLED'\n",
      "size=55 price=699.99 tv_type='OLED'\n",
      "size=55 price=1399.99 tv_type='QLED'\n",
      "size=65 price=999.99 tv_type='OLED'\n",
      "size=65 price=1999.99 tv_type='QLED'\n",
      "size=75 price=1299.99 tv_type='OLED'\n",
      "size=75 price=2499.99 tv_type='QLED'\n"
     ]
    }
   ],
   "source": [
    "class TV(BaseModel):\n",
    "    size: int = Field(description=\"The size of the TV\")\n",
    "    price: float = Field(description=\"The price of the TV in dollars (include cents)\")\n",
    "    tv_type: Literal[\"OLED\", \"QLED\"]\n",
    "\n",
    "\n",
    "@openai.call(model=\"gpt-4o-mini\", response_model=list[TV])\n",
    "@prompt_template(\n",
    "    \"\"\"\n",
    "    Generate {num_datapoints} random but realistic datapoints of TVs.\n",
    "    Output the datapoints as a list of instances of TV.\n",
    "\n",
    "    Make sure to abide by the following constraints:\n",
    "    QLEDS should be roughly (not exactly) 2x the price of an OLED of the same size\n",
    "    for both OLEDs and QLEDS, price should increase roughly proportionately to size\n",
    "    \"\"\"\n",
    ")\n",
    "def generate_tv_data(num_datapoints: int): ...\n",
    "\n",
    "\n",
    "for tv in generate_tv_data(10):\n",
    "    print(tv)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7170831ce7c4ef6c",
   "metadata": {},
   "source": [
    "To demonstrate the constraints’ being followed, you can graph the data using matplotlib, which shows the linear relationships between size and price, and QLEDs costing roughly twice as much as OLED:\n",
    "\n",
    "<img src=\"../../../assets/generating-synthetic-data-chart.png\" alt=\"generating-synthetic-data-chart\">\n",
    "\n",
    "<div class=\"admonition tip\">\n",
    "<p class=\"admonition-title\">Additional Real-World Examples</p>\n",
    "<ul>\n",
    "<li><b>Healthcare and Medical Research</b>: Generating synthetic patient records for training machine learning models without compromising patient privacy</li>\n",
    "<li><b>Environmental Science</b>: Generating synthetic climate data for modeling long-term environmental changes</li>\n",
    "<li><b>Fraud Detection Systems</b>: Generating synthetic data of fraudulent and legitimate transactions for training fraud detection models.</li>\n",
    "</ul>\n",
    "</div>\n",
    "\n",
    "When adapting this recipe to your specific use-case, consider the following:\n",
    "    - Add Pydantic `AfterValidators` to constrain your synthetic data generation\n",
    "    - Verify that the synthetic data generated actually matches real-world data.\n",
    "    - Make sure no biases are present in the generated data, this can be prompt engineered.\n",
    "    - Experiment with different model providers and versions for quality.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}