{ "cells": [ { "cell_type": "markdown", "id": "3495c6e316ccfa49", "metadata": {}, "source": [ "# Generate Synthetic Data\n", "\n", "In this tutorial, we go over how to generate synthetic data for LLMs, in this case, OpenAI’s `gpt-4o-mini`. When using LLMs to synthetically generate data, it is most useful to generate non-numerical data which isn’t strictly dependent on a defined probability distribution - in those cases, it will be far easier to define a distribution and generate these points directly from the distribution.\n", "\n", "However, for:\n", "\n", "- data that needs general intelligence to be realistic\n", "- data that lists many items within a broad category\n", "- data which is language related\n", "\n", "and more, LLMs are far easier to use and yield better (or the only feasible) results.\n", "\n", "
Mirascope Concepts Used
\n", "Background
\n", "\n", "Large Language Models (LLMs) have emerged as powerful tools for generating synthetic data, particularly for text-based applications. Compared to traditional synthetic data generation methods, LLMs can produce more diverse, contextually rich, and human-like textual data, often with less need for domain-specific rules or statistical modeling.\n", "
\n", "Additional Real-World Examples
\n", "