{ "cells": [ { "cell_type": "markdown", "id": "6fcff7e8de11a4b7", "metadata": {}, "source": [ "# Extracting from PDF\n", "\n", "This recipe demonstrates how to leverage Large Language Models (LLMs) -- specifically the OpenAI API -- to extract pages and content from PDF files. We'll cover single PDF document as well as multiple PDF files and also use OCR to extract text from scanned documents.\n", "\n", "
Mirascope Concepts Used
\n", "Background
\n", "\n", "Prior to LLMs, extracting pages from pdf files has been a time-consuming and expensive task. Natural Language Processing (NLP) would be used to identify and categorize information in text specific to the PDF document, or worse yet manually by humans. This would need to happen on every new PDF with new categories, which is not scalable. LLMs possess the ability to understand context, and the versatility to handle diverse data beyond PDFs such as word documents, powerpoints, email, and more.\n", "
\n", "Additional Real-World Examples
\n", "ResumeInfo
schema to a format for a Customer Relationship Management (CRM) tool.