Named Entity Recognition¶

This guide demonstrates techniques to perform Named Entity Recognition (NER) using Large Language Models (LLMs) with various levels of nested entity recognition. We'll use Groq's llama-3.1-8b-instant model, but you can adapt this approach to other models with similar capabilities.

Mirascope Concepts Used

Background

Named Entity Recognition is a subtask of information extraction that seeks to locate and classify named entities in text into predefined categories such as person names, organizations, locations, etc. LLMs have revolutionized NER by enabling more context-aware and hierarchical entity recognition, going beyond traditional rule-based or statistical methods.

LLMs are not trained specifically for NER

It's worth noting that there are models that are trained specifically for NER (such as GLiNER). These models are often much smaller and cheapr and can often get better results for the right tasks. LLMs should generally be reserved for quick and dirty prototyping for NER or for tasks that may require a more nuanced, open-ended language-based approach. For example, an NER system that accepts user input to guide the system by be easier to build using LLMs than a traditionally trained NER-specific model.

Setup¶

To set up our environment, first let's install all of the packages we will use:

In [ ]:

Copied!

!pip install "mirascope[groq]" pytest
!pip install ipytest # For running pytest in Jupyter Notebooks
!pip install "mirascope[groq]" pytest
!pip install ipytest # For running pytest in Jupyter Notebooks

In [ ]:

Copied!

import os

os.environ["GROQ_API_KEY"] = "YOUR_API_KEY"
# Set the appropriate API key for the provider you're using
import os

os.environ["GROQ_API_KEY"] = "YOUR_API_KEY"
# Set the appropriate API key for the provider you're using

Simple NER¶

We'll implement NER with different levels of complexity: simple and nested entity recognition. Let's start with the simple version:

In [3]:

Copied!





from __future__ import annotations  # noqa: F404

import textwrap

from mirascope.core import groq, prompt_template
from pydantic import BaseModel, Field

unstructured_text = """
Apple Inc., the tech giant founded by Steve Jobs and Steve Wozniak, recently announced a partnership with OpenAI, the artificial intelligence research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. This collaboration aims to enhance Siri, Apple's virtual assistant, which competes with Amazon's Alexa and Google Assistant, a product of Alphabet Inc.'s Google division. The joint project will be led by Apple's AI chief John Giannandrea, a former Google executive, and will take place at Apple Park, the company's headquarters in Cupertino, California.
"""


class SimpleEntity(BaseModel):
    entity: str = Field(description="The entity found in the text")
    label: str = Field(
        description="The label of the entity (e.g., PERSON, ORGANIZATION, LOCATION)"
    )


@groq.call(
    model="llama-3.1-8b-instant",
    response_model=list[SimpleEntity],
    json_mode=True,
    call_params={"temperature": 0.0},
)
def simple_ner(text: str) -> str:
    return f"Extract the entities from this text: {text}"


print("Simple NER Results:")
simple_result = simple_ner(unstructured_text)
for entity in simple_result:
    print(f"Entity: {entity.entity}, Label: {entity.label}")
from __future__ import annotations  # noqa: F404

import textwrap

from mirascope.core import groq, prompt_template
from pydantic import BaseModel, Field

unstructured_text = """
Apple Inc., the tech giant founded by Steve Jobs and Steve Wozniak, recently announced a partnership with OpenAI, the artificial intelligence research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. This collaboration aims to enhance Siri, Apple's virtual assistant, which competes with Amazon's Alexa and Google Assistant, a product of Alphabet Inc.'s Google division. The joint project will be led by Apple's AI chief John Giannandrea, a former Google executive, and will take place at Apple Park, the company's headquarters in Cupertino, California.
"""


class SimpleEntity(BaseModel):
    entity: str = Field(description="The entity found in the text")
    label: str = Field(
        description="The label of the entity (e.g., PERSON, ORGANIZATION, LOCATION)"
    )


@groq.call(
    model="llama-3.1-8b-instant",
    response_model=list[SimpleEntity],
    json_mode=True,
    call_params={"temperature": 0.0},
)
def simple_ner(text: str) -> str:
    return f"Extract the entities from this text: {text}"


print("Simple NER Results:")
simple_result = simple_ner(unstructured_text)
for entity in simple_result:
    print(f"Entity: {entity.entity}, Label: {entity.label}")

Simple NER Results:
Entity: Apple Inc., Label: ORGANIZATION
Entity: Steve Jobs, Label: PERSON
Entity: Steve Wozniak, Label: PERSON
Entity: OpenAI, Label: ORGANIZATION
Entity: OpenAI LP, Label: ORGANIZATION
Entity: OpenAI Inc., Label: ORGANIZATION
Entity: Amazon, Label: ORGANIZATION
Entity: Google, Label: ORGANIZATION
Entity: Alphabet Inc., Label: ORGANIZATION
Entity: John Giannandrea, Label: PERSON
Entity: Apple Park, Label: LOCATION
Entity: Cupertino, Label: LOCATION
Entity: California, Label: LOCATION

In this example, we're extracting entities that have just the entity's text and label. However, entities often have relationships that are worth extracting and understanding.

Nested NER¶

Now, let's implement a more sophisticated version that can handle nested entities:

In [4]:

Copied!





class NestedEntity(BaseModel):
    entity: str = Field(description="The entity found in the text")
    label: str = Field(
        description="The label of the entity (e.g., PERSON, ORGANIZATION, LOCATION)"
    )
    parent: str | None = Field(
        description="The parent entity if this entity is nested within another entity",
        default=None,
    )
    children: list[NestedEntity] = Field(
        default_factory=list, description="Nested entities within this entity"
    )


@groq.call(
    model="llama-3.1-8b-instant",
    response_model=list[NestedEntity],
    json_mode=True,
    call_params={"temperature": 0.0},
)
@prompt_template(
    """
    Identify all named entities in the following text, including deeply nested entities. 
    For each entity, provide its label and any nested entities within it.

    Guidelines:
    1. Identify entities of types PERSON, ORGANIZATION, LOCATION, and any other relevant types.
    2. Capture hierarchical relationships between entities.
    3. Include all relevant information, even if it requires deep nesting.
    4. Be thorough and consider all possible entities and their relationships.

    Example:
    Text: "John Smith, the CEO of Tech Innovations Inc., a subsidiary of Global Corp, announced a new product at their headquarters in Silicon Valley."
    Entities:
    - Entity: "John Smith", Label: "PERSON", Parent: None
      Children:
        - Entity: "Tech Innovations Inc.", Label: "ORGANIZATION", Parent: "John Smith"
          Children:
            - Entity: "Global Corp", Label: "ORGANIZATION", Parent: "Tech Innovations Inc."
    - Entity: "Silicon Valley", Label: "LOCATION", Parent: None

    Now, analyze the following text: {text}
    """
)
def nested_ner(text: str): ...


print("\nNested NER Results:")
improved_result = nested_ner(unstructured_text)


def print_nested_entities(entities, level=0):
    for entity in entities:
        indent = "  " * level
        entity_info = (
            f"Entity: {entity.entity}, Label: {entity.label}, Parent: {entity.parent}"
        )
        print(textwrap.indent(entity_info, indent))
        if entity.children:
            print_nested_entities(entity.children, level + 1)


print_nested_entities(improved_result)
class NestedEntity(BaseModel):
    entity: str = Field(description="The entity found in the text")
    label: str = Field(
        description="The label of the entity (e.g., PERSON, ORGANIZATION, LOCATION)"
    )
    parent: str | None = Field(
        description="The parent entity if this entity is nested within another entity",
        default=None,
    )
    children: list[NestedEntity] = Field(
        default_factory=list, description="Nested entities within this entity"
    )


@groq.call(
    model="llama-3.1-8b-instant",
    response_model=list[NestedEntity],
    json_mode=True,
    call_params={"temperature": 0.0},
)
@prompt_template(
    """
    Identify all named entities in the following text, including deeply nested entities. 
    For each entity, provide its label and any nested entities within it.

    Guidelines:
    1. Identify entities of types PERSON, ORGANIZATION, LOCATION, and any other relevant types.
    2. Capture hierarchical relationships between entities.
    3. Include all relevant information, even if it requires deep nesting.
    4. Be thorough and consider all possible entities and their relationships.

    Example:
    Text: "John Smith, the CEO of Tech Innovations Inc., a subsidiary of Global Corp, announced a new product at their headquarters in Silicon Valley."
    Entities:
    - Entity: "John Smith", Label: "PERSON", Parent: None
      Children:
        - Entity: "Tech Innovations Inc.", Label: "ORGANIZATION", Parent: "John Smith"
          Children:
            - Entity: "Global Corp", Label: "ORGANIZATION", Parent: "Tech Innovations Inc."
    - Entity: "Silicon Valley", Label: "LOCATION", Parent: None

    Now, analyze the following text: {text}
    """
)
def nested_ner(text: str): ...


print("\nNested NER Results:")
improved_result = nested_ner(unstructured_text)


def print_nested_entities(entities, level=0):
    for entity in entities:
        indent = "  " * level
        entity_info = (
            f"Entity: {entity.entity}, Label: {entity.label}, Parent: {entity.parent}"
        )
        print(textwrap.indent(entity_info, indent))
        if entity.children:
            print_nested_entities(entity.children, level + 1)


print_nested_entities(improved_result)

Nested NER Results:
Entity: Steve Jobs, Label: PERSON, Parent: None
  Entity: Apple Inc., Label: ORGANIZATION, Parent: Steve Jobs
    Entity: Steve Wozniak, Label: PERSON, Parent: Apple Inc.
    Entity: Apple Park, Label: LOCATION, Parent: Apple Inc.
    Entity: Cupertino, Label: LOCATION, Parent: Apple Park
    Entity: California, Label: LOCATION, Parent: Cupertino
Entity: Steve Wozniak, Label: PERSON, Parent: None
  Entity: Apple Inc., Label: ORGANIZATION, Parent: Steve Wozniak
Entity: Apple Inc., Label: ORGANIZATION, Parent: None
  Entity: John Giannandrea, Label: PERSON, Parent: Apple Inc.
  Entity: Apple Park, Label: LOCATION, Parent: Apple Inc.
  Entity: Cupertino, Label: LOCATION, Parent: Apple Park
  Entity: California, Label: LOCATION, Parent: Cupertino
  Entity: OpenAI, Label: ORGANIZATION, Parent: Apple Inc.
    Entity: OpenAI LP, Label: ORGANIZATION, Parent: OpenAI
    Entity: OpenAI Inc., Label: ORGANIZATION, Parent: OpenAI
Entity: John Giannandrea, Label: PERSON, Parent: None
  Entity: Apple Inc., Label: ORGANIZATION, Parent: John Giannandrea
Entity: Apple Park, Label: LOCATION, Parent: None
  Entity: Cupertino, Label: LOCATION, Parent: Apple Park
  Entity: California, Label: LOCATION, Parent: Cupertino
Entity: Cupertino, Label: LOCATION, Parent: None
  Entity: California, Label: LOCATION, Parent: Cupertino
Entity: California, Label: LOCATION, Parent: None
Entity: OpenAI, Label: ORGANIZATION, Parent: None
  Entity: OpenAI LP, Label: ORGANIZATION, Parent: OpenAI
  Entity: OpenAI Inc., Label: ORGANIZATION, Parent: OpenAI
Entity: OpenAI LP, Label: ORGANIZATION, Parent: None
Entity: OpenAI Inc., Label: ORGANIZATION, Parent: None
Entity: Amazon, Label: ORGANIZATION, Parent: None
  Entity: Alexa, Label: PRODUCT, Parent: Amazon
Entity: Alexa, Label: PRODUCT, Parent: None
Entity: Google, Label: ORGANIZATION, Parent: None
  Entity: Google Assistant, Label: PRODUCT, Parent: Google
  Entity: Alphabet Inc., Label: ORGANIZATION, Parent: Google
Entity: Google Assistant, Label: PRODUCT, Parent: None
Entity: Alphabet Inc., Label: ORGANIZATION, Parent: None

Testing¶

To ensure robustness, it's crucial to test the NER system with diverse scenarios. Here's a function to run multiple test cases:

In [ ]:

Copied!





import ipytest  # noqa: E402
import pytest  # noqa: E402

ipytest.autoconfig()


test_cases = [
    (
        """
    The multinational conglomerate Alphabet Inc., parent company of Google, has acquired 
    DeepMind, a leading AI research laboratory based in London. DeepMind's founder, 
    Demis Hassabis, will join Google Brain, a division of Google AI, as Chief AI Scientist. 
    This move strengthens Alphabet's position in the AI field, challenging competitors like 
    OpenAI, which is backed by Microsoft, and Facebook AI Research, a part of Meta Platforms Inc.
        """,
        [
            NestedEntity(
                entity="Alphabet Inc.",
                label="ORGANIZATION",
                parent=None,
                children=[
                    NestedEntity(
                        entity="Google",
                        label="ORGANIZATION",
                        parent="Alphabet Inc.",
                        children=[
                            NestedEntity(
                                entity="Google Brain",
                                label="ORGANIZATION",
                                parent="Google",
                                children=[],
                            ),
                            NestedEntity(
                                entity="Google AI",
                                label="ORGANIZATION",
                                parent="Google",
                                children=[
                                    NestedEntity(
                                        entity="Google Brain",
                                        label="ORGANIZATION",
                                        parent="Google AI",
                                        children=[],
                                    )
                                ],
                            ),
                        ],
                    ),
                    NestedEntity(
                        entity="DeepMind",
                        label="ORGANIZATION",
                        parent="Alphabet Inc.",
                        children=[
                            NestedEntity(
                                entity="Demis Hassabis",
                                label="PERSON",
                                parent="DeepMind",
                                children=[],
                            )
                        ],
                    ),
                ],
            ),
            NestedEntity(entity="London", label="LOCATION", parent=None, children=[]),
            NestedEntity(
                entity="Demis Hassabis", label="PERSON", parent=None, children=[]
            ),
            NestedEntity(
                entity="OpenAI",
                label="ORGANIZATION",
                parent=None,
                children=[
                    NestedEntity(
                        entity="Microsoft",
                        label="ORGANIZATION",
                        parent="OpenAI",
                        children=[],
                    )
                ],
            ),
            NestedEntity(
                entity="Facebook AI Research",
                label="ORGANIZATION",
                parent=None,
                children=[
                    NestedEntity(
                        entity="Meta Platforms Inc.",
                        label="ORGANIZATION",
                        parent="Facebook AI Research",
                        children=[],
                    )
                ],
            ),
            NestedEntity(
                entity="Meta Platforms Inc.",
                label="ORGANIZATION",
                parent=None,
                children=[],
            ),
            NestedEntity(
                entity="Microsoft", label="ORGANIZATION", parent=None, children=[]
            ),
        ],
    ),
]


@pytest.mark.parametrize("text,expected_output", test_cases)
def test_nested_ner(text: str, expected_output: list[NestedEntity]):
    output = nested_ner(text)
    assert len(output) == len(expected_output)
    for entity, expected_entity in zip(output, expected_output, strict=False):
        assert entity.model_dump() == expected_entity.model_dump()


ipytest.run()  # Run the tests in Jupyter Notebook
import ipytest  # noqa: E402
import pytest  # noqa: E402

ipytest.autoconfig()


test_cases = [
    (
        """
    The multinational conglomerate Alphabet Inc., parent company of Google, has acquired 
    DeepMind, a leading AI research laboratory based in London. DeepMind's founder, 
    Demis Hassabis, will join Google Brain, a division of Google AI, as Chief AI Scientist. 
    This move strengthens Alphabet's position in the AI field, challenging competitors like 
    OpenAI, which is backed by Microsoft, and Facebook AI Research, a part of Meta Platforms Inc.
        """,
        [
            NestedEntity(
                entity="Alphabet Inc.",
                label="ORGANIZATION",
                parent=None,
                children=[
                    NestedEntity(
                        entity="Google",
                        label="ORGANIZATION",
                        parent="Alphabet Inc.",
                        children=[
                            NestedEntity(
                                entity="Google Brain",
                                label="ORGANIZATION",
                                parent="Google",
                                children=[],
                            ),
                            NestedEntity(
                                entity="Google AI",
                                label="ORGANIZATION",
                                parent="Google",
                                children=[
                                    NestedEntity(
                                        entity="Google Brain",
                                        label="ORGANIZATION",
                                        parent="Google AI",
                                        children=[],
                                    )
                                ],
                            ),
                        ],
                    ),
                    NestedEntity(
                        entity="DeepMind",
                        label="ORGANIZATION",
                        parent="Alphabet Inc.",
                        children=[
                            NestedEntity(
                                entity="Demis Hassabis",
                                label="PERSON",
                                parent="DeepMind",
                                children=[],
                            )
                        ],
                    ),
                ],
            ),
            NestedEntity(entity="London", label="LOCATION", parent=None, children=[]),
            NestedEntity(
                entity="Demis Hassabis", label="PERSON", parent=None, children=[]
            ),
            NestedEntity(
                entity="OpenAI",
                label="ORGANIZATION",
                parent=None,
                children=[
                    NestedEntity(
                        entity="Microsoft",
                        label="ORGANIZATION",
                        parent="OpenAI",
                        children=[],
                    )
                ],
            ),
            NestedEntity(
                entity="Facebook AI Research",
                label="ORGANIZATION",
                parent=None,
                children=[
                    NestedEntity(
                        entity="Meta Platforms Inc.",
                        label="ORGANIZATION",
                        parent="Facebook AI Research",
                        children=[],
                    )
                ],
            ),
            NestedEntity(
                entity="Meta Platforms Inc.",
                label="ORGANIZATION",
                parent=None,
                children=[],
            ),
            NestedEntity(
                entity="Microsoft", label="ORGANIZATION", parent=None, children=[]
            ),
        ],
    ),
]


@pytest.mark.parametrize("text,expected_output", test_cases)
def test_nested_ner(text: str, expected_output: list[NestedEntity]):
    output = nested_ner(text)
    assert len(output) == len(expected_output)
    for entity, expected_entity in zip(output, expected_output, strict=False):
        assert entity.model_dump() == expected_entity.model_dump()


ipytest.run()  # Run the tests in Jupyter Notebook

It's important to heavily test any system before you put it in practice. The above example demonstrates how to test such a method (nested_ner in this case), but it only shows a single input/output pair for brevity.

We strongly encourage you to write far more robust tests in your applications with many more test cases. This is why our examples uses @pytest.mark.parametrize to easily include additional test cases.

Further Improvements¶

This Named Entity Recognition (NER) system leverages the power of LLMs to perform context-aware, hierarchical entity extraction with various levels of nesting. It can identify complex relationships between entities, making it suitable for a wide range of applications.

Additional Real-World Applications

Information Extraction: Extracting structured information from unstructured text data.
Question Answering: Identifying entities relevant to a given question.
Document Summarization: Summarizing documents by extracting key entities and relationships.
Sentiment Analysis: Analyzing sentiment towards specific entities or topics.

When adapting this recipe to your specific use-case, consider the following:

Prompt customization to guide the model towards specific entity types or relationships.
Fine-tuning the model on domain-specific data for better accuracy in particular fields.
Implementing a confidence score for each identified entity.
Integrating with a knowledge base to enhance entity disambiguation.
Developing a post-processing step to refine and validate the LLM's output.
Exploring ways to optimize performance for real-time applications.

By leveraging the power of LLMs and the flexibility of the Mirascope library, you can create sophisticated NER systems that go beyond traditional approaches, enabling more nuanced and context-aware entity recognition for various applications.