Named Entity Recognition (NER) with Python, OpenSearch, and BERT

Named Entity Recognition (NER) is a core task in Natural Language Processing (NLP), aiming to identify and classify entities in text into predefined categories such as names of persons, organizations, locations, dates, and more. In this blog, we’ll explore how to build an end-to-end NER pipeline using Python, BERT, and OpenSearch.

We will:

  1. Understand the basics of NER and why it matters.
  2. Train and/or fine-tune a BERT-based NER model in Python.
  3. Use OpenSearch ingest pipelines to preprocess and enrich data.
  4. Query the enriched data in OpenSearch.

1. What is Named Entity Recognition?

Named Entity Recognition (NER) identifies entities like names, dates, and places in unstructured text. Examples:

  • Input: “Barack Obama was born in Hawaii.”
  • Output: [("Barack Obama", "PERSON"), ("Hawaii", "LOCATION")]

NER is widely used in applications like:

  • Information retrieval: Extracting structured information from unstructured data.
  • Search engines: Boosting relevance in search queries.
  • Content tagging: Automatically tagging articles.

2. Setting Up the Environment

Install the necessary Python libraries:

Also, ensure that OpenSearch is installed and running locally or in your cloud environment.

Shell

3. Fine-tuning a BERT Model for NER

We’ll use the transformers library to fine-tune a BERT model for NER. Below is a code snippet to prepare and fine-tune the model.

Dataset Preparation

Prepare a sample dataset in the BIO (Begin-Inside-Outside) tagging format:

Plain Text

Save this data as train.txt and dev.txt.


Python Code: Fine-Tuning a BERT NER Model

Python


4. Ingesting Data into OpenSearch

Create an OpenSearch Index

First, create an OpenSearch index to store enriched data.

Python

Ingest Pipeline for Entity Extraction

We use a custom Python script to process the data and extract entities. After extracting entities, upload them to OpenSearch.

Python

5. Querying Enriched Data

With the enriched data, you can now query OpenSearch for specific entities.

Python

6. Automating the Workflow with Ingest Pipelines

Create an Ingest Pipeline

Define an OpenSearch ingest pipeline for processing and enriching documents:

Python

Ingest Data

pythonCopy codedoc = {"text": "Barack Obama was born in Hawaii."}
client.index(index=index_name, body=doc, pipeline="ner_pipeline")

8. Datasets Available Online for NER Tasks

When building Named Entity Recognition (NER) models, having access to high-quality labeled datasets is crucial. Below are some widely used datasets, their features, and their links. These datasets can be used for training and evaluating NER systems.


1. CoNLL-2003

The CoNLL-2003 dataset is one of the most popular datasets for NER tasks. It was created for the Conference on Computational Natural Language Learning (CoNLL) shared task in 2003.

  • Entities: PER (Person), LOC (Location), ORG (Organization), and MISC (Miscellaneous).
  • Format: BIO tagging format.
  • Languages: English and German.
  • Description: Contains newswire text with manually annotated entities.
  • Link: CoNLL-2003 Dataset

2. OntoNotes 5.0

The OntoNotes 5.0 dataset is a large-scale corpus with multiple types of annotations, including NER.

  • Entities: PERSON, ORG, GPE (Geopolitical entity), DATE, TIME, PERCENT, MONEY, etc.
  • Format: BIO tagging.
  • Languages: English, Chinese, Arabic.
  • Description: A broad-coverage corpus with text from multiple sources such as news articles, broadcast conversations, and more.
  • Link: Available via LDC (Linguistic Data Consortium) (requires a license).

3. WNUT 2017

The WNUT 2017 dataset focuses on emerging and rare entities, making it a valuable resource for training NER systems in dynamic contexts.

  • Entities: Emerging named entities like corporations, creative works, and products.
  • Format: BIO tagging format.
  • Languages: English.
  • Description: Designed for NER systems that handle entities not present in traditional training data.
  • Link: WNUT 2017 Dataset

4. MIT Movie Corpus

The MIT Movie Corpus is a domain-specific dataset designed for NER in movie-related text.

  • Entities: MOVIE, GENRE, ACTOR, DIRECTOR, etc.
  • Format: BIO tagging format.
  • Languages: English.
  • Description: Useful for domain-specific NER tasks in entertainment or recommendation systems.
  • Link: MIT Movie Corpus

5. GUM (Georgetown University Multilayer Corpus)

The GUM corpus is a multilayer annotated dataset that includes NER annotations.

  • Entities: Includes standard categories like PER, LOC, ORG, and some others.
  • Format: BIO tagging format.
  • Languages: English.
  • Description: Covers a variety of text types such as interviews, travel guides, and scientific articles.
  • Link: GUM Corpus

6. WikiANN (Panx Dataset)

The WikiANN (Panx) dataset provides NER annotations for over 282 languages, making it suitable for multilingual NER tasks.

  • Entities: PER, LOC, ORG.
  • Format: BIO tagging format.
  • Languages: Multilingual (282+ languages).
  • Description: Generated using Wikipedia’s structured information and is a good resource for low-resource language NER.
  • Link: WikiANN on Hugging Face

7. FinNER

FinNER is a domain-specific dataset for NER tasks in the financial domain.

  • Entities: FINANCIAL PRODUCT, ORG, PER, and others relevant to the financial sector.
  • Format: BIO tagging.
  • Languages: English.
  • Description: Useful for NER systems designed for finance-related applications like news and market analysis.
  • Link: Check repositories like Hugging Face or financial NLP research papers for availability.

8. I2B2 (Healthcare and Clinical NER)

The I2B2 datasets are widely used for NER in the healthcare and clinical domain.

  • Entities: Disease, Medication, Test, Treatment, etc.
  • Format: BIO tagging.
  • Languages: English.
  • Description: Extracted from de-identified clinical records, useful for healthcare-specific NLP tasks.
  • Link: Requires a license, available through research collaborations. Check the I2B2 Website.

9. Kaggle NER Datasets

Kaggle hosts multiple NER datasets contributed by the community.

  • Entities: Vary depending on the dataset.
  • Languages: Primarily English.
  • Description: Diverse datasets, including customer support tickets, product descriptions, and more.
  • Link: Kaggle NER Datasets

10. Twitter NER

A dataset focused on extracting named entities from social media text, particularly Twitter.

  • Entities: Custom entities relevant to social media, such as USER, HASHTAG, URL.
  • Languages: English.
  • Description: Handles noisy and informal text found in tweets.
  • Link: GitHub – Twitter NER Dataset

11. Choosing the Right Dataset

The choice of dataset depends on the application:

  • General-purpose NER: CoNLL-2003, OntoNotes.
  • Domain-specific NER: MIT Movie Corpus, FinNER, I2B2.
  • Multilingual NER: WikiANN.
  • Social Media: Twitter NER, WNUT 2017.

Having a good understanding of your use case will help you select and prepare the right dataset for training your NER models.

9. Using Large Language Models (LLMs) to Create a NER Dataset

Large Language Models (LLMs) like OpenAI’s GPT, Google’s PaLM, or Hugging Face’s T5 have revolutionized NLP tasks. These models can be leveraged to generate synthetic NER datasets or annotate raw text efficiently, especially when high-quality labeled datasets are unavailable.

Here, we discuss how to use LLMs for creating a NER dataset and walk through a step-by-step guide.


Why Use LLMs for Dataset Creation?

  1. Cost Efficiency: Manual annotation is expensive and time-consuming. LLMs can automate the initial annotation process.
  2. Scalability: LLMs can process large text corpora and create datasets in minutes.
  3. Low-Resource Domains: For niche or low-resource domains, LLMs can bootstrap annotations with minimal human involvement.
  4. Custom Labels: LLMs can adapt to specific entities (e.g., product names, medical terms).

How LLMs Can Help Create NER Datasets

1. Generating Annotated Text

LLMs can generate synthetic text along with corresponding entity annotations. For example:

  • Prompt: “Generate a sentence mentioning a person, a location, and an organization with NER tags.”
  • Output: "Barack Obama visited Google in California." -> [("Barack Obama", "PERSON"), ("Google", "ORG"), ("California", "LOCATION")]

2. Annotating Raw Text

LLMs can label entities in pre-existing raw text. For example:

  • Input: “Tim Cook announced the new iPhone in Cupertino.”
  • Output: [("Tim Cook", "PERSON"), ("iPhone", "PRODUCT"), ("Cupertino", "LOCATION")]

3. Domain-Specific Annotation

LLMs fine-tuned for specific domains can identify specialized entities (e.g., DISEASE, GENE, DRUG in biomedical texts).


Steps to Use LLMs for NER Dataset Creation

Step 1: Choose an LLM

Popular LLMs for dataset creation include:

  • GPT-4: Best for few-shot or zero-shot tasks.
  • T5: Ideal for tasks requiring sequence-to-sequence learning.
  • FLAN-T5: Fine-tuned for instruction following.
  • LLaMA: Open-source LLM that can be adapted for specific tasks.

Step 2: Define Entity Types

Clearly define the entity categories you want to extract. For example:

  • General: PERSON, ORG, LOC.
  • Domain-specific: PRODUCT, DISEASE, GENE.

Step 3: Prompt Engineering

Design effective prompts for LLMs to generate or annotate text. Below are examples:

Prompt for Generating Text with Annotations
plaintextCopy codeGenerate a sentence mentioning a PERSON, an ORGANIZATION, and a LOCATION. Provide BIO-style annotations.
Prompt for Annotating Raw Text
Plain Text

Step 4: Implement the Pipeline

Below is a Python code snippet using OpenAI’s GPT model to annotate text:

Python

Step 5: Validate and Clean the Data

The output generated by LLMs may contain errors. Perform the following:

  1. Human-in-the-Loop: Validate a subset of annotations manually.
  2. Automated Validation: Check for consistency in entity types and formats.

Fine-Tuning LLMs for Better NER Performance

LLMs can be fine-tuned on your specific dataset for improved accuracy:

  1. Collect a small manually annotated dataset in BIO format.
  2. Fine-tune models like T5 or BERT using frameworks such as Hugging Face Transformers.

Advantages and Challenges

Advantages

  1. Rapid dataset creation.
  2. Adaptability to specific domains and languages.
  3. Handles low-resource scenarios effectively.

Challenges

  1. Errors in complex or ambiguous text.
  2. Potential biases in LLMs.
  3. Over-reliance on generated datasets without human validation.

Practical Example: Generating a Small Dataset

Below is an example of generating a small dataset using GPT:

Python

11. Scaling the Process

For larger datasets:

  1. Use batch processing with APIs.
  2. Store outputs in structured formats like JSON or CSV.
  3. Automate error correction with custom scripts.

12. Conclusion

Using LLMs to create NER datasets is a powerful way to bootstrap labeled data, especially in scenarios with limited resources or specific requirements. By carefully engineering prompts, validating results, and fine-tuning models, you can create high-quality NER datasets for diverse applications. This process empowers researchers and practitioners to develop accurate NER systems without the bottleneck of manual annotation.

10. Conclusion

This blog outlined an end-to-end NER pipeline using Python, OpenSearch, and a BERT-based model. We covered:

  • Training a BERT model for NER.
  • Creating an OpenSearch index with mappings for entity data.
  • Enriching data with OpenSearch ingest pipelines.
  • Querying enriched data.

With this workflow, you can scale your NER tasks for real-world applications like search engines, chatbots, or content management.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *