Named Entity Recognition (NER) is a core task in Natural Language Processing (NLP), aiming to identify and classify entities in text into predefined categories such as names of persons, organizations, locations, dates, and more. In this blog, we’ll explore how to build an end-to-end NER pipeline using Python, BERT, and OpenSearch.
We will:
- Understand the basics of NER and why it matters.
- Train and/or fine-tune a BERT-based NER model in Python.
- Use OpenSearch ingest pipelines to preprocess and enrich data.
- Query the enriched data in OpenSearch.
1. What is Named Entity Recognition?
Named Entity Recognition (NER) identifies entities like names, dates, and places in unstructured text. Examples:
- Input: “Barack Obama was born in Hawaii.”
- Output:
[("Barack Obama", "PERSON"), ("Hawaii", "LOCATION")]
NER is widely used in applications like:
- Information retrieval: Extracting structured information from unstructured data.
- Search engines: Boosting relevance in search queries.
- Content tagging: Automatically tagging articles.
2. Setting Up the Environment
Install the necessary Python libraries:
Also, ensure that OpenSearch is installed and running locally or in your cloud environment.
3. Fine-tuning a BERT Model for NER
We’ll use the transformers library to fine-tune a BERT model for NER. Below is a code snippet to prepare and fine-tune the model.
Dataset Preparation
Prepare a sample dataset in the BIO (Begin-Inside-Outside) tagging format:
Save this data as train.txt and dev.txt.
Python Code: Fine-Tuning a BERT NER Model
4. Ingesting Data into OpenSearch
Create an OpenSearch Index
First, create an OpenSearch index to store enriched data.
Ingest Pipeline for Entity Extraction
We use a custom Python script to process the data and extract entities. After extracting entities, upload them to OpenSearch.
5. Querying Enriched Data
With the enriched data, you can now query OpenSearch for specific entities.
6. Automating the Workflow with Ingest Pipelines
Create an Ingest Pipeline
Define an OpenSearch ingest pipeline for processing and enriching documents:
Ingest Data
pythonCopy codedoc = {"text": "Barack Obama was born in Hawaii."}
client.index(index=index_name, body=doc, pipeline="ner_pipeline")
8. Datasets Available Online for NER Tasks
When building Named Entity Recognition (NER) models, having access to high-quality labeled datasets is crucial. Below are some widely used datasets, their features, and their links. These datasets can be used for training and evaluating NER systems.
1. CoNLL-2003
The CoNLL-2003 dataset is one of the most popular datasets for NER tasks. It was created for the Conference on Computational Natural Language Learning (CoNLL) shared task in 2003.
- Entities:
PER(Person),LOC(Location),ORG(Organization), andMISC(Miscellaneous). - Format: BIO tagging format.
- Languages: English and German.
- Description: Contains newswire text with manually annotated entities.
- Link: CoNLL-2003 Dataset
2. OntoNotes 5.0
The OntoNotes 5.0 dataset is a large-scale corpus with multiple types of annotations, including NER.
- Entities:
PERSON,ORG,GPE(Geopolitical entity),DATE,TIME,PERCENT,MONEY, etc. - Format: BIO tagging.
- Languages: English, Chinese, Arabic.
- Description: A broad-coverage corpus with text from multiple sources such as news articles, broadcast conversations, and more.
- Link: Available via LDC (Linguistic Data Consortium) (requires a license).
3. WNUT 2017
The WNUT 2017 dataset focuses on emerging and rare entities, making it a valuable resource for training NER systems in dynamic contexts.
- Entities: Emerging named entities like
corporations,creative works, andproducts. - Format: BIO tagging format.
- Languages: English.
- Description: Designed for NER systems that handle entities not present in traditional training data.
- Link: WNUT 2017 Dataset
4. MIT Movie Corpus
The MIT Movie Corpus is a domain-specific dataset designed for NER in movie-related text.
- Entities:
MOVIE,GENRE,ACTOR,DIRECTOR, etc. - Format: BIO tagging format.
- Languages: English.
- Description: Useful for domain-specific NER tasks in entertainment or recommendation systems.
- Link: MIT Movie Corpus
5. GUM (Georgetown University Multilayer Corpus)
The GUM corpus is a multilayer annotated dataset that includes NER annotations.
- Entities: Includes standard categories like
PER,LOC,ORG, and some others. - Format: BIO tagging format.
- Languages: English.
- Description: Covers a variety of text types such as interviews, travel guides, and scientific articles.
- Link: GUM Corpus
6. WikiANN (Panx Dataset)
The WikiANN (Panx) dataset provides NER annotations for over 282 languages, making it suitable for multilingual NER tasks.
- Entities:
PER,LOC,ORG. - Format: BIO tagging format.
- Languages: Multilingual (282+ languages).
- Description: Generated using Wikipedia’s structured information and is a good resource for low-resource language NER.
- Link: WikiANN on Hugging Face
7. FinNER
FinNER is a domain-specific dataset for NER tasks in the financial domain.
- Entities:
FINANCIAL PRODUCT,ORG,PER, and others relevant to the financial sector. - Format: BIO tagging.
- Languages: English.
- Description: Useful for NER systems designed for finance-related applications like news and market analysis.
- Link: Check repositories like Hugging Face or financial NLP research papers for availability.
8. I2B2 (Healthcare and Clinical NER)
The I2B2 datasets are widely used for NER in the healthcare and clinical domain.
- Entities:
Disease,Medication,Test,Treatment, etc. - Format: BIO tagging.
- Languages: English.
- Description: Extracted from de-identified clinical records, useful for healthcare-specific NLP tasks.
- Link: Requires a license, available through research collaborations. Check the I2B2 Website.
9. Kaggle NER Datasets
Kaggle hosts multiple NER datasets contributed by the community.
- Entities: Vary depending on the dataset.
- Languages: Primarily English.
- Description: Diverse datasets, including customer support tickets, product descriptions, and more.
- Link: Kaggle NER Datasets
10. Twitter NER
A dataset focused on extracting named entities from social media text, particularly Twitter.
- Entities: Custom entities relevant to social media, such as
USER,HASHTAG,URL. - Languages: English.
- Description: Handles noisy and informal text found in tweets.
- Link: GitHub – Twitter NER Dataset
11. Choosing the Right Dataset
The choice of dataset depends on the application:
- General-purpose NER: CoNLL-2003, OntoNotes.
- Domain-specific NER: MIT Movie Corpus, FinNER, I2B2.
- Multilingual NER: WikiANN.
- Social Media: Twitter NER, WNUT 2017.
Having a good understanding of your use case will help you select and prepare the right dataset for training your NER models.
9. Using Large Language Models (LLMs) to Create a NER Dataset
Large Language Models (LLMs) like OpenAI’s GPT, Google’s PaLM, or Hugging Face’s T5 have revolutionized NLP tasks. These models can be leveraged to generate synthetic NER datasets or annotate raw text efficiently, especially when high-quality labeled datasets are unavailable.
Here, we discuss how to use LLMs for creating a NER dataset and walk through a step-by-step guide.
Why Use LLMs for Dataset Creation?
- Cost Efficiency: Manual annotation is expensive and time-consuming. LLMs can automate the initial annotation process.
- Scalability: LLMs can process large text corpora and create datasets in minutes.
- Low-Resource Domains: For niche or low-resource domains, LLMs can bootstrap annotations with minimal human involvement.
- Custom Labels: LLMs can adapt to specific entities (e.g., product names, medical terms).
How LLMs Can Help Create NER Datasets
1. Generating Annotated Text
LLMs can generate synthetic text along with corresponding entity annotations. For example:
- Prompt: “Generate a sentence mentioning a person, a location, and an organization with NER tags.”
- Output:
"Barack Obama visited Google in California." -> [("Barack Obama", "PERSON"), ("Google", "ORG"), ("California", "LOCATION")]
2. Annotating Raw Text
LLMs can label entities in pre-existing raw text. For example:
- Input: “Tim Cook announced the new iPhone in Cupertino.”
- Output:
[("Tim Cook", "PERSON"), ("iPhone", "PRODUCT"), ("Cupertino", "LOCATION")]
3. Domain-Specific Annotation
LLMs fine-tuned for specific domains can identify specialized entities (e.g., DISEASE, GENE, DRUG in biomedical texts).
Steps to Use LLMs for NER Dataset Creation
Step 1: Choose an LLM
Popular LLMs for dataset creation include:
- GPT-4: Best for few-shot or zero-shot tasks.
- T5: Ideal for tasks requiring sequence-to-sequence learning.
- FLAN-T5: Fine-tuned for instruction following.
- LLaMA: Open-source LLM that can be adapted for specific tasks.
Step 2: Define Entity Types
Clearly define the entity categories you want to extract. For example:
- General:
PERSON,ORG,LOC. - Domain-specific:
PRODUCT,DISEASE,GENE.
Step 3: Prompt Engineering
Design effective prompts for LLMs to generate or annotate text. Below are examples:
Prompt for Generating Text with Annotations
plaintextCopy codeGenerate a sentence mentioning a PERSON, an ORGANIZATION, and a LOCATION. Provide BIO-style annotations.
Prompt for Annotating Raw Text
Step 4: Implement the Pipeline
Below is a Python code snippet using OpenAI’s GPT model to annotate text:
Step 5: Validate and Clean the Data
The output generated by LLMs may contain errors. Perform the following:
- Human-in-the-Loop: Validate a subset of annotations manually.
- Automated Validation: Check for consistency in entity types and formats.
Fine-Tuning LLMs for Better NER Performance
LLMs can be fine-tuned on your specific dataset for improved accuracy:
- Collect a small manually annotated dataset in BIO format.
- Fine-tune models like T5 or BERT using frameworks such as Hugging Face Transformers.
Advantages and Challenges
Advantages
- Rapid dataset creation.
- Adaptability to specific domains and languages.
- Handles low-resource scenarios effectively.
Challenges
- Errors in complex or ambiguous text.
- Potential biases in LLMs.
- Over-reliance on generated datasets without human validation.
Practical Example: Generating a Small Dataset
Below is an example of generating a small dataset using GPT:
11. Scaling the Process
For larger datasets:
- Use batch processing with APIs.
- Store outputs in structured formats like JSON or CSV.
- Automate error correction with custom scripts.
12. Conclusion
Using LLMs to create NER datasets is a powerful way to bootstrap labeled data, especially in scenarios with limited resources or specific requirements. By carefully engineering prompts, validating results, and fine-tuning models, you can create high-quality NER datasets for diverse applications. This process empowers researchers and practitioners to develop accurate NER systems without the bottleneck of manual annotation.
10. Conclusion
This blog outlined an end-to-end NER pipeline using Python, OpenSearch, and a BERT-based model. We covered:
- Training a BERT model for NER.
- Creating an OpenSearch index with mappings for entity data.
- Enriching data with OpenSearch ingest pipelines.
- Querying enriched data.
With this workflow, you can scale your NER tasks for real-world applications like search engines, chatbots, or content management.

Leave a Reply