NLP & DOCUMENT AI

Text & Document
Annotation Services

Named entity recognition, relation extraction, document parsing, text classification, and OCR correction — with domain expertise across medical, legal, financial, and technical documents in 30+ languages. Every annotation meets measurable F1 and IAA benchmarks.

2M+
Documents Processed
F1 ≥ 0.93
Entity-Level Accuracy
30+
Languages Supported
κ ≥ 0.85
Inter-Annotator Agreement
15+
Document Types
L1→L2→L3
QA Pipeline
Capabilities

Six Core NLP Annotation Services

Each capability includes specific accuracy benchmarks, throughput rates, and output format options. All configurable to your schema and training pipeline.

Precision entity extraction across unlimited custom types — persons, organizations, medical terms, financial instruments, legal clauses, product attributes, and domain-specific terminology. Support for nested entities, overlapping spans, discontinuous entities, and cross-sentence coreference chains.

TECHNICAL DETAILS
  • Custom entity schemas: unlimited types with hierarchical parent-child relationships
  • Overlapping spans: same text span tagged with multiple entity types
  • Cross-sentence coreference: “Dr. Smith… she… the physician” linked across paragraphs
  • Nested entities: “Bank of America” contains both ORG and LOC entities
  • Discontinuous entities: “left and right ventricle” → two linked annotations
  • Domain taxonomies: ICD-10, SNOMED CT, MeSH, NAICS, legal citation standards
PERFORMANCE
F1
Entity-level F1 ≥ 0.93 (exact span match)
Agreement
Cohen’s κ ≥ 0.85
Throughput
500–2K entities/annotator/day
Capability 2

Relation Extraction & Knowledge Graphs

Entity-to-entity relationships for knowledge graph construction

Annotate directed relationships between entities — drug-disease interactions, cause-effect chains, organizational hierarchies, contractual obligations, and scientific claims. Build the structured knowledge graphs that power question answering, recommendation engines, and decision support systems.

TECHNICAL DETAILS
  • Directed relation annotation: subject → predicate → object triples
  • Evidence span marking: text supporting each relation linked to annotation
  • Cross-document relations: entities and relations linked across document sets
  • Relation types: causal, temporal, hierarchical, associative, negated
  • Confidence scoring: annotator confidence (high/medium/low) per relation
  • Knowledge graph-ready output: RDF triples, Neo4j import format, custom schema
PERFORMANCE
F1
Relation F1 ≥ 0.85
Agreement
κ ≥ 0.80
Throughput
200–500 relations/annotator/day
Capability 3

OCR Correction & Document Parsing

Post-OCR validation with layout-aware field extraction

Post-OCR quality assurance, field extraction validation, and layout-aware document parsing for scanned forms, invoices, receipts, contracts, handwritten documents, and historical archives. We correct OCR errors, validate field extractions, and annotate document structure for training document AI models.

TECHNICAL DETAILS
  • OCR error correction: character-level correction with error type classification
  • Table annotation: row/column structure, cell content, header detection, spanning cells
  • Handwriting recognition validation: word-level and character-level correction
  • Field extraction validation: key-value pairs verified against source document
  • Document structure: heading hierarchy, paragraph boundaries, list detection
  • Historical documents: degraded scan handling, archaic typography, multi-language
PERFORMANCE
Accuracy
≥ 99% character accuracy post-correction
Throughput
100–300 pages/annotator/day
Field Accuracy
≥ 97% field extraction accuracy

Document-level, paragraph-level, and sentence-level classification across sentiment polarity, topic categorization, intent detection, urgency scoring, and custom taxonomies. Support for hierarchical labels with configurable confidence thresholds and inter-annotator agreement enforcement.

TECHNICAL DETAILS
  • Granularity: document-level, paragraph-level, sentence-level, aspect-level
  • Multi-label: unlimited tags per unit with independent confidence scores
  • Intent + slot labeling: for conversational AI and chatbot training
  • Sentiment: 3-point (pos/neg/neutral), 5-point, or continuous scale (0.0–1.0)
  • Hierarchical taxonomy: parent-child class relationships with inheritance
  • Aspect-based sentiment: entity-specific sentiment within the same document
PERFORMANCE
Accuracy
Classification accuracy ≥ 95%
Agreement
κ ≥ 0.85
Throughput
1K–5K documents/annotator/day
Capability 5

Document Understanding & Key-Value Extraction

Layout-aware annotation for intelligent document processing

Key-value pair extraction, form field mapping, table structure annotation, section classification, and relationship mapping across PDFs, scanned images, and structured formats. We annotate the spatial and semantic structure that document AI models need to understand complex real-world documents.

TECHNICAL DETAILS
  • Key-value extraction: field name → field value with bounding region linking
  • Section classification: headers, paragraphs, lists, footnotes, signatures, stamps
  • Form field types: text, checkbox, radio, date, signature, handwritten entries
  • Table structure: cell detection, row/column headers, spanning cells, nested tables
  • Multi-page linking: cross-page references, continuation markers, page-level metadata
  • Spatial relationships: above/below/left-of/right-of/contains between regions
PERFORMANCE
Accuracy
Field extraction F1 ≥ 0.95
Throughput
50–150 documents/annotator/day
Table Accuracy
Cell detection F1 ≥ 0.92
Capability 6

Multilingual & Cross-Lingual Annotation

30+ languages with native-speaker annotators and linguistic expertise

NER, classification, sentiment, and document parsing across 30+ languages — each with native-speaker annotators who understand linguistic nuances, cultural context, and domain-specific terminology in their language. Cross-lingual alignment annotation for multilingual model training.

TECHNICAL DETAILS
  • 30+ languages: European, Asian, Middle Eastern, African language families
  • Script-specific handling: CJK segmentation, Arabic RTL, Devanagari conjuncts
  • Code-switching annotation: mixed-language text with language-span tagging
  • Native-speaker annotators: not machine-translated, not bilingual compromise
  • Cross-lingual entity alignment: same entities linked across language versions
  • Dialect awareness: regional variations, formal/informal register differences
PERFORMANCE
Languages
30+ with native speakers
Cross Lingual
Entity alignment across versions
Throughput
Varies by script complexity
Domains

Domain-Specific NLP Expertise

Our annotators are trained on industry-specific terminology, taxonomies, and edge cases — not generic crowdworkers labeling random text.

Medical & Clinical NLP

Clinical note parsing, medical NER (drugs, symptoms, procedures, diagnoses), ICD-10/CPT code mapping, clinical trial document annotation, and HIPAA-compliant de-identification of protected health information.

ICD-10 / SNOMED CT / MeSH taxonomy PHI de-identification (99.7%+ recall) Clinical note structure parsing Drug-drug interaction annotation

Legal & Compliance NLP

Contract clause extraction, legal entity recognition, obligation/risk identification, regulatory document parsing, case citation linking, and confidentiality-aware annotation with NDA coverage.

Contract clause taxonomy (50+ types) Obligation vs. right classification Citation extraction and linking Confidentiality-compliant workflows

Financial Services NLP

Financial entity extraction, earnings call analysis, SEC/EDGAR filing parsing, KYC document processing, transaction classification, and sentiment analysis on financial news and analyst reports.

Financial entity taxonomy SEC filing structure parsing Transaction categorization Market sentiment (bullish/bearish/neutral)

E-Commerce & Retail NLP

Product attribute extraction from listings, review sentiment analysis (aspect-level), catalog classification with SKU mapping, search query intent detection, and customer support ticket routing.

Product taxonomy alignment Aspect-based sentiment per feature Intent/slot for search queries Multi-language product descriptions

Technical Documentation

API documentation parsing, code comment extraction, technical spec annotation, knowledge base structuring, and developer documentation classification for AI-powered developer tools.

Code-text boundary detection API parameter extraction Error message classification Multi-language code samples

Content Moderation & Safety

Toxicity detection, hate speech classification, misinformation labeling, policy violation flagging, and age-appropriateness rating across user-generated content in 20+ languages.

40+ harm categories Severity scoring (0–5 scale) Context-aware moderation Multi-language coverage
FORMATS

Supported Input & Output Formats

We ingest any text format and deliver in your training pipeline's preferred schema.

Input Formats

PDF (native + scanned)
DOCX / DOC
Plain Text / TXT
HTML / XML
CSV / TSV
JSON / JSONL
Scanned Images
Handwritten Docs
Emails (EML/MSG)
Chat Logs

Output Formats

CoNLL / CoNLL-U
IOB2 / BIOES
spaCy JSON
Prodigy JSONL
BRAT Standoff
Hugging Face
RDF Triples
CSV / Parquet
Custom JSON Schema
Neo4j Import
Quality

NLP-Specific Quality Controls

Text annotation demands linguistic precision. Here's how we maintain consistency across annotators, languages, and document types.

Span Boundary Validation

Automated checks ensure entity spans are clean — no trailing whitespace, partial words, inconsistent boundary definitions, or invalid character offsets. Rule-based validation catches 80% of span errors before human review.

Automated span validation catches 80%+ of boundary errors pre-review

Taxonomy Consistency Enforcement

Entity types, relation labels, and classification categories are validated against the project taxonomy on every annotation. Out-of-schema labels are blocked automatically. Schema versioning tracks changes across guideline iterations.

Zero out-of-schema labels in delivered data

Cross-Document Entity Consistency

Same entities are labeled the same way across all documents — 'JPMorgan Chase', 'JP Morgan', and 'JPMC' all resolve to the same canonical entity. We track entity-level consistency scores across annotators and batches.

Entity consistency score ≥ 0.95 across document sets

Inter-Annotator Agreement (IAA)

Token-level agreement scores using exact-match F1, Cohen's κ, and Krippendorff's α. Computed per entity type and relation type. Disagreements trigger L3 adjudication and targeted recalibration.

Cohen's κ ≥ 0.85 enforced per entity type per batch

Domain Vocabulary Validation

Custom dictionaries for medical, legal, financial, and technical terminology. Annotators are tested on domain-specific gold sets. Vocabulary updates are distributed to all annotators within 24 hours of approval.

Domain vocabulary test score ≥ 90% for production access

Automated Linguistic Checks

Rule-based validation for language-specific issues: tokenization boundary errors (CJK), script consistency (mixed scripts flagged), encoding issues (UTF-8 validation), and sentence boundary detection accuracy.

Language-specific validation rules per supported language
COMPARISON

UTL NLP Annotation vs. Typical Providers

Capability UTL Data Engine Typical Providers
Nested & overlapping entity support Flat entities only
Cross-document coreference resolution
Relation extraction for knowledge graphs
30+ languages with native-speaker annotators 10–15 languages
Domain taxonomy integration (ICD-10, SNOMED) Generic schemas
Per-entity-type IAA tracking (κ) Aggregate only
Automated span boundary validation Manual QA only
Cross-lingual entity alignment
Layout-aware document parsing Text-only
HIPAA-compliant de-identification Basic redaction
“We needed custom NER across 50+ medical entity types with nested span support and cross-document coreference. UTL's team understood the clinical domain from day one, delivered 98.5% F1 on our validation set, and maintained κ ≥ 0.87 across 15 annotators. Their automated span validation alone saved us 30% in review time.”
Data Science Director
Enterprise Health-Tech Platform
FAQS

NLP & Document AI Questions

Yes. We support complex NER schemas with nested entities (e.g., 'Bank of America' tagged as both ORG and LOC), overlapping spans, discontinuous entities, and cross-sentence coreference chains. Our annotation tooling and QA validation are built specifically for these complex span types.
We train annotators on your domain's specific taxonomy, terminology, and edge cases. This includes 20–40 hour domain training, gold set calibration with domain-specific examples, and ongoing terminology dictionary updates. For medical NLP, we integrate ICD-10, SNOMED CT, and MeSH coding standards directly into the annotation workflow.
We provide OCR correction and validation for handwritten documents, including historical archives and degraded scans. Our annotators correct OCR errors at character level, validate field extractions, and annotate document structure. For severely degraded documents, we apply multi-pass review with confidence scoring.
Yes. Our relation extraction annotation produces subject-predicate-object triples with evidence spans, confidence scores, and cross-document linking. Output formats include RDF triples, Neo4j import format, and custom graph schemas. We've built knowledge graphs with 100K+ entities and 500K+ relations for enterprise clients.
Each language has native-speaker annotators with linguistic training, separate gold sets and calibration processes, and language-specific validation rules (CJK tokenization, Arabic RTL handling, etc.). We measure IAA independently per language and never mix language teams.
Scoping + schema design: 3–5 days. Annotator training + calibration: 5–7 days. Pilot (1K–5K documents): 5–10 days. First delivered batch by Day 20. For domain-specific projects (medical, legal), add 3–5 days for domain training. We maintain pre-qualified teams across major domains for faster ramp-up.

Need Text & Document Annotation?

Let's discuss your computer vision data pipeline — from task design to quality-assured delivery. We'll scope a pilot within 48 hours.