Docling

Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types.

The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy through validation against predefined templates. Its pipeline-based architecture supports pluggable processing backends, allowing for the dynamic integration of specialized engines for optical character recognition and complex visual layout analysis. Users can control parsing behavior and extraction parameters through declarative configuration files, facilitating integration into automated workflows and server-based architectures.

The library provides both a programmatic interface and a command-line toolkit to support automated document processing and format conversion. It utilizes optional dependency management to allow for modular installation of specific features, such as media rendering or advanced processing capabilities, depending on the requirements of the application.

Features

Structured Data Extraction - The library identifies and pulls specific information from documents by applying user-defined templates or schemas to extract the exact fields and data types required.
Document Format Converters - Transforming diverse file types and web content into a unified, machine-readable format for downstream data processing and analysis.
Document Layout Analyzers - A computer vision and text processing suite that maps the spatial relationships between text, tables, and images within digital documents.
Hierarchical Document Models - The library organizes text, tables, images, and layout information into a unified hierarchical tree structure that preserves the spatial and semantic relationships between document elements.
Schema-Driven Extractors - Maps unstructured document regions to strongly-typed objects by validating extracted content against predefined structural templates and data models.
Document Parsing Pipelines - A processing engine that transforms unstructured files and web content into a unified, hierarchical data model for downstream analysis.
Document Layout Analysis - Parsing the hierarchical structure of documents to accurately identify and relate text, tables, and images for intelligent content understanding.
Intermediate Representations - Normalizes diverse file formats into a consistent internal model to enable uniform processing across different input sources.
Structured Data Extractors - A tool that identifies and pulls specific information from complex document layouts based on predefined schemas and data types.
Hierarchical Document Models - Organizes heterogeneous document elements into a unified data structure that preserves spatial relationships and semantic document hierarchy.
Schema-Based Data Validation - The library validates extracted information against defined schemas and maps results into strongly-typed objects to ensure data accuracy and simplify programmatic access.
Document Conversion Toolkits - A command-line and programmatic interface for converting diverse file formats into standardized, machine-readable representations for automated workflows.
Conversion Engines - The library provides a document conversion engine that transforms diverse file formats and web addresses into structured models via both programmatic and command-line interfaces.
Document Transformation Pipelines - Processes raw input files through a sequence of modular stages to extract, normalize, and structure document content.
Automated Document Processing - Integrating document parsing capabilities into software pipelines to enable autonomous data handling and analysis within larger application workflows.
Processing Backends - Uses a modular architecture to dynamically load specialized engines for optical character recognition and complex visual layout analysis.
Declarative Configuration Schemas - Allows users to define extraction parameters and processing rules through external configuration files to control the document parsing behavior.
Extraction Configurations - The library allows users to define specific input types and file formats to ensure that documents like PDFs or images are processed according to custom requirements.
Automated Workflow Integration - The library enables integration with automated agents and server-based architectures, allowing document processing tasks to be embedded directly into complex application workflows.