← All repositories

Docling

Features

  • Structured Data ExtractionThe library identifies and pulls specific information from documents by applying user-defined templates or schemas to extract the exact fields and data types required.
  • Document Format ConvertersTransforming diverse file types and web content into a unified, machine-readable format for downstream data processing and analysis.
  • Document Layout AnalyzersA computer vision and text processing suite that maps the spatial relationships between text, tables, and images within digital documents.
  • Hierarchical Document ModelsThe library organizes text, tables, images, and layout information into a unified hierarchical tree structure that preserves the spatial and semantic relationships between document elements.
  • Schema-Driven ExtractorsMaps unstructured document regions to strongly-typed objects by validating extracted content against predefined structural templates and data models.
  • Document Parsing PipelinesA processing engine that transforms unstructured files and web content into a unified, hierarchical data model for downstream analysis.
  • Document Layout AnalysisParsing the hierarchical structure of documents to accurately identify and relate text, tables, and images for intelligent content understanding.
  • Intermediate RepresentationsNormalizes diverse file formats into a consistent internal model to enable uniform processing across different input sources.
  • Structured Data ExtractorsA tool that identifies and pulls specific information from complex document layouts based on predefined schemas and data types.
  • Hierarchical Document ModelsOrganizes heterogeneous document elements into a unified data structure that preserves spatial relationships and semantic document hierarchy.
  • Schema-Based Data ValidationThe library validates extracted information against defined schemas and maps results into strongly-typed objects to ensure data accuracy and simplify programmatic access.
  • Document Conversion ToolkitsA command-line and programmatic interface for converting diverse file formats into standardized, machine-readable representations for automated workflows.
  • Conversion EnginesThe library provides a document conversion engine that transforms diverse file formats and web addresses into structured models via both programmatic and command-line interfaces.
  • Document Transformation PipelinesProcesses raw input files through a sequence of modular stages to extract, normalize, and structure document content.
  • Automated Document ProcessingIntegrating document parsing capabilities into software pipelines to enable autonomous data handling and analysis within larger application workflows.
  • Processing BackendsUses a modular architecture to dynamically load specialized engines for optical character recognition and complex visual layout analysis.
  • Declarative Configuration SchemasAllows users to define extraction parameters and processing rules through external configuration files to control the document parsing behavior.
  • Extraction ConfigurationsThe library allows users to define specific input types and file formats to ensure that documents like PDFs or images are processed according to custom requirements.
  • Automated Workflow IntegrationThe library enables integration with automated agents and server-based architectures, allowing document processing tasks to be embedded directly into complex application workflows.