docling

mirror of https://github.com/DS4SD/docling.git synced 2025-07-31 22:44:27 +00:00

Get your documents ready for gen AI

ai convert document-parser document-parsing documents docx html markdown pdf pdf-converter pdf-to-json pdf-to-text pptx tables xlsx

Go to file

swayam-singhal fc0523b12d integrated paddleocr model for performing accurate ocr when using docling document converter Signed-off-by: Swaymaw <swaymaw@gmail.com>		2024-11-20 15:36:33 +05:30
.github	ci: fix mergify (#350 )	2024-11-15 17:13:01 +01:00
docling	integrated paddleocr model for performing accurate ocr when using docling document converter	2024-11-20 15:36:33 +05:30
docs	feat: added support for exporting DocItem to an image when page image is available (#379 )	2024-11-19 16:28:52 +01:00
tests	feat: added excel backend (#334 )	2024-11-19 12:21:17 +01:00
.gitignore	ci: Add Github Actions (#4 )	2024-07-16 13:05:04 +02:00
.pre-commit-config.yaml	feat!: Docling v2 (#117 )	2024-10-16 21:02:03 +02:00
CHANGELOG.md	chore: bump version to 2.6.0 [skip ci]	2024-11-19 16:07:34 +00:00
CODE_OF_CONDUCT.md	Initial commit	2024-07-15 09:42:42 +02:00
CONTRIBUTING.md	Fix Typo errors in CONTRIBUTING.md file (#164 )	2024-10-22 07:01:48 +02:00
Dockerfile	fix: Dockerfile example copy command (#234 )	2024-11-05 12:48:27 +01:00
LICENSE	chore: fix placeholders in license (#63 )	2024-09-06 17:10:07 +02:00
MAINTAINERS.md	docs: Update MAINTAINERS.md (#59 )	2024-09-02 12:34:38 +02:00
mkdocs.yml	docs: add automatic generation of CLI reference (#325 )	2024-11-15 13:18:17 +01:00
poetry.lock	feat: added support for exporting DocItem to an image when page image is available (#379 )	2024-11-19 16:28:52 +01:00
pyproject.toml	chore: bump version to 2.6.0 [skip ci]	2024-11-19 16:07:34 +00:00
README.md	integrated paddleocr model for performing accurate ocr when using docling document converter	2024-11-20 15:36:33 +05:30

README.md

Additional Features:

Integrated PaddleOCR - For improved OCR capabilities.

To know more about the original repository refer to the readme and documentation available at:
Docling Github Repo Docling Documentation

PaddleOCR Usage - Demo:

from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, ImageFormatOption, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode, TableStructureOptions

pipeline_options = PdfPipelineOptions(do_table_structure=True, generate_page_images=True, images_scale=2.0)
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # use more accurate TableFormer model
pipeline_options.table_structure_options = TableStructureOptions(do_cell_matching=True)
pipeline_options.ocr_options = PaddleOcrOptions(lang="en")

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
        InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
    }
)
result = doc_converter.convert("sample_file.pdf")
print(result.document.export_to_markdown())

License

The Docling codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

IBM ❤️ Open Source AI

Docling has been brought to you by IBM.