Get your documents ready for gen AI
Go to file
2024-11-20 15:36:33 +05:30
.github ci: fix mergify (#350) 2024-11-15 17:13:01 +01:00
docling integrated paddleocr model for performing accurate ocr when using docling document converter 2024-11-20 15:36:33 +05:30
docs feat: added support for exporting DocItem to an image when page image is available (#379) 2024-11-19 16:28:52 +01:00
tests feat: added excel backend (#334) 2024-11-19 12:21:17 +01:00
.gitignore ci: Add Github Actions (#4) 2024-07-16 13:05:04 +02:00
.pre-commit-config.yaml feat!: Docling v2 (#117) 2024-10-16 21:02:03 +02:00
CHANGELOG.md chore: bump version to 2.6.0 [skip ci] 2024-11-19 16:07:34 +00:00
CODE_OF_CONDUCT.md Initial commit 2024-07-15 09:42:42 +02:00
CONTRIBUTING.md Fix Typo errors in CONTRIBUTING.md file (#164) 2024-10-22 07:01:48 +02:00
Dockerfile fix: Dockerfile example copy command (#234) 2024-11-05 12:48:27 +01:00
LICENSE chore: fix placeholders in license (#63) 2024-09-06 17:10:07 +02:00
MAINTAINERS.md docs: Update MAINTAINERS.md (#59) 2024-09-02 12:34:38 +02:00
mkdocs.yml docs: add automatic generation of CLI reference (#325) 2024-11-15 13:18:17 +01:00
poetry.lock feat: added support for exporting DocItem to an image when page image is available (#379) 2024-11-19 16:28:52 +01:00
pyproject.toml chore: bump version to 2.6.0 [skip ci] 2024-11-19 16:07:34 +00:00
README.md integrated paddleocr model for performing accurate ocr when using docling document converter 2024-11-20 15:36:33 +05:30

Additional Features:

  • Integrated PaddleOCR - For improved OCR capabilities.

To know more about the original repository refer to the readme and documentation available at:
Docling Github Repo Docling Documentation

PaddleOCR Usage - Demo:

from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, ImageFormatOption, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode, TableStructureOptions

pipeline_options = PdfPipelineOptions(do_table_structure=True, generate_page_images=True, images_scale=2.0)
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # use more accurate TableFormer model
pipeline_options.table_structure_options = TableStructureOptions(do_cell_matching=True)
pipeline_options.ocr_options = PaddleOcrOptions(lang="en")

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
        InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
    }
)
result = doc_converter.convert("sample_file.pdf")
print(result.document.export_to_markdown())

License

The Docling codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

IBM ❤️ Open Source AI

Docling has been brought to you by IBM.