fix: Support for RTL programmatic documents
fix(parser): detect and handle rotated pages
fix(parser): fix bug causing duplicated text
fix(formula): improve stopping criteria
chore: update lock file
fix: temporary constrain beautifulsoup
* switch to code formula model v1.0.1 and new test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* switch to code formula model v1.0.1 and new test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* cleaned up the data folder in the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* switch to code formula model v1.0.1 and new test pdf
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* added three test-files for right-to-left
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix black
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* added new gt for test_e2e_conversion
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* added new gt for test_e2e_conversion
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* Add code to expose text direction of cell
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* new test file
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
* update lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix mypy reports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix example filepaths
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add test data results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* pin wheel of latest docling-parse release
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use latest docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove debugging code
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix path to files in example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Revert unwanted RTL additions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix test data paths in examples
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Updated label parsing to use `str_to_int` with a default value to prevent potential conversion errors.
Signed-off-by: Vladimir Gurevich <vladimir@beaconcure.com>
Co-authored-by: Vladimir Gurevich <vladimir@beaconcure.com>
* fix(docx): merged cells not properly converted
Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: add type hinting to docx backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Processing of placeholder shapes in pptx that have text but no bbox
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Support of table of content containers in docx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix for docx when headers are also lists, now recorded as appropriate headers and subheaders, unit test included
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
* Update docling/backend/msword_backend.py
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Fix for the crash when encountering WMF images in pptx and docx backends on non Windows platforms
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated faq
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Filetype library may not identify some files as PDF. Leverage the file extension
as a simple solution.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* fix: parse HTML files without body tag
Parse HTML files without 'body' tag, since it is optional in HTML5 specification.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* test: ensure docling converts HTML without body tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Expose `rec_keys_path` in RapidOcrOptions to support custom dictionaries
- Added `rec_keys_path` to `RapidOcrOptions` to align with RapidOCR's capability to use custom character dictionaries.
- Passed `rec_keys_path` to `RapidOcrModel` initialization, ensuring the recognition model can load the correct dictionary (e.g., for Latin characters).
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
* style(rapidocr-options): fix alignment of `rec_keys_path` comment
Adjusted the alignment of the comment for `rec_keys_path` to maintain consistent formatting. No functional changes were made.
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
---------
Signed-off-by: Yorick Terweijden <yorick@spread.ai>
* chore: remove type-ignore marks for attaching text to non GroupItems
After commit b74208 of docling-core, text items can be attached to any NodeItem
and therefore the ignore[arg-type] type marks can be removed.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* test: remove unnecessary imports
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* docs: add documentation on supported formats and backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* docs: add notebook example with XML backends
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* feat: Introduce automatic language detection in tesseract_ocr_cli model. Extend unit tests.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* docs: Add example how to use "auto" language with tesseract OCR engines
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Refactor the TesseractOcrModel and TesseractOcrCliModel to validate if the auto-detected
language is installed in the system and if not fall back to a default option without language.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Add "auto" language for TesseractOcr
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
* Add tesseract-ocr-script-latn installation for the "auto" language
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
* Modify "auto" language in TesseractOcr to initialize the script readers lazily
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
* Finalize script readers
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
* Fix script models prefix for Linux
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
---------
Signed-off-by: Pavel Denisov <pavel.denisov@iais.fraunhofer.de>
- Add error handling for images that cannot be loaded by Pillow
- Improve resilience when encountering corrupted or unsupported image formats
- Maintain processing of other slide elements even if an image fails to load
Signed-off-by: Tendo33 <sjf1998112@gmail.com>
* added http header support for document converter and cli
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
* fixed formatting and typing issues
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
* use pydantic to parse dict
suggested by @dolfim-ibm
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>
---------
Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
* Upgraded Layout Postprocessing, sending old code back to ERZ
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Implement hierachical cluster layout processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested cluster processing through full pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested clusters through GLM as payload
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clean up imports again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.
* fix: Improve the pydantic objects in the pipeline_options and imports.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Updated test ground-truth
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated test ground-truth (again), bugfix for empty layout
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Do proper check to set the device in EasyOCR, RapidOCR.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Correct the way to set GPU for EasyOCR, RapidOCR
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: Ocr AccleratorDevice
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Merge pull request #556 from DS4SD/cau/layout-processing-improvement
feat: layout processing improvements and bugfixes
* Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update HF model ref, reset test generate
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Repin to release package versions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Many layout processing improvements, add document index type
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update pinnings to docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix table box snapping
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for cluster pre-ordering
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Introduce OCR confidence, propagate to orphan in post-processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix form and key value area groups
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Adjust confidence in EasyOcr
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Roll back CLI changes from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docling-core pinning
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Annoying fixes for historical python versions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated test GT for legacy
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Comment cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
* feat: add PATENT_USPTO as input format
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
* feat: add USPTO backend parser
Add a backend implementation to parse patent applications and
grants from the United States Patent Office (USPTO).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: change the name of the USPTO input format
Change the name of the patent USPTO input format to show the typical format (XML).
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: address several input formats with same mime type
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* refactor: group XML backend parsers in a subfolder
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* chore: add safe initialization of PatentUsptoDocumentBackend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
---------
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Update easyocr_model.py
Added this line of code to get recog_network of easyocr parameter
recog_network = self.options.recog_network
Signed-off-by: itsainii <aininawawii@gmail.com>
* Update pipeline_options.py
Added this line in EasyOcrOptions function
recog_network: Optional[str] = 'standard'
Signed-off-by: itsainii <aininawawii@gmail.com>
* Add Easyocr recog_network parameter
Signed-off-by: itsainii <aininawawii@gmail.com>
---------
Signed-off-by: itsainii <aininawawii@gmail.com>
* Upgraded Layout Postprocessing, sending old code back to ERZ
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Implement hierachical cluster layout processing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested cluster processing through full pipeline
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Pass nested clusters through GLM as payload
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move to_docling_document from ds-glm to this repo
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Clean up imports again
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.
* fix: Improve the pydantic objects in the pipeline_options and imports.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Updated test ground-truth
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Updated test ground-truth (again), bugfix for empty layout
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: Do proper check to set the device in EasyOCR, RapidOCR.
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* Rollback changes from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update test gt
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove unused debug settings
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Review fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Nail the accelerator defaults for MPS
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Abhishek Kumar <abhishekrocketeer@gmail.com>
Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=10 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (24.555 seconds) exceeded the specified timeout of 10.000 seconds
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 36.29 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpl6p08u5i/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 36.29 seconds.
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 58.36 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 58.56 seconds.
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 59.82 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 59.88 seconds.
(.venv) mario@Abhisheks-MacBook-Air docling % docling
Usage: docling [OPTIONS] source
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from [docx|pptx|html|image|pdf|asciido Specify input formats to convert │
│ c|md|xlsx] from. Defaults to all formats. │
│ [default: None] │
│ --to [md|json|html|text|doctags] Specify output formats. Defaults to │
│ Markdown. │
│ [default: None] │
│ --image-export-mode [placeholder|embedded|referenced] Image export mode for the document │
│ (only in case of JSON, Markdown or │
│ HTML). With `placeholder`, only the │
│ position of the image is marked in │
│ the output. In `embedded` mode, the │
│ image is embedded as base64 encoded │
│ string. In `referenced` mode, the │
│ image is exported in PNG format and │
│ referenced from the main exported │
│ document. │
│ [default: embedded] │
│ --ocr --no-ocr If enabled, the bitmap content will │
│ be processed using OCR. │
│ [default: ocr] │
│ --force-ocr --no-force-ocr Replace any existing text with OCR │
│ generated text over the full │
│ content. │
│ [default: no-force-ocr] │
│ --ocr-engine [easyocr|tesseract_cli|tesseract| The OCR engine to use. │
│ ocrmac|rapidocr] [default: easyocr] │
│ --ocr-lang TEXT Provide a comma-separated list of │
│ languages used by the OCR engine. │
│ Note that each OCR engine has │
│ different values for the language │
│ names. │
│ [default: None] │
│ --pdf-backend [pypdfium2|dlparse_v1|dlparse_v2] The PDF backend to use. │
│ [default: dlparse_v2] │
│ --table-mode [fast|accurate] The mode to use in the table │
│ structure model. │
│ [default: fast] │
│ --artifacts-path PATH If provided, the location of the │
│ model artifacts. │
│ [default: None] │
│ --abort-on-error --no-abort-on-error If enabled, the bitmap content will │
│ be processed using OCR. │
│ [default: no-abort-on-error] │
│ --output PATH Output directory where results are │
│ saved. │
│ [default: .] │
│ --verbose -v INTEGER Set the verbosity level. -v for │
│ info logging, -vv for debug │
│ logging. │
│ [default: 0] │
│ --debug-visualize-cells --no-debug-visualize-cells Enable debug output which │
│ visualizes the PDF cells │
│ [default: no-debug-visualize-cells] │
│ --debug-visualize-ocr --no-debug-visualize-ocr Enable debug output which │
│ visualizes the OCR cells │
│ [default: no-debug-visualize-ocr] │
│ --debug-visualize-layout --no-debug-visualize-layout Enable debug output which │
│ visualizes the layour clusters │
│ [default: │
│ no-debug-visualize-layout] │
│ --debug-visualize-tables --no-debug-visualize-tables Enable debug output which │
│ visualizes the table cells │
│ [default: │
│ no-debug-visualize-tables] │
│ --version Show version information. │
│ --document-timeout FLOAT The timeout for processing each │
│ document, in seconds. │
│ [default: None] │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯