Commit Graph

293 Commits

Author SHA1 Message Date
João
82441ed6d2 sync with docling main 2025-01-09 12:25:16 -03:00
Panos Vagenas
4fa8028bd8
docs: add LangChain docs (#717)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-09 14:12:05 +01:00
Michele Dolfi
e64b5a2f62
fix: allow earlier requests versions (#716)
allow earlier requests versions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-09 13:30:40 +01:00
github-actions[bot]
9a94b54f6c chore: bump version to 2.15.0 [skip ci] 2025-01-08 12:06:38 +00:00
Christoph Auer
5cb4cf6f19
fix: Correct scaling of debug visualizations, tune OCR (#700)
* fix: Correct scaling of debug visualizations, tune OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: remove unused imports

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Update docling-core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-08 12:26:44 +01:00
Michele Dolfi
ead396ab40
docs: specify docstring types (#702)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-01-08 09:05:18 +01:00
Michele Dolfi
6701f34c85
docs: add link to rag with granite (#698)
* docs: add link to rag with granite

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update mkdocs.yml

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-07 20:01:41 +01:00
Christoph Auer
42856fdf79
fix: Let BeautifulSoup detect the HTML encoding (#695)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-01-07 15:49:28 +01:00
Panos Vagenas
2d24faecd9
docs: add integrations, revamp docs (#693)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-07 14:15:54 +01:00
Jinfeng Sun
d49650c54f
fix(mspowerpoint): handle invalid images in PowerPoint slides (#650)
- Add error handling for images that cannot be loaded by Pillow
- Improve resilience when encountering corrupted or unsupported image formats
- Maintain processing of other slide elements even if an image fails to load

Signed-off-by: Tendo33 <sjf1998112@gmail.com>
2025-01-07 13:58:10 +01:00
Luke Harrison
0ee849e8bc
feat: added http header support for document converter and cli (#642)
* added http header support for document converter and cli

Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>

* fixed formatting and typing issues

Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>

* use pydantic to parse dict

suggested by @dolfim-ibm

Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>

---------

Signed-off-by: Luke Harrison <Luke.Harrison1@ibm.com>
Signed-off-by: Luke Harrison <luke.harrison1@ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2025-01-07 10:15:14 +01:00
JSIV
569038df42
docs: Add OpenContracts as an integration (#679)
* Add OpenContracts as an open source project

OpenContracts now offers Docling as a document ingestion and parsing pipeline

Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>

* Update mkdocs.yml

Added OpenContracts to the nav configs

Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>

---------

Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
2025-01-07 10:14:42 +01:00
m-newhauser
2b591f9872
docs: add Weaviate RAG recipe notebook (#451)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-19 21:57:40 +01:00
Panos Vagenas
fc645ea531
docs: document Haystack & Vectara support (#628)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-19 13:33:02 +01:00
github-actions[bot]
1418fa1488 chore: bump version to 2.14.0 [skip ci] 2024-12-18 07:04:47 +00:00
Lucas Morin
fd034802b6
feat: Create a backend to transform PubMed XML files to DoclingDocument (#557)
Signed-off-by: lucas-morin <lucas.morin222@gmail.com>
2024-12-17 19:27:09 +01:00
github-actions[bot]
e31f09f71f chore: bump version to 2.13.0 [skip ci] 2024-12-17 17:01:04 +00:00
Christoph Auer
60dc852f16
feat: Updated Layout processing with forms and key-value areas (#530)
* Upgraded Layout Postprocessing, sending old code back to ERZ

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Implement hierachical cluster layout processing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pass nested cluster processing through full pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pass nested clusters through GLM as payload

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up imports again

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.

* fix: Improve the pydantic objects in the pipeline_options and imports.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Updated test ground-truth

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated test ground-truth (again), bugfix for empty layout

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Do proper check to set the device in EasyOCR, RapidOCR.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Correct the way to set GPU for EasyOCR, RapidOCR

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Ocr AccleratorDevice

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Merge pull request #556 from DS4SD/cau/layout-processing-improvement

feat: layout processing improvements and bugfixes

* Update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update HF model ref, reset test generate

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Repin to release package versions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Many layout processing improvements, add document index type

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update pinnings to docling-core

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix table box snapping

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for cluster pre-ordering

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Introduce OCR confidence, propagate to orphan in post-processing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix form and key value area groups

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust confidence in EasyOcr

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Roll back CLI changes from main

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update docling-core pinning

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Annoying fixes for historical python versions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated test GT for legacy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Comment cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-17 17:32:24 +01:00
Cesar Berrospi Ramis
00dec7a2f3
test: generate file from CLI in a temporary directory (#618)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2024-12-17 16:35:42 +01:00
Cesar Berrospi Ramis
4e087504cc
feat: create a backend to parse USPTO patents into DoclingDocument (#606)
* feat: add PATENT_USPTO as input format

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* feat: add USPTO backend parser

Add a backend implementation to parse patent applications and
grants from the United States Patent Office (USPTO).

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* refactor: change the name of the USPTO input format

Change the name of the patent USPTO input format to show the typical format (XML).

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* refactor: address several input formats with same mime type

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* refactor: group XML backend parsers in a subfolder

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: add safe initialization of PatentUsptoDocumentBackend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2024-12-17 16:35:23 +01:00
Panos Vagenas
3e599c7bbe
docs: add Haystack RAG example (#615)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-17 14:24:40 +01:00
itsainii
3b53bd38c8
feat: Add Easyocr parameter recog_network (#613)
* Update easyocr_model.py

Added this line of code to get recog_network of easyocr parameter
recog_network = self.options.recog_network

Signed-off-by: itsainii <aininawawii@gmail.com>

* Update pipeline_options.py

Added this line in EasyOcrOptions function

recog_network: Optional[str] = 'standard'

Signed-off-by: itsainii <aininawawii@gmail.com>

* Add Easyocr recog_network parameter

Signed-off-by: itsainii <aininawawii@gmail.com>

---------

Signed-off-by: itsainii <aininawawii@gmail.com>
2024-12-17 09:47:18 +01:00
Nikos Livathinos
3bb3bf5715
docs: Fix the path to the run_with_accelerator.py example (#608)
docs: Fix the path to the run_with_accelerator.py example

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-16 15:03:06 +01:00
Fabio
a5c2fb3052
Merge branch 'DS4SD:main' into main 2024-12-13 22:54:00 -03:00
Fabio
162e89e013
Merge pull request #2 from 0xCarbon/feat/disable_image_labeling
Force Image Content OCR
2024-12-13 22:52:52 -03:00
github-actions[bot]
a2db5fbd0f chore: bump version to 2.12.0 [skip ci] 2024-12-13 18:27:00 +00:00
João
e0aaa783c5 Merge branch 'main' of github.com:0xCarbon/docling into feat/disable_image_labeling 2024-12-13 13:57:35 -03:00
jpcanesin
1e016fd776
Merge pull request #1 from DS4SD/main
feat: Introduce support for GPU Accelerators (#593)
2024-12-13 13:55:48 -03:00
João
a4dc21395d blacklisted the picture layout tag so that it is forced to interpret the contents of the image and retrieve text that otherwise would be lost with an image tag 2024-12-13 13:46:25 -03:00
Nikos Livathinos
19fad9261c
feat: Introduce support for GPU Accelerators (#593)
* Upgraded Layout Postprocessing, sending old code back to ERZ

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Implement hierachical cluster layout processing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pass nested cluster processing through full pipeline

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pass nested clusters through GLM as payload

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Clean up imports again

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.

* fix: Improve the pydantic objects in the pipeline_options and imports.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Updated test ground-truth

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated test ground-truth (again), bugfix for empty layout

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Do proper check to set the device in EasyOCR, RapidOCR.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Rollback changes from main

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update test gt

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove unused debug settings

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Review fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Nail the accelerator defaults for MPS

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-12-13 17:45:22 +01:00
github-actions[bot]
365a1e7b98 chore: bump version to 2.11.0 [skip ci] 2024-12-12 08:16:05 +00:00
Abhishek Kumar
3da166eafa
feat: Add timeout limit to document parsing job. DS4SD#270 (#552)
Signed-off-by: Abhishek Kumar <abhishekrocketeer@gmail.com>

Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=10 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (24.555 seconds) exceeded the specified timeout of 10.000 seconds
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 36.29 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpl6p08u5i/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 36.29 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 58.36 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 58.56 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 59.82 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 59.88 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

 Usage: docling [OPTIONS] source

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None] [required]        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                                       [docx|pptx|html|image|pdf|asciido  Specify input formats to convert    │
│                                                              c|md|xlsx]                         from. Defaults to all formats.      │
│                                                                                                 [default: None]                     │
│ --to                                                         [md|json|html|text|doctags]        Specify output formats. Defaults to │
│                                                                                                 Markdown.                           │
│                                                                                                 [default: None]                     │
│ --image-export-mode                                          [placeholder|embedded|referenced]  Image export mode for the document  │
│                                                                                                 (only in case of JSON, Markdown or  │
│                                                                                                 HTML). With `placeholder`, only the │
│                                                                                                 position of the image is marked in  │
│                                                                                                 the output. In `embedded` mode, the │
│                                                                                                 image is embedded as base64 encoded │
│                                                                                                 string. In `referenced` mode, the   │
│                                                                                                 image is exported in PNG format and │
│                                                                                                 referenced from the main exported   │
│                                                                                                 document.                           │
│                                                                                                 [default: embedded]                 │
│ --ocr                         --no-ocr                                                          If enabled, the bitmap content will │
│                                                                                                 be processed using OCR.             │
│                                                                                                 [default: ocr]                      │
│ --force-ocr                   --no-force-ocr                                                    Replace any existing text with OCR  │
│                                                                                                 generated text over the full        │
│                                                                                                 content.                            │
│                                                                                                 [default: no-force-ocr]             │
│ --ocr-engine                                                 [easyocr|tesseract_cli|tesseract|  The OCR engine to use.              │
│                                                              ocrmac|rapidocr]                   [default: easyocr]                  │
│ --ocr-lang                                                   TEXT                               Provide a comma-separated list of   │
│                                                                                                 languages used by the OCR engine.   │
│                                                                                                 Note that each OCR engine has       │
│                                                                                                 different values for the language   │
│                                                                                                 names.                              │
│                                                                                                 [default: None]                     │
│ --pdf-backend                                                [pypdfium2|dlparse_v1|dlparse_v2]  The PDF backend to use.             │
│                                                                                                 [default: dlparse_v2]               │
│ --table-mode                                                 [fast|accurate]                    The mode to use in the table        │
│                                                                                                 structure model.                    │
│                                                                                                 [default: fast]                     │
│ --artifacts-path                                             PATH                               If provided, the location of the    │
│                                                                                                 model artifacts.                    │
│                                                                                                 [default: None]                     │
│ --abort-on-error              --no-abort-on-error                                               If enabled, the bitmap content will │
│                                                                                                 be processed using OCR.             │
│                                                                                                 [default: no-abort-on-error]        │
│ --output                                                     PATH                               Output directory where results are  │
│                                                                                                 saved.                              │
│                                                                                                 [default: .]                        │
│ --verbose                 -v                                 INTEGER                            Set the verbosity level. -v for     │
│                                                                                                 info logging, -vv for debug         │
│                                                                                                 logging.                            │
│                                                                                                 [default: 0]                        │
│ --debug-visualize-cells       --no-debug-visualize-cells                                        Enable debug output which           │
│                                                                                                 visualizes the PDF cells            │
│                                                                                                 [default: no-debug-visualize-cells] │
│ --debug-visualize-ocr         --no-debug-visualize-ocr                                          Enable debug output which           │
│                                                                                                 visualizes the OCR cells            │
│                                                                                                 [default: no-debug-visualize-ocr]   │
│ --debug-visualize-layout      --no-debug-visualize-layout                                       Enable debug output which           │
│                                                                                                 visualizes the layour clusters      │
│                                                                                                 [default:                           │
│                                                                                                 no-debug-visualize-layout]          │
│ --debug-visualize-tables      --no-debug-visualize-tables                                       Enable debug output which           │
│                                                                                                 visualizes the table cells          │
│                                                                                                 [default:                           │
│                                                                                                 no-debug-visualize-tables]          │
│ --version                                                                                       Show version information.           │
│ --document-timeout                                           FLOAT                              The timeout for processing each     │
│                                                                                                 document, in seconds.               │
│                                                                                                 [default: None]                     │
│ --help                                                                                          Show this message and exit.         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
2024-12-11 15:06:10 +01:00
Christoph Auer
aee9c0b324
fix: Do not import python modules from deepsearch-glm (#569)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-11 12:29:06 +01:00
Christoph Auer
f45499ce93
fix: Handle no result from RapidOcr reader (#558)
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-12-10 16:25:05 +01:00
Panos Vagenas
d0c9e8e508
docs: update chunking usage docs, minor reorg (#550)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-10 16:03:02 +01:00
Michele Dolfi
a7df337654
fix: make enum serializable with human-readable value (#555)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-10 13:12:44 +01:00
github-actions[bot]
eb30c4f763 chore: bump version to 2.10.0 [skip ci] 2024-12-09 16:28:46 +00:00
Christoph Auer
7972d47f88
fix: Call into docling-core for legacy document transform (#551)
Call into docling-core for legacy document transform

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 17:06:47 +01:00
Nikos Livathinos
78f61a8522
fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544)
* fix: main: Introduce format options for Image with the same pdf pipeline_options.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Silence the tqdm messages during the downloading of model files

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Code styling

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Use the HF API to disable the tqdm progress bars

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-12-09 15:57:37 +01:00
Christoph Auer
aca57f0527
feat: docling-parse v2 as default PDF backend (#549)
* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Upgrade to ds-glm 1.0 and docling-parse 3.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lock

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix DP2 backend code, change CLI default backend

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-09 13:26:17 +01:00
github-actions[bot]
9fd2cf847a chore: bump version to 2.9.0 [skip ci] 2024-12-09 09:33:55 +00:00
Panos Vagenas
c8ecdd987e
feat: expose new hybrid chunker, update docs (#384)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-09 08:28:29 +01:00
Maxim Lysak
eb7ffcdd1c
fix: Correcting DefaultText ID for MS Word backend (#537)
Correcting DefaultText ID for MS Word backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 15:48:35 +01:00
Maxim Lysak
3e073dfbeb
feat(MS Word backend): Make detection of headers and other styles localization agnostic (#534)
Using style id instead of style names, which should be localization agnostic

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 15:17:56 +01:00
Michele Dolfi
53039a8367
ci: allow ! in conventionalcommits (#533)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-06 14:50:10 +01:00
Sander Maijers
9102fe1adc
fix: Add py.typed marker file (#531)
feat: add `py.typed` marker file

See: https://typing.readthedocs.io/en/latest/spec/distributing.html#packaging-type-information

Signed-off-by: Sander Maijers <3374183+sanmai-NL@users.noreply.github.com>
2024-12-06 13:42:14 +01:00
Panos Vagenas
e780333440
docs: document new integrations (#532)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-12-06 13:18:14 +01:00
Peter W. J. Staar
0d11e30dd8
fix: Enable HTML export in CLI and add options for image mode (#513)
* updated README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed duck in title

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the index.md

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli to export html

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added html to cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed the duck emoji, added the  in the cli. Currently, the referenced seems broken

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* cleaning up the comments

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reference is now working

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Clean up styling and docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-core>=2.7.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:57 +01:00
Maxim Lysak
b730b2d7a0
fix: Missing text in docx (t tag) when embedded in a table (#528)
Fix for missing text in docx (t tag) when embedded in a table

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-12-06 12:37:25 +01:00
Michele Dolfi
c830b92b2e
fix: restore pydantic version pin after fixes (#512)
* test: pin new docling-core changes and release pydantic pinning

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-core release

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-12-06 09:33:39 +01:00