Marco Fargetta
4046d0b2f3
fix: AsciiDoc header identification ( #1562 ) ( #1563 )
...
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.
Signed-off-by: Marco Fargetta <mfargett@redhat.com >
2025-05-13 11:17:26 +02:00
Michele Dolfi
8baa85a49d
fix: restrict click version and update lock file ( #1582 )
...
* fix click dependency and update lock file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update test GT
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Christoph Auer <cau@zurich.ibm.com >
Co-authored-by: Christoph Auer <cau@zurich.ibm.com >
2025-05-13 10:40:08 +02:00
github-actions[bot]
0d0fa6cbe3
chore: bump version to 2.31.1 [skip ci]
v2.31.1
2025-05-12 09:44:26 +00:00
Michele Dolfi
127e38646f
fix: add smoldocling in download utils ( #1577 )
...
add smoldocling in download utils
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-12 10:48:07 +02:00
Oleg Lavrovsky
844babb390
docs: update links in data_prep_kit ( #1559 )
...
Update data_prep_kit.md
The links were broken, since the repository was renamed. I also noticed that PDF2Parquet is now referred to as Docling2Parquet.
Signed-off-by: Oleg Lavrovsky <31819+loleg@users.noreply.github.com >
2025-05-11 20:38:25 +02:00
Cesar Berrospi Ramis
776e7ecf9a
fix(HTML): handle row spans in header rows ( #1536 )
...
* chore(HTML): log the stacktrace of errors
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(HTML): handle row headers like in pivot tables
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-05-09 15:14:32 +02:00
Panos Vagenas
3220a592e7
docs: add serialization docs, update chunking docs ( #1556 )
...
* docs: add serializers docs, update chunking docs
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update notebook to improve MD table rendering
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-08 21:43:01 +02:00
DavidLee
f1658edbad
fix: mime error in document streams ( #1523 )
...
Update document.py
edit got file mime error
Signed-off-by: DavidLee <yongsheng_li@foxmail.com >
2025-05-06 09:30:46 +02:00
Michele Dolfi
7c705739f9
fix: usage of hashlib for FIPS ( #1512 )
...
fix usage of hashlib for FIPS
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-05-02 15:03:29 +02:00
Panos Vagenas
de56523974
chore: format JSON test files to enable comparison ( #1511 )
...
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-05-02 10:52:18 +02:00
Ihar Hrachyshka
b147331f2a
chore: restore typing hint for self.script_readers ( #1500 )
...
With future annotations, typing hints resolution is always deferred.
https://peps.python.org/pep-0563/
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com >
2025-04-30 20:33:27 +02:00
Ben Browning
4ab7e9ddfb
fix: Guard against attribute errors in TesseractOcrModel __del__ ( #1494 )
...
This moves the initialization of the `reader` and `script_readers`
attributes to before we attempt to import tesserocr, so that when later
accessing these attributes in the garbage collection method `__del__`
the attributes exist.
This requires changing the typing of the `script_readers` dict value to
`Any` because we cannot yet reference its actual strong type, since it's
a tesserocr value.
This prevents throwing an exception during garbage collection for
cases where the TesseractOcrModel instance didn't properly initialize,
like when it throws an `ImportError` during its initializer.
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2025-04-30 17:51:33 +02:00
Zach Cox
cc453961a9
fix: enable cuda_use_flash_attention2 for PictureDescriptionVlmModel ( #1496 )
...
fix: enable use_cuda_flash_attention2 for PictureDescriptionVlmModel
Signed-off-by: Zach Cox <zach.s.cox@gmail.com >
2025-04-30 08:02:52 +02:00
Peter W. J. Staar
976e92e289
fix: updated the time-recorder label for reading order ( #1490 )
...
* fix: updated the time-recorder label for reading order
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* reformatted code
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
2025-04-29 13:02:53 +02:00
Michele Dolfi
d8959c6b19
chore: update dependencies in lock file ( #1458 )
...
update lock: h11 vuln and torch update
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-28 08:52:46 +02:00
nkh0472
a097ccd8d5
chore: typo fix ( #1465 )
...
* typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
* chore: typo fix
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
---------
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
2025-04-28 08:52:09 +02:00
Emmanuel Ferdman
3afbe6c969
docs: update supported formats guide ( #1463 )
...
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com >
2025-04-28 08:51:54 +02:00
Maxim Lysak
94d66a0765
fix: Incorrect scaling of TableModel bboxes when do_cell_matching is False ( #1459 )
...
fixing double scaling in case of do_cell_matching is False
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-25 12:34:12 +02:00
github-actions[bot]
c67133dde4
chore: bump version to 2.31.0 [skip ci]
v2.31.0
2025-04-25 08:28:25 +00:00
Ryan Lin
a2fbbba9f7
feat: add tutorial using Milvus and Docling for RAG pipeline ( #1449 )
...
* feat: add milvus rag with docling tutorial
Signed-off-by: Ryan Lin <linjinhong@yandex.com >
* chore: run pre-commit
Signed-off-by: Ryan Lin <linjinhong@yandex.com >
* feat: add RAG with Milvus example to mkdocs
Signed-off-by: Ryan Lin <linjinhong@yandex.com >
---------
Signed-off-by: Ryan Lin <linjinhong@yandex.com >
2025-04-25 09:12:35 +02:00
Michele Dolfi
976431ed7f
chore: update locked deps ( #1442 )
...
update deps
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-23 14:59:31 +02:00
Cesar Berrospi Ramis
ed20124544
fix(html): handle address, details, and summary tags ( #1436 )
...
* fix(html): handle 'address' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* fix(html): handle 'details' tag
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-23 09:30:59 +02:00
nkh0472
c2470ed216
docs: Fix wrong output format in example code ( #1427 )
...
fix: wrong output format
Signed-off-by: nkh0472 <67589323+nkh0472@users.noreply.github.com >
2025-04-22 12:32:55 +02:00
Michele Dolfi
64918a81ac
docs: Add OpenSSF Best Practices badge ( #1430 )
...
* docs: add openssf badge
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add badge to docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-22 11:23:28 +02:00
Ben Cox
995b3b0ab1
docs: Typo fixes in docling_document.md ( #1400 )
...
Signed-off-by: Ben Cox <1038350+ind1go@users.noreply.github.com >
2025-04-22 08:49:08 +02:00
Eugene
8012a3e4d6
fix: Treat overflowing -v flags as DEBUG ( #1419 )
...
Signed-off-by: Eugene <fogaprod@gmail.com >
2025-04-19 11:02:41 +02:00
Leandro Rosas
88948b0bba
docs: Updated the [Usage] link in architecture.md ( #1416 )
...
Fixed the [Usage] link in architecture.md
Changed the usage link in the tip box from "../usage.md#adjust-pipeline-features" to "../usage/index.md#adjust-pipeline-features" as the previous link is not valid.
Signed-off-by: Leandro Rosas <36343022+leandrosas101@users.noreply.github.com >
2025-04-19 10:20:52 +02:00
Cesar Berrospi Ramis
fa7fc9e63d
fix(codecov): fix codecov argument and yaml file ( #1399 )
...
* fix(codecov): fix codecov argument and yaml file
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* ci: set the codecov status to success even if the CI fails
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-15 18:12:57 +02:00
Panos Vagenas
550b1ca2f8
chore: propagate docling-core fix ( #1389 )
...
* chore: propagate docling-core fix
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
* update lock to latest docling-core release
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
2025-04-15 10:51:47 +02:00
Felix Dittrich
a7dd59c5cb
docs(ocr): Add docs entry for OnnxTR OCR plugin ( #1382 )
...
feat(ocr): Add docs entry for OnnxTR OCR plugin
Signed-off-by: felix <felixdittrich92@gmail.com >
2025-04-15 09:46:59 +02:00
Michele Dolfi
06227e9970
ci: sign pypi packages ( #1392 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-15 08:59:16 +02:00
Michele Dolfi
5458a88464
ci: add coverage and ruff ( #1383 )
...
* add coverage calculation and push
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* new codecov version and usage of token
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* enable ruff formatter instead of black and isort
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff lint fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply ruff unsafe fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add removed imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* runs 1 on linter issues
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* finalize linter fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* Update pyproject.toml
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-14 18:01:26 +02:00
Michele Dolfi
293c28ca7c
docs(security): more statements about secure development ( #1381 )
...
docs: more statement about secure development
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 13:53:26 +02:00
Michele Dolfi
01fbfd5652
docs: Add testing in the docs ( #1379 )
...
* add testing to CONTRIBUTING
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* document test generation
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* typo
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 12:31:48 +02:00
Michele Dolfi
d9c3999175
chore: update lock file ( #1378 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 10:38:10 +02:00
Juil Park
a026b4e84b
docs: Add Notes for Installing in Intel macOS ( #1377 )
...
docs: Add Notes for Intel macOS
Signed-off-by: Juil Park <park@juil.dev >
2025-04-14 10:21:13 +02:00
github-actions[bot]
c391adb5f0
chore: bump version to 2.30.0 [skip ci]
v2.30.0
2025-04-14 08:20:31 +00:00
Michele Dolfi
7e40ad3261
fix(deps): widen typer upper bound ( #1375 )
...
bump typer
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 09:23:39 +02:00
Peter W. J. Staar
c0ba88edf1
feat(cli): add option for html with split-page mode ( #1355 )
...
* updated the cli to output html in split-page mode
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
* add pin for new docling-core with html split argument
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* relock with fixed html export in docling-core
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update more tests
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update lock with docling-core fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* update test results
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add again chunking extras
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 08:41:50 +02:00
Tim Kellogg
0de70e7991
fix: auto-recognize .xlsx, .docx and .pptx files ( #1340 )
...
* bug: auto-recognize .xlsx files
Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com >
* apply styling
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* apply to other ms office zip formats
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Tim Kellogg <timothy.kellogg@gmail.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-14 07:45:13 +02:00
Simon Leiß
b295da4bfe
chore: Update repository URL in CITATION.cff ( #1363 )
...
Update repository URL in CITATION.cff
Repository was moved to docling-project/docling, so adjust the URL.
Signed-off-by: Simon Leiß <5084100+sleiss@users.noreply.github.com >
2025-04-14 06:57:04 +02:00
Cesar Berrospi Ramis
415b877984
fix(docx): declare image_data variable when handling pictures ( #1359 )
...
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 13:04:00 +02:00
Rowan Skewes
250399948d
fix: Implement PictureDescriptionApiOptions.bitmap_area_threshold ( #1248 )
...
fix: Implement PictureDescriptionApiOptions.picture_area_threshold
Signed-off-by: Rowan Skewes <rowan.skewes@gmail.com >
2025-04-11 11:14:05 +02:00
Cesar Berrospi Ramis
eef2bdea77
feat(xlsx): create a page for each worksheet in XLSX backend ( #1332 )
...
* sytle(xlsx): enforce type hints in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* feat(xlsx): create a page for each worksheet in XLSX backend
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docs(xlsx): add docstrings to XLSX backend module.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
* docling(xlsx): add bounding boxes and page size information in cell units
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
---------
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com >
2025-04-11 10:29:53 +02:00
Gabe Goodhart
c605edd8e9
feat: OllamaVlmModel for Granite Vision 3.2 ( #1337 )
...
* build: Add ollama sdk dependency
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Add option plumbing for OllamaVlmOptions in pipeline_options
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Full implementation of OllamaVlmModel
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* feat: Connect "granite_vision_ollama" pipeline option to CLI
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* Revert "build: Add ollama sdk dependency"
After consideration, we're going to use the generic OpenAI API instead
of the Ollama-specific API to avoid duplicate work.
This reverts commit bc6b366468cdd66b52540aac9c7d8b584ab48ad0.
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* refactor: Move OpenAI API call logic into utils.utils
This will allow reuse of this logic in a generic VLM model
NOTE: There is a subtle change here in the ordering of the text prompt and
the image in the call to the OpenAI API. When run against Ollama, this
ordering makes a big difference. If the prompt comes before the image, the
result is terse and not usable whereas the prompt coming after the image
works as expected and matches the non-OpenAI chat API.
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* refactor: Refactor from Ollama SDK to generic OpenAI API
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* fix: Linting, formatting, and bug fixes
The one bug fix was in the timeout arg to openai_image_request. Otherwise,
this is all style changes to get MyPy and black passing cleanly.
Branch: OllamaVlmModel
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
* remove model from download enum
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* generalize input args for other API providers
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* rename and refactor
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add example
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* require flag for remote services
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* disable example from CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
* add examples to docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com >
2025-04-10 18:03:04 +02:00
Joan Fabrégat
6b696b504a
fix: Properly address page in pipeline _assemble_document when page_range is provided ( #1334 )
...
* Fixes #1333
Signed-off-by: Joan Fabrégat <j@fabreg.at >
* fix for the (dumb) MyPy type checker
Signed-off-by: Joan Fabrégat <j@fabreg.at >
---------
Signed-off-by: Joan Fabrégat <j@fabreg.at >
2025-04-10 16:11:28 +02:00
github-actions[bot]
72ab8e1821
chore: bump version to 2.29.0 [skip ci]
v2.29.0
2025-04-10 12:24:09 +00:00
Maxim Lysak
355d8dc7a6
chore: Logo parameter in docling CLI, prints cute ascii logo ( #1294 )
...
logo parameter in docling cli, prints cute ascii logo
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
2025-04-09 05:29:48 +02:00
Rafael Teixeira de Lima
14e9c0ce9a
fix(docx): Adding new latex symbols, simplifying how equations are added to text ( #1295 )
...
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(docx): Improve text parsing (#1268 )
* chore: bump version to 2.28.4 [skip ci]
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Improve text parsing
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201 )
fix wrong type text extracted by tesseract_ocr_cli_model
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Flexibilize heading detection
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Fix trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Remove trailing space
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add visual grounding example (#1270 )
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat(docx): add text formatting and hyperlink support (#630 )
* feat: Enable markdown text formatting for docx
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix imports
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use Formatting
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle hyperlink
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle formatting properly for DocItemLabel.PARAGRAPH
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline group
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle bullet lists
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Strip elements
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run black and mypy
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Handle header and footer
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Use inline_fmt everywhere
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Run precommit
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Address feedback
Signed-off-by: SimJeg <sjegou@nvidia.com >
* Fix add_list_item
Signed-off-by: SimJeg <sjegou@nvidia.com >
* fix minor bugs, mark helper methods internal
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
---------
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* fix(pptx): check if picture shape has an image attached (#1316 )
Check if picture shape has an image attached in pptx backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* chore: update lock file (#1315 )
chore: update lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* docs: add plugins docs (#1319 )
add plugin docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* feat: handle <code> tags as code blocks (#1320 )
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Identify headers through inhenrited style
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Log warning message instead of print
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
* Adding new latex symbols, simplifying how equations are added to text
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
---------
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com >
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com >
Signed-off-by: SimJeg <sjegou@nvidia.com >
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com >
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com >
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com >
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com >
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com >
Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com >
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com >
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com >
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com >
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com >
Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com >
2025-04-08 17:11:37 +02:00
Fernando Santos
0499cd1c1e
feat: handle <code> tags as code blocks ( #1320 )
...
handle <code> tags as code blocks
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com >
2025-04-08 10:32:06 +02:00