Commit Graph

100 Commits

Author SHA1 Message Date
github-actions[bot]
9d8865856d chore: bump version to 2.3.1 [skip ci] 2024-10-30 18:23:53 +00:00
Michele Dolfi
eb679ccbb4
fix: simplify torch dependencies and update pinned docling deps (#190)
* fix: simplify torch dependencies and update pinned docling deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docling-ibm-models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-30 18:44:08 +01:00
github-actions[bot]
43349865d0 chore: bump version to 2.3.0 [skip ci] 2024-10-30 14:47:37 +00:00
Peter W. J. Staar
f542460af3
fix: fix duplicate title and heading + add e2e tests for html and docx (#186)
* add real e2e tests for html and docx

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the output of itxt

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the text

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the tests (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the examples (1)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the output of the test

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the tests, moved the ground-truth

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* moved the ground-truth data

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the html tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* restructure title fix (#187)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-30 13:14:56 +01:00
github-actions[bot]
dda2645d4c chore: bump version to 2.2.1 [skip ci] 2024-10-28 17:18:41 +00:00
Maxim Lysak
88c1673057
fix: MD Backend, fixes to properly handle trailing inline text and emphasis in headers (#178)
* Small fix to properly handle trailing inline text in the md backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added proper handling of headers with bold, italic or emphasis

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removed print

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Made smarter processing of headers, with arbitrary styling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated docling-core to 2.2.1

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated tests because of the change in Markdown export in docling-core

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-25 18:02:20 +02:00
github-actions[bot]
8208c93e3a chore: bump version to 2.2.0 [skip ci] 2024-10-23 16:04:55 +00:00
Peter W. J. Staar
4116819b51
feat: Update to docling-parse v2 without history (#170)
* updated the pyproject (still need to run poetry lock after docling-parse is accepted)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Update imports for docling_parse.pdf_parser_v1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Lock docling-parse 2.0.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Lock docling-parse 2.0.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* repin poetry.lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-23 17:20:11 +02:00
Christoph Auer
3023f18ba0
feat: Support AsciiDoc and Markdown input format (#168)
* updated the base-model and added the asciidoc_backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the asciidoc backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Ensure all models work only on valid pages (#158)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* ci: run ci also on forks (#160)


---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* fix: fix legacy doc ref (#162)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* docs: typo fix (#155)

* Docs: Typo fix

- Corrected spelling of invidual to automatic

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>

* add synchronize event for forks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>

* feat: add coverage_threshold to skip OCR for small images (#161)

* feat: add coverage_threshold to skip OCR for small images

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* filter individual boxes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* chore: bump version to 2.1.0 [skip ci]

* adding tests for asciidocs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* first working asciidoc parser

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* adding test_02.asciidoc

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Drafting Markdown backend via Marko library

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* work in progress on MD backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* md_backend produces docling document with headers, paragraphs, lists

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Improvements in md parsing

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Detecting and assembling tables in markdown in temporary buffers

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added initial docling table support to md_backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned code, improved logging for MD

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixes MyPy requirements, and rest of pre-commit

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed example run_md, added origin info to md_backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* working on asciidocs, struggling with ImageRef

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* able to parse the captions and image uri's

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Update all backends with proper filename in DocumentOrigin

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update to docling-core v2.1.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for MD Backend, to avoid duplicated text inserts into docling doc

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fix styling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Added support for code blocks and fenced code in MD

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaned prints

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added proper processing of in-line textual elements for MD backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed issues with duplicated paragraphs and incorrect lists in pptx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed issue with group ordeering in pptx backend, added gebug log into run with formats

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-23 16:14:26 +02:00
github-actions[bot]
d5460e2d1f chore: bump version to 2.1.0 [skip ci] 2024-10-18 13:21:15 +00:00
Michele Dolfi
61c092f445
docs: add use docling (#150)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-17 18:14:48 +02:00
github-actions[bot]
a29c256041 chore: bump version to 2.0.0 [skip ci] 2024-10-16 19:48:06 +00:00
Christoph Auer
7d3be0edeb
feat!: Docling v2 (#117)
---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-16 21:02:03 +02:00
Panos Vagenas
d504432c1e
docs: introduce docs site (#141)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-14 14:13:13 +02:00
github-actions[bot]
4672b24c1a chore: bump version to 1.20.0 [skip ci] 2024-10-11 13:48:02 +00:00
Christoph Auer
5e4944f15f
feat: new experimental docling-parse v2 backend (#131)
---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 15:12:49 +02:00
github-actions[bot]
2ec39636f0 chore: bump version to 1.19.1 [skip ci] 2024-10-11 08:52:09 +00:00
Panos Vagenas
5f1bd9e9c8
docs: simplify LlamaIndex example using Docling extension (#135)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 22:17:56 +02:00
Panos Vagenas
6924999f1f
chore: explicitly manage pandas dependency (#134)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 14:50:39 +02:00
github-actions[bot]
0ffc1708d2 chore: bump version to 1.19.0 [skip ci] 2024-10-08 17:42:29 +00:00
Michele Dolfi
f96ea86a00
feat: add options for choosing OCR engines (#118)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
2024-10-08 19:07:08 +02:00
github-actions[bot]
9b82ae3324 chore: bump version to 1.18.0 [skip ci] 2024-10-03 17:16:00 +00:00
Maxim Lysak
2422f706a1
feat: new torch-based docling models (#120)
---------

Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-03 18:42:33 +02:00
github-actions[bot]
9ebbbc1245 chore: bump version to 1.17.0 [skip ci] 2024-10-03 13:44:52 +00:00
Michele Dolfi
d44c62d7ce
feat: windows support (#122)
* feat: windows support

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add Windows in README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 14:23:47 +02:00
github-actions[bot]
cde671cf34 chore: bump version to 1.16.1 [skip ci] 2024-09-27 14:36:40 +00:00
Michele Dolfi
34bd887a7f
fix: allow usage of opencv 4.6.x (#110)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-27 15:51:43 +02:00
github-actions[bot]
6760571fe1 chore: bump version to 1.16.0 [skip ci] 2024-09-27 06:21:15 +00:00
Panos Vagenas
39977b5631
chore: move examples extras to respective group (#103)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-25 15:47:48 +02:00
github-actions[bot]
3dfd02a7e9 chore: bump version to 1.15.0 [skip ci] 2024-09-24 15:58:16 +00:00
Michele Dolfi
6a03c208ec
feat: add figure in markdown (#98)
* feat: add figures in markdown

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update to new docling-core and update test results with figures

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update with improved docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-24 17:28:23 +02:00
github-actions[bot]
001d214a13 chore: bump version to 1.14.0 [skip ci] 2024-09-24 13:38:23 +00:00
github-actions[bot]
c65a01c9b7 chore: bump version to 1.13.1 [skip ci] 2024-09-23 19:04:01 +00:00
Peter W. J. Staar
4794ce460a
fix: updated the render_as_doctags with the new arguments from docling-core (#93)
* updated the render_as_doctags with the new arguments from docling-core

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ensuring that docling-core is >1.5.0 to accomodate with the latest export-to-doctags parameters

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the doctags tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix poetry lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Fix formatting problems

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fixed the doctag export in docling/utils/export.py

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* propagate xsize and ysize

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-23 20:12:18 +02:00
github-actions[bot]
6dd1e91c4a chore: bump version to 1.13.0 [skip ci] 2024-09-18 09:26:03 +00:00
Michele Dolfi
f19bd43798
feat: add table exports (#86)
* feat: expose docling-core table exporters and add examples

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove temp internal implementation of html export

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin latest docling-core 1.4.0 with table exports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-18 08:44:13 +02:00
Peter W. J. Staar
442443a102
fix: bumped the glm version and adjusted the tests (#83)
* bumped the glm version and adjusted the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the poetry lock

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fix hooks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fixed the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the tests for tables

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-18 07:43:49 +02:00
github-actions[bot]
8242bce4fa chore: bump version to 1.12.2 [skip ci] 2024-09-17 16:01:34 +00:00
Nikos Livathinos
fa9699fa3c
fix(tests): Adjust the test data to match the new version of LayoutPredictor (#82)
* fix(tests): Adjust the test data to match the new version of LayoutPredictor from docling-ibm-models

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update poetry to use `docling-ibm-models` at version `v1.2.0`

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-09-17 15:50:35 +02:00
github-actions[bot]
f1932fd8c5 chore: bump version to 1.12.1 [skip ci] 2024-09-16 10:58:09 +00:00
github-actions[bot]
34b2772a2e chore: bump version to 1.12.0 [skip ci] 2024-09-13 12:34:15 +00:00
Peter W. J. Staar
98990784df
feat: add docling cli (#75)
* chore: add simple convert script

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted all

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted all

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added default arg

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* use typer for the docling CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* describe output when saving

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add tests for CLI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add export options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-13 14:03:09 +02:00
Michele Dolfi
8aa476ccd3
test: improve typing definitions (part 1) (#72)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-12 15:56:29 +02:00
Panos Vagenas
53569a1023
docs: showcase RAG with LlamaIndex and LangChain (#71)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-11 15:07:08 +02:00
github-actions[bot]
e66dc53765 chore: bump version to 1.11.0 [skip ci] 2024-09-10 16:18:59 +00:00
Peter W. J. Staar
bdfdfbf092
feat: adding txt and doctags output (#68)
* feat: adding txt and doctags output

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* cleaned up the export

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Fix datamodel usage for Figure

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* updated all the examples to deal with new rendering

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-10 17:30:52 +02:00
github-actions[bot]
cd5b6293cc chore: bump version to 1.10.0 [skip ci] 2024-09-10 14:38:07 +00:00
Michele Dolfi
27a7a152e1
feat: linux arm64 support and reducing dependencies (#69)
* feat: linux arm64 support and reducing dependencies

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* downgrade pyarrow for wider support

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-10 15:43:27 +02:00
github-actions[bot]
d3711437f6 chore: bump version to 1.9.0 [skip ci] 2024-09-03 13:33:40 +00:00
Michele Dolfi
1de2e4f924
feat: export document pages as multimodal output (#54)
* feat: export document pages as multimodal output

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* create a single parquet output

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add loading into HF datasets library

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* renaming

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cleanup

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-03 15:05:35 +02:00