Commit Graph

176 Commits

Author SHA1 Message Date
Maksym Lysak
186d71a057 Added support for code blocks and fenced code in MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-22 15:45:47 +02:00
Christoph Auer
4fb803f46c Fix styling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-22 15:30:47 +02:00
Maksym Lysak
47a4d314ea Fixes for MD Backend, to avoid duplicated text inserts into docling doc
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-22 14:39:44 +02:00
Christoph Auer
578e30e23b Update to docling-core v2.1.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-22 14:34:38 +02:00
Christoph Auer
b1a2af6d39 Update all backends with proper filename in DocumentOrigin
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-22 14:11:40 +02:00
Christoph Auer
789b29bb24 Merge ASCIIDoc and Markdown backends in, fixes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-22 11:34:35 +02:00
Christoph Auer
0bbd50f500 Merge branch 'dev/add-asciidocs-backend' of github.com:DS4SD/docling into cau/backend-document-origin 2024-10-22 11:04:49 +02:00
Peter Staar
bb3db07836 fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-22 09:44:27 +02:00
Peter Staar
b04f14ec24 able to parse the captions and image uri's
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-22 09:13:08 +02:00
Peter Staar
1c0a766cc5 working on asciidocs, struggling with ImageRef
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-22 07:42:42 +02:00
Maksym Lysak
8c60dfa0e6 Fixed example run_md, added origin info to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 16:42:18 +02:00
Maksym Lysak
1456a36618 Fixes MyPy requirements, and rest of pre-commit
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:43:39 +02:00
Maksym Lysak
dae366440c Cleaned code, improved logging for MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
ba9beb65e3 Added initial docling table support to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
fa2f8cf236 Detecting and assembling tables in markdown in temporary buffers
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
bef429fee3 Improvements in md parsing
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
534b2203f6 md_backend produces docling document with headers, paragraphs, lists
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
1df89f79ff work in progress on MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
5986213cfe Drafting Markdown backend via Marko library
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:42 +02:00
Peter Staar
c23d049270 adding test_02.asciidoc
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-21 06:01:56 +02:00
Peter Staar
e60c52586b fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-19 06:23:35 +02:00
Peter Staar
70b2ae3fab reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-18 16:57:26 +02:00
Peter Staar
5016daeae3 first working asciidoc parser
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-18 16:51:39 +02:00
Peter Staar
1138cae7f1 adding tests for asciidocs
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-18 16:51:39 +02:00
github-actions[bot]
c60c402e15 chore: bump version to 2.1.0 [skip ci] 2024-10-18 16:51:39 +02:00
Michele Dolfi
006cfb4125 feat: add coverage_threshold to skip OCR for small images (#161)
* feat: add coverage_threshold to skip OCR for small images

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* filter individual boxes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 16:51:39 +02:00
ABHISHEK FADAKE
b6c061093c docs: typo fix (#155)
* Docs: Typo fix

- Corrected spelling of invidual to automatic

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>

* add synchronize event for forks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 16:51:39 +02:00
Panos Vagenas
eb154a1c28 fix: fix legacy doc ref (#162)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-18 16:51:39 +02:00
Michele Dolfi
77fa1db3a1 ci: run ci also on forks (#160)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-10-18 16:51:39 +02:00
Christoph Auer
63d3704e54 Ensure all models work only on valid pages (#158)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-18 16:51:39 +02:00
github-actions[bot]
d5460e2d1f chore: bump version to 2.1.0 [skip ci] 2024-10-18 13:21:15 +00:00
Michele Dolfi
b346faf622
feat: add coverage_threshold to skip OCR for small images (#161)
* feat: add coverage_threshold to skip OCR for small images

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* filter individual boxes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 13:58:23 +02:00
ABHISHEK FADAKE
f799e777c1
docs: typo fix (#155)
* Docs: Typo fix

- Corrected spelling of invidual to automatic

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>

* add synchronize event for forks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 13:56:48 +02:00
Panos Vagenas
63bef59d9e
fix: fix legacy doc ref (#162)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-18 13:11:20 +02:00
Michele Dolfi
bb7a58d45d
ci: run ci also on forks (#160)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-10-18 12:32:27 +02:00
Christoph Auer
a00c937e19
Ensure all models work only on valid pages (#158)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-18 08:54:06 +02:00
Peter Staar
c1d9241b39 updated the asciidoc backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-18 08:28:02 +02:00
Peter Staar
12033537e3 updated the base-model and added the asciidoc_backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-17 19:58:07 +02:00
Maxim Lysak
034a411057
docs: add graphical band in readme (#154)
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-17 18:15:40 +02:00
Michele Dolfi
61c092f445
docs: add use docling (#150)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-17 18:14:48 +02:00
Michele Dolfi
24f949ada2
chore: run apt-get update before install (#156)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-17 17:27:16 +02:00
github-actions[bot]
a29c256041 chore: bump version to 2.0.0 [skip ci] 2024-10-16 19:48:06 +00:00
Christoph Auer
7d3be0edeb
feat!: Docling v2 (#117)
---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-16 21:02:03 +02:00
Panos Vagenas
d504432c1e
docs: introduce docs site (#141)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-14 14:13:13 +02:00
Michele Dolfi
2b1e72d327
refactor: fix type of tesseractocr options (#140)
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-10-14 08:40:22 +02:00
github-actions[bot]
4672b24c1a chore: bump version to 1.20.0 [skip ci] 2024-10-11 13:48:02 +00:00
Christoph Auer
5e4944f15f
feat: new experimental docling-parse v2 backend (#131)
---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 15:12:49 +02:00
github-actions[bot]
2ec39636f0 chore: bump version to 1.19.1 [skip ci] 2024-10-11 08:52:09 +00:00
Nikos Livathinos
dae2a3b667
fix: remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests (#138)
* feat(OCR tests): Introduce fuzziness in the text validation of OCR tests

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix(TesseractOcrCliModel): Send the stderr to devnull to avoid poluting the console with messages from tesseract cmd

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-11 10:21:19 +02:00
Panos Vagenas
5f1bd9e9c8
docs: simplify LlamaIndex example using Docling extension (#135)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 22:17:56 +02:00