Maksym Lysak
e8229fdd4c
cleaned prints
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-22 15:48:33 +02:00
Maksym Lysak
186d71a057
Added support for code blocks and fenced code in MD
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-22 15:45:47 +02:00
Christoph Auer
4fb803f46c
Fix styling
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-22 15:30:47 +02:00
Maksym Lysak
47a4d314ea
Fixes for MD Backend, to avoid duplicated text inserts into docling doc
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-22 14:39:44 +02:00
Christoph Auer
578e30e23b
Update to docling-core v2.1.0
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-22 14:34:38 +02:00
Christoph Auer
b1a2af6d39
Update all backends with proper filename in DocumentOrigin
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-22 14:11:40 +02:00
Christoph Auer
789b29bb24
Merge ASCIIDoc and Markdown backends in, fixes
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-22 11:34:35 +02:00
Christoph Auer
0bbd50f500
Merge branch 'dev/add-asciidocs-backend' of github.com:DS4SD/docling into cau/backend-document-origin
2024-10-22 11:04:49 +02:00
Peter Staar
bb3db07836
fixed the mypy
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-22 09:44:27 +02:00
Peter Staar
b04f14ec24
able to parse the captions and image uri's
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-22 09:13:08 +02:00
Peter Staar
1c0a766cc5
working on asciidocs, struggling with ImageRef
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-22 07:42:42 +02:00
Maksym Lysak
8c60dfa0e6
Fixed example run_md, added origin info to md_backend
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 16:42:18 +02:00
Maksym Lysak
1456a36618
Fixes MyPy requirements, and rest of pre-commit
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:43:39 +02:00
Maksym Lysak
dae366440c
Cleaned code, improved logging for MD
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
ba9beb65e3
Added initial docling table support to md_backend
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
fa2f8cf236
Detecting and assembling tables in markdown in temporary buffers
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
bef429fee3
Improvements in md parsing
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
534b2203f6
md_backend produces docling document with headers, paragraphs, lists
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
1df89f79ff
work in progress on MD backend
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:47 +02:00
Maksym Lysak
5986213cfe
Drafting Markdown backend via Marko library
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-21 15:11:42 +02:00
Peter Staar
c23d049270
adding test_02.asciidoc
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-21 06:01:56 +02:00
Peter Staar
e60c52586b
fixed the mypy
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-19 06:23:35 +02:00
Peter Staar
70b2ae3fab
reformatted the code
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-18 16:57:26 +02:00
Peter Staar
5016daeae3
first working asciidoc parser
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-18 16:51:39 +02:00
Peter Staar
1138cae7f1
adding tests for asciidocs
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-18 16:51:39 +02:00
github-actions[bot]
c60c402e15
chore: bump version to 2.1.0 [skip ci]
2024-10-18 16:51:39 +02:00
Michele Dolfi
006cfb4125
feat: add coverage_threshold to skip OCR for small images ( #161 )
...
* feat: add coverage_threshold to skip OCR for small images
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* filter individual boxes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 16:51:39 +02:00
ABHISHEK FADAKE
b6c061093c
docs: typo fix ( #155 )
...
* Docs: Typo fix
- Corrected spelling of invidual to automatic
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
* add synchronize event for forks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 16:51:39 +02:00
Panos Vagenas
eb154a1c28
fix: fix legacy doc ref ( #162 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-18 16:51:39 +02:00
Michele Dolfi
77fa1db3a1
ci: run ci also on forks ( #160 )
...
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-10-18 16:51:39 +02:00
Christoph Auer
63d3704e54
Ensure all models work only on valid pages ( #158 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-18 16:51:39 +02:00
github-actions[bot]
d5460e2d1f
chore: bump version to 2.1.0 [skip ci]
2024-10-18 13:21:15 +00:00
Michele Dolfi
b346faf622
feat: add coverage_threshold to skip OCR for small images ( #161 )
...
* feat: add coverage_threshold to skip OCR for small images
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* filter individual boxes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 13:58:23 +02:00
ABHISHEK FADAKE
f799e777c1
docs: typo fix ( #155 )
...
* Docs: Typo fix
- Corrected spelling of invidual to automatic
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
* add synchronize event for forks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 13:56:48 +02:00
Panos Vagenas
63bef59d9e
fix: fix legacy doc ref ( #162 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-18 13:11:20 +02:00
Michele Dolfi
bb7a58d45d
ci: run ci also on forks ( #160 )
...
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-10-18 12:32:27 +02:00
Christoph Auer
a00c937e19
Ensure all models work only on valid pages ( #158 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-18 08:54:06 +02:00
Peter Staar
c1d9241b39
updated the asciidoc backend
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-18 08:28:02 +02:00
Peter Staar
12033537e3
updated the base-model and added the asciidoc_backend
...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2024-10-17 19:58:07 +02:00
Maxim Lysak
034a411057
docs: add graphical band in readme ( #154 )
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-17 18:15:40 +02:00
Michele Dolfi
61c092f445
docs: add use docling ( #150 )
...
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-17 18:14:48 +02:00
Michele Dolfi
24f949ada2
chore: run apt-get update before install ( #156 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-17 17:27:16 +02:00
github-actions[bot]
a29c256041
chore: bump version to 2.0.0 [skip ci]
2024-10-16 19:48:06 +00:00
Christoph Auer
7d3be0edeb
feat!: Docling v2 ( #117 )
...
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-16 21:02:03 +02:00
Panos Vagenas
d504432c1e
docs: introduce docs site ( #141 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-14 14:13:13 +02:00
Michele Dolfi
2b1e72d327
refactor: fix type of tesseractocr options ( #140 )
...
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-10-14 08:40:22 +02:00
github-actions[bot]
4672b24c1a
chore: bump version to 1.20.0 [skip ci]
2024-10-11 13:48:02 +00:00
Christoph Auer
5e4944f15f
feat: new experimental docling-parse v2 backend ( #131 )
...
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 15:12:49 +02:00
github-actions[bot]
2ec39636f0
chore: bump version to 1.19.1 [skip ci]
2024-10-11 08:52:09 +00:00
Nikos Livathinos
dae2a3b667
fix: remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests ( #138 )
...
* feat(OCR tests): Introduce fuzziness in the text validation of OCR tests
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix(TesseractOcrCliModel): Send the stderr to devnull to avoid poluting the console with messages from tesseract cmd
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-11 10:21:19 +02:00