Commit Graph

216 Commits

Author SHA1 Message Date
Michele Dolfi
c31045754d Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction 2024-10-10 17:41:07 +02:00
Michele Dolfi
50c05b262a pin updates compatible with each other
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-10 17:40:32 +02:00
Christoph Auer
99cfea38d6 Added verify_conversion_result_v2, Regenerate v1 and v2 test data
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-10 15:37:59 +02:00
Christoph Auer
7cad290ceb Refactor test data, legacy usage and more
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-10 13:54:44 +02:00
Panos Vagenas
5f1bd9e9c8
docs: simplify LlamaIndex example using Docling extension (#135)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 22:17:56 +02:00
Maxim Lysak
da0700f959 Fixes for docx backend
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-09 16:52:44 +02:00
Christoph Auer
b5a27386c1 Merge from main, update OCR model and test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-09 16:04:19 +02:00
Christoph Auer
0dfbd0b6fc Update examples and test cases
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-09 15:20:27 +02:00
Panos Vagenas
6924999f1f
chore: explicitly manage pandas dependency (#134)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 14:50:39 +02:00
github-actions[bot]
0ffc1708d2 chore: bump version to 1.19.0 [skip ci] 2024-10-08 17:42:29 +00:00
Michele Dolfi
f96ea86a00
feat: add options for choosing OCR engines (#118)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
2024-10-08 19:07:08 +02:00
Christoph Auer
080042d06d Merge from upstream
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-08 16:40:55 +02:00
Christoph Auer
203cf19b1b Lots of improvements
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-08 16:38:42 +02:00
Maxim Lysak
07d952acf9 Improved backends
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-08 16:37:47 +02:00
Christoph Auer
c0447206af Merge from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-08 14:42:33 +02:00
Christoph Auer
1d55cbdca9 Updates for Powerpoint backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-08 13:19:58 +02:00
Maxim Lysak
89e58ca730 Added HTML backend implementation, few improvements for other backends
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-08 11:14:44 +02:00
Fasal Shah
d412c363d7
fixed unload pdf backend resources (#129)
Signed-off-by: faisal shah <fashah@redhat.com>
Co-authored-by: faisal shah <fashah@redhat.com>
2024-10-08 10:46:43 +02:00
Maxim Lysak
f773d8a621 Improved demo code, that saves output mds to files
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-07 17:25:17 +02:00
Maxim Lysak
bea9fc22af Added mspowerpoint backend first implementation, improvements on msword backend
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-07 14:55:21 +02:00
Maxim Lysak
1346843301 Improved docx parsing
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-07 13:00:50 +02:00
Christoph Auer
e613f7bc6c Add comments
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-07 12:35:25 +02:00
Maxim Lysak
cefc34e8d8 Working on a first version of DOCX native backend
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-04 18:19:40 +02:00
github-actions[bot]
9b82ae3324 chore: bump version to 1.18.0 [skip ci] 2024-10-03 17:16:00 +00:00
Maxim Lysak
2422f706a1
feat: new torch-based docling models (#120)
---------

Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-03 18:42:33 +02:00
github-actions[bot]
9ebbbc1245 chore: bump version to 1.17.0 [skip ci] 2024-10-03 13:44:52 +00:00
Rui Dias Gomes
dde0aff8bd
update examples (#123)
Signed-off-by: rmdg88 <rmdg88@gmail.com>
2024-10-03 14:28:25 +02:00
Michele Dolfi
d44c62d7ce
feat: windows support (#122)
* feat: windows support

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add Windows in README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 14:23:47 +02:00
Christoph Auer
1fa7cd9855 Fundamental refactoring for multi-format support
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-01 16:54:09 +02:00
Christoph Auer
cd06d89c2a Merge branch 'cau/experimental-format' of github.com:DS4SD/docling into cau/input-format-abstraction 2024-09-30 13:47:57 +02:00
Christoph Auer
0a86529afb Repinning
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-30 13:47:22 +02:00
github-actions[bot]
cde671cf34 chore: bump version to 1.16.1 [skip ci] 2024-09-27 14:36:40 +00:00
Michele Dolfi
34bd887a7f
fix: allow usage of opencv 4.6.x (#110)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-27 15:51:43 +02:00
Christoph Auer
91ab382129 Renaming changes
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-27 15:20:01 +02:00
Panos Vagenas
c05b692d69
docs: document chunking (#111)
[skip ci]

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-27 11:16:04 +02:00
Christoph Auer
2461b56b84 Import rewrites, adapt to changes in docling-core
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-27 09:21:15 +02:00
github-actions[bot]
6760571fe1 chore: bump version to 1.16.0 [skip ci] 2024-09-27 06:21:15 +00:00
Christoph Auer
d6df76f90b
feat: Support tableformer model choice (#90)
* Support tableformer model choice

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update datamodel structure

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add test unit for table options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Ensure import backwards-compatibility for PipelineOptions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update README

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust parameters on custom_convert

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update Dockerfile

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-09-26 21:37:08 +02:00
Christoph Auer
9ffd1dc396 Merge from main 2024-09-26 18:06:08 +02:00
Christoph Auer
0ee82a5e78 Bump deepsearch-glm
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-25 16:05:54 +02:00
Christoph Auer
ba9d115f64 Examples: Don't export experimental output by default
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-25 15:56:29 +02:00
Christoph Auer
ad2bd714d4 Update GT test files for pages
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-25 15:54:55 +02:00
Panos Vagenas
39977b5631
chore: move examples extras to respective group (#103)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-25 15:47:48 +02:00
Christoph Auer
48d8b7bf70 Sync test data from main
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-25 12:26:12 +02:00
Christoph Auer
3efc2bbbf4 Apply renamings to DocItemLabel
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-25 12:22:02 +02:00
Christoph Auer
95c539579d [WIP] introducting extra backend abstraction and input formats
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-25 11:17:49 +02:00
github-actions[bot]
3dfd02a7e9 chore: bump version to 1.15.0 [skip ci] 2024-09-24 15:58:16 +00:00
Michele Dolfi
6a03c208ec
feat: add figure in markdown (#98)
* feat: add figures in markdown

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update to new docling-core and update test results with figures

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update with improved docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-24 17:28:23 +02:00
Christoph Auer
850a521195 Update lockfile
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-24 16:26:22 +02:00
Christoph Auer
33373ac0dd Switch everything to use label enum, and more
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-09-24 16:00:39 +02:00