Commit Graph

  • afafb97b87 Update CLI Christoph Auer 2024-10-15 09:50:06 +0200
  • aa22fd31db small corrections to pptx Maxim Lysak 2024-10-15 09:43:06 +0200
  • 5b33b12660 renaming BaseTableData Christoph Auer 2024-10-14 17:01:50 +0200
  • b964c4bb69 Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction Christoph Auer 2024-10-14 16:54:56 +0200
  • 57de8ad63a Fix generate_multimodal_pages Christoph Auer 2024-10-14 16:52:58 +0200
  • 98ca58ffd0 added support for enumerated lists Maxim Lysak 2024-10-14 16:48:55 +0200
  • 3f0b01702b Update example export code Christoph Auer 2024-10-14 16:40:40 +0200
  • a50ba57a1f Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction Christoph Auer 2024-10-14 16:36:20 +0200
  • 497ddb34a8 Big refactoring for legacy_document support Christoph Auer 2024-10-14 16:36:11 +0200
  • e87bf9ae06 Updated pptx backend, fixes issues with lists, also added more different list cases to example Maxim Lysak 2024-10-14 16:20:17 +0200
  • d504432c1e
    docs: introduce docs site (#141) Panos Vagenas 2024-10-14 14:13:13 +0200
  • 8da09e4133 docs: introduce docs site Panos Vagenas 2024-10-14 13:05:25 +0200
  • 08ab628e75 use self.artifacts_path Michele Dolfi 2024-10-14 09:03:49 +0200
  • ab8f71511b fix artifacts_path via pipeline_options Michele Dolfi 2024-10-14 08:57:15 +0200
  • 2b1e72d327
    refactor: fix type of tesseractocr options (#140) Michele Dolfi 2024-10-14 08:40:22 +0200
  • e3bbaeff5e
    refactor: fix type of tesseractocr options Michele Dolfi 2024-10-13 20:13:52 +0200
  • 245b6c4c01 pin picture data with molecule Michele Dolfi 2024-10-13 18:07:43 +0200
  • ddb509628e use do_ flag in pipeline_options Michele Dolfi 2024-10-13 16:54:46 +0200
  • 7c8d7e222e use new PictureData Michele Dolfi 2024-10-13 16:48:16 +0200
  • c1ed447c21 propagate raises, add enrichment model, some renaming Michele Dolfi 2024-10-13 16:03:19 +0200
  • 941b51aa3e missing renamed files Michele Dolfi 2024-10-11 18:10:45 +0200
  • 7f10a546d3 Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction Michele Dolfi 2024-10-11 17:04:01 +0200
  • 98f1a4597e rename and refactor *model* Michele Dolfi 2024-10-11 16:57:40 +0200
  • 69f0ab419c Bump docling-core version Christoph Auer 2024-10-11 16:55:01 +0200
  • 2a259b9723 Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction Christoph Auer 2024-10-11 16:47:20 +0200
  • 6efcf0a5a5 Add image format support to PdfBackend Christoph Auer 2024-10-11 16:47:15 +0200
  • 6c9f869dc7 fix default _enrich_document Michele Dolfi 2024-10-11 16:38:45 +0200
  • 5b5c99e9da Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction Michele Dolfi 2024-10-11 16:31:28 +0200
  • ca2a96d982 initial refactor iteration Michele Dolfi 2024-10-11 16:31:13 +0200
  • d0fccb9342 Merge from simplify-conv-api Christoph Auer 2024-10-11 15:57:08 +0200
  • 4672b24c1a chore: bump version to 1.20.0 [skip ci] v1.20.0 github-actions[bot] 2024-10-11 13:48:02 +0000
  • 5e4944f15f
    feat: new experimental docling-parse v2 backend (#131) Christoph Auer 2024-10-11 15:12:49 +0200
  • 95c1f80087 Change code to use unordered/ordered list, robustifications Christoph Auer 2024-10-11 14:53:38 +0200
  • 136f16e85a
    feat!: simplify conversion API (#139) Panos Vagenas 2024-10-11 14:52:37 +0200
  • 8777b759ae feat!: simplify conversion API Panos Vagenas 2024-10-11 14:31:07 +0200
  • 7fad5cc785 pin docling-parse release Michele Dolfi 2024-10-11 13:37:12 +0200
  • 753f67a434 fixes Michele Dolfi 2024-10-11 13:06:32 +0200
  • 94b5e1532d add GlmOptions Michele Dolfi 2024-10-11 13:03:38 +0200
  • 786b89efd9 Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction Michele Dolfi 2024-10-11 12:59:11 +0200
  • c6e1471e02 use options objects Michele Dolfi 2024-10-11 12:58:59 +0200
  • 3ee97c42b2 Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction Christoph Auer 2024-10-11 12:57:56 +0200
  • 52713f0cf5 Optionally produce legacy_doc Christoph Auer 2024-10-11 12:57:47 +0200
  • cc9bcc424d fix generation enabled Michele Dolfi 2024-10-11 11:49:38 +0200
  • 331ab36f04 Merge remote-tracking branch 'origin/main' into cau/input-format-abstraction Michele Dolfi 2024-10-11 11:23:04 +0200
  • 025983f07b Backend error handling fixes Christoph Auer 2024-10-11 11:18:47 +0200
  • e9ab3e8849 rerun lock Michele Dolfi 2024-10-11 11:05:47 +0200
  • 4a5576c4c9 Merge remote-tracking branch 'origin/main' into cau/integrate-docling-parse-v2 Michele Dolfi 2024-10-11 10:59:08 +0200
  • 8f6347dbb1 Merge remote-tracking branch 'origin/main' into cau/integrate-docling-parse-v2 Michele Dolfi 2024-10-11 10:56:57 +0200
  • c47ee35bc4 set log level in constructor Michele Dolfi 2024-10-11 10:54:11 +0200
  • 2ec39636f0 chore: bump version to 1.19.1 [skip ci] v1.19.1 github-actions[bot] 2024-10-11 08:52:09 +0000
  • ca79cbcc2c pin latest docling-parse PR Michele Dolfi 2024-10-11 10:50:38 +0200
  • 304d16029a More renaming, design enrichment interface Christoph Auer 2024-10-11 10:21:31 +0200
  • dae2a3b667
    fix: remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests (#138) Nikos Livathinos 2024-10-11 10:21:19 +0200
  • 051beae203 use new interface in minimal example Michele Dolfi 2024-10-11 08:30:09 +0200
  • 7aad3dc946 Update test cases for v2 Christoph Auer 2024-10-10 18:51:19 +0200
  • cd72ea2412 Added verify_conversion_result_v2, Regenerate v1 and v2 test data Christoph Auer 2024-10-10 15:37:36 +0200
  • 1bcad334f2 pin docling-parse release Michele Dolfi 2024-10-10 18:30:09 +0200
  • 3794f8245e add example PNG Michele Dolfi 2024-10-10 18:29:26 +0200
  • a84ba6ddec list all PIL supported mime types Michele Dolfi 2024-10-10 18:28:56 +0200
  • bde8186700 update pinning Michele Dolfi 2024-10-10 17:54:05 +0200
  • c31045754d Merge branch 'cau/input-format-abstraction' of github.com:DS4SD/docling into cau/input-format-abstraction Michele Dolfi 2024-10-10 17:41:07 +0200
  • 50c05b262a pin updates compatible with each other Michele Dolfi 2024-10-10 17:40:32 +0200
  • 4a0c9be576 fix(TesseractOcrCliModel): Send the stderr to devnull to avoid poluting the console with messages from tesseract cmd Nikos Livathinos 2024-10-10 17:19:44 +0200
  • 99cfea38d6 Added verify_conversion_result_v2, Regenerate v1 and v2 test data Christoph Auer 2024-10-10 15:37:36 +0200
  • 7cad290ceb Refactor test data, legacy usage and more Christoph Auer 2024-10-10 13:54:44 +0200
  • 9e0a6ffd41 feat(OCR tests): Introduce fuzziness in the text validation of OCR tests Nikos Livathinos 2024-10-10 11:37:43 +0200
  • 5f1bd9e9c8
    docs: simplify LlamaIndex example using Docling extension (#135) Panos Vagenas 2024-10-09 22:17:56 +0200
  • 5bc67d5a71 chore: simplify LlamaIndex example using Docling extension Panos Vagenas 2024-10-09 17:31:49 +0200
  • da0700f959 Fixes for docx backend Maxim Lysak 2024-10-09 16:47:23 +0200
  • b5a27386c1 Merge from main, update OCR model and test cases Christoph Auer 2024-10-09 16:04:19 +0200
  • 0dfbd0b6fc Update examples and test cases Christoph Auer 2024-10-09 15:20:27 +0200
  • 6924999f1f
    chore: explicitly manage pandas dependency (#134) Panos Vagenas 2024-10-09 14:50:39 +0200
  • d0c0d18c18 chore: explicitly manage pandas dependency Panos Vagenas 2024-10-09 14:06:09 +0200
  • 0ffc1708d2 chore: bump version to 1.19.0 [skip ci] v1.19.0 github-actions[bot] 2024-10-08 17:42:29 +0000
  • f96ea86a00
    feat: add options for choosing OCR engines (#118) Michele Dolfi 2024-10-08 19:07:08 +0200
  • a8cada3694 Update to latest HF models Christoph Auer 2024-10-08 18:54:15 +0200
  • 3d66062db8 missing one part of the comment Michele Dolfi 2024-10-08 18:37:24 +0200
  • 800b16beff keep only one example Michele Dolfi 2024-10-08 18:36:53 +0200
  • bb8cd0f7fc fix: Rename the tesseract OCR related classes and filenames Nikos Livathinos 2024-10-08 16:46:25 +0200
  • 080042d06d Merge from upstream Christoph Auer 2024-10-08 16:40:55 +0200
  • 203cf19b1b Lots of improvements Christoph Auer 2024-10-08 16:38:42 +0200
  • 07d952acf9 Improved backends Maxim Lysak 2024-10-08 16:37:47 +0200
  • 70a8a2cc82 chore(OCR): Rename class names to use Tesseract for the tesserocr and TesseractCLI for the tesseract process Nikos Livathinos 2024-10-08 14:44:23 +0200
  • c0447206af Merge from main Christoph Auer 2024-10-08 14:42:33 +0200
  • 074acd703c feat(OCR): Introduce support for the language path in the pipelines of both Tesseract OCR engines. Nikos Livathinos 2024-10-08 14:24:13 +0200
  • 118afee1f3 fix(TesserOcrModel): Fix cell coordinates Nikos Livathinos 2024-10-08 14:17:54 +0200
  • 29e65e911b fix(test): Introduce parameter in verify_conversion_result() to allow skipping the verification of the cells. It is used in case of OCR tests. Nikos Livathinos 2024-10-08 14:14:43 +0200
  • 072aaf6bb1 fix(test): Update test data for OCR Nikos Livathinos 2024-10-08 14:12:22 +0200
  • 1d55cbdca9 Updates for Powerpoint backend Christoph Auer 2024-10-08 13:19:58 +0200
  • dd8a0e9e44 Support Document Index as a layout class Christoph Auer 2024-10-08 12:32:03 +0200
  • 5bd64779d1 add docs for TESSDATA_PREFIX Michele Dolfi 2024-10-08 11:37:24 +0200
  • ea3f720ef5 remove pydantic warning for model_ Michele Dolfi 2024-10-08 11:32:54 +0200
  • 89e58ca730 Added HTML backend implementation, few improvements for other backends Maxim Lysak 2024-10-08 11:14:44 +0200
  • 67746044a9 Merge remote-tracking branch 'origin/main' into feat-multiple-ocr-engines Michele Dolfi 2024-10-08 10:55:08 +0200
  • 73108d597c docs: explain OCR options Michele Dolfi 2024-10-08 10:54:43 +0200
  • d412c363d7
    fixed unload pdf backend resources (#129) Fasal Shah 2024-10-08 14:16:43 +0530
  • 94bffd7383 Silence v2 parser logging Christoph Auer 2024-10-08 10:34:21 +0200
  • 471daee277 reorder sections in custom_convert Michele Dolfi 2024-10-08 09:53:52 +0200
  • 8ec8c38de8 fix(CI/CD): Use the eng language package location to set the TESSDATA_PREFIX envvar Nikos Livathinos 2024-10-07 18:34:50 +0200
  • be6489bde0 fix(tests): Refactor the data_scanned with a very simple document that allows all OCR engines to produce the same result. Nikos Livathinos 2024-10-07 18:15:36 +0200