Commit Graph

  • af18215714 Rename docling backend to v4 Christoph Auer 2025-03-14 12:35:06 +01:00
  • fa16b12316 chore: move to docling-project org (#1160) Michele Dolfi 2025-03-14 12:35:29 +01:00
  • b77f73beec Text fixes, new test data Christoph Auer 2025-03-14 11:44:09 +01:00
  • f94da44ec5 fix(html): handle nested empty lists (#1154) Cesar Berrospi Ramis 2025-03-13 16:56:58 +01:00
  • e00f362405 Update tests, use TextCell.from_ocr property Christoph Auer 2025-03-13 16:04:08 +01:00
  • 0945973b79 fix: use first table row as col headers (#1156) Panos Vagenas 2025-03-13 15:34:18 +01:00
  • 6eb718f849 feat: equations to latex in MSWord backend (with inline groups) (#1114) Rafael Teixeira de Lima 2025-03-13 15:12:22 +01:00
  • aa92a57fa9 fix: Pass tests, update docling-core to 2.22.0 (#1150) Cesar Berrospi Ramis 2025-03-13 09:45:55 +01:00
  • 6e06040da6 Fix tests Christoph Auer 2025-03-12 20:04:17 +01:00
  • f1cce8ff07 Ground-truth files updated Christoph Auer 2025-03-12 19:57:18 +01:00
  • 519bc43e47 fix: update docling-core to 2.22.0 Cesar Berrospi Ramis 2025-03-12 19:38:03 +01:00
  • 90b0f73d06 Update locks Christoph Auer 2025-03-12 16:54:23 +01:00
  • 9ebd7108f2 Add back DoclingParse v1 backend, pipeline options Christoph Auer 2025-03-12 16:28:25 +01:00
  • 8a45a2cafa update test units Christoph Auer 2025-03-12 12:07:03 +01:00
  • 15282547cb update test cases Christoph Auer 2025-03-12 11:04:48 +01:00
  • 18b4991aa4 Reset tests Christoph Auer 2025-03-11 16:34:38 +01:00
  • a5089ef8f6 Merge branch 'cau/docling-parse-api' of github.com:DS4SD/docling into cau/docling-parse-api Christoph Auer 2025-03-11 16:31:50 +01:00
  • 1b9fcf0edf Fix streams Christoph Auer 2025-03-11 16:24:49 +01:00
  • 31c86613e5 Fix streams Christoph Auer 2025-03-11 16:24:49 +01:00
  • fbcde2cdeb Merge branch 'main' of github.com:DS4SD/docling into cau/docling-parse-api Christoph Auer 2025-03-11 16:06:55 +01:00
  • f411772569 Fixes and test updates Christoph Auer 2025-03-11 16:06:28 +01:00
  • 0dd596ff09 Draft implementation of Doctag backend dev/doctag_backend Maksym Lysak 2025-03-11 14:02:34 +01:00
  • 78353f1697 Use docling-core with docling-parse types Christoph Auer 2025-03-11 13:37:24 +01:00
  • 17c5bf1242 chore: bump version to 2.26.0 [skip ci] v2.26.0 github-actions[bot] 2025-03-11 11:12:43 +00:00
  • eb97357b05 feat: Use new TableFormer model weights and default to accurate model version (#1100) Christoph Auer 2025-03-11 10:53:49 +01:00
  • 5e30381c0d perf: New revision code formula model and document picture classifier (#1140) Matteo 2025-03-11 09:15:28 +00:00
  • 099aa4da83 Updates for DoclingParseV3DocumentBackend Christoph Auer 2025-03-10 17:11:20 +01:00
  • 4d64c4c0b6 fix(CLI): fix help message for abort options (#1130) Michele Dolfi 2025-03-07 14:47:49 +01:00
  • e1c49ad727 docs: add description of DOCLING_ARTIFACTS_PATH env var (#1124) Michele Dolfi 2025-03-06 07:30:07 +01:00
  • a3c957ca6b chore: bump version to 2.25.2 [skip ci] v2.25.2 github-actions[bot] 2025-03-05 14:51:57 +00:00
  • c56ab3a66b fix: Proper handling of orphan IDs in layout postprocessing (#1118) Christoph Auer 2025-03-05 14:30:59 +01:00
  • 655e95dd72 Upgrading docling core and adding groups rtdl/docx_latex Rafael Teixeira de Lima 2025-03-04 17:18:40 +01:00
  • 5630c6b8fd Merge branch 'main' into rtdl/docx_latex Rafael Teixeira de Lima 2025-03-04 16:51:53 +01:00
  • 357d41cc47 docs: Enrichment models (#1097) Michele Dolfi 2025-03-04 14:24:38 +01:00
  • b1e79cadc7 chore: bump version to 2.25.1 [skip ci] v2.25.1 github-actions[bot] 2025-03-03 00:56:40 +00:00
  • 0c1e9391de chore: use gh cache for huggingface models (#1096) Michele Dolfi 2025-03-03 00:13:47 +01:00
  • 8dc0562542 fix: enable locks for threadsafe pdfium (#1052) Michele Dolfi 2025-03-02 20:06:44 +01:00
  • e25d557c06 refactor: add the contentlayer to html-backend (#1040) Peter W. J. Staar 2025-03-02 10:37:53 -05:00
  • db3ceefd4a docs: improve docs on token limit warning triggered by HybridChunker (#1077) Panos Vagenas 2025-02-28 14:54:46 +01:00
  • de7b963b09 fix(html): use 'start' attribute when parsing ordered lists from HTML docs (#1062) Cesar Berrospi Ramis 2025-02-27 09:46:57 +01:00
  • 37dd8c1cc7 chore: bump version to 2.25.0 [skip ci] v2.25.0 github-actions[bot] 2025-02-26 14:16:15 +00:00
  • 3c9fe76b70 feat: [Experimental] Introduce VLM pipeline using HF AutoModelForVision2Seq, featuring SmolDocling model (#1054) Christoph Auer 2025-02-26 14:43:26 +01:00
  • ab683e4fb6 feat(cli): add option for downloading all models, refine help messages (#1061) Panos Vagenas 2025-02-26 13:27:29 +01:00
  • e197225739 fix: vlm using artifacts path (#1057) Michele Dolfi 2025-02-26 08:33:50 +01:00
  • c84b973959 docs: extend chunking docs, add FAQ on token limit (#1053) Panos Vagenas 2025-02-25 13:07:38 +01:00
  • 1c75b52f85 re-built poetry.lock mly/smol-docling-integration Maksym Lysak 2025-02-24 17:37:35 +01:00
  • 9ecec1d330 Updated poetry.lock Maksym Lysak 2025-02-24 17:27:50 +01:00
  • 923f766ada Replaced remaining strings to appropriate enums Maksym Lysak 2025-02-24 16:54:59 +01:00
  • a095a7c5b7 removing changes from base_pipeline Maksym Lysak 2025-02-24 15:13:59 +01:00
  • a7a1f32b10 Added example on how to get original predicted doctags in minimal_smol_docling Maksym Lysak 2025-02-24 14:39:18 +01:00
  • 1dbedcbb4e removed pipeline_options.generate_table_images from vlm_pipeline (deprecated in the pipelines) Maksym Lysak 2025-02-24 14:17:06 +01:00
  • 0c60ef199a Moved keep_backend = True to vlm pipeline Maksym Lysak 2025-02-13 17:53:03 +01:00
  • 853544ba11 Addressing PR comments, added enabled property to SmolDocling, and related VLM pipeline option, few other minor things Maksym Lysak 2025-02-13 17:19:53 +01:00
  • b0935daec4 Removed special html code wrapping when exporting to docling document, cleaned up comments Maksym Lysak 2025-02-13 10:29:37 +01:00
  • b12f5ba80f removed minimal_smol_docling example from CI checks Maksym Lysak 2025-02-13 09:42:45 +01:00
  • 66532eadb6 More elegant solution in removing the input prompt Maksym Lysak 2025-02-12 18:48:48 +01:00
  • e486eb1720 Cleaned up unnecessary logging Maksym Lysak 2025-02-12 17:56:37 +01:00
  • 55fa4eb4e3 Fix repo id Christoph Auer 2025-02-12 17:09:56 +01:00
  • 6f9f4f4aee Update minimal smoldocling example Christoph Auer 2025-02-12 17:07:00 +01:00
  • b1df461ca8 Added captions for the images for SmolDocling assembly code, improved provenance definition for all elements Maksym Lysak 2025-02-11 16:42:23 +01:00
  • d7abe1b1cd Updated example of Smol Docling usage Maksym Lysak 2025-02-11 13:53:19 +01:00
  • 479ee239aa New assembly code for latest model revision, updated prompt and parsing of doctags, updated logging Maksym Lysak 2025-02-11 13:34:14 +01:00
  • 7c4ab5c716 Moved artifacts_path for SmolDocling into vlm_options instead of global pipeline option Maksym Lysak 2025-01-21 18:00:05 +01:00
  • f2751e11f9 Introduced SmolDoclingOptions to configure model parameters (such as query and artifacts path) via client code, see example in minimal_smol_docling. Provisioning for other potential vlm all-in-one models. Maksym Lysak 2025-01-21 17:37:11 +01:00
  • 88b9ac6706 Fixing doctags starting tag, that broke elements on first line during assembly Maksym Lysak 2025-01-21 11:14:55 +01:00
  • 0fe12d819a Updated vlm pipeline assembly and smol docling model code to support updated doctags Maksym Lysak 2025-01-17 17:54:55 +01:00
  • f6d123a01c Flipped keep_backend to True for vlm_pipeline assembly to work Maksym Lysak 2025-01-16 16:51:27 +01:00
  • 9901729d8c Exposed "force_backend_text" as pipeline parameter Maksym Lysak 2025-01-16 14:23:59 +01:00
  • 0dc3ac43b1 Added capability for vlm_pipeline to grab text from preconfigured backend Maksym Lysak 2025-01-16 10:44:49 +01:00
  • e0929781f4 Added tokens/sec measurement, improved example Maksym Lysak 2025-01-15 10:22:48 +01:00
  • 437053572d Replaced hardcoded otsl tokens with the ones from docling-core tokens.py enum Maksym Lysak 2025-01-14 16:07:37 +01:00
  • 2a43c199d5 Cleaned up logs, added pages to vlm_pipeline, basic timing per page measurement in smol_docling models Maksym Lysak 2025-01-14 14:04:47 +01:00
  • 61bb9dbba2 Properly propagating image data per page, together with predicted tags in VLM pipeline. This enables correct figure extraction and page numbers in provenances Maksym Lysak 2025-01-13 15:21:19 +01:00
  • 01c46e24b1 Fix for table span compute in vlm_pipeline Maksym Lysak 2025-01-10 16:30:12 +01:00
  • ef079e4e78 Enabled figure support in vlm_pipeline Maksym Lysak 2025-01-10 13:56:46 +01:00
  • 1b968e4984 Fixes to preserve page image and demo export to html Maksym Lysak 2025-01-10 10:50:35 +01:00
  • 3c4c647615 WIP, first working code for inference of SmolDocling, and vlm pipeline assembly code, example included. Maksym Lysak 2025-01-09 18:41:00 +01:00
  • 03c8d45790 wip smolDocling inference and vlm pipeline Maksym Lysak 2025-01-09 14:43:04 +01:00
  • 1b0ead6907 fix(html): Parse text in div elements as TextItem (#1041) Cesar Berrospi Ramis 2025-02-24 12:38:29 +01:00
  • dc3a388aa2 Skeleton for SmolDocling model and VLM Pipeline Christoph Auer 2025-01-08 10:16:54 +01:00
  • 1d17e7397a test: avoid testing exact JSON in CSV backend (#1038) Suehtam 2025-02-24 07:10:40 +00:00
  • d8a81c3168 chore: bump version to 2.24.0 [skip ci] v2.24.0 github-actions[bot] 2025-02-20 18:31:20 +00:00
  • c93e36988f feat: Implement new reading-order model (#916) Christoph Auer 2025-02-20 17:51:17 +01:00
  • c031a7ae47 chore: bump version to 2.23.1 [skip ci] v2.23.1 github-actions[bot] 2025-02-20 16:26:41 +00:00
  • 1ac010354f test: avoid testing exact JSON (#1027) Cesar Berrospi Ramis 2025-02-20 16:20:07 +01:00
  • 6796f0a132 fix: Runtime error when Pandas Series is not always of string type (#1024) fanszoro 2025-02-20 22:41:41 +08:00
  • dfcc30dddb chore: Update tests and lockfile (#1021) Christoph Auer 2025-02-19 16:51:53 +01:00
  • 27c04007bc docs: revamp picture description example (#1015) Panos Vagenas 2025-02-19 11:28:54 +01:00
  • 7450050ace refactor: upgrade BeautifulSoup4 with type hints (#999) Cesar Berrospi Ramis 2025-02-18 11:30:47 +01:00
  • dadff50589 fix: Disable the TOKENIZERS_PARALLELISM in test_e2e_ocr_conversion.py to avoid warning messages from HF nli/fix_ocr_tests Nikos Livathinos 2025-02-18 10:58:11 +01:00
  • 75db61127c chore: bump version to 2.23.0 [skip ci] v2.23.0 github-actions[bot] 2025-02-17 14:22:49 +00:00
  • 6e75f0b5d3 fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) Maxim Lysak 2025-02-17 14:11:55 +01:00
  • 77eb77bdc2 feat: Support cuda:n GPU device allocation (#694) Ahmed Nassar 2025-02-17 11:31:13 +01:00
  • 428b656793 feat(xml-jats): parse XML JATS documents (#967) Cesar Berrospi Ramis 2025-02-17 10:43:31 +01:00
  • e1436a8b05 test: validate actual docitems in tests (#966) Michele Dolfi 2025-02-14 17:47:53 +01:00
  • b5b1ddca3b chore: Restore the orphan clusters Nikos Livathinos 2025-02-14 11:13:54 +01:00
  • ffbde1d1b0 chore: bump version to 2.22.0 [skip ci] v2.22.0 github-actions[bot] 2025-02-14 08:53:20 +00:00
  • 00d9405b0a feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) Tobias Strebitzer 2025-02-14 15:55:09 +08:00
  • 7493d5b01f docs: update example Dockerfile with download CLI (#929) Michele Dolfi 2025-02-13 14:19:50 +01:00
  • af19c03f6e fix: update Pillow constraints (#958) Michele Dolfi 2025-02-13 14:19:37 +01:00