Commit Graph

  • a1cb0dd344 fix minor bugs, mark helper methods internal Panos Vagenas 2025-04-03 14:21:34 +0200
  • 88a9756861 Detecting table orientation dev/table-orientation Maksym Lysak 2025-04-03 11:10:57 +0200
  • c4f9916fbb Fix add_list_item SimJeg 2025-04-02 17:48:03 +0200
  • da25453155 Address feedback SimJeg 2025-04-02 17:20:52 +0200
  • f40b21e94c Run precommit SimJeg 2025-04-02 16:14:10 +0200
  • cd4b214f05 Merge branch 'main' into docx-markdown-formatting SimJeg 2025-04-02 14:56:30 +0200
  • 71148eb381
    docs: add visual grounding example (#1270) Panos Vagenas 2025-04-02 14:03:19 +0200
  • 1028df66e4 Merge remote-tracking branch 'upstream/main' into docx-markdown-formatting SimJeg 2025-04-02 13:47:47 +0200
  • 21ee884c54
    Merge branch 'main' into show-visual-grounding Panos Vagenas 2025-04-02 13:29:30 +0200
  • d2d68747f9
    fix(docx): Improve text parsing (#1268) Rafael Teixeira de Lima 2025-04-02 12:56:44 +0200
  • c0f769cdd0
    Merge branch 'main' into rtdl/improve_text_parsing Rafael Teixeira de Lima 2025-04-02 12:05:03 +0200
  • c979eaab1a Remove trailing space Rafael Teixeira de Lima 2025-04-02 12:02:05 +0200
  • 331c6ab466 Fix trailing space Rafael Teixeira de Lima 2025-04-02 11:29:14 +0200
  • e535209c75 Flexibilize heading detection Rafael Teixeira de Lima 2025-04-02 10:32:36 +0200
  • d5431577f0 fix: Tesseract OCR CLI can't process images composed with numbers only (#1201) Guilhem VERMOREL 2025-03-31 10:53:49 +0200
  • 76982a5b15 Improve text parsing Rafael Teixeira de Lima 2025-03-31 11:41:06 +0200
  • eb4d17bba5 chore: bump version to 2.28.4 [skip ci] github-actions[bot] 2025-03-29 11:56:42 +0000
  • 895cedb9ab Use inline_fmt everywhere SimJeg 2025-04-01 11:52:20 +0200
  • 61bd559f78 Handle header and footer SimJeg 2025-04-01 11:36:47 +0200
  • 1033c25435 Run black and mypy SimJeg 2025-04-01 11:19:28 +0200
  • eebe88162d feat(ocr): Add OnnxTR as possible OCR engine felix 2025-04-01 08:59:14 +0200
  • 60306f9a83 Strip elements SimJeg 2025-03-31 16:39:40 +0200
  • 473a51adca Strip elements SimJeg 2025-03-31 16:38:44 +0200
  • 5b4464a741 Handle bullet lists SimJeg 2025-03-31 16:27:50 +0200
  • fcaad41a0a Use inline group SimJeg 2025-03-31 15:55:51 +0200
  • f96da371ab docs: add visual grounding example Panos Vagenas 2025-03-31 14:48:09 +0200
  • 094a674bf3 Handle formatting properly for DocItemLabel.PARAGRAPH SimJeg 2025-03-31 14:36:29 +0200
  • 1f9872c8ae feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-31 14:31:47 +0200
  • 4cd2ec5515 feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-25 16:07:53 +0100
  • 45fc08b9cb feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-25 10:42:52 +0100
  • 07dfdaf7ba feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-25 10:07:33 +0100
  • e74d229d4b feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-25 08:38:13 +0100
  • 98496fafcc feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-25 08:00:47 +0100
  • c28907ed9c feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-23 10:53:39 +0100
  • f6560cf662 feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-22 13:36:26 +0100
  • 87fa9ae7a4 feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-22 13:31:05 +0100
  • 15799af736 feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-22 13:30:28 +0100
  • f8dba5891c feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-22 13:29:11 +0100
  • e4ab4ce576 feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-22 13:28:17 +0100
  • 268fa98821 feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-21 22:27:35 +0100
  • 0f183af23b feat(ocr): Add OnnxTR as possible OCR engine felix 2025-03-21 22:26:07 +0100
  • cfc42458ae [Feature] Add OnnxTR as possible OCR engine felix 2025-03-21 22:17:51 +0100
  • a19cf81f98 style & quality applied felix 2025-03-21 21:49:50 +0100
  • 7c87467ea5 format felix 2025-03-21 21:11:31 +0100
  • 35f185f545 init felix 2025-03-21 21:09:16 +0100
  • d3362d1553 Handle hyperlink SimJeg 2025-03-31 12:32:22 +0200
  • 23fa9b9902 Use Formatting SimJeg 2025-03-31 12:20:48 +0200
  • 01b4c12d3b Fix imports SimJeg 2025-03-31 11:38:18 +0200
  • fbfb37f363 Merge branch 'main' into docx-markdown-formatting SimJeg 2025-03-31 11:22:23 +0200
  • 806a090e65 Merge branch 'main' into docx-markdown-formatting SimJeg 2025-03-31 10:56:32 +0200
  • b3d111a3cd
    fix: Tesseract OCR CLI can't process images composed with numbers only (#1201) Guilhem VERMOREL 2025-03-31 10:53:49 +0200
  • 5e12d0795a fix: ensure handling of pictures only applies to picture with an image attribute and image part of all extensions except with emf or wmf extensions to avoid bug in adding picture to doc (just added ny signoff) Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr Benichou 2025-03-30 16:08:19 -0400
  • 68be6e1873 fix: ensure handling of pictures only applies to picture with an image attribute and image part of all extensions except with emf or wmf extensions to avoid bug in adding picture to doc Benichou 2025-03-30 16:02:01 -0400
  • fbc4c4d103 bug fix to ensure handling of pictures only applies to picture with an image attribute and image part of all extensions except with emf or wmf extensions to avoid bug in adding picture to doc Benichou 2025-03-30 15:50:24 -0400
  • 44f2b081ec chore: bump version to 2.28.4 [skip ci] v2.28.4 github-actions[bot] 2025-03-29 11:56:42 +0000
  • 7afad7e52d
    fix: Fixes tables when using OCR (#1261) Maxim Lysak 2025-03-29 10:06:00 +0100
  • d3c10b24a4 Fix for the tables when using OCR Maksym Lysak 2025-03-29 09:33:54 +0100
  • 124f921077 chore: bump version to 2.28.3 [skip ci] v2.28.3 github-actions[bot] 2025-03-28 18:30:03 +0000
  • 8bd71e8e33
    fix: Word-level pdf cells for tables (#1238) Maxim Lysak 2025-03-28 16:34:48 +0100
  • 3d4a163945 Updated dependency to docling-core Maksym Lysak 2025-03-28 14:36:23 +0100
  • a4f6762aef removed comments Maksym Lysak 2025-03-28 12:19:45 +0100
  • 787c6d8ace word-level pdf cells for tables Maksym Lysak 2025-03-25 13:36:58 +0100
  • 396dc66077 fix wrong type text extracted by tesseract_ocr_cli_model gvl4 2025-03-19 11:32:31 +0100
  • 82694b2136 chore: bump version to 2.28.2 [skip ci] v2.28.2 github-actions[bot] 2025-03-26 16:52:06 +0000
  • 9210812bfa
    fix: improve HTML layer detection, various MD fixes (#1241) Panos Vagenas 2025-03-26 16:07:14 +0100
  • 7c143e2fb1 docs: replace 'poetry shell' with 'poetry env activate' for poetry>=2.0.0 Michael Krissgau 2025-03-26 15:03:38 +0100
  • 40c099ee62 fix: improve HTML furniture detection, various MD fixes Panos Vagenas 2025-03-26 15:30:52 +0100
  • 85c4df887b
    fix(html): fix HTML parsed heading level (#1244) Panos Vagenas 2025-03-26 10:30:23 +0100
  • ddc632a675 fix(html): fix HTML parsed heading level Panos Vagenas 2025-03-26 09:32:55 +0100
  • 9eb1686f93 chore: bump version to 2.28.1 [skip ci] v2.28.1 github-actions[bot] 2025-03-25 18:20:23 +0000
  • 38b7108a22
    chore: update locked deps (#1239) Panos Vagenas 2025-03-25 15:48:02 +0100
  • 3823f579c9 update poetry lock Panos Vagenas 2025-03-25 14:00:02 +0100
  • f1f7df49e3 Update test-cases cau/test-dp-word-lines Christoph Auer 2025-03-25 13:49:08 +0100
  • ec958be03c propagate core update Panos Vagenas 2025-03-25 13:03:03 +0100
  • 825b226fab
    fix(converter): Cache same pipeline class with different options (#1152) mislavmartinic 2025-03-26 00:18:44 +1300
  • 6df8827231
    fix(debug): Missing translation of bbox to to_bounding_box (#1220) Hoang-Long Do 2025-03-25 18:18:10 +0700
  • f739d0e4c5
    fix(docx): identifying numbered headers (#1231) Rafael Teixeira de Lima 2025-03-25 11:41:02 +0100
  • 0974ba4e1c
    docs(examples): batch conversion doc raises_on_error (#1147) Clément Doumouro 2025-03-25 11:14:39 +0100
  • 71a7d4bf74 propagate core update Panos Vagenas 2025-03-25 11:11:52 +0100
  • 8ebb0bf1a0
    chore: properly clean up apt temporary files in Dockerfile (#1223) Peter Dave Hello 2025-03-25 18:10:09 +0800
  • b7fc13f3c4 don't use raw doctags serializer Yusik Kim 2025-03-25 10:23:53 +0100
  • 3a09ca50bb fix: make sure page_items are sorted by page_no Yusik Kim 2025-03-20 16:59:35 +0100
  • 38a23eb50b fix: change argument to single doctags string Yusik Kim 2025-03-20 16:02:23 +0100
  • f6892c7877 fix: properly serialize per page Yusik Kim 2025-03-20 15:18:33 +0100
  • 303b77f03d fix: remove duplicate lines Yusik Kim 2025-03-20 13:27:45 +0100
  • 411a33fc69 fix: make it work for multi-page Yusik Kim 2025-03-20 13:23:48 +0100
  • 50d2ef1ad6 fix: add pages to DoclingDoc Yusik Kim 2025-03-20 09:14:49 +0100
  • b52b5672c9 feat: add function to remove content from DocTags Yusik Kim 2025-03-19 18:22:15 +0100
  • 7415fd251e pre-commit chain run Mislav 2025-03-25 22:24:29 +1300
  • bc3f5bd839 Signed-off-by: hl2311 <dhlong2301@gmail.com> hl2311 2025-03-25 00:34:33 +0700
  • af342d6c95 fix: refactor bbox attr pdf hl2311 2025-03-25 00:19:40 +0700
  • 8d453286a6 fix: Refactor missing bbox attribute to PdfTextCell hl2311 2025-03-24 23:34:41 +0700
  • 448cbb2680 Fix: Add missing bbox attribute to PdfTextCell hl2311 2025-03-22 02:37:46 +0700
  • 921a6e7c97 chore: check new core Panos Vagenas 2025-03-24 15:00:11 +0100
  • 8875e5d581 Add style check Rafael Teixeira de Lima 2025-03-24 12:19:37 +0100
  • ce7760f4a1 Modifications to identify numbered headers Rafael Teixeira de Lima 2025-03-24 12:08:10 +0100
  • caf1b61ac9 chore: properly clean up apt temporary files in Dockerfile Peter Dave Hello 2025-03-22 15:36:11 +0800
  • c030211fdc Fix: Add missing bbox attribute to PdfTextCell hl2311 2025-03-22 02:37:46 +0700
  • 2b6fd251a7
    Merge branch 'docling-project:main' into main Ulan Yisaev 2025-03-20 13:15:22 +0200
  • 7df157204b chore: bump version to 2.28.0 [skip ci] v2.28.0 github-actions[bot] 2025-03-19 15:18:10 +0000