Commit Graph

  • ad313473c4 updated the cli argument to show_layout Peter Staar 2025-05-27 10:43:20 +0200
  • e838818783 reformatted code Peter Staar 2025-05-27 09:01:32 +0200
  • bb12c96094 updated the cli Peter Staar 2025-05-27 08:56:54 +0200
  • c9735de4c6 feat: Add visualization of bbox on page with html export. Peter Staar 2025-05-27 05:41:51 +0200
  • 218b520ec4
    xlsm file ShiroYasha18 2025-05-27 05:12:24 +0530
  • 4b8dfa326b
    Delete tests/test_backend_msexcel_xlsm.py ShiroYasha18 2025-05-27 05:09:29 +0530
  • 22e7635a0a
    Update document_converter.py ShiroYasha18 2025-05-27 05:07:53 +0530
  • 3ecbd9a115
    Update test_backend_msexcel_xlsm.py ShiroYasha18 2025-05-27 05:06:15 +0530
  • 1f645e62c8
    Update test_backend_msexcel.py ShiroYasha18 2025-05-27 02:31:43 +0530
  • 96377cb81e
    Update base_models.py ShiroYasha18 2025-05-27 01:50:53 +0530
  • 7c7baf814d
    Merge branch 'docling-project:main' into main ShiroYasha18 2025-05-27 00:11:41 +0530
  • 08beb406d9 fix: when .simplify_text_elements() always put a space between chunks, checks for alphanumeric characters creates more problems than it does good. commit new that testfiles that got forgotten in the last commit. Roman Kayan BAZG 2025-05-25 18:14:32 +0200
  • c75b75e8af fix: pptx shape order Martin Wind 2025-05-25 10:31:06 +0200
  • f11880d5ad
    Merge branch 'main' into html_backend vaaale 2025-05-24 22:46:28 +0200
  • 5d08b749af A new HTML backend that handles styled html (ignors it) as well as images. vaaale 2025-05-24 22:25:51 +0200
  • 733360c7b2 A new HTML backend that handles styled html (ignors it) as well as images. Alexander Vaagan 2025-04-26 16:30:09 +0200
  • 0c88c5b90f html-backend feature: implement reading anchor tags. currently only paragraphs handle them as in-line groups. Roman Kayan BAZG 2025-05-24 13:01:44 +0200
  • a4e6777bb3 fixed the merge conflicts Peter Staar 2025-05-23 16:30:18 +0200
  • 2579d89510 chore: bump version to 2.34.0 [skip ci] v2.34.0 github-actions[bot] 2025-05-22 18:44:45 +0000
  • fffa865014 test: add test file and case for fix(msword_backend): Identify text in the same line after an image / image anchor #1425 Michael Krissgau 2025-05-22 19:02:59 +0200
  • af4aaa28af fix(msword_backend): Identify text in the same line after an image / image anchor #1425 Michael Krissgau 2025-05-22 17:45:15 +0200
  • c2f595d283
    fix: fix ZeroDivisionError for cell_bbox.area() (#1636) Said Gürbüz 2025-05-22 13:43:33 +0200
  • 0d1ff3d8e7 fix ZeroDivisionError for cell_bbox.area() Saidgurbuz 2025-05-22 10:36:35 +0200
  • 45265bf8b1
    feat(ocr): auto-detect rotated pages in Tesseract (#1167) Clément Doumouro 2025-05-21 18:12:33 +0200
  • eb4fd8f73e
    Merge 82af331dae into 90875247e5 Manuel030 2025-05-21 10:32:56 +0000
  • 90875247e5
    feat: Establish confidence estimation for document and pages (#1313) Christoph Auer 2025-05-21 12:32:49 +0200
  • 1ce40a7097 chore(ocr): improve logging in case of OSD failure in TesseractOcrCliModel and TesseractOcrModel Clément Doumouro 2025-05-21 11:16:08 +0200
  • 14d4f5b109
    fix(integration): update the Apify Actor integration (#1619) Václav Vančura 2025-05-21 02:47:55 +0200
  • 84d0889829 chore: bump version to 2.33.0 [skip ci] v2.33.0 github-actions[bot] 2025-05-20 19:54:51 +0000
  • 30f9570e6e fix(ocr): fix TesseractOcrCliModel._is_auto computation Clément Doumouro 2025-05-20 20:42:57 +0200
  • bac5ce6e38 chore(ocr): default TesseractOcrCliModel._is_auto to False Clément Doumouro 2025-05-20 20:27:22 +0200
  • f4d9d4111b
    fix: Fix issue with detecting docx files, and files with upper case extensions (#1609) MoheyElDin Badr 2025-05-20 20:42:37 +0300
  • 08832bfd60 chore(ocr): proceed to OCR without rotation when OSD fails in TesseractOcrModel Clément Doumouro 2025-05-20 19:20:42 +0200
  • 0e00a263fa
    fix: load_from_doctags static usage (#1617) Said Gürbüz 2025-05-20 15:06:12 +0200
  • 0df8b65a05
    Merge pull request #1 from apify/vaclav/docling-sync Václav Vančura 2025-05-20 13:26:30 +0200
  • 480d8318a7 update lock file Saidgurbuz 2025-05-20 12:47:24 +0200
  • 44603b0951 revert lock file Saidgurbuz 2025-05-20 12:44:38 +0200
  • 908d38cd67
    Update .actor/actor.json Jan Čurn 2025-05-20 10:50:54 +0200
  • 9ac6a33169 fix lock file Saidgurbuz 2025-05-20 10:14:42 +0200
  • 4c8f1caba3 chore(ocr): proceed to OCR without rotation when OSD fails in TesseractOcrCliModel Clément Doumouro 2025-05-19 18:30:54 +0200
  • 5698dd9eee update dependencies Saidgurbuz 2025-05-20 10:03:02 +0200
  • f2e9c0784c
    fix: incorrect force_backend_text behaviour for VLM DocTag pipelines (#1371) Krishnan 2025-05-20 13:29:38 +0530
  • 0867936a3a fix load_from_doctags usage Saidgurbuz 2025-05-20 09:42:34 +0200
  • d73c3eb071 chore(ocr): update e2e OCR test data Clément Doumouro 2025-05-16 11:30:39 +0200
  • bf00fa1a9f chore(ocr): revert layout updates Clément Doumouro 2025-04-22 17:35:43 +0200
  • 1181338737 fix(ocr): avoid to swallow tesseract errors causing orientation detection failures Clément Doumouro 2025-04-09 11:31:44 +0200
  • c1ac22947f chore(ocr): update e2e OCR test data Clément Doumouro 2025-04-08 17:48:55 +0200
  • e2becbcaaf chore(ocr): revert layout updates Clément Doumouro 2025-04-08 17:37:58 +0200
  • fdc6a01bc8 fix(ocr): refactor rotation utilities Clément Doumouro 2025-04-08 17:28:06 +0200
  • 0b39bb58bf fix(ocr): move bounding bow rotation util to orientation.py Clément Doumouro 2025-04-08 15:08:24 +0200
  • 6c88365c66 fix(ocr): rotate image to the natural orientation before layout prediction Clément Doumouro 2025-04-04 17:31:45 +0200
  • 17f208633f fix(ocr): update missing test data Clément Doumouro 2025-04-01 14:26:25 +0200
  • 7a3ef336fd fix(ocr): tesseract support mis-oriented documents Clément Doumouro 2025-03-14 14:31:24 +0100
  • 98b5eeb844
    fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549) Pedro Ribeiro 2025-05-19 14:26:00 +0100
  • 12a0e64892
    feat: add textbox content extraction in msword_backend (#1538) AndrewTsai0406 2025-05-19 21:01:36 +0800
  • a55f45f8e8 fix: add line breaks in table cells George Fonseca 2025-05-16 14:45:44 -0300
  • b438197cdf empty commit to retrigger CI Panos Vagenas 2025-05-19 11:24:19 +0200
  • 2a63e82825 fix detecting files with uppercase extensions MoheyElDin Badr 2025-05-19 07:48:40 +0300
  • ca07927792 get merged_text from boundingbox instead of merging it to prevent overlaps Pedro Ribeiro 2025-05-08 08:47:57 +0100
  • 286aac38c1
    chore(actor): bump the Actor version number Václav Vančura 2025-05-18 14:30:51 +0200
  • 8a2550f390
    chore(actor): improve the Actor README.md local header link Václav Vančura 2025-05-18 14:30:31 +0200
  • 5006acc01e
    chore(actor): update Actor README.md with recent repo URL changes Václav Vančura 2025-05-18 14:29:58 +0200
  • d21a870a73
    fix(actor): remove references to missing docling_processor.py Václav Vančura 2025-05-18 14:29:05 +0200
  • 374ecd4890 fixed the static load_from_doctags Peter Staar 2025-05-18 10:58:39 +0200
  • 1ada7bfee7 added the html backend to the VLM pipeline Peter Staar 2025-05-18 10:55:27 +0200
  • e93cc3ce09 fixing the tests Peter Staar 2025-05-18 07:38:06 +0200
  • ce6b4d8b9e Merge remote-tracking branch 'origin/fix/fix-issue-with-detecting-docx-files' into fix/fix-issue-with-detecting-docx-files MoheyElDin Badr 2025-05-17 16:28:37 +0300
  • e5a1a077d3 Add other types Mohey El-Din Badr 2025-05-07 12:27:29 +0300
  • 7885c1d751 Update document.py MoheyElDin Badr 2025-05-06 09:40:13 +0300
  • 82af331dae
    Merge branch 'docling-project:main' into main Manuel030 2025-05-16 16:55:46 +0200
  • 0c7c7c11c2 reformatted the code Peter Staar 2025-05-16 16:31:11 +0200
  • d5b6c871cf streamlining all code Peter Staar 2025-05-16 16:27:27 +0200
  • 661f7c9780 fixed the pipeline for Phi4 Peter Staar 2025-05-16 15:55:49 +0200
  • d41b856961 finalising last points for vlms support Peter Staar 2025-05-16 12:39:26 +0200
  • 7c4c356e76
    chore: fix chunking example data link (#1596) Panos Vagenas 2025-05-16 08:44:47 +0200
  • 2f5ad50105 chore: fix chunking example data link Panos Vagenas 2025-05-15 16:39:25 +0200
  • fc61258273 merged with main Peter Staar 2025-05-15 07:46:06 +0200
  • e2c95d09bc need to get Phi4 working again ... Peter Staar 2025-05-15 07:32:55 +0200
  • 15a8f328c2 added pipeline_model_specializations file Peter Staar 2025-05-15 05:27:16 +0200
  • 56208f6dc0 fix/ran poetry run pre-commit run --all-files to format the file Signed-off-by: Franck Benichou franck.benichou@sciencespo.fr Benichou 2025-05-14 15:35:50 -0400
  • fab016226f Custom Serializer for Table Enrichment Nikhil Khandelwal 2025-05-15 00:36:55 +0530
  • 25856e1392 Added Custom Serializer for Table enrichment Nikhil Khandelwal 2025-05-15 00:26:57 +0530
  • 718633cdec chore: bump version to 2.32.0 [skip ci] github-actions[bot] 2025-05-14 14:28:21 +0000
  • 6116c78717 feat: Improve parallelization for remote services API calls (#1548) Vinay R Damodaran 2025-05-14 06:47:55 -0700
  • e30a703759 fix(ocr): orig field in TesseractOcrCliModel as str (#1553) jimkarag02 2025-05-14 16:05:52 +0300
  • e694a36eda docs: add advanced chunking & serialization example (#1589) Panos Vagenas 2025-05-14 13:35:07 +0100
  • 5653595a0d fix(settings): fix nested settings load via environment variables (#1551) Alex Sokolov 2025-05-14 14:42:10 +0300
  • 336d1c34c8 feat: support image/webp file type (#1415) Elwin 2025-05-14 15:47:28 +0800
  • d452521a2f chore: bump version to 2.31.2 [skip ci] github-actions[bot] 2025-05-13 10:09:19 +0000
  • 8e51785216 fix: AsciiDoc header identification (#1562) (#1563) Marco Fargetta 2025-05-13 11:17:26 +0200
  • e64fd66341 fix: restrict click version and update lock file (#1582) Michele Dolfi 2025-05-13 10:40:08 +0200
  • 540610e4dc Added Custom Serializer for Table enrichment Nikhil Khandelwal 2025-05-15 00:11:00 +0530
  • f2c019cad7 table enrichments - Description and Indexing Shivani Kabu 2025-05-13 13:49:10 +0530
  • 7c67d2b2fe fixed the MyPy Peter Staar 2025-05-14 17:51:43 +0200
  • aeb0716bbb chore: bump version to 2.32.0 [skip ci] v2.32.0 github-actions[bot] 2025-05-14 14:28:21 +0000
  • 3a04f2a367
    feat: Improve parallelization for remote services API calls (#1548) Vinay R Damodaran 2025-05-14 06:47:55 -0700
  • 7bc9e6f963
    Merge d7922ab31d into 9f8b479f17 Rafael Torres Coelho Soares (aka Tuelho) 2025-05-14 13:06:24 +0000
  • 9f8b479f17
    fix(ocr): orig field in TesseractOcrCliModel as str (#1553) jimkarag02 2025-05-14 16:05:52 +0300
  • 9f28abf061
    docs: add advanced chunking & serialization example (#1589) Panos Vagenas 2025-05-14 13:35:07 +0100
  • bba056b2da fix: ensure orig and text are both strings in TesseractOcrCliModel DimtrisKaragatslis 2025-05-08 15:12:22 +0300