From 1df0560ec2cafcd95f2240e6188385e1ec117110 Mon Sep 17 00:00:00 2001
From: Myles <tripflex@users.noreply.github.com>
Date: Tue, 9 Dec 2025 03:33:09 -0500
Subject: [PATCH] fix: Clear word/char cells when force_full_page_ocr is used
 (#2738)

* fix: Clear word/char cells when force_full_page_ocr is used

When force_full_page_ocr=True, the OCR model correctly replaces
textline_cells with OCR-extracted text. However, word_cells and
char_cells were not cleared, causing downstream components like
TableStructureModel to use unreliable PDF-extracted text containing
GLYPH artifacts (e.g., GLYPH<c=1,font=/AAAAAH+font000000002ed64673>).

This fix clears word_cells and char_cells when force_full_page_ocr
is enabled, ensuring TableStructureModel falls back to the OCR-
extracted textline cells via its existing fallback logic.

Fixes issue where PDFs with problematic fonts (Type3, missing
ToUnicode CMap) produced GLYPH artifacts in table content despite
force_full_page_ocr being triggered.

* fix: Filter out PDF-extracted word/char cells when force_full_page_ocr is used

When force_full_page_ocr=True, the OCR model correctly replaces
textline_cells with OCR-extracted text. However, word_cells and
char_cells from the PDF backend were not handled, causing downstream
components like TableStructureModel to use unreliable PDF-extracted
text containing GLYPH artifacts.

Instead of clearing all word/char cells (which would be destructive
for backends like mets_gbs that provide OCR-generated word cells),
this fix filters out only cells where from_ocr=False, preserving any
OCR-generated cells.

This ensures TableStructureModel falls back to the OCR-extracted
textline cells via its existing fallback logic when word_cells is
empty or only contains OCR cells.

Fixes issue where PDFs with problematic fonts (Type3, missing
ToUnicode CMap) produced GLYPH artifacts in table content despite
force_full_page_ocr being triggered.

* DCO Remediation Commit for Myles McNamara <myles@smyl.es>

I, Myles McNamara <myles@smyl.es>, hereby add my Signed-off-by to this commit: 4197a4e273250637e474804517c0cd76bf5ea56e
I, Myles McNamara <myles@smyl.es>, hereby add my Signed-off-by to this commit: a4f4e3fc5cf9822d192dc2cc6248010593f7e761

Signed-off-by: Myles McNamara <myles@smyl.es>

---------

Signed-off-by: Myles McNamara <myles@smyl.es>
---
 docling/models/base_ocr_model.py | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/docling/models/base_ocr_model.py b/docling/models/base_ocr_model.py
index 67ada340..31f44ae0 100644
--- a/docling/models/base_ocr_model.py
+++ b/docling/models/base_ocr_model.py
@@ -154,6 +154,20 @@ class BaseOcrModel(BasePageModel, BaseModelWithOptions):
         page.parsed_page.textline_cells = final_cells
         page.parsed_page.has_lines = len(final_cells) > 0
 
+        # When force_full_page_ocr is used, PDF-extracted word/char cells are
+        # unreliable. Filter out cells where from_ocr=False, keeping any OCR-
+        # generated cells. This ensures downstream components (e.g., table
+        # structure model) fall back to OCR-extracted textline cells.
+        if self.options.force_full_page_ocr:
+            page.parsed_page.word_cells = [
+                c for c in page.parsed_page.word_cells if c.from_ocr
+            ]
+            page.parsed_page.char_cells = [
+                c for c in page.parsed_page.char_cells if c.from_ocr
+            ]
+            page.parsed_page.has_words = len(page.parsed_page.word_cells) > 0
+            page.parsed_page.has_chars = len(page.parsed_page.char_cells) > 0
+
     def _combine_cells(
         self, existing_cells: List[TextCell], ocr_cells: List[TextCell]
     ) -> List[TextCell]: