mirror of https://github.com/DS4SD/docling.git synced 2025-07-25 19:44:34 +00:00

copilot-swe-agent[bot] 252e5e214f Fix timeout status preservation issue by extending _determine_status method

Co-authored-by: cau-git <60343111+cau-git@users.noreply.github.com>

2025-07-23 09:44:10 +00:00

3.8 KiB

Raw Blame History

Fix for ReadingOrderModel AssertionError with document_timeout

Problem Description

When pipeline_options.document_timeout was set in the latest version of docling (v2.24.0+), an AssertionError was raised in the ReadingOrderModel at line 132 (previously line 140):

assert size is not None, "Page size is not initialized."

This error occurred in ReadingOrderModel._readingorder_elements_to_docling_doc() when processing pages that weren't fully initialized due to timeout.

Additionally, there was a secondary issue where the ConversionStatus.PARTIAL_SUCCESS status that was correctly set during timeout was being overwritten by the _determine_status method.

Root Cause

The issue had two parts:

Uninitialized Pages: When a document processing timeout occurs:
- The pipeline stops processing pages mid-way through the document
- Some pages remain uninitialized with page.size = None
- These uninitialized pages are passed to the ReadingOrderModel
- The ReadingOrderModel expects all pages to have size != None, causing the assertion to fail
Status Overwriting: The _determine_status method would:
- Always start with ConversionStatus.SUCCESS
- Only change to PARTIAL_SUCCESS based on backend validation issues
- Ignore that timeout might have already set the status to PARTIAL_SUCCESS

Solution

The fix was implemented in two parts in docling/pipeline/base_pipeline.py:

Part 1: Page Filtering (lines 196-206)

# Filter out uninitialized pages (those with size=None) that may remain
# after timeout or processing failures to prevent assertion errors downstream
initial_page_count = len(conv_res.pages)
conv_res.pages = [page for page in conv_res.pages if page.size is not None]

if len(conv_res.pages) < initial_page_count:
    _log.info(
        f"Filtered out {initial_page_count - len(conv_res.pages)} uninitialized pages "
        f"due to timeout or processing failures"
    )

Part 2: Status Preservation (lines 220-221)

def _determine_status(self, conv_res: ConversionResult) -> ConversionStatus:
    # Preserve PARTIAL_SUCCESS status if already set (e.g., due to timeout)
    status = ConversionStatus.SUCCESS if conv_res.status != ConversionStatus.PARTIAL_SUCCESS else ConversionStatus.PARTIAL_SUCCESS
    
    for page in conv_res.pages:
        if page._backend is None or not page._backend.is_valid():
            conv_res.errors.append(
                ErrorItem(
                    component_type=DoclingComponentType.DOCUMENT_BACKEND,
                    module_name=type(page._backend).__name__,
                    error_message=f"Page {page.page_no} failed to parse.",
                )
            )
            status = ConversionStatus.PARTIAL_SUCCESS

    return status

This fix:

Filters out uninitialized pages before they reach the ReadingOrderModel
Prevents the AssertionError by ensuring all pages have size != None
Preserves timeout-induced PARTIAL_SUCCESS status through the status determination process
Maintains partial conversion results by keeping successfully processed pages
Logs the filtering action for transparency

Verification

The fix has been verified with comprehensive tests that:

✅ Confirm timeout scenarios don't cause AssertionError
✅ Validate that filtered pages are compatible with ReadingOrderModel
✅ Ensure timeout-induced PARTIAL_SUCCESS status is preserved
✅ Ensure normal processing (without timeout) still works correctly

Status

✅ FIXED - The issue has been resolved and the fix is working correctly.

The conversion will now complete with ConversionStatus.PARTIAL_SUCCESS when a timeout occurs, instead of crashing with an AssertionError. The status is properly preserved throughout the pipeline execution.

3.8 KiB Raw Blame History