Fix: inconsistency in how heading levels are calculated in the msword_backend.py file compared to the AsciiDoc, HTML backends

After reviewing the code you provided for the AsciiDoc, HTML, and MS Word backends, I have found a key inconsistency in how heading levels are calculated in the `msword_backend.py` file compared to the other two. This inconsistency is the likely cause of the problem of limited available header levels when converting Word documents. 

### Analysis of the Inconsistency

1.  **`asciidoc_backend.py`**: In the `_parse_section_header` method, the heading level is calculated as `header_level - 1`, where `header_level` is the number of `=` characters. For example, `===` (3 characters) correctly becomes `level=2`.

2.  **`html_backend.py`**: In the `handle_header` method, the level for tags like `<h2>`, `<h3>`, etc., is calculated as `hlevel - 1`. For example, an `<h4>` tag results in `level=3`. (Note: `<h1>` is correctly treated as a document title).

3.  **`msword_backend.py`**: In the `_add_header` method, the level is determined by the number in the style name (e.g., "Heading 4" provides `curr_level = 4`). However, the final level passed to the document model is set by `add_level = curr_level`. This means a "Heading 4" style results in `level=4`.

This is the inconsistency: for a semantically equivalent heading (like `<h4>`, `====`, or "Heading 4"), the MS Word backend produces a level that is one greater than the other backends. This can easily lead to downstream processing or rendering issues that make it seem like the depth is "cut off," especially if that system doesn't expect a heading with `level=4` or higher from this parser.

### The Fix

To resolve this and make the MS Word backend consistent with the others, you need to adjust the level calculation. The fix is a one-line change in `docling/backend/msword_backend.py`.

In the `_add_header` method, change the line that assigns `add_level`.

**File:** `docling/backend/msword_backend.py`
**Function:** `_add_header`

**Original Code (~line 1030):**
```python
        current_level = curr_level
        parent_level = curr_level - 1
        add_level = curr_level
```

**Corrected Code:**
```python
        current_level = curr_level
        parent_level = curr_level - 1
        add_level = curr_level - 1
```

By subtracting 1 from `curr_level`, you align the MS Word backend's behavior with the HTML and AsciiDoc backends. A "Heading 2" will now correctly be parsed as `level=1`, "Heading 3" as `level=2`, and so on, which should solve the depth problem you observed.

~~~

Validated by Gemini 2.5 Pro, o3, o3-pro, Claude 4. 

Signed-off-by: Artus Krohn-Grimberghe <artuskg@users.noreply.github.com>
This commit is contained in:
Artus Krohn-Grimberghe 2025-06-18 15:37:14 +02:00 committed by GitHub
parent 215b540f6c
commit 42af299fa2
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -874,7 +874,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
current_level = curr_level
parent_level = curr_level - 1
add_level = curr_level
add_level = curr_level - 1
else:
current_level = self.level
parent_level = self.level - 1