mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-25 19:44:34 +00:00
Fix: inconsistency in how heading levels are calculated in the msword_backend.py file compared to the AsciiDoc, HTML backends
After reviewing the code you provided for the AsciiDoc, HTML, and MS Word backends, I have found a key inconsistency in how heading levels are calculated in the `msword_backend.py` file compared to the other two. This inconsistency is the likely cause of the problem of limited available header levels when converting Word documents. ### Analysis of the Inconsistency 1. **`asciidoc_backend.py`**: In the `_parse_section_header` method, the heading level is calculated as `header_level - 1`, where `header_level` is the number of `=` characters. For example, `===` (3 characters) correctly becomes `level=2`. 2. **`html_backend.py`**: In the `handle_header` method, the level for tags like `<h2>`, `<h3>`, etc., is calculated as `hlevel - 1`. For example, an `<h4>` tag results in `level=3`. (Note: `<h1>` is correctly treated as a document title). 3. **`msword_backend.py`**: In the `_add_header` method, the level is determined by the number in the style name (e.g., "Heading 4" provides `curr_level = 4`). However, the final level passed to the document model is set by `add_level = curr_level`. This means a "Heading 4" style results in `level=4`. This is the inconsistency: for a semantically equivalent heading (like `<h4>`, `====`, or "Heading 4"), the MS Word backend produces a level that is one greater than the other backends. This can easily lead to downstream processing or rendering issues that make it seem like the depth is "cut off," especially if that system doesn't expect a heading with `level=4` or higher from this parser. ### The Fix To resolve this and make the MS Word backend consistent with the others, you need to adjust the level calculation. The fix is a one-line change in `docling/backend/msword_backend.py`. In the `_add_header` method, change the line that assigns `add_level`. **File:** `docling/backend/msword_backend.py` **Function:** `_add_header` **Original Code (~line 1030):** ```python current_level = curr_level parent_level = curr_level - 1 add_level = curr_level ``` **Corrected Code:** ```python current_level = curr_level parent_level = curr_level - 1 add_level = curr_level - 1 ``` By subtracting 1 from `curr_level`, you align the MS Word backend's behavior with the HTML and AsciiDoc backends. A "Heading 2" will now correctly be parsed as `level=1`, "Heading 3" as `level=2`, and so on, which should solve the depth problem you observed. ~~~ Validated by Gemini 2.5 Pro, o3, o3-pro, Claude 4. Signed-off-by: Artus Krohn-Grimberghe <artuskg@users.noreply.github.com>
This commit is contained in:
parent
215b540f6c
commit
42af299fa2
@ -874,7 +874,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||
|
||||
current_level = curr_level
|
||||
parent_level = curr_level - 1
|
||||
add_level = curr_level
|
||||
add_level = curr_level - 1
|
||||
else:
|
||||
current_level = self.level
|
||||
parent_level = self.level - 1
|
||||
|
Loading…
Reference in New Issue
Block a user