In msword_backend.py, the _add_header method has logic that seems to artificially limit the heading depth:
pythondef _add_header(self, doc: DoclingDocument, curr_level: Optional[int], text: str, is_numbered_style: bool = False) -> None:
# ...
if isinstance(curr_level, int):
# ...
else:
current_level = self.level
parent_level = self.level - 1
add_level = 1 # <-- This is the problem!
When curr_level is None (which happens when the heading style doesn't have a clear level number), it defaults to add_level = 1, effectively flattening deeper headings.
The correct handling, instead, would be to also subtract 1, with minimum of 1:
else:
current_level = self.level
parent_level = self.level - 1
add_level = max(1, self.level - 1) # Also subtract 1, with minimum of 1
Signed-off-by: Artus Krohn-Grimberghe <artuskg@users.noreply.github.com>
After reviewing the code you provided for the AsciiDoc, HTML, and MS Word backends, I have found a key inconsistency in how heading levels are calculated in the `msword_backend.py` file compared to the other two. This inconsistency is the likely cause of the problem of limited available header levels when converting Word documents.
### Analysis of the Inconsistency
1. **`asciidoc_backend.py`**: In the `_parse_section_header` method, the heading level is calculated as `header_level - 1`, where `header_level` is the number of `=` characters. For example, `===` (3 characters) correctly becomes `level=2`.
2. **`html_backend.py`**: In the `handle_header` method, the level for tags like `<h2>`, `<h3>`, etc., is calculated as `hlevel - 1`. For example, an `<h4>` tag results in `level=3`. (Note: `<h1>` is correctly treated as a document title).
3. **`msword_backend.py`**: In the `_add_header` method, the level is determined by the number in the style name (e.g., "Heading 4" provides `curr_level = 4`). However, the final level passed to the document model is set by `add_level = curr_level`. This means a "Heading 4" style results in `level=4`.
This is the inconsistency: for a semantically equivalent heading (like `<h4>`, `====`, or "Heading 4"), the MS Word backend produces a level that is one greater than the other backends. This can easily lead to downstream processing or rendering issues that make it seem like the depth is "cut off," especially if that system doesn't expect a heading with `level=4` or higher from this parser.
### The Fix
To resolve this and make the MS Word backend consistent with the others, you need to adjust the level calculation. The fix is a one-line change in `docling/backend/msword_backend.py`.
In the `_add_header` method, change the line that assigns `add_level`.
**File:** `docling/backend/msword_backend.py`
**Function:** `_add_header`
**Original Code (~line 1030):**
```python
current_level = curr_level
parent_level = curr_level - 1
add_level = curr_level
```
**Corrected Code:**
```python
current_level = curr_level
parent_level = curr_level - 1
add_level = curr_level - 1
```
By subtracting 1 from `curr_level`, you align the MS Word backend's behavior with the HTML and AsciiDoc backends. A "Heading 2" will now correctly be parsed as `level=1`, "Heading 3" as `level=2`, and so on, which should solve the depth problem you observed.
~~~
Validated by Gemini 2.5 Pro, o3, o3-pro, Claude 4.
Signed-off-by: Artus Krohn-Grimberghe <artuskg@users.noreply.github.com>
* Image scale moved to base vlm options.
Added max_size image limit (options and vlm models).
* DCO Remediation Commit for Shkarupa Alex <shkarupa.alex@gmail.com>
I, Shkarupa Alex <shkarupa.alex@gmail.com>, hereby add my Signed-off-by to this commit: e93602a0d0
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
---------
Signed-off-by: Shkarupa Alex <shkarupa.alex@gmail.com>
When page_range param is used for formula conversion,
the system throws list index out of range error.
Included tests to validate that the fix works.
Signed-off-by: Masum <masumsofts@yahoo.com>
The AsciiDoc backend should not create an ImageRef with Size equal to None, instead use default size values.
Refactor static methods as such and add the staticmethod decorator.
Extend the regression test for this fix.
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Make page.parsed_page the only source of truth for text cells
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Small fix
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Correctly compute PDF boxes from pymupdf
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Use different OCR engine order
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add type hints and fix mypy
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* One more test fix
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove with pypdfium2_lock from caller sites
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix typing
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: prov for merged-elems
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Reset pyproject.toml
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fix tests
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
* feat: adding new vlm-models support
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the transformers
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* got microsoft/Phi-4-multimodal-instruct to work
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* working on vlm's
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* refactoring the VLM part
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* all working, now serious refacgtoring necessary
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* refactoring the download_model
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the formulate_prompt
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* pixtral 12b runs via MLX and native transformers
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the VlmPredictionToken
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* refactoring minimal_vlm_pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the MyPy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added pipeline_model_specializations file
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* need to get Phi4 working again ...
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* finalising last points for vlms support
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the pipeline for Phi4
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* streamlining all code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixing the tests
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* added the html backend to the VLM pipeline
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the static load_from_doctags
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* restore stable imports
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use AutoModelForVision2Seq for Pixtral and review example (including rename)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove unused value
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* refactor instances of VLM models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* skip compare example in CI
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use lowercase and uppercase only
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add new minimal_vlm example and refactor pipeline_options_vlm_model for cleaner import
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename pipeline_vlm_model_spec
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* move more argument to options and simplify model init
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add supported_devices
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove not-needed function
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* exclude minimal_vlm
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* missing file
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add message for transformers version
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename to specs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use module import and remove MLX from non-darwin
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove hf_vlm_model and add extra_generation_args
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* use single HF VLM model class
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* remove torch type
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add docs for vision models
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* feat: Add visualization of bbox on page with html export.
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the cli
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the cli argument to show_layout
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fix(actor): remove references to missing docling_processor.py
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): update Actor README.md with recent repo URL changes
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): improve the Actor README.md local header link
Signed-off-by: Václav Vančura <commit@vancura.dev>
* chore(actor): bump the Actor version number
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Update .actor/actor.json
Co-authored-by: Marek Trunkát <marek@trunkat.eu>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>
---------
Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Marek Trunkát <marek@trunkat.eu>
* Provide the option to make remote services call concurrent
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
* Use yield from correctly?
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
* not do amateur hour stuff
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
---------
Signed-off-by: Vinay Damodaran <vrdn@hey.com>