fix(html): preserve code blocks in list items (#2131)

* chore(html): refactor parser to leverage context managers Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * fix(html): parse inline code snippets, also from list items Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore(html): remove hidden tags Remove tags that are not meant to be displayed. Add regression tests for code blocks, inline code, and hidden tags. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-12-08 20:58:11 +00:00 · 2025-08-26 06:43:48 +02:00
parent c0268416cf
commit fa3327e1a6
5 changed files with 950 additions and 76 deletions
--- a/tests/data/groundtruth/docling_v2/html_code_snippets.html.md
+++ b/tests/data/groundtruth/docling_v2/html_code_snippets.html.md
@@ -0,0 +1,24 @@
+# Code snippets
+
+The Pythagorean theorem can be written as an equation relating the lengths of the sides *a* , *b* and the hypotenuse *c* .
+
+To use Docling, simply install `docling` from your package manager, e.g. pip: `pip install docling`
+
+To convert individual documents with python, use `convert()` , for example:
+
+```
+from docling.document_converter import DocumentConverter
+
+source = "https://arxiv.org/pdf/2408.09869"
+converter = DocumentConverter()
+result = converter.convert(source)
+print(result.document.export_to_markdown())
+```
+
+The program will output: `## Docling Technical Report[...]`
+
+Prefetch the models:
+
+- Use the `docling-tools models download` utility:
+- Alternatively, models can be programmatically downloaded using `docling.utils.model_downloader.download_models()` .
+- Also, you can use download-hf-repo parameter to download arbitrary models from HuggingFace by specifying repo id: `$ docling-tools models download-hf-repo ds4sd/SmolDocling-256M-preview Downloading ds4sd/SmolDocling-256M-preview model from HuggingFace...`