Initial commit

2025-12-08 20:58:11 +00:00 · 2024-07-15 09:42:42 +02:00
commit e2d996753b
38 changed files with 8767 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,99 @@
+<p align="center">
+  <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="logo.png" width="150" /> </a>
+</p>
+
+# Docling
+
+Dockling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
+
+## Features
+* ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
+* 📑 Understands detailed page layout, reading order and recovers table structures
+* 📝 Extracts metadata from the document, such as title, authors, references and language
+* 🔍 Optionally applies OCR (use with scanned PDFs)
+
+## Setup
+
+You need Python 3.11 and poetry. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
+
+Once you have `poetry` installed, create an environment and install the package:
+
+```bash
+poetry env use $(which python3.11)
+poetry shell
+poetry install
+```
+
+**Notes**:
+* Works on macOS and Linux environments. Windows platforms are currently not tested.
+
+
+## Usage
+
+For basic usage, see the [convert.py](examples/convert.py) example module. Run with:
+
+```
+python examples/convert.py
+```
+The output of the above command will be written to `./scratch`.
+
+### Enable or disable pipeline features
+
+You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter` 
+```python
+doc_converter = DocumentConverter(
+    artifacts_path=artifacts_path,
+    pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered. 
+                                     do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
+)
+```
+
+### Impose limits on the document size
+
+You can limit the file size and number of pages which should be allowed to process per document.
+```python
+paths = [Path("./test/data/2206.01062.pdf")]
+
+input = DocumentConversionInput.from_paths(
+    paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
+)
+```
+
+### Convert from binary PDF streams 
+
+You can convert PDFs from a binary stream instead of from the filesystem as follows:
+```python
+buf = BytesIO(your_binary_stream)
+docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
+input = DocumentConversionInput.from_streams(docs)
+converted_docs = doc_converter.convert(input)
+```
+### Limit resource usage
+
+You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
+
+
+## Contributing
+
+Please read [Contributing to Docling](./CONTRIBUTING.md) for details.
+
+
+## References
+
+If you use `Docling` in your projects, please consider citing the following:
+
+```bib
+@software{Docling,
+author = {Deep Search Team},
+month = {7},
+title = {{Docling}},
+url = {https://github.com/DS4SD/docling},
+version = {main},
+year = {2024}
+}
+```
+
+## License
+
+The `Docling` codebase is under MIT license.
+For individual model usage, please refer to the model licenses found in the original packages.