Add Streamlit Docling concierge readme file and script

2025-07-31 14:34:40 +00:00 · 2025-03-03 23:49:33 -05:00 · 2025-03-03 23:49:33 -05:00 · 6556557f50
commit 6556557f50
parent b1e79cadc7
2 changed files with 940 additions and 0 deletions
--- a/streamlit-docling-concierge-readme-file.md
+++ b/streamlit-docling-concierge-readme-file.md
@ -0,0 +1,107 @@
 # Docling Concierge - Streamlit Web Interface
 A user-friendly web application for Docling document processing built with Streamlit. This app allows both technical and non-technical users to process documents using natural language instructions or guided selections through an intuitive interface.
 ![Docling Concierge Screenshot](https://via.placeholder.com/800x450?text=Docling+Concierge+Web+App)
 ## Features
 - **Simple User Interface**: Designed specifically for non-technical users
 - **Visual Controls**: Checkboxes, toggles, and dropdowns for all options
 - **Flexible Input**: Upload files or provide URLs
 - **Rich Output Formats**: Markdown, HTML, JSON, Text, Doctags
 - **Interactive Previews**: See processed documents before downloading
 - **Batch Processing**: Handle multiple documents at once
 - **Smart Suggestions**: Get guidance on best options for your documents
 - **Visual Progress Tracking**: Monitor document processing with progress bars
 ## Compatibility
 This version of Docling Concierge is optimized for Docling 2.25.1 and similar versions, with the following adjustments:
 - No dependencies on deprecated `document_type` attributes
 - No references to Docling backend classes
 - Proper language code mappings for different OCR engines
 - Enhanced error handling and progress visualization
 ## Installation
 1. Make sure you have Python 3.8+ installed on your system
 2. Install Docling (if not already installed):
   ```bash
   pip install -U docling==2.25.1
   ```
 3. Install Streamlit and other dependencies:
   ```bash
   pip install streamlit pandas
   ```
 4. Download the `docling_concierge_streamlit.py` file from this repository
 5. Run the Streamlit app:
   ```bash
   streamlit run docling_concierge_streamlit.py
   ```
 ## Usage
 1. **Upload Documents**:
   - Use the "Upload Files" tab to select documents from your computer
   - Or use the "Enter URLs" tab to process documents from the web
 2. **Describe Your Task**:
   - Type what you want to do in natural language
   - Example: "Convert this PDF to markdown with tables"
   - The app will automatically detect intent and pre-select appropriate options
 3. **Adjust Processing Options** (Optional):
   - **Basic Options**: Output format, image handling
   - **OCR Settings**: Enable text recognition, select language
   - **Advanced Features**: Table extraction, code detection, formula processing
 4. **Process Documents**:
   - Click the "Process Documents" button
   - Monitor progress through the status indicator and progress bar
 5. **Work with Results**:
   - Preview processed documents directly in the app
   - Download individual results or all files as a ZIP
 ## Language Support for OCR
 The app includes smart language mapping that automatically translates user-friendly language names to the correct codes for each OCR engine:
 - **EasyOCR**: Uses codes like "en", "es", "fr", etc.
 - **Tesseract**: Uses codes like "eng", "spa", "fra", etc.
 - **RapidOCR**: Uses appropriate codes for this engine
 Users can simply check the languages they need without knowing the specific codes required by each engine.
 ## Common Processing Tasks
 - **Document Conversion**: PDF to Markdown, DOCX to HTML, etc.
 - **Table Extraction**: Pull tables from documents into structured formats (with fast or accurate options)
 - **OCR Processing**: Extract text from scanned documents and images
 - **Content Analysis**: Detect and enhance code blocks, formulas, images
 - **Image Handling**: Choose between embedded, referenced, or placeholder modes
 ## Tips for Non-Technical Users
 - Use the examples in the sidebar for quick starts
 - Start with default options for most documents
 - Enable OCR for scanned documents or images
 - Use "Markdown" format for most readable output
 - Choose "HTML" for web-ready content with formatting
 - Use the "Reset All Settings" button in the sidebar to start fresh
 ## Requirements
 - Python 3.8+
 - Docling 2.25.1 (or compatible version)
 - Streamlit
 - Pandas
 ## License
 This project is licensed under the same terms as Docling.
 ## Acknowledgements
 Based on the original Docling Concierge command-line tool.
--- a/streamlit-docling-concierge.py
+++ b/streamlit-docling-concierge.py
@ -0,0 +1,833 @@
 #!/usr/bin/env python3
 """
 Docling Concierge - Streamlit Web Interface (Docling 2.25.1 or similar)
 Updated to remove references to `document_type`, which no longer exists.
 Features:
 - No docling.backends usage
 - Spinner + progress bar for processing
 - OCR language code mapping for Tesseract/EasyOCR
 - Displays `res.input.format` instead of `document_type` in results
 """
 import os
 import re
 import tempfile
 from pathlib import Path
 from typing import Optional, List, Dict
 import streamlit as st
 import pandas as pd
 # --------------------------------
 # Attempt to import Docling modules
 # --------------------------------
 missing_modules = []
 try:
    import docling
 except ImportError as e:
    missing_modules.append(f"docling: {str(e)}")
 try:
    from docling.document_converter import DocumentConverter, PdfFormatOption
 except ImportError as e:
    missing_modules.append(f"docling.document_converter: {str(e)}")
 try:
    from docling.datamodel.base_models import InputFormat, OutputFormat
 except ImportError as e:
    missing_modules.append(f"docling.datamodel.base_models: {str(e)}")
 try:
    from docling.datamodel.document import ConversionResult
 except ImportError as e:
    missing_modules.append(f"docling.datamodel.document: {str(e)}")
 try:
    from docling.datamodel.pipeline_options import (
        PdfPipelineOptions, OcrEngine, TableFormerMode,
        EasyOcrOptions, TesseractCliOcrOptions, TesseractOcrOptions,
        OcrMacOptions, RapidOcrOptions, OcrOptions,
    )
 except ImportError as e:
    missing_modules.append(f"docling.datamodel.pipeline_options: {str(e)}")
 try:
    from docling_core.types.doc import ImageRefMode
 except ImportError as e:
    missing_modules.append(f"docling_core.types.doc: {str(e)}")
 class DoclingConciergeStreamlit:
    """
    Streamlit interface to Docling functionality, tailored to Docling 2.25.1.
    No references to `document_type` (use input.format instead).
    """
    def __init__(self):
        self.converter = DocumentConverter()
        self.temp_dir = Path(tempfile.mkdtemp(prefix="docling_"))
        # Default pipeline options
        self.pipeline_options = PdfPipelineOptions(
            do_ocr=True,
            do_table_structure=True,
            do_code_enrichment=False,
            do_formula_enrichment=False,
            do_picture_classification=False,
            do_picture_description=False,
            enable_remote_services=False,
        )
        # Simple mappings for input detection
        self.format_aliases = {
            "pdf": InputFormat.PDF,
            "image": InputFormat.IMAGE,
            "docx": InputFormat.DOCX,
            "word": InputFormat.DOCX,
            "excel": InputFormat.XLSX,
            "xlsx": InputFormat.XLSX,
            "pptx": InputFormat.PPTX,
            "powerpoint": InputFormat.PPTX,
            "html": InputFormat.HTML,
            "markdown": InputFormat.MD,
            "md": InputFormat.MD,
            "asciidoc": InputFormat.ASCIIDOC,
            "csv": InputFormat.CSV,
            "xml": InputFormat.XML_JATS,
            "json": InputFormat.JSON_DOCLING,
        }
        # Mapping output format strings to docling's OutputFormat
        self.output_format_aliases = {
            "markdown": OutputFormat.MARKDOWN,
            "md": OutputFormat.MARKDOWN,
            "html": OutputFormat.HTML,
            "json": OutputFormat.JSON,
            "text": OutputFormat.TEXT,
            "txt": OutputFormat.TEXT,
            "doctags": OutputFormat.DOCTAGS,
        }
        # OCR engine choices for the UI
        self.ocr_engines = {
            "EasyOCR (recommended)": OcrEngine.EASYOCR,
            "Tesseract": OcrEngine.TESSERACT,
            "Tesseract CLI": OcrEngine.TESSERACT_CLI,
            "macOS OCR (Apple devices only)": OcrEngine.OCRMAC,
            "RapidOCR": OcrEngine.RAPIDOCR,
        }
        # ImageRefMode choices
        self.image_modes = {
            "Embedded (include images in file)": ImageRefMode.EMBEDDED,
            "Referenced (link to separate image files)": ImageRefMode.REFERENCED,
            "Placeholder (text only, no images)": ImageRefMode.PLACEHOLDER,
        }
        # Common language list for user-friendly display
        self.common_languages = [
            "English", "Spanish", "French", "German", "Chinese",
            "Japanese", "Korean", "Russian", "Arabic", "Hindi",
            "Portuguese", "Italian"
        ]
        # -----------
        # Define OCR language code mappings
        # -----------
        self.lang_code_map_tesseract = {
            "English": "eng",
            "Spanish": "spa",
            "French": "fra",
            "German": "deu",
            "Chinese": "chi_sim",  # Tesseract can use chi_tra for traditional
            "Japanese": "jpn",
            "Korean": "kor",
            "Russian": "rus",
            "Arabic": "ara",
            "Hindi": "hin",
            "Portuguese": "por",
            "Italian": "ita",
        }
        # EasyOCR language codes
        self.lang_code_map_easyocr = {
            "English": "en",
            "Spanish": "es",
            "French": "fr",
            "German": "de",
            "Chinese": "ch_tra",   # or ch_sim
            "Japanese": "ja",
            "Korean": "ko",
            "Russian": "ru",
            "Arabic": "ar",
            "Hindi": "hi",
            "Portuguese": "pt",
            "Italian": "it",
        }
        # RapidOCR might follow a similar convention to EasyOCR
        self.lang_code_map_rapidocr = {
            "English": "en",
            "Spanish": "es",
            "French": "fr",
            "German": "de",
            "Chinese": "ch",
            "Japanese": "ja",
            "Korean": "ko",
            "Russian": "ru",
            "Arabic": "ar",
            "Hindi": "hi",
            "Portuguese": "pt",
            "Italian": "it",
        }
    def configure_streamlit_page(self):
        """Set up the Streamlit page layout and style."""
        st.set_page_config(
            page_title="Docling Concierge",
            page_icon="📄",
            layout="wide",
            initial_sidebar_state="expanded",
        )
        # Custom CSS for styling
        st.markdown("""
        <style>
        .main-header {
            font-size: 2.5rem;
            color: #1E88E5;
        }
        .sub-header {
            font-size: 1.5rem;
            color: #424242;
        }
        .card {
            border-radius: 5px;
            padding: 20px;
            margin-bottom: 20px;
            background-color: #f8f9fa;
            border-left: 4px solid #1E88E5;
        }
        .feature-title {
            font-weight: bold;
            color: #1E88E5;
        }
        .hint {
            font-size: 0.9rem;
            color: #757575;
            font-style: italic;
        }
        .success {
            padding: 10px;
            border-radius: 5px;
            background-color: #E8F5E9;
            border-left: 4px solid #4CAF50;
        }
        .error {
            padding: 10px;
            border-radius: 5px;
            background-color: #FFEBEE;
            border-left: 4px solid #F44336;
        }
        .stButton>button {
            width: 100%;
        }
        </style>
        """, unsafe_allow_html=True)
    def show_welcome(self):
        """Welcome header and intro text."""
        st.markdown('<p class="main-header">📄 Docling Concierge</p>', unsafe_allow_html=True)
        st.markdown('<p class="sub-header">Transform and extract content from your documents with ease</p>',
                    unsafe_allow_html=True)
        with st.expander("ℹ️ About this app", expanded=False):
            st.markdown("""
            **Docling Concierge** helps you process documents in various formats. You can:
            - **Convert between formats**: PDF to Markdown, Word to HTML, and more
            - **Extract content**: Tables, images, code snippets, and formulas
            - **Process text**: Apply OCR to scanned documents
            - **Analyze content**: Enhance with image descriptions and classifications
            Simply upload your documents or provide URLs, select the options you need,
            and let Docling handle the rest.
            """)
    def show_sidebar(self):
        """Sidebar with quick examples and info."""
        with st.sidebar:
            st.markdown("### 📚 Quick Examples")
            st.markdown("Try these examples (click to fill):")
            example_queries = [
                "Convert PDF to Markdown with tables",
                "Extract tables from document and save as HTML",
                "Process scanned PDF with OCR",
                "Convert document to text only (no images)",
                "Extract code blocks from programming PDF"
            ]
            for query in example_queries:
                if st.button(query, key=f"example_{query}"):
                    st.session_state.query = query
            st.markdown("---")
            st.markdown("### 🔍 Document Types")
            st.markdown("""
            - 📄 **Documents**: PDF, DOCX, PPTX
            - 📊 **Data**: XLSX, CSV
            - 🌐 **Web**: HTML
            - 📝 **Text**: Markdown, Text
            - 🖼️ **Images**: JPG, PNG
            """)
            st.markdown("---")
            st.markdown("### 🎯 Output Formats")
            st.markdown("""
            - **Markdown**: Perfect for documentation
            - **HTML**: For web publishing
            - **JSON**: For data processing
            - **Text**: Plain text extraction
            - **Doctags**: Docling token format
            """)
            st.markdown("---")
            st.markdown("### 💡 Tips")
            st.markdown("""
            - Use OCR for scanned documents
            - Enable table extraction for spreadsheet-like content
            - Choose image mode based on your needs
            - Batch process multiple files at once
            """)
            st.markdown("---")
            st.markdown("Made with ❤️ using Docling")
            # Reset button if you want to clear all state
            if st.button("🔄 Reset All Settings"):
                for key in list(st.session_state.keys()):
                    if key not in ["authenticated", "username"]:
                        del st.session_state[key]
                st.rerun()
    def show_input_section(self):
        """Section to upload files or enter URLs."""
        st.markdown('<p class="feature-title">📥 Input Documents</p>', unsafe_allow_html=True)
        input_tab1, input_tab2 = st.tabs(["Upload Files", "Enter URLs"])
        with input_tab1:
            st.markdown("""<p class="hint">
            Upload one or more documents to process. 
            Supports PDF, Word, Excel, PowerPoint, Images, and more.
            </p>""", unsafe_allow_html=True)
            uploaded_files = st.file_uploader(
                "Upload documents",
                accept_multiple_files=True,
                type=[
                    "pdf", "docx", "xlsx", "pptx", "jpg", "jpeg", "png",
                    "html", "md", "csv", "xml", "json", "txt"
                ],
                help="Select one or more files from your computer"
            )
            if uploaded_files:
                st.session_state.file_paths = []
                for uploaded_file in uploaded_files:
                    temp_file_path = self.temp_dir / uploaded_file.name
                    with open(temp_file_path, "wb") as f:
                        f.write(uploaded_file.getbuffer())
                    st.session_state.file_paths.append(temp_file_path)
                # Show info about uploaded files
                files_df = pd.DataFrame({
                    "File Name": [file.name for file in uploaded_files],
                    "Size": [f"{file.size / 1024:.1f} KB" for file in uploaded_files],
                    "Type": [file.type for file in uploaded_files]
                })
                st.dataframe(files_df, use_container_width=True)
        with input_tab2:
            st.markdown("""<p class="hint">
            Enter one or more URLs to process (one per line).
            </p>""", unsafe_allow_html=True)
            url_input = st.text_area(
                "Enter document URLs",
                placeholder="https://example.com/document.pdf\nhttps://another-site.org/paper.pdf",
            )
            if url_input:
                urls = [url.strip() for url in url_input.split("\n") if url.strip().startswith("http")]
                if urls:
                    st.session_state.urls = urls
                    # Show data frame of URLs
                    urls_df = pd.DataFrame({
                        "URL": urls,
                        "Type": ["Web URL" for _ in urls]
                    })
                    st.dataframe(urls_df, use_container_width=True)
                else:
                    st.warning("No valid URLs found. URLs must start with http:// or https://")
    def show_natural_language_query(self):
        """User can type a plain-English request for what to do."""
        st.markdown('<p class="feature-title">🔍 What do you want to do?</p>', unsafe_allow_html=True)
        query = st.text_input(
            "Describe what you want to do with your documents",
            value=st.session_state.get("query", ""),
            placeholder="Example: Convert this PDF to markdown with tables"
        )
        if query:
            st.session_state.query = query
            # Try to guess an output format from the user's text
            output_format = self._extract_output_format_from_query(query)
            if output_format:
                for alias, format_value in self.output_format_aliases.items():
                    if format_value == output_format:
                        st.session_state.output_format = alias.capitalize()
                        break
    def show_visual_options(self):
        """Checkboxes and advanced features for the user."""
        st.markdown('<p class="feature-title">⚙️ Processing Options</p>', unsafe_allow_html=True)
        tab1, tab2, tab3 = st.tabs(["Basic Options", "OCR Settings", "Advanced Features"])
        # ----------
        # Basic Options
        # ----------
        with tab1:
            output_formats = ["Markdown", "HTML", "JSON", "Text", "Doctags"]
            selected_format = st.selectbox(
                "Output Format",
                options=output_formats,
                index=output_formats.index(st.session_state.get("output_format", "Markdown")),
                help="Select the format you want to convert your documents to"
            )
            st.session_state.output_format = selected_format
            image_mode_options = list(self.image_modes.keys())
            selected_image_mode = st.selectbox(
                "Image Handling",
                options=image_mode_options,
                index=0,
                help="Choose how images should be handled in the output"
            )
            st.session_state.image_mode = self.image_modes[selected_image_mode]
        # ----------
        # OCR Settings
        # ----------
        with tab2:
            ocr_enabled = st.checkbox(
                "Enable OCR (Text Recognition)",
                value=True,
                help="Apply OCR for scanned documents or images"
            )
            st.session_state.ocr_enabled = ocr_enabled
            if ocr_enabled:
                ocr_engine_options = list(self.ocr_engines.keys())
                selected_ocr_engine = st.selectbox(
                    "OCR Engine",
                    options=ocr_engine_options,
                    index=0,
                    help="Select the OCR engine to use"
                )
                st.session_state.ocr_engine = self.ocr_engines[selected_ocr_engine]
                force_ocr = st.checkbox(
                    "Force Full-Page OCR",
                    value=False,
                    help="OCR entire pages even if they appear to have machine-readable text"
                )
                st.session_state.force_ocr = force_ocr
                # Expandable area for choosing languages
                with st.expander("OCR Languages", expanded=False):
                    selected_languages = []
                    # Determine which language map we should use
                    if st.session_state.ocr_engine == OcrEngine.EASYOCR:
                        lang_map = self.lang_code_map_easyocr
                    elif st.session_state.ocr_engine == OcrEngine.RAPIDOCR:
                        lang_map = self.lang_code_map_rapidocr
                    else:
                        # Tesseract, Tesseract CLI, OCRMAC fallback
                        lang_map = self.lang_code_map_tesseract
                    col1, col2 = st.columns(2)
                    half = len(self.common_languages) // 2
                    with col1:
                        for lang in self.common_languages[:half]:
                            # If user checks the box, we map e.g. "English" to "eng" or "en"
                            if st.checkbox(lang, value=(lang == "English")):
                                code = lang_map.get(lang, "eng")  # fallback to "eng"
                                selected_languages.append(code)
                    with col2:
                        for lang in self.common_languages[half:]:
                            if st.checkbox(lang):
                                code = lang_map.get(lang, "eng")
                                selected_languages.append(code)
                    custom_lang = st.text_input(
                        "Add other language codes (comma-separated)",
                        placeholder="e.g., eng, spa, fra"
                    )
                    if custom_lang:
                        custom_langs = [x.strip().lower() for x in custom_lang.split(",")]
                        selected_languages.extend(custom_langs)
                    if not selected_languages:
                        # fallback if user didn't check anything
                        selected_languages = ["eng"]
                    st.session_state.ocr_langs = selected_languages
        # ----------
        # Advanced Features
        # ----------
        with tab3:
            st.markdown("#### Content Extraction")
            col1, col2 = st.columns(2)
            with col1:
                table_extraction = st.checkbox(
                    "Extract Tables",
                    value=True,
                    help="Recognize and extract tables from documents"
                )
                st.session_state.do_table_structure = table_extraction
                if table_extraction:
                    table_mode = st.radio(
                        "Table Extraction Mode",
                        options=["Fast", "Accurate"],
                        index=0,
                        horizontal=True,
                        help="Fast is quick but less precise, Accurate is slower but more reliable"
                    )
                    st.session_state.table_mode = (
                        TableFormerMode.FAST if table_mode == "Fast" else TableFormerMode.ACCURATE
                    )
                code_extraction = st.checkbox(
                    "Extract Code Blocks",
                    value=False,
                    help="Identify and extract programming code blocks"
                )
                st.session_state.do_code_enrichment = code_extraction
            with col2:
                formula_extraction = st.checkbox(
                    "Extract Mathematical Formulas",
                    value=False,
                    help="Detect mathematical equations and formulas"
                )
                st.session_state.do_formula_enrichment = formula_extraction
                image_classification = st.checkbox(
                    "Classify Images",
                    value=False,
                    help="Add classifications to images in the document"
                )
                st.session_state.do_picture_classification = image_classification
                image_description = st.checkbox(
                    "Describe Images",
                    value=False,
                    help="Generate descriptive text for images (may require remote services)"
                )
                st.session_state.do_picture_description = image_description
                if image_description:
                    st.session_state.enable_remote_services = True
    def process_documents(self):
        """Process documents based on user settings, with progress bar."""
        has_files = hasattr(st.session_state, 'file_paths') and st.session_state.file_paths
        has_urls = hasattr(st.session_state, 'urls') and st.session_state.urls
        if not (has_files or has_urls):
            st.warning("Please upload at least one document or provide a URL.")
            return
        # Gather sources
        sources = []
        if has_files:
            sources.extend(st.session_state.file_paths)
        if has_urls:
            sources.extend(st.session_state.urls)
        # Determine output format
        output_format_str = st.session_state.get("output_format", "Markdown").lower()
        output_format = self.output_format_aliases.get(output_format_str, OutputFormat.MARKDOWN)
        # OCR settings
        ocr_enabled = st.session_state.get("ocr_enabled", True)
        ocr_engine = st.session_state.get("ocr_engine", OcrEngine.EASYOCR)
        force_ocr = st.session_state.get("force_ocr", False)
        ocr_langs = st.session_state.get("ocr_langs", ["eng"])
        # Image handling
        image_mode = st.session_state.get("image_mode", ImageRefMode.EMBEDDED)
        # Configure pipeline options
        self.pipeline_options.do_ocr = ocr_enabled
        self.pipeline_options.do_table_structure = st.session_state.get("do_table_structure", True)
        self.pipeline_options.do_code_enrichment = st.session_state.get("do_code_enrichment", False)
        self.pipeline_options.do_formula_enrichment = st.session_state.get("do_formula_enrichment", False)
        self.pipeline_options.do_picture_classification = st.session_state.get("do_picture_classification", False)
        self.pipeline_options.do_picture_description = st.session_state.get("do_picture_description", False)
        self.pipeline_options.enable_remote_services = st.session_state.get("enable_remote_services", False)
        # If table extraction is on, set the mode
        if st.session_state.get("do_table_structure", True):
            mode = st.session_state.get("table_mode", TableFormerMode.FAST)
            self.pipeline_options.table_structure_options.mode = mode
        # Create the OCR options object
        if ocr_enabled:
            if ocr_engine == OcrEngine.EASYOCR:
                ocr_options = EasyOcrOptions(force_full_page_ocr=force_ocr)
            elif ocr_engine == OcrEngine.TESSERACT_CLI:
                ocr_options = TesseractCliOcrOptions(force_full_page_ocr=force_ocr)
            elif ocr_engine == OcrEngine.TESSERACT:
                ocr_options = TesseractOcrOptions(force_full_page_ocr=force_ocr)
            elif ocr_engine == OcrEngine.OCRMAC:
                ocr_options = OcrMacOptions(force_full_page_ocr=force_ocr)
            elif ocr_engine == OcrEngine.RAPIDOCR:
                ocr_options = RapidOcrOptions(force_full_page_ocr=force_ocr)
            else:
                ocr_options = EasyOcrOptions(force_full_page_ocr=force_ocr)
            # Assign the languages (e.g. ["eng", "spa"])
            ocr_options.lang = ocr_langs
            self.pipeline_options.ocr_options = ocr_options
        # Create PdfFormatOption with no custom backend
        pdf_format_option = PdfFormatOption(
            pipeline_options=self.pipeline_options
        )
        format_options = {
            InputFormat.PDF: pdf_format_option,
            InputFormat.IMAGE: pdf_format_option,
        }
        converter = DocumentConverter(format_options=format_options)
        results = []
        output_files = []
        # Show a spinner + progress bar for each file
        with st.spinner("Processing documents..."):
            progress_bar = st.progress(0)
            total_sources = len(sources)
            for i, source in enumerate(sources):
                short_name = os.path.basename(str(source))
                st.info(f"Processing {i+1}/{total_sources}: {short_name}")
                try:
                    result = converter.convert(source)
                    results.append(result)
                    if result.status.value == "success":
                        base_filename = os.path.basename(str(result.input.file))
                        base_name = os.path.splitext(base_filename)[0]
                        # Save based on the chosen output format
                        if output_format == OutputFormat.MARKDOWN:
                            out_file = self.temp_dir / f"{base_name}.md"
                            result.document.save_as_markdown(out_file, image_mode=image_mode)
                        elif output_format == OutputFormat.HTML:
                            out_file = self.temp_dir / f"{base_name}.html"
                            result.document.save_as_html(out_file, image_mode=image_mode)
                        elif output_format == OutputFormat.JSON:
                            out_file = self.temp_dir / f"{base_name}.json"
                            result.document.save_as_json(out_file, image_mode=image_mode)
                        elif output_format == OutputFormat.TEXT:
                            out_file = self.temp_dir / f"{base_name}.txt"
                            result.document.save_as_markdown(
                                out_file, strict_text=True, image_mode=ImageRefMode.PLACEHOLDER
                            )
                        elif output_format == OutputFormat.DOCTAGS:
                            out_file = self.temp_dir / f"{base_name}.doctags"
                            result.document.save_as_document_tokens(out_file)
                        output_files.append(out_file)
                except Exception as e:
                    st.error(f"Error processing {short_name}: {str(e)}")
                # Update the progress bar
                progress_percent = int((i + 1) / total_sources * 100)
                progress_bar.progress(progress_percent)
            st.success("Processing complete!")
        st.session_state.results = results
        st.session_state.output_files = output_files
        return results, output_files
    def show_results(self, results, output_files):
        """Display the final results. Replaced `document_type` with `res.input.format`."""
        if not results:
            return
        st.markdown('<p class="feature-title">📝 Results</p>', unsafe_allow_html=True)
        col1, col2 = st.columns([2, 1])
        with col1:
            success_count = sum(r.status.value == "success" for r in results)
            fail_count = len(results) - success_count
            if success_count > 0:
                st.markdown(f"<div class='success'>✅ Successfully processed {success_count} document(s)</div>",
                            unsafe_allow_html=True)
            if fail_count > 0:
                st.markdown(f"<div class='error'>❌ Failed to process {fail_count} document(s)</div>",
                            unsafe_allow_html=True)
        with col2:
            if output_files:
                import zipfile
                import io
                zip_buffer = io.BytesIO()
                with zipfile.ZipFile(zip_buffer, "w", zipfile.ZIP_DEFLATED) as zf:
                    for fp in output_files:
                        zf.write(fp, arcname=fp.name)
                zip_buffer.seek(0)
                st.download_button(
                    "📦 Download All Results",
                    data=zip_buffer,
                    file_name="docling_results.zip",
                    mime="application/zip",
                    use_container_width=True
                )
        if output_files:
            st.markdown("### Processed Documents")
            for i, (res, out_fp) in enumerate(zip(results, output_files), start=1):
                if res.status.value != "success":
                    continue
                with st.expander(f"Document {i}: {out_fp.name}"):
                    st.markdown(f"**Source**: {res.input.file}")
                    # Replaced: st.markdown(f"**Type**: {res.document.document_type}")
                    # With: show the format from res.input
                    st.markdown(f"**Format**: {res.input.format}")
                    # If pages are present
                    st.markdown(f"**Pages**: {len(res.document.pages)}")
                    preview_tab, download_tab = st.tabs(["Preview", "Download"])
                    with preview_tab:
                        try:
                            with open(out_fp, "r", encoding="utf-8") as f:
                                content = f.read()
                        except UnicodeDecodeError:
                            content = f"<Cannot preview file {out_fp.suffix}>"
                        ext = out_fp.suffix.lower()
                        if ext == ".md":
                            st.markdown(content)
                        elif ext == ".html":
                            st.components.v1.html(content, height=400)
                        elif ext == ".json":
                            st.json(content)
                        else:
                            st.text(content)
                    with download_tab:
                        with open(out_fp, "rb") as f:
                            file_data = f.read()
                        st.download_button(
                            f"Download {out_fp.name}",
                            data=file_data,
                            file_name=out_fp.name,
                            mime="application/octet-stream",
                            use_container_width=True
                        )
    def _extract_output_format_from_query(self, query: str) -> Optional[OutputFormat]:
        """Try to guess an OutputFormat from user text like 'Convert to Markdown'."""
        to_format_match = re.search(r'to\s+(\w+)', query, re.IGNORECASE)
        if to_format_match:
            fmt_str = to_format_match.group(1).lower()
            if fmt_str in self.output_format_aliases:
                return self.output_format_aliases[fmt_str]
        # Check any direct mention (like 'markdown', 'html', etc.)
        for format_name, format_value in self.output_format_aliases.items():
            if format_name in query.lower():
                return format_value
        return None
    def run(self):
        """Run the entire Streamlit app."""
        if missing_modules:
            st.error("Error: Some Docling modules could not be imported:")
            for module in missing_modules:
                st.write(f"- {module}")
            st.info("""
            Possible solutions:
            1. Make sure you have the correct version of Docling installed:
               ```
               pip install -U docling
               ```
            2. If you installed from source or development version, make sure all dependencies are installed:
               ```
               pip install -e .[dev]
               ```
            3. Check your PYTHONPATH if using a development environment
            """)
            return
        self.configure_streamlit_page()
        self.show_welcome()
        self.show_sidebar()
        self.show_input_section()
        self.show_natural_language_query()
        self.show_visual_options()
        if st.button("🔄 Process Documents", type="primary", use_container_width=True):
            with st.spinner("Processing documents..."):
                results, output_files = self.process_documents()
                if results:
                    self.show_results(results, output_files)
        # If results already exist in session (e.g. user pressed 'Process' before),
        # show them again
        if hasattr(st.session_state, 'results') and hasattr(st.session_state, 'output_files'):
            self.show_results(st.session_state.results, st.session_state.output_files)
 def main():
    """Entry point for Streamlit."""
    app = DoclingConciergeStreamlit()
    app.run()
 if __name__ == "__main__":
    main()