feat: add a backend parser for WebVTT files (#2288)

* feat: add a backend parser for WebVTT files

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: update README with VTT support

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: add description to supported formats

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: upgrade docling-core to unescape WebVTT in markdown

Pin the new release of docling-core 2.48.2.
Do not escape HTML reserved characters when exporting WebVTT documents to markdown.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: add missing copyright notice

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
This commit is contained in:
Cesar Berrospi Ramis
2025-09-22 15:24:34 +02:00
committed by GitHub
parent b5628f1227
commit 46efaaefee
23 changed files with 3969 additions and 34 deletions

View File

@@ -0,0 +1,22 @@
item-0 at level 0: unspecified: group _root_
item-1 at level 1: section: group WebVTT cue block
item-2 at level 2: text: 00:00.000 --> 00:02.000
item-3 at level 2: inline: group WebVTT cue voice span
item-4 at level 3: text: Esme (first, loud):
item-5 at level 3: text: Its a blue apple tree!
item-6 at level 1: section: group WebVTT cue block
item-7 at level 2: text: 00:02.000 --> 00:04.000
item-8 at level 2: inline: group WebVTT cue voice span
item-9 at level 3: text: Mary:
item-10 at level 3: text: No way!
item-11 at level 1: section: group WebVTT cue block
item-12 at level 2: text: 00:04.000 --> 00:06.000
item-13 at level 2: inline: group WebVTT cue voice span
item-14 at level 3: text: Esme:
item-15 at level 3: text: Hee!
item-16 at level 2: text: laughter
item-17 at level 1: section: group WebVTT cue block
item-18 at level 2: text: 00:06.000 --> 00:08.000
item-19 at level 2: inline: group WebVTT cue voice span
item-20 at level 3: text: Mary (loud):
item-21 at level 3: text: Thats awesome!