mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-09 13:18:24 +00:00
feat: add a backend parser for WebVTT files (#2288)
* feat: add a backend parser for WebVTT files Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs: update README with VTT support Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs: add description to supported formats Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore: upgrade docling-core to unescape WebVTT in markdown Pin the new release of docling-core 2.48.2. Do not escape HTML reserved characters when exporting WebVTT documents to markdown. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test: add missing copyright notice Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
This commit is contained in:
committed by
GitHub
parent
b5628f1227
commit
46efaaefee
77
tests/data/groundtruth/docling_v2/webvtt_example_03.vtt.md
vendored
Normal file
77
tests/data/groundtruth/docling_v2/webvtt_example_03.vtt.md
vendored
Normal file
@@ -0,0 +1,77 @@
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/15-0
|
||||
|
||||
00:00:04.963 --> 00:00:08.571
|
||||
|
||||
Speaker A: OK, I think now we should be recording
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/15-1
|
||||
|
||||
00:00:08.571 --> 00:00:09.403
|
||||
|
||||
Speaker A: properly.
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/16-0
|
||||
|
||||
00:00:10.683 --> 00:00:11.563
|
||||
|
||||
Good.
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/17-0
|
||||
|
||||
00:00:13.363 --> 00:00:13.803
|
||||
|
||||
Speaker A: Yeah.
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/78-0
|
||||
|
||||
00:00:49.603 --> 00:00:53.363
|
||||
|
||||
Speaker B: I was also thinking.
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/113-0
|
||||
|
||||
00:00:54.963 --> 00:01:02.072
|
||||
|
||||
Speaker B: Would be maybe good to create items,
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/113-1
|
||||
|
||||
00:01:02.072 --> 00:01:06.811
|
||||
|
||||
Speaker B: some metadata, some options that can be specific.
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/150-0
|
||||
|
||||
00:01:10.243 --> 00:01:13.014
|
||||
|
||||
Speaker A: Yeah, I mean I think you went even more than
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/119-0
|
||||
|
||||
00:01:10.563 --> 00:01:12.643
|
||||
|
||||
Speaker B: But we preserved the atoms.
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/150-1
|
||||
|
||||
00:01:13.014 --> 00:01:15.907
|
||||
|
||||
Speaker A: than me. I just opened the format.
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/197-1
|
||||
|
||||
00:01:50.222 --> 00:01:51.643
|
||||
|
||||
Speaker A: give it a try, yeah.
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/200-0
|
||||
|
||||
00:01:52.043 --> 00:01:55.043
|
||||
|
||||
Speaker B: Okay, talk to you later.
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/202-0
|
||||
|
||||
00:01:54.603 --> 00:01:55.283
|
||||
|
||||
Speaker A: See you.
|
||||
Reference in New Issue
Block a user