mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 20:58:11 +00:00
feat: add a backend parser for WebVTT files (#2288)
* feat: add a backend parser for WebVTT files Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs: update README with VTT support Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs: add description to supported formats Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * chore: upgrade docling-core to unescape WebVTT in markdown Pin the new release of docling-core 2.48.2. Do not escape HTML reserved characters when exporting WebVTT documents to markdown. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * test: add missing copyright notice Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
This commit is contained in:
committed by
GitHub
parent
b5628f1227
commit
46efaaefee
42
tests/data/webvtt/webvtt_example_01.vtt
vendored
Normal file
42
tests/data/webvtt/webvtt_example_01.vtt
vendored
Normal file
@@ -0,0 +1,42 @@
|
||||
WEBVTT
|
||||
|
||||
NOTE Copyright © 2019 World Wide Web Consortium. https://www.w3.org/TR/webvtt1/
|
||||
|
||||
00:11.000 --> 00:13.000
|
||||
<v Roger Bingham>We are in New York City
|
||||
|
||||
00:13.000 --> 00:16.000
|
||||
<v Roger Bingham>We’re actually at the Lucern Hotel, just down the street
|
||||
|
||||
00:16.000 --> 00:18.000
|
||||
<v Roger Bingham>from the American Museum of Natural History
|
||||
|
||||
00:18.000 --> 00:20.000
|
||||
<v Roger Bingham>And with me is Neil deGrasse Tyson
|
||||
|
||||
00:20.000 --> 00:22.000
|
||||
<v Roger Bingham>Astrophysicist, Director of the Hayden Planetarium
|
||||
|
||||
00:22.000 --> 00:24.000
|
||||
<v Roger Bingham>at the AMNH.
|
||||
|
||||
00:24.000 --> 00:26.000
|
||||
<v Roger Bingham>Thank you for walking down here.
|
||||
|
||||
00:27.000 --> 00:30.000
|
||||
<v Roger Bingham>And I want to do a follow-up on the last conversation we did.
|
||||
|
||||
00:30.000 --> 00:31.500 align:right size:50%
|
||||
<v Roger Bingham>When we e-mailed—
|
||||
|
||||
00:30.500 --> 00:32.500 align:left size:50%
|
||||
<v Neil deGrasse Tyson>Didn’t we talk about enough in that conversation?
|
||||
|
||||
00:32.000 --> 00:35.500 align:right size:50%
|
||||
<v Roger Bingham>No! No no no no; 'cos 'cos obviously 'cos
|
||||
|
||||
00:32.500 --> 00:33.500 align:left size:50%
|
||||
<v Neil deGrasse Tyson><i>Laughs</i>
|
||||
|
||||
00:35.500 --> 00:38.000
|
||||
<v Roger Bingham>You know I’m so excited my glasses are falling off here.
|
||||
15
tests/data/webvtt/webvtt_example_02.vtt
vendored
Normal file
15
tests/data/webvtt/webvtt_example_02.vtt
vendored
Normal file
@@ -0,0 +1,15 @@
|
||||
WEBVTT
|
||||
|
||||
NOTE Copyright © 2019 World Wide Web Consortium. https://www.w3.org/TR/webvtt1/
|
||||
|
||||
00:00.000 --> 00:02.000
|
||||
<v.first.loud Esme>It’s a blue apple tree!
|
||||
|
||||
00:02.000 --> 00:04.000
|
||||
<v Mary>No way!
|
||||
|
||||
00:04.000 --> 00:06.000
|
||||
<v Esme>Hee!</v> <i>laughter</i>
|
||||
|
||||
00:06.000 --> 00:08.000
|
||||
<v.loud Mary>That’s awesome!
|
||||
57
tests/data/webvtt/webvtt_example_03.vtt
vendored
Normal file
57
tests/data/webvtt/webvtt_example_03.vtt
vendored
Normal file
@@ -0,0 +1,57 @@
|
||||
WEBVTT
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/15-0
|
||||
00:00:04.963 --> 00:00:08.571
|
||||
<v Speaker A>OK,
|
||||
I think now we should be recording</v>
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/15-1
|
||||
00:00:08.571 --> 00:00:09.403
|
||||
<v Speaker A>properly.</v>
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/16-0
|
||||
00:00:10.683 --> 00:00:11.563
|
||||
Good.
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/17-0
|
||||
00:00:13.363 --> 00:00:13.803
|
||||
<v Speaker A>Yeah.</v>
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/78-0
|
||||
00:00:49.603 --> 00:00:53.363
|
||||
<v Speaker B>I was also thinking.</v>
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/113-0
|
||||
00:00:54.963 --> 00:01:02.072
|
||||
<v Speaker B>Would be maybe good to create items,</v>
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/113-1
|
||||
00:01:02.072 --> 00:01:06.811
|
||||
<v Speaker B>some metadata,
|
||||
some options that can be specific.</v>
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/150-0
|
||||
00:01:10.243 --> 00:01:13.014
|
||||
<v Speaker A>Yeah,
|
||||
I mean I think you went even more than</v>
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/119-0
|
||||
00:01:10.563 --> 00:01:12.643
|
||||
<v Speaker B>But we preserved the atoms.</v>
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/150-1
|
||||
00:01:13.014 --> 00:01:15.907
|
||||
<v Speaker A>than me.
|
||||
I just opened the format.</v>
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/197-1
|
||||
00:01:50.222 --> 00:01:51.643
|
||||
<v Speaker A>give it a try, yeah.</v>
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/200-0
|
||||
00:01:52.043 --> 00:01:55.043
|
||||
<v Speaker B>Okay, talk to you later.</v>
|
||||
|
||||
62357a1d-d250-41d5-a1cf-6cc0eeceffcc/202-0
|
||||
00:01:54.603 --> 00:01:55.283
|
||||
<v Speaker A>See you.</v>
|
||||
Reference in New Issue
Block a user