[NLPL Task Force (A)] Tool used in OpenSubtitles 2018

Tiedemann, Jörg jorg.tiedemann at helsinki.fi
Mon Jan 28 09:34:21 UTC 2019


The subtitles come from https://www.opensubtitles.org and more about cleaning, converting and aligning them is published in various papers. Look at the references at http://opus.nlpl.eu. Our tools for aligning subtitle files are available here: https://github.com/Helsinki-NLP/subalign In addition to this we use language identification, some heuristics and language models for further filtering and cleaning the data set. Further feedback for improving the data sets is very welcome.

Jörg

********************************************************************************************
Jörg Tiedemann
Language Technology https://blogs.helsinki.fi/language-technology/
University of Helsinki

On 27 Jan 2019, at 16:01, Humberto Castelo Branco <hlcb91 at gmail.com<mailto:hlcb91 at gmail.com>> wrote:

Hello, good morning, my name is Humberto, I found the nlpl site through Google, more specifically the opus, from there I found the OpenSubtitles 2018 link containing several files and downloaded some of them, you used some tool to read the subtitles and extract the corresponding texts between languages? If so, is this tool available publicly? Congratulations on your work, it's great, wonderful, perfect.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20190128/d31b10dc/attachment.htm>


More information about the infrastructure mailing list