[NLPL Task Force (A)] Storage alternatives

Wed Nov 18 12:47:31 UTC 2020

The space issues aren’t just the huggingface models (though those are obviously an issue too) - a single virtual environment is multiple gigabytes worth of libraries, where just the python3.7 binary is often ~5 gigabytes.

– Vinit

> On 17 Nov 2020, at 11:16, Tiedemann, Jörg <jorg.tiedemann at helsinki.fi> wrote:
> 
> 
> This includes 1300 translation models from us.
> I guess you don’t want to include all of them in a module.
> 
> Jörg
> 
> *****************************************************************
> Jörg Tiedemann
> Language Technology                                                   https://blogs.helsinki.fi/language-technology/
> University of Helsinki
> 
>> On 17. Nov 2020, at 12.12, Andrey Kutuzov <andreku at ifi.uio.no> wrote:
>> 
>> It's only about 60 models that HugginFace itself provides
>> (https://huggingface.co/transformers/pretrained_models.html).
>> 
>> The list of community-uploaded modules (https://huggingface.co/models)
>> is of course much larger, but I don't think it makes sense to download
>> ALL of them.
>> 
>> 17.11.2020 08:53, Stephan Oepen wrote:
>>> i would be curious to know how much storage goes to the commonly used
>>> subset of huggingface pre-trained models (and possibly other pre-trained
>>> files)?  much like for the NLPL vectors repository, that is the kind of
>>> data that should not be duplicated in user home directories, i.e. we
>>> might want to devise an NLPL 'transformers' module with many pre-trained
>>> models pre-installed.  is there a common subset of such models, or would
>>> one be possibly be forced to just download everything that is available
>>> through the huggingface hub?
>>> 
>>> oe
>>> 
>>> 
>>> 
>>> On Mon, Nov 16, 2020 at 2:17 PM Andrey Kutuzov <andreku at ifi.uio.no
>>> <mailto:andreku at ifi.uio.no>> wrote:
>>>> 
>>>> Should we indeed schedule a meeting focused on the topic of storage? :)
>>>> 
>>>> 
>>>> On 16.11.2020 11:32, Vinit Ravishankar wrote:
>>>>> Hi folks,
>>>>> 
>>>>> Have any of you figured out a way to store libraries that doesn’t
>>> involve using Saga storage? I’ve cleared up most of my personal data but
>>> my virtual environments and transformers cache add up to around 100 GiB.
>>> Can’t do much with the transformers cache either, because the library
>>> won’t auto-download temporarily if you’re running on GPU.
>>>>> 
>>>>> – Vinit
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Andrey
>>>> PhD Candidate at Language Technology Group (LTG)
>>>> University of Oslo
>> 
>> 
>> -- 
>> Andrey
>> PhD Candidate at Language Technology Group (LTG)
>> University of Oslo
>