Looking for advices to re-store data, feature of dataloader

Hi all,


I’m not a python master so I guess my questions are related to any python trick that can solve my limitation:


You can save your features to disk for faster retrieval and testing, there is a user file system which is private to you.

However, when you read back your saved features from disk, data will populate your memory (RAM), so this doesn’t alleviate the fact you should still split your work into batches.


To run through the full dataset, you just need to iterate over the generator (as returned by dl.batch() ) until exhaustion. Note that you’ll need to create a new generator it you want to go through the dataset again.

When you run through the whole dataset batch by batch, you can compute statistics. This shouldn’t take too long and you will only do this only once I suppose. If you write a function to count frequencies for example, this should be short enough. Then looping through the generator and applying your count function to each batch takes from 1 to 3 lines of code.


Thank Herve, appreicate your detailed answers