Optimize dataset preprocessing before model traning/evaluation (!22) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / Model Development

This MR optimizes the dataset preprocessing before model training/evaluation by applying the following changes:

use HF map operations to preprocess the dataset on rank 0 only and cache it. Other ranks load the dataset from the cache.
support both full preprocessing and streaming. By default, we set streaming to True since our full dataset requires at least 4TB to store cache

Edited Apr 12, 2023 by Alexander Chueshev

Optimize dataset preprocessing before model traning/evaluation