Use DatasetRegistry to register predefined dataset schemas
What does this merge request do and why?
When we need to register a new dataset schema, we currently need to update the cli/datasets.py module, which is suboptimal as we designed the cli/datasets.py module to be generic.
This MR introduces a langsmith_dataset
decorator used to register dataset schemas closer to the place of their definition.
Example:
Before:
registry_dataset_loaders = {
....
"dataset.code-generation.1": predefined.CodeGenerationDataset,
....
}
After:
@langsmith_dataset(name="dataset.code-generation.1")
class CodeGenerationDataset(LocalDatasetLoader):
# Specify the type of data this loader will work with
output_type: Type[CodeGeneration] = CodeGeneration
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
- Check out to this merge request's branch.
- Install dependencies.
poetry run install
- Check the existing commands ELI5 provides:
poetry run datasets create duo_chat.react_tool_selection.1 datasets/duo_chat/react_tool_selection_v1.jsonl
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.