Skip to content

Use DatasetRegistry to register predefined dataset schemas

Alexander Chueshev requested to merge ac/dataset-registry into main

What does this merge request do and why?

When we need to register a new dataset schema, we currently need to update the cli/datasets.py module, which is suboptimal as we designed the cli/datasets.py module to be generic.

This MR introduces a langsmith_dataset decorator used to register dataset schemas closer to the place of their definition.

Example:
Before:

registry_dataset_loaders = {
    ....
    "dataset.code-generation.1": predefined.CodeGenerationDataset,
    ....
}

After:

@langsmith_dataset(name="dataset.code-generation.1")
class CodeGenerationDataset(LocalDatasetLoader):
    # Specify the type of data this loader will work with
    output_type: Type[CodeGeneration] = CodeGeneration

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

  1. Check out to this merge request's branch.
  2. Install dependencies.
    poetry run install
  3. Check the existing commands ELI5 provides:
    poetry run datasets create duo_chat.react_tool_selection.1 datasets/duo_chat/react_tool_selection_v1.jsonl 

Merge request checklist

  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.

Merge request reports

Loading