Datasets (Coming soon)
Guide to creation, management and deletion of datasets
This guide will walk you through creation, management and deletion of datasets.
What is a dataset?
A dataset is a collection of a single training file or a training file along with a validation file used in the fine tuning process. A typical dataset will use a training file alongside a validation file. Each file fulfils a specific purpose:
- Training file: the actual data used to teach the model, the model will adjust it’s internal weights when learning from these examples.
- Validation file: the data in this file is not part of the training dataset and is instead used to gauge how well the new model handles new data, this helps detect overfitting.
An example Dataset management workflow
Prerequisites
- JWT: Grab a long lived JWT from the Nscale CLI
- Your organization’s ID: Find your organization ID from the nscale CLI
Step 1: Create your files
Creating effective training files as part of your fine tuning approach is critical to the success of your new model. There are two distinct files that can be used in the fine tuning process:
- Training file: the actual data used to teach the model, the model will adjust it’s internal weights when learning from these examples.
- Validation file (optional): the data in this file is not part of the training dataset and is instead used to gauge how well the new model handles new data, this helps detect overfitting.
Training file structure
The training file must contain a prompt column with an optional response column. The columns can be named however you like, you’ll need these column names to start a new fine tuning job.
Validation file structure
The validation file should mirror the structure of the training filing.
Step 2: Upload your files
Once you’ve created your files it’s time to upload them. You’ll need to upload them one at a time and take note of the file_id that’s returned to you, you’ll need this as to create your dataset.
Payload
Response
Step 3: Create a new dataset
A dataset is a training file optionally paired with a validation file, this is the artefact used by the actual fine-tuning job.
Payload
Response
Once you have your new dataset you’re ready to start fine tuning