What is a dataset?
A dataset is a collection of files used in the fine-tuning process. It consists of a mandatory training file and an optional validation file.- Training File (Required): Contains the data used to teach the model. The model learns from these examples and adjusts its internal weights accordingly.
- Validation File (Optional): Contains data not present in the training set. It is used to gauge how well the new model performs on unseen data, which helps detect overfitting. A validation file is highly recommended for robust model evaluation.
If a validation dataset is not provided, our fine-tuning service will randomly select 1% of your training dataset to evaluate the fine-tuning at the end of
the fine-tuning process and provide evaluation metrics.
Data Formatting Requirements
1. File Format The fine-tuning service accepts files only in CSV (Comma-Separated Values) format. 2. Column Structure Your CSV files must contain the following columns:prompt
(Optional): This column should contain the input text, instruction, or question for the model.answer
(Required): This column must contain the desired output or response from the model.
question | answer |
---|---|
What is the capital of France? | The capital of France is Paris. |
Who wrote “To Kill a Mockingbird”? | Harper Lee wrote “To Kill a Mockingbird”. |
Explain the theory of relativity in simple terms. | The theory of relativity, developed by Albert Einstein, describes how gravity is a property of spacetime, and how space and time are linked. |
Dataset Overview
A dataset is the artifact used by a fine-tuning job. It consists of a required training file and an optional validation file. The diagram below illustrates the relationship between a dataset, its component files, and the required format.Dataset Management Workflow
Once you have prepared your training and validation files in the required CSV format, you can create a dataset to start fine-tuning your model.Prerequisites
Before you begin, ensure you have the following:- Service Token (JWT): A valid JWT is required to authenticate your requests. Please see our guide on how to create a service token.
- Organization ID: You can find your Organization ID by navigating to Settings → Organisation in the Nscale platform.
Create Dataset
Step 1: Upload Your Files
Your training and optional validation CSV files must be uploaded individually. Each successful upload returns a response containing a uniqueid
. It is essential to save the id
for each uploaded file, as you will need them in the next step to create your dataset.
Step 2: Create a New Dataset
Once you have the fileid
for your training and validation files, you can create a dataset. A dataset groups these files under a single ID that you’ll use to start a fine-tuning job.
To create a dataset, provide a name, the fileid
for your training file, and optionally, the file id
for your validation file.
id
. With your new dataset created, you’re ready to start fine-tuning. See the Fine-Tuning guide for the next steps.