This guide will walk you through creation, management and deletion of datasets.

What is a dataset?

A dataset is a collection of a single training file or a training file along with a validation file used in the fine tuning process. A typical dataset will use a training file alongside a validation file. Each file fulfils a specific purpose:

  • Training file: the actual data used to teach the model, the model will adjust it’s internal weights when learning from these examples.
  • Validation file: the data in this file is not part of the training dataset and is instead used to gauge how well the new model handles new data, this helps detect overfitting.

An example Dataset management workflow

Prerequisites

  1. JWT: Grab a long lived JWT from the Nscale CLI
  2. Your organization’s ID: Find your organization ID from the nscale CLI

Step 1: Create your files

Creating effective training files as part of your fine tuning approach is critical to the success of your new model. There are two distinct files that can be used in the fine tuning process:

  • Training file: the actual data used to teach the model, the model will adjust it’s internal weights when learning from these examples.
  • Validation file (optional): the data in this file is not part of the training dataset and is instead used to gauge how well the new model handles new data, this helps detect overfitting.

Training file structure

The training file must contain a prompt column with an optional response column. The columns can be named however you like, you’ll need these column names to start a new fine tuning job.

Validation file structure

The validation file should mirror the structure of the training filing.

Step 2: Upload your files

Once you’ve created your files it’s time to upload them. You’ll need to upload them one at a time and take note of the file_id that’s returned to you, you’ll need this as to create your dataset.

Payload

  curl -X POST https://fine-tuning.api.nscale.com/api/v1/organizations/$ORGANIZATION_ID/files \
  -H "Authorization: Bearer $NSCALE_API_TOKEN" \
  -H 'Content-Type: multipart/form-data' \
  -H 'Accept: application/json' \
  -F 'file=@"<PATH_TO_FILE>"'

Response

{
	"id": "682d47e8-6d65-4c9a-a9fe-0d695c610366",
	"name": "example_training.csv",
	"bytes": 524288000,
	"workspace_id": "f39ca833-4249-4adf-9d49-1184f78b1ed4",
	"created_at": "2025-06-01T00:00:00Z",
	"columns": [
		{
			"name": "prompt_column",
			"type": "string",
			"values": ["<prompt-1>", "<prompt-2>"]
		},
		{
			"name": "answer_column",
			"type": "string",
			"values": ["<answer-1>", "<answer-2>"]
		}
	],
}

Step 3: Create a new dataset

A dataset is a training file optionally paired with a validation file, this is the artefact used by the actual fine-tuning job.

Payload

{
  "name": "A new run",
  "training_file_id": "682d47e8-6d65-4c9a-a9fe-0d695c610366",
  "validation_file_id": "4df01235-360e-4b7c-816e-da3e370de6c2" // optional
}

Response

{
	"id": "3e0a8ac7-0ba3-4654-adbd-88d3a90f127f",
	"name": "example_dataset",
	"total_bytes": 629145600,
	"training_file_id": "682d47e8-6d65-4c9a-a9fe-0d695c610366",
	"validation_file_id": "4df01235-360e-4b7c-816e-da3e370de6c2",
	"workspace_id": "f39ca833-4249-4adf-9d49-1184f78b1ed4",
	"created_at": "2025-06-01T00:00:00Z",
	"updated_at": "2025-06-01T00:00:00Z"
	"training_file": {
		"id": "682d47e8-6d65-4c9a-a9fe-0d695c610366",
		"name": "example_training.csv",
		"bytes": 524288000
	},
	"validation_file": {
		"id": "4df01235-360e-4b7c-816e-da3e370de6c2",
		"name": "example_validation.csv",
		"bytes": 104857600
	},
}

Once you have your new dataset you’re ready to start fine tuning