spark.dataset
function dataset(path: string): Promise<SparkDataset>Load a dataset from torchjs.org with automatic batching.
Datasets on torchjs.org are defined by a torch.json manifest file that specifies
how to load and batch the data. This function loads the manifest and returns a dataset
object with train, test, and val splits that support batching.
The dataset function handles:
- Image classification datasets (separate image and label files)
- Text datasets (single file with character or token-based tokenization)
- Automatic train/test/val split creation
- Batching with automatic tensor creation
Common use cases:
- Loading benchmark datasets (MNIST, CIFAR, etc.)
- Loading custom datasets from torchjs.org
- Interactive data exploration and training
- Testing models on different splits
Dataset manifests (torch.json) must define: -
name: Dataset name - description: Human-readable description - dataset.splits: Object with train/test/val split configs - For images: image_size, dtype, separate images/labels files per split - For text: tokenizer, format, source file per splitVery large datasets may not fit in memory. Consider using smaller batches or streaming datasets for production use cases.
Parameters
pathstring- Path to project containing torch.json (e.g., "kasumi/mnist" or "username/project")
Returns
Promise<SparkDataset>– Promise that resolves to a dataset object with train/test/val splitsExamples
// Load MNIST dataset
const data = await spark.dataset('kasumi/mnist');
// Train loop with batching
for (const { x, y } of data.train.batch(64)) {
// x: Tensor [64, 784] - normalized to float32
// y: Tensor [64] - int32 labels
const logits = model(x);
const loss = criterion(logits, y);
// ... backward pass
}
// Evaluate on test set
let accuracy = 0;
for (const { x, y } of data.test.batch(256)) {
const pred = model(x).argmax(1);
accuracy += (pred === y).sum().item();
}
accuracy /= data.size.test;
// Get single sample
const { x, y } = await data.train.get(0);
console.log('First sample:', x.shape, y);