spark.dataset

function dataset(path: string): Promise<SparkDataset>

Load a dataset from torchjs.org with automatic batching.

Datasets on torchjs.org are defined by a torch.json manifest file that specifies how to load and batch the data. This function loads the manifest and returns a dataset object with train, test, and val splits that support batching.

The dataset function handles:

Image classification datasets (separate image and label files)
Text datasets (single file with character or token-based tokenization)
Automatic train/test/val split creation
Batching with automatic tensor creation

Common use cases:

Loading benchmark datasets (MNIST, CIFAR, etc.)
Loading custom datasets from torchjs.org
Interactive data exploration and training
Testing models on different splits

Dataset manifests (torch.json) must define: - name: Dataset name - description: Human-readable description - dataset.splits: Object with train/test/val split configs - For images: image_size, dtype, separate images/labels files per split - For text: tokenizer, format, source file per split

Very large datasets may not fit in memory. Consider using smaller batches or streaming datasets for production use cases.

Parameters

pathstring: Path to project containing torch.json (e.g., "kasumi/mnist" or "username/project")

Returns

Promise<SparkDataset>– Promise that resolves to a dataset object with train/test/val splits

Examples

// Load MNIST dataset
const data = await spark.dataset('kasumi/mnist');

// Train loop with batching
for (const { x, y } of data.train.batch(64)) {
  // x: Tensor [64, 784] - normalized to float32
  // y: Tensor [64] - int32 labels
  const logits = model(x);
  const loss = criterion(logits, y);
  // ... backward pass
}

// Evaluate on test set
let accuracy = 0;
for (const { x, y } of data.test.batch(256)) {
  const pred = model(x).argmax(1);
  accuracy += (pred === y).sum().item();
}
accuracy /= data.size.test;

// Get single sample
const { x, y } = await data.train.get(0);
console.log('First sample:', x.shape, y);

createWorkerRpc

deleteFile