torch.nn.LSTMCellOptions

A long short-term memory (LSTM) cell: single timestep recurrent unit with memory.

Maintains both a hidden state h_t and cell state c_t. The cell state acts as long-term memory, while gates control what information flows in/out. Computes in four gated steps: input gate, forget gate, cell gate, output gate. Essential for:

Learning long-range dependencies (up to 100+ timesteps)
Avoiding vanishing gradient problem of vanilla RNNs
Processing sequences where early tokens matter much later
Stable gradient flow through time
Fine-grained control over what's remembered vs forgotten

Unlike RNNCell which has unbounded hidden state, LSTMCell uses cell state c_t as protected memory. Forget gate decides what to discard from memory, input gate decides what to add, output gate decides what to expose. This architecture dramatically improves gradient flow in deep networks.

When to use LSTMCell:

Building custom architectures with LSTM-level stability
Implementing attention on LSTM hidden/cell states
Variable-length sequences with masking
Teacher forcing with LSTM memory
Fine-grained control over cell state for analysis/visualization
Bidirectional processing (separate forward/backward LSTM cells)

Trade-offs:

vs RNNCell: LSTM has cell state for longer-range deps; RNN is simpler
vs GRUCell: LSTM more expressive with separate forget/input gates; GRU more compact
Parameters: LSTM has 4x hidden_size gates vs RNN's 1; memory usage 4x higher
Stability: LSTM much better for long sequences (100+ steps)
Speed: LSTM slower per-step but better convergence often worth it
Gradient flow: LSTM preserves gradients via cell state; RNN/GRU may vanish

LSTM Gate Equations: Four gates process input and previous hidden state:

Input gate: i_t = σ(W_ii @ x_t + b_ii + W_hi @ h_{t-1} + b_hi)
Forget gate: f_t = σ(W_if @ x_t + b_if + W_hf @ h_{t-1} + b_hf)
Cell gate: g_t = tanh(W_ig @ x_t + b_ig + W_hg @ h_{t-1} + b_hg)
Output gate: o_t = σ(W_io @ x_t + b_io + W_ho @ h_{t-1} + b_ho)

Then cell and hidden state update:

c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t (forget old + add new)
h_t = o_t ⊙ tanh(c_t) (expose filtered cell state)

Definition

export interface LSTMCellOptions {
  /** Whether to include bias terms (default: true) */
  bias?: boolean;
}

bias(boolean)optional: – Whether to include bias terms (default: true)

Examples

// Process sequence maintaining both hidden and cell state
const lstm_cell = new torch.nn.LSTMCell(10, 20);  // input_size=10, hidden_size=20

const x = torch.randn([32, 100, 10]);  // [batch=32, seq_len=100, input_size=10]
let h = torch.zeros([32, 20]);  // Initial hidden state
let c = torch.zeros([32, 20]);  // Initial cell state

// Process sequence with manual loop
const outputs: torch.Tensor[] = [];
for (let t = 0; t < 100; t++) {
  const x_t = x.select(1, t);
  [h, c] = lstm_cell.step(x_t, [h, c]);  // Step returns [h_t, c_t]
  outputs.push(h);
}

const output = torch.stack(outputs, 1);  // [batch, seq_len, hidden_size]

// Language model with LSTM: predicting next token
class LSTMLanguageModel extends torch.nn.Module {
  embedding: torch.nn.Embedding;
  lstm_cell: torch.nn.LSTMCell;
  output_proj: torch.nn.Linear;

  constructor(vocab_size: number, embed_dim: number, hidden_dim: number) {
    super();
    this.embedding = new torch.nn.Embedding(vocab_size, embed_dim);
    this.lstm_cell = new torch.nn.LSTMCell(embed_dim, hidden_dim);
    this.output_proj = new torch.nn.Linear(hidden_dim, vocab_size);
  }

  forward(token_ids: torch.Tensor): torch.Tensor {
    const embedded = this.embedding.forward(token_ids);  // [batch, seq_len, embed_dim]

    let h = torch.zeros([embedded.shape[0], 512]);
    let c = torch.zeros([embedded.shape[0], 512]);
    const logits: torch.Tensor[] = [];

    for (let t = 0; t < embedded.shape[1]; t++) {
      const x_t = embedded.select(1, t);
      [h, c] = this.lstm_cell.step(x_t, [h, c]);
      logits.push(this.output_proj.forward(h));
    }

    return torch.stack(logits, 1);  // [batch, seq_len, vocab_size]
  }
}

// Bidirectional LSTM: forward and backward
const lstm_fwd = new torch.nn.LSTMCell(10, 20);
const lstm_bwd = new torch.nn.LSTMCell(10, 20);

const x = torch.randn([32, 100, 10]);

// Forward pass
let h_fwd = torch.zeros([32, 20]);
let c_fwd = torch.zeros([32, 20]);
const fwd_outputs: torch.Tensor[] = [];

for (let t = 0; t < 100; t++) {
  [h_fwd, c_fwd] = lstm_fwd.step(x.select(1, t), [h_fwd, c_fwd]);
  fwd_outputs.push(h_fwd);
}

// Backward pass
let h_bwd = torch.zeros([32, 20]);
let c_bwd = torch.zeros([32, 20]);
const bwd_outputs: torch.Tensor[] = [];

for (let t = 99; t >= 0; t--) {
  [h_bwd, c_bwd] = lstm_bwd.step(x.select(1, t), [h_bwd, c_bwd]);
  bwd_outputs.unshift(h_bwd);
}

// Concatenate forward and backward hidden states
const bidir_outputs: torch.Tensor[] = [];
for (let t = 0; t < 100; t++) {
  bidir_outputs.push(torch.cat([fwd_outputs[t], bwd_outputs[t]], -1));
}

// Analyzing LSTM internals: gate activations
const lstm = new torch.nn.LSTMCell(100, 256);
const x = torch.randn([1, 100]);
let h = torch.zeros([1, 256]);
let c = torch.zeros([1, 256]);

// Manually compute gates to visualize
const gates_combined = x.matmul(lstm.weight_ih.t()).add(
  h.matmul(lstm.weight_hh.t())
);

if (lstm.bias_ih) {
  gates_combined.add_(lstm.bias_ih);
}

const chunked = gates_combined.chunk(4, 1);
const i_gate = chunked[0].sigmoid();  // Input gate (should learn what's important)
const f_gate = chunked[1].sigmoid();  // Forget gate (should learn what to keep)
const g_gate = chunked[2].tanh();     // Cell candidate
const o_gate = chunked[3].sigmoid();  // Output gate

console.log('Input gate mean:', i_gate.mean().item());
console.log('Forget gate mean:', f_gate.mean().item());