Tensor: Definition & Meaning — AI Wiki

A multidimensional array of numbers — the fundamental data structure in deep learning. A scalar is a 0D tensor (a single number). A vector is a 1D tensor. A matrix is a 2D tensor. An image is a 3D tensor (height × width × channels). A batch of images is a 4D tensor. Model weights, activations, gradients — everything in a neural network is a tensor.

Why it matters

Tensors are the language of deep learning. PyTorch, TensorFlow, and JAX are fundamentally tensor computation libraries. Understanding tensor shapes and operations is essential for reading model code, debugging shape mismatches (the most common error in ML code), and understanding what happens inside neural networks. If you can follow the tensor shapes, you can follow the architecture.

Deep Dive

Common tensor shapes in NLP: input tokens are (batch_size, sequence_length) integers. Embeddings are (batch_size, seq_len, model_dim) floats. Attention weights are (batch_size, num_heads, seq_len, seq_len). The output logits are (batch_size, seq_len, vocab_size). Understanding these shapes tells you exactly what's happening: the attention tensor is N×N because each token attends to every other token.

Operations

Key tensor operations: matmul (matrix multiplication — the core computation in neural networks), reshape (changing dimensions without changing data), transpose (swapping dimensions), concat (joining tensors along a dimension), slice (extracting subtensors), and broadcast (making differently-shaped tensors compatible for element-wise operations). Deep learning is really just a sequence of these operations applied to tensors.

GPU Acceleration

Tensors are computed on GPUs because tensor operations are massively parallel: multiplying two matrices involves millions of independent multiply-add operations that can run simultaneously. This is why GPU VRAM matters — all tensors involved in computation must reside in GPU memory. When you run out of VRAM, it's because the sum of all tensor sizes (model weights + activations + gradients + optimizer states) exceeds capacity. Techniques like gradient checkpointing, mixed precision, and model sharding are all about managing tensor memory.

Tensor

Why it matters

Deep Dive

Operations

GPU Acceleration

Related Concepts