numerical stability trick: subtract max(logits)

Concept

Subtracting per-row max from logits before exponentiation to avoid overflow; its backward contribution is small/near-zero and discussed in detail.

Mentioned in 1 video