numerical stability trick: subtract max(logits)
Concept
Subtracting per-row max from logits before exponentiation to avoid overflow; its backward contribution is small/near-zero and discussed in detail.
Mentioned in 1 video
Subtracting per-row max from logits before exponentiation to avoid overflow; its backward contribution is small/near-zero and discussed in detail.