numerical stability trick: subtract max(logits)
Tool / ProductMentioned in 1 video
Subtracting per-row max from logits before exponentiation to avoid overflow; its backward contribution is small/near-zero and discussed in detail.
Subtracting per-row max from logits before exponentiation to avoid overflow; its backward contribution is small/near-zero and discussed in detail.