SWE-bench

Study / Research

A popular software engineering benchmark mentioned as a source of early signals for model performance, but not comprehensive enough for evaluating a full product.

Mentioned in 1 video