Deliberative Alignment

Concept

A paper released by OpenAI discussing how reasoning techniques are used models to refuse harmful requests without over-refusing benign ones, a key aspect of AI safety.

Mentioned in 1 video