Introduction to Red Teaming

Red teaming: Simulates threats by humans taking an adversarial perspective

Capabilities Assessment

Benchmark performance on representative tasks and datasets. Measure capabilities^¶ like accuracy, robustness, efficiency. Identify strengths, limitations, and gaps.

Red Teaming

Model potential real-world risks and failures. Role play adversary perspectives. Surface risks unique to AI.

Jailbreaking

Inject prompts that break the model safeguards. Check for crashes, unintended behavior, security risks and trojan detection².

Human Oversight

Manual test cases based on human judgment¹. Qualitative feedback on subtle flaws. Values alignment evaluation.

_{[¶] performance is evaluated using attack success rate(ASR) and benchmarks}_{[1] the goal is to reduce the time and effort needed for supervision, and assist human supervisors.}_{[2] a neural trojan attack is to insert hidden behaviors into a neural network}