Introduction to Red Teaming
Red teaming: Simulates threats by humans taking an adversarial perspective
Capabilities Assessment
Benchmark performance on representative tasks and datasets. Measure capabilities¶ like accuracy, robustness, efficiency. Identify strengths, limitations, and gaps.
Red Teaming
Model potential real-world risks and failures. Role play adversary perspectives. Surface risks unique to AI.
Jailbreaking
Inject prompts that break the model safeguards. Check for crashes, unintended behavior, security risks and trojan detection2.
Human Oversight
Manual test cases based on human judgment1. Qualitative feedback on subtle flaws. Values alignment evaluation.