AI Safety vs. AI Security

March 14, 2024 Dane Sherrets

What Is the Difference Between Red Teaming For AI Safety and AI Security?

AI red teaming is a form of AI testing to find flaws and vulnerabilities, and the method can be used for both AI safety and AI security exercises. However, the execution and goals differ from one to the other.

AI red teaming for safety issues focuses on preventing AI systems from generating harmful content, such as providing instructions on creating bombs or producing offensive language. It aims to ensure responsible use of AI and adherence to ethical standards. 

On the other hand, red teaming exercises for AI Security involve testing AI systems with the goal of preventing bad actors from abusing the AI to, for example, compromise the confidentiality, integrity, or availability of the systems the AI are embedded in.

AI Safety Example: Snap, Inc.

Snap Inc. has been an early adopter of AI safety red teaming and has partnered with HackerOne to test the strict safeguards they have in place around this new technology. Together, we have made significant developments in the methodology for AI safety red teaming that has led to a more effective approach to surfacing previously unknown problems.

Snap uses image-generating AI models within the backend of its program. The Safety team had already identified eight categories of harmful imagery they wanted to test for, including violence, sex, self-harm, and eating disorders. 

"We knew we wanted to do adversarial testing on the product, and a security expert on our team suggested a bug bounty-style program. From there, we devised the idea to use a 'Capture the Flag' (CTF) style exercise that would incentivize researchers to look for our specific areas of concern. Capture the Flag exercises are a common cybersecurity exercise, and a CTF was used to test large language models (LLMs) at DEFCON. We hadn't seen this applied to testing text-to-image models but thought it could be effective."
— Ilana Arbisser, Technical Lead, AI Safety at Snap Inc.

By setting bounties, we incentivized our community to test the product, and to focus on the content Snap was most concerned about being generated on their platform. Snap and HackerOne adjusted bounties dynamically and continued to experiment with prices to optimize for researcher engagement. The exercise was able to give Snap a process by which to test their filters and generate data that can be used to further evaluate the model. We anticipate the research and the subsequent findings to help create benchmarks and standards for other social media companies to use the same flags to test for content. 

AI Security Example: Google Bard

In a red teaming exercise for AI security, hackers Joseph "rez0" Thacker, Justin "Rhynorater" Gardner, and Roni "Lupin" Carta collaborated together to hack its GenAI assistant, Bard.

The launch of Bard’s Extensions AI feature provides Bard with access to Google Drive, Google Docs, and Gmail. This means Bard would have access to Personally Identifiable Information (PII) and could even read emails, drive documents, and locations. The hackers identified that it analyzes untrusted data and could be susceptible to Indirect Prompt Injection, which can be delivered to users without their consent.

In less than 24 hours from the launch of Bard Extensions, the hackers were able to demonstrate that:

  1. Google Bard is vulnerable to Indirect Prompt Injection via data from Extensions.
  2. Malicious image Prompt Injection instructions will exploit the vulnerability.
  3. When writing the exploit, a prompt injection payload was developed that would exfiltrate the victim’s emails.
indirect prompt injection in Google Bard


With such a powerful impact as the exfiltration of personal emails, the hackers promptly reported this vulnerability to Google, which resulted in a $20,000 bounty.

Bugs like this only scratch the surface of the security vulnerabilities found in GenAI. Organizations developing and deploying GenAI and LLMs need security talent that specializes in the OWASP Top 10 for LLMs if they are going to be serious about introducing it competitively and securely.

AI Red Teaming for Safety and Security With HackerOne

By using the expertise of ethical hackers and adapting the bug bounty model to address AI safety and security, HackerOne's playbook for AI Red Teaming is a proactive approach to fortifying AI while mitigating potential risks. For technology and security leaders venturing into AI integration, we look forward to partnering with you to explore how HackerOne and ethical hackers can contribute to your AI safety journey. To learn more about how to implement AI Red Teaming for your organization, contact our experts at HackerOne.

Previous Article
HackerOne’s In-Depth Approach to Vulnerability Triage and Validation
HackerOne’s In-Depth Approach to Vulnerability Triage and Validation

Like triaging in a hospital emergency room, security issues must be diagnosed and handled by an expert as s...

Next Article
Shift Left is Dead: A Post Mortem
Shift Left is Dead: A Post Mortem

The goal of shift left — to catch vulnerabilities early in the software development lifecycle (SDLC) — is s...