Quick take - A recent study from researchers at Oregon State University, Pennsylvania State University, and CISPA Helmholtz Center for Information Security reveals that large language models (LLMs) are vulnerable to jailbreak attacks, prompting the development of a new defense framework called AutoDefense, which utilizes a multi-agent system to effectively filter harmful responses while maintaining high accuracy and compatibility with various models.

Fast Facts

A study reveals that large language models (LLMs) are vulnerable to jailbreak attacks, despite efforts for moral alignment during pre-training.
Researchers propose a defense framework called AutoDefense, a multi-agent system that filters harmful responses from LLMs and enhances collaboration among agents.
AutoDefense significantly reduces the attack success rate on GPT-3.5 from 55.74% to 7.95% while maintaining a high overall accuracy of 92.91%.
The framework is model-agnostic, allowing integration with various LLMs and can be combined with other defense strategies to improve protection.
AutoDefense employs a three-step assessment process and emphasizes ethical considerations, aiming to minimize false positives while effectively countering harmful content.

Study Reveals Vulnerabilities in Large Language Models

A recent study has highlighted a significant issue in the field of artificial intelligence: the susceptibility of large language models (LLMs) to jailbreak attacks. The paper, authored by Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu, comes from researchers at Oregon State University, Pennsylvania State University, and CISPA Helmholtz Center for Information Security.

Vulnerabilities and Proposed Solutions

Despite efforts to ensure moral alignment during pre-training, LLMs remain vulnerable to adversarial prompts that can lead to harmful behavior. To address these vulnerabilities, the authors propose a new defense framework called AutoDefense.

AutoDefense is a multi-agent system designed to effectively filter harmful responses from LLMs. The framework employs a robust response-filtering mechanism that can withstand various jailbreak attack prompts and is compatible with multiple victim models. By assigning specific roles to different LLM agents, AutoDefense enhances collaboration on defense tasks, improving the instruction-following capabilities of the system.

A key feature of AutoDefense is its use of smaller open-source language models as agents to protect larger models from jailbreak attacks. Experimental results have shown the framework’s effectiveness, with the attack success rate (ASR) on GPT-3.5 significantly reduced from 55.74% to 7.95% when using a three-agent system with LLaMA-2-13b. The overall accuracy of the defense filtering achieved by AutoDefense is reported at 92.91%, indicating minimal impact on normal user interactions.

Ethical Considerations and Framework Design

The framework is designed to be model-agnostic, allowing integration with various LLMs and existing defense components. AutoDefense can be combined with other defense strategies, such as Llama Guard, to enhance its protective capabilities. The authors emphasize the ethical considerations surrounding LLMs, particularly their potential to generate harmful content, and highlight the limitations of previous defense methods, which often involve high training costs and may compromise response quality.

AutoDefense consists of a three-step process for assessing content: intention analysis, prompt inferring, and final judgment. The system includes an input agent, a defense agency, and an output agent. The defense agency comprises multiple LLM agents that collaborate to analyze potentially harmful content and provide a final classification.

The evaluation found that increasing the number of agents in the AutoDefense system generally leads to improved defense performance. The framework introduces an acceptable time overhead due to multiple LLM inference requests, prioritizing low false positive rates (FPR) on safe content while effectively reducing ASR on harmful requests.

Conclusion

The paper concludes that multi-agent approaches hold significant promise for enhancing the robustness of LLMs against jailbreak attacks without compromising performance for regular user queries. For those interested in further exploration, the authors have made the code and data for AutoDefense publicly available on a designated GitHub repository.

Original Source: Read the Full Article Here

Check out what's latest

Jan 23, 2025

New Metric Aims to Improve ML-NIDS Against Adversarial Attacks

Jan 23, 2025

Zero-Space Detection Framework for Ransomware Identification Introduced

Jan 23, 2025

Intelligent Attacks on Cyber-Physical Systems Examined