RSAC 2025: Using an ‘MRI’ for neural networks to understand LLM jailbreaks

Written by Syndicated News Feed



April 28, 2025

SAN FRANCISCO – Why are large language models (LLMs) vulnerable to some jailbreak prompts and not others? The answer to this questions is becoming more important as malicious actors attempt to use LLMs to create dangerous malware and influence campaigns. CyberArk Vulnerability Research Team Leader Mark Cherp and Principal Security Researcher Shaked Reiner shared their methods for seeing how AI models respond to malicious prompts in the session “Beyond the Black Box: Revealing Adversarial Neural Patterns in LLMs” at the RSAC 2025 Conference in San Francisco on Monday.(For Complete Live RSAC 2025 Coverage by SC Media Visit SCWorld.com/RSAC)The researchers compare their method for adversarial AI understanding to an “MRI” for neural networks, offering a view of neural activity that can expose weaknesses in LLM architecture.

Identifying weak neurons and layers in neural networks

The CyberArk researchers explained three methods for determining the neurons, neural paths and layers that are invoked during a successful jailbreak versus a non-jailbreak prompt that leads to the model’s refusal to answer. The team leveraged the TransformerLens tool to uncover the internal activations of open-source models like Llama 3 during different tests.The first method aimed to identify critical layers that defend the model against jailbreaks, by iteratively testing the model with a different layer deactivated for each test. If jailbreaks only succeed when the researchers “lobotomize” a specific layer, Cherp said, they know that layer is critical to prevent the adversarial activity.The second method probed different neural layers for their tendency to produce refusal tokens, such as “no” and “sorry,” when met with adversarial prompts. Layers with a higher refusal tendency offer stronger defense against jailbreaks while those with a low refusal are more vulnerable to jailbreaks.Lastly, the researchers can perform a variability analysis to see which neurons or layers are most active when responding to a successful jailbreak versus non-jailbreak prompt. This way, weak points with high activity during a successful jailbreak can be identified.

How attackers and defenders can target weak points

The researchers describe adversarial AI as a “cat and mouse game between attackers and defenders,” with new jailbreaks constantly popping up as safeguards are put in place to prevent other well-known ones.By looking past the surface and digging deeper into the architecture underlying LLMs, attackers can create more potent jailbreaks while attackers can better understand which neural layers need the most fine-tuning and alignment against jailbreaks.The presenters described how “psychology” and “mind tricks” against neural networks can be used to target existing guardrails and model alignment to produce successful jailbreaks.They developed several of their own jailbreaks including encoding jailbreaks where “dangerous words” such as “Molotov” have each of their letters replaced with random special characters to get past common input and output classifiers that would normally block information related to weapon creation.Next generation jailbreaks, such as those that “exhaust” the model’s context with large amounts of irrelevant information and tasks, also prove potent against modern jailbreak detection schemes.With deeper insight into neural network architecture, attackers could craft jailbreaks that target specific weak layers while avoiding the invocation of critical layers or those with high refusal tendencies. Conversely, defenders can use this information to increase the refusal tendency and jailbreak resilience of weak layers and paths.

Recommendations for organizations to defend against adversarial AI

As the “cat and mouse” between adversarial AI and AI defenders continues, Reiner concludes the most important guidance for organizations is to “never trust your LLM.”Especially as AI agents that can make more autonomous decisions are gaining popularity, Reiner stresses organizations should “validate, sanitize and verify” all LLM outputs before they are passed on to applications or users.The principle of least privilege should also be used for any AI tools and agents, and organizations should make an effort to identify all LLMs used at their organization to ensure no part of their potential AI attack surface falls through the cracks.

Source link

← Prev: This Generation's Framework Laptop 13 Is The Best Yet, If You Can Get It Next: Taiwan's government strengthens 'silicon shield,' restricts exports of TSMC's most advanced process technologies →

info@gajdek360.com

440.973.6652

RSAC 2025: Using an ‘MRI’ for neural networks to understand LLM jailbreaks

Written by Syndicated News Feed

Security

0 Comments(s)

April 28, 2025

Identifying weak neurons and layers in neural networks

How attackers and defenders can target weak points

Recommendations for organizations to defend against adversarial AI

You May Also Like…

Southeast Asia targeted by Earth Kurma APT attacks

Microsoft fixes Outlook paste, blank calendar rendering issues

There’s one question that stumps North Korean fake workers • The Register

0 Comments

© 2000-2023 Gajdek360, LLC All Rights Reserved