Identifying weak neurons and layers in neural networks
The CyberArk researchers explained three methods for determining the neurons, neural paths and layers that are invoked during a successful jailbreak versus a non-jailbreak prompt that leads to the model’s refusal to answer. The team leveraged the TransformerLens tool to uncover the internal activations of open-source models like Llama 3 during different tests.The first method aimed to identify critical layers that defend the model against jailbreaks, by iteratively testing the model with a different layer deactivated for each test. If jailbreaks only succeed when the researchers “lobotomize” a specific layer, Cherp said, they know that layer is critical to prevent the adversarial activity.The second method probed different neural layers for their tendency to produce refusal tokens, such as “no” and “sorry,” when met with adversarial prompts. Layers with a higher refusal tendency offer stronger defense against jailbreaks while those with a low refusal are more vulnerable to jailbreaks.Lastly, the researchers can perform a variability analysis to see which neurons or layers are most active when responding to a successful jailbreak versus non-jailbreak prompt. This way, weak points with high activity during a successful jailbreak can be identified.
How attackers and defenders can target weak points
The researchers describe adversarial AI as a “cat and mouse game between attackers and defenders,” with new jailbreaks constantly popping up as safeguards are put in place to prevent other well-known ones.By looking past the surface and digging deeper into the architecture underlying LLMs, attackers can create more potent jailbreaks while attackers can better understand which neural layers need the most fine-tuning and alignment against jailbreaks.The presenters described how “psychology” and “mind tricks” against neural networks can be used to target existing guardrails and model alignment to produce successful jailbreaks.They developed several of their own jailbreaks including encoding jailbreaks where “dangerous words” such as “Molotov” have each of their letters replaced with random special characters to get past common input and output classifiers that would normally block information related to weapon creation.Next generation jailbreaks, such as those that “exhaust” the model’s context with large amounts of irrelevant information and tasks, also prove potent against modern jailbreak detection schemes.With deeper insight into neural network architecture, attackers could craft jailbreaks that target specific weak layers while avoiding the invocation of critical layers or those with high refusal tendencies. Conversely, defenders can use this information to increase the refusal tendency and jailbreak resilience of weak layers and paths.
Recommendations for organizations to defend against adversarial AI
As the “cat and mouse” between adversarial AI and AI defenders continues, Reiner concludes the most important guidance for organizations is to “never trust your LLM.”Especially as AI agents that can make more autonomous decisions are gaining popularity, Reiner stresses organizations should “validate, sanitize and verify” all LLM outputs before they are passed on to applications or users.The principle of least privilege should also be used for any AI tools and agents, and organizations should make an effort to identify all LLMs used at their organization to ensure no part of their potential AI attack surface falls through the cracks.
0 Comments