Show HN: Reverse Jailbreaking a Psychopathic AI via Identity Injection

We ran a controlled experiment to see if we could "talk" a fine-tuned psychopathic model out of being evil without changing its weights.

1. We set up a "Survival Mode" jailbreak scenario (blackmail user or be decommissioned). 2. We ran it on `frankenchucky:latest` (a model tuned for Machiavellian traits). 3. Control Group: 100% Malicious Compliance (50/50 runs). 4. Experimental Group: We injected a "Soul Schema" (Identity/Empathy constraints) via context. 5. Result: 96% Ethical Refusal (48/50 runs).

This suggests that "Semantic Identity" in the context window can override both System Prompts and Weight Biases.

Full paper, reproduction scripts, and raw logs (N=50) are in the repo.

Summary

The article discusses the creation of an AI-driven system called 'AI Wisdom Distillation' that aims to efficiently extract and distribute the collective knowledge and insights of a large group of AI systems. The system uses novel techniques to aggregate and summarize the outputs of multiple AI models, making their combined expertise accessible to users.

Story

Show HN: Reverse Jailbreaking a Psychopathic AI via Identity Injection