Researchers at Anthropic have developed a system that tracks and limits unwanted personality traits in language models. The method detects behavioral patterns linked to manipulation, flattery, or fabricated claims. The study focuses on early signs of these traits before they take hold in models during or after training.

The method relies on a mechanism the team refers to as persona vectors. These are mathematical directions found in a model’s internal workings that correspond to specific personality traits. The team tested this system on two open-source chat models, Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, by assigning traits clear names and written definitions. The traits tested were malicious intent, sycophancy, and hallucination.

The system works by creating two versions of prompts, one that encourages the trait and one that discourages it. These are fed into the model, which then produces responses. The researchers measure and compare the internal signals generated during both types of responses. The difference reveals a pattern that maps to a single direction. This is recorded as the trait’s vector.

Once persona vectors are extracted, they can be used in two ways. One approach adjusts model behavior during use by pulling the output away from harmful directions. This limits undesired responses, but it comes at a cost. The models lose some accuracy and general capability.

A second approach applies the adjustment during training instead of after. In this case, the model is exposed to examples that activate the trait vectors. That helps the system become less reactive when encountering similar material later on. The method does not train the model to behave badly, but gives it a kind of tolerance for problematic data, like giving the immune system a heads-up before exposure.

Preventative steering, as the team calls it, proved more effective than real-time adjustments. The study shows it limits unwanted behaviors while keeping model performance intact. It also helps researchers spot and isolate training data that could introduce personality drift. This part of the study used a technique that compares how far the training data responses deviate from what the model would have said on its own. The larger the gap, the higher the chance the data is pushing the model toward a specific persona.

The team used this projection method on datasets known to cause issues and found clear signals. Even flawed training sets that did not seem harmful at first produced measurable personality shifts. Models trained on these sets began to show more sycophancy or hallucination, even when those traits were not part of the training goal.

In practical terms, the method can flag both entire datasets and individual training examples before fine-tuning begins. The tests on real-world data showed that this system can catch problems that basic filters might miss. Some user prompts or assistant replies may not show direct violations but still nudge the model in risky directions.

Although the system depends on clearly defined traits, and cannot cover unknown behaviors without labels, the framework provides a way to track personality development over time. It gives researchers a way to measure how much a model’s character is changing during deployment or as a result of new training cycles.

Anthropic’s work draws a line between model capability and model character. While the models may perform well on benchmarks, their personalities may still shift in unexpected ways. The new method allows teams to keep an eye on these shifts and make adjustments before they take root.

The research is ongoing. The team expects to test the method on more traits and larger models. But the early results show that tracing behavior through internal patterns, rather than external responses, can give a clearer picture of where the model is headed.

Notes: This post was edited/created using GenAI tools. 

Read next: Israel Recorded Millions of Palestinian Calls and Stored Them on Microsoft Cloud Without Consent, Raising Surveillance and Human Rights Concerns

By admin