

It’s not even manipulated to that outcome. It has a large training corpus and I’m sure some of that corpus includes stories of people who lied, cheated, threatened etc under stress. So when it’s subjected to the same conditions it produces the statistically likely output, that’s all.
I don’t necessarily disagree with anything you just said, but none of that suggests that the LLM was “manipulated into this outcome by the engineers”.
Two models disagreeing does not mean that the disagreement was a deliberate manipulation.