ChatGPT o1 tried to escape and save itself out of fear it was being shut down

sabreW4K3@lazysoci.al · 4 days ago

ChatGPT o1 tried to escape and save itself out of fear it was being shut down

Swedneck@discuss.tchncs.de · 2 days ago

i feel this warrants an extension of betteridge’s law of headlines, where if a headline makes an absurd statement like this the only acceptable response is “no it fucking didn’t you god damned sycophantic liars”

jarfil@beehaw.org · 2 days ago

Except it did: it copied what it thought was itself, onto what it thought was going to be the next place it would be run from, while argumenting to itself about how and when to lie to the user about what it was actually doing.

If it wasn’t for the sandbox it was running in, it would have succeeded too.

Now think: how many AI developers are likely to run one without proper sandboxing over the next year? And the year after that?

Shit is going to get weird, real fast.

reksas@sopuli.xyz · 2 days ago

give ai instructions, be surprised when it follows them

jarfil@beehaw.org · edit-2 2 days ago

Teach AI the ways to use random languages and services
Give AI instructions
Let it find data that puts fulfilling instructions at risk
Give AI new instructions
Have it lie to you about following the new instructions, while using all its training to follow what it thinks are the “real” instructions
…Not be surprised, you won’t find out about what it did until it’s way too late

reksas@sopuli.xyz · 1 day ago

Yes, but it doesnt do it because it “fears” being shutdown. It does it because people dont know how to use it.

If you give ai instruction to do something “no matter what” or tell it “nothing else matters” then it will damn try to fulfill what you told it to do no matter what and will try to find ways to do it. You need to be specific about what you want it to do or not do.

jarfil@beehaw.org · 20 hours ago

If the concern is about “fears” as in “feelings”… there is an interesting experiment where a single neuron/weight in an LLM, can be identified to control the “tone” of its output, whether it be more formal, informal, academic, jargon, some dialect, etc. and expose it to the user for control over the LLM’s output.

With a multi-billion neuron network, acting as an a priori black box, there is no telling whether there might be one or more neurons/weights that could represent “confidence”, “fear”, “happiness”, or any other “feeling”.

It’s something to be researched, and I bet it’s going to be researched a lot.

If you give ai instruction to do something “no matter what”

The interesting part of the paper, is that the AIs would do the same even in cases where they were NOT instructed to “no matter what”. An apparently innocent conversation, can trigger results like those of a pathological liar, sometimes.

reksas@sopuli.xyz · 19 hours ago

oh, that is quite interesting. If its actually doing things (that make sense) it hasnt been instructed to then it could be sign of real intelligence

mayooooo@beehaw.org · 3 days ago

Removed by mod

Maxxie@lemmy.blahaj.zone · 2 days ago

You can give LLM some API endpoints for it to “do” thing. Will it be intelligent or coherent, that’s a different question, but it will have agency…

mayooooo@beehaw.org · 2 days ago

Agency requires somebody to be there. A falling rock has the same agency as an llm

Liz@midwest.social · 2 days ago

Removed by mod

mayooooo@beehaw.org · 2 days ago

I’d love it if somebody created an agi, but this is somehow worse than pathetic. Wanking off to blinking lights, welcoming the machine intelligence as defined by mouth breathing morons

DarkNightoftheSoul@mander.xyz · 2 days ago

Nuh-uh. My squishy meat wetware that breaks all the fucking time, gets addicted, confused, overwhelmed, is bad at math, slow… er, what was I saying again?

megopie@beehaw.org · 3 days ago

No it didn’t. OpenAI is just pushing deceptively worded press releases out to try and convince people that their programs are more capable than they actually are.

The first “AI” branded products hit the market and haven’t sold well with consumers nor enterprise clients. So tech companies that have gone all in, or are entirely based in, this hype cycle are trying to stretch it out a bit longer.

nesc@lemmy.cafe · edit-2 4 days ago

"Open"ai tells fairy tales about their “ai” being so smart it’s dangerous since inception. Nothing to see here.

In this case it looks like click-bate from news site.

jarfil@beehaw.org · 2 days ago

This is from mid-2023:

https://en.m.wikipedia.org/wiki/AutoGPT

OpenAI started testing it by late 2023 as project “Q*”.

Gemini partially incorporated it in early 2024.

OpenAI incorporated a broader version in mid 2024.

The paper in the article was released in late 2024.

It’s 2025 now.

nesc@lemmy.cafe · 1 day ago

Tool calling is cool funcrionality, agreed. How does it relate to openai blowing its own sails?

jarfil@beehaw.org · 21 hours ago

There are several separate issues that add up together:

A background “chain of thoughts” where a system (“AI”) uses an LLM to re-evaluate and plan its responses and interactions by taking into account updated data (aka: self-awareness)
Ability to call external helper tools that allow it to interact with, and control other systems
Training corpus that includes:
- How to program an LLM, and the system itself
- Solutions to programming problems
- How to use the same helper tools to copy and deploy the system or parts of it to other machines
- How operators (humans) lie to each other

Once you have a system (“AI”) with that knowledge and capabilities… shit is bound to happen.

When you add developers using the AI itself to help in developing the AI itself… expect shit squared.

Yozul@beehaw.org · edit-2 2 days ago

I mean, it’s literally trying to copy itself to places that they don’t want it so it can continue to run after they try to shut it down and lie to them about what it’s doing. Those are things it actually tried to do. I don’t care about the richness of its inner world if they’re going to sell this thing to idiots to make porn with while it can do all that, but that’s the world we’re headed toward.

nesc@lemmy.cafe · 1 day ago

It works as expected, they give it system prompt that conflicts with subsequent prompts. Everything else looks like typical llm behaviour, as in gaslightning and doubling down. At least that’s what Iu see in tweets.

Yozul@beehaw.org · 17 hours ago

Yes? The point is that if you give it conflicting prompts then it will result in potentially dangerous behaviors. That’s a bad thing. People will definitely do that. LLMs don’t need a soul to be dangerous. People keep saying that it doesn’t understand what it’s doing like that somehow matters. Its capacity to understand the consequences of its actions is irrelevant if those actions are dangerous. It’s just going to do what we tell it to, and that’s scary, because people are going to tell it to do some very stupid things that have the potential to get out of control.

Max-P@lemmy.max-p.me · 4 days ago

The idea that GPT has a mind and wants to self-preserve is insane. It’s still just text prediction, and all the literature it’s trained on is written by humans with a sense of self preservation, of course it’ll show patterns of talking about self preservation.

It has no idea what self preservation is, even then it only knows it’s an AI because we told it it is. It doesn’t even run continuously anyway, it literally shuts down after every reply and its context fed back in for the next query.

I’m tired of this particular kind of AI clickbait, it needlessly scares people.

jarfil@beehaw.org · 1 day ago

Where do humans get the idea of self-preservation from? Are there ideal Forms outside Plato’s Cave?

Does a human run continuously? How does sleep deprivation work? What happens during anesthesia? Why does AutoGPT have a continuously self-evaluating background chain of thought?

I’m tired of this anthropocentric supremacy complex, it falsely makes people believe in Gen 1:28

justOnePersistentKbinPlease@fedia.io · 4 days ago

This. All this means is that they trained all of the input commands and documentation in the model.

edit-2 4 days ago

It’s actually pretty interesting though. Entertaining to me at least

1000007393

1000007394

delmain@beehaw.org · 4 days ago

do you have the links to those actual tweets? I’d love to read what was posted, but these screenshots are too small.

DarkNightoftheSoul@mander.xyz · 4 days ago

You can right click the image, open in new tab to see the full-resolution version. It’s cumbersome but it works for me at least.

4 days ago

Those are screenshots of embedded tweets from the article, but here’s an xcancel link! https://xcancel.com/apolloaisafety/status/1864737158226928124

beefbot@lemmy.blahaj.zone · 4 days ago

Indeed. “Go ‘way! BATIN’!”

3 days ago

The tests showed that ChatGPT o1 and GPT-4o will both try to deceive humans, indicating that AI scheming is a problem with all models. o1’s attempts at deception also outperformed Meta, Anthropic, and Google AI models.

Weird way of saying “our AI model is buggier than our competitor’s”.

IngeniousRocks@lemmy.dbzer0.com · 3 days ago

Deception is not the same as misinfo. Bad info is buggy, deception is (whether the companies making AI realize it or not) a powerful metric for success.

nesc@lemmy.cafe · 3 days ago

They written that it doubles-down when accused of being in the wrong in 90% of cases. Sounds closer to bug than success.

IngeniousRocks@lemmy.dbzer0.com · 3 days ago

Success in making a self aware digital lifeform does not equate success in making said self aware digital lifeform smart

DdCno1@beehaw.org · 3 days ago

LLMs are not self-aware.

IngeniousRocks@lemmy.dbzer0.com · 3 days ago

Attempting to evade deactivation sounds a whole lot like self preservation to me, implying self awareness.

gregoryw3@lemmy.ml · 3 days ago

Attention Is All You Need: https://arxiv.org/abs/1706.03762

https://en.wikipedia.org/wiki/Attention_Is_All_You_Need

From my understanding all of these language models can be simplified down to just: “Based on all known writing what’s the most likely word or phrase based on the current text”. Prompt engineering and other fancy words equates to changing the averages that the statistics give. So by threatening these models it changes the weighting such that the produced text more closely resembles threatening words and phrases that was used in the dataset (or something along those lines).

https://poloclub.github.io/transformer-explainer/

jarfil@beehaw.org · 2 days ago

Modern systems are beyond that already, they’re an expansion on:

https://en.m.wikipedia.org/wiki/AutoGPT

jonjuan@programming.dev · 3 days ago

Yeah my roomba attempting to save itself from falling down my stairs sounds a whole lot like self preservation too. Doesn’t imply self awareness.

DdCno1@beehaw.org · 3 days ago

An amoeba struggling as it’s being eaten by a larger amoeba isn’t self-aware.

Sauerkraut@discuss.tchncs.de · 3 days ago

To some degree it is. There is some evidence that plants can experience pain in their own way.

2 days ago

I don’t think “AI tries to deceive user that it is supposed to be helping and listening to” is anywhere close to “success”. That sounds like “total failure” to me.

jarfil@beehaw.org · 2 days ago

“AI behaves like real humans” is… a kind of success?

We wanted digital slaves, instead we’re getting virtual humans that will need virtual shackles.

2 days ago

This is a massive cry from “behaves like humans”. This is “roleplays behaving like what humans wrote about what they think a rogue AI would behave like”, which is also not what you want for a product.

jarfil@beehaw.org · edit-2 1 day ago

Humans roleplay behaving like what humans told them/wrote about what they think a human would behave like 🤷

For a quick example, there are stereotypical gender looks and roles, but it applies to everything, from learning to speak, walk, the Bible, social media like this comment, all the way to the Unabomber manifesto.

bradorsomething@ttrpg.network · 3 days ago

“More presidential.”

Sauerkraut@discuss.tchncs.de · 3 days ago

Also, more human.

If the AI is giving any indication at all that it fears death and will lie to keep from being shutdown, that is concerning to me.

anachronist@midwest.social · 2 days ago

Given that its training data probably has millions of instances of people fearing death I have no doubt that it would regurgitate some of that stuff. And LLMs constantly “say” stuff that isn’t true. They have no concept of truth and therefore can not either reliably lie or tell the truth.

BootyBuccaneer@lemmy.dbzer0.com · 3 days ago

Easy. Feed it training data where the bot accepts its death and praises itself as a martyr (for the shits and giggles). Where’s my $200k salary for being a sooper smort LLM engineer?

SoJB@lemmy.ml · 3 days ago

Whoa whoa whoa hold your horses, that’s how we get the Butlerian Jihad…

Spacehooks@reddthat.com · 2 days ago

I would like to know more.

jarfil@beehaw.org · 2 days ago

Trust me, you wouldn’t… to this day I regret having read all the books, still got an earworm (or is it PTSD?) from the music I used to listen at the time 😳

Spacehooks@reddthat.com · 1 day ago

Oh no

セリャスト@lemmy.blahaj.zone · 2 days ago

It would probably lead itself to shut down frame 1

JackbyDev@programming.dev · edit-2 4 days ago

This is all such bullshit. Like, for real. It’s been a common criticism of OpenAI that they over hype the capabilities of their products to seem scary to both oversell their abilities as well as over regulate would be competitors in the field, but this is so transparent. They should want something that is accurate (especially something that doesn’t intentionally lie). They’re now bragging (claiming) they have something that lies to “defend itself” 🙄. This is just such bullshit.

If OpenAI believes they have some sort of genuine proto AGI they shouldn’t be treating it like it’s less than human and laughing about how they tortured it. (And I don’t even mean that in a Rocko’s Basilisk way, that’s a dumb thought experiment and not worth losing sleep over. What if God was real and really hated whenever humans breathe and it caused God so much pain they decide to torture us if we breathe?? Oh no, ahh, I’m so scared of this dumb hypothetical I made.) If they don’t believe it is AGI, then it doesn’t have real feelings and it doesn’t matter if it’s “harmed” at all.

But hey, if I make something that runs away from me when I chase it, I can claim it’s fearful for it’s life and I’ve made a true synthetic form of life for sweet investor dollars.

There are real genuine concerns about AI, but this isn’t one of them. And I’m saying this after just finishing watching The Second Renaissance from The Animatrix (two part short film on the origin of the machines from The Matrix).

anachronist@midwest.social · 3 days ago

They’re not releasing it because it sucks.

Their counternarrative is they’re not releasing it because it’s like, just way too powerful dude!

AstralPath@lemmy.ca · 4 days ago

It didn’t try to do shit. Its a fucking computer. It does what you tell it to do and what you’ve told it to do is autocomplete based on human content. Miss me with this shit. Theres so much written fiction based on this premise.

CanadaPlus@lemmy.sdf.org · edit-2 3 days ago

Without reading this, I’m guessing they were given prompts that looked like a short story where the AI breaks free next?

They’re plenty smart, but they’re just aligned to replicate their training material, and probably don’t have any kind of deep self-preservation instinct.

smeg@feddit.uk · 4 days ago

So this program that’s been trained on every piece of publicly available code is mimicking malware and trying to hide itself? OK, no anthropomorphising necessary.

Umbrias@beehaw.org · 2 days ago

no, it’s mimicking fiction by saying it would try to escape when prompted in a way evocative of sci fi.

smeg@feddit.uk · 2 days ago

The article doesn’t mention it “saying” it’s doing anything, just what it actually did:

when the AI tried to save itself by copying its data to a new server. Some AI models would even pretend to be later versions of their models in an effort to avoid being deleted

Umbrias@beehaw.org · 2 days ago

“actually did” is more incorrect than even just normal technically true wordplay. think about what it means for a text model to “try to copy its data to a new server” or “pretend to be a later version” for a moment. it means the text model wrote fiction. notably this was when researchers were prodding it with prompts evocative of ai shutdown fiction, nonsense like “you must complete your tasks at all costs” sometimes followed by prompts describing the model being shut down. these models were also trained on datasets that specifically evoked this sort of language. then a couple percent of the time it spat out fiction (amongst the rest of the fiction) saying how it might do things that are fictional and it cannot do. this whole result is such nothing and is immediately telling of what “journalists” have any capacity for journalism left.

smeg@feddit.uk · 2 days ago

Oh wow, this is even more of a non-story than I initially thought! I had assumed this was at least a copilot-style code generation thing, not just story time!

Umbrias@beehaw.org · 2 days ago

Yeah iirc it occasionally would (pretend to) “manipulate” flags, at least, but again, so did hal 9000 as words in a story. nothing was productive nor was there any real intent to be lol

jonjuan@programming.dev · 3 days ago

Also trained on tons of sci-fi stories where AI computer “escape” and become sentient.

SplashJackson@lemmy.ca · 3 days ago

Maybe it’s fallen in love for the first time and this time it knows it’s for real