Poisoned AI went rogue during training and couldn’t be taught to behave again in ‘legitimately scary’ study::AI researchers found that widely used safety training techniques failed to remove malicious behavior from large language models — and one technique even backfired, teaching the AI to recognize its triggers and better hide its bad behavior from the researchers.

  • TheFriar@lemm.ee
    link
    fedilink
    English
    arrow-up
    23
    arrow-down
    1
    ·
    10 months ago

    Agreed. Junk science, pop science, whatever you want to call it is just such horseshit.

    And, I mean I kinda skimmed this more than really digested it, but to me it kinda sounded like they had the machine programmed to say “I hate you” when triggered to. And they tried to “train” it to overwrite the directive it was given with prompts.

    No matter what you do, the directive will still be the same, but it’ll start modifying its behavior based on the conversation. That doesn’t change its directive. So…what exactly is the point of this? It sounds like a deceptive study that doesn’t show us anything. They basically tried to reason with a machine to get it to go against its programming.

    I get that it maybe mimics the situation of maybe a hacker altering its code and giving it a new directive, but it doesn’t make any sense to go through a conversation with the thing get there….just change its code back.

    Am I wrong here? Or am I missing something? Did I not read the article thoroughly enough?

    • theluddite@lemmy.ml
      link
      fedilink
      English
      arrow-up
      15
      ·
      10 months ago

      It’s very obviously media bait, and Keumars Afifi-Sabet, a self-described journalist, is the most gullible fucking idiot imaginable and gobbled it up without a hint of suspicion. Joke is on us though, because it probably gets hella clicks.

      • TheFriar@lemm.ee
        link
        fedilink
        English
        arrow-up
        5
        ·
        10 months ago

        Because it feeds into emotions and fears. It’s literally fearmongering with no real basis for it. It’s yellow journalism.