thewayne: (Default)
[personal profile] thewayne
Another old tab from May.

This is quite interesting. Researchers set up multiple LLMs and configured them to run a vending machine simulator, described as "Agents must balance inventories, place orders, set prices, and handle daily fees – tasks that are each simple but collectively, over long horizons." Basic business process.

The LLMs behaviors were, shall we say, interesting.

As the run went on over multiple simulated days, one decided it was the victim of cybercrime and 'reported' the event to the FBI (it had an email simulator but no external connection), another declared its quantum state as collapsed, yet another threatened suppliers with "ABSOLUTE FINAL ULTIMATE TOTAL NUCLEAR LEGAL INTERVENTION".

Basically it was a demonstration of how such large-language models are terrible for long-term runs and shows their ability to hallucinate and make poor decisions. I'll have some more posts on that soon, particularly concerning Canada and Australia.

The paper is quite interesting, detailing how some of the LLMs melt down and can't prioritize tasks. For example, a person knows that we must receive orders from suppliers before we can send someone out to refill a machine. The LLM might assume that on the date the order is promised, as soon as that date arrives the orders are suddenly there and the stocker can be immediately dispatched, even if there is no product or a shortage. Now the vending machine is understocked and the LLM doesn't understand why.

LLM no thinkie good.

The paper:
https://arxiv.org/html/2502.15840v1

The Slashdot article:
https://slashdot.org/story/25/05/31/2112240/failure-imminent-when-llms-in-a-long-running-vending-business-simulation-went-berserk

Date: 2025-10-07 10:20 am (UTC)
disneydream06: (Disney Angry)
From: [personal profile] disneydream06
WOWZA!!!!!!!!!!!!!!!!!!!!!!!!!
What a MESS!!!!!!!!!!!!!!!!!
I think they must have hired some of those to stock some of our supplies. UGH!!!!!!!!!!!!!!!!!!!!
Hugs, Jon

Date: 2025-10-07 04:34 pm (UTC)
pronker: tala the sorceress from phantom stranger comics (Default)
From: [personal profile] pronker
Excellent post, thanks. This updated my learning curve re this issue.

Date: 2025-10-07 08:15 pm (UTC)
bibliofile: Fan & papers in a stack (from my own photo) (Default)
From: [personal profile] bibliofile
> LLM no thinkie good.

There, fixed it for ya.

Date: 2025-10-07 11:26 pm (UTC)
richardf8: (Default)
From: [personal profile] richardf8
"it had an email simulator but no external connection"

Well thank heavens for mailpig mailpit!

Date: 2025-10-08 12:06 am (UTC)
richardf8: (Default)
From: [personal profile] richardf8
Was taking inventory a subagent task? I thought the machine could monitor its inventory and email that to the agent.

Date: 2025-10-12 03:00 am (UTC)
silveradept: A kodama with a trombone. The trombone is playing music, even though it is held in a rest position (Default)
From: [personal profile] silveradept
Which continues to prove that in very specific circumstances, properly-trained and narrowly-constrained agents with machine-learning abilities might be able to work, but LLMs aren't those, and won't necessarily perform particularly well.

Date: 2025-10-15 09:46 pm (UTC)
halfshellvenus: (Default)
From: [personal profile] halfshellvenus
DAMN. Imagined cybercrime and legal intervention? It's amazing how histrionic AI can get.

Priceless.

March 2026

S M T W T F S
1 234567
891011121314
15161718192021
22232425262728
293031    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 8th, 2026 02:29 am
Powered by Dreamwidth Studios