thewayne: (Default)
[personal profile] thewayne
Another old tab from May.

This is quite interesting. Researchers set up multiple LLMs and configured them to run a vending machine simulator, described as "Agents must balance inventories, place orders, set prices, and handle daily fees – tasks that are each simple but collectively, over long horizons." Basic business process.

The LLMs behaviors were, shall we say, interesting.

As the run went on over multiple simulated days, one decided it was the victim of cybercrime and 'reported' the event to the FBI (it had an email simulator but no external connection), another declared its quantum state as collapsed, yet another threatened suppliers with "ABSOLUTE FINAL ULTIMATE TOTAL NUCLEAR LEGAL INTERVENTION".

Basically it was a demonstration of how such large-language models are terrible for long-term runs and shows their ability to hallucinate and make poor decisions. I'll have some more posts on that soon, particularly concerning Canada and Australia.

The paper is quite interesting, detailing how some of the LLMs melt down and can't prioritize tasks. For example, a person knows that we must receive orders from suppliers before we can send someone out to refill a machine. The LLM might assume that on the date the order is promised, as soon as that date arrives the orders are suddenly there and the stocker can be immediately dispatched, even if there is no product or a shortage. Now the vending machine is understocked and the LLM doesn't understand why.

LLM no thinkie good.

The paper:
https://arxiv.org/html/2502.15840v1

The Slashdot article:
https://slashdot.org/story/25/05/31/2112240/failure-imminent-when-llms-in-a-long-running-vending-business-simulation-went-berserk

Date: 2025-10-08 12:06 am (UTC)
richardf8: (Default)
From: [personal profile] richardf8
Was taking inventory a subagent task? I thought the machine could monitor its inventory and email that to the agent.

March 2026

S M T W T F S
1 234567
891011121314
15161718192021
22232425262728
293031    

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 8th, 2026 06:27 am
Powered by Dreamwidth Studios