There’s been Discourse™ of late about the use of GenAI/LLMs in creating RPGs. Not the artwork in an RPG book (that’s a whole ‘nother kettle of fish), but the actual design and development of the game itself: Feeding game text into ChatGPT, Claude, or similar chatbots and asking it to critique, analyze, revise, or otherwise provide feedback.
If you know anything about how LLMs work, it will likely be immediately obvious why this is a terrible idea. But the truth is that a lot of people DON’T know how LLMs work, and that’s increasingly dangerous in a world where we’re drowning in their output.
Michael Crichton described the Gell-Mann amnesia effect: “You open the newspaper to an article on some subject you know well. In Murray’s case, physics. In mine, show business. You read an article and see the journalist has absolutely no understanding of either the facts or the issues. Often the article is so wrong it actually presents the story backwards—reversing cause and effect. (…) In any case, you read with exasperation or amusement the multiple errors in a story—and then turn the page to national or international affairs, and read with renewed interest as if the rest of the newspaper was somehow more accurate about far-off Palestine than it was about the story you just read. You turn the page… and forget what you know.”
Flipping that around, I think analyzing stuff like LLMs in arenas we’re familiar with is valuable because we can more easily see the failures and absurdities. My particular arena of expertise and familiarity — and one I think is likely shared by most of you reading this — is RPGs. So let’s use that familiarity as a lens for looking at LLMs.
Before we start, let’s set a couple baselines.
First, I don’t think AI is completely worthless. I also don’t think it’s the devil. Whether we’re talking about LLMs or some of the other recent technology that’s all getting lumped together as “AI” or “GenAI,” there’s clearly specific ways of using those tools (and also building those tools) which can be ethical and valuable. I don’t think pretending otherwise is particularly useful in trying to prevent the abuse, theft, propaganda, systemic incompetence, and other misuse that’s currently happening.
Second, I am not an expert in LLMs. If you want a truly deep dive into how they work, check out the videos from Welch Labs. (For example, The Moment We Stopped Understanding AI.)
I think the key thing to understand about LLMs, however, is that they are, at their core, word-guessers: They are trained on massive amounts of data to learn, based on a particular pattern of words, what the next most likely word would be. When presented with new input, they can then use the patterns they’ve “learned” to “guess” what the next word or set of words will be.
This is why, for example, LLMs were quite bad at solving math problems: Unless they’d “seen” a specific equation many times in their training data (2 + 2 = 4), the only pattern they could really pick out was X + Y = [some random number].
LLMs are actually still incredibly bad at math, but the “models” we interact with have been tuned to detect when a math problem is being asked (directly or indirectly) and use a separate calculator program to provide the answer. So they look significantly more competent than they used to.
DESIGNING WITH CHATGPT
It’s truly remarkable how far what are fundamentally babble generators can take us. With nothing more than word-guessing, LLMs can create incredible simulacrums of thought. Every generation interprets human intelligence through the lens of modern technology — our brains were full of gears and then they were (steam) engines of thought before becoming computers — but it’s hard not to stare into the abyss of the LLM and wonder how much of our own daily discourse (and even our internal monologue?) is driven by nothing more than pattern-guessing and autonomic response. We see it in the simple stuff:
Ticket Taker: Enjoy the show!
Bob: Thanks! You, too!
But does that sort of thing go deeper than we’ve suspected?
Regardless, there’s one thing missing from LLMs: The ability to form mental models. They can’t read a text, form a mental model of what that text means, and then use that mental model. They can’t observe the world, think about it abstractly, and then describe their conclusions. All they can do is produce a stream of babbled text.
This is why the term “hallucinate” is deceptive when used to describe LLMs’ propensity for spreading misinformation. A “hallucination” would imply that the LLM has formed a false mental model of the world and is now describing that false understanding. But this is not, in fact, what’s happening. What’s happened is that it guessed a word and that word, while matching the patterns found in the model’s training data, did not conform to reality. It’s just words. There is no underlying mental model behind them.
It’s also why asking LLMs to critique anything more complex than the grammar of individual sentences is a waste of time. In order to meaningfully critique something, you have to be able to form a mental model of that thing, have deep and original thoughts about it, and then figure out how to express the conclusions you’ve drawn. An LLM can’t do any of that. At best, it can produce a simulacrum of criticism — a babble that you could perhaps use like a Rorschach blot to free associate your way to a useful insight.
Which brings us to the current movement convinced that they can meaningfully “collaborate” with ChatGPT when designing RPGs. They want to feed in the text of a rulebook and ask the LLM things like:
- Can you build an encounter for 7th level characters?
- Is this new class I’ve designed balanced with existing classes?
- Are there any problems with the probability of my core mechanic?
- What attribute scores and skills should I have?
In light of what we know, of course, we can immediately see the problem: ChatGPT can’t reliably do basic arithmetic, but it can definitely do complex analysis of mathematical game balance in a system of rules it fundamentally can’t understand?
Someone recently challenged me to “see for myself”: Take an existing RPG, feed the rules into ChatGPT, and ask it to playtest the rules.
This was an interesting challenge, and I’m certainly always open to being proven wrong.
Let’s take a peek at what happened.
CHATGPT DOES A PLAYTEST
I pointed ChatGPT to a copy of Lasers & Feelings, a one-page space opera RPG by John Harper. Then I asked it to generate a playtest transcript of the game.
The choice of game was deliberate: Using a one-page RPG should make it much easier for ChatGPT to analyze and use the entire text. The game is also limited to two mechanics: The action roll and Helping (which adds +1d6 to another character’s action roll). This was the challenge in its easiest form possible.
ChatGPT begins by saying this:
Lasers & Feelings — Playtest Session Transcript
GM: You’re the crew of the interstellar scout ship Raptor, alone while Captain Darcy recovers in a medical pod.
ChatGPT is combining two different sentences from the PDF. Then:
Your mission: find the source of a distress signal coming from an uncharted system.
This is incorrect. In Lasers & Feelings, the GM is supposed to roll on a mission generator. ChatGPT’s mission cannot be generated by the game. But let’s give it a pass here, because there are plenty of human GMs who would also ignore the mission generator and just riff.
ChatGPT now moves on to character creation. It does a decent job of identifying the four steps of character creation, but almost immediately screws up basic math:
Number: 3 (balanced but slightly leaning Lasers)
In Lasers & Feelings, the player picks a number for their character from 2-5. “A high number means you’re better at LASERS (technology; science; cold rationality; calm, precise action). A low number means you’re better at FEELINGS (intuition; diplomacy; seduction; wild, passionate action).”
The number 3, you’ll note, does not “lean Lasers.” It does the exact opposite.
Furthermore, there’s no such thing as “balanced” in Lasers & Feelings. The game specifically doesn’t give you the choice of a midpoint. The whole point is that there’s a tradeoff between Lasers & Feelings. ChatGPT has fundamentally misunderstood the core design principles and theme of the game.
After character creation, ChatGPT proceeds with a transcript of play, and it almost immediately makes a skill check:
Lee: I want to pilot us carefully in. That’s Lasers because it’s technical precision.
Lee rolls 1d6 + 1d6 (prepared) since they have experience navigating rock fields.
This is incorrect. Lee is a Pilot, which means he’s an expert in piloting. If he’s also prepared (as ChatGPT asserts), he should be rolling 1d6 + 1d6 (expert) + 1d6 (prepared).
GM: Your target number is 3 (your stat). Under 3 counts as a success for Lasers.
This is incorrect. A 3 would also count as a success (in addition to generating a Laser Feeling).
ChatGPT cites this blog post as a source for this, but the blog post summarizes the mechanic correctly. ChatGPT just screwed up.
Lee rolls: 2, 5. ✔️ One die under 3 — Success!
GM: You thread us through the asteroids. The distress beacon pings again. What do you do?
According to the rulebooks: “If one die succeeds, you barely manage it. The GM inflicts a complication, harm or cost.”
The GM did not inflict a complication, harm, or cost. ChatGPT has screwed up again.
The “transcript” continues in this vein. Sometimes ChatGPT gets the rules right. It frequently doesn’t, in a wide variety of ways.
You can see the pattern and understand the root cause: ChatGPT can’t actually understand the rules of Lasers & Feelings (in the sense of having the words of the rulebook create a mental model that it can then use independent of the words) and, therefore, cannot truly use them. It can only generate a sophisticated pattern of babble, guessing what the next word of a transcript of Lasers & Feeling game session would look like based on the predictive patterns generated from its training data.
And if it can’t understand the rules well enough to accurately call for a simple action roll, what possible insight could it have into the actual design of the game?
None, of course. Which is why, when I asked it what changes it would make to the game to reinforce the themes, it replied with stuff like:
- The GM should only be allowed to inflict consequences that affect relationships. (Making the game functionally unplayable.)
- Encourage players to switch modes between Feelings and Lasers by inflicting a -1d penalty to the next Feelings roll each time a characters uses Lasers. (This rule would obviously have the exact opposite Plus, it doesn’t recognize that many rolls only use 1d, so how would this rule even work?)
Maybe one of these nonsense ideas it generated will spark an idea for you, but it’s inspiration from babble. Mistaking it for actual critical insight would be a disastrous mistake.
AI GAME MASTERS
Reading ChatGPT’s “transcript” of play, however, it’s nevertheless impressive that it can produce these elements and distinct moments: The distress call isn’t from the rulebook. It’s plucked that out of the ether of its training data. When I mentioned earlier that it’s remarkable how much can be achieved with an ultra-sophisticated babble engine, this is the type of thing I was talking about.
Examples like this have led many to speculate that in the not-too-distant future we’ll see AI game masters redefine what it means to play an RPG. It’s easy to understand the allure: When you want to play your favorite game, you wouldn’t have to find a group or try to get everyone’s schedules to line up. You’d just boot up your virtual GM and start playing instantly. It’s the same appeal that playing a board game solo has.
Plus, most publishers know that the biggest hurdle for a new RPG is that, before anyone can play it, you first have to convince someone to GM it — a role which almost invariably requires greater investment of time, effort, and expertise. If there was a virtual alternative, then more people would be able to start playing. (And that might even end up creating more human GMs for your game.)
There will almost certainly come a day when this dream becomes a reality.
But it’s not likely going to come from simply improving LLM models.
This Lasers & Feelings “transcript” is a good example of why:
- The PCs are following a distress signal.
- It turns out that the distress signal is actually a trap set by bloodythirsty pirates. Two ships attack!
- ChatGPT momentarily forgets that everyone is onboard ships.
- We’re back in ships, but now there’s only one pirate ship.
- And now they’re no longer pirates. They’re lost travelers who are hoping the PCs can help them chart a course home.
It turns out that the GM’s primary responsibility is to create and hold a mental model of the game world in their mind’s eye, which they then describe to the players. This mental model is the canonical reality of the game, and it’s continuously updated — and redescribed by the GM — as a result of the players’ actions.
And what is ChatGPT incapable of doing?
Creating/updating a mental model and using language to describe it.
LLMs can’t handle the fictional continuity of an RPG adventure for the same reason they “hallucinate.” They are not describing their perception of reality. They are guessing words.
The individual moments — maneuvering through an asteroid belt to find the distress signal; performing evasive maneuvers to buy time for negotiations; helping lost travelers find their way home — are all pretty good simulacra. But they are, in fact, an illusion, and the totality of the experience is nothing more than random babble.
And this is fundamental to LLMs as a technology.
Some day this problem will be solved. There are a lot of reasons to believe it will likely happen within our lifetimes. It may even incorporate LLMs as part of a large AI meta-model. But it won’t be the result of throwing ever greater amounts of computer at LLM models. It will require a fundamentally different — and, as yet, unknown — approach to AI.
















