The Alexandrian

Woman in Cybergear

There’s been Discourse™ of late about the use of GenAI/LLMs in creating RPGs. Not the artwork in an RPG book (that’s a whole ‘nother kettle of fish), but the actual design and development of the game itself: Feeding game text into ChatGPT, Claude, or similar chatbots and asking it to critique, analyze, revise, or otherwise provide feedback.

If you know anything about how LLMs work, it will likely be immediately obvious why this is a terrible idea. But the truth is that a lot of people DON’T know how LLMs work, and that’s increasingly dangerous in a world where we’re drowning in their output.

Michael Crichton described the Gell-Mann amnesia effect: “You open the newspaper to an article on some subject you know well. In Murray’s case, physics. In mine, show business. You read an article and see the journalist has absolutely no understanding of either the facts or the issues. Often the article is so wrong it actually presents the story backwards—reversing cause and effect. (…) In any case, you read with exasperation or amusement the multiple errors in a story—and then turn the page to national or international affairs, and read with renewed interest as if the rest of the newspaper was somehow more accurate about far-off Palestine than it was about the story you just read. You turn the page… and forget what you know.”

Flipping that around, I think analyzing stuff like LLMs in arenas we’re familiar with is valuable because we can more easily see the failures and absurdities. My particular arena of expertise and familiarity — and one I think is likely shared by most of you reading this — is RPGs. So let’s use that familiarity as a lens for looking at LLMs.

Before we start, let’s set a couple baselines.

First, I don’t think AI is completely worthless. I also don’t think it’s the devil. Whether we’re talking about LLMs or some of the other recent technology that’s all getting lumped together as “AI” or “GenAI,” there’s clearly specific ways of using those tools (and also building those tools) which can be ethical and valuable. I don’t think pretending otherwise is particularly useful in trying to prevent the abuse, theft, propaganda, systemic incompetence, and other misuse that’s currently happening.

Second, I am not an expert in LLMs. If you want a truly deep dive into how they work, check out the videos from Welch Labs. (For example, The Moment We Stopped Understanding AI.)

I think the key thing to understand about LLMs, however, is that they are, at their core, word-guessers: They are trained on massive amounts of data to learn, based on a particular pattern of words, what the next most likely word would be. When presented with new input, they can then use the patterns they’ve “learned” to “guess” what the next word or set of words will be.

This is why, for example, LLMs were quite bad at solving math problems: Unless they’d “seen” a specific equation many times in their training data (2 + 2 = 4), the only pattern they could really pick out was X + Y = [some random number].

LLMs are actually still incredibly bad at math, but the “models” we interact with have been tuned to detect when a math problem is being asked (directly or indirectly) and use a separate calculator program to provide the answer. So they look significantly more competent than they used to.

DESIGNING WITH CHATGPT

It’s truly remarkable how far what are fundamentally babble generators can take us. With nothing more than word-guessing, LLMs can create incredible simulacrums of thought. Every generation interprets human intelligence through the lens of modern technology — our brains were full of gears and then they were (steam) engines of thought before becoming computers — but it’s hard not to stare into the abyss of the LLM and wonder how much of our own daily discourse (and even our internal monologue?) is driven by nothing more than pattern-guessing and autonomic response. We see it in the simple stuff:

Ticket Taker: Enjoy the show!

Bob: Thanks! You, too!

But does that sort of thing go deeper than we’ve suspected?

Regardless, there’s one thing missing from LLMs: The ability to form mental models. They can’t read a text, form a mental model of what that text means, and then use that mental model. They can’t observe the world, think about it abstractly, and then describe their conclusions. All they can do is produce a stream of babbled text.

This is why the term “hallucinate” is deceptive when used to describe LLMs’ propensity for spreading misinformation. A “hallucination” would imply that the LLM has formed a false mental model of the world and is now describing that false understanding. But this is not, in fact, what’s happening. What’s happened is that it guessed a word and that word, while matching the patterns found in the model’s training data, did not conform to reality. It’s just words. There is no underlying mental model behind them.

It’s also why asking LLMs to critique anything more complex than the grammar of individual sentences is a waste of time. In order to meaningfully critique something, you have to be able to form a mental model of that thing, have deep and original thoughts about it, and then figure out how to express the conclusions you’ve drawn. An LLM can’t do any of that. At best, it can produce a simulacrum of criticism — a babble that you could perhaps use like a Rorschach blot to free associate your way to a useful insight.

Which brings us to the current movement convinced that they can meaningfully “collaborate” with ChatGPT when designing RPGs. They want to feed in the text of a rulebook and ask the LLM things like:

  • Can you build an encounter for 7th level characters?
  • Is this new class I’ve designed balanced with existing classes?
  • Are there any problems with the probability of my core mechanic?
  • What attribute scores and skills should I have?

In light of what we know, of course, we can immediately see the problem: ChatGPT can’t reliably do basic arithmetic, but it can definitely do complex analysis of mathematical game balance in a system of rules it fundamentally can’t understand?

Someone recently challenged me to “see for myself”: Take an existing RPG, feed the rules into ChatGPT, and ask it to playtest the rules.

This was an interesting challenge, and I’m certainly always open to being proven wrong.

Let’s take a peek at what happened.

CHATGPT DOES A PLAYTEST

I pointed ChatGPT to a copy of Lasers & Feelings, a one-page space opera RPG by John Harper. Then I asked it to generate a playtest transcript of the game.

The choice of game was deliberate: Using a one-page RPG should make it much easier for ChatGPT to analyze and use the entire text. The game is also limited to two mechanics: The action roll and Helping (which adds +1d6 to another character’s action roll). This was the challenge in its easiest form possible.

ChatGPT begins by saying this:

Lasers & Feelings — Playtest Session Transcript

GM: You’re the crew of the interstellar scout ship Raptor, alone while Captain Darcy recovers in a medical pod.

ChatGPT is combining two different sentences from the PDF. Then:

Your mission: find the source of a distress signal coming from an uncharted system.

This is incorrect. In Lasers & Feelings, the GM is supposed to roll on a mission generator. ChatGPT’s mission cannot be generated by the game. But let’s give it a pass here, because there are plenty of human GMs who would also ignore the mission generator and just riff.

ChatGPT now moves on to character creation. It does a decent job of identifying the four steps of character creation, but almost immediately screws up basic math:

Number: 3 (balanced but slightly leaning Lasers)

In Lasers & Feelings, the player picks a number for their character from 2-5. “A high number means you’re better at LASERS (technology; science; cold rationality; calm, precise action). A low number means you’re better at FEELINGS (intuition; diplomacy; seduction; wild, passionate action).”

The number 3, you’ll note, does not “lean Lasers.” It does the exact opposite.

Furthermore, there’s no such thing as “balanced” in Lasers & Feelings. The game specifically doesn’t give you the choice of a midpoint. The whole point is that there’s a tradeoff between Lasers & Feelings. ChatGPT has fundamentally misunderstood the core design principles and theme of the game.

After character creation, ChatGPT proceeds with a transcript of play, and it almost immediately makes a skill check:

Lee: I want to pilot us carefully in. That’s Lasers because it’s technical precision.

Lee rolls 1d6 + 1d6 (prepared) since they have experience navigating rock fields.

This is incorrect. Lee is a Pilot, which means he’s an expert in piloting. If he’s also prepared (as ChatGPT asserts), he should be rolling 1d6 + 1d6 (expert) + 1d6 (prepared).

GM: Your target number is 3 (your stat). Under 3 counts as a success for Lasers.

This is incorrect. A 3 would also count as a success (in addition to generating a Laser Feeling).

ChatGPT cites this blog post as a source for this, but the blog post summarizes the mechanic correctly. ChatGPT just screwed up.

Lee rolls: 2, 5. ✔️ One die under 3 — Success!

GM: You thread us through the asteroids. The distress beacon pings again. What do you do?

According to the rulebooks: “If one die succeeds, you barely manage it. The GM inflicts a complication, harm or cost.”

The GM did not inflict a complication, harm, or cost. ChatGPT has screwed up again.

The “transcript” continues in this vein. Sometimes ChatGPT gets the rules right. It frequently doesn’t, in a wide variety of ways.

You can see the pattern and understand the root cause: ChatGPT can’t actually understand the rules of Lasers & Feelings (in the sense of having the words of the rulebook create a mental model that it can then use independent of the words) and, therefore, cannot truly use them. It can only generate a sophisticated pattern of babble, guessing what the next word of a transcript of Lasers & Feeling game session would look like based on the predictive patterns generated from its training data.

And if it can’t understand the rules well enough to accurately call for a simple action roll, what possible insight could it have into the actual design of the game?

None, of course. Which is why, when I asked it what changes it would make to the game to reinforce the themes, it replied with stuff like:

  • The GM should only be allowed to inflict consequences that affect relationships. (Making the game functionally unplayable.)
  • Encourage players to switch modes between Feelings and Lasers by inflicting a -1d penalty to the next Feelings roll each time a characters uses Lasers. (This rule would obviously have the exact opposite Plus, it doesn’t recognize that many rolls only use 1d, so how would this rule even work?)

Maybe one of these nonsense ideas it generated will spark an idea for you, but it’s inspiration from babble. Mistaking it for actual critical insight would be a disastrous mistake.

AI GAME MASTERS

Reading ChatGPT’s “transcript” of play, however, it’s nevertheless impressive that it can produce these elements and distinct moments: The distress call isn’t from the rulebook. It’s plucked that out of the ether of its training data. When I mentioned earlier that it’s remarkable how much can be achieved with an ultra-sophisticated babble engine, this is the type of thing I was talking about.

Examples like this have led many to speculate that in the not-too-distant future we’ll see AI game masters redefine what it means to play an RPG. It’s easy to understand the allure: When you want to play your favorite game, you wouldn’t have to find a group or try to get everyone’s schedules to line up. You’d just boot up your virtual GM and start playing instantly. It’s the same appeal that playing a board game solo has.

Plus, most publishers know that the biggest hurdle for a new RPG is that, before anyone can play it, you first have to convince someone to GM it — a role which almost invariably requires greater investment of time, effort, and expertise. If there was a virtual alternative, then more people would be able to start playing. (And that might even end up creating more human GMs for your game.)

There will almost certainly come a day when this dream becomes a reality.

But it’s not likely going to come from simply improving LLM models.

This Lasers & Feelings “transcript” is a good example of why:

  • The PCs are following a distress signal.
  • It turns out that the distress signal is actually a trap set by bloodythirsty pirates. Two ships attack!
  • ChatGPT momentarily forgets that everyone is onboard ships.
  • We’re back in ships, but now there’s only one pirate ship.
  • And now they’re no longer pirates. They’re lost travelers who are hoping the PCs can help them chart a course home.

It turns out that the GM’s primary responsibility is to create and hold a mental model of the game world in their mind’s eye, which they then describe to the players. This mental model is the canonical reality of the game, and it’s continuously updated — and redescribed by the GM — as a result of the players’ actions.

And what is ChatGPT incapable of doing?

Creating/updating a mental model and using language to describe it.

LLMs can’t handle the fictional continuity of an RPG adventure for the same reason they “hallucinate.” They are not describing their perception of reality. They are guessing words.

The individual moments — maneuvering through an asteroid belt to find the distress signal; performing evasive maneuvers to buy time for negotiations; helping lost travelers find their way home — are all pretty good simulacra. But they are, in fact, an illusion, and the totality of the experience is nothing more than random babble.

And this is fundamental to LLMs as a technology.

Some day this problem will be solved. There are a lot of reasons to believe it will likely happen within our lifetimes. It may even incorporate LLMs as part of a large AI meta-model. But it won’t be the result of throwing ever greater amounts of computer at LLM models. It will require a fundamentally different — and, as yet, unknown — approach to AI.

38 Responses to “Thought of the Day: GenAI and RPGs”

  1. Alex says:

    I think you’re mostly correct in your take, but in my experience (not GM related but in tech), a lot of the tooling around LLMs (progressive disclosure, agents, skills, retrieval augmentation etc) start to fill that gap around mental models. It doesn’t solve the fact that LLMs are inherently advanced babble machines that can’t actually hold a mental model, but allows for much tighter context control and better continuity in a way that a simple ChatGPT chat is unable to replicate. It’s likely going to lead to a lot of poor LLM GM products in the short term, but I wouldn’t be surprised if in a couple of years the pain points you described are somewhat patched leading to semi-competent LLM GMs.

  2. rampaging-poet says:

    My rule of thumb for LLM use is that if it’s important enough I wouldn’t use a random table, it’s too important for an LLM.

    That said, an LLM is basically the Omni-Table. It generates random text on any topic you care to put before it. This makes it extremely flexible when you don’t have a table for to hand. Those individual moments the Lasers and Feelings “playtest” generated show that odf quite well. They’re not coherent when strung together, but any given event could have been on a Random Mission Table or a Distress Beacon Compliation table.

    A well-designed table for a specific purpose will still generally be better than what ChatGPT spits out, but that’s like comparing a filleting knife to a Swiss army knife – you don’t carry the latter because you need one very specific tool, but because it is adequate at many tasks.

    (On the flipside this view of LLMs as the Omni-Table extends to my purchasing habits – I am not likely to purchase a book of results of random table rolls unless there’s some genuine added value over rolling myself. Similarly if an entire book is LLM-generated there is not point in buying it versus just generating one myself)

  3. Oskar says:

    I use LLM for inspiration only in RPGs. I will type some prompt and I will usually get a not so great result. The result is not important – it needs only to kickstart my creativity. Some ideas are worth keeping but more often than not I need to use multiple prompts and different approaches to get some usable results. Personally I do not think that using LLM is good or bad, even when discussing art. I have two issues – this must be done legally (copyrights, etc.), and the result must be decent. Usually there is at least one issue here, quite often both problems occur.

  4. Teos Abadia says:

    I found this really interesting, thanks! I would add that a fundamental problem for the concept of an AI GM is that GMing is at the heart of what makes RPGs compelling. It’s that this is hard and takes time that makes it so compelling. We get payoff as GMs as we master this complexity. Having it done for us has all the charm of a board game app: usually great for a while and then utterly disposable and forgettable. Hard things have great value. If companies try to break down that complexity with AI they will likely lose the very reason so many of us are invested.

  5. Gilthy says:

    I’m reminded of this bit from “The Hitchhikers Guide to the Galaxy” by Douglas Adams, where the mice (seeking the Ultimate Question) want to purchase Arthur’s brain, leading a to discussions about replacing it with a simple electronic one that just repeats basic phrases:

    “Yes, an electronic brain,” said Frankie, “a simple one would suffice.”
    “A simple one!” wailed Arthur.
    “Yeah,” said Zaphod with a sudden evil grin, “you’d just have to program it to say ‘What?’ And ‘I don’t understand’ and ‘Where’s the tea?’—who’d know the difference?”
    “What?” cried Arthur, backing away still further.
    “See what I mean?” said Zaphod and howled with pain because of something that Trillian did at that moment.

  6. ImproperSubset says:

    If your players enjoy arguing about 5e rules almost as much as playing, the AIs are very good at surfacing that one Jeremy Crawford tweet where he clarified a rule.

    I frequently use AI to generate cryptic Divination responses. You can get it to hide an answer in some workman-like rhyming verse pretty easily. It’s not artful, but it’s better than me at string together some meter and rhyme.

  7. Brian Womack says:

    This is a very interesting discussion. I have used AI (ChatGPT specifically) to assist me in creating some rule mini systems that I am using in our game of the Alexandrian Remix of Descent Into Avernus. Specifically, I wanted a Stress system similar in some respects to what is in the Free League Alien game and a Corruption system. The Stress system only applies while in Avernus and Stress points are handed out by the DM for certain situations. Stress adds its total to any D20 test, but whenever current Stress or less is rolled, the character panics. And I made a whole Panic table for this. The Corruption mechanic is based solely on player choices and provides permanent boons and drawbacks at each stage. At a certain point, the character becomes totally corrupted. All of this took considerable input on my part and several iterations and revisions before I was satisfied with the final product. But in the end, I did arrive at a satisfactory conclusion. I then cut and pasted that into Google Docs for use in the game. I also created a Hope-Despair track for use in Act 2 (Helturel) that has global effects on NPC interactions as the PCs attempt to stabilize and unite the various factions and improve the city toward Hope so the place won’t fall apart while they go to Act 3. There are a lot of things to hate about AI (imminent destruction of human civilization being a likely top contender), but I think there is use for LLMs in RPGs; even in design. But Jason is right that this thing can’t think or create. You have to hold all of that juice and use the LLM to basically take notes for you.

  8. Sean W says:

    I’m always interested in this topic as it relates to the TTRPG world. I know a LOT of people really get angry at anything AI created/related to their gaming.

    I’m curious Justin what YOU actually think are some good uses of AI in the ttrpg-space.

    Personally I use it to help me recall stuff. I try to get my notes from a session into ChatGPT and then I can ask it to recall stuff for me. Obviously it’s not perfect but it’s way way better than my old brain either totally forgetting stuff OR me spending way too much time looking through old notes.

  9. Craig says:

    why do you think it’s ok to feed ChatGPT content that isn’t yours?

  10. Justin Alexander says:

    @Craig: I strongly suspect you don’t understand the difference between training data, testing data, and input text. I think you believe you’re asking a question like, “Why do you think it’s OK to download pirated copyrighted material?” but your question is actually more like, “Why do you think it’s OK to do a Google search for a webpage that isn’t yours?” or “Why do you think it’s OK to run a grammar checker on an e-book your purchased?”

    @Sean: Here’s one example.

    Personally, I record my game sessions. AI technology finally made voice recognition software good enough (particularly for group recordings) to be useful, and even moreso because Otter, the site I use, links the transcript to the recording, so you can just click on the text and hear the original recording.

    Otter has also added an AI bot that analyzes the transcript. The session summaries are hilariously inaccurate, but I can ask it questions like

    – What sessions did Aura appear in?
    – What happened to the screwdriver at the end of session 3?
    – When have I mentioned the Salem-Watts corporation?

    and get meaningful answers. Or, at least, answers that are accurate enough that I can use them to track down the information I need in the recordings without having to comb through dozens of hours of audio.

  11. Paul P says:

    At this point ChatGPT is a good tool for:
    + Helping you to Organize your gaming materials (time lines etc.).
    + Doing something repetitive, like constructing an-game calendar system.
    + It might come up with a few interesting bits of babble (idea bit) that could mesh into what you are doing – provided you have given it enough materials to work with. Again, as mentioned above, it produces…”all pretty good simulacra.” In this case a mirror if the data you fed it (at best).
    + The Pay-for GPTs will be better helpers than the “Free” type.
    + A word on GPTs: Do not put your original content into the “Free” GPTs. Anything you enter is absorbed and can be used by anyone… it adds to its training model. If you use a model, be sure that your data is yours and stays that way.

  12. Jisk says:

    Lack of mental model is not the problem here. Sticking to one model, is. A playtest is about the worst thing you could ask it for. It means it switches from playing the part of ‘analyst’ to the part of ‘actor/writer’ and loses most of the context it acquired as -n analyst. You would get v better set of critique by just asking for comments with nothing but the rules. What you did here was, effectively, tell Alice to read the rules, explain them to Bob, and then have Bob run a game session based on that explanation and report back. Which is useful at a certain stage where what you want is clarity testing, but not at the stage you actually asked for.

  13. Jisk says:

    Also I’ve seen novel-length roleplay with a GPT-based AI, that retains coherence just fine. It couldn’t do this two years ago (we tried) but it can now. It was, however, structured as a back and forth, with the other player, a human, sending messages between the AI player’s messages, and pre-trained on a large sample of humans writing in that format. Two years ago it wasn’t enough; now, it forgets things faster than experienced human coauthors but less quickly than my DM does for our normal campaign. And we tried using the same model to do solo RP with itself, and it took a thousand words to find its feet in the setting it was improvising but it got there pretty well.

  14. Steve W says:

    So I see what you’re saying and I think your point is fair enough that asking ChatGPT to playtest a game or do a deep critique on probabilities and mechanics isn’t really workable.

    On the other hand though, I do think there’s value to be found in using LLMs as a collaborative tool for designers in some meaningful way. I think most obviously is brainstorming (and there’s a 2003 paper in the International Journal of Serious Games entitled “How ChatGPT can inspire and improve serious board game design” which goes into this somewhat). If one goes into Perplexity in research mode and says for example “I’m designing a LARP based on scenario X which explores themes Y and Z, find me examples of LARPs following similar themes”, or “the game I’m designing tries to capture such-and-such a mood and feeling, what sort of mechanics have other designers used to do that and what sort of critique did those systems receive?” then it can certainly help with ideas.

    I absolutely wouldn’t advise asking it to create mechanics and I’d be very wary of someone who just uncritically took whatever it spat out as a replacement for applying their own creativity, but even your example of asking it to suggest attributes and skills isn’t necessarily going to be a dead end.

    Basically: LLMs can’t design games, but I do think you’re maybe being too pessimistic about their value as a tool in a design process. The key thing is to understand that they’re a tool to enhance/assist human creativity, not a viable replacement.

  15. temp_anon says:

    I agree with Jisk that your prompting wasn’t great. You said you “pointed it” at Lasers & Feelings. Does that mean you asked it to search for the game across the web, which would introduce a lot of other information and posts unrelated to directly running the game? I see that it somehow brought up a blog post according to your article, so it may have been given too broad a scope. I’d go with just uploading the PDF of the game itself, turning off web search, and then asking it to do the generation stuff, directly, and then going from there. You left it to it own devices and as Jisk said, play a game of telephone with the rules of the game, which is why it went sideways so quickly.

  16. Donnacha K says:

    I think that the statement “LLMs can’t form mental models” is too strong, and not supported by our current best understanding of these systems. I have seen a lot of people repeat it online, but there is pretty good evidence now that LLMs *do* construct mental models (though they can be unreliable). I think a better statement is that (current) LLMs are *bad* at constructing mental models, and often default to the shallow “pattern-matching” that they are better known for.

    (The evidence for the existence of mental models, by the way, does come from their ability to do some arithmetic. As you point out, early models were extremely bad at even simple arithmetic, which strongly suggested they were just doing rudimentary pattern-matching. Modern models call to other programs/tools to try and remedy that, but we can also run LLMs that *don’t* call to other tools, and we can see that they are actually able to do some rudimentary arithmetic now. They’re still not very good at calculation, but they are much more capable than you would expect if they were just doing pattern-matching: they can multiply relatively large integers, large enough that if they were simply using a lookup table it would be too large to store in the footprint of the model itself.)

    I also agree with some other commenters here that a playtest doesn’t sound like the kind of thing an LLM would be good at; I have used them to proof-read/check my writing for errors, and that’s pretty useful. (obviously I don’t expect an LLM to catch every error, and they do occasionally get things wrong, but they have been very useful for catching things like “on page 12 you say you’re going to follow up on x later, but you never do” or “you use a different term to refer to the same concept on page 45 and 89” etc.)

    However, I don’t think I would use an LLM for any GMing that I would do, but just because I know that my friends would likely be made uncomfortable by it. Playing a TTRPG requires a lot of trust, and I think (currently) people can have such viscerally negative reactions to ChatGPT etc that it’s not worth it (for me) for the small benefit it might provide. I know people who would object even to the proof-reading use-case I described above, and I would want those people to feel comfortable playing at a table I’m GMing.

  17. Joel says:

    First, you’re describing LLMs as they were back in 2022. They are much more advanced now, and describing them as “babble generators” is clearly just wrong.
    Secondly, your test was flawed. AI models are designed to work with collaborative input and iteration. They need prompt input to work from, so a proper use in game design is something like, “If I change how this spell works in d&d 5e to xyz, will it break anything?” They’re actually quite good at that sort of design work. They also work well as a DM, because they can generate storyline, and they know the rules from their training data, but it’s a two-sided game. You have to prompt it what you’re doing and what choices you’re making. Then it creates the next bit of the story from that input. It works very well, and I use it to run solo campaigns with me. Does it make occasional mistakes? Yes, and so does a human DM. I think your point about needing knowledge in the field yourself is exactly right, and it’s a key thing the layman doesn’t understand about using AI.

  18. Beau Rancourt says:

    Felt compelled to do some fact-checking. I you recommended some videos, and I’d like to add
    https://youtu.be/21EYKqUsPfg, https://www.youtube.com/watch?v=LPZh9BOjkQs and https://www.youtube.com/watch?v=7xTGNNLPyMI

    I used to work in the field, and after the original alphazero chess engine paper was released, I wrote my own implementation (that worked and trained itself). I was mostly doing computer vision at the time, but at the core this is a neural net and the ideas are all very similar.

    > Regardless, there’s one thing missing from LLMs: The ability to form mental models.

    It’s not clear that this is the case. Brains are (to be slightly reductive) a huge network of neurons that fire once their action potential is met (from other neurons firing); machine neural networks were designed to be the same thing. There’s inputs, a bunch of neural activitity, and outputs.

    All of the in-between layers (between input and output) can represent, well, anything, so long as there’s enough neurons to represent the state space. That *includes* being able to form mental models. It’s *very* hard to inspect how neural networks are solving a problem, but it *could* be forming mental models (and it seems like it does, if that’s an efficient way to solve the problem).

    Imagine training a network to perform addition. As a training set, you supply it with trillions of sampes of [number1, number2, number1+number2]. It *could* try to optimize output accuracy by memorizing the trillion answers and encoding that in the network, but it’s more *efficient* for it to build a network that has an “understanding” of what addition is (forming a mental model) and use that.

    The same thing happens when we try to train a LLM on next-token-predicion on quadrillions of words from the corpus of everything ever written by humanity. In order for it to get *very* good at predicting the next word, it ends up forming a sort of world-model (because the sorts of things that humans write are correlated with the world they live in). Sutton argues that it’s a “fake” world; the world implied by text, instead of the actual world (which seems true to me, but still useful).

    > They can’t observe the world, think about it abstractly, and then describe their conclusions.

    They can’t observe the world, but they *can* think about the world they were trained on (via text-encodings of writings about that world), and reason abstractly about that world, and then describe their thoughts and conclusions. The closer the text-encoding is to the actual world (which is especially the case with math and programming) the better they tend to be.

    > This is why, for example, LLMs were quite bad at solving math problems: Unless they’d “seen” a specific equation many times in their training data (2 + 2 = 4), the only pattern they could really pick out was X + Y = [some random number].
    >
    > LLMs are actually still incredibly bad at math, but the “models” we interact with have been tuned to detect when a math problem is being asked (directly or indirectly) and use a separate calculator program to provide the answer.

    LLMs are *not* bad at math. They’re currently outperforming humans on extremely difficult tests like the math olympiad. They’re excellent at writing proofs, and it’s something you can test yourself. They get a lot better with slight guidance, like explicitly telling them to check their work or that they’re to act like this is a math exam and provide the full proof to their answer. To demonstrate this, I grabbed a random math olympiad question and plugged it into gemini 3 pro. Here’s the output https://gemini.google.com/share/b4c79e989894. Looks correct and well-written to me.

    On my own end, I’ve had great luck uploading complex rulesets and having it be able to generate pre-gen characters or being able to ask it rules minutae, especially if I demand for it to cite the page number it’s pulling the rule from.

  19. Leland J. Tankersley says:

    I have … thoughts about LLMs. Here are a few. (N.b. I’m a software/systems engineer by trade, with a degree in symbolic systems, which is much like cognitive science. I have modified and worked with neural networks, inference engines, expert systems, and rule production systems and have also done a little bit of work with semantic analyzers, although it’s not my core area of expertise.)

    First, I think of them as a (very impressive) parlor trick. LLMs aren’t really doing anything novel or different than was being done by AI researchers twenty or even thirty years ago. They’re just throwing (a lot) more horsepower (storage, processing power, stolen intellectual property) at the problem. The output can be impressive, but sooner or later you realize it’s essentially devoid of internal meaning. We can inject meaning into it but there’s no mind behind it. (Or, arguably, there is a huge collection of arguing minds behind it; I’m not sure that’s better.)

    Second, about LLMs not having a mental model at all – absolutely yes. But … so here’s the thing. When I am writing, I generally have an idea of what I want to say, and then I start putting words down. Often I will pause and think about what word or phrase to use next (and sometimes as a result I go back and edit the first part of a statement so it fits better). How am I doing this? It feels to me a lot like I’m thinking back over everything I’ve heard and (especially) read, hunting for a way someone else already expressed the idea that really fits with my purpose and intent. And that seems a lot like what LLMs are doing when they are deciding/predicting what word should come next – looking through their corpus of training data (/memory of what has been read before) and deciding what next utterance feels/sounds best in context. (Of course, in the LLM case that context is almost entirely lexical.)

    Third, regarding people using AI/LLM tools to “improve” their writing – so these tools are trained on a huge body of data scraped from all the writing on the internet. By design LLM output is “average” – it literally decides what to output based on what is in its training data. So if your writing is below-average, then maybe using an LLM will help you to look average. But if you’re above-average (as are we all here, right? 😉 ) then using an LLM will probably make you come off worse than before.

    Finally, I feel (hope?) that LLM output taking over the world may be a self-correcting problem. Remember, the way they work is to read in a huge corpus of training data, assume that that data reflects “correctness” and then generate new text in a probabilistic way from statistics garnered from the training data. As more and more LLM output gets spewed onto the internet, and LLMs continue to get trained on whatever can be gleaned from the internet, they are effectively poisoning themselves. Essentially, they are sh*itting where they eat. And as time goes on, I think their output will actually get _worse_ as they start training themselves on the increasingly-nonsensical output of models that came before.

  20. Frank says:

    You ever notice how often people defending AI will say “nah, man, you just need a better prompt”?

  21. Nathan says:

    I really just don’t understand the obsession with automating our creativity. Even if LLMs did everything as well as some people claim they do, why would I want to rob myself of the satisfaction of creating something that I made, that I love, that I can be proud of? I get that we all have limits to our skillsets. And I’m not going to claim that I’ve never used an LLM to make up for that – I have, certainly. Though I increasingly do that less and less, as I’m realizing that I miss drawing my own cruddy drawings. RPG’s are about imagination. Can’t we use our imagination?

    I do agree that it has some use cases. If I cannot find an answer to a rules question in my own internet searches, I can often get it to find resources I was incapable of finding. But I increasingly suspect that this is happening because LLMs exist and because of how non-functional modern search (and gaming searches with SEO) operates. There are other ways to solve that problem that don’t require LLMs. I swear it used to be easier to find things on the internet than it is now.

    @Justin, I like your example of using Otter. This is chiefly what I use LLMs for, too. They are language models. That makes them exceptionally good at transcription. As someone who has severe RSI, I appreciate that.

    I’m not going to tell anyone not to use an LLM. If that’s how they want to roll, by all means, do what you want to do. I’m just an old software engineer screaming at clouds at this point. But I don’t think people fully realize what they’re robbing themselves of. Yes, some tasks are tedious. But this is my hobby. It’s a hobby I love. I’m never going to be a perfect GM, and I’m not only okay with that – I think that’s great. I do some things really well. I do other things really poorly. This gives me a style. A lot of what a style is, is leaning into your strengths and minimizing your weaknesses.

    Even if your style is, “I’m not as good at the rules as I want to be,” hey, I get it. I have ADHD. My eyes gloss over so fast when I have to slog through the rules. But I excel at making things up on the fly, at jiving off my players, and maybe it takes me longer to learn the rules – that’s OK. You’re allowed to make mistakes. You’re allowed to not be perfect. You’re allowed to just… have fun.

    Anyhoo- great article. Loved it. Thank you for writing it.

  22. Tim Martin says:

    As a data scientist (I build ML models) and fan of the blog, I just wanted to add to the nuance on “mental models”. Some of the other commenters are right – it’s probably not correct to say that LLMs can’t form mental models (caveat that you haven’t rigorously defined the term, but I think we can make do anyway.)

    Contrary to what many people like to say, LLMs do form some world models in training, and we’ve had good evidence of this for several years. This is my favorite, very accessible example: https://thegradient.pub/othello/

    (The lesson for why this happens, by the way, is that the cost function of an ML training process [e.g. “output the correct probability for the next token”] is not a description of what the model will learn during training. The best way to predict the next token is to learn the causal graph (i.e. world model) that created it. A model with enough capacity and training data will attempt to do just that.)

    As for post-training, reading the rules of a game and following them is basically a matter of calling upon the correct patterns that were learned during training. This is what ChatGPT, etc. do all the time when you talk to them. (Francois Chollet’s explanation of LLMs as a “database of programs” might be useful to understand this: https://fchollet.substack.com/p/how-i-think-about-llm-prompt-engineering) So, I would expect there to be many instances where an LLM could read the rules to a game and follow them pretty well. (Though I understand that the LLM failed here. And that’s one of the problems with LLMs at present – there is large variance in the quality of results.)

    Anyway, not saying that I think a current LLM will be great at helping with RPG design. Just wanted to comment on the larger misconception about mental models. One of the lessons of LLMs is that a lot of our old mental models for what “reasoning” is are not quite fully baked, because LLMs break our definitions in ways that we didn’t expect. We are mostly operating in a grey, “I’m not sure what to call this” space. But saying “this LLM can’t DM because it can’t form a mental model after reading the rules of the game” is not quite right.

  23. Jisk says:

    > You ever notice how often people defending AI will say “nah, man, you just need a better prompt”?

    It’s usually true. Getting good performance on things near the edges of their capabilities is a skill, and the closer to the edge you are, the more skill is needed. However, you can get good general performance with the simple system prompt “Write the best prompt for my question and then respond to that prompt.” My system prompt, which gets me most things I try, adds “Avoid syncophancy and push back on inaccurate assumptions where appropriate.” That isn’t uniformly helpful but it converts a larger number of issues that are hard to catch and fix for a smaller number which are very easy to catch and mostly easy to fix.

    > I really just don’t understand the obsession with automating our creativity. Even if LLMs did everything as well as some people claim they do, why would I want to rob myself of the satisfaction of creating something that I made, that I love, that I can be proud of?

    Using an LLM doesn’t do that if you use it reasonably. Collaboration with an inexperienced junior writer doesn’t either, and is similarly useful. If it does for you, skill issue.

  24. anotherpanacea says:

    It seems like you zero-shotted it. LLMs and GMs both do better with a little feedback. I fed it your blog post and it was easily able to fix those mistakes. (It had some rules confusion about LASER FEELINGS rolls and missed the HELPER mechanic, but that was because it consulted the wrong version.) It would be lonely fun indeed to run a game just with the machine, but that might be better than a fifth BG3 run….

    https://gemini.google.com/share/f7d36b0a50ba

  25. Justin Alexander says:

    @anotherpanacea: I’m not particularly impressed. Its “analysis” is just summarizing what I wrote, plus some random stuff from the comments here.

    Then you tell it to start the playtest! And the first thing it says is:

    Here is the setup phase. I have consulted the Lasers & Feelings PDF tables to generate the mission.

    1. Mission Generation
    I rolled four d6s to generate the scenario:

    Roll 1 (Threat): 3 -> Cyber Pirates

    This is wrong. A roll of 3 on the table is Rogue Captain. “Cyber Pirates” does not appear on the table.

    Roll 2 (Wants to): 2 -> Steal/Capture

    Roll 3 (The): 3 -> Captain Darcy

    Also wrong. Captain Darcy does not appear on that table.

    Let’s skip down to its summary of the mission it generated, which concludes with:

    If they succeed, their possession of a Starfleet Captain (or the entity inside him) will start a war between the Consortium and the Pirate Clans.

    The players’ organization morphs from Starfleet (wrong) to the Consortium (correct) in the middle of a sentence. It got confused and included Star Trek setting information.

    It creates characters, but makes the same mistake ChatGPT did of thinking that 3 is “balanced” between Lasers and Feelings. (But didn’t it read my article here and “understand” it?) And it forgets to create Character Goals.

    It then tries to create the ship, but doesn’t follow the rules for that.

    It wraps up by presenting a summary of the “game state,” which includes:

    …the Cyber Pirate flagship The Iron Lung is hailing you.

    You tell it to start generating a dialogue of play and it reiterates the existing game state:

    The Pirate Mothership Iron Lung has locked onto the ship.

    Whoops. It immediately forgot what it just said.

    Begin play:

    GM: The Iron Lung’s comms override yours.

    What is that even supposed to mean?

    On turn 2 it stops resolving the skill check halfway through, apparently “distracted” by the result of Laser Feelings. You call it out on this later and ask for an explanation, and it says:

    You are absolutely right that I messed up the flow. In Turn 2, I had Jax get “Laser Feelings,” ask a question, and then pass the turn to Sparks. The Rules: The PDF explicitly says: “You can change your action if you want to. Then roll again.”

    The rules PDF you pointed it to doesn’t actually say that. Even with direct prompting of its error it still can’t get it right.

    (With some digging, what seems to have happened here is that it’s using this post from 11 years ago on StackExchange where someone is quoting an older version of the rules. Unclear whether this is part of its training data or a result of it doing a web search.)

    Strict Interpretation: The text does not state that the die counts as a success.

    It’s very certain about being completely wrong about this!

    Reading further, I see that you also eventually figure this out. But for some reason you “can’t fault it” for completely screwing up a game with two rules.

    It then attempts to “prove” that it’s using the correct rules now by resolving a check using the Helper rules.

    Action: Sparks tries to forge the sensor logs to show the Pirates fired first (Truth: The Pirates hailed, Elara lied about Space Pox, then we jammed them).

    … which is not an accurate summary of what it described happening earlier.

    It looks like, after considerable effort, you eventually got it to make a skill check using the correct rules. Will it continue doing so in the future? Maybe.

    Now imagine doing that with every single rule in D&D and then trusting that its feedback on the something like the balance of a new class feature for barbarians will actually be meaningful in some way.

  26. temp_anon says:

    And again, you all are putting way too much information into the LLM, even prompting it to take several steps at a time when it shouldn’t be. I wouldn’t ask a human person to do this much as the previously linked Gemini link.

    Here’s what I got with Grok 4.1, with just uploading the single page of Lasers & Feelings directly, and directing it only to generate the adventure hooks while following the rules to roll for it. It successfully rolled each time, thanks to Grok generating some simple code to determine rolls.

    https://grok.com/share/bGVnYWN5LWNvcHk_1aa47f7d-b902-4889-a110-68b17da00c71

    Direct and to the point after a single step of introducing my goals in the first prompt, then introducing the PDF and asking it in a precise way.

  27. confanity says:

    The massive elephant in the room that nobody seems to have touched on at all is the cost of running AI.

    There are, reportedly, already places where the average power bill has increased by $15/month because of a nearby data center… and it’s not like they’re about to stop building ever-bigger data centers. Supposedly there are some planned data centers that would, on their own, consume power on the same scale as a small *city*, meaning that costs are only going to grow.

    So every time you go to use AI — whether for “inspiration,” or in an attempt to cut humans out of creative activity X altogether — maybe the question you need to ask yourself is, “How much am I willing to pay each month for this service?”

    If you wouldn’t sign up for a $15/month “inspiration” service, then maybe AI isn’t the right tool after all. Not because of whether it’s actually up to any given task, but because the people selling it have found a sneaky underhanded way to charge you for it without actually telling you that that’s what they’re doing.

    (It gets worse, of course. Because the cost of your “inspiration” service isn’t just applied to your own power bill. It’s applied to everyone’s power bill even when they’re not using the product… even if they were already living on the edge of poverty and that extra couple hundred dollars each year means they can’t get their car fixed, or their teeth cleaned, or their rent paid, etc.)

    Until and unless the tech companies build their own reactors, or find some other way to power their city-equivalent power drains that doesn’t increase everyone else’s costs, *no use of AI at all is ethical* unless it’s actively saving human lives, e.g. AI that has been trained to detect cancer cells in biopsy pictures.

    Obviously, using it as a crutch or a shortcut for a game doesn’t meet that standard.

  28. anotherpanacea says:

    It seems like 1.2 and 1.3 had the alternate Laser Feelings rule: https://ia600908.us.archive.org/32/items/lasers_and_feelings_rpg/lasers_and_feelings_rpg.pdf

    (I fed it 1.4 but it apparently didn’t read it. ChatGPT did something similar–it couldn’t see the pdf so it just googled up its own version.)

    Since we already have computers that are really good at following rules, there may eventually be a way to combine the deterministic programming that makes Neverwinter Nights work with the hallucinatory creativity of LLMs.

  29. Liam says:

    I do not think LLMs are healthy for TTRPGs. I know that the cat is out of the bag, and I know that every revolution in human communication has carried with it the same complaint – this [new thing] is destroying [this institution]. I know. But I cannot help but look at prevailing trends in tech and worry that AI will bring ens**ification and occlude human creativity. Maybe that makes me a Luddite, but there is some irony in that, given one of modern fantasy’s foundational texts, Lord of the Rings, contained major themes of the evils of industrialization.

    The reason I am unconvinced by comparisons of LLMs to things like co-creators or assistants to those with time crunches or those who might struggle with the skills traditionally associated with GMing is that these problems were solved before the introduction of LLMs. I do completely understand the power of LLMs to do things like synthesize gatekept knowledge (often behind snarky replies on Stack Overflow lol), but the beauty of TTRPGs is that knowledge is freely and readily shared by the community. In the past decade or two, things like vigorously moderated Discord servers and subreddits have brought the expertise of the community right to your fingertips with a reasonable amount of civility. And there are more free modules than you could ever run in a lifetime, and even more if you’re willing to throw a few bucks at it.

    But alas, there is still a barrier to entry in the form of finding the right space and the risk of rudeness. And given the choice, consumers will always, always pick ease of use and convenience.

  30. Elliott says:

    Great article and all salient points about using this tool for RPGs and why you maybe shouldn’t. However, some of the claims you make about the way LLMs function are a little off base. Using a calculator tool is a real thing but most of the big AIs these days do not invoke tools for simple arithmetic. In machine learning there’s an idea called ‘grokking’, where a model trained for a very long time will experience a realignment of it’s neurons towards real internal logical structures and not just probabilistic noise. It doesn’t make the models any more or less probabilistic but does mean that the ‘probability’ of a math hallucinations becomes very very low.

  31. Elliott says:

    @confanity
    The $15/month figure seems to be conflating different issues. Data centers do consume massive amounts of power but at commercial rates for their own usage. Bill increase near data centers is not for increased usage but for infrastructure upgrades. People are not literally paying the bill for the training electricity like you imply.

    The “everyone pays even if they don’t use it” point misunderstand how electricity is sold, but regional capacity issues is a real and valid concern. The issue isn’t with new projects (data centers) as much as it is with grid/supply infrastructure problems. Nobody cares about electricity when a new factory is built.

    The idea that AI should be used only for ‘life saving’ applications is absurd. If you earnestly believed this you wouldn’t use Google maps, modern smartphone photography, translation software or driving assistance tools. Each is trained using massive sets of data and costs (or costed) a ton. This sort of black and white thinking is very limiting and importantly obviously something you don’t actually stand by. If you did you wouldn’t be using the internet in 2026 🙂

    What are your actual real concerns with AI? There are absolutely real and valid criticisms. I personally am terrified by the fact that celebrity deepfakes became common place overnight and the prevalence of voice based authentication with the ease of building voice models.

    I’m trying to judge the tech, uses, dubious training legality, and effects on the world separately to form nuanced and robust opinions.

  32. confanity says:

    @Elliott –
    Why are you making all these unsupported claims when you clearly don’t actually understand how AI works? Like, you admit that AI doesn’t become less probabilistic, but this follows an utterly bizarre, contradictory claim that changing the probability weights by feeding a given model more repetitions of correct answers to math problems is somehow actually a vague magical “realignment of neurons.”

    There is no magic, only probabilities, and a “low probability of math hallucinations” doesn’t justify firing up a huge expensive AI system where a simple calculator would suffice. It just doesn’t. Stuff like Google Maps at least can justify itself by giving you a huge amount of information that is updated over time, in the way that a physical map can’t. But here you are trying to defend something that is almost always *worse* than the real thing, and the best defense you can muster is that, well, in certain very safe use-cases we can get the randomly, needlessly wrong answers down to a “low probability.”

    Similarly, you’ve completely missed the point of the cost increase issue: the fact of the matter is that *costs are increasing*, full stop. I get that I didn’t make my full thought process clear enough, but part of the point that I definitely did make (and which you seem to have failed to read) is that unless we take action, they’re going to continue to keep on not just consuming but also building, which — say it with me — will lead to increased costs for everyone, even people who can’t afford it. You don’t get to use a couple of verbal dodges to brush that fact under the rug.

    So you’ll have to forgive me if I don’t buy into your shallow claims about what is or is not absurd. The whole point of AI when they first started selling us the idea was that it was there to improve everyone’s standard of living and, yes, save lives. Instead, the overwhelming majority of what we’ve gotten is just enshittification: more spam; more bots; more theft; more fakery; more loneliness and psychosis; more tools for malicious actors to feed misinformation into the public discourse; more “mecha-Hitlers.” AI was supposed to take care of the drudgery and give us more time to devote to creativity; instead it’s being used to cut human beings out of the creative process and leave us with nothing but the drudgery.

    So not only is the cost very much an “actual real concern” with AI, but the extent of other concerns is so vast that we can’t even meaningfully cover here.

    1. There’s the fact that AI is built on theft and actively designed to obscure rather than cite sources.
    2. There’s the fact that the rush to build it out far beyond its actual usefulness is pushing unnecessary costs onto the rest of us.
    3. There’s the fact that the rush itself is a massive bubble that, when it collapses, is likely to cause widespread economic harm to innocent people.
    4. There’s the fact that AI “summaries” actively draw people away from the sources those summaries were build on, depriving them of clicks and attention (and thus revenue), which in turn means we’re going to get fewer reliable information sources.
    5. There’s the fact that by definition, every frivolous use of AI as a crutch or shortcut ultimately harms the user by keeping them from exercising their mental “muscles.”
    6. There’s the fact that a significant and growing portion of uses that we see are actively harmful, such as thoughtful, high-quality content creators being drowned out by an ocean of quick cheap slop… or people being driven to mental illness by sycophantic AI chatbots… or malicious actors spreading lies and propaganda.
    7. There’s the fact that reliance on AI hallucinations as if they were sources of actual information is leading to the spread of lies.

    That’s just a handful off the top of my head. You yourself managed to pay a little lip service to these facts in passing, although for someone who claims to be serious about “judging… effects on the world,” you sure were quick to rush by them without engaging.

    So I’m not interested quibbling with you and “trying to judge” exactly what is the best way to remove the humans from *human creativity*. Again: there is a very limited set of actual, valid uses of AI. Maybe we can find some that don’t strictly involve directly saving a human life. My main point here is that mere crutches and pathetically-small conveniences are not valid uses that justify the costs and dangers.

  33. Devin says:

    Sometimes gets the rules right, but often doesn’t, in a wide variety of ways? Expresses ideas for “improving” the game that would make it functionally unplayable? Seemingly incapable of creating a mental model and using language to describe it?

    Maybe ChatGPT can’t replace the GM, but it sounds like it could replace a lot of players!

  34. MTB says:

    Thank you for this post, Justin. I think this sort of empirical experimentation is really important as we think through the implications of this tech for tabletop games. I personally think there is cause for some concern.

    I was surprised to see how poorly ChatGPT was at understanding the simple rules for Lasers & Feelings, after all these years of development. Very poor.

    The new model from google has been getting well reviewed, so I tried the same experiment (using it’s “thinking” mode). So far as I can see, it correctly interpreted the rules and generated two reasonable encounters. Transcript below:

    https://gemini.google.com/share/70541eda66fe

    I’m not making any comment as to the suitability for playtesting, just adding another data point to the discussion.

    cheers,
    MTB

  35. Actual-Play : IA ? – Donjons & Darons says:

    […] un article, The Alexandrian explique comment il fait « tester » un scénario à l’IA. Le processus […]

  36. empresseggs says:

    @confanity It seems somewhat unwelcome here but I’m very glad that you brought this up, the *cost* of AI/LLMs is rarely brought up in ttrpg circles. Nothing more to add.

  37. Duncan Idaho says:

    Here’s recent research to add to the discussion: https://www.iflscience.com/scientists-forced-ai-language-models-to-play-dungeons-dragons-to-see-how-well-they-concentrate-82297

  38. Maxwell says:

    This may seem ‘extreme’ to some, but as someone not only informed about the plural harms of this tech but someone who would never give my time to something someone has outsourced their own for, I’d never play in a game in which ‘AI’ was used.

    It’s use seems to me to be mired in the same lack of patience and ingenuity we see in those who read Blinkist summaries because they can’t be bothered reading a book.

Leave a Reply

Archives

Recent Posts

Recent Comments

Copyright © The Alexandrian. All rights reserved.