Exploring ChatGPT for

Quest Systems in Unity

Demo Video

Overview

We explored whether Large Language Models (LLMs), specifically ChatGPT, could replace traditional quest systems in games. Using a Unity prototype, we integrated ChatGPT through its API and conducted over 3,600 automated trials across multiple models and settings. Our findings highlight LLMs’ strong potential for rapid prototyping but reveal critical limitations in reliability and scalability.

My Role

  • Designed Unity API integration

  • Built automated test scripts for large-scale evaluation

  • Analyzed results & co-authored empirical analysis section

Tools Used

  • Unity (C#)

  • OpenAI API

  • Automated Testing Framework

Prototype Setup

The prototype scene included interactive objects with condition states (e.g., Broken/Fixed, Dirty/Clean). ChatGPT generated quests instructing the player to change object states, allowing us to test reliability of structured outputs.

Methodology

To evaluate ChatGPT’s potential as a quest generator, we designed a series of automated tests in Unity.

  • Objects & States: Ran trials with 10 and 50 objects, each with either 2 or 5 possible states (e.g., Fixed/Broken, Dirty/Clean).

  • Models Tested: ChatGPT-4.1 Normal, Mini, and Nano.

  • Temperature Values: 0.5, 1.0, 1.5, and 2.0.

  • Testing Scale: Over 3,600 automated trials were conducted to measure reliability and variance in responses.

Each test checked whether the model returned valid JSON, adhered to the required schema, avoided hallucinations, and produced usable quest instructions.

Unity Scene State

Prompt to ChatGPT

JSON Response

Quest Displayed in Game

Results

Our testing revealed clear differences between models and configurations:

  • GPT-4.1 Normal was the most reliable, averaging ~73% success across trials.

  • GPT-4.1 Mini performed reasonably well on simpler tasks, but reliability dropped sharply with more objects and states.

  • GPT-4.1 Nano frequently failed schema compliance, often hallucinating objects or returning unusable responses.

Key statistics across all trials:

  • Valid JSON: 86%

  • Schema Compliance: 62%

  • No Hallucinations: 53%

  • Fully Passed (all criteria met): 53%

Takeaways and Conclusion

LLMs can be a powerful tool for rapid prototyping and experimentation in games.

  1. Current models are not yet reliable enough for production use in structured systems like quests.

  2. Developers must apply schema validation, guardrails, and temperature control to make outputs usable.

  3. Smaller models (like Nano) are unsuitable for structured pipelines, while larger models (Normal) show potential if constraints are applied.

This project highlights both the promise and limitations of using LLMs for quest generation in Unity. While ChatGPT can successfully create structured quests and speed up prototyping, its reliability issues make it unsuitable as a direct replacement for traditional quest systems. Future improvements to model accuracy, combined with robust validation layers, could unlock new opportunities for integrating AI-driven content generation into games.

Resources