Capstone Paper

LuAI: Large Language Models that Teach the User how to Code in Lua

Author: Joseph Nissen Institution: SUNY Polytechnic Institute Instructor: John Szwakob Date: May 8, 2025

Introduction

Over the past five years, artificial intelligence has experienced one of the biggest booms in advancement it has ever seen. Comparable to the birth of the internet, widespread use of AI is going to completely change the way the world works. There are countless use cases for AI that are too specific for larger models to handle effectively yet would still be useful for many people.

This concept was in my mind when thinking of an idea for my Capstone, and I wanted to solve this issue by fine-tuning my own model locally in hopes of making it exceptionally good at one specific task. Given my background as both a Computer Science student and a member of a family of teachers, I set out to create a large language model that would teach someone how to code in a lesser-known language that may not have a lot of support online.

After searching for a coding language that is obscure enough to not have overwhelming resources available while also being relevant enough to be useful, I ended up settling on Lua. Lua is a lightweight coding language, which meant it would be easier for my models to learn. This was extremely important given that the hardware I had to work with was relatively limited. Along with this, Lua is the coding language used to create games in the popular application Roblox, making it a good tool for someone wanting to introduce themselves to game design. Combine these reasons with the fact that there is no existing Lua-specific teaching model, and it was a clear choice.

Background Info

The AI model I created is classified as a large language model (LLM)—a type of artificial intelligence model trained to understand and generate words and conversations. An example of a widely popular LLM is OpenAI’s ChatGPT. These models are trained using massive amounts of data, and creating one from scratch would be a monumental task for one person. For this reason, the models I used started with base models that were already trained on the fine details of human conversation.

These models are still too large to fit onto my graphics card’s virtual RAM, so I had to use quantized (less computationally taxing) versions. To get these base models to output the desired responses, I used fine-tuning—taking a pre-trained AI model and training it further on a smaller, more specific dataset.

Since the goal was to get my model to learn how to break code down and explain it while also ensuring it could proficiently code in Lua, I decided to use model distillation—a technique that essentially makes a small AI model learn to mimic a much larger, more powerful one. Before fine-tuning, I needed to gather many prompts and responses from one of the better models available, using prompt engineering to ensure my dataset contained the correct style of prompt/response pairs. This generated a synthetic dataset.

Project Overview

The goal of this project is to create a large language model that teaches the user how to code in Lua by generating code based on a prompt given by the user. The model will break up the generated code and explain each section of it, and then explain how it all fits together. The intended use case is that the user would have some prior coding experience but no experience with Lua.

Along the course of this project, there were a few changes. I originally intended for the resulting models to not generate code at all, but rather use retrieval augmented generation (RAG) to store the Lua documentation along with the model. This would allow it to pull directly from the documentation to answer questions. However, this strategy had issues—it is very hard to get a model to not generate code, and the RAG structure proved difficult given the diverse information in the Lua documentation. I decided to alter my approach and utilize model distillation instead.

Tools

To create the synthetic dataset, I used OpenAI’s ChatGPT 4o via the ChatGPT API to continuously prompt the model for every question in a coding problems dataset (over 850 questions from Hugging Face) and stored responses in a CSV file.

For base models, I used Hugging Face to find models fitting my criteria and limitations. When loading models, I used the BitsAndBytes library for 4-bit quantization to accommodate memory limitations. All training was done in Google Colab, which allowed me to utilize Google’s hardware with better memory capacity and performance.

Project Structure

I broke the project into three main sections:

Prompt Engineering — Creating the prompt used to receive responses from ChatGPT 4o. This was mainly trial and error: creating a prompt, sending it to ChatGPT, evaluating the response, and tweaking aspects based on whether the output was satisfactory. The final prompt: “Act as a teacher whose main goal is to explain the material that is being generated. Assume the user has a basic knowledge of coding concepts, but nothing advanced. The topic you teach is the coding language Lua, which you are an expert on. Break down the code that is being generated into likewise chunks and explain each section individually and then explain how it all fits together after this.”
Synthetic Dataset Creation — Taking a dataset of 850+ coding problems from Hugging Face and feeding them all into the ChatGPT API, saving responses to a CSV file.
Fine-Tuning — Converting the dataset to JSON, loading it into Google Colab, splitting into testing and training sections, tokenizing the data, loading the model with 4-bit quantization, setting up Parameter-Efficient Fine-Tuning (PEFT), configuring training arguments, and training.

Assessment

Two different models were trained: a 4-billion parameter Qwen3 model and a 7-billion parameter CodeLlama model. Post training, the Qwen model had a loss of about 0.836 while the Llama model ended with a loss of 0.875.

To evaluate how well these models learned teaching Lua code, I compared them to the prompt-engineered ChatGPT 4o model that the original dataset was created on, using the same prompt for all three: “Make me a Fibonacci sequence generator. The user can input the length of the sequence they want.”

ChatGPT 4o (Baseline)

The prompt-engineered response gives a step-by-step breakdown of the process, then provides the code with comments, then breaks the code down piece by piece with detailed explanations. This is the prime example of what the fine-tuned models should be doing.

Qwen3 Fine-Tuned Model

Instead of giving an initial explanation then generating code, the Qwen3 model generates sections of code along with a step-by-step breakdown simultaneously. Then it outputs the entire script and gives an explanation on key concepts. While the format differs, the overall teaching idea is preserved.

CodeLlama Fine-Tuned Model

This model followed the GPT format more closely—starting with an extremely in-depth plan, then giving the full Lua code, followed by another explanation and two external links. The links were a byproduct of the CodeLlama base model, as the training data never contained links. The Llama model gives its answer in a much larger response compared to Qwen3, likely a byproduct of their different parameter sizes.

Although both models complete the task, they still have some limitations. Short prompts tend to cause hallucinations, and they are only competent at solving full coding problems—more specialized questions cause poor responses.

Challenges

Dataset limitations — Finding a dataset for coding problems that was long enough and didn’t specify Python as the answer language.
Hardware limitations — Google Colab’s GPU had only ~14.7 GB of VRAM, forcing the use of models with 7 billion parameters or less, model quantization, and reduced batch sizes.
Runtime limitations — The free version of Google Colab limits daily GPU time, forcing training sessions under five hours.
Accuracy trade-offs — All these limitations forced sacrificing model accuracy in exchange for being able to complete all necessary tasks.

Future Avenues

Access to better hardware would greatly improve fine-tuned model accuracy.
Implementation of RAG with the Lua documentation could improve accuracy and detail.
Models trained for other coding languages displayed in the same front end would increase usefulness.

Conclusion

This capstone project demonstrates the feasibility of creating a fine-tuned, locally run language model designed to teach users how to program in Lua—an area underserved by most large language models. By leveraging prompt engineering, synthetic dataset generation with GPT-4o, and lightweight fine-tuning techniques, I trained distilled models that emulate GPT-style instructional behavior.

Despite hardware limitations and challenges with quantization, the resulting models show clear potential as educational tools, especially for users in low-resource environments or those focused on niche programming languages. This work highlights the growing accessibility of LLM development and points toward a future where domain-specific, fine-tuned models can offer high-quality instruction without reliance on powerful cloud infrastructure.

References

Hugging Face. Code Generation Lite · LiveCodeBench. Hugging Face
Qwen Team. Qwen3-4B. Hugging Face
Code Llama Team. CodeLlama-7B-HF. Hugging Face
Vaswani, A., et al. “Attention Is All You Need.” arXiv, 2017. arxiv.org
Labelbox Guide: “What is Model Distillation?” Labelbox