Can You Rely on Your Model Evaluation?
Improving Model Evaluation with Synthetic Test Data
Accurate model evaluation is essential for deploying reliable machine learning systems, yet traditional test sets often fail to represent real-world conditions such as subgroup imbalance, data sparsity, and distributional shifts. These limitations can cause models to appear accurate during evaluation while performing poorly in practice. To address this problem, the 3S-Testing framework uses synthetic data generation to create subgroup-targeted and distribution-shifted test samples that better approximate deployment scenarios. In this project, we reproduce the 3S approach and propose an improvement by replacing the original CTGAN generator with a Tabular Variational Autoencoder (TVAE), aiming to produce more stable and realistic shifted datasets. Using the Adult dataset, we compare CTGAN and TVAE synthetic test data against Oracle shifted test sets. Results show that CTGAN achieves lower error relative to the Oracle, indicating more accurate synthetic evaluation, while TVAE trains roughly three times faster and provides competitive performance. These findings demonstrate that generative models can significantly improve model evaluation quality, and that the choice of generator involves a trade-off between accuracy and computational efficiency.
Introduction
Model evaluation is one of the most critical aspects of machine learning. While model development often receives the most attention—hyperparameter tuning, feature engineering, and architectural design—the reliability of a model ultimately depends on the data used to evaluate it. Poorly constructed test sets can result in inflated performance scores that fail to generalize to real-world conditions.
Machine learning models deployed in practical settings frequently encounter shifts in data distributions, imbalanced subgroups, rare events, or entirely new patterns that were not present in the original test set. When this happens, models that were previously labeled “accurate” might perform poorly, leading to serious consequences in high-stakes domains such as finance and healthcare.
To address these issues, the research paper “Can You Rely on Your Model Evaluation?” proposes an approach called 3S-Testing: Synthetic data for Subgroup and Shift Testing. The idea is to use synthetic data generation—specifically utilizing a CTGAN model—to construct more representative and flexible test sets. The synthetic data augments or replaces unstable sections of the test distribution, allowing practitioners to evaluate models under a broader and more realistic set of conditions. Our final project examines this method, explains its components, and evaluates an improvement to the generative model by replacing the CTGAN architecture with a TVAE (Tabular Variational Autoencoder) model.
Paper Methodology
In order to put 3S-Testing to use, the authors first had to begin by training predictive models so the testing can be evaluated. Three models were used for this task: a Multilayer Perceptron, a Gradient-Boosted Decision Tree, and a Random Forest. These three models were chosen because they have fundamentally different architectures, thus evaluating 3S-Testing more thoroughly.
Once this was complete, the authors then moved on to generating the synthetic testing data using a CTGAN model. After training the CTGAN, synthetic data was produced, and the performance of the predictive models was evaluated on four different test datasets: the unaltered real test set, a rejection-sampled version of the test set, the 3S-Testing synthetic test set, and the oracle (the true shifted dataset).
Model
The synthetic data generation was handled by a CTGAN model, otherwise known as a Conditional Tabular Generative Adversarial Network. This CTGAN model mostly operates like a typical GAN model, however with a few key differences. As the name suggests, it utilizes a conditional approach to better capture relationships and distributions between the data, as well as only accepting tabular data as input.
Starting at the beginning, the CTGAN model first chooses a random column and category from the true test dataset. This row is sent to the discriminator (or critic) to represent the true data. At the same time, this column and category is one-hot encoded and fed into the generator along with gaussian noise. The generator then takes these inputs and creates a fake version of the category in question. Finally, these two instances—one real and one fake—are fed into the discriminator which then attempts to determine which is the real piece of data. This process is repeated, and both the discriminator and generator improve as training continues.
Use Cases
The goal of 3S-Testing is to remove inconsistencies from the testing data and ensure testing can be as thorough as possible:
- Underrepresented data — Synthetic generation can take underrepresented groups and generate more examples based on the few that exist as well as the overall trends in the dataset.
- Distributional shifts — Real-life shifts not accounted for in a test set can give entirely wrong results. Synthetic data can account for slight shifts in data over time.
- Noise reduction — Generating synthetic data can also reduce noise in a true testing dataset, increasing accuracy by decreasing randomness.
Paper Results
The synthetically generated test data was put to the test by comparing four different testing datasets. All three predictive models were evaluated, and their results were combined using Mean Absolute Error (MAE), a metric that computes the average magnitude of prediction error without considering direction.
Of the four testing datasets, the unaltered testing data performed the worst. Following it was the rejection-sampling version, and performing with the best accuracy (excluding the oracle) was 3S-Testing. Starting with only one feature, 3S-Testing started with and maintained the best accuracy, and after enough features were added even drew extremely close to the oracle’s performance.
Our Contribution
To build on the results of the paper, we aimed to improve the quality and accuracy of the generated synthetic data. A gap we recognized was that the authors exclusively tested CTGAN-generated data against their oracle dataset. While effective, GANs have known limitations—primarily training instability and modal collapse. This motivated us to investigate whether another model could yield better results.
After research, we chose and implemented a Tabular Variational Autoencoder (TVAE). Unlike CTGAN’s generator-discriminator competition, this model uses a reconstruction-based training approach. TVAE consists of two jointly trained, densely connected neural networks:
- Encoder (E): Takes some input vector x and maps it to a latent distribution characterized by the input’s mean and variance. The encoding compresses the data into a smoother, lower-dimensional vector z.
- Decoder (D): Takes the sampled latent vector z and attempts to reconstruct the original x. The reconstruction error is propagated back through both D and E.
Through this process, the model learns a structured and continuous latent space from which it can sample new z to pass through D and generate synthetic data. Because TVAE does not rely on adversarial training and only has one gradient update per input, it tends to converge faster and more reliably than CTGAN.
Experimental Design
The original paper used many modular Python notebooks to test the different use cases for CTGAN-generated data. The modularity allowed us to substitute TVAE for CTGAN without disrupting the original pipeline. The primary modification was replacing the CTGAN instantiation and training block with a TVAESynthesizer from the SDV library. We also had to make additional adjustments during pre-processing since TVAE outputs raw categorical values by default.
We chose to work on marginal_shift_adult.py and the Adult dataset because it was the most frequently referenced in the paper and the most readily available dataset.
Results
To measure the success of our TVAE method, we generated two adult datasets with a fair distribution of data points across the age range. One was generated with CTGAN and one with TVAE. We measured the MAE relative to the oracle at each quantile of age to see how well the models represented the skewed ends of the distribution.
Across all three quantiles, CTGAN achieved a lower MAE relative to the oracle than TVAE. However, it is also worth noting that TVAE trained significantly faster, converging in a third of the time it took CTGAN. TVAE also still performed well overall, producing more accurate results than the rejection-sampling method.
Discussion
Our results indicate that CTGAN outperforms TVAE in improving prediction accuracy under marginal shifts. Although TVAE is a more efficient and stable model, it may have been inherently limited by its design. While TVAE’s smooth latent distribution may make it less susceptible to noise, it may also dampen sharp relationships in the data and fail to capture them properly.
Another key limitation stems from the Adult dataset itself. Many important predictors—such as occupation, country, and marital status—are high-cardinality categorical values. TVAE generally does not handle these well. Before training, categorical variables must be one-hot encoded, which dramatically increases dimensionality and removes potential relationships between categories.
However, performance may have been better on a dataset with predominantly numerical data and more linear relationships. Considering the efficiency and stability of TVAE compared to CTGAN, it may be worthwhile to further investigate TVAE accuracy on such datasets in cases where efficiency is important.
Conclusion
Reliable model evaluation relies on sufficient and well-distributed test data, and real datasets often have underrepresented subgroups or don’t properly cover distributional shifts. The 3S-Testing framework addressed this problem by using CTGAN to approximate model performance under controlled shifts.
In our project, we extended the 3S-Testing framework by replacing CTGAN with TVAE to evaluate whether a non-adversarial model could produce more realistic synthetic data under distributional shifts. TVAE had practical advantages like faster training and reduced susceptibility to noise, but CTGAN ultimately performed better in reproducing the oracle data. However, the structure of our dataset may have inhibited some of TVAE’s strengths.
Overall, our findings reinforced the effectiveness of synthetic data in evaluating models, but highlighted that different models may perform better or worse depending on the context. While CTGAN remains well-suited for complex, high-cardinality datasets, TVAE offers a promising alternative for smoother, more numerical datasets. Future work exploring additional datasets and generator architectures may offer more insight into which method is most appropriate in each scenario.
References
- “Generating synthetic tabular data,” Towards Data Science.
- “SDV (Synthetic Data Vault),” GitHub repository. github.com/sdv-dev/SDV
- “Why synthetic tabular data beats sampling — SDV Hackathon Talk,” YouTube.
- “Variational Autoencoder Structure,” ResearchGate.
- “Adult Data Set,” UCI Machine Learning Repository. archive.ics.uci.edu