BenchBox banner image
MIMS Final Project 2026

BenchBox

Current LLM benchmarks are vulnerable to spurious statistical exploitation. Models achieve high scores by learning dataset-specific cues rather than demonstrating genuine task competence, leading to unstable results that fail to predict real-world performance. If the goal of input perturbation is the robust evaluation of LLMs, it follows that the set from which input perturbations are constructed attend to different prompting contexts.

We propose a novel framework for generating systematic input perturbations to evaluate the robustness of LLM performance and bolster custom benchmark development. We use the Allen Institute's latest dataset to create contextual grammars and vocabularies for input perturbation. A research question we see framing our problems is: "Can perturbation-based evaluation and noisy inputs better predict real-world deployment performance than traditional benchmarks?" At a broad level, our tasks are to:

  1. Sample the WildChat dataset (1M real ChatGPT sessions) and mine for natural prompting patterns
  2. Identify systemic patterns and variations in prompts (via clustering/topic modeling to identify prompt themes for the construction of respective grammars and vocabularies)
  3. Generate perturbations that reflect authentic human reformulation strategies and prompting
  4. Expert review of a selection of our input perturbations to ensure relevance/consistency for the tasks measured
  5. Utilizing perturbed inputs in conjunction with established benchmarks and tasks to evaluate model performance across the space of natural variation among users.
Last updated: May 14, 2026