Vigil is what my Master’s thesis eventually solidified into, but it did not begin that way. It started as a topic that sounded impressive and vague in equal measure: “testing in an LLM-driven world”, somewhere in the orbit of DATS and modern NLP infrastructure. I spent months circling that space, trying different framings, discarding them, and learning the same lesson repeatedly: the failures that matter are rarely clean bugs with a crisp oracle. They are behavioural shifts under slightly different conditions, paired with outputs that can still look plausible enough to pass a casual glance.
From a topic to something runnable
The thesis leans into a simple observation: as systems increasingly depend on pretrained components and service-style pipelines, behaviour is often induced rather than authored. Expected behaviour is rarely written down in a way you can point to, and outputs can shift because of input handling, configuration, dependencies, or execution environment even when application code stays the same. The practical problem is not that tests do not exist. The practical problem is that the usual testing assumptions degrade quietly when behaviour is weakly specified and shaped by many interacting layers.
At some point I stopped trying to keep the scope broad. “Testing LLMs” is too large as a headline and too slippery as a method. I narrowed the thesis to behavioural verification of evolving systems built around black-box components, because that was the part I could actually make precise and defend. The black-box stance is not a provocation. It is realism. In most real settings, you do not have a clean handle on internals, and even if you do, that is rarely where you want to state your expectations. You care about interface-level behaviour under declared conditions.
A very mundane failure pushed this harder. I tried to get DATS running locally and could not. Wrong hardware, hours lost, no clean path forward. I did get access to a dev instance later, but the local failure was the point: a thesis framework should not depend on one specific research system being installable on my machine. If the approach is meant to say something about evolving systems in practice, it needs to work with whatever execution surface you can realistically access, including remote services and pipelines you do not control end to end.
What Vigil makes explicit
The framework itself is not a black box. It is deliberately explicit. Its job is to make execution conditions and behavioural expectations visible enough that you can compare runs without relying on memory and gut feeling.
A specification declares three things: inputs, variation, and checks. That is it. The spec is intentionally small because anything larger becomes a second codebase.
Variation is controlled and declared. A variation modifies either the inputs, the function-level configuration, or the environment context. That intent matters because it keeps provenance honest. If a behavioural difference shows up, you should be able to tell whether it came from an input transformation, a runtime parameter change, or an environment-level change, and you should be able to rerun the same change later.
Each execution becomes a slice: output plus the context needed to treat that output as evidence. The context includes which input produced it, what variation was active, what backend configuration applied, and any metadata the backend can provide. Without recorded slices, “it changed” tends to become folklore, especially once systems evolve and you can no longer reliably reproduce last week’s conditions.
Checks operate over collections of slices. Some expectations are about agreement with references where references exist. Some are about relational consistency across variations. Some are diagnostic signals that do not assert but help interpret what changed. The point is not to reduce behaviour to one score. The point is to make comparisons structured, repeatable, and inspectable.
How the model became coherent
A lot of the work was not inventing new concepts but cutting down to something implementable and usable. Early on I had a much larger taxonomy: more variation modes and more check families. That version would have been an engineering nightmare and a conceptual mess.
Practice forced simplification, and one redesign mattered more than the rest. I initially split checks into reference, relational, and diagnostic, and treated “assertiveness” as something that belonged to the category. Then I noticed a mismatch in my own case studies: explicit reference checks were barely used. Most comparisons were relational checks with an implicit baseline.
That mismatch was useful feedback, so I changed the framework instead of forcing usage. References became first-class again, whether supplied externally or derived internally from a baseline execution, and reference checks became genuinely usable instead of a theoretical label. Relational comparisons still exist as group-oriented checks, and diagnostics remain available when the right output is a signal rather than a pass or fail.
This week I did the last major architecture overhaul and froze the design. Vigil had been developing for months, but these changes were the first time the whole thing felt properly coherent: a small spec that stays usable, explicit variation with intent, provenance-rich slices as evidence, and checks that match how behaviour is actually compared in practice.
Why the case studies matter
The case studies are not decorative examples and they are not there to make the repository look busy. They are the pressure test. They force Vigil to deal with different execution styles and different sources of behavioural change.
Some studies are closer to research infrastructure, others feel closer to what an engineer would actually run in a repo. Right now the spaCy and LLM studies are the most practically grounded, and they are the ones I am expanding next because they map most directly to “this could catch a real regression in a real pipeline”.
That is Vigil in its first finished shape. The next post will be less about the framework and more about the thesis process itself: writing it, narrowing it, and getting it to a point where it can survive being read by other people.