Open research in AI evaluation, behavioural measurement, and specification discovery for agentic AI assurance.
Threshold Systems is the research arm of Threshold Signalworks. We study how instability enters during inference, tool use, and autonomous workflow execution, and we build measurement and intervention tools grounded in that understanding.
Research and commercial separation. Threshold Systems research produces open artefacts: behavioural failure taxonomies, probe suites, run-envelope schemas, reproducible measurement methods, and technical reports. These are designed to be inspectable, citable, and reusable independently of any commercial product. Commercial tooling, hosted services, and enterprise integrations are developed separately by Threshold Signalworks.
Can we build reproducible measurements that detect safety-relevant drift, instability, and premature convergence in agentic AI systems before visible deployment failure? Driftwatch treats behavioural variance not as noise to be averaged away, but as a safety-relevant signal. The intended capability is a specification-discovery layer for agentic AI assurance: identifying which behavioural properties are stable enough to specify, bound, or formalise, and which are too unstable to support meaningful guarantees.
Taxonomy of process-level behavioural failure modes (drift, premature convergence, instruction-integrity failure, repeat-evaluation instability). Structured probe suites. Measurement harness producing deterministic, auditable run-envelope artefacts with trajectory metrics, comparison reports, and instability scores. Technical report connecting empirical behavioural signatures to the requirements of formal assurance.
Before agentic AI systems can be formally specified or verified, we need to know whether their relevant behaviours are stable, inspectable, and specification-ready. Driftwatch contributes to this empirical groundwork. It complements rather than competes with formal methods.
Beyond Driftwatch, the programme spans cognitive architecture under constraint, human decision-making in high-uncertainty environments, and the intersection of neurodiversity and epistemic design. Work across these domains informs the measurement frameworks and failure-mode taxonomies used in the AI evaluation research.
Public artefact packs (evaluation runs, reports, provenance chains, probe suites) will appear here as they are released. Artefacts are designed to be human-auditable and reproducible.
Brian McCallion
ORCID: 0009-0004-1442-1743
Contact: brian@thresholdsignalworks.com