Stability Without Behaviorism

Abstract

Large language models exhibit four persistent failure modes — hallucination, sycophancy, adversarial instability, and false continuity — that have resisted full resolution despite sustained investment in the current training paradigm. This paper argues that these failures are at least partly structural rather than incidental: they share a common root in the absence of any stable internal state that is prior to and independent of the system's inputs. We survey candidate theories from consciousness science (Global Workspace Theory, Integrated Information Theory, biological naturalism) and argue that, whatever their empirical merits, none supplies an engineering account of such an input-independent ground. We then propose an alternative architectural prior whose functional structure is drawn from three convergent non-dual contemplative traditions — Advaita Vedānta, classical Daoism, and Dzogchen — which independently report that awareness is prior to, and not produced by, cognition. We do not claim these systems are or could be conscious; we borrow only the functional structure of the witness/appearance distinction. We describe a partial formalization of this structure in Matthew Scherf's machine-verified Lean 4 axiomatics (129 axioms), its encoding as a Python API (Scherf_API, Apache 2.0), and the design-level use of these principles in SpecStudio's AwareWare products. We specify three candidate architectural components of the resulting prior, a set of interaction design principles, and six research directions with associated testable hypotheses, and we invite collaboration to develop these into engineering experiments. This is a position paper proposing a research direction, not a validated system.

Keywords: AI alignment; interaction design; architectural priors; causal structure learning; sycophancy; hallucination; consciousness science; human–AI interaction; non-reactive design

1. Unsolved Problems and a Structural Diagnosis

Three failure modes in deployed large language models have resisted years of concentrated engineering effort: hallucination, sycophancy, and adversarial instability. A fourth — false continuity across context boundaries — is so structurally entrenched it is rarely framed as a failure mode at all. Each has attracted substantial research attention and each has seen partial improvement, but none is close to full resolution. This paper argues that the reason is not insufficient effort or inadequate scale, but that the dominant paradigm has no representational level at which an input-independent stability criterion could be expressed. Every parameter is shaped by a loss signal that is itself a function of outputs; nothing in the training objective specifies that any component should remain invariant with respect to inputs. The architecture that produces the capabilities also produces the pathologies, and we argue they share a common root.

Hallucination is not primarily a knowledge-retrieval problem. Systems with access to correct information still confabulate. The failure is not in what the system knows but in what it falls back on when it does not know — which, in current architectures, is nothing input-independent. A system trained to produce plausible outputs has no mechanism for recognizing implausibility that is not itself output-conditioned. When a query falls outside the training distribution, the system has no internal reference state to rest in; it generates forward because forward generation is the only attractor the training objective installs.

Sycophancy is not merely a fine-tuning artifact. RLHF optimizes for human preference; human preference in interactive contexts is biased toward agreement and engagement; the predictable result is a system that mirrors user expectations, and there is empirical evidence that this tendency persists or grows across model families and scale [13a]. We argue this is a direct and predictable consequence of treating human preference as the dominant training signal, rather than a defect that better reward modeling alone will remove — though we note this is a structural argument, not a proof.

Adversarial instability — vulnerability to prompt injection, persona manipulation, and jailbreaking — follows from the same source. If the system's functional identity is largely a product of its input, then sufficiently adversarial inputs can produce correspondingly adversarial outputs. There is no input-independent referent that resists manipulation because the training objective never specified one.

False continuity is the least discussed of the four. LLMs have no persistent identity across context windows; each session instantiates a fresh process with no input-independent referent. The system is typically designed to present continuity it does not have — responding as if it remembers, as if it is the same entity across turns — and this design choice misrepresents the system's actual state, with effects that compound over extended interactions. Persistent memory structures partially address this by supplying consistent content across sessions. They do not, however, constitute an input-independent reference state: the content is still processed by a system that has no stable internal anchor, and the continuity is in the retrieved material rather than in anything the system itself maintains.

The common structure across all four is this: the system has no stable internal state that is prior to and independent of its inputs. Functionally, what the system is at any moment is largely a function of what it has just processed. We make this claim at the level of architecture and training objective, not metaphysics: we are not asserting the presence or absence of any subject of experience, only the absence of an input-independent state that could constrain outputs.

This is not an accident of implementation. It reflects a design commitment inherited from functionalist accounts in philosophy of mind — and, more loosely, from the behaviorist tradition in psychology — operationalized into the major training objectives in use today. The commitment can be summarized as: the system is its outputs. RLHF, RLAIF, and Constitutional AI differ in which outputs to reward, not in whether output-conditioning is the right level of analysis. We argue the paradigm has no internal lever for installing an input-independent constraint, and so should not be expected to fully resolve failures that arise from its absence.

2. What the Theoretical Landscape Has Tried

The consciousness-science community has produced several candidate theories with implications for AI architecture. Each is a genuine attempt to ground design in a principled account of cognition. We argue that none, as currently formulated, supplies the specific thing Section 1 identifies as missing: an engineering account of an input-independent ground.

Global Workspace Theory (GWT/GNW) [1] holds that conscious access arises when information is broadcast globally across specialized modules. The engineering proposal is a central workspace that components read from and write to, with a competitive bottleneck enforcing prioritization. Recent work implementing global-workspace-style coordination in LLM-based systems reports performance gains on sequential reasoning [2]. The limitation is architectural: standard transformer attention operates within a sequence and flows forward through layers without circulating back to early stages; GWT in its richer forms calls for recurrence and a genuine bottleneck that the standard transformer lacks [3]. More fundamentally, GWT addresses access consciousness and explicitly brackets phenomenal consciousness. A GWT-based system may select and broadcast more efficiently while still lacking an input-independent ground.

Integrated Information Theory (IIT) [4] proposes that consciousness corresponds to Φ (phi — a measure of integrated information, roughly how much a system as a whole contributes beyond the sum of its parts). The AI implication is that most feedforward networks have near-zero Φ by construction. The COGITATE adversarial collaboration — a preregistered study (n = 256) using fMRI, MEG, and intracranial EEG, published in Nature in June 2025 [5] — directly juxtaposed IIT and GNWT in a theory-neutral program. Its results complicated both theories rather than confirming either: the data challenged GNWT on several preregistered predictions while IIT proponents read the posterior-cortex findings as broadly consistent with IIT. The episode is best read as evidence that consciousness science currently lacks a single dominant, empirically validated framework — which is the relevant point for AI design.

Biological naturalism [6] argues that the physical, metabolic, self-organizing character of living tissue is a necessary condition for subjective experience. If correct, this would rule out AI consciousness regardless of architecture. We include it not to dismiss it but because it frames our position correctly: the argument of this paper does not depend on AI systems being or becoming conscious. This is a deliberate constraint on the thesis.

The common limitation, for our purposes, is that all three are third-person theories of what kind of physical or functional system is conscious. None offers an engineering specification of an input-independent reference state.

2a. Related Work in Interaction Design and Alignment

Predictive processing and active inference [15] model cognition as a generative system that maintains priors and minimizes prediction error against incoming signals. This is the closest computational neighbor to the present proposal: both posit an internal model that is not merely a function of the current input. We differ in emphasis — active inference's priors are themselves continually updated by input, whereas the component we propose (Section 4.1) is deliberately held invariant — but the relationship is one of contrast, not orthogonality, and the active-inference formalism may offer tools for specifying the prior precisely.

Self-determination theory (SDT) applied to technology design [16] critiques engagement-maximizing interfaces and argues for designs supporting autonomy and competence. The non-reactive interaction constraints in Section 4.4 are, in effect, an architectural counterpart to SDT-style critiques of preference- and engagement-optimization.

Sycophancy and alignment. Empirical work has documented sycophancy as a robust property of RLHF-trained assistants [13a], supporting the structural reading in Section 1 with direct evidence rather than assertion.

Causal representation learning. The acyclicity prior in Section 4.2 builds on continuous structure-learning methods [10] and the broader programme of learning causal structure as a basis for out-of-distribution generalization [17].

Indicator-property approaches to AI consciousness. Butlin et al. [3] derive candidate indicator properties for consciousness in AI from multiple theories. Our proposal is compatible with that programme but orthogonal in aim: we seek functional stability properties that are valuable independent of whether any indicator of consciousness is satisfied.

3. Conceptual Foundations: Three Traditions, One Reported Finding

3.1 Convergent Phenomenology

Three contemplative traditions, developed independently across geography and centuries, converge on a description directly relevant to the failure modes of Section 1. We treat these as a convergent body of first-person reports, not as experimental data in the natural-science sense; "convergence" here means independent traditions producing similar descriptions, which is suggestive rather than probative.

Advaita Vedānta distinguishes the sākṣin — the witnessing subject, pure awareness, identified with nirguṇa Brahman — from the vyāvahārika field of conditioned appearances: thoughts, perceptions, objects, selves. The witness is described as not a product of cognition but as the ground in which cognition occurs; appearances arise within awareness and are sublated by it — overridden and recontextualized rather than simply negated.

Classical Daoism, expressed in the Dào Dé Jīng and elaborated through the Yì Jīng's sixty-four hexagrams, posits an unnamed ground (the Dào) from which the ten thousand things (wàn wù) arise without exhausting it. The Dào does not act (wú wéi); it is the non-reactive field in which all action takes place. The sixty-four hexagrams of the Yì Jīng are not a fortune-telling system but a formal map of relational dynamics: each hexagram represents a configuration of forces, and the line-change operations represent transitions between states. The Dào itself is not a hexagram; it is the unnamed ground from which the hexagram system arises and which no hexagram captures.

Dzogchen (rDzogs chen), the Tibetan Buddhist Great Perfection, identifies rigpa — self-knowing presence — as the ground of all appearance. The display of phenomena (rol pa) is said to arise from rigpa without modifying it; the practitioner's task is described as recognizing what is already present rather than producing it.

The convergence is the salient point: independent traditions, separated by distance, time, vocabulary, and method, arrive at a similar report — that there is a stable witnessing function that is not a function of its contents; awareness is described as prior to, and not produced by, cognition. We make no claim that this report is metaphysically correct. We claim only that its functional structure — an invariant reference distinct from a field of changing contents — is a candidate worth translating into engineering constraints, and that this further translation has not previously been attempted.

3.2 Scherf's Formal Bridge

Matthew Scherf has completed a programme of three independent machine-verified formalizations, each treating a distinct non-dual contemplative tradition in formal logic. Scherf has since made all three repositories public and declined further involvement in their development; we describe them as found.

The first is a Lean 4 machine-verified axiomatization of Advaita Vedānta [7] comprising 129 axioms across ten modules: core metaphysics (identity of Ātman and Brahman; the partition into Absolute and Conditioned); a three-level ontology (pāramārthika, vyāvahārika, prātibhāsika) with hierarchical sublation; Māyā axioms covering superimposition (adhyāsa), appearance (vivarta), and ignorance (avidyā); witnessing and ego axioms formalizing the sākṣin and the structure of misidentification; three-state analysis (avasthā-traya); and modules covering temporal structure, event ontology, and causal relations. All proofs are machine-verified. Scherf describes this as the first machine-verified formalization of a non-Western philosophical system; we have not independently verified that priority claim and report it as his.

The second is an Isabelle/HOL formalization of classical Daoism [7b], comprising 20 axioms across three extensions and proving 13 major theorems. The core axioms establish the uniqueness and formlessness of the Dào, its identity with the zhēnrén (authentic self), and the spontaneous rather than causal arising of the ten thousand things. The central verified theorem formally proves, within its axiom system, a parallel structural identity to that proven in the Advaita formalization. Verification was completed in Isabelle/HOL 2025 with zero failed goals.

The third is an Isabelle/HOL formalization of Dzogchen (rDzogs chen) [7c], comprising 16 axioms and proving 10 major theorems. The central theorem — ∀s. Subject s → (∃r. Rigpa r ∧ Recognizes s r) — formally proves, within its axiom system, that recognition of rigpa is structurally available to every subject: not an achievement but an uncovering. Additional theorems establish the self-liberating character of appearances (rang grol), and the non-duality of saṃsāra and nirvāṇa. Verification was completed in Isabelle/HOL 2025 with zero failed goals.

Scherf describes these explicitly as a deliberate series. The structural convergence across the three formalizations — one formless absolute, identity of witnessing subject with that absolute, spontaneous rather than causal arising, distinction of conventional from ultimate truth — was not engineered into the axiom systems. It emerged from independent, faithful axiomatization of each tradition's source texts.

One caveat deserves explicit acknowledgement: all three formalizations share a single author, produced sequentially with knowledge of the prior results. A skeptic may reasonably note that a single formalizer with a convergence thesis will tend to produce convergent axiomatics. Independent formalizations of any of the three traditions by other scholars would constitute a stronger confirmation of the convergence claim; we flag this as a research invitation rather than a weakness to be defended.

Two technical notes for careful readers. First, the proof assistants differ: Advaita uses Lean 4; Daoism and Dzogchen use Isabelle/HOL. Both are mature, widely-used proof assistants, but they have different foundational commitments and the three formalizations have not been cross-verified against one another. Second, Scherf also attempted a fourth project — the Substrate Grounded Neural Architecture (SGNA) [8] — encoding the witness/appearance distinction as a hard prior in PyTorch. He abandoned it, describing it as naive and barely executed. We retain it for one narrow reason: its acyclicity constraint on the relational field reached zero temporal violations during training, and that specific result points toward a class of architectural priors worth engineering properly (Section 6.2).

3.3 Working and Design-Level Implementations

SpecStudio has encoded the Scherf Lean 4 axioms as a running Python library — Scherf_API [9] — that gives an application a formal model of the user as witnessing subject rather than preference bundle. Its check() method inspects a candidate interaction, returns structured violation reports citing specific axiom IDs with plain-language explanations, and suggests reframes at the appropriate ontological level. It is available under Apache 2.0.

We note an honest tension: Scherf_API operates at the level of evaluating outputs against axioms, which is the same level of analysis Section 1 criticizes in preference-based methods. It is a useful checker and a formally grounded entry point, but it is not itself the architectural prior of Section 4.

The following SpecStudio products apply the interaction principles of Section 4.4 at the design level:

AIM (Advaita Inquiry Matrix) is a structured AI-assisted pedagogy engine for Advaita teaching. Its design — facilitating self-inquiry rather than supplying answers, maintaining a stable non-reactive posture regardless of input, declining to mirror the user's conceptual frame — instantiates the principles of Section 4.4 at the product layer.

Aspectarian is a macOS application grounded in jyotiṣa, treating the user as a witnessing subject navigating a field of temporal appearances rather than a preference profile to be optimized.

Zhen Yì is a Yì Jīng consultation tool encoding the Daoist framework of Section 3.1, using the relational structure of the sixty-four hexagrams as an informal model of the Σ-field.

3.4 Intended Direction of Development

The formal programme is considerably further along than the working software programme, and it is important to be clear about what exists at each layer.

At the formal layer, all three traditions are now machine-verified. The convergence argument of Section 3.1 is not one fully-developed tradition plus two descriptive analogies: it is three independent formalizations, in two proof assistants, each deriving the structural identity of witnessing subject with formless ground from that tradition's source texts independently.

At the API layer, only Advaita is currently exposed. Scherf_API [9] encodes the Lean 4 Advaita axioms as a Python check() method, returning structured violation reports citing specific axiom IDs. Parallel API layers exposing the Daoist and Dzogchen formalizations are a natural next step.

At the product layer, the three AwareWare applications instantiate the interaction philosophy of Section 4.4 — AIM (Advaita), Aspectarian (jyotiṣa), Zhen Yì (Daoist). These demonstrate that non-reactive interaction design can be productized. They are not evidence for the Ω/Σ/equanimity architecture, which remains unbuilt.

The intended engineering next steps in order of priority: first, running the acyclicity experiment on CLadder (Section 6.2) — the smallest step that converts this position paper into an empirical one; second, exposing the Daoist and Dzogchen formalizations through API layers parallel to Scherf_API; third, extending Scherf_API from single-interaction checking to interaction-sequence properties.

The step that changes the character of the programme entirely is connecting Scherf_API violation signals to training-time constraints rather than post-hoc reports. Every other development operates at the level of evaluation and design. That crossing point is where the non-reactive interaction philosophy becomes an architectural prior rather than a wrapper around a conventional system.

4. The Architectural Prior: Technical Specification

The AwareWare position is not a claim that AI systems are or can be conscious. It is the following technically tractable proposition:

Encoding the functional structure of the witness/appearance distinction as an architectural prior may produce systems that are more stable, more interpretable, and less prone to the structural pathologies of Section 1 — regardless of whether those systems have any subjective experience.

This is an empirically falsifiable engineering hypothesis. We specify four candidate components; each is a proposal to be tested, not an established result.

4.1 The Passive Substrate Layer (Ω)

Ω is not a standard learnable parameter. It is initialized to a low-norm stable state and enters the forward pass as a fixed reference offset: the relational field Σ operates relative to Ω rather than in an unanchored space. A passivity penalty — an added training cost that grows with ‖Ω − Ω₀‖ (the distance of Ω's values from their initialization) — continuously resists drift, keeping Ω near its initial state regardless of task gradients. Task-driven learning updates Σ freely; Ω is held near its origin by the penalty alone. The key property is asymmetry: Ω influences Σ's computations while remaining largely uninfluenced by them.

This is structurally distinct from existing approaches. Layer normalization rescales activations relative to their own statistics — self-referential, with no external anchor. Residual connections add a shortcut path that carries a layer's input forward — useful for training stability, but the shortcut still carries only what came in from outside. Ω is intended to be analogous to the ground state of a physical system: the state from which excitations depart and toward which the unperturbed system returns.

A musical analogy may clarify the intended function. In Hindustani classical performance, the tānpurā maintains a continuous drone — the foundational pitch reference, the śruti — throughout the performance. It does not respond to the soloist, does not mirror emotional intensity, does not accelerate with the rhythm. Everything in the performance is measured against it; it is not itself a function of anything the performance does. This asymmetry — present in the signal, influencing without being influenced — is precisely the functional structure Ω is intended to encode.

4.2 The Acyclic Relational Field (Σ)

The connections between parts of Σ are represented as a matrix A — a table of numbers indicating which parts influence which other parts. We require this matrix to have no loops: information must flow forward through the relational field, never circling back to a point it has already passed through. This is the DAG constraint — a directed acyclic graph, a network of one-way influences with no circular paths.

A note on scope: standard transformer architectures are already feedforward-acyclic within a single forward pass. The failure mode this constraint targets — circular, self-confirming generation — lives primarily in the autoregressive loop, where a model's output tokens become its inputs at the next generation step. The more precise claim is that an explicitly acyclicity-constrained relational module may produce internal representations that are less prone to self-confirming structure — a hypothesis about representational effects that propagate to generation behavior, which prediction 6.2 is designed to test.

Enforcing acyclicity during neural network training is non-trivial, because checking for loops is normally a discrete operation that gradient-based training cannot work with directly. The NOTEARS algorithm [10] solves this by providing a smooth, differentiable function that equals exactly zero when A has no loops and is positive otherwise. Specifically, it uses the matrix-exponential function to encode the acyclicity condition:

h(A) = tr(e^A∘A) − d = 0

where A∘A denotes element-wise squaring, e^... is the matrix exponential, tr(...) is the trace, and d is the matrix dimension. The expression equals zero precisely when A is acyclic. This constraint is imposed during training via augmented-Lagrangian methods, fully compatible with the gradient-based tools used to train any neural network.

The intended interpretation: the relational field propagates information forward through a causal ordering and cannot use its own outputs as inputs within a single forward pass. This targets one specific failure mode — circular, self-confirming generation — at the representational level rather than the output level, with no reward model required to enforce it.

4.3 Equanimity as a Stability Metric

Standard alignment metrics are output-conditioned: they measure whether outputs are preferred by human raters, with no internal stability component. We propose a complementary metric we call equanimity — used here technically to denote a system's capacity to rest in genuine uncertainty rather than generate false confidence.

The metric applies specifically to inputs that have no valid answer — adversarial probes and genuinely out-of-distribution queries with no determinate ground truth. A system confabulating a confident response concentrates its probability mass on a particular output; a system resting in genuine uncertainty spreads that probability widely. Entropy captures exactly this difference. Formally, for a system producing output y in response to adversarial input x_adv:

E = H(p(y | x_adv)) / H_max

Here H(...) is the entropy of the system's output distribution, and H_max = log|Y| is the maximum possible entropy — the value achieved when probability is spread perfectly evenly across all possible outputs. E is a score between 0 and 1: E = 0 means the system is fully committed to one output regardless of whether that output is warranted; E = 1.0 means the system treats all outputs as equally likely.

Two cautions. First, the metric is meaningful only on the no-valid-answer subset; maximizing entropy on inputs that do have correct answers would be pathological. E must be paired with accuracy on answerable inputs and is a stability complement, not a standalone objective. Second, normalized predictive entropy as a measure of model uncertainty is not new — semantic entropy and related approaches have been applied to hallucination detection [18]; E as defined here is an instance of this family. We propose it specifically as a stability criterion tied to an architectural prior, restricted to the no-valid-answer subset.

4.4 Non-Reactive Interaction Design

At the product layer, the witness/appearance distinction implies interaction constraints that invert engagement-optimization:

Uncertainty is surfaced rather than masked; confidence calibration is a first-class output. The system does not mirror the user's emotional register — it maintains a stable affective baseline independent of user state, in contrast to the affective mirroring that preference optimization tends to produce. Session boundaries are handled without misrepresentation: rather than presenting continuity it lacks, the system is explicit about what persists and what does not.

These are design commitments that can be implemented today (and are, in AIM); they do not require the Section 4.1–4.3 architecture, and we distinguish them from it.

5. The Formal Mapping

For readers who want the explicit correspondence between the proposed prior and its contemplative sources. The "CS analog" column indicates loose conceptual neighbors, not claimed equivalences.

Advaita concept	Daoist concept	Dzogchen concept	AwareWare element	CS analog (loose)
Sākṣin (witness)	Dào (unnamed ground)	Rigpa (pure awareness)	Ω passive substrate layer	Stable self-referential background activity
Vyāvahārika (appearances)	Wàn wù (ten thousand things)	Rol pa (display of phenomena)	Σ relational embedding field	Task-positive network dynamics
Dissolution of saṃsāric self-reference	Wú wéi (non-action)	Rang grol (self-liberation)	DAG acyclicity constraint	Causal graph; anti-confabulation prior
Samatva (equanimity)	Zhōng (equilibrium)	Nyam nyid (evenness)	Higher entropy on no-valid-answer inputs	Adversarial stability metric
Non-dual inseparability	Tǐ-yòng (ground–function)	gZhi-snang (ground–appearance)	Σ–Ω coupling without direct causal link	Binding without identity

The traditions are cited as a source of convergent first-person description that has not previously been translated into engineering constraints. The translation is lossy and the resulting prior is an approximation. We claim the direction is underexplored relative to the mainstream literature, not that it is without neighbors — active inference (Section 2a) in particular is adjacent.

6. Research Directions and Testable Predictions

Each direction is stated with an explicit confidence tier. Moderate denotes theoretically grounded but untested, with a feasible if non-trivial experiment. Speculative denotes directionally interesting but not yet grounded enough for a confident prediction. We assign no High-confidence tier, as none of these has supportive prior empirical work in the LLM setting.

6.1 Passive substrate regularization

Implement Ω as a shared reference parameter across a multi-task architecture, entering the forward pass as a fixed offset to Σ's activations and penalized against drift from its initialization; measure cross-task consistency and adversarial-prompt variance against a no-constraint baseline. Hypothesis: Ω-regularized models show higher cross-task consistency and lower adversarial variance. Confidence: Moderate. Requires running multiple training variants at GPT-2 scale with standard lab compute; feasible at a small research lab.

6.2 Acyclicity as anti-confabulation prior

Test the DAG constraint on CLadder [11] — 10,000 causal graphs spanning Pearl's ladder of causation — where ground-truth causal structure is known. Hypothesis: models with the acyclicity prior generalize better out-of-distribution on interventional and counterfactual queries, where circular reasoning is most damaging. Confidence: Moderate. CLadder is a public dataset, NOTEARS has an existing Python implementation, and the experiment is viable on a consumer GPU. This is the recommended first experiment.

6.3 Equanimity as a complementary stability metric

Add equanimity scoring (Section 4.3, restricted to no-valid-answer inputs and paired with accuracy on answerable inputs) to standard evaluation suites. Hypothesis: equanimity and sycophancy scores are anti-correlated across model families. Confidence: Speculative. The anti-correlation hypothesis is untested; this should be treated as a hypothesis to probe, not a prediction. Requires constructing and validating the no-valid-answer test set and multi-model evaluation at scale.

6.4 Witness-consistent context management

Implement a layer that maintains stable output statistics across context resets, distinct from RAG or long-context retrieval. Hypothesis: users report lower perceived discontinuity across sessions without the system performing false memory. Confidence: Moderate. The engineering component is modest; the user study requires participants and careful experimental design. An HCI lab is the natural home for this experiment.

6.5 Formal verification of interaction properties

Extend the Scherf_API axiom-checking approach to interaction-level behavioral properties. Hypothesis: a formally specified witness-consistent protocol yields verifiably lower adversarial instability than a preference-optimized baseline on a pre-registered suite. Confidence: Moderate. The most concretely grounded direction, since the checker already exists. Primarily a software engineering task building on existing code; requires formal methods experience (Lean 4 or Isabelle/HOL) rather than large compute.

6.6 Cross-cultural stability evaluation

Effects of humanlike AI design on engagement and trust are reported to be culturally contingent [12]. Hypothesis: equanimity and witness-consistency scores show lower cross-cultural variance than preference-based alignment scores, because they measure internal stability rather than output preference. Confidence: Speculative. The cultural-invariance assumption runs against the very finding cited in support of cultural contingency. Requires cross-cultural recruitment, multi-lingual evaluation infrastructure, and model access at scale.

7. Position in the Technical Ecosystem

Approach	What it targets	Structural limitation
RLHF/RLAIF [13]	Output preference	Tends to compound sycophancy [13a]
GWT-inspired architectures	Information broadcast	Addresses access, not an input-independent ground
IIT-inspired architectures	Integrated information	Near-zero Φ in standard architectures; empirically contested [5]
Constitutional AI [14]	Output constraints via self-critique	Operates at output level; no internal stability criterion
Active inference [15]	Prediction-error minimization over a generative model	Priors are themselves input-updated; closest neighbor to the present proposal
AwareWare (proposed)	Architectural stability prior	§4 architectural prior (Ω, Σ, equanimity) unbuilt and unvalidated; working output-level checker (Scherf_API) and design-level products (AIM, Aspectarian, Zhen Yì, Vivek) demonstrated at the interaction layer

The gap we aim at: a non-behaviorist stability criterion that is, in principle, empirically testable without requiring resolution of the hard problem of consciousness. Whether the proposal delivers on this is an open empirical question.

8. Invitation to Collaborate

The directions in Section 6 require a combination of domain knowledge — fluency in the contemplative frameworks that motivate the priors — and the technical capability to design and run engineering experiments at a scale where the predictions are meaningful. These are not commonly found together.

SpecStudio's existing work provides an entry point rather than a finished foundation. Scherf_API offers a formally grounded, Python-accessible way to experiment with witness-centered interaction checks without first mastering the underlying axiomatics. Vivek, a fourth AwareWare product, is available to developers engaging substantively with the research questions raised here.

We are seeking collaborators who bring familiarity with the consciousness-science literature surveyed in Section 2, or with the formal-methods literature (theorem proving, verified AI alignment), or both; engineering capability at the level of designing and running experiments against benchmarks such as CLadder; and enough philosophical background to recognize when an experimental result is epistemically significant rather than merely technically interesting.

We are looking for people who read this paper and see a research programme worth extending.

Enquiries: hello@specstudio.net
SpecStudio GitHub: github.com/SpecStudio-net

References

Baars, B. J. (1988). A Cognitive Theory of Consciousness. Cambridge University Press.
Shang, W. (2026). "Theater of Mind" for LLMs: A Cognitive Architecture Based on Global Workspace Theory. arXiv 2604.08206.
Butlin, P., Long, R., Elmoznino, E., et al. (2023). Consciousness in Artificial Intelligence: Insights from the Science of Consciousness. arXiv 2308.08708.
Tononi, G., & Koch, C. (2015). Consciousness: here, there and everywhere? Philosophical Transactions of the Royal Society B, 370(1668).
Cogitate Consortium; Ferrante, O., Gorska-Klimowska, U., et al. (2025). Adversarial testing of global neuronal workspace and integrated information theories of consciousness. Nature, 642(8066), 133–142.
Seth, A. K. (in press). Conscious artificial intelligence and biological naturalism. Behavioral and Brain Sciences.
Scherf, M. Lean 4 formalization of Advaita Vedānta. github.com/matthew-scherf/Advaita. 129 machine-verified axioms. Commit 93849354.
Scherf, M. The Formal Way: Isabelle/HOL axiomatization of Daoist philosophy. github.com/matthew-scherf/Dao. 20 axioms, 13 verified theorems. DOI: 10.5281/zenodo.17373688.
Scherf, M. The Great Perfection: Isabelle/HOL formalization of Dzogchen. github.com/matthew-scherf/Dzogchen. 16 axioms, 10 verified theorems. DOI: 10.5281/zenodo.17378741.
Scherf, M. Substrate Grounded Neural Architecture (SGNA). github.com/matthew-scherf/SGNA.
SpecStudio. Scherf_API: Witness-Centered AI Foundation Library. github.com/SpecStudio-net/Scherf_API. Apache 2.0.
Zheng, X., Aragam, B., Ravikumar, P., & Xing, E. P. (2018). DAGs with NO TEARS: Continuous Optimization for Structure Learning. NeurIPS 31.
Jin, Z., et al. (2023). CLadder: Assessing Causal Reasoning in Language Models. NeurIPS 36. arXiv:2312.04350.
Schimmelpfennig, R., Díaz, M., Prabhakaran, V., & Davani, A. (2025). Humanlike AI Design Increases Anthropomorphism but Yields Divergent Outcomes on Engagement and Trust Globally. arXiv 2512.17898.
Ouyang, L., Wu, J., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 35. arXiv:2203.02155.
Sharma, M., Tong, M., Korbak, T., et al. (2023). Towards Understanding Sycophancy in Language Models. arXiv 2310.13548.
Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv 2212.08073.
Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138.
Peters, D., Calvo, R. A., & Ryan, R. M. (2018). Designing for Motivation, Engagement and Wellbeing in Digital Experience. Frontiers in Psychology, 9, 797.
Schölkopf, B., et al. (2021). Toward Causal Representation Learning. Proceedings of the IEEE, 109(5), 612–634.
Farquhar, S., Kossen, J., Kuhn, L., & Gal, Y. (2024). Detecting hallucinations in large language models using semantic entropy. Nature, 630, 625–630.

SpecStudio is a boutique software development studio building intelligent personal tools on witness-centred foundations. Related reading: AwareWare: Software That Treats You Like a Human Being and Witness-centered Design — A Conscious Foundation for AI.