← Back to RhyGPU

Mnemosyne Devlog

Devlog 003, Making Hidden State Work

The App Started Testing the Soul

May 2026 · Early app implementation · Hidden state, mock flow, real model testing, and diagnostics

After I split memory into two layers, the next problem was no longer just conceptual.

The Soul would remember like a person.

The World Log would remember like a GM.

That sounded clean in theory.

But now I had to make the app actually do something with it.

Before this point, a lot of Mnemosyne still lived in prompts, notes, and thought experiments. I had the idea of a Soul. I had the idea of a World Log. I had the crude code block. I had the idea that the AI should report what changed instead of carrying the whole system inside the chat.

But an idea does not become real until something can run.

That was where the Tauri app started to matter.

The early app was not pretty. It was not a finished roleplay client. It was barely even the shape of the thing I wanted. But it gave Mnemosyne a body: a desktop window, a chat interface, Rust behind it, React in front of it, local state, Soul files, provider settings, and the beginning of a pipeline.

That was the shift.

Mnemosyne was no longer only a better prompt.

It was becoming a machine that could sit between the user and the model.

The first job was simple to say and annoying to implement:

The player should see the story.

The engine should see the state.

That was the whole problem.

The old crude code block was useful because I could see it. It let me inspect what the AI thought had changed. Trust moved. Fear moved. A memory was flagged. The scene changed. Something became important.

But if the user had to see that code block after every response, the experience was dead.

No one wants an emotional scene followed by ugly machinery.

No visible JSON.

No raw memory tags.

No trust deltas under the dialogue.

No retrieval scores.

No system residue.

No code box reminding the player that the character is being calculated.

So hidden state had to become actually hidden.

Not fake hidden.

Not “scroll past this.”

Not “pretend you did not see it.”

The model could output a machine-readable block, but the app had to catch it, strip it out, parse it, and apply it before the user saw the final message.

If the block leaked, immersion broke.

If the block disappeared, the Soul could not update.

If the block was malformed, the engine had to survive.

That was one of the first moments where Mnemosyne started feeling like real software.

Not because the feature was glamorous.

Because it was annoying in exactly the way real software is annoying.

The turn flow started becoming concrete:

User message.

Compiled context.

Model response.

Visible narration.

Hidden state.

Parser.

Soul update.

World update.

Save.

Repeat.

That loop was the first version of Mnemosyne actually breathing.

There was a mock provider in the code, but I do not want to oversell it.

The mock was not where the real roleplay testing happened.

It was mostly an early placeholder for the chat interface and turn flow. It could return a fixed kind of answer, attach a predictable hidden-state block, and make the UI look like it was doing something.

That was useful for building the app shape.

It proved that a button could send a message.

It proved that the backend could return a response.

It proved that hidden state could exist in the pipeline.

It proved that the UI could update.

But as a test of roleplay, memory, narration, or psychology, it was basically useless.

It did not surprise me.

It did not drift.

It did not forget.

It did not misunderstand the format.

It did not fail like a real model fails.

And because it did not fail like a real model, it could not prove the idea.

The real testing came from actual LLMs.

I was already using OpenRouter heavily to try different models for AI RP, so free OpenRouter models became the obvious way to test without burning too much money. I needed real generations. I needed actual model behavior. I needed the system to survive contact with chaos.

A fake response could prove that the interface worked.

A real model could prove whether the idea held together.

That was where the useful failures started.

But before the failures, there was also a good moment.

The first real test felt good in a way the mock never could.

For the first time, the app was not just clicking through a fake pipeline. A real model answered. The chat moved. The scene had atmosphere. The narration had texture. It was not the final experience, but it was enough to make the project feel less imaginary.

That mattered.

A working mock can tell you the pipe is connected.

A real model can make you feel the product.

And for a moment, it did feel like something was there.

That was encouraging.

But it was not a clean victory.

The good feeling came with a warning almost immediately.

The output was nice, but it was also wrong in a very important way. The model was still talking too much like the character instead of acting like a narrator. It slipped into first-person character presence. It made the experience feel closer to a normal character chatbot than the GM/narrator architecture I wanted.

That worried me.

Because Mnemosyne was not supposed to be another character impersonator.

The whole point was that the narrator should describe the character, while the Soul and World Log carried continuity underneath. If the model collapsed into “being” the character, then the knowledge boundaries would collapse too.

The character could steal narrator knowledge.

The narrator could steal the user’s agency.

The system could drift back into the exact platform behavior I was trying to escape.

So the first real test was encouraging and alarming at the same time.

It proved the app could feel good.

It also proved that feeling good was not enough.

A real model might forget to include the hidden state.

A real model might expose the hidden block to the user.

A real model might wrap the hidden state in the wrong format.

A real model might summarize instead of updating.

A real model might produce beautiful narration and then ruin the machine-readable part.

A real model might keep adding memories instead of letting anything fade.

A real model might understand the rule, explain the rule, and then fail to follow it three turns later.

That was exactly the kind of failure I needed to see.

Because the point was not to make a toy demo work.

The point was to find the places where a real model would break the architecture.

One of the first problems was formatting.

At first, hidden state could be plain JSON. That was easy to inspect, easy to test, and easy to understand. But plain JSON was also fragile and ugly. It was too easy for the model to leak it, corrupt it, or blend it into the visible response.

So the hidden state had to become more controlled.

The app needed a recognizable marker.

The payload needed to be compact.

The parser needed to support the new format without destroying older test transcripts.

That was why encoded hidden state became important.

The hidden block moved toward a versioned payload format: something the engine could identify, decode, and parse without treating it like normal prose.

That did not make the system perfect.

But it made the boundary sharper.

The narration was for the player.

The encoded hidden state was for the engine.

That boundary was everything.

Around the same time, I also started seeing why debug tools mattered.

When hidden state is visible, debugging is easy. You just read the ugly block.

When hidden state is actually hidden, debugging becomes harder. If the Soul updates wrong, I need to know why. If the parser fails, I need to know where. If a memory gets added, scored, discarded, or consolidated, I need to see the trace.

So diagnostics became part of the early app.

Not as polish.

As survival.

I needed a way to inspect the memory cycle without exposing the machinery to the player. I needed to see what the model returned, what the parser extracted, what the engine accepted, and what changed inside the Soul.

That was the practical reason for turn debug panels and memory cycle diagnostics.

The player should not see the machinery.

But the developer absolutely needs to.

This was also where the narrator architecture became more urgent.

Before the real test, narrator mode was mostly an architecture principle.

After the real test, it became a practical requirement.

If the AI was allowed to behave like the character directly, it would blur everything. It might speak in first person. It might steal the user’s actions. It might treat narrator knowledge as character knowledge. It might expose internal state as dialogue.

That was not Mnemosyne.

Mnemosyne needed the model to act as a narrator.

The narrator writes the visible scene.

The hidden state reports candidate changes.

The app strips and parses the hidden state.

The engine updates the Soul and World Log.

The player only sees the story.

That sounds obvious now, but this was the stage where it started becoming obvious through bugs.

Every failure pointed back to the same lesson:

The model should not carry the whole system by itself.

At this stage, I was still asking the LLM to do too much inside one response.

It had to write the scene.

It had to follow the selected narrative mode.

It had to respect the user’s agency.

It had to avoid leaking hidden machinery.

It had to flag memories.

It had to judge emotional importance.

It had to update relationship values.

It had to keep the Soul coherent.

It had to remember to forget.

That was too much for one generation to do perfectly.

But I had not fully escaped that yet.

This was not the dual-pass breakthrough.

That came later.

At this point, the lesson was smaller but important:

The app needed to catch the model’s output and begin enforcing boundaries around it.

The AI could propose.

The engine had to manage.

That was the start of the real boundary.

Not complete.

Not clean.

But visible.

The mock provider gave the interface a shape.

OpenRouter models gave the system real failures.

The first real test gave me a glimpse of the product.

Encoded hidden state gave the engine something it could parse.

Diagnostics gave me a way to see what was happening behind the curtain.

And the Tauri app gave all of it a place to run.

That was the point where Mnemosyne stopped being only a conceptual memory system.

It became an app trying to keep the magic invisible.

Next: Devlog 004, where the app had to learn what to send the model: context compiler work, cleaner Soul and World snapshots, provider payloads, API turn handling, and early narrator guardrails.

Covered commits

This is the first devlog with direct commit coverage. Devlogs 000 to 002 covered pre-repository origin work and early design history. This entry starts tracking the repository from the initial legal/project foundation into the early hidden-state implementation period.

f26cbfe Add AGPL-3.0 license
a642953 Scaffold Tauri desktop client
1fe8132 Initial commit
35bf8bf Declare AGPL package license
303d4f7 Merge GitHub initial repository state
b0697fa Wire mock provider turn flow
aee7972 Encode mock hidden state
9c242ea Add local delete controls
505f75b Align prototype with narrator architecture
c9fe40d Fix native Tauri build setup
b6fb370 Add mock turn acceptance coverage
365d105 Surface memory cycle diagnostics
1f6ed4b Document future settings architecture
53a12d1 Separate setting controls from Soul identity
c526a87 Add Setting Soul lifecycle
802e588 Align API hidden state prompt
21f5efd Add turn debug panel
6f07c51 Improve chat workspace controls