Mnemosyne Devlog
Devlog 003, Making Hidden State Work
The App Started Testing the Soul
After I split memory into two layers, the next problem was no longer just conceptual.
The Soul would remember like a person.
The World Log would remember like a GM.
That sounded clean in theory.
But now I had to make the app actually do something with it.
Before this point, a lot of Mnemosyne still lived in prompts, notes, and thought experiments. I had the idea of a Soul. I had the idea of a World Log. I had the crude code block. I had the idea that the AI should report what changed instead of carrying the whole system inside the chat.
But an idea does not become real until something can run.
That was where the Tauri app started to matter.
The early app was not pretty. It was not a finished roleplay client. It was barely even the shape of the thing I wanted. But it gave Mnemosyne a body: a desktop window, a chat interface, Rust behind it, React in front of it, local state, Soul files, provider settings, and the beginning of a pipeline.
That was the shift.
Mnemosyne was no longer only a better prompt.
It was becoming a machine that could sit between the user and the model.
The first job was simple to say and annoying to implement:
The player should see the story.
The engine should see the state.
That was the whole problem.
The old crude code block was useful because I could see it. It let me inspect what the AI thought had changed. Trust moved. Fear moved. A memory was flagged. The scene changed. Something became important.
But if the user had to see that code block after every response, the experience was dead.
No one wants an emotional scene followed by ugly machinery.
No visible JSON.
No raw memory tags.
No trust deltas under the dialogue.
No retrieval scores.
No system residue.
No code box reminding the player that the character is being calculated.
So hidden state had to become actually hidden.
Not fake hidden.
Not “scroll past this.”
Not “pretend you did not see it.”
The model could output a machine-readable block, but the app had to catch it, strip it out, parse it, and apply it before the user saw the final message.
If the block leaked, immersion broke.
If the block disappeared, the Soul could not update.
If the block was malformed, the engine had to survive.
That was one of the first moments where Mnemosyne started feeling like real software.
Not because the feature was glamorous.
Because it was annoying in exactly the way real software is annoying.
The turn flow started becoming concrete:
User message.
Compiled context.
Model response.
Visible narration.
Hidden state.
Parser.
Soul update.
World update.
Save.
Repeat.
That loop was the first version of Mnemosyne actually breathing.
There was a mock provider in the code, but I do not want to oversell it.
The mock was not where the real roleplay testing happened.
It was mostly an early placeholder for the chat interface and turn flow. It could return a fixed kind of answer, attach a predictable hidden-state block, and make the UI look like it was doing something.
That was useful for building the app shape.
It proved that a button could send a message.
It proved that the backend could return a response.
It proved that hidden state could exist in the pipeline.
It proved that the UI could update.
But as a test of roleplay, memory, narration, or psychology, it was basically useless.
It did not surprise me.
It did not drift.
It did not forget.
It did not misunderstand the format.
It did not fail like a real model fails.
And because it did not fail like a real model, it could not prove the idea.
The real testing came from actual LLMs.
I was already using OpenRouter heavily to try different models for AI RP, so free OpenRouter models became the obvious way to test without burning too much money. I needed real generations. I needed actual model behavior. I needed the system to survive contact with chaos.
A fake response could prove that the interface worked.
A real model could prove whether the idea held together.
That was where the useful failures started.
But before the failures, there was also a good moment.
The first real test felt good in a way the mock never could.
For the first time, the app was not just clicking through a fake pipeline. A real model answered. The chat moved. The scene had atmosphere. The narration had texture. It was not the final experience, but it was enough to make the project feel less imaginary.
That mattered.
A working mock can tell you the pipe is connected.
A real model can make you feel the product.
And for a moment, it did feel like something was there.
That was encouraging.
But it was not a clean victory.
The good feeling came with a warning almost immediately.
The output was nice, but it was also wrong in a very important way. The model was still talking too much like the character instead of acting like a narrator. It slipped into first-person character presence. It made the experience feel closer to a normal character chatbot than the GM/narrator architecture I wanted.
That worried me.
Because Mnemosyne was not supposed to be another character impersonator.
The whole point was that the narrator should describe the character, while the Soul and World Log carried continuity underneath. If the model collapsed into “being” the character, then the knowledge boundaries would collapse too.
The character could steal narrator knowledge.
The narrator could steal the user’s agency.
The system could drift back into the exact platform behavior I was trying to escape.
So the first real test was encouraging and alarming at the same time.
It proved the app could feel good.
It also proved that feeling good was not enough.
A real model might forget to include the hidden state.
A real model might expose the hidden block to the user.
A real model might wrap the hidden state in the wrong format.
A real model might summarize instead of updating.
A real model might produce beautiful narration and then ruin the machine-readable part.
A real model might keep adding memories instead of letting anything fade.
A real model might understand the rule, explain the rule, and then fail to follow it three turns later.
That was exactly the kind of failure I needed to see.
Because the point was not to make a toy demo work.
The point was to find the places where a real model would break the architecture.
One of the first problems was formatting.
At first, hidden state could be plain JSON. That was easy to inspect, easy to test, and easy to understand. But plain JSON was also fragile and ugly. It was too easy for the model to leak it, corrupt it, or blend it into the visible response.
So the hidden state had to become more controlled.
The app needed a recognizable marker.
The payload needed to be compact.
The parser needed to support the new format without destroying older test transcripts.
That was why encoded hidden state became important.
The hidden block moved toward a versioned payload format: something the engine could identify, decode, and parse without treating it like normal prose.
That did not make the system perfect.
But it made the boundary sharper.
The narration was for the player.
The encoded hidden state was for the engine.
That boundary was everything.
Around the same time, I also started seeing why debug tools mattered.
When hidden state is visible, debugging is easy. You just read the ugly block.
When hidden state is actually hidden, debugging becomes harder. If the Soul updates wrong, I need to know why. If the parser fails, I need to know where. If a memory gets added, scored, discarded, or consolidated, I need to see the trace.
So diagnostics became part of the early app.
Not as polish.
As survival.
I needed a way to inspect the memory cycle without exposing the machinery to the player. I needed to see what the model returned, what the parser extracted, what the engine accepted, and what changed inside the Soul.
That was the practical reason for turn debug panels and memory cycle diagnostics.
The player should not see the machinery.
But the developer absolutely needs to.
This was also where the narrator architecture became more urgent.
Before the real test, narrator mode was mostly an architecture principle.
After the real test, it became a practical requirement.
If the AI was allowed to behave like the character directly, it would blur everything. It might speak in first person. It might steal the user’s actions. It might treat narrator knowledge as character knowledge. It might expose internal state as dialogue.
That was not Mnemosyne.
Mnemosyne needed the model to act as a narrator.
The narrator writes the visible scene.
The hidden state reports candidate changes.
The app strips and parses the hidden state.
The engine updates the Soul and World Log.
The player only sees the story.
That sounds obvious now, but this was the stage where it started becoming obvious through bugs.
Every failure pointed back to the same lesson:
The model should not carry the whole system by itself.
At this stage, I was still asking the LLM to do too much inside one response.
It had to write the scene.
It had to follow the selected narrative mode.
It had to respect the user’s agency.
It had to avoid leaking hidden machinery.
It had to flag memories.
It had to judge emotional importance.
It had to update relationship values.
It had to keep the Soul coherent.
It had to remember to forget.
That was too much for one generation to do perfectly.
But I had not fully escaped that yet.
This was not the dual-pass breakthrough.
That came later.
At this point, the lesson was smaller but important:
The app needed to catch the model’s output and begin enforcing boundaries around it.
The AI could propose.
The engine had to manage.
That was the start of the real boundary.
Not complete.
Not clean.
But visible.
The mock provider gave the interface a shape.
OpenRouter models gave the system real failures.
The first real test gave me a glimpse of the product.
Encoded hidden state gave the engine something it could parse.
Diagnostics gave me a way to see what was happening behind the curtain.
And the Tauri app gave all of it a place to run.
That was the point where Mnemosyne stopped being only a conceptual memory system.
It became an app trying to keep the magic invisible.
Next: Devlog 004, where the app had to learn what to send the model: context compiler work, cleaner Soul and World snapshots, provider payloads, API turn handling, and early narrator guardrails.
Covered commits
This is the first devlog with direct commit coverage. Devlogs 000 to 002 covered pre-repository origin work and early design history. This entry starts tracking the repository from the initial legal/project foundation into the early hidden-state implementation period.
f26cbfeAdd AGPL-3.0 licensea642953Scaffold Tauri desktop client1fe8132Initial commit35bf8bfDeclare AGPL package license303d4f7Merge GitHub initial repository stateb0697faWire mock provider turn flowaee7972Encode mock hidden state9c242eaAdd local delete controls505f75bAlign prototype with narrator architecturec9fe40dFix native Tauri build setupb6fb370Add mock turn acceptance coverage365d105Surface memory cycle diagnostics1f6ed4bDocument future settings architecture53a12d1Separate setting controls from Soul identityc526a87Add Setting Soul lifecycle802e588Align API hidden state prompt21f5efdAdd turn debug panel6f07c51Improve chat workspace controls