Most teams using AI today are doing it one person at a time — individual ChatGPT accounts, scattered API experiments, no shared memory. I took a different approach: building a shared, company-wide AI assistant that knows the business, has access to company data sources, and gets smarter over time. Here's what I built and how it works.
The system runs as a multi-user platform on top of a frontier LLM (Claude by Anthropic). Each conversation runs in an isolated Docker container with its own workspace. Containers are ephemeral — they spin up per session — but persistent state lives in mounted volumes: one private volume per user, one shared volume per team channel.
A host orchestration layer handles container lifecycle, message routing, and scheduled task persistence. Scheduled tasks are stored in a host database, not crontab — so they survive container restarts and are manageable via API. The agent surfaces in Slack, is thread-aware, uses @mentions, and can spawn named sub-agents that appear as distinct Slack bot identities for complex multi-step workflows.
The tool interface between the LLM and the host uses the Model Context Protocol (MCP). Every tool call — file reads, bash commands, Slack messages, Notion writes, secret retrieval — goes through MCP servers that the harness exposes to the container over stdio. The LLM requests a tool call; the harness checks the permission tier; if approved, executes it and returns the result. This gives the harness fine-grained control over every action the model takes, with a complete audit trail at the harness layer rather than relying on the model to self-report.
Workspace layout: /workspace/global (read-only system config and skills, synced from host), /workspace/group (shared team channel workspace — memory, daily logs, conversation archives), /workspace/user (private per-user workspace, mounted on every container start — secrets, personal config, scripts). Session state in the user workspace persists across ephemeral containers because the volume is mounted from the host, not the container filesystem. The model is explicitly instructed to write intermediate results and decisions to files rather than holding them in context, which also protects against context rot as conversations grow long.
Not every task needs the same model. A model router directs requests to different tiers based on complexity: lightweight checks and simple summaries go to a fast, cheap model; multi-step reasoning, code generation, and anything requiring judgment goes to a more capable one. The routing logic lives in the harness, not the prompt — so it can't be talked out of it. In practice this cuts costs significantly without any quality regression, because most of the volume is low-complexity work that a smaller model handles fine.
The harness layer is built on NanoClaw, a ~500-line TypeScript wrapper around the Anthropic SDK. The closest comparison is OpenClaw, a popular open-source framework for running AI agents. The design differs in one central dimension: surface area.
OpenClaw is full-featured and handles a lot of the plumbing automatically. NanoClaw deliberately stays small. At ~500 lines, you can audit the entire harness in an afternoon, understand every code path, and modify it for your environment without fighting framework conventions. That size constraint is itself a security property: with a minimal harness and a strict egress firewall, the attack surface is enumerable.
The permission model is split between two explicit tiers: admin-level operations (harness process — container lifecycle, volume mounts, network config, MCP server registration) and user-level agent operations (runs inside the container with minimal permissions). The agent can only take actions the harness exposes via MCP. There is no path for a model to escalate from user-level to admin through prompt manipulation, because the privilege boundary is enforced by the OS process model, not the application layer.
The tradeoff: you own more of the plumbing. SDK upgrades, tool permission changes, and new MCP server integrations all require editing the harness. For a team that wants deep customization and a clear security story, that's a feature. For a team that wants to stay on autopilot, OpenClaw is the better starting point.
Security was first-class from day one, not retrofitted:
Permission modes run a spectrum: auto (model decides, high-stakes actions prompt the user), acceptEdits (file edits auto-approved, bash requires confirmation), bypassPermissions (everything runs without prompts — used only in fully automated scheduled tasks with no human in the loop). All hooks execute in the harness process, not in the container — so even a fully autonomous agent operating in bypassPermissions mode cannot suppress or rewrite them.
The egress firewall is implemented at the container network level using iptables rules applied when the container starts. The allowlist is read from a config file and can be reloaded live without restarting containers: the model can request a new domain by editing the config, but a reload MCP call is required to apply it — creating an observable, auditable step between request and effect rather than a silent background connection.
The most impactful architectural decision was mirroring all major company data sources into local SQLite databases, synced on schedules from each source's API. Here's the full picture at our current scale:
Syncs run on different schedules based on volatility: Slack messages every 30 minutes, Notion pages hourly, Salesforce records every 6 hours, calendar events daily. Each sync is incremental, tracking a high-water mark and pulling only new or modified records since the last run. This keeps sync duration short (seconds to low minutes) regardless of total corpus size.
All databases are opened with the ?immutable=1 URI parameter for agent reads. This disables WAL checkpoint and journal file checks, allowing fully concurrent reads across multiple containers without any locking overhead. Only the sync process holds a write connection. A typical agent container opens 10–12 SQLite databases simultaneously at session start; the immutable flag means none of them block each other or the sync writer.
All caches use SQLite's FTS5 (Full-Text Search version 5) extension, which builds inverted indexes over text columns. What this gives you:
A unified search script queries all caches simultaneously in parallel and merges ranked results across every source. Total latency across the full corpus is consistently under 100ms — no API calls, no rate limits, no per-query cost. The model is instructed to always search local caches first before ever claiming it doesn't have access to something.
Beyond text search, each cache exposes domain-specific structured queries: the CRM cache has pipeline (deal funnel view), account-summary (all activity for an account), and contacts-for commands. The calendar cache has upcoming, meetings-for (everyone's schedule with a given person), and meeting-type filters. These aren't just keyword search — they're normalized relational queries.
I'm deliberately not using vector embeddings for semantic similarity. FTS5 + BM25 has been sufficient for the retrieval tasks that come up in practice. The main gap is synonym and paraphrase matching, which matters occasionally but not often enough to justify maintaining a vector store. I'll add embeddings when the failure cases accumulate.
Under the hood, each cache uses stemming so that variant forms of a word ("settling", "settled", "settlement") all match the same query. The ranking algorithm is BM25, which weights results by term frequency and document rarity. Because scores aren't directly comparable across databases of different sizes, the unified search layer normalizes them before merging — with configurable weights per source so that, for example, an exact phrase match in a Notion doc ranks higher than a loose keyword hit in a Slack message.
I implemented the Karpathy LLM knowledge base pattern for entity-level intelligence. The idea: instead of doing multi-source retrieval at query time (slow, expensive, context-heavy), a scheduled job pre-compiles all evidence about key entities into structured wiki articles.
Two phases: (1) Gather — pull every piece of evidence about an entity (an account, partner, deal) from CRM, Slack, docs, and calendar into structured JSON. Every message, every meeting, every activity. (2) Summarize — an agent reads that JSON and writes a concise, opinionated entity profile using a template. Query-time retrieval becomes a single-document read. Profiles stay fresh via scheduled re-compilation.
The key insight: pre-compilation shifts cost from query time (every conversation, many users) to batch time (once per schedule run), and produces a single coherent document instead of raw evidence scattered across four databases.
The gather phase produces a structured evidence object like this:
{
"entity": "Acme Corp",
"crm": { "stage": "Negotiation", "arr": 120000, "close_date": "2026-06-30", "owner": "..." },
"slack_messages": [
{ "ts": "1743200000.000000", "sender": "...", "channel": "#bd", "body": "..." }
],
"meetings": [
{ "date": "2026-03-15", "participants": ["...", "..."], "summary": "..." }
],
"docs": [
{ "title": "...", "url": "...", "excerpt": "..." }
]
}
The summarizer prompt is deliberately opinionated: it's told to surface risks, blockers, and momentum signals rather than just describe activity. The output template has fixed sections — relationship status, key contacts, open items, recommended next steps — so the profiles are structurally consistent and can be compared across entities. An agent writing "they went quiet after the last call" is more useful than an agent writing "last contact was 2026-03-15."
Beyond single-turn queries, the system supports several patterns for multi-step and parallel work:
An internal metrics API (no external auth, accessible from agent containers) exposes per-run cost breakdowns, daily cost history, agent run stats and error rates, active container counts, cache sync coverage and freshness, and task run history with failure detection.
Alert rules define when to fire vs. stay silent: urgent events always alert; low-priority items and out-of-hours events (10pm–8am) are suppressed. A platform that could generate constant noise needs explicit suppression rules to remain useful.
A use-case monitor tracks what the system is actually being used for across all conversations: which skill categories are invoked most, which queries hit the caches vs. fall back to the API, which tasks get corrected. This feeds a reinforcement loop. High-frequency use cases get dedicated skills and better prompts. Repeated corrections become durable memory entries. Queries that consistently miss in the local caches flag data sources worth adding. The system learns what it needs to be better at by watching itself work.
When a pattern appears consistently across multiple users — the same kind of query, the same multi-step workflow, the same correction — that's a signal to generalize it into a proper skill. A one-off prompt becomes a reusable template. A repeated workflow gets its own orchestration logic. The monitor surfaces these patterns; the skill system absorbs them. New users immediately benefit from what earlier users encountered and refined.
Closing the loop also means running UAT before shipping changes that affect how the system behaves for users. Prompt changes, new skills, modified memory rules — anything that touches the interaction model gets tested against representative queries before it goes live. The goal is catching regressions that unit tests won't find: cases where the system technically works but produces worse answers, or where a new skill conflicts with an existing one.
There are two distinct memory problems in a system like this, and they require different solutions. The first is in-session context: the LLM's active window fills up over a long conversation, degrading quality as irrelevant history crowds out recent signal. The second is cross-session persistence: the model starts each new session knowing nothing about what happened before.
For in-session context, the main countermeasure is treating the file system as external memory. Rather than holding intermediate results in the conversation thread, the agent writes them to files and references them by path. Tool output gets capped — extract what you need, discard the rest. Structured plans survive compaction better than freeform prose, because the plan file is always re-readable. Scheduled tasks always run in isolated context mode: no conversation history, all necessary context in the prompt. This eliminates an entire class of failures where a stale, hours-old conversation poisons a recurring job.
For cross-session persistence, the system uses a typed memory store: four categories (user profile, feedback, project context, external references), each stored as a markdown file with a one-line pointer in a central index. A separate FTS5-indexed search layer makes memories retrievable by keyword. The discipline that matters most: the model is explicitly told that saying "noted" without a file write means nothing was actually remembered. Without that rule, corrections evaporate.
One of the stranger properties that emerges at this level of capability: the system can improve itself by reading other people's setups. An agent can look at how someone else configured their skills, notice a pattern that would work well here, and propose — or directly implement — the change. What's crazy about the current moment is that this loop is real. The system reaches a point where it understands its own scope well enough to keep extending it. You stop adding features to your AI assistant and start having it add features to itself.
Building something like this requires a surprisingly broad skill set in one person or a small team: sysadmin access to wire up integrations and manage infrastructure, enough coding ability to write skills and debug data pipelines, access to internal data sources (Slack, Notion, GitHub, CRM), a working understanding of what each team actually does, and the ongoing willingness to watch how people use the system and keep improving it. No single piece is especially hard — but you need all of them at once.
The goal also has to be scoped correctly. Not "enable the finance team to run their entire close process end-to-end," but "help a specific accountant speed up the part of their job where they personally create journal entries." Start with one person, one workflow, one bottleneck. Prove it works. Expand when the next bottleneck becomes obvious.
At 22 employees I can pull this off alone — I'm a super admin on every system, I understand the business context end-to-end, and I have the technical background to debug whatever breaks. At larger companies this probably needs a small dedicated team: someone to own the infrastructure, someone embedded in each business unit who understands the actual workflows, and someone responsible for maintaining and improving the system as it grows. The skills are real; they're just spread across multiple roles at scale.
Build the web sanitizer on day one — wrapping all external content (web pages, emails, cached data) as untrusted is what actually stops prompt injection at the entry point, and retrofitting it means auditing every place the system reads outside data. The egress firewall is the network-layer counterpart and worth doing early too. Invest in the local cache layer earlier — the speed and cost difference versus live API calls at query time is larger than I expected, and the cross-source unified search turns out to be the most-used feature by far. Be more aggressive about the feedback memory system from the start — the compounding benefit is real but takes weeks to accumulate.
The biggest unlock wasn't any single feature. It was making the system's knowledge persistent and cumulative rather than starting from scratch each session. That's the shift from "an AI you query" to "an AI that knows your business."
<aside> 📖
This is Part 1 of a two-part series. Read Part 2: NanoClaw Architecture →
</aside>