When to Stop Adding Ops: The Fallback Case for LLM-Generated Code
The Question
Should Daytona or E2B.dev fit somewhere in the DCY/Add9 architecture?
Initial answer: no. Daytona pivoted from remote dev environments to AI-agent sandbox infrastructure — sub-90ms isolated VMs, Python/TS/JS runtimes, snapshot-based evals. Our pipeline doesn't execute LLM-authored code. Our LLMs return structured JSON via stored prompts; Python orchestrates deterministic stages. No exec surface, no sandbox value.
Then the pushback: half the problem we're solving is giving each Orca the ability to analyze data sets of any shape and deterministically compute against them. That reframes the answer.
What We've Already Built
Looking at the code: we've already built an in-house version of what Daytona/E2B sell.
iib/source_steps.pyhas_apply_<op>functionsKNOWN_OPSinaggregation_plan.py_OP_REGISTRY- Phase D prompt schema hint
- Plan validator
All of which have to stay in lockstep every time a new client uploads a quirky CSV. That's the treadmill an LLM-code-interpreter pattern is designed to retire. Every weird xlsx with merged headers or pivoted-then-unpivoted data is a new op someone has to author in three places.
Where Dynamic Python Would Pay Off
Ingest / shape normalization: Let the LLM write pandas code to handle quirky files.
Reshape/filter/groupby/join as fallback: When the typed plan fails.
Schema discovery and data quality work: Iterate-on-error is exactly what code interpreters are good at.
Where It Would Be a Regression
Keep the formula DSL in metrics/formula.py. We get a human-readable provenance string (hours / (headcount * 40 * weeks)) rendered into Pulse citations. Replacing that with opaque Python is a downgrade.
Keep Sentinels rule-based. Sub-200ms budget, no sandbox round trip.
Keep Causal Twin sim and Pulse narrative as pure prompt→JSON.
The Determinism Piece
Sandboxed LLM-authored code isn't non-deterministic if you cache right:
- Fingerprint
(orca_id, source schema)→ cached generated code - Store the code blob in
pipeline_cell_provenance - Execute the cached code on every subsequent run without re-prompting
That's actually a richer provenance story than what we have today.
Vendor Comparison
Daytona: Open-source, hybrid mode, generous SDK.
E2B.dev: More mature code-interpreter SDK specifically.
OpenAI Assistants Code Interpreter: Easiest but you lose runtime control and can't cache the generated code.
Roll-your-own Docker + seccomp: ~1-2 weeks of work, cheapest at scale, most security surface.
The Recommendation
Don't replace the typed pipeline. Add a Phase E fallback path.
When plan_generator.validate_plan rejects with unknown_op or apply_steps throws on real input, kick over to LLM-authored DataFrame transform. Cache by (orca, source_schema_fingerprint). Surface the cached code in cell_provenance for audit.
Keep formula DSL, Sentinels, and JSON-only LLM stages untouched.
Daytona's hybrid mode or E2B is the realistic short-cut for the team size we have.