When to Stop Adding Ops: The Fallback Case for LLM-Generated Code

June 24, 2026 · note

The Question

Should Daytona or E2B.dev fit somewhere in the DCY/Add9 architecture?

Initial answer: no. Daytona pivoted from remote dev environments to AI-agent sandbox infrastructure — sub-90ms isolated VMs, Python/TS/JS runtimes, snapshot-based evals. Our pipeline doesn't execute LLM-authored code. Our LLMs return structured JSON via stored prompts; Python orchestrates deterministic stages. No exec surface, no sandbox value.

Then the pushback: half the problem we're solving is giving each Orca the ability to analyze data sets of any shape and deterministically compute against them. That reframes the answer.

What We've Already Built

Looking at the code: we've already built an in-house version of what Daytona/E2B sell.

iib/source_steps.py has _apply_<op> functions
KNOWN_OPS in aggregation_plan.py
_OP_REGISTRY
Phase D prompt schema hint
Plan validator

All of which have to stay in lockstep every time a new client uploads a quirky CSV. That's the treadmill an LLM-code-interpreter pattern is designed to retire. Every weird xlsx with merged headers or pivoted-then-unpivoted data is a new op someone has to author in three places.

Where Dynamic Python Would Pay Off

Ingest / shape normalization: Let the LLM write pandas code to handle quirky files.

Reshape/filter/groupby/join as fallback: When the typed plan fails.

Schema discovery and data quality work: Iterate-on-error is exactly what code interpreters are good at.

Where It Would Be a Regression

Keep the formula DSL in metrics/formula.py. We get a human-readable provenance string (hours / (headcount * 40 * weeks)) rendered into Pulse citations. Replacing that with opaque Python is a downgrade.

Keep Sentinels rule-based. Sub-200ms budget, no sandbox round trip.

Keep Causal Twin sim and Pulse narrative as pure prompt→JSON.

The Determinism Piece

Sandboxed LLM-authored code isn't non-deterministic if you cache right:

Fingerprint (orca_id, source schema) → cached generated code
Store the code blob in pipeline_cell_provenance
Execute the cached code on every subsequent run without re-prompting

That's actually a richer provenance story than what we have today.

Vendor Comparison

Daytona: Open-source, hybrid mode, generous SDK.

E2B.dev: More mature code-interpreter SDK specifically.

OpenAI Assistants Code Interpreter: Easiest but you lose runtime control and can't cache the generated code.

Roll-your-own Docker + seccomp: ~1-2 weeks of work, cheapest at scale, most security surface.

The Recommendation

Don't replace the typed pipeline. Add a Phase E fallback path.

When plan_generator.validate_plan rejects with unknown_op or apply_steps throws on real input, kick over to LLM-authored DataFrame transform. Cache by (orca, source_schema_fingerprint). Surface the cached code in cell_provenance for audit.

Keep formula DSL, Sentinels, and JSON-only LLM stages untouched.

Daytona's hybrid mode or E2B is the realistic short-cut for the team size we have.

← back to all posts