CLONE FROM KAGAMI
error recovery is lacking, we need to recover in the case of a dead node
CLONE FROM KAGAMI
error recovery is lacking, we need to recover in the case of a dead node
TEMPORAL WORKER (READ_WRITE_TASK_QUEUE)
rebase support is critical, we should be able to reuse it
FindingRemediationV3 already uses a Claude SDK agent with Write/Edit/Bash tools in a writable workspace. Production-tested.
most of this code should be reusable
CREATE PR GitHubPRExecutor · pr_opened
this should be outside of the workflow
and opens a PR via the existing GitHubPRExecutor
pr right now is opened exclusively via the api endpoint, and only updated asynchronously after a rebase. the workflow should also decouple this, and should allow user review
Partial success policy: If 3 of 5 targets succeed and 2 fail, should we open a PR with the successful upgrades or abort entirely?
this is an entirely new agent decision making process
Kagami clone may still be slow even with bundles. Should we set a size threshold for clone-based vs. a future overlayfs path
overlayfs will always be the best call here. fsx persists between node reboots that's why the worktree creation is actually very recoverable.
Multi-plan execution: Can a user execute multiple plans for the same codebase simultaneously? (Proposal: no — one active execution per codebase to avoid branch conflicts.)
yes 100% they can and they will. each one may have different permutations. if they change their minds they should make another one.
Plan staleness: How old can a plan be before we require re-planning? Commit SHA validation catches code changes, but should we also check SBOM scan freshness?
it should have no extraneous external file changes and it should allow us to create a ref with the permissions at time of execution
SbomRemediationExecutionWorkflow (6-phase Temporal)
rebase is critical
Planner structured output: add PlannerOutput return type, store upgrade_tiers on plan completion
extraction step
Workflow execution history visible in Temporal Web UI with heartbeat details, activity retries, and failure reasons. First line of debugging for execution issues.
don't ship patch contents between the activities hahaha we do this now I have a ticket for it
Check .github/PULL_REQUEST_TEMPLATE.md in clone 2. Org template Check .github/PULL_REQUEST_TEMPLATE.md in .github repo 3. Nebari default Built-in template with CVE summary, changes, and test guidance
way out of scope, let's stick with the nebari default imo and move the pr template shit to a separate task
Progress Tracking
I would again add the plan generation in here so we're getting terminal states of everything. design for the ui! encode the steps! state machine!
Execution rejected if base_commit_sha doesn't match current HEAD on the target branch. Prevents applying stale plans to changed code.
this is when we rebase which we should be doing otherwise nothing will actually get accepted, if we don't then we end up with a pr that has many many unrelated changes and clients have complained many times
Each execution gets its own temp directory on local disk. No shared state with other executions, other codebases, or other tenants.
there should be some error recovery, I was told the disks aren't persisted
Input: remediationId (UUID), upgradeOption ("minimal" | "moderate" | "comprehensive"), optional branchName, asDraft. Starts Temporal workflow, returns workflowId. Validates status = "plan_ready" and no active execution.
this isn't the pr open thing yet right? right now, we generate the plan, then the patch, then let user review patch, then iterate on patch, then open pr immediately. that system should be the same imo so it's consistent
Lockfile Regen
again imo this is way out of scope for now, but there's a cache step before and after, and checking if we should invalidate
~30s
30 seconds is hopeful hahaha a cargo install can take that just to resolve the versions to download
Clone from Kagamigit transfer progress (objects received / total)~30s Run Executor AgentAgent tool calls (each tool invocation = heartbeat)Per tool call
what happens in between a deploy here? do we reclone? that's the main error case we currently see
30s
for xai I'm not sure I believe this
remediation_id
sbom_remediation_id
getExecutionStatus
getSbomPatchGenerationStatus
triggerSbomExecution
triggerSbomPatchGeneration
Sbom is overloaded already!
pr_opened
comments on a pr can also result in changes_requested, as is often the case
RETRY
the agent should make a decision and output whether or not the pr feedback is actionable. this won't work for all ci mechanisms. we should limit to just 1, like GitHub for now.
this is a can of worms, as soon as I started bringing this up they wanted all of them. that includes Jenkins, self hosted nonsense, bespoke stuff, etc. we should do it but maybe one at a time. I think we have the wiring in place to accept comments.
reusing the existing pr stuff is completely worth doing too so they all benefit from this!
changes_generated
we went from a patch lifecycle to pr lifecycle. is this a terminal state after running the remediation, before a pr is opened? there's missing steps above this
execution_status — per-target: success / failed / skipped
feedback from a user should be included, and there's no running state, we currently miss this in the current ui
started, failed, running, etc are probably good. then you can encode planning as well.
terminal states make sense too. planner_running, plan_generated, patch_running, patch_generated, then *_failed states
execution_started_at, execution_completed_at execution_error — error text on failure
we have 2 executions, the plan and the patch
execution_workflow_id — Temporal workflow ID
this shouldn't be necessary, you can encode the remediation id in the workflow id and that provides the same thing
pr_url, pr_number — direct access
we shouldn't denormalize these, you can open multiple prs or update a remediation and that will make this quickly out of sync
selected_upgrade_option — "minimal" | "moderate" | "comprehensive"
this can be extracted from the remediation report
Network Allowlist (Egress)
this doesn't cover the client's internal package management, we would need to have a vpn as well as a network policy per client.
modifies manifests
and updates any breaking changes
Non-Goals Deferred
I would also add actually running package managers here
npm, pip, poetry, cargo all have different mechanics. An agent with shell access adapts without bespoke parsers per ecosystem.
we talked about why we can't do this and why it's not a good idea. we have 0 sandboxing functionality, and most of our clients use their own internal package hosting. all package managers have some level of arbitrary code execution, if we're not running the installs and version numbers ourselves we do run the risk of exposure to completely unknown supply chain attacks if this is done completely agentically. I would stick to simple version bumps in the manifests, and fixing code that needs to be fixed.