fix(orchestrator): retry template fetch after fail#2526
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 26cfa8d. Bugbot is set up for automated code reviews on this repo. Configure here. |
| func (c *Cache) fetchAndEvictOnFailure(ctx context.Context, key string, tmpl *storageTemplate) { | ||
| err := tmpl.Fetch(ctx, c.buildStore) | ||
| if err == nil { | ||
| return | ||
| } | ||
|
|
||
| c.extendMu.Lock() | ||
| defer c.extendMu.Unlock() | ||
|
|
||
| item := c.cache.Get(key) | ||
| if item == nil || item.Value() != tmpl { | ||
| return | ||
| } | ||
|
|
||
| c.cache.Delete(key) | ||
| logger.L().Error(ctx, "template fetch failed, evicted cached template", | ||
| logger.WithBuildID(key), | ||
| zap.Error(err), | ||
| ) | ||
| } |
There was a problem hiding this comment.
🔴 On partial Fetch failure, fetchAndEvictOnFailure calls cache.Delete which synchronously fires the OnEviction handler in NewCache, running template.Close → closeTemplate → snapfile.Close() → os.RemoveAll(f.path) (storage_file.go:51-53). Concurrent peerserver consumers that obtained the same *storageTemplate via GetCachedTemplate (peerserver/resolve.go:39-47, peerserver/file.go:41-66) can race the deletion: they hold the path returned by t.Snapfile() but their os.Open call may land after the eviction unlinked the file. Pre-PR this was effectively impossible (eviction only fired on 25h TTL); post-PR it can fire within milliseconds. Fix by having OnEviction wait for in-use refs (refcount/handle), or skip Close when the entry was deleted due to a Fetch failure where partial state was never safely shareable.
Evict template cache entries when their initial async fetch fails so transient GCS/rootfs errors do not poison cache.