Files
act_runner/internal/app/poll/poller_test.go
Bo-Yi Wu 9aafec169b perf: use single poller with semaphore-based capacity control (#822)
## Background

#819 replaced the shared `rate.Limiter` with per-worker exponential backoff counters to add jitter and adaptive polling. Before #819, the poller used:

```go
limiter := rate.NewLimiter(rate.Every(p.cfg.Runner.FetchInterval), 1)
```

This limiter was **shared across all N polling goroutines with burst=1**, effectively serializing their `FetchTask` calls — so even with `capacity=60`, the runner issued roughly one `FetchTask` per `FetchInterval` total.

#819 replaced this with independent per-worker `consecutiveEmpty` / `consecutiveErrors` counters. Each goroutine now backs off **independently**, which inadvertently removed the cross-worker serialization. With `capacity=N`, the runner now has N goroutines each polling on their own schedule — a regression from the pre-#819 baseline for any runner with `capacity > 1`.

(Thanks to @ChristopherHX for catching this in review.)

## Problem

With the post-#819 code:

- `capacity=N` maintains **N persistent polling goroutines**, each calling `FetchTask` independently
- At idle, N goroutines each wake up and send a `FetchTask` RPC per `FetchInterval`
- At full load, N goroutines **continue polling** even though no slot is available to run a new task — every one of those RPCs is wasted
- The `Shutdown()` timeout branch has a pre-existing bug: the "non-blocking check" is actually a blocking receive, so `shutdownJobs()` is never reached on timeout

## Real-World Impact: 3 Runners × capacity=60

Current production environment: 3 runners each with `capacity=60`.

| Metric | Post-#819 (current) | This PR | Reduction |
|--------|---------------------|---------|-----------|
| Polling goroutines (total) | 3 × 60 = **180** | 3 × 1 = **3** | **98.3%** (177 fewer) |
| FetchTask RPCs per poll cycle (idle) | **180** | **3** | **98.3%** |
| FetchTask RPCs per poll cycle (full load) | **180** (all wasted) | **0** (blocked on semaphore) | **100%** |
| Concurrent connections to Gitea | **180** | **3** | **98.3%** |
| Backoff state objects | 180 (per-worker) | 3 (one per runner) | Simplified |

### Idle scenario

All 180 goroutines wake up every `FetchInterval`, each sending a `FetchTask` RPC that returns empty. Server handles 180 RPCs per cycle for zero useful work. After this PR: **3 RPCs per cycle** — one per runner.

> Note: pre-#819 idle behavior was already ~3 RPCs/cycle due to the shared `rate.Limiter`. This PR restores that property while also addressing the full-load case below.

### Full-load scenario (all 180 slots occupied)

All 180 goroutines **continue polling** even though no slot is available. Every RPC is wasted. After this PR: all 3 pollers are **blocked on the semaphore** — **zero RPCs** until a task completes.

> This is a scenario neither the pre-#819 shared limiter nor the post-#819 per-worker backoff handles — both still issue `FetchTask` RPCs when no slot is free. The semaphore is the only approach of the three that ties polling to available capacity.

## Why Not Just Revert to `rate.Limiter`?

Reverting would restore the serialized behavior but is not the right long-term fix:

- **`rate.Limiter` has no concept of available capacity.** At full load it still hands out tokens and issues `FetchTask` RPCs that can't be acted on. The semaphore blocks polling entirely in that case — zero wasted RPCs.
- **It composes poorly with adaptive backoff from #819.** A shared limiter and per-worker backoff pull in different directions.
- **N goroutines serializing on a shared limiter means N-1 of them exist only to wait in line.** A single poller expresses the same behavior more directly.

The semaphore approach ties polling to capacity explicitly: `acquire slot → fetch → dispatch → release`. That invariant becomes structural rather than emergent from a rate limiter.

## Solution

Replace N polling goroutines with a **single polling loop** that uses a buffered channel as a semaphore to control concurrent task execution:

```go
// New: poller.go Poll()
sem := make(chan struct{}, p.cfg.Runner.Capacity)
for {
    select {
    case sem <- struct{}{}:       // Acquire slot (blocks at capacity)
    case <-p.pollingCtx.Done():
        return
    }
    task, ok := p.fetchTask(...)  // Single FetchTask RPC
    if !ok {
        <-sem                     // Release slot on empty response
        // backoff...
        continue
    }
    go func(t *runnerv1.Task) {   // Dispatch task
        defer func() { <-sem }()  // Release slot when done
        p.runTaskWithRecover(p.jobsCtx, t)
    }(task)
}
```

The exponential backoff and jitter from #819 are preserved — just driven by a single `workerState` instead of N per-worker states.

## Shutdown Bug Fix

Fixed a pre-existing bug in `Shutdown()` where the timeout branch could never force-cancel running jobs:

```go
// Before (BROKEN): blocking receive, shutdownJobs() never reached
_, ok := <-p.done   // blocks until p.done is closed
if !ok { return nil }
p.shutdownJobs()    // dead code when jobs are still running

// After (FIXED): proper non-blocking check
select {
case <-p.done:
    return nil
default:
}
p.shutdownJobs()    // now correctly reached on timeout
```

## Code Changes

| Area | Detail |
|------|--------|
| `Poller.runner` | `*run.Runner` → `TaskRunner` interface (enables mock-based testing) |
| `Poll()` | N goroutines → single loop with buffered-channel semaphore |
| `PollOnce()` | Inlined from removed `pollOnce()` |
| `waitBackoff()` | New helper, eliminates duplicated backoff logic |
| `resetBackoff()` | New method on `workerState`, also resets stale `lastBackoff` metric |
| `Shutdown()` | Fixed blocking receive → proper non-blocking select |
| Removed | `poll()`, `pollOnce()` private methods (-2 methods, -42 lines) |

## Test Coverage

Added `TestPoller_ConcurrencyLimitedByCapacity` which verifies:

- With `capacity=3`, at most 3 tasks execute concurrently (`maxConcurrent <= 3`)
- Tasks actually overlap in execution (`maxConcurrent >= 2`)
- `FetchTask` is never called concurrently — confirms single poller (`maxFetchConcur == 1`)
- All 6 tasks complete successfully (`totalCompleted == 6`)
- Mock runner respects context cancellation, enabling shutdown path verification

```
=== RUN   TestPoller_ConcurrencyLimitedByCapacity
--- PASS: TestPoller_ConcurrencyLimitedByCapacity (0.10s)
PASS
ok  	gitea.com/gitea/act_runner/internal/app/poll	0.59s
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://gitea.com/gitea/act_runner/pulls/822
Reviewed-by: silverwind <2021+silverwind@noreply.gitea.com>
Co-authored-by: Bo-Yi Wu <appleboy.tw@gmail.com>
Co-committed-by: Bo-Yi Wu <appleboy.tw@gmail.com>
2026-04-19 08:10:23 +00:00

261 lines
8.4 KiB
Go

// Copyright 2026 The Gitea Authors. All rights reserved.
// SPDX-License-Identifier: MIT
package poll
import (
"context"
"errors"
"sync"
"sync/atomic"
"testing"
"time"
"gitea.com/gitea/act_runner/internal/pkg/client/mocks"
"gitea.com/gitea/act_runner/internal/pkg/config"
runnerv1 "code.gitea.io/actions-proto-go/runner/v1"
connect_go "connectrpc.com/connect"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/mock"
"github.com/stretchr/testify/require"
)
// TestPoller_WorkerStateCounters verifies that workerState correctly tracks
// consecutive empty responses independently per state instance, and that
// fetchTask increments only the relevant counter.
func TestPoller_WorkerStateCounters(t *testing.T) {
client := mocks.NewClient(t)
client.On("FetchTask", mock.Anything, mock.Anything).Return(
func(_ context.Context, _ *connect_go.Request[runnerv1.FetchTaskRequest]) (*connect_go.Response[runnerv1.FetchTaskResponse], error) {
// Always return an empty response.
return connect_go.NewResponse(&runnerv1.FetchTaskResponse{}), nil
},
)
cfg, err := config.LoadDefault("")
require.NoError(t, err)
p := &Poller{client: client, cfg: cfg}
ctx := context.Background()
s1 := &workerState{}
s2 := &workerState{}
// Each worker independently observes one empty response.
_, ok := p.fetchTask(ctx, s1)
require.False(t, ok)
_, ok = p.fetchTask(ctx, s2)
require.False(t, ok)
assert.Equal(t, int64(1), s1.consecutiveEmpty, "worker 1 should only count its own empty response")
assert.Equal(t, int64(1), s2.consecutiveEmpty, "worker 2 should only count its own empty response")
// Worker 1 sees a second empty; worker 2 stays at 1.
_, ok = p.fetchTask(ctx, s1)
require.False(t, ok)
assert.Equal(t, int64(2), s1.consecutiveEmpty)
assert.Equal(t, int64(1), s2.consecutiveEmpty, "worker 2's counter must not be affected by worker 1's empty fetches")
}
// TestPoller_FetchErrorIncrementsErrorsOnly verifies that a fetch error
// increments only the per-worker error counter, not the empty counter.
func TestPoller_FetchErrorIncrementsErrorsOnly(t *testing.T) {
client := mocks.NewClient(t)
client.On("FetchTask", mock.Anything, mock.Anything).Return(
func(_ context.Context, _ *connect_go.Request[runnerv1.FetchTaskRequest]) (*connect_go.Response[runnerv1.FetchTaskResponse], error) {
return nil, errors.New("network unreachable")
},
)
cfg, err := config.LoadDefault("")
require.NoError(t, err)
p := &Poller{client: client, cfg: cfg}
s := &workerState{}
_, ok := p.fetchTask(context.Background(), s)
require.False(t, ok)
assert.Equal(t, int64(1), s.consecutiveErrors)
assert.Equal(t, int64(0), s.consecutiveEmpty)
}
// TestPoller_CalculateInterval verifies the exponential backoff math is
// correctly driven by the workerState counters.
func TestPoller_CalculateInterval(t *testing.T) {
cfg, err := config.LoadDefault("")
require.NoError(t, err)
cfg.Runner.FetchInterval = 2 * time.Second
cfg.Runner.FetchIntervalMax = 60 * time.Second
p := &Poller{cfg: cfg}
cases := []struct {
name string
empty, errs int64
wantInterval time.Duration
}{
{"first poll, no backoff", 0, 0, 2 * time.Second},
{"single empty, still base", 1, 0, 2 * time.Second},
{"two empties, doubled", 2, 0, 4 * time.Second},
{"five empties, capped path", 5, 0, 32 * time.Second},
{"many empties, capped at max", 20, 0, 60 * time.Second},
{"errors drive backoff too", 0, 3, 8 * time.Second},
{"max(empty, errors) wins", 2, 4, 16 * time.Second},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
s := &workerState{consecutiveEmpty: tc.empty, consecutiveErrors: tc.errs}
assert.Equal(t, tc.wantInterval, p.calculateInterval(s))
})
}
}
// atomicMax atomically updates target to max(target, val).
func atomicMax(target *atomic.Int64, val int64) {
for {
old := target.Load()
if val <= old || target.CompareAndSwap(old, val) {
break
}
}
}
type mockRunner struct {
delay time.Duration
running atomic.Int64
maxConcurrent atomic.Int64
totalCompleted atomic.Int64
}
func (m *mockRunner) Run(ctx context.Context, _ *runnerv1.Task) error {
atomicMax(&m.maxConcurrent, m.running.Add(1))
select {
case <-time.After(m.delay):
case <-ctx.Done():
}
m.running.Add(-1)
m.totalCompleted.Add(1)
return nil
}
// TestPoller_ConcurrencyLimitedByCapacity verifies that with capacity=3 and
// 6 available tasks, at most 3 tasks run concurrently, and FetchTask is
// never called concurrently (single poller).
func TestPoller_ConcurrencyLimitedByCapacity(t *testing.T) {
const (
capacity = 3
totalTasks = 6
taskDelay = 50 * time.Millisecond
)
var (
tasksReturned atomic.Int64
fetchConcur atomic.Int64
maxFetchConcur atomic.Int64
)
cli := mocks.NewClient(t)
cli.On("FetchTask", mock.Anything, mock.Anything).Return(
func(_ context.Context, _ *connect_go.Request[runnerv1.FetchTaskRequest]) (*connect_go.Response[runnerv1.FetchTaskResponse], error) {
atomicMax(&maxFetchConcur, fetchConcur.Add(1))
defer fetchConcur.Add(-1)
n := tasksReturned.Add(1)
if n <= totalTasks {
return connect_go.NewResponse(&runnerv1.FetchTaskResponse{
Task: &runnerv1.Task{Id: n},
}), nil
}
return connect_go.NewResponse(&runnerv1.FetchTaskResponse{}), nil
},
)
runner := &mockRunner{delay: taskDelay}
cfg, err := config.LoadDefault("")
require.NoError(t, err)
cfg.Runner.Capacity = capacity
cfg.Runner.FetchInterval = 10 * time.Millisecond
cfg.Runner.FetchIntervalMax = 10 * time.Millisecond
poller := New(cfg, cli, runner)
var wg sync.WaitGroup
wg.Go(poller.Poll)
require.Eventually(t, func() bool {
return runner.totalCompleted.Load() >= totalTasks
}, 2*time.Second, 10*time.Millisecond, "all tasks should complete")
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
err = poller.Shutdown(ctx)
require.NoError(t, err)
wg.Wait()
assert.LessOrEqual(t, runner.maxConcurrent.Load(), int64(capacity),
"concurrent running tasks must not exceed capacity")
assert.GreaterOrEqual(t, runner.maxConcurrent.Load(), int64(2),
"with 6 tasks and capacity 3, at least 2 should overlap")
assert.Equal(t, int64(1), maxFetchConcur.Load(),
"FetchTask must never be called concurrently (single poller)")
assert.Equal(t, int64(totalTasks), runner.totalCompleted.Load(),
"all tasks should have been executed")
}
// TestPoller_ShutdownForcesJobsOnTimeout locks in the fix for a
// pre-existing bug where Shutdown's timeout branch used a blocking
// `<-p.done` receive, leaving p.shutdownJobs() unreachable. With a
// task parked on jobsCtx and a Shutdown deadline shorter than the
// task's natural completion, Shutdown must force-cancel via
// shutdownJobs() and return ctx.Err() promptly — not block until the
// task would have finished on its own.
func TestPoller_ShutdownForcesJobsOnTimeout(t *testing.T) {
var served atomic.Bool
cli := mocks.NewClient(t)
cli.On("FetchTask", mock.Anything, mock.Anything).Return(
func(_ context.Context, _ *connect_go.Request[runnerv1.FetchTaskRequest]) (*connect_go.Response[runnerv1.FetchTaskResponse], error) {
if served.CompareAndSwap(false, true) {
return connect_go.NewResponse(&runnerv1.FetchTaskResponse{
Task: &runnerv1.Task{Id: 1},
}), nil
}
return connect_go.NewResponse(&runnerv1.FetchTaskResponse{}), nil
},
)
// delay >> Shutdown timeout: Run only returns when jobsCtx is
// cancelled by shutdownJobs().
runner := &mockRunner{delay: 30 * time.Second}
cfg, err := config.LoadDefault("")
require.NoError(t, err)
cfg.Runner.Capacity = 1
cfg.Runner.FetchInterval = 10 * time.Millisecond
cfg.Runner.FetchIntervalMax = 10 * time.Millisecond
poller := New(cfg, cli, runner)
var wg sync.WaitGroup
wg.Go(poller.Poll)
require.Eventually(t, func() bool {
return runner.running.Load() == 1
}, time.Second, 10*time.Millisecond, "task should start running")
ctx, cancel := context.WithTimeout(context.Background(), 50*time.Millisecond)
defer cancel()
start := time.Now()
err = poller.Shutdown(ctx)
elapsed := time.Since(start)
require.ErrorIs(t, err, context.DeadlineExceeded)
// With the fix, Shutdown returns shortly after the deadline once
// the forced job unwinds. Without the fix, the blocking <-p.done
// would hang for the full 30s mockRunner delay.
assert.Less(t, elapsed, 5*time.Second,
"Shutdown must not block on the parked task; shutdownJobs() must run on timeout")
wg.Wait()
assert.Equal(t, int64(1), runner.totalCompleted.Load(),
"the parked task must be cancelled and unwound")
}