mirror of https://gitea.com/gitea/act_runner.git synced 2026-04-24 04:40:22 +08:00

Go to file

Bo-Yi Wu f33e5a6245 feat: add Prometheus metrics endpoint for runner observability (#820 )

## What

Add an optional Prometheus `/metrics` HTTP endpoint to `act_runner` so operators can observe runner health, polling behavior, job outcomes, and RPC latency without scraping logs.

New surface:

- `internal/pkg/metrics/metrics.go` — metric definitions, custom `Registry`, static Go/process collectors, label constants, `ResultToStatusLabel` helper.
- `internal/pkg/metrics/server.go` — hardened `http.Server` serving `/metrics` and `/healthz` with Slowloris-safe timeouts (`ReadHeaderTimeout` 5s, `ReadTimeout`/`WriteTimeout` 10s, `IdleTimeout` 60s) and a 5s graceful shutdown.
- `daemon.go` wires it up behind `cfg.Metrics.Enabled` (disabled by default).
- `poller.go` / `reporter.go` / `runner.go` instrument their existing hot paths with counters/histograms/gauges — no behavior change.

Metrics exported (namespace `act_runner_`):

| Subsystem | Metric | Type | Labels |
|---|---|---|---|
| — | `info` | Gauge | `version`, `name` |
| — | `capacity`, `uptime_seconds` | Gauge | — |
| `poll` | `fetch_total`, `client_errors_total` | Counter | `result` / `method` |
| `poll` | `fetch_duration_seconds`, `backoff_seconds` | Histogram / Gauge | — |
| `job` | `total` | Counter | `status` |
| `job` | `duration_seconds`, `running`, `capacity_utilization_ratio` | Histogram / GaugeFunc | — |
| `report` | `log_total`, `state_total` | Counter | `result` |
| `report` | `log_duration_seconds`, `state_duration_seconds` | Histogram | — |
| `report` | `log_buffer_rows` | Gauge | — |
| — | `go_*`, `process_*` | standard collectors | — |

All label values are predefined constants — **no high-cardinality labels** (no task IDs, repo URLs, branches, tokens, or secrets) so scraping is safe and bounded.

## Why

Teams self-hosting Gitea + `act_runner` at scale need to answer basic SRE questions that are currently invisible:

- How often are RPCs failing? Which RPC? (`act_runner_client_errors_total`)
- Are runners saturated? (`act_runner_job_capacity_utilization_ratio`, `act_runner_job_running`)
- How long do jobs take? (`act_runner_job_duration_seconds`)
- Is polling backing off? (`act_runner_poll_backoff_seconds`, `act_runner_poll_fetch_total{result=\"error\"}`)
- Are log/state reports slow? (`act_runner_report_{log,state}_duration_seconds`)
- Is the log buffer draining? (`act_runner_report_log_buffer_rows`)

Today operators have to grep logs. This PR makes all of the above first-class metrics so they can feed dashboards and alerts (`rate(act_runner_client_errors_total[5m]) > 0.1`, capacity saturation alerts, etc.).

The endpoint is **disabled by default** and binds to `127.0.0.1:9101` when enabled, so it's opt-in and safe for existing deployments.

## How

### Config

```yaml
metrics:
  enabled: false           # opt-in
  addr: 127.0.0.1:9101     # change to 0.0.0.0:9101 only behind a reverse proxy
```

`config.example.yaml` documents both fields plus a security note about binding externally without auth.

### Wiring

1. `daemon.go` calls `metrics.Init()` (guarded by `sync.Once`), sets `act_runner_info`, `act_runner_capacity`, registers uptime + running-jobs GaugeFuncs, then starts the server goroutine with the daemon context — it shuts down cleanly on `ctx.Done()`.
2. `poller.fetchTask` observes RPC latency / result / error counters. `DeadlineExceeded` (long-poll idle) is treated as an empty result and **not** observed into the histogram so the 5s timeout doesn't swamp the buckets.
3. `poller.pollOnce` reports `poll_backoff_seconds` using the pre-jitter base interval (the true backoff level), and only when it changes — prevents noisy no-op gauge updates at the `FetchIntervalMax` plateau.
4. `reporter.ReportLog` / `ReportState` record duration histograms and success/error counters; `log_buffer_rows` is updated only when the value changes, guarded by the already-held `clientM`.
5. `runner.Run` observes `job_duration_seconds` and increments `job_total` by outcome via `metrics.ResultToStatusLabel`.

### Safety / security review

- All timeouts set; Slowloris-safe.
- Custom `prometheus.NewRegistry()` — no global registration side-effects.
- No sensitive data in labels (reviewed every instrumentation site).
- Single new dependency: `github.com/prometheus/client_golang v1.23.2`.
- Endpoint is unauthenticated by design and documented as such; default localhost bind mitigates exposure. Operators exposing externally should front it with a reverse proxy.

## Verification

### Unit tests

\`\`\`bash
go build ./...
go vet ./...
go test ./...
\`\`\`

### Manual smoke test

1. Enable metrics in `config.yaml`:
   \`\`\`yaml
   metrics:
     enabled: true
     addr: 127.0.0.1:9101
   \`\`\`
2. Start the runner against a Gitea instance: \`./act_runner daemon\`.
3. Scrape the endpoint:
   \`\`\`bash
   curl -s http://127.0.0.1:9101/metrics | grep '^act_runner_'
   curl -s http://127.0.0.1:9101/healthz   # → ok
   \`\`\`
4. Confirm the static series appear immediately: \`act_runner_info\`, \`act_runner_capacity\`, \`act_runner_uptime_seconds\`, \`act_runner_job_running\`, \`act_runner_job_capacity_utilization_ratio\`.
5. Trigger a workflow and confirm counters increment: \`act_runner_poll_fetch_total{result=\"task\"}\`, \`act_runner_job_total{status=\"success\"}\`, \`act_runner_report_log_total{result=\"success\"}\`.
6. Leave the runner idle and confirm \`act_runner_poll_backoff_seconds\` settles (and does **not** churn on every poll).
7. Ctrl-C and confirm a clean \"metrics server shutdown\" log line (no port-in-use error on restart within 5s).

### Prometheus integration

Add to \`prometheus.yml\`:

\`\`\`yaml
scrape_configs:
  - job_name: act_runner
    static_configs:
      - targets: ['127.0.0.1:9101']
\`\`\`

Sample alert to try:

\`\`\`
sum(rate(act_runner_client_errors_total[5m])) by (method) > 0.1
\`\`\`

## Out of scope (follow-ups)

- TLS and auth on the metrics endpoint (mitigated today by localhost default; add when operators need external scraping).
- Per-task labels (intentionally avoided for cardinality safety).

---

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://gitea.com/gitea/act_runner/pulls/820
Reviewed-by: Lunny Xiao <xiaolunwen@gmail.com>
Co-authored-by: Bo-Yi Wu <appleboy.tw@gmail.com>
Co-committed-by: Bo-Yi Wu <appleboy.tw@gmail.com>

2026-04-15 01:27:34 +00:00

.gitea/workflows

Semver tags for Docker images (#720 )

2026-02-25 19:09:53 +00:00

examples

examples: improve DIND rootless network performance (#786 )

2026-02-16 07:56:45 +00:00

internal

feat: add Prometheus metrics endpoint for runner observability (#820 )

2026-04-15 01:27:34 +00:00

scripts

…

.editorconfig

…

.gitattributes

…

.gitignore

…

.golangci.yml

chore(lint): add golangci-lint v2 and fix all lint issues (#803 )

2026-02-22 17:35:08 +00:00

.goreleaser.checksum.sh

…

.goreleaser.yaml

…

build.go

…

Dockerfile

chore(deps): bump Go version to 1.26 in Dockerfile and Makefile (#788 )

2026-02-14 10:32:16 +00:00

go.mod

feat: add Prometheus metrics endpoint for runner observability (#820 )

2026-04-15 01:27:34 +00:00

go.sum

feat: add Prometheus metrics endpoint for runner observability (#820 )

2026-04-15 01:27:34 +00:00

LICENSE

…

main.go

…

Makefile

chore(lint): add golangci-lint v2 and fix all lint issues (#803 )

2026-02-22 17:35:08 +00:00

README.md

…

renovate.json5

…

README.md

act runner

Act runner is a runner for Gitea based on Gitea fork of act.

Installation

Prerequisites

Docker Engine Community version is required for docker mode. To install Docker CE, follow the official install instructions.

Download pre-built binary

Visit here and download the right version for your platform.

Build from source

make build

Build a docker image

make docker

Quickstart

Actions are disabled by default, so you need to add the following to the configuration file of your Gitea instance to enable it:

[actions]
ENABLED=true

Register

./act_runner register

And you will be asked to input:

Gitea instance URL, like http://192.168.8.8:3000/. You should use your gitea instance ROOT_URL as the instance argument and you should not use localhost or 127.0.0.1 as instance IP;
Runner token, you can get it from http://192.168.8.8:3000/admin/actions/runners;
Runner name, you can just leave it blank;
Runner labels, you can just leave it blank.

The process looks like:

INFO Registering runner, arch=amd64, os=darwin, version=0.1.5.
WARN Runner in user-mode.
INFO Enter the Gitea instance URL (for example, https://gitea.com/):
http://192.168.8.8:3000/
INFO Enter the runner token:
fe884e8027dc292970d4e0303fe82b14xxxxxxxx
INFO Enter the runner name (if set empty, use hostname: Test.local):

INFO Enter the runner labels, leave blank to use the default labels (comma-separated, for example, ubuntu-latest:docker://docker.gitea.com/runner-images:ubuntu-latest):

INFO Registering runner, name=Test.local, instance=http://192.168.8.8:3000/, labels=[ubuntu-latest:docker://docker.gitea.com/runner-images:ubuntu-latest ubuntu-22.04:docker://docker.gitea.com/runner-images:ubuntu-22.04 ubuntu-20.04:docker://docker.gitea.com/runner-images:ubuntu-20.04].
DEBU Successfully pinged the Gitea instance server
INFO Runner registered successfully.

You can also register with command line arguments.

./act_runner register --instance http://192.168.8.8:3000 --token <my_runner_token> --no-interactive

If the registry succeed, it will run immediately. Next time, you could run the runner directly.

Run

./act_runner daemon

Run with docker

docker run -e GITEA_INSTANCE_URL=https://your_gitea.com -e GITEA_RUNNER_REGISTRATION_TOKEN=<your_token> -v /var/run/docker.sock:/var/run/docker.sock --name my_runner gitea/act_runner:nightly

Configuration

You can also configure the runner with a configuration file. The configuration file is a YAML file, you can generate a sample configuration file with ./act_runner generate-config.

./act_runner generate-config > config.yaml

You can specify the configuration file path with -c/--config argument.

./act_runner -c config.yaml register # register with config file
./act_runner -c config.yaml daemon # run with config file

You can read the latest version of the configuration file online at config.example.yaml.

Example Deployments

Check out the examples directory for sample deployment types.

Languages

Go 83.4%

JavaScript 15.7%

Makefile 0.6%

Shell 0.2%

Dockerfile 0.1%