mirror of
https://gitea.com/gitea/act_runner.git
synced 2026-04-24 04:40:22 +08:00
## What
Add an optional Prometheus `/metrics` HTTP endpoint to `act_runner` so operators can observe runner health, polling behavior, job outcomes, and RPC latency without scraping logs.
New surface:
- `internal/pkg/metrics/metrics.go` — metric definitions, custom `Registry`, static Go/process collectors, label constants, `ResultToStatusLabel` helper.
- `internal/pkg/metrics/server.go` — hardened `http.Server` serving `/metrics` and `/healthz` with Slowloris-safe timeouts (`ReadHeaderTimeout` 5s, `ReadTimeout`/`WriteTimeout` 10s, `IdleTimeout` 60s) and a 5s graceful shutdown.
- `daemon.go` wires it up behind `cfg.Metrics.Enabled` (disabled by default).
- `poller.go` / `reporter.go` / `runner.go` instrument their existing hot paths with counters/histograms/gauges — no behavior change.
Metrics exported (namespace `act_runner_`):
| Subsystem | Metric | Type | Labels |
|---|---|---|---|
| — | `info` | Gauge | `version`, `name` |
| — | `capacity`, `uptime_seconds` | Gauge | — |
| `poll` | `fetch_total`, `client_errors_total` | Counter | `result` / `method` |
| `poll` | `fetch_duration_seconds`, `backoff_seconds` | Histogram / Gauge | — |
| `job` | `total` | Counter | `status` |
| `job` | `duration_seconds`, `running`, `capacity_utilization_ratio` | Histogram / GaugeFunc | — |
| `report` | `log_total`, `state_total` | Counter | `result` |
| `report` | `log_duration_seconds`, `state_duration_seconds` | Histogram | — |
| `report` | `log_buffer_rows` | Gauge | — |
| — | `go_*`, `process_*` | standard collectors | — |
All label values are predefined constants — **no high-cardinality labels** (no task IDs, repo URLs, branches, tokens, or secrets) so scraping is safe and bounded.
## Why
Teams self-hosting Gitea + `act_runner` at scale need to answer basic SRE questions that are currently invisible:
- How often are RPCs failing? Which RPC? (`act_runner_client_errors_total`)
- Are runners saturated? (`act_runner_job_capacity_utilization_ratio`, `act_runner_job_running`)
- How long do jobs take? (`act_runner_job_duration_seconds`)
- Is polling backing off? (`act_runner_poll_backoff_seconds`, `act_runner_poll_fetch_total{result=\"error\"}`)
- Are log/state reports slow? (`act_runner_report_{log,state}_duration_seconds`)
- Is the log buffer draining? (`act_runner_report_log_buffer_rows`)
Today operators have to grep logs. This PR makes all of the above first-class metrics so they can feed dashboards and alerts (`rate(act_runner_client_errors_total[5m]) > 0.1`, capacity saturation alerts, etc.).
The endpoint is **disabled by default** and binds to `127.0.0.1:9101` when enabled, so it's opt-in and safe for existing deployments.
## How
### Config
```yaml
metrics:
enabled: false # opt-in
addr: 127.0.0.1:9101 # change to 0.0.0.0:9101 only behind a reverse proxy
```
`config.example.yaml` documents both fields plus a security note about binding externally without auth.
### Wiring
1. `daemon.go` calls `metrics.Init()` (guarded by `sync.Once`), sets `act_runner_info`, `act_runner_capacity`, registers uptime + running-jobs GaugeFuncs, then starts the server goroutine with the daemon context — it shuts down cleanly on `ctx.Done()`.
2. `poller.fetchTask` observes RPC latency / result / error counters. `DeadlineExceeded` (long-poll idle) is treated as an empty result and **not** observed into the histogram so the 5s timeout doesn't swamp the buckets.
3. `poller.pollOnce` reports `poll_backoff_seconds` using the pre-jitter base interval (the true backoff level), and only when it changes — prevents noisy no-op gauge updates at the `FetchIntervalMax` plateau.
4. `reporter.ReportLog` / `ReportState` record duration histograms and success/error counters; `log_buffer_rows` is updated only when the value changes, guarded by the already-held `clientM`.
5. `runner.Run` observes `job_duration_seconds` and increments `job_total` by outcome via `metrics.ResultToStatusLabel`.
### Safety / security review
- All timeouts set; Slowloris-safe.
- Custom `prometheus.NewRegistry()` — no global registration side-effects.
- No sensitive data in labels (reviewed every instrumentation site).
- Single new dependency: `github.com/prometheus/client_golang v1.23.2`.
- Endpoint is unauthenticated by design and documented as such; default localhost bind mitigates exposure. Operators exposing externally should front it with a reverse proxy.
## Verification
### Unit tests
\`\`\`bash
go build ./...
go vet ./...
go test ./...
\`\`\`
### Manual smoke test
1. Enable metrics in `config.yaml`:
\`\`\`yaml
metrics:
enabled: true
addr: 127.0.0.1:9101
\`\`\`
2. Start the runner against a Gitea instance: \`./act_runner daemon\`.
3. Scrape the endpoint:
\`\`\`bash
curl -s http://127.0.0.1:9101/metrics | grep '^act_runner_'
curl -s http://127.0.0.1:9101/healthz # → ok
\`\`\`
4. Confirm the static series appear immediately: \`act_runner_info\`, \`act_runner_capacity\`, \`act_runner_uptime_seconds\`, \`act_runner_job_running\`, \`act_runner_job_capacity_utilization_ratio\`.
5. Trigger a workflow and confirm counters increment: \`act_runner_poll_fetch_total{result=\"task\"}\`, \`act_runner_job_total{status=\"success\"}\`, \`act_runner_report_log_total{result=\"success\"}\`.
6. Leave the runner idle and confirm \`act_runner_poll_backoff_seconds\` settles (and does **not** churn on every poll).
7. Ctrl-C and confirm a clean \"metrics server shutdown\" log line (no port-in-use error on restart within 5s).
### Prometheus integration
Add to \`prometheus.yml\`:
\`\`\`yaml
scrape_configs:
- job_name: act_runner
static_configs:
- targets: ['127.0.0.1:9101']
\`\`\`
Sample alert to try:
\`\`\`
sum(rate(act_runner_client_errors_total[5m])) by (method) > 0.1
\`\`\`
## Out of scope (follow-ups)
- TLS and auth on the metrics endpoint (mitigated today by localhost default; add when operators need external scraping).
- Per-task labels (intentionally avoided for cardinality safety).
---
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://gitea.com/gitea/act_runner/pulls/820
Reviewed-by: Lunny Xiao <xiaolunwen@gmail.com>
Co-authored-by: Bo-Yi Wu <appleboy.tw@gmail.com>
Co-committed-by: Bo-Yi Wu <appleboy.tw@gmail.com>
197 lines
9.8 KiB
Go
197 lines
9.8 KiB
Go
// Copyright 2022 The Gitea Authors. All rights reserved.
|
|
// SPDX-License-Identifier: MIT
|
|
|
|
package config
|
|
|
|
import (
|
|
"fmt"
|
|
"maps"
|
|
"os"
|
|
"path/filepath"
|
|
"time"
|
|
|
|
"github.com/joho/godotenv"
|
|
log "github.com/sirupsen/logrus"
|
|
"gopkg.in/yaml.v3"
|
|
)
|
|
|
|
// Log represents the configuration for logging.
|
|
type Log struct {
|
|
Level string `yaml:"level"` // Level indicates the logging level.
|
|
}
|
|
|
|
// Runner represents the configuration for the runner.
|
|
type Runner struct {
|
|
File string `yaml:"file"` // File specifies the file path for the runner.
|
|
Capacity int `yaml:"capacity"` // Capacity specifies the capacity of the runner.
|
|
Envs map[string]string `yaml:"envs"` // Envs stores environment variables for the runner.
|
|
EnvFile string `yaml:"env_file"` // EnvFile specifies the path to the file containing environment variables for the runner.
|
|
Timeout time.Duration `yaml:"timeout"` // Timeout specifies the duration for runner timeout.
|
|
ShutdownTimeout time.Duration `yaml:"shutdown_timeout"` // ShutdownTimeout specifies the duration to wait for running jobs to complete during a shutdown of the runner.
|
|
Insecure bool `yaml:"insecure"` // Insecure indicates whether the runner operates in an insecure mode.
|
|
FetchTimeout time.Duration `yaml:"fetch_timeout"` // FetchTimeout specifies the timeout duration for fetching resources.
|
|
FetchInterval time.Duration `yaml:"fetch_interval"` // FetchInterval specifies the interval duration for fetching resources.
|
|
FetchIntervalMax time.Duration `yaml:"fetch_interval_max"` // FetchIntervalMax specifies the maximum backoff interval when idle.
|
|
LogReportInterval time.Duration `yaml:"log_report_interval"` // LogReportInterval specifies the base interval for periodic log flush.
|
|
LogReportMaxLatency time.Duration `yaml:"log_report_max_latency"` // LogReportMaxLatency specifies the max time a log row can wait before being sent.
|
|
LogReportBatchSize int `yaml:"log_report_batch_size"` // LogReportBatchSize triggers immediate log flush when buffer reaches this size.
|
|
StateReportInterval time.Duration `yaml:"state_report_interval"` // StateReportInterval specifies the interval for state reporting.
|
|
Labels []string `yaml:"labels"` // Labels specify the labels of the runner. Labels are declared on each startup
|
|
GithubMirror string `yaml:"github_mirror"` // GithubMirror defines what mirrors should be used when using github
|
|
}
|
|
|
|
// Cache represents the configuration for caching.
|
|
type Cache struct {
|
|
Enabled *bool `yaml:"enabled"` // Enabled indicates whether caching is enabled. It is a pointer to distinguish between false and not set. If not set, it will be true.
|
|
Dir string `yaml:"dir"` // Dir specifies the directory path for caching.
|
|
Host string `yaml:"host"` // Host specifies the caching host.
|
|
Port uint16 `yaml:"port"` // Port specifies the caching port.
|
|
ExternalServer string `yaml:"external_server"` // ExternalServer specifies the URL of external cache server
|
|
}
|
|
|
|
// Container represents the configuration for the container.
|
|
type Container struct {
|
|
Network string `yaml:"network"` // Network specifies the network for the container.
|
|
NetworkMode string `yaml:"network_mode"` // Deprecated: use Network instead. Could be removed after Gitea 1.20
|
|
Privileged bool `yaml:"privileged"` // Privileged indicates whether the container runs in privileged mode.
|
|
Options string `yaml:"options"` // Options specifies additional options for the container.
|
|
WorkdirParent string `yaml:"workdir_parent"` // WorkdirParent specifies the parent directory for the container's working directory.
|
|
ValidVolumes []string `yaml:"valid_volumes"` // ValidVolumes specifies the volumes (including bind mounts) can be mounted to containers.
|
|
DockerHost string `yaml:"docker_host"` // DockerHost specifies the Docker host. It overrides the value specified in environment variable DOCKER_HOST.
|
|
ForcePull bool `yaml:"force_pull"` // Pull docker image(s) even if already present
|
|
ForceRebuild bool `yaml:"force_rebuild"` // Rebuild docker image(s) even if already present
|
|
RequireDocker bool `yaml:"require_docker"` // Always require a reachable docker daemon, even if not required by act_runner
|
|
DockerTimeout time.Duration `yaml:"docker_timeout"` // Timeout to wait for the docker daemon to be reachable, if docker is required by require_docker or act_runner
|
|
BindWorkdir bool `yaml:"bind_workdir"` // BindWorkdir binds the workspace to the host filesystem instead of using Docker volumes. Required for DinD when jobs use docker compose with bind mounts.
|
|
}
|
|
|
|
// Host represents the configuration for the host.
|
|
type Host struct {
|
|
WorkdirParent string `yaml:"workdir_parent"` // WorkdirParent specifies the parent directory for the host's working directory.
|
|
}
|
|
|
|
// Metrics represents the configuration for the Prometheus metrics endpoint.
|
|
type Metrics struct {
|
|
Enabled bool `yaml:"enabled"` // Enabled indicates whether the metrics endpoint is exposed.
|
|
Addr string `yaml:"addr"` // Addr specifies the listen address for the metrics HTTP server (e.g., ":9101").
|
|
}
|
|
|
|
// Config represents the overall configuration.
|
|
type Config struct {
|
|
Log Log `yaml:"log"` // Log represents the configuration for logging.
|
|
Runner Runner `yaml:"runner"` // Runner represents the configuration for the runner.
|
|
Cache Cache `yaml:"cache"` // Cache represents the configuration for caching.
|
|
Container Container `yaml:"container"` // Container represents the configuration for the container.
|
|
Host Host `yaml:"host"` // Host represents the configuration for the host.
|
|
Metrics Metrics `yaml:"metrics"` // Metrics represents the configuration for the Prometheus metrics endpoint.
|
|
}
|
|
|
|
// LoadDefault returns the default configuration.
|
|
// If file is not empty, it will be used to load the configuration.
|
|
func LoadDefault(file string) (*Config, error) {
|
|
cfg := &Config{}
|
|
if file != "" {
|
|
content, err := os.ReadFile(file)
|
|
if err != nil {
|
|
return nil, fmt.Errorf("open config file %q: %w", file, err)
|
|
}
|
|
if err := yaml.Unmarshal(content, cfg); err != nil {
|
|
return nil, fmt.Errorf("parse config file %q: %w", file, err)
|
|
}
|
|
}
|
|
compatibleWithOldEnvs(file != "", cfg)
|
|
|
|
if cfg.Runner.EnvFile != "" {
|
|
if stat, err := os.Stat(cfg.Runner.EnvFile); err == nil && !stat.IsDir() {
|
|
envs, err := godotenv.Read(cfg.Runner.EnvFile)
|
|
if err != nil {
|
|
return nil, fmt.Errorf("read env file %q: %w", cfg.Runner.EnvFile, err)
|
|
}
|
|
if cfg.Runner.Envs == nil {
|
|
cfg.Runner.Envs = map[string]string{}
|
|
}
|
|
maps.Copy(cfg.Runner.Envs, envs)
|
|
}
|
|
}
|
|
|
|
if cfg.Log.Level == "" {
|
|
cfg.Log.Level = "info"
|
|
}
|
|
if cfg.Runner.File == "" {
|
|
cfg.Runner.File = ".runner"
|
|
}
|
|
if cfg.Runner.Capacity <= 0 {
|
|
cfg.Runner.Capacity = 1
|
|
}
|
|
if cfg.Runner.Timeout <= 0 {
|
|
cfg.Runner.Timeout = 3 * time.Hour
|
|
}
|
|
if cfg.Cache.Enabled == nil {
|
|
b := true
|
|
cfg.Cache.Enabled = &b
|
|
}
|
|
if *cfg.Cache.Enabled {
|
|
if cfg.Cache.Dir == "" {
|
|
home, _ := os.UserHomeDir()
|
|
cfg.Cache.Dir = filepath.Join(home, ".cache", "actcache")
|
|
}
|
|
}
|
|
if cfg.Container.WorkdirParent == "" {
|
|
cfg.Container.WorkdirParent = "workspace"
|
|
}
|
|
if cfg.Host.WorkdirParent == "" {
|
|
home, _ := os.UserHomeDir()
|
|
cfg.Host.WorkdirParent = filepath.Join(home, ".cache", "act")
|
|
}
|
|
if cfg.Runner.FetchTimeout <= 0 {
|
|
cfg.Runner.FetchTimeout = 5 * time.Second
|
|
}
|
|
if cfg.Runner.FetchInterval <= 0 {
|
|
cfg.Runner.FetchInterval = 2 * time.Second
|
|
}
|
|
if cfg.Runner.FetchIntervalMax <= 0 {
|
|
cfg.Runner.FetchIntervalMax = 60 * time.Second
|
|
}
|
|
if cfg.Runner.LogReportInterval <= 0 {
|
|
cfg.Runner.LogReportInterval = 5 * time.Second
|
|
}
|
|
if cfg.Runner.LogReportMaxLatency <= 0 {
|
|
cfg.Runner.LogReportMaxLatency = 3 * time.Second
|
|
}
|
|
if cfg.Runner.LogReportBatchSize <= 0 {
|
|
cfg.Runner.LogReportBatchSize = 100
|
|
}
|
|
if cfg.Runner.StateReportInterval <= 0 {
|
|
cfg.Runner.StateReportInterval = 5 * time.Second
|
|
}
|
|
if cfg.Metrics.Addr == "" {
|
|
cfg.Metrics.Addr = "127.0.0.1:9101"
|
|
}
|
|
|
|
// Validate and fix invalid config combinations to prevent confusing behavior.
|
|
if cfg.Runner.FetchIntervalMax < cfg.Runner.FetchInterval {
|
|
log.Warnf("fetch_interval_max (%v) is less than fetch_interval (%v), setting fetch_interval_max to fetch_interval",
|
|
cfg.Runner.FetchIntervalMax, cfg.Runner.FetchInterval)
|
|
cfg.Runner.FetchIntervalMax = cfg.Runner.FetchInterval
|
|
}
|
|
if cfg.Runner.LogReportMaxLatency >= cfg.Runner.LogReportInterval {
|
|
log.Warnf("log_report_max_latency (%v) >= log_report_interval (%v), the max-latency timer will never fire before the periodic ticker; consider lowering log_report_max_latency",
|
|
cfg.Runner.LogReportMaxLatency, cfg.Runner.LogReportInterval)
|
|
}
|
|
|
|
// although `container.network_mode` will be deprecated, but we have to be compatible with it for now.
|
|
if cfg.Container.NetworkMode != "" && cfg.Container.Network == "" {
|
|
log.Warn("You are trying to use deprecated configuration item of `container.network_mode`, please use `container.network` instead.")
|
|
if cfg.Container.NetworkMode == "bridge" {
|
|
// Previously, if the value of `container.network_mode` is `bridge`, we will create a new network for job.
|
|
// But “bridge” is easily confused with the bridge network created by Docker by default.
|
|
// So we set the value of `container.network` to empty string to make `act_runner` automatically create a new network for job.
|
|
cfg.Container.Network = ""
|
|
} else {
|
|
cfg.Container.Network = cfg.Container.NetworkMode
|
|
}
|
|
}
|
|
|
|
return cfg, nil
|
|
}
|