Skip to content

gha_job_execution_duration_seconds records ~2023 years for jobs canceled before runner assignment #4444

@talsuk5

Description

@talsuk5

Summary

gha_job_execution_duration_seconds records a nonsensical duration of roughly 63,911,499,556 (about 2023 years) for any job that is canceled before it is assigned to a runner. This pollutes Grafana dashboards, skews histogram buckets, and makes any per-(job_name, repository) average derived from _sum / _count unusable unless job_result="canceled" is explicitly filtered out.

Evidence

Raw metric from our Prometheus for job_result="canceled":

gha_job_execution_duration_seconds_sum{
  job_name="Pre-commit (prek)",
  repository="gitops",
  job_result="canceled",
  ...
} = 63911499556

The same (job_name, repository) combination with job_result="succeeded" has realistic values (sum=626 over 4 observations, ~156s avg). Only the canceled path is affected.

63,911,499,556 seconds ≈ 2023 years ≈ Unix epoch (~1,775,904,000) minus the Unix representation of the zero value of Go's time.Time{} (-62,135,596,800, which is year 1 AD).

Root cause

In cmd/ghalistener/metrics/metrics.go:

func (e *exporter) RecordJobCompleted(msg *scaleset.JobCompleted) {
    l := e.completedJobLabels(msg)
    e.incCounter(MetricCompletedJobsTotal, l)

    executionDuration := msg.FinishTime.Unix() - msg.RunnerAssignTime.Unix()
    e.observeHistogram(MetricJobExecutionDurationSeconds, l, float64(executionDuration))
}

When a job is canceled before ever being assigned to a runner, msg.RunnerAssignTime is the zero value time.Time{}. (time.Time{}).Unix() returns -62135596800 (year 1 AD in Unix seconds). The subtraction msg.FinishTime.Unix() - (-62135596800) then yields ~63.9 billion, which gets recorded into the _seconds histogram.

No check exists for msg.RunnerAssignTime.IsZero() before the subtraction.

Impact

  • Histogram _sum is effectively poisoned by every pre-assignment cancellation, because a single such observation dwarfs thousands of real runs.
  • Any dashboard that computes sum(rate(x_sum)) / sum(rate(x_count)) per repo/job becomes useless without an explicit job_result!="canceled" filter.
  • histogram_quantile over _bucket is similarly affected — all values end up in the +Inf bucket.

The official ARC Grafana dashboard (docs/gha-runner-scale-set-controller/samples/grafana-dashboard/ARC-Autoscaling-Runner-Set-Monitoring.json) uses this metric for its "Job Execution" heatmap; the heatmap is silently broken for anyone whose workflow ever cancels a pending job.

Suggested fix

Skip the histogram observation entirely when RunnerAssignTime is zero (the job was canceled before a runner was assigned, so there is no meaningful execution duration to record):

func (e *exporter) RecordJobCompleted(msg *scaleset.JobCompleted) {
    l := e.completedJobLabels(msg)
    e.incCounter(MetricCompletedJobsTotal, l)

    if msg.RunnerAssignTime.IsZero() {
        // Job was canceled before runner assignment; no execution phase to measure.
        return
    }
    executionDuration := msg.FinishTime.Unix() - msg.RunnerAssignTime.Unix()
    e.observeHistogram(MetricJobExecutionDurationSeconds, l, float64(executionDuration))
}

The gha_completed_jobs_total counter is still incremented before the early return, so cancellation counts are preserved — only the meaningless duration observation is dropped.

Alternative: record 0 as the duration for pre-assignment cancellations. This keeps the observation in the histogram but biases the _sum toward zero for the cancel path. Skipping the observation is cleaner.

Note: RecordJobStarted at line 491 also subtracts two times without a zero-value check, but that path is only invoked on the JobStarted message which by definition has both ScaleSetAssignTime and RunnerAssignTime populated, so it's not affected in practice.

Environment

  • ARC controller chart: gha-runner-scale-set-controller 0.13.0
  • Runner scale set chart: gha-runner-scale-set 0.13.0
  • Observed on commit 79e7b17b563d5492e6c8537e3d6fd52eb6002c3a of this repo; code path unchanged on main as of today.

Local workaround

For anyone else hitting this before the upstream fix lands, the dashboard-side workaround is to add job_result!="canceled" to every duration query:

sum by (job_name, repository) (
  gha_job_execution_duration_seconds_sum{job_result!="canceled"}
)
/
sum by (job_name, repository) (
  gha_job_execution_duration_seconds_count{job_result!="canceled"}
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions