Summary
gha_job_execution_duration_seconds records a nonsensical duration of roughly 63,911,499,556 (about 2023 years) for any job that is canceled before it is assigned to a runner. This pollutes Grafana dashboards, skews histogram buckets, and makes any per-(job_name, repository) average derived from _sum / _count unusable unless job_result="canceled" is explicitly filtered out.
Evidence
Raw metric from our Prometheus for job_result="canceled":
gha_job_execution_duration_seconds_sum{
job_name="Pre-commit (prek)",
repository="gitops",
job_result="canceled",
...
} = 63911499556
The same (job_name, repository) combination with job_result="succeeded" has realistic values (sum=626 over 4 observations, ~156s avg). Only the canceled path is affected.
63,911,499,556 seconds ≈ 2023 years ≈ Unix epoch (~1,775,904,000) minus the Unix representation of the zero value of Go's time.Time{} (-62,135,596,800, which is year 1 AD).
Root cause
In cmd/ghalistener/metrics/metrics.go:
func (e *exporter) RecordJobCompleted(msg *scaleset.JobCompleted) {
l := e.completedJobLabels(msg)
e.incCounter(MetricCompletedJobsTotal, l)
executionDuration := msg.FinishTime.Unix() - msg.RunnerAssignTime.Unix()
e.observeHistogram(MetricJobExecutionDurationSeconds, l, float64(executionDuration))
}
When a job is canceled before ever being assigned to a runner, msg.RunnerAssignTime is the zero value time.Time{}. (time.Time{}).Unix() returns -62135596800 (year 1 AD in Unix seconds). The subtraction msg.FinishTime.Unix() - (-62135596800) then yields ~63.9 billion, which gets recorded into the _seconds histogram.
No check exists for msg.RunnerAssignTime.IsZero() before the subtraction.
Impact
- Histogram
_sum is effectively poisoned by every pre-assignment cancellation, because a single such observation dwarfs thousands of real runs.
- Any dashboard that computes
sum(rate(x_sum)) / sum(rate(x_count)) per repo/job becomes useless without an explicit job_result!="canceled" filter.
histogram_quantile over _bucket is similarly affected — all values end up in the +Inf bucket.
The official ARC Grafana dashboard (docs/gha-runner-scale-set-controller/samples/grafana-dashboard/ARC-Autoscaling-Runner-Set-Monitoring.json) uses this metric for its "Job Execution" heatmap; the heatmap is silently broken for anyone whose workflow ever cancels a pending job.
Suggested fix
Skip the histogram observation entirely when RunnerAssignTime is zero (the job was canceled before a runner was assigned, so there is no meaningful execution duration to record):
func (e *exporter) RecordJobCompleted(msg *scaleset.JobCompleted) {
l := e.completedJobLabels(msg)
e.incCounter(MetricCompletedJobsTotal, l)
if msg.RunnerAssignTime.IsZero() {
// Job was canceled before runner assignment; no execution phase to measure.
return
}
executionDuration := msg.FinishTime.Unix() - msg.RunnerAssignTime.Unix()
e.observeHistogram(MetricJobExecutionDurationSeconds, l, float64(executionDuration))
}
The gha_completed_jobs_total counter is still incremented before the early return, so cancellation counts are preserved — only the meaningless duration observation is dropped.
Alternative: record 0 as the duration for pre-assignment cancellations. This keeps the observation in the histogram but biases the _sum toward zero for the cancel path. Skipping the observation is cleaner.
Note: RecordJobStarted at line 491 also subtracts two times without a zero-value check, but that path is only invoked on the JobStarted message which by definition has both ScaleSetAssignTime and RunnerAssignTime populated, so it's not affected in practice.
Environment
- ARC controller chart:
gha-runner-scale-set-controller 0.13.0
- Runner scale set chart:
gha-runner-scale-set 0.13.0
- Observed on commit
79e7b17b563d5492e6c8537e3d6fd52eb6002c3a of this repo; code path unchanged on main as of today.
Local workaround
For anyone else hitting this before the upstream fix lands, the dashboard-side workaround is to add job_result!="canceled" to every duration query:
sum by (job_name, repository) (
gha_job_execution_duration_seconds_sum{job_result!="canceled"}
)
/
sum by (job_name, repository) (
gha_job_execution_duration_seconds_count{job_result!="canceled"}
)
Summary
gha_job_execution_duration_secondsrecords a nonsensical duration of roughly63,911,499,556(about 2023 years) for any job that is canceled before it is assigned to a runner. This pollutes Grafana dashboards, skews histogram buckets, and makes any per-(job_name, repository)average derived from_sum / _countunusable unlessjob_result="canceled"is explicitly filtered out.Evidence
Raw metric from our Prometheus for
job_result="canceled":The same
(job_name, repository)combination withjob_result="succeeded"has realistic values (sum=626 over 4 observations, ~156s avg). Only the canceled path is affected.63,911,499,556seconds ≈ 2023 years ≈ Unix epoch (~1,775,904,000) minus the Unix representation of the zero value of Go'stime.Time{}(-62,135,596,800, which is year 1 AD).Root cause
In
cmd/ghalistener/metrics/metrics.go:When a job is canceled before ever being assigned to a runner,
msg.RunnerAssignTimeis the zero valuetime.Time{}.(time.Time{}).Unix()returns-62135596800(year 1 AD in Unix seconds). The subtractionmsg.FinishTime.Unix() - (-62135596800)then yields ~63.9 billion, which gets recorded into the_secondshistogram.No check exists for
msg.RunnerAssignTime.IsZero()before the subtraction.Impact
_sumis effectively poisoned by every pre-assignment cancellation, because a single such observation dwarfs thousands of real runs.sum(rate(x_sum)) / sum(rate(x_count))per repo/job becomes useless without an explicitjob_result!="canceled"filter.histogram_quantileover_bucketis similarly affected — all values end up in the+Infbucket.The official ARC Grafana dashboard (
docs/gha-runner-scale-set-controller/samples/grafana-dashboard/ARC-Autoscaling-Runner-Set-Monitoring.json) uses this metric for its "Job Execution" heatmap; the heatmap is silently broken for anyone whose workflow ever cancels a pending job.Suggested fix
Skip the histogram observation entirely when
RunnerAssignTimeis zero (the job was canceled before a runner was assigned, so there is no meaningful execution duration to record):The
gha_completed_jobs_totalcounter is still incremented before the early return, so cancellation counts are preserved — only the meaningless duration observation is dropped.Alternative: record
0as the duration for pre-assignment cancellations. This keeps the observation in the histogram but biases the_sumtoward zero for the cancel path. Skipping the observation is cleaner.Note:
RecordJobStartedat line 491 also subtracts two times without a zero-value check, but that path is only invoked on theJobStartedmessage which by definition has bothScaleSetAssignTimeandRunnerAssignTimepopulated, so it's not affected in practice.Environment
gha-runner-scale-set-controller0.13.0gha-runner-scale-set0.13.079e7b17b563d5492e6c8537e3d6fd52eb6002c3aof this repo; code path unchanged onmainas of today.Local workaround
For anyone else hitting this before the upstream fix lands, the dashboard-side workaround is to add
job_result!="canceled"to every duration query: