gha_job_execution_duration_seconds records ~2023 years for jobs canceled before runner assignment

## Summary

`gha_job_execution_duration_seconds` records a nonsensical duration of roughly `63,911,499,556` (about 2023 years) for any job that is canceled **before it is assigned to a runner**. This pollutes Grafana dashboards, skews histogram buckets, and makes any per-`(job_name, repository)` average derived from `_sum / _count` unusable unless `job_result="canceled"` is explicitly filtered out.

## Evidence

Raw metric from our Prometheus for `job_result="canceled"`:

```
gha_job_execution_duration_seconds_sum{
  job_name="Pre-commit (prek)",
  repository="gitops",
  job_result="canceled",
  ...
} = 63911499556
```

The same `(job_name, repository)` combination with `job_result="succeeded"` has realistic values (sum=626 over 4 observations, ~156s avg). Only the canceled path is affected.

`63,911,499,556` seconds ≈ 2023 years ≈ Unix epoch (~1,775,904,000) minus the Unix representation of the zero value of Go's `time.Time{}` (`-62,135,596,800`, which is year 1 AD).

## Root cause

In [`cmd/ghalistener/metrics/metrics.go`](https://github.com/actions/actions-runner-controller/blob/79e7b17b563d5492e6c8537e3d6fd52eb6002c3a/cmd/ghalistener/metrics/metrics.go#L495-L501):

```go
func (e *exporter) RecordJobCompleted(msg *scaleset.JobCompleted) {
    l := e.completedJobLabels(msg)
    e.incCounter(MetricCompletedJobsTotal, l)

    executionDuration := msg.FinishTime.Unix() - msg.RunnerAssignTime.Unix()
    e.observeHistogram(MetricJobExecutionDurationSeconds, l, float64(executionDuration))
}
```

When a job is canceled before ever being assigned to a runner, `msg.RunnerAssignTime` is the zero value `time.Time{}`. `(time.Time{}).Unix()` returns `-62135596800` (year 1 AD in Unix seconds). The subtraction `msg.FinishTime.Unix() - (-62135596800)` then yields ~63.9 billion, which gets recorded into the `_seconds` histogram.

No check exists for `msg.RunnerAssignTime.IsZero()` before the subtraction.

## Impact

- Histogram `_sum` is effectively poisoned by every pre-assignment cancellation, because a single such observation dwarfs thousands of real runs.
- Any dashboard that computes `sum(rate(x_sum)) / sum(rate(x_count))` per repo/job becomes useless without an explicit `job_result!="canceled"` filter.
- `histogram_quantile` over `_bucket` is similarly affected — all values end up in the `+Inf` bucket.

The official ARC Grafana dashboard (`docs/gha-runner-scale-set-controller/samples/grafana-dashboard/ARC-Autoscaling-Runner-Set-Monitoring.json`) uses this metric for its "Job Execution" heatmap; the heatmap is silently broken for anyone whose workflow ever cancels a pending job.

## Suggested fix

Skip the histogram observation entirely when `RunnerAssignTime` is zero (the job was canceled before a runner was assigned, so there is no meaningful execution duration to record):

```go
func (e *exporter) RecordJobCompleted(msg *scaleset.JobCompleted) {
    l := e.completedJobLabels(msg)
    e.incCounter(MetricCompletedJobsTotal, l)

    if msg.RunnerAssignTime.IsZero() {
        // Job was canceled before runner assignment; no execution phase to measure.
        return
    }
    executionDuration := msg.FinishTime.Unix() - msg.RunnerAssignTime.Unix()
    e.observeHistogram(MetricJobExecutionDurationSeconds, l, float64(executionDuration))
}
```

The `gha_completed_jobs_total` counter is still incremented before the early return, so cancellation counts are preserved — only the meaningless duration observation is dropped.

Alternative: record `0` as the duration for pre-assignment cancellations. This keeps the observation in the histogram but biases the `_sum` toward zero for the cancel path. Skipping the observation is cleaner.

Note: `RecordJobStarted` at line 491 also subtracts two times without a zero-value check, but that path is only invoked on the `JobStarted` message which by definition has both `ScaleSetAssignTime` and `RunnerAssignTime` populated, so it's not affected in practice.

## Environment

- ARC controller chart: `gha-runner-scale-set-controller` 0.13.0
- Runner scale set chart: `gha-runner-scale-set` 0.13.0
- Observed on commit `79e7b17b563d5492e6c8537e3d6fd52eb6002c3a` of this repo; code path unchanged on `main` as of today.

## Local workaround

For anyone else hitting this before the upstream fix lands, the dashboard-side workaround is to add `job_result!="canceled"` to every duration query:

```promql
sum by (job_name, repository) (
  gha_job_execution_duration_seconds_sum{job_result!="canceled"}
)
/
sum by (job_name, repository) (
  gha_job_execution_duration_seconds_count{job_result!="canceled"}
)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gha_job_execution_duration_seconds records ~2023 years for jobs canceled before runner assignment #4444

Summary

Evidence

Root cause

Impact

Suggested fix

Environment

Local workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gha_job_execution_duration_seconds records ~2023 years for jobs canceled before runner assignment #4444

Description

Summary

Evidence

Root cause

Impact

Suggested fix

Environment

Local workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions