phc: add metrics for number of unhealthy endpoints#3040
phc: add metrics for number of unhealthy endpoints#3040MustafaSaber wants to merge 2 commits intomasterfrom
Conversation
Signed-off-by: Mustafa Abdelrahman <mustafa.abdelrahman@zalando.de>
proxy/healthy_endpoints.go
Outdated
| ctx.Logger().Infof("Dropping endpoint %q due to passive health check: p=%0.2f, dropProbability=%0.2f", | ||
| e.Host, p, dropProbability) | ||
| metrics.IncCounter("passive-health-check.endpoints.dropped") | ||
| metrics.IncCounter("passive-health-check.requests.dropped") |
There was a problem hiding this comment.
Renamed this one, I think it's more descriptive right now
| beforefiltering := len(endpoints) | ||
| endpoints = p.fadein.filterFadeIn(endpoints, rt) | ||
| endpoints = p.heathlyEndpoints.filterHealthyEndpoints(ctx, endpoints, p.metrics) | ||
| p.metrics.UpdateGauge("passive-health-check.endpoints.dropped", float64(beforefiltering-len(endpoints))) |
There was a problem hiding this comment.
After some discussions with @RomanZavodskikh, this gauge will be overwritten by different services(requests), for example svc A has 2 unhealthy endpoints and svc B has 1 both will write on the same metric.
What we can do is append routeId or Name+Namespace combination. we aren't sure that's a good thing memory wise, wdyt @AlexanderYastrebov @szuecs?
There was a problem hiding this comment.
From the 2 options the better is to use routeId, because Kubernetes data is 1) not available and 2) often not applicable (what about non-kubernetes dataclients?).
Another way would be to have passive-health-check.endpoints.dropped.<endpoint> and Gauge is 0 or 1 and then the query would be sum() to get all current dropped endpoints.
In any case it seems that we need to add some unbounded memory usage and we should think, if we need this at all or if we start by only logs that log the endpoint.
Signed-off-by: Mustafa Abdelrahman <mustafa.abdelrahman@zalando.de>
related to #2346