Monitoring ceph and prometheus

* Monitoring ceph and prometheus
@ 2017-05-11 11:52 Jan Fajerski
  2017-05-11 12:14 ` John Spray
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Fajerski @ 2017-05-11 11:52 UTC (permalink / raw)
  To: ceph-devel

Hi list,
I recently looked into Ceph monitoring with prometheus. There is already a ceph 
exporter for this purpose here https://github.com/digitalocean/ceph_exporter.

Prometheus encourages software projects to instrument their code directly and 
expose this data, instead of using an external piece of code. Several libraries 
are provided for this purpose: 
https://prometheus.io/docs/instrumenting/clientlibs/

I think there are arguments for adding this instrumentation to Ceph directly.  
Generally speaking it should reduce overall complexity in the code (no extra 
exporter component outside of ceph) and in operations (no extra package and 
configuration).

The direct instrumentation could happen in two places:
1)
Directly in Cephs C++ code using https://github.com/jupp0r/prometheus-cpp.  This 
would mean daemons expose their metrics directly via the prometheus http 
interface. This would be the most direct way of exposing metrics, prometheus 
would simply poll all endpoints. Service discovery for scrape targets, say added 
or removed OSDS, would however have to be handled somewhere. For orchestration 
tools à la k8s, ansible, salt, ... either have this feature already or it would 
be simple enough to add. Deployments not using a tool like that need another 
approach. Prometheus offer various mechanisms 
(https://prometheus.io/docs/operating/configuration/#%3Cscrape_config%3E) or a 
ceph component (say mon or mgr) could handle this.

2)
Add a ceph-mgr plugin that exposes the metrics available to ceph-mgr as a 
prometheus scrape target (using https://github.com/prometheus/client_python).  
This would handle the service discovery issue for ceph daemons out of the box 
(though not for the actual mgr-daemon which is the scrape target). The code 
would also be in a central location instead of being scattered in several 
places. It does however add a (maybe pointless) level of indirection 
($ceph_daemon -> ceph-mgr -> prometheus) and adds the need for two different 
scrape intervals (assuming mgr polls metrics from daemons).

I'm aware of the current dashboard efforts based on ceph-mgr exported data. I'm 
sure the data export for the dashboard and prometheus could be unified at some 
point.

Best,
Jan

-- 
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 13+ messages in thread