All of lore.kernel.org
 help / color / mirror / Atom feed
* KPIs for Ceph/OSD client latency / deepscrub latency overhead
@ 2018-07-11 15:50 Marc Schöchlin
       [not found] ` <5b5dcca8-9f82-e785-885a-20d74c6a81a7-aJA5TdoZkU0dnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Marc Schöchlin @ 2018-07-11 15:50 UTC (permalink / raw)
  To: ceph-users, ceph-devel

Hello ceph-users and ceph-devel list,

we got in production with our new shiny luminous (12.2.5) cluster.
This cluster runs SSD and HDD based OSD pools.

To ensure the service quality of the cluster and to have a baseline for
client latency optimization (i.e. in the area of deepscrub optimization)
we would like to have statistics about the client interaction latency of
our cluster.

Which measures can be suitable to get such a "aggregated by
device_class" average latency KPI?
Also a percentile rank would be great (% amount of requests serviced by 
< 5ms,  % amount of requests serviced by  < 20ms, % amount of requests
serviced by  < 50ms, ...)

The following command provides a overview over the commit latency of the
osds but no average latency and no information about the device_class.

ceph osd perf -f json-pretty

{
    "osd_perf_infos": [
        {
            "id": 71,
            "perf_stats": {
                "commit_latency_ms": 2,
                "apply_latency_ms": 0
            }
        },
        {
            "id": 70,
            "perf_stats": {
                "commit_latency_ms": 3,
                "apply_latency_ms": 0
            }

Device class information can be extracted of "ceph df -f json-pretty".

But building averages of averages not seems to be a good thing .... :-)

It seems that i can get more detailed information using the "ceph daemon
osd.<nr> perf histogram dump" command.
This seems to deliver the percentile rank information in a good detail
level.
(http://docs.ceph.com/docs/luminous/dev/perf_histograms/)

My questions:

Are there tools to analyze and aggregate these measures for a group of OSDs?

Which measures should i use as a baseline for client latency optimization?

What is the time horizon of these measures?

I sometimes see messages like this in my log.
This seems to be sourced in deep scrubbing. How can find the
source/solution of this problem?

2018-07-11 16:58:55.064497 mon.ceph-mon-s43 [INF] Cluster is now healthy
2018-07-11 16:59:15.141214 mon.ceph-mon-s43 [WRN] Health check failed: 4
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-11 16:59:25.037707 mon.ceph-mon-s43 [WRN] Health check update: 9
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-11 16:59:30.038001 mon.ceph-mon-s43 [WRN] Health check update:
23 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-11 16:59:35.210900 mon.ceph-mon-s43 [WRN] Health check update:
27 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-11 16:59:45.038718 mon.ceph-mon-s43 [WRN] Health check update:
29 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-11 16:59:50.038955 mon.ceph-mon-s43 [WRN] Health check update:
39 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-11 16:59:55.281279 mon.ceph-mon-s43 [WRN] Health check update:
44 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-11 17:00:00.000121 mon.ceph-mon-s43 [WRN] overall HEALTH_WARN 12
slow requests are blocked > 32 sec
2018-07-11 17:00:05.039677 mon.ceph-mon-s43 [WRN] Health check update:
12 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-11 17:00:09.329897 mon.ceph-mon-s43 [INF] Health check cleared:
REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec)
2018-07-11 17:00:09.329919 mon.ceph-mon-s43 [INF] Cluster is now healthy

Regards
Marc


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: KPIs for Ceph/OSD client latency / deepscrub latency overhead
       [not found] ` <5b5dcca8-9f82-e785-885a-20d74c6a81a7-aJA5TdoZkU0dnm+yROfE0A@public.gmane.org>
@ 2018-07-11 16:42   ` Paul Emmerich
       [not found]     ` <CAD9yTbGZyqEYJTWgWTQ2SbDc6pfvRqv=eGRKw_ebNVbE4xMWxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Paul Emmerich @ 2018-07-11 16:42 UTC (permalink / raw)
  To: Marc Schöchlin; +Cc: ceph-users, ceph-devel


[-- Attachment #1.1: Type: text/plain, Size: 4451 bytes --]

Hi,

from experience: commit/apply_latency are not good metrics, the only good
thing about them is that they are really easy to track.
But we have found them to be almost completely useless in the real world.

We track the op_*_latency metrics from perf dump and found them to be very
helpful, they are more annoying to track due to their
format. The median OSD is a good indicator and so is the slowest OSD.

Paul

2018-07-11 17:50 GMT+02:00 Marc Schöchlin <ms-aJA5TdoZkU0dnm+yROfE0A@public.gmane.org>:

> Hello ceph-users and ceph-devel list,
>
> we got in production with our new shiny luminous (12.2.5) cluster.
> This cluster runs SSD and HDD based OSD pools.
>
> To ensure the service quality of the cluster and to have a baseline for
> client latency optimization (i.e. in the area of deepscrub optimization)
> we would like to have statistics about the client interaction latency of
> our cluster.
>
> Which measures can be suitable to get such a "aggregated by
> device_class" average latency KPI?
> Also a percentile rank would be great (% amount of requests serviced by
> < 5ms,  % amount of requests serviced by  < 20ms, % amount of requests
> serviced by  < 50ms, ...)
>
> The following command provides a overview over the commit latency of the
> osds but no average latency and no information about the device_class.
>
> ceph osd perf -f json-pretty
>
> {
>     "osd_perf_infos": [
>         {
>             "id": 71,
>             "perf_stats": {
>                 "commit_latency_ms": 2,
>                 "apply_latency_ms": 0
>             }
>         },
>         {
>             "id": 70,
>             "perf_stats": {
>                 "commit_latency_ms": 3,
>                 "apply_latency_ms": 0
>             }
>
> Device class information can be extracted of "ceph df -f json-pretty".
>
> But building averages of averages not seems to be a good thing .... :-)
>
> It seems that i can get more detailed information using the "ceph daemon
> osd.<nr> perf histogram dump" command.
> This seems to deliver the percentile rank information in a good detail
> level.
> (http://docs.ceph.com/docs/luminous/dev/perf_histograms/)
>
> My questions:
>
> Are there tools to analyze and aggregate these measures for a group of
> OSDs?
>
> Which measures should i use as a baseline for client latency optimization?
>
> What is the time horizon of these measures?
>
> I sometimes see messages like this in my log.
> This seems to be sourced in deep scrubbing. How can find the
> source/solution of this problem?
>
> 2018-07-11 16:58:55.064497 mon.ceph-mon-s43 [INF] Cluster is now healthy
> 2018-07-11 16:59:15.141214 mon.ceph-mon-s43 [WRN] Health check failed: 4
> slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-07-11 16:59:25.037707 mon.ceph-mon-s43 [WRN] Health check update: 9
> slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-07-11 16:59:30.038001 mon.ceph-mon-s43 [WRN] Health check update:
> 23 slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-07-11 16:59:35.210900 mon.ceph-mon-s43 [WRN] Health check update:
> 27 slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-07-11 16:59:45.038718 mon.ceph-mon-s43 [WRN] Health check update:
> 29 slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-07-11 16:59:50.038955 mon.ceph-mon-s43 [WRN] Health check update:
> 39 slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-07-11 16:59:55.281279 mon.ceph-mon-s43 [WRN] Health check update:
> 44 slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-07-11 17:00:00.000121 mon.ceph-mon-s43 [WRN] overall HEALTH_WARN 12
> slow requests are blocked > 32 sec
> 2018-07-11 17:00:05.039677 mon.ceph-mon-s43 [WRN] Health check update:
> 12 slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-07-11 17:00:09.329897 mon.ceph-mon-s43 [INF] Health check cleared:
> REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec)
> 2018-07-11 17:00:09.329919 mon.ceph-mon-s43 [INF] Cluster is now healthy
>
> Regards
> Marc
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

[-- Attachment #1.2: Type: text/html, Size: 5964 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: KPIs for Ceph/OSD client latency / deepscrub latency overhead
       [not found]     ` <CAD9yTbGZyqEYJTWgWTQ2SbDc6pfvRqv=eGRKw_ebNVbE4xMWxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-07-12  6:37       ` Marc Schöchlin
       [not found]         ` <d0a8d024-3d62-f3c0-31e1-046747d0d9f6-aJA5TdoZkU0dnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Marc Schöchlin @ 2018-07-12  6:37 UTC (permalink / raw)
  To: Paul Emmerich; +Cc: ceph-users, ceph-devel


[-- Attachment #1.1: Type: text/plain, Size: 15543 bytes --]

Hello Paul,

thanks for your response/hints.

I discovered the following tool in the ceph source repository:
https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py

The tool provides output based on the statistics mention by you:

# ceph daemon osd.24 perf histogram dump|grep -P "op_.*_latency"
        "op_r_latency_out_bytes_histogram": {
        "op_w_latency_in_bytes_histogram": {
        "op_rw_latency_in_bytes_histogram": {
        "op_rw_latency_out_bytes_histogram": {

cd /tmp
wget
https://raw.githubusercontent.com/ceph/ceph/master/src/tools/histogram_dump.py
chmod +x histogram_dump.py


Request size (bytes):
        0  512   1k   2k   4k   8k  16k  32k  65k 131k 262k 524k   1M  
2M   4M   8M  16M  33M  67M 134M 268M 536M   1G   2G   4G   8G  17G 
34G  68G 137G 274G
  -1  511   1k   2k   4k   8k  16k  32k  65k 131k 262k 524k   1M   2M  
4M   8M  16M  33M  67M 134M 268M 536M   1G   2G   4G   8G  17G  34G  68G
137G 274G     
                                                                                                                                                               
Latency (usec):
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0        : -1
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0      0 : 99k
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   100k : 199k
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   200k : 399k
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   400k : 799k
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   800k : 1M
   0    0    0    0    0    4    1    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0     1M : 3M
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0     3M : 6M
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0     6M : 12M
   0    0    0    0    0    0    1    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    12M : 25M
   0    0    0    0    0    *5*    1    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    25M : 51M
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    51M : 102M
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    1    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   102M : 204M
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   204M : 409M
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   409M : 819M
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   819M : 1G
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0     1G : 3G
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0     3G : 6G
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0     6G : 13G
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    13G : 26G
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    26G : 52G
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    52G : 104G
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   104G : 209G
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   209G : 419G
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   419G : 838G
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0   838G : 1T
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0     1T : 3T
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0     3T : 6T
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0     6T : 13T
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    13T : 26T
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    26T : 53T
   0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    0    0    0    0    0    0    0    0    0    0   
0    0    0    0    53T :

Probably this script is a good basis for writing my own specialized tool
....

This might be a good option for detailed analysis.
In a first step i just would like to have  two simple KPIs which
describe a average/aggregated write/read latency of these statistics.

Are there tools/other functionalities which provide this in a simple way?

Regards
Marc


Am 11.07.2018 um 18:42 schrieb Paul Emmerich:
> Hi,
>
> from experience: commit/apply_latency are not good metrics, the only
> good thing about them is that they are really easy to track.
> But we have found them to be almost completely useless in the real world.
>
> We track the op_*_latency metrics from perf dump and found them to be
> very helpful, they are more annoying to track due to their
> format. The median OSD is a good indicator and so is the slowest OSD.
>
> Paul
>
> 2018-07-11 17:50 GMT+02:00 Marc Schöchlin <ms@256bit.org
> <mailto:ms@256bit.org>>:
>
>     Hello ceph-users and ceph-devel list,
>
>     we got in production with our new shiny luminous (12.2.5) cluster.
>     This cluster runs SSD and HDD based OSD pools.
>
>     To ensure the service quality of the cluster and to have a
>     baseline for
>     client latency optimization (i.e. in the area of deepscrub
>     optimization)
>     we would like to have statistics about the client interaction
>     latency of
>     our cluster.
>
>     Which measures can be suitable to get such a "aggregated by
>     device_class" average latency KPI?
>     Also a percentile rank would be great (% amount of requests
>     serviced by 
>     < 5ms,  % amount of requests serviced by  < 20ms, % amount of requests
>     serviced by  < 50ms, ...)
>
>     The following command provides a overview over the commit latency
>     of the
>     osds but no average latency and no information about the device_class.
>
>     ceph osd perf -f json-pretty
>
>     {
>         "osd_perf_infos": [
>             {
>                 "id": 71,
>                 "perf_stats": {
>                     "commit_latency_ms": 2,
>                     "apply_latency_ms": 0
>                 }
>             },
>             {
>                 "id": 70,
>                 "perf_stats": {
>                     "commit_latency_ms": 3,
>                     "apply_latency_ms": 0
>                 }
>
>     Device class information can be extracted of "ceph df -f json-pretty".
>
>     But building averages of averages not seems to be a good thing
>     .... :-)
>
>     It seems that i can get more detailed information using the "ceph
>     daemon
>     osd.<nr> perf histogram dump" command.
>     This seems to deliver the percentile rank information in a good detail
>     level.
>     (http://docs.ceph.com/docs/luminous/dev/perf_histograms/
>     <http://docs.ceph.com/docs/luminous/dev/perf_histograms/>)
>
>     My questions:
>
>     Are there tools to analyze and aggregate these measures for a
>     group of OSDs?
>
>     Which measures should i use as a baseline for client latency
>     optimization?
>
>     What is the time horizon of these measures?
>
>     I sometimes see messages like this in my log.
>     This seems to be sourced in deep scrubbing. How can find the
>     source/solution of this problem?
>
>     2018-07-11 16:58:55.064497 mon.ceph-mon-s43 [INF] Cluster is now
>     healthy
>     2018-07-11 16:59:15.141214 mon.ceph-mon-s43 [WRN] Health check
>     failed: 4
>     slow requests are blocked > 32 sec (REQUEST_SLOW)
>     2018-07-11 16:59:25.037707 mon.ceph-mon-s43 [WRN] Health check
>     update: 9
>     slow requests are blocked > 32 sec (REQUEST_SLOW)
>     2018-07-11 16:59:30.038001 mon.ceph-mon-s43 [WRN] Health check update:
>     23 slow requests are blocked > 32 sec (REQUEST_SLOW)
>     2018-07-11 16:59:35.210900 mon.ceph-mon-s43 [WRN] Health check update:
>     27 slow requests are blocked > 32 sec (REQUEST_SLOW)
>     2018-07-11 16:59:45.038718 mon.ceph-mon-s43 [WRN] Health check update:
>     29 slow requests are blocked > 32 sec (REQUEST_SLOW)
>     2018-07-11 16:59:50.038955 mon.ceph-mon-s43 [WRN] Health check update:
>     39 slow requests are blocked > 32 sec (REQUEST_SLOW)
>     2018-07-11 16:59:55.281279 mon.ceph-mon-s43 [WRN] Health check update:
>     44 slow requests are blocked > 32 sec (REQUEST_SLOW)
>     2018-07-11 17:00:00.000121 mon.ceph-mon-s43 [WRN] overall
>     HEALTH_WARN 12
>     slow requests are blocked > 32 sec
>     2018-07-11 17:00:05.039677 mon.ceph-mon-s43 [WRN] Health check update:
>     12 slow requests are blocked > 32 sec (REQUEST_SLOW)
>     2018-07-11 17:00:09.329897 mon.ceph-mon-s43 [INF] Health check
>     cleared:
>     REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec)
>     2018-07-11 17:00:09.329919 mon.ceph-mon-s43 [INF] Cluster is now
>     healthy
>
>     Regards
>     Marc
>
>
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>
>
>
>
> -- 
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io <http://www.croit.io>
> Tel: +49 89 1896585 90


[-- Attachment #1.2: Type: text/html, Size: 21237 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: KPIs for Ceph/OSD client latency / deepscrub latency overhead
       [not found]         ` <d0a8d024-3d62-f3c0-31e1-046747d0d9f6-aJA5TdoZkU0dnm+yROfE0A@public.gmane.org>
@ 2018-07-12 12:51           ` Paul Emmerich
  0 siblings, 0 replies; 4+ messages in thread
From: Paul Emmerich @ 2018-07-12 12:51 UTC (permalink / raw)
  To: Marc Schöchlin; +Cc: ceph-users, ceph-devel


[-- Attachment #1.1: Type: text/plain, Size: 5943 bytes --]

2018-07-12 8:37 GMT+02:00 Marc Schöchlin <ms-aJA5TdoZkU0dnm+yROfE0A@public.gmane.org>:

>
> In a first step i just would like to have  two simple KPIs which describe
> a average/aggregated write/read latency of these statistics.
>
> Are there tools/other functionalities which provide this in a simple way?
>
It's one of the main KPI our management software collects and visualizes:
https://croit.io

IIRC some of the other stats collectors also already collect these metrics,
at least I recall using it with Telegraf/InfluxDB.
But it's also really easy to collect yourself (I've once written it in bash
for some weird collector for a client), the only
hurdle is that you need to calculate the derivate because it collects a
running average.
I've some slides from our training about these metrics:
https://static.croit.io/ceph-training-examples/ceph-training-example-admin-socket.pdf
(Not much in there, it's more of a hands-on lab)



Paul



> Regards
> Marc
>
> Am 11.07.2018 um 18:42 schrieb Paul Emmerich:
>
> Hi,
>
> from experience: commit/apply_latency are not good metrics, the only good
> thing about them is that they are really easy to track.
> But we have found them to be almost completely useless in the real world.
>
> We track the op_*_latency metrics from perf dump and found them to be very
> helpful, they are more annoying to track due to their
> format. The median OSD is a good indicator and so is the slowest OSD.
>
> Paul
>
> 2018-07-11 17:50 GMT+02:00 Marc Schöchlin <ms-aJA5TdoZkU0dnm+yROfE0A@public.gmane.org>:
>
>> Hello ceph-users and ceph-devel list,
>>
>> we got in production with our new shiny luminous (12.2.5) cluster.
>> This cluster runs SSD and HDD based OSD pools.
>>
>> To ensure the service quality of the cluster and to have a baseline for
>> client latency optimization (i.e. in the area of deepscrub optimization)
>> we would like to have statistics about the client interaction latency of
>> our cluster.
>>
>> Which measures can be suitable to get such a "aggregated by
>> device_class" average latency KPI?
>> Also a percentile rank would be great (% amount of requests serviced by
>> < 5ms,  % amount of requests serviced by  < 20ms, % amount of requests
>> serviced by  < 50ms, ...)
>>
>> The following command provides a overview over the commit latency of the
>> osds but no average latency and no information about the device_class.
>>
>> ceph osd perf -f json-pretty
>>
>> {
>>     "osd_perf_infos": [
>>         {
>>             "id": 71,
>>             "perf_stats": {
>>                 "commit_latency_ms": 2,
>>                 "apply_latency_ms": 0
>>             }
>>         },
>>         {
>>             "id": 70,
>>             "perf_stats": {
>>                 "commit_latency_ms": 3,
>>                 "apply_latency_ms": 0
>>             }
>>
>> Device class information can be extracted of "ceph df -f json-pretty".
>>
>> But building averages of averages not seems to be a good thing .... :-)
>>
>> It seems that i can get more detailed information using the "ceph daemon
>> osd.<nr> perf histogram dump" command.
>> This seems to deliver the percentile rank information in a good detail
>> level.
>> (http://docs.ceph.com/docs/luminous/dev/perf_histograms/)
>>
>> My questions:
>>
>> Are there tools to analyze and aggregate these measures for a group of
>> OSDs?
>>
>> Which measures should i use as a baseline for client latency optimization?
>>
>> What is the time horizon of these measures?
>>
>> I sometimes see messages like this in my log.
>> This seems to be sourced in deep scrubbing. How can find the
>> source/solution of this problem?
>>
>> 2018-07-11 16:58:55.064497 mon.ceph-mon-s43 [INF] Cluster is now healthy
>> 2018-07-11 16:59:15.141214 mon.ceph-mon-s43 [WRN] Health check failed: 4
>> slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:25.037707 mon.ceph-mon-s43 [WRN] Health check update: 9
>> slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:30.038001 mon.ceph-mon-s43 [WRN] Health check update:
>> 23 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:35.210900 mon.ceph-mon-s43 [WRN] Health check update:
>> 27 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:45.038718 mon.ceph-mon-s43 [WRN] Health check update:
>> 29 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:50.038955 mon.ceph-mon-s43 [WRN] Health check update:
>> 39 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:55.281279 mon.ceph-mon-s43 [WRN] Health check update:
>> 44 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 17:00:00.000121 mon.ceph-mon-s43 [WRN] overall HEALTH_WARN 12
>> slow requests are blocked > 32 sec
>> 2018-07-11 17:00:05.039677 mon.ceph-mon-s43 [WRN] Health check update:
>> 12 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 17:00:09.329897 mon.ceph-mon-s43 [INF] Health check cleared:
>> REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec)
>> 2018-07-11 17:00:09.329919 mon.ceph-mon-s43 [INF] Cluster is now healthy
>>
>> Regards
>> Marc
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> <https://maps.google.com/?q=Freseniusstr.+31h+%0D%0A++++++++++++++++++++81247+M%C3%BCnchen&entry=gmail&source=g>
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

[-- Attachment #1.2: Type: text/html, Size: 10874 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-07-12 12:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-11 15:50 KPIs for Ceph/OSD client latency / deepscrub latency overhead Marc Schöchlin
     [not found] ` <5b5dcca8-9f82-e785-885a-20d74c6a81a7-aJA5TdoZkU0dnm+yROfE0A@public.gmane.org>
2018-07-11 16:42   ` Paul Emmerich
     [not found]     ` <CAD9yTbGZyqEYJTWgWTQ2SbDc6pfvRqv=eGRKw_ebNVbE4xMWxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-07-12  6:37       ` Marc Schöchlin
     [not found]         ` <d0a8d024-3d62-f3c0-31e1-046747d0d9f6-aJA5TdoZkU0dnm+yROfE0A@public.gmane.org>
2018-07-12 12:51           ` Paul Emmerich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.