On Mon, 28 Nov 2016, Bartłomiej Święcki wrote: > Hi, > > Currently we can query OSD for op latency but it's given as an average. > Average may not give the bets information in this case - i.e. spikes can > easily get hidden there. > > Instead of an average we could easily do a simple histogram - quantize > the latency into predefined set of time intervals, for each of them have > a simple performance counter, at each op increase one of them. Since > those are per OSD, we could have pretty high resolution with fractional > memory usage, performance impact should be negligible since only one > (two if split into read and write) of those counters would be > incremented per one osd op. > > In addition we could also do this in 2D - each counter matching given > latency range and op size range. having such 2D table would show both > latency histogram, request size histogram and combinations of those > (i.e. latency histogram of ~4k ops only). > > What do you think about this idea? I can prepare some code - a simple proof of > concept looks really > straightforward to implement. This sounds like a great idea. I think the main issue is that the data won't be easily exposed via the perfcounter interface... at least not in a way that generic tools can visualize. Unless there is a standardish way to report histogram metrics? sage