All of lore.kernel.org
 help / color / mirror / Atom feed
* rbd top
@ 2015-06-11 19:33 Robert LeBlanc
  2015-06-15 11:52 ` Gregory Farnum
  0 siblings, 1 reply; 11+ messages in thread
From: Robert LeBlanc @ 2015-06-11 19:33 UTC (permalink / raw)
  To: ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

One feature we would like is an "rbd top" command that would be like
top, but show usage of RBD volumes so that we can quickly identify
high demand RBDs.

Since I haven't done any programming for Ceph, I'm trying to think
through the best way to approach this. I don't know if there are
already perf counters that I can query that are at the client, RBD or
the Rados layers. If these counters don't exist would it be best to
implement them at the client layer and look for watchers on the RBD
and query them? Is it better to handle it at the Rados layer and
aggregate the I/O from all chunks? Of course this would need to scale
out very large.

It seems that if the client running rbd top requests the top 'X'
number of objects from each OSD, then it would cut down on the data
that the has to be moved around and processed. It wouldn't be an
extremely accurate view, but might be enough.

What are your thoughts?

Also, what is the best way to get into the Ceph code? I've looked at
several things and I find myself doing a lot of searching to find
connecting pieces. My primary focus is not programming so picking up a
new code base takes me a long time because I don't know many of the
tricks that help people get to speed quickly.

Thanks,
- ----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVeeJoCRDmVDuy+mK58QAAYNUQAMQW1zB7FB7L1ahFdTU5
zt2K3AM+oJv/tl4TPtn1C8YXksHYAlzKK28r6boUszoINb+54c+RxZ8pbstL
JwS10w0H6K/vZP0UOn03MNm3IkOYqJooiL84NT6HGBjAQLUwY6QwDWQ/5Cdl
nhHuH4UDUVX8ydQYQCv7QDx4l5I8NVoUAno7dwXQ8LN6cTrNinQGYeYOXwFE
nyM1u6Fj0mpaOD4OUpHfSkCl/qx6FATv9EM99alNrgBJqgLlp/wi31cSGL7c
4///valIk5Bndt6Qk002uxzB898WD50Tvqkk3rVNDtnO7SKJ6tcMfy8JUfm0
sPy+VU9sBECNxot5zmenvIPxfUIhfmfsY7hzj1wzlw8tP6bUV2T3pdsliOmD
g7fQChmzHl6yUnugYShPCkCysrJa4V9xVQgCPfMUz6Tmacf82BpktZCU1ixX
J+lDYoHqzeReBoRO5+O8DN5kPzwNg/JbwjDW4ZQfBBdBeZxOiUvpfCKv9Y7O
43iqFsbvj6JJye4GahdpgzjgYW1m8k/hKNlw5eS0+/9qkBw4wl+WISiXchsJ
kaAf0IGWbFxskf8KxVgJR7LMin7cq+ehb+Vfxl+PL5qo8hj+j8iZWFLcHzgI
SUAL4hcGjgGTaysnck4cCOk4eWaSwzbYsKEMjtpfzLIZ1PcAgIr52Id6p8gL
YRnR
=oPko
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd top
  2015-06-11 19:33 rbd top Robert LeBlanc
@ 2015-06-15 11:52 ` Gregory Farnum
  2015-06-15 13:52   ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2015-06-15 11:52 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-devel

On Thu, Jun 11, 2015 at 12:33 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> One feature we would like is an "rbd top" command that would be like
> top, but show usage of RBD volumes so that we can quickly identify
> high demand RBDs.
>
> Since I haven't done any programming for Ceph, I'm trying to think
> through the best way to approach this. I don't know if there are
> already perf counters that I can query that are at the client, RBD or
> the Rados layers. If these counters don't exist would it be best to
> implement them at the client layer and look for watchers on the RBD
> and query them? Is it better to handle it at the Rados layer and
> aggregate the I/O from all chunks? Of course this would need to scale
> out very large.
>
> It seems that if the client running rbd top requests the top 'X'
> number of objects from each OSD, then it would cut down on the data
> that the has to be moved around and processed. It wouldn't be an
> extremely accurate view, but might be enough.
>
> What are your thoughts?
>
> Also, what is the best way to get into the Ceph code? I've looked at
> several things and I find myself doing a lot of searching to find
> connecting pieces. My primary focus is not programming so picking up a
> new code base takes me a long time because I don't know many of the
> tricks that help people get to speed quickly.

The basic problem with a tool like this is that it requires gathering
real-time data from either all the OSDs, or all the clients. We do
something similar in order to display approximate IO going through the
system as a whole, but that is based on PGStat messages which come in
periodically and is both laggy and an approximation.

To do this, we'd need to get less-laggy data, and instead of scaling
with the number of OSDs/PGs it would scale with the number of RBD
volumes. You certainly couldn't send that through the monitor and I
shudder to think about the extra load it would invoke at all layers.

How up-to-date do you need the info to be, and how accurate? Does it
need to be queryable in the future or only online? You could perhaps
hook into one of the more precise HitSet implementations we
have...otherwise I think you'd need to add an online querying
framework, perhaps through the perfcounters (which...might scale to
something like this?) or a monitoring service (hopefully attached to
Calamari) that receives continuous updates.
-Greg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd top
  2015-06-15 11:52 ` Gregory Farnum
@ 2015-06-15 13:52   ` Sage Weil
  2015-06-15 15:03     ` John Spray
  2015-06-15 16:28     ` Robert LeBlanc
  0 siblings, 2 replies; 11+ messages in thread
From: Sage Weil @ 2015-06-15 13:52 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Robert LeBlanc, ceph-devel

On Mon, 15 Jun 2015, Gregory Farnum wrote:
> On Thu, Jun 11, 2015 at 12:33 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > One feature we would like is an "rbd top" command that would be like
> > top, but show usage of RBD volumes so that we can quickly identify
> > high demand RBDs.
> >
> > Since I haven't done any programming for Ceph, I'm trying to think
> > through the best way to approach this. I don't know if there are
> > already perf counters that I can query that are at the client, RBD or
> > the Rados layers. If these counters don't exist would it be best to
> > implement them at the client layer and look for watchers on the RBD
> > and query them? Is it better to handle it at the Rados layer and
> > aggregate the I/O from all chunks? Of course this would need to scale
> > out very large.
> >
> > It seems that if the client running rbd top requests the top 'X'
> > number of objects from each OSD, then it would cut down on the data
> > that the has to be moved around and processed. It wouldn't be an
> > extremely accurate view, but might be enough.
> >
> > What are your thoughts?
> >
> > Also, what is the best way to get into the Ceph code? I've looked at
> > several things and I find myself doing a lot of searching to find
> > connecting pieces. My primary focus is not programming so picking up a
> > new code base takes me a long time because I don't know many of the
> > tricks that help people get to speed quickly.
> 
> The basic problem with a tool like this is that it requires gathering
> real-time data from either all the OSDs, or all the clients. We do
> something similar in order to display approximate IO going through the
> system as a whole, but that is based on PGStat messages which come in
> periodically and is both laggy and an approximation.
> 
> To do this, we'd need to get less-laggy data, and instead of scaling
> with the number of OSDs/PGs it would scale with the number of RBD
> volumes. You certainly couldn't send that through the monitor and I
> shudder to think about the extra load it would invoke at all layers.
> 
> How up-to-date do you need the info to be, and how accurate? Does it
> need to be queryable in the future or only online? You could perhaps
> hook into one of the more precise HitSet implementations we
> have...otherwise I think you'd need to add an online querying
> framework, perhaps through the perfcounters (which...might scale to
> something like this?) or a monitoring service (hopefully attached to
> Calamari) that receives continuous updates.

I seem to remember having a short conversation about something like this a 
few CDS's back... although I think it was 'rados top'.  IIRC the basic 
idea we had was for each OSD to track it's top clients (using some 
approximate LRU type algorithm) and then either feed this relatively small 
amount of info (say, top 10-100 clients) back to the mon for summation, 
or dump via the admin socket for calamari to aggregate.

This doesn't give you the rbd image name, but I bet we could infer that 
without too much trouble (e.g., include a recent object or two with the 
client).  Or, just assume that client id is enough (it'll include an IP 
and PID... enough info to find the /var/run/ceph admin socket or the VM 
process.

If we were going to do top clients, I think it'd make sense to also have a 
top objects list as well, so you can see what the hottest objects in the 
cluster are.

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd top
  2015-06-15 13:52   ` Sage Weil
@ 2015-06-15 15:03     ` John Spray
  2015-06-15 16:10       ` Robert LeBlanc
  2015-06-16 10:04       ` Gregory Farnum
  2015-06-15 16:28     ` Robert LeBlanc
  1 sibling, 2 replies; 11+ messages in thread
From: John Spray @ 2015-06-15 15:03 UTC (permalink / raw)
  To: Sage Weil, Gregory Farnum; +Cc: Robert LeBlanc, ceph-devel



On 15/06/2015 14:52, Sage Weil wrote:
>
> I seem to remember having a short conversation about something like this a
> few CDS's back... although I think it was 'rados top'.  IIRC the basic
> idea we had was for each OSD to track it's top clients (using some
> approximate LRU type algorithm) and then either feed this relatively small
> amount of info (say, top 10-100 clients) back to the mon for summation,
> or dump via the admin socket for calamari to aggregate.
>
> This doesn't give you the rbd image name, but I bet we could infer that
> without too much trouble (e.g., include a recent object or two with the
> client).  Or, just assume that client id is enough (it'll include an IP
> and PID... enough info to find the /var/run/ceph admin socket or the VM
> process.
>
> If we were going to do top clients, I think it'd make sense to also have a
> top objects list as well, so you can see what the hottest objects in the
> cluster are.

The following is a bit of a tangent...

A few weeks ago I was thinking about general solutions to this problem 
(for the filesystem).  I played with (very briefly on wip-live-query) 
the idea of publishing a list of queries to the MDSs/OSDs, that would 
allow runtime configuration of what kind of thing we're interested in 
and how we want it broken down.

If we think of it as an SQL-like syntax, then for the RBD case we would 
have something like:
   SELECT read_bytes, write_bytes WHERE pool=rbd GROUP BY rbd_image

(You'd need a protocol-specific module of some kind to define what 
"rbd_image" meant here, which would do a simple mapping from object 
attributes to an identifier (similar would exist for e.g. cephfs inode))

Each time an OSD does an operation, it consults the list of active 
"performance queries" and updates counters according to the value of the 
GROUP BY parameter for the query (so the above example each OSD would be 
keeping a result row for each rbd image touchd).

The LRU part could be implemented as LIMIT BY + SORT parameters, such 
that the result rows would be periodically sorted and the least-touched 
results would drop off the list.  That would probably be used in 
conjunction with a decay operator on the sorted-by field, like:
   SELECT read_bytes, write_bytes,ops WHERE pool=rbd GROUP BY rbd_image 
SORT BY movingAverage(derivative(ops)) LIMIT 100

Combining WHERE clauses would let the user "drill down" (apologies for 
buzzword) by doing things like identifying the most busy clients, and 
then for each of those clients identify which images/files/objects the 
client is most active on, or vice versa identify busy objects and then 
see which clients are hitting them. Usually keeping around enough stats 
to enable this is prohibitive at scale, but it's fine when you're 
actively creating custom queries for the results you're really 
interested in, instead of keeping N_clients*N_objects stats, and when 
you have the LIMIT part to ensure results never get oversized.

The GROUP BY options would also include metadata sent from clients, e.g. 
the obvious cases like VM instance names, or rack IDs, or HPC job IDs.  
Maybe also some less obvious ones like decorating cephfs IOs with the 
inode of the directory containing the file, so that OSDs could 
accumulate per-directory bandwidth numbers, and user could ask "which 
directory is bandwidth-hottest?" as well as "which file is 
bandwidth-hottest?".

Then, after implementing all that craziness, you get some kind of wild 
multicolored GUI that shows you where the action is in your system at a 
cephfs/rgw/rbd level.

Cheers,
John

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd top
  2015-06-15 15:03     ` John Spray
@ 2015-06-15 16:10       ` Robert LeBlanc
  2015-06-15 16:52         ` John Spray
  2015-06-16 10:04       ` Gregory Farnum
  1 sibling, 1 reply; 11+ messages in thread
From: Robert LeBlanc @ 2015-06-15 16:10 UTC (permalink / raw)
  To: John Spray; +Cc: Sage Weil, Gregory Farnum, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

John, let me see if I understand what you are saying...

When a person runs `rbd top`, each OSD would receive a message saying
please capture all the performance, grouped by RBD and limit it to
'X'. That way the OSD doesn't have to constantly update performance
for each object, but when it is requested it starts tracking it?

If so, that is an interesting idea. I wonder if that would be simpler
than tracking the performance of each/MRU objects in some format like
/proc/diskstats where it is in memory and not necessarily consistent.
The benefit is that you could have "lifelong" stats that show up like
iostat and it would be a simple operation. Each object should be able
to reference back to RBD/CephFS upon request and the client could even
be responsible for that load. Client performance data would need stats
in addition to the object stats.

My concern is that adding additional SQL like logic to each op is
going to get very expensive. I guess if we could push that to another
thread early in the op, then it might not be too bad. I'm enjoying the
discussion and new ideas.

Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVfvkHCRDmVDuy+mK58QAAc7wP/RqT7tvKd+9IrJEPaYgH
vUgnIDvkPSuq9jyyIwE/bwcTTHQYtv5rc1pmvD22xgK42weQWKpkHwjkH4KJ
LTTgFsNfv2+AL+/chBbYQlhv3qoDtHdFb6ThpDVFTe7UwoZ+l/AG3sib/RES
3/HBXE42pL2uFOKOfodidmu65guFyBKR8iErL2Sk//vuMLyZ+33kVMpSgSsN
J60WvElhLf0NtM7Dn56Qh+QtlAFJvrgIcf1cl1k2AxKVRN5GiIX5nJXSMdMC
PjRh3hIN7ESeShr6cX9D2TypZspR8MZHUYVkqUahFBxFXYCvVP8qB4/Obhil
xn5ZHkCEp1V80qpAO5Qt1T34Mk7rgGlLueEaJK708bqe8kgfcOsxCDa4pMJ9
4j0ZOjlJTbu5JEMiy0/qmfoH7rMBQrKYeit9PIlB+xhQv/5+xUmmnYb4b3hc
iIA3LdgeFs7H83FbnJZoxIZWYtLij+88VUdmEGREhPVyRX6jd25mVVtvtg7d
3H+8AjeVa45EDHkAQqa5t8Kb5+sKr/LyyKBmqw0suD77kqKqWtKs9+sVFd6X
TXOhPKmAnd8TEGE86JzsfFeypfb76jil08MuIzvDvVj8hsRjpRHsZBQi+DPn
VIIbONqNM4CknURJW17rmP0o9l+sF+KAeji0AGivZuawtD+vbThwfWNRyMjN
5IXZ
=Sgkt
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Jun 15, 2015 at 9:03 AM, John Spray <john.spray@redhat.com> wrote:
>
>
> On 15/06/2015 14:52, Sage Weil wrote:
>>
>>
>> I seem to remember having a short conversation about something like this a
>> few CDS's back... although I think it was 'rados top'.  IIRC the basic
>> idea we had was for each OSD to track it's top clients (using some
>> approximate LRU type algorithm) and then either feed this relatively small
>> amount of info (say, top 10-100 clients) back to the mon for summation,
>> or dump via the admin socket for calamari to aggregate.
>>
>> This doesn't give you the rbd image name, but I bet we could infer that
>> without too much trouble (e.g., include a recent object or two with the
>> client).  Or, just assume that client id is enough (it'll include an IP
>> and PID... enough info to find the /var/run/ceph admin socket or the VM
>> process.
>>
>> If we were going to do top clients, I think it'd make sense to also have a
>> top objects list as well, so you can see what the hottest objects in the
>> cluster are.
>
>
> The following is a bit of a tangent...
>
> A few weeks ago I was thinking about general solutions to this problem (for
> the filesystem).  I played with (very briefly on wip-live-query) the idea of
> publishing a list of queries to the MDSs/OSDs, that would allow runtime
> configuration of what kind of thing we're interested in and how we want it
> broken down.
>
> If we think of it as an SQL-like syntax, then for the RBD case we would have
> something like:
>   SELECT read_bytes, write_bytes WHERE pool=rbd GROUP BY rbd_image
>
> (You'd need a protocol-specific module of some kind to define what
> "rbd_image" meant here, which would do a simple mapping from object
> attributes to an identifier (similar would exist for e.g. cephfs inode))
>
> Each time an OSD does an operation, it consults the list of active
> "performance queries" and updates counters according to the value of the
> GROUP BY parameter for the query (so the above example each OSD would be
> keeping a result row for each rbd image touchd).
>
> The LRU part could be implemented as LIMIT BY + SORT parameters, such that
> the result rows would be periodically sorted and the least-touched results
> would drop off the list.  That would probably be used in conjunction with a
> decay operator on the sorted-by field, like:
>   SELECT read_bytes, write_bytes,ops WHERE pool=rbd GROUP BY rbd_image SORT
> BY movingAverage(derivative(ops)) LIMIT 100
>
> Combining WHERE clauses would let the user "drill down" (apologies for
> buzzword) by doing things like identifying the most busy clients, and then
> for each of those clients identify which images/files/objects the client is
> most active on, or vice versa identify busy objects and then see which
> clients are hitting them. Usually keeping around enough stats to enable this
> is prohibitive at scale, but it's fine when you're actively creating custom
> queries for the results you're really interested in, instead of keeping
> N_clients*N_objects stats, and when you have the LIMIT part to ensure
> results never get oversized.
>
> The GROUP BY options would also include metadata sent from clients, e.g. the
> obvious cases like VM instance names, or rack IDs, or HPC job IDs.  Maybe
> also some less obvious ones like decorating cephfs IOs with the inode of the
> directory containing the file, so that OSDs could accumulate per-directory
> bandwidth numbers, and user could ask "which directory is
> bandwidth-hottest?" as well as "which file is bandwidth-hottest?".
>
> Then, after implementing all that craziness, you get some kind of wild
> multicolored GUI that shows you where the action is in your system at a
> cephfs/rgw/rbd level.
>
> Cheers,
> John

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd top
  2015-06-15 13:52   ` Sage Weil
  2015-06-15 15:03     ` John Spray
@ 2015-06-15 16:28     ` Robert LeBlanc
  1 sibling, 0 replies; 11+ messages in thread
From: Robert LeBlanc @ 2015-06-15 16:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Mon, Jun 15, 2015 at 7:52 AM, Sage Weil  wrote:
> On Mon, 15 Jun 2015, Gregory Farnum wrote:
>> On Thu, Jun 11, 2015 at 12:33 PM, Robert LeBlanc  wrote:
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA256
>> >
>> > One feature we would like is an "rbd top" command that would be like
>> > top, but show usage of RBD volumes so that we can quickly identify
>> > high demand RBDs.
>> >
>> > Since I haven't done any programming for Ceph, I'm trying to think
>> > through the best way to approach this. I don't know if there are
>> > already perf counters that I can query that are at the client, RBD or
>> > the Rados layers. If these counters don't exist would it be best to
>> > implement them at the client layer and look for watchers on the RBD
>> > and query them? Is it better to handle it at the Rados layer and
>> > aggregate the I/O from all chunks? Of course this would need to scale
>> > out very large.
>> >
>> > It seems that if the client running rbd top requests the top 'X'
>> > number of objects from each OSD, then it would cut down on the data
>> > that the has to be moved around and processed. It wouldn't be an
>> > extremely accurate view, but might be enough.
>> >
>> > What are your thoughts?
>> >
>> > Also, what is the best way to get into the Ceph code? I've looked at
>> > several things and I find myself doing a lot of searching to find
>> > connecting pieces. My primary focus is not programming so picking up a
>> > new code base takes me a long time because I don't know many of the
>> > tricks that help people get to speed quickly.
>>
>> The basic problem with a tool like this is that it requires gathering
>> real-time data from either all the OSDs, or all the clients. We do
>> something similar in order to display approximate IO going through the
>> system as a whole, but that is based on PGStat messages which come in
>> periodically and is both laggy and an approximation.
>>
>> To do this, we'd need to get less-laggy data, and instead of scaling
>> with the number of OSDs/PGs it would scale with the number of RBD
>> volumes. You certainly couldn't send that through the monitor and I
>> shudder to think about the extra load it would invoke at all layers.
>>
>> How up-to-date do you need the info to be, and how accurate? Does it
>> need to be queryable in the future or only online? You could perhaps
>> hook into one of the more precise HitSet implementations we
>> have...otherwise I think you'd need to add an online querying
>> framework, perhaps through the perfcounters (which...might scale to
>> something like this?) or a monitoring service (hopefully attached to
>> Calamari) that receives continuous updates.
>
> I seem to remember having a short conversation about something like this a
> few CDS's back... although I think it was 'rados top'.  IIRC the basic
> idea we had was for each OSD to track it's top clients (using some
> approximate LRU type algorithm) and then either feed this relatively small
> amount of info (say, top 10-100 clients) back to the mon for summation,
> or dump via the admin socket for calamari to aggregate.

This was mostly the idea I had in mind. Would it be better to track
objects ore clients. I could think of reasons for either  (objects
would give an idea of stress of the OSDs, clients could give an idea
of clients misbehaving in a shared object/RBD). I was thinking of
something like the admin socket,  but something the client could query
to keep the load off the monitor. However, with multiple clients
having the monitor aggregate the data would reduce the load cluster
wide. I guess the big question would be what is the impact.

> This doesn't give you the rbd image name, but I bet we could infer that
> without too much trouble (e.g., include a recent object or two with the
> client).  Or, just assume that client id is enough (it'll include an IP
> and PID... enough info to find the /var/run/ceph admin socket or the VM
> process.

I thought it was easy to back reference the RBD from the object
because the object has a prefix from the RBD. Am I over simplifying
here or missing something?

> If we were going to do top clients, I think it'd make sense to also have a
> top objects list as well, so you can see what the hottest objects in the
> cluster are.

This makes a lot of sense, it wouldn't be much extra work.

As to Greg's question, I think providing real time days would be too
expensive. What kind of delay do you think would be a good trade off
between latency and load? Of course the closer to real time the
better.

- ----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFbBAEBCAAQBQJVfv0RCRDmVDuy+mK58QAApR8P+MMAgsWzJhuP2o3MC98i
3Cr6scpovs0DeHDr+cuoZ9abk6+oThiAYeXaVlVF1RcCX3lPNYRhxlN8rq1n
cS5eITVdDOQbm+CMX1XzI5TlINMl94rcm4yswclBFUxmyh4Q5H2FoOttLfUA
IfeVnIQU695wz9ZNCa5VH5h3DjX6oZ/TxA/YXGOsb0VvquMZZQHRVagm+1Pk
sB0Axqbg9mjZ7k5xXY4nYrsbXWKLOkGjDlPcjYWfsA0wdV6O9cWB+0DaRGWZ
i/RW305ESJrIYPXL8oWwFlx6Y1RxoRlgsnroj2vo16z4IkNyTGYss0krouIv
3wG8c6GQhITtpQjT3MZM/QvMbGT7WTFvNkXWU5/O97XMpou03h/44w4lJSHF
1YmP6Ju6uySnYKHBTU3dmA38QHSXCy2uBjqRCY62C2CCOpNGGlBrGyAoMmyU
cfr6G9eta/DPCm3kyPsZsCXFag9MZi64QkK+Di2sqVA9B5+05bK/DDAw26zJ
YCQxFGhhEmV+mXq2zq6uMSeQgxFsAVHs39CWi+ZCWQBKhSBcfDc66lZlNzQK
EFtwqWzxXMVJI2boc7OOx8OYeHSXRxhFbvetKzURJXjMZ2ur1In31Y8jWnUK
VnZSL/+GtSaUPtdBB51oapAngvlhk1j6EWOCVFYjftLz5+88Bkhtu4nd1Ef1
0l4=
=CKEr
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd top
  2015-06-15 16:10       ` Robert LeBlanc
@ 2015-06-15 16:52         ` John Spray
  2015-06-16 11:05           ` Wido den Hollander
  0 siblings, 1 reply; 11+ messages in thread
From: John Spray @ 2015-06-15 16:52 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Sage Weil, Gregory Farnum, ceph-devel



On 15/06/2015 17:10, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> John, let me see if I understand what you are saying...
>
> When a person runs `rbd top`, each OSD would receive a message saying
> please capture all the performance, grouped by RBD and limit it to
> 'X'. That way the OSD doesn't have to constantly update performance
> for each object, but when it is requested it starts tracking it?

Right, initially the OSD isn't collecting anything, it starts as soon as 
it sees a query get loaded up (published via OSDMap or some other 
mechanism).

That said, in practice I can see people having some set of queries that 
they always have loaded and feeding into graphite in the background.
>
> If so, that is an interesting idea. I wonder if that would be simpler
> than tracking the performance of each/MRU objects in some format like
> /proc/diskstats where it is in memory and not necessarily consistent.
> The benefit is that you could have "lifelong" stats that show up like
> iostat and it would be a simple operation.

Hmm, not sure we're on the same page about this part, what I'm talking 
about is all in memory and would be lost across daemon restarts.  Some 
other component would be responsible for gathering the stats across all 
the daemons in one place (that central part could persist stats if desired).

> Each object should be able
> to reference back to RBD/CephFS upon request and the client could even
> be responsible for that load. Client performance data would need stats
> in addition to the object stats.

You could extend the mechanism to clients.  However, as much as possible 
it's a good thing to keep it server side, as servers are generally fewer 
(still have to reduce these stats across N servers to present to user), 
and we have multiple client implementations (kernel/userspace).  What 
kind of thing do you want to get from clients?
> My concern is that adding additional SQL like logic to each op is
> going to get very expensive. I guess if we could push that to another
> thread early in the op, then it might not be too bad. I'm enjoying the
> discussion and new ideas.

Hopefully in most cases the query can be applied very cheaply, for 
operations like comparing pool ID or grouping by client ID. However, I 
would also envisage an optional sampling number, such that e.g. only 1 
in every 100 ops would go through the query processing.  Useful for 
systems where keeping highest throughput is paramount, and the numbers 
will still be useful if clients are doing many thousands of ops per second.

Cheers,
John

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd top
  2015-06-15 15:03     ` John Spray
  2015-06-15 16:10       ` Robert LeBlanc
@ 2015-06-16 10:04       ` Gregory Farnum
  1 sibling, 0 replies; 11+ messages in thread
From: Gregory Farnum @ 2015-06-16 10:04 UTC (permalink / raw)
  To: John Spray, sjust; +Cc: Sage Weil, Robert LeBlanc, ceph-devel

On Mon, Jun 15, 2015 at 8:03 AM, John Spray <john.spray@redhat.com> wrote:
>
>
> On 15/06/2015 14:52, Sage Weil wrote:
>>
>>
>> I seem to remember having a short conversation about something like this a
>> few CDS's back... although I think it was 'rados top'.  IIRC the basic
>> idea we had was for each OSD to track it's top clients (using some
>> approximate LRU type algorithm) and then either feed this relatively small
>> amount of info (say, top 10-100 clients) back to the mon for summation,
>> or dump via the admin socket for calamari to aggregate.
>>
>> This doesn't give you the rbd image name, but I bet we could infer that
>> without too much trouble (e.g., include a recent object or two with the
>> client).  Or, just assume that client id is enough (it'll include an IP
>> and PID... enough info to find the /var/run/ceph admin socket or the VM
>> process.
>>
>> If we were going to do top clients, I think it'd make sense to also have a
>> top objects list as well, so you can see what the hottest objects in the
>> cluster are.
>
>
> The following is a bit of a tangent...
>
> A few weeks ago I was thinking about general solutions to this problem (for
> the filesystem).  I played with (very briefly on wip-live-query) the idea of
> publishing a list of queries to the MDSs/OSDs, that would allow runtime
> configuration of what kind of thing we're interested in and how we want it
> broken down.
>
> If we think of it as an SQL-like syntax, then for the RBD case we would have
> something like:
>   SELECT read_bytes, write_bytes WHERE pool=rbd GROUP BY rbd_image
>
> (You'd need a protocol-specific module of some kind to define what
> "rbd_image" meant here, which would do a simple mapping from object
> attributes to an identifier (similar would exist for e.g. cephfs inode))
>
> Each time an OSD does an operation, it consults the list of active
> "performance queries" and updates counters according to the value of the
> GROUP BY parameter for the query (so the above example each OSD would be
> keeping a result row for each rbd image touchd).
>
> The LRU part could be implemented as LIMIT BY + SORT parameters, such that
> the result rows would be periodically sorted and the least-touched results
> would drop off the list.  That would probably be used in conjunction with a
> decay operator on the sorted-by field, like:
>   SELECT read_bytes, write_bytes,ops WHERE pool=rbd GROUP BY rbd_image SORT
> BY movingAverage(derivative(ops)) LIMIT 100
>
> Combining WHERE clauses would let the user "drill down" (apologies for
> buzzword) by doing things like identifying the most busy clients, and then
> for each of those clients identify which images/files/objects the client is
> most active on, or vice versa identify busy objects and then see which
> clients are hitting them. Usually keeping around enough stats to enable this
> is prohibitive at scale, but it's fine when you're actively creating custom
> queries for the results you're really interested in, instead of keeping
> N_clients*N_objects stats, and when you have the LIMIT part to ensure
> results never get oversized.
>
> The GROUP BY options would also include metadata sent from clients, e.g. the
> obvious cases like VM instance names, or rack IDs, or HPC job IDs.  Maybe
> also some less obvious ones like decorating cephfs IOs with the inode of the
> directory containing the file, so that OSDs could accumulate per-directory
> bandwidth numbers, and user could ask "which directory is
> bandwidth-hottest?" as well as "which file is bandwidth-hottest?".
>
> Then, after implementing all that craziness, you get some kind of wild
> multicolored GUI that shows you where the action is in your system at a
> cephfs/rgw/rbd level.

I *like* that idea. We should discuss with Sam before doing too much
though as I know he's thought about various online computations in
RADOS before. Something like this is also interesting in comparison to
our long-theorized "PG classes", etc.
-Greg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd top
  2015-06-15 16:52         ` John Spray
@ 2015-06-16 11:05           ` Wido den Hollander
  2015-06-17 17:06             ` Robert LeBlanc
  0 siblings, 1 reply; 11+ messages in thread
From: Wido den Hollander @ 2015-06-16 11:05 UTC (permalink / raw)
  To: John Spray, Robert LeBlanc; +Cc: Sage Weil, Gregory Farnum, ceph-devel

On 06/15/2015 06:52 PM, John Spray wrote:
> 
> 
> On 15/06/2015 17:10, Robert LeBlanc wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> John, let me see if I understand what you are saying...
>>
>> When a person runs `rbd top`, each OSD would receive a message saying
>> please capture all the performance, grouped by RBD and limit it to
>> 'X'. That way the OSD doesn't have to constantly update performance
>> for each object, but when it is requested it starts tracking it?
> 
> Right, initially the OSD isn't collecting anything, it starts as soon as
> it sees a query get loaded up (published via OSDMap or some other
> mechanism).
> 

I like that idea very much. Currently the OSDs are already CPU bound, a
lot of time is used by processing a request while it's not waiting on
the disk.

Although tracking IOps might seem like a small and cheap thing to do,
it's yet more CPU time spent by the system on something else then
processing the I/O.

So I'm in favor of not always collecting, but only on demand.

Go for performance, low-latency and high IOps.

Wido

> That said, in practice I can see people having some set of queries that
> they always have loaded and feeding into graphite in the background.
>>
>> If so, that is an interesting idea. I wonder if that would be simpler
>> than tracking the performance of each/MRU objects in some format like
>> /proc/diskstats where it is in memory and not necessarily consistent.
>> The benefit is that you could have "lifelong" stats that show up like
>> iostat and it would be a simple operation.
> 
> Hmm, not sure we're on the same page about this part, what I'm talking
> about is all in memory and would be lost across daemon restarts.  Some
> other component would be responsible for gathering the stats across all
> the daemons in one place (that central part could persist stats if
> desired).
> 
>> Each object should be able
>> to reference back to RBD/CephFS upon request and the client could even
>> be responsible for that load. Client performance data would need stats
>> in addition to the object stats.
> 
> You could extend the mechanism to clients.  However, as much as possible
> it's a good thing to keep it server side, as servers are generally fewer
> (still have to reduce these stats across N servers to present to user),
> and we have multiple client implementations (kernel/userspace).  What
> kind of thing do you want to get from clients?
>> My concern is that adding additional SQL like logic to each op is
>> going to get very expensive. I guess if we could push that to another
>> thread early in the op, then it might not be too bad. I'm enjoying the
>> discussion and new ideas.
> 
> Hopefully in most cases the query can be applied very cheaply, for
> operations like comparing pool ID or grouping by client ID. However, I
> would also envisage an optional sampling number, such that e.g. only 1
> in every 100 ops would go through the query processing.  Useful for
> systems where keeping highest throughput is paramount, and the numbers
> will still be useful if clients are doing many thousands of ops per second.
> 
> Cheers,
> John
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd top
  2015-06-16 11:05           ` Wido den Hollander
@ 2015-06-17 17:06             ` Robert LeBlanc
  2015-06-17 17:59               ` John Spray
  0 siblings, 1 reply; 11+ messages in thread
From: Robert LeBlanc @ 2015-06-17 17:06 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: John Spray, Sage Weil, Gregory Farnum, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Well, I think this has gone well past my ability to implement. Should
this be turned into a BP and see if someone is able to work on it?
- ----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Jun 16, 2015 at 5:05 AM, Wido den Hollander  wrote:
> On 06/15/2015 06:52 PM, John Spray wrote:
>>
>>
>> On 15/06/2015 17:10, Robert LeBlanc wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> John, let me see if I understand what you are saying...
>>>
>>> When a person runs `rbd top`, each OSD would receive a message saying
>>> please capture all the performance, grouped by RBD and limit it to
>>> 'X'. That way the OSD doesn't have to constantly update performance
>>> for each object, but when it is requested it starts tracking it?
>>
>> Right, initially the OSD isn't collecting anything, it starts as soon as
>> it sees a query get loaded up (published via OSDMap or some other
>> mechanism).
>>
>
> I like that idea very much. Currently the OSDs are already CPU bound, a
> lot of time is used by processing a request while it's not waiting on
> the disk.
>
> Although tracking IOps might seem like a small and cheap thing to do,
> it's yet more CPU time spent by the system on something else then
> processing the I/O.
>
> So I'm in favor of not always collecting, but only on demand.
>
> Go for performance, low-latency and high IOps.
>
> Wido
>
>> That said, in practice I can see people having some set of queries that
>> they always have loaded and feeding into graphite in the background.
>>>
>>> If so, that is an interesting idea. I wonder if that would be simpler
>>> than tracking the performance of each/MRU objects in some format like
>>> /proc/diskstats where it is in memory and not necessarily consistent.
>>> The benefit is that you could have "lifelong" stats that show up like
>>> iostat and it would be a simple operation.
>>
>> Hmm, not sure we're on the same page about this part, what I'm talking
>> about is all in memory and would be lost across daemon restarts.  Some
>> other component would be responsible for gathering the stats across all
>> the daemons in one place (that central part could persist stats if
>> desired).
>>
>>> Each object should be able
>>> to reference back to RBD/CephFS upon request and the client could even
>>> be responsible for that load. Client performance data would need stats
>>> in addition to the object stats.
>>
>> You could extend the mechanism to clients.  However, as much as possible
>> it's a good thing to keep it server side, as servers are generally fewer
>> (still have to reduce these stats across N servers to present to user),
>> and we have multiple client implementations (kernel/userspace).  What
>> kind of thing do you want to get from clients?
>>> My concern is that adding additional SQL like logic to each op is
>>> going to get very expensive. I guess if we could push that to another
>>> thread early in the op, then it might not be too bad. I'm enjoying the
>>> discussion and new ideas.
>>
>> Hopefully in most cases the query can be applied very cheaply, for
>> operations like comparing pool ID or grouping by client ID. However, I
>> would also envisage an optional sampling number, such that e.g. only 1
>> in every 100 ops would go through the query processing.  Useful for
>> systems where keeping highest throughput is paramount, and the numbers
>> will still be useful if clients are doing many thousands of ops per second.
>>
>> Cheers,
>> John
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVgakhCRDmVDuy+mK58QAALfAP/RoukN52ewY3nRvzHFCD
/r8gsBa5c6o8rMPmUG09kFUALcocD4GPvYmwG45UBQbpI2lL3/SSV50BNS7z
3HtoDgEtn39Qg3P5EqJAehLViaa9Zsj6PukM7nqzOuBvFqGi6BmsEAz7dIAA
X3aQyxHt86D9v/epzPkICa7UytQ+tH+7YMcoFZRAZYqkGrsFQv3m3vVJQZwq
rQN2zkwBYR6CwsayB7j+92q3GTvYTaV90FS8dnBJWzH+9QAsf15044iEbwHo
fzLf4NS4+0cawOl93Oh8jkXV4cmsNmdhcoCJxHd5TUn5tI/oRMSbdHo+i10v
xgqApw8Z3XJ+UTKv/migRFW7RMsUrWABWXquakD5gZ7GfekhJPGCAwl+/FaI
wvjvjKdf5WXPWi0jUd8UciQS/Sj21cqqR7S2Is/AcBx9uoqy/c7855WRub6k
eSbLAnBKXn6s+sjjOwU4zQD1cV7OUOxqVUNY96tJ1+GGQGOngI456tvbVGFN
lpkBLVviVxHtlm+h8r6A82V9CD3zhgbxY+rME8wB7lvbWwqxkYPgH4/JhHMB
fOh1hzB7Ay+SpYEb2chIsjdx9qYzAoQ75dnWTBjxYEQFvt63SdnbMJrFct2h
hgrrZ2QJaLy8/u9WMayzQTShnxtJEo5eo6iB1WsvXafVmUtpONnpNUbWvqUb
LXRT
=7XsA
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd top
  2015-06-17 17:06             ` Robert LeBlanc
@ 2015-06-17 17:59               ` John Spray
  0 siblings, 0 replies; 11+ messages in thread
From: John Spray @ 2015-06-17 17:59 UTC (permalink / raw)
  To: Robert LeBlanc, Wido den Hollander; +Cc: Sage Weil, Gregory Farnum, ceph-devel



On 17/06/2015 18:06, Robert LeBlanc wrote:
> Well, I think this has gone well past my ability to implement. Should
> this be turned into a BP and see if someone is able to work on it?

Sorry, didn't meant to hijack your thread :-)

It might still be useful to discuss the simpler case of tracking top 
clients/top objects (i.e. just native RADOS concepts) with an LRU table 
of stats (like Sage described) as a simpler alternative to my 
custom-querying proposal.  I'm going to write a blueprint for the the 
custom query thing anyway though, as I'm kind of hot on the idea, though 
don't who/when will have time to take it on as it's a bit heavyweight.

John

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-06-17 18:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-11 19:33 rbd top Robert LeBlanc
2015-06-15 11:52 ` Gregory Farnum
2015-06-15 13:52   ` Sage Weil
2015-06-15 15:03     ` John Spray
2015-06-15 16:10       ` Robert LeBlanc
2015-06-15 16:52         ` John Spray
2015-06-16 11:05           ` Wido den Hollander
2015-06-17 17:06             ` Robert LeBlanc
2015-06-17 17:59               ` John Spray
2015-06-16 10:04       ` Gregory Farnum
2015-06-15 16:28     ` Robert LeBlanc

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.