All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: 答复: osd: fine-grained statistics for object space usage
       [not found] ` <alpine.DEB.2.11.1711300304240.8333@piezo.novalocal>
@ 2017-11-30 21:17   ` Gregory Farnum
       [not found]     ` <alpine.DEB.2.11.1711302125440.12766@piezo.novalocal>
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory Farnum @ 2017-11-30 21:17 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Igor Fedotov, Xie Xingguo

On Wed, Nov 29, 2017 at 7:06 PM Sage Weil <sweil@redhat.com> wrote:
>
> On Thu, 30 Nov 2017, xie.xingguo@zte.com.cn wrote:
> > (My network connection seems to be problematic, resending :( )
> >
> >   Anyway, I am + 1 for doing this in a more effective way (e.g., as Igor
> > suggested).
> >
> >   The potential big challenge might be making the scrub-process happy,
> > though!
>
> Would this be something like:
>
> 1- an object_info_t field like uint32_t allocated_size, which has been
> incorporated into the pg summation, and
>
> 2- an ObjectStore method that returns the allocated size for an object?
>
> The challenge I see is that the new value (or delta) needs to be sorted
> out at the transaction prepare time because the stat update is part of the
> transaction, but we won't really know what the result is until bluestore
> (or any other impl) does it's write preparation work.  :/


It would take some doing but this might be a good time to start adding
delayed work. We could get the stat updates as part of the objectstore
callback and incorporate them into future disk ops, and part of the
startup/replay process could be querying for stat updates to objects
we haven’t committed yet.

...except we don’t really have an OSD-level or pg replay phase any
more, do we. Hrmm. And doing it in the transaction would require some
sort of set-up/query phase to the transaction, then finalization and
submission, which isn’t great since it impacts checksumming and other
stuff (although *hopefully* not actual allocation).
-Greg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: osd: fine-grained statistics for object space usage
       [not found]     ` <alpine.DEB.2.11.1711302125440.12766@piezo.novalocal>
@ 2017-11-30 21:46       ` Gregory Farnum
  2017-12-01 14:23       ` Igor Fedotov
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory Farnum @ 2017-11-30 21:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Igor Fedotov, Xie Xingguo

On Thu, Nov 30, 2017 at 1:27 PM, Sage Weil <sweil@redhat.com> wrote:
> On Thu, 30 Nov 2017, Gregory Farnum wrote:
>> On Wed, Nov 29, 2017 at 7:06 PM Sage Weil <sweil@redhat.com> wrote:
>> >
>> > On Thu, 30 Nov 2017, xie.xingguo@zte.com.cn wrote:
>> > > (My network connection seems to be problematic, resending :( )
>> > >
>> > >   Anyway, I am + 1 for doing this in a more effective way (e.g., as Igor
>> > > suggested).
>> > >
>> > >   The potential big challenge might be making the scrub-process happy,
>> > > though!
>> >
>> > Would this be something like:
>> >
>> > 1- an object_info_t field like uint32_t allocated_size, which has been
>> > incorporated into the pg summation, and
>> >
>> > 2- an ObjectStore method that returns the allocated size for an object?
>> >
>> > The challenge I see is that the new value (or delta) needs to be sorted
>> > out at the transaction prepare time because the stat update is part of the
>> > transaction, but we won't really know what the result is until bluestore
>> > (or any other impl) does it's write preparation work.  :/
>>
>>
>> It would take some doing but this might be a good time to start adding
>> delayed work. We could get the stat updates as part of the objectstore
>> callback and incorporate them into future disk ops, and part of the
>> startup/replay process could be querying for stat updates to objects
>> we haven’t committed yet.
>>
>> ...except we don’t really have an OSD-level or pg replay phase any
>> more, do we. Hrmm. And doing it in the transaction would require some
>> sort of set-up/query phase to the transaction, then finalization and
>> submission, which isn’t great since it impacts checksumming and other
>> stuff (although *hopefully* not actual allocation).
>
> Hmm, and there is a larger problem here: we can't really make this
> ObjectStore implementation specific because it may vary across OSDs (some
> may be BlueStore, some may be FileStore).
>
> Even if we didn't have that issue, it constrains the ordering somewhat:
> you would need to prepare and submit the local transaction (to get
> the stat delta) before sending the replica writes.

Yeah. This isn't a problem if we do the stat maintenance separately,
but it's a much larger-scoped patch than just poking the interfaces.

Would you be opposed to a simple OSD-level replay step that could do
stuff like update pg stats for recent object writes?

> I think this sort of layer boundary crossing would make our lives very
> difficult down the line.  :/

I mean, that's true, but it's also something eminently reasonable for
admins to want. I've always found it a bit embarrassing we can't
expose which snapshots are actually taking up space. Saying which
things are using up storage is sort of a critical feature. :/
-Greg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: osd: fine-grained statistics for object space usage
       [not found]     ` <alpine.DEB.2.11.1711302125440.12766@piezo.novalocal>
  2017-11-30 21:46       ` Gregory Farnum
@ 2017-12-01 14:23       ` Igor Fedotov
       [not found]         ` <alpine.DEB.2.11.1712011427180.2819@piezo.novalocal>
  1 sibling, 1 reply; 10+ messages in thread
From: Igor Fedotov @ 2017-12-01 14:23 UTC (permalink / raw)
  To: Sage Weil, Gregory Farnum; +Cc: ceph-devel, Xie Xingguo

On 12/1/2017 12:27 AM, Sage Weil wrote:
> On Thu, 30 Nov 2017, Gregory Farnum wrote:
>> On Wed, Nov 29, 2017 at 7:06 PM Sage Weil <sweil@redhat.com> wrote:
>>> On Thu, 30 Nov 2017, xie.xingguo@zte.com.cn wrote:
>>>> (My network connection seems to be problematic, resending :( )
>>>>
>>>>    Anyway, I am + 1 for doing this in a more effective way (e.g., as Igor
>>>> suggested).
>>>>
>>>>    The potential big challenge might be making the scrub-process happy,
>>>> though!
>>> Would this be something like:
>>>
>>> 1- an object_info_t field like uint32_t allocated_size, which has been
>>> incorporated into the pg summation, and
>>>
>>> 2- an ObjectStore method that returns the allocated size for an object?
>>>
>>> The challenge I see is that the new value (or delta) needs to be sorted
>>> out at the transaction prepare time because the stat update is part of the
>>> transaction, but we won't really know what the result is until bluestore
>>> (or any other impl) does it's write preparation work.  :/
>>
>> It would take some doing but this might be a good time to start adding
>> delayed work. We could get the stat updates as part of the objectstore
>> callback and incorporate them into future disk ops, and part of the
>> startup/replay process could be querying for stat updates to objects
>> we haven’t committed yet.
>>
>> ...except we don’t really have an OSD-level or pg replay phase any
>> more, do we. Hrmm. And doing it in the transaction would require some
>> sort of set-up/query phase to the transaction, then finalization and
>> submission, which isn’t great since it impacts checksumming and other
>> stuff (although *hopefully* not actual allocation).
> Hmm, and there is a larger problem here: we can't really make this
> ObjectStore implementation specific because it may vary across OSDs (some
> may be BlueStore, some may be FileStore).
IMO first of all we should determine what parameter(s) would we track. 
Object logical space usage (as we do now) or physical allocations or both.
For logical space tracking it's probably not an issue to have uniform 
results among different stores - FileStore replicates what we have at 
OSD, BlueStore do the same on its own data structures.
or physical allocation tracking we must handle different results from 
different store types as they are really not the same. I.e. object 
physical size (with 3 replications) should be  calculated as
   size = size_rep1 + size_rep2 + size_rep3
not
   size = size_primary * 3

Also wondering if mixed object store environments have any non-academic 
value?
> Even if we didn't have that issue, it constrains the ordering somewhat:
> you would need to prepare and submit the local transaction (to get
> the stat delta) before sending the replica writes.
>
> I think this sort of layer boundary crossing would make our lives very
> difficult down the line.  :/
>
> sage


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: osd: fine-grained statistics for object space usage
       [not found]         ` <alpine.DEB.2.11.1712011427180.2819@piezo.novalocal>
@ 2017-12-04 11:23           ` Igor Fedotov
       [not found]             ` <alpine.DEB.2.11.1712041424370.22619@piezo.novalocal>
  0 siblings, 1 reply; 10+ messages in thread
From: Igor Fedotov @ 2017-12-04 11:23 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel, Xie Xingguo



On 12/1/2017 5:34 PM, Sage Weil wrote:
> On Fri, 1 Dec 2017, Igor Fedotov wrote:
>> On 12/1/2017 12:27 AM, Sage Weil wrote:
>>> On Thu, 30 Nov 2017, Gregory Farnum wrote:
>>>> On Wed, Nov 29, 2017 at 7:06 PM Sage Weil <sweil@redhat.com> wrote:
>>>> It would take some doing but this might be a good time to start adding
>>>> delayed work. We could get the stat updates as part of the objectstore
>>>> callback and incorporate them into future disk ops, and part of the
>>>> startup/replay process could be querying for stat updates to objects
>>>> we haven’t committed yet.
>>>>
>>>> ...except we don’t really have an OSD-level or pg replay phase any
>>>> more, do we. Hrmm. And doing it in the transaction would require some
>>>> sort of set-up/query phase to the transaction, then finalization and
>>>> submission, which isn’t great since it impacts checksumming and other
>>>> stuff (although *hopefully* not actual allocation).
>>> Hmm, and there is a larger problem here: we can't really make this
>>> ObjectStore implementation specific because it may vary across OSDs (some
>>> may be BlueStore, some may be FileStore).
>> IMO first of all we should determine what parameter(s) would we track. Object
>> logical space usage (as we do now) or physical allocations or both.
>> For logical space tracking it's probably not an issue to have uniform results
>> among different stores - FileStore replicates what we have at OSD, BlueStore
>> do the same on its own data structures.
>> or physical allocation tracking we must handle different results from
>> different store types as they are really not the same. I.e. object physical
>> size (with 3 replications) should be  calculated as
>>    size = size_rep1 + size_rep2 + size_rep3
>> not
>>    size = size_primary * 3
> This level of detail is appealing, but the cost is high.  It would
> require a two-phase update to implement, as Greg suggested: first
> doing the actual update, and then later a follow-up that adjusts
> the stats.
>> Also wondering if mixed object store environments have any non-academic
>> value?
> Definitely.  It happens in hybrid clusters (some HDD, some SSD, where you
> may end up with backends tuned for each), and more commonly for any
> existing cluster that is in the (slow) process of migrating from one
> backend to anoterh (e.g., filestore -> bluestore).  We have to design for
> heterogeneity being the norm if we want to scale.
>
> I see three paths:
>
> 1- We drop this and give up on a fine-grained mapping between logical
> usage and physical usage.  PG stats would reflect the logical sizes
> of objects (as they have historically) and OSDs would report actual
> utilization (after replication, compression, etc.).
>
> 2- We add a ton of complexity to a pipeline we are trying to simplify and
> optimize to provide this detail.
>
> 3- We extend the OSD-side reporting.  Currently (see #1), we only report
> total stats for the entire OSD.  We could maintain ObjectStore-level
> summations by pool.  This would be split tolerant but would still provide
> us a value we can divide against the PG count (or total cluster values)
> in order to tell how efficiently pools are compressing or how sparse
> they are or whatever.
So let me reinterpret (or append to) this suggestion.
- We can start doing per-collection(= per-pg) logical and allocated size 
tracking at BlueStore level. BlueStore to alter corresponding collection 
metadata
on object update by inserting additional collection related 'set 
collection metadata' transaction. PG's involvement isn't needed in this 
scenario until operation completion and hence there is no requirement to 
have two-stage write operation.
- BlueStore should provide this collection metadata by a new OS API call 
(e..g get_collection_meta) and/or extended onreadable_sync notification 
event. I'd prefer to have the latter to avoid additional overhead on 
get_collection_meta call (e.g. collecttion_lookup, locks etc) as we need 
its results after each object update operation.
- PG instance at each OSD node retrieves collection statistics from OS 
when needed or tracks  it in RAM only.
- Two statistics reports  to be distinguished:
   a. Cluster-wide PG report - processing OSD retrieves statistics from 
both local and remote PGs and sums it on per-PG basis. E.g. total per-PG 
physical space usage can be obtained this way.
   b. OSD-wide PG report (or just simple OSD summary report) - OSD 
collects PG statistics from local PGs only. E.g. logical/physical space 
usage at specific OSD can be examined this way.

> 4- We keep what we have now with a duplicated interval_set at the OSD
> level.  Maybe make it a pool option whether we want to track it?  Or add a
> pool property specifying the level of granularity so that it can be
> rounded to 64k blocks or something?  Scrub could reconcile the ObjectStore
> view opportunistically so that e.g. a bunch of 4k discards will eventually
> result in the coarse-grained 64k block appearing as a hole.
>
> #3 still doesn't get us a valid st_blocks for cephfs, but it seems like it
> gets us most of what we want?
>
> sage


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: osd: fine-grained statistics for object space usage
       [not found]             ` <alpine.DEB.2.11.1712041424370.22619@piezo.novalocal>
@ 2017-12-04 22:15               ` Gregory Farnum
  2017-12-05 20:48                 ` Igor Fedotov
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory Farnum @ 2017-12-04 22:15 UTC (permalink / raw)
  To: Sage Weil; +Cc: Igor Fedotov, ceph-devel, Xie Xingguo

On Mon, Dec 4, 2017 at 6:24 AM, Sage Weil <sweil@redhat.com> wrote:
> It's pretty straightforward to maintain collection-level metadata in the
> common case, but I don't see how we can *also* support an O(1) split
> operation.

You're right we can't know the exact answer, but we already solve this
problem for PG object counts and things by doing a fuzzy estimate
(just dividing the PG values in two) until a scrub happens. I don't
think having to do the same here is a reason to avoid it entirely.


> This is why I suggested per-pool metadata.  Pool-level
> information will still let us roll things up into a 'ceph df' type summary
> of how well data in a particular pool is compressing, how sparse it is,
> and so on, which should be sufficient for capacity planning purposes.
> We'll also have per-OSD (by pool) information, which will tell us how
> efficient, e.g., FileStore vs BlueStore is for a given data set (pool).
>
> What we don't get is per-PG granularity.  I don't think this matters much,
> which a user doesn't really care about individual PGs anyway.
>
> We also don't get perfect accuracy when the cluster is degraded.  If
> one or more PGs in a pool is undergoing backfill or whatever, the
> OSD-level summations will be off.  We can *probably* figure out how to
> correct for that by scaling the result based on what we know about the PG
> recovery progress (e.g., how far along backfill on a PG is, and ignoring
> the log-based recovery as an insignificant).

Users don't care much about per-PG granularity in general, but as you
note it breaks down in recovery. More than that, our *balancers* care
very much about exactly what's in each PG, don't they?

>
>> - PG instance at each OSD node retrieves collection statistics from OS when
>> needed or tracks  it in RAM only.
>> - Two statistics reports  to be distinguished:
>>   a. Cluster-wide PG report - processing OSD retrieves statistics from both
>> local and remote PGs and sums it on per-PG basis. E.g. total per-PG physical
>> space usage can be obtained this way.
>>   b. OSD-wide PG report (or just simple OSD summary report) - OSD collects PG
>> statistics from local PGs only. E.g. logical/physical space usage at specific
>> OSD can be examined this way.
>
> ...and if we're talking about OSD-level stats, then I don't think any
> different update path is needed.  We would just statfs() to return a pool
> summation for each pool that exists on the OSD as well as the current
> osd_stat_t (or whatever it is).
>
> Does that seem reasonable?

I'm saying it's a "replay" mechanism or a two-phase commit, but I
really don't think having delayed stat updates would take much doing.
We can modify our in-memory state as soon as the ObjectStore replies
back to us, and add a new "stats-persisted-thru" value to the pg_info.
On any subsequent writes, we update the pg stats according to what we
already know. Then on OSD boot, we compare that value to the last pg
write, and query any objects which changed in the unaccounted pg log
entries. It's a short, easy pass, right? And we're not talking new
blocking queues or anything.
-Greg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: osd: fine-grained statistics for object space usage
  2017-12-04 22:15               ` Gregory Farnum
@ 2017-12-05 20:48                 ` Igor Fedotov
  2017-12-05 21:18                   ` Gregory Farnum
  0 siblings, 1 reply; 10+ messages in thread
From: Igor Fedotov @ 2017-12-05 20:48 UTC (permalink / raw)
  To: Gregory Farnum, Sage Weil; +Cc: ceph-devel, Xie Xingguo



On 12/5/2017 1:15 AM, Gregory Farnum wrote:
> On Mon, Dec 4, 2017 at 6:24 AM, Sage Weil <sweil@redhat.com> wrote:
>> It's pretty straightforward to maintain collection-level metadata in the
>> common case, but I don't see how we can *also* support an O(1) split
>> operation.
> You're right we can't know the exact answer, but we already solve this
> problem for PG object counts and things by doing a fuzzy estimate
> (just dividing the PG values in two) until a scrub happens. I don't
> think having to do the same here is a reason to avoid it entirely.
>
>
>> This is why I suggested per-pool metadata.  Pool-level
>> information will still let us roll things up into a 'ceph df' type summary
>> of how well data in a particular pool is compressing, how sparse it is,
>> and so on, which should be sufficient for capacity planning purposes.
>> We'll also have per-OSD (by pool) information, which will tell us how
>> efficient, e.g., FileStore vs BlueStore is for a given data set (pool).
>>
>> What we don't get is per-PG granularity.  I don't think this matters much,
>> which a user doesn't really care about individual PGs anyway.
>>
>> We also don't get perfect accuracy when the cluster is degraded.  If
>> one or more PGs in a pool is undergoing backfill or whatever, the
>> OSD-level summations will be off.  We can *probably* figure out how to
>> correct for that by scaling the result based on what we know about the PG
>> recovery progress (e.g., how far along backfill on a PG is, and ignoring
>> the log-based recovery as an insignificant).
> Users don't care much about per-PG granularity in general, but as you
> note it breaks down in recovery. More than that, our *balancers* care
> very much about exactly what's in each PG, don't they?
>
>>> - PG instance at each OSD node retrieves collection statistics from OS when
>>> needed or tracks  it in RAM only.
>>> - Two statistics reports  to be distinguished:
>>>    a. Cluster-wide PG report - processing OSD retrieves statistics from both
>>> local and remote PGs and sums it on per-PG basis. E.g. total per-PG physical
>>> space usage can be obtained this way.
>>>    b. OSD-wide PG report (or just simple OSD summary report) - OSD collects PG
>>> statistics from local PGs only. E.g. logical/physical space usage at specific
>>> OSD can be examined this way.
>> ...and if we're talking about OSD-level stats, then I don't think any
>> different update path is needed.  We would just statfs() to return a pool
>> summation for each pool that exists on the OSD as well as the current
>> osd_stat_t (or whatever it is).
>>
>> Does that seem reasonable?
> I'm saying it's a "replay" mechanism or a two-phase commit, but I
> really don't think having delayed stat updates would take much doing.
> We can modify our in-memory state as soon as the ObjectStore replies
> back to us, and add a new "stats-persisted-thru" value to the pg_info.
> On any subsequent writes, we update the pg stats according to what we
> already know. Then on OSD boot, we compare that value to the last pg
> write, and query any objects which changed in the unaccounted pg log
> entries. It's a short, easy pass, right? And we're not talking new
> blocking queues or anything.
That's what I was thinking about too. Here is a very immature POC for 
this approach, seems doable so far:

https://github.com/ceph/ceph/pull/19350


> -Greg


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: osd: fine-grained statistics for object space usage
  2017-12-05 20:48                 ` Igor Fedotov
@ 2017-12-05 21:18                   ` Gregory Farnum
       [not found]                     ` <alpine.DEB.2.11.1712052123570.22619@piezo.novalocal>
  2017-12-05 21:58                     ` Igor Fedotov
  0 siblings, 2 replies; 10+ messages in thread
From: Gregory Farnum @ 2017-12-05 21:18 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Sage Weil, ceph-devel, Xie Xingguo

On Tue, Dec 5, 2017 at 12:48 PM, Igor Fedotov <ifedotov@suse.de> wrote:
>
>
> On 12/5/2017 1:15 AM, Gregory Farnum wrote:
>>
>> On Mon, Dec 4, 2017 at 6:24 AM, Sage Weil <sweil@redhat.com> wrote:
>>>
>>> It's pretty straightforward to maintain collection-level metadata in the
>>> common case, but I don't see how we can *also* support an O(1) split
>>> operation.
>>
>> You're right we can't know the exact answer, but we already solve this
>> problem for PG object counts and things by doing a fuzzy estimate
>> (just dividing the PG values in two) until a scrub happens. I don't
>> think having to do the same here is a reason to avoid it entirely.
>>
>>
>>> This is why I suggested per-pool metadata.  Pool-level
>>> information will still let us roll things up into a 'ceph df' type
>>> summary
>>> of how well data in a particular pool is compressing, how sparse it is,
>>> and so on, which should be sufficient for capacity planning purposes.
>>> We'll also have per-OSD (by pool) information, which will tell us how
>>> efficient, e.g., FileStore vs BlueStore is for a given data set (pool).
>>>
>>> What we don't get is per-PG granularity.  I don't think this matters
>>> much,
>>> which a user doesn't really care about individual PGs anyway.
>>>
>>> We also don't get perfect accuracy when the cluster is degraded.  If
>>> one or more PGs in a pool is undergoing backfill or whatever, the
>>> OSD-level summations will be off.  We can *probably* figure out how to
>>> correct for that by scaling the result based on what we know about the PG
>>> recovery progress (e.g., how far along backfill on a PG is, and ignoring
>>> the log-based recovery as an insignificant).
>>
>> Users don't care much about per-PG granularity in general, but as you
>> note it breaks down in recovery. More than that, our *balancers* care
>> very much about exactly what's in each PG, don't they?
>>
>>>> - PG instance at each OSD node retrieves collection statistics from OS
>>>> when
>>>> needed or tracks  it in RAM only.
>>>> - Two statistics reports  to be distinguished:
>>>>    a. Cluster-wide PG report - processing OSD retrieves statistics from
>>>> both
>>>> local and remote PGs and sums it on per-PG basis. E.g. total per-PG
>>>> physical
>>>> space usage can be obtained this way.
>>>>    b. OSD-wide PG report (or just simple OSD summary report) - OSD
>>>> collects PG
>>>> statistics from local PGs only. E.g. logical/physical space usage at
>>>> specific
>>>> OSD can be examined this way.
>>>
>>> ...and if we're talking about OSD-level stats, then I don't think any
>>> different update path is needed.  We would just statfs() to return a pool
>>> summation for each pool that exists on the OSD as well as the current
>>> osd_stat_t (or whatever it is).
>>>
>>> Does that seem reasonable?
>>
>> I'm saying it's a "replay" mechanism or a two-phase commit, but I
>> really don't think having delayed stat updates would take much doing.
>> We can modify our in-memory state as soon as the ObjectStore replies
>> back to us, and add a new "stats-persisted-thru" value to the pg_info.
>> On any subsequent writes, we update the pg stats according to what we
>> already know. Then on OSD boot, we compare that value to the last pg
>> write, and query any objects which changed in the unaccounted pg log
>> entries. It's a short, easy pass, right? And we're not talking new
>> blocking queues or anything.
>
> That's what I was thinking about too. Here is a very immature POC for this
> approach, seems doable so far:
>
> https://github.com/ceph/ceph/pull/19350

To evaluate this usefully I think we'd need to see how these updates
get committed if the OSD crashes before they're persisted? I expect
that requires some kind of query interface...which, hrm, is actually a
little more complicated if this is the model.
I was just thinking we'd compare the on-disk allocation info for an
object to what we've persisted, but we actually only keep per-PG
stats, right? That's not great. :/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: osd: fine-grained statistics for object space usage
       [not found]                     ` <alpine.DEB.2.11.1712052123570.22619@piezo.novalocal>
@ 2017-12-05 21:37                       ` Sage Weil
  2017-12-05 21:53                       ` 答复: " Gregory Farnum
  1 sibling, 0 replies; 10+ messages in thread
From: Sage Weil @ 2017-12-05 21:37 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Igor Fedotov, ceph-devel, Xie Xingguo

On Tue, 5 Dec 2017, Sage Weil wrote:
> Can we figure out what problem the complex approach solves that the simple 
> approach doesn't?  I think the values are something like:
> 
>                           master     pool-proposal    2pc-pg-update
>  per-object sparseness      x             ?                ?
>  per-pg sparseness                                         x
>  per-pool sparseness                      x                x
> 
> I put ? because for a single object we can just query the backend with 
> a stat equivalent (like the fiemap ObjectStore method).  This is what, 
> say, rbd or cephfs would need to get a st_blocks value.
> 
> For a 'ceph df' column, the pool summation is what you need--not a pg 
> value.
> 
> Is there another user of this information I'm missing?
> 
> AFAICS the only real benefit to the 2pc complexity is a value that remains 
> perfectly accurate during backfill etc, whereas the pool-level summation 
> will drift slightly in that case.  Doesn't seem worth it to me?

I should add though that I'm not sure how well the pool-level stats will 
work out.  I think for BlueStore it means a new key, one per pool, with a 
summing merge operator.  We already do this for the overall store stats; 
this will either supplement that (2 small keys) or replace it (we can add 
the pool values up to get the overall value) so that there ar ethe same 
number of updates but they're spread over per-pool keys?

sage


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: osd: fine-grained statistics for object space usage
       [not found]                     ` <alpine.DEB.2.11.1712052123570.22619@piezo.novalocal>
  2017-12-05 21:37                       ` Sage Weil
@ 2017-12-05 21:53                       ` Gregory Farnum
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory Farnum @ 2017-12-05 21:53 UTC (permalink / raw)
  To: Sage Weil; +Cc: Igor Fedotov, ceph-devel, Xie Xingguo

On Tue, Dec 5, 2017 at 1:35 PM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 5 Dec 2017, Gregory Farnum wrote:
>> On Tue, Dec 5, 2017 at 12:48 PM, Igor Fedotov <ifedotov@suse.de> wrote:
>> >
>> >
>> > On 12/5/2017 1:15 AM, Gregory Farnum wrote:
>> >>
>> >> On Mon, Dec 4, 2017 at 6:24 AM, Sage Weil <sweil@redhat.com> wrote:
>> >>>
>> >>> It's pretty straightforward to maintain collection-level metadata in the
>> >>> common case, but I don't see how we can *also* support an O(1) split
>> >>> operation.
>> >>
>> >> You're right we can't know the exact answer, but we already solve this
>> >> problem for PG object counts and things by doing a fuzzy estimate
>> >> (just dividing the PG values in two) until a scrub happens. I don't
>> >> think having to do the same here is a reason to avoid it entirely.
>
> Oh, right, I forgot about that.
>
>> >>> This is why I suggested per-pool metadata.  Pool-level
>> >>> information will still let us roll things up into a 'ceph df' type
>> >>> summary
>> >>> of how well data in a particular pool is compressing, how sparse it is,
>> >>> and so on, which should be sufficient for capacity planning purposes.
>> >>> We'll also have per-OSD (by pool) information, which will tell us how
>> >>> efficient, e.g., FileStore vs BlueStore is for a given data set (pool).
>> >>>
>> >>> What we don't get is per-PG granularity.  I don't think this matters
>> >>> much,
>> >>> which a user doesn't really care about individual PGs anyway.
>> >>>
>> >>> We also don't get perfect accuracy when the cluster is degraded.  If
>> >>> one or more PGs in a pool is undergoing backfill or whatever, the
>> >>> OSD-level summations will be off.  We can *probably* figure out how to
>> >>> correct for that by scaling the result based on what we know about the PG
>> >>> recovery progress (e.g., how far along backfill on a PG is, and ignoring
>> >>> the log-based recovery as an insignificant).
>> >>
>> >> Users don't care much about per-PG granularity in general, but as you
>> >> note it breaks down in recovery. More than that, our *balancers* care
>> >> very much about exactly what's in each PG, don't they?
>
> The balancer is hands-off if there is any recovery going on (and throttles
> itself to limit the amount of misplaced/rebalancing).

Even if the cluster's clean, if it doesn't know the sizes of PGs, it
doesn't know which ones it should shift around, right? Right now I
think it's just going on the summed logical HEAD object sizes, but
there are obvious problems with that in some scenarios.

>> >>>> - PG instance at each OSD node retrieves collection statistics from OS
>> >>>> when
>> >>>> needed or tracks  it in RAM only.
>> >>>> - Two statistics reports  to be distinguished:
>> >>>>    a. Cluster-wide PG report - processing OSD retrieves statistics from
>> >>>> both
>> >>>> local and remote PGs and sums it on per-PG basis. E.g. total per-PG
>> >>>> physical
>> >>>> space usage can be obtained this way.
>> >>>>    b. OSD-wide PG report (or just simple OSD summary report) - OSD
>> >>>> collects PG
>> >>>> statistics from local PGs only. E.g. logical/physical space usage at
>> >>>> specific
>> >>>> OSD can be examined this way.
>> >>>
>> >>> ...and if we're talking about OSD-level stats, then I don't think any
>> >>> different update path is needed.  We would just statfs() to return a pool
>> >>> summation for each pool that exists on the OSD as well as the current
>> >>> osd_stat_t (or whatever it is).
>> >>>
>> >>> Does that seem reasonable?
>> >>
>> >> I'm saying it's a "replay" mechanism or a two-phase commit, but I
>> >> really don't think having delayed stat updates would take much doing.
>> >> We can modify our in-memory state as soon as the ObjectStore replies
>> >> back to us, and add a new "stats-persisted-thru" value to the pg_info.
>> >> On any subsequent writes, we update the pg stats according to what we
>> >> already know. Then on OSD boot, we compare that value to the last pg
>> >> write, and query any objects which changed in the unaccounted pg log
>> >> entries. It's a short, easy pass, right? And we're not talking new
>> >> blocking queues or anything.
>> >
>> > That's what I was thinking about too. Here is a very immature POC for this
>> > approach, seems doable so far:
>> >
>> > https://github.com/ceph/ceph/pull/19350
>>
>> To evaluate this usefully I think we'd need to see how these updates
>> get committed if the OSD crashes before they're persisted? I expect
>> that requires some kind of query interface...which, hrm, is actually a
>> little more complicated if this is the model.
>> I was just thinking we'd compare the on-disk allocation info for an
>> object to what we've persisted, but we actually only keep per-PG
>> stats, right? That's not great. :/
>
> This direction makes me very nervous.
>
> Can we figure out what problem the complex approach solves that the simple
> approach doesn't?

Maybe you can explain more clearly how this would work. I'm not really
seeing how to implement it efficiently in FileStore. Maybe maintain
size summations for each collection, and update them whenever we do
clones or truncate/append to files? But I think that would work just
as well for exposing PG-level space stats. So maybe we should do that.


> I think the values are something like:
>
>                           master     pool-proposal    2pc-pg-update
>  per-object sparseness      x             ?                ?
>  per-pg sparseness                                         x
>  per-pool sparseness                      x                x
>
> I put ? because for a single object we can just query the backend with
> a stat equivalent (like the fiemap ObjectStore method).  This is what,
> say, rbd or cephfs would need to get a st_blocks value.
>
> For a 'ceph df' column, the pool summation is what you need--not a pg
> value.
>
> Is there another user of this information I'm missing?
>
> AFAICS the only real benefit to the 2pc complexity is a value that remains
> perfectly accurate during backfill etc, whereas the pool-level summation
> will drift slightly in that case.  Doesn't seem worth it to me?

Hmm, it would also be great to resolve the "omaps don't count" thing,
which I don't think we have any other solutions for right now? Not
that this really helps much with that — we could add up the size of
input keys and values, but I don't see any way to efficiently support
omap deletes...
-Greg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 答复: osd: fine-grained statistics for object space usage
  2017-12-05 21:18                   ` Gregory Farnum
       [not found]                     ` <alpine.DEB.2.11.1712052123570.22619@piezo.novalocal>
@ 2017-12-05 21:58                     ` Igor Fedotov
  1 sibling, 0 replies; 10+ messages in thread
From: Igor Fedotov @ 2017-12-05 21:58 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel, Xie Xingguo



On 12/6/2017 12:18 AM, Gregory Farnum wrote:
> On Tue, Dec 5, 2017 at 12:48 PM, Igor Fedotov <ifedotov@suse.de> wrote:
>>
>> On 12/5/2017 1:15 AM, Gregory Farnum wrote:
>>> On Mon, Dec 4, 2017 at 6:24 AM, Sage Weil <sweil@redhat.com> wrote:
>>>> It's pretty straightforward to maintain collection-level metadata in the
>>>> common case, but I don't see how we can *also* support an O(1) split
>>>> operation.
>>> You're right we can't know the exact answer, but we already solve this
>>> problem for PG object counts and things by doing a fuzzy estimate
>>> (just dividing the PG values in two) until a scrub happens. I don't
>>> think having to do the same here is a reason to avoid it entirely.
>>>
>>>
>>>> This is why I suggested per-pool metadata.  Pool-level
>>>> information will still let us roll things up into a 'ceph df' type
>>>> summary
>>>> of how well data in a particular pool is compressing, how sparse it is,
>>>> and so on, which should be sufficient for capacity planning purposes.
>>>> We'll also have per-OSD (by pool) information, which will tell us how
>>>> efficient, e.g., FileStore vs BlueStore is for a given data set (pool).
>>>>
>>>> What we don't get is per-PG granularity.  I don't think this matters
>>>> much,
>>>> which a user doesn't really care about individual PGs anyway.
>>>>
>>>> We also don't get perfect accuracy when the cluster is degraded.  If
>>>> one or more PGs in a pool is undergoing backfill or whatever, the
>>>> OSD-level summations will be off.  We can *probably* figure out how to
>>>> correct for that by scaling the result based on what we know about the PG
>>>> recovery progress (e.g., how far along backfill on a PG is, and ignoring
>>>> the log-based recovery as an insignificant).
>>> Users don't care much about per-PG granularity in general, but as you
>>> note it breaks down in recovery. More than that, our *balancers* care
>>> very much about exactly what's in each PG, don't they?
>>>
>>>>> - PG instance at each OSD node retrieves collection statistics from OS
>>>>> when
>>>>> needed or tracks  it in RAM only.
>>>>> - Two statistics reports  to be distinguished:
>>>>>     a. Cluster-wide PG report - processing OSD retrieves statistics from
>>>>> both
>>>>> local and remote PGs and sums it on per-PG basis. E.g. total per-PG
>>>>> physical
>>>>> space usage can be obtained this way.
>>>>>     b. OSD-wide PG report (or just simple OSD summary report) - OSD
>>>>> collects PG
>>>>> statistics from local PGs only. E.g. logical/physical space usage at
>>>>> specific
>>>>> OSD can be examined this way.
>>>> ...and if we're talking about OSD-level stats, then I don't think any
>>>> different update path is needed.  We would just statfs() to return a pool
>>>> summation for each pool that exists on the OSD as well as the current
>>>> osd_stat_t (or whatever it is).
>>>>
>>>> Does that seem reasonable?
>>> I'm saying it's a "replay" mechanism or a two-phase commit, but I
>>> really don't think having delayed stat updates would take much doing.
>>> We can modify our in-memory state as soon as the ObjectStore replies
>>> back to us, and add a new "stats-persisted-thru" value to the pg_info.
>>> On any subsequent writes, we update the pg stats according to what we
>>> already know. Then on OSD boot, we compare that value to the last pg
>>> write, and query any objects which changed in the unaccounted pg log
>>> entries. It's a short, easy pass, right? And we're not talking new
>>> blocking queues or anything.
>> That's what I was thinking about too. Here is a very immature POC for this
>> approach, seems doable so far:
>>
>> https://github.com/ceph/ceph/pull/19350
> To evaluate this usefully I think we'd need to see how these updates
> get committed if the OSD crashes before they're persisted? I expect
> that requires some kind of query interface...which, hrm, is actually a
> little more complicated if this is the model.
Well, here is a brief overview of the model. IMO it has to handle crashes...
1) While handling a bunch of transaction submitted via queue_transaction 
BlueStore collects statistics changes on per-collection basis and 
appends additional transactions to the bunch to persist them. I.e. at 
BlueStore level these changes are committed along with original write 
transactions. BlueStore also keeps these changes within a collection 
object until explicit reset and is able to return them to upper level 
via corresponding API call.
2) On the next transaction submission OSD/PG retrieve previous 
submission changes from BlueStore, apply them to its own statistics and 
append new transactions to make them persistent (at OSD level). 
ObjectStore API to be extended to trigger OS changes cleanup along with 
PG-related stats update - e.g. an additional flag for omap_setkeys 
transaction to request OS-level statistics changes reset. While handling 
this new bunch BlueStore appends resets persistent changes from the 
previous stage and inserts new changes if any. Step 2) might be repeated 
any number of times.

If OSD crashes between stages 1) and 2) recovery is performed 
automatically when submitting new transactions after OSD restore - 
changes are taken from BlueStore and applied while processing that new 
transaction.
The small drawback of the approach - PG stats are one step behind the 
actual values. This can be either tolerated or handled with simple 
tricks on statistics retrieval: return current PG stats + ones preserved 
at OS or track that delta at OSD separately etc...

Have I missed something?
> I was just thinking we'd compare the on-disk allocation info for an
> object to what we've persisted, but we actually only keep per-PG
> stats, right? That's not great. :/


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-12-05 21:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <201711300849388246138@zte.com.cn>
     [not found] ` <alpine.DEB.2.11.1711300304240.8333@piezo.novalocal>
2017-11-30 21:17   ` 答复: osd: fine-grained statistics for object space usage Gregory Farnum
     [not found]     ` <alpine.DEB.2.11.1711302125440.12766@piezo.novalocal>
2017-11-30 21:46       ` Gregory Farnum
2017-12-01 14:23       ` Igor Fedotov
     [not found]         ` <alpine.DEB.2.11.1712011427180.2819@piezo.novalocal>
2017-12-04 11:23           ` Igor Fedotov
     [not found]             ` <alpine.DEB.2.11.1712041424370.22619@piezo.novalocal>
2017-12-04 22:15               ` Gregory Farnum
2017-12-05 20:48                 ` Igor Fedotov
2017-12-05 21:18                   ` Gregory Farnum
     [not found]                     ` <alpine.DEB.2.11.1712052123570.22619@piezo.novalocal>
2017-12-05 21:37                       ` Sage Weil
2017-12-05 21:53                       ` 答复: " Gregory Farnum
2017-12-05 21:58                     ` Igor Fedotov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.