All of lore.kernel.org
 help / color / mirror / Atom feed
* pg_stat_t is 500+ bytes
@ 2016-08-23 18:13 Sage Weil
  2016-08-23 19:39 ` Mark Nelson
  2016-08-24  3:01 ` Haomai Wang
  0 siblings, 2 replies; 16+ messages in thread
From: Sage Weil @ 2016-08-23 18:13 UTC (permalink / raw)
  To: ceph-devel

This is huge.  It takes the pg_info_t str from 306 bytes to 847 bytes, and 
this _info omap key is rewritten on *every* IO.

We could shrink this down significant with varint and/or delta encoding 
since a huge portion of it is just a bunch of uint64_t counters.  This 
will probably cost some CPU time, but OTOH it'll also shrink our metadata 
down a fair bit too which will pay off later.

Anybody want to tackle this?
sage

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-23 18:13 pg_stat_t is 500+ bytes Sage Weil
@ 2016-08-23 19:39 ` Mark Nelson
  2016-08-24  3:01 ` Haomai Wang
  1 sibling, 0 replies; 16+ messages in thread
From: Mark Nelson @ 2016-08-23 19:39 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

On 08/23/2016 01:13 PM, Sage Weil wrote:
> This is huge.  It takes the pg_info_t str from 306 bytes to 847 bytes, and
> this _info omap key is rewritten on *every* IO.
>
> We could shrink this down significant with varint and/or delta encoding
> since a huge portion of it is just a bunch of uint64_t counters.  This
> will probably cost some CPU time, but OTOH it'll also shrink our metadata
> down a fair bit too which will pay off later.

I'm a little nervous given how heavy we are on CPU already for small IO. 
  How often/much of this is empty space?  Could a fast packing 
implementation work here?

>
> Anybody want to tackle this?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-23 18:13 pg_stat_t is 500+ bytes Sage Weil
  2016-08-23 19:39 ` Mark Nelson
@ 2016-08-24  3:01 ` Haomai Wang
  2016-08-24  4:02   ` Haomai Wang
  1 sibling, 1 reply; 16+ messages in thread
From: Haomai Wang @ 2016-08-24  3:01 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
> This is huge.  It takes the pg_info_t str from 306 bytes to 847 bytes, and
> this _info omap key is rewritten on *every* IO.
>
> We could shrink this down significant with varint and/or delta encoding
> since a huge portion of it is just a bunch of uint64_t counters.  This
> will probably cost some CPU time, but OTOH it'll also shrink our metadata
> down a fair bit too which will pay off later.
>
> Anybody want to tackle this?

what about separating "object_stat_collection_t stats" from pg_stat_t?
pg info should be unchanged for most of times, we could only update
object related stats. This may help to reduce half bytes.

> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-24  3:01 ` Haomai Wang
@ 2016-08-24  4:02   ` Haomai Wang
  2016-08-24 16:09     ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: Haomai Wang @ 2016-08-24  4:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>> This is huge.  It takes the pg_info_t str from 306 bytes to 847 bytes, and
>> this _info omap key is rewritten on *every* IO.
>>
>> We could shrink this down significant with varint and/or delta encoding
>> since a huge portion of it is just a bunch of uint64_t counters.  This
>> will probably cost some CPU time, but OTOH it'll also shrink our metadata
>> down a fair bit too which will pay off later.
>>
>> Anybody want to tackle this?
>
> what about separating "object_stat_collection_t stats" from pg_stat_t?
> pg info should be unchanged for most of times, we could only update
> object related stats. This may help to reduce half bytes.

Or we only store increment values and keep the full in memory(may
reduce to 20-30bytes). In period time we store the full structure(only
hundreds of bytes)....

>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-24  4:02   ` Haomai Wang
@ 2016-08-24 16:09     ` Sage Weil
  2016-08-24 16:12       ` Mark Nelson
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2016-08-24 16:09 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

On Wed, 24 Aug 2016, Haomai Wang wrote:
> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
> > On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
> >> This is huge.  It takes the pg_info_t str from 306 bytes to 847 bytes, and
> >> this _info omap key is rewritten on *every* IO.
> >>
> >> We could shrink this down significant with varint and/or delta encoding
> >> since a huge portion of it is just a bunch of uint64_t counters.  This
> >> will probably cost some CPU time, but OTOH it'll also shrink our metadata
> >> down a fair bit too which will pay off later.
> >>
> >> Anybody want to tackle this?
> >
> > what about separating "object_stat_collection_t stats" from pg_stat_t?
> > pg info should be unchanged for most of times, we could only update
> > object related stats. This may help to reduce half bytes.

I don't think this will work, since every op changes last_update in 
pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
 
> Or we only store increment values and keep the full in memory(may
> reduce to 20-30bytes). In period time we store the full structure(only
> hundreds of bytes)....

A delta is probably very compressible (only a few fields in the stats 
struct change).  The question is how fast can we make it in CPU time.  
Probably a simple delta (which will be almost all 0's) and a trivial 
run-length-encoding scheme that just gets rid of the 0's would do well 
enough...

sage

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-24 16:09     ` Sage Weil
@ 2016-08-24 16:12       ` Mark Nelson
  2016-08-24 16:20         ` Piotr Dałek
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-08-24 16:12 UTC (permalink / raw)
  To: Sage Weil, Haomai Wang; +Cc: ceph-devel



On 08/24/2016 11:09 AM, Sage Weil wrote:
> On Wed, 24 Aug 2016, Haomai Wang wrote:
>> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>>> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>>>> This is huge.  It takes the pg_info_t str from 306 bytes to 847 bytes, and
>>>> this _info omap key is rewritten on *every* IO.
>>>>
>>>> We could shrink this down significant with varint and/or delta encoding
>>>> since a huge portion of it is just a bunch of uint64_t counters.  This
>>>> will probably cost some CPU time, but OTOH it'll also shrink our metadata
>>>> down a fair bit too which will pay off later.
>>>>
>>>> Anybody want to tackle this?
>>>
>>> what about separating "object_stat_collection_t stats" from pg_stat_t?
>>> pg info should be unchanged for most of times, we could only update
>>> object related stats. This may help to reduce half bytes.
>
> I don't think this will work, since every op changes last_update in
> pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>
>> Or we only store increment values and keep the full in memory(may
>> reduce to 20-30bytes). In period time we store the full structure(only
>> hundreds of bytes)....
>
> A delta is probably very compressible (only a few fields in the stats
> struct change).  The question is how fast can we make it in CPU time.
> Probably a simple delta (which will be almost all 0's) and a trivial
> run-length-encoding scheme that just gets rid of the 0's would do well
> enough...

Do we have any rough idea of how many/often consecutive 0s we end up 
with in the current encoding?

>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-24 16:12       ` Mark Nelson
@ 2016-08-24 16:20         ` Piotr Dałek
  2016-08-25 12:54           ` Haomai Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Piotr Dałek @ 2016-08-24 16:20 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, Haomai Wang, ceph-devel

On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
> 
> 
> On 08/24/2016 11:09 AM, Sage Weil wrote:
> >On Wed, 24 Aug 2016, Haomai Wang wrote:
> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
> >>>>This is huge.  It takes the pg_info_t str from 306 bytes to 847 bytes, and
> >>>>this _info omap key is rewritten on *every* IO.
> >>>>
> >>>>We could shrink this down significant with varint and/or delta encoding
> >>>>since a huge portion of it is just a bunch of uint64_t counters.  This
> >>>>will probably cost some CPU time, but OTOH it'll also shrink our metadata
> >>>>down a fair bit too which will pay off later.
> >>>>
> >>>>Anybody want to tackle this?
> >>>
> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
> >>>pg info should be unchanged for most of times, we could only update
> >>>object related stats. This may help to reduce half bytes.
> >
> >I don't think this will work, since every op changes last_update in
> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
> >
> >>Or we only store increment values and keep the full in memory(may
> >>reduce to 20-30bytes). In period time we store the full structure(only
> >>hundreds of bytes)....
> >
> >A delta is probably very compressible (only a few fields in the stats
> >struct change).  The question is how fast can we make it in CPU time.
> >Probably a simple delta (which will be almost all 0's) and a trivial
> >run-length-encoding scheme that just gets rid of the 0's would do well
> >enough...
> 
> Do we have any rough idea of how many/often consecutive 0s we end up
> with in the current encoding?

Or how high these counters get? We could try transposing the matrix made of
those counters. At least the two most significant bytes in most of those
counters are mostly zeros, and after transposing, simple RLE would be
feasible. In any case, I'm not sure if *all* of these fields need to be
uint64_t.

-- 
Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-24 16:20         ` Piotr Dałek
@ 2016-08-25 12:54           ` Haomai Wang
  2016-08-25 12:56             ` Haomai Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Haomai Wang @ 2016-08-25 12:54 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: Mark Nelson, Sage Weil, ceph-devel

I think there exists a lot of fields can't be discarded via judging
0s, like utime_t, epoch_t. A simple way is to compare with previous
pg_info_t.

BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
thread(thread level not process level).

looks we have three optimization from mark and Piotr:

1. reconstruct pg_info_t and make high frequent fields together.
2. change some fields to smaller bits
3. uses SIMD to optimize low frequency fields difference comparison(optional)

intel SSE4.2 pcmpistrm instructive could do very good 128bytes
comparison, pg_info_t is above 700bytes:

inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
    const __m128i w = _mm_load_si128((const __m128i *)p2);

    for (;; p += 16) {
        const __m128i s = _mm_load_si128((const __m128i *)p);
        const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
            _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
            _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));

        if (r != 0) // some of characters isn't equal
            return p + __builtin_ffs(r) - 1;
    }
}

On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>
>>
>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>> >>>>This is huge.  It takes the pg_info_t str from 306 bytes to 847 bytes, and
>> >>>>this _info omap key is rewritten on *every* IO.
>> >>>>
>> >>>>We could shrink this down significant with varint and/or delta encoding
>> >>>>since a huge portion of it is just a bunch of uint64_t counters.  This
>> >>>>will probably cost some CPU time, but OTOH it'll also shrink our metadata
>> >>>>down a fair bit too which will pay off later.
>> >>>>
>> >>>>Anybody want to tackle this?
>> >>>
>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
>> >>>pg info should be unchanged for most of times, we could only update
>> >>>object related stats. This may help to reduce half bytes.
>> >
>> >I don't think this will work, since every op changes last_update in
>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>> >
>> >>Or we only store increment values and keep the full in memory(may
>> >>reduce to 20-30bytes). In period time we store the full structure(only
>> >>hundreds of bytes)....
>> >
>> >A delta is probably very compressible (only a few fields in the stats
>> >struct change).  The question is how fast can we make it in CPU time.
>> >Probably a simple delta (which will be almost all 0's) and a trivial
>> >run-length-encoding scheme that just gets rid of the 0's would do well
>> >enough...
>>
>> Do we have any rough idea of how many/often consecutive 0s we end up
>> with in the current encoding?
>
> Or how high these counters get? We could try transposing the matrix made of
> those counters. At least the two most significant bytes in most of those
> counters are mostly zeros, and after transposing, simple RLE would be
> feasible. In any case, I'm not sure if *all* of these fields need to be
> uint64_t.
>
> --
> Piotr Dałek
> branch@predictor.org.pl
> http://blog.predictor.org.pl

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-25 12:54           ` Haomai Wang
@ 2016-08-25 12:56             ` Haomai Wang
  2016-08-26  1:28               ` Somnath Roy
  2016-08-27  0:00               ` Somnath Roy
  0 siblings, 2 replies; 16+ messages in thread
From: Haomai Wang @ 2016-08-25 12:56 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: Mark Nelson, Sage Weil, ceph-devel

On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
> I think there exists a lot of fields can't be discarded via judging
> 0s, like utime_t, epoch_t. A simple way is to compare with previous
> pg_info_t.

Oh, sorry. We may could use a fresh pg_info_t to indicate. But below
also could apply this way.

>
> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
> thread(thread level not process level).
>
> looks we have three optimization from mark and Piotr:
>
> 1. reconstruct pg_info_t and make high frequent fields together.
> 2. change some fields to smaller bits
> 3. uses SIMD to optimize low frequency fields difference comparison(optional)
>
> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
> comparison, pg_info_t is above 700bytes:
>
> inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
>     const __m128i w = _mm_load_si128((const __m128i *)p2);
>
>     for (;; p += 16) {
>         const __m128i s = _mm_load_si128((const __m128i *)p);
>         const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>             _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>             _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>
>         if (r != 0) // some of characters isn't equal
>             return p + __builtin_ffs(r) - 1;
>     }
> }
>
> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>
>>>
>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
>>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>>> >>>>This is huge.  It takes the pg_info_t str from 306 bytes to 847 bytes, and
>>> >>>>this _info omap key is rewritten on *every* IO.
>>> >>>>
>>> >>>>We could shrink this down significant with varint and/or delta encoding
>>> >>>>since a huge portion of it is just a bunch of uint64_t counters.  This
>>> >>>>will probably cost some CPU time, but OTOH it'll also shrink our metadata
>>> >>>>down a fair bit too which will pay off later.
>>> >>>>
>>> >>>>Anybody want to tackle this?
>>> >>>
>>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
>>> >>>pg info should be unchanged for most of times, we could only update
>>> >>>object related stats. This may help to reduce half bytes.
>>> >
>>> >I don't think this will work, since every op changes last_update in
>>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>>> >
>>> >>Or we only store increment values and keep the full in memory(may
>>> >>reduce to 20-30bytes). In period time we store the full structure(only
>>> >>hundreds of bytes)....
>>> >
>>> >A delta is probably very compressible (only a few fields in the stats
>>> >struct change).  The question is how fast can we make it in CPU time.
>>> >Probably a simple delta (which will be almost all 0's) and a trivial
>>> >run-length-encoding scheme that just gets rid of the 0's would do well
>>> >enough...
>>>
>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>> with in the current encoding?
>>
>> Or how high these counters get? We could try transposing the matrix made of
>> those counters. At least the two most significant bytes in most of those
>> counters are mostly zeros, and after transposing, simple RLE would be
>> feasible. In any case, I'm not sure if *all* of these fields need to be
>> uint64_t.
>>
>> --
>> Piotr Dałek
>> branch@predictor.org.pl
>> http://blog.predictor.org.pl

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: pg_stat_t is 500+ bytes
  2016-08-25 12:56             ` Haomai Wang
@ 2016-08-26  1:28               ` Somnath Roy
  2016-08-26  3:04                 ` Mark Nelson
  2016-08-27  0:00               ` Somnath Roy
  1 sibling, 1 reply; 16+ messages in thread
From: Somnath Roy @ 2016-08-26  1:28 UTC (permalink / raw)
  To: Haomai Wang, Piotr Dałek; +Cc: Mark Nelson, Sage Weil, ceph-devel

Sage,
I can see for statistics we are storing 40 bytes for each IO..

Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)

Should we really store it on each IO ?

1. IMO, gathering/storing stats should be configurable

2. If enabled, I think we can have counters maintained in memory and persist that to db between some configurable interval ?

BTW, as you mentioned, pg_info is huge, 855 bytes.

Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)


Thanks & Regards
Somnath


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Thursday, August 25, 2016 5:56 AM
To: Piotr Dałek
Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: pg_stat_t is 500+ bytes

On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
> I think there exists a lot of fields can't be discarded via judging
> 0s, like utime_t, epoch_t. A simple way is to compare with previous
> pg_info_t.

Oh, sorry. We may could use a fresh pg_info_t to indicate. But below also could apply this way.

>
> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
> thread(thread level not process level).
>
> looks we have three optimization from mark and Piotr:
>
> 1. reconstruct pg_info_t and make high frequent fields together.
> 2. change some fields to smaller bits
> 3. uses SIMD to optimize low frequency fields difference
> comparison(optional)
>
> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
> comparison, pg_info_t is above 700bytes:
>
> inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
>     const __m128i w = _mm_load_si128((const __m128i *)p2);
>
>     for (;; p += 16) {
>         const __m128i s = _mm_load_si128((const __m128i *)p);
>         const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>             _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>             _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>
>         if (r != 0) // some of characters isn't equal
>             return p + __builtin_ffs(r) - 1;
>     }
> }
>
> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>
>>>
>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
>>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>>> >>>>This is huge.  It takes the pg_info_t str from 306 bytes to 847
>>> >>>>bytes, and this _info omap key is rewritten on *every* IO.
>>> >>>>
>>> >>>>We could shrink this down significant with varint and/or delta
>>> >>>>encoding since a huge portion of it is just a bunch of uint64_t
>>> >>>>counters.  This will probably cost some CPU time, but OTOH it'll
>>> >>>>also shrink our metadata down a fair bit too which will pay off later.
>>> >>>>
>>> >>>>Anybody want to tackle this?
>>> >>>
>>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
>>> >>>pg info should be unchanged for most of times, we could only
>>> >>>update object related stats. This may help to reduce half bytes.
>>> >
>>> >I don't think this will work, since every op changes last_update in
>>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>>> >
>>> >>Or we only store increment values and keep the full in memory(may
>>> >>reduce to 20-30bytes). In period time we store the full
>>> >>structure(only hundreds of bytes)....
>>> >
>>> >A delta is probably very compressible (only a few fields in the
>>> >stats struct change).  The question is how fast can we make it in CPU time.
>>> >Probably a simple delta (which will be almost all 0's) and a
>>> >trivial run-length-encoding scheme that just gets rid of the 0's
>>> >would do well enough...
>>>
>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>> with in the current encoding?
>>
>> Or how high these counters get? We could try transposing the matrix
>> made of those counters. At least the two most significant bytes in
>> most of those counters are mostly zeros, and after transposing,
>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>> these fields need to be uint64_t.
>>
>> --
>> Piotr Dałek
>> branch@predictor.org.pl
>> http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-26  1:28               ` Somnath Roy
@ 2016-08-26  3:04                 ` Mark Nelson
  2016-08-26  3:11                   ` Mark Nelson
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-08-26  3:04 UTC (permalink / raw)
  To: Somnath Roy, Haomai Wang, Piotr Dałek; +Cc: Sage Weil, ceph-devel

On 08/25/2016 08:28 PM, Somnath Roy wrote:
> Sage,
> I can see for statistics we are storing 40 bytes for each IO..
>
> Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
>
> Should we really store it on each IO ?
>
> 1. IMO, gathering/storing stats should be configurable
>
> 2. If enabled, I think we can have counters maintained in memory and persist that to db between some configurable interval ?
>
> BTW, as you mentioned, pg_info is huge, 855 bytes.
>
> Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)

I spent some time on this today.  In pg_stat_t we have:

eversion_t: count 5 (80 bytes)
utime_t: count 13 (92 bytes)
object_stat_collection_t: count 1 (256 bytes)
pg_t: count 1 (20 bytes)
up vector: count 1 (16 bytes for 3x rep)
acting_vector: count 1 (16 bytes for 3x rep)
other junk: 74 bytes

some of these will probably take varint encoding pretty well, while I 
suspect others (like the nsec field in utime_t) it won't help much.  I'm 
going through now to get a coarse grain look.  I'm not thrilled with the 
varint encoding overhead, but it does appear to be effective in this 
case.  Right now in a test of master, pg_info_t is consistently 847 
bytes on my single-osd test setup.  Switching object_stat_sum_t fields 
to varint encoding yields an average size of 630.562 with almost all 
sizes either 630 or 631.

256 - (847 - 631) = 40 bytes (vs 256 originally!)

There are 34 fields in pg_info_t, so almost all of them are encoding 
into a 1 byte value in this test.  It's possible that lowz encoding 
might reduce this further.

Mark

>
>
> Thanks & Regards
> Somnath
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Thursday, August 25, 2016 5:56 AM
> To: Piotr Dałek
> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: pg_stat_t is 500+ bytes
>
> On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
>> I think there exists a lot of fields can't be discarded via judging
>> 0s, like utime_t, epoch_t. A simple way is to compare with previous
>> pg_info_t.
>
> Oh, sorry. We may could use a fresh pg_info_t to indicate. But below also could apply this way.
>
>>
>> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
>> thread(thread level not process level).
>>
>> looks we have three optimization from mark and Piotr:
>>
>> 1. reconstruct pg_info_t and make high frequent fields together.
>> 2. change some fields to smaller bits
>> 3. uses SIMD to optimize low frequency fields difference
>> comparison(optional)
>>
>> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
>> comparison, pg_info_t is above 700bytes:
>>
>> inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
>>     const __m128i w = _mm_load_si128((const __m128i *)p2);
>>
>>     for (;; p += 16) {
>>         const __m128i s = _mm_load_si128((const __m128i *)p);
>>         const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>>             _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>>             _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>>
>>         if (r != 0) // some of characters isn't equal
>>             return p + __builtin_ffs(r) - 1;
>>     }
>> }
>>
>> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
>>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>>
>>>>
>>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>>>> On Wed, 24 Aug 2016, Haomai Wang wrote:
>>>>>> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>>>>>>> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>>>>>>>> This is huge.  It takes the pg_info_t str from 306 bytes to 847
>>>>>>>> bytes, and this _info omap key is rewritten on *every* IO.
>>>>>>>>
>>>>>>>> We could shrink this down significant with varint and/or delta
>>>>>>>> encoding since a huge portion of it is just a bunch of uint64_t
>>>>>>>> counters.  This will probably cost some CPU time, but OTOH it'll
>>>>>>>> also shrink our metadata down a fair bit too which will pay off later.
>>>>>>>>
>>>>>>>> Anybody want to tackle this?
>>>>>>>
>>>>>>> what about separating "object_stat_collection_t stats" from pg_stat_t?
>>>>>>> pg info should be unchanged for most of times, we could only
>>>>>>> update object related stats. This may help to reduce half bytes.
>>>>>
>>>>> I don't think this will work, since every op changes last_update in
>>>>> pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>>>>>
>>>>>> Or we only store increment values and keep the full in memory(may
>>>>>> reduce to 20-30bytes). In period time we store the full
>>>>>> structure(only hundreds of bytes)....
>>>>>
>>>>> A delta is probably very compressible (only a few fields in the
>>>>> stats struct change).  The question is how fast can we make it in CPU time.
>>>>> Probably a simple delta (which will be almost all 0's) and a
>>>>> trivial run-length-encoding scheme that just gets rid of the 0's
>>>>> would do well enough...
>>>>
>>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>>> with in the current encoding?
>>>
>>> Or how high these counters get? We could try transposing the matrix
>>> made of those counters. At least the two most significant bytes in
>>> most of those counters are mostly zeros, and after transposing,
>>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>>> these fields need to be uint64_t.
>>>
>>> --
>>> Piotr Dałek
>>> branch@predictor.org.pl
>>> http://blog.predictor.org.pl
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-26  3:04                 ` Mark Nelson
@ 2016-08-26  3:11                   ` Mark Nelson
  2016-08-26  5:36                     ` Mark Nelson
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-08-26  3:11 UTC (permalink / raw)
  To: Somnath Roy, Haomai Wang, Piotr Dałek; +Cc: Sage Weil, ceph-devel



On 08/25/2016 10:04 PM, Mark Nelson wrote:
> On 08/25/2016 08:28 PM, Somnath Roy wrote:
>> Sage,
>> I can see for statistics we are storing 40 bytes for each IO..
>>
>> Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
>>
>> Should we really store it on each IO ?
>>
>> 1. IMO, gathering/storing stats should be configurable
>>
>> 2. If enabled, I think we can have counters maintained in memory and
>> persist that to db between some configurable interval ?
>>
>> BTW, as you mentioned, pg_info is huge, 855 bytes.
>>
>> Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)
>
> I spent some time on this today.  In pg_stat_t we have:
>
> eversion_t: count 5 (80 bytes)
> utime_t: count 13 (92 bytes)
> object_stat_collection_t: count 1 (256 bytes)
> pg_t: count 1 (20 bytes)
> up vector: count 1 (16 bytes for 3x rep)
> acting_vector: count 1 (16 bytes for 3x rep)
> other junk: 74 bytes
>
> some of these will probably take varint encoding pretty well, while I
> suspect others (like the nsec field in utime_t) it won't help much.  I'm
> going through now to get a coarse grain look.  I'm not thrilled with the
> varint encoding overhead, but it does appear to be effective in this
> case.  Right now in a test of master, pg_info_t is consistently 847
> bytes on my single-osd test setup.  Switching object_stat_sum_t fields
> to varint encoding yields an average size of 630.562 with almost all
> sizes either 630 or 631.
>
> 256 - (847 - 631) = 40 bytes (vs 256 originally!)
>
> There are 34 fields in pg_info_t, so almost all of them are encoding
> into a 1 byte value in this test.  It's possible that lowz encoding
> might reduce this further.

oops, there are 34 fields in object_stat_collection_t rather.

>
> Mark
>
>>
>>
>> Thanks & Regards
>> Somnath
>>
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>> Sent: Thursday, August 25, 2016 5:56 AM
>> To: Piotr Dałek
>> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
>> Subject: Re: pg_stat_t is 500+ bytes
>>
>> On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
>>> I think there exists a lot of fields can't be discarded via judging
>>> 0s, like utime_t, epoch_t. A simple way is to compare with previous
>>> pg_info_t.
>>
>> Oh, sorry. We may could use a fresh pg_info_t to indicate. But below
>> also could apply this way.
>>
>>>
>>> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
>>> thread(thread level not process level).
>>>
>>> looks we have three optimization from mark and Piotr:
>>>
>>> 1. reconstruct pg_info_t and make high frequent fields together.
>>> 2. change some fields to smaller bits
>>> 3. uses SIMD to optimize low frequency fields difference
>>> comparison(optional)
>>>
>>> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
>>> comparison, pg_info_t is above 700bytes:
>>>
>>> inline const char *skip_same_128_bytes_SIMD(const char* p, const
>>> char* p2) {
>>>     const __m128i w = _mm_load_si128((const __m128i *)p2);
>>>
>>>     for (;; p += 16) {
>>>         const __m128i s = _mm_load_si128((const __m128i *)p);
>>>         const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>>>             _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>>>             _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>>>
>>>         if (r != 0) // some of characters isn't equal
>>>             return p + __builtin_ffs(r) - 1;
>>>     }
>>> }
>>>
>>> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek
>>> <branch@predictor.org.pl> wrote:
>>>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>>>
>>>>>
>>>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>>>>> On Wed, 24 Aug 2016, Haomai Wang wrote:
>>>>>>> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com>
>>>>>>> wrote:
>>>>>>>> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com>
>>>>>>>> wrote:
>>>>>>>>> This is huge.  It takes the pg_info_t str from 306 bytes to 847
>>>>>>>>> bytes, and this _info omap key is rewritten on *every* IO.
>>>>>>>>>
>>>>>>>>> We could shrink this down significant with varint and/or delta
>>>>>>>>> encoding since a huge portion of it is just a bunch of uint64_t
>>>>>>>>> counters.  This will probably cost some CPU time, but OTOH it'll
>>>>>>>>> also shrink our metadata down a fair bit too which will pay off
>>>>>>>>> later.
>>>>>>>>>
>>>>>>>>> Anybody want to tackle this?
>>>>>>>>
>>>>>>>> what about separating "object_stat_collection_t stats" from
>>>>>>>> pg_stat_t?
>>>>>>>> pg info should be unchanged for most of times, we could only
>>>>>>>> update object related stats. This may help to reduce half bytes.
>>>>>>
>>>>>> I don't think this will work, since every op changes last_update in
>>>>>> pg_info_t *and* the stats (write op count, total bytes, objects,
>>>>>> etc.).
>>>>>>
>>>>>>> Or we only store increment values and keep the full in memory(may
>>>>>>> reduce to 20-30bytes). In period time we store the full
>>>>>>> structure(only hundreds of bytes)....
>>>>>>
>>>>>> A delta is probably very compressible (only a few fields in the
>>>>>> stats struct change).  The question is how fast can we make it in
>>>>>> CPU time.
>>>>>> Probably a simple delta (which will be almost all 0's) and a
>>>>>> trivial run-length-encoding scheme that just gets rid of the 0's
>>>>>> would do well enough...
>>>>>
>>>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>>>> with in the current encoding?
>>>>
>>>> Or how high these counters get? We could try transposing the matrix
>>>> made of those counters. At least the two most significant bytes in
>>>> most of those counters are mostly zeros, and after transposing,
>>>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>>>> these fields need to be uint64_t.
>>>>
>>>> --
>>>> Piotr Dałek
>>>> branch@predictor.org.pl
>>>> http://blog.predictor.org.pl
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message
>> is intended only for the use of the designated recipient(s) named
>> above. If the reader of this message is not the intended recipient,
>> you are hereby notified that you have received this message in error
>> and that any review, dissemination, distribution, or copying of this
>> message is strictly prohibited. If you have received this
>> communication in error, please notify the sender by telephone or
>> e-mail (as shown above) immediately and destroy any and all copies of
>> this message in your possession (whether hard copies or electronically
>> stored copies).
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-26  3:11                   ` Mark Nelson
@ 2016-08-26  5:36                     ` Mark Nelson
  2016-08-29 13:27                       ` Mark Nelson
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-08-26  5:36 UTC (permalink / raw)
  To: Somnath Roy, Haomai Wang, Piotr Dałek; +Cc: Sage Weil, ceph-devel

On 08/25/2016 10:11 PM, Mark Nelson wrote:
>
>
> On 08/25/2016 10:04 PM, Mark Nelson wrote:
>> On 08/25/2016 08:28 PM, Somnath Roy wrote:
>>> Sage,
>>> I can see for statistics we are storing 40 bytes for each IO..
>>>
>>> Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
>>>
>>> Should we really store it on each IO ?
>>>
>>> 1. IMO, gathering/storing stats should be configurable
>>>
>>> 2. If enabled, I think we can have counters maintained in memory and
>>> persist that to db between some configurable interval ?
>>>
>>> BTW, as you mentioned, pg_info is huge, 855 bytes.
>>>
>>> Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)
>>
>> I spent some time on this today.  In pg_stat_t we have:
>>
>> eversion_t: count 5 (80 bytes)
>> utime_t: count 13 (92 bytes)
>> object_stat_collection_t: count 1 (256 bytes)
>> pg_t: count 1 (20 bytes)
>> up vector: count 1 (16 bytes for 3x rep)
>> acting_vector: count 1 (16 bytes for 3x rep)
>> other junk: 74 bytes
>>
>> some of these will probably take varint encoding pretty well, while I
>> suspect others (like the nsec field in utime_t) it won't help much.  I'm
>> going through now to get a coarse grain look.  I'm not thrilled with the
>> varint encoding overhead, but it does appear to be effective in this
>> case.  Right now in a test of master, pg_info_t is consistently 847
>> bytes on my single-osd test setup.  Switching object_stat_sum_t fields
>> to varint encoding yields an average size of 630.562 with almost all
>> sizes either 630 or 631.
>>
>> 256 - (847 - 631) = 40 bytes (vs 256 originally!)
>>
>> There are 34 fields in pg_info_t, so almost all of them are encoding
>> into a 1 byte value in this test.  It's possible that lowz encoding
>> might reduce this further.
>
> oops, there are 34 fields in object_stat_collection_t rather.

Ok, last update until monday since I'm off tomorrow.  I went through and 
changed the eversion_t, utime_t, pg_t, and "other junk" to use varint 
encoding as well.  This also happened to hit the eversion_t structs in 
pg_info_t.  I didn't do the vectors in pg_stat_t or other fields in 
pg_stat_t out of laziness, but those would probably shrink pretty nicely.

After the above, the average encode size in my test is now down to 
453.244.  I suspect we could pretty reasonably get that down to 300-400 
by tweaking things further.  Larger clusters with more OSDs and such 
might not shrink as well though.

Note I didn't bother to preserve backwards compatibility by increasing 
the versioning which would need to be done (and a little ugly) in any 
real implementation of this.

Mark

>
>>
>> Mark
>>
>>>
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>>> Sent: Thursday, August 25, 2016 5:56 AM
>>> To: Piotr Dałek
>>> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
>>> Subject: Re: pg_stat_t is 500+ bytes
>>>
>>> On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
>>>> I think there exists a lot of fields can't be discarded via judging
>>>> 0s, like utime_t, epoch_t. A simple way is to compare with previous
>>>> pg_info_t.
>>>
>>> Oh, sorry. We may could use a fresh pg_info_t to indicate. But below
>>> also could apply this way.
>>>
>>>>
>>>> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
>>>> thread(thread level not process level).
>>>>
>>>> looks we have three optimization from mark and Piotr:
>>>>
>>>> 1. reconstruct pg_info_t and make high frequent fields together.
>>>> 2. change some fields to smaller bits
>>>> 3. uses SIMD to optimize low frequency fields difference
>>>> comparison(optional)
>>>>
>>>> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
>>>> comparison, pg_info_t is above 700bytes:
>>>>
>>>> inline const char *skip_same_128_bytes_SIMD(const char* p, const
>>>> char* p2) {
>>>>     const __m128i w = _mm_load_si128((const __m128i *)p2);
>>>>
>>>>     for (;; p += 16) {
>>>>         const __m128i s = _mm_load_si128((const __m128i *)p);
>>>>         const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>>>>             _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>>>>             _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>>>>
>>>>         if (r != 0) // some of characters isn't equal
>>>>             return p + __builtin_ffs(r) - 1;
>>>>     }
>>>> }
>>>>
>>>> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek
>>>> <branch@predictor.org.pl> wrote:
>>>>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>>>>
>>>>>>
>>>>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>>>>>> On Wed, 24 Aug 2016, Haomai Wang wrote:
>>>>>>>> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com>
>>>>>>>> wrote:
>>>>>>>>> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com>
>>>>>>>>> wrote:
>>>>>>>>>> This is huge.  It takes the pg_info_t str from 306 bytes to 847
>>>>>>>>>> bytes, and this _info omap key is rewritten on *every* IO.
>>>>>>>>>>
>>>>>>>>>> We could shrink this down significant with varint and/or delta
>>>>>>>>>> encoding since a huge portion of it is just a bunch of uint64_t
>>>>>>>>>> counters.  This will probably cost some CPU time, but OTOH it'll
>>>>>>>>>> also shrink our metadata down a fair bit too which will pay off
>>>>>>>>>> later.
>>>>>>>>>>
>>>>>>>>>> Anybody want to tackle this?
>>>>>>>>>
>>>>>>>>> what about separating "object_stat_collection_t stats" from
>>>>>>>>> pg_stat_t?
>>>>>>>>> pg info should be unchanged for most of times, we could only
>>>>>>>>> update object related stats. This may help to reduce half bytes.
>>>>>>>
>>>>>>> I don't think this will work, since every op changes last_update in
>>>>>>> pg_info_t *and* the stats (write op count, total bytes, objects,
>>>>>>> etc.).
>>>>>>>
>>>>>>>> Or we only store increment values and keep the full in memory(may
>>>>>>>> reduce to 20-30bytes). In period time we store the full
>>>>>>>> structure(only hundreds of bytes)....
>>>>>>>
>>>>>>> A delta is probably very compressible (only a few fields in the
>>>>>>> stats struct change).  The question is how fast can we make it in
>>>>>>> CPU time.
>>>>>>> Probably a simple delta (which will be almost all 0's) and a
>>>>>>> trivial run-length-encoding scheme that just gets rid of the 0's
>>>>>>> would do well enough...
>>>>>>
>>>>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>>>>> with in the current encoding?
>>>>>
>>>>> Or how high these counters get? We could try transposing the matrix
>>>>> made of those counters. At least the two most significant bytes in
>>>>> most of those counters are mostly zeros, and after transposing,
>>>>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>>>>> these fields need to be uint64_t.
>>>>>
>>>>> --
>>>>> Piotr Dałek
>>>>> branch@predictor.org.pl
>>>>> http://blog.predictor.org.pl
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>> PLEASE NOTE: The information contained in this electronic mail message
>>> is intended only for the use of the designated recipient(s) named
>>> above. If the reader of this message is not the intended recipient,
>>> you are hereby notified that you have received this message in error
>>> and that any review, dissemination, distribution, or copying of this
>>> message is strictly prohibited. If you have received this
>>> communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies of
>>> this message in your possession (whether hard copies or electronically
>>> stored copies).
>>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: pg_stat_t is 500+ bytes
  2016-08-25 12:56             ` Haomai Wang
  2016-08-26  1:28               ` Somnath Roy
@ 2016-08-27  0:00               ` Somnath Roy
  2016-08-29 13:28                 ` Sage Weil
  1 sibling, 1 reply; 16+ messages in thread
From: Somnath Roy @ 2016-08-27  0:00 UTC (permalink / raw)
  To: Haomai Wang, Piotr Dałek; +Cc: Mark Nelson, Sage Weil, ceph-devel

Sage/Sam,
I was going through the pg_info_t structures and it seems the following fields are duplicates among the different inner structures.

eversion_t last_scrub;                                ->   duplicate within  pg_stat_t, pg_history_t
eversion_t last_deep_scrub;                    -> duplicate within  pg_stat_t, pg_history_t
utime_t last_scrub_stamp;                        -> duplicate within  pg_stat_t, pg_history_t
utime_t last_deep_scrub_stamp;           -> duplicate within  pg_stat_t, pg_history_t
utime_t last_clean_scrub_stamp;            -> duplicate within  pg_stat_t, pg_history_t

pg_stat_t -> created is duplicate of pg_history_t -> epoch_created ?

pg_stat_t->last_became_active is duplicate of pg_stat_t->last_active ?
pg_stat_t->last_became_peered is duplicate of pg_stat_t->last_peered ?

Following quick optimization we could do as well probably..
The following fields of object_stat_sum_t  can have value of 0 and 1 , so, why not replaced with structure bit field ?

  int32_t num_flush_mode_high;  // 1 when in high flush mode, otherwise 0
  int32_t num_flush_mode_low;   // 1 when in low flush mode, otherwise 0
  int32_t num_evict_mode_some;  // 1 when in evict some mode, otherwise 0
  int32_t num_evict_mode_full;  // 1 when in evict full mode, otherwise 0

Same with the following field of pg_info_t , we can replace with bitfield.

bool last_backfill_bitwise

These will replace 93 bytes with 5 bits.

Am I missing anything ?

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Thursday, August 25, 2016 6:29 PM
To: 'Haomai Wang'; Piotr Dałek
Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
Subject: RE: pg_stat_t is 500+ bytes

Sage,
I can see for statistics we are storing 40 bytes for each IO..

Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)

Should we really store it on each IO ?

1. IMO, gathering/storing stats should be configurable

2. If enabled, I think we can have counters maintained in memory and persist that to db between some configurable interval ?

BTW, as you mentioned, pg_info is huge, 855 bytes.

Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)


Thanks & Regards
Somnath


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Thursday, August 25, 2016 5:56 AM
To: Piotr Dałek
Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: pg_stat_t is 500+ bytes

On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
> I think there exists a lot of fields can't be discarded via judging
> 0s, like utime_t, epoch_t. A simple way is to compare with previous
> pg_info_t.

Oh, sorry. We may could use a fresh pg_info_t to indicate. But below also could apply this way.

>
> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
> thread(thread level not process level).
>
> looks we have three optimization from mark and Piotr:
>
> 1. reconstruct pg_info_t and make high frequent fields together.
> 2. change some fields to smaller bits
> 3. uses SIMD to optimize low frequency fields difference
> comparison(optional)
>
> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
> comparison, pg_info_t is above 700bytes:
>
> inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
>     const __m128i w = _mm_load_si128((const __m128i *)p2);
>
>     for (;; p += 16) {
>         const __m128i s = _mm_load_si128((const __m128i *)p);
>         const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>             _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>             _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>
>         if (r != 0) // some of characters isn't equal
>             return p + __builtin_ffs(r) - 1;
>     }
> }
>
> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>
>>>
>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
>>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>>> >>>>This is huge.  It takes the pg_info_t str from 306 bytes to 847
>>> >>>>bytes, and this _info omap key is rewritten on *every* IO.
>>> >>>>
>>> >>>>We could shrink this down significant with varint and/or delta
>>> >>>>encoding since a huge portion of it is just a bunch of uint64_t
>>> >>>>counters.  This will probably cost some CPU time, but OTOH it'll
>>> >>>>also shrink our metadata down a fair bit too which will pay off later.
>>> >>>>
>>> >>>>Anybody want to tackle this?
>>> >>>
>>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
>>> >>>pg info should be unchanged for most of times, we could only
>>> >>>update object related stats. This may help to reduce half bytes.
>>> >
>>> >I don't think this will work, since every op changes last_update in
>>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>>> >
>>> >>Or we only store increment values and keep the full in memory(may
>>> >>reduce to 20-30bytes). In period time we store the full
>>> >>structure(only hundreds of bytes)....
>>> >
>>> >A delta is probably very compressible (only a few fields in the
>>> >stats struct change).  The question is how fast can we make it in CPU time.
>>> >Probably a simple delta (which will be almost all 0's) and a
>>> >trivial run-length-encoding scheme that just gets rid of the 0's
>>> >would do well enough...
>>>
>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>> with in the current encoding?
>>
>> Or how high these counters get? We could try transposing the matrix
>> made of those counters. At least the two most significant bytes in
>> most of those counters are mostly zeros, and after transposing,
>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>> these fields need to be uint64_t.
>>
>> --
>> Piotr Dałek
>> branch@predictor.org.pl
>> http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: pg_stat_t is 500+ bytes
  2016-08-26  5:36                     ` Mark Nelson
@ 2016-08-29 13:27                       ` Mark Nelson
  0 siblings, 0 replies; 16+ messages in thread
From: Mark Nelson @ 2016-08-29 13:27 UTC (permalink / raw)
  To: Somnath Roy, Haomai Wang, Piotr Dałek; +Cc: Sage Weil, ceph-devel

On 08/26/2016 12:36 AM, Mark Nelson wrote:
> On 08/25/2016 10:11 PM, Mark Nelson wrote:
>>
>>
>> On 08/25/2016 10:04 PM, Mark Nelson wrote:
>>> On 08/25/2016 08:28 PM, Somnath Roy wrote:
>>>> Sage,
>>>> I can see for statistics we are storing 40 bytes for each IO..
>>>>
>>>> Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
>>>>
>>>> Should we really store it on each IO ?
>>>>
>>>> 1. IMO, gathering/storing stats should be configurable
>>>>
>>>> 2. If enabled, I think we can have counters maintained in memory and
>>>> persist that to db between some configurable interval ?
>>>>
>>>> BTW, as you mentioned, pg_info is huge, 855 bytes.
>>>>
>>>> Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)
>>>
>>> I spent some time on this today.  In pg_stat_t we have:
>>>
>>> eversion_t: count 5 (80 bytes)
>>> utime_t: count 13 (92 bytes)
>>> object_stat_collection_t: count 1 (256 bytes)
>>> pg_t: count 1 (20 bytes)
>>> up vector: count 1 (16 bytes for 3x rep)
>>> acting_vector: count 1 (16 bytes for 3x rep)
>>> other junk: 74 bytes
>>>
>>> some of these will probably take varint encoding pretty well, while I
>>> suspect others (like the nsec field in utime_t) it won't help much.  I'm
>>> going through now to get a coarse grain look.  I'm not thrilled with the
>>> varint encoding overhead, but it does appear to be effective in this
>>> case.  Right now in a test of master, pg_info_t is consistently 847
>>> bytes on my single-osd test setup.  Switching object_stat_sum_t fields
>>> to varint encoding yields an average size of 630.562 with almost all
>>> sizes either 630 or 631.
>>>
>>> 256 - (847 - 631) = 40 bytes (vs 256 originally!)
>>>
>>> There are 34 fields in pg_info_t, so almost all of them are encoding
>>> into a 1 byte value in this test.  It's possible that lowz encoding
>>> might reduce this further.
>>
>> oops, there are 34 fields in object_stat_collection_t rather.
>
> Ok, last update until monday since I'm off tomorrow.  I went through and
> changed the eversion_t, utime_t, pg_t, and "other junk" to use varint
> encoding as well.  This also happened to hit the eversion_t structs in
> pg_info_t.  I didn't do the vectors in pg_stat_t or other fields in
> pg_stat_t out of laziness, but those would probably shrink pretty nicely.
>
> After the above, the average encode size in my test is now down to
> 453.244.  I suspect we could pretty reasonably get that down to 300-400
> by tweaking things further.  Larger clusters with more OSDs and such
> might not shrink as well though.
>
> Note I didn't bother to preserve backwards compatibility by increasing
> the versioning which would need to be done (and a little ugly) in any
> real implementation of this.
>
> Mark

Ok, I had some tests run over the weekend while I was away and it 
appears that we indeed are going to pay a price for varint encoding. I 
ran my test suite where different IO tests are executed (read, write, 
randread, randwrite, mixed, etc) at different IO sizes going from 
largest to smallest.  I tested both my wip-pg_info_t-varint branch and 
the version of master it was based off of.  I knew bluestore small 
random writes were already CPU constrained, so I tested that as it's the 
same set of tests I've been using when doing the bufferlist append 
testing where we saw a dramatic performance improvement.

Though I only ran through a single iteration of the test suite for each 
branch, it appears that master is faster than wip-pg_info_t-varint for 
16k randwrites and smaller.  For 4K random writes, wip-pg_info_t-varint 
appears to be only about 67% the speed of master.

So the take away I guess is that we can dramatically lower the pg_info_t 
size with varint encoding but we're going to chew through some CPU to do 
it and it's going to hurt if we are already CPU constrained.  I suspect 
rather than using varint as a crutch, we need to think very carefully 
about the size and necessity of these fields, and perhaps other less CPU 
intensive ways to pack the data.  It may be that reverting from varint 
encoding in bluestore might be beneficial as well, especially with fast 
NVMe drives.

Mark

>
>>
>>>
>>> Mark
>>>
>>>>
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>>>> Sent: Thursday, August 25, 2016 5:56 AM
>>>> To: Piotr Dałek
>>>> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
>>>> Subject: Re: pg_stat_t is 500+ bytes
>>>>
>>>> On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
>>>>> I think there exists a lot of fields can't be discarded via judging
>>>>> 0s, like utime_t, epoch_t. A simple way is to compare with previous
>>>>> pg_info_t.
>>>>
>>>> Oh, sorry. We may could use a fresh pg_info_t to indicate. But below
>>>> also could apply this way.
>>>>
>>>>>
>>>>> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
>>>>> thread(thread level not process level).
>>>>>
>>>>> looks we have three optimization from mark and Piotr:
>>>>>
>>>>> 1. reconstruct pg_info_t and make high frequent fields together.
>>>>> 2. change some fields to smaller bits
>>>>> 3. uses SIMD to optimize low frequency fields difference
>>>>> comparison(optional)
>>>>>
>>>>> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
>>>>> comparison, pg_info_t is above 700bytes:
>>>>>
>>>>> inline const char *skip_same_128_bytes_SIMD(const char* p, const
>>>>> char* p2) {
>>>>>     const __m128i w = _mm_load_si128((const __m128i *)p2);
>>>>>
>>>>>     for (;; p += 16) {
>>>>>         const __m128i s = _mm_load_si128((const __m128i *)p);
>>>>>         const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>>>>>             _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>>>>>             _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>>>>>
>>>>>         if (r != 0) // some of characters isn't equal
>>>>>             return p + __builtin_ffs(r) - 1;
>>>>>     }
>>>>> }
>>>>>
>>>>> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek
>>>>> <branch@predictor.org.pl> wrote:
>>>>>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>>>>>>> On Wed, 24 Aug 2016, Haomai Wang wrote:
>>>>>>>>> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com>
>>>>>>>>> wrote:
>>>>>>>>>> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> This is huge.  It takes the pg_info_t str from 306 bytes to 847
>>>>>>>>>>> bytes, and this _info omap key is rewritten on *every* IO.
>>>>>>>>>>>
>>>>>>>>>>> We could shrink this down significant with varint and/or delta
>>>>>>>>>>> encoding since a huge portion of it is just a bunch of uint64_t
>>>>>>>>>>> counters.  This will probably cost some CPU time, but OTOH it'll
>>>>>>>>>>> also shrink our metadata down a fair bit too which will pay off
>>>>>>>>>>> later.
>>>>>>>>>>>
>>>>>>>>>>> Anybody want to tackle this?
>>>>>>>>>>
>>>>>>>>>> what about separating "object_stat_collection_t stats" from
>>>>>>>>>> pg_stat_t?
>>>>>>>>>> pg info should be unchanged for most of times, we could only
>>>>>>>>>> update object related stats. This may help to reduce half bytes.
>>>>>>>>
>>>>>>>> I don't think this will work, since every op changes last_update in
>>>>>>>> pg_info_t *and* the stats (write op count, total bytes, objects,
>>>>>>>> etc.).
>>>>>>>>
>>>>>>>>> Or we only store increment values and keep the full in memory(may
>>>>>>>>> reduce to 20-30bytes). In period time we store the full
>>>>>>>>> structure(only hundreds of bytes)....
>>>>>>>>
>>>>>>>> A delta is probably very compressible (only a few fields in the
>>>>>>>> stats struct change).  The question is how fast can we make it in
>>>>>>>> CPU time.
>>>>>>>> Probably a simple delta (which will be almost all 0's) and a
>>>>>>>> trivial run-length-encoding scheme that just gets rid of the 0's
>>>>>>>> would do well enough...
>>>>>>>
>>>>>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>>>>>> with in the current encoding?
>>>>>>
>>>>>> Or how high these counters get? We could try transposing the matrix
>>>>>> made of those counters. At least the two most significant bytes in
>>>>>> most of those counters are mostly zeros, and after transposing,
>>>>>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>>>>>> these fields need to be uint64_t.
>>>>>>
>>>>>> --
>>>>>> Piotr Dałek
>>>>>> branch@predictor.org.pl
>>>>>> http://blog.predictor.org.pl
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>> PLEASE NOTE: The information contained in this electronic mail message
>>>> is intended only for the use of the designated recipient(s) named
>>>> above. If the reader of this message is not the intended recipient,
>>>> you are hereby notified that you have received this message in error
>>>> and that any review, dissemination, distribution, or copying of this
>>>> message is strictly prohibited. If you have received this
>>>> communication in error, please notify the sender by telephone or
>>>> e-mail (as shown above) immediately and destroy any and all copies of
>>>> this message in your possession (whether hard copies or electronically
>>>> stored copies).
>>>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: pg_stat_t is 500+ bytes
  2016-08-27  0:00               ` Somnath Roy
@ 2016-08-29 13:28                 ` Sage Weil
  0 siblings, 0 replies; 16+ messages in thread
From: Sage Weil @ 2016-08-29 13:28 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Haomai Wang, Piotr Dałek, Mark Nelson, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 7894 bytes --]

On Sat, 27 Aug 2016, Somnath Roy wrote:
> Sage/Sam,
> I was going through the pg_info_t structures and it seems the following 
> fields are duplicates among the different inner structures.
> 
> eversion_t last_scrub;                                ->   duplicate within  pg_stat_t, pg_history_t
> eversion_t last_deep_scrub;                    -> duplicate within  pg_stat_t, pg_history_t
> utime_t last_scrub_stamp;                        -> duplicate within  pg_stat_t, pg_history_t
> utime_t last_deep_scrub_stamp;           -> duplicate within  pg_stat_t, pg_history_t
> utime_t last_clean_scrub_stamp;            -> duplicate within  pg_stat_t, pg_history_t
> 
> pg_stat_t -> created is duplicate of pg_history_t -> epoch_created ?

Yeah, these look like dups to me..
 
> pg_stat_t->last_became_active is duplicate of pg_stat_t->last_active ?
> pg_stat_t->last_became_peered is duplicate of pg_stat_t->last_peered ?

The semantics of these are slightly different: lact_active,peered is the 
last time pg_stat_t refreshed and the pg was active or peered, while the 
_became_ ones are the last time they transitions to active or peered.  It 
may be we can drop the former since they could be inferred from 
last_fresh and state...

> Following quick optimization we could do as well probably..
> The following fields of object_stat_sum_t  can have value of 0 and 1 , so, why not replaced with structure bit field ?
> 
>   int32_t num_flush_mode_high;  // 1 when in high flush mode, otherwise 0
>   int32_t num_flush_mode_low;   // 1 when in low flush mode, otherwise 0
>   int32_t num_evict_mode_some;  // 1 when in evict some mode, otherwise 0
>   int32_t num_evict_mode_full;  // 1 when in evict full mode, otherwise 0
> 
> Same with the following field of pg_info_t , we can replace with bitfield.

This is done this way so that they structure can be summed over PGs and 
pools.

> bool last_backfill_bitwise

Yeah, this could be swapped for a single flags field and appropriate 
accessors.

sage

> These will replace 93 bytes with 5 bits.
> 
> Am I missing anything ?
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, August 25, 2016 6:29 PM
> To: 'Haomai Wang'; Piotr Dałek
> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
> Subject: RE: pg_stat_t is 500+ bytes
> 
> Sage,
> I can see for statistics we are storing 40 bytes for each IO..
> 
> Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
> 
> Should we really store it on each IO ?
> 
> 1. IMO, gathering/storing stats should be configurable
> 
> 2. If enabled, I think we can have counters maintained in memory and persist that to db between some configurable interval ?
> 
> BTW, as you mentioned, pg_info is huge, 855 bytes.
> 
> Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)
> 
> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Thursday, August 25, 2016 5:56 AM
> To: Piotr Dałek
> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: pg_stat_t is 500+ bytes
> 
> On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
> > I think there exists a lot of fields can't be discarded via judging
> > 0s, like utime_t, epoch_t. A simple way is to compare with previous
> > pg_info_t.
> 
> Oh, sorry. We may could use a fresh pg_info_t to indicate. But below also could apply this way.
> 
> >
> > BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
> > thread(thread level not process level).
> >
> > looks we have three optimization from mark and Piotr:
> >
> > 1. reconstruct pg_info_t and make high frequent fields together.
> > 2. change some fields to smaller bits
> > 3. uses SIMD to optimize low frequency fields difference
> > comparison(optional)
> >
> > intel SSE4.2 pcmpistrm instructive could do very good 128bytes
> > comparison, pg_info_t is above 700bytes:
> >
> > inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
> >     const __m128i w = _mm_load_si128((const __m128i *)p2);
> >
> >     for (;; p += 16) {
> >         const __m128i s = _mm_load_si128((const __m128i *)p);
> >         const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
> >             _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
> >             _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
> >
> >         if (r != 0) // some of characters isn't equal
> >             return p + __builtin_ffs(r) - 1;
> >     }
> > }
> >
> > On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
> >> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
> >>>
> >>>
> >>> On 08/24/2016 11:09 AM, Sage Weil wrote:
> >>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
> >>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
> >>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
> >>> >>>>This is huge.  It takes the pg_info_t str from 306 bytes to 847
> >>> >>>>bytes, and this _info omap key is rewritten on *every* IO.
> >>> >>>>
> >>> >>>>We could shrink this down significant with varint and/or delta
> >>> >>>>encoding since a huge portion of it is just a bunch of uint64_t
> >>> >>>>counters.  This will probably cost some CPU time, but OTOH it'll
> >>> >>>>also shrink our metadata down a fair bit too which will pay off later.
> >>> >>>>
> >>> >>>>Anybody want to tackle this?
> >>> >>>
> >>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
> >>> >>>pg info should be unchanged for most of times, we could only
> >>> >>>update object related stats. This may help to reduce half bytes.
> >>> >
> >>> >I don't think this will work, since every op changes last_update in
> >>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
> >>> >
> >>> >>Or we only store increment values and keep the full in memory(may
> >>> >>reduce to 20-30bytes). In period time we store the full
> >>> >>structure(only hundreds of bytes)....
> >>> >
> >>> >A delta is probably very compressible (only a few fields in the
> >>> >stats struct change).  The question is how fast can we make it in CPU time.
> >>> >Probably a simple delta (which will be almost all 0's) and a
> >>> >trivial run-length-encoding scheme that just gets rid of the 0's
> >>> >would do well enough...
> >>>
> >>> Do we have any rough idea of how many/often consecutive 0s we end up
> >>> with in the current encoding?
> >>
> >> Or how high these counters get? We could try transposing the matrix
> >> made of those counters. At least the two most significant bytes in
> >> most of those counters are mostly zeros, and after transposing,
> >> simple RLE would be feasible. In any case, I'm not sure if *all* of
> >> these fields need to be uint64_t.
> >>
> >> --
> >> Piotr Dałek
> >> branch@predictor.org.pl
> >> http://blog.predictor.org.pl
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-08-29 13:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-23 18:13 pg_stat_t is 500+ bytes Sage Weil
2016-08-23 19:39 ` Mark Nelson
2016-08-24  3:01 ` Haomai Wang
2016-08-24  4:02   ` Haomai Wang
2016-08-24 16:09     ` Sage Weil
2016-08-24 16:12       ` Mark Nelson
2016-08-24 16:20         ` Piotr Dałek
2016-08-25 12:54           ` Haomai Wang
2016-08-25 12:56             ` Haomai Wang
2016-08-26  1:28               ` Somnath Roy
2016-08-26  3:04                 ` Mark Nelson
2016-08-26  3:11                   ` Mark Nelson
2016-08-26  5:36                     ` Mark Nelson
2016-08-29 13:27                       ` Mark Nelson
2016-08-27  0:00               ` Somnath Roy
2016-08-29 13:28                 ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.