* pg_stat_t is 500+ bytes
@ 2016-08-23 18:13 Sage Weil
2016-08-23 19:39 ` Mark Nelson
2016-08-24 3:01 ` Haomai Wang
0 siblings, 2 replies; 16+ messages in thread
From: Sage Weil @ 2016-08-23 18:13 UTC (permalink / raw)
To: ceph-devel
This is huge. It takes the pg_info_t str from 306 bytes to 847 bytes, and
this _info omap key is rewritten on *every* IO.
We could shrink this down significant with varint and/or delta encoding
since a huge portion of it is just a bunch of uint64_t counters. This
will probably cost some CPU time, but OTOH it'll also shrink our metadata
down a fair bit too which will pay off later.
Anybody want to tackle this?
sage
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-23 18:13 pg_stat_t is 500+ bytes Sage Weil
@ 2016-08-23 19:39 ` Mark Nelson
2016-08-24 3:01 ` Haomai Wang
1 sibling, 0 replies; 16+ messages in thread
From: Mark Nelson @ 2016-08-23 19:39 UTC (permalink / raw)
To: Sage Weil, ceph-devel
On 08/23/2016 01:13 PM, Sage Weil wrote:
> This is huge. It takes the pg_info_t str from 306 bytes to 847 bytes, and
> this _info omap key is rewritten on *every* IO.
>
> We could shrink this down significant with varint and/or delta encoding
> since a huge portion of it is just a bunch of uint64_t counters. This
> will probably cost some CPU time, but OTOH it'll also shrink our metadata
> down a fair bit too which will pay off later.
I'm a little nervous given how heavy we are on CPU already for small IO.
How often/much of this is empty space? Could a fast packing
implementation work here?
>
> Anybody want to tackle this?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-23 18:13 pg_stat_t is 500+ bytes Sage Weil
2016-08-23 19:39 ` Mark Nelson
@ 2016-08-24 3:01 ` Haomai Wang
2016-08-24 4:02 ` Haomai Wang
1 sibling, 1 reply; 16+ messages in thread
From: Haomai Wang @ 2016-08-24 3:01 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
> This is huge. It takes the pg_info_t str from 306 bytes to 847 bytes, and
> this _info omap key is rewritten on *every* IO.
>
> We could shrink this down significant with varint and/or delta encoding
> since a huge portion of it is just a bunch of uint64_t counters. This
> will probably cost some CPU time, but OTOH it'll also shrink our metadata
> down a fair bit too which will pay off later.
>
> Anybody want to tackle this?
what about separating "object_stat_collection_t stats" from pg_stat_t?
pg info should be unchanged for most of times, we could only update
object related stats. This may help to reduce half bytes.
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-24 3:01 ` Haomai Wang
@ 2016-08-24 4:02 ` Haomai Wang
2016-08-24 16:09 ` Sage Weil
0 siblings, 1 reply; 16+ messages in thread
From: Haomai Wang @ 2016-08-24 4:02 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>> This is huge. It takes the pg_info_t str from 306 bytes to 847 bytes, and
>> this _info omap key is rewritten on *every* IO.
>>
>> We could shrink this down significant with varint and/or delta encoding
>> since a huge portion of it is just a bunch of uint64_t counters. This
>> will probably cost some CPU time, but OTOH it'll also shrink our metadata
>> down a fair bit too which will pay off later.
>>
>> Anybody want to tackle this?
>
> what about separating "object_stat_collection_t stats" from pg_stat_t?
> pg info should be unchanged for most of times, we could only update
> object related stats. This may help to reduce half bytes.
Or we only store increment values and keep the full in memory(may
reduce to 20-30bytes). In period time we store the full structure(only
hundreds of bytes)....
>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-24 4:02 ` Haomai Wang
@ 2016-08-24 16:09 ` Sage Weil
2016-08-24 16:12 ` Mark Nelson
0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2016-08-24 16:09 UTC (permalink / raw)
To: Haomai Wang; +Cc: ceph-devel
On Wed, 24 Aug 2016, Haomai Wang wrote:
> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
> > On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
> >> This is huge. It takes the pg_info_t str from 306 bytes to 847 bytes, and
> >> this _info omap key is rewritten on *every* IO.
> >>
> >> We could shrink this down significant with varint and/or delta encoding
> >> since a huge portion of it is just a bunch of uint64_t counters. This
> >> will probably cost some CPU time, but OTOH it'll also shrink our metadata
> >> down a fair bit too which will pay off later.
> >>
> >> Anybody want to tackle this?
> >
> > what about separating "object_stat_collection_t stats" from pg_stat_t?
> > pg info should be unchanged for most of times, we could only update
> > object related stats. This may help to reduce half bytes.
I don't think this will work, since every op changes last_update in
pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
> Or we only store increment values and keep the full in memory(may
> reduce to 20-30bytes). In period time we store the full structure(only
> hundreds of bytes)....
A delta is probably very compressible (only a few fields in the stats
struct change). The question is how fast can we make it in CPU time.
Probably a simple delta (which will be almost all 0's) and a trivial
run-length-encoding scheme that just gets rid of the 0's would do well
enough...
sage
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-24 16:09 ` Sage Weil
@ 2016-08-24 16:12 ` Mark Nelson
2016-08-24 16:20 ` Piotr Dałek
0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-08-24 16:12 UTC (permalink / raw)
To: Sage Weil, Haomai Wang; +Cc: ceph-devel
On 08/24/2016 11:09 AM, Sage Weil wrote:
> On Wed, 24 Aug 2016, Haomai Wang wrote:
>> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>>> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>>>> This is huge. It takes the pg_info_t str from 306 bytes to 847 bytes, and
>>>> this _info omap key is rewritten on *every* IO.
>>>>
>>>> We could shrink this down significant with varint and/or delta encoding
>>>> since a huge portion of it is just a bunch of uint64_t counters. This
>>>> will probably cost some CPU time, but OTOH it'll also shrink our metadata
>>>> down a fair bit too which will pay off later.
>>>>
>>>> Anybody want to tackle this?
>>>
>>> what about separating "object_stat_collection_t stats" from pg_stat_t?
>>> pg info should be unchanged for most of times, we could only update
>>> object related stats. This may help to reduce half bytes.
>
> I don't think this will work, since every op changes last_update in
> pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>
>> Or we only store increment values and keep the full in memory(may
>> reduce to 20-30bytes). In period time we store the full structure(only
>> hundreds of bytes)....
>
> A delta is probably very compressible (only a few fields in the stats
> struct change). The question is how fast can we make it in CPU time.
> Probably a simple delta (which will be almost all 0's) and a trivial
> run-length-encoding scheme that just gets rid of the 0's would do well
> enough...
Do we have any rough idea of how many/often consecutive 0s we end up
with in the current encoding?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-24 16:12 ` Mark Nelson
@ 2016-08-24 16:20 ` Piotr Dałek
2016-08-25 12:54 ` Haomai Wang
0 siblings, 1 reply; 16+ messages in thread
From: Piotr Dałek @ 2016-08-24 16:20 UTC (permalink / raw)
To: Mark Nelson; +Cc: Sage Weil, Haomai Wang, ceph-devel
On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>
>
> On 08/24/2016 11:09 AM, Sage Weil wrote:
> >On Wed, 24 Aug 2016, Haomai Wang wrote:
> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
> >>>>This is huge. It takes the pg_info_t str from 306 bytes to 847 bytes, and
> >>>>this _info omap key is rewritten on *every* IO.
> >>>>
> >>>>We could shrink this down significant with varint and/or delta encoding
> >>>>since a huge portion of it is just a bunch of uint64_t counters. This
> >>>>will probably cost some CPU time, but OTOH it'll also shrink our metadata
> >>>>down a fair bit too which will pay off later.
> >>>>
> >>>>Anybody want to tackle this?
> >>>
> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
> >>>pg info should be unchanged for most of times, we could only update
> >>>object related stats. This may help to reduce half bytes.
> >
> >I don't think this will work, since every op changes last_update in
> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
> >
> >>Or we only store increment values and keep the full in memory(may
> >>reduce to 20-30bytes). In period time we store the full structure(only
> >>hundreds of bytes)....
> >
> >A delta is probably very compressible (only a few fields in the stats
> >struct change). The question is how fast can we make it in CPU time.
> >Probably a simple delta (which will be almost all 0's) and a trivial
> >run-length-encoding scheme that just gets rid of the 0's would do well
> >enough...
>
> Do we have any rough idea of how many/often consecutive 0s we end up
> with in the current encoding?
Or how high these counters get? We could try transposing the matrix made of
those counters. At least the two most significant bytes in most of those
counters are mostly zeros, and after transposing, simple RLE would be
feasible. In any case, I'm not sure if *all* of these fields need to be
uint64_t.
--
Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-24 16:20 ` Piotr Dałek
@ 2016-08-25 12:54 ` Haomai Wang
2016-08-25 12:56 ` Haomai Wang
0 siblings, 1 reply; 16+ messages in thread
From: Haomai Wang @ 2016-08-25 12:54 UTC (permalink / raw)
To: Piotr Dałek; +Cc: Mark Nelson, Sage Weil, ceph-devel
I think there exists a lot of fields can't be discarded via judging
0s, like utime_t, epoch_t. A simple way is to compare with previous
pg_info_t.
BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
thread(thread level not process level).
looks we have three optimization from mark and Piotr:
1. reconstruct pg_info_t and make high frequent fields together.
2. change some fields to smaller bits
3. uses SIMD to optimize low frequency fields difference comparison(optional)
intel SSE4.2 pcmpistrm instructive could do very good 128bytes
comparison, pg_info_t is above 700bytes:
inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
const __m128i w = _mm_load_si128((const __m128i *)p2);
for (;; p += 16) {
const __m128i s = _mm_load_si128((const __m128i *)p);
const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
_SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
_SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
if (r != 0) // some of characters isn't equal
return p + __builtin_ffs(r) - 1;
}
}
On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>
>>
>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>> >>>>This is huge. It takes the pg_info_t str from 306 bytes to 847 bytes, and
>> >>>>this _info omap key is rewritten on *every* IO.
>> >>>>
>> >>>>We could shrink this down significant with varint and/or delta encoding
>> >>>>since a huge portion of it is just a bunch of uint64_t counters. This
>> >>>>will probably cost some CPU time, but OTOH it'll also shrink our metadata
>> >>>>down a fair bit too which will pay off later.
>> >>>>
>> >>>>Anybody want to tackle this?
>> >>>
>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
>> >>>pg info should be unchanged for most of times, we could only update
>> >>>object related stats. This may help to reduce half bytes.
>> >
>> >I don't think this will work, since every op changes last_update in
>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>> >
>> >>Or we only store increment values and keep the full in memory(may
>> >>reduce to 20-30bytes). In period time we store the full structure(only
>> >>hundreds of bytes)....
>> >
>> >A delta is probably very compressible (only a few fields in the stats
>> >struct change). The question is how fast can we make it in CPU time.
>> >Probably a simple delta (which will be almost all 0's) and a trivial
>> >run-length-encoding scheme that just gets rid of the 0's would do well
>> >enough...
>>
>> Do we have any rough idea of how many/often consecutive 0s we end up
>> with in the current encoding?
>
> Or how high these counters get? We could try transposing the matrix made of
> those counters. At least the two most significant bytes in most of those
> counters are mostly zeros, and after transposing, simple RLE would be
> feasible. In any case, I'm not sure if *all* of these fields need to be
> uint64_t.
>
> --
> Piotr Dałek
> branch@predictor.org.pl
> http://blog.predictor.org.pl
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-25 12:54 ` Haomai Wang
@ 2016-08-25 12:56 ` Haomai Wang
2016-08-26 1:28 ` Somnath Roy
2016-08-27 0:00 ` Somnath Roy
0 siblings, 2 replies; 16+ messages in thread
From: Haomai Wang @ 2016-08-25 12:56 UTC (permalink / raw)
To: Piotr Dałek; +Cc: Mark Nelson, Sage Weil, ceph-devel
On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
> I think there exists a lot of fields can't be discarded via judging
> 0s, like utime_t, epoch_t. A simple way is to compare with previous
> pg_info_t.
Oh, sorry. We may could use a fresh pg_info_t to indicate. But below
also could apply this way.
>
> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
> thread(thread level not process level).
>
> looks we have three optimization from mark and Piotr:
>
> 1. reconstruct pg_info_t and make high frequent fields together.
> 2. change some fields to smaller bits
> 3. uses SIMD to optimize low frequency fields difference comparison(optional)
>
> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
> comparison, pg_info_t is above 700bytes:
>
> inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
> const __m128i w = _mm_load_si128((const __m128i *)p2);
>
> for (;; p += 16) {
> const __m128i s = _mm_load_si128((const __m128i *)p);
> const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
> _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
> _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>
> if (r != 0) // some of characters isn't equal
> return p + __builtin_ffs(r) - 1;
> }
> }
>
> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>
>>>
>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
>>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>>> >>>>This is huge. It takes the pg_info_t str from 306 bytes to 847 bytes, and
>>> >>>>this _info omap key is rewritten on *every* IO.
>>> >>>>
>>> >>>>We could shrink this down significant with varint and/or delta encoding
>>> >>>>since a huge portion of it is just a bunch of uint64_t counters. This
>>> >>>>will probably cost some CPU time, but OTOH it'll also shrink our metadata
>>> >>>>down a fair bit too which will pay off later.
>>> >>>>
>>> >>>>Anybody want to tackle this?
>>> >>>
>>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
>>> >>>pg info should be unchanged for most of times, we could only update
>>> >>>object related stats. This may help to reduce half bytes.
>>> >
>>> >I don't think this will work, since every op changes last_update in
>>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>>> >
>>> >>Or we only store increment values and keep the full in memory(may
>>> >>reduce to 20-30bytes). In period time we store the full structure(only
>>> >>hundreds of bytes)....
>>> >
>>> >A delta is probably very compressible (only a few fields in the stats
>>> >struct change). The question is how fast can we make it in CPU time.
>>> >Probably a simple delta (which will be almost all 0's) and a trivial
>>> >run-length-encoding scheme that just gets rid of the 0's would do well
>>> >enough...
>>>
>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>> with in the current encoding?
>>
>> Or how high these counters get? We could try transposing the matrix made of
>> those counters. At least the two most significant bytes in most of those
>> counters are mostly zeros, and after transposing, simple RLE would be
>> feasible. In any case, I'm not sure if *all* of these fields need to be
>> uint64_t.
>>
>> --
>> Piotr Dałek
>> branch@predictor.org.pl
>> http://blog.predictor.org.pl
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: pg_stat_t is 500+ bytes
2016-08-25 12:56 ` Haomai Wang
@ 2016-08-26 1:28 ` Somnath Roy
2016-08-26 3:04 ` Mark Nelson
2016-08-27 0:00 ` Somnath Roy
1 sibling, 1 reply; 16+ messages in thread
From: Somnath Roy @ 2016-08-26 1:28 UTC (permalink / raw)
To: Haomai Wang, Piotr Dałek; +Cc: Mark Nelson, Sage Weil, ceph-devel
Sage,
I can see for statistics we are storing 40 bytes for each IO..
Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
Should we really store it on each IO ?
1. IMO, gathering/storing stats should be configurable
2. If enabled, I think we can have counters maintained in memory and persist that to db between some configurable interval ?
BTW, as you mentioned, pg_info is huge, 855 bytes.
Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)
Thanks & Regards
Somnath
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Thursday, August 25, 2016 5:56 AM
To: Piotr Dałek
Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: pg_stat_t is 500+ bytes
On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
> I think there exists a lot of fields can't be discarded via judging
> 0s, like utime_t, epoch_t. A simple way is to compare with previous
> pg_info_t.
Oh, sorry. We may could use a fresh pg_info_t to indicate. But below also could apply this way.
>
> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
> thread(thread level not process level).
>
> looks we have three optimization from mark and Piotr:
>
> 1. reconstruct pg_info_t and make high frequent fields together.
> 2. change some fields to smaller bits
> 3. uses SIMD to optimize low frequency fields difference
> comparison(optional)
>
> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
> comparison, pg_info_t is above 700bytes:
>
> inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
> const __m128i w = _mm_load_si128((const __m128i *)p2);
>
> for (;; p += 16) {
> const __m128i s = _mm_load_si128((const __m128i *)p);
> const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
> _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
> _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>
> if (r != 0) // some of characters isn't equal
> return p + __builtin_ffs(r) - 1;
> }
> }
>
> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>
>>>
>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
>>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>>> >>>>This is huge. It takes the pg_info_t str from 306 bytes to 847
>>> >>>>bytes, and this _info omap key is rewritten on *every* IO.
>>> >>>>
>>> >>>>We could shrink this down significant with varint and/or delta
>>> >>>>encoding since a huge portion of it is just a bunch of uint64_t
>>> >>>>counters. This will probably cost some CPU time, but OTOH it'll
>>> >>>>also shrink our metadata down a fair bit too which will pay off later.
>>> >>>>
>>> >>>>Anybody want to tackle this?
>>> >>>
>>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
>>> >>>pg info should be unchanged for most of times, we could only
>>> >>>update object related stats. This may help to reduce half bytes.
>>> >
>>> >I don't think this will work, since every op changes last_update in
>>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>>> >
>>> >>Or we only store increment values and keep the full in memory(may
>>> >>reduce to 20-30bytes). In period time we store the full
>>> >>structure(only hundreds of bytes)....
>>> >
>>> >A delta is probably very compressible (only a few fields in the
>>> >stats struct change). The question is how fast can we make it in CPU time.
>>> >Probably a simple delta (which will be almost all 0's) and a
>>> >trivial run-length-encoding scheme that just gets rid of the 0's
>>> >would do well enough...
>>>
>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>> with in the current encoding?
>>
>> Or how high these counters get? We could try transposing the matrix
>> made of those counters. At least the two most significant bytes in
>> most of those counters are mostly zeros, and after transposing,
>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>> these fields need to be uint64_t.
>>
>> --
>> Piotr Dałek
>> branch@predictor.org.pl
>> http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-26 1:28 ` Somnath Roy
@ 2016-08-26 3:04 ` Mark Nelson
2016-08-26 3:11 ` Mark Nelson
0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-08-26 3:04 UTC (permalink / raw)
To: Somnath Roy, Haomai Wang, Piotr Dałek; +Cc: Sage Weil, ceph-devel
On 08/25/2016 08:28 PM, Somnath Roy wrote:
> Sage,
> I can see for statistics we are storing 40 bytes for each IO..
>
> Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
>
> Should we really store it on each IO ?
>
> 1. IMO, gathering/storing stats should be configurable
>
> 2. If enabled, I think we can have counters maintained in memory and persist that to db between some configurable interval ?
>
> BTW, as you mentioned, pg_info is huge, 855 bytes.
>
> Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)
I spent some time on this today. In pg_stat_t we have:
eversion_t: count 5 (80 bytes)
utime_t: count 13 (92 bytes)
object_stat_collection_t: count 1 (256 bytes)
pg_t: count 1 (20 bytes)
up vector: count 1 (16 bytes for 3x rep)
acting_vector: count 1 (16 bytes for 3x rep)
other junk: 74 bytes
some of these will probably take varint encoding pretty well, while I
suspect others (like the nsec field in utime_t) it won't help much. I'm
going through now to get a coarse grain look. I'm not thrilled with the
varint encoding overhead, but it does appear to be effective in this
case. Right now in a test of master, pg_info_t is consistently 847
bytes on my single-osd test setup. Switching object_stat_sum_t fields
to varint encoding yields an average size of 630.562 with almost all
sizes either 630 or 631.
256 - (847 - 631) = 40 bytes (vs 256 originally!)
There are 34 fields in pg_info_t, so almost all of them are encoding
into a 1 byte value in this test. It's possible that lowz encoding
might reduce this further.
Mark
>
>
> Thanks & Regards
> Somnath
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Thursday, August 25, 2016 5:56 AM
> To: Piotr Dałek
> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: pg_stat_t is 500+ bytes
>
> On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
>> I think there exists a lot of fields can't be discarded via judging
>> 0s, like utime_t, epoch_t. A simple way is to compare with previous
>> pg_info_t.
>
> Oh, sorry. We may could use a fresh pg_info_t to indicate. But below also could apply this way.
>
>>
>> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
>> thread(thread level not process level).
>>
>> looks we have three optimization from mark and Piotr:
>>
>> 1. reconstruct pg_info_t and make high frequent fields together.
>> 2. change some fields to smaller bits
>> 3. uses SIMD to optimize low frequency fields difference
>> comparison(optional)
>>
>> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
>> comparison, pg_info_t is above 700bytes:
>>
>> inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
>> const __m128i w = _mm_load_si128((const __m128i *)p2);
>>
>> for (;; p += 16) {
>> const __m128i s = _mm_load_si128((const __m128i *)p);
>> const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>> _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>> _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>>
>> if (r != 0) // some of characters isn't equal
>> return p + __builtin_ffs(r) - 1;
>> }
>> }
>>
>> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
>>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>>
>>>>
>>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>>>> On Wed, 24 Aug 2016, Haomai Wang wrote:
>>>>>> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>>>>>>> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>>>>>>>> This is huge. It takes the pg_info_t str from 306 bytes to 847
>>>>>>>> bytes, and this _info omap key is rewritten on *every* IO.
>>>>>>>>
>>>>>>>> We could shrink this down significant with varint and/or delta
>>>>>>>> encoding since a huge portion of it is just a bunch of uint64_t
>>>>>>>> counters. This will probably cost some CPU time, but OTOH it'll
>>>>>>>> also shrink our metadata down a fair bit too which will pay off later.
>>>>>>>>
>>>>>>>> Anybody want to tackle this?
>>>>>>>
>>>>>>> what about separating "object_stat_collection_t stats" from pg_stat_t?
>>>>>>> pg info should be unchanged for most of times, we could only
>>>>>>> update object related stats. This may help to reduce half bytes.
>>>>>
>>>>> I don't think this will work, since every op changes last_update in
>>>>> pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>>>>>
>>>>>> Or we only store increment values and keep the full in memory(may
>>>>>> reduce to 20-30bytes). In period time we store the full
>>>>>> structure(only hundreds of bytes)....
>>>>>
>>>>> A delta is probably very compressible (only a few fields in the
>>>>> stats struct change). The question is how fast can we make it in CPU time.
>>>>> Probably a simple delta (which will be almost all 0's) and a
>>>>> trivial run-length-encoding scheme that just gets rid of the 0's
>>>>> would do well enough...
>>>>
>>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>>> with in the current encoding?
>>>
>>> Or how high these counters get? We could try transposing the matrix
>>> made of those counters. At least the two most significant bytes in
>>> most of those counters are mostly zeros, and after transposing,
>>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>>> these fields need to be uint64_t.
>>>
>>> --
>>> Piotr Dałek
>>> branch@predictor.org.pl
>>> http://blog.predictor.org.pl
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-26 3:04 ` Mark Nelson
@ 2016-08-26 3:11 ` Mark Nelson
2016-08-26 5:36 ` Mark Nelson
0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-08-26 3:11 UTC (permalink / raw)
To: Somnath Roy, Haomai Wang, Piotr Dałek; +Cc: Sage Weil, ceph-devel
On 08/25/2016 10:04 PM, Mark Nelson wrote:
> On 08/25/2016 08:28 PM, Somnath Roy wrote:
>> Sage,
>> I can see for statistics we are storing 40 bytes for each IO..
>>
>> Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
>>
>> Should we really store it on each IO ?
>>
>> 1. IMO, gathering/storing stats should be configurable
>>
>> 2. If enabled, I think we can have counters maintained in memory and
>> persist that to db between some configurable interval ?
>>
>> BTW, as you mentioned, pg_info is huge, 855 bytes.
>>
>> Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)
>
> I spent some time on this today. In pg_stat_t we have:
>
> eversion_t: count 5 (80 bytes)
> utime_t: count 13 (92 bytes)
> object_stat_collection_t: count 1 (256 bytes)
> pg_t: count 1 (20 bytes)
> up vector: count 1 (16 bytes for 3x rep)
> acting_vector: count 1 (16 bytes for 3x rep)
> other junk: 74 bytes
>
> some of these will probably take varint encoding pretty well, while I
> suspect others (like the nsec field in utime_t) it won't help much. I'm
> going through now to get a coarse grain look. I'm not thrilled with the
> varint encoding overhead, but it does appear to be effective in this
> case. Right now in a test of master, pg_info_t is consistently 847
> bytes on my single-osd test setup. Switching object_stat_sum_t fields
> to varint encoding yields an average size of 630.562 with almost all
> sizes either 630 or 631.
>
> 256 - (847 - 631) = 40 bytes (vs 256 originally!)
>
> There are 34 fields in pg_info_t, so almost all of them are encoding
> into a 1 byte value in this test. It's possible that lowz encoding
> might reduce this further.
oops, there are 34 fields in object_stat_collection_t rather.
>
> Mark
>
>>
>>
>> Thanks & Regards
>> Somnath
>>
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>> Sent: Thursday, August 25, 2016 5:56 AM
>> To: Piotr Dałek
>> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
>> Subject: Re: pg_stat_t is 500+ bytes
>>
>> On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
>>> I think there exists a lot of fields can't be discarded via judging
>>> 0s, like utime_t, epoch_t. A simple way is to compare with previous
>>> pg_info_t.
>>
>> Oh, sorry. We may could use a fresh pg_info_t to indicate. But below
>> also could apply this way.
>>
>>>
>>> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
>>> thread(thread level not process level).
>>>
>>> looks we have three optimization from mark and Piotr:
>>>
>>> 1. reconstruct pg_info_t and make high frequent fields together.
>>> 2. change some fields to smaller bits
>>> 3. uses SIMD to optimize low frequency fields difference
>>> comparison(optional)
>>>
>>> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
>>> comparison, pg_info_t is above 700bytes:
>>>
>>> inline const char *skip_same_128_bytes_SIMD(const char* p, const
>>> char* p2) {
>>> const __m128i w = _mm_load_si128((const __m128i *)p2);
>>>
>>> for (;; p += 16) {
>>> const __m128i s = _mm_load_si128((const __m128i *)p);
>>> const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>>> _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>>> _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>>>
>>> if (r != 0) // some of characters isn't equal
>>> return p + __builtin_ffs(r) - 1;
>>> }
>>> }
>>>
>>> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek
>>> <branch@predictor.org.pl> wrote:
>>>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>>>
>>>>>
>>>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>>>>> On Wed, 24 Aug 2016, Haomai Wang wrote:
>>>>>>> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com>
>>>>>>> wrote:
>>>>>>>> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com>
>>>>>>>> wrote:
>>>>>>>>> This is huge. It takes the pg_info_t str from 306 bytes to 847
>>>>>>>>> bytes, and this _info omap key is rewritten on *every* IO.
>>>>>>>>>
>>>>>>>>> We could shrink this down significant with varint and/or delta
>>>>>>>>> encoding since a huge portion of it is just a bunch of uint64_t
>>>>>>>>> counters. This will probably cost some CPU time, but OTOH it'll
>>>>>>>>> also shrink our metadata down a fair bit too which will pay off
>>>>>>>>> later.
>>>>>>>>>
>>>>>>>>> Anybody want to tackle this?
>>>>>>>>
>>>>>>>> what about separating "object_stat_collection_t stats" from
>>>>>>>> pg_stat_t?
>>>>>>>> pg info should be unchanged for most of times, we could only
>>>>>>>> update object related stats. This may help to reduce half bytes.
>>>>>>
>>>>>> I don't think this will work, since every op changes last_update in
>>>>>> pg_info_t *and* the stats (write op count, total bytes, objects,
>>>>>> etc.).
>>>>>>
>>>>>>> Or we only store increment values and keep the full in memory(may
>>>>>>> reduce to 20-30bytes). In period time we store the full
>>>>>>> structure(only hundreds of bytes)....
>>>>>>
>>>>>> A delta is probably very compressible (only a few fields in the
>>>>>> stats struct change). The question is how fast can we make it in
>>>>>> CPU time.
>>>>>> Probably a simple delta (which will be almost all 0's) and a
>>>>>> trivial run-length-encoding scheme that just gets rid of the 0's
>>>>>> would do well enough...
>>>>>
>>>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>>>> with in the current encoding?
>>>>
>>>> Or how high these counters get? We could try transposing the matrix
>>>> made of those counters. At least the two most significant bytes in
>>>> most of those counters are mostly zeros, and after transposing,
>>>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>>>> these fields need to be uint64_t.
>>>>
>>>> --
>>>> Piotr Dałek
>>>> branch@predictor.org.pl
>>>> http://blog.predictor.org.pl
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message
>> is intended only for the use of the designated recipient(s) named
>> above. If the reader of this message is not the intended recipient,
>> you are hereby notified that you have received this message in error
>> and that any review, dissemination, distribution, or copying of this
>> message is strictly prohibited. If you have received this
>> communication in error, please notify the sender by telephone or
>> e-mail (as shown above) immediately and destroy any and all copies of
>> this message in your possession (whether hard copies or electronically
>> stored copies).
>>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-26 3:11 ` Mark Nelson
@ 2016-08-26 5:36 ` Mark Nelson
2016-08-29 13:27 ` Mark Nelson
0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-08-26 5:36 UTC (permalink / raw)
To: Somnath Roy, Haomai Wang, Piotr Dałek; +Cc: Sage Weil, ceph-devel
On 08/25/2016 10:11 PM, Mark Nelson wrote:
>
>
> On 08/25/2016 10:04 PM, Mark Nelson wrote:
>> On 08/25/2016 08:28 PM, Somnath Roy wrote:
>>> Sage,
>>> I can see for statistics we are storing 40 bytes for each IO..
>>>
>>> Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
>>>
>>> Should we really store it on each IO ?
>>>
>>> 1. IMO, gathering/storing stats should be configurable
>>>
>>> 2. If enabled, I think we can have counters maintained in memory and
>>> persist that to db between some configurable interval ?
>>>
>>> BTW, as you mentioned, pg_info is huge, 855 bytes.
>>>
>>> Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)
>>
>> I spent some time on this today. In pg_stat_t we have:
>>
>> eversion_t: count 5 (80 bytes)
>> utime_t: count 13 (92 bytes)
>> object_stat_collection_t: count 1 (256 bytes)
>> pg_t: count 1 (20 bytes)
>> up vector: count 1 (16 bytes for 3x rep)
>> acting_vector: count 1 (16 bytes for 3x rep)
>> other junk: 74 bytes
>>
>> some of these will probably take varint encoding pretty well, while I
>> suspect others (like the nsec field in utime_t) it won't help much. I'm
>> going through now to get a coarse grain look. I'm not thrilled with the
>> varint encoding overhead, but it does appear to be effective in this
>> case. Right now in a test of master, pg_info_t is consistently 847
>> bytes on my single-osd test setup. Switching object_stat_sum_t fields
>> to varint encoding yields an average size of 630.562 with almost all
>> sizes either 630 or 631.
>>
>> 256 - (847 - 631) = 40 bytes (vs 256 originally!)
>>
>> There are 34 fields in pg_info_t, so almost all of them are encoding
>> into a 1 byte value in this test. It's possible that lowz encoding
>> might reduce this further.
>
> oops, there are 34 fields in object_stat_collection_t rather.
Ok, last update until monday since I'm off tomorrow. I went through and
changed the eversion_t, utime_t, pg_t, and "other junk" to use varint
encoding as well. This also happened to hit the eversion_t structs in
pg_info_t. I didn't do the vectors in pg_stat_t or other fields in
pg_stat_t out of laziness, but those would probably shrink pretty nicely.
After the above, the average encode size in my test is now down to
453.244. I suspect we could pretty reasonably get that down to 300-400
by tweaking things further. Larger clusters with more OSDs and such
might not shrink as well though.
Note I didn't bother to preserve backwards compatibility by increasing
the versioning which would need to be done (and a little ugly) in any
real implementation of this.
Mark
>
>>
>> Mark
>>
>>>
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>>> Sent: Thursday, August 25, 2016 5:56 AM
>>> To: Piotr Dałek
>>> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
>>> Subject: Re: pg_stat_t is 500+ bytes
>>>
>>> On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
>>>> I think there exists a lot of fields can't be discarded via judging
>>>> 0s, like utime_t, epoch_t. A simple way is to compare with previous
>>>> pg_info_t.
>>>
>>> Oh, sorry. We may could use a fresh pg_info_t to indicate. But below
>>> also could apply this way.
>>>
>>>>
>>>> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
>>>> thread(thread level not process level).
>>>>
>>>> looks we have three optimization from mark and Piotr:
>>>>
>>>> 1. reconstruct pg_info_t and make high frequent fields together.
>>>> 2. change some fields to smaller bits
>>>> 3. uses SIMD to optimize low frequency fields difference
>>>> comparison(optional)
>>>>
>>>> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
>>>> comparison, pg_info_t is above 700bytes:
>>>>
>>>> inline const char *skip_same_128_bytes_SIMD(const char* p, const
>>>> char* p2) {
>>>> const __m128i w = _mm_load_si128((const __m128i *)p2);
>>>>
>>>> for (;; p += 16) {
>>>> const __m128i s = _mm_load_si128((const __m128i *)p);
>>>> const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>>>> _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>>>> _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>>>>
>>>> if (r != 0) // some of characters isn't equal
>>>> return p + __builtin_ffs(r) - 1;
>>>> }
>>>> }
>>>>
>>>> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek
>>>> <branch@predictor.org.pl> wrote:
>>>>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>>>>
>>>>>>
>>>>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>>>>>> On Wed, 24 Aug 2016, Haomai Wang wrote:
>>>>>>>> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com>
>>>>>>>> wrote:
>>>>>>>>> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com>
>>>>>>>>> wrote:
>>>>>>>>>> This is huge. It takes the pg_info_t str from 306 bytes to 847
>>>>>>>>>> bytes, and this _info omap key is rewritten on *every* IO.
>>>>>>>>>>
>>>>>>>>>> We could shrink this down significant with varint and/or delta
>>>>>>>>>> encoding since a huge portion of it is just a bunch of uint64_t
>>>>>>>>>> counters. This will probably cost some CPU time, but OTOH it'll
>>>>>>>>>> also shrink our metadata down a fair bit too which will pay off
>>>>>>>>>> later.
>>>>>>>>>>
>>>>>>>>>> Anybody want to tackle this?
>>>>>>>>>
>>>>>>>>> what about separating "object_stat_collection_t stats" from
>>>>>>>>> pg_stat_t?
>>>>>>>>> pg info should be unchanged for most of times, we could only
>>>>>>>>> update object related stats. This may help to reduce half bytes.
>>>>>>>
>>>>>>> I don't think this will work, since every op changes last_update in
>>>>>>> pg_info_t *and* the stats (write op count, total bytes, objects,
>>>>>>> etc.).
>>>>>>>
>>>>>>>> Or we only store increment values and keep the full in memory(may
>>>>>>>> reduce to 20-30bytes). In period time we store the full
>>>>>>>> structure(only hundreds of bytes)....
>>>>>>>
>>>>>>> A delta is probably very compressible (only a few fields in the
>>>>>>> stats struct change). The question is how fast can we make it in
>>>>>>> CPU time.
>>>>>>> Probably a simple delta (which will be almost all 0's) and a
>>>>>>> trivial run-length-encoding scheme that just gets rid of the 0's
>>>>>>> would do well enough...
>>>>>>
>>>>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>>>>> with in the current encoding?
>>>>>
>>>>> Or how high these counters get? We could try transposing the matrix
>>>>> made of those counters. At least the two most significant bytes in
>>>>> most of those counters are mostly zeros, and after transposing,
>>>>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>>>>> these fields need to be uint64_t.
>>>>>
>>>>> --
>>>>> Piotr Dałek
>>>>> branch@predictor.org.pl
>>>>> http://blog.predictor.org.pl
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>> PLEASE NOTE: The information contained in this electronic mail message
>>> is intended only for the use of the designated recipient(s) named
>>> above. If the reader of this message is not the intended recipient,
>>> you are hereby notified that you have received this message in error
>>> and that any review, dissemination, distribution, or copying of this
>>> message is strictly prohibited. If you have received this
>>> communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies of
>>> this message in your possession (whether hard copies or electronically
>>> stored copies).
>>>
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: pg_stat_t is 500+ bytes
2016-08-25 12:56 ` Haomai Wang
2016-08-26 1:28 ` Somnath Roy
@ 2016-08-27 0:00 ` Somnath Roy
2016-08-29 13:28 ` Sage Weil
1 sibling, 1 reply; 16+ messages in thread
From: Somnath Roy @ 2016-08-27 0:00 UTC (permalink / raw)
To: Haomai Wang, Piotr Dałek; +Cc: Mark Nelson, Sage Weil, ceph-devel
Sage/Sam,
I was going through the pg_info_t structures and it seems the following fields are duplicates among the different inner structures.
eversion_t last_scrub; -> duplicate within pg_stat_t, pg_history_t
eversion_t last_deep_scrub; -> duplicate within pg_stat_t, pg_history_t
utime_t last_scrub_stamp; -> duplicate within pg_stat_t, pg_history_t
utime_t last_deep_scrub_stamp; -> duplicate within pg_stat_t, pg_history_t
utime_t last_clean_scrub_stamp; -> duplicate within pg_stat_t, pg_history_t
pg_stat_t -> created is duplicate of pg_history_t -> epoch_created ?
pg_stat_t->last_became_active is duplicate of pg_stat_t->last_active ?
pg_stat_t->last_became_peered is duplicate of pg_stat_t->last_peered ?
Following quick optimization we could do as well probably..
The following fields of object_stat_sum_t can have value of 0 and 1 , so, why not replaced with structure bit field ?
int32_t num_flush_mode_high; // 1 when in high flush mode, otherwise 0
int32_t num_flush_mode_low; // 1 when in low flush mode, otherwise 0
int32_t num_evict_mode_some; // 1 when in evict some mode, otherwise 0
int32_t num_evict_mode_full; // 1 when in evict full mode, otherwise 0
Same with the following field of pg_info_t , we can replace with bitfield.
bool last_backfill_bitwise
These will replace 93 bytes with 5 bits.
Am I missing anything ?
Thanks & Regards
Somnath
-----Original Message-----
From: Somnath Roy
Sent: Thursday, August 25, 2016 6:29 PM
To: 'Haomai Wang'; Piotr Dałek
Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
Subject: RE: pg_stat_t is 500+ bytes
Sage,
I can see for statistics we are storing 40 bytes for each IO..
Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
Should we really store it on each IO ?
1. IMO, gathering/storing stats should be configurable
2. If enabled, I think we can have counters maintained in memory and persist that to db between some configurable interval ?
BTW, as you mentioned, pg_info is huge, 855 bytes.
Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)
Thanks & Regards
Somnath
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Thursday, August 25, 2016 5:56 AM
To: Piotr Dałek
Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: pg_stat_t is 500+ bytes
On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
> I think there exists a lot of fields can't be discarded via judging
> 0s, like utime_t, epoch_t. A simple way is to compare with previous
> pg_info_t.
Oh, sorry. We may could use a fresh pg_info_t to indicate. But below also could apply this way.
>
> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
> thread(thread level not process level).
>
> looks we have three optimization from mark and Piotr:
>
> 1. reconstruct pg_info_t and make high frequent fields together.
> 2. change some fields to smaller bits
> 3. uses SIMD to optimize low frequency fields difference
> comparison(optional)
>
> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
> comparison, pg_info_t is above 700bytes:
>
> inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
> const __m128i w = _mm_load_si128((const __m128i *)p2);
>
> for (;; p += 16) {
> const __m128i s = _mm_load_si128((const __m128i *)p);
> const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
> _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
> _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>
> if (r != 0) // some of characters isn't equal
> return p + __builtin_ffs(r) - 1;
> }
> }
>
> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>
>>>
>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
>>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
>>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
>>> >>>>This is huge. It takes the pg_info_t str from 306 bytes to 847
>>> >>>>bytes, and this _info omap key is rewritten on *every* IO.
>>> >>>>
>>> >>>>We could shrink this down significant with varint and/or delta
>>> >>>>encoding since a huge portion of it is just a bunch of uint64_t
>>> >>>>counters. This will probably cost some CPU time, but OTOH it'll
>>> >>>>also shrink our metadata down a fair bit too which will pay off later.
>>> >>>>
>>> >>>>Anybody want to tackle this?
>>> >>>
>>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
>>> >>>pg info should be unchanged for most of times, we could only
>>> >>>update object related stats. This may help to reduce half bytes.
>>> >
>>> >I don't think this will work, since every op changes last_update in
>>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
>>> >
>>> >>Or we only store increment values and keep the full in memory(may
>>> >>reduce to 20-30bytes). In period time we store the full
>>> >>structure(only hundreds of bytes)....
>>> >
>>> >A delta is probably very compressible (only a few fields in the
>>> >stats struct change). The question is how fast can we make it in CPU time.
>>> >Probably a simple delta (which will be almost all 0's) and a
>>> >trivial run-length-encoding scheme that just gets rid of the 0's
>>> >would do well enough...
>>>
>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>> with in the current encoding?
>>
>> Or how high these counters get? We could try transposing the matrix
>> made of those counters. At least the two most significant bytes in
>> most of those counters are mostly zeros, and after transposing,
>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>> these fields need to be uint64_t.
>>
>> --
>> Piotr Dałek
>> branch@predictor.org.pl
>> http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: pg_stat_t is 500+ bytes
2016-08-26 5:36 ` Mark Nelson
@ 2016-08-29 13:27 ` Mark Nelson
0 siblings, 0 replies; 16+ messages in thread
From: Mark Nelson @ 2016-08-29 13:27 UTC (permalink / raw)
To: Somnath Roy, Haomai Wang, Piotr Dałek; +Cc: Sage Weil, ceph-devel
On 08/26/2016 12:36 AM, Mark Nelson wrote:
> On 08/25/2016 10:11 PM, Mark Nelson wrote:
>>
>>
>> On 08/25/2016 10:04 PM, Mark Nelson wrote:
>>> On 08/25/2016 08:28 PM, Somnath Roy wrote:
>>>> Sage,
>>>> I can see for statistics we are storing 40 bytes for each IO..
>>>>
>>>> Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
>>>>
>>>> Should we really store it on each IO ?
>>>>
>>>> 1. IMO, gathering/storing stats should be configurable
>>>>
>>>> 2. If enabled, I think we can have counters maintained in memory and
>>>> persist that to db between some configurable interval ?
>>>>
>>>> BTW, as you mentioned, pg_info is huge, 855 bytes.
>>>>
>>>> Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)
>>>
>>> I spent some time on this today. In pg_stat_t we have:
>>>
>>> eversion_t: count 5 (80 bytes)
>>> utime_t: count 13 (92 bytes)
>>> object_stat_collection_t: count 1 (256 bytes)
>>> pg_t: count 1 (20 bytes)
>>> up vector: count 1 (16 bytes for 3x rep)
>>> acting_vector: count 1 (16 bytes for 3x rep)
>>> other junk: 74 bytes
>>>
>>> some of these will probably take varint encoding pretty well, while I
>>> suspect others (like the nsec field in utime_t) it won't help much. I'm
>>> going through now to get a coarse grain look. I'm not thrilled with the
>>> varint encoding overhead, but it does appear to be effective in this
>>> case. Right now in a test of master, pg_info_t is consistently 847
>>> bytes on my single-osd test setup. Switching object_stat_sum_t fields
>>> to varint encoding yields an average size of 630.562 with almost all
>>> sizes either 630 or 631.
>>>
>>> 256 - (847 - 631) = 40 bytes (vs 256 originally!)
>>>
>>> There are 34 fields in pg_info_t, so almost all of them are encoding
>>> into a 1 byte value in this test. It's possible that lowz encoding
>>> might reduce this further.
>>
>> oops, there are 34 fields in object_stat_collection_t rather.
>
> Ok, last update until monday since I'm off tomorrow. I went through and
> changed the eversion_t, utime_t, pg_t, and "other junk" to use varint
> encoding as well. This also happened to hit the eversion_t structs in
> pg_info_t. I didn't do the vectors in pg_stat_t or other fields in
> pg_stat_t out of laziness, but those would probably shrink pretty nicely.
>
> After the above, the average encode size in my test is now down to
> 453.244. I suspect we could pretty reasonably get that down to 300-400
> by tweaking things further. Larger clusters with more OSDs and such
> might not shrink as well though.
>
> Note I didn't bother to preserve backwards compatibility by increasing
> the versioning which would need to be done (and a little ugly) in any
> real implementation of this.
>
> Mark
Ok, I had some tests run over the weekend while I was away and it
appears that we indeed are going to pay a price for varint encoding. I
ran my test suite where different IO tests are executed (read, write,
randread, randwrite, mixed, etc) at different IO sizes going from
largest to smallest. I tested both my wip-pg_info_t-varint branch and
the version of master it was based off of. I knew bluestore small
random writes were already CPU constrained, so I tested that as it's the
same set of tests I've been using when doing the bufferlist append
testing where we saw a dramatic performance improvement.
Though I only ran through a single iteration of the test suite for each
branch, it appears that master is faster than wip-pg_info_t-varint for
16k randwrites and smaller. For 4K random writes, wip-pg_info_t-varint
appears to be only about 67% the speed of master.
So the take away I guess is that we can dramatically lower the pg_info_t
size with varint encoding but we're going to chew through some CPU to do
it and it's going to hurt if we are already CPU constrained. I suspect
rather than using varint as a crutch, we need to think very carefully
about the size and necessity of these fields, and perhaps other less CPU
intensive ways to pack the data. It may be that reverting from varint
encoding in bluestore might be beneficial as well, especially with fast
NVMe drives.
Mark
>
>>
>>>
>>> Mark
>>>
>>>>
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>>>> Sent: Thursday, August 25, 2016 5:56 AM
>>>> To: Piotr Dałek
>>>> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
>>>> Subject: Re: pg_stat_t is 500+ bytes
>>>>
>>>> On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
>>>>> I think there exists a lot of fields can't be discarded via judging
>>>>> 0s, like utime_t, epoch_t. A simple way is to compare with previous
>>>>> pg_info_t.
>>>>
>>>> Oh, sorry. We may could use a fresh pg_info_t to indicate. But below
>>>> also could apply this way.
>>>>
>>>>>
>>>>> BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
>>>>> thread(thread level not process level).
>>>>>
>>>>> looks we have three optimization from mark and Piotr:
>>>>>
>>>>> 1. reconstruct pg_info_t and make high frequent fields together.
>>>>> 2. change some fields to smaller bits
>>>>> 3. uses SIMD to optimize low frequency fields difference
>>>>> comparison(optional)
>>>>>
>>>>> intel SSE4.2 pcmpistrm instructive could do very good 128bytes
>>>>> comparison, pg_info_t is above 700bytes:
>>>>>
>>>>> inline const char *skip_same_128_bytes_SIMD(const char* p, const
>>>>> char* p2) {
>>>>> const __m128i w = _mm_load_si128((const __m128i *)p2);
>>>>>
>>>>> for (;; p += 16) {
>>>>> const __m128i s = _mm_load_si128((const __m128i *)p);
>>>>> const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
>>>>> _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
>>>>> _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
>>>>>
>>>>> if (r != 0) // some of characters isn't equal
>>>>> return p + __builtin_ffs(r) - 1;
>>>>> }
>>>>> }
>>>>>
>>>>> On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek
>>>>> <branch@predictor.org.pl> wrote:
>>>>>> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 08/24/2016 11:09 AM, Sage Weil wrote:
>>>>>>>> On Wed, 24 Aug 2016, Haomai Wang wrote:
>>>>>>>>> On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com>
>>>>>>>>> wrote:
>>>>>>>>>> On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> This is huge. It takes the pg_info_t str from 306 bytes to 847
>>>>>>>>>>> bytes, and this _info omap key is rewritten on *every* IO.
>>>>>>>>>>>
>>>>>>>>>>> We could shrink this down significant with varint and/or delta
>>>>>>>>>>> encoding since a huge portion of it is just a bunch of uint64_t
>>>>>>>>>>> counters. This will probably cost some CPU time, but OTOH it'll
>>>>>>>>>>> also shrink our metadata down a fair bit too which will pay off
>>>>>>>>>>> later.
>>>>>>>>>>>
>>>>>>>>>>> Anybody want to tackle this?
>>>>>>>>>>
>>>>>>>>>> what about separating "object_stat_collection_t stats" from
>>>>>>>>>> pg_stat_t?
>>>>>>>>>> pg info should be unchanged for most of times, we could only
>>>>>>>>>> update object related stats. This may help to reduce half bytes.
>>>>>>>>
>>>>>>>> I don't think this will work, since every op changes last_update in
>>>>>>>> pg_info_t *and* the stats (write op count, total bytes, objects,
>>>>>>>> etc.).
>>>>>>>>
>>>>>>>>> Or we only store increment values and keep the full in memory(may
>>>>>>>>> reduce to 20-30bytes). In period time we store the full
>>>>>>>>> structure(only hundreds of bytes)....
>>>>>>>>
>>>>>>>> A delta is probably very compressible (only a few fields in the
>>>>>>>> stats struct change). The question is how fast can we make it in
>>>>>>>> CPU time.
>>>>>>>> Probably a simple delta (which will be almost all 0's) and a
>>>>>>>> trivial run-length-encoding scheme that just gets rid of the 0's
>>>>>>>> would do well enough...
>>>>>>>
>>>>>>> Do we have any rough idea of how many/often consecutive 0s we end up
>>>>>>> with in the current encoding?
>>>>>>
>>>>>> Or how high these counters get? We could try transposing the matrix
>>>>>> made of those counters. At least the two most significant bytes in
>>>>>> most of those counters are mostly zeros, and after transposing,
>>>>>> simple RLE would be feasible. In any case, I'm not sure if *all* of
>>>>>> these fields need to be uint64_t.
>>>>>>
>>>>>> --
>>>>>> Piotr Dałek
>>>>>> branch@predictor.org.pl
>>>>>> http://blog.predictor.org.pl
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
>>>> PLEASE NOTE: The information contained in this electronic mail message
>>>> is intended only for the use of the designated recipient(s) named
>>>> above. If the reader of this message is not the intended recipient,
>>>> you are hereby notified that you have received this message in error
>>>> and that any review, dissemination, distribution, or copying of this
>>>> message is strictly prohibited. If you have received this
>>>> communication in error, please notify the sender by telephone or
>>>> e-mail (as shown above) immediately and destroy any and all copies of
>>>> this message in your possession (whether hard copies or electronically
>>>> stored copies).
>>>>
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: pg_stat_t is 500+ bytes
2016-08-27 0:00 ` Somnath Roy
@ 2016-08-29 13:28 ` Sage Weil
0 siblings, 0 replies; 16+ messages in thread
From: Sage Weil @ 2016-08-29 13:28 UTC (permalink / raw)
To: Somnath Roy; +Cc: Haomai Wang, Piotr Dałek, Mark Nelson, ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 7894 bytes --]
On Sat, 27 Aug 2016, Somnath Roy wrote:
> Sage/Sam,
> I was going through the pg_info_t structures and it seems the following
> fields are duplicates among the different inner structures.
>
> eversion_t last_scrub; -> duplicate within pg_stat_t, pg_history_t
> eversion_t last_deep_scrub; -> duplicate within pg_stat_t, pg_history_t
> utime_t last_scrub_stamp; -> duplicate within pg_stat_t, pg_history_t
> utime_t last_deep_scrub_stamp; -> duplicate within pg_stat_t, pg_history_t
> utime_t last_clean_scrub_stamp; -> duplicate within pg_stat_t, pg_history_t
>
> pg_stat_t -> created is duplicate of pg_history_t -> epoch_created ?
Yeah, these look like dups to me..
> pg_stat_t->last_became_active is duplicate of pg_stat_t->last_active ?
> pg_stat_t->last_became_peered is duplicate of pg_stat_t->last_peered ?
The semantics of these are slightly different: lact_active,peered is the
last time pg_stat_t refreshed and the pg was active or peered, while the
_became_ ones are the last time they transitions to active or peered. It
may be we can drop the former since they could be inferred from
last_fresh and state...
> Following quick optimization we could do as well probably..
> The following fields of object_stat_sum_t can have value of 0 and 1 , so, why not replaced with structure bit field ?
>
> int32_t num_flush_mode_high; // 1 when in high flush mode, otherwise 0
> int32_t num_flush_mode_low; // 1 when in low flush mode, otherwise 0
> int32_t num_evict_mode_some; // 1 when in evict some mode, otherwise 0
> int32_t num_evict_mode_full; // 1 when in evict full mode, otherwise 0
>
> Same with the following field of pg_info_t , we can replace with bitfield.
This is done this way so that they structure can be summed over PGs and
pools.
> bool last_backfill_bitwise
Yeah, this could be swapped for a single flags field and appropriate
accessors.
sage
> These will replace 93 bytes with 5 bits.
>
> Am I missing anything ?
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, August 25, 2016 6:29 PM
> To: 'Haomai Wang'; Piotr Dałek
> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
> Subject: RE: pg_stat_t is 500+ bytes
>
> Sage,
> I can see for statistics we are storing 40 bytes for each IO..
>
> Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
>
> Should we really store it on each IO ?
>
> 1. IMO, gathering/storing stats should be configurable
>
> 2. If enabled, I think we can have counters maintained in memory and persist that to db between some configurable interval ?
>
> BTW, as you mentioned, pg_info is huge, 855 bytes.
>
> Put( Prefix = M key = 0x0000000000000746'._info' Value size = 855)
>
>
> Thanks & Regards
> Somnath
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Thursday, August 25, 2016 5:56 AM
> To: Piotr Dałek
> Cc: Mark Nelson; Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: pg_stat_t is 500+ bytes
>
> On Thu, Aug 25, 2016 at 8:54 PM, Haomai Wang <haomai@xsky.com> wrote:
> > I think there exists a lot of fields can't be discarded via judging
> > 0s, like utime_t, epoch_t. A simple way is to compare with previous
> > pg_info_t.
>
> Oh, sorry. We may could use a fresh pg_info_t to indicate. But below also could apply this way.
>
> >
> > BTW, I want to mentiond pg_info_t encoding occurs 6.05% cpu time in pg
> > thread(thread level not process level).
> >
> > looks we have three optimization from mark and Piotr:
> >
> > 1. reconstruct pg_info_t and make high frequent fields together.
> > 2. change some fields to smaller bits
> > 3. uses SIMD to optimize low frequency fields difference
> > comparison(optional)
> >
> > intel SSE4.2 pcmpistrm instructive could do very good 128bytes
> > comparison, pg_info_t is above 700bytes:
> >
> > inline const char *skip_same_128_bytes_SIMD(const char* p, const char* p2) {
> > const __m128i w = _mm_load_si128((const __m128i *)p2);
> >
> > for (;; p += 16) {
> > const __m128i s = _mm_load_si128((const __m128i *)p);
> > const unsigned r = _mm_cvtsi128_si32(_mm_cmpistrm(w, s,
> > _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY |
> > _SIDD_BIT_MASK | _SIDD_NEGATIVE_POLARITY));
> >
> > if (r != 0) // some of characters isn't equal
> > return p + __builtin_ffs(r) - 1;
> > }
> > }
> >
> > On Thu, Aug 25, 2016 at 12:20 AM, Piotr Dałek <branch@predictor.org.pl> wrote:
> >> On Wed, Aug 24, 2016 at 11:12:24AM -0500, Mark Nelson wrote:
> >>>
> >>>
> >>> On 08/24/2016 11:09 AM, Sage Weil wrote:
> >>> >On Wed, 24 Aug 2016, Haomai Wang wrote:
> >>> >>On Wed, Aug 24, 2016 at 11:01 AM, Haomai Wang <haomai@xsky.com> wrote:
> >>> >>>On Wed, Aug 24, 2016 at 2:13 AM, Sage Weil <sweil@redhat.com> wrote:
> >>> >>>>This is huge. It takes the pg_info_t str from 306 bytes to 847
> >>> >>>>bytes, and this _info omap key is rewritten on *every* IO.
> >>> >>>>
> >>> >>>>We could shrink this down significant with varint and/or delta
> >>> >>>>encoding since a huge portion of it is just a bunch of uint64_t
> >>> >>>>counters. This will probably cost some CPU time, but OTOH it'll
> >>> >>>>also shrink our metadata down a fair bit too which will pay off later.
> >>> >>>>
> >>> >>>>Anybody want to tackle this?
> >>> >>>
> >>> >>>what about separating "object_stat_collection_t stats" from pg_stat_t?
> >>> >>>pg info should be unchanged for most of times, we could only
> >>> >>>update object related stats. This may help to reduce half bytes.
> >>> >
> >>> >I don't think this will work, since every op changes last_update in
> >>> >pg_info_t *and* the stats (write op count, total bytes, objects, etc.).
> >>> >
> >>> >>Or we only store increment values and keep the full in memory(may
> >>> >>reduce to 20-30bytes). In period time we store the full
> >>> >>structure(only hundreds of bytes)....
> >>> >
> >>> >A delta is probably very compressible (only a few fields in the
> >>> >stats struct change). The question is how fast can we make it in CPU time.
> >>> >Probably a simple delta (which will be almost all 0's) and a
> >>> >trivial run-length-encoding scheme that just gets rid of the 0's
> >>> >would do well enough...
> >>>
> >>> Do we have any rough idea of how many/often consecutive 0s we end up
> >>> with in the current encoding?
> >>
> >> Or how high these counters get? We could try transposing the matrix
> >> made of those counters. At least the two most significant bytes in
> >> most of those counters are mostly zeros, and after transposing,
> >> simple RLE would be feasible. In any case, I'm not sure if *all* of
> >> these fields need to be uint64_t.
> >>
> >> --
> >> Piotr Dałek
> >> branch@predictor.org.pl
> >> http://blog.predictor.org.pl
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> N?????r??y??????X??ǧv???){.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-08-29 13:28 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-23 18:13 pg_stat_t is 500+ bytes Sage Weil
2016-08-23 19:39 ` Mark Nelson
2016-08-24 3:01 ` Haomai Wang
2016-08-24 4:02 ` Haomai Wang
2016-08-24 16:09 ` Sage Weil
2016-08-24 16:12 ` Mark Nelson
2016-08-24 16:20 ` Piotr Dałek
2016-08-25 12:54 ` Haomai Wang
2016-08-25 12:56 ` Haomai Wang
2016-08-26 1:28 ` Somnath Roy
2016-08-26 3:04 ` Mark Nelson
2016-08-26 3:11 ` Mark Nelson
2016-08-26 5:36 ` Mark Nelson
2016-08-29 13:27 ` Mark Nelson
2016-08-27 0:00 ` Somnath Roy
2016-08-29 13:28 ` Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.