All of lore.kernel.org
 help / color / mirror / Atom feed
* bufferlist appenders
@ 2016-08-12 14:27 Sage Weil
  2016-08-12 14:37 ` Mark Nelson
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Sage Weil @ 2016-08-12 14:27 UTC (permalink / raw)
  To: ceph-devel

A ton of time is the encoding/marshalling is spent doing bufferlist 
appends.  This is partly because the buffer code is doing lots of sanity 
range checks, and party because there are multiple layers that get range 
checks and length updates (bufferlist _len changes, 
and bufferlist::append_buffer (a ptr) gets it's length updated, at the 
very least).

To simplify and speed this up, I propose an 'appender' concept/type that 
is used for doing appends in a more efficient way.  It would be used 
like so:

 bufferlist bl;
 {
   bufferlist::safe_appender a = bl.get_safe_appender();
   ::encode(foo, a);
 }

or

 {
   bufferlist::unsafe_appender a = bl.get_unsafe_appender(1024);
   ::encode(foo, a);
 }

The appender keeps its own bufferptr that it copies data into.  The 
bufferptr isn't given to the bufferlist until the appender is destroyed 
(or flush() is called explicitly).  This means that appends are generally 
just a memcpy and a position pointer addition.  In the safe_appender case, 
we also do a range change and allocate a new buffer if necessary.  In the 
unsafe_appender case, it is the callers responsibility to say how big a 
buffer is preallocated.

I have a simple prototype here:

	https://github.com/ceph/ceph/pull/10700

It appears to be almost 10x faster when encoding a uint64_t in a loop!

[ RUN      ] BufferList.appender_bench
appending 1073741824 bytes
buffer::list::append 20.285963
buffer::list encode 19.719120
buffer::list::safe_appender::append 2.588926
buffer::list::safe_appender::append_v 2.837026
buffer::list::safe_appender encode 3.000614
buffer::list::unsafe_appender::append 2.452116
buffer::list::unsafe_appender::append_v 2.553745
buffer::list::unsafe_appender encode 2.200110
[       OK ] BufferList.appender_bench (55637 ms)

Interesting, unsafe isn't much faster than safe.  I suspect the CPU's 
branch prediction is just working really well there?

Anyway, thoughts on this?  Any suggestions for further improvement?

I think the next step is to figure out how to change our 
WRITE_CLASS_ENCODER macros and encode function work with both bufferlists 
and appenders so that it's easy to convert stuff over (and still work with 
a mix of bufferlist-based encoders and appender-based encoders).

sage

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bufferlist appenders
  2016-08-12 14:27 bufferlist appenders Sage Weil
@ 2016-08-12 14:37 ` Mark Nelson
  2016-08-12 22:49 ` Allen Samuels
  2016-08-13  8:50 ` Piotr Dałek
  2 siblings, 0 replies; 7+ messages in thread
From: Mark Nelson @ 2016-08-12 14:37 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

On 08/12/2016 09:27 AM, Sage Weil wrote:
> A ton of time is the encoding/marshalling is spent doing bufferlist
> appends.  This is partly because the buffer code is doing lots of sanity
> range checks, and party because there are multiple layers that get range
> checks and length updates (bufferlist _len changes,
> and bufferlist::append_buffer (a ptr) gets it's length updated, at the
> very least).
>
> To simplify and speed this up, I propose an 'appender' concept/type that
> is used for doing appends in a more efficient way.  It would be used
> like so:
>
>  bufferlist bl;
>  {
>    bufferlist::safe_appender a = bl.get_safe_appender();
>    ::encode(foo, a);
>  }
>
> or
>
>  {
>    bufferlist::unsafe_appender a = bl.get_unsafe_appender(1024);
>    ::encode(foo, a);
>  }
>
> The appender keeps its own bufferptr that it copies data into.  The
> bufferptr isn't given to the bufferlist until the appender is destroyed
> (or flush() is called explicitly).  This means that appends are generally
> just a memcpy and a position pointer addition.  In the safe_appender case,
> we also do a range change and allocate a new buffer if necessary.  In the
> unsafe_appender case, it is the callers responsibility to say how big a
> buffer is preallocated.
>
> I have a simple prototype here:
>
> 	https://github.com/ceph/ceph/pull/10700
>
> It appears to be almost 10x faster when encoding a uint64_t in a loop!

Yay!  This is huge.  For posterity, here's where the original behavior 
was really hurting us in bluestore:

https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view

>
> [ RUN      ] BufferList.appender_bench
> appending 1073741824 bytes
> buffer::list::append 20.285963
> buffer::list encode 19.719120
> buffer::list::safe_appender::append 2.588926
> buffer::list::safe_appender::append_v 2.837026
> buffer::list::safe_appender encode 3.000614
> buffer::list::unsafe_appender::append 2.452116
> buffer::list::unsafe_appender::append_v 2.553745
> buffer::list::unsafe_appender encode 2.200110
> [       OK ] BufferList.appender_bench (55637 ms)
>
> Interesting, unsafe isn't much faster than safe.  I suspect the CPU's
> branch prediction is just working really well there?
>
> Anyway, thoughts on this?  Any suggestions for further improvement?

I'm surprised too that unsafe isn't much faster than safe.  Still, this 
is enough of an improvement that I think we should just run with it for 
now and get some performance tests done as we convert stuff over.

>
> I think the next step is to figure out how to change our
> WRITE_CLASS_ENCODER macros and encode function work with both bufferlists
> and appenders so that it's easy to convert stuff over (and still work with
> a mix of bufferlist-based encoders and appender-based encoders).
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: bufferlist appenders
  2016-08-12 14:27 bufferlist appenders Sage Weil
  2016-08-12 14:37 ` Mark Nelson
@ 2016-08-12 22:49 ` Allen Samuels
  2016-08-13 21:31   ` Sage Weil
  2016-08-13  8:50 ` Piotr Dałek
  2 siblings, 1 reply; 7+ messages in thread
From: Allen Samuels @ 2016-08-12 22:49 UTC (permalink / raw)
  To: Sage Weil, ceph-devel


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Friday, August 12, 2016 7:27 AM
> To: ceph-devel@vger.kernel.org
> Subject: bufferlist appenders
> 
> A ton of time is the encoding/marshalling is spent doing bufferlist appends.
> This is partly because the buffer code is doing lots of sanity range checks, and
> party because there are multiple layers that get range checks and length
> updates (bufferlist _len changes, and bufferlist::append_buffer (a ptr) gets
> it's length updated, at the very least).
> 
> To simplify and speed this up, I propose an 'appender' concept/type that is
> used for doing appends in a more efficient way.  It would be used like so:
> 
>  bufferlist bl;
>  {
>    bufferlist::safe_appender a = bl.get_safe_appender();
>    ::encode(foo, a);
>  }
> 
> or
> 
>  {
>    bufferlist::unsafe_appender a = bl.get_unsafe_appender(1024);
>    ::encode(foo, a);
>  }
> 
> The appender keeps its own bufferptr that it copies data into.  The bufferptr
> isn't given to the bufferlist until the appender is destroyed (or flush() is called
> explicitly).  This means that appends are generally just a memcpy and a
> position pointer addition.  In the safe_appender case, we also do a range
> change and allocate a new buffer if necessary.  In the unsafe_appender case,
> it is the callers responsibility to say how big a buffer is preallocated.
> 
> I have a simple prototype here:
> 
> 	https://github.com/ceph/ceph/pull/10700
> 
> It appears to be almost 10x faster when encoding a uint64_t in a loop!
> 
> [ RUN      ] BufferList.appender_bench
> appending 1073741824 bytes
> buffer::list::append 20.285963
> buffer::list encode 19.719120
> buffer::list::safe_appender::append 2.588926
> buffer::list::safe_appender::append_v 2.837026 buffer::list::safe_appender
> encode 3.000614 buffer::list::unsafe_appender::append 2.452116
> buffer::list::unsafe_appender::append_v 2.553745
> buffer::list::unsafe_appender encode 2.200110
> [       OK ] BufferList.appender_bench (55637 ms)
> 
> Interesting, unsafe isn't much faster than safe.  I suspect the CPU's branch
> prediction is just working really well there?

Yes it looks like it is. But this is a bit of a contrived case because you're timing this code which is 100% in the L1 cache, which might not be a good model. How bad does it get if the loop is unrolled enough times to fall out of the L1 cache but still in the L2 cache (which might be as pessimistic a simulation as the original code is optimistic). 
> 
> Anyway, thoughts on this?  Any suggestions for further improvement?
> 
> I think the next step is to figure out how to change our
> WRITE_CLASS_ENCODER macros and encode function work with both
> bufferlists and appenders so that it's easy to convert stuff over (and still work
> with a mix of bufferlist-based encoders and appender-based encoders).
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bufferlist appenders
  2016-08-12 14:27 bufferlist appenders Sage Weil
  2016-08-12 14:37 ` Mark Nelson
  2016-08-12 22:49 ` Allen Samuels
@ 2016-08-13  8:50 ` Piotr Dałek
  2 siblings, 0 replies; 7+ messages in thread
From: Piotr Dałek @ 2016-08-13  8:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, Aug 12, 2016 at 02:27:26PM +0000, Sage Weil wrote:
> A ton of time is the encoding/marshalling is spent doing bufferlist 
> appends.  This is partly because the buffer code is doing lots of sanity 
> range checks, and party because there are multiple layers that get range 
> checks and length updates (bufferlist _len changes, 
> and bufferlist::append_buffer (a ptr) gets it's length updated, at the 
> very least).
> 
> To simplify and speed this up, I propose an 'appender' concept/type that 
> is used for doing appends in a more efficient way.  It would be used 
> like so:
> 
>  bufferlist bl;
>  {
>    bufferlist::safe_appender a = bl.get_safe_appender();
>    ::encode(foo, a);
>  }
> 
> or
> 
>  {
>    bufferlist::unsafe_appender a = bl.get_unsafe_appender(1024);
>    ::encode(foo, a);
>  }
> 
> The appender keeps its own bufferptr that it copies data into.  The 
> bufferptr isn't given to the bufferlist until the appender is destroyed 
> (or flush() is called explicitly).  This means that appends are generally 
> just a memcpy and a position pointer addition.  In the safe_appender case, 
> we also do a range change and allocate a new buffer if necessary.  In the 
> unsafe_appender case, it is the callers responsibility to say how big a 
> buffer is preallocated.
> 
> I have a simple prototype here:
> 
> 	https://github.com/ceph/ceph/pull/10700
> 
> It appears to be almost 10x faster when encoding a uint64_t in a loop!
> 
> [ RUN      ] BufferList.appender_bench
> appending 1073741824 bytes
> buffer::list::append 20.285963
> buffer::list encode 19.719120
> buffer::list::safe_appender::append 2.588926
> buffer::list::safe_appender::append_v 2.837026
> buffer::list::safe_appender encode 3.000614
> buffer::list::unsafe_appender::append 2.452116
> buffer::list::unsafe_appender::append_v 2.553745
> buffer::list::unsafe_appender encode 2.200110
> [       OK ] BufferList.appender_bench (55637 ms)
> 
> Interesting, unsafe isn't much faster than safe.  I suspect the CPU's 
> branch prediction is just working really well there?

That may be the case, as there's only one branch in safe appender, contrary
to four in buffer::ptr::append. Also, appenders don't return values.

> Anyway, thoughts on this?  Any suggestions for further improvement?

As I wrote on github - replace memcpy with maybe_inline_memcpy for even more
performance.

-- 
Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: bufferlist appenders
  2016-08-12 22:49 ` Allen Samuels
@ 2016-08-13 21:31   ` Sage Weil
  2016-08-13 23:14     ` Allen Samuels
  2016-08-14 13:06     ` Mark Nelson
  0 siblings, 2 replies; 7+ messages in thread
From: Sage Weil @ 2016-08-13 21:31 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Fri, 12 Aug 2016, Allen Samuels wrote:
> 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Friday, August 12, 2016 7:27 AM
> > To: ceph-devel@vger.kernel.org
> > Subject: bufferlist appenders
> > 
> > A ton of time is the encoding/marshalling is spent doing bufferlist appends.
> > This is partly because the buffer code is doing lots of sanity range checks, and
> > party because there are multiple layers that get range checks and length
> > updates (bufferlist _len changes, and bufferlist::append_buffer (a ptr) gets
> > it's length updated, at the very least).
> > 
> > To simplify and speed this up, I propose an 'appender' concept/type that is
> > used for doing appends in a more efficient way.  It would be used like so:
> > 
> >  bufferlist bl;
> >  {
> >    bufferlist::safe_appender a = bl.get_safe_appender();
> >    ::encode(foo, a);
> >  }
> > 
> > or
> > 
> >  {
> >    bufferlist::unsafe_appender a = bl.get_unsafe_appender(1024);
> >    ::encode(foo, a);
> >  }
> > 
> > The appender keeps its own bufferptr that it copies data into.  The bufferptr
> > isn't given to the bufferlist until the appender is destroyed (or flush() is called
> > explicitly).  This means that appends are generally just a memcpy and a
> > position pointer addition.  In the safe_appender case, we also do a range
> > change and allocate a new buffer if necessary.  In the unsafe_appender case,
> > it is the callers responsibility to say how big a buffer is preallocated.
> > 
> > I have a simple prototype here:
> > 
> > 	https://github.com/ceph/ceph/pull/10700
> > 
> > It appears to be almost 10x faster when encoding a uint64_t in a loop!
> > 
> > [ RUN      ] BufferList.appender_bench
> > appending 1073741824 bytes
> > buffer::list::append 20.285963
> > buffer::list encode 19.719120
> > buffer::list::safe_appender::append 2.588926
> > buffer::list::safe_appender::append_v 2.837026 buffer::list::safe_appender
> > encode 3.000614 buffer::list::unsafe_appender::append 2.452116
> > buffer::list::unsafe_appender::append_v 2.553745
> > buffer::list::unsafe_appender encode 2.200110
> > [       OK ] BufferList.appender_bench (55637 ms)
> > 
> > Interesting, unsafe isn't much faster than safe.  I suspect the CPU's branch
> > prediction is just working really well there?
> 
> Yes it looks like it is. But this is a bit of a contrived case because 
> you're timing this code which is 100% in the L1 cache, which might not 
> be a good model. How bad does it get if the loop is unrolled enough 
> times to fall out of the L1 cache but still in the L2 cache (which might 
> be as pessimistic a simulation as the original code is optimistic).

I updated the test to pick a random point in a big array and then encode 
16 items inline/unrolled in the hopes that this would avoid just hammering 
L1.  It does mean that the 16 contiguous items will probably all get 
fetched together (or have the fetch pipelined), but I think that's 
probably pretty realistic.

The test also went through several other revisions and refinements 
yesterday with Igor's review.  Here's the result now:

[ RUN      ] BufferList.appender_bench
appending 268435456 uint64_t's
buffer::list::append 7.142146
buffer::list encode 7.968760
buffer::list::safe_appender::append 3.506399
buffer::list::safe_appender::append (reserve 16) 2.291478
buffer::list::safe_appender::append_v 3.673429
buffer::list::safe_appender::append_v (reserve 16) 2.291800
buffer::list::safe_appender encode 3.673988
buffer::list::unsafe_appender::append 1.766341
buffer::list::unsafe_appender::append_v 1.748361
buffer::list::unsafe_appender encode 1.746126
[       OK ] BufferList.appender_bench (36340 ms)

which looks much more like what I would expect to see.  The 'reserve 16' 
cases are doing a single bounds check and then doing 16 unsafe appends.  
Bottom line is that doing the limited bounds checks gets you most of the 
way to totally unguarded, but not all the way.  And no guards is 
definitely faster (roughly 2x faster than a guard for every word).

Anyway, I think the next step is to look and the enc_dec and see if 
this can complement the approach...

sage

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: bufferlist appenders
  2016-08-13 21:31   ` Sage Weil
@ 2016-08-13 23:14     ` Allen Samuels
  2016-08-14 13:06     ` Mark Nelson
  1 sibling, 0 replies; 7+ messages in thread
From: Allen Samuels @ 2016-08-13 23:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Saturday, August 13, 2016 2:32 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bufferlist appenders
> 
> On Fri, 12 Aug 2016, Allen Samuels wrote:
> >
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > owner@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Friday, August 12, 2016 7:27 AM
> > > To: ceph-devel@vger.kernel.org
> > > Subject: bufferlist appenders
> > >
> > > A ton of time is the encoding/marshalling is spent doing bufferlist
> appends.
> > > This is partly because the buffer code is doing lots of sanity range checks,
> and
> > > party because there are multiple layers that get range checks and length
> > > updates (bufferlist _len changes, and bufferlist::append_buffer (a ptr)
> gets
> > > it's length updated, at the very least).
> > >
> > > To simplify and speed this up, I propose an 'appender' concept/type that
> is
> > > used for doing appends in a more efficient way.  It would be used like so:
> > >
> > >  bufferlist bl;
> > >  {
> > >    bufferlist::safe_appender a = bl.get_safe_appender();
> > >    ::encode(foo, a);
> > >  }
> > >
> > > or
> > >
> > >  {
> > >    bufferlist::unsafe_appender a = bl.get_unsafe_appender(1024);
> > >    ::encode(foo, a);
> > >  }
> > >
> > > The appender keeps its own bufferptr that it copies data into.  The
> bufferptr
> > > isn't given to the bufferlist until the appender is destroyed (or flush() is
> called
> > > explicitly).  This means that appends are generally just a memcpy and a
> > > position pointer addition.  In the safe_appender case, we also do a range
> > > change and allocate a new buffer if necessary.  In the unsafe_appender
> case,
> > > it is the callers responsibility to say how big a buffer is preallocated.
> > >
> > > I have a simple prototype here:
> > >
> > > 	https://github.com/ceph/ceph/pull/10700
> > >
> > > It appears to be almost 10x faster when encoding a uint64_t in a loop!
> > >
> > > [ RUN      ] BufferList.appender_bench
> > > appending 1073741824 bytes
> > > buffer::list::append 20.285963
> > > buffer::list encode 19.719120
> > > buffer::list::safe_appender::append 2.588926
> > > buffer::list::safe_appender::append_v 2.837026
> buffer::list::safe_appender
> > > encode 3.000614 buffer::list::unsafe_appender::append 2.452116
> > > buffer::list::unsafe_appender::append_v 2.553745
> > > buffer::list::unsafe_appender encode 2.200110
> > > [       OK ] BufferList.appender_bench (55637 ms)
> > >
> > > Interesting, unsafe isn't much faster than safe.  I suspect the CPU's
> branch
> > > prediction is just working really well there?
> >
> > Yes it looks like it is. But this is a bit of a contrived case because
> > you're timing this code which is 100% in the L1 cache, which might not
> > be a good model. How bad does it get if the loop is unrolled enough
> > times to fall out of the L1 cache but still in the L2 cache (which might
> > be as pessimistic a simulation as the original code is optimistic).
> 
> I updated the test to pick a random point in a big array and then encode
> 16 items inline/unrolled in the hopes that this would avoid just hammering
> L1.  It does mean that the 16 contiguous items will probably all get
> fetched together (or have the fetch pipelined), but I think that's
> probably pretty realistic.
> 
> The test also went through several other revisions and refinements
> yesterday with Igor's review.  Here's the result now:
> 
> [ RUN      ] BufferList.appender_bench
> appending 268435456 uint64_t's
> buffer::list::append 7.142146
> buffer::list encode 7.968760
> buffer::list::safe_appender::append 3.506399
> buffer::list::safe_appender::append (reserve 16) 2.291478
> buffer::list::safe_appender::append_v 3.673429
> buffer::list::safe_appender::append_v (reserve 16) 2.291800
> buffer::list::safe_appender encode 3.673988
> buffer::list::unsafe_appender::append 1.766341
> buffer::list::unsafe_appender::append_v 1.748361
> buffer::list::unsafe_appender encode 1.746126
> [       OK ] BufferList.appender_bench (36340 ms)
> 
> which looks much more like what I would expect to see.  The 'reserve 16'
> cases are doing a single bounds check and then doing 16 unsafe appends.
> Bottom line is that doing the limited bounds checks gets you most of the
> way to totally unguarded, but not all the way.  And no guards is
> definitely faster (roughly 2x faster than a guard for every word).
> 
> Anyway, I think the next step is to look and the enc_dec and see if
> this can complement the approach...
> 
> sage

I think this is still too optimistic. My comments about the L1 cache were intended to be oriented toward the code cache, not the data cache (sorry, I should have been more specific -- communication IS difficult). Whatever the data cache access pattern is, it'll be more or less the same between all of the methods -- so I think we can ignore it [I agree that having the data in the L2 cache is the right thing to simulate]. However, for the code cache, in the safe_appender case you have the code for "if (pos + l > end) { flush(); prepare_buffer(1); }; memcpy(); iter+=..."; vs. the code for the enc_dec/unsafe_appender case which is just "memcpy(); iter += ....". is significantly degraded. It's more likely that the performance of the two schemes is a function of the code density (where enc_dec/ 
 unsafe_appender wins big time).

But I agree that the thing to strive for is hybrid of the two approaches. It looks to me like if we married the unsafe_appender to the unified estimate/encode paradigm of enc_dec we would have the best of both worlds.

I have to run, but I suspect that having unsafe_appender take as a parameter the pair of estimate/encode functions and that those functions are templated/generated like what I did in the enc_dec framework we would pretty much be there. 





 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bufferlist appenders
  2016-08-13 21:31   ` Sage Weil
  2016-08-13 23:14     ` Allen Samuels
@ 2016-08-14 13:06     ` Mark Nelson
  1 sibling, 0 replies; 7+ messages in thread
From: Mark Nelson @ 2016-08-14 13:06 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel



On 08/13/2016 04:31 PM, Sage Weil wrote:
> On Fri, 12 Aug 2016, Allen Samuels wrote:
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>> Sent: Friday, August 12, 2016 7:27 AM
>>> To: ceph-devel@vger.kernel.org
>>> Subject: bufferlist appenders
>>>
>>> A ton of time is the encoding/marshalling is spent doing bufferlist appends.
>>> This is partly because the buffer code is doing lots of sanity range checks, and
>>> party because there are multiple layers that get range checks and length
>>> updates (bufferlist _len changes, and bufferlist::append_buffer (a ptr) gets
>>> it's length updated, at the very least).
>>>
>>> To simplify and speed this up, I propose an 'appender' concept/type that is
>>> used for doing appends in a more efficient way.  It would be used like so:
>>>
>>>  bufferlist bl;
>>>  {
>>>    bufferlist::safe_appender a = bl.get_safe_appender();
>>>    ::encode(foo, a);
>>>  }
>>>
>>> or
>>>
>>>  {
>>>    bufferlist::unsafe_appender a = bl.get_unsafe_appender(1024);
>>>    ::encode(foo, a);
>>>  }
>>>
>>> The appender keeps its own bufferptr that it copies data into.  The bufferptr
>>> isn't given to the bufferlist until the appender is destroyed (or flush() is called
>>> explicitly).  This means that appends are generally just a memcpy and a
>>> position pointer addition.  In the safe_appender case, we also do a range
>>> change and allocate a new buffer if necessary.  In the unsafe_appender case,
>>> it is the callers responsibility to say how big a buffer is preallocated.
>>>
>>> I have a simple prototype here:
>>>
>>> 	https://github.com/ceph/ceph/pull/10700
>>>
>>> It appears to be almost 10x faster when encoding a uint64_t in a loop!
>>>
>>> [ RUN      ] BufferList.appender_bench
>>> appending 1073741824 bytes
>>> buffer::list::append 20.285963
>>> buffer::list encode 19.719120
>>> buffer::list::safe_appender::append 2.588926
>>> buffer::list::safe_appender::append_v 2.837026 buffer::list::safe_appender
>>> encode 3.000614 buffer::list::unsafe_appender::append 2.452116
>>> buffer::list::unsafe_appender::append_v 2.553745
>>> buffer::list::unsafe_appender encode 2.200110
>>> [       OK ] BufferList.appender_bench (55637 ms)
>>>
>>> Interesting, unsafe isn't much faster than safe.  I suspect the CPU's branch
>>> prediction is just working really well there?
>>
>> Yes it looks like it is. But this is a bit of a contrived case because
>> you're timing this code which is 100% in the L1 cache, which might not
>> be a good model. How bad does it get if the loop is unrolled enough
>> times to fall out of the L1 cache but still in the L2 cache (which might
>> be as pessimistic a simulation as the original code is optimistic).
>
> I updated the test to pick a random point in a big array and then encode
> 16 items inline/unrolled in the hopes that this would avoid just hammering
> L1.  It does mean that the 16 contiguous items will probably all get
> fetched together (or have the fetch pipelined), but I think that's
> probably pretty realistic.
>
> The test also went through several other revisions and refinements
> yesterday with Igor's review.  Here's the result now:
>
> [ RUN      ] BufferList.appender_bench
> appending 268435456 uint64_t's
> buffer::list::append 7.142146
> buffer::list encode 7.968760
> buffer::list::safe_appender::append 3.506399
> buffer::list::safe_appender::append (reserve 16) 2.291478
> buffer::list::safe_appender::append_v 3.673429
> buffer::list::safe_appender::append_v (reserve 16) 2.291800
> buffer::list::safe_appender encode 3.673988
> buffer::list::unsafe_appender::append 1.766341
> buffer::list::unsafe_appender::append_v 1.748361
> buffer::list::unsafe_appender encode 1.746126
> [       OK ] BufferList.appender_bench (36340 ms)
>
> which looks much more like what I would expect to see.  The 'reserve 16'
> cases are doing a single bounds check and then doing 16 unsafe appends.
> Bottom line is that doing the limited bounds checks gets you most of the
> way to totally unguarded, but not all the way.  And no guards is
> definitely faster (roughly 2x faster than a guard for every word).

Agreed, this seems much more realistic than the earlier results.  It's 
interesting how much the bounds check in the reserve 16 case hurts, but 
in any event we are far better off than the default append case!  Really 
good stuff here!

>
> Anyway, I think the next step is to look and the enc_dec and see if
> this can complement the approach...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-08-14 13:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-12 14:27 bufferlist appenders Sage Weil
2016-08-12 14:37 ` Mark Nelson
2016-08-12 22:49 ` Allen Samuels
2016-08-13 21:31   ` Sage Weil
2016-08-13 23:14     ` Allen Samuels
2016-08-14 13:06     ` Mark Nelson
2016-08-13  8:50 ` Piotr Dałek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.