All of lore.kernel.org
 help / color / mirror / Atom feed
* wip-denc
@ 2016-09-13 21:17 Sage Weil
  2016-09-13 23:45 ` wip-denc Mark Nelson
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Sage Weil @ 2016-09-13 21:17 UTC (permalink / raw)
  To: ceph-devel

Hi everyone,

Okay, I have a new wip-denc branch working and ready for some review:

	https://github.com/ceph/ceph/pull/11027

Highlights:

- This includes appender/iterator changes to buffer* to speed up 
encoding and decoding (fewer bounds checks, simpler structures).

- Accordingly, classes/types using the new-style have different arguments 
types for encode/decode.  There is also a new bound_encode() method that 
is used to calculate how big of a buffer to preallocate.

- Most of the important helpers for doing types have new versions that 
work with the new framework (e.g., the ENCODE_START macro has a 
new DENC_START counterpart).

- There is also a mechanism that lets you define the bound_encode, 
encode, and decode methods all in one go using some template magic.  This 
only works for pretty simple types, but it is handy.  It looks like so:

  struct foo_t {
    uint32_t a, b;
    ...
    DENC(foo_t, v, p) {
      DENC_START(1, 1, p);
      denc(v.a, p);
      denc(v.b, p);
      ...
      DENC_FINISH(p);
    }
  };
  WRITE_CLASS_DENC(foo_t)
  

- For new-style types, a new 'denc' function that is overload to do either 
bound_encode, encode, or decode (based on argument types) is defined.  
That means that

  ::denc(v, p);

will work for size_t& p, bufferptr::iterator& p, or 
bufferlist::contiguous_appender& p.  This facilitates the DENC definitions 
above.

- There is glue to invoke new-style encode/decode when old-style encode() 
and decode() are invoked, provided a denc_traits<T> is defined.

- Most of the common containers are there list, vector, set, map, pair, 
but others need to be converted.

- Currently, we're a bit aggressive about using the new-style over the 
old-style when we have the change.  For example, if you have 

  vector<int32_t> foo;
  ::encode(foo, bl);

it will see that it knows how to do int32_t new-style and invoke the 
new-style vector<> code.  I think this is going to be a net win, since 
we avoid doing bounds checks on append for every element (and the 
bound_encode is O(1) for thees base types).  On the other hand, it is 
currently smart enough to not use new-style for individual integer 
types, like so

  int32_t v;
  ::encode(v, bl);

although I suspect after the optimizer gets done with it the generated 
machine code is almost identical.

- Most of the key bluestore types are converted over so that we can do 
some benchmarking.

An overview is at the top of the new denc.h header here:

	https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L55

I think I've captured the best of Allen's, Varada's, and Sam's various 
approaches, but we'll see how it behaves.  Let me know what you think!

Thanks-
sage


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: wip-denc
  2016-09-13 21:17 wip-denc Sage Weil
@ 2016-09-13 23:45 ` Mark Nelson
  2016-09-14  0:29   ` wip-denc Somnath Roy
  2016-09-14  0:47   ` wip-denc Allen Samuels
  2016-09-14  9:12 ` wip-denc Joao Eduardo Luis
  2016-09-14 20:31 ` wip-denc Mark Nelson
  2 siblings, 2 replies; 18+ messages in thread
From: Mark Nelson @ 2016-09-13 23:45 UTC (permalink / raw)
  To: Sage Weil, ceph-devel



On 09/13/2016 04:17 PM, Sage Weil wrote:
> Hi everyone,
>
> Okay, I have a new wip-denc branch working and ready for some review:
>
> 	https://github.com/ceph/ceph/pull/11027
>
> Highlights:
>
> - This includes appender/iterator changes to buffer* to speed up
> encoding and decoding (fewer bounds checks, simpler structures).
>
> - Accordingly, classes/types using the new-style have different arguments
> types for encode/decode.  There is also a new bound_encode() method that
> is used to calculate how big of a buffer to preallocate.
>
> - Most of the important helpers for doing types have new versions that
> work with the new framework (e.g., the ENCODE_START macro has a
> new DENC_START counterpart).
>
> - There is also a mechanism that lets you define the bound_encode,
> encode, and decode methods all in one go using some template magic.  This
> only works for pretty simple types, but it is handy.  It looks like so:
>
>   struct foo_t {
>     uint32_t a, b;
>     ...
>     DENC(foo_t, v, p) {
>       DENC_START(1, 1, p);
>       denc(v.a, p);
>       denc(v.b, p);
>       ...
>       DENC_FINISH(p);
>     }
>   };
>   WRITE_CLASS_DENC(foo_t)
>
>
> - For new-style types, a new 'denc' function that is overload to do either
> bound_encode, encode, or decode (based on argument types) is defined.
> That means that
>
>   ::denc(v, p);
>
> will work for size_t& p, bufferptr::iterator& p, or
> bufferlist::contiguous_appender& p.  This facilitates the DENC definitions
> above.
>
> - There is glue to invoke new-style encode/decode when old-style encode()
> and decode() are invoked, provided a denc_traits<T> is defined.
>
> - Most of the common containers are there list, vector, set, map, pair,
> but others need to be converted.
>
> - Currently, we're a bit aggressive about using the new-style over the
> old-style when we have the change.  For example, if you have
>
>   vector<int32_t> foo;
>   ::encode(foo, bl);
>
> it will see that it knows how to do int32_t new-style and invoke the
> new-style vector<> code.  I think this is going to be a net win, since
> we avoid doing bounds checks on append for every element (and the
> bound_encode is O(1) for thees base types).  On the other hand, it is
> currently smart enough to not use new-style for individual integer
> types, like so
>
>   int32_t v;
>   ::encode(v, bl);
>
> although I suspect after the optimizer gets done with it the generated
> machine code is almost identical.
>
> - Most of the key bluestore types are converted over so that we can do
> some benchmarking.
>
> An overview is at the top of the new denc.h header here:
>
> 	https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L55
>
> I think I've captured the best of Allen's, Varada's, and Sam's various
> approaches, but we'll see how it behaves.  Let me know what you think!

So far this is working pretty well!  It's not even crashing anymore. :D

Just recapping from IRC for ceph-devel: Overall append overhead is lower 
than it was before, but not as low as we got it in the old code by 
creating a single appender all the way from ENCODE_START to 
ENCODE_FINISH and (manually) estimating the memory to allocate across 
the whole thing.  On the other hand we are faster overall now with the 
extentmap sharding, so we're on the right track!  Here we're calling 
encode_some potentially many times in ExtentMap::update and creating a 
new appender for each encode_some call.  Not sure if we are flushing 
multiple times or only at the end, but buffer::list::append is still 
consuming a decent chunk of CPU, and encode_some itself is too. I'm 
suspicious of the c_str() calls and pbl->append(bufferptr(bp, 0, l) even 
though they aren't being explicitly singled out by perf.  In any event 
there's probably some tweaking to do here.

Beyond that I don't have much to report yet.  Rb_tree_increment in 
encode_spanning_blobs and BitMapAreaLeaf::child_check_n_lock are 
starting to show up as bigger % consumers than they used to, so that's 
good news.

FWIW, it looks like we are getting about a 1-4% performance improvement 
over master from this in my CPU limited case.  So far memory usage is 
still pretty high, with OSDs using about 5.3GB of RSS per OSD at the end 
of a 5 minute randwrite test.

Mark

>
> Thanks-
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: wip-denc
  2016-09-13 23:45 ` wip-denc Mark Nelson
@ 2016-09-14  0:29   ` Somnath Roy
  2016-09-14  2:41     ` wip-denc Mark Nelson
                       ` (2 more replies)
  2016-09-14  0:47   ` wip-denc Allen Samuels
  1 sibling, 3 replies; 18+ messages in thread
From: Somnath Roy @ 2016-09-14  0:29 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil, ceph-devel

Sage/Mark,
I cherry-picked the 5 commits related to memory leak from the pull request and it seems to be *fixing* the memory issue so far in my setup. Got the following crash though.

/root/ceph-master/src/os/bluestore/BlueStore.h: 319: FAILED assert(bc.empty())

 ceph version v11.0.0-2254-g8c52058 (8c52058bc8a0f060e06037196627c19b7d258f1f)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55d6cde26ef0]
 2: (BlueStore::SharedBlob::~SharedBlob()+0x82) [0x55d6cdbabee2]
 3: (BlueStore::SharedBlob::put()+0x17d) [0x55d6cdb581dd]
 4: (BlueStore::Onode::put()+0x1e8) [0x55d6cdbaca88]
 5: (BlueStore::TwoQCache::trim(unsigned long, unsigned long)+0x7b3) [0x55d6cdb6e6d3]
 6: (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, char const*, ceph::buffer::ptr&)+0x120) [0x55d6cdb93810]
 7: (PGBackend::objects_get_attr(hobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x106) [0x55d6cd970216]
 8: (ReplicatedPG::get_object_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0x92e) [0x55d6cd8f0cee]
 9: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, eversion_t, PGBackend::RecoveryHandle*)+0x1a4) [0x55d6cd8f9654]
 10: (ReplicatedPG::recover_replicas(unsigned long, ThreadPool::TPHandle&)+0x1093) [0x55d6cd937d83]
 11: (ReplicatedPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xd86) [0x55d6cd9406a6]
 12: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x426) [0x55d6cd7b2f96]
 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x5bf) [0x55d6cd7b3b6f]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55d6cde2bca9]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d6cde2f060]
 16: (()+0x76fa) [0x7fd9251296fa]
 17: (clone()+0x6d) [0x7fd923a14b5d]


I comment out the following commit as it seems a double delete from BlueStore::SharedBlob::put() (?).

681c368c3e809c3291430582bc5d11f9cc44b4ca

Another point, probably, not related to leak fix, but, in the latest master (+ your 1st 5 commits) I am hitting the following assert after OSD restart. If I run IO for some time..

Thread 1 "ceph-osd" received signal SIGABRT, Aborted.
0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff47ff01a in __GI_abort () at abort.c:89
#2  0x0000555555fa114b in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5555562460ed "rm.empty()",
    file=file@entry=0x555556245fd8 "/root/ceph-master/src/os/bluestore/StupidAllocator.cc", line=line@entry=305,
    func=func@entry=0x5555562462c0 <StupidAllocator::init_rm_free(unsigned long, unsigned long)::__PRETTY_FUNCTION__> "virtual void StupidAllocator::init_rm_free(uint64_t, uint64_t)") at /root/ceph-master/src/common/assert.cc:78
#3  0x0000555555df6d5c in StupidAllocator::init_rm_free (this=0x55555f06f7a0, offset=<optimized out>, length=<optimized out>)
    at /root/ceph-master/src/os/bluestore/StupidAllocator.cc:305
#4  0x0000555555dd33aa in BlueFS::mount (this=0x55555f266580) at /root/ceph-master/src/os/bluestore/BlueFS.cc:366
#5  0x0000555555d017e5 in BlueStore::_open_db (this=this@entry=0x55555f162000, create=create@entry=false) at /root/ceph-master/src/os/bluestore/BlueStore.cc:3008
#6  0x0000555555d1c02d in BlueStore::mount (this=0x55555f162000) at /root/ceph-master/src/os/bluestore/BlueStore.cc:3659
#7  0x0000555555924ef3 in OSD::init (this=0x55555f14a000) at /root/ceph-master/src/osd/OSD.cc:2025
#8  0x00005555558838ff in main (argc=<optimized out>, argv=<optimized out>) at /root/ceph-master/src/ceph_osd.cc:609

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, September 13, 2016 4:46 PM
To: Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: wip-denc



On 09/13/2016 04:17 PM, Sage Weil wrote:
> Hi everyone,
>
> Okay, I have a new wip-denc branch working and ready for some review:
>
> https://github.com/ceph/ceph/pull/11027
>
> Highlights:
>
> - This includes appender/iterator changes to buffer* to speed up
> encoding and decoding (fewer bounds checks, simpler structures).
>
> - Accordingly, classes/types using the new-style have different
> arguments types for encode/decode.  There is also a new bound_encode()
> method that is used to calculate how big of a buffer to preallocate.
>
> - Most of the important helpers for doing types have new versions that
> work with the new framework (e.g., the ENCODE_START macro has a new
> DENC_START counterpart).
>
> - There is also a mechanism that lets you define the bound_encode,
> encode, and decode methods all in one go using some template magic.
> This only works for pretty simple types, but it is handy.  It looks like so:
>
>   struct foo_t {
>     uint32_t a, b;
>     ...
>     DENC(foo_t, v, p) {
>       DENC_START(1, 1, p);
>       denc(v.a, p);
>       denc(v.b, p);
>       ...
>       DENC_FINISH(p);
>     }
>   };
>   WRITE_CLASS_DENC(foo_t)
>
>
> - For new-style types, a new 'denc' function that is overload to do
> either bound_encode, encode, or decode (based on argument types) is defined.
> That means that
>
>   ::denc(v, p);
>
> will work for size_t& p, bufferptr::iterator& p, or
> bufferlist::contiguous_appender& p.  This facilitates the DENC
> definitions above.
>
> - There is glue to invoke new-style encode/decode when old-style
> encode() and decode() are invoked, provided a denc_traits<T> is defined.
>
> - Most of the common containers are there list, vector, set, map,
> pair, but others need to be converted.
>
> - Currently, we're a bit aggressive about using the new-style over the
> old-style when we have the change.  For example, if you have
>
>   vector<int32_t> foo;
>   ::encode(foo, bl);
>
> it will see that it knows how to do int32_t new-style and invoke the
> new-style vector<> code.  I think this is going to be a net win, since
> we avoid doing bounds checks on append for every element (and the
> bound_encode is O(1) for thees base types).  On the other hand, it is
> currently smart enough to not use new-style for individual integer
> types, like so
>
>   int32_t v;
>   ::encode(v, bl);
>
> although I suspect after the optimizer gets done with it the generated
> machine code is almost identical.
>
> - Most of the key bluestore types are converted over so that we can do
> some benchmarking.
>
> An overview is at the top of the new denc.h header here:
>
> https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L55
>
> I think I've captured the best of Allen's, Varada's, and Sam's various
> approaches, but we'll see how it behaves.  Let me know what you think!

So far this is working pretty well!  It's not even crashing anymore. :D

Just recapping from IRC for ceph-devel: Overall append overhead is lower than it was before, but not as low as we got it in the old code by creating a single appender all the way from ENCODE_START to ENCODE_FINISH and (manually) estimating the memory to allocate across the whole thing.  On the other hand we are faster overall now with the extentmap sharding, so we're on the right track!  Here we're calling encode_some potentially many times in ExtentMap::update and creating a new appender for each encode_some call.  Not sure if we are flushing multiple times or only at the end, but buffer::list::append is still consuming a decent chunk of CPU, and encode_some itself is too. I'm suspicious of the c_str() calls and pbl->append(bufferptr(bp, 0, l) even though they aren't being explicitly singled out by perf.  In any event there's probably some tweaking to do here.

Beyond that I don't have much to report yet.  Rb_tree_increment in encode_spanning_blobs and BitMapAreaLeaf::child_check_n_lock are starting to show up as bigger % consumers than they used to, so that's good news.

FWIW, it looks like we are getting about a 1-4% performance improvement over master from this in my CPU limited case.  So far memory usage is still pretty high, with OSDs using about 5.3GB of RSS per OSD at the end of a 5 minute randwrite test.

Mark

>
> Thanks-
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: wip-denc
  2016-09-13 23:45 ` wip-denc Mark Nelson
  2016-09-14  0:29   ` wip-denc Somnath Roy
@ 2016-09-14  0:47   ` Allen Samuels
  2016-09-14  1:18     ` wip-denc Mark Nelson
  1 sibling, 1 reply; 18+ messages in thread
From: Allen Samuels @ 2016-09-14  0:47 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil, ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, September 13, 2016 4:46 PM
> To: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: wip-denc
> 
> 
> 
> On 09/13/2016 04:17 PM, Sage Weil wrote:
> > Hi everyone,
> >
> > Okay, I have a new wip-denc branch working and ready for some review:
> >
> > 	https://github.com/ceph/ceph/pull/11027
> >
> > Highlights:
> >
> > - This includes appender/iterator changes to buffer* to speed up
> > encoding and decoding (fewer bounds checks, simpler structures).
> >
> > - Accordingly, classes/types using the new-style have different
> > arguments types for encode/decode.  There is also a new bound_encode()
> > method that is used to calculate how big of a buffer to preallocate.
> >
> > - Most of the important helpers for doing types have new versions that
> > work with the new framework (e.g., the ENCODE_START macro has a new
> > DENC_START counterpart).
> >
> > - There is also a mechanism that lets you define the bound_encode,
> > encode, and decode methods all in one go using some template magic.
> > This only works for pretty simple types, but it is handy.  It looks like so:
> >
> >   struct foo_t {
> >     uint32_t a, b;
> >     ...
> >     DENC(foo_t, v, p) {
> >       DENC_START(1, 1, p);
> >       denc(v.a, p);
> >       denc(v.b, p);
> >       ...
> >       DENC_FINISH(p);
> >     }
> >   };
> >   WRITE_CLASS_DENC(foo_t)
> >
> >
> > - For new-style types, a new 'denc' function that is overload to do
> > either bound_encode, encode, or decode (based on argument types) is
> defined.
> > That means that
> >
> >   ::denc(v, p);
> >
> > will work for size_t& p, bufferptr::iterator& p, or
> > bufferlist::contiguous_appender& p.  This facilitates the DENC
> > definitions above.
> >
> > - There is glue to invoke new-style encode/decode when old-style
> > encode() and decode() are invoked, provided a denc_traits<T> is defined.
> >
> > - Most of the common containers are there list, vector, set, map,
> > pair, but others need to be converted.
> >
> > - Currently, we're a bit aggressive about using the new-style over the
> > old-style when we have the change.  For example, if you have
> >
> >   vector<int32_t> foo;
> >   ::encode(foo, bl);
> >
> > it will see that it knows how to do int32_t new-style and invoke the
> > new-style vector<> code.  I think this is going to be a net win, since
> > we avoid doing bounds checks on append for every element (and the
> > bound_encode is O(1) for thees base types).  On the other hand, it is
> > currently smart enough to not use new-style for individual integer
> > types, like so
> >
> >   int32_t v;
> >   ::encode(v, bl);
> >
> > although I suspect after the optimizer gets done with it the generated
> > machine code is almost identical.
> >
> > - Most of the key bluestore types are converted over so that we can do
> > some benchmarking.
> >
> > An overview is at the top of the new denc.h header here:
> >
> > 	https://github.com/liewegas/ceph/blob/wip-
> denc/src/include/denc.h#L55
> >
> > I think I've captured the best of Allen's, Varada's, and Sam's various
> > approaches, but we'll see how it behaves.  Let me know what you think!
> 
> So far this is working pretty well!  It's not even crashing anymore. :D
> 
> Just recapping from IRC for ceph-devel: Overall append overhead is lower
> than it was before, but not as low as we got it in the old code by creating a
> single appender all the way from ENCODE_START to ENCODE_FINISH and
> (manually) estimating the memory to allocate across the whole thing.  On the

Well the new framework certainly allows that to be done, so I think this simply means that we have more work to do.
Probably there are some "bounded" cases that aren't being exploited.

> other hand we are faster overall now with the extentmap sharding, so we're
> on the right track!  Here we're calling encode_some potentially many times in
> ExtentMap::update and creating a new appender for each encode_some
> call.  Not sure if we are flushing multiple times or only at the end, but
> buffer::list::append is still consuming a decent chunk of CPU, and

If append is consuming CPU that seems like a serious clue -- do you have an idea of who is calling append?

> encode_some itself is too. I'm suspicious of the c_str() calls and pbl-
> >append(bufferptr(bp, 0, l) even though they aren't being explicitly singled
> out by perf.  In any event there's probably some tweaking to do here.
> 
> Beyond that I don't have much to report yet.  Rb_tree_increment in
> encode_spanning_blobs and BitMapAreaLeaf::child_check_n_lock are
> starting to show up as bigger % consumers than they used to, so that's good
> news.
> 
> FWIW, it looks like we are getting about a 1-4% performance improvement
> over master from this in my CPU limited case.  So far memory usage is still
> pretty high, with OSDs using about 5.3GB of RSS per OSD at the end of a 5
> minute randwrite test.

I'm suspicious that that the large memory consumption and the relatively slow CPU performance are related. Especially if you're seeing Rb_tree_increment in the stats, that's suggesting that the maps are too big. Which correlates exactly with the large memory footprint....

> 
> Mark
> 
> >
> > Thanks-
> > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: wip-denc
  2016-09-14  0:47   ` wip-denc Allen Samuels
@ 2016-09-14  1:18     ` Mark Nelson
  0 siblings, 0 replies; 18+ messages in thread
From: Mark Nelson @ 2016-09-14  1:18 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil, ceph-devel



On 09/13/2016 07:47 PM, Allen Samuels wrote:
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Tuesday, September 13, 2016 4:46 PM
>> To: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org
>> Subject: Re: wip-denc
>>
>>
>>
>> On 09/13/2016 04:17 PM, Sage Weil wrote:
>>> Hi everyone,
>>>
>>> Okay, I have a new wip-denc branch working and ready for some review:
>>>
>>> 	https://github.com/ceph/ceph/pull/11027
>>>
>>> Highlights:
>>>
>>> - This includes appender/iterator changes to buffer* to speed up
>>> encoding and decoding (fewer bounds checks, simpler structures).
>>>
>>> - Accordingly, classes/types using the new-style have different
>>> arguments types for encode/decode.  There is also a new bound_encode()
>>> method that is used to calculate how big of a buffer to preallocate.
>>>
>>> - Most of the important helpers for doing types have new versions that
>>> work with the new framework (e.g., the ENCODE_START macro has a new
>>> DENC_START counterpart).
>>>
>>> - There is also a mechanism that lets you define the bound_encode,
>>> encode, and decode methods all in one go using some template magic.
>>> This only works for pretty simple types, but it is handy.  It looks like so:
>>>
>>>   struct foo_t {
>>>     uint32_t a, b;
>>>     ...
>>>     DENC(foo_t, v, p) {
>>>       DENC_START(1, 1, p);
>>>       denc(v.a, p);
>>>       denc(v.b, p);
>>>       ...
>>>       DENC_FINISH(p);
>>>     }
>>>   };
>>>   WRITE_CLASS_DENC(foo_t)
>>>
>>>
>>> - For new-style types, a new 'denc' function that is overload to do
>>> either bound_encode, encode, or decode (based on argument types) is
>> defined.
>>> That means that
>>>
>>>   ::denc(v, p);
>>>
>>> will work for size_t& p, bufferptr::iterator& p, or
>>> bufferlist::contiguous_appender& p.  This facilitates the DENC
>>> definitions above.
>>>
>>> - There is glue to invoke new-style encode/decode when old-style
>>> encode() and decode() are invoked, provided a denc_traits<T> is defined.
>>>
>>> - Most of the common containers are there list, vector, set, map,
>>> pair, but others need to be converted.
>>>
>>> - Currently, we're a bit aggressive about using the new-style over the
>>> old-style when we have the change.  For example, if you have
>>>
>>>   vector<int32_t> foo;
>>>   ::encode(foo, bl);
>>>
>>> it will see that it knows how to do int32_t new-style and invoke the
>>> new-style vector<> code.  I think this is going to be a net win, since
>>> we avoid doing bounds checks on append for every element (and the
>>> bound_encode is O(1) for thees base types).  On the other hand, it is
>>> currently smart enough to not use new-style for individual integer
>>> types, like so
>>>
>>>   int32_t v;
>>>   ::encode(v, bl);
>>>
>>> although I suspect after the optimizer gets done with it the generated
>>> machine code is almost identical.
>>>
>>> - Most of the key bluestore types are converted over so that we can do
>>> some benchmarking.
>>>
>>> An overview is at the top of the new denc.h header here:
>>>
>>> 	https://github.com/liewegas/ceph/blob/wip-
>> denc/src/include/denc.h#L55
>>>
>>> I think I've captured the best of Allen's, Varada's, and Sam's various
>>> approaches, but we'll see how it behaves.  Let me know what you think!
>>
>> So far this is working pretty well!  It's not even crashing anymore. :D
>>
>> Just recapping from IRC for ceph-devel: Overall append overhead is lower
>> than it was before, but not as low as we got it in the old code by creating a
>> single appender all the way from ENCODE_START to ENCODE_FINISH and
>> (manually) estimating the memory to allocate across the whole thing.  On the
>
> Well the new framework certainly allows that to be done, so I think this simply means that we have more work to do.
> Probably there are some "bounded" cases that aren't being exploited.

Indeed.  I guess the trick is being able to do the estimation for 
everything up front.  It was fairly easy pre-sharding but I don't know 
how bad it will be now.

>
>> other hand we are faster overall now with the extentmap sharding, so we're
>> on the right track!  Here we're calling encode_some potentially many times in
>> ExtentMap::update and creating a new appender for each encode_some
>> call.  Not sure if we are flushing multiple times or only at the end, but
>> buffer::list::append is still consuming a decent chunk of CPU, and
>
> If append is consuming CPU that seems like a serious clue -- do you have an idea of who is calling append?

We do when the contiguous_appender is being flushed at least.  since we 
create it inside encode_some we're presumably flushing at least once 
every encode_some call, and perhaps more often.

>
>> encode_some itself is too. I'm suspicious of the c_str() calls and pbl-
>>> append(bufferptr(bp, 0, l) even though they aren't being explicitly singled
>> out by perf.  In any event there's probably some tweaking to do here.
>>
>> Beyond that I don't have much to report yet.  Rb_tree_increment in
>> encode_spanning_blobs and BitMapAreaLeaf::child_check_n_lock are
>> starting to show up as bigger % consumers than they used to, so that's good
>> news.
>>
>> FWIW, it looks like we are getting about a 1-4% performance improvement
>> over master from this in my CPU limited case.  So far memory usage is still
>> pretty high, with OSDs using about 5.3GB of RSS per OSD at the end of a 5
>> minute randwrite test.
>
> I'm suspicious that that the large memory consumption and the relatively slow CPU performance are related. Especially if you're seeing Rb_tree_increment in the stats, that's suggesting that the maps are too big. Which correlates exactly with the large memory footprint....

Could be!  I suspect we'll want to dig into this some time soon.

>
>>
>> Mark
>>
>>>
>>> Thanks-
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>> body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: wip-denc
  2016-09-14  0:29   ` wip-denc Somnath Roy
@ 2016-09-14  2:41     ` Mark Nelson
  2016-09-14  4:05       ` wip-denc Somnath Roy
  2016-09-14 14:10     ` wip-denc Sage Weil
  2016-09-15  0:39     ` wip-denc Sage Weil
  2 siblings, 1 reply; 18+ messages in thread
From: Mark Nelson @ 2016-09-14  2:41 UTC (permalink / raw)
  To: Somnath Roy, Sage Weil, ceph-devel

Here's another one to add to the pile:

      0> 2016-09-13 22:31:45.142288 7f22a1d92700 -1 
/home/ubuntu/src/markhpc/ceph/src/os/bluestore/BitMapAllocator.cc: In 
function 'virtual void BitMapAllocator::commit_start()' thread 
7f22a1d92700 time 2016-09-13 22:31:45.137789
/home/ubuntu/src/markhpc/ceph/src/os/bluestore/BitMapAllocator.cc: 339: 
FAILED assert(m_committing.empty())

  ceph version v10.0.4-8831-ge3e37f1 
(e3e37f11cc0e5cd0b2b0cb29109c66abe328de64)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x85) [0x5636eac23b55]
  2: (BitMapAllocator::commit_start()+0x2e4) [0x5636eaa9c434]
  3: (BlueFS::sync_metadata()+0x1be) [0x5636eaa6c3be]
  4: (BlueRocksDirectory::Fsync()+0xd) [0x5636eaa7ff4d]
  5: 
(rocksdb::FlushJob::WriteLevel0Table(rocksdb::autovector<rocksdb::MemTable*, 
8ul> const&, rocksdb::VersionEdit*, rocksdb::FileMetaData*)+0xc52) 
[0x5636eaaf6852]
  6: (rocksdb::FlushJob::Run(rocksdb::FileMetaData*)+0x33b) [0x5636eaaf963b]
  7: 
(rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, 
rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*)+0x3ad) [0x5636eaab73ad]
  8: (rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*)+0x3ba) [0x5636eaab7efa]
  9: (rocksdb::DBImpl::BackgroundCallFlush()+0x25c) [0x5636eaacd58c]
  10: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1a1) [0x5636eab75111]
  11: (()+0x9361f3) [0x5636eab751f3]
  12: (()+0x7dc5) [0x7f22c4a8edc5]
  13: (clone()+0x6d) [0x7f22c3975ced]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 0 lockdep
    0/ 0 context
    0/ 0 crush
    0/ 0 mds
    0/ 0 mds_balancer
    0/ 0 mds_locker
    0/ 0 mds_log
    0/ 0 mds_log_expire
    0/ 0 mds_migrator
    0/ 0 buffer
    0/ 0 timer
    0/ 0 filer
    0/ 1 striper
    0/ 0 objecter
    0/ 0 rados
    0/ 0 rbd
    0/ 5 rbd_mirror
    0/ 5 rbd_replay
    0/ 0 journaler
    0/ 0 objectcacher
    0/ 0 client
    0/ 0 osd
    0/ 0 optracker
    0/ 0 objclass
    0/ 0 filestore
    0/ 0 journal
    0/ 0 ms
    0/ 0 mon
    0/ 0 monc
    0/ 0 paxos
    0/ 0 tp
    0/ 0 auth
    1/ 5 crypto
    0/ 0 finisher
    0/ 0 heartbeatmap
    0/ 0 perfcounter
    0/ 0 rgw
    1/10 civetweb
    1/ 5 javaclient
    0/ 0 asok
    0/ 0 throttle
    0/ 0 refs
    1/ 5 xio
    1/ 5 compressor
    1/ 5 newstore
    0/ 0 bluestore
    0/ 0 bluefs
    0/ 0 bdev
    1/ 5 kstore
    4/ 5 rocksdb
    4/ 5 leveldb
    4/ 5 memdb
    1/ 5 kinetic
    1/ 5 fuse
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent     10000
   max_new         1000
   log_file /tmp/cbt/ceph/log/osd.5.log
--- end dump of recent events ---
2016-09-13 22:31:45.187609 7f22a1d92700 -1 *** Caught signal (Aborted) **
  in thread 7f22a1d92700 thread_name:rocksdb:bg0

  ceph version v10.0.4-8831-ge3e37f1 
(e3e37f11cc0e5cd0b2b0cb29109c66abe328de64)
  1: (()+0x869fba) [0x5636eaaa8fba]
  2: (()+0xf100) [0x7f22c4a96100]
  3: (gsignal()+0x37) [0x7f22c38b45f7]
  4: (abort()+0x148) [0x7f22c38b5ce8]
  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x267) [0x5636eac23d37]
  6: (BitMapAllocator::commit_start()+0x2e4) [0x5636eaa9c434]
  7: (BlueFS::sync_metadata()+0x1be) [0x5636eaa6c3be]
  8: (BlueRocksDirectory::Fsync()+0xd) [0x5636eaa7ff4d]
  9: 
(rocksdb::FlushJob::WriteLevel0Table(rocksdb::autovector<rocksdb::MemTable*, 
8ul> const&, rocksdb::VersionEdit*, rocksdb::FileMetaData*)+0xc52) 
[0x5636eaaf6852]
  10: (rocksdb::FlushJob::Run(rocksdb::FileMetaData*)+0x33b) 
[0x5636eaaf963b]
  11: 
(rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, 
rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*)+0x3ad) [0x5636eaab73ad]
  12: (rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*)+0x3ba) [0x5636eaab7efa]
  13: (rocksdb::DBImpl::BackgroundCallFlush()+0x25c) [0x5636eaacd58c]
  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1a1) [0x5636eab75111]
  15: (()+0x9361f3) [0x5636eab751f3]
  16: (()+0x7dc5) [0x7f22c4a8edc5]
  17: (clone()+0x6d) [0x7f22c3975ced]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- begin dump of recent events ---
      0> 2016-09-13 22:31:45.187609 7f22a1d92700 -1 *** Caught signal 
(Aborted) **
  in thread 7f22a1d92700 thread_name:rocksdb:bg0

  ceph version v10.0.4-8831-ge3e37f1 
(e3e37f11cc0e5cd0b2b0cb29109c66abe328de64)
  1: (()+0x869fba) [0x5636eaaa8fba]
  2: (()+0xf100) [0x7f22c4a96100]
  3: (gsignal()+0x37) [0x7f22c38b45f7]
  4: (abort()+0x148) [0x7f22c38b5ce8]
  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x267) [0x5636eac23d37]
  6: (BitMapAllocator::commit_start()+0x2e4) [0x5636eaa9c434]
  7: (BlueFS::sync_metadata()+0x1be) [0x5636eaa6c3be]
  8: (BlueRocksDirectory::Fsync()+0xd) [0x5636eaa7ff4d]
  9: 
(rocksdb::FlushJob::WriteLevel0Table(rocksdb::autovector<rocksdb::MemTable*, 
8ul> const&, rocksdb::VersionEdit*, rocksdb::FileMetaData*)+0xc52) 
[0x5636eaaf6852]
  10: (rocksdb::FlushJob::Run(rocksdb::FileMetaData*)+0x33b) 
[0x5636eaaf963b]
  11: 
(rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, 
rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*)+0x3ad) [0x5636eaab73ad]
  12: (rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*)+0x3ba) [0x5636eaab7efa]
  13: (rocksdb::DBImpl::BackgroundCallFlush()+0x25c) [0x5636eaacd58c]
  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1a1) [0x5636eab75111]
  15: (()+0x9361f3) [0x5636eab751f3]
  16: (()+0x7dc5) [0x7f22c4a8edc5]
  17: (clone()+0x6d) [0x7f22c3975ced]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 0 lockdep
    0/ 0 context
    0/ 0 crush
    0/ 0 mds
    0/ 0 mds_balancer
    0/ 0 mds_locker
    0/ 0 mds_log
    0/ 0 mds_log_expire
    0/ 0 mds_migrator
    0/ 0 buffer
    0/ 0 timer
    0/ 0 filer
    0/ 1 striper
    0/ 0 objecter
    0/ 0 rados
    0/ 0 rbd
    0/ 5 rbd_mirror
    0/ 5 rbd_replay
    0/ 0 journaler
    0/ 0 objectcacher
    0/ 0 client
    0/ 0 osd
    0/ 0 optracker
    0/ 0 objclass
    0/ 0 filestore
    0/ 0 journal
    0/ 0 ms
    0/ 0 mon
    0/ 0 monc
    0/ 0 paxos
    0/ 0 tp
    0/ 0 auth
    1/ 5 crypto
    0/ 0 finisher
    0/ 0 heartbeatmap
    0/ 0 perfcounter
    0/ 0 rgw
    1/10 civetweb
    1/ 5 javaclient
    0/ 0 asok
    0/ 0 throttle
    0/ 0 refs
    1/ 5 xio
    1/ 5 compressor
    1/ 5 newstore
    0/ 0 bluestore
    0/ 0 bluefs
    0/ 0 bdev
    1/ 5 kstore
    4/ 5 rocksdb
    4/ 5 leveldb
    4/ 5 memdb
    1/ 5 kinetic
    1/ 5 fuse
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent     10000
   max_new         1000
   log_file /tmp/cbt/ceph/log/osd.5.log
--- end dump of recent events ---


On 09/13/2016 07:29 PM, Somnath Roy wrote:
> Sage/Mark,
> I cherry-picked the 5 commits related to memory leak from the pull request and it seems to be *fixing* the memory issue so far in my setup. Got the following crash though.
>
> /root/ceph-master/src/os/bluestore/BlueStore.h: 319: FAILED assert(bc.empty())
>
>  ceph version v11.0.0-2254-g8c52058 (8c52058bc8a0f060e06037196627c19b7d258f1f)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55d6cde26ef0]
>  2: (BlueStore::SharedBlob::~SharedBlob()+0x82) [0x55d6cdbabee2]
>  3: (BlueStore::SharedBlob::put()+0x17d) [0x55d6cdb581dd]
>  4: (BlueStore::Onode::put()+0x1e8) [0x55d6cdbaca88]
>  5: (BlueStore::TwoQCache::trim(unsigned long, unsigned long)+0x7b3) [0x55d6cdb6e6d3]
>  6: (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, char const*, ceph::buffer::ptr&)+0x120) [0x55d6cdb93810]
>  7: (PGBackend::objects_get_attr(hobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x106) [0x55d6cd970216]
>  8: (ReplicatedPG::get_object_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0x92e) [0x55d6cd8f0cee]
>  9: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, eversion_t, PGBackend::RecoveryHandle*)+0x1a4) [0x55d6cd8f9654]
>  10: (ReplicatedPG::recover_replicas(unsigned long, ThreadPool::TPHandle&)+0x1093) [0x55d6cd937d83]
>  11: (ReplicatedPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xd86) [0x55d6cd9406a6]
>  12: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x426) [0x55d6cd7b2f96]
>  13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x5bf) [0x55d6cd7b3b6f]
>  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55d6cde2bca9]
>  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d6cde2f060]
>  16: (()+0x76fa) [0x7fd9251296fa]
>  17: (clone()+0x6d) [0x7fd923a14b5d]
>
>
> I comment out the following commit as it seems a double delete from BlueStore::SharedBlob::put() (?).
>
> 681c368c3e809c3291430582bc5d11f9cc44b4ca
>
> Another point, probably, not related to leak fix, but, in the latest master (+ your 1st 5 commits) I am hitting the following assert after OSD restart. If I run IO for some time..
>
> Thread 1 "ceph-osd" received signal SIGABRT, Aborted.
> 0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
> 54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x00007ffff47ff01a in __GI_abort () at abort.c:89
> #2  0x0000555555fa114b in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5555562460ed "rm.empty()",
>     file=file@entry=0x555556245fd8 "/root/ceph-master/src/os/bluestore/StupidAllocator.cc", line=line@entry=305,
>     func=func@entry=0x5555562462c0 <StupidAllocator::init_rm_free(unsigned long, unsigned long)::__PRETTY_FUNCTION__> "virtual void StupidAllocator::init_rm_free(uint64_t, uint64_t)") at /root/ceph-master/src/common/assert.cc:78
> #3  0x0000555555df6d5c in StupidAllocator::init_rm_free (this=0x55555f06f7a0, offset=<optimized out>, length=<optimized out>)
>     at /root/ceph-master/src/os/bluestore/StupidAllocator.cc:305
> #4  0x0000555555dd33aa in BlueFS::mount (this=0x55555f266580) at /root/ceph-master/src/os/bluestore/BlueFS.cc:366
> #5  0x0000555555d017e5 in BlueStore::_open_db (this=this@entry=0x55555f162000, create=create@entry=false) at /root/ceph-master/src/os/bluestore/BlueStore.cc:3008
> #6  0x0000555555d1c02d in BlueStore::mount (this=0x55555f162000) at /root/ceph-master/src/os/bluestore/BlueStore.cc:3659
> #7  0x0000555555924ef3 in OSD::init (this=0x55555f14a000) at /root/ceph-master/src/osd/OSD.cc:2025
> #8  0x00005555558838ff in main (argc=<optimized out>, argv=<optimized out>) at /root/ceph-master/src/ceph_osd.cc:609
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, September 13, 2016 4:46 PM
> To: Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: wip-denc
>
>
>
> On 09/13/2016 04:17 PM, Sage Weil wrote:
>> Hi everyone,
>>
>> Okay, I have a new wip-denc branch working and ready for some review:
>>
>> https://github.com/ceph/ceph/pull/11027
>>
>> Highlights:
>>
>> - This includes appender/iterator changes to buffer* to speed up
>> encoding and decoding (fewer bounds checks, simpler structures).
>>
>> - Accordingly, classes/types using the new-style have different
>> arguments types for encode/decode.  There is also a new bound_encode()
>> method that is used to calculate how big of a buffer to preallocate.
>>
>> - Most of the important helpers for doing types have new versions that
>> work with the new framework (e.g., the ENCODE_START macro has a new
>> DENC_START counterpart).
>>
>> - There is also a mechanism that lets you define the bound_encode,
>> encode, and decode methods all in one go using some template magic.
>> This only works for pretty simple types, but it is handy.  It looks like so:
>>
>>   struct foo_t {
>>     uint32_t a, b;
>>     ...
>>     DENC(foo_t, v, p) {
>>       DENC_START(1, 1, p);
>>       denc(v.a, p);
>>       denc(v.b, p);
>>       ...
>>       DENC_FINISH(p);
>>     }
>>   };
>>   WRITE_CLASS_DENC(foo_t)
>>
>>
>> - For new-style types, a new 'denc' function that is overload to do
>> either bound_encode, encode, or decode (based on argument types) is defined.
>> That means that
>>
>>   ::denc(v, p);
>>
>> will work for size_t& p, bufferptr::iterator& p, or
>> bufferlist::contiguous_appender& p.  This facilitates the DENC
>> definitions above.
>>
>> - There is glue to invoke new-style encode/decode when old-style
>> encode() and decode() are invoked, provided a denc_traits<T> is defined.
>>
>> - Most of the common containers are there list, vector, set, map,
>> pair, but others need to be converted.
>>
>> - Currently, we're a bit aggressive about using the new-style over the
>> old-style when we have the change.  For example, if you have
>>
>>   vector<int32_t> foo;
>>   ::encode(foo, bl);
>>
>> it will see that it knows how to do int32_t new-style and invoke the
>> new-style vector<> code.  I think this is going to be a net win, since
>> we avoid doing bounds checks on append for every element (and the
>> bound_encode is O(1) for thees base types).  On the other hand, it is
>> currently smart enough to not use new-style for individual integer
>> types, like so
>>
>>   int32_t v;
>>   ::encode(v, bl);
>>
>> although I suspect after the optimizer gets done with it the generated
>> machine code is almost identical.
>>
>> - Most of the key bluestore types are converted over so that we can do
>> some benchmarking.
>>
>> An overview is at the top of the new denc.h header here:
>>
>> https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L55
>>
>> I think I've captured the best of Allen's, Varada's, and Sam's various
>> approaches, but we'll see how it behaves.  Let me know what you think!
>
> So far this is working pretty well!  It's not even crashing anymore. :D
>
> Just recapping from IRC for ceph-devel: Overall append overhead is lower than it was before, but not as low as we got it in the old code by creating a single appender all the way from ENCODE_START to ENCODE_FINISH and (manually) estimating the memory to allocate across the whole thing.  On the other hand we are faster overall now with the extentmap sharding, so we're on the right track!  Here we're calling encode_some potentially many times in ExtentMap::update and creating a new appender for each encode_some call.  Not sure if we are flushing multiple times or only at the end, but buffer::list::append is still consuming a decent chunk of CPU, and encode_some itself is too. I'm suspicious of the c_str() calls and pbl->append(bufferptr(bp, 0, l) even though they aren't being explicitly si
 ngled out by perf.  In any event there's probably some tweaking to do here.
>
> Beyond that I don't have much to report yet.  Rb_tree_increment in encode_spanning_blobs and BitMapAreaLeaf::child_check_n_lock are starting to show up as bigger % consumers than they used to, so that's good news.
>
> FWIW, it looks like we are getting about a 1-4% performance improvement over master from this in my CPU limited case.  So far memory usage is still pretty high, with OSDs using about 5.3GB of RSS per OSD at the end of a 5 minute randwrite test.
>
> Mark
>
>>
>> Thanks-
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: wip-denc
  2016-09-14  2:41     ` wip-denc Mark Nelson
@ 2016-09-14  4:05       ` Somnath Roy
  2016-09-14 11:06         ` wip-denc Mark Nelson
  0 siblings, 1 reply; 18+ messages in thread
From: Somnath Roy @ 2016-09-14  4:05 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil, ceph-devel

In fact, the leak is not fixed totally :-(
After 30 min run , my system memory started swapping again , will see what else is growing..

Thanks & Regards
Somnath

-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com] 
Sent: Tuesday, September 13, 2016 7:42 PM
To: Somnath Roy; Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: wip-denc

Here's another one to add to the pile:

      0> 2016-09-13 22:31:45.142288 7f22a1d92700 -1
/home/ubuntu/src/markhpc/ceph/src/os/bluestore/BitMapAllocator.cc: In function 'virtual void BitMapAllocator::commit_start()' thread
7f22a1d92700 time 2016-09-13 22:31:45.137789
/home/ubuntu/src/markhpc/ceph/src/os/bluestore/BitMapAllocator.cc: 339: 
FAILED assert(m_committing.empty())

  ceph version v10.0.4-8831-ge3e37f1
(e3e37f11cc0e5cd0b2b0cb29109c66abe328de64)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x5636eac23b55]
  2: (BitMapAllocator::commit_start()+0x2e4) [0x5636eaa9c434]
  3: (BlueFS::sync_metadata()+0x1be) [0x5636eaa6c3be]
  4: (BlueRocksDirectory::Fsync()+0xd) [0x5636eaa7ff4d]
  5: 
(rocksdb::FlushJob::WriteLevel0Table(rocksdb::autovector<rocksdb::MemTable*, 
8ul> const&, rocksdb::VersionEdit*, rocksdb::FileMetaData*)+0xc52)
[0x5636eaaf6852]
  6: (rocksdb::FlushJob::Run(rocksdb::FileMetaData*)+0x33b) [0x5636eaaf963b]
  7: 
(rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*,
rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*,
rocksdb::LogBuffer*)+0x3ad) [0x5636eaab73ad]
  8: (rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*,
rocksdb::LogBuffer*)+0x3ba) [0x5636eaab7efa]
  9: (rocksdb::DBImpl::BackgroundCallFlush()+0x25c) [0x5636eaacd58c]
  10: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1a1) [0x5636eab75111]
  11: (()+0x9361f3) [0x5636eab751f3]
  12: (()+0x7dc5) [0x7f22c4a8edc5]
  13: (clone()+0x6d) [0x7f22c3975ced]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 0 lockdep
    0/ 0 context
    0/ 0 crush
    0/ 0 mds
    0/ 0 mds_balancer
    0/ 0 mds_locker
    0/ 0 mds_log
    0/ 0 mds_log_expire
    0/ 0 mds_migrator
    0/ 0 buffer
    0/ 0 timer
    0/ 0 filer
    0/ 1 striper
    0/ 0 objecter
    0/ 0 rados
    0/ 0 rbd
    0/ 5 rbd_mirror
    0/ 5 rbd_replay
    0/ 0 journaler
    0/ 0 objectcacher
    0/ 0 client
    0/ 0 osd
    0/ 0 optracker
    0/ 0 objclass
    0/ 0 filestore
    0/ 0 journal
    0/ 0 ms
    0/ 0 mon
    0/ 0 monc
    0/ 0 paxos
    0/ 0 tp
    0/ 0 auth
    1/ 5 crypto
    0/ 0 finisher
    0/ 0 heartbeatmap
    0/ 0 perfcounter
    0/ 0 rgw
    1/10 civetweb
    1/ 5 javaclient
    0/ 0 asok
    0/ 0 throttle
    0/ 0 refs
    1/ 5 xio
    1/ 5 compressor
    1/ 5 newstore
    0/ 0 bluestore
    0/ 0 bluefs
    0/ 0 bdev
    1/ 5 kstore
    4/ 5 rocksdb
    4/ 5 leveldb
    4/ 5 memdb
    1/ 5 kinetic
    1/ 5 fuse
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent     10000
   max_new         1000
   log_file /tmp/cbt/ceph/log/osd.5.log
--- end dump of recent events ---
2016-09-13 22:31:45.187609 7f22a1d92700 -1 *** Caught signal (Aborted) **
  in thread 7f22a1d92700 thread_name:rocksdb:bg0

  ceph version v10.0.4-8831-ge3e37f1
(e3e37f11cc0e5cd0b2b0cb29109c66abe328de64)
  1: (()+0x869fba) [0x5636eaaa8fba]
  2: (()+0xf100) [0x7f22c4a96100]
  3: (gsignal()+0x37) [0x7f22c38b45f7]
  4: (abort()+0x148) [0x7f22c38b5ce8]
  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x267) [0x5636eac23d37]
  6: (BitMapAllocator::commit_start()+0x2e4) [0x5636eaa9c434]
  7: (BlueFS::sync_metadata()+0x1be) [0x5636eaa6c3be]
  8: (BlueRocksDirectory::Fsync()+0xd) [0x5636eaa7ff4d]
  9: 
(rocksdb::FlushJob::WriteLevel0Table(rocksdb::autovector<rocksdb::MemTable*, 
8ul> const&, rocksdb::VersionEdit*, rocksdb::FileMetaData*)+0xc52)
[0x5636eaaf6852]
  10: (rocksdb::FlushJob::Run(rocksdb::FileMetaData*)+0x33b)
[0x5636eaaf963b]
  11: 
(rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*,
rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*,
rocksdb::LogBuffer*)+0x3ad) [0x5636eaab73ad]
  12: (rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*,
rocksdb::LogBuffer*)+0x3ba) [0x5636eaab7efa]
  13: (rocksdb::DBImpl::BackgroundCallFlush()+0x25c) [0x5636eaacd58c]
  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1a1) [0x5636eab75111]
  15: (()+0x9361f3) [0x5636eab751f3]
  16: (()+0x7dc5) [0x7f22c4a8edc5]
  17: (clone()+0x6d) [0x7f22c3975ced]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
      0> 2016-09-13 22:31:45.187609 7f22a1d92700 -1 *** Caught signal
(Aborted) **
  in thread 7f22a1d92700 thread_name:rocksdb:bg0

  ceph version v10.0.4-8831-ge3e37f1
(e3e37f11cc0e5cd0b2b0cb29109c66abe328de64)
  1: (()+0x869fba) [0x5636eaaa8fba]
  2: (()+0xf100) [0x7f22c4a96100]
  3: (gsignal()+0x37) [0x7f22c38b45f7]
  4: (abort()+0x148) [0x7f22c38b5ce8]
  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x267) [0x5636eac23d37]
  6: (BitMapAllocator::commit_start()+0x2e4) [0x5636eaa9c434]
  7: (BlueFS::sync_metadata()+0x1be) [0x5636eaa6c3be]
  8: (BlueRocksDirectory::Fsync()+0xd) [0x5636eaa7ff4d]
  9: 
(rocksdb::FlushJob::WriteLevel0Table(rocksdb::autovector<rocksdb::MemTable*, 
8ul> const&, rocksdb::VersionEdit*, rocksdb::FileMetaData*)+0xc52)
[0x5636eaaf6852]
  10: (rocksdb::FlushJob::Run(rocksdb::FileMetaData*)+0x33b)
[0x5636eaaf963b]
  11: 
(rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*,
rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*,
rocksdb::LogBuffer*)+0x3ad) [0x5636eaab73ad]
  12: (rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*,
rocksdb::LogBuffer*)+0x3ba) [0x5636eaab7efa]
  13: (rocksdb::DBImpl::BackgroundCallFlush()+0x25c) [0x5636eaacd58c]
  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1a1) [0x5636eab75111]
  15: (()+0x9361f3) [0x5636eab751f3]
  16: (()+0x7dc5) [0x7f22c4a8edc5]
  17: (clone()+0x6d) [0x7f22c3975ced]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 0 lockdep
    0/ 0 context
    0/ 0 crush
    0/ 0 mds
    0/ 0 mds_balancer
    0/ 0 mds_locker
    0/ 0 mds_log
    0/ 0 mds_log_expire
    0/ 0 mds_migrator
    0/ 0 buffer
    0/ 0 timer
    0/ 0 filer
    0/ 1 striper
    0/ 0 objecter
    0/ 0 rados
    0/ 0 rbd
    0/ 5 rbd_mirror
    0/ 5 rbd_replay
    0/ 0 journaler
    0/ 0 objectcacher
    0/ 0 client
    0/ 0 osd
    0/ 0 optracker
    0/ 0 objclass
    0/ 0 filestore
    0/ 0 journal
    0/ 0 ms
    0/ 0 mon
    0/ 0 monc
    0/ 0 paxos
    0/ 0 tp
    0/ 0 auth
    1/ 5 crypto
    0/ 0 finisher
    0/ 0 heartbeatmap
    0/ 0 perfcounter
    0/ 0 rgw
    1/10 civetweb
    1/ 5 javaclient
    0/ 0 asok
    0/ 0 throttle
    0/ 0 refs
    1/ 5 xio
    1/ 5 compressor
    1/ 5 newstore
    0/ 0 bluestore
    0/ 0 bluefs
    0/ 0 bdev
    1/ 5 kstore
    4/ 5 rocksdb
    4/ 5 leveldb
    4/ 5 memdb
    1/ 5 kinetic
    1/ 5 fuse
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent     10000
   max_new         1000
   log_file /tmp/cbt/ceph/log/osd.5.log
--- end dump of recent events ---


On 09/13/2016 07:29 PM, Somnath Roy wrote:
> Sage/Mark,
> I cherry-picked the 5 commits related to memory leak from the pull request and it seems to be *fixing* the memory issue so far in my setup. Got the following crash though.
>
> /root/ceph-master/src/os/bluestore/BlueStore.h: 319: FAILED 
> assert(bc.empty())
>
>  ceph version v11.0.0-2254-g8c52058 
> (8c52058bc8a0f060e06037196627c19b7d258f1f)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x55d6cde26ef0]
>  2: (BlueStore::SharedBlob::~SharedBlob()+0x82) [0x55d6cdbabee2]
>  3: (BlueStore::SharedBlob::put()+0x17d) [0x55d6cdb581dd]
>  4: (BlueStore::Onode::put()+0x1e8) [0x55d6cdbaca88]
>  5: (BlueStore::TwoQCache::trim(unsigned long, unsigned long)+0x7b3) 
> [0x55d6cdb6e6d3]
>  6: 
> (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&
> , ghobject_t const&, char const*, ceph::buffer::ptr&)+0x120) 
> [0x55d6cdb93810]
>  7: (PGBackend::objects_get_attr(hobject_t const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, ceph::buffer::list*)+0x106) 
> [0x55d6cd970216]
>  8: (ReplicatedPG::get_object_context(hobject_t const&, bool, 
> std::map<std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> >, ceph::buffer::list, 
> std::less<std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > >, 
> std::allocator<std::pair<std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const, 
> ceph::buffer::list> > >*)+0x92e) [0x55d6cd8f0cee]
>  9: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, 
> eversion_t, PGBackend::RecoveryHandle*)+0x1a4) [0x55d6cd8f9654]
>  10: (ReplicatedPG::recover_replicas(unsigned long, 
> ThreadPool::TPHandle&)+0x1093) [0x55d6cd937d83]
>  11: (ReplicatedPG::start_recovery_ops(unsigned long, 
> ThreadPool::TPHandle&, unsigned long*)+0xd86) [0x55d6cd9406a6]
>  12: (OSD::do_recovery(PG*, unsigned int, unsigned long, 
> ThreadPool::TPHandle&)+0x426) [0x55d6cd7b2f96]
>  13: (OSD::ShardedOpWQ::_process(unsigned int, 
> ceph::heartbeat_handle_d*)+0x5bf) [0x55d6cd7b3b6f]
>  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) 
> [0x55d6cde2bca9]
>  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) 
> [0x55d6cde2f060]
>  16: (()+0x76fa) [0x7fd9251296fa]
>  17: (clone()+0x6d) [0x7fd923a14b5d]
>
>
> I comment out the following commit as it seems a double delete from BlueStore::SharedBlob::put() (?).
>
> 681c368c3e809c3291430582bc5d11f9cc44b4ca
>
> Another point, probably, not related to leak fix, but, in the latest master (+ your 1st 5 commits) I am hitting the following assert after OSD restart. If I run IO for some time..
>
> Thread 1 "ceph-osd" received signal SIGABRT, Aborted.
> 0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
> 54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at 
> ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x00007ffff47ff01a in __GI_abort () at abort.c:89
> #2  0x0000555555fa114b in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5555562460ed "rm.empty()",
>     file=file@entry=0x555556245fd8 "/root/ceph-master/src/os/bluestore/StupidAllocator.cc", line=line@entry=305,
>     func=func@entry=0x5555562462c0 
> <StupidAllocator::init_rm_free(unsigned long, unsigned 
> long)::__PRETTY_FUNCTION__> "virtual void 
> StupidAllocator::init_rm_free(uint64_t, uint64_t)") at 
> /root/ceph-master/src/common/assert.cc:78
> #3  0x0000555555df6d5c in StupidAllocator::init_rm_free (this=0x55555f06f7a0, offset=<optimized out>, length=<optimized out>)
>     at /root/ceph-master/src/os/bluestore/StupidAllocator.cc:305
> #4  0x0000555555dd33aa in BlueFS::mount (this=0x55555f266580) at 
> /root/ceph-master/src/os/bluestore/BlueFS.cc:366
> #5  0x0000555555d017e5 in BlueStore::_open_db 
> (this=this@entry=0x55555f162000, create=create@entry=false) at 
> /root/ceph-master/src/os/bluestore/BlueStore.cc:3008
> #6  0x0000555555d1c02d in BlueStore::mount (this=0x55555f162000) at 
> /root/ceph-master/src/os/bluestore/BlueStore.cc:3659
> #7  0x0000555555924ef3 in OSD::init (this=0x55555f14a000) at 
> /root/ceph-master/src/osd/OSD.cc:2025
> #8  0x00005555558838ff in main (argc=<optimized out>, argv=<optimized 
> out>) at /root/ceph-master/src/ceph_osd.cc:609
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, September 13, 2016 4:46 PM
> To: Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: wip-denc
>
>
>
> On 09/13/2016 04:17 PM, Sage Weil wrote:
>> Hi everyone,
>>
>> Okay, I have a new wip-denc branch working and ready for some review:
>>
>> https://github.com/ceph/ceph/pull/11027
>>
>> Highlights:
>>
>> - This includes appender/iterator changes to buffer* to speed up 
>> encoding and decoding (fewer bounds checks, simpler structures).
>>
>> - Accordingly, classes/types using the new-style have different 
>> arguments types for encode/decode.  There is also a new 
>> bound_encode() method that is used to calculate how big of a buffer to preallocate.
>>
>> - Most of the important helpers for doing types have new versions 
>> that work with the new framework (e.g., the ENCODE_START macro has a 
>> new DENC_START counterpart).
>>
>> - There is also a mechanism that lets you define the bound_encode, 
>> encode, and decode methods all in one go using some template magic.
>> This only works for pretty simple types, but it is handy.  It looks like so:
>>
>>   struct foo_t {
>>     uint32_t a, b;
>>     ...
>>     DENC(foo_t, v, p) {
>>       DENC_START(1, 1, p);
>>       denc(v.a, p);
>>       denc(v.b, p);
>>       ...
>>       DENC_FINISH(p);
>>     }
>>   };
>>   WRITE_CLASS_DENC(foo_t)
>>
>>
>> - For new-style types, a new 'denc' function that is overload to do 
>> either bound_encode, encode, or decode (based on argument types) is defined.
>> That means that
>>
>>   ::denc(v, p);
>>
>> will work for size_t& p, bufferptr::iterator& p, or 
>> bufferlist::contiguous_appender& p.  This facilitates the DENC 
>> definitions above.
>>
>> - There is glue to invoke new-style encode/decode when old-style
>> encode() and decode() are invoked, provided a denc_traits<T> is defined.
>>
>> - Most of the common containers are there list, vector, set, map, 
>> pair, but others need to be converted.
>>
>> - Currently, we're a bit aggressive about using the new-style over 
>> the old-style when we have the change.  For example, if you have
>>
>>   vector<int32_t> foo;
>>   ::encode(foo, bl);
>>
>> it will see that it knows how to do int32_t new-style and invoke the 
>> new-style vector<> code.  I think this is going to be a net win, 
>> since we avoid doing bounds checks on append for every element (and 
>> the bound_encode is O(1) for thees base types).  On the other hand, 
>> it is currently smart enough to not use new-style for individual 
>> integer types, like so
>>
>>   int32_t v;
>>   ::encode(v, bl);
>>
>> although I suspect after the optimizer gets done with it the 
>> generated machine code is almost identical.
>>
>> - Most of the key bluestore types are converted over so that we can 
>> do some benchmarking.
>>
>> An overview is at the top of the new denc.h header here:
>>
>> https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L55
>>
>> I think I've captured the best of Allen's, Varada's, and Sam's 
>> various approaches, but we'll see how it behaves.  Let me know what you think!
>
> So far this is working pretty well!  It's not even crashing anymore. 
> :D
>
> Just recapping from IRC for ceph-devel: Overall append overhead is lower than it was before, but not as low as we got it in the old code by creating a single appender all the way from ENCODE_START to ENCODE_FINISH and (manually) estimating the memory to allocate across the whole thing.  On the other hand we are faster overall now with the extentmap sharding, so we're on the right track!  Here we're calling encode_some potentially many times in ExtentMap::update and creating a new appender for each encode_some call.  Not sure if we are flushing multiple times or only at the end, but buffer::list::append is still consuming a decent chunk of CPU, and encode_some itself is too. I'm suspicious of the c_str() calls and pbl->append(bufferptr(bp, 0, l) even though they aren't being explicitly singled out by perf.  In any event there's probably some tweaking to do here.
>
> Beyond that I don't have much to report yet.  Rb_tree_increment in encode_spanning_blobs and BitMapAreaLeaf::child_check_n_lock are starting to show up as bigger % consumers than they used to, so that's good news.
>
> FWIW, it looks like we are getting about a 1-4% performance improvement over master from this in my CPU limited case.  So far memory usage is still pretty high, with OSDs using about 5.3GB of RSS per OSD at the end of a 5 minute randwrite test.
>
> Mark
>
>>
>> Thanks-
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: wip-denc
  2016-09-13 21:17 wip-denc Sage Weil
  2016-09-13 23:45 ` wip-denc Mark Nelson
@ 2016-09-14  9:12 ` Joao Eduardo Luis
  2016-09-14 13:27   ` wip-denc Sage Weil
  2016-09-14 20:31 ` wip-denc Mark Nelson
  2 siblings, 1 reply; 18+ messages in thread
From: Joao Eduardo Luis @ 2016-09-14  9:12 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

On 09/13/2016 10:17 PM, Sage Weil wrote:
> Hi everyone,
>
> Okay, I have a new wip-denc branch working and ready for some review:
>
> 	https://github.com/ceph/ceph/pull/11027
>
> Highlights:
>
> - This includes appender/iterator changes to buffer* to speed up
> encoding and decoding (fewer bounds checks, simpler structures).
>
> - Accordingly, classes/types using the new-style have different arguments
> types for encode/decode.  There is also a new bound_encode() method that
> is used to calculate how big of a buffer to preallocate.
>
> - Most of the important helpers for doing types have new versions that
> work with the new framework (e.g., the ENCODE_START macro has a
> new DENC_START counterpart).
>
> - There is also a mechanism that lets you define the bound_encode,
> encode, and decode methods all in one go using some template magic.  This
> only works for pretty simple types, but it is handy.  It looks like so:
>
>   struct foo_t {
>     uint32_t a, b;
>     ...
>     DENC(foo_t, v, p) {
>       DENC_START(1, 1, p);
>       denc(v.a, p);
>       denc(v.b, p);
>       ...
>       DENC_FINISH(p);
>     }
>   };
>   WRITE_CLASS_DENC(foo_t)

This looks really neat!

I do have a question though:

   How do we handle conditional decoding depending on version?

AFAICT, we basically use the same function (`_denc_friend()`) to do all 
encoding/decoding operations, delegating to `denc()` the responsibility 
of figuring out what operation needs to be done according to parameters 
(i.e., decoding if parameter is a buffer::ptr::iterator; some form of 
encoding otherwise).

Even with `DENC_START` populating `struct_v` and `struct_compat`, just 
like the traditional encoding.h scheme, I'm still trying to wrap my head 
around how this would ideally be handled.

So say we drop a field 'a' from `struct bar_t`, and add a field 'b' on 
version X+1. We still want to encode 'a' on-the-wire for backward 
compatibility with version X, alongside with shiny new 'b' for version 
 >= X+1.

Encoding can be trivially handled, as we do now, by simply populating 
whatever 'a' was with whatever we want. Decoding 'b' is also trivial, as 
we only have to decode it.

Whereas in version X we would simply have

   DENC(bar_t, v, p) {
    ...
    denc(v.a, p);
    ...
   }

on version X+1 we would have to make sure to encode the values 
corresponding to 'a' but not decode anything to `v.a`, as it would not 
exist.

I can see how we could, possibly, get there by using the specific 
encoding or decoding methods, depending on version, but that would 
eventually escalate and lead to a mess in `DENC()`.

Any thoughts on this?

   -Joao

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: wip-denc
  2016-09-14  4:05       ` wip-denc Somnath Roy
@ 2016-09-14 11:06         ` Mark Nelson
  0 siblings, 0 replies; 18+ messages in thread
From: Mark Nelson @ 2016-09-14 11:06 UTC (permalink / raw)
  To: Somnath Roy, Sage Weil, ceph-devel

Yeah, I was seeing the same thing.  I think Sage was passing with clean 
valgrind runs, so I suspect what remains isn't actually a leak.

Mark

On 09/13/2016 11:05 PM, Somnath Roy wrote:
> In fact, the leak is not fixed totally :-(
> After 30 min run , my system memory started swapping again , will see what else is growing..
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, September 13, 2016 7:42 PM
> To: Somnath Roy; Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: wip-denc
>
> Here's another one to add to the pile:
>
>       0> 2016-09-13 22:31:45.142288 7f22a1d92700 -1
> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BitMapAllocator.cc: In function 'virtual void BitMapAllocator::commit_start()' thread
> 7f22a1d92700 time 2016-09-13 22:31:45.137789
> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BitMapAllocator.cc: 339:
> FAILED assert(m_committing.empty())
>
>   ceph version v10.0.4-8831-ge3e37f1
> (e3e37f11cc0e5cd0b2b0cb29109c66abe328de64)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0x5636eac23b55]
>   2: (BitMapAllocator::commit_start()+0x2e4) [0x5636eaa9c434]
>   3: (BlueFS::sync_metadata()+0x1be) [0x5636eaa6c3be]
>   4: (BlueRocksDirectory::Fsync()+0xd) [0x5636eaa7ff4d]
>   5:
> (rocksdb::FlushJob::WriteLevel0Table(rocksdb::autovector<rocksdb::MemTable*,
> 8ul> const&, rocksdb::VersionEdit*, rocksdb::FileMetaData*)+0xc52)
> [0x5636eaaf6852]
>   6: (rocksdb::FlushJob::Run(rocksdb::FileMetaData*)+0x33b) [0x5636eaaf963b]
>   7:
> (rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*,
> rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*,
> rocksdb::LogBuffer*)+0x3ad) [0x5636eaab73ad]
>   8: (rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*,
> rocksdb::LogBuffer*)+0x3ba) [0x5636eaab7efa]
>   9: (rocksdb::DBImpl::BackgroundCallFlush()+0x25c) [0x5636eaacd58c]
>   10: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1a1) [0x5636eab75111]
>   11: (()+0x9361f3) [0x5636eab751f3]
>   12: (()+0x7dc5) [0x7f22c4a8edc5]
>   13: (clone()+0x6d) [0x7f22c3975ced]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> --- logging levels ---
>     0/ 5 none
>     0/ 0 lockdep
>     0/ 0 context
>     0/ 0 crush
>     0/ 0 mds
>     0/ 0 mds_balancer
>     0/ 0 mds_locker
>     0/ 0 mds_log
>     0/ 0 mds_log_expire
>     0/ 0 mds_migrator
>     0/ 0 buffer
>     0/ 0 timer
>     0/ 0 filer
>     0/ 1 striper
>     0/ 0 objecter
>     0/ 0 rados
>     0/ 0 rbd
>     0/ 5 rbd_mirror
>     0/ 5 rbd_replay
>     0/ 0 journaler
>     0/ 0 objectcacher
>     0/ 0 client
>     0/ 0 osd
>     0/ 0 optracker
>     0/ 0 objclass
>     0/ 0 filestore
>     0/ 0 journal
>     0/ 0 ms
>     0/ 0 mon
>     0/ 0 monc
>     0/ 0 paxos
>     0/ 0 tp
>     0/ 0 auth
>     1/ 5 crypto
>     0/ 0 finisher
>     0/ 0 heartbeatmap
>     0/ 0 perfcounter
>     0/ 0 rgw
>     1/10 civetweb
>     1/ 5 javaclient
>     0/ 0 asok
>     0/ 0 throttle
>     0/ 0 refs
>     1/ 5 xio
>     1/ 5 compressor
>     1/ 5 newstore
>     0/ 0 bluestore
>     0/ 0 bluefs
>     0/ 0 bdev
>     1/ 5 kstore
>     4/ 5 rocksdb
>     4/ 5 leveldb
>     4/ 5 memdb
>     1/ 5 kinetic
>     1/ 5 fuse
>    -2/-2 (syslog threshold)
>    -1/-1 (stderr threshold)
>    max_recent     10000
>    max_new         1000
>    log_file /tmp/cbt/ceph/log/osd.5.log
> --- end dump of recent events ---
> 2016-09-13 22:31:45.187609 7f22a1d92700 -1 *** Caught signal (Aborted) **
>   in thread 7f22a1d92700 thread_name:rocksdb:bg0
>
>   ceph version v10.0.4-8831-ge3e37f1
> (e3e37f11cc0e5cd0b2b0cb29109c66abe328de64)
>   1: (()+0x869fba) [0x5636eaaa8fba]
>   2: (()+0xf100) [0x7f22c4a96100]
>   3: (gsignal()+0x37) [0x7f22c38b45f7]
>   4: (abort()+0x148) [0x7f22c38b5ce8]
>   5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x267) [0x5636eac23d37]
>   6: (BitMapAllocator::commit_start()+0x2e4) [0x5636eaa9c434]
>   7: (BlueFS::sync_metadata()+0x1be) [0x5636eaa6c3be]
>   8: (BlueRocksDirectory::Fsync()+0xd) [0x5636eaa7ff4d]
>   9:
> (rocksdb::FlushJob::WriteLevel0Table(rocksdb::autovector<rocksdb::MemTable*,
> 8ul> const&, rocksdb::VersionEdit*, rocksdb::FileMetaData*)+0xc52)
> [0x5636eaaf6852]
>   10: (rocksdb::FlushJob::Run(rocksdb::FileMetaData*)+0x33b)
> [0x5636eaaf963b]
>   11:
> (rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*,
> rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*,
> rocksdb::LogBuffer*)+0x3ad) [0x5636eaab73ad]
>   12: (rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*,
> rocksdb::LogBuffer*)+0x3ba) [0x5636eaab7efa]
>   13: (rocksdb::DBImpl::BackgroundCallFlush()+0x25c) [0x5636eaacd58c]
>   14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1a1) [0x5636eab75111]
>   15: (()+0x9361f3) [0x5636eab751f3]
>   16: (()+0x7dc5) [0x7f22c4a8edc5]
>   17: (clone()+0x6d) [0x7f22c3975ced]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> --- begin dump of recent events ---
>       0> 2016-09-13 22:31:45.187609 7f22a1d92700 -1 *** Caught signal
> (Aborted) **
>   in thread 7f22a1d92700 thread_name:rocksdb:bg0
>
>   ceph version v10.0.4-8831-ge3e37f1
> (e3e37f11cc0e5cd0b2b0cb29109c66abe328de64)
>   1: (()+0x869fba) [0x5636eaaa8fba]
>   2: (()+0xf100) [0x7f22c4a96100]
>   3: (gsignal()+0x37) [0x7f22c38b45f7]
>   4: (abort()+0x148) [0x7f22c38b5ce8]
>   5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x267) [0x5636eac23d37]
>   6: (BitMapAllocator::commit_start()+0x2e4) [0x5636eaa9c434]
>   7: (BlueFS::sync_metadata()+0x1be) [0x5636eaa6c3be]
>   8: (BlueRocksDirectory::Fsync()+0xd) [0x5636eaa7ff4d]
>   9:
> (rocksdb::FlushJob::WriteLevel0Table(rocksdb::autovector<rocksdb::MemTable*,
> 8ul> const&, rocksdb::VersionEdit*, rocksdb::FileMetaData*)+0xc52)
> [0x5636eaaf6852]
>   10: (rocksdb::FlushJob::Run(rocksdb::FileMetaData*)+0x33b)
> [0x5636eaaf963b]
>   11:
> (rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*,
> rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*,
> rocksdb::LogBuffer*)+0x3ad) [0x5636eaab73ad]
>   12: (rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*,
> rocksdb::LogBuffer*)+0x3ba) [0x5636eaab7efa]
>   13: (rocksdb::DBImpl::BackgroundCallFlush()+0x25c) [0x5636eaacd58c]
>   14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1a1) [0x5636eab75111]
>   15: (()+0x9361f3) [0x5636eab751f3]
>   16: (()+0x7dc5) [0x7f22c4a8edc5]
>   17: (clone()+0x6d) [0x7f22c3975ced]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> --- logging levels ---
>     0/ 5 none
>     0/ 0 lockdep
>     0/ 0 context
>     0/ 0 crush
>     0/ 0 mds
>     0/ 0 mds_balancer
>     0/ 0 mds_locker
>     0/ 0 mds_log
>     0/ 0 mds_log_expire
>     0/ 0 mds_migrator
>     0/ 0 buffer
>     0/ 0 timer
>     0/ 0 filer
>     0/ 1 striper
>     0/ 0 objecter
>     0/ 0 rados
>     0/ 0 rbd
>     0/ 5 rbd_mirror
>     0/ 5 rbd_replay
>     0/ 0 journaler
>     0/ 0 objectcacher
>     0/ 0 client
>     0/ 0 osd
>     0/ 0 optracker
>     0/ 0 objclass
>     0/ 0 filestore
>     0/ 0 journal
>     0/ 0 ms
>     0/ 0 mon
>     0/ 0 monc
>     0/ 0 paxos
>     0/ 0 tp
>     0/ 0 auth
>     1/ 5 crypto
>     0/ 0 finisher
>     0/ 0 heartbeatmap
>     0/ 0 perfcounter
>     0/ 0 rgw
>     1/10 civetweb
>     1/ 5 javaclient
>     0/ 0 asok
>     0/ 0 throttle
>     0/ 0 refs
>     1/ 5 xio
>     1/ 5 compressor
>     1/ 5 newstore
>     0/ 0 bluestore
>     0/ 0 bluefs
>     0/ 0 bdev
>     1/ 5 kstore
>     4/ 5 rocksdb
>     4/ 5 leveldb
>     4/ 5 memdb
>     1/ 5 kinetic
>     1/ 5 fuse
>    -2/-2 (syslog threshold)
>    -1/-1 (stderr threshold)
>    max_recent     10000
>    max_new         1000
>    log_file /tmp/cbt/ceph/log/osd.5.log
> --- end dump of recent events ---
>
>
> On 09/13/2016 07:29 PM, Somnath Roy wrote:
>> Sage/Mark,
>> I cherry-picked the 5 commits related to memory leak from the pull request and it seems to be *fixing* the memory issue so far in my setup. Got the following crash though.
>>
>> /root/ceph-master/src/os/bluestore/BlueStore.h: 319: FAILED
>> assert(bc.empty())
>>
>>  ceph version v11.0.0-2254-g8c52058
>> (8c52058bc8a0f060e06037196627c19b7d258f1f)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x80) [0x55d6cde26ef0]
>>  2: (BlueStore::SharedBlob::~SharedBlob()+0x82) [0x55d6cdbabee2]
>>  3: (BlueStore::SharedBlob::put()+0x17d) [0x55d6cdb581dd]
>>  4: (BlueStore::Onode::put()+0x1e8) [0x55d6cdbaca88]
>>  5: (BlueStore::TwoQCache::trim(unsigned long, unsigned long)+0x7b3)
>> [0x55d6cdb6e6d3]
>>  6:
>> (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&
>> , ghobject_t const&, char const*, ceph::buffer::ptr&)+0x120)
>> [0x55d6cdb93810]
>>  7: (PGBackend::objects_get_attr(hobject_t const&,
>> std::__cxx11::basic_string<char, std::char_traits<char>,
>> std::allocator<char> > const&, ceph::buffer::list*)+0x106)
>> [0x55d6cd970216]
>>  8: (ReplicatedPG::get_object_context(hobject_t const&, bool,
>> std::map<std::__cxx11::basic_string<char, std::char_traits<char>,
>> std::allocator<char> >, ceph::buffer::list,
>> std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
>> std::allocator<char> > >,
>> std::allocator<std::pair<std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> > const,
>> ceph::buffer::list> > >*)+0x92e) [0x55d6cd8f0cee]
>>  9: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&,
>> eversion_t, PGBackend::RecoveryHandle*)+0x1a4) [0x55d6cd8f9654]
>>  10: (ReplicatedPG::recover_replicas(unsigned long,
>> ThreadPool::TPHandle&)+0x1093) [0x55d6cd937d83]
>>  11: (ReplicatedPG::start_recovery_ops(unsigned long,
>> ThreadPool::TPHandle&, unsigned long*)+0xd86) [0x55d6cd9406a6]
>>  12: (OSD::do_recovery(PG*, unsigned int, unsigned long,
>> ThreadPool::TPHandle&)+0x426) [0x55d6cd7b2f96]
>>  13: (OSD::ShardedOpWQ::_process(unsigned int,
>> ceph::heartbeat_handle_d*)+0x5bf) [0x55d6cd7b3b6f]
>>  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9)
>> [0x55d6cde2bca9]
>>  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
>> [0x55d6cde2f060]
>>  16: (()+0x76fa) [0x7fd9251296fa]
>>  17: (clone()+0x6d) [0x7fd923a14b5d]
>>
>>
>> I comment out the following commit as it seems a double delete from BlueStore::SharedBlob::put() (?).
>>
>> 681c368c3e809c3291430582bc5d11f9cc44b4ca
>>
>> Another point, probably, not related to leak fix, but, in the latest master (+ your 1st 5 commits) I am hitting the following assert after OSD restart. If I run IO for some time..
>>
>> Thread 1 "ceph-osd" received signal SIGABRT, Aborted.
>> 0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
>> 54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>> (gdb) bt
>> #0  0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at
>> ../sysdeps/unix/sysv/linux/raise.c:54
>> #1  0x00007ffff47ff01a in __GI_abort () at abort.c:89
>> #2  0x0000555555fa114b in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5555562460ed "rm.empty()",
>>     file=file@entry=0x555556245fd8 "/root/ceph-master/src/os/bluestore/StupidAllocator.cc", line=line@entry=305,
>>     func=func@entry=0x5555562462c0
>> <StupidAllocator::init_rm_free(unsigned long, unsigned
>> long)::__PRETTY_FUNCTION__> "virtual void
>> StupidAllocator::init_rm_free(uint64_t, uint64_t)") at
>> /root/ceph-master/src/common/assert.cc:78
>> #3  0x0000555555df6d5c in StupidAllocator::init_rm_free (this=0x55555f06f7a0, offset=<optimized out>, length=<optimized out>)
>>     at /root/ceph-master/src/os/bluestore/StupidAllocator.cc:305
>> #4  0x0000555555dd33aa in BlueFS::mount (this=0x55555f266580) at
>> /root/ceph-master/src/os/bluestore/BlueFS.cc:366
>> #5  0x0000555555d017e5 in BlueStore::_open_db
>> (this=this@entry=0x55555f162000, create=create@entry=false) at
>> /root/ceph-master/src/os/bluestore/BlueStore.cc:3008
>> #6  0x0000555555d1c02d in BlueStore::mount (this=0x55555f162000) at
>> /root/ceph-master/src/os/bluestore/BlueStore.cc:3659
>> #7  0x0000555555924ef3 in OSD::init (this=0x55555f14a000) at
>> /root/ceph-master/src/osd/OSD.cc:2025
>> #8  0x00005555558838ff in main (argc=<optimized out>, argv=<optimized
>> out>) at /root/ceph-master/src/ceph_osd.cc:609
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Tuesday, September 13, 2016 4:46 PM
>> To: Sage Weil; ceph-devel@vger.kernel.org
>> Subject: Re: wip-denc
>>
>>
>>
>> On 09/13/2016 04:17 PM, Sage Weil wrote:
>>> Hi everyone,
>>>
>>> Okay, I have a new wip-denc branch working and ready for some review:
>>>
>>> https://github.com/ceph/ceph/pull/11027
>>>
>>> Highlights:
>>>
>>> - This includes appender/iterator changes to buffer* to speed up
>>> encoding and decoding (fewer bounds checks, simpler structures).
>>>
>>> - Accordingly, classes/types using the new-style have different
>>> arguments types for encode/decode.  There is also a new
>>> bound_encode() method that is used to calculate how big of a buffer to preallocate.
>>>
>>> - Most of the important helpers for doing types have new versions
>>> that work with the new framework (e.g., the ENCODE_START macro has a
>>> new DENC_START counterpart).
>>>
>>> - There is also a mechanism that lets you define the bound_encode,
>>> encode, and decode methods all in one go using some template magic.
>>> This only works for pretty simple types, but it is handy.  It looks like so:
>>>
>>>   struct foo_t {
>>>     uint32_t a, b;
>>>     ...
>>>     DENC(foo_t, v, p) {
>>>       DENC_START(1, 1, p);
>>>       denc(v.a, p);
>>>       denc(v.b, p);
>>>       ...
>>>       DENC_FINISH(p);
>>>     }
>>>   };
>>>   WRITE_CLASS_DENC(foo_t)
>>>
>>>
>>> - For new-style types, a new 'denc' function that is overload to do
>>> either bound_encode, encode, or decode (based on argument types) is defined.
>>> That means that
>>>
>>>   ::denc(v, p);
>>>
>>> will work for size_t& p, bufferptr::iterator& p, or
>>> bufferlist::contiguous_appender& p.  This facilitates the DENC
>>> definitions above.
>>>
>>> - There is glue to invoke new-style encode/decode when old-style
>>> encode() and decode() are invoked, provided a denc_traits<T> is defined.
>>>
>>> - Most of the common containers are there list, vector, set, map,
>>> pair, but others need to be converted.
>>>
>>> - Currently, we're a bit aggressive about using the new-style over
>>> the old-style when we have the change.  For example, if you have
>>>
>>>   vector<int32_t> foo;
>>>   ::encode(foo, bl);
>>>
>>> it will see that it knows how to do int32_t new-style and invoke the
>>> new-style vector<> code.  I think this is going to be a net win,
>>> since we avoid doing bounds checks on append for every element (and
>>> the bound_encode is O(1) for thees base types).  On the other hand,
>>> it is currently smart enough to not use new-style for individual
>>> integer types, like so
>>>
>>>   int32_t v;
>>>   ::encode(v, bl);
>>>
>>> although I suspect after the optimizer gets done with it the
>>> generated machine code is almost identical.
>>>
>>> - Most of the key bluestore types are converted over so that we can
>>> do some benchmarking.
>>>
>>> An overview is at the top of the new denc.h header here:
>>>
>>> https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L55
>>>
>>> I think I've captured the best of Allen's, Varada's, and Sam's
>>> various approaches, but we'll see how it behaves.  Let me know what you think!
>>
>> So far this is working pretty well!  It's not even crashing anymore.
>> :D
>>
>> Just recapping from IRC for ceph-devel: Overall append overhead is lower than it was before, but not as low as we got it in the old code by creating a single appender all the way from ENCODE_START to ENCODE_FINISH and (manually) estimating the memory to allocate across the whole thing.  On the other hand we are faster overall now with the extentmap sharding, so we're on the right track!  Here we're calling encode_some potentially many times in ExtentMap::update and creating a new appender for each encode_some call.  Not sure if we are flushing multiple times or only at the end, but buffer::list::append is still consuming a decent chunk of CPU, and encode_some itself is too. I'm suspicious of the c_str() calls and pbl->append(bufferptr(bp, 0, l) even though they aren't being explicitly s
 ingled out by perf.  In any event there's probably some tweaking to do here.
>>
>> Beyond that I don't have much to report yet.  Rb_tree_increment in encode_spanning_blobs and BitMapAreaLeaf::child_check_n_lock are starting to show up as bigger % consumers than they used to, so that's good news.
>>
>> FWIW, it looks like we are getting about a 1-4% performance improvement over master from this in my CPU limited case.  So far memory usage is still pretty high, with OSDs using about 5.3GB of RSS per OSD at the end of a 5 minute randwrite test.
>>
>> Mark
>>
>>>
>>> Thanks-
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: wip-denc
  2016-09-14  9:12 ` wip-denc Joao Eduardo Luis
@ 2016-09-14 13:27   ` Sage Weil
  2016-09-14 15:03     ` wip-denc Joao Eduardo Luis
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2016-09-14 13:27 UTC (permalink / raw)
  To: Joao Eduardo Luis; +Cc: ceph-devel

On Wed, 14 Sep 2016, Joao Eduardo Luis wrote:
> On 09/13/2016 10:17 PM, Sage Weil wrote:
> > Hi everyone,
> > 
> > Okay, I have a new wip-denc branch working and ready for some review:
> > 
> > 	https://github.com/ceph/ceph/pull/11027
> > 
> > Highlights:
> > 
> > - This includes appender/iterator changes to buffer* to speed up
> > encoding and decoding (fewer bounds checks, simpler structures).
> > 
> > - Accordingly, classes/types using the new-style have different arguments
> > types for encode/decode.  There is also a new bound_encode() method that
> > is used to calculate how big of a buffer to preallocate.
> > 
> > - Most of the important helpers for doing types have new versions that
> > work with the new framework (e.g., the ENCODE_START macro has a
> > new DENC_START counterpart).
> > 
> > - There is also a mechanism that lets you define the bound_encode,
> > encode, and decode methods all in one go using some template magic.  This
> > only works for pretty simple types, but it is handy.  It looks like so:
> > 
> >   struct foo_t {
> >     uint32_t a, b;
> >     ...
> >     DENC(foo_t, v, p) {
> >       DENC_START(1, 1, p);
> >       denc(v.a, p);
> >       denc(v.b, p);
> >       ...
> >       DENC_FINISH(p);
> >     }
> >   };
> >   WRITE_CLASS_DENC(foo_t)
> 
> This looks really neat!
> 
> I do have a question though:
> 
>   How do we handle conditional decoding depending on version?
> 
> AFAICT, we basically use the same function (`_denc_friend()`) to do all
> encoding/decoding operations, delegating to `denc()` the responsibility of
> figuring out what operation needs to be done according to parameters (i.e.,
> decoding if parameter is a buffer::ptr::iterator; some form of encoding
> otherwise).
> 
> Even with `DENC_START` populating `struct_v` and `struct_compat`, just like
> the traditional encoding.h scheme, I'm still trying to wrap my head around how
> this would ideally be handled.
> 
> So say we drop a field 'a' from `struct bar_t`, and add a field 'b' on version
> X+1. We still want to encode 'a' on-the-wire for backward compatibility with
> version X, alongside with shiny new 'b' for version >= X+1.
> 
> Encoding can be trivially handled, as we do now, by simply populating whatever
> 'a' was with whatever we want. Decoding 'b' is also trivial, as we only have
> to decode it.
> 
> Whereas in version X we would simply have
> 
>   DENC(bar_t, v, p) {
>    ...
>    denc(v.a, p);
>    ...
>   }
> 
> on version X+1 we would have to make sure to encode the values corresponding
> to 'a' but not decode anything to `v.a`, as it would not exist.

   struct foo_t {
     uint32_t a, b;
     ...
     DENC(foo_t, v, p) {
       DENC_START(1, 1, p);
       denc(v.a, p);
       denc(v.b, p);
       if (struct_v >= 2) {
         denc(v.new_thing, p);
       }
       DENC_FINISH(p);
     }
   };
   WRITE_CLASS_DENC(foo_t)

should do it, although I haven't tested.  May need to tweak the 
_denc_start to ensure that struct_v is populated for the bound_encode and 
encode cases.  (It should get optimized away by the compiler since 
struct_v is being assigned a compile-time constant, the 'v' macro param.)

What doesn't work is cases where the decode path has a more complicated 
conditional or compat value (that isn't the class constructor's default).  
In those cases you just need to define the 3 methods separately!

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: wip-denc
  2016-09-14  0:29   ` wip-denc Somnath Roy
  2016-09-14  2:41     ` wip-denc Mark Nelson
@ 2016-09-14 14:10     ` Sage Weil
  2016-09-14 14:51       ` wip-denc Somnath Roy
  2016-09-14 17:53       ` wip-denc Somnath Roy
  2016-09-15  0:39     ` wip-denc Sage Weil
  2 siblings, 2 replies; 18+ messages in thread
From: Sage Weil @ 2016-09-14 14:10 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Mark Nelson, ceph-devel

On Wed, 14 Sep 2016, Somnath Roy wrote:
> Sage/Mark,
> I cherry-picked the 5 commits related to memory leak from the pull request and it seems to be *fixing* the memory issue so far in my setup. Got the following crash though.
> 
> /root/ceph-master/src/os/bluestore/BlueStore.h: 319: FAILED assert(bc.empty())

I've pushed a fix for this to the branch.  (It wasn't ~SharedBlob(), but 
the BufferSpace cache data that had to get freed under the correct lock.)

I'm not sure about the StupidAllocator one, though... probably not 
related.  What options are you running with?  I've testing with defaults, 
which use BitmapAllocator for both bluefs and bluestore...

Thanks!
sage


>  ceph version v11.0.0-2254-g8c52058 (8c52058bc8a0f060e06037196627c19b7d258f1f)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55d6cde26ef0]
>  2: (BlueStore::SharedBlob::~SharedBlob()+0x82) [0x55d6cdbabee2]
>  3: (BlueStore::SharedBlob::put()+0x17d) [0x55d6cdb581dd]
>  4: (BlueStore::Onode::put()+0x1e8) [0x55d6cdbaca88]
>  5: (BlueStore::TwoQCache::trim(unsigned long, unsigned long)+0x7b3) [0x55d6cdb6e6d3]
>  6: (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, char const*, ceph::buffer::ptr&)+0x120) [0x55d6cdb93810]
>  7: (PGBackend::objects_get_attr(hobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x106) [0x55d6cd970216]
>  8: (ReplicatedPG::get_object_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0x92e) [0x55d6cd8f0cee]
>  9: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, eversion_t, PGBackend::RecoveryHandle*)+0x1a4) [0x55d6cd8f9654]
>  10: (ReplicatedPG::recover_replicas(unsigned long, ThreadPool::TPHandle&)+0x1093) [0x55d6cd937d83]
>  11: (ReplicatedPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xd86) [0x55d6cd9406a6]
>  12: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x426) [0x55d6cd7b2f96]
>  13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x5bf) [0x55d6cd7b3b6f]
>  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55d6cde2bca9]
>  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d6cde2f060]
>  16: (()+0x76fa) [0x7fd9251296fa]
>  17: (clone()+0x6d) [0x7fd923a14b5d]
> 
> 
> I comment out the following commit as it seems a double delete from BlueStore::SharedBlob::put() (?).
> 
> 681c368c3e809c3291430582bc5d11f9cc44b4ca
> 
> Another point, probably, not related to leak fix, but, in the latest master (+ your 1st 5 commits) I am hitting the following assert after OSD restart. If I run IO for some time..
> 
> Thread 1 "ceph-osd" received signal SIGABRT, Aborted.
> 0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
> 54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x00007ffff47ff01a in __GI_abort () at abort.c:89
> #2  0x0000555555fa114b in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5555562460ed "rm.empty()",
>     file=file@entry=0x555556245fd8 "/root/ceph-master/src/os/bluestore/StupidAllocator.cc", line=line@entry=305,
>     func=func@entry=0x5555562462c0 <StupidAllocator::init_rm_free(unsigned long, unsigned long)::__PRETTY_FUNCTION__> "virtual void StupidAllocator::init_rm_free(uint64_t, uint64_t)") at /root/ceph-master/src/common/assert.cc:78
> #3  0x0000555555df6d5c in StupidAllocator::init_rm_free (this=0x55555f06f7a0, offset=<optimized out>, length=<optimized out>)
>     at /root/ceph-master/src/os/bluestore/StupidAllocator.cc:305
> #4  0x0000555555dd33aa in BlueFS::mount (this=0x55555f266580) at /root/ceph-master/src/os/bluestore/BlueFS.cc:366
> #5  0x0000555555d017e5 in BlueStore::_open_db (this=this@entry=0x55555f162000, create=create@entry=false) at /root/ceph-master/src/os/bluestore/BlueStore.cc:3008
> #6  0x0000555555d1c02d in BlueStore::mount (this=0x55555f162000) at /root/ceph-master/src/os/bluestore/BlueStore.cc:3659
> #7  0x0000555555924ef3 in OSD::init (this=0x55555f14a000) at /root/ceph-master/src/osd/OSD.cc:2025
> #8  0x00005555558838ff in main (argc=<optimized out>, argv=<optimized out>) at /root/ceph-master/src/ceph_osd.cc:609
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, September 13, 2016 4:46 PM
> To: Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: wip-denc
> 
> 
> 
> On 09/13/2016 04:17 PM, Sage Weil wrote:
> > Hi everyone,
> >
> > Okay, I have a new wip-denc branch working and ready for some review:
> >
> > https://github.com/ceph/ceph/pull/11027
> >
> > Highlights:
> >
> > - This includes appender/iterator changes to buffer* to speed up
> > encoding and decoding (fewer bounds checks, simpler structures).
> >
> > - Accordingly, classes/types using the new-style have different
> > arguments types for encode/decode.  There is also a new bound_encode()
> > method that is used to calculate how big of a buffer to preallocate.
> >
> > - Most of the important helpers for doing types have new versions that
> > work with the new framework (e.g., the ENCODE_START macro has a new
> > DENC_START counterpart).
> >
> > - There is also a mechanism that lets you define the bound_encode,
> > encode, and decode methods all in one go using some template magic.
> > This only works for pretty simple types, but it is handy.  It looks like so:
> >
> >   struct foo_t {
> >     uint32_t a, b;
> >     ...
> >     DENC(foo_t, v, p) {
> >       DENC_START(1, 1, p);
> >       denc(v.a, p);
> >       denc(v.b, p);
> >       ...
> >       DENC_FINISH(p);
> >     }
> >   };
> >   WRITE_CLASS_DENC(foo_t)
> >
> >
> > - For new-style types, a new 'denc' function that is overload to do
> > either bound_encode, encode, or decode (based on argument types) is defined.
> > That means that
> >
> >   ::denc(v, p);
> >
> > will work for size_t& p, bufferptr::iterator& p, or
> > bufferlist::contiguous_appender& p.  This facilitates the DENC
> > definitions above.
> >
> > - There is glue to invoke new-style encode/decode when old-style
> > encode() and decode() are invoked, provided a denc_traits<T> is defined.
> >
> > - Most of the common containers are there list, vector, set, map,
> > pair, but others need to be converted.
> >
> > - Currently, we're a bit aggressive about using the new-style over the
> > old-style when we have the change.  For example, if you have
> >
> >   vector<int32_t> foo;
> >   ::encode(foo, bl);
> >
> > it will see that it knows how to do int32_t new-style and invoke the
> > new-style vector<> code.  I think this is going to be a net win, since
> > we avoid doing bounds checks on append for every element (and the
> > bound_encode is O(1) for thees base types).  On the other hand, it is
> > currently smart enough to not use new-style for individual integer
> > types, like so
> >
> >   int32_t v;
> >   ::encode(v, bl);
> >
> > although I suspect after the optimizer gets done with it the generated
> > machine code is almost identical.
> >
> > - Most of the key bluestore types are converted over so that we can do
> > some benchmarking.
> >
> > An overview is at the top of the new denc.h header here:
> >
> > https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L55
> >
> > I think I've captured the best of Allen's, Varada's, and Sam's various
> > approaches, but we'll see how it behaves.  Let me know what you think!
> 
> So far this is working pretty well!  It's not even crashing anymore. :D
> 
> Just recapping from IRC for ceph-devel: Overall append overhead is lower than it was before, but not as low as we got it in the old code by creating a single appender all the way from ENCODE_START to ENCODE_FINISH and (manually) estimating the memory to allocate across the whole thing.  On the other hand we are faster overall now with the extentmap sharding, so we're on the right track!  Here we're calling encode_some potentially many times in ExtentMap::update and creating a new appender for each encode_some call.  Not sure if we are flushing multiple times or only at the end, but buffer::list::append is still consuming a decent chunk of CPU, and encode_some itself is too. I'm suspicious of the c_str() calls and pbl->append(bufferptr(bp, 0, l) even though they aren't being explicitly si
 ngled out by perf.  In any event there's probably some tweaking to do here.
> 
> Beyond that I don't have much to report yet.  Rb_tree_increment in encode_spanning_blobs and BitMapAreaLeaf::child_check_n_lock are starting to show up as bigger % consumers than they used to, so that's good news.
> 
> FWIW, it looks like we are getting about a 1-4% performance improvement over master from this in my CPU limited case.  So far memory usage is still pretty high, with OSDs using about 5.3GB of RSS per OSD at the end of a 5 minute randwrite test.
> 
> Mark
> 
> >
> > Thanks-
> > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: wip-denc
  2016-09-14 14:10     ` wip-denc Sage Weil
@ 2016-09-14 14:51       ` Somnath Roy
  2016-09-14 17:53       ` wip-denc Somnath Roy
  1 sibling, 0 replies; 18+ messages in thread
From: Somnath Roy @ 2016-09-14 14:51 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, ceph-devel

I am using stupid allocator and bitmap freelist..
I will try to see what is happening there..

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Wednesday, September 14, 2016 7:10 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel@vger.kernel.org
Subject: RE: wip-denc

On Wed, 14 Sep 2016, Somnath Roy wrote:
> Sage/Mark,
> I cherry-picked the 5 commits related to memory leak from the pull request and it seems to be *fixing* the memory issue so far in my setup. Got the following crash though.
> 
> /root/ceph-master/src/os/bluestore/BlueStore.h: 319: FAILED 
> assert(bc.empty())

I've pushed a fix for this to the branch.  (It wasn't ~SharedBlob(), but the BufferSpace cache data that had to get freed under the correct lock.)

I'm not sure about the StupidAllocator one, though... probably not related.  What options are you running with?  I've testing with defaults, which use BitmapAllocator for both bluefs and bluestore...

Thanks!
sage


>  ceph version v11.0.0-2254-g8c52058 
> (8c52058bc8a0f060e06037196627c19b7d258f1f)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x55d6cde26ef0]
>  2: (BlueStore::SharedBlob::~SharedBlob()+0x82) [0x55d6cdbabee2]
>  3: (BlueStore::SharedBlob::put()+0x17d) [0x55d6cdb581dd]
>  4: (BlueStore::Onode::put()+0x1e8) [0x55d6cdbaca88]
>  5: (BlueStore::TwoQCache::trim(unsigned long, unsigned long)+0x7b3) 
> [0x55d6cdb6e6d3]
>  6: 
> (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&
> , ghobject_t const&, char const*, ceph::buffer::ptr&)+0x120) 
> [0x55d6cdb93810]
>  7: (PGBackend::objects_get_attr(hobject_t const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, ceph::buffer::list*)+0x106) 
> [0x55d6cd970216]
>  8: (ReplicatedPG::get_object_context(hobject_t const&, bool, 
> std::map<std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> >, ceph::buffer::list, 
> std::less<std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > >, 
> std::allocator<std::pair<std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const, 
> ceph::buffer::list> > >*)+0x92e) [0x55d6cd8f0cee]
>  9: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, 
> eversion_t, PGBackend::RecoveryHandle*)+0x1a4) [0x55d6cd8f9654]
>  10: (ReplicatedPG::recover_replicas(unsigned long, 
> ThreadPool::TPHandle&)+0x1093) [0x55d6cd937d83]
>  11: (ReplicatedPG::start_recovery_ops(unsigned long, 
> ThreadPool::TPHandle&, unsigned long*)+0xd86) [0x55d6cd9406a6]
>  12: (OSD::do_recovery(PG*, unsigned int, unsigned long, 
> ThreadPool::TPHandle&)+0x426) [0x55d6cd7b2f96]
>  13: (OSD::ShardedOpWQ::_process(unsigned int, 
> ceph::heartbeat_handle_d*)+0x5bf) [0x55d6cd7b3b6f]
>  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) 
> [0x55d6cde2bca9]
>  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) 
> [0x55d6cde2f060]
>  16: (()+0x76fa) [0x7fd9251296fa]
>  17: (clone()+0x6d) [0x7fd923a14b5d]
> 
> 
> I comment out the following commit as it seems a double delete from BlueStore::SharedBlob::put() (?).
> 
> 681c368c3e809c3291430582bc5d11f9cc44b4ca
> 
> Another point, probably, not related to leak fix, but, in the latest master (+ your 1st 5 commits) I am hitting the following assert after OSD restart. If I run IO for some time..
> 
> Thread 1 "ceph-osd" received signal SIGABRT, Aborted.
> 0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
> 54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at 
> ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x00007ffff47ff01a in __GI_abort () at abort.c:89
> #2  0x0000555555fa114b in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5555562460ed "rm.empty()",
>     file=file@entry=0x555556245fd8 "/root/ceph-master/src/os/bluestore/StupidAllocator.cc", line=line@entry=305,
>     func=func@entry=0x5555562462c0 
> <StupidAllocator::init_rm_free(unsigned long, unsigned 
> long)::__PRETTY_FUNCTION__> "virtual void 
> StupidAllocator::init_rm_free(uint64_t, uint64_t)") at 
> /root/ceph-master/src/common/assert.cc:78
> #3  0x0000555555df6d5c in StupidAllocator::init_rm_free (this=0x55555f06f7a0, offset=<optimized out>, length=<optimized out>)
>     at /root/ceph-master/src/os/bluestore/StupidAllocator.cc:305
> #4  0x0000555555dd33aa in BlueFS::mount (this=0x55555f266580) at 
> /root/ceph-master/src/os/bluestore/BlueFS.cc:366
> #5  0x0000555555d017e5 in BlueStore::_open_db 
> (this=this@entry=0x55555f162000, create=create@entry=false) at 
> /root/ceph-master/src/os/bluestore/BlueStore.cc:3008
> #6  0x0000555555d1c02d in BlueStore::mount (this=0x55555f162000) at 
> /root/ceph-master/src/os/bluestore/BlueStore.cc:3659
> #7  0x0000555555924ef3 in OSD::init (this=0x55555f14a000) at 
> /root/ceph-master/src/osd/OSD.cc:2025
> #8  0x00005555558838ff in main (argc=<optimized out>, argv=<optimized 
> out>) at /root/ceph-master/src/ceph_osd.cc:609
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, September 13, 2016 4:46 PM
> To: Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: wip-denc
> 
> 
> 
> On 09/13/2016 04:17 PM, Sage Weil wrote:
> > Hi everyone,
> >
> > Okay, I have a new wip-denc branch working and ready for some review:
> >
> > https://github.com/ceph/ceph/pull/11027
> >
> > Highlights:
> >
> > - This includes appender/iterator changes to buffer* to speed up 
> > encoding and decoding (fewer bounds checks, simpler structures).
> >
> > - Accordingly, classes/types using the new-style have different 
> > arguments types for encode/decode.  There is also a new 
> > bound_encode() method that is used to calculate how big of a buffer to preallocate.
> >
> > - Most of the important helpers for doing types have new versions 
> > that work with the new framework (e.g., the ENCODE_START macro has a 
> > new DENC_START counterpart).
> >
> > - There is also a mechanism that lets you define the bound_encode, 
> > encode, and decode methods all in one go using some template magic.
> > This only works for pretty simple types, but it is handy.  It looks like so:
> >
> >   struct foo_t {
> >     uint32_t a, b;
> >     ...
> >     DENC(foo_t, v, p) {
> >       DENC_START(1, 1, p);
> >       denc(v.a, p);
> >       denc(v.b, p);
> >       ...
> >       DENC_FINISH(p);
> >     }
> >   };
> >   WRITE_CLASS_DENC(foo_t)
> >
> >
> > - For new-style types, a new 'denc' function that is overload to do 
> > either bound_encode, encode, or decode (based on argument types) is defined.
> > That means that
> >
> >   ::denc(v, p);
> >
> > will work for size_t& p, bufferptr::iterator& p, or 
> > bufferlist::contiguous_appender& p.  This facilitates the DENC 
> > definitions above.
> >
> > - There is glue to invoke new-style encode/decode when old-style
> > encode() and decode() are invoked, provided a denc_traits<T> is defined.
> >
> > - Most of the common containers are there list, vector, set, map, 
> > pair, but others need to be converted.
> >
> > - Currently, we're a bit aggressive about using the new-style over 
> > the old-style when we have the change.  For example, if you have
> >
> >   vector<int32_t> foo;
> >   ::encode(foo, bl);
> >
> > it will see that it knows how to do int32_t new-style and invoke the 
> > new-style vector<> code.  I think this is going to be a net win, 
> > since we avoid doing bounds checks on append for every element (and 
> > the bound_encode is O(1) for thees base types).  On the other hand, 
> > it is currently smart enough to not use new-style for individual 
> > integer types, like so
> >
> >   int32_t v;
> >   ::encode(v, bl);
> >
> > although I suspect after the optimizer gets done with it the 
> > generated machine code is almost identical.
> >
> > - Most of the key bluestore types are converted over so that we can 
> > do some benchmarking.
> >
> > An overview is at the top of the new denc.h header here:
> >
> > https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L5
> > 5
> >
> > I think I've captured the best of Allen's, Varada's, and Sam's 
> > various approaches, but we'll see how it behaves.  Let me know what you think!
> 
> So far this is working pretty well!  It's not even crashing anymore. 
> :D
> 
> Just recapping from IRC for ceph-devel: Overall append overhead is lower than it was before, but not as low as we got it in the old code by creating a single appender all the way from ENCODE_START to ENCODE_FINISH and (manually) estimating the memory to allocate across the whole thing.  On the other hand we are faster overall now with the extentmap sharding, so we're on the right track!  Here we're calling encode_some potentially many times in ExtentMap::update and creating a new appender for each encode_some call.  Not sure if we are flushing multiple times or only at the end, but buffer::list::append is still consuming a decent chunk of CPU, and encode_some itself is too. I'm suspicious of the c_str() calls and pbl->append(bufferptr(bp, 0, l) even though they aren't being explicitly singled out by perf.  In any event there's probably some tweaking to do here.
> 
> Beyond that I don't have much to report yet.  Rb_tree_increment in encode_spanning_blobs and BitMapAreaLeaf::child_check_n_lock are starting to show up as bigger % consumers than they used to, so that's good news.
> 
> FWIW, it looks like we are getting about a 1-4% performance improvement over master from this in my CPU limited case.  So far memory usage is still pretty high, with OSDs using about 5.3GB of RSS per OSD at the end of a 5 minute randwrite test.
> 
> Mark
> 
> >
> > Thanks-
> > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: wip-denc
  2016-09-14 13:27   ` wip-denc Sage Weil
@ 2016-09-14 15:03     ` Joao Eduardo Luis
  0 siblings, 0 replies; 18+ messages in thread
From: Joao Eduardo Luis @ 2016-09-14 15:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 09/14/2016 02:27 PM, Sage Weil wrote:
> What doesn't work is cases where the decode path has a more complicated
> conditional or compat value (that isn't the class constructor's default).
> In those cases you just need to define the 3 methods separately!

Gotcha. Makes sense.

   -Joao


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: wip-denc
  2016-09-14 14:10     ` wip-denc Sage Weil
  2016-09-14 14:51       ` wip-denc Somnath Roy
@ 2016-09-14 17:53       ` Somnath Roy
  1 sibling, 0 replies; 18+ messages in thread
From: Somnath Roy @ 2016-09-14 17:53 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, ceph-devel

Sage,
Quick update : with latest master with your BufferSpace fix memory issue is resolved it seems , 2 hours now and not seeing much growth :-)
Regarding performance drop , will keep you posted..

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy 
Sent: Wednesday, September 14, 2016 7:51 AM
To: 'Sage Weil'
Cc: Mark Nelson; ceph-devel@vger.kernel.org
Subject: RE: wip-denc

I am using stupid allocator and bitmap freelist..
I will try to see what is happening there..

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Wednesday, September 14, 2016 7:10 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel@vger.kernel.org
Subject: RE: wip-denc

On Wed, 14 Sep 2016, Somnath Roy wrote:
> Sage/Mark,
> I cherry-picked the 5 commits related to memory leak from the pull request and it seems to be *fixing* the memory issue so far in my setup. Got the following crash though.
> 
> /root/ceph-master/src/os/bluestore/BlueStore.h: 319: FAILED
> assert(bc.empty())

I've pushed a fix for this to the branch.  (It wasn't ~SharedBlob(), but the BufferSpace cache data that had to get freed under the correct lock.)

I'm not sure about the StupidAllocator one, though... probably not related.  What options are you running with?  I've testing with defaults, which use BitmapAllocator for both bluefs and bluestore...

Thanks!
sage


>  ceph version v11.0.0-2254-g8c52058
> (8c52058bc8a0f060e06037196627c19b7d258f1f)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x80) [0x55d6cde26ef0]
>  2: (BlueStore::SharedBlob::~SharedBlob()+0x82) [0x55d6cdbabee2]
>  3: (BlueStore::SharedBlob::put()+0x17d) [0x55d6cdb581dd]
>  4: (BlueStore::Onode::put()+0x1e8) [0x55d6cdbaca88]
>  5: (BlueStore::TwoQCache::trim(unsigned long, unsigned long)+0x7b3) 
> [0x55d6cdb6e6d3]
>  6: 
> (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&
> , ghobject_t const&, char const*, ceph::buffer::ptr&)+0x120) 
> [0x55d6cdb93810]
>  7: (PGBackend::objects_get_attr(hobject_t const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, ceph::buffer::list*)+0x106) 
> [0x55d6cd970216]
>  8: (ReplicatedPG::get_object_context(hobject_t const&, bool, 
> std::map<std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> >, ceph::buffer::list, 
> std::less<std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > >, 
> std::allocator<std::pair<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const, 
> ceph::buffer::list> > >*)+0x92e) [0x55d6cd8f0cee]
>  9: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, 
> eversion_t, PGBackend::RecoveryHandle*)+0x1a4) [0x55d6cd8f9654]
>  10: (ReplicatedPG::recover_replicas(unsigned long,
> ThreadPool::TPHandle&)+0x1093) [0x55d6cd937d83]
>  11: (ReplicatedPG::start_recovery_ops(unsigned long, 
> ThreadPool::TPHandle&, unsigned long*)+0xd86) [0x55d6cd9406a6]
>  12: (OSD::do_recovery(PG*, unsigned int, unsigned long,
> ThreadPool::TPHandle&)+0x426) [0x55d6cd7b2f96]
>  13: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x5bf) [0x55d6cd7b3b6f]
>  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) 
> [0x55d6cde2bca9]
>  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> [0x55d6cde2f060]
>  16: (()+0x76fa) [0x7fd9251296fa]
>  17: (clone()+0x6d) [0x7fd923a14b5d]
> 
> 
> I comment out the following commit as it seems a double delete from BlueStore::SharedBlob::put() (?).
> 
> 681c368c3e809c3291430582bc5d11f9cc44b4ca
> 
> Another point, probably, not related to leak fix, but, in the latest master (+ your 1st 5 commits) I am hitting the following assert after OSD restart. If I run IO for some time..
> 
> Thread 1 "ceph-osd" received signal SIGABRT, Aborted.
> 0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
> 54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x00007ffff47ff01a in __GI_abort () at abort.c:89
> #2  0x0000555555fa114b in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5555562460ed "rm.empty()",
>     file=file@entry=0x555556245fd8 "/root/ceph-master/src/os/bluestore/StupidAllocator.cc", line=line@entry=305,
>     func=func@entry=0x5555562462c0
> <StupidAllocator::init_rm_free(unsigned long, unsigned 
> long)::__PRETTY_FUNCTION__> "virtual void 
> StupidAllocator::init_rm_free(uint64_t, uint64_t)") at
> /root/ceph-master/src/common/assert.cc:78
> #3  0x0000555555df6d5c in StupidAllocator::init_rm_free (this=0x55555f06f7a0, offset=<optimized out>, length=<optimized out>)
>     at /root/ceph-master/src/os/bluestore/StupidAllocator.cc:305
> #4  0x0000555555dd33aa in BlueFS::mount (this=0x55555f266580) at
> /root/ceph-master/src/os/bluestore/BlueFS.cc:366
> #5  0x0000555555d017e5 in BlueStore::_open_db 
> (this=this@entry=0x55555f162000, create=create@entry=false) at
> /root/ceph-master/src/os/bluestore/BlueStore.cc:3008
> #6  0x0000555555d1c02d in BlueStore::mount (this=0x55555f162000) at
> /root/ceph-master/src/os/bluestore/BlueStore.cc:3659
> #7  0x0000555555924ef3 in OSD::init (this=0x55555f14a000) at
> /root/ceph-master/src/osd/OSD.cc:2025
> #8  0x00005555558838ff in main (argc=<optimized out>, argv=<optimized
> out>) at /root/ceph-master/src/ceph_osd.cc:609
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, September 13, 2016 4:46 PM
> To: Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: wip-denc
> 
> 
> 
> On 09/13/2016 04:17 PM, Sage Weil wrote:
> > Hi everyone,
> >
> > Okay, I have a new wip-denc branch working and ready for some review:
> >
> > https://github.com/ceph/ceph/pull/11027
> >
> > Highlights:
> >
> > - This includes appender/iterator changes to buffer* to speed up 
> > encoding and decoding (fewer bounds checks, simpler structures).
> >
> > - Accordingly, classes/types using the new-style have different 
> > arguments types for encode/decode.  There is also a new
> > bound_encode() method that is used to calculate how big of a buffer to preallocate.
> >
> > - Most of the important helpers for doing types have new versions 
> > that work with the new framework (e.g., the ENCODE_START macro has a 
> > new DENC_START counterpart).
> >
> > - There is also a mechanism that lets you define the bound_encode, 
> > encode, and decode methods all in one go using some template magic.
> > This only works for pretty simple types, but it is handy.  It looks like so:
> >
> >   struct foo_t {
> >     uint32_t a, b;
> >     ...
> >     DENC(foo_t, v, p) {
> >       DENC_START(1, 1, p);
> >       denc(v.a, p);
> >       denc(v.b, p);
> >       ...
> >       DENC_FINISH(p);
> >     }
> >   };
> >   WRITE_CLASS_DENC(foo_t)
> >
> >
> > - For new-style types, a new 'denc' function that is overload to do 
> > either bound_encode, encode, or decode (based on argument types) is defined.
> > That means that
> >
> >   ::denc(v, p);
> >
> > will work for size_t& p, bufferptr::iterator& p, or 
> > bufferlist::contiguous_appender& p.  This facilitates the DENC 
> > definitions above.
> >
> > - There is glue to invoke new-style encode/decode when old-style
> > encode() and decode() are invoked, provided a denc_traits<T> is defined.
> >
> > - Most of the common containers are there list, vector, set, map, 
> > pair, but others need to be converted.
> >
> > - Currently, we're a bit aggressive about using the new-style over 
> > the old-style when we have the change.  For example, if you have
> >
> >   vector<int32_t> foo;
> >   ::encode(foo, bl);
> >
> > it will see that it knows how to do int32_t new-style and invoke the 
> > new-style vector<> code.  I think this is going to be a net win, 
> > since we avoid doing bounds checks on append for every element (and 
> > the bound_encode is O(1) for thees base types).  On the other hand, 
> > it is currently smart enough to not use new-style for individual 
> > integer types, like so
> >
> >   int32_t v;
> >   ::encode(v, bl);
> >
> > although I suspect after the optimizer gets done with it the 
> > generated machine code is almost identical.
> >
> > - Most of the key bluestore types are converted over so that we can 
> > do some benchmarking.
> >
> > An overview is at the top of the new denc.h header here:
> >
> > https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L5
> > 5
> >
> > I think I've captured the best of Allen's, Varada's, and Sam's 
> > various approaches, but we'll see how it behaves.  Let me know what you think!
> 
> So far this is working pretty well!  It's not even crashing anymore. 
> :D
> 
> Just recapping from IRC for ceph-devel: Overall append overhead is lower than it was before, but not as low as we got it in the old code by creating a single appender all the way from ENCODE_START to ENCODE_FINISH and (manually) estimating the memory to allocate across the whole thing.  On the other hand we are faster overall now with the extentmap sharding, so we're on the right track!  Here we're calling encode_some potentially many times in ExtentMap::update and creating a new appender for each encode_some call.  Not sure if we are flushing multiple times or only at the end, but buffer::list::append is still consuming a decent chunk of CPU, and encode_some itself is too. I'm suspicious of the c_str() calls and pbl->append(bufferptr(bp, 0, l) even though they aren't being explicitly singled out by perf.  In any event there's probably some tweaking to do here.
> 
> Beyond that I don't have much to report yet.  Rb_tree_increment in encode_spanning_blobs and BitMapAreaLeaf::child_check_n_lock are starting to show up as bigger % consumers than they used to, so that's good news.
> 
> FWIW, it looks like we are getting about a 1-4% performance improvement over master from this in my CPU limited case.  So far memory usage is still pretty high, with OSDs using about 5.3GB of RSS per OSD at the end of a 5 minute randwrite test.
> 
> Mark
> 
> >
> > Thanks-
> > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: wip-denc
  2016-09-13 21:17 wip-denc Sage Weil
  2016-09-13 23:45 ` wip-denc Mark Nelson
  2016-09-14  9:12 ` wip-denc Joao Eduardo Luis
@ 2016-09-14 20:31 ` Mark Nelson
  2016-09-14 20:35   ` wip-denc Somnath Roy
  2 siblings, 1 reply; 18+ messages in thread
From: Mark Nelson @ 2016-09-14 20:31 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

On 09/13/2016 04:17 PM, Sage Weil wrote:
> Hi everyone,
>
> Okay, I have a new wip-denc branch working and ready for some review:
>
> 	https://github.com/ceph/ceph/pull/11027
>
> Highlights:
>
> - This includes appender/iterator changes to buffer* to speed up
> encoding and decoding (fewer bounds checks, simpler structures).
>
> - Accordingly, classes/types using the new-style have different arguments
> types for encode/decode.  There is also a new bound_encode() method that
> is used to calculate how big of a buffer to preallocate.
>
> - Most of the important helpers for doing types have new versions that
> work with the new framework (e.g., the ENCODE_START macro has a
> new DENC_START counterpart).
>
> - There is also a mechanism that lets you define the bound_encode,
> encode, and decode methods all in one go using some template magic.  This
> only works for pretty simple types, but it is handy.  It looks like so:
>
>   struct foo_t {
>     uint32_t a, b;
>     ...
>     DENC(foo_t, v, p) {
>       DENC_START(1, 1, p);
>       denc(v.a, p);
>       denc(v.b, p);
>       ...
>       DENC_FINISH(p);
>     }
>   };
>   WRITE_CLASS_DENC(foo_t)
>
>
> - For new-style types, a new 'denc' function that is overload to do either
> bound_encode, encode, or decode (based on argument types) is defined.
> That means that
>
>   ::denc(v, p);
>
> will work for size_t& p, bufferptr::iterator& p, or
> bufferlist::contiguous_appender& p.  This facilitates the DENC definitions
> above.
>
> - There is glue to invoke new-style encode/decode when old-style encode()
> and decode() are invoked, provided a denc_traits<T> is defined.
>
> - Most of the common containers are there list, vector, set, map, pair,
> but others need to be converted.
>
> - Currently, we're a bit aggressive about using the new-style over the
> old-style when we have the change.  For example, if you have
>
>   vector<int32_t> foo;
>   ::encode(foo, bl);
>
> it will see that it knows how to do int32_t new-style and invoke the
> new-style vector<> code.  I think this is going to be a net win, since
> we avoid doing bounds checks on append for every element (and the
> bound_encode is O(1) for thees base types).  On the other hand, it is
> currently smart enough to not use new-style for individual integer
> types, like so
>
>   int32_t v;
>   ::encode(v, bl);
>
> although I suspect after the optimizer gets done with it the generated
> machine code is almost identical.
>
> - Most of the key bluestore types are converted over so that we can do
> some benchmarking.
>
> An overview is at the top of the new denc.h header here:
>
> 	https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L55
>
> I think I've captured the best of Allen's, Varada's, and Sam's various
> approaches, but we'll see how it behaves.  Let me know what you think!

Alright, made it through a round of benchmarking without crashing this 
time.  This is wip-denc + 11059 + 11014 on 4 NVMe cards split into 16 
OSDs.  Need to add the additional memory reduction patches, but for now 
this gives us a bit of an idea where we are at. Scroll to the right for 
graphs.

https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZNi1aU1htRDRDekk

1) Basically sequential reads look bad, but we've known that for a while 
and we can look at it again once the dust settles.  We've never been 
great compared to filestore, but something took a turn for the worst 
earlier this summer.

2) Sequential writes are looking pretty great, and have been since july 
after a bitmap allocator fix.

3) Random read performance has dropped pretty significantly recently. 
Sage thinks this might be the sharding.

4) Small random write performance is about twice as fast, mostly due to 
the sharding, though I'd argue indirectly.  I'd argue this is really due 
to the reduction in bufferlist appends as we saw nearly the same 
improvement when we used the appender with the old code.  These tests 
continue to be CPU limited.

5) Sequential mixed read/write tests look pretty similar to the 7/28 
tests.  The difference vs jewel bluestore seems to primarily be the 
bitmap allocator, but other changes might be having an effect as well.

6) Random mixed read/write tests have improved since 7/28 with the 
sharding and encode/decode changes.  Performance is much higher for 
larger IOs and a little slower for 4K IOs, but it's fairly competitive 
in these tests.

>
> Thanks-
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: wip-denc
  2016-09-14 20:31 ` wip-denc Mark Nelson
@ 2016-09-14 20:35   ` Somnath Roy
  2016-09-14 20:37     ` wip-denc Mark Nelson
  0 siblings, 1 reply; 18+ messages in thread
From: Somnath Roy @ 2016-09-14 20:35 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil, ceph-devel

Not able to access the graphs Mark..

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Wednesday, September 14, 2016 1:32 PM
To: Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: wip-denc

On 09/13/2016 04:17 PM, Sage Weil wrote:
> Hi everyone,
>
> Okay, I have a new wip-denc branch working and ready for some review:
>
> https://github.com/ceph/ceph/pull/11027
>
> Highlights:
>
> - This includes appender/iterator changes to buffer* to speed up
> encoding and decoding (fewer bounds checks, simpler structures).
>
> - Accordingly, classes/types using the new-style have different
> arguments types for encode/decode.  There is also a new bound_encode()
> method that is used to calculate how big of a buffer to preallocate.
>
> - Most of the important helpers for doing types have new versions that
> work with the new framework (e.g., the ENCODE_START macro has a new
> DENC_START counterpart).
>
> - There is also a mechanism that lets you define the bound_encode,
> encode, and decode methods all in one go using some template magic.
> This only works for pretty simple types, but it is handy.  It looks like so:
>
>   struct foo_t {
>     uint32_t a, b;
>     ...
>     DENC(foo_t, v, p) {
>       DENC_START(1, 1, p);
>       denc(v.a, p);
>       denc(v.b, p);
>       ...
>       DENC_FINISH(p);
>     }
>   };
>   WRITE_CLASS_DENC(foo_t)
>
>
> - For new-style types, a new 'denc' function that is overload to do
> either bound_encode, encode, or decode (based on argument types) is defined.
> That means that
>
>   ::denc(v, p);
>
> will work for size_t& p, bufferptr::iterator& p, or
> bufferlist::contiguous_appender& p.  This facilitates the DENC
> definitions above.
>
> - There is glue to invoke new-style encode/decode when old-style
> encode() and decode() are invoked, provided a denc_traits<T> is defined.
>
> - Most of the common containers are there list, vector, set, map,
> pair, but others need to be converted.
>
> - Currently, we're a bit aggressive about using the new-style over the
> old-style when we have the change.  For example, if you have
>
>   vector<int32_t> foo;
>   ::encode(foo, bl);
>
> it will see that it knows how to do int32_t new-style and invoke the
> new-style vector<> code.  I think this is going to be a net win, since
> we avoid doing bounds checks on append for every element (and the
> bound_encode is O(1) for thees base types).  On the other hand, it is
> currently smart enough to not use new-style for individual integer
> types, like so
>
>   int32_t v;
>   ::encode(v, bl);
>
> although I suspect after the optimizer gets done with it the generated
> machine code is almost identical.
>
> - Most of the key bluestore types are converted over so that we can do
> some benchmarking.
>
> An overview is at the top of the new denc.h header here:
>
> https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L55
>
> I think I've captured the best of Allen's, Varada's, and Sam's various
> approaches, but we'll see how it behaves.  Let me know what you think!

Alright, made it through a round of benchmarking without crashing this time.  This is wip-denc + 11059 + 11014 on 4 NVMe cards split into 16 OSDs.  Need to add the additional memory reduction patches, but for now this gives us a bit of an idea where we are at. Scroll to the right for graphs.

https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZNi1aU1htRDRDekk

1) Basically sequential reads look bad, but we've known that for a while and we can look at it again once the dust settles.  We've never been great compared to filestore, but something took a turn for the worst earlier this summer.

2) Sequential writes are looking pretty great, and have been since july after a bitmap allocator fix.

3) Random read performance has dropped pretty significantly recently.
Sage thinks this might be the sharding.

4) Small random write performance is about twice as fast, mostly due to the sharding, though I'd argue indirectly.  I'd argue this is really due to the reduction in bufferlist appends as we saw nearly the same improvement when we used the appender with the old code.  These tests continue to be CPU limited.

5) Sequential mixed read/write tests look pretty similar to the 7/28 tests.  The difference vs jewel bluestore seems to primarily be the bitmap allocator, but other changes might be having an effect as well.

6) Random mixed read/write tests have improved since 7/28 with the sharding and encode/decode changes.  Performance is much higher for larger IOs and a little slower for 4K IOs, but it's fairly competitive in these tests.

>
> Thanks-
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: wip-denc
  2016-09-14 20:35   ` wip-denc Somnath Roy
@ 2016-09-14 20:37     ` Mark Nelson
  0 siblings, 0 replies; 18+ messages in thread
From: Mark Nelson @ 2016-09-14 20:37 UTC (permalink / raw)
  To: Somnath Roy, Sage Weil, ceph-devel

Strange, it's working for me, and claims to be public for everyone.  :/

Mark

On 09/14/2016 03:35 PM, Somnath Roy wrote:
> Not able to access the graphs Mark..
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, September 14, 2016 1:32 PM
> To: Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: wip-denc
>
> On 09/13/2016 04:17 PM, Sage Weil wrote:
>> Hi everyone,
>>
>> Okay, I have a new wip-denc branch working and ready for some review:
>>
>> https://github.com/ceph/ceph/pull/11027
>>
>> Highlights:
>>
>> - This includes appender/iterator changes to buffer* to speed up
>> encoding and decoding (fewer bounds checks, simpler structures).
>>
>> - Accordingly, classes/types using the new-style have different
>> arguments types for encode/decode.  There is also a new bound_encode()
>> method that is used to calculate how big of a buffer to preallocate.
>>
>> - Most of the important helpers for doing types have new versions that
>> work with the new framework (e.g., the ENCODE_START macro has a new
>> DENC_START counterpart).
>>
>> - There is also a mechanism that lets you define the bound_encode,
>> encode, and decode methods all in one go using some template magic.
>> This only works for pretty simple types, but it is handy.  It looks like so:
>>
>>   struct foo_t {
>>     uint32_t a, b;
>>     ...
>>     DENC(foo_t, v, p) {
>>       DENC_START(1, 1, p);
>>       denc(v.a, p);
>>       denc(v.b, p);
>>       ...
>>       DENC_FINISH(p);
>>     }
>>   };
>>   WRITE_CLASS_DENC(foo_t)
>>
>>
>> - For new-style types, a new 'denc' function that is overload to do
>> either bound_encode, encode, or decode (based on argument types) is defined.
>> That means that
>>
>>   ::denc(v, p);
>>
>> will work for size_t& p, bufferptr::iterator& p, or
>> bufferlist::contiguous_appender& p.  This facilitates the DENC
>> definitions above.
>>
>> - There is glue to invoke new-style encode/decode when old-style
>> encode() and decode() are invoked, provided a denc_traits<T> is defined.
>>
>> - Most of the common containers are there list, vector, set, map,
>> pair, but others need to be converted.
>>
>> - Currently, we're a bit aggressive about using the new-style over the
>> old-style when we have the change.  For example, if you have
>>
>>   vector<int32_t> foo;
>>   ::encode(foo, bl);
>>
>> it will see that it knows how to do int32_t new-style and invoke the
>> new-style vector<> code.  I think this is going to be a net win, since
>> we avoid doing bounds checks on append for every element (and the
>> bound_encode is O(1) for thees base types).  On the other hand, it is
>> currently smart enough to not use new-style for individual integer
>> types, like so
>>
>>   int32_t v;
>>   ::encode(v, bl);
>>
>> although I suspect after the optimizer gets done with it the generated
>> machine code is almost identical.
>>
>> - Most of the key bluestore types are converted over so that we can do
>> some benchmarking.
>>
>> An overview is at the top of the new denc.h header here:
>>
>> https://github.com/liewegas/ceph/blob/wip-denc/src/include/denc.h#L55
>>
>> I think I've captured the best of Allen's, Varada's, and Sam's various
>> approaches, but we'll see how it behaves.  Let me know what you think!
>
> Alright, made it through a round of benchmarking without crashing this time.  This is wip-denc + 11059 + 11014 on 4 NVMe cards split into 16 OSDs.  Need to add the additional memory reduction patches, but for now this gives us a bit of an idea where we are at. Scroll to the right for graphs.
>
> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZNi1aU1htRDRDekk
>
> 1) Basically sequential reads look bad, but we've known that for a while and we can look at it again once the dust settles.  We've never been great compared to filestore, but something took a turn for the worst earlier this summer.
>
> 2) Sequential writes are looking pretty great, and have been since july after a bitmap allocator fix.
>
> 3) Random read performance has dropped pretty significantly recently.
> Sage thinks this might be the sharding.
>
> 4) Small random write performance is about twice as fast, mostly due to the sharding, though I'd argue indirectly.  I'd argue this is really due to the reduction in bufferlist appends as we saw nearly the same improvement when we used the appender with the old code.  These tests continue to be CPU limited.
>
> 5) Sequential mixed read/write tests look pretty similar to the 7/28 tests.  The difference vs jewel bluestore seems to primarily be the bitmap allocator, but other changes might be having an effect as well.
>
> 6) Random mixed read/write tests have improved since 7/28 with the sharding and encode/decode changes.  Performance is much higher for larger IOs and a little slower for 4K IOs, but it's fairly competitive in these tests.
>
>>
>> Thanks-
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: wip-denc
  2016-09-14  0:29   ` wip-denc Somnath Roy
  2016-09-14  2:41     ` wip-denc Mark Nelson
  2016-09-14 14:10     ` wip-denc Sage Weil
@ 2016-09-15  0:39     ` Sage Weil
  2 siblings, 0 replies; 18+ messages in thread
From: Sage Weil @ 2016-09-15  0:39 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Mark Nelson, ceph-devel

On Wed, 14 Sep 2016, Somnath Roy wrote:
> Another point, probably, not related to leak fix, but, in the latest master (+ your 1st 5 commits) I am hitting the following assert after OSD restart. If I run IO for some time..
> 
> Thread 1 "ceph-osd" received signal SIGABRT, Aborted.
> 0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
> 54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  0x00007ffff47fd418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x00007ffff47ff01a in __GI_abort () at abort.c:89
> #2  0x0000555555fa114b in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5555562460ed "rm.empty()",
>     file=file@entry=0x555556245fd8 "/root/ceph-master/src/os/bluestore/StupidAllocator.cc", line=line@entry=305,
>     func=func@entry=0x5555562462c0 <StupidAllocator::init_rm_free(unsigned long, unsigned long)::__PRETTY_FUNCTION__> "virtual void StupidAllocator::init_rm_free(uint64_t, uint64_t)") at /root/ceph-master/src/common/assert.cc:78
> #3  0x0000555555df6d5c in StupidAllocator::init_rm_free (this=0x55555f06f7a0, offset=<optimized out>, length=<optimized out>)
>     at /root/ceph-master/src/os/bluestore/StupidAllocator.cc:305
> #4  0x0000555555dd33aa in BlueFS::mount (this=0x55555f266580) at /root/ceph-master/src/os/bluestore/BlueFS.cc:366
> #5  0x0000555555d017e5 in BlueStore::_open_db (this=this@entry=0x55555f162000, create=create@entry=false) at /root/ceph-master/src/os/bluestore/BlueStore.cc:3008
> #6  0x0000555555d1c02d in BlueStore::mount (this=0x55555f162000) at /root/ceph-master/src/os/bluestore/BlueStore.cc:3659
> #7  0x0000555555924ef3 in OSD::init (this=0x55555f14a000) at /root/ceph-master/src/osd/OSD.cc:2025
> #8  0x00005555558838ff in main (argc=<optimized out>, argv=<optimized out>) at /root/ceph-master/src/ceph_osd.cc:609

I hit this one this afternoon. Fix is included in

	https://github.com/ceph/ceph/pull/11087

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-09-15  0:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-13 21:17 wip-denc Sage Weil
2016-09-13 23:45 ` wip-denc Mark Nelson
2016-09-14  0:29   ` wip-denc Somnath Roy
2016-09-14  2:41     ` wip-denc Mark Nelson
2016-09-14  4:05       ` wip-denc Somnath Roy
2016-09-14 11:06         ` wip-denc Mark Nelson
2016-09-14 14:10     ` wip-denc Sage Weil
2016-09-14 14:51       ` wip-denc Somnath Roy
2016-09-14 17:53       ` wip-denc Somnath Roy
2016-09-15  0:39     ` wip-denc Sage Weil
2016-09-14  0:47   ` wip-denc Allen Samuels
2016-09-14  1:18     ` wip-denc Mark Nelson
2016-09-14  9:12 ` wip-denc Joao Eduardo Luis
2016-09-14 13:27   ` wip-denc Sage Weil
2016-09-14 15:03     ` wip-denc Joao Eduardo Luis
2016-09-14 20:31 ` wip-denc Mark Nelson
2016-09-14 20:35   ` wip-denc Somnath Roy
2016-09-14 20:37     ` wip-denc Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.