All of lore.kernel.org
 help / color / mirror / Atom feed
* Bug 19198: Double the memory caused by page alignment
@ 2017-06-29 18:14 Mohamad Gebai
  2017-06-29 19:25 ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Mohamad Gebai @ 2017-06-29 18:14 UTC (permalink / raw)
  To: ceph-devel

Hi,

The ticket http://tracker.ceph.com/issues/19198 says that Bluestore uses
twice as much memory than it should, and the description talks about
page alignment. By looking at the unit test in attachment (see
Bugzilla), I came to the conclusion that it is neither a Bluestore nor a
Ceph bug, but it's simply due to the allocation pattern. The unit test
that reproduces the bug does page-aligned allocations of 4KB blocs (a
page size) in a tight loop. What each allocation ends up doing is the
following:

1. Find the next page boundary because the user (caller of malloc) wants
a page-aligned allocation
2. Allocate the memory requested by the user (a whole page)
3. Keep metadata about that chunk of memory

We can see in this case two pages have been touched: one for the user
and, one for the metadata. Since this is in a tight loop, each iteration
skips a page that is almost completely empty in order to do have the
next allocation page-aligned. This is the worst case scenario, and makes
it seem like Bluestore uses "twice" the memory it should.

If the unit test was doing page-aligned allocation of 40KB, it would
seem like 10% more memory is used (10 pages for the data and one page
for the metadata). What this suggests is that there isn't a direct
solution for this "bug". Alternatively, if a the unit test did
allocations of a page and a half, it would seem like Bluestore uses 33%
more memory than it should.

If the page-aligned allocations are large, and if they are sparse (ie.
there are random smaller non-page-aligned allocations in between), the
heap is much less fragmented, and it won't seem like the memory is
wasted. Does that seem like a reasonable hypothesis or did I completely
misunderstand the bug report?

How this affects Bluestore is in buffer::create_page_aligned(). The
question is: what is the pattern that would cause bufferlist to create
page-aligned buffers that are only a page in size? It doesn't seem like
*any* usage of Bluestore causes this issue (played around with rados
bench without seeing the problem).

Note that the unit test attached in the ticket is deprecated.

Mohamad


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug 19198: Double the memory caused by page alignment
  2017-06-29 18:14 Bug 19198: Double the memory caused by page alignment Mohamad Gebai
@ 2017-06-29 19:25 ` Sage Weil
  2017-06-29 19:33   ` Jason Dillaman
  2017-06-29 21:11   ` Mohamad Gebai
  0 siblings, 2 replies; 10+ messages in thread
From: Sage Weil @ 2017-06-29 19:25 UTC (permalink / raw)
  To: Mohamad Gebai; +Cc: ceph-devel

Hi Mohamad,

Thanks for looking into this!

On Thu, 29 Jun 2017, Mohamad Gebai wrote:
> Hi,
> 
> The ticket http://tracker.ceph.com/issues/19198 says that Bluestore uses
> twice as much memory than it should, and the description talks about
> page alignment. By looking at the unit test in attachment (see
> Bugzilla), I came to the conclusion that it is neither a Bluestore nor a
> Ceph bug, but it's simply due to the allocation pattern. The unit test
> that reproduces the bug does page-aligned allocations of 4KB blocs (a
> page size) in a tight loop. What each allocation ends up doing is the
> following:
> 
> 1. Find the next page boundary because the user (caller of malloc) wants
> a page-aligned allocation
> 2. Allocate the memory requested by the user (a whole page)
> 3. Keep metadata about that chunk of memory
> 
> We can see in this case two pages have been touched: one for the user
> and, one for the metadata. Since this is in a tight loop, each iteration
> skips a page that is almost completely empty in order to do have the
> next allocation page-aligned. This is the worst case scenario, and makes
> it seem like Bluestore uses "twice" the memory it should.
> 
> If the unit test was doing page-aligned allocation of 40KB, it would
> seem like 10% more memory is used (10 pages for the data and one page
> for the metadata). What this suggests is that there isn't a direct
> solution for this "bug". Alternatively, if a the unit test did
> allocations of a page and a half, it would seem like Bluestore uses 33%
> more memory than it should.
> 
> If the page-aligned allocations are large, and if they are sparse (ie.
> there are random smaller non-page-aligned allocations in between), the
> heap is much less fragmented, and it won't seem like the memory is
> wasted. Does that seem like a reasonable hypothesis or did I completely
> misunderstand the bug report?

This all sounds right.

The problem is that it is common and expected for bluestore to ask for a 
4kb page-aligned buffer.  There is the 4kb aligned allocation for the 
buffer itself, and there is the small buffer::raw tracking struct 
with the ref count and so on.  This should end up consuming 4kb + a little 
bit, not 8kb.

First, it would be good to confirm the allocator actually does behave this 
way.  (Ick.)

Then, I think we need to figure out how to mitigate the problem.  I 
suspect what we need to do is create slab-like allocation pool for the 
buffer::raw structs so that they do not consume a full page as a 
side-effect of the allocation timing.

> How this affects Bluestore is in buffer::create_page_aligned(). The
> question is: what is the pattern that would cause bufferlist to create
> page-aligned buffers that are only a page in size? It doesn't seem like
> *any* usage of Bluestore causes this issue (played around with rados
> bench without seeing the problem).

IIRC Igor hit this by doing 4KB random writes via the fio ObjectStore 
driver.  I suspect we'd see a similar with the OSD and 4KB writes, but 
never confirmed.

sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug 19198: Double the memory caused by page alignment
  2017-06-29 19:25 ` Sage Weil
@ 2017-06-29 19:33   ` Jason Dillaman
  2017-06-29 21:11   ` Mohamad Gebai
  1 sibling, 0 replies; 10+ messages in thread
From: Jason Dillaman @ 2017-06-29 19:33 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mohamad Gebai, ceph-devel

On Thu, Jun 29, 2017 at 3:25 PM, Sage Weil <sage@newdream.net> wrote:
> IIRC Igor hit this by doing 4KB random writes via the fio ObjectStore
> driver.  I suspect we'd see a similar with the OSD and 4KB writes, but
> never confirmed.

There was also a recent ticket for the librbd client-side cache using
up exactly 2x the amount RAM as expected under 4K workloads. I haven't
looked into it yet, but I can see how "raw_combined" could potentially
cause this ballooning that I haven't seen before (I used to expect it
on <4K workloads under hammer).

-- 
Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug 19198: Double the memory caused by page alignment
  2017-06-29 19:25 ` Sage Weil
  2017-06-29 19:33   ` Jason Dillaman
@ 2017-06-29 21:11   ` Mohamad Gebai
  2017-06-29 21:18     ` Sage Weil
  1 sibling, 1 reply; 10+ messages in thread
From: Mohamad Gebai @ 2017-06-29 21:11 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,

On 06/29/2017 03:25 PM, Sage Weil wrote:
> On Thu, 29 Jun 2017, Mohamad Gebai wrote:
>> If the page-aligned allocations are large, and if they are sparse (ie.
>> there are random smaller non-page-aligned allocations in between), the
>> heap is much less fragmented, and it won't seem like the memory is
>> wasted. Does that seem like a reasonable hypothesis or did I completely
>> misunderstand the bug report?
> This all sounds right.
>
> The problem is that it is common and expected for bluestore to ask for a 
> 4kb page-aligned buffer.  There is the 4kb aligned allocation for the 
> buffer itself, and there is the small buffer::raw tracking struct 
> with the ref count and so on.  This should end up consuming 4kb + a little 
> bit, not 8kb.

Right, 4kb for the data and a few extra bytes for the rest. So in total,
two pages are touched and accounted for the process for each
page-aligned allocation.

> First, it would be good to confirm the allocator actually does behave this 
> way.  (Ick.)

I was able to reproduce this quite easily outside of Ceph, if you're
interested the code is here:
https://github.com/mogeb/utils/tree/master/mempool. This is simply a
standalone version of the attachment in the tracker. The output of the
program is as follows:

Mem before2: VmRSS:       10900 kB
Mem after2: VmRSS:     8399680 kB
Mem actually used: 8590110720 bytes
Mem that should be used: 4294967296 bytes
Difference: 4295143424 bytes, 4.00016 gb

Also, preloading libtcmalloc makes this behavior disappear (at least for
this program), which confirms further the hypothesis, since tcmalloc
does larger allocations internally.

Mohamad



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug 19198: Double the memory caused by page alignment
  2017-06-29 21:11   ` Mohamad Gebai
@ 2017-06-29 21:18     ` Sage Weil
  2017-06-29 21:32       ` Mohamad Gebai
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-06-29 21:18 UTC (permalink / raw)
  To: Mohamad Gebai; +Cc: ceph-devel

On Thu, 29 Jun 2017, Mohamad Gebai wrote:
> Hi Sage,
> 
> On 06/29/2017 03:25 PM, Sage Weil wrote:
> > On Thu, 29 Jun 2017, Mohamad Gebai wrote:
> >> If the page-aligned allocations are large, and if they are sparse (ie.
> >> there are random smaller non-page-aligned allocations in between), the
> >> heap is much less fragmented, and it won't seem like the memory is
> >> wasted. Does that seem like a reasonable hypothesis or did I completely
> >> misunderstand the bug report?
> > This all sounds right.
> >
> > The problem is that it is common and expected for bluestore to ask for a 
> > 4kb page-aligned buffer.  There is the 4kb aligned allocation for the 
> > buffer itself, and there is the small buffer::raw tracking struct 
> > with the ref count and so on.  This should end up consuming 4kb + a little 
> > bit, not 8kb.
> 
> Right, 4kb for the data and a few extra bytes for the rest. So in total,
> two pages are touched and accounted for the process for each
> page-aligned allocation.
> 
> > First, it would be good to confirm the allocator actually does behave this 
> > way.  (Ick.)
> 
> I was able to reproduce this quite easily outside of Ceph, if you're
> interested the code is here:
> https://github.com/mogeb/utils/tree/master/mempool. This is simply a
> standalone version of the attachment in the tracker. The output of the
> program is as follows:
> 
> Mem before2: VmRSS:       10900 kB
> Mem after2: VmRSS:     8399680 kB
> Mem actually used: 8590110720 bytes
> Mem that should be used: 4294967296 bytes
> Difference: 4295143424 bytes, 4.00016 gb
> 
> Also, preloading libtcmalloc makes this behavior disappear (at least for
> this program), which confirms further the hypothesis, since tcmalloc
> does larger allocations internally.

What do you mean by that last paragraph?  We should be linking against 
tcmalloc in ceph.  But you see that using tcmalloc avoids the problem in 
the reproducer?

sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug 19198: Double the memory caused by page alignment
  2017-06-29 21:18     ` Sage Weil
@ 2017-06-29 21:32       ` Mohamad Gebai
  2017-06-30 13:32         ` Igor Fedotov
  0 siblings, 1 reply; 10+ messages in thread
From: Mohamad Gebai @ 2017-06-29 21:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel


On 06/29/2017 05:18 PM, Sage Weil wrote:
> On Thu, 29 Jun 2017, Mohamad Gebai wrote:
>> Also, preloading libtcmalloc makes this behavior disappear (at least for
>> this program), which confirms further the hypothesis, since tcmalloc
>> does larger allocations internally.
> What do you mean by that last paragraph?  We should be linking against 
> tcmalloc in ceph.  But you see that using tcmalloc avoids the problem in 
> the reproducer?
>

Yes, that's exactly it. That's why I'm asking about the pattern usage,
so I can try to reproduce it in Ceph. I'll try fio as you suggested and
follow up. If you have any thoughts in the mean time, please let me know.

Mohamad


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug 19198: Double the memory caused by page alignment
  2017-06-29 21:32       ` Mohamad Gebai
@ 2017-06-30 13:32         ` Igor Fedotov
  2017-06-30 13:41           ` Mohamad Gebai
  0 siblings, 1 reply; 10+ messages in thread
From: Igor Fedotov @ 2017-06-30 13:32 UTC (permalink / raw)
  To: Mohamad Gebai, Sage Weil; +Cc: ceph-devel

AFAIR I observed the issue for tcmalloc as well. There is a 
corresponding note in the ticket..


On 30.06.2017 0:32, Mohamad Gebai wrote:
> On 06/29/2017 05:18 PM, Sage Weil wrote:
>> On Thu, 29 Jun 2017, Mohamad Gebai wrote:
>>> Also, preloading libtcmalloc makes this behavior disappear (at least for
>>> this program), which confirms further the hypothesis, since tcmalloc
>>> does larger allocations internally.
>> What do you mean by that last paragraph?  We should be linking against
>> tcmalloc in ceph.  But you see that using tcmalloc avoids the problem in
>> the reproducer?
>>
> Yes, that's exactly it. That's why I'm asking about the pattern usage,
> so I can try to reproduce it in Ceph. I'll try fio as you suggested and
> follow up. If you have any thoughts in the mean time, please let me know.
>
> Mohamad
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug 19198: Double the memory caused by page alignment
  2017-06-30 13:32         ` Igor Fedotov
@ 2017-06-30 13:41           ` Mohamad Gebai
  2017-06-30 14:55             ` Igor Fedotov
  0 siblings, 1 reply; 10+ messages in thread
From: Mohamad Gebai @ 2017-06-30 13:41 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

Hi Igor,

On 06/30/2017 09:32 AM, Igor Fedotov wrote:
> AFAIR I observed the issue for tcmalloc as well. There is a
> corresponding note in the ticket..

Yes I did see the note, but I wasn't able to reproduce the issue with
Ceph so I couldn't confirm the status with regard to tcmalloc + Ceph.
I'm about to try the workload that Sage suggested. If you have some
instructions on how to reproduce the problem, it would be great!

Mohamad


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug 19198: Double the memory caused by page alignment
  2017-06-30 13:41           ` Mohamad Gebai
@ 2017-06-30 14:55             ` Igor Fedotov
  2017-07-04  9:52               ` Mohamad Gebai
  0 siblings, 1 reply; 10+ messages in thread
From: Igor Fedotov @ 2017-06-30 14:55 UTC (permalink / raw)
  To: Mohamad Gebai, Sage Weil; +Cc: ceph-devel

Hi Mohamad,

I'm in the process of moving my workplace at the moment.

Will double-check when completed.


But AFAIR the scenario was pretty trivial - build Ceph with ALLOCATOR 
configured to tcmalloc and run an UT with the diff from the ticket.


Thanks,

Igor


On 30.06.2017 16:41, Mohamad Gebai wrote:
> Hi Igor,
>
> On 06/30/2017 09:32 AM, Igor Fedotov wrote:
>> AFAIR I observed the issue for tcmalloc as well. There is a
>> corresponding note in the ticket..
> Yes I did see the note, but I wasn't able to reproduce the issue with
> Ceph so I couldn't confirm the status with regard to tcmalloc + Ceph.
> I'm about to try the workload that Sage suggested. If you have some
> instructions on how to reproduce the problem, it would be great!
>
> Mohamad
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug 19198: Double the memory caused by page alignment
  2017-06-30 14:55             ` Igor Fedotov
@ 2017-07-04  9:52               ` Mohamad Gebai
  0 siblings, 0 replies; 10+ messages in thread
From: Mohamad Gebai @ 2017-07-04  9:52 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel


On 06/30/2017 10:55 AM, Igor Fedotov wrote:
>
> But AFAIR the scenario was pretty trivial - build Ceph with ALLOCATOR
> configured to tcmalloc and run an UT with the diff from the ticket.
>

But the unit tests aren't linked against libtcmalloc.so, and therefore
aren't using tcmalloc, right? Do I have to set LD_PRELOAD to force ctest
to use tcmalloc? If I do set LD_PRELOAD, I don't see the memory problem
anymore.

Mohamad


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-07-04  9:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-29 18:14 Bug 19198: Double the memory caused by page alignment Mohamad Gebai
2017-06-29 19:25 ` Sage Weil
2017-06-29 19:33   ` Jason Dillaman
2017-06-29 21:11   ` Mohamad Gebai
2017-06-29 21:18     ` Sage Weil
2017-06-29 21:32       ` Mohamad Gebai
2017-06-30 13:32         ` Igor Fedotov
2017-06-30 13:41           ` Mohamad Gebai
2017-06-30 14:55             ` Igor Fedotov
2017-07-04  9:52               ` Mohamad Gebai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.