All of lore.kernel.org
 help / color / mirror / Atom feed
* mem use doubles due to buffer::create_page_aligned + bluestore obj content caching
@ 2017-03-06 17:35 Igor Fedotov
  2017-03-06 18:44 ` Gregory Farnum
  0 siblings, 1 reply; 6+ messages in thread
From: Igor Fedotov @ 2017-03-06 17:35 UTC (permalink / raw)
  To: ceph-devel

Hi Cephers,

I've just created a ticket related to bluestore object content caching 
in particular and buffer::create_page_aligned in general.

But I'd like to additionally share this information here as well since 
the root cause seems to be pretty global.

Ticker URL:

http://tracker.ceph.com/issues/19198

Description:

When caching object content BlueStore uses twice as much memory than it 
really needs for that data amount.

The root cause seems to be in buffer::create_page_aligned 
implementation. Actually it results in
new raw_posix_aligned()

   calling mempool::buffer_data::alloc_char.allocate_aligned(len, align);

       calling  posix_memalign((void**)(void*)&ptr, align, total);

sequence that in fact does 2 allocations:

1) for raw_posix_aligned struct
2) for data itself (4096 bytes).

It looks like this sequence causes 2 * 4096 bytes allocation instead of 
sizeof(raw_posix_aligned) + alignment + 4096.
The additional trick is that mempool stuff is unable to estimate such an 
overhead and hence BlueStore cache cleanup doesn't work properly.

It's not clear for me why allocator(s) behave that inefficiently for 
such a pattern though.

The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc and 
tcmalloc builds.


The ticket contains the patch to reproduce the issue and one can see 
that for 16Gb content system mem usage tend to be ~32Gb.

Patch firstly allocates 4K pages 0x400000 times using:

...

+  size_t alloc_count = 0x400000; // allocate 16 Gb total
+  allocs.resize(alloc_count);
+  for( auto i = 0u; i < alloc_count; ++i) {
+    bufferptr p = buffer::create_page_aligned(bsize);
+    bufferlist* bl = new bufferlist;
+    bl->append(p);
+    *(bl->c_str()) = 0; // touch the page to increment system mem use

...

then do the same reproducing  create_page_aligned() implementation:

+  struct fake_raw_posix_aligned{
+    char stub[8];
+    void* data;
+    fake_raw_posix_aligned() {
+      ::posix_memalign(&data, 0x1000, 0x1000); 
//mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000);
+      *((char*)data) = 0; // touch the page
+    }
+    ~fake_raw_posix_aligned() {
+      ::free(data);
+    }
+  };
+  vector <fake_raw_posix_aligned*> allocs2;

+  allocs2.resize(alloc_count);
+  for( auto i = 0u; i < alloc_count; ++i) {
+    allocs2[i] = new fake_raw_posix_aligned();
...

Output shows 32Gb usage in both cases.

Mem before: VmRSS: 45232 kB
Mem after: VmRSS: 33599524 kB
Mem actually used: 33554292 kB
Mem pool reports: 16777216 kB
Mem before2: VmRSS: 2161412 kB
Mem after2: VmRSS: 33632268 kB
Mem actually used: 32226156544 bytes


In general there are two issues here:
1) Doubled memory usage
2) mempool is unaware of such an overhead and miscalculates the actual 
mem usage.

There is probably a way to resolve 2) by forcing raw_combined::create() 
use in buffer::create_page_aligned and tuning mempool calculation to 
take page alignment into account. But I'd like to get some 
comments/thoughts first....


Thanks,
Igor



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mem use doubles due to buffer::create_page_aligned + bluestore obj content caching
  2017-03-06 17:35 mem use doubles due to buffer::create_page_aligned + bluestore obj content caching Igor Fedotov
@ 2017-03-06 18:44 ` Gregory Farnum
  2017-03-06 21:47   ` Igor Fedotov
  0 siblings, 1 reply; 6+ messages in thread
From: Gregory Farnum @ 2017-03-06 18:44 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Mon, Mar 6, 2017 at 9:35 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
> Hi Cephers,
>
> I've just created a ticket related to bluestore object content caching in
> particular and buffer::create_page_aligned in general.
>
> But I'd like to additionally share this information here as well since the
> root cause seems to be pretty global.
>
> Ticker URL:
>
> http://tracker.ceph.com/issues/19198
>
> Description:
>
> When caching object content BlueStore uses twice as much memory than it
> really needs for that data amount.
>
> The root cause seems to be in buffer::create_page_aligned implementation.
> Actually it results in
> new raw_posix_aligned()
>
>   calling mempool::buffer_data::alloc_char.allocate_aligned(len, align);
>
>       calling  posix_memalign((void**)(void*)&ptr, align, total);
>
> sequence that in fact does 2 allocations:
>
> 1) for raw_posix_aligned struct
> 2) for data itself (4096 bytes).
>
> It looks like this sequence causes 2 * 4096 bytes allocation instead of
> sizeof(raw_posix_aligned) + alignment + 4096.
> The additional trick is that mempool stuff is unable to estimate such an
> overhead and hence BlueStore cache cleanup doesn't work properly.
>
> It's not clear for me why allocator(s) behave that inefficiently for such a
> pattern though.
>
> The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc and
> tcmalloc builds.
>
>
> The ticket contains the patch to reproduce the issue and one can see that
> for 16Gb content system mem usage tend to be ~32Gb.
>
> Patch firstly allocates 4K pages 0x400000 times using:
>
> ...
>
> +  size_t alloc_count = 0x400000; // allocate 16 Gb total
> +  allocs.resize(alloc_count);
> +  for( auto i = 0u; i < alloc_count; ++i) {
> +    bufferptr p = buffer::create_page_aligned(bsize);
> +    bufferlist* bl = new bufferlist;
> +    bl->append(p);
> +    *(bl->c_str()) = 0; // touch the page to increment system mem use
>
> ...
>
> then do the same reproducing  create_page_aligned() implementation:
>
> +  struct fake_raw_posix_aligned{
> +    char stub[8];
> +    void* data;
> +    fake_raw_posix_aligned() {
> +      ::posix_memalign(&data, 0x1000, 0x1000);
> //mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000);
> +      *((char*)data) = 0; // touch the page
> +    }
> +    ~fake_raw_posix_aligned() {
> +      ::free(data);
> +    }
> +  };
> +  vector <fake_raw_posix_aligned*> allocs2;
>
> +  allocs2.resize(alloc_count);
> +  for( auto i = 0u; i < alloc_count; ++i) {
> +    allocs2[i] = new fake_raw_posix_aligned();
> ...
>
> Output shows 32Gb usage in both cases.
>
> Mem before: VmRSS: 45232 kB
> Mem after: VmRSS: 33599524 kB
> Mem actually used: 33554292 kB
> Mem pool reports: 16777216 kB
> Mem before2: VmRSS: 2161412 kB
> Mem after2: VmRSS: 33632268 kB
> Mem actually used: 32226156544 bytes
>
>
> In general there are two issues here:
> 1) Doubled memory usage
> 2) mempool is unaware of such an overhead and miscalculates the actual mem
> usage.
>
> There is probably a way to resolve 2) by forcing raw_combined::create() use
> in buffer::create_page_aligned and tuning mempool calculation to take page
> alignment into account. But I'd like to get some comments/thoughts first....

Is this memory being allocated and then freed, so it's "just" imposing
extra work on malloc? Or are we leaking the old unaligned page as
well?

I think we have (prior to BlueStore) only used these functions when
sending data over the wire or speaking to certain kinds of disks
(though I could be totally misremembering), at which point it's going
to be freed really quickly. That might explain why it's not come up
before; I hope we can just massage the implementation or interfaces
rather than this bubbling up way beyond the bufferlist internals...
-Greg

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mem use doubles due to buffer::create_page_aligned + bluestore obj content caching
  2017-03-06 18:44 ` Gregory Farnum
@ 2017-03-06 21:47   ` Igor Fedotov
  2017-03-10 21:24     ` Gregory Farnum
  0 siblings, 1 reply; 6+ messages in thread
From: Igor Fedotov @ 2017-03-06 21:47 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel


On 3/6/2017 9:44 PM, Gregory Farnum wrote:
> On Mon, Mar 6, 2017 at 9:35 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>> Hi Cephers,
>>
>> I've just created a ticket related to bluestore object content caching in
>> particular and buffer::create_page_aligned in general.
>>
>> But I'd like to additionally share this information here as well since the
>> root cause seems to be pretty global.
>>
>> Ticker URL:
>>
>> http://tracker.ceph.com/issues/19198
>>
>> Description:
>>
>> When caching object content BlueStore uses twice as much memory than it
>> really needs for that data amount.
>>
>> The root cause seems to be in buffer::create_page_aligned implementation.
>> Actually it results in
>> new raw_posix_aligned()
>>
>>    calling mempool::buffer_data::alloc_char.allocate_aligned(len, align);
>>
>>        calling  posix_memalign((void**)(void*)&ptr, align, total);
>>
>> sequence that in fact does 2 allocations:
>>
>> 1) for raw_posix_aligned struct
>> 2) for data itself (4096 bytes).
>>
>> It looks like this sequence causes 2 * 4096 bytes allocation instead of
>> sizeof(raw_posix_aligned) + alignment + 4096.
>> The additional trick is that mempool stuff is unable to estimate such an
>> overhead and hence BlueStore cache cleanup doesn't work properly.
>>
>> It's not clear for me why allocator(s) behave that inefficiently for such a
>> pattern though.
>>
>> The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc and
>> tcmalloc builds.
>>
>>
>> The ticket contains the patch to reproduce the issue and one can see that
>> for 16Gb content system mem usage tend to be ~32Gb.
>>
>> Patch firstly allocates 4K pages 0x400000 times using:
>>
>> ...
>>
>> +  size_t alloc_count = 0x400000; // allocate 16 Gb total
>> +  allocs.resize(alloc_count);
>> +  for( auto i = 0u; i < alloc_count; ++i) {
>> +    bufferptr p = buffer::create_page_aligned(bsize);
>> +    bufferlist* bl = new bufferlist;
>> +    bl->append(p);
>> +    *(bl->c_str()) = 0; // touch the page to increment system mem use
>>
>> ...
>>
>> then do the same reproducing  create_page_aligned() implementation:
>>
>> +  struct fake_raw_posix_aligned{
>> +    char stub[8];
>> +    void* data;
>> +    fake_raw_posix_aligned() {
>> +      ::posix_memalign(&data, 0x1000, 0x1000);
>> //mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000);
>> +      *((char*)data) = 0; // touch the page
>> +    }
>> +    ~fake_raw_posix_aligned() {
>> +      ::free(data);
>> +    }
>> +  };
>> +  vector <fake_raw_posix_aligned*> allocs2;
>>
>> +  allocs2.resize(alloc_count);
>> +  for( auto i = 0u; i < alloc_count; ++i) {
>> +    allocs2[i] = new fake_raw_posix_aligned();
>> ...
>>
>> Output shows 32Gb usage in both cases.
>>
>> Mem before: VmRSS: 45232 kB
>> Mem after: VmRSS: 33599524 kB
>> Mem actually used: 33554292 kB
>> Mem pool reports: 16777216 kB
>> Mem before2: VmRSS: 2161412 kB
>> Mem after2: VmRSS: 33632268 kB
>> Mem actually used: 32226156544 bytes
>>
>>
>> In general there are two issues here:
>> 1) Doubled memory usage
>> 2) mempool is unaware of such an overhead and miscalculates the actual mem
>> usage.
>>
>> There is probably a way to resolve 2) by forcing raw_combined::create() use
>> in buffer::create_page_aligned and tuning mempool calculation to take page
>> alignment into account. But I'd like to get some comments/thoughts first....
> Is this memory being allocated and then freed, so it's "just" imposing
> extra work on malloc? Or are we leaking the old unaligned page as
> well?
I don't see any issues after free call. I'm mostly about unexpectedly 
high memory usage while data block is allocated.
And mempool miscalculation related to that.
Surely this is more critical for long-living allocations, e.g. data 
blocks in BlueStore cache.
>
> I think we have (prior to BlueStore) only used these functions when
> sending data over the wire or speaking to certain kinds of disks
> (though I could be totally misremembering), at which point it's going
> to be freed really quickly. That might explain why it's not come up
> before; I hope we can just massage the implementation or interfaces
> rather than this bubbling up way beyond the bufferlist internals...
Yeah, that explains the case a bit.
> -Greg


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mem use doubles due to buffer::create_page_aligned + bluestore obj content caching
  2017-03-06 21:47   ` Igor Fedotov
@ 2017-03-10 21:24     ` Gregory Farnum
  2017-03-13 14:06       ` Igor Fedotov
  0 siblings, 1 reply; 6+ messages in thread
From: Gregory Farnum @ 2017-03-10 21:24 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Mon, Mar 6, 2017 at 1:47 PM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>
> On 3/6/2017 9:44 PM, Gregory Farnum wrote:
>>
>> On Mon, Mar 6, 2017 at 9:35 AM, Igor Fedotov <ifedotov@mirantis.com>
>> wrote:
>>>
>>> Hi Cephers,
>>>
>>> I've just created a ticket related to bluestore object content caching in
>>> particular and buffer::create_page_aligned in general.
>>>
>>> But I'd like to additionally share this information here as well since
>>> the
>>> root cause seems to be pretty global.
>>>
>>> Ticker URL:
>>>
>>> http://tracker.ceph.com/issues/19198
>>>
>>> Description:
>>>
>>> When caching object content BlueStore uses twice as much memory than it
>>> really needs for that data amount.
>>>
>>> The root cause seems to be in buffer::create_page_aligned implementation.
>>> Actually it results in
>>> new raw_posix_aligned()
>>>
>>>    calling mempool::buffer_data::alloc_char.allocate_aligned(len, align);
>>>
>>>        calling  posix_memalign((void**)(void*)&ptr, align, total);
>>>
>>> sequence that in fact does 2 allocations:
>>>
>>> 1) for raw_posix_aligned struct
>>> 2) for data itself (4096 bytes).
>>>
>>> It looks like this sequence causes 2 * 4096 bytes allocation instead of
>>> sizeof(raw_posix_aligned) + alignment + 4096.
>>> The additional trick is that mempool stuff is unable to estimate such an
>>> overhead and hence BlueStore cache cleanup doesn't work properly.
>>>
>>> It's not clear for me why allocator(s) behave that inefficiently for such
>>> a
>>> pattern though.
>>>
>>> The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc and
>>> tcmalloc builds.
>>>
>>>
>>> The ticket contains the patch to reproduce the issue and one can see that
>>> for 16Gb content system mem usage tend to be ~32Gb.
>>>
>>> Patch firstly allocates 4K pages 0x400000 times using:
>>>
>>> ...
>>>
>>> +  size_t alloc_count = 0x400000; // allocate 16 Gb total
>>> +  allocs.resize(alloc_count);
>>> +  for( auto i = 0u; i < alloc_count; ++i) {
>>> +    bufferptr p = buffer::create_page_aligned(bsize);
>>> +    bufferlist* bl = new bufferlist;
>>> +    bl->append(p);
>>> +    *(bl->c_str()) = 0; // touch the page to increment system mem use
>>>
>>> ...
>>>
>>> then do the same reproducing  create_page_aligned() implementation:
>>>
>>> +  struct fake_raw_posix_aligned{
>>> +    char stub[8];
>>> +    void* data;
>>> +    fake_raw_posix_aligned() {
>>> +      ::posix_memalign(&data, 0x1000, 0x1000);
>>> //mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000);
>>> +      *((char*)data) = 0; // touch the page
>>> +    }
>>> +    ~fake_raw_posix_aligned() {
>>> +      ::free(data);
>>> +    }
>>> +  };
>>> +  vector <fake_raw_posix_aligned*> allocs2;
>>>
>>> +  allocs2.resize(alloc_count);
>>> +  for( auto i = 0u; i < alloc_count; ++i) {
>>> +    allocs2[i] = new fake_raw_posix_aligned();
>>> ...
>>>
>>> Output shows 32Gb usage in both cases.
>>>
>>> Mem before: VmRSS: 45232 kB
>>> Mem after: VmRSS: 33599524 kB
>>> Mem actually used: 33554292 kB
>>> Mem pool reports: 16777216 kB
>>> Mem before2: VmRSS: 2161412 kB
>>> Mem after2: VmRSS: 33632268 kB
>>> Mem actually used: 32226156544 bytes
>>>
>>>
>>> In general there are two issues here:
>>> 1) Doubled memory usage
>>> 2) mempool is unaware of such an overhead and miscalculates the actual
>>> mem
>>> usage.
>>>
>>> There is probably a way to resolve 2) by forcing raw_combined::create()
>>> use
>>> in buffer::create_page_aligned and tuning mempool calculation to take
>>> page
>>> alignment into account. But I'd like to get some comments/thoughts
>>> first....
>>
>> Is this memory being allocated and then freed, so it's "just" imposing
>> extra work on malloc? Or are we leaking the old unaligned page as
>> well?
>
> I don't see any issues after free call. I'm mostly about unexpectedly high
> memory usage while data block is allocated.
> And mempool miscalculation related to that.
> Surely this is more critical for long-living allocations, e.g. data blocks
> in BlueStore cache.

Yeah, that's what I meant by "leak", which I realize isn't quite the
typical usage.

Do you have any proposed patches or fixes to deal with it? :)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mem use doubles due to buffer::create_page_aligned + bluestore obj content caching
  2017-03-10 21:24     ` Gregory Farnum
@ 2017-03-13 14:06       ` Igor Fedotov
  2017-03-13 14:17         ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Igor Fedotov @ 2017-03-13 14:06 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 11.03.2017 0:24, Gregory Farnum wrote:
> On Mon, Mar 6, 2017 at 1:47 PM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>> On 3/6/2017 9:44 PM, Gregory Farnum wrote:
>>> On Mon, Mar 6, 2017 at 9:35 AM, Igor Fedotov <ifedotov@mirantis.com>
>>> wrote:
>>>> Hi Cephers,
>>>>
>>>> I've just created a ticket related to bluestore object content caching in
>>>> particular and buffer::create_page_aligned in general.
>>>>
>>>> But I'd like to additionally share this information here as well since
>>>> the
>>>> root cause seems to be pretty global.
>>>>
>>>> Ticker URL:
>>>>
>>>> http://tracker.ceph.com/issues/19198
>>>>
>>>> Description:
>>>>
>>>> When caching object content BlueStore uses twice as much memory than it
>>>> really needs for that data amount.
>>>>
>>>> The root cause seems to be in buffer::create_page_aligned implementation.
>>>> Actually it results in
>>>> new raw_posix_aligned()
>>>>
>>>>     calling mempool::buffer_data::alloc_char.allocate_aligned(len, align);
>>>>
>>>>         calling  posix_memalign((void**)(void*)&ptr, align, total);
>>>>
>>>> sequence that in fact does 2 allocations:
>>>>
>>>> 1) for raw_posix_aligned struct
>>>> 2) for data itself (4096 bytes).
>>>>
>>>> It looks like this sequence causes 2 * 4096 bytes allocation instead of
>>>> sizeof(raw_posix_aligned) + alignment + 4096.
>>>> The additional trick is that mempool stuff is unable to estimate such an
>>>> overhead and hence BlueStore cache cleanup doesn't work properly.
>>>>
>>>> It's not clear for me why allocator(s) behave that inefficiently for such
>>>> a
>>>> pattern though.
>>>>
>>>> The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc and
>>>> tcmalloc builds.
>>>>
>>>>
>>>> The ticket contains the patch to reproduce the issue and one can see that
>>>> for 16Gb content system mem usage tend to be ~32Gb.
>>>>
>>>> Patch firstly allocates 4K pages 0x400000 times using:
>>>>
>>>> ...
>>>>
>>>> +  size_t alloc_count = 0x400000; // allocate 16 Gb total
>>>> +  allocs.resize(alloc_count);
>>>> +  for( auto i = 0u; i < alloc_count; ++i) {
>>>> +    bufferptr p = buffer::create_page_aligned(bsize);
>>>> +    bufferlist* bl = new bufferlist;
>>>> +    bl->append(p);
>>>> +    *(bl->c_str()) = 0; // touch the page to increment system mem use
>>>>
>>>> ...
>>>>
>>>> then do the same reproducing  create_page_aligned() implementation:
>>>>
>>>> +  struct fake_raw_posix_aligned{
>>>> +    char stub[8];
>>>> +    void* data;
>>>> +    fake_raw_posix_aligned() {
>>>> +      ::posix_memalign(&data, 0x1000, 0x1000);
>>>> //mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000);
>>>> +      *((char*)data) = 0; // touch the page
>>>> +    }
>>>> +    ~fake_raw_posix_aligned() {
>>>> +      ::free(data);
>>>> +    }
>>>> +  };
>>>> +  vector <fake_raw_posix_aligned*> allocs2;
>>>>
>>>> +  allocs2.resize(alloc_count);
>>>> +  for( auto i = 0u; i < alloc_count; ++i) {
>>>> +    allocs2[i] = new fake_raw_posix_aligned();
>>>> ...
>>>>
>>>> Output shows 32Gb usage in both cases.
>>>>
>>>> Mem before: VmRSS: 45232 kB
>>>> Mem after: VmRSS: 33599524 kB
>>>> Mem actually used: 33554292 kB
>>>> Mem pool reports: 16777216 kB
>>>> Mem before2: VmRSS: 2161412 kB
>>>> Mem after2: VmRSS: 33632268 kB
>>>> Mem actually used: 32226156544 bytes
>>>>
>>>>
>>>> In general there are two issues here:
>>>> 1) Doubled memory usage
>>>> 2) mempool is unaware of such an overhead and miscalculates the actual
>>>> mem
>>>> usage.
>>>>
>>>> There is probably a way to resolve 2) by forcing raw_combined::create()
>>>> use
>>>> in buffer::create_page_aligned and tuning mempool calculation to take
>>>> page
>>>> alignment into account. But I'd like to get some comments/thoughts
>>>> first....
>>> Is this memory being allocated and then freed, so it's "just" imposing
>>> extra work on malloc? Or are we leaking the old unaligned page as
>>> well?
>> I don't see any issues after free call. I'm mostly about unexpectedly high
>> memory usage while data block is allocated.
>> And mempool miscalculation related to that.
>> Surely this is more critical for long-living allocations, e.g. data blocks
>> in BlueStore cache.
> Yeah, that's what I meant by "leak", which I realize isn't quite the
> typical usage.
>
> Do you have any proposed patches or fixes to deal with it? :)
Just some thoughts, none of them seems ideal though...
1) Get rid of bluestore *content* cache. IMO object metadata caching is 
much more important and hence it's better to use memory for that. As a 
result no long-living aligned page allocations...
2) Do not use raw_posix_aligned in buffer::create_aligned and rollback 
to raw_combined::create(). For the latter mempool's mechanics can be 
fixed to measure alignment overhead properly and hence BlueStore cache 
will handle mem limits properly. Mem use for page aligned buffers is 
still ineffective though...

Thanks,
Igor

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mem use doubles due to buffer::create_page_aligned + bluestore obj content caching
  2017-03-13 14:06       ` Igor Fedotov
@ 2017-03-13 14:17         ` Sage Weil
  0 siblings, 0 replies; 6+ messages in thread
From: Sage Weil @ 2017-03-13 14:17 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Gregory Farnum, ceph-devel

On Mon, 13 Mar 2017, Igor Fedotov wrote:
> On 11.03.2017 0:24, Gregory Farnum wrote:
> > On Mon, Mar 6, 2017 at 1:47 PM, Igor Fedotov <ifedotov@mirantis.com> wrote:
> > > On 3/6/2017 9:44 PM, Gregory Farnum wrote:
> > > > On Mon, Mar 6, 2017 at 9:35 AM, Igor Fedotov <ifedotov@mirantis.com>
> > > > wrote:
> > > > > Hi Cephers,
> > > > > 
> > > > > I've just created a ticket related to bluestore object content caching
> > > > > in
> > > > > particular and buffer::create_page_aligned in general.
> > > > > 
> > > > > But I'd like to additionally share this information here as well since
> > > > > the
> > > > > root cause seems to be pretty global.
> > > > > 
> > > > > Ticker URL:
> > > > > 
> > > > > http://tracker.ceph.com/issues/19198
> > > > > 
> > > > > Description:
> > > > > 
> > > > > When caching object content BlueStore uses twice as much memory than
> > > > > it
> > > > > really needs for that data amount.
> > > > > 
> > > > > The root cause seems to be in buffer::create_page_aligned
> > > > > implementation.
> > > > > Actually it results in
> > > > > new raw_posix_aligned()
> > > > > 
> > > > >     calling mempool::buffer_data::alloc_char.allocate_aligned(len,
> > > > > align);
> > > > > 
> > > > >         calling  posix_memalign((void**)(void*)&ptr, align, total);
> > > > > 
> > > > > sequence that in fact does 2 allocations:
> > > > > 
> > > > > 1) for raw_posix_aligned struct
> > > > > 2) for data itself (4096 bytes).
> > > > > 
> > > > > It looks like this sequence causes 2 * 4096 bytes allocation instead
> > > > > of
> > > > > sizeof(raw_posix_aligned) + alignment + 4096.
> > > > > The additional trick is that mempool stuff is unable to estimate such
> > > > > an
> > > > > overhead and hence BlueStore cache cleanup doesn't work properly.
> > > > > 
> > > > > It's not clear for me why allocator(s) behave that inefficiently for
> > > > > such
> > > > > a
> > > > > pattern though.
> > > > > 
> > > > > The issue is reproducible under Ubuntu 16.04.1 LTS for both jemalloc
> > > > > and
> > > > > tcmalloc builds.
> > > > > 
> > > > > 
> > > > > The ticket contains the patch to reproduce the issue and one can see
> > > > > that
> > > > > for 16Gb content system mem usage tend to be ~32Gb.
> > > > > 
> > > > > Patch firstly allocates 4K pages 0x400000 times using:
> > > > > 
> > > > > ...
> > > > > 
> > > > > +  size_t alloc_count = 0x400000; // allocate 16 Gb total
> > > > > +  allocs.resize(alloc_count);
> > > > > +  for( auto i = 0u; i < alloc_count; ++i) {
> > > > > +    bufferptr p = buffer::create_page_aligned(bsize);
> > > > > +    bufferlist* bl = new bufferlist;
> > > > > +    bl->append(p);
> > > > > +    *(bl->c_str()) = 0; // touch the page to increment system mem use
> > > > > 
> > > > > ...
> > > > > 
> > > > > then do the same reproducing  create_page_aligned() implementation:
> > > > > 
> > > > > +  struct fake_raw_posix_aligned{
> > > > > +    char stub[8];
> > > > > +    void* data;
> > > > > +    fake_raw_posix_aligned() {
> > > > > +      ::posix_memalign(&data, 0x1000, 0x1000);
> > > > > //mempool::buffer_data::alloc_char.allocate_aligned(0x1000, 0x1000);
> > > > > +      *((char*)data) = 0; // touch the page
> > > > > +    }
> > > > > +    ~fake_raw_posix_aligned() {
> > > > > +      ::free(data);
> > > > > +    }
> > > > > +  };
> > > > > +  vector <fake_raw_posix_aligned*> allocs2;
> > > > > 
> > > > > +  allocs2.resize(alloc_count);
> > > > > +  for( auto i = 0u; i < alloc_count; ++i) {
> > > > > +    allocs2[i] = new fake_raw_posix_aligned();
> > > > > ...
> > > > > 
> > > > > Output shows 32Gb usage in both cases.

This is really disconcerting.  If you take out the memalign in the 
fake_raw_posix_aligned ctor, does it use 16gb?  Or is really just that the 
order of the allocations (new then posix_memalign then new ...) 
makes the allocator consume a full page for each 
fake_raw_posix_aligned?  And/or, can you confirm that 
fake_raw_posix_aligned pointers are on page boundaries?

What if all the fake_raw_posix_aligned strutcs are allocated first, and 
*then* the data pages?

> > Do you have any proposed patches or fixes to deal with it? :)
> Just some thoughts, none of them seems ideal though...
> 1) Get rid of bluestore *content* cache. IMO object metadata caching is much
> more important and hence it's better to use memory for that. As a result no
> long-living aligned page allocations...
> 2) Do not use raw_posix_aligned in buffer::create_aligned and rollback to
> raw_combined::create(). For the latter mempool's mechanics can be fixed to
> measure alignment overhead properly and hence BlueStore cache will handle mem
> limits properly. Mem use for page aligned buffers is still ineffective
> though...

The simple heuristic that only does raw_combined for smaller buffers is 
based on the assumption that the allocator isn't stupid and can consume 
less than a full page of overhead for the buffer::raw stuff.  If that's 
truly not the case, then I think there's no reason not to unconditionally 
use raw_combined.  That doesn't seem right, though!

sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-03-13 14:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-06 17:35 mem use doubles due to buffer::create_page_aligned + bluestore obj content caching Igor Fedotov
2017-03-06 18:44 ` Gregory Farnum
2017-03-06 21:47   ` Igor Fedotov
2017-03-10 21:24     ` Gregory Farnum
2017-03-13 14:06       ` Igor Fedotov
2017-03-13 14:17         ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.