All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: hugetlbfs: fix hugetlbfs optimization
@ 2013-11-05 22:10 Andrea Arcangeli
  2013-11-07 17:18 ` Khalid Aziz
  2013-11-12 19:22 ` Pravin Shelar
  0 siblings, 2 replies; 5+ messages in thread
From: Andrea Arcangeli @ 2013-11-05 22:10 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: gregkh, bhutchings, pshelar, cl, hannes, mel, riel, minchan,
	andi, akpm, torvalds, linux-mm

Hi,

this patch is an alternative implementation of the hugetlbfs directio
optimization discussed earlier. We've been looking into this with
Khalid last week and an earlier version of this patch (fully
equivalent as far as CPU overhead is concerned) was benchmarked by
Khalid and it didn't degrade performance compared to the PageHuge
check in current upstream code, so we should be good.

The patch applies cleanly only after reverting
7cb2ef56e6a8b7b368b2e883a0a47d02fed66911, it's much easier to review
it in this form as it avoids all the alignment changes. I'll resend to
Andrew against current upstream by squashing it with the revert after
reviews.

I wished to remove the _mapcount tailpage refcounting for slab and
hugetlbfs tails too, but if the last put_page of a slab tail happens
after the slab page isn't a slab page anymore (but still compound as
it wasn't freed yet because of the tail pin), a VM_BUG_ON would
trigger during the last (unpinning) put_page(slab_tail) with the
mapcount underflow:

			VM_BUG_ON(page_mapcount(page) <= 0);

Not even sure if any driver is doing anything like that, but the
current code would allow it, Pravin should know more about when
exactly in which conditions the last put_page is done on slab tail
pages.

It shall be possible to remove the _mapcount refcounting anyway, as it
is only read by split_huge_page and so it doesn't actually matter if
it underflows, but I prefer to keep the VM_BUG_ON. In fact I added one
more VM_BUG_ON(!PageHead()) even in this patch.

I also didn't notice we missed a PageHead check before calling
__put_single_page(page_head), so I corrected that. It sounds very
unlikely that it could have ever triggered but still better to fix it.

I just booted it... not very well tested yet. Help with the testing
appreciated :).

=====

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: hugetlbfs: fix hugetlbfs optimization
  2013-11-05 22:10 [PATCH] mm: hugetlbfs: fix hugetlbfs optimization Andrea Arcangeli
@ 2013-11-07 17:18 ` Khalid Aziz
  2013-11-12 19:22 ` Pravin Shelar
  1 sibling, 0 replies; 5+ messages in thread
From: Khalid Aziz @ 2013-11-07 17:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: gregkh, bhutchings, pshelar, cl, hannes, mel, riel, minchan,
	andi, akpm, torvalds, linux-mm

On 11/05/2013 03:10 PM, Andrea Arcangeli wrote:
> Hi,
>
> this patch is an alternative implementation of the hugetlbfs directio
> optimization discussed earlier. We've been looking into this with
> Khalid last week and an earlier version of this patch (fully
> equivalent as far as CPU overhead is concerned) was benchmarked by
> Khalid and it didn't degrade performance compared to the PageHuge
> check in current upstream code, so we should be good.
>
> The patch applies cleanly only after reverting
> 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911, it's much easier to review
> it in this form as it avoids all the alignment changes. I'll resend to
> Andrew against current upstream by squashing it with the revert after
> reviews.
>
> I wished to remove the _mapcount tailpage refcounting for slab and
> hugetlbfs tails too, but if the last put_page of a slab tail happens
> after the slab page isn't a slab page anymore (but still compound as
> it wasn't freed yet because of the tail pin), a VM_BUG_ON would
> trigger during the last (unpinning) put_page(slab_tail) with the
> mapcount underflow:
>
> 			VM_BUG_ON(page_mapcount(page) <= 0);
>
> Not even sure if any driver is doing anything like that, but the
> current code would allow it, Pravin should know more about when
> exactly in which conditions the last put_page is done on slab tail
> pages.
>
> It shall be possible to remove the _mapcount refcounting anyway, as it
> is only read by split_huge_page and so it doesn't actually matter if
> it underflows, but I prefer to keep the VM_BUG_ON. In fact I added one
> more VM_BUG_ON(!PageHead()) even in this patch.
>
> I also didn't notice we missed a PageHead check before calling
> __put_single_page(page_head), so I corrected that. It sounds very
> unlikely that it could have ever triggered but still better to fix it.
>
> I just booted it... not very well tested yet. Help with the testing
> appreciated :).
>

Hi Andrea,

I ran performance tests on this patch. I am seeing 4.5% degradation with 
1M random reads, 9.2% degradation with 64K random reads and 13.8% 
degradation with 8K using the orion tool. Just to double check, I 
repeated the same tests with the last version of patch we had exchanged 
before some of the final tweaks you made and degradation with that patch 
is 0.7% for 1M, 2.3% with 64K and 3% for 8K. Actual numbers are in the 
table below (all numbers are in MB/s):

              3.12.0      3.12.0+this_patch      3.12.0+prev_patch
              -------     ------------------     -----------------
1M            8467        8087 (-4.5%)           8409 (-0.7%)
64K           4049        3675 (-9.2%)           3957 (-2.3%)
8K            732         631 (-13.8%)           710 (-3.0%)

One of the tweaks is causing problems. I will try to isolate which one.

--
Khalid


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: hugetlbfs: fix hugetlbfs optimization
  2013-11-05 22:10 [PATCH] mm: hugetlbfs: fix hugetlbfs optimization Andrea Arcangeli
  2013-11-07 17:18 ` Khalid Aziz
@ 2013-11-12 19:22 ` Pravin Shelar
  2013-11-13 16:10   ` Andrea Arcangeli
  1 sibling, 1 reply; 5+ messages in thread
From: Pravin Shelar @ 2013-11-12 19:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Khalid Aziz, gregkh, Ben Hutchings, Christoph Lameter, hannes,
	mel, riel, minchan, andi, Andrew Morton, torvalds, linux-mm

On Tue, Nov 5, 2013 at 2:10 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi,
>
> this patch is an alternative implementation of the hugetlbfs directio
> optimization discussed earlier. We've been looking into this with
> Khalid last week and an earlier version of this patch (fully
> equivalent as far as CPU overhead is concerned) was benchmarked by
> Khalid and it didn't degrade performance compared to the PageHuge
> check in current upstream code, so we should be good.
>
> The patch applies cleanly only after reverting
> 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911, it's much easier to review
> it in this form as it avoids all the alignment changes. I'll resend to
> Andrew against current upstream by squashing it with the revert after
> reviews.
>
> I wished to remove the _mapcount tailpage refcounting for slab and
> hugetlbfs tails too, but if the last put_page of a slab tail happens
> after the slab page isn't a slab page anymore (but still compound as
> it wasn't freed yet because of the tail pin), a VM_BUG_ON would
> trigger during the last (unpinning) put_page(slab_tail) with the
> mapcount underflow:
>
>                         VM_BUG_ON(page_mapcount(page) <= 0);
>
> Not even sure if any driver is doing anything like that, but the
> current code would allow it, Pravin should know more about when
> exactly in which conditions the last put_page is done on slab tail
> pages.
>
Yes, This can happen when slab object is directly passed for IO and it
is done in few filesystems (ocfs, xfs) when I checked last time.

> It shall be possible to remove the _mapcount refcounting anyway, as it
> is only read by split_huge_page and so it doesn't actually matter if
> it underflows, but I prefer to keep the VM_BUG_ON. In fact I added one
> more VM_BUG_ON(!PageHead()) even in this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: hugetlbfs: fix hugetlbfs optimization
  2013-11-12 19:22 ` Pravin Shelar
@ 2013-11-13 16:10   ` Andrea Arcangeli
  2013-11-15 17:00     ` Andrea Arcangeli
  0 siblings, 1 reply; 5+ messages in thread
From: Andrea Arcangeli @ 2013-11-13 16:10 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: Khalid Aziz, gregkh, Ben Hutchings, Christoph Lameter, hannes,
	mel, riel, minchan, andi, Andrew Morton, torvalds, linux-mm

On Tue, Nov 12, 2013 at 11:22:50AM -0800, Pravin Shelar wrote:
> On Tue, Nov 5, 2013 at 2:10 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > Hi,
> >
> > this patch is an alternative implementation of the hugetlbfs directio
> > optimization discussed earlier. We've been looking into this with
> > Khalid last week and an earlier version of this patch (fully
> > equivalent as far as CPU overhead is concerned) was benchmarked by
> > Khalid and it didn't degrade performance compared to the PageHuge
> > check in current upstream code, so we should be good.
> >
> > The patch applies cleanly only after reverting
> > 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911, it's much easier to review
> > it in this form as it avoids all the alignment changes. I'll resend to
> > Andrew against current upstream by squashing it with the revert after
> > reviews.
> >
> > I wished to remove the _mapcount tailpage refcounting for slab and
> > hugetlbfs tails too, but if the last put_page of a slab tail happens
> > after the slab page isn't a slab page anymore (but still compound as
> > it wasn't freed yet because of the tail pin), a VM_BUG_ON would
> > trigger during the last (unpinning) put_page(slab_tail) with the
> > mapcount underflow:
> >
> >                         VM_BUG_ON(page_mapcount(page) <= 0);
> >
> > Not even sure if any driver is doing anything like that, but the
> > current code would allow it, Pravin should know more about when
> > exactly in which conditions the last put_page is done on slab tail
> > pages.
> >
> Yes, This can happen when slab object is directly passed for IO and it
> is done in few filesystems (ocfs, xfs) when I checked last time.

About the slab case however, it cannot be that the tail pin obtained
with get_page(tail_page), is the last reference on the compound
page when it gets released through put_page.

kfree/kmem_cache_free would allow reuse of the whole compound page as
a different slab object without passing through the buddy allocator,
so totally disregarding any tail page pin. So in short I believe it's
safe to remove the mapcount refcounting from slab tail page pinning
because all tail pins should be released before the
kfree/kmem_cache_free, and in turn before the PG_slab flag has been
cleared (which happens before the slab code releases the last
reference count on the slab page to free it). It is also safe to for
hugetlbfs as the hugetlbfs destructor is wiped only after the last
hugetlbfs reference count has been released and no more put_page can
happen then.

So I think we've room for optimizations to fill the performance gap
compared to the PageHuge check at the top of put_page (which also
skips the mapcount tail page refcounting), but I would keep the
optimization to skip the tail page refcounting incremental, and it
makes sense to apply it to slab too if we do it, so we keep the
hugetlbfs and slab cases identical (which is simpler to maintain).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: hugetlbfs: fix hugetlbfs optimization
  2013-11-13 16:10   ` Andrea Arcangeli
@ 2013-11-15 17:00     ` Andrea Arcangeli
  0 siblings, 0 replies; 5+ messages in thread
From: Andrea Arcangeli @ 2013-11-15 17:00 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: Khalid Aziz, gregkh, Ben Hutchings, Christoph Lameter, hannes,
	mel, riel, minchan, andi, Andrew Morton, torvalds, linux-mm

Hi,

so while optimizing away the _mapcount tail page refcounting for slab
and hugetlbfs pages incremental with the fix for the hugetlbfs
optimization I just sent, I also noticed another bug in the current
code (already fixed by the patch). gup_fast is still increasing
_mapcount for hugetlbfs, but it's not decreased in put_page.

It's only noticeable if you do:

echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

On current upstream I get:

BUG: Bad page state in process bash  pfn:59a01
page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
page flags: 0x1c00000000008000(tail)
Modules linked in:
CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000009 ffff880079cb5cc8 ffffffff81640e8b 0000000000000006
 ffffea000139b038 ffff880079cb5ce8 ffffffff8115bb15 00000000000002c1
 ffffea000139b038 ffff880079cb5d48 ffffffff8115bd83 ffff880079cb5de8
Call Trace:
 [<ffffffff81640e8b>] dump_stack+0x55/0x76
 [<ffffffff8115bb15>] bad_page+0xd5/0x130
 [<ffffffff8115bd83>] free_pages_prepare+0x213/0x280
 [<ffffffff8115df16>] __free_pages+0x36/0x80
 [<ffffffff8119b011>] update_and_free_page+0xc1/0xd0
 [<ffffffff8119b512>] free_pool_huge_page+0xc2/0xe0
 [<ffffffff8119b8cc>] set_max_huge_pages.part.58+0x14c/0x220
 [<ffffffff81308a8c>] ? _kstrtoull+0x2c/0x90
 [<ffffffff8119ba70>] nr_hugepages_store_common.isra.60+0xd0/0xf0
 [<ffffffff8119bac3>] nr_hugepages_store+0x13/0x20
 [<ffffffff812f763f>] kobj_attr_store+0xf/0x20
 [<ffffffff812354e9>] sysfs_write_file+0x189/0x1e0
 [<ffffffff811baff5>] vfs_write+0xc5/0x1f0
 [<ffffffff811bb505>] SyS_write+0x55/0xb0
 [<ffffffff81651712>] system_call_fastpath+0x16/0x1b

So good thing I stopped the hugetlbfs optimization from going into
stable.

I'll send a v2 of this work as a patchset of 3 patches where the first
is the same identical patch I already sent but incremental to upstream
and it contains all the fixes needed including for the above
problem.

Patch 1/3 should be applied more urgently as it fixes all those
various bugs. The 2/3 and 3/3 can be deferred.

The patch 3/3 especially should be benchmarked in the usual 8GB/sec
setup before being applied, unless it makes a real difference I
wouldn't apply it because it tends to slowdown the THP case a bit and
it complicates things a bit more.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-11-15 17:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-05 22:10 [PATCH] mm: hugetlbfs: fix hugetlbfs optimization Andrea Arcangeli
2013-11-07 17:18 ` Khalid Aziz
2013-11-12 19:22 ` Pravin Shelar
2013-11-13 16:10   ` Andrea Arcangeli
2013-11-15 17:00     ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.