All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Michel Lespinasse <walken@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickins <hughd@google.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Shaohua Li <shaohua.li@intel.com>,
	Chris Mason <chris.mason@oracle.com>
Subject: Re: [PATCH] thp: tail page refcounting fix
Date: Tue, 23 Aug 2011 16:55:24 +0200	[thread overview]
Message-ID: <20110823145524.GA23870@redhat.com> (raw)
In-Reply-To: <20110822213347.GF2507@redhat.com>

Hello everyone,

[ Chris CC'ed ]

Chris, could you run this patch on one of yours hugetlbfs benchmarks to
verify there is no scalability issue in the spinlock of
__get_page_tail in your environment?

With this patch O_DIRECT on hugetlbfs will take the compound_lock even
if it doesn't need to for hugetlbfs. We could have made this special
for THP only, and in fact we could avoid the "tail_page" reference
counting completely for tail pages with hugetlbfs, but we didn't do
that so far because it's safer to have a single code path for all
compound pages and not make hugetlbfs special case there and have all
compound pages behave the same regardless if it's THP or hugetlbfs or
some driver allocating/freeing it, it keeps things much simpler and
the overhead AFIK is unmeasurable (I ask to benchmark just to be 100%
sure). The refcounting path is tricky and the more testing the better
so if it's the same for all compound pages it's best in my view (at
least for the mid term).

Also note, even if we were to ultraoptimize hugetlbfs the SMP locking
scalability would remain identical because the atomic_inc would be
still needed on the "head" page which is the only "shared" item. The
"superflous" for hugetlbfs is _only_ the write on the
"head_page->flags" locked op, and the tail_page->_mapcount atomic
inc/dec. We can't possibly eliminate the head_page->_count atomic
inc/dec, even if we were to ultraoptimize for hugetlbfs and that's the
only possibly troublesome bit in terms of SMP scalability (the
tail_page is so finegrined it can't be a scalability issue). This is
why I exclude these changes can be measurable and they should work as
great as before.

Also it'd be nice if somebody would look in direct-io to stop doing
that get_page there and relay only on the reference taken by
get_user_pages (KVM and all other get_user_page users never take
additional references on pages returned by get_user_pages and relays
exclusively on the refcount taken by get_user_pages). get_user_pages
is capable of taking the reference on the tail pages without having to
take the compound lock perfectly safely through get_page_foll() (which
is taken while the page_table_lock is still held, so it automatically
serializes with split_huge_page through the page_table_lock). Only
additional get_page(page[i]) done on the tail pages taken with
get_user_pages requires the compound_lock to be safe vs
split_huge_page (and normally there is no way to ever call get_page on
any tail page, only after get_user_pages you could run into that, so
that can be usually easily avoided, and so the compound_lock can be
better optimized away by not calling get_page on tail pages in the
first place which also avoids all other locked ops of get_page, not
just the compound_lock!).

Also note, the compound_lock was already taken for the put_page of the
tail pages. We only could avoid it in get_user_pages. So this isn't
making things much difference. I still kept my priority to avoid any
comound_lock for head pages. head compound pages, or regular pages
(not compound) just have 1 atomic ops like always. That is way more
important for performance I think because the head page refcounting
and regular page refcounting is in the real CPU bound fast paths (not
more I/O bound dealing with pci devices like O_DIRECT). And I still
also avoid compound_lock for the secondary MMU page fault in
get_user_pages (can't avoid it in put_page run after the spte is
established, just no way around it but it's always been like that).

So I loaded this patch on all my systems so far so good, torture
testing is also going without problems.

a git tree including patch is here.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=summary

This is the actual patch you can apply to stock 3.0.0 (3.1 won't boot
here for me because I never use initrd on my development systems...).

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=41dc8190934cea22b8c8b3f89e82052610664fbb

WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: Michel Lespinasse <walken@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickins <hughd@google.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Shaohua Li <shaohua.li@intel.com>,
	Chris Mason <chris.mason@oracle.com>
Subject: Re: [PATCH] thp: tail page refcounting fix
Date: Tue, 23 Aug 2011 16:55:24 +0200	[thread overview]
Message-ID: <20110823145524.GA23870@redhat.com> (raw)
In-Reply-To: <20110822213347.GF2507@redhat.com>

Hello everyone,

[ Chris CC'ed ]

Chris, could you run this patch on one of yours hugetlbfs benchmarks to
verify there is no scalability issue in the spinlock of
__get_page_tail in your environment?

With this patch O_DIRECT on hugetlbfs will take the compound_lock even
if it doesn't need to for hugetlbfs. We could have made this special
for THP only, and in fact we could avoid the "tail_page" reference
counting completely for tail pages with hugetlbfs, but we didn't do
that so far because it's safer to have a single code path for all
compound pages and not make hugetlbfs special case there and have all
compound pages behave the same regardless if it's THP or hugetlbfs or
some driver allocating/freeing it, it keeps things much simpler and
the overhead AFIK is unmeasurable (I ask to benchmark just to be 100%
sure). The refcounting path is tricky and the more testing the better
so if it's the same for all compound pages it's best in my view (at
least for the mid term).

Also note, even if we were to ultraoptimize hugetlbfs the SMP locking
scalability would remain identical because the atomic_inc would be
still needed on the "head" page which is the only "shared" item. The
"superflous" for hugetlbfs is _only_ the write on the
"head_page->flags" locked op, and the tail_page->_mapcount atomic
inc/dec. We can't possibly eliminate the head_page->_count atomic
inc/dec, even if we were to ultraoptimize for hugetlbfs and that's the
only possibly troublesome bit in terms of SMP scalability (the
tail_page is so finegrined it can't be a scalability issue). This is
why I exclude these changes can be measurable and they should work as
great as before.

Also it'd be nice if somebody would look in direct-io to stop doing
that get_page there and relay only on the reference taken by
get_user_pages (KVM and all other get_user_page users never take
additional references on pages returned by get_user_pages and relays
exclusively on the refcount taken by get_user_pages). get_user_pages
is capable of taking the reference on the tail pages without having to
take the compound lock perfectly safely through get_page_foll() (which
is taken while the page_table_lock is still held, so it automatically
serializes with split_huge_page through the page_table_lock). Only
additional get_page(page[i]) done on the tail pages taken with
get_user_pages requires the compound_lock to be safe vs
split_huge_page (and normally there is no way to ever call get_page on
any tail page, only after get_user_pages you could run into that, so
that can be usually easily avoided, and so the compound_lock can be
better optimized away by not calling get_page on tail pages in the
first place which also avoids all other locked ops of get_page, not
just the compound_lock!).

Also note, the compound_lock was already taken for the put_page of the
tail pages. We only could avoid it in get_user_pages. So this isn't
making things much difference. I still kept my priority to avoid any
comound_lock for head pages. head compound pages, or regular pages
(not compound) just have 1 atomic ops like always. That is way more
important for performance I think because the head page refcounting
and regular page refcounting is in the real CPU bound fast paths (not
more I/O bound dealing with pci devices like O_DIRECT). And I still
also avoid compound_lock for the secondary MMU page fault in
get_user_pages (can't avoid it in put_page run after the spte is
established, just no way around it but it's always been like that).

So I loaded this patch on all my systems so far so good, torture
testing is also going without problems.

a git tree including patch is here.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=summary

This is the actual patch you can apply to stock 3.0.0 (3.1 won't boot
here for me because I never use initrd on my development systems...).

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=41dc8190934cea22b8c8b3f89e82052610664fbb

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-08-23 14:56 UTC|newest]

Thread overview: 109+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-19  7:48 [PATCH 0/9] Use RCU to stabilize page counts Michel Lespinasse
2011-08-19  7:48 ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 1/9] mm: rcu read lock for getting reference on pages in migration_entry_wait() Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 2/9] mm: avoid calling get_page_unless_zero() when charging cgroups Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 3/9] mm: rcu read lock when getting from tail to head page Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 4/9] mm: use get_page in deactivate_page() Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 5/9] kvm: use get_page instead of get_page_unless_zero Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 6/9] mm: assert that get_page_unless_zero() callers hold the rcu lock Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19 23:28   ` Andi Kleen
2011-08-19 23:28     ` Andi Kleen
2011-08-19  7:48 ` [PATCH 7/9] rcu: rcu_get_gp_cookie() / rcu_gp_cookie_elapsed() stand-ins Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 8/9] mm: add API for setting a grace period cookie on compound pages Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 9/9] mm: make sure tail page counts are stable before splitting THP pages Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:53 ` [PATCH 0/9] Use RCU to stabilize page counts Michel Lespinasse
2011-08-19  7:53   ` Michel Lespinasse
2011-08-22 21:33 ` [PATCH] thp: tail page refcounting fix Andrea Arcangeli
2011-08-22 21:33   ` Andrea Arcangeli
2011-08-23 14:55   ` Andrea Arcangeli [this message]
2011-08-23 14:55     ` Andrea Arcangeli
2011-08-23 16:45   ` Minchan Kim
2011-08-23 16:45     ` Minchan Kim
2011-08-23 16:54     ` Andrea Arcangeli
2011-08-23 16:54       ` Andrea Arcangeli
2011-08-23 19:52   ` Michel Lespinasse
2011-08-23 19:52     ` Michel Lespinasse
2011-08-24  0:09     ` Andrea Arcangeli
2011-08-24  0:09       ` Andrea Arcangeli
2011-08-24  0:27       ` Andrea Arcangeli
2011-08-24  0:27         ` Andrea Arcangeli
2011-08-24 13:34         ` [PATCH] thp: tail page refcounting fix #2 Andrea Arcangeli
2011-08-24 13:34           ` Andrea Arcangeli
2011-08-26  6:24           ` Michel Lespinasse
2011-08-26  6:24             ` Michel Lespinasse
2011-08-26 16:10             ` Andrea Arcangeli
2011-08-26 16:10               ` Andrea Arcangeli
2011-08-26 18:54               ` [PATCH] thp: tail page refcounting fix #3 Andrea Arcangeli
2011-08-26 18:54                 ` Andrea Arcangeli
2011-08-27  9:41                 ` Michel Lespinasse
2011-08-27  9:41                   ` Michel Lespinasse
2011-08-27 17:34                   ` [PATCH] thp: tail page refcounting fix #4 Andrea Arcangeli
2011-08-27 17:34                     ` Andrea Arcangeli
2011-08-29  4:20                     ` Minchan Kim
2011-08-29  4:20                       ` Minchan Kim
2011-09-01 15:24                       ` [PATCH] thp: tail page refcounting fix #5 Andrea Arcangeli
2011-09-01 15:24                         ` Andrea Arcangeli
2011-09-01 22:27                         ` Michel Lespinasse
2011-09-01 22:27                           ` Michel Lespinasse
2011-09-01 23:28                         ` Andrew Morton
2011-09-01 23:28                           ` Andrew Morton
2011-09-01 23:45                           ` Andi Kleen
2011-09-01 23:45                             ` Andi Kleen
2011-09-02  0:20                             ` Andrea Arcangeli
2011-09-02  0:20                               ` Andrea Arcangeli
2011-09-02  1:17                               ` Andi Kleen
2011-09-02  1:17                                 ` Andi Kleen
2011-09-02  0:03                         ` Andrew Morton
2011-09-02  0:03                           ` Andrew Morton
2011-09-08 16:51                           ` [PATCH] thp: tail page refcounting fix #6 Andrea Arcangeli
2011-09-08 16:51                             ` Andrea Arcangeli
2011-09-23 15:57                             ` Peter Zijlstra
2011-09-23 15:57                               ` Peter Zijlstra
2011-09-30 13:58                               ` Andrea Arcangeli
2011-09-30 13:58                                 ` Andrea Arcangeli
2011-10-16 20:37                               ` thp: gup_fast ppc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6] Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting Andrea Arcangeli
2011-10-16 20:40                               ` thp: gup_fast ppc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6] Andrea Arcangeli
2011-10-16 20:40                                 ` Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast Andrea Arcangeli
2011-10-16 20:40                                 ` Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page Andrea Arcangeli
2011-10-16 20:40                                 ` Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times Andrea Arcangeli
2011-10-16 20:40                                 ` Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting Andrea Arcangeli
2011-10-16 20:40                                 ` Andrea Arcangeli
2011-10-17 14:41                               ` thp: gup_fast s390/sparc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6] Andrea Arcangeli
2011-10-17 14:41                                 ` Andrea Arcangeli
2011-10-17 14:41                                 ` [PATCH 1/3] s390: gup_huge_pmd() support THP tail recounting Andrea Arcangeli
2011-10-17 14:41                                   ` Andrea Arcangeli
2011-10-17 14:41                                 ` [PATCH 2/3] sparc: gup_pte_range() support THP based " Andrea Arcangeli
2011-10-17 14:41                                   ` Andrea Arcangeli
2011-10-17 22:44                                   ` David Miller
2011-10-17 22:44                                     ` David Miller
2011-10-17 14:41                                 ` [PATCH 3/3] thp: share get_huge_page_tail() Andrea Arcangeli
2011-10-17 14:41                                   ` Andrea Arcangeli
2011-10-17 21:32                               ` fix two more s390/sparc gup_fast bugs Andrea Arcangeli
2011-10-17 21:32                                 ` Andrea Arcangeli
2011-10-17 21:32                                 ` [PATCH 1/2] s390: gup_huge_pmd() return 0 if pte changes Andrea Arcangeli
2011-10-17 21:32                                   ` Andrea Arcangeli
2011-10-17 21:32                                 ` [PATCH 2/2] powerpc: " Andrea Arcangeli
2011-10-17 21:32                                   ` Andrea Arcangeli
2011-08-29 22:40                     ` [PATCH] thp: tail page refcounting fix #4 Michel Lespinasse
2011-08-29 22:40                       ` Michel Lespinasse
2011-08-29 23:30                       ` Andrea Arcangeli
2011-08-29 23:30                         ` Andrea Arcangeli
2011-08-26 19:28             ` [PATCH] thp: tail page refcounting fix #2 Andrea Arcangeli
2011-08-26 19:28               ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110823145524.GA23870@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=hughd@google.com \
    --cc=jweiner@redhat.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    --cc=shaohua.li@intel.com \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.