linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [tip:x86/mm] x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
@ 2017-04-20 21:46 Dan Williams
  2017-04-21 14:16 ` Kirill A. Shutemov
  2017-04-23 23:31 ` get_zone_device_page() in get_page() and page_cache_get_speculative() Kirill A. Shutemov
  0 siblings, 2 replies; 48+ messages in thread
From: Dan Williams @ 2017-04-20 21:46 UTC (permalink / raw)
  To: Catalin Marinas, aneesh.kumar, steve.capper, Thomas Gleixner,
	Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar,
	Andrew Morton, Kirill A. Shutemov, H. Peter Anvin, dave.hansen,
	Borislav Petkov, Rik van Riel, dann.frazier, Linus Torvalds,
	Michal Hocko
  Cc: linux-tip-commits

On Sat, Mar 18, 2017 at 2:52 AM, tip-bot for Kirill A. Shutemov
<tipbot@zytor.com> wrote:
> Commit-ID:  2947ba054a4dabbd82848728d765346886050029
> Gitweb:     http://git.kernel.org/tip/2947ba054a4dabbd82848728d765346886050029
> Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> AuthorDate: Fri, 17 Mar 2017 00:39:06 +0300
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Sat, 18 Mar 2017 09:48:03 +0100
>
> x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
>
> This patch provides all required callbacks required by the generic
> get_user_pages_fast() code and switches x86 over - and removes
> the platform specific implementation.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Dann Frazier <dann.frazier@canonical.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Steve Capper <steve.capper@linaro.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-mm@kvack.org
> Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com
> [ Minor readability edits. ]
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

I'm still trying to spot the bug, but bisect points to this patch as
the point at which my unit tests start failing with the following
signature:

[   35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155
percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
[   35.425328] percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0
(0) after switching to atomic
[   35.425329] Modules linked in: ip6t_rpfilter ip6t_REJECT
nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip
6table_mangle ip6table_raw ip6table_security iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
iptable_mangle iptable_raw iptable_security ebtable_filter ebtables
ip6table_filter ip6_tables crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel nd_pmem(O) dax_pmem(O) nd_btt(O) dax(O) serio_raw
nfit(O) nd_e820(O) libnvdimm(O) tpm_tis tpm_tis_co
re tpm nfit_test_iomap(O) nfsd nfs_acl
[   35.433683] CPU: 8 PID: 245 Comm: rcuos/29 Tainted: G           O
 4.11.0-rc2+ #55
[   35.435538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.9.3-1.fc25 04/01/2014
[   35.437500] Call Trace:
[   35.438270]  dump_stack+0x86/0xc3
[   35.439156]  __warn+0xcb/0xf0
[   35.439995]  warn_slowpath_fmt+0x5f/0x80
[   35.440962]  ? rcu_nocb_kthread+0x27a/0x500
[   35.441957]  ? dax_pmem_percpu_exit+0x50/0x50 [dax_pmem]
[   35.443107]  percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
[   35.444251]  ? percpu_ref_exit+0x60/0x60
[   35.445206]  rcu_nocb_kthread+0x327/0x500
[   35.446186]  ? rcu_nocb_kthread+0x27a/0x500
[   35.447188]  kthread+0x10c/0x140
[   35.448058]  ? rcu_eqs_enter+0x50/0x50
[   35.448990]  ? kthread_create_on_node+0x60/0x60
[   35.450038]  ret_from_fork+0x31/0x40
[   35.450976] ---[ end trace eaa40898a09519b5 ]---

This is similar to the backtrace when we were not properly handling
pud faults and was fixed with this commit: 220ced1676c4 "mm: fix
get_user_pages() vs device-dax pud mappings"

I've found some missing _devmap checks in the generic
get_user_pages_fast() path, but this does not fix the regression:

diff --git a/mm/gup.c b/mm/gup.c
index 2559a3987de7..89156cd59cbc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1475,7 +1475,8 @@ static int gup_pmd_range(pud_t pud, unsigned
long addr, unsigned long end,
                if (pmd_none(pmd))
                        return 0;

-               if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
+               if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd)
+                                       || pmd_devmap(pmd))) {
                        /*
                         * NUMA hinting faults need to be handled in the GUP
                         * slowpath for accounting purposes and so that they
@@ -1516,7 +1517,7 @@ static int gup_pud_range(p4d_t p4d, unsigned
long addr, unsigned long end,
                next = pud_addr_end(addr, end);
                if (pud_none(pud))
                        return 0;
-               if (unlikely(pud_huge(pud))) {
+               if (unlikely(pud_huge(pud) || pud_devmap(pud))) {
                        if (!gup_huge_pud(pud, pudp, addr, next, write,
                                          pages, nr))
                                return 0;


...more hunting tomorrow.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [tip:x86/mm] x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
  2017-04-20 21:46 [tip:x86/mm] x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation Dan Williams
@ 2017-04-21 14:16 ` Kirill A. Shutemov
  2017-04-21 19:30   ` Dan Williams
  2017-04-23 23:31 ` get_zone_device_page() in get_page() and page_cache_get_speculative() Kirill A. Shutemov
  1 sibling, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-04-21 14:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Catalin Marinas, aneesh.kumar, steve.capper, Thomas Gleixner,
	Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar,
	Andrew Morton, Kirill A. Shutemov, H. Peter Anvin, dave.hansen,
	Borislav Petkov, Rik van Riel, dann.frazier, Linus Torvalds,
	Michal Hocko, linux-tip-commits

On Thu, Apr 20, 2017 at 02:46:51PM -0700, Dan Williams wrote:
> On Sat, Mar 18, 2017 at 2:52 AM, tip-bot for Kirill A. Shutemov
> <tipbot@zytor.com> wrote:
> > Commit-ID:  2947ba054a4dabbd82848728d765346886050029
> > Gitweb:     http://git.kernel.org/tip/2947ba054a4dabbd82848728d765346886050029
> > Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > AuthorDate: Fri, 17 Mar 2017 00:39:06 +0300
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Sat, 18 Mar 2017 09:48:03 +0100
> >
> > x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
> >
> > This patch provides all required callbacks required by the generic
> > get_user_pages_fast() code and switches x86 over - and removes
> > the platform specific implementation.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Dann Frazier <dann.frazier@canonical.com>
> > Cc: Dave Hansen <dave.hansen@intel.com>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Steve Capper <steve.capper@linaro.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com
> > [ Minor readability edits. ]
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> I'm still trying to spot the bug, but bisect points to this patch as
> the point at which my unit tests start failing with the following
> signature:

I can't find the issue either.

Is it something reproducible without hardware? In KVM?

If yes, could you share the test-case?

> [   35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155
> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
> [   35.425328] percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0
> (0) after switching to atomic
> [   35.425329] Modules linked in: ip6t_rpfilter ip6t_REJECT
> nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip
> 6table_mangle ip6table_raw ip6table_security iptable_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
> iptable_mangle iptable_raw iptable_security ebtable_filter ebtables
> ip6table_filter ip6_tables crct10dif_pclmul crc32_pclmul crc32c_intel
> ghash_clmulni_intel nd_pmem(O) dax_pmem(O) nd_btt(O) dax(O) serio_raw
> nfit(O) nd_e820(O) libnvdimm(O) tpm_tis tpm_tis_co
> re tpm nfit_test_iomap(O) nfsd nfs_acl
> [   35.433683] CPU: 8 PID: 245 Comm: rcuos/29 Tainted: G           O
>  4.11.0-rc2+ #55
> [   35.435538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.9.3-1.fc25 04/01/2014
> [   35.437500] Call Trace:
> [   35.438270]  dump_stack+0x86/0xc3
> [   35.439156]  __warn+0xcb/0xf0
> [   35.439995]  warn_slowpath_fmt+0x5f/0x80
> [   35.440962]  ? rcu_nocb_kthread+0x27a/0x500
> [   35.441957]  ? dax_pmem_percpu_exit+0x50/0x50 [dax_pmem]
> [   35.443107]  percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
> [   35.444251]  ? percpu_ref_exit+0x60/0x60
> [   35.445206]  rcu_nocb_kthread+0x327/0x500
> [   35.446186]  ? rcu_nocb_kthread+0x27a/0x500
> [   35.447188]  kthread+0x10c/0x140
> [   35.448058]  ? rcu_eqs_enter+0x50/0x50
> [   35.448990]  ? kthread_create_on_node+0x60/0x60
> [   35.450038]  ret_from_fork+0x31/0x40
> [   35.450976] ---[ end trace eaa40898a09519b5 ]---
> 
> This is similar to the backtrace when we were not properly handling
> pud faults and was fixed with this commit: 220ced1676c4 "mm: fix
> get_user_pages() vs device-dax pud mappings"
> 
> I've found some missing _devmap checks in the generic
> get_user_pages_fast() path, but this does not fix the regression:

I don't see these in x86 GUP. Was the bug there too?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [tip:x86/mm] x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
  2017-04-21 14:16 ` Kirill A. Shutemov
@ 2017-04-21 19:30   ` Dan Williams
  2017-04-23  9:52     ` [PATCH] Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" Ingo Molnar
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-04-21 19:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Catalin Marinas, Aneesh Kumar K.V, steve.capper, Thomas Gleixner,
	Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar,
	Andrew Morton, Kirill A. Shutemov, H. Peter Anvin, Dave Hansen,
	Borislav Petkov, Rik van Riel, dann.frazier, Linus Torvalds,
	Michal Hocko, linux-tip-commits

On Fri, Apr 21, 2017 at 7:16 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Thu, Apr 20, 2017 at 02:46:51PM -0700, Dan Williams wrote:
>> On Sat, Mar 18, 2017 at 2:52 AM, tip-bot for Kirill A. Shutemov
>> <tipbot@zytor.com> wrote:
>> > Commit-ID:  2947ba054a4dabbd82848728d765346886050029
>> > Gitweb:     http://git.kernel.org/tip/2947ba054a4dabbd82848728d765346886050029
>> > Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > AuthorDate: Fri, 17 Mar 2017 00:39:06 +0300
>> > Committer:  Ingo Molnar <mingo@kernel.org>
>> > CommitDate: Sat, 18 Mar 2017 09:48:03 +0100
>> >
>> > x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
>> >
>> > This patch provides all required callbacks required by the generic
>> > get_user_pages_fast() code and switches x86 over - and removes
>> > the platform specific implementation.
>> >
>> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > Cc: Andrew Morton <akpm@linux-foundation.org>
>> > Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
>> > Cc: Borislav Petkov <bp@alien8.de>
>> > Cc: Catalin Marinas <catalin.marinas@arm.com>
>> > Cc: Dann Frazier <dann.frazier@canonical.com>
>> > Cc: Dave Hansen <dave.hansen@intel.com>
>> > Cc: H. Peter Anvin <hpa@zytor.com>
>> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> > Cc: Peter Zijlstra <peterz@infradead.org>
>> > Cc: Rik van Riel <riel@redhat.com>
>> > Cc: Steve Capper <steve.capper@linaro.org>
>> > Cc: Thomas Gleixner <tglx@linutronix.de>
>> > Cc: linux-arch@vger.kernel.org
>> > Cc: linux-mm@kvack.org
>> > Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com
>> > [ Minor readability edits. ]
>> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
>>
>> I'm still trying to spot the bug, but bisect points to this patch as
>> the point at which my unit tests start failing with the following
>> signature:
>
> I can't find the issue either.
>
> Is it something reproducible without hardware? In KVM?

You can do it in KVM, just boot with the memmap=ss!nn parameter to
simulate pmem. In this case I'm booting with memmap=4G!8G, you should
also specify "nokaslr".

> If yes, could you share the test-case?

Yes, run:

    ./autogen.sh
    ./configure CFLAGS='-g -O0' --prefix=/usr --sysconfdir=/etc
--libdir=/usr/lib64
    make TESTS=device-dax check

...from a checkout of the ndctl project:

    https://github.com/pmem/ndctl

Let me know if you run into any problems getting the test to build or run.

>
>> [   35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155
>> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
>> [   35.425328] percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0
>> (0) after switching to atomic
>> [   35.425329] Modules linked in: ip6t_rpfilter ip6t_REJECT
>> nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc
>> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip
>> 6table_mangle ip6table_raw ip6table_security iptable_nat
>> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
>> iptable_mangle iptable_raw iptable_security ebtable_filter ebtables
>> ip6table_filter ip6_tables crct10dif_pclmul crc32_pclmul crc32c_intel
>> ghash_clmulni_intel nd_pmem(O) dax_pmem(O) nd_btt(O) dax(O) serio_raw
>> nfit(O) nd_e820(O) libnvdimm(O) tpm_tis tpm_tis_co
>> re tpm nfit_test_iomap(O) nfsd nfs_acl
>> [   35.433683] CPU: 8 PID: 245 Comm: rcuos/29 Tainted: G           O
>>  4.11.0-rc2+ #55
>> [   35.435538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS 1.9.3-1.fc25 04/01/2014
>> [   35.437500] Call Trace:
>> [   35.438270]  dump_stack+0x86/0xc3
>> [   35.439156]  __warn+0xcb/0xf0
>> [   35.439995]  warn_slowpath_fmt+0x5f/0x80
>> [   35.440962]  ? rcu_nocb_kthread+0x27a/0x500
>> [   35.441957]  ? dax_pmem_percpu_exit+0x50/0x50 [dax_pmem]
>> [   35.443107]  percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
>> [   35.444251]  ? percpu_ref_exit+0x60/0x60
>> [   35.445206]  rcu_nocb_kthread+0x327/0x500
>> [   35.446186]  ? rcu_nocb_kthread+0x27a/0x500
>> [   35.447188]  kthread+0x10c/0x140
>> [   35.448058]  ? rcu_eqs_enter+0x50/0x50
>> [   35.448990]  ? kthread_create_on_node+0x60/0x60
>> [   35.450038]  ret_from_fork+0x31/0x40
>> [   35.450976] ---[ end trace eaa40898a09519b5 ]---
>>
>> This is similar to the backtrace when we were not properly handling
>> pud faults and was fixed with this commit: 220ced1676c4 "mm: fix
>> get_user_pages() vs device-dax pud mappings"
>>
>> I've found some missing _devmap checks in the generic
>> get_user_pages_fast() path, but this does not fix the regression:
>
> I don't see these in x86 GUP. Was the bug there too?

No it wasn't, the test runs fine with v4.11-rc7, so perhaps I'm
looking in the wrong place...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH] Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
  2017-04-21 19:30   ` Dan Williams
@ 2017-04-23  9:52     ` Ingo Molnar
  0 siblings, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2017-04-23  9:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kirill A. Shutemov, Catalin Marinas, Aneesh Kumar K.V,
	steve.capper, Thomas Gleixner, Peter Zijlstra,
	Linux Kernel Mailing List, Andrew Morton, Kirill A. Shutemov,
	H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel,
	dann.frazier, Linus Torvalds, Michal Hocko, linux-tip-commits


* Dan Williams <dan.j.williams@intel.com> wrote:

> > I can't find the issue either.
> >
> > Is it something reproducible without hardware? In KVM?
> 
> You can do it in KVM, just boot with the memmap=ss!nn parameter to
> simulate pmem. In this case I'm booting with memmap=4G!8G, you should
> also specify "nokaslr".
> 
> > If yes, could you share the test-case?
> 
> Yes, run:
> 
>     ./autogen.sh
>     ./configure CFLAGS='-g -O0' --prefix=/usr --sysconfdir=/etc
> --libdir=/usr/lib64
>     make TESTS=device-dax check
> 
> ...from a checkout of the ndctl project:
> 
>     https://github.com/pmem/ndctl
> 
> Let me know if you run into any problems getting the test to build or run.
> 
> >
> >> [   35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155
> >> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
> >> [   35.425328] percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0
> >> (0) after switching to atomic

Since the bug appears to be pretty severe (GUP race causing possible memory 
corruption that could affect a lot of code), and the merge window is awfully 
close, plus the reproducer appears to be pretty quick, I've queued up the
revert below for the time being, to not block the rest of the pending
tip:x86/mm changes.

I'd have loved to see this conversion in v4.12, but not at any cost.

Thanks,

	Ingo

==================>
>From 6dd29b3df975582ef429b5b93c899e6575785940 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Sun, 23 Apr 2017 11:37:17 +0200
Subject: [PATCH] Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"

This reverts commit 2947ba054a4dabbd82848728d765346886050029.

Dan Williams reported dax-pmem kernel warnings with the following signature:

   WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
   percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic

... and bisected it to this commit, which suggests possible memory corruption
caused by the x86 fast-GUP conversion.

He also pointed out:

 "
  This is similar to the backtrace when we were not properly handling
  pud faults and was fixed with this commit: 220ced1676c4 "mm: fix
  get_user_pages() vs device-dax pud mappings"

  I've found some missing _devmap checks in the generic
  get_user_pages_fast() path, but this does not fix the regression
  [...]
 "

So given that there are known bugs, and a pretty robust looking bisection
points to this commit suggesting that are unknown bugs in the conversion
as well, revert it for the time being - we'll re-try in v4.13.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dann.frazier@canonical.com
Cc: dave.hansen@intel.com
Cc: steve.capper@linaro.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/arm/Kconfig                      |   2 +-
 arch/arm64/Kconfig                    |   2 +-
 arch/powerpc/Kconfig                  |   2 +-
 arch/x86/Kconfig                      |   3 -
 arch/x86/include/asm/mmu_context.h    |  12 +
 arch/x86/include/asm/pgtable-3level.h |  47 ----
 arch/x86/include/asm/pgtable.h        |  53 ----
 arch/x86/include/asm/pgtable_64.h     |  16 +-
 arch/x86/mm/Makefile                  |   2 +-
 arch/x86/mm/gup.c                     | 496 ++++++++++++++++++++++++++++++++++
 mm/Kconfig                            |   2 +-
 mm/gup.c                              |  10 +-
 12 files changed, 519 insertions(+), 128 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 454fadd077ad..0d4e71b42c77 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1666,7 +1666,7 @@ config ARCH_SELECT_MEMORY_MODEL
 config HAVE_ARCH_PFN_VALID
 	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
 
-config HAVE_GENERIC_GUP
+config HAVE_GENERIC_RCU_GUP
 	def_bool y
 	depends on ARM_LPAE
 
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index af62bf79721a..3741859765cf 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -205,7 +205,7 @@ config GENERIC_CALIBRATE_DELAY
 config ZONE_DMA
 	def_bool y
 
-config HAVE_GENERIC_GUP
+config HAVE_GENERIC_RCU_GUP
 	def_bool y
 
 config ARCH_DMA_ADDR_T_64BIT
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3a716b2dcde9..97a8bc8a095c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -135,7 +135,7 @@ config PPC
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_FUNCTION_TRACER
 	select HAVE_GCC_PLUGINS
-	select HAVE_GENERIC_GUP
+	select HAVE_GENERIC_RCU_GUP
 	select HAVE_HW_BREAKPOINT		if PERF_EVENTS && (PPC_BOOK3S || PPC_8xx)
 	select HAVE_IDE
 	select HAVE_IOREMAP_PROT
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a641b900fc1f..2bde14451e54 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2789,9 +2789,6 @@ config X86_DMA_REMAP
 	bool
 	depends on STA2X11
 
-config HAVE_GENERIC_GUP
-	def_bool y
-
 source "net/Kconfig"
 
 source "drivers/Kconfig"
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 6e933d2d88d9..68b329d77b3a 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -220,6 +220,18 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 }
 #endif
 
+static inline bool __pkru_allows_pkey(u16 pkey, bool write)
+{
+	u32 pkru = read_pkru();
+
+	if (!__pkru_allows_read(pkru, pkey))
+		return false;
+	if (write && !__pkru_allows_write(pkru, pkey))
+		return false;
+
+	return true;
+}
+
 /*
  * We only want to enforce protection keys on the current process
  * because we effectively have no access to PKRU for other
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index c8821bab938f..50d35e3185f5 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -212,51 +212,4 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
 #define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })
 #define __swp_entry_to_pte(x)		((pte_t){ { .pte_high = (x).val } })
 
-#define gup_get_pte gup_get_pte
-/*
- * WARNING: only to be used in the get_user_pages_fast() implementation.
- *
- * With get_user_pages_fast(), we walk down the pagetables without taking
- * any locks.  For this we would like to load the pointers atomically,
- * but that is not possible (without expensive cmpxchg8b) on PAE.  What
- * we do have is the guarantee that a PTE will only either go from not
- * present to present, or present to not present or both -- it will not
- * switch to a completely different present page without a TLB flush in
- * between; something that we are blocking by holding interrupts off.
- *
- * Setting ptes from not present to present goes:
- *
- *   ptep->pte_high = h;
- *   smp_wmb();
- *   ptep->pte_low = l;
- *
- * And present to not present goes:
- *
- *   ptep->pte_low = 0;
- *   smp_wmb();
- *   ptep->pte_high = 0;
- *
- * We must ensure here that the load of pte_low sees 'l' iff pte_high
- * sees 'h'. We load pte_high *after* loading pte_low, which ensures we
- * don't see an older value of pte_high.  *Then* we recheck pte_low,
- * which ensures that we haven't picked up a changed pte high. We might
- * have gotten rubbish values from pte_low and pte_high, but we are
- * guaranteed that pte_low will not have the present bit set *unless*
- * it is 'l'. Because get_user_pages_fast() only operates on present ptes
- * we're safe.
- */
-static inline pte_t gup_get_pte(pte_t *ptep)
-{
-	pte_t pte;
-
-	do {
-		pte.pte_low = ptep->pte_low;
-		smp_rmb();
-		pte.pte_high = ptep->pte_high;
-		smp_rmb();
-	} while (unlikely(pte.pte_low != ptep->pte_low));
-
-	return pte;
-}
-
 #endif /* _ASM_X86_PGTABLE_3LEVEL_H */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 942482ac36a8..f5af95a0c6b8 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -244,11 +244,6 @@ static inline int pud_devmap(pud_t pud)
 	return 0;
 }
 #endif
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
-}
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -1190,54 +1185,6 @@ static inline u16 pte_flags_pkey(unsigned long pte_flags)
 #endif
 }
 
-static inline bool __pkru_allows_pkey(u16 pkey, bool write)
-{
-	u32 pkru = read_pkru();
-
-	if (!__pkru_allows_read(pkru, pkey))
-		return false;
-	if (write && !__pkru_allows_write(pkru, pkey))
-		return false;
-
-	return true;
-}
-
-/*
- * 'pteval' can come from a PTE, PMD or PUD.  We only check
- * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
- * same value on all 3 types.
- */
-static inline bool __pte_access_permitted(unsigned long pteval, bool write)
-{
-	unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
-
-	if (write)
-		need_pte_bits |= _PAGE_RW;
-
-	if ((pteval & need_pte_bits) != need_pte_bits)
-		return 0;
-
-	return __pkru_allows_pkey(pte_flags_pkey(pteval), write);
-}
-
-#define pte_access_permitted pte_access_permitted
-static inline bool pte_access_permitted(pte_t pte, bool write)
-{
-	return __pte_access_permitted(pte_val(pte), write);
-}
-
-#define pmd_access_permitted pmd_access_permitted
-static inline bool pmd_access_permitted(pmd_t pmd, bool write)
-{
-	return __pte_access_permitted(pmd_val(pmd), write);
-}
-
-#define pud_access_permitted pud_access_permitted
-static inline bool pud_access_permitted(pud_t pud, bool write)
-{
-	return __pte_access_permitted(pud_val(pud), write);
-}
-
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 12ea31274eb6..9991224f6238 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -227,20 +227,6 @@ extern void cleanup_highmap(void);
 extern void init_extra_mapping_uc(unsigned long phys, unsigned long size);
 extern void init_extra_mapping_wb(unsigned long phys, unsigned long size);
 
-#define gup_fast_permitted gup_fast_permitted
-static inline bool gup_fast_permitted(unsigned long start, int nr_pages,
-		int write)
-{
-	unsigned long len, end;
-
-	len = (unsigned long)nr_pages << PAGE_SHIFT;
-	end = start + len;
-	if (end < start)
-		return false;
-	if (end >> __VIRTUAL_MASK_SHIFT)
-		return false;
-	return true;
-}
-
 #endif /* !__ASSEMBLY__ */
+
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 0fbdcb64f9f8..96d2b847e09e 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -2,7 +2,7 @@
 KCOV_INSTRUMENT_tlb.o	:= n
 
 obj-y	:=  init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
-	    pat.o pgtable.o physaddr.o setup_nx.o tlb.o
+	    pat.o pgtable.o physaddr.o gup.o setup_nx.o tlb.o
 
 # Make sure __phys_addr has no stackprotector
 nostackp := $(call cc-option, -fno-stack-protector)
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
new file mode 100644
index 000000000000..456dfdfd2249
--- /dev/null
+++ b/arch/x86/mm/gup.c
@@ -0,0 +1,496 @@
+/*
+ * Lockless get_user_pages_fast for x86
+ *
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ */
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/highmem.h>
+#include <linux/swap.h>
+#include <linux/memremap.h>
+
+#include <asm/mmu_context.h>
+#include <asm/pgtable.h>
+
+static inline pte_t gup_get_pte(pte_t *ptep)
+{
+#ifndef CONFIG_X86_PAE
+	return READ_ONCE(*ptep);
+#else
+	/*
+	 * With get_user_pages_fast, we walk down the pagetables without taking
+	 * any locks.  For this we would like to load the pointers atomically,
+	 * but that is not possible (without expensive cmpxchg8b) on PAE.  What
+	 * we do have is the guarantee that a pte will only either go from not
+	 * present to present, or present to not present or both -- it will not
+	 * switch to a completely different present page without a TLB flush in
+	 * between; something that we are blocking by holding interrupts off.
+	 *
+	 * Setting ptes from not present to present goes:
+	 * ptep->pte_high = h;
+	 * smp_wmb();
+	 * ptep->pte_low = l;
+	 *
+	 * And present to not present goes:
+	 * ptep->pte_low = 0;
+	 * smp_wmb();
+	 * ptep->pte_high = 0;
+	 *
+	 * We must ensure here that the load of pte_low sees l iff pte_high
+	 * sees h. We load pte_high *after* loading pte_low, which ensures we
+	 * don't see an older value of pte_high.  *Then* we recheck pte_low,
+	 * which ensures that we haven't picked up a changed pte high. We might
+	 * have got rubbish values from pte_low and pte_high, but we are
+	 * guaranteed that pte_low will not have the present bit set *unless*
+	 * it is 'l'. And get_user_pages_fast only operates on present ptes, so
+	 * we're safe.
+	 *
+	 * gup_get_pte should not be used or copied outside gup.c without being
+	 * very careful -- it does not atomically load the pte or anything that
+	 * is likely to be useful for you.
+	 */
+	pte_t pte;
+
+retry:
+	pte.pte_low = ptep->pte_low;
+	smp_rmb();
+	pte.pte_high = ptep->pte_high;
+	smp_rmb();
+	if (unlikely(pte.pte_low != ptep->pte_low))
+		goto retry;
+
+	return pte;
+#endif
+}
+
+static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
+{
+	while ((*nr) - nr_start) {
+		struct page *page = pages[--(*nr)];
+
+		ClearPageReferenced(page);
+		put_page(page);
+	}
+}
+
+/*
+ * 'pteval' can come from a pte, pmd, pud or p4d.  We only check
+ * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
+ * same value on all 4 types.
+ */
+static inline int pte_allows_gup(unsigned long pteval, int write)
+{
+	unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
+
+	if (write)
+		need_pte_bits |= _PAGE_RW;
+
+	if ((pteval & need_pte_bits) != need_pte_bits)
+		return 0;
+
+	/* Check memory protection keys permissions. */
+	if (!__pkru_allows_pkey(pte_flags_pkey(pteval), write))
+		return 0;
+
+	return 1;
+}
+
+/*
+ * The performance critical leaf functions are made noinline otherwise gcc
+ * inlines everything into a single function which results in too much
+ * register pressure.
+ */
+static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct dev_pagemap *pgmap = NULL;
+	int nr_start = *nr, ret = 0;
+	pte_t *ptep, *ptem;
+
+	/*
+	 * Keep the original mapped PTE value (ptem) around since we
+	 * might increment ptep off the end of the page when finishing
+	 * our loop iteration.
+	 */
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		pte_t pte = gup_get_pte(ptep);
+		struct page *page;
+
+		/* Similar to the PMD case, NUMA hinting must take slow path */
+		if (pte_protnone(pte))
+			break;
+
+		if (!pte_allows_gup(pte_val(pte), write))
+			break;
+
+		if (pte_devmap(pte)) {
+			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
+			if (unlikely(!pgmap)) {
+				undo_dev_pagemap(nr, nr_start, pages);
+				break;
+			}
+		} else if (pte_special(pte))
+			break;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+		get_page(page);
+		put_dev_pagemap(pgmap);
+		SetPageReferenced(page);
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+	if (addr == end)
+		ret = 1;
+	pte_unmap(ptem);
+
+	return ret;
+}
+
+static inline void get_head_page_multiple(struct page *page, int nr)
+{
+	VM_BUG_ON_PAGE(page != compound_head(page), page);
+	VM_BUG_ON_PAGE(page_count(page) == 0, page);
+	page_ref_add(page, nr);
+	SetPageReferenced(page);
+}
+
+static int __gup_device_huge(unsigned long pfn, unsigned long addr,
+		unsigned long end, struct page **pages, int *nr)
+{
+	int nr_start = *nr;
+	struct dev_pagemap *pgmap = NULL;
+
+	do {
+		struct page *page = pfn_to_page(pfn);
+
+		pgmap = get_dev_pagemap(pfn, pgmap);
+		if (unlikely(!pgmap)) {
+			undo_dev_pagemap(nr, nr_start, pages);
+			return 0;
+		}
+		SetPageReferenced(page);
+		pages[*nr] = page;
+		get_page(page);
+		put_dev_pagemap(pgmap);
+		(*nr)++;
+		pfn++;
+	} while (addr += PAGE_SIZE, addr != end);
+	return 1;
+}
+
+static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
+		unsigned long end, struct page **pages, int *nr)
+{
+	unsigned long fault_pfn;
+
+	fault_pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
+}
+
+static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
+		unsigned long end, struct page **pages, int *nr)
+{
+	unsigned long fault_pfn;
+
+	fault_pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
+}
+
+static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page;
+	int refs;
+
+	if (!pte_allows_gup(pmd_val(pmd), write))
+		return 0;
+
+	VM_BUG_ON(!pfn_valid(pmd_pfn(pmd)));
+	if (pmd_devmap(pmd))
+		return __gup_device_huge_pmd(pmd, addr, end, pages, nr);
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(pmd_flags(pmd) & _PAGE_SPECIAL);
+
+	refs = 0;
+	head = pmd_page(pmd);
+	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+	get_head_page_multiple(head, refs);
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = *pmdp;
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd))
+			return 0;
+		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
+			/*
+			 * NUMA hinting faults need to be handled in the GUP
+			 * slowpath for accounting purposes and so that they
+			 * can be serialised against THP migration.
+			 */
+			if (pmd_protnone(pmd))
+				return 0;
+			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
+				return 0;
+		} else {
+			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page;
+	int refs;
+
+	if (!pte_allows_gup(pud_val(pud), write))
+		return 0;
+
+	VM_BUG_ON(!pfn_valid(pud_pfn(pud)));
+	if (pud_devmap(pud))
+		return __gup_device_huge_pud(pud, addr, end, pages, nr);
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(pud_flags(pud) & _PAGE_SPECIAL);
+
+	refs = 0;
+	head = pud_page(pud);
+	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	do {
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+	get_head_page_multiple(head, refs);
+
+	return 1;
+}
+
+static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
+			int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(&p4d, addr);
+	do {
+		pud_t pud = *pudp;
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (unlikely(pud_large(pud))) {
+			if (!gup_huge_pud(pud, addr, next, write, pages, nr))
+				return 0;
+		} else {
+			if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+			int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	p4d_t *p4dp;
+
+	p4dp = p4d_offset(&pgd, addr);
+	do {
+		p4d_t p4d = *p4dp;
+
+		next = p4d_addr_end(addr, end);
+		if (p4d_none(p4d))
+			return 0;
+		BUILD_BUG_ON(p4d_large(p4d));
+		if (!gup_pud_range(p4d, addr, next, write, pages, nr))
+			return 0;
+	} while (p4dp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next;
+	unsigned long flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					(void __user *)start, len)))
+		return 0;
+
+	/*
+	 * XXX: batch / limit 'nr', to avoid large irq off latency
+	 * needs some instrumenting to determine the common sizes used by
+	 * important workloads (eg. DB2), and whether limiting the batch size
+	 * will decrease performance.
+	 *
+	 * It seems like we're in the clear for the moment. Direct-IO is
+	 * the main guy that batches up lots of get_user_pages, and even
+	 * they are limited to 64-at-a-time which is not so many.
+	 */
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables and pages from being freed on x86.
+	 *
+	 * So long as we atomically load page table pointers versus teardown
+	 * (which we do on x86, with the above PAE exception), we can follow the
+	 * address down to the the page and take a ref on it.
+	 */
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = *pgdp;
+
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			break;
+		if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @write:	whether pages will be written to
+ * @pages:	array that receives pointers to the pages pinned.
+ * 		Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number
+ * requested. If nr_pages is 0 or negative, returns 0. If no pages
+ * were pinned, returns -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+
+	end = start + len;
+	if (end < start)
+		goto slow_irqon;
+
+#ifdef CONFIG_X86_64
+	if (end >> __VIRTUAL_MASK_SHIFT)
+		goto slow_irqon;
+#endif
+
+	/*
+	 * XXX: batch / limit 'nr', to avoid large irq off latency
+	 * needs some instrumenting to determine the common sizes used by
+	 * important workloads (eg. DB2), and whether limiting the batch size
+	 * will decrease performance.
+	 *
+	 * It seems like we're in the clear for the moment. Direct-IO is
+	 * the main guy that batches up lots of get_user_pages, and even
+	 * they are limited to 64-at-a-time which is not so many.
+	 */
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables and pages from being freed on x86.
+	 *
+	 * So long as we atomically load page table pointers versus teardown
+	 * (which we do on x86, with the above PAE exception), we can follow the
+	 * address down to the the page and take a ref on it.
+	 */
+	local_irq_disable();
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = *pgdp;
+
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			goto slow;
+		if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
+			goto slow;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_enable();
+
+	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
+	return nr;
+
+	{
+		int ret;
+
+slow:
+		local_irq_enable();
+slow_irqon:
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		ret = get_user_pages_unlocked(start,
+					      (end - start) >> PAGE_SHIFT,
+					      pages, write ? FOLL_WRITE : 0);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+
+		return ret;
+	}
+}
diff --git a/mm/Kconfig b/mm/Kconfig
index c89f472b658c..9b8fccb969dc 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,7 +137,7 @@ config HAVE_MEMBLOCK_NODE_MAP
 config HAVE_MEMBLOCK_PHYS_MAP
 	bool
 
-config HAVE_GENERIC_GUP
+config HAVE_GENERIC_RCU_GUP
 	bool
 
 config ARCH_DISCARD_MEMBLOCK
diff --git a/mm/gup.c b/mm/gup.c
index 2559a3987de7..527ec2c6cca3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1155,7 +1155,7 @@ struct page *get_dump_page(unsigned long addr)
 #endif /* CONFIG_ELF_CORE */
 
 /*
- * Generic Fast GUP
+ * Generic RCU Fast GUP
  *
  * get_user_pages_fast attempts to pin user pages by walking the page
  * tables directly and avoids taking locks. Thus the walker needs to be
@@ -1176,8 +1176,8 @@ struct page *get_dump_page(unsigned long addr)
  * Before activating this code, please be aware that the following assumptions
  * are currently made:
  *
- *  *) Either HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
- *  free pages containing page tables or TLB flushing requires IPI broadcast.
+ *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
+ *      pages containing page tables.
  *
  *  *) ptes can be read atomically by the architecture.
  *
@@ -1187,7 +1187,7 @@ struct page *get_dump_page(unsigned long addr)
  *
  * This code is based heavily on the PowerPC implementation by Nick Piggin.
  */
-#ifdef CONFIG_HAVE_GENERIC_GUP
+#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
 
 #ifndef gup_get_pte
 /*
@@ -1677,4 +1677,4 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	return ret;
 }
 
-#endif /* CONFIG_HAVE_GENERIC_GUP */
+#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */

^ permalink raw reply	[flat|nested] 48+ messages in thread

* get_zone_device_page() in get_page() and page_cache_get_speculative()
  2017-04-20 21:46 [tip:x86/mm] x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation Dan Williams
  2017-04-21 14:16 ` Kirill A. Shutemov
@ 2017-04-23 23:31 ` Kirill A. Shutemov
  2017-04-24 17:23   ` Dan Williams
  2017-04-27  0:55   ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Dan Williams
  1 sibling, 2 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-04-23 23:31 UTC (permalink / raw)
  To: Dan Williams, linux-mm
  Cc: Catalin Marinas, aneesh.kumar, steve.capper, Thomas Gleixner,
	Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar,
	Andrew Morton, Kirill A. Shutemov, H. Peter Anvin, dave.hansen,
	Borislav Petkov, Rik van Riel, dann.frazier, Linus Torvalds,
	Michal Hocko, linux-tip-commits

On Thu, Apr 20, 2017 at 02:46:51PM -0700, Dan Williams wrote:
> On Sat, Mar 18, 2017 at 2:52 AM, tip-bot for Kirill A. Shutemov
> <tipbot@zytor.com> wrote:
> > Commit-ID:  2947ba054a4dabbd82848728d765346886050029
> > Gitweb:     http://git.kernel.org/tip/2947ba054a4dabbd82848728d765346886050029
> > Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > AuthorDate: Fri, 17 Mar 2017 00:39:06 +0300
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Sat, 18 Mar 2017 09:48:03 +0100
> >
> > x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
> >
> > This patch provides all required callbacks required by the generic
> > get_user_pages_fast() code and switches x86 over - and removes
> > the platform specific implementation.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Dann Frazier <dann.frazier@canonical.com>
> > Cc: Dave Hansen <dave.hansen@intel.com>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Steve Capper <steve.capper@linaro.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com
> > [ Minor readability edits. ]
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> I'm still trying to spot the bug, but bisect points to this patch as
> the point at which my unit tests start failing with the following
> signature:
> 
> [   35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155
> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200

Okay, I've tracked it down. The issue is triggered by replacment
get_page() with page_cache_get_speculative().

page_cache_get_speculative() doesn't have get_zone_device_page(). :-|

And I think it's your bug, Dan: it's wrong to have
get_/put_zone_device_page() in get_/put_page(). I must be handled by
page_ref_* machinery to catch all cases where we manipulate with page
refcount.

Back to the big picture:

I hate that we need to have such additional code in page refcount
primitives. I worked hard to remove compound page ugliness from there and
now zone_device creeping in...

Is it the only option?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: get_zone_device_page() in get_page() and page_cache_get_speculative()
  2017-04-23 23:31 ` get_zone_device_page() in get_page() and page_cache_get_speculative() Kirill A. Shutemov
@ 2017-04-24 17:23   ` Dan Williams
  2017-04-24 17:30     ` Kirill A. Shutemov
  2017-04-27  0:55   ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Dan Williams
  1 sibling, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-04-24 17:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	Ingo Molnar, Andrew Morton, Kirill A. Shutemov, H. Peter Anvin,
	Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier,
	Linus Torvalds, Michal Hocko, linux-tip-commits

On Sun, Apr 23, 2017 at 4:31 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Thu, Apr 20, 2017 at 02:46:51PM -0700, Dan Williams wrote:
>> On Sat, Mar 18, 2017 at 2:52 AM, tip-bot for Kirill A. Shutemov
>> <tipbot@zytor.com> wrote:
>> > Commit-ID:  2947ba054a4dabbd82848728d765346886050029
>> > Gitweb:     http://git.kernel.org/tip/2947ba054a4dabbd82848728d765346886050029
>> > Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > AuthorDate: Fri, 17 Mar 2017 00:39:06 +0300
>> > Committer:  Ingo Molnar <mingo@kernel.org>
>> > CommitDate: Sat, 18 Mar 2017 09:48:03 +0100
>> >
>> > x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
>> >
>> > This patch provides all required callbacks required by the generic
>> > get_user_pages_fast() code and switches x86 over - and removes
>> > the platform specific implementation.
>> >
>> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > Cc: Andrew Morton <akpm@linux-foundation.org>
>> > Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
>> > Cc: Borislav Petkov <bp@alien8.de>
>> > Cc: Catalin Marinas <catalin.marinas@arm.com>
>> > Cc: Dann Frazier <dann.frazier@canonical.com>
>> > Cc: Dave Hansen <dave.hansen@intel.com>
>> > Cc: H. Peter Anvin <hpa@zytor.com>
>> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> > Cc: Peter Zijlstra <peterz@infradead.org>
>> > Cc: Rik van Riel <riel@redhat.com>
>> > Cc: Steve Capper <steve.capper@linaro.org>
>> > Cc: Thomas Gleixner <tglx@linutronix.de>
>> > Cc: linux-arch@vger.kernel.org
>> > Cc: linux-mm@kvack.org
>> > Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com
>> > [ Minor readability edits. ]
>> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
>>
>> I'm still trying to spot the bug, but bisect points to this patch as
>> the point at which my unit tests start failing with the following
>> signature:
>>
>> [   35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155
>> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
>
> Okay, I've tracked it down. The issue is triggered by replacment
> get_page() with page_cache_get_speculative().
>
> page_cache_get_speculative() doesn't have get_zone_device_page(). :-|
>
> And I think it's your bug, Dan: it's wrong to have
> get_/put_zone_device_page() in get_/put_page(). I must be handled by
> page_ref_* machinery to catch all cases where we manipulate with page
> refcount.

The page_ref conversion landed in 4.6 *after* the ZONE_DEVICE
implementation that landed in 4.5, so there was a missed conversion of
the zone-device reference counting to page_ref.

> Back to the big picture:
>
> I hate that we need to have such additional code in page refcount
> primitives. I worked hard to remove compound page ugliness from there and
> now zone_device creeping in...
>
> Is it the only option?

Not sure, I need to spend some time to understand what page_ref means
to ZONE_DEVICE.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: get_zone_device_page() in get_page() and page_cache_get_speculative()
  2017-04-24 17:23   ` Dan Williams
@ 2017-04-24 17:30     ` Kirill A. Shutemov
  2017-04-24 17:47       ` Dan Williams
  0 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-04-24 17:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	Ingo Molnar, Andrew Morton, Kirill A. Shutemov, H. Peter Anvin,
	Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier,
	Linus Torvalds, Michal Hocko, linux-tip-commits

On Mon, Apr 24, 2017 at 10:23:59AM -0700, Dan Williams wrote:
> On Sun, Apr 23, 2017 at 4:31 PM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
> > On Thu, Apr 20, 2017 at 02:46:51PM -0700, Dan Williams wrote:
> >> On Sat, Mar 18, 2017 at 2:52 AM, tip-bot for Kirill A. Shutemov
> >> <tipbot@zytor.com> wrote:
> >> > Commit-ID:  2947ba054a4dabbd82848728d765346886050029
> >> > Gitweb:     http://git.kernel.org/tip/2947ba054a4dabbd82848728d765346886050029
> >> > Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >> > AuthorDate: Fri, 17 Mar 2017 00:39:06 +0300
> >> > Committer:  Ingo Molnar <mingo@kernel.org>
> >> > CommitDate: Sat, 18 Mar 2017 09:48:03 +0100
> >> >
> >> > x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
> >> >
> >> > This patch provides all required callbacks required by the generic
> >> > get_user_pages_fast() code and switches x86 over - and removes
> >> > the platform specific implementation.
> >> >
> >> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >> > Cc: Andrew Morton <akpm@linux-foundation.org>
> >> > Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
> >> > Cc: Borislav Petkov <bp@alien8.de>
> >> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> >> > Cc: Dann Frazier <dann.frazier@canonical.com>
> >> > Cc: Dave Hansen <dave.hansen@intel.com>
> >> > Cc: H. Peter Anvin <hpa@zytor.com>
> >> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> >> > Cc: Peter Zijlstra <peterz@infradead.org>
> >> > Cc: Rik van Riel <riel@redhat.com>
> >> > Cc: Steve Capper <steve.capper@linaro.org>
> >> > Cc: Thomas Gleixner <tglx@linutronix.de>
> >> > Cc: linux-arch@vger.kernel.org
> >> > Cc: linux-mm@kvack.org
> >> > Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com
> >> > [ Minor readability edits. ]
> >> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> >>
> >> I'm still trying to spot the bug, but bisect points to this patch as
> >> the point at which my unit tests start failing with the following
> >> signature:
> >>
> >> [   35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155
> >> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
> >
> > Okay, I've tracked it down. The issue is triggered by replacment
> > get_page() with page_cache_get_speculative().
> >
> > page_cache_get_speculative() doesn't have get_zone_device_page(). :-|
> >
> > And I think it's your bug, Dan: it's wrong to have
> > get_/put_zone_device_page() in get_/put_page(). I must be handled by
> > page_ref_* machinery to catch all cases where we manipulate with page
> > refcount.
> 
> The page_ref conversion landed in 4.6 *after* the ZONE_DEVICE
> implementation that landed in 4.5, so there was a missed conversion of
> the zone-device reference counting to page_ref.

Fair enough.

But get_page_unless_zero() definitely predates ZONE_DEVICE. :)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: get_zone_device_page() in get_page() and page_cache_get_speculative()
  2017-04-24 17:30     ` Kirill A. Shutemov
@ 2017-04-24 17:47       ` Dan Williams
  2017-04-24 18:01         ` Kirill A. Shutemov
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-04-24 17:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	Ingo Molnar, Andrew Morton, Kirill A. Shutemov, H. Peter Anvin,
	Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier,
	Linus Torvalds, Michal Hocko, linux-tip-commits

On Mon, Apr 24, 2017 at 10:30 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Mon, Apr 24, 2017 at 10:23:59AM -0700, Dan Williams wrote:
>> On Sun, Apr 23, 2017 at 4:31 PM, Kirill A. Shutemov
>> <kirill@shutemov.name> wrote:
>> > On Thu, Apr 20, 2017 at 02:46:51PM -0700, Dan Williams wrote:
>> >> On Sat, Mar 18, 2017 at 2:52 AM, tip-bot for Kirill A. Shutemov
>> >> <tipbot@zytor.com> wrote:
>> >> > Commit-ID:  2947ba054a4dabbd82848728d765346886050029
>> >> > Gitweb:     http://git.kernel.org/tip/2947ba054a4dabbd82848728d765346886050029
>> >> > Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> >> > AuthorDate: Fri, 17 Mar 2017 00:39:06 +0300
>> >> > Committer:  Ingo Molnar <mingo@kernel.org>
>> >> > CommitDate: Sat, 18 Mar 2017 09:48:03 +0100
>> >> >
>> >> > x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
>> >> >
>> >> > This patch provides all required callbacks required by the generic
>> >> > get_user_pages_fast() code and switches x86 over - and removes
>> >> > the platform specific implementation.
>> >> >
>> >> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> >> > Cc: Andrew Morton <akpm@linux-foundation.org>
>> >> > Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
>> >> > Cc: Borislav Petkov <bp@alien8.de>
>> >> > Cc: Catalin Marinas <catalin.marinas@arm.com>
>> >> > Cc: Dann Frazier <dann.frazier@canonical.com>
>> >> > Cc: Dave Hansen <dave.hansen@intel.com>
>> >> > Cc: H. Peter Anvin <hpa@zytor.com>
>> >> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> >> > Cc: Peter Zijlstra <peterz@infradead.org>
>> >> > Cc: Rik van Riel <riel@redhat.com>
>> >> > Cc: Steve Capper <steve.capper@linaro.org>
>> >> > Cc: Thomas Gleixner <tglx@linutronix.de>
>> >> > Cc: linux-arch@vger.kernel.org
>> >> > Cc: linux-mm@kvack.org
>> >> > Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com
>> >> > [ Minor readability edits. ]
>> >> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
>> >>
>> >> I'm still trying to spot the bug, but bisect points to this patch as
>> >> the point at which my unit tests start failing with the following
>> >> signature:
>> >>
>> >> [   35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155
>> >> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
>> >
>> > Okay, I've tracked it down. The issue is triggered by replacment
>> > get_page() with page_cache_get_speculative().
>> >
>> > page_cache_get_speculative() doesn't have get_zone_device_page(). :-|
>> >
>> > And I think it's your bug, Dan: it's wrong to have
>> > get_/put_zone_device_page() in get_/put_page(). I must be handled by
>> > page_ref_* machinery to catch all cases where we manipulate with page
>> > refcount.
>>
>> The page_ref conversion landed in 4.6 *after* the ZONE_DEVICE
>> implementation that landed in 4.5, so there was a missed conversion of
>> the zone-device reference counting to page_ref.
>
> Fair enough.
>
> But get_page_unless_zero() definitely predates ZONE_DEVICE. :)
>

It does, but that's deliberate. A ZONE_DEVICE page never has a zero
reference count, it's always owned by the device, never by the page
allocator. ZONE_DEVICE overrides the ->lru list_head to store private
device information and we rely on the behavior that a non-zero
reference means the page is not added to any lru or page cache list.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: get_zone_device_page() in get_page() and page_cache_get_speculative()
  2017-04-24 17:47       ` Dan Williams
@ 2017-04-24 18:01         ` Kirill A. Shutemov
  2017-04-24 18:25           ` Kirill A. Shutemov
  0 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-04-24 18:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	Ingo Molnar, Andrew Morton, Kirill A. Shutemov, H. Peter Anvin,
	Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier,
	Linus Torvalds, Michal Hocko, linux-tip-commits

On Mon, Apr 24, 2017 at 10:47:43AM -0700, Dan Williams wrote:
> On Mon, Apr 24, 2017 at 10:30 AM, Kirill A. Shutemov
> >> >> [   35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155
> >> >> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
> >> >
> >> > Okay, I've tracked it down. The issue is triggered by replacment
> >> > get_page() with page_cache_get_speculative().
> >> >
> >> > page_cache_get_speculative() doesn't have get_zone_device_page(). :-|
> >> >
> >> > And I think it's your bug, Dan: it's wrong to have
> >> > get_/put_zone_device_page() in get_/put_page(). I must be handled by
> >> > page_ref_* machinery to catch all cases where we manipulate with page
> >> > refcount.
> >>
> >> The page_ref conversion landed in 4.6 *after* the ZONE_DEVICE
> >> implementation that landed in 4.5, so there was a missed conversion of
> >> the zone-device reference counting to page_ref.
> >
> > Fair enough.
> >
> > But get_page_unless_zero() definitely predates ZONE_DEVICE. :)
> >
> 
> It does, but that's deliberate. A ZONE_DEVICE page never has a zero
> reference count, it's always owned by the device, never by the page
> allocator. ZONE_DEVICE overrides the ->lru list_head to store private
> device information and we rely on the behavior that a non-zero
> reference means the page is not added to any lru or page cache list.

So, what do you propose? Use get_page() instead of
page_cache_get_speculative() in GUP_fast() if the page belong to zone
device?

I don't like it. This situation, when we only can use subset of
helpers to manipulate page refcount creates situation waiting to explode.

I think it's still better to do it on page_ref_* level.

BTW, why do we need to pin pgmap from get_page() in first place?
I don't have enough background in ZONE_DEVICE.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: get_zone_device_page() in get_page() and page_cache_get_speculative()
  2017-04-24 18:01         ` Kirill A. Shutemov
@ 2017-04-24 18:25           ` Kirill A. Shutemov
  2017-04-24 18:41             ` Dan Williams
  0 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-04-24 18:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Linux MM, Catalin Marinas, Aneesh Kumar K.V,
	Steve Capper, Thomas Gleixner, Peter Zijlstra,
	Linux Kernel Mailing List, Ingo Molnar, Andrew Morton,
	H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel,
	Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits

On Mon, Apr 24, 2017 at 09:01:58PM +0300, Kirill A. Shutemov wrote:
> On Mon, Apr 24, 2017 at 10:47:43AM -0700, Dan Williams wrote:
> I think it's still better to do it on page_ref_* level.

Something like patch below? What do you think?

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 93416196ba64..bd1b13af4567 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,20 +35,6 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
-/**
- * struct dev_pagemap - metadata for ZONE_DEVICE mappings
- * @altmap: pre-allocated/reserved memory for vmemmap allocations
- * @res: physical address range covered by @ref
- * @ref: reference count that pins the devm_memremap_pages() mapping
- * @dev: host device of the mapping for debug
- */
-struct dev_pagemap {
-	struct vmem_altmap *altmap;
-	const struct resource *res;
-	struct percpu_ref *ref;
-	struct device *dev;
-};
-
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e197d3ca3e8a..c2749b878199 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -760,19 +760,11 @@ static inline enum zone_type page_zonenum(const struct page *page)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void get_zone_device_page(struct page *page);
-void put_zone_device_page(struct page *page);
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
 #else
-static inline void get_zone_device_page(struct page *page)
-{
-}
-static inline void put_zone_device_page(struct page *page)
-{
-}
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return false;
@@ -788,9 +780,6 @@ static inline void get_page(struct page *page)
 	 */
 	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
 	page_ref_inc(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		get_zone_device_page(page);
 }
 
 static inline void put_page(struct page *page)
@@ -799,9 +788,6 @@ static inline void put_page(struct page *page)
 
 	if (put_page_testzero(page))
 		__put_page(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		put_zone_device_page(page);
 }
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..fb7bb60d446b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -601,4 +601,18 @@ typedef struct {
 	unsigned long val;
 } swp_entry_t;
 
+/**
+ * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @altmap: pre-allocated/reserved memory for vmemmap allocations
+ * @res: physical address range covered by @ref
+ * @ref: reference count that pins the devm_memremap_pages() mapping
+ * @dev: host device of the mapping for debug
+ */
+struct dev_pagemap {
+	struct vmem_altmap *altmap;
+	const struct resource *res;
+	struct percpu_ref *ref;
+	struct device *dev;
+};
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 610e13271918..d834c68e21fd 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -61,6 +61,8 @@ static inline void __page_ref_unfreeze(struct page *page, int v)
 
 #endif
 
+static inline bool is_zone_device_page(const struct page *page);
+
 static inline int page_ref_count(struct page *page)
 {
 	return atomic_read(&page->_refcount);
@@ -92,6 +94,9 @@ static inline void page_ref_add(struct page *page, int nr)
 	atomic_add(nr, &page->_refcount);
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
 		__page_ref_mod(page, nr);
+
+	if (unlikely(is_zone_device_page(page)))
+		percpu_ref_get_many(page->pgmap->ref, nr);
 }
 
 static inline void page_ref_sub(struct page *page, int nr)
@@ -99,6 +104,9 @@ static inline void page_ref_sub(struct page *page, int nr)
 	atomic_sub(nr, &page->_refcount);
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
 		__page_ref_mod(page, -nr);
+
+	if (unlikely(is_zone_device_page(page)))
+		percpu_ref_put_many(page->pgmap->ref, nr);
 }
 
 static inline void page_ref_inc(struct page *page)
@@ -106,6 +114,9 @@ static inline void page_ref_inc(struct page *page)
 	atomic_inc(&page->_refcount);
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
 		__page_ref_mod(page, 1);
+
+	if (unlikely(is_zone_device_page(page)))
+		percpu_ref_get(page->pgmap->ref);
 }
 
 static inline void page_ref_dec(struct page *page)
@@ -113,6 +124,9 @@ static inline void page_ref_dec(struct page *page)
 	atomic_dec(&page->_refcount);
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
 		__page_ref_mod(page, -1);
+
+	if (unlikely(is_zone_device_page(page)))
+		percpu_ref_put(page->pgmap->ref);
 }
 
 static inline int page_ref_sub_and_test(struct page *page, int nr)
@@ -121,6 +135,9 @@ static inline int page_ref_sub_and_test(struct page *page, int nr)
 
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_test))
 		__page_ref_mod_and_test(page, -nr, ret);
+
+	if (unlikely(is_zone_device_page(page)))
+		percpu_ref_put_many(page->pgmap->ref, nr);
 	return ret;
 }
 
@@ -130,6 +147,9 @@ static inline int page_ref_inc_return(struct page *page)
 
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_return))
 		__page_ref_mod_and_return(page, 1, ret);
+
+	if (unlikely(is_zone_device_page(page)))
+		percpu_ref_get(page->pgmap->ref);
 	return ret;
 }
 
@@ -139,6 +159,9 @@ static inline int page_ref_dec_and_test(struct page *page)
 
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_test))
 		__page_ref_mod_and_test(page, -1, ret);
+
+	if (unlikely(is_zone_device_page(page)))
+		percpu_ref_put(page->pgmap->ref);
 	return ret;
 }
 
@@ -148,6 +171,9 @@ static inline int page_ref_dec_return(struct page *page)
 
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_return))
 		__page_ref_mod_and_return(page, -1, ret);
+
+	if (unlikely(is_zone_device_page(page)))
+		percpu_ref_put(page->pgmap->ref);
 	return ret;
 }
 
@@ -157,6 +183,9 @@ static inline int page_ref_add_unless(struct page *page, int nr, int u)
 
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_unless))
 		__page_ref_mod_unless(page, nr, ret);
+
+	if (unlikely(is_zone_device_page(page)) && ret)
+		percpu_ref_get_many(page->pgmap->ref, nr);
 	return ret;
 }
 
@@ -166,6 +195,9 @@ static inline int page_ref_freeze(struct page *page, int count)
 
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_freeze))
 		__page_ref_freeze(page, count, ret);
+
+	if (unlikely(is_zone_device_page(page)) && ret)
+		percpu_ref_put_many(page->pgmap->ref, count);
 	return ret;
 }
 
@@ -177,6 +209,9 @@ static inline void page_ref_unfreeze(struct page *page, int count)
 	atomic_set(&page->_refcount, count);
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze))
 		__page_ref_unfreeze(page, count);
+
+	if (unlikely(is_zone_device_page(page)))
+		percpu_ref_get_many(page->pgmap->ref, count);
 }
 
 #endif
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 06123234f118..936cef79d811 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -182,18 +182,6 @@ struct page_map {
 	struct vmem_altmap altmap;
 };
 
-void get_zone_device_page(struct page *page)
-{
-	percpu_ref_get(page->pgmap->ref);
-}
-EXPORT_SYMBOL(get_zone_device_page);
-
-void put_zone_device_page(struct page *page)
-{
-	put_dev_pagemap(page->pgmap);
-}
-EXPORT_SYMBOL(put_zone_device_page);
-
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: get_zone_device_page() in get_page() and page_cache_get_speculative()
  2017-04-24 18:25           ` Kirill A. Shutemov
@ 2017-04-24 18:41             ` Dan Williams
  2017-04-25 13:19               ` Kirill A. Shutemov
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-04-24 18:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Linux MM, Catalin Marinas, Aneesh Kumar K.V,
	Steve Capper, Thomas Gleixner, Peter Zijlstra,
	Linux Kernel Mailing List, Ingo Molnar, Andrew Morton,
	H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel,
	Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits

On Mon, Apr 24, 2017 at 11:25 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> On Mon, Apr 24, 2017 at 09:01:58PM +0300, Kirill A. Shutemov wrote:
>> On Mon, Apr 24, 2017 at 10:47:43AM -0700, Dan Williams wrote:
>> I think it's still better to do it on page_ref_* level.
>
> Something like patch below? What do you think?

>From a quick glance, I think this looks like the right way to go.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: get_zone_device_page() in get_page() and page_cache_get_speculative()
  2017-04-24 18:41             ` Dan Williams
@ 2017-04-25 13:19               ` Kirill A. Shutemov
  2017-04-25 16:44                 ` Dan Williams
  0 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-04-25 13:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kirill A. Shutemov, Linux MM, Catalin Marinas, Aneesh Kumar K.V,
	Steve Capper, Thomas Gleixner, Peter Zijlstra,
	Linux Kernel Mailing List, Ingo Molnar, Andrew Morton,
	H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel,
	Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits

On Mon, Apr 24, 2017 at 11:41:51AM -0700, Dan Williams wrote:
> On Mon, Apr 24, 2017 at 11:25 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > On Mon, Apr 24, 2017 at 09:01:58PM +0300, Kirill A. Shutemov wrote:
> >> On Mon, Apr 24, 2017 at 10:47:43AM -0700, Dan Williams wrote:
> >> I think it's still better to do it on page_ref_* level.
> >
> > Something like patch below? What do you think?
> 
> From a quick glance, I think this looks like the right way to go.

Okay, but I still would like to remove manipulation with pgmap->ref from
hot path.

Can we just check that page_count() match our expectation on
devm_memremap_pages_release() instead of this?

I probably miss something in bigger picture, but would something like
patch work too? It seems work for the test case.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a835edd2db34..695da2a19b4c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct page *page)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void get_zone_device_page(struct page *page);
-void put_zone_device_page(struct page *page);
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
 #else
-static inline void get_zone_device_page(struct page *page)
-{
-}
-static inline void put_zone_device_page(struct page *page)
-{
-}
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return false;
@@ -790,9 +782,6 @@ static inline void get_page(struct page *page)
 	 */
 	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
 	page_ref_inc(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		get_zone_device_page(page);
 }
 
 static inline void put_page(struct page *page)
@@ -801,9 +790,6 @@ static inline void put_page(struct page *page)
 
 	if (put_page_testzero(page))
 		__put_page(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		put_zone_device_page(page);
 }
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07e85e5229da..e542bb2f7ab0 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -182,18 +182,6 @@ struct page_map {
 	struct vmem_altmap altmap;
 };
 
-void get_zone_device_page(struct page *page)
-{
-	percpu_ref_get(page->pgmap->ref);
-}
-EXPORT_SYMBOL(get_zone_device_page);
-
-void put_zone_device_page(struct page *page)
-{
-	put_dev_pagemap(page->pgmap);
-}
-EXPORT_SYMBOL(put_zone_device_page);
-
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -237,12 +225,21 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	struct resource *res = &page_map->res;
 	resource_size_t align_start, align_size;
 	struct dev_pagemap *pgmap = &page_map->pgmap;
+	unsigned long pfn;
 
 	if (percpu_ref_tryget_live(pgmap->ref)) {
 		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
 		percpu_ref_put(pgmap->ref);
 	}
 
+	for_each_device_pfn(pfn, page_map) {
+		struct page *page = pfn_to_page(pfn);
+
+		dev_WARN_ONCE(dev, page_count(page) != 1,
+				"%s: unexpected page count: %d!\n",
+				__func__, page_count(page));
+	}
+
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: get_zone_device_page() in get_page() and page_cache_get_speculative()
  2017-04-25 13:19               ` Kirill A. Shutemov
@ 2017-04-25 16:44                 ` Dan Williams
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Williams @ 2017-04-25 16:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Linux MM, Catalin Marinas, Aneesh Kumar K.V,
	Steve Capper, Thomas Gleixner, Peter Zijlstra,
	Linux Kernel Mailing List, Ingo Molnar, Andrew Morton,
	H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel,
	Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits

On Tue, Apr 25, 2017 at 6:19 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> On Mon, Apr 24, 2017 at 11:41:51AM -0700, Dan Williams wrote:
>> On Mon, Apr 24, 2017 at 11:25 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > On Mon, Apr 24, 2017 at 09:01:58PM +0300, Kirill A. Shutemov wrote:
>> >> On Mon, Apr 24, 2017 at 10:47:43AM -0700, Dan Williams wrote:
>> >> I think it's still better to do it on page_ref_* level.
>> >
>> > Something like patch below? What do you think?
>>
>> From a quick glance, I think this looks like the right way to go.
>
> Okay, but I still would like to remove manipulation with pgmap->ref from
> hot path.
>
> Can we just check that page_count() match our expectation on
> devm_memremap_pages_release() instead of this?
>
> I probably miss something in bigger picture, but would something like
> patch work too? It seems work for the test case.

No, unfortunately this is broken. It should be perfectly legal to
start the driver shutdown process while page references are still
outstanding. We use the percpu-ref infrastructure to wait for those
references to be dropped. With the approach below we'll just race and
crash.

>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a835edd2db34..695da2a19b4c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct page *page)
>  }
>
>  #ifdef CONFIG_ZONE_DEVICE
> -void get_zone_device_page(struct page *page);
> -void put_zone_device_page(struct page *page);
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>         return page_zonenum(page) == ZONE_DEVICE;
>  }
>  #else
> -static inline void get_zone_device_page(struct page *page)
> -{
> -}
> -static inline void put_zone_device_page(struct page *page)
> -{
> -}
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>         return false;
> @@ -790,9 +782,6 @@ static inline void get_page(struct page *page)
>          */
>         VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
>         page_ref_inc(page);
> -
> -       if (unlikely(is_zone_device_page(page)))
> -               get_zone_device_page(page);
>  }
>
>  static inline void put_page(struct page *page)
> @@ -801,9 +790,6 @@ static inline void put_page(struct page *page)
>
>         if (put_page_testzero(page))
>                 __put_page(page);
> -
> -       if (unlikely(is_zone_device_page(page)))
> -               put_zone_device_page(page);
>  }
>
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 07e85e5229da..e542bb2f7ab0 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -182,18 +182,6 @@ struct page_map {
>         struct vmem_altmap altmap;
>  };
>
> -void get_zone_device_page(struct page *page)
> -{
> -       percpu_ref_get(page->pgmap->ref);
> -}
> -EXPORT_SYMBOL(get_zone_device_page);
> -
> -void put_zone_device_page(struct page *page)
> -{
> -       put_dev_pagemap(page->pgmap);
> -}
> -EXPORT_SYMBOL(put_zone_device_page);
> -
>  static void pgmap_radix_release(struct resource *res)
>  {
>         resource_size_t key, align_start, align_size, align_end;
> @@ -237,12 +225,21 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
>         struct resource *res = &page_map->res;
>         resource_size_t align_start, align_size;
>         struct dev_pagemap *pgmap = &page_map->pgmap;
> +       unsigned long pfn;
>
>         if (percpu_ref_tryget_live(pgmap->ref)) {
>                 dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
>                 percpu_ref_put(pgmap->ref);
>         }
>
> +       for_each_device_pfn(pfn, page_map) {
> +               struct page *page = pfn_to_page(pfn);
> +
> +               dev_WARN_ONCE(dev, page_count(page) != 1,
> +                               "%s: unexpected page count: %d!\n",
> +                               __func__, page_count(page));
> +       }
> +
>         /* pages are dead and unused, undo the arch mapping */
>         align_start = res->start & ~(SECTION_SIZE - 1);
>         align_size = ALIGN(resource_size(res), SECTION_SIZE);
> --
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-23 23:31 ` get_zone_device_page() in get_page() and page_cache_get_speculative() Kirill A. Shutemov
  2017-04-24 17:23   ` Dan Williams
@ 2017-04-27  0:55   ` Dan Williams
  2017-04-27  8:33     ` Kirill A. Shutemov
  2017-04-27 16:11     ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Logan Gunthorpe
  1 sibling, 2 replies; 48+ messages in thread
From: Dan Williams @ 2017-04-27  0:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, Jérôme Glisse, Logan Gunthorpe, linux-kernel,
	Kirill Shutemov

Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.

This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.

Cc Ingo Molnar <mingo@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

This patch might fix the regression that we found with the conversion to
generic get_user_pages_fast() in the x86/mm branch pending for 4.12
(commit 2947ba054a4d "x86/mm/gup: Switch GUP to the generic
get_user_page_fast() implementation"). I'll test tomorrow, but in case
someone can give it a try before I wake up, here's an early version.

 drivers/dax/pmem.c    |    2 +-
 drivers/nvdimm/pmem.c |   13 +++++++++++--
 include/linux/mm.h    |   14 --------------
 kernel/memremap.c     |   22 +++++++++-------------
 mm/swap.c             |   10 ++++++++++
 5 files changed, 31 insertions(+), 30 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index d4ca19bd74eb..9f2a0b4fd801 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data)
 	struct dax_pmem *dax_pmem = to_dax_pmem(ref);
 
 	dev_dbg(dax_pmem->dev, "%s\n", __func__);
+	wait_for_completion(&dax_pmem->cmp);
 	percpu_ref_exit(ref);
 }
 
@@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data)
 
 	dev_dbg(dax_pmem->dev, "%s\n", __func__);
 	percpu_ref_kill(ref);
-	wait_for_completion(&dax_pmem->cmp);
 }
 
 static int dax_pmem_probe(struct device *dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b3dab73d741..6be0c1253fcd 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -25,6 +25,7 @@
 #include <linux/badblocks.h>
 #include <linux/memremap.h>
 #include <linux/vmalloc.h>
+#include <linux/blk-mq.h>
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
@@ -243,6 +244,11 @@ static void pmem_release_queue(void *q)
 	blk_cleanup_queue(q);
 }
 
+static void pmem_freeze_queue(void *q)
+{
+	blk_mq_freeze_queue_start(q);
+}
+
 static void pmem_release_disk(void *__pmem)
 {
 	struct pmem_device *pmem = __pmem;
@@ -301,6 +307,9 @@ static int pmem_attach_disk(struct device *dev,
 	if (!q)
 		return -ENOMEM;
 
+	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
+		return -ENOMEM;
+
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
@@ -320,10 +329,10 @@ static int pmem_attach_disk(struct device *dev,
 				pmem->size, ARCH_MEMREMAP_PMEM);
 
 	/*
-	 * At release time the queue must be dead before
+	 * At release time the queue must be frozen before
 	 * devm_memremap_pages is unwound
 	 */
-	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
+	if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
 		return -ENOMEM;
 
 	if (IS_ERR(addr))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00a8fa7e366a..ce17b35257ac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -758,19 +758,11 @@ static inline enum zone_type page_zonenum(const struct page *page)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void get_zone_device_page(struct page *page);
-void put_zone_device_page(struct page *page);
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
 #else
-static inline void get_zone_device_page(struct page *page)
-{
-}
-static inline void put_zone_device_page(struct page *page)
-{
-}
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return false;
@@ -786,9 +778,6 @@ static inline void get_page(struct page *page)
 	 */
 	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
 	page_ref_inc(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		get_zone_device_page(page);
 }
 
 static inline void put_page(struct page *page)
@@ -797,9 +786,6 @@ static inline void put_page(struct page *page)
 
 	if (put_page_testzero(page))
 		__put_page(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		put_zone_device_page(page);
 }
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07e85e5229da..5316efdde083 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -182,18 +182,6 @@ struct page_map {
 	struct vmem_altmap altmap;
 };
 
-void get_zone_device_page(struct page *page)
-{
-	percpu_ref_get(page->pgmap->ref);
-}
-EXPORT_SYMBOL(get_zone_device_page);
-
-void put_zone_device_page(struct page *page)
-{
-	put_dev_pagemap(page->pgmap);
-}
-EXPORT_SYMBOL(put_zone_device_page);
-
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -237,6 +225,10 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	struct resource *res = &page_map->res;
 	resource_size_t align_start, align_size;
 	struct dev_pagemap *pgmap = &page_map->pgmap;
+	unsigned long pfn;
+
+	for_each_device_pfn(pfn, page_map)
+		put_page(pfn_to_page(pfn));
 
 	if (percpu_ref_tryget_live(pgmap->ref)) {
 		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
@@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  *
  * Notes:
  * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
- *    (or devm release event).
+ *    (or devm release event). The expected order of events is that @ref has
+ *    been through percpu_ref_kill() before devm_memremap_pages_release(). The
+ *    wait for the completion of kill and percpu_ref_exit() must occur after
+ *    devm_memremap_pages_release().
  *
  * 2/ @res is expected to be a host memory range that could feasibly be
  *    treated as a "System RAM" range, i.e. not a device mmio range, but
@@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		 */
 		list_del(&page->lru);
 		page->pgmap = pgmap;
+		percpu_ref_get(ref);
 	}
 	devres_add(dev, page_map);
 	return __va(res->start);
diff --git a/mm/swap.c b/mm/swap.c
index 5dabf444d724..01267dda6668 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page)
 
 void __put_page(struct page *page)
 {
+	if (is_zone_device_page(page)) {
+		put_dev_pagemap(page->pgmap);
+
+		/*
+		 * The page belong to device, do not return it to
+		 * page allocator.
+		 */
+		return;
+	}
+
 	if (unlikely(PageCompound(page)))
 		__put_compound_page(page);
 	else

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-27  0:55   ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Dan Williams
@ 2017-04-27  8:33     ` Kirill A. Shutemov
  2017-04-28  6:39       ` Ingo Molnar
  2017-04-27 16:11     ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Logan Gunthorpe
  1 sibling, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-04-27  8:33 UTC (permalink / raw)
  To: Dan Williams, Ingo Molnar
  Cc: akpm, linux-mm, Jérôme Glisse, Logan Gunthorpe,
	linux-kernel, Kirill Shutemov

On Wed, Apr 26, 2017 at 05:55:31PM -0700, Dan Williams wrote:
> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> removed from the mm fast path if we take a single get_dev_pagemap()
> reference to signify that the page is alive and use the final put of the
> page to drop that reference.
> 
> This does require some care to make sure that any waits for the
> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> since it now maintains its own elevated reference.
> 
> Cc Ingo Molnar <mingo@redhat.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> 
> This patch might fix the regression that we found with the conversion to
> generic get_user_pages_fast() in the x86/mm branch pending for 4.12
> (commit 2947ba054a4d "x86/mm/gup: Switch GUP to the generic
> get_user_page_fast() implementation"). I'll test tomorrow, but in case
> someone can give it a try before I wake up, here's an early version.

+ Ingo.

This works for me with and without GUP revert.

Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

>  drivers/dax/pmem.c    |    2 +-
>  drivers/nvdimm/pmem.c |   13 +++++++++++--

There's a trivial conflict in drivers/nvdimm/pmem.c when applied to
tip/master.

>  include/linux/mm.h    |   14 --------------
>  kernel/memremap.c     |   22 +++++++++-------------
>  mm/swap.c             |   10 ++++++++++
>  5 files changed, 31 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
> index d4ca19bd74eb..9f2a0b4fd801 100644
> --- a/drivers/dax/pmem.c
> +++ b/drivers/dax/pmem.c
> @@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data)
>  	struct dax_pmem *dax_pmem = to_dax_pmem(ref);
>  
>  	dev_dbg(dax_pmem->dev, "%s\n", __func__);
> +	wait_for_completion(&dax_pmem->cmp);
>  	percpu_ref_exit(ref);
>  }
>  
> @@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data)
>  
>  	dev_dbg(dax_pmem->dev, "%s\n", __func__);
>  	percpu_ref_kill(ref);
> -	wait_for_completion(&dax_pmem->cmp);
>  }
>  
>  static int dax_pmem_probe(struct device *dev)
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 3b3dab73d741..6be0c1253fcd 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -25,6 +25,7 @@
>  #include <linux/badblocks.h>
>  #include <linux/memremap.h>
>  #include <linux/vmalloc.h>
> +#include <linux/blk-mq.h>
>  #include <linux/pfn_t.h>
>  #include <linux/slab.h>
>  #include <linux/pmem.h>
> @@ -243,6 +244,11 @@ static void pmem_release_queue(void *q)
>  	blk_cleanup_queue(q);
>  }
>  
> +static void pmem_freeze_queue(void *q)
> +{
> +	blk_mq_freeze_queue_start(q);
> +}
> +
>  static void pmem_release_disk(void *__pmem)
>  {
>  	struct pmem_device *pmem = __pmem;
> @@ -301,6 +307,9 @@ static int pmem_attach_disk(struct device *dev,
>  	if (!q)
>  		return -ENOMEM;
>  
> +	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
> +		return -ENOMEM;
> +
>  	pmem->pfn_flags = PFN_DEV;
>  	if (is_nd_pfn(dev)) {
>  		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
> @@ -320,10 +329,10 @@ static int pmem_attach_disk(struct device *dev,
>  				pmem->size, ARCH_MEMREMAP_PMEM);
>  
>  	/*
> -	 * At release time the queue must be dead before
> +	 * At release time the queue must be frozen before
>  	 * devm_memremap_pages is unwound
>  	 */
> -	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
> +	if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
>  		return -ENOMEM;
>  
>  	if (IS_ERR(addr))
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 00a8fa7e366a..ce17b35257ac 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -758,19 +758,11 @@ static inline enum zone_type page_zonenum(const struct page *page)
>  }
>  
>  #ifdef CONFIG_ZONE_DEVICE
> -void get_zone_device_page(struct page *page);
> -void put_zone_device_page(struct page *page);
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>  	return page_zonenum(page) == ZONE_DEVICE;
>  }
>  #else
> -static inline void get_zone_device_page(struct page *page)
> -{
> -}
> -static inline void put_zone_device_page(struct page *page)
> -{
> -}
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>  	return false;
> @@ -786,9 +778,6 @@ static inline void get_page(struct page *page)
>  	 */
>  	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
>  	page_ref_inc(page);
> -
> -	if (unlikely(is_zone_device_page(page)))
> -		get_zone_device_page(page);
>  }
>  
>  static inline void put_page(struct page *page)
> @@ -797,9 +786,6 @@ static inline void put_page(struct page *page)
>  
>  	if (put_page_testzero(page))
>  		__put_page(page);
> -
> -	if (unlikely(is_zone_device_page(page)))
> -		put_zone_device_page(page);
>  }
>  
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 07e85e5229da..5316efdde083 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -182,18 +182,6 @@ struct page_map {
>  	struct vmem_altmap altmap;
>  };
>  
> -void get_zone_device_page(struct page *page)
> -{
> -	percpu_ref_get(page->pgmap->ref);
> -}
> -EXPORT_SYMBOL(get_zone_device_page);
> -
> -void put_zone_device_page(struct page *page)
> -{
> -	put_dev_pagemap(page->pgmap);
> -}
> -EXPORT_SYMBOL(put_zone_device_page);
> -
>  static void pgmap_radix_release(struct resource *res)
>  {
>  	resource_size_t key, align_start, align_size, align_end;
> @@ -237,6 +225,10 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
>  	struct resource *res = &page_map->res;
>  	resource_size_t align_start, align_size;
>  	struct dev_pagemap *pgmap = &page_map->pgmap;
> +	unsigned long pfn;
> +
> +	for_each_device_pfn(pfn, page_map)
> +		put_page(pfn_to_page(pfn));
>  
>  	if (percpu_ref_tryget_live(pgmap->ref)) {
>  		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
> @@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
>   *
>   * Notes:
>   * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
> - *    (or devm release event).
> + *    (or devm release event). The expected order of events is that @ref has
> + *    been through percpu_ref_kill() before devm_memremap_pages_release(). The
> + *    wait for the completion of kill and percpu_ref_exit() must occur after
> + *    devm_memremap_pages_release().
>   *
>   * 2/ @res is expected to be a host memory range that could feasibly be
>   *    treated as a "System RAM" range, i.e. not a device mmio range, but
> @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  		 */
>  		list_del(&page->lru);
>  		page->pgmap = pgmap;
> +		percpu_ref_get(ref);
>  	}
>  	devres_add(dev, page_map);
>  	return __va(res->start);
> diff --git a/mm/swap.c b/mm/swap.c
> index 5dabf444d724..01267dda6668 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page)
>  
>  void __put_page(struct page *page)
>  {
> +	if (is_zone_device_page(page)) {
> +		put_dev_pagemap(page->pgmap);
> +
> +		/*
> +		 * The page belong to device, do not return it to
> +		 * page allocator.
> +		 */
> +		return;
> +	}
> +
>  	if (unlikely(PageCompound(page)))
>  		__put_compound_page(page);
>  	else
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-27  0:55   ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Dan Williams
  2017-04-27  8:33     ` Kirill A. Shutemov
@ 2017-04-27 16:11     ` Logan Gunthorpe
  2017-04-27 16:14       ` Dan Williams
  1 sibling, 1 reply; 48+ messages in thread
From: Logan Gunthorpe @ 2017-04-27 16:11 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: linux-mm, Jérôme Glisse, linux-kernel, Kirill Shutemov



On 26/04/17 06:55 PM, Dan Williams wrote:
> @@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
>   *
>   * Notes:
>   * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
> - *    (or devm release event).
> + *    (or devm release event). The expected order of events is that @ref has
> + *    been through percpu_ref_kill() before devm_memremap_pages_release(). The
> + *    wait for the completion of kill and percpu_ref_exit() must occur after
> + *    devm_memremap_pages_release().
>   *
>   * 2/ @res is expected to be a host memory range that could feasibly be
>   *    treated as a "System RAM" range, i.e. not a device mmio range, but
> @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  		 */
>  		list_del(&page->lru);
>  		page->pgmap = pgmap;
> +		percpu_ref_get(ref);
>  	}
>  	devres_add(dev, page_map);
>  	return __va(res->start);
> diff --git a/mm/swap.c b/mm/swap.c
> index 5dabf444d724..01267dda6668 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page)
>  
>  void __put_page(struct page *page)
>  {
> +	if (is_zone_device_page(page)) {
> +		put_dev_pagemap(page->pgmap);
> +
> +		/*
> +		 * The page belong to device, do not return it to
> +		 * page allocator.
> +		 */
> +		return;
> +	}
> +
>  	if (unlikely(PageCompound(page)))
>  		__put_compound_page(page);
>  	else
> 

Forgive me if I'm missing something but this doesn't make sense to me.
We are taking a reference once when the region is initialized and
releasing it every time a page within the region's reference count drops
to zero. That does not seem to be symmetric and I don't see how it
tracks that pages are in use. Shouldn't get_dev_pagemap be called when
any page is allocated or something like that (ie. the inverse of
__put_page)?

Thanks,

Logan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-27 16:11     ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Logan Gunthorpe
@ 2017-04-27 16:14       ` Dan Williams
  2017-04-27 16:33         ` Logan Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-04-27 16:14 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, Linux MM, Jérôme Glisse, linux-kernel,
	Kirill Shutemov

On Thu, Apr 27, 2017 at 9:11 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 26/04/17 06:55 PM, Dan Williams wrote:
>> @@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
>>   *
>>   * Notes:
>>   * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
>> - *    (or devm release event).
>> + *    (or devm release event). The expected order of events is that @ref has
>> + *    been through percpu_ref_kill() before devm_memremap_pages_release(). The
>> + *    wait for the completion of kill and percpu_ref_exit() must occur after
>> + *    devm_memremap_pages_release().
>>   *
>>   * 2/ @res is expected to be a host memory range that could feasibly be
>>   *    treated as a "System RAM" range, i.e. not a device mmio range, but
>> @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>>                */
>>               list_del(&page->lru);
>>               page->pgmap = pgmap;
>> +             percpu_ref_get(ref);
>>       }
>>       devres_add(dev, page_map);
>>       return __va(res->start);
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 5dabf444d724..01267dda6668 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page)
>>
>>  void __put_page(struct page *page)
>>  {
>> +     if (is_zone_device_page(page)) {
>> +             put_dev_pagemap(page->pgmap);
>> +
>> +             /*
>> +              * The page belong to device, do not return it to
>> +              * page allocator.
>> +              */
>> +             return;
>> +     }
>> +
>>       if (unlikely(PageCompound(page)))
>>               __put_compound_page(page);
>>       else
>>
>
> Forgive me if I'm missing something but this doesn't make sense to me.
> We are taking a reference once when the region is initialized and
> releasing it every time a page within the region's reference count drops
> to zero. That does not seem to be symmetric and I don't see how it
> tracks that pages are in use. Shouldn't get_dev_pagemap be called when
> any page is allocated or something like that (ie. the inverse of
> __put_page)?

You're overlooking that the page reference count 1 after
arch_add_memory(). So at the end of time we're just dropping the
arch_add_memory() reference to release the page and related
dev_pagemap.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-27 16:14       ` Dan Williams
@ 2017-04-27 16:33         ` Logan Gunthorpe
  2017-04-27 16:38           ` Dan Williams
  0 siblings, 1 reply; 48+ messages in thread
From: Logan Gunthorpe @ 2017-04-27 16:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Jérôme Glisse, linux-kernel,
	Kirill Shutemov



On 27/04/17 10:14 AM, Dan Williams wrote:
> You're overlooking that the page reference count 1 after
> arch_add_memory(). So at the end of time we're just dropping the
> arch_add_memory() reference to release the page and related
> dev_pagemap.

Thanks, that does actually make a lot more sense to me now. However,
there still appears to be an asymmetry in that the pgmap->ref is
incremented once and decremented once per page...

Logan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-27 16:33         ` Logan Gunthorpe
@ 2017-04-27 16:38           ` Dan Williams
  2017-04-27 16:45             ` Logan Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-04-27 16:38 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, Linux MM, Jérôme Glisse, linux-kernel,
	Kirill Shutemov

On Thu, Apr 27, 2017 at 9:33 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 27/04/17 10:14 AM, Dan Williams wrote:
>> You're overlooking that the page reference count 1 after
>> arch_add_memory(). So at the end of time we're just dropping the
>> arch_add_memory() reference to release the page and related
>> dev_pagemap.
>
> Thanks, that does actually make a lot more sense to me now. However,
> there still appears to be an asymmetry in that the pgmap->ref is
> incremented once and decremented once per page...
>

No, this hunk...

@@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev,
struct resource *res,
                 */
                list_del(&page->lru);
                page->pgmap = pgmap;
+               percpu_ref_get(ref);
        }


...is inside a for_each_device_pfn() loop.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-27 16:38           ` Dan Williams
@ 2017-04-27 16:45             ` Logan Gunthorpe
  2017-04-27 16:46               ` Dan Williams
  0 siblings, 1 reply; 48+ messages in thread
From: Logan Gunthorpe @ 2017-04-27 16:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Jérôme Glisse, linux-kernel,
	Kirill Shutemov



On 27/04/17 10:38 AM, Dan Williams wrote:
> ...is inside a for_each_device_pfn() loop.
> 

Ah, oops. Then that makes perfect sense. Thanks.

You may have my review tag if you'd like:

Reviewed-by: Logan Gunthorpe <logang@deltatee.com>

Logan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-27 16:45             ` Logan Gunthorpe
@ 2017-04-27 16:46               ` Dan Williams
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Williams @ 2017-04-27 16:46 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, Linux MM, Jérôme Glisse, linux-kernel,
	Kirill Shutemov

On Thu, Apr 27, 2017 at 9:45 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 27/04/17 10:38 AM, Dan Williams wrote:
>> ...is inside a for_each_device_pfn() loop.
>>
>
> Ah, oops. Then that makes perfect sense. Thanks.
>
> You may have my review tag if you'd like:
>
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>

Thanks!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-27  8:33     ` Kirill A. Shutemov
@ 2017-04-28  6:39       ` Ingo Molnar
  2017-04-28  8:14         ` [PATCH] mm, zone_device: Replace " Kirill A. Shutemov
  2017-04-28 17:23         ` [PATCH v2] mm, zone_device: replace " Dan Williams
  0 siblings, 2 replies; 48+ messages in thread
From: Ingo Molnar @ 2017-04-28  6:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Ingo Molnar, akpm, linux-mm,
	Jérôme Glisse, Logan Gunthorpe, linux-kernel,
	Kirill Shutemov


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Wed, Apr 26, 2017 at 05:55:31PM -0700, Dan Williams wrote:
> > Kirill points out that the calls to {get,put}_dev_pagemap() can be
> > removed from the mm fast path if we take a single get_dev_pagemap()
> > reference to signify that the page is alive and use the final put of the
> > page to drop that reference.
> > 
> > This does require some care to make sure that any waits for the
> > percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> > since it now maintains its own elevated reference.
> > 
> > Cc Ingo Molnar <mingo@redhat.com>
> > Cc: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Logan Gunthorpe <logang@deltatee.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> > 
> > This patch might fix the regression that we found with the conversion to
> > generic get_user_pages_fast() in the x86/mm branch pending for 4.12
> > (commit 2947ba054a4d "x86/mm/gup: Switch GUP to the generic
> > get_user_page_fast() implementation"). I'll test tomorrow, but in case
> > someone can give it a try before I wake up, here's an early version.
> 
> + Ingo.
> 
> This works for me with and without GUP revert.
> 
> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> >  drivers/dax/pmem.c    |    2 +-
> >  drivers/nvdimm/pmem.c |   13 +++++++++++--
> 
> There's a trivial conflict in drivers/nvdimm/pmem.c when applied to
> tip/master.

Ok, could someone please send a version either to Linus for v4.11, or a version 
against latest -tip so I can included it in x86/mm, so that x86/mm gets unbroken.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH] mm, zone_device: Replace {get, put}_zone_device_page() with a single reference
  2017-04-28  6:39       ` Ingo Molnar
@ 2017-04-28  8:14         ` Kirill A. Shutemov
  2017-04-28 17:23         ` [PATCH v2] mm, zone_device: replace " Dan Williams
  1 sibling, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-04-28  8:14 UTC (permalink / raw)
  To: Ingo Molnar, akpm
  Cc: Dan Williams, Jérôme Glisse, Logan Gunthorpe, linux-mm,
	linux-kernel, Kirill A . Shutemov

From: Dan Williams <dan.j.williams@intel.com>

Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.

This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.

Cc Ingo Molnar <mingo@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
[kirill.shutemov@linux.intel.com: rebased to -tip]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/dax/pmem.c    |  2 +-
 drivers/nvdimm/pmem.c | 13 +++++++++++--
 include/linux/mm.h    | 14 --------------
 kernel/memremap.c     | 22 +++++++++-------------
 mm/swap.c             | 10 ++++++++++
 5 files changed, 31 insertions(+), 30 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 033f49b31fdc..cb0d742fa23f 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data)
 	struct dax_pmem *dax_pmem = to_dax_pmem(ref);
 
 	dev_dbg(dax_pmem->dev, "%s\n", __func__);
+	wait_for_completion(&dax_pmem->cmp);
 	percpu_ref_exit(ref);
 }
 
@@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data)
 
 	dev_dbg(dax_pmem->dev, "%s\n", __func__);
 	percpu_ref_kill(ref);
-	wait_for_completion(&dax_pmem->cmp);
 }
 
 static int dax_pmem_probe(struct device *dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5b536be5a12e..fb7bbc79ac26 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -25,6 +25,7 @@
 #include <linux/badblocks.h>
 #include <linux/memremap.h>
 #include <linux/vmalloc.h>
+#include <linux/blk-mq.h>
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
@@ -231,6 +232,11 @@ static void pmem_release_queue(void *q)
 	blk_cleanup_queue(q);
 }
 
+static void pmem_freeze_queue(void *q)
+{
+	blk_mq_freeze_queue_start(q);
+}
+
 static void pmem_release_disk(void *disk)
 {
 	del_gendisk(disk);
@@ -284,6 +290,9 @@ static int pmem_attach_disk(struct device *dev,
 	if (!q)
 		return -ENOMEM;
 
+	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
+		return -ENOMEM;
+
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
@@ -303,10 +312,10 @@ static int pmem_attach_disk(struct device *dev,
 				pmem->size, ARCH_MEMREMAP_PMEM);
 
 	/*
-	 * At release time the queue must be dead before
+	 * At release time the queue must be frozen before
 	 * devm_memremap_pages is unwound
 	 */
-	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
+	if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
 		return -ENOMEM;
 
 	if (IS_ERR(addr))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a835edd2db34..695da2a19b4c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct page *page)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void get_zone_device_page(struct page *page);
-void put_zone_device_page(struct page *page);
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
 #else
-static inline void get_zone_device_page(struct page *page)
-{
-}
-static inline void put_zone_device_page(struct page *page)
-{
-}
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return false;
@@ -790,9 +782,6 @@ static inline void get_page(struct page *page)
 	 */
 	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
 	page_ref_inc(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		get_zone_device_page(page);
 }
 
 static inline void put_page(struct page *page)
@@ -801,9 +790,6 @@ static inline void put_page(struct page *page)
 
 	if (put_page_testzero(page))
 		__put_page(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		put_zone_device_page(page);
 }
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07e85e5229da..5316efdde083 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -182,18 +182,6 @@ struct page_map {
 	struct vmem_altmap altmap;
 };
 
-void get_zone_device_page(struct page *page)
-{
-	percpu_ref_get(page->pgmap->ref);
-}
-EXPORT_SYMBOL(get_zone_device_page);
-
-void put_zone_device_page(struct page *page)
-{
-	put_dev_pagemap(page->pgmap);
-}
-EXPORT_SYMBOL(put_zone_device_page);
-
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -237,6 +225,10 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	struct resource *res = &page_map->res;
 	resource_size_t align_start, align_size;
 	struct dev_pagemap *pgmap = &page_map->pgmap;
+	unsigned long pfn;
+
+	for_each_device_pfn(pfn, page_map)
+		put_page(pfn_to_page(pfn));
 
 	if (percpu_ref_tryget_live(pgmap->ref)) {
 		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
@@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  *
  * Notes:
  * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
- *    (or devm release event).
+ *    (or devm release event). The expected order of events is that @ref has
+ *    been through percpu_ref_kill() before devm_memremap_pages_release(). The
+ *    wait for the completion of kill and percpu_ref_exit() must occur after
+ *    devm_memremap_pages_release().
  *
  * 2/ @res is expected to be a host memory range that could feasibly be
  *    treated as a "System RAM" range, i.e. not a device mmio range, but
@@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		 */
 		list_del(&page->lru);
 		page->pgmap = pgmap;
+		percpu_ref_get(ref);
 	}
 	devres_add(dev, page_map);
 	return __va(res->start);
diff --git a/mm/swap.c b/mm/swap.c
index 5dabf444d724..01267dda6668 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page)
 
 void __put_page(struct page *page)
 {
+	if (is_zone_device_page(page)) {
+		put_dev_pagemap(page->pgmap);
+
+		/*
+		 * The page belong to device, do not return it to
+		 * page allocator.
+		 */
+		return;
+	}
+
 	if (unlikely(PageCompound(page)))
 		__put_compound_page(page);
 	else
-- 
2.11.0

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-28  6:39       ` Ingo Molnar
  2017-04-28  8:14         ` [PATCH] mm, zone_device: Replace " Kirill A. Shutemov
@ 2017-04-28 17:23         ` Dan Williams
  2017-04-28 17:34           ` Jerome Glisse
                             ` (2 more replies)
  1 sibling, 3 replies; 48+ messages in thread
From: Dan Williams @ 2017-04-28 17:23 UTC (permalink / raw)
  To: mingo
  Cc: linux-kernel, linux-mm, Jérôme Glisse, Ingo Molnar,
	Andrew Morton, Logan Gunthorpe, Kirill Shutemov

Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.

This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes in v2:
* Rebased to tip/master
* Clarified comment in __put_page
* Clarified devm_memremap_pages() kernel doc about ordering of
  devm_memremap_pages_release() vs percpu_ref_kill() vs wait for
  percpu_ref to drop to zero.

Ingo, I retested this with a revert of commit 6dd29b3df975 "Revert
'x86/mm/gup: Switch GUP to the generic get_user_page_fast()
implementation'". It should be good to go through x86/mm.

 drivers/dax/pmem.c    |    2 +-
 drivers/nvdimm/pmem.c |   13 +++++++++++--
 include/linux/mm.h    |   14 --------------
 kernel/memremap.c     |   22 +++++++++-------------
 mm/swap.c             |   10 ++++++++++
 5 files changed, 31 insertions(+), 30 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 033f49b31fdc..cb0d742fa23f 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data)
 	struct dax_pmem *dax_pmem = to_dax_pmem(ref);
 
 	dev_dbg(dax_pmem->dev, "%s\n", __func__);
+	wait_for_completion(&dax_pmem->cmp);
 	percpu_ref_exit(ref);
 }
 
@@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data)
 
 	dev_dbg(dax_pmem->dev, "%s\n", __func__);
 	percpu_ref_kill(ref);
-	wait_for_completion(&dax_pmem->cmp);
 }
 
 static int dax_pmem_probe(struct device *dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5b536be5a12e..fb7bbc79ac26 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -25,6 +25,7 @@
 #include <linux/badblocks.h>
 #include <linux/memremap.h>
 #include <linux/vmalloc.h>
+#include <linux/blk-mq.h>
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
@@ -231,6 +232,11 @@ static void pmem_release_queue(void *q)
 	blk_cleanup_queue(q);
 }
 
+static void pmem_freeze_queue(void *q)
+{
+	blk_mq_freeze_queue_start(q);
+}
+
 static void pmem_release_disk(void *disk)
 {
 	del_gendisk(disk);
@@ -284,6 +290,9 @@ static int pmem_attach_disk(struct device *dev,
 	if (!q)
 		return -ENOMEM;
 
+	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
+		return -ENOMEM;
+
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
@@ -303,10 +312,10 @@ static int pmem_attach_disk(struct device *dev,
 				pmem->size, ARCH_MEMREMAP_PMEM);
 
 	/*
-	 * At release time the queue must be dead before
+	 * At release time the queue must be frozen before
 	 * devm_memremap_pages is unwound
 	 */
-	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
+	if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
 		return -ENOMEM;
 
 	if (IS_ERR(addr))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a835edd2db34..695da2a19b4c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct page *page)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void get_zone_device_page(struct page *page);
-void put_zone_device_page(struct page *page);
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
 #else
-static inline void get_zone_device_page(struct page *page)
-{
-}
-static inline void put_zone_device_page(struct page *page)
-{
-}
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return false;
@@ -790,9 +782,6 @@ static inline void get_page(struct page *page)
 	 */
 	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
 	page_ref_inc(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		get_zone_device_page(page);
 }
 
 static inline void put_page(struct page *page)
@@ -801,9 +790,6 @@ static inline void put_page(struct page *page)
 
 	if (put_page_testzero(page))
 		__put_page(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		put_zone_device_page(page);
 }
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07e85e5229da..23a6483c3666 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -182,18 +182,6 @@ struct page_map {
 	struct vmem_altmap altmap;
 };
 
-void get_zone_device_page(struct page *page)
-{
-	percpu_ref_get(page->pgmap->ref);
-}
-EXPORT_SYMBOL(get_zone_device_page);
-
-void put_zone_device_page(struct page *page)
-{
-	put_dev_pagemap(page->pgmap);
-}
-EXPORT_SYMBOL(put_zone_device_page);
-
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -237,6 +225,10 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	struct resource *res = &page_map->res;
 	resource_size_t align_start, align_size;
 	struct dev_pagemap *pgmap = &page_map->pgmap;
+	unsigned long pfn;
+
+	for_each_device_pfn(pfn, page_map)
+		put_page(pfn_to_page(pfn));
 
 	if (percpu_ref_tryget_live(pgmap->ref)) {
 		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
@@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  *
  * Notes:
  * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
- *    (or devm release event).
+ *    (or devm release event). The expected order of events is that @ref has
+ *    been through percpu_ref_kill() before devm_memremap_pages_release(). The
+ *    wait for the completion of all references being dropped and
+ *    percpu_ref_exit() must occur after devm_memremap_pages_release().
  *
  * 2/ @res is expected to be a host memory range that could feasibly be
  *    treated as a "System RAM" range, i.e. not a device mmio range, but
@@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		 */
 		list_del(&page->lru);
 		page->pgmap = pgmap;
+		percpu_ref_get(ref);
 	}
 	devres_add(dev, page_map);
 	return __va(res->start);
diff --git a/mm/swap.c b/mm/swap.c
index 5dabf444d724..d8d9ee9e311a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page)
 
 void __put_page(struct page *page)
 {
+	if (is_zone_device_page(page)) {
+		put_dev_pagemap(page->pgmap);
+
+		/*
+		 * The page belongs to the device that created pgmap. Do
+		 * not return it to page allocator.
+		 */
+		return;
+	}
+
 	if (unlikely(PageCompound(page)))
 		__put_compound_page(page);
 	else

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-28 17:23         ` [PATCH v2] mm, zone_device: replace " Dan Williams
@ 2017-04-28 17:34           ` Jerome Glisse
  2017-04-28 17:41             ` Dan Williams
  2017-04-29 14:18           ` Ingo Molnar
  2017-05-01  8:28           ` [tip:x86/mm] mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash tip-bot for Dan Williams
  2 siblings, 1 reply; 48+ messages in thread
From: Jerome Glisse @ 2017-04-28 17:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: mingo, linux-kernel, linux-mm, Ingo Molnar, Andrew Morton,
	Logan Gunthorpe, Kirill Shutemov

> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> removed from the mm fast path if we take a single get_dev_pagemap()
> reference to signify that the page is alive and use the final put of the
> page to drop that reference.
> 
> This does require some care to make sure that any waits for the
> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> since it now maintains its own elevated reference.

This is NAK from HMM point of view as i need those call. So if you remove
them now i will need to add them back as part of HMM.

Cheers,
Jérôme

> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes in v2:
> * Rebased to tip/master
> * Clarified comment in __put_page
> * Clarified devm_memremap_pages() kernel doc about ordering of
>   devm_memremap_pages_release() vs percpu_ref_kill() vs wait for
>   percpu_ref to drop to zero.
> 
> Ingo, I retested this with a revert of commit 6dd29b3df975 "Revert
> 'x86/mm/gup: Switch GUP to the generic get_user_page_fast()
> implementation'". It should be good to go through x86/mm.
> 
>  drivers/dax/pmem.c    |    2 +-
>  drivers/nvdimm/pmem.c |   13 +++++++++++--
>  include/linux/mm.h    |   14 --------------
>  kernel/memremap.c     |   22 +++++++++-------------
>  mm/swap.c             |   10 ++++++++++
>  5 files changed, 31 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
> index 033f49b31fdc..cb0d742fa23f 100644
> --- a/drivers/dax/pmem.c
> +++ b/drivers/dax/pmem.c
> @@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data)
>  	struct dax_pmem *dax_pmem = to_dax_pmem(ref);
>  
>  	dev_dbg(dax_pmem->dev, "%s\n", __func__);
> +	wait_for_completion(&dax_pmem->cmp);
>  	percpu_ref_exit(ref);
>  }
>  
> @@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data)
>  
>  	dev_dbg(dax_pmem->dev, "%s\n", __func__);
>  	percpu_ref_kill(ref);
> -	wait_for_completion(&dax_pmem->cmp);
>  }
>  
>  static int dax_pmem_probe(struct device *dev)
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 5b536be5a12e..fb7bbc79ac26 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -25,6 +25,7 @@
>  #include <linux/badblocks.h>
>  #include <linux/memremap.h>
>  #include <linux/vmalloc.h>
> +#include <linux/blk-mq.h>
>  #include <linux/pfn_t.h>
>  #include <linux/slab.h>
>  #include <linux/pmem.h>
> @@ -231,6 +232,11 @@ static void pmem_release_queue(void *q)
>  	blk_cleanup_queue(q);
>  }
>  
> +static void pmem_freeze_queue(void *q)
> +{
> +	blk_mq_freeze_queue_start(q);
> +}
> +
>  static void pmem_release_disk(void *disk)
>  {
>  	del_gendisk(disk);
> @@ -284,6 +290,9 @@ static int pmem_attach_disk(struct device *dev,
>  	if (!q)
>  		return -ENOMEM;
>  
> +	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
> +		return -ENOMEM;
> +
>  	pmem->pfn_flags = PFN_DEV;
>  	if (is_nd_pfn(dev)) {
>  		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
> @@ -303,10 +312,10 @@ static int pmem_attach_disk(struct device *dev,
>  				pmem->size, ARCH_MEMREMAP_PMEM);
>  
>  	/*
> -	 * At release time the queue must be dead before
> +	 * At release time the queue must be frozen before
>  	 * devm_memremap_pages is unwound
>  	 */
> -	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
> +	if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
>  		return -ENOMEM;
>  
>  	if (IS_ERR(addr))
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a835edd2db34..695da2a19b4c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct
> page *page)
>  }
>  
>  #ifdef CONFIG_ZONE_DEVICE
> -void get_zone_device_page(struct page *page);
> -void put_zone_device_page(struct page *page);
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>  	return page_zonenum(page) == ZONE_DEVICE;
>  }
>  #else
> -static inline void get_zone_device_page(struct page *page)
> -{
> -}
> -static inline void put_zone_device_page(struct page *page)
> -{
> -}
>  static inline bool is_zone_device_page(const struct page *page)
>  {
>  	return false;
> @@ -790,9 +782,6 @@ static inline void get_page(struct page *page)
>  	 */
>  	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
>  	page_ref_inc(page);
> -
> -	if (unlikely(is_zone_device_page(page)))
> -		get_zone_device_page(page);
>  }
>  
>  static inline void put_page(struct page *page)
> @@ -801,9 +790,6 @@ static inline void put_page(struct page *page)
>  
>  	if (put_page_testzero(page))
>  		__put_page(page);
> -
> -	if (unlikely(is_zone_device_page(page)))
> -		put_zone_device_page(page);
>  }
>  
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 07e85e5229da..23a6483c3666 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -182,18 +182,6 @@ struct page_map {
>  	struct vmem_altmap altmap;
>  };
>  
> -void get_zone_device_page(struct page *page)
> -{
> -	percpu_ref_get(page->pgmap->ref);
> -}
> -EXPORT_SYMBOL(get_zone_device_page);
> -
> -void put_zone_device_page(struct page *page)
> -{
> -	put_dev_pagemap(page->pgmap);
> -}
> -EXPORT_SYMBOL(put_zone_device_page);
> -
>  static void pgmap_radix_release(struct resource *res)
>  {
>  	resource_size_t key, align_start, align_size, align_end;
> @@ -237,6 +225,10 @@ static void devm_memremap_pages_release(struct device
> *dev, void *data)
>  	struct resource *res = &page_map->res;
>  	resource_size_t align_start, align_size;
>  	struct dev_pagemap *pgmap = &page_map->pgmap;
> +	unsigned long pfn;
> +
> +	for_each_device_pfn(pfn, page_map)
> +		put_page(pfn_to_page(pfn));
>  
>  	if (percpu_ref_tryget_live(pgmap->ref)) {
>  		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
> @@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t
> phys)
>   *
>   * Notes:
>   * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages()
>   time
> - *    (or devm release event).
> + *    (or devm release event). The expected order of events is that @ref has
> + *    been through percpu_ref_kill() before devm_memremap_pages_release().
> The
> + *    wait for the completion of all references being dropped and
> + *    percpu_ref_exit() must occur after devm_memremap_pages_release().
>   *
>   * 2/ @res is expected to be a host memory range that could feasibly be
>   *    treated as a "System RAM" range, i.e. not a device mmio range, but
> @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct
> resource *res,
>  		 */
>  		list_del(&page->lru);
>  		page->pgmap = pgmap;
> +		percpu_ref_get(ref);
>  	}
>  	devres_add(dev, page_map);
>  	return __va(res->start);
> diff --git a/mm/swap.c b/mm/swap.c
> index 5dabf444d724..d8d9ee9e311a 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page)
>  
>  void __put_page(struct page *page)
>  {
> +	if (is_zone_device_page(page)) {
> +		put_dev_pagemap(page->pgmap);
> +
> +		/*
> +		 * The page belongs to the device that created pgmap. Do
> +		 * not return it to page allocator.
> +		 */
> +		return;
> +	}
> +
>  	if (unlikely(PageCompound(page)))
>  		__put_compound_page(page);
>  	else
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-28 17:34           ` Jerome Glisse
@ 2017-04-28 17:41             ` Dan Williams
  2017-04-28 18:00               ` Jerome Glisse
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-04-28 17:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar, Andrew Morton,
	Logan Gunthorpe, Kirill Shutemov

On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> removed from the mm fast path if we take a single get_dev_pagemap()
>> reference to signify that the page is alive and use the final put of the
>> page to drop that reference.
>>
>> This does require some care to make sure that any waits for the
>> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> since it now maintains its own elevated reference.
>
> This is NAK from HMM point of view as i need those call. So if you remove
> them now i will need to add them back as part of HMM.

I thought you only need them at page free time? You can still hook __put_page().

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-28 17:41             ` Dan Williams
@ 2017-04-28 18:00               ` Jerome Glisse
  2017-04-28 19:02                 ` Dan Williams
  0 siblings, 1 reply; 48+ messages in thread
From: Jerome Glisse @ 2017-04-28 18:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar, Andrew Morton,
	Logan Gunthorpe, Kirill Shutemov

> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> reference to signify that the page is alive and use the final put of the
> >> page to drop that reference.
> >>
> >> This does require some care to make sure that any waits for the
> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> >> since it now maintains its own elevated reference.
> >
> > This is NAK from HMM point of view as i need those call. So if you remove
> > them now i will need to add them back as part of HMM.
> 
> I thought you only need them at page free time? You can still hook
> __put_page().

No, i need a hook when page refcount reach 1, not 0. That being said
i don't care about put_dev_pagemap(page->pgmap); so that part of the
patch is fine from HMM point of view but i definitly need to hook my-
self in the general put_page() function.

So i will have to undo part of this patch for HMM (put_page() will
need to handle ZONE_DEVICE page differently).

Cheers,
Jérôme 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-28 18:00               ` Jerome Glisse
@ 2017-04-28 19:02                 ` Dan Williams
  2017-04-28 19:16                   ` Jerome Glisse
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-04-28 19:02 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar, Andrew Morton,
	Logan Gunthorpe, Kirill Shutemov

On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> >> removed from the mm fast path if we take a single get_dev_pagemap()
>> >> reference to signify that the page is alive and use the final put of the
>> >> page to drop that reference.
>> >>
>> >> This does require some care to make sure that any waits for the
>> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> >> since it now maintains its own elevated reference.
>> >
>> > This is NAK from HMM point of view as i need those call. So if you remove
>> > them now i will need to add them back as part of HMM.
>>
>> I thought you only need them at page free time? You can still hook
>> __put_page().
>
> No, i need a hook when page refcount reach 1, not 0. That being said
> i don't care about put_dev_pagemap(page->pgmap); so that part of the
> patch is fine from HMM point of view but i definitly need to hook my-
> self in the general put_page() function.
>
> So i will have to undo part of this patch for HMM (put_page() will
> need to handle ZONE_DEVICE page differently).

Ok, I'd rather this go in now since it fixes the existing use case,
and unblocks the get_user_pages_fast() conversion to generic code.
That also gives Kirill and -mm folks a chance to review what HMM wants
to do on top of the page_ref infrastructure.  The
{get,put}_zone_device_page interface went in in 4.5 right before
page_ref went in during 4.6, so it was just an oversight that
{get,put}_zone_device_page were not removed earlier.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-28 19:02                 ` Dan Williams
@ 2017-04-28 19:16                   ` Jerome Glisse
  2017-04-28 19:22                     ` Dan Williams
  0 siblings, 1 reply; 48+ messages in thread
From: Jerome Glisse @ 2017-04-28 19:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar, Andrew Morton,
	Logan Gunthorpe, Kirill Shutemov

> On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com>
> >> wrote:
> >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> >> reference to signify that the page is alive and use the final put of
> >> >> the
> >> >> page to drop that reference.
> >> >>
> >> >> This does require some care to make sure that any waits for the
> >> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> >> >> since it now maintains its own elevated reference.
> >> >
> >> > This is NAK from HMM point of view as i need those call. So if you
> >> > remove
> >> > them now i will need to add them back as part of HMM.
> >>
> >> I thought you only need them at page free time? You can still hook
> >> __put_page().
> >
> > No, i need a hook when page refcount reach 1, not 0. That being said
> > i don't care about put_dev_pagemap(page->pgmap); so that part of the
> > patch is fine from HMM point of view but i definitly need to hook my-
> > self in the general put_page() function.
> >
> > So i will have to undo part of this patch for HMM (put_page() will
> > need to handle ZONE_DEVICE page differently).
> 
> Ok, I'd rather this go in now since it fixes the existing use case,
> and unblocks the get_user_pages_fast() conversion to generic code.
> That also gives Kirill and -mm folks a chance to review what HMM wants
> to do on top of the page_ref infrastructure.  The
> {get,put}_zone_device_page interface went in in 4.5 right before
> page_ref went in during 4.6, so it was just an oversight that
> {get,put}_zone_device_page were not removed earlier.
> 

I don't mind this going in, i am hopping people won't ignore HMM patchset
once i repost after 4.12 merge window. Note that there is absolutely no way
around me hooking up inside put_page(). The only other way to do it would
be to modify virtualy all places that call that function to handle ZONE_DEVICE
case.

Jérôme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-28 19:16                   ` Jerome Glisse
@ 2017-04-28 19:22                     ` Dan Williams
  2017-04-28 19:33                       ` Jerome Glisse
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-04-28 19:22 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar, Andrew Morton,
	Logan Gunthorpe, Kirill Shutemov

On Fri, Apr 28, 2017 at 12:16 PM, Jerome Glisse <jglisse@redhat.com> wrote:
>> On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>> >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com>
>> >> wrote:
>> >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> >> >> removed from the mm fast path if we take a single get_dev_pagemap()
>> >> >> reference to signify that the page is alive and use the final put of
>> >> >> the
>> >> >> page to drop that reference.
>> >> >>
>> >> >> This does require some care to make sure that any waits for the
>> >> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> >> >> since it now maintains its own elevated reference.
>> >> >
>> >> > This is NAK from HMM point of view as i need those call. So if you
>> >> > remove
>> >> > them now i will need to add them back as part of HMM.
>> >>
>> >> I thought you only need them at page free time? You can still hook
>> >> __put_page().
>> >
>> > No, i need a hook when page refcount reach 1, not 0. That being said
>> > i don't care about put_dev_pagemap(page->pgmap); so that part of the
>> > patch is fine from HMM point of view but i definitly need to hook my-
>> > self in the general put_page() function.
>> >
>> > So i will have to undo part of this patch for HMM (put_page() will
>> > need to handle ZONE_DEVICE page differently).
>>
>> Ok, I'd rather this go in now since it fixes the existing use case,
>> and unblocks the get_user_pages_fast() conversion to generic code.
>> That also gives Kirill and -mm folks a chance to review what HMM wants
>> to do on top of the page_ref infrastructure.  The
>> {get,put}_zone_device_page interface went in in 4.5 right before
>> page_ref went in during 4.6, so it was just an oversight that
>> {get,put}_zone_device_page were not removed earlier.
>>
>
> I don't mind this going in, i am hopping people won't ignore HMM patchset
> once i repost after 4.12 merge window. Note that there is absolutely no way
> around me hooking up inside put_page(). The only other way to do it would
> be to modify virtualy all places that call that function to handle ZONE_DEVICE
> case.

Are you sure about needing to hook the 2 -> 1 transition? Could we
change ZONE_DEVICE pages to not have an elevated reference count when
they are created so you can keep the HMM references out of the mm hot
path?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-28 19:22                     ` Dan Williams
@ 2017-04-28 19:33                       ` Jerome Glisse
  2017-04-29 10:17                         ` Kirill A. Shutemov
  0 siblings, 1 reply; 48+ messages in thread
From: Jerome Glisse @ 2017-04-28 19:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar, Andrew Morton,
	Logan Gunthorpe, Kirill Shutemov

On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> On Fri, Apr 28, 2017 at 12:16 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >> >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com>
> >> >> wrote:
> >> >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> >> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> >> >> reference to signify that the page is alive and use the final put of
> >> >> >> the
> >> >> >> page to drop that reference.
> >> >> >>
> >> >> >> This does require some care to make sure that any waits for the
> >> >> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> >> >> >> since it now maintains its own elevated reference.
> >> >> >
> >> >> > This is NAK from HMM point of view as i need those call. So if you
> >> >> > remove
> >> >> > them now i will need to add them back as part of HMM.
> >> >>
> >> >> I thought you only need them at page free time? You can still hook
> >> >> __put_page().
> >> >
> >> > No, i need a hook when page refcount reach 1, not 0. That being said
> >> > i don't care about put_dev_pagemap(page->pgmap); so that part of the
> >> > patch is fine from HMM point of view but i definitly need to hook my-
> >> > self in the general put_page() function.
> >> >
> >> > So i will have to undo part of this patch for HMM (put_page() will
> >> > need to handle ZONE_DEVICE page differently).
> >>
> >> Ok, I'd rather this go in now since it fixes the existing use case,
> >> and unblocks the get_user_pages_fast() conversion to generic code.
> >> That also gives Kirill and -mm folks a chance to review what HMM wants
> >> to do on top of the page_ref infrastructure.  The
> >> {get,put}_zone_device_page interface went in in 4.5 right before
> >> page_ref went in during 4.6, so it was just an oversight that
> >> {get,put}_zone_device_page were not removed earlier.
> >>
> >
> > I don't mind this going in, i am hopping people won't ignore HMM patchset
> > once i repost after 4.12 merge window. Note that there is absolutely no way
> > around me hooking up inside put_page(). The only other way to do it would
> > be to modify virtualy all places that call that function to handle ZONE_DEVICE
> > case.
> 
> Are you sure about needing to hook the 2 -> 1 transition? Could we
> change ZONE_DEVICE pages to not have an elevated reference count when
> they are created so you can keep the HMM references out of the mm hot
> path?

100% sure on that :) I need to callback into driver for 2->1 transition
no way around that. If we change ZONE_DEVICE to not have an elevated
reference count that you need to make a lot more change to mm so that
ZONE_DEVICE is never use as fallback for memory allocation. Also need
to make change to be sure that ZONE_DEVICE page never endup in one of
the path that try to put them back on lru. There is a lot of place that
would need to be updated and it would be highly intrusive and add a
lot of special cases to other hot code path.

Maybe i over estimate the amount of work but from top of my head it is
far from being trivial.

Jérôme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-28 19:33                       ` Jerome Glisse
@ 2017-04-29 10:17                         ` Kirill A. Shutemov
  2017-04-30 23:14                           ` Jerome Glisse
  0 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-04-29 10:17 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar,
	Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > change ZONE_DEVICE pages to not have an elevated reference count when
> > they are created so you can keep the HMM references out of the mm hot
> > path?
> 
> 100% sure on that :) I need to callback into driver for 2->1 transition
> no way around that. If we change ZONE_DEVICE to not have an elevated
> reference count that you need to make a lot more change to mm so that
> ZONE_DEVICE is never use as fallback for memory allocation. Also need
> to make change to be sure that ZONE_DEVICE page never endup in one of
> the path that try to put them back on lru. There is a lot of place that
> would need to be updated and it would be highly intrusive and add a
> lot of special cases to other hot code path.

Could you explain more on where the requirement comes from or point me to
where I can read about this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-28 17:23         ` [PATCH v2] mm, zone_device: replace " Dan Williams
  2017-04-28 17:34           ` Jerome Glisse
@ 2017-04-29 14:18           ` Ingo Molnar
  2017-05-01  2:45             ` Dan Williams
  2017-05-01  8:28           ` [tip:x86/mm] mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash tip-bot for Dan Williams
  2 siblings, 1 reply; 48+ messages in thread
From: Ingo Molnar @ 2017-04-29 14:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, linux-mm, Jérôme Glisse, Ingo Molnar,
	Andrew Morton, Logan Gunthorpe, Kirill Shutemov


* Dan Williams <dan.j.williams@intel.com> wrote:

> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> removed from the mm fast path if we take a single get_dev_pagemap()
> reference to signify that the page is alive and use the final put of the
> page to drop that reference.
> 
> This does require some care to make sure that any waits for the
> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> since it now maintains its own elevated reference.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

This changelog is lacking an explanation about how this solves the crashes you 
were seeing.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-29 10:17                         ` Kirill A. Shutemov
@ 2017-04-30 23:14                           ` Jerome Glisse
  2017-05-01  1:42                             ` Dan Williams
                                               ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Jerome Glisse @ 2017-04-30 23:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar,
	Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > change ZONE_DEVICE pages to not have an elevated reference count when
> > > they are created so you can keep the HMM references out of the mm hot
> > > path?
> > 
> > 100% sure on that :) I need to callback into driver for 2->1 transition
> > no way around that. If we change ZONE_DEVICE to not have an elevated
> > reference count that you need to make a lot more change to mm so that
> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > to make change to be sure that ZONE_DEVICE page never endup in one of
> > the path that try to put them back on lru. There is a lot of place that
> > would need to be updated and it would be highly intrusive and add a
> > lot of special cases to other hot code path.
> 
> Could you explain more on where the requirement comes from or point me to
> where I can read about this.
> 

HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
in _any_ vma. So i need to know when a page is freed ie either as result of
unmap, exit or migration or anything that would free the memory. For zone
device a page is free once its refcount reach 1 so i need to catch refcount
transition from 2->1

This is the only way i can inform the device that the page is now free. See

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d

There is _no_ way around that.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-30 23:14                           ` Jerome Glisse
@ 2017-05-01  1:42                             ` Dan Williams
  2017-05-01  1:54                               ` Jerome Glisse
  2017-05-01  3:48                             ` Logan Gunthorpe
  2017-05-01 10:23                             ` Kirill A. Shutemov
  2 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-05-01  1:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kirill A. Shutemov, Ingo Molnar, linux-kernel, Linux MM,
	Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
>> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
>> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
>> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
>> > > change ZONE_DEVICE pages to not have an elevated reference count when
>> > > they are created so you can keep the HMM references out of the mm hot
>> > > path?
>> >
>> > 100% sure on that :) I need to callback into driver for 2->1 transition
>> > no way around that. If we change ZONE_DEVICE to not have an elevated
>> > reference count that you need to make a lot more change to mm so that
>> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
>> > to make change to be sure that ZONE_DEVICE page never endup in one of
>> > the path that try to put them back on lru. There is a lot of place that
>> > would need to be updated and it would be highly intrusive and add a
>> > lot of special cases to other hot code path.
>>
>> Could you explain more on where the requirement comes from or point me to
>> where I can read about this.
>>
>
> HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> in _any_ vma. So i need to know when a page is freed ie either as result of
> unmap, exit or migration or anything that would free the memory. For zone
> device a page is free once its refcount reach 1 so i need to catch refcount
> transition from 2->1
>
> This is the only way i can inform the device that the page is now free. See
>
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d
>
> There is _no_ way around that.

Ok, but I need to point out that this not a ZONE_DEVICE requirement.
This is an HMM-specific need. So, this extra reference counting should
be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case.

Can we hide the extra reference counting behind a static branch so
that the common case fast path doesn't get slower until a HMM device
shows up?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-05-01  1:42                             ` Dan Williams
@ 2017-05-01  1:54                               ` Jerome Glisse
  2017-05-01  2:40                                 ` Dan Williams
  0 siblings, 1 reply; 48+ messages in thread
From: Jerome Glisse @ 2017-05-01  1:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kirill A. Shutemov, Ingo Molnar, linux-kernel, Linux MM,
	Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Sun, Apr 30, 2017 at 06:42:02PM -0700, Dan Williams wrote:
> On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> >> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> >> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> >> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> >> > > change ZONE_DEVICE pages to not have an elevated reference count when
> >> > > they are created so you can keep the HMM references out of the mm hot
> >> > > path?
> >> >
> >> > 100% sure on that :) I need to callback into driver for 2->1 transition
> >> > no way around that. If we change ZONE_DEVICE to not have an elevated
> >> > reference count that you need to make a lot more change to mm so that
> >> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> >> > to make change to be sure that ZONE_DEVICE page never endup in one of
> >> > the path that try to put them back on lru. There is a lot of place that
> >> > would need to be updated and it would be highly intrusive and add a
> >> > lot of special cases to other hot code path.
> >>
> >> Could you explain more on where the requirement comes from or point me to
> >> where I can read about this.
> >>
> >
> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> > in _any_ vma. So i need to know when a page is freed ie either as result of
> > unmap, exit or migration or anything that would free the memory. For zone
> > device a page is free once its refcount reach 1 so i need to catch refcount
> > transition from 2->1
> >
> > This is the only way i can inform the device that the page is now free. See
> >
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d
> >
> > There is _no_ way around that.
> 
> Ok, but I need to point out that this not a ZONE_DEVICE requirement.
> This is an HMM-specific need. So, this extra reference counting should
> be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case.

And it already is delimited, i think you even gave your review by on
the patch.


> Can we hide the extra reference counting behind a static branch so
> that the common case fast path doesn't get slower until a HMM device
> shows up?

Like i already did With likely()/unlikely() ? Or something else ?

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=e84778e9db0672e371eb6599dfcb812512118842

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-05-01  1:54                               ` Jerome Glisse
@ 2017-05-01  2:40                                 ` Dan Williams
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Williams @ 2017-05-01  2:40 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kirill A. Shutemov, Ingo Molnar, linux-kernel, Linux MM,
	Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Sun, Apr 30, 2017 at 6:54 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Sun, Apr 30, 2017 at 06:42:02PM -0700, Dan Williams wrote:
>> On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse <jglisse@redhat.com> wrote:
>> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
>> >> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
>> >> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
>> >> > > Are you sure about needing to hook the 2 -> 1 transition? Could we
>> >> > > change ZONE_DEVICE pages to not have an elevated reference count when
>> >> > > they are created so you can keep the HMM references out of the mm hot
>> >> > > path?
>> >> >
>> >> > 100% sure on that :) I need to callback into driver for 2->1 transition
>> >> > no way around that. If we change ZONE_DEVICE to not have an elevated
>> >> > reference count that you need to make a lot more change to mm so that
>> >> > ZONE_DEVICE is never use as fallback for memory allocation. Also need
>> >> > to make change to be sure that ZONE_DEVICE page never endup in one of
>> >> > the path that try to put them back on lru. There is a lot of place that
>> >> > would need to be updated and it would be highly intrusive and add a
>> >> > lot of special cases to other hot code path.
>> >>
>> >> Could you explain more on where the requirement comes from or point me to
>> >> where I can read about this.
>> >>
>> >
>> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
>> > in _any_ vma. So i need to know when a page is freed ie either as result of
>> > unmap, exit or migration or anything that would free the memory. For zone
>> > device a page is free once its refcount reach 1 so i need to catch refcount
>> > transition from 2->1
>> >
>> > This is the only way i can inform the device that the page is now free. See
>> >
>> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d
>> >
>> > There is _no_ way around that.
>>
>> Ok, but I need to point out that this not a ZONE_DEVICE requirement.
>> This is an HMM-specific need. So, this extra reference counting should
>> be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case.
>
> And it already is delimited, i think you even gave your review by on
> the patch.
>

I gave my reviewed-by to the patch that leveraged
{get,put}_zone_device_page(), but now those are gone and HMM needs a
new patch. I'm just saying these reference counts should not come back
as {get,put}_zone_device_page() because now they're HMM specific.

>> Can we hide the extra reference counting behind a static branch so
>> that the common case fast path doesn't get slower until a HMM device
>> shows up?
>
> Like i already did With likely()/unlikely() ? Or something else ?
>
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=e84778e9db0672e371eb6599dfcb812512118842

Something different:
http://lxr.free-electrons.com/source/Documentation/static-keys.txt

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-29 14:18           ` Ingo Molnar
@ 2017-05-01  2:45             ` Dan Williams
  2017-05-01  7:12               ` Ingo Molnar
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-05-01  2:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linux MM, Jérôme Glisse, Ingo Molnar,
	Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Sat, Apr 29, 2017 at 7:18 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
>> Kirill points out that the calls to {get,put}_dev_pagemap() can be
>> removed from the mm fast path if we take a single get_dev_pagemap()
>> reference to signify that the page is alive and use the final put of the
>> page to drop that reference.
>>
>> This does require some care to make sure that any waits for the
>> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
>> since it now maintains its own elevated reference.
>>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Jérôme Glisse <jglisse@redhat.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>> Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
>> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> This changelog is lacking an explanation about how this solves the crashes you
> were seeing.
>

Kirill? It wasn't clear to me why the conversion to generic
get_user_pages_fast() caused the reference counts to be off.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-30 23:14                           ` Jerome Glisse
  2017-05-01  1:42                             ` Dan Williams
@ 2017-05-01  3:48                             ` Logan Gunthorpe
  2017-05-01 10:23                             ` Kirill A. Shutemov
  2 siblings, 0 replies; 48+ messages in thread
From: Logan Gunthorpe @ 2017-05-01  3:48 UTC (permalink / raw)
  To: Jerome Glisse, Kirill A. Shutemov
  Cc: Dan Williams, Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar,
	Andrew Morton, Kirill Shutemov



On 30/04/17 05:14 PM, Jerome Glisse wrote:
> HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> in _any_ vma. So i need to know when a page is freed ie either as result of
> unmap, exit or migration or anything that would free the memory. For zone
> device a page is free once its refcount reach 1 so i need to catch refcount
> transition from 2->1
> 
> This is the only way i can inform the device that the page is now free. See
> 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d
> 
> There is _no_ way around that.

I had a similar issue in a piece of my p2pmem RFC [1]. I hacked around
it by tracking the pages separately and freeing them when the vma is
closed. This is by no means a great solution, it certainly has it's own
warts. However, maybe it will spark some ideas for some alternate
choices which avoid the hot path.

Another thing I briefly looked at was hooking the vma close process
earlier so that it would callback in time that you can loop through the
pages and do your free process. Of course this all depends on the vma
not getting closed while the pages have other references. So it may not
work at all. Again, just ideas.

Logan


[1]
https://github.com/sbates130272/linux-p2pmem/commit/77c631d92cb5c451c9824b3a4cf9b6cddfde6bb7

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-05-01  2:45             ` Dan Williams
@ 2017-05-01  7:12               ` Ingo Molnar
  2017-05-01  9:33                 ` Kirill A. Shutemov
  0 siblings, 1 reply; 48+ messages in thread
From: Ingo Molnar @ 2017-05-01  7:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, Linux MM, Jérôme Glisse, Ingo Molnar,
	Andrew Morton, Logan Gunthorpe, Kirill Shutemov


* Dan Williams <dan.j.williams@intel.com> wrote:

> On Sat, Apr 29, 2017 at 7:18 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be
> >> removed from the mm fast path if we take a single get_dev_pagemap()
> >> reference to signify that the page is alive and use the final put of the
> >> page to drop that reference.
> >>
> >> This does require some care to make sure that any waits for the
> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
> >> since it now maintains its own elevated reference.
> >>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Cc: Jérôme Glisse <jglisse@redhat.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> >> Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
> >> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >
> > This changelog is lacking an explanation about how this solves the crashes you
> > were seeing.
> 
> Kirill? It wasn't clear to me why the conversion to generic 
> get_user_pages_fast() caused the reference counts to be off.

Ok, the merge window is open and we really need this fix for x86/mm, so this is 
what I've decoded:

 The x86 conversion to the generic GUP code included a small change which causes
 crashes and data corruption in the pmem code - not good.

 The root cause is that the /dev/pmem driver code implicitly relies on the x86
 get_user_pages() implementation doing a get_page() on the page refcount, because
 get_page() does a get_zone_device_page() which properly refcounts pmem's separate
 page struct arrays that are not present in the regular page struct structures.
 (The pmem driver does this because it can cover huge memory areas.)

 But the x86 conversion to the generic GUP code changed the get_page() to
 page_cache_get_speculative() which is faster but doesn't do the
 get_zone_device_page() call the pmem code relies on.

 One way to solve the regression would be to change the generic GUP code to use 
 get_page(), but that would slow things down a bit and punish other generic-GUP 
 using architectures for an x86-ism they did not care about. (Arguably the pmem 
 driver was probably not working reliably for them: but nvdimm is an Intel
 feature, so non-x86 exposure is probably still limited.)

 So restructure the pmem code's interface with the MM instead: get rid of the 
 get/put_zone_device_page() distinction, integrate put_zone_device_page() into 
 __put_page() and and restructure the pmem completion-wait and teardown machinery.

 This speeds up things while also making the pmem refcounting more robust going 
 forward.

... is this extension to the changelog correct?

I'll apply this for the time being - but can still amend the text before sending 
it to Linus later today.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [tip:x86/mm] mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
  2017-04-28 17:23         ` [PATCH v2] mm, zone_device: replace " Dan Williams
  2017-04-28 17:34           ` Jerome Glisse
  2017-04-29 14:18           ` Ingo Molnar
@ 2017-05-01  8:28           ` tip-bot for Dan Williams
  2 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Dan Williams @ 2017-05-01  8:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: kirill.shutemov, jpoimboe, hpa, peterz, tglx, torvalds, logang,
	bp, dvlasenk, linux-kernel, akpm, luto, jglisse, brgerst, mingo,
	dan.j.williams

Commit-ID:  71389703839ebe9cb426c72d5f0bd549592e583c
Gitweb:     http://git.kernel.org/tip/71389703839ebe9cb426c72d5f0bd549592e583c
Author:     Dan Williams <dan.j.williams@intel.com>
AuthorDate: Fri, 28 Apr 2017 10:23:37 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 1 May 2017 09:15:53 +0200

mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash

The x86 conversion to the generic GUP code included a small change which causes
crashes and data corruption in the pmem code - not good.

The root cause is that the /dev/pmem driver code implicitly relies on the x86
get_user_pages() implementation doing a get_page() on the page refcount, because
get_page() does a get_zone_device_page() which properly refcounts pmem's separate
page struct arrays that are not present in the regular page struct structures.
(The pmem driver does this because it can cover huge memory areas.)

But the x86 conversion to the generic GUP code changed the get_page() to
page_cache_get_speculative() which is faster but doesn't do the
get_zone_device_page() call the pmem code relies on.

One way to solve the regression would be to change the generic GUP code to use
get_page(), but that would slow things down a bit and punish other generic-GUP
using architectures for an x86-ism they did not care about. (Arguably the pmem
driver was probably not working reliably for them: but nvdimm is an Intel
feature, so non-x86 exposure is probably still limited.)

So restructure the pmem code's interface with the MM instead: get rid of the
get/put_zone_device_page() distinction, integrate put_zone_device_page() into
__put_page() and and restructure the pmem completion-wait and teardown machinery:

Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.

This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.

This speeds up things while also making the pmem refcounting more robust going
forward.

Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 drivers/dax/pmem.c    |  2 +-
 drivers/nvdimm/pmem.c | 13 +++++++++++--
 include/linux/mm.h    | 14 --------------
 kernel/memremap.c     | 22 +++++++++-------------
 mm/swap.c             | 10 ++++++++++
 5 files changed, 31 insertions(+), 30 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 033f49b3..cb0d742 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data)
 	struct dax_pmem *dax_pmem = to_dax_pmem(ref);
 
 	dev_dbg(dax_pmem->dev, "%s\n", __func__);
+	wait_for_completion(&dax_pmem->cmp);
 	percpu_ref_exit(ref);
 }
 
@@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data)
 
 	dev_dbg(dax_pmem->dev, "%s\n", __func__);
 	percpu_ref_kill(ref);
-	wait_for_completion(&dax_pmem->cmp);
 }
 
 static int dax_pmem_probe(struct device *dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5b536be..fb7bbc7 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -25,6 +25,7 @@
 #include <linux/badblocks.h>
 #include <linux/memremap.h>
 #include <linux/vmalloc.h>
+#include <linux/blk-mq.h>
 #include <linux/pfn_t.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
@@ -231,6 +232,11 @@ static void pmem_release_queue(void *q)
 	blk_cleanup_queue(q);
 }
 
+static void pmem_freeze_queue(void *q)
+{
+	blk_mq_freeze_queue_start(q);
+}
+
 static void pmem_release_disk(void *disk)
 {
 	del_gendisk(disk);
@@ -284,6 +290,9 @@ static int pmem_attach_disk(struct device *dev,
 	if (!q)
 		return -ENOMEM;
 
+	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
+		return -ENOMEM;
+
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
@@ -303,10 +312,10 @@ static int pmem_attach_disk(struct device *dev,
 				pmem->size, ARCH_MEMREMAP_PMEM);
 
 	/*
-	 * At release time the queue must be dead before
+	 * At release time the queue must be frozen before
 	 * devm_memremap_pages is unwound
 	 */
-	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
+	if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
 		return -ENOMEM;
 
 	if (IS_ERR(addr))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a835edd..695da2a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct page *page)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void get_zone_device_page(struct page *page);
-void put_zone_device_page(struct page *page);
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
 #else
-static inline void get_zone_device_page(struct page *page)
-{
-}
-static inline void put_zone_device_page(struct page *page)
-{
-}
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return false;
@@ -790,9 +782,6 @@ static inline void get_page(struct page *page)
 	 */
 	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
 	page_ref_inc(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		get_zone_device_page(page);
 }
 
 static inline void put_page(struct page *page)
@@ -801,9 +790,6 @@ static inline void put_page(struct page *page)
 
 	if (put_page_testzero(page))
 		__put_page(page);
-
-	if (unlikely(is_zone_device_page(page)))
-		put_zone_device_page(page);
 }
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07e85e5..23a6483 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -182,18 +182,6 @@ struct page_map {
 	struct vmem_altmap altmap;
 };
 
-void get_zone_device_page(struct page *page)
-{
-	percpu_ref_get(page->pgmap->ref);
-}
-EXPORT_SYMBOL(get_zone_device_page);
-
-void put_zone_device_page(struct page *page)
-{
-	put_dev_pagemap(page->pgmap);
-}
-EXPORT_SYMBOL(put_zone_device_page);
-
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -237,6 +225,10 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	struct resource *res = &page_map->res;
 	resource_size_t align_start, align_size;
 	struct dev_pagemap *pgmap = &page_map->pgmap;
+	unsigned long pfn;
+
+	for_each_device_pfn(pfn, page_map)
+		put_page(pfn_to_page(pfn));
 
 	if (percpu_ref_tryget_live(pgmap->ref)) {
 		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
@@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  *
  * Notes:
  * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
- *    (or devm release event).
+ *    (or devm release event). The expected order of events is that @ref has
+ *    been through percpu_ref_kill() before devm_memremap_pages_release(). The
+ *    wait for the completion of all references being dropped and
+ *    percpu_ref_exit() must occur after devm_memremap_pages_release().
  *
  * 2/ @res is expected to be a host memory range that could feasibly be
  *    treated as a "System RAM" range, i.e. not a device mmio range, but
@@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		 */
 		list_del(&page->lru);
 		page->pgmap = pgmap;
+		percpu_ref_get(ref);
 	}
 	devres_add(dev, page_map);
 	return __va(res->start);
diff --git a/mm/swap.c b/mm/swap.c
index c4910f1..a4e6113 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page)
 
 void __put_page(struct page *page)
 {
+	if (is_zone_device_page(page)) {
+		put_dev_pagemap(page->pgmap);
+
+		/*
+		 * The page belongs to the device that created pgmap. Do
+		 * not return it to page allocator.
+		 */
+		return;
+	}
+
 	if (unlikely(PageCompound(page)))
 		__put_compound_page(page);
 	else

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-05-01  7:12               ` Ingo Molnar
@ 2017-05-01  9:33                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-05-01  9:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dan Williams, linux-kernel, Linux MM, Jérôme Glisse,
	Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Mon, May 01, 2017 at 09:12:59AM +0200, Ingo Molnar wrote:
> ... is this extension to the changelog correct?

Looks good to me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-04-30 23:14                           ` Jerome Glisse
  2017-05-01  1:42                             ` Dan Williams
  2017-05-01  3:48                             ` Logan Gunthorpe
@ 2017-05-01 10:23                             ` Kirill A. Shutemov
  2017-05-01 13:55                               ` Jerome Glisse
  2 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-05-01 10:23 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar,
	Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > change ZONE_DEVICE pages to not have an elevated reference count when
> > > > they are created so you can keep the HMM references out of the mm hot
> > > > path?
> > > 
> > > 100% sure on that :) I need to callback into driver for 2->1 transition
> > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > reference count that you need to make a lot more change to mm so that
> > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > > to make change to be sure that ZONE_DEVICE page never endup in one of
> > > the path that try to put them back on lru. There is a lot of place that
> > > would need to be updated and it would be highly intrusive and add a
> > > lot of special cases to other hot code path.
> > 
> > Could you explain more on where the requirement comes from or point me to
> > where I can read about this.
> > 
> 
> HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> in _any_ vma. So i need to know when a page is freed ie either as result of
> unmap, exit or migration or anything that would free the memory. For zone
> device a page is free once its refcount reach 1 so i need to catch refcount
> transition from 2->1

What if we would rework zone device to have pages with refcount 0 at
start?

> This is the only way i can inform the device that the page is now free. See
> 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d
> 
> There is _no_ way around that.

I'm still not convinced that it's impossible.

Could you describe lifecycle for pages in case of HMM?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-05-01 10:23                             ` Kirill A. Shutemov
@ 2017-05-01 13:55                               ` Jerome Glisse
  2017-05-01 20:19                                 ` Dan Williams
  2017-05-02 11:37                                 ` Kirill A. Shutemov
  0 siblings, 2 replies; 48+ messages in thread
From: Jerome Glisse @ 2017-05-01 13:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar,
	Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > > change ZONE_DEVICE pages to not have an elevated reference count when
> > > > > they are created so you can keep the HMM references out of the mm hot
> > > > > path?
> > > > 
> > > > 100% sure on that :) I need to callback into driver for 2->1 transition
> > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > > reference count that you need to make a lot more change to mm so that
> > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > > > to make change to be sure that ZONE_DEVICE page never endup in one of
> > > > the path that try to put them back on lru. There is a lot of place that
> > > > would need to be updated and it would be highly intrusive and add a
> > > > lot of special cases to other hot code path.
> > > 
> > > Could you explain more on where the requirement comes from or point me to
> > > where I can read about this.
> > > 
> > 
> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> > in _any_ vma. So i need to know when a page is freed ie either as result of
> > unmap, exit or migration or anything that would free the memory. For zone
> > device a page is free once its refcount reach 1 so i need to catch refcount
> > transition from 2->1
> 
> What if we would rework zone device to have pages with refcount 0 at
> start?

That is a _lot_ of work from top of my head because it would need changes
to a lot of places and likely more hot code path that simply adding some-
thing to put_page() note that i only need something in put_page() i do not
need anything in the get page path. Is adding a conditional branch for
HMM pages in put_page() that much of a problem ?


> > This is the only way i can inform the device that the page is now free. See
> > 
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d
> > 
> > There is _no_ way around that.
> 
> I'm still not convinced that it's impossible.
> 
> Could you describe lifecycle for pages in case of HMM?

Process malloc something, end it over to some function in the program
that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
trigger a migration to device memory.

So in the kernel you get a migration like any existing migration,
original page is unmap, if refcount is all ok (no pin) then a device
page is allocated and thing are migrated to device memory.

What happen after is unknown. Either userspace/kernel driver decide
to migrate back to system memory, either there is an munmap, either
there is a CPU page fault, ... So from that point on the device page
as the exact same life as a regular page.

Above i describe the migrate case, but you can also have new memory
allocation that directly allocate device memory. For instance if the
GPU do a page fault on an address that isn't back by anything then
we can directly allocate a device page. No migration involve in that
case.

HMM pages are like any other pages in most respect. Exception are:
  - no GUP
  - no KSM
  - no lru reclaim
  - no NUMA balancing
  - no regular migration (existing migrate_page)

The fact that minimum refcount for ZONE_DEVICE is 1 already gives
us for free most of the above exception. To convert the refcount to
be like other pages would mean that all of the above would need to
be audited and probably modify to ignore ZONE_DEVICE pages (i am
pretty sure Dan do not want any of the above either).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-05-01 13:55                               ` Jerome Glisse
@ 2017-05-01 20:19                                 ` Dan Williams
  2017-05-01 20:32                                   ` Jerome Glisse
  2017-05-02 11:37                                 ` Kirill A. Shutemov
  1 sibling, 1 reply; 48+ messages in thread
From: Dan Williams @ 2017-05-01 20:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kirill A. Shutemov, Ingo Molnar, linux-kernel, Linux MM,
	Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Mon, May 1, 2017 at 6:55 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
>> On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
>> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
>> > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
>> > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
>> > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
>> > > > > change ZONE_DEVICE pages to not have an elevated reference count when
>> > > > > they are created so you can keep the HMM references out of the mm hot
>> > > > > path?
>> > > >
>> > > > 100% sure on that :) I need to callback into driver for 2->1 transition
>> > > > no way around that. If we change ZONE_DEVICE to not have an elevated
>> > > > reference count that you need to make a lot more change to mm so that
>> > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
>> > > > to make change to be sure that ZONE_DEVICE page never endup in one of
>> > > > the path that try to put them back on lru. There is a lot of place that
>> > > > would need to be updated and it would be highly intrusive and add a
>> > > > lot of special cases to other hot code path.
>> > >
>> > > Could you explain more on where the requirement comes from or point me to
>> > > where I can read about this.
>> > >
>> >
>> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
>> > in _any_ vma. So i need to know when a page is freed ie either as result of
>> > unmap, exit or migration or anything that would free the memory. For zone
>> > device a page is free once its refcount reach 1 so i need to catch refcount
>> > transition from 2->1
>>
>> What if we would rework zone device to have pages with refcount 0 at
>> start?
>
> That is a _lot_ of work from top of my head because it would need changes
> to a lot of places and likely more hot code path that simply adding some-
> thing to put_page() note that i only need something in put_page() i do not
> need anything in the get page path. Is adding a conditional branch for
> HMM pages in put_page() that much of a problem ?
>
>
>> > This is the only way i can inform the device that the page is now free. See
>> >
>> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d
>> >
>> > There is _no_ way around that.
>>
>> I'm still not convinced that it's impossible.
>>
>> Could you describe lifecycle for pages in case of HMM?
>
> Process malloc something, end it over to some function in the program
> that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> trigger a migration to device memory.
>
> So in the kernel you get a migration like any existing migration,
> original page is unmap, if refcount is all ok (no pin) then a device
> page is allocated and thing are migrated to device memory.
>
> What happen after is unknown. Either userspace/kernel driver decide
> to migrate back to system memory, either there is an munmap, either
> there is a CPU page fault, ... So from that point on the device page
> as the exact same life as a regular page.
>
> Above i describe the migrate case, but you can also have new memory
> allocation that directly allocate device memory. For instance if the
> GPU do a page fault on an address that isn't back by anything then
> we can directly allocate a device page. No migration involve in that
> case.
>
> HMM pages are like any other pages in most respect. Exception are:
>   - no GUP
>   - no KSM
>   - no lru reclaim
>   - no NUMA balancing
>   - no regular migration (existing migrate_page)
>
> The fact that minimum refcount for ZONE_DEVICE is 1 already gives
> us for free most of the above exception. To convert the refcount to
> be like other pages would mean that all of the above would need to
> be audited and probably modify to ignore ZONE_DEVICE pages (i am
> pretty sure Dan do not want any of the above either).

Right, adding HMM references to get_page() and put_page() seems less
intrusive. Given how uncommon HMM hardware is (insert grumble about no
visible upstream user of this functionality) I think the 'static
branch' approach helps mitigate the impact for everything else.
Looking back, I should have used that mechanism for the pmem use case,
but it's moot now.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-05-01 20:19                                 ` Dan Williams
@ 2017-05-01 20:32                                   ` Jerome Glisse
  0 siblings, 0 replies; 48+ messages in thread
From: Jerome Glisse @ 2017-05-01 20:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kirill A. Shutemov, Ingo Molnar, linux-kernel, Linux MM,
	Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Mon, May 01, 2017 at 01:19:24PM -0700, Dan Williams wrote:
> On Mon, May 1, 2017 at 6:55 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> >> On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> >> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> >> > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> >> > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> >> > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> >> > > > > change ZONE_DEVICE pages to not have an elevated reference count when
> >> > > > > they are created so you can keep the HMM references out of the mm hot
> >> > > > > path?
> >> > > >
> >> > > > 100% sure on that :) I need to callback into driver for 2->1 transition
> >> > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> >> > > > reference count that you need to make a lot more change to mm so that
> >> > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> >> > > > to make change to be sure that ZONE_DEVICE page never endup in one of
> >> > > > the path that try to put them back on lru. There is a lot of place that
> >> > > > would need to be updated and it would be highly intrusive and add a
> >> > > > lot of special cases to other hot code path.
> >> > >
> >> > > Could you explain more on where the requirement comes from or point me to
> >> > > where I can read about this.
> >> > >
> >> >
> >> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> >> > in _any_ vma. So i need to know when a page is freed ie either as result of
> >> > unmap, exit or migration or anything that would free the memory. For zone
> >> > device a page is free once its refcount reach 1 so i need to catch refcount
> >> > transition from 2->1
> >>
> >> What if we would rework zone device to have pages with refcount 0 at
> >> start?
> >
> > That is a _lot_ of work from top of my head because it would need changes
> > to a lot of places and likely more hot code path that simply adding some-
> > thing to put_page() note that i only need something in put_page() i do not
> > need anything in the get page path. Is adding a conditional branch for
> > HMM pages in put_page() that much of a problem ?
> >
> >
> >> > This is the only way i can inform the device that the page is now free. See
> >> >
> >> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d
> >> >
> >> > There is _no_ way around that.
> >>
> >> I'm still not convinced that it's impossible.
> >>
> >> Could you describe lifecycle for pages in case of HMM?
> >
> > Process malloc something, end it over to some function in the program
> > that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> > trigger a migration to device memory.
> >
> > So in the kernel you get a migration like any existing migration,
> > original page is unmap, if refcount is all ok (no pin) then a device
> > page is allocated and thing are migrated to device memory.
> >
> > What happen after is unknown. Either userspace/kernel driver decide
> > to migrate back to system memory, either there is an munmap, either
> > there is a CPU page fault, ... So from that point on the device page
> > as the exact same life as a regular page.
> >
> > Above i describe the migrate case, but you can also have new memory
> > allocation that directly allocate device memory. For instance if the
> > GPU do a page fault on an address that isn't back by anything then
> > we can directly allocate a device page. No migration involve in that
> > case.
> >
> > HMM pages are like any other pages in most respect. Exception are:
> >   - no GUP
> >   - no KSM
> >   - no lru reclaim
> >   - no NUMA balancing
> >   - no regular migration (existing migrate_page)
> >
> > The fact that minimum refcount for ZONE_DEVICE is 1 already gives
> > us for free most of the above exception. To convert the refcount to
> > be like other pages would mean that all of the above would need to
> > be audited and probably modify to ignore ZONE_DEVICE pages (i am
> > pretty sure Dan do not want any of the above either).
> 
> Right, adding HMM references to get_page() and put_page() seems less
> intrusive. Given how uncommon HMM hardware is (insert grumble about no
> visible upstream user of this functionality) I think the 'static
> branch' approach helps mitigate the impact for everything else.
> Looking back, I should have used that mechanism for the pmem use case,
> but it's moot now.

I do not need anything in get_page() all i need is something in put_page()
to catch the 2 -> 1 refcount transition to know when a page is freed.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-05-01 13:55                               ` Jerome Glisse
  2017-05-01 20:19                                 ` Dan Williams
@ 2017-05-02 11:37                                 ` Kirill A. Shutemov
  2017-05-02 13:22                                   ` Jerome Glisse
  1 sibling, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2017-05-02 11:37 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar,
	Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Mon, May 01, 2017 at 09:55:48AM -0400, Jerome Glisse wrote:
> On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> > On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> > > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > > > change ZONE_DEVICE pages to not have an elevated reference count when
> > > > > > they are created so you can keep the HMM references out of the mm hot
> > > > > > path?
> > > > > 
> > > > > 100% sure on that :) I need to callback into driver for 2->1 transition
> > > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > > > reference count that you need to make a lot more change to mm so that
> > > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > > > > to make change to be sure that ZONE_DEVICE page never endup in one of
> > > > > the path that try to put them back on lru. There is a lot of place that
> > > > > would need to be updated and it would be highly intrusive and add a
> > > > > lot of special cases to other hot code path.
> > > > 
> > > > Could you explain more on where the requirement comes from or point me to
> > > > where I can read about this.
> > > > 
> > > 
> > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> > > in _any_ vma. So i need to know when a page is freed ie either as result of
> > > unmap, exit or migration or anything that would free the memory. For zone
> > > device a page is free once its refcount reach 1 so i need to catch refcount
> > > transition from 2->1
> > 
> > What if we would rework zone device to have pages with refcount 0 at
> > start?
> 
> That is a _lot_ of work from top of my head because it would need changes
> to a lot of places and likely more hot code path that simply adding some-
> thing to put_page() note that i only need something in put_page() i do not
> need anything in the get page path. Is adding a conditional branch for
> HMM pages in put_page() that much of a problem ?

Well, it gets inlined everywhere. Removing zone_device code from
get_page() and put_page() saved non-trivial ~140k in vmlinux for
allyesconfig.

Re-introducing part this bloat would be unfortunate.

> > > This is the only way i can inform the device that the page is now free. See
> > > 
> > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d
> > > 
> > > There is _no_ way around that.
> > 
> > I'm still not convinced that it's impossible.
> > 
> > Could you describe lifecycle for pages in case of HMM?
> 
> Process malloc something, end it over to some function in the program
> that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> trigger a migration to device memory.
> 
> So in the kernel you get a migration like any existing migration,
> original page is unmap, if refcount is all ok (no pin) then a device
> page is allocated and thing are migrated to device memory.
> 
> What happen after is unknown. Either userspace/kernel driver decide
> to migrate back to system memory, either there is an munmap, either
> there is a CPU page fault, ... So from that point on the device page
> as the exact same life as a regular page.
> 
> Above i describe the migrate case, but you can also have new memory
> allocation that directly allocate device memory. For instance if the
> GPU do a page fault on an address that isn't back by anything then
> we can directly allocate a device page. No migration involve in that
> case.
> 
> HMM pages are like any other pages in most respect. Exception are:
>   - no GUP

Hm. How do you exclude GUP? And why is it required?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference
  2017-05-02 11:37                                 ` Kirill A. Shutemov
@ 2017-05-02 13:22                                   ` Jerome Glisse
  0 siblings, 0 replies; 48+ messages in thread
From: Jerome Glisse @ 2017-05-02 13:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dan Williams, Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar,
	Andrew Morton, Logan Gunthorpe, Kirill Shutemov

On Tue, May 02, 2017 at 02:37:46PM +0300, Kirill A. Shutemov wrote:
> On Mon, May 01, 2017 at 09:55:48AM -0400, Jerome Glisse wrote:
> > On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote:
> > > On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote:
> > > > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote:
> > > > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote:
> > > > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we
> > > > > > > change ZONE_DEVICE pages to not have an elevated reference count when
> > > > > > > they are created so you can keep the HMM references out of the mm hot
> > > > > > > path?
> > > > > > 
> > > > > > 100% sure on that :) I need to callback into driver for 2->1 transition
> > > > > > no way around that. If we change ZONE_DEVICE to not have an elevated
> > > > > > reference count that you need to make a lot more change to mm so that
> > > > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need
> > > > > > to make change to be sure that ZONE_DEVICE page never endup in one of
> > > > > > the path that try to put them back on lru. There is a lot of place that
> > > > > > would need to be updated and it would be highly intrusive and add a
> > > > > > lot of special cases to other hot code path.
> > > > > 
> > > > > Could you explain more on where the requirement comes from or point me to
> > > > > where I can read about this.
> > > > > 
> > > > 
> > > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page)
> > > > in _any_ vma. So i need to know when a page is freed ie either as result of
> > > > unmap, exit or migration or anything that would free the memory. For zone
> > > > device a page is free once its refcount reach 1 so i need to catch refcount
> > > > transition from 2->1
> > > 
> > > What if we would rework zone device to have pages with refcount 0 at
> > > start?
> > 
> > That is a _lot_ of work from top of my head because it would need changes
> > to a lot of places and likely more hot code path that simply adding some-
> > thing to put_page() note that i only need something in put_page() i do not
> > need anything in the get page path. Is adding a conditional branch for
> > HMM pages in put_page() that much of a problem ?
> 
> Well, it gets inlined everywhere. Removing zone_device code from
> get_page() and put_page() saved non-trivial ~140k in vmlinux for
> allyesconfig.
> 
> Re-introducing part this bloat would be unfortunate.
> 
> > > > This is the only way i can inform the device that the page is now free. See
> > > > 
> > > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d
> > > > 
> > > > There is _no_ way around that.
> > > 
> > > I'm still not convinced that it's impossible.
> > > 
> > > Could you describe lifecycle for pages in case of HMM?
> > 
> > Process malloc something, end it over to some function in the program
> > that use the GPU that function call GPU API (OpenCL, CUDA, ...) that
> > trigger a migration to device memory.
> > 
> > So in the kernel you get a migration like any existing migration,
> > original page is unmap, if refcount is all ok (no pin) then a device
> > page is allocated and thing are migrated to device memory.
> > 
> > What happen after is unknown. Either userspace/kernel driver decide
> > to migrate back to system memory, either there is an munmap, either
> > there is a CPU page fault, ... So from that point on the device page
> > as the exact same life as a regular page.
> > 
> > Above i describe the migrate case, but you can also have new memory
> > allocation that directly allocate device memory. For instance if the
> > GPU do a page fault on an address that isn't back by anything then
> > we can directly allocate a device page. No migration involve in that
> > case.
> > 
> > HMM pages are like any other pages in most respect. Exception are:
> >   - no GUP
> 
> Hm. How do you exclude GUP? And why is it required?

Well it is not forbiden it just can not happen simply because as device
memory is not accessible by CPU then the corresponding CPU page table
entry is a special entry and thus GUP trigger a page fault that migrate
thing back to regular memory.

The why is simply because we need to always be able to migrate back to
regular memory as device memory is not accessible by the CPU. So we can
not allow anyone to pin it.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2017-05-02 13:23 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-20 21:46 [tip:x86/mm] x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation Dan Williams
2017-04-21 14:16 ` Kirill A. Shutemov
2017-04-21 19:30   ` Dan Williams
2017-04-23  9:52     ` [PATCH] Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" Ingo Molnar
2017-04-23 23:31 ` get_zone_device_page() in get_page() and page_cache_get_speculative() Kirill A. Shutemov
2017-04-24 17:23   ` Dan Williams
2017-04-24 17:30     ` Kirill A. Shutemov
2017-04-24 17:47       ` Dan Williams
2017-04-24 18:01         ` Kirill A. Shutemov
2017-04-24 18:25           ` Kirill A. Shutemov
2017-04-24 18:41             ` Dan Williams
2017-04-25 13:19               ` Kirill A. Shutemov
2017-04-25 16:44                 ` Dan Williams
2017-04-27  0:55   ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Dan Williams
2017-04-27  8:33     ` Kirill A. Shutemov
2017-04-28  6:39       ` Ingo Molnar
2017-04-28  8:14         ` [PATCH] mm, zone_device: Replace " Kirill A. Shutemov
2017-04-28 17:23         ` [PATCH v2] mm, zone_device: replace " Dan Williams
2017-04-28 17:34           ` Jerome Glisse
2017-04-28 17:41             ` Dan Williams
2017-04-28 18:00               ` Jerome Glisse
2017-04-28 19:02                 ` Dan Williams
2017-04-28 19:16                   ` Jerome Glisse
2017-04-28 19:22                     ` Dan Williams
2017-04-28 19:33                       ` Jerome Glisse
2017-04-29 10:17                         ` Kirill A. Shutemov
2017-04-30 23:14                           ` Jerome Glisse
2017-05-01  1:42                             ` Dan Williams
2017-05-01  1:54                               ` Jerome Glisse
2017-05-01  2:40                                 ` Dan Williams
2017-05-01  3:48                             ` Logan Gunthorpe
2017-05-01 10:23                             ` Kirill A. Shutemov
2017-05-01 13:55                               ` Jerome Glisse
2017-05-01 20:19                                 ` Dan Williams
2017-05-01 20:32                                   ` Jerome Glisse
2017-05-02 11:37                                 ` Kirill A. Shutemov
2017-05-02 13:22                                   ` Jerome Glisse
2017-04-29 14:18           ` Ingo Molnar
2017-05-01  2:45             ` Dan Williams
2017-05-01  7:12               ` Ingo Molnar
2017-05-01  9:33                 ` Kirill A. Shutemov
2017-05-01  8:28           ` [tip:x86/mm] mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash tip-bot for Dan Williams
2017-04-27 16:11     ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Logan Gunthorpe
2017-04-27 16:14       ` Dan Williams
2017-04-27 16:33         ` Logan Gunthorpe
2017-04-27 16:38           ` Dan Williams
2017-04-27 16:45             ` Logan Gunthorpe
2017-04-27 16:46               ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).