[v10,3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
diff mbox series

Message ID 20190930015740.84362-4-justin.he@arm.com
State New
Headers show
Series
  • fix double page fault on arm64
Related show

Commit Message

Jia He Sept. 30, 2019, 1:57 a.m. UTC
When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003, pmd=000000023d4b3003, pte=360000298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case.(by Kirill)

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Signed-off-by: Jia He <justin.he@arm.com>
Reported-by: Yibo Cai <Yibo.Cai@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 99 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 84 insertions(+), 15 deletions(-)

Comments

Will Deacon Oct. 1, 2019, 12:54 p.m. UTC | #1
On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
> When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
> will be a double page fault in __copy_from_user_inatomic of cow_user_page.
> 
> Below call trace is from arm64 do_page_fault for debugging purpose
> [  110.016195] Call trace:
> [  110.016826]  do_page_fault+0x5a4/0x690
> [  110.017812]  do_mem_abort+0x50/0xb0
> [  110.018726]  el1_da+0x20/0xc4
> [  110.019492]  __arch_copy_from_user+0x180/0x280
> [  110.020646]  do_wp_page+0xb0/0x860
> [  110.021517]  __handle_mm_fault+0x994/0x1338
> [  110.022606]  handle_mm_fault+0xe8/0x180
> [  110.023584]  do_page_fault+0x240/0x690
> [  110.024535]  do_mem_abort+0x50/0xb0
> [  110.025423]  el0_da+0x20/0x24
> 
> The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> [ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003, pmd=000000023d4b3003, pte=360000298607bd3
> 
> As told by Catalin: "On arm64 without hardware Access Flag, copying from
> user will fail because the pte is old and cannot be marked young. So we
> always end up with zeroed page after fork() + CoW for pfn mappings. we
> don't always have a hardware-managed access flag on arm64."
> 
> This patch fix it by calling pte_mkyoung. Also, the parameter is
> changed because vmf should be passed to cow_user_page()
> 
> Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
> in case there can be some obscure use-case.(by Kirill)
> 
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> 
> Signed-off-by: Jia He <justin.he@arm.com>
> Reported-by: Yibo Cai <Yibo.Cai@arm.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/memory.c | 99 +++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 84 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b1ca51a079f2..1f56b0118ef5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
>  					2;
>  #endif
>  
> +#ifndef arch_faults_on_old_pte
> +static inline bool arch_faults_on_old_pte(void)
> +{
> +	return false;
> +}
> +#endif

Kirill has acked this, so I'm happy to take the patch as-is, however isn't
it the case that /most/ architectures will want to return true for
arch_faults_on_old_pte()? In which case, wouldn't it make more sense for
that to be the default, and have x86 and arm64 provide an override? For
example, aren't most architectures still going to hit the double fault
scenario even with your patch applied?

Will
Jia He Oct. 8, 2019, 2:19 a.m. UTC | #2
Hi Will

> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: 2019年10月1日 20:54
> To: Justin He (Arm Technology China) <Justin.He@arm.com>
> Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Mark Rutland
> <Mark.Rutland@arm.com>; James Morse <James.Morse@arm.com>; Marc
> Zyngier <maz@kernel.org>; Matthew Wilcox <willy@infradead.org>; Kirill A.
> Shutemov <kirill.shutemov@linux.intel.com>; linux-arm-
> kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; Punit Agrawal <punitagrawal@gmail.com>; Thomas
> Gleixner <tglx@linutronix.de>; Andrew Morton <akpm@linux-
> foundation.org>; hejianet@gmail.com; Kaly Xin (Arm Technology China)
> <Kaly.Xin@arm.com>
> Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF
> is cleared
> 
> On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
> > When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest,
> there
> > will be a double page fault in __copy_from_user_inatomic of
> cow_user_page.
> >
> > Below call trace is from arm64 do_page_fault for debugging purpose
> > [  110.016195] Call trace:
> > [  110.016826]  do_page_fault+0x5a4/0x690
> > [  110.017812]  do_mem_abort+0x50/0xb0
> > [  110.018726]  el1_da+0x20/0xc4
> > [  110.019492]  __arch_copy_from_user+0x180/0x280
> > [  110.020646]  do_wp_page+0xb0/0x860
> > [  110.021517]  __handle_mm_fault+0x994/0x1338
> > [  110.022606]  handle_mm_fault+0xe8/0x180
> > [  110.023584]  do_page_fault+0x240/0x690
> > [  110.024535]  do_mem_abort+0x50/0xb0
> > [  110.025423]  el0_da+0x20/0x24
> >
> > The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> > [ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003,
> pmd=000000023d4b3003, pte=360000298607bd3
> >
> > As told by Catalin: "On arm64 without hardware Access Flag, copying
> from
> > user will fail because the pte is old and cannot be marked young. So we
> > always end up with zeroed page after fork() + CoW for pfn mappings. we
> > don't always have a hardware-managed access flag on arm64."
> >
> > This patch fix it by calling pte_mkyoung. Also, the parameter is
> > changed because vmf should be passed to cow_user_page()
> >
> > Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns
> error
> > in case there can be some obscure use-case.(by Kirill)
> >
> > [1]
> https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> >
> > Signed-off-by: Jia He <justin.he@arm.com>
> > Reported-by: Yibo Cai <Yibo.Cai@arm.com>
> > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/memory.c | 99
> +++++++++++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 84 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b1ca51a079f2..1f56b0118ef5 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> >  					2;
> >  #endif
> >
> > +#ifndef arch_faults_on_old_pte
> > +static inline bool arch_faults_on_old_pte(void)
> > +{
> > +	return false;
> > +}
> > +#endif
> 
> Kirill has acked this, so I'm happy to take the patch as-is, however isn't
> it the case that /most/ architectures will want to return true for
> arch_faults_on_old_pte()? In which case, wouldn't it make more sense for
> that to be the default, and have x86 and arm64 provide an override? For
> example, aren't most architectures still going to hit the double fault
> scenario even with your patch applied?

No, after applying my patch series, only those architectures which don't provide
setting access flag by hardware AND don't implement their arch_faults_on_old_pte
will hit the double page fault.

The meaning of true for arch_faults_on_old_pte() is "this arch doesn't have the hardware
setting access flag way, it might cause page fault on an old pte"
I don't want to change other architectures' default behavior here. So by default, 
arch_faults_on_old_pte() is false.

Btw, currently I only observed this double pagefault on arm64's guest (host is ThunderX2).
On X86 guest (host is Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz ), there is no such double
pagefault. It has the similar setting access flag way by hardware.


--
Cheers,
Justin (Jia He)
Will Deacon Oct. 8, 2019, 12:39 p.m. UTC | #3
On Tue, Oct 08, 2019 at 02:19:05AM +0000, Justin He (Arm Technology China) wrote:
> > -----Original Message-----
> > From: Will Deacon <will@kernel.org>
> > Sent: 2019年10月1日 20:54
> > To: Justin He (Arm Technology China) <Justin.He@arm.com>
> > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Mark Rutland
> > <Mark.Rutland@arm.com>; James Morse <James.Morse@arm.com>; Marc
> > Zyngier <maz@kernel.org>; Matthew Wilcox <willy@infradead.org>; Kirill A.
> > Shutemov <kirill.shutemov@linux.intel.com>; linux-arm-
> > kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; Punit Agrawal <punitagrawal@gmail.com>; Thomas
> > Gleixner <tglx@linutronix.de>; Andrew Morton <akpm@linux-
> > foundation.org>; hejianet@gmail.com; Kaly Xin (Arm Technology China)
> > <Kaly.Xin@arm.com>
> > Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF
> > is cleared
> > 
> > On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index b1ca51a079f2..1f56b0118ef5 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > >  					2;
> > >  #endif
> > >
> > > +#ifndef arch_faults_on_old_pte
> > > +static inline bool arch_faults_on_old_pte(void)
> > > +{
> > > +	return false;
> > > +}
> > > +#endif
> > 
> > Kirill has acked this, so I'm happy to take the patch as-is, however isn't
> > it the case that /most/ architectures will want to return true for
> > arch_faults_on_old_pte()? In which case, wouldn't it make more sense for
> > that to be the default, and have x86 and arm64 provide an override? For
> > example, aren't most architectures still going to hit the double fault
> > scenario even with your patch applied?
> 
> No, after applying my patch series, only those architectures which don't provide
> setting access flag by hardware AND don't implement their arch_faults_on_old_pte
> will hit the double page fault.
> 
> The meaning of true for arch_faults_on_old_pte() is "this arch doesn't have the hardware
> setting access flag way, it might cause page fault on an old pte"
> I don't want to change other architectures' default behavior here. So by default, 
> arch_faults_on_old_pte() is false.

...and my complaint is that this is the majority of supported architectures,
so you're fixing something for arm64 which also affects arm, powerpc,
alpha, mips, riscv, ...

Chances are, they won't even realise they need to implement
arch_faults_on_old_pte() until somebody runs into the double fault and
wastes lots of time debugging it before they spot your patch.

> Btw, currently I only observed this double pagefault on arm64's guest
> (host is ThunderX2).  On X86 guest (host is Intel(R) Core(TM) i7-4790 CPU
> @ 3.60GHz ), there is no such double pagefault. It has the similar setting
> access flag way by hardware.

Right, and that's why I'm not concerned about x86 for this problem.

Will
Jia He Oct. 8, 2019, 12:58 p.m. UTC | #4
Hi Will

> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: 2019年10月8日 20:40
> To: Justin He (Arm Technology China) <Justin.He@arm.com>
> Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Mark Rutland
> <Mark.Rutland@arm.com>; James Morse <James.Morse@arm.com>; Marc
> Zyngier <maz@kernel.org>; Matthew Wilcox <willy@infradead.org>; Kirill A.
> Shutemov <kirill.shutemov@linux.intel.com>; linux-arm-
> kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; Punit Agrawal <punitagrawal@gmail.com>; Thomas
> Gleixner <tglx@linutronix.de>; Andrew Morton <akpm@linux-
> foundation.org>; hejianet@gmail.com; Kaly Xin (Arm Technology China)
> <Kaly.Xin@arm.com>; nd <nd@arm.com>
> Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF
> is cleared
> 
> On Tue, Oct 08, 2019 at 02:19:05AM +0000, Justin He (Arm Technology
> China) wrote:
> > > -----Original Message-----
> > > From: Will Deacon <will@kernel.org>
> > > Sent: 2019年10月1日 20:54
> > > To: Justin He (Arm Technology China) <Justin.He@arm.com>
> > > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Mark Rutland
> > > <Mark.Rutland@arm.com>; James Morse <James.Morse@arm.com>;
> Marc
> > > Zyngier <maz@kernel.org>; Matthew Wilcox <willy@infradead.org>;
> Kirill A.
> > > Shutemov <kirill.shutemov@linux.intel.com>; linux-arm-
> > > kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > > mm@kvack.org; Punit Agrawal <punitagrawal@gmail.com>; Thomas
> > > Gleixner <tglx@linutronix.de>; Andrew Morton <akpm@linux-
> > > foundation.org>; hejianet@gmail.com; Kaly Xin (Arm Technology China)
> > > <Kaly.Xin@arm.com>
> > > Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if
> PTE_AF
> > > is cleared
> > >
> > > On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index b1ca51a079f2..1f56b0118ef5 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > > >  					2;
> > > >  #endif
> > > >
> > > > +#ifndef arch_faults_on_old_pte
> > > > +static inline bool arch_faults_on_old_pte(void)
> > > > +{
> > > > +	return false;
> > > > +}
> > > > +#endif
> > >
> > > Kirill has acked this, so I'm happy to take the patch as-is, however isn't
> > > it the case that /most/ architectures will want to return true for
> > > arch_faults_on_old_pte()? In which case, wouldn't it make more sense
> for
> > > that to be the default, and have x86 and arm64 provide an override?
> For
> > > example, aren't most architectures still going to hit the double fault
> > > scenario even with your patch applied?
> >
> > No, after applying my patch series, only those architectures which don't
> provide
> > setting access flag by hardware AND don't implement their
> arch_faults_on_old_pte
> > will hit the double page fault.
> >
> > The meaning of true for arch_faults_on_old_pte() is "this arch doesn't
> have the hardware
> > setting access flag way, it might cause page fault on an old pte"
> > I don't want to change other architectures' default behavior here. So by
> default,
> > arch_faults_on_old_pte() is false.
> 
> ...and my complaint is that this is the majority of supported architectures,
> so you're fixing something for arm64 which also affects arm, powerpc,
> alpha, mips, riscv, ...

So, IIUC, you suggested that:
1. by default, arch_faults_on_old_pte() return true
2. on X86, let arch_faults_on_old_pte() be overrided as returning false
3. on arm64, let it be as-is my patch set.
4. let other architectures decide the behavior. (But by default, it will set
pte_young)

I am ok with that if no objections from others.

@Kirill A. Shutemov Do you have any comments? Thanks
> 
> Chances are, they won't even realise they need to implement
> arch_faults_on_old_pte() until somebody runs into the double fault and
> wastes lots of time debugging it before they spot your patch.

As to this point, I added a WARN_ON in patch 03 to speed up the debugging
process.

--
Cheers,
Justin (Jia He)



> 
> > Btw, currently I only observed this double pagefault on arm64's guest
> > (host is ThunderX2).  On X86 guest (host is Intel(R) Core(TM) i7-4790 CPU
> > @ 3.60GHz ), there is no such double pagefault. It has the similar setting
> > access flag way by hardware.
> 
> Right, and that's why I'm not concerned about x86 for this problem.
> 
> Will
Kirill A. Shutemov Oct. 8, 2019, 2:32 p.m. UTC | #5
On Tue, Oct 08, 2019 at 12:58:57PM +0000, Justin He (Arm Technology China) wrote:
> Hi Will
> 
> > -----Original Message-----
> > From: Will Deacon <will@kernel.org>
> > Sent: 2019年10月8日 20:40
> > To: Justin He (Arm Technology China) <Justin.He@arm.com>
> > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Mark Rutland
> > <Mark.Rutland@arm.com>; James Morse <James.Morse@arm.com>; Marc
> > Zyngier <maz@kernel.org>; Matthew Wilcox <willy@infradead.org>; Kirill A.
> > Shutemov <kirill.shutemov@linux.intel.com>; linux-arm-
> > kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; Punit Agrawal <punitagrawal@gmail.com>; Thomas
> > Gleixner <tglx@linutronix.de>; Andrew Morton <akpm@linux-
> > foundation.org>; hejianet@gmail.com; Kaly Xin (Arm Technology China)
> > <Kaly.Xin@arm.com>; nd <nd@arm.com>
> > Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF
> > is cleared
> > 
> > On Tue, Oct 08, 2019 at 02:19:05AM +0000, Justin He (Arm Technology
> > China) wrote:
> > > > -----Original Message-----
> > > > From: Will Deacon <will@kernel.org>
> > > > Sent: 2019年10月1日 20:54
> > > > To: Justin He (Arm Technology China) <Justin.He@arm.com>
> > > > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Mark Rutland
> > > > <Mark.Rutland@arm.com>; James Morse <James.Morse@arm.com>;
> > Marc
> > > > Zyngier <maz@kernel.org>; Matthew Wilcox <willy@infradead.org>;
> > Kirill A.
> > > > Shutemov <kirill.shutemov@linux.intel.com>; linux-arm-
> > > > kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > > > mm@kvack.org; Punit Agrawal <punitagrawal@gmail.com>; Thomas
> > > > Gleixner <tglx@linutronix.de>; Andrew Morton <akpm@linux-
> > > > foundation.org>; hejianet@gmail.com; Kaly Xin (Arm Technology China)
> > > > <Kaly.Xin@arm.com>
> > > > Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if
> > PTE_AF
> > > > is cleared
> > > >
> > > > On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index b1ca51a079f2..1f56b0118ef5 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > > > >  					2;
> > > > >  #endif
> > > > >
> > > > > +#ifndef arch_faults_on_old_pte
> > > > > +static inline bool arch_faults_on_old_pte(void)
> > > > > +{
> > > > > +	return false;
> > > > > +}
> > > > > +#endif
> > > >
> > > > Kirill has acked this, so I'm happy to take the patch as-is, however isn't
> > > > it the case that /most/ architectures will want to return true for
> > > > arch_faults_on_old_pte()? In which case, wouldn't it make more sense
> > for
> > > > that to be the default, and have x86 and arm64 provide an override?
> > For
> > > > example, aren't most architectures still going to hit the double fault
> > > > scenario even with your patch applied?
> > >
> > > No, after applying my patch series, only those architectures which don't
> > provide
> > > setting access flag by hardware AND don't implement their
> > arch_faults_on_old_pte
> > > will hit the double page fault.
> > >
> > > The meaning of true for arch_faults_on_old_pte() is "this arch doesn't
> > have the hardware
> > > setting access flag way, it might cause page fault on an old pte"
> > > I don't want to change other architectures' default behavior here. So by
> > default,
> > > arch_faults_on_old_pte() is false.
> > 
> > ...and my complaint is that this is the majority of supported architectures,
> > so you're fixing something for arm64 which also affects arm, powerpc,
> > alpha, mips, riscv, ...
> 
> So, IIUC, you suggested that:
> 1. by default, arch_faults_on_old_pte() return true
> 2. on X86, let arch_faults_on_old_pte() be overrided as returning false
> 3. on arm64, let it be as-is my patch set.
> 4. let other architectures decide the behavior. (But by default, it will set
> pte_young)
> 
> I am ok with that if no objections from others.
> 
> @Kirill A. Shutemov Do you have any comments? Thanks

Sounds sane to me.
Palmer Dabbelt Oct. 16, 2019, 11:21 p.m. UTC | #6
On Tue, 08 Oct 2019 05:39:44 PDT (-0700), will@kernel.org wrote:
> On Tue, Oct 08, 2019 at 02:19:05AM +0000, Justin He (Arm Technology China) wrote:
>> > -----Original Message-----
>> > From: Will Deacon <will@kernel.org>
>> > Sent: 2019年10月1日 20:54
>> > To: Justin He (Arm Technology China) <Justin.He@arm.com>
>> > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Mark Rutland
>> > <Mark.Rutland@arm.com>; James Morse <James.Morse@arm.com>; Marc
>> > Zyngier <maz@kernel.org>; Matthew Wilcox <willy@infradead.org>; Kirill A.
>> > Shutemov <kirill.shutemov@linux.intel.com>; linux-arm-
>> > kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
>> > mm@kvack.org; Punit Agrawal <punitagrawal@gmail.com>; Thomas
>> > Gleixner <tglx@linutronix.de>; Andrew Morton <akpm@linux-
>> > foundation.org>; hejianet@gmail.com; Kaly Xin (Arm Technology China)
>> > <Kaly.Xin@arm.com>
>> > Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF
>> > is cleared
>> >
>> > On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
>> > > diff --git a/mm/memory.c b/mm/memory.c
>> > > index b1ca51a079f2..1f56b0118ef5 100644
>> > > --- a/mm/memory.c
>> > > +++ b/mm/memory.c
>> > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
>> > >  					2;
>> > >  #endif
>> > >
>> > > +#ifndef arch_faults_on_old_pte
>> > > +static inline bool arch_faults_on_old_pte(void)
>> > > +{
>> > > +	return false;
>> > > +}
>> > > +#endif
>> >
>> > Kirill has acked this, so I'm happy to take the patch as-is, however isn't
>> > it the case that /most/ architectures will want to return true for
>> > arch_faults_on_old_pte()? In which case, wouldn't it make more sense for
>> > that to be the default, and have x86 and arm64 provide an override? For
>> > example, aren't most architectures still going to hit the double fault
>> > scenario even with your patch applied?
>>
>> No, after applying my patch series, only those architectures which don't provide
>> setting access flag by hardware AND don't implement their arch_faults_on_old_pte
>> will hit the double page fault.
>>
>> The meaning of true for arch_faults_on_old_pte() is "this arch doesn't have the hardware
>> setting access flag way, it might cause page fault on an old pte"
>> I don't want to change other architectures' default behavior here. So by default,
>> arch_faults_on_old_pte() is false.
>
> ...and my complaint is that this is the majority of supported architectures,
> so you're fixing something for arm64 which also affects arm, powerpc,
> alpha, mips, riscv, ...
>
> Chances are, they won't even realise they need to implement
> arch_faults_on_old_pte() until somebody runs into the double fault and
> wastes lots of time debugging it before they spot your patch.

If I understand the semantics correctly, we should have this set to true.  I 
don't have any context here, but we've got

                /*
                 * The kernel assumes that TLBs don't cache invalid
                 * entries, but in RISC-V, SFENCE.VMA specifies an
                 * ordering constraint, not a cache flush; it is
                 * necessary even after writing invalid entries.
                 */
                local_flush_tlb_page(addr);

in do_page_fault().

>> Btw, currently I only observed this double pagefault on arm64's guest
>> (host is ThunderX2).  On X86 guest (host is Intel(R) Core(TM) i7-4790 CPU
>> @ 3.60GHz ), there is no such double pagefault. It has the similar setting
>> access flag way by hardware.
>
> Right, and that's why I'm not concerned about x86 for this problem.
>
> Will
Will Deacon Oct. 16, 2019, 11:46 p.m. UTC | #7
Hey Palmer,

On Wed, Oct 16, 2019 at 04:21:59PM -0700, Palmer Dabbelt wrote:
> On Tue, 08 Oct 2019 05:39:44 PDT (-0700), will@kernel.org wrote:
> > On Tue, Oct 08, 2019 at 02:19:05AM +0000, Justin He (Arm Technology China) wrote:
> > > > On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index b1ca51a079f2..1f56b0118ef5 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > > > >  					2;
> > > > >  #endif
> > > > >
> > > > > +#ifndef arch_faults_on_old_pte
> > > > > +static inline bool arch_faults_on_old_pte(void)
> > > > > +{
> > > > > +	return false;
> > > > > +}
> > > > > +#endif
> > > >
> > > > Kirill has acked this, so I'm happy to take the patch as-is, however isn't
> > > > it the case that /most/ architectures will want to return true for
> > > > arch_faults_on_old_pte()? In which case, wouldn't it make more sense for
> > > > that to be the default, and have x86 and arm64 provide an override? For
> > > > example, aren't most architectures still going to hit the double fault
> > > > scenario even with your patch applied?
> > > 
> > > No, after applying my patch series, only those architectures which don't provide
> > > setting access flag by hardware AND don't implement their arch_faults_on_old_pte
> > > will hit the double page fault.
> > > 
> > > The meaning of true for arch_faults_on_old_pte() is "this arch doesn't have the hardware
> > > setting access flag way, it might cause page fault on an old pte"
> > > I don't want to change other architectures' default behavior here. So by default,
> > > arch_faults_on_old_pte() is false.
> > 
> > ...and my complaint is that this is the majority of supported architectures,
> > so you're fixing something for arm64 which also affects arm, powerpc,
> > alpha, mips, riscv, ...
> > 
> > Chances are, they won't even realise they need to implement
> > arch_faults_on_old_pte() until somebody runs into the double fault and
> > wastes lots of time debugging it before they spot your patch.
> 
> If I understand the semantics correctly, we should have this set to true.  I
> don't have any context here, but we've got
> 
>                /*
>                 * The kernel assumes that TLBs don't cache invalid
>                 * entries, but in RISC-V, SFENCE.VMA specifies an
>                 * ordering constraint, not a cache flush; it is
>                 * necessary even after writing invalid entries.
>                 */
>                local_flush_tlb_page(addr);
> 
> in do_page_fault().

Ok, although I think this is really about whether or not your hardware can
make a pte young when accessed, or whether you take a fault and do it
by updating the pte explicitly.

v12 of the patches did change the default, so you should be "safe" with
those either way:

http://lists.infradead.org/pipermail/linux-arm-kernel/2019-October/686030.html

Will
Palmer Dabbelt Oct. 18, 2019, 8:38 p.m. UTC | #8
On Wed, 16 Oct 2019 16:46:08 PDT (-0700), will@kernel.org wrote:
> Hey Palmer,
>
> On Wed, Oct 16, 2019 at 04:21:59PM -0700, Palmer Dabbelt wrote:
>> On Tue, 08 Oct 2019 05:39:44 PDT (-0700), will@kernel.org wrote:
>> > On Tue, Oct 08, 2019 at 02:19:05AM +0000, Justin He (Arm Technology China) wrote:
>> > > > On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
>> > > > > diff --git a/mm/memory.c b/mm/memory.c
>> > > > > index b1ca51a079f2..1f56b0118ef5 100644
>> > > > > --- a/mm/memory.c
>> > > > > +++ b/mm/memory.c
>> > > > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
>> > > > >  					2;
>> > > > >  #endif
>> > > > >
>> > > > > +#ifndef arch_faults_on_old_pte
>> > > > > +static inline bool arch_faults_on_old_pte(void)
>> > > > > +{
>> > > > > +	return false;
>> > > > > +}
>> > > > > +#endif
>> > > >
>> > > > Kirill has acked this, so I'm happy to take the patch as-is, however isn't
>> > > > it the case that /most/ architectures will want to return true for
>> > > > arch_faults_on_old_pte()? In which case, wouldn't it make more sense for
>> > > > that to be the default, and have x86 and arm64 provide an override? For
>> > > > example, aren't most architectures still going to hit the double fault
>> > > > scenario even with your patch applied?
>> > >
>> > > No, after applying my patch series, only those architectures which don't provide
>> > > setting access flag by hardware AND don't implement their arch_faults_on_old_pte
>> > > will hit the double page fault.
>> > >
>> > > The meaning of true for arch_faults_on_old_pte() is "this arch doesn't have the hardware
>> > > setting access flag way, it might cause page fault on an old pte"
>> > > I don't want to change other architectures' default behavior here. So by default,
>> > > arch_faults_on_old_pte() is false.
>> >
>> > ...and my complaint is that this is the majority of supported architectures,
>> > so you're fixing something for arm64 which also affects arm, powerpc,
>> > alpha, mips, riscv, ...
>> >
>> > Chances are, they won't even realise they need to implement
>> > arch_faults_on_old_pte() until somebody runs into the double fault and
>> > wastes lots of time debugging it before they spot your patch.
>>
>> If I understand the semantics correctly, we should have this set to true.  I
>> don't have any context here, but we've got
>>
>>                /*
>>                 * The kernel assumes that TLBs don't cache invalid
>>                 * entries, but in RISC-V, SFENCE.VMA specifies an
>>                 * ordering constraint, not a cache flush; it is
>>                 * necessary even after writing invalid entries.
>>                 */
>>                local_flush_tlb_page(addr);
>>
>> in do_page_fault().
>
> Ok, although I think this is really about whether or not your hardware can
> make a pte young when accessed, or whether you take a fault and do it
> by updating the pte explicitly.
>
> v12 of the patches did change the default, so you should be "safe" with
> those either way:
>
> http://lists.infradead.org/pipermail/linux-arm-kernel/2019-October/686030.html

OK, that fence is because we allow invalid translations to be cached, which is a 
completely different issue.

RISC-V implementations are allowed to have software managed accessed/dirty 
bits.  For some reason I thought we were relying on the firmware to handle 
this, but I can't actually find the code so I might be crazy.  Wherever it's 
done, there's no spec enforcing it so we should leave this true on RISC-V.

Thanks!

> Will
Jia He Oct. 19, 2019, 2:59 a.m. UTC | #9
Hi Palmer

On 2019/10/19 4:38, Palmer Dabbelt wrote:
> On Wed, 16 Oct 2019 16:46:08 PDT (-0700), will@kernel.org wrote:
>> Hey Palmer,
>>
>> On Wed, Oct 16, 2019 at 04:21:59PM -0700, Palmer Dabbelt wrote:
>>> On Tue, 08 Oct 2019 05:39:44 PDT (-0700), will@kernel.org wrote:
>>> > On Tue, Oct 08, 2019 at 02:19:05AM +0000, Justin He (Arm Technology 
>>> China) wrote:
>>> > > > On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
>>> > > > > diff --git a/mm/memory.c b/mm/memory.c
>>> > > > > index b1ca51a079f2..1f56b0118ef5 100644
>>> > > > > --- a/mm/memory.c
>>> > > > > +++ b/mm/memory.c
>>> > > > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
>>> > > > >                      2;
>>> > > > >  #endif
>>> > > > >
>>> > > > > +#ifndef arch_faults_on_old_pte
>>> > > > > +static inline bool arch_faults_on_old_pte(void)
>>> > > > > +{
>>> > > > > +    return false;
>>> > > > > +}
>>> > > > > +#endif
>>> > > >
>>> > > > Kirill has acked this, so I'm happy to take the patch as-is, however 
>>> isn't
>>> > > > it the case that /most/ architectures will want to return true for
>>> > > > arch_faults_on_old_pte()? In which case, wouldn't it make more sense for
>>> > > > that to be the default, and have x86 and arm64 provide an override? For
>>> > > > example, aren't most architectures still going to hit the double fault
>>> > > > scenario even with your patch applied?
>>> > >
>>> > > No, after applying my patch series, only those architectures which 
>>> don't provide
>>> > > setting access flag by hardware AND don't implement their 
>>> arch_faults_on_old_pte
>>> > > will hit the double page fault.
>>> > >
>>> > > The meaning of true for arch_faults_on_old_pte() is "this arch doesn't 
>>> have the hardware
>>> > > setting access flag way, it might cause page fault on an old pte"
>>> > > I don't want to change other architectures' default behavior here. So 
>>> by default,
>>> > > arch_faults_on_old_pte() is false.
>>> >
>>> > ...and my complaint is that this is the majority of supported architectures,
>>> > so you're fixing something for arm64 which also affects arm, powerpc,
>>> > alpha, mips, riscv, ...
>>> >
>>> > Chances are, they won't even realise they need to implement
>>> > arch_faults_on_old_pte() until somebody runs into the double fault and
>>> > wastes lots of time debugging it before they spot your patch.
>>>
>>> If I understand the semantics correctly, we should have this set to true.  I
>>> don't have any context here, but we've got
>>>
>>>                /*
>>>                 * The kernel assumes that TLBs don't cache invalid
>>>                 * entries, but in RISC-V, SFENCE.VMA specifies an
>>>                 * ordering constraint, not a cache flush; it is
>>>                 * necessary even after writing invalid entries.
>>>                 */
>>>                local_flush_tlb_page(addr);
>>>
>>> in do_page_fault().
>>
>> Ok, although I think this is really about whether or not your hardware can
>> make a pte young when accessed, or whether you take a fault and do it
>> by updating the pte explicitly.
>>
>> v12 of the patches did change the default, so you should be "safe" with
>> those either way:
>>
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2019-October/686030.html
>
> OK, that fence is because we allow invalid translations to be cached, which 
> is a completely different issue.
>
> RISC-V implementations are allowed to have software managed accessed/dirty 
> bits.  For some reason I thought we were relying on the firmware to handle 
> this, but I can't actually find the code so I might be crazy.  Wherever it's 
> done, there's no spec enforcing it so we should leave this true on RISC-V.
>
Thanks for the confirmation. So we can keep the default arch_faults_on_old_pte 
(return true) on RISC-V.


Thanks.


---
Cheers,
Justin (Jia He)

Patch
diff mbox series

diff --git a/mm/memory.c b/mm/memory.c
index b1ca51a079f2..1f56b0118ef5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,13 @@  int randomize_va_space __read_mostly =
 					2;
 #endif
 
+#ifndef arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+	return false;
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
 	randomize_va_space = 0;
@@ -2145,32 +2152,82 @@  static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
 	return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
+static inline bool cow_user_page(struct page *dst, struct page *src,
+				 struct vm_fault *vmf)
 {
+	bool ret;
+	void *kaddr;
+	void __user *uaddr;
+	bool force_mkyoung;
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = vmf->address;
+
 	debug_dma_assert_idle(src);
 
+	if (likely(src)) {
+		copy_user_highpage(dst, src, addr, vma);
+		return true;
+	}
+
 	/*
 	 * If the source page was a PFN mapping, we don't have
 	 * a "struct page" for it. We do a best-effort copy by
 	 * just copying from the original user address. If that
 	 * fails, we just zero-fill it. Live with it.
 	 */
-	if (unlikely(!src)) {
-		void *kaddr = kmap_atomic(dst);
-		void __user *uaddr = (void __user *)(va & PAGE_MASK);
+	kaddr = kmap_atomic(dst);
+	uaddr = (void __user *)(addr & PAGE_MASK);
+
+	/*
+	 * On architectures with software "accessed" bits, we would
+	 * take a double page fault, so mark it accessed here.
+	 */
+	force_mkyoung = arch_faults_on_old_pte() && !pte_young(vmf->orig_pte);
+	if (force_mkyoung) {
+		pte_t entry;
+
+		vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
+		if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+			/*
+			 * Other thread has already handled the fault
+			 * and we don't need to do anything. If it's
+			 * not the case, the fault will be triggered
+			 * again on the same address.
+			 */
+			ret = false;
+			goto pte_unlock;
+		}
+
+		entry = pte_mkyoung(vmf->orig_pte);
+		if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0))
+			update_mmu_cache(vma, addr, vmf->pte);
+	}
 
+	/*
+	 * This really shouldn't fail, because the page is there
+	 * in the page tables. But it might just be unreadable,
+	 * in which case we just give up and fill the result with
+	 * zeroes.
+	 */
+	if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
 		/*
-		 * This really shouldn't fail, because the page is there
-		 * in the page tables. But it might just be unreadable,
-		 * in which case we just give up and fill the result with
-		 * zeroes.
+		 * Give a warn in case there can be some obscure
+		 * use-case
 		 */
-		if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
-			clear_page(kaddr);
-		kunmap_atomic(kaddr);
-		flush_dcache_page(dst);
-	} else
-		copy_user_highpage(dst, src, va, vma);
+		WARN_ON_ONCE(1);
+		clear_page(kaddr);
+	}
+
+	ret = true;
+
+pte_unlock:
+	if (force_mkyoung)
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+	kunmap_atomic(kaddr);
+	flush_dcache_page(dst);
+
+	return ret;
 }
 
 static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
@@ -2327,7 +2384,19 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 				vmf->address);
 		if (!new_page)
 			goto oom;
-		cow_user_page(new_page, old_page, vmf->address, vma);
+
+		if (!cow_user_page(new_page, old_page, vmf)) {
+			/*
+			 * COW failed, if the fault was solved by other,
+			 * it's fine. If not, userspace would re-fault on
+			 * the same address and we will handle the fault
+			 * from the second attempt.
+			 */
+			put_page(new_page);
+			if (old_page)
+				put_page(old_page);
+			return 0;
+		}
 	}
 
 	if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, &memcg, false))