linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/autonuma: don't use set_pte_at when updating protnone ptes
@ 2017-02-06 17:06 Aneesh Kumar K.V
  2017-02-06 18:46 ` Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Aneesh Kumar K.V @ 2017-02-06 17:06 UTC (permalink / raw)
  To: akpm, Rik van Riel, Mel Gorman; +Cc: linux-mm, linux-kernel, Aneesh Kumar K.V

Architectures like ppc64, use privilege access bit to mark pte non accessible.
This implies that kernel can do a copy_to_user to an address marked for numa fault.
This also implies that there can be a parallel hardware update for the pte.
set_pte_at cannot be used in such scenarios. Hence switch the pte
update to use ptep_get_and_clear and set_pte_at combination.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/pgtable.c |  7 +------
 mm/memory.c               | 18 +++++++++---------
 2 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb39c8bd2436..b8ac81a16389 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -186,12 +186,7 @@ static pte_t set_access_flags_filter(pte_t pte, struct vm_area_struct *vma,
 void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 		pte_t pte)
 {
-	/*
-	 * When handling numa faults, we already have the pte marked
-	 * _PAGE_PRESENT, but we can be sure that it is not in hpte.
-	 * Hence we can use set_pte_at for them.
-	 */
-	VM_WARN_ON(pte_present(*ptep) && !pte_protnone(*ptep));
+	VM_WARN_ON(pte_present(*ptep));
 
 	/*
 	 * Add the pte bit when tryint set a pte
diff --git a/mm/memory.c b/mm/memory.c
index 6bf2b471e30c..e78bf72f30dd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3387,32 +3387,32 @@ static int do_numa_page(struct vm_fault *vmf)
 	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
-	pte_t pte = vmf->orig_pte;
-	bool was_writable = pte_write(pte);
+	pte_t pte;
+	bool was_writable = pte_write(vmf->orig_pte);
 	int flags = 0;
 
 	/*
 	* The "pte" at this point cannot be used safely without
 	* validation through pte_unmap_same(). It's of NUMA type but
 	* the pfn may be screwed if the read is non atomic.
-	*
-	* We can safely just do a "set_pte_at()", because the old
-	* page table entry is not accessible, so there would be no
-	* concurrent hardware modifications to the PTE.
 	*/
 	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	if (unlikely(!pte_same(*vmf->pte, pte))) {
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		goto out;
 	}
 
-	/* Make it present again */
+	/*
+	 * Make it present again, Depending on how arch implementes non
+	 * accessible ptes, some can allow access by kernel mode.
+	 */
+	pte = ptep_modify_prot_start(vma->vm_mm, vmf->address, vmf->pte);
 	pte = pte_modify(pte, vma->vm_page_prot);
 	pte = pte_mkyoung(pte);
 	if (was_writable)
 		pte = pte_mkwrite(pte);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
+	ptep_modify_prot_commit(vma->vm_mm, vmf->address, vmf->pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 
 	page = vm_normal_page(vma, vmf->address, pte);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/autonuma: don't use set_pte_at when updating protnone ptes
  2017-02-06 17:06 [PATCH] mm/autonuma: don't use set_pte_at when updating protnone ptes Aneesh Kumar K.V
@ 2017-02-06 18:46 ` Rik van Riel
  2017-02-06 22:26 ` Mel Gorman
  2017-02-14 14:11 ` Aneesh Kumar K.V
  2 siblings, 0 replies; 5+ messages in thread
From: Rik van Riel @ 2017-02-06 18:46 UTC (permalink / raw)
  To: Aneesh Kumar K.V, akpm, Mel Gorman; +Cc: linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 594 bytes --]

On Mon, 2017-02-06 at 22:36 +0530, Aneesh Kumar K.V wrote:
> Architectures like ppc64, use privilege access bit to mark pte non
> accessible.
> This implies that kernel can do a copy_to_user to an address marked
> for numa fault.
> This also implies that there can be a parallel hardware update for
> the pte.
> set_pte_at cannot be used in such scenarios. Hence switch the pte
> update to use ptep_get_and_clear and set_pte_at combination.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/autonuma: don't use set_pte_at when updating protnone ptes
  2017-02-06 17:06 [PATCH] mm/autonuma: don't use set_pte_at when updating protnone ptes Aneesh Kumar K.V
  2017-02-06 18:46 ` Rik van Riel
@ 2017-02-06 22:26 ` Mel Gorman
  2017-02-14 14:11 ` Aneesh Kumar K.V
  2 siblings, 0 replies; 5+ messages in thread
From: Mel Gorman @ 2017-02-06 22:26 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: akpm, Rik van Riel, linux-mm, linux-kernel

On Mon, Feb 06, 2017 at 10:36:16PM +0530, Aneesh Kumar K.V wrote:
> Architectures like ppc64, use privilege access bit to mark pte non accessible.
> This implies that kernel can do a copy_to_user to an address marked for numa fault.
> This also implies that there can be a parallel hardware update for the pte.
> set_pte_at cannot be used in such scenarios. Hence switch the pte
> update to use ptep_get_and_clear and set_pte_at combination.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Yeah, ok. The main thing is that it still avoids doing an unnecessary TLB
flush so

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/autonuma: don't use set_pte_at when updating protnone ptes
  2017-02-06 17:06 [PATCH] mm/autonuma: don't use set_pte_at when updating protnone ptes Aneesh Kumar K.V
  2017-02-06 18:46 ` Rik van Riel
  2017-02-06 22:26 ` Mel Gorman
@ 2017-02-14 14:11 ` Aneesh Kumar K.V
  2017-02-15  0:05   ` Andrew Morton
  2 siblings, 1 reply; 5+ messages in thread
From: Aneesh Kumar K.V @ 2017-02-14 14:11 UTC (permalink / raw)
  To: akpm, Rik van Riel, Mel Gorman; +Cc: linux-mm, linux-kernel

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

> Architectures like ppc64, use privilege access bit to mark pte non accessible.
> This implies that kernel can do a copy_to_user to an address marked for numa fault.
> This also implies that there can be a parallel hardware update for the pte.
> set_pte_at cannot be used in such scenarios. Hence switch the pte
> update to use ptep_get_and_clear and set_pte_at combination.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

With this and other patches a kvm guest is giving me

  494.542145] khugepaged      D13632  1451      2 0x00000800
[  494.542151] Call Trace:
[  494.542158] [c000000fe57a7830] [c000000000e71f10] sysctl_sched_child_runs_first+0x0/0x4 (unreliable)
[  494.542163] [c000000fe57a7a00] [c00000000001ae70] __switch_to+0x2b0/0x440
[  494.542167] [c000000fe57a7a60] [c0000000009ac560] __schedule+0x2e0/0x940
[  494.542170] [c000000fe57a7b00] [c0000000009acc00] schedule+0x40/0xb0
[  494.542173] [c000000fe57a7b30] [c0000000009b1264] rwsem_down_read_failed+0x124/0x1b0
[  494.542176] [c000000fe57a7ba0] [c0000000009b0064] down_read+0x64/0x70
[  494.542180] [c000000fe57a7bd0] [c000000000292a70] khugepaged+0x420/0x25c0
[  494.542184] [c000000fe57a7dc0] [c0000000000df37c] kthread+0x14c/0x190
[  494.542187] [c000000fe57a7e30] [c00000000000bae0] ret_from_kernel_thread+0x5c/0x7c
[  494.542276] INFO: task qemu-system-ppc:6868 blocked for more than 120 seconds.
[  494.542340]       Not tainted 4.10.0-rc8-00025-g0d75d3e #4
[  494.542377] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  494.542439] qemu-system-ppc D10688  6868   6473 0x00040000
[  494.542445] Call Trace:
[  494.542448] [c000000fdca7b6a0] [c00000000001ae70] __switch_to+0x2b0/0x440
[  494.542451] [c000000fdca7b700] [c0000000009ac560] __schedule+0x2e0/0x940
[  494.542454] [c000000fdca7b7a0] [c0000000009acc00] schedule+0x40/0xb0
[  494.542457] [c000000fdca7b7d0] [c0000000009b1264] rwsem_down_read_failed+0x124/0x1b0
[  494.542460] [c000000fdca7b840] [c0000000009b0064] down_read+0x64/0x70
[  494.542464] [c000000fdca7b870] [c0000000002340e0] get_user_pages_unlocked+0x80/0x280
[  494.542467] [c000000fdca7b910] [c0000000002352dc] get_user_pages_fast+0xac/0x110
[  494.542475] [c000000fdca7b960] [d00000001096c4fc] kvmppc_book3s_hv_page_fault+0x2bc/0xbb0 [kvm_hv]
[  494.542479] [c000000fdca7ba50] [d0000000109692e4] kvmppc_vcpu_run_hv+0xee4/0x1290 [kvm_hv]
[  494.542488] [c000000fdca7bb80] [d0000000107113bc] kvmppc_vcpu_run+0x2c/0x40 [kvm]
[  494.542497] [c000000fdca7bba0] [d00000001070ec6c] kvm_arch_vcpu_ioctl_run+0x5c/0x160 [kvm]
[  494.542504] [c000000fdca7bbe0] [d000000010703bf8] kvm_vcpu_ioctl+0x528/0x7a0 [kvm]
[  494.542506] [c000000fdca7bd40] [c0000000002c46dc] do_vfs_ioctl+0xcc/0x8e0
[  494.542509] [c000000fdca7bde0] [c0000000002c4f50] SyS_ioctl+0x60/0xc0
[  494.542512] [c000000fdca7be30] [c00000000000b760] system_call+0x38/0xfc
[  494.542514] INFO: task qemu-system-ppc:6870 blocked for more than 120 seconds.
[  494.542577]       Not tainted 4.10.0-rc8-00025-g0d75d3e #4
[  494.542615] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  494.542677] qemu-system-ppc D10688  6870   6473 0x00040000

Reverting this patch gets rid of the above hang. But I am running into segfault
with systemd in guest. It could be some other patches in my local tree.

Maybe we should hold merging this to 4.11 and wait for this to get more
testing ?

-aneesh

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/autonuma: don't use set_pte_at when updating protnone ptes
  2017-02-14 14:11 ` Aneesh Kumar K.V
@ 2017-02-15  0:05   ` Andrew Morton
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Morton @ 2017-02-15  0:05 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Rik van Riel, Mel Gorman, linux-mm, linux-kernel

On Tue, 14 Feb 2017 19:41:17 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:
> 
> > Architectures like ppc64, use privilege access bit to mark pte non accessible.
> > This implies that kernel can do a copy_to_user to an address marked for numa fault.
> > This also implies that there can be a parallel hardware update for the pte.
> > set_pte_at cannot be used in such scenarios. Hence switch the pte
> > update to use ptep_get_and_clear and set_pte_at combination.
> >
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> 
> With this and other patches a kvm guest is giving me
> 
> ...
> 
> Reverting this patch gets rid of the above hang. But I am running into segfault
> with systemd in guest. It could be some other patches in my local tree.
> 
> Maybe we should hold merging this to 4.11 and wait for this to get more
> testing ?

Shall do.  Please let me know the outcome...

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-02-15  0:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-06 17:06 [PATCH] mm/autonuma: don't use set_pte_at when updating protnone ptes Aneesh Kumar K.V
2017-02-06 18:46 ` Rik van Riel
2017-02-06 22:26 ` Mel Gorman
2017-02-14 14:11 ` Aneesh Kumar K.V
2017-02-15  0:05   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).