All of lore.kernel.org
 help / color / mirror / Atom feed
From: Laurent Dufour <ldufour@liunx.vnet.ibm.com>
To: vinayak menon <vinayakm.list@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net,
	jack@suse.cz, Matthew Wilcox <willy@infradead.org>,
	khandual@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com,
	benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	hpa@zytor.com, Will Deacon <will.deacon@arm.com>,
	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	sergey.senozhatsky.work@gmail.com,
	Andrea Arcangeli <aarcange@redhat.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	kemi.wang@intel.com, Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Rientjes <rientjes@google.com>,
	Jerome Glisse <jglisse@redhat.com>,
	Ganesh Mahendran <opensource.ganesh@gmail.com>,
	Minchan Kim <minchan@kernel.org>,
	punitagrawal@gmail.com, yang.shi@linux.alibaba.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	haren@linux.vnet.ibm.com, npiggin@gmail.com,
	Balbir Singh <bsingharora@gmail.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	linuxppc-dev@lists.ozlabs.org, x86@kernel.org,
	Vinayak Menon <vinmenon@codeaurora.org>
Subject: Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count
Date: Mon, 5 Nov 2018 19:22:50 +0100	[thread overview]
Message-ID: <239bab9c-e38c-951d-9114-34227b1dc94c@liunx.vnet.ibm.com> (raw)
In-Reply-To: <CAOaiJ-nJ_=+diXz8ji42Rro3Mj16C1=NpenRuY3-mjs_GhR4mA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4478 bytes --]

Le 05/11/2018 à 08:04, vinayak menon a écrit :
> Hi Laurent,
> 
> On Thu, May 17, 2018 at 4:37 PM Laurent Dufour
> <ldufour@linux.vnet.ibm.com> wrote:
>>
>> The VMA sequence count has been introduced to allow fast detection of
>> VMA modification when running a page fault handler without holding
>> the mmap_sem.
>>
>> This patch provides protection against the VMA modification done in :
>>          - madvise()
>>          - mpol_rebind_policy()
>>          - vma_replace_policy()
>>          - change_prot_numa()
>>          - mlock(), munlock()
>>          - mprotect()
>>          - mmap_region()
>>          - collapse_huge_page()
>>          - userfaultd registering services
>>
>> In addition, VMA fields which will be read during the speculative fault
>> path needs to be written using WRITE_ONCE to prevent write to be split
>> and intermediate values to be pushed to other CPUs.
>>
>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>> ---
>>   fs/proc/task_mmu.c |  5 ++++-
>>   fs/userfaultfd.c   | 17 +++++++++++++----
>>   mm/khugepaged.c    |  3 +++
>>   mm/madvise.c       |  6 +++++-
>>   mm/mempolicy.c     | 51 ++++++++++++++++++++++++++++++++++-----------------
>>   mm/mlock.c         | 13 ++++++++-----
>>   mm/mmap.c          | 22 +++++++++++++---------
>>   mm/mprotect.c      |  4 +++-
>>   mm/swap_state.c    |  8 ++++++--
>>   9 files changed, 89 insertions(+), 40 deletions(-)
>>
>>   struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>>                                  struct vm_fault *vmf)
>> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
>>                                       unsigned long *start,
>>                                       unsigned long *end)
>>   {
>> -       *start = max3(lpfn, PFN_DOWN(vma->vm_start),
>> +       *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
>>                        PFN_DOWN(faddr & PMD_MASK));
>> -       *end = min3(rpfn, PFN_DOWN(vma->vm_end),
>> +       *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
>>                      PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
>>   }
>>
>> --
>> 2.7.4
>>
> 
> I have got a crash on 4.14 kernel with speculative page faults enabled
> and here is my analysis of the problem.
> The issue was reported only once.

Hi Vinayak,

Thanks for reporting this.

> 
> [23409.303395]  el1_da+0x24/0x84
> [23409.303400]  __radix_tree_lookup+0x8/0x90
> [23409.303407]  find_get_entry+0x64/0x14c
> [23409.303410]  pagecache_get_page+0x5c/0x27c
> [23409.303416]  __read_swap_cache_async+0x80/0x260
> [23409.303420]  swap_vma_readahead+0x264/0x37c
> [23409.303423]  swapin_readahead+0x5c/0x6c
> [23409.303428]  do_swap_page+0x128/0x6e4
> [23409.303431]  handle_pte_fault+0x230/0xca4
> [23409.303435]  __handle_speculative_fault+0x57c/0x7c8
> [23409.303438]  do_page_fault+0x228/0x3e8
> [23409.303442]  do_translation_fault+0x50/0x6c
> [23409.303445]  do_mem_abort+0x5c/0xe0
> [23409.303447]  el0_da+0x20/0x24
> 
> Process A accesses address ADDR (part of VMA A) and that results in a
> translation fault.
> Kernel enters __handle_speculative_fault to fix the fault.
> Process A enters do_swap_page->swapin_readahead->swap_vma_readahead
> from speculative path.
> During this time, another process B which shares the same mm, does a
> mprotect from another CPU which follows
> mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B.
> After the split, ADDR falls into VMA B, but process A is still using
> VMA A.
> Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end.
> swap_vma_readahead->swap_ra_info uses start and end of vma to
> calculate ptes and nr_pte, which goes wrong due to this and finally
> resulting in wrong "entry" passed to
> swap_vma_readahead->__read_swap_cache_async, and in turn causing
> invalid swapper_space
> being passed to __read_swap_cache_async->find_get_page, causing an abort.
> 
> The fix I have tried is to cache vm_start and vm_end also in vmf and
> use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can
> send
> the patch I am a using if you feel that is the right thing to do.

I think the best would be to don't do swap readahead during the 
speculatvive page fault. If the page is found in the swap cache, that's 
fine, but otherwise, we should f	allback to the regular page fault.

The attached -untested- patch is doing this, if you want to give it a 
try. I'll review that for the next series.

Thanks,
Laurent.

[-- Attachment #2: 0001-mm-don-t-do-swap-readahead-during-speculative-page-f.patch --]
[-- Type: text/plain, Size: 1507 bytes --]

From 056afafb0bccea6a356f80f4253ffcd3ef4a1f8d Mon Sep 17 00:00:00 2001
From: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Date: Mon, 5 Nov 2018 18:43:01 +0100
Subject: [PATCH] mm: don't do swap readahead during speculative page fault

Vinayak Menon faced a panic because one thread was page faulting a page in
swap, while another one was mprotecting a part of the VMA leading to a VMA
split.
This raise a panic in swap_vma_readahead() because the VMA's boundaries
were not more matching the faulting address.

To avoid this, if the page is not found in the swap, the speculative page
fault is aborted to retry a regular page fault.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9dd5ffeb1f7e..720dc9a1b99f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3139,6 +3139,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				lru_cache_add_anon(page);
 				swap_readpage(page, true);
 			}
+		} else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+			/*
+			 * Don't try readahead during a speculative page fault as
+			 * the VMA's boundaries may change in our back.
+			 * If the page is not in the swap cache and synchronous read
+			 * is disabled, fall back to the regular page fault mechanism.
+			 */
+			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+			ret = VM_FAULT_RETRY;
+			goto out;
 		} else {
 			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 						vmf);
-- 
2.19.1


WARNING: multiple messages have this Message-ID (diff)
From: Laurent Dufour <ldufour@liunx.vnet.ibm.com>
To: vinayak menon <vinayakm.list@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net,
	jack@suse.cz, Matthew Wilcox <willy@infradead.org>,
	khandual@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com,
	benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	hpa@zytor.com, Will Deacon <will.deacon@arm.com>,
	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	sergey.senozhatsky.work@gmail.com,
	Andrea Arcangeli <aarcange@redhat.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	kemi.wang@intel.com, Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Rientjes <rientjes@google.com>,
	Jerome Glisse <jglisse@redhat.com>,
	Ganesh Mahendran <opensource.ganesh@gmail.com>,
	Minchan Kim <minchan@kernel.org>,
	punitagrawal@gmail.com, yang.shi@linux.alibaba.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	haren@linux.vnet.ibm.com, npiggin@gmail.com,
	Balbir Singh <bsingharora@gmail.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	linuxppc-dev@lists.ozlabs.org, x86@kernel.org,
	Vinayak Menon <vinmenon@codeaurora.org>
Subject: Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count
Date: Mon, 5 Nov 2018 19:22:50 +0100	[thread overview]
Message-ID: <239bab9c-e38c-951d-9114-34227b1dc94c@liunx.vnet.ibm.com> (raw)
In-Reply-To: <CAOaiJ-nJ_=+diXz8ji42Rro3Mj16C1=NpenRuY3-mjs_GhR4mA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4480 bytes --]

Le 05/11/2018 A  08:04, vinayak menon a A(C)critA :
> Hi Laurent,
> 
> On Thu, May 17, 2018 at 4:37 PM Laurent Dufour
> <ldufour@linux.vnet.ibm.com> wrote:
>>
>> The VMA sequence count has been introduced to allow fast detection of
>> VMA modification when running a page fault handler without holding
>> the mmap_sem.
>>
>> This patch provides protection against the VMA modification done in :
>>          - madvise()
>>          - mpol_rebind_policy()
>>          - vma_replace_policy()
>>          - change_prot_numa()
>>          - mlock(), munlock()
>>          - mprotect()
>>          - mmap_region()
>>          - collapse_huge_page()
>>          - userfaultd registering services
>>
>> In addition, VMA fields which will be read during the speculative fault
>> path needs to be written using WRITE_ONCE to prevent write to be split
>> and intermediate values to be pushed to other CPUs.
>>
>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>> ---
>>   fs/proc/task_mmu.c |  5 ++++-
>>   fs/userfaultfd.c   | 17 +++++++++++++----
>>   mm/khugepaged.c    |  3 +++
>>   mm/madvise.c       |  6 +++++-
>>   mm/mempolicy.c     | 51 ++++++++++++++++++++++++++++++++++-----------------
>>   mm/mlock.c         | 13 ++++++++-----
>>   mm/mmap.c          | 22 +++++++++++++---------
>>   mm/mprotect.c      |  4 +++-
>>   mm/swap_state.c    |  8 ++++++--
>>   9 files changed, 89 insertions(+), 40 deletions(-)
>>
>>   struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>>                                  struct vm_fault *vmf)
>> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
>>                                       unsigned long *start,
>>                                       unsigned long *end)
>>   {
>> -       *start = max3(lpfn, PFN_DOWN(vma->vm_start),
>> +       *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
>>                        PFN_DOWN(faddr & PMD_MASK));
>> -       *end = min3(rpfn, PFN_DOWN(vma->vm_end),
>> +       *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
>>                      PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
>>   }
>>
>> --
>> 2.7.4
>>
> 
> I have got a crash on 4.14 kernel with speculative page faults enabled
> and here is my analysis of the problem.
> The issue was reported only once.

Hi Vinayak,

Thanks for reporting this.

> 
> [23409.303395]  el1_da+0x24/0x84
> [23409.303400]  __radix_tree_lookup+0x8/0x90
> [23409.303407]  find_get_entry+0x64/0x14c
> [23409.303410]  pagecache_get_page+0x5c/0x27c
> [23409.303416]  __read_swap_cache_async+0x80/0x260
> [23409.303420]  swap_vma_readahead+0x264/0x37c
> [23409.303423]  swapin_readahead+0x5c/0x6c
> [23409.303428]  do_swap_page+0x128/0x6e4
> [23409.303431]  handle_pte_fault+0x230/0xca4
> [23409.303435]  __handle_speculative_fault+0x57c/0x7c8
> [23409.303438]  do_page_fault+0x228/0x3e8
> [23409.303442]  do_translation_fault+0x50/0x6c
> [23409.303445]  do_mem_abort+0x5c/0xe0
> [23409.303447]  el0_da+0x20/0x24
> 
> Process A accesses address ADDR (part of VMA A) and that results in a
> translation fault.
> Kernel enters __handle_speculative_fault to fix the fault.
> Process A enters do_swap_page->swapin_readahead->swap_vma_readahead
> from speculative path.
> During this time, another process B which shares the same mm, does a
> mprotect from another CPU which follows
> mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B.
> After the split, ADDR falls into VMA B, but process A is still using
> VMA A.
> Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end.
> swap_vma_readahead->swap_ra_info uses start and end of vma to
> calculate ptes and nr_pte, which goes wrong due to this and finally
> resulting in wrong "entry" passed to
> swap_vma_readahead->__read_swap_cache_async, and in turn causing
> invalid swapper_space
> being passed to __read_swap_cache_async->find_get_page, causing an abort.
> 
> The fix I have tried is to cache vm_start and vm_end also in vmf and
> use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can
> send
> the patch I am a using if you feel that is the right thing to do.

I think the best would be to don't do swap readahead during the 
speculatvive page fault. If the page is found in the swap cache, that's 
fine, but otherwise, we should f	allback to the regular page fault.

The attached -untested- patch is doing this, if you want to give it a 
try. I'll review that for the next series.

Thanks,
Laurent.

[-- Attachment #2: 0001-mm-don-t-do-swap-readahead-during-speculative-page-f.patch --]
[-- Type: text/plain, Size: 1507 bytes --]

From 056afafb0bccea6a356f80f4253ffcd3ef4a1f8d Mon Sep 17 00:00:00 2001
From: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Date: Mon, 5 Nov 2018 18:43:01 +0100
Subject: [PATCH] mm: don't do swap readahead during speculative page fault

Vinayak Menon faced a panic because one thread was page faulting a page in
swap, while another one was mprotecting a part of the VMA leading to a VMA
split.
This raise a panic in swap_vma_readahead() because the VMA's boundaries
were not more matching the faulting address.

To avoid this, if the page is not found in the swap, the speculative page
fault is aborted to retry a regular page fault.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9dd5ffeb1f7e..720dc9a1b99f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3139,6 +3139,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				lru_cache_add_anon(page);
 				swap_readpage(page, true);
 			}
+		} else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+			/*
+			 * Don't try readahead during a speculative page fault as
+			 * the VMA's boundaries may change in our back.
+			 * If the page is not in the swap cache and synchronous read
+			 * is disabled, fall back to the regular page fault mechanism.
+			 */
+			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+			ret = VM_FAULT_RETRY;
+			goto out;
 		} else {
 			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 						vmf);
-- 
2.19.1


WARNING: multiple messages have this Message-ID (diff)
From: Laurent Dufour <ldufour@liunx.vnet.ibm.com>
To: vinayak menon <vinayakm.list@gmail.com>
Cc: jack@suse.cz, sergey.senozhatsky.work@gmail.com,
	Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will.deacon@arm.com>,
	Michal Hocko <mhocko@kernel.org>,
	linux-mm@kvack.org, paulus@samba.org, punitagrawal@gmail.com,
	hpa@zytor.com, Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	khandual@linux.vnet.ibm.com,
	Andrea Arcangeli <aarcange@redhat.com>,
	ak@linux.intel.com, Minchan Kim <minchan@kernel.org>,
	x86@kernel.org, Matthew Wilcox <willy@infradead.org>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	Ingo Molnar <mingo@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	npiggin@gmail.com, Jerome Glisse <jglisse@redhat.com>,
	dave@stgolabs.net, kemi.wang@intel.com, kirill@shutemov.name,
	Thomas Gleixner <tglx@linutronix.de>,
	Ganesh Mahendran <opensource.ganesh@gmail.com>,
	yang.shi@linux.alibaba.com, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org,
	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	Vinayak Menon <vinmenon@codeaurora.org>,
	aneesh.kumar@linux.vnet.ibm.com,
	Andrew Morton <akpm@linux-foundation.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	haren@linux.vnet.ibm.com
Subject: Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count
Date: Mon, 5 Nov 2018 19:22:50 +0100	[thread overview]
Message-ID: <239bab9c-e38c-951d-9114-34227b1dc94c@liunx.vnet.ibm.com> (raw)
In-Reply-To: <CAOaiJ-nJ_=+diXz8ji42Rro3Mj16C1=NpenRuY3-mjs_GhR4mA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4478 bytes --]

Le 05/11/2018 à 08:04, vinayak menon a écrit :
> Hi Laurent,
> 
> On Thu, May 17, 2018 at 4:37 PM Laurent Dufour
> <ldufour@linux.vnet.ibm.com> wrote:
>>
>> The VMA sequence count has been introduced to allow fast detection of
>> VMA modification when running a page fault handler without holding
>> the mmap_sem.
>>
>> This patch provides protection against the VMA modification done in :
>>          - madvise()
>>          - mpol_rebind_policy()
>>          - vma_replace_policy()
>>          - change_prot_numa()
>>          - mlock(), munlock()
>>          - mprotect()
>>          - mmap_region()
>>          - collapse_huge_page()
>>          - userfaultd registering services
>>
>> In addition, VMA fields which will be read during the speculative fault
>> path needs to be written using WRITE_ONCE to prevent write to be split
>> and intermediate values to be pushed to other CPUs.
>>
>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>> ---
>>   fs/proc/task_mmu.c |  5 ++++-
>>   fs/userfaultfd.c   | 17 +++++++++++++----
>>   mm/khugepaged.c    |  3 +++
>>   mm/madvise.c       |  6 +++++-
>>   mm/mempolicy.c     | 51 ++++++++++++++++++++++++++++++++++-----------------
>>   mm/mlock.c         | 13 ++++++++-----
>>   mm/mmap.c          | 22 +++++++++++++---------
>>   mm/mprotect.c      |  4 +++-
>>   mm/swap_state.c    |  8 ++++++--
>>   9 files changed, 89 insertions(+), 40 deletions(-)
>>
>>   struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>>                                  struct vm_fault *vmf)
>> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
>>                                       unsigned long *start,
>>                                       unsigned long *end)
>>   {
>> -       *start = max3(lpfn, PFN_DOWN(vma->vm_start),
>> +       *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
>>                        PFN_DOWN(faddr & PMD_MASK));
>> -       *end = min3(rpfn, PFN_DOWN(vma->vm_end),
>> +       *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
>>                      PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
>>   }
>>
>> --
>> 2.7.4
>>
> 
> I have got a crash on 4.14 kernel with speculative page faults enabled
> and here is my analysis of the problem.
> The issue was reported only once.

Hi Vinayak,

Thanks for reporting this.

> 
> [23409.303395]  el1_da+0x24/0x84
> [23409.303400]  __radix_tree_lookup+0x8/0x90
> [23409.303407]  find_get_entry+0x64/0x14c
> [23409.303410]  pagecache_get_page+0x5c/0x27c
> [23409.303416]  __read_swap_cache_async+0x80/0x260
> [23409.303420]  swap_vma_readahead+0x264/0x37c
> [23409.303423]  swapin_readahead+0x5c/0x6c
> [23409.303428]  do_swap_page+0x128/0x6e4
> [23409.303431]  handle_pte_fault+0x230/0xca4
> [23409.303435]  __handle_speculative_fault+0x57c/0x7c8
> [23409.303438]  do_page_fault+0x228/0x3e8
> [23409.303442]  do_translation_fault+0x50/0x6c
> [23409.303445]  do_mem_abort+0x5c/0xe0
> [23409.303447]  el0_da+0x20/0x24
> 
> Process A accesses address ADDR (part of VMA A) and that results in a
> translation fault.
> Kernel enters __handle_speculative_fault to fix the fault.
> Process A enters do_swap_page->swapin_readahead->swap_vma_readahead
> from speculative path.
> During this time, another process B which shares the same mm, does a
> mprotect from another CPU which follows
> mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B.
> After the split, ADDR falls into VMA B, but process A is still using
> VMA A.
> Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end.
> swap_vma_readahead->swap_ra_info uses start and end of vma to
> calculate ptes and nr_pte, which goes wrong due to this and finally
> resulting in wrong "entry" passed to
> swap_vma_readahead->__read_swap_cache_async, and in turn causing
> invalid swapper_space
> being passed to __read_swap_cache_async->find_get_page, causing an abort.
> 
> The fix I have tried is to cache vm_start and vm_end also in vmf and
> use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can
> send
> the patch I am a using if you feel that is the right thing to do.

I think the best would be to don't do swap readahead during the 
speculatvive page fault. If the page is found in the swap cache, that's 
fine, but otherwise, we should f	allback to the regular page fault.

The attached -untested- patch is doing this, if you want to give it a 
try. I'll review that for the next series.

Thanks,
Laurent.

[-- Attachment #2: 0001-mm-don-t-do-swap-readahead-during-speculative-page-f.patch --]
[-- Type: text/plain, Size: 1507 bytes --]

From 056afafb0bccea6a356f80f4253ffcd3ef4a1f8d Mon Sep 17 00:00:00 2001
From: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Date: Mon, 5 Nov 2018 18:43:01 +0100
Subject: [PATCH] mm: don't do swap readahead during speculative page fault

Vinayak Menon faced a panic because one thread was page faulting a page in
swap, while another one was mprotecting a part of the VMA leading to a VMA
split.
This raise a panic in swap_vma_readahead() because the VMA's boundaries
were not more matching the faulting address.

To avoid this, if the page is not found in the swap, the speculative page
fault is aborted to retry a regular page fault.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9dd5ffeb1f7e..720dc9a1b99f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3139,6 +3139,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				lru_cache_add_anon(page);
 				swap_readpage(page, true);
 			}
+		} else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+			/*
+			 * Don't try readahead during a speculative page fault as
+			 * the VMA's boundaries may change in our back.
+			 * If the page is not in the swap cache and synchronous read
+			 * is disabled, fall back to the regular page fault mechanism.
+			 */
+			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+			ret = VM_FAULT_RETRY;
+			goto out;
 		} else {
 			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 						vmf);
-- 
2.19.1


  reply	other threads:[~2018-11-05 18:23 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-17 11:06 [PATCH v11 00/26] Speculative page faults Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 01/26] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT Laurent Dufour
2018-05-17 16:36   ` Randy Dunlap
2018-05-17 17:19     ` Matthew Wilcox
2018-05-17 17:34       ` Randy Dunlap
2018-05-22 12:00         ` [FIX PATCH " Laurent Dufour
2018-05-22 11:44       ` [PATCH " Laurent Dufour
2018-05-22 11:47     ` Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 02/26] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 03/26] powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 04/26] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 05/26] mm: prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 06/26] mm: introduce pte_spinlock " Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 07/26] mm: make pte_unmap_same compatible with SPF Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 08/26] mm: introduce INIT_VMA() Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 09/26] mm: VMA sequence count Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 10/26] mm: protect VMA modifications using " Laurent Dufour
2018-11-05  7:04   ` vinayak menon
2018-11-05  7:04     ` vinayak menon
2018-11-05 18:22     ` Laurent Dufour [this message]
2018-11-05 18:22       ` Laurent Dufour
2018-11-05 18:22       ` Laurent Dufour
2018-11-06  9:28       ` Vinayak Menon
2018-11-06  9:28         ` Vinayak Menon
2018-11-06  9:28         ` Vinayak Menon
2018-05-17 11:06 ` [PATCH v11 11/26] mm: protect mremap() against SPF hanlder Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 12/26] mm: protect SPF handler against anon_vma changes Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 13/26] mm: cache some VMA fields in the vm_fault structure Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 14/26] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 15/26] mm: introduce __lru_cache_add_active_or_unevictable Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 16/26] mm: introduce __vm_normal_page() Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 17/26] mm: introduce __page_add_new_anon_rmap() Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 18/26] mm: protect mm_rb tree with a rwlock Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 19/26] mm: provide speculative fault infrastructure Laurent Dufour
2018-07-24 14:26   ` zhong jiang
2018-07-24 14:26     ` zhong jiang
2018-07-24 16:10     ` Laurent Dufour
2018-07-25  9:04       ` zhong jiang
2018-07-25  9:04         ` zhong jiang
2018-07-25 10:44         ` Laurent Dufour
2018-07-25 11:23           ` zhong jiang
2018-07-25 11:23             ` zhong jiang
2018-05-17 11:06 ` [PATCH v11 20/26] mm: adding speculative page fault failure trace events Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 21/26] perf: add a speculative page fault sw event Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 22/26] perf tools: add support for the SPF perf event Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 23/26] mm: add speculative page fault vmstats Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 24/26] x86/mm: add speculative pagefault handling Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 25/26] powerpc/mm: add speculative page fault Laurent Dufour
2018-05-17 11:06 ` [PATCH v11 26/26] arm64/mm: " Laurent Dufour
2018-05-28  5:23 ` [PATCH v11 00/26] Speculative page faults Song, HaiyanX
2018-05-28  5:23   ` Song, HaiyanX
2018-05-28  7:51   ` Laurent Dufour
2018-05-28  8:22     ` Haiyan Song
2018-05-28  8:54       ` Laurent Dufour
2018-05-28 11:04         ` Wang, Kemi
2018-05-28 11:04           ` Wang, Kemi
2018-06-11  7:49         ` Song, HaiyanX
2018-06-11  7:49           ` Song, HaiyanX
2018-06-11 15:15           ` Laurent Dufour
2018-06-19  9:16             ` Haiyan Song
2018-06-19  9:16               ` Haiyan Song
2018-07-02  8:59           ` Laurent Dufour
2018-07-04  3:23             ` Song, HaiyanX
2018-07-04  3:23               ` Song, HaiyanX
2018-07-04  7:51               ` Laurent Dufour
2018-07-04  7:51                 ` Laurent Dufour
2018-07-11 17:05                 ` Laurent Dufour
2018-07-11 17:05                   ` Laurent Dufour
2018-07-13  3:56                   ` Song, HaiyanX
2018-07-17  9:36                     ` Laurent Dufour
2018-07-17  9:36                       ` Laurent Dufour
2018-08-03  6:36                       ` Song, HaiyanX
2018-08-03  6:45                         ` Song, HaiyanX
2018-08-22 14:23                         ` Laurent Dufour
2018-08-22 14:23                           ` Laurent Dufour
2018-09-18  6:42                           ` Song, HaiyanX
2018-11-05 10:42 ` Balbir Singh
2018-11-05 10:42   ` Balbir Singh
2018-11-05 16:08   ` Laurent Dufour
2018-11-05 16:08     ` Laurent Dufour
2018-11-05 16:08     ` Laurent Dufour

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=239bab9c-e38c-951d-9114-34227b1dc94c@liunx.vnet.ibm.com \
    --to=ldufour@liunx.vnet.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=bsingharora@gmail.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dave@stgolabs.net \
    --cc=haren@linux.vnet.ibm.com \
    --cc=hpa@zytor.com \
    --cc=jack@suse.cz \
    --cc=jglisse@redhat.com \
    --cc=kemi.wang@intel.com \
    --cc=khandual@linux.vnet.ibm.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mhocko@kernel.org \
    --cc=minchan@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mpe@ellerman.id.au \
    --cc=npiggin@gmail.com \
    --cc=opensource.ganesh@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=paulus@samba.org \
    --cc=peterz@infradead.org \
    --cc=punitagrawal@gmail.com \
    --cc=rientjes@google.com \
    --cc=sergey.senozhatsky.work@gmail.com \
    --cc=sergey.senozhatsky@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vinayakm.list@gmail.com \
    --cc=vinmenon@codeaurora.org \
    --cc=will.deacon@arm.com \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.