Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count

From: Laurent Dufour <ldufour@liunx.vnet.ibm.com>
To: vinayak menon <vinayakm.list@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net,
	jack@suse.cz, Matthew Wilcox <willy@infradead.org>,
	khandual@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com,
	benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	hpa@zytor.com, Will Deacon <will.deacon@arm.com>,
	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	sergey.senozhatsky.work@gmail.com,
	Andrea Arcangeli <aarcange@redhat.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	kemi.wang@intel.com, Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Rientjes <rientjes@google.com>,
	Jerome Glisse <jglisse@redhat.com>,
	Ganesh Mahendran <opensource.ganesh@gmail.com>,
	Minchan Kim <minchan@kernel.org>,
	punitagrawal@gmail.com, yang.shi@linux.alibaba.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	haren@linux.vnet.ibm.com, npiggin@gmail.com,
	Balbir Singh <bsingharora@gmail.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	linuxppc-dev@lists.ozlabs.org, x86@kernel.org,
	Vinayak Menon <vinmenon@codeaurora.org>
Subject: Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count
Date: Mon, 5 Nov 2018 19:22:50 +0100	[thread overview]
Message-ID: <239bab9c-e38c-951d-9114-34227b1dc94c@liunx.vnet.ibm.com> (raw)
In-Reply-To: <CAOaiJ-nJ_=+diXz8ji42Rro3Mj16C1=NpenRuY3-mjs_GhR4mA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4478 bytes --]

Le 05/11/2018 à 08:04, vinayak menon a écrit :
> Hi Laurent,
> 
> On Thu, May 17, 2018 at 4:37 PM Laurent Dufour
> <ldufour@linux.vnet.ibm.com> wrote:
>>
>> The VMA sequence count has been introduced to allow fast detection of
>> VMA modification when running a page fault handler without holding
>> the mmap_sem.
>>
>> This patch provides protection against the VMA modification done in :
>>          - madvise()
>>          - mpol_rebind_policy()
>>          - vma_replace_policy()
>>          - change_prot_numa()
>>          - mlock(), munlock()
>>          - mprotect()
>>          - mmap_region()
>>          - collapse_huge_page()
>>          - userfaultd registering services
>>
>> In addition, VMA fields which will be read during the speculative fault
>> path needs to be written using WRITE_ONCE to prevent write to be split
>> and intermediate values to be pushed to other CPUs.
>>
>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>> ---
>>   fs/proc/task_mmu.c |  5 ++++-
>>   fs/userfaultfd.c   | 17 +++++++++++++----
>>   mm/khugepaged.c    |  3 +++
>>   mm/madvise.c       |  6 +++++-
>>   mm/mempolicy.c     | 51 ++++++++++++++++++++++++++++++++++-----------------
>>   mm/mlock.c         | 13 ++++++++-----
>>   mm/mmap.c          | 22 +++++++++++++---------
>>   mm/mprotect.c      |  4 +++-
>>   mm/swap_state.c    |  8 ++++++--
>>   9 files changed, 89 insertions(+), 40 deletions(-)
>>
>>   struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>>                                  struct vm_fault *vmf)
>> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
>>                                       unsigned long *start,
>>                                       unsigned long *end)
>>   {
>> -       *start = max3(lpfn, PFN_DOWN(vma->vm_start),
>> +       *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
>>                        PFN_DOWN(faddr & PMD_MASK));
>> -       *end = min3(rpfn, PFN_DOWN(vma->vm_end),
>> +       *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
>>                      PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
>>   }
>>
>> --
>> 2.7.4
>>
> 
> I have got a crash on 4.14 kernel with speculative page faults enabled
> and here is my analysis of the problem.
> The issue was reported only once.

Hi Vinayak,

Thanks for reporting this.

> 
> [23409.303395]  el1_da+0x24/0x84
> [23409.303400]  __radix_tree_lookup+0x8/0x90
> [23409.303407]  find_get_entry+0x64/0x14c
> [23409.303410]  pagecache_get_page+0x5c/0x27c
> [23409.303416]  __read_swap_cache_async+0x80/0x260
> [23409.303420]  swap_vma_readahead+0x264/0x37c
> [23409.303423]  swapin_readahead+0x5c/0x6c
> [23409.303428]  do_swap_page+0x128/0x6e4
> [23409.303431]  handle_pte_fault+0x230/0xca4
> [23409.303435]  __handle_speculative_fault+0x57c/0x7c8
> [23409.303438]  do_page_fault+0x228/0x3e8
> [23409.303442]  do_translation_fault+0x50/0x6c
> [23409.303445]  do_mem_abort+0x5c/0xe0
> [23409.303447]  el0_da+0x20/0x24
> 
> Process A accesses address ADDR (part of VMA A) and that results in a
> translation fault.
> Kernel enters __handle_speculative_fault to fix the fault.
> Process A enters do_swap_page->swapin_readahead->swap_vma_readahead
> from speculative path.
> During this time, another process B which shares the same mm, does a
> mprotect from another CPU which follows
> mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B.
> After the split, ADDR falls into VMA B, but process A is still using
> VMA A.
> Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end.
> swap_vma_readahead->swap_ra_info uses start and end of vma to
> calculate ptes and nr_pte, which goes wrong due to this and finally
> resulting in wrong "entry" passed to
> swap_vma_readahead->__read_swap_cache_async, and in turn causing
> invalid swapper_space
> being passed to __read_swap_cache_async->find_get_page, causing an abort.
> 
> The fix I have tried is to cache vm_start and vm_end also in vmf and
> use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can
> send
> the patch I am a using if you feel that is the right thing to do.

I think the best would be to don't do swap readahead during the 
speculatvive page fault. If the page is found in the swap cache, that's 
fine, but otherwise, we should f	allback to the regular page fault.

The attached -untested- patch is doing this, if you want to give it a 
try. I'll review that for the next series.

Thanks,
Laurent.

[-- Attachment #2: 0001-mm-don-t-do-swap-readahead-during-speculative-page-f.patch --]
[-- Type: text/plain, Size: 1507 bytes --]

From 056afafb0bccea6a356f80f4253ffcd3ef4a1f8d Mon Sep 17 00:00:00 2001
From: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Date: Mon, 5 Nov 2018 18:43:01 +0100
Subject: [PATCH] mm: don't do swap readahead during speculative page fault

Vinayak Menon faced a panic because one thread was page faulting a page in
swap, while another one was mprotecting a part of the VMA leading to a VMA
split.
This raise a panic in swap_vma_readahead() because the VMA's boundaries
were not more matching the faulting address.

To avoid this, if the page is not found in the swap, the speculative page
fault is aborted to retry a regular page fault.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9dd5ffeb1f7e..720dc9a1b99f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3139,6 +3139,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				lru_cache_add_anon(page);
 				swap_readpage(page, true);
 			}
+		} else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+			/*
+			 * Don't try readahead during a speculative page fault as
+			 * the VMA's boundaries may change in our back.
+			 * If the page is not in the swap cache and synchronous read
+			 * is disabled, fall back to the regular page fault mechanism.
+			 */
+			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+			ret = VM_FAULT_RETRY;
+			goto out;
 		} else {
 			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 						vmf);
-- 
2.19.1