From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F23CC46465 for ; Tue, 6 Nov 2018 09:29:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 58D1120862 for ; Tue, 6 Nov 2018 09:29:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="key not found in DNS" (0-bit key) header.d=codeaurora.org header.i=@codeaurora.org header.b="bOA5uSLw"; dkim=fail reason="key not found in DNS" (0-bit key) header.d=codeaurora.org header.i=@codeaurora.org header.b="UUmOFw+H" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 58D1120862 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729949AbeKFSxU (ORCPT ); Tue, 6 Nov 2018 13:53:20 -0500 Received: from smtp.codeaurora.org ([198.145.29.96]:32990 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729241AbeKFSxT (ORCPT ); Tue, 6 Nov 2018 13:53:19 -0500 Received: by smtp.codeaurora.org (Postfix, from userid 1000) id 1AD2E60ADB; Tue, 6 Nov 2018 09:28:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1541496539; bh=fhMDKdz2qkpoWmcUijENRWLwS+4ODTCiGk3VV+6kj+U=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=bOA5uSLw9J8VDoXcK5YpKTq5RMstoKZD0x5UhmEM8BoD6tsxegV0OS7iwQUr1O4zw WunWSIepLxIPILS8bPxNoh972IwUFDDYObxGkXWiber4NMRIyqr1UcjmP9tC8/0y+J yhWfsAkkq0FLZnqKXR+O7m6mSACtCxYMRoVpRzaQ= Received: from [10.204.82.21] (blr-c-bdr-fw-01_globalnat_allzones-outside.qualcomm.com [103.229.19.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: vinmenon@smtp.codeaurora.org) by smtp.codeaurora.org (Postfix) with ESMTPSA id 3B13760247; Tue, 6 Nov 2018 09:28:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1541496536; bh=fhMDKdz2qkpoWmcUijENRWLwS+4ODTCiGk3VV+6kj+U=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=UUmOFw+Hgn7TaSHyImnZx0o/GBxDs6YraaxCayZeruSJLBWu8RsGSMDa3ZLOxt0ED C0cAL6OveiDa7kQ0wJNAhY0RdHx/Pc2TaKoHYuXr1Q2UtTFMa+ExxEAFCJRe9qx9aF 3BR3N1pAfAqOo8TLrjFc6QqtL+f9VGWIH6Sq4pyk= DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org 3B13760247 Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=none smtp.mailfrom=vinmenon@codeaurora.org Subject: Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count To: Laurent Dufour , vinayak menon Cc: Andrew Morton , Michal Hocko , Peter Zijlstra , kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , khandual@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com, benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , sergey.senozhatsky.work@gmail.com, Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, Daniel Jordan , David Rientjes , Jerome Glisse , Ganesh Mahendran , Minchan Kim , punitagrawal@gmail.com, yang.shi@linux.alibaba.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, npiggin@gmail.com, Balbir Singh , Paul McKenney , Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org References: <1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com> <1526555193-7242-11-git-send-email-ldufour@linux.vnet.ibm.com> <239bab9c-e38c-951d-9114-34227b1dc94c@liunx.vnet.ibm.com> From: Vinayak Menon Message-ID: <836276ba-5063-4d65-4649-480c8bd31c45@codeaurora.org> Date: Tue, 6 Nov 2018 14:58:42 +0530 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <239bab9c-e38c-951d-9114-34227b1dc94c@liunx.vnet.ibm.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/5/2018 11:52 PM, Laurent Dufour wrote: > Le 05/11/2018 à 08:04, vinayak menon a écrit : >> Hi Laurent, >> >> On Thu, May 17, 2018 at 4:37 PM Laurent Dufour >> wrote: >>> >>> The VMA sequence count has been introduced to allow fast detection of >>> VMA modification when running a page fault handler without holding >>> the mmap_sem. >>> >>> This patch provides protection against the VMA modification done in : >>>          - madvise() >>>          - mpol_rebind_policy() >>>          - vma_replace_policy() >>>          - change_prot_numa() >>>          - mlock(), munlock() >>>          - mprotect() >>>          - mmap_region() >>>          - collapse_huge_page() >>>          - userfaultd registering services >>> >>> In addition, VMA fields which will be read during the speculative fault >>> path needs to be written using WRITE_ONCE to prevent write to be split >>> and intermediate values to be pushed to other CPUs. >>> >>> Signed-off-by: Laurent Dufour >>> --- >>>   fs/proc/task_mmu.c |  5 ++++- >>>   fs/userfaultfd.c   | 17 +++++++++++++---- >>>   mm/khugepaged.c    |  3 +++ >>>   mm/madvise.c       |  6 +++++- >>>   mm/mempolicy.c     | 51 ++++++++++++++++++++++++++++++++++----------------- >>>   mm/mlock.c         | 13 ++++++++----- >>>   mm/mmap.c          | 22 +++++++++++++--------- >>>   mm/mprotect.c      |  4 +++- >>>   mm/swap_state.c    |  8 ++++++-- >>>   9 files changed, 89 insertions(+), 40 deletions(-) >>> >>>   struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, >>>                                  struct vm_fault *vmf) >>> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma, >>>                                       unsigned long *start, >>>                                       unsigned long *end) >>>   { >>> -       *start = max3(lpfn, PFN_DOWN(vma->vm_start), >>> +       *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), >>>                        PFN_DOWN(faddr & PMD_MASK)); >>> -       *end = min3(rpfn, PFN_DOWN(vma->vm_end), >>> +       *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), >>>                      PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); >>>   } >>> >>> -- >>> 2.7.4 >>> >> >> I have got a crash on 4.14 kernel with speculative page faults enabled >> and here is my analysis of the problem. >> The issue was reported only once. > > Hi Vinayak, > > Thanks for reporting this. > >> >> [23409.303395]  el1_da+0x24/0x84 >> [23409.303400]  __radix_tree_lookup+0x8/0x90 >> [23409.303407]  find_get_entry+0x64/0x14c >> [23409.303410]  pagecache_get_page+0x5c/0x27c >> [23409.303416]  __read_swap_cache_async+0x80/0x260 >> [23409.303420]  swap_vma_readahead+0x264/0x37c >> [23409.303423]  swapin_readahead+0x5c/0x6c >> [23409.303428]  do_swap_page+0x128/0x6e4 >> [23409.303431]  handle_pte_fault+0x230/0xca4 >> [23409.303435]  __handle_speculative_fault+0x57c/0x7c8 >> [23409.303438]  do_page_fault+0x228/0x3e8 >> [23409.303442]  do_translation_fault+0x50/0x6c >> [23409.303445]  do_mem_abort+0x5c/0xe0 >> [23409.303447]  el0_da+0x20/0x24 >> >> Process A accesses address ADDR (part of VMA A) and that results in a >> translation fault. >> Kernel enters __handle_speculative_fault to fix the fault. >> Process A enters do_swap_page->swapin_readahead->swap_vma_readahead >> from speculative path. >> During this time, another process B which shares the same mm, does a >> mprotect from another CPU which follows >> mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B. >> After the split, ADDR falls into VMA B, but process A is still using >> VMA A. >> Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end. >> swap_vma_readahead->swap_ra_info uses start and end of vma to >> calculate ptes and nr_pte, which goes wrong due to this and finally >> resulting in wrong "entry" passed to >> swap_vma_readahead->__read_swap_cache_async, and in turn causing >> invalid swapper_space >> being passed to __read_swap_cache_async->find_get_page, causing an abort. >> >> The fix I have tried is to cache vm_start and vm_end also in vmf and >> use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can >> send >> the patch I am a using if you feel that is the right thing to do. > > I think the best would be to don't do swap readahead during the speculatvive page fault. If the page is found in the swap cache, that's fine, but otherwise, we should f    allback to the regular page fault. > > The attached -untested- patch is doing this, if you want to give it a try. I'll review that for the next series. > Thanks Laurent. I and going to try this patch. With this patch, since all non-SWP_SYNCHRONOUS_IO swapins result in non-speculative fault and a retry, wouldn't this have an impact on some perf numbers ? If so, would caching start and end be a better option ? Also, would it make sense to move the FAULT_FLAG_SPECULATIVE check inside swapin_readahead, in a way that  swap_cluster_readahead can take the speculative path ? swap_cluster_readahead doesn't seem to use vma values. Thanks, Vinayak From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf1-f199.google.com (mail-pf1-f199.google.com [209.85.210.199]) by kanga.kvack.org (Postfix) with ESMTP id F06D76B02EE for ; Tue, 6 Nov 2018 04:28:59 -0500 (EST) Received: by mail-pf1-f199.google.com with SMTP id g24-v6so12018461pfi.23 for ; Tue, 06 Nov 2018 01:28:59 -0800 (PST) Received: from smtp.codeaurora.org (smtp.codeaurora.org. [198.145.29.96]) by mx.google.com with ESMTPS id h19si19437141pgb.231.2018.11.06.01.28.58 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 06 Nov 2018 01:28:58 -0800 (PST) Subject: Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count References: <1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com> <1526555193-7242-11-git-send-email-ldufour@linux.vnet.ibm.com> <239bab9c-e38c-951d-9114-34227b1dc94c@liunx.vnet.ibm.com> From: Vinayak Menon Message-ID: <836276ba-5063-4d65-4649-480c8bd31c45@codeaurora.org> Date: Tue, 6 Nov 2018 14:58:42 +0530 MIME-Version: 1.0 In-Reply-To: <239bab9c-e38c-951d-9114-34227b1dc94c@liunx.vnet.ibm.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: owner-linux-mm@kvack.org List-ID: To: Laurent Dufour , vinayak menon Cc: Andrew Morton , Michal Hocko , Peter Zijlstra , kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , khandual@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com, benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , sergey.senozhatsky.work@gmail.com, Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, Daniel Jordan , David Rientjes , Jerome Glisse , Ganesh Mahendran , Minchan Kim , punitagrawal@gmail.com, yang.shi@linux.alibaba.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, npiggin@gmail.com, Balbir Singh , Paul McKenney , Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org On 11/5/2018 11:52 PM, Laurent Dufour wrote: > Le 05/11/2018 A 08:04, vinayak menon a A(C)critA : >> Hi Laurent, >> >> On Thu, May 17, 2018 at 4:37 PM Laurent Dufour >> wrote: >>> >>> The VMA sequence count has been introduced to allow fast detection of >>> VMA modification when running a page fault handler without holding >>> the mmap_sem. >>> >>> This patch provides protection against the VMA modification done in : >>> A A A A A A A A - madvise() >>> A A A A A A A A - mpol_rebind_policy() >>> A A A A A A A A - vma_replace_policy() >>> A A A A A A A A - change_prot_numa() >>> A A A A A A A A - mlock(), munlock() >>> A A A A A A A A - mprotect() >>> A A A A A A A A - mmap_region() >>> A A A A A A A A - collapse_huge_page() >>> A A A A A A A A - userfaultd registering services >>> >>> In addition, VMA fields which will be read during the speculative fault >>> path needs to be written using WRITE_ONCE to prevent write to be split >>> and intermediate values to be pushed to other CPUs. >>> >>> Signed-off-by: Laurent Dufour >>> --- >>> A fs/proc/task_mmu.c |A 5 ++++- >>> A fs/userfaultfd.cA A | 17 +++++++++++++---- >>> A mm/khugepaged.cA A A |A 3 +++ >>> A mm/madvise.cA A A A A A |A 6 +++++- >>> A mm/mempolicy.cA A A A | 51 ++++++++++++++++++++++++++++++++++----------------- >>> A mm/mlock.cA A A A A A A A | 13 ++++++++----- >>> A mm/mmap.cA A A A A A A A A | 22 +++++++++++++--------- >>> A mm/mprotect.cA A A A A |A 4 +++- >>> A mm/swap_state.cA A A |A 8 ++++++-- >>> A 9 files changed, 89 insertions(+), 40 deletions(-) >>> >>> A struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, >>> A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A struct vm_fault *vmf) >>> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma, >>> A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A unsigned long *start, >>> A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A unsigned long *end) >>> A { >>> -A A A A A A *start = max3(lpfn, PFN_DOWN(vma->vm_start), >>> +A A A A A A *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), >>> A A A A A A A A A A A A A A A A A A A A A A PFN_DOWN(faddr & PMD_MASK)); >>> -A A A A A A *end = min3(rpfn, PFN_DOWN(vma->vm_end), >>> +A A A A A A *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), >>> A A A A A A A A A A A A A A A A A A A A PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); >>> A } >>> >>> -- >>> 2.7.4 >>> >> >> I have got a crash on 4.14 kernel with speculative page faults enabled >> and here is my analysis of the problem. >> The issue was reported only once. > > Hi Vinayak, > > Thanks for reporting this. > >> >> [23409.303395]A el1_da+0x24/0x84 >> [23409.303400]A __radix_tree_lookup+0x8/0x90 >> [23409.303407]A find_get_entry+0x64/0x14c >> [23409.303410]A pagecache_get_page+0x5c/0x27c >> [23409.303416]A __read_swap_cache_async+0x80/0x260 >> [23409.303420]A swap_vma_readahead+0x264/0x37c >> [23409.303423]A swapin_readahead+0x5c/0x6c >> [23409.303428]A do_swap_page+0x128/0x6e4 >> [23409.303431]A handle_pte_fault+0x230/0xca4 >> [23409.303435]A __handle_speculative_fault+0x57c/0x7c8 >> [23409.303438]A do_page_fault+0x228/0x3e8 >> [23409.303442]A do_translation_fault+0x50/0x6c >> [23409.303445]A do_mem_abort+0x5c/0xe0 >> [23409.303447]A el0_da+0x20/0x24 >> >> Process A accesses address ADDR (part of VMA A) and that results in a >> translation fault. >> Kernel enters __handle_speculative_fault to fix the fault. >> Process A enters do_swap_page->swapin_readahead->swap_vma_readahead >> from speculative path. >> During this time, another process B which shares the same mm, does a >> mprotect from another CPU which follows >> mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B. >> After the split, ADDR falls into VMA B, but process A is still using >> VMA A. >> Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end. >> swap_vma_readahead->swap_ra_info uses start and end of vma to >> calculate ptes and nr_pte, which goes wrong due to this and finally >> resulting in wrong "entry" passed to >> swap_vma_readahead->__read_swap_cache_async, and in turn causing >> invalid swapper_space >> being passed to __read_swap_cache_async->find_get_page, causing an abort. >> >> The fix I have tried is to cache vm_start and vm_end also in vmf and >> use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can >> send >> the patch I am a using if you feel that is the right thing to do. > > I think the best would be to don't do swap readahead during the speculatvive page fault. If the page is found in the swap cache, that's fine, but otherwise, we should fA A A allback to the regular page fault. > > The attached -untested- patch is doing this, if you want to give it a try. I'll review that for the next series. > Thanks Laurent. I and going to try this patch. With this patch, since all non-SWP_SYNCHRONOUS_IO swapins result in non-speculative fault and a retry, wouldn't this have an impact on some perf numbers ? If so, would caching start and end be a better option ? Also, would it make sense to move the FAULT_FLAG_SPECULATIVE check inside swapin_readahead, in a way thatA swap_cluster_readahead can take the speculative path ? swap_cluster_readahead doesn't seem to use vma values. Thanks, Vinayak From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3CD88C32789 for ; Tue, 6 Nov 2018 20:10:57 +0000 (UTC) Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 7FA7B2085B for ; Tue, 6 Nov 2018 20:10:56 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="key not found in DNS" (0-bit key) header.d=codeaurora.org header.i=@codeaurora.org header.b="FKyx9nnr"; dkim=fail reason="key not found in DNS" (0-bit key) header.d=codeaurora.org header.i=@codeaurora.org header.b="UUmOFw+H" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7FA7B2085B Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 42qLJt3GTHzF3FK for ; Wed, 7 Nov 2018 07:10:54 +1100 (AEDT) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=codeaurora.org header.i=@codeaurora.org header.b="FKyx9nnr"; dkim=fail reason="signature verification failed" (1024-bit key) header.d=codeaurora.org header.i=@codeaurora.org header.b="UUmOFw+H"; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=codeaurora.org (client-ip=198.145.29.96; helo=smtp.codeaurora.org; envelope-from=vinmenon@codeaurora.org; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: lists.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=codeaurora.org header.i=@codeaurora.org header.b="FKyx9nnr"; dkim=pass (1024-bit key) header.d=codeaurora.org header.i=@codeaurora.org header.b="UUmOFw+H"; dkim-atps=neutral Received: from smtp.codeaurora.org (smtp.codeaurora.org [198.145.29.96]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 42q9kb2B0lzF2Hd for ; Wed, 7 Nov 2018 00:44:07 +1100 (AEDT) Received: by smtp.codeaurora.org (Postfix, from userid 1000) id 3C890607F7; Tue, 6 Nov 2018 09:28:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1541496538; bh=fhMDKdz2qkpoWmcUijENRWLwS+4ODTCiGk3VV+6kj+U=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=FKyx9nnrJeiymla2bVXSTe9N0dTgqbvARjkwE1C6VW4zxkOKHwK0UbdNQye8PKiGi WFm6nuXIcXX29C4Zmuzj6TrU9P/PEokLyltsgr7hHcgob0yhx0W117Uek5BLl8ajPB r7ZyZkqLRqeY45OpFns68qnA//oi3iPeP/GiwISg= Received: from [10.204.82.21] (blr-c-bdr-fw-01_globalnat_allzones-outside.qualcomm.com [103.229.19.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: vinmenon@smtp.codeaurora.org) by smtp.codeaurora.org (Postfix) with ESMTPSA id 3B13760247; Tue, 6 Nov 2018 09:28:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1541496536; bh=fhMDKdz2qkpoWmcUijENRWLwS+4ODTCiGk3VV+6kj+U=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=UUmOFw+Hgn7TaSHyImnZx0o/GBxDs6YraaxCayZeruSJLBWu8RsGSMDa3ZLOxt0ED C0cAL6OveiDa7kQ0wJNAhY0RdHx/Pc2TaKoHYuXr1Q2UtTFMa+ExxEAFCJRe9qx9aF 3BR3N1pAfAqOo8TLrjFc6QqtL+f9VGWIH6Sq4pyk= DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org 3B13760247 Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=none smtp.mailfrom=vinmenon@codeaurora.org Subject: Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count To: Laurent Dufour , vinayak menon References: <1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com> <1526555193-7242-11-git-send-email-ldufour@linux.vnet.ibm.com> <239bab9c-e38c-951d-9114-34227b1dc94c@liunx.vnet.ibm.com> From: Vinayak Menon Message-ID: <836276ba-5063-4d65-4649-480c8bd31c45@codeaurora.org> Date: Tue, 6 Nov 2018 14:58:42 +0530 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <239bab9c-e38c-951d-9114-34227b1dc94c@liunx.vnet.ibm.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Content-Language: en-US X-Mailman-Approved-At: Wed, 07 Nov 2018 07:07:30 +1100 X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: jack@suse.cz, sergey.senozhatsky.work@gmail.com, Peter Zijlstra , Will Deacon , Michal Hocko , linux-mm@kvack.org, paulus@samba.org, punitagrawal@gmail.com, hpa@zytor.com, Alexei Starovoitov , khandual@linux.vnet.ibm.com, Andrea Arcangeli , ak@linux.intel.com, Minchan Kim , x86@kernel.org, Matthew Wilcox , Daniel Jordan , Ingo Molnar , David Rientjes , Paul McKenney , npiggin@gmail.com, Jerome Glisse , dave@stgolabs.net, kemi.wang@intel.com, kirill@shutemov.name, Thomas Gleixner , Ganesh Mahendran , yang.shi@linux.alibaba.com, linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, Sergey Senozhatsky , aneesh.kumar@linux.vnet.ibm.com, Andrew Morton , Tim Chen , haren@linux.vnet.ibm.com Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On 11/5/2018 11:52 PM, Laurent Dufour wrote: > Le 05/11/2018 à 08:04, vinayak menon a écrit : >> Hi Laurent, >> >> On Thu, May 17, 2018 at 4:37 PM Laurent Dufour >> wrote: >>> >>> The VMA sequence count has been introduced to allow fast detection of >>> VMA modification when running a page fault handler without holding >>> the mmap_sem. >>> >>> This patch provides protection against the VMA modification done in : >>>          - madvise() >>>          - mpol_rebind_policy() >>>          - vma_replace_policy() >>>          - change_prot_numa() >>>          - mlock(), munlock() >>>          - mprotect() >>>          - mmap_region() >>>          - collapse_huge_page() >>>          - userfaultd registering services >>> >>> In addition, VMA fields which will be read during the speculative fault >>> path needs to be written using WRITE_ONCE to prevent write to be split >>> and intermediate values to be pushed to other CPUs. >>> >>> Signed-off-by: Laurent Dufour >>> --- >>>   fs/proc/task_mmu.c |  5 ++++- >>>   fs/userfaultfd.c   | 17 +++++++++++++---- >>>   mm/khugepaged.c    |  3 +++ >>>   mm/madvise.c       |  6 +++++- >>>   mm/mempolicy.c     | 51 ++++++++++++++++++++++++++++++++++----------------- >>>   mm/mlock.c         | 13 ++++++++----- >>>   mm/mmap.c          | 22 +++++++++++++--------- >>>   mm/mprotect.c      |  4 +++- >>>   mm/swap_state.c    |  8 ++++++-- >>>   9 files changed, 89 insertions(+), 40 deletions(-) >>> >>>   struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, >>>                                  struct vm_fault *vmf) >>> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma, >>>                                       unsigned long *start, >>>                                       unsigned long *end) >>>   { >>> -       *start = max3(lpfn, PFN_DOWN(vma->vm_start), >>> +       *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), >>>                        PFN_DOWN(faddr & PMD_MASK)); >>> -       *end = min3(rpfn, PFN_DOWN(vma->vm_end), >>> +       *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), >>>                      PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); >>>   } >>> >>> -- >>> 2.7.4 >>> >> >> I have got a crash on 4.14 kernel with speculative page faults enabled >> and here is my analysis of the problem. >> The issue was reported only once. > > Hi Vinayak, > > Thanks for reporting this. > >> >> [23409.303395]  el1_da+0x24/0x84 >> [23409.303400]  __radix_tree_lookup+0x8/0x90 >> [23409.303407]  find_get_entry+0x64/0x14c >> [23409.303410]  pagecache_get_page+0x5c/0x27c >> [23409.303416]  __read_swap_cache_async+0x80/0x260 >> [23409.303420]  swap_vma_readahead+0x264/0x37c >> [23409.303423]  swapin_readahead+0x5c/0x6c >> [23409.303428]  do_swap_page+0x128/0x6e4 >> [23409.303431]  handle_pte_fault+0x230/0xca4 >> [23409.303435]  __handle_speculative_fault+0x57c/0x7c8 >> [23409.303438]  do_page_fault+0x228/0x3e8 >> [23409.303442]  do_translation_fault+0x50/0x6c >> [23409.303445]  do_mem_abort+0x5c/0xe0 >> [23409.303447]  el0_da+0x20/0x24 >> >> Process A accesses address ADDR (part of VMA A) and that results in a >> translation fault. >> Kernel enters __handle_speculative_fault to fix the fault. >> Process A enters do_swap_page->swapin_readahead->swap_vma_readahead >> from speculative path. >> During this time, another process B which shares the same mm, does a >> mprotect from another CPU which follows >> mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B. >> After the split, ADDR falls into VMA B, but process A is still using >> VMA A. >> Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end. >> swap_vma_readahead->swap_ra_info uses start and end of vma to >> calculate ptes and nr_pte, which goes wrong due to this and finally >> resulting in wrong "entry" passed to >> swap_vma_readahead->__read_swap_cache_async, and in turn causing >> invalid swapper_space >> being passed to __read_swap_cache_async->find_get_page, causing an abort. >> >> The fix I have tried is to cache vm_start and vm_end also in vmf and >> use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can >> send >> the patch I am a using if you feel that is the right thing to do. > > I think the best would be to don't do swap readahead during the speculatvive page fault. If the page is found in the swap cache, that's fine, but otherwise, we should f    allback to the regular page fault. > > The attached -untested- patch is doing this, if you want to give it a try. I'll review that for the next series. > Thanks Laurent. I and going to try this patch. With this patch, since all non-SWP_SYNCHRONOUS_IO swapins result in non-speculative fault and a retry, wouldn't this have an impact on some perf numbers ? If so, would caching start and end be a better option ? Also, would it make sense to move the FAULT_FLAG_SPECULATIVE check inside swapin_readahead, in a way that  swap_cluster_readahead can take the speculative path ? swap_cluster_readahead doesn't seem to use vma values. Thanks, Vinayak