From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 097FDC76199 for ; Sun, 16 Feb 2020 19:18:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D22C12072C for ; Sun, 16 Feb 2020 19:18:20 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="gaun3hLi" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726551AbgBPTSO (ORCPT ); Sun, 16 Feb 2020 14:18:14 -0500 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:39921 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725989AbgBPTSN (ORCPT ); Sun, 16 Feb 2020 14:18:13 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1581880692; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=tuGE6PQn1KB7+DuvWdmcQTP1e0rV79mM3JVVfbfKivs=; b=gaun3hLiY+VTGl0Gu9KOt9ds777Ndo0F5Qz39M94O9nipszvXbLalDSHFHgUfITxHaK+TK +lp4MrWBYNDjNuVP6e+aXFgzrPJAq3C2yJa9LNU3A6+8ufVdyBhFJbhDbsC4rynP47VDdQ jh2YyiUQ/jCOUEULtR7mnrMNiLf+bMo= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-24-WXW4iEjeON-G447_iBq-nw-1; Sun, 16 Feb 2020 14:18:06 -0500 X-MC-Unique: WXW4iEjeON-G447_iBq-nw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 56D21107ACC7; Sun, 16 Feb 2020 19:18:04 +0000 (UTC) Received: from t490s.redhat.com (ovpn-116-86.phx2.redhat.com [10.3.116.86]) by smtp.corp.redhat.com (Postfix) with ESMTP id 0295319756; Sun, 16 Feb 2020 19:18:02 +0000 (UTC) From: Rafael Aquini To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, mgorman@techsingularity.net, akpm@linux-foundation.org, mhocko@suse.com, vbabka@suse.cz Subject: [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa Date: Sun, 16 Feb 2020 14:18:00 -0500 Message-Id: <20200216191800.22423-1-aquini@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Mel Gorman A user reported a bug against a distribution kernel while running a proprietary workload described as "memory intensive that is not swapping" that is expected to apply to mainline kernels. The workload is read/write/modifying ranges of memory and checking the contents. The= y reported that within a few hours that a bad PMD would be reported follo= wed by a memory corruption where expected data was all zeros. A partial re= port of the bad PMD looked like [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000= 002e0396009e2) [ 5195.341184] ------------[ cut here ]------------ [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35! .... [ 5195.410033] Call Trace: [ 5195.410471] [] change_protection_range+0x7dd/0x93= 0 [ 5195.410716] [] change_prot_numa+0x18/0x30 [ 5195.410918] [] task_numa_work+0x1fe/0x310 [ 5195.411200] [] task_work_run+0x72/0x90 [ 5195.411246] [] exit_to_usermode_loop+0x91/0xc2 [ 5195.411494] [] prepare_exit_to_usermode+0x31/0x40 [ 5195.411739] [] retint_user+0x8/0x10 Decoding revealed that the PMD was a valid prot_numa PMD and the bad PM= D was a false detection. The bug does not trigger if automatic NUMA balan= cing or transparent huge pages is disabled. The bug is due a race in change_pmd_range between a pmd_trans_huge and pmd_nond_or_clear_bad check without any locks held. During the pmd_tran= s_huge check, a parallel protection update under lock can have cleared the PMD and filled it with a prot_numa entry between the transhuge check and th= e pmd_none_or_clear_bad check. While this could be fixed with heavy locking, it's only necessary to make a copy of the PMD on the stack during change_pmd_range and avoid races. A new helper is created for this as the check if quite subtle an= d the existing similar helpful is not suitable. This passed 154 hours of test= ing (usually triggers between 20 minutes and 24 hours) without detecting ba= d PMDs or corruption. A basic test of an autonuma-intensive workload show= ed no significant change in behaviour. Although Mel withdrew the patch on the face of LKML comment https://lkml.= org/lkml/2017/4/10/922 the race window aforementioned is still open, and we have reports of Linp= ack test reporting bad residuals after the bad PMD warning is observed. In addition to that, bad= rss-counter and non-zero pgtables assertions are triggered on mm teardown for the task hi= tting the bad PMD. host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2= d2008e7) .... host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:51= 2 host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096 The issue is observed on a v4.18-based distribution kernel, but the race = window is expected to be applicable to mainline kernels, as well. Signed-off-by: Mel Gorman Cc: stable@vger.kernel.org Signed-off-by: Rafael Aquini --- mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 7a8e84f86831..9ea8cc0ab2fd 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area= _struct *vma, pmd_t *pmd, return pages; } =20 +/* + * Used when setting automatic NUMA hinting protection where it is + * critical that a numa hinting PMD is not confused with a bad PMD. + */ +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd) +{ + pmd_t pmdval =3D pmd_read_atomic(pmd); + + /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + barrier(); +#endif + + if (pmd_none(pmdval)) + return 1; + if (pmd_trans_huge(pmdval)) + return 0; + if (unlikely(pmd_bad(pmdval))) { + pmd_clear_bad(pmd); + return 1; + } + + return 0; +} + static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, pgprot_t newprot, int dirty_accountable, int prot_numa) @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct = vm_area_struct *vma, unsigned long this_pages; =20 next =3D pmd_addr_end(addr, end); - if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd) - && pmd_none_or_clear_bad(pmd)) + + /* + * Automatic NUMA balancing walks the tables with mmap_sem + * held for read. It's possible a parallel update to occur + * between pmd_trans_huge() and a pmd_none_or_clear_bad() + * check leading to a false positive and clearing. + * Hence, it's ecessary to atomically read the PMD value + * for all the checks. + */ + if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) && + pmd_none_or_clear_bad_unless_trans_huge(pmd)) goto next; =20 /* invoke the mmu notifier if the pmd is populated */ --=20 2.24.1