From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C98BC433DB for ; Sun, 17 Jan 2021 03:48:33 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C23F822D57 for ; Sun, 17 Jan 2021 03:48:32 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C23F822D57 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CB98B8D0214; Sat, 16 Jan 2021 22:48:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C69F28D01D5; Sat, 16 Jan 2021 22:48:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B80358D0214; Sat, 16 Jan 2021 22:48:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0002.hostedemail.com [216.40.44.2]) by kanga.kvack.org (Postfix) with ESMTP id A3DD98D01D5 for ; Sat, 16 Jan 2021 22:48:31 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 59AE1365B for ; Sun, 17 Jan 2021 03:48:31 +0000 (UTC) X-FDA: 77713884822.22.army17_2f0920b2753d Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin22.hostedemail.com (Postfix) with ESMTP id 3277918038E7E for ; Sun, 17 Jan 2021 03:48:31 +0000 (UTC) X-HE-Tag: army17_2f0920b2753d X-Filterd-Recvd-Size: 5687 Received: from mail3-164.sinamail.sina.com.cn (mail3-164.sinamail.sina.com.cn [202.108.3.164]) by imf24.hostedemail.com (Postfix) with SMTP for ; Sun, 17 Jan 2021 03:48:18 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([222.130.247.86]) by sina.com with ESMTP id 6003B37C0002F08E; Sun, 17 Jan 2021 11:48:13 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 93014215073507 From: Hillf Danton To: Hugh Dickins Cc: Andrew Morton , Sergey Senozhatsky , "Kirill A. Shutemov" , Suleiman Souhlal , Jann Horn , Hillf Danton , Matthew Wilcox , Andrea Arcangeli , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] mm: thp: fix MADV_REMOVE deadlock on shmem THP Date: Sun, 17 Jan 2021 11:48:03 +0800 Message-Id: <20210117034803.16648-1-hdanton@sina.com> In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: on Sat, 16 Jan 2021 14:16:24 -0800 (PST) Hugh Dickins wrote: >=20 > Sergey reported deadlock between kswapd correctly doing its usual > lock_page(page) followed by down_read(page->mapping->i_mmap_rwsem), > and madvise(MADV_REMOVE) on an madvise(MADV_HUGEPAGE) area doing > down_write(page->mapping->i_mmap_rwsem) followed by lock_page(page). >=20 > This happened when shmem_fallocate(punch hole)'s unmap_mapping_range() > reaches zap_pmd_range()'s call to __split_huge_pmd(). The same deadloc= k > could occur when partially truncating a mapped huge tmpfs file, or usin= g > fallocate(FALLOC_FL_PUNCH_HOLE) on it. >=20 > __split_huge_pmd()'s page lock was added in 5.8, to make sure that any > concurrent use of reuse_swap_page() (holding page lock) could not catch > the anon THP's mapcounts and swapcounts while they were being split. >=20 > Fortunately, reuse_swap_page() is never applied to a shmem or file THP > (not even by khugepaged, which checks PageSwapCache before calling), > and anonymous THPs are never created in shmem or file areas: so that > __split_huge_pmd()'s page lock can only be necessary for anonymous THPs= , > on which there is no risk of deadlock with i_mmap_rwsem. CPU0 CPU1 ---- ---- kswapd madvise lock_page down_write(i_mmap_rwsem) down_read lock_page Given nothing wrong on the reclaimer's side, it is the reverse locking order on CPU1 that paves a brick for the peril to run another term, a long one maybe, if I dont misread you. > Reported-by: Sergey Senozhatsky > Fixes: c444eb564fb1 ("mm: thp: make the THP mapcount atomic against __s= plit_huge_pmd_locked()") > Signed-off-by: Hugh Dickins > Reviewed-by: Andrea Arcangeli > Cc: stable@vger.kernel.org > --- >=20 > The status of reuse_swap_page(), and its use on THPs, is currently unde= r > discussion, and may need to be changed: but this patch is a simple fix > to the reported deadlock, which can go in now, and be easily backported > to whichever stable and longterm releases took in 5.8's c444eb564fb1. >=20 > mm/huge_memory.c | 37 +++++++++++++++++++++++-------------- > 1 file changed, 23 insertions(+), 14 deletions(-) >=20 > --- 5.11-rc3/mm/huge_memory.c 2020-12-27 20:39:37.667932292 -0800 > +++ linux/mm/huge_memory.c 2021-01-16 08:02:08.265551393 -0800 > @@ -2202,7 +2202,7 @@ void __split_huge_pmd(struct vm_area_str > { > spinlock_t *ptl; > struct mmu_notifier_range range; > - bool was_locked =3D false; > + bool do_unlock_page =3D false; > pmd_t _pmd; > =20 > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, > @@ -2218,7 +2218,6 @@ void __split_huge_pmd(struct vm_area_str > VM_BUG_ON(freeze && !page); > if (page) { > VM_WARN_ON_ONCE(!PageLocked(page)); > - was_locked =3D true; > if (page !=3D pmd_page(*pmd)) > goto out; > } > @@ -2227,19 +2226,29 @@ repeat: > if (pmd_trans_huge(*pmd)) { > if (!page) { > page =3D pmd_page(*pmd); > - if (unlikely(!trylock_page(page))) { > - get_page(page); > - _pmd =3D *pmd; > - spin_unlock(ptl); > - lock_page(page); > - spin_lock(ptl); > - if (unlikely(!pmd_same(*pmd, _pmd))) { > - unlock_page(page); > + /* > + * An anonymous page must be locked, to ensure that a > + * concurrent reuse_swap_page() sees stable mapcount; > + * but reuse_swap_page() is not used on shmem or file, > + * and page lock must not be taken when zap_pmd_range() > + * calls __split_huge_pmd() while i_mmap_lock is held. > + */ > + if (PageAnon(page)) { > + if (unlikely(!trylock_page(page))) { > + get_page(page); > + _pmd =3D *pmd; > + spin_unlock(ptl); > + lock_page(page); > + spin_lock(ptl); > + if (unlikely(!pmd_same(*pmd, _pmd))) { > + unlock_page(page); > + put_page(page); > + page =3D NULL; > + goto repeat; > + } > put_page(page); > - page =3D NULL; > - goto repeat; > } > - put_page(page); > + do_unlock_page =3D true; > } > } > if (PageMlocked(page)) > @@ -2249,7 +2258,7 @@ repeat: > __split_huge_pmd_locked(vma, pmd, range.start, freeze); > out: > spin_unlock(ptl); > - if (!was_locked && page) > + if (do_unlock_page) > unlock_page(page); > /* > * No need to double call mmu_notifier->invalidate_range() callback. >=20