From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLACK,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE1FDC433E0 for ; Wed, 3 Jun 2020 23:00:23 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5FE0E208E4 for ; Wed, 3 Jun 2020 23:00:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="k5wg2M5M" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5FE0E208E4 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0C26D280046; Wed, 3 Jun 2020 19:00:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 073D5280003; Wed, 3 Jun 2020 19:00:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EF0BB280046; Wed, 3 Jun 2020 19:00:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0244.hostedemail.com [216.40.44.244]) by kanga.kvack.org (Postfix) with ESMTP id D074D280003 for ; Wed, 3 Jun 2020 19:00:22 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 916F98248D51 for ; Wed, 3 Jun 2020 23:00:22 +0000 (UTC) X-FDA: 76889421084.11.spade39_20522939ad356 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin11.hostedemail.com (Postfix) with ESMTP id 72472180F8B8B for ; Wed, 3 Jun 2020 23:00:22 +0000 (UTC) X-HE-Tag: spade39_20522939ad356 X-Filterd-Recvd-Size: 6079 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf05.hostedemail.com (Postfix) with ESMTP for ; Wed, 3 Jun 2020 23:00:21 +0000 (UTC) Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id DD7D420B80; Wed, 3 Jun 2020 23:00:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1591225221; bh=2V0SB3a/ZfS1cLXThIu4Lfn5Qr3Op5ITQnNMJfG5J+o=; h=Date:From:To:Subject:In-Reply-To:From; b=k5wg2M5MNMIC61RmNMvVr4qlkGMKQmHElpMfqHchOo8MpmtYQfy/qsLEnSJLVHmCB gs8rX5dzeucck1yutSMIJ3TL6MGxxqe8TvA6JSAc8aunbk43trq2wfK+9piK23VIUp 0tZvBY/saK/b/+TcKk7TQ/wv61fi9eBXOQ+LeJJ4= Date: Wed, 03 Jun 2020 16:00:20 -0700 From: Andrew Morton To: aarcange@redhat.com, akpm@linux-foundation.org, jhubbard@nvidia.com, kirill.shutemov@linux.intel.com, linux-mm@kvack.org, mike.kravetz@oracle.com, mm-commits@vger.kernel.org, rcampbell@nvidia.com, torvalds@linux-foundation.org, william.kucharski@oracle.com, yang.shi@linux.alibaba.com, ziy@nvidia.com Subject: [patch 065/131] khugepaged: allow to collapse a page shared across fork Message-ID: <20200603230020.0ZpQAslEu%akpm@linux-foundation.org> In-Reply-To: <20200603155549.e041363450869eaae4c7f05b@linux-foundation.org> User-Agent: s-nail v14.8.16 X-Rspamd-Queue-Id: 72472180F8B8B X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: "Kirill A. Shutemov" Subject: khugepaged: allow to collapse a page shared across fork The page can be included into collapse as long as it doesn't have extra pins (from GUP or otherwise). Logic to check the refcount is moved to a separate function. For pages in swap cache, add compound_nr(page) to the expected refcount, in order to handle the compound page case. This is in preparation for the following patch. VM_BUG_ON_PAGE() was removed from __collapse_huge_page_copy() as the invariant it checks is no longer valid: the source can be mapped multiple times now. [yang.shi@linux.alibaba.com: remove error message when checking external pins] Link: http://lkml.kernel.org/r/1589317383-9595-1-git-send-email-yang.shi@linux.alibaba.com [cai@lca.pw: fix set-but-not-used warning] Link: http://lkml.kernel.org/r/20200521145644.GA6367@ovpn-112-192.phx2.redhat.com Link: http://lkml.kernel.org/r/20200416160026.16538-6-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Signed-off-by: Yang Shi Reviewed-by: William Kucharski Reviewed-by: Zi Yan Tested-by: Zi Yan Acked-by: Yang Shi Reviewed-by: John Hubbard Cc: Andrea Arcangeli Cc: Mike Kravetz Cc: Ralph Campbell Signed-off-by: Andrew Morton --- mm/khugepaged.c | 46 +++++++++++++++++++++++++++++++++++++--------- 1 file changed, 37 insertions(+), 9 deletions(-) --- a/mm/khugepaged.c~khugepaged-allow-to-collapse-a-page-shared-across-fork +++ a/mm/khugepaged.c @@ -526,6 +526,17 @@ static void release_pte_pages(pte_t *pte } } +static bool is_refcount_suitable(struct page *page) +{ + int expected_refcount; + + expected_refcount = total_mapcount(page); + if (PageSwapCache(page)) + expected_refcount += compound_nr(page); + + return page_count(page) == expected_refcount; +} + static int __collapse_huge_page_isolate(struct vm_area_struct *vma, unsigned long address, pte_t *pte) @@ -578,11 +589,17 @@ static int __collapse_huge_page_isolate( } /* - * cannot use mapcount: can't collapse if there's a gup pin. - * The page must only be referenced by the scanned process - * and page swap cache. + * Check if the page has any GUP (or other external) pins. + * + * The page table that maps the page has been already unlinked + * from the page table tree and this process cannot get + * an additinal pin on the page. + * + * New pins can come later if the page is shared across fork, + * but not from this process. The other process cannot write to + * the page, only trigger CoW. */ - if (page_count(page) != 1 + PageSwapCache(page)) { + if (!is_refcount_suitable(page)) { unlock_page(page); result = SCAN_PAGE_COUNT; goto out; @@ -669,7 +686,6 @@ static void __collapse_huge_page_copy(pt } else { src_page = pte_page(pteval); copy_user_highpage(page, src_page, address, vma); - VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page); release_pte_page(src_page); /* * ptl mostly unnecessary, but preempt has to @@ -1221,11 +1237,23 @@ static int khugepaged_scan_pmd(struct mm } /* - * cannot use mapcount: can't collapse if there's a gup pin. - * The page must only be referenced by the scanned process - * and page swap cache. + * Check if the page has any GUP (or other external) pins. + * + * Here the check is racy it may see totmal_mapcount > refcount + * in some cases. + * For example, one process with one forked child process. + * The parent has the PMD split due to MADV_DONTNEED, then + * the child is trying unmap the whole PMD, but khugepaged + * may be scanning the parent between the child has + * PageDoubleMap flag cleared and dec the mapcount. So + * khugepaged may see total_mapcount > refcount. + * + * But such case is ephemeral we could always retry collapse + * later. However it may report false positive if the page + * has excessive GUP pins (i.e. 512). Anyway the same check + * will be done again later the risk seems low. */ - if (page_count(page) != 1 + PageSwapCache(page)) { + if (!is_refcount_suitable(page)) { result = SCAN_PAGE_COUNT; goto out_unmap; } _