From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 744DBC43461 for ; Thu, 20 May 2021 19:21:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 59B31608FE for ; Thu, 20 May 2021 19:21:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239761AbhETTWj (ORCPT ); Thu, 20 May 2021 15:22:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37816 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238137AbhETTWg (ORCPT ); Thu, 20 May 2021 15:22:36 -0400 Received: from mail-pl1-x62b.google.com (mail-pl1-x62b.google.com [IPv6:2607:f8b0:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0FB2CC0613CE for ; Thu, 20 May 2021 12:21:14 -0700 (PDT) Received: by mail-pl1-x62b.google.com with SMTP id e15so3042502plh.1 for ; Thu, 20 May 2021 12:21:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=CeN9GBd+V6Hx1HhJKDIlPQtnd+p0opbcdXSqGR3n21U=; b=u1IVQcySroL4ZbG3Jr/riMk3838ALT3P/KkmuA20Ph2fBVKn9gkslFjYIQwCIw1wV6 0W30HB14NFhfOFQm+pz7rbUT6YG/IMYVn18yVWSLg8SxnHxBYBIwuxIsvxjSLxPqXWkL qy4SU1WGw8AinnnGpX0U9OCnli+tklLLkqGIVJICDlwIoHY39w/ykal55s/YVndhxUxq e4YoaYVyjNqEKEqxIGDbJmYxVSBHSmcK4FfOyD6L7D6MfQdq82bMw6m+MrB779W/3XVb fH3Qxp8J0nU0O1ue6HZQ/Hr0zlIoSzEchIs1pZKTwPAd7Q5Uy8deUyjDnlPi34H/nRmY gTcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=CeN9GBd+V6Hx1HhJKDIlPQtnd+p0opbcdXSqGR3n21U=; b=emty0F+p9L/cg3qRMjET4tLAZhCuJEpl3rKCBP0JLqGJV1kOJ6n4IFAUNZxM3jhVUC 1PWGdwKkEq+jB8e7CorAONCSSlQtljLPtJzfNUa2JbSj9Ap+zyvSKaXSV2mofE39htB8 U+ZFmrtZFLYHEu9pvEnEQlCOgAyzJ3tWfIbEzeqWOA7dG4bFaJBs4C6cofrVm/itXa0h e0St7pMOep9d2lWM9dSWsfZ7Zz6waTnGHFN5TtrK3YbTlLX1uRbucwsQQcxrBMiwX3pA kDYyXC5oRpKW0vyKkucZ/kImFagCX0dKOgQDNcOKVsmTFmy14MwUd1n8vdLI5V0MjSly /f+w== X-Gm-Message-State: AOAM531U9kLHKJXMTL6GJ1m2bc4sYcJXxmlJrhIiMHtIdF5C99cMRFuL vASAy8EeZ2Pq7m+9rbjvVcBGRvZEnGzsHIP3/CDzUQ== X-Google-Smtp-Source: ABdhPJw1xiHvxvKZB1SEt+sOCQGr0Dt7pF3pl7FeojUfoj8x7Ugg9sps1kdQdQF+DJgKcOFNqUCjtI6c/IwnqVP/YTo= X-Received: by 2002:a17:90b:1885:: with SMTP id mn5mr6520504pjb.24.1621538473309; Thu, 20 May 2021 12:21:13 -0700 (PDT) MIME-Version: 1.0 References: <20210513234309.366727-1-almasrymina@google.com> <09dc0712-48e8-8ba2-f170-4c2febcfff83@oracle.com> In-Reply-To: From: Mina Almasry Date: Thu, 20 May 2021 12:21:02 -0700 Message-ID: Subject: Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY To: Mike Kravetz Cc: Axel Rasmussen , Peter Xu , Linux-MM , Andrew Morton , open list Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 20, 2021 at 12:18 PM Mina Almasry wrote: > > On Thu, May 13, 2021 at 5:14 PM Mike Kravetz wrote: > > > > On 5/13/21 4:49 PM, Mina Almasry wrote: > > > On Thu, May 13, 2021 at 4:43 PM Mina Almasry wrote: > > >> > > >> When hugetlb_mcopy_atomic_pte() is called with: > > >> - mode==MCOPY_ATOMIC_NORMAL and, > > >> - we already have a page in the page cache corresponding to the > > >> associated address, > > >> > > >> We will allocate a huge page from the reserves, and then fail to insert it > > >> into the cache and return -EEXIST. In this case, we need to return -EEXIST > > >> without allocating a new page as the page already exists in the cache. > > >> Allocating the extra page causes the resv_huge_pages to underflow temporarily > > >> until the extra page is freed. > > >> > > >> To fix this we check if a page exists in the cache, and allocate it and > > >> insert it in the cache immediately while holding the lock. After that we > > >> copy the contents into the page. > > >> > > >> As a side effect of this, pages may exist in the cache for which the > > >> copy failed and for these pages PageUptodate(page) == false. Modify code > > >> that query the cache to handle this correctly. > > >> > > > > > > To be honest, I'm not sure I've done this bit correctly. Please take a > > > look and let me know what you think. It may be too overly complicated > > > to have !PageUptodate() pages in the cache and ask the rest of the > > > code to handle that edge case correctly, but I'm not sure how else to > > > fix this issue. > > > > > > > I think you just moved the underflow from hugetlb_mcopy_atomic_pte to > > hugetlb_no_page. Why? > > > > Consider the case where there is only one reserve left and someone does > > the MCOPY_ATOMIC_NORMAL for the address. We will allocate the page and > > consume the reserve (reserve count == 0) and insert the page into the > > cache. Now, if the copy_huge_page_from_user fails we must drop the > > locks/fault mutex to do the copy. While locks are dropped, someone > > faults on the address and ends up in hugetlb_no_page. The page is in > > the cache but not up to date, so we go down the allocate new page path > > and will decrement the reserve count again to cause underflow. > > > > How about this approach? > > - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte > > that you added. That will catch the race where the page was added to > > the cache before entering the routine. > > - With the above check in place, we only need to worry about the case > > where copy_huge_page_from_user fails and we must drop locks. In this > > case we: > > - Free the page previously allocated. > > - Allocate a 'temporary' huge page without consuming reserves. I'm > > thinking of something similar to page migration. > > - Drop the locks and let the copy_huge_page_from_user be done to the > > temporary page. > > - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the > > *pagep case) we need to once again check > > hugetlbfs_pagecache_present. > > - We then try to allocate the huge page which will consume the > > reserve. If successful, copy contents of temporary page to newly > > allocated page. Free temporary page. > > > > There may be issues with this, and I have not given it deep thought. It > > does abuse the temporary huge page concept, but perhaps no more than > > page migration. Things do slow down if the extra page allocation and > > copy is required, but that would only be the case if copy_huge_page_from_user > > needs to be done without locks. Not sure, but hoping that is rare. > > Just following up this a bit: I've implemented this approach locally, > and with it it's passing the test as-is. When I hack the code such > that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into > this edge case, which causes resv_huge_pages to underflow again (this > time permemantly): > > - hugetlb_no_page() is called on an index and a page is allocated and > inserted into the cache consuming the reservation. > - remove_huge_page() is called on this index and the page is removed from cache. > - hugetlb_mcopy_atomic_pte() is called on this index, we do not find > the page in the cache and we trigger this code patch and the copy > fails. > - The allocations in this code path seem to double consume the > reservation and resv_huge_pages underflows. > > I'm looking at this edge case to understand why a prior > remove_huge_page() causes my code to underflow resv_huge_pages. > I should also mention, without a prior remove_huge_page() this code path works fine, so it seems the fact that the reservation is consumed before causes trouble, but I'm not sure why yet. > > -- > > Mike Kravetz From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F5C2C433B4 for ; Thu, 20 May 2021 19:21:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C6BC36135A for ; Thu, 20 May 2021 19:21:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C6BC36135A Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 60C706B00B1; Thu, 20 May 2021 15:21:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5BAB56B00B2; Thu, 20 May 2021 15:21:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 45E8B6B00B4; Thu, 20 May 2021 15:21:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0208.hostedemail.com [216.40.44.208]) by kanga.kvack.org (Postfix) with ESMTP id 11EEE6B00B1 for ; Thu, 20 May 2021 15:21:15 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id ACB4610FF2 for ; Thu, 20 May 2021 19:21:14 +0000 (UTC) X-FDA: 78162577668.26.B96AE33 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) by imf05.hostedemail.com (Postfix) with ESMTP id E2C9FE000802 for ; Thu, 20 May 2021 19:21:11 +0000 (UTC) Received: by mail-pj1-f50.google.com with SMTP id gb21-20020a17090b0615b029015d1a863a91so5998184pjb.2 for ; Thu, 20 May 2021 12:21:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=CeN9GBd+V6Hx1HhJKDIlPQtnd+p0opbcdXSqGR3n21U=; b=u1IVQcySroL4ZbG3Jr/riMk3838ALT3P/KkmuA20Ph2fBVKn9gkslFjYIQwCIw1wV6 0W30HB14NFhfOFQm+pz7rbUT6YG/IMYVn18yVWSLg8SxnHxBYBIwuxIsvxjSLxPqXWkL qy4SU1WGw8AinnnGpX0U9OCnli+tklLLkqGIVJICDlwIoHY39w/ykal55s/YVndhxUxq e4YoaYVyjNqEKEqxIGDbJmYxVSBHSmcK4FfOyD6L7D6MfQdq82bMw6m+MrB779W/3XVb fH3Qxp8J0nU0O1ue6HZQ/Hr0zlIoSzEchIs1pZKTwPAd7Q5Uy8deUyjDnlPi34H/nRmY gTcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=CeN9GBd+V6Hx1HhJKDIlPQtnd+p0opbcdXSqGR3n21U=; b=MiDa+p00xRVUOM3TlN22NSQBmrQUGFSPabihb2APQF5es+KJoPt6k2Xi8WXfyAaddw pH1MN/pQm4322I1D2gVnPJpioM0rECNJHyiUwTetDeBVfxyRRVKaE1Ygt32z/1EgTxAS WlYMTtOKNT2lyal7/fBT8GSXhtQIg35X0Fu3ObWY6NJukaMpmfCYfIxs18/G9FgVgmV4 qdKnEQQVpx6YvMzfs0wrO8lGLhBNeBI/qNC5ZCI8Uj2uhVx5noyCU8MBpRww11UkzHsq qM3p5rajFAdQ6Kg7JmmbJUnnKsWmXhxsAGiUcxZ9g8JMMFIOGKT+5S4vLDc10Y4uTO4O QXmQ== X-Gm-Message-State: AOAM531s2tk46Pi82zUO7u1OlVF+jaPHidLBnqLch37GgWFB/lZkLedg ZTDxdOIOvgVOip51WFmSVMD4ZG+3HouZw8tQ7OK63A== X-Google-Smtp-Source: ABdhPJw1xiHvxvKZB1SEt+sOCQGr0Dt7pF3pl7FeojUfoj8x7Ugg9sps1kdQdQF+DJgKcOFNqUCjtI6c/IwnqVP/YTo= X-Received: by 2002:a17:90b:1885:: with SMTP id mn5mr6520504pjb.24.1621538473309; Thu, 20 May 2021 12:21:13 -0700 (PDT) MIME-Version: 1.0 References: <20210513234309.366727-1-almasrymina@google.com> <09dc0712-48e8-8ba2-f170-4c2febcfff83@oracle.com> In-Reply-To: From: Mina Almasry Date: Thu, 20 May 2021 12:21:02 -0700 Message-ID: Subject: Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY To: Mike Kravetz Cc: Axel Rasmussen , Peter Xu , Linux-MM , Andrew Morton , open list Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=u1IVQcyS; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of almasrymina@google.com designates 209.85.216.50 as permitted sender) smtp.mailfrom=almasrymina@google.com X-Stat-Signature: fyxjjk1zcwkzenqd4qajjic4j54b15e5 X-Rspamd-Queue-Id: E2C9FE000802 X-Rspamd-Server: rspam02 X-HE-Tag: 1621538471-489511 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, May 20, 2021 at 12:18 PM Mina Almasry wrote: > > On Thu, May 13, 2021 at 5:14 PM Mike Kravetz wrote: > > > > On 5/13/21 4:49 PM, Mina Almasry wrote: > > > On Thu, May 13, 2021 at 4:43 PM Mina Almasry wrote: > > >> > > >> When hugetlb_mcopy_atomic_pte() is called with: > > >> - mode==MCOPY_ATOMIC_NORMAL and, > > >> - we already have a page in the page cache corresponding to the > > >> associated address, > > >> > > >> We will allocate a huge page from the reserves, and then fail to insert it > > >> into the cache and return -EEXIST. In this case, we need to return -EEXIST > > >> without allocating a new page as the page already exists in the cache. > > >> Allocating the extra page causes the resv_huge_pages to underflow temporarily > > >> until the extra page is freed. > > >> > > >> To fix this we check if a page exists in the cache, and allocate it and > > >> insert it in the cache immediately while holding the lock. After that we > > >> copy the contents into the page. > > >> > > >> As a side effect of this, pages may exist in the cache for which the > > >> copy failed and for these pages PageUptodate(page) == false. Modify code > > >> that query the cache to handle this correctly. > > >> > > > > > > To be honest, I'm not sure I've done this bit correctly. Please take a > > > look and let me know what you think. It may be too overly complicated > > > to have !PageUptodate() pages in the cache and ask the rest of the > > > code to handle that edge case correctly, but I'm not sure how else to > > > fix this issue. > > > > > > > I think you just moved the underflow from hugetlb_mcopy_atomic_pte to > > hugetlb_no_page. Why? > > > > Consider the case where there is only one reserve left and someone does > > the MCOPY_ATOMIC_NORMAL for the address. We will allocate the page and > > consume the reserve (reserve count == 0) and insert the page into the > > cache. Now, if the copy_huge_page_from_user fails we must drop the > > locks/fault mutex to do the copy. While locks are dropped, someone > > faults on the address and ends up in hugetlb_no_page. The page is in > > the cache but not up to date, so we go down the allocate new page path > > and will decrement the reserve count again to cause underflow. > > > > How about this approach? > > - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte > > that you added. That will catch the race where the page was added to > > the cache before entering the routine. > > - With the above check in place, we only need to worry about the case > > where copy_huge_page_from_user fails and we must drop locks. In this > > case we: > > - Free the page previously allocated. > > - Allocate a 'temporary' huge page without consuming reserves. I'm > > thinking of something similar to page migration. > > - Drop the locks and let the copy_huge_page_from_user be done to the > > temporary page. > > - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the > > *pagep case) we need to once again check > > hugetlbfs_pagecache_present. > > - We then try to allocate the huge page which will consume the > > reserve. If successful, copy contents of temporary page to newly > > allocated page. Free temporary page. > > > > There may be issues with this, and I have not given it deep thought. It > > does abuse the temporary huge page concept, but perhaps no more than > > page migration. Things do slow down if the extra page allocation and > > copy is required, but that would only be the case if copy_huge_page_from_user > > needs to be done without locks. Not sure, but hoping that is rare. > > Just following up this a bit: I've implemented this approach locally, > and with it it's passing the test as-is. When I hack the code such > that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into > this edge case, which causes resv_huge_pages to underflow again (this > time permemantly): > > - hugetlb_no_page() is called on an index and a page is allocated and > inserted into the cache consuming the reservation. > - remove_huge_page() is called on this index and the page is removed from cache. > - hugetlb_mcopy_atomic_pte() is called on this index, we do not find > the page in the cache and we trigger this code patch and the copy > fails. > - The allocations in this code path seem to double consume the > reservation and resv_huge_pages underflows. > > I'm looking at this edge case to understand why a prior > remove_huge_page() causes my code to underflow resv_huge_pages. > I should also mention, without a prior remove_huge_page() this code path works fine, so it seems the fact that the reservation is consumed before causes trouble, but I'm not sure why yet. > > -- > > Mike Kravetz