From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 713A3C43214 for ; Thu, 12 Aug 2021 18:19:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 52B686101E for ; Thu, 12 Aug 2021 18:19:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237805AbhHLSUO (ORCPT ); Thu, 12 Aug 2021 14:20:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45778 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230110AbhHLSUN (ORCPT ); Thu, 12 Aug 2021 14:20:13 -0400 Received: from mail-ej1-x633.google.com (mail-ej1-x633.google.com [IPv6:2a00:1450:4864:20::633]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0C7C3C061756; Thu, 12 Aug 2021 11:19:48 -0700 (PDT) Received: by mail-ej1-x633.google.com with SMTP id w5so13297529ejq.2; Thu, 12 Aug 2021 11:19:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8nzZyrKMG5TvZzfbf2DLNKR6Jtfcs5c5EesoZusA2AY=; b=T9KK0Hg/ncYGvFpuJf1onKvfAmh032mxqNoQxxDccdXc/LZQyFFuQ3vB6xNpz8szEg PX23xmhiSis4EHAeZY9QHbnay9hENt9ixpzGFNjMCHRVD8wcJnQAgYtsngVaBlf7ZZ32 Z90sESGr+S82m+dmANG/y67aKAHecjvzbzutg7COYRcQrouY5yhuSDisolLuuWcycnxz JA59o02qWy9dhI6tzvm5ATAYtYLzXPn8x62EzJzOe9zAFQQTc6Pb2CNk0PP/XZgRyaG2 eOQGWp+Ma6wAQp9ZYddxWtaxM1h35yGBjtDgIZwnAMz7z3pkAwRBO3XmEX4y2zO8bqW2 t2vA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8nzZyrKMG5TvZzfbf2DLNKR6Jtfcs5c5EesoZusA2AY=; b=C3+V2wPyqKNpCrO53skLwO7RHEVArt4VUlVGNZKeI5lHWIKvyQ+ndKXOFYjGyqlk8R c5ar6YUBGW3xF/qEQwmVfH8Y0o1IIS889LM26I23vjpkOiLPjMQH615RKyeeRoYSh2W+ rbwKgY4RrnV++O7MuHMzfQuySpNe3dT1jB0BV6QXs4hhSFqsr0ukAounKdOLtsK+kqq+ r+03OqYxIpquvnE9bD1BhPH90VLIDapUhbYaCv9p3J2tjuyD9G8qvxAr4hCqT7TIS4Ot Eqpe0QIhJbNPqY/OLdct2Kr1dUrJrn5qO9/66iv937zwf9vydnDVXbpQxo3NcEcaBROg 8cIg== X-Gm-Message-State: AOAM5320xq8+jrlVyGpHpuiaHijFEQMGokrMIE0c937fJFQiLYJYMBn/ eZY+klvJNYPBxAMgR/G1O9ZYzNgOSzFVWqridnk= X-Google-Smtp-Source: ABdhPJzVb8Ly7AuiUHCY8iVDKVjkusJJzpLqX0GR+dzzhsa0I2l70v548ZkgUFHNHkulXbMmvrnJb/V7KyfLVsjBK1Y= X-Received: by 2002:a17:906:491a:: with SMTP id b26mr5005149ejq.25.1628792386664; Thu, 12 Aug 2021 11:19:46 -0700 (PDT) MIME-Version: 1.0 References: <2862852d-badd-7486-3a8e-c5ea9666d6fb@google.com> <749bcf72-efbd-d6c-db30-e9ff98242390@google.com> In-Reply-To: From: Yang Shi Date: Thu, 12 Aug 2021 11:19:34 -0700 Message-ID: Subject: Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index) To: Hugh Dickins Cc: Andrew Morton , Shakeel Butt , "Kirill A. Shutemov" , Miaohe Lin , Mike Kravetz , Michal Hocko , Rik van Riel , Christoph Hellwig , Matthew Wilcox , "Eric W. Biederman" , Alexey Gladkov , Chris Wilson , Matthew Auld , Linux FS-devel Mailing List , Linux Kernel Mailing List , linux-api@vger.kernel.org, Linux MM Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 6, 2021 at 10:57 AM Yang Shi wrote: > > On Thu, Aug 5, 2021 at 10:43 PM Hugh Dickins wrote: > > > > On Thu, 5 Aug 2021, Yang Shi wrote: > > > > > > By rereading the code, I think you are correct. Both cases do work > > > correctly without leaking. And the !CONFIG_NUMA case may carry the > > > huge page indefinitely. > > > > > > I think it is because khugepaged may collapse memory for another NUMA > > > node in the next loop, so it doesn't make too much sense to carry the > > > huge page, but it may be an optimization for !CONFIG_NUMA case. > > > > Yes, that is its intention. > > > > > > > > However, as I mentioned in earlier email the new pcp implementation > > > could cache THP now, so we might not need keep this convoluted logic > > > anymore. Just free the page if collapse is failed then re-allocate > > > THP. The carried THP might improve the success rate a little bit but I > > > doubt how noticeable it would be, may be not worth for the extra > > > complexity at all. > > > > It would be great if the new pcp implementation is good enough to > > get rid of khugepaged's confusing NUMA=y/NUMA=n differences; and all > > the *hpage stuff too, I hope. That would be a welcome cleanup. > > The other question is if that optimization is worth it nowadays or > not. I bet not too many users build NUMA=n kernel nowadays even though > the kernel is actually running on a non-NUMA machine. Some small > devices may run NUMA=n kernel, but I don't think they actually use > THP. So such code complexity could be removed from this point of view > too. > > > > > > > > Collapse failure is not uncommon and leaking huge pages gets noticed. > > > > After writing that, I realized how I'm almost always testing a NUMA=y > > kernel (though on non-NUMA machines), and seldom try the NUMA=n build. > > So did so to check no leak, indeed; but was surprised, when comparing > > vmstats, that the NUMA=n run had done 5 times as much thp_collapse_alloc > > as the NUMA=y run. I've merely made a note to look into that one day: > > maybe it was just a one-off oddity, or maybe the incrementing of stats > > is wrong down one path or the other. I came up with a patch to remove !CONFIG_NUMA case, and my test found the same problem. NUMA=n run had done 5 times as much thp_collapse_alloc as NUMA=y run with vanilla kernel just exactly as what you saw. A quick look shows the huge page allocation timing is different for the two cases. For NUMA=n, the huge page is allocated by khugepaged_prealloc_page() before scanning the address space, so it means huge page may be allocated even though there is no suitable range for collapsing. Then the page would be just freed if khugepaged already made enough progress then try to reallocate again. The problem should be more noticeable if you have a shorter scan interval (scan_sleep_millisecs). I set it to 100ms for my test. We could carry the huge page across scan passes for NUMA=n, but this would make the code more complicated. I don't think it is really worth, so just removing the special case for NUMA=n sounds more reasonable to me. > > Yeah, probably. > > > > > Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CAA25C432BE for ; Thu, 12 Aug 2021 18:19:49 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6B5FE610A7 for ; Thu, 12 Aug 2021 18:19:49 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 6B5FE610A7 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id CA2B78D0003; Thu, 12 Aug 2021 14:19:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C04D58D0001; Thu, 12 Aug 2021 14:19:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA5B88D0003; Thu, 12 Aug 2021 14:19:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0188.hostedemail.com [216.40.44.188]) by kanga.kvack.org (Postfix) with ESMTP id 927298D0001 for ; Thu, 12 Aug 2021 14:19:48 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 3860C1924A for ; Thu, 12 Aug 2021 18:19:48 +0000 (UTC) X-FDA: 78467242056.03.C14EF78 Received: from mail-ej1-f44.google.com (mail-ej1-f44.google.com [209.85.218.44]) by imf15.hostedemail.com (Postfix) with ESMTP id DFD56D00E655 for ; Thu, 12 Aug 2021 18:19:47 +0000 (UTC) Received: by mail-ej1-f44.google.com with SMTP id b15so13252389ejg.10 for ; Thu, 12 Aug 2021 11:19:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8nzZyrKMG5TvZzfbf2DLNKR6Jtfcs5c5EesoZusA2AY=; b=T9KK0Hg/ncYGvFpuJf1onKvfAmh032mxqNoQxxDccdXc/LZQyFFuQ3vB6xNpz8szEg PX23xmhiSis4EHAeZY9QHbnay9hENt9ixpzGFNjMCHRVD8wcJnQAgYtsngVaBlf7ZZ32 Z90sESGr+S82m+dmANG/y67aKAHecjvzbzutg7COYRcQrouY5yhuSDisolLuuWcycnxz JA59o02qWy9dhI6tzvm5ATAYtYLzXPn8x62EzJzOe9zAFQQTc6Pb2CNk0PP/XZgRyaG2 eOQGWp+Ma6wAQp9ZYddxWtaxM1h35yGBjtDgIZwnAMz7z3pkAwRBO3XmEX4y2zO8bqW2 t2vA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8nzZyrKMG5TvZzfbf2DLNKR6Jtfcs5c5EesoZusA2AY=; b=D8QuksxTgHQOsohQK//1bT5mJsEHoHR+i8s13vG11una5OC2g+QpKl7MJvIK0HGnVF WnXtCxAu6pdHyrlIAkgp00p5O9fYnLK8Xna7iatGV9zeFcK0KJRIJ2rQzaXX2W0NXb7Q uWa31/y7cXqJ+3T2IEbjqSEAUD4p3URbt9ZaTe73tpX2E/2n4X1YwSRA92a8gztF3fzh ho/TGg1Ado1y+NRJTtGhFh+3Eh/4MqjxVLPoI4tEbiOoQ42T16gGsWgx26jE+RSJrq6s UvG4Wqgj7U/kkIhPcuRwriiUkRmQmJ5Du4ArvHwQLJqToVG6yjdcDOA7dfMqVBUM0Xbw fcSg== X-Gm-Message-State: AOAM532ayU5WuNJvZPdIDAFx2fRO1kbhRIs9n4RT00/S9+EHI/0PKsa+ vjEbtDQco+4Nzv9E4R4N79+A3ImqQmqIJOMazyE= X-Google-Smtp-Source: ABdhPJzVb8Ly7AuiUHCY8iVDKVjkusJJzpLqX0GR+dzzhsa0I2l70v548ZkgUFHNHkulXbMmvrnJb/V7KyfLVsjBK1Y= X-Received: by 2002:a17:906:491a:: with SMTP id b26mr5005149ejq.25.1628792386664; Thu, 12 Aug 2021 11:19:46 -0700 (PDT) MIME-Version: 1.0 References: <2862852d-badd-7486-3a8e-c5ea9666d6fb@google.com> <749bcf72-efbd-d6c-db30-e9ff98242390@google.com> In-Reply-To: From: Yang Shi Date: Thu, 12 Aug 2021 11:19:34 -0700 Message-ID: Subject: Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index) To: Hugh Dickins Cc: Andrew Morton , Shakeel Butt , "Kirill A. Shutemov" , Miaohe Lin , Mike Kravetz , Michal Hocko , Rik van Riel , Christoph Hellwig , Matthew Wilcox , "Eric W. Biederman" , Alexey Gladkov , Chris Wilson , Matthew Auld , Linux FS-devel Mailing List , Linux Kernel Mailing List , linux-api@vger.kernel.org, Linux MM Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: DFD56D00E655 Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20161025 header.b="T9KK0Hg/"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of shy828301@gmail.com designates 209.85.218.44 as permitted sender) smtp.mailfrom=shy828301@gmail.com X-Rspamd-Server: rspam04 X-Stat-Signature: 31zjfxfxjmaa1itgt83w6377oed61qkm X-HE-Tag: 1628792387-93725 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Aug 6, 2021 at 10:57 AM Yang Shi wrote: > > On Thu, Aug 5, 2021 at 10:43 PM Hugh Dickins wrote: > > > > On Thu, 5 Aug 2021, Yang Shi wrote: > > > > > > By rereading the code, I think you are correct. Both cases do work > > > correctly without leaking. And the !CONFIG_NUMA case may carry the > > > huge page indefinitely. > > > > > > I think it is because khugepaged may collapse memory for another NUMA > > > node in the next loop, so it doesn't make too much sense to carry the > > > huge page, but it may be an optimization for !CONFIG_NUMA case. > > > > Yes, that is its intention. > > > > > > > > However, as I mentioned in earlier email the new pcp implementation > > > could cache THP now, so we might not need keep this convoluted logic > > > anymore. Just free the page if collapse is failed then re-allocate > > > THP. The carried THP might improve the success rate a little bit but I > > > doubt how noticeable it would be, may be not worth for the extra > > > complexity at all. > > > > It would be great if the new pcp implementation is good enough to > > get rid of khugepaged's confusing NUMA=y/NUMA=n differences; and all > > the *hpage stuff too, I hope. That would be a welcome cleanup. > > The other question is if that optimization is worth it nowadays or > not. I bet not too many users build NUMA=n kernel nowadays even though > the kernel is actually running on a non-NUMA machine. Some small > devices may run NUMA=n kernel, but I don't think they actually use > THP. So such code complexity could be removed from this point of view > too. > > > > > > > > Collapse failure is not uncommon and leaking huge pages gets noticed. > > > > After writing that, I realized how I'm almost always testing a NUMA=y > > kernel (though on non-NUMA machines), and seldom try the NUMA=n build. > > So did so to check no leak, indeed; but was surprised, when comparing > > vmstats, that the NUMA=n run had done 5 times as much thp_collapse_alloc > > as the NUMA=y run. I've merely made a note to look into that one day: > > maybe it was just a one-off oddity, or maybe the incrementing of stats > > is wrong down one path or the other. I came up with a patch to remove !CONFIG_NUMA case, and my test found the same problem. NUMA=n run had done 5 times as much thp_collapse_alloc as NUMA=y run with vanilla kernel just exactly as what you saw. A quick look shows the huge page allocation timing is different for the two cases. For NUMA=n, the huge page is allocated by khugepaged_prealloc_page() before scanning the address space, so it means huge page may be allocated even though there is no suitable range for collapsing. Then the page would be just freed if khugepaged already made enough progress then try to reallocate again. The problem should be more noticeable if you have a shorter scan interval (scan_sleep_millisecs). I set it to 100ms for my test. We could carry the huge page across scan passes for NUMA=n, but this would make the code more complicated. I don't think it is really worth, so just removing the special case for NUMA=n sounds more reasonable to me. > > Yeah, probably. > > > > > Hugh