From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 713A3C43214
	for <linux-kernel@archiver.kernel.org>; Thu, 12 Aug 2021 18:19:50 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 52B686101E
	for <linux-kernel@archiver.kernel.org>; Thu, 12 Aug 2021 18:19:50 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S237805AbhHLSUO (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 12 Aug 2021 14:20:14 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45778 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230110AbhHLSUN (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 12 Aug 2021 14:20:13 -0400
Received: from mail-ej1-x633.google.com (mail-ej1-x633.google.com [IPv6:2a00:1450:4864:20::633])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0C7C3C061756;
        Thu, 12 Aug 2021 11:19:48 -0700 (PDT)
Received: by mail-ej1-x633.google.com with SMTP id w5so13297529ejq.2;
        Thu, 12 Aug 2021 11:19:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=8nzZyrKMG5TvZzfbf2DLNKR6Jtfcs5c5EesoZusA2AY=;
        b=T9KK0Hg/ncYGvFpuJf1onKvfAmh032mxqNoQxxDccdXc/LZQyFFuQ3vB6xNpz8szEg
         PX23xmhiSis4EHAeZY9QHbnay9hENt9ixpzGFNjMCHRVD8wcJnQAgYtsngVaBlf7ZZ32
         Z90sESGr+S82m+dmANG/y67aKAHecjvzbzutg7COYRcQrouY5yhuSDisolLuuWcycnxz
         JA59o02qWy9dhI6tzvm5ATAYtYLzXPn8x62EzJzOe9zAFQQTc6Pb2CNk0PP/XZgRyaG2
         eOQGWp+Ma6wAQp9ZYddxWtaxM1h35yGBjtDgIZwnAMz7z3pkAwRBO3XmEX4y2zO8bqW2
         t2vA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=8nzZyrKMG5TvZzfbf2DLNKR6Jtfcs5c5EesoZusA2AY=;
        b=C3+V2wPyqKNpCrO53skLwO7RHEVArt4VUlVGNZKeI5lHWIKvyQ+ndKXOFYjGyqlk8R
         c5ar6YUBGW3xF/qEQwmVfH8Y0o1IIS889LM26I23vjpkOiLPjMQH615RKyeeRoYSh2W+
         rbwKgY4RrnV++O7MuHMzfQuySpNe3dT1jB0BV6QXs4hhSFqsr0ukAounKdOLtsK+kqq+
         r+03OqYxIpquvnE9bD1BhPH90VLIDapUhbYaCv9p3J2tjuyD9G8qvxAr4hCqT7TIS4Ot
         Eqpe0QIhJbNPqY/OLdct2Kr1dUrJrn5qO9/66iv937zwf9vydnDVXbpQxo3NcEcaBROg
         8cIg==
X-Gm-Message-State: AOAM5320xq8+jrlVyGpHpuiaHijFEQMGokrMIE0c937fJFQiLYJYMBn/
        eZY+klvJNYPBxAMgR/G1O9ZYzNgOSzFVWqridnk=
X-Google-Smtp-Source: ABdhPJzVb8Ly7AuiUHCY8iVDKVjkusJJzpLqX0GR+dzzhsa0I2l70v548ZkgUFHNHkulXbMmvrnJb/V7KyfLVsjBK1Y=
X-Received: by 2002:a17:906:491a:: with SMTP id b26mr5005149ejq.25.1628792386664;
 Thu, 12 Aug 2021 11:19:46 -0700 (PDT)
MIME-Version: 1.0
References: <2862852d-badd-7486-3a8e-c5ea9666d6fb@google.com>
 <dae523ab-c75b-f532-af9d-8b6a1d4e29b@google.com> <CAHbLzkoKZ9OdUfP5DX81CKOJWrRZ0GANrmenNeKWNmSOgUh0bQ@mail.gmail.com>
 <e7374d7e-4773-aba1-763-8fa2c953f917@google.com> <CAHbLzko_wg4mx-LTbJ6JcJo-6VzMh5BAcuMV8PXKPsFXOBVASw@mail.gmail.com>
 <CAHbLzkqKQ_k_aipojZd=UiHyivaweCpCFJJn7WCWVcxhTijqAQ@mail.gmail.com>
 <749bcf72-efbd-d6c-db30-e9ff98242390@google.com> <CAHbLzkou+6m+htMNzSQrHfd6U0yURWiewK=Pvg30XSdiW=t+-w@mail.gmail.com>
In-Reply-To: <CAHbLzkou+6m+htMNzSQrHfd6U0yURWiewK=Pvg30XSdiW=t+-w@mail.gmail.com>
From:   Yang Shi <shy828301@gmail.com>
Date:   Thu, 12 Aug 2021 11:19:34 -0700
Message-ID: <CAHbLzkpd1r1kLhNP7=Una_Fxpdgx7vE9aeyBgqHRE8M5e9j-qQ@mail.gmail.com>
Subject: Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
To:     Hugh Dickins <hughd@google.com>
Cc:     Andrew Morton <akpm@linux-foundation.org>,
        Shakeel Butt <shakeelb@google.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        Miaohe Lin <linmiaohe@huawei.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Michal Hocko <mhocko@suse.com>,
        Rik van Riel <riel@surriel.com>,
        Christoph Hellwig <hch@infradead.org>,
        Matthew Wilcox <willy@infradead.org>,
        "Eric W. Biederman" <ebiederm@xmission.com>,
        Alexey Gladkov <legion@kernel.org>,
        Chris Wilson <chris@chris-wilson.co.uk>,
        Matthew Auld <matthew.auld@intel.com>,
        Linux FS-devel Mailing List <linux-fsdevel@vger.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        linux-api@vger.kernel.org, Linux MM <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Aug 6, 2021 at 10:57 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Aug 5, 2021 at 10:43 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > On Thu, 5 Aug 2021, Yang Shi wrote:
> > >
> > > By rereading the code, I think you are correct. Both cases do work
> > > correctly without leaking. And the !CONFIG_NUMA case may carry the
> > > huge page indefinitely.
> > >
> > > I think it is because khugepaged may collapse memory for another NUMA
> > > node in the next loop, so it doesn't make too much sense to carry the
> > > huge page, but it may be an optimization for !CONFIG_NUMA case.
> >
> > Yes, that is its intention.
> >
> > >
> > > However, as I mentioned in earlier email the new pcp implementation
> > > could cache THP now, so we might not need keep this convoluted logic
> > > anymore. Just free the page if collapse is failed then re-allocate
> > > THP. The carried THP might improve the success rate a little bit but I
> > > doubt how noticeable it would be, may be not worth for the extra
> > > complexity at all.
> >
> > It would be great if the new pcp implementation is good enough to
> > get rid of khugepaged's confusing NUMA=y/NUMA=n differences; and all
> > the *hpage stuff too, I hope.  That would be a welcome cleanup.
>
>  The other question is if that optimization is worth it nowadays or
> not. I bet not too many users build NUMA=n kernel nowadays even though
> the kernel is actually running on a non-NUMA machine. Some small
> devices may run NUMA=n kernel, but I don't think they actually use
> THP. So such code complexity could be removed from this point of view
> too.
>
> >
> > > > > Collapse failure is not uncommon and leaking huge pages gets noticed.
> >
> > After writing that, I realized how I'm almost always testing a NUMA=y
> > kernel (though on non-NUMA machines), and seldom try the NUMA=n build.
> > So did so to check no leak, indeed; but was surprised, when comparing
> > vmstats, that the NUMA=n run had done 5 times as much thp_collapse_alloc
> > as the NUMA=y run.  I've merely made a note to look into that one day:
> > maybe it was just a one-off oddity, or maybe the incrementing of stats
> > is wrong down one path or the other.

I came up with a patch to remove !CONFIG_NUMA case, and my test found
the same problem. NUMA=n run had done 5 times as much
thp_collapse_alloc as NUMA=y run with vanilla kernel just exactly as
what you saw.

A quick look shows the huge page allocation timing is different for
the two cases. For NUMA=n, the huge page is allocated by
khugepaged_prealloc_page() before scanning the address space, so it
means huge page may be allocated even though there is no suitable
range for collapsing. Then the page would be just freed if khugepaged
already made enough progress then try to reallocate again. The problem
should be more noticeable if you have a shorter scan interval
(scan_sleep_millisecs). I set it to 100ms for my test.

We could carry the huge page across scan passes for NUMA=n, but this
would make the code more complicated. I don't think it is really
worth, so just removing the special case for NUMA=n sounds more
reasonable to me.

>
> Yeah, probably.
>
> >
> > Hugh

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=vrBG=ND=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CAA25C432BE
	for <linux-mm@archiver.kernel.org>; Thu, 12 Aug 2021 18:19:49 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 6B5FE610A7
	for <linux-mm@archiver.kernel.org>; Thu, 12 Aug 2021 18:19:49 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 6B5FE610A7
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id CA2B78D0003; Thu, 12 Aug 2021 14:19:48 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C04D58D0001; Thu, 12 Aug 2021 14:19:48 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AA5B88D0003; Thu, 12 Aug 2021 14:19:48 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0188.hostedemail.com [216.40.44.188])
	by kanga.kvack.org (Postfix) with ESMTP id 927298D0001
	for <linux-mm@kvack.org>; Thu, 12 Aug 2021 14:19:48 -0400 (EDT)
Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 3860C1924A
	for <linux-mm@kvack.org>; Thu, 12 Aug 2021 18:19:48 +0000 (UTC)
X-FDA: 78467242056.03.C14EF78
Received: from mail-ej1-f44.google.com (mail-ej1-f44.google.com [209.85.218.44])
	by imf15.hostedemail.com (Postfix) with ESMTP id DFD56D00E655
	for <linux-mm@kvack.org>; Thu, 12 Aug 2021 18:19:47 +0000 (UTC)
Received: by mail-ej1-f44.google.com with SMTP id b15so13252389ejg.10
        for <linux-mm@kvack.org>; Thu, 12 Aug 2021 11:19:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=8nzZyrKMG5TvZzfbf2DLNKR6Jtfcs5c5EesoZusA2AY=;
        b=T9KK0Hg/ncYGvFpuJf1onKvfAmh032mxqNoQxxDccdXc/LZQyFFuQ3vB6xNpz8szEg
         PX23xmhiSis4EHAeZY9QHbnay9hENt9ixpzGFNjMCHRVD8wcJnQAgYtsngVaBlf7ZZ32
         Z90sESGr+S82m+dmANG/y67aKAHecjvzbzutg7COYRcQrouY5yhuSDisolLuuWcycnxz
         JA59o02qWy9dhI6tzvm5ATAYtYLzXPn8x62EzJzOe9zAFQQTc6Pb2CNk0PP/XZgRyaG2
         eOQGWp+Ma6wAQp9ZYddxWtaxM1h35yGBjtDgIZwnAMz7z3pkAwRBO3XmEX4y2zO8bqW2
         t2vA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=8nzZyrKMG5TvZzfbf2DLNKR6Jtfcs5c5EesoZusA2AY=;
        b=D8QuksxTgHQOsohQK//1bT5mJsEHoHR+i8s13vG11una5OC2g+QpKl7MJvIK0HGnVF
         WnXtCxAu6pdHyrlIAkgp00p5O9fYnLK8Xna7iatGV9zeFcK0KJRIJ2rQzaXX2W0NXb7Q
         uWa31/y7cXqJ+3T2IEbjqSEAUD4p3URbt9ZaTe73tpX2E/2n4X1YwSRA92a8gztF3fzh
         ho/TGg1Ado1y+NRJTtGhFh+3Eh/4MqjxVLPoI4tEbiOoQ42T16gGsWgx26jE+RSJrq6s
         UvG4Wqgj7U/kkIhPcuRwriiUkRmQmJ5Du4ArvHwQLJqToVG6yjdcDOA7dfMqVBUM0Xbw
         fcSg==
X-Gm-Message-State: AOAM532ayU5WuNJvZPdIDAFx2fRO1kbhRIs9n4RT00/S9+EHI/0PKsa+
	vjEbtDQco+4Nzv9E4R4N79+A3ImqQmqIJOMazyE=
X-Google-Smtp-Source: ABdhPJzVb8Ly7AuiUHCY8iVDKVjkusJJzpLqX0GR+dzzhsa0I2l70v548ZkgUFHNHkulXbMmvrnJb/V7KyfLVsjBK1Y=
X-Received: by 2002:a17:906:491a:: with SMTP id b26mr5005149ejq.25.1628792386664;
 Thu, 12 Aug 2021 11:19:46 -0700 (PDT)
MIME-Version: 1.0
References: <2862852d-badd-7486-3a8e-c5ea9666d6fb@google.com>
 <dae523ab-c75b-f532-af9d-8b6a1d4e29b@google.com> <CAHbLzkoKZ9OdUfP5DX81CKOJWrRZ0GANrmenNeKWNmSOgUh0bQ@mail.gmail.com>
 <e7374d7e-4773-aba1-763-8fa2c953f917@google.com> <CAHbLzko_wg4mx-LTbJ6JcJo-6VzMh5BAcuMV8PXKPsFXOBVASw@mail.gmail.com>
 <CAHbLzkqKQ_k_aipojZd=UiHyivaweCpCFJJn7WCWVcxhTijqAQ@mail.gmail.com>
 <749bcf72-efbd-d6c-db30-e9ff98242390@google.com> <CAHbLzkou+6m+htMNzSQrHfd6U0yURWiewK=Pvg30XSdiW=t+-w@mail.gmail.com>
In-Reply-To: <CAHbLzkou+6m+htMNzSQrHfd6U0yURWiewK=Pvg30XSdiW=t+-w@mail.gmail.com>
From: Yang Shi <shy828301@gmail.com>
Date: Thu, 12 Aug 2021 11:19:34 -0700
Message-ID: <CAHbLzkpd1r1kLhNP7=Una_Fxpdgx7vE9aeyBgqHRE8M5e9j-qQ@mail.gmail.com>
Subject: Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Shakeel Butt <shakeelb@google.com>, 
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Miaohe Lin <linmiaohe@huawei.com>, 
	Mike Kravetz <mike.kravetz@oracle.com>, Michal Hocko <mhocko@suse.com>, 
	Rik van Riel <riel@surriel.com>, Christoph Hellwig <hch@infradead.org>, 
	Matthew Wilcox <willy@infradead.org>, "Eric W. Biederman" <ebiederm@xmission.com>, 
	Alexey Gladkov <legion@kernel.org>, Chris Wilson <chris@chris-wilson.co.uk>, 
	Matthew Auld <matthew.auld@intel.com>, 
	Linux FS-devel Mailing List <linux-fsdevel@vger.kernel.org>, 
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-api@vger.kernel.org, 
	Linux MM <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: DFD56D00E655
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20161025 header.b="T9KK0Hg/";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf15.hostedemail.com: domain of shy828301@gmail.com designates 209.85.218.44 as permitted sender) smtp.mailfrom=shy828301@gmail.com
X-Rspamd-Server: rspam04
X-Stat-Signature: 31zjfxfxjmaa1itgt83w6377oed61qkm
X-HE-Tag: 1628792387-93725
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Aug 6, 2021 at 10:57 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Aug 5, 2021 at 10:43 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > On Thu, 5 Aug 2021, Yang Shi wrote:
> > >
> > > By rereading the code, I think you are correct. Both cases do work
> > > correctly without leaking. And the !CONFIG_NUMA case may carry the
> > > huge page indefinitely.
> > >
> > > I think it is because khugepaged may collapse memory for another NUMA
> > > node in the next loop, so it doesn't make too much sense to carry the
> > > huge page, but it may be an optimization for !CONFIG_NUMA case.
> >
> > Yes, that is its intention.
> >
> > >
> > > However, as I mentioned in earlier email the new pcp implementation
> > > could cache THP now, so we might not need keep this convoluted logic
> > > anymore. Just free the page if collapse is failed then re-allocate
> > > THP. The carried THP might improve the success rate a little bit but I
> > > doubt how noticeable it would be, may be not worth for the extra
> > > complexity at all.
> >
> > It would be great if the new pcp implementation is good enough to
> > get rid of khugepaged's confusing NUMA=y/NUMA=n differences; and all
> > the *hpage stuff too, I hope.  That would be a welcome cleanup.
>
>  The other question is if that optimization is worth it nowadays or
> not. I bet not too many users build NUMA=n kernel nowadays even though
> the kernel is actually running on a non-NUMA machine. Some small
> devices may run NUMA=n kernel, but I don't think they actually use
> THP. So such code complexity could be removed from this point of view
> too.
>
> >
> > > > > Collapse failure is not uncommon and leaking huge pages gets noticed.
> >
> > After writing that, I realized how I'm almost always testing a NUMA=y
> > kernel (though on non-NUMA machines), and seldom try the NUMA=n build.
> > So did so to check no leak, indeed; but was surprised, when comparing
> > vmstats, that the NUMA=n run had done 5 times as much thp_collapse_alloc
> > as the NUMA=y run.  I've merely made a note to look into that one day:
> > maybe it was just a one-off oddity, or maybe the incrementing of stats
> > is wrong down one path or the other.

I came up with a patch to remove !CONFIG_NUMA case, and my test found
the same problem. NUMA=n run had done 5 times as much
thp_collapse_alloc as NUMA=y run with vanilla kernel just exactly as
what you saw.

A quick look shows the huge page allocation timing is different for
the two cases. For NUMA=n, the huge page is allocated by
khugepaged_prealloc_page() before scanning the address space, so it
means huge page may be allocated even though there is no suitable
range for collapsing. Then the page would be just freed if khugepaged
already made enough progress then try to reallocate again. The problem
should be more noticeable if you have a shorter scan interval
(scan_sleep_millisecs). I set it to 100ms for my test.

We could carry the huge page across scan passes for NUMA=n, but this
would make the code more complicated. I don't think it is really
worth, so just removing the special case for NUMA=n sounds more
reasonable to me.

>
> Yeah, probably.
>
> >
> > Hugh