From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C43DEC433EF for ; Thu, 17 Feb 2022 16:26:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 503366B0071; Thu, 17 Feb 2022 11:26:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B1536B0074; Thu, 17 Feb 2022 11:26:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 379AE6B0075; Thu, 17 Feb 2022 11:26:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.27]) by kanga.kvack.org (Postfix) with ESMTP id 28F6A6B0071 for ; Thu, 17 Feb 2022 11:26:07 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E2CF52359E for ; Thu, 17 Feb 2022 16:26:06 +0000 (UTC) X-FDA: 79152798732.05.BFD0B1D Received: from mail-qv1-f47.google.com (mail-qv1-f47.google.com [209.85.219.47]) by imf26.hostedemail.com (Postfix) with ESMTP id 36906140010 for ; Thu, 17 Feb 2022 16:26:06 +0000 (UTC) Received: by mail-qv1-f47.google.com with SMTP id fh9so9276986qvb.1 for ; Thu, 17 Feb 2022 08:26:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=XI5wSofk79j/GRhgb9wMBt7y+qLqraOjCQlmddoi9Gs=; b=HOTiEPtCQSxBF8fYt2Lck0YZjgshd2WDxEaimka2RhcOq6SNGPUEyS0BQo6AIu3+3i CEiFZBStrS3Oz3ofHjiB+/5SMVnqmYuGBZxsWT5OchSc+htxxD6DBJpriNKQLbVFgQpi ijyRxs1iexJ6TabeysTFE0nZ3hmFNyBqQ+KqXllOzmVHuVSrtxa3FpfRfLwhTYDgE7B0 GiVielCS1kmfwzUWPb3SqcPOwASnXYEPEG0dFbTCu+BoOxJgbyT5e9gTY741YXnLxBsg /LtA5+vgu9cfivKhSwtU0+WSMGk79vTFyOjvpHCbcvRljctG7hVVcou0pZ14irx77M9e E79A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=XI5wSofk79j/GRhgb9wMBt7y+qLqraOjCQlmddoi9Gs=; b=f6TOShZ7bPzNzuGD6DQ6V+HkhEJ24nTEMrPhtQiF7dcBWmEkBHXdTjzkd5Bv+FpWk5 Qyp6VHfCgBhLy1OoyeUsWd3HA9UK6YyALtUDqj3L9Q3AnPDtFWbdZc0CwIQn1YJwQUCl 2nV4TBfdcS8CXl3AED5+baGsxC2cSPaXft9XiR+/ByrPUEIcjUVB0RwpolWvfTb2KqWQ ovxFA+6bEPf/E4soO6EDxP2u1ZBoTo3r0Q3I4hq60t0pX4nCbSINu+lR+XI9U3MXpQSQ GQtbi82QSGLSmALsOsAyAkNMwKm/qhljJr+jAaXJebd49B4j2JIK5PiR2wj10IXXLCgV Gpnw== X-Gm-Message-State: AOAM53143L3kL6T/S40lM1CRK2AiKFQ/fJujWF0xfC92sNRXjeL9wPf8 4m0xxgYIAJt7SsI/3PjHIRu98A== X-Google-Smtp-Source: ABdhPJwk4jF5c0aBznwQTLAu3usjR/ONiECpAXfX3JgQyxdPaQN1Tj4ltYGmKL5aijY/bzcsjecJDQ== X-Received: by 2002:a05:622a:1183:b0:2dd:7c52:c0c2 with SMTP id m3-20020a05622a118300b002dd7c52c0c2mr1088371qtk.578.1645115165362; Thu, 17 Feb 2022 08:26:05 -0800 (PST) Received: from localhost (cpe-98-15-154-102.hvc.res.rr.com. [98.15.154.102]) by smtp.gmail.com with ESMTPSA id x83sm1281883qkb.68.2022.02.17.08.26.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 17 Feb 2022 08:26:04 -0800 (PST) Date: Thu, 17 Feb 2022 11:26:04 -0500 From: Johannes Weiner To: "Huang, Ying" Cc: Peter Zijlstra , Mel Gorman , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang , Baolin Wang , Andrew Morton , Michal Hocko , Rik van Riel , Mel Gorman , Dave Hansen , Yang Shi , Zi Yan , Wei Xu , osalvador , Shakeel Butt , zhongjiang-ali Subject: Re: [PATCH -V11 2/3] NUMA balancing: optimize page placement for memory tiering system Message-ID: References: <20220128082751.593478-1-ying.huang@intel.com> <20220128082751.593478-3-ying.huang@intel.com> <87ee4cliia.fsf@yhuang6-desk2.ccr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87ee4cliia.fsf@yhuang6-desk2.ccr.corp.intel.com> X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 36906140010 X-Stat-Signature: aiytycj8frrw6o7op5nbgtirsch8ny44 X-Rspam-User: Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=HOTiEPtC; spf=pass (imf26.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.219.47 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org X-HE-Tag: 1645115166-849158 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Huang, Sorry, I didn't see this reply until you sent out the new version already :( Apologies. On Wed, Feb 09, 2022 at 01:24:29PM +0800, Huang, Ying wrote: > > On Fri, Jan 28, 2022 at 04:27:50PM +0800, Huang Ying wrote: > >> @@ -615,6 +622,10 @@ faults may be controlled by the `numa_balancing_scan_period_min_ms, > >> numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, > >> numa_balancing_scan_size_mb`_, and numa_balancing_settle_count sysctls. > >> > >> +Or NUMA_BALANCING_MEMORY_TIERING to optimize page placement among > >> +different types of memory (represented as different NUMA nodes) to > >> +place the hot pages in the fast memory. This is implemented based on > >> +unmapping and page fault too. > > > > NORMAL | TIERING appears to be a non-sensical combination. > > > > Would it be better to have a tristate (disabled, normal, tiering) > > rather than a mask? > > NORMAL is for balancing cross-socket memory accessing among DRAM nodes. > TIERING is for optimizing page placement between DRAM and PMEM in one > socket. We think it's possible to do both. > > For example, with [3/3] of the patchset, > > - TIERING: because DRAM pages aren't made PROT_NONE, it's disabled to > balance among DRAM nodes. > > - NORMAL | TIERING: both cross-socket balancing among DRAM nodes and > page placement optimizing between DRAM and PMEM are enabled. Ok, I get it. So NORMAL would enable PROT_NONE sampling on all nodes, and TIERING would additionally raise the watermarks on DRAM nodes. Thanks! > >> @@ -2034,16 +2035,30 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) > >> { > >> int page_lru; > >> int nr_pages = thp_nr_pages(page); > >> + int order = compound_order(page); > >> > >> - VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page); > >> + VM_BUG_ON_PAGE(order && !PageTransHuge(page), page); > >> > >> /* Do not migrate THP mapped by multiple processes */ > >> if (PageTransHuge(page) && total_mapcount(page) > 1) > >> return 0; > >> > >> /* Avoid migrating to a node that is nearly full */ > >> - if (!migrate_balanced_pgdat(pgdat, nr_pages)) > >> + if (!migrate_balanced_pgdat(pgdat, nr_pages)) { > >> + int z; > >> + > >> + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) || > >> + !numa_demotion_enabled) > >> + return 0; > >> + if (next_demotion_node(pgdat->node_id) == NUMA_NO_NODE) > >> + return 0; > > > > The encoded behavior doesn't seem very user-friendly: Unless the user > > enables numa demotion in a separate flag, enabling numa balancing in > > tiered mode will silently do nothing. > > In theory, TIERING still does something even with numa_demotion_enabled > == false. Where it works more like the original NUMA balancing. If > there's some free space in DRAM node (for example, some programs exit), > some PMEM pages will be promoted to DRAM. But as in the change log, > this isn't good enough for page placement optimizing. Right, so it's a behavior that likely isn't going to be useful. > > Would it make more sense to have a central flag for the operation of > > tiered memory systems that will enable both promotion and demotion? > > IMHO, it may be possible for people to enable demotion alone. For > example, if some people want to use a user space page placement > optimizing solution based on PMU counters, they may disable TIERING, but > still use demotion as a way to avoid swapping in some situation. Do you > think this makes sense? Yes, it does. > > Alternatively, it could also ignore the state of demotion and promote > > anyway if asked to, resulting in regular reclaim to make room. It > > might not be the most popular combination, but would be in line with > > the zone_reclaim_mode policy of preferring reclaim over remote > > accesses. It would make the knobs behave more as expected and it's > > less convoluted than having flags select other user-visible flags. > > Sorry, I don't get your idea here. Do you suggest to add another knob > like zone_relcaim_mode? Then we can define some bit to control demotion > and promotion there? If so, I still don't know how to fit this into the > existing NUMA balancing framework. No, I'm just suggesting to remove the !numa_demotion_disabled check from the promotion path on unbalanced nodes. Keep the switches independent from each other. Like you said, demotion without promotion can be a valid config with a userspace promoter. And I'm saying promotion without demotion can be a valid config in a zone_reclaim_mode type of setup. We also seem to agree degraded promotion when demotion enabled likely isn't very useful to anybody. So maybe it should be removed? It just comes down to user expectations. There is no masterswitch that says "do the right thing on tiered systems", so absent of that I think it would be best to keep the semantics of each of the two knobs simple and predictable, without tricky interdependencies - like quietly degrading promotion behavior when demotion is disabled. Does that make sense? Thanks! Johannes