From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F0FB8C433EF
	for <linux-mm@archiver.kernel.org>; Wed, 16 Feb 2022 07:38:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 34E716B0078; Wed, 16 Feb 2022 02:38:34 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2FE566B007B; Wed, 16 Feb 2022 02:38:34 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1C8996B007D; Wed, 16 Feb 2022 02:38:34 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0021.hostedemail.com [216.40.44.21])
	by kanga.kvack.org (Postfix) with ESMTP id 0E7D06B0078
	for <linux-mm@kvack.org>; Wed, 16 Feb 2022 02:38:34 -0500 (EST)
Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id C9EED181AC9C6
	for <linux-mm@kvack.org>; Wed, 16 Feb 2022 07:38:33 +0000 (UTC)
X-FDA: 79147840506.19.9D30C94
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
	by imf26.hostedemail.com (Postfix) with ESMTP id C77FB140003
	for <linux-mm@kvack.org>; Wed, 16 Feb 2022 07:38:32 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1644997112; x=1676533112;
  h=from:to:cc:subject:date:message-id:mime-version:
   content-transfer-encoding;
  bh=+HqDf70TozgE/3I3FqJ3BcObqEF3HqQcEJeAkAhehEk=;
  b=JmrHRcH4+ZZ8O2XNSJx1l82DIX7KeKXIqrMEap5DJUeQ6RqagUcRwuFs
   rtLwU9PqkWNWUnG4MHk9dnjGU7j/fFRwoMsvJdT7+JIvVMtdhA7OBh35l
   EmnY6bNvz2qrWAFnjqxAGjc7oAWMxq2znl0cSiTHpwmxhIibXBYJo9Zso
   eHL56BAQeDsRmXew879VyHpk8oTlFJ/Lv80TDxSnC4KAuiItdbyqpaLpg
   VnaKcaWHLHUZEjScOYfTmkofdv6HZ9+nZhTbfcvrRa2tA1pCNLuDugpBr
   1P9sZtKIL34/8MDvpPJ8s1IqCZgWhiDpsMz3U31bBkzTx1N0DsNTx3lSq
   Q==;
X-IronPort-AV: E=McAfee;i="6200,9189,10259"; a="250281952"
X-IronPort-AV: E=Sophos;i="5.88,373,1635231600"; 
   d="scan'208";a="250281952"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Feb 2022 23:38:31 -0800
X-IronPort-AV: E=Sophos;i="5.88,373,1635231600"; 
   d="scan'208";a="498414830"
Received: from yhuang6-desk2.sh.intel.com ([10.239.13.11])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Feb 2022 23:38:27 -0800
From: Huang Ying <ying.huang@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Feng Tang <feng.tang@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Rik van Riel <riel@surriel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Yang Shi <shy828301@gmail.com>,
	Zi Yan <ziy@nvidia.com>,
	Wei Xu <weixugc@google.com>,
	osalvador <osalvador@suse.de>,
	Shakeel Butt <shakeelb@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: [PATCH -V12 0/3] NUMA balancing: optimize memory placement for memory tiering system
Date: Wed, 16 Feb 2022 15:38:12 +0800
Message-Id: <20220216073815.2505536-1-ying.huang@intel.com>
X-Mailer: git-send-email 2.30.2
MIME-Version: 1.0
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: C77FB140003
X-Rspam-User: 
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=JmrHRcH4;
	spf=none (imf26.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.24) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
X-Stat-Signature: m3ffpm4c9skhy3ihhfdiwmrur4bjukog
X-HE-Tag: 1644997112-198505
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

The changes since the last post are as follows,

- Rebased on v5.17-rc4

- Change promotion watermark implementation per Johannes' comments

- Fixes several sysctl ABI document bugs, Thanks Andrew.

--

With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory).  The
memory subsystem of these machines can be called memory tiering
system, because the performance of the different types of memory are
different.

After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
for use like normal RAM"), the PMEM could be used as the
cost-effective volatile memory in separate NUMA nodes.  In a typical
memory tiering system, there are CPUs, DRAM and PMEM in each physical
NUMA node.  The CPUs and the DRAM will be put in one logical node,
while the PMEM will be put in another (faked) logical node.

To optimize the system overall performance, the hot pages should be
placed in DRAM node.  To do that, we need to identify the hot pages in
the PMEM node and migrate them to DRAM node via NUMA migration.

In the original NUMA balancing, there are already a set of existing
mechanisms to identify the pages recently accessed by the CPUs in a
node and migrate the pages to the node.  So we can reuse these
mechanisms to build the mechanisms to optimize the page placement in
the memory tiering system.  This is implemented in this patchset.

At the other hand, the cold pages should be placed in PMEM node.  So,
we also need to identify the cold pages in the DRAM node and migrate
them to PMEM node.

In commit 26aa2d199d6f ("mm/migrate: demote pages during reclaim"), a
mechanism to demote the cold DRAM pages to PMEM node under memory
pressure is implemented.  Based on that, the cold DRAM pages can be
demoted to PMEM node proactively to free some memory space on DRAM
node to accommodate the promoted hot PMEM pages.  This is implemented
in this patchset too.

We have tested the solution with the pmbench memory accessing
benchmark with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent
Memory Model.  The test results shows that the pmbench score can
improve up to 95.9%.

Changelog:

v12:

- Rebased on v5.17-rc4

- Change promotion watermark implementation per Johannes' comments

- Fixes several sysctl ABI document bugs, Thanks Andrew.

v11:

- Rebased on v5.17-rc1

- Remove [4-6] from the original patchset to make it easier to be
  reviewed.

- Change the additional promotion watermark to be the high watermark / 4.

v10:

- Rebased on v5.16-rc1

- Revise error processing for [1/6] (promotion counter) per Yang's commen=
ts

- Add sysctl document for [2/6] (optimize page placement)

- Reset threshold adjustment state when disable/enable tiering mode

- Reset threshold when workload transition is detected.

v9:

- Rebased on v5.15-rc4

- Make "add promotion counter" the first patch per Yang's comments

v8:

- Rebased on v5.15-rc1

- Make user-specified threshold take effect sooner

v7:

- Rebased on the mmots tree of 2021-07-15.

- Some minor fixes.

v6:

- Rebased on the latest page demotion patchset. (which bases on v5.11)

v5:

- Rebased on the latest page demotion patchset. (which bases on v5.10)

v4:

- Rebased on the latest page demotion patchset. (which bases on v5.9-rc6)

- Add page promotion counter.

v3:

- Move the rate limit control as late as possible per Mel Gorman's
  comments.

- Revise the hot page selection implementation to store page scan time
  in struct page.

- Code cleanup.

- Rebased on the latest page demotion patchset.

v2:

- Addressed comments for V1.

- Rebased on v5.5.

Best Regards,
Huang, Ying