linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/6] mm: working set reporting
@ 2023-06-21 18:04 Yuanchu Xie
  2023-06-21 18:04 ` [RFC PATCH v2 1/6] mm: aggregate working set information into histograms Yuanchu Xie
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Yuanchu Xie @ 2023-06-21 18:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J . Wysocki, Michael S . Tsirkin,
	David Hildenbrand, Jason Wang, Andrew Morton, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Yu Zhao,
	Kefeng Wang, Kairui Song, Yosry Ahmed, Yuanchu Xie,
	T . J . Alumbaugh
  Cc: Wei Xu, SeongJae Park, Sudarshan Rajagopalan, kai.huang, hch,
	jon, Aneesh Kumar K V, Matthew Wilcox, Vasily Averin,
	linux-kernel, virtualization, linux-mm, cgroups

RFC v1: https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.com/
For background and interfaces, see the RFC v1 posting.

Changes from v1 -> v2:
- Refactored the patchs into smaller pieces
- Renamed interfaces and functions from wss to wsr (Working Set Reporting)
- Fixed build errors when CONFIG_WSR is not set
- Changed working_set_num_bins to u8 for virtio-balloon
- Added support for per-NUMA node reporting for virtio-balloon

The RFC adds CONFIG_WSR and requires MGLRU to function. T.J. and I aim to support
the active/inactive LRU and working set estimation from the userspace as well.
This series should be build with the following configs:
CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y
CONFIG_VIRTIO_BALLOON=y
CONFIG_WSR=y

TODO list:
- There's a hack in mm/vmscan.c that calls into the virtio-balloon driver,
  which doesn't work if CONFIG_VIRTIO_BALLOON=m. T.J. Alumbaugh (talumbau@google.com)
  and I plan on solving this problem with a working set notification mechanism
  that would allow multiple consumers to subscribe for working set changes.
- memory.reaccess.histogram does not consider swapped out pages to be reaccessed.
  I plan to implement this with the shadow entry computed from mm/workingset.c.

QEMU device implementation:
https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg06617.html

virtio-dev spec proposal v1 (v2 to be posted by T.J.):
https://lore.kernel.org/virtio-dev/CABmGT5Hv6Jd_F9EoQqVMDo4w5=7wJYmS4wwYDqXK3wov44Tf=w@mail.gmail.com/

LSF/MM discussion slides:
https://lore.kernel.org/linux-mm/CABmGT5HK9xHz=E4q4sECCD8XodP9DUcH0dMeQ8kznUQB5HTQhQ@mail.gmail.com/

T.J. Alumbaugh (1):
  virtio-balloon: Add Working Set reporting

Yuanchu Xie (5):
  mm: aggregate working set information into histograms
  mm: add working set refresh threshold to rate-limit aggregation
  mm: report working set when under memory pressure
  mm: extend working set reporting to memcgs
  mm: add per-memcg reaccess histogram

 drivers/base/node.c                 |   3 +
 drivers/virtio/virtio_balloon.c     | 288 +++++++++++++++++
 include/linux/balloon_compaction.h  |   3 +
 include/linux/memcontrol.h          |   6 +
 include/linux/mmzone.h              |   5 +
 include/linux/wsr.h                 | 114 +++++++
 include/uapi/linux/virtio_balloon.h |  33 ++
 mm/Kconfig                          |   7 +
 mm/Makefile                         |   1 +
 mm/internal.h                       |  12 +
 mm/memcontrol.c                     | 351 ++++++++++++++++++++-
 mm/mmzone.c                         |   3 +
 mm/vmscan.c                         | 194 +++++++++++-
 mm/wsr.c                            | 464 ++++++++++++++++++++++++++++
 14 files changed, 1480 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/wsr.h
 create mode 100644 mm/wsr.c

-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 1/6] mm: aggregate working set information into histograms
  2023-06-21 18:04 [RFC PATCH v2 0/6] mm: working set reporting Yuanchu Xie
@ 2023-06-21 18:04 ` Yuanchu Xie
  2023-06-21 18:04 ` [RFC PATCH v2 2/6] mm: add working set refresh threshold to rate-limit aggregation Yuanchu Xie
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Yuanchu Xie @ 2023-06-21 18:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J . Wysocki, Michael S . Tsirkin,
	David Hildenbrand, Jason Wang, Andrew Morton, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Yu Zhao,
	Kefeng Wang, Kairui Song, Yosry Ahmed, Yuanchu Xie,
	T . J . Alumbaugh
  Cc: Wei Xu, SeongJae Park, Sudarshan Rajagopalan, kai.huang, hch,
	jon, Aneesh Kumar K V, Matthew Wilcox, Vasily Averin,
	linux-kernel, virtualization, linux-mm, cgroups

Hierarchically aggregate all memcgs' MGLRU generations and their
page counts into working set histograms.
The histograms break down the system's working set per-node,
per-anon/file.

Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 drivers/base/node.c    |   3 +
 include/linux/mmzone.h |   4 +
 include/linux/wsr.h    |  73 +++++++++++
 mm/Kconfig             |   7 +
 mm/Makefile            |   1 +
 mm/internal.h          |   1 +
 mm/mmzone.c            |   3 +
 mm/vmscan.c            |   3 +
 mm/wsr.c               | 288 +++++++++++++++++++++++++++++++++++++++++
 9 files changed, 383 insertions(+)
 create mode 100644 include/linux/wsr.h
 create mode 100644 mm/wsr.c

diff --git a/drivers/base/node.c b/drivers/base/node.c
index faf3597a96da9..e326debe22d8f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -21,6 +21,7 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
+#include <linux/wsr.h>
 
 static struct bus_type node_subsys = {
 	.name = "node",
@@ -616,6 +617,7 @@ static int register_node(struct node *node, int num)
 	} else {
 		hugetlb_register_node(node);
 		compaction_register_node(node);
+		wsr_register_node(node);
 	}
 
 	return error;
@@ -632,6 +634,7 @@ void unregister_node(struct node *node)
 {
 	hugetlb_unregister_node(node);
 	compaction_unregister_node(node);
+	wsr_unregister_node(node);
 	node_remove_accesses(node);
 	node_remove_caches(node);
 	device_unregister(&node->dev);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cd28a100d9e4f..96f0d8f3584e4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -21,6 +21,7 @@
 #include <linux/mm_types.h>
 #include <linux/page-flags.h>
 #include <linux/local_lock.h>
+#include <linux/wsr.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -527,7 +528,10 @@ struct lruvec {
 	struct lru_gen_struct		lrugen;
 	/* to concurrently iterate lru_gen_mm_list */
 	struct lru_gen_mm_state		mm_state;
+#ifdef CONFIG_WSR
+	struct wsr			__wsr;
 #endif
+#endif /* CONFIG_LRU_GEN */
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
diff --git a/include/linux/wsr.h b/include/linux/wsr.h
new file mode 100644
index 0000000000000..fa46b4d61177d
--- /dev/null
+++ b/include/linux/wsr.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_WSR_H
+#define _LINUX_WSR_H
+
+#include <linux/types.h>
+#include <linux/mutex.h>
+
+struct node;
+struct lruvec;
+struct mem_cgroup;
+struct pglist_data;
+struct scan_control;
+struct lru_gen_mm_walk;
+
+#ifdef CONFIG_WSR
+#define ANON_AND_FILE 2
+
+#define MIN_NR_BINS 4
+#define MAX_NR_BINS 16
+
+struct ws_bin {
+	unsigned long idle_age;
+	unsigned long nr_pages[ANON_AND_FILE];
+};
+
+struct wsr {
+	/* protects bins */
+	struct mutex bins_lock;
+	struct ws_bin bins[MAX_NR_BINS];
+};
+
+void wsr_register_node(struct node *node);
+void wsr_unregister_node(struct node *node);
+
+void wsr_init(struct lruvec *lruvec);
+void wsr_destroy(struct lruvec *lruvec);
+struct wsr *lruvec_wsr(struct lruvec *lruvec);
+
+ssize_t wsr_intervals_ms_parse(char *src, struct ws_bin *bins);
+
+/*
+ * wsr->bins needs to be locked
+ */
+void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
+		 struct pglist_data *pgdat);
+#else
+struct ws_bin;
+struct wsr;
+
+static inline void wsr_register_node(struct node *node)
+{
+}
+static inline void wsr_unregister_node(struct node *node)
+{
+}
+static inline void wsr_init(struct lruvec *lruvec)
+{
+}
+static inline void wsr_destroy(struct lruvec *lruvec)
+{
+}
+/* lruvec_wsr is intentially omitted */
+static inline ssize_t wsr_intervals_ms_parse(char *src, struct ws_bin *bins)
+{
+	return -EINVAL;
+}
+static inline void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
+		 struct pglist_data *pgdat)
+{
+}
+#endif	/* CONFIG_WSR */
+
+#endif	/* _LINUX_WSR_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec055..8a84c1402159a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1183,6 +1183,13 @@ config LRU_GEN_STATS
 	  This option has a per-memcg and per-node memory overhead.
 # }
 
+config WSR
+	bool "Working set reporting"
+	depends on LRU_GEN
+	help
+	  This option enables working set reporting, separate backends
+	  WIP. Currently only supports MGLRU.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e293..12e2da5ba2d04 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
+obj-$(CONFIG_WSR) += wsr.o
 ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
diff --git a/mm/internal.h b/mm/internal.h
index bcf75a8b032de..88dba0b11f663 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -180,6 +180,7 @@ pgprot_t __init early_memremap_pgprot_adjust(resource_size_t phys_addr,
 /*
  * in mm/vmscan.c:
  */
+struct scan_control;
 int isolate_lru_page(struct page *page);
 int folio_isolate_lru(struct folio *folio);
 void putback_lru_page(struct page *page);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 68e1511be12de..22a8282f67150 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -8,6 +8,7 @@
 
 #include <linux/stddef.h>
 #include <linux/mm.h>
+#include <linux/wsr.h>
 #include <linux/mmzone.h>
 
 struct pglist_data *first_online_pgdat(void)
@@ -89,6 +90,8 @@ void lruvec_init(struct lruvec *lruvec)
 	 */
 	list_del(&lruvec->lists[LRU_UNEVICTABLE]);
 
+	wsr_init(lruvec);
+
 	lru_gen_init_lruvec(lruvec);
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5b7b8d4f5297f..150e3cd70c65e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -55,6 +55,7 @@
 #include <linux/ctype.h>
 #include <linux/debugfs.h>
 #include <linux/khugepaged.h>
+#include <linux/wsr.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -5890,6 +5891,8 @@ static int __init init_lru_gen(void)
 	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
 		pr_err("lru_gen: failed to create sysfs group\n");
 
+	wsr_register_node(NULL);
+
 	debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops);
 	debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
 
diff --git a/mm/wsr.c b/mm/wsr.c
new file mode 100644
index 0000000000000..1e4c0ce69caf7
--- /dev/null
+++ b/mm/wsr.c
@@ -0,0 +1,288 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+#include <linux/wsr.h>
+
+#include <linux/node.h>
+#include <linux/mmzone.h>
+#include <linux/mm.h>
+#include <linux/mm_inline.h>
+
+#include "internal.h"
+
+/* For now just embed wsr in the lruvec.
+ * Consider only allocating struct wsr when it's used
+ * since sizeof(struct wsr) is ~864 bytes.
+ */
+struct wsr *lruvec_wsr(struct lruvec *lruvec)
+{
+	return &lruvec->__wsr;
+}
+
+void wsr_init(struct lruvec *lruvec)
+{
+	struct wsr *wsr = lruvec_wsr(lruvec);
+
+	mutex_init(&wsr->bins_lock);
+	wsr->bins[0].idle_age = -1;
+}
+
+void wsr_destroy(struct lruvec *lruvec)
+{
+	struct wsr *wsr = lruvec_wsr(lruvec);
+
+	mutex_destroy(&wsr->bins_lock);
+	memset(wsr, 0, sizeof(*wsr));
+}
+
+ssize_t wsr_intervals_ms_parse(char *src, struct ws_bin *bins)
+{
+	int err, i = 0;
+	char *cur, *next = strim(src);
+
+	while ((cur = strsep(&next, ","))) {
+		unsigned int msecs;
+
+		err = kstrtouint(cur, 0, &msecs);
+		if (err)
+			return err;
+
+		bins[i].idle_age = msecs_to_jiffies(msecs);
+		if (i > 0 && bins[i].idle_age <= bins[i - 1].idle_age)
+			return -EINVAL;
+
+		if (++i == MAX_NR_BINS)
+			return -ERANGE;
+	}
+
+	if (i && i < MIN_NR_BINS - 1)
+		return -ERANGE;
+
+	bins[i].idle_age = -1;
+	return 0;
+}
+
+static void collect_wsr(struct wsr *wsr, const struct lruvec *lruvec)
+{
+	int gen, type, zone;
+	const struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	unsigned long curr_timestamp = jiffies;
+	unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
+	unsigned long min_seq[ANON_AND_FILE] = {
+		READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_ANON]),
+		READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_FILE]),
+	};
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		unsigned long seq;
+		// TODO update bins hierarchically
+		struct ws_bin *bin = wsr->bins;
+
+		lockdep_assert_held(&wsr->bins_lock);
+		for (seq = max_seq; seq + 1 > min_seq[type]; seq--) {
+			unsigned long birth, gen_start = curr_timestamp, error, size = 0;
+
+			gen = lru_gen_from_seq(seq);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
+				size += max(
+					READ_ONCE(lrugen->nr_pages[gen][type]
+								  [zone]),
+					0L);
+
+			birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
+			if (seq != max_seq) {
+				int next_gen = lru_gen_from_seq(seq + 1);
+
+				gen_start = READ_ONCE(
+					lruvec->lrugen.timestamps[next_gen]);
+			}
+
+			error = size;
+			/* gen exceeds the idle_age of bin */
+			while (bin->idle_age != -1 &&
+			       time_before(birth + bin->idle_age,
+					   curr_timestamp)) {
+				unsigned long proportion =
+					gen_start -
+					(curr_timestamp - bin->idle_age);
+				unsigned long gen_len = gen_start - birth;
+
+				if (!gen_len)
+					break;
+				if (proportion) {
+					unsigned long split_bin =
+						size / gen_len *
+						proportion;
+					bin->nr_pages[type] += split_bin;
+					error -= split_bin;
+				}
+				gen_start = curr_timestamp - bin->idle_age;
+				bin++;
+
+			}
+			bin->nr_pages[type] += error;
+		}
+	}
+}
+
+static void refresh_wsr(struct wsr *wsr, struct mem_cgroup *root,
+			struct pglist_data *pgdat)
+{
+	struct ws_bin *bin;
+	struct mem_cgroup *memcg;
+
+	lockdep_assert_held(&wsr->bins_lock);
+	VM_WARN_ON_ONCE(wsr->bins->idle_age == -1);
+
+	for (bin = wsr->bins; bin->idle_age != -1; bin++) {
+		bin->nr_pages[0] = 0;
+		bin->nr_pages[1] = 0;
+	}
+	// the last used bin has idle_age == -1.
+	bin->nr_pages[0] = 0;
+	bin->nr_pages[1] = 0;
+
+	memcg = mem_cgroup_iter(root, NULL, NULL);
+	do {
+		struct lruvec *lruvec =
+			mem_cgroup_lruvec(memcg, pgdat);
+
+		collect_wsr(wsr, lruvec);
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
+}
+static struct pglist_data *kobj_to_pgdat(struct kobject *kobj)
+{
+	int nid = IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id :
+					    first_memory_node;
+
+	return NODE_DATA(nid);
+}
+
+static struct wsr *kobj_to_wsr(struct kobject *kobj)
+{
+	return lruvec_wsr(mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj)));
+}
+
+static ssize_t intervals_ms_show(struct kobject *kobj, struct kobj_attribute *attr,
+				 char *buf)
+{
+	struct ws_bin *bin;
+	int len = 0;
+	struct wsr *wsr = kobj_to_wsr(kobj);
+
+	mutex_lock(&wsr->bins_lock);
+
+	for (bin = wsr->bins; bin->idle_age != -1; bin++)
+		len += sysfs_emit_at(buf, len, "%u,", jiffies_to_msecs(bin->idle_age));
+
+	len += sysfs_emit_at(buf, len, "%lld\n", LLONG_MAX);
+
+	mutex_unlock(&wsr->bins_lock);
+
+	return len;
+}
+
+static ssize_t intervals_ms_store(struct kobject *kobj, struct kobj_attribute *attr,
+				  const char *src, size_t len)
+{
+	char *buf;
+	struct ws_bin *bins;
+	int err = 0;
+	struct wsr *wsr = kobj_to_wsr(kobj);
+
+	bins = kzalloc(sizeof(wsr->bins), GFP_KERNEL);
+	if (!bins)
+		return -ENOMEM;
+
+	buf = kstrdup(src, GFP_KERNEL);
+	if (!buf) {
+		err = -ENOMEM;
+		goto failed;
+	}
+
+	err = wsr_intervals_ms_parse(buf, bins);
+	if (err)
+		goto failed;
+
+	mutex_lock(&wsr->bins_lock);
+	memcpy(wsr->bins, bins, sizeof(wsr->bins));
+	mutex_unlock(&wsr->bins_lock);
+failed:
+	kfree(buf);
+	kfree(bins);
+
+	return err ?: len;
+}
+
+static struct kobj_attribute intervals_ms_attr = __ATTR_RW(intervals_ms);
+
+static ssize_t histogram_show(struct kobject *kobj, struct kobj_attribute *attr,
+			      char *buf)
+{
+	struct ws_bin *bin;
+	int len = 0;
+	struct wsr *wsr = kobj_to_wsr(kobj);
+
+	mutex_lock(&wsr->bins_lock);
+
+	refresh_wsr(wsr, NULL, kobj_to_pgdat(kobj));
+
+	for (bin = wsr->bins; bin->idle_age != -1; bin++)
+		len += sysfs_emit_at(buf, len, "%u anon=%lu file=%lu\n",
+				     jiffies_to_msecs(bin->idle_age), bin->nr_pages[0],
+				     bin->nr_pages[1]);
+
+	len += sysfs_emit_at(buf, len, "%lld anon=%lu file=%lu\n", LLONG_MAX,
+			     bin->nr_pages[0], bin->nr_pages[1]);
+
+	mutex_unlock(&wsr->bins_lock);
+
+	return len;
+}
+
+static struct kobj_attribute histogram_attr = __ATTR_RO(histogram);
+
+static struct attribute *wsr_attrs[] = {
+	&intervals_ms_attr.attr,
+	&histogram_attr.attr,
+	NULL
+};
+
+static const struct attribute_group wsr_attr_group = {
+	.name = "wsr",
+	.attrs = wsr_attrs,
+};
+
+void wsr_register_node(struct node *node)
+{
+	struct kobject *kobj = node ? &node->dev.kobj : mm_kobj;
+	struct wsr *wsr;
+
+	if (IS_ENABLED(CONFIG_NUMA) && !node)
+		return;
+
+	wsr = kobj_to_wsr(kobj);
+
+	/* wsr should be initialized when pgdat was initialized
+	 * or when the root memcg was initialized
+	 */
+	if (sysfs_create_group(kobj, &wsr_attr_group)) {
+		pr_warn("WSR failed to created group");
+		return;
+	}
+}
+
+void wsr_unregister_node(struct node *node)
+{
+	struct kobject *kobj = &node->dev.kobj;
+	struct wsr *wsr;
+
+	if (IS_ENABLED(CONFIG_NUMA) && !node)
+		return;
+
+	wsr = kobj_to_wsr(kobj);
+	sysfs_remove_group(kobj, &wsr_attr_group);
+	wsr_destroy(mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj)));
+}
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 2/6] mm: add working set refresh threshold to rate-limit aggregation
  2023-06-21 18:04 [RFC PATCH v2 0/6] mm: working set reporting Yuanchu Xie
  2023-06-21 18:04 ` [RFC PATCH v2 1/6] mm: aggregate working set information into histograms Yuanchu Xie
@ 2023-06-21 18:04 ` Yuanchu Xie
  2023-06-21 18:04 ` [RFC PATCH v2 3/6] mm: report working set when under memory pressure Yuanchu Xie
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Yuanchu Xie @ 2023-06-21 18:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J . Wysocki, Michael S . Tsirkin,
	David Hildenbrand, Jason Wang, Andrew Morton, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Yu Zhao,
	Kefeng Wang, Kairui Song, Yosry Ahmed, Yuanchu Xie,
	T . J . Alumbaugh
  Cc: Wei Xu, SeongJae Park, Sudarshan Rajagopalan, kai.huang, hch,
	jon, Aneesh Kumar K V, Matthew Wilcox, Vasily Averin,
	linux-kernel, virtualization, linux-mm, cgroups

Refresh threshold is a rate limiting factor to working set
histogram reads. When a working set report is generated, a timestamp
is noted, and the same report will be read until it expires beyond
the refresh threshold, at which point a new report is generated.

Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 include/linux/mmzone.h |  1 +
 include/linux/wsr.h    |  3 +++
 mm/internal.h          | 11 +++++++++
 mm/vmscan.c            | 39 +++++++++++++++++++++++++++++--
 mm/wsr.c               | 52 +++++++++++++++++++++++++++++++++++++++---
 5 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 96f0d8f3584e4..bca828a16a46b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -362,6 +362,7 @@ enum lruvec_flags {
 
 #ifndef __GENERATING_BOUNDS_H
 
+struct node;
 struct lruvec;
 struct page_vma_mapped_walk;
 
diff --git a/include/linux/wsr.h b/include/linux/wsr.h
index fa46b4d61177d..a86105468c710 100644
--- a/include/linux/wsr.h
+++ b/include/linux/wsr.h
@@ -26,6 +26,8 @@ struct ws_bin {
 struct wsr {
 	/* protects bins */
 	struct mutex bins_lock;
+	unsigned long timestamp;
+	unsigned long refresh_threshold;
 	struct ws_bin bins[MAX_NR_BINS];
 };
 
@@ -40,6 +42,7 @@ ssize_t wsr_intervals_ms_parse(char *src, struct ws_bin *bins);
 
 /*
  * wsr->bins needs to be locked
+ * refreshes wsr based on the refresh threshold
  */
 void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
 		 struct pglist_data *pgdat);
diff --git a/mm/internal.h b/mm/internal.h
index 88dba0b11f663..ce4757e7f8277 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -186,6 +186,17 @@ int folio_isolate_lru(struct folio *folio);
 void putback_lru_page(struct page *page);
 void folio_putback_lru(struct folio *folio);
 extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
+int get_swappiness(struct lruvec *lruvec, struct scan_control *sc);
+bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+			struct scan_control *sc, bool can_swap,
+			bool force_scan);
+
+/*
+ * in mm/wsr.c
+ */
+void refresh_wsr(struct wsr *wsr, struct mem_cgroup *root,
+		 struct pglist_data *pgdat, struct scan_control *sc,
+		 unsigned long refresh_threshold);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 150e3cd70c65e..66c5df2a7f65b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3201,7 +3201,7 @@ static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 	return &pgdat->__lruvec;
 }
 
-static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
+int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
 {
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -4402,7 +4402,7 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
 	spin_unlock_irq(&lruvec->lru_lock);
 }
 
-static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
 			       struct scan_control *sc, bool can_swap, bool force_scan)
 {
 	bool success;
@@ -5900,6 +5900,41 @@ static int __init init_lru_gen(void)
 };
 late_initcall(init_lru_gen);
 
+/******************************************************************************
+ *                          working set reporting
+ ******************************************************************************/
+
+#ifdef CONFIG_WSR
+void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
+		 struct pglist_data *pgdat)
+{
+	unsigned int flags;
+	struct scan_control sc = {
+		.may_writepage = true,
+		.may_unmap = true,
+		.may_swap = true,
+		.reclaim_idx = MAX_NR_ZONES - 1,
+		.gfp_mask = GFP_KERNEL,
+	};
+
+	lockdep_assert_held(&wsr->bins_lock);
+
+	if (wsr->bins->idle_age != -1) {
+		unsigned long timestamp = READ_ONCE(wsr->timestamp);
+		unsigned long threshold = READ_ONCE(wsr->refresh_threshold);
+
+		if (time_is_before_jiffies(timestamp + threshold)) {
+			set_task_reclaim_state(current, &sc.reclaim_state);
+			flags = memalloc_noreclaim_save();
+			refresh_wsr(wsr, root, pgdat, &sc, threshold);
+			memalloc_noreclaim_restore(flags);
+			set_task_reclaim_state(current, NULL);
+		}
+	}
+}
+
+#endif /* CONFIG_WSR */
+
 #else /* !CONFIG_LRU_GEN */
 
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
diff --git a/mm/wsr.c b/mm/wsr.c
index 1e4c0ce69caf7..ee295d164461e 100644
--- a/mm/wsr.c
+++ b/mm/wsr.c
@@ -125,8 +125,9 @@ static void collect_wsr(struct wsr *wsr, const struct lruvec *lruvec)
 	}
 }
 
-static void refresh_wsr(struct wsr *wsr, struct mem_cgroup *root,
-			struct pglist_data *pgdat)
+void refresh_wsr(struct wsr *wsr, struct mem_cgroup *root,
+		 struct pglist_data *pgdat, struct scan_control *sc,
+		 unsigned long refresh_threshold)
 {
 	struct ws_bin *bin;
 	struct mem_cgroup *memcg;
@@ -146,6 +147,24 @@ static void refresh_wsr(struct wsr *wsr, struct mem_cgroup *root,
 	do {
 		struct lruvec *lruvec =
 			mem_cgroup_lruvec(memcg, pgdat);
+				bool can_swap = get_swappiness(lruvec, sc);
+		unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
+		unsigned long min_seq[ANON_AND_FILE] = {
+			READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_ANON]),
+			READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_FILE]),
+		};
+
+		mem_cgroup_calculate_protection(root, memcg);
+		if (!mem_cgroup_below_min(root, memcg) && refresh_threshold &&
+		    min_seq[!can_swap] + MAX_NR_GENS - 1 > max_seq) {
+			int gen = lru_gen_from_seq(max_seq);
+			unsigned long birth =
+				READ_ONCE(lruvec->lrugen.timestamps[gen]);
+
+			if (time_is_before_jiffies(birth + refresh_threshold))
+				try_to_inc_max_seq(lruvec, max_seq, sc,
+						   can_swap, false);
+		}
 
 		collect_wsr(wsr, lruvec);
 
@@ -165,6 +184,32 @@ static struct wsr *kobj_to_wsr(struct kobject *kobj)
 	return lruvec_wsr(mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj)));
 }
 
+
+static ssize_t refresh_ms_show(struct kobject *kobj, struct kobj_attribute *attr,
+			       char *buf)
+{
+	struct wsr *wsr = kobj_to_wsr(kobj);
+	unsigned long threshold = READ_ONCE(wsr->refresh_threshold);
+
+	return sysfs_emit(buf, "%u\n", jiffies_to_msecs(threshold));
+}
+
+static ssize_t refresh_ms_store(struct kobject *kobj, struct kobj_attribute *attr,
+				const char *buf, size_t len)
+{
+	unsigned int msecs;
+	struct wsr *wsr = kobj_to_wsr(kobj);
+
+	if (kstrtouint(buf, 0, &msecs))
+		return -EINVAL;
+
+	WRITE_ONCE(wsr->refresh_threshold, msecs_to_jiffies(msecs));
+
+	return len;
+}
+
+static struct kobj_attribute refresh_ms_attr = __ATTR_RW(refresh_ms);
+
 static ssize_t intervals_ms_show(struct kobject *kobj, struct kobj_attribute *attr,
 				 char *buf)
 {
@@ -227,7 +272,7 @@ static ssize_t histogram_show(struct kobject *kobj, struct kobj_attribute *attr,
 
 	mutex_lock(&wsr->bins_lock);
 
-	refresh_wsr(wsr, NULL, kobj_to_pgdat(kobj));
+	wsr_refresh(wsr, NULL, kobj_to_pgdat(kobj));
 
 	for (bin = wsr->bins; bin->idle_age != -1; bin++)
 		len += sysfs_emit_at(buf, len, "%u anon=%lu file=%lu\n",
@@ -245,6 +290,7 @@ static ssize_t histogram_show(struct kobject *kobj, struct kobj_attribute *attr,
 static struct kobj_attribute histogram_attr = __ATTR_RO(histogram);
 
 static struct attribute *wsr_attrs[] = {
+	&refresh_ms_attr.attr,
 	&intervals_ms_attr.attr,
 	&histogram_attr.attr,
 	NULL
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 3/6] mm: report working set when under memory pressure
  2023-06-21 18:04 [RFC PATCH v2 0/6] mm: working set reporting Yuanchu Xie
  2023-06-21 18:04 ` [RFC PATCH v2 1/6] mm: aggregate working set information into histograms Yuanchu Xie
  2023-06-21 18:04 ` [RFC PATCH v2 2/6] mm: add working set refresh threshold to rate-limit aggregation Yuanchu Xie
@ 2023-06-21 18:04 ` Yuanchu Xie
  2023-06-21 18:04 ` [RFC PATCH v2 4/6] mm: extend working set reporting to memcgs Yuanchu Xie
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Yuanchu Xie @ 2023-06-21 18:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J . Wysocki, Michael S . Tsirkin,
	David Hildenbrand, Jason Wang, Andrew Morton, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Yu Zhao,
	Kefeng Wang, Kairui Song, Yosry Ahmed, Yuanchu Xie,
	T . J . Alumbaugh
  Cc: Wei Xu, SeongJae Park, Sudarshan Rajagopalan, kai.huang, hch,
	jon, Aneesh Kumar K V, Matthew Wilcox, Vasily Averin,
	linux-kernel, virtualization, linux-mm, cgroups

When a system is under memory pressure and kswapd kicks in,
a working set report is produced. The userspace program
polling on the histogram file is notified of the new report.

The report threshold acts as a rate-limiting mechanism to
prevent the system from generating reports too frequently.

Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 include/linux/wsr.h |  2 ++
 mm/vmscan.c         | 37 +++++++++++++++++++++++++++++++++++++
 mm/wsr.c            | 29 +++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/include/linux/wsr.h b/include/linux/wsr.h
index a86105468c710..85c901ce026b9 100644
--- a/include/linux/wsr.h
+++ b/include/linux/wsr.h
@@ -26,7 +26,9 @@ struct ws_bin {
 struct wsr {
 	/* protects bins */
 	struct mutex bins_lock;
+	struct kernfs_node *notifier;
 	unsigned long timestamp;
+	unsigned long report_threshold;
 	unsigned long refresh_threshold;
 	struct ws_bin bins[MAX_NR_BINS];
 };
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66c5df2a7f65b..c56fddcec88fb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4559,6 +4559,8 @@ static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned
 	return true;
 }
 
+static void report_ws(struct pglist_data *pgdat, struct scan_control *sc);
+
 /* to protect the working set of the last N jiffies */
 static unsigned long lru_gen_min_ttl __read_mostly;
 
@@ -4570,6 +4572,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 	VM_WARN_ON_ONCE(!current_is_kswapd());
 
+	report_ws(pgdat, sc);
+
 	sc->last_reclaimed = sc->nr_reclaimed;
 
 	/*
@@ -5933,6 +5937,39 @@ void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
 	}
 }
 
+static void report_ws(struct pglist_data *pgdat, struct scan_control *sc)
+{
+	static DEFINE_RATELIMIT_STATE(rate, HZ, 3);
+
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
+	struct wsr *wsr = lruvec_wsr(mem_cgroup_lruvec(memcg, pgdat));
+	unsigned long threshold;
+
+	threshold = READ_ONCE(wsr->report_threshold);
+
+	if (sc->priority == DEF_PRIORITY)
+		return;
+
+	if (READ_ONCE(wsr->bins->idle_age) == -1)
+		return;
+
+	if (!threshold || time_is_after_jiffies(wsr->timestamp + threshold))
+		return;
+
+	if (!__ratelimit(&rate))
+		return;
+
+	if (!mutex_trylock(&wsr->bins_lock))
+		return;
+
+	refresh_wsr(wsr, memcg, pgdat, sc, 0);
+	WRITE_ONCE(wsr->timestamp, jiffies);
+
+	mutex_unlock(&wsr->bins_lock);
+
+	if (wsr->notifier)
+		kernfs_notify(wsr->notifier);
+}
 #endif /* CONFIG_WSR */
 
 #else /* !CONFIG_LRU_GEN */
diff --git a/mm/wsr.c b/mm/wsr.c
index ee295d164461e..cd045ade5e9ba 100644
--- a/mm/wsr.c
+++ b/mm/wsr.c
@@ -24,6 +24,7 @@ void wsr_init(struct lruvec *lruvec)
 
 	mutex_init(&wsr->bins_lock);
 	wsr->bins[0].idle_age = -1;
+	wsr->notifier = NULL;
 }
 
 void wsr_destroy(struct lruvec *lruvec)
@@ -184,6 +185,30 @@ static struct wsr *kobj_to_wsr(struct kobject *kobj)
 	return lruvec_wsr(mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj)));
 }
 
+static ssize_t report_ms_show(struct kobject *kobj, struct kobj_attribute *attr,
+			      char *buf)
+{
+	struct wsr *wsr = kobj_to_wsr(kobj);
+	unsigned long threshold = READ_ONCE(wsr->report_threshold);
+
+	return sysfs_emit(buf, "%u\n", jiffies_to_msecs(threshold));
+}
+
+static ssize_t report_ms_store(struct kobject *kobj, struct kobj_attribute *attr,
+			       const char *buf, size_t len)
+{
+	unsigned int msecs;
+	struct wsr *wsr = kobj_to_wsr(kobj);
+
+	if (kstrtouint(buf, 0, &msecs))
+		return -EINVAL;
+
+	WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(msecs));
+
+	return len;
+}
+
+static struct kobj_attribute report_ms_attr = __ATTR_RW(report_ms);
 
 static ssize_t refresh_ms_show(struct kobject *kobj, struct kobj_attribute *attr,
 			       char *buf)
@@ -290,6 +315,7 @@ static ssize_t histogram_show(struct kobject *kobj, struct kobj_attribute *attr,
 static struct kobj_attribute histogram_attr = __ATTR_RO(histogram);
 
 static struct attribute *wsr_attrs[] = {
+	&report_ms_attr.attr,
 	&refresh_ms_attr.attr,
 	&intervals_ms_attr.attr,
 	&histogram_attr.attr,
@@ -318,6 +344,8 @@ void wsr_register_node(struct node *node)
 		pr_warn("WSR failed to created group");
 		return;
 	}
+
+	wsr->notifier = kernfs_walk_and_get(kobj->sd, "wsr/histogram");
 }
 
 void wsr_unregister_node(struct node *node)
@@ -329,6 +357,7 @@ void wsr_unregister_node(struct node *node)
 		return;
 
 	wsr = kobj_to_wsr(kobj);
+	kernfs_put(wsr->notifier);
 	sysfs_remove_group(kobj, &wsr_attr_group);
 	wsr_destroy(mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj)));
 }
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 4/6] mm: extend working set reporting to memcgs
  2023-06-21 18:04 [RFC PATCH v2 0/6] mm: working set reporting Yuanchu Xie
                   ` (2 preceding siblings ...)
  2023-06-21 18:04 ` [RFC PATCH v2 3/6] mm: report working set when under memory pressure Yuanchu Xie
@ 2023-06-21 18:04 ` Yuanchu Xie
  2023-06-21 18:04 ` [RFC PATCH v2 5/6] mm: add per-memcg reaccess histogram Yuanchu Xie
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Yuanchu Xie @ 2023-06-21 18:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J . Wysocki, Michael S . Tsirkin,
	David Hildenbrand, Jason Wang, Andrew Morton, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Yu Zhao,
	Kefeng Wang, Kairui Song, Yosry Ahmed, Yuanchu Xie,
	T . J . Alumbaugh
  Cc: Wei Xu, SeongJae Park, Sudarshan Rajagopalan, kai.huang, hch,
	jon, Aneesh Kumar K V, Matthew Wilcox, Vasily Averin,
	linux-kernel, virtualization, linux-mm, cgroups

Break down the system-wide working set reporting into
per-memcg reports, which aggregages its children hierarchically.
The per-node working set reporting histograms and refresh/report
threshold files are presented as memcg files, showing a report
containing all the nodes.

Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 include/linux/memcontrol.h |   6 +
 include/linux/wsr.h        |   4 +
 mm/memcontrol.c            | 262 ++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |   9 +-
 4 files changed, 277 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 85dc9b88ea379..96971aa6a48cd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -10,6 +10,7 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+#include <linux/wait.h>
 #include <linux/cgroup.h>
 #include <linux/vm_event_item.h>
 #include <linux/hardirq.h>
@@ -325,6 +326,11 @@ struct mem_cgroup {
 	struct lru_gen_mm_list mm_list;
 #endif
 
+#ifdef CONFIG_WSR
+	int wsr_event;
+	wait_queue_head_t wsr_wait_queue;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
diff --git a/include/linux/wsr.h b/include/linux/wsr.h
index 85c901ce026b9..d45f7cc0672ac 100644
--- a/include/linux/wsr.h
+++ b/include/linux/wsr.h
@@ -48,6 +48,7 @@ ssize_t wsr_intervals_ms_parse(char *src, struct ws_bin *bins);
  */
 void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
 		 struct pglist_data *pgdat);
+void report_ws(struct pglist_data *pgdat, struct scan_control *sc);
 #else
 struct ws_bin;
 struct wsr;
@@ -73,6 +74,9 @@ static inline void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
 		 struct pglist_data *pgdat)
 {
 }
+static inline void report_ws(struct pglist_data *pgdat, struct scan_control *sc)
+{
+}
 #endif	/* CONFIG_WSR */
 
 #endif	/* _LINUX_WSR_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2eee092f8f119..edf5bb31bb19c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -25,6 +25,7 @@
  * Copyright (C) 2020 Alibaba, Inc, Alex Shi
  */
 
+#include <linux/wait.h>
 #include <linux/page_counter.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
@@ -65,6 +66,7 @@
 #include <linux/seq_buf.h>
 #include "internal.h"
 #include <net/sock.h>
+#include <linux/wsr.h>
 #include <net/ip.h>
 #include "slab.h"
 #include "swap.h"
@@ -5233,6 +5235,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	if (!pn)
 		return;
 
+	wsr_destroy(&pn->lruvec);
 	free_percpu(pn->lruvec_stats_percpu);
 	kfree(pn);
 }
@@ -5311,6 +5314,10 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	spin_lock_init(&memcg->deferred_split_queue.split_queue_lock);
 	INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue);
 	memcg->deferred_split_queue.split_queue_len = 0;
+#endif
+#ifdef CONFIG_WSR
+	memcg->wsr_event = 0;
+	init_waitqueue_head(&memcg->wsr_wait_queue);
 #endif
 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
 	lru_gen_init_memcg(memcg);
@@ -5411,6 +5418,11 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	}
 	spin_unlock_irq(&memcg->event_list_lock);
 
+#ifdef CONFIG_WSR
+	wake_up_pollfree(&memcg->wsr_wait_queue);
+	synchronize_rcu();
+#endif
+
 	page_counter_set_min(&memcg->memory, 0);
 	page_counter_set_low(&memcg->memory, 0);
 
@@ -6642,6 +6654,228 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	return nbytes;
 }
 
+#ifdef CONFIG_WSR
+static int memory_wsr_intervals_ms_show(struct seq_file *m, void *v)
+{
+	int nid;
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct wsr *wsr;
+		struct ws_bin *bin;
+
+		wsr = lruvec_wsr(mem_cgroup_lruvec(memcg, NODE_DATA(nid)));
+		mutex_lock(&wsr->bins_lock);
+		seq_printf(m, "N%d=", nid);
+		for (bin = wsr->bins; bin->idle_age != -1; bin++)
+			seq_printf(m, "%u,", jiffies_to_msecs(bin->idle_age));
+		mutex_unlock(&wsr->bins_lock);
+
+		seq_printf(m, "%lld ", LLONG_MAX);
+	}
+	seq_putc(m, '\n');
+
+	return 0;
+}
+
+static ssize_t memory_wsr_intervals_ms_parse(struct kernfs_open_file *of,
+					     char *buf, size_t nbytes,
+					     unsigned int *nid_out,
+					     struct ws_bin *bins)
+{
+	char *node, *intervals;
+	unsigned int nid;
+	int err;
+
+	buf = strstrip(buf);
+	intervals = buf;
+	node = strsep(&intervals, "=");
+
+	if (*node != 'N')
+		return -EINVAL;
+
+	err = kstrtouint(node + 1, 0, &nid);
+	if (err)
+		return err;
+
+	if (nid >= nr_node_ids || !node_state(nid, N_MEMORY))
+		return -EINVAL;
+
+	err = wsr_intervals_ms_parse(intervals, bins);
+	if (err)
+		return err;
+
+	*nid_out = nid;
+	return 0;
+}
+
+static ssize_t memory_wsr_intervals_ms_write(struct kernfs_open_file *of,
+					     char *buf, size_t nbytes,
+					     loff_t off)
+{
+	unsigned int nid;
+	int err;
+	struct wsr *wsr;
+	struct ws_bin *bins;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	bins = kzalloc(sizeof(wsr->bins), GFP_KERNEL);
+	if (!bins)
+		return -ENOMEM;
+
+	err = memory_wsr_intervals_ms_parse(of, buf, nbytes, &nid, bins);
+	if (err)
+		goto failed;
+
+	wsr = lruvec_wsr(mem_cgroup_lruvec(memcg, NODE_DATA(nid)));
+	mutex_lock(&wsr->bins_lock);
+	memcpy(wsr->bins, bins, sizeof(wsr->bins));
+	mutex_unlock(&wsr->bins_lock);
+failed:
+	kfree(bins);
+	return err ?: nbytes;
+}
+
+static int memory_wsr_refresh_ms_show(struct seq_file *m, void *v)
+{
+	int nid;
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct wsr *wsr =
+			lruvec_wsr(mem_cgroup_lruvec(memcg, NODE_DATA(nid)));
+
+		seq_printf(m, "N%d=%u ", nid,
+			   jiffies_to_msecs(READ_ONCE(wsr->refresh_threshold)));
+	}
+	seq_putc(m, '\n');
+
+	return 0;
+}
+
+static ssize_t memory_wsr_threshold_parse(char *buf, size_t nbytes,
+					  unsigned int *nid_out,
+					  unsigned int *msecs)
+{
+	char *node, *threshold;
+	unsigned int nid;
+	int err;
+
+	buf = strstrip(buf);
+	threshold = buf;
+	node = strsep(&threshold, "=");
+
+	if (*node != 'N')
+		return -EINVAL;
+
+	err = kstrtouint(node + 1, 0, &nid);
+	if (err)
+		return err;
+
+	if (nid >= nr_node_ids || !node_state(nid, N_MEMORY))
+		return -EINVAL;
+
+	err = kstrtouint(threshold, 0, msecs);
+	if (err)
+		return err;
+
+	*nid_out = nid;
+
+	return nbytes;
+}
+
+static ssize_t memory_wsr_refresh_ms_write(struct kernfs_open_file *of,
+					   char *buf, size_t nbytes, loff_t off)
+{
+	unsigned int nid, msecs;
+	struct wsr *wsr;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs);
+
+	if (ret < 0)
+		return ret;
+
+	wsr = lruvec_wsr(mem_cgroup_lruvec(memcg, NODE_DATA(nid)));
+	WRITE_ONCE(wsr->refresh_threshold, msecs_to_jiffies(msecs));
+	return ret;
+}
+
+static int memory_wsr_report_ms_show(struct seq_file *m, void *v)
+{
+	int nid;
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct wsr *wsr =
+			lruvec_wsr(mem_cgroup_lruvec(memcg, NODE_DATA(nid)));
+
+		seq_printf(m, "N%d=%u ", nid,
+			   jiffies_to_msecs(READ_ONCE(wsr->report_threshold)));
+	}
+	seq_putc(m, '\n');
+
+	return 0;
+}
+
+static ssize_t memory_wsr_report_ms_write(struct kernfs_open_file *of,
+					  char *buf, size_t nbytes, loff_t off)
+{
+	unsigned int nid, msecs;
+	struct wsr *wsr;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs);
+
+	if (ret < 0)
+		return ret;
+
+	wsr = lruvec_wsr(mem_cgroup_lruvec(memcg, NODE_DATA(nid)));
+	WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(msecs));
+	return ret;
+}
+
+static int memory_wsr_histogram_show(struct seq_file *m, void *v)
+{
+	int nid;
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct wsr *wsr =
+			lruvec_wsr(mem_cgroup_lruvec(memcg, NODE_DATA(nid)));
+		struct ws_bin *bin;
+
+		seq_printf(m, "N%d\n", nid);
+
+		mutex_lock(&wsr->bins_lock);
+		wsr_refresh(wsr, memcg, NODE_DATA(nid));
+		for (bin = wsr->bins; bin->idle_age != -1; bin++)
+			seq_printf(m, "%u anon=%lu file=%lu\n",
+				   jiffies_to_msecs(bin->idle_age),
+				   bin->nr_pages[0], bin->nr_pages[1]);
+
+		seq_printf(m, "%lld anon=%lu file=%lu\n", LLONG_MAX,
+			   bin->nr_pages[0], bin->nr_pages[1]);
+
+		mutex_unlock(&wsr->bins_lock);
+	}
+
+	return 0;
+}
+
+__poll_t memory_wsr_histogram_poll(struct kernfs_open_file *of,
+				   struct poll_table_struct *pt)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	if (memcg->css.flags & CSS_DYING)
+		return DEFAULT_POLLMASK;
+
+	poll_wait(of->file, &memcg->wsr_wait_queue, pt);
+	if (cmpxchg(&memcg->wsr_event, 1, 0) == 1)
+		return DEFAULT_POLLMASK | EPOLLPRI;
+	return DEFAULT_POLLMASK;
+}
+#endif
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -6710,7 +6944,33 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NS_DELEGATABLE,
 		.write = memory_reclaim,
 	},
-	{ }	/* terminate */
+#ifdef CONFIG_WSR
+	{
+		.name = "wsr.intervals_ms",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_wsr_intervals_ms_show,
+		.write = memory_wsr_intervals_ms_write,
+	},
+	{
+		.name = "wsr.refresh_ms",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_wsr_refresh_ms_show,
+		.write = memory_wsr_refresh_ms_write,
+	},
+	{
+		.name = "wsr.report_ms",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_wsr_report_ms_show,
+		.write = memory_wsr_report_ms_write,
+	},
+	{
+		.name = "wsr.histogram",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_wsr_histogram_show,
+		.poll = memory_wsr_histogram_poll,
+	},
+#endif
+	{} /* terminate */
 };
 
 struct cgroup_subsys memory_cgrp_subsys = {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c56fddcec88fb..ba254b6e91e19 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4559,8 +4559,6 @@ static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned
 	return true;
 }
 
-static void report_ws(struct pglist_data *pgdat, struct scan_control *sc);
-
 /* to protect the working set of the last N jiffies */
 static unsigned long lru_gen_min_ttl __read_mostly;
 
@@ -5937,7 +5935,7 @@ void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
 	}
 }
 
-static void report_ws(struct pglist_data *pgdat, struct scan_control *sc)
+void report_ws(struct pglist_data *pgdat, struct scan_control *sc)
 {
 	static DEFINE_RATELIMIT_STATE(rate, HZ, 3);
 
@@ -5969,6 +5967,8 @@ static void report_ws(struct pglist_data *pgdat, struct scan_control *sc)
 
 	if (wsr->notifier)
 		kernfs_notify(wsr->notifier);
+	if (memcg && cmpxchg(&memcg->wsr_event, 0, 1) == 0)
+		wake_up_interruptible(&memcg->wsr_wait_queue);
 }
 #endif /* CONFIG_WSR */
 
@@ -6486,6 +6486,9 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (zone->zone_pgdat == last_pgdat)
 			continue;
 		last_pgdat = zone->zone_pgdat;
+
+		if (!sc->proactive)
+			report_ws(zone->zone_pgdat, sc);
 		shrink_node(zone->zone_pgdat, sc);
 	}
 
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 5/6] mm: add per-memcg reaccess histogram
  2023-06-21 18:04 [RFC PATCH v2 0/6] mm: working set reporting Yuanchu Xie
                   ` (3 preceding siblings ...)
  2023-06-21 18:04 ` [RFC PATCH v2 4/6] mm: extend working set reporting to memcgs Yuanchu Xie
@ 2023-06-21 18:04 ` Yuanchu Xie
  2023-06-21 18:04 ` [RFC PATCH v2 6/6] virtio-balloon: Add Working Set reporting Yuanchu Xie
  2023-06-21 18:48 ` [RFC PATCH v2 0/6] mm: working set reporting Yu Zhao
  6 siblings, 0 replies; 8+ messages in thread
From: Yuanchu Xie @ 2023-06-21 18:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J . Wysocki, Michael S . Tsirkin,
	David Hildenbrand, Jason Wang, Andrew Morton, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Yu Zhao,
	Kefeng Wang, Kairui Song, Yosry Ahmed, Yuanchu Xie,
	T . J . Alumbaugh
  Cc: Wei Xu, SeongJae Park, Sudarshan Rajagopalan, kai.huang, hch,
	jon, Aneesh Kumar K V, Matthew Wilcox, Vasily Averin,
	linux-kernel, virtualization, linux-mm, cgroups

A reaccess refers to detecting an access on a page via refault
or access bit harvesting after the initial access. Similar to
the working set histogram, the reaccess histogram breaks down
reaccesses into user-defined bins.

Currently it only tracks reaccesses from access bit harvesting,
and the plan is to include refaults in the same histogram
by pulling information from folio->mapping->i_pages shadow entry
for swapped out pages.

Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 include/linux/wsr.h |   9 +++-
 mm/memcontrol.c     |  89 ++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c         |   6 ++-
 mm/wsr.c            | 101 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 203 insertions(+), 2 deletions(-)

diff --git a/include/linux/wsr.h b/include/linux/wsr.h
index d45f7cc0672ac..68246734679cd 100644
--- a/include/linux/wsr.h
+++ b/include/linux/wsr.h
@@ -26,11 +26,14 @@ struct ws_bin {
 struct wsr {
 	/* protects bins */
 	struct mutex bins_lock;
+	/* protects reaccess_bins */
+	struct mutex reaccess_bins_lock;
 	struct kernfs_node *notifier;
 	unsigned long timestamp;
 	unsigned long report_threshold;
 	unsigned long refresh_threshold;
 	struct ws_bin bins[MAX_NR_BINS];
+	struct ws_bin reaccess_bins[MAX_NR_BINS];
 };
 
 void wsr_register_node(struct node *node);
@@ -48,6 +51,7 @@ ssize_t wsr_intervals_ms_parse(char *src, struct ws_bin *bins);
  */
 void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
 		 struct pglist_data *pgdat);
+void report_reaccess(struct lruvec *lruvec, struct lru_gen_mm_walk *walk);
 void report_ws(struct pglist_data *pgdat, struct scan_control *sc);
 #else
 struct ws_bin;
@@ -71,7 +75,10 @@ static inline ssize_t wsr_intervals_ms_parse(char *src, struct ws_bin *bins)
 	return -EINVAL;
 }
 static inline void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
-		 struct pglist_data *pgdat)
+			       struct pglist_data *pgdat)
+{
+}
+static inline void report_reaccess(struct lruvec *lruvec, struct lru_gen_mm_walk *walk)
 {
 }
 static inline void report_ws(struct pglist_data *pgdat, struct scan_control *sc)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index edf5bb31bb19c..b901982d659d2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6736,6 +6736,56 @@ static ssize_t memory_wsr_intervals_ms_write(struct kernfs_open_file *of,
 	return err ?: nbytes;
 }
 
+static int memory_reaccess_intervals_ms_show(struct seq_file *m, void *v)
+{
+	int nid;
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct wsr *wsr;
+		struct ws_bin *bin;
+
+		wsr = lruvec_wsr(mem_cgroup_lruvec(memcg, NODE_DATA(nid)));
+		mutex_lock(&wsr->reaccess_bins_lock);
+		seq_printf(m, "N%d=", nid);
+		for (bin = wsr->reaccess_bins; bin->idle_age != -1; bin++)
+			seq_printf(m, "%u,", jiffies_to_msecs(bin->idle_age));
+		mutex_unlock(&wsr->reaccess_bins_lock);
+
+		seq_printf(m, "%lld ", LLONG_MAX);
+	}
+	seq_putc(m, '\n');
+
+	return 0;
+}
+
+static ssize_t memory_reaccess_intervals_ms_write(struct kernfs_open_file *of,
+						  char *buf, size_t nbytes,
+						  loff_t off)
+{
+	unsigned int nid;
+	int err;
+	struct wsr *wsr;
+	struct ws_bin *bins;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	bins = kzalloc(sizeof(wsr->reaccess_bins), GFP_KERNEL);
+	if (!bins)
+		return -ENOMEM;
+
+	err = memory_wsr_intervals_ms_parse(of, buf, nbytes, &nid, bins);
+	if (err)
+		goto failed;
+
+	wsr = lruvec_wsr(mem_cgroup_lruvec(memcg, NODE_DATA(nid)));
+	mutex_lock(&wsr->reaccess_bins_lock);
+	memcpy(wsr->reaccess_bins, bins, sizeof(wsr->reaccess_bins));
+	mutex_unlock(&wsr->reaccess_bins_lock);
+failed:
+	kfree(bins);
+	return err ?: nbytes;
+}
+
 static int memory_wsr_refresh_ms_show(struct seq_file *m, void *v)
 {
 	int nid;
@@ -6874,6 +6924,34 @@ __poll_t memory_wsr_histogram_poll(struct kernfs_open_file *of,
 		return DEFAULT_POLLMASK | EPOLLPRI;
 	return DEFAULT_POLLMASK;
 }
+
+static int memory_reaccess_histogram_show(struct seq_file *m, void *v)
+{
+	int nid;
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct wsr *wsr =
+			lruvec_wsr(mem_cgroup_lruvec(memcg, NODE_DATA(nid)));
+		struct ws_bin *bin;
+
+		seq_printf(m, "N%d\n", nid);
+
+		mutex_lock(&wsr->reaccess_bins_lock);
+		wsr_refresh(wsr, memcg, NODE_DATA(nid));
+		for (bin = wsr->reaccess_bins; bin->idle_age != -1; bin++)
+			seq_printf(m, "%u anon=%lu file=%lu\n",
+				   jiffies_to_msecs(bin->idle_age),
+				   bin->nr_pages[0], bin->nr_pages[1]);
+
+		seq_printf(m, "%lld anon=%lu file=%lu\n", LLONG_MAX,
+			   bin->nr_pages[0], bin->nr_pages[1]);
+
+		mutex_unlock(&wsr->reaccess_bins_lock);
+	}
+
+	return 0;
+}
 #endif
 
 static struct cftype memory_files[] = {
@@ -6969,6 +7047,17 @@ static struct cftype memory_files[] = {
 		.seq_show = memory_wsr_histogram_show,
 		.poll = memory_wsr_histogram_poll,
 	},
+	{
+		.name = "reaccess.intervals_ms",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_reaccess_intervals_ms_show,
+		.write = memory_reaccess_intervals_ms_write,
+	},
+	{
+		.name = "reaccess.histogram",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_reaccess_histogram_show,
+	},
 #endif
 	{} /* terminate */
 };
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ba254b6e91e19..bc8c026ceef0d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4226,6 +4226,7 @@ static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_
 		mem_cgroup_unlock_pages();
 
 		if (walk->batched) {
+			report_reaccess(lruvec, walk);
 			spin_lock_irq(&lruvec->lru_lock);
 			reset_batch_size(lruvec, walk);
 			spin_unlock_irq(&lruvec->lru_lock);
@@ -5079,11 +5080,14 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 		sc->nr_scanned -= folio_nr_pages(folio);
 	}
 
+	walk = current->reclaim_state->mm_walk;
+	if (walk && walk->batched)
+		report_reaccess(lruvec, walk);
+
 	spin_lock_irq(&lruvec->lru_lock);
 
 	move_folios_to_lru(lruvec, &list);
 
-	walk = current->reclaim_state->mm_walk;
 	if (walk && walk->batched)
 		reset_batch_size(lruvec, walk);
 
diff --git a/mm/wsr.c b/mm/wsr.c
index cd045ade5e9ba..a63d678e64f8b 100644
--- a/mm/wsr.c
+++ b/mm/wsr.c
@@ -23,8 +23,10 @@ void wsr_init(struct lruvec *lruvec)
 	struct wsr *wsr = lruvec_wsr(lruvec);
 
 	mutex_init(&wsr->bins_lock);
+	mutex_init(&wsr->reaccess_bins_lock);
 	wsr->bins[0].idle_age = -1;
 	wsr->notifier = NULL;
+	wsr->reaccess_bins[0].idle_age = -1;
 }
 
 void wsr_destroy(struct lruvec *lruvec)
@@ -32,6 +34,7 @@ void wsr_destroy(struct lruvec *lruvec)
 	struct wsr *wsr = lruvec_wsr(lruvec);
 
 	mutex_destroy(&wsr->bins_lock);
+	mutex_destroy(&wsr->reaccess_bins_lock);
 	memset(wsr, 0, sizeof(*wsr));
 }
 
@@ -172,6 +175,104 @@ void refresh_wsr(struct wsr *wsr, struct mem_cgroup *root,
 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
 }
+
+static void collect_reaccess_locked(struct wsr *wsr,
+				    struct lru_gen_struct *lrugen,
+				    struct lru_gen_mm_walk *walk)
+{
+	int gen, type, zone;
+	unsigned long curr_timestamp = jiffies;
+	unsigned long max_seq = READ_ONCE(walk->max_seq);
+	unsigned long min_seq[ANON_AND_FILE] = {
+		READ_ONCE(lrugen->min_seq[LRU_GEN_ANON]),
+		READ_ONCE(lrugen->min_seq[LRU_GEN_FILE]),
+	};
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		unsigned long seq;
+		struct ws_bin *bin = wsr->reaccess_bins;
+
+		lockdep_assert_held(&wsr->reaccess_bins_lock);
+		/* Skip max_seq because a reaccess moves a page from another seq
+		 * to max_seq. We use the negative change in page count from
+		 * other seqs to track the number of reaccesses.
+		 */
+		for (seq = max_seq - 1; seq + 1 > min_seq[type]; seq--) {
+			long error;
+			int next_gen;
+			unsigned long birth, gen_start;
+			long delta = 0;
+
+			gen = lru_gen_from_seq(seq);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+				long nr_pages = walk->nr_pages[gen][type][zone];
+
+				if (nr_pages < 0)
+					delta += -nr_pages;
+			}
+
+			birth = READ_ONCE(lrugen->timestamps[gen]);
+			next_gen = lru_gen_from_seq(seq + 1);
+			gen_start = READ_ONCE(lrugen->timestamps[next_gen]);
+
+			/* ensure gen_start is within idle_age of bin */
+			while (bin->idle_age != -1 &&
+			       time_before(gen_start + bin->idle_age,
+					  curr_timestamp))
+				bin++;
+
+			error = delta;
+			/* gen exceeds the idle_age of bin */
+			while (bin->idle_age != -1 &&
+			       time_before(birth + bin->idle_age,
+					   curr_timestamp)) {
+				unsigned long proportion =
+					gen_start -
+					(curr_timestamp - bin->idle_age);
+				unsigned long gen_len = gen_start - birth;
+
+				if (!gen_len)
+					break;
+				if (proportion) {
+					unsigned long split_bin =
+						delta / gen_len * proportion;
+					bin->nr_pages[type] += split_bin;
+					error -= split_bin;
+				}
+				gen_start = curr_timestamp - bin->idle_age;
+				bin++;
+			}
+			bin->nr_pages[type] += error;
+		}
+	}
+}
+
+static void collect_reaccess(struct wsr *wsr,
+				   struct lru_gen_struct *lrugen,
+				   struct lru_gen_mm_walk *walk)
+{
+	if (READ_ONCE(wsr->reaccess_bins->idle_age) == -1)
+		return;
+
+	mutex_lock(&wsr->reaccess_bins_lock);
+	collect_reaccess_locked(wsr, lrugen, walk);
+	mutex_unlock(&wsr->reaccess_bins_lock);
+}
+
+void report_reaccess(struct lruvec *lruvec, struct lru_gen_mm_walk *walk)
+{
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	while (memcg) {
+		collect_reaccess(lruvec_wsr(mem_cgroup_lruvec(
+					 memcg, lruvec_pgdat(lruvec))),
+				 lrugen, walk);
+		memcg = parent_mem_cgroup(memcg);
+	}
+}
+
 static struct pglist_data *kobj_to_pgdat(struct kobject *kobj)
 {
 	int nid = IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id :
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH v2 6/6] virtio-balloon: Add Working Set reporting
  2023-06-21 18:04 [RFC PATCH v2 0/6] mm: working set reporting Yuanchu Xie
                   ` (4 preceding siblings ...)
  2023-06-21 18:04 ` [RFC PATCH v2 5/6] mm: add per-memcg reaccess histogram Yuanchu Xie
@ 2023-06-21 18:04 ` Yuanchu Xie
  2023-06-21 18:48 ` [RFC PATCH v2 0/6] mm: working set reporting Yu Zhao
  6 siblings, 0 replies; 8+ messages in thread
From: Yuanchu Xie @ 2023-06-21 18:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J . Wysocki, Michael S . Tsirkin,
	David Hildenbrand, Jason Wang, Andrew Morton, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song, Yu Zhao,
	Kefeng Wang, Kairui Song, Yosry Ahmed, Yuanchu Xie,
	T . J . Alumbaugh
  Cc: Wei Xu, SeongJae Park, Sudarshan Rajagopalan, kai.huang, hch,
	jon, Aneesh Kumar K V, Matthew Wilcox, Vasily Averin,
	linux-kernel, virtualization, linux-mm, cgroups

From: "T.J. Alumbaugh" <talumbau@google.com>

Add working set and notification vqueues, along with a
simple interface to kernel WS functions. The driver receives
config info and sends reports on notification. A mutex is
used to guard virtio_balloon state.

Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 drivers/virtio/virtio_balloon.c     | 288 ++++++++++++++++++++++++++++
 include/linux/balloon_compaction.h  |   3 +
 include/linux/wsr.h                 |  29 ++-
 include/uapi/linux/virtio_balloon.h |  33 ++++
 mm/vmscan.c                         | 106 ++++++++++
 5 files changed, 457 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 3f78a3a1eb753..0cb6a46eb7e8a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -11,6 +11,7 @@
 #include <linux/swap.h>
 #include <linux/workqueue.h>
 #include <linux/delay.h>
+#include <linux/device.h>
 #include <linux/slab.h>
 #include <linux/module.h>
 #include <linux/balloon_compaction.h>
@@ -45,6 +46,8 @@ enum virtio_balloon_vq {
 	VIRTIO_BALLOON_VQ_STATS,
 	VIRTIO_BALLOON_VQ_FREE_PAGE,
 	VIRTIO_BALLOON_VQ_REPORTING,
+	VIRTIO_BALLOON_VQ_WORKING_SET,
+	VIRTIO_BALLOON_VQ_NOTIFY,
 	VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -55,6 +58,9 @@ enum virtio_balloon_config_read {
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+#ifdef CONFIG_WSR
+	struct virtqueue *working_set_vq, *notification_vq;
+#endif
 
 	/* Balloon's own wq for cpu-intensive work items */
 	struct workqueue_struct *balloon_wq;
@@ -64,6 +70,10 @@ struct virtio_balloon {
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
 	struct work_struct update_balloon_size_work;
+#ifdef CONFIG_WSR
+	struct work_struct update_balloon_working_set_work;
+	struct work_struct update_balloon_notification_work;
+#endif
 
 	/* Prevent updating balloon when it is being canceled. */
 	spinlock_t stop_update_lock;
@@ -119,6 +129,16 @@ struct virtio_balloon {
 	/* Free page reporting device */
 	struct virtqueue *reporting_vq;
 	struct page_reporting_dev_info pr_dev_info;
+
+#ifdef CONFIG_WSR
+	/* Working Set reporting */
+	u8 working_set_num_bins;
+	struct virtio_balloon_working_set *working_set;
+
+	/* A buffer to hold incoming notification from the host. */
+	unsigned int notification_size;
+	void *notification_buf;
+#endif
 };
 
 static const struct virtio_device_id id_table[] = {
@@ -465,6 +485,211 @@ static void update_balloon_stats_func(struct work_struct *work)
 	stats_handle_request(vb);
 }
 
+#ifdef CONFIG_WSR
+/* Must hold the balloon_lock while calling this function. */
+static inline void reset_working_set(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->working_set_num_bins; ++i) {
+		vb->working_set[i].tag = cpu_to_virtio16(vb->vdev, -1);
+		vb->working_set[i].node_id = cpu_to_virtio16(vb->vdev, -1);
+		vb->working_set[i].idle_age_ms = cpu_to_virtio64(vb->vdev, 0);
+		vb->working_set[i].memory_size_bytes[0] = cpu_to_virtio64(vb->vdev, -1);
+		vb->working_set[i].memory_size_bytes[1] = cpu_to_virtio64(vb->vdev, -1);
+	}
+}
+
+/* Must hold the balloon_lock while calling this function. */
+static inline void update_working_set(struct virtio_balloon *vb, int idx,
+			       u64 idle_age, u64 bytes_anon,
+			       u64 bytes_file, int node_id)
+{
+	vb->working_set[idx].tag = cpu_to_virtio16(vb->vdev, VIRTIO_BALLOON_WS_RECLAIMABLE);
+	vb->working_set[idx].node_id = cpu_to_virtio16(vb->vdev, node_id);
+	vb->working_set[idx].idle_age_ms = cpu_to_virtio64(vb->vdev, idle_age);
+	vb->working_set[idx].memory_size_bytes[0] = cpu_to_virtio64(vb->vdev,
+	     bytes_anon);
+	vb->working_set[idx].memory_size_bytes[1] = cpu_to_virtio64(vb->vdev,
+	     bytes_file);
+}
+
+static bool working_set_is_init(struct virtio_balloon *vb)
+{
+	if (vb->working_set[0].idle_age_ms > 0)
+		return true;
+	return false;
+}
+
+static void virtio_balloon_working_set_request(void)
+{
+	struct pglist_data *pgdat;
+	int nid = NUMA_NO_NODE;
+
+	if (IS_ENABLED(CONFIG_NUMA)) {
+		for_each_online_node(nid) {
+			if (node_possible(nid)) {
+				pgdat = NODE_DATA(nid);
+				working_set_request(pgdat);
+			}
+		}
+	} else {
+		pgdat = NODE_DATA(nid);
+		working_set_request(pgdat);
+	}
+}
+
+static void notification_receive(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	spin_lock(&vb->stop_update_lock);
+	if (!vb->stop_update)
+		queue_work(system_freezable_wq, &vb->update_balloon_notification_work);
+	spin_unlock(&vb->stop_update_lock);
+}
+
+static int virtio_balloon_register_working_set_receiver(struct virtio_balloon *vb,
+	__virtio64 *intervals, unsigned long nr_bins, __virtio64 refresh_ms,
+	__virtio64 report_ms)
+{
+	struct pglist_data *pgdat;
+	unsigned long *bin_intervals = NULL;
+	int i, err;
+	int nid = NUMA_NO_NODE;
+
+	if (intervals && nr_bins) {
+		/* TODO: keep values as 32-bits throughout. */
+		bin_intervals = kzalloc(sizeof(unsigned long) * (nr_bins-1),
+			GFP_KERNEL);
+		for (i = 0; i < nr_bins - 1; i++)
+			bin_intervals[i] = (unsigned long)intervals[i];
+
+		if (IS_ENABLED(CONFIG_NUMA)) {
+			for_each_online_node(nid) {
+				if (node_possible(nid)) {
+					pgdat = NODE_DATA(nid);
+					err = register_working_set_receiver(vb,
+						pgdat, &(bin_intervals[0]),
+						nr_bins, (unsigned long) refresh_ms,
+						(unsigned long) report_ms);
+				}
+			}
+		} else {
+			pgdat = NODE_DATA(nid);
+			err = register_working_set_receiver(vb, pgdat,
+				&(bin_intervals[0]), nr_bins,
+				(unsigned long) refresh_ms,
+				(unsigned long) report_ms);
+		}
+		kfree(bin_intervals);
+		return err;
+	}
+	return -EINVAL;
+}
+
+void working_set_notify(void *ws_receiver, struct ws_bin *bins, int node_id)
+{
+	u64 bytes_nr_file, bytes_nr_anon;
+	struct virtio_balloon *vb = ws_receiver;
+	int idx = 0;
+
+	if (!mutex_trylock(&vb->balloon_lock))
+		return;
+	for (; idx < vb->working_set_num_bins; idx++) {
+		bytes_nr_anon = (u64)(bins[idx].nr_pages[0]) * PAGE_SIZE;
+		bytes_nr_file = (u64)(bins[idx].nr_pages[1]) * PAGE_SIZE;
+		update_working_set(vb, idx, jiffies_to_msecs(bins[idx].idle_age),
+			bytes_nr_anon, bytes_nr_file, node_id);
+	}
+	mutex_unlock(&vb->balloon_lock);
+	/* Send the working set report to the device. */
+	spin_lock(&vb->stop_update_lock);
+	if (!vb->stop_update)
+		queue_work(system_freezable_wq, &vb->update_balloon_working_set_work);
+	spin_unlock(&vb->stop_update_lock);
+}
+EXPORT_SYMBOL(working_set_notify);
+
+static void update_balloon_notification_func(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+	struct scatterlist sg_in;
+	struct pglist_data *pgdat;
+	__virtio64 *bin_intervals;
+	__virtio64 refresh_ms, report_ms;
+	int16_t tag;
+	char *buf;
+	int len;
+
+	vb = container_of(work, struct virtio_balloon,
+			  update_balloon_notification_work);
+
+	/* Read a Working Set notification from the device. */
+	buf = (char *)vb->notification_buf;
+	tag = *((int16_t *)buf);
+	buf += sizeof(int16_t);
+	if (tag == VIRTIO_BALLOON_WS_REQUEST) {
+		pgdat = NODE_DATA(NUMA_NO_NODE);
+		virtio_balloon_working_set_request();
+	} else if (tag == VIRTIO_BALLOON_WS_CONFIG) {
+		mutex_lock(&vb->balloon_lock);
+		reset_working_set(vb);
+		mutex_unlock(&vb->balloon_lock);
+		bin_intervals = (__virtio64 *) buf;
+		buf += sizeof(__virtio64) * (vb->working_set_num_bins - 1);
+		refresh_ms = *((__virtio64 *) buf);
+		buf += sizeof(__virtio64);
+		report_ms = *((__virtio64 *) buf);
+		virtio_balloon_register_working_set_receiver(vb, bin_intervals,
+			vb->working_set_num_bins, refresh_ms, report_ms);
+	} else {
+		dev_warn(&vb->vdev->dev, "Received invalid notification, %u\n", tag);
+		return;
+	}
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vb->notification_vq, &len))
+		;
+	/* Add a new notification buffer for device to fill. */
+	sg_init_one(&sg_in, vb->notification_buf, vb->notification_size);
+	virtqueue_add_inbuf(vb->notification_vq, &sg_in, 1, vb, GFP_KERNEL);
+	virtqueue_kick(vb->notification_vq);
+}
+
+static void update_balloon_ws_func(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+	struct scatterlist sg_out;
+	int err = 0;
+	int unused;
+
+	vb = container_of(work, struct virtio_balloon,
+			  update_balloon_working_set_work);
+
+	mutex_lock(&vb->balloon_lock);
+	if (working_set_is_init(vb)) {
+		/* Detach all the used buffers from the vq */
+		while (virtqueue_get_buf(vb->working_set_vq, &unused))
+			;
+		sg_init_one(&sg_out, vb->working_set,
+			(sizeof(struct virtio_balloon_working_set) *
+			vb->working_set_num_bins));
+		err = virtqueue_add_outbuf(vb->working_set_vq, &sg_out, 1, vb, GFP_KERNEL);
+	} else {
+		dev_warn(&vb->vdev->dev, "Working Set not initialized.");
+		err = -EINVAL;
+	}
+	mutex_unlock(&vb->balloon_lock);
+	if (unlikely(err)) {
+		dev_err(&vb->vdev->dev,
+			"Failed to send working set report err = %d\n", err);
+	} else {
+		virtqueue_kick(vb->working_set_vq);
+	}
+}
+#endif /* CONFIG_WSR */
+
 static void update_balloon_size_func(struct work_struct *work)
 {
 	struct virtio_balloon *vb;
@@ -508,6 +733,10 @@ static int init_vqs(struct virtio_balloon *vb)
 	callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
+	callbacks[VIRTIO_BALLOON_VQ_WORKING_SET] = NULL;
+	names[VIRTIO_BALLOON_VQ_WORKING_SET] = NULL;
+	callbacks[VIRTIO_BALLOON_VQ_NOTIFY] = NULL;
+	names[VIRTIO_BALLOON_VQ_NOTIFY] = NULL;
 
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -524,6 +753,15 @@ static int init_vqs(struct virtio_balloon *vb)
 		callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
 	}
 
+#ifdef CONFIG_WSR
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) {
+		names[VIRTIO_BALLOON_VQ_WORKING_SET] = "ws";
+		callbacks[VIRTIO_BALLOON_VQ_WORKING_SET] = NULL;
+		names[VIRTIO_BALLOON_VQ_NOTIFY] = "notify";
+		callbacks[VIRTIO_BALLOON_VQ_NOTIFY] = notification_receive;
+	}
+#endif
+
 	err = virtio_find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs,
 			      callbacks, names, NULL);
 	if (err)
@@ -534,6 +772,7 @@ static int init_vqs(struct virtio_balloon *vb)
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
 		unsigned int num_stats;
+
 		vb->stats_vq = vqs[VIRTIO_BALLOON_VQ_STATS];
 
 		/*
@@ -553,6 +792,25 @@ static int init_vqs(struct virtio_balloon *vb)
 		virtqueue_kick(vb->stats_vq);
 	}
 
+#ifdef CONFIG_WSR
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING)) {
+		struct scatterlist sg;
+
+		vb->working_set_vq = vqs[VIRTIO_BALLOON_VQ_WORKING_SET];
+		vb->notification_vq = vqs[VIRTIO_BALLOON_VQ_NOTIFY];
+
+		/* Prime the notification virtqueue for the device to fill.*/
+		sg_init_one(&sg, vb->notification_buf, vb->notification_size);
+		err = virtqueue_add_inbuf(vb->notification_vq, &sg, 1, vb, GFP_KERNEL);
+		if (unlikely(err)) {
+			dev_err(&vb->vdev->dev,
+				"Failed to prepare notifications, err = %d\n", err);
+		} else {
+			virtqueue_kick(vb->notification_vq);
+		}
+	}
+#endif
+
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
 		vb->free_page_vq = vqs[VIRTIO_BALLOON_VQ_FREE_PAGE];
 
@@ -878,6 +1136,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 
 	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
 	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
+#ifdef CONFIG_WSR
+	INIT_WORK(&vb->update_balloon_working_set_work, update_balloon_ws_func);
+	INIT_WORK(&vb->update_balloon_notification_work, update_balloon_notification_func);
+#endif
 	spin_lock_init(&vb->stop_update_lock);
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
@@ -885,6 +1147,23 @@ static int virtballoon_probe(struct virtio_device *vdev)
 
 	balloon_devinfo_init(&vb->vb_dev_info);
 
+#ifdef CONFIG_WSR
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_WS_REPORTING)) {
+		virtio_cread_le(vdev, struct virtio_balloon_config, working_set_num_bins,
+				&vb->working_set_num_bins);
+		dev_err(&vb->vdev->dev, "in probe , num bins: %d ", vb->working_set_num_bins);
+		/* Allocate space for a Working Set report. */
+		vb->working_set = kcalloc(vb->working_set_num_bins,
+				 sizeof(struct virtio_balloon_working_set), GFP_KERNEL);
+		/* Allocate space for host notifications. */
+		vb->notification_size =
+			sizeof(uint16_t) +
+			sizeof(uint64_t) * (vb->working_set_num_bins + 1);
+		vb->notification_buf = kzalloc(vb->notification_size, GFP_KERNEL);
+		reset_working_set(vb);
+	}
+#endif
+
 	err = init_vqs(vb);
 	if (err)
 		goto out_free_vb;
@@ -1034,11 +1313,19 @@ static void virtballoon_remove(struct virtio_device *vdev)
 		unregister_oom_notifier(&vb->oom_nb);
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
 		virtio_balloon_unregister_shrinker(vb);
+#ifdef CONFIG_WSR
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_WS_REPORTING))
+		unregister_working_set_receiver(vb);
+#endif
 	spin_lock_irq(&vb->stop_update_lock);
 	vb->stop_update = true;
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+#ifdef CONFIG_WSR
+	cancel_work_sync(&vb->update_balloon_working_set_work);
+	cancel_work_sync(&vb->update_balloon_notification_work);
+#endif
 
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
 		cancel_work_sync(&vb->report_free_page_work);
@@ -1104,6 +1391,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 	VIRTIO_BALLOON_F_PAGE_POISON,
 	VIRTIO_BALLOON_F_REPORTING,
+	VIRTIO_BALLOON_F_WS_REPORTING,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 5ca2d56996201..7bbf5281d84d3 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -43,6 +43,7 @@
 #include <linux/err.h>
 #include <linux/fs.h>
 #include <linux/list.h>
+#include <linux/mmzone.h>
 
 /*
  * Balloon device information descriptor.
@@ -67,6 +68,8 @@ extern size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
 				      struct list_head *pages);
 extern size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
 				     struct list_head *pages, size_t n_req_pages);
+extern void working_set_notify(void *ws_receiver, struct ws_bin *bins,
+			       int node_id);
 
 static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
 {
diff --git a/include/linux/wsr.h b/include/linux/wsr.h
index 68246734679cd..671ca5426254d 100644
--- a/include/linux/wsr.h
+++ b/include/linux/wsr.h
@@ -53,6 +53,16 @@ void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
 		 struct pglist_data *pgdat);
 void report_reaccess(struct lruvec *lruvec, struct lru_gen_mm_walk *walk);
 void report_ws(struct pglist_data *pgdat, struct scan_control *sc);
+/*
+ * Function to send the working set report to a receiver (e.g. the balloon driver)
+ * TODO: Replace with a proper registration interface, similar to shrinkers.
+ */
+int register_working_set_receiver(void *receiver, struct pglist_data *pgdat,
+				unsigned long *intervals, unsigned long nr_bins,
+				unsigned long report_threshold,
+				unsigned long refresh_threshold);
+void unregister_working_set_receiver(void *receiver);
+bool working_set_request(struct pglist_data *pgdat);
 #else
 struct ws_bin;
 struct wsr;
@@ -84,6 +94,21 @@ static inline void report_reaccess(struct lruvec *lruvec, struct lru_gen_mm_walk
 static inline void report_ws(struct pglist_data *pgdat, struct scan_control *sc)
 {
 }
-#endif	/* CONFIG_WSR */
+static inline int
+register_working_set_receiver(void *receiver, struct pglist_data *pgdat,
+			      unsigned long *intervals, unsigned long nr_bins,
+			      unsigned long report_threshold,
+			      unsigned long refresh_threshold)
+{
+	return -EINVAL;
+}
+static inline void unregister_working_set_receiver(void *receiver)
+{
+}
+static inline bool working_set_request(struct pglist_data *pgdat)
+{
+	return false;
+}
+#endif /* CONFIG_WSR */
 
-#endif	/* _LINUX_WSR_H */
+#endif /* _LINUX_WSR_H */
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index ddaa45e723c4c..a682d917daca1 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -37,6 +37,7 @@
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
 #define VIRTIO_BALLOON_F_REPORTING	5 /* Page reporting virtqueue */
+#define VIRTIO_BALLOON_F_WS_REPORTING 6 /* Working Set Size reporting */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -59,6 +60,9 @@ struct virtio_balloon_config {
 	};
 	/* Stores PAGE_POISON if page poisoning is in use */
 	__le32 poison_val;
+	/* Number of bins for Working Set report if in use. */
+	__u8 working_set_num_bins;
+	__u8 padding[3];
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
@@ -116,4 +120,33 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+/* Enumerate all possible message types from the device. */
+enum virtio_balloon_working_set_op {
+	VIRTIO_BALLOON_WS_REQUEST = 1,
+	VIRTIO_BALLOON_WS_CONFIG = 2,
+};
+
+/* The metadata values for Working Set Reports. */
+enum virtio_balloon_working_set_tags {
+	/* Memory is reclaimable by guest */
+	VIRTIO_BALLOON_WS_RECLAIMABLE = 0,
+	/* Memory can only be discarded by guest */
+	VIRTIO_BALLOON_WS_DISCARDABLE = 1,
+};
+
+/*
+ * Working Set Report structure.
+ */
+struct virtio_balloon_working_set {
+	/* A tag for additional metadata. */
+	__le16 tag;
+	/* the NUMA node for this report. */
+	__le16 node_id;
+	uint8_t reserved[4];
+	/* The idle age (in ms) of this bin of memory */
+	__virtio64 idle_age_ms;
+	/* A bin each for anonymous and file-backed memory. */
+	__le64 memory_size_bytes[2];
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc8c026ceef0d..c89728f8f61ba 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5911,6 +5911,57 @@ late_initcall(init_lru_gen);
  ******************************************************************************/
 
 #ifdef CONFIG_WSR
+static void *wsr_receiver;
+
+/*
+ * Register/unregister a receiver of working set notifications
+ * TODO: Replace with a proper registration interface, similar to shrinkers.
+ */
+int register_working_set_receiver(void *receiver, struct pglist_data *pgdat,
+				  unsigned long *intervals,
+				  unsigned long nr_bins,
+				  unsigned long refresh_threshold,
+				  unsigned long report_threshold)
+{
+	struct wsr *wsr;
+	struct ws_bin *bins;
+	int i;
+
+	wsr_receiver = receiver;
+
+	if (!pgdat)
+		return 0;
+
+	if (!intervals || !nr_bins)
+		return 0;
+
+	bins = kzalloc(sizeof(wsr->bins), GFP_KERNEL);
+	if (!bins)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_bins - 1; i++) {
+		bins[i].idle_age = msecs_to_jiffies(*intervals);
+		intervals++;
+	}
+	bins[i].idle_age = -1;
+
+	wsr = lruvec_wsr(mem_cgroup_lruvec(NULL, pgdat));
+
+	mutex_lock(&wsr->bins_lock);
+	memcpy(wsr->bins, bins, sizeof(wsr->bins));
+	WRITE_ONCE(wsr->refresh_threshold, msecs_to_jiffies(refresh_threshold));
+	WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(report_threshold));
+	mutex_unlock(&wsr->bins_lock);
+	return 0;
+}
+EXPORT_SYMBOL(register_working_set_receiver);
+
+void unregister_working_set_receiver(void *receiver)
+{
+	wsr_receiver = NULL;
+}
+EXPORT_SYMBOL(unregister_working_set_receiver);
+
 void wsr_refresh(struct wsr *wsr, struct mem_cgroup *root,
 		 struct pglist_data *pgdat)
 {
@@ -5967,6 +6018,16 @@ void report_ws(struct pglist_data *pgdat, struct scan_control *sc)
 	refresh_wsr(wsr, memcg, pgdat, sc, 0);
 	WRITE_ONCE(wsr->timestamp, jiffies);
 
+	/* Balloon driver subscribes to global memory reclaim.
+	 * This requires CONFIG_VIRTIO_BALLOON=y, not m, because
+	 * it's calling a function defined in virtio_balloon.c.
+	 * This is a hack to have balloon notifications work in a
+	 * proof of concept, and aproper notification registration
+	 * interface is on the TODO list.
+	 */
+	if (!cgroup_reclaim(sc) && wsr_receiver)
+		working_set_notify(wsr_receiver, wsr->bins, pgdat->node_id);
+
 	mutex_unlock(&wsr->bins_lock);
 
 	if (wsr->notifier)
@@ -5974,6 +6035,51 @@ void report_ws(struct pglist_data *pgdat, struct scan_control *sc)
 	if (memcg && cmpxchg(&memcg->wsr_event, 0, 1) == 0)
 		wake_up_interruptible(&memcg->wsr_wait_queue);
 }
+
+/* TODO: Replace with a proper registration interface, similar to shrinkers. */
+bool working_set_request(struct pglist_data *pgdat)
+{
+	unsigned int flags;
+	struct scan_control sc = {
+		.may_writepage = true,
+		.may_unmap = true,
+		.may_swap = true,
+		.reclaim_idx = MAX_NR_ZONES - 1,
+		.gfp_mask = GFP_KERNEL,
+	};
+	struct wsr *wsr;
+
+	if (!wsr_receiver)
+		return false;
+
+	wsr = lruvec_wsr(mem_cgroup_lruvec(NULL, pgdat));
+
+	if (!mutex_trylock(&wsr->bins_lock))
+		return false;
+
+	if (wsr->bins->idle_age != -1) {
+		unsigned long timestamp = READ_ONCE(wsr->timestamp);
+		unsigned long threshold = READ_ONCE(wsr->refresh_threshold);
+
+		if (time_is_before_jiffies(timestamp + threshold)) {
+			// We might need to refresh the report.
+			set_task_reclaim_state(current, &sc.reclaim_state);
+			flags = memalloc_noreclaim_save();
+			refresh_wsr(wsr, NULL, pgdat, &sc, threshold);
+			memalloc_noreclaim_restore(flags);
+			set_task_reclaim_state(current, NULL);
+		}
+	}
+
+	if (wsr_receiver)
+		working_set_notify(wsr_receiver, wsr->bins, pgdat->node_id);
+
+	mutex_unlock(&wsr->bins_lock);
+	return true;
+
+}
+EXPORT_SYMBOL(working_set_request);
+
 #endif /* CONFIG_WSR */
 
 #else /* !CONFIG_LRU_GEN */
-- 
2.41.0.162.gfafddb0af9-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH v2 0/6] mm: working set reporting
  2023-06-21 18:04 [RFC PATCH v2 0/6] mm: working set reporting Yuanchu Xie
                   ` (5 preceding siblings ...)
  2023-06-21 18:04 ` [RFC PATCH v2 6/6] virtio-balloon: Add Working Set reporting Yuanchu Xie
@ 2023-06-21 18:48 ` Yu Zhao
  6 siblings, 0 replies; 8+ messages in thread
From: Yu Zhao @ 2023-06-21 18:48 UTC (permalink / raw)
  To: Yuanchu Xie
  Cc: Greg Kroah-Hartman, Rafael J . Wysocki, Michael S . Tsirkin,
	David Hildenbrand, Jason Wang, Andrew Morton, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Kefeng Wang, Kairui Song, Yosry Ahmed, T . J . Alumbaugh, Wei Xu,
	SeongJae Park, Sudarshan Rajagopalan, kai.huang, hch, jon,
	Aneesh Kumar K V, Matthew Wilcox, Vasily Averin, linux-kernel,
	virtualization, linux-mm, cgroups

On Wed, Jun 21, 2023 at 12:16 PM Yuanchu Xie <yuanchu@google.com> wrote:
>
> RFC v1: https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.com/
> For background and interfaces, see the RFC v1 posting.

v1 only mentioned one use case (ballooning), but we both know there
are at least two solid use cases (the other being job
scheduling/binpacking, e.g., for kubernetes [1]).

Please do a survey, as thoroughly as possible, of use cases.
* What's the significance of WSR to the landscape, in terms of server
and client use cases?
* How would userspace tools, e.g., a PMU-based memory profiler,
leverage the infra provided by WSR?
* Would those who register slab shrinkers, e.g., DMA buffs [2], want
to report their working sets?
* Does this effort intersect with memory placement with NUMA and CXL.mem?

[1] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
[2] https://lore.kernel.org/linux-mm/20230123191728.2928839-1-tjmercier@google.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-06-21 18:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-21 18:04 [RFC PATCH v2 0/6] mm: working set reporting Yuanchu Xie
2023-06-21 18:04 ` [RFC PATCH v2 1/6] mm: aggregate working set information into histograms Yuanchu Xie
2023-06-21 18:04 ` [RFC PATCH v2 2/6] mm: add working set refresh threshold to rate-limit aggregation Yuanchu Xie
2023-06-21 18:04 ` [RFC PATCH v2 3/6] mm: report working set when under memory pressure Yuanchu Xie
2023-06-21 18:04 ` [RFC PATCH v2 4/6] mm: extend working set reporting to memcgs Yuanchu Xie
2023-06-21 18:04 ` [RFC PATCH v2 5/6] mm: add per-memcg reaccess histogram Yuanchu Xie
2023-06-21 18:04 ` [RFC PATCH v2 6/6] virtio-balloon: Add Working Set reporting Yuanchu Xie
2023-06-21 18:48 ` [RFC PATCH v2 0/6] mm: working set reporting Yu Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).