linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Refault distance checking for MGLRU
@ 2023-07-25 18:57 Kairui Song
  2023-07-25 18:57 ` [RFC PATCH 1/4] workingset: simplify and use a more intuitive model Kairui Song
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Kairui Song @ 2023-07-25 18:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Roman Gushchin, Johannes Weiner,
	Michal Hocko, Hugh Dickins, Nhat Pham, Yuanchu Xie,
	Suren Baghdasaryan, T . J . Mercier, Kairui Song

From: Kairui Song <kasong@tencent.com>

Hi, linux-mm

I noticed MGLRU not working very well on certain workflows, which is
observed on some instances on some heavily stressed machines.

I found this was related to refault distance detection, when the
file page workingset size exceeds total memory, and the access
distance (the left-shift time of a page before it gets activated,
considering LRU starts from right) of file pages also larger than
total memory. All file pages are stuck on the oldest generation
and getting read-in then evicted permutably, few get activated and
stay in memory.

This series tries to fix this problem by rework the refault distance
detection to better fit MGLRU, and also tries to use a unified
algorithm for both MGLRU and Inactive/Active LRU.

Patch 1/4 reworked the refault distance detection model for
Inactive/Active LRU.

Patch 2/4 and 3/4 are simplification and prepare.

Patch 4/4 applies the modified refault distance detection
for MGLRU.

Following benchmark showed 5x improvement:
To simulate the workflow, I setup a 3-replicated mongodb cluster using
docker, each in a standalone cgroup, set to use 5 gb of cache and 10g
of oplog, on a 32G VM. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.

Before the patch (with 10G swap, the result won't change whether
swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 904 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     503             27150226136.4   0.02 txn/s
------------------------------------------------------------------
  TOTAL           503             27150226136.4   0.02 txn/s

$ cat /proc/vmstat | grep working
workingset_nodes 53391
workingset_refault_anon 0
workingset_refault_file 23856735
workingset_activate_anon 0
workingset_activate_file 23845737
workingset_restore_anon 0
workingset_restore_file 18280692
workingset_nodereclaim 1024

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          31837        6752         379          23       24706       24607
Swap:         10239           0       10239

After the patch (with 10G swap on same disk, similar result using ZRAM):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 903 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     2575            27094953498.8   0.10 txn/s
------------------------------------------------------------------
  TOTAL           2575            27094953498.8   0.10 txn/s

$ cat /proc/vmstat | grep working
workingset_nodes 78249
workingset_refault_anon 10139
workingset_refault_file 23001863
workingset_activate_anon 7238
workingset_activate_file 6718032
workingset_restore_anon 7432
workingset_restore_file 6719406
workingset_nodereclaim 9747

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          31837        7376         320           3       24140       24014
Swap:         10239        1662        8577

The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. Testing with lower stress also shows
a improvement.

I also checked the benchmark with memtier/memcached and fio,
using similar setup as in commit ac35a4902374 but scaled down to fit in
my test environment:

  memtier test (with 16G ramdisk as swap and 2G cgroup limit):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 \
    -t 12 -B binary &
  memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
    --key-minimum=1 --key-maximum=24000000 --key-pattern=P:P -c 1 \
    -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6

  fio test (with 16G ramdisk on /mnt and 4G cgroup limit):
  fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=random --norandommap \
    --time_based --ramp_time=5m --runtime=5m --group_reporting

Before this patch:
memcached:
            Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
  Best      52832.79         0.00         0.00         1.82042         1.70300         4.54300         6.27100    105641.69
  Worst     46613.56         0.00         0.00         2.05686         1.77500         7.80700        11.83900     93206.05
  Avg (6x)  51024.85         0.00         0.00         1.88506         1.73500         5.43900         9.47100    102026.64
fio:
  read: IOPS=2211k, BW=8637MiB/s (9056MB/s)(2530GiB/300001msec)

After this patch:
memcached:
            Ops/sec     Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
  Best      54218.92         1.76930         1.65500         4.41500         6.27100    108413.34
  Worst     47640.13         2.01495         1.74300         7.64700        11.64700     95258.72
  Avg (6x)  51408.33         1.86988         1.71900         5.43900         9.34300    102793.42
fio:
  read: IOPS=2166k, BW=8462MiB/s (8873MB/s)(2479GiB/300001msec)

memcached looks ok but there is a %2 performance drop for FIO test,
and after some profiling this is mainly caused by the extra atomic
operations and new functions, there seems to be no LRU accuracy drop.

Sending this as RFC as I'm not entirely sure if this is the right
way to fix this issue, of if this is a generic issue or considered
more of a misconfiguration. Any suggetions about how should I test
it is welcomed.

Signed-off-by: Kairui Song <kasong@tencent.com>

Kairui Song (4):
  workingset: simplify and use a more intuitive model
  workingset: simplify lru_gen_test_recent
  lru_gen: convert avg_total and avg_refaulted to atomic
  workingset, lru_gen: apply refault-distance based re-activation

 include/linux/mmzone.h |   4 +-
 include/linux/swap.h   |   2 -
 mm/swap.c              |   1 -
 mm/vmscan.c            |  18 ++-
 mm/workingset.c        | 315 ++++++++++++++++++++++-------------------
 5 files changed, 179 insertions(+), 161 deletions(-)

-- 
2.41.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH 1/4] workingset: simplify and use a more intuitive model
  2023-07-25 18:57 [RFC PATCH 0/4] Refault distance checking for MGLRU Kairui Song
@ 2023-07-25 18:57 ` Kairui Song
  2023-07-25 18:57 ` [RFC PATCH 2/4] workingset: simplify lru_gen_test_recent Kairui Song
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Kairui Song @ 2023-07-25 18:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Roman Gushchin, Johannes Weiner,
	Michal Hocko, Hugh Dickins, Nhat Pham, Yuanchu Xie,
	Suren Baghdasaryan, T . J . Mercier, Kairui Song

From: Kairui Song <kasong@tencent.com>

This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.

The idea behind this change is a new way to calculate the refault
distance which seems working fine in most cases and fits for MGLRU (in
later commits).

Current refault distance is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (consider
   LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.

Assumption 2 is correct, but assumption 1 is not always true,
the activated page could be anywhere in the LRU list, it only left-shift
the pages on its right. And one page can get activate/deactivated
for multiple times.

And MGLRU doesn't fit with this model, since there are multiple gens,
and pages are getting aged and activated constantly upon gen growth.

So instead we introduce a new idea here, "Shadow LRU Position". Simply
consider the evicted pages are still in memory, each has an eviction
sequence like before. Let the `nonresistence_age` be NA and get
increased for each eviction, so the "Shadow LRU Position" of one evicted
page will be:

    SP = ((NA's reading @ eviction) - (NA's reading @ current))

 +---------------------------------------+==========+========+
 |    *   Shadow LRU    O        O  O    | INACTIVE | ACTIVE |
 +----+----------------------------------+==========+========+
      |                                  |
      +----------------------------------+
      |             SP
  refault page                  O -> Hole left by previously refaulted page
                                * -> The page corresponding to SP

Now since SP simply stands for how much currently workflow could push a
page out of current memory, which also means if the page started on
INACTIVE part, it *may* get re-activated if it right shift SP slots into
the ACTIVE list and still doesn't go exceed total memory, which is:

  SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE

Which can be simplified to:

  SP < NR_ACTIVE

And since this is only an estimation, based on several hypothesis and it
actually violates the normal routine of LRU when LRU is working well,
so throttle this by two factor:

1. Previously refaulted pages may leave "holes" on the shadow LRU and
   decrease re-active rate for distant shadow pages.
2. When the ACTIVE part of LRU is long enough, chanllaging them by
   activating one-time faulted inactive page may not be a good idea so
   throttle it by the ratio of ACTIVE/INACTIVE.

Combined all above, we have:

Upon refault:
- If ACTIVE LRU is low, check if SP < NR_ACTIVE to check for
  re-activation.
- If ACTIVE LRU is high, check if
  SP < min(NR_ACTIVE, NR_INACTIVE) / (exponential ratio of ACTIVE / INACTIVE).

This is simpler than before since no longer need to do lruvec operations when
activating a page, and so far, a few benchmarks shows a fair result.

Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:

  memtier test (with 16G ramdisk as swap and 2G cgroup limit):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0700 -t 12 -B binary &
  memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
    --key-minimum=1 --key-maximum=24000000 --key-pattern=P:P -c 1 \
    -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6

  fio test (with 16G ramdisk on /mnt and 4G cgroup limit):
  fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=random --norandommap \
    --time_based --ramp_time=5m --runtime=5m --group_reporting

  Pgbench setup using phronix-test-suite with scale 1000 and
  50 clients on a 5G VM.

  Linux Kernel compliation test done with defconfig on a 2G VM.

Before:
memcached: 48157.04 ops/s
read: IOPS=2003k, BW=7823MiB/s (8203MB/s)(2292GiB/300001msec)
pgbench: 5845 qps
build-linux: 247.063

After:
memcached: 49144.55 ops/s
read: IOPS=2005k, BW=7832MiB/s (8212MB/s)(2294GiB/300002msec)
pgbench: 5832 qps
build-linux: 247.302

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   2 -
 mm/swap.c            |   1 -
 mm/vmscan.c          |   2 -
 mm/workingset.c      | 217 +++++++++++++++++++++----------------------
 4 files changed, 108 insertions(+), 114 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 456546443f1f..43e48023c4c4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -350,10 +350,8 @@ static inline void folio_set_swap_entry(struct folio *folio, swp_entry_t entry)
 
 /* linux/mm/workingset.c */
 bool workingset_test_recent(void *shadow, bool file, bool *workingset);
-void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
 void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg);
 void workingset_refault(struct folio *folio, void *shadow);
-void workingset_activation(struct folio *folio);
 
 /* Only track the nodes of mappings with shadow entries */
 void workingset_update_node(struct xa_node *node);
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..685b446fd4f9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -482,7 +482,6 @@ void folio_mark_accessed(struct folio *folio)
 		else
 			__lru_cache_activate_folio(folio);
 		folio_clear_referenced(folio);
-		workingset_activation(folio);
 	}
 	if (folio_test_idle(folio))
 		folio_clear_idle(folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568b..e7906f7fdc77 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2539,8 +2539,6 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 		lruvec_add_folio(lruvec, folio);
 		nr_pages = folio_nr_pages(folio);
 		nr_moved += nr_pages;
-		if (folio_test_active(folio))
-			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
 	/*
diff --git a/mm/workingset.c b/mm/workingset.c
index 4686ae363000..c0dea2c05f55 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -180,9 +180,10 @@
  */
 
 #define WORKINGSET_SHIFT 1
-#define EVICTION_SHIFT	((BITS_PER_LONG - BITS_PER_XA_VALUE) +	\
+#define EVICTION_SHIFT	((BITS_PER_LONG - BITS_PER_XA_VALUE) + \
 			 WORKINGSET_SHIFT + NODES_SHIFT + \
 			 MEM_CGROUP_ID_SHIFT)
+#define EVICTION_BITS	(BITS_PER_LONG - (EVICTION_SHIFT))
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -226,8 +227,105 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
-#ifdef CONFIG_LRU_GEN
+/*
+ * Get the distance reading at eviction time.
+ */
+static inline unsigned long lru_eviction(struct lruvec *lruvec,
+					    int bits, int bucket_order)
+{
+	unsigned long eviction = atomic_long_read(&lruvec->nonresident_age);
+
+	eviction >>= bucket_order;
+	eviction &= ~0UL >> (BITS_PER_LONG - bits);
+
+	return eviction;
+}
+
+/*
+ * Calculate and test refault distance
+ */
+static bool lru_refault(struct mem_cgroup *memcg,
+			struct lruvec *lruvec,
+			unsigned long eviction,
+			int bits, int bucket_order)
+{
+	unsigned long refault, distance;
+	unsigned long active, inactive;
+
+	eviction <<= bucket_order;
+	refault = atomic_long_read(&lruvec->nonresident_age);
+
+	/*
+	 * The unsigned subtraction here gives an accurate distance
+	 * across nonresident_age overflows in most cases. There is a
+	 * special case: usually, shadow entries have a short lifetime
+	 * and are either refaulted or reclaimed along with the inode
+	 * before they get too old.  But it is not impossible for the
+	 * nonresident_age to lap a shadow entry in the field, which
+	 * can then result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
+	 * unconditionally with *every* reclaim invocation for the
+	 * longest time, so the occasional inappropriate activation
+	 * leading to pressure on the active list is not a problem.
+	 */
+	distance = (refault - eviction) & (~0UL >> (BITS_PER_LONG - bits));
+
+	active = lruvec_page_state(lruvec, NR_ACTIVE_FILE);
+	inactive = lruvec_page_state(lruvec, NR_INACTIVE_FILE);
+	if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
+		active += lruvec_page_state(lruvec, NR_ACTIVE_ANON);
+		inactive += lruvec_page_state(lruvec, NR_INACTIVE_ANON);
+	}
+
+	/*
+	 * When there are already enough active pages, be less aggressive
+	 * on activating pages, challenge already established workingset with
+	 * one time refaulted page may not be a good idea, especially as
+	 * the gap between active workingset and inactive queue grows larger.
+	 */
+	if (active > inactive)
+		return distance < inactive >> (1 + (fls_long(active) - fls_long(inactive)) / 2);
+
+	/*
+	 * Compare the distance to the existing workingset size. We
+	 * don't activate pages that couldn't stay resident even if
+	 * all the memory was available to the workingset. Whether
+	 * workingset competition needs to consider anon or not depends
+	 * on having free swap space.
+	 */
+	return distance < active;
+}
+
+/**
+ * workingset_age_nonresident - age non-resident entries as LRU ages
+ * @lruvec: the lruvec that was aged
+ * @nr_pages: the number of pages to count
+ *
+ * As in-memory pages are aged, non-resident pages need to be aged as
+ * well, in order for the refault distances later on to be comparable
+ * to the in-memory dimensions. This function allows reclaim and LRU
+ * operations to drive the non-resident aging along in parallel.
+ */
+static void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
+{
+	/*
+	 * Reclaiming a cgroup means reclaiming all its children in a
+	 * round-robin fashion. That means that each cgroup has an LRU
+	 * order that is composed of the LRU orders of its child
+	 * cgroups; and every page has an LRU position not just in the
+	 * cgroup that owns it, but in all of that group's ancestors.
+	 *
+	 * So when the physical inactive list of a leaf cgroup ages,
+	 * the virtual inactive lists of all its parents, including
+	 * the root cgroup's, age as well.
+	 */
+	do {
+		atomic_long_add(nr_pages, &lruvec->nonresident_age);
+	} while ((lruvec = parent_lruvec(lruvec)));
+}
 
+#ifdef CONFIG_LRU_GEN
 static void *lru_gen_eviction(struct folio *folio)
 {
 	int hist;
@@ -342,34 +440,6 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 #endif /* CONFIG_LRU_GEN */
 
-/**
- * workingset_age_nonresident - age non-resident entries as LRU ages
- * @lruvec: the lruvec that was aged
- * @nr_pages: the number of pages to count
- *
- * As in-memory pages are aged, non-resident pages need to be aged as
- * well, in order for the refault distances later on to be comparable
- * to the in-memory dimensions. This function allows reclaim and LRU
- * operations to drive the non-resident aging along in parallel.
- */
-void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
-{
-	/*
-	 * Reclaiming a cgroup means reclaiming all its children in a
-	 * round-robin fashion. That means that each cgroup has an LRU
-	 * order that is composed of the LRU orders of its child
-	 * cgroups; and every page has an LRU position not just in the
-	 * cgroup that owns it, but in all of that group's ancestors.
-	 *
-	 * So when the physical inactive list of a leaf cgroup ages,
-	 * the virtual inactive lists of all its parents, including
-	 * the root cgroup's, age as well.
-	 */
-	do {
-		atomic_long_add(nr_pages, &lruvec->nonresident_age);
-	} while ((lruvec = parent_lruvec(lruvec)));
-}
-
 /**
  * workingset_eviction - note the eviction of a folio from memory
  * @target_memcg: the cgroup that is causing the reclaim
@@ -396,11 +466,11 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
-	eviction = atomic_long_read(&lruvec->nonresident_age);
-	eviction >>= bucket_order;
+
+	eviction = lru_eviction(lruvec, EVICTION_BITS, bucket_order);
 	workingset_age_nonresident(lruvec, folio_nr_pages(folio));
 	return pack_shadow(memcgid, pgdat, eviction,
-				folio_test_workingset(folio));
+			   folio_test_workingset(folio));
 }
 
 /**
@@ -418,9 +488,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 {
 	struct mem_cgroup *eviction_memcg;
 	struct lruvec *eviction_lruvec;
-	unsigned long refault_distance;
-	unsigned long workingset_size;
-	unsigned long refault;
 	int memcgid;
 	struct pglist_data *pgdat;
 	unsigned long eviction;
@@ -429,7 +496,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 		return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset);
 
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
-	eviction <<= bucket_order;
 
 	/*
 	 * Look up the memcg associated with the stored ID. It might
@@ -450,50 +516,10 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 	eviction_memcg = mem_cgroup_from_id(memcgid);
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		return false;
-
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
-	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
 
-	/*
-	 * Calculate the refault distance
-	 *
-	 * The unsigned subtraction here gives an accurate distance
-	 * across nonresident_age overflows in most cases. There is a
-	 * special case: usually, shadow entries have a short lifetime
-	 * and are either refaulted or reclaimed along with the inode
-	 * before they get too old.  But it is not impossible for the
-	 * nonresident_age to lap a shadow entry in the field, which
-	 * can then result in a false small refault distance, leading
-	 * to a false activation should this old entry actually
-	 * refault again.  However, earlier kernels used to deactivate
-	 * unconditionally with *every* reclaim invocation for the
-	 * longest time, so the occasional inappropriate activation
-	 * leading to pressure on the active list is not a problem.
-	 */
-	refault_distance = (refault - eviction) & EVICTION_MASK;
-
-	/*
-	 * Compare the distance to the existing workingset size. We
-	 * don't activate pages that couldn't stay resident even if
-	 * all the memory was available to the workingset. Whether
-	 * workingset competition needs to consider anon or not depends
-	 * on having free swap space.
-	 */
-	workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
-	if (!file) {
-		workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_INACTIVE_FILE);
-	}
-	if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
-		workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_ACTIVE_ANON);
-		if (file) {
-			workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_INACTIVE_ANON);
-		}
-	}
-
-	return refault_distance <= workingset_size;
+	return lru_refault(eviction_memcg, eviction_lruvec, eviction,
+			      EVICTION_BITS, bucket_order);
 }
 
 /**
@@ -543,7 +569,6 @@ void workingset_refault(struct folio *folio, void *shadow)
 		goto out;
 
 	folio_set_active(folio);
-	workingset_age_nonresident(lruvec, nr);
 	mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr);
 
 	/* Folio was active prior to eviction */
@@ -560,30 +585,6 @@ void workingset_refault(struct folio *folio, void *shadow)
 	rcu_read_unlock();
 }
 
-/**
- * workingset_activation - note a page activation
- * @folio: Folio that is being activated.
- */
-void workingset_activation(struct folio *folio)
-{
-	struct mem_cgroup *memcg;
-
-	rcu_read_lock();
-	/*
-	 * Filter non-memcg pages here, e.g. unmap can call
-	 * mark_page_accessed() on VDSO pages.
-	 *
-	 * XXX: See workingset_refault() - this should return
-	 * root_mem_cgroup even for !CONFIG_MEMCG.
-	 */
-	memcg = folio_memcg_rcu(folio);
-	if (!mem_cgroup_disabled() && !memcg)
-		goto out;
-	workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio));
-out:
-	rcu_read_unlock();
-}
-
 /*
  * Shadow entries reflect the share of the working set that does not
  * fit into memory, so their number depends on the access pattern of
@@ -777,7 +778,6 @@ static struct lock_class_key shadow_nodes_key;
 
 static int __init workingset_init(void)
 {
-	unsigned int timestamp_bits;
 	unsigned int max_order;
 	int ret;
 
@@ -789,12 +789,11 @@ static int __init workingset_init(void)
 	 * some more pages at runtime, so keep working with up to
 	 * double the initial memory by using totalram_pages as-is.
 	 */
-	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
 	max_order = fls_long(totalram_pages() - 1);
-	if (max_order > timestamp_bits)
-		bucket_order = max_order - timestamp_bits;
+	if (max_order > EVICTION_BITS)
+		bucket_order = max_order - EVICTION_BITS;
 	pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
-	       timestamp_bits, max_order, bucket_order);
+	       EVICTION_BITS, max_order, bucket_order);
 
 	ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow");
 	if (ret)
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH 2/4] workingset: simplify lru_gen_test_recent
  2023-07-25 18:57 [RFC PATCH 0/4] Refault distance checking for MGLRU Kairui Song
  2023-07-25 18:57 ` [RFC PATCH 1/4] workingset: simplify and use a more intuitive model Kairui Song
@ 2023-07-25 18:57 ` Kairui Song
  2023-07-25 18:57 ` [RFC PATCH 3/4] lru_gen: convert avg_total and avg_refaulted to atomic Kairui Song
  2023-07-25 18:57 ` [RFC PATCH 4/4] workingset, lru_gen: apply refault-distance based re-activation Kairui Song
  3 siblings, 0 replies; 5+ messages in thread
From: Kairui Song @ 2023-07-25 18:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Roman Gushchin, Johannes Weiner,
	Michal Hocko, Hugh Dickins, Nhat Pham, Yuanchu Xie,
	Suren Baghdasaryan, T . J . Mercier, Kairui Song

From: Kairui Song <kasong@tencent.com>

Simplify the code, move some common path into its caller, prepare for
following commits.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/workingset.c | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index c0dea2c05f55..126f1fec41ed 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -357,42 +357,38 @@ static void *lru_gen_eviction(struct folio *folio)
  * Tests if the shadow entry is for a folio that was recently evicted.
  * Fills in @lruvec, @token, @workingset with the values unpacked from shadow.
  */
-static bool lru_gen_test_recent(void *shadow, bool file, struct lruvec **lruvec,
-				unsigned long *token, bool *workingset)
+static bool lru_gen_test_recent(struct lruvec *lruvec, bool file,
+				unsigned long token)
 {
-	int memcg_id;
 	unsigned long min_seq;
-	struct mem_cgroup *memcg;
-	struct pglist_data *pgdat;
 
-	unpack_shadow(shadow, &memcg_id, &pgdat, token, workingset);
-
-	memcg = mem_cgroup_from_id(memcg_id);
-	*lruvec = mem_cgroup_lruvec(memcg, pgdat);
-
-	min_seq = READ_ONCE((*lruvec)->lrugen.min_seq[file]);
-	return (*token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH));
+	min_seq = READ_ONCE(lruvec->lrugen.min_seq[file]);
+	return (token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH));
 }
 
 static void lru_gen_refault(struct folio *folio, void *shadow)
 {
+	int memcgid;
 	bool recent;
-	int hist, tier, refs;
 	bool workingset;
 	unsigned long token;
+	int hist, tier, refs;
 	struct lruvec *lruvec;
+	struct pglist_data *pgdat;
 	struct lru_gen_folio *lrugen;
 	int type = folio_is_file_lru(folio);
 	int delta = folio_nr_pages(folio);
 
 	rcu_read_lock();
 
-	recent = lru_gen_test_recent(shadow, type, &lruvec, &token, &workingset);
+	unpack_shadow(shadow, &memcgid, &pgdat, &token, &workingset);
+	lruvec = mem_cgroup_lruvec(mem_cgroup_from_id(memcgid), pgdat);
 	if (lruvec != folio_lruvec(folio))
 		goto unlock;
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
 
+	recent = lru_gen_test_recent(lruvec, type, token);
 	if (!recent)
 		goto unlock;
 
@@ -492,9 +488,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 	struct pglist_data *pgdat;
 	unsigned long eviction;
 
-	if (lru_gen_enabled())
-		return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset);
-
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
 
 	/*
@@ -518,6 +511,9 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 		return false;
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
 
+	if (lru_gen_enabled())
+		return lru_gen_test_recent(eviction_lruvec, file, eviction);
+
 	return lru_refault(eviction_memcg, eviction_lruvec, eviction,
 			      EVICTION_BITS, bucket_order);
 }
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH 3/4] lru_gen: convert avg_total and avg_refaulted to atomic
  2023-07-25 18:57 [RFC PATCH 0/4] Refault distance checking for MGLRU Kairui Song
  2023-07-25 18:57 ` [RFC PATCH 1/4] workingset: simplify and use a more intuitive model Kairui Song
  2023-07-25 18:57 ` [RFC PATCH 2/4] workingset: simplify lru_gen_test_recent Kairui Song
@ 2023-07-25 18:57 ` Kairui Song
  2023-07-25 18:57 ` [RFC PATCH 4/4] workingset, lru_gen: apply refault-distance based re-activation Kairui Song
  3 siblings, 0 replies; 5+ messages in thread
From: Kairui Song @ 2023-07-25 18:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Roman Gushchin, Johannes Weiner,
	Michal Hocko, Hugh Dickins, Nhat Pham, Yuanchu Xie,
	Suren Baghdasaryan, T . J . Mercier, Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, prepare for later patch.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/mmzone.h |  4 ++--
 mm/vmscan.c            | 16 ++++++++--------
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5e50b78d58ea..4ab6bedd3c5b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -425,9 +425,9 @@ struct lru_gen_folio {
 	/* the multi-gen LRU sizes, eventually consistent */
 	long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* the exponential moving average of refaulted */
-	unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
+	atomic_long_t avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
 	/* the exponential moving average of evicted+protected */
-	unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
+	atomic_long_t avg_total[ANON_AND_FILE][MAX_NR_TIERS];
 	/* the first tier doesn't need protection, hence the minus one */
 	unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
 	/* can be modified without holding the LRU lock */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e7906f7fdc77..d34817795c70 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3705,9 +3705,9 @@ static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 
-	pos->refaulted = lrugen->avg_refaulted[type][tier] +
+	pos->refaulted = atomic_long_read(&lrugen->avg_refaulted[type][tier]) +
 			 atomic_long_read(&lrugen->refaulted[hist][type][tier]);
-	pos->total = lrugen->avg_total[type][tier] +
+	pos->total = atomic_long_read(&lrugen->avg_total[type][tier]) +
 		     atomic_long_read(&lrugen->evicted[hist][type][tier]);
 	if (tier)
 		pos->total += lrugen->protected[hist][type][tier - 1];
@@ -3732,15 +3732,15 @@ static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
 		if (carryover) {
 			unsigned long sum;
 
-			sum = lrugen->avg_refaulted[type][tier] +
+			sum = atomic_long_read(&lrugen->avg_refaulted[type][tier]) +
 			      atomic_long_read(&lrugen->refaulted[hist][type][tier]);
-			WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
+			atomic_long_set(&lrugen->avg_refaulted[type][tier], sum / 2);
 
-			sum = lrugen->avg_total[type][tier] +
+			sum = atomic_long_read(&lrugen->avg_total[type][tier]) +
 			      atomic_long_read(&lrugen->evicted[hist][type][tier]);
 			if (tier)
 				sum += lrugen->protected[hist][type][tier - 1];
-			WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
+			atomic_long_set(&lrugen->avg_total[type][tier], sum / 2);
 		}
 
 		if (clear) {
@@ -5869,8 +5869,8 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
 
 			if (seq == max_seq) {
 				s = "RT ";
-				n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
-				n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
+				n[0] = atomic_long_read(&lrugen->avg_refaulted[type][tier]);
+				n[1] = atomic_long_read(&lrugen->avg_total[type][tier]);
 			} else if (seq == min_seq[type] || NR_HIST_GENS > 1) {
 				s = "rep";
 				n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH 4/4] workingset, lru_gen: apply refault-distance based re-activation
  2023-07-25 18:57 [RFC PATCH 0/4] Refault distance checking for MGLRU Kairui Song
                   ` (2 preceding siblings ...)
  2023-07-25 18:57 ` [RFC PATCH 3/4] lru_gen: convert avg_total and avg_refaulted to atomic Kairui Song
@ 2023-07-25 18:57 ` Kairui Song
  3 siblings, 0 replies; 5+ messages in thread
From: Kairui Song @ 2023-07-25 18:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Roman Gushchin, Johannes Weiner,
	Michal Hocko, Hugh Dickins, Nhat Pham, Yuanchu Xie,
	Suren Baghdasaryan, T . J . Mercier, Kairui Song

From: Kairui Song <kasong@tencent.com>

I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.

Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.

So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.

Some minor tinkering is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:

- If a tier-0 page have a qualified refault-distance, just promote
  it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
  active and send it to youngest gen.
- Increase the reference of every page that have a qualified refault-distance
  and increase the PID countroled refault rate of the updated tier.

Following benchmark showed a major improvement.
To simulate the workflow, I setup a 3-replicated mongodb cluster using
docker, each in a standalone cgroup, set to use 5 gb of cache and 10g
of oplog, on a 32G VM. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.

Before the patch (with 10G swap, the result won't change whether
swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 904 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     503             27150226136.4   0.02 txn/s
------------------------------------------------------------------
  TOTAL           503             27150226136.4   0.02 txn/s

$ cat /proc/vmstat | grep working
workingset_nodes 53391
workingset_refault_anon 0
workingset_refault_file 23856735
workingset_activate_anon 0
workingset_activate_file 23845737
workingset_restore_anon 0
workingset_restore_file 18280692
workingset_nodereclaim 1024

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          31837        6752         379          23       24706       24607
Swap:         10239           0       10239

After the patch (with 10G swap on same disk, similar result using ZRAM):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 903 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     2575            27094953498.8   0.10 txn/s
------------------------------------------------------------------
  TOTAL           2575            27094953498.8   0.10 txn/s

$ cat /proc/vmstat | grep working
workingset_nodes 78249
workingset_refault_anon 10139
workingset_refault_file 23001863
workingset_activate_anon 7238
workingset_activate_file 6718032
workingset_restore_anon 7432
workingset_restore_file 6719406
workingset_nodereclaim 9747

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          31837        7376         320           3       24140       24014
Swap:         10239        1662        8577

The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.

I also checked the benchmark with memtier/memcached and fio,
using similar setup as in commit ac35a4902374 but scaled down to fit in
my test environment:

  memcached test (with 16G ramdisk as swap and 2G cgroup limit):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 \
    -t 12 -B binary &
  memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
    --key-minimum=1 --key-maximum=24000000 --key-pattern=P:P -c 1 \
    -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6

  fio test (with 16G ramdisk on /mnt and 4G cgroup limit):
  fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=random --norandommap \
    --time_based --ramp_time=5m --runtime=5m --group_reporting

Before this patch:
memcached read:
            Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
  Best      52832.79         0.00         0.00         1.82042         1.70300         4.54300         6.27100    105641.69
  Worst     46613.56         0.00         0.00         2.05686         1.77500         7.80700        11.83900     93206.05
  Avg (6x)  51024.85         0.00         0.00         1.88506         1.73500         5.43900         9.47100    102026.64
fio:
  read: IOPS=2211k, BW=8637MiB/s (9056MB/s)(2530GiB/300001msec)

After this patch:
memcached read:
            Ops/sec     Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
  Best      54218.92         1.76930         1.65500         4.41500         6.27100    108413.34
  Worst     47640.13         2.01495         1.74300         7.64700        11.64700     95258.72
  Avg (6x)  51408.33         1.86988         1.71900         5.43900         9.34300    102793.42
fio:
  read: IOPS=2166k, BW=8462MiB/s (8873MB/s)(2479GiB/300001msec)

memcached looks ok but there is a %2 performance drop for FIO test,
and after some profiling this is mainly caused by the extra atomic
operations and new functions, there seems to be no LRU accuracy drop.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/workingset.c | 74 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 51 insertions(+), 23 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 126f1fec41ed..40cb0df980f7 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -185,6 +185,7 @@
 			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_BITS	(BITS_PER_LONG - (EVICTION_SHIFT))
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
+#define LRU_GEN_EVICTION_BITS	(EVICTION_BITS - LRU_REFS_WIDTH - LRU_GEN_WIDTH)
 
 /*
  * Eviction timestamps need to be able to cover the full range of
@@ -195,6 +196,7 @@
  * evictions into coarser buckets by shaving off lower timestamp bits.
  */
 static unsigned int bucket_order __read_mostly;
+static unsigned int lru_gen_bucket_order __read_mostly;
 
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
 			 bool workingset)
@@ -345,10 +347,14 @@ static void *lru_gen_eviction(struct folio *folio)
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	lrugen = &lruvec->lrugen;
 	min_seq = READ_ONCE(lrugen->min_seq[type]);
+
 	token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0);
+	token <<= LRU_GEN_EVICTION_BITS;
+	token |= lru_eviction(lruvec, LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
 
 	hist = lru_hist_from_seq(min_seq);
 	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+	workingset_age_nonresident(lruvec, folio_nr_pages(folio));
 
 	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
 }
@@ -363,44 +369,55 @@ static bool lru_gen_test_recent(struct lruvec *lruvec, bool file,
 	unsigned long min_seq;
 
 	min_seq = READ_ONCE(lruvec->lrugen.min_seq[file]);
+	token >>= LRU_GEN_EVICTION_BITS;
 	return (token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH));
 }
 
 static void lru_gen_refault(struct folio *folio, void *shadow)
 {
 	int memcgid;
-	bool recent;
+	bool refault;
 	bool workingset;
 	unsigned long token;
+	bool recent = false;
+	int refault_tier = 0;
 	int hist, tier, refs;
 	struct lruvec *lruvec;
+	struct mem_cgroup *memcg;
 	struct pglist_data *pgdat;
 	struct lru_gen_folio *lrugen;
 	int type = folio_is_file_lru(folio);
 	int delta = folio_nr_pages(folio);
 
-	rcu_read_lock();
-
 	unpack_shadow(shadow, &memcgid, &pgdat, &token, &workingset);
-	lruvec = mem_cgroup_lruvec(mem_cgroup_from_id(memcgid), pgdat);
-	if (lruvec != folio_lruvec(folio))
-		goto unlock;
+	memcg = mem_cgroup_from_id(memcgid);
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	/* memcg can be NULL, go through lruvec */
+	memcg = lruvec_memcg(lruvec);
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
-
-	recent = lru_gen_test_recent(lruvec, type, token);
-	if (!recent)
-		goto unlock;
+	refault = lru_refault(memcg, lruvec, token, LRU_GEN_EVICTION_BITS,
+			      lru_gen_bucket_order);
+	if (lruvec == folio_lruvec(folio))
+		recent = lru_gen_test_recent(lruvec, type, token);
+	if (!recent && !refault)
+		return;
 
 	lrugen = &lruvec->lrugen;
-
 	hist = lru_hist_from_seq(READ_ONCE(lrugen->min_seq[type]));
 	/* see the comment in folio_lru_refs() */
+	token >>= LRU_GEN_EVICTION_BITS;
 	refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
 	tier = lru_tier_from_refs(refs);
-
-	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
-	mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+	refault_tier = tier;
+
+	if (refault) {
+		if (refs)
+			folio_set_active(folio);
+		if (refs != BIT(LRU_REFS_WIDTH))
+			refault_tier = lru_tier_from_refs(refs + 1);
+		mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+	}
 
 	/*
 	 * Count the following two cases as stalls:
@@ -409,12 +426,17 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 	 * 2. For pages accessed multiple times through file descriptors,
 	 *    numbers of accesses might have been out of the range.
 	 */
-	if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
+	if (refault || lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
 		folio_set_workingset(folio);
 		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
 	}
-unlock:
-	rcu_read_unlock();
+
+	if (recent && refault_tier == tier) {
+		atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
+	} else {
+		atomic_long_add(delta, &lrugen->avg_total[type][refault_tier]);
+		atomic_long_add(delta, &lrugen->avg_refaulted[type][refault_tier]);
+	}
 }
 
 #else /* !CONFIG_LRU_GEN */
@@ -536,16 +558,15 @@ void workingset_refault(struct folio *folio, void *shadow)
 	bool workingset;
 	long nr;
 
-	if (lru_gen_enabled()) {
-		lru_gen_refault(folio, shadow);
-		return;
-	}
-
 	/* Flush stats (and potentially sleep) before holding RCU read lock */
 	mem_cgroup_flush_stats_ratelimited();
-
 	rcu_read_lock();
 
+	if (lru_gen_enabled()) {
+		lru_gen_refault(folio, shadow);
+		goto out;
+	}
+
 	/*
 	 * The activation decision for this folio is made at the level
 	 * where the eviction occurred, as that is where the LRU order
@@ -791,6 +812,13 @@ static int __init workingset_init(void)
 	pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
 	       EVICTION_BITS, max_order, bucket_order);
 
+#ifdef CONFIG_LRU_GEN
+	if (max_order > LRU_GEN_EVICTION_BITS)
+		lru_gen_bucket_order = max_order - LRU_GEN_EVICTION_BITS;
+	pr_info("workingset: lru_gen_timestamp_bits=%d lru_gen_bucket_order=%u\n",
+		LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
+#endif
+
 	ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow");
 	if (ret)
 		goto err;
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-07-25 18:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-25 18:57 [RFC PATCH 0/4] Refault distance checking for MGLRU Kairui Song
2023-07-25 18:57 ` [RFC PATCH 1/4] workingset: simplify and use a more intuitive model Kairui Song
2023-07-25 18:57 ` [RFC PATCH 2/4] workingset: simplify lru_gen_test_recent Kairui Song
2023-07-25 18:57 ` [RFC PATCH 3/4] lru_gen: convert avg_total and avg_refaulted to atomic Kairui Song
2023-07-25 18:57 ` [RFC PATCH 4/4] workingset, lru_gen: apply refault-distance based re-activation Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).