[PATCH v3 0/3] mm, lru_gen: batch update pages when aging

* [PATCH v3 0/3] mm, lru_gen: batch update pages when aging
@ 2024-01-23 18:45 Kairui Song
  2024-01-23 18:45 ` [PATCH v3 1/3] mm, lru_gen: try to prefetch next page when scanning LRU Kairui Song
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Kairui Song @ 2024-01-23 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yu Zhao, Wei Xu, Chris Li, Matthew Wilcox,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Link V1:
https://lore.kernel.org/linux-mm/20231222102255.56993-1-ryncsn@gmail.com/

Link V2:
https://lore.kernel.org/linux-mm/20240111183321.19984-1-ryncsn@gmail.com/

Currently when MGLRU ages, it moves the pages one by one and updates mm
counter page by page, which is correct but the overhead can be optimized
by batching these operations.

I did a rebase and applied more tests to see if there are any regressions or
improvements, it seems everything looks OK except the memtier test where I
tuned down the repeat time (-x) compared to V1 and V2 and simply test more
times instead. It now seems to have a minor regression. If it's true,
it's caused by the prefetch patch. But the noise (Standard Deviation) is
a bit high so not sure if that test is credible. The test result of each
individual patch is in the commit message.

Test 1: Ramdisk fio ro test in a 4G memcg on a EPYC 7K62:
  fio -name=mglru --numjobs=16 --directory=/mnt --size=960m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=zipf:0.5 --norandommap \
    --time_based --ramp_time=1m --runtime=6m --group_reporting

Before this series:
bw (  MiB/s): min= 7758, max= 9239, per=100.00%, avg=8747.59, stdev=16.51, samples=11488
iops        : min=1986251, max=2365323, avg=2239380.87, stdev=4225.93, samples=11488

After this series (+7.1%):
bw (  MiB/s): min= 8359, max= 9796, per=100.00%, avg=9367.29, stdev=15.75, samples=11488
iops        : min=2140113, max=2507928, avg=2398024.65, stdev=4033.07, samples=11488

Test 2: Ramdisk fio hybrid test for 30m in a 4G memcg on a EPYC 7K62 (3 times):
  fio --buffered=1 --numjobs=8 --size=960m --directory=/mnt \
    --time_based --ramp_time=1m --runtime=30m \
    --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
    --iodepth_batch_complete=32 --norandommap \
    --name=mglru-ro --rw=randread --random_distribution=zipf:0.7 \
    --name=mglru-rw --rw=randrw --random_distribution=zipf:0.7

Before this series:
 READ: 6622.0 MiB/s, Stdev: 22.090722
WRITE: 1256.3 MiB/s, Stdev: 5.249339

After this series (+5.4%, +3.9%):
 READ: 6981.0 MiB/s, Stdev: 15.556349
WRITE: 1305.7 MiB/s, Stdev: 2.357023

Test 3: 30m of MySQL test in 6G memcg with swap (12 times):
  echo 'set GLOBAL innodb_buffer_pool_size=16106127360;' | \
    mysql -u USER -h localhost --password=PASS
  sysbench /usr/share/sysbench/oltp_read_only.lua \
    --mysql-user=USER --mysql-password=PASS --mysql-db=DB \
    --tables=48 --table-size=2000000 --threads=16 --time=1800 run

Before this series
Avg: 134743.714545 qps. Stdev: 582.242189

After this series (+0.3%):
Avg: 135099.210000 qps. Stdev: 351.488863

Test 4: Build linux kernel in 2G memcg with make -j48 with swap
        (for memory stress, 18 times):

Before this series:
Avg: 1456.768899 s. Stdev: 20.106973

After this series (-0.5%):
Avg: 1464.178154 s. Stdev: 17.992974

Test 5: Memtier test in a 4G cgroup using brd as swap (18 times):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0766 -t 16 -B binary &
  memtier_benchmark -S /tmp/memcached.socket \
    -P memcache_binary -n allkeys \
    --key-minimum=1 --key-maximum=16000000 -d 1024 \
    --ratio=1:0 --key-pattern=P:P -c 1 -t 16 --pipeline 8 -x 3

Before this series:
Avg: 50317.984000 Ops/sec. Stdev: 2568.965458

After this series (-2.7%):
Avg: 48959.374118 Ops/sec. Stdev: 3488.559744

Updates from V2:
- Add more tests and simplify patch 2/3 to contain only one gen info for
  batch, as Wei Xu suggests that the batch struct may use too much stack.
- Add more tests, and test individual patch as requested by Wei Xu.
- Fix typo as pointed out by Andrew Morton.

Update from V1:
- Fix function argument type as suggested by Chris Li.

Kairui Song (3):
  mm, lru_gen: try to prefetch next page when scanning LRU
  mm, lru_gen: batch update counters on aging
  mm, lru_gen: move pages in bulk when aging

 mm/vmscan.c | 145 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 125 insertions(+), 20 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 9+ messages in thread