[PATCH v3 0/7] swapin refactor for optimization and unified readahead

* [PATCH v3 0/7] swapin refactor for optimization and unified readahead
@ 2024-01-29 17:54 Kairui Song
  2024-01-29 17:54 ` [PATCH v3 1/7] mm/swapfile.c: add back some comment Kairui Song
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Kairui Song @ 2024-01-29 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This series tries to unify and clean up the swapin path, introduce minor
optimization, and make both shmem swapoff make use of SWP_SYNCHRONOUS_IO
flag to skip readahead and swapcache for better performance.

Test results:
- swap out 10G zero-filled data to ZRAM then read them in:
  Before: 11143285 us
  After:  10692644 us (+4.1%)

- swapping off a 10G ZRAM (lzo-rle) after same workload:
  Before:
  time swapoff /dev/zram0
  real    0m12.337s
  user    0m0.001s
  sys     0m12.329s

  After:
  time swapoff /dev/zram0
  real    0m9.728s
  user    0m0.001s
  sys     0m9.719s

- shmem FIO test 1 on a Ryzen 5900HX:
  fio -name=tmpfs --numjobs=16 --directory=/tmpfs --size=960m \
    --ioengine=mmap --rw=randread --random_distribution=zipf:0.5 \
    --time_based --ramp_time=1m --runtime=5m --group_reporting
  (using brd as swap, 2G memcg limit)

  Before:
    bw (  MiB/s): min= 1167, max= 1732, per=100.00%, avg=1460.82, stdev= 4.38, samples=9536
    iops        : min=298938, max=443557, avg=373964.41, stdev=1121.27, samples=9536
  After (+3.5%):
    bw (  MiB/s): min= 1285, max= 1738, per=100.00%, avg=1512.88, stdev= 4.34, samples=9456
    iops        : min=328957, max=445105, avg=387294.21, stdev=1111.15, samples=9456

- shmem FIO test 2 on a Ryzen 5900HX:
  fio -name=tmpfs --numjobs=16 --directory=/tmpfs --size=960m \
    --ioengine=mmap --rw=randread --random_distribution=zipf:1.2 \
    --time_based --ramp_time=1m --runtime=5m --group_reporting
  (using brd as swap, 2G memcg limit)

  Before:
    bw (  MiB/s): min= 5296, max= 7112, per=100.00%, avg=6131.93, stdev=17.09, samples=9536
    iops        : min=1355934, max=1820833, avg=1569769.11, stdev=4375.93, samples=9536
  After (+3.1%):
    bw (  MiB/s): min= 5466, max= 7173, per=100.00%, avg=6324.51, stdev=16.66, samples=9521
    iops        : min=1399355, max=1836435, avg=1619068.90, stdev=4263.94, samples=9521

- Some built objects are very slightly smaller (gcc 13.2.1):
./scripts/bloat-o-meter ./vmlinux ./vmlinux.new
add/remove: 4/2 grow/shrink: 1/10 up/down: 818/-983 (-165)
Function                                     old     new   delta
swapin_entry                                   -     482    +482
mm_counter                                     -     248    +248
shmem_swapin_folio                          1412    1468     +56
__pfx_swapin_entry                             -      16     +16
__pfx_mm_counter                               -      16     +16
__read_swap_cache_async                      738     736      -2
copy_present_pte                            1258    1249      -9
mem_cgroup_swapin_charge_folio               297     285     -12
__pfx_swapin_readahead                        16       -     -16
swap_cache_get_folio                         364     345     -19
do_anonymous_page                           1488    1458     -30
unuse_pte_range                              889     833     -56
free_p4d_range                               524     446     -78
restore_exclusive_pte                        937     822    -115
do_swap_page                                2969    2817    -152
swapin_readahead                             239       -    -239
copy_nonpresent_pte                         1478    1223    -255
Total: Before=26056243, After=26056078, chg -0.00%

V2: https://lore.kernel.org/linux-mm/20240102175338.62012-1-ryncsn@gmail.com/
Update from V2:
  - Many code path clean up (merge swapin_entry with swapin_entry_mpol,
    drop second param of mem_cgroup_swapin_charge_folio, swapin_entry
    takes a pointer to folio as return value instaed of pointer to
    boolean to reduce LOC and logic), thanks for Huang, Ying.
  - Don't use cluster readhead for swapoff, the performance is worse
    than VMA readahead for NVME.
  - Add a refactor patch for swap_cache_get_folio.

V1: https://lore.kernel.org/linux-mm/20231119194740.94101-1-ryncsn@gmail.com/T/
Update from V1:
  - Rebased based on mm-unstable.
  - Remove behaviour changing patches, will submit in seperate series
    later.
  - Code style, naming and comments updates.
  - Thanks to Chris Li for very detailed and helpful review of V1. Thanks
    to Matthew Wilcox and Huang Ying for helpful suggestions.

Kairui Song (7):
  mm/swapfile.c: add back some comment
  mm/swap: move no readahead swapin code to a stand-alone helper
  mm/swap: always account swapped in page into current memcg
  mm/swap: introduce swapin_entry for unified readahead policy
  mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO
  mm/swap, shmem: use unified swapin helper for shmem
  mm/swap: refactor swap_cache_get_folio

 include/linux/memcontrol.h |   4 +-
 mm/memcontrol.c            |   5 +-
 mm/memory.c                |  45 ++--------
 mm/shmem.c                 |  50 +++++++----
 mm/swap.h                  |  23 ++---
 mm/swap_state.c            | 176 ++++++++++++++++++++++++++-----------
 mm/swapfile.c              |  20 +++--
 7 files changed, 190 insertions(+), 133 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 20+ messages in thread