All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>, Yang Shi <shy828301@gmail.com>,
	Zi Yan <ziy@nvidia.com>,
	linux-kernel@vger.kernel.org, Yu Zhao <yuzhao@google.com>
Subject: [PATCH 0/3] mm: optimize thp for reclaim and migration
Date: Sat, 31 Jul 2021 00:39:35 -0600	[thread overview]
Message-ID: <20210731063938.1391602-1-yuzhao@google.com> (raw)

Systems using /sys/kernel/mm/transparent_hugepage/enabled=always can
experience memory pressure due to internal fragmentation from a large
number of thp. Userspace can leave many subpages untouched but reclaim
can't identify them based on the dirty bit (and drop them like what it
does to clean pages).

However, it's still possible to know whether a subpage is equivalent
to being clean by checking its content. When splitting a thp for
reclaim or migration, we can drop subpages that contain only zeros
and therefore avoid writing them back or duplicating them.

benchmark
=========
The best case scenario is only the first byte of head is non zero.
The worse case scenario is only the last byte of every subpage is non
zero.

zram 10GB
~~~~~~~~~
  best case   7.170s  3.559s  -50.36%
  worst case  8.216s  8.94s   +8.81%

swap 10GB
~~~~~~~~~
  best case   70.466s  3.544s   -94.97%
  worst case  67.014s  65.521s  -2.22% (noise)

(CONFIG_THP_SWAP=n)

zram (before)
=============
best case
~~~~~~~~~
  time    7.170s

  21.78%  clear_page_erms
  16.47%  zram_bvec_rw
   2.90%  clear_huge_page
   2.74%  _raw_spin_lock
   2.65%  page_vma_mapped_walk
   2.41%  flush_tlb_func
   2.21%  shrink_page_list
   1.55%  try_to_unmap_one
   1.29%  _raw_spin_lock_irqsave
   1.25%  page_counter_cancel
   1.21%  __mod_node_page_state
   1.17%  __mod_lruvec_page_state
   1.10%  add_to_swap_cache
   1.08%  xas_create
   1.07%  xas_store

worst case
~~~~~~~~~~
  time    8.216s

  17.27%  clear_page_erms
  13.25%  lzo1x_1_do_compress
   4.35%  zram_bvec_rw
   3.41%  memset_erms
   3.23%  _raw_spin_lock
   2.88%  page_vma_mapped_walk
   2.75%  clear_huge_page
   2.28%  flush_tlb_func
   1.92%  shrink_page_list
   1.39%  try_to_unmap_one
   1.21%  page_counter_cancel
   1.20%  zs_free
   1.14%  _raw_spin_lock_irqsave
   1.13%  xas_create
   1.02%  __mod_lruvec_page_state

zram (after)
============
best case
~~~~~~~~~
  time    3.559s

  44.55%  clear_page_erms
  27.74%  memchr_inv
   6.43%  clear_huge_page
   2.71%  split_huge_page_to_list
   1.79%  page_vma_mapped_walk
   1.31%  __split_huge_pmd
   1.21%  remove_migration_pte
   1.12%  __free_one_page

worst case
~~~~~~~~~~
  time    8.94s

  16.08%  clear_page_erms
  11.81%  memchr_inv
   9.62%  lzo1x_1_do_compress
   3.51%  memset_erms
   3.17%  zram_bvec_rw
   2.76%  _raw_spin_lock
   2.70%  clear_huge_page
   2.34%  flush_tlb_func
   2.25%  page_vma_mapped_walk
   1.56%  shrink_page_list
   1.33%  try_to_unmap_one
   1.07%  _raw_spin_lock_irqsave
   1.04%  xas_create
   1.01%  zs_free

disk (before)
============
best case
~~~~~~~~~
  time   70.466s

  23.38%  clear_page_erms
   3.91%  clear_huge_page
   3.32%  _raw_spin_lock
   2.95%  page_vma_mapped_walk
   2.43%  shrink_page_list
   2.25%  flush_tlb_func
   1.84%  try_to_unmap_one
   1.53%  _raw_spin_lock_irqsave
   1.45%  __mod_memcg_lruvec_state
   1.37%  page_counter_cancel
   1.17%  _raw_spin_lock_irq
   1.16%  xas_create
   1.12%  _find_next_bit
   1.11%  xas_store
   1.11%  add_to_swap_cache
   1.10%  __mod_node_page_state
   1.04%  __mod_lruvec_page_state
   1.04%  __free_one_page

worst case
~~~~~~~~~~
  time   67.014s

  25.54%  clear_page_erms
   4.36%  clear_huge_page
   3.51%  _raw_spin_lock
   2.97%  page_vma_mapped_walk
   2.85%  flush_tlb_func
   1.84%  try_to_unmap_one
   1.52%  shrink_page_list
   1.47%  __mod_memcg_lruvec_state
   1.42%  page_counter_cancel
   1.38%  _raw_spin_lock_irq
   1.28%  __mod_lruvec_page_state
   1.15%  _raw_spin_lock_irqsave
   1.13%  xas_load
   1.10%  add_to_swap_cache
   1.05%  __mod_node_page_state

disk (after)
============
best case
~~~~~~~~~
  time   3.544s

  42.58%  clear_page_erms
  27.44%  memchr_inv
   6.15%  clear_huge_page
   2.60%  split_huge_page_to_list
   1.74%  page_vma_mapped_walk
   1.26%  __free_one_page
   1.11%  remove_migration_pte
   1.10%  __split_huge_pmd
   1.08%  __list_del_entry_valid

worst case
~~~~~~~~~~
  time   65.521s

  21.83%  clear_page_erms
  14.39%  memchr_inv
   3.66%  clear_huge_page
   3.17%  _raw_spin_lock
   2.38%  page_vma_mapped_walk
   2.32%  flush_tlb_func
   1.59%  try_to_unmap_one
   1.39%  shrink_page_list
   1.26%  page_counter_cancel
   1.16%  __mod_memcg_lruvec_state
   1.11%  add_to_swap_cache
   1.05%  _raw_spin_lock_irq

test.c
======

  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <sys/mman.h>
  #include <unistd.h>
  
  #define handle_error(msg) \
    do { perror(msg); exit(EXIT_FAILURE); } while (0)
  
  int main(int argc, char **argv)
  {
    char *addr;
    size_t size;
    int offset, num_dirty;
    int ret;
  
    if (argc != 4) {
      printf("Usage: ./a.out SIZE OFFSET NUM_DIRTY\n"
             "SIZE: Size of mmap region in MB\n"
             "OFFSET: Offset of (first) dirty byte in each 4KB page, 0 ~ 4095\n"
             "NUM_DIRTY: Number of dirty 4KB pages in each 2MB region, 0 ~ 512\n");
      return -1;
    }
    
    size = (size_t)(((size_t)atoi(argv[1])) << 20);
    offset = atoi(argv[2]);
    num_dirty = atoi(argv[3]);
  
    addr = (char *)mmap(NULL, size , PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    
    if (addr == MAP_FAILED)
      handle_error("mmap");
  
    ret = madvise(addr, size, MADV_HUGEPAGE);
  
    if (ret == -1)
      handle_error("mmap hugepage");
  
    for (size_t i = 0; i < size; i += (1 << 21))
      for (size_t j = 0; j < (num_dirty << 12); j += (1 << 12))
        memset(addr + i + j + offset, 0xff, 1);
  
    ret = madvise(addr, size, MADV_PAGEOUT);
    if (ret == -1)
      handle_error("mmap pageout");
  
    ret = munmap(addr, size);
    if (ret == -1)
      handle_error("munmap");
  
    return 0;
  }

Yu Zhao (3):
  mm: don't take lru lock when splitting isolated thp
  mm: free zapped tail pages when splitting isolated thp
  mm: don't remap clean subpages when splitting isolated thp

 include/linux/rmap.h          |  2 +-
 include/linux/vm_event_item.h |  2 ++
 mm/huge_memory.c              | 63 ++++++++++++++++++++++++---------
 mm/migrate.c                  | 65 ++++++++++++++++++++++++++++++-----
 mm/vmstat.c                   |  2 ++
 5 files changed, 107 insertions(+), 27 deletions(-)

-- 
2.32.0.554.ge1b32706d8-goog


             reply	other threads:[~2021-07-31  6:39 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-31  6:39 Yu Zhao [this message]
2021-07-31  6:39 ` [PATCH 0/3] mm: optimize thp for reclaim and migration Yu Zhao
2021-07-31  6:39 ` [PATCH 1/3] mm: don't take lru lock when splitting isolated thp Yu Zhao
2021-07-31  6:39   ` Yu Zhao
2021-07-31  6:39 ` [PATCH 2/3] mm: free zapped tail pages " Yu Zhao
2021-07-31  6:39   ` Yu Zhao
2021-08-04 14:22   ` Kirill A. Shutemov
2021-08-08 17:28     ` Yu Zhao
2021-08-08 17:28       ` Yu Zhao
2021-08-05  0:13   ` Yang Shi
2021-08-05  0:13     ` Yang Shi
2021-08-08 17:49     ` Yu Zhao
2021-08-08 17:49       ` Yu Zhao
2021-08-11 22:25       ` Yang Shi
2021-08-11 22:25         ` Yang Shi
2021-08-11 23:12         ` Yu Zhao
2021-08-11 23:12           ` Yu Zhao
2021-08-13 23:24           ` Yang Shi
2021-08-13 23:24             ` Yang Shi
2021-08-13 23:56             ` Yu Zhao
2021-08-13 23:56               ` Yu Zhao
2021-08-14  0:30               ` Yang Shi
2021-08-14  0:30                 ` Yang Shi
2021-08-14  1:49                 ` Yu Zhao
2021-08-14  1:49                   ` Yu Zhao
2021-08-14  2:34                   ` Yang Shi
2021-08-14  2:34                     ` Yang Shi
2021-07-31  6:39 ` [PATCH 3/3] mm: don't remap clean subpages " Yu Zhao
2021-07-31  6:39   ` Yu Zhao
2021-07-31  9:53   ` kernel test robot
2021-07-31  9:53     ` kernel test robot
2021-07-31 15:45   ` kernel test robot
2021-07-31 15:45     ` kernel test robot
2021-08-03 11:25   ` Matthew Wilcox
2021-08-03 11:36   ` Matthew Wilcox
2021-08-08 17:21     ` Yu Zhao
2021-08-08 17:21       ` Yu Zhao
2021-08-04 14:27   ` Kirill A. Shutemov
2021-08-08 17:20     ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210731063938.1391602-1-yuzhao@google.com \
    --to=yuzhao@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=shy828301@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.