linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Regression of madvise(MADV_COLD) on shmem?
@ 2022-03-04 17:55 Ivan Teterevkov
  2022-03-05  0:18 ` Minchan Kim
  0 siblings, 1 reply; 8+ messages in thread
From: Ivan Teterevkov @ 2022-03-04 17:55 UTC (permalink / raw)
  To: minchan, akpm, linux-mm, linux-kernel, linux-api, mhocko, hannes,
	timmurray, joel, surenb, dancol, shakeelb, sonnyrao, oleksandr,
	hdanton, lizeb, dave.hansen, kirill.shutemov

Hi folks,

I want to check if there's a regression in the madvise(MADV_COLD) 
behaviour with shared memory or my understanding of how it works is 
inaccurate.

The MADV_COLD advice was introduced in Linux 5.4 and allowed the users 
to mark selected memory ranges as more "inactive" than others, 
overruling the default LRU accounting. It helped to preserve the working 
set of an application. With more recent kernels, e.g. at least 
5.17.0-rc6 and 5.10.42, MADV_COLD has stopped working as expected. 
Please take a look at a short program that demonstrates it:

     /*
      * madvise(MADV_COLD) demo.
      */
     #include <assert.h>
     #include <stdio.h>
     #include <stdlib.h>
     #include <string.h>
     #include <sys/mman.h>

     /* Requires the kernel 5.4 or newer. */
     #ifndef MADV_COLD
     #define MADV_COLD 20
     #endif

     #define GIB(x) ((size_t)(x) << 30)

     int main(void)
     {
         char *shmem, *zeroes;
         int page_size = getpagesize();
         size_t i;

         /* Allocate 8 GiB of shared memory. */
         shmem = mmap(/* addr */ NULL,
                      /* length */ GIB(8),
                      /* prot */ PROT_READ | PROT_WRITE,
                      /* flags */ MAP_SHARED | MAP_ANONYMOUS,
                      /* fd */ -1,
                      /* offset */ 0);
         assert(shmem != MAP_FAILED);

         /* Allocate a zero page for future use. */
         zeroes = calloc(1, page_size);
         assert(zeroes != NULL);

         /* Put 1 GiB blob at the beginning of the shared memory range. */
         memset(shmem, 0xaa, GIB(1));

         /* Read memory adjacent to the blob. */
         for (i = GIB(1); i < GIB(8); i = i + page_size) {
             int res = memcmp(shmem + i, zeroes, page_size);
             assert(res == 0);

             /* Cooldown a zero page and make it "less active" than the 
blob.
              * Under memory pressure, it'll likely become a reclaim target
              * and thus will help to preserve the blob in memory.
              */
             res = madvise(shmem + i, page_size, MADV_COLD);
             assert(res == 0);
         }

         /* Let the user check smaps. */
         printf("done\n");
         pause();

         free(zeroes);
         munmap(shmem, GIB(8));

         return 0;
     }

How to run this program:

1. Create a "test" cgroup with a memory limit of 3 GiB.

1.1. cgroup v1:

     # mkdir /sys/fs/cgroup/memory/test
     # echo 3G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes

1.2. cgroup v2:

     # mkdir /sys/fs/cgroup/test
     # echo 3G > /sys/fs/cgroup/test/memory.max

2. Enable at least a 1 GiB swap device.

3. Run the program in the "test" cgroup:

     # cgexec -g memory:test ./a.out

4. Wait until it has finished, i.e. has printed "done".

5. Check the shared memory VMA stats.

5.1. In 5.17.0-rc6 and 5.10.42:

     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
     7f8ed4648000-7f90d4648000 rw-s 00000000 00:01 2055 
      /dev/zero (deleted)
     Size:            8388608 kB
     KernelPageSize:        4 kB
     MMUPageSize:           4 kB
     Rss:             3119556 kB
     Pss:             3119556 kB
     Shared_Clean:          0 kB
     Shared_Dirty:          0 kB
     Private_Clean:   3119556 kB
     Private_Dirty:         0 kB
     Referenced:            0 kB
     Anonymous:             0 kB
     LazyFree:              0 kB
     AnonHugePages:         0 kB
     ShmemPmdMapped:        0 kB
     FilePmdMapped:         0 kB
     Shared_Hugetlb:        0 kB
     Private_Hugetlb:       0 kB
     Swap:            1048576 kB
     SwapPss:               0 kB
     Locked:                0 kB
     THPeligible:    0
     VmFlags: rd wr sh mr mw me ms sd

5.2. In 5.4.109:

     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
     7fca5f78b000-7fcc5f78b000 rw-s 00000000 00:01 173051 
      /dev/zero (deleted)
     Size:            8388608 kB
     KernelPageSize:        4 kB
     MMUPageSize:           4 kB
     Rss:             3121504 kB
     Pss:             3121504 kB
     Shared_Clean:          0 kB
     Shared_Dirty:          0 kB
     Private_Clean:   2072928 kB
     Private_Dirty:   1048576 kB
     Referenced:            0 kB
     Anonymous:             0 kB
     LazyFree:              0 kB
     AnonHugePages:         0 kB
     ShmemPmdMapped:        0 kB
     FilePmdMapped:        0 kB
     Shared_Hugetlb:        0 kB
     Private_Hugetlb:       0 kB
     Swap:                  0 kB
     SwapPss:               0 kB
     Locked:                0 kB
     THPeligible:            0
     VmFlags: rd wr sh mr mw me ms

There's a noticeable difference in the "Swap" reports so that the older 
kernel doesn't swap the blob, but the newer ones do.

According to ftrace, the newer kernels still call deactivate_page() in 
madvise_cold():

# trace-cmd record -p function_graph -g madvise_cold
# trace-cmd report | less
     a.out-4877  [000]  1485.266106: funcgraph_entry: 
|  madvise_cold() {
     a.out-4877  [000]  1485.266115: funcgraph_entry: 
|    walk_page_range() {
     a.out-4877  [000]  1485.266116: funcgraph_entry: 
|      __walk_page_range() {
     a.out-4877  [000]  1485.266117: funcgraph_entry: 
|        madvise_cold_or_pageout_pte_range() {
     a.out-4877  [000]  1485.266118: funcgraph_entry:        0.179 us 
|          deactivate_page();

(The irrelevant bits are removed for brevity.)

It makes me think there may be a regression in MADV_COLD. Please let me 
know what do you reckon?

Thanks,
Ivan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression of madvise(MADV_COLD) on shmem?
  2022-03-04 17:55 Regression of madvise(MADV_COLD) on shmem? Ivan Teterevkov
@ 2022-03-05  0:18 ` Minchan Kim
  2022-03-05  9:17   ` Yu Zhao
  0 siblings, 1 reply; 8+ messages in thread
From: Minchan Kim @ 2022-03-05  0:18 UTC (permalink / raw)
  To: Ivan Teterevkov
  Cc: akpm, linux-mm, linux-kernel, linux-api, mhocko, hannes,
	timmurray, joel, surenb, dancol, shakeelb, sonnyrao, oleksandr,
	hdanton, lizeb, dave.hansen, kirill.shutemov

On Fri, Mar 04, 2022 at 05:55:58PM +0000, Ivan Teterevkov wrote:
> Hi folks,
> 
> I want to check if there's a regression in the madvise(MADV_COLD) behaviour
> with shared memory or my understanding of how it works is inaccurate.
> 
> The MADV_COLD advice was introduced in Linux 5.4 and allowed the users to
> mark selected memory ranges as more "inactive" than others, overruling the
> default LRU accounting. It helped to preserve the working set of an
> application. With more recent kernels, e.g. at least 5.17.0-rc6 and 5.10.42,
> MADV_COLD has stopped working as expected. Please take a look at a short
> program that demonstrates it:
> 
>     /*
>      * madvise(MADV_COLD) demo.
>      */
>     #include <assert.h>
>     #include <stdio.h>
>     #include <stdlib.h>
>     #include <string.h>
>     #include <sys/mman.h>
> 
>     /* Requires the kernel 5.4 or newer. */
>     #ifndef MADV_COLD
>     #define MADV_COLD 20
>     #endif
> 
>     #define GIB(x) ((size_t)(x) << 30)
> 
>     int main(void)
>     {
>         char *shmem, *zeroes;
>         int page_size = getpagesize();
>         size_t i;
> 
>         /* Allocate 8 GiB of shared memory. */
>         shmem = mmap(/* addr */ NULL,
>                      /* length */ GIB(8),
>                      /* prot */ PROT_READ | PROT_WRITE,
>                      /* flags */ MAP_SHARED | MAP_ANONYMOUS,
>                      /* fd */ -1,
>                      /* offset */ 0);
>         assert(shmem != MAP_FAILED);
> 
>         /* Allocate a zero page for future use. */
>         zeroes = calloc(1, page_size);
>         assert(zeroes != NULL);
> 
>         /* Put 1 GiB blob at the beginning of the shared memory range. */
>         memset(shmem, 0xaa, GIB(1));
> 
>         /* Read memory adjacent to the blob. */
>         for (i = GIB(1); i < GIB(8); i = i + page_size) {
>             int res = memcmp(shmem + i, zeroes, page_size);
>             assert(res == 0);
> 
>             /* Cooldown a zero page and make it "less active" than the blob.
>              * Under memory pressure, it'll likely become a reclaim target
>              * and thus will help to preserve the blob in memory.
>              */
>             res = madvise(shmem + i, page_size, MADV_COLD);
>             assert(res == 0);
>         }
> 
>         /* Let the user check smaps. */
>         printf("done\n");
>         pause();
> 
>         free(zeroes);
>         munmap(shmem, GIB(8));
> 
>         return 0;
>     }
> 
> How to run this program:
> 
> 1. Create a "test" cgroup with a memory limit of 3 GiB.
> 
> 1.1. cgroup v1:
> 
>     # mkdir /sys/fs/cgroup/memory/test
>     # echo 3G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> 
> 1.2. cgroup v2:
> 
>     # mkdir /sys/fs/cgroup/test
>     # echo 3G > /sys/fs/cgroup/test/memory.max
> 
> 2. Enable at least a 1 GiB swap device.
> 
> 3. Run the program in the "test" cgroup:
> 
>     # cgexec -g memory:test ./a.out
> 
> 4. Wait until it has finished, i.e. has printed "done".
> 
> 5. Check the shared memory VMA stats.
> 
> 5.1. In 5.17.0-rc6 and 5.10.42:
> 
>     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
>     7f8ed4648000-7f90d4648000 rw-s 00000000 00:01 2055      /dev/zero
> (deleted)
>     Size:            8388608 kB
>     KernelPageSize:        4 kB
>     MMUPageSize:           4 kB
>     Rss:             3119556 kB
>     Pss:             3119556 kB
>     Shared_Clean:          0 kB
>     Shared_Dirty:          0 kB
>     Private_Clean:   3119556 kB
>     Private_Dirty:         0 kB
>     Referenced:            0 kB
>     Anonymous:             0 kB
>     LazyFree:              0 kB
>     AnonHugePages:         0 kB
>     ShmemPmdMapped:        0 kB
>     FilePmdMapped:         0 kB
>     Shared_Hugetlb:        0 kB
>     Private_Hugetlb:       0 kB
>     Swap:            1048576 kB
>     SwapPss:               0 kB
>     Locked:                0 kB
>     THPeligible:    0
>     VmFlags: rd wr sh mr mw me ms sd
> 
> 5.2. In 5.4.109:
> 
>     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
>     7fca5f78b000-7fcc5f78b000 rw-s 00000000 00:01 173051      /dev/zero
> (deleted)
>     Size:            8388608 kB
>     KernelPageSize:        4 kB
>     MMUPageSize:           4 kB
>     Rss:             3121504 kB
>     Pss:             3121504 kB
>     Shared_Clean:          0 kB
>     Shared_Dirty:          0 kB
>     Private_Clean:   2072928 kB
>     Private_Dirty:   1048576 kB
>     Referenced:            0 kB
>     Anonymous:             0 kB
>     LazyFree:              0 kB
>     AnonHugePages:         0 kB
>     ShmemPmdMapped:        0 kB
>     FilePmdMapped:        0 kB
>     Shared_Hugetlb:        0 kB
>     Private_Hugetlb:       0 kB
>     Swap:                  0 kB
>     SwapPss:               0 kB
>     Locked:                0 kB
>     THPeligible:            0
>     VmFlags: rd wr sh mr mw me ms
> 
> There's a noticeable difference in the "Swap" reports so that the older
> kernel doesn't swap the blob, but the newer ones do.
> 
> According to ftrace, the newer kernels still call deactivate_page() in
> madvise_cold():
> 
> # trace-cmd record -p function_graph -g madvise_cold
> # trace-cmd report | less
>     a.out-4877  [000]  1485.266106: funcgraph_entry: |  madvise_cold() {
>     a.out-4877  [000]  1485.266115: funcgraph_entry: |    walk_page_range()
> {
>     a.out-4877  [000]  1485.266116: funcgraph_entry: |
> __walk_page_range() {
>     a.out-4877  [000]  1485.266117: funcgraph_entry: |
> madvise_cold_or_pageout_pte_range() {
>     a.out-4877  [000]  1485.266118: funcgraph_entry:        0.179 us |
> deactivate_page();
> 
> (The irrelevant bits are removed for brevity.)
> 
> It makes me think there may be a regression in MADV_COLD. Please let me know
> what do you reckon?

Since deactive_page is called, I guess that's not a regression(?) from [1]

Then, my random guess that you mentioned "Swap" as regression might be
related to "workingset detection for anon page" since kernel changes balancing
policy between file and anonymous LRU, which was merged into v5.8.
It would be helpful to see if you try it on v5.7 and v5.8.

[1] 12e967fd8e4e6, mm: do not allow MADV_PAGEOUT for CoW page

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression of madvise(MADV_COLD) on shmem?
  2022-03-05  0:18 ` Minchan Kim
@ 2022-03-05  9:17   ` Yu Zhao
  2022-03-05  9:49     ` Yu Zhao
  2022-03-07 12:10     ` Michal Hocko
  0 siblings, 2 replies; 8+ messages in thread
From: Yu Zhao @ 2022-03-05  9:17 UTC (permalink / raw)
  To: Minchan Kim, Ivan Teterevkov
  Cc: Andrew Morton, Linux-MM, linux-kernel, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	dancol, Shakeel Butt, sonnyrao, oleksandr, Hillf Danton, lizeb,
	Dave Hansen, Kirill A . Shutemov

On Fri, Mar 4, 2022 at 5:18 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Fri, Mar 04, 2022 at 05:55:58PM +0000, Ivan Teterevkov wrote:
> > Hi folks,
> >
> > I want to check if there's a regression in the madvise(MADV_COLD) behaviour
> > with shared memory or my understanding of how it works is inaccurate.
> >
> > The MADV_COLD advice was introduced in Linux 5.4 and allowed the users to
> > mark selected memory ranges as more "inactive" than others, overruling the
> > default LRU accounting. It helped to preserve the working set of an
> > application. With more recent kernels, e.g. at least 5.17.0-rc6 and 5.10.42,
> > MADV_COLD has stopped working as expected. Please take a look at a short
> > program that demonstrates it:
> >
> >     /*
> >      * madvise(MADV_COLD) demo.
> >      */
> >     #include <assert.h>
> >     #include <stdio.h>
> >     #include <stdlib.h>
> >     #include <string.h>
> >     #include <sys/mman.h>
> >
> >     /* Requires the kernel 5.4 or newer. */
> >     #ifndef MADV_COLD
> >     #define MADV_COLD 20
> >     #endif
> >
> >     #define GIB(x) ((size_t)(x) << 30)
> >
> >     int main(void)
> >     {
> >         char *shmem, *zeroes;
> >         int page_size = getpagesize();
> >         size_t i;
> >
> >         /* Allocate 8 GiB of shared memory. */
> >         shmem = mmap(/* addr */ NULL,
> >                      /* length */ GIB(8),
> >                      /* prot */ PROT_READ | PROT_WRITE,
> >                      /* flags */ MAP_SHARED | MAP_ANONYMOUS,
> >                      /* fd */ -1,
> >                      /* offset */ 0);
> >         assert(shmem != MAP_FAILED);
> >
> >         /* Allocate a zero page for future use. */
> >         zeroes = calloc(1, page_size);
> >         assert(zeroes != NULL);
> >
> >         /* Put 1 GiB blob at the beginning of the shared memory range. */
> >         memset(shmem, 0xaa, GIB(1));
> >
> >         /* Read memory adjacent to the blob. */
> >         for (i = GIB(1); i < GIB(8); i = i + page_size) {
> >             int res = memcmp(shmem + i, zeroes, page_size);
> >             assert(res == 0);
> >
> >             /* Cooldown a zero page and make it "less active" than the blob.
> >              * Under memory pressure, it'll likely become a reclaim target
> >              * and thus will help to preserve the blob in memory.
> >              */
> >             res = madvise(shmem + i, page_size, MADV_COLD);
> >             assert(res == 0);
> >         }
> >
> >         /* Let the user check smaps. */
> >         printf("done\n");
> >         pause();
> >
> >         free(zeroes);
> >         munmap(shmem, GIB(8));
> >
> >         return 0;
> >     }
> >
> > How to run this program:
> >
> > 1. Create a "test" cgroup with a memory limit of 3 GiB.
> >
> > 1.1. cgroup v1:
> >
> >     # mkdir /sys/fs/cgroup/memory/test
> >     # echo 3G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> >
> > 1.2. cgroup v2:
> >
> >     # mkdir /sys/fs/cgroup/test
> >     # echo 3G > /sys/fs/cgroup/test/memory.max
> >
> > 2. Enable at least a 1 GiB swap device.
> >
> > 3. Run the program in the "test" cgroup:
> >
> >     # cgexec -g memory:test ./a.out
> >
> > 4. Wait until it has finished, i.e. has printed "done".
> >
> > 5. Check the shared memory VMA stats.
> >
> > 5.1. In 5.17.0-rc6 and 5.10.42:
> >
> >     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
> >     7f8ed4648000-7f90d4648000 rw-s 00000000 00:01 2055      /dev/zero
> > (deleted)
> >     Size:            8388608 kB
> >     KernelPageSize:        4 kB
> >     MMUPageSize:           4 kB
> >     Rss:             3119556 kB
> >     Pss:             3119556 kB
> >     Shared_Clean:          0 kB
> >     Shared_Dirty:          0 kB
> >     Private_Clean:   3119556 kB
> >     Private_Dirty:         0 kB
> >     Referenced:            0 kB
> >     Anonymous:             0 kB
> >     LazyFree:              0 kB
> >     AnonHugePages:         0 kB
> >     ShmemPmdMapped:        0 kB
> >     FilePmdMapped:         0 kB
> >     Shared_Hugetlb:        0 kB
> >     Private_Hugetlb:       0 kB
> >     Swap:            1048576 kB
> >     SwapPss:               0 kB
> >     Locked:                0 kB
> >     THPeligible:    0
> >     VmFlags: rd wr sh mr mw me ms sd
> >
> > 5.2. In 5.4.109:
> >
> >     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
> >     7fca5f78b000-7fcc5f78b000 rw-s 00000000 00:01 173051      /dev/zero
> > (deleted)
> >     Size:            8388608 kB
> >     KernelPageSize:        4 kB
> >     MMUPageSize:           4 kB
> >     Rss:             3121504 kB
> >     Pss:             3121504 kB
> >     Shared_Clean:          0 kB
> >     Shared_Dirty:          0 kB
> >     Private_Clean:   2072928 kB
> >     Private_Dirty:   1048576 kB
> >     Referenced:            0 kB
> >     Anonymous:             0 kB
> >     LazyFree:              0 kB
> >     AnonHugePages:         0 kB
> >     ShmemPmdMapped:        0 kB
> >     FilePmdMapped:        0 kB
> >     Shared_Hugetlb:        0 kB
> >     Private_Hugetlb:       0 kB
> >     Swap:                  0 kB
> >     SwapPss:               0 kB
> >     Locked:                0 kB
> >     THPeligible:            0
> >     VmFlags: rd wr sh mr mw me ms
> >
> > There's a noticeable difference in the "Swap" reports so that the older
> > kernel doesn't swap the blob, but the newer ones do.
> >
> > According to ftrace, the newer kernels still call deactivate_page() in
> > madvise_cold():
> >
> > # trace-cmd record -p function_graph -g madvise_cold
> > # trace-cmd report | less
> >     a.out-4877  [000]  1485.266106: funcgraph_entry: |  madvise_cold() {
> >     a.out-4877  [000]  1485.266115: funcgraph_entry: |    walk_page_range()
> > {
> >     a.out-4877  [000]  1485.266116: funcgraph_entry: |
> > __walk_page_range() {
> >     a.out-4877  [000]  1485.266117: funcgraph_entry: |
> > madvise_cold_or_pageout_pte_range() {
> >     a.out-4877  [000]  1485.266118: funcgraph_entry:        0.179 us |
> > deactivate_page();
> >
> > (The irrelevant bits are removed for brevity.)
> >
> > It makes me think there may be a regression in MADV_COLD. Please let me know
> > what do you reckon?
>
> Since deactive_page is called, I guess that's not a regression(?) from [1]
>
> Then, my random guess that you mentioned "Swap" as regression might be
> related to "workingset detection for anon page" since kernel changes balancing
> policy between file and anonymous LRU, which was merged into v5.8.
> It would be helpful to see if you try it on v5.7 and v5.8.
>
> [1] 12e967fd8e4e6, mm: do not allow MADV_PAGEOUT for CoW page

Yes, I noticed this for a while. With commit b518154e59a ("mm/vmscan:
protect the workingset on anonymous LRU"), anon/shmem pages start on
the inactive lru, and in this case, deactive_page() is a NOP. Before
5.9, anon/shmem pages start on the active lru, deactive_page() moves
zero pages in the test to the inactive lru and therefore protests the
"blob".

This should fix the problem for your test case:

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..7fd99f037ca7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -563,7 +559,7 @@ static void lru_deactivate_file_fn(struct page
*page, struct lruvec *lruvec)

 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-       if (PageActive(page) && !PageUnevictable(page)) {
+       if (!PageUnevictable(page)) {
                int nr_pages = thp_nr_pages(page);

                del_page_from_lru_list(page, lruvec);
@@ -677,7 +673,7 @@ void deactivate_file_page(struct page *page)
  */
 void deactivate_page(struct page *page)
 {
-       if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+       if (PageLRU(page) && !PageUnevictable(page)) {
                struct pagevec *pvec;

                local_lock(&lru_pvecs.lock);

I'll leave it to Minchan to decide whether this is worth fixing,
together with this one:

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..2f142f09c8e1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -529,10 +529,6 @@ static void lru_deactivate_file_fn(struct page
*page, struct lruvec *lruvec)
        if (PageUnevictable(page))
                return;

-       /* Some processes are using the page */
-       if (page_mapped(page))
-               return;
-
        del_page_from_lru_list(page, lruvec);
        ClearPageActive(page);
        ClearPageReferenced(page);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Regression of madvise(MADV_COLD) on shmem?
  2022-03-05  9:17   ` Yu Zhao
@ 2022-03-05  9:49     ` Yu Zhao
  2022-03-07  9:57       ` Ivan Teterevkov
  2022-03-07 12:10     ` Michal Hocko
  1 sibling, 1 reply; 8+ messages in thread
From: Yu Zhao @ 2022-03-05  9:49 UTC (permalink / raw)
  To: Minchan Kim, Ivan Teterevkov
  Cc: Andrew Morton, Linux-MM, linux-kernel, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	dancol, Shakeel Butt, sonnyrao, oleksandr, Hillf Danton, lizeb,
	Dave Hansen, Kirill A . Shutemov

On Sat, Mar 5, 2022 at 2:17 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Fri, Mar 4, 2022 at 5:18 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Fri, Mar 04, 2022 at 05:55:58PM +0000, Ivan Teterevkov wrote:
> > > Hi folks,
> > >
> > > I want to check if there's a regression in the madvise(MADV_COLD) behaviour
> > > with shared memory or my understanding of how it works is inaccurate.
> > >
> > > The MADV_COLD advice was introduced in Linux 5.4 and allowed the users to
> > > mark selected memory ranges as more "inactive" than others, overruling the
> > > default LRU accounting. It helped to preserve the working set of an
> > > application. With more recent kernels, e.g. at least 5.17.0-rc6 and 5.10.42,
> > > MADV_COLD has stopped working as expected. Please take a look at a short
> > > program that demonstrates it:
> > >
> > >     /*
> > >      * madvise(MADV_COLD) demo.
> > >      */
> > >     #include <assert.h>
> > >     #include <stdio.h>
> > >     #include <stdlib.h>
> > >     #include <string.h>
> > >     #include <sys/mman.h>
> > >
> > >     /* Requires the kernel 5.4 or newer. */
> > >     #ifndef MADV_COLD
> > >     #define MADV_COLD 20
> > >     #endif
> > >
> > >     #define GIB(x) ((size_t)(x) << 30)
> > >
> > >     int main(void)
> > >     {
> > >         char *shmem, *zeroes;
> > >         int page_size = getpagesize();
> > >         size_t i;
> > >
> > >         /* Allocate 8 GiB of shared memory. */
> > >         shmem = mmap(/* addr */ NULL,
> > >                      /* length */ GIB(8),
> > >                      /* prot */ PROT_READ | PROT_WRITE,
> > >                      /* flags */ MAP_SHARED | MAP_ANONYMOUS,
> > >                      /* fd */ -1,
> > >                      /* offset */ 0);
> > >         assert(shmem != MAP_FAILED);
> > >
> > >         /* Allocate a zero page for future use. */
> > >         zeroes = calloc(1, page_size);
> > >         assert(zeroes != NULL);
> > >
> > >         /* Put 1 GiB blob at the beginning of the shared memory range. */
> > >         memset(shmem, 0xaa, GIB(1));
> > >
> > >         /* Read memory adjacent to the blob. */
> > >         for (i = GIB(1); i < GIB(8); i = i + page_size) {
> > >             int res = memcmp(shmem + i, zeroes, page_size);
> > >             assert(res == 0);
> > >
> > >             /* Cooldown a zero page and make it "less active" than the blob.
> > >              * Under memory pressure, it'll likely become a reclaim target
> > >              * and thus will help to preserve the blob in memory.
> > >              */
> > >             res = madvise(shmem + i, page_size, MADV_COLD);
> > >             assert(res == 0);
> > >         }
> > >
> > >         /* Let the user check smaps. */
> > >         printf("done\n");
> > >         pause();
> > >
> > >         free(zeroes);
> > >         munmap(shmem, GIB(8));
> > >
> > >         return 0;
> > >     }
> > >
> > > How to run this program:
> > >
> > > 1. Create a "test" cgroup with a memory limit of 3 GiB.
> > >
> > > 1.1. cgroup v1:
> > >
> > >     # mkdir /sys/fs/cgroup/memory/test
> > >     # echo 3G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> > >
> > > 1.2. cgroup v2:
> > >
> > >     # mkdir /sys/fs/cgroup/test
> > >     # echo 3G > /sys/fs/cgroup/test/memory.max
> > >
> > > 2. Enable at least a 1 GiB swap device.
> > >
> > > 3. Run the program in the "test" cgroup:
> > >
> > >     # cgexec -g memory:test ./a.out
> > >
> > > 4. Wait until it has finished, i.e. has printed "done".
> > >
> > > 5. Check the shared memory VMA stats.
> > >
> > > 5.1. In 5.17.0-rc6 and 5.10.42:
> > >
> > >     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
> > >     7f8ed4648000-7f90d4648000 rw-s 00000000 00:01 2055      /dev/zero
> > > (deleted)
> > >     Size:            8388608 kB
> > >     KernelPageSize:        4 kB
> > >     MMUPageSize:           4 kB
> > >     Rss:             3119556 kB
> > >     Pss:             3119556 kB
> > >     Shared_Clean:          0 kB
> > >     Shared_Dirty:          0 kB
> > >     Private_Clean:   3119556 kB
> > >     Private_Dirty:         0 kB
> > >     Referenced:            0 kB
> > >     Anonymous:             0 kB
> > >     LazyFree:              0 kB
> > >     AnonHugePages:         0 kB
> > >     ShmemPmdMapped:        0 kB
> > >     FilePmdMapped:         0 kB
> > >     Shared_Hugetlb:        0 kB
> > >     Private_Hugetlb:       0 kB
> > >     Swap:            1048576 kB
> > >     SwapPss:               0 kB
> > >     Locked:                0 kB
> > >     THPeligible:    0
> > >     VmFlags: rd wr sh mr mw me ms sd
> > >
> > > 5.2. In 5.4.109:
> > >
> > >     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
> > >     7fca5f78b000-7fcc5f78b000 rw-s 00000000 00:01 173051      /dev/zero
> > > (deleted)
> > >     Size:            8388608 kB
> > >     KernelPageSize:        4 kB
> > >     MMUPageSize:           4 kB
> > >     Rss:             3121504 kB
> > >     Pss:             3121504 kB
> > >     Shared_Clean:          0 kB
> > >     Shared_Dirty:          0 kB
> > >     Private_Clean:   2072928 kB
> > >     Private_Dirty:   1048576 kB
> > >     Referenced:            0 kB
> > >     Anonymous:             0 kB
> > >     LazyFree:              0 kB
> > >     AnonHugePages:         0 kB
> > >     ShmemPmdMapped:        0 kB
> > >     FilePmdMapped:        0 kB
> > >     Shared_Hugetlb:        0 kB
> > >     Private_Hugetlb:       0 kB
> > >     Swap:                  0 kB
> > >     SwapPss:               0 kB
> > >     Locked:                0 kB
> > >     THPeligible:            0
> > >     VmFlags: rd wr sh mr mw me ms
> > >
> > > There's a noticeable difference in the "Swap" reports so that the older
> > > kernel doesn't swap the blob, but the newer ones do.
> > >
> > > According to ftrace, the newer kernels still call deactivate_page() in
> > > madvise_cold():
> > >
> > > # trace-cmd record -p function_graph -g madvise_cold
> > > # trace-cmd report | less
> > >     a.out-4877  [000]  1485.266106: funcgraph_entry: |  madvise_cold() {
> > >     a.out-4877  [000]  1485.266115: funcgraph_entry: |    walk_page_range()
> > > {
> > >     a.out-4877  [000]  1485.266116: funcgraph_entry: |
> > > __walk_page_range() {
> > >     a.out-4877  [000]  1485.266117: funcgraph_entry: |
> > > madvise_cold_or_pageout_pte_range() {
> > >     a.out-4877  [000]  1485.266118: funcgraph_entry:        0.179 us |
> > > deactivate_page();
> > >
> > > (The irrelevant bits are removed for brevity.)
> > >
> > > It makes me think there may be a regression in MADV_COLD. Please let me know
> > > what do you reckon?
> >
> > Since deactive_page is called, I guess that's not a regression(?) from [1]
> >
> > Then, my random guess that you mentioned "Swap" as regression might be
> > related to "workingset detection for anon page" since kernel changes balancing
> > policy between file and anonymous LRU, which was merged into v5.8.
> > It would be helpful to see if you try it on v5.7 and v5.8.
> >
> > [1] 12e967fd8e4e6, mm: do not allow MADV_PAGEOUT for CoW page
>
> Yes, I noticed this for a while. With commit b518154e59a ("mm/vmscan:
> protect the workingset on anonymous LRU"), anon/shmem pages start on
> the inactive lru, and in this case, deactive_page() is a NOP. Before
> 5.9, anon/shmem pages start on the active lru, deactive_page() moves
> zero pages in the test to the inactive lru and therefore protests the
> "blob".
>
> This should fix the problem for your test case:
>
> diff --git a/mm/swap.c b/mm/swap.c
> index bcf3ac288b56..7fd99f037ca7 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -563,7 +559,7 @@ static void lru_deactivate_file_fn(struct page
> *page, struct lruvec *lruvec)
>
>  static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>  {
> -       if (PageActive(page) && !PageUnevictable(page)) {
> +       if (!PageUnevictable(page)) {
>                 int nr_pages = thp_nr_pages(page);
>
>                 del_page_from_lru_list(page, lruvec);

Missed one line below:

                ClearPageActive(page);
                ClearPageReferenced(page);
-               add_page_to_lru_list(page, lruvec);
+               add_page_to_lru_list_tail(page, lruvec);

                __count_vm_events(PGDEACTIVATE, nr_pages);
                __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE,

> @@ -677,7 +673,7 @@ void deactivate_file_page(struct page *page)
>   */
>  void deactivate_page(struct page *page)
>  {
> -       if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +       if (PageLRU(page) && !PageUnevictable(page)) {
>                 struct pagevec *pvec;
>
>                 local_lock(&lru_pvecs.lock);
>
> I'll leave it to Minchan to decide whether this is worth fixing,
> together with this one:
>
> diff --git a/mm/swap.c b/mm/swap.c
> index bcf3ac288b56..2f142f09c8e1 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -529,10 +529,6 @@ static void lru_deactivate_file_fn(struct page
> *page, struct lruvec *lruvec)
>         if (PageUnevictable(page))
>                 return;
>
> -       /* Some processes are using the page */
> -       if (page_mapped(page))
> -               return;
> -
>         del_page_from_lru_list(page, lruvec);
>         ClearPageActive(page);
>         ClearPageReferenced(page);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression of madvise(MADV_COLD) on shmem?
  2022-03-05  9:49     ` Yu Zhao
@ 2022-03-07  9:57       ` Ivan Teterevkov
  0 siblings, 0 replies; 8+ messages in thread
From: Ivan Teterevkov @ 2022-03-07  9:57 UTC (permalink / raw)
  To: Yu Zhao, Minchan Kim
  Cc: Andrew Morton, Linux-MM, linux-kernel, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	dancol, Shakeel Butt, sonnyrao, oleksandr, Hillf Danton, lizeb,
	Dave Hansen, Kirill A . Shutemov



On 05-Mar-22 9:49 AM, Yu Zhao wrote:
> On Sat, Mar 5, 2022 at 2:17 AM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Fri, Mar 4, 2022 at 5:18 PM Minchan Kim <minchan@kernel.org> wrote:
>>>
>>> On Fri, Mar 04, 2022 at 05:55:58PM +0000, Ivan Teterevkov wrote:
>>>> It makes me think there may be a regression in MADV_COLD. Please let me know
>>>> what do you reckon?
>>>
>>> Since deactive_page is called, I guess that's not a regression(?) from [1]
>>>
>>> Then, my random guess that you mentioned "Swap" as regression might be
>>> related to "workingset detection for anon page" since kernel changes balancing
>>> policy between file and anonymous LRU, which was merged into v5.8.
>>> It would be helpful to see if you try it on v5.7 and v5.8.
>>>
>>> [1] 12e967fd8e4e6, mm: do not allow MADV_PAGEOUT for CoW page
>>
>> Yes, I noticed this for a while. With commit b518154e59a ("mm/vmscan:
>> protect the workingset on anonymous LRU"), anon/shmem pages start on
>> the inactive lru, and in this case, deactive_page() is a NOP. Before
>> 5.9, anon/shmem pages start on the active lru, deactive_page() moves
>> zero pages in the test to the inactive lru and therefore protests the
>> "blob".
>>
>> This should fix the problem for your test case:
>>

Thank you, it does fix the test case in 5.17.0-rc6+.

>> I'll leave it to Minchan to decide whether this is worth fixing,
>> together with this one:
>>

I suppose add_page_to_lru_list_tail() makes madvise(MADV_COLD) more 
"insistent" with moving the pages at the end of the inactive LRU, but 
perhaps that's aligned with its purpose?

Cheers,
Ivan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression of madvise(MADV_COLD) on shmem?
  2022-03-05  9:17   ` Yu Zhao
  2022-03-05  9:49     ` Yu Zhao
@ 2022-03-07 12:10     ` Michal Hocko
  2022-03-10  9:01       ` Michal Hocko
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2022-03-07 12:10 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Minchan Kim, Ivan Teterevkov, Andrew Morton, Linux-MM,
	linux-kernel, linux-api, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, dancol, Shakeel Butt,
	sonnyrao, oleksandr, Hillf Danton, lizeb, Dave Hansen,
	Kirill A . Shutemov

On Sat 05-03-22 02:17:37, Yu Zhao wrote:
[...]
> diff --git a/mm/swap.c b/mm/swap.c
> index bcf3ac288b56..7fd99f037ca7 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -563,7 +559,7 @@ static void lru_deactivate_file_fn(struct page
> *page, struct lruvec *lruvec)
> 
>  static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>  {
> -       if (PageActive(page) && !PageUnevictable(page)) {
> +       if (!PageUnevictable(page)) {
>                 int nr_pages = thp_nr_pages(page);
> 
>                 del_page_from_lru_list(page, lruvec);
> @@ -677,7 +673,7 @@ void deactivate_file_page(struct page *page)
>   */
>  void deactivate_page(struct page *page)
>  {
> -       if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +       if (PageLRU(page) && !PageUnevictable(page)) {
>                 struct pagevec *pvec;
> 
>                 local_lock(&lru_pvecs.lock);
> 
> I'll leave it to Minchan to decide whether this is worth fixing,
> together with this one:

There doesn't seem to be any dependency on the PageActive anymore. I do
remember we have relied on the PageActive to move from the active list
to the inactive. This is not the case anymore but I am wondering whether
above is really sufficient. If you are deactivating an inactive page
then I would expect you want to move that page in the LRU as well. In
other words don't you want
	if (page_active)
		add_page_to_lru_list
	else
		add_page_to_lru_list_tail
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression of madvise(MADV_COLD) on shmem?
  2022-03-07 12:10     ` Michal Hocko
@ 2022-03-10  9:01       ` Michal Hocko
  2022-03-11  0:09         ` Yu Zhao
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2022-03-10  9:01 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Minchan Kim, Ivan Teterevkov, Andrew Morton, Linux-MM,
	linux-kernel, linux-api, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, dancol, Shakeel Butt,
	sonnyrao, oleksandr, Hillf Danton, lizeb, Dave Hansen,
	Kirill A . Shutemov

On Mon 07-03-22 13:10:08, Michal Hocko wrote:
> On Sat 05-03-22 02:17:37, Yu Zhao wrote:
> [...]
> > diff --git a/mm/swap.c b/mm/swap.c
> > index bcf3ac288b56..7fd99f037ca7 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -563,7 +559,7 @@ static void lru_deactivate_file_fn(struct page
> > *page, struct lruvec *lruvec)
> > 
> >  static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
> >  {
> > -       if (PageActive(page) && !PageUnevictable(page)) {
> > +       if (!PageUnevictable(page)) {
> >                 int nr_pages = thp_nr_pages(page);
> > 
> >                 del_page_from_lru_list(page, lruvec);
> > @@ -677,7 +673,7 @@ void deactivate_file_page(struct page *page)
> >   */
> >  void deactivate_page(struct page *page)
> >  {
> > -       if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> > +       if (PageLRU(page) && !PageUnevictable(page)) {
> >                 struct pagevec *pvec;
> > 
> >                 local_lock(&lru_pvecs.lock);
> > 
> > I'll leave it to Minchan to decide whether this is worth fixing,
> > together with this one:
> 
> There doesn't seem to be any dependency on the PageActive anymore. I do
> remember we have relied on the PageActive to move from the active list
> to the inactive. This is not the case anymore but I am wondering whether
> above is really sufficient. If you are deactivating an inactive page
> then I would expect you want to move that page in the LRU as well. In
> other words don't you want
> 	if (page_active)
> 		add_page_to_lru_list
> 	else
> 		add_page_to_lru_list_tail

Do you plan to send an official patch?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression of madvise(MADV_COLD) on shmem?
  2022-03-10  9:01       ` Michal Hocko
@ 2022-03-11  0:09         ` Yu Zhao
  0 siblings, 0 replies; 8+ messages in thread
From: Yu Zhao @ 2022-03-11  0:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Ivan Teterevkov, Andrew Morton, Linux-MM,
	linux-kernel, linux-api, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, dancol, Shakeel Butt,
	sonnyrao, oleksandr, Hillf Danton, Benoit Lize, Dave Hansen,
	Kirill A . Shutemov

On Thu, Mar 10, 2022 at 2:01 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 07-03-22 13:10:08, Michal Hocko wrote:
> > On Sat 05-03-22 02:17:37, Yu Zhao wrote:
> > [...]
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index bcf3ac288b56..7fd99f037ca7 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -563,7 +559,7 @@ static void lru_deactivate_file_fn(struct page
> > > *page, struct lruvec *lruvec)
> > >
> > >  static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
> > >  {
> > > -       if (PageActive(page) && !PageUnevictable(page)) {
> > > +       if (!PageUnevictable(page)) {
> > >                 int nr_pages = thp_nr_pages(page);
> > >
> > >                 del_page_from_lru_list(page, lruvec);
> > > @@ -677,7 +673,7 @@ void deactivate_file_page(struct page *page)
> > >   */
> > >  void deactivate_page(struct page *page)
> > >  {
> > > -       if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> > > +       if (PageLRU(page) && !PageUnevictable(page)) {
> > >                 struct pagevec *pvec;
> > >
> > >                 local_lock(&lru_pvecs.lock);
> > >
> > > I'll leave it to Minchan to decide whether this is worth fixing,
> > > together with this one:
> >
> > There doesn't seem to be any dependency on the PageActive anymore. I do
> > remember we have relied on the PageActive to move from the active list
> > to the inactive. This is not the case anymore but I am wondering whether
> > above is really sufficient. If you are deactivating an inactive page
> > then I would expect you want to move that page in the LRU as well. In
> > other words don't you want
> >       if (page_active)
> >               add_page_to_lru_list
> >       else
> >               add_page_to_lru_list_tail

Yes, this is better.

> Do you plan to send an official patch?

One thing I still haven't thought through is why the A-bit couldn't
protect the blob in the test. In theory it should be enough even
though deactivate_page() is a NOP.

1. all pages are initially inactive and have the A-bit set
2. madvise(COLD) clears the A-bit for zero-filled pages (but fails to
change their LRU positions)
3. the memcg hits the limit
4. pages in the blob are moved to the active LRU because those pages
still have the A-bit (zero-filled pages remain inactive)
5. inactive_is_low() tests true and the blob gets deactivated???

The last step doesn't make sense, since the inactive list is still very large.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-03-11  0:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-04 17:55 Regression of madvise(MADV_COLD) on shmem? Ivan Teterevkov
2022-03-05  0:18 ` Minchan Kim
2022-03-05  9:17   ` Yu Zhao
2022-03-05  9:49     ` Yu Zhao
2022-03-07  9:57       ` Ivan Teterevkov
2022-03-07 12:10     ` Michal Hocko
2022-03-10  9:01       ` Michal Hocko
2022-03-11  0:09         ` Yu Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).