All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
@ 2019-08-01 18:04 Masoud Sharbiani
  2019-08-01 18:19 ` Greg KH
  2019-08-02  7:40 ` Michal Hocko
  0 siblings, 2 replies; 25+ messages in thread
From: Masoud Sharbiani @ 2019-08-01 18:04 UTC (permalink / raw)
  To: gregkh, mhocko, hannes, vdavydov.dev; +Cc: linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3831 bytes --]

Hey folks,
I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
It was introduced by

29ef680 memcg, oom: move out_of_memory back to the charge path 

The gist of it is that if you have a memory control group for a process that repeatedly maps all of the pages of a file with  repeated calls to:

   mmap(NULL, pages * PAGE_SIZE, PROT_WRITE|PROT_READ, MAP_FILE|MAP_PRIVATE, fd, 0)

The memory cg eventually runs out of memory, as it should. However, prior to the 29ef680 commit, it would kill the running process with OOM; After that commit ( and until 5.3-rc1; Haven’t pinpointed the exact commit in between 5.2.0 and 5.3-rc1) the offending process goes into %100 CPU usage, and doesn’t die (prior behavior) or fail the mmap call (which is what happens if one runs the test program with a low ulimit -v value).

Any ideas on how to chase this down further?

(Test program and script have been pasted below)
Thanks,
Masoud


//——— leaker.c ——
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <stdio.h>
#include <errno.h>
#include <signal.h>

#ifndef PAGE_SIZE
#define PAGE_SIZE 4096
#endif

void sighandler(int x) {
   printf("SIGNAL %d received. Quitting\n", x);
   exit(2);
}

int main(int ac, char*av[])
{
   int i;
   int fd;
   int pages = 4096;
   char buf[PAGE_SIZE];
   char *d;
   int sum = 0, loop_cnt = 0;
   int max_loops = 100000;
   // For getopt(3) stuff:
   int opt;

   while ((opt = getopt(ac, av, "p:c:")) != -1) {
       switch (opt) {
           case 'p':
               pages = atoi(optarg);
               break;
           case 'c':
               max_loops = atoi(optarg);
               break;
           default:
               fprintf(stderr, "Wrong usage:\n");
               fprintf(stderr, "%s -p <pages> -c <loop_count>\n", av[0]);
               exit(-1);
       }

   }
   signal(SIGTERM, sighandler);
   printf("Mapping %d pages anonymously %d times.\n", pages, max_loops);
   printf("File size will be %ld\n", pages * (long)PAGE_SIZE);
   printf("max memory usage size will be %ld\n", (long) max_loops * pages * PAGE_SIZE);

   memset(buf, 0, PAGE_SIZE);

   fd = open("big-data-file.bin", O_CREAT|O_WRONLY|O_TRUNC , S_IRUSR | S_IWUSR);
   if (fd == -1) {
       printf("open failed: %d - %s\n", errno, strerror(errno));
       return -1;
   }
   for (i=0; i < pages; i++) {
       write(fd, buf, PAGE_SIZE);
   }
   close(fd);
   fd = open("big-data-file.bin", O_RDWR);
   printf("fd is %d\n", fd);
   while (loop_cnt < max_loops) {
       d = mmap(NULL, pages * PAGE_SIZE, PROT_WRITE|PROT_READ, MAP_FILE|MAP_PRIVATE, fd, 0);
       if (d == MAP_FAILED) {
           printf("mmap failed: %d - %s\n", errno, strerror(errno));
           return -1;
       }
       printf("Buffer is @ %p\n", d);
       for (i = 0; i < pages * PAGE_SIZE; i++) {
           sum += d[i];
           if ((i & (PAGE_SIZE-1)) == 0)
               d[i] = 42;
       }
       printf("Todal sum was %d. Loop count is %d\n", sum, loop_cnt++);
   }
   close(fd);
   return 0;
}

///—— test script launching it…
#!/bin/sh

if [ `id -u` -ne 0 ]; then
       echo NEED TO RUN THIS AS ROOT.; exit 1
fi
PID=$(echo $$)
echo PID detected as: $PID
mkdir /sys/fs/cgroup/memory/leaker
echo 536870912 > /sys/fs/cgroup/memory/leaker/memory.limit_in_bytes

echo leaker mem cgroup created, with `cat /sys/fs/cgroup/memory/leaker/memory.limit_in_bytes` bytes.
echo $PID > /sys/fs/cgroup/memory/leaker/cgroup.procs
echo Moved into the leaker cgroup.
ps -o cgroup $PID
sleep 15
echo Starting...
./leaker -p 10240 -c 100000

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3437 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-01 18:04 Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1 Masoud Sharbiani
@ 2019-08-01 18:19 ` Greg KH
  2019-08-01 22:26   ` Masoud Sharbiani
  2019-08-02  8:08     ` Hillf Danton
  2019-08-02  7:40 ` Michal Hocko
  1 sibling, 2 replies; 25+ messages in thread
From: Greg KH @ 2019-08-01 18:19 UTC (permalink / raw)
  To: Masoud Sharbiani
  Cc: mhocko, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel

On Thu, Aug 01, 2019 at 11:04:14AM -0700, Masoud Sharbiani wrote:
> Hey folks,
> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
> It was introduced by
> 
> 29ef680 memcg, oom: move out_of_memory back to the charge path 
> 
> The gist of it is that if you have a memory control group for a process that repeatedly maps all of the pages of a file with  repeated calls to:
> 
>    mmap(NULL, pages * PAGE_SIZE, PROT_WRITE|PROT_READ, MAP_FILE|MAP_PRIVATE, fd, 0)
> 
> The memory cg eventually runs out of memory, as it should. However,
> prior to the 29ef680 commit, it would kill the running process with
> OOM; After that commit ( and until 5.3-rc1; Haven’t pinpointed the
> exact commit in between 5.2.0 and 5.3-rc1) the offending process goes
> into %100 CPU usage, and doesn’t die (prior behavior) or fail the mmap
> call (which is what happens if one runs the test program with a low
> ulimit -v value).
> 
> Any ideas on how to chase this down further?

Finding the exact patch that fixes this would be great, as then I can
add it to the 4.19 and 5.2 stable kernels (4.20 is long end-of-life, no
idea why you are messing with that one...)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-01 18:19 ` Greg KH
@ 2019-08-01 22:26   ` Masoud Sharbiani
  2019-08-02  8:08     ` Hillf Danton
  1 sibling, 0 replies; 25+ messages in thread
From: Masoud Sharbiani @ 2019-08-01 22:26 UTC (permalink / raw)
  To: Greg KH; +Cc: mhocko, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 5271 bytes --]

Allow me to issue a correction: 
Running this test on linux master <629f8205a6cc63d2e8e30956bad958a3507d018f> correctly terminates the leaker app with OOM. 
However, running it a second time (after removing the memory cgroup, and allowing the test script to run it again), causes this:

 kernel:watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [leaker1:7193]


[  202.511024] CPU: 7 PID: 7193 Comm: leaker1 Not tainted 5.3.0-rc2+ #8
[  202.517378] Hardware name: <redacted>
[  202.525554] RIP: 0010:lruvec_lru_size+0x49/0xf0
[  202.530085] Code: 41 89 ed b8 ff ff ff ff 45 31 f6 49 c1 e5 03 eb 19 48 63 d0 4c 89 e9 48 8b 14 d5 20 b7 11 b5 48 03 8b 88 00 00 00 4c 03 34 11 <48> c7 c6 80 c5 40 b5 89 c7 e8 29 a7 6f 00 3b 05 57 9d 24 01 72 d1
[  202.548831] RSP: 0018:ffffa7c5480df620 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  202.556398] RAX: 0000000000000000 RBX: ffff8f5b7a1af800 RCX: 00003859bfa03bc0
[  202.563528] RDX: ffff8f5b7f800000 RSI: 0000000000000018 RDI: ffffffffb540c580
[  202.570662] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000004
[  202.577795] R10: ffff8f5b62548000 R11: 0000000000000000 R12: 0000000000000004
[  202.584928] R13: 0000000000000008 R14: 0000000000000000 R15: 0000000000000000
[  202.592063] FS:  00007ff73d835740(0000) GS:ffff8f6b7f840000(0000) knlGS:0000000000000000
[  202.600149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  202.605895] CR2: 00007f1b1c00e428 CR3: 0000001021d56006 CR4: 00000000001606e0
[  202.613026] Call Trace:
[  202.615475]  shrink_node_memcg+0xdb/0x7a0
[  202.619488]  ? shrink_slab+0x266/0x2a0
[  202.623242]  ? mem_cgroup_iter+0x10a/0x2c0
[  202.627337]  shrink_node+0xdd/0x4c0
[  202.630831]  do_try_to_free_pages+0xea/0x3c0
[  202.635104]  try_to_free_mem_cgroup_pages+0xf5/0x1e0
[  202.640068]  try_charge+0x279/0x7a0
[  202.643565]  mem_cgroup_try_charge+0x51/0x1a0
[  202.647925]  __add_to_page_cache_locked+0x19f/0x330
[  202.652800]  ? __mod_lruvec_state+0x40/0xe0
[  202.656987]  ? scan_shadow_nodes+0x30/0x30
[  202.661086]  add_to_page_cache_lru+0x49/0xd0
[  202.665361]  iomap_readpages_actor+0xea/0x230
[  202.669718]  ? iomap_migrate_page+0xe0/0xe0
[  202.673906]  iomap_apply+0xb8/0x150
[  202.677398]  iomap_readpages+0xa7/0x1a0
[  202.681237]  ? iomap_migrate_page+0xe0/0xe0
[  202.685424]  read_pages+0x68/0x190
[  202.688829]  __do_page_cache_readahead+0x19c/0x1b0
[  202.693622]  ondemand_readahead+0x168/0x2a0
[  202.697808]  filemap_fault+0x32d/0x830
[  202.701562]  ? __mod_lruvec_state+0x40/0xe0
[  202.705747]  ? page_remove_rmap+0xcf/0x150
[  202.709846]  ? alloc_set_pte+0x240/0x2c0
[  202.713775]  __xfs_filemap_fault+0x71/0x1c0
[  202.717963]  __do_fault+0x38/0xb0
[  202.721280]  __handle_mm_fault+0x73f/0x1080
[  202.725467]  ? __switch_to_asm+0x34/0x70
[  202.729390]  ? __switch_to_asm+0x40/0x70
[  202.733318]  handle_mm_fault+0xce/0x1f0
[  202.737158]  __do_page_fault+0x231/0x480
[  202.741083]  page_fault+0x2f/0x40
[  202.744404] RIP: 0033:0x400c20
[  202.747461] Code: 45 c8 48 89 c6 bf 32 0e 40 00 b8 00 00 00 00 e8 76 fb ff ff c7 45 ec 00 00 00 00 eb 36 8b 45 ec 48 63 d0 48 8b 45 c8 48 01 d0 <0f> b6 00 0f be c0 01 45 e4 8b 45 ec 25 ff 0f 00 00 85 c0 75 10 8b
[  202.766208] RSP: 002b:00007ffde95ae460 EFLAGS: 00010206
[  202.771432] RAX: 00007ff71e855000 RBX: 0000000000000000 RCX: 000000000000001a
[  202.778558] RDX: 0000000001dfd000 RSI: 000000007fffffe5 RDI: 0000000000000000
[  202.785692] RBP: 00007ffde95af4b0 R08: 0000000000000000 R09: 00007ff73d2a520d
[  202.792823] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000400850
[  202.799949] R13: 00007ffde95af590 R14: 0000000000000000 R15: 0000000000000000


Further tests show that this also happens if one waits long enough on  5.3-rc1 as well.
So I don’t think we have a fix in tree yet. 

Cheers,
Masoud


> On Aug 1, 2019, at 11:19 AM, Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> On Thu, Aug 01, 2019 at 11:04:14AM -0700, Masoud Sharbiani wrote:
>> Hey folks,
>> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
>> It was introduced by
>> 
>> 29ef680 memcg, oom: move out_of_memory back to the charge path 
>> 
>> The gist of it is that if you have a memory control group for a process that repeatedly maps all of the pages of a file with  repeated calls to:
>> 
>>   mmap(NULL, pages * PAGE_SIZE, PROT_WRITE|PROT_READ, MAP_FILE|MAP_PRIVATE, fd, 0)
>> 
>> The memory cg eventually runs out of memory, as it should. However,
>> prior to the 29ef680 commit, it would kill the running process with
>> OOM; After that commit ( and until 5.3-rc1; Haven’t pinpointed the
>> exact commit in between 5.2.0 and 5.3-rc1) the offending process goes
>> into %100 CPU usage, and doesn’t die (prior behavior) or fail the mmap
>> call (which is what happens if one runs the test program with a low
>> ulimit -v value).
>> 
>> Any ideas on how to chase this down further?
> 
> Finding the exact patch that fixes this would be great, as then I can
> add it to the 4.19 and 5.2 stable kernels (4.20 is long end-of-life, no
> idea why you are messing with that one...)
> 
> thanks,
> 
> greg k-h


[-- Attachment #1.2: Type: text/html, Size: 13079 bytes --]

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3437 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
@ 2019-08-02  8:08     ` Hillf Danton
  0 siblings, 0 replies; 25+ messages in thread
From: Masoud Sharbiani @ 2019-08-02  1:08 UTC (permalink / raw)
  To: Greg KH; +Cc: mhocko, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel



> On Aug 1, 2019, at 11:19 AM, Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> On Thu, Aug 01, 2019 at 11:04:14AM -0700, Masoud Sharbiani wrote:
>> Hey folks,
>> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
>> It was introduced by
>> 
>> 29ef680 memcg, oom: move out_of_memory back to the charge path 
>> 
>> The gist of it is that if you have a memory control group for a process that repeatedly maps all of the pages of a file with repeated calls to:
>> 
>>   mmap(NULL, pages * PAGE_SIZE, PROT_WRITE|PROT_READ, MAP_FILE|MAP_PRIVATE, fd, 0)
>> 
>> The memory cg eventually runs out of memory, as it should. However,
>> prior to the 29ef680 commit, it would kill the running process with
>> OOM; After that commit ( and until 5.3-rc1; Haven’t pinpointed the
>> exact commit in between 5.2.0 and 5.3-rc1) the offending process goes
>> into %100 CPU usage, and doesn’t die (prior behavior) or fail the mmap
>> call (which is what happens if one runs the test program with a low
>> ulimit -v value).
>> 
>> Any ideas on how to chase this down further?
> 
> Finding the exact patch that fixes this would be great, as then I can
> add it to the 4.19 and 5.2 stable kernels (4.20 is long end-of-life, no
> idea why you are messing with that one...)
> 
> thanks,
> 
> greg k-h



Allow me to issue a correction: 
Running this test on linux master <629f8205a6cc63d2e8e30956bad958a3507d018f> correctly terminates the leaker app with OOM. 
However, running it a second time (after removing the memory cgroup, and allowing the test script to run it again), causes this:

 kernel:watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [leaker1:7193]


[  202.511024] CPU: 7 PID: 7193 Comm: leaker1 Not tainted 5.3.0-rc2+ #8
[  202.517378] Hardware name: <redacted>
[  202.525554] RIP: 0010:lruvec_lru_size+0x49/0xf0
[  202.530085] Code: 41 89 ed b8 ff ff ff ff 45 31 f6 49 c1 e5 03 eb 19 48 63 d0 4c 89 e9 48 8b 14 d5 20 b7 11 b5 48 03 8b 88 00 00 00 4c 03 34 11 <48> c7 c6 80 c5 40 b5 89 c7 e8 29 a7 6f 00 3b 05 57 9d 24 01 72 d1
[  202.548831] RSP: 0018:ffffa7c5480df620 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  202.556398] RAX: 0000000000000000 RBX: ffff8f5b7a1af800 RCX: 00003859bfa03bc0
[  202.563528] RDX: ffff8f5b7f800000 RSI: 0000000000000018 RDI: ffffffffb540c580
[  202.570662] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000004
[  202.577795] R10: ffff8f5b62548000 R11: 0000000000000000 R12: 0000000000000004
[  202.584928] R13: 0000000000000008 R14: 0000000000000000 R15: 0000000000000000
[  202.592063] FS:  00007ff73d835740(0000) GS:ffff8f6b7f840000(0000) knlGS:0000000000000000
[  202.600149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  202.605895] CR2: 00007f1b1c00e428 CR3: 0000001021d56006 CR4: 00000000001606e0
[  202.613026] Call Trace:
[  202.615475]  shrink_node_memcg+0xdb/0x7a0
[  202.619488]  ? shrink_slab+0x266/0x2a0
[  202.623242]  ? mem_cgroup_iter+0x10a/0x2c0
[  202.627337]  shrink_node+0xdd/0x4c0
[  202.630831]  do_try_to_free_pages+0xea/0x3c0
[  202.635104]  try_to_free_mem_cgroup_pages+0xf5/0x1e0
[  202.640068]  try_charge+0x279/0x7a0
[  202.643565]  mem_cgroup_try_charge+0x51/0x1a0
[  202.647925]  __add_to_page_cache_locked+0x19f/0x330
[  202.652800]  ? __mod_lruvec_state+0x40/0xe0
[  202.656987]  ? scan_shadow_nodes+0x30/0x30
[  202.661086]  add_to_page_cache_lru+0x49/0xd0
[  202.665361]  iomap_readpages_actor+0xea/0x230
[  202.669718]  ? iomap_migrate_page+0xe0/0xe0
[  202.673906]  iomap_apply+0xb8/0x150
[  202.677398]  iomap_readpages+0xa7/0x1a0
[  202.681237]  ? iomap_migrate_page+0xe0/0xe0
[  202.685424]  read_pages+0x68/0x190
[  202.688829]  __do_page_cache_readahead+0x19c/0x1b0
[  202.693622]  ondemand_readahead+0x168/0x2a0
[  202.697808]  filemap_fault+0x32d/0x830
[  202.701562]  ? __mod_lruvec_state+0x40/0xe0
[  202.705747]  ? page_remove_rmap+0xcf/0x150
[  202.709846]  ? alloc_set_pte+0x240/0x2c0
[  202.713775]  __xfs_filemap_fault+0x71/0x1c0
[  202.717963]  __do_fault+0x38/0xb0
[  202.721280]  __handle_mm_fault+0x73f/0x1080
[  202.725467]  ? __switch_to_asm+0x34/0x70
[  202.729390]  ? __switch_to_asm+0x40/0x70
[  202.733318]  handle_mm_fault+0xce/0x1f0
[  202.737158]  __do_page_fault+0x231/0x480
[  202.741083]  page_fault+0x2f/0x40
[  202.744404] RIP: 0033:0x400c20
[  202.747461] Code: 45 c8 48 89 c6 bf 32 0e 40 00 b8 00 00 00 00 e8 76 fb ff ff c7 45 ec 00 00 00 00 eb 36 8b 45 ec 48 63 d0 48 8b 45 c8 48 01 d0 <0f> b6 00 0f be c0 01 45 e4 8b 45 ec 25 ff 0f 00 00 85 c0 75 10 8b
[  202.766208] RSP: 002b:00007ffde95ae460 EFLAGS: 00010206
[  202.771432] RAX: 00007ff71e855000 RBX: 0000000000000000 RCX: 000000000000001a
[  202.778558] RDX: 0000000001dfd000 RSI: 000000007fffffe5 RDI: 0000000000000000
[  202.785692] RBP: 00007ffde95af4b0 R08: 0000000000000000 R09: 00007ff73d2a520d
[  202.792823] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000400850
[  202.799949] R13: 00007ffde95af590 R14: 0000000000000000 R15: 0000000000000000


Further tests show that this also happens if one waits long enough on  5.3-rc1 as well.
So I don’t think we have a fix in tree yet. 

Cheers,
Masoud


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-01 18:04 Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1 Masoud Sharbiani
  2019-08-01 18:19 ` Greg KH
@ 2019-08-02  7:40 ` Michal Hocko
  2019-08-02 14:18   ` Masoud Sharbiani
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2019-08-02  7:40 UTC (permalink / raw)
  To: Masoud Sharbiani
  Cc: gregkh, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel

On Thu 01-08-19 11:04:14, Masoud Sharbiani wrote:
> Hey folks,
> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
> It was introduced by
> 
> 29ef680 memcg, oom: move out_of_memory back to the charge path 

This commit shouldn't really change the OOM behavior for your particular
test case. It would have changed MAP_POPULATE behavior but your usage is
triggering the standard page fault path. The only difference with
29ef680 is that the OOM killer is invoked during the charge path rather
than on the way out of the page fault.

Anyway, I tried to run your test case in a loop and leaker always ends
up being killed as expected with 5.2. See the below oom report. There
must be something else going on. How much swap do you have on your
system?

[337533.314245] leaker invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[337533.314250] CPU: 3 PID: 23793 Comm: leaker Not tainted 5.2.0-rc7 #54
[337533.314251] Hardware name: Dell Inc. Latitude E7470/0T6HHJ, BIOS 1.5.3 04/18/2016
[337533.314252] Call Trace:
[337533.314258]  dump_stack+0x67/0x8e
[337533.314262]  dump_header+0x51/0x2e9
[337533.314265]  ? preempt_count_sub+0xc6/0xd2
[337533.314267]  ? _raw_spin_unlock_irqrestore+0x2c/0x3e
[337533.314269]  oom_kill_process+0x90/0x11d
[337533.314271]  out_of_memory+0x25c/0x26f
[337533.314273]  mem_cgroup_out_of_memory+0x8a/0xa6
[337533.314276]  try_charge+0x1d0/0x782
[337533.314278]  ? preempt_count_sub+0xc6/0xd2
[337533.314280]  mem_cgroup_try_charge+0x1a1/0x207
[337533.314282]  __add_to_page_cache_locked+0xf9/0x2dd
[337533.314285]  ? memcg_drain_all_list_lrus+0x125/0x125
[337533.314286]  add_to_page_cache_lru+0x3c/0x96
[337533.314288]  pagecache_get_page.part.7+0x1d6/0x240
[337533.314290]  filemap_fault+0x267/0x54a
[337533.314292]  ext4_filemap_fault+0x2d/0x41
[337533.314294]  ? ext4_page_mkwrite+0x3cd/0x3cd
[337533.314296]  __do_fault+0x47/0xa7
[337533.314297]  __handle_mm_fault+0xaaa/0xf9d
[337533.314300]  handle_mm_fault+0x174/0x1c3
[337533.314303]  __do_page_fault+0x309/0x412
[337533.314305]  do_page_fault+0x10b/0x131
[337533.314307]  ? page_fault+0x8/0x30
[337533.314309]  page_fault+0x1e/0x30
[337533.314311] RIP: 0033:0x55a806ef8503
[337533.314313] Code: 48 89 c6 48 8d 3d 28 0c 00 00 b8 00 00 00 00 e8 73 fb ff ff c7 45 ec 00 00 00 00 eb 36 8b 45 ec 48 63 d0 48 8b 45 c8 48 01 d0 <0f> b6 00 0f be c0 01 45 e4 8b 45 ec 25 ff 0f 00 00 85 c0 75 10 8b
[337533.314314] RSP: 002b:00007ffcf6734730 EFLAGS: 00010206
[337533.314316] RAX: 00007f2228f74000 RBX: 0000000000000000 RCX: 0000000000000000
[337533.314317] RDX: 0000000000487000 RSI: 000055a806efc260 RDI: 0000000000000000
[337533.314318] RBP: 00007ffcf6735780 R08: 0000000000000000 R09: 00007ffcf67345fc
[337533.314319] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a806ef8120
[337533.314320] R13: 00007ffcf6735860 R14: 0000000000000000 R15: 0000000000000000
[337533.314322] memory: usage 524288kB, limit 524288kB, failcnt 1240247
[337533.314323] memory+swap: usage 2592556kB, limit 9007199254740988kB, failcnt 0
[337533.314324] kmem: usage 7260kB, limit 9007199254740988kB, failcnt 0
[337533.314325] Memory cgroup stats for /leaker: cache:80KB rss:516948KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:2068268KB inactive_anon:258520KB active_anon:258412KB inactive_file:32KB active_file:12KB unevictable:0KB
[337533.314332] Tasks state (memory values in pages):
[337533.314333] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[337533.314404] [  23777]     0 23777      596      400    36864        4             0 sh
[337533.314407] [  23793]     0 23793   655928   126942  5226496   519670             0 leaker
[337533.314408] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),oom_memcg=/leaker,task_memcg=/leaker,task=leaker,pid=23793,uid=0
[337533.314412] Memory cgroup out of memory: Killed process 23793 (leaker) total-vm:2623712kB, anon-rss:506500kB, file-rss:1268kB, shmem-rss:0kB
[337533.418036] oom_reaper: reaped process 23793 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
@ 2019-08-02  8:08     ` Hillf Danton
  0 siblings, 0 replies; 25+ messages in thread
From: Hillf Danton @ 2019-08-02  8:08 UTC (permalink / raw)
  To: Masoud Sharbiani
  Cc: mhocko, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel, Greg KH


On Thu, 01 Aug 2019 18:08:42 -0700 Masoud Sharbiani wrote:
> 
> Allow me to issue a correction:
> Running this test on linux master 
> <629f8205a6cc63d2e8e30956bad958a3507d018f> correctly terminates the 
> leaker app with OOM.
> However, running it a second time (after removing the memory cgroup, and 
> allowing the test script to run it again), causes this:
> 
>  kernel:watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [leaker1:7193]
> 
> 
> [  202.511024] CPU: 7 PID: 7193 Comm: leaker1 Not tainted 5.3.0-rc2+ #8
> [  202.517378] Hardware name: <redacted>
> [  202.525554] RIP: 0010:lruvec_lru_size+0x49/0xf0
> [  202.530085] Code: 41 89 ed b8 ff ff ff ff 45 31 f6 49 c1 e5 03 eb 19 
> 48 63 d0 4c 89 e9 48 8b 14 d5 20 b7 11 b5 48 03 8b 88 00 00 00 4c 03 34 
> 11 <48> c7 c6 80 c5 40 b5 89 c7 e8 29 a7 6f 00 3b 05 57 9d 24 01 72 d1
> [  202.548831] RSP: 0018:ffffa7c5480df620 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
> [  202.556398] RAX: 0000000000000000 RBX: ffff8f5b7a1af800 RCX: 00003859bfa03bc0
> [  202.563528] RDX: ffff8f5b7f800000 RSI: 0000000000000018 RDI: ffffffffb540c580
> [  202.570662] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000004
> [  202.577795] R10: ffff8f5b62548000 R11: 0000000000000000 R12: 0000000000000004
> [  202.584928] R13: 0000000000000008 R14: 0000000000000000 R15: 0000000000000000
> [  202.592063] FS:  00007ff73d835740(0000) GS:ffff8f6b7f840000(0000) knlGS:0000000000000000
> [  202.600149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  202.605895] CR2: 00007f1b1c00e428 CR3: 0000001021d56006 CR4: 00000000001606e0
> [  202.613026] Call Trace:
> [  202.615475]  shrink_node_memcg+0xdb/0x7a0
> [  202.619488]  ? shrink_slab+0x266/0x2a0
> [  202.623242]  ? mem_cgroup_iter+0x10a/0x2c0
> [  202.627337]  shrink_node+0xdd/0x4c0
> [  202.630831]  do_try_to_free_pages+0xea/0x3c0
> [  202.635104]  try_to_free_mem_cgroup_pages+0xf5/0x1e0
> [  202.640068]  try_charge+0x279/0x7a0
> [  202.643565]  mem_cgroup_try_charge+0x51/0x1a0
> [  202.647925]  __add_to_page_cache_locked+0x19f/0x330
> [  202.652800]  ? __mod_lruvec_state+0x40/0xe0
> [  202.656987]  ? scan_shadow_nodes+0x30/0x30
> [  202.661086]  add_to_page_cache_lru+0x49/0xd0
> [  202.665361]  iomap_readpages_actor+0xea/0x230
> [  202.669718]  ? iomap_migrate_page+0xe0/0xe0
> [  202.673906]  iomap_apply+0xb8/0x150
> [  202.677398]  iomap_readpages+0xa7/0x1a0
> [  202.681237]  ? iomap_migrate_page+0xe0/0xe0
> [  202.685424]  read_pages+0x68/0x190
> [  202.688829]  __do_page_cache_readahead+0x19c/0x1b0
> [  202.693622]  ondemand_readahead+0x168/0x2a0
> [  202.697808]  filemap_fault+0x32d/0x830
> [  202.701562]  ? __mod_lruvec_state+0x40/0xe0
> [  202.705747]  ? page_remove_rmap+0xcf/0x150
> [  202.709846]  ? alloc_set_pte+0x240/0x2c0
> [  202.713775]  __xfs_filemap_fault+0x71/0x1c0
> [  202.717963]  __do_fault+0x38/0xb0
> [  202.721280]  __handle_mm_fault+0x73f/0x1080
> [  202.725467]  ? __switch_to_asm+0x34/0x70
> [  202.729390]  ? __switch_to_asm+0x40/0x70
> [  202.733318]  handle_mm_fault+0xce/0x1f0
> [  202.737158]  __do_page_fault+0x231/0x480
> [  202.741083]  page_fault+0x2f/0x40
> [  202.744404] RIP: 0033:0x400c20
> [  202.747461] Code: 45 c8 48 89 c6 bf 32 0e 40 00 b8 00 00 00 00 e8 76 
> fb ff ff c7 45 ec 00 00 00 00 eb 36 8b 45 ec 48 63 d0 48 8b 45 c8 48 01 
> d0 <0f> b6 00 0f be c0 01 45 e4 8b 45 ec 25 ff 0f 00 00 85 c0 75 10 8b
> [  202.766208] RSP: 002b:00007ffde95ae460 EFLAGS: 00010206
> [  202.771432] RAX: 00007ff71e855000 RBX: 0000000000000000 RCX: 000000000000001a
> [  202.778558] RDX: 0000000001dfd000 RSI: 000000007fffffe5 RDI: 0000000000000000
> [  202.785692] RBP: 00007ffde95af4b0 R08: 0000000000000000 R09: 00007ff73d2a520d
> [  202.792823] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000400850
> [  202.799949] R13: 00007ffde95af590 R14: 0000000000000000 R15: 0000000000000000
> 
> 
> Further tests show that this also happens if one waits long enough on  
> 5.3-rc1 as well.
> So I dont think we have a fix in tree yet.

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2547,8 +2547,12 @@ retry:
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
 						    gfp_mask, may_swap);
 
-	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
-		goto retry;
+	if (mem_cgroup_margin(mem_over_limit) >= nr_pages) {
+		if (nr_retries--)
+			goto retry;
+		/* give up charging memhog */
+		return -ENOMEM;
+	}
 
 	if (!drained) {
 		drain_all_stock(mem_over_limit);
--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-02  8:08     ` Hillf Danton
  (?)
@ 2019-08-02  8:18     ` Michal Hocko
  -1 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2019-08-02  8:18 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Masoud Sharbiani, hannes, vdavydov.dev, linux-mm, cgroups,
	linux-kernel, Greg KH

[Hillf, your email client or workflow mangles emails. In this case you
are seem to be reusing the message id from the email you are replying to
which confuses my email client to assume your email is a duplicate.]

On Fri 02-08-19 16:08:01, Hillf Danton wrote:
[...]
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2547,8 +2547,12 @@ retry:
>  	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
>  						    gfp_mask, may_swap);
>  
> -	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> -		goto retry;
> +	if (mem_cgroup_margin(mem_over_limit) >= nr_pages) {
> +		if (nr_retries--)
> +			goto retry;
> +		/* give up charging memhog */
> +		return -ENOMEM;
> +	}

Huh, what? You are effectively saying that we should fail the charge
when the requested nr_pages would fit in. This doesn't make much sense
to me. What are you trying to achive here?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-02  7:40 ` Michal Hocko
@ 2019-08-02 14:18   ` Masoud Sharbiani
  2019-08-02 14:41     ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Masoud Sharbiani @ 2019-08-02 14:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: gregkh, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel

 

> On Aug 2, 2019, at 12:40 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> On Thu 01-08-19 11:04:14, Masoud Sharbiani wrote:
>> Hey folks,
>> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
>> It was introduced by
>> 
>> 29ef680 memcg, oom: move out_of_memory back to the charge path 
> 
> This commit shouldn't really change the OOM behavior for your particular
> test case. It would have changed MAP_POPULATE behavior but your usage is
> triggering the standard page fault path. The only difference with
> 29ef680 is that the OOM killer is invoked during the charge path rather
> than on the way out of the page fault.
> 
> Anyway, I tried to run your test case in a loop and leaker always ends
> up being killed as expected with 5.2. See the below oom report. There
> must be something else going on. How much swap do you have on your
> system?

I do not have swap defined. 
-m


> 
> [337533.314245] leaker invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
> [337533.314250] CPU: 3 PID: 23793 Comm: leaker Not tainted 5.2.0-rc7 #54
> [337533.314251] Hardware name: Dell Inc. Latitude E7470/0T6HHJ, BIOS 1.5.3 04/18/2016
> [337533.314252] Call Trace:
> [337533.314258]  dump_stack+0x67/0x8e
> [337533.314262]  dump_header+0x51/0x2e9
> [337533.314265]  ? preempt_count_sub+0xc6/0xd2
> [337533.314267]  ? _raw_spin_unlock_irqrestore+0x2c/0x3e
> [337533.314269]  oom_kill_process+0x90/0x11d
> [337533.314271]  out_of_memory+0x25c/0x26f
> [337533.314273]  mem_cgroup_out_of_memory+0x8a/0xa6
> [337533.314276]  try_charge+0x1d0/0x782
> [337533.314278]  ? preempt_count_sub+0xc6/0xd2
> [337533.314280]  mem_cgroup_try_charge+0x1a1/0x207
> [337533.314282]  __add_to_page_cache_locked+0xf9/0x2dd
> [337533.314285]  ? memcg_drain_all_list_lrus+0x125/0x125
> [337533.314286]  add_to_page_cache_lru+0x3c/0x96
> [337533.314288]  pagecache_get_page.part.7+0x1d6/0x240
> [337533.314290]  filemap_fault+0x267/0x54a
> [337533.314292]  ext4_filemap_fault+0x2d/0x41
> [337533.314294]  ? ext4_page_mkwrite+0x3cd/0x3cd
> [337533.314296]  __do_fault+0x47/0xa7
> [337533.314297]  __handle_mm_fault+0xaaa/0xf9d
> [337533.314300]  handle_mm_fault+0x174/0x1c3
> [337533.314303]  __do_page_fault+0x309/0x412
> [337533.314305]  do_page_fault+0x10b/0x131
> [337533.314307]  ? page_fault+0x8/0x30
> [337533.314309]  page_fault+0x1e/0x30
> [337533.314311] RIP: 0033:0x55a806ef8503
> [337533.314313] Code: 48 89 c6 48 8d 3d 28 0c 00 00 b8 00 00 00 00 e8 73 fb ff ff c7 45 ec 00 00 00 00 eb 36 8b 45 ec 48 63 d0 48 8b 45 c8 48 01 d0 <0f> b6 00 0f be c0 01 45 e4 8b 45 ec 25 ff 0f 00 00 85 c0 75 10 8b
> [337533.314314] RSP: 002b:00007ffcf6734730 EFLAGS: 00010206
> [337533.314316] RAX: 00007f2228f74000 RBX: 0000000000000000 RCX: 0000000000000000
> [337533.314317] RDX: 0000000000487000 RSI: 000055a806efc260 RDI: 0000000000000000
> [337533.314318] RBP: 00007ffcf6735780 R08: 0000000000000000 R09: 00007ffcf67345fc
> [337533.314319] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a806ef8120
> [337533.314320] R13: 00007ffcf6735860 R14: 0000000000000000 R15: 0000000000000000
> [337533.314322] memory: usage 524288kB, limit 524288kB, failcnt 1240247
> [337533.314323] memory+swap: usage 2592556kB, limit 9007199254740988kB, failcnt 0
> [337533.314324] kmem: usage 7260kB, limit 9007199254740988kB, failcnt 0
> [337533.314325] Memory cgroup stats for /leaker: cache:80KB rss:516948KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:2068268KB inactive_anon:258520KB active_anon:258412KB inactive_file:32KB active_file:12KB unevictable:0KB
> [337533.314332] Tasks state (memory values in pages):
> [337533.314333] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> [337533.314404] [  23777]     0 23777      596      400    36864        4             0 sh
> [337533.314407] [  23793]     0 23793   655928   126942  5226496   519670             0 leaker
> [337533.314408] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),oom_memcg=/leaker,task_memcg=/leaker,task=leaker,pid=23793,uid=0
> [337533.314412] Memory cgroup out of memory: Killed process 23793 (leaker) total-vm:2623712kB, anon-rss:506500kB, file-rss:1268kB, shmem-rss:0kB
> [337533.418036] oom_reaper: reaped process 23793 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-02 14:18   ` Masoud Sharbiani
@ 2019-08-02 14:41     ` Michal Hocko
  2019-08-02 18:00       ` Masoud Sharbiani
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2019-08-02 14:41 UTC (permalink / raw)
  To: Masoud Sharbiani
  Cc: gregkh, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel

On Fri 02-08-19 07:18:17, Masoud Sharbiani wrote:
>  
> 
> > On Aug 2, 2019, at 12:40 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > On Thu 01-08-19 11:04:14, Masoud Sharbiani wrote:
> >> Hey folks,
> >> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
> >> It was introduced by
> >> 
> >> 29ef680 memcg, oom: move out_of_memory back to the charge path 
> > 
> > This commit shouldn't really change the OOM behavior for your particular
> > test case. It would have changed MAP_POPULATE behavior but your usage is
> > triggering the standard page fault path. The only difference with
> > 29ef680 is that the OOM killer is invoked during the charge path rather
> > than on the way out of the page fault.
> > 
> > Anyway, I tried to run your test case in a loop and leaker always ends
> > up being killed as expected with 5.2. See the below oom report. There
> > must be something else going on. How much swap do you have on your
> > system?
> 
> I do not have swap defined. 

OK, I have retested with swap disabled and again everything seems to be
working as expected. The oom happens earlier because I do not have to
wait for the swap to get full.

Which fs do you use to write the file that you mmap? Or could you try to
simplify your test even further? E.g. does everything work as expected
when doing anonymous mmap rather than file backed one?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-02 14:41     ` Michal Hocko
@ 2019-08-02 18:00       ` Masoud Sharbiani
  2019-08-02 19:14         ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Masoud Sharbiani @ 2019-08-02 18:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: gregkh, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6201 bytes --]



> On Aug 2, 2019, at 7:41 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> On Fri 02-08-19 07:18:17, Masoud Sharbiani wrote:
>> 
>> 
>>> On Aug 2, 2019, at 12:40 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>> 
>>> On Thu 01-08-19 11:04:14, Masoud Sharbiani wrote:
>>>> Hey folks,
>>>> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
>>>> It was introduced by
>>>> 
>>>> 29ef680 memcg, oom: move out_of_memory back to the charge path 
>>> 
>>> This commit shouldn't really change the OOM behavior for your particular
>>> test case. It would have changed MAP_POPULATE behavior but your usage is
>>> triggering the standard page fault path. The only difference with
>>> 29ef680 is that the OOM killer is invoked during the charge path rather
>>> than on the way out of the page fault.
>>> 
>>> Anyway, I tried to run your test case in a loop and leaker always ends
>>> up being killed as expected with 5.2. See the below oom report. There
>>> must be something else going on. How much swap do you have on your
>>> system?
>> 
>> I do not have swap defined. 
> 
> OK, I have retested with swap disabled and again everything seems to be
> working as expected. The oom happens earlier because I do not have to
> wait for the swap to get full.
> 

In my tests (with the script provided), it only loops 11 iterations before hanging, and uttering the soft lockup message.


> Which fs do you use to write the file that you mmap?

/dev/sda3 on / type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)

Part of the soft lockup path actually specifies that it is going through __xfs_filemap_fault():

[  561.452933] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [leaker:3261]
[  561.459904] Modules linked in: dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt gpio_ich iTCO_vendor_support dcdbas ipmi_ssif intel_powerc
lamp coretemp kvm_intel ses ipmi_si kvm enclosure scsi_transport_sas ipmi_devintf irqbypass pcspkr lpc_ich sg joydev ipmi_msghandler wmi acp
i_power_meter acpi_cpufreq xfs libcrc32c ata_generic sd_mod pata_acpi ata_piix libata megaraid_sas crc32c_intel serio_raw bnx2 bonding
[  561.495979] CPU: 4 PID: 3261 Comm: leaker Tainted: G          I  L    5.3.0-rc2+ #10
[  561.503704] Hardware name: Dell Inc. PowerEdge R710/0YDJK3, BIOS 6.4.0 07/23/2013
[  561.511168] RIP: 0010:lruvec_lru_size+0x49/0xf0
[  561.515687] Code: 41 89 ed b8 ff ff ff ff 45 31 f6 49 c1 e5 03 eb 19 48 63 d0 4c 89 e9 48 03 8b 88 00 00 00 48 8b 14 d5 60 a9 92 94 4c 03
 34 11 <48> c7 c6 80 7c bf 94 89 c7 e8 89 d3 59 00 3b 05 27 eb ff 00 72 d1
[  561.534418] RSP: 0018:ffffb5f886a3f640 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  561.541968] RAX: 0000000000000002 RBX: ffff96fca3bba400 RCX: 00003ef5d82059f0
[  561.549085] RDX: ffff9702a7a40000 RSI: 0000000000000010 RDI: ffffffff94bf7c80
[  561.556202] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff94ae1c00
[  561.563318] R10: ffff96fcc7802520 R11: 0000000000000000 R12: 0000000000000004
[  561.570435] R13: 0000000000000008 R14: 0000000000000000 R15: 0000000000000000
[  561.577553] FS:  00007f5522602740(0000) GS:ffff9702a7a80000(0000) knlGS:0000000000000000
[  561.585623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  561.591352] CR2: 00007fba755f95b0 CR3: 0000000c646dc000 CR4: 00000000000006e0
[  561.598468] Call Trace:
[  561.600907]  shrink_node_memcg+0xc8/0x790
[  561.604905]  ? shrink_slab+0x245/0x280
[  561.608644]  ? mem_cgroup_iter+0x10a/0x2c0
[  561.612728]  shrink_node+0xcd/0x490
[  561.616208]  do_try_to_free_pages+0xda/0x3a0
[  561.620466]  ? mem_cgroup_select_victim_node+0x43/0x2f0
[  561.625678]  try_to_free_mem_cgroup_pages+0xe7/0x1c0
[  561.630629]  try_charge+0x246/0x7a0
[  561.634107]  mem_cgroup_try_charge+0x6b/0x1e0
[  561.638453]  ? mem_cgroup_commit_charge+0x5a/0x110
[  561.643231]  __add_to_page_cache_locked+0x195/0x330
[  561.648100]  ? scan_shadow_nodes+0x30/0x30
[  561.652184]  add_to_page_cache_lru+0x39/0xa0
[  561.656442]  iomap_readpages_actor+0xf2/0x230
[  561.660787]  iomap_apply+0xa3/0x130
[  561.664266]  iomap_readpages+0x97/0x180
[  561.668091]  ? iomap_migrate_page+0xe0/0xe0
[  561.672266]  read_pages+0x57/0x180
[  561.675657]  __do_page_cache_readahead+0x1ac/0x1c0
[  561.680436]  ondemand_readahead+0x168/0x2a0
[  561.684606]  filemap_fault+0x30d/0x830
[  561.688343]  ? flush_tlb_func_common.isra.8+0x147/0x230
[  561.693554]  ? __mod_lruvec_state+0x40/0xe0
[  561.697726]  ? alloc_set_pte+0x4e6/0x5b0
[  561.701669]  __xfs_filemap_fault+0x61/0x190 [xfs]
[  561.706361]  __do_fault+0x38/0xb0
[  561.709666]  __handle_mm_fault+0xbee/0xe90
[  561.713750]  handle_mm_fault+0xe2/0x200
[  561.717574]  __do_page_fault+0x224/0x490
[  561.721485]  do_page_fault+0x31/0x120
[  561.725137]  page_fault+0x3e/0x50
[  561.728439] RIP: 0033:0x400c5a
[  561.731483] Code: 45 c0 48 89 c6 bf 77 0e 40 00 b8 00 00 00 00 e8 3c fb ff ff c7 45 dc 00 00 00 00 eb 36 8b 45 dc 48 63 d0 48 8b 45 c0 48
 01 d0 <0f> b6 00 0f be c0 01 45 e8 8b 45 dc 25 ff 0f 00 00 85 c0 75 10 8b
[  561.750214] RSP: 002b:00007fffba1d9450 EFLAGS: 00010206
[  561.755426] RAX: 00007f550346b000 RBX: 0000000000000000 RCX: 000000000000001a
[  561.762542] RDX: 0000000001c4c000 RSI: 000000007fffffe5 RDI: 0000000000000000
[  561.769659] RBP: 00007fffba1da4a0 R08: 0000000000000000 R09: 00007f552206c20d
[  561.776775] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000400850
[  561.783892] R13: 00007fffba1da580 R14: 0000000000000000 R15: 0000000000000000


If I switch the backing file to a ext4 filesystem (separate hard drive), it OOMs.


If I switch the file used to /dev/zero, it OOMs: 
…
Todal sum was 0. Loop count is 11
Buffer is @ 0x7f2b66c00000
./test-script-devzero.sh: line 16:  3561 Killed                  ./leaker -p 10240 -c 100000


> Or could you try to
> simplify your test even further? E.g. does everything work as expected
> when doing anonymous mmap rather than file backed one?

It also OOMs with MAP_ANON. 

Hope that helps.
Masoud


> -- 
> Michal Hocko
> SUSE Labs


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3437 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-02 18:00       ` Masoud Sharbiani
@ 2019-08-02 19:14         ` Michal Hocko
  2019-08-02 23:28           ` Masoud Sharbiani
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2019-08-02 19:14 UTC (permalink / raw)
  To: Masoud Sharbiani
  Cc: gregkh, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel

On Fri 02-08-19 11:00:55, Masoud Sharbiani wrote:
> 
> 
> > On Aug 2, 2019, at 7:41 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > On Fri 02-08-19 07:18:17, Masoud Sharbiani wrote:
> >> 
> >> 
> >>> On Aug 2, 2019, at 12:40 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >>> 
> >>> On Thu 01-08-19 11:04:14, Masoud Sharbiani wrote:
> >>>> Hey folks,
> >>>> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
> >>>> It was introduced by
> >>>> 
> >>>> 29ef680 memcg, oom: move out_of_memory back to the charge path 
> >>> 
> >>> This commit shouldn't really change the OOM behavior for your particular
> >>> test case. It would have changed MAP_POPULATE behavior but your usage is
> >>> triggering the standard page fault path. The only difference with
> >>> 29ef680 is that the OOM killer is invoked during the charge path rather
> >>> than on the way out of the page fault.
> >>> 
> >>> Anyway, I tried to run your test case in a loop and leaker always ends
> >>> up being killed as expected with 5.2. See the below oom report. There
> >>> must be something else going on. How much swap do you have on your
> >>> system?
> >> 
> >> I do not have swap defined. 
> > 
> > OK, I have retested with swap disabled and again everything seems to be
> > working as expected. The oom happens earlier because I do not have to
> > wait for the swap to get full.
> > 
> 
> In my tests (with the script provided), it only loops 11 iterations before hanging, and uttering the soft lockup message.
> 
> 
> > Which fs do you use to write the file that you mmap?
> 
> /dev/sda3 on / type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> 
> Part of the soft lockup path actually specifies that it is going through __xfs_filemap_fault():

Right, I have just missed that.

[...]

> If I switch the backing file to a ext4 filesystem (separate hard drive), it OOMs.
> 
> 
> If I switch the file used to /dev/zero, it OOMs: 
> …
> Todal sum was 0. Loop count is 11
> Buffer is @ 0x7f2b66c00000
> ./test-script-devzero.sh: line 16:  3561 Killed                  ./leaker -p 10240 -c 100000
> 
> 
> > Or could you try to
> > simplify your test even further? E.g. does everything work as expected
> > when doing anonymous mmap rather than file backed one?
> 
> It also OOMs with MAP_ANON. 
> 
> Hope that helps.

It helps to focus more on the xfs reclaim path. Just to be sure, is
there any difference if you use cgroup v2? I do not expect to be but
just to be sure there are no v1 artifacts.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-02 19:14         ` Michal Hocko
@ 2019-08-02 23:28           ` Masoud Sharbiani
  2019-08-03  2:36             ` Tetsuo Handa
  2019-08-05  8:18             ` Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1 Michal Hocko
  0 siblings, 2 replies; 25+ messages in thread
From: Masoud Sharbiani @ 2019-08-02 23:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg KH, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 2981 bytes --]



> On Aug 2, 2019, at 12:14 PM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> On Fri 02-08-19 11:00:55, Masoud Sharbiani wrote:
>> 
>> 
>>> On Aug 2, 2019, at 7:41 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>> 
>>> On Fri 02-08-19 07:18:17, Masoud Sharbiani wrote:
>>>> 
>>>> 
>>>>> On Aug 2, 2019, at 12:40 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>>>> 
>>>>> On Thu 01-08-19 11:04:14, Masoud Sharbiani wrote:
>>>>>> Hey folks,
>>>>>> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
>>>>>> It was introduced by
>>>>>> 
>>>>>> 29ef680 memcg, oom: move out_of_memory back to the charge path 
>>>>> 
>>>>> This commit shouldn't really change the OOM behavior for your particular
>>>>> test case. It would have changed MAP_POPULATE behavior but your usage is
>>>>> triggering the standard page fault path. The only difference with
>>>>> 29ef680 is that the OOM killer is invoked during the charge path rather
>>>>> than on the way out of the page fault.
>>>>> 
>>>>> Anyway, I tried to run your test case in a loop and leaker always ends
>>>>> up being killed as expected with 5.2. See the below oom report. There
>>>>> must be something else going on. How much swap do you have on your
>>>>> system?
>>>> 
>>>> I do not have swap defined. 
>>> 
>>> OK, I have retested with swap disabled and again everything seems to be
>>> working as expected. The oom happens earlier because I do not have to
>>> wait for the swap to get full.
>>> 
>> 
>> In my tests (with the script provided), it only loops 11 iterations before hanging, and uttering the soft lockup message.
>> 
>> 
>>> Which fs do you use to write the file that you mmap?
>> 
>> /dev/sda3 on / type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
>> 
>> Part of the soft lockup path actually specifies that it is going through __xfs_filemap_fault():
> 
> Right, I have just missed that.
> 
> [...]
> 
>> If I switch the backing file to a ext4 filesystem (separate hard drive), it OOMs.
>> 
>> 
>> If I switch the file used to /dev/zero, it OOMs: 
>> …
>> Todal sum was 0. Loop count is 11
>> Buffer is @ 0x7f2b66c00000
>> ./test-script-devzero.sh: line 16:  3561 Killed                  ./leaker -p 10240 -c 100000
>> 
>> 
>>> Or could you try to
>>> simplify your test even further? E.g. does everything work as expected
>>> when doing anonymous mmap rather than file backed one?
>> 
>> It also OOMs with MAP_ANON. 
>> 
>> Hope that helps.
> 
> It helps to focus more on the xfs reclaim path. Just to be sure, is
> there any difference if you use cgroup v2? I do not expect to be but
> just to be sure there are no v1 artifacts.

I was unable to use cgroups2. I’ve created the new control group, but the attempt to move a running process into it fails with ‘Device or resource busy’.

Masoud

> -- 
> Michal Hocko
> SUSE Labs


[-- Attachment #1.2: Type: text/html, Size: 12878 bytes --]

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3437 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-02 23:28           ` Masoud Sharbiani
@ 2019-08-03  2:36             ` Tetsuo Handa
  2019-08-03 15:51               ` Tetsuo Handa
  2019-08-05  8:18             ` Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1 Michal Hocko
  1 sibling, 1 reply; 25+ messages in thread
From: Tetsuo Handa @ 2019-08-03  2:36 UTC (permalink / raw)
  To: Masoud Sharbiani, Michal Hocko
  Cc: Greg KH, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel

Well, while mem_cgroup_oom() is actually called, due to hitting

        /*
         * The OOM killer does not compensate for IO-less reclaim.
         * pagefault_out_of_memory lost its gfp context so we have to
         * make sure exclude 0 mask - all other users should have at least
         * ___GFP_DIRECT_RECLAIM to get here.
         */
        if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
                return true;

path inside out_of_memory(), OOM_SUCCESS is returned and retrying without
making forward progress...

----------------------------------------
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2447,6 +2447,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
         */
        oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
                       get_order(nr_pages * PAGE_SIZE));
+       printk("mem_cgroup_oom(%pGg)=%u\n", &gfp_mask, oom_status);
+       dump_stack();
        switch (oom_status) {
        case OOM_SUCCESS:
                nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
----------------------------------------

----------------------------------------
[   55.208578][ T2798] mem_cgroup_oom(GFP_NOFS)=0
[   55.210424][ T2798] CPU: 3 PID: 2798 Comm: leaker Not tainted 5.3.0-rc2+ #637
[   55.212985][ T2798] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[   55.217260][ T2798] Call Trace:
[   55.218597][ T2798]  dump_stack+0x67/0x95
[   55.220200][ T2798]  try_charge+0x4ca/0x6d0
[   55.221843][ T2798]  ? get_mem_cgroup_from_mm+0x1ff/0x2c0
[   55.223855][ T2798]  mem_cgroup_try_charge+0x88/0x2d0
[   55.225723][ T2798]  __add_to_page_cache_locked+0x27e/0x4c0
[   55.227784][ T2798]  ? scan_shadow_nodes+0x30/0x30
[   55.229577][ T2798]  add_to_page_cache_lru+0x72/0x180
[   55.231467][ T2798]  iomap_readpages_actor+0xeb/0x1e0
[   55.233376][ T2798]  ? iomap_migrate_page+0x120/0x120
[   55.235382][ T2798]  iomap_apply+0xaf/0x150
[   55.237049][ T2798]  iomap_readpages+0x9f/0x160
[   55.239061][ T2798]  ? iomap_migrate_page+0x120/0x120
[   55.241013][ T2798]  xfs_vm_readpages+0x54/0x130 [xfs]
[   55.242960][ T2798]  read_pages+0x63/0x160
[   55.244613][ T2798]  __do_page_cache_readahead+0x1cd/0x200
[   55.246699][ T2798]  ondemand_readahead+0x201/0x4d0
[   55.248562][ T2798]  page_cache_async_readahead+0x16e/0x2e0
[   55.250740][ T2798]  ? page_cache_async_readahead+0xa5/0x2e0
[   55.252881][ T2798]  filemap_fault+0x3f3/0xc20
[   55.254813][ T2798]  ? xfs_ilock+0x1de/0x2c0 [xfs]
[   55.256858][ T2798]  ? __xfs_filemap_fault+0x7f/0x270 [xfs]
[   55.259118][ T2798]  ? down_read_nested+0x98/0x170
[   55.261123][ T2798]  ? xfs_ilock+0x1de/0x2c0 [xfs]
[   55.263146][ T2798]  __xfs_filemap_fault+0x92/0x270 [xfs]
[   55.265210][ T2798]  xfs_filemap_fault+0x27/0x30 [xfs]
[   55.267164][ T2798]  __do_fault+0x33/0xd0
[   55.268784][ T2798]  do_fault+0x3be/0x5c0
[   55.270390][ T2798]  __handle_mm_fault+0x462/0xc00
[   55.272251][ T2798]  handle_mm_fault+0x17c/0x380
[   55.274055][ T2798]  ? handle_mm_fault+0x46/0x380
[   55.275877][ T2798]  __do_page_fault+0x24a/0x4c0
[   55.277676][ T2798]  do_page_fault+0x27/0x1b0
[   55.279399][ T2798]  page_fault+0x34/0x40
[   55.281053][ T2798] RIP: 0033:0x4009f0
[   55.282564][ T2798] Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
[   55.289631][ T2798] RSP: 002b:00007fff1804ec00 EFLAGS: 00010206
[   55.291835][ T2798] RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1a000
[   55.294745][ T2798] RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
[   55.297500][ T2798] RBP: 000000000000000c R08: 0000000000000000 R09: 00007f4e7392320d
[   55.300225][ T2798] R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
[   55.303047][ T2798] R13: 0000000000000003 R14: 00007f4e530d6000 R15: 0000000002800000
----------------------------------------



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-03  2:36             ` Tetsuo Handa
@ 2019-08-03 15:51               ` Tetsuo Handa
  2019-08-03 17:41                 ` Masoud Sharbiani
  2019-08-05  8:42                 ` Michal Hocko
  0 siblings, 2 replies; 25+ messages in thread
From: Tetsuo Handa @ 2019-08-03 15:51 UTC (permalink / raw)
  To: Masoud Sharbiani, Michal Hocko
  Cc: Greg KH, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel

Masoud, will you try this patch?

By the way, is /sys/fs/cgroup/memory/leaker/memory.usage_in_bytes remains non-zero
despite /sys/fs/cgroup/memory/leaker/tasks became empty due to memcg OOM killer expected?
Deleting big-data-file.bin after memcg OOM killer reduces some, but still remains
non-zero.

----------------------------------------
From 2f92c70f390f42185c6e2abb8dda98b1b7d02fa9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 4 Aug 2019 00:41:30 +0900
Subject: [PATCH] memcg, oom: don't require __GFP_FS when invoking memcg OOM killer

Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
out_of_memory back to the charge path") broke memcg OOM called from
__xfs_filemap_fault() path. It turned out that try_chage() is retrying
forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
move GFP_NOFS check to out_of_memory"). Regarding memcg OOM, we need to
bypass GFP_NOFS check in order to guarantee forward progress.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Masoud Sharbiani <msharbiani@apple.com>
Bisected-by: Masoud Sharbiani <msharbiani@apple.com>
Fixes: 29ef680ae7c21110 ("memcg, oom: move out_of_memory back to the charge path")
---
 mm/oom_kill.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eda2e2a..26804ab 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1068,9 +1068,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * The OOM killer does not compensate for IO-less reclaim.
 	 * pagefault_out_of_memory lost its gfp context so we have to
 	 * make sure exclude 0 mask - all other users should have at least
-	 * ___GFP_DIRECT_RECLAIM to get here.
+	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
+	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
 		return true;
 
 	/*
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-03 15:51               ` Tetsuo Handa
@ 2019-08-03 17:41                 ` Masoud Sharbiani
  2019-08-03 18:24                   ` Masoud Sharbiani
  2019-08-05  8:42                 ` Michal Hocko
  1 sibling, 1 reply; 25+ messages in thread
From: Masoud Sharbiani @ 2019-08-03 17:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Greg KH, hannes, vdavydov.dev, linux-mm, cgroups,
	linux-kernel



> On Aug 3, 2019, at 8:51 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> Masoud, will you try this patch?

Gladly.
It looks like it is working (and OOMing properly).


> 
> By the way, is /sys/fs/cgroup/memory/leaker/memory.usage_in_bytes remains non-zero
> despite /sys/fs/cgroup/memory/leaker/tasks became empty due to memcg OOM killer expected?
> Deleting big-data-file.bin after memcg OOM killer reduces some, but still remains
> non-zero.

Yes. I had not noticed that:

[ 1114.190477] oom_reaper: reaped process 1942 (leaker), now anon-rss:0kB, file-
rss:0kB, shmem-rss:0kB
./test-script.sh: line 16:  1942 Killed                  ./leaker -p 10240 -c 100000

[root@localhost laleaker]# cat  /sys/fs/cgroup/memory/leaker/memory.usage_in_bytes
3194880
[root@localhost laleaker]# cat  /sys/fs/cgroup/memory/leaker/memory.limit_in_bytes
536870912
[root@localhost laleaker]# rm -f big-data-file.bin
[root@localhost laleaker]# cat  /sys/fs/cgroup/memory/leaker/memory.usage_in_bytes
2838528

Thanks!
Masoud

PS: Tried hand-back-porting it to 4.19-y and it didn’t work. I think there are other patches between 4.19.0 and 5.3 that could be necessary…


> 
> ----------------------------------------
> From 2f92c70f390f42185c6e2abb8dda98b1b7d02fa9 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 4 Aug 2019 00:41:30 +0900
> Subject: [PATCH] memcg, oom: don't require __GFP_FS when invoking memcg OOM killer
> 
> Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
> out_of_memory back to the charge path") broke memcg OOM called from
> __xfs_filemap_fault() path. It turned out that try_chage() is retrying
> forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
> cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
> move GFP_NOFS check to out_of_memory"). Regarding memcg OOM, we need to
> bypass GFP_NOFS check in order to guarantee forward progress.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Masoud Sharbiani <msharbiani@apple.com>
> Bisected-by: Masoud Sharbiani <msharbiani@apple.com>
> Fixes: 29ef680ae7c21110 ("memcg, oom: move out_of_memory back to the charge path")
> ---
> mm/oom_kill.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eda2e2a..26804ab 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1068,9 +1068,10 @@ bool out_of_memory(struct oom_control *oc)
> 	 * The OOM killer does not compensate for IO-less reclaim.
> 	 * pagefault_out_of_memory lost its gfp context so we have to
> 	 * make sure exclude 0 mask - all other users should have at least
> -	 * ___GFP_DIRECT_RECLAIM to get here.
> +	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
> +	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
> 	 */
> -	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
> 		return true;
> 
> 	/*
> -- 
> 1.8.3.1
> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-03 17:41                 ` Masoud Sharbiani
@ 2019-08-03 18:24                   ` Masoud Sharbiani
  0 siblings, 0 replies; 25+ messages in thread
From: Masoud Sharbiani @ 2019-08-03 18:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Greg KH, hannes, vdavydov.dev, linux-mm, cgroups,
	linux-kernel



> On Aug 3, 2019, at 10:41 AM, Masoud Sharbiani <msharbiani@apple.com> wrote:
> 
> 
> 
>> On Aug 3, 2019, at 8:51 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
>> 
>> Masoud, will you try this patch?
> 
> Gladly.
> It looks like it is working (and OOMing properly).
> 
> 
>> 
>> By the way, is /sys/fs/cgroup/memory/leaker/memory.usage_in_bytes remains non-zero
>> despite /sys/fs/cgroup/memory/leaker/tasks became empty due to memcg OOM killer expected?
>> Deleting big-data-file.bin after memcg OOM killer reduces some, but still remains
>> non-zero.
> 
> Yes. I had not noticed that:
> 
> [ 1114.190477] oom_reaper: reaped process 1942 (leaker), now anon-rss:0kB, file-
> rss:0kB, shmem-rss:0kB
> ./test-script.sh: line 16:  1942 Killed                  ./leaker -p 10240 -c 100000
> 
> [root@localhost laleaker]# cat  /sys/fs/cgroup/memory/leaker/memory.usage_in_bytes
> 3194880
> [root@localhost laleaker]# cat  /sys/fs/cgroup/memory/leaker/memory.limit_in_bytes
> 536870912
> [root@localhost laleaker]# rm -f big-data-file.bin
> [root@localhost laleaker]# cat  /sys/fs/cgroup/memory/leaker/memory.usage_in_bytes
> 2838528
> 
> Thanks!
> Masoud
> 
> PS: Tried hand-back-porting it to 4.19-y and it didn’t work. I think there are other patches between 4.19.0 and 5.3 that could be necessary…
> 

Please ignore this last part. It works on 4.19-y branch as well. 

Masoud

> 
>> 
>> ----------------------------------------
>> From 2f92c70f390f42185c6e2abb8dda98b1b7d02fa9 Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Date: Sun, 4 Aug 2019 00:41:30 +0900
>> Subject: [PATCH] memcg, oom: don't require __GFP_FS when invoking memcg OOM killer
>> 
>> Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
>> out_of_memory back to the charge path") broke memcg OOM called from
>> __xfs_filemap_fault() path. It turned out that try_chage() is retrying
>> forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
>> cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
>> move GFP_NOFS check to out_of_memory"). Regarding memcg OOM, we need to
>> bypass GFP_NOFS check in order to guarantee forward progress.
>> 
>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Reported-by: Masoud Sharbiani <msharbiani@apple.com>
>> Bisected-by: Masoud Sharbiani <msharbiani@apple.com>
>> Fixes: 29ef680ae7c21110 ("memcg, oom: move out_of_memory back to the charge path")
>> ---
>> mm/oom_kill.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index eda2e2a..26804ab 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1068,9 +1068,10 @@ bool out_of_memory(struct oom_control *oc)
>> 	 * The OOM killer does not compensate for IO-less reclaim.
>> 	 * pagefault_out_of_memory lost its gfp context so we have to
>> 	 * make sure exclude 0 mask - all other users should have at least
>> -	 * ___GFP_DIRECT_RECLAIM to get here.
>> +	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
>> +	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
>> 	 */
>> -	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
>> +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>> 		return true;
>> 
>> 	/*
>> -- 
>> 1.8.3.1
>> 
>> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-02 23:28           ` Masoud Sharbiani
  2019-08-03  2:36             ` Tetsuo Handa
@ 2019-08-05  8:18             ` Michal Hocko
  1 sibling, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2019-08-05  8:18 UTC (permalink / raw)
  To: Masoud Sharbiani
  Cc: Greg KH, hannes, vdavydov.dev, linux-mm, cgroups, linux-kernel

On Fri 02-08-19 16:28:25, Masoud Sharbiani wrote:
> 
> 
> > On Aug 2, 2019, at 12:14 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > On Fri 02-08-19 11:00:55, Masoud Sharbiani wrote:
> >> 
> >> 
> >>> On Aug 2, 2019, at 7:41 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >>> 
> >>> On Fri 02-08-19 07:18:17, Masoud Sharbiani wrote:
> >>>> 
> >>>> 
> >>>>> On Aug 2, 2019, at 12:40 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >>>>> 
> >>>>> On Thu 01-08-19 11:04:14, Masoud Sharbiani wrote:
> >>>>>> Hey folks,
> >>>>>> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
> >>>>>> It was introduced by
> >>>>>> 
> >>>>>> 29ef680 memcg, oom: move out_of_memory back to the charge path 
> >>>>> 
> >>>>> This commit shouldn't really change the OOM behavior for your particular
> >>>>> test case. It would have changed MAP_POPULATE behavior but your usage is
> >>>>> triggering the standard page fault path. The only difference with
> >>>>> 29ef680 is that the OOM killer is invoked during the charge path rather
> >>>>> than on the way out of the page fault.
> >>>>> 
> >>>>> Anyway, I tried to run your test case in a loop and leaker always ends
> >>>>> up being killed as expected with 5.2. See the below oom report. There
> >>>>> must be something else going on. How much swap do you have on your
> >>>>> system?
> >>>> 
> >>>> I do not have swap defined. 
> >>> 
> >>> OK, I have retested with swap disabled and again everything seems to be
> >>> working as expected. The oom happens earlier because I do not have to
> >>> wait for the swap to get full.
> >>> 
> >> 
> >> In my tests (with the script provided), it only loops 11 iterations before hanging, and uttering the soft lockup message.
> >> 
> >> 
> >>> Which fs do you use to write the file that you mmap?
> >> 
> >> /dev/sda3 on / type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> >> 
> >> Part of the soft lockup path actually specifies that it is going through __xfs_filemap_fault():
> > 
> > Right, I have just missed that.
> > 
> > [...]
> > 
> >> If I switch the backing file to a ext4 filesystem (separate hard drive), it OOMs.
> >> 
> >> 
> >> If I switch the file used to /dev/zero, it OOMs: 
> >> …
> >> Todal sum was 0. Loop count is 11
> >> Buffer is @ 0x7f2b66c00000
> >> ./test-script-devzero.sh: line 16:  3561 Killed                  ./leaker -p 10240 -c 100000
> >> 
> >> 
> >>> Or could you try to
> >>> simplify your test even further? E.g. does everything work as expected
> >>> when doing anonymous mmap rather than file backed one?
> >> 
> >> It also OOMs with MAP_ANON. 
> >> 
> >> Hope that helps.
> > 
> > It helps to focus more on the xfs reclaim path. Just to be sure, is
> > there any difference if you use cgroup v2? I do not expect to be but
> > just to be sure there are no v1 artifacts.
> 
> I was unable to use cgroups2. I’ve created the new control group, but the attempt to move a running process into it fails with ‘Device or resource busy’.

Have you enabled the memory controller for the hierarchy? Please read
Documentation/admin-guide/cgroup-v2.rst for more information.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-03 15:51               ` Tetsuo Handa
  2019-08-03 17:41                 ` Masoud Sharbiani
@ 2019-08-05  8:42                 ` Michal Hocko
  2019-08-05 11:36                   ` Tetsuo Handa
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2019-08-05  8:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Masoud Sharbiani, Greg KH, hannes, vdavydov.dev, linux-mm,
	cgroups, linux-kernel

On Sun 04-08-19 00:51:18, Tetsuo Handa wrote:
> Masoud, will you try this patch?
> 
> By the way, is /sys/fs/cgroup/memory/leaker/memory.usage_in_bytes remains non-zero
> despite /sys/fs/cgroup/memory/leaker/tasks became empty due to memcg OOM killer expected?
> Deleting big-data-file.bin after memcg OOM killer reduces some, but still remains
> non-zero.
> 
> ----------------------------------------
> >From 2f92c70f390f42185c6e2abb8dda98b1b7d02fa9 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 4 Aug 2019 00:41:30 +0900
> Subject: [PATCH] memcg, oom: don't require __GFP_FS when invoking memcg OOM killer
> 
> Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
> out_of_memory back to the charge path") broke memcg OOM called from
> __xfs_filemap_fault() path.

This is very well spotted! I really didn't think of GFP_NOFS although
xfs in the mix could give me some clue.

> It turned out that try_chage() is retrying
> forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
> cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
> move GFP_NOFS check to out_of_memory"). Regarding memcg OOM, we need to
> bypass GFP_NOFS check in order to guarantee forward progress.

This deserves more information about the fix. Why is it OK to trigger
OOM for GFP_NOFS allocations? Doesn't this lead to pre-mature OOM killer
invocation?

You can argue that memcg charges have ignored GFP_NOFS without seeing a
lot of problems. But please document that in the changelog.

It is 3da88fb3bacfaa33 that has introduced this heuristic and I have to
confess I haven't realized the side effect on the memcg side because
OOM was triggered only from the GFP_KERNEL context. So I would point
to 3da88fb3bacfaa33 as introducing the regression albeit silent at the
time.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Masoud Sharbiani <msharbiani@apple.com>
> Bisected-by: Masoud Sharbiani <msharbiani@apple.com>
> Fixes: 29ef680ae7c21110 ("memcg, oom: move out_of_memory back to the charge path")

I would say
Fixes: 3da88fb3bacfaa33 # necessary after 29ef680ae7c21110

Other than that I am not really sure about a better fix. Let's see
whether we see some pre-mature memcg OOM reports and think where to get
from there.

With updated changelog
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/oom_kill.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eda2e2a..26804ab 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1068,9 +1068,10 @@ bool out_of_memory(struct oom_control *oc)
>  	 * The OOM killer does not compensate for IO-less reclaim.
>  	 * pagefault_out_of_memory lost its gfp context so we have to
>  	 * make sure exclude 0 mask - all other users should have at least
> -	 * ___GFP_DIRECT_RECLAIM to get here.
> +	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
> +	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
>  	 */
> -	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>  		return true;
>  
>  	/*
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-05  8:42                 ` Michal Hocko
@ 2019-08-05 11:36                   ` Tetsuo Handa
  2019-08-05 11:44                     ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Tetsuo Handa @ 2019-08-05 11:36 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Masoud Sharbiani, Greg KH, hannes, vdavydov.dev, linux-mm,
	cgroups, linux-kernel

I updated the changelog.

From 80b6f63b9d30df414e468e193a7f1b40c373ed68 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 5 Aug 2019 20:28:35 +0900
Subject: [PATCH v2] memcg, oom: don't require __GFP_FS when invoking memcg OOM killer

Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
out_of_memory back to the charge path") broke memcg OOM called from
__xfs_filemap_fault() path. It turned out that try_charge() is retrying
forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
move GFP_NOFS check to out_of_memory").

Allowing forced charge due to being unable to invoke memcg OOM killer
will lead to global OOM situation, and just returning -ENOMEM will not
solve memcg OOM situation. Therefore, invoking memcg OOM killer (despite
GFP_NOFS) will be the only choice we can choose for now.

Until 29ef680ae7c21110~1, we were able to invoke memcg OOM killer when
GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to
invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in
the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
pre-mature memcg OOM reports due to this patch.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-and-tested-by: Masoud Sharbiani <msharbiani@apple.com>
Bisected-by: Masoud Sharbiani <msharbiani@apple.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Fixes: 3da88fb3bacfaa33 # necessary after 29ef680ae7c21110
Cc: <stable@vger.kernel.org> # 4.19+


[1]

 leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x10a/0x2c0
  mem_cgroup_out_of_memory+0x46/0x80
  mem_cgroup_oom_synchronize+0x2e4/0x310
  ? high_work_func+0x20/0x20
  pagefault_out_of_memory+0x31/0x76
  mm_fault_error+0x55/0x115
  ? handle_mm_fault+0xfd/0x220
  __do_page_fault+0x433/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206
 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000
 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d
 R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 158965
 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
 Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child
 Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB
 oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB


[2]

 leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0
 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x109/0x2d0
  mem_cgroup_out_of_memory+0x46/0x80
  try_charge+0x58d/0x650
  ? __radix_tree_replace+0x81/0x100
  mem_cgroup_try_charge+0x7a/0x100
  __add_to_page_cache_locked+0x92/0x180
  add_to_page_cache_lru+0x4d/0xf0
  iomap_readpages_actor+0xde/0x1b0
  ? iomap_zero_range_actor+0x1d0/0x1d0
  iomap_apply+0xaf/0x130
  iomap_readpages+0x9f/0x150
  ? iomap_zero_range_actor+0x1d0/0x1d0
  xfs_vm_readpages+0x18/0x20 [xfs]
  read_pages+0x60/0x140
  __do_page_cache_readahead+0x193/0x1b0
  ondemand_readahead+0x16d/0x2c0
  page_cache_async_readahead+0x9a/0xd0
  filemap_fault+0x403/0x620
  ? alloc_set_pte+0x12c/0x540
  ? _cond_resched+0x14/0x30
  __xfs_filemap_fault+0x66/0x180 [xfs]
  xfs_filemap_fault+0x27/0x30 [xfs]
  __do_fault+0x19/0x40
  __handle_mm_fault+0x8e8/0xb60
  handle_mm_fault+0xfd/0x220
  __do_page_fault+0x238/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206
 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000
 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d
 R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 7221
 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB
 Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child
 Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB
 oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB


[3]

 leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
 leaker cpuset=/ mems_allowed=0
 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
 Call Trace:
  [<ffffffffaf364147>] dump_stack+0x19/0x1b
  [<ffffffffaf35eb6a>] dump_header+0x90/0x229
  [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0
  [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60
  [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0
  [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570
  [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0
  [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90
  [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157
  [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0
  [<ffffffffaf371925>] do_page_fault+0x35/0x90
  [<ffffffffaf36d768>] page_fault+0x28/0x30
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 20628
 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB
 Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child
 Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB

---
 mm/oom_kill.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eda2e2a..26804ab 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1068,9 +1068,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * The OOM killer does not compensate for IO-less reclaim.
 	 * pagefault_out_of_memory lost its gfp context so we have to
 	 * make sure exclude 0 mask - all other users should have at least
-	 * ___GFP_DIRECT_RECLAIM to get here.
+	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
+	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
 		return true;
 
 	/*
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-05 11:36                   ` Tetsuo Handa
@ 2019-08-05 11:44                     ` Michal Hocko
  2019-08-05 14:00                       ` Tetsuo Handa
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2019-08-05 11:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Masoud Sharbiani, Greg KH, hannes, vdavydov.dev,
	linux-mm, cgroups, linux-kernel

On Mon 05-08-19 20:36:05, Tetsuo Handa wrote:
> I updated the changelog.

This looks much better, thanks! One nit

> >From 80b6f63b9d30df414e468e193a7f1b40c373ed68 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Mon, 5 Aug 2019 20:28:35 +0900
> Subject: [PATCH v2] memcg, oom: don't require __GFP_FS when invoking memcg OOM killer
> 
> Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
> out_of_memory back to the charge path") broke memcg OOM called from
> __xfs_filemap_fault() path. It turned out that try_charge() is retrying
> forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
> cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
> move GFP_NOFS check to out_of_memory").
> 
> Allowing forced charge due to being unable to invoke memcg OOM killer
> will lead to global OOM situation, and just returning -ENOMEM will not
> solve memcg OOM situation.

Returning -ENOMEM would effectivelly lead to triggering the oom killer
from the page fault bail out path. So effectively get us back to before
29ef680ae7c21110. But it is true that this is riskier from the
observability POV when a) the OOM path wouldn't point to the culprit and
b) it would leak ENOMEM from g-u-p path.

> Therefore, invoking memcg OOM killer (despite
> GFP_NOFS) will be the only choice we can choose for now.
> 
> Until 29ef680ae7c21110~1, we were able to invoke memcg OOM killer when
> GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to
> invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in
> the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
> pre-mature memcg OOM reports due to this patch.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-and-tested-by: Masoud Sharbiani <msharbiani@apple.com>
> Bisected-by: Masoud Sharbiani <msharbiani@apple.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Fixes: 3da88fb3bacfaa33 # necessary after 29ef680ae7c21110
> Cc: <stable@vger.kernel.org> # 4.19+
> 
> 
> [1]
> 
>  leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
>  CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
>  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
>  Call Trace:
>   dump_stack+0x63/0x88
>   dump_header+0x67/0x27a
>   ? mem_cgroup_scan_tasks+0x91/0xf0
>   oom_kill_process+0x210/0x410
>   out_of_memory+0x10a/0x2c0
>   mem_cgroup_out_of_memory+0x46/0x80
>   mem_cgroup_oom_synchronize+0x2e4/0x310
>   ? high_work_func+0x20/0x20
>   pagefault_out_of_memory+0x31/0x76
>   mm_fault_error+0x55/0x115
>   ? handle_mm_fault+0xfd/0x220
>   __do_page_fault+0x433/0x4e0
>   do_page_fault+0x22/0x30
>   ? page_fault+0x8/0x30
>   page_fault+0x1e/0x30
>  RIP: 0033:0x4009f0
>  Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
>  RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206
>  RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000
>  RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
>  RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d
>  R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
>  R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000
>  Task in /leaker killed as a result of limit of /leaker
>  memory: usage 524288kB, limit 524288kB, failcnt 158965
>  memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>  kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
>  Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
>  Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child
>  Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB
>  oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> 
> [2]
> 
>  leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0
>  CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
>  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
>  Call Trace:
>   dump_stack+0x63/0x88
>   dump_header+0x67/0x27a
>   ? mem_cgroup_scan_tasks+0x91/0xf0
>   oom_kill_process+0x210/0x410
>   out_of_memory+0x109/0x2d0
>   mem_cgroup_out_of_memory+0x46/0x80
>   try_charge+0x58d/0x650
>   ? __radix_tree_replace+0x81/0x100
>   mem_cgroup_try_charge+0x7a/0x100
>   __add_to_page_cache_locked+0x92/0x180
>   add_to_page_cache_lru+0x4d/0xf0
>   iomap_readpages_actor+0xde/0x1b0
>   ? iomap_zero_range_actor+0x1d0/0x1d0
>   iomap_apply+0xaf/0x130
>   iomap_readpages+0x9f/0x150
>   ? iomap_zero_range_actor+0x1d0/0x1d0
>   xfs_vm_readpages+0x18/0x20 [xfs]
>   read_pages+0x60/0x140
>   __do_page_cache_readahead+0x193/0x1b0
>   ondemand_readahead+0x16d/0x2c0
>   page_cache_async_readahead+0x9a/0xd0
>   filemap_fault+0x403/0x620
>   ? alloc_set_pte+0x12c/0x540
>   ? _cond_resched+0x14/0x30
>   __xfs_filemap_fault+0x66/0x180 [xfs]
>   xfs_filemap_fault+0x27/0x30 [xfs]
>   __do_fault+0x19/0x40
>   __handle_mm_fault+0x8e8/0xb60
>   handle_mm_fault+0xfd/0x220
>   __do_page_fault+0x238/0x4e0
>   do_page_fault+0x22/0x30
>   ? page_fault+0x8/0x30
>   page_fault+0x1e/0x30
>  RIP: 0033:0x4009f0
>  Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
>  RSP: 002b:00007ffda45c9290 EFLAGS: 00010206
>  RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000
>  RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
>  RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d
>  R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
>  R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000
>  Task in /leaker killed as a result of limit of /leaker
>  memory: usage 524288kB, limit 524288kB, failcnt 7221
>  memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>  kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0
>  Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB
>  Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child
>  Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB
>  oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> 
> [3]
> 
>  leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
>  leaker cpuset=/ mems_allowed=0
>  CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1
>  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
>  Call Trace:
>   [<ffffffffaf364147>] dump_stack+0x19/0x1b
>   [<ffffffffaf35eb6a>] dump_header+0x90/0x229
>   [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0
>   [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60
>   [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0
>   [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570
>   [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0
>   [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90
>   [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157
>   [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0
>   [<ffffffffaf371925>] do_page_fault+0x35/0x90
>   [<ffffffffaf36d768>] page_fault+0x28/0x30
>  Task in /leaker killed as a result of limit of /leaker
>  memory: usage 524288kB, limit 524288kB, failcnt 20628
>  memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
>  kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
>  Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB
>  Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child
>  Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB
> 
> ---
>  mm/oom_kill.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eda2e2a..26804ab 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1068,9 +1068,10 @@ bool out_of_memory(struct oom_control *oc)
>  	 * The OOM killer does not compensate for IO-less reclaim.
>  	 * pagefault_out_of_memory lost its gfp context so we have to
>  	 * make sure exclude 0 mask - all other users should have at least
> -	 * ___GFP_DIRECT_RECLAIM to get here.
> +	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
> +	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
>  	 */
> -	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
> +	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>  		return true;
>  
>  	/*
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-05 11:44                     ` Michal Hocko
@ 2019-08-05 14:00                       ` Tetsuo Handa
  2019-08-05 14:26                         ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Tetsuo Handa @ 2019-08-05 14:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Masoud Sharbiani, Greg KH, hannes, vdavydov.dev,
	linux-mm, cgroups, linux-kernel

On 2019/08/05 20:44, Michal Hocko wrote:
>> Allowing forced charge due to being unable to invoke memcg OOM killer
>> will lead to global OOM situation, and just returning -ENOMEM will not
>> solve memcg OOM situation.
> 
> Returning -ENOMEM would effectivelly lead to triggering the oom killer
> from the page fault bail out path. So effectively get us back to before
> 29ef680ae7c21110. But it is true that this is riskier from the
> observability POV when a) the OOM path wouldn't point to the culprit and
> b) it would leak ENOMEM from g-u-p path.
> 

Excuse me? But according to my experiment, below code showed flood of
"Returning -ENOMEM" message instead of invoking the OOM killer.
I didn't find it gets us back to before 29ef680ae7c21110...

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1884,6 +1884,8 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
        mem_cgroup_unmark_under_oom(memcg);
        if (mem_cgroup_out_of_memory(memcg, mask, order))
                ret = OOM_SUCCESS;
+       else if (!(mask & __GFP_FS))
+               ret = OOM_SKIPPED;
        else
                ret = OOM_FAILED;

@@ -2457,8 +2459,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
                goto nomem;
        }
 nomem:
-       if (!(gfp_mask & __GFP_NOFAIL))
+       if (!(gfp_mask & __GFP_NOFAIL)) {
+               printk("Returning -ENOMEM\n");
                return -ENOMEM;
+       }
 force:
        /*
         * The allocation either can't fail or will lead to more memory
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1071,7 +1071,7 @@ bool out_of_memory(struct oom_control *oc)
         * ___GFP_DIRECT_RECLAIM to get here.
         */
        if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
-               return true;
+               return !is_memcg_oom(oc);

        /*
         * Check if there were limitations on the allocation (only relevant for

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-05 14:00                       ` Tetsuo Handa
@ 2019-08-05 14:26                         ` Michal Hocko
  2019-08-06 10:26                           ` Tetsuo Handa
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2019-08-05 14:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Masoud Sharbiani, Greg KH, hannes, vdavydov.dev,
	linux-mm, cgroups, linux-kernel

On Mon 05-08-19 23:00:12, Tetsuo Handa wrote:
> On 2019/08/05 20:44, Michal Hocko wrote:
> >> Allowing forced charge due to being unable to invoke memcg OOM killer
> >> will lead to global OOM situation, and just returning -ENOMEM will not
> >> solve memcg OOM situation.
> > 
> > Returning -ENOMEM would effectivelly lead to triggering the oom killer
> > from the page fault bail out path. So effectively get us back to before
> > 29ef680ae7c21110. But it is true that this is riskier from the
> > observability POV when a) the OOM path wouldn't point to the culprit and
> > b) it would leak ENOMEM from g-u-p path.
> > 
> 
> Excuse me? But according to my experiment, below code showed flood of
> "Returning -ENOMEM" message instead of invoking the OOM killer.
> I didn't find it gets us back to before 29ef680ae7c21110...

You would need to declare OOM_ASYNC to return ENOMEM properly from the
charge (which is effectivelly a revert of 29ef680ae7c21110 for NOFS
allocations). Something like the following

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba9138a4a1de..cc34ff0932ce 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1797,7 +1797,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 	 * Please note that mem_cgroup_out_of_memory might fail to find a
 	 * victim and then we have to bail out from the charge path.
 	 */
-	if (memcg->oom_kill_disable) {
+	if (memcg->oom_kill_disable || !(mask & __GFP_FS)) {
 		if (!current->in_user_fault)
 			return OOM_SKIPPED;
 		css_get(&memcg->css);

I am quite surprised that your patch didn't trigger the global OOM
though. It might mean that ENOMEM doesn't propagate all the way down to
the #PF handler for this path for some reason.

Anyway what I meant to say is that returning ENOMEM has the
observable issues as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-05 14:26                         ` Michal Hocko
@ 2019-08-06 10:26                           ` Tetsuo Handa
  2019-08-06 10:50                             ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Tetsuo Handa @ 2019-08-06 10:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Masoud Sharbiani, Greg KH, hannes, vdavydov.dev,
	linux-mm, cgroups, linux-kernel

On 2019/08/05 23:26, Michal Hocko wrote:
> On Mon 05-08-19 23:00:12, Tetsuo Handa wrote:
>> On 2019/08/05 20:44, Michal Hocko wrote:
>>>> Allowing forced charge due to being unable to invoke memcg OOM killer
>>>> will lead to global OOM situation, and just returning -ENOMEM will not
>>>> solve memcg OOM situation.
>>>
>>> Returning -ENOMEM would effectivelly lead to triggering the oom killer
>>> from the page fault bail out path. So effectively get us back to before
>>> 29ef680ae7c21110. But it is true that this is riskier from the
>>> observability POV when a) the OOM path wouldn't point to the culprit and
>>> b) it would leak ENOMEM from g-u-p path.
>>>
>>
>> Excuse me? But according to my experiment, below code showed flood of
>> "Returning -ENOMEM" message instead of invoking the OOM killer.
>> I didn't find it gets us back to before 29ef680ae7c21110...
> 
> You would need to declare OOM_ASYNC to return ENOMEM properly from the
> charge (which is effectivelly a revert of 29ef680ae7c21110 for NOFS
> allocations). Something like the following
> 

OK. We need to set current->memcg_* before declaring something other than
OOM_SUCCESS and OOM_FAILED... Well, it seems that returning -ENOMEM after
setting current->memcg_* works as expected. What's wrong with your approach?


--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1843,6 +1843,15 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 	if (order > PAGE_ALLOC_COSTLY_ORDER)
 		return OOM_SKIPPED;
 
+	if (!(mask & __GFP_FS)) {
+		BUG_ON(current->memcg_in_oom);
+		css_get(&memcg->css);
+		current->memcg_in_oom = memcg;
+		current->memcg_oom_gfp_mask = mask;
+		current->memcg_oom_order = order;
+		return OOM_ASYNC;
+	}
+
 	memcg_memory_event(memcg, MEMCG_OOM);

 	/*



[   49.921978][ T6736] leaker invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[   49.925152][ T6736] CPU: 1 PID: 6736 Comm: leaker Kdump: loaded Not tainted 5.3.0-rc3+ #936
[   49.927917][ T6736] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[   49.931337][ T6736] Call Trace:
[   49.932673][ T6736]  dump_stack+0x67/0x95
[   49.934438][ T6736]  dump_header+0x4d/0x3e0
[   49.936142][ T6736]  oom_kill_process+0x193/0x220
[   49.940276][ T6736]  out_of_memory+0x105/0x360
[   49.941863][ T6736]  mem_cgroup_out_of_memory+0xb6/0xd0
[   49.943819][ T6736]  try_charge+0xa78/0xa90
[   49.945584][ T6736]  mem_cgroup_try_charge+0x88/0x2f0
[   49.947411][ T6736]  __add_to_page_cache_locked+0x27e/0x4c0
[   49.949441][ T6736]  ? scan_shadow_nodes+0x30/0x30
[   49.951155][ T6736]  add_to_page_cache_lru+0x72/0x180
[   49.952940][ T6736]  pagecache_get_page+0xb6/0x2b0
[   49.954718][ T6736]  filemap_fault+0x613/0xc20
[   49.956407][ T6736]  ? filemap_fault+0x446/0xc20
[   49.958221][ T6736]  ? __xfs_filemap_fault+0x7f/0x290 [xfs]
[   49.960206][ T6736]  ? down_read_nested+0x93/0x170
[   49.962141][ T6736]  ? xfs_ilock+0x1ea/0x2f0 [xfs]
[   49.963925][ T6736]  __xfs_filemap_fault+0x92/0x290 [xfs]
[   49.966089][ T6736]  xfs_filemap_fault+0x27/0x30 [xfs]
[   49.967864][ T6736]  __do_fault+0x33/0xd0
[   49.969467][ T6736]  __handle_mm_fault+0x891/0xbe0
[   49.971222][ T6736]  handle_mm_fault+0x179/0x380
[   49.972902][ T6736]  ? handle_mm_fault+0x46/0x380
[   49.974544][ T6736]  __do_page_fault+0x255/0x4d0
[   49.976283][ T6736]  do_page_fault+0x27/0x1e0
[   49.978012][ T6736]  page_fault+0x34/0x40
[   49.979540][ T6736] RIP: 0033:0x4009f0
[   49.981007][ T6736] Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
[   49.987171][ T6736] RSP: 002b:00007ffdbe464810 EFLAGS: 00010206
[   49.989302][ T6736] RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001d69000
[   49.992130][ T6736] RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
[   49.994857][ T6736] RBP: 000000000000000c R08: 0000000000000000 R09: 00007fa1a2ee420d
[   49.997579][ T6736] R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
[   50.000251][ T6736] R13: 0000000000000003 R14: 00007fa182697000 R15: 0000000002800000
[   50.003734][ T6736] memory: usage 524288kB, limit 524288kB, failcnt 660235
[   50.006452][ T6736] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[   50.009165][ T6736] kmem: usage 2196kB, limit 9007199254740988kB, failcnt 0
[   50.011886][ T6736] Memory cgroup stats for /leaker:
[   50.011950][ T6736] anon 534147072
[   50.011950][ T6736] file 212992
[   50.011950][ T6736] kernel_stack 36864
[   50.011950][ T6736] slab 933888
[   50.011950][ T6736] sock 0
[   50.011950][ T6736] shmem 0
[   50.011950][ T6736] file_mapped 0
[   50.011950][ T6736] file_dirty 0
[   50.011950][ T6736] file_writeback 0
[   50.011950][ T6736] anon_thp 0
[   50.011950][ T6736] inactive_anon 0
[   50.011950][ T6736] active_anon 534048768
[   50.011950][ T6736] inactive_file 0
[   50.011950][ T6736] active_file 151552
[   50.011950][ T6736] unevictable 0
[   50.011950][ T6736] slab_reclaimable 327680
[   50.011950][ T6736] slab_unreclaimable 606208
[   50.011950][ T6736] pgfault 140250
[   50.011950][ T6736] pgmajfault 693
[   50.011950][ T6736] workingset_refault 169950
[   50.011950][ T6736] workingset_activate 1353
[   50.011950][ T6736] workingset_nodereclaim 0
[   50.011950][ T6736] pgrefill 5848
[   50.011950][ T6736] pgscan 859688
[   50.011950][ T6736] pgsteal 180103
[   50.052086][ T6736] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),oom_memcg=/leaker,task_memcg=/leaker,task=leaker,pid=6736,uid=0
[   50.056749][ T6736] Memory cgroup out of memory: Killed process 6736 (leaker) total-vm:536700kB, anon-rss:521704kB, file-rss:1180kB, shmem-rss:0kB
[   50.167554][   T55] oom_reaper: reaped process 6736 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
  2019-08-06 10:26                           ` Tetsuo Handa
@ 2019-08-06 10:50                             ` Michal Hocko
  2019-08-06 12:48                               ` [PATCH v3] memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Tetsuo Handa
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2019-08-06 10:50 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Masoud Sharbiani, Greg KH, hannes, vdavydov.dev,
	linux-mm, cgroups, linux-kernel

On Tue 06-08-19 19:26:12, Tetsuo Handa wrote:
> On 2019/08/05 23:26, Michal Hocko wrote:
> > On Mon 05-08-19 23:00:12, Tetsuo Handa wrote:
> >> On 2019/08/05 20:44, Michal Hocko wrote:
> >>>> Allowing forced charge due to being unable to invoke memcg OOM killer
> >>>> will lead to global OOM situation, and just returning -ENOMEM will not
> >>>> solve memcg OOM situation.
> >>>
> >>> Returning -ENOMEM would effectivelly lead to triggering the oom killer
> >>> from the page fault bail out path. So effectively get us back to before
> >>> 29ef680ae7c21110. But it is true that this is riskier from the
> >>> observability POV when a) the OOM path wouldn't point to the culprit and
> >>> b) it would leak ENOMEM from g-u-p path.
> >>>
> >>
> >> Excuse me? But according to my experiment, below code showed flood of
> >> "Returning -ENOMEM" message instead of invoking the OOM killer.
> >> I didn't find it gets us back to before 29ef680ae7c21110...
> > 
> > You would need to declare OOM_ASYNC to return ENOMEM properly from the
> > charge (which is effectivelly a revert of 29ef680ae7c21110 for NOFS
> > allocations). Something like the following
> > 
> 
> OK. We need to set current->memcg_* before declaring something other than
> OOM_SUCCESS and OOM_FAILED... Well, it seems that returning -ENOMEM after
> setting current->memcg_* works as expected. What's wrong with your approach?

As I've said, and hoped you could pick up parts for your changelog for
the ENOMEM part, a) oom path is lost b) some paths will leak ENOMEM e.g.
g-u-p. So your patch to trigger the oom even for NOFS is a better
alternative I just found your ENOMEM note misleading and something that
could improve.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3] memcg, oom: don't require __GFP_FS when invoking memcg OOM killer
  2019-08-06 10:50                             ` Michal Hocko
@ 2019-08-06 12:48                               ` Tetsuo Handa
  0 siblings, 0 replies; 25+ messages in thread
From: Tetsuo Handa @ 2019-08-06 12:48 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Masoud Sharbiani, Greg KH, hannes, vdavydov.dev, linux-mm,
	cgroups, linux-kernel

Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
out_of_memory back to the charge path") broke memcg OOM called from
__xfs_filemap_fault() path. It turned out that try_charge() is retrying
forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
move GFP_NOFS check to out_of_memory").

Allowing forced charge due to being unable to invoke memcg OOM killer
will lead to global OOM situation. Also, just returning -ENOMEM will be
risky because OOM path is lost and some paths (e.g. get_user_pages())
will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS)
will be the only choice we can choose for now.

Until 29ef680ae7c21110~1, we were able to invoke memcg OOM killer when
GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to
invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in
the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
pre-mature memcg OOM reports due to this patch.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-and-tested-by: Masoud Sharbiani <msharbiani@apple.com>
Bisected-by: Masoud Sharbiani <msharbiani@apple.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Fixes: 3da88fb3bacfaa33 # necessary after 29ef680ae7c21110
Cc: <stable@vger.kernel.org> # 4.19+


[1]

 leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x10a/0x2c0
  mem_cgroup_out_of_memory+0x46/0x80
  mem_cgroup_oom_synchronize+0x2e4/0x310
  ? high_work_func+0x20/0x20
  pagefault_out_of_memory+0x31/0x76
  mm_fault_error+0x55/0x115
  ? handle_mm_fault+0xfd/0x220
  __do_page_fault+0x433/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206
 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000
 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d
 R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 158965
 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
 Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child
 Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB
 oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB


[2]

 leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0
 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x109/0x2d0
  mem_cgroup_out_of_memory+0x46/0x80
  try_charge+0x58d/0x650
  ? __radix_tree_replace+0x81/0x100
  mem_cgroup_try_charge+0x7a/0x100
  __add_to_page_cache_locked+0x92/0x180
  add_to_page_cache_lru+0x4d/0xf0
  iomap_readpages_actor+0xde/0x1b0
  ? iomap_zero_range_actor+0x1d0/0x1d0
  iomap_apply+0xaf/0x130
  iomap_readpages+0x9f/0x150
  ? iomap_zero_range_actor+0x1d0/0x1d0
  xfs_vm_readpages+0x18/0x20 [xfs]
  read_pages+0x60/0x140
  __do_page_cache_readahead+0x193/0x1b0
  ondemand_readahead+0x16d/0x2c0
  page_cache_async_readahead+0x9a/0xd0
  filemap_fault+0x403/0x620
  ? alloc_set_pte+0x12c/0x540
  ? _cond_resched+0x14/0x30
  __xfs_filemap_fault+0x66/0x180 [xfs]
  xfs_filemap_fault+0x27/0x30 [xfs]
  __do_fault+0x19/0x40
  __handle_mm_fault+0x8e8/0xb60
  handle_mm_fault+0xfd/0x220
  __do_page_fault+0x238/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206
 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000
 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d
 R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 7221
 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB
 Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child
 Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB
 oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB


[3]

 leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
 leaker cpuset=/ mems_allowed=0
 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
 Call Trace:
  [<ffffffffaf364147>] dump_stack+0x19/0x1b
  [<ffffffffaf35eb6a>] dump_header+0x90/0x229
  [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0
  [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60
  [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0
  [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570
  [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0
  [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90
  [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157
  [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0
  [<ffffffffaf371925>] do_page_fault+0x35/0x90
  [<ffffffffaf36d768>] page_fault+0x28/0x30
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 20628
 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB
 Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child
 Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB

---
 mm/oom_kill.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eda2e2a..26804ab 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1068,9 +1068,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * The OOM killer does not compensate for IO-less reclaim.
 	 * pagefault_out_of_memory lost its gfp context so we have to
 	 * make sure exclude 0 mask - all other users should have at least
-	 * ___GFP_DIRECT_RECLAIM to get here.
+	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
+	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
 		return true;
 
 	/*
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-08-06 12:48 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-01 18:04 Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1 Masoud Sharbiani
2019-08-01 18:19 ` Greg KH
2019-08-01 22:26   ` Masoud Sharbiani
2019-08-02  1:08   ` Masoud Sharbiani
2019-08-02  8:08     ` Hillf Danton
2019-08-02  8:18     ` Michal Hocko
2019-08-02  7:40 ` Michal Hocko
2019-08-02 14:18   ` Masoud Sharbiani
2019-08-02 14:41     ` Michal Hocko
2019-08-02 18:00       ` Masoud Sharbiani
2019-08-02 19:14         ` Michal Hocko
2019-08-02 23:28           ` Masoud Sharbiani
2019-08-03  2:36             ` Tetsuo Handa
2019-08-03 15:51               ` Tetsuo Handa
2019-08-03 17:41                 ` Masoud Sharbiani
2019-08-03 18:24                   ` Masoud Sharbiani
2019-08-05  8:42                 ` Michal Hocko
2019-08-05 11:36                   ` Tetsuo Handa
2019-08-05 11:44                     ` Michal Hocko
2019-08-05 14:00                       ` Tetsuo Handa
2019-08-05 14:26                         ` Michal Hocko
2019-08-06 10:26                           ` Tetsuo Handa
2019-08-06 10:50                             ` Michal Hocko
2019-08-06 12:48                               ` [PATCH v3] memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Tetsuo Handa
2019-08-05  8:18             ` Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1 Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.