All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
@ 2019-08-01 18:04 Masoud Sharbiani
  2019-08-01 18:19 ` Greg KH
  2019-08-02  7:40   ` Michal Hocko
  0 siblings, 2 replies; 31+ messages in thread
From: Masoud Sharbiani @ 2019-08-01 18:04 UTC (permalink / raw)
  To: gregkh, mhocko, hannes, vdavydov.dev; +Cc: linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3831 bytes --]

Hey folks,
I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
It was introduced by

29ef680 memcg, oom: move out_of_memory back to the charge path 

The gist of it is that if you have a memory control group for a process that repeatedly maps all of the pages of a file with  repeated calls to:

   mmap(NULL, pages * PAGE_SIZE, PROT_WRITE|PROT_READ, MAP_FILE|MAP_PRIVATE, fd, 0)

The memory cg eventually runs out of memory, as it should. However, prior to the 29ef680 commit, it would kill the running process with OOM; After that commit ( and until 5.3-rc1; Haven’t pinpointed the exact commit in between 5.2.0 and 5.3-rc1) the offending process goes into %100 CPU usage, and doesn’t die (prior behavior) or fail the mmap call (which is what happens if one runs the test program with a low ulimit -v value).

Any ideas on how to chase this down further?

(Test program and script have been pasted below)
Thanks,
Masoud


//——— leaker.c ——
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <stdio.h>
#include <errno.h>
#include <signal.h>

#ifndef PAGE_SIZE
#define PAGE_SIZE 4096
#endif

void sighandler(int x) {
   printf("SIGNAL %d received. Quitting\n", x);
   exit(2);
}

int main(int ac, char*av[])
{
   int i;
   int fd;
   int pages = 4096;
   char buf[PAGE_SIZE];
   char *d;
   int sum = 0, loop_cnt = 0;
   int max_loops = 100000;
   // For getopt(3) stuff:
   int opt;

   while ((opt = getopt(ac, av, "p:c:")) != -1) {
       switch (opt) {
           case 'p':
               pages = atoi(optarg);
               break;
           case 'c':
               max_loops = atoi(optarg);
               break;
           default:
               fprintf(stderr, "Wrong usage:\n");
               fprintf(stderr, "%s -p <pages> -c <loop_count>\n", av[0]);
               exit(-1);
       }

   }
   signal(SIGTERM, sighandler);
   printf("Mapping %d pages anonymously %d times.\n", pages, max_loops);
   printf("File size will be %ld\n", pages * (long)PAGE_SIZE);
   printf("max memory usage size will be %ld\n", (long) max_loops * pages * PAGE_SIZE);

   memset(buf, 0, PAGE_SIZE);

   fd = open("big-data-file.bin", O_CREAT|O_WRONLY|O_TRUNC , S_IRUSR | S_IWUSR);
   if (fd == -1) {
       printf("open failed: %d - %s\n", errno, strerror(errno));
       return -1;
   }
   for (i=0; i < pages; i++) {
       write(fd, buf, PAGE_SIZE);
   }
   close(fd);
   fd = open("big-data-file.bin", O_RDWR);
   printf("fd is %d\n", fd);
   while (loop_cnt < max_loops) {
       d = mmap(NULL, pages * PAGE_SIZE, PROT_WRITE|PROT_READ, MAP_FILE|MAP_PRIVATE, fd, 0);
       if (d == MAP_FAILED) {
           printf("mmap failed: %d - %s\n", errno, strerror(errno));
           return -1;
       }
       printf("Buffer is @ %p\n", d);
       for (i = 0; i < pages * PAGE_SIZE; i++) {
           sum += d[i];
           if ((i & (PAGE_SIZE-1)) == 0)
               d[i] = 42;
       }
       printf("Todal sum was %d. Loop count is %d\n", sum, loop_cnt++);
   }
   close(fd);
   return 0;
}

///—— test script launching it…
#!/bin/sh

if [ `id -u` -ne 0 ]; then
       echo NEED TO RUN THIS AS ROOT.; exit 1
fi
PID=$(echo $$)
echo PID detected as: $PID
mkdir /sys/fs/cgroup/memory/leaker
echo 536870912 > /sys/fs/cgroup/memory/leaker/memory.limit_in_bytes

echo leaker mem cgroup created, with `cat /sys/fs/cgroup/memory/leaker/memory.limit_in_bytes` bytes.
echo $PID > /sys/fs/cgroup/memory/leaker/cgroup.procs
echo Moved into the leaker cgroup.
ps -o cgroup $PID
sleep 15
echo Starting...
./leaker -p 10240 -c 100000

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3437 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
@ 2019-08-02 12:10 Hillf Danton
  2019-08-02 13:40 ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Hillf Danton @ 2019-08-02 12:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Masoud Sharbiani, hannes, vdavydov.dev, linux-mm, cgroups,
	linux-kernel, Greg KH


On Fri, 2 Aug 2019 16:18:40 +0800 Michal Hocko wrote:
>
> [Hillf, your email client or workflow mangles emails. In this case you
> are seem to be reusing the message id from the email you are replying to
> which confuses my email client to assume your email is a duplicate.]

[Hi Michal, I sent the previous mail with

	Message-id: <7EE30F16-A90B-47DC-A065-3C21881CD1CC@apple.com>

using git send-email after quitting vi. That tag is removed from this
message and get me informed if it makes your mail client happy.]
>
> Huh, what? You are effectively saying that we should fail the charge
> when the requested nr_pages would fit in. This doesn't make much sense
> to me. What are you trying to achive here?

The report looks like the result of a tight loop.
I want to break it and make the end result of do_page_fault unsuccessful
if nr_retries rounds of page reclaiming fail to get work done. What made
me a bit over stretched is how to determine if the chargee is a memhog
in memcg's vocabulary.
What I prefer here is that do_page_fault succeeds, even if the chargee
exhausts its memory quota/budget granted, as long as more than nr_pages
can be reclaimed _within_ nr_retries rounds. IOW the deadline for memhog
is nr_retries, and no more.

Thanks
Hillf


^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
@ 2019-08-03  5:45 Hillf Danton
  0 siblings, 0 replies; 31+ messages in thread
From: Hillf Danton @ 2019-08-03  5:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Masoud Sharbiani, hannes, vdavydov.dev, linux-mm, cgroups,
	linux-kernel, Greg KH


On Fri, 2 Aug 2019 21:40:29 +0800 Michal Hocko wrote:
> 
> On Fri 02-08-19 20:10:58, Hillf Danton wrote:
> >
> > On Fri, 2 Aug 2019 16:18:40 +0800 Michal Hocko wrote:
> [...]
> > > Huh, what? You are effectively saying that we should fail the charge
> > > when the requested nr_pages would fit in. This doesn't make much sense
> > > to me. What are you trying to achive here?
> >
> > The report looks like the result of a tight loop.
> > I want to break it and make the end result of do_page_fault unsuccessful
> > if nr_retries rounds of page reclaiming fail to get work done. What made
> > me a bit over stretched is how to determine if the chargee is a memhog
> > in memcg's vocabulary.
> > What I prefer here is that do_page_fault succeeds, even if the chargee
> > exhausts its memory quota/budget granted, as long as more than nr_pages
> > can be reclaimed _within_ nr_retries rounds. IOW the deadline for memhog
> > is nr_retries, and no more.
> 
> No, this really doesn't really make sense because it leads to pre-mature
> charge failures. The charge path is funadamentally not different from
> the page allocator path. We do try to reclaim and retry the allocation.
> We keep retrying for ever for non-costly order requests in both cases
> (modulo some corner cases like oom victims etc.).

You are right. It is hard to produce a cure for all corner cases.
We can handle them one by one to reduce the chance that a cpuhog
comes under memory pressure.

Hillf


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-08-06 12:48 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-01 18:04 Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1 Masoud Sharbiani
2019-08-01 18:19 ` Greg KH
2019-08-01 22:26   ` Masoud Sharbiani
2019-08-02  1:08   ` Masoud Sharbiani
2019-08-02  8:08     ` Hillf Danton
2019-08-02  8:18     ` Michal Hocko
2019-08-02  7:40 ` Michal Hocko
2019-08-02  7:40   ` Michal Hocko
2019-08-02 14:18   ` Masoud Sharbiani
2019-08-02 14:18     ` Masoud Sharbiani
2019-08-02 14:41     ` Michal Hocko
2019-08-02 14:41       ` Michal Hocko
2019-08-02 18:00       ` Masoud Sharbiani
2019-08-02 19:14         ` Michal Hocko
2019-08-02 23:28           ` Masoud Sharbiani
2019-08-03  2:36             ` Tetsuo Handa
2019-08-03 15:51               ` Tetsuo Handa
2019-08-03 17:41                 ` Masoud Sharbiani
2019-08-03 18:24                   ` Masoud Sharbiani
2019-08-05  8:42                 ` Michal Hocko
2019-08-05 11:36                   ` Tetsuo Handa
2019-08-05 11:44                     ` Michal Hocko
2019-08-05 14:00                       ` Tetsuo Handa
2019-08-05 14:26                         ` Michal Hocko
2019-08-06 10:26                           ` Tetsuo Handa
2019-08-06 10:50                             ` Michal Hocko
2019-08-06 12:48                               ` [PATCH v3] memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Tetsuo Handa
2019-08-05  8:18             ` Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1 Michal Hocko
2019-08-02 12:10 Hillf Danton
2019-08-02 13:40 ` Michal Hocko
2019-08-03  5:45 Hillf Danton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.