All of lore.kernel.org
 help / color / mirror / Atom feed
* user space unresponsive, followup: lsf/mm congestion
@ 2020-01-07 20:29 Chris Murphy
  2020-01-07 20:58 ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2020-01-07 20:29 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko

Hi,

This is in response to:
https://lore.kernel.org/linux-fsdevel/20200104090955.GF23195@dread.disaster.area/T/#m8b25fd42501d780d8053fc7aa9f4e3a28a19c49f

I decided to open a bug report for tracking and attachments but I'm
also subscribed now to this list so - either here or there.

"loss of responsiveness during heavy swap"
https://bugzilla.kernel.org/show_bug.cgi?id=206117

Thanks!

--
Chris Murphy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-07 20:29 user space unresponsive, followup: lsf/mm congestion Chris Murphy
@ 2020-01-07 20:58 ` Michal Hocko
  2020-01-07 21:25   ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2020-01-07 20:58 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-mm

On Tue 07-01-20 13:29:20, Chris Murphy wrote:
> Hi,
> 
> This is in response to:
> https://lore.kernel.org/linux-fsdevel/20200104090955.GF23195@dread.disaster.area/T/#m8b25fd42501d780d8053fc7aa9f4e3a28a19c49f
> 
> I decided to open a bug report for tracking and attachments but I'm
> also subscribed now to this list so - either here or there.
> 
> "loss of responsiveness during heavy swap"
> https://bugzilla.kernel.org/show_bug.cgi?id=206117

Please collect more snapshots of /proc/vmstat (e.g. in 1s intervals)
since you start your workload. This will give us an insight on how the
memory characteristics change over time. From the single snapshot we can
see
nr_zone_inactive_anon 247598
nr_zone_active_anon 1551129
nr_zone_inactive_file 25283
nr_zone_active_file 21563
pswpin 6002574
pswpout 11199783

which tells us that there is not all that much page cache resident (less
than 200MB) while the majority of the memory is anonymous (7GB) most of
it tracked as active.

Btw. from a quick look at the sysrq output there seems to be quite a lot
of tasks (more than 1k) running on the system. Only handful of them
belong to the compilation. kswapd is busy and 13 processes in direct
reclaim all swapping out to the disk.

From the above, my first guess would be that you are over subscribing
memory you have available. I would focus on who is consuming all that
memory.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-07 20:58 ` Michal Hocko
@ 2020-01-07 21:25   ` Chris Murphy
  2020-01-08  9:25     ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2020-01-07 21:25 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Chris Murphy, linux-mm

On Tue, Jan 7, 2020 at 1:58 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 07-01-20 13:29:20, Chris Murphy wrote:
> > Hi,
> >
> > This is in response to:
> > https://lore.kernel.org/linux-fsdevel/20200104090955.GF23195@dread.disaster.area/T/#m8b25fd42501d780d8053fc7aa9f4e3a28a19c49f
> >
> > I decided to open a bug report for tracking and attachments but I'm
> > also subscribed now to this list so - either here or there.
> >
> > "loss of responsiveness during heavy swap"
> > https://bugzilla.kernel.org/show_bug.cgi?id=206117
>
> Please collect more snapshots of /proc/vmstat (e.g. in 1s intervals)

OK.


> Btw. from a quick look at the sysrq output there seems to be quite a lot
> of tasks (more than 1k) running on the system. Only handful of them
> belong to the compilation. kswapd is busy and 13 processes in direct
> reclaim all swapping out to the disk.

There might be many dozens of tabs in Firefox with nothing loaded in
them, trying to keep the testing more real world (a compile while
browsing) rather than being too deferential to the compile. That does
clutter the sysrq+t but it doesn't change the outcome of the central
culprit which is the ninja compile, which by default does n+2 jobs
where n is the number of virtual CPUs.

> From the above, my first guess would be that you are over subscribing
> memory you have available. I would focus on who is consuming all that
> memory.

ninja - I have made the argument that it is in some sense sabotaging
the system, and I think they're trying to do something a little
smarter with their defaults; however, it's an unprivileged task acting
as a kind of fork bomb that takes down the system. It's a really
eyebrow raising and remarkable experience. And it's common within the
somewhat vertical use case of developers compiling things on their own
systems. Many IDE's use a ton of resources, as much as they can get.
It's not clear to me by what mechanism either the user or these
processes are supposed to effectively negotiate for limited resources,
other than resource restriction. But anyway, they aren't contrived or
malicious examples. A much more synthetic example is 'tail /dev/zero'
which is much more quickly arrrested by the kernel oom-killer, at
least on recent kernels.



-- 
Chris Murphy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-07 21:25   ` Chris Murphy
@ 2020-01-08  9:25     ` Michal Hocko
  2020-01-08 21:14       ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2020-01-08  9:25 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-mm

On Tue 07-01-20 14:25:46, Chris Murphy wrote:
> On Tue, Jan 7, 2020 at 1:58 PM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > Btw. from a quick look at the sysrq output there seems to be quite a lot
> > of tasks (more than 1k) running on the system. Only handful of them
> > belong to the compilation. kswapd is busy and 13 processes in direct
> > reclaim all swapping out to the disk.
> 
> There might be many dozens of tabs in Firefox with nothing loaded in
> them, trying to keep the testing more real world (a compile while
> browsing) rather than being too deferential to the compile. That does
> clutter the sysrq+t but it doesn't change the outcome of the central
> culprit which is the ninja compile, which by default does n+2 jobs
> where n is the number of virtual CPUs.

How much memory does the compile process eat?

> > From the above, my first guess would be that you are over subscribing
> > memory you have available. I would focus on who is consuming all that
> > memory.
> 
> ninja - I have made the argument that it is in some sense sabotaging
> the system, and I think they're trying to do something a little
> smarter with their defaults; however, it's an unprivileged task acting
> as a kind of fork bomb that takes down the system.

Well, I am not sure the fork bomb analogy is appropriate. There is only
a dozen compile processes captured so unless there are way much more
in other phases then this is really negligibe comparing to the rest of
the workloads running on the system.

> It's a really
> eyebrow raising and remarkable experience. And it's common within the
> somewhat vertical use case of developers compiling things on their own
> systems. Many IDE's use a ton of resources, as much as they can get.
> It's not clear to me by what mechanism either the user or these
> processes are supposed to effectively negotiate for limited resources,
> other than resource restriction. But anyway, they aren't contrived or
> malicious examples.

If you know that the compilation process is too disruptive wrt.
memory/cpu consumption then you can use cgroups (memory and cpu
controllers) to throttle that consumption and protect the rest of the
system. The compilation process will take much more time of course and
the explicit configuration is obviously less comfortable than out of the
box auto configuration but the kernel simply doesn't have information to
prioritize resources.

I do agree that the oom detection could be improved to detect a heavy
threshing - be it on page cache or swapin/out - and kill something
rather than leave the system struggling in a highly unproductive state.
This is far from trivial because what is productive is not something
kernel can tell easily as it depends on the workload. As mentioned
elsewhere userspace is likely much better suited to define that policy
and PSI seems to be a good indicator.

> A much more synthetic example is 'tail /dev/zero'
> which is much more quickly arrrested by the kernel oom-killer, at
> least on recent kernels.

Yeah, same like any other memory leak because the memory will simply run
out at some point and the OOM killer can detect that with a good
confidence. It is the threshing (working set not fitting into memory and
refaulting like crazy) that the kernel struggles (and loses) to handle.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-08  9:25     ` Michal Hocko
@ 2020-01-08 21:14       ` Chris Murphy
  2020-01-08 21:18         ` Chris Murphy
  2020-01-09 11:51         ` Michal Hocko
  0 siblings, 2 replies; 13+ messages in thread
From: Chris Murphy @ 2020-01-08 21:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm

On Wed, Jan 8, 2020 at 2:25 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 07-01-20 14:25:46, Chris Murphy wrote:
> > On Tue, Jan 7, 2020 at 1:58 PM Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > > Btw. from a quick look at the sysrq output there seems to be quite a lot
> > > of tasks (more than 1k) running on the system. Only handful of them
> > > belong to the compilation. kswapd is busy and 13 processes in direct
> > > reclaim all swapping out to the disk.
> >
> > There might be many dozens of tabs in Firefox with nothing loaded in
> > them, trying to keep the testing more real world (a compile while
> > browsing) rather than being too deferential to the compile. That does
> > clutter the sysrq+t but it doesn't change the outcome of the central
> > culprit which is the ninja compile, which by default does n+2 jobs
> > where n is the number of virtual CPUs.
>
> How much memory does the compile process eat?

By default it sets jobs to numcpus+2, which is 10. But each job
variably has two processes, and each process's memory requirement
varies a ton, few M to over 1G. In the first 20 minutes, about 13000
processes have started and stopped.

I've updated the bug, attaching kernel messages and /proc/vmstate in
1s increments, although quite often during the build multiple seconds
of sampling were just skipped as the system was under too much
pressure.

> If you know that the compilation process is too disruptive wrt.
> memory/cpu consumption then you can use cgroups (memory and cpu
> controllers) to throttle that consumption and protect the rest of the
> system. The compilation process will take much more time of course and
> the explicit configuration is obviously less comfortable than out of the
> box auto configuration but the kernel simply doesn't have information to
> prioritize resources.

Yes but this isn't scalable for regular users who just follow an
upstream's build instructions.

> I do agree that the oom detection could be improved to detect a heavy
> threshing - be it on page cache or swapin/out - and kill something
> rather than leave the system struggling in a highly unproductive state.
> This is far from trivial because what is productive is not something
> kernel can tell easily as it depends on the workload. As mentioned
> elsewhere userspace is likely much better suited to define that policy
> and PSI seems to be a good indicator.

And even user space doesn't know what resources are required in
advance. The user can guess this has been estimated incorrectly, force
power off, start over by passing a lower number of jobs or whatever.

As for PSI, from oomd folks it sounds like swap is a requirement. And
yet, because of the poor performance of swapping, quite a lot of users
don't have any swap. It's also mixed in server environments to have
swap, and rare in cloud environments to have swap. So if there's a
hard requirement on swap existing, PSI isn't a universal solution.

Thanks,

-- 
Chris Murphy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-08 21:14       ` Chris Murphy
@ 2020-01-08 21:18         ` Chris Murphy
  2020-01-09 11:51         ` Michal Hocko
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2020-01-08 21:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Michal Hocko, linux-mm

On Wed, Jan 8, 2020 at 2:14 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> As for PSI, from oomd folks it sounds like swap is a requirement.

https://github.com/facebookincubator/oomd/issues/80

-- 
Chris Murphy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-08 21:14       ` Chris Murphy
  2020-01-08 21:18         ` Chris Murphy
@ 2020-01-09 11:51         ` Michal Hocko
  2020-01-09 11:53           ` Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2020-01-09 11:51 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-mm

On Wed 08-01-20 14:14:22, Chris Murphy wrote:
> On Wed, Jan 8, 2020 at 2:25 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 07-01-20 14:25:46, Chris Murphy wrote:
> > > On Tue, Jan 7, 2020 at 1:58 PM Michal Hocko <mhocko@kernel.org> wrote:
> > [...]
> > > > Btw. from a quick look at the sysrq output there seems to be quite a lot
> > > > of tasks (more than 1k) running on the system. Only handful of them
> > > > belong to the compilation. kswapd is busy and 13 processes in direct
> > > > reclaim all swapping out to the disk.
> > >
> > > There might be many dozens of tabs in Firefox with nothing loaded in
> > > them, trying to keep the testing more real world (a compile while
> > > browsing) rather than being too deferential to the compile. That does
> > > clutter the sysrq+t but it doesn't change the outcome of the central
> > > culprit which is the ninja compile, which by default does n+2 jobs
> > > where n is the number of virtual CPUs.
> >
> > How much memory does the compile process eat?
> 
> By default it sets jobs to numcpus+2, which is 10. But each job
> variably has two processes, and each process's memory requirement
> varies a ton, few M to over 1G. In the first 20 minutes, about 13000
> processes have started and stopped.

Well, the question is whether the memory demand comes from the
parallelism (aka something controlled by the build process) or the
compilation itself which might be really memory hungry and hard to
overcome even in the parallelism disabled. If this the later is the case
then there is not really much the kernel can do, I am afraid. If the OOM
killer kills your compilation or other important part of your
environment then you just lost a work without any gain, right? You
simply need more memory to handle that workload or throttle the
compilation and give it much more time to finish because it will be
swapping in and out as the working set would not fit into the restricted
amount of memory.

> I've updated the bug, attaching kernel messages and /proc/vmstate in
> 1s increments, although quite often during the build multiple seconds
> of sampling were just skipped as the system was under too much
> pressure.

I have a tool to reduce that problem (see attached).

> > If you know that the compilation process is too disruptive wrt.
> > memory/cpu consumption then you can use cgroups (memory and cpu
> > controllers) to throttle that consumption and protect the rest of the
> > system. The compilation process will take much more time of course and
> > the explicit configuration is obviously less comfortable than out of the
> > box auto configuration but the kernel simply doesn't have information to
> > prioritize resources.
> 
> Yes but this isn't scalable for regular users who just follow an
> upstream's build instructions.

It certainly requires additional steps, no question about that. But I
fail to see how to do that automagically without knowing what the user
expects to happen in the resource shortage.

> > I do agree that the oom detection could be improved to detect a heavy
> > threshing - be it on page cache or swapin/out - and kill something
> > rather than leave the system struggling in a highly unproductive state.
> > This is far from trivial because what is productive is not something
> > kernel can tell easily as it depends on the workload. As mentioned
> > elsewhere userspace is likely much better suited to define that policy
> > and PSI seems to be a good indicator.
> 
> And even user space doesn't know what resources are required in
> advance. The user can guess this has been estimated incorrectly, force
> power off, start over by passing a lower number of jobs or whatever.
> 
> As for PSI, from oomd folks it sounds like swap is a requirement. And
> yet, because of the poor performance of swapping, quite a lot of users
> don't have any swap. It's also mixed in server environments to have
> swap, and rare in cloud environments to have swap. So if there's a
> hard requirement on swap existing, PSI isn't a universal solution.

I cannot really comment on the swap requirement but I would recommend
to have a swap space especially for workloads which have peak memory
consumption that doesn't fit into memory. Additional configuration steps
might be needed so that the whole system doesn't thresh on the swap
(e.g. use memcgs with proper main memory partitioning) but once you are
over committing the memory you have to be careful I believe.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-09 11:51         ` Michal Hocko
@ 2020-01-09 11:53           ` Michal Hocko
  2020-01-10  6:12             ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2020-01-09 11:53 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-mm

[-- Attachment #1: Type: text/plain, Size: 373 bytes --]

On Thu 09-01-20 12:51:48, Michal Hocko wrote:
> > I've updated the bug, attaching kernel messages and /proc/vmstate in
> > 1s increments, although quite often during the build multiple seconds
> > of sampling were just skipped as the system was under too much
> > pressure.
> 
> I have a tool to reduce that problem (see attached).

Now for real
-- 
Michal Hocko
SUSE Labs

[-- Attachment #2: read_vmstat.c --]
[-- Type: text/x-csrc, Size: 5025 bytes --]

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <unistd.h>
#include <time.h>

/*
 * A simple /proc/vmstat collector into a file. It tries hard to guarantee
 * that the content will get into the output file even under a strong memory
 * pressure.
 *
 * Usage
 * ./read_vmstat output_file timeout output_size
 *
 * output_file can be either a non-existing file or - for stdout
 * timeout - time period between two snapshots. s - seconds, ms - miliseconds
 * 	     and m - minutes suffix is allowed
 * output_file - size of the output file. The file is preallocated and pre-filled.
 *
 * If the output reaches the end of the file it will start over overwriting the oldest
 * data. Each snapshot is enclosed by header and footer.
 * =S timestamp
 * [...]
 * E=
 *
 * Please note that your ulimit has to be sufficient to allow to mlock the code+
 * all the buffers.
 *
 * This comes under GPL v2
 * Copyright: Michal Hocko <mhocko@suse.cz> 2015 
 */

#define NS_PER_MS (1000*1000)
#define NS_PER_SEC (1000*NS_PER_MS)

int open_file(const char *str)
{
	int fd;

	fd = open(str, O_CREAT|O_EXCL|O_RDWR, 0755);
	if (fd == -1) {
		perror("open input");
		return 1;
	}

	return fd;
}

int read_timeout(const char *str, struct timespec *timeout)
{
	char *end;
	unsigned long val;

	val = strtoul(str, &end, 10);
	if (val == ULONG_MAX) {
		perror("Invalid number");
		return 1;
	}
	switch(*end) {
		case 's':
			timeout->tv_sec = val;
			break;
		case 'm':
			/* ms vs minute*/
			if (*(end+1) == 's') {
				timeout->tv_sec = (val * NS_PER_MS) / NS_PER_SEC;
				timeout->tv_nsec = (val * NS_PER_MS) % NS_PER_SEC;
			} else {
				timeout->tv_sec = val*60;
			}
			break;
		default:
			fprintf(stderr, "Uknown number %s\n", str);
			return 1;
	}

	return 0;
}

size_t read_size(const char *str)
{
	char *end;
	size_t val = strtoul(str, &end, 10);

	switch (*end) {
		case 'K':
			val <<= 10;
			break;
		case 'M':
			val <<= 20;
			break;
		case 'G':
			val <<= 30;
			break;
	}

	return val;
}

size_t dump_str(char *buffer, size_t buffer_size, size_t pos, const char *in, size_t size)
{
	size_t i;
	for (i = 0; i < size; i++) {
		buffer[pos] = in[i];
		pos = (pos + 1) % buffer_size;
	}

	return pos;
}

/* buffer == NULL -> stdout */
int __collect_logs(const struct timespec *timeout, char *buffer, size_t buffer_size)
{
	char buff[4096]; /* dump to the file automatically */
	time_t before, after;
	int in_fd = open("/proc/vmstat", O_RDONLY);
	size_t out_pos = 0;
	size_t in_pos = 0;
	size_t size = 0;
	int estimate = 0;

	if (in_fd == -1) {
		perror("open vmstat:");
		return 1;
	}

	/* lock everything in */
	if (mlockall(MCL_CURRENT) == -1) {
		perror("mlockall. Continuing anyway");
	}

	while (1) {
		before = time(NULL);

		size = snprintf(buff, sizeof(buff), "=S %lu\n", before);
		lseek(in_fd, 0, SEEK_SET);
		size += read(in_fd, buff + size, sizeof(buff) - size);
		size += snprintf(buff + size, sizeof(buff) - size, "E=\n");
		if (buffer && !estimate) {
			printf("Estimated %d entries fit to the buffer\n", buffer_size/size);
			estimate = 1;
		}

		/* Dump to stdout */
		if (!buffer) {
			printf("%s", buff);
		} else {
			size_t pos;
			pos = dump_str(buffer, buffer_size, out_pos, buff, size);
			if (pos < out_pos)
				fprintf(stderr, "%lu: Buffer wrapped\n", before);
			out_pos = pos;
		}

		after = time(NULL);

		if (after - before > 2) {
			fprintf(stderr, "%d: Snapshoting took %d!!!\n", before, after-before);
		}
		if (nanosleep(timeout, NULL) == -1)
			if (errno == EINTR)
				return 0;
		/* kick in the flushing */
		if (buffer)
			msync(buffer, buffer_size, MS_ASYNC);
	}
}

int collect_logs(int fd, const struct timespec *timeout, size_t buffer_size)
{
	unsigned char *buffer = NULL;

	if (fd != -1) {
		if (ftruncate(fd, buffer_size) == -1) {
			perror("ftruncate");
			return 1;
		}

		if (fallocate(fd, 0, 0, buffer_size) && errno != EOPNOTSUPP) {
			perror("fallocate");
			return 1;
		}

		/* commit it to the disk */
		sync();

		buffer = mmap(NULL, buffer_size, PROT_READ | PROT_WRITE,
				MAP_SHARED | MAP_POPULATE, fd, 0);
		if (buffer == MAP_FAILED) {
			perror("mmap");
			return 1;
		}
	}

	return __collect_logs(timeout, buffer, buffer_size);
}

int main(int argc, char **argv)
{
	struct timespec timeout = {.tv_sec = 1};
	int fd = -1;
	size_t buffer_size = 10UL<<20;

	if (argc > 1) {
		/* output file */
		if (strcmp(argv[1], "-")) {
			fd = open_file(argv[1]);
			if (fd == -1)
				return 1;
		}

		/* timeout */
		if (argc > 2) {
			if (read_timeout(argv[2], &timeout))
				return 1;

			/* buffer size */
			if (argc > 3) {
				buffer_size = read_size(argv[3]);
				if (buffer_size == -1UL)
					return 1;
			}
		}
	}
	printf("file:%s timeout:%lu.%lus buffer_size:%llu\n",
			(fd == -1)? "stdout" : argv[1],
			timeout.tv_sec, timeout.tv_nsec / NS_PER_MS,
			buffer_size);

	return collect_logs(fd, &timeout, buffer_size);
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-09 11:53           ` Michal Hocko
@ 2020-01-10  6:12             ` Chris Murphy
  2020-01-10 11:07               ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2020-01-10  6:12 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Chris Murphy, linux-mm

On Thu, Jan 9, 2020 at 4:53 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 09-01-20 12:51:48, Michal Hocko wrote:
> > > I've updated the bug, attaching kernel messages and /proc/vmstate in
> > > 1s increments, although quite often during the build multiple seconds
> > > of sampling were just skipped as the system was under too much
> > > pressure.
> >
> > I have a tool to reduce that problem (see attached).
>
> Now for real

https://bugzilla.kernel.org/show_bug.cgi?id=206117#c6


-- 
Chris Murphy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-10  6:12             ` Chris Murphy
@ 2020-01-10 11:07               ` Michal Hocko
  2020-01-10 22:27                 ` Chris Murphy
  2020-01-12  0:07                 ` Chris Murphy
  0 siblings, 2 replies; 13+ messages in thread
From: Michal Hocko @ 2020-01-10 11:07 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-mm

On Thu 09-01-20 23:12:49, Chris Murphy wrote:
> On Thu, Jan 9, 2020 at 4:53 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 09-01-20 12:51:48, Michal Hocko wrote:
> > > > I've updated the bug, attaching kernel messages and /proc/vmstate in
> > > > 1s increments, although quite often during the build multiple seconds
> > > > of sampling were just skipped as the system was under too much
> > > > pressure.
> > >
> > > I have a tool to reduce that problem (see attached).
> >
> > Now for real
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=206117#c6

Let me paste the comment here:
: Using Michal Hocko's read_vmstat.c to record ~12 minutes while
: running 'ninja' to compile webkitgtk. I killed ninja ~20s before
: killing read_vmstat. The system is essentially unusable for 7 minutes
: prior to killing ninja which could only be done remotely (which took
: maybe 1/2 minute to enter and execute the kill command; ssh was very
: sluggish). During those 7 minutes, swap was not even 1/2 full, but the
: GUI frozen most of the time including the mouse pointer.

file:stdout timeout:1.0s buffer_size:10485760
=S 1578635397

So you have redirected the output (stdout) to a file. This is less
effective than using a file directly because the progy makes sure to
preallocate and mlock the output file data as well. Anyway, let's have a
look what you managed to gather

$ grep "=S" bug206117_read_vmstat.out | awk '{diff=$2-prev; if (prev && diff > 1) printf "%d %d\n", prev, diff; prev=$2}'
1578635742 2
1578635755 2
1578635776 2
1578635878 2
1578635943 2
1578635967 2
1578636027 2
1578636053 2
1578636111 3
1578636131 2
1578636138 2

so there are few instances when the next snapshot was not gathered next
second. Most of the time it was 2s but that doesn't necessarily mean a
problem because the timestamp resolution could have caused that this was
just bad timing and the diff was close to 1s. There is one timestamp
when this was 3s which looks like a real stall though (this shouldn't
happen with the mlocked output file).  Let's focus on the timestep
though (it is 11min since start):

		1578636111	diff
pgalloc_dma     4       	0
pgalloc_movable 0       	0
pgalloc_dma32   6468509 	14278
pgalloc_normal  40905599        37286

that's 200M of allocations which is not much

pgsteal_kswapd  3384259 	10763
pgscan_kswapd   5282874 	21772
pgsteal_direct  3189026 	30400
pgscan_direct   8372523 	85974

which is 106M reclaimed during that time and the reclaim was not
particularly effective kswapd 49% and direct reclaim 35%.

nr_active_file  	26652   -3564
nr_inactive_file	25731   -4772
nr_active_anon  	1502798 6433
nr_inactive_anon        283561  4385

File LRUs are quite small (~200M) and got shrunk some while the
anonymous LRUs eat almost 8G. while the actively referenced list is 5
times bigger than inactive (5.8G). That suggests that a large part of
the memory is heavily in use.

pswpin  1222809 	4882
pswpout 3475546 	7814

Around 30M has been swapped out but around 60% of that amount has been
swapped back in and that suggests that we are actively refaulting on the
swap.
Please note that a large number of anonymous refaults on swap entries
might be satisfied from the swap cache. That information is not
presented in /proc/vmstate unfortunately.

On the page cache front
workingset_activate     661011  7566
workingset_refault      1894851 14868
workingset_restore      398375  4622

50% of refaults are active and 30% are marked as a workingset.

Now let's see how those numbers evolve over time
$ grep pswp bug206117_read_vmstat.out | awk '/pswpin/{pswpin=$2} /pswpout/{pswpout=$2; if (pswpout) {print pswpin*100/pswpout}}' | calc_min_max.awk
min: 0.06 max: 35.28 avg: 22.49 std: 11.30 nr: 592

The swapin refault builds up over time which matches the expectation
that the memory demand builds up as well and it doesn't fit into RAM
while the workload still needs the memory that has been swapped out.

The first swapout happens shortly before we start the direct reclaim:
$ awk '/=S/{ts=$2} /allocstall/{if ($2) {print ts; exit}}' bug206117_read_vmstat.out
1578635582
$ awk '/=S/{ts=$2} /pswpout/{if ($2) {print ts; exit}}' bug206117_read_vmstat.out
1578635581

Page cache refault have a slightly higher dynamics
$ grep workingset bug206117_read_vmstat.out | awk '/workingset_refault/{workingset_refault=$2} /workingset_activate/{workingset_activate=$2; if (workingset_refault) {print workingset_activate*100/workingset_refault}}'
100
91.453
91.453
91.6201
91.6201
96.587
96.7213
96.7638
96.7777
97.1401
70.2157
[...]
22.4038
22.4447
22.3699
22.3929
22.3724
22.3897
22.3873
22.3846
22.3847
[...]
34.9863
34.9863
34.9847
34.9847
34.9847
34.9846
34.9239

which means that as soon as we start refaulting it is the activating
memory then the activation decreases for some time, most likely because
we start swapping out more and the activation rate stays around 30% in
the end.

All that being said numbers confirm my initial suspicion. The workload
you are trying to run simply doesn't fit into memory and therefore it
has to refault both from the disk for the page cache (likely also
executables) and the swap space.

It is not really obvious whether changing file vs. anonymous reclaim
balance would help much in this case. As we can see swap out starts
before tasks are stalled in the direct reclaim. It is not really clear
whether swapping out more would help to put some relief on the
pagecache. So to me it looks like the reclaim is not doing a terrible
job here considering the memory demand.

It would interesting to see whether tuning vm_swappiness to 100 helps
but considering how large is the anonymous active list I would be very
skeptical.

So in the end it is really hard to see what the kernel should have done
better in this overcommitted case. Killing memory hogs would likely kill
an active workload which would lead to better desktop experience but I
can imagine setups which simply want to have work done albeit sloooowly.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-10 11:07               ` Michal Hocko
@ 2020-01-10 22:27                 ` Chris Murphy
  2020-01-14  9:46                   ` Michal Hocko
  2020-01-12  0:07                 ` Chris Murphy
  1 sibling, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2020-01-10 22:27 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Chris Murphy, linux-mm

On Fri, Jan 10, 2020 at 4:07 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> So you have redirected the output (stdout) to a file. This is less
> effective than using a file directly because the progy makes sure to
> preallocate and mlock the output file data as well. Anyway, let's have a
> look what you managed to gather

I just read the source :P and see the usage. I'll do that properly if
there's a next time. Should it be saved in /tmp to avoid disk writes
or does it not matter?


> It would interesting to see whether tuning vm_swappiness to 100 helps
> but considering how large is the anonymous active list I would be very
> skeptical.

I can try it. Is it better to capture the same amount of time as
before? Or the entire thing until it fails or is stuck for at least 30
minutes?


> So in the end it is really hard to see what the kernel should have done
> better in this overcommitted case. Killing memory hogs would likely kill
> an active workload which would lead to better desktop experience but I
> can imagine setups which simply want to have work done albeit sloooowly.

Right, so the kernel can't know and doesn't really want to know, user
intention. It's really a policy question.

But if the distribution wanted to have a policy of, the mouse pointer
always works - i.e. the user should be able to kill this process, if
they want, from within the GUI - that implies possibly a lot of work
to carve out the necessary resources for that entire stack. I have no
idea if that's possible with the current state of things.

Anyway, I see it's a difficult problem, and I appreciate the
explanations. I don't care about this particular example, my interest
is making it better for everyone - I personally run into this only
when I'm testing for it, but those who experience it, experience it
often. And they're often developers. They have no idea in advance what
the build resource requirements are, and those requirements change a
lot as the compile happens. Difficult problem.

-- 
Chris Murphy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-10 11:07               ` Michal Hocko
  2020-01-10 22:27                 ` Chris Murphy
@ 2020-01-12  0:07                 ` Chris Murphy
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2020-01-12  0:07 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Chris Murphy, linux-mm

On Fri, Jan 10, 2020 at 4:07 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> So you have redirected the output (stdout) to a file. This is less
> effective than using a file directly because the progy makes sure to
> preallocate and mlock the output file data as well. Anyway, let's have a
> look what you managed to gather

[chris@fmac tmp]$ /home/chris/read_vmstat readvmstatwkgtk.out
file:readvmstatwkgtk.out timeout:1.0s buffer_size:10485760
mlockall. Continuing anyway: Cannot allocate memory
Estimated 3744 entries fit to the buffer
^C

This is with vm.swappiness=100, is also ~12 minutes. Responsiveness
appears unchanged. File is attached here:
https://bugzilla.kernel.org/show_bug.cgi?id=206117#c8

More curious is I get a call trace I haven't seen before when I try to
login to this hung system, via ssh. That's in
https://bugzilla.kernel.org/show_bug.cgi?id=206117#c7


-- 
Chris Murphy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: user space unresponsive, followup: lsf/mm congestion
  2020-01-10 22:27                 ` Chris Murphy
@ 2020-01-14  9:46                   ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2020-01-14  9:46 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-mm

On Fri 10-01-20 15:27:10, Chris Murphy wrote:
> On Fri, Jan 10, 2020 at 4:07 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > So you have redirected the output (stdout) to a file. This is less
> > effective than using a file directly because the progy makes sure to
> > preallocate and mlock the output file data as well. Anyway, let's have a
> > look what you managed to gather
> 
> I just read the source :P and see the usage. I'll do that properly if
> there's a next time. Should it be saved in /tmp to avoid disk writes
> or does it not matter?

The usage is described at the top of the c file. As this is my internal
tool I am using I didn't bother to make it super easy ;)
 
> > It would interesting to see whether tuning vm_swappiness to 100 helps
> > but considering how large is the anonymous active list I would be very
> > skeptical.
> 
> I can try it. Is it better to capture the same amount of time as
> before? Or the entire thing until it fails or is stuck for at least 30
> minutes?

The last data provided a good insight so following the same methodology
should be good.
 
> > So in the end it is really hard to see what the kernel should have done
> > better in this overcommitted case. Killing memory hogs would likely kill
> > an active workload which would lead to better desktop experience but I
> > can imagine setups which simply want to have work done albeit sloooowly.
> 
> Right, so the kernel can't know and doesn't really want to know, user
> intention. It's really a policy question.
> 
> But if the distribution wanted to have a policy of, the mouse pointer
> always works - i.e. the user should be able to kill this process, if
> they want, from within the GUI - that implies possibly a lot of work
> to carve out the necessary resources for that entire stack. I have no
> idea if that's possible with the current state of things.

Well, you always have a choice to invoke the oom killer by sysrq+f and
kill the memory hog like that. The more memory demanding the userspace
is the more users have to think how to partition the memory as a
resource. We have tooling for that it just has to be used.
 
> Anyway, I see it's a difficult problem, and I appreciate the
> explanations. I don't care about this particular example, my interest
> is making it better for everyone - I personally run into this only
> when I'm testing for it, but those who experience it, experience it
> often. And they're often developers. They have no idea in advance what
> the build resource requirements are, and those requirements change a
> lot as the compile happens. Difficult problem.

I can only encourage people to report those problems and we can see
where we get from there. Underlying problem might be different even
though symptoms seem to be similar.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-01-14  9:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-07 20:29 user space unresponsive, followup: lsf/mm congestion Chris Murphy
2020-01-07 20:58 ` Michal Hocko
2020-01-07 21:25   ` Chris Murphy
2020-01-08  9:25     ` Michal Hocko
2020-01-08 21:14       ` Chris Murphy
2020-01-08 21:18         ` Chris Murphy
2020-01-09 11:51         ` Michal Hocko
2020-01-09 11:53           ` Michal Hocko
2020-01-10  6:12             ` Chris Murphy
2020-01-10 11:07               ` Michal Hocko
2020-01-10 22:27                 ` Chris Murphy
2020-01-14  9:46                   ` Michal Hocko
2020-01-12  0:07                 ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.