Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
       [not found] <d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com>
@ 2019-08-05 12:13 ` Vlastimil Babka
  2019-08-05 13:31   ` Michal Hocko
  2019-08-05 19:31   ` Johannes Weiner
  2019-08-06 19:00 ` Florian Weimer
  1 sibling, 2 replies; 39+ messages in thread
From: Vlastimil Babka @ 2019-08-05 12:13 UTC (permalink / raw)
  To: Artem S. Tashkinov, linux-kernel
  Cc: linux-mm, Michal Hocko, Johannes Weiner, Suren Baghdasaryan

On 8/4/19 11:23 AM, Artem S. Tashkinov wrote:
> Hello,
> 
> There's this bug which has been bugging many people for many years
> already and which is reproducible in less than a few minutes under the
> latest and greatest kernel, 5.2.6. All the kernel parameters are set to
> defaults.
> 
> Steps to reproduce:
> 
> 1) Boot with mem=4G
> 2) Disable swap to make everything faster (sudo swapoff -a)
> 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
> 4) Start opening tabs in either of them and watch your free RAM decrease
> 
> Once you hit a situation when opening a new tab requires more RAM than
> is currently available, the system will stall hard. You will barely  be
> able to move the mouse pointer. Your disk LED will be flashing
> incessantly (I'm not entirely sure why). You will not be able to run new
> applications or close currently running ones.

> This little crisis may continue for minutes or even longer. I think
> that's not how the system should behave in this situation. I believe
> something must be done about that to avoid this stall.

Yeah that's a known problem, made worse SSD's in fact, as they are able
to keep refaulting the last remaining file pages fast enough, so there
is still apparent progress in reclaim and OOM doesn't kick in.

At this point, the likely solution will be probably based on pressure
stall monitoring (PSI). I don't know how far we are from a built-in
monitor with reasonable defaults for a desktop workload, so CCing
relevant folks.

> I'm almost sure some sysctl parameters could be changed to avoid this
> situation but something tells me this could be done for everyone and
> made default because some non tech-savvy users will just give up on
> Linux if they ever get in a situation like this and they won't be keen
> or even be able to Google for solutions.
> 
> 
> Best regards,
> Artem
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-05 12:13 ` Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure Vlastimil Babka
@ 2019-08-05 13:31   ` Michal Hocko
  2019-08-05 16:47     ` Suren Baghdasaryan
  2019-08-05 18:55     ` Johannes Weiner
  2019-08-05 19:31   ` Johannes Weiner
  1 sibling, 2 replies; 39+ messages in thread
From: Michal Hocko @ 2019-08-05 13:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Artem S. Tashkinov, linux-kernel, linux-mm, Johannes Weiner,
	Suren Baghdasaryan

On Mon 05-08-19 14:13:16, Vlastimil Babka wrote:
> On 8/4/19 11:23 AM, Artem S. Tashkinov wrote:
> > Hello,
> > 
> > There's this bug which has been bugging many people for many years
> > already and which is reproducible in less than a few minutes under the
> > latest and greatest kernel, 5.2.6. All the kernel parameters are set to
> > defaults.
> > 
> > Steps to reproduce:
> > 
> > 1) Boot with mem=4G
> > 2) Disable swap to make everything faster (sudo swapoff -a)
> > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
> > 4) Start opening tabs in either of them and watch your free RAM decrease
> > 
> > Once you hit a situation when opening a new tab requires more RAM than
> > is currently available, the system will stall hard. You will barely  be
> > able to move the mouse pointer. Your disk LED will be flashing
> > incessantly (I'm not entirely sure why). You will not be able to run new
> > applications or close currently running ones.
> 
> > This little crisis may continue for minutes or even longer. I think
> > that's not how the system should behave in this situation. I believe
> > something must be done about that to avoid this stall.
> 
> Yeah that's a known problem, made worse SSD's in fact, as they are able
> to keep refaulting the last remaining file pages fast enough, so there
> is still apparent progress in reclaim and OOM doesn't kick in.
> 
> At this point, the likely solution will be probably based on pressure
> stall monitoring (PSI). I don't know how far we are from a built-in
> monitor with reasonable defaults for a desktop workload, so CCing
> relevant folks.

Another potential approach would be to consider the refault information
we have already for file backed pages. Once we start reclaiming only
workingset pages then we should be trashing, right? It cannot be as
precise as the cost model which can be defined around PSI but it might
give us at least a fallback measure.

This is a really just an idea for a primitive detection. Most likely
incorrect one but it shows an idea at least. It is completely untested
and might be completely broken so unless somebody is really brave and
doesn't run anything that would be missed then I do not recommend to run
it.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 70394cabaf4e..7f30c78b4fbc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -300,6 +300,7 @@ struct lruvec {
 	atomic_long_t			inactive_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
+	atomic_t			workingset_refaults;
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bfb5c4ac108..4401753c3912 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -311,6 +311,15 @@ void *workingset_eviction(struct page *page);
 void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 
+bool lruvec_trashing(struct lruvec *lruvec)
+{
+	/*
+	 * One quarter of the inactive list is constantly refaulting.
+	 * This suggests that we are trashing.
+	 */
+	return 4 * atomic_read(&lruvec->workingset_refaults) > lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
+}
+
 /* Only track the nodes of mappings with shadow entries */
 void workingset_update_node(struct xa_node *node);
 #define mapping_set_update(xas, mapping) do {				\
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7889f583ced9..d198594af0cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2381,6 +2381,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 						  denominator);
 			break;
 		case SCAN_FILE:
+			if (lruvec_trashing(lruvec)) {
+				size = 0;
+				scan = 0;
+				break;
+			}
 		case SCAN_ANON:
 			/* Scan one type exclusively */
 			if ((scan_balance == SCAN_FILE) != file) {
diff --git a/mm/workingset.c b/mm/workingset.c
index e0b4edcb88c8..ee4c45b27e34 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -309,17 +309,25 @@ void workingset_refault(struct page *page, void *shadow)
 	 * don't act on pages that couldn't stay resident even if all
 	 * the memory was available to the page cache.
 	 */
-	if (refault_distance > active_file)
+	if (refault_distance > active_file) {
+		atomic_set(&lruvec->workingset_refaults, 0);
 		goto out;
+	}
 
 	SetPageActive(page);
 	atomic_long_inc(&lruvec->inactive_age);
 	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+	atomic_inc(&lruvec->workingset_refaults);
 
 	/* Page was active prior to eviction */
 	if (workingset) {
 		SetPageWorkingset(page);
 		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
+		/*
+		 * Double the trashing numbers for the actual working set.
+		 * refaults
+		 */
+		atomic_inc(&lruvec->workingset_refaults);
 	}
 out:
 	rcu_read_unlock();
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-05 13:31   ` Michal Hocko
@ 2019-08-05 16:47     ` Suren Baghdasaryan
  2019-08-05 18:55     ` Johannes Weiner
  1 sibling, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2019-08-05 16:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm, Johannes Weiner

On Mon, Aug 5, 2019 at 6:31 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 05-08-19 14:13:16, Vlastimil Babka wrote:
> > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote:
> > > Hello,
> > >
> > > There's this bug which has been bugging many people for many years
> > > already and which is reproducible in less than a few minutes under the
> > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to
> > > defaults.
> > >
> > > Steps to reproduce:
> > >
> > > 1) Boot with mem=4G
> > > 2) Disable swap to make everything faster (sudo swapoff -a)
> > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
> > > 4) Start opening tabs in either of them and watch your free RAM decrease
> > >
> > > Once you hit a situation when opening a new tab requires more RAM than
> > > is currently available, the system will stall hard. You will barely  be
> > > able to move the mouse pointer. Your disk LED will be flashing
> > > incessantly (I'm not entirely sure why). You will not be able to run new
> > > applications or close currently running ones.
> >
> > > This little crisis may continue for minutes or even longer. I think
> > > that's not how the system should behave in this situation. I believe
> > > something must be done about that to avoid this stall.
> >
> > Yeah that's a known problem, made worse SSD's in fact, as they are able
> > to keep refaulting the last remaining file pages fast enough, so there
> > is still apparent progress in reclaim and OOM doesn't kick in.
> >
> > At this point, the likely solution will be probably based on pressure
> > stall monitoring (PSI). I don't know how far we are from a built-in
> > monitor with reasonable defaults for a desktop workload, so CCing
> > relevant folks.
>
> Another potential approach would be to consider the refault information
> we have already for file backed pages. Once we start reclaiming only
> workingset pages then we should be trashing, right? It cannot be as
> precise as the cost model which can be defined around PSI but it might
> give us at least a fallback measure.
>
> This is a really just an idea for a primitive detection. Most likely
> incorrect one but it shows an idea at least. It is completely untested
> and might be completely broken so unless somebody is really brave and
> doesn't run anything that would be missed then I do not recommend to run
> it.

In Android we have a userspace lmkd process which polls for PSI events
and after they get triggered we check several metrics to determine if
we should kill anything. I believe Facebook has a similar userspace
process called oomd which as I heard is a more configurable rule
engine which also uses PSI and configurable rules to make kill
decisions. I've spent considerable time experimenting with different
metrics and thrashing is definitely one of the most useful ones.

>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 70394cabaf4e..7f30c78b4fbc 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -300,6 +300,7 @@ struct lruvec {
>         atomic_long_t                   inactive_age;
>         /* Refaults at the time of last reclaim cycle */
>         unsigned long                   refaults;
> +       atomic_t                        workingset_refaults;
>  #ifdef CONFIG_MEMCG
>         struct pglist_data *pgdat;
>  #endif
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4bfb5c4ac108..4401753c3912 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -311,6 +311,15 @@ void *workingset_eviction(struct page *page);
>  void workingset_refault(struct page *page, void *shadow);
>  void workingset_activation(struct page *page);
>
> +bool lruvec_trashing(struct lruvec *lruvec)
> +{
> +       /*
> +        * One quarter of the inactive list is constantly refaulting.

I'm guessing one quarter is a guesstimate here and needs experimentation?

> +        * This suggests that we are trashing.
> +        */
> +       return 4 * atomic_read(&lruvec->workingset_refaults) > lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);

Just wondering, why do you consider only inactive list here? The
complete workingset is active list + non-idle part of inactive list
isn't it? In my latest experiments I was using configurable percentage
of the active+inactive lists as a threshold to declare we are
thrashing and if thrashing continues after we kill that percentage
starts decaying which results in an earlier next kill (if interested
in details see https://android-review.googlesource.com/c/platform/system/core/+/1041778/14/lmkd/lmkd.c#1968).
I'm also using existing WORKINGSET_REFAULT node_stat_item as
workingset refault counter. Any reason you are not using it in this
reference implementation instead of introducing new
workingset_refaults atomic?

> +}
> +
>  /* Only track the nodes of mappings with shadow entries */
>  void workingset_update_node(struct xa_node *node);
>  #define mapping_set_update(xas, mapping) do {                          \
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7889f583ced9..d198594af0cd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2381,6 +2381,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>                                                   denominator);
>                         break;
>                 case SCAN_FILE:
> +                       if (lruvec_trashing(lruvec)) {
> +                               size = 0;
> +                               scan = 0;
> +                               break;
> +                       }
>                 case SCAN_ANON:
>                         /* Scan one type exclusively */
>                         if ((scan_balance == SCAN_FILE) != file) {
> diff --git a/mm/workingset.c b/mm/workingset.c
> index e0b4edcb88c8..ee4c45b27e34 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -309,17 +309,25 @@ void workingset_refault(struct page *page, void *shadow)
>          * don't act on pages that couldn't stay resident even if all
>          * the memory was available to the page cache.
>          */
> -       if (refault_distance > active_file)
> +       if (refault_distance > active_file) {
> +               atomic_set(&lruvec->workingset_refaults, 0);
>                 goto out;
> +       }
>
>         SetPageActive(page);
>         atomic_long_inc(&lruvec->inactive_age);
>         inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
> +       atomic_inc(&lruvec->workingset_refaults);
>
>         /* Page was active prior to eviction */
>         if (workingset) {
>                 SetPageWorkingset(page);
>                 inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
> +               /*
> +                * Double the trashing numbers for the actual working set.
> +                * refaults
> +                */
> +               atomic_inc(&lruvec->workingset_refaults);
>         }
>  out:
>         rcu_read_unlock();
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-05 13:31   ` Michal Hocko
  2019-08-05 16:47     ` Suren Baghdasaryan
@ 2019-08-05 18:55     ` Johannes Weiner
  2019-08-06  9:29       ` Michal Hocko
  1 sibling, 1 reply; 39+ messages in thread
From: Johannes Weiner @ 2019-08-05 18:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Artem S. Tashkinov, linux-kernel, linux-mm,
	Suren Baghdasaryan

On Mon, Aug 05, 2019 at 03:31:19PM +0200, Michal Hocko wrote:
> On Mon 05-08-19 14:13:16, Vlastimil Babka wrote:
> > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote:
> > > Hello,
> > > 
> > > There's this bug which has been bugging many people for many years
> > > already and which is reproducible in less than a few minutes under the
> > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to
> > > defaults.
> > > 
> > > Steps to reproduce:
> > > 
> > > 1) Boot with mem=4G
> > > 2) Disable swap to make everything faster (sudo swapoff -a)
> > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
> > > 4) Start opening tabs in either of them and watch your free RAM decrease
> > > 
> > > Once you hit a situation when opening a new tab requires more RAM than
> > > is currently available, the system will stall hard. You will barely  be
> > > able to move the mouse pointer. Your disk LED will be flashing
> > > incessantly (I'm not entirely sure why). You will not be able to run new
> > > applications or close currently running ones.
> > 
> > > This little crisis may continue for minutes or even longer. I think
> > > that's not how the system should behave in this situation. I believe
> > > something must be done about that to avoid this stall.
> > 
> > Yeah that's a known problem, made worse SSD's in fact, as they are able
> > to keep refaulting the last remaining file pages fast enough, so there
> > is still apparent progress in reclaim and OOM doesn't kick in.
> > 
> > At this point, the likely solution will be probably based on pressure
> > stall monitoring (PSI). I don't know how far we are from a built-in
> > monitor with reasonable defaults for a desktop workload, so CCing
> > relevant folks.
> 
> Another potential approach would be to consider the refault information
> we have already for file backed pages. Once we start reclaiming only
> workingset pages then we should be trashing, right? It cannot be as
> precise as the cost model which can be defined around PSI but it might
> give us at least a fallback measure.

NAK, this does *not* work. Not even as fallback.

There is no amount of refaults for which you can say whether they are
a problem or not. It depends on the disk speed (obvious) but also on
the workload's memory access patterns (somewhat less obvious).

For example, we have workloads whose cache set doesn't quite fit into
memory, but everything else is pretty much statically allocated and it
rarely touches any new or one-off filesystem data. So there is always
a steady rate of mostly uninterrupted refaults, however, most data
accesses are hitting the cache! And we have fast SSDs that compensate
for the refaults that do occur. The workload runs *completely fine*.

If the cache hit rate was lower and refaults would make up a bigger
share of overall page accesses, or if there was a spinning disk in
that machine, the machine would be completely livelocked - with the
same exact number of refaults and the same amount of RAM!

That's not just an approximation error that we could compensate
for. The same rate of refaults in a system could mean anything from 0%
(all refaults readahead, and IO is done before workload notices) to
100% memory pressure (all refaults are cache misses and workload fully
serialized on pages in question) - and anything in between (a subset
of threads of the workload wait for a subset of the refaults).

The refault rate by itself carries no signal on workload progress.

This is the whole reason why psi was developed - to compare the time
you spend on refaults (encodes IO speed and readhahead efficiency)
compared to the time you spend on being productive (encodes refaults
as share of overall memory accesses of a the workload).


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-05 12:13 ` Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure Vlastimil Babka
  2019-08-05 13:31   ` Michal Hocko
@ 2019-08-05 19:31   ` Johannes Weiner
  2019-08-06  1:08     ` Suren Baghdasaryan
  1 sibling, 1 reply; 39+ messages in thread
From: Johannes Weiner @ 2019-08-05 19:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Artem S. Tashkinov, linux-kernel, linux-mm, Michal Hocko,
	Suren Baghdasaryan

On Mon, Aug 05, 2019 at 02:13:16PM +0200, Vlastimil Babka wrote:
> On 8/4/19 11:23 AM, Artem S. Tashkinov wrote:
> > Hello,
> > 
> > There's this bug which has been bugging many people for many years
> > already and which is reproducible in less than a few minutes under the
> > latest and greatest kernel, 5.2.6. All the kernel parameters are set to
> > defaults.
> > 
> > Steps to reproduce:
> > 
> > 1) Boot with mem=4G
> > 2) Disable swap to make everything faster (sudo swapoff -a)
> > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
> > 4) Start opening tabs in either of them and watch your free RAM decrease
> > 
> > Once you hit a situation when opening a new tab requires more RAM than
> > is currently available, the system will stall hard. You will barely  be
> > able to move the mouse pointer. Your disk LED will be flashing
> > incessantly (I'm not entirely sure why). You will not be able to run new
> > applications or close currently running ones.
> 
> > This little crisis may continue for minutes or even longer. I think
> > that's not how the system should behave in this situation. I believe
> > something must be done about that to avoid this stall.
> 
> Yeah that's a known problem, made worse SSD's in fact, as they are able
> to keep refaulting the last remaining file pages fast enough, so there
> is still apparent progress in reclaim and OOM doesn't kick in.
> 
> At this point, the likely solution will be probably based on pressure
> stall monitoring (PSI). I don't know how far we are from a built-in
> monitor with reasonable defaults for a desktop workload, so CCing
> relevant folks.

Yes, psi was specifically developed to address this problem. Before
it, the kernel had to make all decisions based on relative event rates
but had no notion of time. Whereas to the user, time is clearly an
issue, and in fact makes all the difference. So psi quantifies the
time the workload spends executing vs. spinning its wheels.

But choosing a universal cutoff for killing is not possible, since it
depends on the workload and the user's expectation: GUI and other
latency-sensitive applications care way before a compile job or video
encoding would care.

Because of that, there are things like oomd and lmkd as mentioned, to
leave the exact policy decision to userspace.

That being said, I think we should be able to provide a bare minimum
inside the kernel to avoid complete livelocks where the user does not
believe the machine would be able to recover without a reboot.

The goal wouldn't be a glitch-free user experience - the kernel does
not know enough about the applications to even attempt that. It should
just not hang indefinitely. Maybe similar to the hung task detector.

How about something like the below patch? With that, the kernel
catches excessive thrashing that happens before reclaim fails:

[root@ham ~]# stress -d 128 -m 5
stress: info: [344] dispatching hogs: 0 cpu, 0 io, 5 vm, 128 hdd
Excessive and sustained system-wide memory pressure!
kworker/1:2 invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0
CPU: 1 PID: 77 Comm: kworker/1:2 Not tainted 5.3.0-rc1-mm1-00121-ge34a5cf28771 #142
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
Workqueue: events psi_avgs_work
Call Trace:
 dump_stack+0x46/0x60
 dump_header+0x5c/0x3d5
 ? irq_work_queue+0x46/0x50
 ? wake_up_klogd+0x2b/0x30
 ? vprintk_emit+0xe5/0x190
 oom_kill_process.cold.10+0xb/0x10
 out_of_memory+0x1ea/0x260
 update_averages.cold.8+0x14/0x25
 ? collect_percpu_times+0x84/0x1f0
 psi_avgs_work+0x80/0xc0
 process_one_work+0x1bb/0x310
 worker_thread+0x28/0x3c0
 ? process_one_work+0x310/0x310
 kthread+0x108/0x120
 ? __kthread_create_on_node+0x170/0x170
 ret_from_fork+0x35/0x40
Mem-Info:
active_anon:109463 inactive_anon:109564 isolated_anon:298
 active_file:4676 inactive_file:4073 isolated_file:455
 unevictable:0 dirty:8475 writeback:8 unstable:0
 slab_reclaimable:2585 slab_unreclaimable:4932
 mapped:413 shmem:2 pagetables:1747 bounce:0
 free:13472 free_pcp:17 free_cma:0

Possible snags and questions:

1. psi is an optional feature right now, but these livelocks commonly
   affect desktop users. What should be the default behavior?

2. Should we make the pressure cutoff and time period configurable?

   I fear we would open a can of worms similar to the existing OOM
   killer, where users are trying to use a kernel self-protection
   mechanism to implement workload QoS and priorities - things that
   should firmly be kept in userspace.

3. swapoff annotation. Due to the swapin annotation, swapoff currently
   raises memory pressure. It probably shouldn't. But this will be a
   bigger problem if we trigger the oom killer based on it.

4. Killing once every 10s assumes basically one big culprit. If the
   pressure is created by many different processes, fixing the
   situation could take quite a while.

   What oomd does to solve this is to monitor the PGSCAN counters
   after a kill, to tell whether pressure is persisting, or just from
   residual refaults after the culprit has been dealt with.

   We may need to do something similar here. Or find a solution to
   encode that distinction into psi itself, and it would also take
   care of the swapoff problem, since it's basically the same thing -
   residual refaults without any reclaim pressure to sustain them.

Anyway, here is the draft patch:

From e34a5cf28771d69f13faa0e933adeae44b26b8aa Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 5 Aug 2019 13:15:16 -0400
Subject: [PATCH] psi oom

---
 include/linux/psi_types.h |  4 +++
 kernel/sched/psi.c        | 52 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 07aaf9b82241..390446b07ac7 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -162,6 +162,10 @@ struct psi_group {
 	u64 polling_total[NR_PSI_STATES - 1];
 	u64 polling_next_update;
 	u64 polling_until;
+
+	/* Out-of-memory situation tracking */
+	bool oom_pressure;
+	u64 oom_pressure_start;
 };
 
 #else /* CONFIG_PSI */
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index f28342dc65ec..1027b6611ec2 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -139,6 +139,7 @@
 #include <linux/ctype.h>
 #include <linux/file.h>
 #include <linux/poll.h>
+#include <linux/oom.h>
 #include <linux/psi.h>
 #include "sched.h"
 
@@ -177,6 +178,8 @@ struct psi_group psi_system = {
 	.pcpu = &system_group_pcpu,
 };
 
+static void psi_oom_tick(struct psi_group *group, u64 now);
+
 static void psi_avgs_work(struct work_struct *work);
 
 static void group_init(struct psi_group *group)
@@ -403,6 +406,8 @@ static u64 update_averages(struct psi_group *group, u64 now)
 		calc_avgs(group->avg[s], missed_periods, sample, period);
 	}
 
+	psi_oom_tick(group, now);
+
 	return avg_next_update;
 }
 
@@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void)
 	return 0;
 }
 module_init(psi_proc_init);
+
+#define OOM_PRESSURE_LEVEL	80
+#define OOM_PRESSURE_PERIOD	(10 * NSEC_PER_SEC)
+
+static void psi_oom_tick(struct psi_group *group, u64 now)
+{
+	struct oom_control oc = {
+		.order = 0,
+	};
+	unsigned long pressure;
+	bool high;
+
+	/*
+	 * Protect the system from livelocking due to thrashing. Leave
+	 * per-cgroup policies to oomd, lmkd etc.
+	 */
+	if (group != &psi_system)
+		return;
+
+	pressure = LOAD_INT(group->avg[PSI_MEM_FULL][0]);
+	high = pressure >= OOM_PRESSURE_LEVEL;
+
+	if (!group->oom_pressure && !high)
+		return;
+
+	if (!group->oom_pressure && high) {
+		group->oom_pressure = true;
+		group->oom_pressure_start = now;
+		return;
+	}
+
+	if (group->oom_pressure && !high) {
+		group->oom_pressure = false;
+		return;
+	}
+
+	if (now < group->oom_pressure_start + OOM_PRESSURE_PERIOD)
+		return;
+
+	group->oom_pressure = false;
+
+	if (!mutex_trylock(&oom_lock))
+		return;
+	pr_warn("Excessive and sustained system-wide memory pressure!\n");
+	out_of_memory(&oc);
+	mutex_unlock(&oom_lock);
+}
-- 
2.22.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-05 19:31   ` Johannes Weiner
@ 2019-08-06  1:08     ` Suren Baghdasaryan
  2019-08-06  9:36       ` Vlastimil Babka
  2019-08-06 21:43       ` James Courtier-Dutton
  0 siblings, 2 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2019-08-06  1:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm, Michal Hocko

On Mon, Aug 5, 2019 at 12:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Aug 05, 2019 at 02:13:16PM +0200, Vlastimil Babka wrote:
> > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote:
> > > Hello,
> > >
> > > There's this bug which has been bugging many people for many years
> > > already and which is reproducible in less than a few minutes under the
> > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to
> > > defaults.
> > >
> > > Steps to reproduce:
> > >
> > > 1) Boot with mem=4G
> > > 2) Disable swap to make everything faster (sudo swapoff -a)
> > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
> > > 4) Start opening tabs in either of them and watch your free RAM decrease
> > >
> > > Once you hit a situation when opening a new tab requires more RAM than
> > > is currently available, the system will stall hard. You will barely  be
> > > able to move the mouse pointer. Your disk LED will be flashing
> > > incessantly (I'm not entirely sure why). You will not be able to run new
> > > applications or close currently running ones.
> >
> > > This little crisis may continue for minutes or even longer. I think
> > > that's not how the system should behave in this situation. I believe
> > > something must be done about that to avoid this stall.
> >
> > Yeah that's a known problem, made worse SSD's in fact, as they are able
> > to keep refaulting the last remaining file pages fast enough, so there
> > is still apparent progress in reclaim and OOM doesn't kick in.
> >
> > At this point, the likely solution will be probably based on pressure
> > stall monitoring (PSI). I don't know how far we are from a built-in
> > monitor with reasonable defaults for a desktop workload, so CCing
> > relevant folks.
>
> Yes, psi was specifically developed to address this problem. Before
> it, the kernel had to make all decisions based on relative event rates
> but had no notion of time. Whereas to the user, time is clearly an
> issue, and in fact makes all the difference. So psi quantifies the
> time the workload spends executing vs. spinning its wheels.
>
> But choosing a universal cutoff for killing is not possible, since it
> depends on the workload and the user's expectation: GUI and other
> latency-sensitive applications care way before a compile job or video
> encoding would care.
>
> Because of that, there are things like oomd and lmkd as mentioned, to
> leave the exact policy decision to userspace.
>
> That being said, I think we should be able to provide a bare minimum
> inside the kernel to avoid complete livelocks where the user does not
> believe the machine would be able to recover without a reboot.
>
> The goal wouldn't be a glitch-free user experience - the kernel does
> not know enough about the applications to even attempt that. It should
> just not hang indefinitely. Maybe similar to the hung task detector.
>
> How about something like the below patch? With that, the kernel
> catches excessive thrashing that happens before reclaim fails:
>
> [root@ham ~]# stress -d 128 -m 5
> stress: info: [344] dispatching hogs: 0 cpu, 0 io, 5 vm, 128 hdd
> Excessive and sustained system-wide memory pressure!
> kworker/1:2 invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0
> CPU: 1 PID: 77 Comm: kworker/1:2 Not tainted 5.3.0-rc1-mm1-00121-ge34a5cf28771 #142
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
> Workqueue: events psi_avgs_work
> Call Trace:
>  dump_stack+0x46/0x60
>  dump_header+0x5c/0x3d5
>  ? irq_work_queue+0x46/0x50
>  ? wake_up_klogd+0x2b/0x30
>  ? vprintk_emit+0xe5/0x190
>  oom_kill_process.cold.10+0xb/0x10
>  out_of_memory+0x1ea/0x260
>  update_averages.cold.8+0x14/0x25
>  ? collect_percpu_times+0x84/0x1f0
>  psi_avgs_work+0x80/0xc0
>  process_one_work+0x1bb/0x310
>  worker_thread+0x28/0x3c0
>  ? process_one_work+0x310/0x310
>  kthread+0x108/0x120
>  ? __kthread_create_on_node+0x170/0x170
>  ret_from_fork+0x35/0x40
> Mem-Info:
> active_anon:109463 inactive_anon:109564 isolated_anon:298
>  active_file:4676 inactive_file:4073 isolated_file:455
>  unevictable:0 dirty:8475 writeback:8 unstable:0
>  slab_reclaimable:2585 slab_unreclaimable:4932
>  mapped:413 shmem:2 pagetables:1747 bounce:0
>  free:13472 free_pcp:17 free_cma:0
>
> Possible snags and questions:
>
> 1. psi is an optional feature right now, but these livelocks commonly
>    affect desktop users. What should be the default behavior?
>
> 2. Should we make the pressure cutoff and time period configurable?
>
>    I fear we would open a can of worms similar to the existing OOM
>    killer, where users are trying to use a kernel self-protection
>    mechanism to implement workload QoS and priorities - things that
>    should firmly be kept in userspace.
>
> 3. swapoff annotation. Due to the swapin annotation, swapoff currently
>    raises memory pressure. It probably shouldn't. But this will be a
>    bigger problem if we trigger the oom killer based on it.
>
> 4. Killing once every 10s assumes basically one big culprit. If the
>    pressure is created by many different processes, fixing the
>    situation could take quite a while.
>
>    What oomd does to solve this is to monitor the PGSCAN counters
>    after a kill, to tell whether pressure is persisting, or just from
>    residual refaults after the culprit has been dealt with.
>
>    We may need to do something similar here. Or find a solution to
>    encode that distinction into psi itself, and it would also take
>    care of the swapoff problem, since it's basically the same thing -
>    residual refaults without any reclaim pressure to sustain them.
>
> Anyway, here is the draft patch:
>
> From e34a5cf28771d69f13faa0e933adeae44b26b8aa Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 5 Aug 2019 13:15:16 -0400
> Subject: [PATCH] psi oom
>
> ---
>  include/linux/psi_types.h |  4 +++
>  kernel/sched/psi.c        | 52 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
>
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index 07aaf9b82241..390446b07ac7 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -162,6 +162,10 @@ struct psi_group {
>         u64 polling_total[NR_PSI_STATES - 1];
>         u64 polling_next_update;
>         u64 polling_until;
> +
> +       /* Out-of-memory situation tracking */
> +       bool oom_pressure;
> +       u64 oom_pressure_start;
>  };
>
>  #else /* CONFIG_PSI */
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index f28342dc65ec..1027b6611ec2 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -139,6 +139,7 @@
>  #include <linux/ctype.h>
>  #include <linux/file.h>
>  #include <linux/poll.h>
> +#include <linux/oom.h>
>  #include <linux/psi.h>
>  #include "sched.h"
>
> @@ -177,6 +178,8 @@ struct psi_group psi_system = {
>         .pcpu = &system_group_pcpu,
>  };
>
> +static void psi_oom_tick(struct psi_group *group, u64 now);
> +
>  static void psi_avgs_work(struct work_struct *work);
>
>  static void group_init(struct psi_group *group)
> @@ -403,6 +406,8 @@ static u64 update_averages(struct psi_group *group, u64 now)
>                 calc_avgs(group->avg[s], missed_periods, sample, period);
>         }
>
> +       psi_oom_tick(group, now);
> +
>         return avg_next_update;
>  }
>
> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void)
>         return 0;
>  }
>  module_init(psi_proc_init);
> +
> +#define OOM_PRESSURE_LEVEL     80
> +#define OOM_PRESSURE_PERIOD    (10 * NSEC_PER_SEC)

80% of the last 10 seconds spent in full stall would definitely be a
problem. If the system was already low on memory (which it probably
is, or we would not be reclaiming so hard and registering such a big
stall) then oom-killer would probably kill something before 8 seconds
are passed. If my line of thinking is correct, then do we really
benefit from such additional protection mechanism? I might be wrong
here because my experience is limited to embedded systems with
relatively small amounts of memory.

> +
> +static void psi_oom_tick(struct psi_group *group, u64 now)
> +{
> +       struct oom_control oc = {
> +               .order = 0,
> +       };
> +       unsigned long pressure;
> +       bool high;
> +
> +       /*
> +        * Protect the system from livelocking due to thrashing. Leave
> +        * per-cgroup policies to oomd, lmkd etc.
> +        */
> +       if (group != &psi_system)
> +               return;
> +
> +       pressure = LOAD_INT(group->avg[PSI_MEM_FULL][0]);
> +       high = pressure >= OOM_PRESSURE_LEVEL;
> +
> +       if (!group->oom_pressure && !high)
> +               return;
> +
> +       if (!group->oom_pressure && high) {
> +               group->oom_pressure = true;
> +               group->oom_pressure_start = now;
> +               return;
> +       }
> +
> +       if (group->oom_pressure && !high) {
> +               group->oom_pressure = false;
> +               return;
> +       }
> +
> +       if (now < group->oom_pressure_start + OOM_PRESSURE_PERIOD)
> +               return;
> +
> +       group->oom_pressure = false;
> +
> +       if (!mutex_trylock(&oom_lock))
> +               return;
> +       pr_warn("Excessive and sustained system-wide memory pressure!\n");
> +       out_of_memory(&oc);
> +       mutex_unlock(&oom_lock);
> +}
> --
> 2.22.0
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-05 18:55     ` Johannes Weiner
@ 2019-08-06  9:29       ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2019-08-06  9:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Artem S. Tashkinov, linux-kernel, linux-mm,
	Suren Baghdasaryan

On Mon 05-08-19 14:55:42, Johannes Weiner wrote:
> On Mon, Aug 05, 2019 at 03:31:19PM +0200, Michal Hocko wrote:
> > On Mon 05-08-19 14:13:16, Vlastimil Babka wrote:
> > > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote:
> > > > Hello,
> > > > 
> > > > There's this bug which has been bugging many people for many years
> > > > already and which is reproducible in less than a few minutes under the
> > > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to
> > > > defaults.
> > > > 
> > > > Steps to reproduce:
> > > > 
> > > > 1) Boot with mem=4G
> > > > 2) Disable swap to make everything faster (sudo swapoff -a)
> > > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
> > > > 4) Start opening tabs in either of them and watch your free RAM decrease
> > > > 
> > > > Once you hit a situation when opening a new tab requires more RAM than
> > > > is currently available, the system will stall hard. You will barely  be
> > > > able to move the mouse pointer. Your disk LED will be flashing
> > > > incessantly (I'm not entirely sure why). You will not be able to run new
> > > > applications or close currently running ones.
> > > 
> > > > This little crisis may continue for minutes or even longer. I think
> > > > that's not how the system should behave in this situation. I believe
> > > > something must be done about that to avoid this stall.
> > > 
> > > Yeah that's a known problem, made worse SSD's in fact, as they are able
> > > to keep refaulting the last remaining file pages fast enough, so there
> > > is still apparent progress in reclaim and OOM doesn't kick in.
> > > 
> > > At this point, the likely solution will be probably based on pressure
> > > stall monitoring (PSI). I don't know how far we are from a built-in
> > > monitor with reasonable defaults for a desktop workload, so CCing
> > > relevant folks.
> > 
> > Another potential approach would be to consider the refault information
> > we have already for file backed pages. Once we start reclaiming only
> > workingset pages then we should be trashing, right? It cannot be as
> > precise as the cost model which can be defined around PSI but it might
> > give us at least a fallback measure.
> 
> NAK, this does *not* work. Not even as fallback.
> 
> There is no amount of refaults for which you can say whether they are
> a problem or not. It depends on the disk speed (obvious) but also on
> the workload's memory access patterns (somewhat less obvious).
> 
> For example, we have workloads whose cache set doesn't quite fit into
> memory, but everything else is pretty much statically allocated and it
> rarely touches any new or one-off filesystem data. So there is always
> a steady rate of mostly uninterrupted refaults, however, most data
> accesses are hitting the cache! And we have fast SSDs that compensate
> for the refaults that do occur. The workload runs *completely fine*.

OK, thanks for this example. I can see how a constant working set
refault can work properly if the rate is slower than the overal IO
plus the allocation demand for other purpose.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-06  1:08     ` Suren Baghdasaryan
@ 2019-08-06  9:36       ` Vlastimil Babka
  2019-08-06 14:27         ` Johannes Weiner
  2019-08-06 21:43       ` James Courtier-Dutton
  1 sibling, 1 reply; 39+ messages in thread
From: Vlastimil Babka @ 2019-08-06  9:36 UTC (permalink / raw)
  To: Suren Baghdasaryan, Johannes Weiner
  Cc: Artem S. Tashkinov, LKML, linux-mm, Michal Hocko

On 8/6/19 3:08 AM, Suren Baghdasaryan wrote:
>> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void)
>>         return 0;
>>  }
>>  module_init(psi_proc_init);
>> +
>> +#define OOM_PRESSURE_LEVEL     80
>> +#define OOM_PRESSURE_PERIOD    (10 * NSEC_PER_SEC)
> 
> 80% of the last 10 seconds spent in full stall would definitely be a
> problem. If the system was already low on memory (which it probably
> is, or we would not be reclaiming so hard and registering such a big
> stall) then oom-killer would probably kill something before 8 seconds
> are passed.

If oom killer can act faster, than great! On small embedded systems you probably
don't enable PSI anyway?

> If my line of thinking is correct, then do we really
> benefit from such additional protection mechanism? I might be wrong
> here because my experience is limited to embedded systems with
> relatively small amounts of memory.

Well, Artem in his original mail describes a minutes long stall. Things are
really different on a fast desktop/laptop with SSD. I have experienced this as
well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than
8GB in the laptop). IMHO the default limit should be set so that the user
doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10
seconds should be fine.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-06  9:36       ` Vlastimil Babka
@ 2019-08-06 14:27         ` Johannes Weiner
  2019-08-06 14:36           ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Johannes Weiner @ 2019-08-06 14:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Artem S. Tashkinov, LKML, linux-mm, Michal Hocko

On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote:
> On 8/6/19 3:08 AM, Suren Baghdasaryan wrote:
> >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void)
> >>         return 0;
> >>  }
> >>  module_init(psi_proc_init);
> >> +
> >> +#define OOM_PRESSURE_LEVEL     80
> >> +#define OOM_PRESSURE_PERIOD    (10 * NSEC_PER_SEC)
> > 
> > 80% of the last 10 seconds spent in full stall would definitely be a
> > problem. If the system was already low on memory (which it probably
> > is, or we would not be reclaiming so hard and registering such a big
> > stall) then oom-killer would probably kill something before 8 seconds
> > are passed.
> 
> If oom killer can act faster, than great! On small embedded systems you probably
> don't enable PSI anyway?
> 
> > If my line of thinking is correct, then do we really
> > benefit from such additional protection mechanism? I might be wrong
> > here because my experience is limited to embedded systems with
> > relatively small amounts of memory.
> 
> Well, Artem in his original mail describes a minutes long stall. Things are
> really different on a fast desktop/laptop with SSD. I have experienced this as
> well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than
> 8GB in the laptop). IMHO the default limit should be set so that the user
> doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10
> seconds should be fine.

That's exactly what I have experienced in the past, and this was also
the consistent story in the bug reports we have had.

I suspect it requires a certain combination of RAM size, CPU speed,
and IO capacity: the OOM killer kicks in when reclaim fails, which
happens when all scanned LRU pages were locked and under IO. So IO
needs to be slow enough, or RAM small enough, that the CPU can scan
all LRU pages while they are temporarily unreclaimable (page lock).

It may well be that on phones the RAM is small enough relative to CPU
size.

But on desktops/servers, we frequently see that there is a wider
window of memory consumption in which reclaim efficiency doesn't drop
low enough for the OOM killer to kick in. In the time it takes the CPU
to scan through RAM, enough pages will have *just* finished reading
for reclaim to free them again and continue to make "progress".

We do know that the OOM killer might not kick in for at least 20-25
minutes while the system is entirely unresponsive. People usually
don't wait this long before forcibly rebooting. In a managed fleet,
ssh heartbeat tests eventually fail and force a reboot.

I'm not sure 10s is the perfect value here, but I do think the kernel
should try to get out of such a state, where interacting with the
system is impossible, within a reasonable amount of time.

It could be a little too short for non-interactive number-crunching
systems...


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-06 14:27         ` Johannes Weiner
@ 2019-08-06 14:36           ` Michal Hocko
  2019-08-06 16:27             ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-08-06 14:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Suren Baghdasaryan, Artem S. Tashkinov, LKML, linux-mm

On Tue 06-08-19 10:27:28, Johannes Weiner wrote:
> On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote:
> > On 8/6/19 3:08 AM, Suren Baghdasaryan wrote:
> > >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void)
> > >>         return 0;
> > >>  }
> > >>  module_init(psi_proc_init);
> > >> +
> > >> +#define OOM_PRESSURE_LEVEL     80
> > >> +#define OOM_PRESSURE_PERIOD    (10 * NSEC_PER_SEC)
> > > 
> > > 80% of the last 10 seconds spent in full stall would definitely be a
> > > problem. If the system was already low on memory (which it probably
> > > is, or we would not be reclaiming so hard and registering such a big
> > > stall) then oom-killer would probably kill something before 8 seconds
> > > are passed.
> > 
> > If oom killer can act faster, than great! On small embedded systems you probably
> > don't enable PSI anyway?
> > 
> > > If my line of thinking is correct, then do we really
> > > benefit from such additional protection mechanism? I might be wrong
> > > here because my experience is limited to embedded systems with
> > > relatively small amounts of memory.
> > 
> > Well, Artem in his original mail describes a minutes long stall. Things are
> > really different on a fast desktop/laptop with SSD. I have experienced this as
> > well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than
> > 8GB in the laptop). IMHO the default limit should be set so that the user
> > doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10
> > seconds should be fine.
> 
> That's exactly what I have experienced in the past, and this was also
> the consistent story in the bug reports we have had.
> 
> I suspect it requires a certain combination of RAM size, CPU speed,
> and IO capacity: the OOM killer kicks in when reclaim fails, which
> happens when all scanned LRU pages were locked and under IO. So IO
> needs to be slow enough, or RAM small enough, that the CPU can scan
> all LRU pages while they are temporarily unreclaimable (page lock).
> 
> It may well be that on phones the RAM is small enough relative to CPU
> size.
> 
> But on desktops/servers, we frequently see that there is a wider
> window of memory consumption in which reclaim efficiency doesn't drop
> low enough for the OOM killer to kick in. In the time it takes the CPU
> to scan through RAM, enough pages will have *just* finished reading
> for reclaim to free them again and continue to make "progress".
> 
> We do know that the OOM killer might not kick in for at least 20-25
> minutes while the system is entirely unresponsive. People usually
> don't wait this long before forcibly rebooting. In a managed fleet,
> ssh heartbeat tests eventually fail and force a reboot.
> 
> I'm not sure 10s is the perfect value here, but I do think the kernel
> should try to get out of such a state, where interacting with the
> system is impossible, within a reasonable amount of time.
> 
> It could be a little too short for non-interactive number-crunching
> systems...

Would it be possible to have a module with tunning knobs as parameters
and hook into the PSI infrastructure? People can play with the setting
to their need, we wouldn't really have think about the user visible API
for the tuning and this could be easily adopted as an opt-in mechanism
without a risk of regressions.

I would really love to see a simple threshing watchdog like the one you
have proposed earlier. It is self contained and easy to play with if the
parameters are not hardcoded.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-06 14:36           ` Michal Hocko
@ 2019-08-06 16:27             ` Suren Baghdasaryan
  2019-08-06 22:01               ` Johannes Weiner
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2019-08-06 16:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm

On Tue, Aug 6, 2019 at 7:36 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 06-08-19 10:27:28, Johannes Weiner wrote:
> > On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote:
> > > On 8/6/19 3:08 AM, Suren Baghdasaryan wrote:
> > > >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void)
> > > >>         return 0;
> > > >>  }
> > > >>  module_init(psi_proc_init);
> > > >> +
> > > >> +#define OOM_PRESSURE_LEVEL     80
> > > >> +#define OOM_PRESSURE_PERIOD    (10 * NSEC_PER_SEC)
> > > >
> > > > 80% of the last 10 seconds spent in full stall would definitely be a
> > > > problem. If the system was already low on memory (which it probably
> > > > is, or we would not be reclaiming so hard and registering such a big
> > > > stall) then oom-killer would probably kill something before 8 seconds
> > > > are passed.
> > >
> > > If oom killer can act faster, than great! On small embedded systems you probably
> > > don't enable PSI anyway?

We use PSI triggers with 1 sec tracking window. PSI averages are less
useful on such systems because in 10 secs (which is the shortest PSI
averaging window) memory conditions can change drastically.

> > > > If my line of thinking is correct, then do we really
> > > > benefit from such additional protection mechanism? I might be wrong
> > > > here because my experience is limited to embedded systems with
> > > > relatively small amounts of memory.
> > >
> > > Well, Artem in his original mail describes a minutes long stall. Things are
> > > really different on a fast desktop/laptop with SSD. I have experienced this as
> > > well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than
> > > 8GB in the laptop). IMHO the default limit should be set so that the user
> > > doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10
> > > seconds should be fine.
> >
> > That's exactly what I have experienced in the past, and this was also
> > the consistent story in the bug reports we have had.
> >
> > I suspect it requires a certain combination of RAM size, CPU speed,
> > and IO capacity: the OOM killer kicks in when reclaim fails, which
> > happens when all scanned LRU pages were locked and under IO. So IO
> > needs to be slow enough, or RAM small enough, that the CPU can scan
> > all LRU pages while they are temporarily unreclaimable (page lock).
> >
> > It may well be that on phones the RAM is small enough relative to CPU
> > size.
> >
> > But on desktops/servers, we frequently see that there is a wider
> > window of memory consumption in which reclaim efficiency doesn't drop
> > low enough for the OOM killer to kick in. In the time it takes the CPU
> > to scan through RAM, enough pages will have *just* finished reading
> > for reclaim to free them again and continue to make "progress".
> >
> > We do know that the OOM killer might not kick in for at least 20-25
> > minutes while the system is entirely unresponsive. People usually
> > don't wait this long before forcibly rebooting. In a managed fleet,
> > ssh heartbeat tests eventually fail and force a reboot.

Got it. Thanks for the explanation.

> > I'm not sure 10s is the perfect value here, but I do think the kernel
> > should try to get out of such a state, where interacting with the
> > system is impossible, within a reasonable amount of time.
> >
> > It could be a little too short for non-interactive number-crunching
> > systems...
>
> Would it be possible to have a module with tunning knobs as parameters
> and hook into the PSI infrastructure? People can play with the setting
> to their need, we wouldn't really have think about the user visible API
> for the tuning and this could be easily adopted as an opt-in mechanism
> without a risk of regressions.

PSI averages stalls over 10, 60 and 300 seconds, so implementing 3
corresponding thresholds would be easy. The patch Johannes posted can
be extended to support 3 thresholds instead of 1. I can take a stab at
it if Johannes is busy.
If we want more flexibility we could use PSI triggers with
configurable tracking window but that's more complex and probably not
worth it.

> I would really love to see a simple threshing watchdog like the one you
> have proposed earlier. It is self contained and easy to play with if the
> parameters are not hardcoded.
>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
       [not found] <d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com>
  2019-08-05 12:13 ` Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure Vlastimil Babka
@ 2019-08-06 19:00 ` Florian Weimer
  1 sibling, 0 replies; 39+ messages in thread
From: Florian Weimer @ 2019-08-06 19:00 UTC (permalink / raw)
  To: Artem S. Tashkinov; +Cc: linux-kernel, linux-mm

* Artem S. Tashkinov:

> There's this bug which has been bugging many people for many years
> already and which is reproducible in less than a few minutes under the
> latest and greatest kernel, 5.2.6. All the kernel parameters are set to
> defaults.
>
> Steps to reproduce:
>
> 1) Boot with mem=4G
> 2) Disable swap to make everything faster (sudo swapoff -a)
> 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
> 4) Start opening tabs in either of them and watch your free RAM decrease

Do you see this with Intel graphics?  I think these drivers still use
the GEM shrinker, which effectively bypasses most kernel memory
management heuristics.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-06  1:08     ` Suren Baghdasaryan
  2019-08-06  9:36       ` Vlastimil Babka
@ 2019-08-06 21:43       ` James Courtier-Dutton
  1 sibling, 0 replies; 39+ messages in thread
From: James Courtier-Dutton @ 2019-08-06 21:43 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Johannes Weiner, Vlastimil Babka, Artem S. Tashkinov, LKML,
	linux-mm, Michal Hocko

On Tue, 6 Aug 2019 at 02:09, Suren Baghdasaryan <surenb@google.com> wrote:
>
> 80% of the last 10 seconds spent in full stall would definitely be a
> problem. If the system was already low on memory (which it probably
> is, or we would not be reclaiming so hard and registering such a big
> stall) then oom-killer would probably kill something before 8 seconds
> are passed.

There are other things to consider also.
I can reproduce these types of symptoms and memory pressure is 100%
NOT the cause. (top showing 4GB of a 16GB system in use)
The cause as I see it is disk pressure and the lack of multiple queues
for disk IO requests.
For example, one process can hog 100% of the disk, without other
applications even being able to write just one sector.
We need a way for the linux kernel to better multiplex access to the
disk. Adding QOS, allowing interactive processes to interrupt long
background disk IO tasks.
If we could balance disk access across each active process, the user,
on their desktop, would think the system was more responsive.

Kind Regards

James


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-06 16:27             ` Suren Baghdasaryan
@ 2019-08-06 22:01               ` Johannes Weiner
  2019-08-07  7:59                 ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Johannes Weiner @ 2019-08-06 22:01 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm

On Tue, Aug 06, 2019 at 09:27:05AM -0700, Suren Baghdasaryan wrote:
> On Tue, Aug 6, 2019 at 7:36 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 06-08-19 10:27:28, Johannes Weiner wrote:
> > > On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote:
> > > > On 8/6/19 3:08 AM, Suren Baghdasaryan wrote:
> > > > >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void)
> > > > >>         return 0;
> > > > >>  }
> > > > >>  module_init(psi_proc_init);
> > > > >> +
> > > > >> +#define OOM_PRESSURE_LEVEL     80
> > > > >> +#define OOM_PRESSURE_PERIOD    (10 * NSEC_PER_SEC)
> > > > >
> > > > > 80% of the last 10 seconds spent in full stall would definitely be a
> > > > > problem. If the system was already low on memory (which it probably
> > > > > is, or we would not be reclaiming so hard and registering such a big
> > > > > stall) then oom-killer would probably kill something before 8 seconds
> > > > > are passed.
> > > >
> > > > If oom killer can act faster, than great! On small embedded systems you probably
> > > > don't enable PSI anyway?
> 
> We use PSI triggers with 1 sec tracking window. PSI averages are less
> useful on such systems because in 10 secs (which is the shortest PSI
> averaging window) memory conditions can change drastically.
> 
> > > > > If my line of thinking is correct, then do we really
> > > > > benefit from such additional protection mechanism? I might be wrong
> > > > > here because my experience is limited to embedded systems with
> > > > > relatively small amounts of memory.
> > > >
> > > > Well, Artem in his original mail describes a minutes long stall. Things are
> > > > really different on a fast desktop/laptop with SSD. I have experienced this as
> > > > well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than
> > > > 8GB in the laptop). IMHO the default limit should be set so that the user
> > > > doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10
> > > > seconds should be fine.
> > >
> > > That's exactly what I have experienced in the past, and this was also
> > > the consistent story in the bug reports we have had.
> > >
> > > I suspect it requires a certain combination of RAM size, CPU speed,
> > > and IO capacity: the OOM killer kicks in when reclaim fails, which
> > > happens when all scanned LRU pages were locked and under IO. So IO
> > > needs to be slow enough, or RAM small enough, that the CPU can scan
> > > all LRU pages while they are temporarily unreclaimable (page lock).
> > >
> > > It may well be that on phones the RAM is small enough relative to CPU
> > > size.
> > >
> > > But on desktops/servers, we frequently see that there is a wider
> > > window of memory consumption in which reclaim efficiency doesn't drop
> > > low enough for the OOM killer to kick in. In the time it takes the CPU
> > > to scan through RAM, enough pages will have *just* finished reading
> > > for reclaim to free them again and continue to make "progress".
> > >
> > > We do know that the OOM killer might not kick in for at least 20-25
> > > minutes while the system is entirely unresponsive. People usually
> > > don't wait this long before forcibly rebooting. In a managed fleet,
> > > ssh heartbeat tests eventually fail and force a reboot.
> 
> Got it. Thanks for the explanation.
> 
> > > I'm not sure 10s is the perfect value here, but I do think the kernel
> > > should try to get out of such a state, where interacting with the
> > > system is impossible, within a reasonable amount of time.
> > >
> > > It could be a little too short for non-interactive number-crunching
> > > systems...
> >
> > Would it be possible to have a module with tunning knobs as parameters
> > and hook into the PSI infrastructure? People can play with the setting
> > to their need, we wouldn't really have think about the user visible API
> > for the tuning and this could be easily adopted as an opt-in mechanism
> > without a risk of regressions.

It's relatively easy to trigger a livelock that disables the entire
system for good, as a regular user. It's a little weird to make the
bug fix for that an opt-in with an extensive configuration interface.

This isn't like the hung task watch dog, where it's likely some kind
of kernel issue, right? This can happen on any current kernel.

What I would like to have is a way of self-recovery from a livelock. I
don't mind making it opt-out in case we make mistakes, but the kernel
should provide minimal self-protection out of the box, IMO.

> PSI averages stalls over 10, 60 and 300 seconds, so implementing 3
> corresponding thresholds would be easy. The patch Johannes posted can
> be extended to support 3 thresholds instead of 1. I can take a stab at
> it if Johannes is busy.
> If we want more flexibility we could use PSI triggers with
> configurable tracking window but that's more complex and probably not
> worth it.

This goes into quality-of-service for workloads territory again. I'm
not quite convinced yet we want to go there.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-06 22:01               ` Johannes Weiner
@ 2019-08-07  7:59                 ` Michal Hocko
  2019-08-07 20:51                   ` Johannes Weiner
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-08-07  7:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm

On Tue 06-08-19 18:01:50, Johannes Weiner wrote:
> On Tue, Aug 06, 2019 at 09:27:05AM -0700, Suren Baghdasaryan wrote:
[...]
> > > > I'm not sure 10s is the perfect value here, but I do think the kernel
> > > > should try to get out of such a state, where interacting with the
> > > > system is impossible, within a reasonable amount of time.
> > > >
> > > > It could be a little too short for non-interactive number-crunching
> > > > systems...
> > >
> > > Would it be possible to have a module with tunning knobs as parameters
> > > and hook into the PSI infrastructure? People can play with the setting
> > > to their need, we wouldn't really have think about the user visible API
> > > for the tuning and this could be easily adopted as an opt-in mechanism
> > > without a risk of regressions.
> 
> It's relatively easy to trigger a livelock that disables the entire
> system for good, as a regular user. It's a little weird to make the
> bug fix for that an opt-in with an extensive configuration interface.

Yes, I definitely do agree that this is a bug fix more than a
feature. The thing is that we do not know what the proper default is for
a wide variety of workloads so some way of configurability is needed
(level and period).  If making this a module would require a lot of
additional code then we need a kernel command line parameter at least.

A module would have a nice advantage that you can change your
configuration without rebooting. The same can be achieved by a sysfs on
the other hand.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-07  7:59                 ` Michal Hocko
@ 2019-08-07 20:51                   ` Johannes Weiner
  2019-08-07 21:01                     ` Andrew Morton
                                       ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Johannes Weiner @ 2019-08-07 20:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov,
	Andrew Morton, LKML, linux-mm

On Wed, Aug 07, 2019 at 09:59:27AM +0200, Michal Hocko wrote:
> On Tue 06-08-19 18:01:50, Johannes Weiner wrote:
> > On Tue, Aug 06, 2019 at 09:27:05AM -0700, Suren Baghdasaryan wrote:
> [...]
> > > > > I'm not sure 10s is the perfect value here, but I do think the kernel
> > > > > should try to get out of such a state, where interacting with the
> > > > > system is impossible, within a reasonable amount of time.
> > > > >
> > > > > It could be a little too short for non-interactive number-crunching
> > > > > systems...
> > > >
> > > > Would it be possible to have a module with tunning knobs as parameters
> > > > and hook into the PSI infrastructure? People can play with the setting
> > > > to their need, we wouldn't really have think about the user visible API
> > > > for the tuning and this could be easily adopted as an opt-in mechanism
> > > > without a risk of regressions.
> > 
> > It's relatively easy to trigger a livelock that disables the entire
> > system for good, as a regular user. It's a little weird to make the
> > bug fix for that an opt-in with an extensive configuration interface.
> 
> Yes, I definitely do agree that this is a bug fix more than a
> feature. The thing is that we do not know what the proper default is for
> a wide variety of workloads so some way of configurability is needed
> (level and period).  If making this a module would require a lot of
> additional code then we need a kernel command line parameter at least.
> 
> A module would have a nice advantage that you can change your
> configuration without rebooting. The same can be achieved by a sysfs on
> the other hand.

That's reasonable. How about my initial patch, but behind a config
option and the level and period configurable?

---
From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 5 Aug 2019 13:15:16 -0400
Subject: [PATCH] psi: trigger the OOM killer on severe thrashing

Over the last few years we have had many reports that the kernel can
enter an extended livelock situation under sufficient memory
pressure. The system becomes unresponsive and fully IO bound for
indefinite periods of time, and often the user has no choice but to
reboot. Even though the system is clearly struggling with a shortage
of memory, the OOM killer is not engaging reliably.

The reason is that with bigger RAM, and in particular with faster
SSDs, page reclaim does not necessarily fail in the traditional sense
anymore. In the time it takes the CPU to run through the vast LRU
lists, there are almost always some cache pages that have finished
reading in and can be reclaimed, even before userspace had a chance to
access them. As a result, reclaim is nominally succeeding, but
userspace is refault-bound and not making significant progress.

While this is clearly noticable to human beings, the kernel could not
actually determine this state with the traditional memory event
counters. We might see a certain rate of reclaim activity or refaults,
but how long, or whether at all, userspace is unproductive because of
it depends on IO speed, readahead efficiency, as well as memory access
patterns and concurrency of the userspace applications. The same
number of the VM events could be unnoticed in one system / workload
combination, and result in an indefinite lockup in a different one.

However, eb414681d5a0 ("psi: pressure stall information for CPU,
memory, and IO") introduced a memory pressure metric that quantifies
the share of wallclock time in which userspace waits on reclaim,
refaults, swapins. By using absolute time, it encodes all the above
mentioned variables of hardware capacity and workload behavior. When
memory pressure is 40%, it means that 40% of the time the workload is
stalled on memory, period. This is the actual measure for the lack of
forward progress that users can experience. It's also something they
expect the kernel to manage and remedy if it becomes non-existent.

To accomplish this, this patch implements a thrashing cutoff for the
OOM killer. If the kernel determines a sustained high level of memory
pressure, and thus a lack of forward progress in userspace, it will
trigger the OOM killer to reduce memory contention.

Per default, the OOM killer will engage after 15 seconds of at least
80% memory pressure. These values are tunable via sysctls
vm.thrashing_oom_period and vm.thrashing_oom_level.

Ideally, this would be standard behavior for the kernel, but since it
involves a new metric and OOM killing, let's be safe and make it an
opt-in via CONFIG_THRASHING_OOM. Setting vm.thrashing_oom_level to 0
also disables the feature at runtime.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: "Artem S. Tashkinov" <aros@gmx.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 24 ++++++++
 include/linux/psi.h                     |  5 ++
 include/linux/psi_types.h               |  6 ++
 kernel/sched/psi.c                      | 74 +++++++++++++++++++++++++
 kernel/sysctl.c                         | 20 +++++++
 mm/Kconfig                              | 20 +++++++
 6 files changed, 149 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 64aeee1009ca..0332cb52bcfc 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -66,6 +66,8 @@ files can be found in mm/swap.c.
 - stat_interval
 - stat_refresh
 - numa_stat
+- thrashing_oom_level
+- thrashing_oom_period
 - swappiness
 - unprivileged_userfaultfd
 - user_reserve_kbytes
@@ -825,6 +827,28 @@ When page allocation performance is not a bottleneck and you want all
 	echo 1 > /proc/sys/vm/numa_stat
 
 
+thrashing_oom_level
+===================
+
+This defines the memory pressure level for severe thrashing at which
+the OOM killer will be engaged.
+
+The default is 80. This means the system is considered to be thrashing
+severely when all active tasks are collectively stalled on memory
+(waiting for page reclaim, refaults, swapins etc) for 80% of the time.
+
+A setting of 0 will disable thrashing-based OOM killing.
+
+
+thrashing_oom_period
+===================
+
+This defines the number of seconds the system must sustain severe
+thrashing at thrashing_oom_level before the OOM killer is invoked.
+
+The default is 15.
+
+
 swappiness
 ==========
 
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7b3de7321219..661ce45900f9 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -37,6 +37,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
 			poll_table *wait);
 #endif
 
+#ifdef CONFIG_THRASHING_OOM
+extern unsigned int sysctl_thrashing_oom_level;
+extern unsigned int sysctl_thrashing_oom_period;
+#endif
+
 #else /* CONFIG_PSI */
 
 static inline void psi_init(void) {}
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 07aaf9b82241..7c57d7e5627e 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -162,6 +162,12 @@ struct psi_group {
 	u64 polling_total[NR_PSI_STATES - 1];
 	u64 polling_next_update;
 	u64 polling_until;
+
+#ifdef CONFIG_THRASHING_OOM
+	/* Severe thrashing state tracking */
+	bool oom_pressure;
+	u64 oom_pressure_start;
+#endif
 };
 
 #else /* CONFIG_PSI */
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index f28342dc65ec..4b1b620d6359 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -139,6 +139,7 @@
 #include <linux/ctype.h>
 #include <linux/file.h>
 #include <linux/poll.h>
+#include <linux/oom.h>
 #include <linux/psi.h>
 #include "sched.h"
 
@@ -177,6 +178,14 @@ struct psi_group psi_system = {
 	.pcpu = &system_group_pcpu,
 };
 
+#ifdef CONFIG_THRASHING_OOM
+static void psi_oom_tick(struct psi_group *group, u64 now);
+#else
+static inline void psi_oom_tick(struct psi_group *group, u64 now)
+{
+}
+#endif
+
 static void psi_avgs_work(struct work_struct *work);
 
 static void group_init(struct psi_group *group)
@@ -403,6 +412,8 @@ static u64 update_averages(struct psi_group *group, u64 now)
 		calc_avgs(group->avg[s], missed_periods, sample, period);
 	}
 
+	psi_oom_tick(group, now);
+
 	return avg_next_update;
 }
 
@@ -1280,3 +1291,66 @@ static int __init psi_proc_init(void)
 	return 0;
 }
 module_init(psi_proc_init);
+
+#ifdef CONFIG_THRASHING_OOM
+/*
+ * Trigger the OOM killer when detecting severe thrashing.
+ *
+ * Per default we define severe thrashing as 15 seconds of 80% memory
+ * pressure (i.e. all active tasks are collectively stalled on memory
+ * 80% of the time).
+ */
+unsigned int sysctl_thrashing_oom_level = 80;
+unsigned int sysctl_thrashing_oom_period = 15;
+
+static void psi_oom_tick(struct psi_group *group, u64 now)
+{
+	struct oom_control oc = {
+		.order = 0,
+	};
+	unsigned long pressure;
+	bool high;
+
+	/* Disabled at runtime */
+	if (!sysctl_thrashing_oom_level)
+		return;
+
+	/*
+	 * Protect the system from livelocking due to thrashing. Leave
+	 * per-cgroup policies to oomd, lmkd etc.
+	 */
+	if (group != &psi_system)
+		return;
+
+	pressure = LOAD_INT(group->avg[PSI_MEM_FULL][0]);
+	high = pressure >= sysctl_thrashing_oom_level;
+
+	if (!group->oom_pressure && !high)
+		return;
+
+	if (!group->oom_pressure && high) {
+		group->oom_pressure = true;
+		group->oom_pressure_start = now;
+		return;
+	}
+
+	if (group->oom_pressure && !high) {
+		group->oom_pressure = false;
+		return;
+	}
+
+	if (now < group->oom_pressure_start +
+	    (u64)sysctl_thrashing_oom_period * NSEC_PER_SEC)
+		return;
+
+	pr_warn("Severe thrashing detected! (%ds of %d%% memory pressure)\n",
+		sysctl_thrashing_oom_period, sysctl_thrashing_oom_level);
+
+	group->oom_pressure = false;
+
+	if (!mutex_trylock(&oom_lock))
+		return;
+	out_of_memory(&oc);
+	mutex_unlock(&oom_lock);
+}
+#endif /* CONFIG_THRASHING_OOM */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f12888971d66..3b9b3deb1836 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -68,6 +68,7 @@
 #include <linux/bpf.h>
 #include <linux/mount.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/psi.h>
 
 #include "../lib/kstrtox.h"
 
@@ -1746,6 +1747,25 @@ static struct ctl_table vm_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+#endif
+#ifdef CONFIG_THRASHING_OOM
+	{
+		.procname	= "thrashing_oom_level",
+		.data		= &sysctl_thrashing_oom_level,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &one_hundred,
+	},
+	{
+		.procname	= "thrashing_oom_period",
+		.data		= &sysctl_thrashing_oom_period,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
 #endif
 	{ }
 };
diff --git a/mm/Kconfig b/mm/Kconfig
index 56cec636a1fc..cef13b423beb 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,24 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+config THRASHING_OOM
+	bool "Trigger the OOM killer on severe thrashing"
+	select PSI
+	help
+	  Under memory pressure, the kernel can enter severe thrashing
+	  or swap storms during which the system is fully IO-bound and
+	  does not respond to any user input. The OOM killer does not
+	  always engage because page reclaim manages to make nominal
+	  forward progress, but the system is effectively livelocked.
+
+	  This feature uses pressure stall information (PSI) to detect
+	  severe thrashing and trigger the OOM killer.
+
+	  The OOM killer will be engaged when the system sustains a
+	  memory pressure level of 80% for 15 seconds. This can be
+	  adjusted using the vm.thrashing_oom_[level|period] sysctls.
+
+	  Say Y if you have observed your system becoming unresponsive
+	  for extended periods under memory pressure.
+
 endmenu
-- 
2.22.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-07 20:51                   ` Johannes Weiner
@ 2019-08-07 21:01                     ` Andrew Morton
  2019-08-07 21:34                       ` Johannes Weiner
  2019-08-07 21:12                     ` Johannes Weiner
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 39+ messages in thread
From: Andrew Morton @ 2019-08-07 21:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, LKML, linux-mm

On Wed, 7 Aug 2019 16:51:38 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> However, eb414681d5a0 ("psi: pressure stall information for CPU,
> memory, and IO") introduced a memory pressure metric that quantifies
> the share of wallclock time in which userspace waits on reclaim,
> refaults, swapins. By using absolute time, it encodes all the above
> mentioned variables of hardware capacity and workload behavior. When
> memory pressure is 40%, it means that 40% of the time the workload is
> stalled on memory, period. This is the actual measure for the lack of
> forward progress that users can experience. It's also something they
> expect the kernel to manage and remedy if it becomes non-existent.
> 
> To accomplish this, this patch implements a thrashing cutoff for the
> OOM killer. If the kernel determines a sustained high level of memory
> pressure, and thus a lack of forward progress in userspace, it will
> trigger the OOM killer to reduce memory contention.
> 
> Per default, the OOM killer will engage after 15 seconds of at least
> 80% memory pressure. These values are tunable via sysctls
> vm.thrashing_oom_period and vm.thrashing_oom_level.

Could be implemented in userspace?
</troll>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-07 20:51                   ` Johannes Weiner
  2019-08-07 21:01                     ` Andrew Morton
@ 2019-08-07 21:12                     ` Johannes Weiner
  2019-08-08 11:48                     ` Michal Hocko
  2019-08-08 14:47                     ` Vlastimil Babka
  3 siblings, 0 replies; 39+ messages in thread
From: Johannes Weiner @ 2019-08-07 21:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov,
	Andrew Morton, LKML, linux-mm

On Wed, Aug 07, 2019 at 04:51:42PM -0400, Johannes Weiner wrote:
> Per default, the OOM killer will engage after 15 seconds of at least
> 80% memory pressure. These values are tunable via sysctls
> vm.thrashing_oom_period and vm.thrashing_oom_level.

Let's go with this:

Per default, the OOM killer will engage after 15 seconds of at least
80% memory pressure. From experience, at 80% the user is experiencing
multi-second reaction times. 15 seconds is chosen to be long enough to
not OOM kill a short-lived spike that might resolve itself, yet short
enough for users to not press the reset button just yet.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-07 21:01                     ` Andrew Morton
@ 2019-08-07 21:34                       ` Johannes Weiner
  0 siblings, 0 replies; 39+ messages in thread
From: Johannes Weiner @ 2019-08-07 21:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, LKML, linux-mm

On Wed, Aug 07, 2019 at 02:01:30PM -0700, Andrew Morton wrote:
> On Wed, 7 Aug 2019 16:51:38 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > However, eb414681d5a0 ("psi: pressure stall information for CPU,
> > memory, and IO") introduced a memory pressure metric that quantifies
> > the share of wallclock time in which userspace waits on reclaim,
> > refaults, swapins. By using absolute time, it encodes all the above
> > mentioned variables of hardware capacity and workload behavior. When
> > memory pressure is 40%, it means that 40% of the time the workload is
> > stalled on memory, period. This is the actual measure for the lack of
> > forward progress that users can experience. It's also something they
> > expect the kernel to manage and remedy if it becomes non-existent.
> > 
> > To accomplish this, this patch implements a thrashing cutoff for the
> > OOM killer. If the kernel determines a sustained high level of memory
> > pressure, and thus a lack of forward progress in userspace, it will
> > trigger the OOM killer to reduce memory contention.
> > 
> > Per default, the OOM killer will engage after 15 seconds of at least
> > 80% memory pressure. These values are tunable via sysctls
> > vm.thrashing_oom_period and vm.thrashing_oom_level.
> 
> Could be implemented in userspace?
> </troll>

We do in fact do this with oomd.

But it requires a comprehensive cgroup setup, with complete memory and
IO isolation, to protect that daemon from the memory pressure and
excessive paging of the rest of the system (mlock doesn't really cut
it because you need to potentially allocate quite a few proc dentries
and inodes just to walk the process tree and determine a kill target).

In a fleet that works fine, since we need to maintain that cgroup
infra anyway. But for other users, that's a lot of stack for basic
"don't hang forever if I allocate too much memory" functionality.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-07 20:51                   ` Johannes Weiner
  2019-08-07 21:01                     ` Andrew Morton
  2019-08-07 21:12                     ` Johannes Weiner
@ 2019-08-08 11:48                     ` Michal Hocko
  2019-08-08 15:10                       ` ndrw.xf
  2019-08-08 14:47                     ` Vlastimil Babka
  3 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-08-08 11:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov,
	Andrew Morton, LKML, linux-mm

On Wed 07-08-19 16:51:38, Johannes Weiner wrote:
[...]
> >From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 5 Aug 2019 13:15:16 -0400
> Subject: [PATCH] psi: trigger the OOM killer on severe thrashing
> 
> Over the last few years we have had many reports that the kernel can
> enter an extended livelock situation under sufficient memory
> pressure. The system becomes unresponsive and fully IO bound for
> indefinite periods of time, and often the user has no choice but to
> reboot.

or sysrq+f

> Even though the system is clearly struggling with a shortage
> of memory, the OOM killer is not engaging reliably.
> 
> The reason is that with bigger RAM, and in particular with faster
> SSDs, page reclaim does not necessarily fail in the traditional sense
> anymore. In the time it takes the CPU to run through the vast LRU
> lists, there are almost always some cache pages that have finished
> reading in and can be reclaimed, even before userspace had a chance to
> access them. As a result, reclaim is nominally succeeding, but
> userspace is refault-bound and not making significant progress.
> 
> While this is clearly noticable to human beings, the kernel could not
> actually determine this state with the traditional memory event
> counters. We might see a certain rate of reclaim activity or refaults,
> but how long, or whether at all, userspace is unproductive because of
> it depends on IO speed, readahead efficiency, as well as memory access
> patterns and concurrency of the userspace applications. The same
> number of the VM events could be unnoticed in one system / workload
> combination, and result in an indefinite lockup in a different one.
> 
> However, eb414681d5a0 ("psi: pressure stall information for CPU,
> memory, and IO") introduced a memory pressure metric that quantifies
> the share of wallclock time in which userspace waits on reclaim,
> refaults, swapins. By using absolute time, it encodes all the above
> mentioned variables of hardware capacity and workload behavior. When
> memory pressure is 40%, it means that 40% of the time the workload is
> stalled on memory, period. This is the actual measure for the lack of
> forward progress that users can experience. It's also something they
> expect the kernel to manage and remedy if it becomes non-existent.
> 
> To accomplish this, this patch implements a thrashing cutoff for the
> OOM killer. If the kernel determines a sustained high level of memory
> pressure, and thus a lack of forward progress in userspace, it will
> trigger the OOM killer to reduce memory contention.
> 
> Per default, the OOM killer will engage after 15 seconds of at least
> 80% memory pressure. These values are tunable via sysctls
> vm.thrashing_oom_period and vm.thrashing_oom_level.

As I've said earlier I would be somehow more comfortable with a kernel
command line/module parameter based tuning because it is less of a
stable API and potential future stall detector might be completely
independent on PSI and the current metric exported. But I can live with
that because a period and level sounds quite generic.

> Ideally, this would be standard behavior for the kernel, but since it
> involves a new metric and OOM killing, let's be safe and make it an
> opt-in via CONFIG_THRASHING_OOM. Setting vm.thrashing_oom_level to 0
> also disables the feature at runtime.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reported-by: "Artem S. Tashkinov" <aros@gmx.com>

I am not deeply familiar with PSI internals but from a quick look it
seems that update_averages is called from the OOM safe context (worker).

I have scratched my head how to deal with this "progress is made but it
is all in vain" problem inside the reclaim path but I do not think this
will ever work and having a watchdog like this sound like step in the
right direction. I didn't even expect it would look as simple. Really a
nice work Johannes!

Let's see how this ends up working in practice though.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  Documentation/admin-guide/sysctl/vm.rst | 24 ++++++++
>  include/linux/psi.h                     |  5 ++
>  include/linux/psi_types.h               |  6 ++
>  kernel/sched/psi.c                      | 74 +++++++++++++++++++++++++
>  kernel/sysctl.c                         | 20 +++++++
>  mm/Kconfig                              | 20 +++++++
>  6 files changed, 149 insertions(+)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 64aeee1009ca..0332cb52bcfc 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -66,6 +66,8 @@ files can be found in mm/swap.c.
>  - stat_interval
>  - stat_refresh
>  - numa_stat
> +- thrashing_oom_level
> +- thrashing_oom_period
>  - swappiness
>  - unprivileged_userfaultfd
>  - user_reserve_kbytes
> @@ -825,6 +827,28 @@ When page allocation performance is not a bottleneck and you want all
>  	echo 1 > /proc/sys/vm/numa_stat
>  
>  
> +thrashing_oom_level
> +===================
> +
> +This defines the memory pressure level for severe thrashing at which
> +the OOM killer will be engaged.
> +
> +The default is 80. This means the system is considered to be thrashing
> +severely when all active tasks are collectively stalled on memory
> +(waiting for page reclaim, refaults, swapins etc) for 80% of the time.
> +
> +A setting of 0 will disable thrashing-based OOM killing.
> +
> +
> +thrashing_oom_period
> +===================
> +
> +This defines the number of seconds the system must sustain severe
> +thrashing at thrashing_oom_level before the OOM killer is invoked.
> +
> +The default is 15.
> +
> +
>  swappiness
>  ==========
>  
> diff --git a/include/linux/psi.h b/include/linux/psi.h
> index 7b3de7321219..661ce45900f9 100644
> --- a/include/linux/psi.h
> +++ b/include/linux/psi.h
> @@ -37,6 +37,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
>  			poll_table *wait);
>  #endif
>  
> +#ifdef CONFIG_THRASHING_OOM
> +extern unsigned int sysctl_thrashing_oom_level;
> +extern unsigned int sysctl_thrashing_oom_period;
> +#endif
> +
>  #else /* CONFIG_PSI */
>  
>  static inline void psi_init(void) {}
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index 07aaf9b82241..7c57d7e5627e 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -162,6 +162,12 @@ struct psi_group {
>  	u64 polling_total[NR_PSI_STATES - 1];
>  	u64 polling_next_update;
>  	u64 polling_until;
> +
> +#ifdef CONFIG_THRASHING_OOM
> +	/* Severe thrashing state tracking */
> +	bool oom_pressure;
> +	u64 oom_pressure_start;
> +#endif
>  };
>  
>  #else /* CONFIG_PSI */
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index f28342dc65ec..4b1b620d6359 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -139,6 +139,7 @@
>  #include <linux/ctype.h>
>  #include <linux/file.h>
>  #include <linux/poll.h>
> +#include <linux/oom.h>
>  #include <linux/psi.h>
>  #include "sched.h"
>  
> @@ -177,6 +178,14 @@ struct psi_group psi_system = {
>  	.pcpu = &system_group_pcpu,
>  };
>  
> +#ifdef CONFIG_THRASHING_OOM
> +static void psi_oom_tick(struct psi_group *group, u64 now);
> +#else
> +static inline void psi_oom_tick(struct psi_group *group, u64 now)
> +{
> +}
> +#endif
> +
>  static void psi_avgs_work(struct work_struct *work);
>  
>  static void group_init(struct psi_group *group)
> @@ -403,6 +412,8 @@ static u64 update_averages(struct psi_group *group, u64 now)
>  		calc_avgs(group->avg[s], missed_periods, sample, period);
>  	}
>  
> +	psi_oom_tick(group, now);
> +
>  	return avg_next_update;
>  }
>  
> @@ -1280,3 +1291,66 @@ static int __init psi_proc_init(void)
>  	return 0;
>  }
>  module_init(psi_proc_init);
> +
> +#ifdef CONFIG_THRASHING_OOM
> +/*
> + * Trigger the OOM killer when detecting severe thrashing.
> + *
> + * Per default we define severe thrashing as 15 seconds of 80% memory
> + * pressure (i.e. all active tasks are collectively stalled on memory
> + * 80% of the time).
> + */
> +unsigned int sysctl_thrashing_oom_level = 80;
> +unsigned int sysctl_thrashing_oom_period = 15;
> +
> +static void psi_oom_tick(struct psi_group *group, u64 now)
> +{
> +	struct oom_control oc = {
> +		.order = 0,
> +	};
> +	unsigned long pressure;
> +	bool high;
> +
> +	/* Disabled at runtime */
> +	if (!sysctl_thrashing_oom_level)
> +		return;
> +
> +	/*
> +	 * Protect the system from livelocking due to thrashing. Leave
> +	 * per-cgroup policies to oomd, lmkd etc.
> +	 */
> +	if (group != &psi_system)
> +		return;
> +
> +	pressure = LOAD_INT(group->avg[PSI_MEM_FULL][0]);
> +	high = pressure >= sysctl_thrashing_oom_level;
> +
> +	if (!group->oom_pressure && !high)
> +		return;
> +
> +	if (!group->oom_pressure && high) {
> +		group->oom_pressure = true;
> +		group->oom_pressure_start = now;
> +		return;
> +	}
> +
> +	if (group->oom_pressure && !high) {
> +		group->oom_pressure = false;
> +		return;
> +	}
> +
> +	if (now < group->oom_pressure_start +
> +	    (u64)sysctl_thrashing_oom_period * NSEC_PER_SEC)
> +		return;
> +
> +	pr_warn("Severe thrashing detected! (%ds of %d%% memory pressure)\n",
> +		sysctl_thrashing_oom_period, sysctl_thrashing_oom_level);
> +
> +	group->oom_pressure = false;
> +
> +	if (!mutex_trylock(&oom_lock))
> +		return;
> +	out_of_memory(&oc);
> +	mutex_unlock(&oom_lock);
> +}
> +#endif /* CONFIG_THRASHING_OOM */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index f12888971d66..3b9b3deb1836 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -68,6 +68,7 @@
>  #include <linux/bpf.h>
>  #include <linux/mount.h>
>  #include <linux/userfaultfd_k.h>
> +#include <linux/psi.h>
>  
>  #include "../lib/kstrtox.h"
>  
> @@ -1746,6 +1747,25 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= SYSCTL_ONE,
>  	},
> +#endif
> +#ifdef CONFIG_THRASHING_OOM
> +	{
> +		.procname	= "thrashing_oom_level",
> +		.data		= &sysctl_thrashing_oom_level,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= &one_hundred,
> +	},
> +	{
> +		.procname	= "thrashing_oom_period",
> +		.data		= &sysctl_thrashing_oom_period,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +	},
>  #endif
>  	{ }
>  };
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 56cec636a1fc..cef13b423beb 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,4 +736,24 @@ config ARCH_HAS_PTE_SPECIAL
>  config ARCH_HAS_HUGEPD
>  	bool
>  
> +config THRASHING_OOM
> +	bool "Trigger the OOM killer on severe thrashing"
> +	select PSI
> +	help
> +	  Under memory pressure, the kernel can enter severe thrashing
> +	  or swap storms during which the system is fully IO-bound and
> +	  does not respond to any user input. The OOM killer does not
> +	  always engage because page reclaim manages to make nominal
> +	  forward progress, but the system is effectively livelocked.
> +
> +	  This feature uses pressure stall information (PSI) to detect
> +	  severe thrashing and trigger the OOM killer.
> +
> +	  The OOM killer will be engaged when the system sustains a
> +	  memory pressure level of 80% for 15 seconds. This can be
> +	  adjusted using the vm.thrashing_oom_[level|period] sysctls.
> +
> +	  Say Y if you have observed your system becoming unresponsive
> +	  for extended periods under memory pressure.
> +
>  endmenu
> -- 
> 2.22.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-07 20:51                   ` Johannes Weiner
                                       ` (2 preceding siblings ...)
  2019-08-08 11:48                     ` Michal Hocko
@ 2019-08-08 14:47                     ` Vlastimil Babka
  2019-08-08 17:27                       ` Johannes Weiner
  3 siblings, 1 reply; 39+ messages in thread
From: Vlastimil Babka @ 2019-08-08 14:47 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Suren Baghdasaryan, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm

On 8/7/19 10:51 PM, Johannes Weiner wrote:
> From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 5 Aug 2019 13:15:16 -0400
> Subject: [PATCH] psi: trigger the OOM killer on severe thrashing

Thanks a lot, perhaps finally we are going to eat the elephant ;)

I've tested this by booting with mem=8G and activating browser tabs as
long as I could. Then initially the system started thrashing and didn't
recover for minutes. Then I realized sysrq+f is disabled... Fixed that
up after next reboot, tried lower thresholds, also started monitoring
/proc/pressure/memory, and found out that after minutes of not being
able to move the cursor, both avg10 and avg60 shows only around 15 for
both some and full. Lowered thrashing_oom_level to 10 and (with
thrashing_oom_period of 5) the thrashing OOM finally started kicking,
and the system recovered by itself in reasonable time.

So my conclusion is that the patch works, but there's something odd with
suspiciously low PSI memory values on my system. Any idea how to
investigate this? Also, does it matter that it's a modern desktop, so
systemd puts everything into cgroups, and the unified cgroup2 hierarchy
is also mounted?

Thanks,
Vlastimil


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-08 11:48                     ` Michal Hocko
@ 2019-08-08 15:10                       ` ndrw.xf
  2019-08-08 16:32                         ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: ndrw.xf @ 2019-08-08 15:10 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov,
	Andrew Morton, LKML, linux-mm



On 8 August 2019 12:48:26 BST, Michal Hocko <mhocko@kernel.org> wrote:
>> 
>> Per default, the OOM killer will engage after 15 seconds of at least
>> 80% memory pressure. These values are tunable via sysctls
>> vm.thrashing_oom_period and vm.thrashing_oom_level.
>
>As I've said earlier I would be somehow more comfortable with a kernel
>command line/module parameter based tuning because it is less of a
>stable API and potential future stall detector might be completely
>independent on PSI and the current metric exported. But I can live with
>that because a period and level sounds quite generic.

Would it be possible to reserve a fixed (configurable) amount of RAM for caches, and trigger OOM killer earlier, before most UI code is evicted from memory? In my use case, I am happy sacrificing e.g. 0.5GB and kill runaway tasks _before_ the system freezes. Potentially OOM killer would also work better in such conditions. I almost never work at close to full memory capacity, it's always a single task that goes wrong and brings the system down.

The problem with PSI sensing is that it works after the fact (after the freeze has already occurred). It is not very different from issuing SysRq-f manually on a frozen system, although it would still be a handy feature for batched tasks and remote access. 

Best regards, 
ndrw



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-08 15:10                       ` ndrw.xf
@ 2019-08-08 16:32                         ` Michal Hocko
  2019-08-08 17:57                           ` ndrw.xf
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-08-08 16:32 UTC (permalink / raw)
  To: ndrw.xf
  Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, Andrew Morton, LKML, linux-mm

On Thu 08-08-19 16:10:07, ndrw.xf@redhazel.co.uk wrote:
> 
> 
> On 8 August 2019 12:48:26 BST, Michal Hocko <mhocko@kernel.org> wrote:
> >> 
> >> Per default, the OOM killer will engage after 15 seconds of at least
> >> 80% memory pressure. These values are tunable via sysctls
> >> vm.thrashing_oom_period and vm.thrashing_oom_level.
> >
> >As I've said earlier I would be somehow more comfortable with a kernel
> >command line/module parameter based tuning because it is less of a
> >stable API and potential future stall detector might be completely
> >independent on PSI and the current metric exported. But I can live with
> >that because a period and level sounds quite generic.
> 
> Would it be possible to reserve a fixed (configurable) amount of RAM for caches,

I am afraid there is nothing like that available and I would even argue
it doesn't make much sense either. What would you consider to be a
cache? A kernel/userspace reclaimable memory? What about any other in
kernel memory users? How would you setup such a limit and make it
reasonably maintainable over different kernel releases when the memory
footprint changes over time?

Besides that how does that differ from the existing reclaim mechanism?
Once your cache hits the limit, there would have to be some sort of the
reclaim to happen and then we are back to square one when the reclaim is
making progress but you are effectively treshing over the hot working
set (e.g. code pages)

> and trigger OOM killer earlier, before most UI code is evicted from memory?

How does the kernel knows that important memory is evicted? E.g. say
that your graphic stack is under pressure and it has to drop internal
caches. No outstanding processes will be swapped out yet your UI will be
completely frozen like.

> In my use case, I am happy sacrificing e.g. 0.5GB and kill runaway
> tasks _before_ the system freezes. Potentially OOM killer would also
> work better in such conditions. I almost never work at close to full
> memory capacity, it's always a single task that goes wrong and brings
> the system down.

If you know which task is that then you can put it into a memory cgroup
with a stricter memory limit and have it killed before the overal system
starts suffering.

> The problem with PSI sensing is that it works after the fact (after
> the freeze has already occurred). It is not very different from
> issuing SysRq-f manually on a frozen system, although it would still
> be a handy feature for batched tasks and remote access.

Not really. PSI is giving you a matric that tells you how much time you
spend on the memory reclaim. So you can start watching the system from
lower utilization already.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-08 14:47                     ` Vlastimil Babka
@ 2019-08-08 17:27                       ` Johannes Weiner
  2019-08-09 14:56                         ` Vlastimil Babka
  0 siblings, 1 reply; 39+ messages in thread
From: Johannes Weiner @ 2019-08-08 17:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Suren Baghdasaryan, Artem S. Tashkinov,
	Andrew Morton, LKML, linux-mm

On Thu, Aug 08, 2019 at 04:47:18PM +0200, Vlastimil Babka wrote:
> On 8/7/19 10:51 PM, Johannes Weiner wrote:
> > From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Mon, 5 Aug 2019 13:15:16 -0400
> > Subject: [PATCH] psi: trigger the OOM killer on severe thrashing
> 
> Thanks a lot, perhaps finally we are going to eat the elephant ;)
> 
> I've tested this by booting with mem=8G and activating browser tabs as
> long as I could. Then initially the system started thrashing and didn't
> recover for minutes. Then I realized sysrq+f is disabled... Fixed that
> up after next reboot, tried lower thresholds, also started monitoring
> /proc/pressure/memory, and found out that after minutes of not being
> able to move the cursor, both avg10 and avg60 shows only around 15 for
> both some and full. Lowered thrashing_oom_level to 10 and (with
> thrashing_oom_period of 5) the thrashing OOM finally started kicking,
> and the system recovered by itself in reasonable time.

It sounds like there is a missing annotation. The time has to be going
somewhere, after all. One *known* missing vector I fixed recently is
stalls in submit_bio() itself when refaulting, but it's not merged
yet. Attaching the patch below, can you please test it?

> So my conclusion is that the patch works, but there's something odd with
> suspiciously low PSI memory values on my system. Any idea how to
> investigate this? Also, does it matter that it's a modern desktop, so
> systemd puts everything into cgroups, and the unified cgroup2 hierarchy
> is also mounted?

That shouldn't interfere because 1) pressure is reported recursively
up the cgroup tree, so unless something else runs completely fine on
the system, global pressure should reflect cgroup pressure and 2) the
systemd defaults doesn't set any memory limits or protections, so if
the system is hanging, it's unlikely that anything runs fine.

bcc tools (https://iovisor.github.io/bcc/) has an awesome program
called 'offcputime' that gives you stack traces of sleeping tasks.
This could give an insight into where time is going and point out
operations we might not be annotating correctly yet.

---
From 1b3888bdf075f86f226af4e350c8a88435d1fe8e Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 11 Jul 2019 16:01:40 -0400
Subject: [PATCH] psi: annotate refault stalls from IO submission

psi tracks the time tasks wait for refaulting pages to become
uptodate, but it does not track the time spent submitting the IO. The
submission part can be significant if backing storage is contended or
when cgroup throttling (io.latency) is in effect - a lot of time is
spent in submit_bio(). In that case, we underreport memory pressure.

Annotate submit_bio() to account submission time as memory stall when
the bio is reading userspace workingset pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 block/bio.c               |  3 +++
 block/blk-core.c          | 23 ++++++++++++++++++++++-
 include/linux/blk_types.h |  1 +
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 29cd6cf4da51..4dd9ea0b068b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -805,6 +805,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
 
 	bio->bi_iter.bi_size += len;
 	bio->bi_vcnt++;
+
+	if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page)))
+		bio_set_flag(bio, BIO_WORKINGSET);
 }
 EXPORT_SYMBOL_GPL(__bio_add_page);
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 5d1fc8e17dd1..5993922d63fb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -36,6 +36,7 @@
 #include <linux/blk-cgroup.h>
 #include <linux/debugfs.h>
 #include <linux/bpf.h>
+#include <linux/psi.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
@@ -1127,6 +1128,10 @@ EXPORT_SYMBOL_GPL(direct_make_request);
  */
 blk_qc_t submit_bio(struct bio *bio)
 {
+	bool workingset_read = false;
+	unsigned long pflags;
+	blk_qc_t ret;
+
 	/*
 	 * If it's a regular read/write or a barrier with data attached,
 	 * go through the normal accounting stuff before submission.
@@ -1142,6 +1147,8 @@ blk_qc_t submit_bio(struct bio *bio)
 		if (op_is_write(bio_op(bio))) {
 			count_vm_events(PGPGOUT, count);
 		} else {
+			if (bio_flagged(bio, BIO_WORKINGSET))
+				workingset_read = true;
 			task_io_account_read(bio->bi_iter.bi_size);
 			count_vm_events(PGPGIN, count);
 		}
@@ -1156,7 +1163,21 @@ blk_qc_t submit_bio(struct bio *bio)
 		}
 	}
 
-	return generic_make_request(bio);
+	/*
+	 * If we're reading data that is part of the userspace
+	 * workingset, count submission time as memory stall. When the
+	 * device is congested, or the submitting cgroup IO-throttled,
+	 * submission can be a significant part of overall IO time.
+	 */
+	if (workingset_read)
+		psi_memstall_enter(&pflags);
+
+	ret = generic_make_request(bio);
+
+	if (workingset_read)
+		psi_memstall_leave(&pflags);
+
+	return ret;
 }
 EXPORT_SYMBOL(submit_bio);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 6a53799c3fe2..2f77e3446760 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -209,6 +209,7 @@ enum {
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_USER_MAPPED,	/* contains user pages */
 	BIO_NULL_MAPPED,	/* contains invalid user pages */
+	BIO_WORKINGSET,		/* contains userspace workingset pages */
 	BIO_QUIET,		/* Make BIO Quiet */
 	BIO_CHAIN,		/* chained bio, ->bi_remaining in effect */
 	BIO_REFFED,		/* bio has elevated ->bi_cnt */
-- 
2.22.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-08 16:32                         ` Michal Hocko
@ 2019-08-08 17:57                           ` ndrw.xf
  2019-08-08 18:59                             ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: ndrw.xf @ 2019-08-08 17:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, Andrew Morton, LKML, linux-mm



On 8 August 2019 17:32:28 BST, Michal Hocko <mhocko@kernel.org> wrote:
>
>> Would it be possible to reserve a fixed (configurable) amount of RAM
>for caches,
>
>I am afraid there is nothing like that available and I would even argue
>it doesn't make much sense either. What would you consider to be a
>cache? A kernel/userspace reclaimable memory? What about any other in
>kernel memory users? How would you setup such a limit and make it
>reasonably maintainable over different kernel releases when the memory
>footprint changes over time?

Frankly, I don't know. The earlyoom userspace tool works well enough for me so I assumed this functionality could be implemented in kernel. Default thresholds would have to be tested but it is unlikely zero is the optimum value. 

>Besides that how does that differ from the existing reclaim mechanism?
>Once your cache hits the limit, there would have to be some sort of the
>reclaim to happen and then we are back to square one when the reclaim
>is
>making progress but you are effectively treshing over the hot working
>set (e.g. code pages)

By forcing OOM killer. Reclaiming memory when system becomes unresponsive is precisely what I want to avoid.

>> and trigger OOM killer earlier, before most UI code is evicted from
>memory?
>
>How does the kernel knows that important memory is evicted?

I assume current memory management policy (LRU?) is sufficient to keep most frequently used pages in memory.

>If you know which task is that then you can put it into a memory cgroup
>with a stricter memory limit and have it killed before the overal
>system
>starts suffering.

This is what I intended to use. But I don't know how to bypass SystemD or configure such policies via SystemD. 

>PSI is giving you a matric that tells you how much time you
>spend on the memory reclaim. So you can start watching the system from
>lower utilization already.

This is a fantastic news. Really. I didn't know this is how it works. Two potential issues, though:
1. PSI (if possible) should be normalised wrt the memory reclaiming cost (SSDs have lower cost than HDDs). If not automatically then perhaps via a user configurable option. That's somewhat similar to having configurable PSI thresholds. 
2. It seems PSI measures the _rate_ pages are evicted from memory. While this may correlate with the _absolute_ amount of of memory left, it is not the same. Perhaps weighting PSI with absolute amount of memory used for caches would improve this metric.

Best regards,
ndrw


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-08 17:57                           ` ndrw.xf
@ 2019-08-08 18:59                             ` Michal Hocko
  2019-08-08 21:59                               ` ndrw
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-08-08 18:59 UTC (permalink / raw)
  To: ndrw.xf
  Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, Andrew Morton, LKML, linux-mm

On Thu 08-08-19 18:57:02, ndrw.xf@redhazel.co.uk wrote:
> 
> 
> On 8 August 2019 17:32:28 BST, Michal Hocko <mhocko@kernel.org> wrote:
> >
> >> Would it be possible to reserve a fixed (configurable) amount of RAM
> >for caches,
> >
> >I am afraid there is nothing like that available and I would even argue
> >it doesn't make much sense either. What would you consider to be a
> >cache? A kernel/userspace reclaimable memory? What about any other in
> >kernel memory users? How would you setup such a limit and make it
> >reasonably maintainable over different kernel releases when the memory
> >footprint changes over time?
> 
> Frankly, I don't know. The earlyoom userspace tool works well enough
> for me so I assumed this functionality could be implemented in
> kernel. Default thresholds would have to be tested but it is unlikely
> zero is the optimum value.

Well, I am afraid that implementing anything like that in the kernel
will lead to many regressions and bug reports. People tend to have very
different opinions on when it is suitable to kill a potentially
important part of a workload just because memory gets low.

> >Besides that how does that differ from the existing reclaim mechanism?
> >Once your cache hits the limit, there would have to be some sort of the
> >reclaim to happen and then we are back to square one when the reclaim
> >is
> >making progress but you are effectively treshing over the hot working
> >set (e.g. code pages)
> 
> By forcing OOM killer. Reclaiming memory when system becomes unresponsive is precisely what I want to avoid.
> 
> >> and trigger OOM killer earlier, before most UI code is evicted from
> >memory?
> >
> >How does the kernel knows that important memory is evicted?
> 
> I assume current memory management policy (LRU?) is sufficient to keep most frequently used pages in memory.

LRU aspect doesn't help much, really. If we are reclaiming the same set
of pages becuase they are needed for the workload to operate then we are
effectivelly treshing no matter what kind of replacement policy you are
going to use.


[...]
> >PSI is giving you a matric that tells you how much time you
> >spend on the memory reclaim. So you can start watching the system from
> >lower utilization already.
> 
> This is a fantastic news. Really. I didn't know this is how it
> works. Two potential issues, though:
> 1. PSI (if possible) should be normalised wrt the memory reclaiming
> cost (SSDs have lower cost than HDDs). If not automatically then
> perhaps via a user configurable option. That's somewhat similar to
> having configurable PSI thresholds.

The cost of the reclaim is inherently reflected in those numbers
already because it gives you the amount of time that is spent getting a
memory for you. If you are under a memory pressure then the memory
reclaim is a part of the allocation path.

> 2. It seems PSI measures the _rate_ pages are evicted from
> memory. While this may correlate with the _absolute_ amount of of
> memory left, it is not the same. Perhaps weighting PSI with absolute
> amount of memory used for caches would improve this metric.

Please refer to Documentation/accounting/psi.rst for more information
about how PSI works. 
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-08 18:59                             ` Michal Hocko
@ 2019-08-08 21:59                               ` ndrw
  2019-08-09  8:57                                 ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: ndrw @ 2019-08-08 21:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, Andrew Morton, LKML, linux-mm

On 08/08/2019 19:59, Michal Hocko wrote:
> Well, I am afraid that implementing anything like that in the kernel
> will lead to many regressions and bug reports. People tend to have very
> different opinions on when it is suitable to kill a potentially
> important part of a workload just because memory gets low.

Are you proposing having a zero memory reserve or not having such option 
at all? I'm fine with the current default (zero reserve/margin).

I strongly prefer forcing OOM killer when the system is still running 
normally. Not just for preventing stalls: in my limited testing I found 
the OOM killer on a stalled system rather inaccurate, occasionally 
killing system services etc. I had much better experience with earlyoom.

> LRU aspect doesn't help much, really. If we are reclaiming the same set
> of pages becuase they are needed for the workload to operate then we are
> effectivelly treshing no matter what kind of replacement policy you are
> going to use.

In my case it would work fine (my system already works well with 
earlyoom, and without it it remains responsive until last couple hundred 
MB of RAM).


>>> PSI is giving you a matric that tells you how much time you
>>> spend on the memory reclaim. So you can start watching the system from
>>> lower utilization already.

I've tested it on a system with 45GB of RAM, SSD, swap disabled (my 
intention was to approximate a worst-case scenario) and it didn't really 
detect stall before it happened. I can see some activity after reaching 
~42GB, the system remains fully responsive until it suddenly freezes and 
requires sysrq-f. PSI appears to increase a bit when the system is about 
to run out of memory but the change is so small it would be difficult to 
set a reliable threshold. I expect the PSI numbers to increase 
significantly after the stall (I wasn't able to capture them) but, as 
mentioned above, I was hoping for a solution that would work before the 
stall.

$ while true; do sleep 1; cat /proc/pressure/memory ; done
[starting a test script and waiting for several minutes to fill up memory]
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
some avg10=0.00 avg60=0.00 avg300=0.00 total=10389
full avg10=0.00 avg60=0.00 avg300=0.00 total=6442
some avg10=0.00 avg60=0.00 avg300=0.00 total=18950
full avg10=0.00 avg60=0.00 avg300=0.00 total=11576
some avg10=0.00 avg60=0.00 avg300=0.00 total=25655
full avg10=0.00 avg60=0.00 avg300=0.00 total=16159
some avg10=0.00 avg60=0.00 avg300=0.00 total=31438
full avg10=0.00 avg60=0.00 avg300=0.00 total=19552
some avg10=0.00 avg60=0.00 avg300=0.00 total=44549
full avg10=0.00 avg60=0.00 avg300=0.00 total=27772
some avg10=0.00 avg60=0.00 avg300=0.00 total=52520
full avg10=0.00 avg60=0.00 avg300=0.00 total=32580
some avg10=0.00 avg60=0.00 avg300=0.00 total=60451
full avg10=0.00 avg60=0.00 avg300=0.00 total=37704
some avg10=0.00 avg60=0.00 avg300=0.00 total=68986
full avg10=0.00 avg60=0.00 avg300=0.00 total=42859
some avg10=0.00 avg60=0.00 avg300=0.00 total=76598
full avg10=0.00 avg60=0.00 avg300=0.00 total=48370
some avg10=0.00 avg60=0.00 avg300=0.00 total=83080
full avg10=0.00 avg60=0.00 avg300=0.00 total=52930
some avg10=0.00 avg60=0.00 avg300=0.00 total=89384
full avg10=0.00 avg60=0.00 avg300=0.00 total=56350
some avg10=0.00 avg60=0.00 avg300=0.00 total=95293
full avg10=0.00 avg60=0.00 avg300=0.00 total=60260
some avg10=0.00 avg60=0.00 avg300=0.00 total=101566
full avg10=0.00 avg60=0.00 avg300=0.00 total=64408
some avg10=0.00 avg60=0.00 avg300=0.00 total=108131
full avg10=0.00 avg60=0.00 avg300=0.00 total=68412
some avg10=0.00 avg60=0.00 avg300=0.00 total=121932
full avg10=0.00 avg60=0.00 avg300=0.00 total=77413
some avg10=0.00 avg60=0.00 avg300=0.00 total=140807
full avg10=0.00 avg60=0.00 avg300=0.00 total=91269
some avg10=0.00 avg60=0.00 avg300=0.00 total=170494
full avg10=0.00 avg60=0.00 avg300=0.00 total=110611
[stall, sysrq-f]

Best regards,

ndrw



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-08 21:59                               ` ndrw
@ 2019-08-09  8:57                                 ` Michal Hocko
  2019-08-09 10:09                                   ` ndrw
  2019-08-10 21:07                                   ` ndrw
  0 siblings, 2 replies; 39+ messages in thread
From: Michal Hocko @ 2019-08-09  8:57 UTC (permalink / raw)
  To: ndrw
  Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, Andrew Morton, LKML, linux-mm

On Thu 08-08-19 22:59:32, ndrw wrote:
> On 08/08/2019 19:59, Michal Hocko wrote:
> > Well, I am afraid that implementing anything like that in the kernel
> > will lead to many regressions and bug reports. People tend to have very
> > different opinions on when it is suitable to kill a potentially
> > important part of a workload just because memory gets low.
> 
> Are you proposing having a zero memory reserve or not having such option at
> all? I'm fine with the current default (zero reserve/margin).

We already do have a reserve (min_free_kbytes). That gives kswapd some
room to perform reclaim in the background without obvious latencies to
allocating tasks (well CPU still be used so there is still some effect).

Kswapd tries to keep a balance and free memory low but still with some
room to satisfy an immediate memory demand. Once kswapd doesn't catch up
with the memory demand we dive into the direct reclaim and that is where
people usually see latencies coming from.

The main problem here is that it is hard to tell from a single
allocation latency that we have a bigger problem. As already said, the
usual trashing scenario doesn't show problem during the reclaim because
pages can be freed up very efficiently. The problem is that they are
refaulted very quickly so we are effectively rotating working set like
crazy. Compare that to a normal used-once streaming IO workload which is
generating a lot of page cache that can be recycled in a similar pace
but a working set doesn't get freed. Free memory figures will look very
similar in both cases.

> I strongly prefer forcing OOM killer when the system is still running
> normally. Not just for preventing stalls: in my limited testing I found the
> OOM killer on a stalled system rather inaccurate, occasionally killing
> system services etc. I had much better experience with earlyoom.

Good that earlyoom works for you. All I am saying is that this is not
generally applicable heuristic because we do care about a larger variety
of workloads. I should probably emphasise that the OOM killer is there
as a _last resort_ hand break when something goes terribly wrong. It
operates at times when any user intervention would be really hard
because there is a lack of resources to be actionable.

[...]
> > > > PSI is giving you a matric that tells you how much time you
> > > > spend on the memory reclaim. So you can start watching the system from
> > > > lower utilization already.
> 
> I've tested it on a system with 45GB of RAM, SSD, swap disabled (my
> intention was to approximate a worst-case scenario) and it didn't really
> detect stall before it happened. I can see some activity after reaching
> ~42GB, the system remains fully responsive until it suddenly freezes and
> requires sysrq-f.

This is a useful feedback! What was your workload? Which kernel version?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-09  8:57                                 ` Michal Hocko
@ 2019-08-09 10:09                                   ` ndrw
  2019-08-09 10:50                                     ` Michal Hocko
  2019-08-10 21:07                                   ` ndrw
  1 sibling, 1 reply; 39+ messages in thread
From: ndrw @ 2019-08-09 10:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, Andrew Morton, LKML, linux-mm

On 09/08/2019 09:57, Michal Hocko wrote:
> We already do have a reserve (min_free_kbytes). That gives kswapd some
> room to perform reclaim in the background without obvious latencies to
> allocating tasks (well CPU still be used so there is still some effect).

I tried this option in the past. Unfortunately, I didn't prevent 
freezes. My understanding is this option reserves some amount of memory 
to not be swapped out but does not prevent the kernel from evicting all 
pages from cache when more memory is needed.

> Kswapd tries to keep a balance and free memory low but still with some
> room to satisfy an immediate memory demand. Once kswapd doesn't catch up
> with the memory demand we dive into the direct reclaim and that is where
> people usually see latencies coming from.

Reclaiming memory is fine, of course, but not all the way to 0 caches. 
No caches means all executable pages, ro pages (e.g. fonts) are evicted 
from memory and have to be constantly reloaded on every user action. All 
this while competing with tasks that are using up all memory. This 
happens with of without swap, although swap does spread this issue in 
time a bit.

> The main problem here is that it is hard to tell from a single
> allocation latency that we have a bigger problem. As already said, the
> usual trashing scenario doesn't show problem during the reclaim because
> pages can be freed up very efficiently. The problem is that they are
> refaulted very quickly so we are effectively rotating working set like
> crazy. Compare that to a normal used-once streaming IO workload which is
> generating a lot of page cache that can be recycled in a similar pace
> but a working set doesn't get freed. Free memory figures will look very
> similar in both cases.

Thank you for the explanation. It is indeed a difficult problem - some 
cached pages (streaming IO) will likely not be needed again and should 
be discarded asap, other (like mmapped executable/ro pages of UI 
utilities) will cause thrashing when evicted under high memory pressure. 
Another aspect is that PSI is probably not the best measure of detecting 
imminent thrashing. However, if it can at least detect a freeze that has 
already occurred and force the OOM killer that is still a lot better 
than a dead system, which is the current user experience.

> Good that earlyoom works for you.

I am giving it as an example of a heuristic that seems to work very well 
for me. Something to look into. And yes, I wouldn't mind having such 
mechanism built into the kernel.

>   All I am saying is that this is not
> generally applicable heuristic because we do care about a larger variety
> of workloads. I should probably emphasise that the OOM killer is there
> as a _last resort_ hand break when something goes terribly wrong. It
> operates at times when any user intervention would be really hard
> because there is a lack of resources to be actionable.

It is indeed a last resort solution - without it the system is unusable. 
Still, accuracy matters because killing a wrong task does not fix the 
problem (a task hogging memory is still running) and may break the 
system anyway if something important is killed instead.

[...]

> This is a useful feedback! What was your workload? Which kernel version?

I tested it by running a python script that processes a large amount of 
data in memory (needs around 15GB of RAM). I normally run 2 instances of 
that script in parallel but for testing I started 4 of them. I sometimes 
experience the same issue when using multiple regular memory intensive 
desktop applications in a manner described in the first post but that's 
harder to reproduce because of the user input needed.

[    0.000000] Linux version 5.0.0-21-generic (buildd@lgw01-amd64-036) 
(gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1)) #22-Ubuntu SMP Tue Jul 2 
13:27:33 UTC 2019 (Ubuntu 5.0.0-21.22-generic 5.0.15)
AMD CPU with 4 cores, 8 threads. AMDGPU graphics stack.

Best regards,

ndrw



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-09 10:09                                   ` ndrw
@ 2019-08-09 10:50                                     ` Michal Hocko
  2019-08-09 14:18                                       ` Pintu Agarwal
  2019-08-10 12:34                                       ` ndrw
  0 siblings, 2 replies; 39+ messages in thread
From: Michal Hocko @ 2019-08-09 10:50 UTC (permalink / raw)
  To: ndrw
  Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, Andrew Morton, LKML, linux-mm

On Fri 09-08-19 11:09:33, ndrw wrote:
> On 09/08/2019 09:57, Michal Hocko wrote:
> > We already do have a reserve (min_free_kbytes). That gives kswapd some
> > room to perform reclaim in the background without obvious latencies to
> > allocating tasks (well CPU still be used so there is still some effect).
> 
> I tried this option in the past. Unfortunately, I didn't prevent freezes. My
> understanding is this option reserves some amount of memory to not be

to not be used by normal allocations. It defines reclaim watermarks and
that influences when the background and direct reclaim start to act.

> swapped out but does not prevent the kernel from evicting all pages from
> cache when more memory is needed.

It doesn't have any say on the actual decision on what to reclaim.

> > Kswapd tries to keep a balance and free memory low but still with some
> > room to satisfy an immediate memory demand. Once kswapd doesn't catch up
> > with the memory demand we dive into the direct reclaim and that is where
> > people usually see latencies coming from.
> 
> Reclaiming memory is fine, of course, but not all the way to 0 caches. No
> caches means all executable pages, ro pages (e.g. fonts) are evicted from
> memory and have to be constantly reloaded on every user action. All this
> while competing with tasks that are using up all memory. This happens with
> of without swap, although swap does spread this issue in time a bit.

We try to protect low amount of cache. Have a look at get_scan_count
function. But the exact amount of the cache to be protected is really
hard to know wihtout a crystal ball or understanding of the workload.
The kernel doesn't have neither of the two.

> > The main problem here is that it is hard to tell from a single
> > allocation latency that we have a bigger problem. As already said, the
> > usual trashing scenario doesn't show problem during the reclaim because
> > pages can be freed up very efficiently. The problem is that they are
> > refaulted very quickly so we are effectively rotating working set like
> > crazy. Compare that to a normal used-once streaming IO workload which is
> > generating a lot of page cache that can be recycled in a similar pace
> > but a working set doesn't get freed. Free memory figures will look very
> > similar in both cases.
> 
> Thank you for the explanation. It is indeed a difficult problem - some
> cached pages (streaming IO) will likely not be needed again and should be
> discarded asap, other (like mmapped executable/ro pages of UI utilities)
> will cause thrashing when evicted under high memory pressure. Another aspect
> is that PSI is probably not the best measure of detecting imminent
> thrashing. However, if it can at least detect a freeze that has already
> occurred and force the OOM killer that is still a lot better than a dead
> system, which is the current user experience.

We have been thinking about this problem for a long time and couldn't
come up with anything much better than we have now. PSI is the most recent
improvement in that area. If you have better ideas then patches are
always welcome.

> > Good that earlyoom works for you.
> 
> I am giving it as an example of a heuristic that seems to work very well for
> me. Something to look into. And yes, I wouldn't mind having such mechanism
> built into the kernel.
> 
> >   All I am saying is that this is not
> > generally applicable heuristic because we do care about a larger variety
> > of workloads. I should probably emphasise that the OOM killer is there
> > as a _last resort_ hand break when something goes terribly wrong. It
> > operates at times when any user intervention would be really hard
> > because there is a lack of resources to be actionable.
> 
> It is indeed a last resort solution - without it the system is unusable.
> Still, accuracy matters because killing a wrong task does not fix the
> problem (a task hogging memory is still running) and may break the system
> anyway if something important is killed instead.

That is a completely orthogonal problem, I am afraid. So far we have
been discussing _when_ to trigger OOM killer. This is _who_ to kill. I
haven't heard any recent examples that the victim selection would be way
off and killing something obviously incorrect.

> [...]
> 
> > This is a useful feedback! What was your workload? Which kernel version?
> 
> I tested it by running a python script that processes a large amount of data
> in memory (needs around 15GB of RAM). I normally run 2 instances of that
> script in parallel but for testing I started 4 of them. I sometimes
> experience the same issue when using multiple regular memory intensive
> desktop applications in a manner described in the first post but that's
> harder to reproduce because of the user input needed.

Something that other people can play with to reproduce the issue would
be more than welcome.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-09 10:50                                     ` Michal Hocko
@ 2019-08-09 14:18                                       ` Pintu Agarwal
  2019-08-10 12:34                                       ` ndrw
  1 sibling, 0 replies; 39+ messages in thread
From: Pintu Agarwal @ 2019-08-09 14:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: ndrw, Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, Andrew Morton, LKML, linux-mm, Pintu Kumar

[...]

Hi,

This is an interesting topic for me so I would like to join the conversation.
I will be glad if I can be of any help here either in testing PSI, or
verifying some scenarios and observation.

I have some experience working with low memory embedded devices, like
RAM as low as 128MB, 256MB, less than 1GB mostly, with/without
Display, DRM/Graphics support.
Along with ZRAM as swap space configured as 25% of RAM size.
The eMMC storage space is also as low as 4GB or 8GB max.

So, I have experienced this sluggishness, hang, OOM kill issues quite
a number of times.
So, I would like to share my experience and observation here.

Recently, I have been exploring the PSI feature on my ARM
Qemu/Beagle-Bone environment, so I can share some feedback for this as
well.

The system sluggish behavior can result from 4 types (specially on
smart phone devices):
* memory allocation pressure
* I/O pressure
* Scheduling pressure
* Network pressure

I think the topic of concern here is: memory pressure.
So, I would like to share some thoughts about this.

* In my opinion, memory pressure should be internal to the system and
not visible to the end users.
* The pressure metrics can very from system to system, so its
difficult to apply single policy.
* I guess this is the time to apply "Machine Learning" and "Artificial
Intelligence" into the system :)

* The memory pressure starts with how many times and how quickly
system is entering the slow-path.
  Thus slow-path monitoring may give some clue about pressure building
in the system.
  Thus I use to use slow-path-counter.
  Too much of slow-path in the beginning itself indicates that this
system needs to be re-designed.

* The system should be avoided to entering slow-path again and again
thus avoiding pressure.
  If this happens then its time to reclaim memory in large chunk,
rather than in smaller chunk.
  May be its time to think about shrink_all_memory() knob in kernel.
  It can be run as bottom-half processing, may be from cgroups.

* Some experiment were done in the past. Interested people can check this paper:
  http://events17.linuxfoundation.org/sites/events/files/slides/%5BELC-2015%5D-System-wide-Memory-Defragmenter.pdf

* The system is already behaving sluggish even before it enters oom-kill stage.
  So, most of the time oom stage is skipped, not occurred, or its just
looping around.
  Thus, some kind of oom-monitoring may help to gather some suspect.
  Thats the reason I proposed to use something called
oom-stall-counter. That means system entering oom, but not possibly
oom-kill.
  If this counter is updated means we assume that system started
behaving sluggish.

* A oom-kill-counter can also help in determining how much of killing
happening in kernel space.
  Example: If PSI pressure is building up and this counter is not updating...
  But in any case system-daemon should be avoided from killing.

* Some killing policy should be left to user space. So a standard
system-daemon (or kthread) should be designed along the line.
  It should be configured dynamically based on the system and oom-score.
  As my previous experience, in Tizen, we have used something called:
resourced daemon.
  https://git.tizen.org/cgit/platform/core/system/resourced/tree/src/memory?h=tizen

* Instead of static policy there should be something called "Dynamic
Low Memory Manager" (DLLM) policy.
  That is at every stage (slow-path, swapping, compaction-fail,
reclaim-fail, oom) some action can be taken.
  Earlier this event was triggered using vmpressure, but now it can
replace with PSI.

* Another major culprit with sluggish in the long run is, the
system-daemon occupying all of swap space and never releasing it.
  So, even if the kill applications due to oom, it may not help much.
Since daemons will never be killed.
  So, I proposed something called "Dynamic Swappiness", where
swappiness of daemons came be lowered dynamically, while normal
application have higher values.
  In the past I have done several experiments on this, soon I will be
publishing a paper on it.

* May be it is helpful to understand better, if we start from a very
minimal scale (just 64MB to 512MB RAM) with busy-box.
  If we can tune this perfectly, than large scale will automatically
have no issues.

With respect to PSI, here are my observations:
* PSI memory threshold (10, 60, 300) are too high for an embedded system.
  I think these settings should be dynamic, or user configurable, or
there should be on more entry for 1s or lesser.
* PSI memory values are updated after the oom-kill in kernel had
already happened, that means sluggish already occurred.
  So, I have to utilize the "total" field and monitor the difference manually.
  Like the difference between previous-total and next-total is more
than 100ms and rising, then we suspect OOM.
* Currently, PSI values are system-wide. That is, after sluggish
occurred, it is difficult to predict, which task causes sluggish.
  So, I was thinking to add new entry to capture task details as well.


These are some of my opinion. It may or may not be applicable directly.
Further brain-storming or discussions might be required.



Regards,
Pintu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-08 17:27                       ` Johannes Weiner
@ 2019-08-09 14:56                         ` Vlastimil Babka
  2019-08-09 17:31                           ` Johannes Weiner
  0 siblings, 1 reply; 39+ messages in thread
From: Vlastimil Babka @ 2019-08-09 14:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Suren Baghdasaryan, Artem S. Tashkinov,
	Andrew Morton, LKML, linux-mm

On 8/8/19 7:27 PM, Johannes Weiner wrote:
> On Thu, Aug 08, 2019 at 04:47:18PM +0200, Vlastimil Babka wrote:
>> On 8/7/19 10:51 PM, Johannes Weiner wrote:
>>> From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001
>>> From: Johannes Weiner <hannes@cmpxchg.org>
>>> Date: Mon, 5 Aug 2019 13:15:16 -0400
>>> Subject: [PATCH] psi: trigger the OOM killer on severe thrashing
>>
>> Thanks a lot, perhaps finally we are going to eat the elephant ;)
>>
>> I've tested this by booting with mem=8G and activating browser tabs as
>> long as I could. Then initially the system started thrashing and didn't
>> recover for minutes. Then I realized sysrq+f is disabled... Fixed that
>> up after next reboot, tried lower thresholds, also started monitoring
>> /proc/pressure/memory, and found out that after minutes of not being
>> able to move the cursor, both avg10 and avg60 shows only around 15 for
>> both some and full. Lowered thrashing_oom_level to 10 and (with
>> thrashing_oom_period of 5) the thrashing OOM finally started kicking,
>> and the system recovered by itself in reasonable time.
> 
> It sounds like there is a missing annotation. The time has to be going
> somewhere, after all. One *known* missing vector I fixed recently is
> stalls in submit_bio() itself when refaulting, but it's not merged
> yet. Attaching the patch below, can you please test it?

It made a difference, but not enough, it seems. Before the patch I could
observe "io:full avg10" around 75% and "memory:full avg10" around 20%,
after the patch, "memory:full avg10" went to around 45%, while io stayed
the same (BTW should the refaults be discounted from the io counters, so
that the sum is still <=100%?)
As a result I could change the knobs to recover successfully with
thrashing detected for 10s of 40% memory pressure.

Perhaps being low on memory we can't detect refaults so well due to
limited number of shadow entries, or there was genuine non-refault I/O
in the mix. The detection would then probably have to look at both I/O
and memory?

Thanks,
Vlastimil


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-09 14:56                         ` Vlastimil Babka
@ 2019-08-09 17:31                           ` Johannes Weiner
  2019-08-13 13:47                             ` Vlastimil Babka
  0 siblings, 1 reply; 39+ messages in thread
From: Johannes Weiner @ 2019-08-09 17:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Suren Baghdasaryan, Artem S. Tashkinov,
	Andrew Morton, LKML, linux-mm

On Fri, Aug 09, 2019 at 04:56:28PM +0200, Vlastimil Babka wrote:
> On 8/8/19 7:27 PM, Johannes Weiner wrote:
> > On Thu, Aug 08, 2019 at 04:47:18PM +0200, Vlastimil Babka wrote:
> >> On 8/7/19 10:51 PM, Johannes Weiner wrote:
> >>> From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001
> >>> From: Johannes Weiner <hannes@cmpxchg.org>
> >>> Date: Mon, 5 Aug 2019 13:15:16 -0400
> >>> Subject: [PATCH] psi: trigger the OOM killer on severe thrashing
> >>
> >> Thanks a lot, perhaps finally we are going to eat the elephant ;)
> >>
> >> I've tested this by booting with mem=8G and activating browser tabs as
> >> long as I could. Then initially the system started thrashing and didn't
> >> recover for minutes. Then I realized sysrq+f is disabled... Fixed that
> >> up after next reboot, tried lower thresholds, also started monitoring
> >> /proc/pressure/memory, and found out that after minutes of not being
> >> able to move the cursor, both avg10 and avg60 shows only around 15 for
> >> both some and full. Lowered thrashing_oom_level to 10 and (with
> >> thrashing_oom_period of 5) the thrashing OOM finally started kicking,
> >> and the system recovered by itself in reasonable time.
> > 
> > It sounds like there is a missing annotation. The time has to be going
> > somewhere, after all. One *known* missing vector I fixed recently is
> > stalls in submit_bio() itself when refaulting, but it's not merged
> > yet. Attaching the patch below, can you please test it?
> 
> It made a difference, but not enough, it seems. Before the patch I could
> observe "io:full avg10" around 75% and "memory:full avg10" around 20%,
> after the patch, "memory:full avg10" went to around 45%, while io stayed
> the same (BTW should the refaults be discounted from the io counters, so
> that the sum is still <=100%?)
>
> As a result I could change the knobs to recover successfully with
> thrashing detected for 10s of 40% memory pressure.
> 
> Perhaps being low on memory we can't detect refaults so well due to
> limited number of shadow entries, or there was genuine non-refault I/O
> in the mix. The detection would then probably have to look at both I/O
> and memory?

Thanks for testing it. It's possible that there is legitimate
non-refault IO, and there can be interaction of course between that
and the refault IO. But to be sure that all genuine refaults are
captured, can you record the workingset_* values from /proc/vmstat
before/after the thrash storm? In particular, workingset_nodereclaim
would indicate whether we are losing refault information.

[ The different resource pressures are not meant to be summed
  up. Refaults truly are both IO events and memory events: they
  indicate memory contention, but they also contribute to the IO
  load. So both metrics need to include them, or it would skew the
  picture when you only look at one of them. ]


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-09 10:50                                     ` Michal Hocko
  2019-08-09 14:18                                       ` Pintu Agarwal
@ 2019-08-10 12:34                                       ` ndrw
  2019-08-12  8:24                                         ` Michal Hocko
  1 sibling, 1 reply; 39+ messages in thread
From: ndrw @ 2019-08-10 12:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, Andrew Morton, LKML, linux-mm

On 09/08/2019 11:50, Michal Hocko wrote:
> We try to protect low amount of cache. Have a look at get_scan_count
> function. But the exact amount of the cache to be protected is really
> hard to know wihtout a crystal ball or understanding of the workload.
> The kernel doesn't have neither of the two.

Thank you. I'm familiarizing myself with the code. Is there anyone I 
could discuss some details with? I don't want to create too much noise here.

For example, are file pages created by mmaping files and are anon page 
exclusively allocated on heap (RW data)? If so, where do "streaming IO" 
pages belong to?

> We have been thinking about this problem for a long time and couldn't
> come up with anything much better than we have now. PSI is the most recent
> improvement in that area. If you have better ideas then patches are
> always welcome.

In general, I found there are very few user accessible knobs for 
adjusting caching, especially in the pre-OOM phase. On the other hand, 
swapping, dirty page caching, have many options or can even be disabled 
completely.

For example, I would like to try disabling/limiting eviction of some/all 
file pages (for example exec pages) akin to disabling swapping, but 
there is no such mechanism. Yes, there would likely be problems with 
large RO mmapped files that would need to be addressed, but in many 
applications users would be interested in having such options.

Adjusting how aggressive/conservative the system should be with the OOM 
killer also falls into this category.

>> [OOM killer accuracy]
> That is a completely orthogonal problem, I am afraid. So far we have
> been discussing _when_ to trigger OOM killer. This is _who_ to kill. I
> haven't heard any recent examples that the victim selection would be way
> off and killing something obviously incorrect.

You are right. I've assumed earlyoom is more accurate because of OOM 
killer performing better on a system that isn't stalled yet (perhaps it 
does). But actually, earlyoom doesn't trigger OOM killer at all:

https://github.com/rfjakob/earlyoom#why-not-trigger-the-kernel-oom-killer

Apparently some applications (chrome and electron-based tools) set their 
oom_score_adj incorrectly - this matches my observations of OOM killer 
behavior:

https://bugs.chromium.org/p/chromium/issues/detail?id=333617

> Something that other people can play with to reproduce the issue would
> be more than welcome.

This is the script I used. It reliably reproduces the issue: 
https://github.com/ndrw6/import_postcodes/blob/master/import_postcodes.py 
but it has quite a few dependencies, needs some input data and, in 
general, does a lot more than just fill up the memory. I will try to 
come up with something simpler.

Best regards,

ndrw



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-09  8:57                                 ` Michal Hocko
  2019-08-09 10:09                                   ` ndrw
@ 2019-08-10 21:07                                   ` ndrw
  1 sibling, 0 replies; 39+ messages in thread
From: ndrw @ 2019-08-10 21:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, Andrew Morton, LKML, linux-mm

On 09/08/2019 09:57, Michal Hocko wrote:
> This is a useful feedback! What was your workload? Which kernel version? 

With 16GB zram swap and swappiness=60 I get the avg10 memory PSI numbers 
of about 10 when swap is half filled and ~30 immediately before the 
freeze. Swapping with zram has less effect on system responsiveness 
comparing to swapping to an ssd, so, if combined with the proposed PSI 
triggered OOM killer, this could be a viable solution.

Still, using swap only to make PSI sensing work when triggering OOM 
killer at non-zero available memory would do the job just as well is a 
bit of an overkill. I don't really need these extra few GB or memory, 
just want to get rid of system freezes. Perhaps we could have both 
heuristics.

Best regards,

ndrw



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-10 12:34                                       ` ndrw
@ 2019-08-12  8:24                                         ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2019-08-12  8:24 UTC (permalink / raw)
  To: ndrw
  Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka,
	Artem S. Tashkinov, Andrew Morton, LKML, linux-mm

On Sat 10-08-19 13:34:06, ndrw wrote:
> On 09/08/2019 11:50, Michal Hocko wrote:
> > We try to protect low amount of cache. Have a look at get_scan_count
> > function. But the exact amount of the cache to be protected is really
> > hard to know wihtout a crystal ball or understanding of the workload.
> > The kernel doesn't have neither of the two.
> 
> Thank you. I'm familiarizing myself with the code. Is there anyone I could
> discuss some details with? I don't want to create too much noise here.

linux-mm mailing list sounds like a good fit.

> For example, are file pages created by mmaping files and are anon page
> exclusively allocated on heap (RW data)? If so, where do "streaming IO"
> pages belong to?

Page cache will be generated by both buffered IO (read/write) and file
mmaps. Anonymous memory by MAP_PRIVATE of file backed or MAP_ANON.
Streaming IO is generally referred to by an single data pass IO that
is not reused later (e.g. a backup).

> > We have been thinking about this problem for a long time and couldn't
> > come up with anything much better than we have now. PSI is the most recent
> > improvement in that area. If you have better ideas then patches are
> > always welcome.
> 
> In general, I found there are very few user accessible knobs for adjusting
> caching, especially in the pre-OOM phase. On the other hand, swapping, dirty
> page caching, have many options or can even be disabled completely.
> 
> For example, I would like to try disabling/limiting eviction of some/all
> file pages (for example exec pages) akin to disabling swapping, but there is
> no such mechanism. Yes, there would likely be problems with large RO mmapped
> files that would need to be addressed, but in many applications users would
> be interested in having such options.
> 
> Adjusting how aggressive/conservative the system should be with the OOM
> killer also falls into this category.

What would that mean and how it would be configured?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-09 17:31                           ` Johannes Weiner
@ 2019-08-13 13:47                             ` Vlastimil Babka
  0 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2019-08-13 13:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Suren Baghdasaryan, Artem S. Tashkinov,
	Andrew Morton, LKML, linux-mm

On 8/9/19 7:31 PM, Johannes Weiner wrote:
>> It made a difference, but not enough, it seems. Before the patch I could
>> observe "io:full avg10" around 75% and "memory:full avg10" around 20%,
>> after the patch, "memory:full avg10" went to around 45%, while io stayed
>> the same (BTW should the refaults be discounted from the io counters, so
>> that the sum is still <=100%?)
>>
>> As a result I could change the knobs to recover successfully with
>> thrashing detected for 10s of 40% memory pressure.
>>
>> Perhaps being low on memory we can't detect refaults so well due to
>> limited number of shadow entries, or there was genuine non-refault I/O
>> in the mix. The detection would then probably have to look at both I/O
>> and memory?
> 
> Thanks for testing it. It's possible that there is legitimate
> non-refault IO, and there can be interaction of course between that
> and the refault IO. But to be sure that all genuine refaults are
> captured, can you record the workingset_* values from /proc/vmstat
> before/after the thrash storm? In particular, workingset_nodereclaim
> would indicate whether we are losing refault information.

Let's see... after a ~45 second stall that I ended up by alt-sysrq-f, I
see the following pressure info:

cpu:some avg10=1.04 avg60=2.22 avg300=2.01 total=147402828
io:some avg10=97.13 avg60=65.48 avg300=28.86 total=240442256
io:full avg10=83.93 avg60=57.05 avg300=24.56 total=212125506
memory:some avg10=54.62 avg60=33.69 avg300=15.89 total=67989547
memory:full avg10=44.48 avg60=28.17 avg300=13.17 total=55963961

Captured vmstat workingset values

before:
workingset_nodes 15756
workingset_refault 6111959
workingset_activate 1805063
workingset_restore 919138
workingset_nodereclaim 40796
pgpgin 33889644

after:
workingset_nodes 14842
workingset_refault 9248248
workingset_activate 1966317
workingset_restore 961179
workingset_nodereclaim 41060
pgpgin 46488352

Doesn't seem like losing too much refault info, and it's indeed a mix of
refaults and other I/O? (difference is 3M for refaults and 12.5M for
pgpgin).

> [ The different resource pressures are not meant to be summed
>   up. Refaults truly are both IO events and memory events: they
>   indicate memory contention, but they also contribute to the IO
>   load. So both metrics need to include them, or it would skew the
>   picture when you only look at one of them. ]

Understood, makes sense.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
  2019-08-05  9:05 Hillf Danton
@ 2019-08-05 12:01 ` Artem S. Tashkinov
  0 siblings, 0 replies; 39+ messages in thread
From: Artem S. Tashkinov @ 2019-08-05 12:01 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-kernel, linux-mm

On 8/5/19 9:05 AM, Hillf Danton wrote:
>
> On Sun, 4 Aug 2019 09:23:17 +0000 "Artem S. Tashkinov" <aros@gmx.com> wrote:
>> Hello,
>>
>> There's this bug which has been bugging many people for many years
>> already and which is reproducible in less than a few minutes under the
>> latest and greatest kernel, 5.2.6. All the kernel parameters are set to
>> defaults.
>
> Thanks for report!
>>
>> Steps to reproduce:
>>
>> 1) Boot with mem=4G
>> 2) Disable swap to make everything faster (sudo swapoff -a)
>> 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
>> 4) Start opening tabs in either of them and watch your free RAM decrease
>
> We saw another corner-case cpu hog report under memory pressure also
> with swap disabled. In that report the xfs filesystem was an factor
> with CONFIG_MEMCG enabled. Anything special, say like
>
>   kernel:watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [leaker1:7193]
> or
>   [ 3225.313209] Xorg: page allocation failure: order:4, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
>
> in your kernel log?

I'm running ext4 only without LVM, encryption or anything like that.
Plain GPT/MBR partitions with plenty of free space and no disk errors.

>>
>> Once you hit a situation when opening a new tab requires more RAM than
>> is currently available, the system will stall hard. You will barely  be
>> able to move the mouse pointer. Your disk LED will be flashing
>> incessantly (I'm not entirely sure why). You will not be able to run new
>> applications or close currently running ones.
>
> A cpu hog may come on top of memory hog in some scenario.

It might have happened as well - I couldn't know since I wasn't able to
open a terminal. Once the system recovered there was no trace of
anything extraordinary.

>>
>> This little crisis may continue for minutes or even longer. I think
>> that's not how the system should behave in this situation. I believe
>> something must be done about that to avoid this stall.
>
> Yes, Sir.
>>
>> I'm almost sure some sysctl parameters could be changed to avoid this
>> situation but something tells me this could be done for everyone and
>> made default because some non tech-savvy users will just give up on
>> Linux if they ever get in a situation like this and they won't be keen
>> or even be able to Google for solutions.
>
> I am not willing to repeat that it is hard to produce a pill for all
> patients, but the info you post will help solve the crisis sooner.
>
> Hillf
>

In case you have troubles reproducing this bug report I can publish a VM
image - still everything is quite mundane: Fedora 30 + XFCE + web
browser. Nothing else, nothing fancy.

Regards,
Artem


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
@ 2019-08-05  9:05 Hillf Danton
  2019-08-05 12:01 ` Artem S. Tashkinov
  0 siblings, 1 reply; 39+ messages in thread
From: Hillf Danton @ 2019-08-05  9:05 UTC (permalink / raw)
  To: Artem S. Tashkinov; +Cc: linux-kernel, linux-mm


On Sun, 4 Aug 2019 09:23:17 +0000 "Artem S. Tashkinov" <aros@gmx.com> wrote:
> Hello,
> 
> There's this bug which has been bugging many people for many years
> already and which is reproducible in less than a few minutes under the
> latest and greatest kernel, 5.2.6. All the kernel parameters are set to
> defaults.

Thanks for report!
> 
> Steps to reproduce:
> 
> 1) Boot with mem=4G
> 2) Disable swap to make everything faster (sudo swapoff -a)
> 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
> 4) Start opening tabs in either of them and watch your free RAM decrease

We saw another corner-case cpu hog report under memory pressure also
with swap disabled. In that report the xfs filesystem was an factor
with CONFIG_MEMCG enabled. Anything special, say like

 kernel:watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [leaker1:7193]
or
 [ 3225.313209] Xorg: page allocation failure: order:4, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0

in your kernel log?
> 
> Once you hit a situation when opening a new tab requires more RAM than
> is currently available, the system will stall hard. You will barely  be
> able to move the mouse pointer. Your disk LED will be flashing
> incessantly (I'm not entirely sure why). You will not be able to run new
> applications or close currently running ones.

A cpu hog may come on top of memory hog in some scenario.
> 
> This little crisis may continue for minutes or even longer. I think
> that's not how the system should behave in this situation. I believe
> something must be done about that to avoid this stall.

Yes, Sir.
> 
> I'm almost sure some sysctl parameters could be changed to avoid this
> situation but something tells me this could be done for everyone and
> made default because some non tech-savvy users will just give up on
> Linux if they ever get in a situation like this and they won't be keen
> or even be able to Google for solutions.

I am not willing to repeat that it is hard to produce a pill for all
patients, but the info you post will help solve the crisis sooner.

Hillf


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, back to index

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com>
2019-08-05 12:13 ` Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure Vlastimil Babka
2019-08-05 13:31   ` Michal Hocko
2019-08-05 16:47     ` Suren Baghdasaryan
2019-08-05 18:55     ` Johannes Weiner
2019-08-06  9:29       ` Michal Hocko
2019-08-05 19:31   ` Johannes Weiner
2019-08-06  1:08     ` Suren Baghdasaryan
2019-08-06  9:36       ` Vlastimil Babka
2019-08-06 14:27         ` Johannes Weiner
2019-08-06 14:36           ` Michal Hocko
2019-08-06 16:27             ` Suren Baghdasaryan
2019-08-06 22:01               ` Johannes Weiner
2019-08-07  7:59                 ` Michal Hocko
2019-08-07 20:51                   ` Johannes Weiner
2019-08-07 21:01                     ` Andrew Morton
2019-08-07 21:34                       ` Johannes Weiner
2019-08-07 21:12                     ` Johannes Weiner
2019-08-08 11:48                     ` Michal Hocko
2019-08-08 15:10                       ` ndrw.xf
2019-08-08 16:32                         ` Michal Hocko
2019-08-08 17:57                           ` ndrw.xf
2019-08-08 18:59                             ` Michal Hocko
2019-08-08 21:59                               ` ndrw
2019-08-09  8:57                                 ` Michal Hocko
2019-08-09 10:09                                   ` ndrw
2019-08-09 10:50                                     ` Michal Hocko
2019-08-09 14:18                                       ` Pintu Agarwal
2019-08-10 12:34                                       ` ndrw
2019-08-12  8:24                                         ` Michal Hocko
2019-08-10 21:07                                   ` ndrw
2019-08-08 14:47                     ` Vlastimil Babka
2019-08-08 17:27                       ` Johannes Weiner
2019-08-09 14:56                         ` Vlastimil Babka
2019-08-09 17:31                           ` Johannes Weiner
2019-08-13 13:47                             ` Vlastimil Babka
2019-08-06 21:43       ` James Courtier-Dutton
2019-08-06 19:00 ` Florian Weimer
2019-08-05  9:05 Hillf Danton
2019-08-05 12:01 ` Artem S. Tashkinov

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org linux-mm@archiver.kernel.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox