* Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure @ 2019-08-04 9:23 Artem S. Tashkinov 2019-08-05 12:13 ` Vlastimil Babka ` (2 more replies) 0 siblings, 3 replies; 48+ messages in thread From: Artem S. Tashkinov @ 2019-08-04 9:23 UTC (permalink / raw) To: linux-kernel Hello, There's this bug which has been bugging many people for many years already and which is reproducible in less than a few minutes under the latest and greatest kernel, 5.2.6. All the kernel parameters are set to defaults. Steps to reproduce: 1) Boot with mem=4G 2) Disable swap to make everything faster (sudo swapoff -a) 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox 4) Start opening tabs in either of them and watch your free RAM decrease Once you hit a situation when opening a new tab requires more RAM than is currently available, the system will stall hard. You will barely be able to move the mouse pointer. Your disk LED will be flashing incessantly (I'm not entirely sure why). You will not be able to run new applications or close currently running ones. This little crisis may continue for minutes or even longer. I think that's not how the system should behave in this situation. I believe something must be done about that to avoid this stall. I'm almost sure some sysctl parameters could be changed to avoid this situation but something tells me this could be done for everyone and made default because some non tech-savvy users will just give up on Linux if they ever get in a situation like this and they won't be keen or even be able to Google for solutions. Best regards, Artem ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-04 9:23 Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure Artem S. Tashkinov @ 2019-08-05 12:13 ` Vlastimil Babka 2019-08-05 13:31 ` Michal Hocko 2019-08-05 19:31 ` Johannes Weiner 2019-08-06 19:00 ` Florian Weimer 2019-08-20 6:46 ` Daniel Drake 2 siblings, 2 replies; 48+ messages in thread From: Vlastimil Babka @ 2019-08-05 12:13 UTC (permalink / raw) To: Artem S. Tashkinov, linux-kernel Cc: linux-mm, Michal Hocko, Johannes Weiner, Suren Baghdasaryan On 8/4/19 11:23 AM, Artem S. Tashkinov wrote: > Hello, > > There's this bug which has been bugging many people for many years > already and which is reproducible in less than a few minutes under the > latest and greatest kernel, 5.2.6. All the kernel parameters are set to > defaults. > > Steps to reproduce: > > 1) Boot with mem=4G > 2) Disable swap to make everything faster (sudo swapoff -a) > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox > 4) Start opening tabs in either of them and watch your free RAM decrease > > Once you hit a situation when opening a new tab requires more RAM than > is currently available, the system will stall hard. You will barely be > able to move the mouse pointer. Your disk LED will be flashing > incessantly (I'm not entirely sure why). You will not be able to run new > applications or close currently running ones. > This little crisis may continue for minutes or even longer. I think > that's not how the system should behave in this situation. I believe > something must be done about that to avoid this stall. Yeah that's a known problem, made worse SSD's in fact, as they are able to keep refaulting the last remaining file pages fast enough, so there is still apparent progress in reclaim and OOM doesn't kick in. At this point, the likely solution will be probably based on pressure stall monitoring (PSI). I don't know how far we are from a built-in monitor with reasonable defaults for a desktop workload, so CCing relevant folks. > I'm almost sure some sysctl parameters could be changed to avoid this > situation but something tells me this could be done for everyone and > made default because some non tech-savvy users will just give up on > Linux if they ever get in a situation like this and they won't be keen > or even be able to Google for solutions. > > > Best regards, > Artem > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-05 12:13 ` Vlastimil Babka @ 2019-08-05 13:31 ` Michal Hocko 2019-08-05 16:47 ` Suren Baghdasaryan 2019-08-05 18:55 ` Johannes Weiner 2019-08-05 19:31 ` Johannes Weiner 1 sibling, 2 replies; 48+ messages in thread From: Michal Hocko @ 2019-08-05 13:31 UTC (permalink / raw) To: Vlastimil Babka Cc: Artem S. Tashkinov, linux-kernel, linux-mm, Johannes Weiner, Suren Baghdasaryan On Mon 05-08-19 14:13:16, Vlastimil Babka wrote: > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote: > > Hello, > > > > There's this bug which has been bugging many people for many years > > already and which is reproducible in less than a few minutes under the > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to > > defaults. > > > > Steps to reproduce: > > > > 1) Boot with mem=4G > > 2) Disable swap to make everything faster (sudo swapoff -a) > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox > > 4) Start opening tabs in either of them and watch your free RAM decrease > > > > Once you hit a situation when opening a new tab requires more RAM than > > is currently available, the system will stall hard. You will barely be > > able to move the mouse pointer. Your disk LED will be flashing > > incessantly (I'm not entirely sure why). You will not be able to run new > > applications or close currently running ones. > > > This little crisis may continue for minutes or even longer. I think > > that's not how the system should behave in this situation. I believe > > something must be done about that to avoid this stall. > > Yeah that's a known problem, made worse SSD's in fact, as they are able > to keep refaulting the last remaining file pages fast enough, so there > is still apparent progress in reclaim and OOM doesn't kick in. > > At this point, the likely solution will be probably based on pressure > stall monitoring (PSI). I don't know how far we are from a built-in > monitor with reasonable defaults for a desktop workload, so CCing > relevant folks. Another potential approach would be to consider the refault information we have already for file backed pages. Once we start reclaiming only workingset pages then we should be trashing, right? It cannot be as precise as the cost model which can be defined around PSI but it might give us at least a fallback measure. This is a really just an idea for a primitive detection. Most likely incorrect one but it shows an idea at least. It is completely untested and might be completely broken so unless somebody is really brave and doesn't run anything that would be missed then I do not recommend to run it. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 70394cabaf4e..7f30c78b4fbc 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -300,6 +300,7 @@ struct lruvec { atomic_long_t inactive_age; /* Refaults at the time of last reclaim cycle */ unsigned long refaults; + atomic_t workingset_refaults; #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index 4bfb5c4ac108..4401753c3912 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -311,6 +311,15 @@ void *workingset_eviction(struct page *page); void workingset_refault(struct page *page, void *shadow); void workingset_activation(struct page *page); +bool lruvec_trashing(struct lruvec *lruvec) +{ + /* + * One quarter of the inactive list is constantly refaulting. + * This suggests that we are trashing. + */ + return 4 * atomic_read(&lruvec->workingset_refaults) > lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES); +} + /* Only track the nodes of mappings with shadow entries */ void workingset_update_node(struct xa_node *node); #define mapping_set_update(xas, mapping) do { \ diff --git a/mm/vmscan.c b/mm/vmscan.c index 7889f583ced9..d198594af0cd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2381,6 +2381,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, denominator); break; case SCAN_FILE: + if (lruvec_trashing(lruvec)) { + size = 0; + scan = 0; + break; + } case SCAN_ANON: /* Scan one type exclusively */ if ((scan_balance == SCAN_FILE) != file) { diff --git a/mm/workingset.c b/mm/workingset.c index e0b4edcb88c8..ee4c45b27e34 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -309,17 +309,25 @@ void workingset_refault(struct page *page, void *shadow) * don't act on pages that couldn't stay resident even if all * the memory was available to the page cache. */ - if (refault_distance > active_file) + if (refault_distance > active_file) { + atomic_set(&lruvec->workingset_refaults, 0); goto out; + } SetPageActive(page); atomic_long_inc(&lruvec->inactive_age); inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); + atomic_inc(&lruvec->workingset_refaults); /* Page was active prior to eviction */ if (workingset) { SetPageWorkingset(page); inc_lruvec_state(lruvec, WORKINGSET_RESTORE); + /* + * Double the trashing numbers for the actual working set. + * refaults + */ + atomic_inc(&lruvec->workingset_refaults); } out: rcu_read_unlock(); -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-05 13:31 ` Michal Hocko @ 2019-08-05 16:47 ` Suren Baghdasaryan 2019-08-05 18:55 ` Johannes Weiner 1 sibling, 0 replies; 48+ messages in thread From: Suren Baghdasaryan @ 2019-08-05 16:47 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm, Johannes Weiner On Mon, Aug 5, 2019 at 6:31 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 05-08-19 14:13:16, Vlastimil Babka wrote: > > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote: > > > Hello, > > > > > > There's this bug which has been bugging many people for many years > > > already and which is reproducible in less than a few minutes under the > > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to > > > defaults. > > > > > > Steps to reproduce: > > > > > > 1) Boot with mem=4G > > > 2) Disable swap to make everything faster (sudo swapoff -a) > > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox > > > 4) Start opening tabs in either of them and watch your free RAM decrease > > > > > > Once you hit a situation when opening a new tab requires more RAM than > > > is currently available, the system will stall hard. You will barely be > > > able to move the mouse pointer. Your disk LED will be flashing > > > incessantly (I'm not entirely sure why). You will not be able to run new > > > applications or close currently running ones. > > > > > This little crisis may continue for minutes or even longer. I think > > > that's not how the system should behave in this situation. I believe > > > something must be done about that to avoid this stall. > > > > Yeah that's a known problem, made worse SSD's in fact, as they are able > > to keep refaulting the last remaining file pages fast enough, so there > > is still apparent progress in reclaim and OOM doesn't kick in. > > > > At this point, the likely solution will be probably based on pressure > > stall monitoring (PSI). I don't know how far we are from a built-in > > monitor with reasonable defaults for a desktop workload, so CCing > > relevant folks. > > Another potential approach would be to consider the refault information > we have already for file backed pages. Once we start reclaiming only > workingset pages then we should be trashing, right? It cannot be as > precise as the cost model which can be defined around PSI but it might > give us at least a fallback measure. > > This is a really just an idea for a primitive detection. Most likely > incorrect one but it shows an idea at least. It is completely untested > and might be completely broken so unless somebody is really brave and > doesn't run anything that would be missed then I do not recommend to run > it. In Android we have a userspace lmkd process which polls for PSI events and after they get triggered we check several metrics to determine if we should kill anything. I believe Facebook has a similar userspace process called oomd which as I heard is a more configurable rule engine which also uses PSI and configurable rules to make kill decisions. I've spent considerable time experimenting with different metrics and thrashing is definitely one of the most useful ones. > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 70394cabaf4e..7f30c78b4fbc 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -300,6 +300,7 @@ struct lruvec { > atomic_long_t inactive_age; > /* Refaults at the time of last reclaim cycle */ > unsigned long refaults; > + atomic_t workingset_refaults; > #ifdef CONFIG_MEMCG > struct pglist_data *pgdat; > #endif > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 4bfb5c4ac108..4401753c3912 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -311,6 +311,15 @@ void *workingset_eviction(struct page *page); > void workingset_refault(struct page *page, void *shadow); > void workingset_activation(struct page *page); > > +bool lruvec_trashing(struct lruvec *lruvec) > +{ > + /* > + * One quarter of the inactive list is constantly refaulting. I'm guessing one quarter is a guesstimate here and needs experimentation? > + * This suggests that we are trashing. > + */ > + return 4 * atomic_read(&lruvec->workingset_refaults) > lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES); Just wondering, why do you consider only inactive list here? The complete workingset is active list + non-idle part of inactive list isn't it? In my latest experiments I was using configurable percentage of the active+inactive lists as a threshold to declare we are thrashing and if thrashing continues after we kill that percentage starts decaying which results in an earlier next kill (if interested in details see https://android-review.googlesource.com/c/platform/system/core/+/1041778/14/lmkd/lmkd.c#1968). I'm also using existing WORKINGSET_REFAULT node_stat_item as workingset refault counter. Any reason you are not using it in this reference implementation instead of introducing new workingset_refaults atomic? > +} > + > /* Only track the nodes of mappings with shadow entries */ > void workingset_update_node(struct xa_node *node); > #define mapping_set_update(xas, mapping) do { \ > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 7889f583ced9..d198594af0cd 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2381,6 +2381,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, > denominator); > break; > case SCAN_FILE: > + if (lruvec_trashing(lruvec)) { > + size = 0; > + scan = 0; > + break; > + } > case SCAN_ANON: > /* Scan one type exclusively */ > if ((scan_balance == SCAN_FILE) != file) { > diff --git a/mm/workingset.c b/mm/workingset.c > index e0b4edcb88c8..ee4c45b27e34 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -309,17 +309,25 @@ void workingset_refault(struct page *page, void *shadow) > * don't act on pages that couldn't stay resident even if all > * the memory was available to the page cache. > */ > - if (refault_distance > active_file) > + if (refault_distance > active_file) { > + atomic_set(&lruvec->workingset_refaults, 0); > goto out; > + } > > SetPageActive(page); > atomic_long_inc(&lruvec->inactive_age); > inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); > + atomic_inc(&lruvec->workingset_refaults); > > /* Page was active prior to eviction */ > if (workingset) { > SetPageWorkingset(page); > inc_lruvec_state(lruvec, WORKINGSET_RESTORE); > + /* > + * Double the trashing numbers for the actual working set. > + * refaults > + */ > + atomic_inc(&lruvec->workingset_refaults); > } > out: > rcu_read_unlock(); > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-05 13:31 ` Michal Hocko 2019-08-05 16:47 ` Suren Baghdasaryan @ 2019-08-05 18:55 ` Johannes Weiner 2019-08-06 9:29 ` Michal Hocko 1 sibling, 1 reply; 48+ messages in thread From: Johannes Weiner @ 2019-08-05 18:55 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Artem S. Tashkinov, linux-kernel, linux-mm, Suren Baghdasaryan On Mon, Aug 05, 2019 at 03:31:19PM +0200, Michal Hocko wrote: > On Mon 05-08-19 14:13:16, Vlastimil Babka wrote: > > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote: > > > Hello, > > > > > > There's this bug which has been bugging many people for many years > > > already and which is reproducible in less than a few minutes under the > > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to > > > defaults. > > > > > > Steps to reproduce: > > > > > > 1) Boot with mem=4G > > > 2) Disable swap to make everything faster (sudo swapoff -a) > > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox > > > 4) Start opening tabs in either of them and watch your free RAM decrease > > > > > > Once you hit a situation when opening a new tab requires more RAM than > > > is currently available, the system will stall hard. You will barely be > > > able to move the mouse pointer. Your disk LED will be flashing > > > incessantly (I'm not entirely sure why). You will not be able to run new > > > applications or close currently running ones. > > > > > This little crisis may continue for minutes or even longer. I think > > > that's not how the system should behave in this situation. I believe > > > something must be done about that to avoid this stall. > > > > Yeah that's a known problem, made worse SSD's in fact, as they are able > > to keep refaulting the last remaining file pages fast enough, so there > > is still apparent progress in reclaim and OOM doesn't kick in. > > > > At this point, the likely solution will be probably based on pressure > > stall monitoring (PSI). I don't know how far we are from a built-in > > monitor with reasonable defaults for a desktop workload, so CCing > > relevant folks. > > Another potential approach would be to consider the refault information > we have already for file backed pages. Once we start reclaiming only > workingset pages then we should be trashing, right? It cannot be as > precise as the cost model which can be defined around PSI but it might > give us at least a fallback measure. NAK, this does *not* work. Not even as fallback. There is no amount of refaults for which you can say whether they are a problem or not. It depends on the disk speed (obvious) but also on the workload's memory access patterns (somewhat less obvious). For example, we have workloads whose cache set doesn't quite fit into memory, but everything else is pretty much statically allocated and it rarely touches any new or one-off filesystem data. So there is always a steady rate of mostly uninterrupted refaults, however, most data accesses are hitting the cache! And we have fast SSDs that compensate for the refaults that do occur. The workload runs *completely fine*. If the cache hit rate was lower and refaults would make up a bigger share of overall page accesses, or if there was a spinning disk in that machine, the machine would be completely livelocked - with the same exact number of refaults and the same amount of RAM! That's not just an approximation error that we could compensate for. The same rate of refaults in a system could mean anything from 0% (all refaults readahead, and IO is done before workload notices) to 100% memory pressure (all refaults are cache misses and workload fully serialized on pages in question) - and anything in between (a subset of threads of the workload wait for a subset of the refaults). The refault rate by itself carries no signal on workload progress. This is the whole reason why psi was developed - to compare the time you spend on refaults (encodes IO speed and readhahead efficiency) compared to the time you spend on being productive (encodes refaults as share of overall memory accesses of a the workload). ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-05 18:55 ` Johannes Weiner @ 2019-08-06 9:29 ` Michal Hocko 0 siblings, 0 replies; 48+ messages in thread From: Michal Hocko @ 2019-08-06 9:29 UTC (permalink / raw) To: Johannes Weiner Cc: Vlastimil Babka, Artem S. Tashkinov, linux-kernel, linux-mm, Suren Baghdasaryan On Mon 05-08-19 14:55:42, Johannes Weiner wrote: > On Mon, Aug 05, 2019 at 03:31:19PM +0200, Michal Hocko wrote: > > On Mon 05-08-19 14:13:16, Vlastimil Babka wrote: > > > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote: > > > > Hello, > > > > > > > > There's this bug which has been bugging many people for many years > > > > already and which is reproducible in less than a few minutes under the > > > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to > > > > defaults. > > > > > > > > Steps to reproduce: > > > > > > > > 1) Boot with mem=4G > > > > 2) Disable swap to make everything faster (sudo swapoff -a) > > > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox > > > > 4) Start opening tabs in either of them and watch your free RAM decrease > > > > > > > > Once you hit a situation when opening a new tab requires more RAM than > > > > is currently available, the system will stall hard. You will barely be > > > > able to move the mouse pointer. Your disk LED will be flashing > > > > incessantly (I'm not entirely sure why). You will not be able to run new > > > > applications or close currently running ones. > > > > > > > This little crisis may continue for minutes or even longer. I think > > > > that's not how the system should behave in this situation. I believe > > > > something must be done about that to avoid this stall. > > > > > > Yeah that's a known problem, made worse SSD's in fact, as they are able > > > to keep refaulting the last remaining file pages fast enough, so there > > > is still apparent progress in reclaim and OOM doesn't kick in. > > > > > > At this point, the likely solution will be probably based on pressure > > > stall monitoring (PSI). I don't know how far we are from a built-in > > > monitor with reasonable defaults for a desktop workload, so CCing > > > relevant folks. > > > > Another potential approach would be to consider the refault information > > we have already for file backed pages. Once we start reclaiming only > > workingset pages then we should be trashing, right? It cannot be as > > precise as the cost model which can be defined around PSI but it might > > give us at least a fallback measure. > > NAK, this does *not* work. Not even as fallback. > > There is no amount of refaults for which you can say whether they are > a problem or not. It depends on the disk speed (obvious) but also on > the workload's memory access patterns (somewhat less obvious). > > For example, we have workloads whose cache set doesn't quite fit into > memory, but everything else is pretty much statically allocated and it > rarely touches any new or one-off filesystem data. So there is always > a steady rate of mostly uninterrupted refaults, however, most data > accesses are hitting the cache! And we have fast SSDs that compensate > for the refaults that do occur. The workload runs *completely fine*. OK, thanks for this example. I can see how a constant working set refault can work properly if the rate is slower than the overal IO plus the allocation demand for other purpose. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-05 12:13 ` Vlastimil Babka 2019-08-05 13:31 ` Michal Hocko @ 2019-08-05 19:31 ` Johannes Weiner 2019-08-06 1:08 ` Suren Baghdasaryan 1 sibling, 1 reply; 48+ messages in thread From: Johannes Weiner @ 2019-08-05 19:31 UTC (permalink / raw) To: Vlastimil Babka Cc: Artem S. Tashkinov, linux-kernel, linux-mm, Michal Hocko, Suren Baghdasaryan On Mon, Aug 05, 2019 at 02:13:16PM +0200, Vlastimil Babka wrote: > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote: > > Hello, > > > > There's this bug which has been bugging many people for many years > > already and which is reproducible in less than a few minutes under the > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to > > defaults. > > > > Steps to reproduce: > > > > 1) Boot with mem=4G > > 2) Disable swap to make everything faster (sudo swapoff -a) > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox > > 4) Start opening tabs in either of them and watch your free RAM decrease > > > > Once you hit a situation when opening a new tab requires more RAM than > > is currently available, the system will stall hard. You will barely be > > able to move the mouse pointer. Your disk LED will be flashing > > incessantly (I'm not entirely sure why). You will not be able to run new > > applications or close currently running ones. > > > This little crisis may continue for minutes or even longer. I think > > that's not how the system should behave in this situation. I believe > > something must be done about that to avoid this stall. > > Yeah that's a known problem, made worse SSD's in fact, as they are able > to keep refaulting the last remaining file pages fast enough, so there > is still apparent progress in reclaim and OOM doesn't kick in. > > At this point, the likely solution will be probably based on pressure > stall monitoring (PSI). I don't know how far we are from a built-in > monitor with reasonable defaults for a desktop workload, so CCing > relevant folks. Yes, psi was specifically developed to address this problem. Before it, the kernel had to make all decisions based on relative event rates but had no notion of time. Whereas to the user, time is clearly an issue, and in fact makes all the difference. So psi quantifies the time the workload spends executing vs. spinning its wheels. But choosing a universal cutoff for killing is not possible, since it depends on the workload and the user's expectation: GUI and other latency-sensitive applications care way before a compile job or video encoding would care. Because of that, there are things like oomd and lmkd as mentioned, to leave the exact policy decision to userspace. That being said, I think we should be able to provide a bare minimum inside the kernel to avoid complete livelocks where the user does not believe the machine would be able to recover without a reboot. The goal wouldn't be a glitch-free user experience - the kernel does not know enough about the applications to even attempt that. It should just not hang indefinitely. Maybe similar to the hung task detector. How about something like the below patch? With that, the kernel catches excessive thrashing that happens before reclaim fails: [root@ham ~]# stress -d 128 -m 5 stress: info: [344] dispatching hogs: 0 cpu, 0 io, 5 vm, 128 hdd Excessive and sustained system-wide memory pressure! kworker/1:2 invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0 CPU: 1 PID: 77 Comm: kworker/1:2 Not tainted 5.3.0-rc1-mm1-00121-ge34a5cf28771 #142 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014 Workqueue: events psi_avgs_work Call Trace: dump_stack+0x46/0x60 dump_header+0x5c/0x3d5 ? irq_work_queue+0x46/0x50 ? wake_up_klogd+0x2b/0x30 ? vprintk_emit+0xe5/0x190 oom_kill_process.cold.10+0xb/0x10 out_of_memory+0x1ea/0x260 update_averages.cold.8+0x14/0x25 ? collect_percpu_times+0x84/0x1f0 psi_avgs_work+0x80/0xc0 process_one_work+0x1bb/0x310 worker_thread+0x28/0x3c0 ? process_one_work+0x310/0x310 kthread+0x108/0x120 ? __kthread_create_on_node+0x170/0x170 ret_from_fork+0x35/0x40 Mem-Info: active_anon:109463 inactive_anon:109564 isolated_anon:298 active_file:4676 inactive_file:4073 isolated_file:455 unevictable:0 dirty:8475 writeback:8 unstable:0 slab_reclaimable:2585 slab_unreclaimable:4932 mapped:413 shmem:2 pagetables:1747 bounce:0 free:13472 free_pcp:17 free_cma:0 Possible snags and questions: 1. psi is an optional feature right now, but these livelocks commonly affect desktop users. What should be the default behavior? 2. Should we make the pressure cutoff and time period configurable? I fear we would open a can of worms similar to the existing OOM killer, where users are trying to use a kernel self-protection mechanism to implement workload QoS and priorities - things that should firmly be kept in userspace. 3. swapoff annotation. Due to the swapin annotation, swapoff currently raises memory pressure. It probably shouldn't. But this will be a bigger problem if we trigger the oom killer based on it. 4. Killing once every 10s assumes basically one big culprit. If the pressure is created by many different processes, fixing the situation could take quite a while. What oomd does to solve this is to monitor the PGSCAN counters after a kill, to tell whether pressure is persisting, or just from residual refaults after the culprit has been dealt with. We may need to do something similar here. Or find a solution to encode that distinction into psi itself, and it would also take care of the swapoff problem, since it's basically the same thing - residual refaults without any reclaim pressure to sustain them. Anyway, here is the draft patch: From e34a5cf28771d69f13faa0e933adeae44b26b8aa Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Mon, 5 Aug 2019 13:15:16 -0400 Subject: [PATCH] psi oom --- include/linux/psi_types.h | 4 +++ kernel/sched/psi.c | 52 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 56 insertions(+) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 07aaf9b82241..390446b07ac7 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -162,6 +162,10 @@ struct psi_group { u64 polling_total[NR_PSI_STATES - 1]; u64 polling_next_update; u64 polling_until; + + /* Out-of-memory situation tracking */ + bool oom_pressure; + u64 oom_pressure_start; }; #else /* CONFIG_PSI */ diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index f28342dc65ec..1027b6611ec2 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -139,6 +139,7 @@ #include <linux/ctype.h> #include <linux/file.h> #include <linux/poll.h> +#include <linux/oom.h> #include <linux/psi.h> #include "sched.h" @@ -177,6 +178,8 @@ struct psi_group psi_system = { .pcpu = &system_group_pcpu, }; +static void psi_oom_tick(struct psi_group *group, u64 now); + static void psi_avgs_work(struct work_struct *work); static void group_init(struct psi_group *group) @@ -403,6 +406,8 @@ static u64 update_averages(struct psi_group *group, u64 now) calc_avgs(group->avg[s], missed_periods, sample, period); } + psi_oom_tick(group, now); + return avg_next_update; } @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void) return 0; } module_init(psi_proc_init); + +#define OOM_PRESSURE_LEVEL 80 +#define OOM_PRESSURE_PERIOD (10 * NSEC_PER_SEC) + +static void psi_oom_tick(struct psi_group *group, u64 now) +{ + struct oom_control oc = { + .order = 0, + }; + unsigned long pressure; + bool high; + + /* + * Protect the system from livelocking due to thrashing. Leave + * per-cgroup policies to oomd, lmkd etc. + */ + if (group != &psi_system) + return; + + pressure = LOAD_INT(group->avg[PSI_MEM_FULL][0]); + high = pressure >= OOM_PRESSURE_LEVEL; + + if (!group->oom_pressure && !high) + return; + + if (!group->oom_pressure && high) { + group->oom_pressure = true; + group->oom_pressure_start = now; + return; + } + + if (group->oom_pressure && !high) { + group->oom_pressure = false; + return; + } + + if (now < group->oom_pressure_start + OOM_PRESSURE_PERIOD) + return; + + group->oom_pressure = false; + + if (!mutex_trylock(&oom_lock)) + return; + pr_warn("Excessive and sustained system-wide memory pressure!\n"); + out_of_memory(&oc); + mutex_unlock(&oom_lock); +} -- 2.22.0 ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-05 19:31 ` Johannes Weiner @ 2019-08-06 1:08 ` Suren Baghdasaryan 2019-08-06 9:36 ` Vlastimil Babka 2019-08-06 21:43 ` James Courtier-Dutton 0 siblings, 2 replies; 48+ messages in thread From: Suren Baghdasaryan @ 2019-08-06 1:08 UTC (permalink / raw) To: Johannes Weiner Cc: Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm, Michal Hocko On Mon, Aug 5, 2019 at 12:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Aug 05, 2019 at 02:13:16PM +0200, Vlastimil Babka wrote: > > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote: > > > Hello, > > > > > > There's this bug which has been bugging many people for many years > > > already and which is reproducible in less than a few minutes under the > > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to > > > defaults. > > > > > > Steps to reproduce: > > > > > > 1) Boot with mem=4G > > > 2) Disable swap to make everything faster (sudo swapoff -a) > > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox > > > 4) Start opening tabs in either of them and watch your free RAM decrease > > > > > > Once you hit a situation when opening a new tab requires more RAM than > > > is currently available, the system will stall hard. You will barely be > > > able to move the mouse pointer. Your disk LED will be flashing > > > incessantly (I'm not entirely sure why). You will not be able to run new > > > applications or close currently running ones. > > > > > This little crisis may continue for minutes or even longer. I think > > > that's not how the system should behave in this situation. I believe > > > something must be done about that to avoid this stall. > > > > Yeah that's a known problem, made worse SSD's in fact, as they are able > > to keep refaulting the last remaining file pages fast enough, so there > > is still apparent progress in reclaim and OOM doesn't kick in. > > > > At this point, the likely solution will be probably based on pressure > > stall monitoring (PSI). I don't know how far we are from a built-in > > monitor with reasonable defaults for a desktop workload, so CCing > > relevant folks. > > Yes, psi was specifically developed to address this problem. Before > it, the kernel had to make all decisions based on relative event rates > but had no notion of time. Whereas to the user, time is clearly an > issue, and in fact makes all the difference. So psi quantifies the > time the workload spends executing vs. spinning its wheels. > > But choosing a universal cutoff for killing is not possible, since it > depends on the workload and the user's expectation: GUI and other > latency-sensitive applications care way before a compile job or video > encoding would care. > > Because of that, there are things like oomd and lmkd as mentioned, to > leave the exact policy decision to userspace. > > That being said, I think we should be able to provide a bare minimum > inside the kernel to avoid complete livelocks where the user does not > believe the machine would be able to recover without a reboot. > > The goal wouldn't be a glitch-free user experience - the kernel does > not know enough about the applications to even attempt that. It should > just not hang indefinitely. Maybe similar to the hung task detector. > > How about something like the below patch? With that, the kernel > catches excessive thrashing that happens before reclaim fails: > > [root@ham ~]# stress -d 128 -m 5 > stress: info: [344] dispatching hogs: 0 cpu, 0 io, 5 vm, 128 hdd > Excessive and sustained system-wide memory pressure! > kworker/1:2 invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0 > CPU: 1 PID: 77 Comm: kworker/1:2 Not tainted 5.3.0-rc1-mm1-00121-ge34a5cf28771 #142 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014 > Workqueue: events psi_avgs_work > Call Trace: > dump_stack+0x46/0x60 > dump_header+0x5c/0x3d5 > ? irq_work_queue+0x46/0x50 > ? wake_up_klogd+0x2b/0x30 > ? vprintk_emit+0xe5/0x190 > oom_kill_process.cold.10+0xb/0x10 > out_of_memory+0x1ea/0x260 > update_averages.cold.8+0x14/0x25 > ? collect_percpu_times+0x84/0x1f0 > psi_avgs_work+0x80/0xc0 > process_one_work+0x1bb/0x310 > worker_thread+0x28/0x3c0 > ? process_one_work+0x310/0x310 > kthread+0x108/0x120 > ? __kthread_create_on_node+0x170/0x170 > ret_from_fork+0x35/0x40 > Mem-Info: > active_anon:109463 inactive_anon:109564 isolated_anon:298 > active_file:4676 inactive_file:4073 isolated_file:455 > unevictable:0 dirty:8475 writeback:8 unstable:0 > slab_reclaimable:2585 slab_unreclaimable:4932 > mapped:413 shmem:2 pagetables:1747 bounce:0 > free:13472 free_pcp:17 free_cma:0 > > Possible snags and questions: > > 1. psi is an optional feature right now, but these livelocks commonly > affect desktop users. What should be the default behavior? > > 2. Should we make the pressure cutoff and time period configurable? > > I fear we would open a can of worms similar to the existing OOM > killer, where users are trying to use a kernel self-protection > mechanism to implement workload QoS and priorities - things that > should firmly be kept in userspace. > > 3. swapoff annotation. Due to the swapin annotation, swapoff currently > raises memory pressure. It probably shouldn't. But this will be a > bigger problem if we trigger the oom killer based on it. > > 4. Killing once every 10s assumes basically one big culprit. If the > pressure is created by many different processes, fixing the > situation could take quite a while. > > What oomd does to solve this is to monitor the PGSCAN counters > after a kill, to tell whether pressure is persisting, or just from > residual refaults after the culprit has been dealt with. > > We may need to do something similar here. Or find a solution to > encode that distinction into psi itself, and it would also take > care of the swapoff problem, since it's basically the same thing - > residual refaults without any reclaim pressure to sustain them. > > Anyway, here is the draft patch: > > From e34a5cf28771d69f13faa0e933adeae44b26b8aa Mon Sep 17 00:00:00 2001 > From: Johannes Weiner <hannes@cmpxchg.org> > Date: Mon, 5 Aug 2019 13:15:16 -0400 > Subject: [PATCH] psi oom > > --- > include/linux/psi_types.h | 4 +++ > kernel/sched/psi.c | 52 +++++++++++++++++++++++++++++++++++++++ > 2 files changed, 56 insertions(+) > > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h > index 07aaf9b82241..390446b07ac7 100644 > --- a/include/linux/psi_types.h > +++ b/include/linux/psi_types.h > @@ -162,6 +162,10 @@ struct psi_group { > u64 polling_total[NR_PSI_STATES - 1]; > u64 polling_next_update; > u64 polling_until; > + > + /* Out-of-memory situation tracking */ > + bool oom_pressure; > + u64 oom_pressure_start; > }; > > #else /* CONFIG_PSI */ > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index f28342dc65ec..1027b6611ec2 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -139,6 +139,7 @@ > #include <linux/ctype.h> > #include <linux/file.h> > #include <linux/poll.h> > +#include <linux/oom.h> > #include <linux/psi.h> > #include "sched.h" > > @@ -177,6 +178,8 @@ struct psi_group psi_system = { > .pcpu = &system_group_pcpu, > }; > > +static void psi_oom_tick(struct psi_group *group, u64 now); > + > static void psi_avgs_work(struct work_struct *work); > > static void group_init(struct psi_group *group) > @@ -403,6 +406,8 @@ static u64 update_averages(struct psi_group *group, u64 now) > calc_avgs(group->avg[s], missed_periods, sample, period); > } > > + psi_oom_tick(group, now); > + > return avg_next_update; > } > > @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void) > return 0; > } > module_init(psi_proc_init); > + > +#define OOM_PRESSURE_LEVEL 80 > +#define OOM_PRESSURE_PERIOD (10 * NSEC_PER_SEC) 80% of the last 10 seconds spent in full stall would definitely be a problem. If the system was already low on memory (which it probably is, or we would not be reclaiming so hard and registering such a big stall) then oom-killer would probably kill something before 8 seconds are passed. If my line of thinking is correct, then do we really benefit from such additional protection mechanism? I might be wrong here because my experience is limited to embedded systems with relatively small amounts of memory. > + > +static void psi_oom_tick(struct psi_group *group, u64 now) > +{ > + struct oom_control oc = { > + .order = 0, > + }; > + unsigned long pressure; > + bool high; > + > + /* > + * Protect the system from livelocking due to thrashing. Leave > + * per-cgroup policies to oomd, lmkd etc. > + */ > + if (group != &psi_system) > + return; > + > + pressure = LOAD_INT(group->avg[PSI_MEM_FULL][0]); > + high = pressure >= OOM_PRESSURE_LEVEL; > + > + if (!group->oom_pressure && !high) > + return; > + > + if (!group->oom_pressure && high) { > + group->oom_pressure = true; > + group->oom_pressure_start = now; > + return; > + } > + > + if (group->oom_pressure && !high) { > + group->oom_pressure = false; > + return; > + } > + > + if (now < group->oom_pressure_start + OOM_PRESSURE_PERIOD) > + return; > + > + group->oom_pressure = false; > + > + if (!mutex_trylock(&oom_lock)) > + return; > + pr_warn("Excessive and sustained system-wide memory pressure!\n"); > + out_of_memory(&oc); > + mutex_unlock(&oom_lock); > +} > -- > 2.22.0 > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-06 1:08 ` Suren Baghdasaryan @ 2019-08-06 9:36 ` Vlastimil Babka 2019-08-06 14:27 ` Johannes Weiner 2019-08-06 21:43 ` James Courtier-Dutton 1 sibling, 1 reply; 48+ messages in thread From: Vlastimil Babka @ 2019-08-06 9:36 UTC (permalink / raw) To: Suren Baghdasaryan, Johannes Weiner Cc: Artem S. Tashkinov, LKML, linux-mm, Michal Hocko On 8/6/19 3:08 AM, Suren Baghdasaryan wrote: >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void) >> return 0; >> } >> module_init(psi_proc_init); >> + >> +#define OOM_PRESSURE_LEVEL 80 >> +#define OOM_PRESSURE_PERIOD (10 * NSEC_PER_SEC) > > 80% of the last 10 seconds spent in full stall would definitely be a > problem. If the system was already low on memory (which it probably > is, or we would not be reclaiming so hard and registering such a big > stall) then oom-killer would probably kill something before 8 seconds > are passed. If oom killer can act faster, than great! On small embedded systems you probably don't enable PSI anyway? > If my line of thinking is correct, then do we really > benefit from such additional protection mechanism? I might be wrong > here because my experience is limited to embedded systems with > relatively small amounts of memory. Well, Artem in his original mail describes a minutes long stall. Things are really different on a fast desktop/laptop with SSD. I have experienced this as well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than 8GB in the laptop). IMHO the default limit should be set so that the user doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10 seconds should be fine. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-06 9:36 ` Vlastimil Babka @ 2019-08-06 14:27 ` Johannes Weiner 2019-08-06 14:36 ` Michal Hocko 0 siblings, 1 reply; 48+ messages in thread From: Johannes Weiner @ 2019-08-06 14:27 UTC (permalink / raw) To: Vlastimil Babka Cc: Suren Baghdasaryan, Artem S. Tashkinov, LKML, linux-mm, Michal Hocko On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote: > On 8/6/19 3:08 AM, Suren Baghdasaryan wrote: > >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void) > >> return 0; > >> } > >> module_init(psi_proc_init); > >> + > >> +#define OOM_PRESSURE_LEVEL 80 > >> +#define OOM_PRESSURE_PERIOD (10 * NSEC_PER_SEC) > > > > 80% of the last 10 seconds spent in full stall would definitely be a > > problem. If the system was already low on memory (which it probably > > is, or we would not be reclaiming so hard and registering such a big > > stall) then oom-killer would probably kill something before 8 seconds > > are passed. > > If oom killer can act faster, than great! On small embedded systems you probably > don't enable PSI anyway? > > > If my line of thinking is correct, then do we really > > benefit from such additional protection mechanism? I might be wrong > > here because my experience is limited to embedded systems with > > relatively small amounts of memory. > > Well, Artem in his original mail describes a minutes long stall. Things are > really different on a fast desktop/laptop with SSD. I have experienced this as > well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than > 8GB in the laptop). IMHO the default limit should be set so that the user > doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10 > seconds should be fine. That's exactly what I have experienced in the past, and this was also the consistent story in the bug reports we have had. I suspect it requires a certain combination of RAM size, CPU speed, and IO capacity: the OOM killer kicks in when reclaim fails, which happens when all scanned LRU pages were locked and under IO. So IO needs to be slow enough, or RAM small enough, that the CPU can scan all LRU pages while they are temporarily unreclaimable (page lock). It may well be that on phones the RAM is small enough relative to CPU size. But on desktops/servers, we frequently see that there is a wider window of memory consumption in which reclaim efficiency doesn't drop low enough for the OOM killer to kick in. In the time it takes the CPU to scan through RAM, enough pages will have *just* finished reading for reclaim to free them again and continue to make "progress". We do know that the OOM killer might not kick in for at least 20-25 minutes while the system is entirely unresponsive. People usually don't wait this long before forcibly rebooting. In a managed fleet, ssh heartbeat tests eventually fail and force a reboot. I'm not sure 10s is the perfect value here, but I do think the kernel should try to get out of such a state, where interacting with the system is impossible, within a reasonable amount of time. It could be a little too short for non-interactive number-crunching systems... ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-06 14:27 ` Johannes Weiner @ 2019-08-06 14:36 ` Michal Hocko 2019-08-06 16:27 ` Suren Baghdasaryan 0 siblings, 1 reply; 48+ messages in thread From: Michal Hocko @ 2019-08-06 14:36 UTC (permalink / raw) To: Johannes Weiner Cc: Vlastimil Babka, Suren Baghdasaryan, Artem S. Tashkinov, LKML, linux-mm On Tue 06-08-19 10:27:28, Johannes Weiner wrote: > On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote: > > On 8/6/19 3:08 AM, Suren Baghdasaryan wrote: > > >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void) > > >> return 0; > > >> } > > >> module_init(psi_proc_init); > > >> + > > >> +#define OOM_PRESSURE_LEVEL 80 > > >> +#define OOM_PRESSURE_PERIOD (10 * NSEC_PER_SEC) > > > > > > 80% of the last 10 seconds spent in full stall would definitely be a > > > problem. If the system was already low on memory (which it probably > > > is, or we would not be reclaiming so hard and registering such a big > > > stall) then oom-killer would probably kill something before 8 seconds > > > are passed. > > > > If oom killer can act faster, than great! On small embedded systems you probably > > don't enable PSI anyway? > > > > > If my line of thinking is correct, then do we really > > > benefit from such additional protection mechanism? I might be wrong > > > here because my experience is limited to embedded systems with > > > relatively small amounts of memory. > > > > Well, Artem in his original mail describes a minutes long stall. Things are > > really different on a fast desktop/laptop with SSD. I have experienced this as > > well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than > > 8GB in the laptop). IMHO the default limit should be set so that the user > > doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10 > > seconds should be fine. > > That's exactly what I have experienced in the past, and this was also > the consistent story in the bug reports we have had. > > I suspect it requires a certain combination of RAM size, CPU speed, > and IO capacity: the OOM killer kicks in when reclaim fails, which > happens when all scanned LRU pages were locked and under IO. So IO > needs to be slow enough, or RAM small enough, that the CPU can scan > all LRU pages while they are temporarily unreclaimable (page lock). > > It may well be that on phones the RAM is small enough relative to CPU > size. > > But on desktops/servers, we frequently see that there is a wider > window of memory consumption in which reclaim efficiency doesn't drop > low enough for the OOM killer to kick in. In the time it takes the CPU > to scan through RAM, enough pages will have *just* finished reading > for reclaim to free them again and continue to make "progress". > > We do know that the OOM killer might not kick in for at least 20-25 > minutes while the system is entirely unresponsive. People usually > don't wait this long before forcibly rebooting. In a managed fleet, > ssh heartbeat tests eventually fail and force a reboot. > > I'm not sure 10s is the perfect value here, but I do think the kernel > should try to get out of such a state, where interacting with the > system is impossible, within a reasonable amount of time. > > It could be a little too short for non-interactive number-crunching > systems... Would it be possible to have a module with tunning knobs as parameters and hook into the PSI infrastructure? People can play with the setting to their need, we wouldn't really have think about the user visible API for the tuning and this could be easily adopted as an opt-in mechanism without a risk of regressions. I would really love to see a simple threshing watchdog like the one you have proposed earlier. It is self contained and easy to play with if the parameters are not hardcoded. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-06 14:36 ` Michal Hocko @ 2019-08-06 16:27 ` Suren Baghdasaryan 2019-08-06 22:01 ` Johannes Weiner 0 siblings, 1 reply; 48+ messages in thread From: Suren Baghdasaryan @ 2019-08-06 16:27 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm On Tue, Aug 6, 2019 at 7:36 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 06-08-19 10:27:28, Johannes Weiner wrote: > > On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote: > > > On 8/6/19 3:08 AM, Suren Baghdasaryan wrote: > > > >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void) > > > >> return 0; > > > >> } > > > >> module_init(psi_proc_init); > > > >> + > > > >> +#define OOM_PRESSURE_LEVEL 80 > > > >> +#define OOM_PRESSURE_PERIOD (10 * NSEC_PER_SEC) > > > > > > > > 80% of the last 10 seconds spent in full stall would definitely be a > > > > problem. If the system was already low on memory (which it probably > > > > is, or we would not be reclaiming so hard and registering such a big > > > > stall) then oom-killer would probably kill something before 8 seconds > > > > are passed. > > > > > > If oom killer can act faster, than great! On small embedded systems you probably > > > don't enable PSI anyway? We use PSI triggers with 1 sec tracking window. PSI averages are less useful on such systems because in 10 secs (which is the shortest PSI averaging window) memory conditions can change drastically. > > > > If my line of thinking is correct, then do we really > > > > benefit from such additional protection mechanism? I might be wrong > > > > here because my experience is limited to embedded systems with > > > > relatively small amounts of memory. > > > > > > Well, Artem in his original mail describes a minutes long stall. Things are > > > really different on a fast desktop/laptop with SSD. I have experienced this as > > > well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than > > > 8GB in the laptop). IMHO the default limit should be set so that the user > > > doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10 > > > seconds should be fine. > > > > That's exactly what I have experienced in the past, and this was also > > the consistent story in the bug reports we have had. > > > > I suspect it requires a certain combination of RAM size, CPU speed, > > and IO capacity: the OOM killer kicks in when reclaim fails, which > > happens when all scanned LRU pages were locked and under IO. So IO > > needs to be slow enough, or RAM small enough, that the CPU can scan > > all LRU pages while they are temporarily unreclaimable (page lock). > > > > It may well be that on phones the RAM is small enough relative to CPU > > size. > > > > But on desktops/servers, we frequently see that there is a wider > > window of memory consumption in which reclaim efficiency doesn't drop > > low enough for the OOM killer to kick in. In the time it takes the CPU > > to scan through RAM, enough pages will have *just* finished reading > > for reclaim to free them again and continue to make "progress". > > > > We do know that the OOM killer might not kick in for at least 20-25 > > minutes while the system is entirely unresponsive. People usually > > don't wait this long before forcibly rebooting. In a managed fleet, > > ssh heartbeat tests eventually fail and force a reboot. Got it. Thanks for the explanation. > > I'm not sure 10s is the perfect value here, but I do think the kernel > > should try to get out of such a state, where interacting with the > > system is impossible, within a reasonable amount of time. > > > > It could be a little too short for non-interactive number-crunching > > systems... > > Would it be possible to have a module with tunning knobs as parameters > and hook into the PSI infrastructure? People can play with the setting > to their need, we wouldn't really have think about the user visible API > for the tuning and this could be easily adopted as an opt-in mechanism > without a risk of regressions. PSI averages stalls over 10, 60 and 300 seconds, so implementing 3 corresponding thresholds would be easy. The patch Johannes posted can be extended to support 3 thresholds instead of 1. I can take a stab at it if Johannes is busy. If we want more flexibility we could use PSI triggers with configurable tracking window but that's more complex and probably not worth it. > I would really love to see a simple threshing watchdog like the one you > have proposed earlier. It is self contained and easy to play with if the > parameters are not hardcoded. > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-06 16:27 ` Suren Baghdasaryan @ 2019-08-06 22:01 ` Johannes Weiner 2019-08-07 7:59 ` Michal Hocko 0 siblings, 1 reply; 48+ messages in thread From: Johannes Weiner @ 2019-08-06 22:01 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Michal Hocko, Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm On Tue, Aug 06, 2019 at 09:27:05AM -0700, Suren Baghdasaryan wrote: > On Tue, Aug 6, 2019 at 7:36 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Tue 06-08-19 10:27:28, Johannes Weiner wrote: > > > On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote: > > > > On 8/6/19 3:08 AM, Suren Baghdasaryan wrote: > > > > >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void) > > > > >> return 0; > > > > >> } > > > > >> module_init(psi_proc_init); > > > > >> + > > > > >> +#define OOM_PRESSURE_LEVEL 80 > > > > >> +#define OOM_PRESSURE_PERIOD (10 * NSEC_PER_SEC) > > > > > > > > > > 80% of the last 10 seconds spent in full stall would definitely be a > > > > > problem. If the system was already low on memory (which it probably > > > > > is, or we would not be reclaiming so hard and registering such a big > > > > > stall) then oom-killer would probably kill something before 8 seconds > > > > > are passed. > > > > > > > > If oom killer can act faster, than great! On small embedded systems you probably > > > > don't enable PSI anyway? > > We use PSI triggers with 1 sec tracking window. PSI averages are less > useful on such systems because in 10 secs (which is the shortest PSI > averaging window) memory conditions can change drastically. > > > > > > If my line of thinking is correct, then do we really > > > > > benefit from such additional protection mechanism? I might be wrong > > > > > here because my experience is limited to embedded systems with > > > > > relatively small amounts of memory. > > > > > > > > Well, Artem in his original mail describes a minutes long stall. Things are > > > > really different on a fast desktop/laptop with SSD. I have experienced this as > > > > well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than > > > > 8GB in the laptop). IMHO the default limit should be set so that the user > > > > doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10 > > > > seconds should be fine. > > > > > > That's exactly what I have experienced in the past, and this was also > > > the consistent story in the bug reports we have had. > > > > > > I suspect it requires a certain combination of RAM size, CPU speed, > > > and IO capacity: the OOM killer kicks in when reclaim fails, which > > > happens when all scanned LRU pages were locked and under IO. So IO > > > needs to be slow enough, or RAM small enough, that the CPU can scan > > > all LRU pages while they are temporarily unreclaimable (page lock). > > > > > > It may well be that on phones the RAM is small enough relative to CPU > > > size. > > > > > > But on desktops/servers, we frequently see that there is a wider > > > window of memory consumption in which reclaim efficiency doesn't drop > > > low enough for the OOM killer to kick in. In the time it takes the CPU > > > to scan through RAM, enough pages will have *just* finished reading > > > for reclaim to free them again and continue to make "progress". > > > > > > We do know that the OOM killer might not kick in for at least 20-25 > > > minutes while the system is entirely unresponsive. People usually > > > don't wait this long before forcibly rebooting. In a managed fleet, > > > ssh heartbeat tests eventually fail and force a reboot. > > Got it. Thanks for the explanation. > > > > I'm not sure 10s is the perfect value here, but I do think the kernel > > > should try to get out of such a state, where interacting with the > > > system is impossible, within a reasonable amount of time. > > > > > > It could be a little too short for non-interactive number-crunching > > > systems... > > > > Would it be possible to have a module with tunning knobs as parameters > > and hook into the PSI infrastructure? People can play with the setting > > to their need, we wouldn't really have think about the user visible API > > for the tuning and this could be easily adopted as an opt-in mechanism > > without a risk of regressions. It's relatively easy to trigger a livelock that disables the entire system for good, as a regular user. It's a little weird to make the bug fix for that an opt-in with an extensive configuration interface. This isn't like the hung task watch dog, where it's likely some kind of kernel issue, right? This can happen on any current kernel. What I would like to have is a way of self-recovery from a livelock. I don't mind making it opt-out in case we make mistakes, but the kernel should provide minimal self-protection out of the box, IMO. > PSI averages stalls over 10, 60 and 300 seconds, so implementing 3 > corresponding thresholds would be easy. The patch Johannes posted can > be extended to support 3 thresholds instead of 1. I can take a stab at > it if Johannes is busy. > If we want more flexibility we could use PSI triggers with > configurable tracking window but that's more complex and probably not > worth it. This goes into quality-of-service for workloads territory again. I'm not quite convinced yet we want to go there. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-06 22:01 ` Johannes Weiner @ 2019-08-07 7:59 ` Michal Hocko 2019-08-07 20:51 ` Johannes Weiner 0 siblings, 1 reply; 48+ messages in thread From: Michal Hocko @ 2019-08-07 7:59 UTC (permalink / raw) To: Johannes Weiner Cc: Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm On Tue 06-08-19 18:01:50, Johannes Weiner wrote: > On Tue, Aug 06, 2019 at 09:27:05AM -0700, Suren Baghdasaryan wrote: [...] > > > > I'm not sure 10s is the perfect value here, but I do think the kernel > > > > should try to get out of such a state, where interacting with the > > > > system is impossible, within a reasonable amount of time. > > > > > > > > It could be a little too short for non-interactive number-crunching > > > > systems... > > > > > > Would it be possible to have a module with tunning knobs as parameters > > > and hook into the PSI infrastructure? People can play with the setting > > > to their need, we wouldn't really have think about the user visible API > > > for the tuning and this could be easily adopted as an opt-in mechanism > > > without a risk of regressions. > > It's relatively easy to trigger a livelock that disables the entire > system for good, as a regular user. It's a little weird to make the > bug fix for that an opt-in with an extensive configuration interface. Yes, I definitely do agree that this is a bug fix more than a feature. The thing is that we do not know what the proper default is for a wide variety of workloads so some way of configurability is needed (level and period). If making this a module would require a lot of additional code then we need a kernel command line parameter at least. A module would have a nice advantage that you can change your configuration without rebooting. The same can be achieved by a sysfs on the other hand. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-07 7:59 ` Michal Hocko @ 2019-08-07 20:51 ` Johannes Weiner 2019-08-07 21:01 ` Andrew Morton ` (3 more replies) 0 siblings, 4 replies; 48+ messages in thread From: Johannes Weiner @ 2019-08-07 20:51 UTC (permalink / raw) To: Michal Hocko Cc: Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On Wed, Aug 07, 2019 at 09:59:27AM +0200, Michal Hocko wrote: > On Tue 06-08-19 18:01:50, Johannes Weiner wrote: > > On Tue, Aug 06, 2019 at 09:27:05AM -0700, Suren Baghdasaryan wrote: > [...] > > > > > I'm not sure 10s is the perfect value here, but I do think the kernel > > > > > should try to get out of such a state, where interacting with the > > > > > system is impossible, within a reasonable amount of time. > > > > > > > > > > It could be a little too short for non-interactive number-crunching > > > > > systems... > > > > > > > > Would it be possible to have a module with tunning knobs as parameters > > > > and hook into the PSI infrastructure? People can play with the setting > > > > to their need, we wouldn't really have think about the user visible API > > > > for the tuning and this could be easily adopted as an opt-in mechanism > > > > without a risk of regressions. > > > > It's relatively easy to trigger a livelock that disables the entire > > system for good, as a regular user. It's a little weird to make the > > bug fix for that an opt-in with an extensive configuration interface. > > Yes, I definitely do agree that this is a bug fix more than a > feature. The thing is that we do not know what the proper default is for > a wide variety of workloads so some way of configurability is needed > (level and period). If making this a module would require a lot of > additional code then we need a kernel command line parameter at least. > > A module would have a nice advantage that you can change your > configuration without rebooting. The same can be achieved by a sysfs on > the other hand. That's reasonable. How about my initial patch, but behind a config option and the level and period configurable? --- From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Mon, 5 Aug 2019 13:15:16 -0400 Subject: [PATCH] psi: trigger the OOM killer on severe thrashing Over the last few years we have had many reports that the kernel can enter an extended livelock situation under sufficient memory pressure. The system becomes unresponsive and fully IO bound for indefinite periods of time, and often the user has no choice but to reboot. Even though the system is clearly struggling with a shortage of memory, the OOM killer is not engaging reliably. The reason is that with bigger RAM, and in particular with faster SSDs, page reclaim does not necessarily fail in the traditional sense anymore. In the time it takes the CPU to run through the vast LRU lists, there are almost always some cache pages that have finished reading in and can be reclaimed, even before userspace had a chance to access them. As a result, reclaim is nominally succeeding, but userspace is refault-bound and not making significant progress. While this is clearly noticable to human beings, the kernel could not actually determine this state with the traditional memory event counters. We might see a certain rate of reclaim activity or refaults, but how long, or whether at all, userspace is unproductive because of it depends on IO speed, readahead efficiency, as well as memory access patterns and concurrency of the userspace applications. The same number of the VM events could be unnoticed in one system / workload combination, and result in an indefinite lockup in a different one. However, eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO") introduced a memory pressure metric that quantifies the share of wallclock time in which userspace waits on reclaim, refaults, swapins. By using absolute time, it encodes all the above mentioned variables of hardware capacity and workload behavior. When memory pressure is 40%, it means that 40% of the time the workload is stalled on memory, period. This is the actual measure for the lack of forward progress that users can experience. It's also something they expect the kernel to manage and remedy if it becomes non-existent. To accomplish this, this patch implements a thrashing cutoff for the OOM killer. If the kernel determines a sustained high level of memory pressure, and thus a lack of forward progress in userspace, it will trigger the OOM killer to reduce memory contention. Per default, the OOM killer will engage after 15 seconds of at least 80% memory pressure. These values are tunable via sysctls vm.thrashing_oom_period and vm.thrashing_oom_level. Ideally, this would be standard behavior for the kernel, but since it involves a new metric and OOM killing, let's be safe and make it an opt-in via CONFIG_THRASHING_OOM. Setting vm.thrashing_oom_level to 0 also disables the feature at runtime. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: "Artem S. Tashkinov" <aros@gmx.com> --- Documentation/admin-guide/sysctl/vm.rst | 24 ++++++++ include/linux/psi.h | 5 ++ include/linux/psi_types.h | 6 ++ kernel/sched/psi.c | 74 +++++++++++++++++++++++++ kernel/sysctl.c | 20 +++++++ mm/Kconfig | 20 +++++++ 6 files changed, 149 insertions(+) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 64aeee1009ca..0332cb52bcfc 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -66,6 +66,8 @@ files can be found in mm/swap.c. - stat_interval - stat_refresh - numa_stat +- thrashing_oom_level +- thrashing_oom_period - swappiness - unprivileged_userfaultfd - user_reserve_kbytes @@ -825,6 +827,28 @@ When page allocation performance is not a bottleneck and you want all echo 1 > /proc/sys/vm/numa_stat +thrashing_oom_level +=================== + +This defines the memory pressure level for severe thrashing at which +the OOM killer will be engaged. + +The default is 80. This means the system is considered to be thrashing +severely when all active tasks are collectively stalled on memory +(waiting for page reclaim, refaults, swapins etc) for 80% of the time. + +A setting of 0 will disable thrashing-based OOM killing. + + +thrashing_oom_period +=================== + +This defines the number of seconds the system must sustain severe +thrashing at thrashing_oom_level before the OOM killer is invoked. + +The default is 15. + + swappiness ========== diff --git a/include/linux/psi.h b/include/linux/psi.h index 7b3de7321219..661ce45900f9 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -37,6 +37,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file, poll_table *wait); #endif +#ifdef CONFIG_THRASHING_OOM +extern unsigned int sysctl_thrashing_oom_level; +extern unsigned int sysctl_thrashing_oom_period; +#endif + #else /* CONFIG_PSI */ static inline void psi_init(void) {} diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 07aaf9b82241..7c57d7e5627e 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -162,6 +162,12 @@ struct psi_group { u64 polling_total[NR_PSI_STATES - 1]; u64 polling_next_update; u64 polling_until; + +#ifdef CONFIG_THRASHING_OOM + /* Severe thrashing state tracking */ + bool oom_pressure; + u64 oom_pressure_start; +#endif }; #else /* CONFIG_PSI */ diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index f28342dc65ec..4b1b620d6359 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -139,6 +139,7 @@ #include <linux/ctype.h> #include <linux/file.h> #include <linux/poll.h> +#include <linux/oom.h> #include <linux/psi.h> #include "sched.h" @@ -177,6 +178,14 @@ struct psi_group psi_system = { .pcpu = &system_group_pcpu, }; +#ifdef CONFIG_THRASHING_OOM +static void psi_oom_tick(struct psi_group *group, u64 now); +#else +static inline void psi_oom_tick(struct psi_group *group, u64 now) +{ +} +#endif + static void psi_avgs_work(struct work_struct *work); static void group_init(struct psi_group *group) @@ -403,6 +412,8 @@ static u64 update_averages(struct psi_group *group, u64 now) calc_avgs(group->avg[s], missed_periods, sample, period); } + psi_oom_tick(group, now); + return avg_next_update; } @@ -1280,3 +1291,66 @@ static int __init psi_proc_init(void) return 0; } module_init(psi_proc_init); + +#ifdef CONFIG_THRASHING_OOM +/* + * Trigger the OOM killer when detecting severe thrashing. + * + * Per default we define severe thrashing as 15 seconds of 80% memory + * pressure (i.e. all active tasks are collectively stalled on memory + * 80% of the time). + */ +unsigned int sysctl_thrashing_oom_level = 80; +unsigned int sysctl_thrashing_oom_period = 15; + +static void psi_oom_tick(struct psi_group *group, u64 now) +{ + struct oom_control oc = { + .order = 0, + }; + unsigned long pressure; + bool high; + + /* Disabled at runtime */ + if (!sysctl_thrashing_oom_level) + return; + + /* + * Protect the system from livelocking due to thrashing. Leave + * per-cgroup policies to oomd, lmkd etc. + */ + if (group != &psi_system) + return; + + pressure = LOAD_INT(group->avg[PSI_MEM_FULL][0]); + high = pressure >= sysctl_thrashing_oom_level; + + if (!group->oom_pressure && !high) + return; + + if (!group->oom_pressure && high) { + group->oom_pressure = true; + group->oom_pressure_start = now; + return; + } + + if (group->oom_pressure && !high) { + group->oom_pressure = false; + return; + } + + if (now < group->oom_pressure_start + + (u64)sysctl_thrashing_oom_period * NSEC_PER_SEC) + return; + + pr_warn("Severe thrashing detected! (%ds of %d%% memory pressure)\n", + sysctl_thrashing_oom_period, sysctl_thrashing_oom_level); + + group->oom_pressure = false; + + if (!mutex_trylock(&oom_lock)) + return; + out_of_memory(&oc); + mutex_unlock(&oom_lock); +} +#endif /* CONFIG_THRASHING_OOM */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index f12888971d66..3b9b3deb1836 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -68,6 +68,7 @@ #include <linux/bpf.h> #include <linux/mount.h> #include <linux/userfaultfd_k.h> +#include <linux/psi.h> #include "../lib/kstrtox.h" @@ -1746,6 +1747,25 @@ static struct ctl_table vm_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, +#endif +#ifdef CONFIG_THRASHING_OOM + { + .procname = "thrashing_oom_level", + .data = &sysctl_thrashing_oom_level, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &one_hundred, + }, + { + .procname = "thrashing_oom_period", + .data = &sysctl_thrashing_oom_period, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + }, #endif { } }; diff --git a/mm/Kconfig b/mm/Kconfig index 56cec636a1fc..cef13b423beb 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -736,4 +736,24 @@ config ARCH_HAS_PTE_SPECIAL config ARCH_HAS_HUGEPD bool +config THRASHING_OOM + bool "Trigger the OOM killer on severe thrashing" + select PSI + help + Under memory pressure, the kernel can enter severe thrashing + or swap storms during which the system is fully IO-bound and + does not respond to any user input. The OOM killer does not + always engage because page reclaim manages to make nominal + forward progress, but the system is effectively livelocked. + + This feature uses pressure stall information (PSI) to detect + severe thrashing and trigger the OOM killer. + + The OOM killer will be engaged when the system sustains a + memory pressure level of 80% for 15 seconds. This can be + adjusted using the vm.thrashing_oom_[level|period] sysctls. + + Say Y if you have observed your system becoming unresponsive + for extended periods under memory pressure. + endmenu -- 2.22.0 ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-07 20:51 ` Johannes Weiner @ 2019-08-07 21:01 ` Andrew Morton 2019-08-07 21:34 ` Johannes Weiner 2019-08-07 21:12 ` Johannes Weiner ` (2 subsequent siblings) 3 siblings, 1 reply; 48+ messages in thread From: Andrew Morton @ 2019-08-07 21:01 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm On Wed, 7 Aug 2019 16:51:38 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > However, eb414681d5a0 ("psi: pressure stall information for CPU, > memory, and IO") introduced a memory pressure metric that quantifies > the share of wallclock time in which userspace waits on reclaim, > refaults, swapins. By using absolute time, it encodes all the above > mentioned variables of hardware capacity and workload behavior. When > memory pressure is 40%, it means that 40% of the time the workload is > stalled on memory, period. This is the actual measure for the lack of > forward progress that users can experience. It's also something they > expect the kernel to manage and remedy if it becomes non-existent. > > To accomplish this, this patch implements a thrashing cutoff for the > OOM killer. If the kernel determines a sustained high level of memory > pressure, and thus a lack of forward progress in userspace, it will > trigger the OOM killer to reduce memory contention. > > Per default, the OOM killer will engage after 15 seconds of at least > 80% memory pressure. These values are tunable via sysctls > vm.thrashing_oom_period and vm.thrashing_oom_level. Could be implemented in userspace? </troll> ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-07 21:01 ` Andrew Morton @ 2019-08-07 21:34 ` Johannes Weiner 0 siblings, 0 replies; 48+ messages in thread From: Johannes Weiner @ 2019-08-07 21:34 UTC (permalink / raw) To: Andrew Morton Cc: Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm On Wed, Aug 07, 2019 at 02:01:30PM -0700, Andrew Morton wrote: > On Wed, 7 Aug 2019 16:51:38 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > > > However, eb414681d5a0 ("psi: pressure stall information for CPU, > > memory, and IO") introduced a memory pressure metric that quantifies > > the share of wallclock time in which userspace waits on reclaim, > > refaults, swapins. By using absolute time, it encodes all the above > > mentioned variables of hardware capacity and workload behavior. When > > memory pressure is 40%, it means that 40% of the time the workload is > > stalled on memory, period. This is the actual measure for the lack of > > forward progress that users can experience. It's also something they > > expect the kernel to manage and remedy if it becomes non-existent. > > > > To accomplish this, this patch implements a thrashing cutoff for the > > OOM killer. If the kernel determines a sustained high level of memory > > pressure, and thus a lack of forward progress in userspace, it will > > trigger the OOM killer to reduce memory contention. > > > > Per default, the OOM killer will engage after 15 seconds of at least > > 80% memory pressure. These values are tunable via sysctls > > vm.thrashing_oom_period and vm.thrashing_oom_level. > > Could be implemented in userspace? > </troll> We do in fact do this with oomd. But it requires a comprehensive cgroup setup, with complete memory and IO isolation, to protect that daemon from the memory pressure and excessive paging of the rest of the system (mlock doesn't really cut it because you need to potentially allocate quite a few proc dentries and inodes just to walk the process tree and determine a kill target). In a fleet that works fine, since we need to maintain that cgroup infra anyway. But for other users, that's a lot of stack for basic "don't hang forever if I allocate too much memory" functionality. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-07 20:51 ` Johannes Weiner 2019-08-07 21:01 ` Andrew Morton @ 2019-08-07 21:12 ` Johannes Weiner 2019-08-08 11:48 ` Michal Hocko 2019-08-08 14:47 ` Vlastimil Babka 3 siblings, 0 replies; 48+ messages in thread From: Johannes Weiner @ 2019-08-07 21:12 UTC (permalink / raw) To: Michal Hocko Cc: Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On Wed, Aug 07, 2019 at 04:51:42PM -0400, Johannes Weiner wrote: > Per default, the OOM killer will engage after 15 seconds of at least > 80% memory pressure. These values are tunable via sysctls > vm.thrashing_oom_period and vm.thrashing_oom_level. Let's go with this: Per default, the OOM killer will engage after 15 seconds of at least 80% memory pressure. From experience, at 80% the user is experiencing multi-second reaction times. 15 seconds is chosen to be long enough to not OOM kill a short-lived spike that might resolve itself, yet short enough for users to not press the reset button just yet. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-07 20:51 ` Johannes Weiner 2019-08-07 21:01 ` Andrew Morton 2019-08-07 21:12 ` Johannes Weiner @ 2019-08-08 11:48 ` Michal Hocko 2019-08-08 15:10 ` ndrw.xf 2019-08-08 14:47 ` Vlastimil Babka 3 siblings, 1 reply; 48+ messages in thread From: Michal Hocko @ 2019-08-08 11:48 UTC (permalink / raw) To: Johannes Weiner Cc: Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On Wed 07-08-19 16:51:38, Johannes Weiner wrote: [...] > >From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner <hannes@cmpxchg.org> > Date: Mon, 5 Aug 2019 13:15:16 -0400 > Subject: [PATCH] psi: trigger the OOM killer on severe thrashing > > Over the last few years we have had many reports that the kernel can > enter an extended livelock situation under sufficient memory > pressure. The system becomes unresponsive and fully IO bound for > indefinite periods of time, and often the user has no choice but to > reboot. or sysrq+f > Even though the system is clearly struggling with a shortage > of memory, the OOM killer is not engaging reliably. > > The reason is that with bigger RAM, and in particular with faster > SSDs, page reclaim does not necessarily fail in the traditional sense > anymore. In the time it takes the CPU to run through the vast LRU > lists, there are almost always some cache pages that have finished > reading in and can be reclaimed, even before userspace had a chance to > access them. As a result, reclaim is nominally succeeding, but > userspace is refault-bound and not making significant progress. > > While this is clearly noticable to human beings, the kernel could not > actually determine this state with the traditional memory event > counters. We might see a certain rate of reclaim activity or refaults, > but how long, or whether at all, userspace is unproductive because of > it depends on IO speed, readahead efficiency, as well as memory access > patterns and concurrency of the userspace applications. The same > number of the VM events could be unnoticed in one system / workload > combination, and result in an indefinite lockup in a different one. > > However, eb414681d5a0 ("psi: pressure stall information for CPU, > memory, and IO") introduced a memory pressure metric that quantifies > the share of wallclock time in which userspace waits on reclaim, > refaults, swapins. By using absolute time, it encodes all the above > mentioned variables of hardware capacity and workload behavior. When > memory pressure is 40%, it means that 40% of the time the workload is > stalled on memory, period. This is the actual measure for the lack of > forward progress that users can experience. It's also something they > expect the kernel to manage and remedy if it becomes non-existent. > > To accomplish this, this patch implements a thrashing cutoff for the > OOM killer. If the kernel determines a sustained high level of memory > pressure, and thus a lack of forward progress in userspace, it will > trigger the OOM killer to reduce memory contention. > > Per default, the OOM killer will engage after 15 seconds of at least > 80% memory pressure. These values are tunable via sysctls > vm.thrashing_oom_period and vm.thrashing_oom_level. As I've said earlier I would be somehow more comfortable with a kernel command line/module parameter based tuning because it is less of a stable API and potential future stall detector might be completely independent on PSI and the current metric exported. But I can live with that because a period and level sounds quite generic. > Ideally, this would be standard behavior for the kernel, but since it > involves a new metric and OOM killing, let's be safe and make it an > opt-in via CONFIG_THRASHING_OOM. Setting vm.thrashing_oom_level to 0 > also disables the feature at runtime. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > Reported-by: "Artem S. Tashkinov" <aros@gmx.com> I am not deeply familiar with PSI internals but from a quick look it seems that update_averages is called from the OOM safe context (worker). I have scratched my head how to deal with this "progress is made but it is all in vain" problem inside the reclaim path but I do not think this will ever work and having a watchdog like this sound like step in the right direction. I didn't even expect it would look as simple. Really a nice work Johannes! Let's see how this ends up working in practice though. Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > --- > Documentation/admin-guide/sysctl/vm.rst | 24 ++++++++ > include/linux/psi.h | 5 ++ > include/linux/psi_types.h | 6 ++ > kernel/sched/psi.c | 74 +++++++++++++++++++++++++ > kernel/sysctl.c | 20 +++++++ > mm/Kconfig | 20 +++++++ > 6 files changed, 149 insertions(+) > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > index 64aeee1009ca..0332cb52bcfc 100644 > --- a/Documentation/admin-guide/sysctl/vm.rst > +++ b/Documentation/admin-guide/sysctl/vm.rst > @@ -66,6 +66,8 @@ files can be found in mm/swap.c. > - stat_interval > - stat_refresh > - numa_stat > +- thrashing_oom_level > +- thrashing_oom_period > - swappiness > - unprivileged_userfaultfd > - user_reserve_kbytes > @@ -825,6 +827,28 @@ When page allocation performance is not a bottleneck and you want all > echo 1 > /proc/sys/vm/numa_stat > > > +thrashing_oom_level > +=================== > + > +This defines the memory pressure level for severe thrashing at which > +the OOM killer will be engaged. > + > +The default is 80. This means the system is considered to be thrashing > +severely when all active tasks are collectively stalled on memory > +(waiting for page reclaim, refaults, swapins etc) for 80% of the time. > + > +A setting of 0 will disable thrashing-based OOM killing. > + > + > +thrashing_oom_period > +=================== > + > +This defines the number of seconds the system must sustain severe > +thrashing at thrashing_oom_level before the OOM killer is invoked. > + > +The default is 15. > + > + > swappiness > ========== > > diff --git a/include/linux/psi.h b/include/linux/psi.h > index 7b3de7321219..661ce45900f9 100644 > --- a/include/linux/psi.h > +++ b/include/linux/psi.h > @@ -37,6 +37,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file, > poll_table *wait); > #endif > > +#ifdef CONFIG_THRASHING_OOM > +extern unsigned int sysctl_thrashing_oom_level; > +extern unsigned int sysctl_thrashing_oom_period; > +#endif > + > #else /* CONFIG_PSI */ > > static inline void psi_init(void) {} > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h > index 07aaf9b82241..7c57d7e5627e 100644 > --- a/include/linux/psi_types.h > +++ b/include/linux/psi_types.h > @@ -162,6 +162,12 @@ struct psi_group { > u64 polling_total[NR_PSI_STATES - 1]; > u64 polling_next_update; > u64 polling_until; > + > +#ifdef CONFIG_THRASHING_OOM > + /* Severe thrashing state tracking */ > + bool oom_pressure; > + u64 oom_pressure_start; > +#endif > }; > > #else /* CONFIG_PSI */ > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index f28342dc65ec..4b1b620d6359 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -139,6 +139,7 @@ > #include <linux/ctype.h> > #include <linux/file.h> > #include <linux/poll.h> > +#include <linux/oom.h> > #include <linux/psi.h> > #include "sched.h" > > @@ -177,6 +178,14 @@ struct psi_group psi_system = { > .pcpu = &system_group_pcpu, > }; > > +#ifdef CONFIG_THRASHING_OOM > +static void psi_oom_tick(struct psi_group *group, u64 now); > +#else > +static inline void psi_oom_tick(struct psi_group *group, u64 now) > +{ > +} > +#endif > + > static void psi_avgs_work(struct work_struct *work); > > static void group_init(struct psi_group *group) > @@ -403,6 +412,8 @@ static u64 update_averages(struct psi_group *group, u64 now) > calc_avgs(group->avg[s], missed_periods, sample, period); > } > > + psi_oom_tick(group, now); > + > return avg_next_update; > } > > @@ -1280,3 +1291,66 @@ static int __init psi_proc_init(void) > return 0; > } > module_init(psi_proc_init); > + > +#ifdef CONFIG_THRASHING_OOM > +/* > + * Trigger the OOM killer when detecting severe thrashing. > + * > + * Per default we define severe thrashing as 15 seconds of 80% memory > + * pressure (i.e. all active tasks are collectively stalled on memory > + * 80% of the time). > + */ > +unsigned int sysctl_thrashing_oom_level = 80; > +unsigned int sysctl_thrashing_oom_period = 15; > + > +static void psi_oom_tick(struct psi_group *group, u64 now) > +{ > + struct oom_control oc = { > + .order = 0, > + }; > + unsigned long pressure; > + bool high; > + > + /* Disabled at runtime */ > + if (!sysctl_thrashing_oom_level) > + return; > + > + /* > + * Protect the system from livelocking due to thrashing. Leave > + * per-cgroup policies to oomd, lmkd etc. > + */ > + if (group != &psi_system) > + return; > + > + pressure = LOAD_INT(group->avg[PSI_MEM_FULL][0]); > + high = pressure >= sysctl_thrashing_oom_level; > + > + if (!group->oom_pressure && !high) > + return; > + > + if (!group->oom_pressure && high) { > + group->oom_pressure = true; > + group->oom_pressure_start = now; > + return; > + } > + > + if (group->oom_pressure && !high) { > + group->oom_pressure = false; > + return; > + } > + > + if (now < group->oom_pressure_start + > + (u64)sysctl_thrashing_oom_period * NSEC_PER_SEC) > + return; > + > + pr_warn("Severe thrashing detected! (%ds of %d%% memory pressure)\n", > + sysctl_thrashing_oom_period, sysctl_thrashing_oom_level); > + > + group->oom_pressure = false; > + > + if (!mutex_trylock(&oom_lock)) > + return; > + out_of_memory(&oc); > + mutex_unlock(&oom_lock); > +} > +#endif /* CONFIG_THRASHING_OOM */ > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index f12888971d66..3b9b3deb1836 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -68,6 +68,7 @@ > #include <linux/bpf.h> > #include <linux/mount.h> > #include <linux/userfaultfd_k.h> > +#include <linux/psi.h> > > #include "../lib/kstrtox.h" > > @@ -1746,6 +1747,25 @@ static struct ctl_table vm_table[] = { > .extra1 = SYSCTL_ZERO, > .extra2 = SYSCTL_ONE, > }, > +#endif > +#ifdef CONFIG_THRASHING_OOM > + { > + .procname = "thrashing_oom_level", > + .data = &sysctl_thrashing_oom_level, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ZERO, > + .extra2 = &one_hundred, > + }, > + { > + .procname = "thrashing_oom_period", > + .data = &sysctl_thrashing_oom_period, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ZERO, > + }, > #endif > { } > }; > diff --git a/mm/Kconfig b/mm/Kconfig > index 56cec636a1fc..cef13b423beb 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -736,4 +736,24 @@ config ARCH_HAS_PTE_SPECIAL > config ARCH_HAS_HUGEPD > bool > > +config THRASHING_OOM > + bool "Trigger the OOM killer on severe thrashing" > + select PSI > + help > + Under memory pressure, the kernel can enter severe thrashing > + or swap storms during which the system is fully IO-bound and > + does not respond to any user input. The OOM killer does not > + always engage because page reclaim manages to make nominal > + forward progress, but the system is effectively livelocked. > + > + This feature uses pressure stall information (PSI) to detect > + severe thrashing and trigger the OOM killer. > + > + The OOM killer will be engaged when the system sustains a > + memory pressure level of 80% for 15 seconds. This can be > + adjusted using the vm.thrashing_oom_[level|period] sysctls. > + > + Say Y if you have observed your system becoming unresponsive > + for extended periods under memory pressure. > + > endmenu > -- > 2.22.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-08 11:48 ` Michal Hocko @ 2019-08-08 15:10 ` ndrw.xf 2019-08-08 16:32 ` Michal Hocko 2021-07-24 17:32 ` Alexey Avramov 0 siblings, 2 replies; 48+ messages in thread From: ndrw.xf @ 2019-08-08 15:10 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner Cc: Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On 8 August 2019 12:48:26 BST, Michal Hocko <mhocko@kernel.org> wrote: >> >> Per default, the OOM killer will engage after 15 seconds of at least >> 80% memory pressure. These values are tunable via sysctls >> vm.thrashing_oom_period and vm.thrashing_oom_level. > >As I've said earlier I would be somehow more comfortable with a kernel >command line/module parameter based tuning because it is less of a >stable API and potential future stall detector might be completely >independent on PSI and the current metric exported. But I can live with >that because a period and level sounds quite generic. Would it be possible to reserve a fixed (configurable) amount of RAM for caches, and trigger OOM killer earlier, before most UI code is evicted from memory? In my use case, I am happy sacrificing e.g. 0.5GB and kill runaway tasks _before_ the system freezes. Potentially OOM killer would also work better in such conditions. I almost never work at close to full memory capacity, it's always a single task that goes wrong and brings the system down. The problem with PSI sensing is that it works after the fact (after the freeze has already occurred). It is not very different from issuing SysRq-f manually on a frozen system, although it would still be a handy feature for batched tasks and remote access. Best regards, ndrw ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-08 15:10 ` ndrw.xf @ 2019-08-08 16:32 ` Michal Hocko 2019-08-08 17:57 ` ndrw.xf 2021-07-24 17:32 ` Alexey Avramov 1 sibling, 1 reply; 48+ messages in thread From: Michal Hocko @ 2019-08-08 16:32 UTC (permalink / raw) To: ndrw.xf Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On Thu 08-08-19 16:10:07, ndrw.xf@redhazel.co.uk wrote: > > > On 8 August 2019 12:48:26 BST, Michal Hocko <mhocko@kernel.org> wrote: > >> > >> Per default, the OOM killer will engage after 15 seconds of at least > >> 80% memory pressure. These values are tunable via sysctls > >> vm.thrashing_oom_period and vm.thrashing_oom_level. > > > >As I've said earlier I would be somehow more comfortable with a kernel > >command line/module parameter based tuning because it is less of a > >stable API and potential future stall detector might be completely > >independent on PSI and the current metric exported. But I can live with > >that because a period and level sounds quite generic. > > Would it be possible to reserve a fixed (configurable) amount of RAM for caches, I am afraid there is nothing like that available and I would even argue it doesn't make much sense either. What would you consider to be a cache? A kernel/userspace reclaimable memory? What about any other in kernel memory users? How would you setup such a limit and make it reasonably maintainable over different kernel releases when the memory footprint changes over time? Besides that how does that differ from the existing reclaim mechanism? Once your cache hits the limit, there would have to be some sort of the reclaim to happen and then we are back to square one when the reclaim is making progress but you are effectively treshing over the hot working set (e.g. code pages) > and trigger OOM killer earlier, before most UI code is evicted from memory? How does the kernel knows that important memory is evicted? E.g. say that your graphic stack is under pressure and it has to drop internal caches. No outstanding processes will be swapped out yet your UI will be completely frozen like. > In my use case, I am happy sacrificing e.g. 0.5GB and kill runaway > tasks _before_ the system freezes. Potentially OOM killer would also > work better in such conditions. I almost never work at close to full > memory capacity, it's always a single task that goes wrong and brings > the system down. If you know which task is that then you can put it into a memory cgroup with a stricter memory limit and have it killed before the overal system starts suffering. > The problem with PSI sensing is that it works after the fact (after > the freeze has already occurred). It is not very different from > issuing SysRq-f manually on a frozen system, although it would still > be a handy feature for batched tasks and remote access. Not really. PSI is giving you a matric that tells you how much time you spend on the memory reclaim. So you can start watching the system from lower utilization already. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-08 16:32 ` Michal Hocko @ 2019-08-08 17:57 ` ndrw.xf 2019-08-08 18:59 ` Michal Hocko 0 siblings, 1 reply; 48+ messages in thread From: ndrw.xf @ 2019-08-08 17:57 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On 8 August 2019 17:32:28 BST, Michal Hocko <mhocko@kernel.org> wrote: > >> Would it be possible to reserve a fixed (configurable) amount of RAM >for caches, > >I am afraid there is nothing like that available and I would even argue >it doesn't make much sense either. What would you consider to be a >cache? A kernel/userspace reclaimable memory? What about any other in >kernel memory users? How would you setup such a limit and make it >reasonably maintainable over different kernel releases when the memory >footprint changes over time? Frankly, I don't know. The earlyoom userspace tool works well enough for me so I assumed this functionality could be implemented in kernel. Default thresholds would have to be tested but it is unlikely zero is the optimum value. >Besides that how does that differ from the existing reclaim mechanism? >Once your cache hits the limit, there would have to be some sort of the >reclaim to happen and then we are back to square one when the reclaim >is >making progress but you are effectively treshing over the hot working >set (e.g. code pages) By forcing OOM killer. Reclaiming memory when system becomes unresponsive is precisely what I want to avoid. >> and trigger OOM killer earlier, before most UI code is evicted from >memory? > >How does the kernel knows that important memory is evicted? I assume current memory management policy (LRU?) is sufficient to keep most frequently used pages in memory. >If you know which task is that then you can put it into a memory cgroup >with a stricter memory limit and have it killed before the overal >system >starts suffering. This is what I intended to use. But I don't know how to bypass SystemD or configure such policies via SystemD. >PSI is giving you a matric that tells you how much time you >spend on the memory reclaim. So you can start watching the system from >lower utilization already. This is a fantastic news. Really. I didn't know this is how it works. Two potential issues, though: 1. PSI (if possible) should be normalised wrt the memory reclaiming cost (SSDs have lower cost than HDDs). If not automatically then perhaps via a user configurable option. That's somewhat similar to having configurable PSI thresholds. 2. It seems PSI measures the _rate_ pages are evicted from memory. While this may correlate with the _absolute_ amount of of memory left, it is not the same. Perhaps weighting PSI with absolute amount of memory used for caches would improve this metric. Best regards, ndrw ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-08 17:57 ` ndrw.xf @ 2019-08-08 18:59 ` Michal Hocko 2019-08-08 21:59 ` ndrw 0 siblings, 1 reply; 48+ messages in thread From: Michal Hocko @ 2019-08-08 18:59 UTC (permalink / raw) To: ndrw.xf Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On Thu 08-08-19 18:57:02, ndrw.xf@redhazel.co.uk wrote: > > > On 8 August 2019 17:32:28 BST, Michal Hocko <mhocko@kernel.org> wrote: > > > >> Would it be possible to reserve a fixed (configurable) amount of RAM > >for caches, > > > >I am afraid there is nothing like that available and I would even argue > >it doesn't make much sense either. What would you consider to be a > >cache? A kernel/userspace reclaimable memory? What about any other in > >kernel memory users? How would you setup such a limit and make it > >reasonably maintainable over different kernel releases when the memory > >footprint changes over time? > > Frankly, I don't know. The earlyoom userspace tool works well enough > for me so I assumed this functionality could be implemented in > kernel. Default thresholds would have to be tested but it is unlikely > zero is the optimum value. Well, I am afraid that implementing anything like that in the kernel will lead to many regressions and bug reports. People tend to have very different opinions on when it is suitable to kill a potentially important part of a workload just because memory gets low. > >Besides that how does that differ from the existing reclaim mechanism? > >Once your cache hits the limit, there would have to be some sort of the > >reclaim to happen and then we are back to square one when the reclaim > >is > >making progress but you are effectively treshing over the hot working > >set (e.g. code pages) > > By forcing OOM killer. Reclaiming memory when system becomes unresponsive is precisely what I want to avoid. > > >> and trigger OOM killer earlier, before most UI code is evicted from > >memory? > > > >How does the kernel knows that important memory is evicted? > > I assume current memory management policy (LRU?) is sufficient to keep most frequently used pages in memory. LRU aspect doesn't help much, really. If we are reclaiming the same set of pages becuase they are needed for the workload to operate then we are effectivelly treshing no matter what kind of replacement policy you are going to use. [...] > >PSI is giving you a matric that tells you how much time you > >spend on the memory reclaim. So you can start watching the system from > >lower utilization already. > > This is a fantastic news. Really. I didn't know this is how it > works. Two potential issues, though: > 1. PSI (if possible) should be normalised wrt the memory reclaiming > cost (SSDs have lower cost than HDDs). If not automatically then > perhaps via a user configurable option. That's somewhat similar to > having configurable PSI thresholds. The cost of the reclaim is inherently reflected in those numbers already because it gives you the amount of time that is spent getting a memory for you. If you are under a memory pressure then the memory reclaim is a part of the allocation path. > 2. It seems PSI measures the _rate_ pages are evicted from > memory. While this may correlate with the _absolute_ amount of of > memory left, it is not the same. Perhaps weighting PSI with absolute > amount of memory used for caches would improve this metric. Please refer to Documentation/accounting/psi.rst for more information about how PSI works. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-08 18:59 ` Michal Hocko @ 2019-08-08 21:59 ` ndrw 2019-08-09 8:57 ` Michal Hocko 0 siblings, 1 reply; 48+ messages in thread From: ndrw @ 2019-08-08 21:59 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On 08/08/2019 19:59, Michal Hocko wrote: > Well, I am afraid that implementing anything like that in the kernel > will lead to many regressions and bug reports. People tend to have very > different opinions on when it is suitable to kill a potentially > important part of a workload just because memory gets low. Are you proposing having a zero memory reserve or not having such option at all? I'm fine with the current default (zero reserve/margin). I strongly prefer forcing OOM killer when the system is still running normally. Not just for preventing stalls: in my limited testing I found the OOM killer on a stalled system rather inaccurate, occasionally killing system services etc. I had much better experience with earlyoom. > LRU aspect doesn't help much, really. If we are reclaiming the same set > of pages becuase they are needed for the workload to operate then we are > effectivelly treshing no matter what kind of replacement policy you are > going to use. In my case it would work fine (my system already works well with earlyoom, and without it it remains responsive until last couple hundred MB of RAM). >>> PSI is giving you a matric that tells you how much time you >>> spend on the memory reclaim. So you can start watching the system from >>> lower utilization already. I've tested it on a system with 45GB of RAM, SSD, swap disabled (my intention was to approximate a worst-case scenario) and it didn't really detect stall before it happened. I can see some activity after reaching ~42GB, the system remains fully responsive until it suddenly freezes and requires sysrq-f. PSI appears to increase a bit when the system is about to run out of memory but the change is so small it would be difficult to set a reliable threshold. I expect the PSI numbers to increase significantly after the stall (I wasn't able to capture them) but, as mentioned above, I was hoping for a solution that would work before the stall. $ while true; do sleep 1; cat /proc/pressure/memory ; done [starting a test script and waiting for several minutes to fill up memory] some avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 some avg10=0.00 avg60=0.00 avg300=0.00 total=10389 full avg10=0.00 avg60=0.00 avg300=0.00 total=6442 some avg10=0.00 avg60=0.00 avg300=0.00 total=18950 full avg10=0.00 avg60=0.00 avg300=0.00 total=11576 some avg10=0.00 avg60=0.00 avg300=0.00 total=25655 full avg10=0.00 avg60=0.00 avg300=0.00 total=16159 some avg10=0.00 avg60=0.00 avg300=0.00 total=31438 full avg10=0.00 avg60=0.00 avg300=0.00 total=19552 some avg10=0.00 avg60=0.00 avg300=0.00 total=44549 full avg10=0.00 avg60=0.00 avg300=0.00 total=27772 some avg10=0.00 avg60=0.00 avg300=0.00 total=52520 full avg10=0.00 avg60=0.00 avg300=0.00 total=32580 some avg10=0.00 avg60=0.00 avg300=0.00 total=60451 full avg10=0.00 avg60=0.00 avg300=0.00 total=37704 some avg10=0.00 avg60=0.00 avg300=0.00 total=68986 full avg10=0.00 avg60=0.00 avg300=0.00 total=42859 some avg10=0.00 avg60=0.00 avg300=0.00 total=76598 full avg10=0.00 avg60=0.00 avg300=0.00 total=48370 some avg10=0.00 avg60=0.00 avg300=0.00 total=83080 full avg10=0.00 avg60=0.00 avg300=0.00 total=52930 some avg10=0.00 avg60=0.00 avg300=0.00 total=89384 full avg10=0.00 avg60=0.00 avg300=0.00 total=56350 some avg10=0.00 avg60=0.00 avg300=0.00 total=95293 full avg10=0.00 avg60=0.00 avg300=0.00 total=60260 some avg10=0.00 avg60=0.00 avg300=0.00 total=101566 full avg10=0.00 avg60=0.00 avg300=0.00 total=64408 some avg10=0.00 avg60=0.00 avg300=0.00 total=108131 full avg10=0.00 avg60=0.00 avg300=0.00 total=68412 some avg10=0.00 avg60=0.00 avg300=0.00 total=121932 full avg10=0.00 avg60=0.00 avg300=0.00 total=77413 some avg10=0.00 avg60=0.00 avg300=0.00 total=140807 full avg10=0.00 avg60=0.00 avg300=0.00 total=91269 some avg10=0.00 avg60=0.00 avg300=0.00 total=170494 full avg10=0.00 avg60=0.00 avg300=0.00 total=110611 [stall, sysrq-f] Best regards, ndrw ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-08 21:59 ` ndrw @ 2019-08-09 8:57 ` Michal Hocko 2019-08-09 10:09 ` ndrw 2019-08-10 21:07 ` ndrw 0 siblings, 2 replies; 48+ messages in thread From: Michal Hocko @ 2019-08-09 8:57 UTC (permalink / raw) To: ndrw Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On Thu 08-08-19 22:59:32, ndrw wrote: > On 08/08/2019 19:59, Michal Hocko wrote: > > Well, I am afraid that implementing anything like that in the kernel > > will lead to many regressions and bug reports. People tend to have very > > different opinions on when it is suitable to kill a potentially > > important part of a workload just because memory gets low. > > Are you proposing having a zero memory reserve or not having such option at > all? I'm fine with the current default (zero reserve/margin). We already do have a reserve (min_free_kbytes). That gives kswapd some room to perform reclaim in the background without obvious latencies to allocating tasks (well CPU still be used so there is still some effect). Kswapd tries to keep a balance and free memory low but still with some room to satisfy an immediate memory demand. Once kswapd doesn't catch up with the memory demand we dive into the direct reclaim and that is where people usually see latencies coming from. The main problem here is that it is hard to tell from a single allocation latency that we have a bigger problem. As already said, the usual trashing scenario doesn't show problem during the reclaim because pages can be freed up very efficiently. The problem is that they are refaulted very quickly so we are effectively rotating working set like crazy. Compare that to a normal used-once streaming IO workload which is generating a lot of page cache that can be recycled in a similar pace but a working set doesn't get freed. Free memory figures will look very similar in both cases. > I strongly prefer forcing OOM killer when the system is still running > normally. Not just for preventing stalls: in my limited testing I found the > OOM killer on a stalled system rather inaccurate, occasionally killing > system services etc. I had much better experience with earlyoom. Good that earlyoom works for you. All I am saying is that this is not generally applicable heuristic because we do care about a larger variety of workloads. I should probably emphasise that the OOM killer is there as a _last resort_ hand break when something goes terribly wrong. It operates at times when any user intervention would be really hard because there is a lack of resources to be actionable. [...] > > > > PSI is giving you a matric that tells you how much time you > > > > spend on the memory reclaim. So you can start watching the system from > > > > lower utilization already. > > I've tested it on a system with 45GB of RAM, SSD, swap disabled (my > intention was to approximate a worst-case scenario) and it didn't really > detect stall before it happened. I can see some activity after reaching > ~42GB, the system remains fully responsive until it suddenly freezes and > requires sysrq-f. This is a useful feedback! What was your workload? Which kernel version? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-09 8:57 ` Michal Hocko @ 2019-08-09 10:09 ` ndrw 2019-08-09 10:50 ` Michal Hocko 2019-08-10 21:07 ` ndrw 1 sibling, 1 reply; 48+ messages in thread From: ndrw @ 2019-08-09 10:09 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On 09/08/2019 09:57, Michal Hocko wrote: > We already do have a reserve (min_free_kbytes). That gives kswapd some > room to perform reclaim in the background without obvious latencies to > allocating tasks (well CPU still be used so there is still some effect). I tried this option in the past. Unfortunately, I didn't prevent freezes. My understanding is this option reserves some amount of memory to not be swapped out but does not prevent the kernel from evicting all pages from cache when more memory is needed. > Kswapd tries to keep a balance and free memory low but still with some > room to satisfy an immediate memory demand. Once kswapd doesn't catch up > with the memory demand we dive into the direct reclaim and that is where > people usually see latencies coming from. Reclaiming memory is fine, of course, but not all the way to 0 caches. No caches means all executable pages, ro pages (e.g. fonts) are evicted from memory and have to be constantly reloaded on every user action. All this while competing with tasks that are using up all memory. This happens with of without swap, although swap does spread this issue in time a bit. > The main problem here is that it is hard to tell from a single > allocation latency that we have a bigger problem. As already said, the > usual trashing scenario doesn't show problem during the reclaim because > pages can be freed up very efficiently. The problem is that they are > refaulted very quickly so we are effectively rotating working set like > crazy. Compare that to a normal used-once streaming IO workload which is > generating a lot of page cache that can be recycled in a similar pace > but a working set doesn't get freed. Free memory figures will look very > similar in both cases. Thank you for the explanation. It is indeed a difficult problem - some cached pages (streaming IO) will likely not be needed again and should be discarded asap, other (like mmapped executable/ro pages of UI utilities) will cause thrashing when evicted under high memory pressure. Another aspect is that PSI is probably not the best measure of detecting imminent thrashing. However, if it can at least detect a freeze that has already occurred and force the OOM killer that is still a lot better than a dead system, which is the current user experience. > Good that earlyoom works for you. I am giving it as an example of a heuristic that seems to work very well for me. Something to look into. And yes, I wouldn't mind having such mechanism built into the kernel. > All I am saying is that this is not > generally applicable heuristic because we do care about a larger variety > of workloads. I should probably emphasise that the OOM killer is there > as a _last resort_ hand break when something goes terribly wrong. It > operates at times when any user intervention would be really hard > because there is a lack of resources to be actionable. It is indeed a last resort solution - without it the system is unusable. Still, accuracy matters because killing a wrong task does not fix the problem (a task hogging memory is still running) and may break the system anyway if something important is killed instead. [...] > This is a useful feedback! What was your workload? Which kernel version? I tested it by running a python script that processes a large amount of data in memory (needs around 15GB of RAM). I normally run 2 instances of that script in parallel but for testing I started 4 of them. I sometimes experience the same issue when using multiple regular memory intensive desktop applications in a manner described in the first post but that's harder to reproduce because of the user input needed. [ 0.000000] Linux version 5.0.0-21-generic (buildd@lgw01-amd64-036) (gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1)) #22-Ubuntu SMP Tue Jul 2 13:27:33 UTC 2019 (Ubuntu 5.0.0-21.22-generic 5.0.15) AMD CPU with 4 cores, 8 threads. AMDGPU graphics stack. Best regards, ndrw ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-09 10:09 ` ndrw @ 2019-08-09 10:50 ` Michal Hocko 2019-08-09 14:18 ` Pintu Agarwal 2019-08-10 12:34 ` ndrw 0 siblings, 2 replies; 48+ messages in thread From: Michal Hocko @ 2019-08-09 10:50 UTC (permalink / raw) To: ndrw Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On Fri 09-08-19 11:09:33, ndrw wrote: > On 09/08/2019 09:57, Michal Hocko wrote: > > We already do have a reserve (min_free_kbytes). That gives kswapd some > > room to perform reclaim in the background without obvious latencies to > > allocating tasks (well CPU still be used so there is still some effect). > > I tried this option in the past. Unfortunately, I didn't prevent freezes. My > understanding is this option reserves some amount of memory to not be to not be used by normal allocations. It defines reclaim watermarks and that influences when the background and direct reclaim start to act. > swapped out but does not prevent the kernel from evicting all pages from > cache when more memory is needed. It doesn't have any say on the actual decision on what to reclaim. > > Kswapd tries to keep a balance and free memory low but still with some > > room to satisfy an immediate memory demand. Once kswapd doesn't catch up > > with the memory demand we dive into the direct reclaim and that is where > > people usually see latencies coming from. > > Reclaiming memory is fine, of course, but not all the way to 0 caches. No > caches means all executable pages, ro pages (e.g. fonts) are evicted from > memory and have to be constantly reloaded on every user action. All this > while competing with tasks that are using up all memory. This happens with > of without swap, although swap does spread this issue in time a bit. We try to protect low amount of cache. Have a look at get_scan_count function. But the exact amount of the cache to be protected is really hard to know wihtout a crystal ball or understanding of the workload. The kernel doesn't have neither of the two. > > The main problem here is that it is hard to tell from a single > > allocation latency that we have a bigger problem. As already said, the > > usual trashing scenario doesn't show problem during the reclaim because > > pages can be freed up very efficiently. The problem is that they are > > refaulted very quickly so we are effectively rotating working set like > > crazy. Compare that to a normal used-once streaming IO workload which is > > generating a lot of page cache that can be recycled in a similar pace > > but a working set doesn't get freed. Free memory figures will look very > > similar in both cases. > > Thank you for the explanation. It is indeed a difficult problem - some > cached pages (streaming IO) will likely not be needed again and should be > discarded asap, other (like mmapped executable/ro pages of UI utilities) > will cause thrashing when evicted under high memory pressure. Another aspect > is that PSI is probably not the best measure of detecting imminent > thrashing. However, if it can at least detect a freeze that has already > occurred and force the OOM killer that is still a lot better than a dead > system, which is the current user experience. We have been thinking about this problem for a long time and couldn't come up with anything much better than we have now. PSI is the most recent improvement in that area. If you have better ideas then patches are always welcome. > > Good that earlyoom works for you. > > I am giving it as an example of a heuristic that seems to work very well for > me. Something to look into. And yes, I wouldn't mind having such mechanism > built into the kernel. > > > All I am saying is that this is not > > generally applicable heuristic because we do care about a larger variety > > of workloads. I should probably emphasise that the OOM killer is there > > as a _last resort_ hand break when something goes terribly wrong. It > > operates at times when any user intervention would be really hard > > because there is a lack of resources to be actionable. > > It is indeed a last resort solution - without it the system is unusable. > Still, accuracy matters because killing a wrong task does not fix the > problem (a task hogging memory is still running) and may break the system > anyway if something important is killed instead. That is a completely orthogonal problem, I am afraid. So far we have been discussing _when_ to trigger OOM killer. This is _who_ to kill. I haven't heard any recent examples that the victim selection would be way off and killing something obviously incorrect. > [...] > > > This is a useful feedback! What was your workload? Which kernel version? > > I tested it by running a python script that processes a large amount of data > in memory (needs around 15GB of RAM). I normally run 2 instances of that > script in parallel but for testing I started 4 of them. I sometimes > experience the same issue when using multiple regular memory intensive > desktop applications in a manner described in the first post but that's > harder to reproduce because of the user input needed. Something that other people can play with to reproduce the issue would be more than welcome. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-09 10:50 ` Michal Hocko @ 2019-08-09 14:18 ` Pintu Agarwal 2019-08-10 12:34 ` ndrw 1 sibling, 0 replies; 48+ messages in thread From: Pintu Agarwal @ 2019-08-09 14:18 UTC (permalink / raw) To: Michal Hocko Cc: ndrw, Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm, Pintu Kumar [...] Hi, This is an interesting topic for me so I would like to join the conversation. I will be glad if I can be of any help here either in testing PSI, or verifying some scenarios and observation. I have some experience working with low memory embedded devices, like RAM as low as 128MB, 256MB, less than 1GB mostly, with/without Display, DRM/Graphics support. Along with ZRAM as swap space configured as 25% of RAM size. The eMMC storage space is also as low as 4GB or 8GB max. So, I have experienced this sluggishness, hang, OOM kill issues quite a number of times. So, I would like to share my experience and observation here. Recently, I have been exploring the PSI feature on my ARM Qemu/Beagle-Bone environment, so I can share some feedback for this as well. The system sluggish behavior can result from 4 types (specially on smart phone devices): * memory allocation pressure * I/O pressure * Scheduling pressure * Network pressure I think the topic of concern here is: memory pressure. So, I would like to share some thoughts about this. * In my opinion, memory pressure should be internal to the system and not visible to the end users. * The pressure metrics can very from system to system, so its difficult to apply single policy. * I guess this is the time to apply "Machine Learning" and "Artificial Intelligence" into the system :) * The memory pressure starts with how many times and how quickly system is entering the slow-path. Thus slow-path monitoring may give some clue about pressure building in the system. Thus I use to use slow-path-counter. Too much of slow-path in the beginning itself indicates that this system needs to be re-designed. * The system should be avoided to entering slow-path again and again thus avoiding pressure. If this happens then its time to reclaim memory in large chunk, rather than in smaller chunk. May be its time to think about shrink_all_memory() knob in kernel. It can be run as bottom-half processing, may be from cgroups. * Some experiment were done in the past. Interested people can check this paper: http://events17.linuxfoundation.org/sites/events/files/slides/%5BELC-2015%5D-System-wide-Memory-Defragmenter.pdf * The system is already behaving sluggish even before it enters oom-kill stage. So, most of the time oom stage is skipped, not occurred, or its just looping around. Thus, some kind of oom-monitoring may help to gather some suspect. Thats the reason I proposed to use something called oom-stall-counter. That means system entering oom, but not possibly oom-kill. If this counter is updated means we assume that system started behaving sluggish. * A oom-kill-counter can also help in determining how much of killing happening in kernel space. Example: If PSI pressure is building up and this counter is not updating... But in any case system-daemon should be avoided from killing. * Some killing policy should be left to user space. So a standard system-daemon (or kthread) should be designed along the line. It should be configured dynamically based on the system and oom-score. As my previous experience, in Tizen, we have used something called: resourced daemon. https://git.tizen.org/cgit/platform/core/system/resourced/tree/src/memory?h=tizen * Instead of static policy there should be something called "Dynamic Low Memory Manager" (DLLM) policy. That is at every stage (slow-path, swapping, compaction-fail, reclaim-fail, oom) some action can be taken. Earlier this event was triggered using vmpressure, but now it can replace with PSI. * Another major culprit with sluggish in the long run is, the system-daemon occupying all of swap space and never releasing it. So, even if the kill applications due to oom, it may not help much. Since daemons will never be killed. So, I proposed something called "Dynamic Swappiness", where swappiness of daemons came be lowered dynamically, while normal application have higher values. In the past I have done several experiments on this, soon I will be publishing a paper on it. * May be it is helpful to understand better, if we start from a very minimal scale (just 64MB to 512MB RAM) with busy-box. If we can tune this perfectly, than large scale will automatically have no issues. With respect to PSI, here are my observations: * PSI memory threshold (10, 60, 300) are too high for an embedded system. I think these settings should be dynamic, or user configurable, or there should be on more entry for 1s or lesser. * PSI memory values are updated after the oom-kill in kernel had already happened, that means sluggish already occurred. So, I have to utilize the "total" field and monitor the difference manually. Like the difference between previous-total and next-total is more than 100ms and rising, then we suspect OOM. * Currently, PSI values are system-wide. That is, after sluggish occurred, it is difficult to predict, which task causes sluggish. So, I was thinking to add new entry to capture task details as well. These are some of my opinion. It may or may not be applicable directly. Further brain-storming or discussions might be required. Regards, Pintu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-09 10:50 ` Michal Hocko 2019-08-09 14:18 ` Pintu Agarwal @ 2019-08-10 12:34 ` ndrw 2019-08-12 8:24 ` Michal Hocko 1 sibling, 1 reply; 48+ messages in thread From: ndrw @ 2019-08-10 12:34 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On 09/08/2019 11:50, Michal Hocko wrote: > We try to protect low amount of cache. Have a look at get_scan_count > function. But the exact amount of the cache to be protected is really > hard to know wihtout a crystal ball or understanding of the workload. > The kernel doesn't have neither of the two. Thank you. I'm familiarizing myself with the code. Is there anyone I could discuss some details with? I don't want to create too much noise here. For example, are file pages created by mmaping files and are anon page exclusively allocated on heap (RW data)? If so, where do "streaming IO" pages belong to? > We have been thinking about this problem for a long time and couldn't > come up with anything much better than we have now. PSI is the most recent > improvement in that area. If you have better ideas then patches are > always welcome. In general, I found there are very few user accessible knobs for adjusting caching, especially in the pre-OOM phase. On the other hand, swapping, dirty page caching, have many options or can even be disabled completely. For example, I would like to try disabling/limiting eviction of some/all file pages (for example exec pages) akin to disabling swapping, but there is no such mechanism. Yes, there would likely be problems with large RO mmapped files that would need to be addressed, but in many applications users would be interested in having such options. Adjusting how aggressive/conservative the system should be with the OOM killer also falls into this category. >> [OOM killer accuracy] > That is a completely orthogonal problem, I am afraid. So far we have > been discussing _when_ to trigger OOM killer. This is _who_ to kill. I > haven't heard any recent examples that the victim selection would be way > off and killing something obviously incorrect. You are right. I've assumed earlyoom is more accurate because of OOM killer performing better on a system that isn't stalled yet (perhaps it does). But actually, earlyoom doesn't trigger OOM killer at all: https://github.com/rfjakob/earlyoom#why-not-trigger-the-kernel-oom-killer Apparently some applications (chrome and electron-based tools) set their oom_score_adj incorrectly - this matches my observations of OOM killer behavior: https://bugs.chromium.org/p/chromium/issues/detail?id=333617 > Something that other people can play with to reproduce the issue would > be more than welcome. This is the script I used. It reliably reproduces the issue: https://github.com/ndrw6/import_postcodes/blob/master/import_postcodes.py but it has quite a few dependencies, needs some input data and, in general, does a lot more than just fill up the memory. I will try to come up with something simpler. Best regards, ndrw ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-10 12:34 ` ndrw @ 2019-08-12 8:24 ` Michal Hocko 0 siblings, 0 replies; 48+ messages in thread From: Michal Hocko @ 2019-08-12 8:24 UTC (permalink / raw) To: ndrw Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On Sat 10-08-19 13:34:06, ndrw wrote: > On 09/08/2019 11:50, Michal Hocko wrote: > > We try to protect low amount of cache. Have a look at get_scan_count > > function. But the exact amount of the cache to be protected is really > > hard to know wihtout a crystal ball or understanding of the workload. > > The kernel doesn't have neither of the two. > > Thank you. I'm familiarizing myself with the code. Is there anyone I could > discuss some details with? I don't want to create too much noise here. linux-mm mailing list sounds like a good fit. > For example, are file pages created by mmaping files and are anon page > exclusively allocated on heap (RW data)? If so, where do "streaming IO" > pages belong to? Page cache will be generated by both buffered IO (read/write) and file mmaps. Anonymous memory by MAP_PRIVATE of file backed or MAP_ANON. Streaming IO is generally referred to by an single data pass IO that is not reused later (e.g. a backup). > > We have been thinking about this problem for a long time and couldn't > > come up with anything much better than we have now. PSI is the most recent > > improvement in that area. If you have better ideas then patches are > > always welcome. > > In general, I found there are very few user accessible knobs for adjusting > caching, especially in the pre-OOM phase. On the other hand, swapping, dirty > page caching, have many options or can even be disabled completely. > > For example, I would like to try disabling/limiting eviction of some/all > file pages (for example exec pages) akin to disabling swapping, but there is > no such mechanism. Yes, there would likely be problems with large RO mmapped > files that would need to be addressed, but in many applications users would > be interested in having such options. > > Adjusting how aggressive/conservative the system should be with the OOM > killer also falls into this category. What would that mean and how it would be configured? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-09 8:57 ` Michal Hocko 2019-08-09 10:09 ` ndrw @ 2019-08-10 21:07 ` ndrw 1 sibling, 0 replies; 48+ messages in thread From: ndrw @ 2019-08-10 21:07 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Suren Baghdasaryan, Vlastimil Babka, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On 09/08/2019 09:57, Michal Hocko wrote: > This is a useful feedback! What was your workload? Which kernel version? With 16GB zram swap and swappiness=60 I get the avg10 memory PSI numbers of about 10 when swap is half filled and ~30 immediately before the freeze. Swapping with zram has less effect on system responsiveness comparing to swapping to an ssd, so, if combined with the proposed PSI triggered OOM killer, this could be a viable solution. Still, using swap only to make PSI sensing work when triggering OOM killer at non-zero available memory would do the job just as well is a bit of an overkill. I don't really need these extra few GB or memory, just want to get rid of system freezes. Perhaps we could have both heuristics. Best regards, ndrw ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-08 15:10 ` ndrw.xf 2019-08-08 16:32 ` Michal Hocko @ 2021-07-24 17:32 ` Alexey Avramov 1 sibling, 0 replies; 48+ messages in thread From: Alexey Avramov @ 2021-07-24 17:32 UTC (permalink / raw) To: ndrw.xf Cc: akpm, aros, hannes, linux-kernel, linux-mm, mhocko, surenb, vbabka > Would it be possible to reserve a fixed (configurable) amount of RAM > for caches, and trigger OOM killer earlier, before most UI code is > evicted from memory? Yes! Try this patch: https://github.com/hakavlad/le9-patch The patch provides sysctl knobs for protecting the specified amount of clean file pages under memory pressure. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-07 20:51 ` Johannes Weiner ` (2 preceding siblings ...) 2019-08-08 11:48 ` Michal Hocko @ 2019-08-08 14:47 ` Vlastimil Babka 2019-08-08 17:27 ` Johannes Weiner 3 siblings, 1 reply; 48+ messages in thread From: Vlastimil Babka @ 2019-08-08 14:47 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: Suren Baghdasaryan, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On 8/7/19 10:51 PM, Johannes Weiner wrote: > From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner <hannes@cmpxchg.org> > Date: Mon, 5 Aug 2019 13:15:16 -0400 > Subject: [PATCH] psi: trigger the OOM killer on severe thrashing Thanks a lot, perhaps finally we are going to eat the elephant ;) I've tested this by booting with mem=8G and activating browser tabs as long as I could. Then initially the system started thrashing and didn't recover for minutes. Then I realized sysrq+f is disabled... Fixed that up after next reboot, tried lower thresholds, also started monitoring /proc/pressure/memory, and found out that after minutes of not being able to move the cursor, both avg10 and avg60 shows only around 15 for both some and full. Lowered thrashing_oom_level to 10 and (with thrashing_oom_period of 5) the thrashing OOM finally started kicking, and the system recovered by itself in reasonable time. So my conclusion is that the patch works, but there's something odd with suspiciously low PSI memory values on my system. Any idea how to investigate this? Also, does it matter that it's a modern desktop, so systemd puts everything into cgroups, and the unified cgroup2 hierarchy is also mounted? Thanks, Vlastimil ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-08 14:47 ` Vlastimil Babka @ 2019-08-08 17:27 ` Johannes Weiner 2019-08-09 14:56 ` Vlastimil Babka 0 siblings, 1 reply; 48+ messages in thread From: Johannes Weiner @ 2019-08-08 17:27 UTC (permalink / raw) To: Vlastimil Babka Cc: Michal Hocko, Suren Baghdasaryan, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On Thu, Aug 08, 2019 at 04:47:18PM +0200, Vlastimil Babka wrote: > On 8/7/19 10:51 PM, Johannes Weiner wrote: > > From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001 > > From: Johannes Weiner <hannes@cmpxchg.org> > > Date: Mon, 5 Aug 2019 13:15:16 -0400 > > Subject: [PATCH] psi: trigger the OOM killer on severe thrashing > > Thanks a lot, perhaps finally we are going to eat the elephant ;) > > I've tested this by booting with mem=8G and activating browser tabs as > long as I could. Then initially the system started thrashing and didn't > recover for minutes. Then I realized sysrq+f is disabled... Fixed that > up after next reboot, tried lower thresholds, also started monitoring > /proc/pressure/memory, and found out that after minutes of not being > able to move the cursor, both avg10 and avg60 shows only around 15 for > both some and full. Lowered thrashing_oom_level to 10 and (with > thrashing_oom_period of 5) the thrashing OOM finally started kicking, > and the system recovered by itself in reasonable time. It sounds like there is a missing annotation. The time has to be going somewhere, after all. One *known* missing vector I fixed recently is stalls in submit_bio() itself when refaulting, but it's not merged yet. Attaching the patch below, can you please test it? > So my conclusion is that the patch works, but there's something odd with > suspiciously low PSI memory values on my system. Any idea how to > investigate this? Also, does it matter that it's a modern desktop, so > systemd puts everything into cgroups, and the unified cgroup2 hierarchy > is also mounted? That shouldn't interfere because 1) pressure is reported recursively up the cgroup tree, so unless something else runs completely fine on the system, global pressure should reflect cgroup pressure and 2) the systemd defaults doesn't set any memory limits or protections, so if the system is hanging, it's unlikely that anything runs fine. bcc tools (https://iovisor.github.io/bcc/) has an awesome program called 'offcputime' that gives you stack traces of sleeping tasks. This could give an insight into where time is going and point out operations we might not be annotating correctly yet. --- From 1b3888bdf075f86f226af4e350c8a88435d1fe8e Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Thu, 11 Jul 2019 16:01:40 -0400 Subject: [PATCH] psi: annotate refault stalls from IO submission psi tracks the time tasks wait for refaulting pages to become uptodate, but it does not track the time spent submitting the IO. The submission part can be significant if backing storage is contended or when cgroup throttling (io.latency) is in effect - a lot of time is spent in submit_bio(). In that case, we underreport memory pressure. Annotate submit_bio() to account submission time as memory stall when the bio is reading userspace workingset pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- block/bio.c | 3 +++ block/blk-core.c | 23 ++++++++++++++++++++++- include/linux/blk_types.h | 1 + 3 files changed, 26 insertions(+), 1 deletion(-) diff --git a/block/bio.c b/block/bio.c index 29cd6cf4da51..4dd9ea0b068b 100644 --- a/block/bio.c +++ b/block/bio.c @@ -805,6 +805,9 @@ void __bio_add_page(struct bio *bio, struct page *page, bio->bi_iter.bi_size += len; bio->bi_vcnt++; + + if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page))) + bio_set_flag(bio, BIO_WORKINGSET); } EXPORT_SYMBOL_GPL(__bio_add_page); diff --git a/block/blk-core.c b/block/blk-core.c index 5d1fc8e17dd1..5993922d63fb 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -36,6 +36,7 @@ #include <linux/blk-cgroup.h> #include <linux/debugfs.h> #include <linux/bpf.h> +#include <linux/psi.h> #define CREATE_TRACE_POINTS #include <trace/events/block.h> @@ -1127,6 +1128,10 @@ EXPORT_SYMBOL_GPL(direct_make_request); */ blk_qc_t submit_bio(struct bio *bio) { + bool workingset_read = false; + unsigned long pflags; + blk_qc_t ret; + /* * If it's a regular read/write or a barrier with data attached, * go through the normal accounting stuff before submission. @@ -1142,6 +1147,8 @@ blk_qc_t submit_bio(struct bio *bio) if (op_is_write(bio_op(bio))) { count_vm_events(PGPGOUT, count); } else { + if (bio_flagged(bio, BIO_WORKINGSET)) + workingset_read = true; task_io_account_read(bio->bi_iter.bi_size); count_vm_events(PGPGIN, count); } @@ -1156,7 +1163,21 @@ blk_qc_t submit_bio(struct bio *bio) } } - return generic_make_request(bio); + /* + * If we're reading data that is part of the userspace + * workingset, count submission time as memory stall. When the + * device is congested, or the submitting cgroup IO-throttled, + * submission can be a significant part of overall IO time. + */ + if (workingset_read) + psi_memstall_enter(&pflags); + + ret = generic_make_request(bio); + + if (workingset_read) + psi_memstall_leave(&pflags); + + return ret; } EXPORT_SYMBOL(submit_bio); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 6a53799c3fe2..2f77e3446760 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -209,6 +209,7 @@ enum { BIO_BOUNCED, /* bio is a bounce bio */ BIO_USER_MAPPED, /* contains user pages */ BIO_NULL_MAPPED, /* contains invalid user pages */ + BIO_WORKINGSET, /* contains userspace workingset pages */ BIO_QUIET, /* Make BIO Quiet */ BIO_CHAIN, /* chained bio, ->bi_remaining in effect */ BIO_REFFED, /* bio has elevated ->bi_cnt */ -- 2.22.0 ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-08 17:27 ` Johannes Weiner @ 2019-08-09 14:56 ` Vlastimil Babka 2019-08-09 17:31 ` Johannes Weiner 0 siblings, 1 reply; 48+ messages in thread From: Vlastimil Babka @ 2019-08-09 14:56 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Suren Baghdasaryan, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On 8/8/19 7:27 PM, Johannes Weiner wrote: > On Thu, Aug 08, 2019 at 04:47:18PM +0200, Vlastimil Babka wrote: >> On 8/7/19 10:51 PM, Johannes Weiner wrote: >>> From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001 >>> From: Johannes Weiner <hannes@cmpxchg.org> >>> Date: Mon, 5 Aug 2019 13:15:16 -0400 >>> Subject: [PATCH] psi: trigger the OOM killer on severe thrashing >> >> Thanks a lot, perhaps finally we are going to eat the elephant ;) >> >> I've tested this by booting with mem=8G and activating browser tabs as >> long as I could. Then initially the system started thrashing and didn't >> recover for minutes. Then I realized sysrq+f is disabled... Fixed that >> up after next reboot, tried lower thresholds, also started monitoring >> /proc/pressure/memory, and found out that after minutes of not being >> able to move the cursor, both avg10 and avg60 shows only around 15 for >> both some and full. Lowered thrashing_oom_level to 10 and (with >> thrashing_oom_period of 5) the thrashing OOM finally started kicking, >> and the system recovered by itself in reasonable time. > > It sounds like there is a missing annotation. The time has to be going > somewhere, after all. One *known* missing vector I fixed recently is > stalls in submit_bio() itself when refaulting, but it's not merged > yet. Attaching the patch below, can you please test it? It made a difference, but not enough, it seems. Before the patch I could observe "io:full avg10" around 75% and "memory:full avg10" around 20%, after the patch, "memory:full avg10" went to around 45%, while io stayed the same (BTW should the refaults be discounted from the io counters, so that the sum is still <=100%?) As a result I could change the knobs to recover successfully with thrashing detected for 10s of 40% memory pressure. Perhaps being low on memory we can't detect refaults so well due to limited number of shadow entries, or there was genuine non-refault I/O in the mix. The detection would then probably have to look at both I/O and memory? Thanks, Vlastimil ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-09 14:56 ` Vlastimil Babka @ 2019-08-09 17:31 ` Johannes Weiner 2019-08-13 13:47 ` Vlastimil Babka 0 siblings, 1 reply; 48+ messages in thread From: Johannes Weiner @ 2019-08-09 17:31 UTC (permalink / raw) To: Vlastimil Babka Cc: Michal Hocko, Suren Baghdasaryan, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On Fri, Aug 09, 2019 at 04:56:28PM +0200, Vlastimil Babka wrote: > On 8/8/19 7:27 PM, Johannes Weiner wrote: > > On Thu, Aug 08, 2019 at 04:47:18PM +0200, Vlastimil Babka wrote: > >> On 8/7/19 10:51 PM, Johannes Weiner wrote: > >>> From 9efda85451062dea4ea287a886e515efefeb1545 Mon Sep 17 00:00:00 2001 > >>> From: Johannes Weiner <hannes@cmpxchg.org> > >>> Date: Mon, 5 Aug 2019 13:15:16 -0400 > >>> Subject: [PATCH] psi: trigger the OOM killer on severe thrashing > >> > >> Thanks a lot, perhaps finally we are going to eat the elephant ;) > >> > >> I've tested this by booting with mem=8G and activating browser tabs as > >> long as I could. Then initially the system started thrashing and didn't > >> recover for minutes. Then I realized sysrq+f is disabled... Fixed that > >> up after next reboot, tried lower thresholds, also started monitoring > >> /proc/pressure/memory, and found out that after minutes of not being > >> able to move the cursor, both avg10 and avg60 shows only around 15 for > >> both some and full. Lowered thrashing_oom_level to 10 and (with > >> thrashing_oom_period of 5) the thrashing OOM finally started kicking, > >> and the system recovered by itself in reasonable time. > > > > It sounds like there is a missing annotation. The time has to be going > > somewhere, after all. One *known* missing vector I fixed recently is > > stalls in submit_bio() itself when refaulting, but it's not merged > > yet. Attaching the patch below, can you please test it? > > It made a difference, but not enough, it seems. Before the patch I could > observe "io:full avg10" around 75% and "memory:full avg10" around 20%, > after the patch, "memory:full avg10" went to around 45%, while io stayed > the same (BTW should the refaults be discounted from the io counters, so > that the sum is still <=100%?) > > As a result I could change the knobs to recover successfully with > thrashing detected for 10s of 40% memory pressure. > > Perhaps being low on memory we can't detect refaults so well due to > limited number of shadow entries, or there was genuine non-refault I/O > in the mix. The detection would then probably have to look at both I/O > and memory? Thanks for testing it. It's possible that there is legitimate non-refault IO, and there can be interaction of course between that and the refault IO. But to be sure that all genuine refaults are captured, can you record the workingset_* values from /proc/vmstat before/after the thrash storm? In particular, workingset_nodereclaim would indicate whether we are losing refault information. [ The different resource pressures are not meant to be summed up. Refaults truly are both IO events and memory events: they indicate memory contention, but they also contribute to the IO load. So both metrics need to include them, or it would skew the picture when you only look at one of them. ] ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-09 17:31 ` Johannes Weiner @ 2019-08-13 13:47 ` Vlastimil Babka 0 siblings, 0 replies; 48+ messages in thread From: Vlastimil Babka @ 2019-08-13 13:47 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Suren Baghdasaryan, Artem S. Tashkinov, Andrew Morton, LKML, linux-mm On 8/9/19 7:31 PM, Johannes Weiner wrote: >> It made a difference, but not enough, it seems. Before the patch I could >> observe "io:full avg10" around 75% and "memory:full avg10" around 20%, >> after the patch, "memory:full avg10" went to around 45%, while io stayed >> the same (BTW should the refaults be discounted from the io counters, so >> that the sum is still <=100%?) >> >> As a result I could change the knobs to recover successfully with >> thrashing detected for 10s of 40% memory pressure. >> >> Perhaps being low on memory we can't detect refaults so well due to >> limited number of shadow entries, or there was genuine non-refault I/O >> in the mix. The detection would then probably have to look at both I/O >> and memory? > > Thanks for testing it. It's possible that there is legitimate > non-refault IO, and there can be interaction of course between that > and the refault IO. But to be sure that all genuine refaults are > captured, can you record the workingset_* values from /proc/vmstat > before/after the thrash storm? In particular, workingset_nodereclaim > would indicate whether we are losing refault information. Let's see... after a ~45 second stall that I ended up by alt-sysrq-f, I see the following pressure info: cpu:some avg10=1.04 avg60=2.22 avg300=2.01 total=147402828 io:some avg10=97.13 avg60=65.48 avg300=28.86 total=240442256 io:full avg10=83.93 avg60=57.05 avg300=24.56 total=212125506 memory:some avg10=54.62 avg60=33.69 avg300=15.89 total=67989547 memory:full avg10=44.48 avg60=28.17 avg300=13.17 total=55963961 Captured vmstat workingset values before: workingset_nodes 15756 workingset_refault 6111959 workingset_activate 1805063 workingset_restore 919138 workingset_nodereclaim 40796 pgpgin 33889644 after: workingset_nodes 14842 workingset_refault 9248248 workingset_activate 1966317 workingset_restore 961179 workingset_nodereclaim 41060 pgpgin 46488352 Doesn't seem like losing too much refault info, and it's indeed a mix of refaults and other I/O? (difference is 3M for refaults and 12.5M for pgpgin). > [ The different resource pressures are not meant to be summed > up. Refaults truly are both IO events and memory events: they > indicate memory contention, but they also contribute to the IO > load. So both metrics need to include them, or it would skew the > picture when you only look at one of them. ] Understood, makes sense. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-06 1:08 ` Suren Baghdasaryan 2019-08-06 9:36 ` Vlastimil Babka @ 2019-08-06 21:43 ` James Courtier-Dutton 1 sibling, 0 replies; 48+ messages in thread From: James Courtier-Dutton @ 2019-08-06 21:43 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Johannes Weiner, Vlastimil Babka, Artem S. Tashkinov, LKML, linux-mm, Michal Hocko On Tue, 6 Aug 2019 at 02:09, Suren Baghdasaryan <surenb@google.com> wrote: > > 80% of the last 10 seconds spent in full stall would definitely be a > problem. If the system was already low on memory (which it probably > is, or we would not be reclaiming so hard and registering such a big > stall) then oom-killer would probably kill something before 8 seconds > are passed. There are other things to consider also. I can reproduce these types of symptoms and memory pressure is 100% NOT the cause. (top showing 4GB of a 16GB system in use) The cause as I see it is disk pressure and the lack of multiple queues for disk IO requests. For example, one process can hog 100% of the disk, without other applications even being able to write just one sector. We need a way for the linux kernel to better multiplex access to the disk. Adding QOS, allowing interactive processes to interrupt long background disk IO tasks. If we could balance disk access across each active process, the user, on their desktop, would think the system was more responsive. Kind Regards James ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-04 9:23 Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure Artem S. Tashkinov 2019-08-05 12:13 ` Vlastimil Babka @ 2019-08-06 19:00 ` Florian Weimer 2019-08-20 6:46 ` Daniel Drake 2 siblings, 0 replies; 48+ messages in thread From: Florian Weimer @ 2019-08-06 19:00 UTC (permalink / raw) To: Artem S. Tashkinov; +Cc: linux-kernel, linux-mm * Artem S. Tashkinov: > There's this bug which has been bugging many people for many years > already and which is reproducible in less than a few minutes under the > latest and greatest kernel, 5.2.6. All the kernel parameters are set to > defaults. > > Steps to reproduce: > > 1) Boot with mem=4G > 2) Disable swap to make everything faster (sudo swapoff -a) > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox > 4) Start opening tabs in either of them and watch your free RAM decrease Do you see this with Intel graphics? I think these drivers still use the GEM shrinker, which effectively bypasses most kernel memory management heuristics. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-04 9:23 Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure Artem S. Tashkinov 2019-08-05 12:13 ` Vlastimil Babka 2019-08-06 19:00 ` Florian Weimer @ 2019-08-20 6:46 ` Daniel Drake 2019-08-21 21:42 ` James Courtier-Dutton 2019-08-23 1:54 ` ndrw 2 siblings, 2 replies; 48+ messages in thread From: Daniel Drake @ 2019-08-20 6:46 UTC (permalink / raw) To: aros; +Cc: linux-kernel, linux, hadess, hannes Hi, Artem S. Tashkinov wrote: > Once you hit a situation when opening a new tab requires more RAM than > is currently available, the system will stall hard. You will barely be > able to move the mouse pointer. Your disk LED will be flashing > incessantly (I'm not entirely sure why). You will not be able to run new > applications or close currently running ones. > > This little crisis may continue for minutes or even longer. I think > that's not how the system should behave in this situation. I believe > something must be done about that to avoid this stall. Thanks for reviving this discussion. Indeed, this is a real pain point in the Linux experience. For Endless, we sunk some time into this and emerged with psi being the best solution we could find. The way it works on a time basis seems very appropriate when what we're ultimately interested in is maintaining desktop UI interactivity. With psi enabled in the kernel, we add a small userspace daemon to kill a process when psi reports that *all* userspace tasks are being blocked on kernel memory management work for (at least) 1 second in a 10 second period. https://github.com/endlessm/eos-boot-helper/blob/master/psi-monitor/psi-monitor.c To share our results so far, despite this daemon being a quick initial implementation, we find that it is bringing excellent results, no more memory pressure hangs. The system recovers in less than 30 seconds, usually in more like 10-15 seconds. Sadly a process got killed along the way, but that's a lot better than the user having no option other than pulling the plug. The system may not always recover to a totally smooth state, but the responsiveness to mouse movements and clicks is still decent, so at that point the user can close some more windows to restore full UI performance again. There's just one issue we've seen so far: a single report of psi reporting memory pressure on a desktop system with 4GB RAM which is only running the normal desktop components plus a single gmail tab in the web browser. psi occasionally reports high memory pressure, so then psi-monitor steps in and kills the browser tab, which seems erroneous. We haven't had a chance to look at this in detail yet. Here's a log from the kernel OOM killer showing the memory and process state at this point. https://gist.github.com/dsd/b338bab0206dcce78263f6bb87de7d4a > I'm almost sure some sysctl parameters could be changed to avoid this > situation but something tells me this could be done for everyone and > made default because some non tech-savvy users will just give up on > Linux if they ever get in a situation like this and they won't be keen > or even be able to Google for solutions. As you anticipated, myself and others already jumped in with solutions appropriate for tech-savvy users. Getting solutions widely deployed is indeed another important aspect to tackle. If you're curious to see how this can look from a "just works" standpoint, you might be interested in downloading Endless (www.endlessos.com) and running your tests there; we have the above solution running and active out of the box. Bastien Nocera has recently adapted and extended our solution, presumably with an eye towards getting this more widely deployed as a standard part of the Linux desktop. https://gitlab.freedesktop.org/hadess/low-memory-monitor/ And if there is a meaningful way to make the kernel behave better, that would obviously be of huge value too. Thanks Daniel ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-20 6:46 ` Daniel Drake @ 2019-08-21 21:42 ` James Courtier-Dutton 2019-08-29 12:29 ` Michal Hocko 2019-09-02 20:15 ` Pavel Machek 2019-08-23 1:54 ` ndrw 1 sibling, 2 replies; 48+ messages in thread From: James Courtier-Dutton @ 2019-08-21 21:42 UTC (permalink / raw) To: Daniel Drake Cc: Artem S. Tashkinov, LKML Mailing List, linux, hadess, Johannes Weiner On Tue, 20 Aug 2019 at 07:47, Daniel Drake <drake@endlessm.com> wrote: > > Hi, > > And if there is a meaningful way to make the kernel behave better, that would > obviously be of huge value too. > > Thanks > Daniel Hi, Is there a way for an application to be told that there is a memory pressure situation? For example, say I do a "make -j32" and half way through the compile it hits a memory pressure situation. If make could have been told about it. It could give up on some of the parallel compiles, and instead proceed as if the user have typed "make -j4". It could then re-try the failed compile parts, that failed due to memory pressure. I know all applications won't be this clever, but providing a kernel API so that an application could do something about it, if the programmer of that application has thought about it. It could be similar with say, a hadoop application. If the hadoop process detects memory pressure, if could back off, and process the data more slowly and not try to do so much at the same time. The kernel could also detect which processes are contributing most to the memory pressure (trying to do mallocs) and give them less processor time, and instead ask all processes to release some memory, and those processes that understood the kernel API for that notification could actually do something about it. I have also found, for the desktop, one of the biggest pressures on the system is disk pressure. Accessing the disk causes the UI to be less responsive. For example, if I am in vim, and trying to type letters on the keyboard, whether some other application is using the disk or not should have no impact on my letter writing. Has anyone got any ideas with regards to what we can do about that? Kind Regards James ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-21 21:42 ` James Courtier-Dutton @ 2019-08-29 12:29 ` Michal Hocko 2019-09-02 20:15 ` Pavel Machek 1 sibling, 0 replies; 48+ messages in thread From: Michal Hocko @ 2019-08-29 12:29 UTC (permalink / raw) To: James Courtier-Dutton Cc: Daniel Drake, Artem S. Tashkinov, LKML Mailing List, linux, hadess, Johannes Weiner On Wed 21-08-19 22:42:29, James Courtier-Dutton wrote: > On Tue, 20 Aug 2019 at 07:47, Daniel Drake <drake@endlessm.com> wrote: > > > > Hi, > > > > And if there is a meaningful way to make the kernel behave better, that would > > obviously be of huge value too. > > > > Thanks > > Daniel > > Hi, > > Is there a way for an application to be told that there is a memory > pressure situation? PSI (CONFIG_PSI) measures global as well as per memcg pressure characteristics. [...] > I have also found, for the desktop, one of the biggest pressures on > the system is disk pressure. Accessing the disk causes the UI to be > less responsive. > For example, if I am in vim, and trying to type letters on the > keyboard, whether some other application is using the disk or not > should have no impact on my letter writing. Has anyone got any ideas > with regards to what we can do about that? This is what we have the page cache for. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-21 21:42 ` James Courtier-Dutton 2019-08-29 12:29 ` Michal Hocko @ 2019-09-02 20:15 ` Pavel Machek 1 sibling, 0 replies; 48+ messages in thread From: Pavel Machek @ 2019-09-02 20:15 UTC (permalink / raw) To: James Courtier-Dutton Cc: Daniel Drake, Artem S. Tashkinov, LKML Mailing List, linux, hadess, Johannes Weiner Hi! > > > > And if there is a meaningful way to make the kernel behave better, that would > > obviously be of huge value too. > > > > Thanks > > Daniel > > Hi, > > Is there a way for an application to be told that there is a memory > pressure situation? > For example, say I do a "make -j32" and half way through the compile > it hits a memory pressure situation. > If make could have been told about it. It could give up on some of the > parallel compiles, and instead proceed as if the user have typed "make > -j4". It could then re-try the failed compile parts, that failed due > to memory pressure. > I know all applications won't be this clever, but providing a kernel > API so that an application could do something about it, if the > programmer of that application has thought about it. Support is not really needed in many applications. It would be nice to have for make and web browsers... And I suspect it is easy to do interface becomes available. Best regards, Pavel ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-20 6:46 ` Daniel Drake 2019-08-21 21:42 ` James Courtier-Dutton @ 2019-08-23 1:54 ` ndrw 2019-08-23 2:14 ` Daniel Drake 1 sibling, 1 reply; 48+ messages in thread From: ndrw @ 2019-08-23 1:54 UTC (permalink / raw) To: Daniel Drake, aros; +Cc: linux-kernel, linux, hadess, hannes On 20/08/2019 07:46, Daniel Drake wrote: > To share our results so far, despite this daemon being a quick initial > implementation, we find that it is bringing excellent results, no more memory > pressure hangs. The system recovers in less than 30 seconds, usually in more > like 10-15 seconds. That's obviously a lot better than hard freezes but I wouldn't call such system lock-ups an excellent result. PSI-triggered OOM killer would have indeed been very useful as an emergency brake, and IMHO such mechanism should be built in the kernel and enabled by default. But in my experience it does a very poor job at detecting imminent freezes on systems without swap or with very fast swap (zram). So far, watching MemAvailable (like earlyoom does) is far more reliable and accurate. Unfortunately, there just doesn't seem to be a kernel feature that would reserve a user-defined amount of memory for caches. > There's just one issue we've seen so far: a single report of psi reporting > memory pressure on a desktop system with 4GB RAM which is only running > the normal desktop components plus a single gmail tab in the web browser. > psi occasionally reports high memory pressure, so then psi-monitor steps in and > kills the browser tab, which seems erroneous. Is it Chrome/Chromium? If so, that's a known bug (https://bugs.chromium.org/p/chromium/issues/detail?id=333617) Best regards, ndrw ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure 2019-08-23 1:54 ` ndrw @ 2019-08-23 2:14 ` Daniel Drake 0 siblings, 0 replies; 48+ messages in thread From: Daniel Drake @ 2019-08-23 2:14 UTC (permalink / raw) To: ndrw Cc: aros, Linux Kernel, Linux Upstreaming Team, Bastien Nocera, Johannes Weiner On Fri, Aug 23, 2019 at 9:54 AM ndrw <ndrw.xf@redhazel.co.uk> wrote: > That's obviously a lot better than hard freezes but I wouldn't call such > system lock-ups an excellent result. PSI-triggered OOM killer would have > indeed been very useful as an emergency brake, and IMHO such mechanism > should be built in the kernel and enabled by default. But in my > experience it does a very poor job at detecting imminent freezes on > systems without swap or with very fast swap (zram). Perhaps you could share your precise test environment and the PSI condition you are expecting to hit (that is not being hit). Except for the single failure report mentioned, it's been working fine here in all setups, including with zram which is shipped out of the box. The nice thing about psi is that it's based on how much real-world time the kernel is spending doing memory management. So it's very well poised to handle differences in swap speed etc. You effectively just set the threshold for how much time you view as excessive for the kernel to be busy doing MM, and psi tells you when that's hit. > > There's just one issue we've seen so far: a single report of psi reporting > > memory pressure on a desktop system with 4GB RAM which is only running > > the normal desktop components plus a single gmail tab in the web browser. > > psi occasionally reports high memory pressure, so then psi-monitor steps in and > > kills the browser tab, which seems erroneous. > > Is it Chrome/Chromium? If so, that's a known bug > (https://bugs.chromium.org/p/chromium/issues/detail?id=333617) The issue does not concern which process is being killed. The issue is that in the single report we have of this, psi is apparently reporting high memory pressure even though the system has plenty of free memory. Daniel ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20190805090514.5992-1-hdanton@sina.com>]
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure [not found] <20190805090514.5992-1-hdanton@sina.com> @ 2019-08-05 12:01 ` Artem S. Tashkinov 0 siblings, 0 replies; 48+ messages in thread From: Artem S. Tashkinov @ 2019-08-05 12:01 UTC (permalink / raw) To: Hillf Danton; +Cc: linux-kernel, linux-mm On 8/5/19 9:05 AM, Hillf Danton wrote: > > On Sun, 4 Aug 2019 09:23:17 +0000 "Artem S. Tashkinov" <aros@gmx.com> wrote: >> Hello, >> >> There's this bug which has been bugging many people for many years >> already and which is reproducible in less than a few minutes under the >> latest and greatest kernel, 5.2.6. All the kernel parameters are set to >> defaults. > > Thanks for report! >> >> Steps to reproduce: >> >> 1) Boot with mem=4G >> 2) Disable swap to make everything faster (sudo swapoff -a) >> 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox >> 4) Start opening tabs in either of them and watch your free RAM decrease > > We saw another corner-case cpu hog report under memory pressure also > with swap disabled. In that report the xfs filesystem was an factor > with CONFIG_MEMCG enabled. Anything special, say like > > kernel:watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [leaker1:7193] > or > [ 3225.313209] Xorg: page allocation failure: order:4, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 > > in your kernel log? I'm running ext4 only without LVM, encryption or anything like that. Plain GPT/MBR partitions with plenty of free space and no disk errors. >> >> Once you hit a situation when opening a new tab requires more RAM than >> is currently available, the system will stall hard. You will barely be >> able to move the mouse pointer. Your disk LED will be flashing >> incessantly (I'm not entirely sure why). You will not be able to run new >> applications or close currently running ones. > > A cpu hog may come on top of memory hog in some scenario. It might have happened as well - I couldn't know since I wasn't able to open a terminal. Once the system recovered there was no trace of anything extraordinary. >> >> This little crisis may continue for minutes or even longer. I think >> that's not how the system should behave in this situation. I believe >> something must be done about that to avoid this stall. > > Yes, Sir. >> >> I'm almost sure some sysctl parameters could be changed to avoid this >> situation but something tells me this could be done for everyone and >> made default because some non tech-savvy users will just give up on >> Linux if they ever get in a situation like this and they won't be keen >> or even be able to Google for solutions. > > I am not willing to repeat that it is hard to produce a pill for all > patients, but the info you post will help solve the crisis sooner. > > Hillf > In case you have troubles reproducing this bug report I can publish a VM image - still everything is quite mundane: Fedora 30 + XFCE + web browser. Nothing else, nothing fancy. Regards, Artem ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
@ 2019-08-06 8:57 Johannes Buchner
0 siblings, 0 replies; 48+ messages in thread
From: Johannes Buchner @ 2019-08-06 8:57 UTC (permalink / raw)
To: linux-kernel
[-- Attachment #1.1: Type: text/plain, Size: 4266 bytes --]
> On Mon, Aug 5, 2019 at 12:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> On Mon, Aug 05, 2019 at 02:13:16PM +0200, Vlastimil Babka wrote:
>> > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote:
>> > > Hello,
>> > >
>> > > There's this bug which has been bugging many people for many years
>> > > already and which is reproducible in less than a few minutes under the
>> > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to
>> > > defaults.
>> > >
>> > > Steps to reproduce:
>> > >
>> > > 1) Boot with mem=4G
>> > > 2) Disable swap to make everything faster (sudo swapoff -a)
>> > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
>> > > 4) Start opening tabs in either of them and watch your free RAM decrease
>> > >
>> > > Once you hit a situation when opening a new tab requires more RAM than
>> > > is currently available, the system will stall hard. You will barely be
>> > > able to move the mouse pointer. Your disk LED will be flashing
>> > > incessantly (I'm not entirely sure why). You will not be able to run new
>> > > applications or close currently running ones.
>> >
>> > > This little crisis may continue for minutes or even longer. I think
>> > > that's not how the system should behave in this situation. I believe
>> > > something must be done about that to avoid this stall.
>> >
>> > Yeah that's a known problem, made worse SSD's in fact, as they are able
>> > to keep refaulting the last remaining file pages fast enough, so there
>> > is still apparent progress in reclaim and OOM doesn't kick in.
>> >
>> > At this point, the likely solution will be probably based on pressure
>> > stall monitoring (PSI). I don't know how far we are from a built-in
>> > monitor with reasonable defaults for a desktop workload, so CCing
>> > relevant folks.
>>
>> Yes, psi was specifically developed to address this problem. Before
>> it, the kernel had to make all decisions based on relative event rates
>> but had no notion of time. Whereas to the user, time is clearly an
>> issue, and in fact makes all the difference. So psi quantifies the
>> time the workload spends executing vs. spinning its wheels.
>>
>> But choosing a universal cutoff for killing is not possible, since it
>> depends on the workload and the user's expectation: GUI and other
>> latency-sensitive applications care way before a compile job or video
>> encoding would care.
>>
>> Because of that, there are things like oomd and lmkd as mentioned, to
>> leave the exact policy decision to userspace.
>>
>> That being said, I think we should be able to provide a bare minimum
>> inside the kernel to avoid complete livelocks where the user does not
>> believe the machine would be able to recover without a reboot.
>>
>> The goal wouldn't be a glitch-free user experience - the kernel does
>> not know enough about the applications to even attempt that. It should
>> just not hang indefinitely. Maybe similar to the hung task detector.
>>
>> How about something like the below patch? With that, the kernel
>> catches excessive thrashing that happens before reclaim fails:
>>
>> [snip]
>>
>> +
>> +#define OOM_PRESSURE_LEVEL 80
>> +#define OOM_PRESSURE_PERIOD (10 * NSEC_PER_SEC)
>
> 80% of the last 10 seconds spent in full stall would definitely be a
> problem. If the system was already low on memory (which it probably
> is, or we would not be reclaiming so hard and registering such a big
> stall) then oom-killer would probably kill something before 8 seconds
> are passed. If my line of thinking is correct, then do we really
> benefit from such additional protection mechanism? I might be wrong
> here because my experience is limited to embedded systems with
> relatively small amounts of memory.
When one or more processes fight for memory, much of the time spent
stalling. Would an acceptable alternative strategy be, instead of
killing a process, to hold processes proportional to their stall time
and memory usage? By stop I mean delay their scheduling (akin to kill
-STOP/sleep/kill -CONT), or interleave the scheduling of
large-memory-using processes so they do not have to fight against each
other.
Cheers,
Johannes
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure @ 2019-08-06 19:43 Remi Gauvin 0 siblings, 0 replies; 48+ messages in thread From: Remi Gauvin @ 2019-08-06 19:43 UTC (permalink / raw) To: Linux Kernel Mailing List Sorry, I don't have the original message to reply to.. But to those interested, I have found a solution to the kernel's complete inability to allocate more memory when it needs to swap out. Increase the /proc/sys/vm/watermark_scale_factor from the default 10 to 500 It will make a huge difference, especially with swap on SSD, the kernel will swap out gracefully to allocate more memory, and you can get a few GB more memory in use before really noticing performance problems. ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2021-07-24 17:38 UTC | newest] Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-08-04 9:23 Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure Artem S. Tashkinov 2019-08-05 12:13 ` Vlastimil Babka 2019-08-05 13:31 ` Michal Hocko 2019-08-05 16:47 ` Suren Baghdasaryan 2019-08-05 18:55 ` Johannes Weiner 2019-08-06 9:29 ` Michal Hocko 2019-08-05 19:31 ` Johannes Weiner 2019-08-06 1:08 ` Suren Baghdasaryan 2019-08-06 9:36 ` Vlastimil Babka 2019-08-06 14:27 ` Johannes Weiner 2019-08-06 14:36 ` Michal Hocko 2019-08-06 16:27 ` Suren Baghdasaryan 2019-08-06 22:01 ` Johannes Weiner 2019-08-07 7:59 ` Michal Hocko 2019-08-07 20:51 ` Johannes Weiner 2019-08-07 21:01 ` Andrew Morton 2019-08-07 21:34 ` Johannes Weiner 2019-08-07 21:12 ` Johannes Weiner 2019-08-08 11:48 ` Michal Hocko 2019-08-08 15:10 ` ndrw.xf 2019-08-08 16:32 ` Michal Hocko 2019-08-08 17:57 ` ndrw.xf 2019-08-08 18:59 ` Michal Hocko 2019-08-08 21:59 ` ndrw 2019-08-09 8:57 ` Michal Hocko 2019-08-09 10:09 ` ndrw 2019-08-09 10:50 ` Michal Hocko 2019-08-09 14:18 ` Pintu Agarwal 2019-08-10 12:34 ` ndrw 2019-08-12 8:24 ` Michal Hocko 2019-08-10 21:07 ` ndrw 2021-07-24 17:32 ` Alexey Avramov 2019-08-08 14:47 ` Vlastimil Babka 2019-08-08 17:27 ` Johannes Weiner 2019-08-09 14:56 ` Vlastimil Babka 2019-08-09 17:31 ` Johannes Weiner 2019-08-13 13:47 ` Vlastimil Babka 2019-08-06 21:43 ` James Courtier-Dutton 2019-08-06 19:00 ` Florian Weimer 2019-08-20 6:46 ` Daniel Drake 2019-08-21 21:42 ` James Courtier-Dutton 2019-08-29 12:29 ` Michal Hocko 2019-09-02 20:15 ` Pavel Machek 2019-08-23 1:54 ` ndrw 2019-08-23 2:14 ` Daniel Drake [not found] <20190805090514.5992-1-hdanton@sina.com> 2019-08-05 12:01 ` Artem S. Tashkinov 2019-08-06 8:57 Johannes Buchner 2019-08-06 19:43 Remi Gauvin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).