* [LSF/MM TOPIC] Congestion @ 2019-12-31 12:59 Matthew Wilcox 2020-01-04 9:09 ` Dave Chinner 2020-01-06 11:55 ` [Lsf-pc] " Michal Hocko 0 siblings, 2 replies; 14+ messages in thread From: Matthew Wilcox @ 2019-12-31 12:59 UTC (permalink / raw) To: lsf-pc, linux-fsdevel, linux-mm I don't want to present this topic; I merely noticed the problem. I nominate Jens Axboe and Michael Hocko as session leaders. See the thread here: https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/ Summary: Congestion is broken and has been for years, and everybody's system is sleeping waiting for congestion that will never clear. A good outcome for this meeting would be: - MM defines what information they want from the block stack. - Block stack commits to giving them that information. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] Congestion 2019-12-31 12:59 [LSF/MM TOPIC] Congestion Matthew Wilcox @ 2020-01-04 9:09 ` Dave Chinner 2020-01-06 11:55 ` [Lsf-pc] " Michal Hocko 1 sibling, 0 replies; 14+ messages in thread From: Dave Chinner @ 2020-01-04 9:09 UTC (permalink / raw) To: Matthew Wilcox; +Cc: lsf-pc, linux-fsdevel, linux-mm On Tue, Dec 31, 2019 at 04:59:08AM -0800, Matthew Wilcox wrote: > > I don't want to present this topic; I merely noticed the problem. > I nominate Jens Axboe and Michael Hocko as session leaders. See the > thread here: > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/ > > Summary: Congestion is broken and has been for years, and everybody's > system is sleeping waiting for congestion that will never clear. Another symptom: system does not sleep because there is no recorded congestion so it doesn't back off when it should (the wait_iff_congested() backoff case). Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2019-12-31 12:59 [LSF/MM TOPIC] Congestion Matthew Wilcox 2020-01-04 9:09 ` Dave Chinner @ 2020-01-06 11:55 ` Michal Hocko 2020-01-06 23:21 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: Michal Hocko @ 2020-01-06 11:55 UTC (permalink / raw) To: Matthew Wilcox; +Cc: lsf-pc, linux-fsdevel, linux-mm, Mel Gorman On Tue 31-12-19 04:59:08, Matthew Wilcox wrote: > > I don't want to present this topic; I merely noticed the problem. > I nominate Jens Axboe and Michael Hocko as session leaders. See the > thread here: Thanks for bringing this up Matthew! The change in the behavior came as a surprise to me. I can lead the session for the MM side. > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/ > > Summary: Congestion is broken and has been for years, and everybody's > system is sleeping waiting for congestion that will never clear. > > A good outcome for this meeting would be: > > - MM defines what information they want from the block stack. The history of the congestion waiting is kinda hairy but I will try to summarize expectations we used to have and we can discuss how much of that has been real and what followed up as a cargo cult. Maybe we just find out that we do not need functionality like that anymore. I believe Mel would be a great contributor to the discussion. > - Block stack commits to giving them that information. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2020-01-06 11:55 ` [Lsf-pc] " Michal Hocko @ 2020-01-06 23:21 ` Dave Chinner 2020-01-07 8:23 ` Chris Murphy ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Dave Chinner @ 2020-01-06 23:21 UTC (permalink / raw) To: Michal Hocko; +Cc: Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Mel Gorman On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote: > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote: > > > > I don't want to present this topic; I merely noticed the problem. > > I nominate Jens Axboe and Michael Hocko as session leaders. See the > > thread here: > > Thanks for bringing this up Matthew! The change in the behavior came as > a surprise to me. I can lead the session for the MM side. > > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/ > > > > Summary: Congestion is broken and has been for years, and everybody's > > system is sleeping waiting for congestion that will never clear. > > > > A good outcome for this meeting would be: > > > > - MM defines what information they want from the block stack. > > The history of the congestion waiting is kinda hairy but I will try to > summarize expectations we used to have and we can discuss how much of > that has been real and what followed up as a cargo cult. Maybe we just > find out that we do not need functionality like that anymore. I believe > Mel would be a great contributor to the discussion. We most definitely do need some form of reclaim throttling based on IO congestion, because it is trivial to drive the system into swap storms and OOM killer invocation when there are large dirty slab caches that require IO to make reclaim progress and there's little in the way of page cache to reclaim. This is one of the biggest issues I've come across trying to make XFS inode reclaim non-blocking - the existing code blocks on inode writeback IO congestion to throttle the overall reclaim rate and so prevents swap storms and OOM killer rampages from occurring. The moment I remove the inode writeback blocking from the reclaim path and move the backoffs to the core reclaim congestion backoff algorithms, I see a sustantial increase in the typical reclaim scan priority. This is because the reclaim code does not have an integrated back-off mechanism that can balance reclaim throttling between slab cache and page cache reclaim. This results in insufficient page reclaim backoff under slab cache backoff conditions, leading to excessive page cache reclaim and swapping out all the anonymous pages in memory. Then performance goes to hell as userspace then starts to block on page faults swap thrashing like this: page_fault swap_in alloc page direct reclaim swap out anon page submit_bio wbt_throttle IOWs, page reclaim doesn't back off until userspace gets throttled in the block layer doing swap out during swap in during page faults. For these sorts of workloads there should be little to no swap thrashing occurring - throttling reclaim to the rate at which inodes are cleaned by async IO dispatcher threads is what is needed here, not continuing to wind up reclaim priority until swap storms and the oom killer end up killng the machine... I also see this when the inode cache load is on a separate device to the swap partition - both devices end up at 100% utilisation, one doing inode writeback flat out (about 300,000 inodes/sec from an inode cache of 5-10 million inodes), the other is swap thrashing from a page cache of only 250-500 pages in size. Hence the way congestion was historically dealt with as a "global condition" still needs to exist in some manner - congestion on a single device is sufficient to cause the high level reclaim algroithms to misbehave badly... Hence it seems to me that having IO load feedback to the memory reclaim algorithms is most definitely required for memory reclaim to be able to make the correct decisions about what to reclaim. If the shrinker for the cache that uses 50% of RAM in the machine is saying "backoff needed" and it's underlying device is congested and limiting object reclaim rates, then it's a pretty good indication that reclaim should back off and wait for IO progress to be made instead of trying to reclaim from other LRUs that hold an insignificant amount of memory compared to the huge cache that is backed up waiting on IO completion to make progress.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2020-01-06 23:21 ` Dave Chinner @ 2020-01-07 8:23 ` Chris Murphy 2020-01-07 11:53 ` Michal Hocko 2020-01-07 11:53 ` Michal Hocko 2020-01-09 11:07 ` Jan Kara 2 siblings, 1 reply; 14+ messages in thread From: Chris Murphy @ 2020-01-07 8:23 UTC (permalink / raw) To: Dave Chinner Cc: Michal Hocko, Matthew Wilcox, lsf-pc, Linux FS Devel, linux-mm, Mel Gorman On Mon, Jan 6, 2020 at 4:21 PM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote: > > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote: > > > > > > I don't want to present this topic; I merely noticed the problem. > > > I nominate Jens Axboe and Michael Hocko as session leaders. See the > > > thread here: > > > > Thanks for bringing this up Matthew! The change in the behavior came as > > a surprise to me. I can lead the session for the MM side. > > > > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/ > > > > > > Summary: Congestion is broken and has been for years, and everybody's > > > system is sleeping waiting for congestion that will never clear. > > > > > > A good outcome for this meeting would be: > > > > > > - MM defines what information they want from the block stack. > > > > The history of the congestion waiting is kinda hairy but I will try to > > summarize expectations we used to have and we can discuss how much of > > that has been real and what followed up as a cargo cult. Maybe we just > > find out that we do not need functionality like that anymore. I believe > > Mel would be a great contributor to the discussion. > > We most definitely do need some form of reclaim throttling based on > IO congestion, because it is trivial to drive the system into swap > storms and OOM killer invocation when there are large dirty slab > caches that require IO to make reclaim progress and there's little > in the way of page cache to reclaim. > > This is one of the biggest issues I've come across trying to make > XFS inode reclaim non-blocking - the existing code blocks on inode > writeback IO congestion to throttle the overall reclaim rate and > so prevents swap storms and OOM killer rampages from occurring. > > The moment I remove the inode writeback blocking from the reclaim > path and move the backoffs to the core reclaim congestion backoff > algorithms, I see a sustantial increase in the typical reclaim scan > priority. This is because the reclaim code does not have an > integrated back-off mechanism that can balance reclaim throttling > between slab cache and page cache reclaim. This results in > insufficient page reclaim backoff under slab cache backoff > conditions, leading to excessive page cache reclaim and swapping out > all the anonymous pages in memory. Then performance goes to hell as > userspace then starts to block on page faults swap thrashing like > this: This really caught my attention, however unrelated it may actually be. The gist of my question is: what are distributions doing wrong, that it's possible for an unprivileged process to take down a system such that an ordinary user reaches for the power button? [1] More helpful would be, what should distributions be doing better to avoid the problem in the first place? User space oom daemons are now popular, and there's talk about avoiding swap thrashing and oom by strict use of cgroupsv2 and PSI. Some people say, oh yeah duh, just don't make a swap device at all, what are you crazy? Then there's swap on ZRAM. And alas zswap too. So what's actually recommended to help with this problem? I don't have many original thoughts, but I can't find a reference for why my brain is telling me the kernel oom-killer is mainly concerned about kernel survival in low memory situations, and not user space. But an approximate is "It is the job of the linux 'oom killer' to sacrifice one or more processes in order to free up memory for the system when all else fails." [2] However, a) failure has happened way before oom-killer is invoked, back when the GUI became unresponsive, and b) often it kills some small thing, seemingly freeing up just enough memory that the kernel is happy to stay in this state for indeterminate time. For my testing that's 30 minutes, but I'm compelled to defend a user who asserts a mere 15 second grace period before reaching for the power button. This isn't a common experience across a broad user population, but those who have experienced it once are really familiar with it (they haven't experienced it only once). And I really want to know what can be done to make the user experience better, but it's not clear to me how to do that. [1] Fedora 30/31 default installation, 8G RAM, 8G swap (on plain SSD partition), and compile webkitgtk. Within ~5 minutes all RAM is cosumed, and the "swap storm" begins. The GUI stutters, even the mouse pointer starts to gets choppy, and soon after it's pretty much locked up and for all practical purposes it's locked up. Most typical, it stays this way for 30+ minutes. Occasionally oom-killer kicks in and clobbers something. Sometimes it's one of the compile threads. And also occasionally it'll be something absurd like sshd, sssd, or systemd-journald - which really makes no sense at all. [2] https://linux-mm.org/OOM_Killer -- Chris Murphy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2020-01-07 8:23 ` Chris Murphy @ 2020-01-07 11:53 ` Michal Hocko 2020-01-07 20:12 ` Chris Murphy 0 siblings, 1 reply; 14+ messages in thread From: Michal Hocko @ 2020-01-07 11:53 UTC (permalink / raw) To: Chris Murphy Cc: Dave Chinner, Matthew Wilcox, lsf-pc, Linux FS Devel, linux-mm, Mel Gorman On Tue 07-01-20 01:23:38, Chris Murphy wrote: > On Mon, Jan 6, 2020 at 4:21 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote: > > > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote: > > > > > > > > I don't want to present this topic; I merely noticed the problem. > > > > I nominate Jens Axboe and Michael Hocko as session leaders. See the > > > > thread here: > > > > > > Thanks for bringing this up Matthew! The change in the behavior came as > > > a surprise to me. I can lead the session for the MM side. > > > > > > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/ > > > > > > > > Summary: Congestion is broken and has been for years, and everybody's > > > > system is sleeping waiting for congestion that will never clear. > > > > > > > > A good outcome for this meeting would be: > > > > > > > > - MM defines what information they want from the block stack. > > > > > > The history of the congestion waiting is kinda hairy but I will try to > > > summarize expectations we used to have and we can discuss how much of > > > that has been real and what followed up as a cargo cult. Maybe we just > > > find out that we do not need functionality like that anymore. I believe > > > Mel would be a great contributor to the discussion. > > > > We most definitely do need some form of reclaim throttling based on > > IO congestion, because it is trivial to drive the system into swap > > storms and OOM killer invocation when there are large dirty slab > > caches that require IO to make reclaim progress and there's little > > in the way of page cache to reclaim. > > > > This is one of the biggest issues I've come across trying to make > > XFS inode reclaim non-blocking - the existing code blocks on inode > > writeback IO congestion to throttle the overall reclaim rate and > > so prevents swap storms and OOM killer rampages from occurring. > > > > The moment I remove the inode writeback blocking from the reclaim > > path and move the backoffs to the core reclaim congestion backoff > > algorithms, I see a sustantial increase in the typical reclaim scan > > priority. This is because the reclaim code does not have an > > integrated back-off mechanism that can balance reclaim throttling > > between slab cache and page cache reclaim. This results in > > insufficient page reclaim backoff under slab cache backoff > > conditions, leading to excessive page cache reclaim and swapping out > > all the anonymous pages in memory. Then performance goes to hell as > > userspace then starts to block on page faults swap thrashing like > > this: > > This really caught my attention, however unrelated it may actually be. > The gist of my question is: what are distributions doing wrong, that > it's possible for an unprivileged process to take down a system such > that an ordinary user reaches for the power button? [1] Well, free ticket to all the available memory is the key here I believe. Memory cgroups can be of a great help to reduce the amount of memory for untrusted users. I am not sure whether that would help your example in the footnote though. It seems your workload is reaching a threshing state. It would be interesting to get some more data to see whether that is a result of the real memory demand or the memory reclaim misbehavior (It would be great to collect /proc/vmstat data while the system is behaving like that in a separate email thread). > More helpful > would be, what should distributions be doing better to avoid the > problem in the first place? User space oom daemons are now popular, > and there's talk about avoiding swap thrashing and oom by strict use > of cgroupsv2 and PSI. Some people say, oh yeah duh, just don't make a > swap device at all, what are you crazy? Then there's swap on ZRAM. And > alas zswap too. So what's actually recommended to help with this > problem? I believe this will be workload specific and it is always appreciated to report the behavior as mentioned above. > I don't have many original thoughts, but I can't find a reference for > why my brain is telling me the kernel oom-killer is mainly concerned > about kernel survival in low memory situations, and not user space. This is indeed the case. It is a last resort measure to survive the memory depletion. Unfortunately the oom detection doesn't cope well with the threshing scenarios where the memory is still reclaimable reasonably easily while the userspace cannot make much progress because it is refaulting the working set constantly. PSI has been a great step forward for those workloads. We haven't found a good way to integrate that information into the oom detection yet, unfortunately because an acceptable level of refaulting is very workload dependent. [...] > [1] > Fedora 30/31 default installation, 8G RAM, 8G swap (on plain SSD > partition), and compile webkitgtk. Within ~5 minutes all RAM is > cosumed, and the "swap storm" begins. The GUI stutters, even the mouse > pointer starts to gets choppy, and soon after it's pretty much locked > up and for all practical purposes it's locked up. Most typical, it > stays this way for 30+ minutes. Occasionally oom-killer kicks in and > clobbers something. Sometimes it's one of the compile threads. And > also occasionally it'll be something absurd like sshd, sssd, or > systemd-journald - which really makes no sense at all. > > [2] > https://linux-mm.org/OOM_Killer > > -- > Chris Murphy -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2020-01-07 11:53 ` Michal Hocko @ 2020-01-07 20:12 ` Chris Murphy 0 siblings, 0 replies; 14+ messages in thread From: Chris Murphy @ 2020-01-07 20:12 UTC (permalink / raw) To: Michal Hocko Cc: Chris Murphy, Matthew Wilcox, lsf-pc, Linux FS Devel, linux-mm, Mel Gorman On Tue, Jan 7, 2020 at 4:53 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 07-01-20 01:23:38, Chris Murphy wrote: > > More helpful > > would be, what should distributions be doing better to avoid the > > problem in the first place? User space oom daemons are now popular, > > and there's talk about avoiding swap thrashing and oom by strict use > > of cgroupsv2 and PSI. Some people say, oh yeah duh, just don't make a > > swap device at all, what are you crazy? Then there's swap on ZRAM. And > > alas zswap too. So what's actually recommended to help with this > > problem? > > I believe this will be workload specific and it is always appreciated to > report the behavior as mentioned above. I'll do so in a separate email. But by what mechanism is workload determined or categorized? And how is the system dynamically reconfigured to better handle different workloads? These are general purpose operating systems, of course a user has different workloads from moment to moment. -- Chris Murphy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2020-01-06 23:21 ` Dave Chinner 2020-01-07 8:23 ` Chris Murphy @ 2020-01-07 11:53 ` Michal Hocko 2020-01-09 11:07 ` Jan Kara 2 siblings, 0 replies; 14+ messages in thread From: Michal Hocko @ 2020-01-07 11:53 UTC (permalink / raw) To: Dave Chinner; +Cc: Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Mel Gorman On Tue 07-01-20 10:21:00, Dave Chinner wrote: > On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote: > > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote: > > > > > > I don't want to present this topic; I merely noticed the problem. > > > I nominate Jens Axboe and Michael Hocko as session leaders. See the > > > thread here: > > > > Thanks for bringing this up Matthew! The change in the behavior came as > > a surprise to me. I can lead the session for the MM side. > > > > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/ > > > > > > Summary: Congestion is broken and has been for years, and everybody's > > > system is sleeping waiting for congestion that will never clear. > > > > > > A good outcome for this meeting would be: > > > > > > - MM defines what information they want from the block stack. > > > > The history of the congestion waiting is kinda hairy but I will try to > > summarize expectations we used to have and we can discuss how much of > > that has been real and what followed up as a cargo cult. Maybe we just > > find out that we do not need functionality like that anymore. I believe > > Mel would be a great contributor to the discussion. > > We most definitely do need some form of reclaim throttling based on > IO congestion, because it is trivial to drive the system into swap > storms and OOM killer invocation when there are large dirty slab > caches that require IO to make reclaim progress and there's little > in the way of page cache to reclaim. Just to clarify. I do agree that we need some form of throttling. Sorry if my wording was confusing. What I meant is that I am not sure whether wait_iff_congested as it is implemented now is the right way. We definitely have to slow/block the reclaim when there is a lot of dirty (meta)data. How to do that is a good topic to discuss. [skipping the rest of the email which has many good points] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2020-01-06 23:21 ` Dave Chinner 2020-01-07 8:23 ` Chris Murphy 2020-01-07 11:53 ` Michal Hocko @ 2020-01-09 11:07 ` Jan Kara 2020-01-09 23:00 ` Dave Chinner 2 siblings, 1 reply; 14+ messages in thread From: Jan Kara @ 2020-01-09 11:07 UTC (permalink / raw) To: Dave Chinner Cc: Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Mel Gorman On Tue 07-01-20 10:21:00, Dave Chinner wrote: > On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote: > > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote: > > > > > > I don't want to present this topic; I merely noticed the problem. > > > I nominate Jens Axboe and Michael Hocko as session leaders. See the > > > thread here: > > > > Thanks for bringing this up Matthew! The change in the behavior came as > > a surprise to me. I can lead the session for the MM side. > > > > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/ > > > > > > Summary: Congestion is broken and has been for years, and everybody's > > > system is sleeping waiting for congestion that will never clear. > > > > > > A good outcome for this meeting would be: > > > > > > - MM defines what information they want from the block stack. > > > > The history of the congestion waiting is kinda hairy but I will try to > > summarize expectations we used to have and we can discuss how much of > > that has been real and what followed up as a cargo cult. Maybe we just > > find out that we do not need functionality like that anymore. I believe > > Mel would be a great contributor to the discussion. > > We most definitely do need some form of reclaim throttling based on > IO congestion, because it is trivial to drive the system into swap > storms and OOM killer invocation when there are large dirty slab > caches that require IO to make reclaim progress and there's little > in the way of page cache to reclaim. Agreed, but I guess the question is how do we implement that in a reliable fashion? More on that below... > This is one of the biggest issues I've come across trying to make > XFS inode reclaim non-blocking - the existing code blocks on inode > writeback IO congestion to throttle the overall reclaim rate and > so prevents swap storms and OOM killer rampages from occurring. > > The moment I remove the inode writeback blocking from the reclaim > path and move the backoffs to the core reclaim congestion backoff > algorithms, I see a sustantial increase in the typical reclaim scan > priority. This is because the reclaim code does not have an > integrated back-off mechanism that can balance reclaim throttling > between slab cache and page cache reclaim. This results in > insufficient page reclaim backoff under slab cache backoff > conditions, leading to excessive page cache reclaim and swapping out > all the anonymous pages in memory. Then performance goes to hell as > userspace then starts to block on page faults swap thrashing like > this: > > page_fault > swap_in > alloc page > direct reclaim > swap out anon page > submit_bio > wbt_throttle > > > IOWs, page reclaim doesn't back off until userspace gets throttled > in the block layer doing swap out during swap in during page > faults. For these sorts of workloads there should be little to no > swap thrashing occurring - throttling reclaim to the rate at which > inodes are cleaned by async IO dispatcher threads is what is needed > here, not continuing to wind up reclaim priority until swap storms > and the oom killer end up killng the machine... > > I also see this when the inode cache load is on a separate device to > the swap partition - both devices end up at 100% utilisation, one > doing inode writeback flat out (about 300,000 inodes/sec from an > inode cache of 5-10 million inodes), the other is swap thrashing > from a page cache of only 250-500 pages in size. > > Hence the way congestion was historically dealt with as a "global > condition" still needs to exist in some manner - congestion on a > single device is sufficient to cause the high level reclaim > algroithms to misbehave badly... > > Hence it seems to me that having IO load feedback to the memory > reclaim algorithms is most definitely required for memory reclaim to > be able to make the correct decisions about what to reclaim. If the > shrinker for the cache that uses 50% of RAM in the machine is saying > "backoff needed" and it's underlying device is > congested and limiting object reclaim rates, then it's a pretty good > indication that reclaim should back off and wait for IO progress to > be made instead of trying to reclaim from other LRUs that hold an > insignificant amount of memory compared to the huge cache that is > backed up waiting on IO completion to make progress.... Yes and I think here's the key detail: Reclaim really needs to wait for slab object cleaning to progress so that slab cache can be shrinked. This is related, but not always in a straightforward way, with IO progress and even less with IO congestion on some device. I can easily imagine that e.g. cleaning of inodes to reclaim inode slab may not be efficient enough to utilize full paralelism of a fast storage so the storage will not ever become congested - sure it's an inefficiency that could be fixed but should it misguide reclaim? I don't think so... So I think that to solve this problem in a robust way, we need to provide a mechanism for slab shrinkers to say something like "hang on, I can reclaim X objects you asked for but it will take time, I'll signal to you when they are reclaimable". This way we avoid blocking in the shrinker and can do more efficient async batched reclaim and on mm side we have the freedom to either wait for slab reclaim to progress (if this slab is fundamental to memory pressure) or just go try reclaim something else. Of course, the devil is in the details :). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2020-01-09 11:07 ` Jan Kara @ 2020-01-09 23:00 ` Dave Chinner 2020-02-05 16:05 ` Mel Gorman 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2020-01-09 23:00 UTC (permalink / raw) To: Jan Kara Cc: Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Mel Gorman On Thu, Jan 09, 2020 at 12:07:51PM +0100, Jan Kara wrote: > On Tue 07-01-20 10:21:00, Dave Chinner wrote: > > On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote: > > > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote: > > > > > > > > I don't want to present this topic; I merely noticed the problem. > > > > I nominate Jens Axboe and Michael Hocko as session leaders. See the > > > > thread here: > > > > > > Thanks for bringing this up Matthew! The change in the behavior came as > > > a surprise to me. I can lead the session for the MM side. > > > > > > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.infradead.org/ > > > > > > > > Summary: Congestion is broken and has been for years, and everybody's > > > > system is sleeping waiting for congestion that will never clear. > > > > > > > > A good outcome for this meeting would be: > > > > > > > > - MM defines what information they want from the block stack. > > > > > > The history of the congestion waiting is kinda hairy but I will try to > > > summarize expectations we used to have and we can discuss how much of > > > that has been real and what followed up as a cargo cult. Maybe we just > > > find out that we do not need functionality like that anymore. I believe > > > Mel would be a great contributor to the discussion. > > > > We most definitely do need some form of reclaim throttling based on > > IO congestion, because it is trivial to drive the system into swap > > storms and OOM killer invocation when there are large dirty slab > > caches that require IO to make reclaim progress and there's little > > in the way of page cache to reclaim. > > Agreed, but I guess the question is how do we implement that in a reliable > fashion? More on that below... .... > > Hence it seems to me that having IO load feedback to the memory > > reclaim algorithms is most definitely required for memory reclaim to > > be able to make the correct decisions about what to reclaim. If the > > shrinker for the cache that uses 50% of RAM in the machine is saying > > "backoff needed" and it's underlying device is > > congested and limiting object reclaim rates, then it's a pretty good > > indication that reclaim should back off and wait for IO progress to > > be made instead of trying to reclaim from other LRUs that hold an > > insignificant amount of memory compared to the huge cache that is > > backed up waiting on IO completion to make progress.... > > Yes and I think here's the key detail: Reclaim really needs to wait for > slab object cleaning to progress so that slab cache can be shrinked. This > is related, but not always in a straightforward way, with IO progress and > even less with IO congestion on some device. I can easily imagine that e.g. > cleaning of inodes to reclaim inode slab may not be efficient enough to > utilize full paralelism of a fast storage so the storage will not ever > become congested XFS can currently write back inodes at several hundred MB/s if the underlying storage is capable of sustaining that. i.e. it can drive hundreds of thousands of metadata IOPS if the underlying storage can handle that. With the non-blocking reclaim mods, it's all async writeback, so at least for XFS we will be able to drive fast devices into congestion. > - sure it's an inefficiency that could be fixed but should > it misguide reclaim? The problem is that even cleaning inodes at this rate, I can't get reclaim to actually do the right thing. Reclaim is already going wrong for really fast devices.. > I don't think so... So I think that to solve this > problem in a robust way, we need to provide a mechanism for slab shrinkers > to say something like "hang on, I can reclaim X objects you asked for but > it will take time, I'll signal to you when they are reclaimable". This way > we avoid blocking in the shrinker and can do more efficient async batched > reclaim and on mm side we have the freedom to either wait for slab reclaim > to progress (if this slab is fundamental to memory pressure) or just go try > reclaim something else. Of course, the devil is in the details :). That's pretty much exactly what my non-blocking XFS inode reclaim patches do. It tries to scan, but when it can't make progress it sets a "need backoff" flag and defers the remaining work and expects the high level code to make a sensible back-off decision. The problem is that the decision the high level code makes at the moment is not sensible - it is "back off for a bit, then increase the reclaim priority and reclaim from the page cache again. That;s what is driving the swap storms - inode reclaim says "back-off" and stops trying to do reclaim, and that causes the high level code to reclaim the page cache harder. OTOH, if we *block in the inode shrinker* as we do now, then we don't increase reclaim priority (and hence the amount of page cache scanning) and so the reclaim algorithms don't drive deeply into swap-storm conditions. That's the fundamental problem here - we need to throttle reclaim without *needing to restart the entire high level reclaim loop*. This is an architecture problem more than anything - node and memcg aware shrinkers outnumber the page cache LRU zones by a large number, but we can't throttle on individual shrinkers and wait for them to make progress like we can individual page LRU zone lists. Hence if we want to throttle an individual shrinker, the *only reliable option* we currently have is for the shrinker to block itself. I note that we handle similar "need more individual work" conditions in other writeback situations. e.g. the BDI has a "b_more_io" list to park inodes that require more writeback than a single pass. This allows writeback to *fairly* revisit inodes that require large amounts of writeback to do more writeback without needing to start a whole new BDI dirty inode writeback pass. I suspect that this is the sort of thing we need for reclaim - we need to park shrinker instances that needed backoff onto a "need more reclaim" list that we continue to iterate and back-off on until we've done the reclaim work that this specific reclaim priority pass required us to do. And, realistically, to make this all work in a consistent manner, the zone LRU walkers really should be transitioned to run as shrinker instances that are node and memcg aware, and so they do individual backoff and throttling in the same manner that large slab caches do. This way we end up with an integrated, consistent high level reclaim management architecture that automatically balances page cache vs slab cache reclaim balance... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2020-01-09 23:00 ` Dave Chinner @ 2020-02-05 16:05 ` Mel Gorman 2020-02-06 23:19 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2020-02-05 16:05 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Mel Gorman This thread is ancient but I'm only getting to it now, to express an interest in the general discussion as much as anything else. On Fri, Jan 10, 2020 at 10:00:43AM +1100, Dave Chinner wrote: > > I don't think so... So I think that to solve this > > problem in a robust way, we need to provide a mechanism for slab shrinkers > > to say something like "hang on, I can reclaim X objects you asked for but > > it will take time, I'll signal to you when they are reclaimable". This way > > we avoid blocking in the shrinker and can do more efficient async batched > > reclaim and on mm side we have the freedom to either wait for slab reclaim > > to progress (if this slab is fundamental to memory pressure) or just go try > > reclaim something else. Of course, the devil is in the details :). > > That's pretty much exactly what my non-blocking XFS inode reclaim > patches do. It tries to scan, but when it can't make progress it > sets a "need backoff" flag and defers the remaining work and expects > the high level code to make a sensible back-off decision. > > The problem is that the decision the high level code makes at the > moment is not sensible - it is "back off for a bit, then increase > the reclaim priority and reclaim from the page cache again. That;s > what is driving the swap storms - inode reclaim says "back-off" and > stops trying to do reclaim, and that causes the high level code to > reclaim the page cache harder. > > OTOH, if we *block in the inode shrinker* as we do now, then we > don't increase reclaim priority (and hence the amount of page cache > scanning) and so the reclaim algorithms don't drive deeply into > swap-storm conditions. > > That's the fundamental problem here - we need to throttle reclaim > without *needing to restart the entire high level reclaim loop*. > This is an architecture problem more than anything - node and memcg > aware shrinkers outnumber the page cache LRU zones by a large > number, but we can't throttle on individual shrinkers and wait for > them to make progress like we can individual page LRU zone lists. > Hence if we want to throttle an individual shrinker, the *only > reliable option* we currently have is for the shrinker to block > itself. > Despite the topic name, I learning towards thinking that this is not a congestion issue as such. The throttling mechanism based on BDI partially solved old problems of swap storm, direct relaim issued writeback (historical) or excessive scanning leading to premature OOM kill. When reclaim stopped issuing waiting on writeback it had to rely on congestion control instead and it always was a bit fragile but mostly worked until hardware moved on, storage got faster, memories got larger, or did something crazy like buy a second disk. The commonmreason that stalling would occur is because large amounts of dirty/writeback pages were encountered at the tail of the LRU leading to large amounts of CPU time spent on useless scanning and increasing scan rates until OOM occurred. It never took into account any other factor like shrinker state. But fundamentally what gets a process into trouble is when "reclaim efficiency" drops. Efficiency is the ratio between reclaim scan and reclaim steal with perfect efficiency being one page scanned results in one page reclaimed. As long as reclaim efficiency is perfect, a system may be thrashing but it's not stalling on writeback. It may still be stalling on read but that tends to be less harmful. Blocking on "congestion" caught one very bad condition where efficiency drops -- excessive dirty/writeback pages on the tail of the file LRU. It happened to be a common condition such as if a USB stick was being written but not the only one. When it happened, excessive clean file pages would be taken, swap storms occur and the system thrashes while the dirty pages are being cleaned. In roughly in order of severity the most relevant causes of efficiency drops that come to mind are o page is unevictable due to mlock (goes to separate list) o page is accessed and gets activated o THP has to be split and does another lap through the LRU o page could not be unmapped (probably heavily shared and should be activated anyway) o page is dirty/writeback and goes back on the LRU o page has associated buffers that cannot be freed While I'm nowhere near having enough time to write a prototype, I think it could be throttle reclaim based on recent allocation rate and the contributors to poor reclaim efficiency. Recent allocation rate is appropriate because processes dirtying memory should get caught in balance_dirty_page. It's only heavy allocators that can drive excessive reclaim for multiple unrelated processes. So first, try and keep a rough track of the recent allocation rate or maybe just something like the number of consecutive allocations that entered the slow path due to a low watermark failure. Once a task enters direct reclaim, track the reasons for poor reclaim efficiency (like the list above but maybe add shrinkers) and calculate a score based on weight. An accessed page would have a light weight, a dirty page would have a heavy weight. Shrinkers could apply some unknown weight but I don't know what might be sensible or what the relative weighting would be. If direct reclaim should continue for another loop, wait on a per-node waitqueue until kswapd frees pages above the high watermark or a timeout. The length of the timeout would depend on how heavy an allocator the process is and the reasons why reclaim efficiency was dropping. The timeout costs should accumulate while a task remains in direct reclaim to limit the chance that an unrelated process is punished. It's all hand-waving but I think this would be enough to detect a heavy allocator encountering lots of dirty pages at the tail of the LRU at high frequency without relying on BDI congestion detection. The downside is if the system really is thrashing then a light allocator can become a heavy allocator because it's trying to read itself from swap or fetch hot data. > And, realistically, to make this all work in a consistent manner, > the zone LRU walkers really should be transitioned to run as shrinker > instances that are node and memcg aware, and so they do individual > backoff and throttling in the same manner that large slab caches do. > This way we end up with an integrated, consistent high level reclaim > management architecture that automatically balances page cache vs > slab cache reclaim balance... > That'd probably make more sense but I don't think it would be mandatory to get some basic replacement for wait_iff_congested working. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2020-02-05 16:05 ` Mel Gorman @ 2020-02-06 23:19 ` Dave Chinner 2020-02-07 0:08 ` Matthew Wilcox 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2020-02-06 23:19 UTC (permalink / raw) To: Mel Gorman Cc: Jan Kara, Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Mel Gorman On Wed, Feb 05, 2020 at 04:05:51PM +0000, Mel Gorman wrote: > This thread is ancient but I'm only getting to it now, to express an > interest in the general discussion as much as anything else. > > On Fri, Jan 10, 2020 at 10:00:43AM +1100, Dave Chinner wrote: > > > I don't think so... So I think that to solve this > > > problem in a robust way, we need to provide a mechanism for slab shrinkers > > > to say something like "hang on, I can reclaim X objects you asked for but > > > it will take time, I'll signal to you when they are reclaimable". This way > > > we avoid blocking in the shrinker and can do more efficient async batched > > > reclaim and on mm side we have the freedom to either wait for slab reclaim > > > to progress (if this slab is fundamental to memory pressure) or just go try > > > reclaim something else. Of course, the devil is in the details :). > > > > That's pretty much exactly what my non-blocking XFS inode reclaim > > patches do. It tries to scan, but when it can't make progress it > > sets a "need backoff" flag and defers the remaining work and expects > > the high level code to make a sensible back-off decision. > > > > The problem is that the decision the high level code makes at the > > moment is not sensible - it is "back off for a bit, then increase > > the reclaim priority and reclaim from the page cache again. That;s > > what is driving the swap storms - inode reclaim says "back-off" and > > stops trying to do reclaim, and that causes the high level code to > > reclaim the page cache harder. > > > > OTOH, if we *block in the inode shrinker* as we do now, then we > > don't increase reclaim priority (and hence the amount of page cache > > scanning) and so the reclaim algorithms don't drive deeply into > > swap-storm conditions. > > > > That's the fundamental problem here - we need to throttle reclaim > > without *needing to restart the entire high level reclaim loop*. > > This is an architecture problem more than anything - node and memcg > > aware shrinkers outnumber the page cache LRU zones by a large > > number, but we can't throttle on individual shrinkers and wait for > > them to make progress like we can individual page LRU zone lists. > > Hence if we want to throttle an individual shrinker, the *only > > reliable option* we currently have is for the shrinker to block > > itself. > > > > Despite the topic name, I learning towards thinking that this is not a > congestion issue as such. The throttling mechanism based on BDI partially > solved old problems of swap storm, direct relaim issued writeback > (historical) or excessive scanning leading to premature OOM kill. When > reclaim stopped issuing waiting on writeback it had to rely on congestion > control instead and it always was a bit fragile but mostly worked until > hardware moved on, storage got faster, memories got larger, or did > something crazy like buy a second disk. That's because the code didn't evolve with the changing capabilities of the hardware. Nobody cared because it "mostly worked" and then when it didn't they worked around it in other ways. e.g. the block layer writeback throttle largely throttles swap storms by limiting the amount of swap IO memory reclaim can issue. THe issue is that it doesn't prevent swap storms, just moves them "back out of sight" so no-one cares that about them much again... > The commonmreason that stalling would occur is because large amounts of > dirty/writeback pages were encountered at the tail of the LRU leading to > large amounts of CPU time spent on useless scanning and increasing scan > rates until OOM occurred. It never took into account any other factor > like shrinker state. > > But fundamentally what gets a process into trouble is when "reclaim > efficiency" drops. Efficiency is the ratio between reclaim scan and > reclaim steal with perfect efficiency being one page scanned results in > one page reclaimed. As long as reclaim efficiency is perfect, a system > may be thrashing but it's not stalling on writeback. It may still be > stalling on read but that tends to be less harmful. > > Blocking on "congestion" caught one very bad condition where efficiency > drops -- excessive dirty/writeback pages on the tail of the file LRU. It > happened to be a common condition such as if a USB stick was being written > but not the only one. When it happened, excessive clean file pages would > be taken, swap storms occur and the system thrashes while the dirty > pages are being cleaned. > > In roughly in order of severity the most relevant causes of efficiency > drops that come to mind are > > o page is unevictable due to mlock (goes to separate list) > o page is accessed and gets activated > o THP has to be split and does another lap through the LRU > o page could not be unmapped (probably heavily shared and should be > activated anyway) > o page is dirty/writeback and goes back on the LRU > o page has associated buffers that cannot be freed One of the issues I see here is the focus on the congestion problem entirely form the point of view of page reclaim. What I tried to point out above is that we have *all* the same issues with inode reclaim in the shrinker. The common reason for stalling inode reclaim is large amounts of dirty/writeback inodes on the tail of the LRU. Inode reclaim efficiency drops occur because: o All LRUs are currently performing reclaim scans, so new direct scans only cause lock and/or IO contention rather than increase scan rates. o inode is unevictable because it is currently pinned by the journal o inode has been referenced and activated, so gets skipped o inode is locked, so does another lap through the LRU o inode is dirty/writeback, so does another lap of the LRU IOWs shrinkers have exactly the same problems as the page LRU reclaim. This is why I'm advocating for this problem to be solved in a generic manner, not as a solution focussed entirely around the requirements of page reclaim. > While I'm nowhere near having enough time to write a prototype, I think > it could be throttle reclaim based on recent allocation rate and the > contributors to poor reclaim efficiency. > > Recent allocation rate is appropriate because processes dirtying memory > should get caught in balance_dirty_page. It's only heavy allocators that > can drive excessive reclaim for multiple unrelated processes. So first, > try and keep a rough track of the recent allocation rate or maybe just > something like the number of consecutive allocations that entered the > slow path due to a low watermark failure. Inode dirtying is throttled by the filesystem journal space, which has nothing really to do with memory pressure. And inode allocation isn't throttled in any way at all, except by memory reclaim when there is memory pressure. That's the underlying reason that we've traditionally had to throttle reclaim under heavy inode allocation pressure - blocking reclaim on "congestion" in the inode shrinker caught this very bad condition where efficiency drops off a cliff.... Hence a solution that relies on measuring recent allocation rate needs to first add all that infrastructure for every shrinker that does allocation. A solution that only works for the page LRU infrastructure is not a viable solution to the problems being raised here. > Once a task enters direct reclaim, track the reasons for poor reclaim > efficiency (like the list above but maybe add shrinkers) and calculate a > score based on weight. An accessed page would have a light weight, a dirty > page would have a heavy weight. Shrinkers could apply some unknown weight > but I don't know what might be sensible or what the relative weighting > would be. I'm not sure what a weighting might acheive given we might be scanning millions of objects (we can have hundreds of millions of cached inodes on the LRUs). An aggregated weight does not indicate whether we skipped lots of referenced clean inode that would otherwise be trivial to reclaim, or we skipped a smaller amount of dirty inodes that will take a *long time* to reclaim because they require IO.... > If direct reclaim should continue for another loop, wait on a per-node > waitqueue until kswapd frees pages above the high watermark or a > timeout. The length of the timeout would depend on how heavy an allocator > the process is and the reasons why reclaim efficiency was dropping. The > timeout costs should accumulate while a task remains in direct reclaim > to limit the chance that an unrelated process is punished. > > It's all hand-waving but I think this would be enough to detect a heavy > allocator encountering lots of dirty pages at the tail of the LRU at high > frequency without relying on BDI congestion detection. The downside is if > the system really is thrashing then a light allocator can become a heavy > allocator because it's trying to read itself from swap or fetch hot data. But detecting an abundance dirty pages/inodes on the LRU doesn't really solve the problem of determining if and/or how long we should wait for IO before we try to free more objects. There is no problem with having lots of dirty pages/inodes on the LRU as long as the IO subsystem keeps up with the rate at which reclaim is asking them to be written back via async mechanisms (bdi writeback, metadata writeback, etc). The problem comes when we cannot make efficient progress cleaning pages/inodes on the LRU because the IO subsystem is overloaded and cannot clean pages/inodes any faster. At this point, we have to wait for the IO subsystem to make progress and without feedback from the IO subsystem, we have no idea how fast that progress is made. Hence we have no idea how long we need to wait before trying to reclaim again. i.e. the answer can be different depending on hardware behaviour, not just the current instantaneous reclaim and IO state. That's the fundamental problem we need to solve, and realistically it can only be done with some level of feedback from the IO subsystem. I'd be quite happy to attach a BDI to the reclaim feedback structure and tell the high level code "wait on this bdi for X completions or X milliseconds" rather than the current global "any BDI that completes an IO and is not congested breaks waiters out of congestion_wait". This would also solve the problem of wait_iff_congested() waiting on any congested BDI (or being woken by any uncongested BDI) rather than the actual BDI the shrinker is operating/waiting on. Further, we can actually refine this by connecting the memcg the object belongs to to the blk_cgroup that the object is cleaned through (e.g. wb_get_lookup()) from inside reclaim. Hence we should be able to determine if the blkcg that backs the memcg we are reclaiming from is slow because it is being throttled, even though the backing device itself is not congested. IOWs, we will not block memcg reclaim on IO congestion caused by other memcgs.... > > And, realistically, to make this all work in a consistent manner, > > the zone LRU walkers really should be transitioned to run as shrinker > > instances that are node and memcg aware, and so they do individual > > backoff and throttling in the same manner that large slab caches do. > > This way we end up with an integrated, consistent high level reclaim > > management architecture that automatically balances page cache vs > > slab cache reclaim balance... > > That'd probably make more sense but I don't think it would be mandatory > to get some basic replacement for wait_iff_congested working. Sure, that's the near term issue, but long term it jsut makes no sense to treat shrinker relcaim differently from page reclaim because they have all the same requirements for IO backoff feedback and control. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2020-02-06 23:19 ` Dave Chinner @ 2020-02-07 0:08 ` Matthew Wilcox 2020-02-13 3:18 ` Andrew Morton 0 siblings, 1 reply; 14+ messages in thread From: Matthew Wilcox @ 2020-02-07 0:08 UTC (permalink / raw) To: Dave Chinner Cc: Mel Gorman, Jan Kara, Michal Hocko, lsf-pc, linux-fsdevel, linux-mm, Mel Gorman On Fri, Feb 07, 2020 at 10:19:28AM +1100, Dave Chinner wrote: > But detecting an abundance dirty pages/inodes on the LRU doesn't > really solve the problem of determining if and/or how long we should > wait for IO before we try to free more objects. There is no problem > with having lots of dirty pages/inodes on the LRU as long as the IO > subsystem keeps up with the rate at which reclaim is asking them to > be written back via async mechanisms (bdi writeback, metadata > writeback, etc). > > The problem comes when we cannot make efficient progress cleaning > pages/inodes on the LRU because the IO subsystem is overloaded and > cannot clean pages/inodes any faster. At this point, we have to wait > for the IO subsystem to make progress and without feedback from the > IO subsystem, we have no idea how fast that progress is made. Hence > we have no idea how long we need to wait before trying to reclaim > again. i.e. the answer can be different depending on hardware > behaviour, not just the current instantaneous reclaim and IO state. > > That's the fundamental problem we need to solve, and realistically > it can only be done with some level of feedback from the IO > subsystem. That triggered a memory for me. Jeremy Kerr presented a paper at LCA2006 on a different model where the device driver pulls dirty things from the VM rather than having the VM push dirty things to the device driver. It was prototyped in K42 rather than Linux, but the idea might be useful. http://jk.ozlabs.org/projects/k42/ http://jk.ozlabs.org/projects/k42/device-driven-IO-lca06.pdf ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Congestion 2020-02-07 0:08 ` Matthew Wilcox @ 2020-02-13 3:18 ` Andrew Morton 0 siblings, 0 replies; 14+ messages in thread From: Andrew Morton @ 2020-02-13 3:18 UTC (permalink / raw) To: Matthew Wilcox Cc: Dave Chinner, Mel Gorman, Jan Kara, Michal Hocko, lsf-pc, linux-fsdevel, linux-mm, Mel Gorman On Thu, 6 Feb 2020 16:08:53 -0800 Matthew Wilcox <willy@infradead.org> wrote: > On Fri, Feb 07, 2020 at 10:19:28AM +1100, Dave Chinner wrote: > > But detecting an abundance dirty pages/inodes on the LRU doesn't > > really solve the problem of determining if and/or how long we should > > wait for IO before we try to free more objects. There is no problem > > with having lots of dirty pages/inodes on the LRU as long as the IO > > subsystem keeps up with the rate at which reclaim is asking them to > > be written back via async mechanisms (bdi writeback, metadata > > writeback, etc). > > > > The problem comes when we cannot make efficient progress cleaning > > pages/inodes on the LRU because the IO subsystem is overloaded and > > cannot clean pages/inodes any faster. At this point, we have to wait > > for the IO subsystem to make progress and without feedback from the > > IO subsystem, we have no idea how fast that progress is made. Hence > > we have no idea how long we need to wait before trying to reclaim > > again. i.e. the answer can be different depending on hardware > > behaviour, not just the current instantaneous reclaim and IO state. > > > > That's the fundamental problem we need to solve, and realistically > > it can only be done with some level of feedback from the IO > > subsystem. > > That triggered a memory for me. Jeremy Kerr presented a paper at LCA2006 > on a different model where the device driver pulls dirty things from the VM > rather than having the VM push dirty things to the device driver. It was > prototyped in K42 rather than Linux, but the idea might be useful. > > http://jk.ozlabs.org/projects/k42/ > http://jk.ozlabs.org/projects/k42/device-driven-IO-lca06.pdf Fun. Device drivers says "I have spare bandwidth so send me some stuff". But if device drivers could do that, we wouldn't have broken congestion in the first place ;) ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2020-02-13 3:18 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-31 12:59 [LSF/MM TOPIC] Congestion Matthew Wilcox 2020-01-04 9:09 ` Dave Chinner 2020-01-06 11:55 ` [Lsf-pc] " Michal Hocko 2020-01-06 23:21 ` Dave Chinner 2020-01-07 8:23 ` Chris Murphy 2020-01-07 11:53 ` Michal Hocko 2020-01-07 20:12 ` Chris Murphy 2020-01-07 11:53 ` Michal Hocko 2020-01-09 11:07 ` Jan Kara 2020-01-09 23:00 ` Dave Chinner 2020-02-05 16:05 ` Mel Gorman 2020-02-06 23:19 ` Dave Chinner 2020-02-07 0:08 ` Matthew Wilcox 2020-02-13 3:18 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).