* [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred @ 2016-12-22 0:21 David Rientjes 2016-12-22 8:31 ` Kirill A. Shutemov ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: David Rientjes @ 2016-12-22 0:21 UTC (permalink / raw) To: Andrew Morton Cc: Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm Currently, when defrag is set to "madvise", thp allocations will direct reclaim. However, when defrag is set to "defer", all thp allocations do not attempt reclaim regardless of MADV_HUGEPAGE. This patch always directly reclaims for MADV_HUGEPAGE regions when defrag is not set to "never." The idea is that MADV_HUGEPAGE regions really want to be backed by hugepages and are willing to endure the latency at fault as it was the default behavior prior to commit 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option"). In this form, "defer" is a stronger, more heavyweight version of "madvise". Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/vm/transhuge.txt | 7 +++++-- mm/huge_memory.c | 10 ++++++---- 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -121,8 +121,11 @@ to utilise them. "defer" means that an application will wake kswapd in the background to reclaim pages and wake kcompact to compact memory so that THP is -available in the near future. It's the responsibility of khugepaged -to then install the THP pages later. +available in the near future, unless it is for a region where +madvise(MADV_HUGEPAGE) has been used, in which case direct reclaim will be +used. Kcompactd will attempt to make hugepages available for allocation in +the near future and khugepaged will try to collapse existing memory into +hugepages later. "madvise" will enter direct reclaim like "always" but only for regions that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -619,15 +619,17 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, */ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) { - bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); + const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags) && vma_madvised) return GFP_TRANSHUGE; else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, - &transparent_hugepage_flags)) - return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; - else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, + &transparent_hugepage_flags)) { + return GFP_TRANSHUGE_LIGHT | + (vma_madvised ? __GFP_DIRECT_RECLAIM : + __GFP_KSWAPD_RECLAIM); + } else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-22 0:21 [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred David Rientjes @ 2016-12-22 8:31 ` Kirill A. Shutemov 2016-12-22 10:00 ` Michal Hocko 2017-01-02 8:38 ` Vlastimil Babka 2 siblings, 0 replies; 29+ messages in thread From: Kirill A. Shutemov @ 2016-12-22 8:31 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Wed, Dec 21, 2016 at 04:21:54PM -0800, David Rientjes wrote: > Currently, when defrag is set to "madvise", thp allocations will direct > reclaim. However, when defrag is set to "defer", all thp allocations do > not attempt reclaim regardless of MADV_HUGEPAGE. > > This patch always directly reclaims for MADV_HUGEPAGE regions when defrag > is not set to "never." The idea is that MADV_HUGEPAGE regions really > want to be backed by hugepages and are willing to endure the latency at > fault as it was the default behavior prior to commit 444eb2a449ef ("mm: > thp: set THP defrag by default to madvise and add a stall-free defrag > option"). > > In this form, "defer" is a stronger, more heavyweight version of > "madvise". > > Signed-off-by: David Rientjes <rientjes@google.com> Makes senses to me. Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-22 0:21 [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred David Rientjes 2016-12-22 8:31 ` Kirill A. Shutemov @ 2016-12-22 10:00 ` Michal Hocko 2016-12-22 21:05 ` David Rientjes 2017-01-02 8:38 ` Vlastimil Babka 2 siblings, 1 reply; 29+ messages in thread From: Michal Hocko @ 2016-12-22 10:00 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Wed 21-12-16 16:21:54, David Rientjes wrote: > Currently, when defrag is set to "madvise", thp allocations will direct > reclaim. However, when defrag is set to "defer", all thp allocations do > not attempt reclaim regardless of MADV_HUGEPAGE. > > This patch always directly reclaims for MADV_HUGEPAGE regions when defrag > is not set to "never." The idea is that MADV_HUGEPAGE regions really > want to be backed by hugepages and are willing to endure the latency at > fault as it was the default behavior prior to commit 444eb2a449ef ("mm: > thp: set THP defrag by default to madvise and add a stall-free defrag > option"). AFAIR "defer" is implemented exactly as intended. To offer a never-stall but allow to form THP in the background option. The patch description doesn't explain why this is not good anymore. Could you give us more details about the motivation and why "madvise" doesn't work for you? This is a user visible change so the reason should better be really documented and strong. I am worried that after this patch we really lose a lightweight option which guarantees _no_ THP specific stalls during the page fault which was the biggest pain in the past. > In this form, "defer" is a stronger, more heavyweight version of > "madvise". > > Signed-off-by: David Rientjes <rientjes@google.com> > --- > Documentation/vm/transhuge.txt | 7 +++++-- > mm/huge_memory.c | 10 ++++++---- > 2 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt > --- a/Documentation/vm/transhuge.txt > +++ b/Documentation/vm/transhuge.txt > @@ -121,8 +121,11 @@ to utilise them. > > "defer" means that an application will wake kswapd in the background > to reclaim pages and wake kcompact to compact memory so that THP is > -available in the near future. It's the responsibility of khugepaged > -to then install the THP pages later. > +available in the near future, unless it is for a region where > +madvise(MADV_HUGEPAGE) has been used, in which case direct reclaim will be > +used. Kcompactd will attempt to make hugepages available for allocation in > +the near future and khugepaged will try to collapse existing memory into > +hugepages later. > > "madvise" will enter direct reclaim like "always" but only for regions > that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -619,15 +619,17 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, > */ > static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) > { > - bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); > + const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); > > if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, > &transparent_hugepage_flags) && vma_madvised) > return GFP_TRANSHUGE; > else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, > - &transparent_hugepage_flags)) > - return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; > - else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, > + &transparent_hugepage_flags)) { > + return GFP_TRANSHUGE_LIGHT | > + (vma_madvised ? __GFP_DIRECT_RECLAIM : > + __GFP_KSWAPD_RECLAIM); > + } else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, > &transparent_hugepage_flags)) > return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-22 10:00 ` Michal Hocko @ 2016-12-22 21:05 ` David Rientjes 2016-12-23 8:51 ` Michal Hocko 2016-12-30 12:36 ` Mel Gorman 0 siblings, 2 replies; 29+ messages in thread From: David Rientjes @ 2016-12-22 21:05 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Thu, 22 Dec 2016, Michal Hocko wrote: > > Currently, when defrag is set to "madvise", thp allocations will direct > > reclaim. However, when defrag is set to "defer", all thp allocations do > > not attempt reclaim regardless of MADV_HUGEPAGE. > > > > This patch always directly reclaims for MADV_HUGEPAGE regions when defrag > > is not set to "never." The idea is that MADV_HUGEPAGE regions really > > want to be backed by hugepages and are willing to endure the latency at > > fault as it was the default behavior prior to commit 444eb2a449ef ("mm: > > thp: set THP defrag by default to madvise and add a stall-free defrag > > option"). > > AFAIR "defer" is implemented exactly as intended. To offer a never-stall > but allow to form THP in the background option. The patch description > doesn't explain why this is not good anymore. Could you give us more > details about the motivation and why "madvise" doesn't work for > you? This is a user visible change so the reason should better be really > documented and strong. > The offering of defer breaks backwards compatibility with previous settings of defrag=madvise, where we could set madvise(MADV_HUGEPAGE) on .text segment remap and try to force thp backing if available but not directly reclaim for non VM_HUGEPAGE vmas. This was very advantageous. We prefer that to stay unchanged and allow kcompactd compaction to be triggered in background by everybody else as opposed to direct reclaim. We do not have that ability without this patch. Without this patch, we will be forced to offer multiple sysfs tunables to define (1) direct vs background compact, (2) madvise behavior, (3) always, (4) never and we cannot have 2^4 settings for "defrag" alone. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-22 21:05 ` David Rientjes @ 2016-12-23 8:51 ` Michal Hocko 2016-12-23 10:01 ` David Rientjes 2016-12-30 12:36 ` Mel Gorman 1 sibling, 1 reply; 29+ messages in thread From: Michal Hocko @ 2016-12-23 8:51 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Thu 22-12-16 13:05:27, David Rientjes wrote: > On Thu, 22 Dec 2016, Michal Hocko wrote: > > > > Currently, when defrag is set to "madvise", thp allocations will direct > > > reclaim. However, when defrag is set to "defer", all thp allocations do > > > not attempt reclaim regardless of MADV_HUGEPAGE. > > > > > > This patch always directly reclaims for MADV_HUGEPAGE regions when defrag > > > is not set to "never." The idea is that MADV_HUGEPAGE regions really > > > want to be backed by hugepages and are willing to endure the latency at > > > fault as it was the default behavior prior to commit 444eb2a449ef ("mm: > > > thp: set THP defrag by default to madvise and add a stall-free defrag > > > option"). > > > > AFAIR "defer" is implemented exactly as intended. To offer a never-stall > > but allow to form THP in the background option. The patch description > > doesn't explain why this is not good anymore. Could you give us more > > details about the motivation and why "madvise" doesn't work for > > you? This is a user visible change so the reason should better be really > > documented and strong. > > > > The offering of defer breaks backwards compatibility with previous > settings of defrag=madvise, where we could set madvise(MADV_HUGEPAGE) on > .text segment remap and try to force thp backing if available but not > directly reclaim for non VM_HUGEPAGE vmas. I do not understand the backwards compatibility issue part here. Maybe I am missing something but the semantic of defrag=madvise hasn't changed and a new flag can hardly break backward compatibility. > This was very advantageous. > We prefer that to stay unchanged and allow kcompactd compaction to be > triggered in background by everybody else as opposed to direct reclaim. > We do not have that ability without this patch. So why don't you use defrag=madvise? > Without this patch, we will be forced to offer multiple sysfs tunables to > define (1) direct vs background compact, (2) madvise behavior, (3) always, > (4) never and we cannot have 2^4 settings for "defrag" alone. I disagree. I think the current set of defrag values should be sufficient. We can completely disable direct reclaim, enable it only for opt-in, enable for all and never allow to stall. The advantage of this set of values is that they have _clear_ semantic and behave consistently. If you change defer to "almost never stall except when MADV_HUGEPAGE" then the semantic is less clear. Admin might have a good reason to never allow stalls - especially when he doesn't have a control over the code he is running. Your patch would break this usecase. If we want to provide a better background THP availability we should focus more on kcompactd and the way how it is invoked. Currently we only wake it up during the page allocation path. Long term we want to make it more watermak based I believe. Similar to kswapd. Vlastimil is already playing with this idea. I would prefer such a long term plan more than tweaking THP configuration. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-23 8:51 ` Michal Hocko @ 2016-12-23 10:01 ` David Rientjes 2016-12-23 11:18 ` Michal Hocko 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2016-12-23 10:01 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Fri, 23 Dec 2016, Michal Hocko wrote: > > The offering of defer breaks backwards compatibility with previous > > settings of defrag=madvise, where we could set madvise(MADV_HUGEPAGE) on > > .text segment remap and try to force thp backing if available but not > > directly reclaim for non VM_HUGEPAGE vmas. > > I do not understand the backwards compatibility issue part here. Maybe I > am missing something but the semantic of defrag=madvise hasn't changed > and a new flag can hardly break backward compatibility. > We have no way to compact memory for users who are not using MADV_HUGEPAGE, which is some customers, others require MADV_HUGEPAGE for .text segment remap while loading their binary, without defrag=always or defrag=defer. The problem is that we want to demand direct compact for MADV_HUGEPAGE: they _really_ want hugepages, it's the point of the madvise. We have no setting, without this patch, to ask for background compaction for everybody so that their fault does not have long latency and for some customers to demand compaction. It's a userspace decision, not a kernel decision, and we have lost that ability. > > This was very advantageous. > > We prefer that to stay unchanged and allow kcompactd compaction to be > > triggered in background by everybody else as opposed to direct reclaim. > > We do not have that ability without this patch. > > So why don't you use defrag=madvise? > Um, wtf? Prior to the patch, we used defrag=always because we do not have low latency option; everybody was forced into it. Now that we do have the option, we wish to use deferred compaction so that we have opportunity to fault hugepages in near future. We also have userspace apps, and others have database apps, which want hugepages and are ok with any latency. This should not be a difficult point to understand. Allow the user to define if they are willing to accept latency with MADV_HUGEPAGE. > I disagree. I think the current set of defrag values should be > sufficient. We can completely disable direct reclaim, enable it only for > opt-in, enable for all and never allow to stall. The advantage of this > set of values is that they have _clear_ semantic and behave > consistently. If you change defer to "almost never stall except when > MADV_HUGEPAGE" then the semantic is less clear. Admin might have a good > reason to never allow stalls - especially when he doesn't have a control > over the code he is running. Your patch would break this usecase. > ?????? Why does the admin care if a user's page fault wants to reclaim to get high order memory? The user incurs the penalty for MADV_HUGEPAGE, it always has. Lol. This objection is nonsensical. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-23 10:01 ` David Rientjes @ 2016-12-23 11:18 ` Michal Hocko 2016-12-23 22:46 ` David Rientjes 0 siblings, 1 reply; 29+ messages in thread From: Michal Hocko @ 2016-12-23 11:18 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Fri 23-12-16 02:01:33, David Rientjes wrote: > On Fri, 23 Dec 2016, Michal Hocko wrote: > > > > The offering of defer breaks backwards compatibility with previous > > > settings of defrag=madvise, where we could set madvise(MADV_HUGEPAGE) on > > > .text segment remap and try to force thp backing if available but not > > > directly reclaim for non VM_HUGEPAGE vmas. > > > > I do not understand the backwards compatibility issue part here. Maybe I > > am missing something but the semantic of defrag=madvise hasn't changed > > and a new flag can hardly break backward compatibility. > > > > We have no way to compact memory for users who are not using > MADV_HUGEPAGE, yes we have. it is defrag=always. If you do not want direct compaction and the resulting allocation stalls then you have to rely on kcompactd which is something we should work longterm. > which is some customers, others require MADV_HUGEPAGE for > .text segment remap while loading their binary, without defrag=always or > defrag=defer. The problem is that we want to demand direct compact for > MADV_HUGEPAGE: they _really_ want hugepages, it's the point of the > madvise. and that is the point of defrag=madvise to give them this direct compaction. > We have no setting, without this patch, to ask for background > compaction for everybody so that their fault does not have long latency > and for some customers to demand compaction. that is true and what I am trying to say is that we should aim to give this background compaction for everybody via kcompactd because there are more users than THP who might benefit from low latency high order pages availability. We shouldn't tweak the defer option for that purpose. > It's a userspace decision, not a kernel decision, and we have lost > that ability. I must be missing something but which setting did allow this before? > > > This was very advantageous. > > > We prefer that to stay unchanged and allow kcompactd compaction to be > > > triggered in background by everybody else as opposed to direct reclaim. > > > We do not have that ability without this patch. > > > > So why don't you use defrag=madvise? > > > > Um, wtf? Prior to the patch, we used defrag=always because we do not have > low latency option; everybody was forced into it. Now that we do have > the option, we wish to use deferred compaction so that we have opportunity > to fault hugepages in near future. We also have userspace apps, and > others have database apps, which want hugepages and are ok with any > latency. This should not be a difficult point to understand. Allow the > user to define if they are willing to accept latency with MADV_HUGEPAGE. > > > I disagree. I think the current set of defrag values should be > > sufficient. We can completely disable direct reclaim, enable it only for > > opt-in, enable for all and never allow to stall. The advantage of this > > set of values is that they have _clear_ semantic and behave > > consistently. If you change defer to "almost never stall except when > > MADV_HUGEPAGE" then the semantic is less clear. Admin might have a good > > reason to never allow stalls - especially when he doesn't have a control > > over the code he is running. Your patch would break this usecase. > > > > ?????? Why does the admin care if a user's page fault wants to reclaim to > get high order memory? Because the whole point of the defrag knob is to allow _administrator_ control how much we try to fault in THP. And the primary motivation were latencies. The whole point of introducing defer option was to _never_ stall in the page fault while it still allows to kick the background compaction. If you really want to tweak any option then madvise would be more appropriate IMHO because the semantic would be still clear. Use direct compaction for MADV_HUGEPAGE vmas and kick in kswapd/kcompactd for others. That being said, I understand your usecase and agree that it is useful to allow background compaction to allow smooth THP deferred allocations while madvise users can tolerate stalls from the direct compaction. I just disagree with tweaking defer option to allow for that because I _believe_ that we should accomplish this in a more generic way and allow kcompactd to be more configurable and proactive. We definitely need background compaction for other users than THP. So please try to think about it some more before sending more wtf replies... -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-23 11:18 ` Michal Hocko @ 2016-12-23 22:46 ` David Rientjes 2016-12-26 9:02 ` Michal Hocko 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2016-12-23 22:46 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Fri, 23 Dec 2016, Michal Hocko wrote: > > We have no way to compact memory for users who are not using > > MADV_HUGEPAGE, > > yes we have. it is defrag=always. If you do not want direct compaction > and the resulting allocation stalls then you have to rely on kcompactd > which is something we should work longterm. > No, the point of madvise(MADV_HUGEPAGE) is for applications to tell the kernel that they really want hugepages. Really. Everybody else either never did direct compaction or did a substantially watered down version of it. Now, we have a situation where you can either do direct compaction for MADV_HUGEPAGE and nothing for anybody else, or direct compaction for everybody. In our usecase, we want everybody to kick off background compaction because order=9 gfp_mask & __GFP_KSWAPD_RECLAIM is the only thing that is going to trigger background compaction but are unable to do so without still incurring lengthy pagefaults for non MADV_HUGEPAGE users. > > which is some customers, others require MADV_HUGEPAGE for > > .text segment remap while loading their binary, without defrag=always or > > defrag=defer. The problem is that we want to demand direct compact for > > MADV_HUGEPAGE: they _really_ want hugepages, it's the point of the > > madvise. > > and that is the point of defrag=madvise to give them this direct > compaction. > Do you see the problem by first suggesting defrag=always at the top of your reply and then defrag=madvise now? We cannot set both at once, it's the entire problem with the tristate and now quadstate setting. We want a combination: EVERYBODY kicks off background compaction and applications that really want hugepages and are fine with incuring lengthy page fault, such as those (for the third time) remapping .text segment and doing madvise(MADV_HUGEPAGE) before fault, can use the madvise. > > We have no setting, without this patch, to ask for background > > compaction for everybody so that their fault does not have long latency > > and for some customers to demand compaction. > > that is true and what I am trying to say is that we should aim to give > this background compaction for everybody via kcompactd because there are > more users than THP who might benefit from low latency high order pages > availability. My patch does that, we _defer_ for everybody unless you're using madvise(MADV_HUGEPAGE) and really want hugepages. Forget defrag=never exists, it's not important in the discussion. Forget defrag=always exists because all apps, like batch jobs, don't want lengthy pagefaults. We have two options remaining: - defrag=defer: everybody kicks off background compaction, _nobody_ does direct compaction - defrag=madvise: madvise(MADV_HUGEPAGE) does direct compaction, everybody else does nothing The point you're missing is that we _want_ defrag=defer. We really do. We don't want to stall in the page allocator to get thp, but we want to try to make it available in the short term. However, apps that do madvise(MADV_HUGEPAGE), like remapping your .text segment and wanting your text backed by hugepages and incurring the expense up front, or a database, or a vm, _want_ hugepages now and don't care about lengthy page faults. The point is that I HAVE NO SETTING to get that behavior and defrag=madvise is _not_ a solution because it requires the presence of an app that is doing madvise(MADV_HUGEPAGE) AND faulting memory to get any order=9 compaction. > > ?????? Why does the admin care if a user's page fault wants to reclaim to > > get high order memory? > > Because the whole point of the defrag knob is to allow _administrator_ > control how much we try to fault in THP. And the primary motivation were > latencies. The whole point of introducing defer option was to _never_ > stall in the page fault while it still allows to kick the background > compaction. If you really want to tweak any option then madvise would be > more appropriate IMHO because the semantic would be still clear. Use > direct compaction for MADV_HUGEPAGE vmas and kick in kswapd/kcompactd > for others. > You want defrag=madvise to start doing background compaction for everybody, which was never done before for existing users of defrag=madvise? That might be possible, I don't really care, I just think it's riskier because there are existing users of defrag=madvise who are opting in to new behavior because of the kernel change. This patch changes defrag=defer because it's the new option and people setting the mode know what they are getting. I disagree with your description of what the defrag setting is intended for. The setting of thp defrag is to optimize for apps that truly want transparent behavior, i.e. they aren't doing madvise(MADV_HUGEPAGE). Are they willing to incur lengthy pagefaults for thp when not doing any madvise(2)? defrag=defer should not mean that users of madvise(MADV_HUGEPAGE) that have clearly specified their intent should not be allowed to try compacting memory themselves because they have indicated they are fine with such an expense by doing the madvise(2). This is obviously fine for Kirill, and I have users who remap their .text segment and do madvise(MADV_DONTNEED) because they really want hugepages when they are exec'd, so I'd kindly ask you to consider the real-world use cases that require background compaction to make hugepages available for everybody but allow apps to opt-in to take the expense of compaction on themselves rather than your own theory of what users want. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-23 22:46 ` David Rientjes @ 2016-12-26 9:02 ` Michal Hocko 2016-12-27 0:53 ` David Rientjes 0 siblings, 1 reply; 29+ messages in thread From: Michal Hocko @ 2016-12-26 9:02 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Fri 23-12-16 14:46:43, David Rientjes wrote: [...] > You want defrag=madvise to start doing background compaction for > everybody, which was never done before for existing users of > defrag=madvise? That might be possible, I don't really care, I just think > it's riskier because there are existing users of defrag=madvise who are > opting in to new behavior because of the kernel change. This patch > changes defrag=defer because it's the new option and people setting the > mode know what they are getting. But my primary argument is that if you tweak "defer" value behavior then you lose the only "stall free yet allow background compaction" option. That option is really important. You seem to think that it is the application which is under the control. And I am not all that surprised because you are under control of the whole userspace in your deployments. But there are others where the administrator is not under the control of what application asks for yet he is responsible for the overal "experience" if you will. Long stalls during the page faults are often seen as bugs and users might not really care whether the application writer really wanted THP or not... [...] > This is obviously fine for Kirill, and I have users who remap their .text > segment and do madvise(MADV_DONTNEED) because they really want hugepages > when they are exec'd, so I'd kindly ask you to consider the real-world use > cases that require background compaction to make hugepages available for > everybody but allow apps to opt-in to take the expense of compaction on > themselves rather than your own theory of what users want. I definitely _agree_ that this is a very important usecase! I am just trying to think long term and a more sophisticated background compaction is something that we definitely lack and _want_ longterm. There are more high order users than THP. I believe we really want to teach kcompactd to maintain configurable amount of highorder pages. If there is really a need for an immediate solution^Wworkaround then I think that tweaking the madvise option should be reasonably safe. Admins are really prepared for stalls because they are explicitly opting in for madvise behavior and they will get a background compaction on top. This is a new behavior but I do not see how it would be harmful. If an excessive compaction is a problem then THP can be reduced to madvise only vmas. But, I really _do_ care about having a stall free option which is not a complete disable of the background compaction for THP. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f3c2040edbb1..3679c47faef4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -622,8 +622,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, - &transparent_hugepage_flags) && vma_madvised) - return GFP_TRANSHUGE; + &transparent_hugepage_flags)) + return (vma_madvise) ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-26 9:02 ` Michal Hocko @ 2016-12-27 0:53 ` David Rientjes 2016-12-27 2:32 ` Kirill A. Shutemov 2016-12-27 9:41 ` Michal Hocko 0 siblings, 2 replies; 29+ messages in thread From: David Rientjes @ 2016-12-27 0:53 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Mon, 26 Dec 2016, Michal Hocko wrote: > But my primary argument is that if you tweak "defer" value behavior > then you lose the only "stall free yet allow background compaction" > option. That option is really important. Important to who? What regresses if we kick a background kthread to compact memory for order-9 pageblocks? Why don't we allow userspace to clear __GFP_KSWAPD_RECLAIM if we don't want background reclaim for allocations? > You seem to think that it > is the application which is under the control. And I am not all that > surprised because you are under control of the whole userspace in your > deployments. I have no control over the userspace that runs on my "deployments," I caution you to not make any inferences. > But there are others where the administrator is not under > the control of what application asks for yet he is responsible for the > overal "experience" if you will. The administrator is in charge of an "experience" and wants to avoid background compaction for thp allocations but not background reclaim for any other allocation? (Why am I even replying to this?) If the admin is concerned about anybody doing compaction, they can set defrag to "never". They have had this ability since thp was introduced. > Long stalls during the page faults are > often seen as bugs and users might not really care whether the > application writer really wanted THP or not... > There are no long stalls during page faults introduced by this patch, we are waking up a kthread to do the work. > I definitely _agree_ that this is a very important usecase! I am just > trying to think long term and a more sophisticated background compaction > is something that we definitely lack and _want_ longterm. There are more > high order users than THP. I believe we really want to teach kcompactd > to maintain configurable amount of highorder pages. > We are addressing thp defrag here, not any other use for background compaction for other high-order allocations. I'd prefer that we stay on topic, please. This is only about setting thp defrag to "defer" and if it is possible to kick background compaction and defer direct compaction. We need this patch, Kirill has acked it, and I simply have no more time to talk in circles. > If there is really a need for an immediate solution^Wworkaround then I > think that tweaking the madvise option should be reasonably safe. Admins > are really prepared for stalls because they are explicitly opting in for > madvise behavior and they will get a background compaction on top. This > is a new behavior but I do not see how it would be harmful. If an > excessive compaction is a problem then THP can be reduced to madvise > only vmas. > > But, I really _do_ care about having a stall free option which is not a > complete disable of the background compaction for THP. > This is completely wrong. Before the "defer" option has been introduced, we had "madvise" and should maintain its behavior as much as possible so there are no surprises. We don't change behavior for a tunable out from under existing users because you think you know better. With the new "defer" option, we can make this a stronger variant of "madvise", which Kirill acked, so that existing users of MADV_HUGEPAGE have no change in behavior and we can configure whether we do direct or background compaction for everybody else. If people don't want background compaction, they can set defrag to "madvise". If they want it, they can set it to "defer". It's very simple. That said, I simply don't have the time to continue in circular arguments and would respectfully ask Andrew to apply this acked patch. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-27 0:53 ` David Rientjes @ 2016-12-27 2:32 ` Kirill A. Shutemov 2016-12-27 9:41 ` Michal Hocko 1 sibling, 0 replies; 29+ messages in thread From: Kirill A. Shutemov @ 2016-12-27 2:32 UTC (permalink / raw) To: David Rientjes Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Mon, Dec 26, 2016 at 04:53:39PM -0800, David Rientjes wrote: > > If there is really a need for an immediate solution^Wworkaround then I > > think that tweaking the madvise option should be reasonably safe. Admins > > are really prepared for stalls because they are explicitly opting in for > > madvise behavior and they will get a background compaction on top. This > > is a new behavior but I do not see how it would be harmful. If an > > excessive compaction is a problem then THP can be reduced to madvise > > only vmas. > > > > But, I really _do_ care about having a stall free option which is not a > > complete disable of the background compaction for THP. > > > > This is completely wrong. Before the "defer" option has been introduced, > we had "madvise" and should maintain its behavior as much as possible so > there are no surprises. We don't change behavior for a tunable out from > under existing users because you think you know better. With the new > "defer" option, we can make this a stronger variant of "madvise", which > Kirill acked, so that existing users of MADV_HUGEPAGE have no change in > behavior and we can configure whether we do direct or background > compaction for everybody else. If people don't want background > compaction, they can set defrag to "madvise". If they want it, they can > set it to "defer". It's very simple. > > That said, I simply don't have the time to continue in circular arguments > and would respectfully ask Andrew to apply this acked patch. +1. I don't see a point to make "defer" weaker than "madvise". MADV_HUGEPAGE is a way for an application to say that it's okay with paying price for huge page allocation. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-27 0:53 ` David Rientjes 2016-12-27 2:32 ` Kirill A. Shutemov @ 2016-12-27 9:41 ` Michal Hocko 2016-12-27 21:36 ` David Rientjes 1 sibling, 1 reply; 29+ messages in thread From: Michal Hocko @ 2016-12-27 9:41 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Mon 26-12-16 16:53:39, David Rientjes wrote: > On Mon, 26 Dec 2016, Michal Hocko wrote: > > > But my primary argument is that if you tweak "defer" value behavior > > then you lose the only "stall free yet allow background compaction" > > option. That option is really important. > > Important to who? To all users who want to have THP without stalls experience. This was the whole point of 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option"). > What regresses if we kick a background kthread to compact memory for > order-9 pageblocks? I am not worried about this part. I am worried about the direct compaction part. > Why don't we allow userspace to clear __GFP_KSWAPD_RECLAIM if we don't > want background reclaim for allocations? > > > You seem to think that it > > is the application which is under the control. And I am not all that > > surprised because you are under control of the whole userspace in your > > deployments. > > I have no control over the userspace that runs on my "deployments," I > caution you to not make any inferences. the usecase you have described suggested otherwise. The way how you are using madvise sounds pretty much intentional to me. This is quite a different thing than running an application which uses madivise because it _thinks_ it is a good idea and you are left with that decision and cannot do anything about that. > > But there are others where the administrator is not under > > the control of what application asks for yet he is responsible for the > > overal "experience" if you will. > > The administrator is in charge of an "experience" and wants to avoid > background compaction for thp allocations but not background reclaim for > any other allocation? I do not understand why you are mentioning the background reclaim/compaction again. All I am talking about here is the _direct_ compaction and the way to prevent from it for _all_ THP requests regardless of the madvise status because that is not under the admin control. > (Why am I even replying to this?) If the admin is > concerned about anybody doing compaction, they can set defrag to "never". > They have had this ability since thp was introduced. > > > Long stalls during the page faults are > > often seen as bugs and users might not really care whether the > > application writer really wanted THP or not... > > > > There are no long stalls during page faults introduced by this patch, we > are waking up a kthread to do the work. Yes there _are_. All madvised vmas can stall now which was not the case before. This is breaking the semantic of the defer option as it was introduced and intended (which should be pretty clear from its name). > > I definitely _agree_ that this is a very important usecase! I am just > > trying to think long term and a more sophisticated background compaction > > is something that we definitely lack and _want_ longterm. There are more > > high order users than THP. I believe we really want to teach kcompactd > > to maintain configurable amount of highorder pages. > > > > We are addressing thp defrag here, not any other use for background > compaction for other high-order allocations. I'd prefer that we stay on > topic, please. This is only about setting thp defrag to "defer" and if it > is possible to kick background compaction and defer direct compaction. We > need this patch, Kirill has acked it, and I simply have no more time to > talk in circles. You seem to completely ignore the review feedback and given arguments which is really sad... > > If there is really a need for an immediate solution^Wworkaround then I > > think that tweaking the madvise option should be reasonably safe. Admins > > are really prepared for stalls because they are explicitly opting in for > > madvise behavior and they will get a background compaction on top. This > > is a new behavior but I do not see how it would be harmful. If an > > excessive compaction is a problem then THP can be reduced to madvise > > only vmas. > > > > But, I really _do_ care about having a stall free option which is not a > > complete disable of the background compaction for THP. > > > > This is completely wrong. Before the "defer" option has been introduced, > we had "madvise" and should maintain its behavior as much as possible so > there are no surprises. We don't change behavior for a tunable out from > under existing users because you think you know better. With the new > "defer" option, we can make this a stronger variant of "madvise", which I do not see why "defer" would be any different in that regards. The defer option is there for 3 releases already. It's not an rc thing... I fail to see why adding a background behavior to one existing knob is a problem while adding a _directly_ visible one to another is OK. This just doesn't make any sense to me. > Kirill acked, so that existing users of MADV_HUGEPAGE have no change in > behavior and we can configure whether we do direct or background > compaction for everybody else. If people don't want background > compaction, they can set defrag to "madvise". If they want it, they can > set it to "defer". It's very simple. > > > That said, I simply don't have the time to continue in circular arguments > and would respectfully ask Andrew to apply this acked patch. for reasons mentioned already Nacked-by: Michal Hocko <mhocko@suse.com> -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-27 9:41 ` Michal Hocko @ 2016-12-27 21:36 ` David Rientjes 2016-12-28 8:48 ` Michal Hocko 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2016-12-27 21:36 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Tue, 27 Dec 2016, Michal Hocko wrote: > > Important to who? > > To all users who want to have THP without stalls experience. This was > the whole point of 444eb2a449ef ("mm: thp: set THP defrag by default to > madvise and add a stall-free defrag option"). > THEY DO NOT STALL. If the application is not using madvise(MADV_HUGEPAGE), all we do is kick kcompactd. Nothing else. We don't need any kernel tunable for an admin to override an application doing madvise(MADV_HUGEPAGE) when it wants hugepages and is perfectly happy stalling for them. > > > You seem to think that it > > > is the application which is under the control. And I am not all that > > > surprised because you are under control of the whole userspace in your > > > deployments. > > > > I have no control over the userspace that runs on my "deployments," I > > caution you to not make any inferences. > > the usecase you have described suggested otherwise. The way how you are > using madvise sounds pretty much intentional to me. This is quite a > different thing than running an application which uses madivise because > it _thinks_ it is a good idea and you are left with that decision and > cannot do anything about that. > I literally cannot believe I am reading this on lkml. I am legitimately stunned by this. You are saying that the admin thinks it knows better than the application writer and that its madvise(MADV_HUGEPAGE) wasn't actually intentional? We don't introduce tunables so that admins can control the intentional behavior of an application, unless that behavior is a security concern. The application has specified it wants to wait for hugepages and has backwards compatibility with defrag=madvise settings since thp was introduced, which introduced MADV_HUGEPAGE. It's the entire point of MADV_HUGEPAGE existing. > > > Long stalls during the page faults are > > > often seen as bugs and users might not really care whether the > > > application writer really wanted THP or not... > > > > > > > There are no long stalls during page faults introduced by this patch, we > > are waking up a kthread to do the work. > > Yes there _are_. All madvised vmas can stall now which was not the case > before. This is breaking the semantic of the defer option as it was > introduced and intended (which should be pretty clear from its name). > All madvised VMAs stall now because THEY WANT TO STALL. It is unbelievable that you would claim otherwise or think that you know better than the application writer about their application. > > We are addressing thp defrag here, not any other use for background > > compaction for other high-order allocations. I'd prefer that we stay on > > topic, please. This is only about setting thp defrag to "defer" and if it > > is possible to kick background compaction and defer direct compaction. We > > need this patch, Kirill has acked it, and I simply have no more time to > > talk in circles. > > You seem to completely ignore the review feedback and given arguments > which is really sad... > I am perfectly satisfied with Kirill's review feedback because (1) it makes sense and (2) it supports allowing users who do MADV_HUGEPAGE to actually try to get hugepages, which is the point of the madvise. > > That said, I simply don't have the time to continue in circular arguments > > and would respectfully ask Andrew to apply this acked patch. > > for reasons mentioned already > Nacked-by: Michal Hocko <mhocko@suse.com> I hope I'm not being unrealistically optimistic in assuming that this will be the end of this thread. The patch should be merged. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-27 21:36 ` David Rientjes @ 2016-12-28 8:48 ` Michal Hocko 2016-12-28 21:33 ` David Rientjes 0 siblings, 1 reply; 29+ messages in thread From: Michal Hocko @ 2016-12-28 8:48 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Tue 27-12-16 13:36:54, David Rientjes wrote: [...] > All madvised VMAs stall now because THEY WANT TO STALL. It is > unbelievable that you would claim otherwise or think that you know better > than the application writer about their application. I do care more about _users_ and their _experience_ than what application _writers_ think is the best. This is the whole point of giving the defrag tunable. madvise(MADV_HUGEPAGE) is just a hint to the system that using transparent hugepages is _preferable_, not mandatory. We have an option to allow stalls for those vmas to increase the allocation success rate. We also have tunable to completely ignore it. And we should also have an option to not stall. Your interpretation of what is the expected stalling behavior of MADV_HUGEPAGE doesn't match what the man page says: : MADV_HUGEPAGE (since Linux 2.6.38) : Enable Transparent Huge Pages (THP) for pages in the range specified : by addr and length. Currently, Transparent Huge Pages work only with : private anonymous pages (see mmap(2)). The kernel will regularly : scan the areas marked as huge page candidates to replace them with : huge pages. The kernel will also allocate huge pages directly : when the region is naturally aligned to the huge page size (see : posix_memalign(2)). : : This feature is primarily aimed at applications that use large mappings : of data and access large regions of that memory at a time (e.g., : virtualization systems such as QEMU). It can very easily waste memory : (e.g., a 2MB mapping that only ever accesses 1 byte will result in 2MB : of wired memory instead of one 4KB page). See the Linux kernel source : file Documentation/vm/transhuge.txt for more details. : : The MADV_HUGEPAGE and MADV_NOHUGEPAGE operations are available only if : the kernel was configured with CONFIG_TRANSPARENT_HUGEPAGE. while it clearly contradicts what "defer" option is documented to provide in Documentation/vm/transhuge.txt: : "defer" means that an application will wake kswapd in the background : to reclaim pages and wake kcompact to compact memory so that THP is : available in the near future. It's the responsibility of khugepaged : to then install the THP pages later. I am not going to reply to other your claims because I would just repeat myself. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-28 8:48 ` Michal Hocko @ 2016-12-28 21:33 ` David Rientjes 2016-12-29 8:24 ` Michal Hocko 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2016-12-28 21:33 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Wed, 28 Dec 2016, Michal Hocko wrote: > I do care more about _users_ and their _experience_ than what > application _writers_ think is the best. This is the whole point > of giving the defrag tunable. madvise(MADV_HUGEPAGE) is just a hint to > the system that using transparent hugepages is _preferable_, not > mandatory. We have an option to allow stalls for those vmas to increase > the allocation success rate. We also have tunable to completely ignore > it. And we should also have an option to not stall. > The application developer who uses madvise(MADV_HUGEPAGE) is doing so for a reason. We lack the ability to defragment in the background for all users who don't want to block while allowing madvise(MADV_HUGEPAGE) users to block, as the changelog for this patch clearly indicates. Thanks. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-28 21:33 ` David Rientjes @ 2016-12-29 8:24 ` Michal Hocko 0 siblings, 0 replies; 29+ messages in thread From: Michal Hocko @ 2016-12-29 8:24 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm On Wed 28-12-16 13:33:49, David Rientjes wrote: > On Wed, 28 Dec 2016, Michal Hocko wrote: > > > I do care more about _users_ and their _experience_ than what > > application _writers_ think is the best. This is the whole point > > of giving the defrag tunable. madvise(MADV_HUGEPAGE) is just a hint to > > the system that using transparent hugepages is _preferable_, not > > mandatory. We have an option to allow stalls for those vmas to increase > > the allocation success rate. We also have tunable to completely ignore > > it. And we should also have an option to not stall. > > > > The application developer who uses madvise(MADV_HUGEPAGE) is doing so for > a reason. and nobody questions that... But the application developer can hardly forsee the environment where the application runs. And what might look as a reasonable cost/benefit balance in one setup can turn out completely wrong in a different one - just consider the fragmentation which is the primary contributor to stalls. It is hardly predictable and vary between different workloads/setups a lot. While we have a way (policty if you will) to tell that madvise should be honored as much as possible (defrag=madvise) we do not have a way to tell that even madvised vmas are not worth stalling over because the benefit would not offset the cost. > We lack the ability to defragment in the background for all users who > don't want to block while allowing madvise(MADV_HUGEPAGE) users to block, > as the changelog for this patch clearly indicates. And I agree that this is something to be addressed. I just disagree that this patch is the way how to achieve that. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-22 21:05 ` David Rientjes 2016-12-23 8:51 ` Michal Hocko @ 2016-12-30 12:36 ` Mel Gorman 2016-12-30 12:56 ` Michal Hocko 2016-12-30 22:30 ` David Rientjes 1 sibling, 2 replies; 29+ messages in thread From: Mel Gorman @ 2016-12-30 12:36 UTC (permalink / raw) To: David Rientjes Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, linux-kernel, linux-mm On Thu, Dec 22, 2016 at 01:05:27PM -0800, David Rientjes wrote: > On Thu, 22 Dec 2016, Michal Hocko wrote: > > > > Currently, when defrag is set to "madvise", thp allocations will direct > > > reclaim. However, when defrag is set to "defer", all thp allocations do > > > not attempt reclaim regardless of MADV_HUGEPAGE. > > > > > > This patch always directly reclaims for MADV_HUGEPAGE regions when defrag > > > is not set to "never." The idea is that MADV_HUGEPAGE regions really > > > want to be backed by hugepages and are willing to endure the latency at > > > fault as it was the default behavior prior to commit 444eb2a449ef ("mm: > > > thp: set THP defrag by default to madvise and add a stall-free defrag > > > option"). > > > > AFAIR "defer" is implemented exactly as intended. To offer a never-stall > > but allow to form THP in the background option. The patch description > > doesn't explain why this is not good anymore. Could you give us more > > details about the motivation and why "madvise" doesn't work for > > you? This is a user visible change so the reason should better be really > > documented and strong. > > > I ended up not reading this whole thread in detail because it went back and forth a lot. Michal is correct in that my intent for defer was to have "never stall" as the default behaviour. This was because of the number of severe stalls users experienced that lead to recommendations in tuning guides to always disable THP. I'd also seen multiple instances in bug reports for stalls where it was suggested that THP be disabled even when it could not have been a factor. It would be preferred to keep the default behaviour to avoid reintroducing such bugs. That said; > The offering of defer breaks backwards compatibility with previous > settings of defrag=madvise, where we could set madvise(MADV_HUGEPAGE) on > .text segment remap and try to force thp backing if available but not > directly reclaim for non VM_HUGEPAGE vmas. This was very advantageous. > We prefer that to stay unchanged and allow kcompactd compaction to be > triggered in background by everybody else as opposed to direct reclaim. > We do not have that ability without this patch. > I accept the reasoning that applications that use MADV_HUGEPAGE really want huge pages and may be willing to incur a large stall to get them. It's impossible for the kernel to know in all cases which behaviour is desirable so something is needed. > Without this patch, we will be forced to offer multiple sysfs tunables to > define (1) direct vs background compact, (2) madvise behavior, (3) always, > (4) never and we cannot have 2^4 settings for "defrag" alone. In itself, I don't see this as a bad thing. I won't nak the patch as-is although I consider it unfortunate and worry that we'll see bugs again about slow start times for qemu (the most common application I'm aware of that uses MADV_HUGEPAGE). If that happens, we'd be forced to have a workaround like a systemtap script that intercepted MADV_HUGEPAGE and stripped it or LD_PRELOAD if a specific application can be controlled. I'll neither ack nor nak this patch. However, I would much prefer an additional option be added to sysfs called defer-fault that would avoid all fault-based stalls but still potentially stall for MADV_HUGEPAGE. I would also prefer that the default option is "defer" for both MADV_HUGEPAGE and faults. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-30 12:36 ` Mel Gorman @ 2016-12-30 12:56 ` Michal Hocko 2016-12-30 14:08 ` Mel Gorman 2016-12-30 22:30 ` David Rientjes 1 sibling, 1 reply; 29+ messages in thread From: Michal Hocko @ 2016-12-30 12:56 UTC (permalink / raw) To: Mel Gorman Cc: David Rientjes, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, linux-kernel, linux-mm On Fri 30-12-16 12:36:20, Mel Gorman wrote: [...] > I'll neither ack nor nak this patch. However, I would much prefer an > additional option be added to sysfs called defer-fault that would avoid > all fault-based stalls but still potentially stall for MADV_HUGEPAGE. Would you consider changing the semantic of defer=madvise to invoke KSWAPD for !madvised vmas as acceptable. It would be a change in semantic but I am wondering what would be a risk and potential regression space. Also I am planning to send a pro-active compaction based on a "watermark" as an LSF/MM topic proposal. I suspect that no additional thp specific tunable will be needed if we have a proper compaction watermark tunable. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-30 12:56 ` Michal Hocko @ 2016-12-30 14:08 ` Mel Gorman 0 siblings, 0 replies; 29+ messages in thread From: Mel Gorman @ 2016-12-30 14:08 UTC (permalink / raw) To: Michal Hocko Cc: David Rientjes, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, linux-kernel, linux-mm On Fri, Dec 30, 2016 at 01:56:16PM +0100, Michal Hocko wrote: > On Fri 30-12-16 12:36:20, Mel Gorman wrote: > [...] > > I'll neither ack nor nak this patch. However, I would much prefer an > > additional option be added to sysfs called defer-fault that would avoid > > all fault-based stalls but still potentially stall for MADV_HUGEPAGE. > > Would you consider changing the semantic of defer=madvise to invoke > KSWAPD for !madvised vmas as acceptable. It would be a change in > semantic but I am wondering what would be a risk and potential > regression space. > I'd worry a little, but not a lot. The concern would be that kswapd waking up would reclaim pages and cause major faults that would have remained resident with the current semantics. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-30 12:36 ` Mel Gorman 2016-12-30 12:56 ` Michal Hocko @ 2016-12-30 22:30 ` David Rientjes 2017-01-03 10:37 ` Mel Gorman 1 sibling, 1 reply; 29+ messages in thread From: David Rientjes @ 2016-12-30 22:30 UTC (permalink / raw) To: Mel Gorman Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, linux-kernel, linux-mm On Fri, 30 Dec 2016, Mel Gorman wrote: > Michal is correct in that my intent for defer was to have "never stall" > as the default behaviour. This was because of the number of severe stalls > users experienced that lead to recommendations in tuning guides to always > disable THP. I'd also seen multiple instances in bug reports for stalls > where it was suggested that THP be disabled even when it could not have > been a factor. It would be preferred to keep the default behaviour to > avoid reintroducing such bugs. > I sympathize with that, I've dealt with a number of issues that we have encountered where thp defrag was either at fault or wasn't, and there were also suggestions to set defrag to "madvise" to rule it out and that impacted other users. I'm curious if you could show examples where there were severe stalls being encountered by applications that did madvise(MADV_HUGEPAGE) and users were forced to set madvise to "never". That is, after all, the only topic for consideration in this thread: the direct impact to users of madvise(MADV_HUGEPAGE). If an application does it, I believe that's a demand for work to be done at allocation time to try to get hugepages. They can certainly provide an application-level option to not do the MADV_HUGEPAGE. Qemu is no different, you can add options to do madvise(MADV_HUGEPAGE) or not, and you can also do it after fault. The problem with the current option set is that we don't have the ability to trigger background compaction for everybody, which only very minimally impacts their page fault latency since it just wakes up kcompactd, and allow MADV_HUGEPAGE users to accept that up-front cost by doing direct compaction. My usecase, remapping .text segment and faulting thp memory at startup, demands that ability. Setting defrag=madvise gets that behavior, but nobody else triggers background compaction when thp memory fails and we _want_ that behavior so work is being done to defrag. Setting defrag=defer makes MADV_HUGEPAGE a no-op for page fault, and I argue that's the wrong behavior. > I'll neither ack nor nak this patch. However, I would much prefer an > additional option be added to sysfs called defer-fault that would avoid > all fault-based stalls but still potentially stall for MADV_HUGEPAGE. I > would also prefer that the default option is "defer" for both MADV_HUGEPAGE > and faults. > If you want a fifth option added to sysfs for thp defrag, that's fine, we can easily do that. I'm slightly concerned with more and more options added that we will eventually approach the 2^4 option count that I mentioned earlier and nobody will know what to select. I'm fine with the kernel default remaining as "madvise," we will just set it to whatever gets us "direct for madvise, background for everybody else" behavior as we were planning on using "defer." We can either do (1) merge this patch and allow madvise(MADV_HUGEPAGE) users to always try to get hugepages, potentially adding options to qemu to suppress their MADV_HUGEPAGE if users have complained (would even fix the issue on 2.6 kernels) or do it after majority has been faulted, or (2) add a fifth defrag option to do this suggested behavior and maintain that option forever. I'd obviously prefer the former since I consider MADV_HUGEPAGE and not willing to stall as a userspace issue that can _trivially_ be worked around in userspace, but in the interest of moving forward on this we can do the latter if you'd prefer. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-30 22:30 ` David Rientjes @ 2017-01-03 10:37 ` Mel Gorman 2017-01-03 21:57 ` David Rientjes 0 siblings, 1 reply; 29+ messages in thread From: Mel Gorman @ 2017-01-03 10:37 UTC (permalink / raw) To: David Rientjes Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, linux-kernel, linux-mm On Fri, Dec 30, 2016 at 02:30:32PM -0800, David Rientjes wrote: > On Fri, 30 Dec 2016, Mel Gorman wrote: > > > Michal is correct in that my intent for defer was to have "never stall" > > as the default behaviour. This was because of the number of severe stalls > > users experienced that lead to recommendations in tuning guides to always > > disable THP. I'd also seen multiple instances in bug reports for stalls > > where it was suggested that THP be disabled even when it could not have > > been a factor. It would be preferred to keep the default behaviour to > > avoid reintroducing such bugs. > > > > I sympathize with that, I've dealt with a number of issues that we have > encountered where thp defrag was either at fault or wasn't, and there were > also suggestions to set defrag to "madvise" to rule it out and that > impacted other users. > > I'm curious if you could show examples where there were severe stalls > being encountered by applications that did madvise(MADV_HUGEPAGE) I do not have a bug report that is specific to MADV_HUGEPAGE. Until very recently they would have been masked by THP fault overhead in general. The current defer logic isn't in the field long enough to generate bugs that are detailed enough to catch something like this. > and > users were forced to set madvise to "never". In the bugs I've dealt with, the switch was between "always" and "never". I haven't seen a bug specific to "madvise". > That is, after all, the only > topic for consideration in this thread: the direct impact to users of > madvise(MADV_HUGEPAGE). If an application does it, I believe that's a > demand for work to be done at allocation time to try to get hugepages. > They can certainly provide an application-level option to not do the > MADV_HUGEPAGE. Qemu is no different, you can add options to do > madvise(MADV_HUGEPAGE) or not, and you can also do it after fault. > True, it's possible that this is minor hence why I didn't want to outright Nak the patch. > The problem with the current option set is that we don't have the ability > to trigger background compaction for everybody, which only very minimally > impacts their page fault latency since it just wakes up kcompactd, and > allow MADV_HUGEPAGE users to accept that up-front cost by doing direct > compaction. My usecase, remapping .text segment and faulting thp memory > at startup, demands that ability. Setting defrag=madvise gets that > behavior, but nobody else triggers background compaction when thp memory > fails and we _want_ that behavior so work is being done to defrag. > Setting defrag=defer makes MADV_HUGEPAGE a no-op for page fault, and I > argue that's the wrong behavior. > Again, I accept your reasoning and I don't have direct evidence that it'll be a problem. In an emergency, it could also be worked around using LD_PRELOAD or a systemtap script until a kernel fix could be applied. Unfortunately it could also be years before a patch like this would hit enough users for me to spot the problem in the field. That's not enough to Nak the patch but it was enough to suggest an alternative that would side-step the problem ever occurring. > > I'll neither ack nor nak this patch. However, I would much prefer an > > additional option be added to sysfs called defer-fault that would avoid > > all fault-based stalls but still potentially stall for MADV_HUGEPAGE. I > > would also prefer that the default option is "defer" for both MADV_HUGEPAGE > > and faults. > > > > If you want a fifth option added to sysfs for thp defrag, that's fine, we > can easily do that. I'm slightly concerned with more and more options > added that we will eventually approach the 2^4 option count that I > mentioned earlier and nobody will know what to select. I'm fine with the > kernel default remaining as "madvise," I find it hard to believe this one *can* explode. There are a limited number of user-triggable actions that can trigger stalls. > we will just set it to whatever > gets us "direct for madvise, background for everybody else" behavior as we > were planning on using "defer." > > We can either do > > (1) merge this patch and allow madvise(MADV_HUGEPAGE) users to always try > to get hugepages, potentially adding options to qemu to suppress > their MADV_HUGEPAGE if users have complained (would even fix the > issue on 2.6 kernels) or do it after majority has been faulted, or > > (2) add a fifth defrag option to do this suggested behavior and maintain > that option forever. > > I'd obviously prefer the former since I consider MADV_HUGEPAGE and not > willing to stall as a userspace issue that can _trivially_ be worked > around in userspace, but in the interest of moving forward on this we can > do the latter if you'd prefer. The latter is preferred because it prevents any possibility of encountering this in the field and being unable to workaround it with LD_PRELOAD or systemtap hackery but I won't nak the former either on the grounds I have no data it's a problem and it could be a year or more before I have an example. If it's encountered, we'll be back at introducing another sysfs option. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2017-01-03 10:37 ` Mel Gorman @ 2017-01-03 21:57 ` David Rientjes 2017-01-04 10:12 ` Mel Gorman 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2017-01-03 21:57 UTC (permalink / raw) To: Mel Gorman Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, linux-kernel, linux-mm On Tue, 3 Jan 2017, Mel Gorman wrote: > > I sympathize with that, I've dealt with a number of issues that we have > > encountered where thp defrag was either at fault or wasn't, and there were > > also suggestions to set defrag to "madvise" to rule it out and that > > impacted other users. > > > > I'm curious if you could show examples where there were severe stalls > > being encountered by applications that did madvise(MADV_HUGEPAGE) > > I do not have a bug report that is specific to MADV_HUGEPAGE. Until very > recently they would have been masked by THP fault overhead in general. I parse this, the masking of thp fault overhead in general, as an indication that the qemu user was using defrag set to "always" rather than the new kernel default of "madvise". I wholeheartedly agree that we don't want defrag to be set to "always" be default, but that's not really a huge concern: we can easily set it to anything else by initscripts. Qemu, when they added the MADV_HUGEPAGE, obviously wanted to try to allocate hugepages at fault using the available means when defrag was set to "madvise": https://patchwork.ozlabs.org/patch/177695 So now qemu notices no difference that the kernel default has changed, but you later reference qemu in your email about bugs concerning "slow start times." It's puzzling unless you're offering a defrag setting of "defer" to workaround this potential bug report, which affects the whole machine and now qemu users have _no_ option to try to get thp at fault because the admin thinks he knows better, essentially making MADV_HUGEPAGE a no-op with no alternative provided. That's specifically what I'm arguing against. Qemu can be fixed, and I'll do it myself if necessary, when allocating a new RAMBlock or translation buffer to suppress the MADV_HUGEPAGE if configured. It's a very trivial change, and I can do that if you'll kindly point me to the initial bug report so I can propose it to the appropriate user. As Vlastimil also correctly brings up, there is already a prctl(PR_SET_THP_DISABLE) option available to prevent hugepages at fault and simply requires you to fork the process in the correct context to inherit the vma setting, see commit 1e1836e84f87. > The current defer logic isn't in the field long enough to generate bugs > that are detailed enough to catch something like this. > Let us consider this email as a generating a bug that we, the users of MADV_HUGEPAGE that are using the madvise(2) correctly and add flags to suppress it when desired correctly, have no option to allow background compaction for everybody when we cannot allocate thp immediately but also allow users of our library to accept the cost of direct compaction at fault because they really want their .text segment remapped and backed by hugepages. > > The problem with the current option set is that we don't have the ability > > to trigger background compaction for everybody, which only very minimally > > impacts their page fault latency since it just wakes up kcompactd, and > > allow MADV_HUGEPAGE users to accept that up-front cost by doing direct > > compaction. My usecase, remapping .text segment and faulting thp memory > > at startup, demands that ability. Setting defrag=madvise gets that > > behavior, but nobody else triggers background compaction when thp memory > > fails and we _want_ that behavior so work is being done to defrag. > > Setting defrag=defer makes MADV_HUGEPAGE a no-op for page fault, and I > > argue that's the wrong behavior. > > > > Again, I accept your reasoning and I don't have direct evidence that it'll be > a problem. In an emergency, it could also be worked around using LD_PRELOAD > or a systemtap script until a kernel fix could be applied. Unfortunately it > could also be years before a patch like this would hit enough users for me > to spot the problem in the field. That's not enough to Nak the patch but > it was enough to suggest an alternative that would side-step the problem > ever occurring. > Or simply forking the application after doing prctl(PR_SET_THP_DISABLE)? What exactly are you working around with a LD_PRELOAD that isn't addressed by this? Btw, is there a qemu bug filed that makes doing the MADV_HUGEPAGE configurable? I don't find it at https://bugs.launchpad.net/qemu. > > If you want a fifth option added to sysfs for thp defrag, that's fine, we > > can easily do that. I'm slightly concerned with more and more options > > added that we will eventually approach the 2^4 option count that I > > mentioned earlier and nobody will know what to select. I'm fine with the > > kernel default remaining as "madvise," > > I find it hard to believe this one *can* explode. There are a limited > number of user-triggable actions that can trigger stalls. > I'm confused as to whether you support the addition of a fifth option that users will have to learn what they want, or whether you are open to changing the behavior of "defer" to actually respect userspace madvise(2)? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2017-01-03 21:57 ` David Rientjes @ 2017-01-04 10:12 ` Mel Gorman 2017-01-04 21:53 ` David Rientjes 0 siblings, 1 reply; 29+ messages in thread From: Mel Gorman @ 2017-01-04 10:12 UTC (permalink / raw) To: David Rientjes Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, linux-kernel, linux-mm On Tue, Jan 03, 2017 at 01:57:33PM -0800, David Rientjes wrote: > On Tue, 3 Jan 2017, Mel Gorman wrote: > > > > I sympathize with that, I've dealt with a number of issues that we have > > > encountered where thp defrag was either at fault or wasn't, and there were > > > also suggestions to set defrag to "madvise" to rule it out and that > > > impacted other users. > > > > > > I'm curious if you could show examples where there were severe stalls > > > being encountered by applications that did madvise(MADV_HUGEPAGE) > > > > I do not have a bug report that is specific to MADV_HUGEPAGE. Until very > > recently they would have been masked by THP fault overhead in general. > > I parse this, the masking of thp fault overhead in general, as an > indication that the qemu user was using defrag set to "always" rather than > the new kernel default of "madvise". > There is a slight disconnect. The bug reports I'm aware of predate the introduction of "defer" and the current "madvise" semantics for defrag. The current semantics have not had enough time in the field to generate reports. I expect lag before users are aware of "defer" due to the number of recommendations out there about blindly disabling THP. This because the majority of users I deal with are not running mainline kernels. > <SNIP> > > Qemu can be fixed, and I'll do it myself if necessary, when allocating a > new RAMBlock or translation buffer to suppress the MADV_HUGEPAGE if > configured. It's a very trivial change, and I can do that if you'll > kindly point me to the initial bug report so I can propose it to the > appropriate user. > I don't have a QEMU-related bug to point to. Even if I did, it would be an enterprise distribution bug that isn't public. My expectation is that if I get a bug, that it's going to be QEMU-related but I'm guessing. > As Vlastimil also correctly brings up, there is already a > prctl(PR_SET_THP_DISABLE) option available to prevent hugepages at fault > and simply requires you to fork the process in the correct context to > inherit the vma setting, see commit 1e1836e84f87. > That disables everything and functionally similar to disabling THP. > > The current defer logic isn't in the field long enough to generate bugs > > that are detailed enough to catch something like this. > > > > Let us consider this email as a generating a bug that we, the users of > MADV_HUGEPAGE that are using the madvise(2) correctly and add flags to > suppress it when desired correctly, have no option to allow background > compaction for everybody when we cannot allocate thp immediately but also > allow users of our library to accept the cost of direct compaction at > fault because they really want their .text segment remapped and backed by > hugepages. > Again, I accept that. A new option for the semantics you want instead of adjusting the existing "defer" semantics is preferred to give the "more heavyweight version of madvise" you're looking for. This is simply because it's easier to resolve in the field when users are not always that knowledgable and are typically reluctant to modify application launching but usually open to adjusting kernel tunables. > > > The problem with the current option set is that we don't have the ability > > > to trigger background compaction for everybody, which only very minimally > > > impacts their page fault latency since it just wakes up kcompactd, and > > > allow MADV_HUGEPAGE users to accept that up-front cost by doing direct > > > compaction. My usecase, remapping .text segment and faulting thp memory > > > at startup, demands that ability. Setting defrag=madvise gets that > > > behavior, but nobody else triggers background compaction when thp memory > > > fails and we _want_ that behavior so work is being done to defrag. > > > Setting defrag=defer makes MADV_HUGEPAGE a no-op for page fault, and I > > > argue that's the wrong behavior. > > > > > > > Again, I accept your reasoning and I don't have direct evidence that it'll be > > a problem. In an emergency, it could also be worked around using LD_PRELOAD > > or a systemtap script until a kernel fix could be applied. Unfortunately it > > could also be years before a patch like this would hit enough users for me > > to spot the problem in the field. That's not enough to Nak the patch but > > it was enough to suggest an alternative that would side-step the problem > > ever occurring. > > > > Or simply forking the application after doing prctl(PR_SET_THP_DISABLE)? > What exactly are you working around with a LD_PRELOAD that isn't addressed > by this? > LD_PRELOAD can mask the MADV flag for get the existing "defer" semantics. Disabling THP entirely is something else. > Btw, is there a qemu bug filed that makes doing the MADV_HUGEPAGE > configurable? I don't find it at https://bugs.launchpad.net/qemu. > Not that I'm aware of but I didn't go looking either. > > > If you want a fifth option added to sysfs for thp defrag, that's fine, we > > > can easily do that. I'm slightly concerned with more and more options > > > added that we will eventually approach the 2^4 option count that I > > > mentioned earlier and nobody will know what to select. I'm fine with the > > > kernel default remaining as "madvise," > > > > I find it hard to believe this one *can* explode. There are a limited > > number of user-triggable actions that can trigger stalls. > > > > I'm confused as to whether you support the addition of a fifth option that > users will have to learn what they want, or whether you are open to > changing the behavior of "defer" to actually respect userspace madvise(2)? I would prefer the fifth option and have users select the option if they need it and preserve the existing semantics of defer. However, as I've stated before, as I'm not actually aware of problems with madvise and the current semantics so I cannot nak the patch you have. It'll simply require to have some solution in mind if a bug is encountered. When THP was first introduced, it was months before users started reporting bugs as most users I deal with are not running mainline kernels. There was similar lag when automatic NUMA balancing was introduced. I expect a similar lag for the existing "madvise" default and some education about using "defer" instead of disabling THP entirely when stalls are encountered. It's the hazard of dealing with users that lag behind mainline and it's not a unique problem. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2017-01-04 10:12 ` Mel Gorman @ 2017-01-04 21:53 ` David Rientjes 0 siblings, 0 replies; 29+ messages in thread From: David Rientjes @ 2017-01-04 21:53 UTC (permalink / raw) To: Mel Gorman Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, linux-kernel, linux-mm On Wed, 4 Jan 2017, Mel Gorman wrote: > There is a slight disconnect. The bug reports I'm aware of predate the > introduction of "defer" and the current "madvise" semantics for defrag. The > current semantics have not had enough time in the field to generate > reports. I expect lag before users are aware of "defer" due to the number > of recommendations out there about blindly disabling THP. This because > the majority of users I deal with are not running mainline kernels. > I find it sad that we need to have five options for thp defrag and that people now need to research options that don't break their userspace and affect all applications on the system, especially when one of those options now appears to have a hypothetical usecase, but in the interest of making forward progress in this area for support that we truly need, I'll reluctantly propose it as a patch. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2016-12-22 0:21 [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred David Rientjes 2016-12-22 8:31 ` Kirill A. Shutemov 2016-12-22 10:00 ` Michal Hocko @ 2017-01-02 8:38 ` Vlastimil Babka 2017-01-03 22:44 ` David Rientjes 2 siblings, 1 reply; 29+ messages in thread From: Vlastimil Babka @ 2017-01-02 8:38 UTC (permalink / raw) To: David Rientjes, Andrew Morton Cc: Jonathan Corbet, Kirill A. Shutemov, Mel Gorman, linux-kernel, linux-mm On 12/22/2016 01:21 AM, David Rientjes wrote: > Currently, when defrag is set to "madvise", thp allocations will direct > reclaim. However, when defrag is set to "defer", all thp allocations do > not attempt reclaim regardless of MADV_HUGEPAGE. > > This patch always directly reclaims for MADV_HUGEPAGE regions when defrag > is not set to "never." The idea is that MADV_HUGEPAGE regions really > want to be backed by hugepages and are willing to endure the latency at > fault as it was the default behavior prior to commit 444eb2a449ef ("mm: > thp: set THP defrag by default to madvise and add a stall-free defrag > option"). > > In this form, "defer" is a stronger, more heavyweight version of > "madvise". > > Signed-off-by: David Rientjes <rientjes@google.com> I'm late to the thread (I did read it fully though), so instead of multiple responses, I'll just list my observations here: - "defer", e.g. background kswapd+compaction is not a silver bullet, it will also affect the system. Mel already mentioned extra reclaim. Compaction also has CPU costs, just hides the accounting to a kernel thread so it's not visible as latency. It also increases zone/node lru_lock and lock pressure. For the same reasons, admin might want to limit direct compaction for THP, even for madvise() apps. It's also likely that "defer" might have lower system overhead than "madvise", as with "defer", reclaim/compaction is done by one per-node thread at a time, but there might be multiple madvise() threads. So there might be sense in not allowing madvise() apps to do direct reclaim/compaction on "defer". - for overriding specific apps such as QEMU (including their madvise() usage, AFAICS), we have PR_SET_THP_DISABLE prctl(), so no need to LD_PRELOAD stuff IMO. - I have wondered about exactly the issue here when Mel proposed the defer option [1]. Mel responded that it doesn't seem needed at that point. Now it seems it is. Too bad you didn't raise it then, but to be fair you were not CC'd. So would something like this be possible? > echo "defer madvise" > /sys/kernel/mm/transparent_hugepage/defrag > cat /sys/kernel/mm/transparent_hugepage/defrag always [defer] [madvise] never I'm not sure about the analogous kernel boot option though, I guess those can't use spaces, so maybe comma-separated? If that's not acceptable, then I would probably rather be for changing "madvise" to include "defer", than the other way around. When we augment kcompactd to be more proactive, it might easily be that it will effectively act as "defer", even when defrag=none is set, anyway. [1] http://marc.info/?l=linux-mm&m=145683613929750&w=2 > --- > Documentation/vm/transhuge.txt | 7 +++++-- > mm/huge_memory.c | 10 ++++++---- > 2 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt > --- a/Documentation/vm/transhuge.txt > +++ b/Documentation/vm/transhuge.txt > @@ -121,8 +121,11 @@ to utilise them. > > "defer" means that an application will wake kswapd in the background > to reclaim pages and wake kcompact to compact memory so that THP is > -available in the near future. It's the responsibility of khugepaged > -to then install the THP pages later. > +available in the near future, unless it is for a region where > +madvise(MADV_HUGEPAGE) has been used, in which case direct reclaim will be > +used. Kcompactd will attempt to make hugepages available for allocation in > +the near future and khugepaged will try to collapse existing memory into > +hugepages later. > > "madvise" will enter direct reclaim like "always" but only for regions > that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -619,15 +619,17 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, > */ > static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) > { > - bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); > + const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); > > if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, > &transparent_hugepage_flags) && vma_madvised) > return GFP_TRANSHUGE; > else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, > - &transparent_hugepage_flags)) > - return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; > - else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, > + &transparent_hugepage_flags)) { > + return GFP_TRANSHUGE_LIGHT | > + (vma_madvised ? __GFP_DIRECT_RECLAIM : > + __GFP_KSWAPD_RECLAIM); > + } else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, > &transparent_hugepage_flags)) > return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2017-01-02 8:38 ` Vlastimil Babka @ 2017-01-03 22:44 ` David Rientjes 2017-01-04 8:32 ` Vlastimil Babka 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2017-01-03 22:44 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Mel Gorman, linux-kernel, linux-mm On Mon, 2 Jan 2017, Vlastimil Babka wrote: > I'm late to the thread (I did read it fully though), so instead of > multiple responses, I'll just list my observations here: > > - "defer", e.g. background kswapd+compaction is not a silver bullet, it > will also affect the system. Mel already mentioned extra reclaim. > Compaction also has CPU costs, just hides the accounting to a kernel > thread so it's not visible as latency. It also increases zone/node > lru_lock and lock pressure. > > For the same reasons, admin might want to limit direct compaction for > THP, even for madvise() apps. It's also likely that "defer" might have > lower system overhead than "madvise", as with "defer", > reclaim/compaction is done by one per-node thread at a time, but there > might be multiple madvise() threads. So there might be sense in not > allowing madvise() apps to do direct reclaim/compaction on "defer". > Hmm, is there a significant benefit to setting "defer" rather than "never" if you can rely on khugepaged to trigger compaction when it tries to allocate. I suppose if there is nothing to collapse that this won't do compaction, but is this not intended for users who always want to defer when not immediately available? "Defer" in it's current setting is useless, in my opinion, other than providing it as a simple workaround to users when their applications are doing MADV_HUGEPAGE without allowing them to configure it. We would love to use "defer" if it didn't completely break MADV_HUGEPAGE, though. > - for overriding specific apps such as QEMU (including their madvise() > usage, AFAICS), we have PR_SET_THP_DISABLE prctl(), so no need to > LD_PRELOAD stuff IMO. > Very good point, and I think it's also worthwhile to allow users to suppress the MADV_HUGEPAGE when allocating a translation buffer in qemu if they choose to do so; it's a very trivial patch to qemu to allow this to be configurable. I haven't proposed it because I don't personally have a need for it, and haven't been pointed to anyone who has a need for it. > - I have wondered about exactly the issue here when Mel proposed the > defer option [1]. Mel responded that it doesn't seem needed at that > point. Now it seems it is. Too bad you didn't raise it then, but to be > fair you were not CC'd. > My understanding is that the defer option is available to users who cannot modify their binary to suppress an madvise(MADV_HUGEPAGE) and are unaware that PR_SET_THP_DISABLE exists. The prctl was added specifically when you cannot control your binary. > So would something like this be possible? > > > echo "defer madvise" > /sys/kernel/mm/transparent_hugepage/defrag > > cat /sys/kernel/mm/transparent_hugepage/defrag > always [defer] [madvise] never > > I'm not sure about the analogous kernel boot option though, I guess > those can't use spaces, so maybe comma-separated? > > If that's not acceptable, then I would probably rather be for changing > "madvise" to include "defer", than the other way around. When we augment > kcompactd to be more proactive, it might easily be that it will > effectively act as "defer", even when defrag=none is set, anyway. > The concern I have with changing the behavior of "madvise" is that it changes long standing behavior that people have correctly implemented userspace applications with. I suggest doing this only with "defer" since it's an option that is new, nobody appears to be deploying with, and makes it much more powerful. I think we could make the kernel default as "defer" later as well and not break userspace that has been setting "madvise" ever since the 2.6 kernel. My position is this: userspace that does MADV_HUGEPAGES knows what it's doing. Let it stall if it wants to stall. If users don't want it to be done, allow them to configure it. If a binary has forced you into using it, use the prctl. Otherwise, I think "defer" doing background compaction for everybody and direct compaction for users who really want hugepages is appropriate and is precisely what I need. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2017-01-03 22:44 ` David Rientjes @ 2017-01-04 8:32 ` Vlastimil Babka 2017-01-04 9:46 ` Michal Hocko 2017-01-04 22:04 ` David Rientjes 0 siblings, 2 replies; 29+ messages in thread From: Vlastimil Babka @ 2017-01-04 8:32 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Mel Gorman, linux-kernel, linux-mm On 01/03/2017 11:44 PM, David Rientjes wrote: > On Mon, 2 Jan 2017, Vlastimil Babka wrote: > >> I'm late to the thread (I did read it fully though), so instead of >> multiple responses, I'll just list my observations here: >> >> - "defer", e.g. background kswapd+compaction is not a silver bullet, it >> will also affect the system. Mel already mentioned extra reclaim. >> Compaction also has CPU costs, just hides the accounting to a kernel >> thread so it's not visible as latency. It also increases zone/node >> lru_lock and lock pressure. >> >> For the same reasons, admin might want to limit direct compaction for >> THP, even for madvise() apps. It's also likely that "defer" might have >> lower system overhead than "madvise", as with "defer", >> reclaim/compaction is done by one per-node thread at a time, but there >> might be multiple madvise() threads. So there might be sense in not >> allowing madvise() apps to do direct reclaim/compaction on "defer". >> > > Hmm, is there a significant benefit to setting "defer" rather than "never" > if you can rely on khugepaged to trigger compaction when it tries to > allocate. I suppose if there is nothing to collapse that this won't do > compaction, but is this not intended for users who always want to defer > when not immediately available? I guess two things - khugepaged is quite sleepy and will not respond to demand quickly, so it won't compact that much than kcompactd triggered by "defer" - thus with "defer" it's more likely that although some THP faults will fail, others in near future will succeed and benefit from THP immediately. Again, khugepaged is much slower. But it may recover long-running processes that were unlucky in the initial faults, so it's not useless. > "Defer" in it's current setting is useless, in my opinion, other than > providing it as a simple workaround to users when their applications are > doing MADV_HUGEPAGE without allowing them to configure it. I don't think the primary motivation for "defer" was to restrict MADV_HUGEPAGE apps, but rather to prevent latency to the majority of apps oblivious to THP when the default was "always". On the other hand, setting "madvise" would make performance needlessly worse in some scenarios, so "defer" is a compromise that tries to provide THP's but without the latency, and still much more timely than khugepaged. But that's just my POV, Mel probably has/had also the MADV_HUGEPAGE restriction in mind. I'd expect that the "you have to disable THP" cargo-cult originated around apps (databases?) that did not use MADV_HUGEPAGE, though. > We would love > to use "defer" if it didn't completely break MADV_HUGEPAGE, though. Right. >> - for overriding specific apps such as QEMU (including their madvise() >> usage, AFAICS), we have PR_SET_THP_DISABLE prctl(), so no need to >> LD_PRELOAD stuff IMO. >> > > Very good point, and I think it's also worthwhile to allow users to > suppress the MADV_HUGEPAGE when allocating a translation buffer in qemu if > they choose to do so; it's a very trivial patch to qemu to allow this to > be configurable. I haven't proposed it because I don't personally have a > need for it, and haven't been pointed to anyone who has a need for it. > >> - I have wondered about exactly the issue here when Mel proposed the >> defer option [1]. Mel responded that it doesn't seem needed at that >> point. Now it seems it is. Too bad you didn't raise it then, but to be >> fair you were not CC'd. >> > > My understanding is that the defer option is available to users who cannot > modify their binary to suppress an madvise(MADV_HUGEPAGE) and are unaware > that PR_SET_THP_DISABLE exists. The prctl was added specifically when you > cannot control your binary. Yeah, it's easier than LD_PRELOAD, but still not system-wide transparent. >> So would something like this be possible? >> >>> echo "defer madvise" > /sys/kernel/mm/transparent_hugepage/defrag >>> cat /sys/kernel/mm/transparent_hugepage/defrag >> always [defer] [madvise] never >> >> I'm not sure about the analogous kernel boot option though, I guess >> those can't use spaces, so maybe comma-separated? No opinion on the above? I think it could be somewhat more elegant than a fifth-option that Mel said he would prefer, and deliver the same flexibility. >> If that's not acceptable, then I would probably rather be for changing >> "madvise" to include "defer", than the other way around. When we augment >> kcompactd to be more proactive, it might easily be that it will >> effectively act as "defer", even when defrag=none is set, anyway. >> > > The concern I have with changing the behavior of "madvise" is that it > changes long standing behavior that people have correctly implemented > userspace applications with. I suggest doing this only with "defer" since > it's an option that is new, nobody appears to be deploying with, and makes > it much more powerful. I think we could make the kernel default as > "defer" later as well and not break userspace that has been setting > "madvise" ever since the 2.6 kernel. > > My position is this: userspace that does MADV_HUGEPAGES knows what it's > doing. Let it stall if it wants to stall. If users don't want it to be > done, allow them to configure it. If a binary has forced you into using > it, use the prctl. Otherwise, I think "defer" doing background compaction > for everybody and direct compaction for users who really want hugepages is > appropriate and is precisely what I need. I'm not completely against this. But we haven't really ruled out the most flexible option yet... > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2017-01-04 8:32 ` Vlastimil Babka @ 2017-01-04 9:46 ` Michal Hocko 2017-01-04 22:04 ` David Rientjes 1 sibling, 0 replies; 29+ messages in thread From: Michal Hocko @ 2017-01-04 9:46 UTC (permalink / raw) To: Vlastimil Babka Cc: David Rientjes, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Mel Gorman, linux-kernel, linux-mm On Wed 04-01-17 09:32:55, Vlastimil Babka wrote: > On 01/03/2017 11:44 PM, David Rientjes wrote: > > On Mon, 2 Jan 2017, Vlastimil Babka wrote: [...] > >>> echo "defer madvise" > /sys/kernel/mm/transparent_hugepage/defrag > >>> cat /sys/kernel/mm/transparent_hugepage/defrag > >> always [defer] [madvise] never > >> > >> I'm not sure about the analogous kernel boot option though, I guess > >> those can't use spaces, so maybe comma-separated? > > No opinion on the above? I think it could be somewhat more elegant than > a fifth-option that Mel said he would prefer, and deliver the same > flexibility. I am not sure we have considered the kcompactd watermark option throughly as well. In case the relation is not clear because I admit that the propsal was scattered in more emails. So let me summarize it here. Let's add a system configuration whih would control the pro-active background compaction which would - wake up kcompactd pro-actively even when there is no immediate memory pressure - based on the timeout - keep compacting as long as the requested order is under the configured watermark and the compaction makes further progress. Admin can set up this tunable to reflect demand for the THP in the particular workload. Now how it would play with the THP specific defrag options? - never - THP allocations will be tried without any feedback to kcopactd - no stalls in the page fault path - defer - THP allocations will be tried and kcompactd woken up outside of its wmark setting to catch with the workload - no stalls in the page fault path - madvise - do the direct compaction for madvised VMAs and rely on kcompactd watermarks setting to do the background compaction - always - do the direct compaction for all VMAs We won't have to add or modify any new THP specific option and we will have a generic user independent tunable to tell that the system should try to generate high order pages which is something that is demand for. Such a solution would be more flexible as well because the configuration could reflect the demand much better. Is there any reason, except for not being implemented yet, that would make it inappropriate for the described usecase? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred 2017-01-04 8:32 ` Vlastimil Babka 2017-01-04 9:46 ` Michal Hocko @ 2017-01-04 22:04 ` David Rientjes 1 sibling, 0 replies; 29+ messages in thread From: David Rientjes @ 2017-01-04 22:04 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Mel Gorman, linux-kernel, linux-mm On Wed, 4 Jan 2017, Vlastimil Babka wrote: > > Hmm, is there a significant benefit to setting "defer" rather than "never" > > if you can rely on khugepaged to trigger compaction when it tries to > > allocate. I suppose if there is nothing to collapse that this won't do > > compaction, but is this not intended for users who always want to defer > > when not immediately available? > > I guess two things > - khugepaged is quite sleepy and will not respond to demand quickly, so > it won't compact that much than kcompactd triggered by "defer" That's configurable, so if a user sets defrag to never, they also have the ability to make khugepaged more aggressive in the background to complement that decision. > I don't think the primary motivation for "defer" was to restrict > MADV_HUGEPAGE apps, but rather to prevent latency to the majority of > apps oblivious to THP when the default was "always". On the other hand, > setting "madvise" would make performance needlessly worse in some > scenarios, so "defer" is a compromise that tries to provide THP's but > without the latency, and still much more timely than khugepaged. > It's disappointing we need to have an option that exists solely to suppress a userspace MADV_HUGEPAGE and not actually fix the userspace to not do the MADV_HUGEPAGE in the first place by making it configurable. That is backwards compatible and doesn't require a new kernel version. This never gets answered in the thread, however, and I offered to make the very trivial patch to qemu to do that for the translation buffer but nobody who uses qemu is even asking for this. It's baffling. > >> So would something like this be possible? > >> > >>> echo "defer madvise" > /sys/kernel/mm/transparent_hugepage/defrag > >>> cat /sys/kernel/mm/transparent_hugepage/defrag > >> always [defer] [madvise] never > >> > >> I'm not sure about the analogous kernel boot option though, I guess > >> those can't use spaces, so maybe comma-separated? > > No opinion on the above? I think it could be somewhat more elegant than > a fifth-option that Mel said he would prefer, and deliver the same > flexibility. > I think this would work, but I'm concerned about two things: (1) the kernel command line format as you pointed out earlier, (2) allowing two options to be combined but not other options (always + never), so it takes even more explaining to do to say what you can actually formulate and what the results of that combining is. The tristate, quadstate, and now quint-state options for thp were never extendable, but now this appears to be the most desired option. We can await the bug reports of users who say their MADV_HUGEPAGE is a no-op, though, and tell them their admin needs to switch away from "defer" if anybody actually ever uses that setting. I think you, me, and Kirill are mostly on the same page with respect to this, but I can't argue against hypothetical usecases and how we need to wait years for "defer" to be available to see if any bug reports are generated to make a decision in this area, so my final proposal in this matter will be the reluctant fifth option and if it doesn't work I'll just carry this for ourselves (we have no use for "defer" without this patch). ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2017-01-04 22:05 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-12-22 0:21 [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred David Rientjes 2016-12-22 8:31 ` Kirill A. Shutemov 2016-12-22 10:00 ` Michal Hocko 2016-12-22 21:05 ` David Rientjes 2016-12-23 8:51 ` Michal Hocko 2016-12-23 10:01 ` David Rientjes 2016-12-23 11:18 ` Michal Hocko 2016-12-23 22:46 ` David Rientjes 2016-12-26 9:02 ` Michal Hocko 2016-12-27 0:53 ` David Rientjes 2016-12-27 2:32 ` Kirill A. Shutemov 2016-12-27 9:41 ` Michal Hocko 2016-12-27 21:36 ` David Rientjes 2016-12-28 8:48 ` Michal Hocko 2016-12-28 21:33 ` David Rientjes 2016-12-29 8:24 ` Michal Hocko 2016-12-30 12:36 ` Mel Gorman 2016-12-30 12:56 ` Michal Hocko 2016-12-30 14:08 ` Mel Gorman 2016-12-30 22:30 ` David Rientjes 2017-01-03 10:37 ` Mel Gorman 2017-01-03 21:57 ` David Rientjes 2017-01-04 10:12 ` Mel Gorman 2017-01-04 21:53 ` David Rientjes 2017-01-02 8:38 ` Vlastimil Babka 2017-01-03 22:44 ` David Rientjes 2017-01-04 8:32 ` Vlastimil Babka 2017-01-04 9:46 ` Michal Hocko 2017-01-04 22:04 ` David Rientjes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).