* THP backed thread stacks @ 2023-03-06 23:57 Mike Kravetz 2023-03-07 0:15 ` Peter Xu ` (3 more replies) 0 siblings, 4 replies; 23+ messages in thread From: Mike Kravetz @ 2023-03-06 23:57 UTC (permalink / raw) To: linux-mm, linux-kernel One of our product teams recently experienced 'memory bloat' in their environment. The application in this environment is the JVM which creates hundreds of threads. Threads are ultimately created via pthread_create which also creates the thread stacks. pthread attributes are modified so that stacks are 2MB in size. It just so happens that due to allocation patterns, all their stacks are at 2MB boundaries. The system has THP always set, so a huge page is allocated at the first (write) fault when libpthread initializes the stack. It would seem that this is expected behavior. If you set THP always, you may get huge pages anywhere. However, I can't help but think that backing stacks with huge pages by default may not be the right thing to do. Stacks by their very nature grow in somewhat unpredictable ways over time. Using a large virtual space so that memory is allocated as needed is the desired behavior. The only way to address their 'memory bloat' via thread stacks today is by switching THP to madvise. Just wondering if there is anything better or more selective that can be done? Does it make sense to have THP backed stacks by default? If not, who would be best at disabling? A couple thoughts: - The kernel could disable huge pages on stacks. libpthread/glibc pass the unused flag MAP_STACK. We could key off this and disable huge pages. However, I'm sure there is somebody somewhere today that is getting better performance because they have huge pages backing their stacks. - We could push this to glibc/libpthreads and have them use MADV_NOHUGEPAGE on thread stacks. However, this also has the potential of regressing performance if somebody somewhere is getting better performance due to huge pages. - Other thoughts? Perhaps this is just expected behavior of THP always which is unfortunate in this situation. -- Mike Kravetz ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-06 23:57 THP backed thread stacks Mike Kravetz @ 2023-03-07 0:15 ` Peter Xu 2023-03-07 0:40 ` Mike Kravetz 2023-03-07 10:10 ` David Hildenbrand ` (2 subsequent siblings) 3 siblings, 1 reply; 23+ messages in thread From: Peter Xu @ 2023-03-07 0:15 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-mm, linux-kernel On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: > One of our product teams recently experienced 'memory bloat' in their > environment. The application in this environment is the JVM which > creates hundreds of threads. Threads are ultimately created via > pthread_create which also creates the thread stacks. pthread attributes > are modified so that stacks are 2MB in size. It just so happens that > due to allocation patterns, all their stacks are at 2MB boundaries. The > system has THP always set, so a huge page is allocated at the first > (write) fault when libpthread initializes the stack. > > It would seem that this is expected behavior. If you set THP always, > you may get huge pages anywhere. > > However, I can't help but think that backing stacks with huge pages by > default may not be the right thing to do. Stacks by their very nature > grow in somewhat unpredictable ways over time. Using a large virtual > space so that memory is allocated as needed is the desired behavior. > > The only way to address their 'memory bloat' via thread stacks today is > by switching THP to madvise. > > Just wondering if there is anything better or more selective that can be > done? Does it make sense to have THP backed stacks by default? If not, > who would be best at disabling? A couple thoughts: > - The kernel could disable huge pages on stacks. libpthread/glibc pass > the unused flag MAP_STACK. We could key off this and disable huge pages. > However, I'm sure there is somebody somewhere today that is getting better > performance because they have huge pages backing their stacks. > - We could push this to glibc/libpthreads and have them use > MADV_NOHUGEPAGE on thread stacks. However, this also has the potential > of regressing performance if somebody somewhere is getting better > performance due to huge pages. Yes it seems it's always not safe to change a default behavior to me. For stack I really can't tell why it must be different here. I assume the problem is the wasted space and it exaggerates easily with N-threads. But IIUC it'll be the same as thp to normal memories iiuc, e.g., there can be a per-thread mmap() of 2MB even if only 4K is used each, then if such mmap() is populated by THP for each thread there'll also be a huge waste. > - Other thoughts? > > Perhaps this is just expected behavior of THP always which is unfortunate > in this situation. I would think it's proper the app explicitly choose what it wants if possible, and we do have the interfaces. Then, would pthread_attr_getstack() plus MADV_NOHUGEPAGE work, which to be applied from the JVM framework level? Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-07 0:15 ` Peter Xu @ 2023-03-07 0:40 ` Mike Kravetz 2023-03-08 19:02 ` Mike Kravetz 0 siblings, 1 reply; 23+ messages in thread From: Mike Kravetz @ 2023-03-07 0:40 UTC (permalink / raw) To: Peter Xu; +Cc: linux-mm, linux-kernel On 03/06/23 19:15, Peter Xu wrote: > On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: > > One of our product teams recently experienced 'memory bloat' in their > > environment. The application in this environment is the JVM which > > creates hundreds of threads. Threads are ultimately created via > > pthread_create which also creates the thread stacks. pthread attributes > > are modified so that stacks are 2MB in size. It just so happens that > > due to allocation patterns, all their stacks are at 2MB boundaries. The > > system has THP always set, so a huge page is allocated at the first > > (write) fault when libpthread initializes the stack. > > > > It would seem that this is expected behavior. If you set THP always, > > you may get huge pages anywhere. > > > > However, I can't help but think that backing stacks with huge pages by > > default may not be the right thing to do. Stacks by their very nature > > grow in somewhat unpredictable ways over time. Using a large virtual > > space so that memory is allocated as needed is the desired behavior. > > > > The only way to address their 'memory bloat' via thread stacks today is > > by switching THP to madvise. > > > > Just wondering if there is anything better or more selective that can be > > done? Does it make sense to have THP backed stacks by default? If not, > > who would be best at disabling? A couple thoughts: > > - The kernel could disable huge pages on stacks. libpthread/glibc pass > > the unused flag MAP_STACK. We could key off this and disable huge pages. > > However, I'm sure there is somebody somewhere today that is getting better > > performance because they have huge pages backing their stacks. > > - We could push this to glibc/libpthreads and have them use > > MADV_NOHUGEPAGE on thread stacks. However, this also has the potential > > of regressing performance if somebody somewhere is getting better > > performance due to huge pages. > > Yes it seems it's always not safe to change a default behavior to me. > > For stack I really can't tell why it must be different here. I assume the > problem is the wasted space and it exaggerates easily with N-threads. But > IIUC it'll be the same as thp to normal memories iiuc, e.g., there can be a > per-thread mmap() of 2MB even if only 4K is used each, then if such mmap() > is populated by THP for each thread there'll also be a huge waste. > > > - Other thoughts? > > > > Perhaps this is just expected behavior of THP always which is unfortunate > > in this situation. > > I would think it's proper the app explicitly choose what it wants if > possible, and we do have the interfaces. > > Then, would pthread_attr_getstack() plus MADV_NOHUGEPAGE work, which to be > applied from the JVM framework level? Yes, I believe the only way for this to work would be for the JVM (or any application) to explicitly allocate space for the stacks themselves. Then they could do a MADV_NOHUGEPAGE on the sack before calling pthread_create. The JVM (or application) would also need to create the guard page within the stack that libpthread/glibc would normally create. I'm still checking, but I think the JVM will also need to add some additional code so that it knows when threads exit so it can unmap the stacks. That was also something 'for free' if libpthread/glibc is used for stack creation. -- Mike Kravetz ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-07 0:40 ` Mike Kravetz @ 2023-03-08 19:02 ` Mike Kravetz 2023-03-09 22:38 ` Zach O'Keefe 0 siblings, 1 reply; 23+ messages in thread From: Mike Kravetz @ 2023-03-08 19:02 UTC (permalink / raw) To: Peter Xu, David Hildenbrand, Rik van Riel, Mike Rapoport Cc: linux-mm, linux-kernel On 03/06/23 16:40, Mike Kravetz wrote: > On 03/06/23 19:15, Peter Xu wrote: > > On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: > > > > > > Just wondering if there is anything better or more selective that can be > > > done? Does it make sense to have THP backed stacks by default? If not, > > > who would be best at disabling? A couple thoughts: > > > - The kernel could disable huge pages on stacks. libpthread/glibc pass > > > the unused flag MAP_STACK. We could key off this and disable huge pages. > > > However, I'm sure there is somebody somewhere today that is getting better > > > performance because they have huge pages backing their stacks. > > > - We could push this to glibc/libpthreads and have them use > > > MADV_NOHUGEPAGE on thread stacks. However, this also has the potential > > > of regressing performance if somebody somewhere is getting better > > > performance due to huge pages. > > > > Yes it seems it's always not safe to change a default behavior to me. > > > > For stack I really can't tell why it must be different here. I assume the > > problem is the wasted space and it exaggerates easily with N-threads. But > > IIUC it'll be the same as thp to normal memories iiuc, e.g., there can be a > > per-thread mmap() of 2MB even if only 4K is used each, then if such mmap() > > is populated by THP for each thread there'll also be a huge waste. I may be alone in my thinking here, but it seems that stacks by their nature are not generally good candidates for huge pages. I am just thinking about the 'normal' use case where stacks contain local function data and arguments. Am I missing something, or are huge pages really a benefit here? Of course, I can imagine some thread with a large amount of frequently accessed data allocated on it's stack which could benefit from huge pages. But, this seems to be an exception rather than the rule. I understand the argument that THP always means always and everywhere. It just seems that thread stacks may be 'special enough' to consider disabling by default. -- Mike Kravetz ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-08 19:02 ` Mike Kravetz @ 2023-03-09 22:38 ` Zach O'Keefe 2023-03-09 23:33 ` Mike Kravetz 0 siblings, 1 reply; 23+ messages in thread From: Zach O'Keefe @ 2023-03-09 22:38 UTC (permalink / raw) To: Mike Kravetz Cc: Peter Xu, David Hildenbrand, Rik van Riel, Mike Rapoport, linux-mm, linux-kernel On Wed, Mar 8, 2023 at 11:02 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 03/06/23 16:40, Mike Kravetz wrote: > > On 03/06/23 19:15, Peter Xu wrote: > > > On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: > > > > > > > > Just wondering if there is anything better or more selective that can be > > > > done? Does it make sense to have THP backed stacks by default? If not, > > > > who would be best at disabling? A couple thoughts: > > > > - The kernel could disable huge pages on stacks. libpthread/glibc pass > > > > the unused flag MAP_STACK. We could key off this and disable huge pages. > > > > However, I'm sure there is somebody somewhere today that is getting better > > > > performance because they have huge pages backing their stacks. > > > > - We could push this to glibc/libpthreads and have them use > > > > MADV_NOHUGEPAGE on thread stacks. However, this also has the potential > > > > of regressing performance if somebody somewhere is getting better > > > > performance due to huge pages. > > > > > > Yes it seems it's always not safe to change a default behavior to me. > > > > > > For stack I really can't tell why it must be different here. I assume the > > > problem is the wasted space and it exaggerates easily with N-threads. But > > > IIUC it'll be the same as thp to normal memories iiuc, e.g., there can be a > > > per-thread mmap() of 2MB even if only 4K is used each, then if such mmap() > > > is populated by THP for each thread there'll also be a huge waste. > > I may be alone in my thinking here, but it seems that stacks by their nature > are not generally good candidates for huge pages. I am just thinking about > the 'normal' use case where stacks contain local function data and arguments. > Am I missing something, or are huge pages really a benefit here? > > Of course, I can imagine some thread with a large amount of frequently > accessed data allocated on it's stack which could benefit from huge > pages. But, this seems to be an exception rather than the rule. > > I understand the argument that THP always means always and everywhere. > It just seems that thread stacks may be 'special enough' to consider > disabling by default Just my drive-by 2c, but would agree with you here (at least wrt hugepages not being good candidates, in general). A user mmap()'ing memory has a lot more (direct) control over how they fault / utilize the memory: you know when you're running out of space and can map more space as needed. For these stacks, you're setting the stack size to 2MB just as a precaution so you can avoid overflow -- AFAIU there is no intention of using the whole mapping (and looking at some data, it's very likely you won't come close). That said, why bother setting stack attribute to 2MiB in size if there isn't some intention of possibly being THP-backed? Moreover, how did it happen that the mappings were always hugepage-aligned here? > -- > Mike Kravetz > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-09 22:38 ` Zach O'Keefe @ 2023-03-09 23:33 ` Mike Kravetz 2023-03-10 0:05 ` Zach O'Keefe 0 siblings, 1 reply; 23+ messages in thread From: Mike Kravetz @ 2023-03-09 23:33 UTC (permalink / raw) To: Zach O'Keefe Cc: Peter Xu, David Hildenbrand, Rik van Riel, Mike Rapoport, linux-mm, linux-kernel On 03/09/23 14:38, Zach O'Keefe wrote: > On Wed, Mar 8, 2023 at 11:02 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 03/06/23 16:40, Mike Kravetz wrote: > > > On 03/06/23 19:15, Peter Xu wrote: > > > > On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: > > > > > > > > > > Just wondering if there is anything better or more selective that can be > > > > > done? Does it make sense to have THP backed stacks by default? If not, > > > > > who would be best at disabling? A couple thoughts: > > > > > - The kernel could disable huge pages on stacks. libpthread/glibc pass > > > > > the unused flag MAP_STACK. We could key off this and disable huge pages. > > > > > However, I'm sure there is somebody somewhere today that is getting better > > > > > performance because they have huge pages backing their stacks. > > > > > - We could push this to glibc/libpthreads and have them use > > > > > MADV_NOHUGEPAGE on thread stacks. However, this also has the potential > > > > > of regressing performance if somebody somewhere is getting better > > > > > performance due to huge pages. > > > > > > > > Yes it seems it's always not safe to change a default behavior to me. > > > > > > > > For stack I really can't tell why it must be different here. I assume the > > > > problem is the wasted space and it exaggerates easily with N-threads. But > > > > IIUC it'll be the same as thp to normal memories iiuc, e.g., there can be a > > > > per-thread mmap() of 2MB even if only 4K is used each, then if such mmap() > > > > is populated by THP for each thread there'll also be a huge waste. > > > > I may be alone in my thinking here, but it seems that stacks by their nature > > are not generally good candidates for huge pages. I am just thinking about > > the 'normal' use case where stacks contain local function data and arguments. > > Am I missing something, or are huge pages really a benefit here? > > > > Of course, I can imagine some thread with a large amount of frequently > > accessed data allocated on it's stack which could benefit from huge > > pages. But, this seems to be an exception rather than the rule. > > > > I understand the argument that THP always means always and everywhere. > > It just seems that thread stacks may be 'special enough' to consider > > disabling by default > > Just my drive-by 2c, but would agree with you here (at least wrt > hugepages not being good candidates, in general). A user mmap()'ing > memory has a lot more (direct) control over how they fault / utilize > the memory: you know when you're running out of space and can map more > space as needed. For these stacks, you're setting the stack size to > 2MB just as a precaution so you can avoid overflow -- AFAIU there is > no intention of using the whole mapping (and looking at some data, > it's very likely you won't come close). > > That said, why bother setting stack attribute to 2MiB in size if there > isn't some intention of possibly being THP-backed? Moreover, how did > it happen that the mappings were always hugepage-aligned here? I do not have the details as to why the Java group chose 2MB for stack size. My 'guess' is that they are trying to save on virtual space (although that seems silly). 2MB is actually reducing the default size. The default pthread stack size on my desktop (fedora) is 8MB. This also is a nice multiple of THP size. I think the hugepage alignment in their environment was somewhat luck. One suggestion made was to change stack size to avoid alignment and hugepage usage. That 'works' but seems kind of hackish. Also, David H pointed out the somewhat recent commit to align sufficiently large mappings to THP boundaries. This is going to make all stacks huge page aligned. -- Mike Kravetz ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-09 23:33 ` Mike Kravetz @ 2023-03-10 0:05 ` Zach O'Keefe 2023-03-10 1:40 ` William Kucharski 2023-03-10 22:02 ` Yang Shi 0 siblings, 2 replies; 23+ messages in thread From: Zach O'Keefe @ 2023-03-10 0:05 UTC (permalink / raw) To: Mike Kravetz Cc: Peter Xu, David Hildenbrand, Rik van Riel, Mike Rapoport, linux-mm, linux-kernel On Thu, Mar 9, 2023 at 3:33 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 03/09/23 14:38, Zach O'Keefe wrote: > > On Wed, Mar 8, 2023 at 11:02 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > On 03/06/23 16:40, Mike Kravetz wrote: > > > > On 03/06/23 19:15, Peter Xu wrote: > > > > > On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: > > > > > > > > > > > > Just wondering if there is anything better or more selective that can be > > > > > > done? Does it make sense to have THP backed stacks by default? If not, > > > > > > who would be best at disabling? A couple thoughts: > > > > > > - The kernel could disable huge pages on stacks. libpthread/glibc pass > > > > > > the unused flag MAP_STACK. We could key off this and disable huge pages. > > > > > > However, I'm sure there is somebody somewhere today that is getting better > > > > > > performance because they have huge pages backing their stacks. > > > > > > - We could push this to glibc/libpthreads and have them use > > > > > > MADV_NOHUGEPAGE on thread stacks. However, this also has the potential > > > > > > of regressing performance if somebody somewhere is getting better > > > > > > performance due to huge pages. > > > > > > > > > > Yes it seems it's always not safe to change a default behavior to me. > > > > > > > > > > For stack I really can't tell why it must be different here. I assume the > > > > > problem is the wasted space and it exaggerates easily with N-threads. But > > > > > IIUC it'll be the same as thp to normal memories iiuc, e.g., there can be a > > > > > per-thread mmap() of 2MB even if only 4K is used each, then if such mmap() > > > > > is populated by THP for each thread there'll also be a huge waste. > > > > > > I may be alone in my thinking here, but it seems that stacks by their nature > > > are not generally good candidates for huge pages. I am just thinking about > > > the 'normal' use case where stacks contain local function data and arguments. > > > Am I missing something, or are huge pages really a benefit here? > > > > > > Of course, I can imagine some thread with a large amount of frequently > > > accessed data allocated on it's stack which could benefit from huge > > > pages. But, this seems to be an exception rather than the rule. > > > > > > I understand the argument that THP always means always and everywhere. > > > It just seems that thread stacks may be 'special enough' to consider > > > disabling by default > > > > Just my drive-by 2c, but would agree with you here (at least wrt > > hugepages not being good candidates, in general). A user mmap()'ing > > memory has a lot more (direct) control over how they fault / utilize > > the memory: you know when you're running out of space and can map more > > space as needed. For these stacks, you're setting the stack size to > > 2MB just as a precaution so you can avoid overflow -- AFAIU there is > > no intention of using the whole mapping (and looking at some data, > > it's very likely you won't come close). > > > > That said, why bother setting stack attribute to 2MiB in size if there > > isn't some intention of possibly being THP-backed? Moreover, how did > > it happen that the mappings were always hugepage-aligned here? > > I do not have the details as to why the Java group chose 2MB for stack > size. My 'guess' is that they are trying to save on virtual space (although > that seems silly). 2MB is actually reducing the default size. The > default pthread stack size on my desktop (fedora) is 8MB [..] Oh, that's interesting -- I did not know that. That's huge. > [..] This also is > a nice multiple of THP size. > > I think the hugepage alignment in their environment was somewhat luck. > One suggestion made was to change stack size to avoid alignment and > hugepage usage. That 'works' but seems kind of hackish. That was my first thought, if the alignment was purely due to luck, and not somebody manually specifying it. Agreed it's kind of hackish if anyone can get bit by this by sheer luck. > Also, David H pointed out the somewhat recent commit to align sufficiently > large mappings to THP boundaries. This is going to make all stacks huge > page aligned. I think that change was reverted by Linus in commit 0ba09b173387 ("Revert "mm: align larger anonymous mappings on THP boundaries""), until it's perf regressions were better understood -- and I haven't seen a revamp of it. > -- > Mike Kravetz ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-10 0:05 ` Zach O'Keefe @ 2023-03-10 1:40 ` William Kucharski 2023-03-10 11:25 ` David Hildenbrand 2023-03-10 22:02 ` Yang Shi 1 sibling, 1 reply; 23+ messages in thread From: William Kucharski @ 2023-03-10 1:40 UTC (permalink / raw) To: Zach O'Keefe Cc: Mike Kravetz, Peter Xu, David Hildenbrand, Rik van Riel, Mike Rapoport, Linux-MM, LKML > On Mar 9, 2023, at 17:05, Zach O'Keefe <zokeefe@google.com> wrote: > >> I think the hugepage alignment in their environment was somewhat luck. >> One suggestion made was to change stack size to avoid alignment and >> hugepage usage. That 'works' but seems kind of hackish. > > That was my first thought, if the alignment was purely due to luck, > and not somebody manually specifying it. Agreed it's kind of hackish > if anyone can get bit by this by sheer luck. I don't agree it's "hackish" at all, but I go more into that below. > >> Also, David H pointed out the somewhat recent commit to align sufficiently >> large mappings to THP boundaries. This is going to make all stacks huge >> page aligned. > > I think that change was reverted by Linus in commit 0ba09b173387 > ("Revert "mm: align larger anonymous mappings on THP boundaries""), > until it's perf regressions were better understood -- and I haven't > seen a revamp of it. It's too bad it was reverted, though I understand the concerns regarding it. From my point of view, if an address is properly aligned and a caller is asking for 2M+ to be mapped, it's going to be advantageous from a purely system-focused point of view to do that mapping with a THP. It's less work for the kernel, generates fewer future page faults, involves less page table manipulation and in general means less hassle all around in the generic case. Of course there are all sorts of cases where it may not be the best solution from a performance point of view, but in general I've always preferred the approach of "do it if you CAN" rather than "do it only if asked" for such mappings. You can make a similar bloat argument to the original concern regarding text mappings; you may map a large text region with a THP, and locality of reference may be such that the application actually references little of the mapped space. It still seems that on average you're better off mapping via a THP when possible. It's difficult to heuristically determine whether a caller is "really" going to use a 2M+ space it wants or if it's just being "greedy" and/or is trying to reserve space for "growth later" unless the system receives specific madvise() hints from the caller, so I would prefer an approach where callers would madvise() to shut off rather than enable the behavior. But that's just my $.02 in a discussion where lots of pennies are already being scattered about. :-) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-10 1:40 ` William Kucharski @ 2023-03-10 11:25 ` David Hildenbrand 2023-03-11 12:24 ` William Kucharski 0 siblings, 1 reply; 23+ messages in thread From: David Hildenbrand @ 2023-03-10 11:25 UTC (permalink / raw) To: William Kucharski, Zach O'Keefe Cc: Mike Kravetz, Peter Xu, Rik van Riel, Mike Rapoport, Linux-MM, LKML On 10.03.23 02:40, William Kucharski wrote: > > >> On Mar 9, 2023, at 17:05, Zach O'Keefe <zokeefe@google.com> wrote: >> >>> I think the hugepage alignment in their environment was somewhat luck. >>> One suggestion made was to change stack size to avoid alignment and >>> hugepage usage. That 'works' but seems kind of hackish. >> >> That was my first thought, if the alignment was purely due to luck, >> and not somebody manually specifying it. Agreed it's kind of hackish >> if anyone can get bit by this by sheer luck. > > I don't agree it's "hackish" at all, but I go more into that below. > >> >>> Also, David H pointed out the somewhat recent commit to align sufficiently >>> large mappings to THP boundaries. This is going to make all stacks huge >>> page aligned. >> >> I think that change was reverted by Linus in commit 0ba09b173387 >> ("Revert "mm: align larger anonymous mappings on THP boundaries""), >> until it's perf regressions were better understood -- and I haven't >> seen a revamp of it. > > It's too bad it was reverted, though I understand the concerns regarding it. > > From my point of view, if an address is properly aligned and a caller is > asking for 2M+ to be mapped, it's going to be advantageous from a purely > system-focused point of view to do that mapping with a THP. Just noting that, if user space requests multiple smaller mappings, and the kernel decides to all place them in the same PMD, all VMAs might get merged and you end up with a properly aligned VMA where khugepaged would happily place a THP. That case is, of course, different to the "user space asks for 2M+" mapping case, but from khugepaged perspective they might look alike -- and it might be unclear if a THP is valuable or not (IOW maybe that THP could be better used somewhere else). -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-10 11:25 ` David Hildenbrand @ 2023-03-11 12:24 ` William Kucharski 2023-03-12 0:55 ` Hillf Danton 0 siblings, 1 reply; 23+ messages in thread From: William Kucharski @ 2023-03-11 12:24 UTC (permalink / raw) To: David Hildenbrand Cc: Zach O'Keefe, Mike Kravetz, Peter Xu, Rik van Riel, Mike Rapoport, Linux-MM, LKML > On Mar 10, 2023, at 04:25, David Hildenbrand <david@redhat.com> wrote: > > On 10.03.23 02:40, William Kucharski wrote: >>> On Mar 9, 2023, at 17:05, Zach O'Keefe <zokeefe@google.com> wrote: >>> >>>> I think the hugepage alignment in their environment was somewhat luck. >>>> One suggestion made was to change stack size to avoid alignment and >>>> hugepage usage. That 'works' but seems kind of hackish. >>> >>> That was my first thought, if the alignment was purely due to luck, >>> and not somebody manually specifying it. Agreed it's kind of hackish >>> if anyone can get bit by this by sheer luck. >> I don't agree it's "hackish" at all, but I go more into that below. >>> >>>> Also, David H pointed out the somewhat recent commit to align sufficiently >>>> large mappings to THP boundaries. This is going to make all stacks huge >>>> page aligned. >>> >>> I think that change was reverted by Linus in commit 0ba09b173387 >>> ("Revert "mm: align larger anonymous mappings on THP boundaries""), >>> until it's perf regressions were better understood -- and I haven't >>> seen a revamp of it. >> It's too bad it was reverted, though I understand the concerns regarding it. >> From my point of view, if an address is properly aligned and a caller is >> asking for 2M+ to be mapped, it's going to be advantageous from a purely >> system-focused point of view to do that mapping with a THP. > > Just noting that, if user space requests multiple smaller mappings, and the kernel decides to all place them in the same PMD, all VMAs might get merged and you end up with a properly aligned VMA where khugepaged would happily place a THP. > > That case is, of course, different to the "user space asks for 2M+" mapping case, but from khugepaged perspective they might look alike -- and it might be unclear if a THP is valuable or not (IOW maybe that THP could be better used somewhere else). That's a really, really good point. My general philosophy on the subject (if the address is aligned and the caller is asking for a THP-sized allocation, why not map it with a THP if you can) kind of falls apart when it's the system noticing it can coalesce a bunch of smaller allocations into one THP via khugepaged. Arguably it's the difference between the caller knowing it's asking for something THP-sized on its behalf and the system deciding to remap a bunch of disparate mappings using a THP because _it_ can. If we were to say allow a caller's request for a THP-sized allocation/mapping take priority over those from khugepaged, it would not only be a major vector for abuse, it would also lead to completely indeterminate behavior ("When I start my browser after a reboot I get a bunch of THPs, but after the system's been up for a few weeks, I don't, how come?") I don't have a good answer here. -- Bill ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-11 12:24 ` William Kucharski @ 2023-03-12 0:55 ` Hillf Danton 2023-03-12 4:39 ` William Kucharski 0 siblings, 1 reply; 23+ messages in thread From: Hillf Danton @ 2023-03-12 0:55 UTC (permalink / raw) To: William Kucharski Cc: David Hildenbrand, Zach O'Keefe, Mike Kravetz, Peter Xu, Rik van Riel, Mike Rapoport, Linux-MM, LKML On 11 Mar 2023 12:24:58 +0000 William Kucharski <william.kucharski@oracle.com> >> On Mar 10, 2023, at 04:25, David Hildenbrand <david@redhat.com> wrote: >> On 10.03.23 02:40, William Kucharski wrote: >>>> On Mar 9, 2023, at 17:05, Zach O'Keefe <zokeefe@google.com> wrote: >>>>=20 >>>>> I think the hugepage alignment in their environment was somewhat luck. >>>>> One suggestion made was to change stack size to avoid alignment and >>>>> hugepage usage. That 'works' but seems kind of hackish. >>>>=20 >>>> That was my first thought, if the alignment was purely due to luck, >>>> and not somebody manually specifying it. Agreed it's kind of hackish >>>> if anyone can get bit by this by sheer luck. >>> I don't agree it's "hackish" at all, but I go more into that below. >>>>=20 >>>>> Also, David H pointed out the somewhat recent commit to align sufficie= >ntly >>>>> large mappings to THP boundaries. This is going to make all stacks hu= >ge >>>>> page aligned. >>>>=20 >>>> I think that change was reverted by Linus in commit 0ba09b173387 >>>> ("Revert "mm: align larger anonymous mappings on THP boundaries""), >>>> until it's perf regressions were better understood -- and I haven't >>>> seen a revamp of it. >>> It's too bad it was reverted, though I understand the concerns regarding= > it. >>> From my point of view, if an address is properly aligned and a caller is >>> asking for 2M+ to be mapped, it's going to be advantageous from a purely >>> system-focused point of view to do that mapping with a THP.=20 >>=20 >> Just noting that, if user space requests multiple smaller mappings, and t= >he kernel decides to all place them in the same PMD, all VMAs might get mer= >ged and you end up with a properly aligned VMA where khugepaged would happi= >ly place a THP. >>=20 >> That case is, of course, different to the "user space asks for 2M+" mappi= >ng case, but from khugepaged perspective they might look alike -- and it mi= >ght be unclear if a THP is valuable or not (IOW maybe that THP could be bet= >ter used somewhere else). > >That's a really, really good point. > >My general philosophy on the subject (if the address is aligned and the cal= >ler is asking for a THP-sized allocation, why not map it with a THP if you = >can) kind of falls apart when it's the system noticing it can coalesce a bu= >nch of smaller allocations into one THP via khugepaged. > >Arguably it's the difference between the caller knowing it's asking for som= >ething THP-sized on its behalf and the system deciding to remap a bunch of = >disparate mappings using a THP because _it_ can. > >If we were to say allow a caller's request for a THP-sized allocation/mappi= >ng take priority over those from khugepaged, it would not only be a major v= >ector for abuse, it would also lead to completely indeterminate behavior ("= >When I start my browser after a reboot I get a bunch of THPs, but after the= > system's been up for a few weeks, I don't, how come?") Given transparent_hugepage_flags, how would it be abused? And indetermined? > >I don't have a good answer here. > > -- Bill= ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-12 0:55 ` Hillf Danton @ 2023-03-12 4:39 ` William Kucharski 0 siblings, 0 replies; 23+ messages in thread From: William Kucharski @ 2023-03-12 4:39 UTC (permalink / raw) To: Hillf Danton Cc: David Hildenbrand, Zach O'Keefe, Mike Kravetz, Peter Xu, Rik van Riel, Mike Rapoport, Linux-MM, LKML > On Mar 11, 2023, at 5:55 PM, Hillf Danton <hdanton@sina.com> wrote: > > On 11 Mar 2023 12:24:58 +0000 William Kucharski <william.kucharski@oracle.com> >>> On Mar 10, 2023, at 04:25, David Hildenbrand <david@redhat.com> wrote: >>> On 10.03.23 02:40, William Kucharski wrote: >>>>> On Mar 9, 2023, at 17:05, Zach O'Keefe <zokeefe@google.com> wrote: >>>>> =20 >>>>>> I think the hugepage alignment in their environment was somewhat luck. >>>>>> One suggestion made was to change stack size to avoid alignment and >>>>>> hugepage usage. That 'works' but seems kind of hackish. >>>>> =20 >>>>> That was my first thought, if the alignment was purely due to luck, >>>>> and not somebody manually specifying it. Agreed it's kind of hackish >>>>> if anyone can get bit by this by sheer luck. >>>> I don't agree it's "hackish" at all, but I go more into that below. >>>>> =20 >>>>>> Also, David H pointed out the somewhat recent commit to align sufficie= >> ntly >>>>>> large mappings to THP boundaries. This is going to make all stacks hu= >> ge >>>>>> page aligned. >>>>> =20 >>>>> I think that change was reverted by Linus in commit 0ba09b173387 >>>>> ("Revert "mm: align larger anonymous mappings on THP boundaries""), >>>>> until it's perf regressions were better understood -- and I haven't >>>>> seen a revamp of it. >>>> It's too bad it was reverted, though I understand the concerns regarding= >> it. >>>> From my point of view, if an address is properly aligned and a caller is >>>> asking for 2M+ to be mapped, it's going to be advantageous from a purely >>>> system-focused point of view to do that mapping with a THP.=20 >>> =20 >>> Just noting that, if user space requests multiple smaller mappings, and t= >> he kernel decides to all place them in the same PMD, all VMAs might get mer= >> ged and you end up with a properly aligned VMA where khugepaged would happi= >> ly place a THP. >>> =20 >>> That case is, of course, different to the "user space asks for 2M+" mappi= >> ng case, but from khugepaged perspective they might look alike -- and it mi= >> ght be unclear if a THP is valuable or not (IOW maybe that THP could be bet= >> ter used somewhere else). >> >> That's a really, really good point. >> >> My general philosophy on the subject (if the address is aligned and the cal= >> ler is asking for a THP-sized allocation, why not map it with a THP if you = >> can) kind of falls apart when it's the system noticing it can coalesce a bu= >> nch of smaller allocations into one THP via khugepaged. >> >> Arguably it's the difference between the caller knowing it's asking for som= >> ething THP-sized on its behalf and the system deciding to remap a bunch of = >> disparate mappings using a THP because _it_ can. >> >> If we were to say allow a caller's request for a THP-sized allocation/mappi= >> ng take priority over those from khugepaged, it would not only be a major v= >> ector for abuse, it would also lead to completely indeterminate behavior ("= >> When I start my browser after a reboot I get a bunch of THPs, but after the= >> system's been up for a few weeks, I don't, how come?") > > Given transparent_hugepage_flags, how would it be abused? And indetermined? I was speaking in terms of heuristics, if we allowed callers making THP-mappable requests to have priority over khugepaged requests by default, it would be easy for callers to abuse that to request most THP-mappable memory, leaving little with which khugepaged could coalesce smaller pages. This is much the way hugetlbfs is sometimes used now where if callers can't get the allocations they require, users reboot the machine and make sure their applications requiring such allocations run first. My apologies if I am missing an existing mechanism preventing this, it's been a bit since I walked through that code. -- Bill ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-10 0:05 ` Zach O'Keefe 2023-03-10 1:40 ` William Kucharski @ 2023-03-10 22:02 ` Yang Shi 1 sibling, 0 replies; 23+ messages in thread From: Yang Shi @ 2023-03-10 22:02 UTC (permalink / raw) To: Zach O'Keefe Cc: Mike Kravetz, Peter Xu, David Hildenbrand, Rik van Riel, Mike Rapoport, linux-mm, linux-kernel On Thu, Mar 9, 2023 at 4:05 PM Zach O'Keefe <zokeefe@google.com> wrote: > > On Thu, Mar 9, 2023 at 3:33 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 03/09/23 14:38, Zach O'Keefe wrote: > > > On Wed, Mar 8, 2023 at 11:02 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > > > On 03/06/23 16:40, Mike Kravetz wrote: > > > > > On 03/06/23 19:15, Peter Xu wrote: > > > > > > On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: > > > > > > > > > > > > > > Just wondering if there is anything better or more selective that can be > > > > > > > done? Does it make sense to have THP backed stacks by default? If not, > > > > > > > who would be best at disabling? A couple thoughts: > > > > > > > - The kernel could disable huge pages on stacks. libpthread/glibc pass > > > > > > > the unused flag MAP_STACK. We could key off this and disable huge pages. > > > > > > > However, I'm sure there is somebody somewhere today that is getting better > > > > > > > performance because they have huge pages backing their stacks. > > > > > > > - We could push this to glibc/libpthreads and have them use > > > > > > > MADV_NOHUGEPAGE on thread stacks. However, this also has the potential > > > > > > > of regressing performance if somebody somewhere is getting better > > > > > > > performance due to huge pages. > > > > > > > > > > > > Yes it seems it's always not safe to change a default behavior to me. > > > > > > > > > > > > For stack I really can't tell why it must be different here. I assume the > > > > > > problem is the wasted space and it exaggerates easily with N-threads. But > > > > > > IIUC it'll be the same as thp to normal memories iiuc, e.g., there can be a > > > > > > per-thread mmap() of 2MB even if only 4K is used each, then if such mmap() > > > > > > is populated by THP for each thread there'll also be a huge waste. > > > > > > > > I may be alone in my thinking here, but it seems that stacks by their nature > > > > are not generally good candidates for huge pages. I am just thinking about > > > > the 'normal' use case where stacks contain local function data and arguments. > > > > Am I missing something, or are huge pages really a benefit here? > > > > > > > > Of course, I can imagine some thread with a large amount of frequently > > > > accessed data allocated on it's stack which could benefit from huge > > > > pages. But, this seems to be an exception rather than the rule. > > > > > > > > I understand the argument that THP always means always and everywhere. > > > > It just seems that thread stacks may be 'special enough' to consider > > > > disabling by default > > > > > > Just my drive-by 2c, but would agree with you here (at least wrt > > > hugepages not being good candidates, in general). A user mmap()'ing > > > memory has a lot more (direct) control over how they fault / utilize > > > the memory: you know when you're running out of space and can map more > > > space as needed. For these stacks, you're setting the stack size to > > > 2MB just as a precaution so you can avoid overflow -- AFAIU there is > > > no intention of using the whole mapping (and looking at some data, > > > it's very likely you won't come close). > > > > > > That said, why bother setting stack attribute to 2MiB in size if there > > > isn't some intention of possibly being THP-backed? Moreover, how did > > > it happen that the mappings were always hugepage-aligned here? > > > > I do not have the details as to why the Java group chose 2MB for stack > > size. My 'guess' is that they are trying to save on virtual space (although > > that seems silly). 2MB is actually reducing the default size. The > > default pthread stack size on my desktop (fedora) is 8MB [..] > > Oh, that's interesting -- I did not know that. That's huge. > > > [..] This also is > > a nice multiple of THP size. > > > > I think the hugepage alignment in their environment was somewhat luck. > > One suggestion made was to change stack size to avoid alignment and > > hugepage usage. That 'works' but seems kind of hackish. > > That was my first thought, if the alignment was purely due to luck, > and not somebody manually specifying it. Agreed it's kind of hackish > if anyone can get bit by this by sheer luck. > > > Also, David H pointed out the somewhat recent commit to align sufficiently > > large mappings to THP boundaries. This is going to make all stacks huge > > page aligned. > > I think that change was reverted by Linus in commit 0ba09b173387 > ("Revert "mm: align larger anonymous mappings on THP boundaries""), > until it's perf regressions were better understood -- and I haven't > seen a revamp of it. The regression has been fixed and it is not related to this commit. I suggested Andrew to resurrect this commit a couple of months ago, but it has not been. > > > -- > > Mike Kravetz > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-06 23:57 THP backed thread stacks Mike Kravetz 2023-03-07 0:15 ` Peter Xu @ 2023-03-07 10:10 ` David Hildenbrand 2023-03-07 19:02 ` Mike Kravetz 2023-03-07 13:36 ` Mike Rapoport 2023-03-17 17:52 ` Matthew Wilcox 3 siblings, 1 reply; 23+ messages in thread From: David Hildenbrand @ 2023-03-07 10:10 UTC (permalink / raw) To: Mike Kravetz, linux-mm, linux-kernel; +Cc: Rik van Riel On 07.03.23 00:57, Mike Kravetz wrote: > One of our product teams recently experienced 'memory bloat' in their > environment. The application in this environment is the JVM which > creates hundreds of threads. Threads are ultimately created via > pthread_create which also creates the thread stacks. pthread attributes > are modified so that stacks are 2MB in size. It just so happens that > due to allocation patterns, all their stacks are at 2MB boundaries. Is this also related to a recent change, where we try to always align at PMD boundaries now, such that this gets more likely? commit f35b5d7d676e59e401690b678cd3cfec5e785c23 Author: Rik van Riel <riel@surriel.com> Date: Tue Aug 9 14:24:57 2022 -0400 mm: align larger anonymous mappings on THP boundaries As a side note, I even heard of complains about memory bloat when switching from 4k -> 64k page size with many threads ... -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-07 10:10 ` David Hildenbrand @ 2023-03-07 19:02 ` Mike Kravetz 0 siblings, 0 replies; 23+ messages in thread From: Mike Kravetz @ 2023-03-07 19:02 UTC (permalink / raw) To: David Hildenbrand; +Cc: linux-mm, linux-kernel, Rik van Riel On 03/07/23 11:10, David Hildenbrand wrote: > On 07.03.23 00:57, Mike Kravetz wrote: > > One of our product teams recently experienced 'memory bloat' in their > > environment. The application in this environment is the JVM which > > creates hundreds of threads. Threads are ultimately created via > > pthread_create which also creates the thread stacks. pthread attributes > > are modified so that stacks are 2MB in size. It just so happens that > > due to allocation patterns, all their stacks are at 2MB boundaries. > > Is this also related to a recent change, where we try to always align at PMD > boundaries now, such that this gets more likely? Nope, it happens on a kernel without this change. > commit f35b5d7d676e59e401690b678cd3cfec5e785c23 > Author: Rik van Riel <riel@surriel.com> > Date: Tue Aug 9 14:24:57 2022 -0400 > > mm: align larger anonymous mappings on THP boundaries > > > As a side note, I even heard of complains about memory bloat when switching > from 4k -> 64k page size with many threads ... It seems like the 'answer' is to have applications explicitly opt out of THP if they know it is detrimental for some reason. In this case, it makes sense to opt out for thread stacks that are known to be 2MB at most. Unfortunately, this means the application would need to replicate stack creation (including guard pages) as well as cleanup that is done in libpthread/glibc. -- Mike Kravetz ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-06 23:57 THP backed thread stacks Mike Kravetz 2023-03-07 0:15 ` Peter Xu 2023-03-07 10:10 ` David Hildenbrand @ 2023-03-07 13:36 ` Mike Rapoport 2023-03-17 17:52 ` Matthew Wilcox 3 siblings, 0 replies; 23+ messages in thread From: Mike Rapoport @ 2023-03-07 13:36 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-mm, linux-kernel Hi Mike, On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: > One of our product teams recently experienced 'memory bloat' in their > environment. The application in this environment is the JVM which > creates hundreds of threads. Threads are ultimately created via > pthread_create which also creates the thread stacks. pthread attributes > are modified so that stacks are 2MB in size. It just so happens that > due to allocation patterns, all their stacks are at 2MB boundaries. The > system has THP always set, so a huge page is allocated at the first > (write) fault when libpthread initializes the stack. > > It would seem that this is expected behavior. If you set THP always, > you may get huge pages anywhere. > > However, I can't help but think that backing stacks with huge pages by > default may not be the right thing to do. Stacks by their very nature > grow in somewhat unpredictable ways over time. Using a large virtual > space so that memory is allocated as needed is the desired behavior. > > The only way to address their 'memory bloat' via thread stacks today is > by switching THP to madvise. > > Just wondering if there is anything better or more selective that can be > done? Does it make sense to have THP backed stacks by default? If not, > who would be best at disabling? A couple thoughts: > - The kernel could disable huge pages on stacks. libpthread/glibc pass > the unused flag MAP_STACK. We could key off this and disable huge pages. > However, I'm sure there is somebody somewhere today that is getting better > performance because they have huge pages backing their stacks. > - We could push this to glibc/libpthreads and have them use > MADV_NOHUGEPAGE on thread stacks. However, this also has the potential > of regressing performance if somebody somewhere is getting better > performance due to huge pages. > - Other thoughts? Push this to the application? :) Something like pthread_attr_getstack() + madvice(MADV_NOHUGEPAGE) will do the job, no? > Perhaps this is just expected behavior of THP always which is unfortunate > in this situation. > -- > Mike Kravetz > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-06 23:57 THP backed thread stacks Mike Kravetz ` (2 preceding siblings ...) 2023-03-07 13:36 ` Mike Rapoport @ 2023-03-17 17:52 ` Matthew Wilcox 2023-03-17 18:46 ` Mike Kravetz 2023-03-18 12:58 ` David Laight 3 siblings, 2 replies; 23+ messages in thread From: Matthew Wilcox @ 2023-03-17 17:52 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-mm, linux-kernel On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: > One of our product teams recently experienced 'memory bloat' in their > environment. The application in this environment is the JVM which > creates hundreds of threads. Threads are ultimately created via > pthread_create which also creates the thread stacks. pthread attributes > are modified so that stacks are 2MB in size. It just so happens that > due to allocation patterns, all their stacks are at 2MB boundaries. The > system has THP always set, so a huge page is allocated at the first > (write) fault when libpthread initializes the stack. Do you happen to have an strace (or similar) so we can understand what the application is doing? My understanding is that for a normal app (like, say, 'cat'), we'll allow up to an 8MB stack, but we only create a VMA that is 4kB in size and set the VM_GROWSDOWN flag on it (to allow it to magically grow). Therefore we won't create a 2MB page because the VMA is too small. It sounds like the pthread library is maybe creating a 2MB stack as a 2MB VMA, and that's why we're seeing this behaviour? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-17 17:52 ` Matthew Wilcox @ 2023-03-17 18:46 ` Mike Kravetz 2023-03-20 11:12 ` David Hildenbrand 2023-03-18 12:58 ` David Laight 1 sibling, 1 reply; 23+ messages in thread From: Mike Kravetz @ 2023-03-17 18:46 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm, linux-kernel On 03/17/23 17:52, Matthew Wilcox wrote: > On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: > > One of our product teams recently experienced 'memory bloat' in their > > environment. The application in this environment is the JVM which > > creates hundreds of threads. Threads are ultimately created via > > pthread_create which also creates the thread stacks. pthread attributes > > are modified so that stacks are 2MB in size. It just so happens that > > due to allocation patterns, all their stacks are at 2MB boundaries. The > > system has THP always set, so a huge page is allocated at the first > > (write) fault when libpthread initializes the stack. > > Do you happen to have an strace (or similar) so we can understand what > the application is doing? > > My understanding is that for a normal app (like, say, 'cat'), we'll > allow up to an 8MB stack, but we only create a VMA that is 4kB in size > and set the VM_GROWSDOWN flag on it (to allow it to magically grow). > Therefore we won't create a 2MB page because the VMA is too small. > > It sounds like the pthread library is maybe creating a 2MB stack as > a 2MB VMA, and that's why we're seeing this behaviour? Yes, pthread stacks create a VMA equal to stack size which is different than 'main thread' stack. The 2MB size for pthread stacks created by JVM is actually them explicitly requesting the size (8MB default). We have a good understanding of what is happening. Behavior actually changed a bit with glibc versions in OL7 vs OL8. Do note that THP usage is somewhat out of the control of an application IF they rely on glibc/pthread to allocate stacks. Only way for application to make sure pthread stacks do not use THP would be for them to allocate themselves. Then, they would need to set up the guard page themselves. They would also need to monitor the status of all threads to determine when stacks could be deleted. A bunch of extra code that glibc/pthread already does for free. Oracle glibc team is also involved, and it 'looks' like they may have upstream buy in to add a flag to explicitly enable or disable hugepages on pthread stacks. It seems like concensus from mm community is that we should not treat stacks any differently than any other mappings WRT THP. That is OK, just wanted to throw it out there. -- Mike Kravetz ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-17 18:46 ` Mike Kravetz @ 2023-03-20 11:12 ` David Hildenbrand 2023-03-20 17:46 ` William Kucharski 0 siblings, 1 reply; 23+ messages in thread From: David Hildenbrand @ 2023-03-20 11:12 UTC (permalink / raw) To: Mike Kravetz, Matthew Wilcox; +Cc: linux-mm, linux-kernel On 17.03.23 19:46, Mike Kravetz wrote: > On 03/17/23 17:52, Matthew Wilcox wrote: >> On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: >>> One of our product teams recently experienced 'memory bloat' in their >>> environment. The application in this environment is the JVM which >>> creates hundreds of threads. Threads are ultimately created via >>> pthread_create which also creates the thread stacks. pthread attributes >>> are modified so that stacks are 2MB in size. It just so happens that >>> due to allocation patterns, all their stacks are at 2MB boundaries. The >>> system has THP always set, so a huge page is allocated at the first >>> (write) fault when libpthread initializes the stack. >> >> Do you happen to have an strace (or similar) so we can understand what >> the application is doing? >> >> My understanding is that for a normal app (like, say, 'cat'), we'll >> allow up to an 8MB stack, but we only create a VMA that is 4kB in size >> and set the VM_GROWSDOWN flag on it (to allow it to magically grow). >> Therefore we won't create a 2MB page because the VMA is too small. >> >> It sounds like the pthread library is maybe creating a 2MB stack as >> a 2MB VMA, and that's why we're seeing this behaviour? > > Yes, pthread stacks create a VMA equal to stack size which is different > than 'main thread' stack. The 2MB size for pthread stacks created by > JVM is actually them explicitly requesting the size (8MB default). > > We have a good understanding of what is happening. Behavior actually > changed a bit with glibc versions in OL7 vs OL8. Do note that THP usage > is somewhat out of the control of an application IF they rely on > glibc/pthread to allocate stacks. Only way for application to make sure > pthread stacks do not use THP would be for them to allocate themselves. > Then, they would need to set up the guard page themselves. They would > also need to monitor the status of all threads to determine when stacks > could be deleted. A bunch of extra code that glibc/pthread already does > for free. > > Oracle glibc team is also involved, and it 'looks' like they may have > upstream buy in to add a flag to explicitly enable or disable hugepages > on pthread stacks. > > It seems like concensus from mm community is that we should not > treat stacks any differently than any other mappings WRT THP. That is > OK, just wanted to throw it out there. I wonder if this might we one of the cases where we don't want to allocate a THP on first access to fill holes we don't know if they are all going to get used. But we might want to let khugepaged place a THP if all PTEs are already populated. Hm. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-20 11:12 ` David Hildenbrand @ 2023-03-20 17:46 ` William Kucharski 2023-03-20 17:52 ` David Hildenbrand 2023-03-20 18:06 ` Mike Kravetz 0 siblings, 2 replies; 23+ messages in thread From: William Kucharski @ 2023-03-20 17:46 UTC (permalink / raw) To: David Hildenbrand; +Cc: Mike Kravetz, Matthew Wilcox, Linux-MM, linux-kernel > On Mar 20, 2023, at 05:12, David Hildenbrand <david@redhat.com> wrote: > > On 17.03.23 19:46, Mike Kravetz wrote: >> On 03/17/23 17:52, Matthew Wilcox wrote: >>> On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: >>>> One of our product teams recently experienced 'memory bloat' in their >>>> environment. The application in this environment is the JVM which >>>> creates hundreds of threads. Threads are ultimately created via >>>> pthread_create which also creates the thread stacks. pthread attributes >>>> are modified so that stacks are 2MB in size. It just so happens that >>>> due to allocation patterns, all their stacks are at 2MB boundaries. The >>>> system has THP always set, so a huge page is allocated at the first >>>> (write) fault when libpthread initializes the stack. >>> >>> Do you happen to have an strace (or similar) so we can understand what >>> the application is doing? >>> >>> My understanding is that for a normal app (like, say, 'cat'), we'll >>> allow up to an 8MB stack, but we only create a VMA that is 4kB in size >>> and set the VM_GROWSDOWN flag on it (to allow it to magically grow). >>> Therefore we won't create a 2MB page because the VMA is too small. >>> >>> It sounds like the pthread library is maybe creating a 2MB stack as >>> a 2MB VMA, and that's why we're seeing this behaviour? >> Yes, pthread stacks create a VMA equal to stack size which is different >> than 'main thread' stack. The 2MB size for pthread stacks created by >> JVM is actually them explicitly requesting the size (8MB default). >> We have a good understanding of what is happening. Behavior actually >> changed a bit with glibc versions in OL7 vs OL8. Do note that THP usage >> is somewhat out of the control of an application IF they rely on >> glibc/pthread to allocate stacks. Only way for application to make sure >> pthread stacks do not use THP would be for them to allocate themselves. >> Then, they would need to set up the guard page themselves. They would >> also need to monitor the status of all threads to determine when stacks >> could be deleted. A bunch of extra code that glibc/pthread already does >> for free. >> Oracle glibc team is also involved, and it 'looks' like they may have >> upstream buy in to add a flag to explicitly enable or disable hugepages >> on pthread stacks. >> It seems like concensus from mm community is that we should not >> treat stacks any differently than any other mappings WRT THP. That is >> OK, just wanted to throw it out there. > > I wonder if this might we one of the cases where we don't want to allocate a THP on first access to fill holes we don't know if they are all going to get used. But we might want to let khugepaged place a THP if all PTEs are already populated. Hm. > > -- > Thanks, > > David / dhildenb Unless we do decide to start honoring MAP_STACK, we would be setting an interesting precedent here in that stacks would be the only THP allocation that would be denied a large page until it first proved it was actually going to use all the individual PAGESIZE pages comprising one. Should mapping a text page using a THP be likewise deferred until each PAGESIZE page comprising it had been accessed? Given the main questions of: 1) How to know whether it's a stack allocation 2) How to determine whether the app is consciously trying to allocate the stack via a THP or if it just happened to win the address alignment/size lottery 3) Whether to honor the THP allocation in either case It seems taking the khugepaged approach would require Yet Another Flag to provide a way for an application that KNOWS a THP-mapped stack would be useful to get it without having to incorporate a loop to touch a byte in every PAGESIZE page in their allocated aligned stack and hope it gets its upgrade. William Kucharski ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-20 17:46 ` William Kucharski @ 2023-03-20 17:52 ` David Hildenbrand 2023-03-20 18:06 ` Mike Kravetz 1 sibling, 0 replies; 23+ messages in thread From: David Hildenbrand @ 2023-03-20 17:52 UTC (permalink / raw) To: William Kucharski; +Cc: Mike Kravetz, Matthew Wilcox, Linux-MM, linux-kernel On 20.03.23 18:46, William Kucharski wrote: > > >> On Mar 20, 2023, at 05:12, David Hildenbrand <david@redhat.com> wrote: >> >> On 17.03.23 19:46, Mike Kravetz wrote: >>> On 03/17/23 17:52, Matthew Wilcox wrote: >>>> On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: >>>>> One of our product teams recently experienced 'memory bloat' in their >>>>> environment. The application in this environment is the JVM which >>>>> creates hundreds of threads. Threads are ultimately created via >>>>> pthread_create which also creates the thread stacks. pthread attributes >>>>> are modified so that stacks are 2MB in size. It just so happens that >>>>> due to allocation patterns, all their stacks are at 2MB boundaries. The >>>>> system has THP always set, so a huge page is allocated at the first >>>>> (write) fault when libpthread initializes the stack. >>>> >>>> Do you happen to have an strace (or similar) so we can understand what >>>> the application is doing? >>>> >>>> My understanding is that for a normal app (like, say, 'cat'), we'll >>>> allow up to an 8MB stack, but we only create a VMA that is 4kB in size >>>> and set the VM_GROWSDOWN flag on it (to allow it to magically grow). >>>> Therefore we won't create a 2MB page because the VMA is too small. >>>> >>>> It sounds like the pthread library is maybe creating a 2MB stack as >>>> a 2MB VMA, and that's why we're seeing this behaviour? >>> Yes, pthread stacks create a VMA equal to stack size which is different >>> than 'main thread' stack. The 2MB size for pthread stacks created by >>> JVM is actually them explicitly requesting the size (8MB default). >>> We have a good understanding of what is happening. Behavior actually >>> changed a bit with glibc versions in OL7 vs OL8. Do note that THP usage >>> is somewhat out of the control of an application IF they rely on >>> glibc/pthread to allocate stacks. Only way for application to make sure >>> pthread stacks do not use THP would be for them to allocate themselves. >>> Then, they would need to set up the guard page themselves. They would >>> also need to monitor the status of all threads to determine when stacks >>> could be deleted. A bunch of extra code that glibc/pthread already does >>> for free. >>> Oracle glibc team is also involved, and it 'looks' like they may have >>> upstream buy in to add a flag to explicitly enable or disable hugepages >>> on pthread stacks. >>> It seems like concensus from mm community is that we should not >>> treat stacks any differently than any other mappings WRT THP. That is >>> OK, just wanted to throw it out there. >> >> I wonder if this might we one of the cases where we don't want to allocate a THP on first access to fill holes we don't know if they are all going to get used. But we might want to let khugepaged place a THP if all PTEs are already populated. Hm. >> >> -- >> Thanks, >> >> David / dhildenb > > Unless we do decide to start honoring MAP_STACK, we would be setting an interesting precedent here in that stacks would be the only THP allocation that would be denied a large page until it first proved it was actually going to use all the individual PAGESIZE pages comprising one. Should mapping a text page using a THP be likewise deferred until each PAGESIZE page comprising it had been accessed? IMHO, it's a bit different, because text pages are not anon pages. I suspect is_stack_mapping() -> VM_STACK -> VM_GROWSUP/VM_GROWSDOWN is not always reliable? -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: THP backed thread stacks 2023-03-20 17:46 ` William Kucharski 2023-03-20 17:52 ` David Hildenbrand @ 2023-03-20 18:06 ` Mike Kravetz 1 sibling, 0 replies; 23+ messages in thread From: Mike Kravetz @ 2023-03-20 18:06 UTC (permalink / raw) To: William Kucharski Cc: David Hildenbrand, Matthew Wilcox, Linux-MM, linux-kernel On 03/20/23 10:46, William Kucharski wrote: > > > > On Mar 20, 2023, at 05:12, David Hildenbrand <david@redhat.com> wrote: > > > > On 17.03.23 19:46, Mike Kravetz wrote: > >> On 03/17/23 17:52, Matthew Wilcox wrote: > >>> On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote: > >>>> One of our product teams recently experienced 'memory bloat' in their > >>>> environment. The application in this environment is the JVM which > >>>> creates hundreds of threads. Threads are ultimately created via > >>>> pthread_create which also creates the thread stacks. pthread attributes > >>>> are modified so that stacks are 2MB in size. It just so happens that > >>>> due to allocation patterns, all their stacks are at 2MB boundaries. The > >>>> system has THP always set, so a huge page is allocated at the first > >>>> (write) fault when libpthread initializes the stack. > >>> > >>> Do you happen to have an strace (or similar) so we can understand what > >>> the application is doing? > >>> > >>> My understanding is that for a normal app (like, say, 'cat'), we'll > >>> allow up to an 8MB stack, but we only create a VMA that is 4kB in size > >>> and set the VM_GROWSDOWN flag on it (to allow it to magically grow). > >>> Therefore we won't create a 2MB page because the VMA is too small. > >>> > >>> It sounds like the pthread library is maybe creating a 2MB stack as > >>> a 2MB VMA, and that's why we're seeing this behaviour? > >> Yes, pthread stacks create a VMA equal to stack size which is different > >> than 'main thread' stack. The 2MB size for pthread stacks created by > >> JVM is actually them explicitly requesting the size (8MB default). > >> We have a good understanding of what is happening. Behavior actually > >> changed a bit with glibc versions in OL7 vs OL8. Do note that THP usage > >> is somewhat out of the control of an application IF they rely on > >> glibc/pthread to allocate stacks. Only way for application to make sure > >> pthread stacks do not use THP would be for them to allocate themselves. > >> Then, they would need to set up the guard page themselves. They would > >> also need to monitor the status of all threads to determine when stacks > >> could be deleted. A bunch of extra code that glibc/pthread already does > >> for free. > >> Oracle glibc team is also involved, and it 'looks' like they may have > >> upstream buy in to add a flag to explicitly enable or disable hugepages > >> on pthread stacks. > >> It seems like concensus from mm community is that we should not > >> treat stacks any differently than any other mappings WRT THP. That is > >> OK, just wanted to throw it out there. > > > > I wonder if this might we one of the cases where we don't want to allocate a THP on first access to fill holes we don't know if they are all going to get used. But we might want to let khugepaged place a THP if all PTEs are already populated. Hm. > > > > -- > > Thanks, > > > > David / dhildenb > > Unless we do decide to start honoring MAP_STACK, we would be setting an interesting precedent here in that stacks would be the only THP allocation that would be denied a large page until it first proved it was actually going to use all the individual PAGESIZE pages comprising one. Should mapping a text page using a THP be likewise deferred until each PAGESIZE page comprising it had been accessed? > > Given the main questions of: > > 1) How to know whether it's a stack allocation > > 2) How to determine whether the app is consciously trying to allocate the stack via a THP or if it just happened to win the address alignment/size lottery > > 3) Whether to honor the THP allocation in either case > > It seems taking the khugepaged approach would require Yet Another Flag to provide a way for an application that KNOWS a THP-mapped stack would be useful to get it without having to incorporate a loop to touch a byte in every PAGESIZE page in their allocated aligned stack and hope it gets its upgrade. > Just another 2 cents thrown into the pool. We currently treat the 'main thread' stack differently than pthread stacks. My understanding of main stack handling matches that described by Matthew above. So, IIUC the only way the 'main stack' could possibly use THP pages is if base pages were collapsed via khugepaged. -- Mike Kravetz ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: THP backed thread stacks 2023-03-17 17:52 ` Matthew Wilcox 2023-03-17 18:46 ` Mike Kravetz @ 2023-03-18 12:58 ` David Laight 1 sibling, 0 replies; 23+ messages in thread From: David Laight @ 2023-03-18 12:58 UTC (permalink / raw) To: 'Matthew Wilcox', Mike Kravetz; +Cc: linux-mm, linux-kernel From: Matthew Wilcox > Sent: 17 March 2023 17:53 ... > My understanding is that for a normal app (like, say, 'cat'), we'll > allow up to an 8MB stack, but we only create a VMA that is 4kB in size > and set the VM_GROWSDOWN flag on it (to allow it to magically grow). > Therefore we won't create a 2MB page because the VMA is too small. Is there anyway that glibc (or anything else) could request that for a thread stack? It would make the process 'memory size' reported by ps/top much more meaningful for programs with threads. I've noticed some (what should be) small programs having a size (rss?) of 277m. I'm sure a lot of it is thread stack. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2023-03-20 18:07 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-03-06 23:57 THP backed thread stacks Mike Kravetz 2023-03-07 0:15 ` Peter Xu 2023-03-07 0:40 ` Mike Kravetz 2023-03-08 19:02 ` Mike Kravetz 2023-03-09 22:38 ` Zach O'Keefe 2023-03-09 23:33 ` Mike Kravetz 2023-03-10 0:05 ` Zach O'Keefe 2023-03-10 1:40 ` William Kucharski 2023-03-10 11:25 ` David Hildenbrand 2023-03-11 12:24 ` William Kucharski 2023-03-12 0:55 ` Hillf Danton 2023-03-12 4:39 ` William Kucharski 2023-03-10 22:02 ` Yang Shi 2023-03-07 10:10 ` David Hildenbrand 2023-03-07 19:02 ` Mike Kravetz 2023-03-07 13:36 ` Mike Rapoport 2023-03-17 17:52 ` Matthew Wilcox 2023-03-17 18:46 ` Mike Kravetz 2023-03-20 11:12 ` David Hildenbrand 2023-03-20 17:46 ` William Kucharski 2023-03-20 17:52 ` David Hildenbrand 2023-03-20 18:06 ` Mike Kravetz 2023-03-18 12:58 ` David Laight
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).