* Re: [PATCH v2] mm: Reduce memory bloat with THP [not found] <1516318444-30868-1-git-send-email-nitingupta910@gmail.com> @ 2018-01-19 12:49 ` Michal Hocko 2018-01-19 20:59 ` Nitin Gupta 0 siblings, 1 reply; 12+ messages in thread From: Michal Hocko @ 2018-01-19 12:49 UTC (permalink / raw) To: Nitin Gupta Cc: steven.sistare, Nitin Gupta, Andrew Morton, Ingo Molnar, Mel Gorman, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander (Sasha Levin), Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Jérôme Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm On Thu 18-01-18 15:33:16, Nitin Gupta wrote: > From: Nitin Gupta <nitin.m.gupta@oracle.com> > > Currently, if the THP enabled policy is "always", or the mode > is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage > is allocated on a page fault if the pud or pmd is empty. This > yields the best VA translation performance, but increases memory > consumption if some small page ranges within the huge page are > never accessed. Yes, this is true but hardly unexpected for MADV_HUGEPAGE or THP always users. > An alternate behavior for such page faults is to install a > hugepage only when a region is actually found to be (almost) > fully mapped and active. This is a compromise between > translation performance and memory consumption. Currently there > is no way for an application to choose this compromise for the > page fault conditions above. Is that really true? We have /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none This is not reflected during the PF of course but you can control the behavior there as well. Either by the global setting or a per proces prctl. > With this change, whenever an application issues MADV_DONTNEED on a > memory region, the region is marked as "space-efficient". For such > regions, a hugepage is not immediately allocated on first write. Kirill didn't like it in the previous version and I do not like this either. You are adding a very subtle side effect which might completely unexpected. Consider userspace memory allocator which uses MADV_DONTNEED to free up unused memory. Now you have put it out of THP usage basically. If the memory is used really scarce then we have MADV_NOHUGEPAGE. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] mm: Reduce memory bloat with THP 2018-01-19 12:49 ` [PATCH v2] mm: Reduce memory bloat with THP Michal Hocko @ 2018-01-19 20:59 ` Nitin Gupta 2018-01-25 0:47 ` Zi Yan 2018-01-25 9:58 ` Michal Hocko 0 siblings, 2 replies; 12+ messages in thread From: Nitin Gupta @ 2018-01-19 20:59 UTC (permalink / raw) To: Michal Hocko, Nitin Gupta Cc: steven.sistare, Andrew Morton, Ingo Molnar, Mel Gorman, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander (Sasha Levin), Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Jérôme Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm On 1/19/18 4:49 AM, Michal Hocko wrote: > On Thu 18-01-18 15:33:16, Nitin Gupta wrote: >> From: Nitin Gupta <nitin.m.gupta@oracle.com> >> >> Currently, if the THP enabled policy is "always", or the mode >> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage >> is allocated on a page fault if the pud or pmd is empty. This >> yields the best VA translation performance, but increases memory >> consumption if some small page ranges within the huge page are >> never accessed. > > Yes, this is true but hardly unexpected for MADV_HUGEPAGE or THP always > users. > Yes, allocating hugepage on first touch is the current behavior for above two cases. However, I see issues with this current behavior. Firstly, THP=always mode is often too aggressive/wasteful to be useful for any realistic workloads. For THP=madvise, users may want to back active parts of memory region with hugepages while avoiding aggressive hugepage allocation on first touch. Or, they may really want the current behavior. With this patch, users would have the option to pick what behavior they want by passing hints to the kernel in the form of MADV_HUGEPAGE and MADV_DONTNEED madvise calls. >> An alternate behavior for such page faults is to install a >> hugepage only when a region is actually found to be (almost) >> fully mapped and active. This is a compromise between >> translation performance and memory consumption. Currently there >> is no way for an application to choose this compromise for the >> page fault conditions above. > > Is that really true? We have /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none > This is not reflected during the PF of course but you can control the > behavior there as well. Either by the global setting or a per proces > prctl. > I think this part of patch description needs some rewording. This patch is to change *only* the page fault behavior. Once pages are installed, khugepaged does its job as usual, using max_ptes_none and other config values. I'm not trying to change any khugepaged behavior here. >> With this change, whenever an application issues MADV_DONTNEED on a >> memory region, the region is marked as "space-efficient". For such >> regions, a hugepage is not immediately allocated on first write. > > Kirill didn't like it in the previous version and I do not like this > either. You are adding a very subtle side effect which might completely > unexpected. Consider userspace memory allocator which uses MADV_DONTNEED > to free up unused memory. Now you have put it out of THP usage > basically. > Userpsace may want a region to be considered by khugepaged while opting out of hugepage allocation on first touch. Asking userspace memory allocators to have to track and reclaim unused parts of a THP allocated hugepage does not seems right, as the kernel can use simple userspace hints to avoid allocating extra memory in the first place. I agree that this patch is adding a subtle side-effect which may take some applications by surprise. However, I often see the opposite too: for many workloads, disabling THP is the first advise as this aggressive allocation of hugepages on first touch is unexpected and is too wasteful. For e.g.: 1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB) http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/ 2) Disable THP on MongoDB https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/ 3) Disable THP for Couchbase Server https://blog.couchbase.com/often-overlooked-linux-os-tweaks/ 4) Redis http://antirez.com/news/84 > If the memory is used really scarce then we have MADV_NOHUGEPAGE. > It's not really about memory scarcity but a more efficient use of it. Applications may want hugepage benefits without requiring any changes to app code which is what THP is supposed to provide, while still avoiding memory bloat. -Nitin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] mm: Reduce memory bloat with THP 2018-01-19 20:59 ` Nitin Gupta @ 2018-01-25 0:47 ` Zi Yan 2018-01-25 19:41 ` Nitin Gupta 2018-01-25 9:58 ` Michal Hocko 1 sibling, 1 reply; 12+ messages in thread From: Zi Yan @ 2018-01-25 0:47 UTC (permalink / raw) To: Nitin Gupta Cc: Michal Hocko, Nitin Gupta, steven.sistare, Andrew Morton, Ingo Molnar, Mel Gorman, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander, Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Jérôme Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 2469 bytes --] > >>> With this change, whenever an application issues MADV_DONTNEED on a >>> memory region, the region is marked as "space-efficient". For such >>> regions, a hugepage is not immediately allocated on first write. >> >> Kirill didn't like it in the previous version and I do not like this >> either. You are adding a very subtle side effect which might completely >> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED >> to free up unused memory. Now you have put it out of THP usage >> basically. >> > > Userpsace may want a region to be considered by khugepaged while opting > out of hugepage allocation on first touch. Asking userspace memory > allocators to have to track and reclaim unused parts of a THP allocated > hugepage does not seems right, as the kernel can use simple userspace > hints to avoid allocating extra memory in the first place. > > I agree that this patch is adding a subtle side-effect which may take > some applications by surprise. However, I often see the opposite too: > for many workloads, disabling THP is the first advise as this aggressive > allocation of hugepages on first touch is unexpected and is too > wasteful. For e.g.: > > 1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB) > http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/ > > 2) Disable THP on MongoDB > https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/ > > 3) Disable THP for Couchbase Server > https://blog.couchbase.com/often-overlooked-linux-os-tweaks/ > > 4) Redis > http://antirez.com/news/84 > > >> If the memory is used really scarce then we have MADV_NOHUGEPAGE. >> > > It's not really about memory scarcity but a more efficient use of it. > Applications may want hugepage benefits without requiring any changes to > app code which is what THP is supposed to provide, while still avoiding > memory bloat. > I read these links and find that there are mainly two complains: 1. THP causes latency spikes, because direction compaction slows down THP allocation, 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than THP size and fails because of THP. The first complain is not related to this patch. For second one, at least with recent kernels, MADV_DONTNEED splits THPs and returns the memory range you specified in madvise(). Am I missing anything? — Best Regards, Yan Zi [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 557 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] mm: Reduce memory bloat with THP 2018-01-25 0:47 ` Zi Yan @ 2018-01-25 19:41 ` Nitin Gupta 2018-01-25 21:13 ` Mel Gorman 2018-01-25 22:29 ` Andrea Arcangeli 0 siblings, 2 replies; 12+ messages in thread From: Nitin Gupta @ 2018-01-25 19:41 UTC (permalink / raw) To: Zi Yan Cc: Michal Hocko, Nitin Gupta, steven.sistare, Andrew Morton, Ingo Molnar, Mel Gorman, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander, Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Jérôme Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm On 01/24/2018 04:47 PM, Zi Yan wrote: >>>> With this change, whenever an application issues MADV_DONTNEED on a >>>> memory region, the region is marked as "space-efficient". For such >>>> regions, a hugepage is not immediately allocated on first write. >>> Kirill didn't like it in the previous version and I do not like this >>> either. You are adding a very subtle side effect which might completely >>> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED >>> to free up unused memory. Now you have put it out of THP usage >>> basically. >>> >> Userpsace may want a region to be considered by khugepaged while opting >> out of hugepage allocation on first touch. Asking userspace memory >> allocators to have to track and reclaim unused parts of a THP allocated >> hugepage does not seems right, as the kernel can use simple userspace >> hints to avoid allocating extra memory in the first place. >> >> I agree that this patch is adding a subtle side-effect which may take >> some applications by surprise. However, I often see the opposite too: >> for many workloads, disabling THP is the first advise as this aggressive >> allocation of hugepages on first touch is unexpected and is too >> wasteful. For e.g.: >> >> 1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB) >> http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/ >> >> 2) Disable THP on MongoDB >> https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/ >> >> 3) Disable THP for Couchbase Server >> https://blog.couchbase.com/often-overlooked-linux-os-tweaks/ >> >> 4) Redis >> http://antirez.com/news/84 >> >> >>> If the memory is used really scarce then we have MADV_NOHUGEPAGE. >>> >> It's not really about memory scarcity but a more efficient use of it. >> Applications may want hugepage benefits without requiring any changes to >> app code which is what THP is supposed to provide, while still avoiding >> memory bloat. >> > I read these links and find that there are mainly two complains: > 1. THP causes latency spikes, because direction compaction slows down THP allocation, > 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than > THP size and fails because of THP. > > The first complain is not related to this patch. I'm trying to address many different THP issues and memory bloat is first among them. > For second one, at least with recent kernels, MADV_DONTNEED splits THPs and returns the memory range you > specified in madvise(). Am I missing anything? > Yes, MADV_DONTNEED splits THPs and releases the requested range but this is not solving the issue of aggressive alloc-hugepage-on-first-touch policy of THP=madvise on MADV_HUGEPAGE regions. Sure, some workloads may prefer that policy but for application that don't, this patch give them an option to give hints to the kernel to go for gradual hugepage promotion via khugepaged only (and not on first touch). It's not good if an application has to track which parts of their (implicitly allocated) hugepage are in use and which sub-parts are free so they can issue MADV_DONTNEED calls on them. This approach really does not make THP "transparent" and requires lot of mm tracking code in userpace. Nitin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] mm: Reduce memory bloat with THP 2018-01-25 19:41 ` Nitin Gupta @ 2018-01-25 21:13 ` Mel Gorman 2018-02-01 1:09 ` Nitin Gupta 2018-02-01 10:27 ` Kirill A. Shutemov 2018-01-25 22:29 ` Andrea Arcangeli 1 sibling, 2 replies; 12+ messages in thread From: Mel Gorman @ 2018-01-25 21:13 UTC (permalink / raw) To: Nitin Gupta Cc: Zi Yan, Michal Hocko, Nitin Gupta, steven.sistare, Andrew Morton, Ingo Molnar, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander, Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, J?r?me Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote: > >> It's not really about memory scarcity but a more efficient use of it. > >> Applications may want hugepage benefits without requiring any changes to > >> app code which is what THP is supposed to provide, while still avoiding > >> memory bloat. > >> > > I read these links and find that there are mainly two complains: > > 1. THP causes latency spikes, because direction compaction slows down THP allocation, > > 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than > > THP size and fails because of THP. > > > > The first complain is not related to this patch. > > I'm trying to address many different THP issues and memory bloat is > first among them. Expecting userspace to get this right is probably going to go sideways. It'll be screwed up and be sub-optimal or have odd semantics for existing madvise flags. The fact is that an application may not even know if it's going to be sparsely using memory in advance if it's a computation load modelling from unknown input data. I suggest you read the old Talluri paper "Superpassing the TLB Performance of Superpages with Less Operating System Support" and pay attention to Section 4. There it discusses a page reservation scheme whereby on fault a naturally aligned set of base pages are reserved and only one correctly placed base page is inserted into the faulting address. It was tied into a hypothetical piece of hardware that doesn't exist to give best-effort support for superpages so it does not directly help you but the initial idea is sound. There are holes in the paper from todays perspective but it was written in the 90's. >From there, read "Transparent operating system support for superpages" by Navarro, particularly chapter 4 paying attention to the parts where it talks about opportunism and promotion threshold. Superficially, it goes like this 1. On fault, reserve a THP in the allocator and use one base page that is correctly-aligned for the faulting addresses. By correctly-aligned, I mean that you use base page whose offset would be naturally contiguous if it ever was part of a huge page. 2. On subsequent faults, attempt to use a base page that is naturally aligned to be a THP 3. When a "threshold" of base pages are inserted, allocate the remaining pages and promote it to a THP 4. If there is memory pressure, spill "reserved" pages into the main allocation pool and lose the opportunity to promote (which will need khugepaged to recover) By definition, a promotion threshold of 1 would be the existing scheme of allocation a THP on the first fault and some users will want that. It also should be the default to avoid unexpected overhead. For workloads where memory is being sparsely addressed and the increased overhead of THP is unwelcome then the threshold should be tuned higher with a maximum possible value of HPAGE_PMD_NR. It's non-trivial to do this because at minimum a page fault has to check if there is a potential promotion candidate by checking the PTEs around the faulting address searching for a correctly-aligned base page that is already inserted. If there is, then check if the correctly aligned base page for the current faulting address is free and if so use it. It'll also then need to check the remaining PTEs to see if both the promotion threshold has been reached and if so, promote it to a THP (or else teach khugepaged to do an in-place promotion if possible). In other words, implementing the promotion threshold is both hard and it's not free. However, if it did exist then the only tunable would be the "promotion threshold" and applications would not need any special awareness of their address space. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] mm: Reduce memory bloat with THP 2018-01-25 21:13 ` Mel Gorman @ 2018-02-01 1:09 ` Nitin Gupta 2018-02-01 10:09 ` Mel Gorman 2018-02-01 10:27 ` Kirill A. Shutemov 1 sibling, 1 reply; 12+ messages in thread From: Nitin Gupta @ 2018-02-01 1:09 UTC (permalink / raw) To: Mel Gorman Cc: Zi Yan, Michal Hocko, Nitin Gupta, steven.sistare, Andrew Morton, Ingo Molnar, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander, Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, J?r?me Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm On 01/25/2018 01:13 PM, Mel Gorman wrote: > On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote: >>>> It's not really about memory scarcity but a more efficient use of it. >>>> Applications may want hugepage benefits without requiring any changes to >>>> app code which is what THP is supposed to provide, while still avoiding >>>> memory bloat. >>>> >>> I read these links and find that there are mainly two complains: >>> 1. THP causes latency spikes, because direction compaction slows down THP allocation, >>> 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than >>> THP size and fails because of THP. >>> >>> The first complain is not related to this patch. >> >> I'm trying to address many different THP issues and memory bloat is >> first among them. > > Expecting userspace to get this right is probably going to go sideways. > It'll be screwed up and be sub-optimal or have odd semantics for existing > madvise flags. The fact is that an application may not even know if it's > going to be sparsely using memory in advance if it's a computation load > modelling from unknown input data. > > I suggest you read the old Talluri paper "Superpassing the TLB Performance > of Superpages with Less Operating System Support" and pay attention to > Section 4. There it discusses a page reservation scheme whereby on fault > a naturally aligned set of base pages are reserved and only one correctly > placed base page is inserted into the faulting address. It was tied into > a hypothetical piece of hardware that doesn't exist to give best-effort > support for superpages so it does not directly help you but the initial > idea is sound. There are holes in the paper from todays perspective but > it was written in the 90's. > > From there, read "Transparent operating system support for superpages" > by Navarro, particularly chapter 4 paying attention to the parts where > it talks about opportunism and promotion threshold. > > Superficially, it goes like this > > 1. On fault, reserve a THP in the allocator and use one base page that > is correctly-aligned for the faulting addresses. By correctly-aligned, > I mean that you use base page whose offset would be naturally contiguous > if it ever was part of a huge page. > 2. On subsequent faults, attempt to use a base page that is naturally > aligned to be a THP > 3. When a "threshold" of base pages are inserted, allocate the remaining > pages and promote it to a THP > 4. If there is memory pressure, spill "reserved" pages into the main > allocation pool and lose the opportunity to promote (which will need > khugepaged to recover) > > By definition, a promotion threshold of 1 would be the existing scheme > of allocation a THP on the first fault and some users will want that. It > also should be the default to avoid unexpected overhead. For workloads > where memory is being sparsely addressed and the increased overhead of > THP is unwelcome then the threshold should be tuned higher with a maximum > possible value of HPAGE_PMD_NR. > > It's non-trivial to do this because at minimum a page fault has to check > if there is a potential promotion candidate by checking the PTEs around > the faulting address searching for a correctly-aligned base page that is > already inserted. If there is, then check if the correctly aligned base > page for the current faulting address is free and if so use it. It'll > also then need to check the remaining PTEs to see if both the promotion > threshold has been reached and if so, promote it to a THP (or else teach > khugepaged to do an in-place promotion if possible). In other words, > implementing the promotion threshold is both hard and it's not free. > > However, if it did exist then the only tunable would be the "promotion > threshold" and applications would not need any special awareness of their > address space. > I went through both references you mentioned and I really like the idea of reservation-based hugepage allocation. Navarro also extends the idea to allow multiple hugepage sizes to be used (as support by underlying hardware) which was next in order of what I wanted to do in THP. So, please ignore this patch and I would work towards implementing ideas in these papers. Thanks for the feedback. Nitin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] mm: Reduce memory bloat with THP 2018-02-01 1:09 ` Nitin Gupta @ 2018-02-01 10:09 ` Mel Gorman 0 siblings, 0 replies; 12+ messages in thread From: Mel Gorman @ 2018-02-01 10:09 UTC (permalink / raw) To: Nitin Gupta Cc: Zi Yan, Michal Hocko, Nitin Gupta, steven.sistare, Andrew Morton, Ingo Molnar, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander, Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, J?r?me Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm On Wed, Jan 31, 2018 at 05:09:48PM -0800, Nitin Gupta wrote: > > > > It's non-trivial to do this because at minimum a page fault has to check > > if there is a potential promotion candidate by checking the PTEs around > > the faulting address searching for a correctly-aligned base page that is > > already inserted. If there is, then check if the correctly aligned base > > page for the current faulting address is free and if so use it. It'll > > also then need to check the remaining PTEs to see if both the promotion > > threshold has been reached and if so, promote it to a THP (or else teach > > khugepaged to do an in-place promotion if possible). In other words, > > implementing the promotion threshold is both hard and it's not free. > > > > However, if it did exist then the only tunable would be the "promotion > > threshold" and applications would not need any special awareness of their > > address space. > > > > I went through both references you mentioned and I really like the > idea of reservation-based hugepage allocation. Navarro also extends > the idea to allow multiple hugepage sizes to be used (as support by > underlying hardware) which was next in order of what I wanted to do in > THP. > Don't sweat too much about the multiple page size part. At the time Navarro was writing, it was expected that hardware would support multiple page sizes with fine granularity (e.g. what Itanium did). Just covering the PMD huge page size would go a long way towards balancing memory consumption and huge page usage. > So, please ignore this patch and I would work towards implementing > ideas in these papers. > > Thanks for the feedback. > My pleasure. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] mm: Reduce memory bloat with THP 2018-01-25 21:13 ` Mel Gorman 2018-02-01 1:09 ` Nitin Gupta @ 2018-02-01 10:27 ` Kirill A. Shutemov 2018-02-01 10:46 ` Mel Gorman 1 sibling, 1 reply; 12+ messages in thread From: Kirill A. Shutemov @ 2018-02-01 10:27 UTC (permalink / raw) To: Mel Gorman Cc: Nitin Gupta, Zi Yan, Michal Hocko, Nitin Gupta, steven.sistare, Andrew Morton, Ingo Molnar, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander, Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, J?r?me Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm On Thu, Jan 25, 2018 at 09:13:03PM +0000, Mel Gorman wrote: > On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote: > > >> It's not really about memory scarcity but a more efficient use of it. > > >> Applications may want hugepage benefits without requiring any changes to > > >> app code which is what THP is supposed to provide, while still avoiding > > >> memory bloat. > > >> > > > I read these links and find that there are mainly two complains: > > > 1. THP causes latency spikes, because direction compaction slows down THP allocation, > > > 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than > > > THP size and fails because of THP. > > > > > > The first complain is not related to this patch. > > > > I'm trying to address many different THP issues and memory bloat is > > first among them. > > Expecting userspace to get this right is probably going to go sideways. > It'll be screwed up and be sub-optimal or have odd semantics for existing > madvise flags. The fact is that an application may not even know if it's > going to be sparsely using memory in advance if it's a computation load > modelling from unknown input data. > > I suggest you read the old Talluri paper "Superpassing the TLB Performance > of Superpages with Less Operating System Support" and pay attention to > Section 4. There it discusses a page reservation scheme whereby on fault > a naturally aligned set of base pages are reserved and only one correctly > placed base page is inserted into the faulting address. It was tied into > a hypothetical piece of hardware that doesn't exist to give best-effort > support for superpages so it does not directly help you but the initial > idea is sound. There are holes in the paper from todays perspective but > it was written in the 90's. > > From there, read "Transparent operating system support for superpages" > by Navarro, particularly chapter 4 paying attention to the parts where > it talks about opportunism and promotion threshold. > > Superficially, it goes like this > > 1. On fault, reserve a THP in the allocator and use one base page that > is correctly-aligned for the faulting addresses. By correctly-aligned, > I mean that you use base page whose offset would be naturally contiguous > if it ever was part of a huge page. > 2. On subsequent faults, attempt to use a base page that is naturally > aligned to be a THP > 3. When a "threshold" of base pages are inserted, allocate the remaining > pages and promote it to a THP > 4. If there is memory pressure, spill "reserved" pages into the main > allocation pool and lose the opportunity to promote (which will need > khugepaged to recover) > > By definition, a promotion threshold of 1 would be the existing scheme > of allocation a THP on the first fault and some users will want that. It > also should be the default to avoid unexpected overhead. For workloads > where memory is being sparsely addressed and the increased overhead of > THP is unwelcome then the threshold should be tuned higher with a maximum > possible value of HPAGE_PMD_NR. > > It's non-trivial to do this because at minimum a page fault has to check > if there is a potential promotion candidate by checking the PTEs around > the faulting address searching for a correctly-aligned base page that is > already inserted. If there is, then check if the correctly aligned base > page for the current faulting address is free and if so use it. It'll > also then need to check the remaining PTEs to see if both the promotion > threshold has been reached and if so, promote it to a THP (or else teach > khugepaged to do an in-place promotion if possible). In other words, > implementing the promotion threshold is both hard and it's not free. "not free" is understatement. Converting PTE page table to PMD would require down_write(mmap_sem). Doing it from within page fault path would also mean that we need to drop down_read(mmap) we hold, re-aquaire it with down_write(), find the vma again and re-validate that nothing changed in meanwhile... That's an interesting exercise, but I'm skeptical it would result in anything practical. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] mm: Reduce memory bloat with THP 2018-02-01 10:27 ` Kirill A. Shutemov @ 2018-02-01 10:46 ` Mel Gorman 0 siblings, 0 replies; 12+ messages in thread From: Mel Gorman @ 2018-02-01 10:46 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Nitin Gupta, Zi Yan, Michal Hocko, Nitin Gupta, steven.sistare, Andrew Morton, Ingo Molnar, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander, Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, J?r?me Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm On Thu, Feb 01, 2018 at 01:27:30PM +0300, Kirill A. Shutemov wrote: > > It's non-trivial to do this because at minimum a page fault has to check > > if there is a potential promotion candidate by checking the PTEs around > > the faulting address searching for a correctly-aligned base page that is > > already inserted. If there is, then check if the correctly aligned base > > page for the current faulting address is free and if so use it. It'll > > also then need to check the remaining PTEs to see if both the promotion > > threshold has been reached and if so, promote it to a THP (or else teach > > khugepaged to do an in-place promotion if possible). In other words, > > implementing the promotion threshold is both hard and it's not free. > > "not free" is understatement. > > Converting PTE page table to PMD would require down_write(mmap_sem). > Doing it from within page fault path would also mean that we need to drop > down_read(mmap) we hold, re-aquaire it with down_write(), find the vma again > and re-validate that nothing changed in meanwhile... > > That's an interesting exercise, but I'm skeptical it would result in anything > practical. > The details are painful but we're somewhat caught between a rock and a hard place for workloads that sparsely reference memory and want to avoid excessive memory usage. Given that the cost will be high, it may need to dynamically detect what the promotion threshold is -- default high and reduce it on a per-task basis if promotions are frequent. Either way, expecting applications to get it right with hints is the road to hell paved with good intentions. If they were able to get this right, they would be using prctl(PR_SET_THP_DISABLE) already. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] mm: Reduce memory bloat with THP 2018-01-25 19:41 ` Nitin Gupta 2018-01-25 21:13 ` Mel Gorman @ 2018-01-25 22:29 ` Andrea Arcangeli 1 sibling, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2018-01-25 22:29 UTC (permalink / raw) To: Nitin Gupta Cc: Zi Yan, Michal Hocko, Nitin Gupta, steven.sistare, Andrew Morton, Ingo Molnar, Mel Gorman, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander, Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Jérôme Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote: > I'm trying to address many different THP issues and memory bloat is > first among them. You quoted redis in an earlier email, the redis issue has nothing to do with MADV_DONTNEED. I can quickly explain the redis issue. Redis uses fork() to create a readonly copy of the memory to do snapshotting in the child, while parent still writes to the memory. THP CoWs in the parent are higher latency than 4k CoWs, they also take more memory, but that's secondary, in fact the maximum waste of memory in this model will reach the same worst case (x2) with 4k CoWs too, no difference. The problem is the copy-user there, it adds latency and wastes CPU. Redis can simply use userfaultfd WP mode once it'll be upstream and then it will use 4k granularity as the granularity of the writeprotect userfaults is up to userland to decide. The main benefit is it can avoid the worst case degradation of using x2 physical memory (disabling THP makes zero difference in that regard, if storage is very slow x2 physical memory can still be used if very unlucky), it can throttle the WP writes (anon COW cannot throttle), it can avoid to fork altogether so it shares the same pagetables. It can also put the "user-CoWed" pages (in the fault handler) in front of the write queue, to be written first, using a ring buffer for the CoWed 4k pages, to keep memory utilization even lower despite THP stays on at all times for all pages that didn't get a CoW yet. This will be an optimal snapshot method, much better than fork() no matter if 4k or THP are backing the memory. In short MADV_DONTNEED has nothing to do with redis, if mysql gets an improvement surely you can post a benchmark instead of URLs. If you want low memory usage at the cost of potentially slower performance overall you should use transparent_hugepage=madvise . The cases where THP is not a good tradeoff are genreally related to lower performance in copy-user or the higher cost of compaction if the app is only ever doing short lived allocations. If you post a reproducible benchmark with real life app that gets an improvement with whatever change you're doing, it'll be possible to evaluate it. Thanks, Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] mm: Reduce memory bloat with THP 2018-01-19 20:59 ` Nitin Gupta 2018-01-25 0:47 ` Zi Yan @ 2018-01-25 9:58 ` Michal Hocko 2018-01-25 22:40 ` Andrea Arcangeli 1 sibling, 1 reply; 12+ messages in thread From: Michal Hocko @ 2018-01-25 9:58 UTC (permalink / raw) To: Nitin Gupta Cc: Nitin Gupta, steven.sistare, Andrew Morton, Ingo Molnar, Mel Gorman, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander (Sasha Levin), Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Jérôme Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm On Fri 19-01-18 12:59:17, Nitin Gupta wrote: > On 1/19/18 4:49 AM, Michal Hocko wrote: > > On Thu 18-01-18 15:33:16, Nitin Gupta wrote: > >> From: Nitin Gupta <nitin.m.gupta@oracle.com> > >> > >> Currently, if the THP enabled policy is "always", or the mode > >> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage > >> is allocated on a page fault if the pud or pmd is empty. This > >> yields the best VA translation performance, but increases memory > >> consumption if some small page ranges within the huge page are > >> never accessed. > > > > Yes, this is true but hardly unexpected for MADV_HUGEPAGE or THP always > > users. > > > > Yes, allocating hugepage on first touch is the current behavior for > above two cases. However, I see issues with this current behavior. > Firstly, THP=always mode is often too aggressive/wasteful to be useful > for any realistic workloads. For THP=madvise, users may want to back > active parts of memory region with hugepages while avoiding aggressive > hugepage allocation on first touch. Or, they may really want the current > behavior. Then they should use THP=never and rely on the khugepaged to compact madvise regions. This will avoid first touch problem and you can also control how large portion of the THP has to be mapped already. > With this patch, users would have the option to pick what behavior they > want by passing hints to the kernel in the form of MADV_HUGEPAGE and > MADV_DONTNEED madvise calls. more on this below [...] > >> With this change, whenever an application issues MADV_DONTNEED on a > >> memory region, the region is marked as "space-efficient". For such > >> regions, a hugepage is not immediately allocated on first write. > > > > Kirill didn't like it in the previous version and I do not like this > > either. You are adding a very subtle side effect which might completely > > unexpected. Consider userspace memory allocator which uses MADV_DONTNEED > > to free up unused memory. Now you have put it out of THP usage > > basically. > > > > Userpsace may want a region to be considered by khugepaged while opting > out of hugepage allocation on first touch. Asking userspace memory > allocators to have to track and reclaim unused parts of a THP allocated > hugepage does not seems right, as the kernel can use simple userspace > hints to avoid allocating extra memory in the first place. Yes. This is in sync with what I wrote. Allocators shouldn't care and that is why MADV_DONTNEED with side effect is simply wrong. > I agree that this patch is adding a subtle side-effect which may take > some applications by surprise. However, I often see the opposite too: > for many workloads, disabling THP is the first advise as this aggressive > allocation of hugepages on first touch is unexpected and is too > wasteful. For e.g.: Ohh, absolutely. And that is why we have changed the default in upstream 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option") -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] mm: Reduce memory bloat with THP 2018-01-25 9:58 ` Michal Hocko @ 2018-01-25 22:40 ` Andrea Arcangeli 0 siblings, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2018-01-25 22:40 UTC (permalink / raw) To: Michal Hocko Cc: Nitin Gupta, Nitin Gupta, steven.sistare, Andrew Morton, Ingo Molnar, Mel Gorman, Nadav Amit, Minchan Kim, Kirill A. Shutemov, Peter Zijlstra, Vegard Nossum, Levin, Alexander (Sasha Levin), Mike Rapoport, Hillf Danton, Shaohua Li, Anshuman Khandual, David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Jérôme Glisse, Matthew Wilcox, Ross Zwisler, Hugh Dickins, Tobin C Harding, linux-kernel, linux-mm On Thu, Jan 25, 2018 at 10:58:32AM +0100, Michal Hocko wrote: > Ohh, absolutely. And that is why we have changed the default in upstream > 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a > stall-free defrag option") Agreed, that direct compaction change should already address the cases quoted in the other URLs. One of the URL is about using fork() to snapshot a nosql db state, that one can't be helped by the above commit but it's still unrelated to MADV_DONTNEED or memory bloat. It would be possible to fully fix the use of fork() for snapshotting without userfaultfd WP mode, by just adding an madvise that forces 4k CoWs on top of 2M THP and to call it in the parent that keeps writing to memory while the child is writing the readonly copy to disk, but I believe userfaultfd WP will be way more optimal as it provides so many other advantages (i.e. avoid fork() in the first place and use pthread_create and be able to throttle on I/O and limit the max memory usage to something less than x2 RAM without the risk of triggering the OOM killer and have a ring that is written immediately to keep the mem utilization low etc..). Thanks, Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2018-02-01 10:46 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <1516318444-30868-1-git-send-email-nitingupta910@gmail.com> 2018-01-19 12:49 ` [PATCH v2] mm: Reduce memory bloat with THP Michal Hocko 2018-01-19 20:59 ` Nitin Gupta 2018-01-25 0:47 ` Zi Yan 2018-01-25 19:41 ` Nitin Gupta 2018-01-25 21:13 ` Mel Gorman 2018-02-01 1:09 ` Nitin Gupta 2018-02-01 10:09 ` Mel Gorman 2018-02-01 10:27 ` Kirill A. Shutemov 2018-02-01 10:46 ` Mel Gorman 2018-01-25 22:29 ` Andrea Arcangeli 2018-01-25 9:58 ` Michal Hocko 2018-01-25 22:40 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).