* IO regression after ab8fabd46f on x86 kernels with high memory @ 2013-04-26 23:44 Pierre-Loup A. Griffais 2013-04-27 1:53 ` Rik van Riel 0 siblings, 1 reply; 16+ messages in thread From: Pierre-Loup A. Griffais @ 2013-04-26 23:44 UTC (permalink / raw) To: hannes; +Cc: linux-kernel, torvalds, riel, sonnyrao, kamezawa.hiroyu, akpm I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it takes between two and three minutes. It looks like a similar throughput regression happens on any machine running an i386 PAE kernel with high amounts of memory; the threshold seems to be 16G; passing mem=15G to the kernel commandline fixes it. I bisected it to the following change: commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d Author: Johannes Weiner <jweiner@redhat.com> Date: Tue Jan 10 15:07:42 2012 -0800 mm: exclude reserved pages from dirtyable memory I realize running x86 kernels against high amounts of memory is not advised for various reasons, but I would assume that such a big regression in basic functionality to not be part of them. Is that accurate, or are these configurations expected to become unusable from 3.3 onwards? Also CCing Sonny since it looks like he tried to fix an overflow issue related to the same change with commit c8b74c2f66049, but I'm still experiencing the problem with a kernel built from master. Thanks, - Pierre-Loup ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IO regression after ab8fabd46f on x86 kernels with high memory 2013-04-26 23:44 IO regression after ab8fabd46f on x86 kernels with high memory Pierre-Loup A. Griffais @ 2013-04-27 1:53 ` Rik van Riel 2013-04-27 2:42 ` Johannes Weiner 0 siblings, 1 reply; 16+ messages in thread From: Rik van Riel @ 2013-04-27 1:53 UTC (permalink / raw) To: Pierre-Loup A. Griffais Cc: hannes, linux-kernel, torvalds, sonnyrao, kamezawa.hiroyu, akpm On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote: > I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a > 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it > takes between two and three minutes. It looks like a similar throughput > regression happens on any machine running an i386 PAE kernel with high > amounts of memory; the threshold seems to be 16G; passing mem=15G to the > kernel commandline fixes it. If you have that much memory in the system, you will want to run a 64 bit kernel to avoid all kinds of memory management corner cases. > I bisected it to the following change: > > commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d > Author: Johannes Weiner <jweiner@redhat.com> > Date: Tue Jan 10 15:07:42 2012 -0800 > > mm: exclude reserved pages from dirtyable memory > > I realize running x86 kernels against high amounts of memory is not > advised for various reasons, but I would assume that such a big > regression in basic functionality to not be part of them. Is that > accurate, or are these configurations expected to become unusable from > 3.3 onwards? Reverting that patch would probably break i686 PAE systems with lots of memory at a different threshold. With more than 8-12GB of memory, an i686 kernel is between a rock and a hard place. Whether you move it closer to the rock, or closer to the hard place, all you do is change the way in which it breaks. > Also CCing Sonny since it looks like he tried to fix an overflow issue > related to the same change with commit c8b74c2f66049, but I'm still > experiencing the problem with a kernel built from master. > > Thanks, > - Pierre-Loup -- All rights reversed ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IO regression after ab8fabd46f on x86 kernels with high memory 2013-04-27 1:53 ` Rik van Riel @ 2013-04-27 2:42 ` Johannes Weiner 2013-04-29 21:53 ` Pierre-Loup A. Griffais 0 siblings, 1 reply; 16+ messages in thread From: Johannes Weiner @ 2013-04-27 2:42 UTC (permalink / raw) To: Rik van Riel Cc: Pierre-Loup A. Griffais, linux-kernel, torvalds, sonnyrao, kamezawa.hiroyu, akpm On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote: > On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote: > >I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a > >180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it > >takes between two and three minutes. It looks like a similar throughput > >regression happens on any machine running an i386 PAE kernel with high > >amounts of memory; the threshold seems to be 16G; passing mem=15G to the > >kernel commandline fixes it. > > If you have that much memory in the system, you will > want to run a 64 bit kernel to avoid all kinds of > memory management corner cases. Agreed. You can even keep your 32 bit userland, just swap the kernel... > >I bisected it to the following change: > > > >commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d > >Author: Johannes Weiner <jweiner@redhat.com> > >Date: Tue Jan 10 15:07:42 2012 -0800 > > > > mm: exclude reserved pages from dirtyable memory > > > >I realize running x86 kernels against high amounts of memory is not > >advised for various reasons, but I would assume that such a big > >regression in basic functionality to not be part of them. Is that > >accurate, or are these configurations expected to become unusable from > >3.3 onwards? > > Reverting that patch would probably break i686 PAE systems with > lots of memory at a different threshold. It would also re-introduce the reclaim stalls when zones with very little page cache due to lowmem reserves end up with a large percentage of their LRU dirty. And that affects modern machines too, because of the lowmem reserves in DMA32 due to relatively bigger Normal zones. On such large highmem machines, however, the imbalance between highmem and lowmem is so enormous that the lowmem reserves basically exclude all of lowmem from page cache usage. But because dirty highmem creates lowmem pressure, and the amount of sanely allowable dirty memory is actually a function of lowmem, not highmem, highmem is not included in the amount of dirtyable memory. So because your lowmem is not available for page cache and highmem is not considered dirtyable out of the box, the amount of dirtyable memory on your machine is 0. You can workaround this by setting vm.highmem_is_dirtyable=1. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IO regression after ab8fabd46f on x86 kernels with high memory 2013-04-27 2:42 ` Johannes Weiner @ 2013-04-29 21:53 ` Pierre-Loup A. Griffais 2013-04-29 22:03 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Pierre-Loup A. Griffais @ 2013-04-29 21:53 UTC (permalink / raw) To: Johannes Weiner Cc: Rik van Riel, linux-kernel, torvalds, sonnyrao, kamezawa.hiroyu, akpm On 04/26/2013 07:42 PM, Johannes Weiner wrote: > On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote: >> On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote: >>> I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a >>> 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it >>> takes between two and three minutes. It looks like a similar throughput >>> regression happens on any machine running an i386 PAE kernel with high >>> amounts of memory; the threshold seems to be 16G; passing mem=15G to the >>> kernel commandline fixes it. >> >> If you have that much memory in the system, you will >> want to run a 64 bit kernel to avoid all kinds of >> memory management corner cases. > > Agreed. You can even keep your 32 bit userland, just swap the > kernel... > >>> I bisected it to the following change: >>> >>> commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d >>> Author: Johannes Weiner <jweiner@redhat.com> >>> Date: Tue Jan 10 15:07:42 2012 -0800 >>> >>> mm: exclude reserved pages from dirtyable memory >>> >>> I realize running x86 kernels against high amounts of memory is not >>> advised for various reasons, but I would assume that such a big >>> regression in basic functionality to not be part of them. Is that >>> accurate, or are these configurations expected to become unusable from >>> 3.3 onwards? >> >> Reverting that patch would probably break i686 PAE systems with >> lots of memory at a different threshold. > > It would also re-introduce the reclaim stalls when zones with very > little page cache due to lowmem reserves end up with a large > percentage of their LRU dirty. And that affects modern machines too, > because of the lowmem reserves in DMA32 due to relatively bigger > Normal zones. > > On such large highmem machines, however, the imbalance between highmem > and lowmem is so enormous that the lowmem reserves basically exclude > all of lowmem from page cache usage. > > But because dirty highmem creates lowmem pressure, and the amount of > sanely allowable dirty memory is actually a function of lowmem, not > highmem, highmem is not included in the amount of dirtyable memory. > > So because your lowmem is not available for page cache and highmem is > not considered dirtyable out of the box, the amount of dirtyable > memory on your machine is 0. You can workaround this by setting > vm.highmem_is_dirtyable=1. I understand the technical concerns; we had some existing issues on 3.2 with 24/32GB machines where the kernel would start erroneously OOM-killing new processes after a while; booting with mem=16G solved that. But now this goes a level further, since the machine is unusable upfront, right at boot, even with mem=16G. As such this is clearly seems like a regression more than a tradeoff. We're in a situation where popular distros ship 32-bit as the default "use this if you're not sure what to get" option, with PAE also enabled by default. most modern computers shipping with more than 16G of RAM, especially for gaming. Looking at the Steam HW survey data we have hundreds of users using this combination; this commit means that installing package updates that pull in a new kernel will immediately cause their system to become unusable. Other than this particular concern, what's the high-level take-away? Is PAE support in the Linux kernel a false promise than distros should not be shipping by default, if at all? Should it be removed from the kernel entirely if these configurations are knowingly broken by commits like this? Thanks, - Pierre-Loup ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IO regression after ab8fabd46f on x86 kernels with high memory 2013-04-29 21:53 ` Pierre-Loup A. Griffais @ 2013-04-29 22:03 ` Linus Torvalds 2013-04-29 22:08 ` Pierre-Loup A. Griffais ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Linus Torvalds @ 2013-04-29 22:03 UTC (permalink / raw) To: Pierre-Loup A. Griffais Cc: Johannes Weiner, Rik van Riel, Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais <pgriffais@valvesoftware.com> wrote: > > Other than this particular concern, what's the high-level take-away? Is PAE > support in the Linux kernel a false promise than distros should not be > shipping by default, if at all? Should it be removed from the kernel > entirely if these configurations are knowingly broken by commits like this? PAE is "make it barely work". The whole concept is fundamentally flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't even understand *how* flawed and stupid that is. Don't do it. Upgrade to 64-bit, or live with the fact that IO performance will suck. The fact that it happened to work better under your particular load with one particular IO size is entirely just "random noise". Yeah, the difference between "we can cache it" and "we have to do IO" is huge. With a 32-bit kernel, we do IO much earlier now, just to avoid some really nasty situations. That makes you go from the "can sit in the cache" to the "do lots of IO" situation. Tough. Seriously, you can compile yourself a 64-bit kernel and continue to use your 32-bit user-land. And you can complain to whatever distro you used that it didn't do that in the first place. But we're not going to bother with trying to tune PAE for some particular load. It's just not worth it to anybody. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IO regression after ab8fabd46f on x86 kernels with high memory 2013-04-29 22:03 ` Linus Torvalds @ 2013-04-29 22:08 ` Pierre-Loup A. Griffais 2013-05-02 4:37 ` Sonny Rao 2013-04-30 0:48 ` Rik van Riel 2013-05-08 19:10 ` IO regression after ab8fabd46f on x86 kernels with high memory H. Peter Anvin 2 siblings, 1 reply; 16+ messages in thread From: Pierre-Loup A. Griffais @ 2013-04-29 22:08 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, Rik van Riel, Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton On 04/29/2013 03:03 PM, Linus Torvalds wrote: > On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais > <pgriffais@valvesoftware.com> wrote: >> >> Other than this particular concern, what's the high-level take-away? Is PAE >> support in the Linux kernel a false promise than distros should not be >> shipping by default, if at all? Should it be removed from the kernel >> entirely if these configurations are knowingly broken by commits like this? > > PAE is "make it barely work". The whole concept is fundamentally > flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't > even understand *how* flawed and stupid that is. > > Don't do it. Upgrade to 64-bit, or live with the fact that IO > performance will suck. The fact that it happened to work better under > your particular load with one particular IO size is entirely just > "random noise". > > Yeah, the difference between "we can cache it" and "we have to do IO" > is huge. With a 32-bit kernel, we do IO much earlier now, just to > avoid some really nasty situations. That makes you go from the "can > sit in the cache" to the "do lots of IO" situation. Tough. > > Seriously, you can compile yourself a 64-bit kernel and continue to > use your 32-bit user-land. And you can complain to whatever distro you > used that it didn't do that in the first place. But we're not going to > bother with trying to tune PAE for some particular load. It's just not > worth it to anybody. All of this came from me trying to reproduce slowdowns reported by other people; I personally run a 64-bit kernel and understand how bad of an idea it is to attempt to run 32-bit kernels with PAE enabled on modern machines. However, my goal is to avoid ending up with a variety of end-users that don't necessarily understand this getting bitten by it and breaking their systems by upgrading their kernels. I will indeed bring this up with distributors and point out than shipping PAE kernels by default is not a good idea given these problems and your stance on the matter. Thanks, - Pierre-Loup > > Linus > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IO regression after ab8fabd46f on x86 kernels with high memory 2013-04-29 22:08 ` Pierre-Loup A. Griffais @ 2013-05-02 4:37 ` Sonny Rao 0 siblings, 0 replies; 16+ messages in thread From: Sonny Rao @ 2013-05-02 4:37 UTC (permalink / raw) To: Pierre-Loup A. Griffais Cc: Linus Torvalds, Johannes Weiner, Rik van Riel, Linux Kernel Mailing List, KAMEZAWA Hiroyuki, Andrew Morton On Mon, Apr 29, 2013 at 3:08 PM, Pierre-Loup A. Griffais <pgriffais@valvesoftware.com> wrote: > On 04/29/2013 03:03 PM, Linus Torvalds wrote: >> >> On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais >> <pgriffais@valvesoftware.com> wrote: >>> >>> >>> Other than this particular concern, what's the high-level take-away? Is >>> PAE >>> support in the Linux kernel a false promise than distros should not be >>> shipping by default, if at all? Should it be removed from the kernel >>> entirely if these configurations are knowingly broken by commits like >>> this? >> >> >> PAE is "make it barely work". The whole concept is fundamentally >> flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't >> even understand *how* flawed and stupid that is. >> >> Don't do it. Upgrade to 64-bit, or live with the fact that IO >> performance will suck. The fact that it happened to work better under >> your particular load with one particular IO size is entirely just >> "random noise". >> >> Yeah, the difference between "we can cache it" and "we have to do IO" >> is huge. With a 32-bit kernel, we do IO much earlier now, just to >> avoid some really nasty situations. That makes you go from the "can >> sit in the cache" to the "do lots of IO" situation. Tough. >> >> Seriously, you can compile yourself a 64-bit kernel and continue to >> use your 32-bit user-land. And you can complain to whatever distro you >> used that it didn't do that in the first place. But we're not going to >> bother with trying to tune PAE for some particular load. It's just not >> worth it to anybody. > > > All of this came from me trying to reproduce slowdowns reported by other > people; I personally run a 64-bit kernel and understand how bad of an idea > it is to attempt to run 32-bit kernels with PAE enabled on modern machines. > However, my goal is to avoid ending up with a variety of end-users that > don't necessarily understand this getting bitten by it and breaking their > systems by upgrading their kernels. I will indeed bring this up with > distributors and point out than shipping PAE kernels by default is not a > good idea given these problems and your stance on the matter. > Sorry just saw this (my stupid gmail filters for lkml) The slow-down we ran into wasn't even on PAE -- it was *just* with highmem on a 2GB system. The non-zero amount (90MB? or so) of highmem was enough to cause major problems due to that particular underflow. I would say regardless of how much memory you have, if the system can use a 64-bit kernel, then it almost certainly should. I've seen some very minor performance impacts on 64-bit capable Atom systems with tiny L2 caches, but it's almost in the noise and not worth the pain. > Thanks, > - Pierre-Loup > >> >> Linus >> > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IO regression after ab8fabd46f on x86 kernels with high memory 2013-04-29 22:03 ` Linus Torvalds 2013-04-29 22:08 ` Pierre-Loup A. Griffais @ 2013-04-30 0:48 ` Rik van Riel 2013-04-30 1:06 ` Pierre-Loup A. Griffais 2013-05-02 1:34 ` Steven Rostedt 2013-05-08 19:10 ` IO regression after ab8fabd46f on x86 kernels with high memory H. Peter Anvin 2 siblings, 2 replies; 16+ messages in thread From: Rik van Riel @ 2013-04-30 0:48 UTC (permalink / raw) To: Linus Torvalds Cc: Pierre-Loup A. Griffais, Johannes Weiner, Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton On 04/29/2013 06:03 PM, Linus Torvalds wrote: > Seriously, you can compile yourself a 64-bit kernel and continue to > use your 32-bit user-land. And you can complain to whatever distro you > used that it didn't do that in the first place. But we're not going to > bother with trying to tune PAE for some particular load. It's just not > worth it to anybody. I can think of one way to "tune PAE" that will help avoid the breakage, and at the same time draw the attention of users. Limit the memory that a 32 bit PAE kernel uses, to something small enough where the user will not encounter random breakage. Maybe 8 or 12GB? It could also print out a friendly message, to inform the user they should upgrade to a 64 bit kernel to enjoy the use of all of their memory. It is a bit of a heavy stick, but I suspect that it would clue in all of the affected users. If you have no objection to this, I'll whip up a patch. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IO regression after ab8fabd46f on x86 kernels with high memory 2013-04-30 0:48 ` Rik van Riel @ 2013-04-30 1:06 ` Pierre-Loup A. Griffais 2013-05-02 1:34 ` Steven Rostedt 1 sibling, 0 replies; 16+ messages in thread From: Pierre-Loup A. Griffais @ 2013-04-30 1:06 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Johannes Weiner, Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton On 04/29/2013 05:48 PM, Rik van Riel wrote: > On 04/29/2013 06:03 PM, Linus Torvalds wrote: > >> Seriously, you can compile yourself a 64-bit kernel and continue to >> use your 32-bit user-land. And you can complain to whatever distro you >> used that it didn't do that in the first place. But we're not going to >> bother with trying to tune PAE for some particular load. It's just not >> worth it to anybody. > > I can think of one way to "tune PAE" that will help > avoid the breakage, and at the same time draw the > attention of users. > > Limit the memory that a 32 bit PAE kernel uses, to > something small enough where the user will not > encounter random breakage. Maybe 8 or 12GB? > > It could also print out a friendly message, to > inform the user they should upgrade to a 64 bit > kernel to enjoy the use of all of their memory. > > It is a bit of a heavy stick, but I suspect that > it would clue in all of the affected users. > > If you have no objection to this, I'll whip up a > patch. > That would be pretty useful, especially if I can then convince distributors to apply it and roll it out ASAP. I haven't personally observed any problems with mem=15G whereas mem=16G exhibits the IO issue upfront and more than that exhibits the OOM-killer / low memory starvation issue that existed before Johannes change. Thanks, - Pierre-Loup ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IO regression after ab8fabd46f on x86 kernels with high memory 2013-04-30 0:48 ` Rik van Riel 2013-04-30 1:06 ` Pierre-Loup A. Griffais @ 2013-05-02 1:34 ` Steven Rostedt 2013-05-02 2:46 ` [PATCH] mm,x86: limit 32 bit kernel to 12GB memory Rik van Riel 1 sibling, 1 reply; 16+ messages in thread From: Steven Rostedt @ 2013-05-02 1:34 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Pierre-Loup A. Griffais, Johannes Weiner, Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote: > > It could also print out a friendly message, to > inform the user they should upgrade to a 64 bit > kernel to enjoy the use of all of their memory. Oh, oh, oh!!! Can we use my message: http://lwn.net/Articles/501769/ OK, maybe it's not so friendly ;-) -- Steve ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] mm,x86: limit 32 bit kernel to 12GB memory 2013-05-02 1:34 ` Steven Rostedt @ 2013-05-02 2:46 ` Rik van Riel 2013-05-02 7:37 ` Pierre-Loup A. Griffais 2013-05-02 20:03 ` Linus Torvalds 0 siblings, 2 replies; 16+ messages in thread From: Rik van Riel @ 2013-05-02 2:46 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Pierre-Loup A. Griffais, Johannes Weiner, Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton On Wed, 1 May 2013 21:34:26 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote: > > > > It could also print out a friendly message, to > > inform the user they should upgrade to a 64 bit > > kernel to enjoy the use of all of their memory. > > Oh, oh, oh!!! Can we use my message: > > http://lwn.net/Articles/501769/ > > OK, maybe it's not so friendly ;-) Here's a somewhat friendlier one. Printing out the total amount of memory in the system may give them some extra motivation to upgrade to a 64 bit kernel :) ---8<---- Subject: mm,x86: limit 32 bit kernel to 12GB memory Running 32 bit kernels on very large memory systems is a recipe for disaster, due to fundamental architectural limits in both Linux and the hardware. Moreover, all modern hardware with large memory supports 64 bits. However, many users continue using 32 bit kernels, and end up encountering stability and performance problems as a result. It may be better to save those people the frustration of stability issues by limiting memory on a 32 bit kernel to 12GB (about the upper limit that still works right), and printing a friendly reminder that they really should be using a 64 bit kernel. Signed-off-by: Rik van Riel <riel@redhat.com> --- arch/x86/include/asm/setup.h | 1 + arch/x86/mm/init_32.c | 11 +++++++++++ 2 files changed, 12 insertions(+) diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h index b7bf350..79de6bf 100644 --- a/arch/x86/include/asm/setup.h +++ b/arch/x86/include/asm/setup.h @@ -14,6 +14,7 @@ */ #define MAXMEM_PFN PFN_DOWN(MAXMEM) #define MAX_NONPAE_PFN (1 << 20) +#define MAX_PAE_PFN (3 << 20) #endif /* __i386__ */ diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index 3ac7e31..e35b3f5 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -600,6 +600,12 @@ static void __init lowmem_pfn_init(void) #define MSG_HIGHMEM_TRIMMED \ "Warning: only 4GB will be used. Use a HIGHMEM64G enabled kernel!\n" + +#define MSG_HIGHMEM_INSANE \ + "Warning: 32 bit kernels on large memory systems have problems.\n" \ + "Limiting memory to 12GB for system stability.\n" \ + "Use a 64 bit kernel to access all %lu MB of memory.\n" + /* * We have more RAM than fits into lowmem - we try to put it into * highmem, also taking the highmem=x boot parameter into account: @@ -634,6 +640,11 @@ static void __init highmem_pfn_init(void) max_pfn = MAX_NONPAE_PFN; printk(KERN_WARNING MSG_HIGHMEM_TRIMMED); } +#else /* !CONFIG_HIGHMEM64G */ + if (max_pfn > MAX_PAE_PFN) { + printk(KERN_WARNING MSG_HIGHMEM_INSANE, max_pfn>>8); + max_pfn = MAX_PFN; + } #endif /* !CONFIG_HIGHMEM64G */ #endif /* !CONFIG_HIGHMEM */ } ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] mm,x86: limit 32 bit kernel to 12GB memory 2013-05-02 2:46 ` [PATCH] mm,x86: limit 32 bit kernel to 12GB memory Rik van Riel @ 2013-05-02 7:37 ` Pierre-Loup A. Griffais 2013-05-02 20:03 ` Linus Torvalds 1 sibling, 0 replies; 16+ messages in thread From: Pierre-Loup A. Griffais @ 2013-05-02 7:37 UTC (permalink / raw) To: Rik van Riel Cc: Steven Rostedt, Linus Torvalds, Johannes Weiner, Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton Reviewed-by: Pierre-Loup A. Griffais <pgriffais@valvesoftware.com> On 05/01/2013 07:46 PM, Rik van Riel wrote: > On Wed, 1 May 2013 21:34:26 -0400 > Steven Rostedt <rostedt@goodmis.org> wrote: >> On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote: >>> >>> It could also print out a friendly message, to >>> inform the user they should upgrade to a 64 bit >>> kernel to enjoy the use of all of their memory. >> >> Oh, oh, oh!!! Can we use my message: >> >> http://lwn.net/Articles/501769/ >> >> OK, maybe it's not so friendly ;-) > > Here's a somewhat friendlier one. Printing out the total amount of > memory in the system may give them some extra motivation to upgrade > to a 64 bit kernel :) > > ---8<---- > Subject: mm,x86: limit 32 bit kernel to 12GB memory > > Running 32 bit kernels on very large memory systems is a recipe > for disaster, due to fundamental architectural limits in both > Linux and the hardware. Moreover, all modern hardware with large > memory supports 64 bits. > > However, many users continue using 32 bit kernels, and end up > encountering stability and performance problems as a result. > > It may be better to save those people the frustration of stability > issues by limiting memory on a 32 bit kernel to 12GB (about the upper > limit that still works right), and printing a friendly reminder that > they really should be using a 64 bit kernel. > > Signed-off-by: Rik van Riel <riel@redhat.com> > --- > arch/x86/include/asm/setup.h | 1 + > arch/x86/mm/init_32.c | 11 +++++++++++ > 2 files changed, 12 insertions(+) > > diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h > index b7bf350..79de6bf 100644 > --- a/arch/x86/include/asm/setup.h > +++ b/arch/x86/include/asm/setup.h > @@ -14,6 +14,7 @@ > */ > #define MAXMEM_PFN PFN_DOWN(MAXMEM) > #define MAX_NONPAE_PFN (1 << 20) > +#define MAX_PAE_PFN (3 << 20) > > #endif /* __i386__ */ > > diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c > index 3ac7e31..e35b3f5 100644 > --- a/arch/x86/mm/init_32.c > +++ b/arch/x86/mm/init_32.c > @@ -600,6 +600,12 @@ static void __init lowmem_pfn_init(void) > > #define MSG_HIGHMEM_TRIMMED \ > "Warning: only 4GB will be used. Use a HIGHMEM64G enabled kernel!\n" > + > +#define MSG_HIGHMEM_INSANE \ > + "Warning: 32 bit kernels on large memory systems have problems.\n" \ > + "Limiting memory to 12GB for system stability.\n" \ > + "Use a 64 bit kernel to access all %lu MB of memory.\n" > + > /* > * We have more RAM than fits into lowmem - we try to put it into > * highmem, also taking the highmem=x boot parameter into account: > @@ -634,6 +640,11 @@ static void __init highmem_pfn_init(void) > max_pfn = MAX_NONPAE_PFN; > printk(KERN_WARNING MSG_HIGHMEM_TRIMMED); > } > +#else /* !CONFIG_HIGHMEM64G */ > + if (max_pfn > MAX_PAE_PFN) { > + printk(KERN_WARNING MSG_HIGHMEM_INSANE, max_pfn>>8); > + max_pfn = MAX_PFN; > + } > #endif /* !CONFIG_HIGHMEM64G */ > #endif /* !CONFIG_HIGHMEM */ > } > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm,x86: limit 32 bit kernel to 12GB memory 2013-05-02 2:46 ` [PATCH] mm,x86: limit 32 bit kernel to 12GB memory Rik van Riel 2013-05-02 7:37 ` Pierre-Loup A. Griffais @ 2013-05-02 20:03 ` Linus Torvalds 2013-05-11 9:16 ` Yuhong Bao 1 sibling, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2013-05-02 20:03 UTC (permalink / raw) To: Rik van Riel Cc: Steven Rostedt, Pierre-Loup A. Griffais, Johannes Weiner, Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton On Wed, May 1, 2013 at 7:46 PM, Rik van Riel <riel@redhat.com> wrote: > > Here's a somewhat friendlier one. Printing out the total amount of > memory in the system may give them some extra motivation to upgrade > to a 64 bit kernel :) This needs more work: - suggesting a 64-bit kernel on a truly 32-bit CPU is insane, so it had better actually check the CPUID for 64-bit support ("lm" for "long mode"). - we don't remove features, so there should be a kernel command line option to say "I'm insane, I know this is going to have problems, I want you to try to use more memory anyway" and disable the new 12GB limit - I don't think it's necessarily "system stability". The problem with large amounts of highmem ends up being that we end up using up almost all of the lowmem just to *track* the huge amount of highmem, and then we have so little lowmem that we suck at performance and have various random problems. So it's not just "system stability", it's more fluid than that. The "it's more fluid than that" is also why I'd want to have a way to override it. Using up all lowmem to track highmem is actually ok under some very specific loads. If you have a setup where you have tons of highmem, but all it is ever used for is anonymous user pages, you don't need a lot of lowmem. Some of the craziest PAE users were that class of use, and for all we know there are still crazy people with real 32-bit CPU's that want to do it. We don't really want to support it, we don't really care, but I don't think we want to then say "you cannot do that" either. We want to say "you're a f*cking crazy moron, and we don't think what you do is a good idea, but if if you absolutely want to shoot yourself in the foot, here's how to do it. Don't expect things to work well in general, but you might have a load where it's acceptable". Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH] mm,x86: limit 32 bit kernel to 12GB memory 2013-05-02 20:03 ` Linus Torvalds @ 2013-05-11 9:16 ` Yuhong Bao 0 siblings, 0 replies; 16+ messages in thread From: Yuhong Bao @ 2013-05-11 9:16 UTC (permalink / raw) To: Linus Torvalds, Rik van Riel Cc: Steven Rostedt, Pierre-Loup A. Griffais, Johannes Weiner, Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki > - I don't think it's necessarily "system stability". The problem with > large amounts of highmem ends up being that we end up using up almost > all of the lowmem just to *track* the huge amount of highmem, and then > we have so little lowmem that we suck at performance and have various > random problems. So it's not just "system stability", it's more fluid > than that. FYI 32-bit Windows already limits to 16GB when 3G/1G split is used for a similar reason. (They default to 2G/2G split.) Yuhong Bao ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: IO regression after ab8fabd46f on x86 kernels with high memory 2013-04-29 22:03 ` Linus Torvalds 2013-04-29 22:08 ` Pierre-Loup A. Griffais 2013-04-30 0:48 ` Rik van Riel @ 2013-05-08 19:10 ` H. Peter Anvin 2013-06-03 1:17 ` Yuhong Bao 2 siblings, 1 reply; 16+ messages in thread From: H. Peter Anvin @ 2013-05-08 19:10 UTC (permalink / raw) To: Linus Torvalds Cc: Pierre-Loup A. Griffais, Johannes Weiner, Rik van Riel, Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton On 04/29/2013 03:03 PM, Linus Torvalds wrote: > On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais > <pgriffais@valvesoftware.com> wrote: >> >> Other than this particular concern, what's the high-level take-away? Is PAE >> support in the Linux kernel a false promise than distros should not be >> shipping by default, if at all? Should it be removed from the kernel >> entirely if these configurations are knowingly broken by commits like this? > > PAE is "make it barely work". The whole concept is fundamentally > flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't > even understand *how* flawed and stupid that is. > Let's be straight... the problem isn't PAE per se, the problem is *HIGHMEM*. PAE just allows HIGHMEM to stretch further into problematic territory. Distros install PAE kernels by default because it is required to support NX. That is fine. The problem is that once your memory crosses the HIGHMEM threshold -- 896 MiB in the normal configuration -- then you are in "this is going to hurt" territory. I have seen HIGHMEM devastate performance without even crossing the 4 GiB threshold where PAE is required. We kernel guys have been asking the distros to ship 64-bit kernels even in their 32-bit distros for many years, but concerns of compat issues and the desire to deprecate 32-bit userspace seems to have kept that from happening. -hpa ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: IO regression after ab8fabd46f on x86 kernels with high memory 2013-05-08 19:10 ` IO regression after ab8fabd46f on x86 kernels with high memory H. Peter Anvin @ 2013-06-03 1:17 ` Yuhong Bao 0 siblings, 0 replies; 16+ messages in thread From: Yuhong Bao @ 2013-06-03 1:17 UTC (permalink / raw) To: H. Peter Anvin, Linus Torvalds Cc: Pierre-Loup A. Griffais, Johannes Weiner, Rik van Riel, Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki > We kernel guys have been asking the distros to ship 64-bit kernels even > in their 32-bit distros for many years, but concerns of compat issues > and the desire to deprecate 32-bit userspace seems to have kept that > from happening. And now there is another reason: to call 64-bit EFI runtime services. In retrospect, I would have stuck with 32-bit EFI with 64-bit kernels calling runtime services in compatibility mode, but of course it is too late for that now. Yuhong Bao ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2013-06-03 1:24 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-04-26 23:44 IO regression after ab8fabd46f on x86 kernels with high memory Pierre-Loup A. Griffais 2013-04-27 1:53 ` Rik van Riel 2013-04-27 2:42 ` Johannes Weiner 2013-04-29 21:53 ` Pierre-Loup A. Griffais 2013-04-29 22:03 ` Linus Torvalds 2013-04-29 22:08 ` Pierre-Loup A. Griffais 2013-05-02 4:37 ` Sonny Rao 2013-04-30 0:48 ` Rik van Riel 2013-04-30 1:06 ` Pierre-Loup A. Griffais 2013-05-02 1:34 ` Steven Rostedt 2013-05-02 2:46 ` [PATCH] mm,x86: limit 32 bit kernel to 12GB memory Rik van Riel 2013-05-02 7:37 ` Pierre-Loup A. Griffais 2013-05-02 20:03 ` Linus Torvalds 2013-05-11 9:16 ` Yuhong Bao 2013-05-08 19:10 ` IO regression after ab8fabd46f on x86 kernels with high memory H. Peter Anvin 2013-06-03 1:17 ` Yuhong Bao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).