linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* IO regression after ab8fabd46f on x86 kernels with high memory
@ 2013-04-26 23:44 Pierre-Loup A. Griffais
  2013-04-27  1:53 ` Rik van Riel
  0 siblings, 1 reply; 16+ messages in thread
From: Pierre-Loup A. Griffais @ 2013-04-26 23:44 UTC (permalink / raw)
  To: hannes; +Cc: linux-kernel, torvalds, riel, sonnyrao, kamezawa.hiroyu, akpm

I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a 
180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it 
takes between two and three minutes. It looks like a similar throughput 
regression happens on any machine running an i386 PAE kernel with high 
amounts of memory; the threshold seems to be 16G; passing mem=15G to the 
kernel commandline fixes it.

I bisected it to the following change:

commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
Author: Johannes Weiner <jweiner@redhat.com>
Date:   Tue Jan 10 15:07:42 2012 -0800

     mm: exclude reserved pages from dirtyable memory

I realize running x86 kernels against high amounts of memory is not 
advised for various reasons, but I would assume that such a big 
regression in basic functionality to not be part of them. Is that 
accurate, or are these configurations expected to become unusable from 
3.3 onwards?

Also CCing Sonny since it looks like he tried to fix an overflow issue 
related to the same change with commit c8b74c2f66049, but I'm still 
experiencing the problem with a kernel built from master.

Thanks,
  - Pierre-Loup

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IO regression after ab8fabd46f on x86 kernels with high memory
  2013-04-26 23:44 IO regression after ab8fabd46f on x86 kernels with high memory Pierre-Loup A. Griffais
@ 2013-04-27  1:53 ` Rik van Riel
  2013-04-27  2:42   ` Johannes Weiner
  0 siblings, 1 reply; 16+ messages in thread
From: Rik van Riel @ 2013-04-27  1:53 UTC (permalink / raw)
  To: Pierre-Loup A. Griffais
  Cc: hannes, linux-kernel, torvalds, sonnyrao, kamezawa.hiroyu, akpm

On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:
> I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
> 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
> takes between two and three minutes. It looks like a similar throughput
> regression happens on any machine running an i386 PAE kernel with high
> amounts of memory; the threshold seems to be 16G; passing mem=15G to the
> kernel commandline fixes it.

If you have that much memory in the system, you will
want to run a 64 bit kernel to avoid all kinds of
memory management corner cases.

> I bisected it to the following change:
>
> commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
> Author: Johannes Weiner <jweiner@redhat.com>
> Date:   Tue Jan 10 15:07:42 2012 -0800
>
>      mm: exclude reserved pages from dirtyable memory
>
> I realize running x86 kernels against high amounts of memory is not
> advised for various reasons, but I would assume that such a big
> regression in basic functionality to not be part of them. Is that
> accurate, or are these configurations expected to become unusable from
> 3.3 onwards?

Reverting that patch would probably break i686 PAE systems with
lots of memory at a different threshold.

With more than 8-12GB of memory, an i686 kernel is between a
rock and a hard place. Whether you move it closer to the rock,
or closer to the hard place, all you do is change the way in
which it breaks.

> Also CCing Sonny since it looks like he tried to fix an overflow issue
> related to the same change with commit c8b74c2f66049, but I'm still
> experiencing the problem with a kernel built from master.
>
> Thanks,
>   - Pierre-Loup


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IO regression after ab8fabd46f on x86 kernels with high memory
  2013-04-27  1:53 ` Rik van Riel
@ 2013-04-27  2:42   ` Johannes Weiner
  2013-04-29 21:53     ` Pierre-Loup A. Griffais
  0 siblings, 1 reply; 16+ messages in thread
From: Johannes Weiner @ 2013-04-27  2:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pierre-Loup A. Griffais, linux-kernel, torvalds, sonnyrao,
	kamezawa.hiroyu, akpm

On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote:
> On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:
> >I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
> >180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
> >takes between two and three minutes. It looks like a similar throughput
> >regression happens on any machine running an i386 PAE kernel with high
> >amounts of memory; the threshold seems to be 16G; passing mem=15G to the
> >kernel commandline fixes it.
> 
> If you have that much memory in the system, you will
> want to run a 64 bit kernel to avoid all kinds of
> memory management corner cases.

Agreed.  You can even keep your 32 bit userland, just swap the
kernel...

> >I bisected it to the following change:
> >
> >commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
> >Author: Johannes Weiner <jweiner@redhat.com>
> >Date:   Tue Jan 10 15:07:42 2012 -0800
> >
> >     mm: exclude reserved pages from dirtyable memory
> >
> >I realize running x86 kernels against high amounts of memory is not
> >advised for various reasons, but I would assume that such a big
> >regression in basic functionality to not be part of them. Is that
> >accurate, or are these configurations expected to become unusable from
> >3.3 onwards?
> 
> Reverting that patch would probably break i686 PAE systems with
> lots of memory at a different threshold.

It would also re-introduce the reclaim stalls when zones with very
little page cache due to lowmem reserves end up with a large
percentage of their LRU dirty.  And that affects modern machines too,
because of the lowmem reserves in DMA32 due to relatively bigger
Normal zones.

On such large highmem machines, however, the imbalance between highmem
and lowmem is so enormous that the lowmem reserves basically exclude
all of lowmem from page cache usage.

But because dirty highmem creates lowmem pressure, and the amount of
sanely allowable dirty memory is actually a function of lowmem, not
highmem, highmem is not included in the amount of dirtyable memory.

So because your lowmem is not available for page cache and highmem is
not considered dirtyable out of the box, the amount of dirtyable
memory on your machine is 0.  You can workaround this by setting
vm.highmem_is_dirtyable=1.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IO regression after ab8fabd46f on x86 kernels with high memory
  2013-04-27  2:42   ` Johannes Weiner
@ 2013-04-29 21:53     ` Pierre-Loup A. Griffais
  2013-04-29 22:03       ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Pierre-Loup A. Griffais @ 2013-04-29 21:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, linux-kernel, torvalds, sonnyrao, kamezawa.hiroyu, akpm

On 04/26/2013 07:42 PM, Johannes Weiner wrote:
> On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote:
>> On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:
>>> I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
>>> 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
>>> takes between two and three minutes. It looks like a similar throughput
>>> regression happens on any machine running an i386 PAE kernel with high
>>> amounts of memory; the threshold seems to be 16G; passing mem=15G to the
>>> kernel commandline fixes it.
>>
>> If you have that much memory in the system, you will
>> want to run a 64 bit kernel to avoid all kinds of
>> memory management corner cases.
>
> Agreed.  You can even keep your 32 bit userland, just swap the
> kernel...
>
>>> I bisected it to the following change:
>>>
>>> commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
>>> Author: Johannes Weiner <jweiner@redhat.com>
>>> Date:   Tue Jan 10 15:07:42 2012 -0800
>>>
>>>      mm: exclude reserved pages from dirtyable memory
>>>
>>> I realize running x86 kernels against high amounts of memory is not
>>> advised for various reasons, but I would assume that such a big
>>> regression in basic functionality to not be part of them. Is that
>>> accurate, or are these configurations expected to become unusable from
>>> 3.3 onwards?
>>
>> Reverting that patch would probably break i686 PAE systems with
>> lots of memory at a different threshold.
>
> It would also re-introduce the reclaim stalls when zones with very
> little page cache due to lowmem reserves end up with a large
> percentage of their LRU dirty.  And that affects modern machines too,
> because of the lowmem reserves in DMA32 due to relatively bigger
> Normal zones.
>
> On such large highmem machines, however, the imbalance between highmem
> and lowmem is so enormous that the lowmem reserves basically exclude
> all of lowmem from page cache usage.
>
> But because dirty highmem creates lowmem pressure, and the amount of
> sanely allowable dirty memory is actually a function of lowmem, not
> highmem, highmem is not included in the amount of dirtyable memory.
>
> So because your lowmem is not available for page cache and highmem is
> not considered dirtyable out of the box, the amount of dirtyable
> memory on your machine is 0.  You can workaround this by setting
> vm.highmem_is_dirtyable=1.

I understand the technical concerns; we had some existing issues on 3.2 
with 24/32GB machines where the kernel would start erroneously 
OOM-killing new processes after a while; booting with mem=16G solved 
that. But now this goes a level further, since the machine is unusable 
upfront, right at boot, even with mem=16G. As such this is clearly seems 
like a regression more than a tradeoff.

We're in a situation where popular distros ship 32-bit as the default 
"use this if you're not sure what to get" option, with PAE also enabled 
by default. most modern computers shipping with more than 16G of RAM, 
especially for gaming. Looking at the Steam HW survey data we have 
hundreds of users using this combination; this commit means that 
installing package updates that pull in a new kernel will immediately 
cause their system to become unusable.

Other than this particular concern, what's the high-level take-away? Is 
PAE support in the Linux kernel a false promise than distros should not 
be shipping by default, if at all? Should it be removed from the kernel 
entirely if these configurations are knowingly broken by commits like this?

Thanks,
  - Pierre-Loup



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IO regression after ab8fabd46f on x86 kernels with high memory
  2013-04-29 21:53     ` Pierre-Loup A. Griffais
@ 2013-04-29 22:03       ` Linus Torvalds
  2013-04-29 22:08         ` Pierre-Loup A. Griffais
                           ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Linus Torvalds @ 2013-04-29 22:03 UTC (permalink / raw)
  To: Pierre-Loup A. Griffais
  Cc: Johannes Weiner, Rik van Riel, Linux Kernel Mailing List,
	sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
<pgriffais@valvesoftware.com> wrote:
>
> Other than this particular concern, what's the high-level take-away? Is PAE
> support in the Linux kernel a false promise than distros should not be
> shipping by default, if at all? Should it be removed from the kernel
> entirely if these configurations are knowingly broken by commits like this?

PAE is "make it barely work". The whole concept is fundamentally
flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
even understand *how* flawed and stupid that is.

Don't do it. Upgrade to 64-bit, or live with the fact that IO
performance will suck. The fact that it happened to work better under
your particular load with one particular IO size is entirely just
"random noise".

Yeah, the difference between "we can cache it" and "we have to do IO"
is huge. With a 32-bit kernel, we do IO much earlier now, just to
avoid some really nasty situations. That makes you go from the "can
sit in the cache" to the "do lots of IO" situation. Tough.

Seriously, you can compile yourself a 64-bit kernel and continue to
use your 32-bit user-land. And you can complain to whatever distro you
used that it didn't do that in the first place. But we're not going to
bother with trying to tune PAE for some particular load. It's just not
worth it to anybody.

                Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IO regression after ab8fabd46f on x86 kernels with high memory
  2013-04-29 22:03       ` Linus Torvalds
@ 2013-04-29 22:08         ` Pierre-Loup A. Griffais
  2013-05-02  4:37           ` Sonny Rao
  2013-04-30  0:48         ` Rik van Riel
  2013-05-08 19:10         ` IO regression after ab8fabd46f on x86 kernels with high memory H. Peter Anvin
  2 siblings, 1 reply; 16+ messages in thread
From: Pierre-Loup A. Griffais @ 2013-04-29 22:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Rik van Riel, Linux Kernel Mailing List,
	sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton

On 04/29/2013 03:03 PM, Linus Torvalds wrote:
> On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
> <pgriffais@valvesoftware.com> wrote:
>>
>> Other than this particular concern, what's the high-level take-away? Is PAE
>> support in the Linux kernel a false promise than distros should not be
>> shipping by default, if at all? Should it be removed from the kernel
>> entirely if these configurations are knowingly broken by commits like this?
>
> PAE is "make it barely work". The whole concept is fundamentally
> flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
> even understand *how* flawed and stupid that is.
>
> Don't do it. Upgrade to 64-bit, or live with the fact that IO
> performance will suck. The fact that it happened to work better under
> your particular load with one particular IO size is entirely just
> "random noise".
>
> Yeah, the difference between "we can cache it" and "we have to do IO"
> is huge. With a 32-bit kernel, we do IO much earlier now, just to
> avoid some really nasty situations. That makes you go from the "can
> sit in the cache" to the "do lots of IO" situation. Tough.
>
> Seriously, you can compile yourself a 64-bit kernel and continue to
> use your 32-bit user-land. And you can complain to whatever distro you
> used that it didn't do that in the first place. But we're not going to
> bother with trying to tune PAE for some particular load. It's just not
> worth it to anybody.

All of this came from me trying to reproduce slowdowns reported by other 
people; I personally run a 64-bit kernel and understand how bad of an 
idea it is to attempt to run 32-bit kernels with PAE enabled on modern 
machines. However, my goal is to avoid ending up with a variety of 
end-users that don't necessarily understand this getting bitten by it 
and breaking their systems by upgrading their kernels. I will indeed 
bring this up with distributors and point out than shipping PAE kernels 
by default is not a good idea given these problems and your stance on 
the matter.

Thanks,
  - Pierre-Loup

>
>                  Linus
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IO regression after ab8fabd46f on x86 kernels with high memory
  2013-04-29 22:03       ` Linus Torvalds
  2013-04-29 22:08         ` Pierre-Loup A. Griffais
@ 2013-04-30  0:48         ` Rik van Riel
  2013-04-30  1:06           ` Pierre-Loup A. Griffais
  2013-05-02  1:34           ` Steven Rostedt
  2013-05-08 19:10         ` IO regression after ab8fabd46f on x86 kernels with high memory H. Peter Anvin
  2 siblings, 2 replies; 16+ messages in thread
From: Rik van Riel @ 2013-04-30  0:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pierre-Loup A. Griffais, Johannes Weiner,
	Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki,
	Andrew Morton

On 04/29/2013 06:03 PM, Linus Torvalds wrote:

> Seriously, you can compile yourself a 64-bit kernel and continue to
> use your 32-bit user-land. And you can complain to whatever distro you
> used that it didn't do that in the first place. But we're not going to
> bother with trying to tune PAE for some particular load. It's just not
> worth it to anybody.

I can think of one way to "tune PAE" that will help
avoid the breakage, and at the same time draw the
attention of users.

Limit the memory that a 32 bit PAE kernel uses, to
something small enough where the user will not
encounter random breakage.  Maybe 8 or 12GB?

It could also print out a friendly message, to
inform the user they should upgrade to a 64 bit
kernel to enjoy the use of all of their memory.

It is a bit of a heavy stick, but I suspect that
it would clue in all of the affected users.

If you have no objection to this, I'll whip up a
patch.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IO regression after ab8fabd46f on x86 kernels with high memory
  2013-04-30  0:48         ` Rik van Riel
@ 2013-04-30  1:06           ` Pierre-Loup A. Griffais
  2013-05-02  1:34           ` Steven Rostedt
  1 sibling, 0 replies; 16+ messages in thread
From: Pierre-Loup A. Griffais @ 2013-04-30  1:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Johannes Weiner, Linux Kernel Mailing List,
	sonnyrao, KAMEZAWA Hiroyuki, Andrew Morton

On 04/29/2013 05:48 PM, Rik van Riel wrote:
> On 04/29/2013 06:03 PM, Linus Torvalds wrote:
>
>> Seriously, you can compile yourself a 64-bit kernel and continue to
>> use your 32-bit user-land. And you can complain to whatever distro you
>> used that it didn't do that in the first place. But we're not going to
>> bother with trying to tune PAE for some particular load. It's just not
>> worth it to anybody.
>
> I can think of one way to "tune PAE" that will help
> avoid the breakage, and at the same time draw the
> attention of users.
>
> Limit the memory that a 32 bit PAE kernel uses, to
> something small enough where the user will not
> encounter random breakage.  Maybe 8 or 12GB?
>
> It could also print out a friendly message, to
> inform the user they should upgrade to a 64 bit
> kernel to enjoy the use of all of their memory.
>
> It is a bit of a heavy stick, but I suspect that
> it would clue in all of the affected users.
>
> If you have no objection to this, I'll whip up a
> patch.
>

That would be pretty useful, especially if I can then convince 
distributors to apply it and roll it out ASAP. I haven't personally 
observed any problems with mem=15G whereas mem=16G exhibits the IO issue 
upfront and more than that exhibits the OOM-killer / low memory 
starvation issue that existed before Johannes change.

Thanks,
  - Pierre-Loup

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IO regression after ab8fabd46f on x86 kernels with high memory
  2013-04-30  0:48         ` Rik van Riel
  2013-04-30  1:06           ` Pierre-Loup A. Griffais
@ 2013-05-02  1:34           ` Steven Rostedt
  2013-05-02  2:46             ` [PATCH] mm,x86: limit 32 bit kernel to 12GB memory Rik van Riel
  1 sibling, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2013-05-02  1:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Pierre-Loup A. Griffais, Johannes Weiner,
	Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki,
	Andrew Morton

On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote:
> 
> It could also print out a friendly message, to
> inform the user they should upgrade to a 64 bit
> kernel to enjoy the use of all of their memory.

Oh, oh, oh!!! Can we use my message:

  http://lwn.net/Articles/501769/

OK, maybe it's not so friendly ;-)

-- Steve


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mm,x86: limit 32 bit kernel to 12GB memory
  2013-05-02  1:34           ` Steven Rostedt
@ 2013-05-02  2:46             ` Rik van Riel
  2013-05-02  7:37               ` Pierre-Loup A. Griffais
  2013-05-02 20:03               ` Linus Torvalds
  0 siblings, 2 replies; 16+ messages in thread
From: Rik van Riel @ 2013-05-02  2:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Pierre-Loup A. Griffais, Johannes Weiner,
	Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki,
	Andrew Morton

On Wed, 1 May 2013 21:34:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote:
> > 
> > It could also print out a friendly message, to
> > inform the user they should upgrade to a 64 bit
> > kernel to enjoy the use of all of their memory.
> 
> Oh, oh, oh!!! Can we use my message:
> 
>   http://lwn.net/Articles/501769/
> 
> OK, maybe it's not so friendly ;-)

Here's a somewhat friendlier one. Printing out the total amount of
memory in the system may give them some extra motivation to upgrade
to a 64 bit kernel :)

---8<----
Subject: mm,x86: limit 32 bit kernel to 12GB memory
 
Running 32 bit kernels on very large memory systems is a recipe
for disaster, due to fundamental architectural limits in both
Linux and the hardware. Moreover, all modern hardware with large
memory supports 64 bits.

However, many users continue using 32 bit kernels, and end up
encountering stability and performance problems as a result.

It may be better to save those people the frustration of stability
issues by limiting memory on a 32 bit kernel to 12GB (about the upper
limit that still works right), and printing a friendly reminder that
they really should be using a 64 bit kernel.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/include/asm/setup.h |  1 +
 arch/x86/mm/init_32.c        | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..79de6bf 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -14,6 +14,7 @@
  */
 #define MAXMEM_PFN	PFN_DOWN(MAXMEM)
 #define MAX_NONPAE_PFN	(1 << 20)
+#define MAX_PAE_PFN	(3 << 20)
 
 #endif /* __i386__ */
 
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 3ac7e31..e35b3f5 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -600,6 +600,12 @@ static void __init lowmem_pfn_init(void)
 
 #define MSG_HIGHMEM_TRIMMED \
 	"Warning: only 4GB will be used. Use a HIGHMEM64G enabled kernel!\n"
+
+#define MSG_HIGHMEM_INSANE \
+	"Warning: 32 bit kernels on large memory systems have problems.\n" \
+	"Limiting memory to 12GB for system stability.\n" \
+	"Use a 64 bit kernel to access all %lu MB of memory.\n"
+
 /*
  * We have more RAM than fits into lowmem - we try to put it into
  * highmem, also taking the highmem=x boot parameter into account:
@@ -634,6 +640,11 @@ static void __init highmem_pfn_init(void)
 		max_pfn = MAX_NONPAE_PFN;
 		printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
 	}
+#else /* !CONFIG_HIGHMEM64G */
+	if (max_pfn > MAX_PAE_PFN) {
+		printk(KERN_WARNING MSG_HIGHMEM_INSANE, max_pfn>>8);
+		max_pfn = MAX_PFN;
+	}
 #endif /* !CONFIG_HIGHMEM64G */
 #endif /* !CONFIG_HIGHMEM */
 }

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: IO regression after ab8fabd46f on x86 kernels with high memory
  2013-04-29 22:08         ` Pierre-Loup A. Griffais
@ 2013-05-02  4:37           ` Sonny Rao
  0 siblings, 0 replies; 16+ messages in thread
From: Sonny Rao @ 2013-05-02  4:37 UTC (permalink / raw)
  To: Pierre-Loup A. Griffais
  Cc: Linus Torvalds, Johannes Weiner, Rik van Riel,
	Linux Kernel Mailing List, KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Apr 29, 2013 at 3:08 PM, Pierre-Loup A. Griffais
<pgriffais@valvesoftware.com> wrote:
> On 04/29/2013 03:03 PM, Linus Torvalds wrote:
>>
>> On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
>> <pgriffais@valvesoftware.com> wrote:
>>>
>>>
>>> Other than this particular concern, what's the high-level take-away? Is
>>> PAE
>>> support in the Linux kernel a false promise than distros should not be
>>> shipping by default, if at all? Should it be removed from the kernel
>>> entirely if these configurations are knowingly broken by commits like
>>> this?
>>
>>
>> PAE is "make it barely work". The whole concept is fundamentally
>> flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
>> even understand *how* flawed and stupid that is.
>>
>> Don't do it. Upgrade to 64-bit, or live with the fact that IO
>> performance will suck. The fact that it happened to work better under
>> your particular load with one particular IO size is entirely just
>> "random noise".
>>
>> Yeah, the difference between "we can cache it" and "we have to do IO"
>> is huge. With a 32-bit kernel, we do IO much earlier now, just to
>> avoid some really nasty situations. That makes you go from the "can
>> sit in the cache" to the "do lots of IO" situation. Tough.
>>
>> Seriously, you can compile yourself a 64-bit kernel and continue to
>> use your 32-bit user-land. And you can complain to whatever distro you
>> used that it didn't do that in the first place. But we're not going to
>> bother with trying to tune PAE for some particular load. It's just not
>> worth it to anybody.
>
>
> All of this came from me trying to reproduce slowdowns reported by other
> people; I personally run a 64-bit kernel and understand how bad of an idea
> it is to attempt to run 32-bit kernels with PAE enabled on modern machines.
> However, my goal is to avoid ending up with a variety of end-users that
> don't necessarily understand this getting bitten by it and breaking their
> systems by upgrading their kernels. I will indeed bring this up with
> distributors and point out than shipping PAE kernels by default is not a
> good idea given these problems and your stance on the matter.
>

Sorry just saw this (my stupid gmail filters for lkml) The slow-down
we ran into wasn't even on PAE -- it was *just* with highmem on a 2GB
system.  The non-zero amount (90MB? or so) of highmem was enough to
cause major problems due to that particular underflow.

I would say regardless of how much memory you have, if the system can
use a 64-bit kernel, then it almost certainly should.  I've seen some
very minor performance impacts on 64-bit capable Atom systems with
tiny L2 caches, but it's almost in the noise and not worth the pain.

> Thanks,
>  - Pierre-Loup
>
>>
>>                  Linus
>>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm,x86: limit 32 bit kernel to 12GB memory
  2013-05-02  2:46             ` [PATCH] mm,x86: limit 32 bit kernel to 12GB memory Rik van Riel
@ 2013-05-02  7:37               ` Pierre-Loup A. Griffais
  2013-05-02 20:03               ` Linus Torvalds
  1 sibling, 0 replies; 16+ messages in thread
From: Pierre-Loup A. Griffais @ 2013-05-02  7:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Steven Rostedt, Linus Torvalds, Johannes Weiner,
	Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki,
	Andrew Morton

Reviewed-by: Pierre-Loup A. Griffais <pgriffais@valvesoftware.com>

On 05/01/2013 07:46 PM, Rik van Riel wrote:
> On Wed, 1 May 2013 21:34:26 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>> On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote:
>>>
>>> It could also print out a friendly message, to
>>> inform the user they should upgrade to a 64 bit
>>> kernel to enjoy the use of all of their memory.
>>
>> Oh, oh, oh!!! Can we use my message:
>>
>>    http://lwn.net/Articles/501769/
>>
>> OK, maybe it's not so friendly ;-)
>
> Here's a somewhat friendlier one. Printing out the total amount of
> memory in the system may give them some extra motivation to upgrade
> to a 64 bit kernel :)
>
> ---8<----
> Subject: mm,x86: limit 32 bit kernel to 12GB memory
>
> Running 32 bit kernels on very large memory systems is a recipe
> for disaster, due to fundamental architectural limits in both
> Linux and the hardware. Moreover, all modern hardware with large
> memory supports 64 bits.
>
> However, many users continue using 32 bit kernels, and end up
> encountering stability and performance problems as a result.
>
> It may be better to save those people the frustration of stability
> issues by limiting memory on a 32 bit kernel to 12GB (about the upper
> limit that still works right), and printing a friendly reminder that
> they really should be using a 64 bit kernel.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>   arch/x86/include/asm/setup.h |  1 +
>   arch/x86/mm/init_32.c        | 11 +++++++++++
>   2 files changed, 12 insertions(+)
>
> diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
> index b7bf350..79de6bf 100644
> --- a/arch/x86/include/asm/setup.h
> +++ b/arch/x86/include/asm/setup.h
> @@ -14,6 +14,7 @@
>    */
>   #define MAXMEM_PFN	PFN_DOWN(MAXMEM)
>   #define MAX_NONPAE_PFN	(1 << 20)
> +#define MAX_PAE_PFN	(3 << 20)
>
>   #endif /* __i386__ */
>
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 3ac7e31..e35b3f5 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -600,6 +600,12 @@ static void __init lowmem_pfn_init(void)
>
>   #define MSG_HIGHMEM_TRIMMED \
>   	"Warning: only 4GB will be used. Use a HIGHMEM64G enabled kernel!\n"
> +
> +#define MSG_HIGHMEM_INSANE \
> +	"Warning: 32 bit kernels on large memory systems have problems.\n" \
> +	"Limiting memory to 12GB for system stability.\n" \
> +	"Use a 64 bit kernel to access all %lu MB of memory.\n"
> +
>   /*
>    * We have more RAM than fits into lowmem - we try to put it into
>    * highmem, also taking the highmem=x boot parameter into account:
> @@ -634,6 +640,11 @@ static void __init highmem_pfn_init(void)
>   		max_pfn = MAX_NONPAE_PFN;
>   		printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
>   	}
> +#else /* !CONFIG_HIGHMEM64G */
> +	if (max_pfn > MAX_PAE_PFN) {
> +		printk(KERN_WARNING MSG_HIGHMEM_INSANE, max_pfn>>8);
> +		max_pfn = MAX_PFN;
> +	}
>   #endif /* !CONFIG_HIGHMEM64G */
>   #endif /* !CONFIG_HIGHMEM */
>   }
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm,x86: limit 32 bit kernel to 12GB memory
  2013-05-02  2:46             ` [PATCH] mm,x86: limit 32 bit kernel to 12GB memory Rik van Riel
  2013-05-02  7:37               ` Pierre-Loup A. Griffais
@ 2013-05-02 20:03               ` Linus Torvalds
  2013-05-11  9:16                 ` Yuhong Bao
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2013-05-02 20:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Steven Rostedt, Pierre-Loup A. Griffais, Johannes Weiner,
	Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki,
	Andrew Morton

On Wed, May 1, 2013 at 7:46 PM, Rik van Riel <riel@redhat.com> wrote:
>
> Here's a somewhat friendlier one. Printing out the total amount of
> memory in the system may give them some extra motivation to upgrade
> to a 64 bit kernel :)

This needs more work:

 - suggesting a 64-bit kernel on a truly 32-bit CPU is insane, so it
had better actually check the CPUID for 64-bit support ("lm" for "long
mode").

 - we don't remove features, so there should be a kernel command line
option to say "I'm insane, I know this is going to have problems, I
want you to try to use more memory anyway" and disable the new 12GB
limit

 - I don't think it's necessarily "system stability". The problem with
large amounts of highmem ends up being that we end up using up almost
all of the lowmem just to *track* the huge amount of highmem, and then
we have so little lowmem that we suck at performance and have various
random problems. So it's not just "system stability", it's more fluid
than that.

The "it's more fluid than that" is also why I'd want to have a way to
override it. Using up all lowmem to track highmem is actually ok under
some very specific loads. If you have a setup where you have tons of
highmem, but all it is ever used for is anonymous user pages, you
don't need a lot of lowmem. Some of the craziest PAE users were that
class of use, and for all we know there are still crazy people with
real 32-bit CPU's that want to do it.

We don't really want to support it, we don't really care, but I don't
think we want to then say "you cannot do that" either. We want to say
"you're a f*cking crazy moron, and we don't think what you do is a
good idea, but if if you absolutely want to shoot yourself in the
foot, here's how to do it. Don't expect things to work well in
general, but you might have a load where it's acceptable".

                  Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IO regression after ab8fabd46f on x86 kernels with high memory
  2013-04-29 22:03       ` Linus Torvalds
  2013-04-29 22:08         ` Pierre-Loup A. Griffais
  2013-04-30  0:48         ` Rik van Riel
@ 2013-05-08 19:10         ` H. Peter Anvin
  2013-06-03  1:17           ` Yuhong Bao
  2 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2013-05-08 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pierre-Loup A. Griffais, Johannes Weiner, Rik van Riel,
	Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki,
	Andrew Morton

On 04/29/2013 03:03 PM, Linus Torvalds wrote:
> On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
> <pgriffais@valvesoftware.com> wrote:
>>
>> Other than this particular concern, what's the high-level take-away? Is PAE
>> support in the Linux kernel a false promise than distros should not be
>> shipping by default, if at all? Should it be removed from the kernel
>> entirely if these configurations are knowingly broken by commits like this?
> 
> PAE is "make it barely work". The whole concept is fundamentally
> flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
> even understand *how* flawed and stupid that is.
> 

Let's be straight... the problem isn't PAE per se, the problem is
*HIGHMEM*.  PAE just allows HIGHMEM to stretch further into problematic
territory.

Distros install PAE kernels by default because it is required to support
NX.  That is fine.

The problem is that once your memory crosses the HIGHMEM threshold
-- 896 MiB in the normal configuration -- then you are in "this is going
to hurt" territory.  I have seen HIGHMEM devastate performance without
even crossing the 4 GiB threshold where PAE is required.

We kernel guys have been asking the distros to ship 64-bit kernels even
in their 32-bit distros for many years, but concerns of compat issues
and the desire to deprecate 32-bit userspace seems to have kept that
from happening.

	-hpa



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH] mm,x86: limit 32 bit kernel to 12GB memory
  2013-05-02 20:03               ` Linus Torvalds
@ 2013-05-11  9:16                 ` Yuhong Bao
  0 siblings, 0 replies; 16+ messages in thread
From: Yuhong Bao @ 2013-05-11  9:16 UTC (permalink / raw)
  To: Linus Torvalds, Rik van Riel
  Cc: Steven Rostedt, Pierre-Loup A. Griffais, Johannes Weiner,
	Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki

>  - I don't think it's necessarily "system stability". The problem with
> large amounts of highmem ends up being that we end up using up almost
> all of the lowmem just to *track* the huge amount of highmem, and then
> we have so little lowmem that we suck at performance and have various
> random problems. So it's not just "system stability", it's more fluid
> than that.

FYI 32-bit Windows already limits to 16GB when 3G/1G split is used for a similar reason. (They default to 2G/2G split.)

Yuhong Bao 		 	   		  

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: IO regression after ab8fabd46f on x86 kernels with high memory
  2013-05-08 19:10         ` IO regression after ab8fabd46f on x86 kernels with high memory H. Peter Anvin
@ 2013-06-03  1:17           ` Yuhong Bao
  0 siblings, 0 replies; 16+ messages in thread
From: Yuhong Bao @ 2013-06-03  1:17 UTC (permalink / raw)
  To: H. Peter Anvin, Linus Torvalds
  Cc: Pierre-Loup A. Griffais, Johannes Weiner, Rik van Riel,
	Linux Kernel Mailing List, sonnyrao, KAMEZAWA Hiroyuki

> We kernel guys have been asking the distros to ship 64-bit kernels even
> in their 32-bit distros for many years, but concerns of compat issues
> and the desire to deprecate 32-bit userspace seems to have kept that
> from happening.

And now there is another reason: to call 64-bit EFI runtime services.
In retrospect, I would have stuck with 32-bit EFI with 64-bit kernels calling runtime services in compatibility mode, but of course it is too late for that now.

Yuhong Bao 		 	   		  

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-06-03  1:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-26 23:44 IO regression after ab8fabd46f on x86 kernels with high memory Pierre-Loup A. Griffais
2013-04-27  1:53 ` Rik van Riel
2013-04-27  2:42   ` Johannes Weiner
2013-04-29 21:53     ` Pierre-Loup A. Griffais
2013-04-29 22:03       ` Linus Torvalds
2013-04-29 22:08         ` Pierre-Loup A. Griffais
2013-05-02  4:37           ` Sonny Rao
2013-04-30  0:48         ` Rik van Riel
2013-04-30  1:06           ` Pierre-Loup A. Griffais
2013-05-02  1:34           ` Steven Rostedt
2013-05-02  2:46             ` [PATCH] mm,x86: limit 32 bit kernel to 12GB memory Rik van Riel
2013-05-02  7:37               ` Pierre-Loup A. Griffais
2013-05-02 20:03               ` Linus Torvalds
2013-05-11  9:16                 ` Yuhong Bao
2013-05-08 19:10         ` IO regression after ab8fabd46f on x86 kernels with high memory H. Peter Anvin
2013-06-03  1:17           ` Yuhong Bao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).