linux-parisc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs"
@ 2019-04-06 15:20 Mikulas Patocka
  2019-04-06 17:26 ` Mikulas Patocka
  2019-04-08  9:52 ` Mel Gorman
  0 siblings, 2 replies; 9+ messages in thread
From: Mikulas Patocka @ 2019-04-06 15:20 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Helge Deller, James E.J. Bottomley,
	John David Anglin, linux-parisc, linux-mm
  Cc: Vlastimil Babka, Andrea Arcangeli, Zi Yan

Hi

The patch 1c30844d2dfe272d58c8fc000960b835d13aa2ac ("mm: reclaim small 
amounts of memory when an external fragmentation event occurs") breaks 
memory management on parisc.

I have a parisc machine with 7GiB RAM, the chipset maps the physical 
memory to three zones:
	0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
	1) Start 0x0000000100000000 End 0x00000001bfdfffff Size   3070 MB
	2) Start 0x0000004040000000 End 0x00000040ffffffff Size   3072 MB
(but it is not NUMA)

With the patch 1c30844d2, the kernel will incorrectly reclaim the first 
zone when it fills up, ignoring the fact that there are two completely 
free zones. Basiscally, it limits cache size to 1GiB.

For example, if I run:
# dd if=/dev/sda of=/dev/null bs=1M count=2048

- with the proper kernel, there should be "Buffers - 2GiB" when this 
command finishes. With the patch 1c30844d2, buffers will consume just 1GiB 
or slightly more, because the kernel was incorrectly reclaiming them.

Mikulas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs"
  2019-04-06 15:20 Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" Mikulas Patocka
@ 2019-04-06 17:26 ` Mikulas Patocka
  2019-04-08  9:52 ` Mel Gorman
  1 sibling, 0 replies; 9+ messages in thread
From: Mikulas Patocka @ 2019-04-06 17:26 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Helge Deller, James E.J. Bottomley,
	John David Anglin, linux-parisc, linux-mm
  Cc: Vlastimil Babka, Andrea Arcangeli, Zi Yan



On Sat, 6 Apr 2019, Mikulas Patocka wrote:

> Hi
> 
> The patch 1c30844d2dfe272d58c8fc000960b835d13aa2ac ("mm: reclaim small 
> amounts of memory when an external fragmentation event occurs") breaks 
> memory management on parisc.
> 
> I have a parisc machine with 7GiB RAM, the chipset maps the physical 
> memory to three zones:
> 	0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
> 	1) Start 0x0000000100000000 End 0x00000001bfdfffff Size   3070 MB
> 	2) Start 0x0000004040000000 End 0x00000040ffffffff Size   3072 MB
> (but it is not NUMA)
> 
> With the patch 1c30844d2, the kernel will incorrectly reclaim the first 
> zone when it fills up, ignoring the fact that there are two completely 
> free zones. Basiscally, it limits cache size to 1GiB.
> 
> For example, if I run:
> # dd if=/dev/sda of=/dev/null bs=1M count=2048
> 
> - with the proper kernel, there should be "Buffers - 2GiB" when this 
> command finishes. With the patch 1c30844d2, buffers will consume just 1GiB 
> or slightly more, because the kernel was incorrectly reclaiming them.
> 
> Mikulas

BTW, 3 years ago, there was exactly the same bug: 
https://marc.info/?l=linux-kernel&m=146472966215941&w=2

Mikulas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs"
  2019-04-06 15:20 Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" Mikulas Patocka
  2019-04-06 17:26 ` Mikulas Patocka
@ 2019-04-08  9:52 ` Mel Gorman
  2019-04-08 11:10   ` Mikulas Patocka
  2019-04-08 14:29   ` James Bottomley
  1 sibling, 2 replies; 9+ messages in thread
From: Mel Gorman @ 2019-04-08  9:52 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Helge Deller, James E.J. Bottomley,
	John David Anglin, linux-parisc, linux-mm, Vlastimil Babka,
	Andrea Arcangeli, Zi Yan

On Sat, Apr 06, 2019 at 11:20:35AM -0400, Mikulas Patocka wrote:
> Hi
> 
> The patch 1c30844d2dfe272d58c8fc000960b835d13aa2ac ("mm: reclaim small 
> amounts of memory when an external fragmentation event occurs") breaks 
> memory management on parisc.
> 
> I have a parisc machine with 7GiB RAM, the chipset maps the physical 
> memory to three zones:
> 	0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
> 	1) Start 0x0000000100000000 End 0x00000001bfdfffff Size   3070 MB
> 	2) Start 0x0000004040000000 End 0x00000040ffffffff Size   3072 MB
> (but it is not NUMA)
> 
> With the patch 1c30844d2, the kernel will incorrectly reclaim the first 
> zone when it fills up, ignoring the fact that there are two completely 
> free zones. Basiscally, it limits cache size to 1GiB.
> 
> For example, if I run:
> # dd if=/dev/sda of=/dev/null bs=1M count=2048
> 
> - with the proper kernel, there should be "Buffers - 2GiB" when this 
> command finishes. With the patch 1c30844d2, buffers will consume just 1GiB 
> or slightly more, because the kernel was incorrectly reclaiming them.
> 

I could argue that the feature is behaving as expected for separate
pgdats but that's neither here nor there. The bug is real but I have a
few questions.

First, if pa-risc is !NUMA then why are separate local ranges
represented as separate nodes? Is it because of DISCONTIGMEM or something
else? DISCONTIGMEM is before my time so I'm not familiar with it and
I consider it "essentially dead" but the arch init code seems to setup
pgdats for each physical contiguous range so it's a possibility. The most
likely explanation is pa-risc does not have hardware with addressing
limitations smaller than the CPUs physical address limits and it's
possible to have more ranges than available zones but clarification would
be nice.  By rights, SPARSEMEM would be supported on pa-risc but that
would be a time-consuming and somewhat futile exercise.  Regardless of the
explanation, as pa-risc does not appear to support transparent hugepages,
an option is to special case watermark_boost_factor to be 0 on DISCONTIGMEM
as that commit was primarily about THP with secondary concerns around
SLUB. This is probably the most straight-forward solution but it'd need
a comment obviously. I do not know what the distro configurations for
pa-risc set as I'm not a user of gentoo or debian.

Second, if you set the sysctl vm.watermark_boost_factor=0, does the
problem go away? If so, an option would be to set this sysctl to 0 by
default on distros that support pa-risc. Would that be suitable?

Finally, I'm sure this has been asked before buy why is pa-risc alive?
It appears a new CPU has not been manufactured since 2005. Even Alpha
I can understand being semi-alive since it's an interesting case for
weakly-ordered memory models. pa-risc appears to be supported and active
for debian at least so someone cares. It's not the only feature like this
that is bizarrely alive but it is curious -- 32 bit NUMA support on x86,
I'm looking at you, your machines are all dead since the early 2000's
AFAIK and anyone else using NUMA on 32-bit x86 needs their head examined.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs"
  2019-04-08  9:52 ` Mel Gorman
@ 2019-04-08 11:10   ` Mikulas Patocka
  2019-04-08 12:54     ` Mel Gorman
  2019-04-08 14:29   ` James Bottomley
  1 sibling, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2019-04-08 11:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Helge Deller, James E.J. Bottomley,
	John David Anglin, linux-parisc, linux-mm, Vlastimil Babka,
	Andrea Arcangeli, Zi Yan



On Mon, 8 Apr 2019, Mel Gorman wrote:

> On Sat, Apr 06, 2019 at 11:20:35AM -0400, Mikulas Patocka wrote:
> > Hi
> > 
> > The patch 1c30844d2dfe272d58c8fc000960b835d13aa2ac ("mm: reclaim small 
> > amounts of memory when an external fragmentation event occurs") breaks 
> > memory management on parisc.
> > 
> > I have a parisc machine with 7GiB RAM, the chipset maps the physical 
> > memory to three zones:
> > 	0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
> > 	1) Start 0x0000000100000000 End 0x00000001bfdfffff Size   3070 MB
> > 	2) Start 0x0000004040000000 End 0x00000040ffffffff Size   3072 MB
> > (but it is not NUMA)
> > 
> > With the patch 1c30844d2, the kernel will incorrectly reclaim the first 
> > zone when it fills up, ignoring the fact that there are two completely 
> > free zones. Basiscally, it limits cache size to 1GiB.
> > 
> > For example, if I run:
> > # dd if=/dev/sda of=/dev/null bs=1M count=2048
> > 
> > - with the proper kernel, there should be "Buffers - 2GiB" when this 
> > command finishes. With the patch 1c30844d2, buffers will consume just 1GiB 
> > or slightly more, because the kernel was incorrectly reclaiming them.
> > 
> 
> I could argue that the feature is behaving as expected for separate
> pgdats but that's neither here nor there. The bug is real but I have a
> few questions.
> 
> First, if pa-risc is !NUMA then why are separate local ranges
> represented as separate nodes? Is it because of DISCONTIGMEM or something
> else? DISCONTIGMEM is before my time so I'm not familiar with it and

I'm not an expert in this area, I don't know.

> I consider it "essentially dead" but the arch init code seems to setup
> pgdats for each physical contiguous range so it's a possibility. The most
> likely explanation is pa-risc does not have hardware with addressing
> limitations smaller than the CPUs physical address limits and it's
> possible to have more ranges than available zones but clarification would
> be nice.  By rights, SPARSEMEM would be supported on pa-risc but that
> would be a time-consuming and somewhat futile exercise.  Regardless of the
> explanation, as pa-risc does not appear to support transparent hugepages,
> an option is to special case watermark_boost_factor to be 0 on DISCONTIGMEM
> as that commit was primarily about THP with secondary concerns around
> SLUB. This is probably the most straight-forward solution but it'd need
> a comment obviously. I do not know what the distro configurations for
> pa-risc set as I'm not a user of gentoo or debian.

I use Debian Sid, but I compile my own kernel. I uploaded the kernel 
.config here: 
http://people.redhat.com/~mpatocka/testcases/parisc-config.txt

> Second, if you set the sysctl vm.watermark_boost_factor=0, does the
> problem go away? If so, an option would be to set this sysctl to 0 by
> default on distros that support pa-risc. Would that be suitable?

I have tried it and the problem almost goes away. With 
vm.watermark_boost_factor=0, if I read 2GiB data from the disk, the buffer 
cache will contain about 1.8GiB. So, there's still some superfluous page 
reclaim, but it is smaller.


BTW. I'm interested - on real NUMA machines - is reclaiming the file cache 
really a better option than allocating the file cache from non-local node?


> Finally, I'm sure this has been asked before buy why is pa-risc alive?
> It appears a new CPU has not been manufactured since 2005. Even Alpha
> I can understand being semi-alive since it's an interesting case for
> weakly-ordered memory models. pa-risc appears to be supported and active
> for debian at least so someone cares. It's not the only feature like this
> that is bizarrely alive but it is curious -- 32 bit NUMA support on x86,
> I'm looking at you, your machines are all dead since the early 2000's
> AFAIK and anyone else using NUMA on 32-bit x86 needs their head examined.

I use it to test programs for portability to risc.

If one could choose between buying an expensive power system or a cheap 
pa-risc system, pa-risc may be a better choice. The last pa-risc model has 
four cores at 1.1GHz, so it is not completely unuseable.

Mikulas

> -- 
> Mel Gorman
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs"
  2019-04-08 11:10   ` Mikulas Patocka
@ 2019-04-08 12:54     ` Mel Gorman
  0 siblings, 0 replies; 9+ messages in thread
From: Mel Gorman @ 2019-04-08 12:54 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Helge Deller, James E.J. Bottomley,
	John David Anglin, linux-parisc, linux-mm, Vlastimil Babka,
	Andrea Arcangeli, Zi Yan

On Mon, Apr 08, 2019 at 07:10:11AM -0400, Mikulas Patocka wrote:
> > First, if pa-risc is !NUMA then why are separate local ranges
> > represented as separate nodes? Is it because of DISCONTIGMEM or something
> > else? DISCONTIGMEM is before my time so I'm not familiar with it and
> 
> I'm not an expert in this area, I don't know.
> 

Ok.

> > I consider it "essentially dead" but the arch init code seems to setup
> > pgdats for each physical contiguous range so it's a possibility. The most
> > likely explanation is pa-risc does not have hardware with addressing
> > limitations smaller than the CPUs physical address limits and it's
> > possible to have more ranges than available zones but clarification would
> > be nice.  By rights, SPARSEMEM would be supported on pa-risc but that
> > would be a time-consuming and somewhat futile exercise.  Regardless of the
> > explanation, as pa-risc does not appear to support transparent hugepages,
> > an option is to special case watermark_boost_factor to be 0 on DISCONTIGMEM
> > as that commit was primarily about THP with secondary concerns around
> > SLUB. This is probably the most straight-forward solution but it'd need
> > a comment obviously. I do not know what the distro configurations for
> > pa-risc set as I'm not a user of gentoo or debian.
> 
> I use Debian Sid, but I compile my own kernel. I uploaded the kernel 
> .config here: 
> http://people.redhat.com/~mpatocka/testcases/parisc-config.txt
> 

DISCONTIGMEM is set so based on the arch init code. Glancing at the
history, it seems my assumption was accurate. Discontig used NUMA
structures for non-NUMA machines to allow code to be reused and simplify
matters.

I'll put together a patch that disables this feature on DISCONTIG as it
is surprising in the DISCONTIGMEM.

> > Second, if you set the sysctl vm.watermark_boost_factor=0, does the
> > problem go away? If so, an option would be to set this sysctl to 0 by
> > default on distros that support pa-risc. Would that be suitable?
> 
> I have tried it and the problem almost goes away. With 
> vm.watermark_boost_factor=0, if I read 2GiB data from the disk, the buffer 
> cache will contain about 1.8GiB. So, there's still some superfluous page 
> reclaim, but it is smaller.
> 

Ok, for NUMA, I would generally expect some small amounts of reclaim on
a per-node basis from kswapd waking up as the node fills. I know in your
case there is no NUMA but from a memory consumption/reclaim point of
view, it doesn't matter. There are multiple active node structures so
it's treated as such.

In the short-term, I suggest you update /etc/sysctl.conf to workaround
the issue.

> BTW. I'm interested - on real NUMA machines - is reclaiming the file cache 
> really a better option than allocating the file cache from non-local node?
> 

The patch is not related to file cache concerns, it's for long-term
viability of high-order allocations, particularly THP but also SLUB which
uses high-order allocations by default.

> 
> > Finally, I'm sure this has been asked before buy why is pa-risc alive?
> > It appears a new CPU has not been manufactured since 2005. Even Alpha
> > I can understand being semi-alive since it's an interesting case for
> > weakly-ordered memory models. pa-risc appears to be supported and active
> > for debian at least so someone cares. It's not the only feature like this
> > that is bizarrely alive but it is curious -- 32 bit NUMA support on x86,
> > I'm looking at you, your machines are all dead since the early 2000's
> > AFAIK and anyone else using NUMA on 32-bit x86 needs their head examined.
> 
> I use it to test programs for portability to risc.
> 
> If one could choose between buying an expensive power system or a cheap 
> pa-risc system, pa-risc may be a better choice. The last pa-risc model has 
> four cores at 1.1GHz, so it is not completely unuseable.

Well if it was me and I was checking portability to risc, I'd probably
get hold of a raspberry pi but we all have different ways of looking at
things.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs"
  2019-04-08  9:52 ` Mel Gorman
  2019-04-08 11:10   ` Mikulas Patocka
@ 2019-04-08 14:29   ` James Bottomley
  2019-04-08 15:22     ` Helge Deller
  1 sibling, 1 reply; 9+ messages in thread
From: James Bottomley @ 2019-04-08 14:29 UTC (permalink / raw)
  To: Mel Gorman, Mikulas Patocka
  Cc: Andrew Morton, Helge Deller, John David Anglin, linux-parisc,
	linux-mm, Vlastimil Babka, Andrea Arcangeli, Zi Yan

On Mon, 2019-04-08 at 10:52 +0100, Mel Gorman wrote:
> First, if pa-risc is !NUMA then why are separate local ranges
> represented as separate nodes? Is it because of DISCONTIGMEM or
> something else? DISCONTIGMEM is before my time so I'm not familiar
> with it and I consider it "essentially dead" but the arch init code
> seems to setup pgdats for each physical contiguous range so it's a
> possibility. The most likely explanation is pa-risc does not have
> hardware with addressing limitations smaller than the CPUs physical
> address limits and it's possible to have more ranges than available
> zones but clarification would be nice.

Let me try, since I remember the ancient history.  In the early days,
there had to be a single mem_map array covering all of physical memory.
 Some pa-risc systems had huge gaps in the physical memory; I think one
gap was somewhere around 1GB, so this lead us to wasting huge amounts
of space in mem_map on non-existent memory.  What CONFIG_DISCONTIGMEM
did was allow you to represent this discontinuity on a non-NUMA system
using numa nodes, so we effectively got one node per discontiguous
range.  It's hacky, but it worked.  I thought we finally got converted
to sparsemem by the NUMA people, but I can't find the commit.

James


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs"
  2019-04-08 14:29   ` James Bottomley
@ 2019-04-08 15:22     ` Helge Deller
  2019-04-08 19:44       ` James Bottomley
  2019-04-09 20:09       ` Helge Deller
  0 siblings, 2 replies; 9+ messages in thread
From: Helge Deller @ 2019-04-08 15:22 UTC (permalink / raw)
  To: James Bottomley, Mel Gorman, Mikulas Patocka
  Cc: Andrew Morton, John David Anglin, linux-parisc, linux-mm,
	Vlastimil Babka, Andrea Arcangeli, Zi Yan

On 08.04.19 16:29, James Bottomley wrote:
> On Mon, 2019-04-08 at 10:52 +0100, Mel Gorman wrote:
>> First, if pa-risc is !NUMA then why are separate local ranges
>> represented as separate nodes? Is it because of DISCONTIGMEM or
>> something else? DISCONTIGMEM is before my time so I'm not familiar
>> with it and I consider it "essentially dead" but the arch init code
>> seems to setup pgdats for each physical contiguous range so it's a
>> possibility. The most likely explanation is pa-risc does not have
>> hardware with addressing limitations smaller than the CPUs physical
>> address limits and it's possible to have more ranges than available
>> zones but clarification would be nice.
>
> Let me try, since I remember the ancient history.  In the early days,
> there had to be a single mem_map array covering all of physical memory.
>  Some pa-risc systems had huge gaps in the physical memory; I think one
> gap was somewhere around 1GB, so this lead us to wasting huge amounts
> of space in mem_map on non-existent memory.  What CONFIG_DISCONTIGMEM
> did was allow you to represent this discontinuity on a non-NUMA system
> using numa nodes, so we effectively got one node per discontiguous
> range.  It's hacky, but it worked.  I thought we finally got converted
> to sparsemem by the NUMA people, but I can't find the commit.

James, you tried once:
https://patchwork.kernel.org/patch/729441/

It seems we better should move over to sparsemem now?

Helge

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs"
  2019-04-08 15:22     ` Helge Deller
@ 2019-04-08 19:44       ` James Bottomley
  2019-04-09 20:09       ` Helge Deller
  1 sibling, 0 replies; 9+ messages in thread
From: James Bottomley @ 2019-04-08 19:44 UTC (permalink / raw)
  To: Helge Deller, Mel Gorman, Mikulas Patocka
  Cc: Andrew Morton, John David Anglin, linux-parisc, linux-mm,
	Vlastimil Babka, Andrea Arcangeli, Zi Yan

On Mon, 2019-04-08 at 17:22 +0200, Helge Deller wrote:
> On 08.04.19 16:29, James Bottomley wrote:
> > On Mon, 2019-04-08 at 10:52 +0100, Mel Gorman wrote:
> > > First, if pa-risc is !NUMA then why are separate local ranges
> > > represented as separate nodes? Is it because of DISCONTIGMEM or
> > > something else? DISCONTIGMEM is before my time so I'm not
> > > familiar with it and I consider it "essentially dead" but the
> > > arch init code seems to setup pgdats for each physical contiguous
> > > range so it's a possibility. The most likely explanation is pa-
> > > risc does not have hardware with addressing limitations smaller
> > > than the CPUs physical address limits and it's possible to have
> > > more ranges than available zones but clarification would be nice.
> > 
> > Let me try, since I remember the ancient history.  In the early
> > days, there had to be a single mem_map array covering all of
> > physical memory.  Some pa-risc systems had huge gaps in the
> > physical memory; I think one gap was somewhere around 1GB, so this
> > lead us to wasting huge amounts of space in mem_map on non-existent 
> > memory.  What CONFIG_DISCONTIGMEM did was allow you to represent
> > this discontinuity on a non-NUMA system using numa nodes, so we
> > effectively got one node per discontiguous range.  It's hacky, but
> > it worked.  I thought we finally got converted to sparsemem by the
> > NUMA people, but I can't find the commit.
> 
> James, you tried once:
> https://patchwork.kernel.org/patch/729441/

Ah, so what I was remembering as someone else's problem was, in fact,
my problem?  Hey, I should bottle my memory recall algorithms and sell
them as executive training courses.

> It seems we better should move over to sparsemem now?

I think so.  The basics of the patch likely apply and hopefully in the
intervening 8 years some of the problems I identified have been fixed.

James


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs"
  2019-04-08 15:22     ` Helge Deller
  2019-04-08 19:44       ` James Bottomley
@ 2019-04-09 20:09       ` Helge Deller
  1 sibling, 0 replies; 9+ messages in thread
From: Helge Deller @ 2019-04-09 20:09 UTC (permalink / raw)
  To: Helge Deller
  Cc: James Bottomley, Mel Gorman, Mikulas Patocka, Andrew Morton,
	John David Anglin, linux-parisc, linux-mm, Vlastimil Babka,
	Andrea Arcangeli, Zi Yan

* Helge Deller <deller@gmx.de>:
> On 08.04.19 16:29, James Bottomley wrote:
> > On Mon, 2019-04-08 at 10:52 +0100, Mel Gorman wrote:
> >> First, if pa-risc is !NUMA then why are separate local ranges
> >> represented as separate nodes? Is it because of DISCONTIGMEM or
> >> something else? DISCONTIGMEM is before my time so I'm not familiar
> >> with it and I consider it "essentially dead" but the arch init code
> >> seems to setup pgdats for each physical contiguous range so it's a
> >> possibility. The most likely explanation is pa-risc does not have
> >> hardware with addressing limitations smaller than the CPUs physical
> >> address limits and it's possible to have more ranges than available
> >> zones but clarification would be nice.
> >
> > Let me try, since I remember the ancient history.  In the early days,
> > there had to be a single mem_map array covering all of physical memory.
> >  Some pa-risc systems had huge gaps in the physical memory; I think one
> > gap was somewhere around 1GB, so this lead us to wasting huge amounts
> > of space in mem_map on non-existent memory.  What CONFIG_DISCONTIGMEM
> > did was allow you to represent this discontinuity on a non-NUMA system
> > using numa nodes, so we effectively got one node per discontiguous
> > range.  It's hacky, but it worked.  I thought we finally got converted
> > to sparsemem by the NUMA people, but I can't find the commit.
>
> James, you tried once:
> https://patchwork.kernel.org/patch/729441/
>
> It seems we better should move over to sparsemem now?

Below is an updated patch to convert parisc from DISCONTIGMEM to
SPARSEMEM. It builds and boots for me on 32- and 64-bit machines.
Mikulas, could you try if you still see the the cache limited to 1GiB
with this patch applied ?

Helge

---------------------

From 2c30c3a61bbfb56850862a7f7127416325fe126f Mon Sep 17 00:00:00 2001
From: Helge Deller <deller@gmx.de>
Date: Tue, 9 Apr 2019 21:52:35 +0200
Subject: [PATCH] parisc: Switch from DISCONTIGMEM to SPARSEMEM

Signed-off-by: Helge Deller <deller@gmx.de>

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index c8e6212..4f1397f 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -36,6 +36,7 @@ config PARISC
 	select GENERIC_STRNCPY_FROM_USER
 	select SYSCTL_ARCH_UNALIGN_ALLOW
 	select SYSCTL_EXCEPTION_TRACE
+	select ARCH_DISCARD_MEMBLOCK
 	select HAVE_MOD_ARCH_SPECIFIC
 	select VIRT_TO_BUS
 	select MODULES_USE_ELF_RELA
@@ -311,21 +312,16 @@ config ARCH_SELECT_MEMORY_MODEL
 	def_bool y
 	depends on 64BIT

-config ARCH_DISCONTIGMEM_ENABLE
+config ARCH_SPARSEMEM_ENABLE
 	def_bool y
 	depends on 64BIT

 config ARCH_FLATMEM_ENABLE
 	def_bool y

-config ARCH_DISCONTIGMEM_DEFAULT
+config ARCH_SPARSEMEM_DEFAULT
 	def_bool y
-	depends on ARCH_DISCONTIGMEM_ENABLE
-
-config NODES_SHIFT
-	int
-	default "3"
-	depends on NEED_MULTIPLE_NODES
+	depends on ARCH_SPARSEMEM_ENABLE

 source "kernel/Kconfig.hz"

diff --git a/arch/parisc/include/asm/mmzone.h b/arch/parisc/include/asm/mmzone.h
index fafa389..8d39040 100644
--- a/arch/parisc/include/asm/mmzone.h
+++ b/arch/parisc/include/asm/mmzone.h
@@ -2,62 +2,6 @@
 #ifndef _PARISC_MMZONE_H
 #define _PARISC_MMZONE_H

-#define MAX_PHYSMEM_RANGES 8 /* Fix the size for now (current known max is 3) */
+#define MAX_PHYSMEM_RANGES 4 /* Fix the size for now (current known max is 3) */

-#ifdef CONFIG_DISCONTIGMEM
-
-extern int npmem_ranges;
-
-struct node_map_data {
-    pg_data_t pg_data;
-};
-
-extern struct node_map_data node_data[];
-
-#define NODE_DATA(nid)          (&node_data[nid].pg_data)
-
-/* We have these possible memory map layouts:
- * Astro: 0-3.75, 67.75-68, 4-64
- * zx1: 0-1, 257-260, 4-256
- * Stretch (N-class): 0-2, 4-32, 34-xxx
- */
-
-/* Since each 1GB can only belong to one region (node), we can create
- * an index table for pfn to nid lookup; each entry in pfnnid_map
- * represents 1GB, and contains the node that the memory belongs to. */
-
-#define PFNNID_SHIFT (30 - PAGE_SHIFT)
-#define PFNNID_MAP_MAX  512     /* support 512GB */
-extern signed char pfnnid_map[PFNNID_MAP_MAX];
-
-#ifndef CONFIG_64BIT
-#define pfn_is_io(pfn) ((pfn & (0xf0000000UL >> PAGE_SHIFT)) == (0xf0000000UL >> PAGE_SHIFT))
-#else
-/* io can be 0xf0f0f0f0f0xxxxxx or 0xfffffffff0000000 */
-#define pfn_is_io(pfn) ((pfn & (0xf000000000000000UL >> PAGE_SHIFT)) == (0xf000000000000000UL >> PAGE_SHIFT))
-#endif
-
-static inline int pfn_to_nid(unsigned long pfn)
-{
-	unsigned int i;
-
-	if (unlikely(pfn_is_io(pfn)))
-		return 0;
-
-	i = pfn >> PFNNID_SHIFT;
-	BUG_ON(i >= ARRAY_SIZE(pfnnid_map));
-
-	return pfnnid_map[i];
-}
-
-static inline int pfn_valid(int pfn)
-{
-	int nid = pfn_to_nid(pfn);
-
-	if (nid >= 0)
-		return (pfn < node_end_pfn(nid));
-	return 0;
-}
-
-#endif
 #endif /* _PARISC_MMZONE_H */
diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
index b77f49c..93caf17 100644
--- a/arch/parisc/include/asm/page.h
+++ b/arch/parisc/include/asm/page.h
@@ -147,9 +147,9 @@ extern int npmem_ranges;
 #define __pa(x)			((unsigned long)(x)-PAGE_OFFSET)
 #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))

-#ifndef CONFIG_DISCONTIGMEM
+#ifndef CONFIG_SPARSEMEM
 #define pfn_valid(pfn)		((pfn) < max_mapnr)
-#endif /* CONFIG_DISCONTIGMEM */
+#endif

 #ifdef CONFIG_HUGETLB_PAGE
 #define HPAGE_SHIFT		PMD_SHIFT /* fixed for transparent huge pages */
diff --git a/arch/parisc/include/asm/sparsemem.h b/arch/parisc/include/asm/sparsemem.h
new file mode 100644
index 0000000..b7d1dc9
--- /dev/null
+++ b/arch/parisc/include/asm/sparsemem.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ASM_PARISC_SPARSEMEM_H
+#define ASM_PARISC_SPARSEMEM_H
+
+/* We have these possible memory map layouts:
+ * Astro: 0-3.75, 67.75-68, 4-64
+ * zx1: 0-1, 257-260, 4-256
+ * Stretch (N-class): 0-2, 4-32, 34-xxx
+ */
+
+#define MAX_PHYSMEM_BITS	42
+#define SECTION_SIZE_BITS	37
+
+#endif
diff --git a/arch/parisc/kernel/parisc_ksyms.c b/arch/parisc/kernel/parisc_ksyms.c
index 7baa226..174213b 100644
--- a/arch/parisc/kernel/parisc_ksyms.c
+++ b/arch/parisc/kernel/parisc_ksyms.c
@@ -138,12 +138,6 @@ extern void $$dyncall(void);
 EXPORT_SYMBOL($$dyncall);
 #endif

-#ifdef CONFIG_DISCONTIGMEM
-#include <asm/mmzone.h>
-EXPORT_SYMBOL(node_data);
-EXPORT_SYMBOL(pfnnid_map);
-#endif
-
 #ifdef CONFIG_FUNCTION_TRACER
 extern void _mcount(void);
 EXPORT_SYMBOL(_mcount);
diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index d0b1662..9523394 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -48,11 +48,6 @@ pmd_t pmd0[PTRS_PER_PMD] __attribute__ ((__section__ (".data..vm0.pmd"), aligned
 pgd_t swapper_pg_dir[PTRS_PER_PGD] __attribute__ ((__section__ (".data..vm0.pgd"), aligned(PAGE_SIZE)));
 pte_t pg0[PT_INITIAL * PTRS_PER_PTE] __attribute__ ((__section__ (".data..vm0.pte"), aligned(PAGE_SIZE)));

-#ifdef CONFIG_DISCONTIGMEM
-struct node_map_data node_data[MAX_NUMNODES] __read_mostly;
-signed char pfnnid_map[PFNNID_MAP_MAX] __read_mostly;
-#endif
-
 static struct resource data_resource = {
 	.name	= "Kernel data",
 	.flags	= IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
@@ -76,11 +71,11 @@ static struct resource sysram_resources[MAX_PHYSMEM_RANGES] __read_mostly;
  * information retrieved in kernel/inventory.c.
  */

-physmem_range_t pmem_ranges[MAX_PHYSMEM_RANGES] __read_mostly;
-int npmem_ranges __read_mostly;
+physmem_range_t pmem_ranges[MAX_PHYSMEM_RANGES] __initdata;
+int npmem_ranges __initdata;

 #ifdef CONFIG_64BIT
-#define MAX_MEM         (~0UL)
+#define MAX_MEM         (1UL << MAX_PHYSMEM_BITS)
 #else /* !CONFIG_64BIT */
 #define MAX_MEM         (3584U*1024U*1024U)
 #endif /* !CONFIG_64BIT */
@@ -119,7 +114,7 @@ static void __init mem_limit_func(void)
 static void __init setup_bootmem(void)
 {
 	unsigned long mem_max;
-#ifndef CONFIG_DISCONTIGMEM
+#ifndef CONFIG_SPARSEMEM
 	physmem_range_t pmem_holes[MAX_PHYSMEM_RANGES - 1];
 	int npmem_holes;
 #endif
@@ -137,23 +132,20 @@ static void __init setup_bootmem(void)
 		int j;

 		for (j = i; j > 0; j--) {
-			unsigned long tmp;
+			physmem_range_t tmp;

 			if (pmem_ranges[j-1].start_pfn <
 			    pmem_ranges[j].start_pfn) {

 				break;
 			}
-			tmp = pmem_ranges[j-1].start_pfn;
-			pmem_ranges[j-1].start_pfn = pmem_ranges[j].start_pfn;
-			pmem_ranges[j].start_pfn = tmp;
-			tmp = pmem_ranges[j-1].pages;
-			pmem_ranges[j-1].pages = pmem_ranges[j].pages;
-			pmem_ranges[j].pages = tmp;
+			tmp = pmem_ranges[j-1];
+			pmem_ranges[j-1] = pmem_ranges[j];
+			pmem_ranges[j] = tmp;
 		}
 	}

-#ifndef CONFIG_DISCONTIGMEM
+#ifndef CONFIG_SPARSEMEM
 	/*
 	 * Throw out ranges that are too far apart (controlled by
 	 * MAX_GAP).
@@ -165,7 +157,7 @@ static void __init setup_bootmem(void)
 			 pmem_ranges[i-1].pages) > MAX_GAP) {
 			npmem_ranges = i;
 			printk("Large gap in memory detected (%ld pages). "
-			       "Consider turning on CONFIG_DISCONTIGMEM\n",
+			       "Consider turning on CONFIG_SPARSEMEM\n",
 			       pmem_ranges[i].start_pfn -
 			       (pmem_ranges[i-1].start_pfn +
 			        pmem_ranges[i-1].pages));
@@ -230,9 +222,8 @@ static void __init setup_bootmem(void)

 	printk(KERN_INFO "Total Memory: %ld MB\n",mem_max >> 20);

-#ifndef CONFIG_DISCONTIGMEM
+#ifndef CONFIG_SPARSEMEM
 	/* Merge the ranges, keeping track of the holes */
-
 	{
 		unsigned long end_pfn;
 		unsigned long hole_pages;
@@ -255,18 +246,6 @@ static void __init setup_bootmem(void)
 	}
 #endif

-#ifdef CONFIG_DISCONTIGMEM
-	for (i = 0; i < MAX_PHYSMEM_RANGES; i++) {
-		memset(NODE_DATA(i), 0, sizeof(pg_data_t));
-	}
-	memset(pfnnid_map, 0xff, sizeof(pfnnid_map));
-
-	for (i = 0; i < npmem_ranges; i++) {
-		node_set_state(i, N_NORMAL_MEMORY);
-		node_set_online(i);
-	}
-#endif
-
 	/*
 	 * Initialize and free the full range of memory in each range.
 	 */
@@ -314,7 +293,7 @@ static void __init setup_bootmem(void)
 	memblock_reserve(__pa(KERNEL_BINARY_TEXT_START),
 			(unsigned long)(_end - KERNEL_BINARY_TEXT_START));

-#ifndef CONFIG_DISCONTIGMEM
+#ifndef CONFIG_SPARSEMEM

 	/* reserve the holes */

@@ -360,6 +339,9 @@ static void __init setup_bootmem(void)

 	/* Initialize Page Deallocation Table (PDT) and check for bad memory. */
 	pdc_pdt_init();
+
+	memblock_allow_resize();
+	memblock_dump_all();
 }

 static int __init parisc_text_address(unsigned long vaddr)
@@ -709,37 +691,46 @@ static void __init gateway_init(void)
 		  PAGE_SIZE, PAGE_GATEWAY, 1);
 }

-void __init paging_init(void)
+static void __init parisc_bootmem_free(void)
 {
+	unsigned long zones_size[MAX_NR_ZONES] = { 0, };
+	unsigned long holes_size[MAX_NR_ZONES] = { 0, };
+	unsigned long mem_start_pfn = ~0UL, mem_end_pfn = 0, mem_size_pfn = 0;
 	int i;

+	for (i = 0; i < npmem_ranges; i++) {
+		unsigned long start = pmem_ranges[i].start_pfn;
+		unsigned long size = pmem_ranges[i].pages;
+		unsigned long end = start + size;
+
+		if (mem_start_pfn > start)
+			mem_start_pfn = start;
+		if (mem_end_pfn < end)
+			mem_end_pfn = end;
+		mem_size_pfn += size;
+	}
+
+	zones_size[0] = mem_end_pfn - mem_start_pfn;
+	holes_size[0] = zones_size[0] - mem_size_pfn;
+
+	free_area_init_node(0, zones_size, mem_start_pfn, holes_size);
+}
+
+void __init paging_init(void)
+{
 	setup_bootmem();
 	pagetable_init();
 	gateway_init();
 	flush_cache_all_local(); /* start with known state */
 	flush_tlb_all_local(NULL);

-	for (i = 0; i < npmem_ranges; i++) {
-		unsigned long zones_size[MAX_NR_ZONES] = { 0, };
-
-		zones_size[ZONE_NORMAL] = pmem_ranges[i].pages;
-
-#ifdef CONFIG_DISCONTIGMEM
-		/* Need to initialize the pfnnid_map before we can initialize
-		   the zone */
-		{
-		    int j;
-		    for (j = (pmem_ranges[i].start_pfn >> PFNNID_SHIFT);
-			 j <= ((pmem_ranges[i].start_pfn + pmem_ranges[i].pages) >> PFNNID_SHIFT);
-			 j++) {
-			pfnnid_map[j] = i;
-		    }
-		}
-#endif
-
-		free_area_init_node(i, zones_size,
-				pmem_ranges[i].start_pfn, NULL);
-	}
+	/*
+	 * Mark all memblocks as present for sparsemem using
+	 * memory_present() and then initialize sparsemem.
+	 */
+	memblocks_present();
+	sparse_init();
+	parisc_bootmem_free();
 }

 #ifdef CONFIG_PA20

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-04-09 20:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-06 15:20 Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" Mikulas Patocka
2019-04-06 17:26 ` Mikulas Patocka
2019-04-08  9:52 ` Mel Gorman
2019-04-08 11:10   ` Mikulas Patocka
2019-04-08 12:54     ` Mel Gorman
2019-04-08 14:29   ` James Bottomley
2019-04-08 15:22     ` Helge Deller
2019-04-08 19:44       ` James Bottomley
2019-04-09 20:09       ` Helge Deller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).