* Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" @ 2019-04-06 15:20 Mikulas Patocka 2019-04-06 17:26 ` Mikulas Patocka 2019-04-08 9:52 ` Mel Gorman 0 siblings, 2 replies; 9+ messages in thread From: Mikulas Patocka @ 2019-04-06 15:20 UTC (permalink / raw) To: Mel Gorman, Andrew Morton, Helge Deller, James E.J. Bottomley, John David Anglin, linux-parisc, linux-mm Cc: Vlastimil Babka, Andrea Arcangeli, Zi Yan Hi The patch 1c30844d2dfe272d58c8fc000960b835d13aa2ac ("mm: reclaim small amounts of memory when an external fragmentation event occurs") breaks memory management on parisc. I have a parisc machine with 7GiB RAM, the chipset maps the physical memory to three zones: 0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB 1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB 2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB (but it is not NUMA) With the patch 1c30844d2, the kernel will incorrectly reclaim the first zone when it fills up, ignoring the fact that there are two completely free zones. Basiscally, it limits cache size to 1GiB. For example, if I run: # dd if=/dev/sda of=/dev/null bs=1M count=2048 - with the proper kernel, there should be "Buffers - 2GiB" when this command finishes. With the patch 1c30844d2, buffers will consume just 1GiB or slightly more, because the kernel was incorrectly reclaiming them. Mikulas ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" 2019-04-06 15:20 Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" Mikulas Patocka @ 2019-04-06 17:26 ` Mikulas Patocka 2019-04-08 9:52 ` Mel Gorman 1 sibling, 0 replies; 9+ messages in thread From: Mikulas Patocka @ 2019-04-06 17:26 UTC (permalink / raw) To: Mel Gorman, Andrew Morton, Helge Deller, James E.J. Bottomley, John David Anglin, linux-parisc, linux-mm Cc: Vlastimil Babka, Andrea Arcangeli, Zi Yan On Sat, 6 Apr 2019, Mikulas Patocka wrote: > Hi > > The patch 1c30844d2dfe272d58c8fc000960b835d13aa2ac ("mm: reclaim small > amounts of memory when an external fragmentation event occurs") breaks > memory management on parisc. > > I have a parisc machine with 7GiB RAM, the chipset maps the physical > memory to three zones: > 0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB > 1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB > 2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB > (but it is not NUMA) > > With the patch 1c30844d2, the kernel will incorrectly reclaim the first > zone when it fills up, ignoring the fact that there are two completely > free zones. Basiscally, it limits cache size to 1GiB. > > For example, if I run: > # dd if=/dev/sda of=/dev/null bs=1M count=2048 > > - with the proper kernel, there should be "Buffers - 2GiB" when this > command finishes. With the patch 1c30844d2, buffers will consume just 1GiB > or slightly more, because the kernel was incorrectly reclaiming them. > > Mikulas BTW, 3 years ago, there was exactly the same bug: https://marc.info/?l=linux-kernel&m=146472966215941&w=2 Mikulas ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" 2019-04-06 15:20 Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" Mikulas Patocka 2019-04-06 17:26 ` Mikulas Patocka @ 2019-04-08 9:52 ` Mel Gorman 2019-04-08 11:10 ` Mikulas Patocka 2019-04-08 14:29 ` James Bottomley 1 sibling, 2 replies; 9+ messages in thread From: Mel Gorman @ 2019-04-08 9:52 UTC (permalink / raw) To: Mikulas Patocka Cc: Andrew Morton, Helge Deller, James E.J. Bottomley, John David Anglin, linux-parisc, linux-mm, Vlastimil Babka, Andrea Arcangeli, Zi Yan On Sat, Apr 06, 2019 at 11:20:35AM -0400, Mikulas Patocka wrote: > Hi > > The patch 1c30844d2dfe272d58c8fc000960b835d13aa2ac ("mm: reclaim small > amounts of memory when an external fragmentation event occurs") breaks > memory management on parisc. > > I have a parisc machine with 7GiB RAM, the chipset maps the physical > memory to three zones: > 0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB > 1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB > 2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB > (but it is not NUMA) > > With the patch 1c30844d2, the kernel will incorrectly reclaim the first > zone when it fills up, ignoring the fact that there are two completely > free zones. Basiscally, it limits cache size to 1GiB. > > For example, if I run: > # dd if=/dev/sda of=/dev/null bs=1M count=2048 > > - with the proper kernel, there should be "Buffers - 2GiB" when this > command finishes. With the patch 1c30844d2, buffers will consume just 1GiB > or slightly more, because the kernel was incorrectly reclaiming them. > I could argue that the feature is behaving as expected for separate pgdats but that's neither here nor there. The bug is real but I have a few questions. First, if pa-risc is !NUMA then why are separate local ranges represented as separate nodes? Is it because of DISCONTIGMEM or something else? DISCONTIGMEM is before my time so I'm not familiar with it and I consider it "essentially dead" but the arch init code seems to setup pgdats for each physical contiguous range so it's a possibility. The most likely explanation is pa-risc does not have hardware with addressing limitations smaller than the CPUs physical address limits and it's possible to have more ranges than available zones but clarification would be nice. By rights, SPARSEMEM would be supported on pa-risc but that would be a time-consuming and somewhat futile exercise. Regardless of the explanation, as pa-risc does not appear to support transparent hugepages, an option is to special case watermark_boost_factor to be 0 on DISCONTIGMEM as that commit was primarily about THP with secondary concerns around SLUB. This is probably the most straight-forward solution but it'd need a comment obviously. I do not know what the distro configurations for pa-risc set as I'm not a user of gentoo or debian. Second, if you set the sysctl vm.watermark_boost_factor=0, does the problem go away? If so, an option would be to set this sysctl to 0 by default on distros that support pa-risc. Would that be suitable? Finally, I'm sure this has been asked before buy why is pa-risc alive? It appears a new CPU has not been manufactured since 2005. Even Alpha I can understand being semi-alive since it's an interesting case for weakly-ordered memory models. pa-risc appears to be supported and active for debian at least so someone cares. It's not the only feature like this that is bizarrely alive but it is curious -- 32 bit NUMA support on x86, I'm looking at you, your machines are all dead since the early 2000's AFAIK and anyone else using NUMA on 32-bit x86 needs their head examined. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" 2019-04-08 9:52 ` Mel Gorman @ 2019-04-08 11:10 ` Mikulas Patocka 2019-04-08 12:54 ` Mel Gorman 2019-04-08 14:29 ` James Bottomley 1 sibling, 1 reply; 9+ messages in thread From: Mikulas Patocka @ 2019-04-08 11:10 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Helge Deller, James E.J. Bottomley, John David Anglin, linux-parisc, linux-mm, Vlastimil Babka, Andrea Arcangeli, Zi Yan On Mon, 8 Apr 2019, Mel Gorman wrote: > On Sat, Apr 06, 2019 at 11:20:35AM -0400, Mikulas Patocka wrote: > > Hi > > > > The patch 1c30844d2dfe272d58c8fc000960b835d13aa2ac ("mm: reclaim small > > amounts of memory when an external fragmentation event occurs") breaks > > memory management on parisc. > > > > I have a parisc machine with 7GiB RAM, the chipset maps the physical > > memory to three zones: > > 0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB > > 1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB > > 2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB > > (but it is not NUMA) > > > > With the patch 1c30844d2, the kernel will incorrectly reclaim the first > > zone when it fills up, ignoring the fact that there are two completely > > free zones. Basiscally, it limits cache size to 1GiB. > > > > For example, if I run: > > # dd if=/dev/sda of=/dev/null bs=1M count=2048 > > > > - with the proper kernel, there should be "Buffers - 2GiB" when this > > command finishes. With the patch 1c30844d2, buffers will consume just 1GiB > > or slightly more, because the kernel was incorrectly reclaiming them. > > > > I could argue that the feature is behaving as expected for separate > pgdats but that's neither here nor there. The bug is real but I have a > few questions. > > First, if pa-risc is !NUMA then why are separate local ranges > represented as separate nodes? Is it because of DISCONTIGMEM or something > else? DISCONTIGMEM is before my time so I'm not familiar with it and I'm not an expert in this area, I don't know. > I consider it "essentially dead" but the arch init code seems to setup > pgdats for each physical contiguous range so it's a possibility. The most > likely explanation is pa-risc does not have hardware with addressing > limitations smaller than the CPUs physical address limits and it's > possible to have more ranges than available zones but clarification would > be nice. By rights, SPARSEMEM would be supported on pa-risc but that > would be a time-consuming and somewhat futile exercise. Regardless of the > explanation, as pa-risc does not appear to support transparent hugepages, > an option is to special case watermark_boost_factor to be 0 on DISCONTIGMEM > as that commit was primarily about THP with secondary concerns around > SLUB. This is probably the most straight-forward solution but it'd need > a comment obviously. I do not know what the distro configurations for > pa-risc set as I'm not a user of gentoo or debian. I use Debian Sid, but I compile my own kernel. I uploaded the kernel .config here: http://people.redhat.com/~mpatocka/testcases/parisc-config.txt > Second, if you set the sysctl vm.watermark_boost_factor=0, does the > problem go away? If so, an option would be to set this sysctl to 0 by > default on distros that support pa-risc. Would that be suitable? I have tried it and the problem almost goes away. With vm.watermark_boost_factor=0, if I read 2GiB data from the disk, the buffer cache will contain about 1.8GiB. So, there's still some superfluous page reclaim, but it is smaller. BTW. I'm interested - on real NUMA machines - is reclaiming the file cache really a better option than allocating the file cache from non-local node? > Finally, I'm sure this has been asked before buy why is pa-risc alive? > It appears a new CPU has not been manufactured since 2005. Even Alpha > I can understand being semi-alive since it's an interesting case for > weakly-ordered memory models. pa-risc appears to be supported and active > for debian at least so someone cares. It's not the only feature like this > that is bizarrely alive but it is curious -- 32 bit NUMA support on x86, > I'm looking at you, your machines are all dead since the early 2000's > AFAIK and anyone else using NUMA on 32-bit x86 needs their head examined. I use it to test programs for portability to risc. If one could choose between buying an expensive power system or a cheap pa-risc system, pa-risc may be a better choice. The last pa-risc model has four cores at 1.1GHz, so it is not completely unuseable. Mikulas > -- > Mel Gorman > SUSE Labs > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" 2019-04-08 11:10 ` Mikulas Patocka @ 2019-04-08 12:54 ` Mel Gorman 0 siblings, 0 replies; 9+ messages in thread From: Mel Gorman @ 2019-04-08 12:54 UTC (permalink / raw) To: Mikulas Patocka Cc: Andrew Morton, Helge Deller, James E.J. Bottomley, John David Anglin, linux-parisc, linux-mm, Vlastimil Babka, Andrea Arcangeli, Zi Yan On Mon, Apr 08, 2019 at 07:10:11AM -0400, Mikulas Patocka wrote: > > First, if pa-risc is !NUMA then why are separate local ranges > > represented as separate nodes? Is it because of DISCONTIGMEM or something > > else? DISCONTIGMEM is before my time so I'm not familiar with it and > > I'm not an expert in this area, I don't know. > Ok. > > I consider it "essentially dead" but the arch init code seems to setup > > pgdats for each physical contiguous range so it's a possibility. The most > > likely explanation is pa-risc does not have hardware with addressing > > limitations smaller than the CPUs physical address limits and it's > > possible to have more ranges than available zones but clarification would > > be nice. By rights, SPARSEMEM would be supported on pa-risc but that > > would be a time-consuming and somewhat futile exercise. Regardless of the > > explanation, as pa-risc does not appear to support transparent hugepages, > > an option is to special case watermark_boost_factor to be 0 on DISCONTIGMEM > > as that commit was primarily about THP with secondary concerns around > > SLUB. This is probably the most straight-forward solution but it'd need > > a comment obviously. I do not know what the distro configurations for > > pa-risc set as I'm not a user of gentoo or debian. > > I use Debian Sid, but I compile my own kernel. I uploaded the kernel > .config here: > http://people.redhat.com/~mpatocka/testcases/parisc-config.txt > DISCONTIGMEM is set so based on the arch init code. Glancing at the history, it seems my assumption was accurate. Discontig used NUMA structures for non-NUMA machines to allow code to be reused and simplify matters. I'll put together a patch that disables this feature on DISCONTIG as it is surprising in the DISCONTIGMEM. > > Second, if you set the sysctl vm.watermark_boost_factor=0, does the > > problem go away? If so, an option would be to set this sysctl to 0 by > > default on distros that support pa-risc. Would that be suitable? > > I have tried it and the problem almost goes away. With > vm.watermark_boost_factor=0, if I read 2GiB data from the disk, the buffer > cache will contain about 1.8GiB. So, there's still some superfluous page > reclaim, but it is smaller. > Ok, for NUMA, I would generally expect some small amounts of reclaim on a per-node basis from kswapd waking up as the node fills. I know in your case there is no NUMA but from a memory consumption/reclaim point of view, it doesn't matter. There are multiple active node structures so it's treated as such. In the short-term, I suggest you update /etc/sysctl.conf to workaround the issue. > BTW. I'm interested - on real NUMA machines - is reclaiming the file cache > really a better option than allocating the file cache from non-local node? > The patch is not related to file cache concerns, it's for long-term viability of high-order allocations, particularly THP but also SLUB which uses high-order allocations by default. > > > Finally, I'm sure this has been asked before buy why is pa-risc alive? > > It appears a new CPU has not been manufactured since 2005. Even Alpha > > I can understand being semi-alive since it's an interesting case for > > weakly-ordered memory models. pa-risc appears to be supported and active > > for debian at least so someone cares. It's not the only feature like this > > that is bizarrely alive but it is curious -- 32 bit NUMA support on x86, > > I'm looking at you, your machines are all dead since the early 2000's > > AFAIK and anyone else using NUMA on 32-bit x86 needs their head examined. > > I use it to test programs for portability to risc. > > If one could choose between buying an expensive power system or a cheap > pa-risc system, pa-risc may be a better choice. The last pa-risc model has > four cores at 1.1GHz, so it is not completely unuseable. Well if it was me and I was checking portability to risc, I'd probably get hold of a raspberry pi but we all have different ways of looking at things. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" 2019-04-08 9:52 ` Mel Gorman 2019-04-08 11:10 ` Mikulas Patocka @ 2019-04-08 14:29 ` James Bottomley 2019-04-08 15:22 ` Helge Deller 1 sibling, 1 reply; 9+ messages in thread From: James Bottomley @ 2019-04-08 14:29 UTC (permalink / raw) To: Mel Gorman, Mikulas Patocka Cc: Andrew Morton, Helge Deller, John David Anglin, linux-parisc, linux-mm, Vlastimil Babka, Andrea Arcangeli, Zi Yan On Mon, 2019-04-08 at 10:52 +0100, Mel Gorman wrote: > First, if pa-risc is !NUMA then why are separate local ranges > represented as separate nodes? Is it because of DISCONTIGMEM or > something else? DISCONTIGMEM is before my time so I'm not familiar > with it and I consider it "essentially dead" but the arch init code > seems to setup pgdats for each physical contiguous range so it's a > possibility. The most likely explanation is pa-risc does not have > hardware with addressing limitations smaller than the CPUs physical > address limits and it's possible to have more ranges than available > zones but clarification would be nice. Let me try, since I remember the ancient history. In the early days, there had to be a single mem_map array covering all of physical memory. Some pa-risc systems had huge gaps in the physical memory; I think one gap was somewhere around 1GB, so this lead us to wasting huge amounts of space in mem_map on non-existent memory. What CONFIG_DISCONTIGMEM did was allow you to represent this discontinuity on a non-NUMA system using numa nodes, so we effectively got one node per discontiguous range. It's hacky, but it worked. I thought we finally got converted to sparsemem by the NUMA people, but I can't find the commit. James ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" 2019-04-08 14:29 ` James Bottomley @ 2019-04-08 15:22 ` Helge Deller 2019-04-08 19:44 ` James Bottomley 2019-04-09 20:09 ` Helge Deller 0 siblings, 2 replies; 9+ messages in thread From: Helge Deller @ 2019-04-08 15:22 UTC (permalink / raw) To: James Bottomley, Mel Gorman, Mikulas Patocka Cc: Andrew Morton, John David Anglin, linux-parisc, linux-mm, Vlastimil Babka, Andrea Arcangeli, Zi Yan On 08.04.19 16:29, James Bottomley wrote: > On Mon, 2019-04-08 at 10:52 +0100, Mel Gorman wrote: >> First, if pa-risc is !NUMA then why are separate local ranges >> represented as separate nodes? Is it because of DISCONTIGMEM or >> something else? DISCONTIGMEM is before my time so I'm not familiar >> with it and I consider it "essentially dead" but the arch init code >> seems to setup pgdats for each physical contiguous range so it's a >> possibility. The most likely explanation is pa-risc does not have >> hardware with addressing limitations smaller than the CPUs physical >> address limits and it's possible to have more ranges than available >> zones but clarification would be nice. > > Let me try, since I remember the ancient history. In the early days, > there had to be a single mem_map array covering all of physical memory. > Some pa-risc systems had huge gaps in the physical memory; I think one > gap was somewhere around 1GB, so this lead us to wasting huge amounts > of space in mem_map on non-existent memory. What CONFIG_DISCONTIGMEM > did was allow you to represent this discontinuity on a non-NUMA system > using numa nodes, so we effectively got one node per discontiguous > range. It's hacky, but it worked. I thought we finally got converted > to sparsemem by the NUMA people, but I can't find the commit. James, you tried once: https://patchwork.kernel.org/patch/729441/ It seems we better should move over to sparsemem now? Helge ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" 2019-04-08 15:22 ` Helge Deller @ 2019-04-08 19:44 ` James Bottomley 2019-04-09 20:09 ` Helge Deller 1 sibling, 0 replies; 9+ messages in thread From: James Bottomley @ 2019-04-08 19:44 UTC (permalink / raw) To: Helge Deller, Mel Gorman, Mikulas Patocka Cc: Andrew Morton, John David Anglin, linux-parisc, linux-mm, Vlastimil Babka, Andrea Arcangeli, Zi Yan On Mon, 2019-04-08 at 17:22 +0200, Helge Deller wrote: > On 08.04.19 16:29, James Bottomley wrote: > > On Mon, 2019-04-08 at 10:52 +0100, Mel Gorman wrote: > > > First, if pa-risc is !NUMA then why are separate local ranges > > > represented as separate nodes? Is it because of DISCONTIGMEM or > > > something else? DISCONTIGMEM is before my time so I'm not > > > familiar with it and I consider it "essentially dead" but the > > > arch init code seems to setup pgdats for each physical contiguous > > > range so it's a possibility. The most likely explanation is pa- > > > risc does not have hardware with addressing limitations smaller > > > than the CPUs physical address limits and it's possible to have > > > more ranges than available zones but clarification would be nice. > > > > Let me try, since I remember the ancient history. In the early > > days, there had to be a single mem_map array covering all of > > physical memory. Some pa-risc systems had huge gaps in the > > physical memory; I think one gap was somewhere around 1GB, so this > > lead us to wasting huge amounts of space in mem_map on non-existent > > memory. What CONFIG_DISCONTIGMEM did was allow you to represent > > this discontinuity on a non-NUMA system using numa nodes, so we > > effectively got one node per discontiguous range. It's hacky, but > > it worked. I thought we finally got converted to sparsemem by the > > NUMA people, but I can't find the commit. > > James, you tried once: > https://patchwork.kernel.org/patch/729441/ Ah, so what I was remembering as someone else's problem was, in fact, my problem? Hey, I should bottle my memory recall algorithms and sell them as executive training courses. > It seems we better should move over to sparsemem now? I think so. The basics of the patch likely apply and hopefully in the intervening 8 years some of the problems I identified have been fixed. James ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" 2019-04-08 15:22 ` Helge Deller 2019-04-08 19:44 ` James Bottomley @ 2019-04-09 20:09 ` Helge Deller 1 sibling, 0 replies; 9+ messages in thread From: Helge Deller @ 2019-04-09 20:09 UTC (permalink / raw) To: Helge Deller Cc: James Bottomley, Mel Gorman, Mikulas Patocka, Andrew Morton, John David Anglin, linux-parisc, linux-mm, Vlastimil Babka, Andrea Arcangeli, Zi Yan * Helge Deller <deller@gmx.de>: > On 08.04.19 16:29, James Bottomley wrote: > > On Mon, 2019-04-08 at 10:52 +0100, Mel Gorman wrote: > >> First, if pa-risc is !NUMA then why are separate local ranges > >> represented as separate nodes? Is it because of DISCONTIGMEM or > >> something else? DISCONTIGMEM is before my time so I'm not familiar > >> with it and I consider it "essentially dead" but the arch init code > >> seems to setup pgdats for each physical contiguous range so it's a > >> possibility. The most likely explanation is pa-risc does not have > >> hardware with addressing limitations smaller than the CPUs physical > >> address limits and it's possible to have more ranges than available > >> zones but clarification would be nice. > > > > Let me try, since I remember the ancient history. In the early days, > > there had to be a single mem_map array covering all of physical memory. > > Some pa-risc systems had huge gaps in the physical memory; I think one > > gap was somewhere around 1GB, so this lead us to wasting huge amounts > > of space in mem_map on non-existent memory. What CONFIG_DISCONTIGMEM > > did was allow you to represent this discontinuity on a non-NUMA system > > using numa nodes, so we effectively got one node per discontiguous > > range. It's hacky, but it worked. I thought we finally got converted > > to sparsemem by the NUMA people, but I can't find the commit. > > James, you tried once: > https://patchwork.kernel.org/patch/729441/ > > It seems we better should move over to sparsemem now? Below is an updated patch to convert parisc from DISCONTIGMEM to SPARSEMEM. It builds and boots for me on 32- and 64-bit machines. Mikulas, could you try if you still see the the cache limited to 1GiB with this patch applied ? Helge --------------------- From 2c30c3a61bbfb56850862a7f7127416325fe126f Mon Sep 17 00:00:00 2001 From: Helge Deller <deller@gmx.de> Date: Tue, 9 Apr 2019 21:52:35 +0200 Subject: [PATCH] parisc: Switch from DISCONTIGMEM to SPARSEMEM Signed-off-by: Helge Deller <deller@gmx.de> diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index c8e6212..4f1397f 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -36,6 +36,7 @@ config PARISC select GENERIC_STRNCPY_FROM_USER select SYSCTL_ARCH_UNALIGN_ALLOW select SYSCTL_EXCEPTION_TRACE + select ARCH_DISCARD_MEMBLOCK select HAVE_MOD_ARCH_SPECIFIC select VIRT_TO_BUS select MODULES_USE_ELF_RELA @@ -311,21 +312,16 @@ config ARCH_SELECT_MEMORY_MODEL def_bool y depends on 64BIT -config ARCH_DISCONTIGMEM_ENABLE +config ARCH_SPARSEMEM_ENABLE def_bool y depends on 64BIT config ARCH_FLATMEM_ENABLE def_bool y -config ARCH_DISCONTIGMEM_DEFAULT +config ARCH_SPARSEMEM_DEFAULT def_bool y - depends on ARCH_DISCONTIGMEM_ENABLE - -config NODES_SHIFT - int - default "3" - depends on NEED_MULTIPLE_NODES + depends on ARCH_SPARSEMEM_ENABLE source "kernel/Kconfig.hz" diff --git a/arch/parisc/include/asm/mmzone.h b/arch/parisc/include/asm/mmzone.h index fafa389..8d39040 100644 --- a/arch/parisc/include/asm/mmzone.h +++ b/arch/parisc/include/asm/mmzone.h @@ -2,62 +2,6 @@ #ifndef _PARISC_MMZONE_H #define _PARISC_MMZONE_H -#define MAX_PHYSMEM_RANGES 8 /* Fix the size for now (current known max is 3) */ +#define MAX_PHYSMEM_RANGES 4 /* Fix the size for now (current known max is 3) */ -#ifdef CONFIG_DISCONTIGMEM - -extern int npmem_ranges; - -struct node_map_data { - pg_data_t pg_data; -}; - -extern struct node_map_data node_data[]; - -#define NODE_DATA(nid) (&node_data[nid].pg_data) - -/* We have these possible memory map layouts: - * Astro: 0-3.75, 67.75-68, 4-64 - * zx1: 0-1, 257-260, 4-256 - * Stretch (N-class): 0-2, 4-32, 34-xxx - */ - -/* Since each 1GB can only belong to one region (node), we can create - * an index table for pfn to nid lookup; each entry in pfnnid_map - * represents 1GB, and contains the node that the memory belongs to. */ - -#define PFNNID_SHIFT (30 - PAGE_SHIFT) -#define PFNNID_MAP_MAX 512 /* support 512GB */ -extern signed char pfnnid_map[PFNNID_MAP_MAX]; - -#ifndef CONFIG_64BIT -#define pfn_is_io(pfn) ((pfn & (0xf0000000UL >> PAGE_SHIFT)) == (0xf0000000UL >> PAGE_SHIFT)) -#else -/* io can be 0xf0f0f0f0f0xxxxxx or 0xfffffffff0000000 */ -#define pfn_is_io(pfn) ((pfn & (0xf000000000000000UL >> PAGE_SHIFT)) == (0xf000000000000000UL >> PAGE_SHIFT)) -#endif - -static inline int pfn_to_nid(unsigned long pfn) -{ - unsigned int i; - - if (unlikely(pfn_is_io(pfn))) - return 0; - - i = pfn >> PFNNID_SHIFT; - BUG_ON(i >= ARRAY_SIZE(pfnnid_map)); - - return pfnnid_map[i]; -} - -static inline int pfn_valid(int pfn) -{ - int nid = pfn_to_nid(pfn); - - if (nid >= 0) - return (pfn < node_end_pfn(nid)); - return 0; -} - -#endif #endif /* _PARISC_MMZONE_H */ diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h index b77f49c..93caf17 100644 --- a/arch/parisc/include/asm/page.h +++ b/arch/parisc/include/asm/page.h @@ -147,9 +147,9 @@ extern int npmem_ranges; #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET) #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) -#ifndef CONFIG_DISCONTIGMEM +#ifndef CONFIG_SPARSEMEM #define pfn_valid(pfn) ((pfn) < max_mapnr) -#endif /* CONFIG_DISCONTIGMEM */ +#endif #ifdef CONFIG_HUGETLB_PAGE #define HPAGE_SHIFT PMD_SHIFT /* fixed for transparent huge pages */ diff --git a/arch/parisc/include/asm/sparsemem.h b/arch/parisc/include/asm/sparsemem.h new file mode 100644 index 0000000..b7d1dc9 --- /dev/null +++ b/arch/parisc/include/asm/sparsemem.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef ASM_PARISC_SPARSEMEM_H +#define ASM_PARISC_SPARSEMEM_H + +/* We have these possible memory map layouts: + * Astro: 0-3.75, 67.75-68, 4-64 + * zx1: 0-1, 257-260, 4-256 + * Stretch (N-class): 0-2, 4-32, 34-xxx + */ + +#define MAX_PHYSMEM_BITS 42 +#define SECTION_SIZE_BITS 37 + +#endif diff --git a/arch/parisc/kernel/parisc_ksyms.c b/arch/parisc/kernel/parisc_ksyms.c index 7baa226..174213b 100644 --- a/arch/parisc/kernel/parisc_ksyms.c +++ b/arch/parisc/kernel/parisc_ksyms.c @@ -138,12 +138,6 @@ extern void $$dyncall(void); EXPORT_SYMBOL($$dyncall); #endif -#ifdef CONFIG_DISCONTIGMEM -#include <asm/mmzone.h> -EXPORT_SYMBOL(node_data); -EXPORT_SYMBOL(pfnnid_map); -#endif - #ifdef CONFIG_FUNCTION_TRACER extern void _mcount(void); EXPORT_SYMBOL(_mcount); diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c index d0b1662..9523394 100644 --- a/arch/parisc/mm/init.c +++ b/arch/parisc/mm/init.c @@ -48,11 +48,6 @@ pmd_t pmd0[PTRS_PER_PMD] __attribute__ ((__section__ (".data..vm0.pmd"), aligned pgd_t swapper_pg_dir[PTRS_PER_PGD] __attribute__ ((__section__ (".data..vm0.pgd"), aligned(PAGE_SIZE))); pte_t pg0[PT_INITIAL * PTRS_PER_PTE] __attribute__ ((__section__ (".data..vm0.pte"), aligned(PAGE_SIZE))); -#ifdef CONFIG_DISCONTIGMEM -struct node_map_data node_data[MAX_NUMNODES] __read_mostly; -signed char pfnnid_map[PFNNID_MAP_MAX] __read_mostly; -#endif - static struct resource data_resource = { .name = "Kernel data", .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM, @@ -76,11 +71,11 @@ static struct resource sysram_resources[MAX_PHYSMEM_RANGES] __read_mostly; * information retrieved in kernel/inventory.c. */ -physmem_range_t pmem_ranges[MAX_PHYSMEM_RANGES] __read_mostly; -int npmem_ranges __read_mostly; +physmem_range_t pmem_ranges[MAX_PHYSMEM_RANGES] __initdata; +int npmem_ranges __initdata; #ifdef CONFIG_64BIT -#define MAX_MEM (~0UL) +#define MAX_MEM (1UL << MAX_PHYSMEM_BITS) #else /* !CONFIG_64BIT */ #define MAX_MEM (3584U*1024U*1024U) #endif /* !CONFIG_64BIT */ @@ -119,7 +114,7 @@ static void __init mem_limit_func(void) static void __init setup_bootmem(void) { unsigned long mem_max; -#ifndef CONFIG_DISCONTIGMEM +#ifndef CONFIG_SPARSEMEM physmem_range_t pmem_holes[MAX_PHYSMEM_RANGES - 1]; int npmem_holes; #endif @@ -137,23 +132,20 @@ static void __init setup_bootmem(void) int j; for (j = i; j > 0; j--) { - unsigned long tmp; + physmem_range_t tmp; if (pmem_ranges[j-1].start_pfn < pmem_ranges[j].start_pfn) { break; } - tmp = pmem_ranges[j-1].start_pfn; - pmem_ranges[j-1].start_pfn = pmem_ranges[j].start_pfn; - pmem_ranges[j].start_pfn = tmp; - tmp = pmem_ranges[j-1].pages; - pmem_ranges[j-1].pages = pmem_ranges[j].pages; - pmem_ranges[j].pages = tmp; + tmp = pmem_ranges[j-1]; + pmem_ranges[j-1] = pmem_ranges[j]; + pmem_ranges[j] = tmp; } } -#ifndef CONFIG_DISCONTIGMEM +#ifndef CONFIG_SPARSEMEM /* * Throw out ranges that are too far apart (controlled by * MAX_GAP). @@ -165,7 +157,7 @@ static void __init setup_bootmem(void) pmem_ranges[i-1].pages) > MAX_GAP) { npmem_ranges = i; printk("Large gap in memory detected (%ld pages). " - "Consider turning on CONFIG_DISCONTIGMEM\n", + "Consider turning on CONFIG_SPARSEMEM\n", pmem_ranges[i].start_pfn - (pmem_ranges[i-1].start_pfn + pmem_ranges[i-1].pages)); @@ -230,9 +222,8 @@ static void __init setup_bootmem(void) printk(KERN_INFO "Total Memory: %ld MB\n",mem_max >> 20); -#ifndef CONFIG_DISCONTIGMEM +#ifndef CONFIG_SPARSEMEM /* Merge the ranges, keeping track of the holes */ - { unsigned long end_pfn; unsigned long hole_pages; @@ -255,18 +246,6 @@ static void __init setup_bootmem(void) } #endif -#ifdef CONFIG_DISCONTIGMEM - for (i = 0; i < MAX_PHYSMEM_RANGES; i++) { - memset(NODE_DATA(i), 0, sizeof(pg_data_t)); - } - memset(pfnnid_map, 0xff, sizeof(pfnnid_map)); - - for (i = 0; i < npmem_ranges; i++) { - node_set_state(i, N_NORMAL_MEMORY); - node_set_online(i); - } -#endif - /* * Initialize and free the full range of memory in each range. */ @@ -314,7 +293,7 @@ static void __init setup_bootmem(void) memblock_reserve(__pa(KERNEL_BINARY_TEXT_START), (unsigned long)(_end - KERNEL_BINARY_TEXT_START)); -#ifndef CONFIG_DISCONTIGMEM +#ifndef CONFIG_SPARSEMEM /* reserve the holes */ @@ -360,6 +339,9 @@ static void __init setup_bootmem(void) /* Initialize Page Deallocation Table (PDT) and check for bad memory. */ pdc_pdt_init(); + + memblock_allow_resize(); + memblock_dump_all(); } static int __init parisc_text_address(unsigned long vaddr) @@ -709,37 +691,46 @@ static void __init gateway_init(void) PAGE_SIZE, PAGE_GATEWAY, 1); } -void __init paging_init(void) +static void __init parisc_bootmem_free(void) { + unsigned long zones_size[MAX_NR_ZONES] = { 0, }; + unsigned long holes_size[MAX_NR_ZONES] = { 0, }; + unsigned long mem_start_pfn = ~0UL, mem_end_pfn = 0, mem_size_pfn = 0; int i; + for (i = 0; i < npmem_ranges; i++) { + unsigned long start = pmem_ranges[i].start_pfn; + unsigned long size = pmem_ranges[i].pages; + unsigned long end = start + size; + + if (mem_start_pfn > start) + mem_start_pfn = start; + if (mem_end_pfn < end) + mem_end_pfn = end; + mem_size_pfn += size; + } + + zones_size[0] = mem_end_pfn - mem_start_pfn; + holes_size[0] = zones_size[0] - mem_size_pfn; + + free_area_init_node(0, zones_size, mem_start_pfn, holes_size); +} + +void __init paging_init(void) +{ setup_bootmem(); pagetable_init(); gateway_init(); flush_cache_all_local(); /* start with known state */ flush_tlb_all_local(NULL); - for (i = 0; i < npmem_ranges; i++) { - unsigned long zones_size[MAX_NR_ZONES] = { 0, }; - - zones_size[ZONE_NORMAL] = pmem_ranges[i].pages; - -#ifdef CONFIG_DISCONTIGMEM - /* Need to initialize the pfnnid_map before we can initialize - the zone */ - { - int j; - for (j = (pmem_ranges[i].start_pfn >> PFNNID_SHIFT); - j <= ((pmem_ranges[i].start_pfn + pmem_ranges[i].pages) >> PFNNID_SHIFT); - j++) { - pfnnid_map[j] = i; - } - } -#endif - - free_area_init_node(i, zones_size, - pmem_ranges[i].start_pfn, NULL); - } + /* + * Mark all memblocks as present for sparsemem using + * memory_present() and then initialize sparsemem. + */ + memblocks_present(); + sparse_init(); + parisc_bootmem_free(); } #ifdef CONFIG_PA20 ^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-04-09 20:10 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-04-06 15:20 Memory management broken by "mm: reclaim small amounts of memory when an external fragmentation event occurs" Mikulas Patocka 2019-04-06 17:26 ` Mikulas Patocka 2019-04-08 9:52 ` Mel Gorman 2019-04-08 11:10 ` Mikulas Patocka 2019-04-08 12:54 ` Mel Gorman 2019-04-08 14:29 ` James Bottomley 2019-04-08 15:22 ` Helge Deller 2019-04-08 19:44 ` James Bottomley 2019-04-09 20:09 ` Helge Deller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).