linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* quicklists confuse meminfo
@ 2008-03-09 10:19 Thomas Gleixner
  2008-03-09 10:26 ` Bart Van Assche
                   ` (3 more replies)
  0 siblings, 4 replies; 44+ messages in thread
From: Thomas Gleixner @ 2008-03-09 10:19 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Christoph Lameter,
	Bart Van Assche

Bart reported http://bugzilla.kernel.org/show_bug.cgi?id=9991. He
assumed a memory leak in 32bit kernels when he analyzed the output of
/proc/meminfo.

The leak is not a leak, it's an accounting bug. quicklists keep a
large amount of pages which are accounted as used memory.

It can be easily observed by watching /proc/meminfo on an idle system
and then run

# while true; do ls >/dev/null; done

for a while. The amount of free memory decreases steadily to the
point, where the quicklist limit kicks in.

To verify this I added quicklists (patch below) to the /proc/meminfo
output and the decrease of free memory is matching the increase of the
memory which is kept in quicklists.

I'm not sure how we should handle the quicklist accounting, but the
current state of silently hiding the free memory in the quicklists is
definitely not acceptable.

Another strange observation about quicklists is the imbalance of the
quicklists across CPUs. Running the above loop on a 2way machine I can
observe that the quicklist pages are acuumulating on one CPU. Stopping
and restarting the loop a couple of times can shift the accumulation
from one to the other CPU.

Thanks,
	tglx

---
 fs/proc/proc_misc.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Index: linux-2.6/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.orig/fs/proc/proc_misc.c
+++ linux-2.6/fs/proc/proc_misc.c
@@ -49,6 +49,8 @@
 #include <linux/crash_dump.h>
 #include <linux/pid_namespace.h>
 #include <linux/bootmem.h>
+#include <linux/quicklist.h>
+
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/io.h>
@@ -183,8 +185,11 @@ static int meminfo_read_proc(char *page,
 		"Committed_AS: %8lu kB\n"
 		"VmallocTotal: %8lu kB\n"
 		"VmallocUsed:  %8lu kB\n"
-		"VmallocChunk: %8lu kB\n",
-		K(i.totalram),
+		"VmallocChunk: %8lu kB\n"
+#ifdef CONFIG_QUICKLIST
+		"QuickLists:   %8lu kB\n"
+#endif
+		,K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
 		K(cached),
@@ -215,6 +220,9 @@ static int meminfo_read_proc(char *page,
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
 		vmi.largest_chunk >> 10
+#ifdef CONFIG_QUICKLIST
+		,K(quicklist_total_size())
+#endif
 		);
 
 		len += hugetlb_report_meminfo(page + len);


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 10:19 quicklists confuse meminfo Thomas Gleixner
@ 2008-03-09 10:26 ` Bart Van Assche
  2008-03-09 10:29 ` Andi Kleen
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: Bart Van Assche @ 2008-03-09 10:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Christoph Lameter

On Sun, Mar 9, 2008 at 11:19 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Bart reported http://bugzilla.kernel.org/show_bug.cgi?id=9991.
...
>  +               "QuickLists:   %8lu kB\n"

Regarding the patch: maybe it's better to spell this entry
"Quicklists" than "QuickLists", since quicklist is a single word and
not two words ?

Regarding the memory occupied by quicklists: why does the amount of
memory occupied by quicklists increase over time ? Is there a way to
free the memory occupied by quicklists, similar to the way the memory
occupied by caches can be freed (echo 3 >/proc/sys/vm/drop_caches) ?

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 10:19 quicklists confuse meminfo Thomas Gleixner
  2008-03-09 10:26 ` Bart Van Assche
@ 2008-03-09 10:29 ` Andi Kleen
  2008-03-09 10:42 ` KOSAKI Motohiro
  2008-03-09 11:14 ` Ingo Molnar
  3 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2008-03-09 10:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Christoph Lameter, Bart Van Assche

Thomas Gleixner <tglx@linutronix.de> writes:

> Bart reported http://bugzilla.kernel.org/show_bug.cgi?id=9991. He
> assumed a memory leak in 32bit kernels when he analyzed the output of
> /proc/meminfo.
> 
> The leak is not a leak, it's an accounting bug. quicklists keep a
> large amount of pages which are accounted as used memory.

There are various other subsystems which can cache substantial memory
under the right circumstances. Do you want to add all of them 
to /proc/meminfo? I'm not sure that will scale long term.

One more general possibility would be to integrate this with
with the shrinker callbacks. Everyone who caches memory should
have a shrinker. Perhaps that could be integrated with some reporting
facility that adds a dynamic counter field that is displayed
somewhere in /proc or /sys.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 10:19 quicklists confuse meminfo Thomas Gleixner
  2008-03-09 10:26 ` Bart Van Assche
  2008-03-09 10:29 ` Andi Kleen
@ 2008-03-09 10:42 ` KOSAKI Motohiro
  2008-03-09 12:00   ` Thomas Gleixner
  2008-03-09 11:14 ` Ingo Molnar
  3 siblings, 1 reply; 44+ messages in thread
From: KOSAKI Motohiro @ 2008-03-09 10:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kosaki.motohiro, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Christoph Lameter, Bart Van Assche

Hi

in general, I like this patch and I think this is very useful.
but I don't like #ifdef ;)

> +#ifdef CONFIG_QUICKLIST
> +		"QuickLists:   %8lu kB\n"
> +#endif
> +		,K(i.totalram),
>  		K(i.freeram),
>  		K(i.bufferram),
>  		K(cached),
> @@ -215,6 +220,9 @@ static int meminfo_read_proc(char *page,
>  		(unsigned long)VMALLOC_TOTAL >> 10,
>  		vmi.used >> 10,
>  		vmi.largest_chunk >> 10
> +#ifdef CONFIG_QUICKLIST
> +		,K(quicklist_total_size())
> +#endif

Do you dislike my following patch?

- kosaki



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 fs/proc/proc_misc.c       |    6 ++++--
 include/linux/quicklist.h |    7 +++++++
 2 files changed, 11 insertions(+), 2 deletions(-)

Index: b/fs/proc/proc_misc.c
===================================================================
--- a/fs/proc/proc_misc.c       2008-03-09 16:04:39.000000000 +0900
+++ b/fs/proc/proc_misc.c       2008-03-09 19:47:47.000000000 +0900
@@ -182,7 +182,8 @@ static int meminfo_read_proc(char *page,
                "Committed_AS: %8lu kB\n"
                "VmallocTotal: %8lu kB\n"
                "VmallocUsed:  %8lu kB\n"
-               "VmallocChunk: %8lu kB\n",
+               "VmallocChunk: %8lu kB\n"
+               "Quicklists:   %8lu kB\n",
                K(i.totalram),
                K(i.freeram),
                K(i.bufferram),
@@ -213,7 +214,8 @@ static int meminfo_read_proc(char *page,
                K(committed),
                (unsigned long)VMALLOC_TOTAL >> 10,
                vmi.used >> 10,
-               vmi.largest_chunk >> 10
+               vmi.largest_chunk >> 10,
+               K(quicklist_total_size())
                );

                len += hugetlb_report_meminfo(page + len);
Index: b/include/linux/quicklist.h
===================================================================
--- a/include/linux/quicklist.h 2008-03-09 16:02:28.000000000 +0900
+++ b/include/linux/quicklist.h 2008-03-09 19:45:58.000000000 +0900
@@ -80,6 +80,13 @@ void quicklist_trim(int nr, void (*dtor)

 unsigned long quicklist_total_size(void);

+#else
+
+static inline unsigned long quicklist_total_size(void)
+{
+       return 0;
+}
+
 #endif

 #endif /* LINUX_QUICKLIST_H */



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 10:19 quicklists confuse meminfo Thomas Gleixner
                   ` (2 preceding siblings ...)
  2008-03-09 10:42 ` KOSAKI Motohiro
@ 2008-03-09 11:14 ` Ingo Molnar
  2008-03-09 11:56   ` Thomas Gleixner
                     ` (2 more replies)
  3 siblings, 3 replies; 44+ messages in thread
From: Ingo Molnar @ 2008-03-09 11:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, Andrew Morton, Christoph Lameter, Bart Van Assche


* Thomas Gleixner <tglx@linutronix.de> wrote:

> Bart reported http://bugzilla.kernel.org/show_bug.cgi?id=9991. He 
> assumed a memory leak in 32bit kernels when he analyzed the output of 
> /proc/meminfo.
> 
> The leak is not a leak, it's an accounting bug. quicklists keep a 
> large amount of pages which are accounted as used memory.
[...]
> Another strange observation about quicklists is the imbalance of the 
> quicklists across CPUs. Running the above loop on a 2way machine I can 
> observe that the quicklist pages are acuumulating on one CPU. Stopping 
> and restarting the loop a couple of times can shift the accumulation 
> from one to the other CPU.

hm. I think we should not let this much RAM hang around in a 
special-purpose allocator like quicklists. Shouldnt the quicklists be 
temporary in nature, and be trimmed much more agressively?

in fact, we have a check_pgt_cache() call in cpu_idle(), which does:

        quicklist_trim(0, pgd_dtor, 25, 16);

but it appears we dont do quicklist trimming anywhere else! So if a 
system has no idle time, the quicklist can grow unbounded, and that's a 
real memory leak IMO.

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 11:14 ` Ingo Molnar
@ 2008-03-09 11:56   ` Thomas Gleixner
  2008-03-09 12:01     ` Ingo Molnar
  2008-03-09 12:03   ` Johannes Weiner
  2008-03-09 12:03   ` KOSAKI Motohiro
  2 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2008-03-09 11:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Linus Torvalds, Andrew Morton, Christoph Lameter, Bart Van Assche

On Sun, 9 Mar 2008, Ingo Molnar wrote:

> 
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > Bart reported http://bugzilla.kernel.org/show_bug.cgi?id=9991. He 
> > assumed a memory leak in 32bit kernels when he analyzed the output of 
> > /proc/meminfo.
> > 
> > The leak is not a leak, it's an accounting bug. quicklists keep a 
> > large amount of pages which are accounted as used memory.
> [...]
> > Another strange observation about quicklists is the imbalance of the 
> > quicklists across CPUs. Running the above loop on a 2way machine I can 
> > observe that the quicklist pages are acuumulating on one CPU. Stopping 
> > and restarting the loop a couple of times can shift the accumulation 
> > from one to the other CPU.
> 
> hm. I think we should not let this much RAM hang around in a 
> special-purpose allocator like quicklists. Shouldnt the quicklists be 
> temporary in nature, and be trimmed much more agressively?
> 
> in fact, we have a check_pgt_cache() call in cpu_idle(), which does:
> 
>         quicklist_trim(0, pgd_dtor, 25, 16);
> 
> but it appears we dont do quicklist trimming anywhere else! So if a 
> system has no idle time, the quicklist can grow unbounded, and that's a 
> real memory leak IMO.

Right, also the quicklist_trim() in idle() is freeing at max 16 pages
in one go. According to the quicklist_trim() code we keep up to
(node_free_pages / 16) in the quicklist unconditionally, which seems
rather odd as well.

Thanks,
	tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 10:42 ` KOSAKI Motohiro
@ 2008-03-09 12:00   ` Thomas Gleixner
  0 siblings, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2008-03-09 12:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Christoph Lameter, Bart Van Assche

On Sun, 9 Mar 2008, KOSAKI Motohiro wrote:
> Hi
> 
> in general, I like this patch and I think this is very useful.
> but I don't like #ifdef ;)

/me neither :)
 
> > +#ifdef CONFIG_QUICKLIST
> > +		"QuickLists:   %8lu kB\n"
> > +#endif
> > +		,K(i.totalram),
> >  		K(i.freeram),
> >  		K(i.bufferram),
> >  		K(cached),
> > @@ -215,6 +220,9 @@ static int meminfo_read_proc(char *page,
> >  		(unsigned long)VMALLOC_TOTAL >> 10,
> >  		vmi.used >> 10,
> >  		vmi.largest_chunk >> 10
> > +#ifdef CONFIG_QUICKLIST
> > +		,K(quicklist_total_size())
> > +#endif
> 
> Do you dislike my following patch?

I like it. I'm just not sure about the accounting of cached memory in
general, though quicklists seem to be a pretty obvious one.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 11:56   ` Thomas Gleixner
@ 2008-03-09 12:01     ` Ingo Molnar
  2008-03-09 12:49       ` Andi Kleen
  0 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2008-03-09 12:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, Andrew Morton, Christoph Lameter, Bart Van Assche


* Thomas Gleixner <tglx@linutronix.de> wrote:

> Right, also the quicklist_trim() in idle() is freeing at max 16 pages 
> in one go. According to the quicklist_trim() code we keep up to 
> (node_free_pages / 16) in the quicklist unconditionally, which seems 
> rather odd as well.

i suspect the right approach would be to get rid of them altogether and 
to convert the code to RCU and plain old alloc/free instead.

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 11:14 ` Ingo Molnar
  2008-03-09 11:56   ` Thomas Gleixner
@ 2008-03-09 12:03   ` Johannes Weiner
  2008-03-09 12:03   ` KOSAKI Motohiro
  2 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2008-03-09 12:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Christoph Lameter, Bart Van Assche

Hi,

Ingo Molnar <mingo@elte.hu> writes:

> hm. I think we should not let this much RAM hang around in a 
> special-purpose allocator like quicklists. Shouldnt the quicklists be 
> temporary in nature, and be trimmed much more agressively?
>
> in fact, we have a check_pgt_cache() call in cpu_idle(), which does:
>
>         quicklist_trim(0, pgd_dtor, 25, 16);
>
> but it appears we dont do quicklist trimming anywhere else! So if a 
> system has no idle time, the quicklist can grow unbounded, and that's a 
> real memory leak IMO.

It is also called from tlbu_finish_mmu().

	Hannes

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 11:14 ` Ingo Molnar
  2008-03-09 11:56   ` Thomas Gleixner
  2008-03-09 12:03   ` Johannes Weiner
@ 2008-03-09 12:03   ` KOSAKI Motohiro
  2008-03-09 12:09     ` Ingo Molnar
  2 siblings, 1 reply; 44+ messages in thread
From: KOSAKI Motohiro @ 2008-03-09 12:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kosaki.motohiro, Thomas Gleixner, LKML, Linus Torvalds,
	Andrew Morton, Christoph Lameter, Bart Van Assche

Hi

> hm. I think we should not let this much RAM hang around in a 
> special-purpose allocator like quicklists. Shouldnt the quicklists be 
> temporary in nature, and be trimmed much more agressively?
> 
> in fact, we have a check_pgt_cache() call in cpu_idle(), which does:
> 
>         quicklist_trim(0, pgd_dtor, 25, 16);
> 
> but it appears we dont do quicklist trimming anywhere else! So if a 
> system has no idle time, the quicklist can grow unbounded, and that's a 
> real memory leak IMO.

I test following method.

1. $ hackbench 100 process 1000
2. $ cat /proc/meminfo

quicklist consume 1GB memory of 8GB total memory system.
it seems too large cache ;)

IMHO we need shrink pgtable cache mecanism.


---------------------------------------
MemTotal:      7683328 kB
MemFree:       4940672 kB
Buffers:          2816 kB
Cached:          23232 kB
SwapCached:      72192 kB
Active:          56832 kB
Inactive:        77440 kB
SwapTotal:     2031488 kB
SwapFree:      1907776 kB
Dirty:             704 kB
Writeback:           0 kB
AnonPages:       48384 kB
Mapped:          20864 kB
Slab:          1325824 kB
SReclaimable:    11968 kB
SUnreclaim:    1313856 kB
PageTables:      13632 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   5873152 kB
Committed_AS:   531008 kB
VmallocTotal: 17592177655808 kB
VmallocUsed:     28864 kB
VmallocChunk: 17592177623104 kB
Quicklists:    1194304 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
HugePages_Surp:      0
Hugepagesize:    262144 kB



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 12:03   ` KOSAKI Motohiro
@ 2008-03-09 12:09     ` Ingo Molnar
  2008-03-09 12:34       ` Ingo Molnar
  2008-03-09 12:47       ` KOSAKI Motohiro
  0 siblings, 2 replies; 44+ messages in thread
From: Ingo Molnar @ 2008-03-09 12:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Christoph Lameter, Bart Van Assche


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > but it appears we dont do quicklist trimming anywhere else! So if a 
> > system has no idle time, the quicklist can grow unbounded, and 
> > that's a real memory leak IMO.
> 
> I test following method.
> 
> 1. $ hackbench 100 process 1000
> 2. $ cat /proc/meminfo
> 
> quicklist consume 1GB memory of 8GB total memory system.
> it seems too large cache ;)
> 
> IMHO we need shrink pgtable cache mecanism.

ouch! Could you try the patch below? How large is the quicklist cache 
with this applied?

	Ingo

---------------------->
Subject: x86: trim quicklist more agressively
From: Ingo Molnar <mingo@elte.hu>
Date: Sun Mar 09 12:59:41 CET 2008

trim the quicklists more agressively, up to 1024 pages at once.
(which pretty much means we keep this special-purpose cache as
small as possible)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/mm/pgtable_32.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-x86.q/arch/x86/mm/pgtable_32.c
===================================================================
--- linux-x86.q.orig/arch/x86/mm/pgtable_32.c
+++ linux-x86.q/arch/x86/mm/pgtable_32.c
@@ -358,7 +358,7 @@ void pgd_free(struct mm_struct *mm, pgd_
 
 void check_pgt_cache(void)
 {
-	quicklist_trim(0, pgd_dtor, 25, 16);
+	quicklist_trim(0, pgd_dtor, 25, 1024);
 }
 
 void __pte_free_tlb(struct mmu_gather *tlb, struct page *pte)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 12:09     ` Ingo Molnar
@ 2008-03-09 12:34       ` Ingo Molnar
  2008-03-09 12:51         ` KOSAKI Motohiro
                           ` (3 more replies)
  2008-03-09 12:47       ` KOSAKI Motohiro
  1 sibling, 4 replies; 44+ messages in thread
From: Ingo Molnar @ 2008-03-09 12:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Christoph Lameter, Bart Van Assche


* Ingo Molnar <mingo@elte.hu> wrote:

> > IMHO we need shrink pgtable cache mecanism.
> 
> ouch! Could you try the patch below? How large is the quicklist cache 
> with this applied?

hm, Thomas pointed it out that this wont solve all the problems as 
quicklists have a built-in "preserve me" throttle (which is rather 
stupid).

the right solution is to get rid of quicklists altogether (Thomas 
expects to have patches for that later today).

If anyone implements a _sane_ general purpose allocator that can 
preserve constructed objects then i'm all for utilizing it in x86 too, 
but quicklists arent that mechanism ...

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 12:09     ` Ingo Molnar
  2008-03-09 12:34       ` Ingo Molnar
@ 2008-03-09 12:47       ` KOSAKI Motohiro
  1 sibling, 0 replies; 44+ messages in thread
From: KOSAKI Motohiro @ 2008-03-09 12:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kosaki.motohiro, Thomas Gleixner, LKML, Linus Torvalds,
	Andrew Morton, Christoph Lameter, Bart Van Assche

Hi

> > IMHO we need shrink pgtable cache mecanism.
> 
> ouch! Could you try the patch below? How large is the quicklist cache 
> with this applied?

I porting to IA64, applied, and tested.
but no change behavior.

because quicklist.c::max_pages() formula is strange.
my freememory is about 5GB, thus this function calculate 5GB/16 ~= 300MB,
my machine has 8cpu, thus 300MB x 8 ~= 2.5GB consume at worst case.
(because this variable is per cpu variable)

IMHO FRACTION_OF_NODE_MEM(16) is too small.

after change FRACTION_OF_NODE_MEM to 256,
i got following result.


---------------------------------------------------
MemTotal:      7683328 kB
MemFree:       5904704 kB
Buffers:         28672 kB
Cached:         168384 kB
SwapCached:       1344 kB
Active:         173376 kB
Inactive:       126016 kB
SwapTotal:     2031488 kB
SwapFree:      2030144 kB
Dirty:              64 kB
Writeback:           0 kB
AnonPages:      101120 kB
Mapped:          34496 kB
Slab:          1332608 kB
SReclaimable:    21824 kB
SUnreclaim:    1310784 kB
PageTables:      12480 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   5873152 kB
Committed_AS:   383744 kB
VmallocTotal: 17592177655808 kB
VmallocUsed:     28864 kB
VmallocChunk: 17592177623040 kB
Quicklists:      61888 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
HugePages_Surp:      0
Hugepagesize:    262144 kB
-----------------------------------------------

the following patch is explain to my shrink plan and change to 
FRACTION_OF_NODE_MEM value.
of cource, this is only base of discussion.

I don't tested performance regression test yet.




Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 fs/drop_caches.c           |    3 +++
 include/asm-ia64/pgalloc.h |    9 ++++++++-
 include/linux/quicklist.h  |    7 +++++++
 mm/quicklist.c             |    2 +-
 mm/vmscan.c                |    8 ++++++++
 5 files changed, 27 insertions(+), 2 deletions(-)

Index: b/fs/drop_caches.c
===================================================================
--- a/fs/drop_caches.c  2008-03-09 16:04:14.000000000 +0900
+++ b/fs/drop_caches.c  2008-03-09 20:38:44.000000000 +0900
@@ -8,6 +8,7 @@
 #include <linux/writeback.h>
 #include <linux/sysctl.h>
 #include <linux/gfp.h>
+#include <asm/pgalloc.h>

 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
@@ -63,6 +64,8 @@ int drop_caches_sysctl_handler(ctl_table
                        drop_pagecache();
                if (sysctl_drop_caches & 2)
                        drop_slab();
+               if (sysctl_drop_caches & 4)
+                       drop_pgtable_cache();
        }
        return 0;
 }
Index: b/include/linux/quicklist.h
===================================================================
--- a/include/linux/quicklist.h 2008-03-09 19:45:58.000000000 +0900
+++ b/include/linux/quicklist.h 2008-03-09 21:49:07.000000000 +0900
@@ -87,6 +87,13 @@ static inline unsigned long quicklist_to
        return 0;
 }

+static inline void quicklist_trim(int nr, void (*dtor)(void *),
+                                 unsigned long min_pages,
+                                 unsigned long max_free)
+{
+}
+
+
 #endif

 #endif /* LINUX_QUICKLIST_H */
Index: b/mm/quicklist.c
===================================================================
--- a/mm/quicklist.c    2008-03-09 16:04:53.000000000 +0900
+++ b/mm/quicklist.c    2008-03-09 21:35:18.000000000 +0900
@@ -21,7 +21,7 @@

 DEFINE_PER_CPU(struct quicklist, quicklist)[CONFIG_NR_QUICK];

-#define FRACTION_OF_NODE_MEM   16
+#define FRACTION_OF_NODE_MEM   256

 static unsigned long max_pages(unsigned long min_pages)
 {
Index: b/include/asm-ia64/pgalloc.h
===================================================================
--- a/include/asm-ia64/pgalloc.h        2008-03-09 21:53:14.000000000 +0900
+++ b/include/asm-ia64/pgalloc.h        2008-03-09 21:53:53.000000000 +0900
@@ -114,9 +114,16 @@ static inline void pte_free_kernel(struc

 static inline void check_pgt_cache(void)
 {
-       quicklist_trim(0, NULL, 25, 16);
+       quicklist_trim(0, NULL, 25, 1024);
 }

 #define __pte_free_tlb(tlb, pte)       pte_free((tlb)->mm, pte)

+#define HAVE_ARCH_DROP_PGTABLE_CACHE
+static inline void drop_pgtable_cache(void)
+{
+       quicklist_trim(0, NULL, 25, ULONG_MAX);
+}
+
+
 #endif                         /* _ASM_IA64_PGALLOC_H */
Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c       2008-03-09 19:42:47.000000000 +0900
+++ b/mm/vmscan.c       2008-03-09 20:45:24.000000000 +0900
@@ -42,6 +42,7 @@

 #include <asm/tlbflush.h>
 #include <asm/div64.h>
+#include <asm/pgalloc.h>

 #include <linux/swapops.h>

@@ -1362,6 +1363,7 @@ static unsigned long do_try_to_free_page
                 * over limit cgroups
                 */
                if (scan_global_lru(sc)) {
+                       drop_pgtable_cache();
                        shrink_slab(sc->nr_scanned, gfp_mask, lru_pages);
                        if (reclaim_state) {
                                nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -1589,6 +1591,7 @@ loop_again:
                        if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
                                                end_zone, 0))
                                nr_reclaimed += shrink_zone(priority, zone, &sc);
+                       drop_pgtable_cache();
                        reclaim_state->reclaimed_slab = 0;
                        nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
                                                lru_pages);
@@ -1831,6 +1834,7 @@ unsigned long shrink_all_memory(unsigned
        nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
        /* If slab caches are huge, it's better to hit them first */
        while (nr_slab >= lru_pages) {
+               drop_pgtable_cache();
                reclaim_state.reclaimed_slab = 0;
                shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
                if (!reclaim_state.reclaimed_slab)
@@ -1868,6 +1872,7 @@ unsigned long shrink_all_memory(unsigned
                        if (ret >= nr_pages)
                                goto out;

+                       drop_pgtable_cache();
                        reclaim_state.reclaimed_slab = 0;
                        shrink_slab(sc.nr_scanned, sc.gfp_mask,
                                        count_lru_pages());
@@ -1880,6 +1885,7 @@ unsigned long shrink_all_memory(unsigned
                }
        }

+       drop_pgtable_cache();
        /*
         * If ret = 0, we could not shrink LRUs, but there may be something
         * in slab caches
@@ -2038,6 +2044,8 @@ static int __zone_reclaim(struct zone *z
                } while (priority >= 0 && nr_reclaimed < nr_pages);
        }

+       drop_pgtable_cache();
+
        slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
        if (slab_reclaimable > zone->min_slab_pages) {
                /*


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 12:01     ` Ingo Molnar
@ 2008-03-09 12:49       ` Andi Kleen
  2008-03-10 15:51         ` Christoph Lameter
  0 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2008-03-09 12:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Christoph Lameter, Bart Van Assche

Ingo Molnar <mingo@elte.hu> writes:

> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > Right, also the quicklist_trim() in idle() is freeing at max 16 pages 
> > in one go. According to the quicklist_trim() code we keep up to 
> > (node_free_pages / 16) in the quicklist unconditionally, which seems 
> > rather odd as well.
> 
> i suspect the right approach would be to get rid of them altogether and 
> to convert the code to RCU and plain old alloc/free instead.

Are you sure RCU is even needed? AFAIK all delayed freeing 
on page table level is already handled by the standard free-after-flush
mmu code. 

iirc quicklists were just to avoid rezeroing pages which are known to
be zero at free time (in theory __GFP_ZERO should handle that at 
page_alloc.c level, but it doesn't) and to get a little private fast path 
for the page allocator (might actually predate the current page_alloc
fast paths)

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 12:34       ` Ingo Molnar
@ 2008-03-09 12:51         ` KOSAKI Motohiro
  2008-03-09 13:20         ` Thomas Gleixner
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: KOSAKI Motohiro @ 2008-03-09 12:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kosaki.motohiro, Thomas Gleixner, LKML, Linus Torvalds,
	Andrew Morton, Christoph Lameter, Bart Van Assche

> hm, Thomas pointed it out that this wont solve all the problems as 
> quicklists have a built-in "preserve me" throttle (which is rather 
> stupid).
> 
> the right solution is to get rid of quicklists altogether (Thomas 
> expects to have patches for that later today).
> 
> If anyone implements a _sane_ general purpose allocator that can 
> preserve constructed objects then i'm all for utilizing it in x86 too, 
> but quicklists arent that mechanism ...

Apparently, my patch seems to have been too late ;)
please forget it.


- kosaki



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 12:34       ` Ingo Molnar
  2008-03-09 12:51         ` KOSAKI Motohiro
@ 2008-03-09 13:20         ` Thomas Gleixner
  2008-03-09 18:46         ` Andrew Morton
  2008-03-09 19:11         ` Arjan van de Ven
  3 siblings, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2008-03-09 13:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, LKML, Linus Torvalds, Andrew Morton,
	Christoph Lameter, Bart Van Assche

On Sun, 9 Mar 2008, Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > IMHO we need shrink pgtable cache mecanism.
> > 
> > ouch! Could you try the patch below? How large is the quicklist cache 
> > with this applied?
> 
> hm, Thomas pointed it out that this wont solve all the problems as 
> quicklists have a built-in "preserve me" throttle (which is rather 
> stupid).

There is also the imbalance across CPUs. I think I figured out what's
going there as well. The allocation happens on one CPU (via
page_alloc), but the tear down happens on the other CPU, which
accumulates the pages in the quicklist. So the quicklist of the busy
CPU is empty, while the one of the idle CPU goes up to the limit. When
I pin the loop to one CPU then the quicklists are stable.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 12:34       ` Ingo Molnar
  2008-03-09 12:51         ` KOSAKI Motohiro
  2008-03-09 13:20         ` Thomas Gleixner
@ 2008-03-09 18:46         ` Andrew Morton
  2008-03-09 20:21           ` Andi Kleen
  2008-03-10 15:54           ` Christoph Lameter
  2008-03-09 19:11         ` Arjan van de Ven
  3 siblings, 2 replies; 44+ messages in thread
From: Andrew Morton @ 2008-03-09 18:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Thomas Gleixner, LKML, Linus Torvalds,
	Christoph Lameter, Bart Van Assche

On Sun, 9 Mar 2008 13:34:32 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> the right solution is to get rid of quicklists altogether

Yes, I think so.

- They are pretty marginal from a performance POV (iirc)

- They've been a relatively rich source of bugs

- As I said when we merged them (under protest): Private object caches
  like this are just a bad idea - caches should be *shared*, because some
  other code path which wants a zeroed page wants a cache-warm one, not a
  cache-cold one from the allocator (iirc there was doubt over how
  cache-warm these pages are, however).

  Making __GFP_ZERO smarter/more efficient would be a preferable way of
  addressing any performance problems we have in there.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 12:34       ` Ingo Molnar
                           ` (2 preceding siblings ...)
  2008-03-09 18:46         ` Andrew Morton
@ 2008-03-09 19:11         ` Arjan van de Ven
  2008-03-09 19:25           ` Ingo Molnar
  3 siblings, 1 reply; 44+ messages in thread
From: Arjan van de Ven @ 2008-03-09 19:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Thomas Gleixner, LKML, Linus Torvalds,
	Andrew Morton, Christoph Lameter, Bart Van Assche

On Sun, 9 Mar 2008 13:34:32 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > IMHO we need shrink pgtable cache mecanism.
> > 
> > ouch! Could you try the patch below? How large is the quicklist
> > cache with this applied?
> 
> hm, Thomas pointed it out that this wont solve all the problems as 
> quicklists have a built-in "preserve me" throttle (which is rather 
> stupid).
> 
> the right solution is to get rid of quicklists altogether (Thomas 
> expects to have patches for that later today).

careful with this; the quicklists aren't JUST for speed
they are also there to make sure a page we free that is a pagetable,
is not reused until we have finished flushing the tlbs on all the cpus that saw it.
This is a really hard correctness requirement, and while I can see
that quicklists are probably not the best way to achieve this, we can't just
throw away the behavior ;(

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 19:11         ` Arjan van de Ven
@ 2008-03-09 19:25           ` Ingo Molnar
  2008-03-09 19:27             ` Ingo Molnar
  2008-03-10 15:55             ` Christoph Lameter
  0 siblings, 2 replies; 44+ messages in thread
From: Ingo Molnar @ 2008-03-09 19:25 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: KOSAKI Motohiro, Thomas Gleixner, LKML, Linus Torvalds,
	Andrew Morton, Christoph Lameter, Bart Van Assche


* Arjan van de Ven <arjan@infradead.org> wrote:

> > > ouch! Could you try the patch below? How large is the quicklist 
> > > cache with this applied?
> > 
> > hm, Thomas pointed it out that this wont solve all the problems as 
> > quicklists have a built-in "preserve me" throttle (which is rather 
> > stupid).
> > 
> > the right solution is to get rid of quicklists altogether (Thomas 
> > expects to have patches for that later today).
> 
> careful with this; the quicklists aren't JUST for speed they are also 
> there to make sure a page we free that is a pagetable, is not reused 
> until we have finished flushing the tlbs on all the cpus that saw it. 
> This is a really hard correctness requirement, and while I can see 
> that quicklists are probably not the best way to achieve this, we 
> can't just throw away the behavior ;(

no, that's not true anymore - and the current quicklists code doesnt do 
anything like that AFAICS. It used to be a lot more complex, but now 
it's just a thin wrapper around the page allocator.

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 19:25           ` Ingo Molnar
@ 2008-03-09 19:27             ` Ingo Molnar
  2008-03-09 19:31               ` Ingo Molnar
  2008-03-10 15:57               ` Christoph Lameter
  2008-03-10 15:55             ` Christoph Lameter
  1 sibling, 2 replies; 44+ messages in thread
From: Ingo Molnar @ 2008-03-09 19:27 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: KOSAKI Motohiro, Thomas Gleixner, LKML, Linus Torvalds,
	Andrew Morton, Christoph Lameter, Bart Van Assche


* Ingo Molnar <mingo@elte.hu> wrote:

> > careful with this; the quicklists aren't JUST for speed they are 
> > also there to make sure a page we free that is a pagetable, is not 
> > reused until we have finished flushing the tlbs on all the cpus that 
> > saw it. This is a really hard correctness requirement, and while I 
> > can see that quicklists are probably not the best way to achieve 
> > this, we can't just throw away the behavior ;(
> 
> no, that's not true anymore - and the current quicklists code doesnt 
> do anything like that AFAICS. It used to be a lot more complex, but 
> now it's just a thin wrapper around the page allocator.

i.e. the patch below should do the trick.

(it's still work in progress but it seems to boot just fine)

	Ingo

------------>
Subject: x86: patches/remove-quicklists.patch
From: Ingo Molnar <mingo@elte.hu>
Date: Sun Mar 09 20:04:32 CET 2008

---
 arch/x86/mm/pgtable_32.c |   13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

Index: linux-x86.q/arch/x86/mm/pgtable_32.c
===================================================================
--- linux-x86.q.orig/arch/x86/mm/pgtable_32.c
+++ linux-x86.q/arch/x86/mm/pgtable_32.c
@@ -10,7 +10,6 @@
 #include <linux/pagemap.h>
 #include <linux/spinlock.h>
 #include <linux/module.h>
-#include <linux/quicklist.h>
 
 #include <asm/system.h>
 #include <asm/pgtable.h>
@@ -338,12 +337,16 @@ static void pgd_mop_up_pmds(struct mm_st
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor);
+	pgd_t *pgd;
+
+	pgd = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+	if (pgd)
+		pgd_ctor(pgd);
 
 	mm->pgd = pgd;		/* so that alloc_pd can use it */
 
 	if (pgd && !pgd_prepopulate_pmd(mm, pgd)) {
-		quicklist_free(0, pgd_dtor, pgd);
+		free_page((unsigned long)pgd);
 		pgd = NULL;
 	}
 
@@ -353,12 +356,12 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
-	quicklist_free(0, pgd_dtor, pgd);
+	pgd_dtor(pgd);
+	free_page((unsigned long)pgd);
 }
 
 void check_pgt_cache(void)
 {
-	quicklist_trim(0, pgd_dtor, 25, 16);
 }
 
 void __pte_free_tlb(struct mmu_gather *tlb, struct page *pte)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 19:27             ` Ingo Molnar
@ 2008-03-09 19:31               ` Ingo Molnar
  2008-03-10 15:57               ` Christoph Lameter
  1 sibling, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2008-03-09 19:31 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: KOSAKI Motohiro, Thomas Gleixner, LKML, Linus Torvalds,
	Andrew Morton, Christoph Lameter, Bart Van Assche


* Ingo Molnar <mingo@elte.hu> wrote:

> i.e. the patch below should do the trick.
> 
> (it's still work in progress but it seems to boot just fine)

updated patch below - Thomas found a bug in the alloc-failure cleanup 
path. I shouldnt hack late Sunday night :)

	Ingo

--------------->
Subject: x86: patches/remove-quicklists.patch
From: Ingo Molnar <mingo@elte.hu>
Date: Sun Mar 09 20:04:32 CET 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/mm/pgtable_32.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Index: linux-x86.q/arch/x86/mm/pgtable_32.c
===================================================================
--- linux-x86.q.orig/arch/x86/mm/pgtable_32.c
+++ linux-x86.q/arch/x86/mm/pgtable_32.c
@@ -10,7 +10,6 @@
 #include <linux/pagemap.h>
 #include <linux/spinlock.h>
 #include <linux/module.h>
-#include <linux/quicklist.h>
 
 #include <asm/system.h>
 #include <asm/pgtable.h>
@@ -338,12 +337,17 @@ static void pgd_mop_up_pmds(struct mm_st
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor);
+	pgd_t *pgd;
+
+	pgd = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+	if (pgd)
+		pgd_ctor(pgd);
 
 	mm->pgd = pgd;		/* so that alloc_pd can use it */
 
 	if (pgd && !pgd_prepopulate_pmd(mm, pgd)) {
-		quicklist_free(0, pgd_dtor, pgd);
+		pgd_dtor(pgd);
+		free_page((unsigned long)pgd);
 		pgd = NULL;
 	}
 
@@ -353,12 +357,12 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
-	quicklist_free(0, pgd_dtor, pgd);
+	pgd_dtor(pgd);
+	free_page((unsigned long)pgd);
 }
 
 void check_pgt_cache(void)
 {
-	quicklist_trim(0, pgd_dtor, 25, 16);
 }
 
 void __pte_free_tlb(struct mmu_gather *tlb, struct page *pte)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 18:46         ` Andrew Morton
@ 2008-03-09 20:21           ` Andi Kleen
  2008-03-10 15:54           ` Christoph Lameter
  1 sibling, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2008-03-09 20:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, KOSAKI Motohiro, Thomas Gleixner, LKML,
	Linus Torvalds, Christoph Lameter, Bart Van Assche

Andrew Morton <akpm@linux-foundation.org> writes:

> On Sun, 9 Mar 2008 13:34:32 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> 
> > the right solution is to get rid of quicklists altogether
> 
> Yes, I think so.
> 
> - They are pretty marginal from a performance POV (iirc)

One general issue -- as noted again by Christoph Lameter recently --
is that the order 0 fast path in page_alloc.c isn't actually very 
fast. That is why people keep inventing their own...

 > - As I said when we merged them (under protest): Private object caches
>   like this are just a bad idea - caches should be *shared*, because some
>   other code path which wants a zeroed page wants a cache-warm one, not a
>   cache-cold one from the allocator (iirc there was doubt over how
>   cache-warm these pages are, however).
> 
>   Making __GFP_ZERO smarter/more efficient would be a preferable way of
>   addressing any performance problems we have in there.

To do the same as quicklists you would need a __free_pages_zeroed()
and separate buddy lists I think.  Later is probably somewhat ugly.
Or perhaps do it only for order 0?

Or perhaps idle time zeroing should be reinvestigated on modern CPUs,
but I'm always a little sceptical of that.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 12:49       ` Andi Kleen
@ 2008-03-10 15:51         ` Christoph Lameter
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2008-03-10 15:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Thomas Gleixner, LKML, Linus Torvalds,
	Andrew Morton, Bart Van Assche, William Lee Irwin III,
	Benjamin Herrenschmidt, David Miller

On Sun, 9 Mar 2008, Andi Kleen wrote:

> iirc quicklists were just to avoid rezeroing pages which are known to
> be zero at free time (in theory __GFP_ZERO should handle that at 
> page_alloc.c level, but it doesn't) and to get a little private fast path 
> for the page allocator (might actually predate the current page_alloc
> fast paths)

Quicklists were introduced to increase performance of program termination 
and startup. The avoidance of zeroing is one effect. The other reason that
made this a good optimization is the bad page allocator performance 
in general vs a simple LIFO list.

The numbers that we saw on  x86_64 were around a 90% reduction in overhead 
with the quicklists.

The reason for the throttle is that termination and starting of large 
programs would have to zero large amounts of memory and go through the 
page allocator for all of this if the quicklists would be bounded to a 
fixed limit. Having a fraction of free memory allows preserving large 
amounts of page table pages for the next process that starts.

Right now quicklists do not make much sense because the x86_64 portion was 
removed. IMHO The usefulness for i386 pgd/pud caching is negligible. The 
code came initially from IA64 arch code (I think it was first on sparc64 
though). After it became available in the core it was used by various other arches. 

There have been a couple of people who wanted to continue work on 
quicklists (which made me focus on different things) but so far nothing 
has happened. Replacement of the i386 portion with alloc/free would be 
fairly straightforward I would think.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 18:46         ` Andrew Morton
  2008-03-09 20:21           ` Andi Kleen
@ 2008-03-10 15:54           ` Christoph Lameter
  2008-03-10 16:43             ` Andi Kleen
  2008-03-11  4:07             ` Nick Piggin
  1 sibling, 2 replies; 44+ messages in thread
From: Christoph Lameter @ 2008-03-10 15:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, KOSAKI Motohiro, Thomas Gleixner, LKML,
	Linus Torvalds, Bart Van Assche

On Sun, 9 Mar 2008, Andrew Morton wrote:

> - They are pretty marginal from a performance POV (iirc)

Pretty significant in our experience.

>   Making __GFP_ZERO smarter/more efficient would be a preferable way of
>   addressing any performance problems we have in there.

Looking at the page allocator "fastpath": The basic reaons that this was 
such a good optimization was that the page allocator is expensive to call. 
The hotpath gets more and more clogged with logic. Fixing the page 
allocator to be more efficient may be the right approach here. That could 
also include having a list of zeroed pages.

Zeroed pages however will not address the issue of having initialized pgd 
(which seems to be what i386 needs).


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 19:25           ` Ingo Molnar
  2008-03-09 19:27             ` Ingo Molnar
@ 2008-03-10 15:55             ` Christoph Lameter
  1 sibling, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2008-03-10 15:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, KOSAKI Motohiro, Thomas Gleixner, LKML,
	Linus Torvalds, Andrew Morton, Bart Van Assche

On Sun, 9 Mar 2008, Ingo Molnar wrote:

> > careful with this; the quicklists aren't JUST for speed they are also 
> > there to make sure a page we free that is a pagetable, is not reused 
> > until we have finished flushing the tlbs on all the cpus that saw it. 
> > This is a really hard correctness requirement, and while I can see 
> > that quicklists are probably not the best way to achieve this, we 
> > can't just throw away the behavior ;(
> 
> no, that's not true anymore - and the current quicklists code doesnt do 
> anything like that AFAICS. It used to be a lot more complex, but now 
> it's just a thin wrapper around the page allocator.

Sure it does that. It interacts with the TLB logic which is another bad 
thing as Linus has pointed out.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-09 19:27             ` Ingo Molnar
  2008-03-09 19:31               ` Ingo Molnar
@ 2008-03-10 15:57               ` Christoph Lameter
  1 sibling, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2008-03-10 15:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, KOSAKI Motohiro, Thomas Gleixner, LKML,
	Linus Torvalds, Andrew Morton, Bart Van Assche

Reviewed-by: Christoph Lameter <clameter@sgi.com>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-10 15:54           ` Christoph Lameter
@ 2008-03-10 16:43             ` Andi Kleen
  2008-03-10 17:19               ` Hugh Dickins
  2008-03-11  4:07             ` Nick Piggin
  1 sibling, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2008-03-10 16:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Ingo Molnar, KOSAKI Motohiro, Thomas Gleixner,
	LKML, Linus Torvalds, Bart Van Assche

Christoph Lameter <clameter@sgi.com> writes:
> 
> Zeroed pages however will not address the issue of having initialized pgd 
> (which seems to be what i386 needs).

pgd is tiny on i386 PAE (4 * 16 bytes). Are you sure reinitializing that
is a serious issue? For i386 non PAE (and x86-64) it is just a normal 4k
page.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-10 16:43             ` Andi Kleen
@ 2008-03-10 17:19               ` Hugh Dickins
  2008-03-10 17:25                 ` Andi Kleen
  2008-03-10 17:31                 ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 44+ messages in thread
From: Hugh Dickins @ 2008-03-10 17:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Andrew Morton, Ingo Molnar, KOSAKI Motohiro,
	Thomas Gleixner, LKML, Linus Torvalds, Bart Van Assche,
	Jeremy Fitzhardinge

On Mon, 10 Mar 2008, Andi Kleen wrote:
> Christoph Lameter <clameter@sgi.com> writes:
> > 
> > Zeroed pages however will not address the issue of having initialized pgd 
> > (which seems to be what i386 needs).
> 
> pgd is tiny on i386 PAE (4 * 16 bytes). Are you sure reinitializing that
> is a serious issue? ...

It used to be tiny (32 aligned bytes), then 2.6.22's quicklist enlarged
that to a whole (lowmem) page.  I think we were all too busy with other
stuff to protest loudly enough about that bloat.

If the quicklists are going, it'd be good for PAE to go back to a
kmem_cache of 32-byte entries as in 2.6.21 - I think Ingo's patch is
still using a whole page there.

Or have sl?b alignment changes, or virtualization issues (locking
per underlying struct page?), made a kmem_cache awkward there now?

Hugh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-10 17:19               ` Hugh Dickins
@ 2008-03-10 17:25                 ` Andi Kleen
  2008-03-10 17:31                 ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2008-03-10 17:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andi Kleen, Christoph Lameter, Andrew Morton, Ingo Molnar,
	KOSAKI Motohiro, Thomas Gleixner, LKML, Linus Torvalds,
	Bart Van Assche, Jeremy Fitzhardinge

On Mon, Mar 10, 2008 at 05:19:32PM +0000, Hugh Dickins wrote:
> On Mon, 10 Mar 2008, Andi Kleen wrote:
> > Christoph Lameter <clameter@sgi.com> writes:
> > > 
> > > Zeroed pages however will not address the issue of having initialized pgd 
> > > (which seems to be what i386 needs).
> > 
> > pgd is tiny on i386 PAE (4 * 16 bytes). Are you sure reinitializing that
> > is a serious issue? ...
> 
> It used to be tiny (32 aligned bytes), then 2.6.22's quicklist enlarged
> that to a whole (lowmem) page.  I think we were all too busy with other
> stuff to protest loudly enough about that bloat.

Interesting; I missed.

Hmm, that might have been for Xen or Vmware too (which like some structures
be page aligned) But I agree it is bloat in general.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-10 17:19               ` Hugh Dickins
  2008-03-10 17:25                 ` Andi Kleen
@ 2008-03-10 17:31                 ` Jeremy Fitzhardinge
  2008-03-10 17:53                   ` Andi Kleen
  1 sibling, 1 reply; 44+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-10 17:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andi Kleen, Christoph Lameter, Andrew Morton, Ingo Molnar,
	KOSAKI Motohiro, Thomas Gleixner, LKML, Linus Torvalds,
	Bart Van Assche

Hugh Dickins wrote:
> On Mon, 10 Mar 2008, Andi Kleen wrote:
>   
>> Christoph Lameter <clameter@sgi.com> writes:
>>     
>>> Zeroed pages however will not address the issue of having initialized pgd 
>>> (which seems to be what i386 needs).
>>>       
>> pgd is tiny on i386 PAE (4 * 16 bytes). Are you sure reinitializing that
>> is a serious issue? ...
>>     
>
> It used to be tiny (32 aligned bytes), then 2.6.22's quicklist enlarged
> that to a whole (lowmem) page.  I think we were all too busy with other
> stuff to protest loudly enough about that bloat.
>   

Yes, I was surprised about the pgd page allocation.  It's currently 
necessary for Xen (but I think I can fix that easily enough).  But I was 
especially surprised it was imposed for everyone.

> If the quicklists are going, it'd be good for PAE to go back to a
> kmem_cache of 32-byte entries as in 2.6.21 - I think Ingo's patch is
> still using a whole page there.
>   

+1.  We'd still need to maintain a list to link all the pgds together, 
but I think we can just allocate that out of the cache too (either the 
same object or a separate pgd list cache).

> Or have sl?b alignment changes, or virtualization issues (locking
> per underlying struct page?), made a kmem_cache awkward there now?
>   
I think only Xen has a constraint.  At the moment we rely on page-sized 
pgds for two things:

   1. Xen marks the whole page as being of pgd-type, and so it can't
      have non-pgd contents (as the page would be RO, and any contents
      would be validated as pgd entries).  However I can fix that by
      maintaining a separate per-cpu pgd page, and just copy the four
      entries over when cr3 is reloaded.  This would move the
      Xen-specific requirements into the Xen code without affecting the
      rest of the kernel.
   2. We still need to maintain a list of pgds, as I discussed above.

So from my perspective there are no insoluble problems, and I'd fully 
support the transition.

I've been working on unifying the pgalloc stuff, and the quicklist 
(i386) vs non-quicklist (x86-64) use has been a bit of a thorn in my 
side.  Of course there'll still be the page-sized pgd vs non-page-sized 
pgd, but we just have to live with that.

    J

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-10 17:31                 ` Jeremy Fitzhardinge
@ 2008-03-10 17:53                   ` Andi Kleen
  2008-03-10 18:35                     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2008-03-10 17:53 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Hugh Dickins, Andi Kleen, Christoph Lameter, Andrew Morton,
	Ingo Molnar, KOSAKI Motohiro, Thomas Gleixner, LKML,
	Linus Torvalds, Bart Van Assche

>   1. Xen marks the whole page as being of pgd-type, and so it can't
>      have non-pgd contents (as the page would be RO, and any contents
>      would be validated as pgd entries).  However I can fix that by

That sounds more like a Xen bug. Why should it validate anything except
the first four entries which are used by the hardware? 

I can imagine that false sharing would be slow because you would
have unnecessary traps, but it shouldn't be a correctness issue.

>      maintaining a separate per-cpu pgd page, and just copy the four
>      entries over when cr3 is reloaded.  This would move the
>      Xen-specific requirements into the Xen code without affecting the
>      rest of the kernel.

x86-64 used to do that, but it turned out this breaks some shared cache 
optimizations on the P4 between SMT threads. So you might actually see 
user space performance regressions from it.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-10 17:53                   ` Andi Kleen
@ 2008-03-10 18:35                     ` Jeremy Fitzhardinge
  2008-03-10 19:06                       ` Andi Kleen
  2008-03-10 20:54                       ` H. Peter Anvin
  0 siblings, 2 replies; 44+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-10 18:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, Christoph Lameter, Andrew Morton, Ingo Molnar,
	KOSAKI Motohiro, Thomas Gleixner, LKML, Linus Torvalds,
	Bart Van Assche

Andi Kleen wrote:
>>   1. Xen marks the whole page as being of pgd-type, and so it can't
>>      have non-pgd contents (as the page would be RO, and any contents
>>      would be validated as pgd entries).  However I can fix that by
>>     
>
> That sounds more like a Xen bug. Why should it validate anything except
> the first four entries which are used by the hardware? 
>
> I can imagine that false sharing would be slow because you would
> have unnecessary traps, but it shouldn't be a correctness issue.
>   

Yeah, the most fundamental problem is that the whole page is RO, so even 
if Xen trapped and emulated, it still makes a very bad neighbour.  I was 
also going to say that there's no reason why you couldn't pack multiple 
pgds into one page, but I think we can only specify the cr3 at page 
resolution anyway.

>>      maintaining a separate per-cpu pgd page, and just copy the four
>>      entries over when cr3 is reloaded.  This would move the
>>      Xen-specific requirements into the Xen code without affecting the
>>      rest of the kernel.
>>     
>
> x86-64 used to do that, but it turned out this breaks some shared cache 
> optimizations on the P4 between SMT threads. So you might actually see 
> user space performance regressions from it.
>   

Hm, given that Xen doesn't make any topology guarantees about vcpus, I 
don't think it would make much difference.  I think it would only be 
observable if you actually pinned the vcpus to physical siblings.  
Besides we could easily maintain sibling info and look to see if we're 
about to reload cr3 to match the other thread(s), then just use the same 
pgd copy.

    J

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-10 18:35                     ` Jeremy Fitzhardinge
@ 2008-03-10 19:06                       ` Andi Kleen
  2008-03-10 20:54                       ` H. Peter Anvin
  1 sibling, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2008-03-10 19:06 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Hugh Dickins, Christoph Lameter, Andrew Morton,
	Ingo Molnar, KOSAKI Motohiro, Thomas Gleixner, LKML,
	Linus Torvalds, Bart Van Assche

On Mon, Mar 10, 2008 at 11:35:50AM -0700, Jeremy Fitzhardinge wrote:
> Hm, given that Xen doesn't make any topology guarantees about vcpus, I 
> don't think it would make much difference.  I think it would only be 
> observable if you actually pinned the vcpus to physical siblings.  
> Besides we could easily maintain sibling info and look to see if we're 
> about to reload cr3 to match the other thread(s), then just use the same 
> pgd copy.

The optimization doesn't really require any topology information; 
you just should make sure that if you have a VM shared between
different CPUs that it has the same CR3. The easiest way to enforce
that is to always only use a single CR3 for a VM.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-10 18:35                     ` Jeremy Fitzhardinge
  2008-03-10 19:06                       ` Andi Kleen
@ 2008-03-10 20:54                       ` H. Peter Anvin
  2008-03-10 21:26                         ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 44+ messages in thread
From: H. Peter Anvin @ 2008-03-10 20:54 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Hugh Dickins, Christoph Lameter, Andrew Morton,
	Ingo Molnar, KOSAKI Motohiro, Thomas Gleixner, LKML,
	Linus Torvalds, Bart Van Assche

Jeremy Fitzhardinge wrote:
> 
> Yeah, the most fundamental problem is that the whole page is RO, so even 
> if Xen trapped and emulated, it still makes a very bad neighbour.  I was 
> also going to say that there's no reason why you couldn't pack multiple 
> pgds into one page, but I think we can only specify the cr3 at page 
> resolution anyway.
> 

Wrong.  With PAE you can specify it at any 32-byte boundary.

	-hpa


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-10 20:54                       ` H. Peter Anvin
@ 2008-03-10 21:26                         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 44+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-10 21:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Hugh Dickins, Christoph Lameter, Andrew Morton,
	Ingo Molnar, KOSAKI Motohiro, Thomas Gleixner, LKML,
	Linus Torvalds, Bart Van Assche

H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>>
>> Yeah, the most fundamental problem is that the whole page is RO, so 
>> even if Xen trapped and emulated, it still makes a very bad 
>> neighbour.  I was also going to say that there's no reason why you 
>> couldn't pack multiple pgds into one page, but I think we can only 
>> specify the cr3 at page resolution anyway.
>>
>
> Wrong.  With PAE you can specify it at any 32-byte boundary. 

Yes, the x86 lets you, but Xen doesn't - mostly because it reuses the 
lower 12 bits as upper bits of the pfn so that you can specify a 
pagetable base at >4G.

    J

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-10 15:54           ` Christoph Lameter
  2008-03-10 16:43             ` Andi Kleen
@ 2008-03-11  4:07             ` Nick Piggin
  2008-03-21 12:52               ` Bart Van Assche
  1 sibling, 1 reply; 44+ messages in thread
From: Nick Piggin @ 2008-03-11  4:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Ingo Molnar, KOSAKI Motohiro, Thomas Gleixner,
	LKML, Linus Torvalds, Bart Van Assche

On Tuesday 11 March 2008 02:54, Christoph Lameter wrote:
> On Sun, 9 Mar 2008, Andrew Morton wrote:
> > - They are pretty marginal from a performance POV (iirc)
>
> Pretty significant in our experience.
>
> >   Making __GFP_ZERO smarter/more efficient would be a preferable way of
> >   addressing any performance problems we have in there.
>
> Looking at the page allocator "fastpath": The basic reaons that this was
> such a good optimization was that the page allocator is expensive to call.
> The hotpath gets more and more clogged with logic. Fixing the page
> allocator to be more efficient may be the right approach here. That could
> also include having a list of zeroed pages.

This is insane. We add more and more of this NUMA and cpuset and anti-frag
and page zeroing logic to the page allocator, and then decide that we
don't actually need to obey any of those rules when we're running lat_proc.

You will never be able to make the page allocator faster than a single list
of pages. The reason is because we actually *want* some of these checks and
heuristics in the page allocator.

And I doubt a list of zeroed pages is the right approach. That's just
adding more complexity for lat_proc AFAIKS.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-11  4:07             ` Nick Piggin
@ 2008-03-21 12:52               ` Bart Van Assche
  2008-03-21 14:45                 ` Ingo Molnar
  0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2008-03-21 12:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Andrew Morton, Ingo Molnar, KOSAKI Motohiro,
	Thomas Gleixner, LKML, Linus Torvalds

On Tue, Mar 11, 2008 at 5:07 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>  [ ... ]
>
>  This is insane. We add more and more of this NUMA and cpuset and anti-frag
>  and page zeroing logic to the page allocator, and then decide that we
>  don't actually need to obey any of those rules when we're running lat_proc.
>
>  You will never be able to make the page allocator faster than a single list
>  of pages. The reason is because we actually *want* some of these checks and
>  heuristics in the page allocator.
>
>  And I doubt a list of zeroed pages is the right approach. That's just
>  adding more complexity for lat_proc AFAIKS.

Is anyone currently working on this
(http://bugzilla.kernel.org/show_bug.cgi?id=9991) ? I think this
should either be fixed or documented as a known issue.

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-21 12:52               ` Bart Van Assche
@ 2008-03-21 14:45                 ` Ingo Molnar
  2008-03-26  7:45                   ` Bart Van Assche
  0 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2008-03-21 14:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Nick Piggin, Christoph Lameter, Andrew Morton, KOSAKI Motohiro,
	Thomas Gleixner, LKML, Linus Torvalds


* Bart Van Assche <bart.vanassche@gmail.com> wrote:

> Is anyone currently working on this 
> (http://bugzilla.kernel.org/show_bug.cgi?id=9991) ? I think this 
> should either be fixed or documented as a known issue.

it should be fixed in latest -git (by commit 985a34bd75cc8c9) - can you 
still see it?

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-21 14:45                 ` Ingo Molnar
@ 2008-03-26  7:45                   ` Bart Van Assche
  2008-03-26  7:53                     ` Andrew Morton
  2008-03-26  8:13                     ` Ingo Molnar
  0 siblings, 2 replies; 44+ messages in thread
From: Bart Van Assche @ 2008-03-26  7:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Christoph Lameter, Andrew Morton, KOSAKI Motohiro,
	Thomas Gleixner, LKML, Linus Torvalds

On Fri, Mar 21, 2008 at 3:45 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
>  * Bart Van Assche <bart.vanassche@gmail.com> wrote:
>
>  > Is anyone currently working on this
>  > (http://bugzilla.kernel.org/show_bug.cgi?id=9991) ? I think this
>  > should either be fixed or documented as a known issue.
>
>  it should be fixed in latest -git (by commit 985a34bd75cc8c9) - can you
>  still see it?

This issue is indeed fixed now, thanks (retested with kernel version
2.6.25-rc6-00333-ga4083c9-dirty). Where can I find the patch that
fixed this issue ? A Google query for the commit ID returned only one
result, and that was an URL that pointed to your e-mail.

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-26  7:45                   ` Bart Van Assche
@ 2008-03-26  7:53                     ` Andrew Morton
  2008-03-26  8:13                     ` Ingo Molnar
  1 sibling, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2008-03-26  7:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ingo Molnar, Nick Piggin, Christoph Lameter, KOSAKI Motohiro,
	Thomas Gleixner, LKML, Linus Torvalds

On Wed, 26 Mar 2008 08:45:46 +0100 "Bart Van Assche" <bart.vanassche@gmail.com> wrote:

> On Fri, Mar 21, 2008 at 3:45 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> >  * Bart Van Assche <bart.vanassche@gmail.com> wrote:
> >
> >  > Is anyone currently working on this
> >  > (http://bugzilla.kernel.org/show_bug.cgi?id=9991) ? I think this
> >  > should either be fixed or documented as a known issue.
> >
> >  it should be fixed in latest -git (by commit 985a34bd75cc8c9) - can you
> >  still see it?
> 
> This issue is indeed fixed now, thanks (retested with kernel version
> 2.6.25-rc6-00333-ga4083c9-dirty). Where can I find the patch that
> fixed this issue ? A Google query for the commit ID returned only one
> result, and that was an URL that pointed to your e-mail.
> 

commit 985a34bd75cc8c96e43f00dcdda7c3fdb51a3026
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Sun Mar 9 13:14:37 2008 +0100

    x86: remove quicklists
    
    quicklists cause a serious memory leak on 32-bit x86,
    as documented at:
    
      http://bugzilla.kernel.org/show_bug.cgi?id=9991
    
    the reason is that the quicklist pool is a special-purpose
    cache that grows out of proportion. It is not accounted for
    anywhere and users have no way to even realize that it's
    the quicklists that are causing RAM usage spikes. It was
    supposed to be a relatively small pool, but as demonstrated
    by KOSAKI Motohiro, they can grow as large as:
    
      Quicklists:    1194304 kB
    
    given how much trouble this code has caused historically,
    and given that Andrew objected to its introduction on x86
    (years ago), the best option at this point is to remove them.
    
    [ any performance benefits of caching constructed pgds should
      be implemented in a more generic way (possibly within the page
      allocator), while still allowing constructed pages to be
      allocated by other workloads. ]
    
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f41c953..237fc12 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -66,9 +66,6 @@ config MMU
 config ZONE_DMA
 	def_bool y
 
-config QUICKLIST
-	def_bool X86_32
-
 config SBUS
 	bool
 
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
index 73aba71..2f9e9af 100644
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -342,12 +342,16 @@ static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp)
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor);
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 
-	mm->pgd = pgd;		/* so that alloc_pd can use it */
+	/* so that alloc_pd can use it */
+	mm->pgd = pgd;
+	if (pgd)
+		pgd_ctor(pgd);
 
 	if (pgd && !pgd_prepopulate_pmd(mm, pgd)) {
-		quicklist_free(0, pgd_dtor, pgd);
+		pgd_dtor(pgd);
+		free_page((unsigned long)pgd);
 		pgd = NULL;
 	}
 
@@ -357,12 +361,8 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
-	quicklist_free(0, pgd_dtor, pgd);
-}
-
-void check_pgt_cache(void)
-{
-	quicklist_trim(0, pgd_dtor, 25, 16);
+	pgd_dtor(pgd);
+	free_page((unsigned long)pgd);
 }
 
 void __pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
index a842c72..4e6a0fc 100644
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -26,10 +26,9 @@ struct mm_struct;
 struct vm_area_struct;
 
 extern pgd_t swapper_pg_dir[1024];
-extern struct kmem_cache *pmd_cache;
-void check_pgt_cache(void);
 
-static inline void pgtable_cache_init(void) {}
+static inline void pgtable_cache_init(void) { }
+static inline void check_pgt_cache(void) { }
 void paging_init(void);
 
 


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-26  7:45                   ` Bart Van Assche
  2008-03-26  7:53                     ` Andrew Morton
@ 2008-03-26  8:13                     ` Ingo Molnar
  2008-03-26 10:37                       ` Bart Van Assche
  1 sibling, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2008-03-26  8:13 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Nick Piggin, Christoph Lameter, Andrew Morton, KOSAKI Motohiro,
	Thomas Gleixner, LKML, Linus Torvalds


* Bart Van Assche <bart.vanassche@gmail.com> wrote:

> On Fri, Mar 21, 2008 at 3:45 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> >  * Bart Van Assche <bart.vanassche@gmail.com> wrote:
> >
> >  > Is anyone currently working on this
> >  > (http://bugzilla.kernel.org/show_bug.cgi?id=9991) ? I think this
> >  > should either be fixed or documented as a known issue.
> >
> >  it should be fixed in latest -git (by commit 985a34bd75cc8c9) - can you
> >  still see it?
> 
> This issue is indeed fixed now, thanks (retested with kernel version 
> 2.6.25-rc6-00333-ga4083c9-dirty). Where can I find the patch that 
> fixed this issue ? A Google query for the commit ID returned only one 
> result, and that was an URL that pointed to your e-mail.

thanks a lot Bart for the persistent testing - this was a nasty one to 
track down. Find the standalone fix below.

	Ingo

---------------->
commit 985a34bd75cc8c96e43f00dcdda7c3fdb51a3026
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Sun Mar 9 13:14:37 2008 +0100

    x86: remove quicklists
    
    quicklists cause a serious memory leak on 32-bit x86,
    as documented at:
    
      http://bugzilla.kernel.org/show_bug.cgi?id=9991
    
    the reason is that the quicklist pool is a special-purpose
    cache that grows out of proportion. It is not accounted for
    anywhere and users have no way to even realize that it's
    the quicklists that are causing RAM usage spikes. It was
    supposed to be a relatively small pool, but as demonstrated
    by KOSAKI Motohiro, they can grow as large as:
    
      Quicklists:    1194304 kB
    
    given how much trouble this code has caused historically,
    and given that Andrew objected to its introduction on x86
    (years ago), the best option at this point is to remove them.
    
    [ any performance benefits of caching constructed pgds should
      be implemented in a more generic way (possibly within the page
      allocator), while still allowing constructed pages to be
      allocated by other workloads. ]
    
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f41c953..237fc12 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -66,9 +66,6 @@ config MMU
 config ZONE_DMA
 	def_bool y
 
-config QUICKLIST
-	def_bool X86_32
-
 config SBUS
 	bool
 
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
index 73aba71..2f9e9af 100644
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -342,12 +342,16 @@ static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp)
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor);
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 
-	mm->pgd = pgd;		/* so that alloc_pd can use it */
+	/* so that alloc_pd can use it */
+	mm->pgd = pgd;
+	if (pgd)
+		pgd_ctor(pgd);
 
 	if (pgd && !pgd_prepopulate_pmd(mm, pgd)) {
-		quicklist_free(0, pgd_dtor, pgd);
+		pgd_dtor(pgd);
+		free_page((unsigned long)pgd);
 		pgd = NULL;
 	}
 
@@ -357,12 +361,8 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	pgd_mop_up_pmds(mm, pgd);
-	quicklist_free(0, pgd_dtor, pgd);
-}
-
-void check_pgt_cache(void)
-{
-	quicklist_trim(0, pgd_dtor, 25, 16);
+	pgd_dtor(pgd);
+	free_page((unsigned long)pgd);
 }
 
 void __pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
index a842c72..4e6a0fc 100644
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -26,10 +26,9 @@ struct mm_struct;
 struct vm_area_struct;
 
 extern pgd_t swapper_pg_dir[1024];
-extern struct kmem_cache *pmd_cache;
-void check_pgt_cache(void);
 
-static inline void pgtable_cache_init(void) {}
+static inline void pgtable_cache_init(void) { }
+static inline void check_pgt_cache(void) { }
 void paging_init(void);
 
 

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-26  8:13                     ` Ingo Molnar
@ 2008-03-26 10:37                       ` Bart Van Assche
  2008-03-26 16:34                         ` Christoph Lameter
  0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2008-03-26 10:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Christoph Lameter, Andrew Morton, KOSAKI Motohiro,
	Thomas Gleixner, LKML, Linus Torvalds

On Wed, Mar 26, 2008 at 9:13 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
>  * Bart Van Assche <bart.vanassche@gmail.com> wrote:
>
>  > This issue is indeed fixed now, thanks (retested with kernel version
>  > 2.6.25-rc6-00333-ga4083c9-dirty). Where can I find the patch that
>  > fixed this issue ? A Google query for the commit ID returned only one
>  > result, and that was an URL that pointed to your e-mail.
>
>  thanks a lot Bart for the persistent testing - this was a nasty one to
>  track down. Find the standalone fix below.

Thanks to everyone who worked on this. By the way, I have asked Albert
Calahan, the procps maintainer, to fix the free/top/vmstat tools such
that these take recently added /proc/meminfo fields into account and
display correct values on recent 2.6 kernels. See also
http://bugzilla.kernel.org/show_bug.cgi?id=9991 and
http://marc.info/?l=linux-mm&m=120496901605830&w=2.

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-26 10:37                       ` Bart Van Assche
@ 2008-03-26 16:34                         ` Christoph Lameter
  2008-03-27  9:48                           ` Bart Van Assche
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Lameter @ 2008-03-26 16:34 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ingo Molnar, Nick Piggin, Andrew Morton, KOSAKI Motohiro,
	Thomas Gleixner, LKML, Linus Torvalds

On Wed, 26 Mar 2008, Bart Van Assche wrote:

> Thanks to everyone who worked on this. By the way, I have asked Albert 
> Calahan, the procps maintainer, to fix the free/top/vmstat tools such 
> that these take recently added /proc/meminfo fields into account and 
> display correct values on recent 2.6 kernels. See also 
> http://bugzilla.kernel.org/show_bug.cgi?id=9991 and 
> http://marc.info/?l=linux-mm&m=120496901605830&w=2.

Be aware that some distros have hacked their meminfo (proc and 
/sys/devices/system/node/node*/meminfo) output to be more compatible with 
recent 2.6 kernels but not updated to the new ZVC counter scheme. On those 
kernels (<2.6.17) counters may show weird values and the sum of the per 
node counters in sysfs may not yield the same amount shown in 
/proc/meminfo.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: quicklists confuse meminfo
  2008-03-26 16:34                         ` Christoph Lameter
@ 2008-03-27  9:48                           ` Bart Van Assche
  0 siblings, 0 replies; 44+ messages in thread
From: Bart Van Assche @ 2008-03-27  9:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Nick Piggin, Andrew Morton, KOSAKI Motohiro,
	Thomas Gleixner, LKML, Linus Torvalds, Albert Cahalan

On Wed, Mar 26, 2008 at 5:34 PM, Christoph Lameter <clameter@sgi.com> wrote:
> On Wed, 26 Mar 2008, Bart Van Assche wrote:
>
>  > Thanks to everyone who worked on this. By the way, I have asked Albert
>  > Calahan, the procps maintainer, to fix the free/top/vmstat tools such
>  > that these take recently added /proc/meminfo fields into account and
>  > display correct values on recent 2.6 kernels. See also
>  > http://bugzilla.kernel.org/show_bug.cgi?id=9991 and
>  > http://marc.info/?l=linux-mm&m=120496901605830&w=2.
>
>  Be aware that some distros have hacked their meminfo (proc and
>  /sys/devices/system/node/node*/meminfo) output to be more compatible with
>  recent 2.6 kernels but not updated to the new ZVC counter scheme. On those
>  kernels (<2.6.17) counters may show weird values and the sum of the per
>  node counters in sysfs may not yield the same amount shown in
>  /proc/meminfo.

In my opinion /proc/meminfo is part of the kernelspace - userspace
API, and any Linux distributor that modifies /proc/meminfo is
responsible for modifying the tools that rely on this information.

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2008-03-27  9:49 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-09 10:19 quicklists confuse meminfo Thomas Gleixner
2008-03-09 10:26 ` Bart Van Assche
2008-03-09 10:29 ` Andi Kleen
2008-03-09 10:42 ` KOSAKI Motohiro
2008-03-09 12:00   ` Thomas Gleixner
2008-03-09 11:14 ` Ingo Molnar
2008-03-09 11:56   ` Thomas Gleixner
2008-03-09 12:01     ` Ingo Molnar
2008-03-09 12:49       ` Andi Kleen
2008-03-10 15:51         ` Christoph Lameter
2008-03-09 12:03   ` Johannes Weiner
2008-03-09 12:03   ` KOSAKI Motohiro
2008-03-09 12:09     ` Ingo Molnar
2008-03-09 12:34       ` Ingo Molnar
2008-03-09 12:51         ` KOSAKI Motohiro
2008-03-09 13:20         ` Thomas Gleixner
2008-03-09 18:46         ` Andrew Morton
2008-03-09 20:21           ` Andi Kleen
2008-03-10 15:54           ` Christoph Lameter
2008-03-10 16:43             ` Andi Kleen
2008-03-10 17:19               ` Hugh Dickins
2008-03-10 17:25                 ` Andi Kleen
2008-03-10 17:31                 ` Jeremy Fitzhardinge
2008-03-10 17:53                   ` Andi Kleen
2008-03-10 18:35                     ` Jeremy Fitzhardinge
2008-03-10 19:06                       ` Andi Kleen
2008-03-10 20:54                       ` H. Peter Anvin
2008-03-10 21:26                         ` Jeremy Fitzhardinge
2008-03-11  4:07             ` Nick Piggin
2008-03-21 12:52               ` Bart Van Assche
2008-03-21 14:45                 ` Ingo Molnar
2008-03-26  7:45                   ` Bart Van Assche
2008-03-26  7:53                     ` Andrew Morton
2008-03-26  8:13                     ` Ingo Molnar
2008-03-26 10:37                       ` Bart Van Assche
2008-03-26 16:34                         ` Christoph Lameter
2008-03-27  9:48                           ` Bart Van Assche
2008-03-09 19:11         ` Arjan van de Ven
2008-03-09 19:25           ` Ingo Molnar
2008-03-09 19:27             ` Ingo Molnar
2008-03-09 19:31               ` Ingo Molnar
2008-03-10 15:57               ` Christoph Lameter
2008-03-10 15:55             ` Christoph Lameter
2008-03-09 12:47       ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).