linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* get_vmalloc_info() and /proc/meminfo insanely expensive
@ 2015-08-13  3:29 Linus Torvalds
  2015-08-13  4:00 ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13  3:29 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton, Al Viro; +Cc: Linux Kernel Mailing List

I just did some profiling of a simple "make test" in the git repo, and
was surprised by the top kernel offender: get_vmalloc_info() showed up
at roughly 4% cpu use.

It turns out that bash ends up reading /proc/meminfo on every single
activation, and "make test" is basically just running a huge
collection of shell scripts. You can verify by just doing

    strace -o trace sh -c "echo"

to see what bash does on your system. I suspect it's actually glibc,
because a quick google finds the function "get_phys_pages()" that just
looks at the "MemTotal" line (or possibly get_avphys_pageslooks at the
MemFree" line).

Ok, so bash is insane for caring so deeply that it does this
regardless of anything else. But what else is new - user space does
odd things. It's like a truism.

My gut feel for this is that we should just rate-limit this and cache
the vmalloc information for a fraction of a second or something. Maybe
we could expose total memory sizes in some more efficient format, but
it's not like existing binaries will magically de-crapify themselves,
so just speeding up meminfo sounds like a good thing.

Maybe we could even cache the whole seqfile buffer - Al? How painful
would something like that be? Although from the profiles, it's really
just the vmalloc info gathering that shows up as actually wasting CPU
cycles..

                    Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
  2015-08-13  3:29 get_vmalloc_info() and /proc/meminfo insanely expensive Linus Torvalds
@ 2015-08-13  4:00 ` Andrew Morton
  2015-08-13  5:52   ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2015-08-13  4:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Joonsoo Kim, Al Viro, Linux Kernel Mailing List

On Wed, 12 Aug 2015 20:29:34 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I just did some profiling of a simple "make test" in the git repo, and
> was surprised by the top kernel offender: get_vmalloc_info() showed up
> at roughly 4% cpu use.
> 
> It turns out that bash ends up reading /proc/meminfo on every single
> activation, and "make test" is basically just running a huge
> collection of shell scripts. You can verify by just doing
> 
>     strace -o trace sh -c "echo"
> 
> to see what bash does on your system. I suspect it's actually glibc,
> because a quick google finds the function "get_phys_pages()" that just
> looks at the "MemTotal" line (or possibly get_avphys_pageslooks at the
> MemFree" line).

And bash surely isn't interested in vmalloc stats.  Putting all these
things in the same file wasn't the smartest thing we've ever done.

> Ok, so bash is insane for caring so deeply that it does this
> regardless of anything else. But what else is new - user space does
> odd things. It's like a truism.
> 
> My gut feel for this is that we should just rate-limit this and cache
> the vmalloc information for a fraction of a second or something. Maybe
> we could expose total memory sizes in some more efficient format, but
> it's not like existing binaries will magically de-crapify themselves,
> so just speeding up meminfo sounds like a good thing.
> 
> Maybe we could even cache the whole seqfile buffer - Al? How painful
> would something like that be? Although from the profiles, it's really
> just the vmalloc info gathering that shows up as actually wasting CPU
> cycles..
> 

Do your /proc/meminfo vmalloc numbers actually change during that build?
Mine don't.  Perhaps we can cache the most recent vmalloc_info and
invalidate that cache whenever someone does a vmalloc/vfree/etc.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
  2015-08-13  4:00 ` Andrew Morton
@ 2015-08-13  5:52   ` Linus Torvalds
  2015-08-13  7:42     ` Rasmus Villemoes
                       ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13  5:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joonsoo Kim, Al Viro, Linux Kernel Mailing List

On Wed, Aug 12, 2015 at 9:00 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> Do your /proc/meminfo vmalloc numbers actually change during that build?
> Mine don't.  Perhaps we can cache the most recent vmalloc_info and
> invalidate that cache whenever someone does a vmalloc/vfree/etc.

Sure, that works too.

Looking at that mm/vmalloc.c file, the locking is pretty odd. It looks
pretty strange in setup_vmalloc_vm(), for example. If that newly
allocated "va" that we haven't even exposed to anybody yet has its
address or size changed, we're screwed in so many ways.

I get the feeling this file should be rewritten. But that's not going
to happen. The "let's just cache the last value for one jiffy" seemed
to be the minimal fixup to it.

                     Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
  2015-08-13  5:52   ` Linus Torvalds
@ 2015-08-13  7:42     ` Rasmus Villemoes
  2015-08-13 16:50       ` Linus Torvalds
  2015-08-13 18:32     ` Linus Torvalds
  2015-08-13 20:26     ` Andrew Morton
  2 siblings, 1 reply; 12+ messages in thread
From: Rasmus Villemoes @ 2015-08-13  7:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List

On Thu, Aug 13 2015, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Aug 12, 2015 at 9:00 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>>
>> Do your /proc/meminfo vmalloc numbers actually change during that build?
>> Mine don't.  Perhaps we can cache the most recent vmalloc_info and
>> invalidate that cache whenever someone does a vmalloc/vfree/etc.
>
> Sure, that works too.
>
> Looking at that mm/vmalloc.c file, the locking is pretty odd. It looks
> pretty strange in setup_vmalloc_vm(), for example. If that newly
> allocated "va" that we haven't even exposed to anybody yet has its
> address or size changed, we're screwed in so many ways.
>
> I get the feeling this file should be rewritten. But that's not going
> to happen. The "let's just cache the last value for one jiffy" seemed
> to be the minimal fixup to it.
>

I think it's simpler and better to fix glibc. Looking at the history,
the code for get_[av]phys_pages was added in 1996 (in the git repo
commit 845dcb57), with comments such as

 /* Return the number of pages of physical memory in the system.  There
    is currently (as of version 2.0.21) no system call to determine the
    number.  It is planned for the 2.1.x series to add this, though.

and

    /* XXX Here will come a test for the new system call.  */


And that syscall seems to be sysinfo(). So even though sysinfo() also
returns much more than required, it is still more than an order of
magnitude faster than reading and parsing /proc/meminfo (in the quick
microbench I threw together):

#include <stdio.h>
#include <sys/sysinfo.h>
#include <rdtsc.h>

void do_get_phys_pages(void)
{
	get_phys_pages();
}
void do_get_avphys_pages(void)
{
	get_avphys_pages();
}


void do_sysinfo(void)
{
	struct sysinfo info;

	sysinfo(&info);
}

void time_this(const char *name, void (*f)(void), int rep)
{
	int i;
	unsigned long start, stop;

	start = rdtsc();
	for (i = 0; i < rep; ++i)
		f();
	stop = rdtsc();

	printf("%-20s\t%d\t%lu\t%.1f\n", name, rep, stop-start, (double)(stop-start)/rep);
}

#define time_this(f, rep) time_this(#f, do_ ## f, rep)

int main(void)
{
	time_this(sysinfo, 1);
	time_this(get_phys_pages, 1);
	time_this(get_avphys_pages, 1);

	time_this(sysinfo, 1);
	time_this(get_phys_pages, 1);
	time_this(get_avphys_pages, 1);

	time_this(sysinfo, 10);
	time_this(get_phys_pages, 10);
	time_this(get_avphys_pages, 10);
	
	return 0;
}

$ ./sysinfo 
sysinfo                 1       6056    6056.0
get_phys_pages          1       226744  226744.0
get_avphys_pages        1       84480   84480.0
sysinfo                 1       2288    2288.0
get_phys_pages          1       73216   73216.0
get_avphys_pages        1       76692   76692.0
sysinfo                 10      6856    685.6
get_phys_pages          10      626936  62693.6
get_avphys_pages        10      604440  60444.0

Rasmus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
  2015-08-13  7:42     ` Rasmus Villemoes
@ 2015-08-13 16:50       ` Linus Torvalds
  0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13 16:50 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List

On Thu, Aug 13, 2015 at 12:42 AM, Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> I think it's simpler and better to fix glibc.

Well, I certainly don't disagree, but at the same time I suspect that
(a) many distros will not update glibc very aggressively and (b) we
should fix the unnecessarily expensive kernel operation regardless.

I think that just printing out the ASCII numbers in /proc/meminfo
should be a lot more expensive than computing the numbers. Which is
clearly not the case now, because we do that crazy expensive dynamic
computation every time, even though it hardly ever really changes (and
nobody even really cares about the values).

So I'd prefer to fix the kernel for existing insane users (because the
kernel is just doing stupid things that it didn't expect anybody to
care about), _and_ fix glibc.

                     Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
  2015-08-13  5:52   ` Linus Torvalds
  2015-08-13  7:42     ` Rasmus Villemoes
@ 2015-08-13 18:32     ` Linus Torvalds
  2015-08-13 18:52       ` Linus Torvalds
  2015-08-13 21:15       ` Tony Luck
  2015-08-13 20:26     ` Andrew Morton
  2 siblings, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13 18:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joonsoo Kim, Al Viro, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1136 bytes --]

On Wed, Aug 12, 2015 at 10:52 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I get the feeling this file should be rewritten. But that's not going
> to happen. The "let's just cache the last value for one jiffy" seemed
> to be the minimal fixup to it.

Here's a totally untested patch (I'll reboot and test soon - it does
at least compile for me).

Notice the total lack of locking, which means that it's fundamentally
racy. I really can't find it inside myself to care. Introducing those
vmalloc fields was a mistake to begin with, any races here are "ok, we
get values that were valid at some point, but it might have been a
second ago".

And I also can't find it in myself to care about the "on 32-bit,
jiffies wraps in 49 days if HZ is 1000". If somebody carefully avoids
ever reading /proc/meminfo for 49 days, and then reads it in _just_
the right second, and gets a really stale value, I'm just going to do
my honey badger impression.

Because we really shouldn't have added the vmalloc information to
/proc/meminfo to begin with, and nobody ever cares about those values
anyway.

Comments?

                Linus

[-- Attachment #2: vmalloc.patch --]
[-- Type: text/x-patch, Size: 2253 bytes --]

 mm/vmalloc.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 48 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2faaa2976447..0d0b96ed8948 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2688,7 +2688,7 @@ static int __init proc_vmalloc_init(void)
 }
 module_init(proc_vmalloc_init);
 
-void get_vmalloc_info(struct vmalloc_info *vmi)
+static void calc_vmalloc_info(struct vmalloc_info *vmi)
 {
 	struct vmap_area *va;
 	unsigned long free_area_size;
@@ -2735,5 +2735,51 @@ void get_vmalloc_info(struct vmalloc_info *vmi)
 out:
 	rcu_read_unlock();
 }
-#endif
 
+/*
+ * Calculating the vmalloc information is expensive, and nobody really
+ * deeply cares about it anyway. Yet, some versions of glibc end up
+ * reading /proc/meminfo a lot, not because they care about the vmalloc
+ * fields, but because they care about the total memory info.
+ *
+ * So to alleviate that expense, we cache the vmalloc information for
+ * a second. NOTE! This is fundamentally racy, since the accesses to
+ * the two fields in "struct vmalloc_info" and the cache timeout are
+ * all entirely unsynchronized. We just don't care.
+ */
+void get_vmalloc_info(struct vmalloc_info *vmi)
+{
+	static unsigned long cache_timeout = INITIAL_JIFFIES;
+	static struct vmalloc_info cached_info;
+	unsigned long now = jiffies, last = READ_ONCE(cache_timeout);
+
+	if (now - last < HZ) {
+		*vmi = cached_info;
+		return;
+	}
+
+	/*
+	 * We update the cache timeout early, because we (again) do
+	 * not care if somebody else comes in and sees slightly stale
+	 * information. We'd rather return more stale information
+	 * than waste time with multiple CPU's all calculating the
+	 * new state.
+	 *
+	 * Note: the barriers are here not to fix any races, but to
+	 * avoid the compiling spreading out the updates to these
+	 * variables any more than necessary.
+	 *
+	 * Also note that we calculate the new state into the 'vmi'
+	 * buffer that is passed in, and private to the caller. That
+	 * is intentional: we do not want to update the cached info
+	 * incrementally during the calculations.
+	 */
+	WRITE_ONCE(cache_timeout, now);
+	barrier();
+
+	calc_vmalloc_info(vmi);
+
+	barrier();
+	cached_info = *vmi;
+}
+#endif

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
  2015-08-13 18:32     ` Linus Torvalds
@ 2015-08-13 18:52       ` Linus Torvalds
  2015-08-13 19:50         ` David Rientjes
  2015-08-13 21:15       ` Tony Luck
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13 18:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joonsoo Kim, Al Viro, Linux Kernel Mailing List

On Thu, Aug 13, 2015 at 11:32 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Here's a totally untested patch (I'll reboot and test soon - it does
> at least compile for me).

Seems to work for me. Anybody want to object to the admittedly disgusting patch?

                 Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
  2015-08-13 18:52       ` Linus Torvalds
@ 2015-08-13 19:50         ` David Rientjes
  2015-08-13 20:04           ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: David Rientjes @ 2015-08-13 19:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List

On Thu, 13 Aug 2015, Linus Torvalds wrote:

> On Thu, Aug 13, 2015 at 11:32 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Here's a totally untested patch (I'll reboot and test soon - it does
> > at least compile for me).
> 
> Seems to work for me. Anybody want to object to the admittedly disgusting patch?
> 

Rather than a time based approach, why not invalidate when it is known to 
change, either on the next call to get_vmalloc_info() or as part of the 
vmalloc operation itself?  If the numbers don't change during your test, 
this would seem better and I don't think any vmalloc() function is 
intended to be hot.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
  2015-08-13 19:50         ` David Rientjes
@ 2015-08-13 20:04           ` Linus Torvalds
  0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13 20:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List

On Thu, Aug 13, 2015 at 12:50 PM, David Rientjes <rientjes@google.com> wrote:
>
> Rather than a time based approach, why not invalidate when it is known to
> change, either on the next call to get_vmalloc_info() or as part of the
> vmalloc operation itself?

I started doing that, looking at all the users of vmap_area_lock, and
decided that it's just too messy for me. I wanted something minimal
and obvious. The vmap_area handling is not obvious, especially since
the whole vmalloc statistics don't just depend on the vmap area list,
but also take the individual flags into account (ie it counts
VM_LAZY_FREE[ING] entries as having already been removed etc).

So I started looking at actually turning the vmap_area_lock into a
seqlock or even a rwlock (because some of the users are pure readers)
just to clarify things, and that wasn't hard per se, but I threw up my
hands in disgust. The code that modifies the thing is full of "take
lock, look up,. drop lock, do something, take lock again") etc.

Yes, we could just sprinkle "invalidate_cache = 1" statements around
in all the places that can cause this, but that would be pretty ad-hoc
and ugly too.  And since the reader side is actually entirely lockless
(currently using rcu, with my patch it has the additional lockless and
racy cache reading), to do it *right* you actually need to use at a
minimum memory ordering rules. So it would be fairly subtle too.

In contrast, my "just base it on time" sure as hell isn't subtle. It's
not great, but at least it's obvious and the crazy logic is
localized..

So I'd be all for somebody actually taking the time to do it right.
I'm pretty sure nobody really cares enough, though.

Willing to prove me wrong?

                   Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
  2015-08-13  5:52   ` Linus Torvalds
  2015-08-13  7:42     ` Rasmus Villemoes
  2015-08-13 18:32     ` Linus Torvalds
@ 2015-08-13 20:26     ` Andrew Morton
  2 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2015-08-13 20:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Joonsoo Kim, Al Viro, Linux Kernel Mailing List

On Wed, 12 Aug 2015 22:52:44 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Looking at that mm/vmalloc.c file, the locking is pretty odd. It looks
> pretty strange in setup_vmalloc_vm(), for example. If that newly
> allocated "va" that we haven't even exposed to anybody yet has its
> address or size changed, we're screwed in so many ways.
> 
> I get the feeling this file should be rewritten. But that's not going
> to happen.

Nick Piggin started doing a rewrite in 2008 but it was never completed
(https://lwn.net/Articles/285341/).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
  2015-08-13 18:32     ` Linus Torvalds
  2015-08-13 18:52       ` Linus Torvalds
@ 2015-08-13 21:15       ` Tony Luck
  2015-08-13 22:56         ` Linus Torvalds
  1 sibling, 1 reply; 12+ messages in thread
From: Tony Luck @ 2015-08-13 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List

On Thu, Aug 13, 2015 at 11:32 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> And I also can't find it in myself to care about the "on 32-bit,
> jiffies wraps in 49 days if HZ is 1000". If somebody carefully avoids
> ever reading /proc/meminfo for 49 days, and then reads it in _just_
> the right second, and gets a really stale value, I'm just going to do
> my honey badger impression.

could you at least care enough to write that as

      if (time_after(now, last + HZ)) {

-Tony

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
  2015-08-13 21:15       ` Tony Luck
@ 2015-08-13 22:56         ` Linus Torvalds
  0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13 22:56 UTC (permalink / raw)
  To: Tony Luck; +Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List

On Thu, Aug 13, 2015 at 2:15 PM, Tony Luck <tony.luck@gmail.com> wrote:
>
> could you at least care enough to write that as
>
>       if (time_after(now, last + HZ)) {

Absolutely not. That would be wrong.

"time_after()" covers half the "unsigned long" space (very much on
purpose). We want the cached case to trigger _only_ for the much
tighter case of exactly 1000 jiffies.

                 Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-08-13 22:56 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-13  3:29 get_vmalloc_info() and /proc/meminfo insanely expensive Linus Torvalds
2015-08-13  4:00 ` Andrew Morton
2015-08-13  5:52   ` Linus Torvalds
2015-08-13  7:42     ` Rasmus Villemoes
2015-08-13 16:50       ` Linus Torvalds
2015-08-13 18:32     ` Linus Torvalds
2015-08-13 18:52       ` Linus Torvalds
2015-08-13 19:50         ` David Rientjes
2015-08-13 20:04           ` Linus Torvalds
2015-08-13 21:15       ` Tony Luck
2015-08-13 22:56         ` Linus Torvalds
2015-08-13 20:26     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).