* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
2015-08-13 5:52 ` Linus Torvalds
@ 2015-08-13 7:42 ` Rasmus Villemoes
2015-08-13 16:50 ` Linus Torvalds
2015-08-13 18:32 ` Linus Torvalds
2015-08-13 20:26 ` Andrew Morton
2 siblings, 1 reply; 12+ messages in thread
From: Rasmus Villemoes @ 2015-08-13 7:42 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List
On Thu, Aug 13 2015, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, Aug 12, 2015 at 9:00 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>>
>> Do your /proc/meminfo vmalloc numbers actually change during that build?
>> Mine don't. Perhaps we can cache the most recent vmalloc_info and
>> invalidate that cache whenever someone does a vmalloc/vfree/etc.
>
> Sure, that works too.
>
> Looking at that mm/vmalloc.c file, the locking is pretty odd. It looks
> pretty strange in setup_vmalloc_vm(), for example. If that newly
> allocated "va" that we haven't even exposed to anybody yet has its
> address or size changed, we're screwed in so many ways.
>
> I get the feeling this file should be rewritten. But that's not going
> to happen. The "let's just cache the last value for one jiffy" seemed
> to be the minimal fixup to it.
>
I think it's simpler and better to fix glibc. Looking at the history,
the code for get_[av]phys_pages was added in 1996 (in the git repo
commit 845dcb57), with comments such as
/* Return the number of pages of physical memory in the system. There
is currently (as of version 2.0.21) no system call to determine the
number. It is planned for the 2.1.x series to add this, though.
and
/* XXX Here will come a test for the new system call. */
And that syscall seems to be sysinfo(). So even though sysinfo() also
returns much more than required, it is still more than an order of
magnitude faster than reading and parsing /proc/meminfo (in the quick
microbench I threw together):
#include <stdio.h>
#include <sys/sysinfo.h>
#include <rdtsc.h>
void do_get_phys_pages(void)
{
get_phys_pages();
}
void do_get_avphys_pages(void)
{
get_avphys_pages();
}
void do_sysinfo(void)
{
struct sysinfo info;
sysinfo(&info);
}
void time_this(const char *name, void (*f)(void), int rep)
{
int i;
unsigned long start, stop;
start = rdtsc();
for (i = 0; i < rep; ++i)
f();
stop = rdtsc();
printf("%-20s\t%d\t%lu\t%.1f\n", name, rep, stop-start, (double)(stop-start)/rep);
}
#define time_this(f, rep) time_this(#f, do_ ## f, rep)
int main(void)
{
time_this(sysinfo, 1);
time_this(get_phys_pages, 1);
time_this(get_avphys_pages, 1);
time_this(sysinfo, 1);
time_this(get_phys_pages, 1);
time_this(get_avphys_pages, 1);
time_this(sysinfo, 10);
time_this(get_phys_pages, 10);
time_this(get_avphys_pages, 10);
return 0;
}
$ ./sysinfo
sysinfo 1 6056 6056.0
get_phys_pages 1 226744 226744.0
get_avphys_pages 1 84480 84480.0
sysinfo 1 2288 2288.0
get_phys_pages 1 73216 73216.0
get_avphys_pages 1 76692 76692.0
sysinfo 10 6856 685.6
get_phys_pages 10 626936 62693.6
get_avphys_pages 10 604440 60444.0
Rasmus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
2015-08-13 7:42 ` Rasmus Villemoes
@ 2015-08-13 16:50 ` Linus Torvalds
0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13 16:50 UTC (permalink / raw)
To: Rasmus Villemoes
Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List
On Thu, Aug 13, 2015 at 12:42 AM, Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> I think it's simpler and better to fix glibc.
Well, I certainly don't disagree, but at the same time I suspect that
(a) many distros will not update glibc very aggressively and (b) we
should fix the unnecessarily expensive kernel operation regardless.
I think that just printing out the ASCII numbers in /proc/meminfo
should be a lot more expensive than computing the numbers. Which is
clearly not the case now, because we do that crazy expensive dynamic
computation every time, even though it hardly ever really changes (and
nobody even really cares about the values).
So I'd prefer to fix the kernel for existing insane users (because the
kernel is just doing stupid things that it didn't expect anybody to
care about), _and_ fix glibc.
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
2015-08-13 5:52 ` Linus Torvalds
2015-08-13 7:42 ` Rasmus Villemoes
@ 2015-08-13 18:32 ` Linus Torvalds
2015-08-13 18:52 ` Linus Torvalds
2015-08-13 21:15 ` Tony Luck
2015-08-13 20:26 ` Andrew Morton
2 siblings, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13 18:32 UTC (permalink / raw)
To: Andrew Morton; +Cc: Joonsoo Kim, Al Viro, Linux Kernel Mailing List
[-- Attachment #1: Type: text/plain, Size: 1136 bytes --]
On Wed, Aug 12, 2015 at 10:52 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I get the feeling this file should be rewritten. But that's not going
> to happen. The "let's just cache the last value for one jiffy" seemed
> to be the minimal fixup to it.
Here's a totally untested patch (I'll reboot and test soon - it does
at least compile for me).
Notice the total lack of locking, which means that it's fundamentally
racy. I really can't find it inside myself to care. Introducing those
vmalloc fields was a mistake to begin with, any races here are "ok, we
get values that were valid at some point, but it might have been a
second ago".
And I also can't find it in myself to care about the "on 32-bit,
jiffies wraps in 49 days if HZ is 1000". If somebody carefully avoids
ever reading /proc/meminfo for 49 days, and then reads it in _just_
the right second, and gets a really stale value, I'm just going to do
my honey badger impression.
Because we really shouldn't have added the vmalloc information to
/proc/meminfo to begin with, and nobody ever cares about those values
anyway.
Comments?
Linus
[-- Attachment #2: vmalloc.patch --]
[-- Type: text/x-patch, Size: 2253 bytes --]
mm/vmalloc.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 48 insertions(+), 2 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2faaa2976447..0d0b96ed8948 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2688,7 +2688,7 @@ static int __init proc_vmalloc_init(void)
}
module_init(proc_vmalloc_init);
-void get_vmalloc_info(struct vmalloc_info *vmi)
+static void calc_vmalloc_info(struct vmalloc_info *vmi)
{
struct vmap_area *va;
unsigned long free_area_size;
@@ -2735,5 +2735,51 @@ void get_vmalloc_info(struct vmalloc_info *vmi)
out:
rcu_read_unlock();
}
-#endif
+/*
+ * Calculating the vmalloc information is expensive, and nobody really
+ * deeply cares about it anyway. Yet, some versions of glibc end up
+ * reading /proc/meminfo a lot, not because they care about the vmalloc
+ * fields, but because they care about the total memory info.
+ *
+ * So to alleviate that expense, we cache the vmalloc information for
+ * a second. NOTE! This is fundamentally racy, since the accesses to
+ * the two fields in "struct vmalloc_info" and the cache timeout are
+ * all entirely unsynchronized. We just don't care.
+ */
+void get_vmalloc_info(struct vmalloc_info *vmi)
+{
+ static unsigned long cache_timeout = INITIAL_JIFFIES;
+ static struct vmalloc_info cached_info;
+ unsigned long now = jiffies, last = READ_ONCE(cache_timeout);
+
+ if (now - last < HZ) {
+ *vmi = cached_info;
+ return;
+ }
+
+ /*
+ * We update the cache timeout early, because we (again) do
+ * not care if somebody else comes in and sees slightly stale
+ * information. We'd rather return more stale information
+ * than waste time with multiple CPU's all calculating the
+ * new state.
+ *
+ * Note: the barriers are here not to fix any races, but to
+ * avoid the compiling spreading out the updates to these
+ * variables any more than necessary.
+ *
+ * Also note that we calculate the new state into the 'vmi'
+ * buffer that is passed in, and private to the caller. That
+ * is intentional: we do not want to update the cached info
+ * incrementally during the calculations.
+ */
+ WRITE_ONCE(cache_timeout, now);
+ barrier();
+
+ calc_vmalloc_info(vmi);
+
+ barrier();
+ cached_info = *vmi;
+}
+#endif
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
2015-08-13 18:32 ` Linus Torvalds
@ 2015-08-13 18:52 ` Linus Torvalds
2015-08-13 19:50 ` David Rientjes
2015-08-13 21:15 ` Tony Luck
1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13 18:52 UTC (permalink / raw)
To: Andrew Morton; +Cc: Joonsoo Kim, Al Viro, Linux Kernel Mailing List
On Thu, Aug 13, 2015 at 11:32 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Here's a totally untested patch (I'll reboot and test soon - it does
> at least compile for me).
Seems to work for me. Anybody want to object to the admittedly disgusting patch?
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
2015-08-13 18:52 ` Linus Torvalds
@ 2015-08-13 19:50 ` David Rientjes
2015-08-13 20:04 ` Linus Torvalds
0 siblings, 1 reply; 12+ messages in thread
From: David Rientjes @ 2015-08-13 19:50 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List
On Thu, 13 Aug 2015, Linus Torvalds wrote:
> On Thu, Aug 13, 2015 at 11:32 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Here's a totally untested patch (I'll reboot and test soon - it does
> > at least compile for me).
>
> Seems to work for me. Anybody want to object to the admittedly disgusting patch?
>
Rather than a time based approach, why not invalidate when it is known to
change, either on the next call to get_vmalloc_info() or as part of the
vmalloc operation itself? If the numbers don't change during your test,
this would seem better and I don't think any vmalloc() function is
intended to be hot.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
2015-08-13 19:50 ` David Rientjes
@ 2015-08-13 20:04 ` Linus Torvalds
0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13 20:04 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List
On Thu, Aug 13, 2015 at 12:50 PM, David Rientjes <rientjes@google.com> wrote:
>
> Rather than a time based approach, why not invalidate when it is known to
> change, either on the next call to get_vmalloc_info() or as part of the
> vmalloc operation itself?
I started doing that, looking at all the users of vmap_area_lock, and
decided that it's just too messy for me. I wanted something minimal
and obvious. The vmap_area handling is not obvious, especially since
the whole vmalloc statistics don't just depend on the vmap area list,
but also take the individual flags into account (ie it counts
VM_LAZY_FREE[ING] entries as having already been removed etc).
So I started looking at actually turning the vmap_area_lock into a
seqlock or even a rwlock (because some of the users are pure readers)
just to clarify things, and that wasn't hard per se, but I threw up my
hands in disgust. The code that modifies the thing is full of "take
lock, look up,. drop lock, do something, take lock again") etc.
Yes, we could just sprinkle "invalidate_cache = 1" statements around
in all the places that can cause this, but that would be pretty ad-hoc
and ugly too. And since the reader side is actually entirely lockless
(currently using rcu, with my patch it has the additional lockless and
racy cache reading), to do it *right* you actually need to use at a
minimum memory ordering rules. So it would be fairly subtle too.
In contrast, my "just base it on time" sure as hell isn't subtle. It's
not great, but at least it's obvious and the crazy logic is
localized..
So I'd be all for somebody actually taking the time to do it right.
I'm pretty sure nobody really cares enough, though.
Willing to prove me wrong?
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
2015-08-13 18:32 ` Linus Torvalds
2015-08-13 18:52 ` Linus Torvalds
@ 2015-08-13 21:15 ` Tony Luck
2015-08-13 22:56 ` Linus Torvalds
1 sibling, 1 reply; 12+ messages in thread
From: Tony Luck @ 2015-08-13 21:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List
On Thu, Aug 13, 2015 at 11:32 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> And I also can't find it in myself to care about the "on 32-bit,
> jiffies wraps in 49 days if HZ is 1000". If somebody carefully avoids
> ever reading /proc/meminfo for 49 days, and then reads it in _just_
> the right second, and gets a really stale value, I'm just going to do
> my honey badger impression.
could you at least care enough to write that as
if (time_after(now, last + HZ)) {
-Tony
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
2015-08-13 21:15 ` Tony Luck
@ 2015-08-13 22:56 ` Linus Torvalds
0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2015-08-13 22:56 UTC (permalink / raw)
To: Tony Luck; +Cc: Andrew Morton, Joonsoo Kim, Al Viro, Linux Kernel Mailing List
On Thu, Aug 13, 2015 at 2:15 PM, Tony Luck <tony.luck@gmail.com> wrote:
>
> could you at least care enough to write that as
>
> if (time_after(now, last + HZ)) {
Absolutely not. That would be wrong.
"time_after()" covers half the "unsigned long" space (very much on
purpose). We want the cached case to trigger _only_ for the much
tighter case of exactly 1000 jiffies.
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: get_vmalloc_info() and /proc/meminfo insanely expensive
2015-08-13 5:52 ` Linus Torvalds
2015-08-13 7:42 ` Rasmus Villemoes
2015-08-13 18:32 ` Linus Torvalds
@ 2015-08-13 20:26 ` Andrew Morton
2 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2015-08-13 20:26 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Joonsoo Kim, Al Viro, Linux Kernel Mailing List
On Wed, 12 Aug 2015 22:52:44 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Looking at that mm/vmalloc.c file, the locking is pretty odd. It looks
> pretty strange in setup_vmalloc_vm(), for example. If that newly
> allocated "va" that we haven't even exposed to anybody yet has its
> address or size changed, we're screwed in so many ways.
>
> I get the feeling this file should be rewritten. But that's not going
> to happen.
Nick Piggin started doing a rewrite in 2008 but it was never completed
(https://lwn.net/Articles/285341/).
^ permalink raw reply [flat|nested] 12+ messages in thread