From: "azurIt" <azurit@pobox.sk>
To: "Mel Gorman" <mel@csn.ul.ie>,
"Andrew Morton" <akpm@linux-foundation.org>
Cc: "Eric Dumazet" <eric.dumazet@gmail.com>,
"Changli Gao" <xiaosuo@gmail.com>,
"Am?rico Wang" <xiyou.wangcong@gmail.com>,
"Jiri Slaby" <jslaby@suse.cz>, <linux-kernel@vger.kernel.org>,
<linux-mm@kvack.org>, <linux-fsdevel@vger.kernel.org>,
"Jiri Slaby" <jirislaby@gmail.com>
Subject: Re: Regression from 2.6.36
Date: Fri, 15 Apr 2011 11:59:03 +0200 [thread overview]
Message-ID: <20110415115903.315DEAA1@pobox.sk> (raw)
In-Reply-To: <20110414102501.GE11871@csn.ul.ie>
Also this new patch is working fine and fixing the problem.
Mel, I cannot run your script:
# perl watch-highorder-latency.pl
Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.
# ls -ld /sys/kernel/debug/
ls: cannot access /sys/kernel/debug/: No such file or directory
azur
______________________________________________________________
> Od: "Mel Gorman" <mel@csn.ul.ie>
> Komu: Andrew Morton <akpm@linux-foundation.org>
> Dátum: 14.04.2011 12:25
> Predmet: Re: Regression from 2.6.36
>
> CC: "Eric Dumazet" <eric.dumazet@gmail.com>, "Changli Gao" <xiaosuo@gmail.com>, "Am?rico Wang" <xiyou.wangcong@gmail.com>, "Jiri Slaby" <jslaby@suse.cz>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Jiri Slaby" <jirislaby@gmail.com>
>On Wed, Apr 13, 2011 at 02:16:00PM -0700, Andrew Morton wrote:
>> On Wed, 13 Apr 2011 04:37:36 +0200
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> > Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit :
>> > > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@gmail.com> wrote:
>> > >
>> > > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
>> > > > <akpm@linux-foundation.org> wrote:
>> > > > >
>> > > > > It's somewhat unclear (to me) what caused this regression.
>> > > > >
>> > > > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
>> > > > > and this makes the page allocator go nuts trying to satisfy high-order
>> > > > > page allocation requests?
>> > > > >
>> > > > > Is it because the kernel now will usually free the fdtable
>> > > > > synchronously within the rcu callback, rather than deferring this to a
>> > > > > workqueue?
>> > > > >
>> > > > > The latter seems unlikely, so I'm thinking this was a case of
>> > > > > high-order-allocations-considered-harmful?
>> > > > >
>> > > >
>> > > > Maybe, but I am not sure. Maybe my patch causes too many inner
>> > > > fragments. For example, when asking for 5 pages, get 8 pages, and 3
>> > > > pages are wasted, then memory thrash happens finally.
>> > >
>> > > That theory sounds less likely, but could be tested by using
>> > > alloc_pages_exact().
>> > >
>> >
>> > Very unlikely, since fdtable sizes are powers of two, unless you hit
>> > sysctl_nr_open and it was changed (default value being 2^20)
>> >
>>
>> So am I correct in believing that this regression is due to the
>> high-order allocations putting excess stress onto page reclaim?
>>
>
>This is very plausible but it would be nice to get confirmation on
>what the size of the fdtable was to be sure. If it's big enough for
>high-order allocations and it's a fork-heavy workload with memory
>mostly in use, the fork() latencies could be getting very high. In
>addition, each fork is potentially kicking kswapd awake (to rebalance
>the zone for higher orders). I do not see CONFIG_COMPACTION enabled
>meaning that if I'm right in that kswapd is awake and fork() is
>entering direct reclaim, then we are lumpy reclaiming as well which
>can stall pretty severely.
>
>> If so, then how large _are_ these allocations? This perhaps can be
>> determined from /proc/slabinfo. They must be pretty huge, because slub
>> likes to do excessively-large allocations and the system handles that
>> reasonably well.
>>
>
>I'd be interested in finding out the value of /proc/sys/fs/file-max and
>the output of ulimit -n (max open files) for the main server is. This
>should help us determine what the size of the fdtable is.
>
>> I suppose that a suitable fix would be
>>
>>
>> From: Andrew Morton <akpm@linux-foundation.org>
>>
>> Azurit reports large increases in system time after 2.6.36 when running
>> Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
>> to allocate fdmem if possible").
>>
>> That patch caused the vfs to use kmalloc() for very large allocations and
>> this is causing excessive work (and presumably excessive reclaim) within
>> the page allocator.
>>
>> Fix it by falling back to vmalloc() earlier - when the allocation attempt
>> would have been considered "costly" by reclaim.
>>
>> Reported-by: azurIt <azurit@pobox.sk>
>> Cc: Changli Gao <xiaosuo@gmail.com>
>> Cc: Americo Wang <xiyou.wangcong@gmail.com>
>> Cc: Jiri Slaby <jslaby@suse.cz>
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>
>> fs/file.c | 17 ++++++++++-------
>> 1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff -puN fs/file.c~a fs/file.c
>> --- a/fs/file.c~a
>> +++ a/fs/file.c
>> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>> */
>> static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>>
>> -static inline void *alloc_fdmem(unsigned int size)
>> +static void *alloc_fdmem(unsigned int size)
>> {
>> - void *data;
>> -
>> - data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> - if (data != NULL)
>> - return data;
>> -
>> + /*
>> + * Very large allocations can stress page reclaim, so fall back to
>> + * vmalloc() if the allocation size will be considered "large" by the VM.
>> + */
>> + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
>
>The reporter will need to retest this is really ok. The patch that was
>reported to help avoided high-order allocations entirely. If fork-heavy
>workloads are really entering direct reclaim and increasing fork latency
>enough to ruin performance, then this patch will also suffer. How much
>it helps depends on how big fdtable.
>
>> + void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> + if (data != NULL)
>> + return data;
>> + }
>> return vmalloc(size);
>> }
>>
>
>I'm attaching a primitive perl script that reports high-order allocation
>latencies. I'd be interesting to see what the output of it looks like,
>particularly when the server is in trouble if the bug reporter as the
>time.
>
>--
>Mel Gorman
>SUSE Labs
>
>
next prev parent reply other threads:[~2011-04-15 9:59 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-03-15 13:25 Regression from 2.6.36 azurIt
2011-03-17 0:15 ` Greg KH
2011-03-17 0:53 ` Dave Jones
2011-03-17 13:30 ` azurIt
2011-04-07 10:01 ` azurIt
2011-04-07 10:19 ` Jiri Slaby
2011-04-07 11:21 ` Américo Wang
2011-04-07 11:57 ` Eric Dumazet
2011-04-07 12:13 ` Eric Dumazet
2011-04-07 15:27 ` Changli Gao
2011-04-07 15:36 ` Eric Dumazet
2011-04-12 22:49 ` Andrew Morton
2011-04-13 1:23 ` Changli Gao
2011-04-13 1:31 ` Andrew Morton
2011-04-13 2:37 ` Eric Dumazet
2011-04-13 6:54 ` Regarding memory fragmentation using malloc Pintu Agarwal
2011-04-13 11:44 ` Américo Wang
2011-04-13 13:56 ` Pintu Agarwal
2011-04-13 15:25 ` Michal Nazarewicz
2011-04-14 6:44 ` Pintu Agarwal
2011-04-14 10:47 ` Michal Nazarewicz
2011-04-14 12:24 ` Pintu Agarwal
2011-04-14 12:31 ` Michal Nazarewicz
2011-04-13 21:16 ` Regression from 2.6.36 Andrew Morton
2011-04-13 21:24 ` Andrew Morton
2011-04-19 19:29 ` azurIt
2011-04-19 19:55 ` Andrew Morton
2011-04-13 21:44 ` David Rientjes
2011-04-13 21:54 ` Andrew Morton
2011-04-14 2:10 ` Eric Dumazet
2011-04-14 5:28 ` Andrew Morton
2011-04-14 6:31 ` Eric Dumazet
2011-04-14 9:08 ` azurIt
2011-04-14 10:27 ` Eric Dumazet
2011-04-14 10:31 ` azurIt
2011-04-14 10:25 ` Mel Gorman
2011-04-15 9:59 ` azurIt [this message]
2011-04-15 10:47 ` Mel Gorman
2011-04-15 10:56 ` azurIt
2011-04-15 11:17 ` Mel Gorman
2011-04-15 11:36 ` azurIt
2011-04-15 13:01 ` Mel Gorman
2011-04-15 13:21 ` azurIt
2011-04-15 14:15 ` Mel Gorman
2011-04-08 12:25 ` azurIt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110415115903.315DEAA1@pobox.sk \
--to=azurit@pobox.sk \
--cc=akpm@linux-foundation.org \
--cc=eric.dumazet@gmail.com \
--cc=jirislaby@gmail.com \
--cc=jslaby@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=xiaosuo@gmail.com \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).