linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Dave Chinner <david@fromorbit.com>, Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, "H. Peter Anvin" <hpa@zytor.com>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Steven Rostedt <rostedt@goodmis.org>
Subject: Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Date: Wed, 28 May 2014 19:42:40 -0700	[thread overview]
Message-ID: <CA+55aFzdq2V-Q3WUV7hQJG8jBSAvBqdYLVTNtbD4ObVZ5yDRmw@mail.gmail.com> (raw)
In-Reply-To: <20140529013007.GF6677@dastard>

On Wed, May 28, 2014 at 6:30 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> You're focussing on the specific symptoms, not the bigger picture.
> i.e. you're ignoring all the other "let's start IO" triggers in
> direct reclaim. e.g there's two separate plug flush triggers in
> shrink_inactive_list(), one of which is:

Fair enough. I certainly agree that we should look at the other cases here too.

In fact, I also find it distasteful just how much stack space some of
those VM routines are just using up on their own, never mind any
actual IO paths at all. The fact that __alloc_pages_nodemask() uses
350 bytes of stackspace on its own is actually quite disturbing. The
fact that kernel_map_pages() apparently has almost 400 bytes of stack
is just crazy. Obviously that case only happens with
CONFIG_DEBUG_PAGEALLOC, but still..

> I'm not saying we shouldn't turn of swap from direct reclaim, just
> that all we'd be doing by turning off swap is playing whack-a-stack
> - the next report will simply be from one of the other direct
> reclaim IO schedule points.

Playing whack-a-mole with this for a while might not be a bad idea,
though. It's not like we will ever really improve unless we start
whacking the worst cases. And it should still be a fairly limited
number.

After all, historically, some of the cases we've played whack-a-mole
on have been in XFS, so I'd think you'd be thrilled to see some other
code get blamed this time around ;)

> Regardless of whether it is swap or something external queues the
> bio on the plug, perhaps we should look at why it's done inline
> rather than by kblockd, where it was moved because it was blowing
> the stack from schedule():

So it sounds like we need to do this for io_schedule() too.

In fact, we've generally found it to be a mistake every time we
"automatically" unblock some IO queue. And I'm not saying that because
of stack space, but because we've _often_ had the situation that eager
unblocking results in IO that could have been done as bigger requests.

Of course, we do need to worry about latency for starting IO, but any
of these kinds of memory-pressure writeback patterns are pretty much
by definition not about the latency of one _particular_ IO, so they
don't tent to be latency-sensitive. Quite the reverse: we start
writeback and then end up waiting on something else altogether
(possibly a writeback that got started much earlier).

swapout certainly is _not_ IO-latency-sensitive, especially these
days. And while we _do_ want to throttle in direct reclaim, if it's
about throttling I'd certainly think that it sounds quite reasonable
to push any unplugging to kblockd than try to do that synchronously.
If we are throttling in direct-reclaim, we need to slow things _down_
for the writer, not worry about latency.

> I've said in the past that swap is different to filesystem
> ->writepage implementations because it doesn't require significant
> stack to do block allocation and doesn't trigger IO deep in that
> allocation stack. Hence it has much lower stack overhead than the
> filesystem ->writepage implementations and so is much less likely to
> have stack issues.

Clearly it is true that it lacks the actual filesystem part needed for
the writeback. At the same time, Minchan's example is certainly a good
one of a filesystem (ext4) already being reasonably deep in its own
stack space when it then wants memory.

Looking at that callchain, I have to say that ext4 doesn't look
horrible compared to the whole block layer and virtio.. Yes,
"ext4_writepages()" is using almost 400 bytes of stack, and most of
that seems to be due to:

        struct mpage_da_data mpd;
        struct blk_plug plug;

which looks at least understandable (nothing like the mess in the VM
code where the stack usage is because gcc creates horrible spills)

> This stack overflow shows us that just the memory reclaim + IO
> layers are sufficient to cause a stack overflow, which is something
> I've never seen before.

Well, we've definitely have had some issues with deeper callchains
with md, but I suspect virtio might be worse, and the new blk-mq code
is lilkely worse in this respect too.

And Minchan running out of stack is at least _partly_ due to his debug
options (that DEBUG_PAGEALLOC thing as an extreme example, but I
suspect there's a few other options there that generate more bloated
data structures too too).

>                That implies no IO in direct reclaim context
> is safe - either from swap or io_schedule() unplugging. It also
> lends a lot of weight to my assertion that the majority of the stack
> growth over the past couple of years has been ocurring outside the
> filesystems....

I think Minchan's stack trace definitely backs you up on that. The
filesystem part - despite that one ext4_writepages() function - is a
very small part of the whole. It sits at about ~1kB of stack. Just the
VM "top-level" writeback code is about as much, and then the VM page
alloc/shrinking code when the filesystem needs memory is *twice* that,
and then the block layer and the virtio code are another 1kB each.

The rest is just kthread overhead and that DEBUG_PAGEALLOC thing.
Other debug options might be bloating Minchan's stack use numbers in
general, but probably not by massive amounts. Locks will generally be
_hugely_ bigger due to lock debugging, but that's seldom on the stack.

So no, this is not a filesystem problem. This is definitely core VM
and block layer, no arguments what-so-ever.

I note that Jens wasn't cc'd. Added him in.

               Linus

  parent reply	other threads:[~2014-05-29  2:42 UTC|newest]

Thread overview: 107+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-28  6:53 [PATCH 1/2] ftrace: print stack usage right before Oops Minchan Kim
2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
2014-05-28  8:37   ` Dave Chinner
2014-05-28  9:13     ` Dave Chinner
2014-05-28 16:06       ` Johannes Weiner
2014-05-28 21:55         ` Dave Chinner
2014-05-29  6:06         ` Minchan Kim
2014-05-28  9:04   ` Michael S. Tsirkin
2014-05-29  1:09     ` Minchan Kim
2014-05-29  2:44       ` Steven Rostedt
2014-05-29  4:11         ` Minchan Kim
2014-05-29  2:47       ` Rusty Russell
2014-05-29  4:10     ` virtio_ring stack usage Rusty Russell
2014-05-28  9:27   ` [RFC 2/2] x86_64: expand kernel stack to 16K Borislav Petkov
2014-05-29 13:23     ` One Thousand Gnomes
2014-05-28 14:14   ` Steven Rostedt
2014-05-28 14:23     ` H. Peter Anvin
2014-05-28 22:11       ` Dave Chinner
2014-05-28 22:42         ` H. Peter Anvin
2014-05-28 23:17           ` Dave Chinner
2014-05-28 23:21             ` H. Peter Anvin
2014-05-28 15:43   ` Richard Weinberger
2014-05-28 16:08     ` Steven Rostedt
2014-05-28 16:11       ` Richard Weinberger
2014-05-28 16:13       ` Linus Torvalds
2014-05-28 16:09   ` Linus Torvalds
2014-05-28 22:31     ` Dave Chinner
2014-05-28 22:41       ` Linus Torvalds
2014-05-29  1:30         ` Dave Chinner
2014-05-29  1:58           ` Dave Chinner
2014-05-29  2:51             ` Linus Torvalds
2014-05-29 23:36             ` Minchan Kim
2014-05-30  0:05               ` Linus Torvalds
2014-05-30  0:20                 ` Minchan Kim
2014-05-30  0:31                   ` Linus Torvalds
2014-05-30  0:50                     ` Minchan Kim
2014-05-30  1:24                       ` Linus Torvalds
2014-05-30  1:58                         ` Dave Chinner
2014-05-30  2:13                           ` Linus Torvalds
2014-05-30  6:21                         ` Minchan Kim
2014-05-30  1:30                 ` Linus Torvalds
2014-05-30  0:15               ` Dave Chinner
2014-05-30  2:12                 ` Minchan Kim
2014-05-30  4:37                   ` Linus Torvalds
2014-05-31  1:45                     ` Linus Torvalds
2014-05-30  6:12                   ` Minchan Kim
2014-06-03 13:28                   ` Rasmus Villemoes
2014-06-03 19:04                     ` Linus Torvalds
2014-06-10 12:29                       ` [PATCH 0/2] Per-task wait_queue_t Rasmus Villemoes
2014-06-10 12:29                         ` [PATCH 1/2] wait: Introduce per-task wait_queue_t Rasmus Villemoes
2014-06-11 15:16                           ` Oleg Nesterov
2014-06-10 12:29                         ` [PATCH 2/2] wait: Use the per-task wait_queue_t in ___wait_event macro Rasmus Villemoes
2014-06-10 15:50                         ` [PATCH 0/2] Per-task wait_queue_t Peter Zijlstra
2014-06-12 21:46                           ` Rasmus Villemoes
2014-05-29  2:42           ` Linus Torvalds [this message]
2014-05-29  5:14             ` [RFC 2/2] x86_64: expand kernel stack to 16K H. Peter Anvin
2014-05-29  6:01             ` Rusty Russell
2014-05-29  7:26               ` virtio ring cleanups, which save stack on older gcc Rusty Russell
2014-05-29  7:26                 ` [PATCH 1/4] Hack: measure stack taken by vring from virtio_blk Rusty Russell
2014-05-29 15:39                   ` Linus Torvalds
2014-05-29  7:26                 ` [PATCH 2/4] virtio_net: pass well-formed sg to virtqueue_add_inbuf() Rusty Russell
2014-05-29 10:07                   ` Michael S. Tsirkin
2014-05-29  7:26                 ` [PATCH 3/4] virtio_ring: assume sgs are always well-formed Rusty Russell
2014-05-29 11:18                   ` Michael S. Tsirkin
2014-05-29  7:26                 ` [PATCH 4/4] virtio_ring: unify direct/indirect code paths Rusty Russell
2014-05-29  7:52                   ` Peter Zijlstra
2014-05-29 11:05                     ` Rusty Russell
2014-05-29 11:33                       ` Michael S. Tsirkin
2014-05-29 11:29                   ` Michael S. Tsirkin
2014-05-30  2:37                     ` Rusty Russell
2014-05-30  6:21                       ` Rusty Russell
2014-05-29  7:41                 ` virtio ring cleanups, which save stack on older gcc Minchan Kim
2014-05-29 10:39                   ` Dave Chinner
2014-05-29 11:08                   ` Rusty Russell
2014-05-29 23:45                     ` Minchan Kim
2014-05-30  1:06                       ` Minchan Kim
2014-05-30  6:56                       ` Rusty Russell
2014-05-29  7:26             ` [RFC 2/2] x86_64: expand kernel stack to 16K Dave Chinner
2014-05-29 15:24               ` Linus Torvalds
2014-05-29 23:40                 ` Minchan Kim
2014-05-29 23:53                 ` Dave Chinner
2014-05-30  0:06                   ` Dave Jones
2014-05-30  0:21                     ` Dave Chinner
2014-05-30  0:29                       ` Dave Jones
2014-05-30  0:32                       ` Minchan Kim
2014-05-30  1:34                         ` Dave Chinner
2014-05-30 15:25                           ` H. Peter Anvin
2014-05-30 15:41                             ` Linus Torvalds
2014-05-30 15:52                               ` H. Peter Anvin
2014-05-30 16:06                                 ` Linus Torvalds
2014-05-30 17:24                                   ` Dave Hansen
2014-05-30 18:12                                     ` H. Peter Anvin
2014-10-21  2:00                               ` Dave Jones
2014-10-21  4:59                                 ` Andy Lutomirski
2014-05-30  9:48                 ` Richard Weinberger
2014-05-30 15:36                   ` Linus Torvalds
2014-05-31  2:06             ` Jens Axboe
2014-06-02 22:59               ` Dave Chinner
2014-06-03 13:02               ` Konstantin Khlebnikov
2014-05-29  3:46     ` Minchan Kim
2014-05-29  4:13       ` Linus Torvalds
2014-05-29  5:10         ` Minchan Kim
2014-05-30 21:23     ` Andi Kleen
2014-05-28 16:18 ` [PATCH 1/2] ftrace: print stack usage right before Oops Steven Rostedt
2014-05-29  3:52   ` Minchan Kim
2014-05-29  3:01 ` Steven Rostedt
2014-05-29  3:49   ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+55aFzdq2V-Q3WUV7hQJG8jBSAvBqdYLVTNtbD4ObVZ5yDRmw@mail.gmail.com \
    --to=torvalds@linux-foundation.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=dave.hansen@intel.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=mingo@kernel.org \
    --cc=mst@redhat.com \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).