linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>, Kan Liang <kan.liang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	linux-mm <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk
Date: Mon, 14 Aug 2017 19:52:16 -0700	[thread overview]
Message-ID: <CA+55aFyHVV=eTtAocUrNLymQOCj55qkF58+N+Tjr2YS9TrqFow@mail.gmail.com> (raw)
In-Reply-To: <20170815022743.GB28715@tassilo.jf.intel.com>

On Mon, Aug 14, 2017 at 7:27 PM, Andi Kleen <ak@linux.intel.com> wrote:
>
> We could try it and it may even help in this case and it may
> be a good idea in any case on such a system, but:
>
> - Even with a large hash table it might be that by chance all CPUs
> will be queued up on the same page
> - There are a lot of other wait queues in the kernel and they all
> could run into a similar problem
> - I suspect it's even possible to construct it from user space
> as a kind of DoS attack

Maybe. Which is why I didn't NAK the patch outright.

But I don't think it's the solution for the scalability issue you guys
found. It's just a workaround, and it's likely a bad one at that.

> Now in one case (on a smaller system) we debugged we had
>
> - 4S system with 208 logical threads
> - during the test the wait queue length was 3700 entries.
> - the last CPUs queued had to wait roughly 0.8s
>
> This gives a budget of roughly 1us per wake up.

I'm not at all convinced that follows.

When bad scaling happens, you often end up hitting quadratic (or
worse) behavior. So if you are able to fix the scaling by some fixed
amount, it's possible that almost _all_ the problems just go away.

The real issue is that "3700 entries" part. What was it that actually
triggered them? In particular, if it's just a hashing issue, and we
can trivially just make the hash table be bigger (256 entries is
*tiny*) then the whole thing goes away.

Which is why I really want to hear what happens if you just change
PAGE_WAIT_TABLE_BITS to 16. The right fix would be to just make it
scale by memory, but before we even do that, let's just look at what
happens when you increase the size the stupid way.

Maybe those 3700 entries will just shrink down to 14 entries because
the hash just works fine and 256 entries was just much much too small
when you have hundreds of thousands of threads or whatever

But it is *also* possible that it's actually all waiting on the exact
same page, and there's some way to do a thundering herd on the page
lock bit, for example. But then it would be really good to hear what
it is that triggers that.

The thing is, the reason we perform well on many loads in the kernel
is that I have *always* pushed back against bad workarounds.

We do *not* do lock back-off in our locks, for example, because I told
people that lock contention gets fixed by not contending, not by
trying to act better when things have already become bad.

This is the same issue. We don't "fix" things by papering over some
symptom. We try to fix the _actual_ underlying problem. Maybe there is
some caller that can simply be rewritten. Maybe we can do other tricks
than just make the wait tables bigger. But we should not say "3700
entries is ok, let's just make that sh*t be interruptible".

That is what the patch does now, and that is why I dislike the patch.

So I _am_ NAK'ing the patch if nobody is willing to even try alternatives.

Because a band-aid is ok for "some theoretical worst-case behavior".

But a band-aid is *not* ok for "we can't even be bothered to try to
figure out the right thing, so we're just adding this hack and leaving
it".

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-08-15  2:52 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-15  0:52 [PATCH 1/2] sched/wait: Break up long wake list walk Tim Chen
2017-08-15  0:52 ` [PATCH 2/2] sched/wait: Introduce lock breaker in wake_up_page_bit Tim Chen
2017-08-15  1:48 ` [PATCH 1/2] sched/wait: Break up long wake list walk Linus Torvalds
2017-08-15  2:27   ` Andi Kleen
2017-08-15  2:52     ` Linus Torvalds [this message]
2017-08-15  3:15       ` Andi Kleen
2017-08-15  3:28         ` Linus Torvalds
2017-08-15 19:05           ` Tim Chen
2017-08-15 19:41             ` Linus Torvalds
2017-08-15 19:47               ` Linus Torvalds
2017-08-15 22:47           ` Davidlohr Bueso
2017-08-15 22:56             ` Linus Torvalds
2017-08-15 22:57               ` Linus Torvalds
2017-08-15 23:50                 ` Linus Torvalds
2017-08-16 23:22                   ` Eric W. Biederman
2017-08-17 16:17   ` Liang, Kan
2017-08-17 16:25     ` Linus Torvalds
2017-08-17 20:18       ` Liang, Kan
2017-08-17 20:44         ` Linus Torvalds
2017-08-18 12:23           ` Mel Gorman
2017-08-18 14:20             ` Liang, Kan
2017-08-18 14:46               ` Mel Gorman
2017-08-18 16:36                 ` Tim Chen
2017-08-18 16:45                   ` Andi Kleen
2017-08-18 16:53                 ` Liang, Kan
2017-08-18 17:48                   ` Linus Torvalds
2017-08-18 18:54                     ` Mel Gorman
2017-08-18 19:14                       ` Linus Torvalds
2017-08-18 19:58                         ` Andi Kleen
2017-08-18 20:10                           ` Linus Torvalds
2017-08-21 18:32                         ` Mel Gorman
2017-08-21 18:56                           ` Liang, Kan
2017-08-22 17:23                             ` Liang, Kan
2017-08-22 18:19                               ` Linus Torvalds
2017-08-22 18:25                                 ` Linus Torvalds
2017-08-22 18:56                                 ` Peter Zijlstra
2017-08-22 19:15                                   ` Linus Torvalds
2017-08-22 19:08                                 ` Peter Zijlstra
2017-08-22 19:30                                   ` Linus Torvalds
2017-08-22 19:37                                     ` Andi Kleen
2017-08-22 21:08                                       ` Christopher Lameter
2017-08-22 21:24                                         ` Andi Kleen
2017-08-22 22:52                                           ` Linus Torvalds
2017-08-22 23:19                                             ` Linus Torvalds
2017-08-23 14:51                                             ` Liang, Kan
2017-08-22 19:55                                 ` Liang, Kan
2017-08-22 20:42                                   ` Linus Torvalds
2017-08-22 20:53                                     ` Peter Zijlstra
2017-08-22 20:58                                       ` Linus Torvalds
2017-08-23 14:49                                     ` Liang, Kan
2017-08-23 15:58                                       ` Tim Chen
2017-08-23 18:17                                         ` Linus Torvalds
2017-08-23 20:55                                           ` Liang, Kan
2017-08-23 23:30                                           ` Linus Torvalds
2017-08-24 17:49                                             ` Tim Chen
2017-08-24 18:16                                               ` Linus Torvalds
2017-08-24 20:44                                                 ` Mel Gorman
2017-08-25 16:44                                                   ` Tim Chen
2017-08-23 16:04                                 ` Mel Gorman
2017-08-18 20:05                     ` Andi Kleen
2017-08-18 20:29                       ` Linus Torvalds
2017-08-18 20:29                     ` Liang, Kan
2017-08-18 20:34                       ` Linus Torvalds
2017-08-18 16:55             ` Linus Torvalds
2017-08-18 13:06           ` Liang, Kan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+55aFyHVV=eTtAocUrNLymQOCj55qkF58+N+Tjr2YS9TrqFow@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=kan.liang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=tim.c.chen@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).