linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Pasha Tatashin <pasha.tatashin@soleen.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>,
	"H. Peter Anvin" <hpa@zytor.com>,
	 linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	akpm@linux-foundation.org,  x86@kernel.org, bp@alien8.de,
	brauner@kernel.org, bristot@redhat.com,  bsegall@google.com,
	dave.hansen@linux.intel.com, dianders@chromium.org,
	 dietmar.eggemann@arm.com, eric.devolder@oracle.com,
	hca@linux.ibm.com,  hch@infradead.org,
	jacob.jun.pan@linux.intel.com, jgg@ziepe.ca,
	 jpoimboe@kernel.org, jroedel@suse.de, juri.lelli@redhat.com,
	 kinseyho@google.com, kirill.shutemov@linux.intel.com,
	lstoakes@gmail.com,  luto@kernel.org, mgorman@suse.de,
	mic@digikod.net,  michael.christie@oracle.com, mingo@redhat.com,
	mjguzik@gmail.com,  mst@redhat.com, npiggin@gmail.com,
	peterz@infradead.org, pmladek@suse.com,
	 rick.p.edgecombe@intel.com, rostedt@goodmis.org,
	surenb@google.com,  tglx@linutronix.de, urezki@gmail.com,
	vincent.guittot@linaro.org,  vschneid@redhat.com
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks
Date: Thu, 14 Mar 2024 23:13:56 -0400	[thread overview]
Message-ID: <CA+CK2bAmOj2J10szVijNikexFZ1gmA913vvxnqW4DJKWQikwqQ@mail.gmail.com> (raw)
In-Reply-To: <ZfNWojLB7qjjB0Zw@casper.infradead.org>

On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > which would have to be untangled.
> > >
> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > without moving to dynamically allocated kernel memory.  One icky thing
> > > that XFS does is punt work over to a kernel thread in order to use more
> > > stack!  That breaks a number of things including lockdep (because the
> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > thread owns the lock).
> > >
> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > and if less than that was available, we could allocate a temporary
> > > stack and switch to it.  I suspect Google would also be able to use this
> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > Who knows, we might all be able to use such a thing.
> > >
> > > I'd been thinking about this from the point of view of allocating more
> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > with this idea might lead to a hybrid approach that works better; allocate
> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > rely on people using this "I need more stack" API correctly, and free the
> > > excess pages on return to userspace.  No complicated "switch stacks" API
> > > needed, just an "ensure we have at least N bytes of stack remaining" API.

I like this approach! I think we could also consider having permanent
big stacks for some kernel only threads like kvm-vcpu. A cooperative
stack increase framework could work well and wouldn't negatively
impact the performance of context switching. However, thorough
analysis would be necessary to proactively identify potential stack
overflow situations.

> > Why would we need an "I need more stack" API? Pasha's approach seems
> > like everything we need for what you're talking about.
>
> Because double faults are hard, possibly impossible, and the FRED approach
> Peter described has extra overhead?  This was all described up-thread.

Handling faults in #DF is possible. It requires code inspection to
handle race conditions such as what was shown by tglx. However, as
Andy pointed out, this is not supported by SDM as it is an abort
context (yet we return from it because of ESPFIX64, so return is
possible).

My question, however, if we ignore memory savings and only consider
reliability aspect of this feature.  What is better unconditionally
crashing the machine because a guard page was reached, or printing a
huge warning with a backtracing information about the offending stack,
handling the fault, and survive? I know that historically Linus
preferred WARN() to BUG() [1]. But, this is a somewhat different
scenario compared to simple BUG vs WARN.

Pasha

[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com

  parent reply	other threads:[~2024-03-15  3:14 UTC|newest]

Thread overview: 98+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-11 16:46 [RFC 00/14] Dynamic Kernel Stacks Pasha Tatashin
2024-03-11 16:46 ` [RFC 01/14] task_stack.h: remove obsolete __HAVE_ARCH_KSTACK_END check Pasha Tatashin
2024-03-17 14:36   ` Christophe JAILLET
2024-03-17 15:13     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 02/14] fork: Clean-up ifdef logic around stack allocation Pasha Tatashin
2024-03-11 16:46 ` [RFC 03/14] fork: Clean-up naming of vm_strack/vm_struct variables in vmap stacks code Pasha Tatashin
2024-03-17 14:42   ` Christophe JAILLET
2024-03-19 16:32     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 04/14] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE Pasha Tatashin
2024-03-17 14:45   ` Christophe JAILLET
2024-03-17 15:14     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 05/14] fork: check charging success before zeroing stack Pasha Tatashin
2024-03-12 15:57   ` Kirill A. Shutemov
2024-03-12 16:52     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 06/14] fork: zero vmap stack using clear_page() instead of memset() Pasha Tatashin
2024-03-12  7:15   ` Nikolay Borisov
2024-03-12 16:53     ` Pasha Tatashin
2024-03-14  7:55       ` Christophe Leroy
2024-03-14 13:52         ` Pasha Tatashin
2024-03-17 14:48   ` Christophe JAILLET
2024-03-17 15:15     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 07/14] fork: use the first page in stack to store vm_stack in cached_stacks Pasha Tatashin
2024-03-11 16:46 ` [RFC 08/14] fork: separate vmap stack alloction and free calls Pasha Tatashin
2024-03-14 15:18   ` Jeff Xie
2024-03-14 17:14     ` Pasha Tatashin
2024-03-17 14:51   ` Christophe JAILLET
2024-03-17 15:15     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 09/14] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range_noflush() public functions Pasha Tatashin
2024-03-11 16:46 ` [RFC 10/14] fork: Dynamic Kernel Stacks Pasha Tatashin
2024-03-11 19:32   ` Randy Dunlap
2024-03-11 19:55     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 11/14] x86: add support for " Pasha Tatashin
2024-03-11 22:17   ` Andy Lutomirski
2024-03-11 23:10     ` Pasha Tatashin
2024-03-11 23:33       ` Thomas Gleixner
2024-03-11 23:34       ` Andy Lutomirski
2024-03-12  0:08         ` Pasha Tatashin
2024-03-12  0:23           ` Pasha Tatashin
2024-03-11 23:34     ` Dave Hansen
2024-03-11 23:41       ` Andy Lutomirski
2024-03-11 23:56         ` Nadav Amit
2024-03-12  0:02           ` Andy Lutomirski
2024-03-12  7:20             ` Nadav Amit
2024-03-12  0:53           ` Dave Hansen
2024-03-12  1:25             ` H. Peter Anvin
2024-03-12  2:16               ` Andy Lutomirski
2024-03-12  2:20                 ` H. Peter Anvin
2024-03-12 21:58   ` Andi Kleen
2024-03-13 10:23   ` Thomas Gleixner
2024-03-13 13:43     ` Pasha Tatashin
2024-03-13 15:28       ` Pasha Tatashin
2024-03-13 16:12         ` Thomas Gleixner
2024-03-14 14:03           ` Pasha Tatashin
2024-03-14 18:26             ` Thomas Gleixner
2024-03-11 16:46 ` [RFC 12/14] task_stack.h: Clean-up stack_not_used() implementation Pasha Tatashin
2024-03-11 16:46 ` [RFC 13/14] task_stack.h: Add stack_not_used() support for dynamic stack Pasha Tatashin
2024-03-11 16:46 ` [RFC 14/14] fork: Dynamic Kernel Stack accounting Pasha Tatashin
2024-03-11 17:09 ` [RFC 00/14] Dynamic Kernel Stacks Mateusz Guzik
2024-03-11 18:58   ` Pasha Tatashin
2024-03-11 19:21     ` Mateusz Guzik
2024-03-11 19:55       ` Pasha Tatashin
2024-03-12 17:18 ` H. Peter Anvin
2024-03-12 19:45   ` Pasha Tatashin
2024-03-12 21:36     ` H. Peter Anvin
2024-03-14 19:05       ` Kent Overstreet
2024-03-14 19:23         ` Pasha Tatashin
2024-03-14 19:28           ` Kent Overstreet
2024-03-14 19:34             ` Pasha Tatashin
2024-03-14 19:49               ` Kent Overstreet
2024-03-12 22:18     ` David Laight
2024-03-14 19:43   ` Matthew Wilcox
2024-03-14 19:53     ` Kent Overstreet
2024-03-14 19:57       ` Matthew Wilcox
2024-03-14 19:58         ` Kent Overstreet
2024-03-15  3:13         ` Pasha Tatashin [this message]
2024-03-15  3:39           ` H. Peter Anvin
2024-03-16 19:17             ` Pasha Tatashin
2024-03-17  0:41               ` Matthew Wilcox
2024-03-17  1:32                 ` Kent Overstreet
2024-03-17 14:19                 ` Pasha Tatashin
2024-03-17 14:43               ` Brian Gerst
2024-03-17 16:15                 ` Pasha Tatashin
2024-03-17 21:30                   ` Brian Gerst
2024-03-18 14:59                     ` Pasha Tatashin
2024-03-18 21:02                       ` Brian Gerst
2024-03-19 14:56                         ` Pasha Tatashin
2024-03-17 18:57               ` David Laight
2024-03-18 15:09                 ` Pasha Tatashin
2024-03-18 15:13                   ` Pasha Tatashin
2024-03-18 15:19                   ` Matthew Wilcox
2024-03-18 15:30                     ` Pasha Tatashin
2024-03-18 15:53                       ` David Laight
2024-03-18 16:57                         ` Pasha Tatashin
2024-03-18 15:38               ` David Laight
2024-03-18 17:00                 ` Pasha Tatashin
2024-03-18 17:37                   ` Pasha Tatashin
2024-03-15  4:17           ` H. Peter Anvin
2024-03-17  0:47     ` H. Peter Anvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+CK2bAmOj2J10szVijNikexFZ1gmA913vvxnqW4DJKWQikwqQ@mail.gmail.com \
    --to=pasha.tatashin@soleen.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dianders@chromium.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=eric.devolder@oracle.com \
    --cc=hca@linux.ibm.com \
    --cc=hch@infradead.org \
    --cc=hpa@zytor.com \
    --cc=jacob.jun.pan@linux.intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jpoimboe@kernel.org \
    --cc=jroedel@suse.de \
    --cc=juri.lelli@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=kinseyho@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mic@digikod.net \
    --cc=michael.christie@oracle.com \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=mst@redhat.com \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=urezki@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).