linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Pasha Tatashin <pasha.tatashin@soleen.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	 akpm@linux-foundation.org, x86@kernel.org, bp@alien8.de,
	brauner@kernel.org,  bristot@redhat.com, bsegall@google.com,
	dave.hansen@linux.intel.com,  dianders@chromium.org,
	dietmar.eggemann@arm.com, hca@linux.ibm.com,  hch@infradead.org,
	hpa@zytor.com, jacob.jun.pan@linux.intel.com, jgg@ziepe.ca,
	 jpoimboe@kernel.org, jroedel@suse.de, juri.lelli@redhat.com,
	 kent.overstreet@linux.dev, kinseyho@google.com,
	 kirill.shutemov@linux.intel.com, lstoakes@gmail.com,
	luto@kernel.org,  mgorman@suse.de, mic@digikod.net,
	michael.christie@oracle.com,  mingo@redhat.com, mst@redhat.com,
	npiggin@gmail.com, peterz@infradead.org,  pmladek@suse.com,
	rick.p.edgecombe@intel.com, rostedt@goodmis.org,
	 surenb@google.com, tglx@linutronix.de, urezki@gmail.com,
	 vincent.guittot@linaro.org, vschneid@redhat.com
Subject: Re: [RFC 00/14] Dynamic Kernel Stacks
Date: Mon, 11 Mar 2024 15:55:06 -0400	[thread overview]
Message-ID: <CA+CK2bC5Q9cUH8WkOU0FCYC-XE9JJ52QdrXLbUTR3zLBK5Ah=Q@mail.gmail.com> (raw)
In-Reply-To: <CAGudoHHFQPiYkpHrBqSUVDtxaWXLbSc3ZJDOwMEzheBLO8E6Lw@mail.gmail.com>

On Mon, Mar 11, 2024 at 3:21 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On 3/11/24, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> > On Mon, Mar 11, 2024 at 1:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> 1. what about faults when the thread holds a bunch of arbitrary locks
> >> or has preemption disabled? is the allocation lockless?
> >
> > Each thread has a stack with 4 pages.
> > Pre-allocated page: This page is always allocated and mapped at thread
> > creation.
> > Dynamic pages (3): These pages are mapped dynamically upon stack faults.
> >
> > A per-CPU data structure holds 3 dynamic pages for each CPU. These
> > pages are used to handle stack faults occurring when a running thread
> > faults (even within interrupt-disabled contexts). Typically, only one
> > page is needed, but in the rare case where the thread accesses beyond
> > that, we might use up to all three pages in a single fault. This
> > structure allows for atomic handling of stack faults, preventing
> > conflicts from other processes. Additionally, the thread's 16K-aligned
> > virtual address (VA) and guaranteed pre-allocated page means no page
> > table allocation is required during the fault.
> >
> > When a thread leaves the CPU in normal kernel mode, we check a flag to
> > see if it has experienced stack faults. If so, we charge the thread
> > for the new stack pages and refill the per-CPU data structure with any
> > missing pages.
> >
>
> So this also has to happen if the thread holds a bunch of arbitrary
> semaphores and goes off cpu with them? Anyhow, see below.

Yes, this is alright, if thread is allowed to sleep it should not hold
any alloc_pages() locks.

> >> 2. what happens if there is no memory from which to map extra pages in
> >> the first place? you may be in position where you can't go off cpu
> >
> > When the per-CPU data structure cannot be refilled, and a new thread
> > faults, we issue a message indicating a critical stack fault. This
> > triggers a system-wide panic similar to a guard page access violation
> >
>
> OOM handling is fundamentally what I was worried about. I'm confident
> this failure mode makes the feature unsuitable for general-purpose
> deployments.

The primary goal of this series is to enhance system safety, not
introduce additional risks. Memory saving is a welcome side effect.
Please see below for explanations.

>
> Now, I have no vote here, it may be this is perfectly fine as an
> optional feature, which it is in your patchset. However, if this is to
> go in, the option description definitely needs a big fat warning about
> possible panics if enabled.
>
> I fully agree something(tm) should be done about stacks and the
> current usage is a massive bummer. I wonder if things would be ok if
> they shrinked to just 12K? Perhaps that would provide big enough


The current setting of 1 pre-allocated page 3-dynamic page is just
WIP, we can very well change to 2 pre-allocated 2-dynamic pages, or
3/1 etc.

At Google, we still utilize 8K stacks (have not increased it to 16K
when upstream increased it in 2014) and are only now encountering
extreme cases where the 8K limit is reached. Consequently, we plan to
increase the limit to 16K. Dynamic Kernel Stacks allow us to maintain
an 8K pre-allocated stack while handling page faults only in
exceptionally rare circumstances.

Another example is to increase THREAD_SIZE to 32K, and keep 16K
pre-allocated. This is the same as what upstream has today, but avoids
panics with guard pages thus making the systems safer for everyone.

Pasha

  reply	other threads:[~2024-03-11 19:55 UTC|newest]

Thread overview: 98+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-11 16:46 [RFC 00/14] Dynamic Kernel Stacks Pasha Tatashin
2024-03-11 16:46 ` [RFC 01/14] task_stack.h: remove obsolete __HAVE_ARCH_KSTACK_END check Pasha Tatashin
2024-03-17 14:36   ` Christophe JAILLET
2024-03-17 15:13     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 02/14] fork: Clean-up ifdef logic around stack allocation Pasha Tatashin
2024-03-11 16:46 ` [RFC 03/14] fork: Clean-up naming of vm_strack/vm_struct variables in vmap stacks code Pasha Tatashin
2024-03-17 14:42   ` Christophe JAILLET
2024-03-19 16:32     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 04/14] fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE Pasha Tatashin
2024-03-17 14:45   ` Christophe JAILLET
2024-03-17 15:14     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 05/14] fork: check charging success before zeroing stack Pasha Tatashin
2024-03-12 15:57   ` Kirill A. Shutemov
2024-03-12 16:52     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 06/14] fork: zero vmap stack using clear_page() instead of memset() Pasha Tatashin
2024-03-12  7:15   ` Nikolay Borisov
2024-03-12 16:53     ` Pasha Tatashin
2024-03-14  7:55       ` Christophe Leroy
2024-03-14 13:52         ` Pasha Tatashin
2024-03-17 14:48   ` Christophe JAILLET
2024-03-17 15:15     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 07/14] fork: use the first page in stack to store vm_stack in cached_stacks Pasha Tatashin
2024-03-11 16:46 ` [RFC 08/14] fork: separate vmap stack alloction and free calls Pasha Tatashin
2024-03-14 15:18   ` Jeff Xie
2024-03-14 17:14     ` Pasha Tatashin
2024-03-17 14:51   ` Christophe JAILLET
2024-03-17 15:15     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 09/14] mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range_noflush() public functions Pasha Tatashin
2024-03-11 16:46 ` [RFC 10/14] fork: Dynamic Kernel Stacks Pasha Tatashin
2024-03-11 19:32   ` Randy Dunlap
2024-03-11 19:55     ` Pasha Tatashin
2024-03-11 16:46 ` [RFC 11/14] x86: add support for " Pasha Tatashin
2024-03-11 22:17   ` Andy Lutomirski
2024-03-11 23:10     ` Pasha Tatashin
2024-03-11 23:33       ` Thomas Gleixner
2024-03-11 23:34       ` Andy Lutomirski
2024-03-12  0:08         ` Pasha Tatashin
2024-03-12  0:23           ` Pasha Tatashin
2024-03-11 23:34     ` Dave Hansen
2024-03-11 23:41       ` Andy Lutomirski
2024-03-11 23:56         ` Nadav Amit
2024-03-12  0:02           ` Andy Lutomirski
2024-03-12  7:20             ` Nadav Amit
2024-03-12  0:53           ` Dave Hansen
2024-03-12  1:25             ` H. Peter Anvin
2024-03-12  2:16               ` Andy Lutomirski
2024-03-12  2:20                 ` H. Peter Anvin
2024-03-12 21:58   ` Andi Kleen
2024-03-13 10:23   ` Thomas Gleixner
2024-03-13 13:43     ` Pasha Tatashin
2024-03-13 15:28       ` Pasha Tatashin
2024-03-13 16:12         ` Thomas Gleixner
2024-03-14 14:03           ` Pasha Tatashin
2024-03-14 18:26             ` Thomas Gleixner
2024-03-11 16:46 ` [RFC 12/14] task_stack.h: Clean-up stack_not_used() implementation Pasha Tatashin
2024-03-11 16:46 ` [RFC 13/14] task_stack.h: Add stack_not_used() support for dynamic stack Pasha Tatashin
2024-03-11 16:46 ` [RFC 14/14] fork: Dynamic Kernel Stack accounting Pasha Tatashin
2024-03-11 17:09 ` [RFC 00/14] Dynamic Kernel Stacks Mateusz Guzik
2024-03-11 18:58   ` Pasha Tatashin
2024-03-11 19:21     ` Mateusz Guzik
2024-03-11 19:55       ` Pasha Tatashin [this message]
2024-03-12 17:18 ` H. Peter Anvin
2024-03-12 19:45   ` Pasha Tatashin
2024-03-12 21:36     ` H. Peter Anvin
2024-03-14 19:05       ` Kent Overstreet
2024-03-14 19:23         ` Pasha Tatashin
2024-03-14 19:28           ` Kent Overstreet
2024-03-14 19:34             ` Pasha Tatashin
2024-03-14 19:49               ` Kent Overstreet
2024-03-12 22:18     ` David Laight
2024-03-14 19:43   ` Matthew Wilcox
2024-03-14 19:53     ` Kent Overstreet
2024-03-14 19:57       ` Matthew Wilcox
2024-03-14 19:58         ` Kent Overstreet
2024-03-15  3:13         ` Pasha Tatashin
2024-03-15  3:39           ` H. Peter Anvin
2024-03-16 19:17             ` Pasha Tatashin
2024-03-17  0:41               ` Matthew Wilcox
2024-03-17  1:32                 ` Kent Overstreet
2024-03-17 14:19                 ` Pasha Tatashin
2024-03-17 14:43               ` Brian Gerst
2024-03-17 16:15                 ` Pasha Tatashin
2024-03-17 21:30                   ` Brian Gerst
2024-03-18 14:59                     ` Pasha Tatashin
2024-03-18 21:02                       ` Brian Gerst
2024-03-19 14:56                         ` Pasha Tatashin
2024-03-17 18:57               ` David Laight
2024-03-18 15:09                 ` Pasha Tatashin
2024-03-18 15:13                   ` Pasha Tatashin
2024-03-18 15:19                   ` Matthew Wilcox
2024-03-18 15:30                     ` Pasha Tatashin
2024-03-18 15:53                       ` David Laight
2024-03-18 16:57                         ` Pasha Tatashin
2024-03-18 15:38               ` David Laight
2024-03-18 17:00                 ` Pasha Tatashin
2024-03-18 17:37                   ` Pasha Tatashin
2024-03-15  4:17           ` H. Peter Anvin
2024-03-17  0:47     ` H. Peter Anvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+CK2bC5Q9cUH8WkOU0FCYC-XE9JJ52QdrXLbUTR3zLBK5Ah=Q@mail.gmail.com' \
    --to=pasha.tatashin@soleen.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dianders@chromium.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=hca@linux.ibm.com \
    --cc=hch@infradead.org \
    --cc=hpa@zytor.com \
    --cc=jacob.jun.pan@linux.intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jpoimboe@kernel.org \
    --cc=jroedel@suse.de \
    --cc=juri.lelli@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=kinseyho@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mic@digikod.net \
    --cc=michael.christie@oracle.com \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=mst@redhat.com \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=urezki@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).