Re: [PATCH v7 0/6] x86: Improve Minimum Alternate Stack Size

From: Ingo Molnar <mingo@kernel.org>
To: Len Brown <lenb@kernel.org>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>,
	Borislav Petkov <bp@suse.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andy Lutomirski <luto@kernel.org>, X86 ML <x86@kernel.org>,
	"Brown, Len" <len.brown@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	hjl.tools@gmail.com, Dave Martin <Dave.Martin@arm.com>,
	jannh@google.com, mpe@ellerman.id.au, carlos@redhat.com,
	"bothersome-borer for tony.luck@intel.com" <tony.luck@intel.com>,
	"Ravi V. Shankar" <ravi.v.shankar@intel.com>,
	libc-alpha@sourceware.org, linux-arch@vger.kernel.org,
	linux-api@vger.kernel.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v7 0/6] x86: Improve Minimum Alternate Stack Size
Date: Sat, 20 Mar 2021 18:32:00 +0100	[thread overview]
Message-ID: <20210320173200.GA4153106@gmail.com> (raw)
In-Reply-To: <CAJvTdKnpWL8y4N_BrCiK7fU0UXERwuuM8o84LUpp7Watxd8STw@mail.gmail.com>

* Len Brown <lenb@kernel.org> wrote:

> On Wed, Mar 17, 2021 at 6:45 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Ingo Molnar <mingo@kernel.org> wrote:
> >
> > >
> > > * Chang S. Bae <chang.seok.bae@intel.com> wrote:
> > >
> > > > During signal entry, the kernel pushes data onto the normal userspace
> > > > stack. On x86, the data pushed onto the user stack includes XSAVE state,
> > > > which has grown over time as new features and larger registers have been
> > > > added to the architecture.
> > > >
> > > > MINSIGSTKSZ is a constant provided in the kernel signal.h headers and
> > > > typically distributed in lib-dev(el) packages, e.g. [1]. Its value is
> > > > compiled into programs and is part of the user/kernel ABI. The MINSIGSTKSZ
> > > > constant indicates to userspace how much data the kernel expects to push on
> > > > the user stack, [2][3].
> > > >
> > > > However, this constant is much too small and does not reflect recent
> > > > additions to the architecture. For instance, when AVX-512 states are in
> > > > use, the signal frame size can be 3.5KB while MINSIGSTKSZ remains 2KB.
> > > >
> > > > The bug report [4] explains this as an ABI issue. The small MINSIGSTKSZ can
> > > > cause user stack overflow when delivering a signal.
> > >
> > > >   uapi: Define the aux vector AT_MINSIGSTKSZ
> > > >   x86/signal: Introduce helpers to get the maximum signal frame size
> > > >   x86/elf: Support a new ELF aux vector AT_MINSIGSTKSZ
> > > >   selftest/sigaltstack: Use the AT_MINSIGSTKSZ aux vector if available
> > > >   x86/signal: Detect and prevent an alternate signal stack overflow
> > > >   selftest/x86/signal: Include test cases for validating sigaltstack
> > >
> > > So this looks really complicated, is this justified?
> > >
> > > Why not just internally round up sigaltstack size if it's too small?
> > > This would be more robust, as it would fix applications that use
> > > MINSIGSTKSZ but don't use the new AT_MINSIGSTKSZ facility.
> > >
> > > I.e. does AT_MINSIGSTKSZ have any other uses than avoiding the
> > > segfault if MINSIGSTKSZ is used to create a small signal stack?
> >
> > I.e. if the kernel sees a too small ->ss_size in sigaltstack() it
> > would ignore ->ss_sp and mmap() a new sigaltstack instead and use that
> > for the signal handler stack.
> >
> > This would automatically make MINSIGSTKSZ - and other too small sizes
> > work today, and in the future.
> >
> > But the question is, is there user-space usage of sigaltstacks that
> > relies on controlling or reading the contents of the stack?
> >
> > longjmp using programs perhaps?
> 
> For the legacy binary that requests a too-small sigaltstack, there are
> several choices:
> 
> We could detect the too-small stack at sigaltstack(2) invocation and
> return an error.
> This results in two deal-killing problems:
> First, some applications don't check the return value, so the check
> would be fruitless.
> Second, those that check and error-out may be programs that never
> actually take the signal, and so we'd be causing a dusty binary to
> exit, when it didn't exit on another system, or another kernel.
> 
> Or we could detect the too small stack at signal registration time.
> This has the same two deal-killers as above.
> 
> Then there is the approach in this patch-set, which detects an
> imminent stack overflow at run time.
> It has neither of the two problems above, and the benefit that we now
> prevent data corruption
> that could have been happening on some systems already today.  The
> down side is that the dusty binary
> that does request the too-small stack can now die at run time.
> 
> So your idea of recognizing the problem and conjuring up a 
> sufficient stack is compelling, since it would likely "just work", 
> no matter how dumb the program. But where would the the sufficient 
> stack come from -- is this a new kernel buffer, or is there a way to 
> abscond some user memory?  I would expect a signal handler to look 
> at the data on its stack and nobody else will look at that stack.  
> But this is already an unreasonable program for allocating a special 
> signal stack in the first place :-/ So yes, one could imagine the 
> signal handler could longjump instead of gracefully completing, and 
> if this specially allocated signal stack isn't where the user 
> planned, that could be trouble.

We could mmap() (implicitly) new anonymous memory - but I can see why 
this is probably more trouble than worth...

> Another idea we discussed was to detect the potential overflow at 
> run-time, and instead of killing the process, just push the signal 
> onto the regular user stack. this might actually work, but it is 
> sort of devious; and it would not work in the case where the user 
> overflowed their regular stack already, which may be the most 
> (only?) compelling reason that they allocated and declared a special 
> sigaltstack in the first place...

Yeah, this doesn't sound deterministic enough.

Ok, thanks for the detailed answers - I withdraw my objections, let's 
proceed with the approach you are proposing?

Thanks,

	Ingo