Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

From: Paul Gofman <gofmanp@gmail.com>
To: Andy Lutomirski <luto@kernel.org>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	kernel@collabora.com, Thomas Gleixner <tglx@linutronix.de>,
	Kees Cook <keescook@chromium.org>, Will Drewry <wad@chromium.org>,
	"H . Peter Anvin" <hpa@zytor.com>,
	Zebediah Figura <zfigura@codeweavers.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
Date: Sun, 31 May 2020 22:37:28 +0300	[thread overview]
Message-ID: <38da7b26-8ff4-419a-c848-5eebf4969647@gmail.com> (raw)
In-Reply-To: <CALCETrV+rYnUnve09=n+Zb8BR8mDBq6txX9LmEw7r8tAA7d+2Q@mail.gmail.com>

On 5/31/20 21:57, Andy Lutomirski wrote:
>
> I think that the implementation may well want to live in seccomp, but
> doing this as a seccomp filter isn't quite right.  It's not a security
> thing -- it's an emulation thing.  Seccomp is all about making
> inescapable sandboxes, but that's not what you're doing at all, and
> the fact that seccomp filters are preserved across execve() sounds
> like it'll be annoying for you.

Yes, sure, preserving those filters (more broadly, lack the ability to
change them any time in an arbitrary way) is the major problem
preventing us from using seccomp filters as is for a generic solution.
If not that, growing the table too much (which might be the case if we
mark all the denied address ranges there) may potentially be a
performance problem, but not necessarily, that's something to be tested.

>
> What if there was a special filter type that ran a BPF program on each
> syscall, and the program was allowed to access user memory to make its
> decisions, e.g. to look at some list of memory addresses.  But this
> would explicitly *not* be a security feature -- execve() would remove
> the filter, and the filter's outcome would be one of redirecting
> execution or allowing the syscall.  If the "allow" outcome occurs,
> then regular seccomp filters run.  Obviously the exact semantics here
> would need some care.

Yes, absolutely, we are not implementing any sandboxing in Wine and are
not seeing this as a security feature.

Is the approach discussed in another branch of this thread [1] is some
way similar to what you suggest? If instead of the list of memory
addresses we can use some single flag which we can set by thread when
crossing Windows program / native boundary, we won't have to grow the
lookup table indefinitely. Otherwise I am afraid the list of addresses
might be growing big, but I don't have reasons to think it necessarily
won't work, that's something we could evaluate further and also test
performance given some brief proof of concept implementation.

1. https://lkml.org/lkml/2020/5/31/199