Re: pt_regs->ax == -ENOSYS

From: "H. Peter Anvin" <hpa@zytor.com>
To: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Borislav Petkov <bp@alien8.de>,
	LKML <linux-kernel@vger.kernel.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Kees Cook <keescook@chromium.org>, Will Drewry <wad@chromium.org>
Subject: Re: pt_regs->ax == -ENOSYS
Date: Tue, 27 Apr 2021 17:20:55 -0700	[thread overview]
Message-ID: <3a502aae-4124-5cb2-1dac-bc18b8158fbe@zytor.com> (raw)
In-Reply-To: <CALCETrWzL=jgnWd+6YuBo02GG8vTvsG22sXGaUQCc37vwQ6HdA@mail.gmail.com>

On 4/27/21 5:11 PM, Andy Lutomirski wrote:
> On Tue, Apr 27, 2021 at 5:05 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> On 4/27/21 4:23 PM, Andy Lutomirski wrote:
>>>
>>> I much prefer the model of saying that the bits that make sense for
>>> the syscall type (all 64 for 64-bit SYSCALL and the low 32 for
>>> everything else) are all valid.  This way there are no weird reserved
>>> bits, no weird ptrace() interactions, etc.  I'm a tiny bit concerned
>>> that this would result in a backwards compatibility issue, but not
>>> very.  This would involve changing syscall_get_nr(), but that doesn't
>>> seem so bad.  The biggest problem is that seccomp hardcoded syscall
>>> nrs to 32 bit.
>>>
>>> An alternative would be to declare that we always truncate to 32 bits,
>>> except that 64-bit SYSCALL with high bits set is an error and results
>>> in ENOSYS. The ptrace interaction there is potentially nasty.
>>>
>>> Basically, all choices here kind of suck, and I haven't done a real
>>> analysis of all the issues...
>>>
>>
>> OK, I really don't understand this. The *current* way of doing it causes
>> a bunch of ugly corner conditions, including in ptrace, which this would
>> get rid of. It isn't any different than passing any other argument which
>> is an int -- in fact we have this whole machinery to deal with that subcase.
>>
> 
> Let's suppose we decide to truncate the syscall nr.  What would the
> actual semantics be?  Would ptrace see the truncated value in orig_ax?
>   How about syscall user dispatch?  What happens if ptrace writes a
> value with high bits set to orig_ax?  Do we truncate it again?  Or do
> we say that ptrace *can't* write too large a value?
> 
> For better for worse, RAX is 64 bits, orig_ax is a 64-bit field, and
> it currently has nonsensical semantics.  Redefining orig_ax as a
> 32-bit field is surely possible, but doing so cleanly is not
> necessarily any easier than any other approach.  If it weren't for
> seccomp, I would say that the obviously correct answer is to just
> treat it everywhere as a 64-bit number.
> 

We *used* to truncate the system call number; that was unsigned. It 
causes massive headache to ptrace if a 32-bit ptrace wants to write -1, 
which is a bit hacky.

I would personally like to see orig_ax to be the register passed in and 
for the truncation to happen by syscall_get_nr().

I also note that kernel/seccomp.c and the tracing infrastructure all 
expect a signed int as the system call number. Yes, orig_ax is a 64-bit 
field, but so are the other register fields which doesn't necessarily 
directly reflect the value of an argument -- like, say, %rdi in the case 
of sys_write - it is an int argument so it gets sign extended; this is 
*not* reflected in ptrace.

	-hpa