linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* pt_regs->ax == -ENOSYS
@ 2021-04-27 21:15 H. Peter Anvin
  2021-04-27 21:28 ` Andy Lutomirski
  0 siblings, 1 reply; 12+ messages in thread
From: H. Peter Anvin @ 2021-04-27 21:15 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andrew Lutomirski,
	Borislav Petkov, linux-kernel, oleg, Kees Cook, Will Drewry

Trying to stomp out some possible cargo cult programming?

In the process of going through the various entry code paths, I have to 
admit to being a bit confused why pt_regs->ax is set to -ENOSYS very 
early in the system call path.

What is perhaps even more confusing is:

__visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
{
         nr = syscall_enter_from_user_mode(regs, nr);

         instrumentation_begin();
         if (likely(nr < NR_syscalls)) {
                 nr = array_index_nospec(nr, NR_syscalls);
                 regs->ax = sys_call_table[nr](regs);
#ifdef CONFIG_X86_X32_ABI
         } else if (likely((nr & __X32_SYSCALL_BIT) &&
                           (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
                 nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
                                         X32_NR_syscalls);
                 regs->ax = x32_sys_call_table[nr](regs);
#endif
         }
         instrumentation_end();
         syscall_exit_to_user_mode(regs);
}
#endif

Now, unless I'm completely out to sea, it seems to me that if 
syscall_enter_from_user_mode() changes the system call number to an 
invalid number and pt_regs->ax to !-ENOSYS then the system call will 
return a different value(!) depending on if it is out of range for the 
table (whatever was poked into pt_regs->ax) or if it corresponds to a 
hole in the table. This seems to me at least to be The Wrong Thing.

Calling regs->ax = sys_ni_syscall() in an else clause would arguably be 
the right thing here, except possibly in the case where nr (or (int)nr, 
see below) == -1 or < 0.

Now, syscall_get_nr() returns the low 32 bits of the system call number 
unconditionally. There are places where we look at the sign of this 
number, which means that 0xffffffff7fffffff is "positive" and 
0x7fffffffffffffff is "negative". We have gone back and forth more than 
once on if we should look at %rax or just %eax on a system call... I 
have to admit that the current design makes me a bit nervous.

Finally, can anything bad happen in some weird corner case inside one of 
the syscall_*_mode() calls or after an interrupt if someone tries to 
call syscall(-1) or another negative number?

Food for thought or just my not being up to date?

Thanks,

	-hpa


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: pt_regs->ax == -ENOSYS
  2021-04-27 21:15 pt_regs->ax == -ENOSYS H. Peter Anvin
@ 2021-04-27 21:28 ` Andy Lutomirski
  2021-04-27 22:58   ` H. Peter Anvin
  0 siblings, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2021-04-27 21:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andrew Lutomirski,
	Borislav Petkov, linux-kernel, oleg, Kees Cook, Will Drewry


> On Apr 27, 2021, at 2:15 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> 
> Trying to stomp out some possible cargo cult programming?
> 
> In the process of going through the various entry code paths, I have to admit to being a bit confused why pt_regs->ax is set to -ENOSYS very early in the system call path.
> 

It has to get set to _something_, and copying orig_ax seems perhaps silly.  There could also be code that relies on ptrace poking -1 into the nr resulting in -ENOSYS.

> What is perhaps even more confusing is:
> 
> __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
> {
>        nr = syscall_enter_from_user_mode(regs, nr);
> 
>        instrumentation_begin();
>        if (likely(nr < NR_syscalls)) {
>                nr = array_index_nospec(nr, NR_syscalls);
>                regs->ax = sys_call_table[nr](regs);
> #ifdef CONFIG_X86_X32_ABI
>        } else if (likely((nr & __X32_SYSCALL_BIT) &&
>                          (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
>                nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
>                                        X32_NR_syscalls);
>                regs->ax = x32_sys_call_table[nr](regs);
> #endif
>        }
>        instrumentation_end();
>        syscall_exit_to_user_mode(regs);
> }
> #endif
> 
> Now, unless I'm completely out to sea, it seems to me that if syscall_enter_from_user_mode() changes the system call number to an invalid number and pt_regs->ax to !-ENOSYS then the system call will return a different value(!) depending on if it is out of range for the table (whatever was poked into pt_regs->ax) or if it corresponds to a hole in the table. This seems to me at least to be The Wrong Thing.

I think you’re right.

> 
> Calling regs->ax = sys_ni_syscall() in an else clause would arguably be the right thing here, except possibly in the case where nr (or (int)nr, see below) == -1 or < 0.

I think the check should be -1 for 64 bit but (u32)nr == (u32)-1 for the 32-bit path. Does that seem reasonable?

> 
> Now, syscall_get_nr() returns the low 32 bits of the system call number unconditionally. There are places where we look at the sign of this number, which means that 0xffffffff7fffffff is "positive" and 0x7fffffffffffffff is "negative". We have gone back and forth more than once on if we should look at %rax or just %eax on a system call... I have to admit that the current design makes me a bit nervous.
> 
> Finally, can anything bad happen in some weird corner case inside one of the syscall_*_mode() calls or after an interrupt if someone tries to call syscall(-1) or another negative number?
> 
> Food for thought or just my not being up to date?
> 
> Thanks,
> 
>    -hpa
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: pt_regs->ax == -ENOSYS
  2021-04-27 21:28 ` Andy Lutomirski
@ 2021-04-27 22:58   ` H. Peter Anvin
  2021-04-27 23:23     ` Andy Lutomirski
  2021-04-27 23:29     ` Kees Cook
  0 siblings, 2 replies; 12+ messages in thread
From: H. Peter Anvin @ 2021-04-27 22:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andrew Lutomirski,
	Borislav Petkov, linux-kernel, oleg, Kees Cook, Will Drewry

On 4/27/21 2:28 PM, Andy Lutomirski wrote:
> 
>> On Apr 27, 2021, at 2:15 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> Trying to stomp out some possible cargo cult programming?
>>
>> In the process of going through the various entry code paths, I have to admit to being a bit confused why pt_regs->ax is set to -ENOSYS very early in the system call path.
>>
> 
> It has to get set to _something_, and copying orig_ax seems perhaps silly.  There could also be code that relies on ptrace poking -1 into the nr resulting in -ENOSYS.
> 

Yeah. I obviously ran into this working on the common entry-exit code 
for FRED; the frame has annoyingly different formats because of this, 
and I wanted to avoid slowing down the system call path.

>> What is perhaps even more confusing is:
>>
>> __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
>> {
>>         nr = syscall_enter_from_user_mode(regs, nr);
>>
>>         instrumentation_begin();
>>         if (likely(nr < NR_syscalls)) {
>>                 nr = array_index_nospec(nr, NR_syscalls);
>>                 regs->ax = sys_call_table[nr](regs);
>> #ifdef CONFIG_X86_X32_ABI
>>         } else if (likely((nr & __X32_SYSCALL_BIT) &&
>>                           (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
>>                 nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
>>                                         X32_NR_syscalls);
>>                 regs->ax = x32_sys_call_table[nr](regs);
>> #endif
>>         }
>>         instrumentation_end();
>>         syscall_exit_to_user_mode(regs);
>> }
>> #endif
>>
>> Now, unless I'm completely out to sea, it seems to me that if syscall_enter_from_user_mode() changes the system call number to an invalid number and pt_regs->ax to !-ENOSYS then the system call will return a different value(!) depending on if it is out of range for the table (whatever was poked into pt_regs->ax) or if it corresponds to a hole in the table. This seems to me at least to be The Wrong Thing.
> 
> I think you’re right.
> 
>>
>> Calling regs->ax = sys_ni_syscall() in an else clause would arguably be the right thing here, except possibly in the case where nr (or (int)nr, see below) == -1 or < 0.
> 
> I think the check should be -1 for 64 bit but (u32)nr == (u32)-1 for the 32-bit path. Does that seem reasonable?

I'm thinking overall that depending on 64-bit %rax is once again a 
mistake; I realize that the assembly code that did that kept breaking 
because people messed with it, but we still have:

/*
  * Only the low 32 bits of orig_ax are meaningful, so we return int.
  * This importantly ignores the high bits on 64-bit, so comparisons
  * sign-extend the low 32 bits.
  */
static inline int syscall_get_nr(struct task_struct *task, struct 
pt_regs *regs)
{
         return regs->orig_ax;
}

"Different interpretation of the same data" is a notorious security 
trap. Zero-extending orig_ax would cause different behavior on 32 and 64 
bits and differ from the above, so I'm thinking that just once and for 
all defining the system call number as a signed int for all the x86 ABIs 
would be the sanest.

It still doesn't really answer the question if "movq $-1,%rax; syscall" 
or "movl $-1,%eax; syscall" could somehow cause bad things to happen, 
though, which makes me a little bit nervous still.

	-hpa

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: pt_regs->ax == -ENOSYS
  2021-04-27 22:58   ` H. Peter Anvin
@ 2021-04-27 23:23     ` Andy Lutomirski
  2021-04-28  0:05       ` H. Peter Anvin
  2021-04-27 23:29     ` Kees Cook
  1 sibling, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2021-04-27 23:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andrew Lutomirski,
	Borislav Petkov, LKML, Oleg Nesterov, Kees Cook, Will Drewry

On Tue, Apr 27, 2021 at 3:58 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 4/27/21 2:28 PM, Andy Lutomirski wrote:
> >
> >> On Apr 27, 2021, at 2:15 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> >>
> >> Trying to stomp out some possible cargo cult programming?
> >>
> >> In the process of going through the various entry code paths, I have to admit to being a bit confused why pt_regs->ax is set to -ENOSYS very early in the system call path.
> >>
> >
> > It has to get set to _something_, and copying orig_ax seems perhaps silly.  There could also be code that relies on ptrace poking -1 into the nr resulting in -ENOSYS.
> >
>
> Yeah. I obviously ran into this working on the common entry-exit code
> for FRED; the frame has annoyingly different formats because of this,
> and I wanted to avoid slowing down the system call path.
>
> >> What is perhaps even more confusing is:
> >>
> >> __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
> >> {
> >>         nr = syscall_enter_from_user_mode(regs, nr);
> >>
> >>         instrumentation_begin();
> >>         if (likely(nr < NR_syscalls)) {
> >>                 nr = array_index_nospec(nr, NR_syscalls);
> >>                 regs->ax = sys_call_table[nr](regs);
> >> #ifdef CONFIG_X86_X32_ABI
> >>         } else if (likely((nr & __X32_SYSCALL_BIT) &&
> >>                           (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
> >>                 nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
> >>                                         X32_NR_syscalls);
> >>                 regs->ax = x32_sys_call_table[nr](regs);
> >> #endif
> >>         }
> >>         instrumentation_end();
> >>         syscall_exit_to_user_mode(regs);
> >> }
> >> #endif
> >>
> >> Now, unless I'm completely out to sea, it seems to me that if syscall_enter_from_user_mode() changes the system call number to an invalid number and pt_regs->ax to !-ENOSYS then the system call will return a different value(!) depending on if it is out of range for the table (whatever was poked into pt_regs->ax) or if it corresponds to a hole in the table. This seems to me at least to be The Wrong Thing.
> >
> > I think you’re right.
> >
> >>
> >> Calling regs->ax = sys_ni_syscall() in an else clause would arguably be the right thing here, except possibly in the case where nr (or (int)nr, see below) == -1 or < 0.
> >
> > I think the check should be -1 for 64 bit but (u32)nr == (u32)-1 for the 32-bit path. Does that seem reasonable?
>
> I'm thinking overall that depending on 64-bit %rax is once again a
> mistake; I realize that the assembly code that did that kept breaking
> because people messed with it, but we still have:
>
> /*
>   * Only the low 32 bits of orig_ax are meaningful, so we return int.
>   * This importantly ignores the high bits on 64-bit, so comparisons
>   * sign-extend the low 32 bits.
>   */
> static inline int syscall_get_nr(struct task_struct *task, struct
> pt_regs *regs)
> {
>          return regs->orig_ax;
> }
>
> "Different interpretation of the same data" is a notorious security
> trap. Zero-extending orig_ax would cause different behavior on 32 and 64
> bits and differ from the above, so I'm thinking that just once and for
> all defining the system call number as a signed int for all the x86 ABIs
> would be the sanest.
>
> It still doesn't really answer the question if "movq $-1,%rax; syscall"
> or "movl $-1,%eax; syscall" could somehow cause bad things to happen,
> though, which makes me a little bit nervous still.
>

I much prefer the model of saying that the bits that make sense for
the syscall type (all 64 for 64-bit SYSCALL and the low 32 for
everything else) are all valid.  This way there are no weird reserved
bits, no weird ptrace() interactions, etc.  I'm a tiny bit concerned
that this would result in a backwards compatibility issue, but not
very.  This would involve changing syscall_get_nr(), but that doesn't
seem so bad.  The biggest problem is that seccomp hardcoded syscall
nrs to 32 bit.

An alternative would be to declare that we always truncate to 32 bits,
except that 64-bit SYSCALL with high bits set is an error and results
in ENOSYS. The ptrace interaction there is potentially nasty.

Basically, all choices here kind of suck, and I haven't done a real
analysis of all the issues...

--Andy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: pt_regs->ax == -ENOSYS
  2021-04-27 22:58   ` H. Peter Anvin
  2021-04-27 23:23     ` Andy Lutomirski
@ 2021-04-27 23:29     ` Kees Cook
  2021-04-27 23:51       ` Andy Lutomirski
  1 sibling, 1 reply; 12+ messages in thread
From: Kees Cook @ 2021-04-27 23:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Linus Torvalds, Ingo Molnar, Thomas Gleixner,
	Andrew Lutomirski, Borislav Petkov, linux-kernel, oleg,
	Will Drewry

On Tue, Apr 27, 2021 at 03:58:03PM -0700, H. Peter Anvin wrote:
> On 4/27/21 2:28 PM, Andy Lutomirski wrote:
> > 
> > > On Apr 27, 2021, at 2:15 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> > > 
> > > Trying to stomp out some possible cargo cult programming?
> > > 
> > > In the process of going through the various entry code paths, I have to admit to being a bit confused why pt_regs->ax is set to -ENOSYS very early in the system call path.
> > > 
> > 
> > It has to get set to _something_, and copying orig_ax seems perhaps silly.  There could also be code that relies on ptrace poking -1 into the nr resulting in -ENOSYS.
> > 
> 
> Yeah. I obviously ran into this working on the common entry-exit code for
> FRED; the frame has annoyingly different formats because of this, and I
> wanted to avoid slowing down the system call path.
> 
> > > What is perhaps even more confusing is:
> > > 
> > > __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
> > > {
> > >         nr = syscall_enter_from_user_mode(regs, nr);
> > > 
> > >         instrumentation_begin();
> > >         if (likely(nr < NR_syscalls)) {
> > >                 nr = array_index_nospec(nr, NR_syscalls);
> > >                 regs->ax = sys_call_table[nr](regs);
> > > #ifdef CONFIG_X86_X32_ABI
> > >         } else if (likely((nr & __X32_SYSCALL_BIT) &&
> > >                           (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
> > >                 nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
> > >                                         X32_NR_syscalls);
> > >                 regs->ax = x32_sys_call_table[nr](regs);
> > > #endif
> > >         }
> > >         instrumentation_end();
> > >         syscall_exit_to_user_mode(regs);
> > > }
> > > #endif
> > > 
> > > Now, unless I'm completely out to sea, it seems to me that if syscall_enter_from_user_mode() changes the system call number to an invalid number and pt_regs->ax to !-ENOSYS then the system call will return a different value(!) depending on if it is out of range for the table (whatever was poked into pt_regs->ax) or if it corresponds to a hole in the table. This seems to me at least to be The Wrong Thing.
> > 
> > I think you’re right.
> > 
> > > 
> > > Calling regs->ax = sys_ni_syscall() in an else clause would arguably be the right thing here, except possibly in the case where nr (or (int)nr, see below) == -1 or < 0.
> > 
> > I think the check should be -1 for 64 bit but (u32)nr == (u32)-1 for the 32-bit path. Does that seem reasonable?

FWIW, there is some confusion with how syscall_trac_enter() signals the
"skip syscall" condition (-1L), vs actually calling "syscall -1". Right
now they're not really distinguished, and the early ENOSYS is there so that
"pretend it happened" can be implemented (by either ptrace or seccomp).
As in, "set return value to $whatever, and don't run a syscall".

> I'm thinking overall that depending on 64-bit %rax is once again a mistake;
> I realize that the assembly code that did that kept breaking because people
> messed with it, but we still have:
> 
> /*
>  * Only the low 32 bits of orig_ax are meaningful, so we return int.
>  * This importantly ignores the high bits on 64-bit, so comparisons
>  * sign-extend the low 32 bits.
>  */
> static inline int syscall_get_nr(struct task_struct *task, struct pt_regs
> *regs)
> {
>         return regs->orig_ax;
> }
> 
> "Different interpretation of the same data" is a notorious security trap.
> Zero-extending orig_ax would cause different behavior on 32 and 64 bits and
> differ from the above, so I'm thinking that just once and for all defining
> the system call number as a signed int for all the x86 ABIs would be the
> sanest.
> 
> It still doesn't really answer the question if "movq $-1,%rax; syscall" or
> "movl $-1,%eax; syscall" could somehow cause bad things to happen, though,
> which makes me a little bit nervous still.

I don't think this matters? What's the condition you're worried about
here? The syscall table lookup is going to be safe.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: pt_regs->ax == -ENOSYS
  2021-04-27 23:29     ` Kees Cook
@ 2021-04-27 23:51       ` Andy Lutomirski
  2021-04-28  2:05         ` Kees Cook
  0 siblings, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2021-04-27 23:51 UTC (permalink / raw)
  To: Kees Cook
  Cc: H. Peter Anvin, Linus Torvalds, Ingo Molnar, Thomas Gleixner,
	Andrew Lutomirski, Borislav Petkov, linux-kernel, oleg,
	Will Drewry



> On Apr 27, 2021, at 4:29 PM, Kees Cook <keescook@chromium.org> wrote:
> 
> On Tue, Apr 27, 2021 at 03:58:03PM -0700, H. Peter Anvin wrote:
>>> On 4/27/21 2:28 PM, Andy Lutomirski wrote:
>>> 
>>>> On Apr 27, 2021, at 2:15 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>> 
>>>> Trying to stomp out some possible cargo cult programming?
>>>> 
>>>> In the process of going through the various entry code paths, I have to admit to being a bit confused why pt_regs->ax is set to -ENOSYS very early in the system call path.
>>>> 
>>> 
>>> It has to get set to _something_, and copying orig_ax seems perhaps silly.  There could also be code that relies on ptrace poking -1 into the nr resulting in -ENOSYS.
>>> 
>> 
>> Yeah. I obviously ran into this working on the common entry-exit code for
>> FRED; the frame has annoyingly different formats because of this, and I
>> wanted to avoid slowing down the system call path.
>> 
>>>> What is perhaps even more confusing is:
>>>> 
>>>> __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
>>>> {
>>>>        nr = syscall_enter_from_user_mode(regs, nr);
>>>> 
>>>>        instrumentation_begin();
>>>>        if (likely(nr < NR_syscalls)) {
>>>>                nr = array_index_nospec(nr, NR_syscalls);
>>>>                regs->ax = sys_call_table[nr](regs);
>>>> #ifdef CONFIG_X86_X32_ABI
>>>>        } else if (likely((nr & __X32_SYSCALL_BIT) &&
>>>>                          (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
>>>>                nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
>>>>                                        X32_NR_syscalls);
>>>>                regs->ax = x32_sys_call_table[nr](regs);
>>>> #endif
>>>>        }
>>>>        instrumentation_end();
>>>>        syscall_exit_to_user_mode(regs);
>>>> }
>>>> #endif
>>>> 
>>>> Now, unless I'm completely out to sea, it seems to me that if syscall_enter_from_user_mode() changes the system call number to an invalid number and pt_regs->ax to !-ENOSYS then the system call will return a different value(!) depending on if it is out of range for the table (whatever was poked into pt_regs->ax) or if it corresponds to a hole in the table. This seems to me at least to be The Wrong Thing.
>>> 
>>> I think you’re right.
>>> 
>>>> 
>>>> Calling regs->ax = sys_ni_syscall() in an else clause would arguably be the right thing here, except possibly in the case where nr (or (int)nr, see below) == -1 or < 0.
>>> 
>>> I think the check should be -1 for 64 bit but (u32)nr == (u32)-1 for the 32-bit path. Does that seem reasonable?
> 
> FWIW, there is some confusion with how syscall_trac_enter() signals the
> "skip syscall" condition (-1L), vs actually calling "syscall -1".

Fortunately there is not, and never will be, a syscall -1.  But I agree that calling max syscall + 1 should behave identically to calling a nonexistent syscall in the middle of the table.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: pt_regs->ax == -ENOSYS
  2021-04-27 23:23     ` Andy Lutomirski
@ 2021-04-28  0:05       ` H. Peter Anvin
  2021-04-28  0:11         ` Andy Lutomirski
  0 siblings, 1 reply; 12+ messages in thread
From: H. Peter Anvin @ 2021-04-28  0:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andrew Lutomirski,
	Borislav Petkov, LKML, Oleg Nesterov, Kees Cook, Will Drewry

On 4/27/21 4:23 PM, Andy Lutomirski wrote:
> 
> I much prefer the model of saying that the bits that make sense for
> the syscall type (all 64 for 64-bit SYSCALL and the low 32 for
> everything else) are all valid.  This way there are no weird reserved
> bits, no weird ptrace() interactions, etc.  I'm a tiny bit concerned
> that this would result in a backwards compatibility issue, but not
> very.  This would involve changing syscall_get_nr(), but that doesn't
> seem so bad.  The biggest problem is that seccomp hardcoded syscall
> nrs to 32 bit.
> 
> An alternative would be to declare that we always truncate to 32 bits,
> except that 64-bit SYSCALL with high bits set is an error and results
> in ENOSYS. The ptrace interaction there is potentially nasty.
> 
> Basically, all choices here kind of suck, and I haven't done a real
> analysis of all the issues...
> 

OK, I really don't understand this. The *current* way of doing it causes 
a bunch of ugly corner conditions, including in ptrace, which this would 
get rid of. It isn't any different than passing any other argument which 
is an int -- in fact we have this whole machinery to deal with that subcase.

If it makes you feel better, we could even sign-extend the value in 
orig_ax, but that seems unnecessary and a bit broken to me.

	-hpa


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: pt_regs->ax == -ENOSYS
  2021-04-28  0:05       ` H. Peter Anvin
@ 2021-04-28  0:11         ` Andy Lutomirski
  2021-04-28  0:20           ` H. Peter Anvin
  0 siblings, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2021-04-28  0:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andrew Lutomirski,
	Borislav Petkov, LKML, Oleg Nesterov, Kees Cook, Will Drewry

On Tue, Apr 27, 2021 at 5:05 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 4/27/21 4:23 PM, Andy Lutomirski wrote:
> >
> > I much prefer the model of saying that the bits that make sense for
> > the syscall type (all 64 for 64-bit SYSCALL and the low 32 for
> > everything else) are all valid.  This way there are no weird reserved
> > bits, no weird ptrace() interactions, etc.  I'm a tiny bit concerned
> > that this would result in a backwards compatibility issue, but not
> > very.  This would involve changing syscall_get_nr(), but that doesn't
> > seem so bad.  The biggest problem is that seccomp hardcoded syscall
> > nrs to 32 bit.
> >
> > An alternative would be to declare that we always truncate to 32 bits,
> > except that 64-bit SYSCALL with high bits set is an error and results
> > in ENOSYS. The ptrace interaction there is potentially nasty.
> >
> > Basically, all choices here kind of suck, and I haven't done a real
> > analysis of all the issues...
> >
>
> OK, I really don't understand this. The *current* way of doing it causes
> a bunch of ugly corner conditions, including in ptrace, which this would
> get rid of. It isn't any different than passing any other argument which
> is an int -- in fact we have this whole machinery to deal with that subcase.
>

Let's suppose we decide to truncate the syscall nr.  What would the
actual semantics be?  Would ptrace see the truncated value in orig_ax?
 How about syscall user dispatch?  What happens if ptrace writes a
value with high bits set to orig_ax?  Do we truncate it again?  Or do
we say that ptrace *can't* write too large a value?

For better for worse, RAX is 64 bits, orig_ax is a 64-bit field, and
it currently has nonsensical semantics.  Redefining orig_ax as a
32-bit field is surely possible, but doing so cleanly is not
necessarily any easier than any other approach.  If it weren't for
seccomp, I would say that the obviously correct answer is to just
treat it everywhere as a 64-bit number.

--Andy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: pt_regs->ax == -ENOSYS
  2021-04-28  0:11         ` Andy Lutomirski
@ 2021-04-28  0:20           ` H. Peter Anvin
  2021-04-28  0:46             ` H. Peter Anvin
  0 siblings, 1 reply; 12+ messages in thread
From: H. Peter Anvin @ 2021-04-28  0:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	LKML, Oleg Nesterov, Kees Cook, Will Drewry



On 4/27/21 5:11 PM, Andy Lutomirski wrote:
> On Tue, Apr 27, 2021 at 5:05 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> On 4/27/21 4:23 PM, Andy Lutomirski wrote:
>>>
>>> I much prefer the model of saying that the bits that make sense for
>>> the syscall type (all 64 for 64-bit SYSCALL and the low 32 for
>>> everything else) are all valid.  This way there are no weird reserved
>>> bits, no weird ptrace() interactions, etc.  I'm a tiny bit concerned
>>> that this would result in a backwards compatibility issue, but not
>>> very.  This would involve changing syscall_get_nr(), but that doesn't
>>> seem so bad.  The biggest problem is that seccomp hardcoded syscall
>>> nrs to 32 bit.
>>>
>>> An alternative would be to declare that we always truncate to 32 bits,
>>> except that 64-bit SYSCALL with high bits set is an error and results
>>> in ENOSYS. The ptrace interaction there is potentially nasty.
>>>
>>> Basically, all choices here kind of suck, and I haven't done a real
>>> analysis of all the issues...
>>>
>>
>> OK, I really don't understand this. The *current* way of doing it causes
>> a bunch of ugly corner conditions, including in ptrace, which this would
>> get rid of. It isn't any different than passing any other argument which
>> is an int -- in fact we have this whole machinery to deal with that subcase.
>>
> 
> Let's suppose we decide to truncate the syscall nr.  What would the
> actual semantics be?  Would ptrace see the truncated value in orig_ax?
>   How about syscall user dispatch?  What happens if ptrace writes a
> value with high bits set to orig_ax?  Do we truncate it again?  Or do
> we say that ptrace *can't* write too large a value?
> 
> For better for worse, RAX is 64 bits, orig_ax is a 64-bit field, and
> it currently has nonsensical semantics.  Redefining orig_ax as a
> 32-bit field is surely possible, but doing so cleanly is not
> necessarily any easier than any other approach.  If it weren't for
> seccomp, I would say that the obviously correct answer is to just
> treat it everywhere as a 64-bit number.
> 

We *used* to truncate the system call number; that was unsigned. It 
causes massive headache to ptrace if a 32-bit ptrace wants to write -1, 
which is a bit hacky.

I would personally like to see orig_ax to be the register passed in and 
for the truncation to happen by syscall_get_nr().

I also note that kernel/seccomp.c and the tracing infrastructure all 
expect a signed int as the system call number. Yes, orig_ax is a 64-bit 
field, but so are the other register fields which doesn't necessarily 
directly reflect the value of an argument -- like, say, %rdi in the case 
of sys_write - it is an int argument so it gets sign extended; this is 
*not* reflected in ptrace.

	-hpa

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: pt_regs->ax == -ENOSYS
  2021-04-28  0:20           ` H. Peter Anvin
@ 2021-04-28  0:46             ` H. Peter Anvin
  0 siblings, 0 replies; 12+ messages in thread
From: H. Peter Anvin @ 2021-04-28  0:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	LKML, Oleg Nesterov, Kees Cook, Will Drewry

On 4/27/21 5:20 PM, H. Peter Anvin wrote:
> 
> We *used* to truncate the system call number; that was unsigned. It 
> causes massive headache to ptrace if a 32-bit ptrace wants to write -1, 
> which is a bit hacky.
> 
> I would personally like to see orig_ax to be the register passed in and 
> for the truncation to happen by syscall_get_nr().
> 
> I also note that kernel/seccomp.c and the tracing infrastructure all 
> expect a signed int as the system call number. Yes, orig_ax is a 64-bit 
> field, but so are the other register fields which doesn't necessarily 
> directly reflect the value of an argument -- like, say, %rdi in the case 
> of sys_write - it is an int argument so it gets sign extended; this is 
> *not* reflected in ptrace.
> 
>      -hpa

We could even do this, to make it perhaps harder to mess up:

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 409f661481e1..4e8e5c2e35f4 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -41,7 +41,10 @@ struct pt_regs {
  	unsigned short gs;
  	unsigned short __gsh;
  	/* On interrupt, this is the error code. */
-	unsigned long orig_ax;
+	union {
+		unsigned long orig_ax;
+		int syscall_nr;
+	};
  	unsigned long ip;
  	unsigned short cs;
  	unsigned short __csh;
@@ -78,7 +81,10 @@ struct pt_regs {
   * On syscall entry, this is syscall#. On CPU exception, this is error 
code.
   * On hw interrupt, it's IRQ number:
   */
-	unsigned long orig_ax;
+	union {
+		unsigned long orig_ax;
+		int syscall_nr;
+	};
  /* Return frame for iretq */
  	unsigned long ip;
  	unsigned long cs;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: pt_regs->ax == -ENOSYS
  2021-04-27 23:51       ` Andy Lutomirski
@ 2021-04-28  2:05         ` Kees Cook
  2021-04-28  2:07           ` H. Peter Anvin
  0 siblings, 1 reply; 12+ messages in thread
From: Kees Cook @ 2021-04-28  2:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Linus Torvalds, Ingo Molnar, Thomas Gleixner,
	Andrew Lutomirski, Borislav Petkov, linux-kernel, oleg,
	Will Drewry

On Tue, Apr 27, 2021 at 04:51:06PM -0700, Andy Lutomirski wrote:
> Fortunately there is not, and never will be, a syscall -1.  But I
> agree that calling max syscall + 1 should behave identically to calling
> a nonexistent syscall in the middle of the table.

If that happens, we have to separate the meaning of -1L from ptrace,
seccomp, etc. (i.e. we can't just add an "else { result = -ENOSYS; }" to
the syscall table dispatching code, since that'll overwrite any written
return value when the syscall is meant to be skipped with a specific
return value set by ptrace/seccomp.

syscall_trace_enter() will currently return either -1 or the
syscall. Which means someone making a "syscall -1" will get the skip
semantics currently (though the preloaded -ENOSYS results in the
"expected" outcome).

arm64 recently had to untangle this too:

15956689a0e6 arm64: compat: Ensure upper 32 bits of x0 are zero on syscall return
59ee987ea47c arm64: ptrace: Add a comment describing our syscall entry/exit trap ABI
139dbe5d8ed3 arm64: syscall: Expand the comment about ptrace and syscall(-1)
d83ee6e3e75d arm64: ptrace: Use NO_SYSCALL instead of -1 in syscall_trace_enter()

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: pt_regs->ax == -ENOSYS
  2021-04-28  2:05         ` Kees Cook
@ 2021-04-28  2:07           ` H. Peter Anvin
  0 siblings, 0 replies; 12+ messages in thread
From: H. Peter Anvin @ 2021-04-28  2:07 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andrew Lutomirski,
	Borislav Petkov, linux-kernel, oleg, Will Drewry

Earlier in the thread the suggestion was to have (int)pt_regs->orig_ax < 0 indicate a nonsyscall.

On April 27, 2021 7:05:56 PM PDT, Kees Cook <keescook@chromium.org> wrote:
>On Tue, Apr 27, 2021 at 04:51:06PM -0700, Andy Lutomirski wrote:
>> Fortunately there is not, and never will be, a syscall -1.  But I
>> agree that calling max syscall + 1 should behave identically to
>calling
>> a nonexistent syscall in the middle of the table.
>
>If that happens, we have to separate the meaning of -1L from ptrace,
>seccomp, etc. (i.e. we can't just add an "else { result = -ENOSYS; }"
>to
>the syscall table dispatching code, since that'll overwrite any written
>return value when the syscall is meant to be skipped with a specific
>return value set by ptrace/seccomp.
>
>syscall_trace_enter() will currently return either -1 or the
>syscall. Which means someone making a "syscall -1" will get the skip
>semantics currently (though the preloaded -ENOSYS results in the
>"expected" outcome).
>
>arm64 recently had to untangle this too:
>
>15956689a0e6 arm64: compat: Ensure upper 32 bits of x0 are zero on
>syscall return
>59ee987ea47c arm64: ptrace: Add a comment describing our syscall
>entry/exit trap ABI
>139dbe5d8ed3 arm64: syscall: Expand the comment about ptrace and
>syscall(-1)
>d83ee6e3e75d arm64: ptrace: Use NO_SYSCALL instead of -1 in
>syscall_trace_enter()

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-04-28  2:08 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-27 21:15 pt_regs->ax == -ENOSYS H. Peter Anvin
2021-04-27 21:28 ` Andy Lutomirski
2021-04-27 22:58   ` H. Peter Anvin
2021-04-27 23:23     ` Andy Lutomirski
2021-04-28  0:05       ` H. Peter Anvin
2021-04-28  0:11         ` Andy Lutomirski
2021-04-28  0:20           ` H. Peter Anvin
2021-04-28  0:46             ` H. Peter Anvin
2021-04-27 23:29     ` Kees Cook
2021-04-27 23:51       ` Andy Lutomirski
2021-04-28  2:05         ` Kees Cook
2021-04-28  2:07           ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).