All of lore.kernel.org
 help / color / mirror / Atom feed
* Invoking a system call from within the kernel
@ 2017-11-16  2:16 Demi Marie Obenour
  2017-11-16  9:54 ` Greg KH
  0 siblings, 1 reply; 7+ messages in thread
From: Demi Marie Obenour @ 2017-11-16  2:16 UTC (permalink / raw)
  To: kernelnewbies

I am looking to write my first driver.  This driver will create a single
character device, which can be opened by any user.  The device will
support one ioctl:

        long ioctl_syscall(int fd, long syscall, long args[6]);

This is simply equivalent to:

        syscall(syscall, args[0], args[1], args[2], args[3], args[4],
                args[5], args[6]);

and indeed I want it to behave *identically* to that.  That means that
ptracers are notified about the syscall (and given the opportunity to
update its arguments), and that seccomp_bpf filters are applied.
Furthermore, it means that all arguments to the syscall need full
validation, as if they came from userspace (because they do).

Is there an in-kernel API that allows one to invoke an arbitrary syscall
with arguments AND proper ptrace/seccomp_bpf filtering?  If not, how
difficult would it be to create one?

Sincerely,

Demi Obenour

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Invoking a system call from within the kernel
  2017-11-16  2:16 Invoking a system call from within the kernel Demi Marie Obenour
@ 2017-11-16  9:54 ` Greg KH
  2017-11-18 18:15   ` Demi Marie Obenour
  0 siblings, 1 reply; 7+ messages in thread
From: Greg KH @ 2017-11-16  9:54 UTC (permalink / raw)
  To: kernelnewbies

On Wed, Nov 15, 2017 at 09:16:35PM -0500, Demi Marie Obenour wrote:
> I am looking to write my first driver.  This driver will create a single
> character device, which can be opened by any user.  The device will
> support one ioctl:
> 
>         long ioctl_syscall(int fd, long syscall, long args[6]);
> 
> This is simply equivalent to:
> 
>         syscall(syscall, args[0], args[1], args[2], args[3], args[4],
>                 args[5], args[6]);

Wait, why?  Why do you want to do something like this, what problem are
you trying to solve that you feel that something like this is the
solution?  Let's step back and see if there isn't a better way to do
this.

> and indeed I want it to behave *identically* to that.  That means that
> ptracers are notified about the syscall (and given the opportunity to
> update its arguments), and that seccomp_bpf filters are applied.
> Furthermore, it means that all arguments to the syscall need full
> validation, as if they came from userspace (because they do).
> 
> Is there an in-kernel API that allows one to invoke an arbitrary syscall
> with arguments AND proper ptrace/seccomp_bpf filtering?  If not, how
> difficult would it be to create one?

Wouldn't creating such an interface be more work than just using the
correct user/kernel interface in the first place?  :)

Again, what is the problem you are trying to solve here.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Invoking a system call from within the kernel
  2017-11-16  9:54 ` Greg KH
@ 2017-11-18 18:15   ` Demi Marie Obenour
  2017-11-18 18:44     ` valdis.kletnieks at vt.edu
  0 siblings, 1 reply; 7+ messages in thread
From: Demi Marie Obenour @ 2017-11-18 18:15 UTC (permalink / raw)
  To: kernelnewbies


On Thu, Nov 16, 2017 at 10:54:24AM +0100, Greg KH wrote:
> On Wed, Nov 15, 2017 at 09:16:35PM -0500, Demi Marie Obenour wrote:
> > I am looking to write my first driver.  This driver will create a single
> > character device, which can be opened by any user.  The device will
> > support one ioctl:
> > 
> >         long ioctl_syscall(int fd, long syscall, long args[6]);
> > 
> > This is simply equivalent to:
> > 
> >         syscall(syscall, args[0], args[1], args[2], args[3], args[4],
> >                 args[5], args[6]);
> 
> Wait, why?  Why do you want to do something like this, what problem are
> you trying to solve that you feel that something like this is the
> solution?  Let's step back and see if there isn't a better way to do
> this.
> 
You are correct that there is a different problem that I really want to
solve.

Here is the different problem:  I want to have a new device (let's call
it `/dev/async_syscall`), with root:root owner and 0600 permissions.
When the user opens the device, the returned file descriptor can be used
to submit an async syscall request using the following ioctl:

        /* Fixed-size types to avoid a 32-bit compat layer */
        struct linux_async_syscall {
                __u64 syscall;
                __u64 args[6];
                __u64 user1;
                __u64 user2;
        };

        /* arguments is really a struct linux_async_syscall * */
        /* n_syscalls is really a size_t */
        int ioctl(int fd, LINUX_ASYNC_SYSCALL, __u64 n_syscalls,
                  __u64 arguments, __u64 num_succeed);

Here `arguments` is an array of `struct linux_async_syscall` with
size `n_syscalls`, and `num_succeeded` is a pointer to an `int` that
receives the number of successfully submitted system calls.

In the kernel, this does the following:

1. Check that the parameters make sense
2. Copy them into kernel memory, and place the memory somewhere where it
   will be freed if the process terminates.
3. For each `struct linux_async_syscall` passed:
   1. Run seccomp filters to ensure that the process can actually make
      the syscall.
   2. Check the syscall against a whitelist of system calls that can be
      made asynchronously.
4. Call the in-kernel implementation of clone(), creating a new
   kernel thread.
5. In the parent, return success if and only if the thread creation was
   successfull.
6. In the child, for each `struct linux_async_syscall` passed, invoke
   the system call, as if from userspace.  Upon return, post a message
   to the file descriptor, which the userspace process can then
   retrieve with read(2).

I am sure there are more optimizations to be made, or possibly an
entirely different and superior approach.
> > and indeed I want it to behave *identically* to that.  That means that
> > ptracers are notified about the syscall (and given the opportunity to
> > update its arguments), and that seccomp_bpf filters are applied.
> > Furthermore, it means that all arguments to the syscall need full
> > validation, as if they came from userspace (because they do).
> > 
> > Is there an in-kernel API that allows one to invoke an arbitrary syscall
> > with arguments AND proper ptrace/seccomp_bpf filtering?  If not, how
> > difficult would it be to create one?
> 
> Wouldn't creating such an interface be more work than just using the
> correct user/kernel interface in the first place?  :)
>
Yes, it would. :)

However, the ioctl I actually want to implement (see above) does the
system call asynchronously.  That isn?t possible using the existing
APIs.
> 
> Again, what is the problem you are trying to solve here.
>
See above :)  Basically, I am trying to improve performance and reduce
complexity of programs that need to do a lot of buffered file I/O.
> 
> thanks,
> 
> greg k-h
>
Thank you, Greg!

Demi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Invoking a system call from within the kernel
  2017-11-18 18:15   ` Demi Marie Obenour
@ 2017-11-18 18:44     ` valdis.kletnieks at vt.edu
  2017-11-18 19:09       ` Demi Marie Obenour
  0 siblings, 1 reply; 7+ messages in thread
From: valdis.kletnieks at vt.edu @ 2017-11-18 18:44 UTC (permalink / raw)
  To: kernelnewbies

On Sat, 18 Nov 2017 13:15:27 -0500, Demi Marie Obenour said:

> However, the ioctl I actually want to implement (see above) does the
> system call asynchronously.  That isn???t possible using the existing
> APIs.

Ever consider that it's because there's no clear semantics to what
executing an arbitrary syscall asyncronously even *means*?

What doe an async getuid() mean?  For bonus points, what does it
return if the program does an async getuid(), and then does a
setuid() call *before the async call completes*?

What is the return value of an async call that fails?  How is it
returned, and how do you tell if a negative return code is
from the async code failing, or the syscall failing?

> See above :)  Basically, I am trying to improve performance and reduce
> complexity of programs that need to do a lot of buffered file I/O.

We already have an AIO subsystem for exactly this.  And eventfd's, and
poll(), and a bunch of other stuff.

And they improve performance, but increase complexity.  It's pretty
hard to make

	while (rc=read(....) > 0)
		rc2 = write(....)

less complex.  Catching the return of an async call makes it more complex.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 486 bytes
Desc: not available
Url : http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20171118/4d6421ac/attachment.bin 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Invoking a system call from within the kernel
  2017-11-18 18:44     ` valdis.kletnieks at vt.edu
@ 2017-11-18 19:09       ` Demi Marie Obenour
  2017-11-19  0:49         ` valdis.kletnieks at vt.edu
  2017-11-19  9:50         ` Greg KH
  0 siblings, 2 replies; 7+ messages in thread
From: Demi Marie Obenour @ 2017-11-18 19:09 UTC (permalink / raw)
  To: kernelnewbies

On Sat, Nov 18, 2017 at 01:44:44PM -0500, valdis.kletnieks at vt.edu wrote:
> On Sat, 18 Nov 2017 13:15:27 -0500, Demi Marie Obenour said:
> 
> > However, the ioctl I actually want to implement (see above) does the
> > system call asynchronously.  That isn???t possible using the existing
> > APIs.
> 
> Ever consider that it's because there's no clear semantics to what
> executing an arbitrary syscall asyncronously even *means*?
> 
> What doe an async getuid() mean?  For bonus points, what does it
> return if the program does an async getuid(), and then does a
> setuid() call *before the async call completes*?
> 
Only whitelisted system calls would be allowed, such as open(), read(),
and write().  Async getuid() would not be allowed.  Nor would async
exit() or exit_group().

The only system calls that would be whitelisted for async use are those
that could potentially block on I/O.  ?Block? is used in a general
sense: it includes disc I/O as well as network I/O.
>
> What is the return value of an async call that fails?  How is it
> returned, and how do you tell if a negative return code is
> from the async code failing, or the syscall failing?
> 
If an async call fails, the packet posted to the file descriptor
contains the negative error code.
>
> > See above :)  Basically, I am trying to improve performance and reduce
> > complexity of programs that need to do a lot of buffered file I/O.
> 
> We already have an AIO subsystem for exactly this.  And eventfd's, and
> poll(), and a bunch of other stuff.
>
This actually works with poll()/epoll()/etc.  Specifically, the device
file descriptor becomes readable when a completion event is posted to
it, indiating that an async system call has completed and its result is
available.
> 
> And they improve performance, but increase complexity.  It's pretty
> hard to make
> 
> 	while (rc=read(....) > 0)
> 		rc2 = write(....)
> 
> less complex.  Catching the return of an async call makes it more complex.
>
Many programs (such as Node.js, NGINX, Firefox, Chrome, and every other
GUI program) use an event loop architecture.  To maintain
responsiveness, it is necessary to avoid blocking calls on the main
thread (the thread that runs the event loop).  For filesystem
operations, this is generally done by doing the operation in a thread
pool.

Async system calls move the thread pool to the kernel.  The kernel has
system-wide information and perform optimizations regarding e.g.
scheduling and threadpool size that userspace cannot.  Furthermore,
the kernel threadpool threads have no userspace counterparts, so they
avoid requiring a userspace stack or other data structures.

There was a previous attempt to implement async system calls using the
AIO interface.  Linus rejected it on the basis that an async system call
API should be more general.

Sincerely,

Demi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Invoking a system call from within the kernel
  2017-11-18 19:09       ` Demi Marie Obenour
@ 2017-11-19  0:49         ` valdis.kletnieks at vt.edu
  2017-11-19  9:50         ` Greg KH
  1 sibling, 0 replies; 7+ messages in thread
From: valdis.kletnieks at vt.edu @ 2017-11-19  0:49 UTC (permalink / raw)
  To: kernelnewbies

On Sat, 18 Nov 2017 14:09:31 -0500, Demi Marie Obenour said:

> Only whitelisted system calls would be allowed, such as open(), read(),
> and write().  Async getuid() would not be allowed.  Nor would async
> exit() or exit_group().

You missed the point - If you allow async calls, you need to deal with
the fact that this can change the semantics of things and introduce race
conditions.

What semantics does an async open() have? Under what conditions
does open() take long enough that doing it asyncronously provides
a benefit?

What system calls are you going to allow to be async?  (Hold that
thought for a moment, we'll return to it...)

> If an async call fails, the packet posted to the file descriptor
> contains the negative error code.

OK.. Was that a -5 error from async() itself, or a -5 from the async read()?

> Many programs (such as Node.js, NGINX, Firefox, Chrome, and every other
> GUI program) use an event loop architecture.  To maintain
> responsiveness, it is necessary to avoid blocking calls on the main
> thread (the thread that runs the event loop).  For filesystem
> operations, this is generally done by doing the operation in a thread
> pool.

And somehow, all those event loops are able to work just fine
without adding kernel infrastructure.  Given that track record,
you'll need to show a *large* benefit in order to get it into
the kernel.   Hint:  kdbus didn't make it in.

> There was a previous attempt to implement async system calls using the
> AIO interface.  Linus rejected it on the basis that an async system call
> API should be more general.

Do you have enough system calls to make it more general than AIO?




-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 486 bytes
Desc: not available
Url : http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20171118/55ea93a6/attachment.bin 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Invoking a system call from within the kernel
  2017-11-18 19:09       ` Demi Marie Obenour
  2017-11-19  0:49         ` valdis.kletnieks at vt.edu
@ 2017-11-19  9:50         ` Greg KH
  1 sibling, 0 replies; 7+ messages in thread
From: Greg KH @ 2017-11-19  9:50 UTC (permalink / raw)
  To: kernelnewbies

On Sat, Nov 18, 2017 at 02:09:31PM -0500, Demi Marie Obenour wrote:
> Async system calls move the thread pool to the kernel.  The kernel has
> system-wide information and perform optimizations regarding e.g.
> scheduling and threadpool size that userspace cannot.  Furthermore,
> the kernel threadpool threads have no userspace counterparts, so they
> avoid requiring a userspace stack or other data structures.

But they are not "free", you have to handle all of that within the
kernel now.  In a way to properly account for all resources and
contraints that is needed in order to correctly manage such logic.

> There was a previous attempt to implement async system calls using the
> AIO interface.  Linus rejected it on the basis that an async system call
> API should be more general.

Yes, please go look at those previous attempts and learn from why they
failed if you wish to try to attempt this again.  Don't ignore history :)

Best of luck, it should be some fun work.

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-11-19  9:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-16  2:16 Invoking a system call from within the kernel Demi Marie Obenour
2017-11-16  9:54 ` Greg KH
2017-11-18 18:15   ` Demi Marie Obenour
2017-11-18 18:44     ` valdis.kletnieks at vt.edu
2017-11-18 19:09       ` Demi Marie Obenour
2017-11-19  0:49         ` valdis.kletnieks at vt.edu
2017-11-19  9:50         ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.