netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC: Ioctl v2
@ 2022-05-20 16:16 Kent Overstreet
  2022-05-20 20:31 ` Andrew Lunn
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Kent Overstreet @ 2022-05-20 16:16 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-block, netdev; +Cc: mcgrof, tytso

At LSF we had our annual talk about all the ways ioctls suck. It got me thinking
about a proposal for a simple lightweight replacement.

Problems with ioctls:

* There's no namespacing, and ioctl numbers clash

Ioctl numbers clashing isn't a _huge_ issue in practice because you'll only have
so many chunks of code handling ioctls for the same FD (VFS, filesystem or
driver) and because ioctl struct size is also dispatched on, but it is pretty
gross - there's nothing preventing different drivers from picking the same ioctl
numbers. And since we've got one byte for the "namespace" and another byte for
the ioctl number, and according to my grep 7k different ioctls, betcha this
happens somewhere - I think Luis had an example at LSF.

Where the lack of real namespacing bites us more is when ioctls get promoted
from filesystem or driver on up. Ted had a good example of an ext2 ioctl getting
promoted to the VFS when it really shouldn't have, because it was exposing ext2
specific data structures.

But because this is as simple as changing a #define EXT2_IOC to #define FS_IOC,
it's really easy to do without adequate review - you don't have to change the
ioctl number and break userspace, so why would you?

Introducing real namespacing would mean that promoting an ioctl to the VFS level
would really have to be a new ioctl, and it'll get people to think more about
what the new ioctl would be.

* The calling convention sucks

With ioctls, you have to define a struct for your parameters, and struct members
might be used for inputs, or outputs, or both.

The problem is, these structs really need to be fully portable the same way
structs defined for on disk formats have to be, and we've got no way of checking
for this. This is a real minefield: if you need to pass a pointer type, you
can't pass a pointer because sizeof(void *) is different (and kernel space might
be 64 bit, with userspace 32 bit or 64 bit) - and you can't pass a ulong either,
it has to be a u64.

The whole "define a struct for your parameters" was a hack and a bad idea.
Ioctls are just function calls - they're driver-private syscalls - and they
should work like function calls.

IOCTL v2 proposal:

* Namespacing

To solve the namespacing issue, I want to steal an approach I've seen from
OpenGL, where extensions are namespaced: an extension will be referenced by name
where the name is vendor.foo, and when an extension becomes standard it gets a
new name (arb.foo instead of nvidia.foo, I think? it's been awhile).

To do this we'll need to define ioctls by name, not by hardcoded number, and
likewise userspace will have to call ioctls by name, not by number. To avoid a
string lookup on every ioctl call, I propose a new syscall

int sys_get_ioctl_nr(char *name)

And then userspace will just call this once for every ioctl it uses, either at
program startup or lazily when an ioctl is first called. This can all be nicely
hidden in a little wrapper library.

We'll want to randomize ioctl numbers in kernel space, to ensure userspace
_can't_ hard code them.

Also, another thing that came up at LSF was introspection, it's hard for
strace() et al to handle ioctls. Implementing this name -> nr mapping will give
us a registry of ioctls supported on a given kernel which we can make available
in /proc; and while we're at it, why not include the prototype too?

* Better calling convention

ioctls are just private syscalls. Syscalls look like normal function calls, why
can't ioctls?  Some ioctls do complicated things that require defining structs
with all the tricky layout rules that we kernel devs have all had beaten into
our brains - but most probably would not, if we could do normal-looking function
calls.

Well, syscalls do require arch specific code to handle calling conventions, and
we don't want that. What I propose doing is having the underlying syscall be

#define IOCTL_MAXARGS	8

struct ioctl_args {
	__u64	args[IOCTL_MAXARGS];
};

int sys_ioctl_v2(int fd, int ioctl_nr, struct ioctl_args __user *args)

Userspace won't call this directly. Userspace will call normal looking
functions, like:

int bcachefs_ioctl_disk_add(int fd, unsigned flags, char __user *disk_path);

Which will be a wrapper that casts the function arguments to u64s (or s64s for
signed integers, so that we don't have surprises when kernel space and user
space disagree about sizeof(long)) and then does the actual syscall.

------------------

I want to circulate this and get some comments and feedback, and if no one
raises any serious objections - I'd love to get collaborators to work on this
with me. Flame away!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: Ioctl v2
  2022-05-20 16:16 RFC: Ioctl v2 Kent Overstreet
@ 2022-05-20 20:31 ` Andrew Lunn
  2022-05-21 16:45   ` Kent Overstreet
  2022-05-20 23:45 ` Theodore Ts'o
  2022-06-09 22:02 ` Jan Engelhardt
  2 siblings, 1 reply; 11+ messages in thread
From: Andrew Lunn @ 2022-05-20 20:31 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-fsdevel, linux-kernel, linux-block, netdev, mcgrof, tytso

> I want to circulate this and get some comments and feedback, and if
> no one raises any serious objections - I'd love to get collaborators
> to work on this with me. Flame away!

Hi Kent

I doubt you will get much interest from netdev. netdev already
considers ioctl as legacy, and mostly uses netlink and a message
passing structure, which is easy to extend in a backwards compatible
manor.

https://man7.org/linux/man-pages/man7/netlink.7.html

	Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: Ioctl v2
  2022-05-20 16:16 RFC: Ioctl v2 Kent Overstreet
  2022-05-20 20:31 ` Andrew Lunn
@ 2022-05-20 23:45 ` Theodore Ts'o
  2022-05-25 17:20   ` Kent Overstreet
  2022-06-09 22:02 ` Jan Engelhardt
  2 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2022-05-20 23:45 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-fsdevel, linux-kernel, linux-block, netdev, mcgrof

On Fri, May 20, 2022 at 12:16:52PM -0400, Kent Overstreet wrote:
> 
> Where the lack of real namespacing bites us more is when ioctls get promoted
> from filesystem or driver on up. Ted had a good example of an ext2 ioctl getting
> promoted to the VFS when it really shouldn't have, because it was exposing ext2
> specific data structures.
> 
> But because this is as simple as changing a #define EXT2_IOC to #define FS_IOC,
> it's really easy to do without adequate review - you don't have to change the
> ioctl number and break userspace, so why would you?
> 
> Introducing real namespacing would mean that promoting an ioctl to the VFS level
> would really have to be a new ioctl, and it'll get people to think more about
> what the new ioctl would be.

It's not clear that making harder not to break userspace is a
*feature*.  If existing programs are using a particular ioctl
namespace, being able to have other file systems adopt it has
historically been considered a *feature* not a *bug*.

At the time, we had working utilities, chattr and lsattr, which were
deployed on all Linux distributions, and newer file systems, such as
xfs, reiserfs, btrfs, etc., decided they wanted to piggy-back on those
existing utilities.  Forcing folks to deploy new utilities just
because it's the best way to force "adequate review" might be swinging
the pendulum too far in the straight-jacket direction.

It's worked the other way, too.  For example, the "shutdown" ioctl was
originally comes from XFS, and ext4 adopted that feature because the
existing interface was perfectly good, and that allows us to to get
the testing from xfstests for free.  The same goes for the extended
attribute, "trim", "freeze" and "thaw" ioctls.  

We've also promoted ioctls from btrfs to the VFS layer, including
FICLONE, GETLABEL, SETLABEL.  Take a look at include/uapi/linux/fs.h.
Ioctl's in 'X' namespace come from XFS.  Ioctl's in the 0x94 interface
come from btrfs.  And of course ioctl's in the 'f' interface come from
ext2/ext3/ext4.  It's actually worked pretty well, and we should
acknowledge that.

In the case of extended attributes, we had perfectly working userspace
tools that would have ***broken*** if we adhered to a doctrinaire,
when you promote an interface, we break the userspace interface Just
Because it's the Good Computer Science Thing To Do.

> ioctls are just private syscalls. Syscalls look like normal function calls, why
> can't ioctls?  Some ioctls do complicated things that require defining structs
> with all the tricky layout rules that we kernel devs have all had beaten into
> our brains - but most probably would not, if we could do normal-looking function
> calls.
> 
> Well, syscalls do require arch specific code to handle calling conventions, and
> we don't want that.....
>
> Userspace won't call this directly. Userspace will call normal looking
> functions, like:
> 
> int bcachefs_ioctl_disk_add(int fd, unsigned flags, char __user *disk_path);
> 
> Which will be a wrapper that casts the function arguments to u64s (or s64s for
> signed integers, so that we don't have surprises when kernel space and user
> space disagree about sizeof(long)) and then does the actual syscall.

So this approach requires that someone has to actually implement the
wrapper library.  Who will that be?  The answer could be, "let libc do
it", but then we need to worry about all the C libraries out there
actually adopting the new ioctl, which takes a long time, and
historically, some C library maintainers have had.... opinionated
points of view about the sort of "value add that should be done at the
C Library level".

There are other examples, such as the AIO and extended attribute
libraries, but they require someone willing to maintain those
libraries for the long term.  In some cases, such as the extended
attribute, that makes total sense --- but that's because that's a
library which is using an ioctl which was promoted from a specific
file system to a generic VFS interface, and so it had a lot of users.

In other cases, the only user of a particular interface might be a
file system specific userspace utility --- the kind of thing which is
shipped in btrfs-progs, e2fsprogs, f2fs-tools, xfsprogs, etc.
Creating a wrapper library for those kinds of ioctls is going to be
overkill.

So I suspect there won't be a lot of standardization here.  It could
be that there will be wrapper function that lives in util-linux, or
e2fsprogs, or xfsprogs, or whatever.  But in that case, we could do
exactly the same thing vis-a-vis creating wrapper function using the
existing ioctl interface.

For example, I have an ext2fs library function
ext2fs_get_device_size2(), which wraps not only the BLKGETSIZE and
BLKGETSIZE64 ioctls, but also the equivalents for Apple Darwin
(DKIOCGETBLOCKCOUNT), FreeBSD/NetBSD (DIOCGDINFO and later
DIOCGMEDIASIZE), and the Window's DeviceIoControl()'s
IOCTL_DISK_GET_DRIVE_GEOMETRY call.  The point is that wrapper
functions are very much orthogonal to the ioctl interface; we're all
going to have wrapper functions, and we'll create them where they are
needed.

If we force developers to have to create and maintain wrapper
libraries for all new ioctl2 interfaces, Just Because It's Proper
Computer Science Design Best Practices, it's going to be painful
enough that I suspect people will just stick to adding new ioctl code
points to the existing ioctl() interface.

So I think we need to be a little careful here.  We made adding System
Calls difficult, because it was Better(tm).  And there would good
reasons for that.  But that has also incentivized people to use the
escape hatches.  Make the "perfect" too painful, and people will find
ways to avoid using it when at all possible.

						- Ted

P.S.  During the LSF/MM hallway track, one maintainer said that the
fsdevel bikeshedding had gotten *so* painful with the process not
having a guaranteed consensus within a reasonal period of time, that
he was planning on adding a file system specific ioctl, and then later
on, if other people were wanted to use it, they would be free to use
the same ioctl code point and interface that he had come up with.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: Ioctl v2
  2022-05-20 20:31 ` Andrew Lunn
@ 2022-05-21 16:45   ` Kent Overstreet
  2022-05-21 18:55     ` Andrew Lunn
  2022-05-21 19:45     ` Stephen Hemminger
  0 siblings, 2 replies; 11+ messages in thread
From: Kent Overstreet @ 2022-05-21 16:45 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: linux-fsdevel, linux-kernel, linux-block, netdev, mcgrof, tytso

On Fri, May 20, 2022 at 10:31:02PM +0200, Andrew Lunn wrote:
> > I want to circulate this and get some comments and feedback, and if
> > no one raises any serious objections - I'd love to get collaborators
> > to work on this with me. Flame away!
> 
> Hi Kent
> 
> I doubt you will get much interest from netdev. netdev already
> considers ioctl as legacy, and mostly uses netlink and a message
> passing structure, which is easy to extend in a backwards compatible
> manor.

The more I look at netlink the more I wonder what on earth it's targeted at or
was trying to solve. It must exist for a reason, but I've written a few ioctls
myself and I can't fathom a situation where I'd actually want any of the stuff
netlink provides.

Why bother with getting a special socket type? Why asynchronous messages with
all the marshalling/unmarshalling that entails?

From what I've seen all we really want is driver private syscalls, and the
things about ioctls that suck are where it's _not_ like syscalls. Let's just
make it work more like normal function calls.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: Ioctl v2
  2022-05-21 16:45   ` Kent Overstreet
@ 2022-05-21 18:55     ` Andrew Lunn
  2022-05-21 19:45     ` Stephen Hemminger
  1 sibling, 0 replies; 11+ messages in thread
From: Andrew Lunn @ 2022-05-21 18:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-fsdevel, linux-kernel, linux-block, netdev, mcgrof, tytso

On Sat, May 21, 2022 at 12:45:46PM -0400, Kent Overstreet wrote:
> On Fri, May 20, 2022 at 10:31:02PM +0200, Andrew Lunn wrote:
> > > I want to circulate this and get some comments and feedback, and if
> > > no one raises any serious objections - I'd love to get collaborators
> > > to work on this with me. Flame away!
> > 
> > Hi Kent
> > 
> > I doubt you will get much interest from netdev. netdev already
> > considers ioctl as legacy, and mostly uses netlink and a message
> > passing structure, which is easy to extend in a backwards compatible
> > manor.
> 
> The more I look at netlink the more I wonder what on earth it's targeted at or
> was trying to solve. It must exist for a reason, but I've written a few ioctls
> myself and I can't fathom a situation where I'd actually want any of the stuff
> netlink provides.
> 
> Why bother with getting a special socket type? Why asynchronous messages with
> all the marshalling/unmarshalling that entails?

Hi Kent

It has its uses, but my main point was, it is unlikely netdev will buy
into anything else.

> >From what I've seen all we really want is driver private syscalls

netdev is actually very opposed to private syscalls. Given the chance,
each driver will define its own vendor specific APIs, there will be
zero interoperability, you need vendor tools, the documentation will
be missing etc. So netdev tries very hard to have well defined APIs
which are vendor neutral to cover anything a driver, or the network
stack, wants to do.

    Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: Ioctl v2
  2022-05-21 16:45   ` Kent Overstreet
  2022-05-21 18:55     ` Andrew Lunn
@ 2022-05-21 19:45     ` Stephen Hemminger
  2022-05-25 17:02       ` Kent Overstreet
  1 sibling, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2022-05-21 19:45 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Andrew Lunn, linux-fsdevel, linux-kernel, linux-block, netdev,
	mcgrof, tytso

On Sat, 21 May 2022 12:45:46 -0400
Kent Overstreet <kent.overstreet@gmail.com> wrote:

> On Fri, May 20, 2022 at 10:31:02PM +0200, Andrew Lunn wrote:
> > > I want to circulate this and get some comments and feedback, and if
> > > no one raises any serious objections - I'd love to get collaborators
> > > to work on this with me. Flame away!  
> > 
> > Hi Kent
> > 
> > I doubt you will get much interest from netdev. netdev already
> > considers ioctl as legacy, and mostly uses netlink and a message
> > passing structure, which is easy to extend in a backwards compatible
> > manor.  
> 
> The more I look at netlink the more I wonder what on earth it's targeted at or
> was trying to solve. It must exist for a reason, but I've written a few ioctls
> myself and I can't fathom a situation where I'd actually want any of the stuff
> netlink provides.

Netlink was built for networking operations, you want to set something like a route with a large
number of varying parameters in one transaction. And you don't want to have to invent
a new system call every time a new option is added.

Also, you want to monitor changes and see these events for a userspace control
application such as a routing daemon.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: Ioctl v2
  2022-05-21 19:45     ` Stephen Hemminger
@ 2022-05-25 17:02       ` Kent Overstreet
  2022-05-25 20:57         ` Andrew Lunn
  2022-06-01  8:29         ` Leon Romanovsky
  0 siblings, 2 replies; 11+ messages in thread
From: Kent Overstreet @ 2022-05-25 17:02 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Andrew Lunn, linux-fsdevel, linux-kernel, linux-block, netdev,
	mcgrof, tytso

On Sat, May 21, 2022 at 12:45:59PM -0700, Stephen Hemminger wrote:
> On Sat, 21 May 2022 12:45:46 -0400
> Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> > On Fri, May 20, 2022 at 10:31:02PM +0200, Andrew Lunn wrote:
> > > > I want to circulate this and get some comments and feedback, and if
> > > > no one raises any serious objections - I'd love to get collaborators
> > > > to work on this with me. Flame away!  
> > > 
> > > Hi Kent
> > > 
> > > I doubt you will get much interest from netdev. netdev already
> > > considers ioctl as legacy, and mostly uses netlink and a message
> > > passing structure, which is easy to extend in a backwards compatible
> > > manor.  
> > 
> > The more I look at netlink the more I wonder what on earth it's targeted at or
> > was trying to solve. It must exist for a reason, but I've written a few ioctls
> > myself and I can't fathom a situation where I'd actually want any of the stuff
> > netlink provides.
> 
> Netlink was built for networking operations, you want to set something like a route with a large
> number of varying parameters in one transaction. And you don't want to have to invent
> a new system call every time a new option is added.
> 
> Also, you want to monitor changes and see these events for a userspace control
> application such as a routing daemon.

That makes sense - perhaps the new mount API could've been done as a netlink
interface :)

But perhaps it makes sense to have both - netlink for the big complicated
stateful operations, ioctl v2 for the simpler ones. I haven't looked at netlink
usage at all, but most of the filesystem ioctls I've looked at fall into the the
simple bucket, for me.

Actually, I have one in bcachefs that might fit better into the netlink bucket -
maybe while I've got your attention you could tell me what this is like in
netlink land.

In bcachefs, we have "data jobs", where userspace asks us to do something that
requires walking data and performing some operation on them - this is used for
manual rebalance, evacuating data off a device, scrub (when that gets
implemented), etc.

The way I did this was with an ioctl that takes as a parameter the job to
perform, then it kicks off a kernel thread to do the work and returns a file
descriptor, which userspace reads from to find out the current status of the job
(which it uses to implement a progress indicator). We kill off the kthread if
the file descriptor is closed, meaning ctrl-c works as expected.

I really like how this turned out, it's not much code and super slick - I was
considering abstracting it out as generic functionality. But this definitely
sounds like what netlink is targeted at - thoughts?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: Ioctl v2
  2022-05-20 23:45 ` Theodore Ts'o
@ 2022-05-25 17:20   ` Kent Overstreet
  0 siblings, 0 replies; 11+ messages in thread
From: Kent Overstreet @ 2022-05-25 17:20 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-fsdevel, linux-kernel, linux-block, netdev, mcgrof

On Fri, May 20, 2022 at 07:45:39PM -0400, Theodore Ts'o wrote:
> On Fri, May 20, 2022 at 12:16:52PM -0400, Kent Overstreet wrote:
> > 
> > Where the lack of real namespacing bites us more is when ioctls get promoted
> > from filesystem or driver on up. Ted had a good example of an ext2 ioctl getting
> > promoted to the VFS when it really shouldn't have, because it was exposing ext2
> > specific data structures.
> > 
> > But because this is as simple as changing a #define EXT2_IOC to #define FS_IOC,
> > it's really easy to do without adequate review - you don't have to change the
> > ioctl number and break userspace, so why would you?
> > 
> > Introducing real namespacing would mean that promoting an ioctl to the VFS level
> > would really have to be a new ioctl, and it'll get people to think more about
> > what the new ioctl would be.
> 
> It's not clear that making harder not to break userspace is a
> *feature*.  If existing programs are using a particular ioctl
> namespace, being able to have other file systems adopt it has
> historically been considered a *feature* not a *bug*.
> 
> At the time, we had working utilities, chattr and lsattr, which were
> deployed on all Linux distributions, and newer file systems, such as
> xfs, reiserfs, btrfs, etc., decided they wanted to piggy-back on those
> existing utilities.  Forcing folks to deploy new utilities just
> because it's the best way to force "adequate review" might be swinging
> the pendulum too far in the straight-jacket direction.

But back in those days, users updating those core utilities was a much bigger
hassle. That's changing, we don't really have users compiling from source
anymore, and we're slowly but steadily moving towards the rolling releases
world: slackware -> debian -> nixos.

And on the developer side of things, historically developers have worked on a
few packages where they were comfortable with the process, and packages tended
more towards giant monorepos, but the younger generation is more used to working
across multiple packages/repositories as necessary (this is where github has
been emphatically a good thing, despite being proprietary; it's standardized a
lot of the friction-y "how do I submit to this repo" stuff).

So I understand where you're coming from but I think this is worth rethinking.

Additionally, bikeshedding gets really painful when people are trying to plan
for all eventualities and think too far ahead. Having multiple stages of review
is _helpful_ with this. If an ioctl is filesystem-private, it's perfectly fine
for it embed filesystem specific data structures - if we can ensure that that
won't get lifted to the VFS layer without anyone noticing!

> ...
>
> In the case of extended attributes, we had perfectly working userspace
> tools that would have ***broken*** if we adhered to a doctrinaire,
> when you promote an interface, we break the userspace interface Just
> Because it's the Good Computer Science Thing To Do.

Not broken, though - it just would've needed updating to support additional
filesystems, and when the ioctls don't need changing the patches would be
trivial:

ret = ioctl_xfs_goingdown(..);
becomes
ret = ioctl_fs_goingdown(...) ?:
      ioctl_xfs_goingdown(...);

(I'm the only one I know who does that chaining syntax in C, but I like it :)

> So this approach requires that someone has to actually implement the
> wrapper library.  Who will that be?  The answer could be, "let libc do
> it", but then we need to worry about all the C libraries out there
> actually adopting the new ioctl, which takes a long time, and
> historically, some C library maintainers have had.... opinionated
> points of view about the sort of "value add that should be done at the
> C Library level".

Not libc, and we definitely don't want to have to update that library for every
new ioctl - I'm imagining that library just being responsible for the "query
kernel for ioctl numbers" part, the ioctl definitions themselves will still come
from kernel headers.

> For example, I have an ext2fs library function
> ext2fs_get_device_size2(), which wraps not only the BLKGETSIZE and
> BLKGETSIZE64 ioctls, but also the equivalents for Apple Darwin
> (DKIOCGETBLOCKCOUNT), FreeBSD/NetBSD (DIOCGDINFO and later
> DIOCGMEDIASIZE), and the Window's DeviceIoControl()'s
> IOCTL_DISK_GET_DRIVE_GEOMETRY call.  The point is that wrapper
> functions are very much orthogonal to the ioctl interface; we're all
> going to have wrapper functions, and we'll create them where they are
> needed.

This seems unrelated to the ioctl v2 discussion, but - it would be _great_ if we
could get that in a separate repository where others could make use of it :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: Ioctl v2
  2022-05-25 17:02       ` Kent Overstreet
@ 2022-05-25 20:57         ` Andrew Lunn
  2022-06-01  8:29         ` Leon Romanovsky
  1 sibling, 0 replies; 11+ messages in thread
From: Andrew Lunn @ 2022-05-25 20:57 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Stephen Hemminger, linux-fsdevel, linux-kernel, linux-block,
	netdev, mcgrof, tytso

> Actually, I have one in bcachefs that might fit better into the netlink bucket -
> maybe while I've got your attention you could tell me what this is like in
> netlink land.
> 
> In bcachefs, we have "data jobs", where userspace asks us to do something that
> requires walking data and performing some operation on them - this is used for
> manual rebalance, evacuating data off a device, scrub (when that gets
> implemented), etc.
> 
> The way I did this was with an ioctl that takes as a parameter the job to
> perform, then it kicks off a kernel thread to do the work and returns a file
> descriptor, which userspace reads from to find out the current status of the job
> (which it uses to implement a progress indicator). We kill off the kthread if
> the file descriptor is closed, meaning ctrl-c works as expected.
> 
> I really like how this turned out, it's not much code and super slick - I was
> considering abstracting it out as generic functionality. But this definitely
> sounds like what netlink is targeted at - thoughts?

What is tricky with networking, is that it has a Big Lock, the
RTNL. All ioctl and netlink operations are performed while holding
this lock. So you cannot do an operation which takes a while.

But i implemented something similar to what you want a couple of years
ago. Ethernet cable testing. It is split into a couple of netlink
messages. There is one to initiate the cable test, and you can pass
some parameters. If the Ethernet PHY supports it, you get back an
immediate ACK, or an error messages, with probably -EOPNOTSUP, or
-EINVAL. This is all done with the RTNL held, the lock being released
after the reply.

The PHY then actually starts doing the cable test. I can take from a
couple of seconds, to 10-20 seconds, depending on exactly how it is
implemented, how fast the PHY is etc.

Once the PHY has finished, it broadcasts a report of the cable test to
userspace. Any process can receive this. So the invoking ethtool
--cable-test eth42 process waits around for it and dumps the test
results.

broadcasting messages is a big part of netlink. 'ip monitor' will
receive all these broadcasts and decode them. So you get to see routes
added/remove, ARP resolutions, interfaces going up and down. etc.

Your ctrl-c handling does not exist, as far as i know. With cable
testing, it runs to completion and makes the report. It could be there
is nobody listening. At least for some PHYs you cannot abort a cable
test once started, so ctrl-c might not even make sense.

There is a video of my LPC talk online somewhere. But 1/2 of it is
physics, how cable testing actually works. There is some details of
the netlink part and how the PHY state machine works.

    Andrew




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: Ioctl v2
  2022-05-25 17:02       ` Kent Overstreet
  2022-05-25 20:57         ` Andrew Lunn
@ 2022-06-01  8:29         ` Leon Romanovsky
  1 sibling, 0 replies; 11+ messages in thread
From: Leon Romanovsky @ 2022-06-01  8:29 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Stephen Hemminger, Andrew Lunn, linux-fsdevel, linux-kernel,
	linux-block, netdev, mcgrof, tytso, RDMA mailing list

On Wed, May 25, 2022 at 01:02:33PM -0400, Kent Overstreet wrote:
> On Sat, May 21, 2022 at 12:45:59PM -0700, Stephen Hemminger wrote:
> > On Sat, 21 May 2022 12:45:46 -0400
> > Kent Overstreet <kent.overstreet@gmail.com> wrote:
> > 
> > > On Fri, May 20, 2022 at 10:31:02PM +0200, Andrew Lunn wrote:
> > > > > I want to circulate this and get some comments and feedback, and if
> > > > > no one raises any serious objections - I'd love to get collaborators
> > > > > to work on this with me. Flame away!  
> > > > 
> > > > Hi Kent
> > > > 
> > > > I doubt you will get much interest from netdev. netdev already
> > > > considers ioctl as legacy, and mostly uses netlink and a message
> > > > passing structure, which is easy to extend in a backwards compatible
> > > > manor.  
> > > 
> > > The more I look at netlink the more I wonder what on earth it's targeted at or
> > > was trying to solve. It must exist for a reason, but I've written a few ioctls
> > > myself and I can't fathom a situation where I'd actually want any of the stuff
> > > netlink provides.
> > 
> > Netlink was built for networking operations, you want to set something like a route with a large
> > number of varying parameters in one transaction. And you don't want to have to invent
> > a new system call every time a new option is added.
> > 
> > Also, you want to monitor changes and see these events for a userspace control
> > application such as a routing daemon.
> 
> That makes sense - perhaps the new mount API could've been done as a netlink
> interface :)
> 
> But perhaps it makes sense to have both - netlink for the big complicated
> stateful operations, ioctl v2 for the simpler ones. I haven't looked at netlink
> usage at all, but most of the filesystem ioctls I've looked at fall into the the
> simple bucket, for me.

In RDMA, we solved this thing (standard entry points, multiple
parameters and vendor specific data) by combining netlink and ioctls.

The entry point is done with ioctls (mainly performance reason, but not
only) while data is passed in netlink attributes style.

ib_uverbs_ioctl:
https://elixir.bootlin.com/linux/v5.18/source/drivers/infiniband/core/uverbs_ioctl.c#L605

Latest example of newly added global to whole stack command:
RDMA/uverbs: Add uverbs command for dma-buf based MR registration
https://lore.kernel.org/linux-rdma/1608067636-98073-4-git-send-email-jianxin.xiong@intel.com/

Thanks

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RFC: Ioctl v2
  2022-05-20 16:16 RFC: Ioctl v2 Kent Overstreet
  2022-05-20 20:31 ` Andrew Lunn
  2022-05-20 23:45 ` Theodore Ts'o
@ 2022-06-09 22:02 ` Jan Engelhardt
  2 siblings, 0 replies; 11+ messages in thread
From: Jan Engelhardt @ 2022-06-09 22:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-fsdevel, linux-kernel, linux-block, netdev, mcgrof, tytso


On Friday 2022-05-20 18:16, Kent Overstreet wrote:
>Problems with ioctls:
>
>* There's no namespacing, and ioctl numbers clash
>
>IOCTL v2 proposal:
>
>* Namespacing
>
>To solve the namespacing issue, I want to steal an approach I've seen from
>OpenGL, where extensions are namespaced: an extension will be referenced by name
>where the name is vendor.foo, and when an extension becomes standard it gets a
>new name (arb.foo instead of nvidia.foo, I think? it's been awhile).
>To do this we'll need to define ioctls by name, not by hardcoded number, [...]

https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_NV_glsl_shader.html
Right there on the front matter: "Registered number #13".

https://www.khronos.org/registry/vulkan/specs/1.3/styleguide.html
"All extensions must be registered with Khronos."
"Each extension is assigned a range of values that can be used to create globally-unique enum
values"

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-06-09 22:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-20 16:16 RFC: Ioctl v2 Kent Overstreet
2022-05-20 20:31 ` Andrew Lunn
2022-05-21 16:45   ` Kent Overstreet
2022-05-21 18:55     ` Andrew Lunn
2022-05-21 19:45     ` Stephen Hemminger
2022-05-25 17:02       ` Kent Overstreet
2022-05-25 20:57         ` Andrew Lunn
2022-06-01  8:29         ` Leon Romanovsky
2022-05-20 23:45 ` Theodore Ts'o
2022-05-25 17:20   ` Kent Overstreet
2022-06-09 22:02 ` Jan Engelhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).