All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
@ 2016-09-03  5:20 Andrei Vagin
  2016-09-04 19:03 ` Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Andrei Vagin @ 2016-09-03  5:20 UTC (permalink / raw)
  To: ksummit-discuss; +Cc: Kirill Kolyshkin, David Ahern, Patrick McHardy

The netlink interface proved itself as a great way to perform
descriptor-based kernel/userspace communication. It is especially useful
for cases involving a big amount of data to transfer. The netlink
communication protocol is simple and elegant; it also allows to extend
the message format without breaking backward compatibility.

One big problem of netlink is credentials. When a user-space process is
opening a new file descriptor, kernel saves the opener's credentials to
f_cred field of the file struct. After that, every access to that fd are
checked against the saved credentials. In essence, this allows for a
process to open a file descriptor as root and then drop capabilities.
With netlink socket, it is not possible to implement this access scheme.

Currently netlink is widely used in the network subsystem, but there are
also a few users outside of networking, such as audit and taskstats.
Developers who used netlink for anything except the networking know
there are some issues. For example, taskstats code has broken user and
pid namespace support.

Another potential user of netlink socket is task_diag, a faster
/proc/PID-like interface proposed some time ago
(https://lkml.org/lkml/2015/7/6/142). It makes sense to use the netlink
interface for it, too, but the whole feature is currently blocked by the
netlink discussion.

A few months ago Andy Lutomirski suggested to rework the netlink
interface in order to solve the known issues. We suggest discussing his
idea:

----- snip --- snip --- snip -----
(taken from http://lists.openwall.net/netdev/2016/05/05/51)

The tl;dr is that Andrey wants to add an interface to ask a pidns some
questions, and netlink looks natural, except that using netlink sockets
to interrogate a pidns seems rather problematic.  I would also love to
see a decent interface for interrogating user namespaces, and again,
netlink would be great, except that it's a socket and makes no sense in
this context.

Netlink had, and possibly still has, tons of serious security bugs
involving code checking send() callers' creds.  I found and fixed a few
a couple years ago.  To reiterate once again, send() CANNOT use caller
creds safely.  (I feel like I say this once every few weeks. It's
getting old.)

I realize that it's convenient to use a socket as a context to keep
state between syscalls, but it has some annoying side effects:

 - It makes people want to rely on send()'s caller's creds.
 - It's miserable in combination with seccomp.
 - It doesn't play nicely with namespaces.
 - It makes me wonder why things like task_diag, which have nothing
   to do with networking, seem to get tangled up with networking.


Would it be worth considering adding a parallel interface, using it for
new things, and slowly migrating old use cases over?

int issue_kernel_command(int ns, int command, const struct iovec *iov, int iovcnt, int flags);

ns is an actual namespace fd or:

KERNEL_COMMAND_CURRENT_NETNS
KERNEL_COMMAND_CURRENT_PIDNS
etc, or a special one:
KERNEL_COMMAND_GLOBAL.  KERNEL_COMMAND_GLOBAL can't be used in a
non-root namespace.

KERNEL_COMMAND_GLOBAL works even for namespaced things, if the
relevant current ns is the init namespace.  (This feature is optional,
but it would allow gradually namespacing global things.)

command is an enumerated command.  Each command implies a namespace
type, and, if you feed this thing the wrong namespace type, you get
EINVAL.  The high bit of command indicates whether it's read-only
command.

iov gives a command in the format expected, which, for the most part,
would be a netlink message.

The return value is an fd that you can call read/readv on to read the
response.  It's not a socket (or at least you can't do normal socket
operations on it if it is a socket behind the scenes).  The
implementation of read() promises *not* to look at caller creds.  The
returned fd is unconditionally cloexec -- it's 2016 already.  Sheesh.

When you've read all the data, all you can do is close the fd.  You
can't issue another command on the same fd.  You also can't call write()
or send() on the fd unless someone has a good reason why you should be
able to and why it's safe.  You can't issue another command on the same
fd.

I imagine that the implementation could re-use a bunch of netlink code
under the hood.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-09-03  5:20 [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those Andrei Vagin
@ 2016-09-04 19:03 ` Andy Lutomirski
  2016-09-05 19:30 ` Alexei Starovoitov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2016-09-04 19:03 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Kirill Kolyshkin, ksummit-discuss, David Ahern, Patrick McHardy

On Fri, Sep 2, 2016 at 10:20 PM, Andrei Vagin <avagin@virtuozzo.com> wrote:
> The netlink interface proved itself as a great way to perform
> descriptor-based kernel/userspace communication. It is especially useful
> for cases involving a big amount of data to transfer. The netlink
> communication protocol is simple and elegant; it also allows to extend
> the message format without breaking backward compatibility.

I'm interested, surprise surprise :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-09-03  5:20 [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those Andrei Vagin
  2016-09-04 19:03 ` Andy Lutomirski
@ 2016-09-05 19:30 ` Alexei Starovoitov
  2016-09-06 16:05   ` Stephen Hemminger
  2016-09-12 14:05   ` Hannes Reinecke
  2016-09-13 17:29 ` Eric W. Biederman
  2016-11-01  3:15 ` Andrei Vagin
  3 siblings, 2 replies; 15+ messages in thread
From: Alexei Starovoitov @ 2016-09-05 19:30 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Kirill Kolyshkin, ksummit-discuss, Eric Dumazet, David Ahern,
	Pablo Neira Ayuso

On Fri, Sep 2, 2016 at 10:20 PM, Andrei Vagin <avagin@virtuozzo.com> wrote:
> The netlink interface proved itself as a great way to perform
> descriptor-based kernel/userspace communication. It is especially useful
> for cases involving a big amount of data to transfer. The netlink
> communication protocol is simple and elegant; it also allows to extend
> the message format without breaking backward compatibility.
>
> One big problem of netlink is credentials. When a user-space process is
> opening a new file descriptor, kernel saves the opener's credentials to
> f_cred field of the file struct. After that, every access to that fd are
> checked against the saved credentials. In essence, this allows for a
> process to open a file descriptor as root and then drop capabilities.
> With netlink socket, it is not possible to implement this access scheme.
>
> Currently netlink is widely used in the network subsystem, but there are
> also a few users outside of networking, such as audit and taskstats.
> Developers who used netlink for anything except the networking know
> there are some issues. For example, taskstats code has broken user and
> pid namespace support.
>
> Another potential user of netlink socket is task_diag, a faster
> /proc/PID-like interface proposed some time ago
> (https://lkml.org/lkml/2015/7/6/142). It makes sense to use the netlink
> interface for it, too, but the whole feature is currently blocked by the
> netlink discussion.
>
> A few months ago Andy Lutomirski suggested to rework the netlink
> interface in order to solve the known issues. We suggest discussing his
> idea:
>
> ----- snip --- snip --- snip -----
> (taken from http://lists.openwall.net/netdev/2016/05/05/51)
>
> The tl;dr is that Andrey wants to add an interface to ask a pidns some
> questions, and netlink looks natural, except that using netlink sockets
> to interrogate a pidns seems rather problematic.  I would also love to
> see a decent interface for interrogating user namespaces, and again,
> netlink would be great, except that it's a socket and makes no sense in
> this context.
>
> Netlink had, and possibly still has, tons of serious security bugs
> involving code checking send() callers' creds.  I found and fixed a few
> a couple years ago.  To reiterate once again, send() CANNOT use caller
> creds safely.  (I feel like I say this once every few weeks. It's
> getting old.)
>
> I realize that it's convenient to use a socket as a context to keep
> state between syscalls, but it has some annoying side effects:
>
>  - It makes people want to rely on send()'s caller's creds.
>  - It's miserable in combination with seccomp.
>  - It doesn't play nicely with namespaces.
>  - It makes me wonder why things like task_diag, which have nothing
>    to do with networking, seem to get tangled up with networking.
>
>
> Would it be worth considering adding a parallel interface, using it for
> new things, and slowly migrating old use cases over?
>
> int issue_kernel_command(int ns, int command, const struct iovec *iov, int iovcnt, int flags);
>
> ns is an actual namespace fd or:
>
> KERNEL_COMMAND_CURRENT_NETNS
> KERNEL_COMMAND_CURRENT_PIDNS
> etc, or a special one:
> KERNEL_COMMAND_GLOBAL.  KERNEL_COMMAND_GLOBAL can't be used in a
> non-root namespace.
>
> KERNEL_COMMAND_GLOBAL works even for namespaced things, if the
> relevant current ns is the init namespace.  (This feature is optional,
> but it would allow gradually namespacing global things.)
>
> command is an enumerated command.  Each command implies a namespace
> type, and, if you feed this thing the wrong namespace type, you get
> EINVAL.  The high bit of command indicates whether it's read-only
> command.
>
> iov gives a command in the format expected, which, for the most part,
> would be a netlink message.
>
> The return value is an fd that you can call read/readv on to read the
> response.  It's not a socket (or at least you can't do normal socket
> operations on it if it is a socket behind the scenes).  The
> implementation of read() promises *not* to look at caller creds.  The
> returned fd is unconditionally cloexec -- it's 2016 already.  Sheesh.
>
> When you've read all the data, all you can do is close the fd.  You
> can't issue another command on the same fd.  You also can't call write()
> or send() on the fd unless someone has a good reason why you should be
> able to and why it's safe.  You can't issue another command on the same
> fd.
>
> I imagine that the implementation could re-use a bunch of netlink code
> under the hood.

I'm very interested in this discussion.
Adding few folks as well.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-09-05 19:30 ` Alexei Starovoitov
@ 2016-09-06 16:05   ` Stephen Hemminger
  2016-09-12 14:05   ` Hannes Reinecke
  1 sibling, 0 replies; 15+ messages in thread
From: Stephen Hemminger @ 2016-09-06 16:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kirill Kolyshkin, Andrei Vagin, ksummit-discuss, Eric Dumazet,
	David Ahern, Pablo Neira Ayuso

On Mon, 5 Sep 2016 12:30:13 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Fri, Sep 2, 2016 at 10:20 PM, Andrei Vagin <avagin@virtuozzo.com> wrote:
> > The netlink interface proved itself as a great way to perform
> > descriptor-based kernel/userspace communication. It is especially useful
> > for cases involving a big amount of data to transfer. The netlink
> > communication protocol is simple and elegant; it also allows to extend
> > the message format without breaking backward compatibility.
> >
> > One big problem of netlink is credentials. When a user-space process is
> > opening a new file descriptor, kernel saves the opener's credentials to
> > f_cred field of the file struct. After that, every access to that fd are
> > checked against the saved credentials. In essence, this allows for a
> > process to open a file descriptor as root and then drop capabilities.
> > With netlink socket, it is not possible to implement this access scheme.
> >
> > Currently netlink is widely used in the network subsystem, but there are
> > also a few users outside of networking, such as audit and taskstats.
> > Developers who used netlink for anything except the networking know
> > there are some issues. For example, taskstats code has broken user and
> > pid namespace support.
> >
> > Another potential user of netlink socket is task_diag, a faster
> > /proc/PID-like interface proposed some time ago
> > (https://lkml.org/lkml/2015/7/6/142). It makes sense to use the netlink
> > interface for it, too, but the whole feature is currently blocked by the
> > netlink discussion.
> >
> > A few months ago Andy Lutomirski suggested to rework the netlink
> > interface in order to solve the known issues. We suggest discussing his
> > idea:
> >
> > ----- snip --- snip --- snip -----
> > (taken from http://lists.openwall.net/netdev/2016/05/05/51)
> >
> > The tl;dr is that Andrey wants to add an interface to ask a pidns some
> > questions, and netlink looks natural, except that using netlink sockets
> > to interrogate a pidns seems rather problematic.  I would also love to
> > see a decent interface for interrogating user namespaces, and again,
> > netlink would be great, except that it's a socket and makes no sense in
> > this context.
> >
> > Netlink had, and possibly still has, tons of serious security bugs
> > involving code checking send() callers' creds.  I found and fixed a few
> > a couple years ago.  To reiterate once again, send() CANNOT use caller
> > creds safely.  (I feel like I say this once every few weeks. It's
> > getting old.)
> >
> > I realize that it's convenient to use a socket as a context to keep
> > state between syscalls, but it has some annoying side effects:
> >
> >  - It makes people want to rely on send()'s caller's creds.
> >  - It's miserable in combination with seccomp.
> >  - It doesn't play nicely with namespaces.
> >  - It makes me wonder why things like task_diag, which have nothing
> >    to do with networking, seem to get tangled up with networking.
> >
> >
> > Would it be worth considering adding a parallel interface, using it for
> > new things, and slowly migrating old use cases over?
> >
> > int issue_kernel_command(int ns, int command, const struct iovec *iov, int iovcnt, int flags);
> >
> > ns is an actual namespace fd or:
> >
> > KERNEL_COMMAND_CURRENT_NETNS
> > KERNEL_COMMAND_CURRENT_PIDNS
> > etc, or a special one:
> > KERNEL_COMMAND_GLOBAL.  KERNEL_COMMAND_GLOBAL can't be used in a
> > non-root namespace.
> >
> > KERNEL_COMMAND_GLOBAL works even for namespaced things, if the
> > relevant current ns is the init namespace.  (This feature is optional,
> > but it would allow gradually namespacing global things.)
> >
> > command is an enumerated command.  Each command implies a namespace
> > type, and, if you feed this thing the wrong namespace type, you get
> > EINVAL.  The high bit of command indicates whether it's read-only
> > command.
> >
> > iov gives a command in the format expected, which, for the most part,
> > would be a netlink message.
> >
> > The return value is an fd that you can call read/readv on to read the
> > response.  It's not a socket (or at least you can't do normal socket
> > operations on it if it is a socket behind the scenes).  The
> > implementation of read() promises *not* to look at caller creds.  The
> > returned fd is unconditionally cloexec -- it's 2016 already.  Sheesh.
> >
> > When you've read all the data, all you can do is close the fd.  You
> > can't issue another command on the same fd.  You also can't call write()
> > or send() on the fd unless someone has a good reason why you should be
> > able to and why it's safe.  You can't issue another command on the same
> > fd.
> >
> > I imagine that the implementation could re-use a bunch of netlink code
> > under the hood.  
> 
> I'm very interested in this discussion.
> Adding few folks as well.

I am interested as well. We should also put this on agenda at netconf.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-09-05 19:30 ` Alexei Starovoitov
  2016-09-06 16:05   ` Stephen Hemminger
@ 2016-09-12 14:05   ` Hannes Reinecke
  1 sibling, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2016-09-12 14:05 UTC (permalink / raw)
  To: ksummit-discuss

On 09/05/2016 09:30 PM, Alexei Starovoitov wrote:
> On Fri, Sep 2, 2016 at 10:20 PM, Andrei Vagin <avagin@virtuozzo.com> wrote:
>> The netlink interface proved itself as a great way to perform
>> descriptor-based kernel/userspace communication. It is especially useful
>> for cases involving a big amount of data to transfer. The netlink
>> communication protocol is simple and elegant; it also allows to extend
>> the message format without breaking backward compatibility.
>>
>> One big problem of netlink is credentials. When a user-space process is
>> opening a new file descriptor, kernel saves the opener's credentials to
>> f_cred field of the file struct. After that, every access to that fd are
>> checked against the saved credentials. In essence, this allows for a
>> process to open a file descriptor as root and then drop capabilities.
>> With netlink socket, it is not possible to implement this access scheme.
>>
>> Currently netlink is widely used in the network subsystem, but there are
>> also a few users outside of networking, such as audit and taskstats.
>> Developers who used netlink for anything except the networking know
>> there are some issues. For example, taskstats code has broken user and
>> pid namespace support.
>>
>> Another potential user of netlink socket is task_diag, a faster
>> /proc/PID-like interface proposed some time ago
>> (https://lkml.org/lkml/2015/7/6/142). It makes sense to use the netlink
>> interface for it, too, but the whole feature is currently blocked by the
>> netlink discussion.
>>
>> A few months ago Andy Lutomirski suggested to rework the netlink
>> interface in order to solve the known issues. We suggest discussing his
>> idea:
>>
>> ----- snip --- snip --- snip -----
>> (taken from http://lists.openwall.net/netdev/2016/05/05/51)
>>
>> The tl;dr is that Andrey wants to add an interface to ask a pidns some
>> questions, and netlink looks natural, except that using netlink sockets
>> to interrogate a pidns seems rather problematic.  I would also love to
>> see a decent interface for interrogating user namespaces, and again,
>> netlink would be great, except that it's a socket and makes no sense in
>> this context.
>>
>> Netlink had, and possibly still has, tons of serious security bugs
>> involving code checking send() callers' creds.  I found and fixed a few
>> a couple years ago.  To reiterate once again, send() CANNOT use caller
>> creds safely.  (I feel like I say this once every few weeks. It's
>> getting old.)
>>
>> I realize that it's convenient to use a socket as a context to keep
>> state between syscalls, but it has some annoying side effects:
>>
>>  - It makes people want to rely on send()'s caller's creds.
>>  - It's miserable in combination with seccomp.
>>  - It doesn't play nicely with namespaces.
>>  - It makes me wonder why things like task_diag, which have nothing
>>    to do with networking, seem to get tangled up with networking.
>>
>>
>> Would it be worth considering adding a parallel interface, using it for
>> new things, and slowly migrating old use cases over?
>>
>> int issue_kernel_command(int ns, int command, const struct iovec *iov, int iovcnt, int flags);
>>
>> ns is an actual namespace fd or:
>>
>> KERNEL_COMMAND_CURRENT_NETNS
>> KERNEL_COMMAND_CURRENT_PIDNS
>> etc, or a special one:
>> KERNEL_COMMAND_GLOBAL.  KERNEL_COMMAND_GLOBAL can't be used in a
>> non-root namespace.
>>
>> KERNEL_COMMAND_GLOBAL works even for namespaced things, if the
>> relevant current ns is the init namespace.  (This feature is optional,
>> but it would allow gradually namespacing global things.)
>>
>> command is an enumerated command.  Each command implies a namespace
>> type, and, if you feed this thing the wrong namespace type, you get
>> EINVAL.  The high bit of command indicates whether it's read-only
>> command.
>>
>> iov gives a command in the format expected, which, for the most part,
>> would be a netlink message.
>>
>> The return value is an fd that you can call read/readv on to read the
>> response.  It's not a socket (or at least you can't do normal socket
>> operations on it if it is a socket behind the scenes).  The
>> implementation of read() promises *not* to look at caller creds.  The
>> returned fd is unconditionally cloexec -- it's 2016 already.  Sheesh.
>>
>> When you've read all the data, all you can do is close the fd.  You
>> can't issue another command on the same fd.  You also can't call write()
>> or send() on the fd unless someone has a good reason why you should be
>> able to and why it's safe.  You can't issue another command on the same
>> fd.
>>
>> I imagine that the implementation could re-use a bunch of netlink code
>> under the hood.
> 
> I'm very interested in this discussion.
> Adding few folks as well.
> 
Yes, please.
I'd be interested in this, too.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.com			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-09-03  5:20 [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those Andrei Vagin
  2016-09-04 19:03 ` Andy Lutomirski
  2016-09-05 19:30 ` Alexei Starovoitov
@ 2016-09-13 17:29 ` Eric W. Biederman
  2016-09-16  5:58   ` Andrei Vagin
  2016-11-01  3:15 ` Andrei Vagin
  3 siblings, 1 reply; 15+ messages in thread
From: Eric W. Biederman @ 2016-09-13 17:29 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Kirill Kolyshkin, Patrick McHardy, ksummit-discuss, David Ahern


Andrei Vagin <avagin@virtuozzo.com> writes:

> The netlink interface proved itself as a great way to perform
> descriptor-based kernel/userspace communication. It is especially useful
> for cases involving a big amount of data to transfer. The netlink
> communication protocol is simple and elegant; it also allows to extend
> the message format without breaking backward compatibility.
>
> One big problem of netlink is credentials. When a user-space process is
> opening a new file descriptor, kernel saves the opener's credentials to
> f_cred field of the file struct. After that, every access to that fd are
> checked against the saved credentials. In essence, this allows for a
> process to open a file descriptor as root and then drop capabilities.
> With netlink socket, it is not possible to implement this access
> scheme.

A historical oversight, and unfortunately implementing it breaks
routing daemons.

> Currently netlink is widely used in the network subsystem, but there are
> also a few users outside of networking, such as audit and taskstats.
> Developers who used netlink for anything except the networking know
> there are some issues. For example, taskstats code has broken user and
> pid namespace support.
>
> Another potential user of netlink socket is task_diag, a faster
> /proc/PID-like interface proposed some time ago
> (https://lkml.org/lkml/2015/7/6/142). It makes sense to use the netlink
> interface for it, too, but the whole feature is currently blocked by the
> netlink discussion.

I disagree.  It is not part of the networking subystem so netlink does
is very very unlikely to make sense.  In general netlink is an over
engineered solution outside of networking.

My overall impression is that network people get networking protocols
and so netlink makes a good fit for the network stack. On the other had
non-network people in general don't in general do well with networking
intefaces, so I do not recommend netlink for anything outside of the
network subsystem.

All of that is before you start getting into namespace details.

Now some of that is at least in part because of the volume of use the
interface is expected to get.  Low volume interfaces tend as a rule to
have more ``interesting'' corner cases.  Regardless of the subsystem.

Looking at your referenced task_diag interface I think netlink is
completely unsuitable because your interface does not follow the good
netlink pattern for binary attributes and binary data.  Possibly
a taskdiagfd() system call would make sense.  Or quite likely a pure
taskdiag() system call.

Things should be simplified to the point where the design is clear
easily understood and easily tested.  What you are really suggesting is
tossing out proc with the motiviation of checkpoint/restart.  Perhaps
that is fine.  There are certainly other avenues to consider there.

There are reasons proc is in text format and while useful it is not
fundamentally because text is human readable.  The reasons have to do
with maintainability of data structures.

I tend to think Andy's solution is also over engineered.  Either we have
a file descriptor in which case a ns argument is unnecessary or we have
or we have a raw syscall which immediately returns the information.
In which case an ns argument is very unnecessary.

So yes I would be interested in the conversation.  Although I think
there is some serious homework that needs to happen.

Eric


> A few months ago Andy Lutomirski suggested to rework the netlink
> interface in order to solve the known issues. We suggest discussing his
> idea:
>
> ----- snip --- snip --- snip -----
> (taken from http://lists.openwall.net/netdev/2016/05/05/51)
>
> The tl;dr is that Andrey wants to add an interface to ask a pidns some
> questions, and netlink looks natural, except that using netlink sockets
> to interrogate a pidns seems rather problematic.  I would also love to
> see a decent interface for interrogating user namespaces, and again,
> netlink would be great, except that it's a socket and makes no sense in
> this context.
>
> Netlink had, and possibly still has, tons of serious security bugs
> involving code checking send() callers' creds.  I found and fixed a few
> a couple years ago.  To reiterate once again, send() CANNOT use caller
> creds safely.  (I feel like I say this once every few weeks. It's
> getting old.)
>
> I realize that it's convenient to use a socket as a context to keep
> state between syscalls, but it has some annoying side effects:
>
>  - It makes people want to rely on send()'s caller's creds.
>  - It's miserable in combination with seccomp.
>  - It doesn't play nicely with namespaces.
>  - It makes me wonder why things like task_diag, which have nothing
>    to do with networking, seem to get tangled up with networking.
>
>
> Would it be worth considering adding a parallel interface, using it for
> new things, and slowly migrating old use cases over?
>
> int issue_kernel_command(int ns, int command, const struct iovec *iov, int iovcnt, int flags);
>
> ns is an actual namespace fd or:
>
> KERNEL_COMMAND_CURRENT_NETNS
> KERNEL_COMMAND_CURRENT_PIDNS
> etc, or a special one:
> KERNEL_COMMAND_GLOBAL.  KERNEL_COMMAND_GLOBAL can't be used in a
> non-root namespace.
>
> KERNEL_COMMAND_GLOBAL works even for namespaced things, if the
> relevant current ns is the init namespace.  (This feature is optional,
> but it would allow gradually namespacing global things.)
>
> command is an enumerated command.  Each command implies a namespace
> type, and, if you feed this thing the wrong namespace type, you get
> EINVAL.  The high bit of command indicates whether it's read-only
> command.
>
> iov gives a command in the format expected, which, for the most part,
> would be a netlink message.
>
> The return value is an fd that you can call read/readv on to read the
> response.  It's not a socket (or at least you can't do normal socket
> operations on it if it is a socket behind the scenes).  The
> implementation of read() promises *not* to look at caller creds.  The
> returned fd is unconditionally cloexec -- it's 2016 already.  Sheesh.
>
> When you've read all the data, all you can do is close the fd.  You
> can't issue another command on the same fd.  You also can't call write()
> or send() on the fd unless someone has a good reason why you should be
> able to and why it's safe.  You can't issue another command on the same
> fd.
>
> I imagine that the implementation could re-use a bunch of netlink code
> under the hood.
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-09-13 17:29 ` Eric W. Biederman
@ 2016-09-16  5:58   ` Andrei Vagin
  2016-09-18 20:18     ` Eric W. Biederman
  0 siblings, 1 reply; 15+ messages in thread
From: Andrei Vagin @ 2016-09-16  5:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Kolyshkin, Patrick McHardy, ksummit-discuss, David Ahern

On Tue, Sep 13, 2016 at 12:29:35PM -0500, Eric W. Biederman wrote:
> 
> Andrei Vagin <avagin@virtuozzo.com> writes:
> 
> > The netlink interface proved itself as a great way to perform
> > descriptor-based kernel/userspace communication. It is especially useful
> > for cases involving a big amount of data to transfer. The netlink
> > communication protocol is simple and elegant; it also allows to extend
> > the message format without breaking backward compatibility.
> >
> > One big problem of netlink is credentials. When a user-space process is
> > opening a new file descriptor, kernel saves the opener's credentials to
> > f_cred field of the file struct. After that, every access to that fd are
> > checked against the saved credentials. In essence, this allows for a
> > process to open a file descriptor as root and then drop capabilities.
> > With netlink socket, it is not possible to implement this access
> > scheme.
> 
> A historical oversight, and unfortunately implementing it breaks
> routing daemons.
> 
> > Currently netlink is widely used in the network subsystem, but there are
> > also a few users outside of networking, such as audit and taskstats.
> > Developers who used netlink for anything except the networking know
> > there are some issues. For example, taskstats code has broken user and
> > pid namespace support.
> >
> > Another potential user of netlink socket is task_diag, a faster
> > /proc/PID-like interface proposed some time ago
> > (https://lkml.org/lkml/2015/7/6/142). It makes sense to use the netlink
> > interface for it, too, but the whole feature is currently blocked by the
> > netlink discussion.
> 
> I disagree.  It is not part of the networking subystem so netlink does
> is very very unlikely to make sense.  In general netlink is an over
> engineered solution outside of networking.
> 
> My overall impression is that network people get networking protocols
> and so netlink makes a good fit for the network stack. On the other had
> non-network people in general don't in general do well with networking
> intefaces, so I do not recommend netlink for anything outside of the
> network subsystem.
> 
> All of that is before you start getting into namespace details.
> 
> Now some of that is at least in part because of the volume of use the
> interface is expected to get.  Low volume interfaces tend as a rule to
> have more ``interesting'' corner cases.  Regardless of the subsystem.
> 
> Looking at your referenced task_diag interface I think netlink is
> completely unsuitable because your interface does not follow the good
> netlink pattern for binary attributes and binary data.  Possibly
> a taskdiagfd() system call would make sense.

Eric, thank you for the feedback. Could you elaborate what do you mean
when you say: "does not follow the good netlink pattern for binary
attributes and binary data." Maybe you can give an example of bad
patterns.

There was an another version where a proc file is used instead of netlink
sockets:
https://lwn.net/Articles/683371/
https://github.com/avagin/linux-task-diag/blob/devel/fs/proc/task_diag.c

In this version, I don't use netlink sockets, but I use the netlink format
for messages. The motivation here is to have expandable format for the
future improvements.

> Or quite likely a pure taskdiag() system call.

The amount of data may be quite big to get them for one iteration, so I
would prefer to have a file descriptor.

> 
> Things should be simplified to the point where the design is clear
> easily understood and easily tested.  What you are really suggesting is
> tossing out proc with the motiviation of checkpoint/restart.  Perhaps
> that is fine.  There are certainly other avenues to consider there.

I want to think that the motivation is to make a good and fast interface
to use it from code. We already checked this interface in criu, perf and
procps, in all cases we get significant performance improvements.

Thanks,
Andrei

> 
> There are reasons proc is in text format and while useful it is not
> fundamentally because text is human readable.  The reasons have to do
> with maintainability of data structures.
> 
> I tend to think Andy's solution is also over engineered.  Either we have
> a file descriptor in which case a ns argument is unnecessary or we have
> or we have a raw syscall which immediately returns the information.
> In which case an ns argument is very unnecessary.
> 
> So yes I would be interested in the conversation.  Although I think
> there is some serious homework that needs to happen.
> 
> Eric
> 
> 
> > A few months ago Andy Lutomirski suggested to rework the netlink
> > interface in order to solve the known issues. We suggest discussing his
> > idea:
> >
> > ----- snip --- snip --- snip -----
> > (taken from http://lists.openwall.net/netdev/2016/05/05/51)
> >
> > The tl;dr is that Andrey wants to add an interface to ask a pidns some
> > questions, and netlink looks natural, except that using netlink sockets
> > to interrogate a pidns seems rather problematic.  I would also love to
> > see a decent interface for interrogating user namespaces, and again,
> > netlink would be great, except that it's a socket and makes no sense in
> > this context.
> >
> > Netlink had, and possibly still has, tons of serious security bugs
> > involving code checking send() callers' creds.  I found and fixed a few
> > a couple years ago.  To reiterate once again, send() CANNOT use caller
> > creds safely.  (I feel like I say this once every few weeks. It's
> > getting old.)
> >
> > I realize that it's convenient to use a socket as a context to keep
> > state between syscalls, but it has some annoying side effects:
> >
> >  - It makes people want to rely on send()'s caller's creds.
> >  - It's miserable in combination with seccomp.
> >  - It doesn't play nicely with namespaces.
> >  - It makes me wonder why things like task_diag, which have nothing
> >    to do with networking, seem to get tangled up with networking.
> >
> >
> > Would it be worth considering adding a parallel interface, using it for
> > new things, and slowly migrating old use cases over?
> >
> > int issue_kernel_command(int ns, int command, const struct iovec *iov, int iovcnt, int flags);
> >
> > ns is an actual namespace fd or:
> >
> > KERNEL_COMMAND_CURRENT_NETNS
> > KERNEL_COMMAND_CURRENT_PIDNS
> > etc, or a special one:
> > KERNEL_COMMAND_GLOBAL.  KERNEL_COMMAND_GLOBAL can't be used in a
> > non-root namespace.
> >
> > KERNEL_COMMAND_GLOBAL works even for namespaced things, if the
> > relevant current ns is the init namespace.  (This feature is optional,
> > but it would allow gradually namespacing global things.)
> >
> > command is an enumerated command.  Each command implies a namespace
> > type, and, if you feed this thing the wrong namespace type, you get
> > EINVAL.  The high bit of command indicates whether it's read-only
> > command.
> >
> > iov gives a command in the format expected, which, for the most part,
> > would be a netlink message.
> >
> > The return value is an fd that you can call read/readv on to read the
> > response.  It's not a socket (or at least you can't do normal socket
> > operations on it if it is a socket behind the scenes).  The
> > implementation of read() promises *not* to look at caller creds.  The
> > returned fd is unconditionally cloexec -- it's 2016 already.  Sheesh.
> >
> > When you've read all the data, all you can do is close the fd.  You
> > can't issue another command on the same fd.  You also can't call write()
> > or send() on the fd unless someone has a good reason why you should be
> > able to and why it's safe.  You can't issue another command on the same
> > fd.
> >
> > I imagine that the implementation could re-use a bunch of netlink code
> > under the hood.
> > _______________________________________________
> > Ksummit-discuss mailing list
> > Ksummit-discuss@lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-09-16  5:58   ` Andrei Vagin
@ 2016-09-18 20:18     ` Eric W. Biederman
  2016-10-11  2:13       ` Andrei Vagin
  0 siblings, 1 reply; 15+ messages in thread
From: Eric W. Biederman @ 2016-09-18 20:18 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Kirill Kolyshkin, Patrick McHardy, ksummit-discuss, David Ahern

Andrei Vagin <avagin@virtuozzo.com> writes:

> On Tue, Sep 13, 2016 at 12:29:35PM -0500, Eric W. Biederman wrote:
>> 
>> Andrei Vagin <avagin@virtuozzo.com> writes:
>> 
>> > The netlink interface proved itself as a great way to perform
>> > descriptor-based kernel/userspace communication. It is especially useful
>> > for cases involving a big amount of data to transfer. The netlink
>> > communication protocol is simple and elegant; it also allows to extend
>> > the message format without breaking backward compatibility.
>> >
>> > One big problem of netlink is credentials. When a user-space process is
>> > opening a new file descriptor, kernel saves the opener's credentials to
>> > f_cred field of the file struct. After that, every access to that fd are
>> > checked against the saved credentials. In essence, this allows for a
>> > process to open a file descriptor as root and then drop capabilities.
>> > With netlink socket, it is not possible to implement this access
>> > scheme.
>> 
>> A historical oversight, and unfortunately implementing it breaks
>> routing daemons.
>> 
>> > Currently netlink is widely used in the network subsystem, but there are
>> > also a few users outside of networking, such as audit and taskstats.
>> > Developers who used netlink for anything except the networking know
>> > there are some issues. For example, taskstats code has broken user and
>> > pid namespace support.
>> >
>> > Another potential user of netlink socket is task_diag, a faster
>> > /proc/PID-like interface proposed some time ago
>> > (https://lkml.org/lkml/2015/7/6/142). It makes sense to use the netlink
>> > interface for it, too, but the whole feature is currently blocked by the
>> > netlink discussion.
>> 
>> I disagree.  It is not part of the networking subystem so netlink does
>> is very very unlikely to make sense.  In general netlink is an over
>> engineered solution outside of networking.
>> 
>> My overall impression is that network people get networking protocols
>> and so netlink makes a good fit for the network stack. On the other had
>> non-network people in general don't in general do well with networking
>> intefaces, so I do not recommend netlink for anything outside of the
>> network subsystem.
>> 
>> All of that is before you start getting into namespace details.
>> 
>> Now some of that is at least in part because of the volume of use the
>> interface is expected to get.  Low volume interfaces tend as a rule to
>> have more ``interesting'' corner cases.  Regardless of the subsystem.
>> 
>> Looking at your referenced task_diag interface I think netlink is
>> completely unsuitable because your interface does not follow the good
>> netlink pattern for binary attributes and binary data.  Possibly
>> a taskdiagfd() system call would make sense.
>
> Eric, thank you for the feedback. Could you elaborate what do you mean
> when you say: "does not follow the good netlink pattern for binary
> attributes and binary data." Maybe you can give an example of bad
> patterns.

So.  The worst case for binary attributes and binary data that I know of
is demonstrated by the many many many variants of the stat system call.

In general when I hear binary interface with attributes what I hear is a
maintenance disaster, where when I hear ascii I hear something that is
built in a more maintainable fashion.

The routing netlink protocol handles this by very carefully making every
attribute returned contained in a tag and length, and perhaps I was
misread your code but I did not see that discipline in use.  One of the
good features of it is that rtnetlink can add attributes that were not
requested by the caller and everything continues to work, even though
there was not negotiation.

> There was an another version where a proc file is used instead of netlink
> sockets:
> https://lwn.net/Articles/683371/
> https://github.com/avagin/linux-task-diag/blob/devel/fs/proc/task_diag.c
>
> In this version, I don't use netlink sockets, but I use the netlink format
> for messages. The motivation here is to have expandable format for the
> future improvements.

Perhaps I misread how you are using the netlink format.  Your
description of needing to request additional attributes sounded like you
were not using such an expandable binary protocol.

>> Or quite likely a pure taskdiag() system call.
>
> The amount of data may be quite big to get them for one iteration, so I
> would prefer to have a file descriptor.
>
>> 
>> Things should be simplified to the point where the design is clear
>> easily understood and easily tested.  What you are really suggesting is
>> tossing out proc with the motiviation of checkpoint/restart.  Perhaps
>> that is fine.  There are certainly other avenues to consider there.
>
> I want to think that the motivation is to make a good and fast interface
> to use it from code. We already checked this interface in criu, perf and
> procps, in all cases we get significant performance improvements.

Depending it looks like it probably doubles the maintenance work in a
code base that does not always seem to have enough maintenance.

One idea for this kind of situation is to have a readfile or a readfiles
system call that is a cousin of readlink.  That for small files just
gives you the data in the file without the cost of creating a file
descriptor.  This idea can be extended to multiple files, and it could
have a readv like interface.

To some extent it matters where the cost is.

At a practical level we already have a binary protocol for getting a
list of all of the processes in the system, readdir on /proc.

Honestly if the code is well done and does not require too much
maintenance I don't care.  But I think we need to take a good hard look
at maintenance issues when adding a duplicate interface for what feels
like the same information.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-09-18 20:18     ` Eric W. Biederman
@ 2016-10-11  2:13       ` Andrei Vagin
  2016-10-11 14:14         ` Alexey Dobriyan
  0 siblings, 1 reply; 15+ messages in thread
From: Andrei Vagin @ 2016-10-11  2:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Kolyshkin, Patrick McHardy, ksummit-discuss, David Ahern

On Sun, Sep 18, 2016 at 03:18:13PM -0500, Eric W. Biederman wrote:
> Andrei Vagin <avagin@virtuozzo.com> writes:
> 
> > On Tue, Sep 13, 2016 at 12:29:35PM -0500, Eric W. Biederman wrote:
> >> 
> >> Andrei Vagin <avagin@virtuozzo.com> writes:
> >> 
> >> > The netlink interface proved itself as a great way to perform
> >> > descriptor-based kernel/userspace communication. It is especially useful
> >> > for cases involving a big amount of data to transfer. The netlink
> >> > communication protocol is simple and elegant; it also allows to extend
> >> > the message format without breaking backward compatibility.
> >> >
> >> > One big problem of netlink is credentials. When a user-space process is
> >> > opening a new file descriptor, kernel saves the opener's credentials to
> >> > f_cred field of the file struct. After that, every access to that fd are
> >> > checked against the saved credentials. In essence, this allows for a
> >> > process to open a file descriptor as root and then drop capabilities.
> >> > With netlink socket, it is not possible to implement this access
> >> > scheme.
> >> 
> >> A historical oversight, and unfortunately implementing it breaks
> >> routing daemons.
> >> 
> >> > Currently netlink is widely used in the network subsystem, but there are
> >> > also a few users outside of networking, such as audit and taskstats.
> >> > Developers who used netlink for anything except the networking know
> >> > there are some issues. For example, taskstats code has broken user and
> >> > pid namespace support.
> >> >
> >> > Another potential user of netlink socket is task_diag, a faster
> >> > /proc/PID-like interface proposed some time ago
> >> > (https://lkml.org/lkml/2015/7/6/142). It makes sense to use the netlink
> >> > interface for it, too, but the whole feature is currently blocked by the
> >> > netlink discussion.
> >> 
> >> I disagree.  It is not part of the networking subystem so netlink does
> >> is very very unlikely to make sense.  In general netlink is an over
> >> engineered solution outside of networking.
> >> 
> >> My overall impression is that network people get networking protocols
> >> and so netlink makes a good fit for the network stack. On the other had
> >> non-network people in general don't in general do well with networking
> >> intefaces, so I do not recommend netlink for anything outside of the
> >> network subsystem.
> >> 
> >> All of that is before you start getting into namespace details.
> >> 
> >> Now some of that is at least in part because of the volume of use the
> >> interface is expected to get.  Low volume interfaces tend as a rule to
> >> have more ``interesting'' corner cases.  Regardless of the subsystem.
> >> 
> >> Looking at your referenced task_diag interface I think netlink is
> >> completely unsuitable because your interface does not follow the good
> >> netlink pattern for binary attributes and binary data.  Possibly
> >> a taskdiagfd() system call would make sense.
> >
> > Eric, thank you for the feedback. Could you elaborate what do you mean
> > when you say: "does not follow the good netlink pattern for binary
> > attributes and binary data." Maybe you can give an example of bad
> > patterns.
> 
> So.  The worst case for binary attributes and binary data that I know of
> is demonstrated by the many many many variants of the stat system call.

It is the worst case because it's imposible to extend the initial
interface, isn't it?

In my case we can extend existing attributes and can add new ones
without breaking compatibility. It's actually why the netlink format is
used here.

> 
> In general when I hear binary interface with attributes what I hear is a
> maintenance disaster, where when I hear ascii I hear something that is
> built in a more maintainable fashion.

The problem is that encoding and decoding operations are heavy in
this case. I mean we need to convert data into a text format and then
decode them back into a binary format.

Here is a perf output which shows that we spend a lot of time in
seq_printf():

$ sudo perf record -g sh -c 'cat /proc/*/status > /dev/null'
$ sudo perf report -s cpu
-  100.00%   100.00%  -001                                                                                                                                   ▒
   - 51.81% __GI___libc_read                                                                                                                                 ▒
      - 51.62% entry_SYSCALL_64_fastpath                                                                                                                     ▒
         - 51.55% sys_read                                                                                                                                   ▒
            - 51.36% vfs_read                                                                                                                                ▒
               - 51.11% __vfs_read                                                                                                                           ▒
                  - 50.73% seq_read                                                                                                                          ▒
                     - 50.29% proc_single_show                                                                                                               ▒
                        - 49.85% proc_pid_status                                                                                                             ▒
                           - 18.61% render_sigset_t                                                                                                          ▒
                              + 16.20% seq_printf                                                                                                            ▒
                           - 12.57% seq_printf                                                                                                               ▒
                              + 11.37% seq_vprintf                                                                                                           ▒
                           - 6.73% task_mem                                                                                                                  ▒
                              + 5.77% seq_printf                                                                                                             ▒
                              + 0.70% hugetlb_report_usage                                                                                                   ▒
                           - 3.93% render_cap_t                                                                                                              ◆
                              + 3.24% seq_printf                                                                                                             ▒
                                0.63% seq_puts                                                                                                               ▒
                           - 2.48% cpuset_task_status_allowed                                                                                                ▒
                              + seq_printf                                                                                                                   ▒
                             0.57% get_task_mm                                                                                                               ▒
   - 10.48% __GI___libc_open                                                                                                                                 ▒
        entry_SYSCALL_64_fastpath                                                                                                                            ▒
      - sys_open                                                                                                                                             ▒
         - do_sys_open                                                                                                                                       ▒
            + 9.79% do_filp_open  

> 
> The routing netlink protocol handles this by very carefully making every
> attribute returned contained in a tag and length, and perhaps I was
> misread your code but I did not see that discipline in use.  One of the
> good features of it is that rtnetlink can add attributes that were not
> requested by the caller and everything continues to work, even though
> there was not negotiation.

All this it true for task_diag.

> 
> > There was an another version where a proc file is used instead of netlink
> > sockets:
> > https://lwn.net/Articles/683371/
> > https://github.com/avagin/linux-task-diag/blob/devel/fs/proc/task_diag.c
> >
> > In this version, I don't use netlink sockets, but I use the netlink format
> > for messages. The motivation here is to have expandable format for the
> > future improvements.
> 
> Perhaps I misread how you are using the netlink format.  Your
> description of needing to request additional attributes sounded like you
> were not using such an expandable binary protocol.

I think you misunderstood this point. task_diag allows to specify what kind
of information about processes are required. For example, you can request
credentials, vma-s, memory consumtions, ets.

And each this request can be described by a few netlink attributes and
it is possiable to add new attributes in a future.

> 
> >> Or quite likely a pure taskdiag() system call.
> >
> > The amount of data may be quite big to get them for one iteration, so I
> > would prefer to have a file descriptor.
> >
> >> 
> >> Things should be simplified to the point where the design is clear
> >> easily understood and easily tested.  What you are really suggesting is
> >> tossing out proc with the motiviation of checkpoint/restart.  Perhaps
> >> that is fine.  There are certainly other avenues to consider there.
> >
> > I want to think that the motivation is to make a good and fast interface
> > to use it from code. We already checked this interface in criu, perf and
> > procps, in all cases we get significant performance improvements.
> 
> Depending it looks like it probably doubles the maintenance work in a
> code base that does not always seem to have enough maintenance.
> 
> One idea for this kind of situation is to have a readfile or a readfiles
> system call that is a cousin of readlink.  That for small files just
> gives you the data in the file without the cost of creating a file
> descriptor.  This idea can be extended to multiple files, and it could
> have a readv like interface.

You can see from my perf output, that sys_open is only 10%, but seq_file
is more than 30%. The kernel is spendind the most time to encode data,
but then tools like ps, top, perf, criu are spending time to decode them
back and it's the main problem.

> 
> To some extent it matters where the cost is.
> 
> At a practical level we already have a binary protocol for getting a
> list of all of the processes in the system, readdir on /proc.
> 
> Honestly if the code is well done and does not require too much
> maintenance I don't care.  But I think we need to take a good hard look
> at maintenance issues when adding a duplicate interface for what feels
> like the same information.

I completely agree with you here and I try to prove that it's worth it.

Thanks,
Andrei

> 
> Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-10-11  2:13       ` Andrei Vagin
@ 2016-10-11 14:14         ` Alexey Dobriyan
  0 siblings, 0 replies; 15+ messages in thread
From: Alexey Dobriyan @ 2016-10-11 14:14 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Kirill Kolyshkin, David Ahern, ksummit-discuss, Patrick McHardy

>> > On Tue, Sep 13, 2016 at 12:29:35PM -0500, Eric W. Biederman wrote:

>> To some extent it matters where the cost is.
>>
>> At a practical level we already have a binary protocol for getting a
>> list of all of the processes in the system, readdir on /proc.

readdir() on /proc is only partially binary, because the interesting part
for checkpoint purposes is in text

Generic userspace may or may not convert pid back to binary
(things like top should do it because of sorting output).

And _you_ need to filter out kernel threads.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-09-03  5:20 [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those Andrei Vagin
                   ` (2 preceding siblings ...)
  2016-09-13 17:29 ` Eric W. Biederman
@ 2016-11-01  3:15 ` Andrei Vagin
  2016-11-01 14:58   ` James Bottomley
  3 siblings, 1 reply; 15+ messages in thread
From: Andrei Vagin @ 2016-11-01  3:15 UTC (permalink / raw)
  To: ksummit-discuss
  Cc: Kirill Kolyshkin, Eric Dumazet, David Ahern, Patrick McHardy,
	Pablo Neira Ayuso

Hello everyone,

I have never been before on kernel-summit and don't know how talks are
planed here. Looks like we have enough people how want to duscuss
the subject. Maybe, if it is not too late, we can choose a time for this
discussion?

I'm here for Linux Plumbers and have a definite plan on Friday from 2pm
to 5pm. Any other time works for me.

Thanks,
Andrei

On Fri, Sep 02, 2016 at 10:20:14PM -0700, Andrei Vagin wrote:
> The netlink interface proved itself as a great way to perform
> descriptor-based kernel/userspace communication. It is especially useful
> for cases involving a big amount of data to transfer. The netlink
> communication protocol is simple and elegant; it also allows to extend
> the message format without breaking backward compatibility.
> 
> One big problem of netlink is credentials. When a user-space process is
> opening a new file descriptor, kernel saves the opener's credentials to
> f_cred field of the file struct. After that, every access to that fd are
> checked against the saved credentials. In essence, this allows for a
> process to open a file descriptor as root and then drop capabilities.
> With netlink socket, it is not possible to implement this access scheme.
> 
> Currently netlink is widely used in the network subsystem, but there are
> also a few users outside of networking, such as audit and taskstats.
> Developers who used netlink for anything except the networking know
> there are some issues. For example, taskstats code has broken user and
> pid namespace support.
> 
> Another potential user of netlink socket is task_diag, a faster
> /proc/PID-like interface proposed some time ago
> (https://lkml.org/lkml/2015/7/6/142). It makes sense to use the netlink
> interface for it, too, but the whole feature is currently blocked by the
> netlink discussion.
> 
> A few months ago Andy Lutomirski suggested to rework the netlink
> interface in order to solve the known issues. We suggest discussing his
> idea:
> 
> ----- snip --- snip --- snip -----
> (taken from http://lists.openwall.net/netdev/2016/05/05/51)
> 
> The tl;dr is that Andrey wants to add an interface to ask a pidns some
> questions, and netlink looks natural, except that using netlink sockets
> to interrogate a pidns seems rather problematic.  I would also love to
> see a decent interface for interrogating user namespaces, and again,
> netlink would be great, except that it's a socket and makes no sense in
> this context.
> 
> Netlink had, and possibly still has, tons of serious security bugs
> involving code checking send() callers' creds.  I found and fixed a few
> a couple years ago.  To reiterate once again, send() CANNOT use caller
> creds safely.  (I feel like I say this once every few weeks. It's
> getting old.)
> 
> I realize that it's convenient to use a socket as a context to keep
> state between syscalls, but it has some annoying side effects:
> 
>  - It makes people want to rely on send()'s caller's creds.
>  - It's miserable in combination with seccomp.
>  - It doesn't play nicely with namespaces.
>  - It makes me wonder why things like task_diag, which have nothing
>    to do with networking, seem to get tangled up with networking.
> 
> 
> Would it be worth considering adding a parallel interface, using it for
> new things, and slowly migrating old use cases over?
> 
> int issue_kernel_command(int ns, int command, const struct iovec *iov, int iovcnt, int flags);
> 
> ns is an actual namespace fd or:
> 
> KERNEL_COMMAND_CURRENT_NETNS
> KERNEL_COMMAND_CURRENT_PIDNS
> etc, or a special one:
> KERNEL_COMMAND_GLOBAL.  KERNEL_COMMAND_GLOBAL can't be used in a
> non-root namespace.
> 
> KERNEL_COMMAND_GLOBAL works even for namespaced things, if the
> relevant current ns is the init namespace.  (This feature is optional,
> but it would allow gradually namespacing global things.)
> 
> command is an enumerated command.  Each command implies a namespace
> type, and, if you feed this thing the wrong namespace type, you get
> EINVAL.  The high bit of command indicates whether it's read-only
> command.
> 
> iov gives a command in the format expected, which, for the most part,
> would be a netlink message.
> 
> The return value is an fd that you can call read/readv on to read the
> response.  It's not a socket (or at least you can't do normal socket
> operations on it if it is a socket behind the scenes).  The
> implementation of read() promises *not* to look at caller creds.  The
> returned fd is unconditionally cloexec -- it's 2016 already.  Sheesh.
> 
> When you've read all the data, all you can do is close the fd.  You
> can't issue another command on the same fd.  You also can't call write()
> or send() on the fd unless someone has a good reason why you should be
> able to and why it's safe.  You can't issue another command on the same
> fd.
> 
> I imagine that the implementation could re-use a bunch of netlink code
> under the hood.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-11-01  3:15 ` Andrei Vagin
@ 2016-11-01 14:58   ` James Bottomley
  2016-11-01 16:39     ` Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: James Bottomley @ 2016-11-01 14:58 UTC (permalink / raw)
  To: Andrei Vagin, ksummit-discuss
  Cc: Kirill Kolyshkin, Eric Dumazet, Patrick McHardy,
	Pablo Neira Ayuso, David Ahern

On Mon, 2016-10-31 at 20:15 -0700, Andrei Vagin wrote:
> Hello everyone,
> 
> I have never been before on kernel-summit and don't know how talks 
> are planed here. Looks like we have enough people how want to duscuss
> the subject. Maybe, if it is not too late, we can choose a time for 
> this discussion?
> 
> I'm here for Linux Plumbers and have a definite plan on Friday from 
> 2pm to 5pm. Any other time works for me.

I suppose the process is a bit nebulous, let me try to explain it.

Plumbers and the Kernel Summit open track are running in parallel. 
 Plumbers has room for BoFs if you want to submit one:

https://linuxplumbersconf.org/2016/ocw/events/LPC2016BOFS/proposals

They'll be scheduled on Thursday evening.  You could also try to get it
into a relevant MC, but you have to negotiate that with the person
running the MC.

The Kernel Summit plenary track is also accepting submissions here:

https://linuxplumbersconf.org/2016/ocw/events/KS2016/proposals

The KS track has the Coronado room from Tuesday-Friday (ending around
14:00 on Friday), so you can work out your best time to schedule this. 
 Ted (as KS chair) makes the final decision, so it's a good idea as
well as submitting the proposal to send email here so Ted can gauge how
much interest there is for the proposal.

James

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
  2016-11-01 14:58   ` James Bottomley
@ 2016-11-01 16:39     ` Theodore Ts'o
       [not found]       ` <CANaxB-ycZFtZW3=WasEDXgBwf3NF4C46aNwTOpKqHjuPbN5e-Q@mail.gmail.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2016-11-01 16:39 UTC (permalink / raw)
  To: James Bottomley
  Cc: Kirill Kolyshkin, Andrei Vagin, ksummit-discuss, Eric Dumazet,
	David Ahern, Patrick McHardy, Pablo Neira Ayuso

On Tue, Nov 01, 2016 at 08:58:55AM -0600, James Bottomley wrote:
> The Kernel Summit plenary track is also accepting submissions here:
> 
> https://linuxplumbersconf.org/2016/ocw/events/KS2016/proposals
> 
> The KS track has the Coronado room from Tuesday-Friday (ending around
> 14:00 on Friday), so you can work out your best time to schedule this. 
>  Ted (as KS chair) makes the final decision, so it's a good idea as
> well as submitting the proposal to send email here so Ted can gauge how
> much interest there is for the proposal.

More than good idea; *please* send a proposal to this list, along with
your preferences of when the session should scheduled, so the critical
participants for your sessions don't have conflicts with the other LPC
events.

If you make the presentation using the web link above, it means you'll
be able to edit the description after we approve it, and it will be
reflected on the schedule.  Otherwise I will make the submission on
your behalf, which might be more convenient for you, but the tradeoff
is that you won't be able to edit abstract afterwards.

							- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
       [not found]       ` <CANaxB-ycZFtZW3=WasEDXgBwf3NF4C46aNwTOpKqHjuPbN5e-Q@mail.gmail.com>
@ 2016-11-03 15:41         ` Andrey Vagin
  0 siblings, 0 replies; 15+ messages in thread
From: Andrey Vagin @ 2016-11-03 15:41 UTC (permalink / raw)
  To: Andy Lutomirski, Stephen Hemminger, Arnd Bergmann, Eric W. Biederman
  Cc: Kirill Kolyshkin, ksummit-discuss, Eric Dumazet, David Ahern,
	Patrick McHardy, Pablo Neira Ayuso, Alexei Starovoitov

Hi All,

The session is scheduled to Thursday at 3pm:
Title: Netlink engine issues, and ways to fix those
https://www.linuxplumbersconf.org/2016/ocw/proposals/4599

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those
@ 2016-11-03 21:04 Kirill Kolyshkin
  0 siblings, 0 replies; 15+ messages in thread
From: Kirill Kolyshkin @ 2016-11-03 21:04 UTC (permalink / raw)
  To: Andrey Vagin
  Cc: ksummit-discuss, Eric Dumazet, David Ahern, Patrick McHardy,
	Pablo Neira Ayuso, Alexei Starovoitov

[-- Attachment #1: Type: text/plain, Size: 310 bytes --]

Just to remind you, the session is about to begin, it's Coronado room.

On Nov 3, 2016 9:41 AM, Andrey Vagin <avagin@openvz.org> wrote:
Hi All,

The session is scheduled to Thursday at 3pm:
Title: Netlink engine issues, and ways to fix those
https://www.linuxplumbersconf.org/2016/ocw/proposals/4599


[-- Attachment #2: Type: text/html, Size: 844 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-11-03 21:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-03  5:20 [Ksummit-discuss] [TECH TOPIC] Netlink engine issues, and ways to fix those Andrei Vagin
2016-09-04 19:03 ` Andy Lutomirski
2016-09-05 19:30 ` Alexei Starovoitov
2016-09-06 16:05   ` Stephen Hemminger
2016-09-12 14:05   ` Hannes Reinecke
2016-09-13 17:29 ` Eric W. Biederman
2016-09-16  5:58   ` Andrei Vagin
2016-09-18 20:18     ` Eric W. Biederman
2016-10-11  2:13       ` Andrei Vagin
2016-10-11 14:14         ` Alexey Dobriyan
2016-11-01  3:15 ` Andrei Vagin
2016-11-01 14:58   ` James Bottomley
2016-11-01 16:39     ` Theodore Ts'o
     [not found]       ` <CANaxB-ycZFtZW3=WasEDXgBwf3NF4C46aNwTOpKqHjuPbN5e-Q@mail.gmail.com>
2016-11-03 15:41         ` Andrey Vagin
2016-11-03 21:04 Kirill Kolyshkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.