All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] RPCBIND: add anonymous listening socket in addition to named one
@ 2011-12-28 15:17 Stanislav Kinsbursky
  2011-12-28 17:03 ` Chuck Lever
  2011-12-28 18:22 ` bfields
  0 siblings, 2 replies; 15+ messages in thread
From: Stanislav Kinsbursky @ 2011-12-28 15:17 UTC (permalink / raw)
  To: Trond.Myklebust, bfields; +Cc: linux-nfs

Hello.
I've experienced a problem with registering Lockd service with rpcbind in 
container. My container operates in it's own network namespace context and has 
it's own root. But on service register, kernel tries to connect to named unix 
socket by using rpciod_workqueue. Thus any connect is done with the same 
fs->root, and this leads to that kernel socket, used for registering service 
with local portmapper, will always connect to the same user-space socket 
regardless to fs->root of process, requested register operation.
Possible solution for this problem, which I would like to discuss, is to add one 
more listening socket to rpcbind process. But this one should be anonymous. 
Anonymous unix sockets accept connections only within it's network namespace 
context, so kernel socket connect will be done always to the user-space socket 
in the same network namespace.
Does anyone have any objections to this? Or, probably, better solution for the 
problem?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-28 15:17 [RFC] RPCBIND: add anonymous listening socket in addition to named one Stanislav Kinsbursky
@ 2011-12-28 17:03 ` Chuck Lever
  2011-12-28 17:30   ` Stanislav Kinsbursky
  2011-12-28 18:22 ` bfields
  1 sibling, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2011-12-28 17:03 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Trond.Myklebust, bfields, linux-nfs


On Dec 28, 2011, at 10:17 AM, Stanislav Kinsbursky wrote:

> Hello.
> I've experienced a problem with registering Lockd service with rpcbind in container. My container operates in it's own network namespace context and has it's own root. But on service register, kernel tries to connect to named unix socket by using rpciod_workqueue. Thus any connect is done with the same fs->root, and this leads to that kernel socket, used for registering service with local portmapper, will always connect to the same user-space socket regardless to fs->root of process, requested register operation.
> Possible solution for this problem, which I would like to discuss, is to add one more listening socket to rpcbind process. But this one should be anonymous. Anonymous unix sockets accept connections only within it's network namespace context, so kernel socket connect will be done always to the user-space socket in the same network namespace.

A UNIX socket is used so that rpcbind can record the identity of the process on the other end of the socket.  That way only that user may unregister this service.  That user is known as the registration's "owner."  Whatever solution is chosen, I believe we need to preserve the registration owner functionality.

> Does anyone have any objections to this? Or, probably, better solution for the problem?


Isn't this also an issue for TCP connections to other hosts?  How does the kernel RPC client choose a TCP connection's source address?

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-28 17:03 ` Chuck Lever
@ 2011-12-28 17:30   ` Stanislav Kinsbursky
  2011-12-28 17:59     ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: Stanislav Kinsbursky @ 2011-12-28 17:30 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond.Myklebust, bfields, linux-nfs

28.12.2011 21:03, Chuck Lever пишет:
>
> On Dec 28, 2011, at 10:17 AM, Stanislav Kinsbursky wrote:
>
> > Hello.
> > I've experienced a problem with registering Lockd service with rpcbind in 
> container. My container operates in it's own network namespace context and has 
> it's own root. But on service register, kernel tries to connect to named unix 
> socket by using rpciod_workqueue. Thus any connect is done with the same 
> fs->root, and this leads to that kernel socket, used for registering service 
> with local portmapper, will always connect to the same user-space socket 
> regardless to fs->root of process, requested register operation.
> > Possible solution for this problem, which I would like to discuss, is to add 
> one more listening socket to rpcbind process. But this one should be 
> anonymous. Anonymous unix sockets accept connections only within it's network 
> namespace context, so kernel socket connect will be done always to the 
> user-space socket in the same network namespace.
>
> A UNIX socket is used so that rpcbind can record the identity of the process 
> on the other end of the socket.  That way only that user may unregister this 
> service.  That user is known as the registration's "owner."  Whatever solution 
> is chosen, I believe we need to preserve the registration owner functionality.
>

Sorry, but I don't get get it.
What do you mean by "user" and "identity"?

> > Does anyone have any objections to this? Or, probably, better solution for 
> the problem?
>
>
> Isn't this also an issue for TCP connections to other hosts?  How does the 
> kernel RPC client choose a TCP connection's source address?
>

I'm confused here too. What TPC connections are you talking about? And what 
source address?

I little bit more info about the whole "NFS in container" structure (as I see it):
1) Each container operates in it's own network namespace and has it's own root.
2) Each contatiner has it's own network device(s) and IP address(es).
3) Each container has it's own rpcbind process instance.
4) Each service (like LockD and NFSd  in future) will register itself with all 
per-net rpcbind instances it have to.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-28 17:30   ` Stanislav Kinsbursky
@ 2011-12-28 17:59     ` Chuck Lever
  2011-12-29 11:48       ` Stanislav Kinsbursky
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2011-12-28 17:59 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Trond.Myklebust, bfields, linux-nfs


On Dec 28, 2011, at 12:30 PM, Stanislav Kinsbursky wrote:

> 28.12.2011 21:03, Chuck Lever пишет:
>> 
>> On Dec 28, 2011, at 10:17 AM, Stanislav Kinsbursky wrote:
>> 
>> > Hello.
>> > I've experienced a problem with registering Lockd service with rpcbind in container. My container operates in it's own network namespace context and has it's own root. But on service register, kernel tries to connect to named unix socket by using rpciod_workqueue. Thus any connect is done with the same fs->root, and this leads to that kernel socket, used for registering service with local portmapper, will always connect to the same user-space socket regardless to fs->root of process, requested register operation.
>> > Possible solution for this problem, which I would like to discuss, is to add one more listening socket to rpcbind process. But this one should be anonymous. Anonymous unix sockets accept connections only within it's network namespace context, so kernel socket connect will be done always to the user-space socket in the same network namespace.
>> 
>> A UNIX socket is used so that rpcbind can record the identity of the process on the other end of the socket.  That way only that user may unregister this service.  That user is known as the registration's "owner."  Whatever solution is chosen, I believe we need to preserve the registration owner functionality.
>> 
> 
> Sorry, but I don't get get it.
> What do you mean by "user" and "identity"?

When an RPC application registers itself with the local rpcbind daemon, it does so with an AF_UNIX socket.  rpcbind scrapes the UID of the RPC application process off the other end of the socket, and records that UID with the new registration.  For example:

[cel@forain ~]$ rpcinfo
   program version netid     address                service    owner
    100000    4    tcp6      ::.0.111               portmapper superuser
    100000    3    tcp6      ::.0.111               portmapper superuser
    100000    4    udp6      ::.0.111               portmapper superuser
    100000    3    udp6      ::.0.111               portmapper superuser
    100000    4    tcp       0.0.0.0.0.111          portmapper superuser
    100000    3    tcp       0.0.0.0.0.111          portmapper superuser
    100000    2    tcp       0.0.0.0.0.111          portmapper superuser
    100000    4    udp       0.0.0.0.0.111          portmapper superuser
    100000    3    udp       0.0.0.0.0.111          portmapper superuser
    100000    2    udp       0.0.0.0.0.111          portmapper superuser
    100000    4    local     /var/run/rpcbind.sock  portmapper superuser
    100000    3    local     /var/run/rpcbind.sock  portmapper superuser
    100024    1    udp       0.0.0.0.149.137        status     29
    100024    1    tcp       0.0.0.0.152.179        status     29
    100024    1    udp6      ::.148.0               status     29
    100024    1    tcp6      ::.217.71              status     29
[cel@forain ~]$

The last column is the "owner."  That's the UID of the process that performed the registration.  Only processes running under that UID may unregister that service.

This doesn't work for registrations that were performed via a network interface (like lo).  It only works when an application uses the AF_UNIX socket.

The point of this is to prevent other users from replacing a registration.  Now any user can register an RPC service and be sure that it won't be stomped on by some other user.

Whatever solution to your problem you find, it must preserve this behavior.  Will using an anonymous socket allow rpcbind to discover the UID of the registering process?

> 
>> > Does anyone have any objections to this? Or, probably, better solution for the problem?
>> 
>> 
>> Isn't this also an issue for TCP connections to other hosts?  How does the kernel RPC client choose a TCP connection's source address?
>> 
> 
> I'm confused here too. What TPC connections are you talking about? And what source address?

A TCP socket has two endpoints.  The source address and port for the local endpoint is chosen when the socket is bound.  The destination address and port for the remote endpoint is chosen when the socket is connected.

RPC client consumers, such as lockd, the NFS client, or the MNT client, have to make outbound TCP connections to other hosts.  In user space, RPC TCP sockets use the IP address of the current network namespace as their source address.

If the kernel RPC client makes a TCP connection to another host, how is the TCP socket's source address determined? If the answer is that kernel_bind() chooses this source address, and that kernel_bind() call is performed in the rpciod connect worker, then the source address is always chosen in the root network namespace (unless the rpciod connect worker is namespace aware).

> I little bit more info about the whole "NFS in container" structure (as I see it):
> 1) Each container operates in it's own network namespace and has it's own root.
> 2) Each contatiner has it's own network device(s) and IP address(es).

Right.  As above, I assumed the IP address of the current network namespace is used as the source address on outbound TCP connections.  That would mean that the rpciod work queue that handles such connections would have to be network namespace aware.

If it is, why isn't this also enough for RPC over AF_UNIX sockets?  The network namespace in effect when the kernel performs the connect should determine which rpcbind is listening on the other end of the AF_UNIX socket in your local network namespace, unless I've misunderstood your problem.

> 3) Each container has it's own rpcbind process instance.
> 4) Each service (like LockD and NFSd  in future) will register itself with all per-net rpcbind instances it have to.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-28 15:17 [RFC] RPCBIND: add anonymous listening socket in addition to named one Stanislav Kinsbursky
  2011-12-28 17:03 ` Chuck Lever
@ 2011-12-28 18:22 ` bfields
  2011-12-29 11:48   ` Stanislav Kinsbursky
  1 sibling, 1 reply; 15+ messages in thread
From: bfields @ 2011-12-28 18:22 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Trond.Myklebust, linux-nfs

On Wed, Dec 28, 2011 at 07:17:30PM +0400, Stanislav Kinsbursky wrote:
> I've experienced a problem with registering Lockd service with
> rpcbind in container. My container operates in it's own network
> namespace context and has it's own root. But on service register,
> kernel tries to connect to named unix socket by using
> rpciod_workqueue. Thus any connect is done with the same fs->root,

There's no way to pass the correct context down to the rpc task and from
there to the registration code?

--b.

> and this leads to that kernel socket, used for registering service
> with local portmapper, will always connect to the same user-space
> socket regardless to fs->root of process, requested register
> operation.
> Possible solution for this problem, which I would like to discuss,
> is to add one more listening socket to rpcbind process. But this one
> should be anonymous. Anonymous unix sockets accept connections only
> within it's network namespace context, so kernel socket connect will
> be done always to the user-space socket in the same network
> namespace.
> Does anyone have any objections to this? Or, probably, better
> solution for the problem?
> 
> -- 
> Best regards,
> Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-28 17:59     ` Chuck Lever
@ 2011-12-29 11:48       ` Stanislav Kinsbursky
  2011-12-29 16:03         ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: Stanislav Kinsbursky @ 2011-12-29 11:48 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond.Myklebust, bfields, linux-nfs

28.12.2011 21:59, Chuck Lever пишет:
>
> On Dec 28, 2011, at 12:30 PM, Stanislav Kinsbursky wrote:
>
>> 28.12.2011 21:03, Chuck Lever пишет:
>>>
>>> On Dec 28, 2011, at 10:17 AM, Stanislav Kinsbursky wrote:
>>>
>>>> Hello.
>>>> I've experienced a problem with registering Lockd service with rpcbind in container. My container operates in it's own network namespace context and has it's own root. But on service register, kernel tries to connect to named unix socket by using rpciod_workqueue. Thus any connect is done with the same fs->root, and this leads to that kernel socket, used for registering service with local portmapper, will always connect to the same user-space socket regardless to fs->root of process, requested register operation.
>>>> Possible solution for this problem, which I would like to discuss, is to add one more listening socket to rpcbind process. But this one should be anonymous. Anonymous unix sockets accept connections only within it's network namespace context, so kernel socket connect will be done always to the user-space socket in the same network namespace.
>>>
>>> A UNIX socket is used so that rpcbind can record the identity of the process on the other end of the socket.  That way only that user may unregister this service.  That user is known as the registration's "owner."  Whatever solution is chosen, I believe we need to preserve the registration owner functionality.
>>>
>>
>> Sorry, but I don't get get it.
>> What do you mean by "user" and "identity"?
>
> When an RPC application registers itself with the local rpcbind daemon, it does so with an AF_UNIX socket.  rpcbind scrapes the UID of the RPC application process off the other end of the socket, and records that UID with the new registration.  For example:
>
> [cel@forain ~]$ rpcinfo
>     program version netid     address                service    owner
>      100000    4    tcp6      ::.0.111               portmapper superuser
>      100000    3    tcp6      ::.0.111               portmapper superuser
>      100000    4    udp6      ::.0.111               portmapper superuser
>      100000    3    udp6      ::.0.111               portmapper superuser
>      100000    4    tcp       0.0.0.0.0.111          portmapper superuser
>      100000    3    tcp       0.0.0.0.0.111          portmapper superuser
>      100000    2    tcp       0.0.0.0.0.111          portmapper superuser
>      100000    4    udp       0.0.0.0.0.111          portmapper superuser
>      100000    3    udp       0.0.0.0.0.111          portmapper superuser
>      100000    2    udp       0.0.0.0.0.111          portmapper superuser
>      100000    4    local     /var/run/rpcbind.sock  portmapper superuser
>      100000    3    local     /var/run/rpcbind.sock  portmapper superuser
>      100024    1    udp       0.0.0.0.149.137        status     29
>      100024    1    tcp       0.0.0.0.152.179        status     29
>      100024    1    udp6      ::.148.0               status     29
>      100024    1    tcp6      ::.217.71              status     29
> [cel@forain ~]$
>
> The last column is the "owner."  That's the UID of the process that performed the registration.  Only processes running under that UID may unregister that service.
>
> This doesn't work for registrations that were performed via a network interface (like lo).  It only works when an application uses the AF_UNIX socket.
>
> The point of this is to prevent other users from replacing a registration.  Now any user can register an RPC service and be sure that it won't be stomped on by some other user.
>
> Whatever solution to your problem you find, it must preserve this behavior.  Will using an anonymous socket allow rpcbind to discover the UID of the registering process?
>

First of all, thanks for detailed explanation.
And the answer is yes - anonymous socket will allow rpcbind to discover the UID 
of the registering process.
At least I don't see any differences in this place between named and anonymous 
socket (unix_listen and unix_stream_connect).

>
> A TCP socket has two endpoints.  The source address and port for the local endpoint is chosen when the socket is bound.  The destination address and port for the remote endpoint is chosen when the socket is connected.
>
> RPC client consumers, such as lockd, the NFS client, or the MNT client, have to make outbound TCP connections to other hosts.  In user space, RPC TCP sockets use the IP address of the current network namespace as their source address.
>
> If the kernel RPC client makes a TCP connection to another host, how is the TCP socket's source address determined? If the answer is that kernel_bind() chooses this source address, and that kernel_bind() call is performed in the rpciod connect worker, then the source address is always chosen in the root network namespace (unless the rpciod connect worker is namespace aware).
>

This address to bind to is taken from transport, which is set on RPC client 
creation (which is created in sync mode). So, no problem here, I hope.

>> I little bit more info about the whole "NFS in container" structure (as I see it):
>> 1) Each container operates in it's own network namespace and has it's own root.
>> 2) Each contatiner has it's own network device(s) and IP address(es).
>
> Right.  As above, I assumed the IP address of the current network namespace is used as the source address on outbound TCP connections.  That would mean that the rpciod work queue that handles such connections would have to be network namespace aware.
>

And it is. IOW, rpciod_workqueue just handles rpc_tasks as is.

> If it is, why isn't this also enough for RPC over AF_UNIX sockets?  The network namespace in effect when the kernel performs the connect should determine which rpcbind is listening on the other end of the AF_UNIX socket in your local network namespace, unless I've misunderstood your problem.
>

Because unix named (!) sockets are not network namespace aware. They are 
"current->fs->root" aware. I.e. if this connect operation would be performed on 
"sync mode" (i.e. from the context of mount of NFS server start operation), then 
all will works fine (in case of each container works in it's own root, of course).
But currently all connect operations are done by rpciod_workqueue. I.e. 
regardless to root of process, started service registration, unix socket will be 
always looked up by path "/var/run/rpcbind.sock" starting from rpciod_workqueue 
root. I can set desired root before kernel connect during handling RPC task by 
rpciod_workqueue and revert back to the old one after connection and this will 
solve the problem to.
But this approach looks like ugly hack to me. And also requires additional 
pointer in sock_xprt structure to bypass desired context to rpciod_workqueue 
handler.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-28 18:22 ` bfields
@ 2011-12-29 11:48   ` Stanislav Kinsbursky
  0 siblings, 0 replies; 15+ messages in thread
From: Stanislav Kinsbursky @ 2011-12-29 11:48 UTC (permalink / raw)
  To: bfields; +Cc: Trond.Myklebust, linux-nfs

28.12.2011 22:22, bfields@fieldses.org пишет:
> On Wed, Dec 28, 2011 at 07:17:30PM +0400, Stanislav Kinsbursky wrote:
>> I've experienced a problem with registering Lockd service with
>> rpcbind in container. My container operates in it's own network
>> namespace context and has it's own root. But on service register,
>> kernel tries to connect to named unix socket by using
>> rpciod_workqueue. Thus any connect is done with the same fs->root,
>
> There's no way to pass the correct context down to the rpc task and from
> there to the registration code?
>

This context is current->fs->root. It's used to lookup unix socket inode by name 
on connect.
Obviously, current is always "rpciod_workqueue" kernel thread. And thus has the 
same root.
The only solution how to pass and use the context, which I can see, is to expand 
sock_xprt with a pointer to corrent root and then set this root before 
kernel_connect() and set the old one after.
But this looks ugly and too complicated for the issue, from my pow.
Or maybe you have some more elegant solution to this problem?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-29 11:48       ` Stanislav Kinsbursky
@ 2011-12-29 16:03         ` Chuck Lever
  2011-12-29 16:12           ` Stanislav Kinsbursky
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2011-12-29 16:03 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Trond.Myklebust, bfields, linux-nfs


On Dec 29, 2011, at 6:48 AM, Stanislav Kinsbursky wrote:

> 28.12.2011 21:59, Chuck Lever пишет:
>> 
>> On Dec 28, 2011, at 12:30 PM, Stanislav Kinsbursky wrote:
>> 
>>> 28.12.2011 21:03, Chuck Lever пишет:
>>>> 
>>>> On Dec 28, 2011, at 10:17 AM, Stanislav Kinsbursky wrote:
>>>> 
>>>>> Hello.
>>>>> I've experienced a problem with registering Lockd service with rpcbind in container. My container operates in it's own network namespace context and has it's own root. But on service register, kernel tries to connect to named unix socket by using rpciod_workqueue. Thus any connect is done with the same fs->root, and this leads to that kernel socket, used for registering service with local portmapper, will always connect to the same user-space socket regardless to fs->root of process, requested register operation.
>>>>> Possible solution for this problem, which I would like to discuss, is to add one more listening socket to rpcbind process. But this one should be anonymous. Anonymous unix sockets accept connections only within it's network namespace context, so kernel socket connect will be done always to the user-space socket in the same network namespace.
>>>> 
>>>> A UNIX socket is used so that rpcbind can record the identity of the process on the other end of the socket.  That way only that user may unregister this service.  That user is known as the registration's "owner."  Whatever solution is chosen, I believe we need to preserve the registration owner functionality.
>>>> 
>>> 
>>> Sorry, but I don't get get it.
>>> What do you mean by "user" and "identity"?
>> 
>> When an RPC application registers itself with the local rpcbind daemon, it does so with an AF_UNIX socket.  rpcbind scrapes the UID of the RPC application process off the other end of the socket, and records that UID with the new registration.  For example:
>> 
>> [cel@forain ~]$ rpcinfo
>>    program version netid     address                service    owner
>>     100000    4    tcp6      ::.0.111               portmapper superuser
>>     100000    3    tcp6      ::.0.111               portmapper superuser
>>     100000    4    udp6      ::.0.111               portmapper superuser
>>     100000    3    udp6      ::.0.111               portmapper superuser
>>     100000    4    tcp       0.0.0.0.0.111          portmapper superuser
>>     100000    3    tcp       0.0.0.0.0.111          portmapper superuser
>>     100000    2    tcp       0.0.0.0.0.111          portmapper superuser
>>     100000    4    udp       0.0.0.0.0.111          portmapper superuser
>>     100000    3    udp       0.0.0.0.0.111          portmapper superuser
>>     100000    2    udp       0.0.0.0.0.111          portmapper superuser
>>     100000    4    local     /var/run/rpcbind.sock  portmapper superuser
>>     100000    3    local     /var/run/rpcbind.sock  portmapper superuser
>>     100024    1    udp       0.0.0.0.149.137        status     29
>>     100024    1    tcp       0.0.0.0.152.179        status     29
>>     100024    1    udp6      ::.148.0               status     29
>>     100024    1    tcp6      ::.217.71              status     29
>> [cel@forain ~]$
>> 
>> The last column is the "owner."  That's the UID of the process that performed the registration.  Only processes running under that UID may unregister that service.
>> 
>> This doesn't work for registrations that were performed via a network interface (like lo).  It only works when an application uses the AF_UNIX socket.
>> 
>> The point of this is to prevent other users from replacing a registration.  Now any user can register an RPC service and be sure that it won't be stomped on by some other user.
>> 
>> Whatever solution to your problem you find, it must preserve this behavior.  Will using an anonymous socket allow rpcbind to discover the UID of the registering process?
>> 
> 
> First of all, thanks for detailed explanation.
> And the answer is yes - anonymous socket will allow rpcbind to discover the UID of the registering process.
> At least I don't see any differences in this place between named and anonymous socket (unix_listen and unix_stream_connect).
> 
>> 
>> A TCP socket has two endpoints.  The source address and port for the local endpoint is chosen when the socket is bound.  The destination address and port for the remote endpoint is chosen when the socket is connected.
>> 
>> RPC client consumers, such as lockd, the NFS client, or the MNT client, have to make outbound TCP connections to other hosts.  In user space, RPC TCP sockets use the IP address of the current network namespace as their source address.
>> 
>> If the kernel RPC client makes a TCP connection to another host, how is the TCP socket's source address determined? If the answer is that kernel_bind() chooses this source address, and that kernel_bind() call is performed in the rpciod connect worker, then the source address is always chosen in the root network namespace (unless the rpciod connect worker is namespace aware).
>> 
> 
> This address to bind to is taken from transport, which is set on RPC client creation (which is created in sync mode). So, no problem here, I hope.
> 
>>> I little bit more info about the whole "NFS in container" structure (as I see it):
>>> 1) Each container operates in it's own network namespace and has it's own root.
>>> 2) Each contatiner has it's own network device(s) and IP address(es).
>> 
>> Right.  As above, I assumed the IP address of the current network namespace is used as the source address on outbound TCP connections.  That would mean that the rpciod work queue that handles such connections would have to be network namespace aware.
>> 
> 
> And it is. IOW, rpciod_workqueue just handles rpc_tasks as is.
> 
>> If it is, why isn't this also enough for RPC over AF_UNIX sockets?  The network namespace in effect when the kernel performs the connect should determine which rpcbind is listening on the other end of the AF_UNIX socket in your local network namespace, unless I've misunderstood your problem.
>> 
> 
> Because unix named (!) sockets are not network namespace aware.

OK, this is the part I was unaware of.  Bruce and I had talked about this a few months ago, and my take-away was that these sockets were network namespace-aware, such that this was supposed to work as we want, already.

> They are "current->fs->root" aware.

In other words, these are relative to the local file namespace, not the local network namespace.  I was afraid of that.

> I.e. if this connect operation would be performed on "sync mode" (i.e. from the context of mount of NFS server start operation), then all will works fine (in case of each container works in it's own root, of course).
> But currently all connect operations are done by rpciod_workqueue. I.e. regardless to root of process, started service registration, unix socket will be always looked up by path "/var/run/rpcbind.sock" starting from rpciod_workqueue root. I can set desired root before kernel connect during handling RPC task by rpciod_workqueue and revert back to the old one after connection and this will solve the problem to.
> But this approach looks like ugly hack to me. And also requires additional pointer in sock_xprt structure to bypass desired context to rpciod_workqueue handler.

Can several network namespaces share the same file namespace?  That might cause them to share the same rpcbind, which is undesirable.  Might this also be a problem for user space?

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-29 16:03         ` Chuck Lever
@ 2011-12-29 16:12           ` Stanislav Kinsbursky
  2011-12-29 16:23             ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: Stanislav Kinsbursky @ 2011-12-29 16:12 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond.Myklebust, bfields, linux-nfs

29.12.2011 20:03, Chuck Lever пишет:
>> Because unix named (!) sockets are not network namespace aware.
>
> OK, this is the part I was unaware of.  Bruce and I had talked about this a few months ago, and my take-away was that these sockets were network namespace-aware, such that this was supposed to work as we want, already.
>
>> They are "current->fs->root" aware.
>
> In other words, these are relative to the local file namespace, not the local network namespace.  I was afraid of that.
>
>> I.e. if this connect operation would be performed on "sync mode" (i.e. from the context of mount of NFS server start operation), then all will works fine (in case of each container works in it's own root, of course).
>> But currently all connect operations are done by rpciod_workqueue. I.e. regardless to root of process, started service registration, unix socket will be always looked up by path "/var/run/rpcbind.sock" starting from rpciod_workqueue root. I can set desired root before kernel connect during handling RPC task by rpciod_workqueue and revert back to the old one after connection and this will solve the problem to.
>> But this approach looks like ugly hack to me. And also requires additional pointer in sock_xprt structure to bypass desired context to rpciod_workqueue handler.
>
> Can several network namespaces share the same file namespace?  That might cause them to share the same rpcbind, which is undesirable.  Might this also be a problem for user space?
>

Yes, they can. But only in general. And it will be a problem for user space 
programs, using unix named sockets for network related stuff (like rpcbind, for 
instance).
But, actually, I don't see any sense in having several network namespaces with 
the same root. Probably someone can suggest a specific "real life" solution, 
which can use such scheme. But it's not a container and thus no guarantees 
should be provided in this case from my pow.
Or we need to throw away this unix sockets approach and use only network 
namespace aware routines. But again, does this really required?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-29 16:12           ` Stanislav Kinsbursky
@ 2011-12-29 16:23             ` Chuck Lever
  2011-12-29 17:04               ` Stanislav Kinsbursky
  2011-12-29 17:42               ` Stanislav Kinsbursky
  0 siblings, 2 replies; 15+ messages in thread
From: Chuck Lever @ 2011-12-29 16:23 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Trond.Myklebust, bfields, linux-nfs


On Dec 29, 2011, at 11:12 AM, Stanislav Kinsbursky wrote:

> 29.12.2011 20:03, Chuck Lever пишет:
>>> Because unix named (!) sockets are not network namespace aware.
>> 
>> OK, this is the part I was unaware of.  Bruce and I had talked about this a few months ago, and my take-away was that these sockets were network namespace-aware, such that this was supposed to work as we want, already.
>> 
>>> They are "current->fs->root" aware.
>> 
>> In other words, these are relative to the local file namespace, not the local network namespace.  I was afraid of that.
>> 
>>> I.e. if this connect operation would be performed on "sync mode" (i.e. from the context of mount of NFS server start operation), then all will works fine (in case of each container works in it's own root, of course).
>>> But currently all connect operations are done by rpciod_workqueue. I.e. regardless to root of process, started service registration, unix socket will be always looked up by path "/var/run/rpcbind.sock" starting from rpciod_workqueue root. I can set desired root before kernel connect during handling RPC task by rpciod_workqueue and revert back to the old one after connection and this will solve the problem to.
>>> But this approach looks like ugly hack to me. And also requires additional pointer in sock_xprt structure to bypass desired context to rpciod_workqueue handler.
>> 
>> Can several network namespaces share the same file namespace?  That might cause them to share the same rpcbind, which is undesirable.  Might this also be a problem for user space?
>> 
> 
> Yes, they can. But only in general. And it will be a problem for user space programs, using unix named sockets for network related stuff (like rpcbind, for instance).
> But, actually, I don't see any sense in having several network namespaces with the same root. Probably someone can suggest a specific "real life" solution, which can use such scheme.

I can't think of one either.

> But it's not a container and thus no guarantees should be provided in this case from my pow.

That's probably reasonable, and should be documented publicly if we take the approach of keeping a unique /var/run/rpcbind.sock for each network namespace.

> Or we need to throw away this unix sockets approach and use only network namespace aware routines. But again, does this really required?

/var/run/rpcbind.sock is a formal libtirpc/rpcbind API that is common to libtirpc on other OSes.  Now, it's not likely that any application except the kernel uses it directly.  Still, I'm leery of removing it entirely. 

My sense is that handing an fs->root value to the rpciod workqueue gives us behavior that is closest to what we have now in a single network namespace configuration.  In other words, it's a change that introduces the least amount of "surprise" to the current RPC architecture.

Do you have a patch so we can see just how ugly this might get?

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-29 16:23             ` Chuck Lever
@ 2011-12-29 17:04               ` Stanislav Kinsbursky
  2011-12-29 17:42               ` Stanislav Kinsbursky
  1 sibling, 0 replies; 15+ messages in thread
From: Stanislav Kinsbursky @ 2011-12-29 17:04 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond.Myklebust, bfields, linux-nfs

29.12.2011 20:23, Chuck Lever пишет:
>
> On Dec 29, 2011, at 11:12 AM, Stanislav Kinsbursky wrote:
>
>> 29.12.2011 20:03, Chuck Lever пишет:
>>>> Because unix named (!) sockets are not network namespace aware.
>>>
>>> OK, this is the part I was unaware of.  Bruce and I had talked about this a few months ago, and my take-away was that these sockets were network namespace-aware, such that this was supposed to work as we want, already.
>>>
>>>> They are "current->fs->root" aware.
>>>
>>> In other words, these are relative to the local file namespace, not the local network namespace.  I was afraid of that.
>>>
>>>> I.e. if this connect operation would be performed on "sync mode" (i.e. from the context of mount of NFS server start operation), then all will works fine (in case of each container works in it's own root, of course).
>>>> But currently all connect operations are done by rpciod_workqueue. I.e. regardless to root of process, started service registration, unix socket will be always looked up by path "/var/run/rpcbind.sock" starting from rpciod_workqueue root. I can set desired root before kernel connect during handling RPC task by rpciod_workqueue and revert back to the old one after connection and this will solve the problem to.
>>>> But this approach looks like ugly hack to me. And also requires additional pointer in sock_xprt structure to bypass desired context to rpciod_workqueue handler.
>>>
>>> Can several network namespaces share the same file namespace?  That might cause them to share the same rpcbind, which is undesirable.  Might this also be a problem for user space?
>>>
>>
>> Yes, they can. But only in general. And it will be a problem for user space programs, using unix named sockets for network related stuff (like rpcbind, for instance).
>> But, actually, I don't see any sense in having several network namespaces with the same root. Probably someone can suggest a specific "real life" solution, which can use such scheme.
>
> I can't think of one either.
>
>> But it's not a container and thus no guarantees should be provided in this case from my pow.
>
> That's probably reasonable, and should be documented publicly if we take the approach of keeping a unique /var/run/rpcbind.sock for each network namespace.
>

Ok, thanks for notice.

>> Or we need to throw away this unix sockets approach and use only network namespace aware routines. But again, does this really required?
>
> /var/run/rpcbind.sock is a formal libtirpc/rpcbind API that is common to libtirpc on other OSes.  Now, it's not likely that any application except the kernel uses it directly.  Still, I'm leery of removing it entirely.
>

BTW, I'm not proposing to remove it.

> My sense is that handing an fs->root value to the rpciod workqueue gives us behavior that is closest to what we have now in a single network namespace configuration.  In other words, it's a change that introduces the least amount of "surprise" to the current RPC architecture.
>

I dug a little bit into rpcbind sources and probably you are right.

> Do you have a patch so we can see just how ugly this might get?
>

Not yet. But will send soon.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-29 16:23             ` Chuck Lever
  2011-12-29 17:04               ` Stanislav Kinsbursky
@ 2011-12-29 17:42               ` Stanislav Kinsbursky
  2012-01-25 11:12                 ` Stanislav Kinsbursky
  1 sibling, 1 reply; 15+ messages in thread
From: Stanislav Kinsbursky @ 2011-12-29 17:42 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond.Myklebust, bfields, linux-nfs

29.12.2011 20:23, Chuck Lever пишет:
>
> Do you have a patch so we can see just how ugly this might get?
>

Something like this (I've checked - it works properly).

From: Stanislav Kinsbursky <skinsbursky@parallels.com>


---
  fs/fs_struct.c        |    1 +
  net/sunrpc/xprtsock.c |   20 ++++++++++++++++++++
  2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 78b519c..0f984c3 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -36,6 +36,7 @@ void set_fs_root(struct fs_struct *fs, struct path *path)
         if (old_root.dentry)
                 path_put_longterm(&old_root);
  }
+EXPORT_SYMBOL_GPL(set_fs_root);

  /*
   * Replace the fs->{pwdmnt,pwd} with {mnt,dentry}. Put the old values.
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 610a74a..d22ef5b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -37,6 +37,7 @@
  #include <linux/sunrpc/svcsock.h>
  #include <linux/sunrpc/xprtsock.h>
  #include <linux/file.h>
+#include <linux/fs_struct.h>
  #ifdef CONFIG_SUNRPC_BACKCHANNEL
  #include <linux/sunrpc/bc_xprt.h>
  #endif
@@ -239,6 +240,8 @@ struct sock_xprt {
         void                    (*old_state_change)(struct sock *);
         void                    (*old_write_space)(struct sock *);
         void                    (*old_error_report)(struct sock *);
+
+       struct path             *root;
  };

  /*
@@ -848,6 +851,11 @@ static void xs_destroy(struct rpc_xprt *xprt)

         cancel_delayed_work_sync(&transport->connect_worker);

+       if (transport->root) {
+               path_put(transport->root);
+               kfree(transport->root);
+       }
+
         xs_close(xprt);
         xs_free_peer_addresses(xprt);
         xprt_free(xprt);
@@ -1881,6 +1889,7 @@ static void xs_local_setup_socket(struct work_struct *work)
         struct rpc_xprt *xprt = &transport->xprt;
         struct socket *sock;
         int status = -EIO;
+       struct path root;

         if (xprt->shutdown)
                 goto out;
@@ -1898,6 +1907,9 @@ static void xs_local_setup_socket(struct work_struct *work)
         dprintk("RPC:       worker connecting xprt %p via AF_LOCAL to %s\n",
                         xprt, xprt->address_strings[RPC_DISPLAY_ADDR]);

+       get_fs_root(current->fs, &root);
+       set_fs_root(current->fs, transport->root);
+
         status = xs_local_finish_connecting(xprt, sock);
         switch (status) {
         case 0:
@@ -1915,6 +1927,8 @@ static void xs_local_setup_socket(struct work_struct *work)
                                 xprt->address_strings[RPC_DISPLAY_ADDR]);
         }

+       set_fs_root(current->fs, &root);
+
  out:
         xprt_clear_connecting(xprt);
         xprt_wake_pending_tasks(xprt, status);
@@ -2577,6 +2591,12 @@ static struct rpc_xprt *xs_setup_local(struct xprt_create 
*args)
                 INIT_DELAYED_WORK(&transport->connect_worker,
                                         xs_local_setup_socket);
                 xs_format_peer_addresses(xprt, "local", RPCBIND_NETID_LOCAL);
+
+               ret = ERR_PTR(-ENOMEM);
+               transport->root = kmalloc(GFP_KERNEL, sizeof(struct path));
+               if (transport->root)
+                       goto out_err;
+               get_fs_root(current->fs, transport->root);
                 break;
         default:
                 ret = ERR_PTR(-EAFNOSUPPORT);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2011-12-29 17:42               ` Stanislav Kinsbursky
@ 2012-01-25 11:12                 ` Stanislav Kinsbursky
  2012-01-25 14:41                   ` bfields
  0 siblings, 1 reply; 15+ messages in thread
From: Stanislav Kinsbursky @ 2012-01-25 11:12 UTC (permalink / raw)
  To: Chuck Lever, bfields; +Cc: Trond.Myklebust, linux-nfs

Chuck, Bruce, have you seen this patch I'm replying to?
I would appreciate for any comments with this problem.
Also, probably we can instead just perform local transports socket connection in 
sync mode? This should solve our problem.
What do you think?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2012-01-25 11:12                 ` Stanislav Kinsbursky
@ 2012-01-25 14:41                   ` bfields
  2012-01-25 16:02                     ` Stanislav Kinsbursky
  0 siblings, 1 reply; 15+ messages in thread
From: bfields @ 2012-01-25 14:41 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Chuck Lever, Trond.Myklebust, linux-nfs

On Wed, Jan 25, 2012 at 03:12:40PM +0400, Stanislav Kinsbursky wrote:
> Chuck, Bruce, have you seen this patch I'm replying to?
> I would appreciate for any comments with this problem.
> Also, probably we can instead just perform local transports socket
> connection in sync mode? This should solve our problem.
> What do you think?

I'm hoping Chuck will answer....

(I'd recommend just resending the patch, though; it's more likely to get
a response than if readers need to go dig the original out of mail
archives.)

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] RPCBIND: add anonymous listening socket in addition to named one
  2012-01-25 14:41                   ` bfields
@ 2012-01-25 16:02                     ` Stanislav Kinsbursky
  0 siblings, 0 replies; 15+ messages in thread
From: Stanislav Kinsbursky @ 2012-01-25 16:02 UTC (permalink / raw)
  To: bfields; +Cc: Chuck Lever, Trond.Myklebust, linux-nfs

25.01.2012 18:41, bfields@fieldses.org пишет:
> On Wed, Jan 25, 2012 at 03:12:40PM +0400, Stanislav Kinsbursky wrote:
>> Chuck, Bruce, have you seen this patch I'm replying to?
>> I would appreciate for any comments with this problem.
>> Also, probably we can instead just perform local transports socket
>> connection in sync mode? This should solve our problem.
>> What do you think?
>
> I'm hoping Chuck will answer....
>
> (I'd recommend just resending the patch, though; it's more likely to get
> a response than if readers need to go dig the original out of mail
> archives.)
>

Thanks, but I really don't like the patch idea. I'd prefer to connect in sync 
mode for unix sockets instead. At least it is local (NFS) change.
But this code isn't familiar to me and I'm afraid of possible pitfalls.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-01-25 16:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-28 15:17 [RFC] RPCBIND: add anonymous listening socket in addition to named one Stanislav Kinsbursky
2011-12-28 17:03 ` Chuck Lever
2011-12-28 17:30   ` Stanislav Kinsbursky
2011-12-28 17:59     ` Chuck Lever
2011-12-29 11:48       ` Stanislav Kinsbursky
2011-12-29 16:03         ` Chuck Lever
2011-12-29 16:12           ` Stanislav Kinsbursky
2011-12-29 16:23             ` Chuck Lever
2011-12-29 17:04               ` Stanislav Kinsbursky
2011-12-29 17:42               ` Stanislav Kinsbursky
2012-01-25 11:12                 ` Stanislav Kinsbursky
2012-01-25 14:41                   ` bfields
2012-01-25 16:02                     ` Stanislav Kinsbursky
2011-12-28 18:22 ` bfields
2011-12-29 11:48   ` Stanislav Kinsbursky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.