netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
@ 2015-10-19 16:59 Stephen Hemminger
  2015-10-19 23:33 ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Stephen Hemminger @ 2015-10-19 16:59 UTC (permalink / raw)
  To: netdev

This looks like corner case, but worth forwarding.

Begin forwarded message:

Date: Mon, 19 Oct 2015 13:21:33 +0000
From: "bugzilla-daemon@bugzilla.kernel.org" <bugzilla-daemon@bugzilla.kernel.org>
To: "shemminger@linux-foundation.org" <shemminger@linux-foundation.org>
Subject: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)


https://bugzilla.kernel.org/show_bug.cgi?id=106241

            Bug ID: 106241
           Summary: shutdown(3)/close(3) behaviour is incorrect for
                    sockets in accept(3)
           Product: Networking
           Version: 2.5
    Kernel Version: 3.10.0-229.14.1.el7.x86_64
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: IPV4
          Assignee: shemminger@linux-foundation.org
          Reporter: Alan.Burlison@oracle.com
        Regression: No

Created attachment 190501
  --> https://bugzilla.kernel.org/attachment.cgi?id=190501&action=edit
Test program illustrating the problem

The Linux behaviour in the current scenario is incorrect:

1. ThreadA opens, binds, listens and accepts on a socket, waiting for
connections.

2. Some time later ThreadB calls shutdown on the socket ThreadA is waiting in
accept on.

Here is what happens:

On Linux, the shutdown call in ThreadB succeeds and the accept call in ThreadA
returns with EINVAL.

On Solaris, the shutdown call in ThreadB fails and returns ENOTCONN. ThreadA
continues to wait in accept.

Relevant POSIX manpages:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/accept.html
http://pubs.opengroup.org/onlinepubs/9699919799/functions/shutdown.html

The POSIX shutdown manpage says:

"The shutdown() function shall cause all or part of a full-duplex connection on
the socket associated with the file descriptor socket to be shut down."
...
"[ENOTCONN] The socket is not connected."

Page 229 & 303 of "UNIX System V Network Programming" say:

"shutdown can only be called on sockets that have been previously connected"

"The socket [passed to accept that] fd refers to does not participate in the
connection. It remains available to receive further connect indications"

That is pretty clear, sockets being waited on with accept are not connected by
definition. Nor is it the accept socket connected when a client connects to it,
it is the socket returned by accept that is connected to the client. Therefore
the Solaris behaviour of failing the shutdown call is correct.

In order to get the required behaviour of ThreadB causing ThreadA to exit the
accept call with an error, the correct way is for ThreadB to call close on the
socket that ThreadA is waiting on in accept.

On Solaris, calling close in ThreadB succeeds, and the accept call in ThreadA
fails and returns EBADF.

On Linux, calling close in ThreadB succeeds but ThreadA continues to wait in
accept until there is an incoming connection. That accept returns successfully.
However subsequent accept calls on the same socket return EBADF.

The Linux behaviour is fundamentally broken in three places:

1. Allowing shutdown to succeed on an unconnected socket is incorrect.  

2. Returning a successful accept on a closed file descriptor is incorrect,
especially as future accept calls on the same socket fail.

3. Once shutdown has been called on the socket, calling close on the socket
fails with EBADF. That is incorrect, shutdown should just prevent further IO on
the socket, it should not close it.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-19 16:59 Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Stephen Hemminger
@ 2015-10-19 23:33 ` Eric Dumazet
  2015-10-20  1:12   ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-19 23:33 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, Alan.Burlison

On Mon, 2015-10-19 at 09:59 -0700, Stephen Hemminger wrote:
> This looks like corner case, but worth forwarding.
> 
> Begin forwarded message:
> 
> Date: Mon, 19 Oct 2015 13:21:33 +0000
> From: "bugzilla-daemon@bugzilla.kernel.org" <bugzilla-daemon@bugzilla.kernel.org>
> To: "shemminger@linux-foundation.org" <shemminger@linux-foundation.org>
> Subject: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=106241
> 
>             Bug ID: 106241
>            Summary: shutdown(3)/close(3) behaviour is incorrect for
>                     sockets in accept(3)
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 3.10.0-229.14.1.el7.x86_64
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV4
>           Assignee: shemminger@linux-foundation.org
>           Reporter: Alan.Burlison@oracle.com
>         Regression: No
> 
> Created attachment 190501
>   --> https://bugzilla.kernel.org/attachment.cgi?id=190501&action=edit
> Test program illustrating the problem
> 
> The Linux behaviour in the current scenario is incorrect:
> 
> 1. ThreadA opens, binds, listens and accepts on a socket, waiting for
> connections.
> 
> 2. Some time later ThreadB calls shutdown on the socket ThreadA is waiting in
> accept on.
> 
> Here is what happens:
> 
> On Linux, the shutdown call in ThreadB succeeds and the accept call in ThreadA
> returns with EINVAL.
> 
> On Solaris, the shutdown call in ThreadB fails and returns ENOTCONN. ThreadA
> continues to wait in accept.
> 
> Relevant POSIX manpages:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/accept.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/shutdown.html
> 
> The POSIX shutdown manpage says:
> 
> "The shutdown() function shall cause all or part of a full-duplex connection on
> the socket associated with the file descriptor socket to be shut down."
> ...
> "[ENOTCONN] The socket is not connected."
> 
> Page 229 & 303 of "UNIX System V Network Programming" say:
> 
> "shutdown can only be called on sockets that have been previously connected"
> 
> "The socket [passed to accept that] fd refers to does not participate in the
> connection. It remains available to receive further connect indications"
> 
> That is pretty clear, sockets being waited on with accept are not connected by
> definition. Nor is it the accept socket connected when a client connects to it,
> it is the socket returned by accept that is connected to the client. Therefore
> the Solaris behaviour of failing the shutdown call is correct.
> 
> In order to get the required behaviour of ThreadB causing ThreadA to exit the
> accept call with an error, the correct way is for ThreadB to call close on the
> socket that ThreadA is waiting on in accept.
> 
> On Solaris, calling close in ThreadB succeeds, and the accept call in ThreadA
> fails and returns EBADF.
> 
> On Linux, calling close in ThreadB succeeds but ThreadA continues to wait in
> accept until there is an incoming connection. That accept returns successfully.
> However subsequent accept calls on the same socket return EBADF.
> 
> The Linux behaviour is fundamentally broken in three places:
> 
> 1. Allowing shutdown to succeed on an unconnected socket is incorrect.  
> 
> 2. Returning a successful accept on a closed file descriptor is incorrect,
> especially as future accept calls on the same socket fail.
> 
> 3. Once shutdown has been called on the socket, calling close on the socket
> fails with EBADF. That is incorrect, shutdown should just prevent further IO on
> the socket, it should not close it.
> 

It looks it is a long standing problem, right ?

inet_shutdown() has this very specific comment from beginning of git
tree :

        switch (sk->sk_state) {
...
        /* Remaining two branches are temporary solution for missing
         * close() in multithreaded environment. It is _not_ a good idea,
         * but we have no choice until close() is repaired at VFS level.
         */
        case TCP_LISTEN:
                if (!(how & RCV_SHUTDOWN))
                        break;
                /* Fall through */
        case TCP_SYN_SENT:
                err = sk->sk_prot->disconnect(sk, O_NONBLOCK);
                sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
                break;
        }

Claiming Solaris does it differently is kind of moot.
linux is not Solaris.

Unless proven a real problem (and not only by trying to backport from Solaris to linux),
we'll probably wont change this.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-19 23:33 ` Eric Dumazet
@ 2015-10-20  1:12   ` Alan Burlison
  2015-10-20  1:45     ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-20  1:12 UTC (permalink / raw)
  To: Eric Dumazet, Stephen Hemminger; +Cc: netdev

On 20/10/2015 00:33, Eric Dumazet wrote:

> It looks it is a long standing problem, right ?

Yep, seems so.

> inet_shutdown() has this very specific comment from beginning of git
> tree :
>
>          switch (sk->sk_state) {
> ...
>          /* Remaining two branches are temporary solution for missing
>           * close() in multithreaded environment. It is _not_ a good idea,
>           * but we have no choice until close() is repaired at VFS level.
>           */
>          case TCP_LISTEN:
>                  if (!(how & RCV_SHUTDOWN))
>                          break;
>                  /* Fall through */
>          case TCP_SYN_SENT:
>                  err = sk->sk_prot->disconnect(sk, O_NONBLOCK);
>                  sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
>                  break;
>          }

I think it's probably an intrinsic part of the way *NIX file descriptors 
and their reuse has worked since the dawn of *NIX time - at which time 
threads didn't exist, so this problem didn't either. The advent of 
threads made this this hole possible, which is I believe what the 
comment above is pointing out. The problem people are trying to solve by 
calling shutdown() on an listen()ing socket is the race in MT programs 
between a socket being closed and the same file descriptor being 
recycled by a subsequent open()/socket() etc.

> Claiming Solaris does it differently is kind of moot.
> linux is not Solaris.

Agreed that Linux != Solaris, but the argument I'm being faced with is 
that anything that doesn't behave in the same way as Linux is wrong by 
definition. And the problem with that is that the Linux behaviour of 
shutdown() on a listen()/accept() socket is I believe incorrect anyway 
as my read of POSIX says that shutdown() is only valid on connected 
sockets, and sockets in listen()/accept() aren't connected by 
definition, and Linux allows shutdown() to succeed when it should 
probably return ENOTCONN. Yes, there's a potential race with FDs being 
recycled, but you can get that with vanilla file FDs as well, where 
shutdown() isn't an option.

Another problem is that if I call close() on a Linux socket that's in 
accept() the accept call just sits there until there's an incoming 
connection, which succeeds even though the socket is supposed to be 
closed, but then an immediately following accept() on the same socket 
fails. And yet another problem is that poll() on a socket that's had 
listen() called on it returns immediately even if there's no incoming 
connection on it, which I believe makes multiplexing a set of sockets 
which includes a socket you want to accept() on impossible. The test 
program I attached to the bug allows you to play around with the 
different combinations.

> Unless proven a real problem (and not only by trying to backport from Solaris to linux),
> we'll probably wont change this.

It's a real problem (with Hadoop, which contains C/C++ to do low-level 
I/O) and it's the other way around, I am porting that code from Linux to 
Solaris.

I accept that you probably can't change the behaviour of shutdown() in 
Linux without breaking existing code, for example it seems libmicrohttpd 
also assumes it's OK to call shutdown() on a listen() socket on Linux, 
see https://lists.gnu.org/archive/html/libmicrohttpd/2011-09/msg00024.html

However, even if the shutdown() behaviour can't be changed the Linux 
close()/poll() behaviour on a listen()/accept() sockets seems rather odd.

There *may* be a way around this that's race-free and cross-platform 
involving the use of /dev/null and dup2(), see Listing Five on 
http://www.drdobbs.com/parallel/file-descriptors-and-multithreaded-progr/212001285 
but I haven't confirmed it works yet.

Thanks,

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-20  1:12   ` Alan Burlison
@ 2015-10-20  1:45     ` Eric Dumazet
  2015-10-20  9:59       ` Alan Burlison
  2015-10-21  3:49       ` Al Viro
  0 siblings, 2 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-20  1:45 UTC (permalink / raw)
  To: Alan Burlison; +Cc: Stephen Hemminger, netdev

On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote:

> Another problem is that if I call close() on a Linux socket that's in 
> accept() the accept call just sits there until there's an incoming 
> connection, which succeeds even though the socket is supposed to be 
> closed, but then an immediately following accept() on the same socket 
> fails. 

This is exactly what the comment I pasted documents.

On linux, doing close(listener) on one thread does _not_ wakeup other
threads doing accept(listener)

So I guess allowing shutdown(listener) was a way to somehow propagate
some info on the threads stuck in accept()

This is a VFS issue, and a long standing one.

Think of all cases like dup() and fd passing games, and the close(fd)
being able to signal out of band info is racy.

close() is literally removing one ref count on a file.
Expecting it doing some kind of magical cleanup of a socket is not
reasonable/practical.

On a multi threaded program, each thread doing an accept() increased the
refcount on the file.

Really, I have no idea of how Solaris coped with this, and I do not want
to know.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-20  1:45     ` Eric Dumazet
@ 2015-10-20  9:59       ` Alan Burlison
  2015-10-20 11:24         ` David Miller
  2015-10-20 13:19         ` Fw: " Eric Dumazet
  2015-10-21  3:49       ` Al Viro
  1 sibling, 2 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-20  9:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, netdev

On 20/10/2015 02:45, Eric Dumazet wrote:

> On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote:
>
>> Another problem is that if I call close() on a Linux socket that's in
>> accept() the accept call just sits there until there's an incoming
>> connection, which succeeds even though the socket is supposed to be
>> closed, but then an immediately following accept() on the same socket
>> fails.
>
> This is exactly what the comment I pasted documents.

Yes, and it's caveated with "temporary solution" and "_not_ a good idea" 
and says that the problem is that close() needs repairing. In other 
works, the change in shutdown() behaviour appears to be a workaround for 
an acknowledged bug, and it's one with consequences.

There are two separate things here.

The first is the close/reopen race issue with filehandles. As I said, I 
believe that's an artefact of history, because it wasn't possible for it 
to happen before threads and was possible after them. There appears to 
be no good way to avoid this with the current *NIX filehandle semantics 
when threads are being used. History sucks.

The second is how/if you might work around that. As far as I can tell, 
Linux does so by allowing shutdown() to be called on unconnected sockets 
and uses that as a signal that threads waiting in accept() should return 
from the accept with a failure but without the filehandle being actually 
closed, and therefore not being available for reuse, and therefore not 
subject to potential races.  However by doing so I believe the behaviour 
of shutdown is then not POSIX-conforming. The Linux manpage for 
shutdown(2) says "CONFORMING TO POSIX.1-2001", as far as I can tell it 
isn't. At very least I believe the manpage needs changing.

> On linux, doing close(listener) on one thread does _not_ wakeup other
> threads doing accept(listener)

Allowing an in-progress accept() to continue and to succeed at some 
point in the distant future on a filehandle that's closed seems incorrect.

Also, the behaviour of poll() that I mentioned - that it returns 
immediately for a socket in the listen() state - does seem like an 
out-and-out bug to me, I haven't seen any explanation of why that might 
be correct behaviour.

> So I guess allowing shutdown(listener) was a way to somehow propagate
> some info on the threads stuck in accept()

Yes, I think you are right.

> This is a VFS issue, and a long standing one.

That seems to be what the comment you quoted is saying, yes.

> Think of all cases like dup() and fd passing games, and the close(fd)
> being able to signal out of band info is racy.

Yes, I agree there are potential race conditions, intrinsic to the way 
*NIX FDs work. As I said earlier, there *may* be a portable way around 
this using /dev/null & dup2() but I haven't had chance to investigate 
that yet.

> close() is literally removing one ref count on a file.
> Expecting it doing some kind of magical cleanup of a socket is not
> reasonable/practical.

I'm not sure what that means at an application level - no matter how 
many copies I make of a FD in a process it's still just an integer and 
calling close on it closes it and causes any future IOs to fail - except 
for the case of sockets in accept() it seems, which continue and may 
even eventually succeed. Leaving aside the behaviour of shutdown() on 
listening sockets, the current behaviour of close() on a socket in 
accept() seems incorrect. And then of course there's also the poll() issue.

> On a multi threaded program, each thread doing an accept() increased the
> refcount on the file.

That may be how Linux implements accept(), but I don't see anything 
about refcounting in the POSIX spec for accept().

> Really, I have no idea of how Solaris coped with this, and I do not want
> to know.

The bug goes into quite some detail about how Solaris behaves. The issue 
here is that we have two implementations, Linux and Solaris, both 
claiming to be POSIX-conformant but both showing different behaviour. 
There's a discussion to be had about the whys and wherefores of that 
difference, but saying that you don't want to know how Solaris behaves 
isn't really going to help move the conversation along.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-20  9:59       ` Alan Burlison
@ 2015-10-20 11:24         ` David Miller
  2015-10-20 11:39           ` Alan Burlison
  2015-10-20 13:19         ` Fw: " Eric Dumazet
  1 sibling, 1 reply; 138+ messages in thread
From: David Miller @ 2015-10-20 11:24 UTC (permalink / raw)
  To: Alan.Burlison; +Cc: eric.dumazet, stephen, netdev

From: Alan Burlison <Alan.Burlison@oracle.com>
Date: Tue, 20 Oct 2015 10:59:46 +0100

> The bug goes into quite some detail about how Solaris behaves. The
> issue here is that we have two implementations, Linux and Solaris,
> both claiming to be POSIX-conformant but both showing different
> behaviour. There's a discussion to be had about the whys and
> wherefores of that difference, but saying that you don't want to know
> how Solaris behaves isn't really going to help move the conversation
> along.

With two decades of precendence, applications will need to find a way
to cope with the behavior on every existing Linux kernel out there.

Even if we were to propose something here and change things, it won't
be available on real sites for 6 months at a minimum, and only a an
extremely small fraction of actual machines.

It's more practical for userspace to cope with the bahvior.  This is
simply because coping with current behavior will work on every Linux
kernel on the planet, and also it won't require us to potentially
break any existing setups.

This to me matters more than any squabbling over semantics or who
really implements POSIX correctly or not.  That simply does not matter
at all.

It especially does not matter as far as I and millions of existing
socket apps out there are concerned.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-20 11:24         ` David Miller
@ 2015-10-20 11:39           ` Alan Burlison
  0 siblings, 0 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-20 11:39 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, stephen, netdev

On 20/10/2015 12:24, David Miller wrote:

> With two decades of precendence, applications will need to find a way
> to cope with the behavior on every existing Linux kernel out there.
>
> Even if we were to propose something here and change things, it won't
> be available on real sites for 6 months at a minimum, and only a an
> extremely small fraction of actual machines.
>
> It's more practical for userspace to cope with the bahvior.  This is
> simply because coping with current behavior will work on every Linux
> kernel on the planet, and also it won't require us to potentially
> break any existing setups.

Yes, as I said I think in practice about the best that can be done is to 
document the behaviour, although I think the assertion of their being 
millions of apps that would be affected is an over-estimate. Only MT 
apps that use accept() in one thread and shutdown() in another should be 
impacted, i.e mainly threaded apps that act as network service providers 
of one form or another.

But having said that, the behaviour of close() and poll() on sockets 
being used in accept() still looks incorrect.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-20  9:59       ` Alan Burlison
  2015-10-20 11:24         ` David Miller
@ 2015-10-20 13:19         ` Eric Dumazet
  2015-10-20 13:45           ` Alan Burlison
  1 sibling, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-20 13:19 UTC (permalink / raw)
  To: Alan Burlison; +Cc: Stephen Hemminger, netdev

On Tue, 2015-10-20 at 10:59 +0100, Alan Burlison wrote:

> That may be how Linux implements accept(), but I don't see anything 
> about refcounting in the POSIX spec for accept().

That's an internal implementation detail. POSIX does not document linux
kernel overall design and specific tricks.

linux is GPL, while Solaris is proprietary code. There is quite a
difference, and we do not want to copy Solaris behavior. We want our own
way, practical, and good enough.

If POSIX makes sense we try to be compliant. If not, we do not.

If you are interested, take a look at fs/* code, and try to implement
your proposal and keep good performance.

You might find a clever way, without infringing prior art. We did not
yet.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-20 13:19         ` Fw: " Eric Dumazet
@ 2015-10-20 13:45           ` Alan Burlison
  2015-10-20 15:30             ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-20 13:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, netdev

On 20/10/2015 14:19, Eric Dumazet wrote:

>> That may be how Linux implements accept(), but I don't see anything
>> about refcounting in the POSIX spec for accept().
>
> That's an internal implementation detail. POSIX does not document linux
> kernel overall design and specific tricks.

No, neither does it document Solaris kernel design, or *BSD design, or 
Windows design. It does however specify what the externally observable 
behaviour has to be in order to be POSIX compliant.

> linux is GPL, while Solaris is proprietary code. There is quite a
> difference, and we do not want to copy Solaris behavior. We want our own
> way, practical, and good enough.

I don't see what the licensing terms of particular implementations have 
to do with POSIX, indeed as far as I can tell POSIX says nothing at all 
about the subject. It's therefore not pertinent to this discussion.

I'm not expecting Linux to copy every Solaris behaviour, if I was I 
might for example be suggesting that Linux dropped support for the 
SO_RCVTIMEO and SO_SNDTIMEO setsockopt() options on AF_UNIX sockets, 
because Solaris doesn't currently implement those options. That would 
clearly be a ridiculous stance for me to take, what I've actually done 
is logged a bug against Solaris because it's clearly something we need 
to fix.

What I do think is reasonable is that if Linux claims POSIX conformance 
then it either conforms or documents the variant behaviour and as I've 
said I don't believe is does conform in the case of shutdown().

> If POSIX makes sense we try to be compliant. If not, we do not.

That's one possible design option. However in this case the Linux 
manpage claims the Linux behaviour is POSIX compliant and as far as I 
can tell it isn't. As I've already said several times, I agree there's 
probably not much that can be done about it without causing breakage, 
which is why I suggested simply documenting the behaviour may be the 
best option.

And I still haven't seen any reasoning behind the Linux close() and 
poll() behaviour on sockets that are in the listen state.

> If you are interested, take a look at fs/* code, and try to implement
> your proposal and keep good performance.
>
> You might find a clever way, without infringing prior art. We did not
> yet.

I might, but contractually I can't, unfortunately.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-20 13:45           ` Alan Burlison
@ 2015-10-20 15:30             ` Eric Dumazet
  2015-10-20 18:31               ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-20 15:30 UTC (permalink / raw)
  To: Alan Burlison; +Cc: Stephen Hemminger, netdev

On Tue, 2015-10-20 at 14:45 +0100, Alan Burlison wrote:

> And I still haven't seen any reasoning behind the Linux close() and 
> poll() behaviour on sockets that are in the listen state.

Same answer.

A close() does _not_ wakeup an accept() or a poll() (this is exactly the
same problem), or any blocking system call using the same 'file'

This is the choice Linus Torvalds and Al Viro did years ago.

It is set in stone, unless someone comes up with a very nice patch set,
that does not break existing applications.

close() man page states :

NOTES
      It  is  probably unwise to close file descriptors while they may be in
      use by system calls in other threads in the  same  process.   Since  a
      file  descriptor may be reused, there are some obscure race conditions
      that may cause unintended side effects.

You are in this grey zone.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-20 15:30             ` Eric Dumazet
@ 2015-10-20 18:31               ` Alan Burlison
  2015-10-20 18:42                 ` Eric Dumazet
  2015-10-21 10:25                 ` David Laight
  0 siblings, 2 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-20 18:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, netdev

On 20/10/2015 16:30, Eric Dumazet wrote:

> A close() does _not_ wakeup an accept() or a poll() (this is exactly the
> same problem), or any blocking system call using the same 'file'

Not waking up the accept() is one thing, allowing the accept() to 
successfully complete some indeterminate time after the socket has been 
closed is something else entirely. You shouldn't have to call shutdown() 
to terminate an accept() on a socket, close() should suffice. Yes, if 
you want to use the shutdown() 'feature' to kick the accept() thread out 
of the call to accept() without closing the FD and you don't care about 
cross-platform compatibility then you can call shutdown() followed by 
close(). However that's only ever required on Linux as far as I can 
tell, and even on Linux applications that deal with the thread race by 
other means shouldn't be forced to use shutdown() when just close() 
would suffice.

The problem with poll() is that it returns immediately when passed a FD 
that is in the listening state. rather than waiting until there's an 
incoming connection to handle. As I said, that means you can't use 
poll() to multiplex between read/write/accept sockets.

> close() man page states :
>
> NOTES
>        It  is  probably unwise to close file descriptors while they may be in
>        use by system calls in other threads in the  same  process.   Since  a
>        file  descriptor may be reused, there are some obscure race conditions
>        that may cause unintended side effects.
>
> You are in this grey zone.

No, the race issue with file descriptor reuse and the close() behaviour 
are not the same thing. The manpage comment is correct, but not relevant.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-20 18:31               ` Alan Burlison
@ 2015-10-20 18:42                 ` Eric Dumazet
  2015-10-21 10:25                 ` David Laight
  1 sibling, 0 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-20 18:42 UTC (permalink / raw)
  To: Alan Burlison; +Cc: Stephen Hemminger, netdev

On Tue, 2015-10-20 at 19:31 +0100, Alan Burlison wrote:

> No, the race issue with file descriptor reuse and the close() behaviour 
> are not the same thing. The manpage comment is correct, but not relevant.

Ok, it seems you know better than me, I will be stop the discussion and
wait for your patches.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-20  1:45     ` Eric Dumazet
  2015-10-20  9:59       ` Alan Burlison
@ 2015-10-21  3:49       ` Al Viro
  2015-10-21 14:38         ` Alan Burlison
  1 sibling, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-21  3:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alan Burlison, Stephen Hemminger, netdev, dholland-tech

On Mon, Oct 19, 2015 at 06:45:32PM -0700, Eric Dumazet wrote:
> On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote:
> 
> > Another problem is that if I call close() on a Linux socket that's in 
> > accept() the accept call just sits there until there's an incoming 
> > connection, which succeeds even though the socket is supposed to be 
> > closed, but then an immediately following accept() on the same socket 
> > fails. 
> 
> This is exactly what the comment I pasted documents.
> 
> On linux, doing close(listener) on one thread does _not_ wakeup other
> threads doing accept(listener)
> 
> So I guess allowing shutdown(listener) was a way to somehow propagate
> some info on the threads stuck in accept()
> 
> This is a VFS issue, and a long standing one.
> 
> Think of all cases like dup() and fd passing games, and the close(fd)
> being able to signal out of band info is racy.
> 
> close() is literally removing one ref count on a file.

> Expecting it doing some kind of magical cleanup of a socket is not
> reasonable/practical.
> 
> On a multi threaded program, each thread doing an accept() increased the
> refcount on the file.

Refcount is an implementation detail, of course.  However, in any Unix I know
of, there are two separate notions - descriptor losing connection to opened
file (be it from close(), exit(), execve(), dup2(), etc.) and opened file
getting closed.

The latter cannot happen while there are descriptors connected to the
file in question, of course.  However, that is not the only thing
that might prevent an opened file from getting closed - e.g. sending an
SCM_RIGHTS datagram with attached descriptor connected to the opened file
in question *at* *the* *moment* *of* *sendmsg(2)* will carry said opened
file until it is successfully received or discarded (in the former case
recepient will get a new descriptor refering to that opened file, of course).
Having the original descriptor closed right after sendmsg(2) does *not*
do anything to opened file.  On any Unix that implements descriptor-passing.

There's going to be a notion of "last close"; that's what this refcount is
about and _that_ is more than implementation detail.

The real question is what kind of semantics would one want in the following
situations:

1)
// fd is a socket
fcntl(fd, F_SETFD, FD_CLOEXEC);
fork();
in parent: accept(fd);
in child: execve()

2)
// fd is a socket, 1 is /dev/null
fork();
in parent: accept(fd);
in child: dup2(1, fd);

3)
// fd is a socket
fd2 = dup(fd);
in thread A: accept(fd);
in thread B: close(fd);

4)
// fd is a socket, 1 is /dev/null
fd2 = dup(fd);
in thread A: accept(fd);
in thread B: dup2(1, fd);

5)
// fd is a socket, 1 is /dev/null
fd2 = dup(fd);
in thread A: accept(fd);
in thread B: close(fd2);

6)
// fd is a socket
in thread A: accept(fd);
in thread B: close(fd);

In other words, is that destruction of
	* any descriptor refering to this socket [utterly insane for obvious
reasons]
	* the last descriptor refering to this socket (modulo descriptor
passing, etc.) [a bitch to implement, unless we treat a syscall in progress
as keeping the opened file open], or
	* _the_ descriptor used to issue accept(2) [a bitch to implement,
with a lot of fun races in an already race-prone area]?
Additional question is whether it's
	* just a magical behaviour of close(2) [ugly], or
	* something that happens when descriptor gets dissociated from
opened file [obviously more consistent]?

BTW, for real fun, consider this:
7)
// fd is a socket
fd2 = dup(fd);
in thread A: accept(fd);
in thread B: accept(fd);
in thread C: accept(fd2);
in thread D: close(fd);

Which threads (if any), should get hit where it hurts?

I honestly don't know what Solaris does; AFAICS, FreeBSD behaves like Linux
these days.  NetBSD plays really weird games in their fd_close(); what
they are trying to achieve is at least sane - in (7) they'd hit A and B with
EBADF and C would restart and continue waiting, in (3,4,6) A gets EBADF,
in (1,2,5) accept() is unaffected.  The problem is that their solution is
racy - they have a separate refcount on _descriptor_, plus a file method
(->fo_restart) for triggering an equivalent of signal interrupting anything
that might be blocked on that sucker, with syscall restart (and subsequent
EBADF on attempt to refetch the sucker.  Racy if we reopen or are doing dup2()
in the first place - these restarts might get CPU just after we return from
dup2() and pick the *new* descriptor just fine.  It might be possible to fix
their approach (having
        if (__predict_false(ff->ff_file == NULL)) {
                /*
                 * Another user of the file is already closing, and is
                 * waiting for other users of the file to drain.  Release
                 * our reference, and wake up the closer.
                 */
                atomic_dec_uint(&ff->ff_refcnt);
                cv_broadcast(&ff->ff_closing);
path in fd_close() mark the thread as "don't bother restarting, just bugger
off" might be workable), but... it's still pretty costly.  They
pay with memory footprint (at least 32 bits per descriptor, and that's
leaving aside the fun issues with what to wait on) and the only thing that
might be saving them from cacheline ping-pong from hell is that their
struct fdfile is really fat - there's a lot more than just an extra
u32 in there.

I have no idea what semantics does Solaris have in that area and how racy
their descriptor table handling is.  And no, I'm not going to RTFS their
kernel, CDDL being what it is.  I *do* know that Linux and all *BSD kernels
had pretty severe races in that area.  Quite a few of those, and a lot more
painful than the one RTFS(NetBSD) seems to have caught just now.  So I would
seriously recommend the folks who are free to RTFS(Solaris) to review that
area.  Carefully.  There tend to be dragons.

_IF_ somebody can come up with clean semantics and tolerable approach to
implementing it, I'd be glad to see that.  What we do is "syscall
in progress keeps the file it operates upon open, no matter what happens to
descriptors".  AFAICS, what NetBSD tries to implement is also reasonably
clean wrt semantics ("detaching an opened file from a descriptor that
is being operated upon by some syscalls triggers restart or failure of all
syscalls operating on the opened file in question and waits for them
to bugger off", but their implementation appears to be both racy and far too
heavyweight, with no obvious solutions to the latter.

Come to think of that, restart-based solutions have an obvious problem -
if we were talking about restart due to signal, the userland code could
(and would have to) block those, just to avoid this kind of issues with
the wrong descriptor picked on restart.  But there's no way to block _that_,
so if you have two descriptors refering to the same socket and 4 threads doing
A: sleeps in accept(fd1)
B: sleeps in accept(fd2)
C: close(fd1)
D: (with all precautions re signals taken by the whole thing) dup2(fd3, fd2)
you can end up with C coming first, kicking A and B (as operating on that
socket) with A legitimately failing and B going into restart.  And losing
CPU to D, which does that dup2(), so when B regains CPU it's operating on
the socket it never intended to.  So this approach seems to be broken, no
matter what...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-20 18:31               ` Alan Burlison
  2015-10-20 18:42                 ` Eric Dumazet
@ 2015-10-21 10:25                 ` David Laight
  2015-10-21 10:49                   ` Alan Burlison
  1 sibling, 1 reply; 138+ messages in thread
From: David Laight @ 2015-10-21 10:25 UTC (permalink / raw)
  To: 'Alan Burlison', Eric Dumazet; +Cc: Stephen Hemminger, netdev

From: Alan Burlison
> Sent: 20 October 2015 19:31
...
> The problem with poll() is that it returns immediately when passed a FD
> that is in the listening state. rather than waiting until there's an
> incoming connection to handle. As I said, that means you can't use
> poll() to multiplex between read/write/accept sockets.

That seems to work for me...

	David

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 10:25                 ` David Laight
@ 2015-10-21 10:49                   ` Alan Burlison
  2015-10-21 11:28                     ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-21 10:49 UTC (permalink / raw)
  To: David Laight, Eric Dumazet; +Cc: Stephen Hemminger, netdev

On 21/10/2015 11:25, David Laight wrote:

>> The problem with poll() is that it returns immediately when passed a FD
>> that is in the listening state. rather than waiting until there's an
>> incoming connection to handle. As I said, that means you can't use
>> poll() to multiplex between read/write/accept sockets.
>
> That seems to work for me...

In my test case I was setting all the available event bits in 
pollfd.events to see what came back. With poll() on a listen() socket 
you get an immediate return with bits set in revents indicating the 
socket is available for output, which of course it isn't. Indeed an 
attempt to write to it fails. If you remove the output event bits in 
pollfd.events then the poll() waits as expected until there's an 
incoming connection on the socket.

I suppose one answer is "Well, don't do that then" but returning an 
output indication on a socket that's in listen() seems rather odd.

With POLLOUT|POLLWRNORM|POLLWRBAND:

main: polling #1
[returns immediately]
main: poll #1: Success
poll fd: 0 revents: POLLOUT POLLWRBAND
main: write #1: Transport endpoint is not connected

Without POLLOUT|POLLWRNORM|POLLWRBAND:

main: polling #1
[waits for connection]
main: poll #1: Success
poll fd: 0 revents: POLLIN POLLRDNORM

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 10:49                   ` Alan Burlison
@ 2015-10-21 11:28                     ` Eric Dumazet
  2015-10-21 13:03                       ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-21 11:28 UTC (permalink / raw)
  To: Alan Burlison; +Cc: David Laight, Stephen Hemminger, netdev

On Wed, 2015-10-21 at 11:49 +0100, Alan Burlison wrote:
> On 21/10/2015 11:25, David Laight wrote:
> 
> >> The problem with poll() is that it returns immediately when passed a FD
> >> that is in the listening state. rather than waiting until there's an
> >> incoming connection to handle. As I said, that means you can't use
> >> poll() to multiplex between read/write/accept sockets.
> >
> > That seems to work for me...
> 
> In my test case I was setting all the available event bits in 
> pollfd.events to see what came back. With poll() on a listen() socket 
> you get an immediate return with bits set in revents indicating the 
> socket is available for output, which of course it isn't. Indeed an 
> attempt to write to it fails. If you remove the output event bits in 
> pollfd.events then the poll() waits as expected until there's an 
> incoming connection on the socket.
> 
> I suppose one answer is "Well, don't do that then" but returning an 
> output indication on a socket that's in listen() seems rather odd.
> 
> With POLLOUT|POLLWRNORM|POLLWRBAND:
> 
> main: polling #1
> [returns immediately]
> main: poll #1: Success
> poll fd: 0 revents: POLLOUT POLLWRBAND
> main: write #1: Transport endpoint is not connected
> 
> Without POLLOUT|POLLWRNORM|POLLWRBAND:
> 
> main: polling #1
> [waits for connection]
> main: poll #1: Success
> poll fd: 0 revents: POLLIN POLLRDNORM
> 

This works for me. Please double check your programs

242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295 <unfinished ...>
242046 <... poll resumed> )             = 1 ([{fd=3, revents=POLLIN}])
242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35888), inet_pton(AF_INET6, "::ffff:10.246.7.151", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4
242046 close(4)                         = 0
242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN}])
242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35890), inet_pton(AF_INET6, "::ffff:10.246.7.151", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4
242046 close(4)                         = 0
242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN}])
242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35892), inet_pton(AF_INET6, "::ffff:10.246.7.151", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4
242046 close(4)                         = 0
242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN}])
242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35894), inet_pton(AF_INET6, "::ffff:10.246.7.151", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4
242046 close(4)                         = 0

poll() only cares about POLLIN | POLLRDNORM for a listener.


static inline unsigned int inet_csk_listen_poll(const struct sock *sk)
{
        return !reqsk_queue_empty(&inet_csk(sk)->icsk_accept_queue) ?
                        (POLLIN | POLLRDNORM) : 0;
}

if (sk->sk_state == TCP_LISTEN)
    return inet_csk_listen_poll(sk);

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 11:28                     ` Eric Dumazet
@ 2015-10-21 13:03                       ` Alan Burlison
  2015-10-21 13:29                         ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-21 13:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Laight, Stephen Hemminger, netdev

On 21/10/2015 12:28, Eric Dumazet wrote:

> This works for me. Please double check your programs

I have just done so, it works as you say for AF_INET sockets but if you 
switch to AF_UNIX sockets it does the wrong thing in the way I described.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 13:03                       ` Alan Burlison
@ 2015-10-21 13:29                         ` Eric Dumazet
  0 siblings, 0 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-21 13:29 UTC (permalink / raw)
  To: Alan Burlison; +Cc: David Laight, Stephen Hemminger, netdev

On Wed, 2015-10-21 at 14:03 +0100, Alan Burlison wrote:
> On 21/10/2015 12:28, Eric Dumazet wrote:
> 
> > This works for me. Please double check your programs
> 
> I have just done so, it works as you say for AF_INET sockets but if you 
> switch to AF_UNIX sockets it does the wrong thing in the way I described.
> 

Oh well. Please try the following :


diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f658235fb4..24dec8bb571d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -328,7 +328,8 @@ found:
 
 static inline int unix_writable(struct sock *sk)
 {
-	return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
+	return sk->sk_state != TCP_LISTEN &&
+	       (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
 
 static void unix_write_space(struct sock *sk)

^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21  3:49       ` Al Viro
@ 2015-10-21 14:38         ` Alan Burlison
  2015-10-21 15:30           ` David Miller
                             ` (3 more replies)
  0 siblings, 4 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-21 14:38 UTC (permalink / raw)
  To: Al Viro, Eric Dumazet
  Cc: Stephen Hemminger, netdev, dholland-tech, Casper Dik

On 21/10/2015 04:49, Al Viro wrote:

Firstly, thank you for the comprehensive and considered reply.

> Refcount is an implementation detail, of course.  However, in any Unix I know
> of, there are two separate notions - descriptor losing connection to opened
> file (be it from close(), exit(), execve(), dup2(), etc.) and opened file
> getting closed.

Yep, it's an implementation detail inside the kernel - Solaris also has 
a refcount inside its vnodes. However that's really only dimly visible 
at the process level, where all you have is an integer file ID.

> The latter cannot happen while there are descriptors connected to the
> file in question, of course.  However, that is not the only thing
> that might prevent an opened file from getting closed - e.g. sending an
> SCM_RIGHTS datagram with attached descriptor connected to the opened file
> in question *at* *the* *moment* *of* *sendmsg(2)* will carry said opened
> file until it is successfully received or discarded (in the former case
> recepient will get a new descriptor refering to that opened file, of course).
> Having the original descriptor closed right after sendmsg(2) does *not*
> do anything to opened file.  On any Unix that implements descriptor-passing.

I believe async IO data is another way that a file can remain live after 
a close(), from the close() section of IEEE Std 1003.1:

"An I/O operation that is not canceled completes as if the close() 
operation had not yet occurred"

> There's going to be a notion of "last close"; that's what this refcount is
> about and _that_ is more than implementation detail.

Yes, POSIX distinguishes between "file descriptor" and "file 
description" (ugh!) and the close() page says:

"When all file descriptors associated with an open file description have 
been closed, the open file description shall be freed."

In the context of this discussion I believe it's the behaviour of the 
integer file descriptor that's the issue. Once it's had close() called 
on it then it's invalid, and any IO on it should fail, even if the 
underlying file description is still 'live'.

> In other words, is that destruction of
> 	* any descriptor refering to this socket [utterly insane for obvious
> reasons]
> 	* the last descriptor refering to this socket (modulo descriptor
> passing, etc.) [a bitch to implement, unless we treat a syscall in progress
> as keeping the opened file open], or
> 	* _the_ descriptor used to issue accept(2) [a bitch to implement,
> with a lot of fun races in an already race-prone area]?

 From reading the POSIX close() page I believe the second option is the 
correct one.

> Additional question is whether it's
> 	* just a magical behaviour of close(2) [ugly], or
> 	* something that happens when descriptor gets dissociated from
> opened file [obviously more consistent]?

The second, I believe.

> BTW, for real fun, consider this:
> 7)
> // fd is a socket
> fd2 = dup(fd);
> in thread A: accept(fd);
> in thread B: accept(fd);
> in thread C: accept(fd2);
> in thread D: close(fd);
>
> Which threads (if any), should get hit where it hurts?

A & B should return from the accept with an error. C should continue. 
Which is what happens on Solaris.

> I have no idea what semantics does Solaris have in that area and how racy
> their descriptor table handling is.  And no, I'm not going to RTFS their
> kernel, CDDL being what it is.

I can answer that for you :-) I've looked through the appropriate bits 
of the Solaris kernel code and my colleague Casper has written an 
excellent summary of what happens, so with his permission I've just 
copied it verbatim below:

----------
Since at least Solaris 7 (1998), a thread which is sleeping
on a file descriptor which is being closed by another thread,
will be woken up.

To this end each thread keeps a list of file descriptors
in use by the current active system call.

When a file descriptor is closed and this file descriptor
is marked as being in use by other threads, the kernel
will search all threads to see which have this file descriptor
listed as in use. For each such thread, the kernel tells
the thread that its active fds list is now stale and, if
possible, makes the thread run.

While this algorithm is pretty expensive, it is not often invoked.

The thread running close() will NOT return until all other threads
using that filedescriptor have released it.

When run, the thread will return from its syscall and will in most cases
return EBADF. A second thread trying to close this same file descriptor
may return earlier with close() returning EBADF.
----------

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 14:38         ` Alan Burlison
@ 2015-10-21 15:30           ` David Miller
  2015-10-21 16:04             ` Casper.Dik
  2015-10-21 16:32           ` Fw: " Eric Dumazet
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 138+ messages in thread
From: David Miller @ 2015-10-21 15:30 UTC (permalink / raw)
  To: Alan.Burlison
  Cc: viro, eric.dumazet, stephen, netdev, dholland-tech, casper.dik

From: Alan Burlison <Alan.Burlison@oracle.com>
Date: Wed, 21 Oct 2015 15:38:51 +0100

> While this algorithm is pretty expensive, it is not often invoked.

I bet it can be easily intentionally invoked, by a malicious entity no
less.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 15:30           ` David Miller
@ 2015-10-21 16:04             ` Casper.Dik
  2015-10-21 21:18               ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Casper.Dik @ 2015-10-21 16:04 UTC (permalink / raw)
  To: David Miller
  Cc: Alan.Burlison, viro, eric.dumazet, stephen, netdev, dholland-tech


From: David Miller <davem@davemloft.net>
Date: Wed, 21 Oct 2015 08:30:08 -0700 (PDT) (17:30 CEST)

>From: Alan Burlison <Alan.Burlison@oracle.com>
>Date: Wed, 21 Oct 2015 15:38:51 +0100
>
>> While this algorithm is pretty expensive, it is not often invoked.
>
>I bet it can be easily intentionally invoked, by a malicious entity no
>less.

It is only expensive within the process itself.  Whether it is run inside 
the kernel isn't much different in the context of Solaris.  If you have an 
attacker which can run any code, it doesn't really matter what that code 
is.  It is not really, expensive (like grabbing expensive locks or
for any length of time).  It's basically O(n) depending on the numbers of 
threads in the process.

If you have an application which can be triggered in doing that, it is 
still a bug in the application.  

Is such socket still listed with netstat on Linux?  I believe it uses
uses /proc and it will not be able to find that socket through the list of 
opened files.

If we look at our typical problem we have a accept loop:

		for (;;) {
			newfd = accept(fd. ...);  /* X */

			/* stuff */
		}

While we have a second thread doing a "close(fd);" and possibly opening 
another file which just happens to return this particular fd.

In Solaris the following one of the following things will happen,
whatever the first thread is doing once close() is called:

	- accept() dies with EBADF (close() before or during the call to
	  accept())
	- accept() returns some other error (new fd you can't accept on)
	- accept() returns a new fd (if it was closed and reopened and a 
	  the new fd allows accept())

On Linux exactly the same thing happens *except* when we find ourselves in accept(),
then we wait until a connection made or "shutdown()" is called.

I don't think any of the outcomes in the first thread is acceptable; 
clearly no sufficient synchronization between the threads.


At that point Linux cannot find out who owns the socket:

#  netstat -p -a | grep /tmp/unix
unix  2      [ ACC ]     STREAM     LISTENING     14743  -                   /tmp/unix_sock

In Solaris you'd get:

netstat -u -f unix| grep unix_
stream-ord casper     5334 shutdown       /tmp/unix_sock

Simple synchronization is can be done.

Casper

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 14:38         ` Alan Burlison
  2015-10-21 15:30           ` David Miller
@ 2015-10-21 16:32           ` Eric Dumazet
  2015-10-21 18:51           ` Al Viro
  2015-10-23 18:30           ` Fw: " David Holland
  3 siblings, 0 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-21 16:32 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Al Viro, Stephen Hemminger, netdev, dholland-tech, Casper Dik

On Wed, 2015-10-21 at 15:38 +0100, Alan Burlison wrote:

> ----------
> Since at least Solaris 7 (1998), a thread which is sleeping
> on a file descriptor which is being closed by another thread,
> will be woken up.
> 
> To this end each thread keeps a list of file descriptors
> in use by the current active system call.

Ouch.

> 
> When a file descriptor is closed and this file descriptor
> is marked as being in use by other threads, the kernel
> will search all threads to see which have this file descriptor
> listed as in use. For each such thread, the kernel tells
> the thread that its active fds list is now stale and, if
> possible, makes the thread run.
> 

This is what I feared.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 14:38         ` Alan Burlison
  2015-10-21 15:30           ` David Miller
  2015-10-21 16:32           ` Fw: " Eric Dumazet
@ 2015-10-21 18:51           ` Al Viro
  2015-10-21 20:33             ` Casper.Dik
                               ` (2 more replies)
  2015-10-23 18:30           ` Fw: " David Holland
  3 siblings, 3 replies; 138+ messages in thread
From: Al Viro @ 2015-10-21 18:51 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Eric Dumazet, Stephen Hemminger, netdev, dholland-tech, Casper Dik

On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote:

> >There's going to be a notion of "last close"; that's what this refcount is
> >about and _that_ is more than implementation detail.
> 
> Yes, POSIX distinguishes between "file descriptor" and "file
> description" (ugh!) and the close() page says:

Would've been better if they went for something like "IO channel" for
the latter ;-/

> "When all file descriptors associated with an open file description
> have been closed, the open file description shall be freed."

BTW, is SCM_RIGHTS outside of scope?  They do mention it in one place
only:
| Ancillary data is also possible at the socket level. The <sys/socket.h>
| header shall define the following symbolic constant for use as the cmsg_type
| value when cmsg_level is SOL_SOCKET:
|
| SCM_RIGHTS
|     Indicates that the data array contains the access rights to be sent or
| received.

with no further details whatsoever.  It's been there since at least 4.3-Reno;
does anybody still use the older variant (->msg_accrights, that is)?  IIRC,
there was some crap circa 2.6 when Solaris used to do ->msg_accrights for
descriptor-passing, but more or less current versions appear to support
SCM_RIGHTS...  In any case, descriptor-passing had been there in some form
since at least '83 (the old variant is already present in 4.2) and considering
it out-of-scope for POSIX is bloody ridiculous, IMO.

Unless they want to consider in-flight descriptor-passing datagrams as
collections of file descriptors, the quoted sentence is seriously misleading.
And then there's mmap(), which they do kinda-sorta mention...

> >In other words, is that destruction of
> >	* any descriptor refering to this socket [utterly insane for obvious
> >reasons]
> >	* the last descriptor refering to this socket (modulo descriptor
> >passing, etc.) [a bitch to implement, unless we treat a syscall in progress
> >as keeping the opened file open], or
> >	* _the_ descriptor used to issue accept(2) [a bitch to implement,
> >with a lot of fun races in an already race-prone area]?
> 
> From reading the POSIX close() page I believe the second option is
> the correct one.

Er...  So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that
behaviour, in your opinion?  Because fd is sure as hell not the last
descriptor refering to that socket - fd2 remains alive and well.

Behaviour you describe below matches the _third_ option.

> >BTW, for real fun, consider this:
> >7)
> >// fd is a socket
> >fd2 = dup(fd);
> >in thread A: accept(fd);
> >in thread B: accept(fd);
> >in thread C: accept(fd2);
> >in thread D: close(fd);
> >
> >Which threads (if any), should get hit where it hurts?
> 
> A & B should return from the accept with an error. C should
> continue. Which is what happens on Solaris.

> To this end each thread keeps a list of file descriptors
> in use by the current active system call.

Yecchhhh...  How much cross-CPU traffic does that cause on
multithread processes?  Not on close(2), on maintaining the
descriptor use counts through the normal syscalls.

> When a file descriptor is closed and this file descriptor
> is marked as being in use by other threads, the kernel
> will search all threads to see which have this file descriptor
> listed as in use. For each such thread, the kernel tells
> the thread that its active fds list is now stale and, if
> possible, makes the thread run.
>
> While this algorithm is pretty expensive, it is not often invoked.

Sure, but the upkeep of data structures it would need is there
whether you actually end up triggering it or not.  Both in
memory footprint and in cacheline pingpong...

Besides, the list of threads using given descriptor table also needs
to be maintained, unless you scan *all* threads in the system (which
would be quite a fun wrt latency and affect a lot more than just the
process doing something dumb and rare).

BTW, speaking of fun races: AFAICS, NetBSD dup2() isn't atomic.  It
calls fd_close() outside of ->fd_lock (has to, since fd_close() is
grabbing that itself), so another thread doing e.g.  fcntl(newfd, F_GETFD)
in the middle of dup2(oldfd, newfd) might end up with EBADF, even though
both before and after dup2() newfd had been open.  What's worse,
thread A:
	fd1 = open("foo", ...);
	fd2 = open("bar", ...);
	...
	dup2(fd1, fd2);
thread B:
	fd = open("baz", ...);
might, AFAICS, end up with fd == fd2 and refering to foo instead of baz.
All it takes is the last open() managing to grab ->fd_lock just as fd_close()
from dup2() has dropped it.  Which is an unexpected behaviour, to put it
mildly, no matter how much standard lawyering one applies...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 18:51           ` Al Viro
@ 2015-10-21 20:33             ` Casper.Dik
  2015-10-22  4:21               ` Al Viro
  2015-11-02 10:03               ` David Laight
  2015-10-21 22:28             ` Alan Burlison
  2015-10-22  1:29             ` David Miller
  2 siblings, 2 replies; 138+ messages in thread
From: Casper.Dik @ 2015-10-21 20:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Alan Burlison, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech


>On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote:
>
>> >There's going to be a notion of "last close"; that's what this refcount is
>> >about and _that_ is more than implementation detail.
>> 
>> Yes, POSIX distinguishes between "file descriptor" and "file
>> description" (ugh!) and the close() page says:
>
>Would've been better if they went for something like "IO channel" for
>the latter ;-/

Or at least some other word.  A file descriptor is just an index to
a list of file pointers (and wasn't named so?)

>> "When all file descriptors associated with an open file description
>> have been closed, the open file description shall be freed."
>
>BTW, is SCM_RIGHTS outside of scope?  They do mention it in one place
>only:
>| Ancillary data is also possible at the socket level. The <sys/socket.h>
>| header shall define the following symbolic constant for use as the cmsg_type
>| value when cmsg_level is SOL_SOCKET:
>|
>| SCM_RIGHTS
>|     Indicates that the data array contains the access rights to be sent or
>| received.
>
>with no further details whatsoever.  It's been there since at least 4.3-Reno;
>does anybody still use the older variant (->msg_accrights, that is)?  IIRC,
>there was some crap circa 2.6 when Solaris used to do ->msg_accrights for
>descriptor-passing, but more or less current versions appear to support
>SCM_RIGHTS...  In any case, descriptor-passing had been there in some form
>since at least '83 (the old variant is already present in 4.2) and considering
>it out-of-scope for POSIX is bloody ridiculous, IMO.

SCM_RIGHTS was introduced as part of the POSIX standardization of BSD 
sockets.  Looks like they became part of Solaris 2.6, but the default
was non-standard sockets so you may easily find msg->accrights but not
SCM_RIGHTS.

msg_accrights is what was introduced in BSD in likely the first 
implementation of socket-based file descriptor passing.

SysV has its own file descriptor passing on file descriptors passing.

But that interface is too much ad-hoc, so SCM_RIGHTS is a standardizations 
allowing multiple types of messages to be send around

>Unless they want to consider in-flight descriptor-passing datagrams as
>collections of file descriptors, the quoted sentence is seriously misleading.
>And then there's mmap(), which they do kinda-sorta mention...

Well, a file descriptor really only exists in the context of a process; 
in-flight it is no longer a file descriptor as there process context with 
a file descriptor table; so pointers to file descriptions are passed 
around.

>> >In other words, is that destruction of
>> >	* any descriptor refering to this socket [utterly insane for obvious
>> >reasons]
>> >	* the last descriptor refering to this socket (modulo descriptor
>> >passing, etc.) [a bitch to implement, unless we treat a syscall in progress
>> >as keeping the opened file open], or
>> >	* _the_ descriptor used to issue accept(2) [a bitch to implement,
>> >with a lot of fun races in an already race-prone area]?
>> 
>> From reading the POSIX close() page I believe the second option is
>> the correct one.
>
>Er...  So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that
>behaviour, in your opinion?  Because fd is sure as hell not the last
>descriptor refering to that socket - fd2 remains alive and well.
>
>Behaviour you describe below matches the _third_ option.

>> >BTW, for real fun, consider this:
>> >7)
>> >// fd is a socket
>> >fd2 = dup(fd);
>> >in thread A: accept(fd);
>> >in thread B: accept(fd);
>> >in thread C: accept(fd2);
>> >in thread D: close(fd);
>> >
>> >Which threads (if any), should get hit where it hurts?
>> 
>> A & B should return from the accept with an error. C should
>> continue. Which is what happens on Solaris.
>
>> To this end each thread keeps a list of file descriptors
>> in use by the current active system call.
>
>Yecchhhh...  How much cross-CPU traffic does that cause on
>multithread processes?  Not on close(2), on maintaining the
>descriptor use counts through the normal syscalls.

In the Solaris implementation is pretty much what we do;
but there is no much cross-CPU traffic.  Of course, you will need
to keep locks in the file descriptor table if only to find
the actual file pointer.

The work is done only in the case of a badly written application
where close is required to hunt down all threads currently using
the specific file descriptor.

>> When a file descriptor is closed and this file descriptor
>> is marked as being in use by other threads, the kernel
>> will search all threads to see which have this file descriptor
>> listed as in use. For each such thread, the kernel tells
>> the thread that its active fds list is now stale and, if
>> possible, makes the thread run.
>>
>> While this algorithm is pretty expensive, it is not often invoked.
>
>Sure, but the upkeep of data structures it would need is there
>whether you actually end up triggering it or not.  Both in
>memory footprint and in cacheline pingpong...

Most of the work on using a file descriptor is local to the thread.

>Besides, the list of threads using given descriptor table also needs
>to be maintained, unless you scan *all* threads in the system (which
>would be quite a fun wrt latency and affect a lot more than just the
>process doing something dumb and rare).

In Solaris all threads using the same file descriptor table are all the 
threads in the same process.  This is typical for Unix but it is not what 
we have in Linux, or so I'm told.   So we already have that particular
list of threads.

>BTW, speaking of fun races: AFAICS, NetBSD dup2() isn't atomic.  It
>calls fd_close() outside of ->fd_lock (has to, since fd_close() is
>grabbing that itself), so another thread doing e.g.  fcntl(newfd, F_GETFD)
>in the middle of dup2(oldfd, newfd) might end up with EBADF, even though
>both before and after dup2() newfd had been open.  What's worse,
>thread A:
>	fd1 = open("foo", ...);
>	fd2 = open("bar", ...);
>	...
>	dup2(fd1, fd2);
>thread B:
>	fd = open("baz", ...);
>might, AFAICS, end up with fd == fd2 and refering to foo instead of baz.
>All it takes is the last open() managing to grab ->fd_lock just as fd_close()
>from dup2() has dropped it.  Which is an unexpected behaviour, to put it
>mildly, no matter how much standard lawyering one applies...

It could happen even when the implementation does not have any races; but 
I think you're saying that you know that the open call in thread B 
comes after the open to fd2, then fd != fd2.

This looks indeed like a bug.

Casper

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 16:04             ` Casper.Dik
@ 2015-10-21 21:18               ` Eric Dumazet
  2015-10-21 21:28                 ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-21 21:18 UTC (permalink / raw)
  To: Casper.Dik
  Cc: David Miller, Alan.Burlison, viro, stephen, netdev, dholland-tech

On Wed, 2015-10-21 at 18:04 +0200, Casper.Dik@oracle.com wrote:

> It is only expensive within the process itself. 

thread synchro would require additional barriers to maintain this state,
for a very unlikely case.

The atomic_{inc|dec}() on file refcount are already a problem.

We certainly do not use/take a lock to find a file pointer into file
descriptor table.

(This lock is only used at open() or close() time)

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 21:18               ` Eric Dumazet
@ 2015-10-21 21:28                 ` Al Viro
  0 siblings, 0 replies; 138+ messages in thread
From: Al Viro @ 2015-10-21 21:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Casper.Dik, David Miller, Alan.Burlison, stephen, netdev, dholland-tech

On Wed, Oct 21, 2015 at 02:18:30PM -0700, Eric Dumazet wrote:
> On Wed, 2015-10-21 at 18:04 +0200, Casper.Dik@oracle.com wrote:
> 
> > It is only expensive within the process itself. 
> 
> thread synchro would require additional barriers to maintain this state,
> for a very unlikely case.
> 
> The atomic_{inc|dec}() on file refcount are already a problem.

To be fair, this would avoid at least _that_ - file refcount updates would
be less frequent.  However, unlike file->f_count, *descriptor* refcounts
would be much harder to keep in different cachelines.

> We certainly do not use/take a lock to find a file pointer into file
> descriptor table.
> 
> (This lock is only used at open() or close() time)

as well as dup2() and friends, of course.  But yes, we are pretty careful
about avoiding non-local stores on the normal "get file by descriptor" path -
in case of non-shared descriptor table we don't do them at all, and in case
of shared one we only modify refcount in struct file.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 18:51           ` Al Viro
  2015-10-21 20:33             ` Casper.Dik
@ 2015-10-21 22:28             ` Alan Burlison
  2015-10-22  1:29             ` David Miller
  2 siblings, 0 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-21 22:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, Stephen Hemminger, netdev, dholland-tech, Casper Dik

On 21/10/2015 19:51, Al Viro wrote:

> BTW, is SCM_RIGHTS outside of scope?  They do mention it in one place
> only:
> | Ancillary data is also possible at the socket level. The <sys/socket.h>
> | header shall define the following symbolic constant for use as the cmsg_type
> | value when cmsg_level is SOL_SOCKET:
> |
> | SCM_RIGHTS
> |     Indicates that the data array contains the access rights to be sent or
> | received.

That's still exactly the same in Issue 7, Technical Corrigenda 2 which 
is in final review at the moment.

>>> In other words, is that destruction of
>>> 	* any descriptor refering to this socket [utterly insane for obvious
>>> reasons]
>>> 	* the last descriptor refering to this socket (modulo descriptor
>>> passing, etc.) [a bitch to implement, unless we treat a syscall in progress
>>> as keeping the opened file open], or
>>> 	* _the_ descriptor used to issue accept(2) [a bitch to implement,
>>> with a lot of fun races in an already race-prone area]?
>>
>>  From reading the POSIX close() page I believe the second option is
>> the correct one.
>
> Er...  So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that
> behaviour, in your opinion?  Because fd is sure as hell not the last
> descriptor refering to that socket - fd2 remains alive and well.
>
> Behaviour you describe below matches the _third_ option.

Ah, I wasn't 100% sure what the intended difference between #2 and #3 
was TBH, it does sound like I meant #3, yes :-)

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 18:51           ` Al Viro
  2015-10-21 20:33             ` Casper.Dik
  2015-10-21 22:28             ` Alan Burlison
@ 2015-10-22  1:29             ` David Miller
  2015-10-22  4:17               ` Alan Burlison
  2 siblings, 1 reply; 138+ messages in thread
From: David Miller @ 2015-10-22  1:29 UTC (permalink / raw)
  To: viro
  Cc: Alan.Burlison, eric.dumazet, stephen, netdev, dholland-tech, casper.dik

From: Al Viro <viro@ZenIV.linux.org.uk>
Date: Wed, 21 Oct 2015 19:51:04 +0100

> Sure, but the upkeep of data structures it would need is there
> whether you actually end up triggering it or not.  Both in
> memory footprint and in cacheline pingpong...

+1

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22  1:29             ` David Miller
@ 2015-10-22  4:17               ` Alan Burlison
  2015-10-22  4:44                 ` Al Viro
  2015-10-22  6:15                 ` Casper.Dik
  0 siblings, 2 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-22  4:17 UTC (permalink / raw)
  To: David Miller, viro
  Cc: eric.dumazet, stephen, netdev, dholland-tech, casper.dik

On 22/10/2015 02:29, David Miller wrote:

> From: Al Viro <viro@ZenIV.linux.org.uk>
> Date: Wed, 21 Oct 2015 19:51:04 +0100
>
>> Sure, but the upkeep of data structures it would need is there
>> whether you actually end up triggering it or not.  Both in
>> memory footprint and in cacheline pingpong...
>
> +1

It's been said that the current mechanisms in Linux & some BSD variants 
can be subject to races, and the behaviour exhibited doesn't conform to 
POSIX, for example requiring the use of shutdown() on unconnected 
sockets because close() doesn't kick off other threads accept()ing on 
the same fd. I'd be interested to hear if there's a better and more 
performant way of handling the situation that doesn't involve doing the 
sort of bookkeeping Casper described,.

To quote one of my colleague's favourite sayings: Performance is a goal, 
correctness is a constraint.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 20:33             ` Casper.Dik
@ 2015-10-22  4:21               ` Al Viro
  2015-10-22 10:55                 ` Alan Burlison
  2015-11-02 10:03               ` David Laight
  1 sibling, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-22  4:21 UTC (permalink / raw)
  To: Casper.Dik
  Cc: Alan Burlison, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech

On Wed, Oct 21, 2015 at 10:33:04PM +0200, Casper.Dik@oracle.com wrote:
> 
> >On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote:
> >
> >> >There's going to be a notion of "last close"; that's what this refcount is
> >> >about and _that_ is more than implementation detail.
> >> 
> >> Yes, POSIX distinguishes between "file descriptor" and "file
> >> description" (ugh!) and the close() page says:
> >
> >Would've been better if they went for something like "IO channel" for
> >the latter ;-/
> 
> Or at least some other word.  A file descriptor is just an index to
> a list of file pointers (and wasn't named so?)

*nod*

There's no less than 3 distinct notions associated with the word "file" -
"file as collection of bytes on filesystem", "opened file as IO channel" and
"file descriptor", all related ;-/  "File description" vs. "file descriptor"
is atrociously bad terminology.

> >Unless they want to consider in-flight descriptor-passing datagrams as
> >collections of file descriptors, the quoted sentence is seriously misleading.
> >And then there's mmap(), which they do kinda-sorta mention...
> 
> Well, a file descriptor really only exists in the context of a process; 
> in-flight it is no longer a file descriptor as there process context with 
> a file descriptor table; so pointers to file descriptions are passed 
> around.

Yes.  Note, BTW, that descriptor contains a bit more than a pointer - there
are properties associated with it (close-on-exec and is-it-already-claimed),
which makes abusing it for describing SCM_RIGHTS payloads even more of a
stretch.  IOW, description of semantics for close() and friends needs fixing -
it simply does not match the situation on anything that would be anywhere
near POSIX compliance in other areas.
 
> >Sure, but the upkeep of data structures it would need is there
> >whether you actually end up triggering it or not.  Both in
> >memory footprint and in cacheline pingpong...
> 
> Most of the work on using a file descriptor is local to the thread.

Using - sure, but what of cacheline dirtied every time you resolve a
descriptor to file reference?  How much does it cover and to what
degree is that local to thread?  When it's a refcount inside struct file -
no big deal, we'll be reading the same cacheline anyway and unless several
threads are pounding on the same file with syscalls at the same time,
that won't be a big deal.  But when refcounts are associated with
descriptors...

In case of Linux we have two bitmaps and an array of pointers associated
with descriptor table.  They grow on demand (in parallel)
	* reserving a descriptor is done under ->file_lock (dropped/regained
around memory allocation if we end up expanding the sucker, actual reassignment
of pointers to array/bitmaps is under that spinlock)
	* installing a pointer is lockless (we wait for ongoing resize to
settle, RCU takes care of the rest)
	* grabbing a file by index is lockless as well
	* removing a pointer is under ->file_lock, so's replacing it by dup2().

Grabbing a file by descriptor follows pointer from task_struct to descriptor
table, from descriptor table to element of array of pointers (embedded when
we have few descriptors, but becomes separately allocated when more is
needed), and from array element to struct file.  In struct file we fetch
->f_mode and (if descriptor table is shared) atomically increment ->f_count.

For comparison, NetBSD has an extra level of indirection (with similar
tricks for embedding them while there are few descriptors), with a lot
fatter structure around the pointer to file - they keep close-on-exec and
in-use in there, along with refcount and their equivalent of waitqueue.
These structures, once they grow past the embedded set, are allocated
one-by-one, so copying the table on fork() costs a _lot_ more.  Rather
than an array of pointers to files they have an array of pointers to
those guys.  Reserving a descriptor triggers allocation of new struct
fdfile and installing a pointer to it into the array.  Allows for slightly
simpler installing of pointer to file afterwards - unlike us, they don't
have to be careful about array resize happening in parallel.

Grabbing a file by index is lockless, so's installing a pointer to file.
Reserving a descriptor is under ->fd_lock (mutex rather than a spinlock).
Removing a pointer is under ->fd_lock, so's replacing it by
dup2(), but dup2() has an unpleasant race (see upthread).

They do the same amount of pointer-chasing on lookup proper, but only because
they do not look into struct file itself there.  Which happens immediately
afterwards, since callers *will* look into what they've got.  I didn't look
into the details of barrier use in there, but it looks generally sane.

Cacheline pingpong is probably not a big deal there, but only because these
structures are fat and scattered.  Another fun issue is that they have
in-use bits buried deep, which means that they need to mirror them in
a separate bitmap - would cost too much otherwise.  They actually went for
two-level bitmap - the first one with a bit per 32 descriptors, another -
with bit per descriptor.  Might or might not be worth nicking (and 1:32
ratio needs experimenting)...

The worst issues seem to be memory footprint and cost on fork().  Extra
level of indirection is unpleasant, but not terribly so.  Use of mutex
instead of spinlock is a secondary issue - it should be reasonably easy
to deal with.  And there are outright races, both in dup2() and in
restart-related logics...

Having a mirror of in-use bits is another potential source of races -
hadn't looked into that deeply enough.  That'll probably come into
play on any attempts to reduce fork() costs...

> >Besides, the list of threads using given descriptor table also needs
> >to be maintained, unless you scan *all* threads in the system (which
> >would be quite a fun wrt latency and affect a lot more than just the
> >process doing something dumb and rare).
> 
> In Solaris all threads using the same file descriptor table are all the 
> threads in the same process.  This is typical for Unix but it is not what 
> we have in Linux, or so I'm told.   So we already have that particular
> list of threads.

Linux (as well as *BSD and Plan 9) has descriptor table as first-class
sharable resource - any thread you spawn might share or copy it, and
you can unshare yours at any time.  Maintaining the list of threads
using the damn thing wouldn't be too much PITA, anyway - that's a secondary
issue.

> >BTW, speaking of fun races: AFAICS, NetBSD dup2() isn't atomic.  It
> >calls fd_close() outside of ->fd_lock (has to, since fd_close() is
> >grabbing that itself), so another thread doing e.g.  fcntl(newfd, F_GETFD)
> >in the middle of dup2(oldfd, newfd) might end up with EBADF, even though
> >both before and after dup2() newfd had been open.  What's worse,
> >thread A:
> >	fd1 = open("foo", ...);
> >	fd2 = open("bar", ...);
> >	...
> >	dup2(fd1, fd2);
> >thread B:
> >	fd = open("baz", ...);
> >might, AFAICS, end up with fd == fd2 and refering to foo instead of baz.
> >All it takes is the last open() managing to grab ->fd_lock just as fd_close()
> >from dup2() has dropped it.  Which is an unexpected behaviour, to put it
> >mildly, no matter how much standard lawyering one applies...
> 
> It could happen even when the implementation does not have any races; but 
> I think you're saying that you know that the open call in thread B 
> comes after the open to fd2, then fd != fd2.

No, what I'm saying is that dup2() of one descriptor thread A has opened to
another such descriptor, with no other threads ever doing anything to either
should *not* affect anything done by other threads.  If open() in thread B
comes before fd2 = open("bar", ...), it still should (and will) get a different
descriptor.

The point is, dup2() over _unused_ descriptor is inherently racy, but dup2()
over a descriptor we'd opened and kept open should be safe.  As it is,
their implementation boils down to "userland must do process-wide exclusion
between *any* dup2() and all syscalls that might create a new descriptor -
open()/pipe()/socket()/accept()/recvmsg()/dup()/etc".  At the very least,
it's a big QoI problem, especially since such userland exclusion would have
to be taken around the operations that can block for a long time.  Sure,
POSIX wording regarding dup2 is so weak that this behaviour can be argued
to be compliant, but... replacement of the opened file associated with
newfd really ought to be atomic to be of any use for multithreaded processes.
IOW, if newfd is an opened descriptor prior to dup2() and no other thread
attempts to close it by any means, there should be no window during which
it would appear to be not opened.  Linux and FreeBSD satisfy that; OpenBSD
seems to do the same, from the quick look.  NetBSD doesn't, no idea about
Solaris.  FWIW, older NetBSD implementation (prior to "File descriptor changes,
discussed on tech-kern:" back in 2008) used to behave like OpenBSD one; it
had fixed a lot of crap, so it's entirely possible that OpenBSD simply has
kept the old implementation, with tons of other races in that area, but this
dup2() race got introduced in that rewrite.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22  4:17               ` Alan Burlison
@ 2015-10-22  4:44                 ` Al Viro
  2015-10-22  6:03                   ` Al Viro
                                     ` (2 more replies)
  2015-10-22  6:15                 ` Casper.Dik
  1 sibling, 3 replies; 138+ messages in thread
From: Al Viro @ 2015-10-22  4:44 UTC (permalink / raw)
  To: Alan Burlison
  Cc: David Miller, eric.dumazet, stephen, netdev, dholland-tech, casper.dik

On Thu, Oct 22, 2015 at 05:17:50AM +0100, Alan Burlison wrote:

> It's been said that the current mechanisms in Linux & some BSD
> variants can be subject to races

You do realize that it goes for the entire area?  And the races found
in this thread are in the BSD variant that tries to do something similar
to what you guys say Solaris is doing, so I'm not sure which way does
that argument go.  A high-level description with the same level of details
will be race-free in _all_ of them.  The devil is in the details, of course,
and historically they had been very easy to get wrong.  And extra complexity
in that area doesn't make things better.

>, and the behaviour exhibited
> doesn't conform to POSIX, for example requiring the use of
> shutdown() on unconnected sockets because close() doesn't kick off
> other threads accept()ing on the same fd.

Umm...  The old kernels (and even more - old userland) are not going to
disappear, so we are stuck with such uses of shutdown(2) anyway, POSIX or
no POSIX.

> I'd be interested to hear
> if there's a better and more performant way of handling the
> situation that doesn't involve doing the sort of bookkeeping Casper
> described,.

So would a lot of other people.

> To quote one of my colleague's favourite sayings: Performance is a
> goal, correctness is a constraint.

Except that in this case "correctness" is the matter of rather obscure and
ill-documented areas in POSIX.  Don't get me wrong - this semantics isn't
inherently bad, but it's nowhere near being an absolute requirement.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22  4:44                 ` Al Viro
@ 2015-10-22  6:03                   ` Al Viro
  2015-10-22  6:34                     ` Casper.Dik
  2015-10-22  6:51                   ` Casper.Dik
  2015-10-22 11:15                   ` Alan Burlison
  2 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-22  6:03 UTC (permalink / raw)
  To: Alan Burlison
  Cc: David Miller, eric.dumazet, stephen, netdev, dholland-tech, casper.dik

On Thu, Oct 22, 2015 at 05:44:58AM +0100, Al Viro wrote:

> Except that in this case "correctness" is the matter of rather obscure and
> ill-documented areas in POSIX.  Don't get me wrong - this semantics isn't
> inherently bad, but it's nowhere near being an absolute requirement.

PS: in principle, a fairly ugly trick might suffice for accept(2), but
I'm less than happy with going there.  Namely, we could
	* have ->accept() get descriptor number
	* have ->flush() get descriptor number in addition to current->files
and have it DTRT for sockets in the middle of accept(2).

However, in addition to being ugly as hell, it has the problem with the points
where we call ->flush(), specifically do_dup2() and __close_fd().
It's done *after* the replacement/removal from descriptor table, so another
socket might have already gotten the same descriptor and we'd get spurious
termination of accept(2).

And I'm really curious about the things Solaris would do with dup2() there.
Does it take into account the possibility of new accept() coming just as
dup2() is trying to terminate the ongoing ones?  Is there a window when
descriptor-to-file lookups would fail?  Looks like a race/deadlock country...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22  4:17               ` Alan Burlison
  2015-10-22  4:44                 ` Al Viro
@ 2015-10-22  6:15                 ` Casper.Dik
  2015-10-22 11:30                   ` Eric Dumazet
  1 sibling, 1 reply; 138+ messages in thread
From: Casper.Dik @ 2015-10-22  6:15 UTC (permalink / raw)
  To: Alan Burlison
  Cc: David Miller, viro, eric.dumazet, stephen, netdev, dholland-tech


>It's been said that the current mechanisms in Linux & some BSD variants 
>can be subject to races, and the behaviour exhibited doesn't conform to 
>POSIX, for example requiring the use of shutdown() on unconnected 
>sockets because close() doesn't kick off other threads accept()ing on 
>the same fd. I'd be interested to hear if there's a better and more 
>performant way of handling the situation that doesn't involve doing the 
>sort of bookkeeping Casper described,.

Of course, the implementation is now around 18 years old; clearly a lot of 
things have changed since then.

In the particular case of Linux close() on a socket, surely it must be 
possible to detect at close that it is a listening socket and that you are 
about to close the last reference; the kernel could then do the shutdown() 
all by itself.

Casper

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22  6:03                   ` Al Viro
@ 2015-10-22  6:34                     ` Casper.Dik
  2015-10-22 17:21                       ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: Casper.Dik @ 2015-10-22  6:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev,
	dholland-tech



>And I'm really curious about the things Solaris would do with dup2() there.
>Does it take into account the possibility of new accept() coming just as
>dup2() is trying to terminate the ongoing ones?  Is there a window when
>descriptor-to-file lookups would fail?  Looks like a race/deadlock country...

Solaris does not "terminate" threads, instead it tells them that the
file descriptor information used is stale and wkae's up the thread.

The accept call gets woken up and it checks for incoming connections; it 
will then either find a new connection and returns that particular 
connection or it will find nothing and returns EINTR; in the post-syscall 
glue this is checked (the kernel thread has been told to take the 
expensive post-syscall routine) and if the system call was interrupted, 
EBADF is returned instead.

It is also possible for the connection to come in late and then the socket 
will be changed and the already accepted (in TCP terms, not in the
socket API terms) embryonic will be closed too as is normal when a 
listening socket is closed with a list of not ready accept()ed connections.

Casper

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22  4:44                 ` Al Viro
  2015-10-22  6:03                   ` Al Viro
@ 2015-10-22  6:51                   ` Casper.Dik
  2015-10-22 11:18                     ` Alan Burlison
  2015-10-22 11:15                   ` Alan Burlison
  2 siblings, 1 reply; 138+ messages in thread
From: Casper.Dik @ 2015-10-22  6:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev,
	dholland-tech


From: Al Viro <viro@ZenIV.linux.org.uk>

>Except that in this case "correctness" is the matter of rather obscure and
>ill-documented areas in POSIX.  Don't get me wrong - this semantics isn't
>inherently bad, but it's nowhere near being an absolute requirement.

It would more fruitful to have such a discussion in one of the OpenGroup 
mailing lists; people gathered there have a lot of experience and it is 
also possible to fix the standard when it turns out that it indeed as 
vague as you claim it is (I don't quite agree with that)

Casper

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22  4:21               ` Al Viro
@ 2015-10-22 10:55                 ` Alan Burlison
  2015-10-22 18:16                   ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-22 10:55 UTC (permalink / raw)
  To: Al Viro, Casper.Dik
  Cc: Eric Dumazet, Stephen Hemminger, netdev, dholland-tech

On 22/10/2015 05:21, Al Viro wrote:

>> Most of the work on using a file descriptor is local to the thread.
>
> Using - sure, but what of cacheline dirtied every time you resolve a
> descriptor to file reference?

Don't you have to do that anyway, to do anything useful with the file?

> How much does it cover and to what
> degree is that local to thread?  When it's a refcount inside struct file -
> no big deal, we'll be reading the same cacheline anyway and unless several
> threads are pounding on the same file with syscalls at the same time,
> that won't be a big deal.  But when refcounts are associated with
> descriptors...

There is a refcount in the struct held in the per-process list of open 
files and the 'slow path' processing is only taken if there's more than 
one LWP in the process that's accessing the file.

> In case of Linux we have two bitmaps and an array of pointers associated
> with descriptor table.  They grow on demand (in parallel)
> 	* reserving a descriptor is done under ->file_lock (dropped/regained
> around memory allocation if we end up expanding the sucker, actual reassignment
> of pointers to array/bitmaps is under that spinlock)
> 	* installing a pointer is lockless (we wait for ongoing resize to
> settle, RCU takes care of the rest)
> 	* grabbing a file by index is lockless as well
> 	* removing a pointer is under ->file_lock, so's replacing it by dup2().

Is that table per-process or global?

> The point is, dup2() over _unused_ descriptor is inherently racy, but dup2()
> over a descriptor we'd opened and kept open should be safe.  As it is,
> their implementation boils down to "userland must do process-wide exclusion
> between *any* dup2() and all syscalls that might create a new descriptor -
> open()/pipe()/socket()/accept()/recvmsg()/dup()/etc".  At the very least,
> it's a big QoI problem, especially since such userland exclusion would have
> to be taken around the operations that can block for a long time.  Sure,
> POSIX wording regarding dup2 is so weak that this behaviour can be argued
> to be compliant, but... replacement of the opened file associated with
> newfd really ought to be atomic to be of any use for multithreaded processes.

There's existing language in the Issue 7 dup2() description that says 
dup2() has to be atomic:

"the dup2( ) function provides unique services, as no other
interface is able to atomically replace an existing file descriptor."

And there is some new language in Issue 7 Technical Corrigenda 2 that 
reinforces that, when it's talking about reassignment of 
stdin/stdout/stderr:

"Furthermore, a close() followed by a reopen operation (e.g. open(), 
dup() etc) is not atomic; dup2() should be used to change standard file 
descriptors."

I don't think that it's possible to claim that a non-atomic dup2() is 
POSIX-compliant.

> IOW, if newfd is an opened descriptor prior to dup2() and no other thread
> attempts to close it by any means, there should be no window during which
> it would appear to be not opened.  Linux and FreeBSD satisfy that; OpenBSD
> seems to do the same, from the quick look.  NetBSD doesn't, no idea about
> Solaris.  FWIW, older NetBSD implementation (prior to "File descriptor changes,
> discussed on tech-kern:" back in 2008) used to behave like OpenBSD one; it
> had fixed a lot of crap, so it's entirely possible that OpenBSD simply has
> kept the old implementation, with tons of other races in that area, but this
> dup2() race got introduced in that rewrite.

Related to dup2(), there's some rather surprising behaviour on Linux. 
Here's the scenario:

----------
ThreadA opens, listens and accepts on socket fd1, waiting for incoming 
connections.

ThreadB waits for a while, then opens normal file fd2 for read/write.
ThreadB uses dup2 to make fd1 a clone of fd2.
ThreadB closes fd2.

ThreadA remains sat in accept on fd1 which is now a plain file, not a 
socket.

ThreadB writes to fd1, the result of which appears in the file, so fd1 
is indeed operating as a plain file.

ThreadB exits. ThreadA is still sat in accept on fd1.

A connection is made to fd1 by another process. The accept call succeeds 
and returns the incoming  connection. fd1 is still operating as a 
socket, even though it's now actually a plain file.
----------

I assume this is another consequence of the fact that threads waiting in 
accept don't get a notification if the fd they are using is closed, 
either directly by a call to close or by a syscall such as dup2. Not 
waking up other threads on a fd when it is closed seems like it's 
undesirable behaviour.

I can see the reasoning behind allowing shutdown to be used to do such a 
wakeup even if that's not POSIX-compliant - it may make it slightly 
easier for applications avoid fd recycling races. However the current 
situation is that shutdown is the *only* way to perform such a wakeup - 
simply closing the fd has no effect on any other threads. That seems 
incorrect.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22  4:44                 ` Al Viro
  2015-10-22  6:03                   ` Al Viro
  2015-10-22  6:51                   ` Casper.Dik
@ 2015-10-22 11:15                   ` Alan Burlison
  2 siblings, 0 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-22 11:15 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, eric.dumazet, stephen, netdev, dholland-tech, casper.dik

On 22/10/2015 05:44, Al Viro wrote:

>> It's been said that the current mechanisms in Linux & some BSD
>> variants can be subject to races
>
> You do realize that it goes for the entire area?  And the races found
> in this thread are in the BSD variant that tries to do something similar
> to what you guys say Solaris is doing, so I'm not sure which way does
> that argument go.  A high-level description with the same level of details
> will be race-free in _all_ of them.  The devil is in the details, of course,
> and historically they had been very easy to get wrong.  And extra complexity
> in that area doesn't make things better.

Yes, I absolutely agree it's difficult to get it right. Modulo 
undetected bugs I believe the Solaris implementation is race free, and 
if it isn't I think it's fair to say we'd consider that to be a bug.

>> , and the behaviour exhibited
>> doesn't conform to POSIX, for example requiring the use of
>> shutdown() on unconnected sockets because close() doesn't kick off
>> other threads accept()ing on the same fd.
>
> Umm...  The old kernels (and even more - old userland) are not going to
> disappear, so we are stuck with such uses of shutdown(2) anyway, POSIX or
> no POSIX.

Yes, I understand the problem and in an earlier part of the discussion I 
said that I suspected that all that could really be done was to document 
the behaviour as changing it would break existing code. I still think 
that's probably the only workable option.

>> I'd be interested to hear
>> if there's a better and more performant way of handling the
>> situation that doesn't involve doing the sort of bookkeeping Casper
>> described,.
>
> So would a lot of other people.

:-)

>> To quote one of my colleague's favourite sayings: Performance is a
>> goal, correctness is a constraint.
>
> Except that in this case "correctness" is the matter of rather obscure and
> ill-documented areas in POSIX.  Don't get me wrong - this semantics isn't
> inherently bad, but it's nowhere near being an absolute requirement.

I don't think it's *that* obscure, when I found the original shutdown() 
problem google showed up another occurrence pretty quickly - 
https://lists.gnu.org/archive/html/libmicrohttpd/2011-09/msg00024.html. 
If a fd is closed then allowing other uses of it to continue in other 
threads seems incorrect to me, as in the dup2() scenario I posted.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22  6:51                   ` Casper.Dik
@ 2015-10-22 11:18                     ` Alan Burlison
  0 siblings, 0 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-22 11:18 UTC (permalink / raw)
  To: Casper.Dik, Al Viro
  Cc: David Miller, eric.dumazet, stephen, netdev, dholland-tech

On 22/10/2015 07:51, Casper.Dik@oracle.com wrote:

> It would more fruitful to have such a discussion in one of the OpenGroup
> mailing lists; people gathered there have a lot of experience and it is
> also possible to fix the standard when it turns out that it indeed as
> vague as you claim it is (I don't quite agree with that)

+1. If there's interest in doing that I'll ask our POSIX rep the best 
way of initiating such a conversation.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22  6:15                 ` Casper.Dik
@ 2015-10-22 11:30                   ` Eric Dumazet
  2015-10-22 11:58                     ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-22 11:30 UTC (permalink / raw)
  To: Casper.Dik
  Cc: Alan Burlison, David Miller, viro, stephen, netdev, dholland-tech

On Thu, 2015-10-22 at 08:15 +0200, Casper.Dik@oracle.com wrote:
> >It's been said that the current mechanisms in Linux & some BSD variants 
> >can be subject to races, and the behaviour exhibited doesn't conform to 
> >POSIX, for example requiring the use of shutdown() on unconnected 
> >sockets because close() doesn't kick off other threads accept()ing on 
> >the same fd. I'd be interested to hear if there's a better and more 
> >performant way of handling the situation that doesn't involve doing the 
> >sort of bookkeeping Casper described,.
> 
> Of course, the implementation is now around 18 years old; clearly a lot of 
> things have changed since then.
> 
> In the particular case of Linux close() on a socket, surely it must be 
> possible to detect at close that it is a listening socket and that you are 
> about to close the last reference; the kernel could then do the shutdown() 
> all by itself.

We absolutely do not _want_ to do this just so that linux becomes slower
to the point Solaris can compete, or you guys can avoid some work.

close(fd) is very far from knowing a file is a 'listener' or even a
'socket' without extra cache line misses.

To force a close of an accept() or whatever blocking socket related
system call a shutdown() makes a lot of sense.

This would have zero additional overhead for the fast path.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 11:30                   ` Eric Dumazet
@ 2015-10-22 11:58                     ` Alan Burlison
  2015-10-22 12:10                       ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-22 11:58 UTC (permalink / raw)
  To: Eric Dumazet, Casper.Dik
  Cc: David Miller, viro, stephen, netdev, dholland-tech

On 22/10/2015 12:30, Eric Dumazet wrote:

> We absolutely do not _want_ to do this just so that linux becomes slower
> to the point Solaris can compete, or you guys can avoid some work.

Sentiments such as that really have no place in a discussion that's been 
focussed primarily on the behaviour of interfaces, albeit with 
digressions into the potential performance impacts. The discussion has 
been cordial and I for one appreciate Al Viro's posts on the subject, 
from which I've leaned a lot. Can we please keep it that way? Thanks.

> close(fd) is very far from knowing a file is a 'listener' or even a
> 'socket' without extra cache line misses.
>
> To force a close of an accept() or whatever blocking socket related
> system call a shutdown() makes a lot of sense.
>
> This would have zero additional overhead for the fast path.

Yes, that would I believe be a significant improvement.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 11:58                     ` Alan Burlison
@ 2015-10-22 12:10                       ` Eric Dumazet
  2015-10-22 13:12                         ` David Miller
  2015-10-22 13:14                         ` Alan Burlison
  0 siblings, 2 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-22 12:10 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Casper.Dik, David Miller, viro, stephen, netdev, dholland-tech

On Thu, 2015-10-22 at 12:58 +0100, Alan Burlison wrote:
> On 22/10/2015 12:30, Eric Dumazet wrote:
> 
> > We absolutely do not _want_ to do this just so that linux becomes slower
> > to the point Solaris can compete, or you guys can avoid some work.
> 
> Sentiments such as that really have no place in a discussion that's been 
> focussed primarily on the behaviour of interfaces, albeit with 
> digressions into the potential performance impacts. The discussion has 
> been cordial and I for one appreciate Al Viro's posts on the subject, 
> from which I've leaned a lot. Can we please keep it that way? Thanks.

Certainly not.

I am a major linux networking developper and wont accept linux is
hijacked by guys who never contributed to it, just so it meets their
unreasonable expectations.

We absolutely care about performance. And I do not care you focus on
POSIX crap.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 12:10                       ` Eric Dumazet
@ 2015-10-22 13:12                         ` David Miller
  2015-10-22 13:14                         ` Alan Burlison
  1 sibling, 0 replies; 138+ messages in thread
From: David Miller @ 2015-10-22 13:12 UTC (permalink / raw)
  To: eric.dumazet
  Cc: Alan.Burlison, Casper.Dik, viro, stephen, netdev, dholland-tech

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 22 Oct 2015 05:10:58 -0700

> I am a major linux networking developper and wont accept linux is
> hijacked by guys who never contributed to it, just so it meets their
> unreasonable expectations.
> 
> We absolutely care about performance. And I do not care you focus on
> POSIX crap.

+1

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 12:10                       ` Eric Dumazet
  2015-10-22 13:12                         ` David Miller
@ 2015-10-22 13:14                         ` Alan Burlison
  2015-10-22 17:05                           ` Al Viro
  1 sibling, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-22 13:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Casper.Dik, David Miller, viro, stephen, netdev, dholland-tech

On 22/10/2015 13:10, Eric Dumazet wrote:

>> Sentiments such as that really have no place in a discussion that's been
>> focussed primarily on the behaviour of interfaces, albeit with
>> digressions into the potential performance impacts. The discussion has
>> been cordial and I for one appreciate Al Viro's posts on the subject,
>> from which I've leaned a lot. Can we please keep it that way? Thanks.
>
> Certainly not.

OK, in  which case I'll make this my last on-list reply to this part of 
the thread as I think continuing it is unlikely to be productive. If you 
would like to further discuss your concerns about my motivations I'm 
happy to do so off list, along with anyone you want to cc in. Thanks.

> I am a major linux networking developper and wont accept linux is
> hijacked by guys who never contributed to it, just so it meets their
> unreasonable expectations.

Yes, I'm aware of who you are. And if my expectations were completely 
unreasonable then I'd have expected the conversation to have already 
drawn to a close by now.

> We absolutely care about performance. And I do not care you focus on
> POSIX crap.

Yes, I understand the concern about the potential performance impact and 
it's a valid concern. And I also understand that the current Linux 
behaviour of shutdown() on unconnected sockets probably can't be changed 
without causing breakage and is therefore unlikely to happen as well.

The issues I hit were in the context of application porting, where the 
APIs in question are covered by POSIX. The Linux manpages for open(), 
close(), socket(), dup2() and shutdown() all claim POSIX.1-2001 
conformance. If performance is the most important concern then it's a 
valid decision to prioritise that over POSIX conformance, you simply 
can't continue to claim that the relevant Linux APIs are fully POSIX 
conformant, so I believe at the minimum the Linux manpages need modifying.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 13:14                         ` Alan Burlison
@ 2015-10-22 17:05                           ` Al Viro
  2015-10-22 17:39                             ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-22 17:05 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Eric Dumazet, Casper.Dik, David Miller, stephen, netdev, dholland-tech

On Thu, Oct 22, 2015 at 02:14:42PM +0100, Alan Burlison wrote:

> The issues I hit were in the context of application porting, where
> the APIs in question are covered by POSIX. The Linux manpages for
> open(), close(), socket(), dup2() and shutdown() all claim
> POSIX.1-2001 conformance. If performance is the most important
> concern then it's a valid decision to prioritise that over POSIX
> conformance, you simply can't continue to claim that the relevant
> Linux APIs are fully POSIX conformant, so I believe at the minimum
> the Linux manpages need modifying.

Oh, for...  Right in this thread an example of complete BS has been quoted
from POSIX close(2).  The part about closing a file when the last descriptor
gets closed.  _Nothing_ is POSIX-compliant in that respect (nor should
it be).  Semantics around the distinction between file descriptors and
<barf> file descriptions is underspecified, not to mention being very poorly
written.

You want to add something along the lines of "if any action by another thread
changes the mapping from file descriptors to file descriptions for any
file descriptor passed to syscall, such and such things should happen" - go
ahead and specify what should happen.  As it is, I don't see anything of
that sort in e.g. accept(2).  And no,
	[EBADF]
	    The socket argument is not a valid file descriptor.
in there is nowhere near being unambiguous enough - everyone agrees that
argument should be a valid descriptor at the time of call, but I would be
very surprised to find _any_ implementation (including Solaris one)
recheck that upon exit to userland.

For more bullshit from the same source (issue 7, close(2)):
	If fildes refers to a socket, close() shall cause the socket to be
	destroyed. If the socket is in connection-mode, and the SO_LINGER
	option is set for the socket with non-zero linger time, and the socket
	has untransmitted data, then close() shall block for up to the current
	linger interval until all data is transmitted.
I challenge you to find *any* implementation that would have
	fd = socket(...);
	close(dup(fd));
do what this wonder of technical prose clearly requests.  In the same text we
also have
	When all file descriptors associated with a pipe or FIFO special file
	are closed, any data remaining in the pipe or FIFO shall be discarded.
as well as explicit (and underspecified, but perhaps they do it elsewhere)
"last close" in parts related to sockets and ptys.

And that is not to mention the dup2(2) wording in there:
	If fildes2 is already a valid open file descriptor, it shall be
	closed first
which is (a) inviting misinterpretation that would make the damn thing
non-atomic (the only mentioning of atomicity is in non-normative sections)
and (b) says fsck-all about the effects of closing descriptor.  The latter
is a problem, since nothing in close(2) bothers making a distinction between
the effects specific to particular syscall and those common to all ways of
closing a descriptor.  And no, it's not a nitpicking - consider e.g. the
parts concerning the order of events triggered by close(2) (such and such
should be completed before close(2) returns); should it be taken as "same
events should be completed before newfd is associated with the file description
refered to by oldfd"?  It _is_ user-visible, since close(2) removes fcntl
locks.  Sure, there is (otherwise unexplained)
	The dup2() function is not intended for use in critical regions
	as a synchronization mechanism.
down in informative sections, so one can infer that event order here isn't
to be relied upon.  With no way to guess whether the event order concerning
e.g. effect on ongoing accept(newfd) is any different in that respect.

The entire area in Issue 7 stinks.  It might make sense to try and fix it
up, but let's not pretend that what's in there right now does specify the
semantics in this kind of situations.

I'm not saying that Solaris approach yields an inherently bad semantics or
that it's impossible to implement without high scalability price and/or
high memory footprint.  But waving the flag of POSIX compliance when you
are actually talking about the ways your implementation plugs the holes in
a badly incomplete spec...

Not to contribute to pissing contest, but IIRC Solaris wasn't even the first
kernel having to deal with the possibility of descriptor table getting changed
by another thread under ongoing syscall.  Which was completely outside of
POSIX scope, not that Plan 9 folks gave a damn.  For Linux that can of worms
had opened in 1.3.22 (Sep 1995), for OpenBSD - next January, a month later
followed by FreeBSD (ab, said to be based on OpenBSD variant) and in Jan 1998
by NetBSD (said to be partially based on FreeBSD one).  All of those had been
more or less inspired by Plan 9 approach (in case of *BSD the original
implementation was by Ron Minnich).  Not sure when Plan 9 has implemented
their variant; it was definitely there by the beginning of 1993 (going by
the date on Release 1 rfork(2)).  That would be what, around the time of
Solaris 2.1?

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22  6:34                     ` Casper.Dik
@ 2015-10-22 17:21                       ` Al Viro
  2015-10-22 18:24                         ` Casper.Dik
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-22 17:21 UTC (permalink / raw)
  To: Casper.Dik
  Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev,
	dholland-tech

On Thu, Oct 22, 2015 at 08:34:19AM +0200, Casper.Dik@oracle.com wrote:
> 
> 
> >And I'm really curious about the things Solaris would do with dup2() there.
> >Does it take into account the possibility of new accept() coming just as
> >dup2() is trying to terminate the ongoing ones?  Is there a window when
> >descriptor-to-file lookups would fail?  Looks like a race/deadlock country...
> 
> Solaris does not "terminate" threads, instead it tells them that the
> file descriptor information used is stale and wkae's up the thread.

Sorry, lousy wording - I meant "terminate syscall in another thread".
Better yet, make that "what happens if new accept(newfd) comes while dup2()
waits for affected syscalls in other threads to finish"?  Assuming it
does wait, that is...

While we are at it, what's the relative order of record locks removal
and switching the meaning of newfd?  In our kernel it happens *after*
the switchover (i.e. if another thread is waiting for a record lock held on
any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see
the state with newfd still refering to what it used to; note that waiter
might be using any descriptor refering to the file newfd used to refer
to, so it won't be affected by the "wake those who had the meaning of
their arguments change" side of things).

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 17:05                           ` Al Viro
@ 2015-10-22 17:39                             ` Alan Burlison
  2015-10-22 18:56                               ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-22 17:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, Casper.Dik, David Miller, stephen, netdev, dholland-tech

On 22/10/2015 18:05, Al Viro wrote:

> Oh, for...  Right in this thread an example of complete BS has been quoted
> from POSIX close(2).  The part about closing a file when the last descriptor
> gets closed.  _Nothing_ is POSIX-compliant in that respect (nor should
> it be).

That's not exactly what it says, we've already discussed, for example in 
the case of pending async IO on a filehandle.

> Semantics around the distinction between file descriptors and
> <barf> file descriptions is underspecified, not to mention being very poorly
> written.

I agree that part could do with some polishing.

> You want to add something along the lines of "if any action by another thread
> changes the mapping from file descriptors to file descriptions for any
> file descriptor passed to syscall, such and such things should happen" - go
> ahead and specify what should happen.  As it is, I don't see anything of
> that sort in e.g. accept(2).  And no,
> 	[EBADF]
> 	    The socket argument is not a valid file descriptor.
> in there is nowhere near being unambiguous enough - everyone agrees that
> argument should be a valid descriptor at the time of call, but I would be
> very surprised to find _any_ implementation (including Solaris one)
> recheck that upon exit to userland.

The scenario I described previously, where dup2() is used to modify a fd 
that's being used in accept, result in the accept call terminating in 
the same way as if close had been called on it - Casper gave details 
earlier.

> For more bullshit from the same source (issue 7, close(2)):
> 	If fildes refers to a socket, close() shall cause the socket to be
> 	destroyed. If the socket is in connection-mode, and the SO_LINGER
> 	option is set for the socket with non-zero linger time, and the socket
> 	has untransmitted data, then close() shall block for up to the current
> 	linger interval until all data is transmitted.
> I challenge you to find *any* implementation that would have
> 	fd = socket(...);
> 	close(dup(fd));
> do what this wonder of technical prose clearly requests.  In the same text we
> also have
> 	When all file descriptors associated with a pipe or FIFO special file
> 	are closed, any data remaining in the pipe or FIFO shall be discarded.
> as well as explicit (and underspecified, but perhaps they do it elsewhere)
> "last close" in parts related to sockets and ptys.

Yes, Casper has just reported that to TOG, see 
http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11573. 
Our assessment is that sockets should behave the same way as plain 
files, i.e. 'last close'.

> And that is not to mention the dup2(2) wording in there:
> 	If fildes2 is already a valid open file descriptor, it shall be
> 	closed first
> which is (a) inviting misinterpretation that would make the damn thing
> non-atomic (the only mentioning of atomicity is in non-normative sections)

I've already explained why I believe atomic behaviour of dup2() is 
required by POSIX. If you feel it's not clear then we can ask POSIX for 
clarification.

> and (b) says fsck-all about the effects of closing descriptor.  The latter
> is a problem, since nothing in close(2) bothers making a distinction between
> the effects specific to particular syscall and those common to all ways of
> closing a descriptor.  And no, it's not a nitpicking - consider e.g. the
> parts concerning the order of events triggered by close(2) (such and such
> should be completed before close(2) returns); should it be taken as "same
> events should be completed before newfd is associated with the file description
> refered to by oldfd"?  It _is_ user-visible, since close(2) removes fcntl
> locks.  Sure, there is (otherwise unexplained)
> 	The dup2() function is not intended for use in critical regions
> 	as a synchronization mechanism.
> down in informative sections, so one can infer that event order here isn't
> to be relied upon.  With no way to guess whether the event order concerning
> e.g. effect on ongoing accept(newfd) is any different in that respect.

I think "it shall be closed first" makes it pretty clear that what is 
expected is the same behaviour as any direct invocation of close, and 
that has to happen before the reassignment. What makes you believe 
that's isn't the case?

> The entire area in Issue 7 stinks.  It might make sense to try and fix it
> up, but let's not pretend that what's in there right now does specify the
> semantics in this kind of situations.

Sorry, I disagree.

> I'm not saying that Solaris approach yields an inherently bad semantics or
> that it's impossible to implement without high scalability price and/or
> high memory footprint.  But waving the flag of POSIX compliance when you
> are actually talking about the ways your implementation plugs the holes in
> a badly incomplete spec...

Personally I believe the spec is clear enough to allow an unambiguous 
interpretation of the required behavior in this area. If you think there 
are areas where the Solaris behaviour is in disagreement with the spec 
then I'd be interested to hear them.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 10:55                 ` Alan Burlison
@ 2015-10-22 18:16                   ` Al Viro
  2015-10-22 20:15                     ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-22 18:16 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Casper.Dik, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech

On Thu, Oct 22, 2015 at 11:55:42AM +0100, Alan Burlison wrote:
> On 22/10/2015 05:21, Al Viro wrote:
> 
> >>Most of the work on using a file descriptor is local to the thread.
> >
> >Using - sure, but what of cacheline dirtied every time you resolve a
> >descriptor to file reference?
> 
> Don't you have to do that anyway, to do anything useful with the file?

Dirtying the cacheline that contains struct file itself is different, but
that's not per-descriptor.

> >In case of Linux we have two bitmaps and an array of pointers associated
> >with descriptor table.  They grow on demand (in parallel)
> >	* reserving a descriptor is done under ->file_lock (dropped/regained
> >around memory allocation if we end up expanding the sucker, actual reassignment
> >of pointers to array/bitmaps is under that spinlock)
> >	* installing a pointer is lockless (we wait for ongoing resize to
> >settle, RCU takes care of the rest)
> >	* grabbing a file by index is lockless as well
> >	* removing a pointer is under ->file_lock, so's replacing it by dup2().
> 
> Is that table per-process or global?

Usually it's per-process, but any thread could ask for a private instance
to work with (and then spawn more threads sharing that instance - or getting
independent copies).

It's common for Plan 9-inspired models - basically, you treat every thread
as a machine that consists of
	* memory
	* file descriptor table
	* namespace
	* signal handlers
	...
	* CPU (i.e. actual thread of execution).
The last part can't be shared; anything else can.  fork(2) variant used to
start new threads (clone(2) in case of Linux, rfork(2) in Plan 9 and *BSD)
is told which components should be copies of parent's ones and which should
be shared with the parent.  fork(2) is simply "copy everything except for the
namespace".  It's fairly common to have "share everything", but intermediate
variants are also possible.  There are constraints (e.g. you can't share
signal handlers without sharing the memory space), but descriptor table
can be shared independently from memory space just fine.  There's also a
way to say "unshare this, this and that components" - mapped to unshare(2) in
Linux and to rfork(2) in Plan 9.

Best way to think of that is to consider descriptor table as a first-class
object a thread can be connected to.  Usually you have one for each process,
with all threads belonging to that process connected to the same thing,
but that's just the most common use.

> I don't think that it's possible to claim that a non-atomic dup2()
> is POSIX-compliant.

Except that it's in non-normative part of dup2(2), AFAICS.  I certainly
agree that it would be a standard lawyering beyond reason, but "not
possible to claim" is too optimistic.  Maybe I'm just more cynical...

> ThreadA remains sat in accept on fd1 which is now a plain file, not
> a socket.

No.  accept() is not an operation on file descriptors; it's an operation on
file descriptions (pardon for use of that terminology).  They are specified
by passing descriptors, but there's a hell of a difference between e.g.
dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek()
(operations on descriptions).

Lookups are done once per syscall; the only exception is F_SETFL{,W}, where
we recheck that descriptor is refering to the same thing before granting
the lock.

Again, POSIX is still underspecifying the semantics of shared descriptor
tables; back when the bulk of it had been written there had been no way
to have a descriptor -> description mapping changed under a syscall by
action of another thread.  Hell, they still hadn't picked on some things
that happened in early 80s, let alone early-to-mid 90s...

Linux and Solaris happen to cover these gaps differently; FreeBSD and
OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 17:21                       ` Al Viro
@ 2015-10-22 18:24                         ` Casper.Dik
  2015-10-22 19:07                           ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: Casper.Dik @ 2015-10-22 18:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev,
	dholland-tech


>On Thu, Oct 22, 2015 at 08:34:19AM +0200, Casper.Dik@oracle.com wrote:
>> 
>> 
>> >And I'm really curious about the things Solaris would do with dup2() there.
>> >Does it take into account the possibility of new accept() coming just as
>> >dup2() is trying to terminate the ongoing ones?  Is there a window when
>> >descriptor-to-file lookups would fail?  Looks like a race/deadlock country...
>> 
>> Solaris does not "terminate" threads, instead it tells them that the
>> file descriptor information used is stale and wkae's up the thread.
>
>Sorry, lousy wording - I meant "terminate syscall in another thread".
>Better yet, make that "what happens if new accept(newfd) comes while dup2()
>waits for affected syscalls in other threads to finish"?  Assuming it
>does wait, that is..

No there is no such window; the accept() call either returns EBADF
(dup2()) wins the race or it returns a new file descriptor (and dup2()
then closes the listening descriptor).

One or the other.

>While we are at it, what's the relative order of record locks removal
>and switching the meaning of newfd?  In our kernel it happens *after*
>the switchover (i.e. if another thread is waiting for a record lock held on
>any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see
>the state with newfd still refering to what it used to; note that waiter
>might be using any descriptor refering to the file newfd used to refer
>to, so it won't be affected by the "wake those who had the meaning of
>their arguments change" side of things).

The external behaviour atomic; you cannot distinguish the order
between the closing of the original file (and waking up other threads
waiting for a record lock) or changing the file referenced by that newfd.

But this not include a global or process specific lock.

Casper

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 17:39                             ` Alan Burlison
@ 2015-10-22 18:56                               ` Al Viro
  2015-10-22 19:50                                 ` Casper.Dik
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-22 18:56 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Eric Dumazet, Casper.Dik, David Miller, stephen, netdev, dholland-tech

On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote:
> On 22/10/2015 18:05, Al Viro wrote:
> 
> >Oh, for...  Right in this thread an example of complete BS has been quoted
> >from POSIX close(2).  The part about closing a file when the last descriptor
> >gets closed.  _Nothing_ is POSIX-compliant in that respect (nor should
> >it be).
> 
> That's not exactly what it says, we've already discussed, for
> example in the case of pending async IO on a filehandle.

Sigh...  It completely fails to mention descriptor-passing.  Which
	a) is relevant to what "last close" means and
	b) had been there for nearly the third of a century.

> I agree that part could do with some polishing.

google("wire brush of enlightenment") is what comes to mind...

> >and (b) says fsck-all about the effects of closing descriptor.  The latter
> >is a problem, since nothing in close(2) bothers making a distinction between
> >the effects specific to particular syscall and those common to all ways of
> >closing a descriptor.  And no, it's not a nitpicking - consider e.g. the
> >parts concerning the order of events triggered by close(2) (such and such
> >should be completed before close(2) returns); should it be taken as "same
> >events should be completed before newfd is associated with the file description
> >refered to by oldfd"?  It _is_ user-visible, since close(2) removes fcntl
> >locks.  Sure, there is (otherwise unexplained)
> >	The dup2() function is not intended for use in critical regions
> >	as a synchronization mechanism.
> >down in informative sections, so one can infer that event order here isn't
> >to be relied upon.  With no way to guess whether the event order concerning
> >e.g. effect on ongoing accept(newfd) is any different in that respect.
> 
> I think "it shall be closed first" makes it pretty clear that what
> is expected is the same behaviour as any direct invocation of close,
> and that has to happen before the reassignment. What makes you
> believe that's isn't the case?

So unless I'm misparsing something, you want
thread A: accept(newfd)
thread B: dup2(oldfd, newfd)
have accept() bugger off before the switchover happens?

What should happen if thread C does accept(newfd) right as B has decided that
there's nothing more to wait?  For close(newfd) it would be simple - we are
going to have lookup by descriptor fail with EBADF anyway, so making it do
so as soon as we go hunting for those who are currently in accept(newfd)
would do the trick - no new threads like that shall appear and as long as
the descriptor is not declared free for taking by descriptor allocation nobody
is going to be screwed by open() picking that slot of descriptor table too
early.  Trying to do that for dup2() would lose atomicity.  I honestly don't
know how Solaris behaves in that case, BTW - the race (if any) would probably
be hard to hit, so in case of Linux I would have to go and RTFS before saying
that there isn't one.  I can't do that in with Solaris; all I can do here
is ask you guys...

Moreover, see above for record locks removal.  Should that happen prior to
switchover?  If you have

dup(fd, fd2);
set a record lock on fd2
spawn a thread
in child, try to grab the same lock on fd2
in parent, do some work and close(fd)

you are guaranteed that child won't see fd refering to the same file after it
acquires the lock.
Replace close(fd) with dup(fd3, fd); should the same hold true in that case?

FWIW, Linux behaviour in that area is to have record locks removal done
between the switchover and return to userland in case of dup2() and between
the removal from descriptor table and return to userland in case of close().

> Personally I believe the spec is clear enough to allow an
> unambiguous interpretation of the required behavior in this area. If
> you think there are areas where the Solaris behaviour is in
> disagreement with the spec then I'd be interested to hear them.

The spec is so vague that I strongly suspect that *both* Solaris and Linux
behaviours are not in disagreement with it (modulo shutdown(2) extension
Linux-side and we are really stuck with that one).

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 18:24                         ` Casper.Dik
@ 2015-10-22 19:07                           ` Al Viro
  2015-10-22 19:51                             ` Casper.Dik
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-22 19:07 UTC (permalink / raw)
  To: Casper.Dik
  Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev,
	dholland-tech

On Thu, Oct 22, 2015 at 08:24:51PM +0200, Casper.Dik@oracle.com wrote:

> The external behaviour atomic; you cannot distinguish the order
> between the closing of the original file (and waking up other threads
> waiting for a record lock) or changing the file referenced by that newfd.
> 
> But this not include a global or process specific lock.

Interesting...  Do you mean that decriptor-to-file lookup blocks until that
rundown finishes?

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 18:56                               ` Al Viro
@ 2015-10-22 19:50                                 ` Casper.Dik
  2015-10-23 17:09                                   ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: Casper.Dik @ 2015-10-22 19:50 UTC (permalink / raw)
  To: Al Viro
  Cc: Alan Burlison, Eric Dumazet, David Miller, stephen, netdev,
	dholland-tech


From: Al Viro <viro@ZenIV.linux.org.uk>

>On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote:
>> On 22/10/2015 18:05, Al Viro wrote:
>> 
>> >Oh, for...  Right in this thread an example of complete BS has been quoted
>> >from POSIX close(2).  The part about closing a file when the last descriptor
>> >gets closed.  _Nothing_ is POSIX-compliant in that respect (nor should
>> >it be).
>> 
>> That's not exactly what it says, we've already discussed, for
>> example in the case of pending async IO on a filehandle.
>
>Sigh...  It completely fails to mention descriptor-passing.  Which
>	a) is relevant to what "last close" means and
>	b) had been there for nearly the third of a century.

Why is that different?  These clearly count as file descriptors.

>> I agree that part could do with some polishing.
>
>google("wire brush of enlightenment") is what comes to mind...

Standardese is similar to legalese; it not writing that is directly open 
to interpretation to those who are not inducted in writing may have some 
problem interpreting what exactly is meant by wording of the standard.



>> I think "it shall be closed first" makes it pretty clear that what
>> is expected is the same behaviour as any direct invocation of close,
>> and that has to happen before the reassignment. What makes you
>> believe that's isn't the case?
>
>So unless I'm misparsing something, you want
>thread A: accept(newfd)
>thread B: dup2(oldfd, newfd)
>have accept() bugger off before the switchover happens?

Well, certainly *before* we return from dup2().
(and clearly only once we have determined that dup2() will return
successfully)

>What should happen if thread C does accept(newfd) right as B has decided that
>there's nothing more to wait?  For close(newfd) it would be simple - we are
>going to have lookup by descriptor fail with EBADF anyway, so making it do
>so as soon as we go hunting for those who are currently in accept(newfd)
>would do the trick - no new threads like that shall appear and as long as
>the descriptor is not declared free for taking by descriptor allocation nobody
>is going to be screwed by open() picking that slot of descriptor table too
>early.  Trying to do that for dup2() would lose atomicity.  I honestly don't
>know how Solaris behaves in that case, BTW - the race (if any) would probably
>be hard to hit, so in case of Linux I would have to go and RTFS before saying
>that there isn't one.  I can't do that in with Solaris; all I can do here
>is ask you guys...

Solaris dup2() behaves exactly like close().

>Moreover, see above for record locks removal.  Should that happen prior to
>switchover?  If you have
>
>dup(fd, fd2);
>set a record lock on fd2
>spawn a thread
>in child, try to grab the same lock on fd2
>in parent, do some work and close(fd)

>you are guaranteed that child won't see fd refering to the same file after it
>acquires the lock.

Here's you are talking about a lock held by the "parent" and that the
"child" will only get the lock once close(fd) is done?

Yes.  The final "close" is done *after* the pointer has been removed from 
the file descriptor table.

>Replace close(fd) with dup(fd3, fd); should the same hold true in that case?

Yes.

>FWIW, Linux behaviour in that area is to have record locks removal done
>between the switchover and return to userland in case of dup2() and between
>the removal from descriptor table and return to userland in case of close().
>
>> Personally I believe the spec is clear enough to allow an
>> unambiguous interpretation of the required behavior in this area. If
>> you think there are areas where the Solaris behaviour is in
>> disagreement with the spec then I'd be interested to hear them.
>
>The spec is so vague that I strongly suspect that *both* Solaris and Linux
>behaviours are not in disagreement with it (modulo shutdown(2) extension
>Linux-side and we are really stuck with that one).

I'm not sure if the standard allows a handful of threads in accept() for a 
file descriptor which has already been closed *and* can be re-issued for 
other uses.

Casper

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 19:07                           ` Al Viro
@ 2015-10-22 19:51                             ` Casper.Dik
  2015-10-22 21:57                               ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: Casper.Dik @ 2015-10-22 19:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev,
	dholland-tech


>On Thu, Oct 22, 2015 at 08:24:51PM +0200, Casper.Dik@oracle.com wrote:
>
>> The external behaviour atomic; you cannot distinguish the order
>> between the closing of the original file (and waking up other threads
>> waiting for a record lock) or changing the file referenced by that newfd.
>> 
>> But this not include a global or process specific lock.
>
>Interesting...  Do you mean that decriptor-to-file lookup blocks until that
>rundown finishes?

For that particular file descriptor, yes.  (I'm assuming you mean the 
Solaris kernel running down all lwps who have a system in progress on that 
particular file descriptor).  All other fd to file lookups are not blocked 
at all by this locking.

It should be clear that any such occurrences are application errors and 
should be hardly ever seen in practice.  It is also known when this is 
needed so it is hardly even attempted.

Casper

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 18:16                   ` Al Viro
@ 2015-10-22 20:15                     ` Alan Burlison
  0 siblings, 0 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-22 20:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Casper.Dik, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech

On 22/10/15 19:16, Al Viro wrote:

>> Don't you have to do that anyway, to do anything useful with the file?
>
> Dirtying the cacheline that contains struct file itself is different, but
> that's not per-descriptor.

Yes, true enough.

> Usually it's per-process, but any thread could ask for a private instance
> to work with (and then spawn more threads sharing that instance - or getting
> independent copies).

[snip]

Thanks again for the info, interesting. Solaris also does things along 
the same lines. In fact we recently extended posix_spawn so it could be 
used by the JVM to create subprocesses without wholescale copying, and 
Casper has done some optimisation work on posix_spawn as well.

>> I don't think that it's possible to claim that a non-atomic dup2()
>> is POSIX-compliant.
>
> Except that it's in non-normative part of dup2(2), AFAICS.  I certainly
> agree that it would be a standard lawyering beyond reason, but "not
> possible to claim" is too optimistic.  Maybe I'm just more cynical...

Possibly so, and possibly justifiably so as well ;-)

>> ThreadA remains sat in accept on fd1 which is now a plain file, not
>> a socket.
>
> No.  accept() is not an operation on file descriptors; it's an operation on
> file descriptions (pardon for use of that terminology).  They are specified
> by passing descriptors, but there's a hell of a difference between e.g.
> dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek()
> (operations on descriptions).
>
> Lookups are done once per syscall; the only exception is F_SETFL{,W}, where
> we recheck that descriptor is refering to the same thing before granting
> the lock.

Yes, but if you believe that dup2() requires an implicit close() within 
it and that dup2() has to be atomic then expecting that other threads 
waiting on the same fd in accept() will get a notification seems 
reasonable enough.

> Again, POSIX is still underspecifying the semantics of shared descriptor
> tables; back when the bulk of it had been written there had been no way
> to have a descriptor -> description mapping changed under a syscall by
> action of another thread.  Hell, they still hadn't picked on some things
> that happened in early 80s, let alone early-to-mid 90s...

That's indisputably true - much of the POSIX behaviour predates threads 
and it shows, quite badly, in some places.

> Linux and Solaris happen to cover these gaps differently; FreeBSD and
> OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one.

Yes, true enough.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 19:51                             ` Casper.Dik
@ 2015-10-22 21:57                               ` Al Viro
  2015-10-23  9:52                                 ` Casper.Dik
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-22 21:57 UTC (permalink / raw)
  To: Casper.Dik
  Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev,
	dholland-tech

On Thu, Oct 22, 2015 at 09:51:05PM +0200, Casper.Dik@oracle.com wrote:
> 
> >On Thu, Oct 22, 2015 at 08:24:51PM +0200, Casper.Dik@oracle.com wrote:
> >
> >> The external behaviour atomic; you cannot distinguish the order
> >> between the closing of the original file (and waking up other threads
> >> waiting for a record lock) or changing the file referenced by that newfd.
> >> 
> >> But this not include a global or process specific lock.
> >
> >Interesting...  Do you mean that decriptor-to-file lookup blocks until that
> >rundown finishes?
> 
> For that particular file descriptor, yes.  (I'm assuming you mean the 
> Solaris kernel running down all lwps who have a system in progress on that 
> particular file descriptor).  All other fd to file lookups are not blocked 
> at all by this locking.
> 
> It should be clear that any such occurrences are application errors and 
> should be hardly ever seen in practice.  It is also known when this is 
> needed so it is hardly even attempted.

Ho-hum...  It could even be made lockless in fast path; the problems I see
are
	* descriptor-to-file lookup becomes unsafe in a lot of locking
conditions.  Sure, most of that happens on the entry to some syscall, with
very light locking environment, but... auditing every sodding ioctl that
might be doing such lookups is an interesting exercise, and then there are
->mount() instances doing the same thing.  And procfs accesses.  Probably
nothing impossible to deal with, but nothing pleasant either.
	* memory footprint.  In case of Linux on amd64 or sparc64,
main()
{
	int i;
	for (i = 0; i < 1<<24; dup2(0, i++))	// 16M descriptors
		;
}
will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient ulimit -n,
of course).  How much will Solaris eat on the same?
	* related to the above - how much cacheline sharing will that involve?
These per-descriptor use counts are bitch to pack, and giving each a cacheline
of its own...  <shudder>

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 21:57                               ` Al Viro
@ 2015-10-23  9:52                                 ` Casper.Dik
  2015-10-23 13:02                                   ` Eric Dumazet
  2015-10-24  2:30                                   ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro
  0 siblings, 2 replies; 138+ messages in thread
From: Casper.Dik @ 2015-10-23  9:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev,
	dholland-tech



>Ho-hum...  It could even be made lockless in fast path; the problems I see
>are
>	* descriptor-to-file lookup becomes unsafe in a lot of locking
>conditions.  Sure, most of that happens on the entry to some syscall, with
>very light locking environment, but... auditing every sodding ioctl that
>might be doing such lookups is an interesting exercise, and then there are
>->mount() instances doing the same thing.  And procfs accesses.  Probably
>nothing impossible to deal with, but nothing pleasant either.

In the Solaris kernel code, the ioctl code is generally not handled a file 
descriptor but instead a file pointer (i.e., the lookup is done early in 
the system call).

In those specific cases where a system call needs to convert a file 
descriptor to a file pointer, there is only one routines which can be used.

>	* memory footprint.  In case of Linux on amd64 or sparc64,
>main()
>{
>	int i;
>	for (i = 0; i < 1<<24; dup2(0, i++))	// 16M descriptors
>		;
>}
>will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient ulimit -n,
>of course).  How much will Solaris eat on the same?

Yeah, that is a large amount of memory.  Of course, the table is only 
sized when it is extended and there is a reason why there is a limit on 
file descriptors.  But we're using more data per file descriptor entry.


>	* related to the above - how much cacheline sharing will that involve?
>These per-descriptor use counts are bitch to pack, and giving each a cacheline
>of its own...  <shudder>

As I said, we do actually use a lock and yes that means that you really  
want to have a single cache line for each and every entry.  It does make 
it easy to have non-racy file description updates.  You certainly do not 
want false sharing when there is a lot of contention.

Other data is used to make sure that it only takes O(log(n)) to find the 
lowest available file descriptor entry.  (Where n, I think, is the returned
descriptor)

Not contended locks aren't expensive.  And all is done on a single cache 
line.

One question about the Linux implementation: what happens when a socket in 
select is closed?  I'm assuming that the kernel waits until "shutdown" is 
given or when a connection comes in?

Is it a problem that you can "hide" your listening socket with a thread in 
accept()?  I would think so (It would be visible in netstat but you can't 
easily find out why has it)

Casper

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23  9:52                                 ` Casper.Dik
@ 2015-10-23 13:02                                   ` Eric Dumazet
  2015-10-23 13:20                                     ` Casper.Dik
  2015-10-23 13:35                                     ` Alan Burlison
  2015-10-24  2:30                                   ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro
  1 sibling, 2 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-23 13:02 UTC (permalink / raw)
  To: Casper.Dik
  Cc: Al Viro, Alan Burlison, David Miller, stephen, netdev, dholland-tech

On Fri, 2015-10-23 at 11:52 +0200, Casper.Dik@oracle.com wrote:
> 
> >Ho-hum...  It could even be made lockless in fast path; the problems I see
> >are
> >	* descriptor-to-file lookup becomes unsafe in a lot of locking
> >conditions.  Sure, most of that happens on the entry to some syscall, with
> >very light locking environment, but... auditing every sodding ioctl that
> >might be doing such lookups is an interesting exercise, and then there are
> >->mount() instances doing the same thing.  And procfs accesses.  Probably
> >nothing impossible to deal with, but nothing pleasant either.
> 
> In the Solaris kernel code, the ioctl code is generally not handled a file 
> descriptor but instead a file pointer (i.e., the lookup is done early in 
> the system call).
> 
> In those specific cases where a system call needs to convert a file 
> descriptor to a file pointer, there is only one routines which can be used.
> 
> >	* memory footprint.  In case of Linux on amd64 or sparc64,
> >main()
> >{
> >	int i;
> >	for (i = 0; i < 1<<24; dup2(0, i++))	// 16M descriptors
> >		;
> >}
> >will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient ulimit -n,
> >of course).  How much will Solaris eat on the same?
> 
> Yeah, that is a large amount of memory.  Of course, the table is only 
> sized when it is extended and there is a reason why there is a limit on 
> file descriptors.  But we're using more data per file descriptor entry.
> 
> 
> >	* related to the above - how much cacheline sharing will that involve?
> >These per-descriptor use counts are bitch to pack, and giving each a cacheline
> >of its own...  <shudder>
> 
> As I said, we do actually use a lock and yes that means that you really  
> want to have a single cache line for each and every entry.  It does make 
> it easy to have non-racy file description updates.  You certainly do not 
> want false sharing when there is a lot of contention.
> 
> Other data is used to make sure that it only takes O(log(n)) to find the 
> lowest available file descriptor entry.  (Where n, I think, is the returned
> descriptor)

Yet another POSIX deficiency.

When a server deals with 10,000,000+ socks, we absolutely do not care of
this requirement.

O(log(n)) is still crazy if it involves O(log(n)) cache misses.

> 
> Not contended locks aren't expensive.  And all is done on a single cache 
> line.
> 
> One question about the Linux implementation: what happens when a socket in 
> select is closed?  I'm assuming that the kernel waits until "shutdown" is 
> given or when a connection comes in?
> 
> Is it a problem that you can "hide" your listening socket with a thread in 
> accept()?  I would think so (It would be visible in netstat but you can't 
> easily find out why has it)

Again, netstat -p on a server with 10,000,000 sockets never completes.

Never try this unless you are desperate and want to avoid a reboot
maybe.

If you absolutely want to nuke a listener because of untrusted
applications, we better implement a proper syscall.

Android has such a facility.

Alternative would be to extend netlink (ss command from iproute2
package) to carry one pid per socket.

ss -atnp state listening

-> would not have to readlink (/proc/*/fd/*)

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 13:02                                   ` Eric Dumazet
@ 2015-10-23 13:20                                     ` Casper.Dik
  2015-10-23 13:48                                       ` Eric Dumazet
  2015-10-23 14:13                                       ` Eric Dumazet
  2015-10-23 13:35                                     ` Alan Burlison
  1 sibling, 2 replies; 138+ messages in thread
From: Casper.Dik @ 2015-10-23 13:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, Alan Burlison, David Miller, stephen, netdev, dholland-tech



>Yet another POSIX deficiency.
>
>When a server deals with 10,000,000+ socks, we absolutely do not care of
>this requirement.
>
>O(log(n)) is still crazy if it involves O(log(n)) cache misses.

You miss the fire point of the algorithm; you *always* find an available 
file descriptor in O(log(N)) (where N is the size of the table).

Does your algorithm guarantee that?

>> Is it a problem that you can "hide" your listening socket with a thread in 
>> accept()?  I would think so (It would be visible in netstat but you can't 
>> easily find out why has it)
>
>Again, netstat -p on a server with 10,000,000 sockets never completes.

This point was not about a 10M sockets server but in general...

>Never try this unless you are desperate and want to avoid a reboot
>maybe.
>
>If you absolutely want to nuke a listener because of untrusted
>applications, we better implement a proper syscall.
>
>Android has such a facility.

Solaris has had such an option too, but that wasn't the point.  You really 
don't want to know which application is doing this?

>Alternative would be to extend netlink (ss command from iproute2
>package) to carry one pid per socket.
>
>ss -atnp state listening

That would be an option too.

Casper

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 13:02                                   ` Eric Dumazet
  2015-10-23 13:20                                     ` Casper.Dik
@ 2015-10-23 13:35                                     ` Alan Burlison
  2015-10-23 14:21                                       ` Eric Dumazet
  1 sibling, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-23 13:35 UTC (permalink / raw)
  To: Eric Dumazet, Casper.Dik
  Cc: Al Viro, David Miller, stephen, netdev, dholland-tech

On 23/10/2015 14:02, Eric Dumazet wrote:

>> Other data is used to make sure that it only takes O(log(n)) to find the
>> lowest available file descriptor entry.  (Where n, I think, is the returned
>> descriptor)
>
> Yet another POSIX deficiency.
>
> When a server deals with 10,000,000+ socks, we absolutely do not care of
> this requirement.
>
> O(log(n)) is still crazy if it involves O(log(n)) cache misses.

If you think it's a POSIX deficiency then logging a DR is probably the 
correct way of addressing that. And as I've said it's fine to decide 
that you don't care about what POSIX says on the subject but you can't 
simultaneously claim POSIX conformance. One or the other, not both.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 13:20                                     ` Casper.Dik
@ 2015-10-23 13:48                                       ` Eric Dumazet
  2015-10-23 14:13                                       ` Eric Dumazet
  1 sibling, 0 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-23 13:48 UTC (permalink / raw)
  To: Casper.Dik
  Cc: Al Viro, Alan Burlison, David Miller, stephen, netdev, dholland-tech

On Fri, 2015-10-23 at 15:20 +0200, Casper.Dik@oracle.com wrote:

> This point was not about a 10M sockets server but in general...
> 

Hey, we do not design the OS only for handling desktop workloads.

If netstat does not work on this typical server workload, then it is a
joke.

> Solaris has had such an option too, but that wasn't the point.  You really 
> don't want to know which application is doing this?


Apparently nobody had this need before you asked.

My point is that instead of carrying ~30 years of legacy, we would
better design the system properly so that we can _use_ it in all
situations.

ps used to open /dev/kmem a long time ago.

fuser and similar commands having to know system internals are really
hacks.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 13:20                                     ` Casper.Dik
  2015-10-23 13:48                                       ` Eric Dumazet
@ 2015-10-23 14:13                                       ` Eric Dumazet
  1 sibling, 0 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-23 14:13 UTC (permalink / raw)
  To: Casper.Dik
  Cc: Al Viro, Alan Burlison, David Miller, stephen, netdev, dholland-tech

On Fri, 2015-10-23 at 15:20 +0200, Casper.Dik@oracle.com wrote:
> 
> >Yet another POSIX deficiency.
> >
> >When a server deals with 10,000,000+ socks, we absolutely do not care of
> >this requirement.
> >
> >O(log(n)) is still crazy if it involves O(log(n)) cache misses.
> 
> You miss the fire point of the algorithm; you *always* find an available 
> file descriptor in O(log(N)) (where N is the size of the table).
> 
> Does your algorithm guarantee that?

Its a simple bit map. Each cache line contains 64*8 = 512 bits. Lookup
is actually quite fast thanks to hardware prefetches. 

Main problem we had until recently was that we used to acquire fd table
lock 3 times per accept()/close() pair

This patch took care of the issue :

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8a81252b774b53e628a8a0fe18e2b8fc236d92cc

Then, adding a O_RANDOMFD flag (or O_DO_NOT_REQUIRE_POSIX_FD_RULES) is
helping a lot, as it allows us to pick a random starting point during
the bitmap search.

In practice, finding a slot in fd array is O(1) : one cache line miss
exactly.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 13:35                                     ` Alan Burlison
@ 2015-10-23 14:21                                       ` Eric Dumazet
  2015-10-23 15:46                                         ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-23 14:21 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech

On Fri, 2015-10-23 at 14:35 +0100, Alan Burlison wrote:

> If you think it's a POSIX deficiency then logging a DR is probably the 
> correct way of addressing that. And as I've said it's fine to decide 
> that you don't care about what POSIX says on the subject but you can't 
> simultaneously claim POSIX conformance. One or the other, not both.

I claim nothing. If you believe a man page should be fixed, please send
a patch to man page maintainer.

Have you tested the patch I sent ?

The goal here is to improve things, not to say you are right or I am
right. I am very often wrong, then what ?

This list is about linux kernel development.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 14:21                                       ` Eric Dumazet
@ 2015-10-23 15:46                                         ` Alan Burlison
  2015-10-23 16:00                                           ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-23 15:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech

On 23/10/2015 15:21, Eric Dumazet wrote:

> I claim nothing. If you believe a man page should be fixed, please send
> a patch to man page maintainer.

Ermm, you *really* want me to submit a patch removing 'Conforms to 
POSIX.1-2001' from *every* Linux manpage?

> Have you tested the patch I sent ?

The AF_UNIX poll one? No, I don't have the means to do so, and in any 
case that's not a POSIX issue, just a plain bug. I'm happy to log a bug 
if that helps.

> The goal here is to improve things, not to say you are right or I am
> right. I am very often wrong, then what ?
>
> This list is about linux kernel development.

Thank you for the clarification.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 15:46                                         ` Alan Burlison
@ 2015-10-23 16:00                                           ` Eric Dumazet
  2015-10-23 16:07                                             ` Alan Burlison
  2015-10-23 16:19                                             ` Eric Dumazet
  0 siblings, 2 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-23 16:00 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech

On Fri, 2015-10-23 at 16:46 +0100, Alan Burlison wrote:
> On 23/10/2015 15:21, Eric Dumazet wrote:
> 
> > I claim nothing. If you believe a man page should be fixed, please send
> > a patch to man page maintainer.
> 
> Ermm, you *really* want me to submit a patch removing 'Conforms to 
> POSIX.1-2001' from *every* Linux manpage?

Only on the pages you think there is an error that matters.

> 
> > Have you tested the patch I sent ?
> 
> The AF_UNIX poll one? No, I don't have the means to do so, and in any 
> case that's not a POSIX issue, just a plain bug. I'm happy to log a bug 
> if that helps.

We submit patches when someone needs a fix.

If not, we have more urgent issues to solve first.

I wrote following test case, and confirmed the patch fixes the issue.

I will submit it formally.

Thanks.

#include <pthread.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <poll.h>

static void fail(const char *str)
{
	perror(str);
	printf("FAIL\n");
	exit(1);
}

int main(int argc, char *argv[])
{
	int listener = socket(AF_UNIX, SOCK_STREAM, 0);
	struct pollfd pfd;
	struct sockaddr_un addr;
	int res;

	if (listener == -1)
		perror("socket()");
	memset(&addr, 0, sizeof(addr));
	addr.sun_family = AF_UNIX;
	if (bind(listener, (struct sockaddr *)&addr, sizeof(addr)) == -1)
		fail("bind()");  
	if (listen(listener, 10) == -1)
		fail("listen()");
	pfd.fd = listener;
	pfd.events = -1;
	res = poll(&pfd, 1, 10);
	if (res == -1)
		fail("poll()");
	if (res == 1 && pfd.revents & (POLLOUT|POLLWRNORM|POLLWRBAND)) {
		fprintf(stderr, "poll(af_unix listener) returned a POLLOUT status !\n");
		printf("FAIL\n");
		return 1;
	}
	printf("OK\n");
	return 0;
}

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 16:00                                           ` Eric Dumazet
@ 2015-10-23 16:07                                             ` Alan Burlison
  2015-10-23 16:19                                             ` Eric Dumazet
  1 sibling, 0 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-23 16:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech

On 23/10/2015 17:00, Eric Dumazet wrote:

>> Ermm, you *really* want me to submit a patch removing 'Conforms to
>> POSIX.1-2001' from *every* Linux manpage?
>
> Only on the pages you think there is an error that matters.

If there's consensus that the current shutdown(), dup2(), close() and 
accept() behaviour are not POSIX-compliant then I can do that, sure.

>>> Have you tested the patch I sent ?
>>
>> The AF_UNIX poll one? No, I don't have the means to do so, and in any
>> case that's not a POSIX issue, just a plain bug. I'm happy to log a bug
>> if that helps.
>
> We submit patches when someone needs a fix.
>
> If not, we have more urgent issues to solve first.
>
> I wrote following test case, and confirmed the patch fixes the issue.
>
> I will submit it formally.

Thanks, works for me - the poll() issue only affects AF_UNIX sockets in 
the listen state and is easily avoided by simply not setting the output 
bits in the poll events mask, so it's not exactly high priority.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 16:00                                           ` Eric Dumazet
  2015-10-23 16:07                                             ` Alan Burlison
@ 2015-10-23 16:19                                             ` Eric Dumazet
  2015-10-23 16:40                                               ` Alan Burlison
  1 sibling, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-23 16:19 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech

On Fri, 2015-10-23 at 09:00 -0700, Eric Dumazet wrote:
> on the pages you think there is an error that matters.
> 
> > 
> > > Have you tested the patch I sent ?
> > 
> > The AF_UNIX poll one? No, I don't have the means to do so, and in any 
> > case that's not a POSIX issue, just a plain bug. I'm happy to log a bug 
> > if that helps.

BTW, there is no kernel bug here. POSIX poll() man page says :

POLLOUT
    Normal data may be written without blocking. 

If you attempt to write on a listener, write() does _not_ block and
returns -1, which seems correct behavior to me, in accordance with man
page.

socket(PF_LOCAL, SOCK_STREAM, 0)        = 3
bind(3, {sa_family=AF_LOCAL, sun_path=@""}, 110) = 0
listen(3, 10)                           = 0
write(3, "test", 4)                     = -1 ENOTCONN (Transport endpoint is not connected)

Could you point out which part of POSIX is mandating that af_unix
listener MUST filter out POLLOUT ?

Thanks.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 16:19                                             ` Eric Dumazet
@ 2015-10-23 16:40                                               ` Alan Burlison
  2015-10-23 17:47                                                 ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-23 16:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech

On 23/10/2015 17:19, Eric Dumazet wrote:

>>> The AF_UNIX poll one? No, I don't have the means to do so, and in any
>>> case that's not a POSIX issue, just a plain bug. I'm happy to log a bug
>>> if that helps.
>
> BTW, there is no kernel bug here. POSIX poll() man page says :
>
> POLLOUT
>      Normal data may be written without blocking.
>
> If you attempt to write on a listener, write() does _not_ block and
> returns -1, which seems correct behavior to me, in accordance with man
> page.

Except of course data may not be written, because an attempt to actually 
do so fails, because the socket is in the listen state, is not connected 
and therefore no attempt to write to it could ever succeed. The only bit 
of the required behaviour that the current AF_UNIX poll implementation 
actually gets right is the "without blocking" bit, and that's only the 
case because the failure is detected immediately and the write call 
returns immediately with an error.

> socket(PF_LOCAL, SOCK_STREAM, 0)        = 3
> bind(3, {sa_family=AF_LOCAL, sun_path=@""}, 110) = 0
> listen(3, 10)                           = 0
> write(3, "test", 4)                     = -1 ENOTCONN (Transport endpoint is not connected)
>
> Could you point out which part of POSIX is mandating that af_unix
> listener MUST filter out POLLOUT ?

"A file descriptor for a socket that is listening for connections shall 
indicate that it is ready for reading, once connections are available. A 
file descriptor for a socket that is connecting asynchronously shall 
indicate that it is ready for writing, once a connection has been 
established."

If POSIX had to explicitly list every possible thing that 
implementations *should not* do rather than just those that they 
*should* do then it would be even more unwieldy than it already is.

And if what you are asserting is correct, why isn't the AF_INET 
behaviour the same as the AF_UNIX behaviour?

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-22 19:50                                 ` Casper.Dik
@ 2015-10-23 17:09                                   ` Al Viro
  0 siblings, 0 replies; 138+ messages in thread
From: Al Viro @ 2015-10-23 17:09 UTC (permalink / raw)
  To: Casper.Dik
  Cc: Alan Burlison, Eric Dumazet, David Miller, stephen, netdev,
	dholland-tech

On Thu, Oct 22, 2015 at 09:50:10PM +0200, Casper.Dik@oracle.com wrote:
> >Sigh...  It completely fails to mention descriptor-passing.  Which
> >	a) is relevant to what "last close" means and
> >	b) had been there for nearly the third of a century.
> 
> Why is that different?  These clearly count as file descriptors.

To quote your own posting upthread (Message-ID:
<201510212033.t9LKX4G8007718@room101.nl.oracle.com>)

> Well, a file descriptor really only exists in the context of a process; 
> in-flight it is no longer a file descriptor as there process context with 
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> a file descriptor table; so pointers to file descriptions are passed 
> around.

IMO it shows that the wording is anything but clear.  The only way to claim
that it's accurate is, indeed, to declare that the contents of in-flight
SCM_RIGHTS datagram should be counted as file descriptors and that's too
much of a stretch for the reasons you've pointed to.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 16:40                                               ` Alan Burlison
@ 2015-10-23 17:47                                                 ` Eric Dumazet
  2015-10-23 17:59                                                   ` [PATCH net-next] af_unix: do not report POLLOUT on listeners Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-23 17:47 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech

On Fri, 2015-10-23 at 17:40 +0100, Alan Burlison wrote:
> On 23/10/2015 17:19, Eric Dumazet wrote:
> 
> >>> The AF_UNIX poll one? No, I don't have the means to do so, and in any
> >>> case that's not a POSIX issue, just a plain bug. I'm happy to log a bug
> >>> if that helps.
> >
> > BTW, there is no kernel bug here. POSIX poll() man page says :
> >
> > POLLOUT
> >      Normal data may be written without blocking.
> >
> > If you attempt to write on a listener, write() does _not_ block and
> > returns -1, which seems correct behavior to me, in accordance with man
> > page.
> 
> Except of course data may not be written, because an attempt to actually 
> do so fails, because the socket is in the listen state, is not connected 
> and therefore no attempt to write to it could ever succeed. The only bit 
> of the required behaviour that the current AF_UNIX poll implementation 
> actually gets right is the "without blocking" bit, and that's only the 
> case because the failure is detected immediately and the write call 
> returns immediately with an error.

Yeah, I know some people use poll(NULL, 0, timeout) to implement
msleep().

Because it definitely impresses friends.

So why not poll(&pfd, 1, timeout) to do the same, with a socket listener
and POLLOUT in pfd.events

Go figure. I'll send the fine patch.

Thanks.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* [PATCH net-next] af_unix: do not report POLLOUT on listeners
  2015-10-23 17:47                                                 ` Eric Dumazet
@ 2015-10-23 17:59                                                   ` Eric Dumazet
  2015-10-25 13:45                                                     ` David Miller
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-23 17:59 UTC (permalink / raw)
  To: Alan Burlison, David Miller
  Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech

From: Eric Dumazet <edumazet@google.com>

poll(POLLOUT) on a listener should not report fd is ready for
a write().

This would break some applications using poll() and pfd.events = -1,
as they would not block in poll()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alan Burlison <Alan.Burlison@oracle.com>
Tested-by: Eric Dumazet <edumazet@google.com>
---
 net/unix/af_unix.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f658235fb4..aaa0b58d6aba 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,9 +326,10 @@ found:
 	return s;
 }
 
-static inline int unix_writable(struct sock *sk)
+static int unix_writable(const struct sock *sk)
 {
-	return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
+	return sk->sk_state != TCP_LISTEN &&
+	       (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
 
 static void unix_write_space(struct sock *sk)

^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 14:38         ` Alan Burlison
                             ` (2 preceding siblings ...)
  2015-10-21 18:51           ` Al Viro
@ 2015-10-23 18:30           ` David Holland
  2015-10-23 19:51             ` Al Viro
  3 siblings, 1 reply; 138+ messages in thread
From: David Holland @ 2015-10-23 18:30 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Al Viro, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech,
	Casper Dik

On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote:
 > On 21/10/2015 04:49, Al Viro wrote:
 > >BTW, for real fun, consider this:
 > >7)
 > >// fd is a socket
 > >fd2 = dup(fd);
 > >in thread A: accept(fd);
 > >in thread B: accept(fd);
 > >in thread C: accept(fd2);
 > >in thread D: close(fd);
 > >
 > >Which threads (if any), should get hit where it hurts?
 > 
 > A & B should return from the accept with an error. C should continue. Which
 > is what happens on Solaris.

So, I'm coming late to this discussion and I don't have the original
context; however, to me this cited behavior seems undesirable and if I
ran across it in the wild I would probably describe it as a bug.

System call processing for operations on files involves translating a
file descriptor (a number) into an open-file object (or "file
description"), struct file in BSD and I think also in Linux. The
actual system call logic operates on the open-file object, so once the
translation happens application monkeyshines involving file descriptor
numbers should have no effect on calls in progress. Other behavior
would violate the principle of least surprise, as this basic
architecture predates POSIX.

The behavior Al Viro found in NetBSD's dup2 is a bug. System calls are
supposed to be atomic, regardless of what POSIX might or might not
say.

I did not write that code but I'll pass the report on to those who
did.

-- 
David A. Holland
dholland@netbsd.org

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23 18:30           ` Fw: " David Holland
@ 2015-10-23 19:51             ` Al Viro
  0 siblings, 0 replies; 138+ messages in thread
From: Al Viro @ 2015-10-23 19:51 UTC (permalink / raw)
  To: David Holland
  Cc: Alan Burlison, Eric Dumazet, Stephen Hemminger, netdev, Casper Dik

On Fri, Oct 23, 2015 at 06:30:25PM +0000, David Holland wrote:

> So, I'm coming late to this discussion and I don't have the original
> context; however, to me this cited behavior seems undesirable and if I
> ran across it in the wild I would probably describe it as a bug.

Unfortunately, that's precisely what NetBSD is trying to implement (and
that's what will happen if nothing else reopens fd).  See the logics in
fd_close(), with ->fo_restart() and waiting for all activity to settle
down.  As for the missing context, what fd_close() is doing is also
unreliable - inducing ERESTART in other threads sitting in accept(2) and
things like that and waiting for them to run into EBADF they'll get
(barring races) on syscall restart; threads sitting in accept() et.al.
on the same struct file, but with different descriptors will hopefully
go into restart and continue unaffected.  All that machinery relies on
nothing having reused the descriptor for socket(2), dup2() target, etc. while
those threads had been going through the syscall restart - if that happens,
you are SOL, since accept(2) _will_ restart on an unexpected socket.

Moreover, if you fix dup2() atomicity, this approach will reliably shit
itself for situations when dup2() rather than close() is used to close
the socket.  It relies upon having at least some window where the victim
descriptor would be yielding EBADF.

> System call processing for operations on files involves translating a
> file descriptor (a number) into an open-file object (or "file
> description"), struct file in BSD and I think also in Linux. The
> actual system call logic operates on the open-file object, so once the
> translation happens application monkeyshines involving file descriptor
> numbers should have no effect on calls in progress. Other behavior
> would violate the principle of least surprise, as this basic
> architecture predates POSIX.

Well, to be fair, until '93 there was no way to have descriptor table changed
under a syscall in the first place.  The old model (everything up to and 
ncluding 4.4BSD final) simply didn't include anything of that sort - mapping
from descriptors to open files was not shared and all changes a syscall might
see were ones done by the syscall itself.

So this thing isn't covered by the basic architecture - it's something that
had been significantly new merely two decades ago.  And POSIX still hasn't
quite caught up with that newfangled 4.2BSD thing...

IMO what you've described above is fine - that's how Linux works, that's
how FreeBSD and OpenBSD work and that's how NetBSD used to work until 2008
or so.  "Cancel syscall if any of the descriptors got dissociated from
opened files by action of another thread, have the dissociating operation
wait for all affected syscalls to run down" thing had been introduced then
and it is similar to what Solaris is doing.

AFAICS, the main issue with that is the memory footprint from hell and/or
cacheline clusterfuck.  Having accept(2) bugger off with e.g. EINTR in such
situation isn't inherently worse or better than having it sit there as if
close() or dup2() has not happened - matter of taste, and if there had been a
way to do it without inflicting the price on processes that do not pull that
kind of crap in the first place... might be worth considering.

As it is, the memory footprint seems to be too heavy.  I'm not entirely
convinced that there's no clever way to avoid that, but right now I don't
see anything that would look like a good approach.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-23  9:52                                 ` Casper.Dik
  2015-10-23 13:02                                   ` Eric Dumazet
@ 2015-10-24  2:30                                   ` Al Viro
  2015-10-27  9:08                                     ` Casper.Dik
  1 sibling, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-24  2:30 UTC (permalink / raw)
  To: Casper.Dik
  Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev,
	dholland-tech

On Fri, Oct 23, 2015 at 11:52:34AM +0200, Casper.Dik@oracle.com wrote:
> 
> 
> >Ho-hum...  It could even be made lockless in fast path; the problems I see
> >are
> >	* descriptor-to-file lookup becomes unsafe in a lot of locking
> >conditions.  Sure, most of that happens on the entry to some syscall, with
> >very light locking environment, but... auditing every sodding ioctl that
> >might be doing such lookups is an interesting exercise, and then there are
> >->mount() instances doing the same thing.  And procfs accesses.  Probably
> >nothing impossible to deal with, but nothing pleasant either.
> 
> In the Solaris kernel code, the ioctl code is generally not handled a file 
> descriptor but instead a file pointer (i.e., the lookup is done early in 
> the system call).

The one that comes as the first argument of ioctl(2) - sure, but e.g.
BTRFS_IOC_CLONE_RANGE gets a pointer to this:
struct btrfs_ioctl_clone_range_args {
  __s64 src_fd;
  __u64 src_offset, src_length;
  __u64 dest_offset;
};
as the third argument.  VFS sure as hell has no idea of that thing - it's
up to btrfs_ioctl() to copy it in and deal with what it had been given.
While we are at it, ioctl(fd, BTRFS_IOC_CLONE, src_fd) also gets the
second descriptor-to-file lookup in btrfs-specific code; fd, of course,
is looked up by VFS code.  Now, these two are done in locking-neutral
environment, but that's just two of several dozens.

And no, I'm not fond of such irregular ways to pass file descriptors, but
we can't kill ioctl(2) with all weirdness hiding behind it, more's the pity...

> In those specific cases where a system call needs to convert a file 
> descriptor to a file pointer, there is only one routines which can be used.

Obviously, but the problem is deadlock avoidance using it.

> As I said, we do actually use a lock and yes that means that you really  
> want to have a single cache line for each and every entry.  It does make 
> it easy to have non-racy file description updates.  You certainly do not 
> want false sharing when there is a lot of contention.
> 
> Other data is used to make sure that it only takes O(log(n)) to find the 
> lowest available file descriptor entry.  (Where n, I think, is the returned
> descriptor)

TBH, with that kind of memory footprint I would be more interested in
constants than in asymptotics - not that O(log(n)) would be hard to
arrange (a bunch of bitmaps with something like 1:512 ratio between the
levels, to keep the damn thing within a reasonable cacheline size would
probably do with not too horrible constant; 3 levels of that would already
give 128M descriptors, and that's a gigabyte worth of struct file *
alone; with your "cacheline per descriptor" it's what, about 8Gb eaten by
descriptor table?)

Hell knows, might be worth doing regardless of anything else.  Not making
it worse than our plain bitmap variant in any situations shouldn't be hard...

> Not contended locks aren't expensive.  And all is done on a single cache 
> line.

The memory footprint is really scary.  Bitmaps are pretty much noise, but
blowing it by factor of 8 on normal 64bit (or 16 on something like Itanic -
or Venus for that matter, which is more relevant for you guys)

Said that, what's the point of "close won't return until..."?  After all,
you can't guarantee that thread with cancelled syscall won't lose CPU
immediately upon return to userland, so it *can't* make any assumptions
about the descriptor not having been already reused.  I don't get it - what
does that buy for userland code?

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH net-next] af_unix: do not report POLLOUT on listeners
  2015-10-23 17:59                                                   ` [PATCH net-next] af_unix: do not report POLLOUT on listeners Eric Dumazet
@ 2015-10-25 13:45                                                     ` David Miller
  0 siblings, 0 replies; 138+ messages in thread
From: David Miller @ 2015-10-25 13:45 UTC (permalink / raw)
  To: eric.dumazet
  Cc: Alan.Burlison, Casper.Dik, viro, stephen, netdev, dholland-tech

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 23 Oct 2015 10:59:16 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> poll(POLLOUT) on a listener should not report fd is ready for
> a write().
> 
> This would break some applications using poll() and pfd.events = -1,
> as they would not block in poll()
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Alan Burlison <Alan.Burlison@oracle.com>
> Tested-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-24  2:30                                   ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro
@ 2015-10-27  9:08                                     ` Casper.Dik
  2015-10-27 10:52                                       ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: Casper.Dik @ 2015-10-27  9:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev,
	dholland-tech



>And no, I'm not fond of such irregular ways to pass file descriptors, but
>we can't kill ioctl(2) with all weirdness hiding behind it, more's the pity...

Yeah, there are a number of calls which supposed work on one but have a 
second argument which is also a file descriptor; mostly part of ioctl().

>> In those specific cases where a system call needs to convert a file 
>> descriptor to a file pointer, there is only one routines which can be used.
>
>Obviously, but the problem is deadlock avoidance using it.

The Solaris algorithm is quite different and as such there is no chance of 
having a deadlock using that function (there is a bunch of functions)


>The memory footprint is really scary.  Bitmaps are pretty much noise, but
>blowing it by factor of 8 on normal 64bit (or 16 on something like Itanic -
>or Venus for that matter, which is more relevant for you guys)

Fair enough.  I think we have some systems with a larger cache line.

>Said that, what's the point of "close won't return until..."?  After all,
>you can't guarantee that thread with cancelled syscall won't lose CPU
>immediately upon return to userland, so it *can't* make any assumptions
>about the descriptor not having been already reused.  I don't get it - what
>does that buy for userland code?

Generally I wouldn't see that as a problem, but in the case of a socket 
blocking on accept indefinitely, I do see it as a problem especially as 
the thread actually wants to stop listening.

But in general, this is basically a problem with the application: the file 
descriptor space is shared between threads and having one thread sniping 
at open files, you do have a problem and whatever the kernel does in that 
case perhaps doesn't matter all that much: the application needs to be 
fixed anyway.

Casper

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27  9:08                                     ` Casper.Dik
@ 2015-10-27 10:52                                       ` Alan Burlison
  2015-10-27 12:01                                         ` Eric Dumazet
                                                           ` (3 more replies)
  0 siblings, 4 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-27 10:52 UTC (permalink / raw)
  To: Casper.Dik, Al Viro
  Cc: David Miller, eric.dumazet, stephen, netdev, dholland-tech

On 27/10/2015 09:08, Casper.Dik@oracle.com wrote:

> Generally I wouldn't see that as a problem, but in the case of a socket
> blocking on accept indefinitely, I do see it as a problem especially as
> the thread actually wants to stop listening.
>
> But in general, this is basically a problem with the application: the file
> descriptor space is shared between threads and having one thread sniping
> at open files, you do have a problem and whatever the kernel does in that
> case perhaps doesn't matter all that much: the application needs to be
> fixed anyway.

The scenario in Hadoop is that the FD is being used by a thread that's 
waiting in accept and another thread wants to shut it down, e.g. because 
the application is terminating and needs to stop all threads cleanly. I 
agree the use of shutdown()+close() on Linux or dup2() on Solaris is 
pretty much an application-level hack - the concern in both cases is 
that the file descriptor that's being used in the accept() might be 
recycled by another thread. However that just begs the question of why 
the FD isn't properly encapsulated by the application in a singleton 
object, with the required shut down semantics provided by having a 
mechanism to invalidate the singleton and its contained FD.

There are other mechanisms that could be used to do a clean shutdown 
that don't require the OS to provide workarounds for arguably broken 
application behaviour, for example by setting a 'shutdown' flag in the 
object and then doing a dummy connect() to the accepting FD to kick it 
off the accept() and thereby getting it to re-check the 'shutdown' flag 
and not re-enter the accept().

If the object encapsulating a FD is invalidated and that prevents the FD 
being used any more because the only access is via that object, then it 
simply doesn't matter if the FD is reused elsewhere, there can be no 
race so a complicated, platform-dependent dance isn't needed.

Unfortunately Hadoop isn't the only thing that pulls the shutdown() 
trick, so I don't think there's a simple fix for this, as discussed 
earlier in the thread. Having said that, if close() on Linux also did an 
implicit shutdown() it would mean that well-written applications that 
handled the scoping, sharing and reuse of FDs properly could just call 
close() and have it work the same way across *NIX platforms.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 10:52                                       ` Alan Burlison
@ 2015-10-27 12:01                                         ` Eric Dumazet
  2015-10-27 12:27                                           ` Alan Burlison
  2015-10-27 13:42                                         ` David Miller
                                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-27 12:01 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech

On Tue, 2015-10-27 at 10:52 +0000, Alan Burlison wrote:

> Unfortunately Hadoop isn't the only thing that pulls the shutdown() 
> trick, so I don't think there's a simple fix for this, as discussed 
> earlier in the thread. Having said that, if close() on Linux also did an 
> implicit shutdown() it would mean that well-written applications that 
> handled the scoping, sharing and reuse of FDs properly could just call 
> close() and have it work the same way across *NIX platforms.


Are non multi threaded applications considered well written ?

listener = socket(...);
bind(listener, ...);
listen(fd, 10000);
Loop 1 10
  if (fork() == 0)
    do_accept(listener)

Now if a child does a close(listener), or is killed, you propose that it
does an implicit shutdown() and all other children no longer can
accept() ?

Surely you did not gave all details on how it is really working.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 12:01                                         ` Eric Dumazet
@ 2015-10-27 12:27                                           ` Alan Burlison
  2015-10-27 12:44                                             ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-27 12:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech

On 27/10/2015 12:01, Eric Dumazet wrote:

> Are non multi threaded applications considered well written ?
>
> listener = socket(...);
> bind(listener, ...);
> listen(fd, 10000);
> Loop 1 10
>    if (fork() == 0)
>      do_accept(listener)
>
> Now if a child does a close(listener), or is killed, you propose that it
> does an implicit shutdown() and all other children no longer can
> accept() ?

No, of course not. I made it quite clear I was talking about MT programs.

> Surely you did not gave all details on how it is really working.

In the case of Hadoop, it works the way I describe.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 12:27                                           ` Alan Burlison
@ 2015-10-27 12:44                                             ` Eric Dumazet
  0 siblings, 0 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-27 12:44 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech

On Tue, 2015-10-27 at 12:27 +0000, Alan Burlison wrote:
> On 27/10/2015 12:01, Eric Dumazet wrote:
> 
> > Are non multi threaded applications considered well written ?
> >
> > listener = socket(...);
> > bind(listener, ...);
> > listen(fd, 10000);
> > Loop 1 10
> >    if (fork() == 0)
> >      do_accept(listener)
> >
> > Now if a child does a close(listener), or is killed, you propose that it
> > does an implicit shutdown() and all other children no longer can
> > accept() ?
> 
> No, of course not. I made it quite clear I was talking about MT programs.

Nothing is clear. Sorry.

Now shat about programs using both fork() model and MT, to get one MT
process per NUMA node ?

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 13:42                                         ` David Miller
@ 2015-10-27 13:37                                           ` Alan Burlison
  2015-10-27 13:59                                             ` David Miller
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-27 13:37 UTC (permalink / raw)
  To: David Miller
  Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech

On 27/10/2015 13:42, David Miller wrote:

> This semantic would only exist after Linux version X.Y.Z and vendor
> kernels that decided to backport the feature.
>
> Ergo, this application would ironically be non-portable on Linux
> machines.

Yes, that's true enough, until nobody was using the old versions any more.

> If portable Linux applications have to handle the situation using
> existing facilities there is absolutely zero value to add it now
> because it only will add more complexity to applications handling
> things correctly because they will always have two cases to somehow
> conditionally handle under Linux.
>
> And if the intention is to just always assume the close() semantic
> thing is there, then you have given me a disincentive to ever add the
> facility.

If you took that argument to it's logical extreme they you'd never make 
any changes that made changes to existing behaviour, and that's patently 
not the case.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 10:52                                       ` Alan Burlison
  2015-10-27 12:01                                         ` Eric Dumazet
@ 2015-10-27 13:42                                         ` David Miller
  2015-10-27 13:37                                           ` Alan Burlison
  2015-10-27 23:17                                         ` Al Viro
  2015-10-29 14:58                                         ` David Holland
  3 siblings, 1 reply; 138+ messages in thread
From: David Miller @ 2015-10-27 13:42 UTC (permalink / raw)
  To: Alan.Burlison
  Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech

From: Alan Burlison <Alan.Burlison@oracle.com>
Date: Tue, 27 Oct 2015 10:52:46 +0000

> an implicit shutdown() it would mean that well-written applications
> that handled the scoping, sharing and reuse of FDs properly could just
> call close() and have it work the same way across *NIX platforms.

This semantic would only exist after Linux version X.Y.Z and vendor
kernels that decided to backport the feature.

Ergo, this application would ironically be non-portable on Linux
machines.

If portable Linux applications have to handle the situation using
existing facilities there is absolutely zero value to add it now
because it only will add more complexity to applications handling
things correctly because they will always have two cases to somehow
conditionally handle under Linux.

And if the intention is to just always assume the close() semantic
thing is there, then you have given me a disincentive to ever add the
facility.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 13:37                                           ` Alan Burlison
@ 2015-10-27 13:59                                             ` David Miller
  2015-10-27 14:13                                               ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: David Miller @ 2015-10-27 13:59 UTC (permalink / raw)
  To: Alan.Burlison
  Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech

From: Alan Burlison <Alan.Burlison@oracle.com>
Date: Tue, 27 Oct 2015 13:37:26 +0000

> If you took that argument to it's logical extreme they you'd never
> make any changes that made changes to existing behaviour, and that's
> patently not the case.

You know exactly what I mean, and what you're saying here is just a
scarecrow distracting the discussion from the real issue.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 13:59                                             ` David Miller
@ 2015-10-27 14:13                                               ` Alan Burlison
  2015-10-27 14:39                                                 ` David Miller
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-27 14:13 UTC (permalink / raw)
  To: David Miller
  Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech

On 27/10/2015 13:59, David Miller wrote:

> From: Alan Burlison <Alan.Burlison@oracle.com>
> Date: Tue, 27 Oct 2015 13:37:26 +0000
>
>> If you took that argument to it's logical extreme they you'd never
>> make any changes that made changes to existing behaviour, and that's
>> patently not the case.
>
> You know exactly what I mean, and what you're saying here is just a
> scarecrow distracting the discussion from the real issue.

I think you probably mean "a straw man".

The problematic case is MT applications that don't manage sharing of FDs 
in a sensible way and need to have a way of terminating any accept() in 
other threads. On Linux that's currently done with a shutdown()+close(), 
on other platforms you can use open()+dup2(). However the Linux 
shutdown() mechanism relies on bending the POSIX semantics and the 
dup2() mechanism doesn't work on Linux as it also doesn't kick other 
threads off accept().

At the moment, on Linux you have to explicitly call shutdown() on a 
socket on which another thread may be sat in accept(). If closing the 
socket in one thread terminated any accept()s in other threads, in the 
same way that an explicit shutdown() does, then the explicit shutdown() 
wouldn't be needed for more sensibly written apps that weren't prone to 
FD recycling races. As far as I can tell, that would work cross-platform.

Ideally there'd be a single way of doing this that worked 
cross-platform, at the moment there isn't. And yes, even if such a 
mechanism was available now it would be some time before it could be 
assumed to be available everywhere. I don't know enough about the Linux 
implementation to know if there is a practical way around this, and of 
course even if such a change were made, potential breakage of existing 
code would be a concern. If there's a better, cross-platform way of 
doing this then I'm all ears.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 14:13                                               ` Alan Burlison
@ 2015-10-27 14:39                                                 ` David Miller
  2015-10-27 14:39                                                   ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: David Miller @ 2015-10-27 14:39 UTC (permalink / raw)
  To: Alan.Burlison
  Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech

From: Alan Burlison <Alan.Burlison@oracle.com>
Date: Tue, 27 Oct 2015 14:13:31 +0000

> Ideally there'd be a single way of doing this that worked
> cross-platform, at the moment there isn't. And yes, even if such a
> mechanism was available now it would be some time before it could be
> assumed to be available everywhere.

You will never be able to assume it is available everywhere under
Linux.  Ever.

This is the fundamental issue that you seem to completely not
understand.

You cannot just assume 5 years from now or whatever that the close()
thing is there even if I added it to the tree right now.

Your intent is to somewhere down the road assume this, and therefore
distribute a broken piece of infrastructure that only works on some
Linux systems.

This is not acceptable.

The backwards compat code will need to be in your code forever.  There
is no way around it.  That is, again, unless you want your code to not
work on a non-trivial number of Linux systems out there.

Making this worse is that there isn't going to be a straightforward
nor reliable way to test for the presence of this at run time.

You _have_ a way to accomplish what you want to do today and it works
on every Linux system on the planet.

Given the constraints, and the fact that you're going to have to
account for this situation somehow in your code forever, I see very
little to no value in adding the close() thing.

So your cross-platform unified behavior goal is simply unobtainable.
So please deal with reality rather than wishful inpractical things.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 14:39                                                 ` David Miller
@ 2015-10-27 14:39                                                   ` Alan Burlison
  2015-10-27 15:04                                                     ` David Miller
  0 siblings, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-27 14:39 UTC (permalink / raw)
  To: David Miller
  Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech

On 27/10/2015 14:39, David Miller wrote:

> You will never be able to assume it is available everywhere under
> Linux.  Ever.
>
> This is the fundamental issue that you seem to completely not
> understand.

If that was true in general then Linux would be dead and there would be 
no point ever adding any new features to it.

> You cannot just assume 5 years from now or whatever that the close()
> thing is there even if I added it to the tree right now.

No, a configure-time feature probe would be needed, as is generally 
considered to be good practice.

> Your intent is to somewhere down the road assume this, and therefore
> distribute a broken piece of infrastructure that only works on some
> Linux systems.
>
> This is not acceptable.

Acceptable to who? Unless you are claiming to speak for the authors of 
every piece of software that ever runs on Linux you can't make that 
assertion. For example, Hadoop relies on Linux CGroups to provide 
features you'd want in production deployments, yet CGroups only appeared 
in Linux 2.6.24.

> The backwards compat code will need to be in your code forever.  There
> is no way around it.  That is, again, unless you want your code to not
> work on a non-trivial number of Linux systems out there.

That's simply not true. Both Linux and Solaris have dropped support for 
old features in the past. Yes it takes a long time to do so but it's 
perfectly possible.

> Making this worse is that there isn't going to be a straightforward
> nor reliable way to test for the presence of this at run time.

I attached a test case to the original bug that demonstrates the 
platform-specific behaviours, it would be easy enough to modify that to 
use as a configure-time feature probe.

> You _have_ a way to accomplish what you want to do today and it works
> on every Linux system on the planet.
>
> Given the constraints, and the fact that you're going to have to
> account for this situation somehow in your code forever, I see very
> little to no value in adding the close() thing.

I think your assessment is unduly pessimistic.

> So your cross-platform unified behavior goal is simply unobtainable.
> So please deal with reality rather than wishful inpractical things.

If you don't think this is worth discussing because you personally don't 
believe cross-platform compatibility is worth bothering with then fine, 
say so. But then you need to persuade everyone else with a stake in 
Linux that your viewpoint is the only valid one, and you will have to 
also persuade them that Linux should no longer concern itself with POSIX 
compliance. I know *I* couldn't do that for Solaris, and with all due 
respect, I very much doubt *you* can do so for Linux.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 14:39                                                   ` Alan Burlison
@ 2015-10-27 15:04                                                     ` David Miller
  2015-10-27 15:53                                                       ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: David Miller @ 2015-10-27 15:04 UTC (permalink / raw)
  To: Alan.Burlison
  Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech

From: Alan Burlison <Alan.Burlison@oracle.com>
Date: Tue, 27 Oct 2015 14:39:56 +0000

> On 27/10/2015 14:39, David Miller wrote:
> 
>> Making this worse is that there isn't going to be a straightforward
>> nor reliable way to test for the presence of this at run time.
> 
> I attached a test case to the original bug that demonstrates the
> platform-specific behaviours, it would be easy enough to modify that
> to use as a configure-time feature probe.

I said "run time".  Like when your final code runs on a target system.

Not at build tree configure time.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 15:04                                                     ` David Miller
@ 2015-10-27 15:53                                                       ` Alan Burlison
  0 siblings, 0 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-27 15:53 UTC (permalink / raw)
  To: David Miller
  Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech

On 27/10/2015 15:04, David Miller wrote:

> I said "run time".  Like when your final code runs on a target system.
>
> Not at build tree configure time.

Yes, you could do it then as well.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 10:52                                       ` Alan Burlison
  2015-10-27 12:01                                         ` Eric Dumazet
  2015-10-27 13:42                                         ` David Miller
@ 2015-10-27 23:17                                         ` Al Viro
  2015-10-28  0:13                                           ` Eric Dumazet
  2015-10-28 16:04                                           ` Alan Burlison
  2015-10-29 14:58                                         ` David Holland
  3 siblings, 2 replies; 138+ messages in thread
From: Al Viro @ 2015-10-27 23:17 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Casper.Dik, David Miller, eric.dumazet, stephen, netdev, dholland-tech

On Tue, Oct 27, 2015 at 10:52:46AM +0000, Alan Burlison wrote:
> Unfortunately Hadoop isn't the only thing that pulls the shutdown()
> trick, so I don't think there's a simple fix for this, as discussed
> earlier in the thread. Having said that, if close() on Linux also
> did an implicit shutdown() it would mean that well-written
> applications that handled the scoping, sharing and reuse of FDs
> properly could just call close() and have it work the same way
> across *NIX platforms.

... except for all Linux, FreeBSD and OpenBSD versions out there, but
hey, who's counting those, right?  Not to mention the OSX behaviour -
I really have no idea what it does; the FreeBSD ancestry in its kernel
is distant enough for a lot of changes to have happened in that area.

So...  Which Unices other than Solaris and NetBSD actually behave that
way?  I.e. have close(fd) cancel accept(fd) another thread is sitting
in.  Note that NetBSD implementation has known races.  Linux, FreeBSD
and OpenBSD don't do that at all.

Frankly, as far as I'm concerned, the bottom line is
	* there are two variants of semantics in that area and there's not
much that could be done about that.
	* POSIX is vague enough for both variants to comply with it (it's
also very badly written in the area in question).
	* I don't see any way to implement something similar to Solaris
behaviour without a huge increase of memory footprint or massive cacheline
pingpong.  Solaris appears to go for memory footprint from hell - cacheline
per descriptor (instead of a pointer per descriptor).
	* the benefits of Solaris-style behaviour are not obvious - all things
equal it would be interesting, but the things are very much not equal.  What's
more, if your userland code is such that accept() argument could be closed by
another thread, the caller *cannot* do anything with said argument after
accept() returns, no matter which variant of semantics is used.
	* [Linux-specific aside] our __alloc_fd() can degrade quite badly
with some use patterns.  The cacheline pingpong in the bitmap is probably
inevitable, unless we accept considerably heavier memory footprint,
but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
to trigger - close(3);open(...); will have the next open() after that
scanning the entire in-use bitmap.  I think I see a way to improve it
without slowing the normal case down, but I'll need to experiment a
bit before I post patches.  Anybody with examples of real-world loads
that make our descriptor allocator to degrade is very welcome to post
the reproducers...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 23:17                                         ` Al Viro
@ 2015-10-28  0:13                                           ` Eric Dumazet
  2015-10-28 12:35                                             ` Al Viro
  2015-10-28 16:04                                           ` Alan Burlison
  1 sibling, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-28  0:13 UTC (permalink / raw)
  To: Al Viro
  Cc: Alan Burlison, Casper.Dik, David Miller, stephen, netdev, dholland-tech

On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote:

> 	* [Linux-specific aside] our __alloc_fd() can degrade quite badly
> with some use patterns.  The cacheline pingpong in the bitmap is probably
> inevitable, unless we accept considerably heavier memory footprint,
> but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> to trigger - close(3);open(...); will have the next open() after that
> scanning the entire in-use bitmap.  I think I see a way to improve it
> without slowing the normal case down, but I'll need to experiment a
> bit before I post patches.  Anybody with examples of real-world loads
> that make our descriptor allocator to degrade is very welcome to post
> the reproducers...

Well, I do have real-world loads, but quite hard to setup in a lab :(

Note that we also hit the 'struct cred'->usage refcount for every
open()/close()/sock_alloc(), and simply moving uid/gid out of the first
cache line really helps, as current_fsuid() and current_fsgid() no
longer forces a pingpong.

I moved seldom used fields on the first cache line, so that overall
memory usage did not change (192 bytes on 64 bit arches)


diff --git a/include/linux/cred.h b/include/linux/cred.h
index 8d70e1361ecd..460efae83522 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -124,7 +124,17 @@ struct cred {
 #define CRED_MAGIC     0x43736564
 #define CRED_MAGIC_DEAD        0x44656144
 #endif
-       kuid_t          uid;            /* real UID of the task */
+       struct rcu_head rcu;            /* RCU deletion hook */
+
+       kernel_cap_t    cap_inheritable; /* caps our children can inherit */
+       kernel_cap_t    cap_permitted;  /* caps we're permitted */
+       kernel_cap_t    cap_effective;  /* caps we can actually use */
+       kernel_cap_t    cap_bset;       /* capability bounding set */
+       kernel_cap_t    cap_ambient;    /* Ambient capability set */
+
+       kuid_t          uid ____cacheline_aligned_in_smp;
+                                       /* real UID of the task */
+
        kgid_t          gid;            /* real GID of the task */
        kuid_t          suid;           /* saved UID of the task */
        kgid_t          sgid;           /* saved GID of the task */
@@ -133,11 +143,6 @@ struct cred {
        kuid_t          fsuid;          /* UID for VFS ops */
        kgid_t          fsgid;          /* GID for VFS ops */
        unsigned        securebits;     /* SUID-less security management */
-       kernel_cap_t    cap_inheritable; /* caps our children can inherit */
-       kernel_cap_t    cap_permitted;  /* caps we're permitted */
-       kernel_cap_t    cap_effective;  /* caps we can actually use */
-       kernel_cap_t    cap_bset;       /* capability bounding set */
-       kernel_cap_t    cap_ambient;    /* Ambient capability set */
 #ifdef CONFIG_KEYS
        unsigned char   jit_keyring;    /* default keyring to attach requested
                                         * keys to */
@@ -152,7 +157,6 @@ struct cred {
        struct user_struct *user;       /* real user ID subscription */
        struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
        struct group_info *group_info;  /* supplementary groups for euid/fsgid */
-       struct rcu_head rcu;            /* RCU deletion hook */
 };
 
 extern void __put_cred(struct cred *);

^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28  0:13                                           ` Eric Dumazet
@ 2015-10-28 12:35                                             ` Al Viro
  2015-10-28 13:24                                               ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-28 12:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

[Linus and Dave added, Solaris and NetBSD folks dropped from Cc]

On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote:
> On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote:
> 
> > 	* [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > with some use patterns.  The cacheline pingpong in the bitmap is probably
> > inevitable, unless we accept considerably heavier memory footprint,
> > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > to trigger - close(3);open(...); will have the next open() after that
> > scanning the entire in-use bitmap.  I think I see a way to improve it
> > without slowing the normal case down, but I'll need to experiment a
> > bit before I post patches.  Anybody with examples of real-world loads
> > that make our descriptor allocator to degrade is very welcome to post
> > the reproducers...
> 
> Well, I do have real-world loads, but quite hard to setup in a lab :(
> 
> Note that we also hit the 'struct cred'->usage refcount for every
> open()/close()/sock_alloc(), and simply moving uid/gid out of the first
> cache line really helps, as current_fsuid() and current_fsgid() no
> longer forces a pingpong.
> 
> I moved seldom used fields on the first cache line, so that overall
> memory usage did not change (192 bytes on 64 bit arches)

[snip]

Makes sense, but there's a funny thing about that refcount - the part
coming from ->f_cred is the most frequently changed *and* almost all
places using ->f_cred are just looking at its fields and do not manipulate
its refcount.  The only exception (do_process_acct()) is easy to eliminate
just by storing a separate reference to the current creds of acct(2) caller
and using it instead of looking at ->f_cred.  What's more, the place where we
grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and*
most of the time such a reference is there for dropping ->f_cred (in
file_free()/file_free_rcu()).

With that change in kernel/acct.c done, we could do the following:
	a) split the cred refcount into the normal and percpu parts and
add a spinlock in there.
	b) have put_cred() do this:
		if (atomic_dec_and_test(&cred->usage)) {
			this_cpu_add(&cred->f_cred_usage, 1);
			call_rcu(&cred->rcu, put_f_cred_rcu);
		}
	c) have get_empty_filp() increment current_cred ->f_cred_usage with
this_cpu_add()
	d) have file_free() do
		percpu_counter_dec(&nr_files);
		rcu_read_lock();
		if (likely(atomic_read(&f->f_cred->usage))) {
			this_cpu_add(&f->f_cred->f_cred_usage, -1);
			rcu_read_unlock();
			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu_light);
		} else {
			rcu_read_unlock();
			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
		}
file_free_rcu() being
static void file_free_rcu(struct rcu_head *head)
{
        struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
        put_f_cred(&f->f_cred->rcu);
        kmem_cache_free(filp_cachep, f);
}
and file_free_rcu_light() - the same sans put_f_cred();

with put_f_cred() doing
	spin_lock cred->lock
	this_cpu_add(&cred->f_cred_usage, -1);
	find the sum of cred->f_cred_usage
	spin_unlock cred->lock
	if the sum has not reached 0
		return
	current put_cred_rcu(cred)

IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and
(most of) ->f_cred dropping.

Note that there are two paths leading to put_f_cred() in the above - via
call_rcu() on &cred->rcu and from file_free_rcu() called via call_rcu() on
&f->f_u.fu_rcuhead.  Both are RCU-delayed and they can happen in parallel -
different rcu_head are used.

atomic_read() check in file_free() might give false positives if it comes
just before put_cred() on another CPU kills the last non-f_cred reference.
It's not a problem, since put_f_cred() from that put_cred() won't be
executed until we drop rcu_read_lock(), so we can safely decrement the
cred->f_cred_usage without cred->lock here (and we are guaranteed that we won't
be dropping the last of that - the same put_cred() would've incremented
->f_cred_usage).

Does anybody see problems with that approach?  I'm going to grab some sleep
(only a couple of hours so far tonight ;-/), will cook an incremental to Eric's
field-reordering patch when I get up...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 12:35                                             ` Al Viro
@ 2015-10-28 13:24                                               ` Eric Dumazet
  2015-10-28 14:47                                                 ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-28 13:24 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

On Wed, 2015-10-28 at 12:35 +0000, Al Viro wrote:
> [Linus and Dave added, Solaris and NetBSD folks dropped from Cc]
> 
> On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote:
> > On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote:
> > 
> > > 	* [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > > with some use patterns.  The cacheline pingpong in the bitmap is probably
> > > inevitable, unless we accept considerably heavier memory footprint,
> > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > > to trigger - close(3);open(...); will have the next open() after that
> > > scanning the entire in-use bitmap.  I think I see a way to improve it
> > > without slowing the normal case down, but I'll need to experiment a
> > > bit before I post patches.  Anybody with examples of real-world loads
> > > that make our descriptor allocator to degrade is very welcome to post
> > > the reproducers...
> > 
> > Well, I do have real-world loads, but quite hard to setup in a lab :(
> > 
> > Note that we also hit the 'struct cred'->usage refcount for every
> > open()/close()/sock_alloc(), and simply moving uid/gid out of the first
> > cache line really helps, as current_fsuid() and current_fsgid() no
> > longer forces a pingpong.
> > 
> > I moved seldom used fields on the first cache line, so that overall
> > memory usage did not change (192 bytes on 64 bit arches)
> 
> [snip]
> 
> Makes sense, but there's a funny thing about that refcount - the part
> coming from ->f_cred is the most frequently changed *and* almost all
> places using ->f_cred are just looking at its fields and do not manipulate
> its refcount.  The only exception (do_process_acct()) is easy to eliminate
> just by storing a separate reference to the current creds of acct(2) caller
> and using it instead of looking at ->f_cred.  What's more, the place where we
> grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and*
> most of the time such a reference is there for dropping ->f_cred (in
> file_free()/file_free_rcu()).
> 
> With that change in kernel/acct.c done, we could do the following:
> 	a) split the cred refcount into the normal and percpu parts and
> add a spinlock in there.
> 	b) have put_cred() do this:
> 		if (atomic_dec_and_test(&cred->usage)) {
> 			this_cpu_add(&cred->f_cred_usage, 1);
> 			call_rcu(&cred->rcu, put_f_cred_rcu);
> 		}
> 	c) have get_empty_filp() increment current_cred ->f_cred_usage with
> this_cpu_add()
> 	d) have file_free() do
> 		percpu_counter_dec(&nr_files);
> 		rcu_read_lock();
> 		if (likely(atomic_read(&f->f_cred->usage))) {
> 			this_cpu_add(&f->f_cred->f_cred_usage, -1);
> 			rcu_read_unlock();
> 			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu_light);
> 		} else {
> 			rcu_read_unlock();
> 			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> 		}
> file_free_rcu() being
> static void file_free_rcu(struct rcu_head *head)
> {
>         struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
>         put_f_cred(&f->f_cred->rcu);
>         kmem_cache_free(filp_cachep, f);
> }
> and file_free_rcu_light() - the same sans put_f_cred();
> 
> with put_f_cred() doing
> 	spin_lock cred->lock
> 	this_cpu_add(&cred->f_cred_usage, -1);
> 	find the sum of cred->f_cred_usage
> 	spin_unlock cred->lock
> 	if the sum has not reached 0
> 		return
> 	current put_cred_rcu(cred)
> 
> IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and
> (most of) ->f_cred dropping.
> 
> Note that there are two paths leading to put_f_cred() in the above - via
> call_rcu() on &cred->rcu and from file_free_rcu() called via call_rcu() on
> &f->f_u.fu_rcuhead.  Both are RCU-delayed and they can happen in parallel -
> different rcu_head are used.
> 
> atomic_read() check in file_free() might give false positives if it comes
> just before put_cred() on another CPU kills the last non-f_cred reference.
> It's not a problem, since put_f_cred() from that put_cred() won't be
> executed until we drop rcu_read_lock(), so we can safely decrement the
> cred->f_cred_usage without cred->lock here (and we are guaranteed that we won't
> be dropping the last of that - the same put_cred() would've incremented
> ->f_cred_usage).
> 
> Does anybody see problems with that approach?  I'm going to grab some sleep
> (only a couple of hours so far tonight ;-/), will cook an incremental to Eric's
> field-reordering patch when I get up...

Before I take a deep look at your suggestion, are you sure plain use of
include/linux/percpu-refcount.h infra is not possible for struct cred ?

Thanks !

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 13:24                                               ` Eric Dumazet
@ 2015-10-28 14:47                                                 ` Eric Dumazet
  2015-10-28 21:13                                                   ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-28 14:47 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:

> Before I take a deep look at your suggestion, are you sure plain use of
> include/linux/percpu-refcount.h infra is not possible for struct cred ?

BTW, I am not convinced we need to spend so much energy and per-cpu
memory for struct cred refcount.

The big problem is fd array spinlock of course and bitmap search for
POSIX compliance.

The cache line trashing in struct cred is a minor one ;)

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 23:17                                         ` Al Viro
  2015-10-28  0:13                                           ` Eric Dumazet
@ 2015-10-28 16:04                                           ` Alan Burlison
  1 sibling, 0 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-28 16:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Casper.Dik, David Miller, eric.dumazet, stephen, netdev, dholland-tech

On 27/10/2015 23:17, Al Viro wrote:

> Frankly, as far as I'm concerned, the bottom line is
> 	* there are two variants of semantics in that area and there's not
> much that could be done about that.

Yes, that seems to be the case.

> 	* POSIX is vague enough for both variants to comply with it (it's
> also very badly written in the area in question).

On that aspect I disagree, the POSIX semantics seem clear to me, and are 
different to the Linux behaviour.

> 	* I don't see any way to implement something similar to Solaris
> behaviour without a huge increase of memory footprint or massive cacheline
> pingpong.  Solaris appears to go for memory footprint from hell - cacheline
> per descriptor (instead of a pointer per descriptor).

Yes, that does seem to be the case. Thanks for the detailed explanation 
you've provided as to why that's so.

> 	* the benefits of Solaris-style behaviour are not obvious - all things
> equal it would be interesting, but the things are very much not equal.  What's
> more, if your userland code is such that accept() argument could be closed by
> another thread, the caller *cannot* do anything with said argument after
> accept() returns, no matter which variant of semantics is used.

Yes, irrespective of how you terminate the accept, once it returns with 
an error it's unsafe to use the FD, with the exception of failures such 
as EAGAIN, EINTR etc. However the shutdown() behaviour of Linux is not 
POSIX compliant and allowing an accept to continue of a FD that's been 
closed doesn't seem correct either.

> 	* [Linux-specific aside] our __alloc_fd() can degrade quite badly
> with some use patterns.  The cacheline pingpong in the bitmap is probably
> inevitable, unless we accept considerably heavier memory footprint,
> but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> to trigger - close(3);open(...); will have the next open() after that
> scanning the entire in-use bitmap.  I think I see a way to improve it
> without slowing the normal case down, but I'll need to experiment a
> bit before I post patches.  Anybody with examples of real-world loads
> that make our descriptor allocator to degrade is very welcome to post
> the reproducers...

It looks like the remaining discussion is going to be about Linux 
implementation details so I'll bow out at this point. Thanks again for 
all the helpful explanation.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 14:47                                                 ` Eric Dumazet
@ 2015-10-28 21:13                                                   ` Al Viro
  2015-10-28 21:44                                                     ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-28 21:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote:
> On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:
> 
> > Before I take a deep look at your suggestion, are you sure plain use of
> > include/linux/percpu-refcount.h infra is not possible for struct cred ?
> 
> BTW, I am not convinced we need to spend so much energy and per-cpu
> memory for struct cred refcount.
> 
> The big problem is fd array spinlock of course and bitmap search for
> POSIX compliance.
> 
> The cache line trashing in struct cred is a minor one ;)

percpu-refcount isn't convenient - the only such candidate for ref_kill in
there is "all other references are gone", and that can happen in
interesting locking environments.  I doubt that it would be a good fit, TBH...

Cacheline pingpong on the descriptors bitmap is probably inevitable, but
it's not the only problem in the existing implementation - close a small
descriptor when you've got a lot of them and look for the second open
after that.  _That_ can lead to thousands of cachelines being read through,
all under the table spinlock.  It's literally orders of magnitude worse.
And if the first open after that close happens to be for a short-living
descriptor, you'll get the same situation back in your face as soon as you
close it.

I think we can seriously improve that without screwing the fast path by
adding "summary" bitmaps once the primary grows past the cacheline worth
of bits.  With bits in the summary bitmap corresponding to cacheline-sized
chunks of the primary, being set iff all bits in the corresponding chunk
are set.  If the summary map grows larger than one cacheline, add the
second-order one (that happens at quarter million descriptors and serves
until 128 million; adding the third-order map is probably worthless).

I want to maintain the same kind of "everything below this is known to be
in use" thing as we do now.  Allocation would start with looking into the
same place in primary bitmap where we'd looked now and similar search
forward for zero bit.  _However_, it would stop at cacheline boundary.
If nothing had been found, we look in the corresponding place in the
summary bitmap and search for zero bit there.  Again, no more than up
to the cacheline boundary.  If something is found, we've got a chunk in
the primary known to contain a zero bit; if not - go to the second-level
and search there, etc.

When a zero bit in the primary had been found, check if it's within the
rlimit (passed to __alloc_fd() explicitly) and either bugger off or set
that bit.  If there are zero bits left in the same word - we are done,
otherwise check the still unread words in the cacheline and see if all
of them are ~0UL.  If all of them are, set the bit in summary bitmap, etc.

Normal case is exactly the same as now - one cacheline accessed and modified.
We might end up touching more than that, but it's going to be rare and
the cases when it happens are very likely to lead to much worse amount of
memory traffic with the current code.

Freeing is done by zeroing the bit in primary, checking for other zero bits
nearby and buggering off if there are such.  If the entire cacheline used
to be all-bits-set, clear the bit in summary and, if there's a second-order
summary, get the bit in there clear as well - it's probably not worth
bothering with checking that all the cacheline in summary bitmap had been
all-bits-set.  Again, the normal case is the same as now.

It'll need profiling and tuning, but AFAICS it's doable without making the
things worse than they are now, and it should get rid of those O(N) fetches
under spinlock cases.  And yes, those are triggerable and visible in
profiles.  IMO it's worth trying to fix...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 21:13                                                   ` Al Viro
@ 2015-10-28 21:44                                                     ` Eric Dumazet
  2015-10-28 22:33                                                       ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-28 21:44 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

On Wed, 2015-10-28 at 21:13 +0000, Al Viro wrote:
> On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote:
> > On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:
> > 
> > > Before I take a deep look at your suggestion, are you sure plain use of
> > > include/linux/percpu-refcount.h infra is not possible for struct cred ?
> > 
> > BTW, I am not convinced we need to spend so much energy and per-cpu
> > memory for struct cred refcount.
> > 
> > The big problem is fd array spinlock of course and bitmap search for
> > POSIX compliance.
> > 
> > The cache line trashing in struct cred is a minor one ;)
> 
> percpu-refcount isn't convenient - the only such candidate for ref_kill in
> there is "all other references are gone", and that can happen in
> interesting locking environments.  I doubt that it would be a good fit, TBH...

OK then ...

> Cacheline pingpong on the descriptors bitmap is probably inevitable, but
> it's not the only problem in the existing implementation - close a small
> descriptor when you've got a lot of them and look for the second open
> after that.  _That_ can lead to thousands of cachelines being read through,
> all under the table spinlock.  It's literally orders of magnitude worse.
> And if the first open after that close happens to be for a short-living
> descriptor, you'll get the same situation back in your face as soon as you
> close it.
> 
> I think we can seriously improve that without screwing the fast path by
> adding "summary" bitmaps once the primary grows past the cacheline worth
> of bits.  With bits in the summary bitmap corresponding to cacheline-sized
> chunks of the primary, being set iff all bits in the corresponding chunk
> are set.  If the summary map grows larger than one cacheline, add the
> second-order one (that happens at quarter million descriptors and serves
> until 128 million; adding the third-order map is probably worthless).
> 
> I want to maintain the same kind of "everything below this is known to be
> in use" thing as we do now.  Allocation would start with looking into the
> same place in primary bitmap where we'd looked now and similar search
> forward for zero bit.  _However_, it would stop at cacheline boundary.
> If nothing had been found, we look in the corresponding place in the
> summary bitmap and search for zero bit there.  Again, no more than up
> to the cacheline boundary.  If something is found, we've got a chunk in
> the primary known to contain a zero bit; if not - go to the second-level
> and search there, etc.
> 
> When a zero bit in the primary had been found, check if it's within the
> rlimit (passed to __alloc_fd() explicitly) and either bugger off or set
> that bit.  If there are zero bits left in the same word - we are done,
> otherwise check the still unread words in the cacheline and see if all
> of them are ~0UL.  If all of them are, set the bit in summary bitmap, etc.
> 
> Normal case is exactly the same as now - one cacheline accessed and modified.
> We might end up touching more than that, but it's going to be rare and
> the cases when it happens are very likely to lead to much worse amount of
> memory traffic with the current code.
> 
> Freeing is done by zeroing the bit in primary, checking for other zero bits
> nearby and buggering off if there are such.  If the entire cacheline used
> to be all-bits-set, clear the bit in summary and, if there's a second-order
> summary, get the bit in there clear as well - it's probably not worth
> bothering with checking that all the cacheline in summary bitmap had been
> all-bits-set.  Again, the normal case is the same as now.
> 
> It'll need profiling and tuning, but AFAICS it's doable without making the
> things worse than they are now, and it should get rid of those O(N) fetches
> under spinlock cases.  And yes, those are triggerable and visible in
> profiles.  IMO it's worth trying to fix...
> 

Well, all this complexity goes away with a O_FD_FASTALLOC /
SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
tells the kernel we do not care getting the lowest possible fd as POSIX
mandates.

With this bit set, the bitmap search can start at a random point, and we
find a lot in O(1) : one cache line miss, if you have at least one free
bit/slot per 512 bits (64 bytes cache line).

#ifndef O_FD_FASTALLOC
#define O_FD_FASTALLOC 0x40000000
#endif

#ifndef SOCK_FD_FASTALLOC
#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
#endif

... // active sockets
socket(AF_INET, SOCK_STREAM | SOCK_FD_FASTALLOC, 0);
... // passive sockets
accept4(sockfd, ..., SOCK_FD_FASTALLOC);
...

Except for legacy stuff and stdin/stdout/stderr games, I really doubt
lot of applications absolutely rely on the POSIX thing...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 21:44                                                     ` Eric Dumazet
@ 2015-10-28 22:33                                                       ` Al Viro
  2015-10-28 23:08                                                         ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-28 22:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote:

> Well, all this complexity goes away with a O_FD_FASTALLOC /
> SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
> tells the kernel we do not care getting the lowest possible fd as POSIX
> mandates.

... which won't do a damn thing for existing userland.

> Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> lot of applications absolutely rely on the POSIX thing...

We obviously can't turn that into default behaviour, though.  BTW, what
distribution do you have in mind for those random descriptors?  Uniform
on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
memory footprint pretty soon...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 22:33                                                       ` Al Viro
@ 2015-10-28 23:08                                                         ` Eric Dumazet
  2015-10-29  0:15                                                           ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-28 23:08 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

On Wed, 2015-10-28 at 22:33 +0000, Al Viro wrote:
> On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote:
> 
> > Well, all this complexity goes away with a O_FD_FASTALLOC /
> > SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
> > tells the kernel we do not care getting the lowest possible fd as POSIX
> > mandates.
> 
> ... which won't do a damn thing for existing userland.

For the userland that need +5,000,000 socket, I can tell you they are
using this flag as soon they are aware it exists ;)

> 
> > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > lot of applications absolutely rely on the POSIX thing...
> 
> We obviously can't turn that into default behaviour, though.  BTW, what
> distribution do you have in mind for those random descriptors?  Uniform
> on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> memory footprint pretty soon...

Simply [0 , fdt->max_fds] is working well in most cases.




^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-28 23:08                                                         ` Eric Dumazet
@ 2015-10-29  0:15                                                           ` Al Viro
  2015-10-29  3:29                                                             ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-29  0:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote:
> > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > > lot of applications absolutely rely on the POSIX thing...
> > 
> > We obviously can't turn that into default behaviour, though.  BTW, what
> > distribution do you have in mind for those random descriptors?  Uniform
> > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> > memory footprint pretty soon...
> 
> Simply [0 , fdt->max_fds] is working well in most cases.

Umm...  So first you dup2() to establish the ->max_fds you want, then
do such opens?  What used/unused ratio do you expect to deal with?
And what kind of locking are you going to use?  Keep in mind that
e.g. dup2() is dependent on the lack of allocations while it's working,
so it's not as simple as "we don't need no stinkin' ->files_lock"...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29  0:15                                                           ` Al Viro
@ 2015-10-29  3:29                                                             ` Eric Dumazet
  2015-10-29  4:16                                                               ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-29  3:29 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

On Thu, 2015-10-29 at 00:15 +0000, Al Viro wrote:
> On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote:
> > > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > > > lot of applications absolutely rely on the POSIX thing...
> > > 
> > > We obviously can't turn that into default behaviour, though.  BTW, what
> > > distribution do you have in mind for those random descriptors?  Uniform
> > > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> > > memory footprint pretty soon...
> > 
> > Simply [0 , fdt->max_fds] is working well in most cases.
> 
> Umm...  So first you dup2() to establish the ->max_fds you want, then
> do such opens?

Yes, dup2() is done at program startup, knowing the expected max load
(in term of concurrent fd) + ~10 % (actual fd array size can be more
than this because of power of two rounding in alloc_fdtable() )

But this is an optimization : If you do not use the initial dup2(), the
fd array can be automatically expanded if needed (all slots are in use)

>   What used/unused ratio do you expect to deal with?
> And what kind of locking are you going to use?  Keep in mind that
> e.g. dup2() is dependent on the lack of allocations while it's working,
> so it's not as simple as "we don't need no stinkin' ->files_lock"...

No locking change. files->file_lock is still taken.

We only want to minimize time to find an empty slot.

The trick is to not start bitmap search at files->next_fd, but a random
point. This is a win if we assume there are enough holes.

low = start;
if (low < files->next_fd)
    low = files->next_fd;

res = -1;
if (flags & O_FD_FASTALLOC) {
	random_point = pick_random_between(low, fdt->max_fds);

	res = find_next_zero_bit(fdt->open_fds, fdt->max_fds,
				random_point);
	/* No empty slot found, try the other range */
	if (res >= fdt->max_fds) {
		res = find_next_zero_bit(fdt->open_fds,
				low, random_point);
		if (res >= random_point)
			res = -1;
	}
}
...





^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29  3:29                                                             ` Eric Dumazet
@ 2015-10-29  4:16                                                               ` Al Viro
  2015-10-29 12:35                                                                 ` Eric Dumazet
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-29  4:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

On Wed, Oct 28, 2015 at 08:29:41PM -0700, Eric Dumazet wrote:

> But this is an optimization : If you do not use the initial dup2(), the
> fd array can be automatically expanded if needed (all slots are in use)

Whee...

> No locking change. files->file_lock is still taken.
> 
> We only want to minimize time to find an empty slot.

Then I'd say that my variant is going to win.  It *will* lead to
cacheline pingpong in more cases than yours, but I'm quite sure that
it will be a win as far as the total amount of cachelines accessed.

> The trick is to not start bitmap search at files->next_fd, but a random
> point. This is a win if we assume there are enough holes.
> 
> low = start;
> if (low < files->next_fd)
>     low = files->next_fd;
> 
> res = -1;
> if (flags & O_FD_FASTALLOC) {
> 	random_point = pick_random_between(low, fdt->max_fds);
> 
> 	res = find_next_zero_bit(fdt->open_fds, fdt->max_fds,
> 				random_point);
> 	/* No empty slot found, try the other range */
> 	if (res >= fdt->max_fds) {
> 		res = find_next_zero_bit(fdt->open_fds,
> 				low, random_point);
> 		if (res >= random_point)
> 			res = -1;
> 	}
> }

Have you tried to experiment with that in userland?  I mean, emulate that
thing in normal userland code, count the cacheline accesses and drive it
with the use patterns collected from actual applications.

I can sit down and play with math expectations, but I suspect that it's
easier to experiment.  It's nothing but an intuition (I hadn't seriously
done probability theory in quite a while, and my mathematical tastes run
more to geometry and topology anyway), but... I would expect it to degrade
badly when the bitmap is reasonably dense.

Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's
not cheap - it's done via #PF triggered in kernel mode, with handler
noticing that the faulting address is in vmalloc range and doing the
right thing.  IOW, if your bitmap is very sparse, the price of page faults
needs to be taken into account.

AFAICS, the only benefit of that thing is keeping dirtied cachelines far
from each other.  Which might be a win overall, but I'm not convinced that
the rest won't offset the effect of that...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29  4:16                                                               ` Al Viro
@ 2015-10-29 12:35                                                                 ` Eric Dumazet
  2015-10-29 13:48                                                                   ` Eric Dumazet
  2015-10-30 17:18                                                                   ` Linus Torvalds
  0 siblings, 2 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-29 12:35 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

On Thu, 2015-10-29 at 04:16 +0000, Al Viro wrote:

> Have you tried to experiment with that in userland?  I mean, emulate that
> thing in normal userland code, count the cacheline accesses and drive it
> with the use patterns collected from actual applications.

Sure.

> 
> I can sit down and play with math expectations, but I suspect that it's
> easier to experiment.  It's nothing but an intuition (I hadn't seriously
> done probability theory in quite a while, and my mathematical tastes run
> more to geometry and topology anyway), but... I would expect it to degrade
> badly when the bitmap is reasonably dense.
> 
> Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's
> not cheap - it's done via #PF triggered in kernel mode, with handler
> noticing that the faulting address is in vmalloc range and doing the
> right thing.  IOW, if your bitmap is very sparse, the price of page faults
> needs to be taken into account.

This vmalloc PF is pure noise.
This only matters for the very first allocations.

We target programs opening zillions of fd in their lifetime ;)

Not having to expand a 4,000,000 slots fd array while fully loaded also
removes a latency spike that is very often not desirable.


> 
> AFAICS, the only benefit of that thing is keeping dirtied cachelines far
> from each other.  Which might be a win overall, but I'm not convinced that
> the rest won't offset the effect of that...

Well, I already tested the O_FD_FASTALLOC thing, and I can tell you
find_next_zero_bit() is nowhere to be found in kernel profiles anymore.
It also lowers time we hold the fd array spinlock while doing fd alloc.

User land test program I wrote few months back

Current kernel :

    64.98%  [kernel]          [k] queued_spin_lock_slowpath    
    14.88%  opensock          [.] memset    // this part simulates user land actual work ;)                   
    11.15%  [kernel]          [k] _find_next_bit.part.0        
     0.69%  [kernel]          [k] _raw_spin_lock               
     0.46%  [kernel]          [k] memset_erms                  
     0.38%  [kernel]          [k] sk_alloc                     
     0.37%  [kernel]          [k] kmem_cache_alloc             
     0.33%  [kernel]          [k] get_empty_filp               
     0.31%  [kernel]          [k] kmem_cache_free              
     0.26%  [kernel]          [k] __alloc_fd                   
     0.26%  opensock          [.] child_function               
     0.18%  [kernel]          [k] inode_init_always            
     0.17%  opensock          [.] __random_r                   


/*
 * test for b/9072743 : fd scaling on gigantic process (with ~ 10,000,000 TCP sockets)
 * - Size fd arrays in kernel to avoid resizings that kill latencies.
 * - Then launch xx threads doing
 *    populate the fd array of the process, opening 'max' files.
 *    
 *    - Loop : close(randomfd()), socket(AF_INET, SOCK_STREAM, 0);
 *
 * Usage : opensock [ -n fds_count ] [ -t threads_count] [-f]
 */

#include <pthread.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

unsigned int count;
int skflags;

#define NBTHREADS_MAX 4096
pthread_t tid[NBTHREADS_MAX];
int nbthreads;
int nbthreads_req = 24;
int stop_all;

#ifndef O_FD_FASTALLOC
#define O_FD_FASTALLOC 0x40000000
#endif

#ifndef SOCK_FD_FASTALLOC
#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
#endif

/* expand kernel fd array for optimal perf.
 * This could be done by doing a loop on dup(),
 * or can be done using dup2()
 */
int expand_fd_array(int max)
{
	int target, res;
	int fd = socket(AF_INET, SOCK_STREAM, 0);

	if (fd == -1) {
		perror("socket()");
		return -1;
	}
	for (;;) {
		count = max;
		target = count;
		if (skflags & SOCK_FD_FASTALLOC)
			target += count/10;
		res = dup2(fd, target);
		if (res != -1) {
			close(res);
			break;
		}
		max -= max/10;
	}
	printf("count=%u (check/increase ulimit -n)\n", count);
	return 0;
}

static char state[32] = {
	0, 1, 2, 3, 4, 5, 6, 7,
	8, 9, 10, 11, 12, 13, 14, 15,
	16, 17, 18, 19, 20, 21, 22, 23,
	24, 25, 26, 27, 28, 29, 30, 31
};

/* each thread is using ~400 KB of data per unit of work */
#define WORKING_SET_SIZE 400000

static void *child_function(void *arg)
{
	unsigned int max = count / nbthreads_req;
	struct random_data buf;
	unsigned int idx;
	int *tab;
	unsigned long iter = 0;
	unsigned long *work_set = malloc(WORKING_SET_SIZE);
	int i;

	if (!work_set)
		return NULL;
	tab = malloc(max * sizeof(int));
	if (!tab) {
		free(work_set);
		return NULL;
	}
	memset(tab, 255, max * sizeof(int));
	
	initstate_r(getpid(), state, sizeof(state), &buf);

	tab[0] = socket(AF_INET, SOCK_STREAM | skflags, 0);
	for (i = 1; i < max; i++)
		tab[i] = dup(tab[0]);

	while (!stop_all) {
		random_r(&buf, &idx);
		idx = idx % max;
		close(tab[idx]);

		/* user space needs typically to use a bit of the memory. */
		memset(work_set, idx, WORKING_SET_SIZE);

		tab[idx] = socket(AF_INET, SOCK_STREAM | skflags, 0);
		if (tab[idx] == -1) {
			perror("socket");
			break;
		}
		iter++;
	}
	for (i = 0; i < max; i++)
		close(tab[idx]);
	free(tab);
	free(work_set);
	printf("%lu\n", iter);
	return NULL;
}

static int launch_threads(void)
{
	int i, err;

	for (i = 0; i < nbthreads_req; i++) {
		err = pthread_create(&tid[i], NULL, child_function, NULL);
		if (err)
			return err;
		nbthreads++;
	}
	return 0;
}

static void wait_end(void)
{
	int i;
	for (i = 0; i < nbthreads; i++)
		pthread_join(tid[i], NULL);
}

static void usage(int code)
{
	fprintf(stderr, "Usage : opensock [ -n fds_count ] [ -t threads_count] [-f]\n");
	exit(code);
}

int main(int argc, char *argv[])
{
	int c;
	int max = 1000000;
	int duration = 10;

	while ((c = getopt(argc, argv, "fn:t:l:")) != -1) {
		switch (c) {
		case 'f':
			skflags = SOCK_FD_FASTALLOC;
			break;
		case 'n':
			max = atoi(optarg);
			break;
		case 't':
			nbthreads_req = atoi(optarg);
			if (nbthreads_req > NBTHREADS_MAX)
				usage(1);
			break;
		case 'l':
			duration = atoi(optarg);
			break;
		default:
			usage(1);
		}
	}
	system("sysctl -w fs.file-max=8000000");
	expand_fd_array(max);
	launch_threads();
	sleep(duration);
	stop_all = 1;
	wait_end();
}



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 12:35                                                                 ` Eric Dumazet
@ 2015-10-29 13:48                                                                   ` Eric Dumazet
  2015-10-30 17:18                                                                   ` Linus Torvalds
  1 sibling, 0 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-29 13:48 UTC (permalink / raw)
  To: Al Viro
  Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel

On Thu, 2015-10-29 at 05:35 -0700, Eric Dumazet wrote:

> Current kernel :
> 
>     64.98%  [kernel]          [k] queued_spin_lock_slowpath    
>     14.88%  opensock          [.] memset    // this part simulates user land actual work ;)                   
>     11.15%  [kernel]          [k] _find_next_bit.part.0        
>      0.69%  [kernel]          [k] _raw_spin_lock               
>      0.46%  [kernel]          [k] memset_erms                  
>      0.38%  [kernel]          [k] sk_alloc                     
>      0.37%  [kernel]          [k] kmem_cache_alloc             
>      0.33%  [kernel]          [k] get_empty_filp               
>      0.31%  [kernel]          [k] kmem_cache_free              
>      0.26%  [kernel]          [k] __alloc_fd                   
>      0.26%  opensock          [.] child_function               
>      0.18%  [kernel]          [k] inode_init_always            
>      0.17%  opensock          [.] __random_r                   

With attached prototype patch we get this profile instead :

You can see we no longer hit the spinlock issue and cache waste
in find_next_bit.

Userland can really progress _much_ faster.

    76.86%  opensock          [.] memset                        
     1.31%  [kernel]          [k] _raw_spin_lock                
     1.15%  assd              [.] 0x000000000056f32c            
     1.08%  [kernel]          [k] kmem_cache_free               
     0.97%  [kernel]          [k] kmem_cache_alloc              
     0.83%  [kernel]          [k] sk_alloc                      
     0.72%  [kernel]          [k] memset_erms                   
     0.70%  opensock          [.] child_function                
     0.67%  [kernel]          [k] get_empty_filp                
     0.65%  [kernel]          [k] __alloc_fd                    
     0.58%  [kernel]          [k] __close_fd                    
     0.49%  [kernel]          [k] queued_spin_lock_slowpath     

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..eabb9a626259 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -22,6 +22,7 @@
 #include <linux/spinlock.h>
 #include <linux/rcupdate.h>
 #include <linux/workqueue.h>
+#include <linux/random.h>
 
 int sysctl_nr_open __read_mostly = 1024*1024;
 int sysctl_nr_open_min = BITS_PER_LONG;
@@ -471,6 +472,19 @@ int __alloc_fd(struct files_struct *files,
 	spin_lock(&files->file_lock);
 repeat:
 	fdt = files_fdtable(files);
+
+	if (unlikely(flags & O_FD_FASTALLOC)) {
+		u32 rnd, limit = min(end, fdt->max_fds);
+
+		/*
+		 * Note: do not bother with files->next_fd,
+		 * this is for POSIX lovers...
+		 */
+		rnd = ((u64)prandom_u32() * limit) >> 32;
+		fd = find_next_zero_bit(fdt->open_fds, limit, rnd);
+		if (fd < limit)
+			goto ok;
+	}
 	fd = start;
 	if (fd < files->next_fd)
 		fd = files->next_fd;
@@ -499,7 +513,7 @@ repeat:
 
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
-
+ok:
 	__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
 		__set_close_on_exec(fd, fdt);
diff --git a/include/linux/net.h b/include/linux/net.h
index 70ac5e28e6b7..3823d082af4c 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -76,6 +76,7 @@ enum sock_type {
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK	O_NONBLOCK
 #endif
+#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
 
 #endif /* ARCH_HAS_SOCKET_TYPES */
 
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index e063effe0cc1..badd421dd9f4 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_FD_FASTALLOC
+#define O_FD_FASTALLOC 0x40000000
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
diff --git a/net/socket.c b/net/socket.c
index 9963a0b53a64..6dde02b2eaf9 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1227,9 +1227,10 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
 	BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
+	BUILD_BUG_ON(SOCK_FD_FASTALLOC & SOCK_TYPE_MASK);
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_FD_FASTALLOC))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1240,7 +1241,7 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
 	if (retval < 0)
 		goto out;
 
-	retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
+	retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK | O_FD_FASTALLOC));
 	if (retval < 0)
 		goto out_release;
 
@@ -1266,7 +1267,7 @@ SYSCALL_DEFINE4(socketpair, int, family, int, type, int, protocol,
 	int flags;
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_FD_FASTALLOC))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1436,7 +1437,7 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
 	int err, len, newfd, fput_needed;
 	struct sockaddr_storage address;
 
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_FD_FASTALLOC))
 		return -EINVAL;
 
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))

^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-27 10:52                                       ` Alan Burlison
                                                           ` (2 preceding siblings ...)
  2015-10-27 23:17                                         ` Al Viro
@ 2015-10-29 14:58                                         ` David Holland
  2015-10-29 15:18                                           ` Alan Burlison
  2015-10-30 17:43                                           ` David Laight
  3 siblings, 2 replies; 138+ messages in thread
From: David Holland @ 2015-10-29 14:58 UTC (permalink / raw)
  To: Alan Burlison
  Cc: Casper.Dik, Al Viro, David Miller, eric.dumazet, stephen, netdev,
	dholland-tech

On Tue, Oct 27, 2015 at 10:52:46AM +0000, Alan Burlison wrote:
 > >But in general, this is basically a problem with the application: the file
 > >descriptor space is shared between threads and having one thread sniping
 > >at open files, you do have a problem and whatever the kernel does in that
 > >case perhaps doesn't matter all that much: the application needs to be
 > >fixed anyway.
 > 
 > The scenario in Hadoop is that the FD is being used by a thread that's
 > waiting in accept and another thread wants to shut it down, e.g. because
 > the application is terminating and needs to stop all threads cleanly.

ISTM that the best way to do this is to post a signal to the thread so
accept bails with EINTR, at which point it can check to see if it's
supposed to be exiting.

Otherwise it sounds like the call you're looking for is not close(2)
but revoke(2). Last I remember Linux doesn't have revoke because
there's no way to implement it that isn't a trainwreck.

-- 
David A. Holland
dholland@netbsd.org

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 14:58                                         ` David Holland
@ 2015-10-29 15:18                                           ` Alan Burlison
  2015-10-29 16:01                                             ` David Holland
  2015-10-30 17:43                                           ` David Laight
  1 sibling, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-29 15:18 UTC (permalink / raw)
  To: David Holland
  Cc: Casper.Dik, Al Viro, David Miller, eric.dumazet, stephen, netdev

On 29/10/2015 14:58, David Holland wrote:

> ISTM that the best way to do this is to post a signal to the thread so
> accept bails with EINTR, at which point it can check to see if it's
> supposed to be exiting.

Yes, you could use pthread_kill, but that would require keeping a list 
of the tids of all the threads that were using the FD, and that really 
just moves the problem elsewhere rather than fixing it.

> Otherwise it sounds like the call you're looking for is not close(2)
> but revoke(2). Last I remember Linux doesn't have revoke because
> there's no way to implement it that isn't a trainwreck.

close(2) as per specified by POSIX works just fine on Solaris, if that 
was the case everywhere then it wouldn't be an issue. And for cases 
where it is necessary to keep the FD assigned because of races, the 
dup2(2) trick works fine as well.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 15:18                                           ` Alan Burlison
@ 2015-10-29 16:01                                             ` David Holland
  2015-10-29 16:15                                               ` Alan Burlison
  0 siblings, 1 reply; 138+ messages in thread
From: David Holland @ 2015-10-29 16:01 UTC (permalink / raw)
  To: Alan Burlison
  Cc: David Holland, Casper.Dik, Al Viro, David Miller, eric.dumazet,
	stephen, netdev

On Thu, Oct 29, 2015 at 03:18:40PM +0000, Alan Burlison wrote:
 > On 29/10/2015 14:58, David Holland wrote:
 > >ISTM that the best way to do this is to post a signal to the thread so
 > >accept bails with EINTR, at which point it can check to see if it's
 > >supposed to be exiting.
 > 
 > Yes, you could use pthread_kill, but that would require keeping a list of
 > the tids of all the threads that were using the FD, and that really just
 > moves the problem elsewhere rather than fixing it.

Hardly; it moves the burden of doing stupid things to the
application. If as you said the goal is to shut down all threads
cleanly, then it doesn't need to keep track in detail anyway; it can
just post SIGTERM to every thread, or SIGUSR1 if SIGTERM is bad for
some reason, or whatever.

 > >Otherwise it sounds like the call you're looking for is not close(2)
 > >but revoke(2). Last I remember Linux doesn't have revoke because
 > >there's no way to implement it that isn't a trainwreck.
 > 
 > close(2) as per specified by POSIX works just fine on Solaris, if that was
 > the case everywhere then it wouldn't be an issue. And for cases where it is
 > necessary to keep the FD assigned because of races, the dup2(2) trick works
 > fine as well.

close(2) as specified by POSIX doesn't prohibit this weird revoke-like
behavior, but there's nothing in there that mandates it either. (I
thought this discussion had already clarified that.)

Note that while NetBSD apparently supports this behavior because
someone copied it from Solaris, I'm about to go recommend it be
removed.

-- 
David A. Holland
dholland@netbsd.org

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 16:01                                             ` David Holland
@ 2015-10-29 16:15                                               ` Alan Burlison
  2015-10-29 17:07                                                 ` Al Viro
  2015-10-30  5:44                                                 ` David Holland
  0 siblings, 2 replies; 138+ messages in thread
From: Alan Burlison @ 2015-10-29 16:15 UTC (permalink / raw)
  To: David Holland
  Cc: Casper.Dik, Al Viro, David Miller, eric.dumazet, stephen, netdev

On 29/10/2015 16:01, David Holland wrote:

> Hardly; it moves the burden of doing stupid things to the
> application. If as you said the goal is to shut down all threads
> cleanly, then it doesn't need to keep track in detail anyway; it can
> just post SIGTERM to every thread, or SIGUSR1 if SIGTERM is bad for
> some reason, or whatever.

I agree that the root issue is poor application design, but posting a 
signal to every thread is not a solution if you only want to shut down a 
subset of threads.

> close(2) as specified by POSIX doesn't prohibit this weird revoke-like
> behavior, but there's nothing in there that mandates it either. (I
> thought this discussion had already clarified that.)

There was an attempt to interpret POSIX that way, with which I still 
disagree. If a FD is closed or reassigned then any current pending 
operations on it should be terminated.

> Note that while NetBSD apparently supports this behavior because
> someone copied it from Solaris, I'm about to go recommend it be
> removed.

Which behaviour? The abort accept() on close() behaviour?

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 16:15                                               ` Alan Burlison
@ 2015-10-29 17:07                                                 ` Al Viro
  2015-10-29 17:12                                                   ` Alan Burlison
  2015-10-30  1:55                                                   ` David Miller
  2015-10-30  5:44                                                 ` David Holland
  1 sibling, 2 replies; 138+ messages in thread
From: Al Viro @ 2015-10-29 17:07 UTC (permalink / raw)
  To: Alan Burlison
  Cc: David Holland, Casper.Dik, David Miller, eric.dumazet, stephen, netdev

On Thu, Oct 29, 2015 at 04:15:33PM +0000, Alan Burlison wrote:

> There was an attempt to interpret POSIX that way, with which I still
> disagree. If a FD is closed or reassigned then any current pending
> operations on it should be terminated.

Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote the exact
place in POSIX that requires such behaviour?

This is getting ridiculous - if we are talking about POSIX-mandated behaviour
of close(), please show where is it mandated.  Using close(2) on a descriptor
that might be used by other threads is a bloody bad design in userland code -
I think everyone in this thread agrees on that.  Making that a recommended way
to do _anything_ is nuts.

Now, no userland code, however lousy it might be, should be able to screw
the system.  But that isn't the issue - our variant is providing that just
fine.

BTW, "cancel accept(2) because sending a signal is hard" is bogus anyway -
a thread in accept(3) just about to enter the kernel would get buggered
if another thread closes that descriptor and the third one does socket(2).
_IF_ you are doing that kind of "close a descriptor under other threads"
thing, you need to inform the potentially affected threads anyway, and
you'd better not rely on them being currently in kernel mode.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 17:07                                                 ` Al Viro
@ 2015-10-29 17:12                                                   ` Alan Burlison
  2015-10-30  1:54                                                     ` David Miller
  2015-10-30  1:55                                                   ` David Miller
  1 sibling, 1 reply; 138+ messages in thread
From: Alan Burlison @ 2015-10-29 17:12 UTC (permalink / raw)
  To: Al Viro; +Cc: David Miller, eric.dumazet, stephen, netdev

[off list]

On 29/10/2015 17:07, Al Viro wrote:

> Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote the exact
> place in POSIX that requires such behaviour?

If that's the way the conversation is going to go, sorry, no.

-- 
Alan Burlison
--

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 17:12                                                   ` Alan Burlison
@ 2015-10-30  1:54                                                     ` David Miller
  0 siblings, 0 replies; 138+ messages in thread
From: David Miller @ 2015-10-30  1:54 UTC (permalink / raw)
  To: Alan.Burlison; +Cc: viro, eric.dumazet, stephen, netdev

From: Alan Burlison <Alan.Burlison@oracle.com>
Date: Thu, 29 Oct 2015 17:12:44 +0000

> On 29/10/2015 17:07, Al Viro wrote:
> 
>> Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote
>> the exact
>> place in POSIX that requires such behaviour?
> 
> If that's the way the conversation is going to go, sorry, no.

I find Al's request to be frankly quite reasonable, as is his
frustration expressed in his tone as well.

Furthermore, NetBSD's intention to try and get rid of the
close() on accept() behavior shows that the Linux developers
are not in the minority of being against requiring this
behavior or even finding it desirable in any way.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 17:07                                                 ` Al Viro
  2015-10-29 17:12                                                   ` Alan Burlison
@ 2015-10-30  1:55                                                   ` David Miller
  1 sibling, 0 replies; 138+ messages in thread
From: David Miller @ 2015-10-30  1:55 UTC (permalink / raw)
  To: viro
  Cc: Alan.Burlison, dholland-tech, Casper.Dik, eric.dumazet, stephen, netdev

From: Al Viro <viro@ZenIV.linux.org.uk>
Date: Thu, 29 Oct 2015 17:07:48 +0000

> _IF_ you are doing that kind of "close a descriptor under other threads"
> thing, you need to inform the potentially affected threads anyway, and
> you'd better not rely on them being currently in kernel mode.

+1

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 16:15                                               ` Alan Burlison
  2015-10-29 17:07                                                 ` Al Viro
@ 2015-10-30  5:44                                                 ` David Holland
  1 sibling, 0 replies; 138+ messages in thread
From: David Holland @ 2015-10-30  5:44 UTC (permalink / raw)
  To: Alan Burlison
  Cc: David Holland, Casper.Dik, Al Viro, David Miller, eric.dumazet,
	stephen, netdev

On Thu, Oct 29, 2015 at 04:15:33PM +0000, Alan Burlison wrote:
 > >close(2) as specified by POSIX doesn't prohibit this weird revoke-like
 > >behavior, but there's nothing in there that mandates it either. (I
 > >thought this discussion had already clarified that.)
 > 
 > There was an attempt to interpret POSIX that way, with which I still
 > disagree. If a FD is closed or reassigned then any current pending
 > operations on it should be terminated.

C&V, please.

 > >Note that while NetBSD apparently supports this behavior because
 > >someone copied it from Solaris, I'm about to go recommend it be
 > >removed.
 > 
 > Which behaviour? The abort accept() on close() behaviour?

That, and aborting anything else too. Close isn't revoke.

-- 
David A. Holland
dholland@netbsd.org

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 12:35                                                                 ` Eric Dumazet
  2015-10-29 13:48                                                                   ` Eric Dumazet
@ 2015-10-30 17:18                                                                   ` Linus Torvalds
  2015-10-30 21:02                                                                     ` Al Viro
  1 sibling, 1 reply; 138+ messages in thread
From: Linus Torvalds @ 2015-10-30 17:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Thu, Oct 29, 2015 at 5:35 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Well, I already tested the O_FD_FASTALLOC thing, and I can tell you
> find_next_zero_bit() is nowhere to be found in kernel profiles anymore.
> It also lowers time we hold the fd array spinlock while doing fd alloc.

I do wonder if we couldn't just speed up the bitmap allocator by an
order of magnitude. It would be nicer to be able to make existing
loads faster without any new "don't bother with POSIX semantics" flag.

We could, for example, extend the "open_fds" bitmap with a
"second-level" bitmap that has information on when the first level is
full. We traverse the open_fd's bitmap one word at a time anyway, we
could have a second-level bitmap that has one bit per word to say
whether that word is already full.

So you'd basically have:

 - open_fds: array with the real bitmap (exactly like we do now)
 - full_fds_bits: array of bits that shows which of the "unsigned
longs" in 'open_fs' are already full.

and then you can basically walk 4096 fd's at a time by just checking
one single word in the full_fds_bits[] array.

So in __alloc_fd(), instead of using find_next_zero_bit(), you'd use
"find_next_fd()", which woulf do something like

    #define NR_BITS_PER_BIT (64*sizeof(long)*sizeof(long))

    unsigned long find_next_fd(struct fdtable *fdt, unsigned long start)
    {
        while (start < fdt->max_fds) {
            unsigned long startbit = start / BITS_PER_LONG;
            unsigned long bitbit = startbit / BITS_PER_LONG;
            unsigned long bitmax = (bitbit+1) * BITS_PER_LONG * BITS_PER_LONG;

            if (bitmax > max_fds)
                bitmax = fdt->max_fds;

            // Are all the bits in all the bitbit arrays already set?
            if (!~fds->full_fds_bits[bitbit]) {
                start = bitmax;
                continue;
            }

            fd = find_next_zero_bit(fdt->open_fds, bitmax, start);
            if (fd < bitmax)
                return fd;
        }
        return fdt->max_fds;
    }

which really should cut down on the bit traversal by a factor of 64.

Of course, then you do have to maintain that bitmap-of-bits in
__set_open_fd() and __clear_open_fd(), but that should be pretty
trivial too:

 - __clear_open_fd() would become just

        __clear_bit(fd, fdt->open_fds);
        __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);

 - and __set_open_fd() would become

        __set_bit(fd, fdt->open_fds);
        fd /= BITS_PER_LONG;
        if (!~fdt->open_fds[fd])
            __set_bit(fd, fdt->full_fds_bits);

or something.

NOTE NOTE NOTE! The above is all very much written without any testing
in the email client, so while I tried to make it look like "real
code", consider the above just pseudo-code.

The advantage of the above is that it should just work for existing
binaries. It may not be quite as optimal as just introducing a new
"don't care about POSIX" feature, but quite frankly, if it cuts down
the bad case of "find_next_zero_bit()" by a factror of 64 (and then
adds a *small* expense factor on top of that), I suspect it should be
"good enough" even for your nasty case.

What do you think? Willing to try the above approach (with any
inevitable bug-fixes) and see how it compares?

Obviously in addition to any fixes to my pseudo-code above you'd need
to add the allocations for the new "full_fds_bits" etc, but I think it
should be easy to make the full_fds_bit allocation be *part* of the
"open_fds" allocation, so you wouldn't need a new allocation in
alloc_fdtable(). We already do that whole "use a single allocation" to
combine open_fds with close_on_exec into one single allocation.

                    Linus

^ permalink raw reply	[flat|nested] 138+ messages in thread

* RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-29 14:58                                         ` David Holland
  2015-10-29 15:18                                           ` Alan Burlison
@ 2015-10-30 17:43                                           ` David Laight
  2015-10-30 21:09                                             ` Al Viro
  1 sibling, 1 reply; 138+ messages in thread
From: David Laight @ 2015-10-30 17:43 UTC (permalink / raw)
  To: 'David Holland', Alan Burlison
  Cc: Casper.Dik, Al Viro, David Miller, eric.dumazet, stephen, netdev

From: David Holland
> Sent: 29 October 2015 14:59
> On Tue, Oct 27, 2015 at 10:52:46AM +0000, Alan Burlison wrote:
>  > >But in general, this is basically a problem with the application: the file
>  > >descriptor space is shared between threads and having one thread sniping
>  > >at open files, you do have a problem and whatever the kernel does in that
>  > >case perhaps doesn't matter all that much: the application needs to be
>  > >fixed anyway.
>  >
>  > The scenario in Hadoop is that the FD is being used by a thread that's
>  > waiting in accept and another thread wants to shut it down, e.g. because
>  > the application is terminating and needs to stop all threads cleanly.
> 
> ISTM that the best way to do this is to post a signal to the thread so
> accept bails with EINTR, at which point it can check to see if it's
> supposed to be exiting.

Actually, just send it a connect indication.

ISTM that the correct call should be listen(fd, 0);
Although that doesn't help a thread stuck in recvmsg() for a datagram.

It is also tempting to think that close(fd) should sleep until all
io activities using that fd have completed - whether or not blocking
calls are woken.

	David

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 17:18                                                                   ` Linus Torvalds
@ 2015-10-30 21:02                                                                     ` Al Viro
  2015-10-30 21:23                                                                       ` Linus Torvalds
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-30 21:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Fri, Oct 30, 2015 at 10:18:12AM -0700, Linus Torvalds wrote:

> I do wonder if we couldn't just speed up the bitmap allocator by an
> order of magnitude. It would be nicer to be able to make existing
> loads faster without any new "don't bother with POSIX semantics" flag.
> 
> We could, for example, extend the "open_fds" bitmap with a
> "second-level" bitmap that has information on when the first level is
> full. We traverse the open_fd's bitmap one word at a time anyway, we
> could have a second-level bitmap that has one bit per word to say
> whether that word is already full.

Your variant has 1:64 ratio; obviously better than now, but we can actually
do 1:bits-per-cacheline quite easily.

I've been playing with a variant that has more than two bitmaps, and
AFAICS it
	a) does not increase the amount of cacheline pulled and
	b) keeps it well-bound even in the absolutely worst case
(128M-odd descriptors filled, followed by close(0);dup2(1,0); -
in that case it ends up accessing the 7 cachelines worth of
bitmaps; your variant will read through 4000-odd cachelines of
the summary bitmap alone; the mainline is even worse).

> The advantage of the above is that it should just work for existing
> binaries. It may not be quite as optimal as just introducing a new
> "don't care about POSIX" feature, but quite frankly, if it cuts down
> the bad case of "find_next_zero_bit()" by a factror of 64 (and then
> adds a *small* expense factor on top of that), I suspect it should be
> "good enough" even for your nasty case.
> 
> What do you think? Willing to try the above approach (with any
> inevitable bug-fixes) and see how it compares?
> 
> Obviously in addition to any fixes to my pseudo-code above you'd need
> to add the allocations for the new "full_fds_bits" etc, but I think it
> should be easy to make the full_fds_bit allocation be *part* of the
> "open_fds" allocation, so you wouldn't need a new allocation in
> alloc_fdtable(). We already do that whole "use a single allocation" to
> combine open_fds with close_on_exec into one single allocation.

I'll finish testing what I've got and post it; it costs 3 extra pointers
in the files_struct and a bit fatter bitmap allocation (less than 0.2%
extra).  All the arguments regarding the unmodified binaries apply, of
course, and so far it looks fairly compact...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 17:43                                           ` David Laight
@ 2015-10-30 21:09                                             ` Al Viro
  2015-11-04 15:54                                               ` David Laight
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-30 21:09 UTC (permalink / raw)
  To: David Laight
  Cc: 'David Holland',
	Alan Burlison, Casper.Dik, David Miller, eric.dumazet, stephen,
	netdev

On Fri, Oct 30, 2015 at 05:43:21PM +0000, David Laight wrote:

> ISTM that the correct call should be listen(fd, 0);
> Although that doesn't help a thread stuck in recvmsg() for a datagram.
> 
> It is also tempting to think that close(fd) should sleep until all
> io activities using that fd have completed - whether or not blocking
> calls are woken.

Sigh...  The kernel has no idea when other threads are done with "all
io activities using that fd" - it can wait for them to leave the
kernel mode, but there's fuck-all it can do about e.g. a userland
loop doing write() until there's more data to send.  And no, you can't
rely upon them catching EBADF on the next iteration - by the time they
get there, close() might very well have returned and open() from yet
another thread might've grabbed the same descriptor.  Welcome to your
data being written to hell knows what...

That's precisely the reason why "wait in close()" kind of semantics
is worthless - the races are still there, and having them a bit
harder to hit just makes them harder to debug.  Worse, it might
create an impression of safety where there's none to be had.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 21:02                                                                     ` Al Viro
@ 2015-10-30 21:23                                                                       ` Linus Torvalds
  2015-10-30 21:50                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 138+ messages in thread
From: Linus Torvalds @ 2015-10-30 21:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Fri, Oct 30, 2015 at 2:02 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Your variant has 1:64 ratio; obviously better than now, but we can actually
> do 1:bits-per-cacheline quite easily.

Ok, but in that case you end up needing a counter for each cacheline
too (to count how many bits, in order to know when to say "cacheline
is entirely full").

Because otherwise you'll end up having to check the whole cacheline
(rather than check just one word, or check the "bit counter") when you
set a bit.

So I suspect a "bitmap of full words" ends up being much simpler.  But
hey, if you can make your more complex thing work, go right ahead.

               Linus

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 21:23                                                                       ` Linus Torvalds
@ 2015-10-30 21:50                                                                         ` Linus Torvalds
  2015-10-30 22:33                                                                           ` Al Viro
  2015-10-31  1:07                                                                           ` Eric Dumazet
  0 siblings, 2 replies; 138+ messages in thread
From: Linus Torvalds @ 2015-10-30 21:50 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1508 bytes --]

On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Oct 30, 2015 at 2:02 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> Your variant has 1:64 ratio; obviously better than now, but we can actually
>> do 1:bits-per-cacheline quite easily.
>
> Ok, but in that case you end up needing a counter for each cacheline
> too (to count how many bits, in order to know when to say "cacheline
> is entirely full").

So here's a largely untested version of my "one bit per word"
approach. It seems to work, but looking at it, I'm unhappy with a few
things:

 - using kmalloc() for the .full_fds_bits[] array is simple, but
disgusting, since 99% of all programs just have a single word.

   I know I talked about just adding the allocation to the same one
that allocates the bitmaps themselves, but I got lazy and didn't do
it. Especially since that code seems to try fairly hard to make the
allocations nice powers of two, according to the comments. That may
actually matter from an allocation standpoint.

 - Maybe we could just use that "full_fds_bits_init" field for when a
single word is sufficient, and avoid the kmalloc that way?

Anyway. This is a pretty simple patch, and I actually think that we
could just get rid of the "next_fd" logic entirely with this. That
would make this *patch* be more complicated, but it would make the
resulting *code* be simpler.

Hmm? Want to play with this? Eric, what does this do to your test-case?

                        Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 4966 bytes --]

 fs/file.c               | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/fdtable.h |  2 ++
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..0b89a97cabb5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -48,6 +48,7 @@ static void __free_fdtable(struct fdtable *fdt)
 {
 	kvfree(fdt->fd);
 	kvfree(fdt->open_fds);
+	kfree(fdt->full_fds_bits);
 	kfree(fdt);
 }
 
@@ -56,6 +57,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
 	__free_fdtable(container_of(rcu, struct fdtable, rcu));
 }
 
+#define BITBIT_NR(nr)	BITS_TO_LONGS(BITS_TO_LONGS(nr))
+#define BITBIT_SIZE(nr)	(BITBIT_NR(nr) * sizeof(long))
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
@@ -77,6 +81,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memset((char *)(nfdt->open_fds) + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+
+	cpy = BITBIT_SIZE(ofdt->max_fds);
+	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
+	memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
+	memset(cpy+(char *)nfdt->full_fds_bits, 0, set);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
@@ -122,8 +131,21 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 
+	/*
+	 * The "bitmap of bitmaps" has a bit set for each word in
+	 * the open_fds array that is full. We initialize it to
+	 * zero both at allocation and at copying, because it is
+	 * important that it never have a bit set for a non-full
+	 * word, but it doesn't have to be exact otherwise.
+	 */
+	fdt->full_fds_bits = kzalloc(BITBIT_SIZE(nr), GFP_KERNEL);
+	if (!fdt->full_fds_bits)
+		goto out_nofull;
+
 	return fdt;
 
+out_nofull:
+	kvfree(fdt->open_fds);
 out_arr:
 	kvfree(fdt->fd);
 out_fdt:
@@ -229,14 +251,18 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 	__clear_bit(fd, fdt->close_on_exec);
 }
 
-static inline void __set_open_fd(int fd, struct fdtable *fdt)
+static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__set_bit(fd, fdt->open_fds);
+	fd /= BITS_PER_LONG;
+	if (!~fdt->open_fds[fd])
+		__set_bit(fd, fdt->full_fds_bits);
 }
 
-static inline void __clear_open_fd(int fd, struct fdtable *fdt)
+static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__clear_bit(fd, fdt->open_fds);
+	__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
 }
 
 static int count_open_files(struct fdtable *fdt)
@@ -280,6 +306,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
 	new_fdt->open_fds = newf->open_fds_init;
+	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
 
 	spin_lock(&oldf->file_lock);
@@ -323,6 +350,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8);
 	memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8);
+	memcpy(new_fdt->full_fds_bits, old_fdt->full_fds_bits, BITBIT_SIZE(open_files));
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
@@ -454,10 +482,25 @@ struct files_struct init_files = {
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
 		.open_fds	= init_files.open_fds_init,
+		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
 	.file_lock	= __SPIN_LOCK_UNLOCKED(init_files.file_lock),
 };
 
+static unsigned long find_next_fd(struct fdtable *fdt, unsigned long start)
+{
+	unsigned long maxfd = fdt->max_fds;
+	unsigned long maxbit = maxfd / BITS_PER_LONG;
+	unsigned long bitbit = start / BITS_PER_LONG;
+
+	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
+	if (bitbit > maxfd)
+		return maxfd;
+	if (bitbit > start)
+		start = bitbit;
+	return find_next_zero_bit(fdt->open_fds, maxfd, start);
+}
+
 /*
  * allocate a file descriptor, mark it busy.
  */
@@ -476,7 +519,7 @@ repeat:
 		fd = files->next_fd;
 
 	if (fd < fdt->max_fds)
-		fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd);
+		fd = find_next_fd(fdt, fd);
 
 	/*
 	 * N.B. For clone tasks sharing a files structure, this test
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 674e3e226465..5295535b60c6 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -26,6 +26,7 @@ struct fdtable {
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
 	unsigned long *open_fds;
+	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
 };
 
@@ -59,6 +60,7 @@ struct files_struct {
 	int next_fd;
 	unsigned long close_on_exec_init[1];
 	unsigned long open_fds_init[1];
+	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
 };
 

^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 21:50                                                                         ` Linus Torvalds
@ 2015-10-30 22:33                                                                           ` Al Viro
  2015-10-30 23:52                                                                             ` Linus Torvalds
  2015-10-31  1:07                                                                           ` Eric Dumazet
  1 sibling, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-30 22:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Fri, Oct 30, 2015 at 02:50:46PM -0700, Linus Torvalds wrote:

> Anyway. This is a pretty simple patch, and I actually think that we
> could just get rid of the "next_fd" logic entirely with this. That
> would make this *patch* be more complicated, but it would make the
> resulting *code* be simpler.

Dropping next_fd would screw you in case of strictly sequential allocations...

Your point re power-of-two allocations is well-taken, but then I'm not
sure that kzalloc() is good enough here.  Look: you have a bit for every
64 descriptors, i.e. byte per 512.  On 10M case Eric had been talking
about that'll yield 32Kb worth of your secondary bitmap.  It's right on
the edge of the range where vmalloc() becomes attractive; for something
bigger it gets even worse...

Currently we go for vmalloc (on bitmaps) once we are past 128K descriptors
(two bitmaps packed together => 256Kbit = 32Kb).  kmalloc() is very
sensitive to size being a power of two, but IIRC vmalloc() isn't...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 22:33                                                                           ` Al Viro
@ 2015-10-30 23:52                                                                             ` Linus Torvalds
  2015-10-31  0:09                                                                               ` Al Viro
                                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 138+ messages in thread
From: Linus Torvalds @ 2015-10-30 23:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2106 bytes --]

On Fri, Oct 30, 2015 at 3:33 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Oct 30, 2015 at 02:50:46PM -0700, Linus Torvalds wrote:
>
>> Anyway. This is a pretty simple patch, and I actually think that we
>> could just get rid of the "next_fd" logic entirely with this. That
>> would make this *patch* be more complicated, but it would make the
>> resulting *code* be simpler.
>
> Dropping next_fd would screw you in case of strictly sequential allocations...

I don't think it would matter in real life, since I don't really think
you have lots of fd's with strictly sequential behavior.

That said, the trivial "open lots of fds" benchmark would show it, so
I guess we can just keep it. The next_fd logic is not expensive or
complex, after all.

> Your point re power-of-two allocations is well-taken, but then I'm not
> sure that kzalloc() is good enough here.

Attached is an updated patch that just uses the regular bitmap
allocator and extends it to also have the bitmap of bitmaps. It
actually simplifies the patch, so I guess it's better this way.

Anyway, I've tested it all a bit more, and for a trivial worst-case
stress program that explicitly kills the next_fd logic by doing

    for (i = 0; i < 1000000; i++) {
        close(3);
        dup2(0,3);
        if (dup(0))
            break;
    }

it takes it down from roughly 10s to 0.2s. So the patch is quite
noticeable on that kind of pattern.

NOTE! You'll obviously need to increase your limits to actually be
able to do the above with lots of file descriptors.

I ran Eric's test-program too, and find_next_zero_bit() dropped to a
fraction of a percent. It's not entirely gone, but it's down in the
noise.

I really suspect this patch is "good enough" in reality, and I would
*much* rather do something like this than add a new non-POSIX flag
that people have to update their binaries for. I agree with Eric that
*some* people will do so, but it's still the wrong thing to do. Let's
just make performance with the normal semantics be good enough that we
don't need to play odd special games.

Eric?

                       Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 4590 bytes --]

 fs/file.c               | 39 +++++++++++++++++++++++++++++++++++----
 include/linux/fdtable.h |  2 ++
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..6f6eb2b03af5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -56,6 +56,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
 	__free_fdtable(container_of(rcu, struct fdtable, rcu));
 }
 
+#define BITBIT_NR(nr)	BITS_TO_LONGS(BITS_TO_LONGS(nr))
+#define BITBIT_SIZE(nr)	(BITBIT_NR(nr) * sizeof(long))
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
@@ -77,6 +80,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memset((char *)(nfdt->open_fds) + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+
+	cpy = BITBIT_SIZE(ofdt->max_fds);
+	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
+	memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
+	memset(cpy+(char *)nfdt->full_fds_bits, 0, set);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
@@ -115,12 +123,14 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = alloc_fdmem(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES));
+				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES));
 	if (!data)
 		goto out_arr;
 	fdt->open_fds = data;
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
+	data += nr / BITS_PER_BYTE;
+	fdt->full_fds_bits = data;
 
 	return fdt;
 
@@ -229,14 +239,18 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 	__clear_bit(fd, fdt->close_on_exec);
 }
 
-static inline void __set_open_fd(int fd, struct fdtable *fdt)
+static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__set_bit(fd, fdt->open_fds);
+	fd /= BITS_PER_LONG;
+	if (!~fdt->open_fds[fd])
+		__set_bit(fd, fdt->full_fds_bits);
 }
 
-static inline void __clear_open_fd(int fd, struct fdtable *fdt)
+static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__clear_bit(fd, fdt->open_fds);
+	__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
 }
 
 static int count_open_files(struct fdtable *fdt)
@@ -280,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
 	new_fdt->open_fds = newf->open_fds_init;
+	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
 
 	spin_lock(&oldf->file_lock);
@@ -323,6 +338,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8);
 	memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8);
+	memcpy(new_fdt->full_fds_bits, old_fdt->full_fds_bits, BITBIT_SIZE(open_files));
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
@@ -454,10 +470,25 @@ struct files_struct init_files = {
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
 		.open_fds	= init_files.open_fds_init,
+		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
 	.file_lock	= __SPIN_LOCK_UNLOCKED(init_files.file_lock),
 };
 
+static unsigned long find_next_fd(struct fdtable *fdt, unsigned long start)
+{
+	unsigned long maxfd = fdt->max_fds;
+	unsigned long maxbit = maxfd / BITS_PER_LONG;
+	unsigned long bitbit = start / BITS_PER_LONG;
+
+	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
+	if (bitbit > maxfd)
+		return maxfd;
+	if (bitbit > start)
+		start = bitbit;
+	return find_next_zero_bit(fdt->open_fds, maxfd, start);
+}
+
 /*
  * allocate a file descriptor, mark it busy.
  */
@@ -476,7 +507,7 @@ repeat:
 		fd = files->next_fd;
 
 	if (fd < fdt->max_fds)
-		fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd);
+		fd = find_next_fd(fdt, fd);
 
 	/*
 	 * N.B. For clone tasks sharing a files structure, this test
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 674e3e226465..5295535b60c6 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -26,6 +26,7 @@ struct fdtable {
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
 	unsigned long *open_fds;
+	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
 };
 
@@ -59,6 +60,7 @@ struct files_struct {
 	int next_fd;
 	unsigned long close_on_exec_init[1];
 	unsigned long open_fds_init[1];
+	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
 };
 

^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 23:52                                                                             ` Linus Torvalds
@ 2015-10-31  0:09                                                                               ` Al Viro
  2015-10-31 15:59                                                                               ` Eric Dumazet
  2015-10-31 19:34                                                                               ` Al Viro
  2 siblings, 0 replies; 138+ messages in thread
From: Al Viro @ 2015-10-31  0:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Fri, Oct 30, 2015 at 04:52:41PM -0700, Linus Torvalds wrote:

> I really suspect this patch is "good enough" in reality, and I would
> *much* rather do something like this than add a new non-POSIX flag
> that people have to update their binaries for. I agree with Eric that
> *some* people will do so, but it's still the wrong thing to do. Let's
> just make performance with the normal semantics be good enough that we
> don't need to play odd special games.
> 
> Eric?

IIRC, at least a part of what Eric used to complain about was that on
seriously multithreaded processes doing a lot of e.g. socket(2) we end up
a lot of bouncing the cacheline containing the first free bits in the bitmap.
But looking at the whole thing, I really wonder if the tons of threads asking
for random bytes won't get at least as bad cacheline bouncing while
getting said bytes, so I'm not sure if that rationale has survived.

PS: this problem obviously exists in Linus' variant as well as in mine;
the question is whether Eric's approach manages to avoid it in the first place.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 21:50                                                                         ` Linus Torvalds
  2015-10-30 22:33                                                                           ` Al Viro
@ 2015-10-31  1:07                                                                           ` Eric Dumazet
  1 sibling, 0 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-31  1:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Fri, 2015-10-30 at 14:50 -0700, Linus Torvalds wrote:
> On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Fri, Oct 30, 2015 at 2:02 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >>
> >> Your variant has 1:64 ratio; obviously better than now, but we can actually
> >> do 1:bits-per-cacheline quite easily.
> >
> > Ok, but in that case you end up needing a counter for each cacheline
> > too (to count how many bits, in order to know when to say "cacheline
> > is entirely full").
> 
> So here's a largely untested version of my "one bit per word"
> approach. It seems to work, but looking at it, I'm unhappy with a few
> things:
> 
>  - using kmalloc() for the .full_fds_bits[] array is simple, but
> disgusting, since 99% of all programs just have a single word.
> 
>    I know I talked about just adding the allocation to the same one
> that allocates the bitmaps themselves, but I got lazy and didn't do
> it. Especially since that code seems to try fairly hard to make the
> allocations nice powers of two, according to the comments. That may
> actually matter from an allocation standpoint.
> 
>  - Maybe we could just use that "full_fds_bits_init" field for when a
> single word is sufficient, and avoid the kmalloc that way?

At least make sure the allocation uses a cache line, so that multiple
processes do not share same cache line for this possibly hot field

fdt->full_fds_bits = kzalloc(max_t(size_t,
				   L1_CACHE_BYTES,
				   BITBIT_SIZE(nr)),
			     GFP_KERNEL);

> 
> Anyway. This is a pretty simple patch, and I actually think that we
> could just get rid of the "next_fd" logic entirely with this. That
> would make this *patch* be more complicated, but it would make the
> resulting *code* be simpler.
> 
> Hmm? Want to play with this? Eric, what does this do to your test-case?

Excellent results so far Linus, 500 % increase, thanks a lot !

Tested using 16 threads, 8 on Socket0, 8 on Socket1 

Before patch :

# ulimit -n 12000000
# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10   
count=10000000 (check/increase ulimit -n)
total = 636870

After patch :

taskset ff0ff ./opensock -t 16 -n 10000000 -l 10
count=10000000 (check/increase ulimit -n)
total = 3845134   (6 times better)

Your patch out-performs the O_FD_FASTALLOC one on this particular test
by ~ 9 % :

taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 -f
count=10000000 (check/increase ulimit -n)
total = 3505252

If I raise to 48 threads, the FAST_ALLOC wins by 5 % (3752087 instead of
3546666).

Oh, and 48 threads without any patches : 383027

-> program runs one order of magnitude faster, congrats !



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 23:52                                                                             ` Linus Torvalds
  2015-10-31  0:09                                                                               ` Al Viro
@ 2015-10-31 15:59                                                                               ` Eric Dumazet
  2015-10-31 19:34                                                                               ` Al Viro
  2 siblings, 0 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-31 15:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Fri, 2015-10-30 at 16:52 -0700, Linus Torvalds wrote: sequential
allocations...
> 
> I don't think it would matter in real life, since I don't really think
> you have lots of fd's with strictly sequential behavior.
> 
> That said, the trivial "open lots of fds" benchmark would show it, so
> I guess we can just keep it. The next_fd logic is not expensive or
> complex, after all.

+1


> Attached is an updated patch that just uses the regular bitmap
> allocator and extends it to also have the bitmap of bitmaps. It
> actually simplifies the patch, so I guess it's better this way.
> 
> Anyway, I've tested it all a bit more, and for a trivial worst-case
> stress program that explicitly kills the next_fd logic by doing
> 
>     for (i = 0; i < 1000000; i++) {
>         close(3);
>         dup2(0,3);
>         if (dup(0))
>             break;
>     }
> 
> it takes it down from roughly 10s to 0.2s. So the patch is quite
> noticeable on that kind of pattern.
> 
> NOTE! You'll obviously need to increase your limits to actually be
> able to do the above with lots of file descriptors.
> 
> I ran Eric's test-program too, and find_next_zero_bit() dropped to a
> fraction of a percent. It's not entirely gone, but it's down in the
> noise.
> 
> I really suspect this patch is "good enough" in reality, and I would
> *much* rather do something like this than add a new non-POSIX flag
> that people have to update their binaries for. I agree with Eric that
> *some* people will do so, but it's still the wrong thing to do. Let's
> just make performance with the normal semantics be good enough that we
> don't need to play odd special games.
> 
> Eric?

I absolutely agree a generic solution is far better, especially when
its performance is in par.

Tested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>

Note that a non-POSIX flag (or a thread personality hints)
would still allow the kernel to do proper NUMA affinity placement : Say
the fd_array and bitmaps are split on the 2 nodes (or more, but most
servers nowadays have 2 sockets really).

Then at fd allocation time, we can prefer to pick an fd for which memory
holding various bits and the file pointer are in the local node

This speeds up subsequent fd system call on programs that constantly
blow away cpu caches, saving QPI transactions.

Thanks a lot Linus.

lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10
count=10000000 (check/increase ulimit -n)
total = 3992764

lpaa24:~# ./opensock -t 48 -n 10000000 -l 10
count=10000000 (check/increase ulimit -n)
total = 3545249

Profile with 16 threads :

    69.55%  opensock          [.] memset                       
    11.83%  [kernel]          [k] queued_spin_lock_slowpath    
     1.91%  [kernel]          [k] _find_next_bit.part.0        
     1.68%  [kernel]          [k] _raw_spin_lock               
     0.99%  [kernel]          [k] kmem_cache_alloc             
     0.99%  [kernel]          [k] memset_erms                  
     0.95%  [kernel]          [k] get_empty_filp               
     0.82%  [kernel]          [k] __close_fd                   
     0.73%  [kernel]          [k] __alloc_fd                   
     0.65%  [kernel]          [k] sk_alloc                     
     0.63%  opensock          [.] child_function               
     0.56%  [kernel]          [k] fput                         
     0.35%  [kernel]          [k] sock_alloc                   
     0.31%  [kernel]          [k] kmem_cache_free              
     0.31%  [kernel]          [k] inode_init_always            
     0.28%  [kernel]          [k] d_set_d_op                   
     0.27%  [kernel]          [k] entry_SYSCALL_64_after_swapgs

Profile with 48 threads :

    57.92%  [kernel]          [k] queued_spin_lock_slowpath    
    32.14%  opensock          [.] memset                       
     0.81%  [kernel]          [k] _find_next_bit.part.0        
     0.51%  [kernel]          [k] _raw_spin_lock               
     0.45%  [kernel]          [k] kmem_cache_alloc             
     0.38%  [kernel]          [k] kmem_cache_free              
     0.34%  [kernel]          [k] __close_fd                   
     0.32%  [kernel]          [k] memset_erms                  
     0.25%  [kernel]          [k] __alloc_fd                   
     0.24%  [kernel]          [k] get_empty_filp               
     0.23%  opensock          [.] child_function               
     0.18%  [kernel]          [k] __d_alloc                    
     0.17%  [kernel]          [k] inode_init_always            
     0.16%  [kernel]          [k] sock_alloc                   
     0.16%  [kernel]          [k] del_timer                    
     0.15%  [kernel]          [k] entry_SYSCALL_64_after_swapgs
     0.15%  perf              [.] 0x000000000004d924           
     0.15%  [kernel]          [k] tcp_close                    





^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 23:52                                                                             ` Linus Torvalds
  2015-10-31  0:09                                                                               ` Al Viro
  2015-10-31 15:59                                                                               ` Eric Dumazet
@ 2015-10-31 19:34                                                                               ` Al Viro
  2015-10-31 19:54                                                                                 ` Linus Torvalds
  2 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-31 19:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Fri, Oct 30, 2015 at 04:52:41PM -0700, Linus Torvalds wrote:
> I really suspect this patch is "good enough" in reality, and I would
> *much* rather do something like this than add a new non-POSIX flag
> that people have to update their binaries for. I agree with Eric that
> *some* people will do so, but it's still the wrong thing to do. Let's
> just make performance with the normal semantics be good enough that we
> don't need to play odd special games.
> 
> Eric?

... and here's the current variant of mine.  FWIW, it seems to survive
LTP and xfstests + obvious "let's torture the allocator".  On the
"million dups" test it seems to be about 25% faster than the one Linus
had posted, at ten millions - about 80%.  On opensock results seem to be
about 20% better than with the variant Linus has posted, but I'm not sure
if the testbox is anywhere near the expected, so I'd appreciate if you'd
given it a spin on your setups.

It obviously needs saner comments, tuning, etc.  BTW, another obvious
low-hanging fruit with this data structure is count_open_files() (and
that goes for 1:64 bitmap Linus uses) - dup2(0, 10000000); close(10000000);
fork(); and count_open_files() is chewing through the damn bitmap from
about 16M down to low tens.  While holding ->files_lock, at that...
I'm not saying it's critical, and it's definitely a followup patch fodder
in either approach, but it's easy enough to do.

diff --git a/fs/file.c b/fs/file.c
index 6c672ad..fa43cbe 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -30,6 +30,8 @@ int sysctl_nr_open_min = BITS_PER_LONG;
 int sysctl_nr_open_max = __const_max(INT_MAX, ~(size_t)0/sizeof(void *)) &
 			 -BITS_PER_LONG;
 
+#define BITS_PER_CHUNK 512
+
 static void *alloc_fdmem(size_t size)
 {
 	/*
@@ -46,8 +48,10 @@ static void *alloc_fdmem(size_t size)
 
 static void __free_fdtable(struct fdtable *fdt)
 {
+	int i;
 	kvfree(fdt->fd);
-	kvfree(fdt->open_fds);
+	for (i = 0; i <= 3; i++)
+		kvfree(fdt->bitmaps[i]);
 	kfree(fdt);
 }
 
@@ -62,7 +66,7 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
  */
 static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 {
-	unsigned int cpy, set;
+	unsigned int cpy, set, to, from, level, n;
 
 	BUG_ON(nfdt->max_fds < ofdt->max_fds);
 
@@ -71,18 +75,53 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memcpy(nfdt->fd, ofdt->fd, cpy);
 	memset((char *)(nfdt->fd) + cpy, 0, set);
 
-	cpy = ofdt->max_fds / BITS_PER_BYTE;
-	set = (nfdt->max_fds - ofdt->max_fds) / BITS_PER_BYTE;
-	memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
-	memset((char *)(nfdt->open_fds) + cpy, 0, set);
+	cpy = ofdt->max_fds / 8;
+	set = (nfdt->max_fds - ofdt->max_fds) / 8;
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+	if (likely(!nfdt->bitmaps[1])) {
+		// flat to flat
+		memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], cpy);
+		memset((char *)(nfdt->bitmaps[0]) + cpy, 0, set);
+		return;
+	}
+	to = round_up(nfdt->max_fds, BITS_PER_CHUNK);
+	set = (to - ofdt->max_fds) / 8;
+	// copy and pad the primary
+	memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], ofdt->max_fds / 8);
+	memset((char *)nfdt->bitmaps[0] + ofdt->max_fds / 8, 0, set);
+	// copy and pad the old secondaries
+	from = round_up(ofdt->max_fds, BITS_PER_CHUNK);
+	for (level = 1; level <= 3; level++) {
+		if (!ofdt->bitmaps[level])
+			break;
+		to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		from = round_up(from / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		memcpy(nfdt->bitmaps[level], ofdt->bitmaps[level], from / 8);
+		memset((char *)nfdt->bitmaps[level] + from / 8, 0, (to - from) / 8);
+	}
+	// zero the new ones (if any)
+	for (n = level; n <= 3; n++) {
+		if (!nfdt->bitmaps[n])
+			break;
+		to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		memset(nfdt->bitmaps[n], 0, to / 8);
+	}
+	// and maybe adjust bit 0 in the first new one.
+	if (unlikely(n != level)) {
+		unsigned long *p = nfdt->bitmaps[level - 1];
+		for (n = 0; n < BITS_PER_CHUNK / BITS_PER_LONG; n++)
+			if (~p[n])
+				return;
+		__set_bit(0, nfdt->bitmaps[level]);
+	}
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
 {
 	struct fdtable *fdt;
 	void *data;
+	int level = 0;
 
 	/*
 	 * Figure out how many fds we actually want to support in this fdtable.
@@ -114,16 +153,28 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 		goto out_fdt;
 	fdt->fd = data;
 
+	if (nr > BITS_PER_CHUNK)
+		nr = round_up(nr, BITS_PER_CHUNK);
 	data = alloc_fdmem(max_t(size_t,
 				 2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES));
 	if (!data)
 		goto out_arr;
-	fdt->open_fds = data;
+	fdt->bitmaps[0] = data;
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
-
+	fdt->bitmaps[1] = fdt->bitmaps[2] = fdt->bitmaps[3] = NULL;
+	while (unlikely(nr > BITS_PER_CHUNK)) {
+		nr = round_up(nr / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		data = alloc_fdmem(nr);
+		if (!data)
+			goto out_bitmaps;
+		fdt->bitmaps[++level] = data;
+	}
 	return fdt;
 
+out_bitmaps:
+	while (level >= 0)
+		kvfree(fdt->bitmaps[level--]);
 out_arr:
 	kvfree(fdt->fd);
 out_fdt:
@@ -229,14 +280,47 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 	__clear_bit(fd, fdt->close_on_exec);
 }
 
+static bool set_and_check(unsigned long *start, unsigned n)
+{
+	int i;
+	start += (n & -BITS_PER_CHUNK) / BITS_PER_LONG;
+	n %= BITS_PER_CHUNK;
+	__set_bit(n, start);
+	for (i = 0; i < BITS_PER_CHUNK / BITS_PER_LONG; i++)
+		if (~*start++)
+			return true;
+	return false;
+}
+
 static inline void __set_open_fd(int fd, struct fdtable *fdt)
 {
-	__set_bit(fd, fdt->open_fds);
+	int level;
+	for (level = 0; ; level++, fd /= BITS_PER_CHUNK) {
+		if (level == 3 || !fdt->bitmaps[level + 1]) {
+			__set_bit(fd, fdt->bitmaps[level]);
+			break;
+		}
+		if (likely(set_and_check(fdt->bitmaps[level], fd)))
+			break;
+	}
 }
 
 static inline void __clear_open_fd(int fd, struct fdtable *fdt)
 {
-	__clear_bit(fd, fdt->open_fds);
+	int level;
+	unsigned long *p = fdt->bitmaps[0] + fd / BITS_PER_LONG, v;
+	v = *p;
+	__clear_bit(fd % BITS_PER_LONG, p);
+	if (~v)		// quick test to avoid looking at other cachelines
+		return;
+	for (level = 1; level <= 3; level++) {
+		if (!fdt->bitmaps[level])
+			break;
+		fd /= BITS_PER_CHUNK;
+		if (!test_bit(fd, fdt->bitmaps[level]))
+			break;
+		__clear_bit(fd, fdt->bitmaps[level]);
+	}
 }
 
 static int count_open_files(struct fdtable *fdt)
@@ -246,7 +330,7 @@ static int count_open_files(struct fdtable *fdt)
 
 	/* Find the last open fd */
 	for (i = size / BITS_PER_LONG; i > 0; ) {
-		if (fdt->open_fds[--i])
+		if (fdt->bitmaps[0][--i])
 			break;
 	}
 	i = (i + 1) * BITS_PER_LONG;
@@ -262,7 +346,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 {
 	struct files_struct *newf;
 	struct file **old_fds, **new_fds;
-	int open_files, size, i;
+	int open_files, size, i, n;
 	struct fdtable *old_fdt, *new_fdt;
 
 	*errorp = -ENOMEM;
@@ -279,7 +363,8 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
-	new_fdt->open_fds = newf->open_fds_init;
+	new_fdt->bitmaps[0] = newf->open_fds_init;
+	new_fdt->bitmaps[1] = new_fdt->bitmaps[2] = new_fdt->bitmaps[3] = NULL;
 	new_fdt->fd = &newf->fd_array[0];
 
 	spin_lock(&oldf->file_lock);
@@ -321,8 +406,17 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	old_fds = old_fdt->fd;
 	new_fds = new_fdt->fd;
 
-	memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8);
 	memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8);
+	memcpy(new_fdt->bitmaps[0], old_fdt->bitmaps[0], open_files / 8);
+
+	n = round_up(open_files, BITS_PER_CHUNK);
+	for (i = 1; i <= 3; i++) {
+		if (!new_fdt->bitmaps[i])
+			break;
+		n /= BITS_PER_CHUNK;
+		n = round_up(n, BITS_PER_CHUNK);
+		memcpy(new_fdt->bitmaps[i], old_fdt->bitmaps[i], n / 8);
+	}
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
@@ -351,10 +445,24 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 		int left = (new_fdt->max_fds - open_files) / 8;
 		int start = open_files / BITS_PER_LONG;
 
-		memset(&new_fdt->open_fds[start], 0, left);
-		memset(&new_fdt->close_on_exec[start], 0, left);
+		memset(new_fdt->close_on_exec + start, 0, left);
+		if (likely(!new_fdt->bitmaps[1])) {
+			memset(new_fdt->bitmaps[0] + start, 0, left);
+			goto done;
+		}
+		start = round_up(open_files, BITS_PER_CHUNK);
+		n = round_up(new_fdt->max_fds, BITS_PER_CHUNK);
+		for (i = 0 ; i <= 3; i++) {
+			char *p = (void *)new_fdt->bitmaps[i];
+			if (!p)
+				break;
+			n = round_up(n / BITS_PER_CHUNK, BITS_PER_CHUNK);
+			start = round_up(start / BITS_PER_CHUNK, BITS_PER_CHUNK);
+			memset(p + start / 8, 0, (n - start) / 8);
+		}
 	}
 
+done:
 	rcu_assign_pointer(newf->fdt, new_fdt);
 
 	return newf;
@@ -380,7 +488,7 @@ static struct fdtable *close_files(struct files_struct * files)
 		i = j * BITS_PER_LONG;
 		if (i >= fdt->max_fds)
 			break;
-		set = fdt->open_fds[j++];
+		set = fdt->bitmaps[0][j++];
 		while (set) {
 			if (set & 1) {
 				struct file * file = xchg(&fdt->fd[i], NULL);
@@ -453,70 +561,146 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
-		.open_fds	= init_files.open_fds_init,
+		.bitmaps[0]	= init_files.open_fds_init,
 	},
 	.file_lock	= __SPIN_LOCK_UNLOCKED(init_files.file_lock),
 };
 
-/*
- * allocate a file descriptor, mark it busy.
- */
+/* search for the next zero bit in cacheline */
+static unsigned scan(unsigned long *start, unsigned size, unsigned from,
+		int check_zeroes)
+{
+	unsigned long *p = start + from / BITS_PER_LONG, *q = p, *end;
+	unsigned bit = from % BITS_PER_LONG, res;
+	unsigned long v = *p, w = v + (1UL<<bit);
+
+	start += (from & -BITS_PER_CHUNK) / BITS_PER_LONG;
+	end = start + size / BITS_PER_LONG;
+
+	if (unlikely(!(w & ~v))) {
+		while (likely(++q < end)) {
+			v = *q;
+			w = v + 1;
+			if (likely(w))
+				goto got_it;
+		}
+		return size;	// not in this chunk
+	}
+got_it:
+	res = __ffs(w & ~v);		// log2, really - it's a power of 2
+	res += (q - start) * BITS_PER_LONG;
+	if (!check_zeroes)
+		return res;
+	if (likely(~(v | w)))		// would zeroes remain in *q?
+		return res;
+	if (p == q || !bit)		// was *p fully checked?
+		p--;
+	while (++q < end)		// any zeros in the tail?
+		if (likely(~*q))
+			return res;
+	if (unlikely(check_zeroes > 1))
+		for (q = start; q <= p; q++)
+			if (~*q)
+				return res;
+	return res | (1U<<31);
+}
+
 int __alloc_fd(struct files_struct *files,
 	       unsigned start, unsigned end, unsigned flags)
 {
-	unsigned int fd;
+	unsigned int fd, base;
 	int error;
 	struct fdtable *fdt;
-
+	unsigned count, level, n;
+	int summary;
 	spin_lock(&files->file_lock);
 repeat:
 	fdt = files_fdtable(files);
+	count = fdt->max_fds;
+	summary = 2;
 	fd = start;
-	if (fd < files->next_fd)
+	if (likely(fd <= files->next_fd)) {
 		fd = files->next_fd;
-
-	if (fd < fdt->max_fds)
-		fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd);
-
-	/*
-	 * N.B. For clone tasks sharing a files structure, this test
-	 * will limit the total number of files that can be opened.
-	 */
-	error = -EMFILE;
-	if (fd >= end)
-		goto out;
-
-	error = expand_files(files, fd);
-	if (error < 0)
+		summary = 1;
+	}
+	base = fd;
+	if (unlikely(base >= count))
+		goto expand2;
+	if (likely(!fdt->bitmaps[1])) {
+		base = scan(fdt->bitmaps[0], count, base, 0);
+		if (unlikely(base == count))
+			goto expand;
+		if (unlikely(base >= end)) {
+			error = -EMFILE;
+			goto out;
+		}
+		fd = base;
+		__set_bit(fd, fdt->bitmaps[0]);
+		goto found;
+	}
+	n = scan(fdt->bitmaps[0], BITS_PER_CHUNK, base, summary);
+	base &= -BITS_PER_CHUNK;
+	base += n & ~(1U<<31);
+	if (unlikely(n == BITS_PER_CHUNK)) {
+		int bits[3];
+		level = 0;
+		do {
+			bits[level] = count;
+			count = DIV_ROUND_UP(count, BITS_PER_CHUNK);
+			base /= BITS_PER_CHUNK;
+			n = scan(fdt->bitmaps[++level], BITS_PER_CHUNK, base, 0);
+			base &= -BITS_PER_CHUNK;
+			base += n;
+			if (unlikely(base >= count))
+				goto expand;
+		} while (unlikely(n == BITS_PER_CHUNK));
+		while (level--) {
+			base *= BITS_PER_CHUNK;
+			n = scan(fdt->bitmaps[level], BITS_PER_CHUNK, base, !level);
+			if (WARN_ON(n == BITS_PER_CHUNK)) {
+				error = -EINVAL;
+				goto out;
+			}
+			base += n & ~(1U<<31);
+			if (unlikely(base >= bits[level]))
+				goto expand;
+		}
+	}
+	if (unlikely(base >= end)) {
+		error = -EMFILE;
 		goto out;
-
-	/*
-	 * If we needed to expand the fs array we
-	 * might have blocked - try again.
-	 */
-	if (error)
-		goto repeat;
-
-	if (start <= files->next_fd)
+	}
+	fd = base;
+	__set_bit(fd, fdt->bitmaps[0]);
+	if (unlikely(n & (1U << 31))) {
+		for (level = 1; ; level++) {
+			base /= BITS_PER_CHUNK;
+			if (level == 3 || !fdt->bitmaps[level + 1]) {
+				__set_bit(base, fdt->bitmaps[level]);
+				break;
+			}
+			if (likely(set_and_check(fdt->bitmaps[level], base)))
+				break;
+		}
+	}
+found:
+	if (summary == 1)
 		files->next_fd = fd + 1;
-
-	__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
 	error = fd;
-#if 1
-	/* Sanity check */
-	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
-		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
-		rcu_assign_pointer(fdt->fd[fd], NULL);
-	}
-#endif
-
 out:
 	spin_unlock(&files->file_lock);
 	return error;
+expand:
+	base = fdt->max_fds;
+expand2:
+	error = expand_files(files, base);
+	if (error < 0)
+		goto out;
+	goto repeat;
 }
 
 static int alloc_fd(unsigned start, unsigned flags)
@@ -809,7 +993,8 @@ __releases(&files->file_lock)
 		goto Ebusy;
 	get_file(file);
 	rcu_assign_pointer(fdt->fd[fd], file);
-	__set_open_fd(fd, fdt);
+	if (!tofree)
+		__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
 		__set_close_on_exec(fd, fdt);
 	else
diff --git a/fs/select.c b/fs/select.c
index 0155473..670f542 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -350,7 +350,7 @@ static int max_select_fd(unsigned long n, fd_set_bits *fds)
 	set = ~(~0UL << (n & (BITS_PER_LONG-1)));
 	n /= BITS_PER_LONG;
 	fdt = files_fdtable(current->files);
-	open_fds = fdt->open_fds + n;
+	open_fds = fdt->bitmaps[0] + n;
 	max = 0;
 	if (set) {
 		set &= BITS(fds, n);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 674e3e2..6ef5274 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -25,7 +25,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
-	unsigned long *open_fds;
+	unsigned long *bitmaps[4];
 	struct rcu_head rcu;
 };
 
@@ -36,7 +36,7 @@ static inline bool close_on_exec(int fd, const struct fdtable *fdt)
 
 static inline bool fd_is_open(int fd, const struct fdtable *fdt)
 {
-	return test_bit(fd, fdt->open_fds);
+	return test_bit(fd, fdt->bitmaps[0]);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 19:34                                                                               ` Al Viro
@ 2015-10-31 19:54                                                                                 ` Linus Torvalds
  2015-10-31 20:29                                                                                   ` Al Viro
  2015-10-31 20:45                                                                                   ` Eric Dumazet
  0 siblings, 2 replies; 138+ messages in thread
From: Linus Torvalds @ 2015-10-31 19:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ... and here's the current variant of mine.

Ugh. I really liked how simple mine ended up being. Yours is definitely not.

And based on the profiles from Eric, finding the fd is no longer the
problem even with my simpler patch. The problem ends up being the
contention on the file_lock spinlock.

Eric, I assume that's not "expand_fdtable", since your test-program
seems to expand the fd array at the beginning. So it's presumably all
from the __alloc_fd() use, but we should double-check.. Eric, can you
do a callgraph profile and see which caller is the hottest?

                 Linus

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 19:54                                                                                 ` Linus Torvalds
@ 2015-10-31 20:29                                                                                   ` Al Viro
  2015-11-02  0:24                                                                                     ` Al Viro
  2015-10-31 20:45                                                                                   ` Eric Dumazet
  1 sibling, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-10-31 20:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sat, Oct 31, 2015 at 12:54:50PM -0700, Linus Torvalds wrote:
> On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > ... and here's the current variant of mine.
> 
> Ugh. I really liked how simple mine ended up being. Yours is definitely not.

Note that it's not the final variant - just something that should be
testable.  There are all kinds of things that still need cleaning/simplifying
in there - e.g. scan() is definitely more complex than needed (if nothing
else, the "small bitmap" case is simply find_next_zero_bit(), and the
rest all have size equal to full cacheline; moreover, I'd overdone the
"... and check if there are other zero bits left" thing - its callers
used to use that a lot, and with the execption of two of them it was
absolutely worthless.  So it ended up more generic than necessary and
I'm going to trim that crap down.

It's still going to end up more complex than yours, obviously, but not as
badly as it is now.  I'm not sure if the speedup will be worth the
extra complexity, thus asking for testing...  On _some_ loads it is
considerably faster (at least by factor of 5), but whether those loads
resemble anything that occurs on real systems...

BTW, considerable amount of unpleasantness is due to ragged-right-end kind
of problems - take /proc/sys/fs/nr-open to something other than a power of
two and a whole lot of fun issues start happening.  I went for "if there
are secondary bitmaps at all, pad all bitmaps to a multiple of cacheline",
which at least somewhat mitigates that mess; hell knows, there might be
a clever way to sidestep it entirely...

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 19:54                                                                                 ` Linus Torvalds
  2015-10-31 20:29                                                                                   ` Al Viro
@ 2015-10-31 20:45                                                                                   ` Eric Dumazet
  2015-10-31 21:23                                                                                     ` Linus Torvalds
  1 sibling, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-10-31 20:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Sat, 2015-10-31 at 12:54 -0700, Linus Torvalds wrote:
> On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > ... and here's the current variant of mine.
> 
> Ugh. I really liked how simple mine ended up being. Yours is definitely not.
> 
> And based on the profiles from Eric, finding the fd is no longer the
> problem even with my simpler patch. The problem ends up being the
> contention on the file_lock spinlock.
> 
> Eric, I assume that's not "expand_fdtable", since your test-program
> seems to expand the fd array at the beginning. So it's presumably all
> from the __alloc_fd() use, but we should double-check.. Eric, can you
> do a callgraph profile and see which caller is the hottest?

Sure : profile taken while test runs using 16 threads (Since this is
probably not a too biased micro benchmark...)

# hostname : lpaa24
# os release : 4.3.0-smp-DEV
# perf version : 3.12.0-6-GOOGLE
# arch : x86_64
# nrcpus online : 48
# nrcpus avail : 48
# cpudesc : Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz
# cpuid : GenuineIntel,6,62,4
# total memory : 264126320 kB
# cmdline : /usr/bin/perf record -a -g sleep 4 
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, att
# CPU_TOPOLOGY info available, use -I to display
# NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, msr = 38, uncore_cbox_10 = 35, uncore_cbox_11 = 36, software = 1, power = 7, uncore_irp = 8, uncore_pcu = 37, tracepoint = 2, uncore_
# Samples: 260K of event 'cycles'
# Event count (approx.): 196742182232
#
# Overhead        Command        Shared Object                                                                                                                
# ........  .............  ...................  ..............................................................................................................
#
    67.15%       opensock  opensock             [.] memset                                                                                                    
                 |
                 --- memset

    13.84%       opensock  [kernel.kallsyms]    [k] queued_spin_lock_slowpath                                                                                 
                 |
                 --- queued_spin_lock_slowpath
                    |          
                    |--99.97%-- _raw_spin_lock
                    |          |          
                    |          |--53.03%-- __close_fd
                    |          |          sys_close
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __libc_close
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |          |          
                    |          |--46.83%-- __alloc_fd
                    |          |          get_unused_fd_flags
                    |          |          sock_map_fd
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |           --0.13%-- [...]
                     --0.03%-- [...]

     1.84%       opensock  [kernel.kallsyms]    [k] _find_next_bit.part.0                                                                                     
                 |
                 --- _find_next_bit.part.0
                    |          
                    |--65.97%-- find_next_zero_bit
                    |          __alloc_fd
                    |          get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--34.01%-- __alloc_fd
                    |          get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          |          
                    |           --100.00%-- 0x0
                     --0.02%-- [...]

     1.59%       opensock  [kernel.kallsyms]    [k] _raw_spin_lock                                                                                            
                 |
                 --- _raw_spin_lock
                    |          
                    |--28.78%-- get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--26.53%-- sys_close
                    |          entry_SYSCALL_64_fastpath
                    |          __libc_close
                    |          
                    |--13.95%-- cache_alloc_refill
                    |          |          
                    |          |--99.48%-- kmem_cache_alloc
                    |          |          |          
                    |          |          |--81.20%-- sk_prot_alloc
                    |          |          |          sk_alloc
                    |          |          |          inet_create
                    |          |          |          __sock_create
                    |          |          |          sock_create
                    |          |          |          sys_socket
                    |          |          |          entry_SYSCALL_64_fastpath
                    |          |          |          __socket
                    |          |          |          
                    |          |          |--8.43%-- sock_alloc_inode
                    |          |          |          alloc_inode
                    |          |          |          new_inode_pseudo
                    |          |          |          sock_alloc
                    |          |          |          __sock_create
                    |          |          |          sock_create
                    |          |          |          sys_socket
                    |          |          |          entry_SYSCALL_64_fastpath
                    |          |          |          __socket
                    |          |          |          
                    |          |          |--5.80%-- __d_alloc
                    |          |          |          d_alloc_pseudo
                    |          |          |          sock_alloc_file
                    |          |          |          sock_map_fd
                    |          |          |          sys_socket



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 20:45                                                                                   ` Eric Dumazet
@ 2015-10-31 21:23                                                                                     ` Linus Torvalds
  2015-10-31 21:51                                                                                       ` Al Viro
  2015-10-31 22:34                                                                                       ` Eric Dumazet
  0 siblings, 2 replies; 138+ messages in thread
From: Linus Torvalds @ 2015-10-31 21:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Sat, Oct 31, 2015 at 1:45 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>     13.84%       opensock  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
>                  |
>                  --- queued_spin_lock_slowpath
>                     |
>                     |--99.97%-- _raw_spin_lock
>                     |          |
>                     |          |--53.03%-- __close_fd
>                     |          |
>                     |          |--46.83%-- __alloc_fd

Interesting. "__close_fd" actually looks more expensive than
allocation. They presumably get called equally often, so it's probably
some cache effect.

__close_fd() doesn't do anything even remotely interesting as far as I
can tell, but it strikes me that we probably take a *lot* of cache
misses on the stupid "close-on-exec" flags, which are probably always
zero anyway.

Mind testing something really stupid, and making the __clear_bit() in
__clear_close_on_exec() conditiona, something like this:

     static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
     {
    -       __clear_bit(fd, fdt->close_on_exec);
    +       if (test_bit(fd, fdt->close_on_exec)
    +               __clear_bit(fd, fdt->close_on_exec);
     }

and see if it makes a difference.

This is the kind of thing that a single-threaded (or even
single-socket) test will never actually show, because it caches well
enough. But for two sockets, I could imagine the unnecessary dirtying
of cachelines and ping-pong being noticeable.

The other stuff we probably can't do all that much about. Unless we
decide to go for some complicated lockless optimistic file descriptor
allocation scheme with retry-on-failure instead of locks. Which I'm
sure is possible, but I'm equally sure is painful.

                    Linus

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 21:23                                                                                     ` Linus Torvalds
@ 2015-10-31 21:51                                                                                       ` Al Viro
  2015-10-31 22:34                                                                                       ` Eric Dumazet
  1 sibling, 0 replies; 138+ messages in thread
From: Al Viro @ 2015-10-31 21:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sat, Oct 31, 2015 at 02:23:31PM -0700, Linus Torvalds wrote:

> The other stuff we probably can't do all that much about. Unless we
> decide to go for some complicated lockless optimistic file descriptor
> allocation scheme with retry-on-failure instead of locks. Which I'm
> sure is possible, but I'm equally sure is painful.

The interesting part is dup2() - we'd have to do something like
	serialize against other dup2
	was_claimed = atomically set and test bit in bitmap
	if was_claimed
		tofree = fdt->fd[fd];
		if (!tofree)
			fail with EBUSY
	install into ->fd[...]
	end of critical area
in there; __alloc_fd() could be made retry-on-failure, but I don't see
how to cope with dup2 vs. dup2 without an explicit exclusion.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 21:23                                                                                     ` Linus Torvalds
  2015-10-31 21:51                                                                                       ` Al Viro
@ 2015-10-31 22:34                                                                                       ` Eric Dumazet
  1 sibling, 0 replies; 138+ messages in thread
From: Eric Dumazet @ 2015-10-31 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Miller, Stephen Hemminger, Network Development,
	David Howells, linux-fsdevel

On Sat, 2015-10-31 at 14:23 -0700, Linus Torvalds wrote:

> Mind testing something really stupid, and making the __clear_bit() in
> __clear_close_on_exec() conditiona, something like this:
> 
>      static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
>      {
>     -       __clear_bit(fd, fdt->close_on_exec);
>     +       if (test_bit(fd, fdt->close_on_exec)
>     +               __clear_bit(fd, fdt->close_on_exec);
>      }
> 
> and see if it makes a difference.

It does ;)

About 4 % qps increase

3 runs : 
lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10
total = 4176651
total = 4178012
total = 4105226

instead of :
total = 3910620
total = 3874567
total = 3971028

Perf profile :

    69.12%       opensock  opensock             [.] memset                                       
                 |
                 --- memset

    12.37%       opensock  [kernel.kallsyms]    [k] queued_spin_lock_slowpath                    
                 |
                 --- queued_spin_lock_slowpath
                    |          
                    |--99.99%-- _raw_spin_lock
                    |          |          
                    |          |--51.99%-- __close_fd
                    |          |          sys_close
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __libc_close
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |          |          
                    |          |--47.79%-- __alloc_fd
                    |          |          get_unused_fd_flags
                    |          |          sock_map_fd
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |           --0.21%-- [...]
                     --0.01%-- [...]

     1.92%       opensock  [kernel.kallsyms]    [k] _find_next_bit.part.0                        
                 |
                 --- _find_next_bit.part.0
                    |          
                    |--66.93%-- find_next_zero_bit
                    |          __alloc_fd
                    |          get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                     --33.07%-- __alloc_fd
                               get_unused_fd_flags
                               sock_map_fd
                               sys_socket
                               entry_SYSCALL_64_fastpath
                               __socket
                               |          
                                --100.00%-- 0x0

     1.63%       opensock  [kernel.kallsyms]    [k] _raw_spin_lock                               
                 |
                 --- _raw_spin_lock
                    |          
                    |--28.66%-- get_unused_fd_flags
                    |          sock_map_fd



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-31 20:29                                                                                   ` Al Viro
@ 2015-11-02  0:24                                                                                     ` Al Viro
  2015-11-02  0:59                                                                                       ` Linus Torvalds
  2015-11-02  2:14                                                                                       ` Eric Dumazet
  0 siblings, 2 replies; 138+ messages in thread
From: Al Viro @ 2015-11-02  0:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sat, Oct 31, 2015 at 08:29:29PM +0000, Al Viro wrote:
> On Sat, Oct 31, 2015 at 12:54:50PM -0700, Linus Torvalds wrote:
> > On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > ... and here's the current variant of mine.
> > 
> > Ugh. I really liked how simple mine ended up being. Yours is definitely not.
> 
> Note that it's not the final variant - just something that should be
> testable.  There are all kinds of things that still need cleaning/simplifying
> in there - e.g. scan() is definitely more complex than needed (if nothing
> else, the "small bitmap" case is simply find_next_zero_bit(), and the
> rest all have size equal to full cacheline; moreover, I'd overdone the
> "... and check if there are other zero bits left" thing - its callers
> used to use that a lot, and with the execption of two of them it was
> absolutely worthless.  So it ended up more generic than necessary and
> I'm going to trim that crap down.
> 
> It's still going to end up more complex than yours, obviously, but not as
> badly as it is now.  I'm not sure if the speedup will be worth the
> extra complexity, thus asking for testing...  On _some_ loads it is
> considerably faster (at least by factor of 5), but whether those loads
> resemble anything that occurs on real systems...

This ought to be a bit cleaner.  Eric, could you test the variant below on your
setup?

diff --git a/fs/file.c b/fs/file.c
index 6c672ad..0144920 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -30,6 +30,9 @@ int sysctl_nr_open_min = BITS_PER_LONG;
 int sysctl_nr_open_max = __const_max(INT_MAX, ~(size_t)0/sizeof(void *)) &
 			 -BITS_PER_LONG;
 
+#define BITS_PER_CHUNK 512
+#define BYTES_PER_CHUNK (BITS_PER_CHUNK / 8)
+
 static void *alloc_fdmem(size_t size)
 {
 	/*
@@ -46,8 +49,10 @@ static void *alloc_fdmem(size_t size)
 
 static void __free_fdtable(struct fdtable *fdt)
 {
+	int i;
 	kvfree(fdt->fd);
-	kvfree(fdt->open_fds);
+	for (i = 0; i <= 3; i++)
+		kvfree(fdt->bitmaps[i]);
 	kfree(fdt);
 }
 
@@ -56,13 +61,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
 	__free_fdtable(container_of(rcu, struct fdtable, rcu));
 }
 
+static inline bool cacheline_full(unsigned long *p)
+{
+	int n;
+
+	for (n = 0; n < BITS_PER_CHUNK / BITS_PER_LONG; n++)
+		if (likely(~p[n]))
+			return false;
+	return true;
+}
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
  */
 static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 {
-	unsigned int cpy, set;
+	unsigned int cpy, set, to, from, level, n;
 
 	BUG_ON(nfdt->max_fds < ofdt->max_fds);
 
@@ -71,18 +86,48 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memcpy(nfdt->fd, ofdt->fd, cpy);
 	memset((char *)(nfdt->fd) + cpy, 0, set);
 
-	cpy = ofdt->max_fds / BITS_PER_BYTE;
-	set = (nfdt->max_fds - ofdt->max_fds) / BITS_PER_BYTE;
-	memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
-	memset((char *)(nfdt->open_fds) + cpy, 0, set);
+	cpy = ofdt->max_fds / 8;
+	set = (nfdt->max_fds - ofdt->max_fds) / 8;
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+	if (likely(!nfdt->bitmaps[1])) {
+		// flat to flat
+		memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], cpy);
+		memset((char *)(nfdt->bitmaps[0]) + cpy, 0, set);
+		return;
+	}
+	to = round_up(nfdt->max_fds, BITS_PER_CHUNK);
+	set = (to - ofdt->max_fds) / 8;
+	// copy and pad the primary
+	memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], ofdt->max_fds / 8);
+	memset((char *)nfdt->bitmaps[0] + ofdt->max_fds / 8, 0, set);
+	// copy and pad the old secondaries
+	from = round_up(ofdt->max_fds, BITS_PER_CHUNK);
+	for (level = 1; level <= 3; level++) {
+		if (!ofdt->bitmaps[level])
+			break;
+		to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		from = round_up(from / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		memcpy(nfdt->bitmaps[level], ofdt->bitmaps[level], from / 8);
+		memset((char *)nfdt->bitmaps[level] + from / 8, 0, (to - from) / 8);
+	}
+	// zero the new ones (if any)
+	for (n = level; n <= 3; n++) {
+		if (!nfdt->bitmaps[n])
+			break;
+		to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		memset(nfdt->bitmaps[n], 0, to / 8);
+	}
+	// and maybe adjust bit 0 in the first new one.
+	if (unlikely(n != level && cacheline_full(nfdt->bitmaps[level - 1])))
+		__set_bit(0, nfdt->bitmaps[level]);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
 {
 	struct fdtable *fdt;
 	void *data;
+	int level = 0;
 
 	/*
 	 * Figure out how many fds we actually want to support in this fdtable.
@@ -114,16 +159,28 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 		goto out_fdt;
 	fdt->fd = data;
 
+	if (nr > BITS_PER_CHUNK)
+		nr = round_up(nr, BITS_PER_CHUNK);
 	data = alloc_fdmem(max_t(size_t,
 				 2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES));
 	if (!data)
 		goto out_arr;
-	fdt->open_fds = data;
+	fdt->bitmaps[0] = data;
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
-
+	fdt->bitmaps[1] = fdt->bitmaps[2] = fdt->bitmaps[3] = NULL;
+	while (unlikely(nr > BITS_PER_CHUNK)) {
+		nr = round_up(nr / BITS_PER_CHUNK, BITS_PER_CHUNK);
+		data = alloc_fdmem(nr);
+		if (!data)
+			goto out_bitmaps;
+		fdt->bitmaps[++level] = data;
+	}
 	return fdt;
 
+out_bitmaps:
+	while (level >= 0)
+		kvfree(fdt->bitmaps[level--]);
 out_arr:
 	kvfree(fdt->fd);
 out_fdt:
@@ -229,14 +286,41 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 	__clear_bit(fd, fdt->close_on_exec);
 }
 
-static inline void __set_open_fd(int fd, struct fdtable *fdt)
+static inline void __set_open_fd(unsigned fd, struct fdtable *fdt)
 {
-	__set_bit(fd, fdt->open_fds);
+	int level;
+	for (level = 0; ; level++) {
+		unsigned long *p;
+
+		__set_bit(fd, fdt->bitmaps[level]);
+
+		if (level == 3 || !fdt->bitmaps[level + 1])
+			break;
+
+		fd /= BITS_PER_CHUNK;
+
+		p = fdt->bitmaps[level] + BIT_WORD(fd * BITS_PER_CHUNK);
+		if (likely(!cacheline_full(p)))
+			break;
+	}
 }
 
-static inline void __clear_open_fd(int fd, struct fdtable *fdt)
+static inline void __clear_open_fd(unsigned fd, struct fdtable *fdt)
 {
-	__clear_bit(fd, fdt->open_fds);
+	int level;
+	unsigned long *p = fdt->bitmaps[0] + BIT_WORD(fd);
+	unsigned long v = *p;
+	__clear_bit(fd % BITS_PER_LONG, p);
+	if (~v)		// quick test to avoid looking at other cachelines
+		return;
+	for (level = 1; level <= 3; level++) {
+		if (!fdt->bitmaps[level])
+			break;
+		fd /= BITS_PER_CHUNK;
+		if (!test_bit(fd, fdt->bitmaps[level]))
+			break;
+		__clear_bit(fd, fdt->bitmaps[level]);
+	}
 }
 
 static int count_open_files(struct fdtable *fdt)
@@ -246,7 +330,7 @@ static int count_open_files(struct fdtable *fdt)
 
 	/* Find the last open fd */
 	for (i = size / BITS_PER_LONG; i > 0; ) {
-		if (fdt->open_fds[--i])
+		if (fdt->bitmaps[0][--i])
 			break;
 	}
 	i = (i + 1) * BITS_PER_LONG;
@@ -262,7 +346,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 {
 	struct files_struct *newf;
 	struct file **old_fds, **new_fds;
-	int open_files, size, i;
+	int open_files, size, i, n;
 	struct fdtable *old_fdt, *new_fdt;
 
 	*errorp = -ENOMEM;
@@ -279,7 +363,8 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
-	new_fdt->open_fds = newf->open_fds_init;
+	new_fdt->bitmaps[0] = newf->open_fds_init;
+	new_fdt->bitmaps[1] = new_fdt->bitmaps[2] = new_fdt->bitmaps[3] = NULL;
 	new_fdt->fd = &newf->fd_array[0];
 
 	spin_lock(&oldf->file_lock);
@@ -321,8 +406,17 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	old_fds = old_fdt->fd;
 	new_fds = new_fdt->fd;
 
-	memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8);
 	memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8);
+	memcpy(new_fdt->bitmaps[0], old_fdt->bitmaps[0], open_files / 8);
+
+	n = round_up(open_files, BITS_PER_CHUNK);
+	for (i = 1; i <= 3; i++) {
+		if (!new_fdt->bitmaps[i])
+			break;
+		n /= BITS_PER_CHUNK;
+		n = round_up(n, BITS_PER_CHUNK);
+		memcpy(new_fdt->bitmaps[i], old_fdt->bitmaps[i], n / 8);
+	}
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
@@ -351,10 +445,24 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 		int left = (new_fdt->max_fds - open_files) / 8;
 		int start = open_files / BITS_PER_LONG;
 
-		memset(&new_fdt->open_fds[start], 0, left);
-		memset(&new_fdt->close_on_exec[start], 0, left);
+		memset(new_fdt->close_on_exec + start, 0, left);
+		if (likely(!new_fdt->bitmaps[1])) {
+			memset(new_fdt->bitmaps[0] + start, 0, left);
+			goto done;
+		}
+		start = round_up(open_files, BITS_PER_CHUNK);
+		n = round_up(new_fdt->max_fds, BITS_PER_CHUNK);
+		for (i = 0 ; i <= 3; i++) {
+			char *p = (void *)new_fdt->bitmaps[i];
+			if (!p)
+				break;
+			n = round_up(n / BITS_PER_CHUNK, BITS_PER_CHUNK);
+			start = round_up(start / BITS_PER_CHUNK, BITS_PER_CHUNK);
+			memset(p + start / 8, 0, (n - start) / 8);
+		}
 	}
 
+done:
 	rcu_assign_pointer(newf->fdt, new_fdt);
 
 	return newf;
@@ -380,7 +488,7 @@ static struct fdtable *close_files(struct files_struct * files)
 		i = j * BITS_PER_LONG;
 		if (i >= fdt->max_fds)
 			break;
-		set = fdt->open_fds[j++];
+		set = fdt->bitmaps[0][j++];
 		while (set) {
 			if (set & 1) {
 				struct file * file = xchg(&fdt->fd[i], NULL);
@@ -453,70 +561,128 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
-		.open_fds	= init_files.open_fds_init,
+		.bitmaps[0]	= init_files.open_fds_init,
 	},
 	.file_lock	= __SPIN_LOCK_UNLOCKED(init_files.file_lock),
 };
 
-/*
- * allocate a file descriptor, mark it busy.
- */
+/* search for the next zero bit in cacheline */
+#define NO_FREE (1ULL<<32)
+#define LAST_FREE (2ULL<<32)
+#define CHUNK_ALIGNED(p) IS_ALIGNED((uintptr_t)p, BYTES_PER_CHUNK)
+
+static __u64 scan(struct fdtable *fdt, int level, unsigned from)
+{
+	unsigned long *p = fdt->bitmaps[level] + BIT_WORD(from);
+	unsigned long v = *p, w = v + BIT_MASK(from);
+
+	from = round_down(from, BITS_PER_CHUNK);
+
+	if (unlikely(!(w & ~v))) {
+		while (!CHUNK_ALIGNED(++p)) {
+			v = *p;
+			w = v + 1;
+			if (likely(w))
+				goto got_it;
+		}
+		return NO_FREE | (from + BITS_PER_CHUNK);
+	}
+got_it:
+	from += __ffs(w & ~v);		// log2, really - it's a power of 2
+	from += 8 * ((uintptr_t)p % BYTES_PER_CHUNK);
+	if (level)			// don't bother with looking for more
+		return from;
+	if (likely(~(v | w)))		// would zeroes remain in *p?
+		return from;
+	while (!CHUNK_ALIGNED(++p))	// any zeros in the tail?
+		if (likely(~*p))
+			return from;
+	return LAST_FREE | from;
+}
+
 int __alloc_fd(struct files_struct *files,
 	       unsigned start, unsigned end, unsigned flags)
 {
 	unsigned int fd;
+	__u64 base;
 	int error;
 	struct fdtable *fdt;
+	unsigned count;
 
 	spin_lock(&files->file_lock);
 repeat:
 	fdt = files_fdtable(files);
+	count = fdt->max_fds;
 	fd = start;
 	if (fd < files->next_fd)
 		fd = files->next_fd;
-
-	if (fd < fdt->max_fds)
-		fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd);
-
-	/*
-	 * N.B. For clone tasks sharing a files structure, this test
-	 * will limit the total number of files that can be opened.
-	 */
-	error = -EMFILE;
-	if (fd >= end)
-		goto out;
-
-	error = expand_files(files, fd);
-	if (error < 0)
-		goto out;
-
-	/*
-	 * If we needed to expand the fs array we
-	 * might have blocked - try again.
-	 */
-	if (error)
+	if (unlikely(fd >= count)) {
+		error = expand_files(files, fd);
+		if (error < 0)
+			goto out;
 		goto repeat;
-
+	}
+	if (likely(!fdt->bitmaps[1])) {
+		base = find_next_zero_bit(fdt->bitmaps[0], count, fd);
+		if (unlikely(base == count))
+			goto expand;
+		if (unlikely(base >= end)) {
+			error = -EMFILE;
+			goto out;
+		}
+		fd = base;
+		__set_bit(fd, fdt->bitmaps[0]);
+		goto found;
+	}
+	base = scan(fdt, 0, fd);
+	if (unlikely(base & NO_FREE)) {
+		int bits[3];
+		int level = 0;
+		do {
+			if (unlikely((u32)base >= count))
+				goto expand;
+			bits[level] = count;
+			count = DIV_ROUND_UP(count, BITS_PER_CHUNK);
+			base = scan(fdt, ++level, (u32)base / BITS_PER_CHUNK);
+		} while (unlikely(base & NO_FREE));
+		while (level) {
+			if (unlikely((u32)base >= count))
+				goto expand;
+			base = scan(fdt, --level, (u32)base * BITS_PER_CHUNK);
+			if (WARN_ON(base & NO_FREE)) {
+				error = -EINVAL;
+				goto out;
+			}
+			count = bits[level];
+		}
+		if (unlikely((u32)base >= count))
+			goto expand;
+	}
+	fd = base;
+	if (unlikely(fd >= end)) {
+		error = -EMFILE;
+		goto out;
+	}
+	if (likely(!(base & LAST_FREE)))
+		__set_bit(fd, fdt->bitmaps[0]);
+	else
+		__set_open_fd(fd, fdt);
+found:
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
-
-	__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
 	error = fd;
-#if 1
-	/* Sanity check */
-	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
-		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
-		rcu_assign_pointer(fdt->fd[fd], NULL);
-	}
-#endif
-
 out:
 	spin_unlock(&files->file_lock);
 	return error;
+expand:
+	error = expand_files(files, fdt->max_fds);
+	if (error < 0)
+		goto out;
+	goto repeat;
 }
 
 static int alloc_fd(unsigned start, unsigned flags)
@@ -809,7 +975,8 @@ __releases(&files->file_lock)
 		goto Ebusy;
 	get_file(file);
 	rcu_assign_pointer(fdt->fd[fd], file);
-	__set_open_fd(fd, fdt);
+	if (!tofree)
+		__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
 		__set_close_on_exec(fd, fdt);
 	else
diff --git a/fs/select.c b/fs/select.c
index 0155473..670f542 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -350,7 +350,7 @@ static int max_select_fd(unsigned long n, fd_set_bits *fds)
 	set = ~(~0UL << (n & (BITS_PER_LONG-1)));
 	n /= BITS_PER_LONG;
 	fdt = files_fdtable(current->files);
-	open_fds = fdt->open_fds + n;
+	open_fds = fdt->bitmaps[0] + n;
 	max = 0;
 	if (set) {
 		set &= BITS(fds, n);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 674e3e2..6ef5274 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -25,7 +25,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
-	unsigned long *open_fds;
+	unsigned long *bitmaps[4];
 	struct rcu_head rcu;
 };
 
@@ -36,7 +36,7 @@ static inline bool close_on_exec(int fd, const struct fdtable *fdt)
 
 static inline bool fd_is_open(int fd, const struct fdtable *fdt)
 {
-	return test_bit(fd, fdt->open_fds);
+	return test_bit(fd, fdt->bitmaps[0]);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-11-02  0:24                                                                                     ` Al Viro
@ 2015-11-02  0:59                                                                                       ` Linus Torvalds
  2015-11-02  2:14                                                                                       ` Eric Dumazet
  1 sibling, 0 replies; 138+ messages in thread
From: Linus Torvalds @ 2015-11-02  0:59 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sun, Nov 1, 2015 at 4:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> This ought to be a bit cleaner.  Eric, could you test the variant below on your
> setup?

I'd love to see the numbers, but at the same time I really can't say I
love your patch.

I've merged my own two patches for now (not actually pushed out - I
don't want to distract people from just testing 4.3 for a while),
because I felt that those had a unreasonably high bang-for-the-buck
(ie big speedups for something that still keeps the code very simple).

I'm definitely open to improving this further, even go as far as your
patch, but just looking at your version of __alloc_fd(), I just don't
get the warm and fuzzies.

                  Linus

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-11-02  0:24                                                                                     ` Al Viro
  2015-11-02  0:59                                                                                       ` Linus Torvalds
@ 2015-11-02  2:14                                                                                       ` Eric Dumazet
  2015-11-02  6:22                                                                                         ` Al Viro
  1 sibling, 1 reply; 138+ messages in thread
From: Eric Dumazet @ 2015-11-02  2:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Mon, 2015-11-02 at 00:24 +0000, Al Viro wrote:

> This ought to be a bit cleaner.  Eric, could you test the variant below on your
> setup?

Sure !

5 runs of :
lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10

total = 4386311
total = 4560402
total = 4437309
total = 4516227
total = 4478778


With 48 threads :

./opensock -t 48 -n 10000000 -l 10
total = 4940245
total = 4848513
total = 4813153
total = 4813946
total = 5127804

Perf output taken on the 16 threads run :

    74.71%       opensock  opensock           [.] memset                                    
                 |
                 --- memset

     5.64%       opensock  [kernel.kallsyms]  [k] queued_spin_lock_slowpath                 
                 |
                 --- queued_spin_lock_slowpath
                    |          
                    |--99.89%-- _raw_spin_lock
                    |          |          
                    |          |--52.74%-- __close_fd
                    |          |          sys_close
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __libc_close
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |          |          
                    |          |--46.97%-- __alloc_fd
                    |          |          get_unused_fd_flags
                    |          |          sock_map_fd
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |           --0.30%-- [...]
                     --0.11%-- [...]

     1.69%       opensock  [kernel.kallsyms]  [k] _raw_spin_lock                            
                 |
                 --- _raw_spin_lock
                    |          
                    |--27.37%-- get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--22.40%-- sys_close
                    |          entry_SYSCALL_64_fastpath
                    |          __libc_close
                    |          
                    |--15.79%-- cache_alloc_refill
                    |          |          
                    |          |--99.27%-- kmem_cache_alloc
                    |          |          |          
                    |          |          |--81.25%-- sk_prot_alloc
                    |          |          |          sk_alloc
                    |          |          |          inet_create
                    |          |          |          __sock_create
                    |          |          |          sock_create
                    |          |          |          sys_socket
                    |          |          |          entry_SYSCALL_64_fastpath
                    |          |          |          __socket
                    |          |          |          
                    |          |          |--9.08%-- sock_alloc_inode
                    |          |          |          alloc_inode
                    |          |          |          new_inode_pseudo
                    |          |          |          sock_alloc
                    |          |          |          __sock_create
                    |          |          |          sock_create
                    |          |          |          sys_socket
                    |          |          |          entry_SYSCALL_64_fastpath
                    |          |          |          __socket
                    |          |          |          
                    |          |          |--4.98%-- __d_alloc
                    |          |          |          d_alloc_pseudo
                    |          |          |          sock_alloc_file
                    |          |          |          sock_map_fd
                    |          |          |          sys_socket
                    |          |          |          entry_SYSCALL_64_fastpath
                    |          |          |          __socket
                    |          |          |          
                    |          |           --4.69%-- get_empty_filp
                    |          |                     alloc_file
                    |          |                     sock_alloc_file
                    |          |                     sock_map_fd
                    |          |                     sys_socket
                    |          |                     entry_SYSCALL_64_fastpath
                    |          |                     __socket
                    |          |          
                    |           --0.73%-- kmem_cache_alloc_trace
                    |                     sock_alloc_inode
                    |                     alloc_inode
                    |                     new_inode_pseudo
                    |                     sock_alloc
                    |                     __sock_create
                    |                     sock_create
                    |                     sys_socket
                    |                     entry_SYSCALL_64_fastpath
                    |                     __socket
                    |          
                    |--10.80%-- sock_alloc_file
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--7.47%-- sock_alloc
                    |          __sock_create
                    |          sock_create
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--6.96%-- kmem_cache_alloc
                    |          |          
                    |          |--72.94%-- sk_prot_alloc
                    |          |          sk_alloc
                    |          |          inet_create
                    |          |          __sock_create
                    |          |          sock_create
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          
                    |          |--15.51%-- sock_alloc_inode
                    |          |          alloc_inode
                    |          |          new_inode_pseudo
                    |          |          sock_alloc
                    |          |          __sock_create
                    |          |          sock_create
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          
                    |          |--7.59%-- get_empty_filp
                    |          |          alloc_file
                    |          |          sock_alloc_file
                    |          |          sock_map_fd
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          
                    |           --3.96%-- __d_alloc
                    |                     d_alloc_pseudo
                    |                     sock_alloc_file
                    |                     sock_map_fd
                    |                     sys_socket
                    |                     entry_SYSCALL_64_fastpath
                    |                     __socket
                    |          
                    |--3.74%-- d_instantiate
                    |          sock_alloc_file
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--2.03%-- iput
                    |          __dentry_kill
                    |          dput
                    |          __fput
                    |          ____fput
                    |          task_work_run
                    |          prepare_exit_to_usermode
                    |          syscall_return_slowpath
                    |          int_ret_from_sys_call
                    |          __libc_close
                    |          
                    |--0.60%-- __fsnotify_inode_delete
                    |          __destroy_inode
                    |          destroy_inode
                    |          evict
                    |          iput
                    |          __dentry_kill
                    |          dput
                    |          __fput
                    |          ____fput
                    |          task_work_run
                    |          prepare_exit_to_usermode
                    |          syscall_return_slowpath
                    |          int_ret_from_sys_call
                    |          __libc_close
                    |          
                    |--0.55%-- evict
                    |          iput
                    |          __dentry_kill
                    |          dput
                    |          __fput
                    |          ____fput
                    |          task_work_run
                    |          prepare_exit_to_usermode
                    |          syscall_return_slowpath
                    |          int_ret_from_sys_call
                    |          __libc_close
                    |          
                    |--0.53%-- inet_release
                    |          sock_release
                    |          sock_close
                    |          __fput
                    |          ____fput
                    |          task_work_run
                    |          prepare_exit_to_usermode
                    |          syscall_return_slowpath
                    |          int_ret_from_sys_call
                    |          __libc_close
                    |          
                    |--0.51%-- __fput
                    |          ____fput
                    |          task_work_run
                    |          prepare_exit_to_usermode
                    |          syscall_return_slowpath


Perf taken on the 48 threads run :

   48.34%       opensock  [kernel.kallsyms]   [k] queued_spin_lock_slowpath           
                 |
                 --- queued_spin_lock_slowpath
                    |          
                    |--99.93%-- _raw_spin_lock
                    |          |          
                    |          |--50.11%-- __close_fd
                    |          |          sys_close
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __libc_close
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |          |          
                    |          |--49.85%-- __alloc_fd
                    |          |          get_unused_fd_flags
                    |          |          sock_map_fd
                    |          |          sys_socket
                    |          |          entry_SYSCALL_64_fastpath
                    |          |          __socket
                    |          |          |          
                    |          |           --100.00%-- 0x0
                    |           --0.03%-- [...]
                     --0.07%-- [...]

    41.03%       opensock  opensock            [.] memset                              
                 |
                 --- memset

     0.69%       opensock  [kernel.kallsyms]   [k] _raw_spin_lock                      
                 |
                 --- _raw_spin_lock
                    |          
                    |--30.22%-- sys_close
                    |          entry_SYSCALL_64_fastpath
                    |          __libc_close
                    |          |          
                    |           --100.00%-- 0x0
                    |          
                    |--30.15%-- get_unused_fd_flags
                    |          sock_map_fd
                    |          sys_socket
                    |          entry_SYSCALL_64_fastpath
                    |          __socket
                    |          
                    |--14.61%-- cache_alloc_refill
                    |          |          

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-11-02  2:14                                                                                       ` Eric Dumazet
@ 2015-11-02  6:22                                                                                         ` Al Viro
  0 siblings, 0 replies; 138+ messages in thread
From: Al Viro @ 2015-11-02  6:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, David Miller, Stephen Hemminger,
	Network Development, David Howells, linux-fsdevel

On Sun, Nov 01, 2015 at 06:14:43PM -0800, Eric Dumazet wrote:
> On Mon, 2015-11-02 at 00:24 +0000, Al Viro wrote:
> 
> > This ought to be a bit cleaner.  Eric, could you test the variant below on your
> > setup?
> 
> Sure !
> 
> 5 runs of :
> lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10
> 
> total = 4386311
> total = 4560402
> total = 4437309
> total = 4516227
> total = 4478778

Umm...  With Linus' variant it was what, around 4000000?  +10% or so, then...

> With 48 threads :
> 
> ./opensock -t 48 -n 10000000 -l 10
> total = 4940245
> total = 4848513
> total = 4813153
> total = 4813946
> total = 5127804

And that - +40%?  Interesting...  And it looks like at 48 threads you are
still seeing arseloads of contention, but apparently less than with Linus'
variant...  What if you throw the __clear_close_on_exec() patch on
top of that?

Looks like it's spending less time under ->files_lock...  Could you get
information on fs/file.o hotspots?

^ permalink raw reply	[flat|nested] 138+ messages in thread

* RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-21 20:33             ` Casper.Dik
  2015-10-22  4:21               ` Al Viro
@ 2015-11-02 10:03               ` David Laight
  2015-11-02 10:29                 ` Al Viro
  1 sibling, 1 reply; 138+ messages in thread
From: David Laight @ 2015-11-02 10:03 UTC (permalink / raw)
  To: 'Casper.Dik@oracle.com', Al Viro
  Cc: Alan Burlison, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech

From: Casper.Dik@oracle.com
> Sent: 21 October 2015 21:33
..
> >Er...  So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that
> >behaviour, in your opinion?  Because fd is sure as hell not the last
> >descriptor refering to that socket - fd2 remains alive and well.
> >
> >Behaviour you describe below matches the _third_ option.
> 
> >> >BTW, for real fun, consider this:
> >> >7)
> >> >// fd is a socket
> >> >fd2 = dup(fd);
> >> >in thread A: accept(fd);
> >> >in thread B: accept(fd);
> >> >in thread C: accept(fd2);
> >> >in thread D: close(fd);

If D is going to do this, it ought to be using dup2() to clone
a copy of (say) /dev/null onto 'fd'.

> >> >Which threads (if any), should get hit where it hurts?
> >>
> >> A & B should return from the accept with an error. C should
> >> continue. Which is what happens on Solaris.

That seems to completely break the normal *nix file scheme...

> >> To this end each thread keeps a list of file descriptors
> >> in use by the current active system call.
> >

Remember, Solaris (and SYSV) has extra levels of multiplexing between
userspace and char special drivers (and probably sockets) than Linux does.
As well as having multiple fd referencing the same struct FILE, multiple
FILE can point to the same inode.
If you have two different /dev entries for the same major/minor you
also end up with separate inodes - all finally referencing the same
driver data (indexed only by minor number).
(You actually get two inodes in the chain, one for the disk filesystem
and one char-special. The ref counts on both can be greater than 1.)

	David

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-11-02 10:03               ` David Laight
@ 2015-11-02 10:29                 ` Al Viro
  0 siblings, 0 replies; 138+ messages in thread
From: Al Viro @ 2015-11-02 10:29 UTC (permalink / raw)
  To: David Laight
  Cc: 'Casper.Dik@oracle.com',
	Alan Burlison, Eric Dumazet, Stephen Hemminger, netdev,
	dholland-tech

On Mon, Nov 02, 2015 at 10:03:58AM +0000, David Laight wrote:
> Remember, Solaris (and SYSV) has extra levels of multiplexing between
> userspace and char special drivers (and probably sockets) than Linux does.
> As well as having multiple fd referencing the same struct FILE, multiple
> FILE can point to the same inode.

As they ever could on any Unix.  Every open(2) results in a new struct file
(BTW, I've never seen that capitalization for kernel structure - not in
v7, not in *BSD, etc.; FILE is a userland typedef and I would be rather
surprised if any kernel, Solaris included, would've renamed 'struct file'
to 'struct FILE').

> If you have two different /dev entries for the same major/minor you
> also end up with separate inodes - all finally referencing the same
> driver data (indexed only by minor number).

Again, the same goes for all Unices, both Linux and Solaris included.
And what the devil does that have to do with sockets, anyway?  Or with
the problem in question, while we are at it - they have different
descriptors pointing to the same struct file behave differently;
anything sensitive to file type would be past that point.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-10-30 21:09                                             ` Al Viro
@ 2015-11-04 15:54                                               ` David Laight
  2015-11-04 16:27                                                 ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: David Laight @ 2015-11-04 15:54 UTC (permalink / raw)
  To: 'Al Viro'
  Cc: 'David Holland',
	Alan Burlison, Casper.Dik, David Miller, eric.dumazet, stephen,
	netdev

From: Al Viro
> Sent: 30 October 2015 21:10
> On Fri, Oct 30, 2015 at 05:43:21PM +0000, David Laight wrote:
> 
> > ISTM that the correct call should be listen(fd, 0);
> > Although that doesn't help a thread stuck in recvmsg() for a datagram.
> >
> > It is also tempting to think that close(fd) should sleep until all
> > io activities using that fd have completed - whether or not blocking
> > calls are woken.
> 
> Sigh...  The kernel has no idea when other threads are done with "all
> io activities using that fd" - it can wait for them to leave the
> kernel mode, but there's fuck-all it can do about e.g. a userland
> loop doing write() until there's more data to send.  And no, you can't
> rely upon them catching EBADF on the next iteration - by the time they
> get there, close() might very well have returned and open() from yet
> another thread might've grabbed the same descriptor.  Welcome to your
> data being written to hell knows what...

That just means that the application must use dup2() rather than close().
It must do that anyway since the thread it is trying to stop might be
sleeping in the system call stub in libc at the time a close() and open()
happen.
The listening (in this case) thread would need to look at its global
data to determine that it is supposed to exit, and then close the fd itself.

	David

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-11-04 15:54                                               ` David Laight
@ 2015-11-04 16:27                                                 ` Al Viro
  2015-11-06 15:07                                                   ` David Laight
  0 siblings, 1 reply; 138+ messages in thread
From: Al Viro @ 2015-11-04 16:27 UTC (permalink / raw)
  To: David Laight
  Cc: 'David Holland',
	Alan Burlison, Casper.Dik, David Miller, eric.dumazet, stephen,
	netdev

On Wed, Nov 04, 2015 at 03:54:09PM +0000, David Laight wrote:
> > Sigh...  The kernel has no idea when other threads are done with "all
> > io activities using that fd" - it can wait for them to leave the
> > kernel mode, but there's fuck-all it can do about e.g. a userland
> > loop doing write() until there's more data to send.  And no, you can't
> > rely upon them catching EBADF on the next iteration - by the time they
> > get there, close() might very well have returned and open() from yet
> > another thread might've grabbed the same descriptor.  Welcome to your
> > data being written to hell knows what...
> 
> That just means that the application must use dup2() rather than close().
> It must do that anyway since the thread it is trying to stop might be
> sleeping in the system call stub in libc at the time a close() and open()
> happen.

Oh, _lovely_.  So instead of continuation of that write(2) going down
the throat of something opened by unrelated thread, it (starting from a
pretty arbitrary point) will go into the descriptor the closing thread
passed to dup2().  Until it, in turn, gets closed, at which point we
are back to square one.  That, of course, makes it so much better -
whatever had I been thinking about that made me miss that?

> The listening (in this case) thread would need to look at its global
> data to determine that it is supposed to exit, and then close the fd itself.

Right until it crosses into the kernel mode and does descriptor-to-file
lookup, presumably?  Because prior to that point this kernel-side
"protection" oesn't come into play.  In other words, this is inherently
racy, and AFAICS you are the first poster in that thread who disagrees
with that.

_Any_ userland code that would be racy without that kludge of semantics
in close()/dup2() is *still* racy with it.  If that crap gets triggered
at all, the userland code responsible for that is broken.  Said crap
makes the race windows more narrow, but it doesn't really close them.
And IMO it's rather misduided, especially since it's a) quiet and b)
costly as hell.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-11-04 16:27                                                 ` Al Viro
@ 2015-11-06 15:07                                                   ` David Laight
  2015-11-06 19:31                                                     ` Al Viro
  0 siblings, 1 reply; 138+ messages in thread
From: David Laight @ 2015-11-06 15:07 UTC (permalink / raw)
  To: 'Al Viro'
  Cc: 'David Holland',
	Alan Burlison, Casper.Dik, David Miller, eric.dumazet, stephen,
	netdev

From: Al Viro
> Sent: 04 November 2015 16:28
> On Wed, Nov 04, 2015 at 03:54:09PM +0000, David Laight wrote:
> > > Sigh...  The kernel has no idea when other threads are done with "all
> > > io activities using that fd" - it can wait for them to leave the
> > > kernel mode, but there's fuck-all it can do about e.g. a userland
> > > loop doing write() until there's more data to send.  And no, you can't
> > > rely upon them catching EBADF on the next iteration - by the time they
> > > get there, close() might very well have returned and open() from yet
> > > another thread might've grabbed the same descriptor.  Welcome to your
> > > data being written to hell knows what...
> >
> > That just means that the application must use dup2() rather than close().
> > It must do that anyway since the thread it is trying to stop might be
> > sleeping in the system call stub in libc at the time a close() and open()
> > happen.
> 
> Oh, _lovely_.  So instead of continuation of that write(2) going down
> the throat of something opened by unrelated thread, it (starting from a
> pretty arbitrary point) will go into the descriptor the closing thread
> passed to dup2().  Until it, in turn, gets closed, at which point we
> are back to square one.  That, of course, makes it so much better -
> whatever had I been thinking about that made me miss that?

Do I detect a note of sarcasm.

You'd open "/dev/null" (lacking a "/dev/error") and use that as the source fd.
So any writes (etc) would be discarded in a safe manner, and nothing will block.

	David

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
  2015-11-06 15:07                                                   ` David Laight
@ 2015-11-06 19:31                                                     ` Al Viro
  0 siblings, 0 replies; 138+ messages in thread
From: Al Viro @ 2015-11-06 19:31 UTC (permalink / raw)
  To: David Laight
  Cc: 'David Holland',
	Alan Burlison, Casper.Dik, David Miller, eric.dumazet, stephen,
	netdev

On Fri, Nov 06, 2015 at 03:07:30PM +0000, David Laight wrote:

> > Oh, _lovely_.  So instead of continuation of that write(2) going down
> > the throat of something opened by unrelated thread, it (starting from a
> > pretty arbitrary point) will go into the descriptor the closing thread
> > passed to dup2().  Until it, in turn, gets closed, at which point we
> > are back to square one.  That, of course, makes it so much better -
> > whatever had I been thinking about that made me miss that?
> 
> Do I detect a note of sarcasm.
> 
> You'd open "/dev/null" (lacking a "/dev/error") and use that as the source fd.
> So any writes (etc) would be discarded in a safe manner, and nothing will block.

... and never close that descriptor afterwards?  Then why would you care whether
dup2() blocks there or not?  Are you seriously saying that there is any code
that would (a) rely on that kind of warranty and (b) not be racy as hell?

Until now everyone had been talking about that thing as an attempt of mitigation
for broken userland code; you seem to be the first one to imply that this can
be used as a deliberate technics...

^ permalink raw reply	[flat|nested] 138+ messages in thread

end of thread, other threads:[~2015-11-06 19:31 UTC | newest]

Thread overview: 138+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-19 16:59 Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Stephen Hemminger
2015-10-19 23:33 ` Eric Dumazet
2015-10-20  1:12   ` Alan Burlison
2015-10-20  1:45     ` Eric Dumazet
2015-10-20  9:59       ` Alan Burlison
2015-10-20 11:24         ` David Miller
2015-10-20 11:39           ` Alan Burlison
2015-10-20 13:19         ` Fw: " Eric Dumazet
2015-10-20 13:45           ` Alan Burlison
2015-10-20 15:30             ` Eric Dumazet
2015-10-20 18:31               ` Alan Burlison
2015-10-20 18:42                 ` Eric Dumazet
2015-10-21 10:25                 ` David Laight
2015-10-21 10:49                   ` Alan Burlison
2015-10-21 11:28                     ` Eric Dumazet
2015-10-21 13:03                       ` Alan Burlison
2015-10-21 13:29                         ` Eric Dumazet
2015-10-21  3:49       ` Al Viro
2015-10-21 14:38         ` Alan Burlison
2015-10-21 15:30           ` David Miller
2015-10-21 16:04             ` Casper.Dik
2015-10-21 21:18               ` Eric Dumazet
2015-10-21 21:28                 ` Al Viro
2015-10-21 16:32           ` Fw: " Eric Dumazet
2015-10-21 18:51           ` Al Viro
2015-10-21 20:33             ` Casper.Dik
2015-10-22  4:21               ` Al Viro
2015-10-22 10:55                 ` Alan Burlison
2015-10-22 18:16                   ` Al Viro
2015-10-22 20:15                     ` Alan Burlison
2015-11-02 10:03               ` David Laight
2015-11-02 10:29                 ` Al Viro
2015-10-21 22:28             ` Alan Burlison
2015-10-22  1:29             ` David Miller
2015-10-22  4:17               ` Alan Burlison
2015-10-22  4:44                 ` Al Viro
2015-10-22  6:03                   ` Al Viro
2015-10-22  6:34                     ` Casper.Dik
2015-10-22 17:21                       ` Al Viro
2015-10-22 18:24                         ` Casper.Dik
2015-10-22 19:07                           ` Al Viro
2015-10-22 19:51                             ` Casper.Dik
2015-10-22 21:57                               ` Al Viro
2015-10-23  9:52                                 ` Casper.Dik
2015-10-23 13:02                                   ` Eric Dumazet
2015-10-23 13:20                                     ` Casper.Dik
2015-10-23 13:48                                       ` Eric Dumazet
2015-10-23 14:13                                       ` Eric Dumazet
2015-10-23 13:35                                     ` Alan Burlison
2015-10-23 14:21                                       ` Eric Dumazet
2015-10-23 15:46                                         ` Alan Burlison
2015-10-23 16:00                                           ` Eric Dumazet
2015-10-23 16:07                                             ` Alan Burlison
2015-10-23 16:19                                             ` Eric Dumazet
2015-10-23 16:40                                               ` Alan Burlison
2015-10-23 17:47                                                 ` Eric Dumazet
2015-10-23 17:59                                                   ` [PATCH net-next] af_unix: do not report POLLOUT on listeners Eric Dumazet
2015-10-25 13:45                                                     ` David Miller
2015-10-24  2:30                                   ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro
2015-10-27  9:08                                     ` Casper.Dik
2015-10-27 10:52                                       ` Alan Burlison
2015-10-27 12:01                                         ` Eric Dumazet
2015-10-27 12:27                                           ` Alan Burlison
2015-10-27 12:44                                             ` Eric Dumazet
2015-10-27 13:42                                         ` David Miller
2015-10-27 13:37                                           ` Alan Burlison
2015-10-27 13:59                                             ` David Miller
2015-10-27 14:13                                               ` Alan Burlison
2015-10-27 14:39                                                 ` David Miller
2015-10-27 14:39                                                   ` Alan Burlison
2015-10-27 15:04                                                     ` David Miller
2015-10-27 15:53                                                       ` Alan Burlison
2015-10-27 23:17                                         ` Al Viro
2015-10-28  0:13                                           ` Eric Dumazet
2015-10-28 12:35                                             ` Al Viro
2015-10-28 13:24                                               ` Eric Dumazet
2015-10-28 14:47                                                 ` Eric Dumazet
2015-10-28 21:13                                                   ` Al Viro
2015-10-28 21:44                                                     ` Eric Dumazet
2015-10-28 22:33                                                       ` Al Viro
2015-10-28 23:08                                                         ` Eric Dumazet
2015-10-29  0:15                                                           ` Al Viro
2015-10-29  3:29                                                             ` Eric Dumazet
2015-10-29  4:16                                                               ` Al Viro
2015-10-29 12:35                                                                 ` Eric Dumazet
2015-10-29 13:48                                                                   ` Eric Dumazet
2015-10-30 17:18                                                                   ` Linus Torvalds
2015-10-30 21:02                                                                     ` Al Viro
2015-10-30 21:23                                                                       ` Linus Torvalds
2015-10-30 21:50                                                                         ` Linus Torvalds
2015-10-30 22:33                                                                           ` Al Viro
2015-10-30 23:52                                                                             ` Linus Torvalds
2015-10-31  0:09                                                                               ` Al Viro
2015-10-31 15:59                                                                               ` Eric Dumazet
2015-10-31 19:34                                                                               ` Al Viro
2015-10-31 19:54                                                                                 ` Linus Torvalds
2015-10-31 20:29                                                                                   ` Al Viro
2015-11-02  0:24                                                                                     ` Al Viro
2015-11-02  0:59                                                                                       ` Linus Torvalds
2015-11-02  2:14                                                                                       ` Eric Dumazet
2015-11-02  6:22                                                                                         ` Al Viro
2015-10-31 20:45                                                                                   ` Eric Dumazet
2015-10-31 21:23                                                                                     ` Linus Torvalds
2015-10-31 21:51                                                                                       ` Al Viro
2015-10-31 22:34                                                                                       ` Eric Dumazet
2015-10-31  1:07                                                                           ` Eric Dumazet
2015-10-28 16:04                                           ` Alan Burlison
2015-10-29 14:58                                         ` David Holland
2015-10-29 15:18                                           ` Alan Burlison
2015-10-29 16:01                                             ` David Holland
2015-10-29 16:15                                               ` Alan Burlison
2015-10-29 17:07                                                 ` Al Viro
2015-10-29 17:12                                                   ` Alan Burlison
2015-10-30  1:54                                                     ` David Miller
2015-10-30  1:55                                                   ` David Miller
2015-10-30  5:44                                                 ` David Holland
2015-10-30 17:43                                           ` David Laight
2015-10-30 21:09                                             ` Al Viro
2015-11-04 15:54                                               ` David Laight
2015-11-04 16:27                                                 ` Al Viro
2015-11-06 15:07                                                   ` David Laight
2015-11-06 19:31                                                     ` Al Viro
2015-10-22  6:51                   ` Casper.Dik
2015-10-22 11:18                     ` Alan Burlison
2015-10-22 11:15                   ` Alan Burlison
2015-10-22  6:15                 ` Casper.Dik
2015-10-22 11:30                   ` Eric Dumazet
2015-10-22 11:58                     ` Alan Burlison
2015-10-22 12:10                       ` Eric Dumazet
2015-10-22 13:12                         ` David Miller
2015-10-22 13:14                         ` Alan Burlison
2015-10-22 17:05                           ` Al Viro
2015-10-22 17:39                             ` Alan Burlison
2015-10-22 18:56                               ` Al Viro
2015-10-22 19:50                                 ` Casper.Dik
2015-10-23 17:09                                   ` Al Viro
2015-10-23 18:30           ` Fw: " David Holland
2015-10-23 19:51             ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).