* Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) @ 2015-10-19 16:59 Stephen Hemminger 2015-10-19 23:33 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Stephen Hemminger @ 2015-10-19 16:59 UTC (permalink / raw) To: netdev This looks like corner case, but worth forwarding. Begin forwarded message: Date: Mon, 19 Oct 2015 13:21:33 +0000 From: "bugzilla-daemon@bugzilla.kernel.org" <bugzilla-daemon@bugzilla.kernel.org> To: "shemminger@linux-foundation.org" <shemminger@linux-foundation.org> Subject: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) https://bugzilla.kernel.org/show_bug.cgi?id=106241 Bug ID: 106241 Summary: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Product: Networking Version: 2.5 Kernel Version: 3.10.0-229.14.1.el7.x86_64 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: IPV4 Assignee: shemminger@linux-foundation.org Reporter: Alan.Burlison@oracle.com Regression: No Created attachment 190501 --> https://bugzilla.kernel.org/attachment.cgi?id=190501&action=edit Test program illustrating the problem The Linux behaviour in the current scenario is incorrect: 1. ThreadA opens, binds, listens and accepts on a socket, waiting for connections. 2. Some time later ThreadB calls shutdown on the socket ThreadA is waiting in accept on. Here is what happens: On Linux, the shutdown call in ThreadB succeeds and the accept call in ThreadA returns with EINVAL. On Solaris, the shutdown call in ThreadB fails and returns ENOTCONN. ThreadA continues to wait in accept. Relevant POSIX manpages: http://pubs.opengroup.org/onlinepubs/9699919799/functions/accept.html http://pubs.opengroup.org/onlinepubs/9699919799/functions/shutdown.html The POSIX shutdown manpage says: "The shutdown() function shall cause all or part of a full-duplex connection on the socket associated with the file descriptor socket to be shut down." ... "[ENOTCONN] The socket is not connected." Page 229 & 303 of "UNIX System V Network Programming" say: "shutdown can only be called on sockets that have been previously connected" "The socket [passed to accept that] fd refers to does not participate in the connection. It remains available to receive further connect indications" That is pretty clear, sockets being waited on with accept are not connected by definition. Nor is it the accept socket connected when a client connects to it, it is the socket returned by accept that is connected to the client. Therefore the Solaris behaviour of failing the shutdown call is correct. In order to get the required behaviour of ThreadB causing ThreadA to exit the accept call with an error, the correct way is for ThreadB to call close on the socket that ThreadA is waiting on in accept. On Solaris, calling close in ThreadB succeeds, and the accept call in ThreadA fails and returns EBADF. On Linux, calling close in ThreadB succeeds but ThreadA continues to wait in accept until there is an incoming connection. That accept returns successfully. However subsequent accept calls on the same socket return EBADF. The Linux behaviour is fundamentally broken in three places: 1. Allowing shutdown to succeed on an unconnected socket is incorrect. 2. Returning a successful accept on a closed file descriptor is incorrect, especially as future accept calls on the same socket fail. 3. Once shutdown has been called on the socket, calling close on the socket fails with EBADF. That is incorrect, shutdown should just prevent further IO on the socket, it should not close it. -- You are receiving this mail because: You are the assignee for the bug. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-19 16:59 Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Stephen Hemminger @ 2015-10-19 23:33 ` Eric Dumazet 2015-10-20 1:12 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-19 23:33 UTC (permalink / raw) To: Stephen Hemminger; +Cc: netdev, Alan.Burlison On Mon, 2015-10-19 at 09:59 -0700, Stephen Hemminger wrote: > This looks like corner case, but worth forwarding. > > Begin forwarded message: > > Date: Mon, 19 Oct 2015 13:21:33 +0000 > From: "bugzilla-daemon@bugzilla.kernel.org" <bugzilla-daemon@bugzilla.kernel.org> > To: "shemminger@linux-foundation.org" <shemminger@linux-foundation.org> > Subject: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) > > > https://bugzilla.kernel.org/show_bug.cgi?id=106241 > > Bug ID: 106241 > Summary: shutdown(3)/close(3) behaviour is incorrect for > sockets in accept(3) > Product: Networking > Version: 2.5 > Kernel Version: 3.10.0-229.14.1.el7.x86_64 > Hardware: All > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: IPV4 > Assignee: shemminger@linux-foundation.org > Reporter: Alan.Burlison@oracle.com > Regression: No > > Created attachment 190501 > --> https://bugzilla.kernel.org/attachment.cgi?id=190501&action=edit > Test program illustrating the problem > > The Linux behaviour in the current scenario is incorrect: > > 1. ThreadA opens, binds, listens and accepts on a socket, waiting for > connections. > > 2. Some time later ThreadB calls shutdown on the socket ThreadA is waiting in > accept on. > > Here is what happens: > > On Linux, the shutdown call in ThreadB succeeds and the accept call in ThreadA > returns with EINVAL. > > On Solaris, the shutdown call in ThreadB fails and returns ENOTCONN. ThreadA > continues to wait in accept. > > Relevant POSIX manpages: > > http://pubs.opengroup.org/onlinepubs/9699919799/functions/accept.html > http://pubs.opengroup.org/onlinepubs/9699919799/functions/shutdown.html > > The POSIX shutdown manpage says: > > "The shutdown() function shall cause all or part of a full-duplex connection on > the socket associated with the file descriptor socket to be shut down." > ... > "[ENOTCONN] The socket is not connected." > > Page 229 & 303 of "UNIX System V Network Programming" say: > > "shutdown can only be called on sockets that have been previously connected" > > "The socket [passed to accept that] fd refers to does not participate in the > connection. It remains available to receive further connect indications" > > That is pretty clear, sockets being waited on with accept are not connected by > definition. Nor is it the accept socket connected when a client connects to it, > it is the socket returned by accept that is connected to the client. Therefore > the Solaris behaviour of failing the shutdown call is correct. > > In order to get the required behaviour of ThreadB causing ThreadA to exit the > accept call with an error, the correct way is for ThreadB to call close on the > socket that ThreadA is waiting on in accept. > > On Solaris, calling close in ThreadB succeeds, and the accept call in ThreadA > fails and returns EBADF. > > On Linux, calling close in ThreadB succeeds but ThreadA continues to wait in > accept until there is an incoming connection. That accept returns successfully. > However subsequent accept calls on the same socket return EBADF. > > The Linux behaviour is fundamentally broken in three places: > > 1. Allowing shutdown to succeed on an unconnected socket is incorrect. > > 2. Returning a successful accept on a closed file descriptor is incorrect, > especially as future accept calls on the same socket fail. > > 3. Once shutdown has been called on the socket, calling close on the socket > fails with EBADF. That is incorrect, shutdown should just prevent further IO on > the socket, it should not close it. > It looks it is a long standing problem, right ? inet_shutdown() has this very specific comment from beginning of git tree : switch (sk->sk_state) { ... /* Remaining two branches are temporary solution for missing * close() in multithreaded environment. It is _not_ a good idea, * but we have no choice until close() is repaired at VFS level. */ case TCP_LISTEN: if (!(how & RCV_SHUTDOWN)) break; /* Fall through */ case TCP_SYN_SENT: err = sk->sk_prot->disconnect(sk, O_NONBLOCK); sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED; break; } Claiming Solaris does it differently is kind of moot. linux is not Solaris. Unless proven a real problem (and not only by trying to backport from Solaris to linux), we'll probably wont change this. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-19 23:33 ` Eric Dumazet @ 2015-10-20 1:12 ` Alan Burlison 2015-10-20 1:45 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-20 1:12 UTC (permalink / raw) To: Eric Dumazet, Stephen Hemminger; +Cc: netdev On 20/10/2015 00:33, Eric Dumazet wrote: > It looks it is a long standing problem, right ? Yep, seems so. > inet_shutdown() has this very specific comment from beginning of git > tree : > > switch (sk->sk_state) { > ... > /* Remaining two branches are temporary solution for missing > * close() in multithreaded environment. It is _not_ a good idea, > * but we have no choice until close() is repaired at VFS level. > */ > case TCP_LISTEN: > if (!(how & RCV_SHUTDOWN)) > break; > /* Fall through */ > case TCP_SYN_SENT: > err = sk->sk_prot->disconnect(sk, O_NONBLOCK); > sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED; > break; > } I think it's probably an intrinsic part of the way *NIX file descriptors and their reuse has worked since the dawn of *NIX time - at which time threads didn't exist, so this problem didn't either. The advent of threads made this this hole possible, which is I believe what the comment above is pointing out. The problem people are trying to solve by calling shutdown() on an listen()ing socket is the race in MT programs between a socket being closed and the same file descriptor being recycled by a subsequent open()/socket() etc. > Claiming Solaris does it differently is kind of moot. > linux is not Solaris. Agreed that Linux != Solaris, but the argument I'm being faced with is that anything that doesn't behave in the same way as Linux is wrong by definition. And the problem with that is that the Linux behaviour of shutdown() on a listen()/accept() socket is I believe incorrect anyway as my read of POSIX says that shutdown() is only valid on connected sockets, and sockets in listen()/accept() aren't connected by definition, and Linux allows shutdown() to succeed when it should probably return ENOTCONN. Yes, there's a potential race with FDs being recycled, but you can get that with vanilla file FDs as well, where shutdown() isn't an option. Another problem is that if I call close() on a Linux socket that's in accept() the accept call just sits there until there's an incoming connection, which succeeds even though the socket is supposed to be closed, but then an immediately following accept() on the same socket fails. And yet another problem is that poll() on a socket that's had listen() called on it returns immediately even if there's no incoming connection on it, which I believe makes multiplexing a set of sockets which includes a socket you want to accept() on impossible. The test program I attached to the bug allows you to play around with the different combinations. > Unless proven a real problem (and not only by trying to backport from Solaris to linux), > we'll probably wont change this. It's a real problem (with Hadoop, which contains C/C++ to do low-level I/O) and it's the other way around, I am porting that code from Linux to Solaris. I accept that you probably can't change the behaviour of shutdown() in Linux without breaking existing code, for example it seems libmicrohttpd also assumes it's OK to call shutdown() on a listen() socket on Linux, see https://lists.gnu.org/archive/html/libmicrohttpd/2011-09/msg00024.html However, even if the shutdown() behaviour can't be changed the Linux close()/poll() behaviour on a listen()/accept() sockets seems rather odd. There *may* be a way around this that's race-free and cross-platform involving the use of /dev/null and dup2(), see Listing Five on http://www.drdobbs.com/parallel/file-descriptors-and-multithreaded-progr/212001285 but I haven't confirmed it works yet. Thanks, -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-20 1:12 ` Alan Burlison @ 2015-10-20 1:45 ` Eric Dumazet 2015-10-20 9:59 ` Alan Burlison 2015-10-21 3:49 ` Al Viro 0 siblings, 2 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-20 1:45 UTC (permalink / raw) To: Alan Burlison; +Cc: Stephen Hemminger, netdev On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote: > Another problem is that if I call close() on a Linux socket that's in > accept() the accept call just sits there until there's an incoming > connection, which succeeds even though the socket is supposed to be > closed, but then an immediately following accept() on the same socket > fails. This is exactly what the comment I pasted documents. On linux, doing close(listener) on one thread does _not_ wakeup other threads doing accept(listener) So I guess allowing shutdown(listener) was a way to somehow propagate some info on the threads stuck in accept() This is a VFS issue, and a long standing one. Think of all cases like dup() and fd passing games, and the close(fd) being able to signal out of band info is racy. close() is literally removing one ref count on a file. Expecting it doing some kind of magical cleanup of a socket is not reasonable/practical. On a multi threaded program, each thread doing an accept() increased the refcount on the file. Really, I have no idea of how Solaris coped with this, and I do not want to know. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-20 1:45 ` Eric Dumazet @ 2015-10-20 9:59 ` Alan Burlison 2015-10-20 11:24 ` David Miller 2015-10-20 13:19 ` Fw: " Eric Dumazet 2015-10-21 3:49 ` Al Viro 1 sibling, 2 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-20 9:59 UTC (permalink / raw) To: Eric Dumazet; +Cc: Stephen Hemminger, netdev On 20/10/2015 02:45, Eric Dumazet wrote: > On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote: > >> Another problem is that if I call close() on a Linux socket that's in >> accept() the accept call just sits there until there's an incoming >> connection, which succeeds even though the socket is supposed to be >> closed, but then an immediately following accept() on the same socket >> fails. > > This is exactly what the comment I pasted documents. Yes, and it's caveated with "temporary solution" and "_not_ a good idea" and says that the problem is that close() needs repairing. In other works, the change in shutdown() behaviour appears to be a workaround for an acknowledged bug, and it's one with consequences. There are two separate things here. The first is the close/reopen race issue with filehandles. As I said, I believe that's an artefact of history, because it wasn't possible for it to happen before threads and was possible after them. There appears to be no good way to avoid this with the current *NIX filehandle semantics when threads are being used. History sucks. The second is how/if you might work around that. As far as I can tell, Linux does so by allowing shutdown() to be called on unconnected sockets and uses that as a signal that threads waiting in accept() should return from the accept with a failure but without the filehandle being actually closed, and therefore not being available for reuse, and therefore not subject to potential races. However by doing so I believe the behaviour of shutdown is then not POSIX-conforming. The Linux manpage for shutdown(2) says "CONFORMING TO POSIX.1-2001", as far as I can tell it isn't. At very least I believe the manpage needs changing. > On linux, doing close(listener) on one thread does _not_ wakeup other > threads doing accept(listener) Allowing an in-progress accept() to continue and to succeed at some point in the distant future on a filehandle that's closed seems incorrect. Also, the behaviour of poll() that I mentioned - that it returns immediately for a socket in the listen() state - does seem like an out-and-out bug to me, I haven't seen any explanation of why that might be correct behaviour. > So I guess allowing shutdown(listener) was a way to somehow propagate > some info on the threads stuck in accept() Yes, I think you are right. > This is a VFS issue, and a long standing one. That seems to be what the comment you quoted is saying, yes. > Think of all cases like dup() and fd passing games, and the close(fd) > being able to signal out of band info is racy. Yes, I agree there are potential race conditions, intrinsic to the way *NIX FDs work. As I said earlier, there *may* be a portable way around this using /dev/null & dup2() but I haven't had chance to investigate that yet. > close() is literally removing one ref count on a file. > Expecting it doing some kind of magical cleanup of a socket is not > reasonable/practical. I'm not sure what that means at an application level - no matter how many copies I make of a FD in a process it's still just an integer and calling close on it closes it and causes any future IOs to fail - except for the case of sockets in accept() it seems, which continue and may even eventually succeed. Leaving aside the behaviour of shutdown() on listening sockets, the current behaviour of close() on a socket in accept() seems incorrect. And then of course there's also the poll() issue. > On a multi threaded program, each thread doing an accept() increased the > refcount on the file. That may be how Linux implements accept(), but I don't see anything about refcounting in the POSIX spec for accept(). > Really, I have no idea of how Solaris coped with this, and I do not want > to know. The bug goes into quite some detail about how Solaris behaves. The issue here is that we have two implementations, Linux and Solaris, both claiming to be POSIX-conformant but both showing different behaviour. There's a discussion to be had about the whys and wherefores of that difference, but saying that you don't want to know how Solaris behaves isn't really going to help move the conversation along. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-20 9:59 ` Alan Burlison @ 2015-10-20 11:24 ` David Miller 2015-10-20 11:39 ` Alan Burlison 2015-10-20 13:19 ` Fw: " Eric Dumazet 1 sibling, 1 reply; 138+ messages in thread From: David Miller @ 2015-10-20 11:24 UTC (permalink / raw) To: Alan.Burlison; +Cc: eric.dumazet, stephen, netdev From: Alan Burlison <Alan.Burlison@oracle.com> Date: Tue, 20 Oct 2015 10:59:46 +0100 > The bug goes into quite some detail about how Solaris behaves. The > issue here is that we have two implementations, Linux and Solaris, > both claiming to be POSIX-conformant but both showing different > behaviour. There's a discussion to be had about the whys and > wherefores of that difference, but saying that you don't want to know > how Solaris behaves isn't really going to help move the conversation > along. With two decades of precendence, applications will need to find a way to cope with the behavior on every existing Linux kernel out there. Even if we were to propose something here and change things, it won't be available on real sites for 6 months at a minimum, and only a an extremely small fraction of actual machines. It's more practical for userspace to cope with the bahvior. This is simply because coping with current behavior will work on every Linux kernel on the planet, and also it won't require us to potentially break any existing setups. This to me matters more than any squabbling over semantics or who really implements POSIX correctly or not. That simply does not matter at all. It especially does not matter as far as I and millions of existing socket apps out there are concerned. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-20 11:24 ` David Miller @ 2015-10-20 11:39 ` Alan Burlison 0 siblings, 0 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-20 11:39 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, stephen, netdev On 20/10/2015 12:24, David Miller wrote: > With two decades of precendence, applications will need to find a way > to cope with the behavior on every existing Linux kernel out there. > > Even if we were to propose something here and change things, it won't > be available on real sites for 6 months at a minimum, and only a an > extremely small fraction of actual machines. > > It's more practical for userspace to cope with the bahvior. This is > simply because coping with current behavior will work on every Linux > kernel on the planet, and also it won't require us to potentially > break any existing setups. Yes, as I said I think in practice about the best that can be done is to document the behaviour, although I think the assertion of their being millions of apps that would be affected is an over-estimate. Only MT apps that use accept() in one thread and shutdown() in another should be impacted, i.e mainly threaded apps that act as network service providers of one form or another. But having said that, the behaviour of close() and poll() on sockets being used in accept() still looks incorrect. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-20 9:59 ` Alan Burlison 2015-10-20 11:24 ` David Miller @ 2015-10-20 13:19 ` Eric Dumazet 2015-10-20 13:45 ` Alan Burlison 1 sibling, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-20 13:19 UTC (permalink / raw) To: Alan Burlison; +Cc: Stephen Hemminger, netdev On Tue, 2015-10-20 at 10:59 +0100, Alan Burlison wrote: > That may be how Linux implements accept(), but I don't see anything > about refcounting in the POSIX spec for accept(). That's an internal implementation detail. POSIX does not document linux kernel overall design and specific tricks. linux is GPL, while Solaris is proprietary code. There is quite a difference, and we do not want to copy Solaris behavior. We want our own way, practical, and good enough. If POSIX makes sense we try to be compliant. If not, we do not. If you are interested, take a look at fs/* code, and try to implement your proposal and keep good performance. You might find a clever way, without infringing prior art. We did not yet. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-20 13:19 ` Fw: " Eric Dumazet @ 2015-10-20 13:45 ` Alan Burlison 2015-10-20 15:30 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-20 13:45 UTC (permalink / raw) To: Eric Dumazet; +Cc: Stephen Hemminger, netdev On 20/10/2015 14:19, Eric Dumazet wrote: >> That may be how Linux implements accept(), but I don't see anything >> about refcounting in the POSIX spec for accept(). > > That's an internal implementation detail. POSIX does not document linux > kernel overall design and specific tricks. No, neither does it document Solaris kernel design, or *BSD design, or Windows design. It does however specify what the externally observable behaviour has to be in order to be POSIX compliant. > linux is GPL, while Solaris is proprietary code. There is quite a > difference, and we do not want to copy Solaris behavior. We want our own > way, practical, and good enough. I don't see what the licensing terms of particular implementations have to do with POSIX, indeed as far as I can tell POSIX says nothing at all about the subject. It's therefore not pertinent to this discussion. I'm not expecting Linux to copy every Solaris behaviour, if I was I might for example be suggesting that Linux dropped support for the SO_RCVTIMEO and SO_SNDTIMEO setsockopt() options on AF_UNIX sockets, because Solaris doesn't currently implement those options. That would clearly be a ridiculous stance for me to take, what I've actually done is logged a bug against Solaris because it's clearly something we need to fix. What I do think is reasonable is that if Linux claims POSIX conformance then it either conforms or documents the variant behaviour and as I've said I don't believe is does conform in the case of shutdown(). > If POSIX makes sense we try to be compliant. If not, we do not. That's one possible design option. However in this case the Linux manpage claims the Linux behaviour is POSIX compliant and as far as I can tell it isn't. As I've already said several times, I agree there's probably not much that can be done about it without causing breakage, which is why I suggested simply documenting the behaviour may be the best option. And I still haven't seen any reasoning behind the Linux close() and poll() behaviour on sockets that are in the listen state. > If you are interested, take a look at fs/* code, and try to implement > your proposal and keep good performance. > > You might find a clever way, without infringing prior art. We did not > yet. I might, but contractually I can't, unfortunately. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-20 13:45 ` Alan Burlison @ 2015-10-20 15:30 ` Eric Dumazet 2015-10-20 18:31 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-20 15:30 UTC (permalink / raw) To: Alan Burlison; +Cc: Stephen Hemminger, netdev On Tue, 2015-10-20 at 14:45 +0100, Alan Burlison wrote: > And I still haven't seen any reasoning behind the Linux close() and > poll() behaviour on sockets that are in the listen state. Same answer. A close() does _not_ wakeup an accept() or a poll() (this is exactly the same problem), or any blocking system call using the same 'file' This is the choice Linus Torvalds and Al Viro did years ago. It is set in stone, unless someone comes up with a very nice patch set, that does not break existing applications. close() man page states : NOTES It is probably unwise to close file descriptors while they may be in use by system calls in other threads in the same process. Since a file descriptor may be reused, there are some obscure race conditions that may cause unintended side effects. You are in this grey zone. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-20 15:30 ` Eric Dumazet @ 2015-10-20 18:31 ` Alan Burlison 2015-10-20 18:42 ` Eric Dumazet 2015-10-21 10:25 ` David Laight 0 siblings, 2 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-20 18:31 UTC (permalink / raw) To: Eric Dumazet; +Cc: Stephen Hemminger, netdev On 20/10/2015 16:30, Eric Dumazet wrote: > A close() does _not_ wakeup an accept() or a poll() (this is exactly the > same problem), or any blocking system call using the same 'file' Not waking up the accept() is one thing, allowing the accept() to successfully complete some indeterminate time after the socket has been closed is something else entirely. You shouldn't have to call shutdown() to terminate an accept() on a socket, close() should suffice. Yes, if you want to use the shutdown() 'feature' to kick the accept() thread out of the call to accept() without closing the FD and you don't care about cross-platform compatibility then you can call shutdown() followed by close(). However that's only ever required on Linux as far as I can tell, and even on Linux applications that deal with the thread race by other means shouldn't be forced to use shutdown() when just close() would suffice. The problem with poll() is that it returns immediately when passed a FD that is in the listening state. rather than waiting until there's an incoming connection to handle. As I said, that means you can't use poll() to multiplex between read/write/accept sockets. > close() man page states : > > NOTES > It is probably unwise to close file descriptors while they may be in > use by system calls in other threads in the same process. Since a > file descriptor may be reused, there are some obscure race conditions > that may cause unintended side effects. > > You are in this grey zone. No, the race issue with file descriptor reuse and the close() behaviour are not the same thing. The manpage comment is correct, but not relevant. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-20 18:31 ` Alan Burlison @ 2015-10-20 18:42 ` Eric Dumazet 2015-10-21 10:25 ` David Laight 1 sibling, 0 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-20 18:42 UTC (permalink / raw) To: Alan Burlison; +Cc: Stephen Hemminger, netdev On Tue, 2015-10-20 at 19:31 +0100, Alan Burlison wrote: > No, the race issue with file descriptor reuse and the close() behaviour > are not the same thing. The manpage comment is correct, but not relevant. Ok, it seems you know better than me, I will be stop the discussion and wait for your patches. ^ permalink raw reply [flat|nested] 138+ messages in thread
* RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-20 18:31 ` Alan Burlison 2015-10-20 18:42 ` Eric Dumazet @ 2015-10-21 10:25 ` David Laight 2015-10-21 10:49 ` Alan Burlison 1 sibling, 1 reply; 138+ messages in thread From: David Laight @ 2015-10-21 10:25 UTC (permalink / raw) To: 'Alan Burlison', Eric Dumazet; +Cc: Stephen Hemminger, netdev From: Alan Burlison > Sent: 20 October 2015 19:31 ... > The problem with poll() is that it returns immediately when passed a FD > that is in the listening state. rather than waiting until there's an > incoming connection to handle. As I said, that means you can't use > poll() to multiplex between read/write/accept sockets. That seems to work for me... David ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 10:25 ` David Laight @ 2015-10-21 10:49 ` Alan Burlison 2015-10-21 11:28 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-21 10:49 UTC (permalink / raw) To: David Laight, Eric Dumazet; +Cc: Stephen Hemminger, netdev On 21/10/2015 11:25, David Laight wrote: >> The problem with poll() is that it returns immediately when passed a FD >> that is in the listening state. rather than waiting until there's an >> incoming connection to handle. As I said, that means you can't use >> poll() to multiplex between read/write/accept sockets. > > That seems to work for me... In my test case I was setting all the available event bits in pollfd.events to see what came back. With poll() on a listen() socket you get an immediate return with bits set in revents indicating the socket is available for output, which of course it isn't. Indeed an attempt to write to it fails. If you remove the output event bits in pollfd.events then the poll() waits as expected until there's an incoming connection on the socket. I suppose one answer is "Well, don't do that then" but returning an output indication on a socket that's in listen() seems rather odd. With POLLOUT|POLLWRNORM|POLLWRBAND: main: polling #1 [returns immediately] main: poll #1: Success poll fd: 0 revents: POLLOUT POLLWRBAND main: write #1: Transport endpoint is not connected Without POLLOUT|POLLWRNORM|POLLWRBAND: main: polling #1 [waits for connection] main: poll #1: Success poll fd: 0 revents: POLLIN POLLRDNORM -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 10:49 ` Alan Burlison @ 2015-10-21 11:28 ` Eric Dumazet 2015-10-21 13:03 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-21 11:28 UTC (permalink / raw) To: Alan Burlison; +Cc: David Laight, Stephen Hemminger, netdev On Wed, 2015-10-21 at 11:49 +0100, Alan Burlison wrote: > On 21/10/2015 11:25, David Laight wrote: > > >> The problem with poll() is that it returns immediately when passed a FD > >> that is in the listening state. rather than waiting until there's an > >> incoming connection to handle. As I said, that means you can't use > >> poll() to multiplex between read/write/accept sockets. > > > > That seems to work for me... > > In my test case I was setting all the available event bits in > pollfd.events to see what came back. With poll() on a listen() socket > you get an immediate return with bits set in revents indicating the > socket is available for output, which of course it isn't. Indeed an > attempt to write to it fails. If you remove the output event bits in > pollfd.events then the poll() waits as expected until there's an > incoming connection on the socket. > > I suppose one answer is "Well, don't do that then" but returning an > output indication on a socket that's in listen() seems rather odd. > > With POLLOUT|POLLWRNORM|POLLWRBAND: > > main: polling #1 > [returns immediately] > main: poll #1: Success > poll fd: 0 revents: POLLOUT POLLWRBAND > main: write #1: Transport endpoint is not connected > > Without POLLOUT|POLLWRNORM|POLLWRBAND: > > main: polling #1 > [waits for connection] > main: poll #1: Success > poll fd: 0 revents: POLLIN POLLRDNORM > This works for me. Please double check your programs 242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295 <unfinished ...> 242046 <... poll resumed> ) = 1 ([{fd=3, revents=POLLIN}]) 242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35888), inet_pton(AF_INET6, "::ffff:10.246.7.151", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4 242046 close(4) = 0 242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN}]) 242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35890), inet_pton(AF_INET6, "::ffff:10.246.7.151", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4 242046 close(4) = 0 242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN}]) 242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35892), inet_pton(AF_INET6, "::ffff:10.246.7.151", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4 242046 close(4) = 0 242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN}]) 242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35894), inet_pton(AF_INET6, "::ffff:10.246.7.151", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4 242046 close(4) = 0 poll() only cares about POLLIN | POLLRDNORM for a listener. static inline unsigned int inet_csk_listen_poll(const struct sock *sk) { return !reqsk_queue_empty(&inet_csk(sk)->icsk_accept_queue) ? (POLLIN | POLLRDNORM) : 0; } if (sk->sk_state == TCP_LISTEN) return inet_csk_listen_poll(sk); ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 11:28 ` Eric Dumazet @ 2015-10-21 13:03 ` Alan Burlison 2015-10-21 13:29 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-21 13:03 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Laight, Stephen Hemminger, netdev On 21/10/2015 12:28, Eric Dumazet wrote: > This works for me. Please double check your programs I have just done so, it works as you say for AF_INET sockets but if you switch to AF_UNIX sockets it does the wrong thing in the way I described. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 13:03 ` Alan Burlison @ 2015-10-21 13:29 ` Eric Dumazet 0 siblings, 0 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-21 13:29 UTC (permalink / raw) To: Alan Burlison; +Cc: David Laight, Stephen Hemminger, netdev On Wed, 2015-10-21 at 14:03 +0100, Alan Burlison wrote: > On 21/10/2015 12:28, Eric Dumazet wrote: > > > This works for me. Please double check your programs > > I have just done so, it works as you say for AF_INET sockets but if you > switch to AF_UNIX sockets it does the wrong thing in the way I described. > Oh well. Please try the following : diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f658235fb4..24dec8bb571d 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -328,7 +328,8 @@ found: static inline int unix_writable(struct sock *sk) { - return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf; + return sk->sk_state != TCP_LISTEN && + (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf; } static void unix_write_space(struct sock *sk) ^ permalink raw reply related [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-20 1:45 ` Eric Dumazet 2015-10-20 9:59 ` Alan Burlison @ 2015-10-21 3:49 ` Al Viro 2015-10-21 14:38 ` Alan Burlison 1 sibling, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-21 3:49 UTC (permalink / raw) To: Eric Dumazet; +Cc: Alan Burlison, Stephen Hemminger, netdev, dholland-tech On Mon, Oct 19, 2015 at 06:45:32PM -0700, Eric Dumazet wrote: > On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote: > > > Another problem is that if I call close() on a Linux socket that's in > > accept() the accept call just sits there until there's an incoming > > connection, which succeeds even though the socket is supposed to be > > closed, but then an immediately following accept() on the same socket > > fails. > > This is exactly what the comment I pasted documents. > > On linux, doing close(listener) on one thread does _not_ wakeup other > threads doing accept(listener) > > So I guess allowing shutdown(listener) was a way to somehow propagate > some info on the threads stuck in accept() > > This is a VFS issue, and a long standing one. > > Think of all cases like dup() and fd passing games, and the close(fd) > being able to signal out of band info is racy. > > close() is literally removing one ref count on a file. > Expecting it doing some kind of magical cleanup of a socket is not > reasonable/practical. > > On a multi threaded program, each thread doing an accept() increased the > refcount on the file. Refcount is an implementation detail, of course. However, in any Unix I know of, there are two separate notions - descriptor losing connection to opened file (be it from close(), exit(), execve(), dup2(), etc.) and opened file getting closed. The latter cannot happen while there are descriptors connected to the file in question, of course. However, that is not the only thing that might prevent an opened file from getting closed - e.g. sending an SCM_RIGHTS datagram with attached descriptor connected to the opened file in question *at* *the* *moment* *of* *sendmsg(2)* will carry said opened file until it is successfully received or discarded (in the former case recepient will get a new descriptor refering to that opened file, of course). Having the original descriptor closed right after sendmsg(2) does *not* do anything to opened file. On any Unix that implements descriptor-passing. There's going to be a notion of "last close"; that's what this refcount is about and _that_ is more than implementation detail. The real question is what kind of semantics would one want in the following situations: 1) // fd is a socket fcntl(fd, F_SETFD, FD_CLOEXEC); fork(); in parent: accept(fd); in child: execve() 2) // fd is a socket, 1 is /dev/null fork(); in parent: accept(fd); in child: dup2(1, fd); 3) // fd is a socket fd2 = dup(fd); in thread A: accept(fd); in thread B: close(fd); 4) // fd is a socket, 1 is /dev/null fd2 = dup(fd); in thread A: accept(fd); in thread B: dup2(1, fd); 5) // fd is a socket, 1 is /dev/null fd2 = dup(fd); in thread A: accept(fd); in thread B: close(fd2); 6) // fd is a socket in thread A: accept(fd); in thread B: close(fd); In other words, is that destruction of * any descriptor refering to this socket [utterly insane for obvious reasons] * the last descriptor refering to this socket (modulo descriptor passing, etc.) [a bitch to implement, unless we treat a syscall in progress as keeping the opened file open], or * _the_ descriptor used to issue accept(2) [a bitch to implement, with a lot of fun races in an already race-prone area]? Additional question is whether it's * just a magical behaviour of close(2) [ugly], or * something that happens when descriptor gets dissociated from opened file [obviously more consistent]? BTW, for real fun, consider this: 7) // fd is a socket fd2 = dup(fd); in thread A: accept(fd); in thread B: accept(fd); in thread C: accept(fd2); in thread D: close(fd); Which threads (if any), should get hit where it hurts? I honestly don't know what Solaris does; AFAICS, FreeBSD behaves like Linux these days. NetBSD plays really weird games in their fd_close(); what they are trying to achieve is at least sane - in (7) they'd hit A and B with EBADF and C would restart and continue waiting, in (3,4,6) A gets EBADF, in (1,2,5) accept() is unaffected. The problem is that their solution is racy - they have a separate refcount on _descriptor_, plus a file method (->fo_restart) for triggering an equivalent of signal interrupting anything that might be blocked on that sucker, with syscall restart (and subsequent EBADF on attempt to refetch the sucker. Racy if we reopen or are doing dup2() in the first place - these restarts might get CPU just after we return from dup2() and pick the *new* descriptor just fine. It might be possible to fix their approach (having if (__predict_false(ff->ff_file == NULL)) { /* * Another user of the file is already closing, and is * waiting for other users of the file to drain. Release * our reference, and wake up the closer. */ atomic_dec_uint(&ff->ff_refcnt); cv_broadcast(&ff->ff_closing); path in fd_close() mark the thread as "don't bother restarting, just bugger off" might be workable), but... it's still pretty costly. They pay with memory footprint (at least 32 bits per descriptor, and that's leaving aside the fun issues with what to wait on) and the only thing that might be saving them from cacheline ping-pong from hell is that their struct fdfile is really fat - there's a lot more than just an extra u32 in there. I have no idea what semantics does Solaris have in that area and how racy their descriptor table handling is. And no, I'm not going to RTFS their kernel, CDDL being what it is. I *do* know that Linux and all *BSD kernels had pretty severe races in that area. Quite a few of those, and a lot more painful than the one RTFS(NetBSD) seems to have caught just now. So I would seriously recommend the folks who are free to RTFS(Solaris) to review that area. Carefully. There tend to be dragons. _IF_ somebody can come up with clean semantics and tolerable approach to implementing it, I'd be glad to see that. What we do is "syscall in progress keeps the file it operates upon open, no matter what happens to descriptors". AFAICS, what NetBSD tries to implement is also reasonably clean wrt semantics ("detaching an opened file from a descriptor that is being operated upon by some syscalls triggers restart or failure of all syscalls operating on the opened file in question and waits for them to bugger off", but their implementation appears to be both racy and far too heavyweight, with no obvious solutions to the latter. Come to think of that, restart-based solutions have an obvious problem - if we were talking about restart due to signal, the userland code could (and would have to) block those, just to avoid this kind of issues with the wrong descriptor picked on restart. But there's no way to block _that_, so if you have two descriptors refering to the same socket and 4 threads doing A: sleeps in accept(fd1) B: sleeps in accept(fd2) C: close(fd1) D: (with all precautions re signals taken by the whole thing) dup2(fd3, fd2) you can end up with C coming first, kicking A and B (as operating on that socket) with A legitimately failing and B going into restart. And losing CPU to D, which does that dup2(), so when B regains CPU it's operating on the socket it never intended to. So this approach seems to be broken, no matter what... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 3:49 ` Al Viro @ 2015-10-21 14:38 ` Alan Burlison 2015-10-21 15:30 ` David Miller ` (3 more replies) 0 siblings, 4 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-21 14:38 UTC (permalink / raw) To: Al Viro, Eric Dumazet Cc: Stephen Hemminger, netdev, dholland-tech, Casper Dik On 21/10/2015 04:49, Al Viro wrote: Firstly, thank you for the comprehensive and considered reply. > Refcount is an implementation detail, of course. However, in any Unix I know > of, there are two separate notions - descriptor losing connection to opened > file (be it from close(), exit(), execve(), dup2(), etc.) and opened file > getting closed. Yep, it's an implementation detail inside the kernel - Solaris also has a refcount inside its vnodes. However that's really only dimly visible at the process level, where all you have is an integer file ID. > The latter cannot happen while there are descriptors connected to the > file in question, of course. However, that is not the only thing > that might prevent an opened file from getting closed - e.g. sending an > SCM_RIGHTS datagram with attached descriptor connected to the opened file > in question *at* *the* *moment* *of* *sendmsg(2)* will carry said opened > file until it is successfully received or discarded (in the former case > recepient will get a new descriptor refering to that opened file, of course). > Having the original descriptor closed right after sendmsg(2) does *not* > do anything to opened file. On any Unix that implements descriptor-passing. I believe async IO data is another way that a file can remain live after a close(), from the close() section of IEEE Std 1003.1: "An I/O operation that is not canceled completes as if the close() operation had not yet occurred" > There's going to be a notion of "last close"; that's what this refcount is > about and _that_ is more than implementation detail. Yes, POSIX distinguishes between "file descriptor" and "file description" (ugh!) and the close() page says: "When all file descriptors associated with an open file description have been closed, the open file description shall be freed." In the context of this discussion I believe it's the behaviour of the integer file descriptor that's the issue. Once it's had close() called on it then it's invalid, and any IO on it should fail, even if the underlying file description is still 'live'. > In other words, is that destruction of > * any descriptor refering to this socket [utterly insane for obvious > reasons] > * the last descriptor refering to this socket (modulo descriptor > passing, etc.) [a bitch to implement, unless we treat a syscall in progress > as keeping the opened file open], or > * _the_ descriptor used to issue accept(2) [a bitch to implement, > with a lot of fun races in an already race-prone area]? From reading the POSIX close() page I believe the second option is the correct one. > Additional question is whether it's > * just a magical behaviour of close(2) [ugly], or > * something that happens when descriptor gets dissociated from > opened file [obviously more consistent]? The second, I believe. > BTW, for real fun, consider this: > 7) > // fd is a socket > fd2 = dup(fd); > in thread A: accept(fd); > in thread B: accept(fd); > in thread C: accept(fd2); > in thread D: close(fd); > > Which threads (if any), should get hit where it hurts? A & B should return from the accept with an error. C should continue. Which is what happens on Solaris. > I have no idea what semantics does Solaris have in that area and how racy > their descriptor table handling is. And no, I'm not going to RTFS their > kernel, CDDL being what it is. I can answer that for you :-) I've looked through the appropriate bits of the Solaris kernel code and my colleague Casper has written an excellent summary of what happens, so with his permission I've just copied it verbatim below: ---------- Since at least Solaris 7 (1998), a thread which is sleeping on a file descriptor which is being closed by another thread, will be woken up. To this end each thread keeps a list of file descriptors in use by the current active system call. When a file descriptor is closed and this file descriptor is marked as being in use by other threads, the kernel will search all threads to see which have this file descriptor listed as in use. For each such thread, the kernel tells the thread that its active fds list is now stale and, if possible, makes the thread run. While this algorithm is pretty expensive, it is not often invoked. The thread running close() will NOT return until all other threads using that filedescriptor have released it. When run, the thread will return from its syscall and will in most cases return EBADF. A second thread trying to close this same file descriptor may return earlier with close() returning EBADF. ---------- -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 14:38 ` Alan Burlison @ 2015-10-21 15:30 ` David Miller 2015-10-21 16:04 ` Casper.Dik 2015-10-21 16:32 ` Fw: " Eric Dumazet ` (2 subsequent siblings) 3 siblings, 1 reply; 138+ messages in thread From: David Miller @ 2015-10-21 15:30 UTC (permalink / raw) To: Alan.Burlison Cc: viro, eric.dumazet, stephen, netdev, dholland-tech, casper.dik From: Alan Burlison <Alan.Burlison@oracle.com> Date: Wed, 21 Oct 2015 15:38:51 +0100 > While this algorithm is pretty expensive, it is not often invoked. I bet it can be easily intentionally invoked, by a malicious entity no less. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 15:30 ` David Miller @ 2015-10-21 16:04 ` Casper.Dik 2015-10-21 21:18 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Casper.Dik @ 2015-10-21 16:04 UTC (permalink / raw) To: David Miller Cc: Alan.Burlison, viro, eric.dumazet, stephen, netdev, dholland-tech From: David Miller <davem@davemloft.net> Date: Wed, 21 Oct 2015 08:30:08 -0700 (PDT) (17:30 CEST) >From: Alan Burlison <Alan.Burlison@oracle.com> >Date: Wed, 21 Oct 2015 15:38:51 +0100 > >> While this algorithm is pretty expensive, it is not often invoked. > >I bet it can be easily intentionally invoked, by a malicious entity no >less. It is only expensive within the process itself. Whether it is run inside the kernel isn't much different in the context of Solaris. If you have an attacker which can run any code, it doesn't really matter what that code is. It is not really, expensive (like grabbing expensive locks or for any length of time). It's basically O(n) depending on the numbers of threads in the process. If you have an application which can be triggered in doing that, it is still a bug in the application. Is such socket still listed with netstat on Linux? I believe it uses uses /proc and it will not be able to find that socket through the list of opened files. If we look at our typical problem we have a accept loop: for (;;) { newfd = accept(fd. ...); /* X */ /* stuff */ } While we have a second thread doing a "close(fd);" and possibly opening another file which just happens to return this particular fd. In Solaris the following one of the following things will happen, whatever the first thread is doing once close() is called: - accept() dies with EBADF (close() before or during the call to accept()) - accept() returns some other error (new fd you can't accept on) - accept() returns a new fd (if it was closed and reopened and a the new fd allows accept()) On Linux exactly the same thing happens *except* when we find ourselves in accept(), then we wait until a connection made or "shutdown()" is called. I don't think any of the outcomes in the first thread is acceptable; clearly no sufficient synchronization between the threads. At that point Linux cannot find out who owns the socket: # netstat -p -a | grep /tmp/unix unix 2 [ ACC ] STREAM LISTENING 14743 - /tmp/unix_sock In Solaris you'd get: netstat -u -f unix| grep unix_ stream-ord casper 5334 shutdown /tmp/unix_sock Simple synchronization is can be done. Casper ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 16:04 ` Casper.Dik @ 2015-10-21 21:18 ` Eric Dumazet 2015-10-21 21:28 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-21 21:18 UTC (permalink / raw) To: Casper.Dik Cc: David Miller, Alan.Burlison, viro, stephen, netdev, dholland-tech On Wed, 2015-10-21 at 18:04 +0200, Casper.Dik@oracle.com wrote: > It is only expensive within the process itself. thread synchro would require additional barriers to maintain this state, for a very unlikely case. The atomic_{inc|dec}() on file refcount are already a problem. We certainly do not use/take a lock to find a file pointer into file descriptor table. (This lock is only used at open() or close() time) ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 21:18 ` Eric Dumazet @ 2015-10-21 21:28 ` Al Viro 0 siblings, 0 replies; 138+ messages in thread From: Al Viro @ 2015-10-21 21:28 UTC (permalink / raw) To: Eric Dumazet Cc: Casper.Dik, David Miller, Alan.Burlison, stephen, netdev, dholland-tech On Wed, Oct 21, 2015 at 02:18:30PM -0700, Eric Dumazet wrote: > On Wed, 2015-10-21 at 18:04 +0200, Casper.Dik@oracle.com wrote: > > > It is only expensive within the process itself. > > thread synchro would require additional barriers to maintain this state, > for a very unlikely case. > > The atomic_{inc|dec}() on file refcount are already a problem. To be fair, this would avoid at least _that_ - file refcount updates would be less frequent. However, unlike file->f_count, *descriptor* refcounts would be much harder to keep in different cachelines. > We certainly do not use/take a lock to find a file pointer into file > descriptor table. > > (This lock is only used at open() or close() time) as well as dup2() and friends, of course. But yes, we are pretty careful about avoiding non-local stores on the normal "get file by descriptor" path - in case of non-shared descriptor table we don't do them at all, and in case of shared one we only modify refcount in struct file. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 14:38 ` Alan Burlison 2015-10-21 15:30 ` David Miller @ 2015-10-21 16:32 ` Eric Dumazet 2015-10-21 18:51 ` Al Viro 2015-10-23 18:30 ` Fw: " David Holland 3 siblings, 0 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-21 16:32 UTC (permalink / raw) To: Alan Burlison Cc: Al Viro, Stephen Hemminger, netdev, dholland-tech, Casper Dik On Wed, 2015-10-21 at 15:38 +0100, Alan Burlison wrote: > ---------- > Since at least Solaris 7 (1998), a thread which is sleeping > on a file descriptor which is being closed by another thread, > will be woken up. > > To this end each thread keeps a list of file descriptors > in use by the current active system call. Ouch. > > When a file descriptor is closed and this file descriptor > is marked as being in use by other threads, the kernel > will search all threads to see which have this file descriptor > listed as in use. For each such thread, the kernel tells > the thread that its active fds list is now stale and, if > possible, makes the thread run. > This is what I feared. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 14:38 ` Alan Burlison 2015-10-21 15:30 ` David Miller 2015-10-21 16:32 ` Fw: " Eric Dumazet @ 2015-10-21 18:51 ` Al Viro 2015-10-21 20:33 ` Casper.Dik ` (2 more replies) 2015-10-23 18:30 ` Fw: " David Holland 3 siblings, 3 replies; 138+ messages in thread From: Al Viro @ 2015-10-21 18:51 UTC (permalink / raw) To: Alan Burlison Cc: Eric Dumazet, Stephen Hemminger, netdev, dholland-tech, Casper Dik On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote: > >There's going to be a notion of "last close"; that's what this refcount is > >about and _that_ is more than implementation detail. > > Yes, POSIX distinguishes between "file descriptor" and "file > description" (ugh!) and the close() page says: Would've been better if they went for something like "IO channel" for the latter ;-/ > "When all file descriptors associated with an open file description > have been closed, the open file description shall be freed." BTW, is SCM_RIGHTS outside of scope? They do mention it in one place only: | Ancillary data is also possible at the socket level. The <sys/socket.h> | header shall define the following symbolic constant for use as the cmsg_type | value when cmsg_level is SOL_SOCKET: | | SCM_RIGHTS | Indicates that the data array contains the access rights to be sent or | received. with no further details whatsoever. It's been there since at least 4.3-Reno; does anybody still use the older variant (->msg_accrights, that is)? IIRC, there was some crap circa 2.6 when Solaris used to do ->msg_accrights for descriptor-passing, but more or less current versions appear to support SCM_RIGHTS... In any case, descriptor-passing had been there in some form since at least '83 (the old variant is already present in 4.2) and considering it out-of-scope for POSIX is bloody ridiculous, IMO. Unless they want to consider in-flight descriptor-passing datagrams as collections of file descriptors, the quoted sentence is seriously misleading. And then there's mmap(), which they do kinda-sorta mention... > >In other words, is that destruction of > > * any descriptor refering to this socket [utterly insane for obvious > >reasons] > > * the last descriptor refering to this socket (modulo descriptor > >passing, etc.) [a bitch to implement, unless we treat a syscall in progress > >as keeping the opened file open], or > > * _the_ descriptor used to issue accept(2) [a bitch to implement, > >with a lot of fun races in an already race-prone area]? > > From reading the POSIX close() page I believe the second option is > the correct one. Er... So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that behaviour, in your opinion? Because fd is sure as hell not the last descriptor refering to that socket - fd2 remains alive and well. Behaviour you describe below matches the _third_ option. > >BTW, for real fun, consider this: > >7) > >// fd is a socket > >fd2 = dup(fd); > >in thread A: accept(fd); > >in thread B: accept(fd); > >in thread C: accept(fd2); > >in thread D: close(fd); > > > >Which threads (if any), should get hit where it hurts? > > A & B should return from the accept with an error. C should > continue. Which is what happens on Solaris. > To this end each thread keeps a list of file descriptors > in use by the current active system call. Yecchhhh... How much cross-CPU traffic does that cause on multithread processes? Not on close(2), on maintaining the descriptor use counts through the normal syscalls. > When a file descriptor is closed and this file descriptor > is marked as being in use by other threads, the kernel > will search all threads to see which have this file descriptor > listed as in use. For each such thread, the kernel tells > the thread that its active fds list is now stale and, if > possible, makes the thread run. > > While this algorithm is pretty expensive, it is not often invoked. Sure, but the upkeep of data structures it would need is there whether you actually end up triggering it or not. Both in memory footprint and in cacheline pingpong... Besides, the list of threads using given descriptor table also needs to be maintained, unless you scan *all* threads in the system (which would be quite a fun wrt latency and affect a lot more than just the process doing something dumb and rare). BTW, speaking of fun races: AFAICS, NetBSD dup2() isn't atomic. It calls fd_close() outside of ->fd_lock (has to, since fd_close() is grabbing that itself), so another thread doing e.g. fcntl(newfd, F_GETFD) in the middle of dup2(oldfd, newfd) might end up with EBADF, even though both before and after dup2() newfd had been open. What's worse, thread A: fd1 = open("foo", ...); fd2 = open("bar", ...); ... dup2(fd1, fd2); thread B: fd = open("baz", ...); might, AFAICS, end up with fd == fd2 and refering to foo instead of baz. All it takes is the last open() managing to grab ->fd_lock just as fd_close() from dup2() has dropped it. Which is an unexpected behaviour, to put it mildly, no matter how much standard lawyering one applies... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 18:51 ` Al Viro @ 2015-10-21 20:33 ` Casper.Dik 2015-10-22 4:21 ` Al Viro 2015-11-02 10:03 ` David Laight 2015-10-21 22:28 ` Alan Burlison 2015-10-22 1:29 ` David Miller 2 siblings, 2 replies; 138+ messages in thread From: Casper.Dik @ 2015-10-21 20:33 UTC (permalink / raw) To: Al Viro Cc: Alan Burlison, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech >On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote: > >> >There's going to be a notion of "last close"; that's what this refcount is >> >about and _that_ is more than implementation detail. >> >> Yes, POSIX distinguishes between "file descriptor" and "file >> description" (ugh!) and the close() page says: > >Would've been better if they went for something like "IO channel" for >the latter ;-/ Or at least some other word. A file descriptor is just an index to a list of file pointers (and wasn't named so?) >> "When all file descriptors associated with an open file description >> have been closed, the open file description shall be freed." > >BTW, is SCM_RIGHTS outside of scope? They do mention it in one place >only: >| Ancillary data is also possible at the socket level. The <sys/socket.h> >| header shall define the following symbolic constant for use as the cmsg_type >| value when cmsg_level is SOL_SOCKET: >| >| SCM_RIGHTS >| Indicates that the data array contains the access rights to be sent or >| received. > >with no further details whatsoever. It's been there since at least 4.3-Reno; >does anybody still use the older variant (->msg_accrights, that is)? IIRC, >there was some crap circa 2.6 when Solaris used to do ->msg_accrights for >descriptor-passing, but more or less current versions appear to support >SCM_RIGHTS... In any case, descriptor-passing had been there in some form >since at least '83 (the old variant is already present in 4.2) and considering >it out-of-scope for POSIX is bloody ridiculous, IMO. SCM_RIGHTS was introduced as part of the POSIX standardization of BSD sockets. Looks like they became part of Solaris 2.6, but the default was non-standard sockets so you may easily find msg->accrights but not SCM_RIGHTS. msg_accrights is what was introduced in BSD in likely the first implementation of socket-based file descriptor passing. SysV has its own file descriptor passing on file descriptors passing. But that interface is too much ad-hoc, so SCM_RIGHTS is a standardizations allowing multiple types of messages to be send around >Unless they want to consider in-flight descriptor-passing datagrams as >collections of file descriptors, the quoted sentence is seriously misleading. >And then there's mmap(), which they do kinda-sorta mention... Well, a file descriptor really only exists in the context of a process; in-flight it is no longer a file descriptor as there process context with a file descriptor table; so pointers to file descriptions are passed around. >> >In other words, is that destruction of >> > * any descriptor refering to this socket [utterly insane for obvious >> >reasons] >> > * the last descriptor refering to this socket (modulo descriptor >> >passing, etc.) [a bitch to implement, unless we treat a syscall in progress >> >as keeping the opened file open], or >> > * _the_ descriptor used to issue accept(2) [a bitch to implement, >> >with a lot of fun races in an already race-prone area]? >> >> From reading the POSIX close() page I believe the second option is >> the correct one. > >Er... So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that >behaviour, in your opinion? Because fd is sure as hell not the last >descriptor refering to that socket - fd2 remains alive and well. > >Behaviour you describe below matches the _third_ option. >> >BTW, for real fun, consider this: >> >7) >> >// fd is a socket >> >fd2 = dup(fd); >> >in thread A: accept(fd); >> >in thread B: accept(fd); >> >in thread C: accept(fd2); >> >in thread D: close(fd); >> > >> >Which threads (if any), should get hit where it hurts? >> >> A & B should return from the accept with an error. C should >> continue. Which is what happens on Solaris. > >> To this end each thread keeps a list of file descriptors >> in use by the current active system call. > >Yecchhhh... How much cross-CPU traffic does that cause on >multithread processes? Not on close(2), on maintaining the >descriptor use counts through the normal syscalls. In the Solaris implementation is pretty much what we do; but there is no much cross-CPU traffic. Of course, you will need to keep locks in the file descriptor table if only to find the actual file pointer. The work is done only in the case of a badly written application where close is required to hunt down all threads currently using the specific file descriptor. >> When a file descriptor is closed and this file descriptor >> is marked as being in use by other threads, the kernel >> will search all threads to see which have this file descriptor >> listed as in use. For each such thread, the kernel tells >> the thread that its active fds list is now stale and, if >> possible, makes the thread run. >> >> While this algorithm is pretty expensive, it is not often invoked. > >Sure, but the upkeep of data structures it would need is there >whether you actually end up triggering it or not. Both in >memory footprint and in cacheline pingpong... Most of the work on using a file descriptor is local to the thread. >Besides, the list of threads using given descriptor table also needs >to be maintained, unless you scan *all* threads in the system (which >would be quite a fun wrt latency and affect a lot more than just the >process doing something dumb and rare). In Solaris all threads using the same file descriptor table are all the threads in the same process. This is typical for Unix but it is not what we have in Linux, or so I'm told. So we already have that particular list of threads. >BTW, speaking of fun races: AFAICS, NetBSD dup2() isn't atomic. It >calls fd_close() outside of ->fd_lock (has to, since fd_close() is >grabbing that itself), so another thread doing e.g. fcntl(newfd, F_GETFD) >in the middle of dup2(oldfd, newfd) might end up with EBADF, even though >both before and after dup2() newfd had been open. What's worse, >thread A: > fd1 = open("foo", ...); > fd2 = open("bar", ...); > ... > dup2(fd1, fd2); >thread B: > fd = open("baz", ...); >might, AFAICS, end up with fd == fd2 and refering to foo instead of baz. >All it takes is the last open() managing to grab ->fd_lock just as fd_close() >from dup2() has dropped it. Which is an unexpected behaviour, to put it >mildly, no matter how much standard lawyering one applies... It could happen even when the implementation does not have any races; but I think you're saying that you know that the open call in thread B comes after the open to fd2, then fd != fd2. This looks indeed like a bug. Casper ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 20:33 ` Casper.Dik @ 2015-10-22 4:21 ` Al Viro 2015-10-22 10:55 ` Alan Burlison 2015-11-02 10:03 ` David Laight 1 sibling, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-22 4:21 UTC (permalink / raw) To: Casper.Dik Cc: Alan Burlison, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech On Wed, Oct 21, 2015 at 10:33:04PM +0200, Casper.Dik@oracle.com wrote: > > >On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote: > > > >> >There's going to be a notion of "last close"; that's what this refcount is > >> >about and _that_ is more than implementation detail. > >> > >> Yes, POSIX distinguishes between "file descriptor" and "file > >> description" (ugh!) and the close() page says: > > > >Would've been better if they went for something like "IO channel" for > >the latter ;-/ > > Or at least some other word. A file descriptor is just an index to > a list of file pointers (and wasn't named so?) *nod* There's no less than 3 distinct notions associated with the word "file" - "file as collection of bytes on filesystem", "opened file as IO channel" and "file descriptor", all related ;-/ "File description" vs. "file descriptor" is atrociously bad terminology. > >Unless they want to consider in-flight descriptor-passing datagrams as > >collections of file descriptors, the quoted sentence is seriously misleading. > >And then there's mmap(), which they do kinda-sorta mention... > > Well, a file descriptor really only exists in the context of a process; > in-flight it is no longer a file descriptor as there process context with > a file descriptor table; so pointers to file descriptions are passed > around. Yes. Note, BTW, that descriptor contains a bit more than a pointer - there are properties associated with it (close-on-exec and is-it-already-claimed), which makes abusing it for describing SCM_RIGHTS payloads even more of a stretch. IOW, description of semantics for close() and friends needs fixing - it simply does not match the situation on anything that would be anywhere near POSIX compliance in other areas. > >Sure, but the upkeep of data structures it would need is there > >whether you actually end up triggering it or not. Both in > >memory footprint and in cacheline pingpong... > > Most of the work on using a file descriptor is local to the thread. Using - sure, but what of cacheline dirtied every time you resolve a descriptor to file reference? How much does it cover and to what degree is that local to thread? When it's a refcount inside struct file - no big deal, we'll be reading the same cacheline anyway and unless several threads are pounding on the same file with syscalls at the same time, that won't be a big deal. But when refcounts are associated with descriptors... In case of Linux we have two bitmaps and an array of pointers associated with descriptor table. They grow on demand (in parallel) * reserving a descriptor is done under ->file_lock (dropped/regained around memory allocation if we end up expanding the sucker, actual reassignment of pointers to array/bitmaps is under that spinlock) * installing a pointer is lockless (we wait for ongoing resize to settle, RCU takes care of the rest) * grabbing a file by index is lockless as well * removing a pointer is under ->file_lock, so's replacing it by dup2(). Grabbing a file by descriptor follows pointer from task_struct to descriptor table, from descriptor table to element of array of pointers (embedded when we have few descriptors, but becomes separately allocated when more is needed), and from array element to struct file. In struct file we fetch ->f_mode and (if descriptor table is shared) atomically increment ->f_count. For comparison, NetBSD has an extra level of indirection (with similar tricks for embedding them while there are few descriptors), with a lot fatter structure around the pointer to file - they keep close-on-exec and in-use in there, along with refcount and their equivalent of waitqueue. These structures, once they grow past the embedded set, are allocated one-by-one, so copying the table on fork() costs a _lot_ more. Rather than an array of pointers to files they have an array of pointers to those guys. Reserving a descriptor triggers allocation of new struct fdfile and installing a pointer to it into the array. Allows for slightly simpler installing of pointer to file afterwards - unlike us, they don't have to be careful about array resize happening in parallel. Grabbing a file by index is lockless, so's installing a pointer to file. Reserving a descriptor is under ->fd_lock (mutex rather than a spinlock). Removing a pointer is under ->fd_lock, so's replacing it by dup2(), but dup2() has an unpleasant race (see upthread). They do the same amount of pointer-chasing on lookup proper, but only because they do not look into struct file itself there. Which happens immediately afterwards, since callers *will* look into what they've got. I didn't look into the details of barrier use in there, but it looks generally sane. Cacheline pingpong is probably not a big deal there, but only because these structures are fat and scattered. Another fun issue is that they have in-use bits buried deep, which means that they need to mirror them in a separate bitmap - would cost too much otherwise. They actually went for two-level bitmap - the first one with a bit per 32 descriptors, another - with bit per descriptor. Might or might not be worth nicking (and 1:32 ratio needs experimenting)... The worst issues seem to be memory footprint and cost on fork(). Extra level of indirection is unpleasant, but not terribly so. Use of mutex instead of spinlock is a secondary issue - it should be reasonably easy to deal with. And there are outright races, both in dup2() and in restart-related logics... Having a mirror of in-use bits is another potential source of races - hadn't looked into that deeply enough. That'll probably come into play on any attempts to reduce fork() costs... > >Besides, the list of threads using given descriptor table also needs > >to be maintained, unless you scan *all* threads in the system (which > >would be quite a fun wrt latency and affect a lot more than just the > >process doing something dumb and rare). > > In Solaris all threads using the same file descriptor table are all the > threads in the same process. This is typical for Unix but it is not what > we have in Linux, or so I'm told. So we already have that particular > list of threads. Linux (as well as *BSD and Plan 9) has descriptor table as first-class sharable resource - any thread you spawn might share or copy it, and you can unshare yours at any time. Maintaining the list of threads using the damn thing wouldn't be too much PITA, anyway - that's a secondary issue. > >BTW, speaking of fun races: AFAICS, NetBSD dup2() isn't atomic. It > >calls fd_close() outside of ->fd_lock (has to, since fd_close() is > >grabbing that itself), so another thread doing e.g. fcntl(newfd, F_GETFD) > >in the middle of dup2(oldfd, newfd) might end up with EBADF, even though > >both before and after dup2() newfd had been open. What's worse, > >thread A: > > fd1 = open("foo", ...); > > fd2 = open("bar", ...); > > ... > > dup2(fd1, fd2); > >thread B: > > fd = open("baz", ...); > >might, AFAICS, end up with fd == fd2 and refering to foo instead of baz. > >All it takes is the last open() managing to grab ->fd_lock just as fd_close() > >from dup2() has dropped it. Which is an unexpected behaviour, to put it > >mildly, no matter how much standard lawyering one applies... > > It could happen even when the implementation does not have any races; but > I think you're saying that you know that the open call in thread B > comes after the open to fd2, then fd != fd2. No, what I'm saying is that dup2() of one descriptor thread A has opened to another such descriptor, with no other threads ever doing anything to either should *not* affect anything done by other threads. If open() in thread B comes before fd2 = open("bar", ...), it still should (and will) get a different descriptor. The point is, dup2() over _unused_ descriptor is inherently racy, but dup2() over a descriptor we'd opened and kept open should be safe. As it is, their implementation boils down to "userland must do process-wide exclusion between *any* dup2() and all syscalls that might create a new descriptor - open()/pipe()/socket()/accept()/recvmsg()/dup()/etc". At the very least, it's a big QoI problem, especially since such userland exclusion would have to be taken around the operations that can block for a long time. Sure, POSIX wording regarding dup2 is so weak that this behaviour can be argued to be compliant, but... replacement of the opened file associated with newfd really ought to be atomic to be of any use for multithreaded processes. IOW, if newfd is an opened descriptor prior to dup2() and no other thread attempts to close it by any means, there should be no window during which it would appear to be not opened. Linux and FreeBSD satisfy that; OpenBSD seems to do the same, from the quick look. NetBSD doesn't, no idea about Solaris. FWIW, older NetBSD implementation (prior to "File descriptor changes, discussed on tech-kern:" back in 2008) used to behave like OpenBSD one; it had fixed a lot of crap, so it's entirely possible that OpenBSD simply has kept the old implementation, with tons of other races in that area, but this dup2() race got introduced in that rewrite. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 4:21 ` Al Viro @ 2015-10-22 10:55 ` Alan Burlison 2015-10-22 18:16 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-22 10:55 UTC (permalink / raw) To: Al Viro, Casper.Dik Cc: Eric Dumazet, Stephen Hemminger, netdev, dholland-tech On 22/10/2015 05:21, Al Viro wrote: >> Most of the work on using a file descriptor is local to the thread. > > Using - sure, but what of cacheline dirtied every time you resolve a > descriptor to file reference? Don't you have to do that anyway, to do anything useful with the file? > How much does it cover and to what > degree is that local to thread? When it's a refcount inside struct file - > no big deal, we'll be reading the same cacheline anyway and unless several > threads are pounding on the same file with syscalls at the same time, > that won't be a big deal. But when refcounts are associated with > descriptors... There is a refcount in the struct held in the per-process list of open files and the 'slow path' processing is only taken if there's more than one LWP in the process that's accessing the file. > In case of Linux we have two bitmaps and an array of pointers associated > with descriptor table. They grow on demand (in parallel) > * reserving a descriptor is done under ->file_lock (dropped/regained > around memory allocation if we end up expanding the sucker, actual reassignment > of pointers to array/bitmaps is under that spinlock) > * installing a pointer is lockless (we wait for ongoing resize to > settle, RCU takes care of the rest) > * grabbing a file by index is lockless as well > * removing a pointer is under ->file_lock, so's replacing it by dup2(). Is that table per-process or global? > The point is, dup2() over _unused_ descriptor is inherently racy, but dup2() > over a descriptor we'd opened and kept open should be safe. As it is, > their implementation boils down to "userland must do process-wide exclusion > between *any* dup2() and all syscalls that might create a new descriptor - > open()/pipe()/socket()/accept()/recvmsg()/dup()/etc". At the very least, > it's a big QoI problem, especially since such userland exclusion would have > to be taken around the operations that can block for a long time. Sure, > POSIX wording regarding dup2 is so weak that this behaviour can be argued > to be compliant, but... replacement of the opened file associated with > newfd really ought to be atomic to be of any use for multithreaded processes. There's existing language in the Issue 7 dup2() description that says dup2() has to be atomic: "the dup2( ) function provides unique services, as no other interface is able to atomically replace an existing file descriptor." And there is some new language in Issue 7 Technical Corrigenda 2 that reinforces that, when it's talking about reassignment of stdin/stdout/stderr: "Furthermore, a close() followed by a reopen operation (e.g. open(), dup() etc) is not atomic; dup2() should be used to change standard file descriptors." I don't think that it's possible to claim that a non-atomic dup2() is POSIX-compliant. > IOW, if newfd is an opened descriptor prior to dup2() and no other thread > attempts to close it by any means, there should be no window during which > it would appear to be not opened. Linux and FreeBSD satisfy that; OpenBSD > seems to do the same, from the quick look. NetBSD doesn't, no idea about > Solaris. FWIW, older NetBSD implementation (prior to "File descriptor changes, > discussed on tech-kern:" back in 2008) used to behave like OpenBSD one; it > had fixed a lot of crap, so it's entirely possible that OpenBSD simply has > kept the old implementation, with tons of other races in that area, but this > dup2() race got introduced in that rewrite. Related to dup2(), there's some rather surprising behaviour on Linux. Here's the scenario: ---------- ThreadA opens, listens and accepts on socket fd1, waiting for incoming connections. ThreadB waits for a while, then opens normal file fd2 for read/write. ThreadB uses dup2 to make fd1 a clone of fd2. ThreadB closes fd2. ThreadA remains sat in accept on fd1 which is now a plain file, not a socket. ThreadB writes to fd1, the result of which appears in the file, so fd1 is indeed operating as a plain file. ThreadB exits. ThreadA is still sat in accept on fd1. A connection is made to fd1 by another process. The accept call succeeds and returns the incoming connection. fd1 is still operating as a socket, even though it's now actually a plain file. ---------- I assume this is another consequence of the fact that threads waiting in accept don't get a notification if the fd they are using is closed, either directly by a call to close or by a syscall such as dup2. Not waking up other threads on a fd when it is closed seems like it's undesirable behaviour. I can see the reasoning behind allowing shutdown to be used to do such a wakeup even if that's not POSIX-compliant - it may make it slightly easier for applications avoid fd recycling races. However the current situation is that shutdown is the *only* way to perform such a wakeup - simply closing the fd has no effect on any other threads. That seems incorrect. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 10:55 ` Alan Burlison @ 2015-10-22 18:16 ` Al Viro 2015-10-22 20:15 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-22 18:16 UTC (permalink / raw) To: Alan Burlison Cc: Casper.Dik, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech On Thu, Oct 22, 2015 at 11:55:42AM +0100, Alan Burlison wrote: > On 22/10/2015 05:21, Al Viro wrote: > > >>Most of the work on using a file descriptor is local to the thread. > > > >Using - sure, but what of cacheline dirtied every time you resolve a > >descriptor to file reference? > > Don't you have to do that anyway, to do anything useful with the file? Dirtying the cacheline that contains struct file itself is different, but that's not per-descriptor. > >In case of Linux we have two bitmaps and an array of pointers associated > >with descriptor table. They grow on demand (in parallel) > > * reserving a descriptor is done under ->file_lock (dropped/regained > >around memory allocation if we end up expanding the sucker, actual reassignment > >of pointers to array/bitmaps is under that spinlock) > > * installing a pointer is lockless (we wait for ongoing resize to > >settle, RCU takes care of the rest) > > * grabbing a file by index is lockless as well > > * removing a pointer is under ->file_lock, so's replacing it by dup2(). > > Is that table per-process or global? Usually it's per-process, but any thread could ask for a private instance to work with (and then spawn more threads sharing that instance - or getting independent copies). It's common for Plan 9-inspired models - basically, you treat every thread as a machine that consists of * memory * file descriptor table * namespace * signal handlers ... * CPU (i.e. actual thread of execution). The last part can't be shared; anything else can. fork(2) variant used to start new threads (clone(2) in case of Linux, rfork(2) in Plan 9 and *BSD) is told which components should be copies of parent's ones and which should be shared with the parent. fork(2) is simply "copy everything except for the namespace". It's fairly common to have "share everything", but intermediate variants are also possible. There are constraints (e.g. you can't share signal handlers without sharing the memory space), but descriptor table can be shared independently from memory space just fine. There's also a way to say "unshare this, this and that components" - mapped to unshare(2) in Linux and to rfork(2) in Plan 9. Best way to think of that is to consider descriptor table as a first-class object a thread can be connected to. Usually you have one for each process, with all threads belonging to that process connected to the same thing, but that's just the most common use. > I don't think that it's possible to claim that a non-atomic dup2() > is POSIX-compliant. Except that it's in non-normative part of dup2(2), AFAICS. I certainly agree that it would be a standard lawyering beyond reason, but "not possible to claim" is too optimistic. Maybe I'm just more cynical... > ThreadA remains sat in accept on fd1 which is now a plain file, not > a socket. No. accept() is not an operation on file descriptors; it's an operation on file descriptions (pardon for use of that terminology). They are specified by passing descriptors, but there's a hell of a difference between e.g. dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek() (operations on descriptions). Lookups are done once per syscall; the only exception is F_SETFL{,W}, where we recheck that descriptor is refering to the same thing before granting the lock. Again, POSIX is still underspecifying the semantics of shared descriptor tables; back when the bulk of it had been written there had been no way to have a descriptor -> description mapping changed under a syscall by action of another thread. Hell, they still hadn't picked on some things that happened in early 80s, let alone early-to-mid 90s... Linux and Solaris happen to cover these gaps differently; FreeBSD and OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 18:16 ` Al Viro @ 2015-10-22 20:15 ` Alan Burlison 0 siblings, 0 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-22 20:15 UTC (permalink / raw) To: Al Viro Cc: Casper.Dik, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech On 22/10/15 19:16, Al Viro wrote: >> Don't you have to do that anyway, to do anything useful with the file? > > Dirtying the cacheline that contains struct file itself is different, but > that's not per-descriptor. Yes, true enough. > Usually it's per-process, but any thread could ask for a private instance > to work with (and then spawn more threads sharing that instance - or getting > independent copies). [snip] Thanks again for the info, interesting. Solaris also does things along the same lines. In fact we recently extended posix_spawn so it could be used by the JVM to create subprocesses without wholescale copying, and Casper has done some optimisation work on posix_spawn as well. >> I don't think that it's possible to claim that a non-atomic dup2() >> is POSIX-compliant. > > Except that it's in non-normative part of dup2(2), AFAICS. I certainly > agree that it would be a standard lawyering beyond reason, but "not > possible to claim" is too optimistic. Maybe I'm just more cynical... Possibly so, and possibly justifiably so as well ;-) >> ThreadA remains sat in accept on fd1 which is now a plain file, not >> a socket. > > No. accept() is not an operation on file descriptors; it's an operation on > file descriptions (pardon for use of that terminology). They are specified > by passing descriptors, but there's a hell of a difference between e.g. > dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek() > (operations on descriptions). > > Lookups are done once per syscall; the only exception is F_SETFL{,W}, where > we recheck that descriptor is refering to the same thing before granting > the lock. Yes, but if you believe that dup2() requires an implicit close() within it and that dup2() has to be atomic then expecting that other threads waiting on the same fd in accept() will get a notification seems reasonable enough. > Again, POSIX is still underspecifying the semantics of shared descriptor > tables; back when the bulk of it had been written there had been no way > to have a descriptor -> description mapping changed under a syscall by > action of another thread. Hell, they still hadn't picked on some things > that happened in early 80s, let alone early-to-mid 90s... That's indisputably true - much of the POSIX behaviour predates threads and it shows, quite badly, in some places. > Linux and Solaris happen to cover these gaps differently; FreeBSD and > OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one. Yes, true enough. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 20:33 ` Casper.Dik 2015-10-22 4:21 ` Al Viro @ 2015-11-02 10:03 ` David Laight 2015-11-02 10:29 ` Al Viro 1 sibling, 1 reply; 138+ messages in thread From: David Laight @ 2015-11-02 10:03 UTC (permalink / raw) To: 'Casper.Dik@oracle.com', Al Viro Cc: Alan Burlison, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech From: Casper.Dik@oracle.com > Sent: 21 October 2015 21:33 .. > >Er... So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that > >behaviour, in your opinion? Because fd is sure as hell not the last > >descriptor refering to that socket - fd2 remains alive and well. > > > >Behaviour you describe below matches the _third_ option. > > >> >BTW, for real fun, consider this: > >> >7) > >> >// fd is a socket > >> >fd2 = dup(fd); > >> >in thread A: accept(fd); > >> >in thread B: accept(fd); > >> >in thread C: accept(fd2); > >> >in thread D: close(fd); If D is going to do this, it ought to be using dup2() to clone a copy of (say) /dev/null onto 'fd'. > >> >Which threads (if any), should get hit where it hurts? > >> > >> A & B should return from the accept with an error. C should > >> continue. Which is what happens on Solaris. That seems to completely break the normal *nix file scheme... > >> To this end each thread keeps a list of file descriptors > >> in use by the current active system call. > > Remember, Solaris (and SYSV) has extra levels of multiplexing between userspace and char special drivers (and probably sockets) than Linux does. As well as having multiple fd referencing the same struct FILE, multiple FILE can point to the same inode. If you have two different /dev entries for the same major/minor you also end up with separate inodes - all finally referencing the same driver data (indexed only by minor number). (You actually get two inodes in the chain, one for the disk filesystem and one char-special. The ref counts on both can be greater than 1.) David ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-11-02 10:03 ` David Laight @ 2015-11-02 10:29 ` Al Viro 0 siblings, 0 replies; 138+ messages in thread From: Al Viro @ 2015-11-02 10:29 UTC (permalink / raw) To: David Laight Cc: 'Casper.Dik@oracle.com', Alan Burlison, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech On Mon, Nov 02, 2015 at 10:03:58AM +0000, David Laight wrote: > Remember, Solaris (and SYSV) has extra levels of multiplexing between > userspace and char special drivers (and probably sockets) than Linux does. > As well as having multiple fd referencing the same struct FILE, multiple > FILE can point to the same inode. As they ever could on any Unix. Every open(2) results in a new struct file (BTW, I've never seen that capitalization for kernel structure - not in v7, not in *BSD, etc.; FILE is a userland typedef and I would be rather surprised if any kernel, Solaris included, would've renamed 'struct file' to 'struct FILE'). > If you have two different /dev entries for the same major/minor you > also end up with separate inodes - all finally referencing the same > driver data (indexed only by minor number). Again, the same goes for all Unices, both Linux and Solaris included. And what the devil does that have to do with sockets, anyway? Or with the problem in question, while we are at it - they have different descriptors pointing to the same struct file behave differently; anything sensitive to file type would be past that point. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 18:51 ` Al Viro 2015-10-21 20:33 ` Casper.Dik @ 2015-10-21 22:28 ` Alan Burlison 2015-10-22 1:29 ` David Miller 2 siblings, 0 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-21 22:28 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, Stephen Hemminger, netdev, dholland-tech, Casper Dik On 21/10/2015 19:51, Al Viro wrote: > BTW, is SCM_RIGHTS outside of scope? They do mention it in one place > only: > | Ancillary data is also possible at the socket level. The <sys/socket.h> > | header shall define the following symbolic constant for use as the cmsg_type > | value when cmsg_level is SOL_SOCKET: > | > | SCM_RIGHTS > | Indicates that the data array contains the access rights to be sent or > | received. That's still exactly the same in Issue 7, Technical Corrigenda 2 which is in final review at the moment. >>> In other words, is that destruction of >>> * any descriptor refering to this socket [utterly insane for obvious >>> reasons] >>> * the last descriptor refering to this socket (modulo descriptor >>> passing, etc.) [a bitch to implement, unless we treat a syscall in progress >>> as keeping the opened file open], or >>> * _the_ descriptor used to issue accept(2) [a bitch to implement, >>> with a lot of fun races in an already race-prone area]? >> >> From reading the POSIX close() page I believe the second option is >> the correct one. > > Er... So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that > behaviour, in your opinion? Because fd is sure as hell not the last > descriptor refering to that socket - fd2 remains alive and well. > > Behaviour you describe below matches the _third_ option. Ah, I wasn't 100% sure what the intended difference between #2 and #3 was TBH, it does sound like I meant #3, yes :-) -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 18:51 ` Al Viro 2015-10-21 20:33 ` Casper.Dik 2015-10-21 22:28 ` Alan Burlison @ 2015-10-22 1:29 ` David Miller 2015-10-22 4:17 ` Alan Burlison 2 siblings, 1 reply; 138+ messages in thread From: David Miller @ 2015-10-22 1:29 UTC (permalink / raw) To: viro Cc: Alan.Burlison, eric.dumazet, stephen, netdev, dholland-tech, casper.dik From: Al Viro <viro@ZenIV.linux.org.uk> Date: Wed, 21 Oct 2015 19:51:04 +0100 > Sure, but the upkeep of data structures it would need is there > whether you actually end up triggering it or not. Both in > memory footprint and in cacheline pingpong... +1 ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 1:29 ` David Miller @ 2015-10-22 4:17 ` Alan Burlison 2015-10-22 4:44 ` Al Viro 2015-10-22 6:15 ` Casper.Dik 0 siblings, 2 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-22 4:17 UTC (permalink / raw) To: David Miller, viro Cc: eric.dumazet, stephen, netdev, dholland-tech, casper.dik On 22/10/2015 02:29, David Miller wrote: > From: Al Viro <viro@ZenIV.linux.org.uk> > Date: Wed, 21 Oct 2015 19:51:04 +0100 > >> Sure, but the upkeep of data structures it would need is there >> whether you actually end up triggering it or not. Both in >> memory footprint and in cacheline pingpong... > > +1 It's been said that the current mechanisms in Linux & some BSD variants can be subject to races, and the behaviour exhibited doesn't conform to POSIX, for example requiring the use of shutdown() on unconnected sockets because close() doesn't kick off other threads accept()ing on the same fd. I'd be interested to hear if there's a better and more performant way of handling the situation that doesn't involve doing the sort of bookkeeping Casper described,. To quote one of my colleague's favourite sayings: Performance is a goal, correctness is a constraint. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 4:17 ` Alan Burlison @ 2015-10-22 4:44 ` Al Viro 2015-10-22 6:03 ` Al Viro ` (2 more replies) 2015-10-22 6:15 ` Casper.Dik 1 sibling, 3 replies; 138+ messages in thread From: Al Viro @ 2015-10-22 4:44 UTC (permalink / raw) To: Alan Burlison Cc: David Miller, eric.dumazet, stephen, netdev, dholland-tech, casper.dik On Thu, Oct 22, 2015 at 05:17:50AM +0100, Alan Burlison wrote: > It's been said that the current mechanisms in Linux & some BSD > variants can be subject to races You do realize that it goes for the entire area? And the races found in this thread are in the BSD variant that tries to do something similar to what you guys say Solaris is doing, so I'm not sure which way does that argument go. A high-level description with the same level of details will be race-free in _all_ of them. The devil is in the details, of course, and historically they had been very easy to get wrong. And extra complexity in that area doesn't make things better. >, and the behaviour exhibited > doesn't conform to POSIX, for example requiring the use of > shutdown() on unconnected sockets because close() doesn't kick off > other threads accept()ing on the same fd. Umm... The old kernels (and even more - old userland) are not going to disappear, so we are stuck with such uses of shutdown(2) anyway, POSIX or no POSIX. > I'd be interested to hear > if there's a better and more performant way of handling the > situation that doesn't involve doing the sort of bookkeeping Casper > described,. So would a lot of other people. > To quote one of my colleague's favourite sayings: Performance is a > goal, correctness is a constraint. Except that in this case "correctness" is the matter of rather obscure and ill-documented areas in POSIX. Don't get me wrong - this semantics isn't inherently bad, but it's nowhere near being an absolute requirement. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 4:44 ` Al Viro @ 2015-10-22 6:03 ` Al Viro 2015-10-22 6:34 ` Casper.Dik 2015-10-22 6:51 ` Casper.Dik 2015-10-22 11:15 ` Alan Burlison 2 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-22 6:03 UTC (permalink / raw) To: Alan Burlison Cc: David Miller, eric.dumazet, stephen, netdev, dholland-tech, casper.dik On Thu, Oct 22, 2015 at 05:44:58AM +0100, Al Viro wrote: > Except that in this case "correctness" is the matter of rather obscure and > ill-documented areas in POSIX. Don't get me wrong - this semantics isn't > inherently bad, but it's nowhere near being an absolute requirement. PS: in principle, a fairly ugly trick might suffice for accept(2), but I'm less than happy with going there. Namely, we could * have ->accept() get descriptor number * have ->flush() get descriptor number in addition to current->files and have it DTRT for sockets in the middle of accept(2). However, in addition to being ugly as hell, it has the problem with the points where we call ->flush(), specifically do_dup2() and __close_fd(). It's done *after* the replacement/removal from descriptor table, so another socket might have already gotten the same descriptor and we'd get spurious termination of accept(2). And I'm really curious about the things Solaris would do with dup2() there. Does it take into account the possibility of new accept() coming just as dup2() is trying to terminate the ongoing ones? Is there a window when descriptor-to-file lookups would fail? Looks like a race/deadlock country... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 6:03 ` Al Viro @ 2015-10-22 6:34 ` Casper.Dik 2015-10-22 17:21 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: Casper.Dik @ 2015-10-22 6:34 UTC (permalink / raw) To: Al Viro Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev, dholland-tech >And I'm really curious about the things Solaris would do with dup2() there. >Does it take into account the possibility of new accept() coming just as >dup2() is trying to terminate the ongoing ones? Is there a window when >descriptor-to-file lookups would fail? Looks like a race/deadlock country... Solaris does not "terminate" threads, instead it tells them that the file descriptor information used is stale and wkae's up the thread. The accept call gets woken up and it checks for incoming connections; it will then either find a new connection and returns that particular connection or it will find nothing and returns EINTR; in the post-syscall glue this is checked (the kernel thread has been told to take the expensive post-syscall routine) and if the system call was interrupted, EBADF is returned instead. It is also possible for the connection to come in late and then the socket will be changed and the already accepted (in TCP terms, not in the socket API terms) embryonic will be closed too as is normal when a listening socket is closed with a list of not ready accept()ed connections. Casper ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 6:34 ` Casper.Dik @ 2015-10-22 17:21 ` Al Viro 2015-10-22 18:24 ` Casper.Dik 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-22 17:21 UTC (permalink / raw) To: Casper.Dik Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev, dholland-tech On Thu, Oct 22, 2015 at 08:34:19AM +0200, Casper.Dik@oracle.com wrote: > > > >And I'm really curious about the things Solaris would do with dup2() there. > >Does it take into account the possibility of new accept() coming just as > >dup2() is trying to terminate the ongoing ones? Is there a window when > >descriptor-to-file lookups would fail? Looks like a race/deadlock country... > > Solaris does not "terminate" threads, instead it tells them that the > file descriptor information used is stale and wkae's up the thread. Sorry, lousy wording - I meant "terminate syscall in another thread". Better yet, make that "what happens if new accept(newfd) comes while dup2() waits for affected syscalls in other threads to finish"? Assuming it does wait, that is... While we are at it, what's the relative order of record locks removal and switching the meaning of newfd? In our kernel it happens *after* the switchover (i.e. if another thread is waiting for a record lock held on any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see the state with newfd still refering to what it used to; note that waiter might be using any descriptor refering to the file newfd used to refer to, so it won't be affected by the "wake those who had the meaning of their arguments change" side of things). ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 17:21 ` Al Viro @ 2015-10-22 18:24 ` Casper.Dik 2015-10-22 19:07 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: Casper.Dik @ 2015-10-22 18:24 UTC (permalink / raw) To: Al Viro Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev, dholland-tech >On Thu, Oct 22, 2015 at 08:34:19AM +0200, Casper.Dik@oracle.com wrote: >> >> >> >And I'm really curious about the things Solaris would do with dup2() there. >> >Does it take into account the possibility of new accept() coming just as >> >dup2() is trying to terminate the ongoing ones? Is there a window when >> >descriptor-to-file lookups would fail? Looks like a race/deadlock country... >> >> Solaris does not "terminate" threads, instead it tells them that the >> file descriptor information used is stale and wkae's up the thread. > >Sorry, lousy wording - I meant "terminate syscall in another thread". >Better yet, make that "what happens if new accept(newfd) comes while dup2() >waits for affected syscalls in other threads to finish"? Assuming it >does wait, that is.. No there is no such window; the accept() call either returns EBADF (dup2()) wins the race or it returns a new file descriptor (and dup2() then closes the listening descriptor). One or the other. >While we are at it, what's the relative order of record locks removal >and switching the meaning of newfd? In our kernel it happens *after* >the switchover (i.e. if another thread is waiting for a record lock held on >any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see >the state with newfd still refering to what it used to; note that waiter >might be using any descriptor refering to the file newfd used to refer >to, so it won't be affected by the "wake those who had the meaning of >their arguments change" side of things). The external behaviour atomic; you cannot distinguish the order between the closing of the original file (and waking up other threads waiting for a record lock) or changing the file referenced by that newfd. But this not include a global or process specific lock. Casper ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 18:24 ` Casper.Dik @ 2015-10-22 19:07 ` Al Viro 2015-10-22 19:51 ` Casper.Dik 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-22 19:07 UTC (permalink / raw) To: Casper.Dik Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev, dholland-tech On Thu, Oct 22, 2015 at 08:24:51PM +0200, Casper.Dik@oracle.com wrote: > The external behaviour atomic; you cannot distinguish the order > between the closing of the original file (and waking up other threads > waiting for a record lock) or changing the file referenced by that newfd. > > But this not include a global or process specific lock. Interesting... Do you mean that decriptor-to-file lookup blocks until that rundown finishes? ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 19:07 ` Al Viro @ 2015-10-22 19:51 ` Casper.Dik 2015-10-22 21:57 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: Casper.Dik @ 2015-10-22 19:51 UTC (permalink / raw) To: Al Viro Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev, dholland-tech >On Thu, Oct 22, 2015 at 08:24:51PM +0200, Casper.Dik@oracle.com wrote: > >> The external behaviour atomic; you cannot distinguish the order >> between the closing of the original file (and waking up other threads >> waiting for a record lock) or changing the file referenced by that newfd. >> >> But this not include a global or process specific lock. > >Interesting... Do you mean that decriptor-to-file lookup blocks until that >rundown finishes? For that particular file descriptor, yes. (I'm assuming you mean the Solaris kernel running down all lwps who have a system in progress on that particular file descriptor). All other fd to file lookups are not blocked at all by this locking. It should be clear that any such occurrences are application errors and should be hardly ever seen in practice. It is also known when this is needed so it is hardly even attempted. Casper ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 19:51 ` Casper.Dik @ 2015-10-22 21:57 ` Al Viro 2015-10-23 9:52 ` Casper.Dik 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-22 21:57 UTC (permalink / raw) To: Casper.Dik Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev, dholland-tech On Thu, Oct 22, 2015 at 09:51:05PM +0200, Casper.Dik@oracle.com wrote: > > >On Thu, Oct 22, 2015 at 08:24:51PM +0200, Casper.Dik@oracle.com wrote: > > > >> The external behaviour atomic; you cannot distinguish the order > >> between the closing of the original file (and waking up other threads > >> waiting for a record lock) or changing the file referenced by that newfd. > >> > >> But this not include a global or process specific lock. > > > >Interesting... Do you mean that decriptor-to-file lookup blocks until that > >rundown finishes? > > For that particular file descriptor, yes. (I'm assuming you mean the > Solaris kernel running down all lwps who have a system in progress on that > particular file descriptor). All other fd to file lookups are not blocked > at all by this locking. > > It should be clear that any such occurrences are application errors and > should be hardly ever seen in practice. It is also known when this is > needed so it is hardly even attempted. Ho-hum... It could even be made lockless in fast path; the problems I see are * descriptor-to-file lookup becomes unsafe in a lot of locking conditions. Sure, most of that happens on the entry to some syscall, with very light locking environment, but... auditing every sodding ioctl that might be doing such lookups is an interesting exercise, and then there are ->mount() instances doing the same thing. And procfs accesses. Probably nothing impossible to deal with, but nothing pleasant either. * memory footprint. In case of Linux on amd64 or sparc64, main() { int i; for (i = 0; i < 1<<24; dup2(0, i++)) // 16M descriptors ; } will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient ulimit -n, of course). How much will Solaris eat on the same? * related to the above - how much cacheline sharing will that involve? These per-descriptor use counts are bitch to pack, and giving each a cacheline of its own... <shudder> ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 21:57 ` Al Viro @ 2015-10-23 9:52 ` Casper.Dik 2015-10-23 13:02 ` Eric Dumazet 2015-10-24 2:30 ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro 0 siblings, 2 replies; 138+ messages in thread From: Casper.Dik @ 2015-10-23 9:52 UTC (permalink / raw) To: Al Viro Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev, dholland-tech >Ho-hum... It could even be made lockless in fast path; the problems I see >are > * descriptor-to-file lookup becomes unsafe in a lot of locking >conditions. Sure, most of that happens on the entry to some syscall, with >very light locking environment, but... auditing every sodding ioctl that >might be doing such lookups is an interesting exercise, and then there are >->mount() instances doing the same thing. And procfs accesses. Probably >nothing impossible to deal with, but nothing pleasant either. In the Solaris kernel code, the ioctl code is generally not handled a file descriptor but instead a file pointer (i.e., the lookup is done early in the system call). In those specific cases where a system call needs to convert a file descriptor to a file pointer, there is only one routines which can be used. > * memory footprint. In case of Linux on amd64 or sparc64, >main() >{ > int i; > for (i = 0; i < 1<<24; dup2(0, i++)) // 16M descriptors > ; >} >will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient ulimit -n, >of course). How much will Solaris eat on the same? Yeah, that is a large amount of memory. Of course, the table is only sized when it is extended and there is a reason why there is a limit on file descriptors. But we're using more data per file descriptor entry. > * related to the above - how much cacheline sharing will that involve? >These per-descriptor use counts are bitch to pack, and giving each a cacheline >of its own... <shudder> As I said, we do actually use a lock and yes that means that you really want to have a single cache line for each and every entry. It does make it easy to have non-racy file description updates. You certainly do not want false sharing when there is a lot of contention. Other data is used to make sure that it only takes O(log(n)) to find the lowest available file descriptor entry. (Where n, I think, is the returned descriptor) Not contended locks aren't expensive. And all is done on a single cache line. One question about the Linux implementation: what happens when a socket in select is closed? I'm assuming that the kernel waits until "shutdown" is given or when a connection comes in? Is it a problem that you can "hide" your listening socket with a thread in accept()? I would think so (It would be visible in netstat but you can't easily find out why has it) Casper ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 9:52 ` Casper.Dik @ 2015-10-23 13:02 ` Eric Dumazet 2015-10-23 13:20 ` Casper.Dik 2015-10-23 13:35 ` Alan Burlison 2015-10-24 2:30 ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro 1 sibling, 2 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-23 13:02 UTC (permalink / raw) To: Casper.Dik Cc: Al Viro, Alan Burlison, David Miller, stephen, netdev, dholland-tech On Fri, 2015-10-23 at 11:52 +0200, Casper.Dik@oracle.com wrote: > > >Ho-hum... It could even be made lockless in fast path; the problems I see > >are > > * descriptor-to-file lookup becomes unsafe in a lot of locking > >conditions. Sure, most of that happens on the entry to some syscall, with > >very light locking environment, but... auditing every sodding ioctl that > >might be doing such lookups is an interesting exercise, and then there are > >->mount() instances doing the same thing. And procfs accesses. Probably > >nothing impossible to deal with, but nothing pleasant either. > > In the Solaris kernel code, the ioctl code is generally not handled a file > descriptor but instead a file pointer (i.e., the lookup is done early in > the system call). > > In those specific cases where a system call needs to convert a file > descriptor to a file pointer, there is only one routines which can be used. > > > * memory footprint. In case of Linux on amd64 or sparc64, > >main() > >{ > > int i; > > for (i = 0; i < 1<<24; dup2(0, i++)) // 16M descriptors > > ; > >} > >will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient ulimit -n, > >of course). How much will Solaris eat on the same? > > Yeah, that is a large amount of memory. Of course, the table is only > sized when it is extended and there is a reason why there is a limit on > file descriptors. But we're using more data per file descriptor entry. > > > > * related to the above - how much cacheline sharing will that involve? > >These per-descriptor use counts are bitch to pack, and giving each a cacheline > >of its own... <shudder> > > As I said, we do actually use a lock and yes that means that you really > want to have a single cache line for each and every entry. It does make > it easy to have non-racy file description updates. You certainly do not > want false sharing when there is a lot of contention. > > Other data is used to make sure that it only takes O(log(n)) to find the > lowest available file descriptor entry. (Where n, I think, is the returned > descriptor) Yet another POSIX deficiency. When a server deals with 10,000,000+ socks, we absolutely do not care of this requirement. O(log(n)) is still crazy if it involves O(log(n)) cache misses. > > Not contended locks aren't expensive. And all is done on a single cache > line. > > One question about the Linux implementation: what happens when a socket in > select is closed? I'm assuming that the kernel waits until "shutdown" is > given or when a connection comes in? > > Is it a problem that you can "hide" your listening socket with a thread in > accept()? I would think so (It would be visible in netstat but you can't > easily find out why has it) Again, netstat -p on a server with 10,000,000 sockets never completes. Never try this unless you are desperate and want to avoid a reboot maybe. If you absolutely want to nuke a listener because of untrusted applications, we better implement a proper syscall. Android has such a facility. Alternative would be to extend netlink (ss command from iproute2 package) to carry one pid per socket. ss -atnp state listening -> would not have to readlink (/proc/*/fd/*) ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 13:02 ` Eric Dumazet @ 2015-10-23 13:20 ` Casper.Dik 2015-10-23 13:48 ` Eric Dumazet 2015-10-23 14:13 ` Eric Dumazet 2015-10-23 13:35 ` Alan Burlison 1 sibling, 2 replies; 138+ messages in thread From: Casper.Dik @ 2015-10-23 13:20 UTC (permalink / raw) To: Eric Dumazet Cc: Al Viro, Alan Burlison, David Miller, stephen, netdev, dholland-tech >Yet another POSIX deficiency. > >When a server deals with 10,000,000+ socks, we absolutely do not care of >this requirement. > >O(log(n)) is still crazy if it involves O(log(n)) cache misses. You miss the fire point of the algorithm; you *always* find an available file descriptor in O(log(N)) (where N is the size of the table). Does your algorithm guarantee that? >> Is it a problem that you can "hide" your listening socket with a thread in >> accept()? I would think so (It would be visible in netstat but you can't >> easily find out why has it) > >Again, netstat -p on a server with 10,000,000 sockets never completes. This point was not about a 10M sockets server but in general... >Never try this unless you are desperate and want to avoid a reboot >maybe. > >If you absolutely want to nuke a listener because of untrusted >applications, we better implement a proper syscall. > >Android has such a facility. Solaris has had such an option too, but that wasn't the point. You really don't want to know which application is doing this? >Alternative would be to extend netlink (ss command from iproute2 >package) to carry one pid per socket. > >ss -atnp state listening That would be an option too. Casper ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 13:20 ` Casper.Dik @ 2015-10-23 13:48 ` Eric Dumazet 2015-10-23 14:13 ` Eric Dumazet 1 sibling, 0 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-23 13:48 UTC (permalink / raw) To: Casper.Dik Cc: Al Viro, Alan Burlison, David Miller, stephen, netdev, dholland-tech On Fri, 2015-10-23 at 15:20 +0200, Casper.Dik@oracle.com wrote: > This point was not about a 10M sockets server but in general... > Hey, we do not design the OS only for handling desktop workloads. If netstat does not work on this typical server workload, then it is a joke. > Solaris has had such an option too, but that wasn't the point. You really > don't want to know which application is doing this? Apparently nobody had this need before you asked. My point is that instead of carrying ~30 years of legacy, we would better design the system properly so that we can _use_ it in all situations. ps used to open /dev/kmem a long time ago. fuser and similar commands having to know system internals are really hacks. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 13:20 ` Casper.Dik 2015-10-23 13:48 ` Eric Dumazet @ 2015-10-23 14:13 ` Eric Dumazet 1 sibling, 0 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-23 14:13 UTC (permalink / raw) To: Casper.Dik Cc: Al Viro, Alan Burlison, David Miller, stephen, netdev, dholland-tech On Fri, 2015-10-23 at 15:20 +0200, Casper.Dik@oracle.com wrote: > > >Yet another POSIX deficiency. > > > >When a server deals with 10,000,000+ socks, we absolutely do not care of > >this requirement. > > > >O(log(n)) is still crazy if it involves O(log(n)) cache misses. > > You miss the fire point of the algorithm; you *always* find an available > file descriptor in O(log(N)) (where N is the size of the table). > > Does your algorithm guarantee that? Its a simple bit map. Each cache line contains 64*8 = 512 bits. Lookup is actually quite fast thanks to hardware prefetches. Main problem we had until recently was that we used to acquire fd table lock 3 times per accept()/close() pair This patch took care of the issue : http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8a81252b774b53e628a8a0fe18e2b8fc236d92cc Then, adding a O_RANDOMFD flag (or O_DO_NOT_REQUIRE_POSIX_FD_RULES) is helping a lot, as it allows us to pick a random starting point during the bitmap search. In practice, finding a slot in fd array is O(1) : one cache line miss exactly. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 13:02 ` Eric Dumazet 2015-10-23 13:20 ` Casper.Dik @ 2015-10-23 13:35 ` Alan Burlison 2015-10-23 14:21 ` Eric Dumazet 1 sibling, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-23 13:35 UTC (permalink / raw) To: Eric Dumazet, Casper.Dik Cc: Al Viro, David Miller, stephen, netdev, dholland-tech On 23/10/2015 14:02, Eric Dumazet wrote: >> Other data is used to make sure that it only takes O(log(n)) to find the >> lowest available file descriptor entry. (Where n, I think, is the returned >> descriptor) > > Yet another POSIX deficiency. > > When a server deals with 10,000,000+ socks, we absolutely do not care of > this requirement. > > O(log(n)) is still crazy if it involves O(log(n)) cache misses. If you think it's a POSIX deficiency then logging a DR is probably the correct way of addressing that. And as I've said it's fine to decide that you don't care about what POSIX says on the subject but you can't simultaneously claim POSIX conformance. One or the other, not both. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 13:35 ` Alan Burlison @ 2015-10-23 14:21 ` Eric Dumazet 2015-10-23 15:46 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-23 14:21 UTC (permalink / raw) To: Alan Burlison Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech On Fri, 2015-10-23 at 14:35 +0100, Alan Burlison wrote: > If you think it's a POSIX deficiency then logging a DR is probably the > correct way of addressing that. And as I've said it's fine to decide > that you don't care about what POSIX says on the subject but you can't > simultaneously claim POSIX conformance. One or the other, not both. I claim nothing. If you believe a man page should be fixed, please send a patch to man page maintainer. Have you tested the patch I sent ? The goal here is to improve things, not to say you are right or I am right. I am very often wrong, then what ? This list is about linux kernel development. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 14:21 ` Eric Dumazet @ 2015-10-23 15:46 ` Alan Burlison 2015-10-23 16:00 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-23 15:46 UTC (permalink / raw) To: Eric Dumazet Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech On 23/10/2015 15:21, Eric Dumazet wrote: > I claim nothing. If you believe a man page should be fixed, please send > a patch to man page maintainer. Ermm, you *really* want me to submit a patch removing 'Conforms to POSIX.1-2001' from *every* Linux manpage? > Have you tested the patch I sent ? The AF_UNIX poll one? No, I don't have the means to do so, and in any case that's not a POSIX issue, just a plain bug. I'm happy to log a bug if that helps. > The goal here is to improve things, not to say you are right or I am > right. I am very often wrong, then what ? > > This list is about linux kernel development. Thank you for the clarification. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 15:46 ` Alan Burlison @ 2015-10-23 16:00 ` Eric Dumazet 2015-10-23 16:07 ` Alan Burlison 2015-10-23 16:19 ` Eric Dumazet 0 siblings, 2 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-23 16:00 UTC (permalink / raw) To: Alan Burlison Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech On Fri, 2015-10-23 at 16:46 +0100, Alan Burlison wrote: > On 23/10/2015 15:21, Eric Dumazet wrote: > > > I claim nothing. If you believe a man page should be fixed, please send > > a patch to man page maintainer. > > Ermm, you *really* want me to submit a patch removing 'Conforms to > POSIX.1-2001' from *every* Linux manpage? Only on the pages you think there is an error that matters. > > > Have you tested the patch I sent ? > > The AF_UNIX poll one? No, I don't have the means to do so, and in any > case that's not a POSIX issue, just a plain bug. I'm happy to log a bug > if that helps. We submit patches when someone needs a fix. If not, we have more urgent issues to solve first. I wrote following test case, and confirmed the patch fixes the issue. I will submit it formally. Thanks. #include <pthread.h> #include <stdio.h> #include <sys/types.h> #include <sys/socket.h> #include <sys/un.h> #include <stdlib.h> #include <errno.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <poll.h> static void fail(const char *str) { perror(str); printf("FAIL\n"); exit(1); } int main(int argc, char *argv[]) { int listener = socket(AF_UNIX, SOCK_STREAM, 0); struct pollfd pfd; struct sockaddr_un addr; int res; if (listener == -1) perror("socket()"); memset(&addr, 0, sizeof(addr)); addr.sun_family = AF_UNIX; if (bind(listener, (struct sockaddr *)&addr, sizeof(addr)) == -1) fail("bind()"); if (listen(listener, 10) == -1) fail("listen()"); pfd.fd = listener; pfd.events = -1; res = poll(&pfd, 1, 10); if (res == -1) fail("poll()"); if (res == 1 && pfd.revents & (POLLOUT|POLLWRNORM|POLLWRBAND)) { fprintf(stderr, "poll(af_unix listener) returned a POLLOUT status !\n"); printf("FAIL\n"); return 1; } printf("OK\n"); return 0; } ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 16:00 ` Eric Dumazet @ 2015-10-23 16:07 ` Alan Burlison 2015-10-23 16:19 ` Eric Dumazet 1 sibling, 0 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-23 16:07 UTC (permalink / raw) To: Eric Dumazet Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech On 23/10/2015 17:00, Eric Dumazet wrote: >> Ermm, you *really* want me to submit a patch removing 'Conforms to >> POSIX.1-2001' from *every* Linux manpage? > > Only on the pages you think there is an error that matters. If there's consensus that the current shutdown(), dup2(), close() and accept() behaviour are not POSIX-compliant then I can do that, sure. >>> Have you tested the patch I sent ? >> >> The AF_UNIX poll one? No, I don't have the means to do so, and in any >> case that's not a POSIX issue, just a plain bug. I'm happy to log a bug >> if that helps. > > We submit patches when someone needs a fix. > > If not, we have more urgent issues to solve first. > > I wrote following test case, and confirmed the patch fixes the issue. > > I will submit it formally. Thanks, works for me - the poll() issue only affects AF_UNIX sockets in the listen state and is easily avoided by simply not setting the output bits in the poll events mask, so it's not exactly high priority. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 16:00 ` Eric Dumazet 2015-10-23 16:07 ` Alan Burlison @ 2015-10-23 16:19 ` Eric Dumazet 2015-10-23 16:40 ` Alan Burlison 1 sibling, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-23 16:19 UTC (permalink / raw) To: Alan Burlison Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech On Fri, 2015-10-23 at 09:00 -0700, Eric Dumazet wrote: > on the pages you think there is an error that matters. > > > > > > Have you tested the patch I sent ? > > > > The AF_UNIX poll one? No, I don't have the means to do so, and in any > > case that's not a POSIX issue, just a plain bug. I'm happy to log a bug > > if that helps. BTW, there is no kernel bug here. POSIX poll() man page says : POLLOUT Normal data may be written without blocking. If you attempt to write on a listener, write() does _not_ block and returns -1, which seems correct behavior to me, in accordance with man page. socket(PF_LOCAL, SOCK_STREAM, 0) = 3 bind(3, {sa_family=AF_LOCAL, sun_path=@""}, 110) = 0 listen(3, 10) = 0 write(3, "test", 4) = -1 ENOTCONN (Transport endpoint is not connected) Could you point out which part of POSIX is mandating that af_unix listener MUST filter out POLLOUT ? Thanks. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 16:19 ` Eric Dumazet @ 2015-10-23 16:40 ` Alan Burlison 2015-10-23 17:47 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-23 16:40 UTC (permalink / raw) To: Eric Dumazet Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech On 23/10/2015 17:19, Eric Dumazet wrote: >>> The AF_UNIX poll one? No, I don't have the means to do so, and in any >>> case that's not a POSIX issue, just a plain bug. I'm happy to log a bug >>> if that helps. > > BTW, there is no kernel bug here. POSIX poll() man page says : > > POLLOUT > Normal data may be written without blocking. > > If you attempt to write on a listener, write() does _not_ block and > returns -1, which seems correct behavior to me, in accordance with man > page. Except of course data may not be written, because an attempt to actually do so fails, because the socket is in the listen state, is not connected and therefore no attempt to write to it could ever succeed. The only bit of the required behaviour that the current AF_UNIX poll implementation actually gets right is the "without blocking" bit, and that's only the case because the failure is detected immediately and the write call returns immediately with an error. > socket(PF_LOCAL, SOCK_STREAM, 0) = 3 > bind(3, {sa_family=AF_LOCAL, sun_path=@""}, 110) = 0 > listen(3, 10) = 0 > write(3, "test", 4) = -1 ENOTCONN (Transport endpoint is not connected) > > Could you point out which part of POSIX is mandating that af_unix > listener MUST filter out POLLOUT ? "A file descriptor for a socket that is listening for connections shall indicate that it is ready for reading, once connections are available. A file descriptor for a socket that is connecting asynchronously shall indicate that it is ready for writing, once a connection has been established." If POSIX had to explicitly list every possible thing that implementations *should not* do rather than just those that they *should* do then it would be even more unwieldy than it already is. And if what you are asserting is correct, why isn't the AF_INET behaviour the same as the AF_UNIX behaviour? -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 16:40 ` Alan Burlison @ 2015-10-23 17:47 ` Eric Dumazet 2015-10-23 17:59 ` [PATCH net-next] af_unix: do not report POLLOUT on listeners Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-23 17:47 UTC (permalink / raw) To: Alan Burlison Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech On Fri, 2015-10-23 at 17:40 +0100, Alan Burlison wrote: > On 23/10/2015 17:19, Eric Dumazet wrote: > > >>> The AF_UNIX poll one? No, I don't have the means to do so, and in any > >>> case that's not a POSIX issue, just a plain bug. I'm happy to log a bug > >>> if that helps. > > > > BTW, there is no kernel bug here. POSIX poll() man page says : > > > > POLLOUT > > Normal data may be written without blocking. > > > > If you attempt to write on a listener, write() does _not_ block and > > returns -1, which seems correct behavior to me, in accordance with man > > page. > > Except of course data may not be written, because an attempt to actually > do so fails, because the socket is in the listen state, is not connected > and therefore no attempt to write to it could ever succeed. The only bit > of the required behaviour that the current AF_UNIX poll implementation > actually gets right is the "without blocking" bit, and that's only the > case because the failure is detected immediately and the write call > returns immediately with an error. Yeah, I know some people use poll(NULL, 0, timeout) to implement msleep(). Because it definitely impresses friends. So why not poll(&pfd, 1, timeout) to do the same, with a socket listener and POLLOUT in pfd.events Go figure. I'll send the fine patch. Thanks. ^ permalink raw reply [flat|nested] 138+ messages in thread
* [PATCH net-next] af_unix: do not report POLLOUT on listeners 2015-10-23 17:47 ` Eric Dumazet @ 2015-10-23 17:59 ` Eric Dumazet 2015-10-25 13:45 ` David Miller 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-23 17:59 UTC (permalink / raw) To: Alan Burlison, David Miller Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech From: Eric Dumazet <edumazet@google.com> poll(POLLOUT) on a listener should not report fd is ready for a write(). This would break some applications using poll() and pfd.events = -1, as they would not block in poll() Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Alan Burlison <Alan.Burlison@oracle.com> Tested-by: Eric Dumazet <edumazet@google.com> --- net/unix/af_unix.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f658235fb4..aaa0b58d6aba 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,9 +326,10 @@ found: return s; } -static inline int unix_writable(struct sock *sk) +static int unix_writable(const struct sock *sk) { - return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf; + return sk->sk_state != TCP_LISTEN && + (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf; } static void unix_write_space(struct sock *sk) ^ permalink raw reply related [flat|nested] 138+ messages in thread
* Re: [PATCH net-next] af_unix: do not report POLLOUT on listeners 2015-10-23 17:59 ` [PATCH net-next] af_unix: do not report POLLOUT on listeners Eric Dumazet @ 2015-10-25 13:45 ` David Miller 0 siblings, 0 replies; 138+ messages in thread From: David Miller @ 2015-10-25 13:45 UTC (permalink / raw) To: eric.dumazet Cc: Alan.Burlison, Casper.Dik, viro, stephen, netdev, dholland-tech From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 23 Oct 2015 10:59:16 -0700 > From: Eric Dumazet <edumazet@google.com> > > poll(POLLOUT) on a listener should not report fd is ready for > a write(). > > This would break some applications using poll() and pfd.events = -1, > as they would not block in poll() > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Reported-by: Alan Burlison <Alan.Burlison@oracle.com> > Tested-by: Eric Dumazet <edumazet@google.com> Applied, thanks Eric. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 9:52 ` Casper.Dik 2015-10-23 13:02 ` Eric Dumazet @ 2015-10-24 2:30 ` Al Viro 2015-10-27 9:08 ` Casper.Dik 1 sibling, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-24 2:30 UTC (permalink / raw) To: Casper.Dik Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev, dholland-tech On Fri, Oct 23, 2015 at 11:52:34AM +0200, Casper.Dik@oracle.com wrote: > > > >Ho-hum... It could even be made lockless in fast path; the problems I see > >are > > * descriptor-to-file lookup becomes unsafe in a lot of locking > >conditions. Sure, most of that happens on the entry to some syscall, with > >very light locking environment, but... auditing every sodding ioctl that > >might be doing such lookups is an interesting exercise, and then there are > >->mount() instances doing the same thing. And procfs accesses. Probably > >nothing impossible to deal with, but nothing pleasant either. > > In the Solaris kernel code, the ioctl code is generally not handled a file > descriptor but instead a file pointer (i.e., the lookup is done early in > the system call). The one that comes as the first argument of ioctl(2) - sure, but e.g. BTRFS_IOC_CLONE_RANGE gets a pointer to this: struct btrfs_ioctl_clone_range_args { __s64 src_fd; __u64 src_offset, src_length; __u64 dest_offset; }; as the third argument. VFS sure as hell has no idea of that thing - it's up to btrfs_ioctl() to copy it in and deal with what it had been given. While we are at it, ioctl(fd, BTRFS_IOC_CLONE, src_fd) also gets the second descriptor-to-file lookup in btrfs-specific code; fd, of course, is looked up by VFS code. Now, these two are done in locking-neutral environment, but that's just two of several dozens. And no, I'm not fond of such irregular ways to pass file descriptors, but we can't kill ioctl(2) with all weirdness hiding behind it, more's the pity... > In those specific cases where a system call needs to convert a file > descriptor to a file pointer, there is only one routines which can be used. Obviously, but the problem is deadlock avoidance using it. > As I said, we do actually use a lock and yes that means that you really > want to have a single cache line for each and every entry. It does make > it easy to have non-racy file description updates. You certainly do not > want false sharing when there is a lot of contention. > > Other data is used to make sure that it only takes O(log(n)) to find the > lowest available file descriptor entry. (Where n, I think, is the returned > descriptor) TBH, with that kind of memory footprint I would be more interested in constants than in asymptotics - not that O(log(n)) would be hard to arrange (a bunch of bitmaps with something like 1:512 ratio between the levels, to keep the damn thing within a reasonable cacheline size would probably do with not too horrible constant; 3 levels of that would already give 128M descriptors, and that's a gigabyte worth of struct file * alone; with your "cacheline per descriptor" it's what, about 8Gb eaten by descriptor table?) Hell knows, might be worth doing regardless of anything else. Not making it worse than our plain bitmap variant in any situations shouldn't be hard... > Not contended locks aren't expensive. And all is done on a single cache > line. The memory footprint is really scary. Bitmaps are pretty much noise, but blowing it by factor of 8 on normal 64bit (or 16 on something like Itanic - or Venus for that matter, which is more relevant for you guys) Said that, what's the point of "close won't return until..."? After all, you can't guarantee that thread with cancelled syscall won't lose CPU immediately upon return to userland, so it *can't* make any assumptions about the descriptor not having been already reused. I don't get it - what does that buy for userland code? ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-24 2:30 ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro @ 2015-10-27 9:08 ` Casper.Dik 2015-10-27 10:52 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: Casper.Dik @ 2015-10-27 9:08 UTC (permalink / raw) To: Al Viro Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev, dholland-tech >And no, I'm not fond of such irregular ways to pass file descriptors, but >we can't kill ioctl(2) with all weirdness hiding behind it, more's the pity... Yeah, there are a number of calls which supposed work on one but have a second argument which is also a file descriptor; mostly part of ioctl(). >> In those specific cases where a system call needs to convert a file >> descriptor to a file pointer, there is only one routines which can be used. > >Obviously, but the problem is deadlock avoidance using it. The Solaris algorithm is quite different and as such there is no chance of having a deadlock using that function (there is a bunch of functions) >The memory footprint is really scary. Bitmaps are pretty much noise, but >blowing it by factor of 8 on normal 64bit (or 16 on something like Itanic - >or Venus for that matter, which is more relevant for you guys) Fair enough. I think we have some systems with a larger cache line. >Said that, what's the point of "close won't return until..."? After all, >you can't guarantee that thread with cancelled syscall won't lose CPU >immediately upon return to userland, so it *can't* make any assumptions >about the descriptor not having been already reused. I don't get it - what >does that buy for userland code? Generally I wouldn't see that as a problem, but in the case of a socket blocking on accept indefinitely, I do see it as a problem especially as the thread actually wants to stop listening. But in general, this is basically a problem with the application: the file descriptor space is shared between threads and having one thread sniping at open files, you do have a problem and whatever the kernel does in that case perhaps doesn't matter all that much: the application needs to be fixed anyway. Casper ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 9:08 ` Casper.Dik @ 2015-10-27 10:52 ` Alan Burlison 2015-10-27 12:01 ` Eric Dumazet ` (3 more replies) 0 siblings, 4 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-27 10:52 UTC (permalink / raw) To: Casper.Dik, Al Viro Cc: David Miller, eric.dumazet, stephen, netdev, dholland-tech On 27/10/2015 09:08, Casper.Dik@oracle.com wrote: > Generally I wouldn't see that as a problem, but in the case of a socket > blocking on accept indefinitely, I do see it as a problem especially as > the thread actually wants to stop listening. > > But in general, this is basically a problem with the application: the file > descriptor space is shared between threads and having one thread sniping > at open files, you do have a problem and whatever the kernel does in that > case perhaps doesn't matter all that much: the application needs to be > fixed anyway. The scenario in Hadoop is that the FD is being used by a thread that's waiting in accept and another thread wants to shut it down, e.g. because the application is terminating and needs to stop all threads cleanly. I agree the use of shutdown()+close() on Linux or dup2() on Solaris is pretty much an application-level hack - the concern in both cases is that the file descriptor that's being used in the accept() might be recycled by another thread. However that just begs the question of why the FD isn't properly encapsulated by the application in a singleton object, with the required shut down semantics provided by having a mechanism to invalidate the singleton and its contained FD. There are other mechanisms that could be used to do a clean shutdown that don't require the OS to provide workarounds for arguably broken application behaviour, for example by setting a 'shutdown' flag in the object and then doing a dummy connect() to the accepting FD to kick it off the accept() and thereby getting it to re-check the 'shutdown' flag and not re-enter the accept(). If the object encapsulating a FD is invalidated and that prevents the FD being used any more because the only access is via that object, then it simply doesn't matter if the FD is reused elsewhere, there can be no race so a complicated, platform-dependent dance isn't needed. Unfortunately Hadoop isn't the only thing that pulls the shutdown() trick, so I don't think there's a simple fix for this, as discussed earlier in the thread. Having said that, if close() on Linux also did an implicit shutdown() it would mean that well-written applications that handled the scoping, sharing and reuse of FDs properly could just call close() and have it work the same way across *NIX platforms. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 10:52 ` Alan Burlison @ 2015-10-27 12:01 ` Eric Dumazet 2015-10-27 12:27 ` Alan Burlison 2015-10-27 13:42 ` David Miller ` (2 subsequent siblings) 3 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-27 12:01 UTC (permalink / raw) To: Alan Burlison Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech On Tue, 2015-10-27 at 10:52 +0000, Alan Burlison wrote: > Unfortunately Hadoop isn't the only thing that pulls the shutdown() > trick, so I don't think there's a simple fix for this, as discussed > earlier in the thread. Having said that, if close() on Linux also did an > implicit shutdown() it would mean that well-written applications that > handled the scoping, sharing and reuse of FDs properly could just call > close() and have it work the same way across *NIX platforms. Are non multi threaded applications considered well written ? listener = socket(...); bind(listener, ...); listen(fd, 10000); Loop 1 10 if (fork() == 0) do_accept(listener) Now if a child does a close(listener), or is killed, you propose that it does an implicit shutdown() and all other children no longer can accept() ? Surely you did not gave all details on how it is really working. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 12:01 ` Eric Dumazet @ 2015-10-27 12:27 ` Alan Burlison 2015-10-27 12:44 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-27 12:27 UTC (permalink / raw) To: Eric Dumazet Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech On 27/10/2015 12:01, Eric Dumazet wrote: > Are non multi threaded applications considered well written ? > > listener = socket(...); > bind(listener, ...); > listen(fd, 10000); > Loop 1 10 > if (fork() == 0) > do_accept(listener) > > Now if a child does a close(listener), or is killed, you propose that it > does an implicit shutdown() and all other children no longer can > accept() ? No, of course not. I made it quite clear I was talking about MT programs. > Surely you did not gave all details on how it is really working. In the case of Hadoop, it works the way I describe. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 12:27 ` Alan Burlison @ 2015-10-27 12:44 ` Eric Dumazet 0 siblings, 0 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-27 12:44 UTC (permalink / raw) To: Alan Burlison Cc: Casper.Dik, Al Viro, David Miller, stephen, netdev, dholland-tech On Tue, 2015-10-27 at 12:27 +0000, Alan Burlison wrote: > On 27/10/2015 12:01, Eric Dumazet wrote: > > > Are non multi threaded applications considered well written ? > > > > listener = socket(...); > > bind(listener, ...); > > listen(fd, 10000); > > Loop 1 10 > > if (fork() == 0) > > do_accept(listener) > > > > Now if a child does a close(listener), or is killed, you propose that it > > does an implicit shutdown() and all other children no longer can > > accept() ? > > No, of course not. I made it quite clear I was talking about MT programs. Nothing is clear. Sorry. Now shat about programs using both fork() model and MT, to get one MT process per NUMA node ? ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 10:52 ` Alan Burlison 2015-10-27 12:01 ` Eric Dumazet @ 2015-10-27 13:42 ` David Miller 2015-10-27 13:37 ` Alan Burlison 2015-10-27 23:17 ` Al Viro 2015-10-29 14:58 ` David Holland 3 siblings, 1 reply; 138+ messages in thread From: David Miller @ 2015-10-27 13:42 UTC (permalink / raw) To: Alan.Burlison Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech From: Alan Burlison <Alan.Burlison@oracle.com> Date: Tue, 27 Oct 2015 10:52:46 +0000 > an implicit shutdown() it would mean that well-written applications > that handled the scoping, sharing and reuse of FDs properly could just > call close() and have it work the same way across *NIX platforms. This semantic would only exist after Linux version X.Y.Z and vendor kernels that decided to backport the feature. Ergo, this application would ironically be non-portable on Linux machines. If portable Linux applications have to handle the situation using existing facilities there is absolutely zero value to add it now because it only will add more complexity to applications handling things correctly because they will always have two cases to somehow conditionally handle under Linux. And if the intention is to just always assume the close() semantic thing is there, then you have given me a disincentive to ever add the facility. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 13:42 ` David Miller @ 2015-10-27 13:37 ` Alan Burlison 2015-10-27 13:59 ` David Miller 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-27 13:37 UTC (permalink / raw) To: David Miller Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech On 27/10/2015 13:42, David Miller wrote: > This semantic would only exist after Linux version X.Y.Z and vendor > kernels that decided to backport the feature. > > Ergo, this application would ironically be non-portable on Linux > machines. Yes, that's true enough, until nobody was using the old versions any more. > If portable Linux applications have to handle the situation using > existing facilities there is absolutely zero value to add it now > because it only will add more complexity to applications handling > things correctly because they will always have two cases to somehow > conditionally handle under Linux. > > And if the intention is to just always assume the close() semantic > thing is there, then you have given me a disincentive to ever add the > facility. If you took that argument to it's logical extreme they you'd never make any changes that made changes to existing behaviour, and that's patently not the case. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 13:37 ` Alan Burlison @ 2015-10-27 13:59 ` David Miller 2015-10-27 14:13 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: David Miller @ 2015-10-27 13:59 UTC (permalink / raw) To: Alan.Burlison Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech From: Alan Burlison <Alan.Burlison@oracle.com> Date: Tue, 27 Oct 2015 13:37:26 +0000 > If you took that argument to it's logical extreme they you'd never > make any changes that made changes to existing behaviour, and that's > patently not the case. You know exactly what I mean, and what you're saying here is just a scarecrow distracting the discussion from the real issue. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 13:59 ` David Miller @ 2015-10-27 14:13 ` Alan Burlison 2015-10-27 14:39 ` David Miller 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-27 14:13 UTC (permalink / raw) To: David Miller Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech On 27/10/2015 13:59, David Miller wrote: > From: Alan Burlison <Alan.Burlison@oracle.com> > Date: Tue, 27 Oct 2015 13:37:26 +0000 > >> If you took that argument to it's logical extreme they you'd never >> make any changes that made changes to existing behaviour, and that's >> patently not the case. > > You know exactly what I mean, and what you're saying here is just a > scarecrow distracting the discussion from the real issue. I think you probably mean "a straw man". The problematic case is MT applications that don't manage sharing of FDs in a sensible way and need to have a way of terminating any accept() in other threads. On Linux that's currently done with a shutdown()+close(), on other platforms you can use open()+dup2(). However the Linux shutdown() mechanism relies on bending the POSIX semantics and the dup2() mechanism doesn't work on Linux as it also doesn't kick other threads off accept(). At the moment, on Linux you have to explicitly call shutdown() on a socket on which another thread may be sat in accept(). If closing the socket in one thread terminated any accept()s in other threads, in the same way that an explicit shutdown() does, then the explicit shutdown() wouldn't be needed for more sensibly written apps that weren't prone to FD recycling races. As far as I can tell, that would work cross-platform. Ideally there'd be a single way of doing this that worked cross-platform, at the moment there isn't. And yes, even if such a mechanism was available now it would be some time before it could be assumed to be available everywhere. I don't know enough about the Linux implementation to know if there is a practical way around this, and of course even if such a change were made, potential breakage of existing code would be a concern. If there's a better, cross-platform way of doing this then I'm all ears. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 14:13 ` Alan Burlison @ 2015-10-27 14:39 ` David Miller 2015-10-27 14:39 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: David Miller @ 2015-10-27 14:39 UTC (permalink / raw) To: Alan.Burlison Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech From: Alan Burlison <Alan.Burlison@oracle.com> Date: Tue, 27 Oct 2015 14:13:31 +0000 > Ideally there'd be a single way of doing this that worked > cross-platform, at the moment there isn't. And yes, even if such a > mechanism was available now it would be some time before it could be > assumed to be available everywhere. You will never be able to assume it is available everywhere under Linux. Ever. This is the fundamental issue that you seem to completely not understand. You cannot just assume 5 years from now or whatever that the close() thing is there even if I added it to the tree right now. Your intent is to somewhere down the road assume this, and therefore distribute a broken piece of infrastructure that only works on some Linux systems. This is not acceptable. The backwards compat code will need to be in your code forever. There is no way around it. That is, again, unless you want your code to not work on a non-trivial number of Linux systems out there. Making this worse is that there isn't going to be a straightforward nor reliable way to test for the presence of this at run time. You _have_ a way to accomplish what you want to do today and it works on every Linux system on the planet. Given the constraints, and the fact that you're going to have to account for this situation somehow in your code forever, I see very little to no value in adding the close() thing. So your cross-platform unified behavior goal is simply unobtainable. So please deal with reality rather than wishful inpractical things. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 14:39 ` David Miller @ 2015-10-27 14:39 ` Alan Burlison 2015-10-27 15:04 ` David Miller 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-27 14:39 UTC (permalink / raw) To: David Miller Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech On 27/10/2015 14:39, David Miller wrote: > You will never be able to assume it is available everywhere under > Linux. Ever. > > This is the fundamental issue that you seem to completely not > understand. If that was true in general then Linux would be dead and there would be no point ever adding any new features to it. > You cannot just assume 5 years from now or whatever that the close() > thing is there even if I added it to the tree right now. No, a configure-time feature probe would be needed, as is generally considered to be good practice. > Your intent is to somewhere down the road assume this, and therefore > distribute a broken piece of infrastructure that only works on some > Linux systems. > > This is not acceptable. Acceptable to who? Unless you are claiming to speak for the authors of every piece of software that ever runs on Linux you can't make that assertion. For example, Hadoop relies on Linux CGroups to provide features you'd want in production deployments, yet CGroups only appeared in Linux 2.6.24. > The backwards compat code will need to be in your code forever. There > is no way around it. That is, again, unless you want your code to not > work on a non-trivial number of Linux systems out there. That's simply not true. Both Linux and Solaris have dropped support for old features in the past. Yes it takes a long time to do so but it's perfectly possible. > Making this worse is that there isn't going to be a straightforward > nor reliable way to test for the presence of this at run time. I attached a test case to the original bug that demonstrates the platform-specific behaviours, it would be easy enough to modify that to use as a configure-time feature probe. > You _have_ a way to accomplish what you want to do today and it works > on every Linux system on the planet. > > Given the constraints, and the fact that you're going to have to > account for this situation somehow in your code forever, I see very > little to no value in adding the close() thing. I think your assessment is unduly pessimistic. > So your cross-platform unified behavior goal is simply unobtainable. > So please deal with reality rather than wishful inpractical things. If you don't think this is worth discussing because you personally don't believe cross-platform compatibility is worth bothering with then fine, say so. But then you need to persuade everyone else with a stake in Linux that your viewpoint is the only valid one, and you will have to also persuade them that Linux should no longer concern itself with POSIX compliance. I know *I* couldn't do that for Solaris, and with all due respect, I very much doubt *you* can do so for Linux. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 14:39 ` Alan Burlison @ 2015-10-27 15:04 ` David Miller 2015-10-27 15:53 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: David Miller @ 2015-10-27 15:04 UTC (permalink / raw) To: Alan.Burlison Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech From: Alan Burlison <Alan.Burlison@oracle.com> Date: Tue, 27 Oct 2015 14:39:56 +0000 > On 27/10/2015 14:39, David Miller wrote: > >> Making this worse is that there isn't going to be a straightforward >> nor reliable way to test for the presence of this at run time. > > I attached a test case to the original bug that demonstrates the > platform-specific behaviours, it would be easy enough to modify that > to use as a configure-time feature probe. I said "run time". Like when your final code runs on a target system. Not at build tree configure time. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 15:04 ` David Miller @ 2015-10-27 15:53 ` Alan Burlison 0 siblings, 0 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-27 15:53 UTC (permalink / raw) To: David Miller Cc: Casper.Dik, viro, eric.dumazet, stephen, netdev, dholland-tech On 27/10/2015 15:04, David Miller wrote: > I said "run time". Like when your final code runs on a target system. > > Not at build tree configure time. Yes, you could do it then as well. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 10:52 ` Alan Burlison 2015-10-27 12:01 ` Eric Dumazet 2015-10-27 13:42 ` David Miller @ 2015-10-27 23:17 ` Al Viro 2015-10-28 0:13 ` Eric Dumazet 2015-10-28 16:04 ` Alan Burlison 2015-10-29 14:58 ` David Holland 3 siblings, 2 replies; 138+ messages in thread From: Al Viro @ 2015-10-27 23:17 UTC (permalink / raw) To: Alan Burlison Cc: Casper.Dik, David Miller, eric.dumazet, stephen, netdev, dholland-tech On Tue, Oct 27, 2015 at 10:52:46AM +0000, Alan Burlison wrote: > Unfortunately Hadoop isn't the only thing that pulls the shutdown() > trick, so I don't think there's a simple fix for this, as discussed > earlier in the thread. Having said that, if close() on Linux also > did an implicit shutdown() it would mean that well-written > applications that handled the scoping, sharing and reuse of FDs > properly could just call close() and have it work the same way > across *NIX platforms. ... except for all Linux, FreeBSD and OpenBSD versions out there, but hey, who's counting those, right? Not to mention the OSX behaviour - I really have no idea what it does; the FreeBSD ancestry in its kernel is distant enough for a lot of changes to have happened in that area. So... Which Unices other than Solaris and NetBSD actually behave that way? I.e. have close(fd) cancel accept(fd) another thread is sitting in. Note that NetBSD implementation has known races. Linux, FreeBSD and OpenBSD don't do that at all. Frankly, as far as I'm concerned, the bottom line is * there are two variants of semantics in that area and there's not much that could be done about that. * POSIX is vague enough for both variants to comply with it (it's also very badly written in the area in question). * I don't see any way to implement something similar to Solaris behaviour without a huge increase of memory footprint or massive cacheline pingpong. Solaris appears to go for memory footprint from hell - cacheline per descriptor (instead of a pointer per descriptor). * the benefits of Solaris-style behaviour are not obvious - all things equal it would be interesting, but the things are very much not equal. What's more, if your userland code is such that accept() argument could be closed by another thread, the caller *cannot* do anything with said argument after accept() returns, no matter which variant of semantics is used. * [Linux-specific aside] our __alloc_fd() can degrade quite badly with some use patterns. The cacheline pingpong in the bitmap is probably inevitable, unless we accept considerably heavier memory footprint, but we also have a case when alloc_fd() takes O(n) and it's _not_ hard to trigger - close(3);open(...); will have the next open() after that scanning the entire in-use bitmap. I think I see a way to improve it without slowing the normal case down, but I'll need to experiment a bit before I post patches. Anybody with examples of real-world loads that make our descriptor allocator to degrade is very welcome to post the reproducers... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 23:17 ` Al Viro @ 2015-10-28 0:13 ` Eric Dumazet 2015-10-28 12:35 ` Al Viro 2015-10-28 16:04 ` Alan Burlison 1 sibling, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-28 0:13 UTC (permalink / raw) To: Al Viro Cc: Alan Burlison, Casper.Dik, David Miller, stephen, netdev, dholland-tech On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote: > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > with some use patterns. The cacheline pingpong in the bitmap is probably > inevitable, unless we accept considerably heavier memory footprint, > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > to trigger - close(3);open(...); will have the next open() after that > scanning the entire in-use bitmap. I think I see a way to improve it > without slowing the normal case down, but I'll need to experiment a > bit before I post patches. Anybody with examples of real-world loads > that make our descriptor allocator to degrade is very welcome to post > the reproducers... Well, I do have real-world loads, but quite hard to setup in a lab :( Note that we also hit the 'struct cred'->usage refcount for every open()/close()/sock_alloc(), and simply moving uid/gid out of the first cache line really helps, as current_fsuid() and current_fsgid() no longer forces a pingpong. I moved seldom used fields on the first cache line, so that overall memory usage did not change (192 bytes on 64 bit arches) diff --git a/include/linux/cred.h b/include/linux/cred.h index 8d70e1361ecd..460efae83522 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -124,7 +124,17 @@ struct cred { #define CRED_MAGIC 0x43736564 #define CRED_MAGIC_DEAD 0x44656144 #endif - kuid_t uid; /* real UID of the task */ + struct rcu_head rcu; /* RCU deletion hook */ + + kernel_cap_t cap_inheritable; /* caps our children can inherit */ + kernel_cap_t cap_permitted; /* caps we're permitted */ + kernel_cap_t cap_effective; /* caps we can actually use */ + kernel_cap_t cap_bset; /* capability bounding set */ + kernel_cap_t cap_ambient; /* Ambient capability set */ + + kuid_t uid ____cacheline_aligned_in_smp; + /* real UID of the task */ + kgid_t gid; /* real GID of the task */ kuid_t suid; /* saved UID of the task */ kgid_t sgid; /* saved GID of the task */ @@ -133,11 +143,6 @@ struct cred { kuid_t fsuid; /* UID for VFS ops */ kgid_t fsgid; /* GID for VFS ops */ unsigned securebits; /* SUID-less security management */ - kernel_cap_t cap_inheritable; /* caps our children can inherit */ - kernel_cap_t cap_permitted; /* caps we're permitted */ - kernel_cap_t cap_effective; /* caps we can actually use */ - kernel_cap_t cap_bset; /* capability bounding set */ - kernel_cap_t cap_ambient; /* Ambient capability set */ #ifdef CONFIG_KEYS unsigned char jit_keyring; /* default keyring to attach requested * keys to */ @@ -152,7 +157,6 @@ struct cred { struct user_struct *user; /* real user ID subscription */ struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */ struct group_info *group_info; /* supplementary groups for euid/fsgid */ - struct rcu_head rcu; /* RCU deletion hook */ }; extern void __put_cred(struct cred *); ^ permalink raw reply related [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-28 0:13 ` Eric Dumazet @ 2015-10-28 12:35 ` Al Viro 2015-10-28 13:24 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-28 12:35 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel [Linus and Dave added, Solaris and NetBSD folks dropped from Cc] On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote: > On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote: > > > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > > with some use patterns. The cacheline pingpong in the bitmap is probably > > inevitable, unless we accept considerably heavier memory footprint, > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > > to trigger - close(3);open(...); will have the next open() after that > > scanning the entire in-use bitmap. I think I see a way to improve it > > without slowing the normal case down, but I'll need to experiment a > > bit before I post patches. Anybody with examples of real-world loads > > that make our descriptor allocator to degrade is very welcome to post > > the reproducers... > > Well, I do have real-world loads, but quite hard to setup in a lab :( > > Note that we also hit the 'struct cred'->usage refcount for every > open()/close()/sock_alloc(), and simply moving uid/gid out of the first > cache line really helps, as current_fsuid() and current_fsgid() no > longer forces a pingpong. > > I moved seldom used fields on the first cache line, so that overall > memory usage did not change (192 bytes on 64 bit arches) [snip] Makes sense, but there's a funny thing about that refcount - the part coming from ->f_cred is the most frequently changed *and* almost all places using ->f_cred are just looking at its fields and do not manipulate its refcount. The only exception (do_process_acct()) is easy to eliminate just by storing a separate reference to the current creds of acct(2) caller and using it instead of looking at ->f_cred. What's more, the place where we grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and* most of the time such a reference is there for dropping ->f_cred (in file_free()/file_free_rcu()). With that change in kernel/acct.c done, we could do the following: a) split the cred refcount into the normal and percpu parts and add a spinlock in there. b) have put_cred() do this: if (atomic_dec_and_test(&cred->usage)) { this_cpu_add(&cred->f_cred_usage, 1); call_rcu(&cred->rcu, put_f_cred_rcu); } c) have get_empty_filp() increment current_cred ->f_cred_usage with this_cpu_add() d) have file_free() do percpu_counter_dec(&nr_files); rcu_read_lock(); if (likely(atomic_read(&f->f_cred->usage))) { this_cpu_add(&f->f_cred->f_cred_usage, -1); rcu_read_unlock(); call_rcu(&f->f_u.fu_rcuhead, file_free_rcu_light); } else { rcu_read_unlock(); call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); } file_free_rcu() being static void file_free_rcu(struct rcu_head *head) { struct file *f = container_of(head, struct file, f_u.fu_rcuhead); put_f_cred(&f->f_cred->rcu); kmem_cache_free(filp_cachep, f); } and file_free_rcu_light() - the same sans put_f_cred(); with put_f_cred() doing spin_lock cred->lock this_cpu_add(&cred->f_cred_usage, -1); find the sum of cred->f_cred_usage spin_unlock cred->lock if the sum has not reached 0 return current put_cred_rcu(cred) IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and (most of) ->f_cred dropping. Note that there are two paths leading to put_f_cred() in the above - via call_rcu() on &cred->rcu and from file_free_rcu() called via call_rcu() on &f->f_u.fu_rcuhead. Both are RCU-delayed and they can happen in parallel - different rcu_head are used. atomic_read() check in file_free() might give false positives if it comes just before put_cred() on another CPU kills the last non-f_cred reference. It's not a problem, since put_f_cred() from that put_cred() won't be executed until we drop rcu_read_lock(), so we can safely decrement the cred->f_cred_usage without cred->lock here (and we are guaranteed that we won't be dropping the last of that - the same put_cred() would've incremented ->f_cred_usage). Does anybody see problems with that approach? I'm going to grab some sleep (only a couple of hours so far tonight ;-/), will cook an incremental to Eric's field-reordering patch when I get up... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-28 12:35 ` Al Viro @ 2015-10-28 13:24 ` Eric Dumazet 2015-10-28 14:47 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-28 13:24 UTC (permalink / raw) To: Al Viro Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel On Wed, 2015-10-28 at 12:35 +0000, Al Viro wrote: > [Linus and Dave added, Solaris and NetBSD folks dropped from Cc] > > On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote: > > On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote: > > > > > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > > > with some use patterns. The cacheline pingpong in the bitmap is probably > > > inevitable, unless we accept considerably heavier memory footprint, > > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > > > to trigger - close(3);open(...); will have the next open() after that > > > scanning the entire in-use bitmap. I think I see a way to improve it > > > without slowing the normal case down, but I'll need to experiment a > > > bit before I post patches. Anybody with examples of real-world loads > > > that make our descriptor allocator to degrade is very welcome to post > > > the reproducers... > > > > Well, I do have real-world loads, but quite hard to setup in a lab :( > > > > Note that we also hit the 'struct cred'->usage refcount for every > > open()/close()/sock_alloc(), and simply moving uid/gid out of the first > > cache line really helps, as current_fsuid() and current_fsgid() no > > longer forces a pingpong. > > > > I moved seldom used fields on the first cache line, so that overall > > memory usage did not change (192 bytes on 64 bit arches) > > [snip] > > Makes sense, but there's a funny thing about that refcount - the part > coming from ->f_cred is the most frequently changed *and* almost all > places using ->f_cred are just looking at its fields and do not manipulate > its refcount. The only exception (do_process_acct()) is easy to eliminate > just by storing a separate reference to the current creds of acct(2) caller > and using it instead of looking at ->f_cred. What's more, the place where we > grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and* > most of the time such a reference is there for dropping ->f_cred (in > file_free()/file_free_rcu()). > > With that change in kernel/acct.c done, we could do the following: > a) split the cred refcount into the normal and percpu parts and > add a spinlock in there. > b) have put_cred() do this: > if (atomic_dec_and_test(&cred->usage)) { > this_cpu_add(&cred->f_cred_usage, 1); > call_rcu(&cred->rcu, put_f_cred_rcu); > } > c) have get_empty_filp() increment current_cred ->f_cred_usage with > this_cpu_add() > d) have file_free() do > percpu_counter_dec(&nr_files); > rcu_read_lock(); > if (likely(atomic_read(&f->f_cred->usage))) { > this_cpu_add(&f->f_cred->f_cred_usage, -1); > rcu_read_unlock(); > call_rcu(&f->f_u.fu_rcuhead, file_free_rcu_light); > } else { > rcu_read_unlock(); > call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); > } > file_free_rcu() being > static void file_free_rcu(struct rcu_head *head) > { > struct file *f = container_of(head, struct file, f_u.fu_rcuhead); > put_f_cred(&f->f_cred->rcu); > kmem_cache_free(filp_cachep, f); > } > and file_free_rcu_light() - the same sans put_f_cred(); > > with put_f_cred() doing > spin_lock cred->lock > this_cpu_add(&cred->f_cred_usage, -1); > find the sum of cred->f_cred_usage > spin_unlock cred->lock > if the sum has not reached 0 > return > current put_cred_rcu(cred) > > IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and > (most of) ->f_cred dropping. > > Note that there are two paths leading to put_f_cred() in the above - via > call_rcu() on &cred->rcu and from file_free_rcu() called via call_rcu() on > &f->f_u.fu_rcuhead. Both are RCU-delayed and they can happen in parallel - > different rcu_head are used. > > atomic_read() check in file_free() might give false positives if it comes > just before put_cred() on another CPU kills the last non-f_cred reference. > It's not a problem, since put_f_cred() from that put_cred() won't be > executed until we drop rcu_read_lock(), so we can safely decrement the > cred->f_cred_usage without cred->lock here (and we are guaranteed that we won't > be dropping the last of that - the same put_cred() would've incremented > ->f_cred_usage). > > Does anybody see problems with that approach? I'm going to grab some sleep > (only a couple of hours so far tonight ;-/), will cook an incremental to Eric's > field-reordering patch when I get up... Before I take a deep look at your suggestion, are you sure plain use of include/linux/percpu-refcount.h infra is not possible for struct cred ? Thanks ! ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-28 13:24 ` Eric Dumazet @ 2015-10-28 14:47 ` Eric Dumazet 2015-10-28 21:13 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-28 14:47 UTC (permalink / raw) To: Al Viro Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote: > Before I take a deep look at your suggestion, are you sure plain use of > include/linux/percpu-refcount.h infra is not possible for struct cred ? BTW, I am not convinced we need to spend so much energy and per-cpu memory for struct cred refcount. The big problem is fd array spinlock of course and bitmap search for POSIX compliance. The cache line trashing in struct cred is a minor one ;) ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-28 14:47 ` Eric Dumazet @ 2015-10-28 21:13 ` Al Viro 2015-10-28 21:44 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-28 21:13 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote: > On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote: > > > Before I take a deep look at your suggestion, are you sure plain use of > > include/linux/percpu-refcount.h infra is not possible for struct cred ? > > BTW, I am not convinced we need to spend so much energy and per-cpu > memory for struct cred refcount. > > The big problem is fd array spinlock of course and bitmap search for > POSIX compliance. > > The cache line trashing in struct cred is a minor one ;) percpu-refcount isn't convenient - the only such candidate for ref_kill in there is "all other references are gone", and that can happen in interesting locking environments. I doubt that it would be a good fit, TBH... Cacheline pingpong on the descriptors bitmap is probably inevitable, but it's not the only problem in the existing implementation - close a small descriptor when you've got a lot of them and look for the second open after that. _That_ can lead to thousands of cachelines being read through, all under the table spinlock. It's literally orders of magnitude worse. And if the first open after that close happens to be for a short-living descriptor, you'll get the same situation back in your face as soon as you close it. I think we can seriously improve that without screwing the fast path by adding "summary" bitmaps once the primary grows past the cacheline worth of bits. With bits in the summary bitmap corresponding to cacheline-sized chunks of the primary, being set iff all bits in the corresponding chunk are set. If the summary map grows larger than one cacheline, add the second-order one (that happens at quarter million descriptors and serves until 128 million; adding the third-order map is probably worthless). I want to maintain the same kind of "everything below this is known to be in use" thing as we do now. Allocation would start with looking into the same place in primary bitmap where we'd looked now and similar search forward for zero bit. _However_, it would stop at cacheline boundary. If nothing had been found, we look in the corresponding place in the summary bitmap and search for zero bit there. Again, no more than up to the cacheline boundary. If something is found, we've got a chunk in the primary known to contain a zero bit; if not - go to the second-level and search there, etc. When a zero bit in the primary had been found, check if it's within the rlimit (passed to __alloc_fd() explicitly) and either bugger off or set that bit. If there are zero bits left in the same word - we are done, otherwise check the still unread words in the cacheline and see if all of them are ~0UL. If all of them are, set the bit in summary bitmap, etc. Normal case is exactly the same as now - one cacheline accessed and modified. We might end up touching more than that, but it's going to be rare and the cases when it happens are very likely to lead to much worse amount of memory traffic with the current code. Freeing is done by zeroing the bit in primary, checking for other zero bits nearby and buggering off if there are such. If the entire cacheline used to be all-bits-set, clear the bit in summary and, if there's a second-order summary, get the bit in there clear as well - it's probably not worth bothering with checking that all the cacheline in summary bitmap had been all-bits-set. Again, the normal case is the same as now. It'll need profiling and tuning, but AFAICS it's doable without making the things worse than they are now, and it should get rid of those O(N) fetches under spinlock cases. And yes, those are triggerable and visible in profiles. IMO it's worth trying to fix... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-28 21:13 ` Al Viro @ 2015-10-28 21:44 ` Eric Dumazet 2015-10-28 22:33 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-28 21:44 UTC (permalink / raw) To: Al Viro Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel On Wed, 2015-10-28 at 21:13 +0000, Al Viro wrote: > On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote: > > On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote: > > > > > Before I take a deep look at your suggestion, are you sure plain use of > > > include/linux/percpu-refcount.h infra is not possible for struct cred ? > > > > BTW, I am not convinced we need to spend so much energy and per-cpu > > memory for struct cred refcount. > > > > The big problem is fd array spinlock of course and bitmap search for > > POSIX compliance. > > > > The cache line trashing in struct cred is a minor one ;) > > percpu-refcount isn't convenient - the only such candidate for ref_kill in > there is "all other references are gone", and that can happen in > interesting locking environments. I doubt that it would be a good fit, TBH... OK then ... > Cacheline pingpong on the descriptors bitmap is probably inevitable, but > it's not the only problem in the existing implementation - close a small > descriptor when you've got a lot of them and look for the second open > after that. _That_ can lead to thousands of cachelines being read through, > all under the table spinlock. It's literally orders of magnitude worse. > And if the first open after that close happens to be for a short-living > descriptor, you'll get the same situation back in your face as soon as you > close it. > > I think we can seriously improve that without screwing the fast path by > adding "summary" bitmaps once the primary grows past the cacheline worth > of bits. With bits in the summary bitmap corresponding to cacheline-sized > chunks of the primary, being set iff all bits in the corresponding chunk > are set. If the summary map grows larger than one cacheline, add the > second-order one (that happens at quarter million descriptors and serves > until 128 million; adding the third-order map is probably worthless). > > I want to maintain the same kind of "everything below this is known to be > in use" thing as we do now. Allocation would start with looking into the > same place in primary bitmap where we'd looked now and similar search > forward for zero bit. _However_, it would stop at cacheline boundary. > If nothing had been found, we look in the corresponding place in the > summary bitmap and search for zero bit there. Again, no more than up > to the cacheline boundary. If something is found, we've got a chunk in > the primary known to contain a zero bit; if not - go to the second-level > and search there, etc. > > When a zero bit in the primary had been found, check if it's within the > rlimit (passed to __alloc_fd() explicitly) and either bugger off or set > that bit. If there are zero bits left in the same word - we are done, > otherwise check the still unread words in the cacheline and see if all > of them are ~0UL. If all of them are, set the bit in summary bitmap, etc. > > Normal case is exactly the same as now - one cacheline accessed and modified. > We might end up touching more than that, but it's going to be rare and > the cases when it happens are very likely to lead to much worse amount of > memory traffic with the current code. > > Freeing is done by zeroing the bit in primary, checking for other zero bits > nearby and buggering off if there are such. If the entire cacheline used > to be all-bits-set, clear the bit in summary and, if there's a second-order > summary, get the bit in there clear as well - it's probably not worth > bothering with checking that all the cacheline in summary bitmap had been > all-bits-set. Again, the normal case is the same as now. > > It'll need profiling and tuning, but AFAICS it's doable without making the > things worse than they are now, and it should get rid of those O(N) fetches > under spinlock cases. And yes, those are triggerable and visible in > profiles. IMO it's worth trying to fix... > Well, all this complexity goes away with a O_FD_FASTALLOC / SOCK_FD_FASTALLOC bit in various fd allocations, which specifically tells the kernel we do not care getting the lowest possible fd as POSIX mandates. With this bit set, the bitmap search can start at a random point, and we find a lot in O(1) : one cache line miss, if you have at least one free bit/slot per 512 bits (64 bytes cache line). #ifndef O_FD_FASTALLOC #define O_FD_FASTALLOC 0x40000000 #endif #ifndef SOCK_FD_FASTALLOC #define SOCK_FD_FASTALLOC O_FD_FASTALLOC #endif ... // active sockets socket(AF_INET, SOCK_STREAM | SOCK_FD_FASTALLOC, 0); ... // passive sockets accept4(sockfd, ..., SOCK_FD_FASTALLOC); ... Except for legacy stuff and stdin/stdout/stderr games, I really doubt lot of applications absolutely rely on the POSIX thing... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-28 21:44 ` Eric Dumazet @ 2015-10-28 22:33 ` Al Viro 2015-10-28 23:08 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-28 22:33 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote: > Well, all this complexity goes away with a O_FD_FASTALLOC / > SOCK_FD_FASTALLOC bit in various fd allocations, which specifically > tells the kernel we do not care getting the lowest possible fd as POSIX > mandates. ... which won't do a damn thing for existing userland. > Except for legacy stuff and stdin/stdout/stderr games, I really doubt > lot of applications absolutely rely on the POSIX thing... We obviously can't turn that into default behaviour, though. BTW, what distribution do you have in mind for those random descriptors? Uniform on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the memory footprint pretty soon... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-28 22:33 ` Al Viro @ 2015-10-28 23:08 ` Eric Dumazet 2015-10-29 0:15 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-28 23:08 UTC (permalink / raw) To: Al Viro Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel On Wed, 2015-10-28 at 22:33 +0000, Al Viro wrote: > On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote: > > > Well, all this complexity goes away with a O_FD_FASTALLOC / > > SOCK_FD_FASTALLOC bit in various fd allocations, which specifically > > tells the kernel we do not care getting the lowest possible fd as POSIX > > mandates. > > ... which won't do a damn thing for existing userland. For the userland that need +5,000,000 socket, I can tell you they are using this flag as soon they are aware it exists ;) > > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt > > lot of applications absolutely rely on the POSIX thing... > > We obviously can't turn that into default behaviour, though. BTW, what > distribution do you have in mind for those random descriptors? Uniform > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the > memory footprint pretty soon... Simply [0 , fdt->max_fds] is working well in most cases. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-28 23:08 ` Eric Dumazet @ 2015-10-29 0:15 ` Al Viro 2015-10-29 3:29 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-29 0:15 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote: > > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt > > > lot of applications absolutely rely on the POSIX thing... > > > > We obviously can't turn that into default behaviour, though. BTW, what > > distribution do you have in mind for those random descriptors? Uniform > > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the > > memory footprint pretty soon... > > Simply [0 , fdt->max_fds] is working well in most cases. Umm... So first you dup2() to establish the ->max_fds you want, then do such opens? What used/unused ratio do you expect to deal with? And what kind of locking are you going to use? Keep in mind that e.g. dup2() is dependent on the lack of allocations while it's working, so it's not as simple as "we don't need no stinkin' ->files_lock"... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 0:15 ` Al Viro @ 2015-10-29 3:29 ` Eric Dumazet 2015-10-29 4:16 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-29 3:29 UTC (permalink / raw) To: Al Viro Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel On Thu, 2015-10-29 at 00:15 +0000, Al Viro wrote: > On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote: > > > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt > > > > lot of applications absolutely rely on the POSIX thing... > > > > > > We obviously can't turn that into default behaviour, though. BTW, what > > > distribution do you have in mind for those random descriptors? Uniform > > > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the > > > memory footprint pretty soon... > > > > Simply [0 , fdt->max_fds] is working well in most cases. > > Umm... So first you dup2() to establish the ->max_fds you want, then > do such opens? Yes, dup2() is done at program startup, knowing the expected max load (in term of concurrent fd) + ~10 % (actual fd array size can be more than this because of power of two rounding in alloc_fdtable() ) But this is an optimization : If you do not use the initial dup2(), the fd array can be automatically expanded if needed (all slots are in use) > What used/unused ratio do you expect to deal with? > And what kind of locking are you going to use? Keep in mind that > e.g. dup2() is dependent on the lack of allocations while it's working, > so it's not as simple as "we don't need no stinkin' ->files_lock"... No locking change. files->file_lock is still taken. We only want to minimize time to find an empty slot. The trick is to not start bitmap search at files->next_fd, but a random point. This is a win if we assume there are enough holes. low = start; if (low < files->next_fd) low = files->next_fd; res = -1; if (flags & O_FD_FASTALLOC) { random_point = pick_random_between(low, fdt->max_fds); res = find_next_zero_bit(fdt->open_fds, fdt->max_fds, random_point); /* No empty slot found, try the other range */ if (res >= fdt->max_fds) { res = find_next_zero_bit(fdt->open_fds, low, random_point); if (res >= random_point) res = -1; } } ... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 3:29 ` Eric Dumazet @ 2015-10-29 4:16 ` Al Viro 2015-10-29 12:35 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-29 4:16 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel On Wed, Oct 28, 2015 at 08:29:41PM -0700, Eric Dumazet wrote: > But this is an optimization : If you do not use the initial dup2(), the > fd array can be automatically expanded if needed (all slots are in use) Whee... > No locking change. files->file_lock is still taken. > > We only want to minimize time to find an empty slot. Then I'd say that my variant is going to win. It *will* lead to cacheline pingpong in more cases than yours, but I'm quite sure that it will be a win as far as the total amount of cachelines accessed. > The trick is to not start bitmap search at files->next_fd, but a random > point. This is a win if we assume there are enough holes. > > low = start; > if (low < files->next_fd) > low = files->next_fd; > > res = -1; > if (flags & O_FD_FASTALLOC) { > random_point = pick_random_between(low, fdt->max_fds); > > res = find_next_zero_bit(fdt->open_fds, fdt->max_fds, > random_point); > /* No empty slot found, try the other range */ > if (res >= fdt->max_fds) { > res = find_next_zero_bit(fdt->open_fds, > low, random_point); > if (res >= random_point) > res = -1; > } > } Have you tried to experiment with that in userland? I mean, emulate that thing in normal userland code, count the cacheline accesses and drive it with the use patterns collected from actual applications. I can sit down and play with math expectations, but I suspect that it's easier to experiment. It's nothing but an intuition (I hadn't seriously done probability theory in quite a while, and my mathematical tastes run more to geometry and topology anyway), but... I would expect it to degrade badly when the bitmap is reasonably dense. Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's not cheap - it's done via #PF triggered in kernel mode, with handler noticing that the faulting address is in vmalloc range and doing the right thing. IOW, if your bitmap is very sparse, the price of page faults needs to be taken into account. AFAICS, the only benefit of that thing is keeping dirtied cachelines far from each other. Which might be a win overall, but I'm not convinced that the rest won't offset the effect of that... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 4:16 ` Al Viro @ 2015-10-29 12:35 ` Eric Dumazet 2015-10-29 13:48 ` Eric Dumazet 2015-10-30 17:18 ` Linus Torvalds 0 siblings, 2 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-29 12:35 UTC (permalink / raw) To: Al Viro Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel On Thu, 2015-10-29 at 04:16 +0000, Al Viro wrote: > Have you tried to experiment with that in userland? I mean, emulate that > thing in normal userland code, count the cacheline accesses and drive it > with the use patterns collected from actual applications. Sure. > > I can sit down and play with math expectations, but I suspect that it's > easier to experiment. It's nothing but an intuition (I hadn't seriously > done probability theory in quite a while, and my mathematical tastes run > more to geometry and topology anyway), but... I would expect it to degrade > badly when the bitmap is reasonably dense. > > Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's > not cheap - it's done via #PF triggered in kernel mode, with handler > noticing that the faulting address is in vmalloc range and doing the > right thing. IOW, if your bitmap is very sparse, the price of page faults > needs to be taken into account. This vmalloc PF is pure noise. This only matters for the very first allocations. We target programs opening zillions of fd in their lifetime ;) Not having to expand a 4,000,000 slots fd array while fully loaded also removes a latency spike that is very often not desirable. > > AFAICS, the only benefit of that thing is keeping dirtied cachelines far > from each other. Which might be a win overall, but I'm not convinced that > the rest won't offset the effect of that... Well, I already tested the O_FD_FASTALLOC thing, and I can tell you find_next_zero_bit() is nowhere to be found in kernel profiles anymore. It also lowers time we hold the fd array spinlock while doing fd alloc. User land test program I wrote few months back Current kernel : 64.98% [kernel] [k] queued_spin_lock_slowpath 14.88% opensock [.] memset // this part simulates user land actual work ;) 11.15% [kernel] [k] _find_next_bit.part.0 0.69% [kernel] [k] _raw_spin_lock 0.46% [kernel] [k] memset_erms 0.38% [kernel] [k] sk_alloc 0.37% [kernel] [k] kmem_cache_alloc 0.33% [kernel] [k] get_empty_filp 0.31% [kernel] [k] kmem_cache_free 0.26% [kernel] [k] __alloc_fd 0.26% opensock [.] child_function 0.18% [kernel] [k] inode_init_always 0.17% opensock [.] __random_r /* * test for b/9072743 : fd scaling on gigantic process (with ~ 10,000,000 TCP sockets) * - Size fd arrays in kernel to avoid resizings that kill latencies. * - Then launch xx threads doing * populate the fd array of the process, opening 'max' files. * * - Loop : close(randomfd()), socket(AF_INET, SOCK_STREAM, 0); * * Usage : opensock [ -n fds_count ] [ -t threads_count] [-f] */ #include <pthread.h> #include <stdio.h> #include <sys/types.h> #include <sys/socket.h> #include <stdlib.h> #include <errno.h> #include <fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> unsigned int count; int skflags; #define NBTHREADS_MAX 4096 pthread_t tid[NBTHREADS_MAX]; int nbthreads; int nbthreads_req = 24; int stop_all; #ifndef O_FD_FASTALLOC #define O_FD_FASTALLOC 0x40000000 #endif #ifndef SOCK_FD_FASTALLOC #define SOCK_FD_FASTALLOC O_FD_FASTALLOC #endif /* expand kernel fd array for optimal perf. * This could be done by doing a loop on dup(), * or can be done using dup2() */ int expand_fd_array(int max) { int target, res; int fd = socket(AF_INET, SOCK_STREAM, 0); if (fd == -1) { perror("socket()"); return -1; } for (;;) { count = max; target = count; if (skflags & SOCK_FD_FASTALLOC) target += count/10; res = dup2(fd, target); if (res != -1) { close(res); break; } max -= max/10; } printf("count=%u (check/increase ulimit -n)\n", count); return 0; } static char state[32] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }; /* each thread is using ~400 KB of data per unit of work */ #define WORKING_SET_SIZE 400000 static void *child_function(void *arg) { unsigned int max = count / nbthreads_req; struct random_data buf; unsigned int idx; int *tab; unsigned long iter = 0; unsigned long *work_set = malloc(WORKING_SET_SIZE); int i; if (!work_set) return NULL; tab = malloc(max * sizeof(int)); if (!tab) { free(work_set); return NULL; } memset(tab, 255, max * sizeof(int)); initstate_r(getpid(), state, sizeof(state), &buf); tab[0] = socket(AF_INET, SOCK_STREAM | skflags, 0); for (i = 1; i < max; i++) tab[i] = dup(tab[0]); while (!stop_all) { random_r(&buf, &idx); idx = idx % max; close(tab[idx]); /* user space needs typically to use a bit of the memory. */ memset(work_set, idx, WORKING_SET_SIZE); tab[idx] = socket(AF_INET, SOCK_STREAM | skflags, 0); if (tab[idx] == -1) { perror("socket"); break; } iter++; } for (i = 0; i < max; i++) close(tab[idx]); free(tab); free(work_set); printf("%lu\n", iter); return NULL; } static int launch_threads(void) { int i, err; for (i = 0; i < nbthreads_req; i++) { err = pthread_create(&tid[i], NULL, child_function, NULL); if (err) return err; nbthreads++; } return 0; } static void wait_end(void) { int i; for (i = 0; i < nbthreads; i++) pthread_join(tid[i], NULL); } static void usage(int code) { fprintf(stderr, "Usage : opensock [ -n fds_count ] [ -t threads_count] [-f]\n"); exit(code); } int main(int argc, char *argv[]) { int c; int max = 1000000; int duration = 10; while ((c = getopt(argc, argv, "fn:t:l:")) != -1) { switch (c) { case 'f': skflags = SOCK_FD_FASTALLOC; break; case 'n': max = atoi(optarg); break; case 't': nbthreads_req = atoi(optarg); if (nbthreads_req > NBTHREADS_MAX) usage(1); break; case 'l': duration = atoi(optarg); break; default: usage(1); } } system("sysctl -w fs.file-max=8000000"); expand_fd_array(max); launch_threads(); sleep(duration); stop_all = 1; wait_end(); } ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 12:35 ` Eric Dumazet @ 2015-10-29 13:48 ` Eric Dumazet 2015-10-30 17:18 ` Linus Torvalds 1 sibling, 0 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-29 13:48 UTC (permalink / raw) To: Al Viro Cc: David Miller, stephen, netdev, Linus Torvalds, dhowells, linux-fsdevel On Thu, 2015-10-29 at 05:35 -0700, Eric Dumazet wrote: > Current kernel : > > 64.98% [kernel] [k] queued_spin_lock_slowpath > 14.88% opensock [.] memset // this part simulates user land actual work ;) > 11.15% [kernel] [k] _find_next_bit.part.0 > 0.69% [kernel] [k] _raw_spin_lock > 0.46% [kernel] [k] memset_erms > 0.38% [kernel] [k] sk_alloc > 0.37% [kernel] [k] kmem_cache_alloc > 0.33% [kernel] [k] get_empty_filp > 0.31% [kernel] [k] kmem_cache_free > 0.26% [kernel] [k] __alloc_fd > 0.26% opensock [.] child_function > 0.18% [kernel] [k] inode_init_always > 0.17% opensock [.] __random_r With attached prototype patch we get this profile instead : You can see we no longer hit the spinlock issue and cache waste in find_next_bit. Userland can really progress _much_ faster. 76.86% opensock [.] memset 1.31% [kernel] [k] _raw_spin_lock 1.15% assd [.] 0x000000000056f32c 1.08% [kernel] [k] kmem_cache_free 0.97% [kernel] [k] kmem_cache_alloc 0.83% [kernel] [k] sk_alloc 0.72% [kernel] [k] memset_erms 0.70% opensock [.] child_function 0.67% [kernel] [k] get_empty_filp 0.65% [kernel] [k] __alloc_fd 0.58% [kernel] [k] __close_fd 0.49% [kernel] [k] queued_spin_lock_slowpath diff --git a/fs/file.c b/fs/file.c index 6c672ad329e9..eabb9a626259 100644 --- a/fs/file.c +++ b/fs/file.c @@ -22,6 +22,7 @@ #include <linux/spinlock.h> #include <linux/rcupdate.h> #include <linux/workqueue.h> +#include <linux/random.h> int sysctl_nr_open __read_mostly = 1024*1024; int sysctl_nr_open_min = BITS_PER_LONG; @@ -471,6 +472,19 @@ int __alloc_fd(struct files_struct *files, spin_lock(&files->file_lock); repeat: fdt = files_fdtable(files); + + if (unlikely(flags & O_FD_FASTALLOC)) { + u32 rnd, limit = min(end, fdt->max_fds); + + /* + * Note: do not bother with files->next_fd, + * this is for POSIX lovers... + */ + rnd = ((u64)prandom_u32() * limit) >> 32; + fd = find_next_zero_bit(fdt->open_fds, limit, rnd); + if (fd < limit) + goto ok; + } fd = start; if (fd < files->next_fd) fd = files->next_fd; @@ -499,7 +513,7 @@ repeat: if (start <= files->next_fd) files->next_fd = fd + 1; - +ok: __set_open_fd(fd, fdt); if (flags & O_CLOEXEC) __set_close_on_exec(fd, fdt); diff --git a/include/linux/net.h b/include/linux/net.h index 70ac5e28e6b7..3823d082af4c 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -76,6 +76,7 @@ enum sock_type { #ifndef SOCK_NONBLOCK #define SOCK_NONBLOCK O_NONBLOCK #endif +#define SOCK_FD_FASTALLOC O_FD_FASTALLOC #endif /* ARCH_HAS_SOCKET_TYPES */ diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index e063effe0cc1..badd421dd9f4 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -88,6 +88,10 @@ #define __O_TMPFILE 020000000 #endif +#ifndef O_FD_FASTALLOC +#define O_FD_FASTALLOC 0x40000000 +#endif + /* a horrid kludge trying to make sure that this will fail on old kernels */ #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT) diff --git a/net/socket.c b/net/socket.c index 9963a0b53a64..6dde02b2eaf9 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1227,9 +1227,10 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK); BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK); BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK); + BUILD_BUG_ON(SOCK_FD_FASTALLOC & SOCK_TYPE_MASK); flags = type & ~SOCK_TYPE_MASK; - if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK)) + if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_FD_FASTALLOC)) return -EINVAL; type &= SOCK_TYPE_MASK; @@ -1240,7 +1241,7 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) if (retval < 0) goto out; - retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK)); + retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK | O_FD_FASTALLOC)); if (retval < 0) goto out_release; @@ -1266,7 +1267,7 @@ SYSCALL_DEFINE4(socketpair, int, family, int, type, int, protocol, int flags; flags = type & ~SOCK_TYPE_MASK; - if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK)) + if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_FD_FASTALLOC)) return -EINVAL; type &= SOCK_TYPE_MASK; @@ -1436,7 +1437,7 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr, int err, len, newfd, fput_needed; struct sockaddr_storage address; - if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK)) + if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_FD_FASTALLOC)) return -EINVAL; if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK)) ^ permalink raw reply related [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 12:35 ` Eric Dumazet 2015-10-29 13:48 ` Eric Dumazet @ 2015-10-30 17:18 ` Linus Torvalds 2015-10-30 21:02 ` Al Viro 1 sibling, 1 reply; 138+ messages in thread From: Linus Torvalds @ 2015-10-30 17:18 UTC (permalink / raw) To: Eric Dumazet Cc: Al Viro, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Thu, Oct 29, 2015 at 5:35 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Well, I already tested the O_FD_FASTALLOC thing, and I can tell you > find_next_zero_bit() is nowhere to be found in kernel profiles anymore. > It also lowers time we hold the fd array spinlock while doing fd alloc. I do wonder if we couldn't just speed up the bitmap allocator by an order of magnitude. It would be nicer to be able to make existing loads faster without any new "don't bother with POSIX semantics" flag. We could, for example, extend the "open_fds" bitmap with a "second-level" bitmap that has information on when the first level is full. We traverse the open_fd's bitmap one word at a time anyway, we could have a second-level bitmap that has one bit per word to say whether that word is already full. So you'd basically have: - open_fds: array with the real bitmap (exactly like we do now) - full_fds_bits: array of bits that shows which of the "unsigned longs" in 'open_fs' are already full. and then you can basically walk 4096 fd's at a time by just checking one single word in the full_fds_bits[] array. So in __alloc_fd(), instead of using find_next_zero_bit(), you'd use "find_next_fd()", which woulf do something like #define NR_BITS_PER_BIT (64*sizeof(long)*sizeof(long)) unsigned long find_next_fd(struct fdtable *fdt, unsigned long start) { while (start < fdt->max_fds) { unsigned long startbit = start / BITS_PER_LONG; unsigned long bitbit = startbit / BITS_PER_LONG; unsigned long bitmax = (bitbit+1) * BITS_PER_LONG * BITS_PER_LONG; if (bitmax > max_fds) bitmax = fdt->max_fds; // Are all the bits in all the bitbit arrays already set? if (!~fds->full_fds_bits[bitbit]) { start = bitmax; continue; } fd = find_next_zero_bit(fdt->open_fds, bitmax, start); if (fd < bitmax) return fd; } return fdt->max_fds; } which really should cut down on the bit traversal by a factor of 64. Of course, then you do have to maintain that bitmap-of-bits in __set_open_fd() and __clear_open_fd(), but that should be pretty trivial too: - __clear_open_fd() would become just __clear_bit(fd, fdt->open_fds); __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits); - and __set_open_fd() would become __set_bit(fd, fdt->open_fds); fd /= BITS_PER_LONG; if (!~fdt->open_fds[fd]) __set_bit(fd, fdt->full_fds_bits); or something. NOTE NOTE NOTE! The above is all very much written without any testing in the email client, so while I tried to make it look like "real code", consider the above just pseudo-code. The advantage of the above is that it should just work for existing binaries. It may not be quite as optimal as just introducing a new "don't care about POSIX" feature, but quite frankly, if it cuts down the bad case of "find_next_zero_bit()" by a factror of 64 (and then adds a *small* expense factor on top of that), I suspect it should be "good enough" even for your nasty case. What do you think? Willing to try the above approach (with any inevitable bug-fixes) and see how it compares? Obviously in addition to any fixes to my pseudo-code above you'd need to add the allocations for the new "full_fds_bits" etc, but I think it should be easy to make the full_fds_bit allocation be *part* of the "open_fds" allocation, so you wouldn't need a new allocation in alloc_fdtable(). We already do that whole "use a single allocation" to combine open_fds with close_on_exec into one single allocation. Linus ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-30 17:18 ` Linus Torvalds @ 2015-10-30 21:02 ` Al Viro 2015-10-30 21:23 ` Linus Torvalds 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-30 21:02 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Fri, Oct 30, 2015 at 10:18:12AM -0700, Linus Torvalds wrote: > I do wonder if we couldn't just speed up the bitmap allocator by an > order of magnitude. It would be nicer to be able to make existing > loads faster without any new "don't bother with POSIX semantics" flag. > > We could, for example, extend the "open_fds" bitmap with a > "second-level" bitmap that has information on when the first level is > full. We traverse the open_fd's bitmap one word at a time anyway, we > could have a second-level bitmap that has one bit per word to say > whether that word is already full. Your variant has 1:64 ratio; obviously better than now, but we can actually do 1:bits-per-cacheline quite easily. I've been playing with a variant that has more than two bitmaps, and AFAICS it a) does not increase the amount of cacheline pulled and b) keeps it well-bound even in the absolutely worst case (128M-odd descriptors filled, followed by close(0);dup2(1,0); - in that case it ends up accessing the 7 cachelines worth of bitmaps; your variant will read through 4000-odd cachelines of the summary bitmap alone; the mainline is even worse). > The advantage of the above is that it should just work for existing > binaries. It may not be quite as optimal as just introducing a new > "don't care about POSIX" feature, but quite frankly, if it cuts down > the bad case of "find_next_zero_bit()" by a factror of 64 (and then > adds a *small* expense factor on top of that), I suspect it should be > "good enough" even for your nasty case. > > What do you think? Willing to try the above approach (with any > inevitable bug-fixes) and see how it compares? > > Obviously in addition to any fixes to my pseudo-code above you'd need > to add the allocations for the new "full_fds_bits" etc, but I think it > should be easy to make the full_fds_bit allocation be *part* of the > "open_fds" allocation, so you wouldn't need a new allocation in > alloc_fdtable(). We already do that whole "use a single allocation" to > combine open_fds with close_on_exec into one single allocation. I'll finish testing what I've got and post it; it costs 3 extra pointers in the files_struct and a bit fatter bitmap allocation (less than 0.2% extra). All the arguments regarding the unmodified binaries apply, of course, and so far it looks fairly compact... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-30 21:02 ` Al Viro @ 2015-10-30 21:23 ` Linus Torvalds 2015-10-30 21:50 ` Linus Torvalds 0 siblings, 1 reply; 138+ messages in thread From: Linus Torvalds @ 2015-10-30 21:23 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Fri, Oct 30, 2015 at 2:02 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Your variant has 1:64 ratio; obviously better than now, but we can actually > do 1:bits-per-cacheline quite easily. Ok, but in that case you end up needing a counter for each cacheline too (to count how many bits, in order to know when to say "cacheline is entirely full"). Because otherwise you'll end up having to check the whole cacheline (rather than check just one word, or check the "bit counter") when you set a bit. So I suspect a "bitmap of full words" ends up being much simpler. But hey, if you can make your more complex thing work, go right ahead. Linus ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-30 21:23 ` Linus Torvalds @ 2015-10-30 21:50 ` Linus Torvalds 2015-10-30 22:33 ` Al Viro 2015-10-31 1:07 ` Eric Dumazet 0 siblings, 2 replies; 138+ messages in thread From: Linus Torvalds @ 2015-10-30 21:50 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 1508 bytes --] On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Oct 30, 2015 at 2:02 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> Your variant has 1:64 ratio; obviously better than now, but we can actually >> do 1:bits-per-cacheline quite easily. > > Ok, but in that case you end up needing a counter for each cacheline > too (to count how many bits, in order to know when to say "cacheline > is entirely full"). So here's a largely untested version of my "one bit per word" approach. It seems to work, but looking at it, I'm unhappy with a few things: - using kmalloc() for the .full_fds_bits[] array is simple, but disgusting, since 99% of all programs just have a single word. I know I talked about just adding the allocation to the same one that allocates the bitmaps themselves, but I got lazy and didn't do it. Especially since that code seems to try fairly hard to make the allocations nice powers of two, according to the comments. That may actually matter from an allocation standpoint. - Maybe we could just use that "full_fds_bits_init" field for when a single word is sufficient, and avoid the kmalloc that way? Anyway. This is a pretty simple patch, and I actually think that we could just get rid of the "next_fd" logic entirely with this. That would make this *patch* be more complicated, but it would make the resulting *code* be simpler. Hmm? Want to play with this? Eric, what does this do to your test-case? Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 4966 bytes --] fs/file.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- include/linux/fdtable.h | 2 ++ 2 files changed, 48 insertions(+), 3 deletions(-) diff --git a/fs/file.c b/fs/file.c index 6c672ad329e9..0b89a97cabb5 100644 --- a/fs/file.c +++ b/fs/file.c @@ -48,6 +48,7 @@ static void __free_fdtable(struct fdtable *fdt) { kvfree(fdt->fd); kvfree(fdt->open_fds); + kfree(fdt->full_fds_bits); kfree(fdt); } @@ -56,6 +57,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu) __free_fdtable(container_of(rcu, struct fdtable, rcu)); } +#define BITBIT_NR(nr) BITS_TO_LONGS(BITS_TO_LONGS(nr)) +#define BITBIT_SIZE(nr) (BITBIT_NR(nr) * sizeof(long)) + /* * Expand the fdset in the files_struct. Called with the files spinlock * held for write. @@ -77,6 +81,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memset((char *)(nfdt->open_fds) + cpy, 0, set); memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); memset((char *)(nfdt->close_on_exec) + cpy, 0, set); + + cpy = BITBIT_SIZE(ofdt->max_fds); + set = BITBIT_SIZE(nfdt->max_fds) - cpy; + memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); + memset(cpy+(char *)nfdt->full_fds_bits, 0, set); } static struct fdtable * alloc_fdtable(unsigned int nr) @@ -122,8 +131,21 @@ static struct fdtable * alloc_fdtable(unsigned int nr) data += nr / BITS_PER_BYTE; fdt->close_on_exec = data; + /* + * The "bitmap of bitmaps" has a bit set for each word in + * the open_fds array that is full. We initialize it to + * zero both at allocation and at copying, because it is + * important that it never have a bit set for a non-full + * word, but it doesn't have to be exact otherwise. + */ + fdt->full_fds_bits = kzalloc(BITBIT_SIZE(nr), GFP_KERNEL); + if (!fdt->full_fds_bits) + goto out_nofull; + return fdt; +out_nofull: + kvfree(fdt->open_fds); out_arr: kvfree(fdt->fd); out_fdt: @@ -229,14 +251,18 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) __clear_bit(fd, fdt->close_on_exec); } -static inline void __set_open_fd(int fd, struct fdtable *fdt) +static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt) { __set_bit(fd, fdt->open_fds); + fd /= BITS_PER_LONG; + if (!~fdt->open_fds[fd]) + __set_bit(fd, fdt->full_fds_bits); } -static inline void __clear_open_fd(int fd, struct fdtable *fdt) +static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt) { __clear_bit(fd, fdt->open_fds); + __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits); } static int count_open_files(struct fdtable *fdt) @@ -280,6 +306,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) new_fdt->max_fds = NR_OPEN_DEFAULT; new_fdt->close_on_exec = newf->close_on_exec_init; new_fdt->open_fds = newf->open_fds_init; + new_fdt->full_fds_bits = newf->full_fds_bits_init; new_fdt->fd = &newf->fd_array[0]; spin_lock(&oldf->file_lock); @@ -323,6 +350,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8); memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8); + memcpy(new_fdt->full_fds_bits, old_fdt->full_fds_bits, BITBIT_SIZE(open_files)); for (i = open_files; i != 0; i--) { struct file *f = *old_fds++; @@ -454,10 +482,25 @@ struct files_struct init_files = { .fd = &init_files.fd_array[0], .close_on_exec = init_files.close_on_exec_init, .open_fds = init_files.open_fds_init, + .full_fds_bits = init_files.full_fds_bits_init, }, .file_lock = __SPIN_LOCK_UNLOCKED(init_files.file_lock), }; +static unsigned long find_next_fd(struct fdtable *fdt, unsigned long start) +{ + unsigned long maxfd = fdt->max_fds; + unsigned long maxbit = maxfd / BITS_PER_LONG; + unsigned long bitbit = start / BITS_PER_LONG; + + bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; + if (bitbit > maxfd) + return maxfd; + if (bitbit > start) + start = bitbit; + return find_next_zero_bit(fdt->open_fds, maxfd, start); +} + /* * allocate a file descriptor, mark it busy. */ @@ -476,7 +519,7 @@ repeat: fd = files->next_fd; if (fd < fdt->max_fds) - fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd); + fd = find_next_fd(fdt, fd); /* * N.B. For clone tasks sharing a files structure, this test diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index 674e3e226465..5295535b60c6 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -26,6 +26,7 @@ struct fdtable { struct file __rcu **fd; /* current fd array */ unsigned long *close_on_exec; unsigned long *open_fds; + unsigned long *full_fds_bits; struct rcu_head rcu; }; @@ -59,6 +60,7 @@ struct files_struct { int next_fd; unsigned long close_on_exec_init[1]; unsigned long open_fds_init[1]; + unsigned long full_fds_bits_init[1]; struct file __rcu * fd_array[NR_OPEN_DEFAULT]; }; ^ permalink raw reply related [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-30 21:50 ` Linus Torvalds @ 2015-10-30 22:33 ` Al Viro 2015-10-30 23:52 ` Linus Torvalds 2015-10-31 1:07 ` Eric Dumazet 1 sibling, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-30 22:33 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Fri, Oct 30, 2015 at 02:50:46PM -0700, Linus Torvalds wrote: > Anyway. This is a pretty simple patch, and I actually think that we > could just get rid of the "next_fd" logic entirely with this. That > would make this *patch* be more complicated, but it would make the > resulting *code* be simpler. Dropping next_fd would screw you in case of strictly sequential allocations... Your point re power-of-two allocations is well-taken, but then I'm not sure that kzalloc() is good enough here. Look: you have a bit for every 64 descriptors, i.e. byte per 512. On 10M case Eric had been talking about that'll yield 32Kb worth of your secondary bitmap. It's right on the edge of the range where vmalloc() becomes attractive; for something bigger it gets even worse... Currently we go for vmalloc (on bitmaps) once we are past 128K descriptors (two bitmaps packed together => 256Kbit = 32Kb). kmalloc() is very sensitive to size being a power of two, but IIRC vmalloc() isn't... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-30 22:33 ` Al Viro @ 2015-10-30 23:52 ` Linus Torvalds 2015-10-31 0:09 ` Al Viro ` (2 more replies) 0 siblings, 3 replies; 138+ messages in thread From: Linus Torvalds @ 2015-10-30 23:52 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 2106 bytes --] On Fri, Oct 30, 2015 at 3:33 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Oct 30, 2015 at 02:50:46PM -0700, Linus Torvalds wrote: > >> Anyway. This is a pretty simple patch, and I actually think that we >> could just get rid of the "next_fd" logic entirely with this. That >> would make this *patch* be more complicated, but it would make the >> resulting *code* be simpler. > > Dropping next_fd would screw you in case of strictly sequential allocations... I don't think it would matter in real life, since I don't really think you have lots of fd's with strictly sequential behavior. That said, the trivial "open lots of fds" benchmark would show it, so I guess we can just keep it. The next_fd logic is not expensive or complex, after all. > Your point re power-of-two allocations is well-taken, but then I'm not > sure that kzalloc() is good enough here. Attached is an updated patch that just uses the regular bitmap allocator and extends it to also have the bitmap of bitmaps. It actually simplifies the patch, so I guess it's better this way. Anyway, I've tested it all a bit more, and for a trivial worst-case stress program that explicitly kills the next_fd logic by doing for (i = 0; i < 1000000; i++) { close(3); dup2(0,3); if (dup(0)) break; } it takes it down from roughly 10s to 0.2s. So the patch is quite noticeable on that kind of pattern. NOTE! You'll obviously need to increase your limits to actually be able to do the above with lots of file descriptors. I ran Eric's test-program too, and find_next_zero_bit() dropped to a fraction of a percent. It's not entirely gone, but it's down in the noise. I really suspect this patch is "good enough" in reality, and I would *much* rather do something like this than add a new non-POSIX flag that people have to update their binaries for. I agree with Eric that *some* people will do so, but it's still the wrong thing to do. Let's just make performance with the normal semantics be good enough that we don't need to play odd special games. Eric? Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 4590 bytes --] fs/file.c | 39 +++++++++++++++++++++++++++++++++++---- include/linux/fdtable.h | 2 ++ 2 files changed, 37 insertions(+), 4 deletions(-) diff --git a/fs/file.c b/fs/file.c index 6c672ad329e9..6f6eb2b03af5 100644 --- a/fs/file.c +++ b/fs/file.c @@ -56,6 +56,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu) __free_fdtable(container_of(rcu, struct fdtable, rcu)); } +#define BITBIT_NR(nr) BITS_TO_LONGS(BITS_TO_LONGS(nr)) +#define BITBIT_SIZE(nr) (BITBIT_NR(nr) * sizeof(long)) + /* * Expand the fdset in the files_struct. Called with the files spinlock * held for write. @@ -77,6 +80,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memset((char *)(nfdt->open_fds) + cpy, 0, set); memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); memset((char *)(nfdt->close_on_exec) + cpy, 0, set); + + cpy = BITBIT_SIZE(ofdt->max_fds); + set = BITBIT_SIZE(nfdt->max_fds) - cpy; + memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); + memset(cpy+(char *)nfdt->full_fds_bits, 0, set); } static struct fdtable * alloc_fdtable(unsigned int nr) @@ -115,12 +123,14 @@ static struct fdtable * alloc_fdtable(unsigned int nr) fdt->fd = data; data = alloc_fdmem(max_t(size_t, - 2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES)); + 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES)); if (!data) goto out_arr; fdt->open_fds = data; data += nr / BITS_PER_BYTE; fdt->close_on_exec = data; + data += nr / BITS_PER_BYTE; + fdt->full_fds_bits = data; return fdt; @@ -229,14 +239,18 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) __clear_bit(fd, fdt->close_on_exec); } -static inline void __set_open_fd(int fd, struct fdtable *fdt) +static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt) { __set_bit(fd, fdt->open_fds); + fd /= BITS_PER_LONG; + if (!~fdt->open_fds[fd]) + __set_bit(fd, fdt->full_fds_bits); } -static inline void __clear_open_fd(int fd, struct fdtable *fdt) +static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt) { __clear_bit(fd, fdt->open_fds); + __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits); } static int count_open_files(struct fdtable *fdt) @@ -280,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) new_fdt->max_fds = NR_OPEN_DEFAULT; new_fdt->close_on_exec = newf->close_on_exec_init; new_fdt->open_fds = newf->open_fds_init; + new_fdt->full_fds_bits = newf->full_fds_bits_init; new_fdt->fd = &newf->fd_array[0]; spin_lock(&oldf->file_lock); @@ -323,6 +338,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8); memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8); + memcpy(new_fdt->full_fds_bits, old_fdt->full_fds_bits, BITBIT_SIZE(open_files)); for (i = open_files; i != 0; i--) { struct file *f = *old_fds++; @@ -454,10 +470,25 @@ struct files_struct init_files = { .fd = &init_files.fd_array[0], .close_on_exec = init_files.close_on_exec_init, .open_fds = init_files.open_fds_init, + .full_fds_bits = init_files.full_fds_bits_init, }, .file_lock = __SPIN_LOCK_UNLOCKED(init_files.file_lock), }; +static unsigned long find_next_fd(struct fdtable *fdt, unsigned long start) +{ + unsigned long maxfd = fdt->max_fds; + unsigned long maxbit = maxfd / BITS_PER_LONG; + unsigned long bitbit = start / BITS_PER_LONG; + + bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; + if (bitbit > maxfd) + return maxfd; + if (bitbit > start) + start = bitbit; + return find_next_zero_bit(fdt->open_fds, maxfd, start); +} + /* * allocate a file descriptor, mark it busy. */ @@ -476,7 +507,7 @@ repeat: fd = files->next_fd; if (fd < fdt->max_fds) - fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd); + fd = find_next_fd(fdt, fd); /* * N.B. For clone tasks sharing a files structure, this test diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index 674e3e226465..5295535b60c6 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -26,6 +26,7 @@ struct fdtable { struct file __rcu **fd; /* current fd array */ unsigned long *close_on_exec; unsigned long *open_fds; + unsigned long *full_fds_bits; struct rcu_head rcu; }; @@ -59,6 +60,7 @@ struct files_struct { int next_fd; unsigned long close_on_exec_init[1]; unsigned long open_fds_init[1]; + unsigned long full_fds_bits_init[1]; struct file __rcu * fd_array[NR_OPEN_DEFAULT]; }; ^ permalink raw reply related [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-30 23:52 ` Linus Torvalds @ 2015-10-31 0:09 ` Al Viro 2015-10-31 15:59 ` Eric Dumazet 2015-10-31 19:34 ` Al Viro 2 siblings, 0 replies; 138+ messages in thread From: Al Viro @ 2015-10-31 0:09 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Fri, Oct 30, 2015 at 04:52:41PM -0700, Linus Torvalds wrote: > I really suspect this patch is "good enough" in reality, and I would > *much* rather do something like this than add a new non-POSIX flag > that people have to update their binaries for. I agree with Eric that > *some* people will do so, but it's still the wrong thing to do. Let's > just make performance with the normal semantics be good enough that we > don't need to play odd special games. > > Eric? IIRC, at least a part of what Eric used to complain about was that on seriously multithreaded processes doing a lot of e.g. socket(2) we end up a lot of bouncing the cacheline containing the first free bits in the bitmap. But looking at the whole thing, I really wonder if the tons of threads asking for random bytes won't get at least as bad cacheline bouncing while getting said bytes, so I'm not sure if that rationale has survived. PS: this problem obviously exists in Linus' variant as well as in mine; the question is whether Eric's approach manages to avoid it in the first place. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-30 23:52 ` Linus Torvalds 2015-10-31 0:09 ` Al Viro @ 2015-10-31 15:59 ` Eric Dumazet 2015-10-31 19:34 ` Al Viro 2 siblings, 0 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-31 15:59 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Fri, 2015-10-30 at 16:52 -0700, Linus Torvalds wrote: sequential allocations... > > I don't think it would matter in real life, since I don't really think > you have lots of fd's with strictly sequential behavior. > > That said, the trivial "open lots of fds" benchmark would show it, so > I guess we can just keep it. The next_fd logic is not expensive or > complex, after all. +1 > Attached is an updated patch that just uses the regular bitmap > allocator and extends it to also have the bitmap of bitmaps. It > actually simplifies the patch, so I guess it's better this way. > > Anyway, I've tested it all a bit more, and for a trivial worst-case > stress program that explicitly kills the next_fd logic by doing > > for (i = 0; i < 1000000; i++) { > close(3); > dup2(0,3); > if (dup(0)) > break; > } > > it takes it down from roughly 10s to 0.2s. So the patch is quite > noticeable on that kind of pattern. > > NOTE! You'll obviously need to increase your limits to actually be > able to do the above with lots of file descriptors. > > I ran Eric's test-program too, and find_next_zero_bit() dropped to a > fraction of a percent. It's not entirely gone, but it's down in the > noise. > > I really suspect this patch is "good enough" in reality, and I would > *much* rather do something like this than add a new non-POSIX flag > that people have to update their binaries for. I agree with Eric that > *some* people will do so, but it's still the wrong thing to do. Let's > just make performance with the normal semantics be good enough that we > don't need to play odd special games. > > Eric? I absolutely agree a generic solution is far better, especially when its performance is in par. Tested-by: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Note that a non-POSIX flag (or a thread personality hints) would still allow the kernel to do proper NUMA affinity placement : Say the fd_array and bitmaps are split on the 2 nodes (or more, but most servers nowadays have 2 sockets really). Then at fd allocation time, we can prefer to pick an fd for which memory holding various bits and the file pointer are in the local node This speeds up subsequent fd system call on programs that constantly blow away cpu caches, saving QPI transactions. Thanks a lot Linus. lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 count=10000000 (check/increase ulimit -n) total = 3992764 lpaa24:~# ./opensock -t 48 -n 10000000 -l 10 count=10000000 (check/increase ulimit -n) total = 3545249 Profile with 16 threads : 69.55% opensock [.] memset 11.83% [kernel] [k] queued_spin_lock_slowpath 1.91% [kernel] [k] _find_next_bit.part.0 1.68% [kernel] [k] _raw_spin_lock 0.99% [kernel] [k] kmem_cache_alloc 0.99% [kernel] [k] memset_erms 0.95% [kernel] [k] get_empty_filp 0.82% [kernel] [k] __close_fd 0.73% [kernel] [k] __alloc_fd 0.65% [kernel] [k] sk_alloc 0.63% opensock [.] child_function 0.56% [kernel] [k] fput 0.35% [kernel] [k] sock_alloc 0.31% [kernel] [k] kmem_cache_free 0.31% [kernel] [k] inode_init_always 0.28% [kernel] [k] d_set_d_op 0.27% [kernel] [k] entry_SYSCALL_64_after_swapgs Profile with 48 threads : 57.92% [kernel] [k] queued_spin_lock_slowpath 32.14% opensock [.] memset 0.81% [kernel] [k] _find_next_bit.part.0 0.51% [kernel] [k] _raw_spin_lock 0.45% [kernel] [k] kmem_cache_alloc 0.38% [kernel] [k] kmem_cache_free 0.34% [kernel] [k] __close_fd 0.32% [kernel] [k] memset_erms 0.25% [kernel] [k] __alloc_fd 0.24% [kernel] [k] get_empty_filp 0.23% opensock [.] child_function 0.18% [kernel] [k] __d_alloc 0.17% [kernel] [k] inode_init_always 0.16% [kernel] [k] sock_alloc 0.16% [kernel] [k] del_timer 0.15% [kernel] [k] entry_SYSCALL_64_after_swapgs 0.15% perf [.] 0x000000000004d924 0.15% [kernel] [k] tcp_close ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-30 23:52 ` Linus Torvalds 2015-10-31 0:09 ` Al Viro 2015-10-31 15:59 ` Eric Dumazet @ 2015-10-31 19:34 ` Al Viro 2015-10-31 19:54 ` Linus Torvalds 2 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-31 19:34 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Fri, Oct 30, 2015 at 04:52:41PM -0700, Linus Torvalds wrote: > I really suspect this patch is "good enough" in reality, and I would > *much* rather do something like this than add a new non-POSIX flag > that people have to update their binaries for. I agree with Eric that > *some* people will do so, but it's still the wrong thing to do. Let's > just make performance with the normal semantics be good enough that we > don't need to play odd special games. > > Eric? ... and here's the current variant of mine. FWIW, it seems to survive LTP and xfstests + obvious "let's torture the allocator". On the "million dups" test it seems to be about 25% faster than the one Linus had posted, at ten millions - about 80%. On opensock results seem to be about 20% better than with the variant Linus has posted, but I'm not sure if the testbox is anywhere near the expected, so I'd appreciate if you'd given it a spin on your setups. It obviously needs saner comments, tuning, etc. BTW, another obvious low-hanging fruit with this data structure is count_open_files() (and that goes for 1:64 bitmap Linus uses) - dup2(0, 10000000); close(10000000); fork(); and count_open_files() is chewing through the damn bitmap from about 16M down to low tens. While holding ->files_lock, at that... I'm not saying it's critical, and it's definitely a followup patch fodder in either approach, but it's easy enough to do. diff --git a/fs/file.c b/fs/file.c index 6c672ad..fa43cbe 100644 --- a/fs/file.c +++ b/fs/file.c @@ -30,6 +30,8 @@ int sysctl_nr_open_min = BITS_PER_LONG; int sysctl_nr_open_max = __const_max(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; +#define BITS_PER_CHUNK 512 + static void *alloc_fdmem(size_t size) { /* @@ -46,8 +48,10 @@ static void *alloc_fdmem(size_t size) static void __free_fdtable(struct fdtable *fdt) { + int i; kvfree(fdt->fd); - kvfree(fdt->open_fds); + for (i = 0; i <= 3; i++) + kvfree(fdt->bitmaps[i]); kfree(fdt); } @@ -62,7 +66,7 @@ static void free_fdtable_rcu(struct rcu_head *rcu) */ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) { - unsigned int cpy, set; + unsigned int cpy, set, to, from, level, n; BUG_ON(nfdt->max_fds < ofdt->max_fds); @@ -71,18 +75,53 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memcpy(nfdt->fd, ofdt->fd, cpy); memset((char *)(nfdt->fd) + cpy, 0, set); - cpy = ofdt->max_fds / BITS_PER_BYTE; - set = (nfdt->max_fds - ofdt->max_fds) / BITS_PER_BYTE; - memcpy(nfdt->open_fds, ofdt->open_fds, cpy); - memset((char *)(nfdt->open_fds) + cpy, 0, set); + cpy = ofdt->max_fds / 8; + set = (nfdt->max_fds - ofdt->max_fds) / 8; memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); memset((char *)(nfdt->close_on_exec) + cpy, 0, set); + if (likely(!nfdt->bitmaps[1])) { + // flat to flat + memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], cpy); + memset((char *)(nfdt->bitmaps[0]) + cpy, 0, set); + return; + } + to = round_up(nfdt->max_fds, BITS_PER_CHUNK); + set = (to - ofdt->max_fds) / 8; + // copy and pad the primary + memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], ofdt->max_fds / 8); + memset((char *)nfdt->bitmaps[0] + ofdt->max_fds / 8, 0, set); + // copy and pad the old secondaries + from = round_up(ofdt->max_fds, BITS_PER_CHUNK); + for (level = 1; level <= 3; level++) { + if (!ofdt->bitmaps[level]) + break; + to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK); + from = round_up(from / BITS_PER_CHUNK, BITS_PER_CHUNK); + memcpy(nfdt->bitmaps[level], ofdt->bitmaps[level], from / 8); + memset((char *)nfdt->bitmaps[level] + from / 8, 0, (to - from) / 8); + } + // zero the new ones (if any) + for (n = level; n <= 3; n++) { + if (!nfdt->bitmaps[n]) + break; + to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK); + memset(nfdt->bitmaps[n], 0, to / 8); + } + // and maybe adjust bit 0 in the first new one. + if (unlikely(n != level)) { + unsigned long *p = nfdt->bitmaps[level - 1]; + for (n = 0; n < BITS_PER_CHUNK / BITS_PER_LONG; n++) + if (~p[n]) + return; + __set_bit(0, nfdt->bitmaps[level]); + } } static struct fdtable * alloc_fdtable(unsigned int nr) { struct fdtable *fdt; void *data; + int level = 0; /* * Figure out how many fds we actually want to support in this fdtable. @@ -114,16 +153,28 @@ static struct fdtable * alloc_fdtable(unsigned int nr) goto out_fdt; fdt->fd = data; + if (nr > BITS_PER_CHUNK) + nr = round_up(nr, BITS_PER_CHUNK); data = alloc_fdmem(max_t(size_t, 2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES)); if (!data) goto out_arr; - fdt->open_fds = data; + fdt->bitmaps[0] = data; data += nr / BITS_PER_BYTE; fdt->close_on_exec = data; - + fdt->bitmaps[1] = fdt->bitmaps[2] = fdt->bitmaps[3] = NULL; + while (unlikely(nr > BITS_PER_CHUNK)) { + nr = round_up(nr / BITS_PER_CHUNK, BITS_PER_CHUNK); + data = alloc_fdmem(nr); + if (!data) + goto out_bitmaps; + fdt->bitmaps[++level] = data; + } return fdt; +out_bitmaps: + while (level >= 0) + kvfree(fdt->bitmaps[level--]); out_arr: kvfree(fdt->fd); out_fdt: @@ -229,14 +280,47 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) __clear_bit(fd, fdt->close_on_exec); } +static bool set_and_check(unsigned long *start, unsigned n) +{ + int i; + start += (n & -BITS_PER_CHUNK) / BITS_PER_LONG; + n %= BITS_PER_CHUNK; + __set_bit(n, start); + for (i = 0; i < BITS_PER_CHUNK / BITS_PER_LONG; i++) + if (~*start++) + return true; + return false; +} + static inline void __set_open_fd(int fd, struct fdtable *fdt) { - __set_bit(fd, fdt->open_fds); + int level; + for (level = 0; ; level++, fd /= BITS_PER_CHUNK) { + if (level == 3 || !fdt->bitmaps[level + 1]) { + __set_bit(fd, fdt->bitmaps[level]); + break; + } + if (likely(set_and_check(fdt->bitmaps[level], fd))) + break; + } } static inline void __clear_open_fd(int fd, struct fdtable *fdt) { - __clear_bit(fd, fdt->open_fds); + int level; + unsigned long *p = fdt->bitmaps[0] + fd / BITS_PER_LONG, v; + v = *p; + __clear_bit(fd % BITS_PER_LONG, p); + if (~v) // quick test to avoid looking at other cachelines + return; + for (level = 1; level <= 3; level++) { + if (!fdt->bitmaps[level]) + break; + fd /= BITS_PER_CHUNK; + if (!test_bit(fd, fdt->bitmaps[level])) + break; + __clear_bit(fd, fdt->bitmaps[level]); + } } static int count_open_files(struct fdtable *fdt) @@ -246,7 +330,7 @@ static int count_open_files(struct fdtable *fdt) /* Find the last open fd */ for (i = size / BITS_PER_LONG; i > 0; ) { - if (fdt->open_fds[--i]) + if (fdt->bitmaps[0][--i]) break; } i = (i + 1) * BITS_PER_LONG; @@ -262,7 +346,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) { struct files_struct *newf; struct file **old_fds, **new_fds; - int open_files, size, i; + int open_files, size, i, n; struct fdtable *old_fdt, *new_fdt; *errorp = -ENOMEM; @@ -279,7 +363,8 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) new_fdt = &newf->fdtab; new_fdt->max_fds = NR_OPEN_DEFAULT; new_fdt->close_on_exec = newf->close_on_exec_init; - new_fdt->open_fds = newf->open_fds_init; + new_fdt->bitmaps[0] = newf->open_fds_init; + new_fdt->bitmaps[1] = new_fdt->bitmaps[2] = new_fdt->bitmaps[3] = NULL; new_fdt->fd = &newf->fd_array[0]; spin_lock(&oldf->file_lock); @@ -321,8 +406,17 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) old_fds = old_fdt->fd; new_fds = new_fdt->fd; - memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8); memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8); + memcpy(new_fdt->bitmaps[0], old_fdt->bitmaps[0], open_files / 8); + + n = round_up(open_files, BITS_PER_CHUNK); + for (i = 1; i <= 3; i++) { + if (!new_fdt->bitmaps[i]) + break; + n /= BITS_PER_CHUNK; + n = round_up(n, BITS_PER_CHUNK); + memcpy(new_fdt->bitmaps[i], old_fdt->bitmaps[i], n / 8); + } for (i = open_files; i != 0; i--) { struct file *f = *old_fds++; @@ -351,10 +445,24 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) int left = (new_fdt->max_fds - open_files) / 8; int start = open_files / BITS_PER_LONG; - memset(&new_fdt->open_fds[start], 0, left); - memset(&new_fdt->close_on_exec[start], 0, left); + memset(new_fdt->close_on_exec + start, 0, left); + if (likely(!new_fdt->bitmaps[1])) { + memset(new_fdt->bitmaps[0] + start, 0, left); + goto done; + } + start = round_up(open_files, BITS_PER_CHUNK); + n = round_up(new_fdt->max_fds, BITS_PER_CHUNK); + for (i = 0 ; i <= 3; i++) { + char *p = (void *)new_fdt->bitmaps[i]; + if (!p) + break; + n = round_up(n / BITS_PER_CHUNK, BITS_PER_CHUNK); + start = round_up(start / BITS_PER_CHUNK, BITS_PER_CHUNK); + memset(p + start / 8, 0, (n - start) / 8); + } } +done: rcu_assign_pointer(newf->fdt, new_fdt); return newf; @@ -380,7 +488,7 @@ static struct fdtable *close_files(struct files_struct * files) i = j * BITS_PER_LONG; if (i >= fdt->max_fds) break; - set = fdt->open_fds[j++]; + set = fdt->bitmaps[0][j++]; while (set) { if (set & 1) { struct file * file = xchg(&fdt->fd[i], NULL); @@ -453,70 +561,146 @@ struct files_struct init_files = { .max_fds = NR_OPEN_DEFAULT, .fd = &init_files.fd_array[0], .close_on_exec = init_files.close_on_exec_init, - .open_fds = init_files.open_fds_init, + .bitmaps[0] = init_files.open_fds_init, }, .file_lock = __SPIN_LOCK_UNLOCKED(init_files.file_lock), }; -/* - * allocate a file descriptor, mark it busy. - */ +/* search for the next zero bit in cacheline */ +static unsigned scan(unsigned long *start, unsigned size, unsigned from, + int check_zeroes) +{ + unsigned long *p = start + from / BITS_PER_LONG, *q = p, *end; + unsigned bit = from % BITS_PER_LONG, res; + unsigned long v = *p, w = v + (1UL<<bit); + + start += (from & -BITS_PER_CHUNK) / BITS_PER_LONG; + end = start + size / BITS_PER_LONG; + + if (unlikely(!(w & ~v))) { + while (likely(++q < end)) { + v = *q; + w = v + 1; + if (likely(w)) + goto got_it; + } + return size; // not in this chunk + } +got_it: + res = __ffs(w & ~v); // log2, really - it's a power of 2 + res += (q - start) * BITS_PER_LONG; + if (!check_zeroes) + return res; + if (likely(~(v | w))) // would zeroes remain in *q? + return res; + if (p == q || !bit) // was *p fully checked? + p--; + while (++q < end) // any zeros in the tail? + if (likely(~*q)) + return res; + if (unlikely(check_zeroes > 1)) + for (q = start; q <= p; q++) + if (~*q) + return res; + return res | (1U<<31); +} + int __alloc_fd(struct files_struct *files, unsigned start, unsigned end, unsigned flags) { - unsigned int fd; + unsigned int fd, base; int error; struct fdtable *fdt; - + unsigned count, level, n; + int summary; spin_lock(&files->file_lock); repeat: fdt = files_fdtable(files); + count = fdt->max_fds; + summary = 2; fd = start; - if (fd < files->next_fd) + if (likely(fd <= files->next_fd)) { fd = files->next_fd; - - if (fd < fdt->max_fds) - fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd); - - /* - * N.B. For clone tasks sharing a files structure, this test - * will limit the total number of files that can be opened. - */ - error = -EMFILE; - if (fd >= end) - goto out; - - error = expand_files(files, fd); - if (error < 0) + summary = 1; + } + base = fd; + if (unlikely(base >= count)) + goto expand2; + if (likely(!fdt->bitmaps[1])) { + base = scan(fdt->bitmaps[0], count, base, 0); + if (unlikely(base == count)) + goto expand; + if (unlikely(base >= end)) { + error = -EMFILE; + goto out; + } + fd = base; + __set_bit(fd, fdt->bitmaps[0]); + goto found; + } + n = scan(fdt->bitmaps[0], BITS_PER_CHUNK, base, summary); + base &= -BITS_PER_CHUNK; + base += n & ~(1U<<31); + if (unlikely(n == BITS_PER_CHUNK)) { + int bits[3]; + level = 0; + do { + bits[level] = count; + count = DIV_ROUND_UP(count, BITS_PER_CHUNK); + base /= BITS_PER_CHUNK; + n = scan(fdt->bitmaps[++level], BITS_PER_CHUNK, base, 0); + base &= -BITS_PER_CHUNK; + base += n; + if (unlikely(base >= count)) + goto expand; + } while (unlikely(n == BITS_PER_CHUNK)); + while (level--) { + base *= BITS_PER_CHUNK; + n = scan(fdt->bitmaps[level], BITS_PER_CHUNK, base, !level); + if (WARN_ON(n == BITS_PER_CHUNK)) { + error = -EINVAL; + goto out; + } + base += n & ~(1U<<31); + if (unlikely(base >= bits[level])) + goto expand; + } + } + if (unlikely(base >= end)) { + error = -EMFILE; goto out; - - /* - * If we needed to expand the fs array we - * might have blocked - try again. - */ - if (error) - goto repeat; - - if (start <= files->next_fd) + } + fd = base; + __set_bit(fd, fdt->bitmaps[0]); + if (unlikely(n & (1U << 31))) { + for (level = 1; ; level++) { + base /= BITS_PER_CHUNK; + if (level == 3 || !fdt->bitmaps[level + 1]) { + __set_bit(base, fdt->bitmaps[level]); + break; + } + if (likely(set_and_check(fdt->bitmaps[level], base))) + break; + } + } +found: + if (summary == 1) files->next_fd = fd + 1; - - __set_open_fd(fd, fdt); if (flags & O_CLOEXEC) __set_close_on_exec(fd, fdt); else __clear_close_on_exec(fd, fdt); error = fd; -#if 1 - /* Sanity check */ - if (rcu_access_pointer(fdt->fd[fd]) != NULL) { - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd); - rcu_assign_pointer(fdt->fd[fd], NULL); - } -#endif - out: spin_unlock(&files->file_lock); return error; +expand: + base = fdt->max_fds; +expand2: + error = expand_files(files, base); + if (error < 0) + goto out; + goto repeat; } static int alloc_fd(unsigned start, unsigned flags) @@ -809,7 +993,8 @@ __releases(&files->file_lock) goto Ebusy; get_file(file); rcu_assign_pointer(fdt->fd[fd], file); - __set_open_fd(fd, fdt); + if (!tofree) + __set_open_fd(fd, fdt); if (flags & O_CLOEXEC) __set_close_on_exec(fd, fdt); else diff --git a/fs/select.c b/fs/select.c index 0155473..670f542 100644 --- a/fs/select.c +++ b/fs/select.c @@ -350,7 +350,7 @@ static int max_select_fd(unsigned long n, fd_set_bits *fds) set = ~(~0UL << (n & (BITS_PER_LONG-1))); n /= BITS_PER_LONG; fdt = files_fdtable(current->files); - open_fds = fdt->open_fds + n; + open_fds = fdt->bitmaps[0] + n; max = 0; if (set) { set &= BITS(fds, n); diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index 674e3e2..6ef5274 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -25,7 +25,7 @@ struct fdtable { unsigned int max_fds; struct file __rcu **fd; /* current fd array */ unsigned long *close_on_exec; - unsigned long *open_fds; + unsigned long *bitmaps[4]; struct rcu_head rcu; }; @@ -36,7 +36,7 @@ static inline bool close_on_exec(int fd, const struct fdtable *fdt) static inline bool fd_is_open(int fd, const struct fdtable *fdt) { - return test_bit(fd, fdt->open_fds); + return test_bit(fd, fdt->bitmaps[0]); } /* ^ permalink raw reply related [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-31 19:34 ` Al Viro @ 2015-10-31 19:54 ` Linus Torvalds 2015-10-31 20:29 ` Al Viro 2015-10-31 20:45 ` Eric Dumazet 0 siblings, 2 replies; 138+ messages in thread From: Linus Torvalds @ 2015-10-31 19:54 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > ... and here's the current variant of mine. Ugh. I really liked how simple mine ended up being. Yours is definitely not. And based on the profiles from Eric, finding the fd is no longer the problem even with my simpler patch. The problem ends up being the contention on the file_lock spinlock. Eric, I assume that's not "expand_fdtable", since your test-program seems to expand the fd array at the beginning. So it's presumably all from the __alloc_fd() use, but we should double-check.. Eric, can you do a callgraph profile and see which caller is the hottest? Linus ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-31 19:54 ` Linus Torvalds @ 2015-10-31 20:29 ` Al Viro 2015-11-02 0:24 ` Al Viro 2015-10-31 20:45 ` Eric Dumazet 1 sibling, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-31 20:29 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Sat, Oct 31, 2015 at 12:54:50PM -0700, Linus Torvalds wrote: > On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > ... and here's the current variant of mine. > > Ugh. I really liked how simple mine ended up being. Yours is definitely not. Note that it's not the final variant - just something that should be testable. There are all kinds of things that still need cleaning/simplifying in there - e.g. scan() is definitely more complex than needed (if nothing else, the "small bitmap" case is simply find_next_zero_bit(), and the rest all have size equal to full cacheline; moreover, I'd overdone the "... and check if there are other zero bits left" thing - its callers used to use that a lot, and with the execption of two of them it was absolutely worthless. So it ended up more generic than necessary and I'm going to trim that crap down. It's still going to end up more complex than yours, obviously, but not as badly as it is now. I'm not sure if the speedup will be worth the extra complexity, thus asking for testing... On _some_ loads it is considerably faster (at least by factor of 5), but whether those loads resemble anything that occurs on real systems... BTW, considerable amount of unpleasantness is due to ragged-right-end kind of problems - take /proc/sys/fs/nr-open to something other than a power of two and a whole lot of fun issues start happening. I went for "if there are secondary bitmaps at all, pad all bitmaps to a multiple of cacheline", which at least somewhat mitigates that mess; hell knows, there might be a clever way to sidestep it entirely... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-31 20:29 ` Al Viro @ 2015-11-02 0:24 ` Al Viro 2015-11-02 0:59 ` Linus Torvalds 2015-11-02 2:14 ` Eric Dumazet 0 siblings, 2 replies; 138+ messages in thread From: Al Viro @ 2015-11-02 0:24 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Sat, Oct 31, 2015 at 08:29:29PM +0000, Al Viro wrote: > On Sat, Oct 31, 2015 at 12:54:50PM -0700, Linus Torvalds wrote: > > On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > ... and here's the current variant of mine. > > > > Ugh. I really liked how simple mine ended up being. Yours is definitely not. > > Note that it's not the final variant - just something that should be > testable. There are all kinds of things that still need cleaning/simplifying > in there - e.g. scan() is definitely more complex than needed (if nothing > else, the "small bitmap" case is simply find_next_zero_bit(), and the > rest all have size equal to full cacheline; moreover, I'd overdone the > "... and check if there are other zero bits left" thing - its callers > used to use that a lot, and with the execption of two of them it was > absolutely worthless. So it ended up more generic than necessary and > I'm going to trim that crap down. > > It's still going to end up more complex than yours, obviously, but not as > badly as it is now. I'm not sure if the speedup will be worth the > extra complexity, thus asking for testing... On _some_ loads it is > considerably faster (at least by factor of 5), but whether those loads > resemble anything that occurs on real systems... This ought to be a bit cleaner. Eric, could you test the variant below on your setup? diff --git a/fs/file.c b/fs/file.c index 6c672ad..0144920 100644 --- a/fs/file.c +++ b/fs/file.c @@ -30,6 +30,9 @@ int sysctl_nr_open_min = BITS_PER_LONG; int sysctl_nr_open_max = __const_max(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; +#define BITS_PER_CHUNK 512 +#define BYTES_PER_CHUNK (BITS_PER_CHUNK / 8) + static void *alloc_fdmem(size_t size) { /* @@ -46,8 +49,10 @@ static void *alloc_fdmem(size_t size) static void __free_fdtable(struct fdtable *fdt) { + int i; kvfree(fdt->fd); - kvfree(fdt->open_fds); + for (i = 0; i <= 3; i++) + kvfree(fdt->bitmaps[i]); kfree(fdt); } @@ -56,13 +61,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu) __free_fdtable(container_of(rcu, struct fdtable, rcu)); } +static inline bool cacheline_full(unsigned long *p) +{ + int n; + + for (n = 0; n < BITS_PER_CHUNK / BITS_PER_LONG; n++) + if (likely(~p[n])) + return false; + return true; +} + /* * Expand the fdset in the files_struct. Called with the files spinlock * held for write. */ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) { - unsigned int cpy, set; + unsigned int cpy, set, to, from, level, n; BUG_ON(nfdt->max_fds < ofdt->max_fds); @@ -71,18 +86,48 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memcpy(nfdt->fd, ofdt->fd, cpy); memset((char *)(nfdt->fd) + cpy, 0, set); - cpy = ofdt->max_fds / BITS_PER_BYTE; - set = (nfdt->max_fds - ofdt->max_fds) / BITS_PER_BYTE; - memcpy(nfdt->open_fds, ofdt->open_fds, cpy); - memset((char *)(nfdt->open_fds) + cpy, 0, set); + cpy = ofdt->max_fds / 8; + set = (nfdt->max_fds - ofdt->max_fds) / 8; memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); memset((char *)(nfdt->close_on_exec) + cpy, 0, set); + if (likely(!nfdt->bitmaps[1])) { + // flat to flat + memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], cpy); + memset((char *)(nfdt->bitmaps[0]) + cpy, 0, set); + return; + } + to = round_up(nfdt->max_fds, BITS_PER_CHUNK); + set = (to - ofdt->max_fds) / 8; + // copy and pad the primary + memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], ofdt->max_fds / 8); + memset((char *)nfdt->bitmaps[0] + ofdt->max_fds / 8, 0, set); + // copy and pad the old secondaries + from = round_up(ofdt->max_fds, BITS_PER_CHUNK); + for (level = 1; level <= 3; level++) { + if (!ofdt->bitmaps[level]) + break; + to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK); + from = round_up(from / BITS_PER_CHUNK, BITS_PER_CHUNK); + memcpy(nfdt->bitmaps[level], ofdt->bitmaps[level], from / 8); + memset((char *)nfdt->bitmaps[level] + from / 8, 0, (to - from) / 8); + } + // zero the new ones (if any) + for (n = level; n <= 3; n++) { + if (!nfdt->bitmaps[n]) + break; + to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK); + memset(nfdt->bitmaps[n], 0, to / 8); + } + // and maybe adjust bit 0 in the first new one. + if (unlikely(n != level && cacheline_full(nfdt->bitmaps[level - 1]))) + __set_bit(0, nfdt->bitmaps[level]); } static struct fdtable * alloc_fdtable(unsigned int nr) { struct fdtable *fdt; void *data; + int level = 0; /* * Figure out how many fds we actually want to support in this fdtable. @@ -114,16 +159,28 @@ static struct fdtable * alloc_fdtable(unsigned int nr) goto out_fdt; fdt->fd = data; + if (nr > BITS_PER_CHUNK) + nr = round_up(nr, BITS_PER_CHUNK); data = alloc_fdmem(max_t(size_t, 2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES)); if (!data) goto out_arr; - fdt->open_fds = data; + fdt->bitmaps[0] = data; data += nr / BITS_PER_BYTE; fdt->close_on_exec = data; - + fdt->bitmaps[1] = fdt->bitmaps[2] = fdt->bitmaps[3] = NULL; + while (unlikely(nr > BITS_PER_CHUNK)) { + nr = round_up(nr / BITS_PER_CHUNK, BITS_PER_CHUNK); + data = alloc_fdmem(nr); + if (!data) + goto out_bitmaps; + fdt->bitmaps[++level] = data; + } return fdt; +out_bitmaps: + while (level >= 0) + kvfree(fdt->bitmaps[level--]); out_arr: kvfree(fdt->fd); out_fdt: @@ -229,14 +286,41 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) __clear_bit(fd, fdt->close_on_exec); } -static inline void __set_open_fd(int fd, struct fdtable *fdt) +static inline void __set_open_fd(unsigned fd, struct fdtable *fdt) { - __set_bit(fd, fdt->open_fds); + int level; + for (level = 0; ; level++) { + unsigned long *p; + + __set_bit(fd, fdt->bitmaps[level]); + + if (level == 3 || !fdt->bitmaps[level + 1]) + break; + + fd /= BITS_PER_CHUNK; + + p = fdt->bitmaps[level] + BIT_WORD(fd * BITS_PER_CHUNK); + if (likely(!cacheline_full(p))) + break; + } } -static inline void __clear_open_fd(int fd, struct fdtable *fdt) +static inline void __clear_open_fd(unsigned fd, struct fdtable *fdt) { - __clear_bit(fd, fdt->open_fds); + int level; + unsigned long *p = fdt->bitmaps[0] + BIT_WORD(fd); + unsigned long v = *p; + __clear_bit(fd % BITS_PER_LONG, p); + if (~v) // quick test to avoid looking at other cachelines + return; + for (level = 1; level <= 3; level++) { + if (!fdt->bitmaps[level]) + break; + fd /= BITS_PER_CHUNK; + if (!test_bit(fd, fdt->bitmaps[level])) + break; + __clear_bit(fd, fdt->bitmaps[level]); + } } static int count_open_files(struct fdtable *fdt) @@ -246,7 +330,7 @@ static int count_open_files(struct fdtable *fdt) /* Find the last open fd */ for (i = size / BITS_PER_LONG; i > 0; ) { - if (fdt->open_fds[--i]) + if (fdt->bitmaps[0][--i]) break; } i = (i + 1) * BITS_PER_LONG; @@ -262,7 +346,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) { struct files_struct *newf; struct file **old_fds, **new_fds; - int open_files, size, i; + int open_files, size, i, n; struct fdtable *old_fdt, *new_fdt; *errorp = -ENOMEM; @@ -279,7 +363,8 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) new_fdt = &newf->fdtab; new_fdt->max_fds = NR_OPEN_DEFAULT; new_fdt->close_on_exec = newf->close_on_exec_init; - new_fdt->open_fds = newf->open_fds_init; + new_fdt->bitmaps[0] = newf->open_fds_init; + new_fdt->bitmaps[1] = new_fdt->bitmaps[2] = new_fdt->bitmaps[3] = NULL; new_fdt->fd = &newf->fd_array[0]; spin_lock(&oldf->file_lock); @@ -321,8 +406,17 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) old_fds = old_fdt->fd; new_fds = new_fdt->fd; - memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8); memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8); + memcpy(new_fdt->bitmaps[0], old_fdt->bitmaps[0], open_files / 8); + + n = round_up(open_files, BITS_PER_CHUNK); + for (i = 1; i <= 3; i++) { + if (!new_fdt->bitmaps[i]) + break; + n /= BITS_PER_CHUNK; + n = round_up(n, BITS_PER_CHUNK); + memcpy(new_fdt->bitmaps[i], old_fdt->bitmaps[i], n / 8); + } for (i = open_files; i != 0; i--) { struct file *f = *old_fds++; @@ -351,10 +445,24 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) int left = (new_fdt->max_fds - open_files) / 8; int start = open_files / BITS_PER_LONG; - memset(&new_fdt->open_fds[start], 0, left); - memset(&new_fdt->close_on_exec[start], 0, left); + memset(new_fdt->close_on_exec + start, 0, left); + if (likely(!new_fdt->bitmaps[1])) { + memset(new_fdt->bitmaps[0] + start, 0, left); + goto done; + } + start = round_up(open_files, BITS_PER_CHUNK); + n = round_up(new_fdt->max_fds, BITS_PER_CHUNK); + for (i = 0 ; i <= 3; i++) { + char *p = (void *)new_fdt->bitmaps[i]; + if (!p) + break; + n = round_up(n / BITS_PER_CHUNK, BITS_PER_CHUNK); + start = round_up(start / BITS_PER_CHUNK, BITS_PER_CHUNK); + memset(p + start / 8, 0, (n - start) / 8); + } } +done: rcu_assign_pointer(newf->fdt, new_fdt); return newf; @@ -380,7 +488,7 @@ static struct fdtable *close_files(struct files_struct * files) i = j * BITS_PER_LONG; if (i >= fdt->max_fds) break; - set = fdt->open_fds[j++]; + set = fdt->bitmaps[0][j++]; while (set) { if (set & 1) { struct file * file = xchg(&fdt->fd[i], NULL); @@ -453,70 +561,128 @@ struct files_struct init_files = { .max_fds = NR_OPEN_DEFAULT, .fd = &init_files.fd_array[0], .close_on_exec = init_files.close_on_exec_init, - .open_fds = init_files.open_fds_init, + .bitmaps[0] = init_files.open_fds_init, }, .file_lock = __SPIN_LOCK_UNLOCKED(init_files.file_lock), }; -/* - * allocate a file descriptor, mark it busy. - */ +/* search for the next zero bit in cacheline */ +#define NO_FREE (1ULL<<32) +#define LAST_FREE (2ULL<<32) +#define CHUNK_ALIGNED(p) IS_ALIGNED((uintptr_t)p, BYTES_PER_CHUNK) + +static __u64 scan(struct fdtable *fdt, int level, unsigned from) +{ + unsigned long *p = fdt->bitmaps[level] + BIT_WORD(from); + unsigned long v = *p, w = v + BIT_MASK(from); + + from = round_down(from, BITS_PER_CHUNK); + + if (unlikely(!(w & ~v))) { + while (!CHUNK_ALIGNED(++p)) { + v = *p; + w = v + 1; + if (likely(w)) + goto got_it; + } + return NO_FREE | (from + BITS_PER_CHUNK); + } +got_it: + from += __ffs(w & ~v); // log2, really - it's a power of 2 + from += 8 * ((uintptr_t)p % BYTES_PER_CHUNK); + if (level) // don't bother with looking for more + return from; + if (likely(~(v | w))) // would zeroes remain in *p? + return from; + while (!CHUNK_ALIGNED(++p)) // any zeros in the tail? + if (likely(~*p)) + return from; + return LAST_FREE | from; +} + int __alloc_fd(struct files_struct *files, unsigned start, unsigned end, unsigned flags) { unsigned int fd; + __u64 base; int error; struct fdtable *fdt; + unsigned count; spin_lock(&files->file_lock); repeat: fdt = files_fdtable(files); + count = fdt->max_fds; fd = start; if (fd < files->next_fd) fd = files->next_fd; - - if (fd < fdt->max_fds) - fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd); - - /* - * N.B. For clone tasks sharing a files structure, this test - * will limit the total number of files that can be opened. - */ - error = -EMFILE; - if (fd >= end) - goto out; - - error = expand_files(files, fd); - if (error < 0) - goto out; - - /* - * If we needed to expand the fs array we - * might have blocked - try again. - */ - if (error) + if (unlikely(fd >= count)) { + error = expand_files(files, fd); + if (error < 0) + goto out; goto repeat; - + } + if (likely(!fdt->bitmaps[1])) { + base = find_next_zero_bit(fdt->bitmaps[0], count, fd); + if (unlikely(base == count)) + goto expand; + if (unlikely(base >= end)) { + error = -EMFILE; + goto out; + } + fd = base; + __set_bit(fd, fdt->bitmaps[0]); + goto found; + } + base = scan(fdt, 0, fd); + if (unlikely(base & NO_FREE)) { + int bits[3]; + int level = 0; + do { + if (unlikely((u32)base >= count)) + goto expand; + bits[level] = count; + count = DIV_ROUND_UP(count, BITS_PER_CHUNK); + base = scan(fdt, ++level, (u32)base / BITS_PER_CHUNK); + } while (unlikely(base & NO_FREE)); + while (level) { + if (unlikely((u32)base >= count)) + goto expand; + base = scan(fdt, --level, (u32)base * BITS_PER_CHUNK); + if (WARN_ON(base & NO_FREE)) { + error = -EINVAL; + goto out; + } + count = bits[level]; + } + if (unlikely((u32)base >= count)) + goto expand; + } + fd = base; + if (unlikely(fd >= end)) { + error = -EMFILE; + goto out; + } + if (likely(!(base & LAST_FREE))) + __set_bit(fd, fdt->bitmaps[0]); + else + __set_open_fd(fd, fdt); +found: if (start <= files->next_fd) files->next_fd = fd + 1; - - __set_open_fd(fd, fdt); if (flags & O_CLOEXEC) __set_close_on_exec(fd, fdt); else __clear_close_on_exec(fd, fdt); error = fd; -#if 1 - /* Sanity check */ - if (rcu_access_pointer(fdt->fd[fd]) != NULL) { - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd); - rcu_assign_pointer(fdt->fd[fd], NULL); - } -#endif - out: spin_unlock(&files->file_lock); return error; +expand: + error = expand_files(files, fdt->max_fds); + if (error < 0) + goto out; + goto repeat; } static int alloc_fd(unsigned start, unsigned flags) @@ -809,7 +975,8 @@ __releases(&files->file_lock) goto Ebusy; get_file(file); rcu_assign_pointer(fdt->fd[fd], file); - __set_open_fd(fd, fdt); + if (!tofree) + __set_open_fd(fd, fdt); if (flags & O_CLOEXEC) __set_close_on_exec(fd, fdt); else diff --git a/fs/select.c b/fs/select.c index 0155473..670f542 100644 --- a/fs/select.c +++ b/fs/select.c @@ -350,7 +350,7 @@ static int max_select_fd(unsigned long n, fd_set_bits *fds) set = ~(~0UL << (n & (BITS_PER_LONG-1))); n /= BITS_PER_LONG; fdt = files_fdtable(current->files); - open_fds = fdt->open_fds + n; + open_fds = fdt->bitmaps[0] + n; max = 0; if (set) { set &= BITS(fds, n); diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index 674e3e2..6ef5274 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -25,7 +25,7 @@ struct fdtable { unsigned int max_fds; struct file __rcu **fd; /* current fd array */ unsigned long *close_on_exec; - unsigned long *open_fds; + unsigned long *bitmaps[4]; struct rcu_head rcu; }; @@ -36,7 +36,7 @@ static inline bool close_on_exec(int fd, const struct fdtable *fdt) static inline bool fd_is_open(int fd, const struct fdtable *fdt) { - return test_bit(fd, fdt->open_fds); + return test_bit(fd, fdt->bitmaps[0]); } /* ^ permalink raw reply related [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-11-02 0:24 ` Al Viro @ 2015-11-02 0:59 ` Linus Torvalds 2015-11-02 2:14 ` Eric Dumazet 1 sibling, 0 replies; 138+ messages in thread From: Linus Torvalds @ 2015-11-02 0:59 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Sun, Nov 1, 2015 at 4:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > This ought to be a bit cleaner. Eric, could you test the variant below on your > setup? I'd love to see the numbers, but at the same time I really can't say I love your patch. I've merged my own two patches for now (not actually pushed out - I don't want to distract people from just testing 4.3 for a while), because I felt that those had a unreasonably high bang-for-the-buck (ie big speedups for something that still keeps the code very simple). I'm definitely open to improving this further, even go as far as your patch, but just looking at your version of __alloc_fd(), I just don't get the warm and fuzzies. Linus ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-11-02 0:24 ` Al Viro 2015-11-02 0:59 ` Linus Torvalds @ 2015-11-02 2:14 ` Eric Dumazet 2015-11-02 6:22 ` Al Viro 1 sibling, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-11-02 2:14 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Mon, 2015-11-02 at 00:24 +0000, Al Viro wrote: > This ought to be a bit cleaner. Eric, could you test the variant below on your > setup? Sure ! 5 runs of : lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 total = 4386311 total = 4560402 total = 4437309 total = 4516227 total = 4478778 With 48 threads : ./opensock -t 48 -n 10000000 -l 10 total = 4940245 total = 4848513 total = 4813153 total = 4813946 total = 5127804 Perf output taken on the 16 threads run : 74.71% opensock opensock [.] memset | --- memset 5.64% opensock [kernel.kallsyms] [k] queued_spin_lock_slowpath | --- queued_spin_lock_slowpath | |--99.89%-- _raw_spin_lock | | | |--52.74%-- __close_fd | | sys_close | | entry_SYSCALL_64_fastpath | | __libc_close | | | | | --100.00%-- 0x0 | | | |--46.97%-- __alloc_fd | | get_unused_fd_flags | | sock_map_fd | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | | | --100.00%-- 0x0 | --0.30%-- [...] --0.11%-- [...] 1.69% opensock [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--27.37%-- get_unused_fd_flags | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | |--22.40%-- sys_close | entry_SYSCALL_64_fastpath | __libc_close | |--15.79%-- cache_alloc_refill | | | |--99.27%-- kmem_cache_alloc | | | | | |--81.25%-- sk_prot_alloc | | | sk_alloc | | | inet_create | | | __sock_create | | | sock_create | | | sys_socket | | | entry_SYSCALL_64_fastpath | | | __socket | | | | | |--9.08%-- sock_alloc_inode | | | alloc_inode | | | new_inode_pseudo | | | sock_alloc | | | __sock_create | | | sock_create | | | sys_socket | | | entry_SYSCALL_64_fastpath | | | __socket | | | | | |--4.98%-- __d_alloc | | | d_alloc_pseudo | | | sock_alloc_file | | | sock_map_fd | | | sys_socket | | | entry_SYSCALL_64_fastpath | | | __socket | | | | | --4.69%-- get_empty_filp | | alloc_file | | sock_alloc_file | | sock_map_fd | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | --0.73%-- kmem_cache_alloc_trace | sock_alloc_inode | alloc_inode | new_inode_pseudo | sock_alloc | __sock_create | sock_create | sys_socket | entry_SYSCALL_64_fastpath | __socket | |--10.80%-- sock_alloc_file | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | |--7.47%-- sock_alloc | __sock_create | sock_create | sys_socket | entry_SYSCALL_64_fastpath | __socket | |--6.96%-- kmem_cache_alloc | | | |--72.94%-- sk_prot_alloc | | sk_alloc | | inet_create | | __sock_create | | sock_create | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | |--15.51%-- sock_alloc_inode | | alloc_inode | | new_inode_pseudo | | sock_alloc | | __sock_create | | sock_create | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | |--7.59%-- get_empty_filp | | alloc_file | | sock_alloc_file | | sock_map_fd | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | --3.96%-- __d_alloc | d_alloc_pseudo | sock_alloc_file | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | |--3.74%-- d_instantiate | sock_alloc_file | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | |--2.03%-- iput | __dentry_kill | dput | __fput | ____fput | task_work_run | prepare_exit_to_usermode | syscall_return_slowpath | int_ret_from_sys_call | __libc_close | |--0.60%-- __fsnotify_inode_delete | __destroy_inode | destroy_inode | evict | iput | __dentry_kill | dput | __fput | ____fput | task_work_run | prepare_exit_to_usermode | syscall_return_slowpath | int_ret_from_sys_call | __libc_close | |--0.55%-- evict | iput | __dentry_kill | dput | __fput | ____fput | task_work_run | prepare_exit_to_usermode | syscall_return_slowpath | int_ret_from_sys_call | __libc_close | |--0.53%-- inet_release | sock_release | sock_close | __fput | ____fput | task_work_run | prepare_exit_to_usermode | syscall_return_slowpath | int_ret_from_sys_call | __libc_close | |--0.51%-- __fput | ____fput | task_work_run | prepare_exit_to_usermode | syscall_return_slowpath Perf taken on the 48 threads run : 48.34% opensock [kernel.kallsyms] [k] queued_spin_lock_slowpath | --- queued_spin_lock_slowpath | |--99.93%-- _raw_spin_lock | | | |--50.11%-- __close_fd | | sys_close | | entry_SYSCALL_64_fastpath | | __libc_close | | | | | --100.00%-- 0x0 | | | |--49.85%-- __alloc_fd | | get_unused_fd_flags | | sock_map_fd | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | | | --100.00%-- 0x0 | --0.03%-- [...] --0.07%-- [...] 41.03% opensock opensock [.] memset | --- memset 0.69% opensock [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--30.22%-- sys_close | entry_SYSCALL_64_fastpath | __libc_close | | | --100.00%-- 0x0 | |--30.15%-- get_unused_fd_flags | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | |--14.61%-- cache_alloc_refill | | ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-11-02 2:14 ` Eric Dumazet @ 2015-11-02 6:22 ` Al Viro 0 siblings, 0 replies; 138+ messages in thread From: Al Viro @ 2015-11-02 6:22 UTC (permalink / raw) To: Eric Dumazet Cc: Linus Torvalds, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Sun, Nov 01, 2015 at 06:14:43PM -0800, Eric Dumazet wrote: > On Mon, 2015-11-02 at 00:24 +0000, Al Viro wrote: > > > This ought to be a bit cleaner. Eric, could you test the variant below on your > > setup? > > Sure ! > > 5 runs of : > lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 > > total = 4386311 > total = 4560402 > total = 4437309 > total = 4516227 > total = 4478778 Umm... With Linus' variant it was what, around 4000000? +10% or so, then... > With 48 threads : > > ./opensock -t 48 -n 10000000 -l 10 > total = 4940245 > total = 4848513 > total = 4813153 > total = 4813946 > total = 5127804 And that - +40%? Interesting... And it looks like at 48 threads you are still seeing arseloads of contention, but apparently less than with Linus' variant... What if you throw the __clear_close_on_exec() patch on top of that? Looks like it's spending less time under ->files_lock... Could you get information on fs/file.o hotspots? ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-31 19:54 ` Linus Torvalds 2015-10-31 20:29 ` Al Viro @ 2015-10-31 20:45 ` Eric Dumazet 2015-10-31 21:23 ` Linus Torvalds 1 sibling, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-31 20:45 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Sat, 2015-10-31 at 12:54 -0700, Linus Torvalds wrote: > On Sat, Oct 31, 2015 at 12:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > ... and here's the current variant of mine. > > Ugh. I really liked how simple mine ended up being. Yours is definitely not. > > And based on the profiles from Eric, finding the fd is no longer the > problem even with my simpler patch. The problem ends up being the > contention on the file_lock spinlock. > > Eric, I assume that's not "expand_fdtable", since your test-program > seems to expand the fd array at the beginning. So it's presumably all > from the __alloc_fd() use, but we should double-check.. Eric, can you > do a callgraph profile and see which caller is the hottest? Sure : profile taken while test runs using 16 threads (Since this is probably not a too biased micro benchmark...) # hostname : lpaa24 # os release : 4.3.0-smp-DEV # perf version : 3.12.0-6-GOOGLE # arch : x86_64 # nrcpus online : 48 # nrcpus avail : 48 # cpudesc : Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz # cpuid : GenuineIntel,6,62,4 # total memory : 264126320 kB # cmdline : /usr/bin/perf record -a -g sleep 4 # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, att # CPU_TOPOLOGY info available, use -I to display # NUMA_TOPOLOGY info available, use -I to display # pmu mappings: cpu = 4, msr = 38, uncore_cbox_10 = 35, uncore_cbox_11 = 36, software = 1, power = 7, uncore_irp = 8, uncore_pcu = 37, tracepoint = 2, uncore_ # Samples: 260K of event 'cycles' # Event count (approx.): 196742182232 # # Overhead Command Shared Object # ........ ............. ................... .............................................................................................................. # 67.15% opensock opensock [.] memset | --- memset 13.84% opensock [kernel.kallsyms] [k] queued_spin_lock_slowpath | --- queued_spin_lock_slowpath | |--99.97%-- _raw_spin_lock | | | |--53.03%-- __close_fd | | sys_close | | entry_SYSCALL_64_fastpath | | __libc_close | | | | | --100.00%-- 0x0 | | | |--46.83%-- __alloc_fd | | get_unused_fd_flags | | sock_map_fd | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | | | --100.00%-- 0x0 | --0.13%-- [...] --0.03%-- [...] 1.84% opensock [kernel.kallsyms] [k] _find_next_bit.part.0 | --- _find_next_bit.part.0 | |--65.97%-- find_next_zero_bit | __alloc_fd | get_unused_fd_flags | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | |--34.01%-- __alloc_fd | get_unused_fd_flags | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | | | --100.00%-- 0x0 --0.02%-- [...] 1.59% opensock [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--28.78%-- get_unused_fd_flags | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | |--26.53%-- sys_close | entry_SYSCALL_64_fastpath | __libc_close | |--13.95%-- cache_alloc_refill | | | |--99.48%-- kmem_cache_alloc | | | | | |--81.20%-- sk_prot_alloc | | | sk_alloc | | | inet_create | | | __sock_create | | | sock_create | | | sys_socket | | | entry_SYSCALL_64_fastpath | | | __socket | | | | | |--8.43%-- sock_alloc_inode | | | alloc_inode | | | new_inode_pseudo | | | sock_alloc | | | __sock_create | | | sock_create | | | sys_socket | | | entry_SYSCALL_64_fastpath | | | __socket | | | | | |--5.80%-- __d_alloc | | | d_alloc_pseudo | | | sock_alloc_file | | | sock_map_fd | | | sys_socket ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-31 20:45 ` Eric Dumazet @ 2015-10-31 21:23 ` Linus Torvalds 2015-10-31 21:51 ` Al Viro 2015-10-31 22:34 ` Eric Dumazet 0 siblings, 2 replies; 138+ messages in thread From: Linus Torvalds @ 2015-10-31 21:23 UTC (permalink / raw) To: Eric Dumazet Cc: Al Viro, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Sat, Oct 31, 2015 at 1:45 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > 13.84% opensock [kernel.kallsyms] [k] queued_spin_lock_slowpath > | > --- queued_spin_lock_slowpath > | > |--99.97%-- _raw_spin_lock > | | > | |--53.03%-- __close_fd > | | > | |--46.83%-- __alloc_fd Interesting. "__close_fd" actually looks more expensive than allocation. They presumably get called equally often, so it's probably some cache effect. __close_fd() doesn't do anything even remotely interesting as far as I can tell, but it strikes me that we probably take a *lot* of cache misses on the stupid "close-on-exec" flags, which are probably always zero anyway. Mind testing something really stupid, and making the __clear_bit() in __clear_close_on_exec() conditiona, something like this: static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) { - __clear_bit(fd, fdt->close_on_exec); + if (test_bit(fd, fdt->close_on_exec) + __clear_bit(fd, fdt->close_on_exec); } and see if it makes a difference. This is the kind of thing that a single-threaded (or even single-socket) test will never actually show, because it caches well enough. But for two sockets, I could imagine the unnecessary dirtying of cachelines and ping-pong being noticeable. The other stuff we probably can't do all that much about. Unless we decide to go for some complicated lockless optimistic file descriptor allocation scheme with retry-on-failure instead of locks. Which I'm sure is possible, but I'm equally sure is painful. Linus ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-31 21:23 ` Linus Torvalds @ 2015-10-31 21:51 ` Al Viro 2015-10-31 22:34 ` Eric Dumazet 1 sibling, 0 replies; 138+ messages in thread From: Al Viro @ 2015-10-31 21:51 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Sat, Oct 31, 2015 at 02:23:31PM -0700, Linus Torvalds wrote: > The other stuff we probably can't do all that much about. Unless we > decide to go for some complicated lockless optimistic file descriptor > allocation scheme with retry-on-failure instead of locks. Which I'm > sure is possible, but I'm equally sure is painful. The interesting part is dup2() - we'd have to do something like serialize against other dup2 was_claimed = atomically set and test bit in bitmap if was_claimed tofree = fdt->fd[fd]; if (!tofree) fail with EBUSY install into ->fd[...] end of critical area in there; __alloc_fd() could be made retry-on-failure, but I don't see how to cope with dup2 vs. dup2 without an explicit exclusion. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-31 21:23 ` Linus Torvalds 2015-10-31 21:51 ` Al Viro @ 2015-10-31 22:34 ` Eric Dumazet 1 sibling, 0 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-31 22:34 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Sat, 2015-10-31 at 14:23 -0700, Linus Torvalds wrote: > Mind testing something really stupid, and making the __clear_bit() in > __clear_close_on_exec() conditiona, something like this: > > static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) > { > - __clear_bit(fd, fdt->close_on_exec); > + if (test_bit(fd, fdt->close_on_exec) > + __clear_bit(fd, fdt->close_on_exec); > } > > and see if it makes a difference. It does ;) About 4 % qps increase 3 runs : lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 total = 4176651 total = 4178012 total = 4105226 instead of : total = 3910620 total = 3874567 total = 3971028 Perf profile : 69.12% opensock opensock [.] memset | --- memset 12.37% opensock [kernel.kallsyms] [k] queued_spin_lock_slowpath | --- queued_spin_lock_slowpath | |--99.99%-- _raw_spin_lock | | | |--51.99%-- __close_fd | | sys_close | | entry_SYSCALL_64_fastpath | | __libc_close | | | | | --100.00%-- 0x0 | | | |--47.79%-- __alloc_fd | | get_unused_fd_flags | | sock_map_fd | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | | | --100.00%-- 0x0 | --0.21%-- [...] --0.01%-- [...] 1.92% opensock [kernel.kallsyms] [k] _find_next_bit.part.0 | --- _find_next_bit.part.0 | |--66.93%-- find_next_zero_bit | __alloc_fd | get_unused_fd_flags | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | --33.07%-- __alloc_fd get_unused_fd_flags sock_map_fd sys_socket entry_SYSCALL_64_fastpath __socket | --100.00%-- 0x0 1.63% opensock [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--28.66%-- get_unused_fd_flags | sock_map_fd ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-30 21:50 ` Linus Torvalds 2015-10-30 22:33 ` Al Viro @ 2015-10-31 1:07 ` Eric Dumazet 1 sibling, 0 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-31 1:07 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, David Miller, Stephen Hemminger, Network Development, David Howells, linux-fsdevel On Fri, 2015-10-30 at 14:50 -0700, Linus Torvalds wrote: > On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Fri, Oct 30, 2015 at 2:02 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > >> > >> Your variant has 1:64 ratio; obviously better than now, but we can actually > >> do 1:bits-per-cacheline quite easily. > > > > Ok, but in that case you end up needing a counter for each cacheline > > too (to count how many bits, in order to know when to say "cacheline > > is entirely full"). > > So here's a largely untested version of my "one bit per word" > approach. It seems to work, but looking at it, I'm unhappy with a few > things: > > - using kmalloc() for the .full_fds_bits[] array is simple, but > disgusting, since 99% of all programs just have a single word. > > I know I talked about just adding the allocation to the same one > that allocates the bitmaps themselves, but I got lazy and didn't do > it. Especially since that code seems to try fairly hard to make the > allocations nice powers of two, according to the comments. That may > actually matter from an allocation standpoint. > > - Maybe we could just use that "full_fds_bits_init" field for when a > single word is sufficient, and avoid the kmalloc that way? At least make sure the allocation uses a cache line, so that multiple processes do not share same cache line for this possibly hot field fdt->full_fds_bits = kzalloc(max_t(size_t, L1_CACHE_BYTES, BITBIT_SIZE(nr)), GFP_KERNEL); > > Anyway. This is a pretty simple patch, and I actually think that we > could just get rid of the "next_fd" logic entirely with this. That > would make this *patch* be more complicated, but it would make the > resulting *code* be simpler. > > Hmm? Want to play with this? Eric, what does this do to your test-case? Excellent results so far Linus, 500 % increase, thanks a lot ! Tested using 16 threads, 8 on Socket0, 8 on Socket1 Before patch : # ulimit -n 12000000 # taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 count=10000000 (check/increase ulimit -n) total = 636870 After patch : taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 count=10000000 (check/increase ulimit -n) total = 3845134 (6 times better) Your patch out-performs the O_FD_FASTALLOC one on this particular test by ~ 9 % : taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 -f count=10000000 (check/increase ulimit -n) total = 3505252 If I raise to 48 threads, the FAST_ALLOC wins by 5 % (3752087 instead of 3546666). Oh, and 48 threads without any patches : 383027 -> program runs one order of magnitude faster, congrats ! ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 23:17 ` Al Viro 2015-10-28 0:13 ` Eric Dumazet @ 2015-10-28 16:04 ` Alan Burlison 1 sibling, 0 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-28 16:04 UTC (permalink / raw) To: Al Viro Cc: Casper.Dik, David Miller, eric.dumazet, stephen, netdev, dholland-tech On 27/10/2015 23:17, Al Viro wrote: > Frankly, as far as I'm concerned, the bottom line is > * there are two variants of semantics in that area and there's not > much that could be done about that. Yes, that seems to be the case. > * POSIX is vague enough for both variants to comply with it (it's > also very badly written in the area in question). On that aspect I disagree, the POSIX semantics seem clear to me, and are different to the Linux behaviour. > * I don't see any way to implement something similar to Solaris > behaviour without a huge increase of memory footprint or massive cacheline > pingpong. Solaris appears to go for memory footprint from hell - cacheline > per descriptor (instead of a pointer per descriptor). Yes, that does seem to be the case. Thanks for the detailed explanation you've provided as to why that's so. > * the benefits of Solaris-style behaviour are not obvious - all things > equal it would be interesting, but the things are very much not equal. What's > more, if your userland code is such that accept() argument could be closed by > another thread, the caller *cannot* do anything with said argument after > accept() returns, no matter which variant of semantics is used. Yes, irrespective of how you terminate the accept, once it returns with an error it's unsafe to use the FD, with the exception of failures such as EAGAIN, EINTR etc. However the shutdown() behaviour of Linux is not POSIX compliant and allowing an accept to continue of a FD that's been closed doesn't seem correct either. > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > with some use patterns. The cacheline pingpong in the bitmap is probably > inevitable, unless we accept considerably heavier memory footprint, > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > to trigger - close(3);open(...); will have the next open() after that > scanning the entire in-use bitmap. I think I see a way to improve it > without slowing the normal case down, but I'll need to experiment a > bit before I post patches. Anybody with examples of real-world loads > that make our descriptor allocator to degrade is very welcome to post > the reproducers... It looks like the remaining discussion is going to be about Linux implementation details so I'll bow out at this point. Thanks again for all the helpful explanation. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-27 10:52 ` Alan Burlison ` (2 preceding siblings ...) 2015-10-27 23:17 ` Al Viro @ 2015-10-29 14:58 ` David Holland 2015-10-29 15:18 ` Alan Burlison 2015-10-30 17:43 ` David Laight 3 siblings, 2 replies; 138+ messages in thread From: David Holland @ 2015-10-29 14:58 UTC (permalink / raw) To: Alan Burlison Cc: Casper.Dik, Al Viro, David Miller, eric.dumazet, stephen, netdev, dholland-tech On Tue, Oct 27, 2015 at 10:52:46AM +0000, Alan Burlison wrote: > >But in general, this is basically a problem with the application: the file > >descriptor space is shared between threads and having one thread sniping > >at open files, you do have a problem and whatever the kernel does in that > >case perhaps doesn't matter all that much: the application needs to be > >fixed anyway. > > The scenario in Hadoop is that the FD is being used by a thread that's > waiting in accept and another thread wants to shut it down, e.g. because > the application is terminating and needs to stop all threads cleanly. ISTM that the best way to do this is to post a signal to the thread so accept bails with EINTR, at which point it can check to see if it's supposed to be exiting. Otherwise it sounds like the call you're looking for is not close(2) but revoke(2). Last I remember Linux doesn't have revoke because there's no way to implement it that isn't a trainwreck. -- David A. Holland dholland@netbsd.org ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 14:58 ` David Holland @ 2015-10-29 15:18 ` Alan Burlison 2015-10-29 16:01 ` David Holland 2015-10-30 17:43 ` David Laight 1 sibling, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-29 15:18 UTC (permalink / raw) To: David Holland Cc: Casper.Dik, Al Viro, David Miller, eric.dumazet, stephen, netdev On 29/10/2015 14:58, David Holland wrote: > ISTM that the best way to do this is to post a signal to the thread so > accept bails with EINTR, at which point it can check to see if it's > supposed to be exiting. Yes, you could use pthread_kill, but that would require keeping a list of the tids of all the threads that were using the FD, and that really just moves the problem elsewhere rather than fixing it. > Otherwise it sounds like the call you're looking for is not close(2) > but revoke(2). Last I remember Linux doesn't have revoke because > there's no way to implement it that isn't a trainwreck. close(2) as per specified by POSIX works just fine on Solaris, if that was the case everywhere then it wouldn't be an issue. And for cases where it is necessary to keep the FD assigned because of races, the dup2(2) trick works fine as well. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 15:18 ` Alan Burlison @ 2015-10-29 16:01 ` David Holland 2015-10-29 16:15 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: David Holland @ 2015-10-29 16:01 UTC (permalink / raw) To: Alan Burlison Cc: David Holland, Casper.Dik, Al Viro, David Miller, eric.dumazet, stephen, netdev On Thu, Oct 29, 2015 at 03:18:40PM +0000, Alan Burlison wrote: > On 29/10/2015 14:58, David Holland wrote: > >ISTM that the best way to do this is to post a signal to the thread so > >accept bails with EINTR, at which point it can check to see if it's > >supposed to be exiting. > > Yes, you could use pthread_kill, but that would require keeping a list of > the tids of all the threads that were using the FD, and that really just > moves the problem elsewhere rather than fixing it. Hardly; it moves the burden of doing stupid things to the application. If as you said the goal is to shut down all threads cleanly, then it doesn't need to keep track in detail anyway; it can just post SIGTERM to every thread, or SIGUSR1 if SIGTERM is bad for some reason, or whatever. > >Otherwise it sounds like the call you're looking for is not close(2) > >but revoke(2). Last I remember Linux doesn't have revoke because > >there's no way to implement it that isn't a trainwreck. > > close(2) as per specified by POSIX works just fine on Solaris, if that was > the case everywhere then it wouldn't be an issue. And for cases where it is > necessary to keep the FD assigned because of races, the dup2(2) trick works > fine as well. close(2) as specified by POSIX doesn't prohibit this weird revoke-like behavior, but there's nothing in there that mandates it either. (I thought this discussion had already clarified that.) Note that while NetBSD apparently supports this behavior because someone copied it from Solaris, I'm about to go recommend it be removed. -- David A. Holland dholland@netbsd.org ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 16:01 ` David Holland @ 2015-10-29 16:15 ` Alan Burlison 2015-10-29 17:07 ` Al Viro 2015-10-30 5:44 ` David Holland 0 siblings, 2 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-29 16:15 UTC (permalink / raw) To: David Holland Cc: Casper.Dik, Al Viro, David Miller, eric.dumazet, stephen, netdev On 29/10/2015 16:01, David Holland wrote: > Hardly; it moves the burden of doing stupid things to the > application. If as you said the goal is to shut down all threads > cleanly, then it doesn't need to keep track in detail anyway; it can > just post SIGTERM to every thread, or SIGUSR1 if SIGTERM is bad for > some reason, or whatever. I agree that the root issue is poor application design, but posting a signal to every thread is not a solution if you only want to shut down a subset of threads. > close(2) as specified by POSIX doesn't prohibit this weird revoke-like > behavior, but there's nothing in there that mandates it either. (I > thought this discussion had already clarified that.) There was an attempt to interpret POSIX that way, with which I still disagree. If a FD is closed or reassigned then any current pending operations on it should be terminated. > Note that while NetBSD apparently supports this behavior because > someone copied it from Solaris, I'm about to go recommend it be > removed. Which behaviour? The abort accept() on close() behaviour? -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 16:15 ` Alan Burlison @ 2015-10-29 17:07 ` Al Viro 2015-10-29 17:12 ` Alan Burlison 2015-10-30 1:55 ` David Miller 2015-10-30 5:44 ` David Holland 1 sibling, 2 replies; 138+ messages in thread From: Al Viro @ 2015-10-29 17:07 UTC (permalink / raw) To: Alan Burlison Cc: David Holland, Casper.Dik, David Miller, eric.dumazet, stephen, netdev On Thu, Oct 29, 2015 at 04:15:33PM +0000, Alan Burlison wrote: > There was an attempt to interpret POSIX that way, with which I still > disagree. If a FD is closed or reassigned then any current pending > operations on it should be terminated. Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote the exact place in POSIX that requires such behaviour? This is getting ridiculous - if we are talking about POSIX-mandated behaviour of close(), please show where is it mandated. Using close(2) on a descriptor that might be used by other threads is a bloody bad design in userland code - I think everyone in this thread agrees on that. Making that a recommended way to do _anything_ is nuts. Now, no userland code, however lousy it might be, should be able to screw the system. But that isn't the issue - our variant is providing that just fine. BTW, "cancel accept(2) because sending a signal is hard" is bogus anyway - a thread in accept(3) just about to enter the kernel would get buggered if another thread closes that descriptor and the third one does socket(2). _IF_ you are doing that kind of "close a descriptor under other threads" thing, you need to inform the potentially affected threads anyway, and you'd better not rely on them being currently in kernel mode. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 17:07 ` Al Viro @ 2015-10-29 17:12 ` Alan Burlison 2015-10-30 1:54 ` David Miller 2015-10-30 1:55 ` David Miller 1 sibling, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-29 17:12 UTC (permalink / raw) To: Al Viro; +Cc: David Miller, eric.dumazet, stephen, netdev [off list] On 29/10/2015 17:07, Al Viro wrote: > Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote the exact > place in POSIX that requires such behaviour? If that's the way the conversation is going to go, sorry, no. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 17:12 ` Alan Burlison @ 2015-10-30 1:54 ` David Miller 0 siblings, 0 replies; 138+ messages in thread From: David Miller @ 2015-10-30 1:54 UTC (permalink / raw) To: Alan.Burlison; +Cc: viro, eric.dumazet, stephen, netdev From: Alan Burlison <Alan.Burlison@oracle.com> Date: Thu, 29 Oct 2015 17:12:44 +0000 > On 29/10/2015 17:07, Al Viro wrote: > >> Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote >> the exact >> place in POSIX that requires such behaviour? > > If that's the way the conversation is going to go, sorry, no. I find Al's request to be frankly quite reasonable, as is his frustration expressed in his tone as well. Furthermore, NetBSD's intention to try and get rid of the close() on accept() behavior shows that the Linux developers are not in the minority of being against requiring this behavior or even finding it desirable in any way. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 17:07 ` Al Viro 2015-10-29 17:12 ` Alan Burlison @ 2015-10-30 1:55 ` David Miller 1 sibling, 0 replies; 138+ messages in thread From: David Miller @ 2015-10-30 1:55 UTC (permalink / raw) To: viro Cc: Alan.Burlison, dholland-tech, Casper.Dik, eric.dumazet, stephen, netdev From: Al Viro <viro@ZenIV.linux.org.uk> Date: Thu, 29 Oct 2015 17:07:48 +0000 > _IF_ you are doing that kind of "close a descriptor under other threads" > thing, you need to inform the potentially affected threads anyway, and > you'd better not rely on them being currently in kernel mode. +1 ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 16:15 ` Alan Burlison 2015-10-29 17:07 ` Al Viro @ 2015-10-30 5:44 ` David Holland 1 sibling, 0 replies; 138+ messages in thread From: David Holland @ 2015-10-30 5:44 UTC (permalink / raw) To: Alan Burlison Cc: David Holland, Casper.Dik, Al Viro, David Miller, eric.dumazet, stephen, netdev On Thu, Oct 29, 2015 at 04:15:33PM +0000, Alan Burlison wrote: > >close(2) as specified by POSIX doesn't prohibit this weird revoke-like > >behavior, but there's nothing in there that mandates it either. (I > >thought this discussion had already clarified that.) > > There was an attempt to interpret POSIX that way, with which I still > disagree. If a FD is closed or reassigned then any current pending > operations on it should be terminated. C&V, please. > >Note that while NetBSD apparently supports this behavior because > >someone copied it from Solaris, I'm about to go recommend it be > >removed. > > Which behaviour? The abort accept() on close() behaviour? That, and aborting anything else too. Close isn't revoke. -- David A. Holland dholland@netbsd.org ^ permalink raw reply [flat|nested] 138+ messages in thread
* RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-29 14:58 ` David Holland 2015-10-29 15:18 ` Alan Burlison @ 2015-10-30 17:43 ` David Laight 2015-10-30 21:09 ` Al Viro 1 sibling, 1 reply; 138+ messages in thread From: David Laight @ 2015-10-30 17:43 UTC (permalink / raw) To: 'David Holland', Alan Burlison Cc: Casper.Dik, Al Viro, David Miller, eric.dumazet, stephen, netdev From: David Holland > Sent: 29 October 2015 14:59 > On Tue, Oct 27, 2015 at 10:52:46AM +0000, Alan Burlison wrote: > > >But in general, this is basically a problem with the application: the file > > >descriptor space is shared between threads and having one thread sniping > > >at open files, you do have a problem and whatever the kernel does in that > > >case perhaps doesn't matter all that much: the application needs to be > > >fixed anyway. > > > > The scenario in Hadoop is that the FD is being used by a thread that's > > waiting in accept and another thread wants to shut it down, e.g. because > > the application is terminating and needs to stop all threads cleanly. > > ISTM that the best way to do this is to post a signal to the thread so > accept bails with EINTR, at which point it can check to see if it's > supposed to be exiting. Actually, just send it a connect indication. ISTM that the correct call should be listen(fd, 0); Although that doesn't help a thread stuck in recvmsg() for a datagram. It is also tempting to think that close(fd) should sleep until all io activities using that fd have completed - whether or not blocking calls are woken. David ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-30 17:43 ` David Laight @ 2015-10-30 21:09 ` Al Viro 2015-11-04 15:54 ` David Laight 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-30 21:09 UTC (permalink / raw) To: David Laight Cc: 'David Holland', Alan Burlison, Casper.Dik, David Miller, eric.dumazet, stephen, netdev On Fri, Oct 30, 2015 at 05:43:21PM +0000, David Laight wrote: > ISTM that the correct call should be listen(fd, 0); > Although that doesn't help a thread stuck in recvmsg() for a datagram. > > It is also tempting to think that close(fd) should sleep until all > io activities using that fd have completed - whether or not blocking > calls are woken. Sigh... The kernel has no idea when other threads are done with "all io activities using that fd" - it can wait for them to leave the kernel mode, but there's fuck-all it can do about e.g. a userland loop doing write() until there's more data to send. And no, you can't rely upon them catching EBADF on the next iteration - by the time they get there, close() might very well have returned and open() from yet another thread might've grabbed the same descriptor. Welcome to your data being written to hell knows what... That's precisely the reason why "wait in close()" kind of semantics is worthless - the races are still there, and having them a bit harder to hit just makes them harder to debug. Worse, it might create an impression of safety where there's none to be had. ^ permalink raw reply [flat|nested] 138+ messages in thread
* RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-30 21:09 ` Al Viro @ 2015-11-04 15:54 ` David Laight 2015-11-04 16:27 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: David Laight @ 2015-11-04 15:54 UTC (permalink / raw) To: 'Al Viro' Cc: 'David Holland', Alan Burlison, Casper.Dik, David Miller, eric.dumazet, stephen, netdev From: Al Viro > Sent: 30 October 2015 21:10 > On Fri, Oct 30, 2015 at 05:43:21PM +0000, David Laight wrote: > > > ISTM that the correct call should be listen(fd, 0); > > Although that doesn't help a thread stuck in recvmsg() for a datagram. > > > > It is also tempting to think that close(fd) should sleep until all > > io activities using that fd have completed - whether or not blocking > > calls are woken. > > Sigh... The kernel has no idea when other threads are done with "all > io activities using that fd" - it can wait for them to leave the > kernel mode, but there's fuck-all it can do about e.g. a userland > loop doing write() until there's more data to send. And no, you can't > rely upon them catching EBADF on the next iteration - by the time they > get there, close() might very well have returned and open() from yet > another thread might've grabbed the same descriptor. Welcome to your > data being written to hell knows what... That just means that the application must use dup2() rather than close(). It must do that anyway since the thread it is trying to stop might be sleeping in the system call stub in libc at the time a close() and open() happen. The listening (in this case) thread would need to look at its global data to determine that it is supposed to exit, and then close the fd itself. David ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-11-04 15:54 ` David Laight @ 2015-11-04 16:27 ` Al Viro 2015-11-06 15:07 ` David Laight 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-11-04 16:27 UTC (permalink / raw) To: David Laight Cc: 'David Holland', Alan Burlison, Casper.Dik, David Miller, eric.dumazet, stephen, netdev On Wed, Nov 04, 2015 at 03:54:09PM +0000, David Laight wrote: > > Sigh... The kernel has no idea when other threads are done with "all > > io activities using that fd" - it can wait for them to leave the > > kernel mode, but there's fuck-all it can do about e.g. a userland > > loop doing write() until there's more data to send. And no, you can't > > rely upon them catching EBADF on the next iteration - by the time they > > get there, close() might very well have returned and open() from yet > > another thread might've grabbed the same descriptor. Welcome to your > > data being written to hell knows what... > > That just means that the application must use dup2() rather than close(). > It must do that anyway since the thread it is trying to stop might be > sleeping in the system call stub in libc at the time a close() and open() > happen. Oh, _lovely_. So instead of continuation of that write(2) going down the throat of something opened by unrelated thread, it (starting from a pretty arbitrary point) will go into the descriptor the closing thread passed to dup2(). Until it, in turn, gets closed, at which point we are back to square one. That, of course, makes it so much better - whatever had I been thinking about that made me miss that? > The listening (in this case) thread would need to look at its global > data to determine that it is supposed to exit, and then close the fd itself. Right until it crosses into the kernel mode and does descriptor-to-file lookup, presumably? Because prior to that point this kernel-side "protection" oesn't come into play. In other words, this is inherently racy, and AFAICS you are the first poster in that thread who disagrees with that. _Any_ userland code that would be racy without that kludge of semantics in close()/dup2() is *still* racy with it. If that crap gets triggered at all, the userland code responsible for that is broken. Said crap makes the race windows more narrow, but it doesn't really close them. And IMO it's rather misduided, especially since it's a) quiet and b) costly as hell. ^ permalink raw reply [flat|nested] 138+ messages in thread
* RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-11-04 16:27 ` Al Viro @ 2015-11-06 15:07 ` David Laight 2015-11-06 19:31 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: David Laight @ 2015-11-06 15:07 UTC (permalink / raw) To: 'Al Viro' Cc: 'David Holland', Alan Burlison, Casper.Dik, David Miller, eric.dumazet, stephen, netdev From: Al Viro > Sent: 04 November 2015 16:28 > On Wed, Nov 04, 2015 at 03:54:09PM +0000, David Laight wrote: > > > Sigh... The kernel has no idea when other threads are done with "all > > > io activities using that fd" - it can wait for them to leave the > > > kernel mode, but there's fuck-all it can do about e.g. a userland > > > loop doing write() until there's more data to send. And no, you can't > > > rely upon them catching EBADF on the next iteration - by the time they > > > get there, close() might very well have returned and open() from yet > > > another thread might've grabbed the same descriptor. Welcome to your > > > data being written to hell knows what... > > > > That just means that the application must use dup2() rather than close(). > > It must do that anyway since the thread it is trying to stop might be > > sleeping in the system call stub in libc at the time a close() and open() > > happen. > > Oh, _lovely_. So instead of continuation of that write(2) going down > the throat of something opened by unrelated thread, it (starting from a > pretty arbitrary point) will go into the descriptor the closing thread > passed to dup2(). Until it, in turn, gets closed, at which point we > are back to square one. That, of course, makes it so much better - > whatever had I been thinking about that made me miss that? Do I detect a note of sarcasm. You'd open "/dev/null" (lacking a "/dev/error") and use that as the source fd. So any writes (etc) would be discarded in a safe manner, and nothing will block. David ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-11-06 15:07 ` David Laight @ 2015-11-06 19:31 ` Al Viro 0 siblings, 0 replies; 138+ messages in thread From: Al Viro @ 2015-11-06 19:31 UTC (permalink / raw) To: David Laight Cc: 'David Holland', Alan Burlison, Casper.Dik, David Miller, eric.dumazet, stephen, netdev On Fri, Nov 06, 2015 at 03:07:30PM +0000, David Laight wrote: > > Oh, _lovely_. So instead of continuation of that write(2) going down > > the throat of something opened by unrelated thread, it (starting from a > > pretty arbitrary point) will go into the descriptor the closing thread > > passed to dup2(). Until it, in turn, gets closed, at which point we > > are back to square one. That, of course, makes it so much better - > > whatever had I been thinking about that made me miss that? > > Do I detect a note of sarcasm. > > You'd open "/dev/null" (lacking a "/dev/error") and use that as the source fd. > So any writes (etc) would be discarded in a safe manner, and nothing will block. ... and never close that descriptor afterwards? Then why would you care whether dup2() blocks there or not? Are you seriously saying that there is any code that would (a) rely on that kind of warranty and (b) not be racy as hell? Until now everyone had been talking about that thing as an attempt of mitigation for broken userland code; you seem to be the first one to imply that this can be used as a deliberate technics... ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 4:44 ` Al Viro 2015-10-22 6:03 ` Al Viro @ 2015-10-22 6:51 ` Casper.Dik 2015-10-22 11:18 ` Alan Burlison 2015-10-22 11:15 ` Alan Burlison 2 siblings, 1 reply; 138+ messages in thread From: Casper.Dik @ 2015-10-22 6:51 UTC (permalink / raw) To: Al Viro Cc: Alan Burlison, David Miller, eric.dumazet, stephen, netdev, dholland-tech From: Al Viro <viro@ZenIV.linux.org.uk> >Except that in this case "correctness" is the matter of rather obscure and >ill-documented areas in POSIX. Don't get me wrong - this semantics isn't >inherently bad, but it's nowhere near being an absolute requirement. It would more fruitful to have such a discussion in one of the OpenGroup mailing lists; people gathered there have a lot of experience and it is also possible to fix the standard when it turns out that it indeed as vague as you claim it is (I don't quite agree with that) Casper ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 6:51 ` Casper.Dik @ 2015-10-22 11:18 ` Alan Burlison 0 siblings, 0 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-22 11:18 UTC (permalink / raw) To: Casper.Dik, Al Viro Cc: David Miller, eric.dumazet, stephen, netdev, dholland-tech On 22/10/2015 07:51, Casper.Dik@oracle.com wrote: > It would more fruitful to have such a discussion in one of the OpenGroup > mailing lists; people gathered there have a lot of experience and it is > also possible to fix the standard when it turns out that it indeed as > vague as you claim it is (I don't quite agree with that) +1. If there's interest in doing that I'll ask our POSIX rep the best way of initiating such a conversation. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 4:44 ` Al Viro 2015-10-22 6:03 ` Al Viro 2015-10-22 6:51 ` Casper.Dik @ 2015-10-22 11:15 ` Alan Burlison 2 siblings, 0 replies; 138+ messages in thread From: Alan Burlison @ 2015-10-22 11:15 UTC (permalink / raw) To: Al Viro Cc: David Miller, eric.dumazet, stephen, netdev, dholland-tech, casper.dik On 22/10/2015 05:44, Al Viro wrote: >> It's been said that the current mechanisms in Linux & some BSD >> variants can be subject to races > > You do realize that it goes for the entire area? And the races found > in this thread are in the BSD variant that tries to do something similar > to what you guys say Solaris is doing, so I'm not sure which way does > that argument go. A high-level description with the same level of details > will be race-free in _all_ of them. The devil is in the details, of course, > and historically they had been very easy to get wrong. And extra complexity > in that area doesn't make things better. Yes, I absolutely agree it's difficult to get it right. Modulo undetected bugs I believe the Solaris implementation is race free, and if it isn't I think it's fair to say we'd consider that to be a bug. >> , and the behaviour exhibited >> doesn't conform to POSIX, for example requiring the use of >> shutdown() on unconnected sockets because close() doesn't kick off >> other threads accept()ing on the same fd. > > Umm... The old kernels (and even more - old userland) are not going to > disappear, so we are stuck with such uses of shutdown(2) anyway, POSIX or > no POSIX. Yes, I understand the problem and in an earlier part of the discussion I said that I suspected that all that could really be done was to document the behaviour as changing it would break existing code. I still think that's probably the only workable option. >> I'd be interested to hear >> if there's a better and more performant way of handling the >> situation that doesn't involve doing the sort of bookkeeping Casper >> described,. > > So would a lot of other people. :-) >> To quote one of my colleague's favourite sayings: Performance is a >> goal, correctness is a constraint. > > Except that in this case "correctness" is the matter of rather obscure and > ill-documented areas in POSIX. Don't get me wrong - this semantics isn't > inherently bad, but it's nowhere near being an absolute requirement. I don't think it's *that* obscure, when I found the original shutdown() problem google showed up another occurrence pretty quickly - https://lists.gnu.org/archive/html/libmicrohttpd/2011-09/msg00024.html. If a fd is closed then allowing other uses of it to continue in other threads seems incorrect to me, as in the dup2() scenario I posted. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 4:17 ` Alan Burlison 2015-10-22 4:44 ` Al Viro @ 2015-10-22 6:15 ` Casper.Dik 2015-10-22 11:30 ` Eric Dumazet 1 sibling, 1 reply; 138+ messages in thread From: Casper.Dik @ 2015-10-22 6:15 UTC (permalink / raw) To: Alan Burlison Cc: David Miller, viro, eric.dumazet, stephen, netdev, dholland-tech >It's been said that the current mechanisms in Linux & some BSD variants >can be subject to races, and the behaviour exhibited doesn't conform to >POSIX, for example requiring the use of shutdown() on unconnected >sockets because close() doesn't kick off other threads accept()ing on >the same fd. I'd be interested to hear if there's a better and more >performant way of handling the situation that doesn't involve doing the >sort of bookkeeping Casper described,. Of course, the implementation is now around 18 years old; clearly a lot of things have changed since then. In the particular case of Linux close() on a socket, surely it must be possible to detect at close that it is a listening socket and that you are about to close the last reference; the kernel could then do the shutdown() all by itself. Casper ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 6:15 ` Casper.Dik @ 2015-10-22 11:30 ` Eric Dumazet 2015-10-22 11:58 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: Eric Dumazet @ 2015-10-22 11:30 UTC (permalink / raw) To: Casper.Dik Cc: Alan Burlison, David Miller, viro, stephen, netdev, dholland-tech On Thu, 2015-10-22 at 08:15 +0200, Casper.Dik@oracle.com wrote: > >It's been said that the current mechanisms in Linux & some BSD variants > >can be subject to races, and the behaviour exhibited doesn't conform to > >POSIX, for example requiring the use of shutdown() on unconnected > >sockets because close() doesn't kick off other threads accept()ing on > >the same fd. I'd be interested to hear if there's a better and more > >performant way of handling the situation that doesn't involve doing the > >sort of bookkeeping Casper described,. > > Of course, the implementation is now around 18 years old; clearly a lot of > things have changed since then. > > In the particular case of Linux close() on a socket, surely it must be > possible to detect at close that it is a listening socket and that you are > about to close the last reference; the kernel could then do the shutdown() > all by itself. We absolutely do not _want_ to do this just so that linux becomes slower to the point Solaris can compete, or you guys can avoid some work. close(fd) is very far from knowing a file is a 'listener' or even a 'socket' without extra cache line misses. To force a close of an accept() or whatever blocking socket related system call a shutdown() makes a lot of sense. This would have zero additional overhead for the fast path. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 11:30 ` Eric Dumazet @ 2015-10-22 11:58 ` Alan Burlison 2015-10-22 12:10 ` Eric Dumazet 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-22 11:58 UTC (permalink / raw) To: Eric Dumazet, Casper.Dik Cc: David Miller, viro, stephen, netdev, dholland-tech On 22/10/2015 12:30, Eric Dumazet wrote: > We absolutely do not _want_ to do this just so that linux becomes slower > to the point Solaris can compete, or you guys can avoid some work. Sentiments such as that really have no place in a discussion that's been focussed primarily on the behaviour of interfaces, albeit with digressions into the potential performance impacts. The discussion has been cordial and I for one appreciate Al Viro's posts on the subject, from which I've leaned a lot. Can we please keep it that way? Thanks. > close(fd) is very far from knowing a file is a 'listener' or even a > 'socket' without extra cache line misses. > > To force a close of an accept() or whatever blocking socket related > system call a shutdown() makes a lot of sense. > > This would have zero additional overhead for the fast path. Yes, that would I believe be a significant improvement. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 11:58 ` Alan Burlison @ 2015-10-22 12:10 ` Eric Dumazet 2015-10-22 13:12 ` David Miller 2015-10-22 13:14 ` Alan Burlison 0 siblings, 2 replies; 138+ messages in thread From: Eric Dumazet @ 2015-10-22 12:10 UTC (permalink / raw) To: Alan Burlison Cc: Casper.Dik, David Miller, viro, stephen, netdev, dholland-tech On Thu, 2015-10-22 at 12:58 +0100, Alan Burlison wrote: > On 22/10/2015 12:30, Eric Dumazet wrote: > > > We absolutely do not _want_ to do this just so that linux becomes slower > > to the point Solaris can compete, or you guys can avoid some work. > > Sentiments such as that really have no place in a discussion that's been > focussed primarily on the behaviour of interfaces, albeit with > digressions into the potential performance impacts. The discussion has > been cordial and I for one appreciate Al Viro's posts on the subject, > from which I've leaned a lot. Can we please keep it that way? Thanks. Certainly not. I am a major linux networking developper and wont accept linux is hijacked by guys who never contributed to it, just so it meets their unreasonable expectations. We absolutely care about performance. And I do not care you focus on POSIX crap. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 12:10 ` Eric Dumazet @ 2015-10-22 13:12 ` David Miller 2015-10-22 13:14 ` Alan Burlison 1 sibling, 0 replies; 138+ messages in thread From: David Miller @ 2015-10-22 13:12 UTC (permalink / raw) To: eric.dumazet Cc: Alan.Burlison, Casper.Dik, viro, stephen, netdev, dholland-tech From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 22 Oct 2015 05:10:58 -0700 > I am a major linux networking developper and wont accept linux is > hijacked by guys who never contributed to it, just so it meets their > unreasonable expectations. > > We absolutely care about performance. And I do not care you focus on > POSIX crap. +1 ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 12:10 ` Eric Dumazet 2015-10-22 13:12 ` David Miller @ 2015-10-22 13:14 ` Alan Burlison 2015-10-22 17:05 ` Al Viro 1 sibling, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-22 13:14 UTC (permalink / raw) To: Eric Dumazet Cc: Casper.Dik, David Miller, viro, stephen, netdev, dholland-tech On 22/10/2015 13:10, Eric Dumazet wrote: >> Sentiments such as that really have no place in a discussion that's been >> focussed primarily on the behaviour of interfaces, albeit with >> digressions into the potential performance impacts. The discussion has >> been cordial and I for one appreciate Al Viro's posts on the subject, >> from which I've leaned a lot. Can we please keep it that way? Thanks. > > Certainly not. OK, in which case I'll make this my last on-list reply to this part of the thread as I think continuing it is unlikely to be productive. If you would like to further discuss your concerns about my motivations I'm happy to do so off list, along with anyone you want to cc in. Thanks. > I am a major linux networking developper and wont accept linux is > hijacked by guys who never contributed to it, just so it meets their > unreasonable expectations. Yes, I'm aware of who you are. And if my expectations were completely unreasonable then I'd have expected the conversation to have already drawn to a close by now. > We absolutely care about performance. And I do not care you focus on > POSIX crap. Yes, I understand the concern about the potential performance impact and it's a valid concern. And I also understand that the current Linux behaviour of shutdown() on unconnected sockets probably can't be changed without causing breakage and is therefore unlikely to happen as well. The issues I hit were in the context of application porting, where the APIs in question are covered by POSIX. The Linux manpages for open(), close(), socket(), dup2() and shutdown() all claim POSIX.1-2001 conformance. If performance is the most important concern then it's a valid decision to prioritise that over POSIX conformance, you simply can't continue to claim that the relevant Linux APIs are fully POSIX conformant, so I believe at the minimum the Linux manpages need modifying. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 13:14 ` Alan Burlison @ 2015-10-22 17:05 ` Al Viro 2015-10-22 17:39 ` Alan Burlison 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-22 17:05 UTC (permalink / raw) To: Alan Burlison Cc: Eric Dumazet, Casper.Dik, David Miller, stephen, netdev, dholland-tech On Thu, Oct 22, 2015 at 02:14:42PM +0100, Alan Burlison wrote: > The issues I hit were in the context of application porting, where > the APIs in question are covered by POSIX. The Linux manpages for > open(), close(), socket(), dup2() and shutdown() all claim > POSIX.1-2001 conformance. If performance is the most important > concern then it's a valid decision to prioritise that over POSIX > conformance, you simply can't continue to claim that the relevant > Linux APIs are fully POSIX conformant, so I believe at the minimum > the Linux manpages need modifying. Oh, for... Right in this thread an example of complete BS has been quoted from POSIX close(2). The part about closing a file when the last descriptor gets closed. _Nothing_ is POSIX-compliant in that respect (nor should it be). Semantics around the distinction between file descriptors and <barf> file descriptions is underspecified, not to mention being very poorly written. You want to add something along the lines of "if any action by another thread changes the mapping from file descriptors to file descriptions for any file descriptor passed to syscall, such and such things should happen" - go ahead and specify what should happen. As it is, I don't see anything of that sort in e.g. accept(2). And no, [EBADF] The socket argument is not a valid file descriptor. in there is nowhere near being unambiguous enough - everyone agrees that argument should be a valid descriptor at the time of call, but I would be very surprised to find _any_ implementation (including Solaris one) recheck that upon exit to userland. For more bullshit from the same source (issue 7, close(2)): If fildes refers to a socket, close() shall cause the socket to be destroyed. If the socket is in connection-mode, and the SO_LINGER option is set for the socket with non-zero linger time, and the socket has untransmitted data, then close() shall block for up to the current linger interval until all data is transmitted. I challenge you to find *any* implementation that would have fd = socket(...); close(dup(fd)); do what this wonder of technical prose clearly requests. In the same text we also have When all file descriptors associated with a pipe or FIFO special file are closed, any data remaining in the pipe or FIFO shall be discarded. as well as explicit (and underspecified, but perhaps they do it elsewhere) "last close" in parts related to sockets and ptys. And that is not to mention the dup2(2) wording in there: If fildes2 is already a valid open file descriptor, it shall be closed first which is (a) inviting misinterpretation that would make the damn thing non-atomic (the only mentioning of atomicity is in non-normative sections) and (b) says fsck-all about the effects of closing descriptor. The latter is a problem, since nothing in close(2) bothers making a distinction between the effects specific to particular syscall and those common to all ways of closing a descriptor. And no, it's not a nitpicking - consider e.g. the parts concerning the order of events triggered by close(2) (such and such should be completed before close(2) returns); should it be taken as "same events should be completed before newfd is associated with the file description refered to by oldfd"? It _is_ user-visible, since close(2) removes fcntl locks. Sure, there is (otherwise unexplained) The dup2() function is not intended for use in critical regions as a synchronization mechanism. down in informative sections, so one can infer that event order here isn't to be relied upon. With no way to guess whether the event order concerning e.g. effect on ongoing accept(newfd) is any different in that respect. The entire area in Issue 7 stinks. It might make sense to try and fix it up, but let's not pretend that what's in there right now does specify the semantics in this kind of situations. I'm not saying that Solaris approach yields an inherently bad semantics or that it's impossible to implement without high scalability price and/or high memory footprint. But waving the flag of POSIX compliance when you are actually talking about the ways your implementation plugs the holes in a badly incomplete spec... Not to contribute to pissing contest, but IIRC Solaris wasn't even the first kernel having to deal with the possibility of descriptor table getting changed by another thread under ongoing syscall. Which was completely outside of POSIX scope, not that Plan 9 folks gave a damn. For Linux that can of worms had opened in 1.3.22 (Sep 1995), for OpenBSD - next January, a month later followed by FreeBSD (ab, said to be based on OpenBSD variant) and in Jan 1998 by NetBSD (said to be partially based on FreeBSD one). All of those had been more or less inspired by Plan 9 approach (in case of *BSD the original implementation was by Ron Minnich). Not sure when Plan 9 has implemented their variant; it was definitely there by the beginning of 1993 (going by the date on Release 1 rfork(2)). That would be what, around the time of Solaris 2.1? ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 17:05 ` Al Viro @ 2015-10-22 17:39 ` Alan Burlison 2015-10-22 18:56 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: Alan Burlison @ 2015-10-22 17:39 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, Casper.Dik, David Miller, stephen, netdev, dholland-tech On 22/10/2015 18:05, Al Viro wrote: > Oh, for... Right in this thread an example of complete BS has been quoted > from POSIX close(2). The part about closing a file when the last descriptor > gets closed. _Nothing_ is POSIX-compliant in that respect (nor should > it be). That's not exactly what it says, we've already discussed, for example in the case of pending async IO on a filehandle. > Semantics around the distinction between file descriptors and > <barf> file descriptions is underspecified, not to mention being very poorly > written. I agree that part could do with some polishing. > You want to add something along the lines of "if any action by another thread > changes the mapping from file descriptors to file descriptions for any > file descriptor passed to syscall, such and such things should happen" - go > ahead and specify what should happen. As it is, I don't see anything of > that sort in e.g. accept(2). And no, > [EBADF] > The socket argument is not a valid file descriptor. > in there is nowhere near being unambiguous enough - everyone agrees that > argument should be a valid descriptor at the time of call, but I would be > very surprised to find _any_ implementation (including Solaris one) > recheck that upon exit to userland. The scenario I described previously, where dup2() is used to modify a fd that's being used in accept, result in the accept call terminating in the same way as if close had been called on it - Casper gave details earlier. > For more bullshit from the same source (issue 7, close(2)): > If fildes refers to a socket, close() shall cause the socket to be > destroyed. If the socket is in connection-mode, and the SO_LINGER > option is set for the socket with non-zero linger time, and the socket > has untransmitted data, then close() shall block for up to the current > linger interval until all data is transmitted. > I challenge you to find *any* implementation that would have > fd = socket(...); > close(dup(fd)); > do what this wonder of technical prose clearly requests. In the same text we > also have > When all file descriptors associated with a pipe or FIFO special file > are closed, any data remaining in the pipe or FIFO shall be discarded. > as well as explicit (and underspecified, but perhaps they do it elsewhere) > "last close" in parts related to sockets and ptys. Yes, Casper has just reported that to TOG, see http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11573. Our assessment is that sockets should behave the same way as plain files, i.e. 'last close'. > And that is not to mention the dup2(2) wording in there: > If fildes2 is already a valid open file descriptor, it shall be > closed first > which is (a) inviting misinterpretation that would make the damn thing > non-atomic (the only mentioning of atomicity is in non-normative sections) I've already explained why I believe atomic behaviour of dup2() is required by POSIX. If you feel it's not clear then we can ask POSIX for clarification. > and (b) says fsck-all about the effects of closing descriptor. The latter > is a problem, since nothing in close(2) bothers making a distinction between > the effects specific to particular syscall and those common to all ways of > closing a descriptor. And no, it's not a nitpicking - consider e.g. the > parts concerning the order of events triggered by close(2) (such and such > should be completed before close(2) returns); should it be taken as "same > events should be completed before newfd is associated with the file description > refered to by oldfd"? It _is_ user-visible, since close(2) removes fcntl > locks. Sure, there is (otherwise unexplained) > The dup2() function is not intended for use in critical regions > as a synchronization mechanism. > down in informative sections, so one can infer that event order here isn't > to be relied upon. With no way to guess whether the event order concerning > e.g. effect on ongoing accept(newfd) is any different in that respect. I think "it shall be closed first" makes it pretty clear that what is expected is the same behaviour as any direct invocation of close, and that has to happen before the reassignment. What makes you believe that's isn't the case? > The entire area in Issue 7 stinks. It might make sense to try and fix it > up, but let's not pretend that what's in there right now does specify the > semantics in this kind of situations. Sorry, I disagree. > I'm not saying that Solaris approach yields an inherently bad semantics or > that it's impossible to implement without high scalability price and/or > high memory footprint. But waving the flag of POSIX compliance when you > are actually talking about the ways your implementation plugs the holes in > a badly incomplete spec... Personally I believe the spec is clear enough to allow an unambiguous interpretation of the required behavior in this area. If you think there are areas where the Solaris behaviour is in disagreement with the spec then I'd be interested to hear them. -- Alan Burlison -- ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 17:39 ` Alan Burlison @ 2015-10-22 18:56 ` Al Viro 2015-10-22 19:50 ` Casper.Dik 0 siblings, 1 reply; 138+ messages in thread From: Al Viro @ 2015-10-22 18:56 UTC (permalink / raw) To: Alan Burlison Cc: Eric Dumazet, Casper.Dik, David Miller, stephen, netdev, dholland-tech On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote: > On 22/10/2015 18:05, Al Viro wrote: > > >Oh, for... Right in this thread an example of complete BS has been quoted > >from POSIX close(2). The part about closing a file when the last descriptor > >gets closed. _Nothing_ is POSIX-compliant in that respect (nor should > >it be). > > That's not exactly what it says, we've already discussed, for > example in the case of pending async IO on a filehandle. Sigh... It completely fails to mention descriptor-passing. Which a) is relevant to what "last close" means and b) had been there for nearly the third of a century. > I agree that part could do with some polishing. google("wire brush of enlightenment") is what comes to mind... > >and (b) says fsck-all about the effects of closing descriptor. The latter > >is a problem, since nothing in close(2) bothers making a distinction between > >the effects specific to particular syscall and those common to all ways of > >closing a descriptor. And no, it's not a nitpicking - consider e.g. the > >parts concerning the order of events triggered by close(2) (such and such > >should be completed before close(2) returns); should it be taken as "same > >events should be completed before newfd is associated with the file description > >refered to by oldfd"? It _is_ user-visible, since close(2) removes fcntl > >locks. Sure, there is (otherwise unexplained) > > The dup2() function is not intended for use in critical regions > > as a synchronization mechanism. > >down in informative sections, so one can infer that event order here isn't > >to be relied upon. With no way to guess whether the event order concerning > >e.g. effect on ongoing accept(newfd) is any different in that respect. > > I think "it shall be closed first" makes it pretty clear that what > is expected is the same behaviour as any direct invocation of close, > and that has to happen before the reassignment. What makes you > believe that's isn't the case? So unless I'm misparsing something, you want thread A: accept(newfd) thread B: dup2(oldfd, newfd) have accept() bugger off before the switchover happens? What should happen if thread C does accept(newfd) right as B has decided that there's nothing more to wait? For close(newfd) it would be simple - we are going to have lookup by descriptor fail with EBADF anyway, so making it do so as soon as we go hunting for those who are currently in accept(newfd) would do the trick - no new threads like that shall appear and as long as the descriptor is not declared free for taking by descriptor allocation nobody is going to be screwed by open() picking that slot of descriptor table too early. Trying to do that for dup2() would lose atomicity. I honestly don't know how Solaris behaves in that case, BTW - the race (if any) would probably be hard to hit, so in case of Linux I would have to go and RTFS before saying that there isn't one. I can't do that in with Solaris; all I can do here is ask you guys... Moreover, see above for record locks removal. Should that happen prior to switchover? If you have dup(fd, fd2); set a record lock on fd2 spawn a thread in child, try to grab the same lock on fd2 in parent, do some work and close(fd) you are guaranteed that child won't see fd refering to the same file after it acquires the lock. Replace close(fd) with dup(fd3, fd); should the same hold true in that case? FWIW, Linux behaviour in that area is to have record locks removal done between the switchover and return to userland in case of dup2() and between the removal from descriptor table and return to userland in case of close(). > Personally I believe the spec is clear enough to allow an > unambiguous interpretation of the required behavior in this area. If > you think there are areas where the Solaris behaviour is in > disagreement with the spec then I'd be interested to hear them. The spec is so vague that I strongly suspect that *both* Solaris and Linux behaviours are not in disagreement with it (modulo shutdown(2) extension Linux-side and we are really stuck with that one). ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 18:56 ` Al Viro @ 2015-10-22 19:50 ` Casper.Dik 2015-10-23 17:09 ` Al Viro 0 siblings, 1 reply; 138+ messages in thread From: Casper.Dik @ 2015-10-22 19:50 UTC (permalink / raw) To: Al Viro Cc: Alan Burlison, Eric Dumazet, David Miller, stephen, netdev, dholland-tech From: Al Viro <viro@ZenIV.linux.org.uk> >On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote: >> On 22/10/2015 18:05, Al Viro wrote: >> >> >Oh, for... Right in this thread an example of complete BS has been quoted >> >from POSIX close(2). The part about closing a file when the last descriptor >> >gets closed. _Nothing_ is POSIX-compliant in that respect (nor should >> >it be). >> >> That's not exactly what it says, we've already discussed, for >> example in the case of pending async IO on a filehandle. > >Sigh... It completely fails to mention descriptor-passing. Which > a) is relevant to what "last close" means and > b) had been there for nearly the third of a century. Why is that different? These clearly count as file descriptors. >> I agree that part could do with some polishing. > >google("wire brush of enlightenment") is what comes to mind... Standardese is similar to legalese; it not writing that is directly open to interpretation to those who are not inducted in writing may have some problem interpreting what exactly is meant by wording of the standard. >> I think "it shall be closed first" makes it pretty clear that what >> is expected is the same behaviour as any direct invocation of close, >> and that has to happen before the reassignment. What makes you >> believe that's isn't the case? > >So unless I'm misparsing something, you want >thread A: accept(newfd) >thread B: dup2(oldfd, newfd) >have accept() bugger off before the switchover happens? Well, certainly *before* we return from dup2(). (and clearly only once we have determined that dup2() will return successfully) >What should happen if thread C does accept(newfd) right as B has decided that >there's nothing more to wait? For close(newfd) it would be simple - we are >going to have lookup by descriptor fail with EBADF anyway, so making it do >so as soon as we go hunting for those who are currently in accept(newfd) >would do the trick - no new threads like that shall appear and as long as >the descriptor is not declared free for taking by descriptor allocation nobody >is going to be screwed by open() picking that slot of descriptor table too >early. Trying to do that for dup2() would lose atomicity. I honestly don't >know how Solaris behaves in that case, BTW - the race (if any) would probably >be hard to hit, so in case of Linux I would have to go and RTFS before saying >that there isn't one. I can't do that in with Solaris; all I can do here >is ask you guys... Solaris dup2() behaves exactly like close(). >Moreover, see above for record locks removal. Should that happen prior to >switchover? If you have > >dup(fd, fd2); >set a record lock on fd2 >spawn a thread >in child, try to grab the same lock on fd2 >in parent, do some work and close(fd) >you are guaranteed that child won't see fd refering to the same file after it >acquires the lock. Here's you are talking about a lock held by the "parent" and that the "child" will only get the lock once close(fd) is done? Yes. The final "close" is done *after* the pointer has been removed from the file descriptor table. >Replace close(fd) with dup(fd3, fd); should the same hold true in that case? Yes. >FWIW, Linux behaviour in that area is to have record locks removal done >between the switchover and return to userland in case of dup2() and between >the removal from descriptor table and return to userland in case of close(). > >> Personally I believe the spec is clear enough to allow an >> unambiguous interpretation of the required behavior in this area. If >> you think there are areas where the Solaris behaviour is in >> disagreement with the spec then I'd be interested to hear them. > >The spec is so vague that I strongly suspect that *both* Solaris and Linux >behaviours are not in disagreement with it (modulo shutdown(2) extension >Linux-side and we are really stuck with that one). I'm not sure if the standard allows a handful of threads in accept() for a file descriptor which has already been closed *and* can be re-issued for other uses. Casper ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-22 19:50 ` Casper.Dik @ 2015-10-23 17:09 ` Al Viro 0 siblings, 0 replies; 138+ messages in thread From: Al Viro @ 2015-10-23 17:09 UTC (permalink / raw) To: Casper.Dik Cc: Alan Burlison, Eric Dumazet, David Miller, stephen, netdev, dholland-tech On Thu, Oct 22, 2015 at 09:50:10PM +0200, Casper.Dik@oracle.com wrote: > >Sigh... It completely fails to mention descriptor-passing. Which > > a) is relevant to what "last close" means and > > b) had been there for nearly the third of a century. > > Why is that different? These clearly count as file descriptors. To quote your own posting upthread (Message-ID: <201510212033.t9LKX4G8007718@room101.nl.oracle.com>) > Well, a file descriptor really only exists in the context of a process; > in-flight it is no longer a file descriptor as there process context with ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > a file descriptor table; so pointers to file descriptions are passed > around. IMO it shows that the wording is anything but clear. The only way to claim that it's accurate is, indeed, to declare that the contents of in-flight SCM_RIGHTS datagram should be counted as file descriptors and that's too much of a stretch for the reasons you've pointed to. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-21 14:38 ` Alan Burlison ` (2 preceding siblings ...) 2015-10-21 18:51 ` Al Viro @ 2015-10-23 18:30 ` David Holland 2015-10-23 19:51 ` Al Viro 3 siblings, 1 reply; 138+ messages in thread From: David Holland @ 2015-10-23 18:30 UTC (permalink / raw) To: Alan Burlison Cc: Al Viro, Eric Dumazet, Stephen Hemminger, netdev, dholland-tech, Casper Dik On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote: > On 21/10/2015 04:49, Al Viro wrote: > >BTW, for real fun, consider this: > >7) > >// fd is a socket > >fd2 = dup(fd); > >in thread A: accept(fd); > >in thread B: accept(fd); > >in thread C: accept(fd2); > >in thread D: close(fd); > > > >Which threads (if any), should get hit where it hurts? > > A & B should return from the accept with an error. C should continue. Which > is what happens on Solaris. So, I'm coming late to this discussion and I don't have the original context; however, to me this cited behavior seems undesirable and if I ran across it in the wild I would probably describe it as a bug. System call processing for operations on files involves translating a file descriptor (a number) into an open-file object (or "file description"), struct file in BSD and I think also in Linux. The actual system call logic operates on the open-file object, so once the translation happens application monkeyshines involving file descriptor numbers should have no effect on calls in progress. Other behavior would violate the principle of least surprise, as this basic architecture predates POSIX. The behavior Al Viro found in NetBSD's dup2 is a bug. System calls are supposed to be atomic, regardless of what POSIX might or might not say. I did not write that code but I'll pass the report on to those who did. -- David A. Holland dholland@netbsd.org ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) 2015-10-23 18:30 ` Fw: " David Holland @ 2015-10-23 19:51 ` Al Viro 0 siblings, 0 replies; 138+ messages in thread From: Al Viro @ 2015-10-23 19:51 UTC (permalink / raw) To: David Holland Cc: Alan Burlison, Eric Dumazet, Stephen Hemminger, netdev, Casper Dik On Fri, Oct 23, 2015 at 06:30:25PM +0000, David Holland wrote: > So, I'm coming late to this discussion and I don't have the original > context; however, to me this cited behavior seems undesirable and if I > ran across it in the wild I would probably describe it as a bug. Unfortunately, that's precisely what NetBSD is trying to implement (and that's what will happen if nothing else reopens fd). See the logics in fd_close(), with ->fo_restart() and waiting for all activity to settle down. As for the missing context, what fd_close() is doing is also unreliable - inducing ERESTART in other threads sitting in accept(2) and things like that and waiting for them to run into EBADF they'll get (barring races) on syscall restart; threads sitting in accept() et.al. on the same struct file, but with different descriptors will hopefully go into restart and continue unaffected. All that machinery relies on nothing having reused the descriptor for socket(2), dup2() target, etc. while those threads had been going through the syscall restart - if that happens, you are SOL, since accept(2) _will_ restart on an unexpected socket. Moreover, if you fix dup2() atomicity, this approach will reliably shit itself for situations when dup2() rather than close() is used to close the socket. It relies upon having at least some window where the victim descriptor would be yielding EBADF. > System call processing for operations on files involves translating a > file descriptor (a number) into an open-file object (or "file > description"), struct file in BSD and I think also in Linux. The > actual system call logic operates on the open-file object, so once the > translation happens application monkeyshines involving file descriptor > numbers should have no effect on calls in progress. Other behavior > would violate the principle of least surprise, as this basic > architecture predates POSIX. Well, to be fair, until '93 there was no way to have descriptor table changed under a syscall in the first place. The old model (everything up to and ncluding 4.4BSD final) simply didn't include anything of that sort - mapping from descriptors to open files was not shared and all changes a syscall might see were ones done by the syscall itself. So this thing isn't covered by the basic architecture - it's something that had been significantly new merely two decades ago. And POSIX still hasn't quite caught up with that newfangled 4.2BSD thing... IMO what you've described above is fine - that's how Linux works, that's how FreeBSD and OpenBSD work and that's how NetBSD used to work until 2008 or so. "Cancel syscall if any of the descriptors got dissociated from opened files by action of another thread, have the dissociating operation wait for all affected syscalls to run down" thing had been introduced then and it is similar to what Solaris is doing. AFAICS, the main issue with that is the memory footprint from hell and/or cacheline clusterfuck. Having accept(2) bugger off with e.g. EINTR in such situation isn't inherently worse or better than having it sit there as if close() or dup2() has not happened - matter of taste, and if there had been a way to do it without inflicting the price on processes that do not pull that kind of crap in the first place... might be worth considering. As it is, the memory footprint seems to be too heavy. I'm not entirely convinced that there's no clever way to avoid that, but right now I don't see anything that would look like a good approach. ^ permalink raw reply [flat|nested] 138+ messages in thread
end of thread, other threads:[~2015-11-06 19:31 UTC | newest] Thread overview: 138+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-10-19 16:59 Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Stephen Hemminger 2015-10-19 23:33 ` Eric Dumazet 2015-10-20 1:12 ` Alan Burlison 2015-10-20 1:45 ` Eric Dumazet 2015-10-20 9:59 ` Alan Burlison 2015-10-20 11:24 ` David Miller 2015-10-20 11:39 ` Alan Burlison 2015-10-20 13:19 ` Fw: " Eric Dumazet 2015-10-20 13:45 ` Alan Burlison 2015-10-20 15:30 ` Eric Dumazet 2015-10-20 18:31 ` Alan Burlison 2015-10-20 18:42 ` Eric Dumazet 2015-10-21 10:25 ` David Laight 2015-10-21 10:49 ` Alan Burlison 2015-10-21 11:28 ` Eric Dumazet 2015-10-21 13:03 ` Alan Burlison 2015-10-21 13:29 ` Eric Dumazet 2015-10-21 3:49 ` Al Viro 2015-10-21 14:38 ` Alan Burlison 2015-10-21 15:30 ` David Miller 2015-10-21 16:04 ` Casper.Dik 2015-10-21 21:18 ` Eric Dumazet 2015-10-21 21:28 ` Al Viro 2015-10-21 16:32 ` Fw: " Eric Dumazet 2015-10-21 18:51 ` Al Viro 2015-10-21 20:33 ` Casper.Dik 2015-10-22 4:21 ` Al Viro 2015-10-22 10:55 ` Alan Burlison 2015-10-22 18:16 ` Al Viro 2015-10-22 20:15 ` Alan Burlison 2015-11-02 10:03 ` David Laight 2015-11-02 10:29 ` Al Viro 2015-10-21 22:28 ` Alan Burlison 2015-10-22 1:29 ` David Miller 2015-10-22 4:17 ` Alan Burlison 2015-10-22 4:44 ` Al Viro 2015-10-22 6:03 ` Al Viro 2015-10-22 6:34 ` Casper.Dik 2015-10-22 17:21 ` Al Viro 2015-10-22 18:24 ` Casper.Dik 2015-10-22 19:07 ` Al Viro 2015-10-22 19:51 ` Casper.Dik 2015-10-22 21:57 ` Al Viro 2015-10-23 9:52 ` Casper.Dik 2015-10-23 13:02 ` Eric Dumazet 2015-10-23 13:20 ` Casper.Dik 2015-10-23 13:48 ` Eric Dumazet 2015-10-23 14:13 ` Eric Dumazet 2015-10-23 13:35 ` Alan Burlison 2015-10-23 14:21 ` Eric Dumazet 2015-10-23 15:46 ` Alan Burlison 2015-10-23 16:00 ` Eric Dumazet 2015-10-23 16:07 ` Alan Burlison 2015-10-23 16:19 ` Eric Dumazet 2015-10-23 16:40 ` Alan Burlison 2015-10-23 17:47 ` Eric Dumazet 2015-10-23 17:59 ` [PATCH net-next] af_unix: do not report POLLOUT on listeners Eric Dumazet 2015-10-25 13:45 ` David Miller 2015-10-24 2:30 ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro 2015-10-27 9:08 ` Casper.Dik 2015-10-27 10:52 ` Alan Burlison 2015-10-27 12:01 ` Eric Dumazet 2015-10-27 12:27 ` Alan Burlison 2015-10-27 12:44 ` Eric Dumazet 2015-10-27 13:42 ` David Miller 2015-10-27 13:37 ` Alan Burlison 2015-10-27 13:59 ` David Miller 2015-10-27 14:13 ` Alan Burlison 2015-10-27 14:39 ` David Miller 2015-10-27 14:39 ` Alan Burlison 2015-10-27 15:04 ` David Miller 2015-10-27 15:53 ` Alan Burlison 2015-10-27 23:17 ` Al Viro 2015-10-28 0:13 ` Eric Dumazet 2015-10-28 12:35 ` Al Viro 2015-10-28 13:24 ` Eric Dumazet 2015-10-28 14:47 ` Eric Dumazet 2015-10-28 21:13 ` Al Viro 2015-10-28 21:44 ` Eric Dumazet 2015-10-28 22:33 ` Al Viro 2015-10-28 23:08 ` Eric Dumazet 2015-10-29 0:15 ` Al Viro 2015-10-29 3:29 ` Eric Dumazet 2015-10-29 4:16 ` Al Viro 2015-10-29 12:35 ` Eric Dumazet 2015-10-29 13:48 ` Eric Dumazet 2015-10-30 17:18 ` Linus Torvalds 2015-10-30 21:02 ` Al Viro 2015-10-30 21:23 ` Linus Torvalds 2015-10-30 21:50 ` Linus Torvalds 2015-10-30 22:33 ` Al Viro 2015-10-30 23:52 ` Linus Torvalds 2015-10-31 0:09 ` Al Viro 2015-10-31 15:59 ` Eric Dumazet 2015-10-31 19:34 ` Al Viro 2015-10-31 19:54 ` Linus Torvalds 2015-10-31 20:29 ` Al Viro 2015-11-02 0:24 ` Al Viro 2015-11-02 0:59 ` Linus Torvalds 2015-11-02 2:14 ` Eric Dumazet 2015-11-02 6:22 ` Al Viro 2015-10-31 20:45 ` Eric Dumazet 2015-10-31 21:23 ` Linus Torvalds 2015-10-31 21:51 ` Al Viro 2015-10-31 22:34 ` Eric Dumazet 2015-10-31 1:07 ` Eric Dumazet 2015-10-28 16:04 ` Alan Burlison 2015-10-29 14:58 ` David Holland 2015-10-29 15:18 ` Alan Burlison 2015-10-29 16:01 ` David Holland 2015-10-29 16:15 ` Alan Burlison 2015-10-29 17:07 ` Al Viro 2015-10-29 17:12 ` Alan Burlison 2015-10-30 1:54 ` David Miller 2015-10-30 1:55 ` David Miller 2015-10-30 5:44 ` David Holland 2015-10-30 17:43 ` David Laight 2015-10-30 21:09 ` Al Viro 2015-11-04 15:54 ` David Laight 2015-11-04 16:27 ` Al Viro 2015-11-06 15:07 ` David Laight 2015-11-06 19:31 ` Al Viro 2015-10-22 6:51 ` Casper.Dik 2015-10-22 11:18 ` Alan Burlison 2015-10-22 11:15 ` Alan Burlison 2015-10-22 6:15 ` Casper.Dik 2015-10-22 11:30 ` Eric Dumazet 2015-10-22 11:58 ` Alan Burlison 2015-10-22 12:10 ` Eric Dumazet 2015-10-22 13:12 ` David Miller 2015-10-22 13:14 ` Alan Burlison 2015-10-22 17:05 ` Al Viro 2015-10-22 17:39 ` Alan Burlison 2015-10-22 18:56 ` Al Viro 2015-10-22 19:50 ` Casper.Dik 2015-10-23 17:09 ` Al Viro 2015-10-23 18:30 ` Fw: " David Holland 2015-10-23 19:51 ` Al Viro
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.