All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Miller <davem@davemloft.net>,
	stephen@networkplumber.org, netdev@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	dhowells@redhat.com, linux-fsdevel@vger.kernel.org
Subject: Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
Date: Wed, 28 Oct 2015 14:44:28 -0700	[thread overview]
Message-ID: <1446068668.7476.89.camel@edumazet-glaptop2.roam.corp.google.com> (raw)
In-Reply-To: <20151028211347.GC22011@ZenIV.linux.org.uk>

On Wed, 2015-10-28 at 21:13 +0000, Al Viro wrote:
> On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote:
> > On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:
> > 
> > > Before I take a deep look at your suggestion, are you sure plain use of
> > > include/linux/percpu-refcount.h infra is not possible for struct cred ?
> > 
> > BTW, I am not convinced we need to spend so much energy and per-cpu
> > memory for struct cred refcount.
> > 
> > The big problem is fd array spinlock of course and bitmap search for
> > POSIX compliance.
> > 
> > The cache line trashing in struct cred is a minor one ;)
> 
> percpu-refcount isn't convenient - the only such candidate for ref_kill in
> there is "all other references are gone", and that can happen in
> interesting locking environments.  I doubt that it would be a good fit, TBH...

OK then ...

> Cacheline pingpong on the descriptors bitmap is probably inevitable, but
> it's not the only problem in the existing implementation - close a small
> descriptor when you've got a lot of them and look for the second open
> after that.  _That_ can lead to thousands of cachelines being read through,
> all under the table spinlock.  It's literally orders of magnitude worse.
> And if the first open after that close happens to be for a short-living
> descriptor, you'll get the same situation back in your face as soon as you
> close it.
> 
> I think we can seriously improve that without screwing the fast path by
> adding "summary" bitmaps once the primary grows past the cacheline worth
> of bits.  With bits in the summary bitmap corresponding to cacheline-sized
> chunks of the primary, being set iff all bits in the corresponding chunk
> are set.  If the summary map grows larger than one cacheline, add the
> second-order one (that happens at quarter million descriptors and serves
> until 128 million; adding the third-order map is probably worthless).
> 
> I want to maintain the same kind of "everything below this is known to be
> in use" thing as we do now.  Allocation would start with looking into the
> same place in primary bitmap where we'd looked now and similar search
> forward for zero bit.  _However_, it would stop at cacheline boundary.
> If nothing had been found, we look in the corresponding place in the
> summary bitmap and search for zero bit there.  Again, no more than up
> to the cacheline boundary.  If something is found, we've got a chunk in
> the primary known to contain a zero bit; if not - go to the second-level
> and search there, etc.
> 
> When a zero bit in the primary had been found, check if it's within the
> rlimit (passed to __alloc_fd() explicitly) and either bugger off or set
> that bit.  If there are zero bits left in the same word - we are done,
> otherwise check the still unread words in the cacheline and see if all
> of them are ~0UL.  If all of them are, set the bit in summary bitmap, etc.
> 
> Normal case is exactly the same as now - one cacheline accessed and modified.
> We might end up touching more than that, but it's going to be rare and
> the cases when it happens are very likely to lead to much worse amount of
> memory traffic with the current code.
> 
> Freeing is done by zeroing the bit in primary, checking for other zero bits
> nearby and buggering off if there are such.  If the entire cacheline used
> to be all-bits-set, clear the bit in summary and, if there's a second-order
> summary, get the bit in there clear as well - it's probably not worth
> bothering with checking that all the cacheline in summary bitmap had been
> all-bits-set.  Again, the normal case is the same as now.
> 
> It'll need profiling and tuning, but AFAICS it's doable without making the
> things worse than they are now, and it should get rid of those O(N) fetches
> under spinlock cases.  And yes, those are triggerable and visible in
> profiles.  IMO it's worth trying to fix...
> 

Well, all this complexity goes away with a O_FD_FASTALLOC /
SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
tells the kernel we do not care getting the lowest possible fd as POSIX
mandates.

With this bit set, the bitmap search can start at a random point, and we
find a lot in O(1) : one cache line miss, if you have at least one free
bit/slot per 512 bits (64 bytes cache line).

#ifndef O_FD_FASTALLOC
#define O_FD_FASTALLOC 0x40000000
#endif

#ifndef SOCK_FD_FASTALLOC
#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
#endif

... // active sockets
socket(AF_INET, SOCK_STREAM | SOCK_FD_FASTALLOC, 0);
... // passive sockets
accept4(sockfd, ..., SOCK_FD_FASTALLOC);
...

Except for legacy stuff and stdin/stdout/stderr games, I really doubt
lot of applications absolutely rely on the POSIX thing...

  reply	other threads:[~2015-10-28 21:44 UTC|newest]

Thread overview: 138+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-19 16:59 Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Stephen Hemminger
2015-10-19 23:33 ` Eric Dumazet
2015-10-20  1:12   ` Alan Burlison
2015-10-20  1:45     ` Eric Dumazet
2015-10-20  9:59       ` Alan Burlison
2015-10-20 11:24         ` David Miller
2015-10-20 11:39           ` Alan Burlison
2015-10-20 13:19         ` Fw: " Eric Dumazet
2015-10-20 13:45           ` Alan Burlison
2015-10-20 15:30             ` Eric Dumazet
2015-10-20 18:31               ` Alan Burlison
2015-10-20 18:42                 ` Eric Dumazet
2015-10-21 10:25                 ` David Laight
2015-10-21 10:49                   ` Alan Burlison
2015-10-21 11:28                     ` Eric Dumazet
2015-10-21 13:03                       ` Alan Burlison
2015-10-21 13:29                         ` Eric Dumazet
2015-10-21  3:49       ` Al Viro
2015-10-21 14:38         ` Alan Burlison
2015-10-21 15:30           ` David Miller
2015-10-21 16:04             ` Casper.Dik
2015-10-21 21:18               ` Eric Dumazet
2015-10-21 21:28                 ` Al Viro
2015-10-21 16:32           ` Fw: " Eric Dumazet
2015-10-21 18:51           ` Al Viro
2015-10-21 20:33             ` Casper.Dik
2015-10-22  4:21               ` Al Viro
2015-10-22 10:55                 ` Alan Burlison
2015-10-22 18:16                   ` Al Viro
2015-10-22 20:15                     ` Alan Burlison
2015-11-02 10:03               ` David Laight
2015-11-02 10:29                 ` Al Viro
2015-10-21 22:28             ` Alan Burlison
2015-10-22  1:29             ` David Miller
2015-10-22  4:17               ` Alan Burlison
2015-10-22  4:44                 ` Al Viro
2015-10-22  6:03                   ` Al Viro
2015-10-22  6:34                     ` Casper.Dik
2015-10-22 17:21                       ` Al Viro
2015-10-22 18:24                         ` Casper.Dik
2015-10-22 19:07                           ` Al Viro
2015-10-22 19:51                             ` Casper.Dik
2015-10-22 21:57                               ` Al Viro
2015-10-23  9:52                                 ` Casper.Dik
2015-10-23 13:02                                   ` Eric Dumazet
2015-10-23 13:20                                     ` Casper.Dik
2015-10-23 13:48                                       ` Eric Dumazet
2015-10-23 14:13                                       ` Eric Dumazet
2015-10-23 13:35                                     ` Alan Burlison
2015-10-23 14:21                                       ` Eric Dumazet
2015-10-23 15:46                                         ` Alan Burlison
2015-10-23 16:00                                           ` Eric Dumazet
2015-10-23 16:07                                             ` Alan Burlison
2015-10-23 16:19                                             ` Eric Dumazet
2015-10-23 16:40                                               ` Alan Burlison
2015-10-23 17:47                                                 ` Eric Dumazet
2015-10-23 17:59                                                   ` [PATCH net-next] af_unix: do not report POLLOUT on listeners Eric Dumazet
2015-10-25 13:45                                                     ` David Miller
2015-10-24  2:30                                   ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro
2015-10-27  9:08                                     ` Casper.Dik
2015-10-27 10:52                                       ` Alan Burlison
2015-10-27 12:01                                         ` Eric Dumazet
2015-10-27 12:27                                           ` Alan Burlison
2015-10-27 12:44                                             ` Eric Dumazet
2015-10-27 13:42                                         ` David Miller
2015-10-27 13:37                                           ` Alan Burlison
2015-10-27 13:59                                             ` David Miller
2015-10-27 14:13                                               ` Alan Burlison
2015-10-27 14:39                                                 ` David Miller
2015-10-27 14:39                                                   ` Alan Burlison
2015-10-27 15:04                                                     ` David Miller
2015-10-27 15:53                                                       ` Alan Burlison
2015-10-27 23:17                                         ` Al Viro
2015-10-28  0:13                                           ` Eric Dumazet
2015-10-28 12:35                                             ` Al Viro
2015-10-28 13:24                                               ` Eric Dumazet
2015-10-28 14:47                                                 ` Eric Dumazet
2015-10-28 21:13                                                   ` Al Viro
2015-10-28 21:44                                                     ` Eric Dumazet [this message]
2015-10-28 22:33                                                       ` Al Viro
2015-10-28 23:08                                                         ` Eric Dumazet
2015-10-29  0:15                                                           ` Al Viro
2015-10-29  3:29                                                             ` Eric Dumazet
2015-10-29  4:16                                                               ` Al Viro
2015-10-29 12:35                                                                 ` Eric Dumazet
2015-10-29 13:48                                                                   ` Eric Dumazet
2015-10-30 17:18                                                                   ` Linus Torvalds
2015-10-30 21:02                                                                     ` Al Viro
2015-10-30 21:23                                                                       ` Linus Torvalds
2015-10-30 21:50                                                                         ` Linus Torvalds
2015-10-30 22:33                                                                           ` Al Viro
2015-10-30 23:52                                                                             ` Linus Torvalds
2015-10-31  0:09                                                                               ` Al Viro
2015-10-31 15:59                                                                               ` Eric Dumazet
2015-10-31 19:34                                                                               ` Al Viro
2015-10-31 19:54                                                                                 ` Linus Torvalds
2015-10-31 20:29                                                                                   ` Al Viro
2015-11-02  0:24                                                                                     ` Al Viro
2015-11-02  0:59                                                                                       ` Linus Torvalds
2015-11-02  2:14                                                                                       ` Eric Dumazet
2015-11-02  6:22                                                                                         ` Al Viro
2015-10-31 20:45                                                                                   ` Eric Dumazet
2015-10-31 21:23                                                                                     ` Linus Torvalds
2015-10-31 21:51                                                                                       ` Al Viro
2015-10-31 22:34                                                                                       ` Eric Dumazet
2015-10-31  1:07                                                                           ` Eric Dumazet
2015-10-28 16:04                                           ` Alan Burlison
2015-10-29 14:58                                         ` David Holland
2015-10-29 15:18                                           ` Alan Burlison
2015-10-29 16:01                                             ` David Holland
2015-10-29 16:15                                               ` Alan Burlison
2015-10-29 17:07                                                 ` Al Viro
2015-10-29 17:12                                                   ` Alan Burlison
2015-10-30  1:54                                                     ` David Miller
2015-10-30  1:55                                                   ` David Miller
2015-10-30  5:44                                                 ` David Holland
2015-10-30 17:43                                           ` David Laight
2015-10-30 21:09                                             ` Al Viro
2015-11-04 15:54                                               ` David Laight
2015-11-04 16:27                                                 ` Al Viro
2015-11-06 15:07                                                   ` David Laight
2015-11-06 19:31                                                     ` Al Viro
2015-10-22  6:51                   ` Casper.Dik
2015-10-22 11:18                     ` Alan Burlison
2015-10-22 11:15                   ` Alan Burlison
2015-10-22  6:15                 ` Casper.Dik
2015-10-22 11:30                   ` Eric Dumazet
2015-10-22 11:58                     ` Alan Burlison
2015-10-22 12:10                       ` Eric Dumazet
2015-10-22 13:12                         ` David Miller
2015-10-22 13:14                         ` Alan Burlison
2015-10-22 17:05                           ` Al Viro
2015-10-22 17:39                             ` Alan Burlison
2015-10-22 18:56                               ` Al Viro
2015-10-22 19:50                                 ` Casper.Dik
2015-10-23 17:09                                   ` Al Viro
2015-10-23 18:30           ` Fw: " David Holland
2015-10-23 19:51             ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1446068668.7476.89.camel@edumazet-glaptop2.roam.corp.google.com \
    --to=eric.dumazet@gmail.com \
    --cc=davem@davemloft.net \
    --cc=dhowells@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=stephen@networkplumber.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.