All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@kernel.org>
To: Eric Dumazet <edumazet@google.com>,
	"David S . Miller" <davem@davemloft.net>
Cc: netdev <netdev@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Soheil Hassas Yeganeh <soheil@google.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	linux-mm <linux-mm@kvack.org>,
	Linux API <linux-api@vger.kernel.org>
Subject: Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
Date: Mon, 23 Apr 2018 14:14:33 -0700	[thread overview]
Message-ID: <9ed6083f-d731-945c-dbcd-f977c5600b03@kernel.org> (raw)
In-Reply-To: <20180420155542.122183-1-edumazet@google.com>

On 04/20/2018 08:55 AM, Eric Dumazet wrote:
> This patch series provide a new mmap_hook to fs willing to grab
> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
> 
> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
> is held), and improve multi-threading scalability.
> 

I think that the right solution is to rework mmap() on TCP sockets a 
bit.  The current approach in net-next is very strange for a few reasons:

1. It uses mmap() as an operation that has side effects besides just 
creating a mapping.  If nothing else, it's surprising, since mmap() 
doesn't usually do that.  But it's also causing problems like what 
you're seeing.

2. The performance is worse than it needs to be.  mmap() is slow, and I 
doubt you'll find many mm developers who consider this particular abuse 
of mmap() to be a valid thing to optimize for.

3. I'm not at all convinced the accounting is sane.  As far as I can 
tell, you're allowing unprivileged users to increment the count on 
network-owned pages, limited only by available virtual memory, without 
obviously charging it to the socket buffer limits.  It looks like a 
program that simply forgot to call munmap() would cause the system to 
run out of memory, and I see no reason to expect the OOM killer to have 
any real chance of killing the right task.

4. Error handling sucks.  If I try to mmap() a large range (which is the 
whole point -- using a small range will kill performance) and not quite 
all of it can be mapped, then I waste a bunch of time in the kernel and 
get *none* of the range mapped.

I would suggest that you rework the interface a bit.  First a user would 
call mmap() on a TCP socket, which would create an empty VMA.  (It would 
set vm_ops to point to tcp_vm_ops or similar so that the TCP code could 
recognize it, but it would have no effect whatsoever on the TCP state 
machine.  Reading the VMA would get SIGBUS.)  Then a user would call a 
new ioctl() or setsockopt() function and pass something like:

struct tcp_zerocopy_receive {
   void *address;
   size_t length;
};

The kernel would verify that [address, address+length) is entirely 
inside a single TCP VMA and then would do the vm_insert_range magic.  On 
success, length is changed to the length that actually got mapped.  The 
kernel could do this while holding mmap_sem for *read*, and it could get 
the lock ordering right.  If and when mm range locks ever get merged, it 
could switch to using a range lock.

Then you could use MADV_DONTNEED or another ioctl/setsockopt to zap the 
part of the mapping that you're done with.

Does this seem reasonable?  It should involve very little code change, 
it will run faster, it will scale better, and it is much less weird IMO.

  parent reply	other threads:[~2018-04-23 21:14 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-20 15:55 [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue Eric Dumazet
2018-04-20 15:55 ` [PATCH net-next 1/4] mm: provide a mmap_hook infrastructure Eric Dumazet
2018-04-20 15:55 ` [PATCH net-next 2/4] net: implement sock_mmap_hook() Eric Dumazet
2018-04-20 15:55 ` [PATCH net-next 3/4] tcp: provide tcp_mmap_hook() Eric Dumazet
2018-04-20 15:55 ` [PATCH net-next 4/4] tcp: mmap: move the skb cleanup to tcp_mmap_hook() Eric Dumazet
2018-04-21  9:07 ` [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue Christoph Hellwig
2018-04-21 16:55   ` Eric Dumazet
2018-04-23 21:14 ` Andy Lutomirski [this message]
2018-04-23 21:38   ` Eric Dumazet
2018-04-23 21:38     ` Eric Dumazet
2018-04-24  2:04     ` Andy Lutomirski
2018-04-24  4:30       ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9ed6083f-d731-945c-dbcd-f977c5600b03@kernel.org \
    --to=luto@kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=soheil@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.