linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	qemu-devel@nongnu.org, KVM list <kvm@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Linux API <linux-api@vger.kernel.org>,
	Andres Lagar-Cavilla <andreslc@google.com>,
	Dave Hansen <dave@sr71.net>, Paolo Bonzini <pbonzini@redhat.com>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Andy Lutomirski <luto@amacapital.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Sasha Levin <sasha.levin@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	Peter Feiner <pfeiner@google.com>,
	Christopher Covington <cov@codeaurora.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Android Kernel Team <kernel-team@android.com>,
	Robert Love <rlove@google.com>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Neil Brown <neilb@suse.de>, Mike Hommey <mh@glandium.org>,
	Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Keith Packard <keithp@keithp.com>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	Wenchao Xia <wenchaoqemu@gmail.com>,
	Andrew Jones <drjones@redhat.com>,
	Juan Quintela <quintela@redhat.com>
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
Date: Tue, 7 Oct 2014 12:56:18 -0400	[thread overview]
Message-ID: <CA+55aFwJP_suQWhgSp1NvYOPGB5QrBYS57LT14n7TWXqCEajdg@mail.gmail.com> (raw)
In-Reply-To: <20141007141913.GC2342@redhat.com>

On Tue, Oct 7, 2014 at 10:19 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> I see what you mean. The only cons I see is that we couldn't use then
> recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr,
> PAGE_SIZE, ..)  and retain the zerocopy behavior. Or how could we?
> There's no recvfile(userfaultfd, socketfd, PAGE_SIZE).

You're doing completelt faulty math, and you haven't thought it through.

Your "zero-copy" case is no such thing. Who cares if some packet
receive is zero-copy, when you need to set up the anonymous page to
*receive* the zero copy into, which involves page allocation, page
zeroing, page table setup with VM and page table locking, etc etc.

The thing is, the whole concept of "zero-copy" is pure and utter
bullshit. Sun made a big deal about the whole concept back in the
nineties, and IT DOES NOT WORK. It's a scam. Don't buy into it. It's
crap. It's made-up and not real.

Then, once you've allocated and cleared the page, mapped it in, your
"zero-copy" model involves looking up the page in the page tables
again (locking etc), then doing that zero-copy to the page. Then, when
you remap it, you look it up in the page tables AGAIN, with locking,
move it around, have to clear the old page table entry (which involves
a locked cmpxchg64), a TLB flush with most likely a cross-CPU IPI -
since the people who do this are all threaded and want many CPU's, and
then you insert the page into the new place.

That's *insane*. It's crap. All just to try to avoid one page copy.

Don't do it. remapping games really are complete BS. They never beat
just copying the data. It's that simple.

> As things stands now, I'm afraid with a write() syscall we couldn't do
> it zerocopy.

Really, you need to rethink your whole "zerocopy" model. It's broken.
Nobody sane cares. You've bought into a model that Sun already showed
doesn't work.

The only time zero-copy works is in random benchmarks that are very
careful to not touch the data at all at any point, and also try to
make sure that the benchmark is very much single-threaded so that you
never have the whole cross-CPU IPI issue for the TLB invalidate. Then,
and only then, can zero-copy win. And it's just not a realistic
scenario.

> If it wasn't for the TLB flush of the old page, the remap_anon_pages
> variant would be more optimal than doing a copy through a write
> syscall. Is the copy cheaper than a TLB flush? I probably naively
> assumed the TLB flush was always cheaper.

A local TLB flush is cheap. That's not the problem. The problem is the
setup of the page, and the clearing of the page, and the cross-CPU TLB
flush. And the page table locking, etc etc.

So no, I'm not AT ALL worried about a single "invlpg" instruction.
That's nothing. Local CPU TLB flushes of single pages are basically
free. But that really isn't where the costs are.

Quite frankly, the *only* time page remapping has ever made sense is
when it is used for realloc() kind of purposes, where you need to
remap pages not because of zero-copy, but because you need to change
the virtual address space layout. And then you make sure it's not a
common operation, because you're not doing it as a performance win,
you're doing it because you're changing your virtual layout.

Really. Zerocopy is for benchmarks, and for things like "splice()"
when you can avoid the page tables *entirely*. But the notion that
page remapping of user pages is faster than a copy is pure and utter
garbage. It's simply not true.

So I really think you should aim for a "write()": kind of interface.

With write, you may not get the zero-copy, but on the other hand it
allows you to re-use the source buffer many times without having to
allocate new pages and map it in etc. So a "read()+write()" loop (or,
quite commonly a "generate data computationally from a compressed
source + write()" loop) is actually much more efficient than the
zero-copy remapping, because you don't have all the complexities and
overheads in creating the source page.

It is possible that that could involve "splice()" too, although I
don't really think the source data tends to be in page-aligned chunks.
But hey, splice() at least *can* be faster than copying (and then we
have vmsplice() not because it's magically faster, but because it can
under certain circumstances be worth it, and it kind of made sense to
allow the interface, but I really don't think it's used very much or
very useful).

               Linus

  parent reply	other threads:[~2014-10-07 16:56 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 01/17] mm: gup: add FOLL_TRIED Andrea Arcangeli
2014-10-03 18:15   ` Linus Torvalds
2014-10-03 20:55     ` Paolo Bonzini
2014-10-03 17:07 ` [PATCH 02/17] mm: gup: add get_user_pages_locked and get_user_pages_unlocked Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 03/17] mm: gup: use get_user_pages_unlocked within get_user_pages_fast Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious Andrea Arcangeli
2014-10-03 18:23   ` Linus Torvalds
2014-10-06 14:14     ` Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 05/17] mm: gup: use get_user_pages_fast and get_user_pages_unlocked Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 06/17] kvm: Faults which trigger IO release the mmap_sem Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
2014-10-07  9:03   ` Kirill A. Shutemov
2014-11-06 20:08   ` Konstantin Khlebnikov
2014-10-03 17:07 ` [PATCH 08/17] mm: madvise MADV_USERFAULT Andrea Arcangeli
2014-10-03 23:13   ` Mike Hommey
2014-10-06 17:24     ` Andrea Arcangeli
2014-10-07 10:36   ` Kirill A. Shutemov
2014-10-07 10:46     ` Dr. David Alan Gilbert
2014-10-07 10:52       ` [Qemu-devel] " Kirill A. Shutemov
2014-10-07 11:01         ` Dr. David Alan Gilbert
2014-10-07 11:30           ` Kirill A. Shutemov
2014-10-07 13:24     ` Andrea Arcangeli
2014-10-07 15:21       ` Kirill A. Shutemov
2014-10-03 17:07 ` [PATCH 09/17] mm: PT lock: export double_pt_lock/unlock Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 10/17] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
2014-10-03 18:31   ` Linus Torvalds
2014-10-06  8:55     ` Dr. David Alan Gilbert
2014-10-06 16:41       ` Andrea Arcangeli
2014-10-07 12:47         ` Linus Torvalds
2014-10-07 14:19           ` Andrea Arcangeli
2014-10-07 15:52             ` Andrea Arcangeli
2014-10-07 15:54               ` Andy Lutomirski
2014-10-07 16:13               ` Peter Feiner
2014-10-07 16:56             ` Linus Torvalds [this message]
2014-10-07 17:07           ` Dr. David Alan Gilbert
2014-10-07 17:14             ` Paolo Bonzini
2014-10-07 17:25               ` Dr. David Alan Gilbert
2014-10-07 11:10   ` [Qemu-devel] " Kirill A. Shutemov
2014-10-07 13:37     ` Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 11/17] mm: swp_entry_swapcount Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 12/17] mm: sys_remap_anon_pages Andrea Arcangeli
2014-10-04 13:13   ` Andi Kleen
2014-10-06 17:00     ` Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 13/17] waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 14/17] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 15/17] userfaultfd: make userfaultfd_write non blocking Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 16/17] powerpc: add remap_anon_pages and userfaultfd Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 17/17] userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER Andrea Arcangeli
2014-10-27  9:32 ` [PATCH 00/17] RFC: userfault v2 zhanghailiang
2014-10-29 17:46   ` Andrea Arcangeli
2014-10-29 17:56     ` [Qemu-devel] " Peter Maydell
2014-11-21 20:14       ` Andrea Arcangeli
2014-11-21 23:05         ` Peter Maydell
2014-11-25 19:45           ` Andrea Arcangeli
2014-10-30 11:31     ` zhanghailiang
2014-10-30 12:49       ` Dr. David Alan Gilbert
2014-10-31  1:26         ` zhanghailiang
2014-11-19 18:49           ` Andrea Arcangeli
2014-11-20  2:54             ` zhanghailiang
2014-11-20 17:38               ` Andrea Arcangeli
2014-11-21  7:19                 ` zhanghailiang
2014-10-31  2:23       ` Peter Feiner
2014-10-31  3:29         ` zhanghailiang
2014-10-31  4:38           ` zhanghailiang
2014-10-31  5:17             ` Andres Lagar-Cavilla
2014-10-31  8:11               ` zhanghailiang
2014-10-31 19:39           ` Peter Feiner
2014-11-01  8:48             ` zhanghailiang
2014-11-20 17:29             ` Andrea Arcangeli
2014-11-12  7:18       ` zhanghailiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+55aFwJP_suQWhgSp1NvYOPGB5QrBYS57LT14n7TWXqCEajdg@mail.gmail.com \
    --to=torvalds@linux-foundation.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreslc@google.com \
    --cc=anthony@codemonkey.ws \
    --cc=cov@codeaurora.org \
    --cc=dave@sr71.net \
    --cc=dgilbert@redhat.com \
    --cc=dmitry.adamushko@gmail.com \
    --cc=drjones@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=keithp@keithp.com \
    --cc=kernel-team@android.com \
    --cc=kosaki.motohiro@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@amacapital.net \
    --cc=mgorman@suse.de \
    --cc=mh@glandium.org \
    --cc=minchan@kernel.org \
    --cc=neilb@suse.de \
    --cc=pbonzini@redhat.com \
    --cc=peter.huangpeng@huawei.com \
    --cc=pfeiner@google.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=riel@redhat.com \
    --cc=rlove@google.com \
    --cc=sasha.levin@oracle.com \
    --cc=stefanha@gmail.com \
    --cc=walken@google.com \
    --cc=wenchaoqemu@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).