linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, linux-mm@kvack.org, Jan Kara <jack@suse.cz>
Subject: Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address
Date: Thu, 7 Mar 2019 16:27:17 -0500	[thread overview]
Message-ID: <20190307212717.GS23850@redhat.com> (raw)
In-Reply-To: <20190307201722.GG3835@redhat.com>

Hello Jerome,

On Thu, Mar 07, 2019 at 03:17:22PM -0500, Jerome Glisse wrote:
> So for the above the easiest thing is to call set_page_dirty() from
> the mmu notifier callback. It is always safe to use the non locking
> variant from such callback. Well it is safe only if the page was
> map with write permission prior to the callback so here i assume
> nothing stupid is going on and that you only vmap page with write
> if they have a CPU pte with write and if not then you force a write
> page fault.

So if the GUP doesn't set FOLL_WRITE, set_page_dirty simply shouldn't
be called in such case. It only ever makes sense if the pte is
writable.

On a side note, the reason the write bit on the pte enabled avoids the
need of the _lock suffix is because of the stable page writeback
guarantees?

> Basicly from mmu notifier callback you have the same right as zap
> pte has.

Good point.

Related to this I already was wondering why the set_page_dirty is not
done in the invalidate. Reading the patch it looks like the dirty is
marked dirty when the ring wraps around, not in the invalidate, Jeson
can tell if I misread something there.

For transient data passing through the ring, nobody should care if
it's lost. It's not user-journaled anyway so it could hit the disk in
any order. The only reason to flush it to do disk is if there's memory
pressure (to pageout like a swapout) and in such case it's enough to
mark it dirty only in the mmu notifier invalidate like you pointed out
(and only if GUP was called with FOLL_WRITE).

> O_DIRECT can suffer from the same issue but the race window for that
> is small enough that it is unlikely it ever happened. But for device

Ok that clarifies things.

> driver that GUP page for hours/days/weeks/months ... obviously the
> race window is big enough here. It affects many fs (ext4, xfs, ...)
> in different ways. I think ext4 is the most obvious because of the
> kernel log trace it leaves behind.
> 
> Bottom line is for set_page_dirty to be safe you need the following:
>     lock_page()
>     page_mkwrite()
>     set_pte_with_write()
>     unlock_page()

I also wondered why ext4 writepage doesn't recreate the bh if they got
dropped by the VM and page->private is 0. I mean, page->index and
page->mapping are still there, that's enough info for writepage itself
to take a slow path and calls page_mkwrite to find where to write the
page on disk.

> Now when loosing the write permission on the pte you will first get
> a mmu notifier callback so anyone that abide by mmu notifier is fine
> as long as they only write to the page if they found a pte with
> write as it means the above sequence did happen and page is write-
> able until the mmu notifier callback happens.
>
> When you lookup a page into the page cache you still need to call
> page_mkwrite() before installing a write-able pte.
> 
> Here for this vmap thing all you need is that the original user
> pte had the write flag. If you only allow write in the vmap when
> the original pte had write and you abide by mmu notifier then it
> is ok to call set_page_dirty from the mmu notifier (but not after).
> 
> Hence why my suggestion is a special vunmap that call set_page_dirty
> on the page from the mmu notifier.

Agreed, that will solve all issues in vhost context with regard to
set_page_dirty, including the case the memory is backed by VM_SHARED ext4.

Thanks!
Andrea

  reply	other threads:[~2019-03-07 21:27 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-06  7:18 [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap() Jason Wang
2019-03-06  7:18 ` [RFC PATCH V2 1/5] vhost: generalize adding used elem Jason Wang
2019-03-06  7:18 ` [RFC PATCH V2 2/5] vhost: fine grain userspace memory accessors Jason Wang
2019-03-06 10:45   ` Christophe de Dinechin
2019-03-07  2:38     ` Jason Wang
2019-03-06  7:18 ` [RFC PATCH V2 3/5] vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch() Jason Wang
2019-03-06  7:18 ` [RFC PATCH V2 4/5] vhost: introduce helpers to get the size of metadata area Jason Wang
2019-03-06 10:56   ` Christophe de Dinechin
2019-03-07  2:40     ` Jason Wang
2019-03-06 18:43   ` Souptick Joarder
2019-03-07  2:42     ` Jason Wang
2019-03-06  7:18 ` [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address Jason Wang
2019-03-06 16:31   ` Michael S. Tsirkin
2019-03-07  2:45     ` Jason Wang
2019-03-07 15:34       ` Michael S. Tsirkin
2019-03-07 19:09         ` Jerome Glisse
2019-03-07 19:38           ` Andrea Arcangeli
2019-03-07 20:17             ` Jerome Glisse
2019-03-07 21:27               ` Andrea Arcangeli [this message]
2019-03-08  9:13                 ` Jason Wang
2019-03-08 19:11                   ` Andrea Arcangeli
2019-03-11  7:21                     ` Jason Wang
2019-03-11 14:45                 ` Jan Kara
2019-03-08  8:31         ` Jason Wang
2019-03-07 15:47   ` Michael S. Tsirkin
2019-03-07 17:56     ` Michael S. Tsirkin
2019-03-07 19:16       ` Andrea Arcangeli
2019-03-08  8:50         ` Jason Wang
2019-03-08 14:58           ` Jerome Glisse
2019-03-11  7:18             ` Jason Wang
2019-03-08 19:48           ` Andrea Arcangeli
2019-03-08 20:06             ` Jerome Glisse
2019-03-11  7:40             ` Jason Wang
2019-03-11 12:48               ` Michael S. Tsirkin
2019-03-11 13:43                 ` Andrea Arcangeli
2019-03-12  2:56                   ` Jason Wang
2019-03-12  3:51                     ` Michael S. Tsirkin
2019-03-12  2:52                 ` Jason Wang
2019-03-12  3:50                   ` Michael S. Tsirkin
2019-03-12  7:15                     ` Jason Wang
2019-03-07 19:17       ` Jerome Glisse
2019-03-08  2:21         ` Michael S. Tsirkin
2019-03-08  2:55           ` Jerome Glisse
2019-03-08  3:16             ` Michael S. Tsirkin
2019-03-08  3:40               ` Jerome Glisse
2019-03-08  3:43                 ` Michael S. Tsirkin
2019-03-08  3:45                   ` Jerome Glisse
2019-03-08  9:15                     ` Jason Wang
2019-03-08  8:58         ` Jason Wang
2019-03-08 12:56           ` Michael S. Tsirkin
2019-03-08 15:02             ` Jerome Glisse
2019-03-08 19:13           ` Andrea Arcangeli
2019-03-08 14:12 ` [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap() Christoph Hellwig
2019-03-11  7:13   ` Jason Wang
2019-03-11 13:59     ` Michael S. Tsirkin
2019-03-11 18:14       ` David Miller
2019-03-12  2:59         ` Jason Wang
2019-03-12  3:52           ` Michael S. Tsirkin
2019-03-12  7:17             ` Jason Wang
2019-03-12 11:54               ` Michael S. Tsirkin
2019-03-12 15:46                 ` James Bottomley
2019-03-12 20:04                   ` Andrea Arcangeli
2019-03-12 20:53                     ` James Bottomley
2019-03-12 21:11                       ` Andrea Arcangeli
2019-03-12 21:19                         ` James Bottomley
2019-03-12 21:53                           ` Andrea Arcangeli
2019-03-12 22:02                             ` James Bottomley
2019-03-12 22:50                               ` Andrea Arcangeli
2019-03-12 22:57                                 ` James Bottomley
2019-03-13 16:05                       ` Christoph Hellwig
2019-03-13 16:37                         ` James Bottomley
2019-03-14 10:42                           ` Michael S. Tsirkin
2019-03-14 13:49                             ` Jason Wang
2019-03-14 19:33                               ` Andrea Arcangeli
2019-03-15  4:39                                 ` Jason Wang
2019-03-12  5:14           ` James Bottomley
2019-03-12  7:51             ` Jason Wang
2019-03-12  7:53               ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190307212717.GS23850@redhat.com \
    --to=aarcange@redhat.com \
    --cc=jack@suse.cz \
    --cc=jasowang@redhat.com \
    --cc=jglisse@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=peterx@redhat.com \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).