All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jerome Glisse <jglisse@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	kvm@vger.kernel.org, "Michael S. Tsirkin" <mst@redhat.com>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address
Date: Fri, 8 Mar 2019 09:58:01 -0500	[thread overview]
Message-ID: <20190308145800.GA3661__22704.5102788665$1555717422$gmane$org@redhat.com> (raw)
In-Reply-To: <e2fad6ed-9257-b53c-394b-bc913fc444c0@redhat.com>

On Fri, Mar 08, 2019 at 04:50:36PM +0800, Jason Wang wrote:
> 
> On 2019/3/8 上午3:16, Andrea Arcangeli wrote:
> > On Thu, Mar 07, 2019 at 12:56:45PM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Mar 07, 2019 at 10:47:22AM -0500, Michael S. Tsirkin wrote:
> > > > On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang wrote:
> > > > > +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> > > > > +	.invalidate_range = vhost_invalidate_range,
> > > > > +};
> > > > > +
> > > > >   void vhost_dev_init(struct vhost_dev *dev,
> > > > >   		    struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
> > > > >   {
> > > > I also wonder here: when page is write protected then
> > > > it does not look like .invalidate_range is invoked.
> > > > 
> > > > E.g. mm/ksm.c calls
> > > > 
> > > > mmu_notifier_invalidate_range_start and
> > > > mmu_notifier_invalidate_range_end but not mmu_notifier_invalidate_range.
> > > > 
> > > > Similarly, rmap in page_mkclean_one will not call
> > > > mmu_notifier_invalidate_range.
> > > > 
> > > > If I'm right vhost won't get notified when page is write-protected since you
> > > > didn't install start/end notifiers. Note that end notifier can be called
> > > > with page locked, so it's not as straight-forward as just adding a call.
> > > > Writing into a write-protected page isn't a good idea.
> > > > 
> > > > Note that documentation says:
> > > > 	it is fine to delay the mmu_notifier_invalidate_range
> > > > 	call to mmu_notifier_invalidate_range_end() outside the page table lock.
> > > > implying it's called just later.
> > > OK I missed the fact that _end actually calls
> > > mmu_notifier_invalidate_range internally. So that part is fine but the
> > > fact that you are trying to take page lock under VQ mutex and take same
> > > mutex within notifier probably means it's broken for ksm and rmap at
> > > least since these call invalidate with lock taken.
> > Yes this lock inversion needs more thoughts.
> > 
> > > And generally, Andrea told me offline one can not take mutex under
> > > the notifier callback. I CC'd Andrea for why.
> > Yes, the problem then is the ->invalidate_page is called then under PT
> > lock so it cannot take mutex, you also cannot take the page_lock, it
> > can at most take a spinlock or trylock_page.
> > 
> > So it must switch back to the _start/_end methods unless you rewrite
> > the locking.
> > 
> > The difference with _start/_end, is that ->invalidate_range avoids the
> > _start callback basically, but to avoid the _start callback safely, it
> > has to be called in between the ptep_clear_flush and the set_pte_at
> > whenever the pfn changes like during a COW. So it cannot be coalesced
> > in a single TLB flush that invalidates all sptes in a range like we
> > prefer for performance reasons for example in KVM. It also cannot
> > sleep.
> > 
> > In short ->invalidate_range must be really fast (it shouldn't require
> > to send IPI to all other CPUs like KVM may require during an
> > invalidate_range_start) and it must not sleep, in order to prefer it
> > to _start/_end.
> > 
> > I.e. the invalidate of the secondary MMU that walks the linux
> > pagetables in hardware (in vhost case with GUP in software) has to
> > happen while the linux pagetable is zero, otherwise a concurrent
> > hardware pagetable lookup could re-instantiate a mapping to the old
> > page in between the set_pte_at and the invalidate_range_end (which
> > internally calls ->invalidate_range). Jerome documented it nicely in
> > Documentation/vm/mmu_notifier.rst .
> 
> 
> Right, I've actually gone through this several times but some details were
> missed by me obviously.
> 
> 
> > 
> > Now you don't really walk the pagetable in hardware in vhost, but if
> > you use gup_fast after usemm() it's similar.
> > 
> > For vhost the invalidate would be really fast, there are no IPI to
> > deliver at all, the problem is just the mutex.
> 
> 
> Yes. A possible solution is to introduce a valid flag for VA. Vhost may only
> try to access kernel VA when it was valid. Invalidate_range_start() will
> clear this under the protection of the vq mutex when it can block. Then
> invalidate_range_end() then can clear this flag. An issue is blockable is 
> always false for range_end().
> 

Note that there can be multiple asynchronous concurrent invalidate_range
callbacks. So a flag does not work but a counter of number of active
invalidation would. See how KSM is doing it for instance in kvm_main.c

The pattern for this kind of thing is:
    my_invalidate_range_start(start,end) {
        ...
        if (mystruct_overlap(mystruct, start, end)) {
            mystruct_lock();
            mystruct->invalidate_count++;
            ...
            mystruct_unlock();
        }
    }

    my_invalidate_range_end(start,end) {
        ...
        if (mystruct_overlap(mystruct, start, end)) {
            mystruct_lock();
            mystruct->invalidate_count--;
            ...
            mystruct_unlock();
        }
    }

    my_access_va(mystruct) {
    again:
        wait_on(!mystruct->invalidate_count)
        mystruct_lock();
        if (mystruct->invalidate_count) {
            mystruct_unlock();
            goto again;
        }
        GUP();
        ...
        mystruct_unlock();
    }

Cheers,
Jérôme
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

  parent reply	other threads:[~2019-03-08 14:58 UTC|newest]

Thread overview: 185+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-06  7:18 [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap() Jason Wang
2019-03-06  7:18 ` Jason Wang
2019-03-06  7:18 ` [RFC PATCH V2 1/5] vhost: generalize adding used elem Jason Wang
2019-03-06  7:18 ` Jason Wang
2019-03-06  7:18 ` [RFC PATCH V2 2/5] vhost: fine grain userspace memory accessors Jason Wang
2019-03-06  7:18 ` Jason Wang
2019-03-06 10:45   ` Christophe de Dinechin
2019-03-06 10:45   ` Christophe de Dinechin
2019-03-06 10:45     ` Christophe de Dinechin
2019-03-07  2:38     ` Jason Wang
2019-03-07  2:38       ` Jason Wang
2019-03-07  2:38       ` Jason Wang
2019-03-06  7:18 ` [RFC PATCH V2 3/5] vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch() Jason Wang
2019-03-06  7:18 ` Jason Wang
2019-03-06  7:18 ` [RFC PATCH V2 4/5] vhost: introduce helpers to get the size of metadata area Jason Wang
2019-03-06 10:56   ` Christophe de Dinechin
2019-03-06 10:56   ` Christophe de Dinechin
2019-03-07  2:40     ` Jason Wang
2019-03-07  2:40     ` Jason Wang
2019-03-06 18:43   ` Souptick Joarder
2019-03-06 18:43     ` Souptick Joarder
2019-03-07  2:42     ` Jason Wang
2019-03-07  2:42     ` Jason Wang
2019-03-06  7:18 ` Jason Wang
2019-03-06  7:18 ` [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address Jason Wang
2019-03-06  7:18 ` Jason Wang
2019-03-06 16:31   ` Michael S. Tsirkin
2019-03-06 16:31   ` Michael S. Tsirkin
2019-03-07  2:45     ` Jason Wang
2019-03-07 15:34       ` Michael S. Tsirkin
2019-03-07 15:34       ` Michael S. Tsirkin
2019-03-07 19:09         ` Jerome Glisse
2019-03-07 19:09         ` Jerome Glisse
2019-03-07 19:38           ` Andrea Arcangeli
2019-03-07 19:38             ` Andrea Arcangeli
2019-03-07 20:17             ` Jerome Glisse
2019-03-07 20:17             ` Jerome Glisse
2019-03-07 21:27               ` Andrea Arcangeli
2019-03-08  9:13                 ` Jason Wang
2019-03-08 19:11                   ` Andrea Arcangeli
2019-03-11  7:21                     ` Jason Wang
2019-03-11  7:21                     ` Jason Wang
2019-03-08 19:11                   ` Andrea Arcangeli
2019-03-08  9:13                 ` Jason Wang
2019-03-11 14:45                 ` Jan Kara
2019-03-11 14:45                 ` Jan Kara
2019-03-07 21:27               ` Andrea Arcangeli
2019-03-08  8:31         ` Jason Wang
2019-03-08  8:31         ` Jason Wang
2019-03-07  2:45     ` Jason Wang
2019-03-07 15:47   ` Michael S. Tsirkin
2019-03-07 17:56     ` Michael S. Tsirkin
2019-03-07 17:56     ` Michael S. Tsirkin
2019-03-07 19:16       ` Andrea Arcangeli
2019-03-08  8:50         ` Jason Wang
2019-03-08  8:50           ` Jason Wang
2019-03-08 14:58           ` Jerome Glisse
2019-03-11  7:18             ` Jason Wang
2019-03-11  7:18             ` Jason Wang
2019-03-08 14:58           ` Jerome Glisse [this message]
2019-03-08 19:48           ` Andrea Arcangeli
2019-03-08 19:48           ` Andrea Arcangeli
2019-03-08 20:06             ` Jerome Glisse
2019-03-08 20:06             ` Jerome Glisse
2019-03-11  7:40             ` Jason Wang
2019-03-11  7:40               ` Jason Wang
2019-03-11 12:48               ` Michael S. Tsirkin
2019-03-11 13:43                 ` Andrea Arcangeli
2019-03-12  2:56                   ` Jason Wang
2019-03-12  2:56                   ` Jason Wang
2019-03-12  3:51                     ` Michael S. Tsirkin
2019-03-12  3:51                     ` Michael S. Tsirkin
2019-03-11 13:43                 ` Andrea Arcangeli
2019-03-12  2:52                 ` Jason Wang
2019-03-12  2:52                   ` Jason Wang
2019-03-12  3:50                   ` Michael S. Tsirkin
2019-03-12  3:50                   ` Michael S. Tsirkin
2019-03-12  7:15                     ` Jason Wang
2019-03-12  7:15                     ` Jason Wang
2019-03-11 12:48               ` Michael S. Tsirkin
2019-03-07 19:16       ` Andrea Arcangeli
2019-03-07 19:17       ` Jerome Glisse
2019-03-07 19:17       ` Jerome Glisse
2019-03-08  2:21         ` Michael S. Tsirkin
2019-03-08  2:21           ` Michael S. Tsirkin
2019-03-08  2:55           ` Jerome Glisse
2019-03-08  2:55           ` Jerome Glisse
2019-03-08  3:16             ` Michael S. Tsirkin
2019-03-08  3:16             ` Michael S. Tsirkin
2019-03-08  3:40               ` Jerome Glisse
2019-03-08  3:40               ` Jerome Glisse
2019-03-08  3:43                 ` Michael S. Tsirkin
2019-03-08  3:43                 ` Michael S. Tsirkin
2019-03-08  3:45                   ` Jerome Glisse
2019-03-08  9:15                     ` Jason Wang
2019-03-08  9:15                     ` Jason Wang
2019-03-08  3:45                   ` Jerome Glisse
2019-03-08  8:58         ` Jason Wang
2019-03-08  8:58         ` Jason Wang
2019-03-08 12:56           ` Michael S. Tsirkin
2019-03-08 12:56           ` Michael S. Tsirkin
2019-03-08 15:02             ` Jerome Glisse
2019-03-08 15:02             ` Jerome Glisse
2019-03-08 19:13           ` Andrea Arcangeli
2019-03-08 19:13           ` Andrea Arcangeli
2019-03-07 15:47   ` Michael S. Tsirkin
2019-03-08 14:12 ` [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap() Christoph Hellwig
2019-03-08 14:12 ` Christoph Hellwig
2019-03-08 14:12   ` Christoph Hellwig
2019-03-11  7:13   ` Jason Wang
2019-03-11  7:13   ` Jason Wang
2019-03-11  7:13     ` Jason Wang
2019-03-11 13:59     ` Michael S. Tsirkin
2019-03-11 13:59     ` Michael S. Tsirkin
2019-03-11 13:59       ` Michael S. Tsirkin
2019-03-11 18:14       ` David Miller
2019-03-11 18:14       ` David Miller
2019-03-11 18:14         ` David Miller
2019-03-12  2:59         ` Jason Wang
2019-03-12  2:59           ` Jason Wang
2019-03-12  3:52           ` Michael S. Tsirkin
2019-03-12  3:52             ` Michael S. Tsirkin
2019-03-12  3:52             ` Michael S. Tsirkin
2019-03-12  7:17             ` Jason Wang
2019-03-12  7:17               ` Jason Wang
2019-03-12 11:54               ` Michael S. Tsirkin
2019-03-12 11:54               ` Michael S. Tsirkin
2019-03-12 11:54                 ` Michael S. Tsirkin
2019-03-12 15:46                 ` James Bottomley
2019-03-12 15:46                   ` James Bottomley
2019-03-12 15:46                   ` James Bottomley
2019-03-12 20:04                   ` Andrea Arcangeli
2019-03-12 20:04                     ` Andrea Arcangeli
2019-03-12 20:04                     ` Andrea Arcangeli
2019-03-12 20:53                     ` James Bottomley
2019-03-12 20:53                       ` James Bottomley
2019-03-12 20:53                       ` James Bottomley
2019-03-12 21:11                       ` Andrea Arcangeli
2019-03-12 21:11                       ` Andrea Arcangeli
2019-03-12 21:11                         ` Andrea Arcangeli
2019-03-12 21:19                         ` James Bottomley
2019-03-12 21:19                           ` James Bottomley
2019-03-12 21:19                           ` James Bottomley
2019-03-12 21:53                           ` Andrea Arcangeli
2019-03-12 21:53                             ` Andrea Arcangeli
2019-03-12 22:02                             ` James Bottomley
2019-03-12 22:02                               ` James Bottomley
2019-03-12 22:02                               ` James Bottomley
2019-03-12 22:50                               ` Andrea Arcangeli
2019-03-12 22:50                               ` Andrea Arcangeli
2019-03-12 22:50                                 ` Andrea Arcangeli
2019-03-12 22:57                                 ` James Bottomley
2019-03-12 22:57                                   ` James Bottomley
2019-03-12 22:57                                   ` James Bottomley
2019-03-12 21:53                           ` Andrea Arcangeli
2019-03-13 16:05                       ` Christoph Hellwig
2019-03-13 16:05                         ` Christoph Hellwig
2019-03-13 16:05                         ` Christoph Hellwig
2019-03-13 16:37                         ` James Bottomley
2019-03-13 16:37                           ` James Bottomley
2019-03-13 16:37                           ` James Bottomley
2019-03-14 10:42                           ` Michael S. Tsirkin
2019-03-14 10:42                           ` Michael S. Tsirkin
2019-03-14 10:42                             ` Michael S. Tsirkin
2019-03-14 13:49                             ` Jason Wang
2019-03-14 13:49                             ` Jason Wang
2019-03-14 13:49                               ` Jason Wang
2019-03-14 19:33                               ` Andrea Arcangeli
2019-03-14 19:33                                 ` Andrea Arcangeli
2019-03-15  4:39                                 ` Jason Wang
2019-03-15  4:39                                   ` Jason Wang
2019-03-15  4:39                                 ` Jason Wang
2019-03-14 19:33                               ` Andrea Arcangeli
2019-03-13 16:05                       ` Christoph Hellwig
2019-03-12  7:17             ` Jason Wang
2019-03-12  5:14           ` James Bottomley
2019-03-12  5:14             ` James Bottomley
2019-03-12  5:14             ` James Bottomley
2019-03-12  7:51             ` Jason Wang
2019-03-12  7:51             ` Jason Wang
2019-03-12  7:51               ` Jason Wang
2019-03-12  7:53               ` Jason Wang
2019-03-12  7:53                 ` Jason Wang
2019-03-12  7:53               ` Jason Wang
2019-03-12  2:59         ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='20190308145800.GA3661__22704.5102788665$1555717422$gmane$org@redhat.com' \
    --to=jglisse@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.