From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, jgg@ziepe.ca,
"Paul E. McKenney" <paulmck@linux.ibm.com>
Subject: Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
Date: Wed, 31 Jul 2019 14:29:50 -0400 [thread overview]
Message-ID: <20190731132438-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <20190731084655.7024-8-jasowang@redhat.com>
On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> We used to use RCU to synchronize MMU notifier with worker. This leads
> calling synchronize_rcu() in invalidate_range_start(). But on a busy
> system, there would be many factors that may slow down the
> synchronize_rcu() which makes it unsuitable to be called in MMU
> notifier.
>
> A solution is SRCU but its overhead is obvious with the expensive full
> memory barrier. Another choice is to use seqlock, but it doesn't
> provide a synchronization method between readers and writers. The last
> choice is to use vq mutex, but it need to deal with the worst case
> that MMU notifier must be blocked and wait for the finish of swap in.
>
> So this patch switches use a counter to track whether or not the map
> was used. The counter was increased when vq try to start or finish
> uses the map. This means, when it was even, we're sure there's no
> readers and MMU notifier is synchronized. When it was odd, it means
> there's a reader we need to wait it to be even again then we are
> synchronized. To avoid full memory barrier, store_release +
> load_acquire on the counter is used.
Unfortunately this needs a lot of review and testing, so this can't make
rc2, and I don't think this is the kind of patch I can merge after rc3.
Subtle memory barrier tricks like this can introduce new bugs while they
are fixing old ones.
>
> Consider the read critical section is pretty small the synchronization
> should be done very fast.
>
> Note the patch lead about 3% PPS dropping.
Sorry what do you mean by this last sentence? This degrades performance
compared to what?
>
> Reported-by: Michael S. Tsirkin <mst@redhat.com>
> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++----------------
> drivers/vhost/vhost.h | 7 +-
> 2 files changed, 94 insertions(+), 58 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index cfc11f9ed9c9..db2c81cb1e90 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
>
> spin_lock(&vq->mmu_lock);
> for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> - map[i] = rcu_dereference_protected(vq->maps[i],
> - lockdep_is_held(&vq->mmu_lock));
> + map[i] = vq->maps[i];
> if (map[i]) {
> vhost_set_map_dirty(vq, map[i], i);
> - rcu_assign_pointer(vq->maps[i], NULL);
> + vq->maps[i] = NULL;
> }
> }
> spin_unlock(&vq->mmu_lock);
>
> - /* No need for synchronize_rcu() or kfree_rcu() since we are
> - * serialized with memory accessors (e.g vq mutex held).
> + /* No need for synchronization since we are serialized with
> + * memory accessors (e.g vq mutex held).
> */
>
> for (i = 0; i < VHOST_NUM_ADDRS; i++)
> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
> return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
> }
>
> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> +{
> + int ref = READ_ONCE(vq->ref);
> +
> + smp_store_release(&vq->ref, ref + 1);
> + /* Make sure ref counter is visible before accessing the map */
> + smp_load_acquire(&vq->ref);
The map access is after this sequence, correct?
Just going by the rules in Documentation/memory-barriers.txt,
I think that this pair will not order following accesses with ref store.
Documentation/memory-barriers.txt says:
+ In addition, a RELEASE+ACQUIRE
+ pair is -not- guaranteed to act as a full memory barrier.
The guarantee that is made is this:
after
an ACQUIRE on a given variable, all memory accesses preceding any prior
RELEASE on that same variable are guaranteed to be visible.
And if we also had the reverse rule we'd end up with a full barrier,
won't we?
Cc Paul in case I missed something here. And if I'm right,
maybe we should call this out, adding
"The opposite is not true: a prior RELEASE is not
guaranteed to be visible before memory accesses following
the subsequent ACQUIRE".
> +}
> +
> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> +{
> + int ref = READ_ONCE(vq->ref);
> +
> + /* Make sure vq access is done before increasing ref counter */
> + smp_store_release(&vq->ref, ref + 1);
> +}
> +
> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
> +{
> + int ref;
> +
> + /* Make sure map change was done before checking ref counter */
> + smp_mb();
> +
> + ref = READ_ONCE(vq->ref);
> + if (ref & 0x1) {
Please document the even/odd trick here too, not just in the commit log.
> + /* When ref change,
changes
> we are sure no reader can see
> + * previous map */
> + while (READ_ONCE(vq->ref) == ref) {
what is the below line in aid of?
> + set_current_state(TASK_RUNNING);
> + schedule();
if (need_resched())
schedule();
?
> + }
On an interruptible kernel, there's a risk here is that
a task got preempted with an odd ref.
So I suspect we'll have to disable preemption when we
make ref odd.
> + }
> + /* Make sure ref counter was checked before any other
> + * operations that was dene on map. */
was dene -> were done?
> + smp_mb();
> +}
> +
> static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> int index,
> unsigned long start,
> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> spin_lock(&vq->mmu_lock);
> ++vq->invalidate_count;
>
> - map = rcu_dereference_protected(vq->maps[index],
> - lockdep_is_held(&vq->mmu_lock));
> + map = vq->maps[index];
> if (map) {
> vhost_set_map_dirty(vq, map, index);
> - rcu_assign_pointer(vq->maps[index], NULL);
> + vq->maps[index] = NULL;
> }
> spin_unlock(&vq->mmu_lock);
>
> if (map) {
> - synchronize_rcu();
> + vhost_vq_sync_access(vq);
> vhost_map_unprefetch(map);
> }
> }
> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
> for (i = 0; i < dev->nvqs; ++i) {
> vq = dev->vqs[i];
> for (j = 0; j < VHOST_NUM_ADDRS; j++)
> - RCU_INIT_POINTER(vq->maps[j], NULL);
> + vq->maps[j] = NULL;
> }
> }
> #endif
> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> vq->indirect = NULL;
> vq->heads = NULL;
> vq->dev = dev;
> + vq->ref = 0;
> mutex_init(&vq->mutex);
> spin_lock_init(&vq->mmu_lock);
> vhost_vq_reset(dev, vq);
> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
> map->npages = npages;
> map->pages = pages;
>
> - rcu_assign_pointer(vq->maps[index], map);
> + vq->maps[index] = map;
> /* No need for a synchronize_rcu(). This function should be
> * called by dev->worker so we are serialized with all
> * readers.
> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> struct vring_used *used;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> + map = vq->maps[VHOST_ADDR_USED];
> if (likely(map)) {
> used = map->addr;
> *((__virtio16 *)&used->ring[vq->num]) =
> cpu_to_vhost16(vq, vq->avail_idx);
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> size_t size;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> + map = vq->maps[VHOST_ADDR_USED];
> if (likely(map)) {
> used = map->addr;
> size = count * sizeof(*head);
> memcpy(used->ring + idx, head, size);
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> struct vring_used *used;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> + map = vq->maps[VHOST_ADDR_USED];
> if (likely(map)) {
> used = map->addr;
> used->flags = cpu_to_vhost16(vq, vq->used_flags);
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
> struct vring_used *used;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> + map = vq->maps[VHOST_ADDR_USED];
> if (likely(map)) {
> used = map->addr;
> used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> struct vring_avail *avail;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> + map = vq->maps[VHOST_ADDR_AVAIL];
> if (likely(map)) {
> avail = map->addr;
> *idx = avail->idx;
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> struct vring_avail *avail;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> + map = vq->maps[VHOST_ADDR_AVAIL];
> if (likely(map)) {
> avail = map->addr;
> *head = avail->ring[idx & (vq->num - 1)];
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> struct vring_avail *avail;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> + map = vq->maps[VHOST_ADDR_AVAIL];
> if (likely(map)) {
> avail = map->addr;
> *flags = avail->flags;
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> struct vring_avail *avail;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> + vhost_vq_access_map_begin(vq);
> + map = vq->maps[VHOST_ADDR_AVAIL];
> if (likely(map)) {
> avail = map->addr;
> *event = (__virtio16)avail->ring[vq->num];
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> struct vring_used *used;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> + map = vq->maps[VHOST_ADDR_USED];
> if (likely(map)) {
> used = map->addr;
> *idx = used->idx;
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> struct vring_desc *d;
>
> if (!vq->iotlb) {
> - rcu_read_lock();
> + vhost_vq_access_map_begin(vq);
>
> - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
> + map = vq->maps[VHOST_ADDR_DESC];
> if (likely(map)) {
> d = map->addr;
> *desc = *(d + idx);
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> return 0;
> }
>
> - rcu_read_unlock();
> + vhost_vq_access_map_end(vq);
> }
> #endif
>
> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
> #if VHOST_ARCH_CAN_ACCEL_UACCESS
> static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
> {
> - struct vhost_map __rcu *map;
> + struct vhost_map *map;
> int i;
>
> for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> - rcu_read_lock();
> - map = rcu_dereference(vq->maps[i]);
> - rcu_read_unlock();
> + map = vq->maps[i];
> if (unlikely(!map))
> vhost_map_prefetch(vq, i);
> }
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index a9a2a93857d2..f9e9558a529d 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -115,16 +115,17 @@ struct vhost_virtqueue {
> #if VHOST_ARCH_CAN_ACCEL_UACCESS
> /* Read by memory accessors, modified by meta data
> * prefetching, MMU notifier and vring ioctl().
> - * Synchonrized through mmu_lock (writers) and RCU (writers
> - * and readers).
> + * Synchonrized through mmu_lock (writers) and ref counters,
> + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end().
> */
> - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
> + struct vhost_map *maps[VHOST_NUM_ADDRS];
> /* Read by MMU notifier, modified by vring ioctl(),
> * synchronized through MMU notifier
> * registering/unregistering.
> */
> struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
> #endif
> + int ref;
Is it important that this is signed? If not I'd do unsigned here:
even though kernel does compile with 2s complement sign overflow,
it seems cleaner not to depend on that.
> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>
> struct file *kick;
> --
> 2.18.1
next prev parent reply other threads:[~2019-07-31 18:30 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-07-31 8:46 [PATCH V2 0/9] Fixes for metadata accelreation Jason Wang
2019-07-31 8:46 ` [PATCH V2 1/9] vhost: don't set uaddr for invalid address Jason Wang
2019-07-31 8:46 ` [PATCH V2 2/9] vhost: validate MMU notifier registration Jason Wang
2019-07-31 8:46 ` [PATCH V2 3/9] vhost: fix vhost map leak Jason Wang
2019-07-31 8:46 ` [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr() Jason Wang
2019-07-31 12:41 ` Jason Gunthorpe
2019-07-31 13:29 ` Jason Wang
2019-07-31 19:32 ` Jason Gunthorpe
2019-07-31 19:37 ` Michael S. Tsirkin
2019-08-01 5:03 ` Jason Wang
2019-07-31 8:46 ` [PATCH V2 5/9] vhost: mark dirty pages during map uninit Jason Wang
2019-07-31 8:46 ` [PATCH V2 6/9] vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps() Jason Wang
2019-07-31 8:46 ` [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker Jason Wang
2019-07-31 8:50 ` Jason Wang
2019-07-31 12:39 ` Jason Gunthorpe
2019-07-31 13:28 ` Jason Wang
2019-07-31 19:30 ` Jason Gunthorpe
2019-08-01 5:02 ` Jason Wang
2019-08-01 14:15 ` Jason Gunthorpe
2019-08-02 9:40 ` Jason Wang
2019-08-02 12:46 ` Jason Gunthorpe
2019-08-02 14:27 ` Michael S. Tsirkin
2019-08-02 17:24 ` Jason Gunthorpe
2019-08-03 21:36 ` Michael S. Tsirkin
2019-08-04 0:14 ` Jason Gunthorpe
2019-08-04 8:07 ` Michael S. Tsirkin
2019-08-05 4:39 ` Jason Wang
2019-08-06 11:53 ` Jason Gunthorpe
2019-08-06 13:36 ` Michael S. Tsirkin
2019-08-06 13:40 ` Jason Gunthorpe
2019-08-05 4:36 ` Jason Wang
2019-08-05 4:41 ` Jason Wang
2019-08-05 6:40 ` Michael S. Tsirkin
2019-08-05 8:24 ` Jason Wang
2019-08-05 6:30 ` Michael S. Tsirkin
2019-08-05 8:22 ` Jason Wang
2019-08-05 4:20 ` Jason Wang
2019-08-06 12:04 ` Jason Gunthorpe
2019-08-07 6:49 ` Jason Wang
2019-08-02 14:03 ` Michael S. Tsirkin
2019-08-05 4:33 ` Jason Wang
2019-08-05 6:28 ` Michael S. Tsirkin
2019-08-05 8:21 ` Jason Wang
2019-07-31 18:29 ` Michael S. Tsirkin [this message]
2019-08-01 8:06 ` Jason Wang
2019-08-03 21:54 ` Michael S. Tsirkin
2019-08-05 8:18 ` Jason Wang
2019-07-31 8:46 ` [PATCH V2 8/9] vhost: correctly set dirty pages in MMU notifiers callback Jason Wang
2019-07-31 8:46 ` [PATCH V2 9/9] vhost: do not return -EAGIAN for non blocking invalidation too early Jason Wang
2019-07-31 9:59 ` Stefano Garzarella
2019-07-31 10:05 ` Jason Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190731132438-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=jasowang@redhat.com \
--cc=jgg@ziepe.ca \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=netdev@vger.kernel.org \
--cc=paulmck@linux.ibm.com \
--cc=virtualization@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).