* [PATCH V2 0/9] Fixes for metadata accelreation @ 2019-07-31 8:46 Jason Wang 2019-07-31 8:46 ` [PATCH V2 1/9] vhost: don't set uaddr for invalid address Jason Wang ` (8 more replies) 0 siblings, 9 replies; 51+ messages in thread From: Jason Wang @ 2019-07-31 8:46 UTC (permalink / raw) To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg Hi all: This series try to fix several issues introduced by meta data accelreation series. Please review. Changes from V1: - Try not use RCU to syncrhonize MMU notifier with vhost worker - set dirty pages after no readers - return -EAGAIN only when we find the range is overlapped with metadata Jason Wang (9): vhost: don't set uaddr for invalid address vhost: validate MMU notifier registration vhost: fix vhost map leak vhost: reset invalidate_count in vhost_set_vring_num_addr() vhost: mark dirty pages during map uninit vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps() vhost: do not use RCU to synchronize MMU notifier with worker vhost: correctly set dirty pages in MMU notifiers callback vhost: do not return -EAGIAN for non blocking invalidation too early drivers/vhost/vhost.c | 232 +++++++++++++++++++++++++++--------------- drivers/vhost/vhost.h | 8 +- 2 files changed, 154 insertions(+), 86 deletions(-) -- 2.18.1 ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH V2 1/9] vhost: don't set uaddr for invalid address 2019-07-31 8:46 [PATCH V2 0/9] Fixes for metadata accelreation Jason Wang @ 2019-07-31 8:46 ` Jason Wang 2019-07-31 8:46 ` [PATCH V2 2/9] vhost: validate MMU notifier registration Jason Wang ` (7 subsequent siblings) 8 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-07-31 8:46 UTC (permalink / raw) To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg We should not setup uaddr for the invalid address, otherwise we may try to pin or prefetch mapping of wrong pages. Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/vhost/vhost.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 0536f8526359..488380a581dc 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2082,7 +2082,8 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d, } #if VHOST_ARCH_CAN_ACCEL_UACCESS - vhost_setup_vq_uaddr(vq); + if (r == 0) + vhost_setup_vq_uaddr(vq); if (d->mm) mmu_notifier_register(&d->mmu_notifier, d->mm); -- 2.18.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH V2 2/9] vhost: validate MMU notifier registration 2019-07-31 8:46 [PATCH V2 0/9] Fixes for metadata accelreation Jason Wang 2019-07-31 8:46 ` [PATCH V2 1/9] vhost: don't set uaddr for invalid address Jason Wang @ 2019-07-31 8:46 ` Jason Wang 2019-07-31 8:46 ` [PATCH V2 3/9] vhost: fix vhost map leak Jason Wang ` (6 subsequent siblings) 8 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-07-31 8:46 UTC (permalink / raw) To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg The return value of mmu_notifier_register() is not checked in vhost_vring_set_num_addr(). This will cause an out of sync between mm and MMU notifier thus a double free. To solve this, introduce a boolean flag to track whether MMU notifier is registered and only do unregistering when it was true. Reported-and-tested-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/vhost/vhost.c | 19 +++++++++++++++---- drivers/vhost/vhost.h | 1 + 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 488380a581dc..17f6abea192e 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -629,6 +629,7 @@ void vhost_dev_init(struct vhost_dev *dev, dev->iov_limit = iov_limit; dev->weight = weight; dev->byte_weight = byte_weight; + dev->has_notifier = false; init_llist_head(&dev->work_list); init_waitqueue_head(&dev->wait); INIT_LIST_HEAD(&dev->read_list); @@ -730,6 +731,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev) if (err) goto err_mmu_notifier; #endif + dev->has_notifier = true; return 0; @@ -959,7 +961,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev) } if (dev->mm) { #if VHOST_ARCH_CAN_ACCEL_UACCESS - mmu_notifier_unregister(&dev->mmu_notifier, dev->mm); + if (dev->has_notifier) { + mmu_notifier_unregister(&dev->mmu_notifier, + dev->mm); + dev->has_notifier = false; + } #endif mmput(dev->mm); } @@ -2064,8 +2070,10 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d, /* Unregister MMU notifer to allow invalidation callback * can access vq->uaddrs[] without holding a lock. */ - if (d->mm) + if (d->has_notifier) { mmu_notifier_unregister(&d->mmu_notifier, d->mm); + d->has_notifier = false; + } vhost_uninit_vq_maps(vq); #endif @@ -2085,8 +2093,11 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d, if (r == 0) vhost_setup_vq_uaddr(vq); - if (d->mm) - mmu_notifier_register(&d->mmu_notifier, d->mm); + if (d->mm) { + r = mmu_notifier_register(&d->mmu_notifier, d->mm); + if (!r) + d->has_notifier = true; + } #endif mutex_unlock(&vq->mutex); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 42a8c2a13ab1..a9a2a93857d2 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -214,6 +214,7 @@ struct vhost_dev { int iov_limit; int weight; int byte_weight; + bool has_notifier; }; bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int total_len); -- 2.18.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH V2 3/9] vhost: fix vhost map leak 2019-07-31 8:46 [PATCH V2 0/9] Fixes for metadata accelreation Jason Wang 2019-07-31 8:46 ` [PATCH V2 1/9] vhost: don't set uaddr for invalid address Jason Wang 2019-07-31 8:46 ` [PATCH V2 2/9] vhost: validate MMU notifier registration Jason Wang @ 2019-07-31 8:46 ` Jason Wang 2019-07-31 8:46 ` [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr() Jason Wang ` (5 subsequent siblings) 8 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-07-31 8:46 UTC (permalink / raw) To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg We don't free map during vhost_map_unprefetch(). This means it could be leaked. Fixing by free the map. Reported-by: Michael S. Tsirkin <mst@redhat.com> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/vhost/vhost.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 17f6abea192e..2a3154976277 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -302,9 +302,7 @@ static void vhost_vq_meta_reset(struct vhost_dev *d) static void vhost_map_unprefetch(struct vhost_map *map) { kfree(map->pages); - map->pages = NULL; - map->npages = 0; - map->addr = NULL; + kfree(map); } static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) -- 2.18.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr() 2019-07-31 8:46 [PATCH V2 0/9] Fixes for metadata accelreation Jason Wang ` (2 preceding siblings ...) 2019-07-31 8:46 ` [PATCH V2 3/9] vhost: fix vhost map leak Jason Wang @ 2019-07-31 8:46 ` Jason Wang 2019-07-31 12:41 ` Jason Gunthorpe 2019-07-31 8:46 ` [PATCH V2 5/9] vhost: mark dirty pages during map uninit Jason Wang ` (4 subsequent siblings) 8 siblings, 1 reply; 51+ messages in thread From: Jason Wang @ 2019-07-31 8:46 UTC (permalink / raw) To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg The vhost_set_vring_num_addr() could be called in the middle of invalidate_range_start() and invalidate_range_end(). If we don't reset invalidate_count after the un-registering of MMU notifier, the invalidate_cont will run out of sync (e.g never reach zero). This will in fact disable the fast accessor path. Fixing by reset the count to zero. Reported-by: Michael S. Tsirkin <mst@redhat.com> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/vhost/vhost.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2a3154976277..2a7217c33668 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2073,6 +2073,10 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d, d->has_notifier = false; } + /* reset invalidate_count in case we are in the middle of + * invalidate_start() and invalidate_end(). + */ + vq->invalidate_count = 0; vhost_uninit_vq_maps(vq); #endif -- 2.18.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr() 2019-07-31 8:46 ` [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr() Jason Wang @ 2019-07-31 12:41 ` Jason Gunthorpe 2019-07-31 13:29 ` Jason Wang 0 siblings, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2019-07-31 12:41 UTC (permalink / raw) To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On Wed, Jul 31, 2019 at 04:46:50AM -0400, Jason Wang wrote: > The vhost_set_vring_num_addr() could be called in the middle of > invalidate_range_start() and invalidate_range_end(). If we don't reset > invalidate_count after the un-registering of MMU notifier, the > invalidate_cont will run out of sync (e.g never reach zero). This will > in fact disable the fast accessor path. Fixing by reset the count to > zero. > > Reported-by: Michael S. Tsirkin <mst@redhat.com> Did Michael report this as well? > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") > Signed-off-by: Jason Wang <jasowang@redhat.com> > drivers/vhost/vhost.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 2a3154976277..2a7217c33668 100644 > +++ b/drivers/vhost/vhost.c > @@ -2073,6 +2073,10 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d, > d->has_notifier = false; > } > > + /* reset invalidate_count in case we are in the middle of > + * invalidate_start() and invalidate_end(). > + */ > + vq->invalidate_count = 0; > vhost_uninit_vq_maps(vq); > #endif > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr() 2019-07-31 12:41 ` Jason Gunthorpe @ 2019-07-31 13:29 ` Jason Wang 2019-07-31 19:32 ` Jason Gunthorpe 0 siblings, 1 reply; 51+ messages in thread From: Jason Wang @ 2019-07-31 13:29 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/7/31 下午8:41, Jason Gunthorpe wrote: > On Wed, Jul 31, 2019 at 04:46:50AM -0400, Jason Wang wrote: >> The vhost_set_vring_num_addr() could be called in the middle of >> invalidate_range_start() and invalidate_range_end(). If we don't reset >> invalidate_count after the un-registering of MMU notifier, the >> invalidate_cont will run out of sync (e.g never reach zero). This will >> in fact disable the fast accessor path. Fixing by reset the count to >> zero. >> >> Reported-by: Michael S. Tsirkin <mst@redhat.com> > Did Michael report this as well? Correct me if I was wrong. I think it's point 4 described in https://lkml.org/lkml/2019/7/21/25. Thanks > >> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") >> Signed-off-by: Jason Wang <jasowang@redhat.com> >> drivers/vhost/vhost.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index 2a3154976277..2a7217c33668 100644 >> +++ b/drivers/vhost/vhost.c >> @@ -2073,6 +2073,10 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d, >> d->has_notifier = false; >> } >> >> + /* reset invalidate_count in case we are in the middle of >> + * invalidate_start() and invalidate_end(). >> + */ >> + vq->invalidate_count = 0; >> vhost_uninit_vq_maps(vq); >> #endif >> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr() 2019-07-31 13:29 ` Jason Wang @ 2019-07-31 19:32 ` Jason Gunthorpe 2019-07-31 19:37 ` Michael S. Tsirkin 2019-08-01 5:03 ` Jason Wang 0 siblings, 2 replies; 51+ messages in thread From: Jason Gunthorpe @ 2019-07-31 19:32 UTC (permalink / raw) To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On Wed, Jul 31, 2019 at 09:29:28PM +0800, Jason Wang wrote: > > On 2019/7/31 下午8:41, Jason Gunthorpe wrote: > > On Wed, Jul 31, 2019 at 04:46:50AM -0400, Jason Wang wrote: > > > The vhost_set_vring_num_addr() could be called in the middle of > > > invalidate_range_start() and invalidate_range_end(). If we don't reset > > > invalidate_count after the un-registering of MMU notifier, the > > > invalidate_cont will run out of sync (e.g never reach zero). This will > > > in fact disable the fast accessor path. Fixing by reset the count to > > > zero. > > > > > > Reported-by: Michael S. Tsirkin <mst@redhat.com> > > Did Michael report this as well? > > > Correct me if I was wrong. I think it's point 4 described in > https://lkml.org/lkml/2019/7/21/25. I'm not sure what that is talking about But this fixes what I described: https://lkml.org/lkml/2019/7/22/554 Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr() 2019-07-31 19:32 ` Jason Gunthorpe @ 2019-07-31 19:37 ` Michael S. Tsirkin 2019-08-01 5:03 ` Jason Wang 1 sibling, 0 replies; 51+ messages in thread From: Michael S. Tsirkin @ 2019-07-31 19:37 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm On Wed, Jul 31, 2019 at 04:32:52PM -0300, Jason Gunthorpe wrote: > On Wed, Jul 31, 2019 at 09:29:28PM +0800, Jason Wang wrote: > > > > On 2019/7/31 下午8:41, Jason Gunthorpe wrote: > > > On Wed, Jul 31, 2019 at 04:46:50AM -0400, Jason Wang wrote: > > > > The vhost_set_vring_num_addr() could be called in the middle of > > > > invalidate_range_start() and invalidate_range_end(). If we don't reset > > > > invalidate_count after the un-registering of MMU notifier, the > > > > invalidate_cont will run out of sync (e.g never reach zero). This will > > > > in fact disable the fast accessor path. Fixing by reset the count to > > > > zero. > > > > > > > > Reported-by: Michael S. Tsirkin <mst@redhat.com> > > > Did Michael report this as well? > > > > > > Correct me if I was wrong. I think it's point 4 described in > > https://lkml.org/lkml/2019/7/21/25. > > I'm not sure what that is talking about > > But this fixes what I described: > > https://lkml.org/lkml/2019/7/22/554 > > Jason These are two reasons for a possible counter imbalance. Unsurprisingly they are both fixed if you reset the counter to 0. -- MST ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr() 2019-07-31 19:32 ` Jason Gunthorpe 2019-07-31 19:37 ` Michael S. Tsirkin @ 2019-08-01 5:03 ` Jason Wang 1 sibling, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-08-01 5:03 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/8/1 上午3:32, Jason Gunthorpe wrote: > On Wed, Jul 31, 2019 at 09:29:28PM +0800, Jason Wang wrote: >> On 2019/7/31 下午8:41, Jason Gunthorpe wrote: >>> On Wed, Jul 31, 2019 at 04:46:50AM -0400, Jason Wang wrote: >>>> The vhost_set_vring_num_addr() could be called in the middle of >>>> invalidate_range_start() and invalidate_range_end(). If we don't reset >>>> invalidate_count after the un-registering of MMU notifier, the >>>> invalidate_cont will run out of sync (e.g never reach zero). This will >>>> in fact disable the fast accessor path. Fixing by reset the count to >>>> zero. >>>> >>>> Reported-by: Michael S. Tsirkin <mst@redhat.com> >>> Did Michael report this as well? >> >> Correct me if I was wrong. I think it's point 4 described in >> https://lkml.org/lkml/2019/7/21/25. > I'm not sure what that is talking about > > But this fixes what I described: > > https://lkml.org/lkml/2019/7/22/554 > > Jason I'm sorry I miss this, will add your name as reported-by in the next version. Thanks ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH V2 5/9] vhost: mark dirty pages during map uninit 2019-07-31 8:46 [PATCH V2 0/9] Fixes for metadata accelreation Jason Wang ` (3 preceding siblings ...) 2019-07-31 8:46 ` [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr() Jason Wang @ 2019-07-31 8:46 ` Jason Wang 2019-07-31 8:46 ` [PATCH V2 6/9] vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps() Jason Wang ` (3 subsequent siblings) 8 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-07-31 8:46 UTC (permalink / raw) To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg We don't mark dirty pages if the map was teared down outside MMU notifier. This will lead untracked dirty pages. Fixing by marking dirty pages during map uninit. Reported-by: Michael S. Tsirkin <mst@redhat.com> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/vhost/vhost.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2a7217c33668..c12cdadb0855 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -305,6 +305,18 @@ static void vhost_map_unprefetch(struct vhost_map *map) kfree(map); } +static void vhost_set_map_dirty(struct vhost_virtqueue *vq, + struct vhost_map *map, int index) +{ + struct vhost_uaddr *uaddr = &vq->uaddrs[index]; + int i; + + if (uaddr->write) { + for (i = 0; i < map->npages; i++) + set_page_dirty(map->pages[i]); + } +} + static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) { struct vhost_map *map[VHOST_NUM_ADDRS]; @@ -314,8 +326,10 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) for (i = 0; i < VHOST_NUM_ADDRS; i++) { map[i] = rcu_dereference_protected(vq->maps[i], lockdep_is_held(&vq->mmu_lock)); - if (map[i]) + if (map[i]) { + vhost_set_map_dirty(vq, map[i], i); rcu_assign_pointer(vq->maps[i], NULL); + } } spin_unlock(&vq->mmu_lock); @@ -353,7 +367,6 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, { struct vhost_uaddr *uaddr = &vq->uaddrs[index]; struct vhost_map *map; - int i; if (!vhost_map_range_overlap(uaddr, start, end)) return; @@ -364,10 +377,7 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, map = rcu_dereference_protected(vq->maps[index], lockdep_is_held(&vq->mmu_lock)); if (map) { - if (uaddr->write) { - for (i = 0; i < map->npages; i++) - set_page_dirty(map->pages[i]); - } + vhost_set_map_dirty(vq, map, index); rcu_assign_pointer(vq->maps[index], NULL); } spin_unlock(&vq->mmu_lock); -- 2.18.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH V2 6/9] vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps() 2019-07-31 8:46 [PATCH V2 0/9] Fixes for metadata accelreation Jason Wang ` (4 preceding siblings ...) 2019-07-31 8:46 ` [PATCH V2 5/9] vhost: mark dirty pages during map uninit Jason Wang @ 2019-07-31 8:46 ` Jason Wang 2019-07-31 8:46 ` [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker Jason Wang ` (2 subsequent siblings) 8 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-07-31 8:46 UTC (permalink / raw) To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg There's no need for RCU synchronization in vhost_uninit_vq_maps() since we've already serialized with readers (memory accessors). This also avoid the possible userspace DOS through ioctl() because of the possible high latency caused by synchronize_rcu(). Reported-by: Michael S. Tsirkin <mst@redhat.com> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/vhost/vhost.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index c12cdadb0855..cfc11f9ed9c9 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -333,7 +333,9 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) } spin_unlock(&vq->mmu_lock); - synchronize_rcu(); + /* No need for synchronize_rcu() or kfree_rcu() since we are + * serialized with memory accessors (e.g vq mutex held). + */ for (i = 0; i < VHOST_NUM_ADDRS; i++) if (map[i]) -- 2.18.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-07-31 8:46 [PATCH V2 0/9] Fixes for metadata accelreation Jason Wang ` (5 preceding siblings ...) 2019-07-31 8:46 ` [PATCH V2 6/9] vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps() Jason Wang @ 2019-07-31 8:46 ` Jason Wang 2019-07-31 8:50 ` Jason Wang ` (2 more replies) 2019-07-31 8:46 ` [PATCH V2 8/9] vhost: correctly set dirty pages in MMU notifiers callback Jason Wang 2019-07-31 8:46 ` [PATCH V2 9/9] vhost: do not return -EAGIAN for non blocking invalidation too early Jason Wang 8 siblings, 3 replies; 51+ messages in thread From: Jason Wang @ 2019-07-31 8:46 UTC (permalink / raw) To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg We used to use RCU to synchronize MMU notifier with worker. This leads calling synchronize_rcu() in invalidate_range_start(). But on a busy system, there would be many factors that may slow down the synchronize_rcu() which makes it unsuitable to be called in MMU notifier. A solution is SRCU but its overhead is obvious with the expensive full memory barrier. Another choice is to use seqlock, but it doesn't provide a synchronization method between readers and writers. The last choice is to use vq mutex, but it need to deal with the worst case that MMU notifier must be blocked and wait for the finish of swap in. So this patch switches use a counter to track whether or not the map was used. The counter was increased when vq try to start or finish uses the map. This means, when it was even, we're sure there's no readers and MMU notifier is synchronized. When it was odd, it means there's a reader we need to wait it to be even again then we are synchronized. To avoid full memory barrier, store_release + load_acquire on the counter is used. Consider the read critical section is pretty small the synchronization should be done very fast. Note the patch lead about 3% PPS dropping. Reported-by: Michael S. Tsirkin <mst@redhat.com> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++---------------- drivers/vhost/vhost.h | 7 +- 2 files changed, 94 insertions(+), 58 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index cfc11f9ed9c9..db2c81cb1e90 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) spin_lock(&vq->mmu_lock); for (i = 0; i < VHOST_NUM_ADDRS; i++) { - map[i] = rcu_dereference_protected(vq->maps[i], - lockdep_is_held(&vq->mmu_lock)); + map[i] = vq->maps[i]; if (map[i]) { vhost_set_map_dirty(vq, map[i], i); - rcu_assign_pointer(vq->maps[i], NULL); + vq->maps[i] = NULL; } } spin_unlock(&vq->mmu_lock); - /* No need for synchronize_rcu() or kfree_rcu() since we are - * serialized with memory accessors (e.g vq mutex held). + /* No need for synchronization since we are serialized with + * memory accessors (e.g vq mutex held). */ for (i = 0; i < VHOST_NUM_ADDRS; i++) @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); } +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) +{ + int ref = READ_ONCE(vq->ref); + + smp_store_release(&vq->ref, ref + 1); + /* Make sure ref counter is visible before accessing the map */ + smp_load_acquire(&vq->ref); +} + +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) +{ + int ref = READ_ONCE(vq->ref); + + /* Make sure vq access is done before increasing ref counter */ + smp_store_release(&vq->ref, ref + 1); +} + +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) +{ + int ref; + + /* Make sure map change was done before checking ref counter */ + smp_mb(); + + ref = READ_ONCE(vq->ref); + if (ref & 0x1) { + /* When ref change, we are sure no reader can see + * previous map */ + while (READ_ONCE(vq->ref) == ref) { + set_current_state(TASK_RUNNING); + schedule(); + } + } + /* Make sure ref counter was checked before any other + * operations that was dene on map. */ + smp_mb(); +} + static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, int index, unsigned long start, @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, spin_lock(&vq->mmu_lock); ++vq->invalidate_count; - map = rcu_dereference_protected(vq->maps[index], - lockdep_is_held(&vq->mmu_lock)); + map = vq->maps[index]; if (map) { vhost_set_map_dirty(vq, map, index); - rcu_assign_pointer(vq->maps[index], NULL); + vq->maps[index] = NULL; } spin_unlock(&vq->mmu_lock); if (map) { - synchronize_rcu(); + vhost_vq_sync_access(vq); vhost_map_unprefetch(map); } } @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev) for (i = 0; i < dev->nvqs; ++i) { vq = dev->vqs[i]; for (j = 0; j < VHOST_NUM_ADDRS; j++) - RCU_INIT_POINTER(vq->maps[j], NULL); + vq->maps[j] = NULL; } } #endif @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev, vq->indirect = NULL; vq->heads = NULL; vq->dev = dev; + vq->ref = 0; mutex_init(&vq->mutex); spin_lock_init(&vq->mmu_lock); vhost_vq_reset(dev, vq); @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq, map->npages = npages; map->pages = pages; - rcu_assign_pointer(vq->maps[index], map); + vq->maps[index] = map; /* No need for a synchronize_rcu(). This function should be * called by dev->worker so we are serialized with all * readers. @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) struct vring_used *used; if (!vq->iotlb) { - rcu_read_lock(); + vhost_vq_access_map_begin(vq); - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); + map = vq->maps[VHOST_ADDR_USED]; if (likely(map)) { used = map->addr; *((__virtio16 *)&used->ring[vq->num]) = cpu_to_vhost16(vq, vq->avail_idx); - rcu_read_unlock(); + vhost_vq_access_map_end(vq); return 0; } - rcu_read_unlock(); + vhost_vq_access_map_end(vq); } #endif @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, size_t size; if (!vq->iotlb) { - rcu_read_lock(); + vhost_vq_access_map_begin(vq); - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); + map = vq->maps[VHOST_ADDR_USED]; if (likely(map)) { used = map->addr; size = count * sizeof(*head); memcpy(used->ring + idx, head, size); - rcu_read_unlock(); + vhost_vq_access_map_end(vq); return 0; } - rcu_read_unlock(); + vhost_vq_access_map_end(vq); } #endif @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) struct vring_used *used; if (!vq->iotlb) { - rcu_read_lock(); + vhost_vq_access_map_begin(vq); - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); + map = vq->maps[VHOST_ADDR_USED]; if (likely(map)) { used = map->addr; used->flags = cpu_to_vhost16(vq, vq->used_flags); - rcu_read_unlock(); + vhost_vq_access_map_end(vq); return 0; } - rcu_read_unlock(); + vhost_vq_access_map_end(vq); } #endif @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) struct vring_used *used; if (!vq->iotlb) { - rcu_read_lock(); + vhost_vq_access_map_begin(vq); - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); + map = vq->maps[VHOST_ADDR_USED]; if (likely(map)) { used = map->addr; used->idx = cpu_to_vhost16(vq, vq->last_used_idx); - rcu_read_unlock(); + vhost_vq_access_map_end(vq); return 0; } - rcu_read_unlock(); + vhost_vq_access_map_end(vq); } #endif @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, struct vring_avail *avail; if (!vq->iotlb) { - rcu_read_lock(); + vhost_vq_access_map_begin(vq); - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); + map = vq->maps[VHOST_ADDR_AVAIL]; if (likely(map)) { avail = map->addr; *idx = avail->idx; - rcu_read_unlock(); + vhost_vq_access_map_end(vq); return 0; } - rcu_read_unlock(); + vhost_vq_access_map_end(vq); } #endif @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, struct vring_avail *avail; if (!vq->iotlb) { - rcu_read_lock(); + vhost_vq_access_map_begin(vq); - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); + map = vq->maps[VHOST_ADDR_AVAIL]; if (likely(map)) { avail = map->addr; *head = avail->ring[idx & (vq->num - 1)]; - rcu_read_unlock(); + vhost_vq_access_map_end(vq); return 0; } - rcu_read_unlock(); + vhost_vq_access_map_end(vq); } #endif @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, struct vring_avail *avail; if (!vq->iotlb) { - rcu_read_lock(); + vhost_vq_access_map_begin(vq); - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); + map = vq->maps[VHOST_ADDR_AVAIL]; if (likely(map)) { avail = map->addr; *flags = avail->flags; - rcu_read_unlock(); + vhost_vq_access_map_end(vq); return 0; } - rcu_read_unlock(); + vhost_vq_access_map_end(vq); } #endif @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq, struct vring_avail *avail; if (!vq->iotlb) { - rcu_read_lock(); - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); + vhost_vq_access_map_begin(vq); + map = vq->maps[VHOST_ADDR_AVAIL]; if (likely(map)) { avail = map->addr; *event = (__virtio16)avail->ring[vq->num]; - rcu_read_unlock(); + vhost_vq_access_map_end(vq); return 0; } - rcu_read_unlock(); + vhost_vq_access_map_end(vq); } #endif @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, struct vring_used *used; if (!vq->iotlb) { - rcu_read_lock(); + vhost_vq_access_map_begin(vq); - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); + map = vq->maps[VHOST_ADDR_USED]; if (likely(map)) { used = map->addr; *idx = used->idx; - rcu_read_unlock(); + vhost_vq_access_map_end(vq); return 0; } - rcu_read_unlock(); + vhost_vq_access_map_end(vq); } #endif @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq, struct vring_desc *d; if (!vq->iotlb) { - rcu_read_lock(); + vhost_vq_access_map_begin(vq); - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]); + map = vq->maps[VHOST_ADDR_DESC]; if (likely(map)) { d = map->addr; *desc = *(d + idx); - rcu_read_unlock(); + vhost_vq_access_map_end(vq); return 0; } - rcu_read_unlock(); + vhost_vq_access_map_end(vq); } #endif @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq, #if VHOST_ARCH_CAN_ACCEL_UACCESS static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq) { - struct vhost_map __rcu *map; + struct vhost_map *map; int i; for (i = 0; i < VHOST_NUM_ADDRS; i++) { - rcu_read_lock(); - map = rcu_dereference(vq->maps[i]); - rcu_read_unlock(); + map = vq->maps[i]; if (unlikely(!map)) vhost_map_prefetch(vq, i); } diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index a9a2a93857d2..f9e9558a529d 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -115,16 +115,17 @@ struct vhost_virtqueue { #if VHOST_ARCH_CAN_ACCEL_UACCESS /* Read by memory accessors, modified by meta data * prefetching, MMU notifier and vring ioctl(). - * Synchonrized through mmu_lock (writers) and RCU (writers - * and readers). + * Synchonrized through mmu_lock (writers) and ref counters, + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end(). */ - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS]; + struct vhost_map *maps[VHOST_NUM_ADDRS]; /* Read by MMU notifier, modified by vring ioctl(), * synchronized through MMU notifier * registering/unregistering. */ struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; #endif + int ref; const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; struct file *kick; -- 2.18.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-07-31 8:46 ` [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker Jason Wang @ 2019-07-31 8:50 ` Jason Wang 2019-07-31 12:39 ` Jason Gunthorpe 2019-07-31 18:29 ` Michael S. Tsirkin 2 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-07-31 8:50 UTC (permalink / raw) To: mst, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg On 2019/7/31 下午4:46, Jason Wang wrote: > We used to use RCU to synchronize MMU notifier with worker. This leads > calling synchronize_rcu() in invalidate_range_start(). But on a busy > system, there would be many factors that may slow down the > synchronize_rcu() which makes it unsuitable to be called in MMU > notifier. > > A solution is SRCU but its overhead is obvious with the expensive full > memory barrier. Another choice is to use seqlock, but it doesn't > provide a synchronization method between readers and writers. The last > choice is to use vq mutex, but it need to deal with the worst case > that MMU notifier must be blocked and wait for the finish of swap in. > > So this patch switches use a counter to track whether or not the map > was used. The counter was increased when vq try to start or finish > uses the map. This means, when it was even, we're sure there's no > readers and MMU notifier is synchronized. When it was odd, it means > there's a reader we need to wait it to be even again then we are > synchronized. To avoid full memory barrier, store_release + > load_acquire on the counter is used. For reviewers, I try hard to avoid e.g smp_mb(), please double check whether or not this trick work. Thanks > > Consider the read critical section is pretty small the synchronization > should be done very fast. > > Note the patch lead about 3% PPS dropping. > > Reported-by: Michael S. Tsirkin <mst@redhat.com> > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") > Signed-off-by: Jason Wang <jasowang@redhat.com> > --- > drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++---------------- > drivers/vhost/vhost.h | 7 +- > 2 files changed, 94 insertions(+), 58 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index cfc11f9ed9c9..db2c81cb1e90 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) > > spin_lock(&vq->mmu_lock); > for (i = 0; i < VHOST_NUM_ADDRS; i++) { > - map[i] = rcu_dereference_protected(vq->maps[i], > - lockdep_is_held(&vq->mmu_lock)); > + map[i] = vq->maps[i]; > if (map[i]) { > vhost_set_map_dirty(vq, map[i], i); > - rcu_assign_pointer(vq->maps[i], NULL); > + vq->maps[i] = NULL; > } > } > spin_unlock(&vq->mmu_lock); > > - /* No need for synchronize_rcu() or kfree_rcu() since we are > - * serialized with memory accessors (e.g vq mutex held). > + /* No need for synchronization since we are serialized with > + * memory accessors (e.g vq mutex held). > */ > > for (i = 0; i < VHOST_NUM_ADDRS; i++) > @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, > return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); > } > > +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) > +{ > + int ref = READ_ONCE(vq->ref); > + > + smp_store_release(&vq->ref, ref + 1); > + /* Make sure ref counter is visible before accessing the map */ > + smp_load_acquire(&vq->ref); > +} > + > +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) > +{ > + int ref = READ_ONCE(vq->ref); > + > + /* Make sure vq access is done before increasing ref counter */ > + smp_store_release(&vq->ref, ref + 1); > +} > + > +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) > +{ > + int ref; > + > + /* Make sure map change was done before checking ref counter */ > + smp_mb(); > + > + ref = READ_ONCE(vq->ref); > + if (ref & 0x1) { > + /* When ref change, we are sure no reader can see > + * previous map */ > + while (READ_ONCE(vq->ref) == ref) { > + set_current_state(TASK_RUNNING); > + schedule(); > + } > + } > + /* Make sure ref counter was checked before any other > + * operations that was dene on map. */ > + smp_mb(); > +} > + > static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > int index, > unsigned long start, > @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > spin_lock(&vq->mmu_lock); > ++vq->invalidate_count; > > - map = rcu_dereference_protected(vq->maps[index], > - lockdep_is_held(&vq->mmu_lock)); > + map = vq->maps[index]; > if (map) { > vhost_set_map_dirty(vq, map, index); > - rcu_assign_pointer(vq->maps[index], NULL); > + vq->maps[index] = NULL; > } > spin_unlock(&vq->mmu_lock); > > if (map) { > - synchronize_rcu(); > + vhost_vq_sync_access(vq); > vhost_map_unprefetch(map); > } > } > @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev) > for (i = 0; i < dev->nvqs; ++i) { > vq = dev->vqs[i]; > for (j = 0; j < VHOST_NUM_ADDRS; j++) > - RCU_INIT_POINTER(vq->maps[j], NULL); > + vq->maps[j] = NULL; > } > } > #endif > @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev, > vq->indirect = NULL; > vq->heads = NULL; > vq->dev = dev; > + vq->ref = 0; > mutex_init(&vq->mutex); > spin_lock_init(&vq->mmu_lock); > vhost_vq_reset(dev, vq); > @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq, > map->npages = npages; > map->pages = pages; > > - rcu_assign_pointer(vq->maps[index], map); > + vq->maps[index] = map; > /* No need for a synchronize_rcu(). This function should be > * called by dev->worker so we are serialized with all > * readers. > @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > *((__virtio16 *)&used->ring[vq->num]) = > cpu_to_vhost16(vq, vq->avail_idx); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, > size_t size; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > size = count * sizeof(*head); > memcpy(used->ring + idx, head, size); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > used->flags = cpu_to_vhost16(vq, vq->used_flags); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > used->idx = cpu_to_vhost16(vq, vq->last_used_idx); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *idx = avail->idx; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *head = avail->ring[idx & (vq->num - 1)]; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *flags = avail->flags; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + vhost_vq_access_map_begin(vq); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *event = (__virtio16)avail->ring[vq->num]; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > *idx = used->idx; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq, > struct vring_desc *d; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]); > + map = vq->maps[VHOST_ADDR_DESC]; > if (likely(map)) { > d = map->addr; > *desc = *(d + idx); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq, > #if VHOST_ARCH_CAN_ACCEL_UACCESS > static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq) > { > - struct vhost_map __rcu *map; > + struct vhost_map *map; > int i; > > for (i = 0; i < VHOST_NUM_ADDRS; i++) { > - rcu_read_lock(); > - map = rcu_dereference(vq->maps[i]); > - rcu_read_unlock(); > + map = vq->maps[i]; > if (unlikely(!map)) > vhost_map_prefetch(vq, i); > } > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h > index a9a2a93857d2..f9e9558a529d 100644 > --- a/drivers/vhost/vhost.h > +++ b/drivers/vhost/vhost.h > @@ -115,16 +115,17 @@ struct vhost_virtqueue { > #if VHOST_ARCH_CAN_ACCEL_UACCESS > /* Read by memory accessors, modified by meta data > * prefetching, MMU notifier and vring ioctl(). > - * Synchonrized through mmu_lock (writers) and RCU (writers > - * and readers). > + * Synchonrized through mmu_lock (writers) and ref counters, > + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end(). > */ > - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS]; > + struct vhost_map *maps[VHOST_NUM_ADDRS]; > /* Read by MMU notifier, modified by vring ioctl(), > * synchronized through MMU notifier > * registering/unregistering. > */ > struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; > #endif > + int ref; > const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; > > struct file *kick; ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-07-31 8:46 ` [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker Jason Wang 2019-07-31 8:50 ` Jason Wang @ 2019-07-31 12:39 ` Jason Gunthorpe 2019-07-31 13:28 ` Jason Wang 2019-07-31 18:29 ` Michael S. Tsirkin 2 siblings, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2019-07-31 12:39 UTC (permalink / raw) To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: > We used to use RCU to synchronize MMU notifier with worker. This leads > calling synchronize_rcu() in invalidate_range_start(). But on a busy > system, there would be many factors that may slow down the > synchronize_rcu() which makes it unsuitable to be called in MMU > notifier. > > A solution is SRCU but its overhead is obvious with the expensive full > memory barrier. Another choice is to use seqlock, but it doesn't > provide a synchronization method between readers and writers. The last > choice is to use vq mutex, but it need to deal with the worst case > that MMU notifier must be blocked and wait for the finish of swap in. > > So this patch switches use a counter to track whether or not the map > was used. The counter was increased when vq try to start or finish > uses the map. This means, when it was even, we're sure there's no > readers and MMU notifier is synchronized. When it was odd, it means > there's a reader we need to wait it to be even again then we are > synchronized. You just described a seqlock. We've been talking about providing this as some core service from mmu notifiers because nearly every use of this API needs it. IMHO this gets the whole thing backwards, the common pattern is to protect the 'shadow pte' data with a seqlock (usually open coded), such that the mmu notififer side has the write side of that lock and the read side is consumed by the thread accessing or updating the SPTE. > Reported-by: Michael S. Tsirkin <mst@redhat.com> > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") > Signed-off-by: Jason Wang <jasowang@redhat.com> > drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++---------------- > drivers/vhost/vhost.h | 7 +- > 2 files changed, 94 insertions(+), 58 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index cfc11f9ed9c9..db2c81cb1e90 100644 > +++ b/drivers/vhost/vhost.c > @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) > > spin_lock(&vq->mmu_lock); > for (i = 0; i < VHOST_NUM_ADDRS; i++) { > - map[i] = rcu_dereference_protected(vq->maps[i], > - lockdep_is_held(&vq->mmu_lock)); > + map[i] = vq->maps[i]; > if (map[i]) { > vhost_set_map_dirty(vq, map[i], i); > - rcu_assign_pointer(vq->maps[i], NULL); > + vq->maps[i] = NULL; > } > } > spin_unlock(&vq->mmu_lock); > > - /* No need for synchronize_rcu() or kfree_rcu() since we are > - * serialized with memory accessors (e.g vq mutex held). > + /* No need for synchronization since we are serialized with > + * memory accessors (e.g vq mutex held). > */ > > for (i = 0; i < VHOST_NUM_ADDRS; i++) > @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, > return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); > } > > +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) > +{ > + int ref = READ_ONCE(vq->ref); Is a lock/single threaded supposed to be held for this? > + > + smp_store_release(&vq->ref, ref + 1); > + /* Make sure ref counter is visible before accessing the map */ > + smp_load_acquire(&vq->ref); release/acquire semantics are intended to protect blocks of related data, so reading something with acquire and throwing away the result is nonsense. > +} > + > +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) > +{ > + int ref = READ_ONCE(vq->ref); If the write to vq->ref is not locked this algorithm won't work, if it is locked the READ_ONCE is not needed. > + /* Make sure vq access is done before increasing ref counter */ > + smp_store_release(&vq->ref, ref + 1); > +} > + > +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) > +{ > + int ref; > + > + /* Make sure map change was done before checking ref counter */ > + smp_mb(); This is probably smp_rmb after reading ref, and if you are setting ref with smp_store_release then this should be smp_load_acquire() without an explicit mb. > + ref = READ_ONCE(vq->ref); > + if (ref & 0x1) { > + /* When ref change, we are sure no reader can see > + * previous map */ > + while (READ_ONCE(vq->ref) == ref) { > + set_current_state(TASK_RUNNING); > + schedule(); > + } > + } This is basically read_seqcount_begin()' with a schedule instead of cpu_relax > + /* Make sure ref counter was checked before any other > + * operations that was dene on map. */ > + smp_mb(); should be in a smp_load_acquire() > +} > + > static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > int index, > unsigned long start, > @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > spin_lock(&vq->mmu_lock); > ++vq->invalidate_count; > > - map = rcu_dereference_protected(vq->maps[index], > - lockdep_is_held(&vq->mmu_lock)); > + map = vq->maps[index]; > if (map) { > vhost_set_map_dirty(vq, map, index); > - rcu_assign_pointer(vq->maps[index], NULL); > + vq->maps[index] = NULL; > } > spin_unlock(&vq->mmu_lock); > > if (map) { > - synchronize_rcu(); > + vhost_vq_sync_access(vq); What prevents racing with vhost_vq_access_map_end here? > vhost_map_unprefetch(map); > } > } Overall I don't like it. We are trying to get rid of these botique mmu notifier patterns in drivers. Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-07-31 12:39 ` Jason Gunthorpe @ 2019-07-31 13:28 ` Jason Wang 2019-07-31 19:30 ` Jason Gunthorpe 0 siblings, 1 reply; 51+ messages in thread From: Jason Wang @ 2019-07-31 13:28 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/7/31 下午8:39, Jason Gunthorpe wrote: > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: >> We used to use RCU to synchronize MMU notifier with worker. This leads >> calling synchronize_rcu() in invalidate_range_start(). But on a busy >> system, there would be many factors that may slow down the >> synchronize_rcu() which makes it unsuitable to be called in MMU >> notifier. >> >> A solution is SRCU but its overhead is obvious with the expensive full >> memory barrier. Another choice is to use seqlock, but it doesn't >> provide a synchronization method between readers and writers. The last >> choice is to use vq mutex, but it need to deal with the worst case >> that MMU notifier must be blocked and wait for the finish of swap in. >> >> So this patch switches use a counter to track whether or not the map >> was used. The counter was increased when vq try to start or finish >> uses the map. This means, when it was even, we're sure there's no >> readers and MMU notifier is synchronized. When it was odd, it means >> there's a reader we need to wait it to be even again then we are >> synchronized. > You just described a seqlock. Kind of, see my explanation below. > > We've been talking about providing this as some core service from mmu > notifiers because nearly every use of this API needs it. That would be very helpful. > > IMHO this gets the whole thing backwards, the common pattern is to > protect the 'shadow pte' data with a seqlock (usually open coded), > such that the mmu notififer side has the write side of that lock and > the read side is consumed by the thread accessing or updating the SPTE. Yes, I've considered something like that. But the problem is, mmu notifier (writer) need to wait for the vhost worker to finish the read before it can do things like setting dirty pages and unmapping page. It looks to me seqlock doesn't provide things like this. Or are you suggesting that taking writer seq lock in vhost worker and busy wait for seqcount to be even in MMU notifier (something similar to what this patch did)? I don't do this because e.g: write_seqcount_begin() map = vq->map[X] write or read through map->addr directly write_seqcount_end() There's no rmb() in write_seqcount_begin(), so map could be read before write_seqcount_begin(), but it looks to me now that this doesn't harm at all, maybe we can try this way. > > >> Reported-by: Michael S. Tsirkin <mst@redhat.com> >> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") >> Signed-off-by: Jason Wang <jasowang@redhat.com> >> drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++---------------- >> drivers/vhost/vhost.h | 7 +- >> 2 files changed, 94 insertions(+), 58 deletions(-) >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index cfc11f9ed9c9..db2c81cb1e90 100644 >> +++ b/drivers/vhost/vhost.c >> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) >> >> spin_lock(&vq->mmu_lock); >> for (i = 0; i < VHOST_NUM_ADDRS; i++) { >> - map[i] = rcu_dereference_protected(vq->maps[i], >> - lockdep_is_held(&vq->mmu_lock)); >> + map[i] = vq->maps[i]; >> if (map[i]) { >> vhost_set_map_dirty(vq, map[i], i); >> - rcu_assign_pointer(vq->maps[i], NULL); >> + vq->maps[i] = NULL; >> } >> } >> spin_unlock(&vq->mmu_lock); >> >> - /* No need for synchronize_rcu() or kfree_rcu() since we are >> - * serialized with memory accessors (e.g vq mutex held). >> + /* No need for synchronization since we are serialized with >> + * memory accessors (e.g vq mutex held). >> */ >> >> for (i = 0; i < VHOST_NUM_ADDRS; i++) >> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, >> return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); >> } >> >> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) >> +{ >> + int ref = READ_ONCE(vq->ref); > Is a lock/single threaded supposed to be held for this? Yes, only vhost worker kthread can accept this. > >> + >> + smp_store_release(&vq->ref, ref + 1); >> + /* Make sure ref counter is visible before accessing the map */ >> + smp_load_acquire(&vq->ref); > release/acquire semantics are intended to protect blocks of related > data, so reading something with acquire and throwing away the result > is nonsense. Actually I want to use smp_mb() here, so I admit it's a trick that even won't work. But now I think I can just use write_seqcount_begin() here. > >> +} >> + >> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) >> +{ >> + int ref = READ_ONCE(vq->ref); > If the write to vq->ref is not locked this algorithm won't work, if it > is locked the READ_ONCE is not needed. Yes. > >> + /* Make sure vq access is done before increasing ref counter */ >> + smp_store_release(&vq->ref, ref + 1); >> +} >> + >> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) >> +{ >> + int ref; >> + >> + /* Make sure map change was done before checking ref counter */ >> + smp_mb(); > This is probably smp_rmb after reading ref, and if you are setting ref > with smp_store_release then this should be smp_load_acquire() without > an explicit mb. We had something like: spin_lock(); vq->maps[index] = NULL; spin_unlock(); vhost_vq_sync_access(vq); we need to make sure the read of ref is done after setting vq->maps[index] to NULL. It looks to me neither smp_load_acquire() nor smp_store_release() can help in this case. > >> + ref = READ_ONCE(vq->ref); >> + if (ref & 0x1) { >> + /* When ref change, we are sure no reader can see >> + * previous map */ >> + while (READ_ONCE(vq->ref) == ref) { >> + set_current_state(TASK_RUNNING); >> + schedule(); >> + } >> + } > This is basically read_seqcount_begin()' with a schedule instead of > cpu_relax Yes it is. > > >> + /* Make sure ref counter was checked before any other >> + * operations that was dene on map. */ >> + smp_mb(); > should be in a smp_load_acquire() Right, if we use smp_load_acquire() to load the counter. > >> +} >> + >> static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >> int index, >> unsigned long start, >> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >> spin_lock(&vq->mmu_lock); >> ++vq->invalidate_count; >> >> - map = rcu_dereference_protected(vq->maps[index], >> - lockdep_is_held(&vq->mmu_lock)); >> + map = vq->maps[index]; >> if (map) { >> vhost_set_map_dirty(vq, map, index); >> - rcu_assign_pointer(vq->maps[index], NULL); >> + vq->maps[index] = NULL; >> } >> spin_unlock(&vq->mmu_lock); >> >> if (map) { >> - synchronize_rcu(); >> + vhost_vq_sync_access(vq); > What prevents racing with vhost_vq_access_map_end here? vhost_vq_access_map_end() uses smp_store_release() for the counter. Is this not sufficient? > >> vhost_map_unprefetch(map); >> } >> } > Overall I don't like it. > > We are trying to get rid of these botique mmu notifier patterns in > drivers. I agree, so do you think we can take write lock in vhost worker then wait for the counter to be even in MMU notifier? It looks much cleaner than this patch. Thanks > > Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-07-31 13:28 ` Jason Wang @ 2019-07-31 19:30 ` Jason Gunthorpe 2019-08-01 5:02 ` Jason Wang 0 siblings, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2019-07-31 19:30 UTC (permalink / raw) To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote: > > On 2019/7/31 下午8:39, Jason Gunthorpe wrote: > > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: > > > We used to use RCU to synchronize MMU notifier with worker. This leads > > > calling synchronize_rcu() in invalidate_range_start(). But on a busy > > > system, there would be many factors that may slow down the > > > synchronize_rcu() which makes it unsuitable to be called in MMU > > > notifier. > > > > > > A solution is SRCU but its overhead is obvious with the expensive full > > > memory barrier. Another choice is to use seqlock, but it doesn't > > > provide a synchronization method between readers and writers. The last > > > choice is to use vq mutex, but it need to deal with the worst case > > > that MMU notifier must be blocked and wait for the finish of swap in. > > > > > > So this patch switches use a counter to track whether or not the map > > > was used. The counter was increased when vq try to start or finish > > > uses the map. This means, when it was even, we're sure there's no > > > readers and MMU notifier is synchronized. When it was odd, it means > > > there's a reader we need to wait it to be even again then we are > > > synchronized. > > You just described a seqlock. > > > Kind of, see my explanation below. > > > > > > We've been talking about providing this as some core service from mmu > > notifiers because nearly every use of this API needs it. > > > That would be very helpful. > > > > > > IMHO this gets the whole thing backwards, the common pattern is to > > protect the 'shadow pte' data with a seqlock (usually open coded), > > such that the mmu notififer side has the write side of that lock and > > the read side is consumed by the thread accessing or updating the SPTE. > > > Yes, I've considered something like that. But the problem is, mmu notifier > (writer) need to wait for the vhost worker to finish the read before it can > do things like setting dirty pages and unmapping page. It looks to me > seqlock doesn't provide things like this. The seqlock is usually used to prevent a 2nd thread from accessing the VA while it is being changed by the mm. ie you use something seqlocky instead of the ugly mmu_notifier_unregister/register cycle. You are supposed to use something simple like a spinlock or mutex inside the invalidate_range_start to serialized tear down of the SPTEs with their accessors. > write_seqcount_begin() > > map = vq->map[X] > > write or read through map->addr directly > > write_seqcount_end() > > > There's no rmb() in write_seqcount_begin(), so map could be read before > write_seqcount_begin(), but it looks to me now that this doesn't harm at > all, maybe we can try this way. That is because it is a write side lock, not a read lock. IIRC seqlocks have weaker barriers because the write side needs to be serialized in some other way. The requirement I see is you need invalidate_range_start to block until another thread exits its critical section (ie stops accessing the SPTEs). That is a spinlock/mutex. You just can't invent a faster spinlock by open coding something with barriers, it doesn't work. Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-07-31 19:30 ` Jason Gunthorpe @ 2019-08-01 5:02 ` Jason Wang 2019-08-01 14:15 ` Jason Gunthorpe 0 siblings, 1 reply; 51+ messages in thread From: Jason Wang @ 2019-08-01 5:02 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/8/1 上午3:30, Jason Gunthorpe wrote: > On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote: >> On 2019/7/31 下午8:39, Jason Gunthorpe wrote: >>> On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: >>>> We used to use RCU to synchronize MMU notifier with worker. This leads >>>> calling synchronize_rcu() in invalidate_range_start(). But on a busy >>>> system, there would be many factors that may slow down the >>>> synchronize_rcu() which makes it unsuitable to be called in MMU >>>> notifier. >>>> >>>> A solution is SRCU but its overhead is obvious with the expensive full >>>> memory barrier. Another choice is to use seqlock, but it doesn't >>>> provide a synchronization method between readers and writers. The last >>>> choice is to use vq mutex, but it need to deal with the worst case >>>> that MMU notifier must be blocked and wait for the finish of swap in. >>>> >>>> So this patch switches use a counter to track whether or not the map >>>> was used. The counter was increased when vq try to start or finish >>>> uses the map. This means, when it was even, we're sure there's no >>>> readers and MMU notifier is synchronized. When it was odd, it means >>>> there's a reader we need to wait it to be even again then we are >>>> synchronized. >>> You just described a seqlock. >> >> Kind of, see my explanation below. >> >> >>> We've been talking about providing this as some core service from mmu >>> notifiers because nearly every use of this API needs it. >> >> That would be very helpful. >> >> >>> IMHO this gets the whole thing backwards, the common pattern is to >>> protect the 'shadow pte' data with a seqlock (usually open coded), >>> such that the mmu notififer side has the write side of that lock and >>> the read side is consumed by the thread accessing or updating the SPTE. >> >> Yes, I've considered something like that. But the problem is, mmu notifier >> (writer) need to wait for the vhost worker to finish the read before it can >> do things like setting dirty pages and unmapping page. It looks to me >> seqlock doesn't provide things like this. > The seqlock is usually used to prevent a 2nd thread from accessing the > VA while it is being changed by the mm. ie you use something seqlocky > instead of the ugly mmu_notifier_unregister/register cycle. Yes, so we have two mappings: [1] vring address to VA [2] VA to PA And have several readers and writers 1) set_vring_num_addr(): writer of both [1] and [2] 2) MMU notifier: reader of [1] writer of [2] 3) GUP: reader of [1] writer of [2] 4) memory accessors: reader of [1] and [2] Fortunately, 1) 3) and 4) have already synchronized through vq->mutex. We only need to deal with synchronization between 2) and each of the reset: Sync between 1) and 2): For mapping [1], I do mmu_notifier_unregister/register. This help to avoid holding any lock to do overlap check. Anyway we only care about one or three pages , but the whole guest memory could be several TBs. For mapping [2], both 1) and 2) are writers, so use spinlock (mmu_lock) to synchronize. Sync between 2) and 3): For mapping [1], both are readers, no need any synchronization. For mapping [2], both 2) and 3) are writers, so synchronize through spinlock (mmu_lock); Sync between 2) and 4): For mapping [1], both are readers, no need any synchronization. For mapping [2], synchronize through RCU (or something simliar to seqlock). You suggestion is about the synchronization of [1] which may make sense, but it could be done on top as an optimization. What this path tries to do is to not use RCU for [2]. Of course, the simplest way is to use vq mutex in 2) but it means: - we must hold vq lock to check range overlap - since the critical section was increased, the worst case is to wait guest memory to be swapped in, this could be even slower than synchronize_rcu(). > > You are supposed to use something simple like a spinlock or mutex > inside the invalidate_range_start to serialized tear down of the SPTEs > with their accessors. Technically yes, but we probably can't afford that for vhost fast path, the atomics eliminate almost all the performance improvement brought by this patch on a machine without SMAP. > >> write_seqcount_begin() >> >> map = vq->map[X] >> >> write or read through map->addr directly >> >> write_seqcount_end() >> >> >> There's no rmb() in write_seqcount_begin(), so map could be read before >> write_seqcount_begin(), but it looks to me now that this doesn't harm at >> all, maybe we can try this way. > That is because it is a write side lock, not a read lock. IIRC > seqlocks have weaker barriers because the write side needs to be > serialized in some other way. Yes. Having a hard thought of the code, it looks to me write_seqcount_begin()/end() is sufficient here: - Notifier will only assign NULL to map, so it doesn't harm to read map before seq, then we will fallback to normal copy_from/to_user() slow path earlier - if we write through map->addr, it should be done before increasing the seqcount because of the smp_wmb() in write_seqcount_end() - if we read through map->addr which also contain a store to a pointer, we have a good data dependency so smp_wmb() also work here. > > The requirement I see is you need invalidate_range_start to block > until another thread exits its critical section (ie stops accessing > the SPTEs). Yes. > > That is a spinlock/mutex. Or a semantics similar to RCU. > > You just can't invent a faster spinlock by open coding something with > barriers, it doesn't work. > > Jason If write_seqlock() works here, we can simply wait for seqcount to move advance in MMU notifier. The original idea is to use RCU which solves this perfectly. But as pointed out it could be slow. Thanks ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-01 5:02 ` Jason Wang @ 2019-08-01 14:15 ` Jason Gunthorpe 2019-08-02 9:40 ` Jason Wang 0 siblings, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2019-08-01 14:15 UTC (permalink / raw) To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On Thu, Aug 01, 2019 at 01:02:18PM +0800, Jason Wang wrote: > > On 2019/8/1 上午3:30, Jason Gunthorpe wrote: > > On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote: > > > On 2019/7/31 下午8:39, Jason Gunthorpe wrote: > > > > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: > > > > > We used to use RCU to synchronize MMU notifier with worker. This leads > > > > > calling synchronize_rcu() in invalidate_range_start(). But on a busy > > > > > system, there would be many factors that may slow down the > > > > > synchronize_rcu() which makes it unsuitable to be called in MMU > > > > > notifier. > > > > > > > > > > A solution is SRCU but its overhead is obvious with the expensive full > > > > > memory barrier. Another choice is to use seqlock, but it doesn't > > > > > provide a synchronization method between readers and writers. The last > > > > > choice is to use vq mutex, but it need to deal with the worst case > > > > > that MMU notifier must be blocked and wait for the finish of swap in. > > > > > > > > > > So this patch switches use a counter to track whether or not the map > > > > > was used. The counter was increased when vq try to start or finish > > > > > uses the map. This means, when it was even, we're sure there's no > > > > > readers and MMU notifier is synchronized. When it was odd, it means > > > > > there's a reader we need to wait it to be even again then we are > > > > > synchronized. > > > > You just described a seqlock. > > > > > > Kind of, see my explanation below. > > > > > > > > > > We've been talking about providing this as some core service from mmu > > > > notifiers because nearly every use of this API needs it. > > > > > > That would be very helpful. > > > > > > > > > > IMHO this gets the whole thing backwards, the common pattern is to > > > > protect the 'shadow pte' data with a seqlock (usually open coded), > > > > such that the mmu notififer side has the write side of that lock and > > > > the read side is consumed by the thread accessing or updating the SPTE. > > > > > > Yes, I've considered something like that. But the problem is, mmu notifier > > > (writer) need to wait for the vhost worker to finish the read before it can > > > do things like setting dirty pages and unmapping page. It looks to me > > > seqlock doesn't provide things like this. > > The seqlock is usually used to prevent a 2nd thread from accessing the > > VA while it is being changed by the mm. ie you use something seqlocky > > instead of the ugly mmu_notifier_unregister/register cycle. > > > Yes, so we have two mappings: > > [1] vring address to VA > [2] VA to PA > > And have several readers and writers > > 1) set_vring_num_addr(): writer of both [1] and [2] > 2) MMU notifier: reader of [1] writer of [2] > 3) GUP: reader of [1] writer of [2] > 4) memory accessors: reader of [1] and [2] > > Fortunately, 1) 3) and 4) have already synchronized through vq->mutex. We > only need to deal with synchronization between 2) and each of the reset: > Sync between 1) and 2): For mapping [1], I do > mmu_notifier_unregister/register. This help to avoid holding any lock to do > overlap check. I suspect you could have done this with a RCU technique instead of register/unregister. > Sync between 2) and 4): For mapping [1], both are readers, no need any > synchronization. For mapping [2], synchronize through RCU (or something > simliar to seqlock). You can't really use a seqlock, seqlocks are collision-retry locks, and the semantic here is that invalidate_range_start *MUST* not continue until thread doing #4 above is guarenteed no longer touching the memory. This must be a proper barrier, like a spinlock, mutex, or synchronize_rcu. And, again, you can't re-invent a spinlock with open coding and get something better. Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-01 14:15 ` Jason Gunthorpe @ 2019-08-02 9:40 ` Jason Wang 2019-08-02 12:46 ` Jason Gunthorpe 2019-08-02 14:03 ` Michael S. Tsirkin 0 siblings, 2 replies; 51+ messages in thread From: Jason Wang @ 2019-08-02 9:40 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/8/1 下午10:15, Jason Gunthorpe wrote: > On Thu, Aug 01, 2019 at 01:02:18PM +0800, Jason Wang wrote: >> On 2019/8/1 上午3:30, Jason Gunthorpe wrote: >>> On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote: >>>> On 2019/7/31 下午8:39, Jason Gunthorpe wrote: >>>>> On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: >>>>>> We used to use RCU to synchronize MMU notifier with worker. This leads >>>>>> calling synchronize_rcu() in invalidate_range_start(). But on a busy >>>>>> system, there would be many factors that may slow down the >>>>>> synchronize_rcu() which makes it unsuitable to be called in MMU >>>>>> notifier. >>>>>> >>>>>> A solution is SRCU but its overhead is obvious with the expensive full >>>>>> memory barrier. Another choice is to use seqlock, but it doesn't >>>>>> provide a synchronization method between readers and writers. The last >>>>>> choice is to use vq mutex, but it need to deal with the worst case >>>>>> that MMU notifier must be blocked and wait for the finish of swap in. >>>>>> >>>>>> So this patch switches use a counter to track whether or not the map >>>>>> was used. The counter was increased when vq try to start or finish >>>>>> uses the map. This means, when it was even, we're sure there's no >>>>>> readers and MMU notifier is synchronized. When it was odd, it means >>>>>> there's a reader we need to wait it to be even again then we are >>>>>> synchronized. >>>>> You just described a seqlock. >>>> Kind of, see my explanation below. >>>> >>>> >>>>> We've been talking about providing this as some core service from mmu >>>>> notifiers because nearly every use of this API needs it. >>>> That would be very helpful. >>>> >>>> >>>>> IMHO this gets the whole thing backwards, the common pattern is to >>>>> protect the 'shadow pte' data with a seqlock (usually open coded), >>>>> such that the mmu notififer side has the write side of that lock and >>>>> the read side is consumed by the thread accessing or updating the SPTE. >>>> Yes, I've considered something like that. But the problem is, mmu notifier >>>> (writer) need to wait for the vhost worker to finish the read before it can >>>> do things like setting dirty pages and unmapping page. It looks to me >>>> seqlock doesn't provide things like this. >>> The seqlock is usually used to prevent a 2nd thread from accessing the >>> VA while it is being changed by the mm. ie you use something seqlocky >>> instead of the ugly mmu_notifier_unregister/register cycle. >> >> Yes, so we have two mappings: >> >> [1] vring address to VA >> [2] VA to PA >> >> And have several readers and writers >> >> 1) set_vring_num_addr(): writer of both [1] and [2] >> 2) MMU notifier: reader of [1] writer of [2] >> 3) GUP: reader of [1] writer of [2] >> 4) memory accessors: reader of [1] and [2] >> >> Fortunately, 1) 3) and 4) have already synchronized through vq->mutex. We >> only need to deal with synchronization between 2) and each of the reset: >> Sync between 1) and 2): For mapping [1], I do >> mmu_notifier_unregister/register. This help to avoid holding any lock to do >> overlap check. > I suspect you could have done this with a RCU technique instead of > register/unregister. Probably. But the issue to be addressed by this patch is the synchronization between MMU notifier and vhost worker. > >> Sync between 2) and 4): For mapping [1], both are readers, no need any >> synchronization. For mapping [2], synchronize through RCU (or something >> simliar to seqlock). > You can't really use a seqlock, seqlocks are collision-retry locks, > and the semantic here is that invalidate_range_start *MUST* not > continue until thread doing #4 above is guarenteed no longer touching > the memory. Yes, that's the tricky part. For hardware like CPU, kicking through IPI is sufficient for synchronization. But for vhost kthread, it requires a low overhead synchronization. > > This must be a proper barrier, like a spinlock, mutex, or > synchronize_rcu. I start with synchronize_rcu() but both you and Michael raise some concern. Then I try spinlock and mutex: 1) spinlock: add lots of overhead on datapath, this leads 0 performance improvement. 2) SRCU: full memory barrier requires on srcu_read_lock(), which still leads little performance improvement 3) mutex: a possible issue is need to wait for the page to be swapped in (is this unacceptable ?), another issue is that we need hold vq lock during range overlap check. 4) using vhost_flush_work() instead of synchronize_rcu(): still need to wait for swap. But can do overlap checking without the lock > > And, again, you can't re-invent a spinlock with open coding and get > something better. So the question is if waiting for swap is considered to be unsuitable for MMU notifiers. If not, it would simplify codes. If not, we still need to figure out a possible solution. Btw, I come up another idea, that is to disable preemption when vhost thread need to access the memory. Then register preempt notifier and if vhost thread is preempted, we're sure no one will access the memory and can do the cleanup. Thanks > > Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-02 9:40 ` Jason Wang @ 2019-08-02 12:46 ` Jason Gunthorpe 2019-08-02 14:27 ` Michael S. Tsirkin 2019-08-05 4:20 ` Jason Wang 2019-08-02 14:03 ` Michael S. Tsirkin 1 sibling, 2 replies; 51+ messages in thread From: Jason Gunthorpe @ 2019-08-02 12:46 UTC (permalink / raw) To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: > > This must be a proper barrier, like a spinlock, mutex, or > > synchronize_rcu. > > > I start with synchronize_rcu() but both you and Michael raise some > concern. I've also idly wondered if calling synchronize_rcu() under the various mm locks is a deadlock situation. > Then I try spinlock and mutex: > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance > improvement. I think the topic here is correctness not performance improvement > 2) SRCU: full memory barrier requires on srcu_read_lock(), which still leads > little performance improvement > 3) mutex: a possible issue is need to wait for the page to be swapped in (is > this unacceptable ?), another issue is that we need hold vq lock during > range overlap check. I have a feeling that mmu notififers cannot safely become dependent on progress of swap without causing deadlock. You probably should avoid this. > > And, again, you can't re-invent a spinlock with open coding and get > > something better. > > So the question is if waiting for swap is considered to be unsuitable for > MMU notifiers. If not, it would simplify codes. If not, we still need to > figure out a possible solution. > > Btw, I come up another idea, that is to disable preemption when vhost thread > need to access the memory. Then register preempt notifier and if vhost > thread is preempted, we're sure no one will access the memory and can do the > cleanup. I think you should use the spinlock so at least the code is obviously functionally correct and worry about designing some properly justified performance change after. Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-02 12:46 ` Jason Gunthorpe @ 2019-08-02 14:27 ` Michael S. Tsirkin 2019-08-02 17:24 ` Jason Gunthorpe 2019-08-05 4:36 ` Jason Wang 2019-08-05 4:20 ` Jason Wang 1 sibling, 2 replies; 51+ messages in thread From: Michael S. Tsirkin @ 2019-08-02 14:27 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: > > > This must be a proper barrier, like a spinlock, mutex, or > > > synchronize_rcu. > > > > > > I start with synchronize_rcu() but both you and Michael raise some > > concern. > > I've also idly wondered if calling synchronize_rcu() under the various > mm locks is a deadlock situation. > > > Then I try spinlock and mutex: > > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance > > improvement. > > I think the topic here is correctness not performance improvement The topic is whether we should revert commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address") or keep it in. The only reason to keep it is performance. Now as long as all this code is disabled anyway, we can experiment a bit. I personally feel we would be best served by having two code paths: - Access to VM memory directly mapped into kernel - Access to userspace Having it all cleanly split will allow a bunch of optimizations, for example for years now we planned to be able to process an incoming short packet directly on softirq path, or an outgoing on directly within eventfd. -- MST ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-02 14:27 ` Michael S. Tsirkin @ 2019-08-02 17:24 ` Jason Gunthorpe 2019-08-03 21:36 ` Michael S. Tsirkin 2019-08-05 4:36 ` Jason Wang 1 sibling, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2019-08-02 17:24 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote: > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: > > > > This must be a proper barrier, like a spinlock, mutex, or > > > > synchronize_rcu. > > > > > > > > > I start with synchronize_rcu() but both you and Michael raise some > > > concern. > > > > I've also idly wondered if calling synchronize_rcu() under the various > > mm locks is a deadlock situation. > > > > > Then I try spinlock and mutex: > > > > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance > > > improvement. > > > > I think the topic here is correctness not performance improvement > > The topic is whether we should revert > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address") > > or keep it in. The only reason to keep it is performance. Yikes, I'm not sure you can ever win against copy_from_user using mmu_notifiers? The synchronization requirements are likely always more expensive unless large and scattered copies are being done.. The rcu is about the only simple approach that could be less expensive, and that gets back to the question if you can block an invalidate_start_range in synchronize_rcu or not.. So, frankly, I'd revert it until someone could prove the rcu solution is OK.. BTW, how do you get copy_from_user to work outside a syscall? Also, why can't this just permanently GUP the pages? In fact, where does it put_page them anyhow? Worrying that 7f466 adds a get_user page but does not add a put_page?? Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-02 17:24 ` Jason Gunthorpe @ 2019-08-03 21:36 ` Michael S. Tsirkin 2019-08-04 0:14 ` Jason Gunthorpe 0 siblings, 1 reply; 51+ messages in thread From: Michael S. Tsirkin @ 2019-08-03 21:36 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm On Fri, Aug 02, 2019 at 02:24:18PM -0300, Jason Gunthorpe wrote: > On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote: > > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: > > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: > > > > > This must be a proper barrier, like a spinlock, mutex, or > > > > > synchronize_rcu. > > > > > > > > > > > > I start with synchronize_rcu() but both you and Michael raise some > > > > concern. > > > > > > I've also idly wondered if calling synchronize_rcu() under the various > > > mm locks is a deadlock situation. > > > > > > > Then I try spinlock and mutex: > > > > > > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance > > > > improvement. > > > > > > I think the topic here is correctness not performance improvement > > > > The topic is whether we should revert > > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address") > > > > or keep it in. The only reason to keep it is performance. > > Yikes, I'm not sure you can ever win against copy_from_user using > mmu_notifiers? Ever since copy_from_user started playing with flags (for SMAP) and added speculation barriers there's a chance we can win by accessing memory through the kernel address. Another reason would be to access it from e.g. softirq context. copy_from_user will only work if the correct mmu is active. > The synchronization requirements are likely always > more expensive unless large and scattered copies are being done.. > > The rcu is about the only simple approach that could be less > expensive, and that gets back to the question if you can block an > invalidate_start_range in synchronize_rcu or not.. > > So, frankly, I'd revert it until someone could prove the rcu solution is > OK.. I have it all disabled at compile time, so reverting isn't urgent anymore. I'll wait a couple more days to decide what's cleanest. > BTW, how do you get copy_from_user to work outside a syscall? By switching to the correct mm. > > Also, why can't this just permanently GUP the pages? In fact, where > does it put_page them anyhow? Worrying that 7f466 adds a get_user page > but does not add a put_page?? > > Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-03 21:36 ` Michael S. Tsirkin @ 2019-08-04 0:14 ` Jason Gunthorpe 2019-08-04 8:07 ` Michael S. Tsirkin 0 siblings, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2019-08-04 0:14 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm On Sat, Aug 03, 2019 at 05:36:13PM -0400, Michael S. Tsirkin wrote: > On Fri, Aug 02, 2019 at 02:24:18PM -0300, Jason Gunthorpe wrote: > > On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote: > > > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: > > > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: > > > > > > This must be a proper barrier, like a spinlock, mutex, or > > > > > > synchronize_rcu. > > > > > > > > > > > > > > > I start with synchronize_rcu() but both you and Michael raise some > > > > > concern. > > > > > > > > I've also idly wondered if calling synchronize_rcu() under the various > > > > mm locks is a deadlock situation. > > > > > > > > > Then I try spinlock and mutex: > > > > > > > > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance > > > > > improvement. > > > > > > > > I think the topic here is correctness not performance improvement > > > > > > The topic is whether we should revert > > > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address") > > > > > > or keep it in. The only reason to keep it is performance. > > > > Yikes, I'm not sure you can ever win against copy_from_user using > > mmu_notifiers? > > Ever since copy_from_user started playing with flags (for SMAP) and > added speculation barriers there's a chance we can win by accessing > memory through the kernel address. You think copy_to_user will be more expensive than the minimum two atomics required to synchronize with another thread? > > Also, why can't this just permanently GUP the pages? In fact, where > > does it put_page them anyhow? Worrying that 7f466 adds a get_user page > > but does not add a put_page?? You didn't answer this.. Why not just use GUP? Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-04 0:14 ` Jason Gunthorpe @ 2019-08-04 8:07 ` Michael S. Tsirkin 2019-08-05 4:39 ` Jason Wang 2019-08-06 11:53 ` Jason Gunthorpe 0 siblings, 2 replies; 51+ messages in thread From: Michael S. Tsirkin @ 2019-08-04 8:07 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm On Sat, Aug 03, 2019 at 09:14:00PM -0300, Jason Gunthorpe wrote: > On Sat, Aug 03, 2019 at 05:36:13PM -0400, Michael S. Tsirkin wrote: > > On Fri, Aug 02, 2019 at 02:24:18PM -0300, Jason Gunthorpe wrote: > > > On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote: > > > > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: > > > > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: > > > > > > > This must be a proper barrier, like a spinlock, mutex, or > > > > > > > synchronize_rcu. > > > > > > > > > > > > > > > > > > I start with synchronize_rcu() but both you and Michael raise some > > > > > > concern. > > > > > > > > > > I've also idly wondered if calling synchronize_rcu() under the various > > > > > mm locks is a deadlock situation. > > > > > > > > > > > Then I try spinlock and mutex: > > > > > > > > > > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance > > > > > > improvement. > > > > > > > > > > I think the topic here is correctness not performance improvement > > > > > > > > The topic is whether we should revert > > > > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address") > > > > > > > > or keep it in. The only reason to keep it is performance. > > > > > > Yikes, I'm not sure you can ever win against copy_from_user using > > > mmu_notifiers? > > > > Ever since copy_from_user started playing with flags (for SMAP) and > > added speculation barriers there's a chance we can win by accessing > > memory through the kernel address. > > You think copy_to_user will be more expensive than the minimum two > atomics required to synchronize with another thread? I frankly don't know. With SMAP you flip flags twice, and with spectre you flush the pipeline. Is that cheaper or more expensive than an atomic operation? Testing is the only way to tell. > > > Also, why can't this just permanently GUP the pages? In fact, where > > > does it put_page them anyhow? Worrying that 7f466 adds a get_user page > > > but does not add a put_page?? > > You didn't answer this.. Why not just use GUP? > > Jason Sorry I misunderstood the question. Permanent GUP breaks lots of functionality we need such as THP and numa balancing. release_pages is used instead of put_page. -- MST ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-04 8:07 ` Michael S. Tsirkin @ 2019-08-05 4:39 ` Jason Wang 2019-08-06 11:53 ` Jason Gunthorpe 1 sibling, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-08-05 4:39 UTC (permalink / raw) To: Michael S. Tsirkin, Jason Gunthorpe Cc: kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/8/4 下午4:07, Michael S. Tsirkin wrote: > On Sat, Aug 03, 2019 at 09:14:00PM -0300, Jason Gunthorpe wrote: >> On Sat, Aug 03, 2019 at 05:36:13PM -0400, Michael S. Tsirkin wrote: >>> On Fri, Aug 02, 2019 at 02:24:18PM -0300, Jason Gunthorpe wrote: >>>> On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote: >>>>> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: >>>>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: >>>>>>>> This must be a proper barrier, like a spinlock, mutex, or >>>>>>>> synchronize_rcu. >>>>>>> >>>>>>> I start with synchronize_rcu() but both you and Michael raise some >>>>>>> concern. >>>>>> I've also idly wondered if calling synchronize_rcu() under the various >>>>>> mm locks is a deadlock situation. >>>>>> >>>>>>> Then I try spinlock and mutex: >>>>>>> >>>>>>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance >>>>>>> improvement. >>>>>> I think the topic here is correctness not performance improvement >>>>> The topic is whether we should revert >>>>> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address") >>>>> >>>>> or keep it in. The only reason to keep it is performance. >>>> Yikes, I'm not sure you can ever win against copy_from_user using >>>> mmu_notifiers? >>> Ever since copy_from_user started playing with flags (for SMAP) and >>> added speculation barriers there's a chance we can win by accessing >>> memory through the kernel address. >> You think copy_to_user will be more expensive than the minimum two >> atomics required to synchronize with another thread? > I frankly don't know. With SMAP you flip flags twice, and with spectre > you flush the pipeline. Is that cheaper or more expensive than an atomic > operation? Testing is the only way to tell. Let me test, I only did test on a non SMAP machine. Switching to spinlock kills all performance improvement. Thanks > >>>> Also, why can't this just permanently GUP the pages? In fact, where >>>> does it put_page them anyhow? Worrying that 7f466 adds a get_user page >>>> but does not add a put_page?? >> You didn't answer this.. Why not just use GUP? >> >> Jason > Sorry I misunderstood the question. Permanent GUP breaks lots of > functionality we need such as THP and numa balancing. > > release_pages is used instead of put_page. > > > > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-04 8:07 ` Michael S. Tsirkin 2019-08-05 4:39 ` Jason Wang @ 2019-08-06 11:53 ` Jason Gunthorpe 2019-08-06 13:36 ` Michael S. Tsirkin 1 sibling, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2019-08-06 11:53 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm On Sun, Aug 04, 2019 at 04:07:17AM -0400, Michael S. Tsirkin wrote: > > > > Also, why can't this just permanently GUP the pages? In fact, where > > > > does it put_page them anyhow? Worrying that 7f466 adds a get_user page > > > > but does not add a put_page?? > > > > You didn't answer this.. Why not just use GUP? > > > > Jason > > Sorry I misunderstood the question. Permanent GUP breaks lots of > functionality we need such as THP and numa balancing. Really? It doesn't look like that many pages are involved.. Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-06 11:53 ` Jason Gunthorpe @ 2019-08-06 13:36 ` Michael S. Tsirkin 2019-08-06 13:40 ` Jason Gunthorpe 0 siblings, 1 reply; 51+ messages in thread From: Michael S. Tsirkin @ 2019-08-06 13:36 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm On Tue, Aug 06, 2019 at 08:53:17AM -0300, Jason Gunthorpe wrote: > On Sun, Aug 04, 2019 at 04:07:17AM -0400, Michael S. Tsirkin wrote: > > > > > Also, why can't this just permanently GUP the pages? In fact, where > > > > > does it put_page them anyhow? Worrying that 7f466 adds a get_user page > > > > > but does not add a put_page?? > > > > > > You didn't answer this.. Why not just use GUP? > > > > > > Jason > > > > Sorry I misunderstood the question. Permanent GUP breaks lots of > > functionality we need such as THP and numa balancing. > > Really? It doesn't look like that many pages are involved.. > > Jason Yea. But they just might happen to be heavily accessed ones.... -- MST ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-06 13:36 ` Michael S. Tsirkin @ 2019-08-06 13:40 ` Jason Gunthorpe 0 siblings, 0 replies; 51+ messages in thread From: Jason Gunthorpe @ 2019-08-06 13:40 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm On Tue, Aug 06, 2019 at 09:36:58AM -0400, Michael S. Tsirkin wrote: > On Tue, Aug 06, 2019 at 08:53:17AM -0300, Jason Gunthorpe wrote: > > On Sun, Aug 04, 2019 at 04:07:17AM -0400, Michael S. Tsirkin wrote: > > > > > > Also, why can't this just permanently GUP the pages? In fact, where > > > > > > does it put_page them anyhow? Worrying that 7f466 adds a get_user page > > > > > > but does not add a put_page?? > > > > > > > > You didn't answer this.. Why not just use GUP? > > > > > > > > Jason > > > > > > Sorry I misunderstood the question. Permanent GUP breaks lots of > > > functionality we need such as THP and numa balancing. > > > > Really? It doesn't look like that many pages are involved.. > > > > Jason > > Yea. But they just might happen to be heavily accessed ones.... Maybe you can solve the numa balance problem some other way and use normal GUP.. Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-02 14:27 ` Michael S. Tsirkin 2019-08-02 17:24 ` Jason Gunthorpe @ 2019-08-05 4:36 ` Jason Wang 2019-08-05 4:41 ` Jason Wang 2019-08-05 6:30 ` Michael S. Tsirkin 1 sibling, 2 replies; 51+ messages in thread From: Jason Wang @ 2019-08-05 4:36 UTC (permalink / raw) To: Michael S. Tsirkin, Jason Gunthorpe Cc: kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/8/2 下午10:27, Michael S. Tsirkin wrote: > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: >> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: >>>> This must be a proper barrier, like a spinlock, mutex, or >>>> synchronize_rcu. >>> >>> I start with synchronize_rcu() but both you and Michael raise some >>> concern. >> I've also idly wondered if calling synchronize_rcu() under the various >> mm locks is a deadlock situation. >> >>> Then I try spinlock and mutex: >>> >>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance >>> improvement. >> I think the topic here is correctness not performance improvement > The topic is whether we should revert > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address") > > or keep it in. The only reason to keep it is performance. Maybe it's time to introduce the config option? > > Now as long as all this code is disabled anyway, we can experiment a > bit. > > I personally feel we would be best served by having two code paths: > > - Access to VM memory directly mapped into kernel > - Access to userspace > > > Having it all cleanly split will allow a bunch of optimizations, for > example for years now we planned to be able to process an incoming short > packet directly on softirq path, or an outgoing on directly within > eventfd. It's not hard consider we've already had our own accssors. But the question is (as asked in another thread), do you want permanent GUP or still use MMU notifiers. Thanks ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-05 4:36 ` Jason Wang @ 2019-08-05 4:41 ` Jason Wang 2019-08-05 6:40 ` Michael S. Tsirkin 2019-08-05 6:30 ` Michael S. Tsirkin 1 sibling, 1 reply; 51+ messages in thread From: Jason Wang @ 2019-08-05 4:41 UTC (permalink / raw) To: Michael S. Tsirkin, Jason Gunthorpe Cc: linux-mm, netdev, linux-kernel, kvm, virtualization On 2019/8/5 下午12:36, Jason Wang wrote: > > On 2019/8/2 下午10:27, Michael S. Tsirkin wrote: >> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: >>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: >>>>> This must be a proper barrier, like a spinlock, mutex, or >>>>> synchronize_rcu. >>>> >>>> I start with synchronize_rcu() but both you and Michael raise some >>>> concern. >>> I've also idly wondered if calling synchronize_rcu() under the various >>> mm locks is a deadlock situation. >>> >>>> Then I try spinlock and mutex: >>>> >>>> 1) spinlock: add lots of overhead on datapath, this leads 0 >>>> performance >>>> improvement. >>> I think the topic here is correctness not performance improvement >> The topic is whether we should revert >> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual >> address") >> >> or keep it in. The only reason to keep it is performance. > > > Maybe it's time to introduce the config option? Or does it make sense if I post a V3 with: - introduce config option and disable the optimization by default - switch from synchronize_rcu() to vhost_flush_work(), but the rest are the same This can give us some breath to decide which way should go for next release? Thanks > > >> >> Now as long as all this code is disabled anyway, we can experiment a >> bit. >> >> I personally feel we would be best served by having two code paths: >> >> - Access to VM memory directly mapped into kernel >> - Access to userspace >> >> >> Having it all cleanly split will allow a bunch of optimizations, for >> example for years now we planned to be able to process an incoming short >> packet directly on softirq path, or an outgoing on directly within >> eventfd. > > > It's not hard consider we've already had our own accssors. But the > question is (as asked in another thread), do you want permanent GUP or > still use MMU notifiers. > > Thanks > > _______________________________________________ > Virtualization mailing list > Virtualization@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-05 4:41 ` Jason Wang @ 2019-08-05 6:40 ` Michael S. Tsirkin 2019-08-05 8:24 ` Jason Wang 0 siblings, 1 reply; 51+ messages in thread From: Michael S. Tsirkin @ 2019-08-05 6:40 UTC (permalink / raw) To: Jason Wang Cc: Jason Gunthorpe, linux-mm, netdev, linux-kernel, kvm, virtualization On Mon, Aug 05, 2019 at 12:41:45PM +0800, Jason Wang wrote: > > On 2019/8/5 下午12:36, Jason Wang wrote: > > > > On 2019/8/2 下午10:27, Michael S. Tsirkin wrote: > > > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: > > > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: > > > > > > This must be a proper barrier, like a spinlock, mutex, or > > > > > > synchronize_rcu. > > > > > > > > > > I start with synchronize_rcu() but both you and Michael raise some > > > > > concern. > > > > I've also idly wondered if calling synchronize_rcu() under the various > > > > mm locks is a deadlock situation. > > > > > > > > > Then I try spinlock and mutex: > > > > > > > > > > 1) spinlock: add lots of overhead on datapath, this leads 0 > > > > > performance > > > > > improvement. > > > > I think the topic here is correctness not performance improvement > > > The topic is whether we should revert > > > commit 7f466032dc9 ("vhost: access vq metadata through kernel > > > virtual address") > > > > > > or keep it in. The only reason to keep it is performance. > > > > > > Maybe it's time to introduce the config option? > > > Or does it make sense if I post a V3 with: > > - introduce config option and disable the optimization by default > > - switch from synchronize_rcu() to vhost_flush_work(), but the rest are the > same > > This can give us some breath to decide which way should go for next release? > > Thanks As is, with preempt enabled? Nope I don't think blocking an invalidator on swap IO is ok, so I don't believe this stuff is going into this release at this point. So it's more a question of whether it's better to revert and apply a clean patch on top, or just keep the code around but disabled with an ifdef as is. I'm open to both options, and would like your opinion on this. > > > > > > > > > > > Now as long as all this code is disabled anyway, we can experiment a > > > bit. > > > > > > I personally feel we would be best served by having two code paths: > > > > > > - Access to VM memory directly mapped into kernel > > > - Access to userspace > > > > > > > > > Having it all cleanly split will allow a bunch of optimizations, for > > > example for years now we planned to be able to process an incoming short > > > packet directly on softirq path, or an outgoing on directly within > > > eventfd. > > > > > > It's not hard consider we've already had our own accssors. But the > > question is (as asked in another thread), do you want permanent GUP or > > still use MMU notifiers. > > > > Thanks > > > > _______________________________________________ > > Virtualization mailing list > > Virtualization@lists.linux-foundation.org > > https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-05 6:40 ` Michael S. Tsirkin @ 2019-08-05 8:24 ` Jason Wang 0 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-08-05 8:24 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Jason Gunthorpe, linux-mm, netdev, linux-kernel, kvm, virtualization On 2019/8/5 下午2:40, Michael S. Tsirkin wrote: > On Mon, Aug 05, 2019 at 12:41:45PM +0800, Jason Wang wrote: >> On 2019/8/5 下午12:36, Jason Wang wrote: >>> On 2019/8/2 下午10:27, Michael S. Tsirkin wrote: >>>> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: >>>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: >>>>>>> This must be a proper barrier, like a spinlock, mutex, or >>>>>>> synchronize_rcu. >>>>>> I start with synchronize_rcu() but both you and Michael raise some >>>>>> concern. >>>>> I've also idly wondered if calling synchronize_rcu() under the various >>>>> mm locks is a deadlock situation. >>>>> >>>>>> Then I try spinlock and mutex: >>>>>> >>>>>> 1) spinlock: add lots of overhead on datapath, this leads 0 >>>>>> performance >>>>>> improvement. >>>>> I think the topic here is correctness not performance improvement >>>> The topic is whether we should revert >>>> commit 7f466032dc9 ("vhost: access vq metadata through kernel >>>> virtual address") >>>> >>>> or keep it in. The only reason to keep it is performance. >>> >>> Maybe it's time to introduce the config option? >> >> Or does it make sense if I post a V3 with: >> >> - introduce config option and disable the optimization by default >> >> - switch from synchronize_rcu() to vhost_flush_work(), but the rest are the >> same >> >> This can give us some breath to decide which way should go for next release? >> >> Thanks > As is, with preempt enabled? Nope I don't think blocking an invalidator > on swap IO is ok, so I don't believe this stuff is going into this > release at this point. > > So it's more a question of whether it's better to revert and apply a clean > patch on top, or just keep the code around but disabled with an ifdef as is. > I'm open to both options, and would like your opinion on this. Then I prefer to leave current code (VHOST_ARCH_CAN_ACCEL to 0) as is. This can also save efforts on rebasing packed virtqueues. Thanks > >>> >>>> Now as long as all this code is disabled anyway, we can experiment a >>>> bit. >>>> >>>> I personally feel we would be best served by having two code paths: >>>> >>>> - Access to VM memory directly mapped into kernel >>>> - Access to userspace >>>> >>>> >>>> Having it all cleanly split will allow a bunch of optimizations, for >>>> example for years now we planned to be able to process an incoming short >>>> packet directly on softirq path, or an outgoing on directly within >>>> eventfd. >>> >>> It's not hard consider we've already had our own accssors. But the >>> question is (as asked in another thread), do you want permanent GUP or >>> still use MMU notifiers. >>> >>> Thanks >>> >>> _______________________________________________ >>> Virtualization mailing list >>> Virtualization@lists.linux-foundation.org >>> https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-05 4:36 ` Jason Wang 2019-08-05 4:41 ` Jason Wang @ 2019-08-05 6:30 ` Michael S. Tsirkin 2019-08-05 8:22 ` Jason Wang 1 sibling, 1 reply; 51+ messages in thread From: Michael S. Tsirkin @ 2019-08-05 6:30 UTC (permalink / raw) To: Jason Wang Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel, linux-mm On Mon, Aug 05, 2019 at 12:36:40PM +0800, Jason Wang wrote: > > On 2019/8/2 下午10:27, Michael S. Tsirkin wrote: > > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: > > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: > > > > > This must be a proper barrier, like a spinlock, mutex, or > > > > > synchronize_rcu. > > > > > > > > I start with synchronize_rcu() but both you and Michael raise some > > > > concern. > > > I've also idly wondered if calling synchronize_rcu() under the various > > > mm locks is a deadlock situation. > > > > > > > Then I try spinlock and mutex: > > > > > > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance > > > > improvement. > > > I think the topic here is correctness not performance improvement > > The topic is whether we should revert > > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address") > > > > or keep it in. The only reason to keep it is performance. > > > Maybe it's time to introduce the config option? Depending on CONFIG_BROKEN? I'm not sure it's a good idea. > > > > > Now as long as all this code is disabled anyway, we can experiment a > > bit. > > > > I personally feel we would be best served by having two code paths: > > > > - Access to VM memory directly mapped into kernel > > - Access to userspace > > > > > > Having it all cleanly split will allow a bunch of optimizations, for > > example for years now we planned to be able to process an incoming short > > packet directly on softirq path, or an outgoing on directly within > > eventfd. > > > It's not hard consider we've already had our own accssors. But the question > is (as asked in another thread), do you want permanent GUP or still use MMU > notifiers. > > Thanks We want THP and NUMA to work. Both are important for performance. -- MST ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-05 6:30 ` Michael S. Tsirkin @ 2019-08-05 8:22 ` Jason Wang 0 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-08-05 8:22 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/8/5 下午2:30, Michael S. Tsirkin wrote: > On Mon, Aug 05, 2019 at 12:36:40PM +0800, Jason Wang wrote: >> On 2019/8/2 下午10:27, Michael S. Tsirkin wrote: >>> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote: >>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: >>>>>> This must be a proper barrier, like a spinlock, mutex, or >>>>>> synchronize_rcu. >>>>> I start with synchronize_rcu() but both you and Michael raise some >>>>> concern. >>>> I've also idly wondered if calling synchronize_rcu() under the various >>>> mm locks is a deadlock situation. >>>> >>>>> Then I try spinlock and mutex: >>>>> >>>>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance >>>>> improvement. >>>> I think the topic here is correctness not performance improvement >>> The topic is whether we should revert >>> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address") >>> >>> or keep it in. The only reason to keep it is performance. >> >> Maybe it's time to introduce the config option? > Depending on CONFIG_BROKEN? I'm not sure it's a good idea. Ok. >>> Now as long as all this code is disabled anyway, we can experiment a >>> bit. >>> >>> I personally feel we would be best served by having two code paths: >>> >>> - Access to VM memory directly mapped into kernel >>> - Access to userspace >>> >>> >>> Having it all cleanly split will allow a bunch of optimizations, for >>> example for years now we planned to be able to process an incoming short >>> packet directly on softirq path, or an outgoing on directly within >>> eventfd. >> >> It's not hard consider we've already had our own accssors. But the question >> is (as asked in another thread), do you want permanent GUP or still use MMU >> notifiers. >> >> Thanks > We want THP and NUMA to work. Both are important for performance. > Yes. Thanks ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-02 12:46 ` Jason Gunthorpe 2019-08-02 14:27 ` Michael S. Tsirkin @ 2019-08-05 4:20 ` Jason Wang 2019-08-06 12:04 ` Jason Gunthorpe 1 sibling, 1 reply; 51+ messages in thread From: Jason Wang @ 2019-08-05 4:20 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/8/2 下午8:46, Jason Gunthorpe wrote: > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: >>> This must be a proper barrier, like a spinlock, mutex, or >>> synchronize_rcu. >> >> I start with synchronize_rcu() but both you and Michael raise some >> concern. > I've also idly wondered if calling synchronize_rcu() under the various > mm locks is a deadlock situation. Maybe, that's why I suggest to use vhost_work_flush() which is much lightweight can can achieve the same function. It can guarantee all previous work has been processed after vhost_work_flush() return. > >> Then I try spinlock and mutex: >> >> 1) spinlock: add lots of overhead on datapath, this leads 0 performance >> improvement. > I think the topic here is correctness not performance improvement But the whole series is to speed up vhost. > >> 2) SRCU: full memory barrier requires on srcu_read_lock(), which still leads >> little performance improvement > >> 3) mutex: a possible issue is need to wait for the page to be swapped in (is >> this unacceptable ?), another issue is that we need hold vq lock during >> range overlap check. > I have a feeling that mmu notififers cannot safely become dependent on > progress of swap without causing deadlock. You probably should avoid > this. Yes, so that's why I try to synchronize the critical region by myself. >>> And, again, you can't re-invent a spinlock with open coding and get >>> something better. >> So the question is if waiting for swap is considered to be unsuitable for >> MMU notifiers. If not, it would simplify codes. If not, we still need to >> figure out a possible solution. >> >> Btw, I come up another idea, that is to disable preemption when vhost thread >> need to access the memory. Then register preempt notifier and if vhost >> thread is preempted, we're sure no one will access the memory and can do the >> cleanup. > I think you should use the spinlock so at least the code is obviously > functionally correct and worry about designing some properly justified > performance change after. > > Jason Spinlock is correct but make the whole series meaningless consider it won't bring any performance improvement. Thanks ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-05 4:20 ` Jason Wang @ 2019-08-06 12:04 ` Jason Gunthorpe 2019-08-07 6:49 ` Jason Wang 0 siblings, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2019-08-06 12:04 UTC (permalink / raw) To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On Mon, Aug 05, 2019 at 12:20:45PM +0800, Jason Wang wrote: > > On 2019/8/2 下午8:46, Jason Gunthorpe wrote: > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: > > > > This must be a proper barrier, like a spinlock, mutex, or > > > > synchronize_rcu. > > > > > > I start with synchronize_rcu() but both you and Michael raise some > > > concern. > > I've also idly wondered if calling synchronize_rcu() under the various > > mm locks is a deadlock situation. > > > Maybe, that's why I suggest to use vhost_work_flush() which is much > lightweight can can achieve the same function. It can guarantee all previous > work has been processed after vhost_work_flush() return. If things are already running in a work, then yes, you can piggyback on the existing spinlocks inside the workqueue and be Ok However, if that work is doing any copy_from_user, then the flush becomes dependent on swap and it won't work again... > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance > > > improvement. > > I think the topic here is correctness not performance improvement> > But the whole series is to speed up vhost. So? Starting with a whole bunch of crazy, possibly broken, locking and claiming a performance win is not reasonable. > Spinlock is correct but make the whole series meaningless consider it won't > bring any performance improvement. You can't invent a faster spinlock by opencoding some wild scheme. There is nothing special about the usage here, it needs a blocking lock, plain and simple. Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-06 12:04 ` Jason Gunthorpe @ 2019-08-07 6:49 ` Jason Wang 0 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-08-07 6:49 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/8/6 下午8:04, Jason Gunthorpe wrote: > On Mon, Aug 05, 2019 at 12:20:45PM +0800, Jason Wang wrote: >> On 2019/8/2 下午8:46, Jason Gunthorpe wrote: >>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: >>>>> This must be a proper barrier, like a spinlock, mutex, or >>>>> synchronize_rcu. >>>> I start with synchronize_rcu() but both you and Michael raise some >>>> concern. >>> I've also idly wondered if calling synchronize_rcu() under the various >>> mm locks is a deadlock situation. >> >> Maybe, that's why I suggest to use vhost_work_flush() which is much >> lightweight can can achieve the same function. It can guarantee all previous >> work has been processed after vhost_work_flush() return. > If things are already running in a work, then yes, you can piggyback > on the existing spinlocks inside the workqueue and be Ok > > However, if that work is doing any copy_from_user, then the flush > becomes dependent on swap and it won't work again... Yes it do copy_from_user(), so we can't do this. > >>>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance >>>> improvement. >>> I think the topic here is correctness not performance improvement> > >> But the whole series is to speed up vhost. > So? Starting with a whole bunch of crazy, possibly broken, locking and > claiming a performance win is not reasonable. Yes, I admit this patch is tricky, I'm not going to push this. Will post a V3. > >> Spinlock is correct but make the whole series meaningless consider it won't >> bring any performance improvement. > You can't invent a faster spinlock by opencoding some wild > scheme. There is nothing special about the usage here, it needs a > blocking lock, plain and simple. > > Jason Will post V3. Let's see if you are happy with that version. Thanks ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-02 9:40 ` Jason Wang 2019-08-02 12:46 ` Jason Gunthorpe @ 2019-08-02 14:03 ` Michael S. Tsirkin 2019-08-05 4:33 ` Jason Wang 1 sibling, 1 reply; 51+ messages in thread From: Michael S. Tsirkin @ 2019-08-02 14:03 UTC (permalink / raw) To: Jason Wang Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel, linux-mm On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: > Btw, I come up another idea, that is to disable preemption when vhost thread > need to access the memory. Then register preempt notifier and if vhost > thread is preempted, we're sure no one will access the memory and can do the > cleanup. Great, more notifiers :( Maybe can live with 1- disable preemption while using the cached pointer 2- teach vhost to recover from memory access failures, by switching to regular from/to user path So if you want to try that, fine since it's a step in the right direction. But I think fundamentally it's not what we want to do long term. It's always been a fundamental problem with this patch series that only metadata is accessed through a direct pointer. The difference in ways you handle metadata and data is what is now coming and messing everything up. So if continuing the direct map approach, what is needed is a cache of mapped VM memory, then on a cache miss we'd queue work along the lines of 1-2 above. That's one direction to take. Another one is to give up on that and write our own version of uaccess macros. Add a "high security" flag to the vhost module and if not active use these for userspace memory access. -- MST ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-02 14:03 ` Michael S. Tsirkin @ 2019-08-05 4:33 ` Jason Wang 2019-08-05 6:28 ` Michael S. Tsirkin 0 siblings, 1 reply; 51+ messages in thread From: Jason Wang @ 2019-08-05 4:33 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/8/2 下午10:03, Michael S. Tsirkin wrote: > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: >> Btw, I come up another idea, that is to disable preemption when vhost thread >> need to access the memory. Then register preempt notifier and if vhost >> thread is preempted, we're sure no one will access the memory and can do the >> cleanup. > Great, more notifiers :( > > Maybe can live with > 1- disable preemption while using the cached pointer > 2- teach vhost to recover from memory access failures, > by switching to regular from/to user path I don't get this, I believe we want to recover from regular from/to user path, isn't it? > > So if you want to try that, fine since it's a step in > the right direction. > > But I think fundamentally it's not what we want to do long term. Yes. > > It's always been a fundamental problem with this patch series that only > metadata is accessed through a direct pointer. > > The difference in ways you handle metadata and data is what is > now coming and messing everything up. I do propose soemthing like this in the past: https://www.spinics.net/lists/linux-virtualization/msg36824.html. But looks like you have some concern about its locality. But the problem still there, GUP can do page fault, so still need to synchronize it with MMU notifiers. The solution might be something like moving GUP to a dedicated kind of vhost work. > > So if continuing the direct map approach, > what is needed is a cache of mapped VM memory, then on a cache miss > we'd queue work along the lines of 1-2 above. > > That's one direction to take. Another one is to give up on that and > write our own version of uaccess macros. Add a "high security" flag to > the vhost module and if not active use these for userspace memory > access. Or using SET_BACKEND_FEATURES? But do you mean permanent GUP as I did in original RFC https://lkml.org/lkml/2018/12/13/218? Thanks > > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-05 4:33 ` Jason Wang @ 2019-08-05 6:28 ` Michael S. Tsirkin 2019-08-05 8:21 ` Jason Wang 0 siblings, 1 reply; 51+ messages in thread From: Michael S. Tsirkin @ 2019-08-05 6:28 UTC (permalink / raw) To: Jason Wang Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel, linux-mm On Mon, Aug 05, 2019 at 12:33:45PM +0800, Jason Wang wrote: > > On 2019/8/2 下午10:03, Michael S. Tsirkin wrote: > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: > > > Btw, I come up another idea, that is to disable preemption when vhost thread > > > need to access the memory. Then register preempt notifier and if vhost > > > thread is preempted, we're sure no one will access the memory and can do the > > > cleanup. > > Great, more notifiers :( > > > > Maybe can live with > > 1- disable preemption while using the cached pointer > > 2- teach vhost to recover from memory access failures, > > by switching to regular from/to user path > > > I don't get this, I believe we want to recover from regular from/to user > path, isn't it? That (disable copy to/from user completely) would be a nice to have since it would reduce the attack surface of the driver, but e.g. your code already doesn't do that. > > > > > So if you want to try that, fine since it's a step in > > the right direction. > > > > But I think fundamentally it's not what we want to do long term. > > > Yes. > > > > > > It's always been a fundamental problem with this patch series that only > > metadata is accessed through a direct pointer. > > > > The difference in ways you handle metadata and data is what is > > now coming and messing everything up. > > > I do propose soemthing like this in the past: > https://www.spinics.net/lists/linux-virtualization/msg36824.html. But looks > like you have some concern about its locality. Right and it doesn't go away. You'll need to come up with a test that messes it up and triggers a worst-case scenario, so we can measure how bad is that worst-case. > But the problem still there, GUP can do page fault, so still need to > synchronize it with MMU notifiers. I think the idea was, if GUP would need a pagefault, don't do a GUP and do to/from user instead. Hopefully that will fault the page in and the next access will go through. > The solution might be something like > moving GUP to a dedicated kind of vhost work. Right, generally GUP. > > > > > So if continuing the direct map approach, > > what is needed is a cache of mapped VM memory, then on a cache miss > > we'd queue work along the lines of 1-2 above. > > > > That's one direction to take. Another one is to give up on that and > > write our own version of uaccess macros. Add a "high security" flag to > > the vhost module and if not active use these for userspace memory > > access. > > > Or using SET_BACKEND_FEATURES? No, I don't think it's considered best practice to allow unpriveledged userspace control over whether kernel enables security features. > But do you mean permanent GUP as I did in > original RFC https://lkml.org/lkml/2018/12/13/218? > > Thanks Permanent GUP breaks THP and NUMA. > > > > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-05 6:28 ` Michael S. Tsirkin @ 2019-08-05 8:21 ` Jason Wang 0 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-08-05 8:21 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel, linux-mm On 2019/8/5 下午2:28, Michael S. Tsirkin wrote: > On Mon, Aug 05, 2019 at 12:33:45PM +0800, Jason Wang wrote: >> On 2019/8/2 下午10:03, Michael S. Tsirkin wrote: >>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote: >>>> Btw, I come up another idea, that is to disable preemption when vhost thread >>>> need to access the memory. Then register preempt notifier and if vhost >>>> thread is preempted, we're sure no one will access the memory and can do the >>>> cleanup. >>> Great, more notifiers :( >>> >>> Maybe can live with >>> 1- disable preemption while using the cached pointer >>> 2- teach vhost to recover from memory access failures, >>> by switching to regular from/to user path >> >> I don't get this, I believe we want to recover from regular from/to user >> path, isn't it? > That (disable copy to/from user completely) would be a nice to have > since it would reduce the attack surface of the driver, but e.g. your > code already doesn't do that. > Yes since it requires a lot of changes. > >>> So if you want to try that, fine since it's a step in >>> the right direction. >>> >>> But I think fundamentally it's not what we want to do long term. >> >> Yes. >> >> >>> It's always been a fundamental problem with this patch series that only >>> metadata is accessed through a direct pointer. >>> >>> The difference in ways you handle metadata and data is what is >>> now coming and messing everything up. >> >> I do propose soemthing like this in the past: >> https://www.spinics.net/lists/linux-virtualization/msg36824.html. But looks >> like you have some concern about its locality. > Right and it doesn't go away. You'll need to come up > with a test that messes it up and triggers a worst-case > scenario, so we can measure how bad is that worst-case. > >> But the problem still there, GUP can do page fault, so still need to >> synchronize it with MMU notifiers. > I think the idea was, if GUP would need a pagefault, don't > do a GUP and do to/from user instead. But this still need to be synchronized with MMU notifiers (or using dedicated work for GUP). > Hopefully that > will fault the page in and the next access will go through. > >> The solution might be something like >> moving GUP to a dedicated kind of vhost work. > Right, generally GUP. > >>> So if continuing the direct map approach, >>> what is needed is a cache of mapped VM memory, then on a cache miss >>> we'd queue work along the lines of 1-2 above. >>> >>> That's one direction to take. Another one is to give up on that and >>> write our own version of uaccess macros. Add a "high security" flag to >>> the vhost module and if not active use these for userspace memory >>> access. >> >> Or using SET_BACKEND_FEATURES? > No, I don't think it's considered best practice to allow unpriveledged > userspace control over whether kernel enables security features. Get this. > >> But do you mean permanent GUP as I did in >> original RFC https://lkml.org/lkml/2018/12/13/218? >> >> Thanks > Permanent GUP breaks THP and NUMA. Yes. Thanks > >>> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-07-31 8:46 ` [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker Jason Wang 2019-07-31 8:50 ` Jason Wang 2019-07-31 12:39 ` Jason Gunthorpe @ 2019-07-31 18:29 ` Michael S. Tsirkin 2019-08-01 8:06 ` Jason Wang 2 siblings, 1 reply; 51+ messages in thread From: Michael S. Tsirkin @ 2019-07-31 18:29 UTC (permalink / raw) To: Jason Wang Cc: kvm, virtualization, netdev, linux-kernel, linux-mm, jgg, Paul E. McKenney On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: > We used to use RCU to synchronize MMU notifier with worker. This leads > calling synchronize_rcu() in invalidate_range_start(). But on a busy > system, there would be many factors that may slow down the > synchronize_rcu() which makes it unsuitable to be called in MMU > notifier. > > A solution is SRCU but its overhead is obvious with the expensive full > memory barrier. Another choice is to use seqlock, but it doesn't > provide a synchronization method between readers and writers. The last > choice is to use vq mutex, but it need to deal with the worst case > that MMU notifier must be blocked and wait for the finish of swap in. > > So this patch switches use a counter to track whether or not the map > was used. The counter was increased when vq try to start or finish > uses the map. This means, when it was even, we're sure there's no > readers and MMU notifier is synchronized. When it was odd, it means > there's a reader we need to wait it to be even again then we are > synchronized. To avoid full memory barrier, store_release + > load_acquire on the counter is used. Unfortunately this needs a lot of review and testing, so this can't make rc2, and I don't think this is the kind of patch I can merge after rc3. Subtle memory barrier tricks like this can introduce new bugs while they are fixing old ones. > > Consider the read critical section is pretty small the synchronization > should be done very fast. > > Note the patch lead about 3% PPS dropping. Sorry what do you mean by this last sentence? This degrades performance compared to what? > > Reported-by: Michael S. Tsirkin <mst@redhat.com> > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") > Signed-off-by: Jason Wang <jasowang@redhat.com> > --- > drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++---------------- > drivers/vhost/vhost.h | 7 +- > 2 files changed, 94 insertions(+), 58 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index cfc11f9ed9c9..db2c81cb1e90 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) > > spin_lock(&vq->mmu_lock); > for (i = 0; i < VHOST_NUM_ADDRS; i++) { > - map[i] = rcu_dereference_protected(vq->maps[i], > - lockdep_is_held(&vq->mmu_lock)); > + map[i] = vq->maps[i]; > if (map[i]) { > vhost_set_map_dirty(vq, map[i], i); > - rcu_assign_pointer(vq->maps[i], NULL); > + vq->maps[i] = NULL; > } > } > spin_unlock(&vq->mmu_lock); > > - /* No need for synchronize_rcu() or kfree_rcu() since we are > - * serialized with memory accessors (e.g vq mutex held). > + /* No need for synchronization since we are serialized with > + * memory accessors (e.g vq mutex held). > */ > > for (i = 0; i < VHOST_NUM_ADDRS; i++) > @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, > return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); > } > > +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) > +{ > + int ref = READ_ONCE(vq->ref); > + > + smp_store_release(&vq->ref, ref + 1); > + /* Make sure ref counter is visible before accessing the map */ > + smp_load_acquire(&vq->ref); The map access is after this sequence, correct? Just going by the rules in Documentation/memory-barriers.txt, I think that this pair will not order following accesses with ref store. Documentation/memory-barriers.txt says: + In addition, a RELEASE+ACQUIRE + pair is -not- guaranteed to act as a full memory barrier. The guarantee that is made is this: after an ACQUIRE on a given variable, all memory accesses preceding any prior RELEASE on that same variable are guaranteed to be visible. And if we also had the reverse rule we'd end up with a full barrier, won't we? Cc Paul in case I missed something here. And if I'm right, maybe we should call this out, adding "The opposite is not true: a prior RELEASE is not guaranteed to be visible before memory accesses following the subsequent ACQUIRE". > +} > + > +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) > +{ > + int ref = READ_ONCE(vq->ref); > + > + /* Make sure vq access is done before increasing ref counter */ > + smp_store_release(&vq->ref, ref + 1); > +} > + > +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) > +{ > + int ref; > + > + /* Make sure map change was done before checking ref counter */ > + smp_mb(); > + > + ref = READ_ONCE(vq->ref); > + if (ref & 0x1) { Please document the even/odd trick here too, not just in the commit log. > + /* When ref change, changes > we are sure no reader can see > + * previous map */ > + while (READ_ONCE(vq->ref) == ref) { what is the below line in aid of? > + set_current_state(TASK_RUNNING); > + schedule(); if (need_resched()) schedule(); ? > + } On an interruptible kernel, there's a risk here is that a task got preempted with an odd ref. So I suspect we'll have to disable preemption when we make ref odd. > + } > + /* Make sure ref counter was checked before any other > + * operations that was dene on map. */ was dene -> were done? > + smp_mb(); > +} > + > static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > int index, > unsigned long start, > @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > spin_lock(&vq->mmu_lock); > ++vq->invalidate_count; > > - map = rcu_dereference_protected(vq->maps[index], > - lockdep_is_held(&vq->mmu_lock)); > + map = vq->maps[index]; > if (map) { > vhost_set_map_dirty(vq, map, index); > - rcu_assign_pointer(vq->maps[index], NULL); > + vq->maps[index] = NULL; > } > spin_unlock(&vq->mmu_lock); > > if (map) { > - synchronize_rcu(); > + vhost_vq_sync_access(vq); > vhost_map_unprefetch(map); > } > } > @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev) > for (i = 0; i < dev->nvqs; ++i) { > vq = dev->vqs[i]; > for (j = 0; j < VHOST_NUM_ADDRS; j++) > - RCU_INIT_POINTER(vq->maps[j], NULL); > + vq->maps[j] = NULL; > } > } > #endif > @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev, > vq->indirect = NULL; > vq->heads = NULL; > vq->dev = dev; > + vq->ref = 0; > mutex_init(&vq->mutex); > spin_lock_init(&vq->mmu_lock); > vhost_vq_reset(dev, vq); > @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq, > map->npages = npages; > map->pages = pages; > > - rcu_assign_pointer(vq->maps[index], map); > + vq->maps[index] = map; > /* No need for a synchronize_rcu(). This function should be > * called by dev->worker so we are serialized with all > * readers. > @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > *((__virtio16 *)&used->ring[vq->num]) = > cpu_to_vhost16(vq, vq->avail_idx); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, > size_t size; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > size = count * sizeof(*head); > memcpy(used->ring + idx, head, size); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > used->flags = cpu_to_vhost16(vq, vq->used_flags); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > used->idx = cpu_to_vhost16(vq, vq->last_used_idx); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *idx = avail->idx; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *head = avail->ring[idx & (vq->num - 1)]; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *flags = avail->flags; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq, > struct vring_avail *avail; > > if (!vq->iotlb) { > - rcu_read_lock(); > - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > + vhost_vq_access_map_begin(vq); > + map = vq->maps[VHOST_ADDR_AVAIL]; > if (likely(map)) { > avail = map->addr; > *event = (__virtio16)avail->ring[vq->num]; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, > struct vring_used *used; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > + map = vq->maps[VHOST_ADDR_USED]; > if (likely(map)) { > used = map->addr; > *idx = used->idx; > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq, > struct vring_desc *d; > > if (!vq->iotlb) { > - rcu_read_lock(); > + vhost_vq_access_map_begin(vq); > > - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]); > + map = vq->maps[VHOST_ADDR_DESC]; > if (likely(map)) { > d = map->addr; > *desc = *(d + idx); > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > return 0; > } > > - rcu_read_unlock(); > + vhost_vq_access_map_end(vq); > } > #endif > > @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq, > #if VHOST_ARCH_CAN_ACCEL_UACCESS > static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq) > { > - struct vhost_map __rcu *map; > + struct vhost_map *map; > int i; > > for (i = 0; i < VHOST_NUM_ADDRS; i++) { > - rcu_read_lock(); > - map = rcu_dereference(vq->maps[i]); > - rcu_read_unlock(); > + map = vq->maps[i]; > if (unlikely(!map)) > vhost_map_prefetch(vq, i); > } > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h > index a9a2a93857d2..f9e9558a529d 100644 > --- a/drivers/vhost/vhost.h > +++ b/drivers/vhost/vhost.h > @@ -115,16 +115,17 @@ struct vhost_virtqueue { > #if VHOST_ARCH_CAN_ACCEL_UACCESS > /* Read by memory accessors, modified by meta data > * prefetching, MMU notifier and vring ioctl(). > - * Synchonrized through mmu_lock (writers) and RCU (writers > - * and readers). > + * Synchonrized through mmu_lock (writers) and ref counters, > + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end(). > */ > - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS]; > + struct vhost_map *maps[VHOST_NUM_ADDRS]; > /* Read by MMU notifier, modified by vring ioctl(), > * synchronized through MMU notifier > * registering/unregistering. > */ > struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; > #endif > + int ref; Is it important that this is signed? If not I'd do unsigned here: even though kernel does compile with 2s complement sign overflow, it seems cleaner not to depend on that. > const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; > > struct file *kick; > -- > 2.18.1 ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-07-31 18:29 ` Michael S. Tsirkin @ 2019-08-01 8:06 ` Jason Wang 2019-08-03 21:54 ` Michael S. Tsirkin 0 siblings, 1 reply; 51+ messages in thread From: Jason Wang @ 2019-08-01 8:06 UTC (permalink / raw) To: Michael S. Tsirkin Cc: kvm, virtualization, netdev, linux-kernel, linux-mm, jgg, Paul E. McKenney On 2019/8/1 上午2:29, Michael S. Tsirkin wrote: > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: >> We used to use RCU to synchronize MMU notifier with worker. This leads >> calling synchronize_rcu() in invalidate_range_start(). But on a busy >> system, there would be many factors that may slow down the >> synchronize_rcu() which makes it unsuitable to be called in MMU >> notifier. >> >> A solution is SRCU but its overhead is obvious with the expensive full >> memory barrier. Another choice is to use seqlock, but it doesn't >> provide a synchronization method between readers and writers. The last >> choice is to use vq mutex, but it need to deal with the worst case >> that MMU notifier must be blocked and wait for the finish of swap in. >> >> So this patch switches use a counter to track whether or not the map >> was used. The counter was increased when vq try to start or finish >> uses the map. This means, when it was even, we're sure there's no >> readers and MMU notifier is synchronized. When it was odd, it means >> there's a reader we need to wait it to be even again then we are >> synchronized. To avoid full memory barrier, store_release + >> load_acquire on the counter is used. > > Unfortunately this needs a lot of review and testing, so this can't make > rc2, and I don't think this is the kind of patch I can merge after rc3. > Subtle memory barrier tricks like this can introduce new bugs while they > are fixing old ones. I admit the patch is tricky. Some questions: - Do we must address the case of e.g swap in? If not, a simple vhost_work_flush() instead of synchronize_rcu() may work. - Having some hard thought, I think we can use seqlock, it looks to me smp_wmb() is in write_segcount_begin() is sufficient, we don't care vq->map read before smp_wmb(), and for the other we all have good data devendency so smp_wmb() in the write_seqbegin_end() is sufficient. diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index db2c81cb1e90..6d9501303258 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -363,39 +363,29 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) { - int ref = READ_ONCE(vq->ref); - - smp_store_release(&vq->ref, ref + 1); - /* Make sure ref counter is visible before accessing the map */ - smp_load_acquire(&vq->ref); + write_seqcount_begin(&vq->seq); } static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) { - int ref = READ_ONCE(vq->ref); - - /* Make sure vq access is done before increasing ref counter */ - smp_store_release(&vq->ref, ref + 1); + write_seqcount_end(&vq->seq); } static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) { - int ref; + unsigned int ret; /* Make sure map change was done before checking ref counter */ smp_mb(); - - ref = READ_ONCE(vq->ref); - if (ref & 0x1) { - /* When ref change, we are sure no reader can see + ret = raw_read_seqcount(&vq->seq); + if (ret & 0x1) { + /* When seq changes, we are sure no reader can see * previous map */ - while (READ_ONCE(vq->ref) == ref) { - set_current_state(TASK_RUNNING); + while (raw_read_seqcount(&vq->seq) == ret) schedule(); - } } - /* Make sure ref counter was checked before any other - * operations that was dene on map. */ + /* Make sure seq was checked before any other operations that + * was dene on map. */ smp_mb(); } @@ -691,7 +681,7 @@ void vhost_dev_init(struct vhost_dev *dev, vq->indirect = NULL; vq->heads = NULL; vq->dev = dev; - vq->ref = 0; + seqcount_init(&vq->seq); mutex_init(&vq->mutex); spin_lock_init(&vq->mmu_lock); vhost_vq_reset(dev, vq); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 3d10da0ae511..1a705e181a84 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -125,7 +125,7 @@ struct vhost_virtqueue { */ struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; #endif - int ref; + seqcount_t seq; const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; struct file *kick; -- 2.18.1 > > > > > >> >> Consider the read critical section is pretty small the synchronization >> should be done very fast. >> >> Note the patch lead about 3% PPS dropping. > > Sorry what do you mean by this last sentence? This degrades performance > compared to what? Compare to without this patch. > >> >> Reported-by: Michael S. Tsirkin <mst@redhat.com> >> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") >> Signed-off-by: Jason Wang <jasowang@redhat.com> >> --- >> drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++---------------- >> drivers/vhost/vhost.h | 7 +- >> 2 files changed, 94 insertions(+), 58 deletions(-) >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index cfc11f9ed9c9..db2c81cb1e90 100644 >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) >> >> spin_lock(&vq->mmu_lock); >> for (i = 0; i < VHOST_NUM_ADDRS; i++) { >> - map[i] = rcu_dereference_protected(vq->maps[i], >> - lockdep_is_held(&vq->mmu_lock)); >> + map[i] = vq->maps[i]; >> if (map[i]) { >> vhost_set_map_dirty(vq, map[i], i); >> - rcu_assign_pointer(vq->maps[i], NULL); >> + vq->maps[i] = NULL; >> } >> } >> spin_unlock(&vq->mmu_lock); >> >> - /* No need for synchronize_rcu() or kfree_rcu() since we are >> - * serialized with memory accessors (e.g vq mutex held). >> + /* No need for synchronization since we are serialized with >> + * memory accessors (e.g vq mutex held). >> */ >> >> for (i = 0; i < VHOST_NUM_ADDRS; i++) >> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, >> return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); >> } >> >> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) >> +{ >> + int ref = READ_ONCE(vq->ref); >> + >> + smp_store_release(&vq->ref, ref + 1); >> + /* Make sure ref counter is visible before accessing the map */ >> + smp_load_acquire(&vq->ref); > > The map access is after this sequence, correct? Yes. > > Just going by the rules in Documentation/memory-barriers.txt, > I think that this pair will not order following accesses with ref store. > > Documentation/memory-barriers.txt says: > > > + In addition, a RELEASE+ACQUIRE > + pair is -not- guaranteed to act as a full memory barrier. > > > > The guarantee that is made is this: > after > an ACQUIRE on a given variable, all memory accesses preceding any prior > RELEASE on that same variable are guaranteed to be visible. Yes, but it's not clear about the order of ACQUIRE the same location of previous RELEASE. And it only has a example like: " *A = a; RELEASE M ACQUIRE N *B = b; could occur as: ACQUIRE N, STORE *B, STORE *A, RELEASE M " But it doesn't explain what happen when *A = a RELEASE M ACQUIRE M *B = b; And tools/memory-model/Documentation said " First, when a lock-acquire reads from a lock-release, the LKMM requires that every instruction po-before the lock-release must execute before any instruction po-after the lock-acquire. " Is this a hint that I was correct? > > > And if we also had the reverse rule we'd end up with a full barrier, > won't we? > > Cc Paul in case I missed something here. And if I'm right, > maybe we should call this out, adding > > "The opposite is not true: a prior RELEASE is not > guaranteed to be visible before memory accesses following > the subsequent ACQUIRE". That kinds of violates the RELEASE? " This also acts as a one-way permeable barrier. It guarantees that all memory operations before the RELEASE operation will appear to happen before the RELEASE operation with respect to the other components of the " > > > >> +} >> + >> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) >> +{ >> + int ref = READ_ONCE(vq->ref); >> + >> + /* Make sure vq access is done before increasing ref counter */ >> + smp_store_release(&vq->ref, ref + 1); >> +} >> + >> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) >> +{ >> + int ref; >> + >> + /* Make sure map change was done before checking ref counter */ >> + smp_mb(); >> + >> + ref = READ_ONCE(vq->ref); >> + if (ref & 0x1) { > > Please document the even/odd trick here too, not just in the commit log. > Ok. >> + /* When ref change, > > changes > >> we are sure no reader can see >> + * previous map */ >> + while (READ_ONCE(vq->ref) == ref) { > > > what is the below line in aid of? > >> + set_current_state(TASK_RUNNING); >> + schedule(); > > if (need_resched()) > schedule(); > > ? Yes, better. > >> + } > > On an interruptible kernel, there's a risk here is that > a task got preempted with an odd ref. > So I suspect we'll have to disable preemption when we > make ref odd. I'm not sure I get, if the odd is not the original value we read, we're sure it won't read the new map here I believe. > > >> + } >> + /* Make sure ref counter was checked before any other >> + * operations that was dene on map. */ > > was dene -> were done? > Yes. >> + smp_mb(); >> +} >> + >> static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >> int index, >> unsigned long start, >> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >> spin_lock(&vq->mmu_lock); >> ++vq->invalidate_count; >> >> - map = rcu_dereference_protected(vq->maps[index], >> - lockdep_is_held(&vq->mmu_lock)); >> + map = vq->maps[index]; >> if (map) { >> vhost_set_map_dirty(vq, map, index); >> - rcu_assign_pointer(vq->maps[index], NULL); >> + vq->maps[index] = NULL; >> } >> spin_unlock(&vq->mmu_lock); >> >> if (map) { >> - synchronize_rcu(); >> + vhost_vq_sync_access(vq); >> vhost_map_unprefetch(map); >> } >> } >> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev) >> for (i = 0; i < dev->nvqs; ++i) { >> vq = dev->vqs[i]; >> for (j = 0; j < VHOST_NUM_ADDRS; j++) >> - RCU_INIT_POINTER(vq->maps[j], NULL); >> + vq->maps[j] = NULL; >> } >> } >> #endif >> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev, >> vq->indirect = NULL; >> vq->heads = NULL; >> vq->dev = dev; >> + vq->ref = 0; >> mutex_init(&vq->mutex); >> spin_lock_init(&vq->mmu_lock); >> vhost_vq_reset(dev, vq); >> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq, >> map->npages = npages; >> map->pages = pages; >> >> - rcu_assign_pointer(vq->maps[index], map); >> + vq->maps[index] = map; >> /* No need for a synchronize_rcu(). This function should be >> * called by dev->worker so we are serialized with all >> * readers. >> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) >> struct vring_used *used; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >> + map = vq->maps[VHOST_ADDR_USED]; >> if (likely(map)) { >> used = map->addr; >> *((__virtio16 *)&used->ring[vq->num]) = >> cpu_to_vhost16(vq, vq->avail_idx); >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, >> size_t size; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >> + map = vq->maps[VHOST_ADDR_USED]; >> if (likely(map)) { >> used = map->addr; >> size = count * sizeof(*head); >> memcpy(used->ring + idx, head, size); >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) >> struct vring_used *used; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >> + map = vq->maps[VHOST_ADDR_USED]; >> if (likely(map)) { >> used = map->addr; >> used->flags = cpu_to_vhost16(vq, vq->used_flags); >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) >> struct vring_used *used; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >> + map = vq->maps[VHOST_ADDR_USED]; >> if (likely(map)) { >> used = map->addr; >> used->idx = cpu_to_vhost16(vq, vq->last_used_idx); >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, >> struct vring_avail *avail; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >> + map = vq->maps[VHOST_ADDR_AVAIL]; >> if (likely(map)) { >> avail = map->addr; >> *idx = avail->idx; >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, >> struct vring_avail *avail; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >> + map = vq->maps[VHOST_ADDR_AVAIL]; >> if (likely(map)) { >> avail = map->addr; >> *head = avail->ring[idx & (vq->num - 1)]; >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, >> struct vring_avail *avail; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >> + map = vq->maps[VHOST_ADDR_AVAIL]; >> if (likely(map)) { >> avail = map->addr; >> *flags = avail->flags; >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq, >> struct vring_avail *avail; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >> + vhost_vq_access_map_begin(vq); >> + map = vq->maps[VHOST_ADDR_AVAIL]; >> if (likely(map)) { >> avail = map->addr; >> *event = (__virtio16)avail->ring[vq->num]; >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, >> struct vring_used *used; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >> + map = vq->maps[VHOST_ADDR_USED]; >> if (likely(map)) { >> used = map->addr; >> *idx = used->idx; >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq, >> struct vring_desc *d; >> >> if (!vq->iotlb) { >> - rcu_read_lock(); >> + vhost_vq_access_map_begin(vq); >> >> - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]); >> + map = vq->maps[VHOST_ADDR_DESC]; >> if (likely(map)) { >> d = map->addr; >> *desc = *(d + idx); >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> return 0; >> } >> >> - rcu_read_unlock(); >> + vhost_vq_access_map_end(vq); >> } >> #endif >> >> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq, >> #if VHOST_ARCH_CAN_ACCEL_UACCESS >> static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq) >> { >> - struct vhost_map __rcu *map; >> + struct vhost_map *map; >> int i; >> >> for (i = 0; i < VHOST_NUM_ADDRS; i++) { >> - rcu_read_lock(); >> - map = rcu_dereference(vq->maps[i]); >> - rcu_read_unlock(); >> + map = vq->maps[i]; >> if (unlikely(!map)) >> vhost_map_prefetch(vq, i); >> } >> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h >> index a9a2a93857d2..f9e9558a529d 100644 >> --- a/drivers/vhost/vhost.h >> +++ b/drivers/vhost/vhost.h >> @@ -115,16 +115,17 @@ struct vhost_virtqueue { >> #if VHOST_ARCH_CAN_ACCEL_UACCESS >> /* Read by memory accessors, modified by meta data >> * prefetching, MMU notifier and vring ioctl(). >> - * Synchonrized through mmu_lock (writers) and RCU (writers >> - * and readers). >> + * Synchonrized through mmu_lock (writers) and ref counters, >> + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end(). >> */ >> - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS]; >> + struct vhost_map *maps[VHOST_NUM_ADDRS]; >> /* Read by MMU notifier, modified by vring ioctl(), >> * synchronized through MMU notifier >> * registering/unregistering. >> */ >> struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; >> #endif >> + int ref; > > Is it important that this is signed? If not I'd do unsigned here: > even though kernel does compile with 2s complement sign overflow, > it seems cleaner not to depend on that. Not a must, let me fix. Thanks > >> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; >> >> struct file *kick; >> -- >> 2.18.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-01 8:06 ` Jason Wang @ 2019-08-03 21:54 ` Michael S. Tsirkin 2019-08-05 8:18 ` Jason Wang 0 siblings, 1 reply; 51+ messages in thread From: Michael S. Tsirkin @ 2019-08-03 21:54 UTC (permalink / raw) To: Jason Wang Cc: kvm, virtualization, netdev, linux-kernel, linux-mm, jgg, Paul E. McKenney On Thu, Aug 01, 2019 at 04:06:13AM -0400, Jason Wang wrote: > On 2019/8/1 上午2:29, Michael S. Tsirkin wrote: > > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: > >> We used to use RCU to synchronize MMU notifier with worker. This leads > >> calling synchronize_rcu() in invalidate_range_start(). But on a busy > >> system, there would be many factors that may slow down the > >> synchronize_rcu() which makes it unsuitable to be called in MMU > >> notifier. > >> > >> A solution is SRCU but its overhead is obvious with the expensive full > >> memory barrier. Another choice is to use seqlock, but it doesn't > >> provide a synchronization method between readers and writers. The last > >> choice is to use vq mutex, but it need to deal with the worst case > >> that MMU notifier must be blocked and wait for the finish of swap in. > >> > >> So this patch switches use a counter to track whether or not the map > >> was used. The counter was increased when vq try to start or finish > >> uses the map. This means, when it was even, we're sure there's no > >> readers and MMU notifier is synchronized. When it was odd, it means > >> there's a reader we need to wait it to be even again then we are > >> synchronized. To avoid full memory barrier, store_release + > >> load_acquire on the counter is used. > > > > Unfortunately this needs a lot of review and testing, so this can't make > > rc2, and I don't think this is the kind of patch I can merge after rc3. > > Subtle memory barrier tricks like this can introduce new bugs while they > > are fixing old ones. > > I admit the patch is tricky. Some questions: > > - Do we must address the case of e.g swap in? If not, a simple > vhost_work_flush() instead of synchronize_rcu() may work. > - Having some hard thought, I think we can use seqlock, it looks > to me smp_wmb() is in write_segcount_begin() is sufficient, we don't > care vq->map read before smp_wmb(), and for the other we all have > good data devendency so smp_wmb() in the write_seqbegin_end() is > sufficient. If we need an mb in the begin() we can switch to dependent_ptr_mb. if you need me to fix it up and repost, let me know. Why isn't it a problem if the map is accessed outside the lock? > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index db2c81cb1e90..6d9501303258 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -363,39 +363,29 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, > > static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) > { > - int ref = READ_ONCE(vq->ref); > - > - smp_store_release(&vq->ref, ref + 1); > - /* Make sure ref counter is visible before accessing the map */ > - smp_load_acquire(&vq->ref); > + write_seqcount_begin(&vq->seq); > } > > static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) > { > - int ref = READ_ONCE(vq->ref); > - > - /* Make sure vq access is done before increasing ref counter */ > - smp_store_release(&vq->ref, ref + 1); > + write_seqcount_end(&vq->seq); > } > > static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) > { > - int ref; > + unsigned int ret; > > /* Make sure map change was done before checking ref counter */ > smp_mb(); > - > - ref = READ_ONCE(vq->ref); > - if (ref & 0x1) { > - /* When ref change, we are sure no reader can see > + ret = raw_read_seqcount(&vq->seq); > + if (ret & 0x1) { > + /* When seq changes, we are sure no reader can see > * previous map */ > - while (READ_ONCE(vq->ref) == ref) { > - set_current_state(TASK_RUNNING); > + while (raw_read_seqcount(&vq->seq) == ret) > schedule(); So why do we set state here? And should not we check need_sched? > - } > } > - /* Make sure ref counter was checked before any other > - * operations that was dene on map. */ > + /* Make sure seq was checked before any other operations that > + * was dene on map. */ > smp_mb(); > } > > @@ -691,7 +681,7 @@ void vhost_dev_init(struct vhost_dev *dev, > vq->indirect = NULL; > vq->heads = NULL; > vq->dev = dev; > - vq->ref = 0; > + seqcount_init(&vq->seq); > mutex_init(&vq->mutex); > spin_lock_init(&vq->mmu_lock); > vhost_vq_reset(dev, vq); > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h > index 3d10da0ae511..1a705e181a84 100644 > --- a/drivers/vhost/vhost.h > +++ b/drivers/vhost/vhost.h > @@ -125,7 +125,7 @@ struct vhost_virtqueue { > */ > struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; > #endif > - int ref; > + seqcount_t seq; > const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; > > struct file *kick; > -- > 2.18.1 > > > > > > > > > > > > >> > >> Consider the read critical section is pretty small the synchronization > >> should be done very fast. > >> > >> Note the patch lead about 3% PPS dropping. > > > > Sorry what do you mean by this last sentence? This degrades performance > > compared to what? > > Compare to without this patch. OK is the feature still a performance win? or should we drop it for now? > > > >> > >> Reported-by: Michael S. Tsirkin <mst@redhat.com> > >> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") > >> Signed-off-by: Jason Wang <jasowang@redhat.com> > >> --- > >> drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++---------------- > >> drivers/vhost/vhost.h | 7 +- > >> 2 files changed, 94 insertions(+), 58 deletions(-) > >> > >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > >> index cfc11f9ed9c9..db2c81cb1e90 100644 > >> --- a/drivers/vhost/vhost.c > >> +++ b/drivers/vhost/vhost.c > >> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) > >> > >> spin_lock(&vq->mmu_lock); > >> for (i = 0; i < VHOST_NUM_ADDRS; i++) { > >> - map[i] = rcu_dereference_protected(vq->maps[i], > >> - lockdep_is_held(&vq->mmu_lock)); > >> + map[i] = vq->maps[i]; > >> if (map[i]) { > >> vhost_set_map_dirty(vq, map[i], i); > >> - rcu_assign_pointer(vq->maps[i], NULL); > >> + vq->maps[i] = NULL; > >> } > >> } > >> spin_unlock(&vq->mmu_lock); > >> > >> - /* No need for synchronize_rcu() or kfree_rcu() since we are > >> - * serialized with memory accessors (e.g vq mutex held). > >> + /* No need for synchronization since we are serialized with > >> + * memory accessors (e.g vq mutex held). > >> */ > >> > >> for (i = 0; i < VHOST_NUM_ADDRS; i++) > >> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, > >> return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); > >> } > >> > >> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) > >> +{ > >> + int ref = READ_ONCE(vq->ref); > >> + > >> + smp_store_release(&vq->ref, ref + 1); > >> + /* Make sure ref counter is visible before accessing the map */ > >> + smp_load_acquire(&vq->ref); > > > > The map access is after this sequence, correct? > > Yes. > > > > > Just going by the rules in Documentation/memory-barriers.txt, > > I think that this pair will not order following accesses with ref store. > > > > Documentation/memory-barriers.txt says: > > > > > > + In addition, a RELEASE+ACQUIRE > > + pair is -not- guaranteed to act as a full memory barrier. > > > > > > > > The guarantee that is made is this: > > after > > an ACQUIRE on a given variable, all memory accesses preceding any prior > > RELEASE on that same variable are guaranteed to be visible. > > Yes, but it's not clear about the order of ACQUIRE the same location > of previous RELEASE. And it only has a example like: > > " > *A = a; > RELEASE M > ACQUIRE N > *B = b; > > could occur as: > > ACQUIRE N, STORE *B, STORE *A, RELEASE M > " > > But it doesn't explain what happen when > > *A = a > RELEASE M > ACQUIRE M > *B = b; > > And tools/memory-model/Documentation said > > " > First, when a lock-acquire reads from a lock-release, the LKMM > requires that every instruction po-before the lock-release must > execute before any instruction po-after the lock-acquire. > " > > Is this a hint that I was correct? I don't think it's correct since by this logic memory barriers can be nops on x86. > > > > > > And if we also had the reverse rule we'd end up with a full barrier, > > won't we? > > > > Cc Paul in case I missed something here. And if I'm right, > > maybe we should call this out, adding > > > > "The opposite is not true: a prior RELEASE is not > > guaranteed to be visible before memory accesses following > > the subsequent ACQUIRE". > > That kinds of violates the RELEASE? > > " > This also acts as a one-way permeable barrier. It guarantees that all > memory operations before the RELEASE operation will appear to happen > before the RELEASE operation with respect to the other components of the > " yes but we are talking about RELEASE itself versus stuff that comes after it. > > > > > > > >> +} > >> + > >> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) > >> +{ > >> + int ref = READ_ONCE(vq->ref); > >> + > >> + /* Make sure vq access is done before increasing ref counter */ > >> + smp_store_release(&vq->ref, ref + 1); > >> +} > >> + > >> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) > >> +{ > >> + int ref; > >> + > >> + /* Make sure map change was done before checking ref counter */ > >> + smp_mb(); > >> + > >> + ref = READ_ONCE(vq->ref); > >> + if (ref & 0x1) { > > > > Please document the even/odd trick here too, not just in the commit log. > > > > Ok. > > >> + /* When ref change, > > > > changes > > > >> we are sure no reader can see > >> + * previous map */ > >> + while (READ_ONCE(vq->ref) == ref) { > > > > > > what is the below line in aid of? > > > >> + set_current_state(TASK_RUNNING); any answers here? > >> + schedule(); > > > > if (need_resched()) > > schedule(); > > > > ? > > Yes, better. > > > > >> + } > > > > On an interruptible kernel, there's a risk here is that > > a task got preempted with an odd ref. > > So I suspect we'll have to disable preemption when we > > make ref odd. > > I'm not sure I get, if the odd is not the original value we read, > we're sure it won't read the new map here I believe. But we will spin for a very long time in this case. > > > > > >> + } > >> + /* Make sure ref counter was checked before any other > >> + * operations that was dene on map. */ > > > > was dene -> were done? > > > > Yes. > > >> + smp_mb(); > >> +} > >> + > >> static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > >> int index, > >> unsigned long start, > >> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > >> spin_lock(&vq->mmu_lock); > >> ++vq->invalidate_count; > >> > >> - map = rcu_dereference_protected(vq->maps[index], > >> - lockdep_is_held(&vq->mmu_lock)); > >> + map = vq->maps[index]; > >> if (map) { > >> vhost_set_map_dirty(vq, map, index); > >> - rcu_assign_pointer(vq->maps[index], NULL); > >> + vq->maps[index] = NULL; > >> } > >> spin_unlock(&vq->mmu_lock); > >> > >> if (map) { > >> - synchronize_rcu(); > >> + vhost_vq_sync_access(vq); > >> vhost_map_unprefetch(map); > >> } > >> } > >> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev) > >> for (i = 0; i < dev->nvqs; ++i) { > >> vq = dev->vqs[i]; > >> for (j = 0; j < VHOST_NUM_ADDRS; j++) > >> - RCU_INIT_POINTER(vq->maps[j], NULL); > >> + vq->maps[j] = NULL; > >> } > >> } > >> #endif > >> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev, > >> vq->indirect = NULL; > >> vq->heads = NULL; > >> vq->dev = dev; > >> + vq->ref = 0; > >> mutex_init(&vq->mutex); > >> spin_lock_init(&vq->mmu_lock); > >> vhost_vq_reset(dev, vq); > >> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq, > >> map->npages = npages; > >> map->pages = pages; > >> > >> - rcu_assign_pointer(vq->maps[index], map); > >> + vq->maps[index] = map; > >> /* No need for a synchronize_rcu(). This function should be > >> * called by dev->worker so we are serialized with all > >> * readers. > >> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) > >> struct vring_used *used; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > >> + map = vq->maps[VHOST_ADDR_USED]; > >> if (likely(map)) { > >> used = map->addr; > >> *((__virtio16 *)&used->ring[vq->num]) = > >> cpu_to_vhost16(vq, vq->avail_idx); > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, > >> size_t size; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > >> + map = vq->maps[VHOST_ADDR_USED]; > >> if (likely(map)) { > >> used = map->addr; > >> size = count * sizeof(*head); > >> memcpy(used->ring + idx, head, size); > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) > >> struct vring_used *used; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > >> + map = vq->maps[VHOST_ADDR_USED]; > >> if (likely(map)) { > >> used = map->addr; > >> used->flags = cpu_to_vhost16(vq, vq->used_flags); > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) > >> struct vring_used *used; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > >> + map = vq->maps[VHOST_ADDR_USED]; > >> if (likely(map)) { > >> used = map->addr; > >> used->idx = cpu_to_vhost16(vq, vq->last_used_idx); > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, > >> struct vring_avail *avail; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > >> + map = vq->maps[VHOST_ADDR_AVAIL]; > >> if (likely(map)) { > >> avail = map->addr; > >> *idx = avail->idx; > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, > >> struct vring_avail *avail; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > >> + map = vq->maps[VHOST_ADDR_AVAIL]; > >> if (likely(map)) { > >> avail = map->addr; > >> *head = avail->ring[idx & (vq->num - 1)]; > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, > >> struct vring_avail *avail; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > >> + map = vq->maps[VHOST_ADDR_AVAIL]; > >> if (likely(map)) { > >> avail = map->addr; > >> *flags = avail->flags; > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq, > >> struct vring_avail *avail; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); > >> + vhost_vq_access_map_begin(vq); > >> + map = vq->maps[VHOST_ADDR_AVAIL]; > >> if (likely(map)) { > >> avail = map->addr; > >> *event = (__virtio16)avail->ring[vq->num]; > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, > >> struct vring_used *used; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); > >> + map = vq->maps[VHOST_ADDR_USED]; > >> if (likely(map)) { > >> used = map->addr; > >> *idx = used->idx; > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq, > >> struct vring_desc *d; > >> > >> if (!vq->iotlb) { > >> - rcu_read_lock(); > >> + vhost_vq_access_map_begin(vq); > >> > >> - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]); > >> + map = vq->maps[VHOST_ADDR_DESC]; > >> if (likely(map)) { > >> d = map->addr; > >> *desc = *(d + idx); > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> return 0; > >> } > >> > >> - rcu_read_unlock(); > >> + vhost_vq_access_map_end(vq); > >> } > >> #endif > >> > >> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq, > >> #if VHOST_ARCH_CAN_ACCEL_UACCESS > >> static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq) > >> { > >> - struct vhost_map __rcu *map; > >> + struct vhost_map *map; > >> int i; > >> > >> for (i = 0; i < VHOST_NUM_ADDRS; i++) { > >> - rcu_read_lock(); > >> - map = rcu_dereference(vq->maps[i]); > >> - rcu_read_unlock(); > >> + map = vq->maps[i]; > >> if (unlikely(!map)) > >> vhost_map_prefetch(vq, i); > >> } > >> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h > >> index a9a2a93857d2..f9e9558a529d 100644 > >> --- a/drivers/vhost/vhost.h > >> +++ b/drivers/vhost/vhost.h > >> @@ -115,16 +115,17 @@ struct vhost_virtqueue { > >> #if VHOST_ARCH_CAN_ACCEL_UACCESS > >> /* Read by memory accessors, modified by meta data > >> * prefetching, MMU notifier and vring ioctl(). > >> - * Synchonrized through mmu_lock (writers) and RCU (writers > >> - * and readers). > >> + * Synchonrized through mmu_lock (writers) and ref counters, > >> + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end(). > >> */ > >> - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS]; > >> + struct vhost_map *maps[VHOST_NUM_ADDRS]; > >> /* Read by MMU notifier, modified by vring ioctl(), > >> * synchronized through MMU notifier > >> * registering/unregistering. > >> */ > >> struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; > >> #endif > >> + int ref; > > > > Is it important that this is signed? If not I'd do unsigned here: > > even though kernel does compile with 2s complement sign overflow, > > it seems cleaner not to depend on that. > > Not a must, let me fix. > > Thanks > > > > >> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; > >> > >> struct file *kick; > >> -- > >> 2.18.1 ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker 2019-08-03 21:54 ` Michael S. Tsirkin @ 2019-08-05 8:18 ` Jason Wang 0 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-08-05 8:18 UTC (permalink / raw) To: Michael S. Tsirkin Cc: kvm, virtualization, netdev, linux-kernel, linux-mm, jgg, Paul E. McKenney On 2019/8/4 上午5:54, Michael S. Tsirkin wrote: > On Thu, Aug 01, 2019 at 04:06:13AM -0400, Jason Wang wrote: >> On 2019/8/1 上午2:29, Michael S. Tsirkin wrote: >>> On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote: >>>> We used to use RCU to synchronize MMU notifier with worker. This leads >>>> calling synchronize_rcu() in invalidate_range_start(). But on a busy >>>> system, there would be many factors that may slow down the >>>> synchronize_rcu() which makes it unsuitable to be called in MMU >>>> notifier. >>>> >>>> A solution is SRCU but its overhead is obvious with the expensive full >>>> memory barrier. Another choice is to use seqlock, but it doesn't >>>> provide a synchronization method between readers and writers. The last >>>> choice is to use vq mutex, but it need to deal with the worst case >>>> that MMU notifier must be blocked and wait for the finish of swap in. >>>> >>>> So this patch switches use a counter to track whether or not the map >>>> was used. The counter was increased when vq try to start or finish >>>> uses the map. This means, when it was even, we're sure there's no >>>> readers and MMU notifier is synchronized. When it was odd, it means >>>> there's a reader we need to wait it to be even again then we are >>>> synchronized. To avoid full memory barrier, store_release + >>>> load_acquire on the counter is used. >>> Unfortunately this needs a lot of review and testing, so this can't make >>> rc2, and I don't think this is the kind of patch I can merge after rc3. >>> Subtle memory barrier tricks like this can introduce new bugs while they >>> are fixing old ones. >> I admit the patch is tricky. Some questions: >> >> - Do we must address the case of e.g swap in? If not, a simple >> vhost_work_flush() instead of synchronize_rcu() may work. >> - Having some hard thought, I think we can use seqlock, it looks >> to me smp_wmb() is in write_segcount_begin() is sufficient, we don't >> care vq->map read before smp_wmb(), and for the other we all have >> good data devendency so smp_wmb() in the write_seqbegin_end() is >> sufficient. > If we need an mb in the begin() we can switch to > dependent_ptr_mb. if you need me to fix it up > and repost, let me know. Yes, but please let me figure out whether mb is really necessary here. > > Why isn't it a problem if the map is > accessed outside the lock? Correct me if I was wrong. E.g for vhost_put_avail_event() vhost_vq_access_map_begin(vq); map = vq->maps[VHOST_ADDR_USED]; if (likely(map)) { used = map->addr; *((__virtio16 *)&used->ring[vq->num]) = cpu_to_vhost16(vq, vq->avail_idx); vhost_vq_access_map_end(vq); return 0; } vhost_vq_access_map_end(vq); We dont' care whether map is accessed before vhost_vq_access_map_begin() since MMU notifier can only change map from non-NULL to NULL. If we read it too early, we will only get NULL and won't use the map at all. And smp_wmb() in vhost_vq_access_map_begin() can make sure the real access to map->addr is done after we increasing the counter. > > > >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index db2c81cb1e90..6d9501303258 100644 >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -363,39 +363,29 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, >> >> static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) >> { >> - int ref = READ_ONCE(vq->ref); >> - >> - smp_store_release(&vq->ref, ref + 1); >> - /* Make sure ref counter is visible before accessing the map */ >> - smp_load_acquire(&vq->ref); >> + write_seqcount_begin(&vq->seq); >> } >> >> static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) >> { >> - int ref = READ_ONCE(vq->ref); >> - >> - /* Make sure vq access is done before increasing ref counter */ >> - smp_store_release(&vq->ref, ref + 1); >> + write_seqcount_end(&vq->seq); >> } >> >> static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) >> { >> - int ref; >> + unsigned int ret; >> >> /* Make sure map change was done before checking ref counter */ >> smp_mb(); >> - >> - ref = READ_ONCE(vq->ref); >> - if (ref & 0x1) { >> - /* When ref change, we are sure no reader can see >> + ret = raw_read_seqcount(&vq->seq); >> + if (ret & 0x1) { >> + /* When seq changes, we are sure no reader can see >> * previous map */ >> - while (READ_ONCE(vq->ref) == ref) { >> - set_current_state(TASK_RUNNING); >> + while (raw_read_seqcount(&vq->seq) == ret) >> schedule(); > > So why do we set state here? No need, just a artifact of previous patch. > And should not we > check need_sched? We need use need_sched(). > > >> - } >> } >> - /* Make sure ref counter was checked before any other >> - * operations that was dene on map. */ >> + /* Make sure seq was checked before any other operations that >> + * was dene on map. */ >> smp_mb(); >> } >> >> @@ -691,7 +681,7 @@ void vhost_dev_init(struct vhost_dev *dev, >> vq->indirect = NULL; >> vq->heads = NULL; >> vq->dev = dev; >> - vq->ref = 0; >> + seqcount_init(&vq->seq); >> mutex_init(&vq->mutex); >> spin_lock_init(&vq->mmu_lock); >> vhost_vq_reset(dev, vq); >> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h >> index 3d10da0ae511..1a705e181a84 100644 >> --- a/drivers/vhost/vhost.h >> +++ b/drivers/vhost/vhost.h >> @@ -125,7 +125,7 @@ struct vhost_virtqueue { >> */ >> struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; >> #endif >> - int ref; >> + seqcount_t seq; >> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; >> >> struct file *kick; >> -- >> 2.18.1 >> >>> >>> >>> >>> >>>> Consider the read critical section is pretty small the synchronization >>>> should be done very fast. >>>> >>>> Note the patch lead about 3% PPS dropping. >>> Sorry what do you mean by this last sentence? This degrades performance >>> compared to what? >> Compare to without this patch. > OK is the feature still a performance win? or should we drop it for now? Still a win, just a drop from 23% improvement to 20% improvement. >>>> Reported-by: Michael S. Tsirkin <mst@redhat.com> >>>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") >>>> Signed-off-by: Jason Wang <jasowang@redhat.com> >>>> --- >>>> drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++---------------- >>>> drivers/vhost/vhost.h | 7 +- >>>> 2 files changed, 94 insertions(+), 58 deletions(-) >>>> >>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >>>> index cfc11f9ed9c9..db2c81cb1e90 100644 >>>> --- a/drivers/vhost/vhost.c >>>> +++ b/drivers/vhost/vhost.c >>>> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) >>>> >>>> spin_lock(&vq->mmu_lock); >>>> for (i = 0; i < VHOST_NUM_ADDRS; i++) { >>>> - map[i] = rcu_dereference_protected(vq->maps[i], >>>> - lockdep_is_held(&vq->mmu_lock)); >>>> + map[i] = vq->maps[i]; >>>> if (map[i]) { >>>> vhost_set_map_dirty(vq, map[i], i); >>>> - rcu_assign_pointer(vq->maps[i], NULL); >>>> + vq->maps[i] = NULL; >>>> } >>>> } >>>> spin_unlock(&vq->mmu_lock); >>>> >>>> - /* No need for synchronize_rcu() or kfree_rcu() since we are >>>> - * serialized with memory accessors (e.g vq mutex held). >>>> + /* No need for synchronization since we are serialized with >>>> + * memory accessors (e.g vq mutex held). >>>> */ >>>> >>>> for (i = 0; i < VHOST_NUM_ADDRS; i++) >>>> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, >>>> return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); >>>> } >>>> >>>> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) >>>> +{ >>>> + int ref = READ_ONCE(vq->ref); >>>> + >>>> + smp_store_release(&vq->ref, ref + 1); >>>> + /* Make sure ref counter is visible before accessing the map */ >>>> + smp_load_acquire(&vq->ref); >>> The map access is after this sequence, correct? >> Yes. >> >>> Just going by the rules in Documentation/memory-barriers.txt, >>> I think that this pair will not order following accesses with ref store. >>> >>> Documentation/memory-barriers.txt says: >>> >>> >>> + In addition, a RELEASE+ACQUIRE >>> + pair is -not- guaranteed to act as a full memory barrier. >>> >>> >>> >>> The guarantee that is made is this: >>> after >>> an ACQUIRE on a given variable, all memory accesses preceding any prior >>> RELEASE on that same variable are guaranteed to be visible. >> Yes, but it's not clear about the order of ACQUIRE the same location >> of previous RELEASE. And it only has a example like: >> >> " >> *A = a; >> RELEASE M >> ACQUIRE N >> *B = b; >> >> could occur as: >> >> ACQUIRE N, STORE *B, STORE *A, RELEASE M >> " >> >> But it doesn't explain what happen when >> >> *A = a >> RELEASE M >> ACQUIRE M >> *B = b; >> >> And tools/memory-model/Documentation said >> >> " >> First, when a lock-acquire reads from a lock-release, the LKMM >> requires that every instruction po-before the lock-release must >> execute before any instruction po-after the lock-acquire. >> " >> >> Is this a hint that I was correct? > I don't think it's correct since by this logic > memory barriers can be nops on x86. It not a nop, instead, it goes to a write and then read to one same location of memory. > >>> >>> And if we also had the reverse rule we'd end up with a full barrier, >>> won't we? >>> >>> Cc Paul in case I missed something here. And if I'm right, >>> maybe we should call this out, adding >>> >>> "The opposite is not true: a prior RELEASE is not >>> guaranteed to be visible before memory accesses following >>> the subsequent ACQUIRE". >> That kinds of violates the RELEASE? >> >> " >> This also acts as a one-way permeable barrier. It guarantees that all >> memory operations before the RELEASE operation will appear to happen >> before the RELEASE operation with respect to the other components of the >> " > > yes but we are talking about RELEASE itself versus stuff > that comes after it. Unless RELEASE and ACQUIRE on the same address can be reordered (at least doesn't happen x86). The following ACQUIRE can make sure stuff after ACQUIRE id done after RELEASE. > >>> >>> >>>> +} >>>> + >>>> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) >>>> +{ >>>> + int ref = READ_ONCE(vq->ref); >>>> + >>>> + /* Make sure vq access is done before increasing ref counter */ >>>> + smp_store_release(&vq->ref, ref + 1); >>>> +} >>>> + >>>> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) >>>> +{ >>>> + int ref; >>>> + >>>> + /* Make sure map change was done before checking ref counter */ >>>> + smp_mb(); >>>> + >>>> + ref = READ_ONCE(vq->ref); >>>> + if (ref & 0x1) { >>> Please document the even/odd trick here too, not just in the commit log. >>> >> Ok. >> >>>> + /* When ref change, >>> changes >>> >>>> we are sure no reader can see >>>> + * previous map */ >>>> + while (READ_ONCE(vq->ref) == ref) { >>> >>> what is the below line in aid of? >>> >>>> + set_current_state(TASK_RUNNING); > any answers here? It's unecessary. > >>>> + schedule(); >>> if (need_resched()) >>> schedule(); >>> >>> ? >> Yes, better. >> >>>> + } >>> On an interruptible kernel, there's a risk here is that >>> a task got preempted with an odd ref. >>> So I suspect we'll have to disable preemption when we >>> make ref odd. >> I'm not sure I get, if the odd is not the original value we read, >> we're sure it won't read the new map here I believe. > But we will spin for a very long time in this case. Yes, but do we disable preemption in MMU notifier callback. If not it should be the same as MMU notifier was preempted for long time. We can disable the preempt count here, but it needs an extra cacheline. Thanks > >>> >>>> + } >>>> + /* Make sure ref counter was checked before any other >>>> + * operations that was dene on map. */ >>> was dene -> were done? >>> >> Yes. >> >>>> + smp_mb(); >>>> +} >>>> + >>>> static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >>>> int index, >>>> unsigned long start, >>>> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >>>> spin_lock(&vq->mmu_lock); >>>> ++vq->invalidate_count; >>>> >>>> - map = rcu_dereference_protected(vq->maps[index], >>>> - lockdep_is_held(&vq->mmu_lock)); >>>> + map = vq->maps[index]; >>>> if (map) { >>>> vhost_set_map_dirty(vq, map, index); >>>> - rcu_assign_pointer(vq->maps[index], NULL); >>>> + vq->maps[index] = NULL; >>>> } >>>> spin_unlock(&vq->mmu_lock); >>>> >>>> if (map) { >>>> - synchronize_rcu(); >>>> + vhost_vq_sync_access(vq); >>>> vhost_map_unprefetch(map); >>>> } >>>> } >>>> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev) >>>> for (i = 0; i < dev->nvqs; ++i) { >>>> vq = dev->vqs[i]; >>>> for (j = 0; j < VHOST_NUM_ADDRS; j++) >>>> - RCU_INIT_POINTER(vq->maps[j], NULL); >>>> + vq->maps[j] = NULL; >>>> } >>>> } >>>> #endif >>>> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev, >>>> vq->indirect = NULL; >>>> vq->heads = NULL; >>>> vq->dev = dev; >>>> + vq->ref = 0; >>>> mutex_init(&vq->mutex); >>>> spin_lock_init(&vq->mmu_lock); >>>> vhost_vq_reset(dev, vq); >>>> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq, >>>> map->npages = npages; >>>> map->pages = pages; >>>> >>>> - rcu_assign_pointer(vq->maps[index], map); >>>> + vq->maps[index] = map; >>>> /* No need for a synchronize_rcu(). This function should be >>>> * called by dev->worker so we are serialized with all >>>> * readers. >>>> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) >>>> struct vring_used *used; >>>> >>>> if (!vq->iotlb) { >>>> - rcu_read_lock(); >>>> + vhost_vq_access_map_begin(vq); >>>> >>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >>>> + map = vq->maps[VHOST_ADDR_USED]; >>>> if (likely(map)) { >>>> used = map->addr; >>>> *((__virtio16 *)&used->ring[vq->num]) = >>>> cpu_to_vhost16(vq, vq->avail_idx); >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> return 0; >>>> } >>>> >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> } >>>> #endif >>>> >>>> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, >>>> size_t size; >>>> >>>> if (!vq->iotlb) { >>>> - rcu_read_lock(); >>>> + vhost_vq_access_map_begin(vq); >>>> >>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >>>> + map = vq->maps[VHOST_ADDR_USED]; >>>> if (likely(map)) { >>>> used = map->addr; >>>> size = count * sizeof(*head); >>>> memcpy(used->ring + idx, head, size); >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> return 0; >>>> } >>>> >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> } >>>> #endif >>>> >>>> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) >>>> struct vring_used *used; >>>> >>>> if (!vq->iotlb) { >>>> - rcu_read_lock(); >>>> + vhost_vq_access_map_begin(vq); >>>> >>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >>>> + map = vq->maps[VHOST_ADDR_USED]; >>>> if (likely(map)) { >>>> used = map->addr; >>>> used->flags = cpu_to_vhost16(vq, vq->used_flags); >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> return 0; >>>> } >>>> >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> } >>>> #endif >>>> >>>> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) >>>> struct vring_used *used; >>>> >>>> if (!vq->iotlb) { >>>> - rcu_read_lock(); >>>> + vhost_vq_access_map_begin(vq); >>>> >>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >>>> + map = vq->maps[VHOST_ADDR_USED]; >>>> if (likely(map)) { >>>> used = map->addr; >>>> used->idx = cpu_to_vhost16(vq, vq->last_used_idx); >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> return 0; >>>> } >>>> >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> } >>>> #endif >>>> >>>> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, >>>> struct vring_avail *avail; >>>> >>>> if (!vq->iotlb) { >>>> - rcu_read_lock(); >>>> + vhost_vq_access_map_begin(vq); >>>> >>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >>>> + map = vq->maps[VHOST_ADDR_AVAIL]; >>>> if (likely(map)) { >>>> avail = map->addr; >>>> *idx = avail->idx; >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> return 0; >>>> } >>>> >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> } >>>> #endif >>>> >>>> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, >>>> struct vring_avail *avail; >>>> >>>> if (!vq->iotlb) { >>>> - rcu_read_lock(); >>>> + vhost_vq_access_map_begin(vq); >>>> >>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >>>> + map = vq->maps[VHOST_ADDR_AVAIL]; >>>> if (likely(map)) { >>>> avail = map->addr; >>>> *head = avail->ring[idx & (vq->num - 1)]; >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> return 0; >>>> } >>>> >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> } >>>> #endif >>>> >>>> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, >>>> struct vring_avail *avail; >>>> >>>> if (!vq->iotlb) { >>>> - rcu_read_lock(); >>>> + vhost_vq_access_map_begin(vq); >>>> >>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >>>> + map = vq->maps[VHOST_ADDR_AVAIL]; >>>> if (likely(map)) { >>>> avail = map->addr; >>>> *flags = avail->flags; >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> return 0; >>>> } >>>> >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> } >>>> #endif >>>> >>>> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq, >>>> struct vring_avail *avail; >>>> >>>> if (!vq->iotlb) { >>>> - rcu_read_lock(); >>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]); >>>> + vhost_vq_access_map_begin(vq); >>>> + map = vq->maps[VHOST_ADDR_AVAIL]; >>>> if (likely(map)) { >>>> avail = map->addr; >>>> *event = (__virtio16)avail->ring[vq->num]; >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> return 0; >>>> } >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> } >>>> #endif >>>> >>>> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, >>>> struct vring_used *used; >>>> >>>> if (!vq->iotlb) { >>>> - rcu_read_lock(); >>>> + vhost_vq_access_map_begin(vq); >>>> >>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]); >>>> + map = vq->maps[VHOST_ADDR_USED]; >>>> if (likely(map)) { >>>> used = map->addr; >>>> *idx = used->idx; >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> return 0; >>>> } >>>> >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> } >>>> #endif >>>> >>>> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq, >>>> struct vring_desc *d; >>>> >>>> if (!vq->iotlb) { >>>> - rcu_read_lock(); >>>> + vhost_vq_access_map_begin(vq); >>>> >>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]); >>>> + map = vq->maps[VHOST_ADDR_DESC]; >>>> if (likely(map)) { >>>> d = map->addr; >>>> *desc = *(d + idx); >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> return 0; >>>> } >>>> >>>> - rcu_read_unlock(); >>>> + vhost_vq_access_map_end(vq); >>>> } >>>> #endif >>>> >>>> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq, >>>> #if VHOST_ARCH_CAN_ACCEL_UACCESS >>>> static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq) >>>> { >>>> - struct vhost_map __rcu *map; >>>> + struct vhost_map *map; >>>> int i; >>>> >>>> for (i = 0; i < VHOST_NUM_ADDRS; i++) { >>>> - rcu_read_lock(); >>>> - map = rcu_dereference(vq->maps[i]); >>>> - rcu_read_unlock(); >>>> + map = vq->maps[i]; >>>> if (unlikely(!map)) >>>> vhost_map_prefetch(vq, i); >>>> } >>>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h >>>> index a9a2a93857d2..f9e9558a529d 100644 >>>> --- a/drivers/vhost/vhost.h >>>> +++ b/drivers/vhost/vhost.h >>>> @@ -115,16 +115,17 @@ struct vhost_virtqueue { >>>> #if VHOST_ARCH_CAN_ACCEL_UACCESS >>>> /* Read by memory accessors, modified by meta data >>>> * prefetching, MMU notifier and vring ioctl(). >>>> - * Synchonrized through mmu_lock (writers) and RCU (writers >>>> - * and readers). >>>> + * Synchonrized through mmu_lock (writers) and ref counters, >>>> + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end(). >>>> */ >>>> - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS]; >>>> + struct vhost_map *maps[VHOST_NUM_ADDRS]; >>>> /* Read by MMU notifier, modified by vring ioctl(), >>>> * synchronized through MMU notifier >>>> * registering/unregistering. >>>> */ >>>> struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS]; >>>> #endif >>>> + int ref; >>> Is it important that this is signed? If not I'd do unsigned here: >>> even though kernel does compile with 2s complement sign overflow, >>> it seems cleaner not to depend on that. >> Not a must, let me fix. >> >> Thanks >> >>>> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; >>>> >>>> struct file *kick; >>>> -- >>>> 2.18.1 ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH V2 8/9] vhost: correctly set dirty pages in MMU notifiers callback 2019-07-31 8:46 [PATCH V2 0/9] Fixes for metadata accelreation Jason Wang ` (6 preceding siblings ...) 2019-07-31 8:46 ` [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker Jason Wang @ 2019-07-31 8:46 ` Jason Wang 2019-07-31 8:46 ` [PATCH V2 9/9] vhost: do not return -EAGIAN for non blocking invalidation too early Jason Wang 8 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-07-31 8:46 UTC (permalink / raw) To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg We need make sure there's no reference on the map before trying to mark set dirty pages. Reported-by: Michael S. Tsirkin <mst@redhat.com> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/vhost/vhost.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index db2c81cb1e90..fc2da8a0c671 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -414,14 +414,13 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, ++vq->invalidate_count; map = vq->maps[index]; - if (map) { - vhost_set_map_dirty(vq, map, index); + if (map) vq->maps[index] = NULL; - } spin_unlock(&vq->mmu_lock); if (map) { vhost_vq_sync_access(vq); + vhost_set_map_dirty(vq, map, index); vhost_map_unprefetch(map); } } -- 2.18.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH V2 9/9] vhost: do not return -EAGIAN for non blocking invalidation too early 2019-07-31 8:46 [PATCH V2 0/9] Fixes for metadata accelreation Jason Wang ` (7 preceding siblings ...) 2019-07-31 8:46 ` [PATCH V2 8/9] vhost: correctly set dirty pages in MMU notifiers callback Jason Wang @ 2019-07-31 8:46 ` Jason Wang 2019-07-31 9:59 ` Stefano Garzarella 8 siblings, 1 reply; 51+ messages in thread From: Jason Wang @ 2019-07-31 8:46 UTC (permalink / raw) To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg Instead of returning -EAGAIN unconditionally, we'd better do that only we're sure the range is overlapped with the metadata area. Reported-by: Jason Gunthorpe <jgg@ziepe.ca> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") Signed-off-by: Jason Wang <jasowang@redhat.com> --- drivers/vhost/vhost.c | 32 +++++++++++++++++++------------- 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index fc2da8a0c671..96c6aeb1871f 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -399,16 +399,19 @@ static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) smp_mb(); } -static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, - int index, - unsigned long start, - unsigned long end) +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq, + int index, + unsigned long start, + unsigned long end, + bool blockable) { struct vhost_uaddr *uaddr = &vq->uaddrs[index]; struct vhost_map *map; if (!vhost_map_range_overlap(uaddr, start, end)) - return; + return 0; + else if (!blockable) + return -EAGAIN; spin_lock(&vq->mmu_lock); ++vq->invalidate_count; @@ -423,6 +426,8 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, vhost_set_map_dirty(vq, map, index); vhost_map_unprefetch(map); } + + return 0; } static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq, @@ -443,18 +448,19 @@ static int vhost_invalidate_range_start(struct mmu_notifier *mn, { struct vhost_dev *dev = container_of(mn, struct vhost_dev, mmu_notifier); - int i, j; - - if (!mmu_notifier_range_blockable(range)) - return -EAGAIN; + bool blockable = mmu_notifier_range_blockable(range); + int i, j, ret; for (i = 0; i < dev->nvqs; i++) { struct vhost_virtqueue *vq = dev->vqs[i]; - for (j = 0; j < VHOST_NUM_ADDRS; j++) - vhost_invalidate_vq_start(vq, j, - range->start, - range->end); + for (j = 0; j < VHOST_NUM_ADDRS; j++) { + ret = vhost_invalidate_vq_start(vq, j, + range->start, + range->end, blockable); + if (ret) + return ret; + } } return 0; -- 2.18.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH V2 9/9] vhost: do not return -EAGIAN for non blocking invalidation too early 2019-07-31 8:46 ` [PATCH V2 9/9] vhost: do not return -EAGIAN for non blocking invalidation too early Jason Wang @ 2019-07-31 9:59 ` Stefano Garzarella 2019-07-31 10:05 ` Jason Wang 0 siblings, 1 reply; 51+ messages in thread From: Stefano Garzarella @ 2019-07-31 9:59 UTC (permalink / raw) To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm, jgg A little typo in the title: s/EAGIAN/EAGAIN Thanks, Stefano On Wed, Jul 31, 2019 at 04:46:55AM -0400, Jason Wang wrote: > Instead of returning -EAGAIN unconditionally, we'd better do that only > we're sure the range is overlapped with the metadata area. > > Reported-by: Jason Gunthorpe <jgg@ziepe.ca> > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") > Signed-off-by: Jason Wang <jasowang@redhat.com> > --- > drivers/vhost/vhost.c | 32 +++++++++++++++++++------------- > 1 file changed, 19 insertions(+), 13 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index fc2da8a0c671..96c6aeb1871f 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -399,16 +399,19 @@ static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) > smp_mb(); > } > > -static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > - int index, > - unsigned long start, > - unsigned long end) > +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > + int index, > + unsigned long start, > + unsigned long end, > + bool blockable) > { > struct vhost_uaddr *uaddr = &vq->uaddrs[index]; > struct vhost_map *map; > > if (!vhost_map_range_overlap(uaddr, start, end)) > - return; > + return 0; > + else if (!blockable) > + return -EAGAIN; > > spin_lock(&vq->mmu_lock); > ++vq->invalidate_count; > @@ -423,6 +426,8 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > vhost_set_map_dirty(vq, map, index); > vhost_map_unprefetch(map); > } > + > + return 0; > } > > static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq, > @@ -443,18 +448,19 @@ static int vhost_invalidate_range_start(struct mmu_notifier *mn, > { > struct vhost_dev *dev = container_of(mn, struct vhost_dev, > mmu_notifier); > - int i, j; > - > - if (!mmu_notifier_range_blockable(range)) > - return -EAGAIN; > + bool blockable = mmu_notifier_range_blockable(range); > + int i, j, ret; > > for (i = 0; i < dev->nvqs; i++) { > struct vhost_virtqueue *vq = dev->vqs[i]; > > - for (j = 0; j < VHOST_NUM_ADDRS; j++) > - vhost_invalidate_vq_start(vq, j, > - range->start, > - range->end); > + for (j = 0; j < VHOST_NUM_ADDRS; j++) { > + ret = vhost_invalidate_vq_start(vq, j, > + range->start, > + range->end, blockable); > + if (ret) > + return ret; > + } > } > > return 0; > -- > 2.18.1 > -- ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH V2 9/9] vhost: do not return -EAGIAN for non blocking invalidation too early 2019-07-31 9:59 ` Stefano Garzarella @ 2019-07-31 10:05 ` Jason Wang 0 siblings, 0 replies; 51+ messages in thread From: Jason Wang @ 2019-07-31 10:05 UTC (permalink / raw) To: Stefano Garzarella Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm, jgg On 2019/7/31 下午5:59, Stefano Garzarella wrote: > A little typo in the title: s/EAGIAN/EAGAIN > > Thanks, > Stefano Right, will fix if need respin or Michael can help to fix. Thanks > > On Wed, Jul 31, 2019 at 04:46:55AM -0400, Jason Wang wrote: >> Instead of returning -EAGAIN unconditionally, we'd better do that only >> we're sure the range is overlapped with the metadata area. >> >> Reported-by: Jason Gunthorpe <jgg@ziepe.ca> >> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address") >> Signed-off-by: Jason Wang <jasowang@redhat.com> >> --- >> drivers/vhost/vhost.c | 32 +++++++++++++++++++------------- >> 1 file changed, 19 insertions(+), 13 deletions(-) >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index fc2da8a0c671..96c6aeb1871f 100644 >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -399,16 +399,19 @@ static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq) >> smp_mb(); >> } >> >> -static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >> - int index, >> - unsigned long start, >> - unsigned long end) >> +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >> + int index, >> + unsigned long start, >> + unsigned long end, >> + bool blockable) >> { >> struct vhost_uaddr *uaddr = &vq->uaddrs[index]; >> struct vhost_map *map; >> >> if (!vhost_map_range_overlap(uaddr, start, end)) >> - return; >> + return 0; >> + else if (!blockable) >> + return -EAGAIN; >> >> spin_lock(&vq->mmu_lock); >> ++vq->invalidate_count; >> @@ -423,6 +426,8 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >> vhost_set_map_dirty(vq, map, index); >> vhost_map_unprefetch(map); >> } >> + >> + return 0; >> } >> >> static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq, >> @@ -443,18 +448,19 @@ static int vhost_invalidate_range_start(struct mmu_notifier *mn, >> { >> struct vhost_dev *dev = container_of(mn, struct vhost_dev, >> mmu_notifier); >> - int i, j; >> - >> - if (!mmu_notifier_range_blockable(range)) >> - return -EAGAIN; >> + bool blockable = mmu_notifier_range_blockable(range); >> + int i, j, ret; >> >> for (i = 0; i < dev->nvqs; i++) { >> struct vhost_virtqueue *vq = dev->vqs[i]; >> >> - for (j = 0; j < VHOST_NUM_ADDRS; j++) >> - vhost_invalidate_vq_start(vq, j, >> - range->start, >> - range->end); >> + for (j = 0; j < VHOST_NUM_ADDRS; j++) { >> + ret = vhost_invalidate_vq_start(vq, j, >> + range->start, >> + range->end, blockable); >> + if (ret) >> + return ret; >> + } >> } >> >> return 0; >> -- >> 2.18.1 >> ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2019-08-07 6:50 UTC | newest] Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-07-31 8:46 [PATCH V2 0/9] Fixes for metadata accelreation Jason Wang 2019-07-31 8:46 ` [PATCH V2 1/9] vhost: don't set uaddr for invalid address Jason Wang 2019-07-31 8:46 ` [PATCH V2 2/9] vhost: validate MMU notifier registration Jason Wang 2019-07-31 8:46 ` [PATCH V2 3/9] vhost: fix vhost map leak Jason Wang 2019-07-31 8:46 ` [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr() Jason Wang 2019-07-31 12:41 ` Jason Gunthorpe 2019-07-31 13:29 ` Jason Wang 2019-07-31 19:32 ` Jason Gunthorpe 2019-07-31 19:37 ` Michael S. Tsirkin 2019-08-01 5:03 ` Jason Wang 2019-07-31 8:46 ` [PATCH V2 5/9] vhost: mark dirty pages during map uninit Jason Wang 2019-07-31 8:46 ` [PATCH V2 6/9] vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps() Jason Wang 2019-07-31 8:46 ` [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker Jason Wang 2019-07-31 8:50 ` Jason Wang 2019-07-31 12:39 ` Jason Gunthorpe 2019-07-31 13:28 ` Jason Wang 2019-07-31 19:30 ` Jason Gunthorpe 2019-08-01 5:02 ` Jason Wang 2019-08-01 14:15 ` Jason Gunthorpe 2019-08-02 9:40 ` Jason Wang 2019-08-02 12:46 ` Jason Gunthorpe 2019-08-02 14:27 ` Michael S. Tsirkin 2019-08-02 17:24 ` Jason Gunthorpe 2019-08-03 21:36 ` Michael S. Tsirkin 2019-08-04 0:14 ` Jason Gunthorpe 2019-08-04 8:07 ` Michael S. Tsirkin 2019-08-05 4:39 ` Jason Wang 2019-08-06 11:53 ` Jason Gunthorpe 2019-08-06 13:36 ` Michael S. Tsirkin 2019-08-06 13:40 ` Jason Gunthorpe 2019-08-05 4:36 ` Jason Wang 2019-08-05 4:41 ` Jason Wang 2019-08-05 6:40 ` Michael S. Tsirkin 2019-08-05 8:24 ` Jason Wang 2019-08-05 6:30 ` Michael S. Tsirkin 2019-08-05 8:22 ` Jason Wang 2019-08-05 4:20 ` Jason Wang 2019-08-06 12:04 ` Jason Gunthorpe 2019-08-07 6:49 ` Jason Wang 2019-08-02 14:03 ` Michael S. Tsirkin 2019-08-05 4:33 ` Jason Wang 2019-08-05 6:28 ` Michael S. Tsirkin 2019-08-05 8:21 ` Jason Wang 2019-07-31 18:29 ` Michael S. Tsirkin 2019-08-01 8:06 ` Jason Wang 2019-08-03 21:54 ` Michael S. Tsirkin 2019-08-05 8:18 ` Jason Wang 2019-07-31 8:46 ` [PATCH V2 8/9] vhost: correctly set dirty pages in MMU notifiers callback Jason Wang 2019-07-31 8:46 ` [PATCH V2 9/9] vhost: do not return -EAGIAN for non blocking invalidation too early Jason Wang 2019-07-31 9:59 ` Stefano Garzarella 2019-07-31 10:05 ` Jason Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).