linux-parisc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: WARNING in __mmdrop
       [not found] <0000000000008dd6bb058e006938@google.com>
@ 2019-07-20 10:08 ` syzbot
  2019-07-21 10:02   ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: syzbot @ 2019-07-20 10:08 UTC (permalink / raw)
  To: aarcange, akpm, christian, davem, ebiederm, elena.reshetova,
	guro, hch, james.bottomley, jasowang, jglisse, keescook, ldv,
	linux-arm-kernel, linux-kernel, linux-mm, linux-parisc, luto,
	mhocko, mingo, mst, namit, peterz, syzkaller-bugs, viro, wad

syzbot has bisected this bug to:

commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
Author: Jason Wang <jasowang@redhat.com>
Date:   Fri May 24 08:12:18 2019 +0000

     vhost: access vq metadata through kernel virtual address

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
start commit:   6d21a41b Add linux-next specific files for 20190718
git tree:       linux-next
final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000

Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual  
address")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-20 10:08 ` WARNING in __mmdrop syzbot
@ 2019-07-21 10:02   ` Michael S. Tsirkin
  2019-07-21 12:18     ` Michael S. Tsirkin
                       ` (3 more replies)
  0 siblings, 4 replies; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-21 10:02 UTC (permalink / raw)
  To: syzbot
  Cc: aarcange, akpm, christian, davem, ebiederm, elena.reshetova,
	guro, hch, james.bottomley, jasowang, jglisse, keescook, ldv,
	linux-arm-kernel, linux-kernel, linux-mm, linux-parisc, luto,
	mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> syzbot has bisected this bug to:
> 
> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> Author: Jason Wang <jasowang@redhat.com>
> Date:   Fri May 24 08:12:18 2019 +0000
> 
>     vhost: access vq metadata through kernel virtual address
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> start commit:   6d21a41b Add linux-next specific files for 20190718
> git tree:       linux-next
> final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> 
> Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> address")
> 
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection


OK I poked at this for a bit, I see several things that
we need to fix, though I'm not yet sure it's the reason for
the failures:


1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
   That's just a bad hack, in particular I don't think device
   mutex is taken and so poking at two VQs will corrupt
   memory.
   So what to do? How about a per vq notifier?
   Of course we also have synchronize_rcu
   in the notifier which is slow and is now going to be called twice.
   I think call_rcu would be more appropriate here.
   We then need rcu_barrier on module unload.
   OTOH if we make pages linear with map then we are good
   with kfree_rcu which is even nicer.

2. Doesn't map leak after vhost_map_unprefetch?
   And why does it poke at contents of the map?
   No one should use it right?

3. notifier unregister happens last in vhost_dev_cleanup,
   but register happens first. This looks wrong to me.

4. OK so we use the invalidate count to try and detect that
   some invalidate is in progress.
   I am not 100% sure why do we care.
   Assuming we do, uaddr can change between start and end
   and then the counter can get negative, or generally
   out of sync.

So what to do about all this?
I am inclined to say let's just drop the uaddr optimization
for now. E.g. kvm invalidates unconditionally.
3 should be fixed independently.


-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-21 10:02   ` Michael S. Tsirkin
@ 2019-07-21 12:18     ` Michael S. Tsirkin
  2019-07-22  5:24       ` Jason Wang
  2019-07-21 12:28     ` RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop) Michael S. Tsirkin
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-21 12:18 UTC (permalink / raw)
  To: syzbot
  Cc: aarcange, akpm, christian, davem, ebiederm, elena.reshetova,
	guro, hch, james.bottomley, jasowang, jglisse, keescook, ldv,
	linux-arm-kernel, linux-kernel, linux-mm, linux-parisc, luto,
	mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > syzbot has bisected this bug to:
> > 
> > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > Author: Jason Wang <jasowang@redhat.com>
> > Date:   Fri May 24 08:12:18 2019 +0000
> > 
> >     vhost: access vq metadata through kernel virtual address
> > 
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > start commit:   6d21a41b Add linux-next specific files for 20190718
> > git tree:       linux-next
> > final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > 
> > Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > address")
> > 
> > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> 
> 
> OK I poked at this for a bit, I see several things that
> we need to fix, though I'm not yet sure it's the reason for
> the failures:
> 
> 
> 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
>    That's just a bad hack, in particular I don't think device
>    mutex is taken and so poking at two VQs will corrupt
>    memory.
>    So what to do? How about a per vq notifier?
>    Of course we also have synchronize_rcu
>    in the notifier which is slow and is now going to be called twice.
>    I think call_rcu would be more appropriate here.
>    We then need rcu_barrier on module unload.
>    OTOH if we make pages linear with map then we are good
>    with kfree_rcu which is even nicer.
> 
> 2. Doesn't map leak after vhost_map_unprefetch?
>    And why does it poke at contents of the map?
>    No one should use it right?
> 
> 3. notifier unregister happens last in vhost_dev_cleanup,
>    but register happens first. This looks wrong to me.
> 
> 4. OK so we use the invalidate count to try and detect that
>    some invalidate is in progress.
>    I am not 100% sure why do we care.
>    Assuming we do, uaddr can change between start and end
>    and then the counter can get negative, or generally
>    out of sync.
> 
> So what to do about all this?
> I am inclined to say let's just drop the uaddr optimization
> for now. E.g. kvm invalidates unconditionally.
> 3 should be fixed independently.


Above implements this but is only build-tested.
Jason, pls take a look. If you like the approach feel
free to take it from here.

One thing the below does not have is any kind of rate-limiting.
Given it's so easy to restart I'm thinking it makes sense
to add a generic infrastructure for this.
Can be a separate patch I guess.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 0536f8526359..1d89715af89d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -299,53 +299,30 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
 }
 
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
-static void vhost_map_unprefetch(struct vhost_map *map)
-{
-	kfree(map->pages);
-	map->pages = NULL;
-	map->npages = 0;
-	map->addr = NULL;
-}
-
-static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
+static void __vhost_cleanup_vq_maps(struct vhost_virtqueue *vq)
 {
 	struct vhost_map *map[VHOST_NUM_ADDRS];
 	int i;
 
-	spin_lock(&vq->mmu_lock);
 	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
 		map[i] = rcu_dereference_protected(vq->maps[i],
 				  lockdep_is_held(&vq->mmu_lock));
-		if (map[i])
+		if (map[i]) {
+			if (vq->uaddrs[i].write) {
+				for (i = 0; i < map[i]->npages; i++)
+					set_page_dirty(map[i]->pages[i]);
+			}
 			rcu_assign_pointer(vq->maps[i], NULL);
+			kfree_rcu(map[i], head);
+		}
 	}
+}
+
+static void vhost_cleanup_vq_maps(struct vhost_virtqueue *vq)
+{
+	spin_lock(&vq->mmu_lock);
+	__vhost_cleanup_vq_maps(vq);
 	spin_unlock(&vq->mmu_lock);
-
-	synchronize_rcu();
-
-	for (i = 0; i < VHOST_NUM_ADDRS; i++)
-		if (map[i])
-			vhost_map_unprefetch(map[i]);
-
-}
-
-static void vhost_reset_vq_maps(struct vhost_virtqueue *vq)
-{
-	int i;
-
-	vhost_uninit_vq_maps(vq);
-	for (i = 0; i < VHOST_NUM_ADDRS; i++)
-		vq->uaddrs[i].size = 0;
-}
-
-static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
-				     unsigned long start,
-				     unsigned long end)
-{
-	if (unlikely(!uaddr->size))
-		return false;
-
-	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
 }
 
 static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
@@ -353,31 +330,11 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 				      unsigned long start,
 				      unsigned long end)
 {
-	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
-	struct vhost_map *map;
-	int i;
-
-	if (!vhost_map_range_overlap(uaddr, start, end))
-		return;
-
 	spin_lock(&vq->mmu_lock);
 	++vq->invalidate_count;
 
-	map = rcu_dereference_protected(vq->maps[index],
-					lockdep_is_held(&vq->mmu_lock));
-	if (map) {
-		if (uaddr->write) {
-			for (i = 0; i < map->npages; i++)
-				set_page_dirty(map->pages[i]);
-		}
-		rcu_assign_pointer(vq->maps[index], NULL);
-	}
+	__vhost_cleanup_vq_maps(vq);
 	spin_unlock(&vq->mmu_lock);
-
-	if (map) {
-		synchronize_rcu();
-		vhost_map_unprefetch(map);
-	}
 }
 
 static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
@@ -385,9 +342,6 @@ static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
 				    unsigned long start,
 				    unsigned long end)
 {
-	if (!vhost_map_range_overlap(&vq->uaddrs[index], start, end))
-		return;
-
 	spin_lock(&vq->mmu_lock);
 	--vq->invalidate_count;
 	spin_unlock(&vq->mmu_lock);
@@ -483,7 +437,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->invalidate_count = 0;
 	__vhost_vq_meta_reset(vq);
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
-	vhost_reset_vq_maps(vq);
+	vhost_cleanup_vq_maps(vq);
 #endif
 }
 
@@ -833,6 +787,7 @@ static void vhost_setup_uaddr(struct vhost_virtqueue *vq,
 			      size_t size, bool write)
 {
 	struct vhost_uaddr *addr = &vq->uaddrs[index];
+	spin_lock(&vq->mmu_lock);
 
 	addr->uaddr = uaddr;
 	addr->size = size;
@@ -841,6 +796,8 @@ static void vhost_setup_uaddr(struct vhost_virtqueue *vq,
 
 static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq)
 {
+	spin_lock(&vq->mmu_lock);
+
 	vhost_setup_uaddr(vq, VHOST_ADDR_DESC,
 			  (unsigned long)vq->desc,
 			  vhost_get_desc_size(vq, vq->num),
@@ -853,6 +810,8 @@ static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq)
 			  (unsigned long)vq->used,
 			  vhost_get_used_size(vq, vq->num),
 			  true);
+
+	spin_unlock(&vq->mmu_lock);
 }
 
 static int vhost_map_prefetch(struct vhost_virtqueue *vq,
@@ -874,13 +833,11 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
 		goto err;
 
 	err = -ENOMEM;
-	map = kmalloc(sizeof(*map), GFP_ATOMIC);
+	map = kmalloc(sizeof(*map) + sizeof(*map->pages) * npages, GFP_ATOMIC);
 	if (!map)
 		goto err;
 
-	pages = kmalloc_array(npages, sizeof(struct page *), GFP_ATOMIC);
-	if (!pages)
-		goto err_pages;
+	pages = map->pages;
 
 	err = EFAULT;
 	npinned = __get_user_pages_fast(uaddr->uaddr, npages,
@@ -907,7 +864,6 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
 
 	map->addr = vaddr + (uaddr->uaddr & (PAGE_SIZE - 1));
 	map->npages = npages;
-	map->pages = pages;
 
 	rcu_assign_pointer(vq->maps[index], map);
 	/* No need for a synchronize_rcu(). This function should be
@@ -919,8 +875,6 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
 	return 0;
 
 err_gup:
-	kfree(pages);
-err_pages:
 	kfree(map);
 err:
 	spin_unlock(&vq->mmu_lock);
@@ -942,6 +896,10 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 		vhost_vq_reset(dev, dev->vqs[i]);
 	}
 	vhost_dev_free_iovecs(dev);
+#if VHOST_ARCH_CAN_ACCEL_UACCESS
+	if (dev->mm)
+		mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
+#endif
 	if (dev->log_ctx)
 		eventfd_ctx_put(dev->log_ctx);
 	dev->log_ctx = NULL;
@@ -957,16 +915,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 		kthread_stop(dev->worker);
 		dev->worker = NULL;
 	}
-	if (dev->mm) {
-#if VHOST_ARCH_CAN_ACCEL_UACCESS
-		mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
-#endif
+	if (dev->mm)
 		mmput(dev->mm);
-	}
-#if VHOST_ARCH_CAN_ACCEL_UACCESS
-	for (i = 0; i < dev->nvqs; i++)
-		vhost_uninit_vq_maps(dev->vqs[i]);
-#endif
 	dev->mm = NULL;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
@@ -1426,7 +1376,7 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
 		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
 		if (likely(map)) {
 			avail = map->addr;
-			*event = (__virtio16)avail->ring[vq->num];
+			*event = avail->ring[vq->num];
 			rcu_read_unlock();
 			return 0;
 		}
@@ -1830,6 +1780,8 @@ static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
 	struct vhost_map __rcu *map;
 	int i;
 
+	vhost_setup_vq_uaddr(vq);
+
 	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
 		rcu_read_lock();
 		map = rcu_dereference(vq->maps[i]);
@@ -1838,6 +1790,10 @@ static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
 			vhost_map_prefetch(vq, i);
 	}
 }
+#else
+static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
+{
+}
 #endif
 
 int vq_meta_prefetch(struct vhost_virtqueue *vq)
@@ -1845,9 +1801,7 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
 	unsigned int num = vq->num;
 
 	if (!vq->iotlb) {
-#if VHOST_ARCH_CAN_ACCEL_UACCESS
 		vhost_vq_map_prefetch(vq);
-#endif
 		return 1;
 	}
 
@@ -2060,16 +2014,6 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 
 	mutex_lock(&vq->mutex);
 
-#if VHOST_ARCH_CAN_ACCEL_UACCESS
-	/* Unregister MMU notifer to allow invalidation callback
-	 * can access vq->uaddrs[] without holding a lock.
-	 */
-	if (d->mm)
-		mmu_notifier_unregister(&d->mmu_notifier, d->mm);
-
-	vhost_uninit_vq_maps(vq);
-#endif
-
 	switch (ioctl) {
 	case VHOST_SET_VRING_NUM:
 		r = vhost_vring_set_num(d, vq, argp);
@@ -2081,13 +2025,6 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 		BUG();
 	}
 
-#if VHOST_ARCH_CAN_ACCEL_UACCESS
-	vhost_setup_vq_uaddr(vq);
-
-	if (d->mm)
-		mmu_notifier_register(&d->mmu_notifier, d->mm);
-#endif
-
 	mutex_unlock(&vq->mutex);
 
 	return r;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 819296332913..584bb13c4d6d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -86,7 +86,8 @@ enum vhost_uaddr_type {
 struct vhost_map {
 	int npages;
 	void *addr;
-	struct page **pages;
+	struct rcu_head head;
+	struct page *pages[];
 };
 
 struct vhost_uaddr {

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-21 10:02   ` Michael S. Tsirkin
  2019-07-21 12:18     ` Michael S. Tsirkin
@ 2019-07-21 12:28     ` Michael S. Tsirkin
  2019-07-21 13:17       ` Paul E. McKenney
  2019-07-22  5:21     ` WARNING in __mmdrop Jason Wang
  2019-07-22 14:11     ` Jason Gunthorpe
  3 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-21 12:28 UTC (permalink / raw)
  To: paulmck
  Cc: aarcange, akpm, christian, davem, ebiederm, elena.reshetova,
	guro, hch, james.bottomley, jasowang, jglisse, keescook, ldv,
	linux-arm-kernel, linux-kernel, linux-mm, linux-parisc, luto,
	mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

Hi Paul, others,

So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
is what happens if userspace starts cycling through lots of these
ioctls.  Given we actually use rcu as an optimization, we could just
disable the optimization temporarily - but the question would be how to
detect an excessive rate without working too hard :) .

I guess we could define as excessive any rate where callback is
outstanding at the time when new structure is allocated.  I have very
little understanding of rcu internals - so I wanted to check that the
following more or less implements this heuristic before I spend time
actually testing it.

Could others pls take a look and let me know?

Thanks!

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index 477b4eb44af5..067909521d72 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -125,6 +125,25 @@ void synchronize_rcu(void)
 }
 EXPORT_SYMBOL_GPL(synchronize_rcu);
 
+/*
+ * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
+ */
+bool call_rcu_outstanding(void)
+{
+	unsigned long flags;
+	struct rcu_data *rdp;
+	bool outstanding;
+
+	local_irq_save(flags);
+	rdp = this_cpu_ptr(&rcu_data);
+	outstanding = rcu_segcblist_empty(&rdp->cblist);
+	outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
+	local_irq_restore(flags);
+
+	return outstanding;
+}
+EXPORT_SYMBOL_GPL(call_rcu_outstanding);
+
 /*
  * Post an RCU callback to be invoked after the end of an RCU grace
  * period.  But since we have but one CPU, that would be after any
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a14e5fbbea46..d4b9d61e637d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
 {
 }
 
+/*
+ * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
+ */
+bool call_rcu_outstanding(void)
+{
+	unsigned long flags;
+	struct rcu_data *rdp;
+	bool outstanding;
+
+	local_irq_save(flags);
+	rdp = this_cpu_ptr(&rcu_data);
+	outstanding = rcu_segcblist_empty(&rdp->cblist);
+	local_irq_restore(flags);
+
+	return outstanding;
+}
+EXPORT_SYMBOL_GPL(call_rcu_outstanding);
+
 /*
  * Helper function for call_rcu() and friends.  The cpu argument will
  * normally be -1, indicating "currently running CPU".  It may specify

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-21 12:28     ` RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop) Michael S. Tsirkin
@ 2019-07-21 13:17       ` Paul E. McKenney
  2019-07-21 17:53         ` Michael S. Tsirkin
  2019-07-21 21:08         ` Matthew Wilcox
  0 siblings, 2 replies; 87+ messages in thread
From: Paul E. McKenney @ 2019-07-21 13:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, akpm, christian, davem, ebiederm, elena.reshetova,
	guro, hch, james.bottomley, jasowang, jglisse, keescook, ldv,
	linux-arm-kernel, linux-kernel, linux-mm, linux-parisc, luto,
	mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> Hi Paul, others,
> 
> So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> is what happens if userspace starts cycling through lots of these
> ioctls.  Given we actually use rcu as an optimization, we could just
> disable the optimization temporarily - but the question would be how to
> detect an excessive rate without working too hard :) .
> 
> I guess we could define as excessive any rate where callback is
> outstanding at the time when new structure is allocated.  I have very
> little understanding of rcu internals - so I wanted to check that the
> following more or less implements this heuristic before I spend time
> actually testing it.
> 
> Could others pls take a look and let me know?

These look good as a way of seeing if there are any outstanding callbacks,
but in the case of Tree RCU, call_rcu_outstanding() would almost never
return false on a busy system.

Here are some alternatives:

o	RCU uses some pieces of Rao Shoaib kfree_rcu() patches.
	The idea is to make kfree_rcu() locally buffer requests into
	batches of (say) 1,000, but processing smaller batches when RCU
	is idle, or when some smallish amout of time has passed with
	no more kfree_rcu() request from that CPU.  RCU than takes in
	the batch using not call_rcu(), but rather queue_rcu_work().
	The resulting batch of kfree() calls would therefore execute in
	workqueue context rather than in softirq context, which should
	be much easier on the system.

	In theory, this would allow people to use kfree_rcu() without
	worrying quite so much about overload.  It would also not be
	that hard to implement.

o	Subsystems vulnerable to user-induced kfree_rcu() flooding use
	call_rcu() instead of kfree_rcu().  Keep a count of the number
	of things waiting for a grace period, and when this gets too
	large, disable the optimization.  It will then drain down, at
	which point the optimization can be re-enabled.

	But please note that callbacks are -not- guaranteed to run on
	the CPU that queued them.  So yes, you would need a per-CPU
	counter, but you would need to periodically sum it up to check
	against the global state.  Or keep track of the CPU that
	did the call_rcu() so that you can atomically decrement in
	the callback the same counter that was atomically incremented
	just before the call_rcu().  Or any number of other approaches.

Also, the overhead is important.  For example, as far as I know,
current RCU gracefully handles close(open(...)) in a tight userspace
loop.  But there might be trouble due to tight userspace loops around
lighter-weight operations.

So an important question is "Just how fast is your ioctl?"  If it takes
(say) 100 microseconds to execute, there should be absolutely no problem.
On the other hand, if it can execute in 50 nanoseconds, this very likely
does need serious attention.

Other thoughts?

							Thanx, Paul

> Thanks!
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> 
> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> index 477b4eb44af5..067909521d72 100644
> --- a/kernel/rcu/tiny.c
> +++ b/kernel/rcu/tiny.c
> @@ -125,6 +125,25 @@ void synchronize_rcu(void)
>  }
>  EXPORT_SYMBOL_GPL(synchronize_rcu);
> 
> +/*
> + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> + */
> +bool call_rcu_outstanding(void)
> +{
> +	unsigned long flags;
> +	struct rcu_data *rdp;
> +	bool outstanding;
> +
> +	local_irq_save(flags);
> +	rdp = this_cpu_ptr(&rcu_data);
> +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> +	outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
> +	local_irq_restore(flags);
> +
> +	return outstanding;
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> +
>  /*
>   * Post an RCU callback to be invoked after the end of an RCU grace
>   * period.  But since we have but one CPU, that would be after any
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index a14e5fbbea46..d4b9d61e637d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
>  {
>  }
> 
> +/*
> + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> + */
> +bool call_rcu_outstanding(void)
> +{
> +	unsigned long flags;
> +	struct rcu_data *rdp;
> +	bool outstanding;
> +
> +	local_irq_save(flags);
> +	rdp = this_cpu_ptr(&rcu_data);
> +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> +	local_irq_restore(flags);
> +
> +	return outstanding;
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> +
>  /*
>   * Helper function for call_rcu() and friends.  The cpu argument will
>   * normally be -1, indicating "currently running CPU".  It may specify

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-21 13:17       ` Paul E. McKenney
@ 2019-07-21 17:53         ` Michael S. Tsirkin
  2019-07-21 19:28           ` Paul E. McKenney
  2019-07-21 21:08         ` Matthew Wilcox
  1 sibling, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-21 17:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: aarcange, akpm, christian, davem, ebiederm, elena.reshetova,
	guro, hch, james.bottomley, jasowang, jglisse, keescook, ldv,
	linux-arm-kernel, linux-kernel, linux-mm, linux-parisc, luto,
	mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> > Hi Paul, others,
> > 
> > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> > is what happens if userspace starts cycling through lots of these
> > ioctls.  Given we actually use rcu as an optimization, we could just
> > disable the optimization temporarily - but the question would be how to
> > detect an excessive rate without working too hard :) .
> > 
> > I guess we could define as excessive any rate where callback is
> > outstanding at the time when new structure is allocated.  I have very
> > little understanding of rcu internals - so I wanted to check that the
> > following more or less implements this heuristic before I spend time
> > actually testing it.
> > 
> > Could others pls take a look and let me know?
> 
> These look good as a way of seeing if there are any outstanding callbacks,
> but in the case of Tree RCU, call_rcu_outstanding() would almost never
> return false on a busy system.


Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy
and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000?

> 
> Here are some alternatives:
> 
> o	RCU uses some pieces of Rao Shoaib kfree_rcu() patches.
> 	The idea is to make kfree_rcu() locally buffer requests into
> 	batches of (say) 1,000, but processing smaller batches when RCU
> 	is idle, or when some smallish amout of time has passed with
> 	no more kfree_rcu() request from that CPU.  RCU than takes in
> 	the batch using not call_rcu(), but rather queue_rcu_work().
> 	The resulting batch of kfree() calls would therefore execute in
> 	workqueue context rather than in softirq context, which should
> 	be much easier on the system.
> 
> 	In theory, this would allow people to use kfree_rcu() without
> 	worrying quite so much about overload.  It would also not be
> 	that hard to implement.
> 
> o	Subsystems vulnerable to user-induced kfree_rcu() flooding use
> 	call_rcu() instead of kfree_rcu().  Keep a count of the number
> 	of things waiting for a grace period, and when this gets too
> 	large, disable the optimization.  It will then drain down, at
> 	which point the optimization can be re-enabled.
> 
> 	But please note that callbacks are -not- guaranteed to run on
> 	the CPU that queued them.  So yes, you would need a per-CPU
> 	counter, but you would need to periodically sum it up to check
> 	against the global state.  Or keep track of the CPU that
> 	did the call_rcu() so that you can atomically decrement in
> 	the callback the same counter that was atomically incremented
> 	just before the call_rcu().  Or any number of other approaches.

I'm really looking for something we can do this merge window
and without adding too much code, and kfree_rcu is intended to
fix a bug.
Adding call_rcu and careful accounting is something that I'm not
happy adding with merge window already open.

> 
> Also, the overhead is important.  For example, as far as I know,
> current RCU gracefully handles close(open(...)) in a tight userspace
> loop.  But there might be trouble due to tight userspace loops around
> lighter-weight operations.
> 
> So an important question is "Just how fast is your ioctl?"  If it takes
> (say) 100 microseconds to execute, there should be absolutely no problem.
> On the other hand, if it can execute in 50 nanoseconds, this very likely
> does need serious attention.
> 
> Other thoughts?
> 
> 							Thanx, Paul

Hmm the answer to this would be I'm not sure.
It's setup time stuff we never tested it.

> > Thanks!
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > 
> > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > index 477b4eb44af5..067909521d72 100644
> > --- a/kernel/rcu/tiny.c
> > +++ b/kernel/rcu/tiny.c
> > @@ -125,6 +125,25 @@ void synchronize_rcu(void)
> >  }
> >  EXPORT_SYMBOL_GPL(synchronize_rcu);
> > 
> > +/*
> > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > + */
> > +bool call_rcu_outstanding(void)
> > +{
> > +	unsigned long flags;
> > +	struct rcu_data *rdp;
> > +	bool outstanding;
> > +
> > +	local_irq_save(flags);
> > +	rdp = this_cpu_ptr(&rcu_data);
> > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > +	outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
> > +	local_irq_restore(flags);
> > +
> > +	return outstanding;
> > +}
> > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > +
> >  /*
> >   * Post an RCU callback to be invoked after the end of an RCU grace
> >   * period.  But since we have but one CPU, that would be after any
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index a14e5fbbea46..d4b9d61e637d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
> >  {
> >  }
> > 
> > +/*
> > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > + */
> > +bool call_rcu_outstanding(void)
> > +{
> > +	unsigned long flags;
> > +	struct rcu_data *rdp;
> > +	bool outstanding;
> > +
> > +	local_irq_save(flags);
> > +	rdp = this_cpu_ptr(&rcu_data);
> > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > +	local_irq_restore(flags);
> > +
> > +	return outstanding;
> > +}
> > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > +
> >  /*
> >   * Helper function for call_rcu() and friends.  The cpu argument will
> >   * normally be -1, indicating "currently running CPU".  It may specify

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-21 17:53         ` Michael S. Tsirkin
@ 2019-07-21 19:28           ` Paul E. McKenney
  2019-07-22  7:56             ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2019-07-21 19:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, akpm, christian, davem, ebiederm, elena.reshetova,
	guro, hch, james.bottomley, jasowang, jglisse, keescook, ldv,
	linux-arm-kernel, linux-kernel, linux-mm, linux-parisc, luto,
	mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Sun, Jul 21, 2019 at 01:53:23PM -0400, Michael S. Tsirkin wrote:
> On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> > > Hi Paul, others,
> > > 
> > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> > > is what happens if userspace starts cycling through lots of these
> > > ioctls.  Given we actually use rcu as an optimization, we could just
> > > disable the optimization temporarily - but the question would be how to
> > > detect an excessive rate without working too hard :) .
> > > 
> > > I guess we could define as excessive any rate where callback is
> > > outstanding at the time when new structure is allocated.  I have very
> > > little understanding of rcu internals - so I wanted to check that the
> > > following more or less implements this heuristic before I spend time
> > > actually testing it.
> > > 
> > > Could others pls take a look and let me know?
> > 
> > These look good as a way of seeing if there are any outstanding callbacks,
> > but in the case of Tree RCU, call_rcu_outstanding() would almost never
> > return false on a busy system.
> 
> Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy
> and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000?

Or the function could simply return the number of callbacks queued
on the current CPU, and let the caller decide how many is too many.

> > Here are some alternatives:
> > 
> > o	RCU uses some pieces of Rao Shoaib kfree_rcu() patches.
> > 	The idea is to make kfree_rcu() locally buffer requests into
> > 	batches of (say) 1,000, but processing smaller batches when RCU
> > 	is idle, or when some smallish amout of time has passed with
> > 	no more kfree_rcu() request from that CPU.  RCU than takes in
> > 	the batch using not call_rcu(), but rather queue_rcu_work().
> > 	The resulting batch of kfree() calls would therefore execute in
> > 	workqueue context rather than in softirq context, which should
> > 	be much easier on the system.
> > 
> > 	In theory, this would allow people to use kfree_rcu() without
> > 	worrying quite so much about overload.  It would also not be
> > 	that hard to implement.
> > 
> > o	Subsystems vulnerable to user-induced kfree_rcu() flooding use
> > 	call_rcu() instead of kfree_rcu().  Keep a count of the number
> > 	of things waiting for a grace period, and when this gets too
> > 	large, disable the optimization.  It will then drain down, at
> > 	which point the optimization can be re-enabled.
> > 
> > 	But please note that callbacks are -not- guaranteed to run on
> > 	the CPU that queued them.  So yes, you would need a per-CPU
> > 	counter, but you would need to periodically sum it up to check
> > 	against the global state.  Or keep track of the CPU that
> > 	did the call_rcu() so that you can atomically decrement in
> > 	the callback the same counter that was atomically incremented
> > 	just before the call_rcu().  Or any number of other approaches.
> 
> I'm really looking for something we can do this merge window
> and without adding too much code, and kfree_rcu is intended to
> fix a bug.
> Adding call_rcu and careful accounting is something that I'm not
> happy adding with merge window already open.

OK, then I suggest having the interface return you the number of
callbacks.  That allows you to experiment with the cutoff.

Give or take the ioctl overhead...

> > Also, the overhead is important.  For example, as far as I know,
> > current RCU gracefully handles close(open(...)) in a tight userspace
> > loop.  But there might be trouble due to tight userspace loops around
> > lighter-weight operations.
> > 
> > So an important question is "Just how fast is your ioctl?"  If it takes
> > (say) 100 microseconds to execute, there should be absolutely no problem.
> > On the other hand, if it can execute in 50 nanoseconds, this very likely
> > does need serious attention.
> > 
> > Other thoughts?
> > 
> > 							Thanx, Paul
> 
> Hmm the answer to this would be I'm not sure.
> It's setup time stuff we never tested it.

Is it possible to measure it easily?

							Thanx, Paul

> > > Thanks!
> > > 
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > 
> > > 
> > > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > > index 477b4eb44af5..067909521d72 100644
> > > --- a/kernel/rcu/tiny.c
> > > +++ b/kernel/rcu/tiny.c
> > > @@ -125,6 +125,25 @@ void synchronize_rcu(void)
> > >  }
> > >  EXPORT_SYMBOL_GPL(synchronize_rcu);
> > > 
> > > +/*
> > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > + */
> > > +bool call_rcu_outstanding(void)
> > > +{
> > > +	unsigned long flags;
> > > +	struct rcu_data *rdp;
> > > +	bool outstanding;
> > > +
> > > +	local_irq_save(flags);
> > > +	rdp = this_cpu_ptr(&rcu_data);
> > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > +	outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
> > > +	local_irq_restore(flags);
> > > +
> > > +	return outstanding;
> > > +}
> > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > +
> > >  /*
> > >   * Post an RCU callback to be invoked after the end of an RCU grace
> > >   * period.  But since we have but one CPU, that would be after any
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index a14e5fbbea46..d4b9d61e637d 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
> > >  {
> > >  }
> > > 
> > > +/*
> > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > + */
> > > +bool call_rcu_outstanding(void)
> > > +{
> > > +	unsigned long flags;
> > > +	struct rcu_data *rdp;
> > > +	bool outstanding;
> > > +
> > > +	local_irq_save(flags);
> > > +	rdp = this_cpu_ptr(&rcu_data);
> > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > +	local_irq_restore(flags);
> > > +
> > > +	return outstanding;
> > > +}
> > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > +
> > >  /*
> > >   * Helper function for call_rcu() and friends.  The cpu argument will
> > >   * normally be -1, indicating "currently running CPU".  It may specify
> 


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-21 13:17       ` Paul E. McKenney
  2019-07-21 17:53         ` Michael S. Tsirkin
@ 2019-07-21 21:08         ` Matthew Wilcox
  2019-07-21 23:31           ` Paul E. McKenney
  1 sibling, 1 reply; 87+ messages in thread
From: Matthew Wilcox @ 2019-07-21 21:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael S. Tsirkin, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jasowang, jglisse,
	keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> Also, the overhead is important.  For example, as far as I know,
> current RCU gracefully handles close(open(...)) in a tight userspace
> loop.  But there might be trouble due to tight userspace loops around
> lighter-weight operations.

I thought you believed that RCU was antifragile, in that it would scale
better as it was used more heavily?

Would it make sense to have call_rcu() check to see if there are many
outstanding requests on this CPU and if so process them before returning?
That would ensure that frequent callers usually ended up doing their
own processing.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-21 21:08         ` Matthew Wilcox
@ 2019-07-21 23:31           ` Paul E. McKenney
  2019-07-22  7:52             ` Michael S. Tsirkin
  2019-07-22 15:14             ` Joel Fernandes
  0 siblings, 2 replies; 87+ messages in thread
From: Paul E. McKenney @ 2019-07-21 23:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael S. Tsirkin, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jasowang, jglisse,
	keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Sun, Jul 21, 2019 at 02:08:37PM -0700, Matthew Wilcox wrote:
> On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > Also, the overhead is important.  For example, as far as I know,
> > current RCU gracefully handles close(open(...)) in a tight userspace
> > loop.  But there might be trouble due to tight userspace loops around
> > lighter-weight operations.
> 
> I thought you believed that RCU was antifragile, in that it would scale
> better as it was used more heavily?

You are referring to this?  https://paulmck.livejournal.com/47933.html

If so, the last few paragraphs might be worth re-reading.   ;-)

And in this case, the heuristics RCU uses to decide when to schedule
invocation of the callbacks needs some help.  One component of that help
is a time-based limit to the number of consecutive callback invocations
(see my crude prototype and Eric Dumazet's more polished patch).  Another
component is an overload warning.

Why would an overload warning be needed if RCU's callback-invocation
scheduling heurisitics were upgraded?  Because someone could boot a
100-CPU system with the rcu_nocbs=0-99, bind all of the resulting
rcuo kthreads to (say) CPU 0, and then run a callback-heavy workload
on all of the CPUs.  Given the constraints, CPU 0 cannot keep up.

So warnings are required as well.

> Would it make sense to have call_rcu() check to see if there are many
> outstanding requests on this CPU and if so process them before returning?
> That would ensure that frequent callers usually ended up doing their
> own processing.

Unfortunately, no.  Here is a code fragment illustrating why:

	void my_cb(struct rcu_head *rhp)
	{
		unsigned long flags;

		spin_lock_irqsave(&my_lock, flags);
		handle_cb(rhp);
		spin_unlock_irqrestore(&my_lock, flags);
	}

	. . .

	spin_lock_irqsave(&my_lock, flags);
	p = look_something_up();
	remove_that_something(p);
	call_rcu(p, my_cb);
	spin_unlock_irqrestore(&my_lock, flags);

Invoking the extra callbacks directly from call_rcu() would thus result
in self-deadlock.  Documentation/RCU/UP.txt contains a few more examples
along these lines.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-21 10:02   ` Michael S. Tsirkin
  2019-07-21 12:18     ` Michael S. Tsirkin
  2019-07-21 12:28     ` RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop) Michael S. Tsirkin
@ 2019-07-22  5:21     ` Jason Wang
  2019-07-22  8:02       ` Michael S. Tsirkin
  2019-07-22 14:11     ` Jason Gunthorpe
  3 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-22  5:21 UTC (permalink / raw)
  To: Michael S. Tsirkin, syzbot
  Cc: aarcange, akpm, christian, davem, ebiederm, elena.reshetova,
	guro, hch, james.bottomley, jglisse, keescook, ldv,
	linux-arm-kernel, linux-kernel, linux-mm, linux-parisc, luto,
	mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/21 下午6:02, Michael S. Tsirkin wrote:
> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
>> syzbot has bisected this bug to:
>>
>> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
>> Author: Jason Wang <jasowang@redhat.com>
>> Date:   Fri May 24 08:12:18 2019 +0000
>>
>>      vhost: access vq metadata through kernel virtual address
>>
>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
>> start commit:   6d21a41b Add linux-next specific files for 20190718
>> git tree:       linux-next
>> final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
>> console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
>> dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
>>
>> Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
>> address")
>>
>> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>
> OK I poked at this for a bit, I see several things that
> we need to fix, though I'm not yet sure it's the reason for
> the failures:
>
>
> 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
>     That's just a bad hack,


This is used to avoid holding lock when checking whether the addresses 
are overlapped. Otherwise we need to take spinlock for each invalidation 
request even if it was the va range that is not interested for us. This 
will be very slow e.g during guest boot.


>   in particular I don't think device
>     mutex is taken and so poking at two VQs will corrupt
>     memory.


The caller vhost_net_ioctl() (or scsi and vsock) will hold device mutex 
before calling us.


>     So what to do? How about a per vq notifier?
>     Of course we also have synchronize_rcu
>     in the notifier which is slow and is now going to be called twice.
>     I think call_rcu would be more appropriate here.
>     We then need rcu_barrier on module unload.


So this seems unnecessary.


>     OTOH if we make pages linear with map then we are good
>     with kfree_rcu which is even nicer.


It could be an optimization on top.


>
> 2. Doesn't map leak after vhost_map_unprefetch?
>     And why does it poke at contents of the map?
>     No one should use it right?


Yes, it's not hard to fix just kfree map in this function.


>
> 3. notifier unregister happens last in vhost_dev_cleanup,
>     but register happens first. This looks wrong to me.


I'm not sure I get the the exact issue here.


>
> 4. OK so we use the invalidate count to try and detect that
>     some invalidate is in progress.
>     I am not 100% sure why do we care.
>     Assuming we do, uaddr can change between start and end
>     and then the counter can get negative, or generally
>     out of sync.


Yes, so the fix is as simple as zero the invalidate_count after 
unregister  the mmu notifier in vhost_set_vring_num_addr().


>
> So what to do about all this?
> I am inclined to say let's just drop the uaddr optimization
> for now. E.g. kvm invalidates unconditionally.
> 3 should be fixed independently.


Maybe it's better to try to fix with the exist uaddr optimization first.

I did spot two other issues:

1) we don't check the return value mmu_register in vhost_set_vring_num()

2) we try to setup vq address even if set_vring_addr() fail


For the bug it self, it looks to me that the mm refcount was messed up 
since we try to register and unregister MMU notifier. But I haven't 
figured out why, will do more investigation.

Thanks


>
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-21 12:18     ` Michael S. Tsirkin
@ 2019-07-22  5:24       ` Jason Wang
  2019-07-22  8:08         ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-22  5:24 UTC (permalink / raw)
  To: Michael S. Tsirkin, syzbot
  Cc: aarcange, akpm, christian, davem, ebiederm, elena.reshetova,
	guro, hch, james.bottomley, jglisse, keescook, ldv,
	linux-arm-kernel, linux-kernel, linux-mm, linux-parisc, luto,
	mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/21 下午8:18, Michael S. Tsirkin wrote:
> On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
>> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
>>> syzbot has bisected this bug to:
>>>
>>> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
>>> Author: Jason Wang<jasowang@redhat.com>
>>> Date:   Fri May 24 08:12:18 2019 +0000
>>>
>>>      vhost: access vq metadata through kernel virtual address
>>>
>>> bisection log:https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
>>> start commit:   6d21a41b Add linux-next specific files for 20190718
>>> git tree:       linux-next
>>> final crash:https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
>>> console output:https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
>>> kernel config:https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
>>> dashboard link:https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
>>> syz repro:https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
>>>
>>> Reported-by:syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
>>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
>>> address")
>>>
>>> For information about bisection process see:https://goo.gl/tpsmEJ#bisection
>> OK I poked at this for a bit, I see several things that
>> we need to fix, though I'm not yet sure it's the reason for
>> the failures:
>>
>>
>> 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
>>     That's just a bad hack, in particular I don't think device
>>     mutex is taken and so poking at two VQs will corrupt
>>     memory.
>>     So what to do? How about a per vq notifier?
>>     Of course we also have synchronize_rcu
>>     in the notifier which is slow and is now going to be called twice.
>>     I think call_rcu would be more appropriate here.
>>     We then need rcu_barrier on module unload.
>>     OTOH if we make pages linear with map then we are good
>>     with kfree_rcu which is even nicer.
>>
>> 2. Doesn't map leak after vhost_map_unprefetch?
>>     And why does it poke at contents of the map?
>>     No one should use it right?
>>
>> 3. notifier unregister happens last in vhost_dev_cleanup,
>>     but register happens first. This looks wrong to me.
>>
>> 4. OK so we use the invalidate count to try and detect that
>>     some invalidate is in progress.
>>     I am not 100% sure why do we care.
>>     Assuming we do, uaddr can change between start and end
>>     and then the counter can get negative, or generally
>>     out of sync.
>>
>> So what to do about all this?
>> I am inclined to say let's just drop the uaddr optimization
>> for now. E.g. kvm invalidates unconditionally.
>> 3 should be fixed independently.
> Above implements this but is only build-tested.
> Jason, pls take a look. If you like the approach feel
> free to take it from here.
>
> One thing the below does not have is any kind of rate-limiting.
> Given it's so easy to restart I'm thinking it makes sense
> to add a generic infrastructure for this.
> Can be a separate patch I guess.


I don't get why must use kfree_rcu() instead of synchronize_rcu() here.


>
> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>


Let me try to figure out the root cause then decide whether or not to go 
for this way.

Thanks



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-21 23:31           ` Paul E. McKenney
@ 2019-07-22  7:52             ` Michael S. Tsirkin
  2019-07-22 11:51               ` Paul E. McKenney
  2019-07-22 15:14             ` Joel Fernandes
  1 sibling, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-22  7:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jasowang, jglisse,
	keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Sun, Jul 21, 2019 at 04:31:13PM -0700, Paul E. McKenney wrote:
> On Sun, Jul 21, 2019 at 02:08:37PM -0700, Matthew Wilcox wrote:
> > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > > Also, the overhead is important.  For example, as far as I know,
> > > current RCU gracefully handles close(open(...)) in a tight userspace
> > > loop.  But there might be trouble due to tight userspace loops around
> > > lighter-weight operations.
> > 
> > I thought you believed that RCU was antifragile, in that it would scale
> > better as it was used more heavily?
> 
> You are referring to this?  https://paulmck.livejournal.com/47933.html
> 
> If so, the last few paragraphs might be worth re-reading.   ;-)
> 
> And in this case, the heuristics RCU uses to decide when to schedule
> invocation of the callbacks needs some help.  One component of that help
> is a time-based limit to the number of consecutive callback invocations
> (see my crude prototype and Eric Dumazet's more polished patch).  Another
> component is an overload warning.
> 
> Why would an overload warning be needed if RCU's callback-invocation
> scheduling heurisitics were upgraded?  Because someone could boot a
> 100-CPU system with the rcu_nocbs=0-99, bind all of the resulting
> rcuo kthreads to (say) CPU 0, and then run a callback-heavy workload
> on all of the CPUs.  Given the constraints, CPU 0 cannot keep up.
> 
> So warnings are required as well.
> 
> > Would it make sense to have call_rcu() check to see if there are many
> > outstanding requests on this CPU and if so process them before returning?
> > That would ensure that frequent callers usually ended up doing their
> > own processing.
> 
> Unfortunately, no.  Here is a code fragment illustrating why:
> 
> 	void my_cb(struct rcu_head *rhp)
> 	{
> 		unsigned long flags;
> 
> 		spin_lock_irqsave(&my_lock, flags);
> 		handle_cb(rhp);
> 		spin_unlock_irqrestore(&my_lock, flags);
> 	}
> 
> 	. . .
> 
> 	spin_lock_irqsave(&my_lock, flags);
> 	p = look_something_up();
> 	remove_that_something(p);
> 	call_rcu(p, my_cb);
> 	spin_unlock_irqrestore(&my_lock, flags);
> 
> Invoking the extra callbacks directly from call_rcu() would thus result
> in self-deadlock.  Documentation/RCU/UP.txt contains a few more examples
> along these lines.

We could add an option that simply fails if overloaded, right?
Have caller recover...

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-21 19:28           ` Paul E. McKenney
@ 2019-07-22  7:56             ` Michael S. Tsirkin
  2019-07-22 11:57               ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-22  7:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: aarcange, akpm, christian, davem, ebiederm, elena.reshetova,
	guro, hch, james.bottomley, jasowang, jglisse, keescook, ldv,
	linux-arm-kernel, linux-kernel, linux-mm, linux-parisc, luto,
	mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Sun, Jul 21, 2019 at 12:28:41PM -0700, Paul E. McKenney wrote:
> On Sun, Jul 21, 2019 at 01:53:23PM -0400, Michael S. Tsirkin wrote:
> > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > > On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> > > > Hi Paul, others,
> > > > 
> > > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> > > > is what happens if userspace starts cycling through lots of these
> > > > ioctls.  Given we actually use rcu as an optimization, we could just
> > > > disable the optimization temporarily - but the question would be how to
> > > > detect an excessive rate without working too hard :) .
> > > > 
> > > > I guess we could define as excessive any rate where callback is
> > > > outstanding at the time when new structure is allocated.  I have very
> > > > little understanding of rcu internals - so I wanted to check that the
> > > > following more or less implements this heuristic before I spend time
> > > > actually testing it.
> > > > 
> > > > Could others pls take a look and let me know?
> > > 
> > > These look good as a way of seeing if there are any outstanding callbacks,
> > > but in the case of Tree RCU, call_rcu_outstanding() would almost never
> > > return false on a busy system.
> > 
> > Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy
> > and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000?
> 
> Or the function could simply return the number of callbacks queued
> on the current CPU, and let the caller decide how many is too many.
> 
> > > Here are some alternatives:
> > > 
> > > o	RCU uses some pieces of Rao Shoaib kfree_rcu() patches.
> > > 	The idea is to make kfree_rcu() locally buffer requests into
> > > 	batches of (say) 1,000, but processing smaller batches when RCU
> > > 	is idle, or when some smallish amout of time has passed with
> > > 	no more kfree_rcu() request from that CPU.  RCU than takes in
> > > 	the batch using not call_rcu(), but rather queue_rcu_work().
> > > 	The resulting batch of kfree() calls would therefore execute in
> > > 	workqueue context rather than in softirq context, which should
> > > 	be much easier on the system.
> > > 
> > > 	In theory, this would allow people to use kfree_rcu() without
> > > 	worrying quite so much about overload.  It would also not be
> > > 	that hard to implement.
> > > 
> > > o	Subsystems vulnerable to user-induced kfree_rcu() flooding use
> > > 	call_rcu() instead of kfree_rcu().  Keep a count of the number
> > > 	of things waiting for a grace period, and when this gets too
> > > 	large, disable the optimization.  It will then drain down, at
> > > 	which point the optimization can be re-enabled.
> > > 
> > > 	But please note that callbacks are -not- guaranteed to run on
> > > 	the CPU that queued them.  So yes, you would need a per-CPU
> > > 	counter, but you would need to periodically sum it up to check
> > > 	against the global state.  Or keep track of the CPU that
> > > 	did the call_rcu() so that you can atomically decrement in
> > > 	the callback the same counter that was atomically incremented
> > > 	just before the call_rcu().  Or any number of other approaches.
> > 
> > I'm really looking for something we can do this merge window
> > and without adding too much code, and kfree_rcu is intended to
> > fix a bug.
> > Adding call_rcu and careful accounting is something that I'm not
> > happy adding with merge window already open.
> 
> OK, then I suggest having the interface return you the number of
> callbacks.  That allows you to experiment with the cutoff.
> 
> Give or take the ioctl overhead...

OK - and for tiny just assume 1 is too much?


> > > Also, the overhead is important.  For example, as far as I know,
> > > current RCU gracefully handles close(open(...)) in a tight userspace
> > > loop.  But there might be trouble due to tight userspace loops around
> > > lighter-weight operations.
> > > 
> > > So an important question is "Just how fast is your ioctl?"  If it takes
> > > (say) 100 microseconds to execute, there should be absolutely no problem.
> > > On the other hand, if it can execute in 50 nanoseconds, this very likely
> > > does need serious attention.
> > > 
> > > Other thoughts?
> > > 
> > > 							Thanx, Paul
> > 
> > Hmm the answer to this would be I'm not sure.
> > It's setup time stuff we never tested it.
> 
> Is it possible to measure it easily?
> 
> 							Thanx, Paul
> 
> > > > Thanks!
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > 
> > > > 
> > > > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > > > index 477b4eb44af5..067909521d72 100644
> > > > --- a/kernel/rcu/tiny.c
> > > > +++ b/kernel/rcu/tiny.c
> > > > @@ -125,6 +125,25 @@ void synchronize_rcu(void)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(synchronize_rcu);
> > > > 
> > > > +/*
> > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > > + */
> > > > +bool call_rcu_outstanding(void)
> > > > +{
> > > > +	unsigned long flags;
> > > > +	struct rcu_data *rdp;
> > > > +	bool outstanding;
> > > > +
> > > > +	local_irq_save(flags);
> > > > +	rdp = this_cpu_ptr(&rcu_data);
> > > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > > +	outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
> > > > +	local_irq_restore(flags);
> > > > +
> > > > +	return outstanding;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > > +
> > > >  /*
> > > >   * Post an RCU callback to be invoked after the end of an RCU grace
> > > >   * period.  But since we have but one CPU, that would be after any
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index a14e5fbbea46..d4b9d61e637d 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
> > > >  {
> > > >  }
> > > > 
> > > > +/*
> > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > > + */
> > > > +bool call_rcu_outstanding(void)
> > > > +{
> > > > +	unsigned long flags;
> > > > +	struct rcu_data *rdp;
> > > > +	bool outstanding;
> > > > +
> > > > +	local_irq_save(flags);
> > > > +	rdp = this_cpu_ptr(&rcu_data);
> > > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > > +	local_irq_restore(flags);
> > > > +
> > > > +	return outstanding;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > > +
> > > >  /*
> > > >   * Helper function for call_rcu() and friends.  The cpu argument will
> > > >   * normally be -1, indicating "currently running CPU".  It may specify
> > 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-22  5:21     ` WARNING in __mmdrop Jason Wang
@ 2019-07-22  8:02       ` Michael S. Tsirkin
  2019-07-23  3:55         ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-22  8:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Mon, Jul 22, 2019 at 01:21:59PM +0800, Jason Wang wrote:
> 
> On 2019/7/21 下午6:02, Michael S. Tsirkin wrote:
> > On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > > syzbot has bisected this bug to:
> > > 
> > > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > > Author: Jason Wang <jasowang@redhat.com>
> > > Date:   Fri May 24 08:12:18 2019 +0000
> > > 
> > >      vhost: access vq metadata through kernel virtual address
> > > 
> > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > > start commit:   6d21a41b Add linux-next specific files for 20190718
> > > git tree:       linux-next
> > > final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > > 
> > > Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> > > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > address")
> > > 
> > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> > 
> > OK I poked at this for a bit, I see several things that
> > we need to fix, though I'm not yet sure it's the reason for
> > the failures:
> > 
> > 
> > 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
> >     That's just a bad hack,
> 
> 
> This is used to avoid holding lock when checking whether the addresses are
> overlapped. Otherwise we need to take spinlock for each invalidation request
> even if it was the va range that is not interested for us. This will be very
> slow e.g during guest boot.

KVM seems to do exactly that.
I tried and guest does not seem to boot any slower.
Do you observe any slowdown?

Now I took a hard look at the uaddr hackery it really makes
me nervious. So I think for this release we want something
safe, and optimizations on top. As an alternative revert the
optimization and try again for next merge window.


-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-22  5:24       ` Jason Wang
@ 2019-07-22  8:08         ` Michael S. Tsirkin
  2019-07-23  4:01           ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-22  8:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Mon, Jul 22, 2019 at 01:24:24PM +0800, Jason Wang wrote:
> 
> On 2019/7/21 下午8:18, Michael S. Tsirkin wrote:
> > On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
> > > On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > > > syzbot has bisected this bug to:
> > > > 
> > > > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > > > Author: Jason Wang<jasowang@redhat.com>
> > > > Date:   Fri May 24 08:12:18 2019 +0000
> > > > 
> > > >      vhost: access vq metadata through kernel virtual address
> > > > 
> > > > bisection log:https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > > > start commit:   6d21a41b Add linux-next specific files for 20190718
> > > > git tree:       linux-next
> > > > final crash:https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > > > console output:https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > > > kernel config:https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > > > dashboard link:https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > > > syz repro:https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > > > 
> > > > Reported-by:syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> > > > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > > address")
> > > > 
> > > > For information about bisection process see:https://goo.gl/tpsmEJ#bisection
> > > OK I poked at this for a bit, I see several things that
> > > we need to fix, though I'm not yet sure it's the reason for
> > > the failures:
> > > 
> > > 
> > > 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
> > >     That's just a bad hack, in particular I don't think device
> > >     mutex is taken and so poking at two VQs will corrupt
> > >     memory.
> > >     So what to do? How about a per vq notifier?
> > >     Of course we also have synchronize_rcu
> > >     in the notifier which is slow and is now going to be called twice.
> > >     I think call_rcu would be more appropriate here.
> > >     We then need rcu_barrier on module unload.
> > >     OTOH if we make pages linear with map then we are good
> > >     with kfree_rcu which is even nicer.
> > > 
> > > 2. Doesn't map leak after vhost_map_unprefetch?
> > >     And why does it poke at contents of the map?
> > >     No one should use it right?
> > > 
> > > 3. notifier unregister happens last in vhost_dev_cleanup,
> > >     but register happens first. This looks wrong to me.
> > > 
> > > 4. OK so we use the invalidate count to try and detect that
> > >     some invalidate is in progress.
> > >     I am not 100% sure why do we care.
> > >     Assuming we do, uaddr can change between start and end
> > >     and then the counter can get negative, or generally
> > >     out of sync.
> > > 
> > > So what to do about all this?
> > > I am inclined to say let's just drop the uaddr optimization
> > > for now. E.g. kvm invalidates unconditionally.
> > > 3 should be fixed independently.
> > Above implements this but is only build-tested.
> > Jason, pls take a look. If you like the approach feel
> > free to take it from here.
> > 
> > One thing the below does not have is any kind of rate-limiting.
> > Given it's so easy to restart I'm thinking it makes sense
> > to add a generic infrastructure for this.
> > Can be a separate patch I guess.
> 
> 
> I don't get why must use kfree_rcu() instead of synchronize_rcu() here.

synchronize_rcu has very high latency on busy systems.
It is not something that should be used on a syscall path.
KVM had to switch to SRCU to keep it sane.
Otherwise one guest can trivially slow down another one.

> 
> > 
> > Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> 
> 
> Let me try to figure out the root cause then decide whether or not to go for
> this way.
> 
> Thanks

The root cause of the crash is relevant, but we still need
to fix issues 1-4.

More issues (my patch tries to fix them too):

5. page not dirtied when mappings are torn down outside
   of invalidate callback

6. potential cross-VM DOS by one guest keeping system busy
   and increasing synchronize_rcu latency to the point where
   another guest stars timing out and crashes



-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22  7:52             ` Michael S. Tsirkin
@ 2019-07-22 11:51               ` Paul E. McKenney
  2019-07-22 13:41                 ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2019-07-22 11:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Matthew Wilcox, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jasowang, jglisse,
	keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Mon, Jul 22, 2019 at 03:52:05AM -0400, Michael S. Tsirkin wrote:
> On Sun, Jul 21, 2019 at 04:31:13PM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 21, 2019 at 02:08:37PM -0700, Matthew Wilcox wrote:
> > > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > > > Also, the overhead is important.  For example, as far as I know,
> > > > current RCU gracefully handles close(open(...)) in a tight userspace
> > > > loop.  But there might be trouble due to tight userspace loops around
> > > > lighter-weight operations.
> > > 
> > > I thought you believed that RCU was antifragile, in that it would scale
> > > better as it was used more heavily?
> > 
> > You are referring to this?  https://paulmck.livejournal.com/47933.html
> > 
> > If so, the last few paragraphs might be worth re-reading.   ;-)
> > 
> > And in this case, the heuristics RCU uses to decide when to schedule
> > invocation of the callbacks needs some help.  One component of that help
> > is a time-based limit to the number of consecutive callback invocations
> > (see my crude prototype and Eric Dumazet's more polished patch).  Another
> > component is an overload warning.
> > 
> > Why would an overload warning be needed if RCU's callback-invocation
> > scheduling heurisitics were upgraded?  Because someone could boot a
> > 100-CPU system with the rcu_nocbs=0-99, bind all of the resulting
> > rcuo kthreads to (say) CPU 0, and then run a callback-heavy workload
> > on all of the CPUs.  Given the constraints, CPU 0 cannot keep up.
> > 
> > So warnings are required as well.
> > 
> > > Would it make sense to have call_rcu() check to see if there are many
> > > outstanding requests on this CPU and if so process them before returning?
> > > That would ensure that frequent callers usually ended up doing their
> > > own processing.
> > 
> > Unfortunately, no.  Here is a code fragment illustrating why:
> > 
> > 	void my_cb(struct rcu_head *rhp)
> > 	{
> > 		unsigned long flags;
> > 
> > 		spin_lock_irqsave(&my_lock, flags);
> > 		handle_cb(rhp);
> > 		spin_unlock_irqrestore(&my_lock, flags);
> > 	}
> > 
> > 	. . .
> > 
> > 	spin_lock_irqsave(&my_lock, flags);
> > 	p = look_something_up();
> > 	remove_that_something(p);
> > 	call_rcu(p, my_cb);
> > 	spin_unlock_irqrestore(&my_lock, flags);
> > 
> > Invoking the extra callbacks directly from call_rcu() would thus result
> > in self-deadlock.  Documentation/RCU/UP.txt contains a few more examples
> > along these lines.
> 
> We could add an option that simply fails if overloaded, right?
> Have caller recover...

For example, return EBUSY from your ioctl?  That should work.  You could
also sleep for a jiffy or two to let things catch up in this BUSY (or
similar) case.  Or try three times, waiting a jiffy between each try,
and return EBUSY if all three tries failed.

Or just keep it simple and return EBUSY on the first try.  ;-)

All of this assumes that this ioctl is the cause of the overload, which
during early boot seems to me to be a safe assumption.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22  7:56             ` Michael S. Tsirkin
@ 2019-07-22 11:57               ` Paul E. McKenney
  0 siblings, 0 replies; 87+ messages in thread
From: Paul E. McKenney @ 2019-07-22 11:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, akpm, christian, davem, ebiederm, elena.reshetova,
	guro, hch, james.bottomley, jasowang, jglisse, keescook, ldv,
	linux-arm-kernel, linux-kernel, linux-mm, linux-parisc, luto,
	mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Mon, Jul 22, 2019 at 03:56:22AM -0400, Michael S. Tsirkin wrote:
> On Sun, Jul 21, 2019 at 12:28:41PM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 21, 2019 at 01:53:23PM -0400, Michael S. Tsirkin wrote:
> > > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > > > On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> > > > > Hi Paul, others,
> > > > > 
> > > > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> > > > > is what happens if userspace starts cycling through lots of these
> > > > > ioctls.  Given we actually use rcu as an optimization, we could just
> > > > > disable the optimization temporarily - but the question would be how to
> > > > > detect an excessive rate without working too hard :) .
> > > > > 
> > > > > I guess we could define as excessive any rate where callback is
> > > > > outstanding at the time when new structure is allocated.  I have very
> > > > > little understanding of rcu internals - so I wanted to check that the
> > > > > following more or less implements this heuristic before I spend time
> > > > > actually testing it.
> > > > > 
> > > > > Could others pls take a look and let me know?
> > > > 
> > > > These look good as a way of seeing if there are any outstanding callbacks,
> > > > but in the case of Tree RCU, call_rcu_outstanding() would almost never
> > > > return false on a busy system.
> > > 
> > > Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy
> > > and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000?
> > 
> > Or the function could simply return the number of callbacks queued
> > on the current CPU, and let the caller decide how many is too many.
> > 
> > > > Here are some alternatives:
> > > > 
> > > > o	RCU uses some pieces of Rao Shoaib kfree_rcu() patches.
> > > > 	The idea is to make kfree_rcu() locally buffer requests into
> > > > 	batches of (say) 1,000, but processing smaller batches when RCU
> > > > 	is idle, or when some smallish amout of time has passed with
> > > > 	no more kfree_rcu() request from that CPU.  RCU than takes in
> > > > 	the batch using not call_rcu(), but rather queue_rcu_work().
> > > > 	The resulting batch of kfree() calls would therefore execute in
> > > > 	workqueue context rather than in softirq context, which should
> > > > 	be much easier on the system.
> > > > 
> > > > 	In theory, this would allow people to use kfree_rcu() without
> > > > 	worrying quite so much about overload.  It would also not be
> > > > 	that hard to implement.
> > > > 
> > > > o	Subsystems vulnerable to user-induced kfree_rcu() flooding use
> > > > 	call_rcu() instead of kfree_rcu().  Keep a count of the number
> > > > 	of things waiting for a grace period, and when this gets too
> > > > 	large, disable the optimization.  It will then drain down, at
> > > > 	which point the optimization can be re-enabled.
> > > > 
> > > > 	But please note that callbacks are -not- guaranteed to run on
> > > > 	the CPU that queued them.  So yes, you would need a per-CPU
> > > > 	counter, but you would need to periodically sum it up to check
> > > > 	against the global state.  Or keep track of the CPU that
> > > > 	did the call_rcu() so that you can atomically decrement in
> > > > 	the callback the same counter that was atomically incremented
> > > > 	just before the call_rcu().  Or any number of other approaches.
> > > 
> > > I'm really looking for something we can do this merge window
> > > and without adding too much code, and kfree_rcu is intended to
> > > fix a bug.
> > > Adding call_rcu and careful accounting is something that I'm not
> > > happy adding with merge window already open.
> > 
> > OK, then I suggest having the interface return you the number of
> > callbacks.  That allows you to experiment with the cutoff.
> > 
> > Give or take the ioctl overhead...
> 
> OK - and for tiny just assume 1 is too much?

I bet that for tiny you won't need to rate-limit at all.  The reason
is that grace periods are quite short.

In fact, for TINY (that is, !SMP && !PREEMPT), synchronize_rcu() is a
no-op.  So in TINY, given that your ioctl is executing at process level,
you could just invoke synchronize_rcu() and then kfree():

#ifdef CONFIG_TINY_RCU
	synchronize_rcu();  /* No other CPUs, so a QS is a GP! */
	kfree(whatever);
	return; /* Or whatever control flow is appropriate. */
#endif
	/* More complicated stuff for !TINY. */

							Thanx, Paul

> > > > Also, the overhead is important.  For example, as far as I know,
> > > > current RCU gracefully handles close(open(...)) in a tight userspace
> > > > loop.  But there might be trouble due to tight userspace loops around
> > > > lighter-weight operations.
> > > > 
> > > > So an important question is "Just how fast is your ioctl?"  If it takes
> > > > (say) 100 microseconds to execute, there should be absolutely no problem.
> > > > On the other hand, if it can execute in 50 nanoseconds, this very likely
> > > > does need serious attention.
> > > > 
> > > > Other thoughts?
> > > > 
> > > > 							Thanx, Paul
> > > 
> > > Hmm the answer to this would be I'm not sure.
> > > It's setup time stuff we never tested it.
> > 
> > Is it possible to measure it easily?
> > 
> > 							Thanx, Paul
> > 
> > > > > Thanks!
> > > > > 
> > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > 
> > > > > 
> > > > > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > > > > index 477b4eb44af5..067909521d72 100644
> > > > > --- a/kernel/rcu/tiny.c
> > > > > +++ b/kernel/rcu/tiny.c
> > > > > @@ -125,6 +125,25 @@ void synchronize_rcu(void)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(synchronize_rcu);
> > > > > 
> > > > > +/*
> > > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > > > + */
> > > > > +bool call_rcu_outstanding(void)
> > > > > +{
> > > > > +	unsigned long flags;
> > > > > +	struct rcu_data *rdp;
> > > > > +	bool outstanding;
> > > > > +
> > > > > +	local_irq_save(flags);
> > > > > +	rdp = this_cpu_ptr(&rcu_data);
> > > > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > > > +	outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
> > > > > +	local_irq_restore(flags);
> > > > > +
> > > > > +	return outstanding;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > > > +
> > > > >  /*
> > > > >   * Post an RCU callback to be invoked after the end of an RCU grace
> > > > >   * period.  But since we have but one CPU, that would be after any
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index a14e5fbbea46..d4b9d61e637d 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
> > > > >  {
> > > > >  }
> > > > > 
> > > > > +/*
> > > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > > > + */
> > > > > +bool call_rcu_outstanding(void)
> > > > > +{
> > > > > +	unsigned long flags;
> > > > > +	struct rcu_data *rdp;
> > > > > +	bool outstanding;
> > > > > +
> > > > > +	local_irq_save(flags);
> > > > > +	rdp = this_cpu_ptr(&rcu_data);
> > > > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > > > +	local_irq_restore(flags);
> > > > > +
> > > > > +	return outstanding;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > > > +
> > > > >  /*
> > > > >   * Helper function for call_rcu() and friends.  The cpu argument will
> > > > >   * normally be -1, indicating "currently running CPU".  It may specify
> > > 
> 


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22 11:51               ` Paul E. McKenney
@ 2019-07-22 13:41                 ` Jason Gunthorpe
  2019-07-22 15:52                   ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2019-07-22 13:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael S. Tsirkin, Matthew Wilcox, aarcange, akpm, christian,
	davem, ebiederm, elena.reshetova, guro, hch, james.bottomley,
	jasowang, jglisse, keescook, ldv, linux-arm-kernel, linux-kernel,
	linux-mm, linux-parisc, luto, mhocko, mingo, namit, peterz,
	syzkaller-bugs, viro, wad

On Mon, Jul 22, 2019 at 04:51:49AM -0700, Paul E. McKenney wrote:

> > > > Would it make sense to have call_rcu() check to see if there are many
> > > > outstanding requests on this CPU and if so process them before returning?
> > > > That would ensure that frequent callers usually ended up doing their
> > > > own processing.
> > > 
> > > Unfortunately, no.  Here is a code fragment illustrating why:

That is only true in the general case though, kfree_rcu() doesn't have
this problem since we know what the callback is doing. In general a
caller of kfree_rcu() should not need to hold any locks while calling
it.

We could apply the same idea more generally and have some
'call_immediate_or_rcu()' which has restrictions on the caller's
context.

I think if we have some kind of problem here it would be better to
handle it inside the core code and only require that callers use the
correct RCU API.

I can think of many places where kfree_rcu() is being used under user
control..

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-21 10:02   ` Michael S. Tsirkin
                       ` (2 preceding siblings ...)
  2019-07-22  5:21     ` WARNING in __mmdrop Jason Wang
@ 2019-07-22 14:11     ` Jason Gunthorpe
  2019-07-25  6:02       ` Michael S. Tsirkin
  3 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2019-07-22 14:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jasowang, jglisse,
	keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > syzbot has bisected this bug to:
> > 
> > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > Author: Jason Wang <jasowang@redhat.com>
> > Date:   Fri May 24 08:12:18 2019 +0000
> > 
> >     vhost: access vq metadata through kernel virtual address
> > 
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > start commit:   6d21a41b Add linux-next specific files for 20190718
> > git tree:       linux-next
> > final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > 
> > Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > address")
> > 
> > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> 
> 
> OK I poked at this for a bit, I see several things that
> we need to fix, though I'm not yet sure it's the reason for
> the failures:

This stuff looks quite similar to the hmm_mirror use model and other
places in the kernel. I'm still hoping we can share this code a bit more.

There is another bug, this sequence here:

vhost_vring_set_num_addr()
   mmu_notifier_unregister()
   [..]
   mmu_notifier_register()

Which I think is trying to create a lock to protect dev->vqs..

Has the problem that mmu_notifier_unregister() doesn't guarantee that
invalidate_start/end are fully paired.

So after any unregister the code has to clean up any resulting
unbalanced invalidate_count before it can call mmu_notifier_register
again. ie zero the invalidate_count.

It also seems really weird that vhost_map_prefetch() can fail, ie due
to __get_user_pages_fast needing to block, but that just silently
(permanently?) disables the optimization?? At least the usage here
would be better done with a seqcount lock and a normal blocking call
to get_user_pages_fast()...

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-21 23:31           ` Paul E. McKenney
  2019-07-22  7:52             ` Michael S. Tsirkin
@ 2019-07-22 15:14             ` Joel Fernandes
  2019-07-22 15:47               ` Michael S. Tsirkin
  1 sibling, 1 reply; 87+ messages in thread
From: Joel Fernandes @ 2019-07-22 15:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Michael S. Tsirkin, aarcange, akpm, christian,
	davem, ebiederm, elena.reshetova, guro, hch, james.bottomley,
	jasowang, jglisse, keescook, ldv, linux-arm-kernel, linux-kernel,
	linux-mm, linux-parisc, luto, mhocko, mingo, namit, peterz,
	syzkaller-bugs, viro, wad

[snip]
> > Would it make sense to have call_rcu() check to see if there are many
> > outstanding requests on this CPU and if so process them before returning?
> > That would ensure that frequent callers usually ended up doing their
> > own processing.

Other than what Paul already mentioned about deadlocks, I am not sure if this
would even work for all cases since call_rcu() has to wait for a grace
period.

So, if the number of outstanding requests are higher than a certain amount,
then you *still* have to wait for some RCU configurations for the grace
period duration and cannot just execute the callback in-line. Did I miss
something?

Can waiting in-line for a grace period duration be tolerated in the vhost case?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22 15:14             ` Joel Fernandes
@ 2019-07-22 15:47               ` Michael S. Tsirkin
  2019-07-22 15:55                 ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-22 15:47 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E. McKenney, Matthew Wilcox, aarcange, akpm, christian,
	davem, ebiederm, elena.reshetova, guro, hch, james.bottomley,
	jasowang, jglisse, keescook, ldv, linux-arm-kernel, linux-kernel,
	linux-mm, linux-parisc, luto, mhocko, mingo, namit, peterz,
	syzkaller-bugs, viro, wad

On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> [snip]
> > > Would it make sense to have call_rcu() check to see if there are many
> > > outstanding requests on this CPU and if so process them before returning?
> > > That would ensure that frequent callers usually ended up doing their
> > > own processing.
> 
> Other than what Paul already mentioned about deadlocks, I am not sure if this
> would even work for all cases since call_rcu() has to wait for a grace
> period.
> 
> So, if the number of outstanding requests are higher than a certain amount,
> then you *still* have to wait for some RCU configurations for the grace
> period duration and cannot just execute the callback in-line. Did I miss
> something?
> 
> Can waiting in-line for a grace period duration be tolerated in the vhost case?
> 
> thanks,
> 
>  - Joel

No, but it has many other ways to recover (try again later, drop a
packet, use a slower copy to/from user).

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22 13:41                 ` Jason Gunthorpe
@ 2019-07-22 15:52                   ` Paul E. McKenney
  2019-07-22 16:04                     ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2019-07-22 15:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michael S. Tsirkin, Matthew Wilcox, aarcange, akpm, christian,
	davem, ebiederm, elena.reshetova, guro, hch, james.bottomley,
	jasowang, jglisse, keescook, ldv, linux-arm-kernel, linux-kernel,
	linux-mm, linux-parisc, luto, mhocko, mingo, namit, peterz,
	syzkaller-bugs, viro, wad

On Mon, Jul 22, 2019 at 10:41:52AM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 22, 2019 at 04:51:49AM -0700, Paul E. McKenney wrote:
> 
> > > > > Would it make sense to have call_rcu() check to see if there are many
> > > > > outstanding requests on this CPU and if so process them before returning?
> > > > > That would ensure that frequent callers usually ended up doing their
> > > > > own processing.
> > > > 
> > > > Unfortunately, no.  Here is a code fragment illustrating why:
> 
> That is only true in the general case though, kfree_rcu() doesn't have
> this problem since we know what the callback is doing. In general a
> caller of kfree_rcu() should not need to hold any locks while calling
> it.

Good point, at least as long as the slab allocators don't call kfree_rcu()
while holding any of the slab locks.

However, that would require a separate list for the kfree_rcu() callbacks,
and concurrent access to those lists of kfree_rcu() callbacks.  So this
might work, but would add some complexity and also yet another restriction
between RCU and another kernel subsystem.  So I would like to try the
other approaches first, for example, the time-based approach in my
prototype and Eric Dumazet's more polished patch.

But the immediate-invocation possibility is still there if needed.

> We could apply the same idea more generally and have some
> 'call_immediate_or_rcu()' which has restrictions on the caller's
> context.
> 
> I think if we have some kind of problem here it would be better to
> handle it inside the core code and only require that callers use the
> correct RCU API.

Agreed.  Especially given that there are a number of things that can
be done within RCU.

> I can think of many places where kfree_rcu() is being used under user
> control..

And same for call_rcu().

And this is not the first time we have run into this.  The last time
was about 15 years ago, if I remember correctly, and that one led to
some of the quiescent-state forcing and callback-invocation batch size
tricks still in use today.  My only real surprise is that it took so
long for this to come up again.  ;-)

Please note also that in the common case on default configurations,
callback invocation is done on the CPU that posted the callback.
This means that callback invocation normally applies backpressure
to the callback-happy workload.

So why then is there a problem?

The problem is not the lack of backpressure, but rather that the
scheduling of callback invocation needs to be a bit more considerate
of the needs of the rest of the system.  In the common case, that is.
Except that the uncommon case is real-time configurations, in which care
is needed anyway.  But I am in the midst of helping those out as well,
details on the "dev" branch of -rcu.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22 15:47               ` Michael S. Tsirkin
@ 2019-07-22 15:55                 ` Paul E. McKenney
  2019-07-22 16:13                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2019-07-22 15:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Joel Fernandes, Matthew Wilcox, aarcange, akpm, christian, davem,
	ebiederm, elena.reshetova, guro, hch, james.bottomley, jasowang,
	jglisse, keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> > [snip]
> > > > Would it make sense to have call_rcu() check to see if there are many
> > > > outstanding requests on this CPU and if so process them before returning?
> > > > That would ensure that frequent callers usually ended up doing their
> > > > own processing.
> > 
> > Other than what Paul already mentioned about deadlocks, I am not sure if this
> > would even work for all cases since call_rcu() has to wait for a grace
> > period.
> > 
> > So, if the number of outstanding requests are higher than a certain amount,
> > then you *still* have to wait for some RCU configurations for the grace
> > period duration and cannot just execute the callback in-line. Did I miss
> > something?
> > 
> > Can waiting in-line for a grace period duration be tolerated in the vhost case?
> > 
> > thanks,
> > 
> >  - Joel
> 
> No, but it has many other ways to recover (try again later, drop a
> packet, use a slower copy to/from user).

True enough!  And your idea of taking recovery action based on the number
of callbacks seems like a good one while we are getting RCU's callback
scheduling improved.

By the way, was this a real problem that you could make happen on real
hardware?  If not, I would suggest just letting RCU get improved over
the next couple of releases.

If it is something that you actually made happen, please let me know
what (if anything) you need from me for your callback-counting EBUSY
scheme.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22 15:52                   ` Paul E. McKenney
@ 2019-07-22 16:04                     ` Jason Gunthorpe
  2019-07-22 16:15                       ` Michael S. Tsirkin
  2019-07-22 16:15                       ` Paul E. McKenney
  0 siblings, 2 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2019-07-22 16:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael S. Tsirkin, Matthew Wilcox, aarcange, akpm, christian,
	davem, ebiederm, elena.reshetova, guro, hch, james.bottomley,
	jasowang, jglisse, keescook, ldv, linux-arm-kernel, linux-kernel,
	linux-mm, linux-parisc, luto, mhocko, mingo, namit, peterz,
	syzkaller-bugs, viro, wad

On Mon, Jul 22, 2019 at 08:52:35AM -0700, Paul E. McKenney wrote:
> So why then is there a problem?

I'm not sure there is a real problem, I thought Michael was just
asking how to design with RCU in the case where the user controls the
kfree_rcu??

Sounds like the answer is "don't worry about it" ?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22 15:55                 ` Paul E. McKenney
@ 2019-07-22 16:13                   ` Michael S. Tsirkin
  2019-07-22 16:25                     ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-22 16:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Matthew Wilcox, aarcange, akpm, christian, davem,
	ebiederm, elena.reshetova, guro, hch, james.bottomley, jasowang,
	jglisse, keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> > > [snip]
> > > > > Would it make sense to have call_rcu() check to see if there are many
> > > > > outstanding requests on this CPU and if so process them before returning?
> > > > > That would ensure that frequent callers usually ended up doing their
> > > > > own processing.
> > > 
> > > Other than what Paul already mentioned about deadlocks, I am not sure if this
> > > would even work for all cases since call_rcu() has to wait for a grace
> > > period.
> > > 
> > > So, if the number of outstanding requests are higher than a certain amount,
> > > then you *still* have to wait for some RCU configurations for the grace
> > > period duration and cannot just execute the callback in-line. Did I miss
> > > something?
> > > 
> > > Can waiting in-line for a grace period duration be tolerated in the vhost case?
> > > 
> > > thanks,
> > > 
> > >  - Joel
> > 
> > No, but it has many other ways to recover (try again later, drop a
> > packet, use a slower copy to/from user).
> 
> True enough!  And your idea of taking recovery action based on the number
> of callbacks seems like a good one while we are getting RCU's callback
> scheduling improved.
> 
> By the way, was this a real problem that you could make happen on real
> hardware?


>  If not, I would suggest just letting RCU get improved over
> the next couple of releases.


So basically use kfree_rcu but add a comment saying e.g. "WARNING:
in the future callers of kfree_rcu might need to check that
not too many callbacks get queued. In that case, we can
disable the optimization, or recover in some other way.
Watch this space."


> If it is something that you actually made happen, please let me know
> what (if anything) you need from me for your callback-counting EBUSY
> scheme.
> 
> 							Thanx, Paul

If you mean kfree_rcu causing OOM then no, it's all theoretical.
If you mean synchronize_rcu stalling to the point where guest will OOPs,
then yes, that's not too hard to trigger.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22 16:04                     ` Jason Gunthorpe
@ 2019-07-22 16:15                       ` Michael S. Tsirkin
  2019-07-22 16:15                       ` Paul E. McKenney
  1 sibling, 0 replies; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-22 16:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Paul E. McKenney, Matthew Wilcox, aarcange, akpm, christian,
	davem, ebiederm, elena.reshetova, guro, hch, james.bottomley,
	jasowang, jglisse, keescook, ldv, linux-arm-kernel, linux-kernel,
	linux-mm, linux-parisc, luto, mhocko, mingo, namit, peterz,
	syzkaller-bugs, viro, wad

On Mon, Jul 22, 2019 at 01:04:48PM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 22, 2019 at 08:52:35AM -0700, Paul E. McKenney wrote:
> > So why then is there a problem?
> 
> I'm not sure there is a real problem, I thought Michael was just
> asking how to design with RCU in the case where the user controls the
> kfree_rcu??


Right it's all based on documentation saying we should worry :)

> Sounds like the answer is "don't worry about it" ?
> 
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22 16:04                     ` Jason Gunthorpe
  2019-07-22 16:15                       ` Michael S. Tsirkin
@ 2019-07-22 16:15                       ` Paul E. McKenney
  1 sibling, 0 replies; 87+ messages in thread
From: Paul E. McKenney @ 2019-07-22 16:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michael S. Tsirkin, Matthew Wilcox, aarcange, akpm, christian,
	davem, ebiederm, elena.reshetova, guro, hch, james.bottomley,
	jasowang, jglisse, keescook, ldv, linux-arm-kernel, linux-kernel,
	linux-mm, linux-parisc, luto, mhocko, mingo, namit, peterz,
	syzkaller-bugs, viro, wad

On Mon, Jul 22, 2019 at 01:04:48PM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 22, 2019 at 08:52:35AM -0700, Paul E. McKenney wrote:
> > So why then is there a problem?
> 
> I'm not sure there is a real problem, I thought Michael was just
> asking how to design with RCU in the case where the user controls the
> kfree_rcu??
> 
> Sounds like the answer is "don't worry about it" ?

Unless you can force failures, you should be good.

And either way, improvements to RCU's handling of this sort of situation
are in the works.  And rcutorture has gained tests of this stuff in the
last year or so as well, see its "fwd_progress" module parameter and
the related code.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22 16:13                   ` Michael S. Tsirkin
@ 2019-07-22 16:25                     ` Paul E. McKenney
  2019-07-22 16:32                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2019-07-22 16:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Joel Fernandes, Matthew Wilcox, aarcange, akpm, christian, davem,
	ebiederm, elena.reshetova, guro, hch, james.bottomley, jasowang,
	jglisse, keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Mon, Jul 22, 2019 at 12:13:40PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote:
> > > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> > > > [snip]
> > > > > > Would it make sense to have call_rcu() check to see if there are many
> > > > > > outstanding requests on this CPU and if so process them before returning?
> > > > > > That would ensure that frequent callers usually ended up doing their
> > > > > > own processing.
> > > > 
> > > > Other than what Paul already mentioned about deadlocks, I am not sure if this
> > > > would even work for all cases since call_rcu() has to wait for a grace
> > > > period.
> > > > 
> > > > So, if the number of outstanding requests are higher than a certain amount,
> > > > then you *still* have to wait for some RCU configurations for the grace
> > > > period duration and cannot just execute the callback in-line. Did I miss
> > > > something?
> > > > 
> > > > Can waiting in-line for a grace period duration be tolerated in the vhost case?
> > > > 
> > > > thanks,
> > > > 
> > > >  - Joel
> > > 
> > > No, but it has many other ways to recover (try again later, drop a
> > > packet, use a slower copy to/from user).
> > 
> > True enough!  And your idea of taking recovery action based on the number
> > of callbacks seems like a good one while we are getting RCU's callback
> > scheduling improved.
> > 
> > By the way, was this a real problem that you could make happen on real
> > hardware?
> 
> >  If not, I would suggest just letting RCU get improved over
> > the next couple of releases.
> 
> So basically use kfree_rcu but add a comment saying e.g. "WARNING:
> in the future callers of kfree_rcu might need to check that
> not too many callbacks get queued. In that case, we can
> disable the optimization, or recover in some other way.
> Watch this space."

That sounds fair.

> > If it is something that you actually made happen, please let me know
> > what (if anything) you need from me for your callback-counting EBUSY
> > scheme.
> > 
> > 							Thanx, Paul
> 
> If you mean kfree_rcu causing OOM then no, it's all theoretical.
> If you mean synchronize_rcu stalling to the point where guest will OOPs,
> then yes, that's not too hard to trigger.

Is synchronize_rcu() being stalled by the userspace loop that is invoking
your ioctl that does kfree_rcu()?  Or instead by the resulting callback
invocation?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22 16:25                     ` Paul E. McKenney
@ 2019-07-22 16:32                       ` Michael S. Tsirkin
  2019-07-22 18:58                         ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-22 16:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Matthew Wilcox, aarcange, akpm, christian, davem,
	ebiederm, elena.reshetova, guro, hch, james.bottomley, jasowang,
	jglisse, keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Mon, Jul 22, 2019 at 09:25:51AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 22, 2019 at 12:13:40PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> > > > > [snip]
> > > > > > > Would it make sense to have call_rcu() check to see if there are many
> > > > > > > outstanding requests on this CPU and if so process them before returning?
> > > > > > > That would ensure that frequent callers usually ended up doing their
> > > > > > > own processing.
> > > > > 
> > > > > Other than what Paul already mentioned about deadlocks, I am not sure if this
> > > > > would even work for all cases since call_rcu() has to wait for a grace
> > > > > period.
> > > > > 
> > > > > So, if the number of outstanding requests are higher than a certain amount,
> > > > > then you *still* have to wait for some RCU configurations for the grace
> > > > > period duration and cannot just execute the callback in-line. Did I miss
> > > > > something?
> > > > > 
> > > > > Can waiting in-line for a grace period duration be tolerated in the vhost case?
> > > > > 
> > > > > thanks,
> > > > > 
> > > > >  - Joel
> > > > 
> > > > No, but it has many other ways to recover (try again later, drop a
> > > > packet, use a slower copy to/from user).
> > > 
> > > True enough!  And your idea of taking recovery action based on the number
> > > of callbacks seems like a good one while we are getting RCU's callback
> > > scheduling improved.
> > > 
> > > By the way, was this a real problem that you could make happen on real
> > > hardware?
> > 
> > >  If not, I would suggest just letting RCU get improved over
> > > the next couple of releases.
> > 
> > So basically use kfree_rcu but add a comment saying e.g. "WARNING:
> > in the future callers of kfree_rcu might need to check that
> > not too many callbacks get queued. In that case, we can
> > disable the optimization, or recover in some other way.
> > Watch this space."
> 
> That sounds fair.
> 
> > > If it is something that you actually made happen, please let me know
> > > what (if anything) you need from me for your callback-counting EBUSY
> > > scheme.
> > > 
> > > 							Thanx, Paul
> > 
> > If you mean kfree_rcu causing OOM then no, it's all theoretical.
> > If you mean synchronize_rcu stalling to the point where guest will OOPs,
> > then yes, that's not too hard to trigger.
> 
> Is synchronize_rcu() being stalled by the userspace loop that is invoking
> your ioctl that does kfree_rcu()?  Or instead by the resulting callback
> invocation?
> 
> 							Thanx, Paul

Sorry, let me clarify.  We currently have synchronize_rcu in a userspace
loop. I have a patch replacing that with kfree_rcu.  This isn't the
first time synchronize_rcu is stalling a VM for a long while so I didn't
investigate further.

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)
  2019-07-22 16:32                       ` Michael S. Tsirkin
@ 2019-07-22 18:58                         ` Paul E. McKenney
  0 siblings, 0 replies; 87+ messages in thread
From: Paul E. McKenney @ 2019-07-22 18:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Joel Fernandes, Matthew Wilcox, aarcange, akpm, christian, davem,
	ebiederm, elena.reshetova, guro, hch, james.bottomley, jasowang,
	jglisse, keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Mon, Jul 22, 2019 at 12:32:17PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 22, 2019 at 09:25:51AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 22, 2019 at 12:13:40PM -0400, Michael S. Tsirkin wrote:
> > > On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote:
> > > > > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> > > > > > [snip]
> > > > > > > > Would it make sense to have call_rcu() check to see if there are many
> > > > > > > > outstanding requests on this CPU and if so process them before returning?
> > > > > > > > That would ensure that frequent callers usually ended up doing their
> > > > > > > > own processing.
> > > > > > 
> > > > > > Other than what Paul already mentioned about deadlocks, I am not sure if this
> > > > > > would even work for all cases since call_rcu() has to wait for a grace
> > > > > > period.
> > > > > > 
> > > > > > So, if the number of outstanding requests are higher than a certain amount,
> > > > > > then you *still* have to wait for some RCU configurations for the grace
> > > > > > period duration and cannot just execute the callback in-line. Did I miss
> > > > > > something?
> > > > > > 
> > > > > > Can waiting in-line for a grace period duration be tolerated in the vhost case?
> > > > > > 
> > > > > > thanks,
> > > > > > 
> > > > > >  - Joel
> > > > > 
> > > > > No, but it has many other ways to recover (try again later, drop a
> > > > > packet, use a slower copy to/from user).
> > > > 
> > > > True enough!  And your idea of taking recovery action based on the number
> > > > of callbacks seems like a good one while we are getting RCU's callback
> > > > scheduling improved.
> > > > 
> > > > By the way, was this a real problem that you could make happen on real
> > > > hardware?
> > > 
> > > >  If not, I would suggest just letting RCU get improved over
> > > > the next couple of releases.
> > > 
> > > So basically use kfree_rcu but add a comment saying e.g. "WARNING:
> > > in the future callers of kfree_rcu might need to check that
> > > not too many callbacks get queued. In that case, we can
> > > disable the optimization, or recover in some other way.
> > > Watch this space."
> > 
> > That sounds fair.
> > 
> > > > If it is something that you actually made happen, please let me know
> > > > what (if anything) you need from me for your callback-counting EBUSY
> > > > scheme.
> > > 
> > > If you mean kfree_rcu causing OOM then no, it's all theoretical.
> > > If you mean synchronize_rcu stalling to the point where guest will OOPs,
> > > then yes, that's not too hard to trigger.
> > 
> > Is synchronize_rcu() being stalled by the userspace loop that is invoking
> > your ioctl that does kfree_rcu()?  Or instead by the resulting callback
> > invocation?
> 
> Sorry, let me clarify.  We currently have synchronize_rcu in a userspace
> loop. I have a patch replacing that with kfree_rcu.  This isn't the
> first time synchronize_rcu is stalling a VM for a long while so I didn't
> investigate further.

Ah, so a bunch of synchronize_rcu() calls within a single system call
inside the host is stalling the guest, correct?

If so, one straightforward approach is to do an rcu_barrier() every
(say) 1000 kfree_rcu() calls within that loop in the system call.
This will decrease the overhead by almost a factor of 1000 compared to
a synchronize_rcu() on each trip through that loop, and will prevent
callback overload.

Or if the situation is different (for example, the guest does a long
sequence of system calls, each of which does a single kfree_rcu() or
some such), please let me know what the situation is.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-22  8:02       ` Michael S. Tsirkin
@ 2019-07-23  3:55         ` Jason Wang
  2019-07-23  5:02           ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-23  3:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/22 下午4:02, Michael S. Tsirkin wrote:
> On Mon, Jul 22, 2019 at 01:21:59PM +0800, Jason Wang wrote:
>> On 2019/7/21 下午6:02, Michael S. Tsirkin wrote:
>>> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
>>>> syzbot has bisected this bug to:
>>>>
>>>> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
>>>> Author: Jason Wang <jasowang@redhat.com>
>>>> Date:   Fri May 24 08:12:18 2019 +0000
>>>>
>>>>       vhost: access vq metadata through kernel virtual address
>>>>
>>>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
>>>> start commit:   6d21a41b Add linux-next specific files for 20190718
>>>> git tree:       linux-next
>>>> final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
>>>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
>>>>
>>>> Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
>>>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
>>>> address")
>>>>
>>>> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>>> OK I poked at this for a bit, I see several things that
>>> we need to fix, though I'm not yet sure it's the reason for
>>> the failures:
>>>
>>>
>>> 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
>>>      That's just a bad hack,
>>
>> This is used to avoid holding lock when checking whether the addresses are
>> overlapped. Otherwise we need to take spinlock for each invalidation request
>> even if it was the va range that is not interested for us. This will be very
>> slow e.g during guest boot.
> KVM seems to do exactly that.
> I tried and guest does not seem to boot any slower.
> Do you observe any slowdown?


Yes I do.


>
> Now I took a hard look at the uaddr hackery it really makes
> me nervious. So I think for this release we want something
> safe, and optimizations on top. As an alternative revert the
> optimization and try again for next merge window.


Will post a series of fixes, let me know if you're ok with that.

Thanks


>
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-22  8:08         ` Michael S. Tsirkin
@ 2019-07-23  4:01           ` Jason Wang
  2019-07-23  5:01             ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-23  4:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/22 下午4:08, Michael S. Tsirkin wrote:
> On Mon, Jul 22, 2019 at 01:24:24PM +0800, Jason Wang wrote:
>> On 2019/7/21 下午8:18, Michael S. Tsirkin wrote:
>>> On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
>>>> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
>>>>> syzbot has bisected this bug to:
>>>>>
>>>>> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
>>>>> Author: Jason Wang<jasowang@redhat.com>
>>>>> Date:   Fri May 24 08:12:18 2019 +0000
>>>>>
>>>>>       vhost: access vq metadata through kernel virtual address
>>>>>
>>>>> bisection log:https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
>>>>> start commit:   6d21a41b Add linux-next specific files for 20190718
>>>>> git tree:       linux-next
>>>>> final crash:https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
>>>>> console output:https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
>>>>> kernel config:https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
>>>>> dashboard link:https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
>>>>> syz repro:https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
>>>>>
>>>>> Reported-by:syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
>>>>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
>>>>> address")
>>>>>
>>>>> For information about bisection process see:https://goo.gl/tpsmEJ#bisection
>>>> OK I poked at this for a bit, I see several things that
>>>> we need to fix, though I'm not yet sure it's the reason for
>>>> the failures:
>>>>
>>>>
>>>> 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
>>>>      That's just a bad hack, in particular I don't think device
>>>>      mutex is taken and so poking at two VQs will corrupt
>>>>      memory.
>>>>      So what to do? How about a per vq notifier?
>>>>      Of course we also have synchronize_rcu
>>>>      in the notifier which is slow and is now going to be called twice.
>>>>      I think call_rcu would be more appropriate here.
>>>>      We then need rcu_barrier on module unload.
>>>>      OTOH if we make pages linear with map then we are good
>>>>      with kfree_rcu which is even nicer.
>>>>
>>>> 2. Doesn't map leak after vhost_map_unprefetch?
>>>>      And why does it poke at contents of the map?
>>>>      No one should use it right?
>>>>
>>>> 3. notifier unregister happens last in vhost_dev_cleanup,
>>>>      but register happens first. This looks wrong to me.
>>>>
>>>> 4. OK so we use the invalidate count to try and detect that
>>>>      some invalidate is in progress.
>>>>      I am not 100% sure why do we care.
>>>>      Assuming we do, uaddr can change between start and end
>>>>      and then the counter can get negative, or generally
>>>>      out of sync.
>>>>
>>>> So what to do about all this?
>>>> I am inclined to say let's just drop the uaddr optimization
>>>> for now. E.g. kvm invalidates unconditionally.
>>>> 3 should be fixed independently.
>>> Above implements this but is only build-tested.
>>> Jason, pls take a look. If you like the approach feel
>>> free to take it from here.
>>>
>>> One thing the below does not have is any kind of rate-limiting.
>>> Given it's so easy to restart I'm thinking it makes sense
>>> to add a generic infrastructure for this.
>>> Can be a separate patch I guess.
>>
>> I don't get why must use kfree_rcu() instead of synchronize_rcu() here.
> synchronize_rcu has very high latency on busy systems.
> It is not something that should be used on a syscall path.
> KVM had to switch to SRCU to keep it sane.
> Otherwise one guest can trivially slow down another one.


I think you mean the synchronize_rcu_expedited()? Rethink of the code, 
the synchronize_rcu() in ioctl() could be removed, since it was 
serialized with memory accessor.

Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(), 
(just a little bit more hard to trigger):


     case KVM_RUN: {
...
         if (unlikely(oldpid != task_pid(current))) {
             /* The thread running this VCPU changed. */
             struct pid *newpid;

             r = kvm_arch_vcpu_run_pid_change(vcpu);
             if (r)
                 break;

             newpid = get_task_pid(current, PIDTYPE_PID);
             rcu_assign_pointer(vcpu->pid, newpid);
             if (oldpid)
                 synchronize_rcu();
             put_pid(oldpid);
         }
...
         break;


>
>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>
>> Let me try to figure out the root cause then decide whether or not to go for
>> this way.
>>
>> Thanks
> The root cause of the crash is relevant, but we still need
> to fix issues 1-4.
>
> More issues (my patch tries to fix them too):
>
> 5. page not dirtied when mappings are torn down outside
>     of invalidate callback


Yes.


>
> 6. potential cross-VM DOS by one guest keeping system busy
>     and increasing synchronize_rcu latency to the point where
>     another guest stars timing out and crashes
>
>
>

This will be addressed after I remove the synchronize_rcu() from ioctl path.

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  4:01           ` Jason Wang
@ 2019-07-23  5:01             ` Michael S. Tsirkin
  2019-07-23  5:47               ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-23  5:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 23, 2019 at 12:01:40PM +0800, Jason Wang wrote:
> 
> On 2019/7/22 下午4:08, Michael S. Tsirkin wrote:
> > On Mon, Jul 22, 2019 at 01:24:24PM +0800, Jason Wang wrote:
> > > On 2019/7/21 下午8:18, Michael S. Tsirkin wrote:
> > > > On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
> > > > > On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > > > > > syzbot has bisected this bug to:
> > > > > > 
> > > > > > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > > > > > Author: Jason Wang<jasowang@redhat.com>
> > > > > > Date:   Fri May 24 08:12:18 2019 +0000
> > > > > > 
> > > > > >       vhost: access vq metadata through kernel virtual address
> > > > > > 
> > > > > > bisection log:https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > > > > > start commit:   6d21a41b Add linux-next specific files for 20190718
> > > > > > git tree:       linux-next
> > > > > > final crash:https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > > > > > console output:https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > > > > > kernel config:https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > > > > > dashboard link:https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > > > > > syz repro:https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > > > > > 
> > > > > > Reported-by:syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> > > > > > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > > > > address")
> > > > > > 
> > > > > > For information about bisection process see:https://goo.gl/tpsmEJ#bisection
> > > > > OK I poked at this for a bit, I see several things that
> > > > > we need to fix, though I'm not yet sure it's the reason for
> > > > > the failures:
> > > > > 
> > > > > 
> > > > > 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
> > > > >      That's just a bad hack, in particular I don't think device
> > > > >      mutex is taken and so poking at two VQs will corrupt
> > > > >      memory.
> > > > >      So what to do? How about a per vq notifier?
> > > > >      Of course we also have synchronize_rcu
> > > > >      in the notifier which is slow and is now going to be called twice.
> > > > >      I think call_rcu would be more appropriate here.
> > > > >      We then need rcu_barrier on module unload.
> > > > >      OTOH if we make pages linear with map then we are good
> > > > >      with kfree_rcu which is even nicer.
> > > > > 
> > > > > 2. Doesn't map leak after vhost_map_unprefetch?
> > > > >      And why does it poke at contents of the map?
> > > > >      No one should use it right?
> > > > > 
> > > > > 3. notifier unregister happens last in vhost_dev_cleanup,
> > > > >      but register happens first. This looks wrong to me.
> > > > > 
> > > > > 4. OK so we use the invalidate count to try and detect that
> > > > >      some invalidate is in progress.
> > > > >      I am not 100% sure why do we care.
> > > > >      Assuming we do, uaddr can change between start and end
> > > > >      and then the counter can get negative, or generally
> > > > >      out of sync.
> > > > > 
> > > > > So what to do about all this?
> > > > > I am inclined to say let's just drop the uaddr optimization
> > > > > for now. E.g. kvm invalidates unconditionally.
> > > > > 3 should be fixed independently.
> > > > Above implements this but is only build-tested.
> > > > Jason, pls take a look. If you like the approach feel
> > > > free to take it from here.
> > > > 
> > > > One thing the below does not have is any kind of rate-limiting.
> > > > Given it's so easy to restart I'm thinking it makes sense
> > > > to add a generic infrastructure for this.
> > > > Can be a separate patch I guess.
> > > 
> > > I don't get why must use kfree_rcu() instead of synchronize_rcu() here.
> > synchronize_rcu has very high latency on busy systems.
> > It is not something that should be used on a syscall path.
> > KVM had to switch to SRCU to keep it sane.
> > Otherwise one guest can trivially slow down another one.
> 
> 
> I think you mean the synchronize_rcu_expedited()? Rethink of the code, the
> synchronize_rcu() in ioctl() could be removed, since it was serialized with
> memory accessor.


Really let's just use kfree_rcu. It's way cleaner: fire and forget.

> 
> Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
> (just a little bit more hard to trigger):


AFAIK these never run in response to guest events.
So they can take very long and guests still won't crash.


> 
>     case KVM_RUN: {
> ...
>         if (unlikely(oldpid != task_pid(current))) {
>             /* The thread running this VCPU changed. */
>             struct pid *newpid;
> 
>             r = kvm_arch_vcpu_run_pid_change(vcpu);
>             if (r)
>                 break;
> 
>             newpid = get_task_pid(current, PIDTYPE_PID);
>             rcu_assign_pointer(vcpu->pid, newpid);
>             if (oldpid)
>                 synchronize_rcu();
>             put_pid(oldpid);
>         }
> ...
>         break;
> 
> 
> > 
> > > > Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> > > 
> > > Let me try to figure out the root cause then decide whether or not to go for
> > > this way.
> > > 
> > > Thanks
> > The root cause of the crash is relevant, but we still need
> > to fix issues 1-4.
> > 
> > More issues (my patch tries to fix them too):
> > 
> > 5. page not dirtied when mappings are torn down outside
> >     of invalidate callback
> 
> 
> Yes.
> 
> 
> > 
> > 6. potential cross-VM DOS by one guest keeping system busy
> >     and increasing synchronize_rcu latency to the point where
> >     another guest stars timing out and crashes
> > 
> > 
> > 
> 
> This will be addressed after I remove the synchronize_rcu() from ioctl path.
> 
> Thanks

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  3:55         ` Jason Wang
@ 2019-07-23  5:02           ` Michael S. Tsirkin
  2019-07-23  5:48             ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-23  5:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 23, 2019 at 11:55:28AM +0800, Jason Wang wrote:
> 
> On 2019/7/22 下午4:02, Michael S. Tsirkin wrote:
> > On Mon, Jul 22, 2019 at 01:21:59PM +0800, Jason Wang wrote:
> > > On 2019/7/21 下午6:02, Michael S. Tsirkin wrote:
> > > > On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > > > > syzbot has bisected this bug to:
> > > > > 
> > > > > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > > > > Author: Jason Wang <jasowang@redhat.com>
> > > > > Date:   Fri May 24 08:12:18 2019 +0000
> > > > > 
> > > > >       vhost: access vq metadata through kernel virtual address
> > > > > 
> > > > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > > > > start commit:   6d21a41b Add linux-next specific files for 20190718
> > > > > git tree:       linux-next
> > > > > final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > > > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > > > > 
> > > > > Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> > > > > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > > > address")
> > > > > 
> > > > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> > > > OK I poked at this for a bit, I see several things that
> > > > we need to fix, though I'm not yet sure it's the reason for
> > > > the failures:
> > > > 
> > > > 
> > > > 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
> > > >      That's just a bad hack,
> > > 
> > > This is used to avoid holding lock when checking whether the addresses are
> > > overlapped. Otherwise we need to take spinlock for each invalidation request
> > > even if it was the va range that is not interested for us. This will be very
> > > slow e.g during guest boot.
> > KVM seems to do exactly that.
> > I tried and guest does not seem to boot any slower.
> > Do you observe any slowdown?
> 
> 
> Yes I do.
> 
> 
> > 
> > Now I took a hard look at the uaddr hackery it really makes
> > me nervious. So I think for this release we want something
> > safe, and optimizations on top. As an alternative revert the
> > optimization and try again for next merge window.
> 
> 
> Will post a series of fixes, let me know if you're ok with that.
> 
> Thanks

I'd prefer you to take a hard look at the patch I posted
which makes code cleaner, and ad optimizations on top.
But other ways could be ok too.

> 
> > 
> > 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  5:01             ` Michael S. Tsirkin
@ 2019-07-23  5:47               ` Jason Wang
  2019-07-23  7:23                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-23  5:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/23 下午1:01, Michael S. Tsirkin wrote:
> On Tue, Jul 23, 2019 at 12:01:40PM +0800, Jason Wang wrote:
>> On 2019/7/22 下午4:08, Michael S. Tsirkin wrote:
>>> On Mon, Jul 22, 2019 at 01:24:24PM +0800, Jason Wang wrote:
>>>> On 2019/7/21 下午8:18, Michael S. Tsirkin wrote:
>>>>> On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
>>>>>> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
>>>>>>> syzbot has bisected this bug to:
>>>>>>>
>>>>>>> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
>>>>>>> Author: Jason Wang<jasowang@redhat.com>
>>>>>>> Date:   Fri May 24 08:12:18 2019 +0000
>>>>>>>
>>>>>>>        vhost: access vq metadata through kernel virtual address
>>>>>>>
>>>>>>> bisection log:https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
>>>>>>> start commit:   6d21a41b Add linux-next specific files for 20190718
>>>>>>> git tree:       linux-next
>>>>>>> final crash:https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
>>>>>>> console output:https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
>>>>>>> kernel config:https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
>>>>>>> dashboard link:https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
>>>>>>> syz repro:https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
>>>>>>>
>>>>>>> Reported-by:syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
>>>>>>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
>>>>>>> address")
>>>>>>>
>>>>>>> For information about bisection process see:https://goo.gl/tpsmEJ#bisection
>>>>>> OK I poked at this for a bit, I see several things that
>>>>>> we need to fix, though I'm not yet sure it's the reason for
>>>>>> the failures:
>>>>>>
>>>>>>
>>>>>> 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
>>>>>>       That's just a bad hack, in particular I don't think device
>>>>>>       mutex is taken and so poking at two VQs will corrupt
>>>>>>       memory.
>>>>>>       So what to do? How about a per vq notifier?
>>>>>>       Of course we also have synchronize_rcu
>>>>>>       in the notifier which is slow and is now going to be called twice.
>>>>>>       I think call_rcu would be more appropriate here.
>>>>>>       We then need rcu_barrier on module unload.
>>>>>>       OTOH if we make pages linear with map then we are good
>>>>>>       with kfree_rcu which is even nicer.
>>>>>>
>>>>>> 2. Doesn't map leak after vhost_map_unprefetch?
>>>>>>       And why does it poke at contents of the map?
>>>>>>       No one should use it right?
>>>>>>
>>>>>> 3. notifier unregister happens last in vhost_dev_cleanup,
>>>>>>       but register happens first. This looks wrong to me.
>>>>>>
>>>>>> 4. OK so we use the invalidate count to try and detect that
>>>>>>       some invalidate is in progress.
>>>>>>       I am not 100% sure why do we care.
>>>>>>       Assuming we do, uaddr can change between start and end
>>>>>>       and then the counter can get negative, or generally
>>>>>>       out of sync.
>>>>>>
>>>>>> So what to do about all this?
>>>>>> I am inclined to say let's just drop the uaddr optimization
>>>>>> for now. E.g. kvm invalidates unconditionally.
>>>>>> 3 should be fixed independently.
>>>>> Above implements this but is only build-tested.
>>>>> Jason, pls take a look. If you like the approach feel
>>>>> free to take it from here.
>>>>>
>>>>> One thing the below does not have is any kind of rate-limiting.
>>>>> Given it's so easy to restart I'm thinking it makes sense
>>>>> to add a generic infrastructure for this.
>>>>> Can be a separate patch I guess.
>>>> I don't get why must use kfree_rcu() instead of synchronize_rcu() here.
>>> synchronize_rcu has very high latency on busy systems.
>>> It is not something that should be used on a syscall path.
>>> KVM had to switch to SRCU to keep it sane.
>>> Otherwise one guest can trivially slow down another one.
>>
>> I think you mean the synchronize_rcu_expedited()? Rethink of the code, the
>> synchronize_rcu() in ioctl() could be removed, since it was serialized with
>> memory accessor.
>
> Really let's just use kfree_rcu. It's way cleaner: fire and forget.


Looks not, you need rate limit the fire as you've figured out? And in 
fact, the synchronization is not even needed, does it help if I leave a 
comment to explain?


>
>> Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
>> (just a little bit more hard to trigger):
>
> AFAIK these never run in response to guest events.
> So they can take very long and guests still won't crash.


What if guest manages to escape to qemu?

Thanks


>
>
>>      case KVM_RUN: {
>> ...
>>          if (unlikely(oldpid != task_pid(current))) {
>>              /* The thread running this VCPU changed. */
>>              struct pid *newpid;
>>
>>              r = kvm_arch_vcpu_run_pid_change(vcpu);
>>              if (r)
>>                  break;
>>
>>              newpid = get_task_pid(current, PIDTYPE_PID);
>>              rcu_assign_pointer(vcpu->pid, newpid);
>>              if (oldpid)
>>                  synchronize_rcu();
>>              put_pid(oldpid);
>>          }
>> ...
>>          break;
>>
>>
>>>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>>> Let me try to figure out the root cause then decide whether or not to go for
>>>> this way.
>>>>
>>>> Thanks
>>> The root cause of the crash is relevant, but we still need
>>> to fix issues 1-4.
>>>
>>> More issues (my patch tries to fix them too):
>>>
>>> 5. page not dirtied when mappings are torn down outside
>>>      of invalidate callback
>>
>> Yes.
>>
>>
>>> 6. potential cross-VM DOS by one guest keeping system busy
>>>      and increasing synchronize_rcu latency to the point where
>>>      another guest stars timing out and crashes
>>>
>>>
>>>
>> This will be addressed after I remove the synchronize_rcu() from ioctl path.
>>
>> Thanks

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  5:02           ` Michael S. Tsirkin
@ 2019-07-23  5:48             ` Jason Wang
  2019-07-23  7:25               ` Michael S. Tsirkin
  2019-07-23  7:56               ` Michael S. Tsirkin
  0 siblings, 2 replies; 87+ messages in thread
From: Jason Wang @ 2019-07-23  5:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/23 下午1:02, Michael S. Tsirkin wrote:
> On Tue, Jul 23, 2019 at 11:55:28AM +0800, Jason Wang wrote:
>> On 2019/7/22 下午4:02, Michael S. Tsirkin wrote:
>>> On Mon, Jul 22, 2019 at 01:21:59PM +0800, Jason Wang wrote:
>>>> On 2019/7/21 下午6:02, Michael S. Tsirkin wrote:
>>>>> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
>>>>>> syzbot has bisected this bug to:
>>>>>>
>>>>>> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
>>>>>> Author: Jason Wang <jasowang@redhat.com>
>>>>>> Date:   Fri May 24 08:12:18 2019 +0000
>>>>>>
>>>>>>        vhost: access vq metadata through kernel virtual address
>>>>>>
>>>>>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
>>>>>> start commit:   6d21a41b Add linux-next specific files for 20190718
>>>>>> git tree:       linux-next
>>>>>> final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
>>>>>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
>>>>>>
>>>>>> Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
>>>>>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
>>>>>> address")
>>>>>>
>>>>>> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>>>>> OK I poked at this for a bit, I see several things that
>>>>> we need to fix, though I'm not yet sure it's the reason for
>>>>> the failures:
>>>>>
>>>>>
>>>>> 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
>>>>>       That's just a bad hack,
>>>> This is used to avoid holding lock when checking whether the addresses are
>>>> overlapped. Otherwise we need to take spinlock for each invalidation request
>>>> even if it was the va range that is not interested for us. This will be very
>>>> slow e.g during guest boot.
>>> KVM seems to do exactly that.
>>> I tried and guest does not seem to boot any slower.
>>> Do you observe any slowdown?
>>
>> Yes I do.
>>
>>
>>> Now I took a hard look at the uaddr hackery it really makes
>>> me nervious. So I think for this release we want something
>>> safe, and optimizations on top. As an alternative revert the
>>> optimization and try again for next merge window.
>>
>> Will post a series of fixes, let me know if you're ok with that.
>>
>> Thanks
> I'd prefer you to take a hard look at the patch I posted
> which makes code cleaner,


I did. But it looks to me a series that is only about 60 lines of code 
can fix all the issues we found without reverting the uaddr optimization.


>   and ad optimizations on top.
> But other ways could be ok too.


I'm waiting for the test result from syzbot and will post. Let's see if 
you are OK with that.

Thanks


>>>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  5:47               ` Jason Wang
@ 2019-07-23  7:23                 ` Michael S. Tsirkin
  2019-07-23  7:53                   ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-23  7:23 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 23, 2019 at 01:47:04PM +0800, Jason Wang wrote:
> 
> On 2019/7/23 下午1:01, Michael S. Tsirkin wrote:
> > On Tue, Jul 23, 2019 at 12:01:40PM +0800, Jason Wang wrote:
> > > On 2019/7/22 下午4:08, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 22, 2019 at 01:24:24PM +0800, Jason Wang wrote:
> > > > > On 2019/7/21 下午8:18, Michael S. Tsirkin wrote:
> > > > > > On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
> > > > > > > On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > > > > > > > syzbot has bisected this bug to:
> > > > > > > > 
> > > > > > > > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > > > > > > > Author: Jason Wang<jasowang@redhat.com>
> > > > > > > > Date:   Fri May 24 08:12:18 2019 +0000
> > > > > > > > 
> > > > > > > >        vhost: access vq metadata through kernel virtual address
> > > > > > > > 
> > > > > > > > bisection log:https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > > > > > > > start commit:   6d21a41b Add linux-next specific files for 20190718
> > > > > > > > git tree:       linux-next
> > > > > > > > final crash:https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > > > > > > > console output:https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > > > > > > > kernel config:https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > > > > > > > dashboard link:https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > > > > > > > syz repro:https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > > > > > > > 
> > > > > > > > Reported-by:syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> > > > > > > > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > > > > > > address")
> > > > > > > > 
> > > > > > > > For information about bisection process see:https://goo.gl/tpsmEJ#bisection
> > > > > > > OK I poked at this for a bit, I see several things that
> > > > > > > we need to fix, though I'm not yet sure it's the reason for
> > > > > > > the failures:
> > > > > > > 
> > > > > > > 
> > > > > > > 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
> > > > > > >       That's just a bad hack, in particular I don't think device
> > > > > > >       mutex is taken and so poking at two VQs will corrupt
> > > > > > >       memory.
> > > > > > >       So what to do? How about a per vq notifier?
> > > > > > >       Of course we also have synchronize_rcu
> > > > > > >       in the notifier which is slow and is now going to be called twice.
> > > > > > >       I think call_rcu would be more appropriate here.
> > > > > > >       We then need rcu_barrier on module unload.
> > > > > > >       OTOH if we make pages linear with map then we are good
> > > > > > >       with kfree_rcu which is even nicer.
> > > > > > > 
> > > > > > > 2. Doesn't map leak after vhost_map_unprefetch?
> > > > > > >       And why does it poke at contents of the map?
> > > > > > >       No one should use it right?
> > > > > > > 
> > > > > > > 3. notifier unregister happens last in vhost_dev_cleanup,
> > > > > > >       but register happens first. This looks wrong to me.
> > > > > > > 
> > > > > > > 4. OK so we use the invalidate count to try and detect that
> > > > > > >       some invalidate is in progress.
> > > > > > >       I am not 100% sure why do we care.
> > > > > > >       Assuming we do, uaddr can change between start and end
> > > > > > >       and then the counter can get negative, or generally
> > > > > > >       out of sync.
> > > > > > > 
> > > > > > > So what to do about all this?
> > > > > > > I am inclined to say let's just drop the uaddr optimization
> > > > > > > for now. E.g. kvm invalidates unconditionally.
> > > > > > > 3 should be fixed independently.
> > > > > > Above implements this but is only build-tested.
> > > > > > Jason, pls take a look. If you like the approach feel
> > > > > > free to take it from here.
> > > > > > 
> > > > > > One thing the below does not have is any kind of rate-limiting.
> > > > > > Given it's so easy to restart I'm thinking it makes sense
> > > > > > to add a generic infrastructure for this.
> > > > > > Can be a separate patch I guess.
> > > > > I don't get why must use kfree_rcu() instead of synchronize_rcu() here.
> > > > synchronize_rcu has very high latency on busy systems.
> > > > It is not something that should be used on a syscall path.
> > > > KVM had to switch to SRCU to keep it sane.
> > > > Otherwise one guest can trivially slow down another one.
> > > 
> > > I think you mean the synchronize_rcu_expedited()? Rethink of the code, the
> > > synchronize_rcu() in ioctl() could be removed, since it was serialized with
> > > memory accessor.
> > 
> > Really let's just use kfree_rcu. It's way cleaner: fire and forget.
> 
> 
> Looks not, you need rate limit the fire as you've figured out?

See the discussion that followed. Basically no, it's good enough
already and is only going to be better.

> And in fact,
> the synchronization is not even needed, does it help if I leave a comment to
> explain?

Let's try to figure it out in the mail first. I'm pretty sure the
current logic is wrong.

> 
> > 
> > > Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
> > > (just a little bit more hard to trigger):
> > 
> > AFAIK these never run in response to guest events.
> > So they can take very long and guests still won't crash.
> 
> 
> What if guest manages to escape to qemu?
> 
> Thanks

Then it's going to be slow. Why do we care?
What we do not want is synchronize_rcu that guest is blocked on.

> 
> > 
> > 
> > >      case KVM_RUN: {
> > > ...
> > >          if (unlikely(oldpid != task_pid(current))) {
> > >              /* The thread running this VCPU changed. */
> > >              struct pid *newpid;
> > > 
> > >              r = kvm_arch_vcpu_run_pid_change(vcpu);
> > >              if (r)
> > >                  break;
> > > 
> > >              newpid = get_task_pid(current, PIDTYPE_PID);
> > >              rcu_assign_pointer(vcpu->pid, newpid);
> > >              if (oldpid)
> > >                  synchronize_rcu();
> > >              put_pid(oldpid);
> > >          }
> > > ...
> > >          break;
> > > 
> > > 
> > > > > > Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> > > > > Let me try to figure out the root cause then decide whether or not to go for
> > > > > this way.
> > > > > 
> > > > > Thanks
> > > > The root cause of the crash is relevant, but we still need
> > > > to fix issues 1-4.
> > > > 
> > > > More issues (my patch tries to fix them too):
> > > > 
> > > > 5. page not dirtied when mappings are torn down outside
> > > >      of invalidate callback
> > > 
> > > Yes.
> > > 
> > > 
> > > > 6. potential cross-VM DOS by one guest keeping system busy
> > > >      and increasing synchronize_rcu latency to the point where
> > > >      another guest stars timing out and crashes
> > > > 
> > > > 
> > > > 
> > > This will be addressed after I remove the synchronize_rcu() from ioctl path.
> > > 
> > > Thanks

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  5:48             ` Jason Wang
@ 2019-07-23  7:25               ` Michael S. Tsirkin
  2019-07-23  7:55                 ` Jason Wang
  2019-07-23  7:56               ` Michael S. Tsirkin
  1 sibling, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-23  7:25 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 23, 2019 at 01:48:52PM +0800, Jason Wang wrote:
> 
> On 2019/7/23 下午1:02, Michael S. Tsirkin wrote:
> > On Tue, Jul 23, 2019 at 11:55:28AM +0800, Jason Wang wrote:
> > > On 2019/7/22 下午4:02, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 22, 2019 at 01:21:59PM +0800, Jason Wang wrote:
> > > > > On 2019/7/21 下午6:02, Michael S. Tsirkin wrote:
> > > > > > On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > > > > > > syzbot has bisected this bug to:
> > > > > > > 
> > > > > > > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > > > > > > Author: Jason Wang <jasowang@redhat.com>
> > > > > > > Date:   Fri May 24 08:12:18 2019 +0000
> > > > > > > 
> > > > > > >        vhost: access vq metadata through kernel virtual address
> > > > > > > 
> > > > > > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > > > > > > start commit:   6d21a41b Add linux-next specific files for 20190718
> > > > > > > git tree:       linux-next
> > > > > > > final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > > > > > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > > > > > > 
> > > > > > > Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> > > > > > > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > > > > > address")
> > > > > > > 
> > > > > > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> > > > > > OK I poked at this for a bit, I see several things that
> > > > > > we need to fix, though I'm not yet sure it's the reason for
> > > > > > the failures:
> > > > > > 
> > > > > > 
> > > > > > 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
> > > > > >       That's just a bad hack,
> > > > > This is used to avoid holding lock when checking whether the addresses are
> > > > > overlapped. Otherwise we need to take spinlock for each invalidation request
> > > > > even if it was the va range that is not interested for us. This will be very
> > > > > slow e.g during guest boot.
> > > > KVM seems to do exactly that.
> > > > I tried and guest does not seem to boot any slower.
> > > > Do you observe any slowdown?
> > > 
> > > Yes I do.
> > > 
> > > 
> > > > Now I took a hard look at the uaddr hackery it really makes
> > > > me nervious. So I think for this release we want something
> > > > safe, and optimizations on top. As an alternative revert the
> > > > optimization and try again for next merge window.
> > > 
> > > Will post a series of fixes, let me know if you're ok with that.
> > > 
> > > Thanks
> > I'd prefer you to take a hard look at the patch I posted
> > which makes code cleaner,
> 
> 
> I did. But it looks to me a series that is only about 60 lines of code can
> fix all the issues we found without reverting the uaddr optimization.
> 
> 
> >   and ad optimizations on top.
> > But other ways could be ok too.
> 
> 
> I'm waiting for the test result from syzbot and will post. Let's see if you
> are OK with that.
> 
> Thanks

Oh I didn't know one can push a test to syzbot and get back
a result. How does one do that?


> 
> > > > 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  7:23                 ` Michael S. Tsirkin
@ 2019-07-23  7:53                   ` Jason Wang
  2019-07-23  8:10                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-23  7:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
>>> Really let's just use kfree_rcu. It's way cleaner: fire and forget.
>> Looks not, you need rate limit the fire as you've figured out?
> See the discussion that followed. Basically no, it's good enough
> already and is only going to be better.
>
>> And in fact,
>> the synchronization is not even needed, does it help if I leave a comment to
>> explain?
> Let's try to figure it out in the mail first. I'm pretty sure the
> current logic is wrong.


Here is what the code what to achieve:

- The map was protected by RCU

- Writers are: MMU notifier invalidation callbacks, file operations 
(ioctls etc), meta_prefetch (datapath)

- Readers are: memory accessor

Writer are synchronized through mmu_lock. RCU is used to synchronized 
between writers and readers.

The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized 
it with readers (memory accessors) in the path of file operations. But 
in this case, vq->mutex was already held, this means it has been 
serialized with memory accessor. That's why I think it could be removed 
safely.

Anything I miss here?


>
>>>> Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
>>>> (just a little bit more hard to trigger):
>>> AFAIK these never run in response to guest events.
>>> So they can take very long and guests still won't crash.
>> What if guest manages to escape to qemu?
>>
>> Thanks
> Then it's going to be slow. Why do we care?
> What we do not want is synchronize_rcu that guest is blocked on.
>

Ok, this looks like that I have some misunderstanding here of the reason 
why synchronize_rcu() is not preferable in the path of ioctl. But in kvm 
case, if rcu_expedited is set, it can triggers IPIs AFAIK.

Thanks



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  7:25               ` Michael S. Tsirkin
@ 2019-07-23  7:55                 ` Jason Wang
  0 siblings, 0 replies; 87+ messages in thread
From: Jason Wang @ 2019-07-23  7:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/23 下午3:25, Michael S. Tsirkin wrote:
> On Tue, Jul 23, 2019 at 01:48:52PM +0800, Jason Wang wrote:
>> On 2019/7/23 下午1:02, Michael S. Tsirkin wrote:
>>> On Tue, Jul 23, 2019 at 11:55:28AM +0800, Jason Wang wrote:
>>>> On 2019/7/22 下午4:02, Michael S. Tsirkin wrote:
>>>>> On Mon, Jul 22, 2019 at 01:21:59PM +0800, Jason Wang wrote:
>>>>>> On 2019/7/21 下午6:02, Michael S. Tsirkin wrote:
>>>>>>> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
>>>>>>>> syzbot has bisected this bug to:
>>>>>>>>
>>>>>>>> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
>>>>>>>> Author: Jason Wang<jasowang@redhat.com>
>>>>>>>> Date:   Fri May 24 08:12:18 2019 +0000
>>>>>>>>
>>>>>>>>         vhost: access vq metadata through kernel virtual address
>>>>>>>>
>>>>>>>> bisection log:https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
>>>>>>>> start commit:   6d21a41b Add linux-next specific files for 20190718
>>>>>>>> git tree:       linux-next
>>>>>>>> final crash:https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
>>>>>>>> console output:https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
>>>>>>>> kernel config:https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
>>>>>>>> dashboard link:https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
>>>>>>>> syz repro:https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
>>>>>>>>
>>>>>>>> Reported-by:syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
>>>>>>>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
>>>>>>>> address")
>>>>>>>>
>>>>>>>> For information about bisection process see:https://goo.gl/tpsmEJ#bisection
>>>>>>> OK I poked at this for a bit, I see several things that
>>>>>>> we need to fix, though I'm not yet sure it's the reason for
>>>>>>> the failures:
>>>>>>>
>>>>>>>
>>>>>>> 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
>>>>>>>        That's just a bad hack,
>>>>>> This is used to avoid holding lock when checking whether the addresses are
>>>>>> overlapped. Otherwise we need to take spinlock for each invalidation request
>>>>>> even if it was the va range that is not interested for us. This will be very
>>>>>> slow e.g during guest boot.
>>>>> KVM seems to do exactly that.
>>>>> I tried and guest does not seem to boot any slower.
>>>>> Do you observe any slowdown?
>>>> Yes I do.
>>>>
>>>>
>>>>> Now I took a hard look at the uaddr hackery it really makes
>>>>> me nervious. So I think for this release we want something
>>>>> safe, and optimizations on top. As an alternative revert the
>>>>> optimization and try again for next merge window.
>>>> Will post a series of fixes, let me know if you're ok with that.
>>>>
>>>> Thanks
>>> I'd prefer you to take a hard look at the patch I posted
>>> which makes code cleaner,
>> I did. But it looks to me a series that is only about 60 lines of code can
>> fix all the issues we found without reverting the uaddr optimization.
>>
>>
>>>    and ad optimizations on top.
>>> But other ways could be ok too.
>> I'm waiting for the test result from syzbot and will post. Let's see if you
>> are OK with that.
>>
>> Thanks
> Oh I didn't know one can push a test to syzbot and get back
> a result. How does one do that?


See here https://github.com/google/syzkaller/blob/master/docs/syzbot.md

Just reply this thread by attaching a fix with command like: "#syz test: 
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
7f466032dc9e5a61217f22ea34b2df932786bbfc"

Btw, I've let syzbot test you patch, and it passes.

Thanks


>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  5:48             ` Jason Wang
  2019-07-23  7:25               ` Michael S. Tsirkin
@ 2019-07-23  7:56               ` Michael S. Tsirkin
  2019-07-23  8:42                 ` Jason Wang
  1 sibling, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-23  7:56 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 23, 2019 at 01:48:52PM +0800, Jason Wang wrote:
> 
> On 2019/7/23 下午1:02, Michael S. Tsirkin wrote:
> > On Tue, Jul 23, 2019 at 11:55:28AM +0800, Jason Wang wrote:
> > > On 2019/7/22 下午4:02, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 22, 2019 at 01:21:59PM +0800, Jason Wang wrote:
> > > > > On 2019/7/21 下午6:02, Michael S. Tsirkin wrote:
> > > > > > On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > > > > > > syzbot has bisected this bug to:
> > > > > > > 
> > > > > > > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > > > > > > Author: Jason Wang <jasowang@redhat.com>
> > > > > > > Date:   Fri May 24 08:12:18 2019 +0000
> > > > > > > 
> > > > > > >        vhost: access vq metadata through kernel virtual address
> > > > > > > 
> > > > > > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > > > > > > start commit:   6d21a41b Add linux-next specific files for 20190718
> > > > > > > git tree:       linux-next
> > > > > > > final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > > > > > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > > > > > > 
> > > > > > > Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> > > > > > > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > > > > > address")
> > > > > > > 
> > > > > > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> > > > > > OK I poked at this for a bit, I see several things that
> > > > > > we need to fix, though I'm not yet sure it's the reason for
> > > > > > the failures:
> > > > > > 
> > > > > > 
> > > > > > 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
> > > > > >       That's just a bad hack,
> > > > > This is used to avoid holding lock when checking whether the addresses are
> > > > > overlapped. Otherwise we need to take spinlock for each invalidation request
> > > > > even if it was the va range that is not interested for us. This will be very
> > > > > slow e.g during guest boot.
> > > > KVM seems to do exactly that.
> > > > I tried and guest does not seem to boot any slower.
> > > > Do you observe any slowdown?
> > > 
> > > Yes I do.
> > > 
> > > 
> > > > Now I took a hard look at the uaddr hackery it really makes
> > > > me nervious. So I think for this release we want something
> > > > safe, and optimizations on top. As an alternative revert the
> > > > optimization and try again for next merge window.
> > > 
> > > Will post a series of fixes, let me know if you're ok with that.
> > > 
> > > Thanks
> > I'd prefer you to take a hard look at the patch I posted
> > which makes code cleaner,
> 
> 
> I did. But it looks to me a series that is only about 60 lines of code can
> fix all the issues we found without reverting the uaddr optimization.

Another thing I like about the patch I posted is that
it removes 60 lines of code, instead of adding more :)
Mostly because of unifying everything into
a single cleanup function and using kfree_rcu.

So how about this: do exactly what you propose but as a 2 patch series:
start with the slow safe patch, and add then return uaddr optimizations
on top. We can then more easily reason about whether they are safe.

Basically you are saying this:
	- notifiers are only needed to invalidate maps
	- we make sure any uaddr change invalidates maps anyway
	- thus it's ok not to have notifiers since we do
	  not have maps

All this looks ok but the question is why do we
bother unregistering them. And the answer seems to
be that this is so we can start with a balanced
counter: otherwise we can be between _start and
_end calls.

I also wonder about ordering. kvm has this:
       /*
         * Used to check for invalidations in progress, of the pfn that is
         * returned by pfn_to_pfn_prot below.
         */
        mmu_seq = kvm->mmu_notifier_seq;
        /*
         * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
         * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
         * risk the page we get a reference to getting unmapped before we have a
         * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
         *
         * This smp_rmb() pairs with the effective smp_wmb() of the combination
         * of the pte_unmap_unlock() after the PTE is zapped, and the
         * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
         * mmu_notifier_seq is incremented.
         */
        smp_rmb();

does this apply to us? Can't we use a seqlock instead so we do
not need to worry?

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  7:53                   ` Jason Wang
@ 2019-07-23  8:10                     ` Michael S. Tsirkin
  2019-07-23  8:49                       ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-23  8:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 23, 2019 at 03:53:06PM +0800, Jason Wang wrote:
> 
> On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
> > > > Really let's just use kfree_rcu. It's way cleaner: fire and forget.
> > > Looks not, you need rate limit the fire as you've figured out?
> > See the discussion that followed. Basically no, it's good enough
> > already and is only going to be better.
> > 
> > > And in fact,
> > > the synchronization is not even needed, does it help if I leave a comment to
> > > explain?
> > Let's try to figure it out in the mail first. I'm pretty sure the
> > current logic is wrong.
> 
> 
> Here is what the code what to achieve:
> 
> - The map was protected by RCU
> 
> - Writers are: MMU notifier invalidation callbacks, file operations (ioctls
> etc), meta_prefetch (datapath)
> 
> - Readers are: memory accessor
> 
> Writer are synchronized through mmu_lock. RCU is used to synchronized
> between writers and readers.
> 
> The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized it
> with readers (memory accessors) in the path of file operations. But in this
> case, vq->mutex was already held, this means it has been serialized with
> memory accessor. That's why I think it could be removed safely.
> 
> Anything I miss here?
> 

So invalidate callbacks need to reset the map, and they do
not have vq mutex. How can they do this and free
the map safely? They need synchronize_rcu or kfree_rcu right?

And I worry somewhat that synchronize_rcu in an MMU notifier
is a problem, MMU notifiers are supposed to be quick:
they are on a read side critical section of SRCU.

If we could get rid of RCU that would be even better.

But now I wonder:
	invalidate_start has to mark page as dirty
	(this is what my patch added, current code misses this).

	at that point kernel can come and make the page clean again.

	At that point VQ handlers can keep a copy of the map
	and change the page again.


At this point I don't understand how we can mark page dirty
safely.

> > 
> > > > > Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
> > > > > (just a little bit more hard to trigger):
> > > > AFAIK these never run in response to guest events.
> > > > So they can take very long and guests still won't crash.
> > > What if guest manages to escape to qemu?
> > > 
> > > Thanks
> > Then it's going to be slow. Why do we care?
> > What we do not want is synchronize_rcu that guest is blocked on.
> > 
> 
> Ok, this looks like that I have some misunderstanding here of the reason why
> synchronize_rcu() is not preferable in the path of ioctl. But in kvm case,
> if rcu_expedited is set, it can triggers IPIs AFAIK.
> 
> Thanks
>

Yes, expedited is not good for something guest can trigger.
Let's just use kfree_rcu if we can. Paul said even though
documentation still says it needs to be rate-limited, that
part is basically stale and will get updated.

-- 
MST 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  7:56               ` Michael S. Tsirkin
@ 2019-07-23  8:42                 ` Jason Wang
  2019-07-23 10:27                   ` Michael S. Tsirkin
  2019-07-23 10:42                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 87+ messages in thread
From: Jason Wang @ 2019-07-23  8:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/23 下午3:56, Michael S. Tsirkin wrote:
> On Tue, Jul 23, 2019 at 01:48:52PM +0800, Jason Wang wrote:
>> On 2019/7/23 下午1:02, Michael S. Tsirkin wrote:
>>> On Tue, Jul 23, 2019 at 11:55:28AM +0800, Jason Wang wrote:
>>>> On 2019/7/22 下午4:02, Michael S. Tsirkin wrote:
>>>>> On Mon, Jul 22, 2019 at 01:21:59PM +0800, Jason Wang wrote:
>>>>>> On 2019/7/21 下午6:02, Michael S. Tsirkin wrote:
>>>>>>> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
>>>>>>>> syzbot has bisected this bug to:
>>>>>>>>
>>>>>>>> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
>>>>>>>> Author: Jason Wang <jasowang@redhat.com>
>>>>>>>> Date:   Fri May 24 08:12:18 2019 +0000
>>>>>>>>
>>>>>>>>         vhost: access vq metadata through kernel virtual address
>>>>>>>>
>>>>>>>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
>>>>>>>> start commit:   6d21a41b Add linux-next specific files for 20190718
>>>>>>>> git tree:       linux-next
>>>>>>>> final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
>>>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
>>>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
>>>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
>>>>>>>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
>>>>>>>>
>>>>>>>> Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
>>>>>>>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
>>>>>>>> address")
>>>>>>>>
>>>>>>>> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>>>>>>> OK I poked at this for a bit, I see several things that
>>>>>>> we need to fix, though I'm not yet sure it's the reason for
>>>>>>> the failures:
>>>>>>>
>>>>>>>
>>>>>>> 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
>>>>>>>        That's just a bad hack,
>>>>>> This is used to avoid holding lock when checking whether the addresses are
>>>>>> overlapped. Otherwise we need to take spinlock for each invalidation request
>>>>>> even if it was the va range that is not interested for us. This will be very
>>>>>> slow e.g during guest boot.
>>>>> KVM seems to do exactly that.
>>>>> I tried and guest does not seem to boot any slower.
>>>>> Do you observe any slowdown?
>>>> Yes I do.
>>>>
>>>>
>>>>> Now I took a hard look at the uaddr hackery it really makes
>>>>> me nervious. So I think for this release we want something
>>>>> safe, and optimizations on top. As an alternative revert the
>>>>> optimization and try again for next merge window.
>>>> Will post a series of fixes, let me know if you're ok with that.
>>>>
>>>> Thanks
>>> I'd prefer you to take a hard look at the patch I posted
>>> which makes code cleaner,
>>
>> I did. But it looks to me a series that is only about 60 lines of code can
>> fix all the issues we found without reverting the uaddr optimization.
> Another thing I like about the patch I posted is that
> it removes 60 lines of code, instead of adding more :)
> Mostly because of unifying everything into
> a single cleanup function and using kfree_rcu.


Yes.


>
> So how about this: do exactly what you propose but as a 2 patch series:
> start with the slow safe patch, and add then return uaddr optimizations
> on top. We can then more easily reason about whether they are safe.


If you stick, I can do this.


> Basically you are saying this:
> 	- notifiers are only needed to invalidate maps
> 	- we make sure any uaddr change invalidates maps anyway
> 	- thus it's ok not to have notifiers since we do
> 	  not have maps
>
> All this looks ok but the question is why do we
> bother unregistering them. And the answer seems to
> be that this is so we can start with a balanced
> counter: otherwise we can be between _start and
> _end calls.


Yes, since there could be multiple co-current invalidation requests. We 
need count them to make sure we don't pin wrong pages.


>
> I also wonder about ordering. kvm has this:
>         /*
>           * Used to check for invalidations in progress, of the pfn that is
>           * returned by pfn_to_pfn_prot below.
>           */
>          mmu_seq = kvm->mmu_notifier_seq;
>          /*
>           * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
>           * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
>           * risk the page we get a reference to getting unmapped before we have a
>           * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
>           *
>           * This smp_rmb() pairs with the effective smp_wmb() of the combination
>           * of the pte_unmap_unlock() after the PTE is zapped, and the
>           * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
>           * mmu_notifier_seq is incremented.
>           */
>          smp_rmb();
>
> does this apply to us? Can't we use a seqlock instead so we do
> not need to worry?


I'm not familiar with kvm MMU internals, but we do everything under of 
mmu_lock.

Thanks



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  8:10                     ` Michael S. Tsirkin
@ 2019-07-23  8:49                       ` Jason Wang
  2019-07-23  9:26                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-23  8:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/23 下午4:10, Michael S. Tsirkin wrote:
> On Tue, Jul 23, 2019 at 03:53:06PM +0800, Jason Wang wrote:
>> On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
>>>>> Really let's just use kfree_rcu. It's way cleaner: fire and forget.
>>>> Looks not, you need rate limit the fire as you've figured out?
>>> See the discussion that followed. Basically no, it's good enough
>>> already and is only going to be better.
>>>
>>>> And in fact,
>>>> the synchronization is not even needed, does it help if I leave a comment to
>>>> explain?
>>> Let's try to figure it out in the mail first. I'm pretty sure the
>>> current logic is wrong.
>>
>> Here is what the code what to achieve:
>>
>> - The map was protected by RCU
>>
>> - Writers are: MMU notifier invalidation callbacks, file operations (ioctls
>> etc), meta_prefetch (datapath)
>>
>> - Readers are: memory accessor
>>
>> Writer are synchronized through mmu_lock. RCU is used to synchronized
>> between writers and readers.
>>
>> The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized it
>> with readers (memory accessors) in the path of file operations. But in this
>> case, vq->mutex was already held, this means it has been serialized with
>> memory accessor. That's why I think it could be removed safely.
>>
>> Anything I miss here?
>>
> So invalidate callbacks need to reset the map, and they do
> not have vq mutex. How can they do this and free
> the map safely? They need synchronize_rcu or kfree_rcu right?


Invalidation callbacks need but file operations (e.g ioctl) not.


>
> And I worry somewhat that synchronize_rcu in an MMU notifier
> is a problem, MMU notifiers are supposed to be quick:


Looks not, since it can allow to be blocked and lots of driver depends 
on this. (E.g mmu_notifier_range_blockable()).


> they are on a read side critical section of SRCU.
>
> If we could get rid of RCU that would be even better.
>
> But now I wonder:
> 	invalidate_start has to mark page as dirty
> 	(this is what my patch added, current code misses this).


Nope, current code did this but not the case when map need to be 
invalidated in the vhost control path (ioctl etc).


>
> 	at that point kernel can come and make the page clean again.
>
> 	At that point VQ handlers can keep a copy of the map
> 	and change the page again.


We will increase invalidate_count which prevent the page being used by map.

Thanks


>
>
> At this point I don't understand how we can mark page dirty
> safely.
>
>>>>>> Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
>>>>>> (just a little bit more hard to trigger):
>>>>> AFAIK these never run in response to guest events.
>>>>> So they can take very long and guests still won't crash.
>>>> What if guest manages to escape to qemu?
>>>>
>>>> Thanks
>>> Then it's going to be slow. Why do we care?
>>> What we do not want is synchronize_rcu that guest is blocked on.
>>>
>> Ok, this looks like that I have some misunderstanding here of the reason why
>> synchronize_rcu() is not preferable in the path of ioctl. But in kvm case,
>> if rcu_expedited is set, it can triggers IPIs AFAIK.
>>
>> Thanks
>>
> Yes, expedited is not good for something guest can trigger.
> Let's just use kfree_rcu if we can. Paul said even though
> documentation still says it needs to be rate-limited, that
> part is basically stale and will get updated.
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  8:49                       ` Jason Wang
@ 2019-07-23  9:26                         ` Michael S. Tsirkin
  2019-07-23 13:31                           ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-23  9:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 23, 2019 at 04:49:01PM +0800, Jason Wang wrote:
> 
> On 2019/7/23 下午4:10, Michael S. Tsirkin wrote:
> > On Tue, Jul 23, 2019 at 03:53:06PM +0800, Jason Wang wrote:
> > > On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
> > > > > > Really let's just use kfree_rcu. It's way cleaner: fire and forget.
> > > > > Looks not, you need rate limit the fire as you've figured out?
> > > > See the discussion that followed. Basically no, it's good enough
> > > > already and is only going to be better.
> > > > 
> > > > > And in fact,
> > > > > the synchronization is not even needed, does it help if I leave a comment to
> > > > > explain?
> > > > Let's try to figure it out in the mail first. I'm pretty sure the
> > > > current logic is wrong.
> > > 
> > > Here is what the code what to achieve:
> > > 
> > > - The map was protected by RCU
> > > 
> > > - Writers are: MMU notifier invalidation callbacks, file operations (ioctls
> > > etc), meta_prefetch (datapath)
> > > 
> > > - Readers are: memory accessor
> > > 
> > > Writer are synchronized through mmu_lock. RCU is used to synchronized
> > > between writers and readers.
> > > 
> > > The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized it
> > > with readers (memory accessors) in the path of file operations. But in this
> > > case, vq->mutex was already held, this means it has been serialized with
> > > memory accessor. That's why I think it could be removed safely.
> > > 
> > > Anything I miss here?
> > > 
> > So invalidate callbacks need to reset the map, and they do
> > not have vq mutex. How can they do this and free
> > the map safely? They need synchronize_rcu or kfree_rcu right?
> 
> 
> Invalidation callbacks need but file operations (e.g ioctl) not.
> 
> 
> > 
> > And I worry somewhat that synchronize_rcu in an MMU notifier
> > is a problem, MMU notifiers are supposed to be quick:
> 
> 
> Looks not, since it can allow to be blocked and lots of driver depends on
> this. (E.g mmu_notifier_range_blockable()).

Right, they can block. So why don't we take a VQ mutex and be
done with it then? No RCU tricks.

> 
> > they are on a read side critical section of SRCU.
> > 
> > If we could get rid of RCU that would be even better.
> > 
> > But now I wonder:
> > 	invalidate_start has to mark page as dirty
> > 	(this is what my patch added, current code misses this).
> 
> 
> Nope, current code did this but not the case when map need to be invalidated
> in the vhost control path (ioctl etc).
> 
> 
> > 
> > 	at that point kernel can come and make the page clean again.
> > 
> > 	At that point VQ handlers can keep a copy of the map
> > 	and change the page again.
> 
> 
> We will increase invalidate_count which prevent the page being used by map.
> 
> Thanks

OK I think I got it, thanks!


> 
> > 
> > 
> > At this point I don't understand how we can mark page dirty
> > safely.
> > 
> > > > > > > Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
> > > > > > > (just a little bit more hard to trigger):
> > > > > > AFAIK these never run in response to guest events.
> > > > > > So they can take very long and guests still won't crash.
> > > > > What if guest manages to escape to qemu?
> > > > > 
> > > > > Thanks
> > > > Then it's going to be slow. Why do we care?
> > > > What we do not want is synchronize_rcu that guest is blocked on.
> > > > 
> > > Ok, this looks like that I have some misunderstanding here of the reason why
> > > synchronize_rcu() is not preferable in the path of ioctl. But in kvm case,
> > > if rcu_expedited is set, it can triggers IPIs AFAIK.
> > > 
> > > Thanks
> > > 
> > Yes, expedited is not good for something guest can trigger.
> > Let's just use kfree_rcu if we can. Paul said even though
> > documentation still says it needs to be rate-limited, that
> > part is basically stale and will get updated.
> > 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  8:42                 ` Jason Wang
@ 2019-07-23 10:27                   ` Michael S. Tsirkin
  2019-07-23 13:34                     ` Jason Wang
  2019-07-23 10:42                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-23 10:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 23, 2019 at 04:42:19PM +0800, Jason Wang wrote:
> 
> On 2019/7/23 下午3:56, Michael S. Tsirkin wrote:
> > On Tue, Jul 23, 2019 at 01:48:52PM +0800, Jason Wang wrote:
> > > On 2019/7/23 下午1:02, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 23, 2019 at 11:55:28AM +0800, Jason Wang wrote:
> > > > > On 2019/7/22 下午4:02, Michael S. Tsirkin wrote:
> > > > > > On Mon, Jul 22, 2019 at 01:21:59PM +0800, Jason Wang wrote:
> > > > > > > On 2019/7/21 下午6:02, Michael S. Tsirkin wrote:
> > > > > > > > On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > > > > > > > > syzbot has bisected this bug to:
> > > > > > > > > 
> > > > > > > > > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > > > > > > > > Author: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Date:   Fri May 24 08:12:18 2019 +0000
> > > > > > > > > 
> > > > > > > > >         vhost: access vq metadata through kernel virtual address
> > > > > > > > > 
> > > > > > > > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > > > > > > > > start commit:   6d21a41b Add linux-next specific files for 20190718
> > > > > > > > > git tree:       linux-next
> > > > > > > > > final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > > > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > > > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > > > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > > > > > > > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > > > > > > > > 
> > > > > > > > > Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> > > > > > > > > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > > > > > > > address")
> > > > > > > > > 
> > > > > > > > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> > > > > > > > OK I poked at this for a bit, I see several things that
> > > > > > > > we need to fix, though I'm not yet sure it's the reason for
> > > > > > > > the failures:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
> > > > > > > >        That's just a bad hack,
> > > > > > > This is used to avoid holding lock when checking whether the addresses are
> > > > > > > overlapped. Otherwise we need to take spinlock for each invalidation request
> > > > > > > even if it was the va range that is not interested for us. This will be very
> > > > > > > slow e.g during guest boot.
> > > > > > KVM seems to do exactly that.
> > > > > > I tried and guest does not seem to boot any slower.
> > > > > > Do you observe any slowdown?
> > > > > Yes I do.
> > > > > 
> > > > > 
> > > > > > Now I took a hard look at the uaddr hackery it really makes
> > > > > > me nervious. So I think for this release we want something
> > > > > > safe, and optimizations on top. As an alternative revert the
> > > > > > optimization and try again for next merge window.
> > > > > Will post a series of fixes, let me know if you're ok with that.
> > > > > 
> > > > > Thanks
> > > > I'd prefer you to take a hard look at the patch I posted
> > > > which makes code cleaner,
> > > 
> > > I did. But it looks to me a series that is only about 60 lines of code can
> > > fix all the issues we found without reverting the uaddr optimization.
> > Another thing I like about the patch I posted is that
> > it removes 60 lines of code, instead of adding more :)
> > Mostly because of unifying everything into
> > a single cleanup function and using kfree_rcu.
> 
> 
> Yes.
> 
> 
> > 
> > So how about this: do exactly what you propose but as a 2 patch series:
> > start with the slow safe patch, and add then return uaddr optimizations
> > on top. We can then more easily reason about whether they are safe.
> 
> 
> If you stick, I can do this.

Given I realized my patch is buggy in that
it does not wait for outstanding maps, I don't
insist.

> 
> > Basically you are saying this:
> > 	- notifiers are only needed to invalidate maps
> > 	- we make sure any uaddr change invalidates maps anyway
> > 	- thus it's ok not to have notifiers since we do
> > 	  not have maps
> > 
> > All this looks ok but the question is why do we
> > bother unregistering them. And the answer seems to
> > be that this is so we can start with a balanced
> > counter: otherwise we can be between _start and
> > _end calls.
> 
> 
> Yes, since there could be multiple co-current invalidation requests. We need
> count them to make sure we don't pin wrong pages.
> 
> 
> > 
> > I also wonder about ordering. kvm has this:
> >         /*
> >           * Used to check for invalidations in progress, of the pfn that is
> >           * returned by pfn_to_pfn_prot below.
> >           */
> >          mmu_seq = kvm->mmu_notifier_seq;
> >          /*
> >           * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
> >           * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
> >           * risk the page we get a reference to getting unmapped before we have a
> >           * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
> >           *
> >           * This smp_rmb() pairs with the effective smp_wmb() of the combination
> >           * of the pte_unmap_unlock() after the PTE is zapped, and the
> >           * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
> >           * mmu_notifier_seq is incremented.
> >           */
> >          smp_rmb();
> > 
> > does this apply to us? Can't we use a seqlock instead so we do
> > not need to worry?
> 
> 
> I'm not familiar with kvm MMU internals, but we do everything under of
> mmu_lock.
> 
> Thanks

I don't think this helps at all.

There's no lock between checking the invalidate counter and
get user pages fast within vhost_map_prefetch. So it's possible
that get user pages fast reads PTEs speculatively before
invalidate is read.

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  8:42                 ` Jason Wang
  2019-07-23 10:27                   ` Michael S. Tsirkin
@ 2019-07-23 10:42                   ` Michael S. Tsirkin
  2019-07-23 13:37                     ` Jason Wang
  1 sibling, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-23 10:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 23, 2019 at 04:42:19PM +0800, Jason Wang wrote:
> > So how about this: do exactly what you propose but as a 2 patch series:
> > start with the slow safe patch, and add then return uaddr optimizations
> > on top. We can then more easily reason about whether they are safe.
> 
> 
> If you stick, I can do this.

So I definitely don't insist but I'd like us to get back to where
we know existing code is very safe (if not super fast) and
optimizing from there.  Bugs happen but I'd like to see a bisect
giving us "oh it's because of XYZ optimization" and not the
general "it's somewhere within this driver" that we are getting
now.

Maybe the way to do this is to revert for this release cycle
and target the next one. What do you think?

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23  9:26                         ` Michael S. Tsirkin
@ 2019-07-23 13:31                           ` Jason Wang
  2019-07-25  5:52                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-23 13:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/23 下午5:26, Michael S. Tsirkin wrote:
> On Tue, Jul 23, 2019 at 04:49:01PM +0800, Jason Wang wrote:
>> On 2019/7/23 下午4:10, Michael S. Tsirkin wrote:
>>> On Tue, Jul 23, 2019 at 03:53:06PM +0800, Jason Wang wrote:
>>>> On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
>>>>>>> Really let's just use kfree_rcu. It's way cleaner: fire and forget.
>>>>>> Looks not, you need rate limit the fire as you've figured out?
>>>>> See the discussion that followed. Basically no, it's good enough
>>>>> already and is only going to be better.
>>>>>
>>>>>> And in fact,
>>>>>> the synchronization is not even needed, does it help if I leave a comment to
>>>>>> explain?
>>>>> Let's try to figure it out in the mail first. I'm pretty sure the
>>>>> current logic is wrong.
>>>> Here is what the code what to achieve:
>>>>
>>>> - The map was protected by RCU
>>>>
>>>> - Writers are: MMU notifier invalidation callbacks, file operations (ioctls
>>>> etc), meta_prefetch (datapath)
>>>>
>>>> - Readers are: memory accessor
>>>>
>>>> Writer are synchronized through mmu_lock. RCU is used to synchronized
>>>> between writers and readers.
>>>>
>>>> The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized it
>>>> with readers (memory accessors) in the path of file operations. But in this
>>>> case, vq->mutex was already held, this means it has been serialized with
>>>> memory accessor. That's why I think it could be removed safely.
>>>>
>>>> Anything I miss here?
>>>>
>>> So invalidate callbacks need to reset the map, and they do
>>> not have vq mutex. How can they do this and free
>>> the map safely? They need synchronize_rcu or kfree_rcu right?
>> Invalidation callbacks need but file operations (e.g ioctl) not.
>>
>>
>>> And I worry somewhat that synchronize_rcu in an MMU notifier
>>> is a problem, MMU notifiers are supposed to be quick:
>> Looks not, since it can allow to be blocked and lots of driver depends on
>> this. (E.g mmu_notifier_range_blockable()).
> Right, they can block. So why don't we take a VQ mutex and be
> done with it then? No RCU tricks.


This is how I want to go with RFC and V1. But I end up with deadlock 
between vq locks and some MM internal locks. So I decide to use RCU 
which is 100% under the control of vhost.

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23 10:27                   ` Michael S. Tsirkin
@ 2019-07-23 13:34                     ` Jason Wang
  2019-07-23 15:02                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-23 13:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/23 下午6:27, Michael S. Tsirkin wrote:
>> Yes, since there could be multiple co-current invalidation requests. We need
>> count them to make sure we don't pin wrong pages.
>>
>>
>>> I also wonder about ordering. kvm has this:
>>>          /*
>>>            * Used to check for invalidations in progress, of the pfn that is
>>>            * returned by pfn_to_pfn_prot below.
>>>            */
>>>           mmu_seq = kvm->mmu_notifier_seq;
>>>           /*
>>>            * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
>>>            * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
>>>            * risk the page we get a reference to getting unmapped before we have a
>>>            * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
>>>            *
>>>            * This smp_rmb() pairs with the effective smp_wmb() of the combination
>>>            * of the pte_unmap_unlock() after the PTE is zapped, and the
>>>            * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
>>>            * mmu_notifier_seq is incremented.
>>>            */
>>>           smp_rmb();
>>>
>>> does this apply to us? Can't we use a seqlock instead so we do
>>> not need to worry?
>> I'm not familiar with kvm MMU internals, but we do everything under of
>> mmu_lock.
>>
>> Thanks
> I don't think this helps at all.
>
> There's no lock between checking the invalidate counter and
> get user pages fast within vhost_map_prefetch. So it's possible
> that get user pages fast reads PTEs speculatively before
> invalidate is read.
>
> -- 


In vhost_map_prefetch() we do:

         spin_lock(&vq->mmu_lock);

         ...

         err = -EFAULT;
         if (vq->invalidate_count)
                 goto err;

         ...

         npinned = __get_user_pages_fast(uaddr->uaddr, npages,
                                         uaddr->write, pages);

         ...

         spin_unlock(&vq->mmu_lock);

Is this not sufficient?

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23 10:42                   ` Michael S. Tsirkin
@ 2019-07-23 13:37                     ` Jason Wang
  0 siblings, 0 replies; 87+ messages in thread
From: Jason Wang @ 2019-07-23 13:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/23 下午6:42, Michael S. Tsirkin wrote:
> On Tue, Jul 23, 2019 at 04:42:19PM +0800, Jason Wang wrote:
>>> So how about this: do exactly what you propose but as a 2 patch series:
>>> start with the slow safe patch, and add then return uaddr optimizations
>>> on top. We can then more easily reason about whether they are safe.
>>
>> If you stick, I can do this.
> So I definitely don't insist but I'd like us to get back to where
> we know existing code is very safe (if not super fast) and
> optimizing from there.  Bugs happen but I'd like to see a bisect
> giving us "oh it's because of XYZ optimization" and not the
> general "it's somewhere within this driver" that we are getting
> now.


Syzbot has bisected to the commit of metadata acceleration in fact :)


>
> Maybe the way to do this is to revert for this release cycle
> and target the next one. What do you think?


I would try to fix the issues consider packed virtqueue which may use 
this for a good performance number. But if you insist, I'm ok to revert. 
Or maybe introduce a config option to disable it by default (almost all 
optimized could be ruled out).

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23 13:34                     ` Jason Wang
@ 2019-07-23 15:02                       ` Michael S. Tsirkin
  2019-07-24  2:17                         ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-23 15:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 23, 2019 at 09:34:29PM +0800, Jason Wang wrote:
> 
> On 2019/7/23 下午6:27, Michael S. Tsirkin wrote:
> > > Yes, since there could be multiple co-current invalidation requests. We need
> > > count them to make sure we don't pin wrong pages.
> > > 
> > > 
> > > > I also wonder about ordering. kvm has this:
> > > >          /*
> > > >            * Used to check for invalidations in progress, of the pfn that is
> > > >            * returned by pfn_to_pfn_prot below.
> > > >            */
> > > >           mmu_seq = kvm->mmu_notifier_seq;
> > > >           /*
> > > >            * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
> > > >            * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
> > > >            * risk the page we get a reference to getting unmapped before we have a
> > > >            * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
> > > >            *
> > > >            * This smp_rmb() pairs with the effective smp_wmb() of the combination
> > > >            * of the pte_unmap_unlock() after the PTE is zapped, and the
> > > >            * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
> > > >            * mmu_notifier_seq is incremented.
> > > >            */
> > > >           smp_rmb();
> > > > 
> > > > does this apply to us? Can't we use a seqlock instead so we do
> > > > not need to worry?
> > > I'm not familiar with kvm MMU internals, but we do everything under of
> > > mmu_lock.
> > > 
> > > Thanks
> > I don't think this helps at all.
> > 
> > There's no lock between checking the invalidate counter and
> > get user pages fast within vhost_map_prefetch. So it's possible
> > that get user pages fast reads PTEs speculatively before
> > invalidate is read.
> > 
> > -- 
> 
> 
> In vhost_map_prefetch() we do:
> 
>         spin_lock(&vq->mmu_lock);
> 
>         ...
> 
>         err = -EFAULT;
>         if (vq->invalidate_count)
>                 goto err;
> 
>         ...
> 
>         npinned = __get_user_pages_fast(uaddr->uaddr, npages,
>                                         uaddr->write, pages);
> 
>         ...
> 
>         spin_unlock(&vq->mmu_lock);
> 
> Is this not sufficient?
> 
> Thanks

So what orders __get_user_pages_fast wrt invalidate_count read?

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23 15:02                       ` Michael S. Tsirkin
@ 2019-07-24  2:17                         ` Jason Wang
  2019-07-24  8:05                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-24  2:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/23 下午11:02, Michael S. Tsirkin wrote:
> On Tue, Jul 23, 2019 at 09:34:29PM +0800, Jason Wang wrote:
>> On 2019/7/23 下午6:27, Michael S. Tsirkin wrote:
>>>> Yes, since there could be multiple co-current invalidation requests. We need
>>>> count them to make sure we don't pin wrong pages.
>>>>
>>>>
>>>>> I also wonder about ordering. kvm has this:
>>>>>           /*
>>>>>             * Used to check for invalidations in progress, of the pfn that is
>>>>>             * returned by pfn_to_pfn_prot below.
>>>>>             */
>>>>>            mmu_seq = kvm->mmu_notifier_seq;
>>>>>            /*
>>>>>             * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
>>>>>             * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
>>>>>             * risk the page we get a reference to getting unmapped before we have a
>>>>>             * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
>>>>>             *
>>>>>             * This smp_rmb() pairs with the effective smp_wmb() of the combination
>>>>>             * of the pte_unmap_unlock() after the PTE is zapped, and the
>>>>>             * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
>>>>>             * mmu_notifier_seq is incremented.
>>>>>             */
>>>>>            smp_rmb();
>>>>>
>>>>> does this apply to us? Can't we use a seqlock instead so we do
>>>>> not need to worry?
>>>> I'm not familiar with kvm MMU internals, but we do everything under of
>>>> mmu_lock.
>>>>
>>>> Thanks
>>> I don't think this helps at all.
>>>
>>> There's no lock between checking the invalidate counter and
>>> get user pages fast within vhost_map_prefetch. So it's possible
>>> that get user pages fast reads PTEs speculatively before
>>> invalidate is read.
>>>
>>> -- 
>>
>> In vhost_map_prefetch() we do:
>>
>>          spin_lock(&vq->mmu_lock);
>>
>>          ...
>>
>>          err = -EFAULT;
>>          if (vq->invalidate_count)
>>                  goto err;
>>
>>          ...
>>
>>          npinned = __get_user_pages_fast(uaddr->uaddr, npages,
>>                                          uaddr->write, pages);
>>
>>          ...
>>
>>          spin_unlock(&vq->mmu_lock);
>>
>> Is this not sufficient?
>>
>> Thanks
> So what orders __get_user_pages_fast wrt invalidate_count read?


So in invalidate_end() callback we have:

spin_lock(&vq->mmu_lock);
--vq->invalidate_count;
         spin_unlock(&vq->mmu_lock);


So even PTE is read speculatively before reading invalidate_count (only 
in the case of invalidate_count is zero). The spinlock has guaranteed 
that we won't read any stale PTEs.

Thanks


>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-24  2:17                         ` Jason Wang
@ 2019-07-24  8:05                           ` Michael S. Tsirkin
  2019-07-24 10:08                             ` Jason Wang
  2019-07-24 16:53                             ` Jason Gunthorpe
  0 siblings, 2 replies; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-24  8:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Wed, Jul 24, 2019 at 10:17:14AM +0800, Jason Wang wrote:
> 
> On 2019/7/23 下午11:02, Michael S. Tsirkin wrote:
> > On Tue, Jul 23, 2019 at 09:34:29PM +0800, Jason Wang wrote:
> > > On 2019/7/23 下午6:27, Michael S. Tsirkin wrote:
> > > > > Yes, since there could be multiple co-current invalidation requests. We need
> > > > > count them to make sure we don't pin wrong pages.
> > > > > 
> > > > > 
> > > > > > I also wonder about ordering. kvm has this:
> > > > > >           /*
> > > > > >             * Used to check for invalidations in progress, of the pfn that is
> > > > > >             * returned by pfn_to_pfn_prot below.
> > > > > >             */
> > > > > >            mmu_seq = kvm->mmu_notifier_seq;
> > > > > >            /*
> > > > > >             * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
> > > > > >             * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
> > > > > >             * risk the page we get a reference to getting unmapped before we have a
> > > > > >             * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
> > > > > >             *
> > > > > >             * This smp_rmb() pairs with the effective smp_wmb() of the combination
> > > > > >             * of the pte_unmap_unlock() after the PTE is zapped, and the
> > > > > >             * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
> > > > > >             * mmu_notifier_seq is incremented.
> > > > > >             */
> > > > > >            smp_rmb();
> > > > > > 
> > > > > > does this apply to us? Can't we use a seqlock instead so we do
> > > > > > not need to worry?
> > > > > I'm not familiar with kvm MMU internals, but we do everything under of
> > > > > mmu_lock.
> > > > > 
> > > > > Thanks
> > > > I don't think this helps at all.
> > > > 
> > > > There's no lock between checking the invalidate counter and
> > > > get user pages fast within vhost_map_prefetch. So it's possible
> > > > that get user pages fast reads PTEs speculatively before
> > > > invalidate is read.
> > > > 
> > > > -- 
> > > 
> > > In vhost_map_prefetch() we do:
> > > 
> > >          spin_lock(&vq->mmu_lock);
> > > 
> > >          ...
> > > 
> > >          err = -EFAULT;
> > >          if (vq->invalidate_count)
> > >                  goto err;
> > > 
> > >          ...
> > > 
> > >          npinned = __get_user_pages_fast(uaddr->uaddr, npages,
> > >                                          uaddr->write, pages);
> > > 
> > >          ...
> > > 
> > >          spin_unlock(&vq->mmu_lock);
> > > 
> > > Is this not sufficient?
> > > 
> > > Thanks
> > So what orders __get_user_pages_fast wrt invalidate_count read?
> 
> 
> So in invalidate_end() callback we have:
> 
> spin_lock(&vq->mmu_lock);
> --vq->invalidate_count;
>         spin_unlock(&vq->mmu_lock);
> 
> 
> So even PTE is read speculatively before reading invalidate_count (only in
> the case of invalidate_count is zero). The spinlock has guaranteed that we
> won't read any stale PTEs.
> 
> Thanks

I'm sorry I just do not get the argument.
If you want to order two reads you need an smp_rmb
or stronger between them executed on the same CPU.

Executing any kind of barrier on another CPU
will have no ordering effect on the 1st one.


So if CPU1 runs the prefetch, and CPU2 runs invalidate
callback, read of invalidate counter on CPU1 can bypass
read of PTE on CPU1 unless there's a barrier
in between, and nothing CPU2 does can affect that outcome.


What did I miss?

> 
> > 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-24  8:05                           ` Michael S. Tsirkin
@ 2019-07-24 10:08                             ` Jason Wang
  2019-07-24 18:25                               ` Michael S. Tsirkin
  2019-07-24 16:53                             ` Jason Gunthorpe
  1 sibling, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-24 10:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/24 下午4:05, Michael S. Tsirkin wrote:
> On Wed, Jul 24, 2019 at 10:17:14AM +0800, Jason Wang wrote:
>> On 2019/7/23 下午11:02, Michael S. Tsirkin wrote:
>>> On Tue, Jul 23, 2019 at 09:34:29PM +0800, Jason Wang wrote:
>>>> On 2019/7/23 下午6:27, Michael S. Tsirkin wrote:
>>>>>> Yes, since there could be multiple co-current invalidation requests. We need
>>>>>> count them to make sure we don't pin wrong pages.
>>>>>>
>>>>>>
>>>>>>> I also wonder about ordering. kvm has this:
>>>>>>>            /*
>>>>>>>              * Used to check for invalidations in progress, of the pfn that is
>>>>>>>              * returned by pfn_to_pfn_prot below.
>>>>>>>              */
>>>>>>>             mmu_seq = kvm->mmu_notifier_seq;
>>>>>>>             /*
>>>>>>>              * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
>>>>>>>              * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
>>>>>>>              * risk the page we get a reference to getting unmapped before we have a
>>>>>>>              * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
>>>>>>>              *
>>>>>>>              * This smp_rmb() pairs with the effective smp_wmb() of the combination
>>>>>>>              * of the pte_unmap_unlock() after the PTE is zapped, and the
>>>>>>>              * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
>>>>>>>              * mmu_notifier_seq is incremented.
>>>>>>>              */
>>>>>>>             smp_rmb();
>>>>>>>
>>>>>>> does this apply to us? Can't we use a seqlock instead so we do
>>>>>>> not need to worry?
>>>>>> I'm not familiar with kvm MMU internals, but we do everything under of
>>>>>> mmu_lock.
>>>>>>
>>>>>> Thanks
>>>>> I don't think this helps at all.
>>>>>
>>>>> There's no lock between checking the invalidate counter and
>>>>> get user pages fast within vhost_map_prefetch. So it's possible
>>>>> that get user pages fast reads PTEs speculatively before
>>>>> invalidate is read.
>>>>>
>>>>> -- 
>>>> In vhost_map_prefetch() we do:
>>>>
>>>>           spin_lock(&vq->mmu_lock);
>>>>
>>>>           ...
>>>>
>>>>           err = -EFAULT;
>>>>           if (vq->invalidate_count)
>>>>                   goto err;
>>>>
>>>>           ...
>>>>
>>>>           npinned = __get_user_pages_fast(uaddr->uaddr, npages,
>>>>                                           uaddr->write, pages);
>>>>
>>>>           ...
>>>>
>>>>           spin_unlock(&vq->mmu_lock);
>>>>
>>>> Is this not sufficient?
>>>>
>>>> Thanks
>>> So what orders __get_user_pages_fast wrt invalidate_count read?
>>
>> So in invalidate_end() callback we have:
>>
>> spin_lock(&vq->mmu_lock);
>> --vq->invalidate_count;
>>          spin_unlock(&vq->mmu_lock);
>>
>>
>> So even PTE is read speculatively before reading invalidate_count (only in
>> the case of invalidate_count is zero). The spinlock has guaranteed that we
>> won't read any stale PTEs.
>>
>> Thanks
> I'm sorry I just do not get the argument.
> If you want to order two reads you need an smp_rmb
> or stronger between them executed on the same CPU.
>
> Executing any kind of barrier on another CPU
> will have no ordering effect on the 1st one.
>
>
> So if CPU1 runs the prefetch, and CPU2 runs invalidate
> callback, read of invalidate counter on CPU1 can bypass
> read of PTE on CPU1 unless there's a barrier
> in between, and nothing CPU2 does can affect that outcome.
>
>
> What did I miss?


It doesn't harm if PTE is read before invalidate_count, this is because:

1) This speculation is serialized with invalidate_range_end() because of 
the spinlock

2) This speculation can only make effect when we read invalidate_count 
as zero.

3) This means the speculation is done after the last 
invalidate_range_end() and because of the spinlock, when we enter the 
critical section of spinlock in prefetch, we can not see any stale PTE 
that was unmapped before.

Am I wrong?

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-24  8:05                           ` Michael S. Tsirkin
  2019-07-24 10:08                             ` Jason Wang
@ 2019-07-24 16:53                             ` Jason Gunthorpe
  2019-07-24 18:25                               ` Michael S. Tsirkin
  1 sibling, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2019-07-24 16:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Wed, Jul 24, 2019 at 04:05:17AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 24, 2019 at 10:17:14AM +0800, Jason Wang wrote:
> > So even PTE is read speculatively before reading invalidate_count (only in
> > the case of invalidate_count is zero). The spinlock has guaranteed that we
> > won't read any stale PTEs.
> 
> I'm sorry I just do not get the argument.
> If you want to order two reads you need an smp_rmb
> or stronger between them executed on the same CPU.

No, that is only for unlocked algorithms.

In this case the spinlock provides all the 'or stronger' ordering
required.

For invalidate_count going 0->1 the spin_lock ensures that any
following PTE update during invalidation does not order before the
spin_lock()

While holding the lock and observing 1 in invalidate_count the PTE
values might be changing, but are ignored. C's rules about sequencing
make this safe.

For invalidate_count going 1->0 the spin_unlock ensures that any
preceeding PTE update during invalidation does not order after the
spin_unlock

While holding the lock and observing 0 in invalidating_count the PTE
values cannot be changing.

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-24 16:53                             ` Jason Gunthorpe
@ 2019-07-24 18:25                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-24 18:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Wang, syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Wed, Jul 24, 2019 at 01:53:17PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 24, 2019 at 04:05:17AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jul 24, 2019 at 10:17:14AM +0800, Jason Wang wrote:
> > > So even PTE is read speculatively before reading invalidate_count (only in
> > > the case of invalidate_count is zero). The spinlock has guaranteed that we
> > > won't read any stale PTEs.
> > 
> > I'm sorry I just do not get the argument.
> > If you want to order two reads you need an smp_rmb
> > or stronger between them executed on the same CPU.
> 
> No, that is only for unlocked algorithms.
> 
> In this case the spinlock provides all the 'or stronger' ordering
> required.
> 
> For invalidate_count going 0->1 the spin_lock ensures that any
> following PTE update during invalidation does not order before the
> spin_lock()
> 
> While holding the lock and observing 1 in invalidate_count the PTE
> values might be changing, but are ignored. C's rules about sequencing
> make this safe.
> 
> For invalidate_count going 1->0 the spin_unlock ensures that any
> preceeding PTE update during invalidation does not order after the
> spin_unlock
> 
> While holding the lock and observing 0 in invalidating_count the PTE
> values cannot be changing.
> 
> Jason

Oh right. So prefetch holds the spinlock the whole time.
Sorry about the noise.

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-24 10:08                             ` Jason Wang
@ 2019-07-24 18:25                               ` Michael S. Tsirkin
  2019-07-25  3:44                                 ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-24 18:25 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Wed, Jul 24, 2019 at 06:08:05PM +0800, Jason Wang wrote:
> 
> On 2019/7/24 下午4:05, Michael S. Tsirkin wrote:
> > On Wed, Jul 24, 2019 at 10:17:14AM +0800, Jason Wang wrote:
> > > On 2019/7/23 下午11:02, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 23, 2019 at 09:34:29PM +0800, Jason Wang wrote:
> > > > > On 2019/7/23 下午6:27, Michael S. Tsirkin wrote:
> > > > > > > Yes, since there could be multiple co-current invalidation requests. We need
> > > > > > > count them to make sure we don't pin wrong pages.
> > > > > > > 
> > > > > > > 
> > > > > > > > I also wonder about ordering. kvm has this:
> > > > > > > >            /*
> > > > > > > >              * Used to check for invalidations in progress, of the pfn that is
> > > > > > > >              * returned by pfn_to_pfn_prot below.
> > > > > > > >              */
> > > > > > > >             mmu_seq = kvm->mmu_notifier_seq;
> > > > > > > >             /*
> > > > > > > >              * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
> > > > > > > >              * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
> > > > > > > >              * risk the page we get a reference to getting unmapped before we have a
> > > > > > > >              * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
> > > > > > > >              *
> > > > > > > >              * This smp_rmb() pairs with the effective smp_wmb() of the combination
> > > > > > > >              * of the pte_unmap_unlock() after the PTE is zapped, and the
> > > > > > > >              * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
> > > > > > > >              * mmu_notifier_seq is incremented.
> > > > > > > >              */
> > > > > > > >             smp_rmb();
> > > > > > > > 
> > > > > > > > does this apply to us? Can't we use a seqlock instead so we do
> > > > > > > > not need to worry?
> > > > > > > I'm not familiar with kvm MMU internals, but we do everything under of
> > > > > > > mmu_lock.
> > > > > > > 
> > > > > > > Thanks
> > > > > > I don't think this helps at all.
> > > > > > 
> > > > > > There's no lock between checking the invalidate counter and
> > > > > > get user pages fast within vhost_map_prefetch. So it's possible
> > > > > > that get user pages fast reads PTEs speculatively before
> > > > > > invalidate is read.
> > > > > > 
> > > > > > -- 
> > > > > In vhost_map_prefetch() we do:
> > > > > 
> > > > >           spin_lock(&vq->mmu_lock);
> > > > > 
> > > > >           ...
> > > > > 
> > > > >           err = -EFAULT;
> > > > >           if (vq->invalidate_count)
> > > > >                   goto err;
> > > > > 
> > > > >           ...
> > > > > 
> > > > >           npinned = __get_user_pages_fast(uaddr->uaddr, npages,
> > > > >                                           uaddr->write, pages);
> > > > > 
> > > > >           ...
> > > > > 
> > > > >           spin_unlock(&vq->mmu_lock);
> > > > > 
> > > > > Is this not sufficient?
> > > > > 
> > > > > Thanks
> > > > So what orders __get_user_pages_fast wrt invalidate_count read?
> > > 
> > > So in invalidate_end() callback we have:
> > > 
> > > spin_lock(&vq->mmu_lock);
> > > --vq->invalidate_count;
> > >          spin_unlock(&vq->mmu_lock);
> > > 
> > > 
> > > So even PTE is read speculatively before reading invalidate_count (only in
> > > the case of invalidate_count is zero). The spinlock has guaranteed that we
> > > won't read any stale PTEs.
> > > 
> > > Thanks
> > I'm sorry I just do not get the argument.
> > If you want to order two reads you need an smp_rmb
> > or stronger between them executed on the same CPU.
> > 
> > Executing any kind of barrier on another CPU
> > will have no ordering effect on the 1st one.
> > 
> > 
> > So if CPU1 runs the prefetch, and CPU2 runs invalidate
> > callback, read of invalidate counter on CPU1 can bypass
> > read of PTE on CPU1 unless there's a barrier
> > in between, and nothing CPU2 does can affect that outcome.
> > 
> > 
> > What did I miss?
> 
> 
> It doesn't harm if PTE is read before invalidate_count, this is because:
> 
> 1) This speculation is serialized with invalidate_range_end() because of the
> spinlock
> 
> 2) This speculation can only make effect when we read invalidate_count as
> zero.
> 
> 3) This means the speculation is done after the last invalidate_range_end()
> and because of the spinlock, when we enter the critical section of spinlock
> in prefetch, we can not see any stale PTE that was unmapped before.
> 
> Am I wrong?
> 
> Thanks

OK I think you are right. Sorry it took me a while to figure out.

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-24 18:25                               ` Michael S. Tsirkin
@ 2019-07-25  3:44                                 ` Jason Wang
  2019-07-25  5:09                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-25  3:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/25 上午2:25, Michael S. Tsirkin wrote:
> On Wed, Jul 24, 2019 at 06:08:05PM +0800, Jason Wang wrote:
>> On 2019/7/24 下午4:05, Michael S. Tsirkin wrote:
>>> On Wed, Jul 24, 2019 at 10:17:14AM +0800, Jason Wang wrote:
>>>> On 2019/7/23 下午11:02, Michael S. Tsirkin wrote:
>>>>> On Tue, Jul 23, 2019 at 09:34:29PM +0800, Jason Wang wrote:
>>>>>> On 2019/7/23 下午6:27, Michael S. Tsirkin wrote:
>>>>>>>> Yes, since there could be multiple co-current invalidation requests. We need
>>>>>>>> count them to make sure we don't pin wrong pages.
>>>>>>>>
>>>>>>>>
>>>>>>>>> I also wonder about ordering. kvm has this:
>>>>>>>>>             /*
>>>>>>>>>               * Used to check for invalidations in progress, of the pfn that is
>>>>>>>>>               * returned by pfn_to_pfn_prot below.
>>>>>>>>>               */
>>>>>>>>>              mmu_seq = kvm->mmu_notifier_seq;
>>>>>>>>>              /*
>>>>>>>>>               * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
>>>>>>>>>               * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
>>>>>>>>>               * risk the page we get a reference to getting unmapped before we have a
>>>>>>>>>               * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
>>>>>>>>>               *
>>>>>>>>>               * This smp_rmb() pairs with the effective smp_wmb() of the combination
>>>>>>>>>               * of the pte_unmap_unlock() after the PTE is zapped, and the
>>>>>>>>>               * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
>>>>>>>>>               * mmu_notifier_seq is incremented.
>>>>>>>>>               */
>>>>>>>>>              smp_rmb();
>>>>>>>>>
>>>>>>>>> does this apply to us? Can't we use a seqlock instead so we do
>>>>>>>>> not need to worry?
>>>>>>>> I'm not familiar with kvm MMU internals, but we do everything under of
>>>>>>>> mmu_lock.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>> I don't think this helps at all.
>>>>>>>
>>>>>>> There's no lock between checking the invalidate counter and
>>>>>>> get user pages fast within vhost_map_prefetch. So it's possible
>>>>>>> that get user pages fast reads PTEs speculatively before
>>>>>>> invalidate is read.
>>>>>>>
>>>>>>> -- 
>>>>>> In vhost_map_prefetch() we do:
>>>>>>
>>>>>>            spin_lock(&vq->mmu_lock);
>>>>>>
>>>>>>            ...
>>>>>>
>>>>>>            err = -EFAULT;
>>>>>>            if (vq->invalidate_count)
>>>>>>                    goto err;
>>>>>>
>>>>>>            ...
>>>>>>
>>>>>>            npinned = __get_user_pages_fast(uaddr->uaddr, npages,
>>>>>>                                            uaddr->write, pages);
>>>>>>
>>>>>>            ...
>>>>>>
>>>>>>            spin_unlock(&vq->mmu_lock);
>>>>>>
>>>>>> Is this not sufficient?
>>>>>>
>>>>>> Thanks
>>>>> So what orders __get_user_pages_fast wrt invalidate_count read?
>>>> So in invalidate_end() callback we have:
>>>>
>>>> spin_lock(&vq->mmu_lock);
>>>> --vq->invalidate_count;
>>>>           spin_unlock(&vq->mmu_lock);
>>>>
>>>>
>>>> So even PTE is read speculatively before reading invalidate_count (only in
>>>> the case of invalidate_count is zero). The spinlock has guaranteed that we
>>>> won't read any stale PTEs.
>>>>
>>>> Thanks
>>> I'm sorry I just do not get the argument.
>>> If you want to order two reads you need an smp_rmb
>>> or stronger between them executed on the same CPU.
>>>
>>> Executing any kind of barrier on another CPU
>>> will have no ordering effect on the 1st one.
>>>
>>>
>>> So if CPU1 runs the prefetch, and CPU2 runs invalidate
>>> callback, read of invalidate counter on CPU1 can bypass
>>> read of PTE on CPU1 unless there's a barrier
>>> in between, and nothing CPU2 does can affect that outcome.
>>>
>>>
>>> What did I miss?
>>
>> It doesn't harm if PTE is read before invalidate_count, this is because:
>>
>> 1) This speculation is serialized with invalidate_range_end() because of the
>> spinlock
>>
>> 2) This speculation can only make effect when we read invalidate_count as
>> zero.
>>
>> 3) This means the speculation is done after the last invalidate_range_end()
>> and because of the spinlock, when we enter the critical section of spinlock
>> in prefetch, we can not see any stale PTE that was unmapped before.
>>
>> Am I wrong?
>>
>> Thanks
> OK I think you are right. Sorry it took me a while to figure out.


No problem. So do you want me to send a V2 of the fixes (e.g with the 
conversion from synchronize_rcu() to kfree_rcu()). Or you want something 
else. (e.g revert or a config option)?

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-25  3:44                                 ` Jason Wang
@ 2019-07-25  5:09                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-25  5:09 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Thu, Jul 25, 2019 at 11:44:27AM +0800, Jason Wang wrote:
> 
> On 2019/7/25 上午2:25, Michael S. Tsirkin wrote:
> > On Wed, Jul 24, 2019 at 06:08:05PM +0800, Jason Wang wrote:
> > > On 2019/7/24 下午4:05, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 24, 2019 at 10:17:14AM +0800, Jason Wang wrote:
> > > > > On 2019/7/23 下午11:02, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 23, 2019 at 09:34:29PM +0800, Jason Wang wrote:
> > > > > > > On 2019/7/23 下午6:27, Michael S. Tsirkin wrote:
> > > > > > > > > Yes, since there could be multiple co-current invalidation requests. We need
> > > > > > > > > count them to make sure we don't pin wrong pages.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > I also wonder about ordering. kvm has this:
> > > > > > > > > >             /*
> > > > > > > > > >               * Used to check for invalidations in progress, of the pfn that is
> > > > > > > > > >               * returned by pfn_to_pfn_prot below.
> > > > > > > > > >               */
> > > > > > > > > >              mmu_seq = kvm->mmu_notifier_seq;
> > > > > > > > > >              /*
> > > > > > > > > >               * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
> > > > > > > > > >               * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
> > > > > > > > > >               * risk the page we get a reference to getting unmapped before we have a
> > > > > > > > > >               * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
> > > > > > > > > >               *
> > > > > > > > > >               * This smp_rmb() pairs with the effective smp_wmb() of the combination
> > > > > > > > > >               * of the pte_unmap_unlock() after the PTE is zapped, and the
> > > > > > > > > >               * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
> > > > > > > > > >               * mmu_notifier_seq is incremented.
> > > > > > > > > >               */
> > > > > > > > > >              smp_rmb();
> > > > > > > > > > 
> > > > > > > > > > does this apply to us? Can't we use a seqlock instead so we do
> > > > > > > > > > not need to worry?
> > > > > > > > > I'm not familiar with kvm MMU internals, but we do everything under of
> > > > > > > > > mmu_lock.
> > > > > > > > > 
> > > > > > > > > Thanks
> > > > > > > > I don't think this helps at all.
> > > > > > > > 
> > > > > > > > There's no lock between checking the invalidate counter and
> > > > > > > > get user pages fast within vhost_map_prefetch. So it's possible
> > > > > > > > that get user pages fast reads PTEs speculatively before
> > > > > > > > invalidate is read.
> > > > > > > > 
> > > > > > > > -- 
> > > > > > > In vhost_map_prefetch() we do:
> > > > > > > 
> > > > > > >            spin_lock(&vq->mmu_lock);
> > > > > > > 
> > > > > > >            ...
> > > > > > > 
> > > > > > >            err = -EFAULT;
> > > > > > >            if (vq->invalidate_count)
> > > > > > >                    goto err;
> > > > > > > 
> > > > > > >            ...
> > > > > > > 
> > > > > > >            npinned = __get_user_pages_fast(uaddr->uaddr, npages,
> > > > > > >                                            uaddr->write, pages);
> > > > > > > 
> > > > > > >            ...
> > > > > > > 
> > > > > > >            spin_unlock(&vq->mmu_lock);
> > > > > > > 
> > > > > > > Is this not sufficient?
> > > > > > > 
> > > > > > > Thanks
> > > > > > So what orders __get_user_pages_fast wrt invalidate_count read?
> > > > > So in invalidate_end() callback we have:
> > > > > 
> > > > > spin_lock(&vq->mmu_lock);
> > > > > --vq->invalidate_count;
> > > > >           spin_unlock(&vq->mmu_lock);
> > > > > 
> > > > > 
> > > > > So even PTE is read speculatively before reading invalidate_count (only in
> > > > > the case of invalidate_count is zero). The spinlock has guaranteed that we
> > > > > won't read any stale PTEs.
> > > > > 
> > > > > Thanks
> > > > I'm sorry I just do not get the argument.
> > > > If you want to order two reads you need an smp_rmb
> > > > or stronger between them executed on the same CPU.
> > > > 
> > > > Executing any kind of barrier on another CPU
> > > > will have no ordering effect on the 1st one.
> > > > 
> > > > 
> > > > So if CPU1 runs the prefetch, and CPU2 runs invalidate
> > > > callback, read of invalidate counter on CPU1 can bypass
> > > > read of PTE on CPU1 unless there's a barrier
> > > > in between, and nothing CPU2 does can affect that outcome.
> > > > 
> > > > 
> > > > What did I miss?
> > > 
> > > It doesn't harm if PTE is read before invalidate_count, this is because:
> > > 
> > > 1) This speculation is serialized with invalidate_range_end() because of the
> > > spinlock
> > > 
> > > 2) This speculation can only make effect when we read invalidate_count as
> > > zero.
> > > 
> > > 3) This means the speculation is done after the last invalidate_range_end()
> > > and because of the spinlock, when we enter the critical section of spinlock
> > > in prefetch, we can not see any stale PTE that was unmapped before.
> > > 
> > > Am I wrong?
> > > 
> > > Thanks
> > OK I think you are right. Sorry it took me a while to figure out.
> 
> 
> No problem. So do you want me to send a V2 of the fixes (e.g with the
> conversion from synchronize_rcu() to kfree_rcu()). Or you want something
> else. (e.g revert or a config option)?
> 
> Thanks

Pls post V2 and I'll do my best to do a thorough review.  We can then
decide, if we find more issues then patch revert makes more sense IMHO.
If we don't let's keep it in and if issues surface close to release
we can flip the config option.



-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-23 13:31                           ` Jason Wang
@ 2019-07-25  5:52                             ` Michael S. Tsirkin
  2019-07-25  7:43                               ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-25  5:52 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 23, 2019 at 09:31:35PM +0800, Jason Wang wrote:
> 
> On 2019/7/23 下午5:26, Michael S. Tsirkin wrote:
> > On Tue, Jul 23, 2019 at 04:49:01PM +0800, Jason Wang wrote:
> > > On 2019/7/23 下午4:10, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 23, 2019 at 03:53:06PM +0800, Jason Wang wrote:
> > > > > On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
> > > > > > > > Really let's just use kfree_rcu. It's way cleaner: fire and forget.
> > > > > > > Looks not, you need rate limit the fire as you've figured out?
> > > > > > See the discussion that followed. Basically no, it's good enough
> > > > > > already and is only going to be better.
> > > > > > 
> > > > > > > And in fact,
> > > > > > > the synchronization is not even needed, does it help if I leave a comment to
> > > > > > > explain?
> > > > > > Let's try to figure it out in the mail first. I'm pretty sure the
> > > > > > current logic is wrong.
> > > > > Here is what the code what to achieve:
> > > > > 
> > > > > - The map was protected by RCU
> > > > > 
> > > > > - Writers are: MMU notifier invalidation callbacks, file operations (ioctls
> > > > > etc), meta_prefetch (datapath)
> > > > > 
> > > > > - Readers are: memory accessor
> > > > > 
> > > > > Writer are synchronized through mmu_lock. RCU is used to synchronized
> > > > > between writers and readers.
> > > > > 
> > > > > The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized it
> > > > > with readers (memory accessors) in the path of file operations. But in this
> > > > > case, vq->mutex was already held, this means it has been serialized with
> > > > > memory accessor. That's why I think it could be removed safely.
> > > > > 
> > > > > Anything I miss here?
> > > > > 
> > > > So invalidate callbacks need to reset the map, and they do
> > > > not have vq mutex. How can they do this and free
> > > > the map safely? They need synchronize_rcu or kfree_rcu right?
> > > Invalidation callbacks need but file operations (e.g ioctl) not.
> > > 
> > > 
> > > > And I worry somewhat that synchronize_rcu in an MMU notifier
> > > > is a problem, MMU notifiers are supposed to be quick:
> > > Looks not, since it can allow to be blocked and lots of driver depends on
> > > this. (E.g mmu_notifier_range_blockable()).
> > Right, they can block. So why don't we take a VQ mutex and be
> > done with it then? No RCU tricks.
> 
> 
> This is how I want to go with RFC and V1. But I end up with deadlock between
> vq locks and some MM internal locks. So I decide to use RCU which is 100%
> under the control of vhost.
> 
> Thanks

And I guess the deadlock is because GUP is taking mmu locks which are
taken on mmu notifier path, right?  How about we add a seqlock and take
that in invalidate callbacks?  We can then drop the VQ lock before GUP,
and take it again immediately after.

something like
	if (!vq_meta_mapped(vq)) {
		vq_meta_setup(&uaddrs);
		mutex_unlock(vq->mutex)
		vq_meta_map(&uaddrs);
		mutex_lock(vq->mutex)

		/* recheck both sock->private_data and seqlock count. */
		if changed - bail out
	}

And also requires that VQ uaddrs is defined like this:
- writers must have both vq mutex and dev mutex
- readers must have either vq mutex or dev mutex


That's a big change though. For now, how about switching to a per-vq SRCU?
That is only a little bit more expensive than RCU, and we
can use synchronize_srcu_expedited.

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-22 14:11     ` Jason Gunthorpe
@ 2019-07-25  6:02       ` Michael S. Tsirkin
  2019-07-25  7:44         ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-25  6:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jasowang, jglisse,
	keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Mon, Jul 22, 2019 at 11:11:52AM -0300, Jason Gunthorpe wrote:
> On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
> > On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > > syzbot has bisected this bug to:
> > > 
> > > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > > Author: Jason Wang <jasowang@redhat.com>
> > > Date:   Fri May 24 08:12:18 2019 +0000
> > > 
> > >     vhost: access vq metadata through kernel virtual address
> > > 
> > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > > start commit:   6d21a41b Add linux-next specific files for 20190718
> > > git tree:       linux-next
> > > final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > > 
> > > Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
> > > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > address")
> > > 
> > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> > 
> > 
> > OK I poked at this for a bit, I see several things that
> > we need to fix, though I'm not yet sure it's the reason for
> > the failures:
> 
> This stuff looks quite similar to the hmm_mirror use model and other
> places in the kernel. I'm still hoping we can share this code a bit more.

Right. I think hmm is something we should look at.

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-25  5:52                             ` Michael S. Tsirkin
@ 2019-07-25  7:43                               ` Jason Wang
  2019-07-25  8:28                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-25  7:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/25 下午1:52, Michael S. Tsirkin wrote:
> On Tue, Jul 23, 2019 at 09:31:35PM +0800, Jason Wang wrote:
>> On 2019/7/23 下午5:26, Michael S. Tsirkin wrote:
>>> On Tue, Jul 23, 2019 at 04:49:01PM +0800, Jason Wang wrote:
>>>> On 2019/7/23 下午4:10, Michael S. Tsirkin wrote:
>>>>> On Tue, Jul 23, 2019 at 03:53:06PM +0800, Jason Wang wrote:
>>>>>> On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
>>>>>>>>> Really let's just use kfree_rcu. It's way cleaner: fire and forget.
>>>>>>>> Looks not, you need rate limit the fire as you've figured out?
>>>>>>> See the discussion that followed. Basically no, it's good enough
>>>>>>> already and is only going to be better.
>>>>>>>
>>>>>>>> And in fact,
>>>>>>>> the synchronization is not even needed, does it help if I leave a comment to
>>>>>>>> explain?
>>>>>>> Let's try to figure it out in the mail first. I'm pretty sure the
>>>>>>> current logic is wrong.
>>>>>> Here is what the code what to achieve:
>>>>>>
>>>>>> - The map was protected by RCU
>>>>>>
>>>>>> - Writers are: MMU notifier invalidation callbacks, file operations (ioctls
>>>>>> etc), meta_prefetch (datapath)
>>>>>>
>>>>>> - Readers are: memory accessor
>>>>>>
>>>>>> Writer are synchronized through mmu_lock. RCU is used to synchronized
>>>>>> between writers and readers.
>>>>>>
>>>>>> The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized it
>>>>>> with readers (memory accessors) in the path of file operations. But in this
>>>>>> case, vq->mutex was already held, this means it has been serialized with
>>>>>> memory accessor. That's why I think it could be removed safely.
>>>>>>
>>>>>> Anything I miss here?
>>>>>>
>>>>> So invalidate callbacks need to reset the map, and they do
>>>>> not have vq mutex. How can they do this and free
>>>>> the map safely? They need synchronize_rcu or kfree_rcu right?
>>>> Invalidation callbacks need but file operations (e.g ioctl) not.
>>>>
>>>>
>>>>> And I worry somewhat that synchronize_rcu in an MMU notifier
>>>>> is a problem, MMU notifiers are supposed to be quick:
>>>> Looks not, since it can allow to be blocked and lots of driver depends on
>>>> this. (E.g mmu_notifier_range_blockable()).
>>> Right, they can block. So why don't we take a VQ mutex and be
>>> done with it then? No RCU tricks.
>>
>> This is how I want to go with RFC and V1. But I end up with deadlock between
>> vq locks and some MM internal locks. So I decide to use RCU which is 100%
>> under the control of vhost.
>>
>> Thanks
> And I guess the deadlock is because GUP is taking mmu locks which are
> taken on mmu notifier path, right?


Yes, but it's not the only lock. I don't remember the details, but I can 
confirm I meet issues with one or two other locks.


>    How about we add a seqlock and take
> that in invalidate callbacks?  We can then drop the VQ lock before GUP,
> and take it again immediately after.
>
> something like
> 	if (!vq_meta_mapped(vq)) {
> 		vq_meta_setup(&uaddrs);
> 		mutex_unlock(vq->mutex)
> 		vq_meta_map(&uaddrs);


The problem is the vq address could be changed at this time.


> 		mutex_lock(vq->mutex)
>
> 		/* recheck both sock->private_data and seqlock count. */
> 		if changed - bail out
> 	}
>
> And also requires that VQ uaddrs is defined like this:
> - writers must have both vq mutex and dev mutex
> - readers must have either vq mutex or dev mutex
>
>
> That's a big change though. For now, how about switching to a per-vq SRCU?
> That is only a little bit more expensive than RCU, and we
> can use synchronize_srcu_expedited.
>

Consider we switch to use kfree_rcu(), what's the advantage of per-vq SRCU?

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-25  6:02       ` Michael S. Tsirkin
@ 2019-07-25  7:44         ` Jason Wang
  0 siblings, 0 replies; 87+ messages in thread
From: Jason Wang @ 2019-07-25  7:44 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Gunthorpe
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/25 下午2:02, Michael S. Tsirkin wrote:
> On Mon, Jul 22, 2019 at 11:11:52AM -0300, Jason Gunthorpe wrote:
>> On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
>>> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
>>>> syzbot has bisected this bug to:
>>>>
>>>> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
>>>> Author: Jason Wang <jasowang@redhat.com>
>>>> Date:   Fri May 24 08:12:18 2019 +0000
>>>>
>>>>      vhost: access vq metadata through kernel virtual address
>>>>
>>>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
>>>> start commit:   6d21a41b Add linux-next specific files for 20190718
>>>> git tree:       linux-next
>>>> final crash:    https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
>>>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
>>>>
>>>> Reported-by: syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
>>>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
>>>> address")
>>>>
>>>> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>>>
>>> OK I poked at this for a bit, I see several things that
>>> we need to fix, though I'm not yet sure it's the reason for
>>> the failures:
>> This stuff looks quite similar to the hmm_mirror use model and other
>> places in the kernel. I'm still hoping we can share this code a bit more.
> Right. I think hmm is something we should look at.


Exactly. I plan to do that.

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-25  7:43                               ` Jason Wang
@ 2019-07-25  8:28                                 ` Michael S. Tsirkin
  2019-07-25 13:21                                   ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-25  8:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Thu, Jul 25, 2019 at 03:43:41PM +0800, Jason Wang wrote:
> 
> On 2019/7/25 下午1:52, Michael S. Tsirkin wrote:
> > On Tue, Jul 23, 2019 at 09:31:35PM +0800, Jason Wang wrote:
> > > On 2019/7/23 下午5:26, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 23, 2019 at 04:49:01PM +0800, Jason Wang wrote:
> > > > > On 2019/7/23 下午4:10, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 23, 2019 at 03:53:06PM +0800, Jason Wang wrote:
> > > > > > > On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
> > > > > > > > > > Really let's just use kfree_rcu. It's way cleaner: fire and forget.
> > > > > > > > > Looks not, you need rate limit the fire as you've figured out?
> > > > > > > > See the discussion that followed. Basically no, it's good enough
> > > > > > > > already and is only going to be better.
> > > > > > > > 
> > > > > > > > > And in fact,
> > > > > > > > > the synchronization is not even needed, does it help if I leave a comment to
> > > > > > > > > explain?
> > > > > > > > Let's try to figure it out in the mail first. I'm pretty sure the
> > > > > > > > current logic is wrong.
> > > > > > > Here is what the code what to achieve:
> > > > > > > 
> > > > > > > - The map was protected by RCU
> > > > > > > 
> > > > > > > - Writers are: MMU notifier invalidation callbacks, file operations (ioctls
> > > > > > > etc), meta_prefetch (datapath)
> > > > > > > 
> > > > > > > - Readers are: memory accessor
> > > > > > > 
> > > > > > > Writer are synchronized through mmu_lock. RCU is used to synchronized
> > > > > > > between writers and readers.
> > > > > > > 
> > > > > > > The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized it
> > > > > > > with readers (memory accessors) in the path of file operations. But in this
> > > > > > > case, vq->mutex was already held, this means it has been serialized with
> > > > > > > memory accessor. That's why I think it could be removed safely.
> > > > > > > 
> > > > > > > Anything I miss here?
> > > > > > > 
> > > > > > So invalidate callbacks need to reset the map, and they do
> > > > > > not have vq mutex. How can they do this and free
> > > > > > the map safely? They need synchronize_rcu or kfree_rcu right?
> > > > > Invalidation callbacks need but file operations (e.g ioctl) not.
> > > > > 
> > > > > 
> > > > > > And I worry somewhat that synchronize_rcu in an MMU notifier
> > > > > > is a problem, MMU notifiers are supposed to be quick:
> > > > > Looks not, since it can allow to be blocked and lots of driver depends on
> > > > > this. (E.g mmu_notifier_range_blockable()).
> > > > Right, they can block. So why don't we take a VQ mutex and be
> > > > done with it then? No RCU tricks.
> > > 
> > > This is how I want to go with RFC and V1. But I end up with deadlock between
> > > vq locks and some MM internal locks. So I decide to use RCU which is 100%
> > > under the control of vhost.
> > > 
> > > Thanks
> > And I guess the deadlock is because GUP is taking mmu locks which are
> > taken on mmu notifier path, right?
> 
> 
> Yes, but it's not the only lock. I don't remember the details, but I can
> confirm I meet issues with one or two other locks.
> 
> 
> >    How about we add a seqlock and take
> > that in invalidate callbacks?  We can then drop the VQ lock before GUP,
> > and take it again immediately after.
> > 
> > something like
> > 	if (!vq_meta_mapped(vq)) {
> > 		vq_meta_setup(&uaddrs);
> > 		mutex_unlock(vq->mutex)
> > 		vq_meta_map(&uaddrs);
> 
> 
> The problem is the vq address could be changed at this time.
> 
> 
> > 		mutex_lock(vq->mutex)
> > 
> > 		/* recheck both sock->private_data and seqlock count. */
> > 		if changed - bail out
> > 	}
> > 
> > And also requires that VQ uaddrs is defined like this:
> > - writers must have both vq mutex and dev mutex
> > - readers must have either vq mutex or dev mutex
> > 
> > 
> > That's a big change though. For now, how about switching to a per-vq SRCU?
> > That is only a little bit more expensive than RCU, and we
> > can use synchronize_srcu_expedited.
> > 
> 
> Consider we switch to use kfree_rcu(), what's the advantage of per-vq SRCU?
> 
> Thanks


I thought we established that notifiers must wait for
all readers to finish before they mark page dirty, to
prevent page from becoming dirty after address
has been invalidated.
Right?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-25  8:28                                 ` Michael S. Tsirkin
@ 2019-07-25 13:21                                   ` Jason Wang
  2019-07-25 13:26                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-25 13:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/25 下午4:28, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2019 at 03:43:41PM +0800, Jason Wang wrote:
>> On 2019/7/25 下午1:52, Michael S. Tsirkin wrote:
>>> On Tue, Jul 23, 2019 at 09:31:35PM +0800, Jason Wang wrote:
>>>> On 2019/7/23 下午5:26, Michael S. Tsirkin wrote:
>>>>> On Tue, Jul 23, 2019 at 04:49:01PM +0800, Jason Wang wrote:
>>>>>> On 2019/7/23 下午4:10, Michael S. Tsirkin wrote:
>>>>>>> On Tue, Jul 23, 2019 at 03:53:06PM +0800, Jason Wang wrote:
>>>>>>>> On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
>>>>>>>>>>> Really let's just use kfree_rcu. It's way cleaner: fire and forget.
>>>>>>>>>> Looks not, you need rate limit the fire as you've figured out?
>>>>>>>>> See the discussion that followed. Basically no, it's good enough
>>>>>>>>> already and is only going to be better.
>>>>>>>>>
>>>>>>>>>> And in fact,
>>>>>>>>>> the synchronization is not even needed, does it help if I leave a comment to
>>>>>>>>>> explain?
>>>>>>>>> Let's try to figure it out in the mail first. I'm pretty sure the
>>>>>>>>> current logic is wrong.
>>>>>>>> Here is what the code what to achieve:
>>>>>>>>
>>>>>>>> - The map was protected by RCU
>>>>>>>>
>>>>>>>> - Writers are: MMU notifier invalidation callbacks, file operations (ioctls
>>>>>>>> etc), meta_prefetch (datapath)
>>>>>>>>
>>>>>>>> - Readers are: memory accessor
>>>>>>>>
>>>>>>>> Writer are synchronized through mmu_lock. RCU is used to synchronized
>>>>>>>> between writers and readers.
>>>>>>>>
>>>>>>>> The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized it
>>>>>>>> with readers (memory accessors) in the path of file operations. But in this
>>>>>>>> case, vq->mutex was already held, this means it has been serialized with
>>>>>>>> memory accessor. That's why I think it could be removed safely.
>>>>>>>>
>>>>>>>> Anything I miss here?
>>>>>>>>
>>>>>>> So invalidate callbacks need to reset the map, and they do
>>>>>>> not have vq mutex. How can they do this and free
>>>>>>> the map safely? They need synchronize_rcu or kfree_rcu right?
>>>>>> Invalidation callbacks need but file operations (e.g ioctl) not.
>>>>>>
>>>>>>
>>>>>>> And I worry somewhat that synchronize_rcu in an MMU notifier
>>>>>>> is a problem, MMU notifiers are supposed to be quick:
>>>>>> Looks not, since it can allow to be blocked and lots of driver depends on
>>>>>> this. (E.g mmu_notifier_range_blockable()).
>>>>> Right, they can block. So why don't we take a VQ mutex and be
>>>>> done with it then? No RCU tricks.
>>>> This is how I want to go with RFC and V1. But I end up with deadlock between
>>>> vq locks and some MM internal locks. So I decide to use RCU which is 100%
>>>> under the control of vhost.
>>>>
>>>> Thanks
>>> And I guess the deadlock is because GUP is taking mmu locks which are
>>> taken on mmu notifier path, right?
>>
>> Yes, but it's not the only lock. I don't remember the details, but I can
>> confirm I meet issues with one or two other locks.
>>
>>
>>>     How about we add a seqlock and take
>>> that in invalidate callbacks?  We can then drop the VQ lock before GUP,
>>> and take it again immediately after.
>>>
>>> something like
>>> 	if (!vq_meta_mapped(vq)) {
>>> 		vq_meta_setup(&uaddrs);
>>> 		mutex_unlock(vq->mutex)
>>> 		vq_meta_map(&uaddrs);
>>
>> The problem is the vq address could be changed at this time.
>>
>>
>>> 		mutex_lock(vq->mutex)
>>>
>>> 		/* recheck both sock->private_data and seqlock count. */
>>> 		if changed - bail out
>>> 	}
>>>
>>> And also requires that VQ uaddrs is defined like this:
>>> - writers must have both vq mutex and dev mutex
>>> - readers must have either vq mutex or dev mutex
>>>
>>>
>>> That's a big change though. For now, how about switching to a per-vq SRCU?
>>> That is only a little bit more expensive than RCU, and we
>>> can use synchronize_srcu_expedited.
>>>
>> Consider we switch to use kfree_rcu(), what's the advantage of per-vq SRCU?
>>
>> Thanks
>
> I thought we established that notifiers must wait for
> all readers to finish before they mark page dirty, to
> prevent page from becoming dirty after address
> has been invalidated.
> Right?


Exactly, and that's the reason actually I use synchronize_rcu() there.

So the concern is still the possible synchronize_expedited()? Can I do 
this on through another series on top of the incoming V2?

Thanks



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-25 13:21                                   ` Jason Wang
@ 2019-07-25 13:26                                     ` Michael S. Tsirkin
  2019-07-25 14:25                                       ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-25 13:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Thu, Jul 25, 2019 at 09:21:22PM +0800, Jason Wang wrote:
> 
> On 2019/7/25 下午4:28, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2019 at 03:43:41PM +0800, Jason Wang wrote:
> > > On 2019/7/25 下午1:52, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 23, 2019 at 09:31:35PM +0800, Jason Wang wrote:
> > > > > On 2019/7/23 下午5:26, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 23, 2019 at 04:49:01PM +0800, Jason Wang wrote:
> > > > > > > On 2019/7/23 下午4:10, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Jul 23, 2019 at 03:53:06PM +0800, Jason Wang wrote:
> > > > > > > > > On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
> > > > > > > > > > > > Really let's just use kfree_rcu. It's way cleaner: fire and forget.
> > > > > > > > > > > Looks not, you need rate limit the fire as you've figured out?
> > > > > > > > > > See the discussion that followed. Basically no, it's good enough
> > > > > > > > > > already and is only going to be better.
> > > > > > > > > > 
> > > > > > > > > > > And in fact,
> > > > > > > > > > > the synchronization is not even needed, does it help if I leave a comment to
> > > > > > > > > > > explain?
> > > > > > > > > > Let's try to figure it out in the mail first. I'm pretty sure the
> > > > > > > > > > current logic is wrong.
> > > > > > > > > Here is what the code what to achieve:
> > > > > > > > > 
> > > > > > > > > - The map was protected by RCU
> > > > > > > > > 
> > > > > > > > > - Writers are: MMU notifier invalidation callbacks, file operations (ioctls
> > > > > > > > > etc), meta_prefetch (datapath)
> > > > > > > > > 
> > > > > > > > > - Readers are: memory accessor
> > > > > > > > > 
> > > > > > > > > Writer are synchronized through mmu_lock. RCU is used to synchronized
> > > > > > > > > between writers and readers.
> > > > > > > > > 
> > > > > > > > > The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized it
> > > > > > > > > with readers (memory accessors) in the path of file operations. But in this
> > > > > > > > > case, vq->mutex was already held, this means it has been serialized with
> > > > > > > > > memory accessor. That's why I think it could be removed safely.
> > > > > > > > > 
> > > > > > > > > Anything I miss here?
> > > > > > > > > 
> > > > > > > > So invalidate callbacks need to reset the map, and they do
> > > > > > > > not have vq mutex. How can they do this and free
> > > > > > > > the map safely? They need synchronize_rcu or kfree_rcu right?
> > > > > > > Invalidation callbacks need but file operations (e.g ioctl) not.
> > > > > > > 
> > > > > > > 
> > > > > > > > And I worry somewhat that synchronize_rcu in an MMU notifier
> > > > > > > > is a problem, MMU notifiers are supposed to be quick:
> > > > > > > Looks not, since it can allow to be blocked and lots of driver depends on
> > > > > > > this. (E.g mmu_notifier_range_blockable()).
> > > > > > Right, they can block. So why don't we take a VQ mutex and be
> > > > > > done with it then? No RCU tricks.
> > > > > This is how I want to go with RFC and V1. But I end up with deadlock between
> > > > > vq locks and some MM internal locks. So I decide to use RCU which is 100%
> > > > > under the control of vhost.
> > > > > 
> > > > > Thanks
> > > > And I guess the deadlock is because GUP is taking mmu locks which are
> > > > taken on mmu notifier path, right?
> > > 
> > > Yes, but it's not the only lock. I don't remember the details, but I can
> > > confirm I meet issues with one or two other locks.
> > > 
> > > 
> > > >     How about we add a seqlock and take
> > > > that in invalidate callbacks?  We can then drop the VQ lock before GUP,
> > > > and take it again immediately after.
> > > > 
> > > > something like
> > > > 	if (!vq_meta_mapped(vq)) {
> > > > 		vq_meta_setup(&uaddrs);
> > > > 		mutex_unlock(vq->mutex)
> > > > 		vq_meta_map(&uaddrs);
> > > 
> > > The problem is the vq address could be changed at this time.
> > > 
> > > 
> > > > 		mutex_lock(vq->mutex)
> > > > 
> > > > 		/* recheck both sock->private_data and seqlock count. */
> > > > 		if changed - bail out
> > > > 	}
> > > > 
> > > > And also requires that VQ uaddrs is defined like this:
> > > > - writers must have both vq mutex and dev mutex
> > > > - readers must have either vq mutex or dev mutex
> > > > 
> > > > 
> > > > That's a big change though. For now, how about switching to a per-vq SRCU?
> > > > That is only a little bit more expensive than RCU, and we
> > > > can use synchronize_srcu_expedited.
> > > > 
> > > Consider we switch to use kfree_rcu(), what's the advantage of per-vq SRCU?
> > > 
> > > Thanks
> > 
> > I thought we established that notifiers must wait for
> > all readers to finish before they mark page dirty, to
> > prevent page from becoming dirty after address
> > has been invalidated.
> > Right?
> 
> 
> Exactly, and that's the reason actually I use synchronize_rcu() there.
> 
> So the concern is still the possible synchronize_expedited()?


I think synchronize_srcu_expedited.

synchronize_expedited sends lots of IPI and is bad for realtime VMs.

> Can I do this
> on through another series on top of the incoming V2?
> 
> Thanks
> 

The question is this: is this still a gain if we switch to the
more expensive srcu? If yes then we can keep the feature on,
if not we'll put it off until next release and think
of better solutions. rcu->srcu is just a find and replace,
don't see why we need to defer that. can be a separate patch
for sure, but we need to know how well it works.

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-25 13:26                                     ` Michael S. Tsirkin
@ 2019-07-25 14:25                                       ` Jason Wang
  2019-07-26 11:49                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-25 14:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
>> Exactly, and that's the reason actually I use synchronize_rcu() there.
>>
>> So the concern is still the possible synchronize_expedited()?
> I think synchronize_srcu_expedited.
>
> synchronize_expedited sends lots of IPI and is bad for realtime VMs.
>
>> Can I do this
>> on through another series on top of the incoming V2?
>>
>> Thanks
>>
> The question is this: is this still a gain if we switch to the
> more expensive srcu? If yes then we can keep the feature on,


I think we only care about the cost on srcu_read_lock() which looks 
pretty tiny form my point of view. Which is basically a READ_ONCE() + 
WRITE_ONCE().

Of course I can benchmark to see the difference.


> if not we'll put it off until next release and think
> of better solutions. rcu->srcu is just a find and replace,
> don't see why we need to defer that. can be a separate patch
> for sure, but we need to know how well it works.


I think I get here, let me try to do that in V2 and let's see the numbers.

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-25 14:25                                       ` Jason Wang
@ 2019-07-26 11:49                                         ` Michael S. Tsirkin
  2019-07-26 12:00                                           ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-26 11:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
> 
> On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
> > > Exactly, and that's the reason actually I use synchronize_rcu() there.
> > > 
> > > So the concern is still the possible synchronize_expedited()?
> > I think synchronize_srcu_expedited.
> > 
> > synchronize_expedited sends lots of IPI and is bad for realtime VMs.
> > 
> > > Can I do this
> > > on through another series on top of the incoming V2?
> > > 
> > > Thanks
> > > 
> > The question is this: is this still a gain if we switch to the
> > more expensive srcu? If yes then we can keep the feature on,
> 
> 
> I think we only care about the cost on srcu_read_lock() which looks pretty
> tiny form my point of view. Which is basically a READ_ONCE() + WRITE_ONCE().
> 
> Of course I can benchmark to see the difference.
> 
> 
> > if not we'll put it off until next release and think
> > of better solutions. rcu->srcu is just a find and replace,
> > don't see why we need to defer that. can be a separate patch
> > for sure, but we need to know how well it works.
> 
> 
> I think I get here, let me try to do that in V2 and let's see the numbers.
> 
> Thanks

There's one other thing that bothers me, and that is that
for large rings which are not physically contiguous
we don't implement the optimization.

For sure, that can wait, but I think eventually we should
vmap large rings.

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-26 11:49                                         ` Michael S. Tsirkin
@ 2019-07-26 12:00                                           ` Jason Wang
  2019-07-26 12:38                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-26 12:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/26 下午7:49, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
>> On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
>>>> Exactly, and that's the reason actually I use synchronize_rcu() there.
>>>>
>>>> So the concern is still the possible synchronize_expedited()?
>>> I think synchronize_srcu_expedited.
>>>
>>> synchronize_expedited sends lots of IPI and is bad for realtime VMs.
>>>
>>>> Can I do this
>>>> on through another series on top of the incoming V2?
>>>>
>>>> Thanks
>>>>
>>> The question is this: is this still a gain if we switch to the
>>> more expensive srcu? If yes then we can keep the feature on,
>>
>> I think we only care about the cost on srcu_read_lock() which looks pretty
>> tiny form my point of view. Which is basically a READ_ONCE() + WRITE_ONCE().
>>
>> Of course I can benchmark to see the difference.
>>
>>
>>> if not we'll put it off until next release and think
>>> of better solutions. rcu->srcu is just a find and replace,
>>> don't see why we need to defer that. can be a separate patch
>>> for sure, but we need to know how well it works.
>>
>> I think I get here, let me try to do that in V2 and let's see the numbers.
>>
>> Thanks


It looks to me for tree rcu, its srcu_read_lock() have a mb() which is 
too expensive for us.

If we just worry about the IPI, can we do something like in 
vhost_invalidate_vq_start()?

         if (map) {
                 /* In order to avoid possible IPIs with
                  * synchronize_rcu_expedited() we use call_rcu() +
                  * completion.
*/
init_completion(&c.completion);
                 call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
wait_for_completion(&c.completion);
                 vhost_set_map_dirty(vq, map, index);
vhost_map_unprefetch(map);
         }

?


> There's one other thing that bothers me, and that is that
> for large rings which are not physically contiguous
> we don't implement the optimization.
>
> For sure, that can wait, but I think eventually we should
> vmap large rings.


Yes, worth to try. But using direct map has its own advantage: it can 
use hugepage that vmap can't

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-26 12:00                                           ` Jason Wang
@ 2019-07-26 12:38                                             ` Michael S. Tsirkin
  2019-07-26 12:53                                               ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-26 12:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Fri, Jul 26, 2019 at 08:00:58PM +0800, Jason Wang wrote:
> 
> On 2019/7/26 下午7:49, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
> > > On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
> > > > > Exactly, and that's the reason actually I use synchronize_rcu() there.
> > > > > 
> > > > > So the concern is still the possible synchronize_expedited()?
> > > > I think synchronize_srcu_expedited.
> > > > 
> > > > synchronize_expedited sends lots of IPI and is bad for realtime VMs.
> > > > 
> > > > > Can I do this
> > > > > on through another series on top of the incoming V2?
> > > > > 
> > > > > Thanks
> > > > > 
> > > > The question is this: is this still a gain if we switch to the
> > > > more expensive srcu? If yes then we can keep the feature on,
> > > 
> > > I think we only care about the cost on srcu_read_lock() which looks pretty
> > > tiny form my point of view. Which is basically a READ_ONCE() + WRITE_ONCE().
> > > 
> > > Of course I can benchmark to see the difference.
> > > 
> > > 
> > > > if not we'll put it off until next release and think
> > > > of better solutions. rcu->srcu is just a find and replace,
> > > > don't see why we need to defer that. can be a separate patch
> > > > for sure, but we need to know how well it works.
> > > 
> > > I think I get here, let me try to do that in V2 and let's see the numbers.
> > > 
> > > Thanks
> 
> 
> It looks to me for tree rcu, its srcu_read_lock() have a mb() which is too
> expensive for us.

I will try to ponder using vq lock in some way.
Maybe with trylock somehow ...


> If we just worry about the IPI,

With synchronize_rcu what I would worry about is that guest is stalled
because system is busy because of other guests.
With expedited it's the IPIs...


> can we do something like in
> vhost_invalidate_vq_start()?
> 
>         if (map) {
>                 /* In order to avoid possible IPIs with
>                  * synchronize_rcu_expedited() we use call_rcu() +
>                  * completion.
> */
> init_completion(&c.completion);
>                 call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
> wait_for_completion(&c.completion);
>                 vhost_set_map_dirty(vq, map, index);
> vhost_map_unprefetch(map);
>         }
> 
> ?

Why would that be faster than synchronize_rcu?



> 
> > There's one other thing that bothers me, and that is that
> > for large rings which are not physically contiguous
> > we don't implement the optimization.
> > 
> > For sure, that can wait, but I think eventually we should
> > vmap large rings.
> 
> 
> Yes, worth to try. But using direct map has its own advantage: it can use
> hugepage that vmap can't
> 
> Thanks

Sure, so we can do that for small rings.

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-26 12:38                                             ` Michael S. Tsirkin
@ 2019-07-26 12:53                                               ` Jason Wang
  2019-07-26 13:36                                                 ` Jason Wang
  2019-07-26 13:47                                                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 87+ messages in thread
From: Jason Wang @ 2019-07-26 12:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/26 下午8:38, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2019 at 08:00:58PM +0800, Jason Wang wrote:
>> On 2019/7/26 下午7:49, Michael S. Tsirkin wrote:
>>> On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
>>>> On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
>>>>>> Exactly, and that's the reason actually I use synchronize_rcu() there.
>>>>>>
>>>>>> So the concern is still the possible synchronize_expedited()?
>>>>> I think synchronize_srcu_expedited.
>>>>>
>>>>> synchronize_expedited sends lots of IPI and is bad for realtime VMs.
>>>>>
>>>>>> Can I do this
>>>>>> on through another series on top of the incoming V2?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>> The question is this: is this still a gain if we switch to the
>>>>> more expensive srcu? If yes then we can keep the feature on,
>>>> I think we only care about the cost on srcu_read_lock() which looks pretty
>>>> tiny form my point of view. Which is basically a READ_ONCE() + WRITE_ONCE().
>>>>
>>>> Of course I can benchmark to see the difference.
>>>>
>>>>
>>>>> if not we'll put it off until next release and think
>>>>> of better solutions. rcu->srcu is just a find and replace,
>>>>> don't see why we need to defer that. can be a separate patch
>>>>> for sure, but we need to know how well it works.
>>>> I think I get here, let me try to do that in V2 and let's see the numbers.
>>>>
>>>> Thanks
>>
>> It looks to me for tree rcu, its srcu_read_lock() have a mb() which is too
>> expensive for us.
> I will try to ponder using vq lock in some way.
> Maybe with trylock somehow ...


Ok, let me retry if necessary (but I do remember I end up with deadlocks 
last try).


>
>
>> If we just worry about the IPI,
> With synchronize_rcu what I would worry about is that guest is stalled


Can this synchronize_rcu() be triggered by guest? If yes, there are 
several other MMU notifiers that can block. Is vhost something special here?


> because system is busy because of other guests.
> With expedited it's the IPIs...
>

The current synchronize_rcu()  can force a expedited grace period:

void synchronize_rcu(void)
{
         ...
         if (rcu_blocking_is_gp())
return;
         if (rcu_gp_is_expedited())
synchronize_rcu_expedited();
else
wait_rcu_gp(call_rcu);
}
EXPORT_SYMBOL_GPL(synchronize_rcu);


>> can we do something like in
>> vhost_invalidate_vq_start()?
>>
>>          if (map) {
>>                  /* In order to avoid possible IPIs with
>>                   * synchronize_rcu_expedited() we use call_rcu() +
>>                   * completion.
>> */
>> init_completion(&c.completion);
>>                  call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
>> wait_for_completion(&c.completion);
>>                  vhost_set_map_dirty(vq, map, index);
>> vhost_map_unprefetch(map);
>>          }
>>
>> ?
> Why would that be faster than synchronize_rcu?


No faster but no IPI.


>
>
>>> There's one other thing that bothers me, and that is that
>>> for large rings which are not physically contiguous
>>> we don't implement the optimization.
>>>
>>> For sure, that can wait, but I think eventually we should
>>> vmap large rings.
>>
>> Yes, worth to try. But using direct map has its own advantage: it can use
>> hugepage that vmap can't
>>
>> Thanks
> Sure, so we can do that for small rings.


Yes, that's possible but should be done on top.

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-26 12:53                                               ` Jason Wang
@ 2019-07-26 13:36                                                 ` Jason Wang
  2019-07-26 13:49                                                   ` Michael S. Tsirkin
  2019-07-26 13:47                                                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-26 13:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/26 下午8:53, Jason Wang wrote:
>
> On 2019/7/26 下午8:38, Michael S. Tsirkin wrote:
>> On Fri, Jul 26, 2019 at 08:00:58PM +0800, Jason Wang wrote:
>>> On 2019/7/26 下午7:49, Michael S. Tsirkin wrote:
>>>> On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
>>>>> On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
>>>>>>> Exactly, and that's the reason actually I use synchronize_rcu() 
>>>>>>> there.
>>>>>>>
>>>>>>> So the concern is still the possible synchronize_expedited()?
>>>>>> I think synchronize_srcu_expedited.
>>>>>>
>>>>>> synchronize_expedited sends lots of IPI and is bad for realtime VMs.
>>>>>>
>>>>>>> Can I do this
>>>>>>> on through another series on top of the incoming V2?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>> The question is this: is this still a gain if we switch to the
>>>>>> more expensive srcu? If yes then we can keep the feature on,
>>>>> I think we only care about the cost on srcu_read_lock() which 
>>>>> looks pretty
>>>>> tiny form my point of view. Which is basically a READ_ONCE() + 
>>>>> WRITE_ONCE().
>>>>>
>>>>> Of course I can benchmark to see the difference.
>>>>>
>>>>>
>>>>>> if not we'll put it off until next release and think
>>>>>> of better solutions. rcu->srcu is just a find and replace,
>>>>>> don't see why we need to defer that. can be a separate patch
>>>>>> for sure, but we need to know how well it works.
>>>>> I think I get here, let me try to do that in V2 and let's see the 
>>>>> numbers.
>>>>>
>>>>> Thanks
>>>
>>> It looks to me for tree rcu, its srcu_read_lock() have a mb() which 
>>> is too
>>> expensive for us.
>> I will try to ponder using vq lock in some way.
>> Maybe with trylock somehow ...
>
>
> Ok, let me retry if necessary (but I do remember I end up with 
> deadlocks last try). 


Ok, I play a little with this. And it works so far. Will do more testing 
tomorrow.

One reason could be I switch to use get_user_pages_fast() to 
__get_user_pages_fast() which doesn't need mmap_sem.

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-26 12:53                                               ` Jason Wang
  2019-07-26 13:36                                                 ` Jason Wang
@ 2019-07-26 13:47                                                 ` Michael S. Tsirkin
  2019-07-26 14:00                                                   ` Jason Wang
  1 sibling, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-26 13:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Fri, Jul 26, 2019 at 08:53:18PM +0800, Jason Wang wrote:
> 
> On 2019/7/26 下午8:38, Michael S. Tsirkin wrote:
> > On Fri, Jul 26, 2019 at 08:00:58PM +0800, Jason Wang wrote:
> > > On 2019/7/26 下午7:49, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
> > > > > On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
> > > > > > > Exactly, and that's the reason actually I use synchronize_rcu() there.
> > > > > > > 
> > > > > > > So the concern is still the possible synchronize_expedited()?
> > > > > > I think synchronize_srcu_expedited.
> > > > > > 
> > > > > > synchronize_expedited sends lots of IPI and is bad for realtime VMs.
> > > > > > 
> > > > > > > Can I do this
> > > > > > > on through another series on top of the incoming V2?
> > > > > > > 
> > > > > > > Thanks
> > > > > > > 
> > > > > > The question is this: is this still a gain if we switch to the
> > > > > > more expensive srcu? If yes then we can keep the feature on,
> > > > > I think we only care about the cost on srcu_read_lock() which looks pretty
> > > > > tiny form my point of view. Which is basically a READ_ONCE() + WRITE_ONCE().
> > > > > 
> > > > > Of course I can benchmark to see the difference.
> > > > > 
> > > > > 
> > > > > > if not we'll put it off until next release and think
> > > > > > of better solutions. rcu->srcu is just a find and replace,
> > > > > > don't see why we need to defer that. can be a separate patch
> > > > > > for sure, but we need to know how well it works.
> > > > > I think I get here, let me try to do that in V2 and let's see the numbers.
> > > > > 
> > > > > Thanks
> > > 
> > > It looks to me for tree rcu, its srcu_read_lock() have a mb() which is too
> > > expensive for us.
> > I will try to ponder using vq lock in some way.
> > Maybe with trylock somehow ...
> 
> 
> Ok, let me retry if necessary (but I do remember I end up with deadlocks
> last try).
> 
> 
> > 
> > 
> > > If we just worry about the IPI,
> > With synchronize_rcu what I would worry about is that guest is stalled
> 
> 
> Can this synchronize_rcu() be triggered by guest? If yes, there are several
> other MMU notifiers that can block. Is vhost something special here?

Sorry, let me explain: guests (and tasks in general)
can trigger activity that will
make synchronize_rcu take a long time. Thus blocking
an mmu notifier until synchronize_rcu finishes
is a bad idea.

> 
> > because system is busy because of other guests.
> > With expedited it's the IPIs...
> > 
> 
> The current synchronize_rcu()  can force a expedited grace period:
> 
> void synchronize_rcu(void)
> {
>         ...
>         if (rcu_blocking_is_gp())
> return;
>         if (rcu_gp_is_expedited())
> synchronize_rcu_expedited();
> else
> wait_rcu_gp(call_rcu);
> }
> EXPORT_SYMBOL_GPL(synchronize_rcu);


An admin can force rcu to finish faster, trading
interrupts for responsiveness.

> 
> > > can we do something like in
> > > vhost_invalidate_vq_start()?
> > > 
> > >          if (map) {
> > >                  /* In order to avoid possible IPIs with
> > >                   * synchronize_rcu_expedited() we use call_rcu() +
> > >                   * completion.
> > > */
> > > init_completion(&c.completion);
> > >                  call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
> > > wait_for_completion(&c.completion);
> > >                  vhost_set_map_dirty(vq, map, index);
> > > vhost_map_unprefetch(map);
> > >          }
> > > 
> > > ?
> > Why would that be faster than synchronize_rcu?
> 
> 
> No faster but no IPI.
> 

Sorry I still don't see the point.
synchronize_rcu doesn't normally do an IPI either.


> > 
> > 
> > > > There's one other thing that bothers me, and that is that
> > > > for large rings which are not physically contiguous
> > > > we don't implement the optimization.
> > > > 
> > > > For sure, that can wait, but I think eventually we should
> > > > vmap large rings.
> > > 
> > > Yes, worth to try. But using direct map has its own advantage: it can use
> > > hugepage that vmap can't
> > > 
> > > Thanks
> > Sure, so we can do that for small rings.
> 
> 
> Yes, that's possible but should be done on top.
> 
> Thanks

Absolutely. Need to fix up the bugs first.

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-26 13:36                                                 ` Jason Wang
@ 2019-07-26 13:49                                                   ` Michael S. Tsirkin
  2019-07-29  5:54                                                     ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-26 13:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Fri, Jul 26, 2019 at 09:36:18PM +0800, Jason Wang wrote:
> 
> On 2019/7/26 下午8:53, Jason Wang wrote:
> > 
> > On 2019/7/26 下午8:38, Michael S. Tsirkin wrote:
> > > On Fri, Jul 26, 2019 at 08:00:58PM +0800, Jason Wang wrote:
> > > > On 2019/7/26 下午7:49, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
> > > > > > On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
> > > > > > > > Exactly, and that's the reason actually I use
> > > > > > > > synchronize_rcu() there.
> > > > > > > > 
> > > > > > > > So the concern is still the possible synchronize_expedited()?
> > > > > > > I think synchronize_srcu_expedited.
> > > > > > > 
> > > > > > > synchronize_expedited sends lots of IPI and is bad for realtime VMs.
> > > > > > > 
> > > > > > > > Can I do this
> > > > > > > > on through another series on top of the incoming V2?
> > > > > > > > 
> > > > > > > > Thanks
> > > > > > > > 
> > > > > > > The question is this: is this still a gain if we switch to the
> > > > > > > more expensive srcu? If yes then we can keep the feature on,
> > > > > > I think we only care about the cost on srcu_read_lock()
> > > > > > which looks pretty
> > > > > > tiny form my point of view. Which is basically a
> > > > > > READ_ONCE() + WRITE_ONCE().
> > > > > > 
> > > > > > Of course I can benchmark to see the difference.
> > > > > > 
> > > > > > 
> > > > > > > if not we'll put it off until next release and think
> > > > > > > of better solutions. rcu->srcu is just a find and replace,
> > > > > > > don't see why we need to defer that. can be a separate patch
> > > > > > > for sure, but we need to know how well it works.
> > > > > > I think I get here, let me try to do that in V2 and
> > > > > > let's see the numbers.
> > > > > > 
> > > > > > Thanks
> > > > 
> > > > It looks to me for tree rcu, its srcu_read_lock() have a mb()
> > > > which is too
> > > > expensive for us.
> > > I will try to ponder using vq lock in some way.
> > > Maybe with trylock somehow ...
> > 
> > 
> > Ok, let me retry if necessary (but I do remember I end up with deadlocks
> > last try).
> 
> 
> Ok, I play a little with this. And it works so far. Will do more testing
> tomorrow.
> 
> One reason could be I switch to use get_user_pages_fast() to
> __get_user_pages_fast() which doesn't need mmap_sem.
> 
> Thanks

OK that sounds good. If we also set a flag to make
vhost_exceeds_weight exit, then I think it will be all good.

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-26 13:47                                                 ` Michael S. Tsirkin
@ 2019-07-26 14:00                                                   ` Jason Wang
  2019-07-26 14:10                                                     ` Michael S. Tsirkin
  2019-07-26 15:03                                                     ` Jason Gunthorpe
  0 siblings, 2 replies; 87+ messages in thread
From: Jason Wang @ 2019-07-26 14:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/26 下午9:47, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2019 at 08:53:18PM +0800, Jason Wang wrote:
>> On 2019/7/26 下午8:38, Michael S. Tsirkin wrote:
>>> On Fri, Jul 26, 2019 at 08:00:58PM +0800, Jason Wang wrote:
>>>> On 2019/7/26 下午7:49, Michael S. Tsirkin wrote:
>>>>> On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
>>>>>> On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
>>>>>>>> Exactly, and that's the reason actually I use synchronize_rcu() there.
>>>>>>>>
>>>>>>>> So the concern is still the possible synchronize_expedited()?
>>>>>>> I think synchronize_srcu_expedited.
>>>>>>>
>>>>>>> synchronize_expedited sends lots of IPI and is bad for realtime VMs.
>>>>>>>
>>>>>>>> Can I do this
>>>>>>>> on through another series on top of the incoming V2?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>> The question is this: is this still a gain if we switch to the
>>>>>>> more expensive srcu? If yes then we can keep the feature on,
>>>>>> I think we only care about the cost on srcu_read_lock() which looks pretty
>>>>>> tiny form my point of view. Which is basically a READ_ONCE() + WRITE_ONCE().
>>>>>>
>>>>>> Of course I can benchmark to see the difference.
>>>>>>
>>>>>>
>>>>>>> if not we'll put it off until next release and think
>>>>>>> of better solutions. rcu->srcu is just a find and replace,
>>>>>>> don't see why we need to defer that. can be a separate patch
>>>>>>> for sure, but we need to know how well it works.
>>>>>> I think I get here, let me try to do that in V2 and let's see the numbers.
>>>>>>
>>>>>> Thanks
>>>> It looks to me for tree rcu, its srcu_read_lock() have a mb() which is too
>>>> expensive for us.
>>> I will try to ponder using vq lock in some way.
>>> Maybe with trylock somehow ...
>>
>> Ok, let me retry if necessary (but I do remember I end up with deadlocks
>> last try).
>>
>>
>>>
>>>> If we just worry about the IPI,
>>> With synchronize_rcu what I would worry about is that guest is stalled
>>
>> Can this synchronize_rcu() be triggered by guest? If yes, there are several
>> other MMU notifiers that can block. Is vhost something special here?
> Sorry, let me explain: guests (and tasks in general)
> can trigger activity that will
> make synchronize_rcu take a long time.


Yes, I get this.


>   Thus blocking
> an mmu notifier until synchronize_rcu finishes
> is a bad idea.


The question is, MMU notifier are allowed to be blocked on 
invalidate_range_start() which could be much slower than 
synchronize_rcu() to finish.

Looking at amdgpu_mn_invalidate_range_start_gfx() which calls 
amdgpu_mn_invalidate_node() which did:

                 r = reservation_object_wait_timeout_rcu(bo->tbo.resv,
                         true, false, MAX_SCHEDULE_TIMEOUT);

...


>>> because system is busy because of other guests.
>>> With expedited it's the IPIs...
>>>
>> The current synchronize_rcu()  can force a expedited grace period:
>>
>> void synchronize_rcu(void)
>> {
>>          ...
>>          if (rcu_blocking_is_gp())
>> return;
>>          if (rcu_gp_is_expedited())
>> synchronize_rcu_expedited();
>> else
>> wait_rcu_gp(call_rcu);
>> }
>> EXPORT_SYMBOL_GPL(synchronize_rcu);
>
> An admin can force rcu to finish faster, trading
> interrupts for responsiveness.


Yes, so when set, all each synchronize_rcu() will go for 
synchronize_rcu_expedited().


>
>>>> can we do something like in
>>>> vhost_invalidate_vq_start()?
>>>>
>>>>           if (map) {
>>>>                   /* In order to avoid possible IPIs with
>>>>                    * synchronize_rcu_expedited() we use call_rcu() +
>>>>                    * completion.
>>>> */
>>>> init_completion(&c.completion);
>>>>                   call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
>>>> wait_for_completion(&c.completion);
>>>>                   vhost_set_map_dirty(vq, map, index);
>>>> vhost_map_unprefetch(map);
>>>>           }
>>>>
>>>> ?
>>> Why would that be faster than synchronize_rcu?
>>
>> No faster but no IPI.
>>
> Sorry I still don't see the point.
> synchronize_rcu doesn't normally do an IPI either.
>

Not the case of when rcu_expedited is set. This can just 100% make sure 
there's no IPI.


>>>
>>>>> There's one other thing that bothers me, and that is that
>>>>> for large rings which are not physically contiguous
>>>>> we don't implement the optimization.
>>>>>
>>>>> For sure, that can wait, but I think eventually we should
>>>>> vmap large rings.
>>>> Yes, worth to try. But using direct map has its own advantage: it can use
>>>> hugepage that vmap can't
>>>>
>>>> Thanks
>>> Sure, so we can do that for small rings.
>>
>> Yes, that's possible but should be done on top.
>>
>> Thanks
> Absolutely. Need to fix up the bugs first.
>

Yes.

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-26 14:00                                                   ` Jason Wang
@ 2019-07-26 14:10                                                     ` Michael S. Tsirkin
  2019-07-26 15:03                                                     ` Jason Gunthorpe
  1 sibling, 0 replies; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-26 14:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Fri, Jul 26, 2019 at 10:00:20PM +0800, Jason Wang wrote:
> 
> On 2019/7/26 下午9:47, Michael S. Tsirkin wrote:
> > On Fri, Jul 26, 2019 at 08:53:18PM +0800, Jason Wang wrote:
> > > On 2019/7/26 下午8:38, Michael S. Tsirkin wrote:
> > > > On Fri, Jul 26, 2019 at 08:00:58PM +0800, Jason Wang wrote:
> > > > > On 2019/7/26 下午7:49, Michael S. Tsirkin wrote:
> > > > > > On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
> > > > > > > On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
> > > > > > > > > Exactly, and that's the reason actually I use synchronize_rcu() there.
> > > > > > > > > 
> > > > > > > > > So the concern is still the possible synchronize_expedited()?
> > > > > > > > I think synchronize_srcu_expedited.
> > > > > > > > 
> > > > > > > > synchronize_expedited sends lots of IPI and is bad for realtime VMs.
> > > > > > > > 
> > > > > > > > > Can I do this
> > > > > > > > > on through another series on top of the incoming V2?
> > > > > > > > > 
> > > > > > > > > Thanks
> > > > > > > > > 
> > > > > > > > The question is this: is this still a gain if we switch to the
> > > > > > > > more expensive srcu? If yes then we can keep the feature on,
> > > > > > > I think we only care about the cost on srcu_read_lock() which looks pretty
> > > > > > > tiny form my point of view. Which is basically a READ_ONCE() + WRITE_ONCE().
> > > > > > > 
> > > > > > > Of course I can benchmark to see the difference.
> > > > > > > 
> > > > > > > 
> > > > > > > > if not we'll put it off until next release and think
> > > > > > > > of better solutions. rcu->srcu is just a find and replace,
> > > > > > > > don't see why we need to defer that. can be a separate patch
> > > > > > > > for sure, but we need to know how well it works.
> > > > > > > I think I get here, let me try to do that in V2 and let's see the numbers.
> > > > > > > 
> > > > > > > Thanks
> > > > > It looks to me for tree rcu, its srcu_read_lock() have a mb() which is too
> > > > > expensive for us.
> > > > I will try to ponder using vq lock in some way.
> > > > Maybe with trylock somehow ...
> > > 
> > > Ok, let me retry if necessary (but I do remember I end up with deadlocks
> > > last try).
> > > 
> > > 
> > > > 
> > > > > If we just worry about the IPI,
> > > > With synchronize_rcu what I would worry about is that guest is stalled
> > > 
> > > Can this synchronize_rcu() be triggered by guest? If yes, there are several
> > > other MMU notifiers that can block. Is vhost something special here?
> > Sorry, let me explain: guests (and tasks in general)
> > can trigger activity that will
> > make synchronize_rcu take a long time.
> 
> 
> Yes, I get this.
> 
> 
> >   Thus blocking
> > an mmu notifier until synchronize_rcu finishes
> > is a bad idea.
> 
> 
> The question is, MMU notifier are allowed to be blocked on
> invalidate_range_start() which could be much slower than synchronize_rcu()
> to finish.
> 
> Looking at amdgpu_mn_invalidate_range_start_gfx() which calls
> amdgpu_mn_invalidate_node() which did:
> 
>                 r = reservation_object_wait_timeout_rcu(bo->tbo.resv,
>                         true, false, MAX_SCHEDULE_TIMEOUT);
> 
> ...
> 

Right. And the result will probably be VMs freezing/timing out, too.
It's just that we care about VMs more than the GPU guys :)


> > > > because system is busy because of other guests.
> > > > With expedited it's the IPIs...
> > > > 
> > > The current synchronize_rcu()  can force a expedited grace period:
> > > 
> > > void synchronize_rcu(void)
> > > {
> > >          ...
> > >          if (rcu_blocking_is_gp())
> > > return;
> > >          if (rcu_gp_is_expedited())
> > > synchronize_rcu_expedited();
> > > else
> > > wait_rcu_gp(call_rcu);
> > > }
> > > EXPORT_SYMBOL_GPL(synchronize_rcu);
> > 
> > An admin can force rcu to finish faster, trading
> > interrupts for responsiveness.
> 
> 
> Yes, so when set, all each synchronize_rcu() will go for
> synchronize_rcu_expedited().

And that's bad for realtime things. I understand what you are saying,
host admin can set this and VMs won't time-out.  What I'm saying is we
should not make admins choose between two types of bugs. Tuning for
performance is fine.

> 
> > 
> > > > > can we do something like in
> > > > > vhost_invalidate_vq_start()?
> > > > > 
> > > > >           if (map) {
> > > > >                   /* In order to avoid possible IPIs with
> > > > >                    * synchronize_rcu_expedited() we use call_rcu() +
> > > > >                    * completion.
> > > > > */
> > > > > init_completion(&c.completion);
> > > > >                   call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
> > > > > wait_for_completion(&c.completion);
> > > > >                   vhost_set_map_dirty(vq, map, index);
> > > > > vhost_map_unprefetch(map);
> > > > >           }
> > > > > 
> > > > > ?
> > > > Why would that be faster than synchronize_rcu?
> > > 
> > > No faster but no IPI.
> > > 
> > Sorry I still don't see the point.
> > synchronize_rcu doesn't normally do an IPI either.
> > 
> 
> Not the case of when rcu_expedited is set. This can just 100% make sure
> there's no IPI.

Right but then the latency can be pretty big.

> 
> > > > 
> > > > > > There's one other thing that bothers me, and that is that
> > > > > > for large rings which are not physically contiguous
> > > > > > we don't implement the optimization.
> > > > > > 
> > > > > > For sure, that can wait, but I think eventually we should
> > > > > > vmap large rings.
> > > > > Yes, worth to try. But using direct map has its own advantage: it can use
> > > > > hugepage that vmap can't
> > > > > 
> > > > > Thanks
> > > > Sure, so we can do that for small rings.
> > > 
> > > Yes, that's possible but should be done on top.
> > > 
> > > Thanks
> > Absolutely. Need to fix up the bugs first.
> > 
> 
> Yes.
> 
> Thanks

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-26 14:00                                                   ` Jason Wang
  2019-07-26 14:10                                                     ` Michael S. Tsirkin
@ 2019-07-26 15:03                                                     ` Jason Gunthorpe
  2019-07-29  5:56                                                       ` Jason Wang
  1 sibling, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2019-07-26 15:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, syzbot, aarcange, akpm, christian, davem,
	ebiederm, elena.reshetova, guro, hch, james.bottomley, jglisse,
	keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Fri, Jul 26, 2019 at 10:00:20PM +0800, Jason Wang wrote:
> The question is, MMU notifier are allowed to be blocked on
> invalidate_range_start() which could be much slower than synchronize_rcu()
> to finish.
> 
> Looking at amdgpu_mn_invalidate_range_start_gfx() which calls
> amdgpu_mn_invalidate_node() which did:
> 
>                 r = reservation_object_wait_timeout_rcu(bo->tbo.resv,
>                         true, false, MAX_SCHEDULE_TIMEOUT);
> 
> ...

The general guidance has been that invalidate_start should block
minimally, if at all.

I would say synchronize_rcu is outside that guidance.

BTW, always returning EAGAIN for mmu_notifier_range_blockable() is not
good either, it should instead only return EAGAIN if any
vhost_map_range_overlap() is true.

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-26 13:49                                                   ` Michael S. Tsirkin
@ 2019-07-29  5:54                                                     ` Jason Wang
  2019-07-29  8:59                                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-29  5:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/26 下午9:49, Michael S. Tsirkin wrote:
>>> Ok, let me retry if necessary (but I do remember I end up with deadlocks
>>> last try).
>> Ok, I play a little with this. And it works so far. Will do more testing
>> tomorrow.
>>
>> One reason could be I switch to use get_user_pages_fast() to
>> __get_user_pages_fast() which doesn't need mmap_sem.
>>
>> Thanks
> OK that sounds good. If we also set a flag to make
> vhost_exceeds_weight exit, then I think it will be all good.


After some experiments, I came up two methods:

1) switch to use vq->mutex, then we must take the vq lock during range 
checking (but I don't see obvious slowdown for 16vcpus + 16queues). 
Setting flags during weight check should work but it still can't address 
the worst case: wait for the page to be swapped in. Is this acceptable?

2) using current RCU but replace synchronize_rcu() with 
vhost_work_flush(). The worst case is the same as 1) but we can check 
range without holding any locks.

Which one did you prefer?

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-26 15:03                                                     ` Jason Gunthorpe
@ 2019-07-29  5:56                                                       ` Jason Wang
  0 siblings, 0 replies; 87+ messages in thread
From: Jason Wang @ 2019-07-29  5:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michael S. Tsirkin, syzbot, aarcange, akpm, christian, davem,
	ebiederm, elena.reshetova, guro, hch, james.bottomley, jglisse,
	keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad


On 2019/7/26 下午11:03, Jason Gunthorpe wrote:
> On Fri, Jul 26, 2019 at 10:00:20PM +0800, Jason Wang wrote:
>> The question is, MMU notifier are allowed to be blocked on
>> invalidate_range_start() which could be much slower than synchronize_rcu()
>> to finish.
>>
>> Looking at amdgpu_mn_invalidate_range_start_gfx() which calls
>> amdgpu_mn_invalidate_node() which did:
>>
>>                  r = reservation_object_wait_timeout_rcu(bo->tbo.resv,
>>                          true, false, MAX_SCHEDULE_TIMEOUT);
>>
>> ...
> The general guidance has been that invalidate_start should block
> minimally, if at all.
>
> I would say synchronize_rcu is outside that guidance.


Yes, I get this.


>
> BTW, always returning EAGAIN for mmu_notifier_range_blockable() is not
> good either, it should instead only return EAGAIN if any
> vhost_map_range_overlap() is true.


Right, let me optimize that.

Thanks


>
> Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-29  5:54                                                     ` Jason Wang
@ 2019-07-29  8:59                                                       ` Michael S. Tsirkin
  2019-07-29 14:24                                                         ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-29  8:59 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Mon, Jul 29, 2019 at 01:54:49PM +0800, Jason Wang wrote:
> 
> On 2019/7/26 下午9:49, Michael S. Tsirkin wrote:
> > > > Ok, let me retry if necessary (but I do remember I end up with deadlocks
> > > > last try).
> > > Ok, I play a little with this. And it works so far. Will do more testing
> > > tomorrow.
> > > 
> > > One reason could be I switch to use get_user_pages_fast() to
> > > __get_user_pages_fast() which doesn't need mmap_sem.
> > > 
> > > Thanks
> > OK that sounds good. If we also set a flag to make
> > vhost_exceeds_weight exit, then I think it will be all good.
> 
> 
> After some experiments, I came up two methods:
> 
> 1) switch to use vq->mutex, then we must take the vq lock during range
> checking (but I don't see obvious slowdown for 16vcpus + 16queues). Setting
> flags during weight check should work but it still can't address the worst
> case: wait for the page to be swapped in. Is this acceptable?
> 
> 2) using current RCU but replace synchronize_rcu() with vhost_work_flush().
> The worst case is the same as 1) but we can check range without holding any
> locks.
> 
> Which one did you prefer?
> 
> Thanks

I would rather we start with 1 and switch to 2 after we
can show some gain.

But the worst case needs to be addressed.  How about sending a signal to
the vhost thread?  We will need to fix up error handling (I think that
at the moment it will error out in that case, handling this as EFAULT -
and we don't want to drop packets if we can help it, and surely not
enter any error states.  In particular it might be especially tricky if
we wrote into userspace memory and are now trying to log the write.
I guess we can disable the optimization if log is enabled?).

-- 
MST

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-29  8:59                                                       ` Michael S. Tsirkin
@ 2019-07-29 14:24                                                         ` Jason Wang
  2019-07-29 14:44                                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-29 14:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/29 下午4:59, Michael S. Tsirkin wrote:
> On Mon, Jul 29, 2019 at 01:54:49PM +0800, Jason Wang wrote:
>> On 2019/7/26 下午9:49, Michael S. Tsirkin wrote:
>>>>> Ok, let me retry if necessary (but I do remember I end up with deadlocks
>>>>> last try).
>>>> Ok, I play a little with this. And it works so far. Will do more testing
>>>> tomorrow.
>>>>
>>>> One reason could be I switch to use get_user_pages_fast() to
>>>> __get_user_pages_fast() which doesn't need mmap_sem.
>>>>
>>>> Thanks
>>> OK that sounds good. If we also set a flag to make
>>> vhost_exceeds_weight exit, then I think it will be all good.
>>
>> After some experiments, I came up two methods:
>>
>> 1) switch to use vq->mutex, then we must take the vq lock during range
>> checking (but I don't see obvious slowdown for 16vcpus + 16queues). Setting
>> flags during weight check should work but it still can't address the worst
>> case: wait for the page to be swapped in. Is this acceptable?
>>
>> 2) using current RCU but replace synchronize_rcu() with vhost_work_flush().
>> The worst case is the same as 1) but we can check range without holding any
>> locks.
>>
>> Which one did you prefer?
>>
>> Thanks
> I would rather we start with 1 and switch to 2 after we
> can show some gain.
>
> But the worst case needs to be addressed.


Yes.


> How about sending a signal to
> the vhost thread?  We will need to fix up error handling (I think that
> at the moment it will error out in that case, handling this as EFAULT -
> and we don't want to drop packets if we can help it, and surely not
> enter any error states.  In particular it might be especially tricky if
> we wrote into userspace memory and are now trying to log the write.
> I guess we can disable the optimization if log is enabled?).


This may work but requires a lot of changes. And actually it's the price 
of using vq mutex. Actually, the critical section should be rather 
small, e.g just inside memory accessors.

I wonder whether or not just do synchronize our self like:

static void inline vhost_inc_vq_ref(struct vhost_virtqueue *vq)
{
         int ref = READ_ONCE(vq->ref);

         WRITE_ONCE(vq->ref, ref + 1);
smp_rmb();
}

static void inline vhost_dec_vq_ref(struct vhost_virtqueue *vq)
{
         int ref = READ_ONCE(vq->ref);

smp_wmb();
         WRITE_ONCE(vq->ref, ref - 1);
}

static void inline vhost_wait_for_ref(struct vhost_virtqueue *vq)
{
         while (READ_ONCE(vq->ref));
mb();
}


Or using smp_load_acquire()/smp_store_release() instead?

Thanks

>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-29 14:24                                                         ` Jason Wang
@ 2019-07-29 14:44                                                           ` Michael S. Tsirkin
  2019-07-30  7:44                                                             ` Jason Wang
  0 siblings, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-29 14:44 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Mon, Jul 29, 2019 at 10:24:43PM +0800, Jason Wang wrote:
> 
> On 2019/7/29 下午4:59, Michael S. Tsirkin wrote:
> > On Mon, Jul 29, 2019 at 01:54:49PM +0800, Jason Wang wrote:
> > > On 2019/7/26 下午9:49, Michael S. Tsirkin wrote:
> > > > > > Ok, let me retry if necessary (but I do remember I end up with deadlocks
> > > > > > last try).
> > > > > Ok, I play a little with this. And it works so far. Will do more testing
> > > > > tomorrow.
> > > > > 
> > > > > One reason could be I switch to use get_user_pages_fast() to
> > > > > __get_user_pages_fast() which doesn't need mmap_sem.
> > > > > 
> > > > > Thanks
> > > > OK that sounds good. If we also set a flag to make
> > > > vhost_exceeds_weight exit, then I think it will be all good.
> > > 
> > > After some experiments, I came up two methods:
> > > 
> > > 1) switch to use vq->mutex, then we must take the vq lock during range
> > > checking (but I don't see obvious slowdown for 16vcpus + 16queues). Setting
> > > flags during weight check should work but it still can't address the worst
> > > case: wait for the page to be swapped in. Is this acceptable?
> > > 
> > > 2) using current RCU but replace synchronize_rcu() with vhost_work_flush().
> > > The worst case is the same as 1) but we can check range without holding any
> > > locks.
> > > 
> > > Which one did you prefer?
> > > 
> > > Thanks
> > I would rather we start with 1 and switch to 2 after we
> > can show some gain.
> > 
> > But the worst case needs to be addressed.
> 
> 
> Yes.
> 
> 
> > How about sending a signal to
> > the vhost thread?  We will need to fix up error handling (I think that
> > at the moment it will error out in that case, handling this as EFAULT -
> > and we don't want to drop packets if we can help it, and surely not
> > enter any error states.  In particular it might be especially tricky if
> > we wrote into userspace memory and are now trying to log the write.
> > I guess we can disable the optimization if log is enabled?).
> 
> 
> This may work but requires a lot of changes.

I agree.

> And actually it's the price of
> using vq mutex. 

Not sure what's meant here.

> Actually, the critical section should be rather small, e.g
> just inside memory accessors.

Also true.

> 
> I wonder whether or not just do synchronize our self like:
> 
> static void inline vhost_inc_vq_ref(struct vhost_virtqueue *vq)
> {
>         int ref = READ_ONCE(vq->ref);
> 
>         WRITE_ONCE(vq->ref, ref + 1);
> smp_rmb();
> }
> 
> static void inline vhost_dec_vq_ref(struct vhost_virtqueue *vq)
> {
>         int ref = READ_ONCE(vq->ref);
> 
> smp_wmb();
>         WRITE_ONCE(vq->ref, ref - 1);
> }
> 
> static void inline vhost_wait_for_ref(struct vhost_virtqueue *vq)
> {
>         while (READ_ONCE(vq->ref));
> mb();
> }

Looks good but I'd like to think of a strategy/existing lock that let us
block properly as opposed to spinning, that would be more friendly to
e.g. the realtime patch.

> 
> Or using smp_load_acquire()/smp_store_release() instead?
> 
> Thanks

These are cheaper on x86, yes.

> > 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-29 14:44                                                           ` Michael S. Tsirkin
@ 2019-07-30  7:44                                                             ` Jason Wang
  2019-07-30  8:03                                                               ` Jason Wang
  2019-07-30 15:08                                                               ` Michael S. Tsirkin
  0 siblings, 2 replies; 87+ messages in thread
From: Jason Wang @ 2019-07-30  7:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/29 下午10:44, Michael S. Tsirkin wrote:
> On Mon, Jul 29, 2019 at 10:24:43PM +0800, Jason Wang wrote:
>> On 2019/7/29 下午4:59, Michael S. Tsirkin wrote:
>>> On Mon, Jul 29, 2019 at 01:54:49PM +0800, Jason Wang wrote:
>>>> On 2019/7/26 下午9:49, Michael S. Tsirkin wrote:
>>>>>>> Ok, let me retry if necessary (but I do remember I end up with deadlocks
>>>>>>> last try).
>>>>>> Ok, I play a little with this. And it works so far. Will do more testing
>>>>>> tomorrow.
>>>>>>
>>>>>> One reason could be I switch to use get_user_pages_fast() to
>>>>>> __get_user_pages_fast() which doesn't need mmap_sem.
>>>>>>
>>>>>> Thanks
>>>>> OK that sounds good. If we also set a flag to make
>>>>> vhost_exceeds_weight exit, then I think it will be all good.
>>>> After some experiments, I came up two methods:
>>>>
>>>> 1) switch to use vq->mutex, then we must take the vq lock during range
>>>> checking (but I don't see obvious slowdown for 16vcpus + 16queues). Setting
>>>> flags during weight check should work but it still can't address the worst
>>>> case: wait for the page to be swapped in. Is this acceptable?
>>>>
>>>> 2) using current RCU but replace synchronize_rcu() with vhost_work_flush().
>>>> The worst case is the same as 1) but we can check range without holding any
>>>> locks.
>>>>
>>>> Which one did you prefer?
>>>>
>>>> Thanks
>>> I would rather we start with 1 and switch to 2 after we
>>> can show some gain.
>>>
>>> But the worst case needs to be addressed.
>>
>> Yes.
>>
>>
>>> How about sending a signal to
>>> the vhost thread?  We will need to fix up error handling (I think that
>>> at the moment it will error out in that case, handling this as EFAULT -
>>> and we don't want to drop packets if we can help it, and surely not
>>> enter any error states.  In particular it might be especially tricky if
>>> we wrote into userspace memory and are now trying to log the write.
>>> I guess we can disable the optimization if log is enabled?).
>>
>> This may work but requires a lot of changes.
> I agree.
>
>> And actually it's the price of
>> using vq mutex.
> Not sure what's meant here.


I mean if we use vq mutex, it means the critical section was increased 
and we need to deal with swapping then.


>
>> Actually, the critical section should be rather small, e.g
>> just inside memory accessors.
> Also true.
>
>> I wonder whether or not just do synchronize our self like:
>>
>> static void inline vhost_inc_vq_ref(struct vhost_virtqueue *vq)
>> {
>>          int ref = READ_ONCE(vq->ref);
>>
>>          WRITE_ONCE(vq->ref, ref + 1);
>> smp_rmb();
>> }
>>
>> static void inline vhost_dec_vq_ref(struct vhost_virtqueue *vq)
>> {
>>          int ref = READ_ONCE(vq->ref);
>>
>> smp_wmb();
>>          WRITE_ONCE(vq->ref, ref - 1);
>> }
>>
>> static void inline vhost_wait_for_ref(struct vhost_virtqueue *vq)
>> {
>>          while (READ_ONCE(vq->ref));
>> mb();
>> }
> Looks good but I'd like to think of a strategy/existing lock that let us
> block properly as opposed to spinning, that would be more friendly to
> e.g. the realtime patch.


Does it make sense to disable preemption in the critical section? Then 
we don't need to block and we have a deterministic time spent on memory 
accssors?


>
>> Or using smp_load_acquire()/smp_store_release() instead?
>>
>> Thanks
> These are cheaper on x86, yes.


Will use this.

Thanks


>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-30  7:44                                                             ` Jason Wang
@ 2019-07-30  8:03                                                               ` Jason Wang
  2019-07-30 15:08                                                               ` Michael S. Tsirkin
  1 sibling, 0 replies; 87+ messages in thread
From: Jason Wang @ 2019-07-30  8:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/30 下午3:44, Jason Wang wrote:
>>>
>>> }
>> Looks good but I'd like to think of a strategy/existing lock that let us
>> block properly as opposed to spinning, that would be more friendly to
>> e.g. the realtime patch.
>
>
> Does it make sense to disable preemption in the critical section? Then 
> we don't need to block and we have a deterministic time spent on 
> memory accssors?


Ok, touching preempt counter seems a little bit expensive in the fast 
path. Will try for blocking.

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-30  7:44                                                             ` Jason Wang
  2019-07-30  8:03                                                               ` Jason Wang
@ 2019-07-30 15:08                                                               ` Michael S. Tsirkin
  2019-07-31  8:49                                                                 ` Jason Wang
  1 sibling, 1 reply; 87+ messages in thread
From: Michael S. Tsirkin @ 2019-07-30 15:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad

On Tue, Jul 30, 2019 at 03:44:47PM +0800, Jason Wang wrote:
> 
> On 2019/7/29 下午10:44, Michael S. Tsirkin wrote:
> > On Mon, Jul 29, 2019 at 10:24:43PM +0800, Jason Wang wrote:
> > > On 2019/7/29 下午4:59, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 29, 2019 at 01:54:49PM +0800, Jason Wang wrote:
> > > > > On 2019/7/26 下午9:49, Michael S. Tsirkin wrote:
> > > > > > > > Ok, let me retry if necessary (but I do remember I end up with deadlocks
> > > > > > > > last try).
> > > > > > > Ok, I play a little with this. And it works so far. Will do more testing
> > > > > > > tomorrow.
> > > > > > > 
> > > > > > > One reason could be I switch to use get_user_pages_fast() to
> > > > > > > __get_user_pages_fast() which doesn't need mmap_sem.
> > > > > > > 
> > > > > > > Thanks
> > > > > > OK that sounds good. If we also set a flag to make
> > > > > > vhost_exceeds_weight exit, then I think it will be all good.
> > > > > After some experiments, I came up two methods:
> > > > > 
> > > > > 1) switch to use vq->mutex, then we must take the vq lock during range
> > > > > checking (but I don't see obvious slowdown for 16vcpus + 16queues). Setting
> > > > > flags during weight check should work but it still can't address the worst
> > > > > case: wait for the page to be swapped in. Is this acceptable?
> > > > > 
> > > > > 2) using current RCU but replace synchronize_rcu() with vhost_work_flush().
> > > > > The worst case is the same as 1) but we can check range without holding any
> > > > > locks.
> > > > > 
> > > > > Which one did you prefer?
> > > > > 
> > > > > Thanks
> > > > I would rather we start with 1 and switch to 2 after we
> > > > can show some gain.
> > > > 
> > > > But the worst case needs to be addressed.
> > > 
> > > Yes.
> > > 
> > > 
> > > > How about sending a signal to
> > > > the vhost thread?  We will need to fix up error handling (I think that
> > > > at the moment it will error out in that case, handling this as EFAULT -
> > > > and we don't want to drop packets if we can help it, and surely not
> > > > enter any error states.  In particular it might be especially tricky if
> > > > we wrote into userspace memory and are now trying to log the write.
> > > > I guess we can disable the optimization if log is enabled?).
> > > 
> > > This may work but requires a lot of changes.
> > I agree.
> > 
> > > And actually it's the price of
> > > using vq mutex.
> > Not sure what's meant here.
> 
> 
> I mean if we use vq mutex, it means the critical section was increased and
> we need to deal with swapping then.
> 
> 
> > 
> > > Actually, the critical section should be rather small, e.g
> > > just inside memory accessors.
> > Also true.
> > 
> > > I wonder whether or not just do synchronize our self like:
> > > 
> > > static void inline vhost_inc_vq_ref(struct vhost_virtqueue *vq)
> > > {
> > >          int ref = READ_ONCE(vq->ref);
> > > 
> > >          WRITE_ONCE(vq->ref, ref + 1);
> > > smp_rmb();
> > > }
> > > 
> > > static void inline vhost_dec_vq_ref(struct vhost_virtqueue *vq)
> > > {
> > >          int ref = READ_ONCE(vq->ref);
> > > 
> > > smp_wmb();
> > >          WRITE_ONCE(vq->ref, ref - 1);
> > > }
> > > 
> > > static void inline vhost_wait_for_ref(struct vhost_virtqueue *vq)
> > > {
> > >          while (READ_ONCE(vq->ref));
> > > mb();
> > > }
> > Looks good but I'd like to think of a strategy/existing lock that let us
> > block properly as opposed to spinning, that would be more friendly to
> > e.g. the realtime patch.
> 
> 
> Does it make sense to disable preemption in the critical section? Then we
> don't need to block and we have a deterministic time spent on memory
> accssors?

Hmm maybe. I'm getting really nervious at this point - we
seem to be using every trick in the book.

> 
> > 
> > > Or using smp_load_acquire()/smp_store_release() instead?
> > > 
> > > Thanks
> > These are cheaper on x86, yes.
> 
> 
> Will use this.
> 
> Thanks
> 
> 

This looks suspiciously like a seqlock though.
Can that be used somehow?


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-30 15:08                                                               ` Michael S. Tsirkin
@ 2019-07-31  8:49                                                                 ` Jason Wang
  2019-07-31 23:00                                                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Wang @ 2019-07-31  8:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: syzbot, aarcange, akpm, christian, davem, ebiederm,
	elena.reshetova, guro, hch, james.bottomley, jglisse, keescook,
	ldv, linux-arm-kernel, linux-kernel, linux-mm, linux-parisc,
	luto, mhocko, mingo, namit, peterz, syzkaller-bugs, viro, wad


On 2019/7/30 下午11:08, Michael S. Tsirkin wrote:
> On Tue, Jul 30, 2019 at 03:44:47PM +0800, Jason Wang wrote:
>> On 2019/7/29 下午10:44, Michael S. Tsirkin wrote:
>>> On Mon, Jul 29, 2019 at 10:24:43PM +0800, Jason Wang wrote:
>>>> On 2019/7/29 下午4:59, Michael S. Tsirkin wrote:
>>>>> On Mon, Jul 29, 2019 at 01:54:49PM +0800, Jason Wang wrote:
>>>>>> On 2019/7/26 下午9:49, Michael S. Tsirkin wrote:
>>>>>>>>> Ok, let me retry if necessary (but I do remember I end up with deadlocks
>>>>>>>>> last try).
>>>>>>>> Ok, I play a little with this. And it works so far. Will do more testing
>>>>>>>> tomorrow.
>>>>>>>>
>>>>>>>> One reason could be I switch to use get_user_pages_fast() to
>>>>>>>> __get_user_pages_fast() which doesn't need mmap_sem.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>> OK that sounds good. If we also set a flag to make
>>>>>>> vhost_exceeds_weight exit, then I think it will be all good.
>>>>>> After some experiments, I came up two methods:
>>>>>>
>>>>>> 1) switch to use vq->mutex, then we must take the vq lock during range
>>>>>> checking (but I don't see obvious slowdown for 16vcpus + 16queues). Setting
>>>>>> flags during weight check should work but it still can't address the worst
>>>>>> case: wait for the page to be swapped in. Is this acceptable?
>>>>>>
>>>>>> 2) using current RCU but replace synchronize_rcu() with vhost_work_flush().
>>>>>> The worst case is the same as 1) but we can check range without holding any
>>>>>> locks.
>>>>>>
>>>>>> Which one did you prefer?
>>>>>>
>>>>>> Thanks
>>>>> I would rather we start with 1 and switch to 2 after we
>>>>> can show some gain.
>>>>>
>>>>> But the worst case needs to be addressed.
>>>> Yes.
>>>>
>>>>
>>>>> How about sending a signal to
>>>>> the vhost thread?  We will need to fix up error handling (I think that
>>>>> at the moment it will error out in that case, handling this as EFAULT -
>>>>> and we don't want to drop packets if we can help it, and surely not
>>>>> enter any error states.  In particular it might be especially tricky if
>>>>> we wrote into userspace memory and are now trying to log the write.
>>>>> I guess we can disable the optimization if log is enabled?).
>>>> This may work but requires a lot of changes.
>>> I agree.
>>>
>>>> And actually it's the price of
>>>> using vq mutex.
>>> Not sure what's meant here.
>>
>> I mean if we use vq mutex, it means the critical section was increased and
>> we need to deal with swapping then.
>>
>>
>>>> Actually, the critical section should be rather small, e.g
>>>> just inside memory accessors.
>>> Also true.
>>>
>>>> I wonder whether or not just do synchronize our self like:
>>>>
>>>> static void inline vhost_inc_vq_ref(struct vhost_virtqueue *vq)
>>>> {
>>>>           int ref = READ_ONCE(vq->ref);
>>>>
>>>>           WRITE_ONCE(vq->ref, ref + 1);
>>>> smp_rmb();
>>>> }
>>>>
>>>> static void inline vhost_dec_vq_ref(struct vhost_virtqueue *vq)
>>>> {
>>>>           int ref = READ_ONCE(vq->ref);
>>>>
>>>> smp_wmb();
>>>>           WRITE_ONCE(vq->ref, ref - 1);
>>>> }
>>>>
>>>> static void inline vhost_wait_for_ref(struct vhost_virtqueue *vq)
>>>> {
>>>>           while (READ_ONCE(vq->ref));
>>>> mb();
>>>> }
>>> Looks good but I'd like to think of a strategy/existing lock that let us
>>> block properly as opposed to spinning, that would be more friendly to
>>> e.g. the realtime patch.
>>
>> Does it make sense to disable preemption in the critical section? Then we
>> don't need to block and we have a deterministic time spent on memory
>> accssors?
> Hmm maybe. I'm getting really nervious at this point - we
> seem to be using every trick in the book.
>

Yes, looking at the synchronization implemented by other MMU notifiers. 
Vhost is even the simplest.


>>>> Or using smp_load_acquire()/smp_store_release() instead?
>>>>
>>>> Thanks
>>> These are cheaper on x86, yes.
>>
>> Will use this.
>>
>> Thanks
>>
>>
> This looks suspiciously like a seqlock though.
> Can that be used somehow?
>

seqlock does not provide a way to synchronize with readers. But I did 
borrow some ideas from seqlock and post a new version.

Please review.

Thanks


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: WARNING in __mmdrop
  2019-07-31  8:49                                                                 ` Jason Wang
@ 2019-07-31 23:00                                                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2019-07-31 23:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, syzbot, aarcange, akpm, christian, davem,
	ebiederm, elena.reshetova, guro, hch, james.bottomley, jglisse,
	keescook, ldv, linux-arm-kernel, linux-kernel, linux-mm,
	linux-parisc, luto, mhocko, mingo, namit, peterz, syzkaller-bugs,
	viro, wad

On Wed, Jul 31, 2019 at 04:49:32PM +0800, Jason Wang wrote:
> Yes, looking at the synchronization implemented by other MMU notifiers.
> Vhost is even the simplest.

I think that is only because it calls gup under a spinlock, which is,
IMHO, not great.

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2019-07-31 23:01 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <0000000000008dd6bb058e006938@google.com>
2019-07-20 10:08 ` WARNING in __mmdrop syzbot
2019-07-21 10:02   ` Michael S. Tsirkin
2019-07-21 12:18     ` Michael S. Tsirkin
2019-07-22  5:24       ` Jason Wang
2019-07-22  8:08         ` Michael S. Tsirkin
2019-07-23  4:01           ` Jason Wang
2019-07-23  5:01             ` Michael S. Tsirkin
2019-07-23  5:47               ` Jason Wang
2019-07-23  7:23                 ` Michael S. Tsirkin
2019-07-23  7:53                   ` Jason Wang
2019-07-23  8:10                     ` Michael S. Tsirkin
2019-07-23  8:49                       ` Jason Wang
2019-07-23  9:26                         ` Michael S. Tsirkin
2019-07-23 13:31                           ` Jason Wang
2019-07-25  5:52                             ` Michael S. Tsirkin
2019-07-25  7:43                               ` Jason Wang
2019-07-25  8:28                                 ` Michael S. Tsirkin
2019-07-25 13:21                                   ` Jason Wang
2019-07-25 13:26                                     ` Michael S. Tsirkin
2019-07-25 14:25                                       ` Jason Wang
2019-07-26 11:49                                         ` Michael S. Tsirkin
2019-07-26 12:00                                           ` Jason Wang
2019-07-26 12:38                                             ` Michael S. Tsirkin
2019-07-26 12:53                                               ` Jason Wang
2019-07-26 13:36                                                 ` Jason Wang
2019-07-26 13:49                                                   ` Michael S. Tsirkin
2019-07-29  5:54                                                     ` Jason Wang
2019-07-29  8:59                                                       ` Michael S. Tsirkin
2019-07-29 14:24                                                         ` Jason Wang
2019-07-29 14:44                                                           ` Michael S. Tsirkin
2019-07-30  7:44                                                             ` Jason Wang
2019-07-30  8:03                                                               ` Jason Wang
2019-07-30 15:08                                                               ` Michael S. Tsirkin
2019-07-31  8:49                                                                 ` Jason Wang
2019-07-31 23:00                                                                   ` Jason Gunthorpe
2019-07-26 13:47                                                 ` Michael S. Tsirkin
2019-07-26 14:00                                                   ` Jason Wang
2019-07-26 14:10                                                     ` Michael S. Tsirkin
2019-07-26 15:03                                                     ` Jason Gunthorpe
2019-07-29  5:56                                                       ` Jason Wang
2019-07-21 12:28     ` RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop) Michael S. Tsirkin
2019-07-21 13:17       ` Paul E. McKenney
2019-07-21 17:53         ` Michael S. Tsirkin
2019-07-21 19:28           ` Paul E. McKenney
2019-07-22  7:56             ` Michael S. Tsirkin
2019-07-22 11:57               ` Paul E. McKenney
2019-07-21 21:08         ` Matthew Wilcox
2019-07-21 23:31           ` Paul E. McKenney
2019-07-22  7:52             ` Michael S. Tsirkin
2019-07-22 11:51               ` Paul E. McKenney
2019-07-22 13:41                 ` Jason Gunthorpe
2019-07-22 15:52                   ` Paul E. McKenney
2019-07-22 16:04                     ` Jason Gunthorpe
2019-07-22 16:15                       ` Michael S. Tsirkin
2019-07-22 16:15                       ` Paul E. McKenney
2019-07-22 15:14             ` Joel Fernandes
2019-07-22 15:47               ` Michael S. Tsirkin
2019-07-22 15:55                 ` Paul E. McKenney
2019-07-22 16:13                   ` Michael S. Tsirkin
2019-07-22 16:25                     ` Paul E. McKenney
2019-07-22 16:32                       ` Michael S. Tsirkin
2019-07-22 18:58                         ` Paul E. McKenney
2019-07-22  5:21     ` WARNING in __mmdrop Jason Wang
2019-07-22  8:02       ` Michael S. Tsirkin
2019-07-23  3:55         ` Jason Wang
2019-07-23  5:02           ` Michael S. Tsirkin
2019-07-23  5:48             ` Jason Wang
2019-07-23  7:25               ` Michael S. Tsirkin
2019-07-23  7:55                 ` Jason Wang
2019-07-23  7:56               ` Michael S. Tsirkin
2019-07-23  8:42                 ` Jason Wang
2019-07-23 10:27                   ` Michael S. Tsirkin
2019-07-23 13:34                     ` Jason Wang
2019-07-23 15:02                       ` Michael S. Tsirkin
2019-07-24  2:17                         ` Jason Wang
2019-07-24  8:05                           ` Michael S. Tsirkin
2019-07-24 10:08                             ` Jason Wang
2019-07-24 18:25                               ` Michael S. Tsirkin
2019-07-25  3:44                                 ` Jason Wang
2019-07-25  5:09                                   ` Michael S. Tsirkin
2019-07-24 16:53                             ` Jason Gunthorpe
2019-07-24 18:25                               ` Michael S. Tsirkin
2019-07-23 10:42                   ` Michael S. Tsirkin
2019-07-23 13:37                     ` Jason Wang
2019-07-22 14:11     ` Jason Gunthorpe
2019-07-25  6:02       ` Michael S. Tsirkin
2019-07-25  7:44         ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).