kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue

All of lore.kernel.org
 help / color / mirror / Atom feed

* kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
@ 2018-12-13 23:24 Steven Luong (sluong) via Virtualization
  2018-12-17 22:55 ` Michael S. Tsirkin
  0 siblings, 1 reply; 5+ messages in thread
From: Steven Luong (sluong) via Virtualization @ 2018-12-13 23:24 UTC (permalink / raw)
  To: virtualization; +Cc: Damjan Marion (damarion), Jerome Tollet (jtollet)

[-- Attachment #1.1: Type: text/plain, Size: 2057 bytes --]

Folks,

We came across a memory race condition between VPP vhost driver and the kernel vhost. VPP is running a tap interface over vhost backend. In this case, VPP is acting as the vhost driver mode and the kernel vhost is acting as the vhost device mode.

In the kernel vhost’s TX traffic direction which is VPP’s RX traffic direction, kernel vhost is the producer and VPP is the consumer. Kernel vhost places traffic in kernel vhost’s TX vring. VPP removes traffic in VPP’s RX vring. It is inevitable that the vring may become full under heavy traffic and the kernel vhost may have to stop and wait for the guest (VPP) to empty the vring and to refill the vring with descriptors. When that case happens, kernel vhost clears the bit in the vring’s used flag to request an interrupt/notification. Due to shared memory race condition, VPP may miss the clearing of the vring’s used flag from kernel vhost and didn’t send kernel vhost an interrupt after VPP empties and refills the vring with new descriptors. Unfortunately, this missed notification causes the kernel vhost to be stuck because once the kernel vhost is waiting for an interrupt/notification from the guest, only an interrupt/notification from the guest can resume/re-enable the guest from submitting new packets to the vring. This design seems vulnerable. Should the kernel vhost totally depend on the notification from the guest? Quoting the text from

http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html

/* The device uses this in used->flags to advise the driver: don’t kick me
 * when you add a buffer.  It’s unreliable, so it’s simply an
 * optimization. */
#define VIRTQ_USED_F_NO_NOTIFY  1

I interpret that the notification is simply an optimization, not a reliable notification mechanism. So the kernel vhost should not bet the farm on it.

We encounter the same race condition in kernel vhost’s RX traffic direction which causes the kernel vhost to be stuck due to missed interrupt/notification although it is less frequent.

Steven

[-- Attachment #1.2: Type: text/html, Size: 4987 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
  2018-12-13 23:24 kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue Steven Luong (sluong) via Virtualization
@ 2018-12-17 22:55 ` Michael S. Tsirkin
  2018-12-17 23:56   ` Steven Luong (sluong) via Virtualization
  0 siblings, 1 reply; 5+ messages in thread
From: Michael S. Tsirkin @ 2018-12-17 22:55 UTC (permalink / raw)
  To: Steven Luong (sluong)
  Cc: Jerome Tollet (jtollet), Damjan Marion (damarion), virtualization

On Thu, Dec 13, 2018 at 11:24:28PM +0000, Steven Luong (sluong) via Virtualization wrote:
> Folks,
> 
>  
> 
> We came across a memory race condition between VPP vhost driver and the kernel
> vhost. VPP is running a tap interface over vhost backend. In this case, VPP is
> acting as the vhost driver mode and the kernel vhost is acting as the vhost
> device mode.
> 
>  
> 
> In the kernel vhost’s TX traffic direction which is VPP’s RX traffic direction,
> kernel vhost is the producer and VPP is the consumer.

Looking at vhost net kernel code, it seems to use the
same terminology as virtio net.
Can you pls clarify which ring number do you use 0 or 1?

I will assume ring 0 below.

> Kernel vhost places
> traffic in kernel vhost’s TX vring. VPP removes traffic in VPP’s RX vring.

So VPP makes buffers available and vhost kernel uses them?

> It
> is inevitable that the vring may become full under heavy traffic and the kernel
> vhost may have to stop and wait for the guest (VPP) to empty the vring and to
> refill the vring with descriptors. When that case happens, kernel vhost clears
> the bit in the vring’s used flag to request an interrupt/notification. Due to
> shared memory race condition, VPP may miss the clearing of the vring’s used
> flag from kernel vhost and didn’t send kernel vhost an interrupt after VPP
> empties and refills the vring with new descriptors.

So kernel vhost's wakeups are signalled through eventfd - I assume
this is what you mean by an interrupt?

To prevent the race that you describe, vhost has this code:

                /* OK, now we need to know about added descriptors. */
                if (!headcount) {
                        if (unlikely(busyloop_intr)) {
                                vhost_poll_queue(&vq->poll);
                        } else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
                                /* They have slipped one in as we were
                                 * doing that: check again. */
                                vhost_disable_notify(&net->dev, vq);
                                continue;
                        }
                        /* Nothing new?  Wait for eventfd to tell us

So ring should be re-checked after notifications are enabled.
If ring is no longer empty, vhost will proceed to handle the ring.  
This code has been around for many years.

Thus if VPP doesn't work with it then I suspect it does something
differently, e.g. is it possible that it is missing a memory
barrier after writing out the available buffers and before
checking the flags?

> Unfortunately, this missed
> notification causes the kernel vhost to be stuck because once the kernel vhost
> is waiting for an interrupt/notification from the guest, only an interrupt/
> notification from the guest can resume/re-enable the guest from submitting new
> packets to the vring. This design seems vulnerable.

The protocol right now relies on notifications never being lost.

I can imagine an alternative where device also re-checks
the rings periodically, but that would involve running
timers host side which is only possible if there's a
free processor that can handle them. Further,
it will still lead to stalls and packet drops, which will
be harder to debug.

That's just my $.02, pls feel free to take it with the virtio tc
maybe others will feel differently.

> Should the kernel vhost
> totally depend on the notification from the guest? Quoting the text from
> 
>  
> 
> http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html
> 
>  
> 
> /* The device uses this in used->flags to advise the driver: don’t kick me 
>  * when you add a buffer.  It’s unreliable, so it’s simply an 
>  * optimization. */ 
> #define VIRTQ_USED_F_NO_NOTIFY  1 
> 
>  
> 
> I interpret that the notification is simply an optimization, not a reliable
> notification mechanism. 

What was meant I think is that suppression is unreliable.

>So the kernel vhost should not bet the farm on it.
> 
>  
> 
> We encounter the same race condition in kernel vhost’s RX traffic direction
> which causes the kernel vhost to be stuck due to missed interrupt/notification
> although it is less frequent.

For the reverse direction the spec actually has some pseudo-code and a suggestion
about handling that race:

	For optimal performance, a driver MAY disable used buffer notifications while processing the used
	ring, but beware the problem of missing notifications between emptying the ring and reenabling no-
	tifications. This is usually handled by re-checking for more used buffers after notifications are re-
	enabled:

		virtq_disable_used_buffer_notifications(vq);
		for (;;) {
			if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) {
			virtq_enable_used_buffer_notifications(vq);
			mb();
			if (vq->last_seen_used != le16_to_cpu(virtq->used.idx))
				break;
			virtq_disable_used_buffer_notifications(vq);
		}
		struct virtq_used_elem *e = virtq.used->ring[vq->last_seen_used%vsz];
		process_buffer(e);
		vq->last_seen_used++;

>  
> 
> Steven
> 

> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
  2018-12-17 22:55 ` Michael S. Tsirkin
@ 2018-12-17 23:56   ` Steven Luong (sluong) via Virtualization
  2018-12-18  0:15     ` Michael S. Tsirkin
  0 siblings, 1 reply; 5+ messages in thread
From: Steven Luong (sluong) via Virtualization @ 2018-12-17 23:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jerome Tollet (jtollet), Damjan Marion (damarion), virtualization



On 12/17/18, 2:55 PM, "Michael S. Tsirkin" <mst@redhat.com> wrote:

    On Thu, Dec 13, 2018 at 11:24:28PM +0000, Steven Luong (sluong) via Virtualization wrote:
    > Folks,
    > 
    >  
    > 
    > We came across a memory race condition between VPP vhost driver and the kernel
    > vhost. VPP is running a tap interface over vhost backend. In this case, VPP is
    > acting as the vhost driver mode and the kernel vhost is acting as the vhost
    > device mode.
    > 
    >  
    > 
    > In the kernel vhost’s TX traffic direction which is VPP’s RX traffic direction,
    > kernel vhost is the producer and VPP is the consumer.
    
    Looking at vhost net kernel code, it seems to use the
    same terminology as virtio net.
    Can you pls clarify which ring number do you use 0 or 1?

<SVL> 0. </SVL>
    
    I will assume ring 0 below.
    
    
    
    > Kernel vhost places
    > traffic in kernel vhost’s TX vring. VPP removes traffic in VPP’s RX vring.
    
    
    So VPP makes buffers available and vhost kernel uses them?
    
<SVL> Correct. </SVL>
    
    > It
    > is inevitable that the vring may become full under heavy traffic and the kernel
    > vhost may have to stop and wait for the guest (VPP) to empty the vring and to
    > refill the vring with descriptors. When that case happens, kernel vhost clears
    > the bit in the vring’s used flag to request an interrupt/notification. Due to
    > shared memory race condition, VPP may miss the clearing of the vring’s used
    > flag from kernel vhost and didn’t send kernel vhost an interrupt after VPP
    > empties and refills the vring with new descriptors.
    
    So kernel vhost's wakeups are signalled through eventfd - I assume
    this is what you mean by an interrupt?
    
    
    To prevent the race that you describe, vhost has this code:
    
                    /* OK, now we need to know about added descriptors. */
                    if (!headcount) {
                            if (unlikely(busyloop_intr)) {
                                    vhost_poll_queue(&vq->poll);
                            } else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
                                    /* They have slipped one in as we were
                                     * doing that: check again. */
                                    vhost_disable_notify(&net->dev, vq);
                                    continue;
                            }
                            /* Nothing new?  Wait for eventfd to tell us
    
    So ring should be re-checked after notifications are enabled.
    If ring is no longer empty, vhost will proceed to handle the ring.  
    This code has been around for many years.
    
    Thus if VPP doesn't work with it then I suspect it does something
    differently, e.g. is it possible that it is missing a memory
    barrier after writing out the available buffers and before
    checking the flags?
    
    <SVL> Can the race condition be avoided totally? Here is the scenario.

t0: VPP refills the vring with new descriptors, not yet sets avail->idx to make it available to kernel vhost. 
t1: kernel vhost checks the vring, still sees no space available, but does not yet execute the line of code to clear the used->flags
t2: vpp executes sfence, and sets avail->idx to make it available to kernel vhost
t3: vpp checks used->flags, it is still 1, vpp skips sending the interrupt
t4: kernel vhost clears used->flags to indicate interrupt is required. But VPP just missed it. The result is kernel vhost got stuck even though the vring is filled with new descriptors.

Steven

</SVL>
    
    
    > Unfortunately, this missed
    > notification causes the kernel vhost to be stuck because once the kernel vhost
    > is waiting for an interrupt/notification from the guest, only an interrupt/
    > notification from the guest can resume/re-enable the guest from submitting new
    > packets to the vring. This design seems vulnerable.
    
    The protocol right now relies on notifications never being lost.
    
    I can imagine an alternative where device also re-checks
    the rings periodically, but that would involve running
    timers host side which is only possible if there's a
    free processor that can handle them. Further,
    it will still lead to stalls and packet drops, which will
    be harder to debug.
    
    That's just my $.02, pls feel free to take it with the virtio tc
    maybe others will feel differently.
    
    > Should the kernel vhost
    > totally depend on the notification from the guest? Quoting the text from
    > 
    >  
    > 
    > http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html
    > 
    >  
    > 
    > /* The device uses this in used->flags to advise the driver: don’t kick me 
    >  * when you add a buffer.  It’s unreliable, so it’s simply an 
    >  * optimization. */ 
    > #define VIRTQ_USED_F_NO_NOTIFY  1 
    > 
    >  
    > 
    > I interpret that the notification is simply an optimization, not a reliable
    > notification mechanism. 
    
    What was meant I think is that suppression is unreliable.
    
    >So the kernel vhost should not bet the farm on it.
    > 
    >  
    > 
    > We encounter the same race condition in kernel vhost’s RX traffic direction
    > which causes the kernel vhost to be stuck due to missed interrupt/notification
    > although it is less frequent.
    
    
    For the reverse direction the spec actually has some pseudo-code and a suggestion
    about handling that race:
    
    	For optimal performance, a driver MAY disable used buffer notifications while processing the used
    	ring, but beware the problem of missing notifications between emptying the ring and reenabling no-
    	tifications. This is usually handled by re-checking for more used buffers after notifications are re-
    	enabled:
    
    		virtq_disable_used_buffer_notifications(vq);
    		for (;;) {
    			if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) {
    			virtq_enable_used_buffer_notifications(vq);
    			mb();
    			if (vq->last_seen_used != le16_to_cpu(virtq->used.idx))
    				break;
    			virtq_disable_used_buffer_notifications(vq);
    		}
    		struct virtq_used_elem *e = virtq.used->ring[vq->last_seen_used%vsz];
    		process_buffer(e);
    		vq->last_seen_used++;
    
    
    
    >  
    > 
    > Steven
    > 
    
    > _______________________________________________
    > Virtualization mailing list
    > Virtualization@lists.linux-foundation.org
    > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
    
    

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
  2018-12-17 23:56   ` Steven Luong (sluong) via Virtualization
@ 2018-12-18  0:15     ` Michael S. Tsirkin
  2019-01-08 19:05       ` Steven Luong (sluong) via Virtualization
  0 siblings, 1 reply; 5+ messages in thread
From: Michael S. Tsirkin @ 2018-12-18  0:15 UTC (permalink / raw)
  To: Steven Luong (sluong)
  Cc: Jerome Tollet (jtollet), Damjan Marion (damarion), virtualization

On Mon, Dec 17, 2018 at 11:56:59PM +0000, Steven Luong (sluong) wrote:
> 
> 
> On 12/17/18, 2:55 PM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
>     On Thu, Dec 13, 2018 at 11:24:28PM +0000, Steven Luong (sluong) via Virtualization wrote:
>     > Folks,
>     > 
>     >  
>     > 
>     > We came across a memory race condition between VPP vhost driver and the kernel
>     > vhost. VPP is running a tap interface over vhost backend. In this case, VPP is
>     > acting as the vhost driver mode and the kernel vhost is acting as the vhost
>     > device mode.
>     > 
>     >  
>     > 
>     > In the kernel vhost’s TX traffic direction which is VPP’s RX traffic direction,
>     > kernel vhost is the producer and VPP is the consumer.
>     
>     Looking at vhost net kernel code, it seems to use the
>     same terminology as virtio net.
>     Can you pls clarify which ring number do you use 0 or 1?
> 
> <SVL> 0. </SVL>
>     
>     I will assume ring 0 below.
>     
>     
>     
>     > Kernel vhost places
>     > traffic in kernel vhost’s TX vring. VPP removes traffic in VPP’s RX vring.
>     
>     
>     So VPP makes buffers available and vhost kernel uses them?
>     
> <SVL> Correct. </SVL>
>     
>     > It
>     > is inevitable that the vring may become full under heavy traffic and the kernel
>     > vhost may have to stop and wait for the guest (VPP) to empty the vring and to
>     > refill the vring with descriptors. When that case happens, kernel vhost clears
>     > the bit in the vring’s used flag to request an interrupt/notification. Due to
>     > shared memory race condition, VPP may miss the clearing of the vring’s used
>     > flag from kernel vhost and didn’t send kernel vhost an interrupt after VPP
>     > empties and refills the vring with new descriptors.
>     
>     So kernel vhost's wakeups are signalled through eventfd - I assume
>     this is what you mean by an interrupt?
>     
>     
>     To prevent the race that you describe, vhost has this code:
>     
>                     /* OK, now we need to know about added descriptors. */
>                     if (!headcount) {
>                             if (unlikely(busyloop_intr)) {
>                                     vhost_poll_queue(&vq->poll);
>                             } else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>                                     /* They have slipped one in as we were
>                                      * doing that: check again. */
>                                     vhost_disable_notify(&net->dev, vq);
>                                     continue;
>                             }
>                             /* Nothing new?  Wait for eventfd to tell us
>     
>     So ring should be re-checked after notifications are enabled.
>     If ring is no longer empty, vhost will proceed to handle the ring.  
>     This code has been around for many years.
>     
>     Thus if VPP doesn't work with it then I suspect it does something
>     differently, e.g. is it possible that it is missing a memory
>     barrier after writing out the available buffers and before
>     checking the flags?
>     
>     <SVL> Can the race condition be avoided totally? Here is the scenario.
> 
> t0: VPP refills the vring with new descriptors, not yet sets avail->idx to make it available to kernel vhost. 
> t1: kernel vhost checks the vring, still sees no space available, but does not yet execute the line of code to clear the used->flags
> t2: vpp executes sfence, and sets avail->idx to make it available to kernel vhost

At this point you need a full memory barrier. E.g. mfence, or xor.
Otherwise the read can get re-ordered before the write.

> t3: vpp checks used->flags, it is still 1, vpp skips sending the interrupt
> t4: kernel vhost clears used->flags to indicate interrupt is required. But VPP just missed it.

fine but at this point kernel vhost will recheck using vhost_get_avail
and process the buffers. So the lack of notification isn't a problem:

bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
{
        __virtio16 avail_idx;
        int r;

        if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
                return false;
        vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
        if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
                r = vhost_update_used_flags(vq);
                if (r) {
                        vq_err(vq, "Failed to enable notification at %p: %d\n",
                               &vq->used->flags, r);
                        return false;
                }
        } else {
                r = vhost_update_avail_event(vq, vq->avail_idx);
                if (r) {
                        vq_err(vq, "Failed to update avail event index at %p: %d\n",
                               vhost_avail_event(vq), r);
                        return false;
                }
        }
        /* They could have slipped one in as we were doing that: make
         * sure it's written, then check again. */
        smp_mb();
        r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
        if (r) {
                vq_err(vq, "Failed to check avail idx at %p: %d\n",
                       &vq->avail->idx, r);
                return false;
        }

        return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
}

>The result is kernel vhost got stuck even though the vring is filled with new descriptors.

Shouldn't be. Maybe you are missing a fence before reading used->flags?


> Steven
> 
> </SVL>
>     
>     
>     > Unfortunately, this missed
>     > notification causes the kernel vhost to be stuck because once the kernel vhost
>     > is waiting for an interrupt/notification from the guest, only an interrupt/
>     > notification from the guest can resume/re-enable the guest from submitting new
>     > packets to the vring. This design seems vulnerable.
>     
>     The protocol right now relies on notifications never being lost.
>     
>     I can imagine an alternative where device also re-checks
>     the rings periodically, but that would involve running
>     timers host side which is only possible if there's a
>     free processor that can handle them. Further,
>     it will still lead to stalls and packet drops, which will
>     be harder to debug.
>     
>     That's just my $.02, pls feel free to take it with the virtio tc
>     maybe others will feel differently.
>     
>     > Should the kernel vhost
>     > totally depend on the notification from the guest? Quoting the text from
>     > 
>     >  
>     > 
>     > http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html
>     > 
>     >  
>     > 
>     > /* The device uses this in used->flags to advise the driver: don’t kick me 
>     >  * when you add a buffer.  It’s unreliable, so it’s simply an 
>     >  * optimization. */ 
>     > #define VIRTQ_USED_F_NO_NOTIFY  1 
>     > 
>     >  
>     > 
>     > I interpret that the notification is simply an optimization, not a reliable
>     > notification mechanism. 
>     
>     What was meant I think is that suppression is unreliable.
>     
>     >So the kernel vhost should not bet the farm on it.
>     > 
>     >  
>     > 
>     > We encounter the same race condition in kernel vhost’s RX traffic direction
>     > which causes the kernel vhost to be stuck due to missed interrupt/notification
>     > although it is less frequent.
>     
>     
>     For the reverse direction the spec actually has some pseudo-code and a suggestion
>     about handling that race:
>     
>     	For optimal performance, a driver MAY disable used buffer notifications while processing the used
>     	ring, but beware the problem of missing notifications between emptying the ring and reenabling no-
>     	tifications. This is usually handled by re-checking for more used buffers after notifications are re-
>     	enabled:
>     
>     		virtq_disable_used_buffer_notifications(vq);
>     		for (;;) {
>     			if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) {
>     			virtq_enable_used_buffer_notifications(vq);
>     			mb();
>     			if (vq->last_seen_used != le16_to_cpu(virtq->used.idx))
>     				break;
>     			virtq_disable_used_buffer_notifications(vq);
>     		}
>     		struct virtq_used_elem *e = virtq.used->ring[vq->last_seen_used%vsz];
>     		process_buffer(e);
>     		vq->last_seen_used++;
>     
>     
>     
>     >  
>     > 
>     > Steven
>     > 
>     
>     > _______________________________________________
>     > Virtualization mailing list
>     > Virtualization@lists.linux-foundation.org
>     > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
>     
>     
> 
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
  2018-12-18  0:15     ` Michael S. Tsirkin
@ 2019-01-08 19:05       ` Steven Luong (sluong) via Virtualization
  0 siblings, 0 replies; 5+ messages in thread
From: Steven Luong (sluong) via Virtualization @ 2019-01-08 19:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jerome Tollet (jtollet), Damjan Marion (damarion), virtualization

Michael,

Thank you very much for your help. I greatly appreciate it.

Steven

On 12/17/18, 4:16 PM, "Michael S. Tsirkin" <mst@redhat.com> wrote:

    On Mon, Dec 17, 2018 at 11:56:59PM +0000, Steven Luong (sluong) wrote:
    > 
    > 
    > On 12/17/18, 2:55 PM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
    > 
    >     On Thu, Dec 13, 2018 at 11:24:28PM +0000, Steven Luong (sluong) via Virtualization wrote:
    >     > Folks,
    >     > 
    >     >  
    >     > 
    >     > We came across a memory race condition between VPP vhost driver and the kernel
    >     > vhost. VPP is running a tap interface over vhost backend. In this case, VPP is
    >     > acting as the vhost driver mode and the kernel vhost is acting as the vhost
    >     > device mode.
    >     > 
    >     >  
    >     > 
    >     > In the kernel vhost’s TX traffic direction which is VPP’s RX traffic direction,
    >     > kernel vhost is the producer and VPP is the consumer.
    >     
    >     Looking at vhost net kernel code, it seems to use the
    >     same terminology as virtio net.
    >     Can you pls clarify which ring number do you use 0 or 1?
    > 
    > <SVL> 0. </SVL>
    >     
    >     I will assume ring 0 below.
    >     
    >     
    >     
    >     > Kernel vhost places
    >     > traffic in kernel vhost’s TX vring. VPP removes traffic in VPP’s RX vring.
    >     
    >     
    >     So VPP makes buffers available and vhost kernel uses them?
    >     
    > <SVL> Correct. </SVL>
    >     
    >     > It
    >     > is inevitable that the vring may become full under heavy traffic and the kernel
    >     > vhost may have to stop and wait for the guest (VPP) to empty the vring and to
    >     > refill the vring with descriptors. When that case happens, kernel vhost clears
    >     > the bit in the vring’s used flag to request an interrupt/notification. Due to
    >     > shared memory race condition, VPP may miss the clearing of the vring’s used
    >     > flag from kernel vhost and didn’t send kernel vhost an interrupt after VPP
    >     > empties and refills the vring with new descriptors.
    >     
    >     So kernel vhost's wakeups are signalled through eventfd - I assume
    >     this is what you mean by an interrupt?
    >     
    >     
    >     To prevent the race that you describe, vhost has this code:
    >     
    >                     /* OK, now we need to know about added descriptors. */
    >                     if (!headcount) {
    >                             if (unlikely(busyloop_intr)) {
    >                                     vhost_poll_queue(&vq->poll);
    >                             } else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
    >                                     /* They have slipped one in as we were
    >                                      * doing that: check again. */
    >                                     vhost_disable_notify(&net->dev, vq);
    >                                     continue;
    >                             }
    >                             /* Nothing new?  Wait for eventfd to tell us
    >     
    >     So ring should be re-checked after notifications are enabled.
    >     If ring is no longer empty, vhost will proceed to handle the ring.  
    >     This code has been around for many years.
    >     
    >     Thus if VPP doesn't work with it then I suspect it does something
    >     differently, e.g. is it possible that it is missing a memory
    >     barrier after writing out the available buffers and before
    >     checking the flags?
    >     
    >     <SVL> Can the race condition be avoided totally? Here is the scenario.
    > 
    > t0: VPP refills the vring with new descriptors, not yet sets avail->idx to make it available to kernel vhost. 
    > t1: kernel vhost checks the vring, still sees no space available, but does not yet execute the line of code to clear the used->flags
    > t2: vpp executes sfence, and sets avail->idx to make it available to kernel vhost
    
    At this point you need a full memory barrier. E.g. mfence, or xor.
    Otherwise the read can get re-ordered before the write.
    
    > t3: vpp checks used->flags, it is still 1, vpp skips sending the interrupt
    > t4: kernel vhost clears used->flags to indicate interrupt is required. But VPP just missed it.
    
    fine but at this point kernel vhost will recheck using vhost_get_avail
    and process the buffers. So the lack of notification isn't a problem:
    
    bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
    {
            __virtio16 avail_idx;
            int r;
    
            if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
                    return false;
            vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
            if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
                    r = vhost_update_used_flags(vq);
                    if (r) {
                            vq_err(vq, "Failed to enable notification at %p: %d\n",
                                   &vq->used->flags, r);
                            return false;
                    }
            } else {
                    r = vhost_update_avail_event(vq, vq->avail_idx);
                    if (r) {
                            vq_err(vq, "Failed to update avail event index at %p: %d\n",
                                   vhost_avail_event(vq), r);
                            return false;
                    }
            }
            /* They could have slipped one in as we were doing that: make
             * sure it's written, then check again. */
            smp_mb();
            r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
            if (r) {
                    vq_err(vq, "Failed to check avail idx at %p: %d\n",
                           &vq->avail->idx, r);
                    return false;
            }
    
            return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
    }
    
    >The result is kernel vhost got stuck even though the vring is filled with new descriptors.
    
    Shouldn't be. Maybe you are missing a fence before reading used->flags?
    
    
    > Steven
    > 
    > </SVL>
    >     
    >     
    >     > Unfortunately, this missed
    >     > notification causes the kernel vhost to be stuck because once the kernel vhost
    >     > is waiting for an interrupt/notification from the guest, only an interrupt/
    >     > notification from the guest can resume/re-enable the guest from submitting new
    >     > packets to the vring. This design seems vulnerable.
    >     
    >     The protocol right now relies on notifications never being lost.
    >     
    >     I can imagine an alternative where device also re-checks
    >     the rings periodically, but that would involve running
    >     timers host side which is only possible if there's a
    >     free processor that can handle them. Further,
    >     it will still lead to stalls and packet drops, which will
    >     be harder to debug.
    >     
    >     That's just my $.02, pls feel free to take it with the virtio tc
    >     maybe others will feel differently.
    >     
    >     > Should the kernel vhost
    >     > totally depend on the notification from the guest? Quoting the text from
    >     > 
    >     >  
    >     > 
    >     > http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html
    >     > 
    >     >  
    >     > 
    >     > /* The device uses this in used->flags to advise the driver: don’t kick me 
    >     >  * when you add a buffer.  It’s unreliable, so it’s simply an 
    >     >  * optimization. */ 
    >     > #define VIRTQ_USED_F_NO_NOTIFY  1 
    >     > 
    >     >  
    >     > 
    >     > I interpret that the notification is simply an optimization, not a reliable
    >     > notification mechanism. 
    >     
    >     What was meant I think is that suppression is unreliable.
    >     
    >     >So the kernel vhost should not bet the farm on it.
    >     > 
    >     >  
    >     > 
    >     > We encounter the same race condition in kernel vhost’s RX traffic direction
    >     > which causes the kernel vhost to be stuck due to missed interrupt/notification
    >     > although it is less frequent.
    >     
    >     
    >     For the reverse direction the spec actually has some pseudo-code and a suggestion
    >     about handling that race:
    >     
    >     	For optimal performance, a driver MAY disable used buffer notifications while processing the used
    >     	ring, but beware the problem of missing notifications between emptying the ring and reenabling no-
    >     	tifications. This is usually handled by re-checking for more used buffers after notifications are re-
    >     	enabled:
    >     
    >     		virtq_disable_used_buffer_notifications(vq);
    >     		for (;;) {
    >     			if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) {
    >     			virtq_enable_used_buffer_notifications(vq);
    >     			mb();
    >     			if (vq->last_seen_used != le16_to_cpu(virtq->used.idx))
    >     				break;
    >     			virtq_disable_used_buffer_notifications(vq);
    >     		}
    >     		struct virtq_used_elem *e = virtq.used->ring[vq->last_seen_used%vsz];
    >     		process_buffer(e);
    >     		vq->last_seen_used++;
    >     
    >     
    >     
    >     >  
    >     > 
    >     > Steven
    >     > 
    >     
    >     > _______________________________________________
    >     > Virtualization mailing list
    >     > Virtualization@lists.linux-foundation.org
    >     > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
    >     
    >     
    > 
    

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-01-08 19:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-13 23:24 kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue Steven Luong (sluong) via Virtualization
2018-12-17 22:55 ` Michael S. Tsirkin
2018-12-17 23:56   ` Steven Luong (sluong) via Virtualization
2018-12-18  0:15     ` Michael S. Tsirkin
2019-01-08 19:05       ` Steven Luong (sluong) via Virtualization

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.