All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-14  7:49 Gavin Shan
  2024-03-14  8:05 ` Michael S. Tsirkin
                   ` (2 more replies)
  0 siblings, 3 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-14  7:49 UTC (permalink / raw)
  To: virtualization; +Cc: linux-kernel, mst, jasowang, xuanzhuo, yihyu, shan.gavin

The issue is reported by Yihuang Yu who have 'netperf' test on
NVidia's grace-grace and grace-hopper machines. The 'netperf'
client is started in the VM hosted by grace-hopper machine,
while the 'netperf' server is running on grace-grace machine.

The VM is started with virtio-net and vhost has been enabled.
We observe a error message spew from VM and then soft-lockup
report. The error message indicates the data associated with
the descriptor (index: 135) has been released, and the queue
is marked as broken. It eventually leads to the endless effort
to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
and soft-lockup. The stale index 135 is fetched from the available
ring and published to the used ring by vhost, meaning we have
disordred write to the available ring element and available index.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
  -accel kvm -machine virt,gic-version=host                            \
     :                                                                 \
  -netdev tap,id=vnet0,vhost=on                                        \
  -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \

  [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!

Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
ARM64. It should work for other architectures, but performance loss is
expected.

Cc: stable@vger.kernel.org
Reported-by: Yihuang Yu <yihyu@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 drivers/virtio/virtio_ring.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 49299b1f9ec7..7d852811c912 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
 	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
 
-	/* Descriptors and available array need to be set before we expose the
-	 * new available array entries. */
-	virtio_wmb(vq->weak_barriers);
+	/*
+	 * Descriptors and available array need to be set before we expose
+	 * the new available array entries. virtio_wmb() should be enough
+	 * to ensuere the order theoretically. However, a stronger barrier
+	 * is needed by ARM64. Otherwise, the stale data can be observed
+	 * by the host (vhost). A stronger barrier should work for other
+	 * architectures, but performance loss is expected.
+	 */
+	virtio_mb(false);
 	vq->split.avail_idx_shadow++;
 	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
 						vq->split.avail_idx_shadow);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-14  7:49 [PATCH] virtio_ring: Fix the stale index in available ring Gavin Shan
@ 2024-03-14  8:05 ` Michael S. Tsirkin
  2024-03-14 10:15   ` Gavin Shan
  2024-03-18 16:59 ` Will Deacon
  2024-03-19  6:14 ` Michael S. Tsirkin
  2 siblings, 1 reply; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-14  8:05 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu, shan.gavin

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
> 
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
>   -accel kvm -machine virt,gic-version=host                            \
>      :                                                                 \
>   -netdev tap,id=vnet0,vhost=on                                        \
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> 
>   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> 
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Yihuang Yu <yihyu@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  drivers/virtio/virtio_ring.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>  	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>  	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>  
> -	/* Descriptors and available array need to be set before we expose the
> -	 * new available array entries. */
> -	virtio_wmb(vq->weak_barriers);
> +	/*
> +	 * Descriptors and available array need to be set before we expose
> +	 * the new available array entries. virtio_wmb() should be enough
> +	 * to ensuere the order theoretically. However, a stronger barrier
> +	 * is needed by ARM64. Otherwise, the stale data can be observed
> +	 * by the host (vhost). A stronger barrier should work for other
> +	 * architectures, but performance loss is expected.
> +	 */
> +	virtio_mb(false);


I don't get what is going on here. Any explanation why virtio_wmb is not
enough besides "it does not work"?

>  	vq->split.avail_idx_shadow++;
>  	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>  						vq->split.avail_idx_shadow);
> -- 
> 2.44.0


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-14  8:05 ` Michael S. Tsirkin
@ 2024-03-14 10:15   ` Gavin Shan
  2024-03-14 11:50     ` Michael S. Tsirkin
  0 siblings, 1 reply; 78+ messages in thread
From: Gavin Shan @ 2024-03-14 10:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu, shan.gavin

On 3/14/24 18:05, Michael S. Tsirkin wrote:
> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
>> The issue is reported by Yihuang Yu who have 'netperf' test on
>> NVidia's grace-grace and grace-hopper machines. The 'netperf'
>> client is started in the VM hosted by grace-hopper machine,
>> while the 'netperf' server is running on grace-grace machine.
>>
>> The VM is started with virtio-net and vhost has been enabled.
>> We observe a error message spew from VM and then soft-lockup
>> report. The error message indicates the data associated with
>> the descriptor (index: 135) has been released, and the queue
>> is marked as broken. It eventually leads to the endless effort
>> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
>> and soft-lockup. The stale index 135 is fetched from the available
>> ring and published to the used ring by vhost, meaning we have
>> disordred write to the available ring element and available index.
>>
>>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
>>    -accel kvm -machine virt,gic-version=host                            \
>>       :                                                                 \
>>    -netdev tap,id=vnet0,vhost=on                                        \
>>    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>>
>>    [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>>
>> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
>> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
>> ARM64. It should work for other architectures, but performance loss is
>> expected.
>>
>> Cc: stable@vger.kernel.org
>> Reported-by: Yihuang Yu <yihyu@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   drivers/virtio/virtio_ring.c | 12 +++++++++---
>>   1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index 49299b1f9ec7..7d852811c912 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>   	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>   
>> -	/* Descriptors and available array need to be set before we expose the
>> -	 * new available array entries. */
>> -	virtio_wmb(vq->weak_barriers);
>> +	/*
>> +	 * Descriptors and available array need to be set before we expose
>> +	 * the new available array entries. virtio_wmb() should be enough
>> +	 * to ensuere the order theoretically. However, a stronger barrier
>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>> +	 * by the host (vhost). A stronger barrier should work for other
>> +	 * architectures, but performance loss is expected.
>> +	 */
>> +	virtio_mb(false);
> 
> 
> I don't get what is going on here. Any explanation why virtio_wmb is not
> enough besides "it does not work"?
> 

The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
than "dmb" because "dsb" ensures that all memory accesses raised before this
instruction is completed when the 'dsb' instruction completes. However, "dmb"
doesn't guarantee the order of completion of the memory accesses.

So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
The stronger barrier 'dsb' ensures the completion order as we expected.

     virtio_wmb(true)         virt_mb(false)
       virt_wmb                 mb
         __smp_wmb               __mb
           dmb(ishst)              dsb(sy)
       

Extraced from ARMv9 specificaton
================================
The DMB instruction is a memory barrier instruction that ensures the relative
order of memory accesses before the barrier with memory accesses after the
barrier. The DMB instruction _does not_ ensure the completion of any of the
memory accesses for which it ensures relative order.

A DSB instruction is a memory barrier that ensures that memory accesses that
occur before the DSB instruction have __completed__ before the completion of
the DSB instruction. In doing this, it acts as a stronger barrier than a DMB
and all ordering that is created by a DMB with specific options is also generated
by a DSB with the same options.

>>   	vq->split.avail_idx_shadow++;
>>   	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>   						vq->split.avail_idx_shadow);
>> -- 
>> 2.44.0
> 

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-14 10:15   ` Gavin Shan
@ 2024-03-14 11:50     ` Michael S. Tsirkin
  2024-03-14 12:50       ` Gavin Shan
  0 siblings, 1 reply; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-14 11:50 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu, shan.gavin

On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
> > >    -accel kvm -machine virt,gic-version=host                            \
> > >       :                                                                 \
> > >    -netdev tap,id=vnet0,vhost=on                                        \
> > >    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >    [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +++++++++---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > >   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > -	/* Descriptors and available array need to be set before we expose the
> > > -	 * new available array entries. */
> > > -	virtio_wmb(vq->weak_barriers);
> > > +	/*
> > > +	 * Descriptors and available array need to be set before we expose
> > > +	 * the new available array entries. virtio_wmb() should be enough
> > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > +	 * by the host (vhost). A stronger barrier should work for other
> > > +	 * architectures, but performance loss is expected.
> > > +	 */
> > > +	virtio_mb(false);
> > 
> > 
> > I don't get what is going on here. Any explanation why virtio_wmb is not
> > enough besides "it does not work"?
> > 
> 
> The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
> than "dmb" because "dsb" ensures that all memory accesses raised before this
> instruction is completed when the 'dsb' instruction completes. However, "dmb"
> doesn't guarantee the order of completion of the memory accesses.
>
> So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.

Completed as observed by which CPU?
We have 2 writes that we want observed by another CPU in order.
So if CPU observes a new value of idx we want it to see
new value in ring.
This is standard use of smp_wmb()
How are these 2 writes different?

What DMB does, is that is seems to ensure that effects
of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
are observed after effects of
'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.




> The stronger barrier 'dsb' ensures the completion order as we expected.
> 
>     virtio_wmb(true)         virt_mb(false)
>       virt_wmb                 mb
>         __smp_wmb               __mb
>           dmb(ishst)              dsb(sy)

First, why would you want a non smp barrier when you are
synchronizing with another CPU? This is what smp_ means ...


> Extraced from ARMv9 specificaton
> ================================
> The DMB instruction is a memory barrier instruction that ensures the relative
> order of memory accesses before the barrier with memory accesses after the
> barrier. The DMB instruction _does not_ ensure the completion of any of the
> memory accesses for which it ensures relative order.

Isn't this exactly what we need?

> A DSB instruction is a memory barrier that ensures that memory accesses that
> occur before the DSB instruction have __completed__ before the completion of
> the DSB instruction.

This seems appropriate for when you want to order things more
strongly. I do not get why it's necessary here.

> In doing this, it acts as a stronger barrier than a DMB
> and all ordering that is created by a DMB with specific options is also generated
> by a DSB with the same options.
> 
> > >   	vq->split.avail_idx_shadow++;
> > >   	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >   						vq->split.avail_idx_shadow);
> > > -- 
> > > 2.44.0
> > 
> 
> Thanks,
> Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-14 11:50     ` Michael S. Tsirkin
@ 2024-03-14 12:50       ` Gavin Shan
  2024-03-14 12:59         ` Michael S. Tsirkin
  0 siblings, 1 reply; 78+ messages in thread
From: Gavin Shan @ 2024-03-14 12:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu, shan.gavin

On 3/14/24 21:50, Michael S. Tsirkin wrote:
> On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
>> On 3/14/24 18:05, Michael S. Tsirkin wrote:
>>> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
>>>> The issue is reported by Yihuang Yu who have 'netperf' test on
>>>> NVidia's grace-grace and grace-hopper machines. The 'netperf'
>>>> client is started in the VM hosted by grace-hopper machine,
>>>> while the 'netperf' server is running on grace-grace machine.
>>>>
>>>> The VM is started with virtio-net and vhost has been enabled.
>>>> We observe a error message spew from VM and then soft-lockup
>>>> report. The error message indicates the data associated with
>>>> the descriptor (index: 135) has been released, and the queue
>>>> is marked as broken. It eventually leads to the endless effort
>>>> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
>>>> and soft-lockup. The stale index 135 is fetched from the available
>>>> ring and published to the used ring by vhost, meaning we have
>>>> disordred write to the available ring element and available index.
>>>>
>>>>     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
>>>>     -accel kvm -machine virt,gic-version=host                            \
>>>>        :                                                                 \
>>>>     -netdev tap,id=vnet0,vhost=on                                        \
>>>>     -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>>>>
>>>>     [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>>>>
>>>> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
>>>> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
>>>> ARM64. It should work for other architectures, but performance loss is
>>>> expected.
>>>>
>>>> Cc: stable@vger.kernel.org
>>>> Reported-by: Yihuang Yu <yihyu@redhat.com>
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>> ---
>>>>    drivers/virtio/virtio_ring.c | 12 +++++++++---
>>>>    1 file changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>> index 49299b1f9ec7..7d852811c912 100644
>>>> --- a/drivers/virtio/virtio_ring.c
>>>> +++ b/drivers/virtio/virtio_ring.c
>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>> -	/* Descriptors and available array need to be set before we expose the
>>>> -	 * new available array entries. */
>>>> -	virtio_wmb(vq->weak_barriers);
>>>> +	/*
>>>> +	 * Descriptors and available array need to be set before we expose
>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>> +	 * architectures, but performance loss is expected.
>>>> +	 */
>>>> +	virtio_mb(false);
>>>
>>>
>>> I don't get what is going on here. Any explanation why virtio_wmb is not
>>> enough besides "it does not work"?
>>>
>>
>> The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
>> than "dmb" because "dsb" ensures that all memory accesses raised before this
>> instruction is completed when the 'dsb' instruction completes. However, "dmb"
>> doesn't guarantee the order of completion of the memory accesses.
>>
>> So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
>> can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> 
> Completed as observed by which CPU?
> We have 2 writes that we want observed by another CPU in order.
> So if CPU observes a new value of idx we want it to see
> new value in ring.
> This is standard use of smp_wmb()
> How are these 2 writes different?
> 
> What DMB does, is that is seems to ensure that effects
> of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> are observed after effects of
> 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> 
>

Completed as observed by the CPU where vhost worker is running. I don't think DMB
does the work here. If I'm understanding correctly, DMB ensures the order of these
two writes from the local CPU's standpoint. The written data can be stored in local
CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU where
vhost worker is running. So DMB isn't ensuring the write data is observed from the
far CPU.

DSB ensures that the written data is observable from the far CPU immediately.
  
> 
> 
>> The stronger barrier 'dsb' ensures the completion order as we expected.
>>
>>      virtio_wmb(true)         virt_mb(false)
>>        virt_wmb                 mb
>>          __smp_wmb               __mb
>>            dmb(ishst)              dsb(sy)
> 
> First, why would you want a non smp barrier when you are
> synchronizing with another CPU? This is what smp_ means ...
> 
> 
>> Extraced from ARMv9 specificaton
>> ================================
>> The DMB instruction is a memory barrier instruction that ensures the relative
>> order of memory accesses before the barrier with memory accesses after the
>> barrier. The DMB instruction _does not_ ensure the completion of any of the
>> memory accesses for which it ensures relative order.
> 
> Isn't this exactly what we need?
> 

I don't think so. DMB gurantees the order of EXECUTION, but DSB gurantees the
order of COMPLETION. After the data store is executed, the written data can
be stored in local CPU's cache, far from completion. Only the instruction is
completed, the written data is synchronized to the far CPU where vhost worker
is running.

Here is another paper talking about the difference between DMB and DSB. It's
explaining the difference between DMB/DSB better than ARMv9 specification.
However, it's hard for us to understand DMB/DSB work from the hardware level
based on the specification and paper.

https://ipads.se.sjtu.edu.cn/_media/publications/liuppopp20.pdf
(section 2.2  Order-preserving Options)
(I'm extracting the relevant part as below)

Data Memory Barrier (DMB) prevents reordering of memory accesses across the barrier.
Instead of waiting for previous accesses to become observable in the specified domain,
DMB only maintains the relative order between memory accesses. Meanwhile, DMB does
not block any non-memory access operations.

Data Synchronization Barrier (DSB) prevents reordering of any instructions across
the barrier. DSB will make sure that all masters in the specified domain can observe
the previous operations before issuing any subsequent instructions.

>> A DSB instruction is a memory barrier that ensures that memory accesses that
>> occur before the DSB instruction have __completed__ before the completion of
>> the DSB instruction.
> 
> This seems appropriate for when you want to order things more
> strongly. I do not get why it's necessary here.
> 

Please refer to above explanations.

>> In doing this, it acts as a stronger barrier than a DMB
>> and all ordering that is created by a DMB with specific options is also generated
>> by a DSB with the same options.
>>
>>>>    	vq->split.avail_idx_shadow++;
>>>>    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>    						vq->split.avail_idx_shadow);
>>>> -- 
>>>> 2.44.0
>>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-14 12:50       ` Gavin Shan
@ 2024-03-14 12:59         ` Michael S. Tsirkin
  2024-03-15 10:45             ` Gavin Shan
  0 siblings, 1 reply; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-14 12:59 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu, shan.gavin

On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
> On 3/14/24 21:50, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> > > On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > > client is started in the VM hosted by grace-hopper machine,
> > > > > while the 'netperf' server is running on grace-grace machine.
> > > > > 
> > > > > The VM is started with virtio-net and vhost has been enabled.
> > > > > We observe a error message spew from VM and then soft-lockup
> > > > > report. The error message indicates the data associated with
> > > > > the descriptor (index: 135) has been released, and the queue
> > > > > is marked as broken. It eventually leads to the endless effort
> > > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > > ring and published to the used ring by vhost, meaning we have
> > > > > disordred write to the available ring element and available index.
> > > > > 
> > > > >     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
> > > > >     -accel kvm -machine virt,gic-version=host                            \
> > > > >        :                                                                 \
> > > > >     -netdev tap,id=vnet0,vhost=on                                        \
> > > > >     -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > > 
> > > > >     [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > > 
> > > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > > ARM64. It should work for other architectures, but performance loss is
> > > > > expected.
> > > > > 
> > > > > Cc: stable@vger.kernel.org
> > > > > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > > > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > > > ---
> > > > >    drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > > >    1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > -	 * new available array entries. */
> > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > +	/*
> > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > +	 * architectures, but performance loss is expected.
> > > > > +	 */
> > > > > +	virtio_mb(false);
> > > > 
> > > > 
> > > > I don't get what is going on here. Any explanation why virtio_wmb is not
> > > > enough besides "it does not work"?
> > > > 
> > > 
> > > The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
> > > than "dmb" because "dsb" ensures that all memory accesses raised before this
> > > instruction is completed when the 'dsb' instruction completes. However, "dmb"
> > > doesn't guarantee the order of completion of the memory accesses.
> > > 
> > > So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> > > can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> > 
> > Completed as observed by which CPU?
> > We have 2 writes that we want observed by another CPU in order.
> > So if CPU observes a new value of idx we want it to see
> > new value in ring.
> > This is standard use of smp_wmb()
> > How are these 2 writes different?
> > 
> > What DMB does, is that is seems to ensure that effects
> > of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> > are observed after effects of
> > 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> > 
> > 
> 
> Completed as observed by the CPU where vhost worker is running. I don't think DMB
> does the work here. If I'm understanding correctly, DMB ensures the order of these
> two writes from the local CPU's standpoint.

No this makes no sense at all. All memory accesses are in order from
local CPU standpoint.

> The written data can be stored in local
> CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU where
> vhost worker is running. So DMB isn't ensuring the write data is observed from the
> far CPU.

No but it has to ensure *ordering*. So if index is oberved then ring
is observed too because there is a MB there when reading.


> DSB ensures that the written data is observable from the far CPU immediately.




> > 
> > 
> > > The stronger barrier 'dsb' ensures the completion order as we expected.
> > > 
> > >      virtio_wmb(true)         virt_mb(false)
> > >        virt_wmb                 mb
> > >          __smp_wmb               __mb
> > >            dmb(ishst)              dsb(sy)
> > 
> > First, why would you want a non smp barrier when you are
> > synchronizing with another CPU? This is what smp_ means ...
> > 
> > 
> > > Extraced from ARMv9 specificaton
> > > ================================
> > > The DMB instruction is a memory barrier instruction that ensures the relative
> > > order of memory accesses before the barrier with memory accesses after the
> > > barrier. The DMB instruction _does not_ ensure the completion of any of the
> > > memory accesses for which it ensures relative order.
> > 
> > Isn't this exactly what we need?
> > 
> 
> I don't think so. DMB gurantees the order of EXECUTION, but DSB gurantees the
> order of COMPLETION. After the data store is executed, the written data can
> be stored in local CPU's cache, far from completion. Only the instruction is
> completed, the written data is synchronized to the far CPU where vhost worker
> is running.
> 
> Here is another paper talking about the difference between DMB and DSB. It's
> explaining the difference between DMB/DSB better than ARMv9 specification.
> However, it's hard for us to understand DMB/DSB work from the hardware level
> based on the specification and paper.
> 
> https://ipads.se.sjtu.edu.cn/_media/publications/liuppopp20.pdf
> (section 2.2  Order-preserving Options)
> (I'm extracting the relevant part as below)
> 
> Data Memory Barrier (DMB) prevents reordering of memory accesses across the barrier.
> Instead of waiting for previous accesses to become observable in the specified domain,
> DMB only maintains the relative order between memory accesses. Meanwhile, DMB does
> not block any non-memory access operations.
> 
> Data Synchronization Barrier (DSB) prevents reordering of any instructions across
> the barrier. DSB will make sure that all masters in the specified domain can observe
> the previous operations before issuing any subsequent instructions.
> 

What you are describing is that smp_wmb is buggy on this platform.
Poossible but are you sure?
Write a small test and check.
Leave virtio alone.


> > > A DSB instruction is a memory barrier that ensures that memory accesses that
> > > occur before the DSB instruction have __completed__ before the completion of
> > > the DSB instruction.
> > 
> > This seems appropriate for when you want to order things more
> > strongly. I do not get why it's necessary here.
> > 
> 
> Please refer to above explanations.
> 
> > > In doing this, it acts as a stronger barrier than a DMB
> > > and all ordering that is created by a DMB with specific options is also generated
> > > by a DSB with the same options.
> > > 
> > > > >    	vq->split.avail_idx_shadow++;
> > > > >    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >    						vq->split.avail_idx_shadow);
> > > > > -- 
> > > > > 2.44.0
> > > > 
> 
> Thanks,
> Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-14 12:59         ` Michael S. Tsirkin
@ 2024-03-15 10:45             ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-15 10:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs


+ Will, Catalin and Matt from Nvidia

On 3/14/24 22:59, Michael S. Tsirkin wrote:
> On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
>> On 3/14/24 21:50, Michael S. Tsirkin wrote:
>>> On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
>>>> On 3/14/24 18:05, Michael S. Tsirkin wrote:
>>>>> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
>>>>>> The issue is reported by Yihuang Yu who have 'netperf' test on
>>>>>> NVidia's grace-grace and grace-hopper machines. The 'netperf'
>>>>>> client is started in the VM hosted by grace-hopper machine,
>>>>>> while the 'netperf' server is running on grace-grace machine.
>>>>>>
>>>>>> The VM is started with virtio-net and vhost has been enabled.
>>>>>> We observe a error message spew from VM and then soft-lockup
>>>>>> report. The error message indicates the data associated with
>>>>>> the descriptor (index: 135) has been released, and the queue
>>>>>> is marked as broken. It eventually leads to the endless effort
>>>>>> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
>>>>>> and soft-lockup. The stale index 135 is fetched from the available
>>>>>> ring and published to the used ring by vhost, meaning we have
>>>>>> disordred write to the available ring element and available index.
>>>>>>
>>>>>>      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
>>>>>>      -accel kvm -machine virt,gic-version=host                            \
>>>>>>         :                                                                 \
>>>>>>      -netdev tap,id=vnet0,vhost=on                                        \
>>>>>>      -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>>>>>>
>>>>>>      [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>>>>>>
>>>>>> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
>>>>>> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
>>>>>> ARM64. It should work for other architectures, but performance loss is
>>>>>> expected.
>>>>>>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Reported-by: Yihuang Yu <yihyu@redhat.com>
>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>>> ---
>>>>>>     drivers/virtio/virtio_ring.c | 12 +++++++++---
>>>>>>     1 file changed, 9 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>>     	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>     	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>> -	/* Descriptors and available array need to be set before we expose the
>>>>>> -	 * new available array entries. */
>>>>>> -	virtio_wmb(vq->weak_barriers);
>>>>>> +	/*
>>>>>> +	 * Descriptors and available array need to be set before we expose
>>>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>>>> +	 * architectures, but performance loss is expected.
>>>>>> +	 */
>>>>>> +	virtio_mb(false);
>>>>>
>>>>>
>>>>> I don't get what is going on here. Any explanation why virtio_wmb is not
>>>>> enough besides "it does not work"?
>>>>>
>>>>
>>>> The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
>>>> than "dmb" because "dsb" ensures that all memory accesses raised before this
>>>> instruction is completed when the 'dsb' instruction completes. However, "dmb"
>>>> doesn't guarantee the order of completion of the memory accesses.
>>>>
>>>> So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
>>>> can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
>>>
>>> Completed as observed by which CPU?
>>> We have 2 writes that we want observed by another CPU in order.
>>> So if CPU observes a new value of idx we want it to see
>>> new value in ring.
>>> This is standard use of smp_wmb()
>>> How are these 2 writes different?
>>>
>>> What DMB does, is that is seems to ensure that effects
>>> of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
>>> are observed after effects of
>>> 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
>>>
>>>
>>
>> Completed as observed by the CPU where vhost worker is running. I don't think DMB
>> does the work here. If I'm understanding correctly, DMB ensures the order of these
>> two writes from the local CPU's standpoint.
> 
> No this makes no sense at all. All memory accesses are in order from
> local CPU standpoint.
> 

It's true if compiler doesn't reorder the accesses, and light-weight barrier
like 'dmb' and 'isb' is used. Otherwise, the accesses still can be disordered
on the local CPU, correct?

>> The written data can be stored in local
>> CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU where
>> vhost worker is running. So DMB isn't ensuring the write data is observed from the
>> far CPU.
> 
> No but it has to ensure *ordering*. So if index is oberved then ring
> is observed too because there is a MB there when reading.
> 

Right. It should work like this way theoretically. I don't know the difference
between 'dmb' and 'dsb' from the hardware level. The description in ARMv9 spec
is also ambiguous. Lets see if Matt will have comments. Besides, this issue is
only reproducible on NVidia's grace-hopper. We don't hit the issue on Ampere's
CPUs.

> 
>> DSB ensures that the written data is observable from the far CPU immediately.
>  
>>>
>>>
>>>> The stronger barrier 'dsb' ensures the completion order as we expected.
>>>>
>>>>       virtio_wmb(true)         virt_mb(false)
>>>>         virt_wmb                 mb
>>>>           __smp_wmb               __mb
>>>>             dmb(ishst)              dsb(sy)
>>>
>>> First, why would you want a non smp barrier when you are
>>> synchronizing with another CPU? This is what smp_ means ...
>>>
>>>
>>>> Extraced from ARMv9 specificaton
>>>> ================================
>>>> The DMB instruction is a memory barrier instruction that ensures the relative
>>>> order of memory accesses before the barrier with memory accesses after the
>>>> barrier. The DMB instruction _does not_ ensure the completion of any of the
>>>> memory accesses for which it ensures relative order.
>>>
>>> Isn't this exactly what we need?
>>>
>>
>> I don't think so. DMB gurantees the order of EXECUTION, but DSB gurantees the
>> order of COMPLETION. After the data store is executed, the written data can
>> be stored in local CPU's cache, far from completion. Only the instruction is
>> completed, the written data is synchronized to the far CPU where vhost worker
>> is running.
>>
>> Here is another paper talking about the difference between DMB and DSB. It's
>> explaining the difference between DMB/DSB better than ARMv9 specification.
>> However, it's hard for us to understand DMB/DSB work from the hardware level
>> based on the specification and paper.
>>
>> https://ipads.se.sjtu.edu.cn/_media/publications/liuppopp20.pdf
>> (section 2.2  Order-preserving Options)
>> (I'm extracting the relevant part as below)
>>
>> Data Memory Barrier (DMB) prevents reordering of memory accesses across the barrier.
>> Instead of waiting for previous accesses to become observable in the specified domain,
>> DMB only maintains the relative order between memory accesses. Meanwhile, DMB does
>> not block any non-memory access operations.
>>
>> Data Synchronization Barrier (DSB) prevents reordering of any instructions across
>> the barrier. DSB will make sure that all masters in the specified domain can observe
>> the previous operations before issuing any subsequent instructions.
>>
> 
> What you are describing is that smp_wmb is buggy on this platform.
> Poossible but are you sure?
> Write a small test and check.
> Leave virtio alone.
> 

Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
to reproduce it with my own driver where one thread writes to the shared buffer
and another thread reads from the buffer. I don't hit the out-of-order issue so
far. My driver may be not correct somewhere and I will update if I can reproduce
the issue with my driver in the future.

> 
>>>> A DSB instruction is a memory barrier that ensures that memory accesses that
>>>> occur before the DSB instruction have __completed__ before the completion of
>>>> the DSB instruction.
>>>
>>> This seems appropriate for when you want to order things more
>>> strongly. I do not get why it's necessary here.
>>>
>>
>> Please refer to above explanations.
>>
>>>> In doing this, it acts as a stronger barrier than a DMB
>>>> and all ordering that is created by a DMB with specific options is also generated
>>>> by a DSB with the same options.
>>>>
>>>>>>     	vq->split.avail_idx_shadow++;
>>>>>>     	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>     						vq->split.avail_idx_shadow);
>>>>>> -- 
>>>>>> 2.44.0
>>>>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-15 10:45             ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-15 10:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs


+ Will, Catalin and Matt from Nvidia

On 3/14/24 22:59, Michael S. Tsirkin wrote:
> On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
>> On 3/14/24 21:50, Michael S. Tsirkin wrote:
>>> On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
>>>> On 3/14/24 18:05, Michael S. Tsirkin wrote:
>>>>> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
>>>>>> The issue is reported by Yihuang Yu who have 'netperf' test on
>>>>>> NVidia's grace-grace and grace-hopper machines. The 'netperf'
>>>>>> client is started in the VM hosted by grace-hopper machine,
>>>>>> while the 'netperf' server is running on grace-grace machine.
>>>>>>
>>>>>> The VM is started with virtio-net and vhost has been enabled.
>>>>>> We observe a error message spew from VM and then soft-lockup
>>>>>> report. The error message indicates the data associated with
>>>>>> the descriptor (index: 135) has been released, and the queue
>>>>>> is marked as broken. It eventually leads to the endless effort
>>>>>> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
>>>>>> and soft-lockup. The stale index 135 is fetched from the available
>>>>>> ring and published to the used ring by vhost, meaning we have
>>>>>> disordred write to the available ring element and available index.
>>>>>>
>>>>>>      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
>>>>>>      -accel kvm -machine virt,gic-version=host                            \
>>>>>>         :                                                                 \
>>>>>>      -netdev tap,id=vnet0,vhost=on                                        \
>>>>>>      -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>>>>>>
>>>>>>      [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>>>>>>
>>>>>> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
>>>>>> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
>>>>>> ARM64. It should work for other architectures, but performance loss is
>>>>>> expected.
>>>>>>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Reported-by: Yihuang Yu <yihyu@redhat.com>
>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>>> ---
>>>>>>     drivers/virtio/virtio_ring.c | 12 +++++++++---
>>>>>>     1 file changed, 9 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>>     	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>     	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>> -	/* Descriptors and available array need to be set before we expose the
>>>>>> -	 * new available array entries. */
>>>>>> -	virtio_wmb(vq->weak_barriers);
>>>>>> +	/*
>>>>>> +	 * Descriptors and available array need to be set before we expose
>>>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>>>> +	 * architectures, but performance loss is expected.
>>>>>> +	 */
>>>>>> +	virtio_mb(false);
>>>>>
>>>>>
>>>>> I don't get what is going on here. Any explanation why virtio_wmb is not
>>>>> enough besides "it does not work"?
>>>>>
>>>>
>>>> The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
>>>> than "dmb" because "dsb" ensures that all memory accesses raised before this
>>>> instruction is completed when the 'dsb' instruction completes. However, "dmb"
>>>> doesn't guarantee the order of completion of the memory accesses.
>>>>
>>>> So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
>>>> can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
>>>
>>> Completed as observed by which CPU?
>>> We have 2 writes that we want observed by another CPU in order.
>>> So if CPU observes a new value of idx we want it to see
>>> new value in ring.
>>> This is standard use of smp_wmb()
>>> How are these 2 writes different?
>>>
>>> What DMB does, is that is seems to ensure that effects
>>> of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
>>> are observed after effects of
>>> 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
>>>
>>>
>>
>> Completed as observed by the CPU where vhost worker is running. I don't think DMB
>> does the work here. If I'm understanding correctly, DMB ensures the order of these
>> two writes from the local CPU's standpoint.
> 
> No this makes no sense at all. All memory accesses are in order from
> local CPU standpoint.
> 

It's true if compiler doesn't reorder the accesses, and light-weight barrier
like 'dmb' and 'isb' is used. Otherwise, the accesses still can be disordered
on the local CPU, correct?

>> The written data can be stored in local
>> CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU where
>> vhost worker is running. So DMB isn't ensuring the write data is observed from the
>> far CPU.
> 
> No but it has to ensure *ordering*. So if index is oberved then ring
> is observed too because there is a MB there when reading.
> 

Right. It should work like this way theoretically. I don't know the difference
between 'dmb' and 'dsb' from the hardware level. The description in ARMv9 spec
is also ambiguous. Lets see if Matt will have comments. Besides, this issue is
only reproducible on NVidia's grace-hopper. We don't hit the issue on Ampere's
CPUs.

> 
>> DSB ensures that the written data is observable from the far CPU immediately.
>  
>>>
>>>
>>>> The stronger barrier 'dsb' ensures the completion order as we expected.
>>>>
>>>>       virtio_wmb(true)         virt_mb(false)
>>>>         virt_wmb                 mb
>>>>           __smp_wmb               __mb
>>>>             dmb(ishst)              dsb(sy)
>>>
>>> First, why would you want a non smp barrier when you are
>>> synchronizing with another CPU? This is what smp_ means ...
>>>
>>>
>>>> Extraced from ARMv9 specificaton
>>>> ================================
>>>> The DMB instruction is a memory barrier instruction that ensures the relative
>>>> order of memory accesses before the barrier with memory accesses after the
>>>> barrier. The DMB instruction _does not_ ensure the completion of any of the
>>>> memory accesses for which it ensures relative order.
>>>
>>> Isn't this exactly what we need?
>>>
>>
>> I don't think so. DMB gurantees the order of EXECUTION, but DSB gurantees the
>> order of COMPLETION. After the data store is executed, the written data can
>> be stored in local CPU's cache, far from completion. Only the instruction is
>> completed, the written data is synchronized to the far CPU where vhost worker
>> is running.
>>
>> Here is another paper talking about the difference between DMB and DSB. It's
>> explaining the difference between DMB/DSB better than ARMv9 specification.
>> However, it's hard for us to understand DMB/DSB work from the hardware level
>> based on the specification and paper.
>>
>> https://ipads.se.sjtu.edu.cn/_media/publications/liuppopp20.pdf
>> (section 2.2  Order-preserving Options)
>> (I'm extracting the relevant part as below)
>>
>> Data Memory Barrier (DMB) prevents reordering of memory accesses across the barrier.
>> Instead of waiting for previous accesses to become observable in the specified domain,
>> DMB only maintains the relative order between memory accesses. Meanwhile, DMB does
>> not block any non-memory access operations.
>>
>> Data Synchronization Barrier (DSB) prevents reordering of any instructions across
>> the barrier. DSB will make sure that all masters in the specified domain can observe
>> the previous operations before issuing any subsequent instructions.
>>
> 
> What you are describing is that smp_wmb is buggy on this platform.
> Poossible but are you sure?
> Write a small test and check.
> Leave virtio alone.
> 

Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
to reproduce it with my own driver where one thread writes to the shared buffer
and another thread reads from the buffer. I don't hit the out-of-order issue so
far. My driver may be not correct somewhere and I will update if I can reproduce
the issue with my driver in the future.

> 
>>>> A DSB instruction is a memory barrier that ensures that memory accesses that
>>>> occur before the DSB instruction have __completed__ before the completion of
>>>> the DSB instruction.
>>>
>>> This seems appropriate for when you want to order things more
>>> strongly. I do not get why it's necessary here.
>>>
>>
>> Please refer to above explanations.
>>
>>>> In doing this, it acts as a stronger barrier than a DMB
>>>> and all ordering that is created by a DMB with specific options is also generated
>>>> by a DSB with the same options.
>>>>
>>>>>>     	vq->split.avail_idx_shadow++;
>>>>>>     	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>     						vq->split.avail_idx_shadow);
>>>>>> -- 
>>>>>> 2.44.0
>>>>>

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-15 10:45             ` Gavin Shan
@ 2024-03-15 11:05               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-15 11:05 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs

On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> 
> + Will, Catalin and Matt from Nvidia
> 
> On 3/14/24 22:59, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
> > > On 3/14/24 21:50, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> > > > > On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > > > > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > > > > client is started in the VM hosted by grace-hopper machine,
> > > > > > > while the 'netperf' server is running on grace-grace machine.
> > > > > > > 
> > > > > > > The VM is started with virtio-net and vhost has been enabled.
> > > > > > > We observe a error message spew from VM and then soft-lockup
> > > > > > > report. The error message indicates the data associated with
> > > > > > > the descriptor (index: 135) has been released, and the queue
> > > > > > > is marked as broken. It eventually leads to the endless effort
> > > > > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > > > > ring and published to the used ring by vhost, meaning we have
> > > > > > > disordred write to the available ring element and available index.
> > > > > > > 
> > > > > > >      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
> > > > > > >      -accel kvm -machine virt,gic-version=host                            \
> > > > > > >         :                                                                 \
> > > > > > >      -netdev tap,id=vnet0,vhost=on                                        \
> > > > > > >      -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > > > > 
> > > > > > >      [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > > > > 
> > > > > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > > > > ARM64. It should work for other architectures, but performance loss is
> > > > > > > expected.
> > > > > > > 
> > > > > > > Cc: stable@vger.kernel.org
> > > > > > > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > > > > > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > > > > > ---
> > > > > > >     drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > > > > >     1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > > >     	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > >     	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > > > -	 * new available array entries. */
> > > > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > > > +	/*
> > > > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > > > +	 * architectures, but performance loss is expected.
> > > > > > > +	 */
> > > > > > > +	virtio_mb(false);
> > > > > > 
> > > > > > 
> > > > > > I don't get what is going on here. Any explanation why virtio_wmb is not
> > > > > > enough besides "it does not work"?
> > > > > > 
> > > > > 
> > > > > The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
> > > > > than "dmb" because "dsb" ensures that all memory accesses raised before this
> > > > > instruction is completed when the 'dsb' instruction completes. However, "dmb"
> > > > > doesn't guarantee the order of completion of the memory accesses.
> > > > > 
> > > > > So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> > > > > can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> > > > 
> > > > Completed as observed by which CPU?
> > > > We have 2 writes that we want observed by another CPU in order.
> > > > So if CPU observes a new value of idx we want it to see
> > > > new value in ring.
> > > > This is standard use of smp_wmb()
> > > > How are these 2 writes different?
> > > > 
> > > > What DMB does, is that is seems to ensure that effects
> > > > of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> > > > are observed after effects of
> > > > 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> > > > 
> > > > 
> > > 
> > > Completed as observed by the CPU where vhost worker is running. I don't think DMB
> > > does the work here. If I'm understanding correctly, DMB ensures the order of these
> > > two writes from the local CPU's standpoint.
> > 
> > No this makes no sense at all. All memory accesses are in order from
> > local CPU standpoint.
> > 
> 
> It's true if compiler doesn't reorder the accesses, and light-weight barrier
> like 'dmb' and 'isb' is used. Otherwise, the accesses still can be disordered
> on the local CPU, correct?

Compiler issues aside (you can look at binary and see if they got
reordered) no, the local CPU always observes all writes in the
same way they were initiated, and on all architectures.

> > > The written data can be stored in local
> > > CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU where
> > > vhost worker is running. So DMB isn't ensuring the write data is observed from the
> > > far CPU.
> > 
> > No but it has to ensure *ordering*. So if index is oberved then ring
> > is observed too because there is a MB there when reading.
> > 
> 
> Right. It should work like this way theoretically. I don't know the difference
> between 'dmb' and 'dsb' from the hardware level. The description in ARMv9 spec
> is also ambiguous. Lets see if Matt will have comments. Besides, this issue is
> only reproducible on NVidia's grace-hopper. We don't hit the issue on Ampere's
> CPUs.
> 
> > 
> > > DSB ensures that the written data is observable from the far CPU immediately.
> > > > 
> > > > 
> > > > > The stronger barrier 'dsb' ensures the completion order as we expected.
> > > > > 
> > > > >       virtio_wmb(true)         virt_mb(false)
> > > > >         virt_wmb                 mb
> > > > >           __smp_wmb               __mb
> > > > >             dmb(ishst)              dsb(sy)
> > > > 
> > > > First, why would you want a non smp barrier when you are
> > > > synchronizing with another CPU? This is what smp_ means ...
> > > > 
> > > > 
> > > > > Extraced from ARMv9 specificaton
> > > > > ================================
> > > > > The DMB instruction is a memory barrier instruction that ensures the relative
> > > > > order of memory accesses before the barrier with memory accesses after the
> > > > > barrier. The DMB instruction _does not_ ensure the completion of any of the
> > > > > memory accesses for which it ensures relative order.
> > > > 
> > > > Isn't this exactly what we need?
> > > > 
> > > 
> > > I don't think so. DMB gurantees the order of EXECUTION, but DSB gurantees the
> > > order of COMPLETION. After the data store is executed, the written data can
> > > be stored in local CPU's cache, far from completion. Only the instruction is
> > > completed, the written data is synchronized to the far CPU where vhost worker
> > > is running.
> > > 
> > > Here is another paper talking about the difference between DMB and DSB. It's
> > > explaining the difference between DMB/DSB better than ARMv9 specification.
> > > However, it's hard for us to understand DMB/DSB work from the hardware level
> > > based on the specification and paper.
> > > 
> > > https://ipads.se.sjtu.edu.cn/_media/publications/liuppopp20.pdf
> > > (section 2.2  Order-preserving Options)
> > > (I'm extracting the relevant part as below)
> > > 
> > > Data Memory Barrier (DMB) prevents reordering of memory accesses across the barrier.
> > > Instead of waiting for previous accesses to become observable in the specified domain,
> > > DMB only maintains the relative order between memory accesses. Meanwhile, DMB does
> > > not block any non-memory access operations.
> > > 
> > > Data Synchronization Barrier (DSB) prevents reordering of any instructions across
> > > the barrier. DSB will make sure that all masters in the specified domain can observe
> > > the previous operations before issuing any subsequent instructions.
> > > 
> > 
> > What you are describing is that smp_wmb is buggy on this platform.
> > Poossible but are you sure?
> > Write a small test and check.
> > Leave virtio alone.
> > 
> 
> Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
> to reproduce it with my own driver where one thread writes to the shared buffer
> and another thread reads from the buffer. I don't hit the out-of-order issue so
> far.

Make sure the 2 areas you are accessing are in different cache lines.


> My driver may be not correct somewhere and I will update if I can reproduce
> the issue with my driver in the future.

Then maybe your change is just making virtio slower and masks the bug
that is actually elsewhere?

You don't really need a driver. Here's a simple test: without barriers
assertion will fail. With barriers it will not.
(Warning: didn't bother testing too much, could be buggy.

---

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define FIRST values[0]
#define SECOND values[64]

volatile int values[100] = {};

void* writer_thread(void* arg) {
	while (1) {
	FIRST++;
	// NEED smp_wmb here
	SECOND++;
	}
}

void* reader_thread(void* arg) {
    while (1) {
	int first = FIRST;
	// NEED smp_rmb here
	int second = SECOND;
	assert(first - second == 1 || first - second == 0);
    }
}

int main() {
    pthread_t writer, reader;

    pthread_create(&writer, NULL, writer_thread, NULL);
    pthread_create(&reader, NULL, reader_thread, NULL);

    pthread_join(writer, NULL);
    pthread_join(reader, NULL);

    return 0;
}

-- 
MST


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-15 11:05               ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-15 11:05 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs

On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> 
> + Will, Catalin and Matt from Nvidia
> 
> On 3/14/24 22:59, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
> > > On 3/14/24 21:50, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> > > > > On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > > > > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > > > > client is started in the VM hosted by grace-hopper machine,
> > > > > > > while the 'netperf' server is running on grace-grace machine.
> > > > > > > 
> > > > > > > The VM is started with virtio-net and vhost has been enabled.
> > > > > > > We observe a error message spew from VM and then soft-lockup
> > > > > > > report. The error message indicates the data associated with
> > > > > > > the descriptor (index: 135) has been released, and the queue
> > > > > > > is marked as broken. It eventually leads to the endless effort
> > > > > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > > > > ring and published to the used ring by vhost, meaning we have
> > > > > > > disordred write to the available ring element and available index.
> > > > > > > 
> > > > > > >      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
> > > > > > >      -accel kvm -machine virt,gic-version=host                            \
> > > > > > >         :                                                                 \
> > > > > > >      -netdev tap,id=vnet0,vhost=on                                        \
> > > > > > >      -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > > > > 
> > > > > > >      [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > > > > 
> > > > > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > > > > ARM64. It should work for other architectures, but performance loss is
> > > > > > > expected.
> > > > > > > 
> > > > > > > Cc: stable@vger.kernel.org
> > > > > > > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > > > > > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > > > > > ---
> > > > > > >     drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > > > > >     1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > > >     	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > >     	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > > > -	 * new available array entries. */
> > > > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > > > +	/*
> > > > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > > > +	 * architectures, but performance loss is expected.
> > > > > > > +	 */
> > > > > > > +	virtio_mb(false);
> > > > > > 
> > > > > > 
> > > > > > I don't get what is going on here. Any explanation why virtio_wmb is not
> > > > > > enough besides "it does not work"?
> > > > > > 
> > > > > 
> > > > > The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
> > > > > than "dmb" because "dsb" ensures that all memory accesses raised before this
> > > > > instruction is completed when the 'dsb' instruction completes. However, "dmb"
> > > > > doesn't guarantee the order of completion of the memory accesses.
> > > > > 
> > > > > So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> > > > > can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> > > > 
> > > > Completed as observed by which CPU?
> > > > We have 2 writes that we want observed by another CPU in order.
> > > > So if CPU observes a new value of idx we want it to see
> > > > new value in ring.
> > > > This is standard use of smp_wmb()
> > > > How are these 2 writes different?
> > > > 
> > > > What DMB does, is that is seems to ensure that effects
> > > > of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> > > > are observed after effects of
> > > > 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> > > > 
> > > > 
> > > 
> > > Completed as observed by the CPU where vhost worker is running. I don't think DMB
> > > does the work here. If I'm understanding correctly, DMB ensures the order of these
> > > two writes from the local CPU's standpoint.
> > 
> > No this makes no sense at all. All memory accesses are in order from
> > local CPU standpoint.
> > 
> 
> It's true if compiler doesn't reorder the accesses, and light-weight barrier
> like 'dmb' and 'isb' is used. Otherwise, the accesses still can be disordered
> on the local CPU, correct?

Compiler issues aside (you can look at binary and see if they got
reordered) no, the local CPU always observes all writes in the
same way they were initiated, and on all architectures.

> > > The written data can be stored in local
> > > CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU where
> > > vhost worker is running. So DMB isn't ensuring the write data is observed from the
> > > far CPU.
> > 
> > No but it has to ensure *ordering*. So if index is oberved then ring
> > is observed too because there is a MB there when reading.
> > 
> 
> Right. It should work like this way theoretically. I don't know the difference
> between 'dmb' and 'dsb' from the hardware level. The description in ARMv9 spec
> is also ambiguous. Lets see if Matt will have comments. Besides, this issue is
> only reproducible on NVidia's grace-hopper. We don't hit the issue on Ampere's
> CPUs.
> 
> > 
> > > DSB ensures that the written data is observable from the far CPU immediately.
> > > > 
> > > > 
> > > > > The stronger barrier 'dsb' ensures the completion order as we expected.
> > > > > 
> > > > >       virtio_wmb(true)         virt_mb(false)
> > > > >         virt_wmb                 mb
> > > > >           __smp_wmb               __mb
> > > > >             dmb(ishst)              dsb(sy)
> > > > 
> > > > First, why would you want a non smp barrier when you are
> > > > synchronizing with another CPU? This is what smp_ means ...
> > > > 
> > > > 
> > > > > Extraced from ARMv9 specificaton
> > > > > ================================
> > > > > The DMB instruction is a memory barrier instruction that ensures the relative
> > > > > order of memory accesses before the barrier with memory accesses after the
> > > > > barrier. The DMB instruction _does not_ ensure the completion of any of the
> > > > > memory accesses for which it ensures relative order.
> > > > 
> > > > Isn't this exactly what we need?
> > > > 
> > > 
> > > I don't think so. DMB gurantees the order of EXECUTION, but DSB gurantees the
> > > order of COMPLETION. After the data store is executed, the written data can
> > > be stored in local CPU's cache, far from completion. Only the instruction is
> > > completed, the written data is synchronized to the far CPU where vhost worker
> > > is running.
> > > 
> > > Here is another paper talking about the difference between DMB and DSB. It's
> > > explaining the difference between DMB/DSB better than ARMv9 specification.
> > > However, it's hard for us to understand DMB/DSB work from the hardware level
> > > based on the specification and paper.
> > > 
> > > https://ipads.se.sjtu.edu.cn/_media/publications/liuppopp20.pdf
> > > (section 2.2  Order-preserving Options)
> > > (I'm extracting the relevant part as below)
> > > 
> > > Data Memory Barrier (DMB) prevents reordering of memory accesses across the barrier.
> > > Instead of waiting for previous accesses to become observable in the specified domain,
> > > DMB only maintains the relative order between memory accesses. Meanwhile, DMB does
> > > not block any non-memory access operations.
> > > 
> > > Data Synchronization Barrier (DSB) prevents reordering of any instructions across
> > > the barrier. DSB will make sure that all masters in the specified domain can observe
> > > the previous operations before issuing any subsequent instructions.
> > > 
> > 
> > What you are describing is that smp_wmb is buggy on this platform.
> > Poossible but are you sure?
> > Write a small test and check.
> > Leave virtio alone.
> > 
> 
> Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
> to reproduce it with my own driver where one thread writes to the shared buffer
> and another thread reads from the buffer. I don't hit the out-of-order issue so
> far.

Make sure the 2 areas you are accessing are in different cache lines.


> My driver may be not correct somewhere and I will update if I can reproduce
> the issue with my driver in the future.

Then maybe your change is just making virtio slower and masks the bug
that is actually elsewhere?

You don't really need a driver. Here's a simple test: without barriers
assertion will fail. With barriers it will not.
(Warning: didn't bother testing too much, could be buggy.

---

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define FIRST values[0]
#define SECOND values[64]

volatile int values[100] = {};

void* writer_thread(void* arg) {
	while (1) {
	FIRST++;
	// NEED smp_wmb here
	SECOND++;
	}
}

void* reader_thread(void* arg) {
    while (1) {
	int first = FIRST;
	// NEED smp_rmb here
	int second = SECOND;
	assert(first - second == 1 || first - second == 0);
    }
}

int main() {
    pthread_t writer, reader;

    pthread_create(&writer, NULL, writer_thread, NULL);
    pthread_create(&reader, NULL, reader_thread, NULL);

    pthread_join(writer, NULL);
    pthread_join(reader, NULL);

    return 0;
}

-- 
MST


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-15 11:05               ` Michael S. Tsirkin
@ 2024-03-15 11:24                 ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-15 11:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs


On 3/15/24 21:05, Michael S. Tsirkin wrote:
> On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
>>>> Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
>> to reproduce it with my own driver where one thread writes to the shared buffer
>> and another thread reads from the buffer. I don't hit the out-of-order issue so
>> far.
> 
> Make sure the 2 areas you are accessing are in different cache lines.
> 

Yes, I already put those 2 areas to separate cache lines.

> 
>> My driver may be not correct somewhere and I will update if I can reproduce
>> the issue with my driver in the future.
> 
> Then maybe your change is just making virtio slower and masks the bug
> that is actually elsewhere?
> 
> You don't really need a driver. Here's a simple test: without barriers
> assertion will fail. With barriers it will not.
> (Warning: didn't bother testing too much, could be buggy.
> 
> ---
> 
> #include <pthread.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <assert.h>
> 
> #define FIRST values[0]
> #define SECOND values[64]
> 
> volatile int values[100] = {};
> 
> void* writer_thread(void* arg) {
> 	while (1) {
> 	FIRST++;
> 	// NEED smp_wmb here
         __asm__ volatile("dmb ishst" : : : "memory");
> 	SECOND++;
> 	}
> }
> 
> void* reader_thread(void* arg) {
>      while (1) {
> 	int first = FIRST;
> 	// NEED smp_rmb here
         __asm__ volatile("dmb ishld" : : : "memory");
> 	int second = SECOND;
> 	assert(first - second == 1 || first - second == 0);
>      }
> }
> 
> int main() {
>      pthread_t writer, reader;
> 
>      pthread_create(&writer, NULL, writer_thread, NULL);
>      pthread_create(&reader, NULL, reader_thread, NULL);
> 
>      pthread_join(writer, NULL);
>      pthread_join(reader, NULL);
> 
>      return 0;
> }
> 

Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
the assert on both of them. After replacing 'dmb' with 'dsb', I can
hit assert on both of them too. I need to look at the code closely.

[root@virt-mtcollins-02 test]# ./a
a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
Aborted (core dumped)

[root@nvidia-grace-hopper-05 test]# ./a
a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
Aborted (core dumped)

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-15 11:24                 ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-15 11:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs


On 3/15/24 21:05, Michael S. Tsirkin wrote:
> On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
>>>> Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
>> to reproduce it with my own driver where one thread writes to the shared buffer
>> and another thread reads from the buffer. I don't hit the out-of-order issue so
>> far.
> 
> Make sure the 2 areas you are accessing are in different cache lines.
> 

Yes, I already put those 2 areas to separate cache lines.

> 
>> My driver may be not correct somewhere and I will update if I can reproduce
>> the issue with my driver in the future.
> 
> Then maybe your change is just making virtio slower and masks the bug
> that is actually elsewhere?
> 
> You don't really need a driver. Here's a simple test: without barriers
> assertion will fail. With barriers it will not.
> (Warning: didn't bother testing too much, could be buggy.
> 
> ---
> 
> #include <pthread.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <assert.h>
> 
> #define FIRST values[0]
> #define SECOND values[64]
> 
> volatile int values[100] = {};
> 
> void* writer_thread(void* arg) {
> 	while (1) {
> 	FIRST++;
> 	// NEED smp_wmb here
         __asm__ volatile("dmb ishst" : : : "memory");
> 	SECOND++;
> 	}
> }
> 
> void* reader_thread(void* arg) {
>      while (1) {
> 	int first = FIRST;
> 	// NEED smp_rmb here
         __asm__ volatile("dmb ishld" : : : "memory");
> 	int second = SECOND;
> 	assert(first - second == 1 || first - second == 0);
>      }
> }
> 
> int main() {
>      pthread_t writer, reader;
> 
>      pthread_create(&writer, NULL, writer_thread, NULL);
>      pthread_create(&reader, NULL, reader_thread, NULL);
> 
>      pthread_join(writer, NULL);
>      pthread_join(reader, NULL);
> 
>      return 0;
> }
> 

Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
the assert on both of them. After replacing 'dmb' with 'dsb', I can
hit assert on both of them too. I need to look at the code closely.

[root@virt-mtcollins-02 test]# ./a
a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
Aborted (core dumped)

[root@nvidia-grace-hopper-05 test]# ./a
a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
Aborted (core dumped)

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-15 11:24                 ` Gavin Shan
@ 2024-03-17 16:50                   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-17 16:50 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs

On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
> 
> On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
> > > to reproduce it with my own driver where one thread writes to the shared buffer
> > > and another thread reads from the buffer. I don't hit the out-of-order issue so
> > > far.
> > 
> > Make sure the 2 areas you are accessing are in different cache lines.
> > 
> 
> Yes, I already put those 2 areas to separate cache lines.
> 
> > 
> > > My driver may be not correct somewhere and I will update if I can reproduce
> > > the issue with my driver in the future.
> > 
> > Then maybe your change is just making virtio slower and masks the bug
> > that is actually elsewhere?
> > 
> > You don't really need a driver. Here's a simple test: without barriers
> > assertion will fail. With barriers it will not.
> > (Warning: didn't bother testing too much, could be buggy.
> > 
> > ---
> > 
> > #include <pthread.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <assert.h>
> > 
> > #define FIRST values[0]
> > #define SECOND values[64]
> > 
> > volatile int values[100] = {};
> > 
> > void* writer_thread(void* arg) {
> > 	while (1) {
> > 	FIRST++;
> > 	// NEED smp_wmb here
>         __asm__ volatile("dmb ishst" : : : "memory");
> > 	SECOND++;
> > 	}
> > }
> > 
> > void* reader_thread(void* arg) {
> >      while (1) {
> > 	int first = FIRST;
> > 	// NEED smp_rmb here
>         __asm__ volatile("dmb ishld" : : : "memory");
> > 	int second = SECOND;
> > 	assert(first - second == 1 || first - second == 0);
> >      }
> > }
> > 
> > int main() {
> >      pthread_t writer, reader;
> > 
> >      pthread_create(&writer, NULL, writer_thread, NULL);
> >      pthread_create(&reader, NULL, reader_thread, NULL);
> > 
> >      pthread_join(writer, NULL);
> >      pthread_join(reader, NULL);
> > 
> >      return 0;
> > }
> > 
> 
> Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> the assert on both of them. After replacing 'dmb' with 'dsb', I can
> hit assert on both of them too. I need to look at the code closely.
> 
> [root@virt-mtcollins-02 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> Aborted (core dumped)
> 
> [root@nvidia-grace-hopper-05 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> Aborted (core dumped)
> 
> Thanks,
> Gavin


Actually this test is broken. No need for ordering it's a simple race.
The following works on x86 though (x86 does not need barriers
though).


#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#if 0
#define x86_rmb()  asm volatile("lfence":::"memory")
#define x86_mb()  asm volatile("mfence":::"memory")
#define x86_smb()  asm volatile("sfence":::"memory")
#else
#define x86_rmb()  asm volatile("":::"memory")
#define x86_mb()  asm volatile("":::"memory")
#define x86_smb()  asm volatile("":::"memory")
#endif

#define FIRST values[0]
#define SECOND values[640]
#define FLAG values[1280]

volatile unsigned values[2000] = {};

void* writer_thread(void* arg) {
	while (1) {
	/* Now synchronize with reader */
	while(FLAG);
	FIRST++;
	x86_smb();
	SECOND++;
	x86_smb();
	FLAG = 1;
	}
}

void* reader_thread(void* arg) {
    while (1) {
	/* Now synchronize with writer */
	while(!FLAG);
	x86_rmb();
	unsigned first = FIRST;
	x86_rmb();
	unsigned second = SECOND;
	assert(first - second == 1 || first - second == 0);
	FLAG = 0;

	if (!(first %1000000))
		printf("%d\n", first);
   }
}

int main() {
    pthread_t writer, reader;

    pthread_create(&writer, NULL, writer_thread, NULL);
    pthread_create(&reader, NULL, reader_thread, NULL);

    pthread_join(writer, NULL);
    pthread_join(reader, NULL);

    return 0;
}


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-17 16:50                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-17 16:50 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs

On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
> 
> On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
> > > to reproduce it with my own driver where one thread writes to the shared buffer
> > > and another thread reads from the buffer. I don't hit the out-of-order issue so
> > > far.
> > 
> > Make sure the 2 areas you are accessing are in different cache lines.
> > 
> 
> Yes, I already put those 2 areas to separate cache lines.
> 
> > 
> > > My driver may be not correct somewhere and I will update if I can reproduce
> > > the issue with my driver in the future.
> > 
> > Then maybe your change is just making virtio slower and masks the bug
> > that is actually elsewhere?
> > 
> > You don't really need a driver. Here's a simple test: without barriers
> > assertion will fail. With barriers it will not.
> > (Warning: didn't bother testing too much, could be buggy.
> > 
> > ---
> > 
> > #include <pthread.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <assert.h>
> > 
> > #define FIRST values[0]
> > #define SECOND values[64]
> > 
> > volatile int values[100] = {};
> > 
> > void* writer_thread(void* arg) {
> > 	while (1) {
> > 	FIRST++;
> > 	// NEED smp_wmb here
>         __asm__ volatile("dmb ishst" : : : "memory");
> > 	SECOND++;
> > 	}
> > }
> > 
> > void* reader_thread(void* arg) {
> >      while (1) {
> > 	int first = FIRST;
> > 	// NEED smp_rmb here
>         __asm__ volatile("dmb ishld" : : : "memory");
> > 	int second = SECOND;
> > 	assert(first - second == 1 || first - second == 0);
> >      }
> > }
> > 
> > int main() {
> >      pthread_t writer, reader;
> > 
> >      pthread_create(&writer, NULL, writer_thread, NULL);
> >      pthread_create(&reader, NULL, reader_thread, NULL);
> > 
> >      pthread_join(writer, NULL);
> >      pthread_join(reader, NULL);
> > 
> >      return 0;
> > }
> > 
> 
> Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> the assert on both of them. After replacing 'dmb' with 'dsb', I can
> hit assert on both of them too. I need to look at the code closely.
> 
> [root@virt-mtcollins-02 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> Aborted (core dumped)
> 
> [root@nvidia-grace-hopper-05 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> Aborted (core dumped)
> 
> Thanks,
> Gavin


Actually this test is broken. No need for ordering it's a simple race.
The following works on x86 though (x86 does not need barriers
though).


#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#if 0
#define x86_rmb()  asm volatile("lfence":::"memory")
#define x86_mb()  asm volatile("mfence":::"memory")
#define x86_smb()  asm volatile("sfence":::"memory")
#else
#define x86_rmb()  asm volatile("":::"memory")
#define x86_mb()  asm volatile("":::"memory")
#define x86_smb()  asm volatile("":::"memory")
#endif

#define FIRST values[0]
#define SECOND values[640]
#define FLAG values[1280]

volatile unsigned values[2000] = {};

void* writer_thread(void* arg) {
	while (1) {
	/* Now synchronize with reader */
	while(FLAG);
	FIRST++;
	x86_smb();
	SECOND++;
	x86_smb();
	FLAG = 1;
	}
}

void* reader_thread(void* arg) {
    while (1) {
	/* Now synchronize with writer */
	while(!FLAG);
	x86_rmb();
	unsigned first = FIRST;
	x86_rmb();
	unsigned second = SECOND;
	assert(first - second == 1 || first - second == 0);
	FLAG = 0;

	if (!(first %1000000))
		printf("%d\n", first);
   }
}

int main() {
    pthread_t writer, reader;

    pthread_create(&writer, NULL, writer_thread, NULL);
    pthread_create(&reader, NULL, reader_thread, NULL);

    pthread_join(writer, NULL);
    pthread_join(reader, NULL);

    return 0;
}


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-17 16:50                   ` Michael S. Tsirkin
@ 2024-03-17 23:41                     ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-17 23:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs

On 3/18/24 02:50, Michael S. Tsirkin wrote:
> On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
>>
>> On 3/15/24 21:05, Michael S. Tsirkin wrote:
>>> On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
>>>>>> Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
>>>> to reproduce it with my own driver where one thread writes to the shared buffer
>>>> and another thread reads from the buffer. I don't hit the out-of-order issue so
>>>> far.
>>>
>>> Make sure the 2 areas you are accessing are in different cache lines.
>>>
>>
>> Yes, I already put those 2 areas to separate cache lines.
>>
>>>
>>>> My driver may be not correct somewhere and I will update if I can reproduce
>>>> the issue with my driver in the future.
>>>
>>> Then maybe your change is just making virtio slower and masks the bug
>>> that is actually elsewhere?
>>>
>>> You don't really need a driver. Here's a simple test: without barriers
>>> assertion will fail. With barriers it will not.
>>> (Warning: didn't bother testing too much, could be buggy.
>>>
>>> ---
>>>
>>> #include <pthread.h>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <assert.h>
>>>
>>> #define FIRST values[0]
>>> #define SECOND values[64]
>>>
>>> volatile int values[100] = {};
>>>
>>> void* writer_thread(void* arg) {
>>> 	while (1) {
>>> 	FIRST++;
>>> 	// NEED smp_wmb here
>>          __asm__ volatile("dmb ishst" : : : "memory");
>>> 	SECOND++;
>>> 	}
>>> }
>>>
>>> void* reader_thread(void* arg) {
>>>       while (1) {
>>> 	int first = FIRST;
>>> 	// NEED smp_rmb here
>>          __asm__ volatile("dmb ishld" : : : "memory");
>>> 	int second = SECOND;
>>> 	assert(first - second == 1 || first - second == 0);
>>>       }
>>> }
>>>
>>> int main() {
>>>       pthread_t writer, reader;
>>>
>>>       pthread_create(&writer, NULL, writer_thread, NULL);
>>>       pthread_create(&reader, NULL, reader_thread, NULL);
>>>
>>>       pthread_join(writer, NULL);
>>>       pthread_join(reader, NULL);
>>>
>>>       return 0;
>>> }
>>>
>>
>> Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
>> the assert on both of them. After replacing 'dmb' with 'dsb', I can
>> hit assert on both of them too. I need to look at the code closely.
>>
>> [root@virt-mtcollins-02 test]# ./a
>> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
>> Aborted (core dumped)
>>
>> [root@nvidia-grace-hopper-05 test]# ./a
>> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
>> Aborted (core dumped)
>>
>> Thanks,
>> Gavin
> 
> 
> Actually this test is broken. No need for ordering it's a simple race.
> The following works on x86 though (x86 does not need barriers
> though).
> 
> 
> #include <pthread.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <assert.h>
> 
> #if 0
> #define x86_rmb()  asm volatile("lfence":::"memory")
> #define x86_mb()  asm volatile("mfence":::"memory")
> #define x86_smb()  asm volatile("sfence":::"memory")
> #else
> #define x86_rmb()  asm volatile("":::"memory")
> #define x86_mb()  asm volatile("":::"memory")
> #define x86_smb()  asm volatile("":::"memory")
> #endif
> 
> #define FIRST values[0]
> #define SECOND values[640]
> #define FLAG values[1280]
> 
> volatile unsigned values[2000] = {};
> 
> void* writer_thread(void* arg) {
> 	while (1) {
> 	/* Now synchronize with reader */
> 	while(FLAG);
> 	FIRST++;
> 	x86_smb();
> 	SECOND++;
> 	x86_smb();
> 	FLAG = 1;
> 	}
> }
> 
> void* reader_thread(void* arg) {
>      while (1) {
> 	/* Now synchronize with writer */
> 	while(!FLAG);
> 	x86_rmb();
> 	unsigned first = FIRST;
> 	x86_rmb();
> 	unsigned second = SECOND;
> 	assert(first - second == 1 || first - second == 0);
> 	FLAG = 0;
> 
> 	if (!(first %1000000))
> 		printf("%d\n", first);
>     }
> }
> 
> int main() {
>      pthread_t writer, reader;
> 
>      pthread_create(&writer, NULL, writer_thread, NULL);
>      pthread_create(&reader, NULL, reader_thread, NULL);
> 
>      pthread_join(writer, NULL);
>      pthread_join(reader, NULL);
> 
>      return 0;
> }
> 

I tried it on host and VM of NVidia's grace-hopper. Without the barriers, I
can hit assert. With the barriers, it's working fine without hitting the
assert.

I also had some code to mimic virtio vring last weekend, and it's just
working well. Back to our original issue, __smb_wmb() is issued by guest
while __smb_rmb() is executed on host. The VM and host are running at
different exception level: EL2 vs EL1. I'm not sure it's the cause. I
need to modify my code so that __smb_wmb() and __smb_rmb() can be executed
from guest and host.

[gshan@gshan code]$ cat test.h
#ifndef __TEST_H
#define __TEST_H

struct vring_desc {
	uint64_t	addr;
	uint32_t	len;
	uint16_t	flags;
	uint16_t	next;
} __attribute__((aligned(4)));

struct vring_avail {
	uint16_t	flags;
	uint16_t	idx;
	uint16_t	ring[];
} __attribute__((aligned(4)));

struct vring_used_elem {
	uint32_t	id;
	uint32_t	len;
} __attribute__((aligned(4)));

struct vring_used {
	uint16_t 		flags;
         uint16_t 		idx;
         struct vring_used_elem	ring[];
} __attribute__((aligned(4)));

struct vring {
	struct vring_desc	*desc;
	struct vring_avail	*avail;
	struct vring_used	*used;
	uint8_t			pad0[64];

	/* Writer */
	uint32_t		num;
	uint32_t		w_num_free;
	uint32_t		w_free_head;
	uint16_t		w_avail_idx;
	uint16_t		w_last_used_idx;
	uint16_t		*w_extra_data;
	uint16_t		*w_extra_next;
	uint8_t			pad1[64];

	/* Reader */
	uint16_t		r_avail_idx;
	uint16_t		r_last_avail_idx;
	uint16_t		r_last_used_idx;
	uint8_t			pad2[64];
};

static inline unsigned int vring_size(unsigned int num, unsigned long align)
{
	return ((sizeof(struct vring_desc) * num +
		 sizeof(uint16_t) * (3 + num) + (align - 1)) & ~(align - 1)) +
	       sizeof(uint16_t) * 3 + sizeof(struct vring_used_elem) * num;
		
}

static inline void __smp_rmb(void)
{
#ifdef WEAK_BARRIER
	__asm__ volatile("dmb ishld" : : : "memory");
#else
	__asm__ volatile("dsb sy"    : : : "memory");
#endif
}

static inline void __smp_wmb(void)
{
#ifdef WEAK_BARRIER
	__asm__ volatile("dmb ishst" : : : "memory");
#else
	__asm__ volatile("dsb sy"    : : : "memory");
#endif
}

#endif /* __TEST_H */


[gshan@gshan code]$ cat test.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <assert.h>
#include <sched.h>
#include <pthread.h>

#include "test.h"

static struct vring *vring;

static int bind_cpu(int cpuid)
{
	cpu_set_t cpuset;

	CPU_ZERO(&cpuset);
	CPU_SET(cpuid, &cpuset);

	return pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
}

static void write_free_used_desc(void)
{
	uint16_t last_used;
	uint32_t idx;

	if ((uint16_t)(vring->used->idx - vring->w_last_used_idx) < 64)
		return;

	while (true) {
		if (vring->w_last_used_idx == vring->used->idx)
			return;

		__smp_rmb();

		/* Retrieve the head */
		last_used = vring->w_last_used_idx & (vring->num - 1);
		idx = vring->used->ring[last_used].id;
		assert(idx < vring->num);
		assert(vring->w_extra_data[idx]);

		/* Reclaim the descriptor */
		vring->w_extra_data[idx] = 0;
		vring->w_extra_next[idx] = vring->w_free_head;
		vring->w_free_head = idx;

		/* Update statistics */
		vring->w_num_free++;
		vring->w_last_used_idx++;
	}
}

static void write_push_desc(void)
{
	uint32_t head = vring->w_free_head;
	uint32_t avail_idx;

	if (vring->w_num_free < 1)
		return;

	/*
	 * The data in the descriptor doesn't matter. The idea here
	 * is to dirty the cache line.
	 */
	vring->desc[head].flags = 1;
	vring->desc[head].addr  = 0xffffffffffffffff;
	vring->desc[head].len   = 0xffffffff;
	vring->desc[head].next  = vring->w_extra_next[head];
	vring->desc[head].flags = 0;

	vring->w_num_free--;
	vring->w_free_head = vring->w_extra_next[head];
	vring->w_extra_data[head] = 1;

	avail_idx = vring->w_avail_idx & (vring->num - 1);
	vring->avail->ring[avail_idx] = head;

	__smp_wmb();

	vring->w_avail_idx++;
	vring->avail->idx = vring->w_avail_idx;
}

static void *write_worker(void *arg)
{
	assert(!bind_cpu(10));

	while (true) {
		write_free_used_desc();
		write_push_desc();
	}

	return NULL;
}

static void read_pull_desc(void)
{
	uint16_t avail_idx, last_avail_idx;
	uint32_t head;

	last_avail_idx = vring->r_last_avail_idx;
	if (vring->r_avail_idx == vring->r_last_avail_idx) {
		vring->r_avail_idx = vring->avail->idx;
		if (vring->r_avail_idx == last_avail_idx)
			return;

		__smp_rmb();
	}

	head = vring->avail->ring[last_avail_idx & (vring->num - 1)];
	assert(head < vring->num);
	vring->r_last_avail_idx++;

	vring->used->ring[vring->r_last_used_idx & (vring->num - 1)].id  = head;
	vring->used->ring[vring->r_last_used_idx & (vring->num - 1)].len = 0;
	vring->r_last_used_idx++;

	__smp_wmb();

	vring->used->idx = vring->r_last_used_idx;
}

static void *read_worker(void *arg)
{
	assert(!bind_cpu(60));

	while (true) {
		read_pull_desc();
	}

	return NULL;
}

static void init_vring(unsigned int num, unsigned long align)
{
	unsigned int size, i;

	/* vring */
	vring = malloc(sizeof(*vring));
	assert(vring);
	memset(vring, 0, sizeof(*vring));

	/* Descriptors */
	size = vring_size(num, align);
	vring->desc = (struct vring_desc *)malloc(size);
	assert(vring->desc);
	memset(vring->desc, 0, size);
	vring->avail = (struct vring_avail *)((void *)vring->desc +
					      num * sizeof(struct vring_desc));
	vring->used = (struct vring_used *)(((unsigned long)&vring->avail->ring[num] +
					    sizeof(uint16_t) + (align - 1)) & ~(align - 1));

	/* Writer's extra data */
	vring->w_extra_data = malloc(sizeof(uint16_t) * num);
	assert(vring->w_extra_data);
	memset(vring->w_extra_data, 0, sizeof(uint16_t) * num);
	vring->w_extra_next = malloc(sizeof(uint16_t) * num);
	assert(vring->w_extra_next);
	memset(vring->w_extra_next, 0, sizeof(uint16_t) * num);
	for (i = 0; i < num - 1; i++)
		vring->w_extra_next[i] = i + 1;

	/* Statistics */
	vring->num = num;
	vring->w_num_free = num;
	vring->w_free_head = 0;
	vring->w_avail_idx = 0;
	vring->w_last_used_idx = 0;
	vring->r_avail_idx = 0;
	vring->r_last_avail_idx = 0;
	vring->r_last_used_idx = 0;
}

int main(int argc, char **argv)
{
	pthread_t w_tid, r_tid;
	int ret;

	assert(!bind_cpu(0));

	init_vring(256, 64);

	ret = pthread_create(&w_tid, NULL, write_worker, NULL);
	assert(!ret);
	ret = pthread_create(&r_tid, NULL, read_worker, NULL);
	assert(!ret);
	ret = pthread_join(w_tid, NULL);
	assert(!ret);
	ret = pthread_join(r_tid, NULL);
	assert(!ret);

	while (true) {
		sleep(1);
	}

	return 0;
}

Thanks,
Gavin





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-17 23:41                     ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-17 23:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs

On 3/18/24 02:50, Michael S. Tsirkin wrote:
> On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
>>
>> On 3/15/24 21:05, Michael S. Tsirkin wrote:
>>> On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
>>>>>> Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
>>>> to reproduce it with my own driver where one thread writes to the shared buffer
>>>> and another thread reads from the buffer. I don't hit the out-of-order issue so
>>>> far.
>>>
>>> Make sure the 2 areas you are accessing are in different cache lines.
>>>
>>
>> Yes, I already put those 2 areas to separate cache lines.
>>
>>>
>>>> My driver may be not correct somewhere and I will update if I can reproduce
>>>> the issue with my driver in the future.
>>>
>>> Then maybe your change is just making virtio slower and masks the bug
>>> that is actually elsewhere?
>>>
>>> You don't really need a driver. Here's a simple test: without barriers
>>> assertion will fail. With barriers it will not.
>>> (Warning: didn't bother testing too much, could be buggy.
>>>
>>> ---
>>>
>>> #include <pthread.h>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <assert.h>
>>>
>>> #define FIRST values[0]
>>> #define SECOND values[64]
>>>
>>> volatile int values[100] = {};
>>>
>>> void* writer_thread(void* arg) {
>>> 	while (1) {
>>> 	FIRST++;
>>> 	// NEED smp_wmb here
>>          __asm__ volatile("dmb ishst" : : : "memory");
>>> 	SECOND++;
>>> 	}
>>> }
>>>
>>> void* reader_thread(void* arg) {
>>>       while (1) {
>>> 	int first = FIRST;
>>> 	// NEED smp_rmb here
>>          __asm__ volatile("dmb ishld" : : : "memory");
>>> 	int second = SECOND;
>>> 	assert(first - second == 1 || first - second == 0);
>>>       }
>>> }
>>>
>>> int main() {
>>>       pthread_t writer, reader;
>>>
>>>       pthread_create(&writer, NULL, writer_thread, NULL);
>>>       pthread_create(&reader, NULL, reader_thread, NULL);
>>>
>>>       pthread_join(writer, NULL);
>>>       pthread_join(reader, NULL);
>>>
>>>       return 0;
>>> }
>>>
>>
>> Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
>> the assert on both of them. After replacing 'dmb' with 'dsb', I can
>> hit assert on both of them too. I need to look at the code closely.
>>
>> [root@virt-mtcollins-02 test]# ./a
>> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
>> Aborted (core dumped)
>>
>> [root@nvidia-grace-hopper-05 test]# ./a
>> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
>> Aborted (core dumped)
>>
>> Thanks,
>> Gavin
> 
> 
> Actually this test is broken. No need for ordering it's a simple race.
> The following works on x86 though (x86 does not need barriers
> though).
> 
> 
> #include <pthread.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <assert.h>
> 
> #if 0
> #define x86_rmb()  asm volatile("lfence":::"memory")
> #define x86_mb()  asm volatile("mfence":::"memory")
> #define x86_smb()  asm volatile("sfence":::"memory")
> #else
> #define x86_rmb()  asm volatile("":::"memory")
> #define x86_mb()  asm volatile("":::"memory")
> #define x86_smb()  asm volatile("":::"memory")
> #endif
> 
> #define FIRST values[0]
> #define SECOND values[640]
> #define FLAG values[1280]
> 
> volatile unsigned values[2000] = {};
> 
> void* writer_thread(void* arg) {
> 	while (1) {
> 	/* Now synchronize with reader */
> 	while(FLAG);
> 	FIRST++;
> 	x86_smb();
> 	SECOND++;
> 	x86_smb();
> 	FLAG = 1;
> 	}
> }
> 
> void* reader_thread(void* arg) {
>      while (1) {
> 	/* Now synchronize with writer */
> 	while(!FLAG);
> 	x86_rmb();
> 	unsigned first = FIRST;
> 	x86_rmb();
> 	unsigned second = SECOND;
> 	assert(first - second == 1 || first - second == 0);
> 	FLAG = 0;
> 
> 	if (!(first %1000000))
> 		printf("%d\n", first);
>     }
> }
> 
> int main() {
>      pthread_t writer, reader;
> 
>      pthread_create(&writer, NULL, writer_thread, NULL);
>      pthread_create(&reader, NULL, reader_thread, NULL);
> 
>      pthread_join(writer, NULL);
>      pthread_join(reader, NULL);
> 
>      return 0;
> }
> 

I tried it on host and VM of NVidia's grace-hopper. Without the barriers, I
can hit assert. With the barriers, it's working fine without hitting the
assert.

I also had some code to mimic virtio vring last weekend, and it's just
working well. Back to our original issue, __smb_wmb() is issued by guest
while __smb_rmb() is executed on host. The VM and host are running at
different exception level: EL2 vs EL1. I'm not sure it's the cause. I
need to modify my code so that __smb_wmb() and __smb_rmb() can be executed
from guest and host.

[gshan@gshan code]$ cat test.h
#ifndef __TEST_H
#define __TEST_H

struct vring_desc {
	uint64_t	addr;
	uint32_t	len;
	uint16_t	flags;
	uint16_t	next;
} __attribute__((aligned(4)));

struct vring_avail {
	uint16_t	flags;
	uint16_t	idx;
	uint16_t	ring[];
} __attribute__((aligned(4)));

struct vring_used_elem {
	uint32_t	id;
	uint32_t	len;
} __attribute__((aligned(4)));

struct vring_used {
	uint16_t 		flags;
         uint16_t 		idx;
         struct vring_used_elem	ring[];
} __attribute__((aligned(4)));

struct vring {
	struct vring_desc	*desc;
	struct vring_avail	*avail;
	struct vring_used	*used;
	uint8_t			pad0[64];

	/* Writer */
	uint32_t		num;
	uint32_t		w_num_free;
	uint32_t		w_free_head;
	uint16_t		w_avail_idx;
	uint16_t		w_last_used_idx;
	uint16_t		*w_extra_data;
	uint16_t		*w_extra_next;
	uint8_t			pad1[64];

	/* Reader */
	uint16_t		r_avail_idx;
	uint16_t		r_last_avail_idx;
	uint16_t		r_last_used_idx;
	uint8_t			pad2[64];
};

static inline unsigned int vring_size(unsigned int num, unsigned long align)
{
	return ((sizeof(struct vring_desc) * num +
		 sizeof(uint16_t) * (3 + num) + (align - 1)) & ~(align - 1)) +
	       sizeof(uint16_t) * 3 + sizeof(struct vring_used_elem) * num;
		
}

static inline void __smp_rmb(void)
{
#ifdef WEAK_BARRIER
	__asm__ volatile("dmb ishld" : : : "memory");
#else
	__asm__ volatile("dsb sy"    : : : "memory");
#endif
}

static inline void __smp_wmb(void)
{
#ifdef WEAK_BARRIER
	__asm__ volatile("dmb ishst" : : : "memory");
#else
	__asm__ volatile("dsb sy"    : : : "memory");
#endif
}

#endif /* __TEST_H */


[gshan@gshan code]$ cat test.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <assert.h>
#include <sched.h>
#include <pthread.h>

#include "test.h"

static struct vring *vring;

static int bind_cpu(int cpuid)
{
	cpu_set_t cpuset;

	CPU_ZERO(&cpuset);
	CPU_SET(cpuid, &cpuset);

	return pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
}

static void write_free_used_desc(void)
{
	uint16_t last_used;
	uint32_t idx;

	if ((uint16_t)(vring->used->idx - vring->w_last_used_idx) < 64)
		return;

	while (true) {
		if (vring->w_last_used_idx == vring->used->idx)
			return;

		__smp_rmb();

		/* Retrieve the head */
		last_used = vring->w_last_used_idx & (vring->num - 1);
		idx = vring->used->ring[last_used].id;
		assert(idx < vring->num);
		assert(vring->w_extra_data[idx]);

		/* Reclaim the descriptor */
		vring->w_extra_data[idx] = 0;
		vring->w_extra_next[idx] = vring->w_free_head;
		vring->w_free_head = idx;

		/* Update statistics */
		vring->w_num_free++;
		vring->w_last_used_idx++;
	}
}

static void write_push_desc(void)
{
	uint32_t head = vring->w_free_head;
	uint32_t avail_idx;

	if (vring->w_num_free < 1)
		return;

	/*
	 * The data in the descriptor doesn't matter. The idea here
	 * is to dirty the cache line.
	 */
	vring->desc[head].flags = 1;
	vring->desc[head].addr  = 0xffffffffffffffff;
	vring->desc[head].len   = 0xffffffff;
	vring->desc[head].next  = vring->w_extra_next[head];
	vring->desc[head].flags = 0;

	vring->w_num_free--;
	vring->w_free_head = vring->w_extra_next[head];
	vring->w_extra_data[head] = 1;

	avail_idx = vring->w_avail_idx & (vring->num - 1);
	vring->avail->ring[avail_idx] = head;

	__smp_wmb();

	vring->w_avail_idx++;
	vring->avail->idx = vring->w_avail_idx;
}

static void *write_worker(void *arg)
{
	assert(!bind_cpu(10));

	while (true) {
		write_free_used_desc();
		write_push_desc();
	}

	return NULL;
}

static void read_pull_desc(void)
{
	uint16_t avail_idx, last_avail_idx;
	uint32_t head;

	last_avail_idx = vring->r_last_avail_idx;
	if (vring->r_avail_idx == vring->r_last_avail_idx) {
		vring->r_avail_idx = vring->avail->idx;
		if (vring->r_avail_idx == last_avail_idx)
			return;

		__smp_rmb();
	}

	head = vring->avail->ring[last_avail_idx & (vring->num - 1)];
	assert(head < vring->num);
	vring->r_last_avail_idx++;

	vring->used->ring[vring->r_last_used_idx & (vring->num - 1)].id  = head;
	vring->used->ring[vring->r_last_used_idx & (vring->num - 1)].len = 0;
	vring->r_last_used_idx++;

	__smp_wmb();

	vring->used->idx = vring->r_last_used_idx;
}

static void *read_worker(void *arg)
{
	assert(!bind_cpu(60));

	while (true) {
		read_pull_desc();
	}

	return NULL;
}

static void init_vring(unsigned int num, unsigned long align)
{
	unsigned int size, i;

	/* vring */
	vring = malloc(sizeof(*vring));
	assert(vring);
	memset(vring, 0, sizeof(*vring));

	/* Descriptors */
	size = vring_size(num, align);
	vring->desc = (struct vring_desc *)malloc(size);
	assert(vring->desc);
	memset(vring->desc, 0, size);
	vring->avail = (struct vring_avail *)((void *)vring->desc +
					      num * sizeof(struct vring_desc));
	vring->used = (struct vring_used *)(((unsigned long)&vring->avail->ring[num] +
					    sizeof(uint16_t) + (align - 1)) & ~(align - 1));

	/* Writer's extra data */
	vring->w_extra_data = malloc(sizeof(uint16_t) * num);
	assert(vring->w_extra_data);
	memset(vring->w_extra_data, 0, sizeof(uint16_t) * num);
	vring->w_extra_next = malloc(sizeof(uint16_t) * num);
	assert(vring->w_extra_next);
	memset(vring->w_extra_next, 0, sizeof(uint16_t) * num);
	for (i = 0; i < num - 1; i++)
		vring->w_extra_next[i] = i + 1;

	/* Statistics */
	vring->num = num;
	vring->w_num_free = num;
	vring->w_free_head = 0;
	vring->w_avail_idx = 0;
	vring->w_last_used_idx = 0;
	vring->r_avail_idx = 0;
	vring->r_last_avail_idx = 0;
	vring->r_last_used_idx = 0;
}

int main(int argc, char **argv)
{
	pthread_t w_tid, r_tid;
	int ret;

	assert(!bind_cpu(0));

	init_vring(256, 64);

	ret = pthread_create(&w_tid, NULL, write_worker, NULL);
	assert(!ret);
	ret = pthread_create(&r_tid, NULL, read_worker, NULL);
	assert(!ret);
	ret = pthread_join(w_tid, NULL);
	assert(!ret);
	ret = pthread_join(r_tid, NULL);
	assert(!ret);

	while (true) {
		sleep(1);
	}

	return 0;
}

Thanks,
Gavin





_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-17 23:41                     ` Gavin Shan
@ 2024-03-18  7:50                       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-18  7:50 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs

On Mon, Mar 18, 2024 at 09:41:45AM +1000, Gavin Shan wrote:
> On 3/18/24 02:50, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
> > > 
> > > On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > > > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
> > > > > to reproduce it with my own driver where one thread writes to the shared buffer
> > > > > and another thread reads from the buffer. I don't hit the out-of-order issue so
> > > > > far.
> > > > 
> > > > Make sure the 2 areas you are accessing are in different cache lines.
> > > > 
> > > 
> > > Yes, I already put those 2 areas to separate cache lines.
> > > 
> > > > 
> > > > > My driver may be not correct somewhere and I will update if I can reproduce
> > > > > the issue with my driver in the future.
> > > > 
> > > > Then maybe your change is just making virtio slower and masks the bug
> > > > that is actually elsewhere?
> > > > 
> > > > You don't really need a driver. Here's a simple test: without barriers
> > > > assertion will fail. With barriers it will not.
> > > > (Warning: didn't bother testing too much, could be buggy.
> > > > 
> > > > ---
> > > > 
> > > > #include <pthread.h>
> > > > #include <stdio.h>
> > > > #include <stdlib.h>
> > > > #include <assert.h>
> > > > 
> > > > #define FIRST values[0]
> > > > #define SECOND values[64]
> > > > 
> > > > volatile int values[100] = {};
> > > > 
> > > > void* writer_thread(void* arg) {
> > > > 	while (1) {
> > > > 	FIRST++;
> > > > 	// NEED smp_wmb here
> > >          __asm__ volatile("dmb ishst" : : : "memory");
> > > > 	SECOND++;
> > > > 	}
> > > > }
> > > > 
> > > > void* reader_thread(void* arg) {
> > > >       while (1) {
> > > > 	int first = FIRST;
> > > > 	// NEED smp_rmb here
> > >          __asm__ volatile("dmb ishld" : : : "memory");
> > > > 	int second = SECOND;
> > > > 	assert(first - second == 1 || first - second == 0);
> > > >       }
> > > > }
> > > > 
> > > > int main() {
> > > >       pthread_t writer, reader;
> > > > 
> > > >       pthread_create(&writer, NULL, writer_thread, NULL);
> > > >       pthread_create(&reader, NULL, reader_thread, NULL);
> > > > 
> > > >       pthread_join(writer, NULL);
> > > >       pthread_join(reader, NULL);
> > > > 
> > > >       return 0;
> > > > }
> > > > 
> > > 
> > > Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> > > the assert on both of them. After replacing 'dmb' with 'dsb', I can
> > > hit assert on both of them too. I need to look at the code closely.
> > > 
> > > [root@virt-mtcollins-02 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> > > Aborted (core dumped)
> > > 
> > > [root@nvidia-grace-hopper-05 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> > > Aborted (core dumped)
> > > 
> > > Thanks,
> > > Gavin
> > 
> > 
> > Actually this test is broken. No need for ordering it's a simple race.
> > The following works on x86 though (x86 does not need barriers
> > though).
> > 
> > 
> > #include <pthread.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <assert.h>
> > 
> > #if 0
> > #define x86_rmb()  asm volatile("lfence":::"memory")
> > #define x86_mb()  asm volatile("mfence":::"memory")
> > #define x86_smb()  asm volatile("sfence":::"memory")
> > #else
> > #define x86_rmb()  asm volatile("":::"memory")
> > #define x86_mb()  asm volatile("":::"memory")
> > #define x86_smb()  asm volatile("":::"memory")
> > #endif
> > 
> > #define FIRST values[0]
> > #define SECOND values[640]
> > #define FLAG values[1280]
> > 
> > volatile unsigned values[2000] = {};
> > 
> > void* writer_thread(void* arg) {
> > 	while (1) {
> > 	/* Now synchronize with reader */
> > 	while(FLAG);
> > 	FIRST++;
> > 	x86_smb();
> > 	SECOND++;
> > 	x86_smb();
> > 	FLAG = 1;
> > 	}
> > }
> > 
> > void* reader_thread(void* arg) {
> >      while (1) {
> > 	/* Now synchronize with writer */
> > 	while(!FLAG);
> > 	x86_rmb();
> > 	unsigned first = FIRST;
> > 	x86_rmb();
> > 	unsigned second = SECOND;
> > 	assert(first - second == 1 || first - second == 0);
> > 	FLAG = 0;
> > 
> > 	if (!(first %1000000))
> > 		printf("%d\n", first);
> >     }
> > }
> > 
> > int main() {
> >      pthread_t writer, reader;
> > 
> >      pthread_create(&writer, NULL, writer_thread, NULL);
> >      pthread_create(&reader, NULL, reader_thread, NULL);
> > 
> >      pthread_join(writer, NULL);
> >      pthread_join(reader, NULL);
> > 
> >      return 0;
> > }
> > 
> 
> I tried it on host and VM of NVidia's grace-hopper. Without the barriers, I
> can hit assert. With the barriers, it's working fine without hitting the
> assert.
> 
> I also had some code to mimic virtio vring last weekend, and it's just
> working well. Back to our original issue, __smb_wmb() is issued by guest
> while __smb_rmb() is executed on host. The VM and host are running at
> different exception level: EL2 vs EL1. I'm not sure it's the cause. I
> need to modify my code so that __smb_wmb() and __smb_rmb() can be executed
> from guest and host.

It is thinkably possible that on grace-hopper barriers work
differently somehow. We need to find out more though.
Anyone from Nvidia can chime in?

-- 
MST


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-18  7:50                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-18  7:50 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu,
	shan.gavin, Will Deacon, Catalin Marinas, linux-arm-kernel,
	mochs

On Mon, Mar 18, 2024 at 09:41:45AM +1000, Gavin Shan wrote:
> On 3/18/24 02:50, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
> > > 
> > > On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > > > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
> > > > > to reproduce it with my own driver where one thread writes to the shared buffer
> > > > > and another thread reads from the buffer. I don't hit the out-of-order issue so
> > > > > far.
> > > > 
> > > > Make sure the 2 areas you are accessing are in different cache lines.
> > > > 
> > > 
> > > Yes, I already put those 2 areas to separate cache lines.
> > > 
> > > > 
> > > > > My driver may be not correct somewhere and I will update if I can reproduce
> > > > > the issue with my driver in the future.
> > > > 
> > > > Then maybe your change is just making virtio slower and masks the bug
> > > > that is actually elsewhere?
> > > > 
> > > > You don't really need a driver. Here's a simple test: without barriers
> > > > assertion will fail. With barriers it will not.
> > > > (Warning: didn't bother testing too much, could be buggy.
> > > > 
> > > > ---
> > > > 
> > > > #include <pthread.h>
> > > > #include <stdio.h>
> > > > #include <stdlib.h>
> > > > #include <assert.h>
> > > > 
> > > > #define FIRST values[0]
> > > > #define SECOND values[64]
> > > > 
> > > > volatile int values[100] = {};
> > > > 
> > > > void* writer_thread(void* arg) {
> > > > 	while (1) {
> > > > 	FIRST++;
> > > > 	// NEED smp_wmb here
> > >          __asm__ volatile("dmb ishst" : : : "memory");
> > > > 	SECOND++;
> > > > 	}
> > > > }
> > > > 
> > > > void* reader_thread(void* arg) {
> > > >       while (1) {
> > > > 	int first = FIRST;
> > > > 	// NEED smp_rmb here
> > >          __asm__ volatile("dmb ishld" : : : "memory");
> > > > 	int second = SECOND;
> > > > 	assert(first - second == 1 || first - second == 0);
> > > >       }
> > > > }
> > > > 
> > > > int main() {
> > > >       pthread_t writer, reader;
> > > > 
> > > >       pthread_create(&writer, NULL, writer_thread, NULL);
> > > >       pthread_create(&reader, NULL, reader_thread, NULL);
> > > > 
> > > >       pthread_join(writer, NULL);
> > > >       pthread_join(reader, NULL);
> > > > 
> > > >       return 0;
> > > > }
> > > > 
> > > 
> > > Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> > > the assert on both of them. After replacing 'dmb' with 'dsb', I can
> > > hit assert on both of them too. I need to look at the code closely.
> > > 
> > > [root@virt-mtcollins-02 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> > > Aborted (core dumped)
> > > 
> > > [root@nvidia-grace-hopper-05 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> > > Aborted (core dumped)
> > > 
> > > Thanks,
> > > Gavin
> > 
> > 
> > Actually this test is broken. No need for ordering it's a simple race.
> > The following works on x86 though (x86 does not need barriers
> > though).
> > 
> > 
> > #include <pthread.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <assert.h>
> > 
> > #if 0
> > #define x86_rmb()  asm volatile("lfence":::"memory")
> > #define x86_mb()  asm volatile("mfence":::"memory")
> > #define x86_smb()  asm volatile("sfence":::"memory")
> > #else
> > #define x86_rmb()  asm volatile("":::"memory")
> > #define x86_mb()  asm volatile("":::"memory")
> > #define x86_smb()  asm volatile("":::"memory")
> > #endif
> > 
> > #define FIRST values[0]
> > #define SECOND values[640]
> > #define FLAG values[1280]
> > 
> > volatile unsigned values[2000] = {};
> > 
> > void* writer_thread(void* arg) {
> > 	while (1) {
> > 	/* Now synchronize with reader */
> > 	while(FLAG);
> > 	FIRST++;
> > 	x86_smb();
> > 	SECOND++;
> > 	x86_smb();
> > 	FLAG = 1;
> > 	}
> > }
> > 
> > void* reader_thread(void* arg) {
> >      while (1) {
> > 	/* Now synchronize with writer */
> > 	while(!FLAG);
> > 	x86_rmb();
> > 	unsigned first = FIRST;
> > 	x86_rmb();
> > 	unsigned second = SECOND;
> > 	assert(first - second == 1 || first - second == 0);
> > 	FLAG = 0;
> > 
> > 	if (!(first %1000000))
> > 		printf("%d\n", first);
> >     }
> > }
> > 
> > int main() {
> >      pthread_t writer, reader;
> > 
> >      pthread_create(&writer, NULL, writer_thread, NULL);
> >      pthread_create(&reader, NULL, reader_thread, NULL);
> > 
> >      pthread_join(writer, NULL);
> >      pthread_join(reader, NULL);
> > 
> >      return 0;
> > }
> > 
> 
> I tried it on host and VM of NVidia's grace-hopper. Without the barriers, I
> can hit assert. With the barriers, it's working fine without hitting the
> assert.
> 
> I also had some code to mimic virtio vring last weekend, and it's just
> working well. Back to our original issue, __smb_wmb() is issued by guest
> while __smb_rmb() is executed on host. The VM and host are running at
> different exception level: EL2 vs EL1. I'm not sure it's the cause. I
> need to modify my code so that __smb_wmb() and __smb_rmb() can be executed
> from guest and host.

It is thinkably possible that on grace-hopper barriers work
differently somehow. We need to find out more though.
Anyone from Nvidia can chime in?

-- 
MST


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-14  7:49 [PATCH] virtio_ring: Fix the stale index in available ring Gavin Shan
  2024-03-14  8:05 ` Michael S. Tsirkin
@ 2024-03-18 16:59 ` Will Deacon
  2024-03-19  4:59     ` Gavin Shan
  2024-03-19  7:36   ` Michael S. Tsirkin
  2024-03-19  6:14 ` Michael S. Tsirkin
  2 siblings, 2 replies; 78+ messages in thread
From: Will Deacon @ 2024-03-18 16:59 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, mst, jasowang, xuanzhuo, yihyu, shan.gavin

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
> 
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
>   -accel kvm -machine virt,gic-version=host                            \
>      :                                                                 \
>   -netdev tap,id=vnet0,vhost=on                                        \
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> 
>   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> 
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Yihuang Yu <yihyu@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  drivers/virtio/virtio_ring.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>  	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>  	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>  
> -	/* Descriptors and available array need to be set before we expose the
> -	 * new available array entries. */
> -	virtio_wmb(vq->weak_barriers);
> +	/*
> +	 * Descriptors and available array need to be set before we expose
> +	 * the new available array entries. virtio_wmb() should be enough
> +	 * to ensuere the order theoretically. However, a stronger barrier
> +	 * is needed by ARM64. Otherwise, the stale data can be observed
> +	 * by the host (vhost). A stronger barrier should work for other
> +	 * architectures, but performance loss is expected.
> +	 */
> +	virtio_mb(false);
>  	vq->split.avail_idx_shadow++;
>  	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>  						vq->split.avail_idx_shadow);

Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
here, especially when ordering accesses to coherent memory.

In practice, either the larger timing different from the DSB or the fact
that you're going from a Store->Store barrier to a full barrier is what
makes things "work" for you. Have you tried, for example, a DMB SY
(e.g. via __smb_mb()).

We definitely shouldn't take changes like this without a proper
explanation of what is going on.

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-18 16:59 ` Will Deacon
@ 2024-03-19  4:59     ` Gavin Shan
  2024-03-19  7:36   ` Michael S. Tsirkin
  1 sibling, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  4:59 UTC (permalink / raw)
  To: Will Deacon
  Cc: virtualization, linux-kernel, mst, jasowang, xuanzhuo, yihyu,
	shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/19/24 02:59, Will Deacon wrote:
> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
>> The issue is reported by Yihuang Yu who have 'netperf' test on
>> NVidia's grace-grace and grace-hopper machines. The 'netperf'
>> client is started in the VM hosted by grace-hopper machine,
>> while the 'netperf' server is running on grace-grace machine.
>>
>> The VM is started with virtio-net and vhost has been enabled.
>> We observe a error message spew from VM and then soft-lockup
>> report. The error message indicates the data associated with
>> the descriptor (index: 135) has been released, and the queue
>> is marked as broken. It eventually leads to the endless effort
>> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
>> and soft-lockup. The stale index 135 is fetched from the available
>> ring and published to the used ring by vhost, meaning we have
>> disordred write to the available ring element and available index.
>>
>>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
>>    -accel kvm -machine virt,gic-version=host                            \
>>       :                                                                 \
>>    -netdev tap,id=vnet0,vhost=on                                        \
>>    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>>
>>    [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>>
>> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
>> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
>> ARM64. It should work for other architectures, but performance loss is
>> expected.
>>
>> Cc: stable@vger.kernel.org
>> Reported-by: Yihuang Yu <yihyu@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   drivers/virtio/virtio_ring.c | 12 +++++++++---
>>   1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index 49299b1f9ec7..7d852811c912 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>   	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>   
>> -	/* Descriptors and available array need to be set before we expose the
>> -	 * new available array entries. */
>> -	virtio_wmb(vq->weak_barriers);
>> +	/*
>> +	 * Descriptors and available array need to be set before we expose
>> +	 * the new available array entries. virtio_wmb() should be enough
>> +	 * to ensuere the order theoretically. However, a stronger barrier
>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>> +	 * by the host (vhost). A stronger barrier should work for other
>> +	 * architectures, but performance loss is expected.
>> +	 */
>> +	virtio_mb(false);
>>   	vq->split.avail_idx_shadow++;
>>   	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>   						vq->split.avail_idx_shadow);
> 
> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> here, especially when ordering accesses to coherent memory.
> 
> In practice, either the larger timing different from the DSB or the fact
> that you're going from a Store->Store barrier to a full barrier is what
> makes things "work" for you. Have you tried, for example, a DMB SY
> (e.g. via __smb_mb()).
> 
> We definitely shouldn't take changes like this without a proper
> explanation of what is going on.
> 

Thanks for your comments, Will.

Yes, DMB should work for us. However, it seems this instruction has issues on
NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
from hardware level. I agree it's not the solution to replace DMB with DSB
before we fully understand the root cause.

I tried the possible replacement like below. __smp_mb() can avoid the issue like
__mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.

static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
{
     :
         /* Put entry in available array (but don't update avail->idx until they
          * do sync). */
         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);

         /* Descriptors and available array need to be set before we expose the
          * new available array entries. */
         // Broken: virtio_wmb(vq->weak_barriers);
         // Broken: __dma_mb();
         // Work:   __mb();
         // Work:   __smp_mb();
         // Work:   __ndelay(100);
         // Work:   __ndelay(10);
         // Broken: __ndelay(9);

        vq->split.avail_idx_shadow++;
         vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
                                                 vq->split.avail_idx_shadow);
         vq->num_added++;

         pr_debug("Added buffer head %i to %p\n", head, vq);
         END_USE(vq);
         :
}

I also tried to measure the consumed time for various barrier-relative instructions using
ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
__smp_wmb() but faster than __mb()

     Instruction           Range of used time in ns
     ----------------------------------------------
     __smp_wmb()           [32  1128032]
     __smp_mb()            [32  1160096]
     __mb()                [32  1162496]

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  4:59     ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  4:59 UTC (permalink / raw)
  To: Will Deacon
  Cc: virtualization, linux-kernel, mst, jasowang, xuanzhuo, yihyu,
	shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/19/24 02:59, Will Deacon wrote:
> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
>> The issue is reported by Yihuang Yu who have 'netperf' test on
>> NVidia's grace-grace and grace-hopper machines. The 'netperf'
>> client is started in the VM hosted by grace-hopper machine,
>> while the 'netperf' server is running on grace-grace machine.
>>
>> The VM is started with virtio-net and vhost has been enabled.
>> We observe a error message spew from VM and then soft-lockup
>> report. The error message indicates the data associated with
>> the descriptor (index: 135) has been released, and the queue
>> is marked as broken. It eventually leads to the endless effort
>> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
>> and soft-lockup. The stale index 135 is fetched from the available
>> ring and published to the used ring by vhost, meaning we have
>> disordred write to the available ring element and available index.
>>
>>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
>>    -accel kvm -machine virt,gic-version=host                            \
>>       :                                                                 \
>>    -netdev tap,id=vnet0,vhost=on                                        \
>>    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>>
>>    [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>>
>> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
>> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
>> ARM64. It should work for other architectures, but performance loss is
>> expected.
>>
>> Cc: stable@vger.kernel.org
>> Reported-by: Yihuang Yu <yihyu@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   drivers/virtio/virtio_ring.c | 12 +++++++++---
>>   1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index 49299b1f9ec7..7d852811c912 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>   	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>   
>> -	/* Descriptors and available array need to be set before we expose the
>> -	 * new available array entries. */
>> -	virtio_wmb(vq->weak_barriers);
>> +	/*
>> +	 * Descriptors and available array need to be set before we expose
>> +	 * the new available array entries. virtio_wmb() should be enough
>> +	 * to ensuere the order theoretically. However, a stronger barrier
>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>> +	 * by the host (vhost). A stronger barrier should work for other
>> +	 * architectures, but performance loss is expected.
>> +	 */
>> +	virtio_mb(false);
>>   	vq->split.avail_idx_shadow++;
>>   	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>   						vq->split.avail_idx_shadow);
> 
> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> here, especially when ordering accesses to coherent memory.
> 
> In practice, either the larger timing different from the DSB or the fact
> that you're going from a Store->Store barrier to a full barrier is what
> makes things "work" for you. Have you tried, for example, a DMB SY
> (e.g. via __smb_mb()).
> 
> We definitely shouldn't take changes like this without a proper
> explanation of what is going on.
> 

Thanks for your comments, Will.

Yes, DMB should work for us. However, it seems this instruction has issues on
NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
from hardware level. I agree it's not the solution to replace DMB with DSB
before we fully understand the root cause.

I tried the possible replacement like below. __smp_mb() can avoid the issue like
__mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.

static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
{
     :
         /* Put entry in available array (but don't update avail->idx until they
          * do sync). */
         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);

         /* Descriptors and available array need to be set before we expose the
          * new available array entries. */
         // Broken: virtio_wmb(vq->weak_barriers);
         // Broken: __dma_mb();
         // Work:   __mb();
         // Work:   __smp_mb();
         // Work:   __ndelay(100);
         // Work:   __ndelay(10);
         // Broken: __ndelay(9);

        vq->split.avail_idx_shadow++;
         vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
                                                 vq->split.avail_idx_shadow);
         vq->num_added++;

         pr_debug("Added buffer head %i to %p\n", head, vq);
         END_USE(vq);
         :
}

I also tried to measure the consumed time for various barrier-relative instructions using
ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
__smp_wmb() but faster than __mb()

     Instruction           Range of used time in ns
     ----------------------------------------------
     __smp_wmb()           [32  1128032]
     __smp_mb()            [32  1160096]
     __mb()                [32  1162496]

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  4:59     ` Gavin Shan
@ 2024-03-19  6:09       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  6:09 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> On 3/19/24 02:59, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
> > >    -accel kvm -machine virt,gic-version=host                            \
> > >       :                                                                 \
> > >    -netdev tap,id=vnet0,vhost=on                                        \
> > >    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >    [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +++++++++---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > >   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > -	/* Descriptors and available array need to be set before we expose the
> > > -	 * new available array entries. */
> > > -	virtio_wmb(vq->weak_barriers);
> > > +	/*
> > > +	 * Descriptors and available array need to be set before we expose
> > > +	 * the new available array entries. virtio_wmb() should be enough
> > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > +	 * by the host (vhost). A stronger barrier should work for other
> > > +	 * architectures, but performance loss is expected.
> > > +	 */
> > > +	virtio_mb(false);
> > >   	vq->split.avail_idx_shadow++;
> > >   	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >   						vq->split.avail_idx_shadow);
> > 
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> > 
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> > 
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> > 
> 
> Thanks for your comments, Will.
> 
> Yes, DMB should work for us. However, it seems this instruction has issues on
> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> from hardware level. I agree it's not the solution to replace DMB with DSB
> before we fully understand the root cause.
> 
> I tried the possible replacement like below. __smp_mb() can avoid the issue like
> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> 
> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> {
>     :
>         /* Put entry in available array (but don't update avail->idx until they
>          * do sync). */
>         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> 
>         /* Descriptors and available array need to be set before we expose the
>          * new available array entries. */
>         // Broken: virtio_wmb(vq->weak_barriers);
>         // Broken: __dma_mb();
>         // Work:   __mb();
>         // Work:   __smp_mb();
>         // Work:   __ndelay(100);
>         // Work:   __ndelay(10);
>         // Broken: __ndelay(9);
> 
>        vq->split.avail_idx_shadow++;
>         vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>                                                 vq->split.avail_idx_shadow);

What if you stick __ndelay here?


>         vq->num_added++;
> 
>         pr_debug("Added buffer head %i to %p\n", head, vq);
>         END_USE(vq);
>         :
> }
> 
> I also tried to measure the consumed time for various barrier-relative instructions using
> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> __smp_wmb() but faster than __mb()
> 
>     Instruction           Range of used time in ns
>     ----------------------------------------------
>     __smp_wmb()           [32  1128032]
>     __smp_mb()            [32  1160096]
>     __mb()                [32  1162496]
> 
> Thanks,
> Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  6:09       ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  6:09 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> On 3/19/24 02:59, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
> > >    -accel kvm -machine virt,gic-version=host                            \
> > >       :                                                                 \
> > >    -netdev tap,id=vnet0,vhost=on                                        \
> > >    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >    [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +++++++++---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > >   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > -	/* Descriptors and available array need to be set before we expose the
> > > -	 * new available array entries. */
> > > -	virtio_wmb(vq->weak_barriers);
> > > +	/*
> > > +	 * Descriptors and available array need to be set before we expose
> > > +	 * the new available array entries. virtio_wmb() should be enough
> > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > +	 * by the host (vhost). A stronger barrier should work for other
> > > +	 * architectures, but performance loss is expected.
> > > +	 */
> > > +	virtio_mb(false);
> > >   	vq->split.avail_idx_shadow++;
> > >   	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >   						vq->split.avail_idx_shadow);
> > 
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> > 
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> > 
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> > 
> 
> Thanks for your comments, Will.
> 
> Yes, DMB should work for us. However, it seems this instruction has issues on
> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> from hardware level. I agree it's not the solution to replace DMB with DSB
> before we fully understand the root cause.
> 
> I tried the possible replacement like below. __smp_mb() can avoid the issue like
> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> 
> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> {
>     :
>         /* Put entry in available array (but don't update avail->idx until they
>          * do sync). */
>         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> 
>         /* Descriptors and available array need to be set before we expose the
>          * new available array entries. */
>         // Broken: virtio_wmb(vq->weak_barriers);
>         // Broken: __dma_mb();
>         // Work:   __mb();
>         // Work:   __smp_mb();
>         // Work:   __ndelay(100);
>         // Work:   __ndelay(10);
>         // Broken: __ndelay(9);
> 
>        vq->split.avail_idx_shadow++;
>         vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>                                                 vq->split.avail_idx_shadow);

What if you stick __ndelay here?


>         vq->num_added++;
> 
>         pr_debug("Added buffer head %i to %p\n", head, vq);
>         END_USE(vq);
>         :
> }
> 
> I also tried to measure the consumed time for various barrier-relative instructions using
> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> __smp_wmb() but faster than __mb()
> 
>     Instruction           Range of used time in ns
>     ----------------------------------------------
>     __smp_wmb()           [32  1128032]
>     __smp_mb()            [32  1160096]
>     __mb()                [32  1162496]
> 
> Thanks,
> Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  6:09       ` Michael S. Tsirkin
@ 2024-03-19  6:10         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  6:10 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > On 3/19/24 02:59, Will Deacon wrote:
> > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > client is started in the VM hosted by grace-hopper machine,
> > > > while the 'netperf' server is running on grace-grace machine.
> > > > 
> > > > The VM is started with virtio-net and vhost has been enabled.
> > > > We observe a error message spew from VM and then soft-lockup
> > > > report. The error message indicates the data associated with
> > > > the descriptor (index: 135) has been released, and the queue
> > > > is marked as broken. It eventually leads to the endless effort
> > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > ring and published to the used ring by vhost, meaning we have
> > > > disordred write to the available ring element and available index.
> > > > 
> > > >    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
> > > >    -accel kvm -machine virt,gic-version=host                            \
> > > >       :                                                                 \
> > > >    -netdev tap,id=vnet0,vhost=on                                        \
> > > >    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > 
> > > >    [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > 
> > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > ARM64. It should work for other architectures, but performance loss is
> > > > expected.
> > > > 
> > > > Cc: stable@vger.kernel.org
> > > > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > > ---
> > > >   drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 49299b1f9ec7..7d852811c912 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > >   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >   	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > -	/* Descriptors and available array need to be set before we expose the
> > > > -	 * new available array entries. */
> > > > -	virtio_wmb(vq->weak_barriers);
> > > > +	/*
> > > > +	 * Descriptors and available array need to be set before we expose
> > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > +	 * architectures, but performance loss is expected.
> > > > +	 */
> > > > +	virtio_mb(false);
> > > >   	vq->split.avail_idx_shadow++;
> > > >   	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > >   						vq->split.avail_idx_shadow);
> > > 
> > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > here, especially when ordering accesses to coherent memory.
> > > 
> > > In practice, either the larger timing different from the DSB or the fact
> > > that you're going from a Store->Store barrier to a full barrier is what
> > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > (e.g. via __smb_mb()).
> > > 
> > > We definitely shouldn't take changes like this without a proper
> > > explanation of what is going on.
> > > 
> > 
> > Thanks for your comments, Will.
> > 
> > Yes, DMB should work for us. However, it seems this instruction has issues on
> > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > from hardware level. I agree it's not the solution to replace DMB with DSB
> > before we fully understand the root cause.
> > 
> > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > 
> > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > {
> >     :
> >         /* Put entry in available array (but don't update avail->idx until they
> >          * do sync). */
> >         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> >         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > 
> >         /* Descriptors and available array need to be set before we expose the
> >          * new available array entries. */
> >         // Broken: virtio_wmb(vq->weak_barriers);
> >         // Broken: __dma_mb();
> >         // Work:   __mb();
> >         // Work:   __smp_mb();

Did you try __smp_wmb ? And wmb?

> >         // Work:   __ndelay(100);
> >         // Work:   __ndelay(10);
> >         // Broken: __ndelay(9);
> > 
> >        vq->split.avail_idx_shadow++;
> >         vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> >                                                 vq->split.avail_idx_shadow);
> 
> What if you stick __ndelay here?

And keep virtio_wmb above?

> 
> >         vq->num_added++;
> > 
> >         pr_debug("Added buffer head %i to %p\n", head, vq);
> >         END_USE(vq);
> >         :
> > }
> > 
> > I also tried to measure the consumed time for various barrier-relative instructions using
> > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > __smp_wmb() but faster than __mb()
> > 
> >     Instruction           Range of used time in ns
> >     ----------------------------------------------
> >     __smp_wmb()           [32  1128032]
> >     __smp_mb()            [32  1160096]
> >     __mb()                [32  1162496]
> > 
> > Thanks,
> > Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  6:10         ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  6:10 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > On 3/19/24 02:59, Will Deacon wrote:
> > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > client is started in the VM hosted by grace-hopper machine,
> > > > while the 'netperf' server is running on grace-grace machine.
> > > > 
> > > > The VM is started with virtio-net and vhost has been enabled.
> > > > We observe a error message spew from VM and then soft-lockup
> > > > report. The error message indicates the data associated with
> > > > the descriptor (index: 135) has been released, and the queue
> > > > is marked as broken. It eventually leads to the endless effort
> > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > ring and published to the used ring by vhost, meaning we have
> > > > disordred write to the available ring element and available index.
> > > > 
> > > >    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
> > > >    -accel kvm -machine virt,gic-version=host                            \
> > > >       :                                                                 \
> > > >    -netdev tap,id=vnet0,vhost=on                                        \
> > > >    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > 
> > > >    [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > 
> > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > ARM64. It should work for other architectures, but performance loss is
> > > > expected.
> > > > 
> > > > Cc: stable@vger.kernel.org
> > > > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > > ---
> > > >   drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 49299b1f9ec7..7d852811c912 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > >   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >   	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > -	/* Descriptors and available array need to be set before we expose the
> > > > -	 * new available array entries. */
> > > > -	virtio_wmb(vq->weak_barriers);
> > > > +	/*
> > > > +	 * Descriptors and available array need to be set before we expose
> > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > +	 * architectures, but performance loss is expected.
> > > > +	 */
> > > > +	virtio_mb(false);
> > > >   	vq->split.avail_idx_shadow++;
> > > >   	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > >   						vq->split.avail_idx_shadow);
> > > 
> > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > here, especially when ordering accesses to coherent memory.
> > > 
> > > In practice, either the larger timing different from the DSB or the fact
> > > that you're going from a Store->Store barrier to a full barrier is what
> > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > (e.g. via __smb_mb()).
> > > 
> > > We definitely shouldn't take changes like this without a proper
> > > explanation of what is going on.
> > > 
> > 
> > Thanks for your comments, Will.
> > 
> > Yes, DMB should work for us. However, it seems this instruction has issues on
> > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > from hardware level. I agree it's not the solution to replace DMB with DSB
> > before we fully understand the root cause.
> > 
> > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > 
> > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > {
> >     :
> >         /* Put entry in available array (but don't update avail->idx until they
> >          * do sync). */
> >         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> >         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > 
> >         /* Descriptors and available array need to be set before we expose the
> >          * new available array entries. */
> >         // Broken: virtio_wmb(vq->weak_barriers);
> >         // Broken: __dma_mb();
> >         // Work:   __mb();
> >         // Work:   __smp_mb();

Did you try __smp_wmb ? And wmb?

> >         // Work:   __ndelay(100);
> >         // Work:   __ndelay(10);
> >         // Broken: __ndelay(9);
> > 
> >        vq->split.avail_idx_shadow++;
> >         vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> >                                                 vq->split.avail_idx_shadow);
> 
> What if you stick __ndelay here?

And keep virtio_wmb above?

> 
> >         vq->num_added++;
> > 
> >         pr_debug("Added buffer head %i to %p\n", head, vq);
> >         END_USE(vq);
> >         :
> > }
> > 
> > I also tried to measure the consumed time for various barrier-relative instructions using
> > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > __smp_wmb() but faster than __mb()
> > 
> >     Instruction           Range of used time in ns
> >     ----------------------------------------------
> >     __smp_wmb()           [32  1128032]
> >     __smp_mb()            [32  1160096]
> >     __mb()                [32  1162496]
> > 
> > Thanks,
> > Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-14  7:49 [PATCH] virtio_ring: Fix the stale index in available ring Gavin Shan
  2024-03-14  8:05 ` Michael S. Tsirkin
  2024-03-18 16:59 ` Will Deacon
@ 2024-03-19  6:14 ` Michael S. Tsirkin
  2 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  6:14 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, jasowang, xuanzhuo, yihyu, shan.gavin

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
> 
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
>   -accel kvm -machine virt,gic-version=host                            \
>      :                                                                 \
>   -netdev tap,id=vnet0,vhost=on                                        \
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> 
>   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> 
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Yihuang Yu <yihyu@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  drivers/virtio/virtio_ring.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>  	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>  	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>  
> -	/* Descriptors and available array need to be set before we expose the
> -	 * new available array entries. */
> -	virtio_wmb(vq->weak_barriers);
> +	/*
> +	 * Descriptors and available array need to be set before we expose
> +	 * the new available array entries. virtio_wmb() should be enough
> +	 * to ensuere the order theoretically. However, a stronger barrier
> +	 * is needed by ARM64. Otherwise, the stale data can be observed
> +	 * by the host (vhost). A stronger barrier should work for other
> +	 * architectures, but performance loss is expected.
> +	 */
> +	virtio_mb(false);
>  	vq->split.avail_idx_shadow++;
>  	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>  						vq->split.avail_idx_shadow);



Something else to try, is to disassemble the code and check the compiler is not broken.

It also might help to replace assigment above with WRITE_ONCE -
it's technically always has been the right thing to do, it's just a big
change (has to be done everywhere if done at all) so we never bothered
and we never hit a compiler that would split or speculate stores ...


> -- 
> 2.44.0


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  6:09       ` Michael S. Tsirkin
@ 2024-03-19  6:38         ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  6:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/19/24 16:09, Michael S. Tsirkin wrote:

>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>> index 49299b1f9ec7..7d852811c912 100644
>>>> --- a/drivers/virtio/virtio_ring.c
>>>> +++ b/drivers/virtio/virtio_ring.c
>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>> -	/* Descriptors and available array need to be set before we expose the
>>>> -	 * new available array entries. */
>>>> -	virtio_wmb(vq->weak_barriers);
>>>> +	/*
>>>> +	 * Descriptors and available array need to be set before we expose
>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>> +	 * architectures, but performance loss is expected.
>>>> +	 */
>>>> +	virtio_mb(false);
>>>>    	vq->split.avail_idx_shadow++;
>>>>    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>    						vq->split.avail_idx_shadow);
>>>
>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>> here, especially when ordering accesses to coherent memory.
>>>
>>> In practice, either the larger timing different from the DSB or the fact
>>> that you're going from a Store->Store barrier to a full barrier is what
>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>> (e.g. via __smb_mb()).
>>>
>>> We definitely shouldn't take changes like this without a proper
>>> explanation of what is going on.
>>>
>>
>> Thanks for your comments, Will.
>>
>> Yes, DMB should work for us. However, it seems this instruction has issues on
>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>> from hardware level. I agree it's not the solution to replace DMB with DSB
>> before we fully understand the root cause.
>>
>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>
>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>> {
>>      :
>>          /* Put entry in available array (but don't update avail->idx until they
>>           * do sync). */
>>          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>
>>          /* Descriptors and available array need to be set before we expose the
>>           * new available array entries. */
>>          // Broken: virtio_wmb(vq->weak_barriers);
>>          // Broken: __dma_mb();
>>          // Work:   __mb();
>>          // Work:   __smp_mb();
>>          // Work:   __ndelay(100);
>>          // Work:   __ndelay(10);
>>          // Broken: __ndelay(9);
>>
>>         vq->split.avail_idx_shadow++;
>>          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>                                                  vq->split.avail_idx_shadow);
> 
> What if you stick __ndelay here?
> 

        /* Put entry in available array (but don't update avail->idx until they
          * do sync). */
         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);

         /* Descriptors and available array need to be set before we expose the
          * new available array entries. */
         virtio_wmb(vq->weak_barriers);
         vq->split.avail_idx_shadow++;
         vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
                                                 vq->split.avail_idx_shadow);
         /* Try __ndelay(x) here as Michael suggested
          *
          * Work:      __ndelay(200);    possiblly make it hard to reproduce
          * Broken:    __ndelay(100);
          * Broken:    __ndelay(20);
          * Broken:    __ndelay(10);
          */
         __ndelay(200);


> 
>>          vq->num_added++;
>>
>>          pr_debug("Added buffer head %i to %p\n", head, vq);
>>          END_USE(vq);
>>          :
>> }
>>
>> I also tried to measure the consumed time for various barrier-relative instructions using
>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>> __smp_wmb() but faster than __mb()
>>
>>      Instruction           Range of used time in ns
>>      ----------------------------------------------
>>      __smp_wmb()           [32  1128032]
>>      __smp_mb()            [32  1160096]
>>      __mb()                [32  1162496]
>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  6:38         ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  6:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/19/24 16:09, Michael S. Tsirkin wrote:

>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>> index 49299b1f9ec7..7d852811c912 100644
>>>> --- a/drivers/virtio/virtio_ring.c
>>>> +++ b/drivers/virtio/virtio_ring.c
>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>> -	/* Descriptors and available array need to be set before we expose the
>>>> -	 * new available array entries. */
>>>> -	virtio_wmb(vq->weak_barriers);
>>>> +	/*
>>>> +	 * Descriptors and available array need to be set before we expose
>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>> +	 * architectures, but performance loss is expected.
>>>> +	 */
>>>> +	virtio_mb(false);
>>>>    	vq->split.avail_idx_shadow++;
>>>>    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>    						vq->split.avail_idx_shadow);
>>>
>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>> here, especially when ordering accesses to coherent memory.
>>>
>>> In practice, either the larger timing different from the DSB or the fact
>>> that you're going from a Store->Store barrier to a full barrier is what
>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>> (e.g. via __smb_mb()).
>>>
>>> We definitely shouldn't take changes like this without a proper
>>> explanation of what is going on.
>>>
>>
>> Thanks for your comments, Will.
>>
>> Yes, DMB should work for us. However, it seems this instruction has issues on
>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>> from hardware level. I agree it's not the solution to replace DMB with DSB
>> before we fully understand the root cause.
>>
>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>
>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>> {
>>      :
>>          /* Put entry in available array (but don't update avail->idx until they
>>           * do sync). */
>>          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>
>>          /* Descriptors and available array need to be set before we expose the
>>           * new available array entries. */
>>          // Broken: virtio_wmb(vq->weak_barriers);
>>          // Broken: __dma_mb();
>>          // Work:   __mb();
>>          // Work:   __smp_mb();
>>          // Work:   __ndelay(100);
>>          // Work:   __ndelay(10);
>>          // Broken: __ndelay(9);
>>
>>         vq->split.avail_idx_shadow++;
>>          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>                                                  vq->split.avail_idx_shadow);
> 
> What if you stick __ndelay here?
> 

        /* Put entry in available array (but don't update avail->idx until they
          * do sync). */
         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);

         /* Descriptors and available array need to be set before we expose the
          * new available array entries. */
         virtio_wmb(vq->weak_barriers);
         vq->split.avail_idx_shadow++;
         vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
                                                 vq->split.avail_idx_shadow);
         /* Try __ndelay(x) here as Michael suggested
          *
          * Work:      __ndelay(200);    possiblly make it hard to reproduce
          * Broken:    __ndelay(100);
          * Broken:    __ndelay(20);
          * Broken:    __ndelay(10);
          */
         __ndelay(200);


> 
>>          vq->num_added++;
>>
>>          pr_debug("Added buffer head %i to %p\n", head, vq);
>>          END_USE(vq);
>>          :
>> }
>>
>> I also tried to measure the consumed time for various barrier-relative instructions using
>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>> __smp_wmb() but faster than __mb()
>>
>>      Instruction           Range of used time in ns
>>      ----------------------------------------------
>>      __smp_wmb()           [32  1128032]
>>      __smp_mb()            [32  1160096]
>>      __mb()                [32  1162496]
>>

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  6:38         ` Gavin Shan
@ 2024-03-19  6:43           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  6:43 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> On 3/19/24 16:09, Michael S. Tsirkin wrote:
> 
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > -	 * new available array entries. */
> > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > +	/*
> > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > +	 * architectures, but performance loss is expected.
> > > > > +	 */
> > > > > +	virtio_mb(false);
> > > > >    	vq->split.avail_idx_shadow++;
> > > > >    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >    						vq->split.avail_idx_shadow);
> > > > 
> > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > here, especially when ordering accesses to coherent memory.
> > > > 
> > > > In practice, either the larger timing different from the DSB or the fact
> > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > (e.g. via __smb_mb()).
> > > > 
> > > > We definitely shouldn't take changes like this without a proper
> > > > explanation of what is going on.
> > > > 
> > > 
> > > Thanks for your comments, Will.
> > > 
> > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > before we fully understand the root cause.
> > > 
> > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > 
> > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > {
> > >      :
> > >          /* Put entry in available array (but don't update avail->idx until they
> > >           * do sync). */
> > >          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > 
> > >          /* Descriptors and available array need to be set before we expose the
> > >           * new available array entries. */
> > >          // Broken: virtio_wmb(vq->weak_barriers);
> > >          // Broken: __dma_mb();
> > >          // Work:   __mb();
> > >          // Work:   __smp_mb();
> > >          // Work:   __ndelay(100);
> > >          // Work:   __ndelay(10);
> > >          // Broken: __ndelay(9);
> > > 
> > >         vq->split.avail_idx_shadow++;
> > >          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >                                                  vq->split.avail_idx_shadow);
> > 
> > What if you stick __ndelay here?
> > 
> 
>        /* Put entry in available array (but don't update avail->idx until they
>          * do sync). */
>         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> 
>         /* Descriptors and available array need to be set before we expose the
>          * new available array entries. */
>         virtio_wmb(vq->weak_barriers);
>         vq->split.avail_idx_shadow++;
>         vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>                                                 vq->split.avail_idx_shadow);
>         /* Try __ndelay(x) here as Michael suggested
>          *
>          * Work:      __ndelay(200);    possiblly make it hard to reproduce
>          * Broken:    __ndelay(100);
>          * Broken:    __ndelay(20);
>          * Broken:    __ndelay(10);
>          */
>         __ndelay(200);

So we see that just changing the timing masks the race.
What are you using on the host side? vhost or qemu?

> 
> > 
> > >          vq->num_added++;
> > > 
> > >          pr_debug("Added buffer head %i to %p\n", head, vq);
> > >          END_USE(vq);
> > >          :
> > > }
> > > 
> > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > __smp_wmb() but faster than __mb()
> > > 
> > >      Instruction           Range of used time in ns
> > >      ----------------------------------------------
> > >      __smp_wmb()           [32  1128032]
> > >      __smp_mb()            [32  1160096]
> > >      __mb()                [32  1162496]
> > > 
> 
> Thanks,
> Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  6:43           ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  6:43 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> On 3/19/24 16:09, Michael S. Tsirkin wrote:
> 
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > -	 * new available array entries. */
> > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > +	/*
> > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > +	 * architectures, but performance loss is expected.
> > > > > +	 */
> > > > > +	virtio_mb(false);
> > > > >    	vq->split.avail_idx_shadow++;
> > > > >    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >    						vq->split.avail_idx_shadow);
> > > > 
> > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > here, especially when ordering accesses to coherent memory.
> > > > 
> > > > In practice, either the larger timing different from the DSB or the fact
> > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > (e.g. via __smb_mb()).
> > > > 
> > > > We definitely shouldn't take changes like this without a proper
> > > > explanation of what is going on.
> > > > 
> > > 
> > > Thanks for your comments, Will.
> > > 
> > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > before we fully understand the root cause.
> > > 
> > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > 
> > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > {
> > >      :
> > >          /* Put entry in available array (but don't update avail->idx until they
> > >           * do sync). */
> > >          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > 
> > >          /* Descriptors and available array need to be set before we expose the
> > >           * new available array entries. */
> > >          // Broken: virtio_wmb(vq->weak_barriers);
> > >          // Broken: __dma_mb();
> > >          // Work:   __mb();
> > >          // Work:   __smp_mb();
> > >          // Work:   __ndelay(100);
> > >          // Work:   __ndelay(10);
> > >          // Broken: __ndelay(9);
> > > 
> > >         vq->split.avail_idx_shadow++;
> > >          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >                                                  vq->split.avail_idx_shadow);
> > 
> > What if you stick __ndelay here?
> > 
> 
>        /* Put entry in available array (but don't update avail->idx until they
>          * do sync). */
>         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> 
>         /* Descriptors and available array need to be set before we expose the
>          * new available array entries. */
>         virtio_wmb(vq->weak_barriers);
>         vq->split.avail_idx_shadow++;
>         vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>                                                 vq->split.avail_idx_shadow);
>         /* Try __ndelay(x) here as Michael suggested
>          *
>          * Work:      __ndelay(200);    possiblly make it hard to reproduce
>          * Broken:    __ndelay(100);
>          * Broken:    __ndelay(20);
>          * Broken:    __ndelay(10);
>          */
>         __ndelay(200);

So we see that just changing the timing masks the race.
What are you using on the host side? vhost or qemu?

> 
> > 
> > >          vq->num_added++;
> > > 
> > >          pr_debug("Added buffer head %i to %p\n", head, vq);
> > >          END_USE(vq);
> > >          :
> > > }
> > > 
> > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > __smp_wmb() but faster than __mb()
> > > 
> > >      Instruction           Range of used time in ns
> > >      ----------------------------------------------
> > >      __smp_wmb()           [32  1128032]
> > >      __smp_mb()            [32  1160096]
> > >      __mb()                [32  1162496]
> > > 
> 
> Thanks,
> Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  6:43           ` Michael S. Tsirkin
@ 2024-03-19  6:49             ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  6:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs


On 3/19/24 16:43, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
>> On 3/19/24 16:09, Michael S. Tsirkin wrote:
>>
>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>>     	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>     	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>> -	/* Descriptors and available array need to be set before we expose the
>>>>>> -	 * new available array entries. */
>>>>>> -	virtio_wmb(vq->weak_barriers);
>>>>>> +	/*
>>>>>> +	 * Descriptors and available array need to be set before we expose
>>>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>>>> +	 * architectures, but performance loss is expected.
>>>>>> +	 */
>>>>>> +	virtio_mb(false);
>>>>>>     	vq->split.avail_idx_shadow++;
>>>>>>     	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>     						vq->split.avail_idx_shadow);
>>>>>
>>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>>> here, especially when ordering accesses to coherent memory.
>>>>>
>>>>> In practice, either the larger timing different from the DSB or the fact
>>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>>> (e.g. via __smb_mb()).
>>>>>
>>>>> We definitely shouldn't take changes like this without a proper
>>>>> explanation of what is going on.
>>>>>
>>>>
>>>> Thanks for your comments, Will.
>>>>
>>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>>> before we fully understand the root cause.
>>>>
>>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>>
>>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>>> {
>>>>       :
>>>>           /* Put entry in available array (but don't update avail->idx until they
>>>>            * do sync). */
>>>>           avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>           vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>
>>>>           /* Descriptors and available array need to be set before we expose the
>>>>            * new available array entries. */
>>>>           // Broken: virtio_wmb(vq->weak_barriers);
>>>>           // Broken: __dma_mb();
>>>>           // Work:   __mb();
>>>>           // Work:   __smp_mb();
>>>>           // Work:   __ndelay(100);
>>>>           // Work:   __ndelay(10);
>>>>           // Broken: __ndelay(9);
>>>>
>>>>          vq->split.avail_idx_shadow++;
>>>>           vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>                                                   vq->split.avail_idx_shadow);
>>>
>>> What if you stick __ndelay here?
>>>
>>
>>         /* Put entry in available array (but don't update avail->idx until they
>>           * do sync). */
>>          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>
>>          /* Descriptors and available array need to be set before we expose the
>>           * new available array entries. */
>>          virtio_wmb(vq->weak_barriers);
>>          vq->split.avail_idx_shadow++;
>>          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>                                                  vq->split.avail_idx_shadow);
>>          /* Try __ndelay(x) here as Michael suggested
>>           *
>>           * Work:      __ndelay(200);    possiblly make it hard to reproduce
>>           * Broken:    __ndelay(100);
>>           * Broken:    __ndelay(20);
>>           * Broken:    __ndelay(10);
>>           */
>>          __ndelay(200);
> 
> So we see that just changing the timing masks the race.
> What are you using on the host side? vhost or qemu?
> 

__ndelay(200) may make the issue harder to be reproduce as I understand.
More delays here will give vhost relief, reducing the race.

The issue is only reproducible when vhost is turned on. Otherwise, we
aren't able to hit the issue.

    -netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0

>>
>>>
>>>>           vq->num_added++;
>>>>
>>>>           pr_debug("Added buffer head %i to %p\n", head, vq);
>>>>           END_USE(vq);
>>>>           :
>>>> }
>>>>
>>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>>> __smp_wmb() but faster than __mb()
>>>>
>>>>       Instruction           Range of used time in ns
>>>>       ----------------------------------------------
>>>>       __smp_wmb()           [32  1128032]
>>>>       __smp_mb()            [32  1160096]
>>>>       __mb()                [32  1162496]
>>>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  6:49             ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  6:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs


On 3/19/24 16:43, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
>> On 3/19/24 16:09, Michael S. Tsirkin wrote:
>>
>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>>     	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>     	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>> -	/* Descriptors and available array need to be set before we expose the
>>>>>> -	 * new available array entries. */
>>>>>> -	virtio_wmb(vq->weak_barriers);
>>>>>> +	/*
>>>>>> +	 * Descriptors and available array need to be set before we expose
>>>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>>>> +	 * architectures, but performance loss is expected.
>>>>>> +	 */
>>>>>> +	virtio_mb(false);
>>>>>>     	vq->split.avail_idx_shadow++;
>>>>>>     	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>     						vq->split.avail_idx_shadow);
>>>>>
>>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>>> here, especially when ordering accesses to coherent memory.
>>>>>
>>>>> In practice, either the larger timing different from the DSB or the fact
>>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>>> (e.g. via __smb_mb()).
>>>>>
>>>>> We definitely shouldn't take changes like this without a proper
>>>>> explanation of what is going on.
>>>>>
>>>>
>>>> Thanks for your comments, Will.
>>>>
>>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>>> before we fully understand the root cause.
>>>>
>>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>>
>>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>>> {
>>>>       :
>>>>           /* Put entry in available array (but don't update avail->idx until they
>>>>            * do sync). */
>>>>           avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>           vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>
>>>>           /* Descriptors and available array need to be set before we expose the
>>>>            * new available array entries. */
>>>>           // Broken: virtio_wmb(vq->weak_barriers);
>>>>           // Broken: __dma_mb();
>>>>           // Work:   __mb();
>>>>           // Work:   __smp_mb();
>>>>           // Work:   __ndelay(100);
>>>>           // Work:   __ndelay(10);
>>>>           // Broken: __ndelay(9);
>>>>
>>>>          vq->split.avail_idx_shadow++;
>>>>           vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>                                                   vq->split.avail_idx_shadow);
>>>
>>> What if you stick __ndelay here?
>>>
>>
>>         /* Put entry in available array (but don't update avail->idx until they
>>           * do sync). */
>>          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>
>>          /* Descriptors and available array need to be set before we expose the
>>           * new available array entries. */
>>          virtio_wmb(vq->weak_barriers);
>>          vq->split.avail_idx_shadow++;
>>          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>                                                  vq->split.avail_idx_shadow);
>>          /* Try __ndelay(x) here as Michael suggested
>>           *
>>           * Work:      __ndelay(200);    possiblly make it hard to reproduce
>>           * Broken:    __ndelay(100);
>>           * Broken:    __ndelay(20);
>>           * Broken:    __ndelay(10);
>>           */
>>          __ndelay(200);
> 
> So we see that just changing the timing masks the race.
> What are you using on the host side? vhost or qemu?
> 

__ndelay(200) may make the issue harder to be reproduce as I understand.
More delays here will give vhost relief, reducing the race.

The issue is only reproducible when vhost is turned on. Otherwise, we
aren't able to hit the issue.

    -netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0

>>
>>>
>>>>           vq->num_added++;
>>>>
>>>>           pr_debug("Added buffer head %i to %p\n", head, vq);
>>>>           END_USE(vq);
>>>>           :
>>>> }
>>>>
>>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>>> __smp_wmb() but faster than __mb()
>>>>
>>>>       Instruction           Range of used time in ns
>>>>       ----------------------------------------------
>>>>       __smp_wmb()           [32  1128032]
>>>>       __smp_mb()            [32  1160096]
>>>>       __mb()                [32  1162496]
>>>>

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  6:10         ` Michael S. Tsirkin
@ 2024-03-19  6:54           ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  6:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/19/24 16:10, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
>> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
>>> On 3/19/24 02:59, Will Deacon wrote:
[...]
>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>> -	/* Descriptors and available array need to be set before we expose the
>>>>> -	 * new available array entries. */
>>>>> -	virtio_wmb(vq->weak_barriers);
>>>>> +	/*
>>>>> +	 * Descriptors and available array need to be set before we expose
>>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>>> +	 * architectures, but performance loss is expected.
>>>>> +	 */
>>>>> +	virtio_mb(false);
>>>>>    	vq->split.avail_idx_shadow++;
>>>>>    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>    						vq->split.avail_idx_shadow);
>>>>
>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>> here, especially when ordering accesses to coherent memory.
>>>>
>>>> In practice, either the larger timing different from the DSB or the fact
>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>> (e.g. via __smb_mb()).
>>>>
>>>> We definitely shouldn't take changes like this without a proper
>>>> explanation of what is going on.
>>>>
>>>
>>> Thanks for your comments, Will.
>>>
>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>> before we fully understand the root cause.
>>>
>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>
>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>> {
>>>      :
>>>          /* Put entry in available array (but don't update avail->idx until they
>>>           * do sync). */
>>>          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>
>>>          /* Descriptors and available array need to be set before we expose the
>>>           * new available array entries. */
>>>          // Broken: virtio_wmb(vq->weak_barriers);
>>>          // Broken: __dma_mb();
>>>          // Work:   __mb();
>>>          // Work:   __smp_mb();
> 
> Did you try __smp_wmb ? And wmb?
> 

virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.

__wmb() works either. No issue found with it.

>>>          // Work:   __ndelay(100);
>>>          // Work:   __ndelay(10);
>>>          // Broken: __ndelay(9);
>>>
>>>         vq->split.avail_idx_shadow++;
>>>          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>                                                  vq->split.avail_idx_shadow);
>>
>> What if you stick __ndelay here?
> 
> And keep virtio_wmb above?
> 

The result has been shared through a separate reply.

>>
>>>          vq->num_added++;
>>>
>>>          pr_debug("Added buffer head %i to %p\n", head, vq);
>>>          END_USE(vq);
>>>          :
>>> }
>>>
>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>> __smp_wmb() but faster than __mb()
>>>
>>>      Instruction           Range of used time in ns
>>>      ----------------------------------------------
>>>      __smp_wmb()           [32  1128032]
>>>      __smp_mb()            [32  1160096]
>>>      __mb()                [32  1162496]
>>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  6:54           ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  6:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/19/24 16:10, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
>> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
>>> On 3/19/24 02:59, Will Deacon wrote:
[...]
>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>> -	/* Descriptors and available array need to be set before we expose the
>>>>> -	 * new available array entries. */
>>>>> -	virtio_wmb(vq->weak_barriers);
>>>>> +	/*
>>>>> +	 * Descriptors and available array need to be set before we expose
>>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>>> +	 * architectures, but performance loss is expected.
>>>>> +	 */
>>>>> +	virtio_mb(false);
>>>>>    	vq->split.avail_idx_shadow++;
>>>>>    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>    						vq->split.avail_idx_shadow);
>>>>
>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>> here, especially when ordering accesses to coherent memory.
>>>>
>>>> In practice, either the larger timing different from the DSB or the fact
>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>> (e.g. via __smb_mb()).
>>>>
>>>> We definitely shouldn't take changes like this without a proper
>>>> explanation of what is going on.
>>>>
>>>
>>> Thanks for your comments, Will.
>>>
>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>> before we fully understand the root cause.
>>>
>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>
>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>> {
>>>      :
>>>          /* Put entry in available array (but don't update avail->idx until they
>>>           * do sync). */
>>>          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>
>>>          /* Descriptors and available array need to be set before we expose the
>>>           * new available array entries. */
>>>          // Broken: virtio_wmb(vq->weak_barriers);
>>>          // Broken: __dma_mb();
>>>          // Work:   __mb();
>>>          // Work:   __smp_mb();
> 
> Did you try __smp_wmb ? And wmb?
> 

virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.

__wmb() works either. No issue found with it.

>>>          // Work:   __ndelay(100);
>>>          // Work:   __ndelay(10);
>>>          // Broken: __ndelay(9);
>>>
>>>         vq->split.avail_idx_shadow++;
>>>          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>                                                  vq->split.avail_idx_shadow);
>>
>> What if you stick __ndelay here?
> 
> And keep virtio_wmb above?
> 

The result has been shared through a separate reply.

>>
>>>          vq->num_added++;
>>>
>>>          pr_debug("Added buffer head %i to %p\n", head, vq);
>>>          END_USE(vq);
>>>          :
>>> }
>>>
>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>> __smp_wmb() but faster than __mb()
>>>
>>>      Instruction           Range of used time in ns
>>>      ----------------------------------------------
>>>      __smp_wmb()           [32  1128032]
>>>      __smp_mb()            [32  1160096]
>>>      __mb()                [32  1162496]
>>>

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  6:54           ` Gavin Shan
@ 2024-03-19  7:04             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  7:04 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > >    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > > -	 * new available array entries. */
> > > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > > +	/*
> > > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > > +	 * architectures, but performance loss is expected.
> > > > > > +	 */
> > > > > > +	virtio_mb(false);
> > > > > >    	vq->split.avail_idx_shadow++;
> > > > > >    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > >    						vq->split.avail_idx_shadow);
> > > > > 
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > > 
> > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > > 
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > > 
> > > > 
> > > > Thanks for your comments, Will.
> > > > 
> > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > before we fully understand the root cause.
> > > > 
> > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > 
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > >      :
> > > >          /* Put entry in available array (but don't update avail->idx until they
> > > >           * do sync). */
> > > >          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > 
> > > >          /* Descriptors and available array need to be set before we expose the
> > > >           * new available array entries. */
> > > >          // Broken: virtio_wmb(vq->weak_barriers);
> > > >          // Broken: __dma_mb();
> > > >          // Work:   __mb();
> > > >          // Work:   __smp_mb();
> > 
> > Did you try __smp_wmb ? And wmb?
> > 
> 
> virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.
> 
> __wmb() works either. No issue found with it.

Oh interesting. So how do smp_mb() and wmb() disassemble on this
platform? Can you please check?


> > > >          // Work:   __ndelay(100);
> > > >          // Work:   __ndelay(10);
> > > >          // Broken: __ndelay(9);
> > > > 
> > > >         vq->split.avail_idx_shadow++;
> > > >          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > >                                                  vq->split.avail_idx_shadow);
> > > 
> > > What if you stick __ndelay here?
> > 
> > And keep virtio_wmb above?
> > 
> 
> The result has been shared through a separate reply.
> 
> > > 
> > > >          vq->num_added++;
> > > > 
> > > >          pr_debug("Added buffer head %i to %p\n", head, vq);
> > > >          END_USE(vq);
> > > >          :
> > > > }
> > > > 
> > > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > > __smp_wmb() but faster than __mb()
> > > > 
> > > >      Instruction           Range of used time in ns
> > > >      ----------------------------------------------
> > > >      __smp_wmb()           [32  1128032]
> > > >      __smp_mb()            [32  1160096]
> > > >      __mb()                [32  1162496]
> > > > 
> 
> Thanks,
> Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  7:04             ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  7:04 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > >    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > > -	 * new available array entries. */
> > > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > > +	/*
> > > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > > +	 * architectures, but performance loss is expected.
> > > > > > +	 */
> > > > > > +	virtio_mb(false);
> > > > > >    	vq->split.avail_idx_shadow++;
> > > > > >    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > >    						vq->split.avail_idx_shadow);
> > > > > 
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > > 
> > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > > 
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > > 
> > > > 
> > > > Thanks for your comments, Will.
> > > > 
> > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > before we fully understand the root cause.
> > > > 
> > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > 
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > >      :
> > > >          /* Put entry in available array (but don't update avail->idx until they
> > > >           * do sync). */
> > > >          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > 
> > > >          /* Descriptors and available array need to be set before we expose the
> > > >           * new available array entries. */
> > > >          // Broken: virtio_wmb(vq->weak_barriers);
> > > >          // Broken: __dma_mb();
> > > >          // Work:   __mb();
> > > >          // Work:   __smp_mb();
> > 
> > Did you try __smp_wmb ? And wmb?
> > 
> 
> virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.
> 
> __wmb() works either. No issue found with it.

Oh interesting. So how do smp_mb() and wmb() disassemble on this
platform? Can you please check?


> > > >          // Work:   __ndelay(100);
> > > >          // Work:   __ndelay(10);
> > > >          // Broken: __ndelay(9);
> > > > 
> > > >         vq->split.avail_idx_shadow++;
> > > >          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > >                                                  vq->split.avail_idx_shadow);
> > > 
> > > What if you stick __ndelay here?
> > 
> > And keep virtio_wmb above?
> > 
> 
> The result has been shared through a separate reply.
> 
> > > 
> > > >          vq->num_added++;
> > > > 
> > > >          pr_debug("Added buffer head %i to %p\n", head, vq);
> > > >          END_USE(vq);
> > > >          :
> > > > }
> > > > 
> > > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > > __smp_wmb() but faster than __mb()
> > > > 
> > > >      Instruction           Range of used time in ns
> > > >      ----------------------------------------------
> > > >      __smp_wmb()           [32  1128032]
> > > >      __smp_mb()            [32  1160096]
> > > >      __mb()                [32  1162496]
> > > > 
> 
> Thanks,
> Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  6:49             ` Gavin Shan
@ 2024-03-19  7:09               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  7:09 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
> 
> On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > > >     	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > >     	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > > > -	 * new available array entries. */
> > > > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > > > +	/*
> > > > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > > > +	 * architectures, but performance loss is expected.
> > > > > > > +	 */
> > > > > > > +	virtio_mb(false);
> > > > > > >     	vq->split.avail_idx_shadow++;
> > > > > > >     	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > >     						vq->split.avail_idx_shadow);
> > > > > > 
> > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > > here, especially when ordering accesses to coherent memory.
> > > > > > 
> > > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > > (e.g. via __smb_mb()).
> > > > > > 
> > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > explanation of what is going on.
> > > > > > 
> > > > > 
> > > > > Thanks for your comments, Will.
> > > > > 
> > > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > > before we fully understand the root cause.
> > > > > 
> > > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > > 
> > > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > > {
> > > > >       :
> > > > >           /* Put entry in available array (but don't update avail->idx until they
> > > > >            * do sync). */
> > > > >           avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >           vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > 
> > > > >           /* Descriptors and available array need to be set before we expose the
> > > > >            * new available array entries. */
> > > > >           // Broken: virtio_wmb(vq->weak_barriers);
> > > > >           // Broken: __dma_mb();
> > > > >           // Work:   __mb();
> > > > >           // Work:   __smp_mb();
> > > > >           // Work:   __ndelay(100);
> > > > >           // Work:   __ndelay(10);
> > > > >           // Broken: __ndelay(9);
> > > > > 
> > > > >          vq->split.avail_idx_shadow++;
> > > > >           vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >                                                   vq->split.avail_idx_shadow);
> > > > 
> > > > What if you stick __ndelay here?
> > > > 
> > > 
> > >         /* Put entry in available array (but don't update avail->idx until they
> > >           * do sync). */
> > >          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > 
> > >          /* Descriptors and available array need to be set before we expose the
> > >           * new available array entries. */
> > >          virtio_wmb(vq->weak_barriers);
> > >          vq->split.avail_idx_shadow++;
> > >          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >                                                  vq->split.avail_idx_shadow);
> > >          /* Try __ndelay(x) here as Michael suggested
> > >           *
> > >           * Work:      __ndelay(200);    possiblly make it hard to reproduce
> > >           * Broken:    __ndelay(100);
> > >           * Broken:    __ndelay(20);
> > >           * Broken:    __ndelay(10);
> > >           */
> > >          __ndelay(200);
> > 
> > So we see that just changing the timing masks the race.
> > What are you using on the host side? vhost or qemu?
> > 
> 
> __ndelay(200) may make the issue harder to be reproduce as I understand.
> More delays here will give vhost relief, reducing the race.
> 
> The issue is only reproducible when vhost is turned on. Otherwise, we
> aren't able to hit the issue.
> 
>    -netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
>    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0


Given it's vhost, it's also possible that the issue is host side.
I wonder what happens if we stick a delay or a stronger barrier
in vhost.c - either here:

        /* Make sure buffer is written before we update index. */
        smp_wmb();      


or here:

                /* Only get avail ring entries after they have been
                 * exposed by guest.
                 */
                smp_rmb();

?


> > > 
> > > > 
> > > > >           vq->num_added++;
> > > > > 
> > > > >           pr_debug("Added buffer head %i to %p\n", head, vq);
> > > > >           END_USE(vq);
> > > > >           :
> > > > > }
> > > > > 
> > > > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > > > __smp_wmb() but faster than __mb()
> > > > > 
> > > > >       Instruction           Range of used time in ns
> > > > >       ----------------------------------------------
> > > > >       __smp_wmb()           [32  1128032]
> > > > >       __smp_mb()            [32  1160096]
> > > > >       __mb()                [32  1162496]
> > > > > 
> 
> Thanks,
> Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  7:09               ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  7:09 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
> 
> On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > > >     	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > >     	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > > > -	 * new available array entries. */
> > > > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > > > +	/*
> > > > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > > > +	 * architectures, but performance loss is expected.
> > > > > > > +	 */
> > > > > > > +	virtio_mb(false);
> > > > > > >     	vq->split.avail_idx_shadow++;
> > > > > > >     	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > >     						vq->split.avail_idx_shadow);
> > > > > > 
> > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > > here, especially when ordering accesses to coherent memory.
> > > > > > 
> > > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > > (e.g. via __smb_mb()).
> > > > > > 
> > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > explanation of what is going on.
> > > > > > 
> > > > > 
> > > > > Thanks for your comments, Will.
> > > > > 
> > > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > > before we fully understand the root cause.
> > > > > 
> > > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > > 
> > > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > > {
> > > > >       :
> > > > >           /* Put entry in available array (but don't update avail->idx until they
> > > > >            * do sync). */
> > > > >           avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >           vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > 
> > > > >           /* Descriptors and available array need to be set before we expose the
> > > > >            * new available array entries. */
> > > > >           // Broken: virtio_wmb(vq->weak_barriers);
> > > > >           // Broken: __dma_mb();
> > > > >           // Work:   __mb();
> > > > >           // Work:   __smp_mb();
> > > > >           // Work:   __ndelay(100);
> > > > >           // Work:   __ndelay(10);
> > > > >           // Broken: __ndelay(9);
> > > > > 
> > > > >          vq->split.avail_idx_shadow++;
> > > > >           vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >                                                   vq->split.avail_idx_shadow);
> > > > 
> > > > What if you stick __ndelay here?
> > > > 
> > > 
> > >         /* Put entry in available array (but don't update avail->idx until they
> > >           * do sync). */
> > >          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > 
> > >          /* Descriptors and available array need to be set before we expose the
> > >           * new available array entries. */
> > >          virtio_wmb(vq->weak_barriers);
> > >          vq->split.avail_idx_shadow++;
> > >          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >                                                  vq->split.avail_idx_shadow);
> > >          /* Try __ndelay(x) here as Michael suggested
> > >           *
> > >           * Work:      __ndelay(200);    possiblly make it hard to reproduce
> > >           * Broken:    __ndelay(100);
> > >           * Broken:    __ndelay(20);
> > >           * Broken:    __ndelay(10);
> > >           */
> > >          __ndelay(200);
> > 
> > So we see that just changing the timing masks the race.
> > What are you using on the host side? vhost or qemu?
> > 
> 
> __ndelay(200) may make the issue harder to be reproduce as I understand.
> More delays here will give vhost relief, reducing the race.
> 
> The issue is only reproducible when vhost is turned on. Otherwise, we
> aren't able to hit the issue.
> 
>    -netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
>    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0


Given it's vhost, it's also possible that the issue is host side.
I wonder what happens if we stick a delay or a stronger barrier
in vhost.c - either here:

        /* Make sure buffer is written before we update index. */
        smp_wmb();      


or here:

                /* Only get avail ring entries after they have been
                 * exposed by guest.
                 */
                smp_rmb();

?


> > > 
> > > > 
> > > > >           vq->num_added++;
> > > > > 
> > > > >           pr_debug("Added buffer head %i to %p\n", head, vq);
> > > > >           END_USE(vq);
> > > > >           :
> > > > > }
> > > > > 
> > > > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > > > __smp_wmb() but faster than __mb()
> > > > > 
> > > > >       Instruction           Range of used time in ns
> > > > >       ----------------------------------------------
> > > > >       __smp_wmb()           [32  1128032]
> > > > >       __smp_mb()            [32  1160096]
> > > > >       __mb()                [32  1162496]
> > > > > 
> 
> Thanks,
> Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-18 16:59 ` Will Deacon
  2024-03-19  4:59     ` Gavin Shan
@ 2024-03-19  7:36   ` Michael S. Tsirkin
  2024-03-19 18:21     ` Will Deacon
  1 sibling, 1 reply; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  7:36 UTC (permalink / raw)
  To: Will Deacon
  Cc: Gavin Shan, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin

On Mon, Mar 18, 2024 at 04:59:24PM +0000, Will Deacon wrote:
> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > The issue is reported by Yihuang Yu who have 'netperf' test on
> > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > client is started in the VM hosted by grace-hopper machine,
> > while the 'netperf' server is running on grace-grace machine.
> > 
> > The VM is started with virtio-net and vhost has been enabled.
> > We observe a error message spew from VM and then soft-lockup
> > report. The error message indicates the data associated with
> > the descriptor (index: 135) has been released, and the queue
> > is marked as broken. It eventually leads to the endless effort
> > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > and soft-lockup. The stale index 135 is fetched from the available
> > ring and published to the used ring by vhost, meaning we have
> > disordred write to the available ring element and available index.
> > 
> >   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
> >   -accel kvm -machine virt,gic-version=host                            \
> >      :                                                                 \
> >   -netdev tap,id=vnet0,vhost=on                                        \
> >   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > 
> >   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > 
> > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > ARM64. It should work for other architectures, but performance loss is
> > expected.
> > 
> > Cc: stable@vger.kernel.org
> > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > ---
> >  drivers/virtio/virtio_ring.c | 12 +++++++++---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 49299b1f9ec7..7d852811c912 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> >  	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> >  	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> >  
> > -	/* Descriptors and available array need to be set before we expose the
> > -	 * new available array entries. */
> > -	virtio_wmb(vq->weak_barriers);
> > +	/*
> > +	 * Descriptors and available array need to be set before we expose
> > +	 * the new available array entries. virtio_wmb() should be enough
> > +	 * to ensuere the order theoretically. However, a stronger barrier
> > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > +	 * by the host (vhost). A stronger barrier should work for other
> > +	 * architectures, but performance loss is expected.
> > +	 */
> > +	virtio_mb(false);
> >  	vq->split.avail_idx_shadow++;
> >  	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> >  						vq->split.avail_idx_shadow);
> 
> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> here, especially when ordering accesses to coherent memory.
> 
> In practice, either the larger timing different from the DSB or the fact
> that you're going from a Store->Store barrier to a full barrier is what
> makes things "work" for you. Have you tried, for example, a DMB SY
> (e.g. via __smb_mb()).
> 
> We definitely shouldn't take changes like this without a proper
> explanation of what is going on.
> 
> Will

Just making sure: so on this system, how do
smp_wmb() and wmb() differ? smb_wmb is normally for synchronizing
with kernel running on another CPU and we are doing something
unusual in virtio when we use it to synchronize with host
as opposed to the guest - e.g. CONFIG_SMP is special cased
because of this:

#define virt_wmb() do { kcsan_wmb(); __smp_wmb(); } while (0)

Note __smp_wmb not smp_wmb which would be a NOP on UP.


-- 
MST


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  7:04             ` Michael S. Tsirkin
@ 2024-03-19  7:41               ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  7:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/19/24 17:04, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
>> On 3/19/24 16:10, Michael S. Tsirkin wrote:
>>> On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
>>>> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
>>>>> On 3/19/24 02:59, Will Deacon wrote:
>> [...]
>>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>>>     	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>>     	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>>> -	/* Descriptors and available array need to be set before we expose the
>>>>>>> -	 * new available array entries. */
>>>>>>> -	virtio_wmb(vq->weak_barriers);
>>>>>>> +	/*
>>>>>>> +	 * Descriptors and available array need to be set before we expose
>>>>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>>>>> +	 * architectures, but performance loss is expected.
>>>>>>> +	 */
>>>>>>> +	virtio_mb(false);
>>>>>>>     	vq->split.avail_idx_shadow++;
>>>>>>>     	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>>     						vq->split.avail_idx_shadow);
>>>>>>
>>>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>>>> here, especially when ordering accesses to coherent memory.
>>>>>>
>>>>>> In practice, either the larger timing different from the DSB or the fact
>>>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>>>> (e.g. via __smb_mb()).
>>>>>>
>>>>>> We definitely shouldn't take changes like this without a proper
>>>>>> explanation of what is going on.
>>>>>>
>>>>>
>>>>> Thanks for your comments, Will.
>>>>>
>>>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>>>> before we fully understand the root cause.
>>>>>
>>>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>>>
>>>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>>>> {
>>>>>       :
>>>>>           /* Put entry in available array (but don't update avail->idx until they
>>>>>            * do sync). */
>>>>>           avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>           vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>
>>>>>           /* Descriptors and available array need to be set before we expose the
>>>>>            * new available array entries. */
>>>>>           // Broken: virtio_wmb(vq->weak_barriers);
>>>>>           // Broken: __dma_mb();
>>>>>           // Work:   __mb();
>>>>>           // Work:   __smp_mb();
>>>
>>> Did you try __smp_wmb ? And wmb?
>>>
>>
>> virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.
>>
>> __wmb() works either. No issue found with it.
> 
> Oh interesting. So how do smp_mb() and wmb() disassemble on this
> platform? Can you please check?
> 

I don't see they have been translated wrongly on Nvidia's grace-hopper:

===> virtio_wmb(vq->weak_barriers)

0xffff8000807b07c8 <+1168>:  ldrb    w0, [x20, #66]
0xffff8000807b07cc <+1172>:  cbz     w0, 0xffff8000807b089c <virtqueue_add_split+1380>
0xffff8000807b07d0 <+1176>:  dmb     ishst     // same to __smp_wmb()
     :
0xffff8000807b089c <+1380>:  dmb     oshst     // same to __dma_wmb()
0xffff8000807b08a0 <+1384>:  b       0xffff8000807b07d4 <virtqueue_add_split+1180>

===> wmb()

0xffff8000807b07c8 <+1168>:  dsb     st


> 
>>>>>           // Work:   __ndelay(100);
>>>>>           // Work:   __ndelay(10);
>>>>>           // Broken: __ndelay(9);
>>>>>
>>>>>          vq->split.avail_idx_shadow++;
>>>>>           vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>                                                   vq->split.avail_idx_shadow);
>>>>
>>>> What if you stick __ndelay here?
>>>
>>> And keep virtio_wmb above?
>>>
>>
>> The result has been shared through a separate reply.
>>
>>>>
>>>>>           vq->num_added++;
>>>>>
>>>>>           pr_debug("Added buffer head %i to %p\n", head, vq);
>>>>>           END_USE(vq);
>>>>>           :
>>>>> }
>>>>>
>>>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>>>> __smp_wmb() but faster than __mb()
>>>>>
>>>>>       Instruction           Range of used time in ns
>>>>>       ----------------------------------------------
>>>>>       __smp_wmb()           [32  1128032]
>>>>>       __smp_mb()            [32  1160096]
>>>>>       __mb()                [32  1162496]
>>>>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  7:41               ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  7:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/19/24 17:04, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
>> On 3/19/24 16:10, Michael S. Tsirkin wrote:
>>> On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
>>>> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
>>>>> On 3/19/24 02:59, Will Deacon wrote:
>> [...]
>>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>>>     	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>>     	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>>> -	/* Descriptors and available array need to be set before we expose the
>>>>>>> -	 * new available array entries. */
>>>>>>> -	virtio_wmb(vq->weak_barriers);
>>>>>>> +	/*
>>>>>>> +	 * Descriptors and available array need to be set before we expose
>>>>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>>>>> +	 * architectures, but performance loss is expected.
>>>>>>> +	 */
>>>>>>> +	virtio_mb(false);
>>>>>>>     	vq->split.avail_idx_shadow++;
>>>>>>>     	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>>     						vq->split.avail_idx_shadow);
>>>>>>
>>>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>>>> here, especially when ordering accesses to coherent memory.
>>>>>>
>>>>>> In practice, either the larger timing different from the DSB or the fact
>>>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>>>> (e.g. via __smb_mb()).
>>>>>>
>>>>>> We definitely shouldn't take changes like this without a proper
>>>>>> explanation of what is going on.
>>>>>>
>>>>>
>>>>> Thanks for your comments, Will.
>>>>>
>>>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>>>> before we fully understand the root cause.
>>>>>
>>>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>>>
>>>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>>>> {
>>>>>       :
>>>>>           /* Put entry in available array (but don't update avail->idx until they
>>>>>            * do sync). */
>>>>>           avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>           vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>
>>>>>           /* Descriptors and available array need to be set before we expose the
>>>>>            * new available array entries. */
>>>>>           // Broken: virtio_wmb(vq->weak_barriers);
>>>>>           // Broken: __dma_mb();
>>>>>           // Work:   __mb();
>>>>>           // Work:   __smp_mb();
>>>
>>> Did you try __smp_wmb ? And wmb?
>>>
>>
>> virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.
>>
>> __wmb() works either. No issue found with it.
> 
> Oh interesting. So how do smp_mb() and wmb() disassemble on this
> platform? Can you please check?
> 

I don't see they have been translated wrongly on Nvidia's grace-hopper:

===> virtio_wmb(vq->weak_barriers)

0xffff8000807b07c8 <+1168>:  ldrb    w0, [x20, #66]
0xffff8000807b07cc <+1172>:  cbz     w0, 0xffff8000807b089c <virtqueue_add_split+1380>
0xffff8000807b07d0 <+1176>:  dmb     ishst     // same to __smp_wmb()
     :
0xffff8000807b089c <+1380>:  dmb     oshst     // same to __dma_wmb()
0xffff8000807b08a0 <+1384>:  b       0xffff8000807b07d4 <virtqueue_add_split+1180>

===> wmb()

0xffff8000807b07c8 <+1168>:  dsb     st


> 
>>>>>           // Work:   __ndelay(100);
>>>>>           // Work:   __ndelay(10);
>>>>>           // Broken: __ndelay(9);
>>>>>
>>>>>          vq->split.avail_idx_shadow++;
>>>>>           vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>                                                   vq->split.avail_idx_shadow);
>>>>
>>>> What if you stick __ndelay here?
>>>
>>> And keep virtio_wmb above?
>>>
>>
>> The result has been shared through a separate reply.
>>
>>>>
>>>>>           vq->num_added++;
>>>>>
>>>>>           pr_debug("Added buffer head %i to %p\n", head, vq);
>>>>>           END_USE(vq);
>>>>>           :
>>>>> }
>>>>>
>>>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>>>> __smp_wmb() but faster than __mb()
>>>>>
>>>>>       Instruction           Range of used time in ns
>>>>>       ----------------------------------------------
>>>>>       __smp_wmb()           [32  1128032]
>>>>>       __smp_mb()            [32  1160096]
>>>>>       __mb()                [32  1162496]
>>>>>

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  7:09               ` Michael S. Tsirkin
@ 2024-03-19  8:08                 ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  8:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/19/24 17:09, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
>>
>> On 3/19/24 16:43, Michael S. Tsirkin wrote:
>>> On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
>>>> On 3/19/24 16:09, Michael S. Tsirkin wrote:
>>>>
>>>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>>>>      	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>>>      	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>>>> -	/* Descriptors and available array need to be set before we expose the
>>>>>>>> -	 * new available array entries. */
>>>>>>>> -	virtio_wmb(vq->weak_barriers);
>>>>>>>> +	/*
>>>>>>>> +	 * Descriptors and available array need to be set before we expose
>>>>>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>>>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>>>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>>>>>> +	 * architectures, but performance loss is expected.
>>>>>>>> +	 */
>>>>>>>> +	virtio_mb(false);
>>>>>>>>      	vq->split.avail_idx_shadow++;
>>>>>>>>      	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>>>      						vq->split.avail_idx_shadow);
>>>>>>>
>>>>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>>>>> here, especially when ordering accesses to coherent memory.
>>>>>>>
>>>>>>> In practice, either the larger timing different from the DSB or the fact
>>>>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>>>>> (e.g. via __smb_mb()).
>>>>>>>
>>>>>>> We definitely shouldn't take changes like this without a proper
>>>>>>> explanation of what is going on.
>>>>>>>
>>>>>>
>>>>>> Thanks for your comments, Will.
>>>>>>
>>>>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>>>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>>>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>>>>> before we fully understand the root cause.
>>>>>>
>>>>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>>>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>>>>
>>>>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>>>>> {
>>>>>>        :
>>>>>>            /* Put entry in available array (but don't update avail->idx until they
>>>>>>             * do sync). */
>>>>>>            avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>            vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>>
>>>>>>            /* Descriptors and available array need to be set before we expose the
>>>>>>             * new available array entries. */
>>>>>>            // Broken: virtio_wmb(vq->weak_barriers);
>>>>>>            // Broken: __dma_mb();
>>>>>>            // Work:   __mb();
>>>>>>            // Work:   __smp_mb();
>>>>>>            // Work:   __ndelay(100);
>>>>>>            // Work:   __ndelay(10);
>>>>>>            // Broken: __ndelay(9);
>>>>>>
>>>>>>           vq->split.avail_idx_shadow++;
>>>>>>            vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>                                                    vq->split.avail_idx_shadow);
>>>>>
>>>>> What if you stick __ndelay here?
>>>>>
>>>>
>>>>          /* Put entry in available array (but don't update avail->idx until they
>>>>            * do sync). */
>>>>           avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>           vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>
>>>>           /* Descriptors and available array need to be set before we expose the
>>>>            * new available array entries. */
>>>>           virtio_wmb(vq->weak_barriers);
>>>>           vq->split.avail_idx_shadow++;
>>>>           vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>                                                   vq->split.avail_idx_shadow);
>>>>           /* Try __ndelay(x) here as Michael suggested
>>>>            *
>>>>            * Work:      __ndelay(200);    possiblly make it hard to reproduce
>>>>            * Broken:    __ndelay(100);
>>>>            * Broken:    __ndelay(20);
>>>>            * Broken:    __ndelay(10);
>>>>            */
>>>>           __ndelay(200);
>>>
>>> So we see that just changing the timing masks the race.
>>> What are you using on the host side? vhost or qemu?
>>>
>>
>> __ndelay(200) may make the issue harder to be reproduce as I understand.
>> More delays here will give vhost relief, reducing the race.
>>
>> The issue is only reproducible when vhost is turned on. Otherwise, we
>> aren't able to hit the issue.
>>
>>     -netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
>>     -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
> 
> 
> Given it's vhost, it's also possible that the issue is host side.
> I wonder what happens if we stick a delay or a stronger barrier
> in vhost.c - either here:
> 
>          /* Make sure buffer is written before we update index. */
>          smp_wmb();
> 
> 
> or here:
> 
>                  /* Only get avail ring entries after they have been
>                   * exposed by guest.
>                   */
>                  smp_rmb();
> 
> ?
> 

It's possible. However, I still can hit the issue after both of them are
replaed with '__mb()'. So the issue seems on the guest side, where the
written data isn't observed in time by the CPU on far end (for vhost worker).

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..327b68d21b02 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2529,7 +2529,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
                 /* Only get avail ring entries after they have been
                  * exposed by guest.
                  */
-               smp_rmb();
+               // smp_rmb();
+               __mb();
         }
  
         /* Grab the next descriptor number they're advertising, and increment
@@ -2703,7 +2704,9 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
         r = __vhost_add_used_n(vq, heads, count);
  
         /* Make sure buffer is written before we update index. */
-       smp_wmb();
+       // smp_wmb();
+       __mb();
+

 From the guest:

[   14.102608] virtio_net virtio0: output.0:id 80 is not a head!


>>>>
>>>>>
>>>>>>            vq->num_added++;
>>>>>>
>>>>>>            pr_debug("Added buffer head %i to %p\n", head, vq);
>>>>>>            END_USE(vq);
>>>>>>            :
>>>>>> }
>>>>>>
>>>>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>>>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>>>>> __smp_wmb() but faster than __mb()
>>>>>>
>>>>>>        Instruction           Range of used time in ns
>>>>>>        ----------------------------------------------
>>>>>>        __smp_wmb()           [32  1128032]
>>>>>>        __smp_mb()            [32  1160096]
>>>>>>        __mb()                [32  1162496]
>>>>>>

Thanks,
Gavin


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  8:08                 ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19  8:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/19/24 17:09, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
>>
>> On 3/19/24 16:43, Michael S. Tsirkin wrote:
>>> On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
>>>> On 3/19/24 16:09, Michael S. Tsirkin wrote:
>>>>
>>>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>>>>      	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>>>      	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>>>> -	/* Descriptors and available array need to be set before we expose the
>>>>>>>> -	 * new available array entries. */
>>>>>>>> -	virtio_wmb(vq->weak_barriers);
>>>>>>>> +	/*
>>>>>>>> +	 * Descriptors and available array need to be set before we expose
>>>>>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>>>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>>>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>>>>>> +	 * architectures, but performance loss is expected.
>>>>>>>> +	 */
>>>>>>>> +	virtio_mb(false);
>>>>>>>>      	vq->split.avail_idx_shadow++;
>>>>>>>>      	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>>>      						vq->split.avail_idx_shadow);
>>>>>>>
>>>>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>>>>> here, especially when ordering accesses to coherent memory.
>>>>>>>
>>>>>>> In practice, either the larger timing different from the DSB or the fact
>>>>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>>>>> (e.g. via __smb_mb()).
>>>>>>>
>>>>>>> We definitely shouldn't take changes like this without a proper
>>>>>>> explanation of what is going on.
>>>>>>>
>>>>>>
>>>>>> Thanks for your comments, Will.
>>>>>>
>>>>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>>>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>>>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>>>>> before we fully understand the root cause.
>>>>>>
>>>>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>>>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>>>>
>>>>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>>>>> {
>>>>>>        :
>>>>>>            /* Put entry in available array (but don't update avail->idx until they
>>>>>>             * do sync). */
>>>>>>            avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>            vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>>
>>>>>>            /* Descriptors and available array need to be set before we expose the
>>>>>>             * new available array entries. */
>>>>>>            // Broken: virtio_wmb(vq->weak_barriers);
>>>>>>            // Broken: __dma_mb();
>>>>>>            // Work:   __mb();
>>>>>>            // Work:   __smp_mb();
>>>>>>            // Work:   __ndelay(100);
>>>>>>            // Work:   __ndelay(10);
>>>>>>            // Broken: __ndelay(9);
>>>>>>
>>>>>>           vq->split.avail_idx_shadow++;
>>>>>>            vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>                                                    vq->split.avail_idx_shadow);
>>>>>
>>>>> What if you stick __ndelay here?
>>>>>
>>>>
>>>>          /* Put entry in available array (but don't update avail->idx until they
>>>>            * do sync). */
>>>>           avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>           vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>
>>>>           /* Descriptors and available array need to be set before we expose the
>>>>            * new available array entries. */
>>>>           virtio_wmb(vq->weak_barriers);
>>>>           vq->split.avail_idx_shadow++;
>>>>           vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>                                                   vq->split.avail_idx_shadow);
>>>>           /* Try __ndelay(x) here as Michael suggested
>>>>            *
>>>>            * Work:      __ndelay(200);    possiblly make it hard to reproduce
>>>>            * Broken:    __ndelay(100);
>>>>            * Broken:    __ndelay(20);
>>>>            * Broken:    __ndelay(10);
>>>>            */
>>>>           __ndelay(200);
>>>
>>> So we see that just changing the timing masks the race.
>>> What are you using on the host side? vhost or qemu?
>>>
>>
>> __ndelay(200) may make the issue harder to be reproduce as I understand.
>> More delays here will give vhost relief, reducing the race.
>>
>> The issue is only reproducible when vhost is turned on. Otherwise, we
>> aren't able to hit the issue.
>>
>>     -netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
>>     -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
> 
> 
> Given it's vhost, it's also possible that the issue is host side.
> I wonder what happens if we stick a delay or a stronger barrier
> in vhost.c - either here:
> 
>          /* Make sure buffer is written before we update index. */
>          smp_wmb();
> 
> 
> or here:
> 
>                  /* Only get avail ring entries after they have been
>                   * exposed by guest.
>                   */
>                  smp_rmb();
> 
> ?
> 

It's possible. However, I still can hit the issue after both of them are
replaed with '__mb()'. So the issue seems on the guest side, where the
written data isn't observed in time by the CPU on far end (for vhost worker).

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..327b68d21b02 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2529,7 +2529,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
                 /* Only get avail ring entries after they have been
                  * exposed by guest.
                  */
-               smp_rmb();
+               // smp_rmb();
+               __mb();
         }
  
         /* Grab the next descriptor number they're advertising, and increment
@@ -2703,7 +2704,9 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
         r = __vhost_add_used_n(vq, heads, count);
  
         /* Make sure buffer is written before we update index. */
-       smp_wmb();
+       // smp_wmb();
+       __mb();
+

 From the guest:

[   14.102608] virtio_net virtio0: output.0:id 80 is not a head!


>>>>
>>>>>
>>>>>>            vq->num_added++;
>>>>>>
>>>>>>            pr_debug("Added buffer head %i to %p\n", head, vq);
>>>>>>            END_USE(vq);
>>>>>>            :
>>>>>> }
>>>>>>
>>>>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>>>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>>>>> __smp_wmb() but faster than __mb()
>>>>>>
>>>>>>        Instruction           Range of used time in ns
>>>>>>        ----------------------------------------------
>>>>>>        __smp_wmb()           [32  1128032]
>>>>>>        __smp_mb()            [32  1160096]
>>>>>>        __mb()                [32  1162496]
>>>>>>

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  6:54           ` Gavin Shan
@ 2024-03-19  8:28             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  8:28 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > >    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > > -	 * new available array entries. */
> > > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > > +	/*
> > > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > > +	 * architectures, but performance loss is expected.
> > > > > > +	 */
> > > > > > +	virtio_mb(false);
> > > > > >    	vq->split.avail_idx_shadow++;
> > > > > >    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > >    						vq->split.avail_idx_shadow);
> > > > > 
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > > 
> > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > > 
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > > 
> > > > 
> > > > Thanks for your comments, Will.
> > > > 
> > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > before we fully understand the root cause.
> > > > 
> > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > 
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > >      :
> > > >          /* Put entry in available array (but don't update avail->idx until they
> > > >           * do sync). */
> > > >          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > 
> > > >          /* Descriptors and available array need to be set before we expose the
> > > >           * new available array entries. */
> > > >          // Broken: virtio_wmb(vq->weak_barriers);
> > > >          // Broken: __dma_mb();
> > > >          // Work:   __mb();
> > > >          // Work:   __smp_mb();
> > 
> > Did you try __smp_wmb ? And wmb?
> > 
> 
> virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.


> __wmb() works either. No issue found with it.

So this is 
arch/arm64/include/asm/barrier.h:#define __smp_wmb()    dmb(ishst)

versus

arch/arm64/include/asm/barrier.h:#define __wmb()                dsb(st)


right?

Really interesting. And you are saying dma_wmb does not work either:

arch/arm64/include/asm/barrier.h:#define __dma_wmb()    dmb(oshst)


Really strange.




However I found this:
https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Memory-attributes/Cacheable-and-shareable-memory-attributes



Going by this picture, all CPUs are in the innner shareable domain so
ishst should be enough to synchronize, right?


However, there are two points that give me pause here:


Inner shareable
This represents a shareability domain that can be shared by multiple
processors, but not necessarily all of the agents in the system. A
system might have multiple Inner Shareable domains. An operation that
affects one Inner Shareable domain does not affect other Inner Shareable
domains in the system. An example of such a domain might be a quad-core
Cortex-A57 cluster.


Point 1 - so is it possible that there are multiple inner
shareable domains in this system? With vhost running
inside one and guest inside another? Anyone knows if that
is the case on nvidia grace hopper and how to find out?



Outer shareable
An outer shareable (OSH) domain re-order is shared by multiple agents and
can consist of one or more inner shareable domains. An operation that
affects an outer shareable domain also implicitly affects all inner
shareable domains inside it. However, it does not otherwise behave as an
inner shareable operation.


I do not get this last sentence. If it affects all inner domains then
how it "does not behave as an inner shareable operation"?






> 
> > > >          // Work:   __ndelay(100);
> > > >          // Work:   __ndelay(10);
> > > >          // Broken: __ndelay(9);
> > > > 
> > > >         vq->split.avail_idx_shadow++;
> > > >          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > >                                                  vq->split.avail_idx_shadow);
> > > 
> > > What if you stick __ndelay here?
> > 
> > And keep virtio_wmb above?
> > 
> 
> The result has been shared through a separate reply.
> 
> > > 
> > > >          vq->num_added++;
> > > > 
> > > >          pr_debug("Added buffer head %i to %p\n", head, vq);
> > > >          END_USE(vq);
> > > >          :
> > > > }
> > > > 
> > > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > > __smp_wmb() but faster than __mb()
> > > > 
> > > >      Instruction           Range of used time in ns
> > > >      ----------------------------------------------
> > > >      __smp_wmb()           [32  1128032]
> > > >      __smp_mb()            [32  1160096]
> > > >      __mb()                [32  1162496]
> > > > 
> 
> Thanks,
> Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  8:28             ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  8:28 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > >    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > > -	 * new available array entries. */
> > > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > > +	/*
> > > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > > +	 * architectures, but performance loss is expected.
> > > > > > +	 */
> > > > > > +	virtio_mb(false);
> > > > > >    	vq->split.avail_idx_shadow++;
> > > > > >    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > >    						vq->split.avail_idx_shadow);
> > > > > 
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > > 
> > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > > 
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > > 
> > > > 
> > > > Thanks for your comments, Will.
> > > > 
> > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > before we fully understand the root cause.
> > > > 
> > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > 
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > >      :
> > > >          /* Put entry in available array (but don't update avail->idx until they
> > > >           * do sync). */
> > > >          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > 
> > > >          /* Descriptors and available array need to be set before we expose the
> > > >           * new available array entries. */
> > > >          // Broken: virtio_wmb(vq->weak_barriers);
> > > >          // Broken: __dma_mb();
> > > >          // Work:   __mb();
> > > >          // Work:   __smp_mb();
> > 
> > Did you try __smp_wmb ? And wmb?
> > 
> 
> virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.


> __wmb() works either. No issue found with it.

So this is 
arch/arm64/include/asm/barrier.h:#define __smp_wmb()    dmb(ishst)

versus

arch/arm64/include/asm/barrier.h:#define __wmb()                dsb(st)


right?

Really interesting. And you are saying dma_wmb does not work either:

arch/arm64/include/asm/barrier.h:#define __dma_wmb()    dmb(oshst)


Really strange.




However I found this:
https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Memory-attributes/Cacheable-and-shareable-memory-attributes



Going by this picture, all CPUs are in the innner shareable domain so
ishst should be enough to synchronize, right?


However, there are two points that give me pause here:


Inner shareable
This represents a shareability domain that can be shared by multiple
processors, but not necessarily all of the agents in the system. A
system might have multiple Inner Shareable domains. An operation that
affects one Inner Shareable domain does not affect other Inner Shareable
domains in the system. An example of such a domain might be a quad-core
Cortex-A57 cluster.


Point 1 - so is it possible that there are multiple inner
shareable domains in this system? With vhost running
inside one and guest inside another? Anyone knows if that
is the case on nvidia grace hopper and how to find out?



Outer shareable
An outer shareable (OSH) domain re-order is shared by multiple agents and
can consist of one or more inner shareable domains. An operation that
affects an outer shareable domain also implicitly affects all inner
shareable domains inside it. However, it does not otherwise behave as an
inner shareable operation.


I do not get this last sentence. If it affects all inner domains then
how it "does not behave as an inner shareable operation"?






> 
> > > >          // Work:   __ndelay(100);
> > > >          // Work:   __ndelay(10);
> > > >          // Broken: __ndelay(9);
> > > > 
> > > >         vq->split.avail_idx_shadow++;
> > > >          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > >                                                  vq->split.avail_idx_shadow);
> > > 
> > > What if you stick __ndelay here?
> > 
> > And keep virtio_wmb above?
> > 
> 
> The result has been shared through a separate reply.
> 
> > > 
> > > >          vq->num_added++;
> > > > 
> > > >          pr_debug("Added buffer head %i to %p\n", head, vq);
> > > >          END_USE(vq);
> > > >          :
> > > > }
> > > > 
> > > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > > __smp_wmb() but faster than __mb()
> > > > 
> > > >      Instruction           Range of used time in ns
> > > >      ----------------------------------------------
> > > >      __smp_wmb()           [32  1128032]
> > > >      __smp_mb()            [32  1160096]
> > > >      __mb()                [32  1162496]
> > > > 
> 
> Thanks,
> Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  8:08                 ` Gavin Shan
@ 2024-03-19  8:49                   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  8:49 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 06:08:27PM +1000, Gavin Shan wrote:
> On 3/19/24 17:09, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
> > > 
> > > On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > > > > 
> > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > > > > >      	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > > > >      	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > > > > > -	 * new available array entries. */
> > > > > > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > > > > > +	/*
> > > > > > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > > > > > +	 * architectures, but performance loss is expected.
> > > > > > > > > +	 */
> > > > > > > > > +	virtio_mb(false);
> > > > > > > > >      	vq->split.avail_idx_shadow++;
> > > > > > > > >      	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > > > >      						vq->split.avail_idx_shadow);
> > > > > > > > 
> > > > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > > > > here, especially when ordering accesses to coherent memory.
> > > > > > > > 
> > > > > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > > > > (e.g. via __smb_mb()).
> > > > > > > > 
> > > > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > > > explanation of what is going on.
> > > > > > > > 
> > > > > > > 
> > > > > > > Thanks for your comments, Will.
> > > > > > > 
> > > > > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > > > > before we fully understand the root cause.
> > > > > > > 
> > > > > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > > > > 
> > > > > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > > > > {
> > > > > > >        :
> > > > > > >            /* Put entry in available array (but don't update avail->idx until they
> > > > > > >             * do sync). */
> > > > > > >            avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > >            vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > > 
> > > > > > >            /* Descriptors and available array need to be set before we expose the
> > > > > > >             * new available array entries. */
> > > > > > >            // Broken: virtio_wmb(vq->weak_barriers);
> > > > > > >            // Broken: __dma_mb();
> > > > > > >            // Work:   __mb();
> > > > > > >            // Work:   __smp_mb();
> > > > > > >            // Work:   __ndelay(100);
> > > > > > >            // Work:   __ndelay(10);
> > > > > > >            // Broken: __ndelay(9);
> > > > > > > 
> > > > > > >           vq->split.avail_idx_shadow++;
> > > > > > >            vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > >                                                    vq->split.avail_idx_shadow);
> > > > > > 
> > > > > > What if you stick __ndelay here?
> > > > > > 
> > > > > 
> > > > >          /* Put entry in available array (but don't update avail->idx until they
> > > > >            * do sync). */
> > > > >           avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >           vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > 
> > > > >           /* Descriptors and available array need to be set before we expose the
> > > > >            * new available array entries. */
> > > > >           virtio_wmb(vq->weak_barriers);
> > > > >           vq->split.avail_idx_shadow++;
> > > > >           vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >                                                   vq->split.avail_idx_shadow);
> > > > >           /* Try __ndelay(x) here as Michael suggested
> > > > >            *
> > > > >            * Work:      __ndelay(200);    possiblly make it hard to reproduce
> > > > >            * Broken:    __ndelay(100);
> > > > >            * Broken:    __ndelay(20);
> > > > >            * Broken:    __ndelay(10);
> > > > >            */
> > > > >           __ndelay(200);
> > > > 
> > > > So we see that just changing the timing masks the race.
> > > > What are you using on the host side? vhost or qemu?
> > > > 
> > > 
> > > __ndelay(200) may make the issue harder to be reproduce as I understand.
> > > More delays here will give vhost relief, reducing the race.
> > > 
> > > The issue is only reproducible when vhost is turned on. Otherwise, we
> > > aren't able to hit the issue.
> > > 
> > >     -netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
> > >     -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
> > 
> > 
> > Given it's vhost, it's also possible that the issue is host side.
> > I wonder what happens if we stick a delay or a stronger barrier
> > in vhost.c - either here:
> > 
> >          /* Make sure buffer is written before we update index. */
> >          smp_wmb();
> > 
> > 
> > or here:
> > 
> >                  /* Only get avail ring entries after they have been
> >                   * exposed by guest.
> >                   */
> >                  smp_rmb();
> > 
> > ?
> > 
> 
> It's possible. However, I still can hit the issue after both of them are
> replaed with '__mb()'. So the issue seems on the guest side, where the
> written data isn't observed in time by the CPU on far end (for vhost worker).
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..327b68d21b02 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2529,7 +2529,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>                 /* Only get avail ring entries after they have been
>                  * exposed by guest.
>                  */
> -               smp_rmb();
> +               // smp_rmb();
> +               __mb();
>         }
>         /* Grab the next descriptor number they're advertising, and increment
> @@ -2703,7 +2704,9 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
>         r = __vhost_add_used_n(vq, heads, count);
>         /* Make sure buffer is written before we update index. */
> -       smp_wmb();
> +       // smp_wmb();
> +       __mb();
> +
> 
> From the guest:
> 
> [   14.102608] virtio_net virtio0: output.0:id 80 is not a head!


You realize however that for that to happen, first vhost has to
use the wrong index?
Again the fact tweaking code helps is suggestive but does
not lead us to the root cause.

So you should be able to detect the bad index when vhost gets it.
Here's a suggestion: stick a valid flag in desc.
then vhost should notice if it gets a desc with that flag clear.
Like this maybe?


diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 6f7e5010a673..2789e51b57e6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -524,7 +524,7 @@ static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq,
 	struct vring_desc_extra *extra = vring->split.desc_extra;
 	u16 next;
 
-	desc[i].flags = cpu_to_virtio16(vq->vdev, flags);
+	desc[i].flags = cpu_to_virtio16(vq->vdev, flags | (0x1 << 5)); /* Mark it valid */
 	desc[i].addr = cpu_to_virtio64(vq->vdev, addr);
 	desc[i].len = cpu_to_virtio32(vq->vdev, len);
 
@@ -721,6 +721,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 			i = virtio16_to_cpu(_vq->vdev, desc[i].next);
 		} else
 			i = vring_unmap_one_split(vq, i);
+		desc[i].flags = 0x0;
 	}
 
 free_indirect:
@@ -775,11 +776,13 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
 	i = head;
 
 	while (vq->split.vring.desc[i].flags & nextflag) {
+		vq->split.vring.desc[i].flags = 0x0;
 		vring_unmap_one_split(vq, i);
 		i = vq->split.desc_extra[i].next;
 		vq->vq.num_free++;
 	}
 
+	vq->split.vring.desc[i].flags = 0x0;
 	vring_unmap_one_split(vq, i);
 	vq->split.desc_extra[i].next = vq->free_head;
 	vq->free_head = head;


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19  8:49                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-19  8:49 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 06:08:27PM +1000, Gavin Shan wrote:
> On 3/19/24 17:09, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
> > > 
> > > On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > > > > 
> > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > > > > >      	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > > > >      	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > > > > > -	 * new available array entries. */
> > > > > > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > > > > > +	/*
> > > > > > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > > > > > +	 * architectures, but performance loss is expected.
> > > > > > > > > +	 */
> > > > > > > > > +	virtio_mb(false);
> > > > > > > > >      	vq->split.avail_idx_shadow++;
> > > > > > > > >      	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > > > >      						vq->split.avail_idx_shadow);
> > > > > > > > 
> > > > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > > > > here, especially when ordering accesses to coherent memory.
> > > > > > > > 
> > > > > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > > > > (e.g. via __smb_mb()).
> > > > > > > > 
> > > > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > > > explanation of what is going on.
> > > > > > > > 
> > > > > > > 
> > > > > > > Thanks for your comments, Will.
> > > > > > > 
> > > > > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > > > > before we fully understand the root cause.
> > > > > > > 
> > > > > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > > > > 
> > > > > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > > > > {
> > > > > > >        :
> > > > > > >            /* Put entry in available array (but don't update avail->idx until they
> > > > > > >             * do sync). */
> > > > > > >            avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > >            vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > > 
> > > > > > >            /* Descriptors and available array need to be set before we expose the
> > > > > > >             * new available array entries. */
> > > > > > >            // Broken: virtio_wmb(vq->weak_barriers);
> > > > > > >            // Broken: __dma_mb();
> > > > > > >            // Work:   __mb();
> > > > > > >            // Work:   __smp_mb();
> > > > > > >            // Work:   __ndelay(100);
> > > > > > >            // Work:   __ndelay(10);
> > > > > > >            // Broken: __ndelay(9);
> > > > > > > 
> > > > > > >           vq->split.avail_idx_shadow++;
> > > > > > >            vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > >                                                    vq->split.avail_idx_shadow);
> > > > > > 
> > > > > > What if you stick __ndelay here?
> > > > > > 
> > > > > 
> > > > >          /* Put entry in available array (but don't update avail->idx until they
> > > > >            * do sync). */
> > > > >           avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >           vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > 
> > > > >           /* Descriptors and available array need to be set before we expose the
> > > > >            * new available array entries. */
> > > > >           virtio_wmb(vq->weak_barriers);
> > > > >           vq->split.avail_idx_shadow++;
> > > > >           vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >                                                   vq->split.avail_idx_shadow);
> > > > >           /* Try __ndelay(x) here as Michael suggested
> > > > >            *
> > > > >            * Work:      __ndelay(200);    possiblly make it hard to reproduce
> > > > >            * Broken:    __ndelay(100);
> > > > >            * Broken:    __ndelay(20);
> > > > >            * Broken:    __ndelay(10);
> > > > >            */
> > > > >           __ndelay(200);
> > > > 
> > > > So we see that just changing the timing masks the race.
> > > > What are you using on the host side? vhost or qemu?
> > > > 
> > > 
> > > __ndelay(200) may make the issue harder to be reproduce as I understand.
> > > More delays here will give vhost relief, reducing the race.
> > > 
> > > The issue is only reproducible when vhost is turned on. Otherwise, we
> > > aren't able to hit the issue.
> > > 
> > >     -netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
> > >     -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
> > 
> > 
> > Given it's vhost, it's also possible that the issue is host side.
> > I wonder what happens if we stick a delay or a stronger barrier
> > in vhost.c - either here:
> > 
> >          /* Make sure buffer is written before we update index. */
> >          smp_wmb();
> > 
> > 
> > or here:
> > 
> >                  /* Only get avail ring entries after they have been
> >                   * exposed by guest.
> >                   */
> >                  smp_rmb();
> > 
> > ?
> > 
> 
> It's possible. However, I still can hit the issue after both of them are
> replaed with '__mb()'. So the issue seems on the guest side, where the
> written data isn't observed in time by the CPU on far end (for vhost worker).
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..327b68d21b02 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2529,7 +2529,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>                 /* Only get avail ring entries after they have been
>                  * exposed by guest.
>                  */
> -               smp_rmb();
> +               // smp_rmb();
> +               __mb();
>         }
>         /* Grab the next descriptor number they're advertising, and increment
> @@ -2703,7 +2704,9 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
>         r = __vhost_add_used_n(vq, heads, count);
>         /* Make sure buffer is written before we update index. */
> -       smp_wmb();
> +       // smp_wmb();
> +       __mb();
> +
> 
> From the guest:
> 
> [   14.102608] virtio_net virtio0: output.0:id 80 is not a head!


You realize however that for that to happen, first vhost has to
use the wrong index?
Again the fact tweaking code helps is suggestive but does
not lead us to the root cause.

So you should be able to detect the bad index when vhost gets it.
Here's a suggestion: stick a valid flag in desc.
then vhost should notice if it gets a desc with that flag clear.
Like this maybe?


diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 6f7e5010a673..2789e51b57e6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -524,7 +524,7 @@ static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq,
 	struct vring_desc_extra *extra = vring->split.desc_extra;
 	u16 next;
 
-	desc[i].flags = cpu_to_virtio16(vq->vdev, flags);
+	desc[i].flags = cpu_to_virtio16(vq->vdev, flags | (0x1 << 5)); /* Mark it valid */
 	desc[i].addr = cpu_to_virtio64(vq->vdev, addr);
 	desc[i].len = cpu_to_virtio32(vq->vdev, len);
 
@@ -721,6 +721,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 			i = virtio16_to_cpu(_vq->vdev, desc[i].next);
 		} else
 			i = vring_unmap_one_split(vq, i);
+		desc[i].flags = 0x0;
 	}
 
 free_indirect:
@@ -775,11 +776,13 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
 	i = head;
 
 	while (vq->split.vring.desc[i].flags & nextflag) {
+		vq->split.vring.desc[i].flags = 0x0;
 		vring_unmap_one_split(vq, i);
 		i = vq->split.desc_extra[i].next;
 		vq->vq.num_free++;
 	}
 
+	vq->split.vring.desc[i].flags = 0x0;
 	vring_unmap_one_split(vq, i);
 	vq->split.desc_extra[i].next = vq->free_head;
 	vq->free_head = head;


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  7:36   ` Michael S. Tsirkin
@ 2024-03-19 18:21     ` Will Deacon
  0 siblings, 0 replies; 78+ messages in thread
From: Will Deacon @ 2024-03-19 18:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gavin Shan, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin

On Tue, Mar 19, 2024 at 03:36:31AM -0400, Michael S. Tsirkin wrote:
> On Mon, Mar 18, 2024 at 04:59:24PM +0000, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > >  	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >  	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > >  
> > > -	/* Descriptors and available array need to be set before we expose the
> > > -	 * new available array entries. */
> > > -	virtio_wmb(vq->weak_barriers);
> > > +	/*
> > > +	 * Descriptors and available array need to be set before we expose
> > > +	 * the new available array entries. virtio_wmb() should be enough
> > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > +	 * by the host (vhost). A stronger barrier should work for other
> > > +	 * architectures, but performance loss is expected.
> > > +	 */
> > > +	virtio_mb(false);
> > >  	vq->split.avail_idx_shadow++;
> > >  	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >  						vq->split.avail_idx_shadow);
> > 
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> > 
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> > 
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> 
> Just making sure: so on this system, how do
> smp_wmb() and wmb() differ? smb_wmb is normally for synchronizing
> with kernel running on another CPU and we are doing something
> unusual in virtio when we use it to synchronize with host
> as opposed to the guest - e.g. CONFIG_SMP is special cased
> because of this:
> 
> #define virt_wmb() do { kcsan_wmb(); __smp_wmb(); } while (0)
> 
> Note __smp_wmb not smp_wmb which would be a NOP on UP.

I think that should be fine (as long as the NOP is a barrier() for the
compiler).

wmb() uses a DSB, but that is only really relevant to endpoint ordering
with real I/O devices, e.g. if for some reason you need to guarantee
that a write has made it all the way to the device before proceeding.
Even then you're slightly at the mercy of the memory type and the
interconnect not giving an early acknowledgement, so the extra ordering
is rarely needed in practice and we don't even provide it for our I/O
accessors (e.g. writel() and readl()).

So for virtio between two CPUs using coherent memory, DSB is a red
herring.

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19  4:59     ` Gavin Shan
@ 2024-03-19 18:22       ` Will Deacon
  -1 siblings, 0 replies; 78+ messages in thread
From: Will Deacon @ 2024-03-19 18:22 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, mst, jasowang, xuanzhuo, yihyu,
	shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> On 3/19/24 02:59, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
> > >    -accel kvm -machine virt,gic-version=host                            \
> > >       :                                                                 \
> > >    -netdev tap,id=vnet0,vhost=on                                        \
> > >    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >    [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +++++++++---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > >   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > -	/* Descriptors and available array need to be set before we expose the
> > > -	 * new available array entries. */
> > > -	virtio_wmb(vq->weak_barriers);
> > > +	/*
> > > +	 * Descriptors and available array need to be set before we expose
> > > +	 * the new available array entries. virtio_wmb() should be enough
> > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > +	 * by the host (vhost). A stronger barrier should work for other
> > > +	 * architectures, but performance loss is expected.
> > > +	 */
> > > +	virtio_mb(false);
> > >   	vq->split.avail_idx_shadow++;
> > >   	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >   						vq->split.avail_idx_shadow);
> > 
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> > 
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> > 
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> > 
> 
> Thanks for your comments, Will.
> 
> Yes, DMB should work for us. However, it seems this instruction has issues on
> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> from hardware level. I agree it's not the solution to replace DMB with DSB
> before we fully understand the root cause.
> 
> I tried the possible replacement like below. __smp_mb() can avoid the issue like
> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> 
> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> {
>     :
>         /* Put entry in available array (but don't update avail->idx until they
>          * do sync). */
>         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> 
>         /* Descriptors and available array need to be set before we expose the
>          * new available array entries. */
>         // Broken: virtio_wmb(vq->weak_barriers);
>         // Broken: __dma_mb();
>         // Work:   __mb();
>         // Work:   __smp_mb();

It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
confident are you in that result?

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19 18:22       ` Will Deacon
  0 siblings, 0 replies; 78+ messages in thread
From: Will Deacon @ 2024-03-19 18:22 UTC (permalink / raw)
  To: Gavin Shan
  Cc: virtualization, linux-kernel, mst, jasowang, xuanzhuo, yihyu,
	shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> On 3/19/24 02:59, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
> > >    -accel kvm -machine virt,gic-version=host                            \
> > >       :                                                                 \
> > >    -netdev tap,id=vnet0,vhost=on                                        \
> > >    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >    [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +++++++++---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > >   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > -	/* Descriptors and available array need to be set before we expose the
> > > -	 * new available array entries. */
> > > -	virtio_wmb(vq->weak_barriers);
> > > +	/*
> > > +	 * Descriptors and available array need to be set before we expose
> > > +	 * the new available array entries. virtio_wmb() should be enough
> > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > +	 * by the host (vhost). A stronger barrier should work for other
> > > +	 * architectures, but performance loss is expected.
> > > +	 */
> > > +	virtio_mb(false);
> > >   	vq->split.avail_idx_shadow++;
> > >   	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >   						vq->split.avail_idx_shadow);
> > 
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> > 
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> > 
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> > 
> 
> Thanks for your comments, Will.
> 
> Yes, DMB should work for us. However, it seems this instruction has issues on
> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> from hardware level. I agree it's not the solution to replace DMB with DSB
> before we fully understand the root cause.
> 
> I tried the possible replacement like below. __smp_mb() can avoid the issue like
> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> 
> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> {
>     :
>         /* Put entry in available array (but don't update avail->idx until they
>          * do sync). */
>         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>         vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> 
>         /* Descriptors and available array need to be set before we expose the
>          * new available array entries. */
>         // Broken: virtio_wmb(vq->weak_barriers);
>         // Broken: __dma_mb();
>         // Work:   __mb();
>         // Work:   __smp_mb();

It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
confident are you in that result?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19 18:22       ` Will Deacon
@ 2024-03-19 23:56         ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19 23:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: virtualization, linux-kernel, mst, jasowang, xuanzhuo, yihyu,
	shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/20/24 04:22, Will Deacon wrote:
> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
>> On 3/19/24 02:59, Will Deacon wrote:
>>>>    drivers/virtio/virtio_ring.c | 12 +++++++++---
>>>>    1 file changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>> index 49299b1f9ec7..7d852811c912 100644
>>>> --- a/drivers/virtio/virtio_ring.c
>>>> +++ b/drivers/virtio/virtio_ring.c
>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>> -	/* Descriptors and available array need to be set before we expose the
>>>> -	 * new available array entries. */
>>>> -	virtio_wmb(vq->weak_barriers);
>>>> +	/*
>>>> +	 * Descriptors and available array need to be set before we expose
>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>> +	 * architectures, but performance loss is expected.
>>>> +	 */
>>>> +	virtio_mb(false);
>>>>    	vq->split.avail_idx_shadow++;
>>>>    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>    						vq->split.avail_idx_shadow);
>>>
>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>> here, especially when ordering accesses to coherent memory.
>>>
>>> In practice, either the larger timing different from the DSB or the fact
>>> that you're going from a Store->Store barrier to a full barrier is what
>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>> (e.g. via __smb_mb()).
>>>
>>> We definitely shouldn't take changes like this without a proper
>>> explanation of what is going on.
>>>
>>
>> Thanks for your comments, Will.
>>
>> Yes, DMB should work for us. However, it seems this instruction has issues on
>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>> from hardware level. I agree it's not the solution to replace DMB with DSB
>> before we fully understand the root cause.
>>
>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>
>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>> {
>>      :
>>          /* Put entry in available array (but don't update avail->idx until they
>>           * do sync). */
>>          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>
>>          /* Descriptors and available array need to be set before we expose the
>>           * new available array entries. */
>>          // Broken: virtio_wmb(vq->weak_barriers);
>>          // Broken: __dma_mb();
>>          // Work:   __mb();
>>          // Work:   __smp_mb();
> 
> It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
> confident are you in that result?
> 

Yes, __dma_mb() is even stronger than __smp_mb(). I retried the test, showing
that both __dma_mb() and __smp_mb() work for us. I had too many tests yesterday
and something may have been messed up.

Instruction         Hitting times in 10 tests
---------------------------------------------
__smp_wmb()         8
__smp_mb()          0
__dma_wmb()         7
__dma_mb()          0
__mb()              0
__wmb()             0

It's strange that __smp_mb() works, but __smp_wmb() fails. It seems we need a
read barrier here. I will try WRITE_ONCE() + __smp_wmb() as suggested by Michael
in another reply. Will update the result soon.

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-19 23:56         ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-19 23:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: virtualization, linux-kernel, mst, jasowang, xuanzhuo, yihyu,
	shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/20/24 04:22, Will Deacon wrote:
> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
>> On 3/19/24 02:59, Will Deacon wrote:
>>>>    drivers/virtio/virtio_ring.c | 12 +++++++++---
>>>>    1 file changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>> index 49299b1f9ec7..7d852811c912 100644
>>>> --- a/drivers/virtio/virtio_ring.c
>>>> +++ b/drivers/virtio/virtio_ring.c
>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>> -	/* Descriptors and available array need to be set before we expose the
>>>> -	 * new available array entries. */
>>>> -	virtio_wmb(vq->weak_barriers);
>>>> +	/*
>>>> +	 * Descriptors and available array need to be set before we expose
>>>> +	 * the new available array entries. virtio_wmb() should be enough
>>>> +	 * to ensuere the order theoretically. However, a stronger barrier
>>>> +	 * is needed by ARM64. Otherwise, the stale data can be observed
>>>> +	 * by the host (vhost). A stronger barrier should work for other
>>>> +	 * architectures, but performance loss is expected.
>>>> +	 */
>>>> +	virtio_mb(false);
>>>>    	vq->split.avail_idx_shadow++;
>>>>    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>    						vq->split.avail_idx_shadow);
>>>
>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>> here, especially when ordering accesses to coherent memory.
>>>
>>> In practice, either the larger timing different from the DSB or the fact
>>> that you're going from a Store->Store barrier to a full barrier is what
>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>> (e.g. via __smb_mb()).
>>>
>>> We definitely shouldn't take changes like this without a proper
>>> explanation of what is going on.
>>>
>>
>> Thanks for your comments, Will.
>>
>> Yes, DMB should work for us. However, it seems this instruction has issues on
>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>> from hardware level. I agree it's not the solution to replace DMB with DSB
>> before we fully understand the root cause.
>>
>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>
>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>> {
>>      :
>>          /* Put entry in available array (but don't update avail->idx until they
>>           * do sync). */
>>          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>
>>          /* Descriptors and available array need to be set before we expose the
>>           * new available array entries. */
>>          // Broken: virtio_wmb(vq->weak_barriers);
>>          // Broken: __dma_mb();
>>          // Work:   __mb();
>>          // Work:   __smp_mb();
> 
> It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
> confident are you in that result?
> 

Yes, __dma_mb() is even stronger than __smp_mb(). I retried the test, showing
that both __dma_mb() and __smp_mb() work for us. I had too many tests yesterday
and something may have been messed up.

Instruction         Hitting times in 10 tests
---------------------------------------------
__smp_wmb()         8
__smp_mb()          0
__dma_wmb()         7
__dma_mb()          0
__mb()              0
__wmb()             0

It's strange that __smp_mb() works, but __smp_wmb() fails. It seems we need a
read barrier here. I will try WRITE_ONCE() + __smp_wmb() as suggested by Michael
in another reply. Will update the result soon.

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-19 23:56         ` Gavin Shan
@ 2024-03-20  0:49           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-20  0:49 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Wed, Mar 20, 2024 at 09:56:58AM +1000, Gavin Shan wrote:
> On 3/20/24 04:22, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > On 3/19/24 02:59, Will Deacon wrote:
> > > > >    drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > > >    1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > -	 * new available array entries. */
> > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > +	/*
> > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > +	 * architectures, but performance loss is expected.
> > > > > +	 */
> > > > > +	virtio_mb(false);
> > > > >    	vq->split.avail_idx_shadow++;
> > > > >    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >    						vq->split.avail_idx_shadow);
> > > > 
> > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > here, especially when ordering accesses to coherent memory.
> > > > 
> > > > In practice, either the larger timing different from the DSB or the fact
> > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > (e.g. via __smb_mb()).
> > > > 
> > > > We definitely shouldn't take changes like this without a proper
> > > > explanation of what is going on.
> > > > 
> > > 
> > > Thanks for your comments, Will.
> > > 
> > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > before we fully understand the root cause.
> > > 
> > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > 
> > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > {
> > >      :
> > >          /* Put entry in available array (but don't update avail->idx until they
> > >           * do sync). */
> > >          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > 
> > >          /* Descriptors and available array need to be set before we expose the
> > >           * new available array entries. */
> > >          // Broken: virtio_wmb(vq->weak_barriers);
> > >          // Broken: __dma_mb();
> > >          // Work:   __mb();
> > >          // Work:   __smp_mb();
> > 
> > It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
> > confident are you in that result?
> > 
> 
> Yes, __dma_mb() is even stronger than __smp_mb(). I retried the test, showing
> that both __dma_mb() and __smp_mb() work for us. I had too many tests yesterday
> and something may have been messed up.
> 
> Instruction         Hitting times in 10 tests
> ---------------------------------------------
> __smp_wmb()         8
> __smp_mb()          0
> __dma_wmb()         7
> __dma_mb()          0
> __mb()              0
> __wmb()             0
> 
> It's strange that __smp_mb() works, but __smp_wmb() fails. It seems we need a
> read barrier here. I will try WRITE_ONCE() + __smp_wmb() as suggested by Michael
> in another reply. Will update the result soon.
> 
> Thanks,
> Gavin


I think you are wasting the time with these tests. Even if it helps what
does this tell us? Try setting a flag as I suggested elsewhere.
Then check it in vhost.
Or here's another idea - possibly easier. Copy the high bits from index
into ring itself. Then vhost can check that head is synchronized with
index.

Warning: completely untested, not even compiled. But should give you
the idea. If this works btw we should consider making this official in
the spec.


 static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 6f7e5010a673..79456706d0bd 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 	/* Put entry in available array (but don't update avail->idx until they
 	 * do sync). */
 	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
-	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
+	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
+	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
 
 	/* Descriptors and available array need to be set before we expose the
 	 * new available array entries. */

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..bd8f7c763caa 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
 static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
 				       __virtio16 *head, int idx)
 {
-	return vhost_get_avail(vq, *head,
+	unsigned i = idx;
+	unsigned flag = i & ~(vq->num - 1);
+	unsigned val = vhost_get_avail(vq, *head,
 			       &vq->avail->ring[idx & (vq->num - 1)]);
+	unsigned valflag = val & ~(vq->num - 1);
+
+	WARN_ON(valflag != flag);
+
+	return val & (vq->num - 1);
 }
 
-- 
MST


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-20  0:49           ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-20  0:49 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Wed, Mar 20, 2024 at 09:56:58AM +1000, Gavin Shan wrote:
> On 3/20/24 04:22, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > On 3/19/24 02:59, Will Deacon wrote:
> > > > >    drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > > >    1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >    	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > -	/* Descriptors and available array need to be set before we expose the
> > > > > -	 * new available array entries. */
> > > > > -	virtio_wmb(vq->weak_barriers);
> > > > > +	/*
> > > > > +	 * Descriptors and available array need to be set before we expose
> > > > > +	 * the new available array entries. virtio_wmb() should be enough
> > > > > +	 * to ensuere the order theoretically. However, a stronger barrier
> > > > > +	 * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +	 * by the host (vhost). A stronger barrier should work for other
> > > > > +	 * architectures, but performance loss is expected.
> > > > > +	 */
> > > > > +	virtio_mb(false);
> > > > >    	vq->split.avail_idx_shadow++;
> > > > >    	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >    						vq->split.avail_idx_shadow);
> > > > 
> > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > here, especially when ordering accesses to coherent memory.
> > > > 
> > > > In practice, either the larger timing different from the DSB or the fact
> > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > (e.g. via __smb_mb()).
> > > > 
> > > > We definitely shouldn't take changes like this without a proper
> > > > explanation of what is going on.
> > > > 
> > > 
> > > Thanks for your comments, Will.
> > > 
> > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > before we fully understand the root cause.
> > > 
> > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > 
> > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > {
> > >      :
> > >          /* Put entry in available array (but don't update avail->idx until they
> > >           * do sync). */
> > >          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > 
> > >          /* Descriptors and available array need to be set before we expose the
> > >           * new available array entries. */
> > >          // Broken: virtio_wmb(vq->weak_barriers);
> > >          // Broken: __dma_mb();
> > >          // Work:   __mb();
> > >          // Work:   __smp_mb();
> > 
> > It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
> > confident are you in that result?
> > 
> 
> Yes, __dma_mb() is even stronger than __smp_mb(). I retried the test, showing
> that both __dma_mb() and __smp_mb() work for us. I had too many tests yesterday
> and something may have been messed up.
> 
> Instruction         Hitting times in 10 tests
> ---------------------------------------------
> __smp_wmb()         8
> __smp_mb()          0
> __dma_wmb()         7
> __dma_mb()          0
> __mb()              0
> __wmb()             0
> 
> It's strange that __smp_mb() works, but __smp_wmb() fails. It seems we need a
> read barrier here. I will try WRITE_ONCE() + __smp_wmb() as suggested by Michael
> in another reply. Will update the result soon.
> 
> Thanks,
> Gavin


I think you are wasting the time with these tests. Even if it helps what
does this tell us? Try setting a flag as I suggested elsewhere.
Then check it in vhost.
Or here's another idea - possibly easier. Copy the high bits from index
into ring itself. Then vhost can check that head is synchronized with
index.

Warning: completely untested, not even compiled. But should give you
the idea. If this works btw we should consider making this official in
the spec.


 static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 6f7e5010a673..79456706d0bd 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 	/* Put entry in available array (but don't update avail->idx until they
 	 * do sync). */
 	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
-	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
+	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
+	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
 
 	/* Descriptors and available array need to be set before we expose the
 	 * new available array entries. */

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..bd8f7c763caa 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
 static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
 				       __virtio16 *head, int idx)
 {
-	return vhost_get_avail(vq, *head,
+	unsigned i = idx;
+	unsigned flag = i & ~(vq->num - 1);
+	unsigned val = vhost_get_avail(vq, *head,
 			       &vq->avail->ring[idx & (vq->num - 1)]);
+	unsigned valflag = val & ~(vq->num - 1);
+
+	WARN_ON(valflag != flag);
+
+	return val & (vq->num - 1);
 }
 
-- 
MST


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-20  0:49           ` Michael S. Tsirkin
@ 2024-03-20  5:24             ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-20  5:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/20/24 10:49, Michael S. Tsirkin wrote:> 
> I think you are wasting the time with these tests. Even if it helps what
> does this tell us? Try setting a flag as I suggested elsewhere.
> Then check it in vhost.
> Or here's another idea - possibly easier. Copy the high bits from index
> into ring itself. Then vhost can check that head is synchronized with
> index.
> 
> Warning: completely untested, not even compiled. But should give you
> the idea. If this works btw we should consider making this official in
> the spec.
> 
> 
>   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 6f7e5010a673..79456706d0bd 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>   	/* Put entry in available array (but don't update avail->idx until they
>   	 * do sync). */
>   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> -	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> +	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> +	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
>   
>   	/* Descriptors and available array need to be set before we expose the
>   	 * new available array entries. */
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..bd8f7c763caa 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>   static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>   				       __virtio16 *head, int idx)
>   {
> -	return vhost_get_avail(vq, *head,
> +	unsigned i = idx;
> +	unsigned flag = i & ~(vq->num - 1);
> +	unsigned val = vhost_get_avail(vq, *head,
>   			       &vq->avail->ring[idx & (vq->num - 1)]);
> +	unsigned valflag = val & ~(vq->num - 1);
> +
> +	WARN_ON(valflag != flag);
> +
> +	return val & (vq->num - 1);
>   }
>   

Thanks, Michael. The code is already self-explanatory. Since vq->num is 256, I just
squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the high
byte is still correct and valid?

         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
         vq->split.vring.avail->ring[avail] =
                 cpu_to_virtio16(_vq->vdev, head | (avail << 8));


         head = vhost16_to_cpu(vq, ring_head);
         WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
         head = head & 0xff;

One question: Does QEMU has any chance writing data to the available queue when
vhost is enabled? My previous understanding is no, the queue is totally owned by
vhost instead of QEMU.

Before this patch was posted, I had debugging code to record last 16 transactions
to the available and used queue from guest and host side. It did reveal the wrong
head was fetched from the available queue.

[   11.785745] ================ virtqueue_get_buf_ctx_split ================
[   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
[   11.786655] head to be released: 036 077
[   11.786952]
[   11.786952] avail_idx:
[   11.787234] 000  63985  <--
[   11.787237] 001  63986
[   11.787444] 002  63987
[   11.787632] 003  63988
[   11.787821] 004  63989
[   11.788006] 005  63990
[   11.788194] 006  63991
[   11.788381] 007  63992
[   11.788567] 008  63993
[   11.788772] 009  63994
[   11.788957] 010  63995
[   11.789141] 011  63996
[   11.789327] 012  63997
[   11.789515] 013  63998
[   11.789701] 014  63999
[   11.789886] 015  64000
[   11.790068]
[   11.790068] avail_head:
[   11.790529] 000  075  <--
[   11.790718] 001  036
[   11.790890] 002  077
[   11.791061] 003  129
[   11.791231] 004  072
[   11.791400] 005  130
[   11.791574] 006  015
[   11.791748] 007  074
[   11.791918] 008  130
[   11.792094] 009  130
[   11.792263] 010  074
[   11.792437] 011  015
[   11.792617] 012  072
[   11.792788] 013  129
[   11.792961] 014  077    // The last two heads from guest to host: 077, 036
[   11.793134] 015  036

[root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost

avail_idx
000  63998
001  64000
002  63954  <---
003  63955
004  63956
005  63974
006  63981
007  63984
008  63986
009  63987
010  63988
011  63989
012  63992
013  63993
014  63995
015  63997

avail_head
000  074
001  015
002  072
003  129
004  074            // The last two heads seen by vhost is: 074, 036
005  036
006  075  <---
007  036
008  077
009  129
010  072
011  130
012  015
013  074
014  130
015  130

used_idx
000  64000
001  63882  <---
002  63889
003  63891
004  63898
005  63936
006  63942
007  63946
008  63949
009  63953
010  63957
011  63981
012  63990
013  63992
014  63993
015  63999

used_head
000  072
001  129
002  074          // The last two heads published to guest is: 074, 036
003  036
004  075  <---
005  036
006  077
007  129
008  072
009  130
010  015
011  074
012  130
013  130
014  074
015  015

Thanks,
Gavin





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-20  5:24             ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-20  5:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On 3/20/24 10:49, Michael S. Tsirkin wrote:> 
> I think you are wasting the time with these tests. Even if it helps what
> does this tell us? Try setting a flag as I suggested elsewhere.
> Then check it in vhost.
> Or here's another idea - possibly easier. Copy the high bits from index
> into ring itself. Then vhost can check that head is synchronized with
> index.
> 
> Warning: completely untested, not even compiled. But should give you
> the idea. If this works btw we should consider making this official in
> the spec.
> 
> 
>   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 6f7e5010a673..79456706d0bd 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>   	/* Put entry in available array (but don't update avail->idx until they
>   	 * do sync). */
>   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> -	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> +	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> +	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
>   
>   	/* Descriptors and available array need to be set before we expose the
>   	 * new available array entries. */
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..bd8f7c763caa 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>   static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>   				       __virtio16 *head, int idx)
>   {
> -	return vhost_get_avail(vq, *head,
> +	unsigned i = idx;
> +	unsigned flag = i & ~(vq->num - 1);
> +	unsigned val = vhost_get_avail(vq, *head,
>   			       &vq->avail->ring[idx & (vq->num - 1)]);
> +	unsigned valflag = val & ~(vq->num - 1);
> +
> +	WARN_ON(valflag != flag);
> +
> +	return val & (vq->num - 1);
>   }
>   

Thanks, Michael. The code is already self-explanatory. Since vq->num is 256, I just
squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the high
byte is still correct and valid?

         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
         vq->split.vring.avail->ring[avail] =
                 cpu_to_virtio16(_vq->vdev, head | (avail << 8));


         head = vhost16_to_cpu(vq, ring_head);
         WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
         head = head & 0xff;

One question: Does QEMU has any chance writing data to the available queue when
vhost is enabled? My previous understanding is no, the queue is totally owned by
vhost instead of QEMU.

Before this patch was posted, I had debugging code to record last 16 transactions
to the available and used queue from guest and host side. It did reveal the wrong
head was fetched from the available queue.

[   11.785745] ================ virtqueue_get_buf_ctx_split ================
[   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
[   11.786655] head to be released: 036 077
[   11.786952]
[   11.786952] avail_idx:
[   11.787234] 000  63985  <--
[   11.787237] 001  63986
[   11.787444] 002  63987
[   11.787632] 003  63988
[   11.787821] 004  63989
[   11.788006] 005  63990
[   11.788194] 006  63991
[   11.788381] 007  63992
[   11.788567] 008  63993
[   11.788772] 009  63994
[   11.788957] 010  63995
[   11.789141] 011  63996
[   11.789327] 012  63997
[   11.789515] 013  63998
[   11.789701] 014  63999
[   11.789886] 015  64000
[   11.790068]
[   11.790068] avail_head:
[   11.790529] 000  075  <--
[   11.790718] 001  036
[   11.790890] 002  077
[   11.791061] 003  129
[   11.791231] 004  072
[   11.791400] 005  130
[   11.791574] 006  015
[   11.791748] 007  074
[   11.791918] 008  130
[   11.792094] 009  130
[   11.792263] 010  074
[   11.792437] 011  015
[   11.792617] 012  072
[   11.792788] 013  129
[   11.792961] 014  077    // The last two heads from guest to host: 077, 036
[   11.793134] 015  036

[root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost

avail_idx
000  63998
001  64000
002  63954  <---
003  63955
004  63956
005  63974
006  63981
007  63984
008  63986
009  63987
010  63988
011  63989
012  63992
013  63993
014  63995
015  63997

avail_head
000  074
001  015
002  072
003  129
004  074            // The last two heads seen by vhost is: 074, 036
005  036
006  075  <---
007  036
008  077
009  129
010  072
011  130
012  015
013  074
014  130
015  130

used_idx
000  64000
001  63882  <---
002  63889
003  63891
004  63898
005  63936
006  63942
007  63946
008  63949
009  63953
010  63957
011  63981
012  63990
013  63992
014  63993
015  63999

used_head
000  072
001  129
002  074          // The last two heads published to guest is: 074, 036
003  036
004  075  <---
005  036
006  077
007  129
008  072
009  130
010  015
011  074
012  130
013  130
014  074
015  015

Thanks,
Gavin





_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-20  5:24             ` Gavin Shan
@ 2024-03-20  7:14               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-20  7:14 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > I think you are wasting the time with these tests. Even if it helps what
> > does this tell us? Try setting a flag as I suggested elsewhere.
> > Then check it in vhost.
> > Or here's another idea - possibly easier. Copy the high bits from index
> > into ring itself. Then vhost can check that head is synchronized with
> > index.
> > 
> > Warning: completely untested, not even compiled. But should give you
> > the idea. If this works btw we should consider making this official in
> > the spec.
> > 
> > 
> >   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 6f7e5010a673..79456706d0bd 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> >   	/* Put entry in available array (but don't update avail->idx until they
> >   	 * do sync). */
> >   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > -	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > +	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> > +	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
> >   	/* Descriptors and available array need to be set before we expose the
> >   	 * new available array entries. */
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..bd8f7c763caa 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> >   static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> >   				       __virtio16 *head, int idx)
> >   {
> > -	return vhost_get_avail(vq, *head,
> > +	unsigned i = idx;
> > +	unsigned flag = i & ~(vq->num - 1);
> > +	unsigned val = vhost_get_avail(vq, *head,
> >   			       &vq->avail->ring[idx & (vq->num - 1)]);
> > +	unsigned valflag = val & ~(vq->num - 1);
> > +
> > +	WARN_ON(valflag != flag);
> > +
> > +	return val & (vq->num - 1);
> >   }
> 
> Thanks, Michael. The code is already self-explanatory.

Apparently not. See below.

> Since vq->num is 256, I just
> squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
> the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the high
> byte is still correct and valid?


I would find this very surprising.

>         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>         vq->split.vring.avail->ring[avail] =
>                 cpu_to_virtio16(_vq->vdev, head | (avail << 8));
> 
> 
>         head = vhost16_to_cpu(vq, ring_head);
>         WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
>         head = head & 0xff;

This code misses the point of the test.
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected/
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected.

The value you are interested in should change
each time you go around the ring a full circle.
Thus you want exactly the *high byte* of avail idx -
this is what my patch did - your patch instead
stored and compared the low byte.


The advantage of this debugging patch is that it will detect the issue immediately
not after guest detected the problem in the used ring.
For example, you can add code to re-read the value, or dump the whole
ring.

> One question: Does QEMU has any chance writing data to the available queue when
> vhost is enabled? My previous understanding is no, the queue is totally owned by
> vhost instead of QEMU.

It shouldn't do it normally.

> Before this patch was posted, I had debugging code to record last 16 transactions
> to the available and used queue from guest and host side. It did reveal the wrong
> head was fetched from the available queue.

Oh nice that's a very good hint. And is this still reproducible?

> [   11.785745] ================ virtqueue_get_buf_ctx_split ================
> [   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
> [   11.786655] head to be released: 036 077
> [   11.786952]
> [   11.786952] avail_idx:
> [   11.787234] 000  63985  <--
> [   11.787237] 001  63986
> [   11.787444] 002  63987
> [   11.787632] 003  63988
> [   11.787821] 004  63989
> [   11.788006] 005  63990
> [   11.788194] 006  63991
> [   11.788381] 007  63992
> [   11.788567] 008  63993
> [   11.788772] 009  63994
> [   11.788957] 010  63995
> [   11.789141] 011  63996
> [   11.789327] 012  63997
> [   11.789515] 013  63998
> [   11.789701] 014  63999
> [   11.789886] 015  64000
> [   11.790068]
> [   11.790068] avail_head:
> [   11.790529] 000  075  <--
> [   11.790718] 001  036
> [   11.790890] 002  077
> [   11.791061] 003  129
> [   11.791231] 004  072
> [   11.791400] 005  130
> [   11.791574] 006  015
> [   11.791748] 007  074
> [   11.791918] 008  130
> [   11.792094] 009  130
> [   11.792263] 010  074
> [   11.792437] 011  015
> [   11.792617] 012  072
> [   11.792788] 013  129
> [   11.792961] 014  077    // The last two heads from guest to host: 077, 036
> [   11.793134] 015  036

Maybe dump the avail ring from guest to make sure
it matches the expected contents?

> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
> 
> avail_idx
> 000  63998
> 001  64000
> 002  63954  <---
> 003  63955
> 004  63956
> 005  63974
> 006  63981
> 007  63984
> 008  63986
> 009  63987
> 010  63988
> 011  63989
> 012  63992
> 013  63993
> 014  63995
> 015  63997
> 
> avail_head
> 000  074
> 001  015
> 002  072
> 003  129
> 004  074            // The last two heads seen by vhost is: 074, 036
> 005  036
> 006  075  <---


And is 074 the previous (stale) value in the ring?


> 007  036
> 008  077
> 009  129
> 010  072
> 011  130
> 012  015
> 013  074
> 014  130
> 015  130



> used_idx
> 000  64000
> 001  63882  <---
> 002  63889
> 003  63891
> 004  63898
> 005  63936
> 006  63942
> 007  63946
> 008  63949
> 009  63953
> 010  63957
> 011  63981
> 012  63990
> 013  63992
> 014  63993
> 015  63999
> 
> used_head
> 000  072
> 001  129
> 002  074          // The last two heads published to guest is: 074, 036
> 003  036
> 004  075  <---
> 005  036
> 006  077
> 007  129
> 008  072
> 009  130
> 010  015
> 011  074
> 012  130
> 013  130
> 014  074
> 015  015
> 
> Thanks,
> Gavin

I like this debugging patch, it might make sense to
polish it up and include in production. We'll want it in
debugfs naturally not /proc/vhost.

But all in good time.


> 
> 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-20  7:14               ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-20  7:14 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > I think you are wasting the time with these tests. Even if it helps what
> > does this tell us? Try setting a flag as I suggested elsewhere.
> > Then check it in vhost.
> > Or here's another idea - possibly easier. Copy the high bits from index
> > into ring itself. Then vhost can check that head is synchronized with
> > index.
> > 
> > Warning: completely untested, not even compiled. But should give you
> > the idea. If this works btw we should consider making this official in
> > the spec.
> > 
> > 
> >   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 6f7e5010a673..79456706d0bd 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> >   	/* Put entry in available array (but don't update avail->idx until they
> >   	 * do sync). */
> >   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > -	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > +	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> > +	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
> >   	/* Descriptors and available array need to be set before we expose the
> >   	 * new available array entries. */
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..bd8f7c763caa 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> >   static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> >   				       __virtio16 *head, int idx)
> >   {
> > -	return vhost_get_avail(vq, *head,
> > +	unsigned i = idx;
> > +	unsigned flag = i & ~(vq->num - 1);
> > +	unsigned val = vhost_get_avail(vq, *head,
> >   			       &vq->avail->ring[idx & (vq->num - 1)]);
> > +	unsigned valflag = val & ~(vq->num - 1);
> > +
> > +	WARN_ON(valflag != flag);
> > +
> > +	return val & (vq->num - 1);
> >   }
> 
> Thanks, Michael. The code is already self-explanatory.

Apparently not. See below.

> Since vq->num is 256, I just
> squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
> the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the high
> byte is still correct and valid?


I would find this very surprising.

>         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>         vq->split.vring.avail->ring[avail] =
>                 cpu_to_virtio16(_vq->vdev, head | (avail << 8));
> 
> 
>         head = vhost16_to_cpu(vq, ring_head);
>         WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
>         head = head & 0xff;

This code misses the point of the test.
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected/
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected.

The value you are interested in should change
each time you go around the ring a full circle.
Thus you want exactly the *high byte* of avail idx -
this is what my patch did - your patch instead
stored and compared the low byte.


The advantage of this debugging patch is that it will detect the issue immediately
not after guest detected the problem in the used ring.
For example, you can add code to re-read the value, or dump the whole
ring.

> One question: Does QEMU has any chance writing data to the available queue when
> vhost is enabled? My previous understanding is no, the queue is totally owned by
> vhost instead of QEMU.

It shouldn't do it normally.

> Before this patch was posted, I had debugging code to record last 16 transactions
> to the available and used queue from guest and host side. It did reveal the wrong
> head was fetched from the available queue.

Oh nice that's a very good hint. And is this still reproducible?

> [   11.785745] ================ virtqueue_get_buf_ctx_split ================
> [   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
> [   11.786655] head to be released: 036 077
> [   11.786952]
> [   11.786952] avail_idx:
> [   11.787234] 000  63985  <--
> [   11.787237] 001  63986
> [   11.787444] 002  63987
> [   11.787632] 003  63988
> [   11.787821] 004  63989
> [   11.788006] 005  63990
> [   11.788194] 006  63991
> [   11.788381] 007  63992
> [   11.788567] 008  63993
> [   11.788772] 009  63994
> [   11.788957] 010  63995
> [   11.789141] 011  63996
> [   11.789327] 012  63997
> [   11.789515] 013  63998
> [   11.789701] 014  63999
> [   11.789886] 015  64000
> [   11.790068]
> [   11.790068] avail_head:
> [   11.790529] 000  075  <--
> [   11.790718] 001  036
> [   11.790890] 002  077
> [   11.791061] 003  129
> [   11.791231] 004  072
> [   11.791400] 005  130
> [   11.791574] 006  015
> [   11.791748] 007  074
> [   11.791918] 008  130
> [   11.792094] 009  130
> [   11.792263] 010  074
> [   11.792437] 011  015
> [   11.792617] 012  072
> [   11.792788] 013  129
> [   11.792961] 014  077    // The last two heads from guest to host: 077, 036
> [   11.793134] 015  036

Maybe dump the avail ring from guest to make sure
it matches the expected contents?

> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
> 
> avail_idx
> 000  63998
> 001  64000
> 002  63954  <---
> 003  63955
> 004  63956
> 005  63974
> 006  63981
> 007  63984
> 008  63986
> 009  63987
> 010  63988
> 011  63989
> 012  63992
> 013  63993
> 014  63995
> 015  63997
> 
> avail_head
> 000  074
> 001  015
> 002  072
> 003  129
> 004  074            // The last two heads seen by vhost is: 074, 036
> 005  036
> 006  075  <---


And is 074 the previous (stale) value in the ring?


> 007  036
> 008  077
> 009  129
> 010  072
> 011  130
> 012  015
> 013  074
> 014  130
> 015  130



> used_idx
> 000  64000
> 001  63882  <---
> 002  63889
> 003  63891
> 004  63898
> 005  63936
> 006  63942
> 007  63946
> 008  63949
> 009  63953
> 010  63957
> 011  63981
> 012  63990
> 013  63992
> 014  63993
> 015  63999
> 
> used_head
> 000  072
> 001  129
> 002  074          // The last two heads published to guest is: 074, 036
> 003  036
> 004  075  <---
> 005  036
> 006  077
> 007  129
> 008  072
> 009  130
> 010  015
> 011  074
> 012  130
> 013  130
> 014  074
> 015  015
> 
> Thanks,
> Gavin

I like this debugging patch, it might make sense to
polish it up and include in production. We'll want it in
debugfs naturally not /proc/vhost.

But all in good time.


> 
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-20  5:24             ` Gavin Shan
@ 2024-03-20 17:15               ` Keir Fraser
  -1 siblings, 0 replies; 78+ messages in thread
From: Keir Fraser @ 2024-03-20 17:15 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Michael S. Tsirkin, Will Deacon, virtualization, linux-kernel,
	jasowang, xuanzhuo, yihyu, shan.gavin, linux-arm-kernel,
	Catalin Marinas, mochs

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> 
> Before this patch was posted, I had debugging code to record last 16 transactions
> to the available and used queue from guest and host side. It did reveal the wrong
> head was fetched from the available queue.
> 
> [   11.785745] ================ virtqueue_get_buf_ctx_split ================
> [   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
> [   11.786655] head to be released: 036 077
> [   11.786952]
> [   11.786952] avail_idx:
> [   11.787234] 000  63985  <--
> [   11.787237] 001  63986
> [   11.787444] 002  63987
> [   11.787632] 003  63988
> [   11.787821] 004  63989
> [   11.788006] 005  63990
> [   11.788194] 006  63991
> [   11.788381] 007  63992
> [   11.788567] 008  63993
> [   11.788772] 009  63994
> [   11.788957] 010  63995
> [   11.789141] 011  63996
> [   11.789327] 012  63997
> [   11.789515] 013  63998
> [   11.789701] 014  63999
> [   11.789886] 015  64000

Does the error always occur at such a round idx value?

Here, 64000 == 0xFA00. Maybe coincidence but it's improbable enough to be interesting.

This debug code seems rather useful!

 -- Keir



> [   11.790068]
> [   11.790068] avail_head:
> [   11.790529] 000  075  <--
> [   11.790718] 001  036
> [   11.790890] 002  077
> [   11.791061] 003  129
> [   11.791231] 004  072
> [   11.791400] 005  130
> [   11.791574] 006  015
> [   11.791748] 007  074
> [   11.791918] 008  130
> [   11.792094] 009  130
> [   11.792263] 010  074
> [   11.792437] 011  015
> [   11.792617] 012  072
> [   11.792788] 013  129
> [   11.792961] 014  077    // The last two heads from guest to host: 077, 036
> [   11.793134] 015  036
> 
> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
> 
> avail_idx
> 000  63998
> 001  64000
> 002  63954  <---
> 003  63955
> 004  63956
> 005  63974
> 006  63981
> 007  63984
> 008  63986
> 009  63987
> 010  63988
> 011  63989
> 012  63992
> 013  63993
> 014  63995
> 015  63997
> 
> avail_head
> 000  074
> 001  015
> 002  072
> 003  129
> 004  074            // The last two heads seen by vhost is: 074, 036
> 005  036
> 006  075  <---
> 007  036
> 008  077
> 009  129
> 010  072
> 011  130
> 012  015
> 013  074
> 014  130
> 015  130
> 
> used_idx
> 000  64000
> 001  63882  <---
> 002  63889
> 003  63891
> 004  63898
> 005  63936
> 006  63942
> 007  63946
> 008  63949
> 009  63953
> 010  63957
> 011  63981
> 012  63990
> 013  63992
> 014  63993
> 015  63999
> 
> used_head
> 000  072
> 001  129
> 002  074          // The last two heads published to guest is: 074, 036
> 003  036
> 004  075  <---
> 005  036
> 006  077
> 007  129
> 008  072
> 009  130
> 010  015
> 011  074
> 012  130
> 013  130
> 014  074
> 015  015
> 
> Thanks,
> Gavin
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-20 17:15               ` Keir Fraser
  0 siblings, 0 replies; 78+ messages in thread
From: Keir Fraser @ 2024-03-20 17:15 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Michael S. Tsirkin, Will Deacon, virtualization, linux-kernel,
	jasowang, xuanzhuo, yihyu, shan.gavin, linux-arm-kernel,
	Catalin Marinas, mochs

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> 
> Before this patch was posted, I had debugging code to record last 16 transactions
> to the available and used queue from guest and host side. It did reveal the wrong
> head was fetched from the available queue.
> 
> [   11.785745] ================ virtqueue_get_buf_ctx_split ================
> [   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
> [   11.786655] head to be released: 036 077
> [   11.786952]
> [   11.786952] avail_idx:
> [   11.787234] 000  63985  <--
> [   11.787237] 001  63986
> [   11.787444] 002  63987
> [   11.787632] 003  63988
> [   11.787821] 004  63989
> [   11.788006] 005  63990
> [   11.788194] 006  63991
> [   11.788381] 007  63992
> [   11.788567] 008  63993
> [   11.788772] 009  63994
> [   11.788957] 010  63995
> [   11.789141] 011  63996
> [   11.789327] 012  63997
> [   11.789515] 013  63998
> [   11.789701] 014  63999
> [   11.789886] 015  64000

Does the error always occur at such a round idx value?

Here, 64000 == 0xFA00. Maybe coincidence but it's improbable enough to be interesting.

This debug code seems rather useful!

 -- Keir



> [   11.790068]
> [   11.790068] avail_head:
> [   11.790529] 000  075  <--
> [   11.790718] 001  036
> [   11.790890] 002  077
> [   11.791061] 003  129
> [   11.791231] 004  072
> [   11.791400] 005  130
> [   11.791574] 006  015
> [   11.791748] 007  074
> [   11.791918] 008  130
> [   11.792094] 009  130
> [   11.792263] 010  074
> [   11.792437] 011  015
> [   11.792617] 012  072
> [   11.792788] 013  129
> [   11.792961] 014  077    // The last two heads from guest to host: 077, 036
> [   11.793134] 015  036
> 
> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
> 
> avail_idx
> 000  63998
> 001  64000
> 002  63954  <---
> 003  63955
> 004  63956
> 005  63974
> 006  63981
> 007  63984
> 008  63986
> 009  63987
> 010  63988
> 011  63989
> 012  63992
> 013  63993
> 014  63995
> 015  63997
> 
> avail_head
> 000  074
> 001  015
> 002  072
> 003  129
> 004  074            // The last two heads seen by vhost is: 074, 036
> 005  036
> 006  075  <---
> 007  036
> 008  077
> 009  129
> 010  072
> 011  130
> 012  015
> 013  074
> 014  130
> 015  130
> 
> used_idx
> 000  64000
> 001  63882  <---
> 002  63889
> 003  63891
> 004  63898
> 005  63936
> 006  63942
> 007  63946
> 008  63949
> 009  63953
> 010  63957
> 011  63981
> 012  63990
> 013  63992
> 014  63993
> 015  63999
> 
> used_head
> 000  072
> 001  129
> 002  074          // The last two heads published to guest is: 074, 036
> 003  036
> 004  075  <---
> 005  036
> 006  077
> 007  129
> 008  072
> 009  130
> 010  015
> 011  074
> 012  130
> 013  130
> 014  074
> 015  015
> 
> Thanks,
> Gavin
> 
> 
> 
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-20 17:15               ` Keir Fraser
@ 2024-03-21 12:06                 ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-21 12:06 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Michael S. Tsirkin, Will Deacon, virtualization, linux-kernel,
	jasowang, xuanzhuo, yihyu, shan.gavin, linux-arm-kernel,
	Catalin Marinas, mochs

On 3/21/24 03:15, Keir Fraser wrote:
> On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
>>
>> Before this patch was posted, I had debugging code to record last 16 transactions
>> to the available and used queue from guest and host side. It did reveal the wrong
>> head was fetched from the available queue.
>>
>> [   11.785745] ================ virtqueue_get_buf_ctx_split ================
>> [   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
>> [   11.786655] head to be released: 036 077
>> [   11.786952]
>> [   11.786952] avail_idx:
>> [   11.787234] 000  63985  <--
>> [   11.787237] 001  63986
>> [   11.787444] 002  63987
>> [   11.787632] 003  63988
>> [   11.787821] 004  63989
>> [   11.788006] 005  63990
>> [   11.788194] 006  63991
>> [   11.788381] 007  63992
>> [   11.788567] 008  63993
>> [   11.788772] 009  63994
>> [   11.788957] 010  63995
>> [   11.789141] 011  63996
>> [   11.789327] 012  63997
>> [   11.789515] 013  63998
>> [   11.789701] 014  63999
>> [   11.789886] 015  64000
> 
> Does the error always occur at such a round idx value?
> 
> Here, 64000 == 0xFA00. Maybe coincidence but it's improbable enough to be interesting.
> 
> This debug code seems rather useful!
> 

Keir, Nope, it's just coincidence. We don't have such kind of pattern.

Thanks,
Gavin

> 
> 
>> [   11.790068]
>> [   11.790068] avail_head:
>> [   11.790529] 000  075  <--
>> [   11.790718] 001  036
>> [   11.790890] 002  077
>> [   11.791061] 003  129
>> [   11.791231] 004  072
>> [   11.791400] 005  130
>> [   11.791574] 006  015
>> [   11.791748] 007  074
>> [   11.791918] 008  130
>> [   11.792094] 009  130
>> [   11.792263] 010  074
>> [   11.792437] 011  015
>> [   11.792617] 012  072
>> [   11.792788] 013  129
>> [   11.792961] 014  077    // The last two heads from guest to host: 077, 036
>> [   11.793134] 015  036
>>
>> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
>>
>> avail_idx
>> 000  63998
>> 001  64000
>> 002  63954  <---
>> 003  63955
>> 004  63956
>> 005  63974
>> 006  63981
>> 007  63984
>> 008  63986
>> 009  63987
>> 010  63988
>> 011  63989
>> 012  63992
>> 013  63993
>> 014  63995
>> 015  63997
>>
>> avail_head
>> 000  074
>> 001  015
>> 002  072
>> 003  129
>> 004  074            // The last two heads seen by vhost is: 074, 036
>> 005  036
>> 006  075  <---
>> 007  036
>> 008  077
>> 009  129
>> 010  072
>> 011  130
>> 012  015
>> 013  074
>> 014  130
>> 015  130
>>
>> used_idx
>> 000  64000
>> 001  63882  <---
>> 002  63889
>> 003  63891
>> 004  63898
>> 005  63936
>> 006  63942
>> 007  63946
>> 008  63949
>> 009  63953
>> 010  63957
>> 011  63981
>> 012  63990
>> 013  63992
>> 014  63993
>> 015  63999
>>
>> used_head
>> 000  072
>> 001  129
>> 002  074          // The last two heads published to guest is: 074, 036
>> 003  036
>> 004  075  <---
>> 005  036
>> 006  077
>> 007  129
>> 008  072
>> 009  130
>> 010  015
>> 011  074
>> 012  130
>> 013  130
>> 014  074
>> 015  015
>>
>> Thanks,
>> Gavin
>>
>>
>>
>>
> 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-21 12:06                 ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-21 12:06 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Michael S. Tsirkin, Will Deacon, virtualization, linux-kernel,
	jasowang, xuanzhuo, yihyu, shan.gavin, linux-arm-kernel,
	Catalin Marinas, mochs

On 3/21/24 03:15, Keir Fraser wrote:
> On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
>>
>> Before this patch was posted, I had debugging code to record last 16 transactions
>> to the available and used queue from guest and host side. It did reveal the wrong
>> head was fetched from the available queue.
>>
>> [   11.785745] ================ virtqueue_get_buf_ctx_split ================
>> [   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
>> [   11.786655] head to be released: 036 077
>> [   11.786952]
>> [   11.786952] avail_idx:
>> [   11.787234] 000  63985  <--
>> [   11.787237] 001  63986
>> [   11.787444] 002  63987
>> [   11.787632] 003  63988
>> [   11.787821] 004  63989
>> [   11.788006] 005  63990
>> [   11.788194] 006  63991
>> [   11.788381] 007  63992
>> [   11.788567] 008  63993
>> [   11.788772] 009  63994
>> [   11.788957] 010  63995
>> [   11.789141] 011  63996
>> [   11.789327] 012  63997
>> [   11.789515] 013  63998
>> [   11.789701] 014  63999
>> [   11.789886] 015  64000
> 
> Does the error always occur at such a round idx value?
> 
> Here, 64000 == 0xFA00. Maybe coincidence but it's improbable enough to be interesting.
> 
> This debug code seems rather useful!
> 

Keir, Nope, it's just coincidence. We don't have such kind of pattern.

Thanks,
Gavin

> 
> 
>> [   11.790068]
>> [   11.790068] avail_head:
>> [   11.790529] 000  075  <--
>> [   11.790718] 001  036
>> [   11.790890] 002  077
>> [   11.791061] 003  129
>> [   11.791231] 004  072
>> [   11.791400] 005  130
>> [   11.791574] 006  015
>> [   11.791748] 007  074
>> [   11.791918] 008  130
>> [   11.792094] 009  130
>> [   11.792263] 010  074
>> [   11.792437] 011  015
>> [   11.792617] 012  072
>> [   11.792788] 013  129
>> [   11.792961] 014  077    // The last two heads from guest to host: 077, 036
>> [   11.793134] 015  036
>>
>> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
>>
>> avail_idx
>> 000  63998
>> 001  64000
>> 002  63954  <---
>> 003  63955
>> 004  63956
>> 005  63974
>> 006  63981
>> 007  63984
>> 008  63986
>> 009  63987
>> 010  63988
>> 011  63989
>> 012  63992
>> 013  63993
>> 014  63995
>> 015  63997
>>
>> avail_head
>> 000  074
>> 001  015
>> 002  072
>> 003  129
>> 004  074            // The last two heads seen by vhost is: 074, 036
>> 005  036
>> 006  075  <---
>> 007  036
>> 008  077
>> 009  129
>> 010  072
>> 011  130
>> 012  015
>> 013  074
>> 014  130
>> 015  130
>>
>> used_idx
>> 000  64000
>> 001  63882  <---
>> 002  63889
>> 003  63891
>> 004  63898
>> 005  63936
>> 006  63942
>> 007  63946
>> 008  63949
>> 009  63953
>> 010  63957
>> 011  63981
>> 012  63990
>> 013  63992
>> 014  63993
>> 015  63999
>>
>> used_head
>> 000  072
>> 001  129
>> 002  074          // The last two heads published to guest is: 074, 036
>> 003  036
>> 004  075  <---
>> 005  036
>> 006  077
>> 007  129
>> 008  072
>> 009  130
>> 010  015
>> 011  074
>> 012  130
>> 013  130
>> 014  074
>> 015  015
>>
>> Thanks,
>> Gavin
>>
>>
>>
>>
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-20  7:14               ` Michael S. Tsirkin
@ 2024-03-25  7:34                 ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-25  7:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs


On 3/20/24 17:14, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
>> On 3/20/24 10:49, Michael S. Tsirkin wrote:>
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>> index 6f7e5010a673..79456706d0bd 100644
>>> --- a/drivers/virtio/virtio_ring.c
>>> +++ b/drivers/virtio/virtio_ring.c
>>> @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>    	/* Put entry in available array (but don't update avail->idx until they
>>>    	 * do sync). */
>>>    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>> -	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>> +	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
>>> +	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
>>>    	/* Descriptors and available array need to be set before we expose the
>>>    	 * new available array entries. */
>>>

Ok, Michael. I continued with my debugging code. It still looks like a
hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
involved for the discussion, as suggested by you.

Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
Note that I have only one vCPU in my configuration.

Secondly, the debugging code is enhanced so that the available head for
(last_avail_idx - 1) is read for twice and recorded. It means the available
head for one specific available index is read for twice. I do see the
available heads are different from the consecutive reads. More details
are shared as below.

 From the guest side
===================

virtio_net virtio0: output.0:id 86 is not a head!
head to be released: 047 062 112

avail_idx:
000  49665
001  49666  <--
  :
015  49664

avail_head:
000  062
001  047  <--
  :
015  112

 From the host side
==================

avail_idx
000  49663
001  49666  <---
  :

avail_head
000  062  (062)
001  047  (047)  <---
  :
015  086  (112)          // head 086 is returned from the first read,
                          // but head 112 is returned from the second read

vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 49664

Thanks,
Gavin



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-25  7:34                 ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-25  7:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs


On 3/20/24 17:14, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
>> On 3/20/24 10:49, Michael S. Tsirkin wrote:>
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>> index 6f7e5010a673..79456706d0bd 100644
>>> --- a/drivers/virtio/virtio_ring.c
>>> +++ b/drivers/virtio/virtio_ring.c
>>> @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>    	/* Put entry in available array (but don't update avail->idx until they
>>>    	 * do sync). */
>>>    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>> -	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>> +	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
>>> +	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
>>>    	/* Descriptors and available array need to be set before we expose the
>>>    	 * new available array entries. */
>>>

Ok, Michael. I continued with my debugging code. It still looks like a
hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
involved for the discussion, as suggested by you.

Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
Note that I have only one vCPU in my configuration.

Secondly, the debugging code is enhanced so that the available head for
(last_avail_idx - 1) is read for twice and recorded. It means the available
head for one specific available index is read for twice. I do see the
available heads are different from the consecutive reads. More details
are shared as below.

 From the guest side
===================

virtio_net virtio0: output.0:id 86 is not a head!
head to be released: 047 062 112

avail_idx:
000  49665
001  49666  <--
  :
015  49664

avail_head:
000  062
001  047  <--
  :
015  112

 From the host side
==================

avail_idx
000  49663
001  49666  <---
  :

avail_head
000  062  (062)
001  047  (047)  <---
  :
015  086  (112)          // head 086 is returned from the first read,
                          // but head 112 is returned from the second read

vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 49664

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-25  7:34                 ` Gavin Shan
@ 2024-03-26  7:49                   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-26  7:49 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote:
> 
> On 3/20/24 17:14, Michael S. Tsirkin wrote:
> > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> > > On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 6f7e5010a673..79456706d0bd 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > >    	/* Put entry in available array (but don't update avail->idx until they
> > > >    	 * do sync). */
> > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > -	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > +	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> > > > +	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
> > > >    	/* Descriptors and available array need to be set before we expose the
> > > >    	 * new available array entries. */
> > > > 
> 
> Ok, Michael. I continued with my debugging code. It still looks like a
> hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
> involved for the discussion, as suggested by you.

Do you have a support contact at Nvidia to report this?

> Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
> Note that I have only one vCPU in my configuration.

Interesting but is guest built with CONFIG_SMP set?

> Secondly, the debugging code is enhanced so that the available head for
> (last_avail_idx - 1) is read for twice and recorded. It means the available
> head for one specific available index is read for twice. I do see the
> available heads are different from the consecutive reads. More details
> are shared as below.
> 
> From the guest side
> ===================
> 
> virtio_net virtio0: output.0:id 86 is not a head!
> head to be released: 047 062 112
> 
> avail_idx:
> 000  49665
> 001  49666  <--
>  :
> 015  49664

what are these #s 49665 and so on?
and how large is the ring?
I am guessing 49664 is the index ring size is 16 and
49664 % 16 == 0

> avail_head:


is this the avail ring contents?

> 000  062
> 001  047  <--
>  :
> 015  112


What are these arrows pointing at, btw?


> From the host side
> ==================
> 
> avail_idx
> 000  49663
> 001  49666  <---
>  :
> 
> avail_head
> 000  062  (062)
> 001  047  (047)  <---
>  :
> 015  086  (112)          // head 086 is returned from the first read,
>                          // but head 112 is returned from the second read
> 
> vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 49664
> 
> Thanks,
> Gavin

OK thanks so this proves it is actually the avail ring value.

-- 
MST


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-26  7:49                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-26  7:49 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Will Deacon, virtualization, linux-kernel, jasowang, xuanzhuo,
	yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas, mochs

On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote:
> 
> On 3/20/24 17:14, Michael S. Tsirkin wrote:
> > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> > > On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 6f7e5010a673..79456706d0bd 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > >    	/* Put entry in available array (but don't update avail->idx until they
> > > >    	 * do sync). */
> > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > -	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > +	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> > > > +	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
> > > >    	/* Descriptors and available array need to be set before we expose the
> > > >    	 * new available array entries. */
> > > > 
> 
> Ok, Michael. I continued with my debugging code. It still looks like a
> hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
> involved for the discussion, as suggested by you.

Do you have a support contact at Nvidia to report this?

> Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
> Note that I have only one vCPU in my configuration.

Interesting but is guest built with CONFIG_SMP set?

> Secondly, the debugging code is enhanced so that the available head for
> (last_avail_idx - 1) is read for twice and recorded. It means the available
> head for one specific available index is read for twice. I do see the
> available heads are different from the consecutive reads. More details
> are shared as below.
> 
> From the guest side
> ===================
> 
> virtio_net virtio0: output.0:id 86 is not a head!
> head to be released: 047 062 112
> 
> avail_idx:
> 000  49665
> 001  49666  <--
>  :
> 015  49664

what are these #s 49665 and so on?
and how large is the ring?
I am guessing 49664 is the index ring size is 16 and
49664 % 16 == 0

> avail_head:


is this the avail ring contents?

> 000  062
> 001  047  <--
>  :
> 015  112


What are these arrows pointing at, btw?


> From the host side
> ==================
> 
> avail_idx
> 000  49663
> 001  49666  <---
>  :
> 
> avail_head
> 000  062  (062)
> 001  047  (047)  <---
>  :
> 015  086  (112)          // head 086 is returned from the first read,
>                          // but head 112 is returned from the second read
> 
> vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 49664
> 
> Thanks,
> Gavin

OK thanks so this proves it is actually the avail ring value.

-- 
MST


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-26  7:49                   ` Michael S. Tsirkin
@ 2024-03-26  9:38                     ` Keir Fraser
  -1 siblings, 0 replies; 78+ messages in thread
From: Keir Fraser @ 2024-03-26  9:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gavin Shan, Will Deacon, virtualization, linux-kernel, jasowang,
	xuanzhuo, yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas,
	mochs

On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote:
> > 
> > On 3/20/24 17:14, Michael S. Tsirkin wrote:
> > > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> > > > On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 6f7e5010a673..79456706d0bd 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > >    	/* Put entry in available array (but don't update avail->idx until they
> > > > >    	 * do sync). */
> > > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > -	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > +	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> > > > > +	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
> > > > >    	/* Descriptors and available array need to be set before we expose the
> > > > >    	 * new available array entries. */
> > > > > 
> > 
> > Ok, Michael. I continued with my debugging code. It still looks like a
> > hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
> > involved for the discussion, as suggested by you.
> 
> Do you have a support contact at Nvidia to report this?
> 
> > Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
> > Note that I have only one vCPU in my configuration.
> 
> Interesting but is guest built with CONFIG_SMP set?

arm64 is always built CONFIG_SMP.

> > Secondly, the debugging code is enhanced so that the available head for
> > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > head for one specific available index is read for twice. I do see the
> > available heads are different from the consecutive reads. More details
> > are shared as below.
> > 
> > From the guest side
> > ===================
> > 
> > virtio_net virtio0: output.0:id 86 is not a head!
> > head to be released: 047 062 112
> > 
> > avail_idx:
> > 000  49665
> > 001  49666  <--
> >  :
> > 015  49664
> 
> what are these #s 49665 and so on?
> and how large is the ring?
> I am guessing 49664 is the index ring size is 16 and
> 49664 % 16 == 0

More than that, 49664 % 256 == 0

So again there seems to be an error in the vicinity of roll-over of
the idx low byte, as I observed in the earlier log. Surely this is
more than coincidence?

 -- Keir

> > avail_head:
> 
> 
> is this the avail ring contents?
> 
> > 000  062
> > 001  047  <--
> >  :
> > 015  112
> 
> 
> What are these arrows pointing at, btw?
> 
> 
> > From the host side
> > ==================
> > 
> > avail_idx
> > 000  49663
> > 001  49666  <---
> >  :
> > 
> > avail_head
> > 000  062  (062)
> > 001  047  (047)  <---
> >  :
> > 015  086  (112)          // head 086 is returned from the first read,
> >                          // but head 112 is returned from the second read
> > 
> > vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 49664
> > 
> > Thanks,
> > Gavin
> 
> OK thanks so this proves it is actually the avail ring value.
> 
> -- 
> MST
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-26  9:38                     ` Keir Fraser
  0 siblings, 0 replies; 78+ messages in thread
From: Keir Fraser @ 2024-03-26  9:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gavin Shan, Will Deacon, virtualization, linux-kernel, jasowang,
	xuanzhuo, yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas,
	mochs

On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote:
> > 
> > On 3/20/24 17:14, Michael S. Tsirkin wrote:
> > > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> > > > On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 6f7e5010a673..79456706d0bd 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > >    	/* Put entry in available array (but don't update avail->idx until they
> > > > >    	 * do sync). */
> > > > >    	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > -	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > +	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> > > > > +	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
> > > > >    	/* Descriptors and available array need to be set before we expose the
> > > > >    	 * new available array entries. */
> > > > > 
> > 
> > Ok, Michael. I continued with my debugging code. It still looks like a
> > hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
> > involved for the discussion, as suggested by you.
> 
> Do you have a support contact at Nvidia to report this?
> 
> > Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
> > Note that I have only one vCPU in my configuration.
> 
> Interesting but is guest built with CONFIG_SMP set?

arm64 is always built CONFIG_SMP.

> > Secondly, the debugging code is enhanced so that the available head for
> > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > head for one specific available index is read for twice. I do see the
> > available heads are different from the consecutive reads. More details
> > are shared as below.
> > 
> > From the guest side
> > ===================
> > 
> > virtio_net virtio0: output.0:id 86 is not a head!
> > head to be released: 047 062 112
> > 
> > avail_idx:
> > 000  49665
> > 001  49666  <--
> >  :
> > 015  49664
> 
> what are these #s 49665 and so on?
> and how large is the ring?
> I am guessing 49664 is the index ring size is 16 and
> 49664 % 16 == 0

More than that, 49664 % 256 == 0

So again there seems to be an error in the vicinity of roll-over of
the idx low byte, as I observed in the earlier log. Surely this is
more than coincidence?

 -- Keir

> > avail_head:
> 
> 
> is this the avail ring contents?
> 
> > 000  062
> > 001  047  <--
> >  :
> > 015  112
> 
> 
> What are these arrows pointing at, btw?
> 
> 
> > From the host side
> > ==================
> > 
> > avail_idx
> > 000  49663
> > 001  49666  <---
> >  :
> > 
> > avail_head
> > 000  062  (062)
> > 001  047  (047)  <---
> >  :
> > 015  086  (112)          // head 086 is returned from the first read,
> >                          // but head 112 is returned from the second read
> > 
> > vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 49664
> > 
> > Thanks,
> > Gavin
> 
> OK thanks so this proves it is actually the avail ring value.
> 
> -- 
> MST
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-26  9:38                     ` Keir Fraser
@ 2024-03-26 11:43                       ` Will Deacon
  -1 siblings, 0 replies; 78+ messages in thread
From: Will Deacon @ 2024-03-26 11:43 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Michael S. Tsirkin, Gavin Shan, virtualization, linux-kernel,
	jasowang, xuanzhuo, yihyu, shan.gavin, linux-arm-kernel,
	Catalin Marinas, mochs

On Tue, Mar 26, 2024 at 09:38:55AM +0000, Keir Fraser wrote:
> On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > Secondly, the debugging code is enhanced so that the available head for
> > > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > > head for one specific available index is read for twice. I do see the
> > > available heads are different from the consecutive reads. More details
> > > are shared as below.
> > > 
> > > From the guest side
> > > ===================
> > > 
> > > virtio_net virtio0: output.0:id 86 is not a head!
> > > head to be released: 047 062 112
> > > 
> > > avail_idx:
> > > 000  49665
> > > 001  49666  <--
> > >  :
> > > 015  49664
> > 
> > what are these #s 49665 and so on?
> > and how large is the ring?
> > I am guessing 49664 is the index ring size is 16 and
> > 49664 % 16 == 0
> 
> More than that, 49664 % 256 == 0
> 
> So again there seems to be an error in the vicinity of roll-over of
> the idx low byte, as I observed in the earlier log. Surely this is
> more than coincidence?

Yeah, I'd still really like to see the disassembly for both sides of the
protocol here. Gavin, is that something you're able to provide? Worst
case, the host and guest vmlinux objects would be a starting point.

Personally, I'd be fairly surprised if this was a hardware issue.

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-26 11:43                       ` Will Deacon
  0 siblings, 0 replies; 78+ messages in thread
From: Will Deacon @ 2024-03-26 11:43 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Michael S. Tsirkin, Gavin Shan, virtualization, linux-kernel,
	jasowang, xuanzhuo, yihyu, shan.gavin, linux-arm-kernel,
	Catalin Marinas, mochs

On Tue, Mar 26, 2024 at 09:38:55AM +0000, Keir Fraser wrote:
> On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > Secondly, the debugging code is enhanced so that the available head for
> > > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > > head for one specific available index is read for twice. I do see the
> > > available heads are different from the consecutive reads. More details
> > > are shared as below.
> > > 
> > > From the guest side
> > > ===================
> > > 
> > > virtio_net virtio0: output.0:id 86 is not a head!
> > > head to be released: 047 062 112
> > > 
> > > avail_idx:
> > > 000  49665
> > > 001  49666  <--
> > >  :
> > > 015  49664
> > 
> > what are these #s 49665 and so on?
> > and how large is the ring?
> > I am guessing 49664 is the index ring size is 16 and
> > 49664 % 16 == 0
> 
> More than that, 49664 % 256 == 0
> 
> So again there seems to be an error in the vicinity of roll-over of
> the idx low byte, as I observed in the earlier log. Surely this is
> more than coincidence?

Yeah, I'd still really like to see the disassembly for both sides of the
protocol here. Gavin, is that something you're able to provide? Worst
case, the host and guest vmlinux objects would be a starting point.

Personally, I'd be fairly surprised if this was a hardware issue.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-26 11:43                       ` Will Deacon
@ 2024-03-26 15:46                         ` Will Deacon
  -1 siblings, 0 replies; 78+ messages in thread
From: Will Deacon @ 2024-03-26 15:46 UTC (permalink / raw)
  To: Keir Fraser, gshan
  Cc: Michael S. Tsirkin, virtualization, linux-kernel, jasowang,
	xuanzhuo, yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas,
	mochs

On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 09:38:55AM +0000, Keir Fraser wrote:
> > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > Secondly, the debugging code is enhanced so that the available head for
> > > > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > > > head for one specific available index is read for twice. I do see the
> > > > available heads are different from the consecutive reads. More details
> > > > are shared as below.
> > > > 
> > > > From the guest side
> > > > ===================
> > > > 
> > > > virtio_net virtio0: output.0:id 86 is not a head!
> > > > head to be released: 047 062 112
> > > > 
> > > > avail_idx:
> > > > 000  49665
> > > > 001  49666  <--
> > > >  :
> > > > 015  49664
> > > 
> > > what are these #s 49665 and so on?
> > > and how large is the ring?
> > > I am guessing 49664 is the index ring size is 16 and
> > > 49664 % 16 == 0
> > 
> > More than that, 49664 % 256 == 0
> > 
> > So again there seems to be an error in the vicinity of roll-over of
> > the idx low byte, as I observed in the earlier log. Surely this is
> > more than coincidence?
> 
> Yeah, I'd still really like to see the disassembly for both sides of the
> protocol here. Gavin, is that something you're able to provide? Worst
> case, the host and guest vmlinux objects would be a starting point.
> 
> Personally, I'd be fairly surprised if this was a hardware issue.

Ok, long shot after eyeballing the vhost code, but does the diff below
help at all? It looks like vhost_vq_avail_empty() can advance the value
saved in 'vq->avail_idx' but without the read barrier, possibly confusing
vhost_get_vq_desc() in polling mode.

Will

--->8

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..87bff710331a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
                return false;
        vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
+       smp_rmb();
        return vq->avail_idx == vq->last_avail_idx;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-26 15:46                         ` Will Deacon
  0 siblings, 0 replies; 78+ messages in thread
From: Will Deacon @ 2024-03-26 15:46 UTC (permalink / raw)
  To: Keir Fraser, gshan
  Cc: Michael S. Tsirkin, virtualization, linux-kernel, jasowang,
	xuanzhuo, yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas,
	mochs

On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 09:38:55AM +0000, Keir Fraser wrote:
> > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > Secondly, the debugging code is enhanced so that the available head for
> > > > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > > > head for one specific available index is read for twice. I do see the
> > > > available heads are different from the consecutive reads. More details
> > > > are shared as below.
> > > > 
> > > > From the guest side
> > > > ===================
> > > > 
> > > > virtio_net virtio0: output.0:id 86 is not a head!
> > > > head to be released: 047 062 112
> > > > 
> > > > avail_idx:
> > > > 000  49665
> > > > 001  49666  <--
> > > >  :
> > > > 015  49664
> > > 
> > > what are these #s 49665 and so on?
> > > and how large is the ring?
> > > I am guessing 49664 is the index ring size is 16 and
> > > 49664 % 16 == 0
> > 
> > More than that, 49664 % 256 == 0
> > 
> > So again there seems to be an error in the vicinity of roll-over of
> > the idx low byte, as I observed in the earlier log. Surely this is
> > more than coincidence?
> 
> Yeah, I'd still really like to see the disassembly for both sides of the
> protocol here. Gavin, is that something you're able to provide? Worst
> case, the host and guest vmlinux objects would be a starting point.
> 
> Personally, I'd be fairly surprised if this was a hardware issue.

Ok, long shot after eyeballing the vhost code, but does the diff below
help at all? It looks like vhost_vq_avail_empty() can advance the value
saved in 'vq->avail_idx' but without the read barrier, possibly confusing
vhost_get_vq_desc() in polling mode.

Will

--->8

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..87bff710331a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
                return false;
        vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
+       smp_rmb();
        return vq->avail_idx == vq->last_avail_idx;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-26 15:46                         ` Will Deacon
@ 2024-03-26 23:14                           ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-26 23:14 UTC (permalink / raw)
  To: Will Deacon, Keir Fraser
  Cc: Michael S. Tsirkin, virtualization, linux-kernel, jasowang,
	xuanzhuo, yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas,
	mochs


On 3/27/24 01:46, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
> 
> Ok, long shot after eyeballing the vhost code, but does the diff below
> help at all? It looks like vhost_vq_avail_empty() can advance the value
> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
> vhost_get_vq_desc() in polling mode.
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..87bff710331a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>                  return false;
>          vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>   
> +       smp_rmb();
>          return vq->avail_idx == vq->last_avail_idx;
>   }
>   EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
> 

Thanks, Will. I already noticed smp_rmb() has been missed in vhost_vq_avail_empty().
The issue still exists after smp_rmb() is added here. However, I'm inspired by your
suggestion and recheck the code again. It seems another smp_rmb() has been missed
in vhost_enable_notify().

With smp_rmb() added to vhost_vq_avail_empty() and vhost_enable_notify(), I'm unable
to hit the issue. I will try for more times to make sure the issue is really resolved.
After that, I will post formal patches for review.

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-26 23:14                           ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-26 23:14 UTC (permalink / raw)
  To: Will Deacon, Keir Fraser
  Cc: Michael S. Tsirkin, virtualization, linux-kernel, jasowang,
	xuanzhuo, yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas,
	mochs


On 3/27/24 01:46, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
> 
> Ok, long shot after eyeballing the vhost code, but does the diff below
> help at all? It looks like vhost_vq_avail_empty() can advance the value
> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
> vhost_get_vq_desc() in polling mode.
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..87bff710331a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>                  return false;
>          vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>   
> +       smp_rmb();
>          return vq->avail_idx == vq->last_avail_idx;
>   }
>   EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
> 

Thanks, Will. I already noticed smp_rmb() has been missed in vhost_vq_avail_empty().
The issue still exists after smp_rmb() is added here. However, I'm inspired by your
suggestion and recheck the code again. It seems another smp_rmb() has been missed
in vhost_enable_notify().

With smp_rmb() added to vhost_vq_avail_empty() and vhost_enable_notify(), I'm unable
to hit the issue. I will try for more times to make sure the issue is really resolved.
After that, I will post formal patches for review.

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-26 23:14                           ` Gavin Shan
@ 2024-03-27  0:01                             ` Gavin Shan
  -1 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-27  0:01 UTC (permalink / raw)
  To: Will Deacon, Keir Fraser
  Cc: Michael S. Tsirkin, virtualization, linux-kernel, jasowang,
	xuanzhuo, yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas,
	mochs


On 3/27/24 09:14, Gavin Shan wrote:
> On 3/27/24 01:46, Will Deacon wrote:
>> On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
>>
>> Ok, long shot after eyeballing the vhost code, but does the diff below
>> help at all? It looks like vhost_vq_avail_empty() can advance the value
>> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
>> vhost_get_vq_desc() in polling mode.
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 045f666b4f12..87bff710331a 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>>                  return false;
>>          vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>> +       smp_rmb();
>>          return vq->avail_idx == vq->last_avail_idx;
>>   }
>>   EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
>>
> 
> Thanks, Will. I already noticed smp_rmb() has been missed in vhost_vq_avail_empty().
> The issue still exists after smp_rmb() is added here. However, I'm inspired by your
> suggestion and recheck the code again. It seems another smp_rmb() has been missed
> in vhost_enable_notify().
> 
> With smp_rmb() added to vhost_vq_avail_empty() and vhost_enable_notify(), I'm unable
> to hit the issue. I will try for more times to make sure the issue is really resolved.
> After that, I will post formal patches for review.
> 

Thanks again, Will. The formal patches have been sent for review.

https://lkml.org/lkml/2024/3/27/40

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-27  0:01                             ` Gavin Shan
  0 siblings, 0 replies; 78+ messages in thread
From: Gavin Shan @ 2024-03-27  0:01 UTC (permalink / raw)
  To: Will Deacon, Keir Fraser
  Cc: Michael S. Tsirkin, virtualization, linux-kernel, jasowang,
	xuanzhuo, yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas,
	mochs


On 3/27/24 09:14, Gavin Shan wrote:
> On 3/27/24 01:46, Will Deacon wrote:
>> On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
>>
>> Ok, long shot after eyeballing the vhost code, but does the diff below
>> help at all? It looks like vhost_vq_avail_empty() can advance the value
>> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
>> vhost_get_vq_desc() in polling mode.
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 045f666b4f12..87bff710331a 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>>                  return false;
>>          vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>> +       smp_rmb();
>>          return vq->avail_idx == vq->last_avail_idx;
>>   }
>>   EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
>>
> 
> Thanks, Will. I already noticed smp_rmb() has been missed in vhost_vq_avail_empty().
> The issue still exists after smp_rmb() is added here. However, I'm inspired by your
> suggestion and recheck the code again. It seems another smp_rmb() has been missed
> in vhost_enable_notify().
> 
> With smp_rmb() added to vhost_vq_avail_empty() and vhost_enable_notify(), I'm unable
> to hit the issue. I will try for more times to make sure the issue is really resolved.
> After that, I will post formal patches for review.
> 

Thanks again, Will. The formal patches have been sent for review.

https://lkml.org/lkml/2024/3/27/40

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
  2024-03-26 15:46                         ` Will Deacon
@ 2024-03-27 11:56                           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-27 11:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Keir Fraser, gshan, virtualization, linux-kernel, jasowang,
	xuanzhuo, yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas,
	mochs

On Tue, Mar 26, 2024 at 03:46:29PM +0000, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
> > On Tue, Mar 26, 2024 at 09:38:55AM +0000, Keir Fraser wrote:
> > > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > > Secondly, the debugging code is enhanced so that the available head for
> > > > > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > > > > head for one specific available index is read for twice. I do see the
> > > > > available heads are different from the consecutive reads. More details
> > > > > are shared as below.
> > > > > 
> > > > > From the guest side
> > > > > ===================
> > > > > 
> > > > > virtio_net virtio0: output.0:id 86 is not a head!
> > > > > head to be released: 047 062 112
> > > > > 
> > > > > avail_idx:
> > > > > 000  49665
> > > > > 001  49666  <--
> > > > >  :
> > > > > 015  49664
> > > > 
> > > > what are these #s 49665 and so on?
> > > > and how large is the ring?
> > > > I am guessing 49664 is the index ring size is 16 and
> > > > 49664 % 16 == 0
> > > 
> > > More than that, 49664 % 256 == 0
> > > 
> > > So again there seems to be an error in the vicinity of roll-over of
> > > the idx low byte, as I observed in the earlier log. Surely this is
> > > more than coincidence?
> > 
> > Yeah, I'd still really like to see the disassembly for both sides of the
> > protocol here. Gavin, is that something you're able to provide? Worst
> > case, the host and guest vmlinux objects would be a starting point.
> > 
> > Personally, I'd be fairly surprised if this was a hardware issue.
> 
> Ok, long shot after eyeballing the vhost code, but does the diff below
> help at all? It looks like vhost_vq_avail_empty() can advance the value
> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
> vhost_get_vq_desc() in polling mode.
> 
> Will
> 
> --->8
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..87bff710331a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>                 return false;
>         vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>  
> +       smp_rmb();
>         return vq->avail_idx == vq->last_avail_idx;
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);

Oh wow you are right.

We have:

        if (vq->avail_idx == vq->last_avail_idx) {
                if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
                        vq_err(vq, "Failed to access avail idx at %p\n",
                                &vq->avail->idx);
                        return -EFAULT;
                }
                vq->avail_idx = vhost16_to_cpu(vq, avail_idx);

                if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
                        vq_err(vq, "Guest moved used index from %u to %u",
                                last_avail_idx, vq->avail_idx);
                        return -EFAULT;
                }

                /* If there's nothing new since last we looked, return
                 * invalid.
                 */
                if (vq->avail_idx == last_avail_idx)
                        return vq->num;

                /* Only get avail ring entries after they have been
                 * exposed by guest.
                 */
                smp_rmb();
        }


and so the rmb only happens if avail_idx is not advanced.

Actually there is a bunch of code duplication where we assign to
avail_idx, too.

Will thanks a lot for looking into this! I kept looking into
the virtio side for some reason, the fact that it did not
trigger with qemu should have been a big hint!


-- 
MST


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] virtio_ring: Fix the stale index in available ring
@ 2024-03-27 11:56                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 78+ messages in thread
From: Michael S. Tsirkin @ 2024-03-27 11:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Keir Fraser, gshan, virtualization, linux-kernel, jasowang,
	xuanzhuo, yihyu, shan.gavin, linux-arm-kernel, Catalin Marinas,
	mochs

On Tue, Mar 26, 2024 at 03:46:29PM +0000, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
> > On Tue, Mar 26, 2024 at 09:38:55AM +0000, Keir Fraser wrote:
> > > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > > Secondly, the debugging code is enhanced so that the available head for
> > > > > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > > > > head for one specific available index is read for twice. I do see the
> > > > > available heads are different from the consecutive reads. More details
> > > > > are shared as below.
> > > > > 
> > > > > From the guest side
> > > > > ===================
> > > > > 
> > > > > virtio_net virtio0: output.0:id 86 is not a head!
> > > > > head to be released: 047 062 112
> > > > > 
> > > > > avail_idx:
> > > > > 000  49665
> > > > > 001  49666  <--
> > > > >  :
> > > > > 015  49664
> > > > 
> > > > what are these #s 49665 and so on?
> > > > and how large is the ring?
> > > > I am guessing 49664 is the index ring size is 16 and
> > > > 49664 % 16 == 0
> > > 
> > > More than that, 49664 % 256 == 0
> > > 
> > > So again there seems to be an error in the vicinity of roll-over of
> > > the idx low byte, as I observed in the earlier log. Surely this is
> > > more than coincidence?
> > 
> > Yeah, I'd still really like to see the disassembly for both sides of the
> > protocol here. Gavin, is that something you're able to provide? Worst
> > case, the host and guest vmlinux objects would be a starting point.
> > 
> > Personally, I'd be fairly surprised if this was a hardware issue.
> 
> Ok, long shot after eyeballing the vhost code, but does the diff below
> help at all? It looks like vhost_vq_avail_empty() can advance the value
> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
> vhost_get_vq_desc() in polling mode.
> 
> Will
> 
> --->8
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..87bff710331a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>                 return false;
>         vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>  
> +       smp_rmb();
>         return vq->avail_idx == vq->last_avail_idx;
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);

Oh wow you are right.

We have:

        if (vq->avail_idx == vq->last_avail_idx) {
                if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
                        vq_err(vq, "Failed to access avail idx at %p\n",
                                &vq->avail->idx);
                        return -EFAULT;
                }
                vq->avail_idx = vhost16_to_cpu(vq, avail_idx);

                if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
                        vq_err(vq, "Guest moved used index from %u to %u",
                                last_avail_idx, vq->avail_idx);
                        return -EFAULT;
                }

                /* If there's nothing new since last we looked, return
                 * invalid.
                 */
                if (vq->avail_idx == last_avail_idx)
                        return vq->num;

                /* Only get avail ring entries after they have been
                 * exposed by guest.
                 */
                smp_rmb();
        }


and so the rmb only happens if avail_idx is not advanced.

Actually there is a bunch of code duplication where we assign to
avail_idx, too.

Will thanks a lot for looking into this! I kept looking into
the virtio side for some reason, the fact that it did not
trigger with qemu should have been a big hint!


-- 
MST


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2024-03-27 11:57 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-14  7:49 [PATCH] virtio_ring: Fix the stale index in available ring Gavin Shan
2024-03-14  8:05 ` Michael S. Tsirkin
2024-03-14 10:15   ` Gavin Shan
2024-03-14 11:50     ` Michael S. Tsirkin
2024-03-14 12:50       ` Gavin Shan
2024-03-14 12:59         ` Michael S. Tsirkin
2024-03-15 10:45           ` Gavin Shan
2024-03-15 10:45             ` Gavin Shan
2024-03-15 11:05             ` Michael S. Tsirkin
2024-03-15 11:05               ` Michael S. Tsirkin
2024-03-15 11:24               ` Gavin Shan
2024-03-15 11:24                 ` Gavin Shan
2024-03-17 16:50                 ` Michael S. Tsirkin
2024-03-17 16:50                   ` Michael S. Tsirkin
2024-03-17 23:41                   ` Gavin Shan
2024-03-17 23:41                     ` Gavin Shan
2024-03-18  7:50                     ` Michael S. Tsirkin
2024-03-18  7:50                       ` Michael S. Tsirkin
2024-03-18 16:59 ` Will Deacon
2024-03-19  4:59   ` Gavin Shan
2024-03-19  4:59     ` Gavin Shan
2024-03-19  6:09     ` Michael S. Tsirkin
2024-03-19  6:09       ` Michael S. Tsirkin
2024-03-19  6:10       ` Michael S. Tsirkin
2024-03-19  6:10         ` Michael S. Tsirkin
2024-03-19  6:54         ` Gavin Shan
2024-03-19  6:54           ` Gavin Shan
2024-03-19  7:04           ` Michael S. Tsirkin
2024-03-19  7:04             ` Michael S. Tsirkin
2024-03-19  7:41             ` Gavin Shan
2024-03-19  7:41               ` Gavin Shan
2024-03-19  8:28           ` Michael S. Tsirkin
2024-03-19  8:28             ` Michael S. Tsirkin
2024-03-19  6:38       ` Gavin Shan
2024-03-19  6:38         ` Gavin Shan
2024-03-19  6:43         ` Michael S. Tsirkin
2024-03-19  6:43           ` Michael S. Tsirkin
2024-03-19  6:49           ` Gavin Shan
2024-03-19  6:49             ` Gavin Shan
2024-03-19  7:09             ` Michael S. Tsirkin
2024-03-19  7:09               ` Michael S. Tsirkin
2024-03-19  8:08               ` Gavin Shan
2024-03-19  8:08                 ` Gavin Shan
2024-03-19  8:49                 ` Michael S. Tsirkin
2024-03-19  8:49                   ` Michael S. Tsirkin
2024-03-19 18:22     ` Will Deacon
2024-03-19 18:22       ` Will Deacon
2024-03-19 23:56       ` Gavin Shan
2024-03-19 23:56         ` Gavin Shan
2024-03-20  0:49         ` Michael S. Tsirkin
2024-03-20  0:49           ` Michael S. Tsirkin
2024-03-20  5:24           ` Gavin Shan
2024-03-20  5:24             ` Gavin Shan
2024-03-20  7:14             ` Michael S. Tsirkin
2024-03-20  7:14               ` Michael S. Tsirkin
2024-03-25  7:34               ` Gavin Shan
2024-03-25  7:34                 ` Gavin Shan
2024-03-26  7:49                 ` Michael S. Tsirkin
2024-03-26  7:49                   ` Michael S. Tsirkin
2024-03-26  9:38                   ` Keir Fraser
2024-03-26  9:38                     ` Keir Fraser
2024-03-26 11:43                     ` Will Deacon
2024-03-26 11:43                       ` Will Deacon
2024-03-26 15:46                       ` Will Deacon
2024-03-26 15:46                         ` Will Deacon
2024-03-26 23:14                         ` Gavin Shan
2024-03-26 23:14                           ` Gavin Shan
2024-03-27  0:01                           ` Gavin Shan
2024-03-27  0:01                             ` Gavin Shan
2024-03-27 11:56                         ` Michael S. Tsirkin
2024-03-27 11:56                           ` Michael S. Tsirkin
2024-03-20 17:15             ` Keir Fraser
2024-03-20 17:15               ` Keir Fraser
2024-03-21 12:06               ` Gavin Shan
2024-03-21 12:06                 ` Gavin Shan
2024-03-19  7:36   ` Michael S. Tsirkin
2024-03-19 18:21     ` Will Deacon
2024-03-19  6:14 ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.