All of lore.kernel.org
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func
@ 2021-06-04  7:34 Joyce Kong
  2021-06-04 16:12 ` Honnappa Nagarahalli
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Joyce Kong @ 2021-06-04  7:34 UTC (permalink / raw)
  To: beilei.xing, qi.z.zhang, ruifeng.wang, honnappa.nagarahalli; +Cc: dev, nd

Add the logic to determine how many DD bits have been set
for contiguous packets, for removing the SMP barrier while
reading descs.

Signed-off-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/i40e/i40e_rxtx.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decec..410a81f30 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 	uint16_t pkt_len;
 	uint64_t qword1;
 	uint32_t rx_status;
-	int32_t s[I40E_LOOK_AHEAD], nb_dd;
+	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
 	int32_t i, j, nb_rx = 0;
 	uint64_t pkt_flags;
 	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
@@ -482,11 +482,14 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 					I40E_RXD_QW1_STATUS_SHIFT;
 		}
 
-		rte_smp_rmb();
-
 		/* Compute how many status bits were set */
-		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
-			nb_dd += s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
+			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+			if (var)
+				nb_dd += 1;
+			else
+				break;
+		}
 
 		nb_rx += nb_dd;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func
  2021-06-04  7:34 [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func Joyce Kong
@ 2021-06-04 16:12 ` Honnappa Nagarahalli
  2021-06-06 14:17 ` Zhang, Qi Z
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Honnappa Nagarahalli @ 2021-06-04 16:12 UTC (permalink / raw)
  To: Joyce Kong, beilei.xing, qi.z.zhang, Ruifeng Wang
  Cc: dev, nd, Honnappa Nagarahalli, nd

<snip>
> 
> Add the logic to determine how many DD bits have been set for contiguous
> packets, for removing the SMP barrier while reading descs.
Are there any performance numbers with this change?

> 
> Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  drivers/net/i40e/i40e_rxtx.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index
> 6c58decec..410a81f30 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
>  	uint16_t pkt_len;
>  	uint64_t qword1;
>  	uint32_t rx_status;
> -	int32_t s[I40E_LOOK_AHEAD], nb_dd;
> +	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
>  	int32_t i, j, nb_rx = 0;
>  	uint64_t pkt_flags;
>  	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl; @@ -482,11
> +482,14 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
>  					I40E_RXD_QW1_STATUS_SHIFT;
>  		}
> 
> -		rte_smp_rmb();
> -
>  		/* Compute how many status bits were set */
> -		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
> -			nb_dd += s[j] & (1 <<
> I40E_RX_DESC_STATUS_DD_SHIFT);
> +		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
> +			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> +			if (var)
> +				nb_dd += 1;
> +			else
> +				break;
> +		}
> 
>  		nb_rx += nb_dd;
> 
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func
  2021-06-04  7:34 [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func Joyce Kong
  2021-06-04 16:12 ` Honnappa Nagarahalli
@ 2021-06-06 14:17 ` Zhang, Qi Z
  2021-06-06 18:33   ` Honnappa Nagarahalli
  2021-06-23  8:43 ` [dpdk-dev] [PATCH v2] net/i40e: add logic of processing continuous DD bits for Arm Joyce Kong
  2021-07-06  6:54 ` [dpdk-dev] [PATCH v3 0/2] fixes for i40e hw scan ring Joyce Kong
  3 siblings, 1 reply; 22+ messages in thread
From: Zhang, Qi Z @ 2021-06-06 14:17 UTC (permalink / raw)
  To: Joyce Kong, Xing, Beilei, ruifeng.wang, honnappa.nagarahalli; +Cc: dev, nd



> -----Original Message-----
> From: Joyce Kong <joyce.kong@arm.com>
> Sent: Friday, June 4, 2021 3:34 PM
> To: Xing, Beilei <beilei.xing@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ruifeng.wang@arm.com; honnappa.nagarahalli@arm.com
> Cc: dev@dpdk.org; nd@arm.com
> Subject: [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func
> 
> Add the logic to determine how many DD bits have been set for contiguous
> packets, for removing the SMP barrier while reading descs.

I didn't understand this.
The current logic already guarantee the read out DD bits are from continue packets, as it read Rx descriptor in a reversed order from the ring.
So I didn't see the a new logic be added, would you describe more clear about the purpose of this patch?

> 
> Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  drivers/net/i40e/i40e_rxtx.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index
> 6c58decec..410a81f30 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
>  	uint16_t pkt_len;
>  	uint64_t qword1;
>  	uint32_t rx_status;
> -	int32_t s[I40E_LOOK_AHEAD], nb_dd;
> +	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
>  	int32_t i, j, nb_rx = 0;
>  	uint64_t pkt_flags;
>  	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl; @@ -482,11 +482,14
> @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
>  					I40E_RXD_QW1_STATUS_SHIFT;
>  		}
> 
> -		rte_smp_rmb();

Any performance gain by removing this? and it is not necessary to be combined with below change, right?
 
> -
>  		/* Compute how many status bits were set */
> -		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
> -			nb_dd += s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> +		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
> +			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> +			if (var)
> +				nb_dd += 1;
> +			else
> +				break;
> +		}
> 
>  		nb_rx += nb_dd;
> 
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func
  2021-06-06 14:17 ` Zhang, Qi Z
@ 2021-06-06 18:33   ` Honnappa Nagarahalli
  2021-06-07 14:55     ` Zhang, Qi Z
  0 siblings, 1 reply; 22+ messages in thread
From: Honnappa Nagarahalli @ 2021-06-06 18:33 UTC (permalink / raw)
  To: Zhang, Qi Z, Joyce Kong, Xing, Beilei, Ruifeng Wang
  Cc: dev, nd, Honnappa Nagarahalli, nd

<snip>

> >
> > Add the logic to determine how many DD bits have been set for
> > contiguous packets, for removing the SMP barrier while reading descs.
> 
> I didn't understand this.
> The current logic already guarantee the read out DD bits are from continue
> packets, as it read Rx descriptor in a reversed order from the ring.
Qi, the comments in the code mention that there is a race condition if the descriptors are not read in the reverse order. But, they do not mention what the race condition is and how it can occur. Appreciate if you could explain that.

On x86, the reads are not re-ordered (though the compiler can re-order). On ARM, the reads can get re-ordered and hence the barriers are required. In order to avoid the barriers, we are trying to process only those descriptors whose DD bits are set such that they are contiguous. i.e. if the DD bits are 1011, we process only the first descriptor.

> So I didn't see the a new logic be added, would you describe more clear about
> the purpose of this patch?
> 
> >
> > Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  drivers/net/i40e/i40e_rxtx.c | 13 ++++++++-----
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > b/drivers/net/i40e/i40e_rxtx.c index
> > 6c58decec..410a81f30 100644
> > --- a/drivers/net/i40e/i40e_rxtx.c
> > +++ b/drivers/net/i40e/i40e_rxtx.c
> > @@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> >  	uint16_t pkt_len;
> >  	uint64_t qword1;
> >  	uint32_t rx_status;
> > -	int32_t s[I40E_LOOK_AHEAD], nb_dd;
> > +	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
> >  	int32_t i, j, nb_rx = 0;
> >  	uint64_t pkt_flags;
> >  	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl; @@ -482,11
> > +482,14 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> >  					I40E_RXD_QW1_STATUS_SHIFT;
> >  		}
> >
> > -		rte_smp_rmb();
> 
> Any performance gain by removing this? and it is not necessary to be
> combined with below change, right?
> 
> > -
> >  		/* Compute how many status bits were set */
> > -		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
> > -			nb_dd += s[j] & (1 <<
> I40E_RX_DESC_STATUS_DD_SHIFT);
> > +		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
> > +			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> > +			if (var)
> > +				nb_dd += 1;
> > +			else
> > +				break;
> > +		}
> >
> >  		nb_rx += nb_dd;
> >
> > --
> > 2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func
  2021-06-06 18:33   ` Honnappa Nagarahalli
@ 2021-06-07 14:55     ` Zhang, Qi Z
  2021-06-07 21:36       ` Honnappa Nagarahalli
  0 siblings, 1 reply; 22+ messages in thread
From: Zhang, Qi Z @ 2021-06-07 14:55 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Joyce Kong, Xing, Beilei, Ruifeng Wang; +Cc: dev, nd, nd



> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Monday, June 7, 2021 2:33 AM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>; Joyce Kong <Joyce.Kong@arm.com>;
> Xing, Beilei <beilei.xing@intel.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v1] net/i40e: remove the SMP barrier in HW scanning
> func
> 
> <snip>
> 
> > >
> > > Add the logic to determine how many DD bits have been set for
> > > contiguous packets, for removing the SMP barrier while reading descs.
> >
> > I didn't understand this.
> > The current logic already guarantee the read out DD bits are from
> > continue packets, as it read Rx descriptor in a reversed order from the ring.
> Qi, the comments in the code mention that there is a race condition if the
> descriptors are not read in the reverse order. But, they do not mention what
> the race condition is and how it can occur. Appreciate if you could explain
> that.

The Race condition happens between the NIC and CPU, if write and read DD bit in the same order, there might be a hole (e.g. 1011)  with the reverse read order, we make sure no more "1" after the first "0"
as the read address are declared as volatile, compiler will not re-ordered them.

> 
> On x86, the reads are not re-ordered (though the compiler can re-order). On
> ARM, the reads can get re-ordered and hence the barriers are required. In
> order to avoid the barriers, we are trying to process only those descriptors
> whose DD bits are set such that they are contiguous. i.e. if the DD bits are
> 1011, we process only the first descriptor.

Ok, I see. thanks for the explanation.
At this moment, I may prefer not change the behavior of x86, so compile option for arm can be added, in future when we observe no performance impact for x86 as well, we can consider to remove it, what do you think?

> 
> > So I didn't see the a new logic be added, would you describe more
> > clear about the purpose of this patch?
> >
> > >
> > > Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > >  drivers/net/i40e/i40e_rxtx.c | 13 ++++++++-----
> > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > b/drivers/net/i40e/i40e_rxtx.c index
> > > 6c58decec..410a81f30 100644
> > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > @@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> > >  	uint16_t pkt_len;
> > >  	uint64_t qword1;
> > >  	uint32_t rx_status;
> > > -	int32_t s[I40E_LOOK_AHEAD], nb_dd;
> > > +	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
> > >  	int32_t i, j, nb_rx = 0;
> > >  	uint64_t pkt_flags;
> > >  	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl; @@ -482,11
> > > +482,14 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> > >  					I40E_RXD_QW1_STATUS_SHIFT;
> > >  		}
> > >
> > > -		rte_smp_rmb();
> >
> > Any performance gain by removing this? and it is not necessary to be
> > combined with below change, right?
> >
> > > -
> > >  		/* Compute how many status bits were set */
> > > -		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
> > > -			nb_dd += s[j] & (1 <<
> > I40E_RX_DESC_STATUS_DD_SHIFT);
> > > +		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
> > > +			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> > > +			if (var)
> > > +				nb_dd += 1;
> > > +			else
> > > +				break;
> > > +		}
> > >
> > >  		nb_rx += nb_dd;
> > >
> > > --
> > > 2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func
  2021-06-07 14:55     ` Zhang, Qi Z
@ 2021-06-07 21:36       ` Honnappa Nagarahalli
  2021-06-15  6:30         ` Joyce Kong
  2021-06-16 13:29         ` Zhang, Qi Z
  0 siblings, 2 replies; 22+ messages in thread
From: Honnappa Nagarahalli @ 2021-06-07 21:36 UTC (permalink / raw)
  To: Zhang, Qi Z, Joyce Kong, Xing, Beilei, Ruifeng Wang
  Cc: dev, nd, Honnappa Nagarahalli, nd

<snip>

> >
> > > >
> > > > Add the logic to determine how many DD bits have been set for
> > > > contiguous packets, for removing the SMP barrier while reading descs.
> > >
> > > I didn't understand this.
> > > The current logic already guarantee the read out DD bits are from
> > > continue packets, as it read Rx descriptor in a reversed order from the
> ring.
> > Qi, the comments in the code mention that there is a race condition if
> > the descriptors are not read in the reverse order. But, they do not
> > mention what the race condition is and how it can occur. Appreciate if
> > you could explain that.
> 
> The Race condition happens between the NIC and CPU, if write and read DD
> bit in the same order, there might be a hole (e.g. 1011)  with the reverse read
> order, we make sure no more "1" after the first "0"
> as the read address are declared as volatile, compiler will not re-ordered
> them.
My understanding is that

1) the NIC will write an entire cache line of descriptors to memory "atomically" (i.e. the entire cache line is visible to the CPU at once) if there are enough descriptors ready to fill one cache line.
2) But, if there are not enough descriptors ready (because for ex: there is not enough traffic), then it might write partial cache lines.

Please correct me if I am wrong.

For #1, I do not think it matters if we read the descriptors in reverse order or not as the cache line is written atomically.
For #1, if we read in reverse order, does it make sense to not check the DD bits of descriptors that are earlier in the order once we encounter a descriptor that has its DD bit set? This is because NIC updates the descriptors in order.

> 
> >
> > On x86, the reads are not re-ordered (though the compiler can
> > re-order). On ARM, the reads can get re-ordered and hence the barriers
> > are required. In order to avoid the barriers, we are trying to process
> > only those descriptors whose DD bits are set such that they are
> > contiguous. i.e. if the DD bits are 1011, we process only the first descriptor.
> 
> Ok, I see. thanks for the explanation.
> At this moment, I may prefer not change the behavior of x86, so compile
> option for arm can be added, in future when we observe no performance
> impact for x86 as well, we can consider to remove it, what do you think?
I am ok with this approach.

> 
> >
> > > So I didn't see the a new logic be added, would you describe more
> > > clear about the purpose of this patch?
> > >
> > > >
> > > > Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > ---
> > > >  drivers/net/i40e/i40e_rxtx.c | 13 ++++++++-----
> > > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > b/drivers/net/i40e/i40e_rxtx.c index
> > > > 6c58decec..410a81f30 100644
> > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > @@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue
> *rxq)
> > > >  	uint16_t pkt_len;
> > > >  	uint64_t qword1;
> > > >  	uint32_t rx_status;
> > > > -	int32_t s[I40E_LOOK_AHEAD], nb_dd;
> > > > +	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
> > > >  	int32_t i, j, nb_rx = 0;
> > > >  	uint64_t pkt_flags;
> > > >  	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl; @@ -482,11
> > > > +482,14 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> > > >  					I40E_RXD_QW1_STATUS_SHIFT;
> > > >  		}
> > > >
> > > > -		rte_smp_rmb();
> > >
> > > Any performance gain by removing this? and it is not necessary to be
> > > combined with below change, right?
> > >
> > > > -
> > > >  		/* Compute how many status bits were set */
> > > > -		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
> > > > -			nb_dd += s[j] & (1 <<
> > > I40E_RX_DESC_STATUS_DD_SHIFT);
> > > > +		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
> > > > +			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> > > > +			if (var)
> > > > +				nb_dd += 1;
> > > > +			else
> > > > +				break;
> > > > +		}
> > > >
> > > >  		nb_rx += nb_dd;
> > > >
> > > > --
> > > > 2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func
  2021-06-07 21:36       ` Honnappa Nagarahalli
@ 2021-06-15  6:30         ` Joyce Kong
  2021-06-16 13:29         ` Zhang, Qi Z
  1 sibling, 0 replies; 22+ messages in thread
From: Joyce Kong @ 2021-06-15  6:30 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Zhang, Qi Z, Xing, Beilei, Ruifeng Wang; +Cc: dev, nd, nd

<snip>
 
> > > > > Add the logic to determine how many DD bits have been set for
> > > > > contiguous packets, for removing the SMP barrier while reading descs.
> > > >
> > > > I didn't understand this.
> > > > The current logic already guarantee the read out DD bits are from
> > > > continue packets, as it read Rx descriptor in a reversed order
> > > > from the ring.
> > > Qi, the comments in the code mention that there is a race condition
> > > if the descriptors are not read in the reverse order. But, they do
> > > not mention what the race condition is and how it can occur.
> > > Appreciate if you could explain that.
> >
> > The Race condition happens between the NIC and CPU, if write and read
> > DD bit in the same order, there might be a hole (e.g. 1011)  with the
> > reverse read order, we make sure no more "1" after the first "0"
> > as the read address are declared as volatile, compiler will not
> > re-ordered them.
> My understanding is that
> 
> 1) the NIC will write an entire cache line of descriptors to memory
> "atomically" (i.e. the entire cache line is visible to the CPU at once) if there
> are enough descriptors ready to fill one cache line.
> 2) But, if there are not enough descriptors ready (because for ex: there is not
> enough traffic), then it might write partial cache lines.
> 
> Please correct me if I am wrong.
> 
> For #1, I do not think it matters if we read the descriptors in reverse order or
> not as the cache line is written atomically.
> For #1, if we read in reverse order, does it make sense to not check the DD
> bits of descriptors that are earlier in the order once we encounter a
> descriptor that has its DD bit set? This is because NIC updates the descriptors
> in order.
> 
> >
> > >
> > > On x86, the reads are not re-ordered (though the compiler can
> > > re-order). On ARM, the reads can get re-ordered and hence the
> > > barriers are required. In order to avoid the barriers, we are trying
> > > to process only those descriptors whose DD bits are set such that
> > > they are contiguous. i.e. if the DD bits are 1011, we process only the first
> descriptor.
> >
> > Ok, I see. thanks for the explanation.
> > At this moment, I may prefer not change the behavior of x86, so
> > compile option for arm can be added, in future when we observe no
> > performance impact for x86 as well, we can consider to remove it, what do
> you think?
> I am ok with this approach.
> 

Thanks for your comments, I will modify the patch according to your suggestions.

> >
> > >
> > > > So I didn't see the a new logic be added, would you describe more
> > > > clear about the purpose of this patch?
> > > >
> > > > >
> > > > > Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > ---
> > > > >  drivers/net/i40e/i40e_rxtx.c | 13 ++++++++-----
> > > > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > b/drivers/net/i40e/i40e_rxtx.c index
> > > > > 6c58decec..410a81f30 100644
> > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > @@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue
> > *rxq)
> > > > >  	uint16_t pkt_len;
> > > > >  	uint64_t qword1;
> > > > >  	uint32_t rx_status;
> > > > > -	int32_t s[I40E_LOOK_AHEAD], nb_dd;
> > > > > +	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
> > > > >  	int32_t i, j, nb_rx = 0;
> > > > >  	uint64_t pkt_flags;
> > > > >  	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl; @@ -482,11
> > > > > +482,14 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> > > > >  					I40E_RXD_QW1_STATUS_SHIFT;
> > > > >  		}
> > > > >
> > > > > -		rte_smp_rmb();
> > > >
> > > > Any performance gain by removing this? and it is not necessary to
> > > > be combined with below change, right?
> > > >

I have tested the patch on both x86 and Arm platforms, it seems no performance change.
As Honnappa explained, we combined these to avoid the barriers. In this way, we only
process those descriptors whose DD bits are set such that they are contiguous.

> > > > > -
> > > > >  		/* Compute how many status bits were set */
> > > > > -		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
> > > > > -			nb_dd += s[j] & (1 <<
> > > > I40E_RX_DESC_STATUS_DD_SHIFT);
> > > > > +		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
> > > > > +			var = s[j] & (1 <<
> I40E_RX_DESC_STATUS_DD_SHIFT);
> > > > > +			if (var)
> > > > > +				nb_dd += 1;
> > > > > +			else
> > > > > +				break;
> > > > > +		}
> > > > >
> > > > >  		nb_rx += nb_dd;
> > > > >
> > > > > --
> > > > > 2.17.1
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func
  2021-06-07 21:36       ` Honnappa Nagarahalli
  2021-06-15  6:30         ` Joyce Kong
@ 2021-06-16 13:29         ` Zhang, Qi Z
  2021-06-16 13:37           ` Bruce Richardson
  1 sibling, 1 reply; 22+ messages in thread
From: Zhang, Qi Z @ 2021-06-16 13:29 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Joyce Kong, Xing, Beilei, Ruifeng Wang; +Cc: dev, nd, nd

Hi

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Tuesday, June 8, 2021 5:36 AM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>; Joyce Kong <Joyce.Kong@arm.com>;
> Xing, Beilei <beilei.xing@intel.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v1] net/i40e: remove the SMP barrier in HW scanning
> func
> 
> <snip>
> 
> > >
> > > > >
> > > > > Add the logic to determine how many DD bits have been set for
> > > > > contiguous packets, for removing the SMP barrier while reading descs.
> > > >
> > > > I didn't understand this.
> > > > The current logic already guarantee the read out DD bits are from
> > > > continue packets, as it read Rx descriptor in a reversed order
> > > > from the
> > ring.
> > > Qi, the comments in the code mention that there is a race condition
> > > if the descriptors are not read in the reverse order. But, they do
> > > not mention what the race condition is and how it can occur.
> > > Appreciate if you could explain that.
> >
> > The Race condition happens between the NIC and CPU, if write and read
> > DD bit in the same order, there might be a hole (e.g. 1011)  with the
> > reverse read order, we make sure no more "1" after the first "0"
> > as the read address are declared as volatile, compiler will not
> > re-ordered them.
> My understanding is that
> 
> 1) the NIC will write an entire cache line of descriptors to memory "atomically"
> (i.e. the entire cache line is visible to the CPU at once) if there are enough
> descriptors ready to fill one cache line.
> 2) But, if there are not enough descriptors ready (because for ex: there is not
> enough traffic), then it might write partial cache lines.

Yes, for example a cache line contains 4 x16 bytes descriptors and it is possible we get 1 1 1 0 for DD bit at some moment.

> 
> Please correct me if I am wrong.
> 
> For #1, I do not think it matters if we read the descriptors in reverse order or
> not as the cache line is written atomically.

I think below cases may happens if we don't read in reserve order.

1. CPU get first cache line as 1 1 1 0 in a loop
2. new packets coming and NIC append last 1 to the first cache and a new cache line with 1 1 1 1.
3. CPU continue new cache line with 1 1 1 1 in the same loop, but the last 1 of first cache line is missed, so finally it get 1 1 1 0 1 1 1 1. 


> For #1, if we read in reverse order, does it make sense to not check the DD bits
> of descriptors that are earlier in the order once we encounter a descriptor that
> has its DD bit set? This is because NIC updates the descriptors in order.

I think the answer is yes, when we met the first DD bit, we should able to calculated the exact number base on the index, but not sure how much performance gain.


> 
> >
> > >
> > > On x86, the reads are not re-ordered (though the compiler can
> > > re-order). On ARM, the reads can get re-ordered and hence the
> > > barriers are required. In order to avoid the barriers, we are trying
> > > to process only those descriptors whose DD bits are set such that
> > > they are contiguous. i.e. if the DD bits are 1011, we process only the first
> descriptor.
> >
> > Ok, I see. thanks for the explanation.
> > At this moment, I may prefer not change the behavior of x86, so
> > compile option for arm can be added, in future when we observe no
> > performance impact for x86 as well, we can consider to remove it, what do
> you think?
> I am ok with this approach.
> 
> >
> > >
> > > > So I didn't see the a new logic be added, would you describe more
> > > > clear about the purpose of this patch?
> > > >
> > > > >
> > > > > Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > ---
> > > > >  drivers/net/i40e/i40e_rxtx.c | 13 ++++++++-----
> > > > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > b/drivers/net/i40e/i40e_rxtx.c index
> > > > > 6c58decec..410a81f30 100644
> > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > @@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue
> > *rxq)
> > > > >  	uint16_t pkt_len;
> > > > >  	uint64_t qword1;
> > > > >  	uint32_t rx_status;
> > > > > -	int32_t s[I40E_LOOK_AHEAD], nb_dd;
> > > > > +	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
> > > > >  	int32_t i, j, nb_rx = 0;
> > > > >  	uint64_t pkt_flags;
> > > > >  	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl; @@ -482,11
> > > > > +482,14 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> > > > >  					I40E_RXD_QW1_STATUS_SHIFT;
> > > > >  		}
> > > > >
> > > > > -		rte_smp_rmb();
> > > >
> > > > Any performance gain by removing this? and it is not necessary to
> > > > be combined with below change, right?
> > > >
> > > > > -
> > > > >  		/* Compute how many status bits were set */
> > > > > -		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
> > > > > -			nb_dd += s[j] & (1 <<
> > > > I40E_RX_DESC_STATUS_DD_SHIFT);
> > > > > +		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
> > > > > +			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> > > > > +			if (var)
> > > > > +				nb_dd += 1;
> > > > > +			else
> > > > > +				break;
> > > > > +		}
> > > > >
> > > > >  		nb_rx += nb_dd;
> > > > >
> > > > > --
> > > > > 2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func
  2021-06-16 13:29         ` Zhang, Qi Z
@ 2021-06-16 13:37           ` Bruce Richardson
  2021-06-16 20:26             ` Honnappa Nagarahalli
  0 siblings, 1 reply; 22+ messages in thread
From: Bruce Richardson @ 2021-06-16 13:37 UTC (permalink / raw)
  To: Zhang, Qi Z
  Cc: Honnappa Nagarahalli, Joyce Kong, Xing, Beilei, Ruifeng Wang, dev, nd

On Wed, Jun 16, 2021 at 01:29:24PM +0000, Zhang, Qi Z wrote:
> Hi
> 
> > -----Original Message-----
> > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Sent: Tuesday, June 8, 2021 5:36 AM
> > To: Zhang, Qi Z <qi.z.zhang@intel.com>; Joyce Kong <Joyce.Kong@arm.com>;
> > Xing, Beilei <beilei.xing@intel.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> > Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > Subject: RE: [PATCH v1] net/i40e: remove the SMP barrier in HW scanning
> > func
> > 
> > <snip>
> > 
> > > >
> > > > > >
> > > > > > Add the logic to determine how many DD bits have been set for
> > > > > > contiguous packets, for removing the SMP barrier while reading descs.
> > > > >
> > > > > I didn't understand this.
> > > > > The current logic already guarantee the read out DD bits are from
> > > > > continue packets, as it read Rx descriptor in a reversed order
> > > > > from the
> > > ring.
> > > > Qi, the comments in the code mention that there is a race condition
> > > > if the descriptors are not read in the reverse order. But, they do
> > > > not mention what the race condition is and how it can occur.
> > > > Appreciate if you could explain that.
> > >
> > > The Race condition happens between the NIC and CPU, if write and read
> > > DD bit in the same order, there might be a hole (e.g. 1011)  with the
> > > reverse read order, we make sure no more "1" after the first "0"
> > > as the read address are declared as volatile, compiler will not
> > > re-ordered them.
> > My understanding is that
> > 
> > 1) the NIC will write an entire cache line of descriptors to memory "atomically"
> > (i.e. the entire cache line is visible to the CPU at once) if there are enough
> > descriptors ready to fill one cache line.
> > 2) But, if there are not enough descriptors ready (because for ex: there is not
> > enough traffic), then it might write partial cache lines.
> 
> Yes, for example a cache line contains 4 x16 bytes descriptors and it is possible we get 1 1 1 0 for DD bit at some moment.
> 
> > 
> > Please correct me if I am wrong.
> > 
> > For #1, I do not think it matters if we read the descriptors in reverse order or
> > not as the cache line is written atomically.
> 
> I think below cases may happens if we don't read in reserve order.
> 
> 1. CPU get first cache line as 1 1 1 0 in a loop
> 2. new packets coming and NIC append last 1 to the first cache and a new cache line with 1 1 1 1.
> 3. CPU continue new cache line with 1 1 1 1 in the same loop, but the last 1 of first cache line is missed, so finally it get 1 1 1 0 1 1 1 1. 
> 

The one-sentence answer here is: when two entities are moving along a line
in the same direction - like two runners in a race - then they can pass
each other multiple times as each goes slower or faster at any point in
time, whereas if they are moving in opposite directions there will only
ever be one cross-over point no matter how the speed of each changes. 

In the case of NIC and software this fact means that there will always be a
clear cross-over point from DD set to not-set.

> 
> > For #1, if we read in reverse order, does it make sense to not check the DD bits
> > of descriptors that are earlier in the order once we encounter a descriptor that
> > has its DD bit set? This is because NIC updates the descriptors in order.
> 
> I think the answer is yes, when we met the first DD bit, we should able to calculated the exact number base on the index, but not sure how much performance gain.
> 
The other factors here are:
1. The driver does not do a straight read of all 32 DD bits in one go,
rather it does 8 at a time and aborts at the end of a set of 8 if not all
are valid.
2. For any that are set, we have to read the descriptor anyway to get the
packet data out of it, so in the shortcut case of the last descriptor being
set, we still have to read the other 7 anyway, and DD comes for free as
part of it.
3. Blindly reading 8 at a time reduces the branching to just a single
decision point at the end of each set of 8, reducing possible branch
mispredicts.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func
  2021-06-16 13:37           ` Bruce Richardson
@ 2021-06-16 20:26             ` Honnappa Nagarahalli
  0 siblings, 0 replies; 22+ messages in thread
From: Honnappa Nagarahalli @ 2021-06-16 20:26 UTC (permalink / raw)
  To: Bruce Richardson, Zhang, Qi Z
  Cc: Joyce Kong, Xing, Beilei, Ruifeng Wang, dev, nd, nd

<snip>

> > > > >
> > > > > > >
> > > > > > > Add the logic to determine how many DD bits have been set
> > > > > > > for contiguous packets, for removing the SMP barrier while reading
> descs.
> > > > > >
> > > > > > I didn't understand this.
> > > > > > The current logic already guarantee the read out DD bits are
> > > > > > from continue packets, as it read Rx descriptor in a reversed
> > > > > > order from the
> > > > ring.
> > > > > Qi, the comments in the code mention that there is a race
> > > > > condition if the descriptors are not read in the reverse order.
> > > > > But, they do not mention what the race condition is and how it can
> occur.
> > > > > Appreciate if you could explain that.
> > > >
> > > > The Race condition happens between the NIC and CPU, if write and
> > > > read DD bit in the same order, there might be a hole (e.g. 1011)
> > > > with the reverse read order, we make sure no more "1" after the first "0"
> > > > as the read address are declared as volatile, compiler will not
> > > > re-ordered them.
> > > My understanding is that
> > >
> > > 1) the NIC will write an entire cache line of descriptors to memory
> "atomically"
> > > (i.e. the entire cache line is visible to the CPU at once) if there
> > > are enough descriptors ready to fill one cache line.
> > > 2) But, if there are not enough descriptors ready (because for ex:
> > > there is not enough traffic), then it might write partial cache lines.
> >
> > Yes, for example a cache line contains 4 x16 bytes descriptors and it is
> possible we get 1 1 1 0 for DD bit at some moment.
> >
> > >
> > > Please correct me if I am wrong.
> > >
> > > For #1, I do not think it matters if we read the descriptors in
> > > reverse order or not as the cache line is written atomically.
> >
> > I think below cases may happens if we don't read in reserve order.
> >
> > 1. CPU get first cache line as 1 1 1 0 in a loop 2. new packets coming
> > and NIC append last 1 to the first cache and a new cache line with 1 1 1 1.
> > 3. CPU continue new cache line with 1 1 1 1 in the same loop, but the last 1
> of first cache line is missed, so finally it get 1 1 1 0 1 1 1 1.
> >
> 
> The one-sentence answer here is: when two entities are moving along a line in
> the same direction - like two runners in a race - then they can pass each other
> multiple times as each goes slower or faster at any point in time, whereas if
> they are moving in opposite directions there will only ever be one cross-over
> point no matter how the speed of each changes.
> 
> In the case of NIC and software this fact means that there will always be a
> clear cross-over point from DD set to not-set.
Thanks Bruce, that is a great analogy to describe the problem assuming that the reads are actually happening in the program order.

On Arm platform, even though the program is reading in reverse order, the reads might get executed in any random order. We have 2 solutions here:
1) Enforced the order with barriers or
2) Only process descriptors with contiguous DD bits set

> 
> >
> > > For #1, if we read in reverse order, does it make sense to not check
> > > the DD bits of descriptors that are earlier in the order once we
> > > encounter a descriptor that has its DD bit set? This is because NIC updates
> the descriptors in order.
> >
> > I think the answer is yes, when we met the first DD bit, we should able to
> calculated the exact number base on the index, but not sure how much
> performance gain.
> >
> The other factors here are:
> 1. The driver does not do a straight read of all 32 DD bits in one go, rather it
> does 8 at a time and aborts at the end of a set of 8 if not all are valid.
> 2. For any that are set, we have to read the descriptor anyway to get the
> packet data out of it, so in the shortcut case of the last descriptor being set,
> we still have to read the other 7 anyway, and DD comes for free as part of it.
> 3. Blindly reading 8 at a time reduces the branching to just a single decision
> point at the end of each set of 8, reducing possible branch mispredicts.
Agree.
I think there is another requirement. The other words in the descriptor should be read only after reading the word containing the DD bit.

On x86, the program order takes care of this (although compiler barrier is required).
On Arm, this needs to be taken care explicitly using barriers.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v2] net/i40e: add logic of processing continuous DD bits for Arm
  2021-06-04  7:34 [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func Joyce Kong
  2021-06-04 16:12 ` Honnappa Nagarahalli
  2021-06-06 14:17 ` Zhang, Qi Z
@ 2021-06-23  8:43 ` Joyce Kong
  2021-06-30  1:14   ` Honnappa Nagarahalli
  2021-07-06  6:54 ` [dpdk-dev] [PATCH v3 0/2] fixes for i40e hw scan ring Joyce Kong
  3 siblings, 1 reply; 22+ messages in thread
From: Joyce Kong @ 2021-06-23  8:43 UTC (permalink / raw)
  To: beilei.xing, qi.z.zhang, ruifeng.wang, honnappa.nagarahalli,
	bruce.richardson, helin.zhang
  Cc: dev, stable, nd

For Arm platforms, reading descs can get re-ordered, then the
status of DD bits will be discontinuous, so add the logic to
only process continuous descs by checking DD bits.

Fixes: 4861cde46116 ("i40e: new poll mode driver")
Cc: stable@dpdk.org

Signed-off-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/i40e/i40e_rxtx.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decec..86e2f083e 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 	uint16_t pkt_len;
 	uint64_t qword1;
 	uint32_t rx_status;
-	int32_t s[I40E_LOOK_AHEAD], nb_dd;
+	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
 	int32_t i, j, nb_rx = 0;
 	uint64_t pkt_flags;
 	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
@@ -482,11 +482,22 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 					I40E_RXD_QW1_STATUS_SHIFT;
 		}
 
-		rte_smp_rmb();
+		/* This barrier is to order loads of different words in the descriptor */
+		rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
 
 		/* Compute how many status bits were set */
-		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
-			nb_dd += s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
+			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+#ifdef RTE_ARCH_ARM
+			/* For Arm platforms, only compute continuous status bits */
+			if (var)
+				nb_dd += 1;
+			else
+				break;
+#else
+			nb_dd += var;
+#endif
+		}
 
 		nb_rx += nb_dd;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2] net/i40e: add logic of processing continuous DD bits for Arm
  2021-06-23  8:43 ` [dpdk-dev] [PATCH v2] net/i40e: add logic of processing continuous DD bits for Arm Joyce Kong
@ 2021-06-30  1:14   ` Honnappa Nagarahalli
  2021-07-05  3:41     ` Joyce Kong
  0 siblings, 1 reply; 22+ messages in thread
From: Honnappa Nagarahalli @ 2021-06-30  1:14 UTC (permalink / raw)
  To: Joyce Kong, beilei.xing, qi.z.zhang, Ruifeng Wang,
	bruce.richardson, helin.zhang
  Cc: dev, stable, nd, Honnappa Nagarahalli, nd

<snip>
> 
> For Arm platforms, reading descs can get re-ordered, then the status of DD
> bits will be discontinuous, so add the logic to only process continuous descs
> by checking DD bits.
> 
> Fixes: 4861cde46116 ("i40e: new poll mode driver")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  drivers/net/i40e/i40e_rxtx.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index
> 6c58decec..86e2f083e 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
>  	uint16_t pkt_len;
>  	uint64_t qword1;
>  	uint32_t rx_status;
> -	int32_t s[I40E_LOOK_AHEAD], nb_dd;
> +	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
>  	int32_t i, j, nb_rx = 0;
>  	uint64_t pkt_flags;
>  	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl; @@ -482,11
> +482,22 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
>  					I40E_RXD_QW1_STATUS_SHIFT;
>  		}
> 
> -		rte_smp_rmb();
> +		/* This barrier is to order loads of different words in the
> descriptor */
> +		rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
I think this should go into a separate commit as the following change is unrelated.

> 
>  		/* Compute how many status bits were set */
> -		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
> -			nb_dd += s[j] & (1 <<
> I40E_RX_DESC_STATUS_DD_SHIFT);
> +		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
> +			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> #ifdef
> +RTE_ARCH_ARM
> +			/* For Arm platforms, only compute continuous
> status bits */
> +			if (var)
> +				nb_dd += 1;
> +			else
> +				break;
> +#else
> +			nb_dd += var;
> +#endif
> +		}
> 
>  		nb_rx += nb_dd;
> 
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v2] net/i40e: add logic of processing continuous DD bits for Arm
  2021-06-30  1:14   ` Honnappa Nagarahalli
@ 2021-07-05  3:41     ` Joyce Kong
  0 siblings, 0 replies; 22+ messages in thread
From: Joyce Kong @ 2021-07-05  3:41 UTC (permalink / raw)
  To: Honnappa Nagarahalli, beilei.xing, qi.z.zhang, Ruifeng Wang,
	bruce.richardson, helin.zhang
  Cc: dev, stable, nd



> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Wednesday, June 30, 2021 9:15 AM
> To: Joyce Kong <Joyce.Kong@arm.com>; beilei.xing@intel.com;
> qi.z.zhang@intel.com; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> bruce.richardson@intel.com; helin.zhang@intel.com
> Cc: dev@dpdk.org; stable@dpdk.org; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2] net/i40e: add logic of processing continuous DD bits
> for Arm
> 
> <snip>
> >
> > For Arm platforms, reading descs can get re-ordered, then the status
> > of DD bits will be discontinuous, so add the logic to only process
> > continuous descs by checking DD bits.
> >
> > Fixes: 4861cde46116 ("i40e: new poll mode driver")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  drivers/net/i40e/i40e_rxtx.c | 19 +++++++++++++++----
> >  1 file changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > b/drivers/net/i40e/i40e_rxtx.c index 6c58decec..86e2f083e 100644
> > --- a/drivers/net/i40e/i40e_rxtx.c
> > +++ b/drivers/net/i40e/i40e_rxtx.c
> > @@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> >  	uint16_t pkt_len;
> >  	uint64_t qword1;
> >  	uint32_t rx_status;
> > -	int32_t s[I40E_LOOK_AHEAD], nb_dd;
> > +	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
> >  	int32_t i, j, nb_rx = 0;
> >  	uint64_t pkt_flags;
> >  	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl; @@ -482,11
> > +482,22 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> >  					I40E_RXD_QW1_STATUS_SHIFT;
> >  		}
> >
> > -		rte_smp_rmb();
> > +		/* This barrier is to order loads of different words in the
> > descriptor */
> > +		rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> I think this should go into a separate commit as the following change is
> unrelated.

Will separate the two changes in v3.

> 
> >
> >  		/* Compute how many status bits were set */
> > -		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
> > -			nb_dd += s[j] & (1 <<
> > I40E_RX_DESC_STATUS_DD_SHIFT);
> > +		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
> > +			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
> > #ifdef
> > +RTE_ARCH_ARM
> > +			/* For Arm platforms, only compute continuous
> > status bits */
> > +			if (var)
> > +				nb_dd += 1;
> > +			else
> > +				break;
> > +#else
> > +			nb_dd += var;
> > +#endif
> > +		}
> >
> >  		nb_rx += nb_dd;
> >
> > --
> > 2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v3 0/2] fixes for i40e hw scan ring
  2021-06-04  7:34 [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func Joyce Kong
                   ` (2 preceding siblings ...)
  2021-06-23  8:43 ` [dpdk-dev] [PATCH v2] net/i40e: add logic of processing continuous DD bits for Arm Joyce Kong
@ 2021-07-06  6:54 ` Joyce Kong
  2021-07-06  6:54   ` [dpdk-dev] [PATCH v3 1/2] net/i40e: add logic of processing continuous DD bits for Arm Joyce Kong
  2021-07-06  6:54   ` [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence Joyce Kong
  3 siblings, 2 replies; 22+ messages in thread
From: Joyce Kong @ 2021-07-06  6:54 UTC (permalink / raw)
  To: beilei.xing, qi.z.zhang, ruifeng.wang, honnappa.nagarahalli,
	bruce.richardson, helin.zhang
  Cc: dev, stable, nd

This patchset contains two parts for i40e PMD, one is to add
the logic of processing continuous DD bits for Arm platform,
the other is to replace SMP barrier with thread fence.

v3:
 Seperate the commit changes into two parts. <Honnappa Nagarahalli>

v2:
 Only add the compile option for Arm and keep X86 intact. <Qi Zhang> 

v1:
 The initial version.

Joyce Kong (2):
  net/i40e: add logic of processing continuous DD bits for Arm
  net/i40e: replace SMP barrier with thread fence

 drivers/net/i40e/i40e_rxtx.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v3 1/2] net/i40e: add logic of processing continuous DD bits for Arm
  2021-07-06  6:54 ` [dpdk-dev] [PATCH v3 0/2] fixes for i40e hw scan ring Joyce Kong
@ 2021-07-06  6:54   ` Joyce Kong
  2021-07-09  3:05     ` Zhang, Qi Z
  2021-07-06  6:54   ` [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence Joyce Kong
  1 sibling, 1 reply; 22+ messages in thread
From: Joyce Kong @ 2021-07-06  6:54 UTC (permalink / raw)
  To: beilei.xing, qi.z.zhang, ruifeng.wang, honnappa.nagarahalli,
	bruce.richardson, helin.zhang
  Cc: dev, stable, nd

For Arm platforms, reading descs can get re-ordered, then the
status of DD bits will be discontinuous, so add the logic to
only process continuous descs by checking DD bits.

Fixes: 4861cde46116 ("i40e: new poll mode driver")
Cc: stable@dpdk.org

Signed-off-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/i40e/i40e_rxtx.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decec..9aaabfd92 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 	uint16_t pkt_len;
 	uint64_t qword1;
 	uint32_t rx_status;
-	int32_t s[I40E_LOOK_AHEAD], nb_dd;
+	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
 	int32_t i, j, nb_rx = 0;
 	uint64_t pkt_flags;
 	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl;
@@ -485,8 +485,18 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 		rte_smp_rmb();
 
 		/* Compute how many status bits were set */
-		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
-			nb_dd += s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
+			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+#ifdef RTE_ARCH_ARM
+			/* For Arm platforms, only compute continuous status bits */
+			if (var)
+				nb_dd += 1;
+			else
+				break;
+#else
+			nb_dd += var;
+#endif
+		}
 
 		nb_rx += nb_dd;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence
  2021-07-06  6:54 ` [dpdk-dev] [PATCH v3 0/2] fixes for i40e hw scan ring Joyce Kong
  2021-07-06  6:54   ` [dpdk-dev] [PATCH v3 1/2] net/i40e: add logic of processing continuous DD bits for Arm Joyce Kong
@ 2021-07-06  6:54   ` Joyce Kong
  2021-07-08 12:09     ` Zhang, Qi Z
  2021-07-13  0:46     ` Zhang, Qi Z
  1 sibling, 2 replies; 22+ messages in thread
From: Joyce Kong @ 2021-07-06  6:54 UTC (permalink / raw)
  To: beilei.xing, qi.z.zhang, ruifeng.wang, honnappa.nagarahalli,
	bruce.richardson, helin.zhang
  Cc: dev, stable, nd

Simply replace the SMP barrier with atomic thread fence for
i40e hw ring sacn, if there is no synchronization point.

Signed-off-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/i40e/i40e_rxtx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 9aaabfd92..86e2f083e 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -482,7 +482,8 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 					I40E_RXD_QW1_STATUS_SHIFT;
 		}
 
-		rte_smp_rmb();
+		/* This barrier is to order loads of different words in the descriptor */
+		rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
 
 		/* Compute how many status bits were set */
 		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence
  2021-07-06  6:54   ` [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence Joyce Kong
@ 2021-07-08 12:09     ` Zhang, Qi Z
  2021-07-08 13:51       ` Lance Richardson
  2021-07-13  0:46     ` Zhang, Qi Z
  1 sibling, 1 reply; 22+ messages in thread
From: Zhang, Qi Z @ 2021-07-08 12:09 UTC (permalink / raw)
  To: Joyce Kong, Xing, Beilei, ruifeng.wang, honnappa.nagarahalli,
	Richardson, Bruce, Zhang, Helin
  Cc: dev, stable, nd



> -----Original Message-----
> From: Joyce Kong <joyce.kong@arm.com>
> Sent: Tuesday, July 6, 2021 2:54 PM
> To: Xing, Beilei <beilei.xing@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ruifeng.wang@arm.com; honnappa.nagarahalli@arm.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> Cc: dev@dpdk.org; stable@dpdk.org; nd@arm.com
> Subject: [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence
> 
> Simply replace the SMP barrier with atomic thread fence for i40e hw ring sacn,
> if there is no synchronization point.
> 
> Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  drivers/net/i40e/i40e_rxtx.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index
> 9aaabfd92..86e2f083e 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -482,7 +482,8 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
>  					I40E_RXD_QW1_STATUS_SHIFT;
>  		}
> 
> -		rte_smp_rmb();
> +		/* This barrier is to order loads of different words in the descriptor */
> +		rte_atomic_thread_fence(__ATOMIC_ACQUIRE);

Now for x86, you actually replace a compiler barrier with a memory fence, this may have potential performance impact which need additional resource to investigate 
So there are 2 options:
1. if you want this patch be merged into DPDK 21.08, please change this for ARM only.
2. you can wait for our update for x86 but I guess it will miss 21.08.

What do you think?

Btw for patch 1/2, I think I can merge it independently right?


> 
>  		/* Compute how many status bits were set */
>  		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence
  2021-07-08 12:09     ` Zhang, Qi Z
@ 2021-07-08 13:51       ` Lance Richardson
  2021-07-08 14:26         ` Zhang, Qi Z
  0 siblings, 1 reply; 22+ messages in thread
From: Lance Richardson @ 2021-07-08 13:51 UTC (permalink / raw)
  To: Zhang, Qi Z
  Cc: Joyce Kong, Xing, Beilei, ruifeng.wang, honnappa.nagarahalli,
	Richardson, Bruce, Zhang, Helin, dev, stable, nd

[-- Attachment #1: Type: text/plain, Size: 1782 bytes --]

On Thu, Jul 8, 2021 at 8:09 AM Zhang, Qi Z <qi.z.zhang@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Joyce Kong <joyce.kong@arm.com>
> > Sent: Tuesday, July 6, 2021 2:54 PM
> > To: Xing, Beilei <beilei.xing@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> > ruifeng.wang@arm.com; honnappa.nagarahalli@arm.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> > Cc: dev@dpdk.org; stable@dpdk.org; nd@arm.com
> > Subject: [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence
> >
> > Simply replace the SMP barrier with atomic thread fence for i40e hw ring sacn,
> > if there is no synchronization point.
> >
> > Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  drivers/net/i40e/i40e_rxtx.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index
> > 9aaabfd92..86e2f083e 100644
> > --- a/drivers/net/i40e/i40e_rxtx.c
> > +++ b/drivers/net/i40e/i40e_rxtx.c
> > @@ -482,7 +482,8 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> >                                       I40E_RXD_QW1_STATUS_SHIFT;
> >               }
> >
> > -             rte_smp_rmb();
> > +             /* This barrier is to order loads of different words in the descriptor */
> > +             rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
>
> Now for x86, you actually replace a compiler barrier with a memory fence, this may have potential performance impact which need additional resource to investigate

No memory fence instruction is generated for
__ATOMIC_ACQUIRE on x86 for any version of gcc
or clang that I've tried, based on experiments here:

    https://godbolt.org/z/Yxr1vGhKP

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence
  2021-07-08 13:51       ` Lance Richardson
@ 2021-07-08 14:26         ` Zhang, Qi Z
  2021-07-08 14:44           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 22+ messages in thread
From: Zhang, Qi Z @ 2021-07-08 14:26 UTC (permalink / raw)
  To: Lance Richardson
  Cc: Joyce Kong, Xing, Beilei, ruifeng.wang, honnappa.nagarahalli,
	Richardson, Bruce, Zhang, Helin, dev, stable, nd



> -----Original Message-----
> From: Lance Richardson <lance.richardson@broadcom.com>
> Sent: Thursday, July 8, 2021 9:51 PM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>
> Cc: Joyce Kong <joyce.kong@arm.com>; Xing, Beilei <beilei.xing@intel.com>;
> ruifeng.wang@arm.com; honnappa.nagarahalli@arm.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> dev@dpdk.org; stable@dpdk.org; nd@arm.com
> Subject: Re: [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with
> thread fence
> 
> On Thu, Jul 8, 2021 at 8:09 AM Zhang, Qi Z <qi.z.zhang@intel.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Joyce Kong <joyce.kong@arm.com>
> > > Sent: Tuesday, July 6, 2021 2:54 PM
> > > To: Xing, Beilei <beilei.xing@intel.com>; Zhang, Qi Z
> <qi.z.zhang@intel.com>;
> > > ruifeng.wang@arm.com; honnappa.nagarahalli@arm.com; Richardson,
> Bruce
> > > <bruce.richardson@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> > > Cc: dev@dpdk.org; stable@dpdk.org; nd@arm.com
> > > Subject: [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence
> > >
> > > Simply replace the SMP barrier with atomic thread fence for i40e hw ring
> sacn,
> > > if there is no synchronization point.
> > >
> > > Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > >  drivers/net/i40e/i40e_rxtx.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index
> > > 9aaabfd92..86e2f083e 100644
> > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > @@ -482,7 +482,8 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> > >
> I40E_RXD_QW1_STATUS_SHIFT;
> > >               }
> > >
> > > -             rte_smp_rmb();
> > > +             /* This barrier is to order loads of different words in the
> descriptor */
> > > +             rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> >
> > Now for x86, you actually replace a compiler barrier with a memory fence,
> this may have potential performance impact which need additional resource to
> investigate
> 
> No memory fence instruction is generated for
> __ATOMIC_ACQUIRE on x86 for any version of gcc
> or clang that I've tried, based on experiments here:
> 
>     https://godbolt.org/z/Yxr1vGhKP

Nice tool!
I try to write some dummy code combined with or without __atomic_thread_fence(__ATOMIC_ACQUIRE)
but I didn't see any difference of the generated assembly code, does that means __atomic_thread_fence(__ATOMIC_ACQUIRE) just does nothing on x86?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence
  2021-07-08 14:26         ` Zhang, Qi Z
@ 2021-07-08 14:44           ` Honnappa Nagarahalli
  0 siblings, 0 replies; 22+ messages in thread
From: Honnappa Nagarahalli @ 2021-07-08 14:44 UTC (permalink / raw)
  To: Zhang, Qi Z, Lance Richardson
  Cc: Joyce Kong, Xing, Beilei, Ruifeng Wang, Richardson, Bruce, Zhang,
	Helin, dev, stable, nd, Honnappa Nagarahalli, nd

<snip>

> > > >
> > > > Simply replace the SMP barrier with atomic thread fence for i40e
> > > > hw ring
> > sacn,
> > > > if there is no synchronization point.
> > > >
> > > > Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > ---
> > > >  drivers/net/i40e/i40e_rxtx.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > b/drivers/net/i40e/i40e_rxtx.c index 9aaabfd92..86e2f083e 100644
> > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > @@ -482,7 +482,8 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue
> > > > *rxq)
> > > >
> > I40E_RXD_QW1_STATUS_SHIFT;
> > > >               }
> > > >
> > > > -             rte_smp_rmb();
> > > > +             /* This barrier is to order loads of different words
> > > > + in the
> > descriptor */
> > > > +             rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> > >
> > > Now for x86, you actually replace a compiler barrier with a memory
> > > fence,
> > this may have potential performance impact which need additional
> > resource to investigate
> >
> > No memory fence instruction is generated for __ATOMIC_ACQUIRE on x86
> > for any version of gcc or clang that I've tried, based on experiments
> > here:
> >
> >     https://godbolt.org/z/Yxr1vGhKP
> 
> Nice tool!
> I try to write some dummy code combined with or without
> __atomic_thread_fence(__ATOMIC_ACQUIRE)
> but I didn't see any difference of the generated assembly code, does that means
> __atomic_thread_fence(__ATOMIC_ACQUIRE) just does nothing on x86?
Yes, it should not have any barriers generated for x86. At the same time it also acts as a compiler barrier.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] net/i40e: add logic of processing continuous DD bits for Arm
  2021-07-06  6:54   ` [dpdk-dev] [PATCH v3 1/2] net/i40e: add logic of processing continuous DD bits for Arm Joyce Kong
@ 2021-07-09  3:05     ` Zhang, Qi Z
  0 siblings, 0 replies; 22+ messages in thread
From: Zhang, Qi Z @ 2021-07-09  3:05 UTC (permalink / raw)
  To: Joyce Kong, Xing, Beilei, ruifeng.wang, honnappa.nagarahalli,
	Richardson, Bruce, Zhang, Helin
  Cc: dev, stable, nd



> -----Original Message-----
> From: Joyce Kong <joyce.kong@arm.com>
> Sent: Tuesday, July 6, 2021 2:54 PM
> To: Xing, Beilei <beilei.xing@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ruifeng.wang@arm.com; honnappa.nagarahalli@arm.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> Cc: dev@dpdk.org; stable@dpdk.org; nd@arm.com
> Subject: [PATCH v3 1/2] net/i40e: add logic of processing continuous DD bits for
> Arm
> 
> For Arm platforms, reading descs can get re-ordered, then the status of DD
> bits will be discontinuous, so add the logic to only process continuous descs by
> checking DD bits.
> 
> Fixes: 4861cde46116 ("i40e: new poll mode driver")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>

Applied to dpdk-next-net-intel.

Thanks
Qi


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence
  2021-07-06  6:54   ` [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence Joyce Kong
  2021-07-08 12:09     ` Zhang, Qi Z
@ 2021-07-13  0:46     ` Zhang, Qi Z
  1 sibling, 0 replies; 22+ messages in thread
From: Zhang, Qi Z @ 2021-07-13  0:46 UTC (permalink / raw)
  To: Joyce Kong, Xing, Beilei, ruifeng.wang, honnappa.nagarahalli,
	Richardson, Bruce, Zhang, Helin
  Cc: dev, stable, nd



> -----Original Message-----
> From: Joyce Kong <joyce.kong@arm.com>
> Sent: Tuesday, July 6, 2021 2:54 PM
> To: Xing, Beilei <beilei.xing@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ruifeng.wang@arm.com; honnappa.nagarahalli@arm.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> Cc: dev@dpdk.org; stable@dpdk.org; nd@arm.com
> Subject: [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence
> 
> Simply replace the SMP barrier with atomic thread fence for i40e hw ring sacn,
> if there is no synchronization point.
> 
> Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>

Acked-by: Qi Zhang <qi.z.zhang@intel.com>

Applied to dpdk-next-net-intel.

Thanks
Qi


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-07-13  0:47 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-04  7:34 [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func Joyce Kong
2021-06-04 16:12 ` Honnappa Nagarahalli
2021-06-06 14:17 ` Zhang, Qi Z
2021-06-06 18:33   ` Honnappa Nagarahalli
2021-06-07 14:55     ` Zhang, Qi Z
2021-06-07 21:36       ` Honnappa Nagarahalli
2021-06-15  6:30         ` Joyce Kong
2021-06-16 13:29         ` Zhang, Qi Z
2021-06-16 13:37           ` Bruce Richardson
2021-06-16 20:26             ` Honnappa Nagarahalli
2021-06-23  8:43 ` [dpdk-dev] [PATCH v2] net/i40e: add logic of processing continuous DD bits for Arm Joyce Kong
2021-06-30  1:14   ` Honnappa Nagarahalli
2021-07-05  3:41     ` Joyce Kong
2021-07-06  6:54 ` [dpdk-dev] [PATCH v3 0/2] fixes for i40e hw scan ring Joyce Kong
2021-07-06  6:54   ` [dpdk-dev] [PATCH v3 1/2] net/i40e: add logic of processing continuous DD bits for Arm Joyce Kong
2021-07-09  3:05     ` Zhang, Qi Z
2021-07-06  6:54   ` [dpdk-dev] [PATCH v3 2/2] net/i40e: replace SMP barrier with thread fence Joyce Kong
2021-07-08 12:09     ` Zhang, Qi Z
2021-07-08 13:51       ` Lance Richardson
2021-07-08 14:26         ` Zhang, Qi Z
2021-07-08 14:44           ` Honnappa Nagarahalli
2021-07-13  0:46     ` Zhang, Qi Z

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.