All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] net/mlx5: poll completion queue once per a call
@ 2017-07-20 15:48 Yongseok Koh
  2017-07-20 16:34 ` Sagi Grimberg
  2017-07-31 16:12 ` Ferruh Yigit
  0 siblings, 2 replies; 8+ messages in thread
From: Yongseok Koh @ 2017-07-20 15:48 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro; +Cc: dev, Yongseok Koh

mlx5_tx_complete() polls completion queue multiple times until it
encounters an invalid entry. As Tx completions are suppressed by
MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
in a poll. And freeing too many buffers in a call can cause high jitter.
This patch improves throughput a little.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
---
 drivers/net/mlx5/mlx5_rxtx.h | 32 ++++++++++----------------------
 1 file changed, 10 insertions(+), 22 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 534aaeb46..7fd59a4b1 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -480,30 +480,18 @@ mlx5_tx_complete(struct txq *txq)
 	struct rte_mempool *pool = NULL;
 	unsigned int blk_n = 0;
 
-	do {
-		volatile struct mlx5_cqe *tmp;
-
-		tmp = &(*txq->cqes)[cq_ci & cqe_cnt];
-		if (check_cqe(tmp, cqe_n, cq_ci))
-			break;
-		cqe = tmp;
+	cqe = &(*txq->cqes)[cq_ci & cqe_cnt];
+	if (unlikely(check_cqe(cqe, cqe_n, cq_ci)))
+		return;
 #ifndef NDEBUG
-		if (MLX5_CQE_FORMAT(cqe->op_own) == MLX5_COMPRESSED) {
-			if (!check_cqe_seen(cqe))
-				ERROR("unexpected compressed CQE, TX stopped");
-			return;
-		}
-		if ((MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_RESP_ERR) ||
-		    (MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_REQ_ERR)) {
-			if (!check_cqe_seen(cqe))
-				ERROR("unexpected error CQE, TX stopped");
-			return;
-		}
-#endif /* NDEBUG */
-		++cq_ci;
-	} while (1);
-	if (unlikely(cqe == NULL))
+	if ((MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_RESP_ERR) ||
+	    (MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_REQ_ERR)) {
+		if (!check_cqe_seen(cqe))
+			ERROR("unexpected error CQE, TX stopped");
 		return;
+	}
+#endif /* NDEBUG */
+	++cq_ci;
 	txq->wqe_pi = ntohs(cqe->wqe_counter);
 	ctrl = (volatile struct mlx5_wqe_ctrl *)
 		tx_mlx5_wqe(txq, txq->wqe_pi);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] net/mlx5: poll completion queue once per a call
  2017-07-20 15:48 [PATCH] net/mlx5: poll completion queue once per a call Yongseok Koh
@ 2017-07-20 16:34 ` Sagi Grimberg
  2017-07-21 15:10   ` Yongseok Koh
  2017-07-31 16:12 ` Ferruh Yigit
  1 sibling, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2017-07-20 16:34 UTC (permalink / raw)
  To: Yongseok Koh, adrien.mazarguil, nelio.laranjeiro; +Cc: dev


> mlx5_tx_complete() polls completion queue multiple times until it
> encounters an invalid entry. As Tx completions are suppressed by
> MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
> in a poll. And freeing too many buffers in a call can cause high jitter.
> This patch improves throughput a little.

What if the device generates burst of completions? Holding these
completions un-reaped can theoretically cause resource stress on
the corresponding mempool(s).

I totally get the need for a stopping condition, but is "loop once"
the best stop condition?

Perhaps an adaptive budget (based on online stats) perform better?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net/mlx5: poll completion queue once per a call
  2017-07-20 16:34 ` Sagi Grimberg
@ 2017-07-21 15:10   ` Yongseok Koh
  2017-07-23  9:49     ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Yongseok Koh @ 2017-07-21 15:10 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: adrien.mazarguil, nelio.laranjeiro, dev

On Thu, Jul 20, 2017 at 07:34:04PM +0300, Sagi Grimberg wrote:
> 
> > mlx5_tx_complete() polls completion queue multiple times until it
> > encounters an invalid entry. As Tx completions are suppressed by
> > MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
> > in a poll. And freeing too many buffers in a call can cause high jitter.
> > This patch improves throughput a little.
> 
> What if the device generates burst of completions?
mlx5 PMD suppresses completions anyway. It requests a completion per every
MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion
queue is even much small.

> Holding these completions un-reaped can theoretically cause resource stress on
> the corresponding mempool(s).
Can you make your point clearer? Do you think the "stress" can impact
performance? I think stress doesn't matter unless it is depleted. And app is
responsible for supplying enough mbufs considering the depth of all queues (max
# of outstanding mbufs).

> I totally get the need for a stopping condition, but is "loop once"
> the best stop condition?
Best for what?

> Perhaps an adaptive budget (based on online stats) perform better?
Please bring up any suggestion or submit a patch if any. Does "budget" mean the
threshold? If so, calculation of stats for adaptive threshold can impact single
core performance. With multiple cores, adjusting threshold doesn't affect much.

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net/mlx5: poll completion queue once per a call
  2017-07-21 15:10   ` Yongseok Koh
@ 2017-07-23  9:49     ` Sagi Grimberg
  2017-07-25  7:43       ` Yongseok Koh
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2017-07-23  9:49 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: adrien.mazarguil, nelio.laranjeiro, dev

>>> mlx5_tx_complete() polls completion queue multiple times until it
>>> encounters an invalid entry. As Tx completions are suppressed by
>>> MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
>>> in a poll. And freeing too many buffers in a call can cause high jitter.
>>> This patch improves throughput a little.
>>
>> What if the device generates burst of completions?
> mlx5 PMD suppresses completions anyway. It requests a completion per every
> MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion
> queue is even much small.

Yes I realize that, but can't the device still complete in a burst (of
unsuppressed completions)? I mean its not guaranteed that for every
txq_complete a signaled completion is pending right? What happens if
the device has inconsistent completion pacing? Can't the sw grow a
batch of completions if txq_complete will process a single completion
unconditionally?

>> Holding these completions un-reaped can theoretically cause resource stress on
>> the corresponding mempool(s).
> Can you make your point clearer? Do you think the "stress" can impact
> performance? I think stress doesn't matter unless it is depleted. And app is
> responsible for supplying enough mbufs considering the depth of all queues (max
> # of outstanding mbufs).

I might be missing something, but # of outstanding mbufs should be
relatively small as the pmd reaps every MLX5_TX_COMP_THRESH mbufs right?
Why should the pool account for the entire TX queue depth (which can
be very large)?

Is there a hard requirement documented somewhere that the application
needs to account for the entire TX queue depths for sizing its mbuf
pool?

My question is with the proposed change, doesn't this mean that the 
application might need to allocate a bigger TX mbuf pool? Because the
pmd can theoretically consume completions slower (as in multiple TX
burst calls)?

>> I totally get the need for a stopping condition, but is "loop once"
>> the best stop condition?
> Best for what?

Best condition to stop consuming TX completions. As I said, I think that
leaving TX completions un-reaped can (at least in theory) slow down the
mbuf reclamation, which impacts the application. (unless I'm not
understanding something fundamental)

>> Perhaps an adaptive budget (based on online stats) perform better?
> Please bring up any suggestion or submit a patch if any.

I was simply providing a review for the patch. I don't have the time
to come up with a better patch unfortunately, but I still think its
fair to raise a point.

> Does "budget" mean the
> threshold? If so, calculation of stats for adaptive threshold can impact single
> core performance. With multiple cores, adjusting threshold doesn't affect much.

If you look at mlx5e driver in the kernel, it maintains online stats on
its RX and TX queues. It maintain these stats mostly for adaptive
interrupt moderation control (but not only).

I was suggesting maintaining per TX queue stats on average completions
consumed for each TX burst call, and adjust the stopping condition
according to a calculated stat.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net/mlx5: poll completion queue once per a call
  2017-07-23  9:49     ` Sagi Grimberg
@ 2017-07-25  7:43       ` Yongseok Koh
  2017-07-27 11:12         ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Yongseok Koh @ 2017-07-25  7:43 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: adrien.mazarguil, nelio.laranjeiro, dev

On Sun, Jul 23, 2017 at 12:49:36PM +0300, Sagi Grimberg wrote:
> > > > mlx5_tx_complete() polls completion queue multiple times until it
> > > > encounters an invalid entry. As Tx completions are suppressed by
> > > > MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
> > > > in a poll. And freeing too many buffers in a call can cause high jitter.
> > > > This patch improves throughput a little.
> > > 
> > > What if the device generates burst of completions?
> > mlx5 PMD suppresses completions anyway. It requests a completion per every
> > MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion
> > queue is even much small.
> 
> Yes I realize that, but can't the device still complete in a burst (of
> unsuppressed completions)? I mean its not guaranteed that for every
> txq_complete a signaled completion is pending right? What happens if
> the device has inconsistent completion pacing? Can't the sw grow a
> batch of completions if txq_complete will process a single completion
> unconditionally?
Speculation. First of all, device doesn't delay completion notifications for no
reason. ASIC is not a SW running on top of a OS. If a completion comes up late,
this means device really can't keep up the rate of posting descriptors. If so,
tx_burst() should generate back-pressure by returning partial Tx, then app can
make a decision between drop or retry. Retry on Tx means back-pressuring Rx side
if app is forwarding packets.

More serious problem I expected was a case that the THRESH is smaller than
burst size. In that case, txq->elts[] will be short of slots all the time. But
fortunately, in MLX PMD, we request one completion per a burst at most, not
every THRESH of packets.

If there's some SW jitter on Tx processiong, the Tx CQ can grow for sure.
Question to myself was "when does it shrink?". It shrinks when Tx burst is light
(burst size is smaller than THRESH) because mlx5_tx_complete() is always called
every time tx_burst() is called. What if it keeps growing? Then, drop is
necessary and natural like I mentioned above.

It doesn't make sense for SW to absorb any possible SW jitters. Cost is high.
It is usually done by increasing queue depth. Keeping steady state is more
important. 

Rather, this patch is helpful for reducing jitters. When I run a profiler, the
most cycle-consuming part on Tx is still freeing buffers. If we allow loops on
checking valid CQE, many buffers could be freed in a single call of
mlx5_tx_complete() at some moment, then it would cause a long delay. This would
aggravate jitter.

> > > Holding these completions un-reaped can theoretically cause resource stress on
> > > the corresponding mempool(s).
> > Can you make your point clearer? Do you think the "stress" can impact
> > performance? I think stress doesn't matter unless it is depleted. And app is
> > responsible for supplying enough mbufs considering the depth of all queues (max
> > # of outstanding mbufs).
> 
> I might be missing something, but # of outstanding mbufs should be
> relatively small as the pmd reaps every MLX5_TX_COMP_THRESH mbufs right?
> Why should the pool account for the entire TX queue depth (which can
> be very large)?
Reason is simple for Rx queue. If the number of mbufs in the provisioned mempool
is less then rxq depth, PMD can't even successfully initialize device. PMD
doesn't keep a private mempool. So, it is nonsensical to provision less mbufs
than queue depth even if it isn't documented. It is obvious.

No mempool is assigned for Tx. And in this case, app isn't forced to prepare
enough mbufs to cover all the Tx queues. But the downside of it is significant
performance degradation. From PMD perspective, it just needs to avoid any
deadlock condition due to depletion. Even if freeing mbufs in bulk causes some
resource depletion in app side, it is a fair trade-off to get higher performance
unless there's no deadlock. And as far as I can tell, most of PMDs would free
mbufs in bulk, not one by one. Also good for cache locality.

Anyway, there are many examples according to packet processing mode -
fwd/rxonly/txonly. But I won't explain all of them one by one.

> Is there a hard requirement documented somewhere that the application
> needs to account for the entire TX queue depths for sizing its mbuf
> pool?
If needed, we should document it and this will be a good start for you to
contribute to DPDK community. But, think about the definition of Tx queue depth,
doesn't it mean that a queue can hold that amount of descriptors? Then app
should prepare more mbufs than the queue depth which is configured by the app.
In my understanding, there's no point of having less mbufs than the total amount
of queue entries. If resource is scarce, what's the point of having larger queue
depth? It should have smaller queue.

> My question is with the proposed change, doesn't this mean that the
> application might need to allocate a bigger TX mbuf pool? Because the
> pmd can theoretically consume completions slower (as in multiple TX
> burst calls)?
No. Explained above.

[...]
> > > Perhaps an adaptive budget (based on online stats) perform better?
> > Please bring up any suggestion or submit a patch if any.
> 
> I was simply providing a review for the patch. I don't have the time
> to come up with a better patch unfortunately, but I still think its
> fair to raise a point.
Of course. I appreciate your time for the review. And keep in mind that nothing
is impossible in an open source community. I always like to discuss about ideas
with anyone. But I was just asking to hear more details about your suggestion if
you wanted me to implement it, rather than giving me one-sentence question :-)

> > Does "budget" mean the
> > threshold? If so, calculation of stats for adaptive threshold can impact single
> > core performance. With multiple cores, adjusting threshold doesn't affect much.
> 
> If you look at mlx5e driver in the kernel, it maintains online stats on
> its RX and TX queues. It maintain these stats mostly for adaptive
> interrupt moderation control (but not only).
> 
> I was suggesting maintaining per TX queue stats on average completions
> consumed for each TX burst call, and adjust the stopping condition
> according to a calculated stat.
In case of interrupt mitigation, it could be beneficial because interrupt
handling cost is too costly. But, the beauty of DPDK is polling, isn't it?


And please remember to ack at the end of this discussion if you are okay so that
this patch can gets merged. One data point is, single core performance (fwd) of
vectorized PMD gets improved by more than 6% with this patch. 6% is never small.

Thanks for your review again.

Yongseok

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net/mlx5: poll completion queue once per a call
  2017-07-25  7:43       ` Yongseok Koh
@ 2017-07-27 11:12         ` Sagi Grimberg
  2017-07-28  0:26           ` Yongseok Koh
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2017-07-27 11:12 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: adrien.mazarguil, nelio.laranjeiro, dev


>> Yes I realize that, but can't the device still complete in a burst (of
>> unsuppressed completions)? I mean its not guaranteed that for every
>> txq_complete a signaled completion is pending right? What happens if
>> the device has inconsistent completion pacing? Can't the sw grow a
>> batch of completions if txq_complete will process a single completion
>> unconditionally?
> Speculation. First of all, device doesn't delay completion notifications for no
> reason. ASIC is not a SW running on top of a OS.

I'm sorry but this statement is not correct. It might be correct in a
lab environment, but in practice, there are lots of things that can
affect the device timing.

> If a completion comes up late,
> this means device really can't keep up the rate of posting descriptors. If so,
> tx_burst() should generate back-pressure by returning partial Tx, then app can
> make a decision between drop or retry. Retry on Tx means back-pressuring Rx side
> if app is forwarding packets.

Not arguing on that, I was simply suggesting that better heuristics
could be applied than "process one completion unconditionally".

> More serious problem I expected was a case that the THRESH is smaller than
> burst size. In that case, txq->elts[] will be short of slots all the time. But
> fortunately, in MLX PMD, we request one completion per a burst at most, not
> every THRESH of packets.
> 
> If there's some SW jitter on Tx processiong, the Tx CQ can grow for sure.
> Question to myself was "when does it shrink?". It shrinks when Tx burst is light
> (burst size is smaller than THRESH) because mlx5_tx_complete() is always called
> every time tx_burst() is called. What if it keeps growing? Then, drop is
> necessary and natural like I mentioned above.
> 
> It doesn't make sense for SW to absorb any possible SW jitters. Cost is high.
> It is usually done by increasing queue depth. Keeping steady state is more
> important.

Again, I agree jitters are bad, but with proper heuristics in place mlx5
can still keep a low jitter _and_ consume completions faster than
consecutive tx_burst invocations.

> Rather, this patch is helpful for reducing jitters. When I run a profiler, the
> most cycle-consuming part on Tx is still freeing buffers. If we allow loops on
> checking valid CQE, many buffers could be freed in a single call of
> mlx5_tx_complete() at some moment, then it would cause a long delay. This would
> aggravate jitter.

I didn't argue the fact that this patch addresses an issue, but mlx5 is
a driver that is designed to run applications that can act differently
than your test case.

> Of course. I appreciate your time for the review. And keep in mind that nothing
> is impossible in an open source community. I always like to discuss about ideas
> with anyone. But I was just asking to hear more details about your suggestion if
> you wanted me to implement it, rather than giving me one-sentence question :-)

Good to know.

>>> Does "budget" mean the
>>> threshold? If so, calculation of stats for adaptive threshold can impact single
>>> core performance. With multiple cores, adjusting threshold doesn't affect much.
>>
>> If you look at mlx5e driver in the kernel, it maintains online stats on
>> its RX and TX queues. It maintain these stats mostly for adaptive
>> interrupt moderation control (but not only).
>>
>> I was suggesting maintaining per TX queue stats on average completions
>> consumed for each TX burst call, and adjust the stopping condition
>> according to a calculated stat.
> In case of interrupt mitigation, it could be beneficial because interrupt
> handling cost is too costly. But, the beauty of DPDK is polling, isn't it?

If you read again my comment, I didn't suggest to apply stats for
interrupt moderation, I just gave an example of a use-case. I was
suggesting to maintain the online stats for adjusting a threshold of
how much completions to process in a tx burst call (instead of
processing one unconditionally).

> And please remember to ack at the end of this discussion if you are okay so that
> this patch can gets merged. One data point is, single core performance (fwd) of
> vectorized PMD gets improved by more than 6% with this patch. 6% is never small.

Yea, I don't mind merging it in given that I don't have time to come
up with anything better (or worse :))

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net/mlx5: poll completion queue once per a call
  2017-07-27 11:12         ` Sagi Grimberg
@ 2017-07-28  0:26           ` Yongseok Koh
  0 siblings, 0 replies; 8+ messages in thread
From: Yongseok Koh @ 2017-07-28  0:26 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Adrien Mazarguil, Nélio Laranjeiro, dev


> On Jul 27, 2017, at 4:12 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>> Yes I realize that, but can't the device still complete in a burst (of
>>> unsuppressed completions)? I mean its not guaranteed that for every
>>> txq_complete a signaled completion is pending right? What happens if
>>> the device has inconsistent completion pacing? Can't the sw grow a
>>> batch of completions if txq_complete will process a single completion
>>> unconditionally?
>> Speculation. First of all, device doesn't delay completion notifications for no
>> reason. ASIC is not a SW running on top of a OS.
> 
> I'm sorry but this statement is not correct. It might be correct in a
> lab environment, but in practice, there are lots of things that can
> affect the device timing.
Disagree.

[...]
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Thanks for ack!

Yongseok

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net/mlx5: poll completion queue once per a call
  2017-07-20 15:48 [PATCH] net/mlx5: poll completion queue once per a call Yongseok Koh
  2017-07-20 16:34 ` Sagi Grimberg
@ 2017-07-31 16:12 ` Ferruh Yigit
  1 sibling, 0 replies; 8+ messages in thread
From: Ferruh Yigit @ 2017-07-31 16:12 UTC (permalink / raw)
  To: Yongseok Koh, adrien.mazarguil, nelio.laranjeiro; +Cc: dev

On 7/20/2017 4:48 PM, Yongseok Koh wrote:
> mlx5_tx_complete() polls completion queue multiple times until it
> encounters an invalid entry. As Tx completions are suppressed by
> MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
> in a poll. And freeing too many buffers in a call can cause high jitter.
> This patch improves throughput a little.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Applied to dpdk-next-net/master, thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-07-31 16:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-20 15:48 [PATCH] net/mlx5: poll completion queue once per a call Yongseok Koh
2017-07-20 16:34 ` Sagi Grimberg
2017-07-21 15:10   ` Yongseok Koh
2017-07-23  9:49     ` Sagi Grimberg
2017-07-25  7:43       ` Yongseok Koh
2017-07-27 11:12         ` Sagi Grimberg
2017-07-28  0:26           ` Yongseok Koh
2017-07-31 16:12 ` Ferruh Yigit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.