All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 1/7] net/rds: Give fr_state a chance to transition to FRMR_IS_FREE
@ 2019-07-01 16:39 Gerd Rausch
  2019-07-01 18:27 ` David Miller
  0 siblings, 1 reply; 5+ messages in thread
From: Gerd Rausch @ 2019-07-01 16:39 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev; +Cc: David Miller

In the context of FRMR (ib_frmr.c):

Memory regions make it onto the "clean_list" via "rds_ib_flush_mr_pool",
after the memory region has been posted for invalidation via
"rds_ib_post_inv".

At that point in time, "fr_state" may still be in state "FRMR_IS_INUSE",
since the only place where "fr_state" transitions to "FRMR_IS_FREE"
is in "rds_ib_mr_cqe_handler", which is triggered by a tasklet.

So in case we notice that "fr_state != FRMR_IS_FREE" (see below),
we wait for "fr_inv_done" to trigger with a maximum of 10msec.
Then we check again, and only put the memory region onto the drop_list
(via "rds_ib_free_frmr") in case the situation remains unchanged.

This avoids the problem of memory-regions bouncing between "clean_list"
and "drop_list" before they even have a chance to be properly invalidated.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
---
 net/rds/ib_frmr.c | 32 +++++++++++++++++++++++++++++++-
 net/rds/ib_mr.h   |  1 +
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 32ae26ed58a0..9f8aa310c27a 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -75,6 +75,7 @@ static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
 		pool->max_items_soft = pool->max_items;
 
 	frmr->fr_state = FRMR_IS_FREE;
+	init_waitqueue_head(&frmr->fr_inv_done);
 	return ibmr;
 
 out_no_cigar:
@@ -285,6 +286,7 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc)
 	if (frmr->fr_inv) {
 		frmr->fr_state = FRMR_IS_FREE;
 		frmr->fr_inv = false;
+		wake_up(&frmr->fr_inv_done);
 	}
 
 	atomic_inc(&ic->i_fastreg_wrs);
@@ -345,8 +347,36 @@ struct rds_ib_mr *rds_ib_reg_frmr(struct rds_ib_device *rds_ibdev,
 	}
 
 	do {
-		if (ibmr)
+		if (ibmr) {
+			/* Memory regions make it onto the "clean_list" via
+			 * "rds_ib_flush_mr_pool", after the memory region has
+			 * been posted for invalidation via "rds_ib_post_inv".
+			 *
+			 * At that point in time, "fr_state" may still be
+			 * in state "FRMR_IS_INUSE", since the only place where
+			 * "fr_state" transitions to "FRMR_IS_FREE" is in
+			 * is in "rds_ib_mr_cqe_handler", which is
+			 * triggered by a tasklet.
+			 *
+			 * So in case we notice that
+			 * "fr_state != FRMR_IS_FREE" (see below), * we wait for
+			 * "fr_inv_done" to trigger with a maximum of 10msec.
+			 * Then we check again, and only put the memory region
+			 * onto the drop_list (via "rds_ib_free_frmr")
+			 * in case the situation remains unchanged.
+			 *
+			 * This avoids the problem of memory-regions bouncing
+			 * between "clean_list" and "drop_list" before they
+			 * even have a chance to be properly invalidated.
+			 */
+			frmr = &ibmr->u.frmr;
+			wait_event_timeout(frmr->fr_inv_done,
+					   frmr->fr_state == FRMR_IS_FREE,
+					   msecs_to_jiffies(10));
+			if (frmr->fr_state == FRMR_IS_FREE)
+				break;
 			rds_ib_free_frmr(ibmr, true);
+		}
 		ibmr = rds_ib_alloc_frmr(rds_ibdev, nents);
 		if (IS_ERR(ibmr))
 			return ibmr;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index 5da12c248431..42daccb7b5eb 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -57,6 +57,7 @@ struct rds_ib_frmr {
 	struct ib_mr		*mr;
 	enum rds_ib_fr_state	fr_state;
 	bool			fr_inv;
+	wait_queue_head_t	fr_inv_done;
 	struct ib_send_wr	fr_wr;
 	unsigned int		dma_npages;
 	unsigned int		sg_byte_len;
-- 
2.18.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next 1/7] net/rds: Give fr_state a chance to transition to FRMR_IS_FREE
  2019-07-01 16:39 [PATCH net-next 1/7] net/rds: Give fr_state a chance to transition to FRMR_IS_FREE Gerd Rausch
@ 2019-07-01 18:27 ` David Miller
  2019-07-01 20:50   ` Gerd Rausch
  0 siblings, 1 reply; 5+ messages in thread
From: David Miller @ 2019-07-01 18:27 UTC (permalink / raw)
  To: gerd.rausch; +Cc: santosh.shilimkar, netdev

From: Gerd Rausch <gerd.rausch@oracle.com>
Date: Mon, 1 Jul 2019 09:39:44 -0700

> +			/* Memory regions make it onto the "clean_list" via
> +			 * "rds_ib_flush_mr_pool", after the memory region has
> +			 * been posted for invalidation via "rds_ib_post_inv".
> +			 *
> +			 * At that point in time, "fr_state" may still be
> +			 * in state "FRMR_IS_INUSE", since the only place where
> +			 * "fr_state" transitions to "FRMR_IS_FREE" is in
> +			 * is in "rds_ib_mr_cqe_handler", which is
> +			 * triggered by a tasklet.
> +			 *
> +			 * So in case we notice that
> +			 * "fr_state != FRMR_IS_FREE" (see below), * we wait for
> +			 * "fr_inv_done" to trigger with a maximum of 10msec.
> +			 * Then we check again, and only put the memory region
> +			 * onto the drop_list (via "rds_ib_free_frmr")
> +			 * in case the situation remains unchanged.
> +			 *
> +			 * This avoids the problem of memory-regions bouncing
> +			 * between "clean_list" and "drop_list" before they
> +			 * even have a chance to be properly invalidated.
> +			 */
> +			frmr = &ibmr->u.frmr;
> +			wait_event_timeout(frmr->fr_inv_done,
> +					   frmr->fr_state == FRMR_IS_FREE,
> +					   msecs_to_jiffies(10));
> +			if (frmr->fr_state == FRMR_IS_FREE)
> +				break;

If we see FRMR_IS_FREE after the timeout, what cleans this up?

Also, why 10msec?  Why that specific value and not some other value?  Why
not wait for however long it takes for the tasklet to run and clean it up?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next 1/7] net/rds: Give fr_state a chance to transition to FRMR_IS_FREE
  2019-07-01 18:27 ` David Miller
@ 2019-07-01 20:50   ` Gerd Rausch
  2019-07-01 20:53     ` santosh.shilimkar
  0 siblings, 1 reply; 5+ messages in thread
From: Gerd Rausch @ 2019-07-01 20:50 UTC (permalink / raw)
  To: David Miller; +Cc: santosh.shilimkar, netdev

Hi David,

On 01/07/2019 11.27, David Miller wrote:
> From: Gerd Rausch <gerd.rausch@oracle.com>
> Date: Mon, 1 Jul 2019 09:39:44 -0700
> 
>> +			/* Memory regions make it onto the "clean_list" via
>> +			 * "rds_ib_flush_mr_pool", after the memory region has
>> +			 * been posted for invalidation via "rds_ib_post_inv".
>> +			 *
>> +			 * At that point in time, "fr_state" may still be
>> +			 * in state "FRMR_IS_INUSE", since the only place where
>> +			 * "fr_state" transitions to "FRMR_IS_FREE" is in
>> +			 * is in "rds_ib_mr_cqe_handler", which is
>> +			 * triggered by a tasklet.
>> +			 *
>> +			 * So in case we notice that
>> +			 * "fr_state != FRMR_IS_FREE" (see below), * we wait for
>> +			 * "fr_inv_done" to trigger with a maximum of 10msec.
>> +			 * Then we check again, and only put the memory region
>> +			 * onto the drop_list (via "rds_ib_free_frmr")
>> +			 * in case the situation remains unchanged.
>> +			 *
>> +			 * This avoids the problem of memory-regions bouncing
>> +			 * between "clean_list" and "drop_list" before they
>> +			 * even have a chance to be properly invalidated.
>> +			 */
>> +			frmr = &ibmr->u.frmr;
>> +			wait_event_timeout(frmr->fr_inv_done,
>> +					   frmr->fr_state == FRMR_IS_FREE,
>> +					   msecs_to_jiffies(10));
>> +			if (frmr->fr_state == FRMR_IS_FREE)
>> +				break;
> 
> If we see FRMR_IS_FREE after the timeout, what cleans this up?
> 

In that case, the memory-region is subjected to the
"rds_ib_free_frmr(ibmr, true)" call that follows:
In essence making it onto the "drop_list".

It's the same as if it wouldn't transition to FRMR_IS_FREE at all.
In both cases, the memory region should get dropped, and the application
would have been penalized by an extra 10msec wait-time (for having tried to invalidate it).

> Also, why 10msec?

It's empirical.
I had added some debugging code (not part of this submission) that traced
the return value of "wait_event_timeout" in order to see the out-lier in terms
of processing the "IB_WR_LOCAL_INV" request.

On my test-systems the majority of requests were done in less than 1msec.
I saw an outlier at almost 2msec once.
So I gave it an extra order-of-magnitude multiplier for extra buffer / paranoia.

> Why that specific value and not some other value?

I looked around to find what Mellanox or any other reference material had
so say about the expected turn--around time of an "IB_WR_LOCAL_INV" ought to be.
I wasn't able to find any.

Please note that even if there was an upper-bound specified, such as minutes:
It wouldn't necessarily be a good idea to penalize an application by wait-times
up to one minute, if the alternative is to just put this memory region on a
drop-list and pick another one (which is suggested here).

> Why not wait for however long it takes for the tasklet to run and clean it up?

Two reasons I can think of:
1) The penalty of long wait-times would go to the application
2) If there were a firmware-bug, the "wait_event" would not terminate

Thanks,

  Gerd


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next 1/7] net/rds: Give fr_state a chance to transition to FRMR_IS_FREE
  2019-07-01 20:50   ` Gerd Rausch
@ 2019-07-01 20:53     ` santosh.shilimkar
  2019-07-01 20:58       ` Gerd Rausch
  0 siblings, 1 reply; 5+ messages in thread
From: santosh.shilimkar @ 2019-07-01 20:53 UTC (permalink / raw)
  To: Gerd Rausch, David Miller; +Cc: netdev

On 7/1/19 1:50 PM, Gerd Rausch wrote:
> Hi David,
> 
> On 01/07/2019 11.27, David Miller wrote:
>> From: Gerd Rausch <gerd.rausch@oracle.com>
>> Date: Mon, 1 Jul 2019 09:39:44 -0700
>>
>>> +			/* Memory regions make it onto the "clean_list" via
>>> +			 * "rds_ib_flush_mr_pool", after the memory region has
>>> +			 * been posted for invalidation via "rds_ib_post_inv".
>>> +			 *
>>> +			 * At that point in time, "fr_state" may still be
>>> +			 * in state "FRMR_IS_INUSE", since the only place where
>>> +			 * "fr_state" transitions to "FRMR_IS_FREE" is in
>>> +			 * is in "rds_ib_mr_cqe_handler", which is
>>> +			 * triggered by a tasklet.
>>> +			 *
>>> +			 * So in case we notice that
>>> +			 * "fr_state != FRMR_IS_FREE" (see below), * we wait for
>>> +			 * "fr_inv_done" to trigger with a maximum of 10msec.
>>> +			 * Then we check again, and only put the memory region
>>> +			 * onto the drop_list (via "rds_ib_free_frmr")
>>> +			 * in case the situation remains unchanged.
>>> +			 *
>>> +			 * This avoids the problem of memory-regions bouncing
>>> +			 * between "clean_list" and "drop_list" before they
>>> +			 * even have a chance to be properly invalidated.
>>> +			 */
>>> +			frmr = &ibmr->u.frmr;
>>> +			wait_event_timeout(frmr->fr_inv_done,
>>> +					   frmr->fr_state == FRMR_IS_FREE,
>>> +					   msecs_to_jiffies(10));
>>> +			if (frmr->fr_state == FRMR_IS_FREE)
>>> +				break;
>>
>> If we see FRMR_IS_FREE after the timeout, what cleans this up?
>>
> 
> In that case, the memory-region is subjected to the
> "rds_ib_free_frmr(ibmr, true)" call that follows:
> In essence making it onto the "drop_list".
> 
> It's the same as if it wouldn't transition to FRMR_IS_FREE at all.
> In both cases, the memory region should get dropped, and the application
> would have been penalized by an extra 10msec wait-time (for having tried to invalidate it).
> 
>> Also, why 10msec?
> 
> It's empirical.
> I had added some debugging code (not part of this submission) that traced
> the return value of "wait_event_timeout" in order to see the out-lier in terms
> of processing the "IB_WR_LOCAL_INV" request.
> 
> On my test-systems the majority of requests were done in less than 1msec.
> I saw an outlier at almost 2msec once.
> So I gave it an extra order-of-magnitude multiplier for extra buffer / paranoia.
> 
>> Why that specific value and not some other value?
> 
> I looked around to find what Mellanox or any other reference material had
> so say about the expected turn--around time of an "IB_WR_LOCAL_INV" ought to be.
> I wasn't able to find any.
> 
LOCAL_INV/REG etc are all end being HCA commands and the command 
timeouts are large. 60 seconds is what CX3 HCA has for example.
Thats the worst case timeout from HCA before marking the command
to be timeout and hence the operation to be failed.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next 1/7] net/rds: Give fr_state a chance to transition to FRMR_IS_FREE
  2019-07-01 20:53     ` santosh.shilimkar
@ 2019-07-01 20:58       ` Gerd Rausch
  0 siblings, 0 replies; 5+ messages in thread
From: Gerd Rausch @ 2019-07-01 20:58 UTC (permalink / raw)
  To: santosh.shilimkar, David Miller; +Cc: netdev

Hi Santosh,

On 01/07/2019 13.53, santosh.shilimkar@oracle.com wrote:
> LOCAL_INV/REG etc are all end being HCA commands and the command timeouts are large. 60 seconds is what CX3 HCA has for example.
> Thats the worst case timeout from HCA before marking the command
> to be timeout and hence the operation to be failed.
> 

It's a tradeoff between waiting for 60 seconds or just putting that memory-region on a drop_list.
IMO, penalizing an application with an up-to 60 seconds wait time is not necessarily the better of the 2 options.

Thanks,

  Gerd

PS: If you've got a pointer to the 60 seconds CX3 HCA timeout, please share.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-07-01 20:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-01 16:39 [PATCH net-next 1/7] net/rds: Give fr_state a chance to transition to FRMR_IS_FREE Gerd Rausch
2019-07-01 18:27 ` David Miller
2019-07-01 20:50   ` Gerd Rausch
2019-07-01 20:53     ` santosh.shilimkar
2019-07-01 20:58       ` Gerd Rausch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.