linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
@ 2022-01-18 14:47 Praveen Kumar Kannoju
  2022-01-18 16:48 ` Santosh Shilimkar
  0 siblings, 1 reply; 19+ messages in thread
From: Praveen Kumar Kannoju @ 2022-01-18 14:47 UTC (permalink / raw)
  To: santosh.shilimkar, davem, kuba, netdev, linux-rdma, rds-devel,
	linux-kernel
  Cc: rama.nichanamatlu, rajesh.sivaramasubramaniom, Praveen Kumar Kannoju

This patch aims to reduce the number of asynchronous workers being spawned
to execute the function "rds_ib_flush_mr_pool" during the high I/O
situations. Synchronous call path's to this function "rds_ib_flush_mr_pool"
will be executed without being disturbed. By reducing the number of
processes contending to flush the mr pool, the total number of D state
processes waiting to acquire the mutex lock will be greatly reduced, which
otherwise were causing DB instance crash as the corresponding processes
were not progressing while waiting to acquire the mutex lock.

Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com>
---
 net/rds/ib.h       |  1 +
 net/rds/ib_mr.h    |  2 ++
 net/rds/ib_rdma.c  | 18 ++++++++++++++++--
 net/rds/ib_stats.c |  1 +
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 2ba7110..d881e3f 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -308,6 +308,7 @@ struct rds_ib_statistics {
 	uint64_t	s_ib_rdma_mr_1m_pool_flush;
 	uint64_t	s_ib_rdma_mr_1m_pool_wait;
 	uint64_t	s_ib_rdma_mr_1m_pool_depleted;
+	uint64_t	s_ib_rdma_flush_mr_pool_avoided;
 	uint64_t	s_ib_rdma_mr_8k_reused;
 	uint64_t	s_ib_rdma_mr_1m_reused;
 	uint64_t	s_ib_atomic_cswp;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index ea5e9ae..9cbec6e 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -105,6 +105,8 @@ struct rds_ib_mr_pool {
 	unsigned long		max_items_soft;
 	unsigned long		max_free_pinned;
 	unsigned int		max_pages;
+
+	bool                    flush_ongoing;	/* To avoid redundant flushes */
 };
 
 extern struct workqueue_struct *rds_ib_mr_wq;
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 8f070ee..6b640b5 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 	 */
 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
 	dirty_to_clean += llist_append_to_list(&pool->free_list, &unmap_list);
+	WRITE_ONCE(pool->flush_ongoing, true);
+	smp_wmb();
 	if (free_all) {
 		unsigned long flags;
 
@@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 	atomic_sub(nfreed, &pool->item_count);
 
 out:
+	WRITE_ONCE(pool->flush_ongoing, false);
+	smp_wmb();
 	mutex_unlock(&pool->flush_lock);
 	if (waitqueue_active(&pool->flush_wait))
 		wake_up(&pool->flush_wait);
@@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private, int invalidate)
 
 	/* If we've pinned too many pages, request a flush */
 	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned ||
-	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
-		queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
+	    atomic_read(&pool->dirty_count) >= pool->max_items / 5) {
+		smp_rmb();
+		if (!READ_ONCE(pool->flush_ongoing)) {
+			queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
+		} else {
+			/* This counter indicates the number of redundant
+			 * flush calls avoided, and provides an indication
+			 * of the load pattern imposed on kernel.
+			 */
+			rds_ib_stats_inc(s_ib_rdma_flush_mr_pool_avoided);
+		}
 
 	if (invalidate) {
 		if (likely(!in_interrupt())) {
@@ -670,6 +683,7 @@ struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev,
 
 	pool->max_free_pinned = pool->max_items * pool->max_pages / 4;
 	pool->max_items_soft = rds_ibdev->max_mrs * 3 / 4;
+	pool->flush_ongoing = false;
 
 	return pool;
 }
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index ac46d89..29ae5cb 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -75,6 +75,7 @@
 	"ib_rdma_mr_1m_pool_flush",
 	"ib_rdma_mr_1m_pool_wait",
 	"ib_rdma_mr_1m_pool_depleted",
+	"ib_rdma_flush_mr_pool_avoided",
 	"ib_rdma_mr_8k_reused",
 	"ib_rdma_mr_1m_reused",
 	"ib_atomic_cswp",
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-18 14:47 [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool Praveen Kumar Kannoju
@ 2022-01-18 16:48 ` Santosh Shilimkar
  2022-01-18 18:00   ` Leon Romanovsky
  2022-01-18 19:17   ` Jason Gunthorpe
  0 siblings, 2 replies; 19+ messages in thread
From: Santosh Shilimkar @ 2022-01-18 16:48 UTC (permalink / raw)
  To: Praveen Kannoju
  Cc: David S . Miller, kuba, netdev, linux-rdma, rds-devel,
	linux-kernel, Rama Nichanamatlu, Rajesh Sivaramasubramaniom


> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju <praveen.kannoju@oracle.com> wrote:
> 
> This patch aims to reduce the number of asynchronous workers being spawned
> to execute the function "rds_ib_flush_mr_pool" during the high I/O
> situations. Synchronous call path's to this function "rds_ib_flush_mr_pool"
> will be executed without being disturbed. By reducing the number of
> processes contending to flush the mr pool, the total number of D state
> processes waiting to acquire the mutex lock will be greatly reduced, which
> otherwise were causing DB instance crash as the corresponding processes
> were not progressing while waiting to acquire the mutex lock.
> 
> Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com>
> —
> 
[…]

> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
> index 8f070ee..6b640b5 100644
> --- a/net/rds/ib_rdma.c
> +++ b/net/rds/ib_rdma.c
> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> 	 */
> 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
> 	dirty_to_clean += llist_append_to_list(&pool->free_list, &unmap_list);
> +	WRITE_ONCE(pool->flush_ongoing, true);
> +	smp_wmb();
> 	if (free_all) {
> 		unsigned long flags;
> 
> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> 	atomic_sub(nfreed, &pool->item_count);
> 
> out:
> +	WRITE_ONCE(pool->flush_ongoing, false);
> +	smp_wmb();
> 	mutex_unlock(&pool->flush_lock);
> 	if (waitqueue_active(&pool->flush_wait))
> 		wake_up(&pool->flush_wait);
> @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private, int invalidate)
> 
> 	/* If we've pinned too many pages, request a flush */
> 	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned ||
> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> -		queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
> +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5) {
> +		smp_rmb();
You won’t need these explicit barriers since above atomic and write once already
issue them.

Other than that, it seems a good idea to me. With that fixed, feel free to add my
ack and repost.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-18 16:48 ` Santosh Shilimkar
@ 2022-01-18 18:00   ` Leon Romanovsky
  2022-01-18 19:17   ` Jason Gunthorpe
  1 sibling, 0 replies; 19+ messages in thread
From: Leon Romanovsky @ 2022-01-18 18:00 UTC (permalink / raw)
  To: Santosh Shilimkar
  Cc: Praveen Kannoju, David S . Miller, kuba, netdev, linux-rdma,
	rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar wrote:
> 
> > On Jan 18, 2022, at 6:47 AM, Praveen Kannoju <praveen.kannoju@oracle.com> wrote:
> > 
> > This patch aims to reduce the number of asynchronous workers being spawned
> > to execute the function "rds_ib_flush_mr_pool" during the high I/O
> > situations. Synchronous call path's to this function "rds_ib_flush_mr_pool"
> > will be executed without being disturbed. By reducing the number of
> > processes contending to flush the mr pool, the total number of D state
> > processes waiting to acquire the mutex lock will be greatly reduced, which
> > otherwise were causing DB instance crash as the corresponding processes
> > were not progressing while waiting to acquire the mutex lock.
> > 
> > Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com>
> > —
> > 
> […]
> 
> > diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
> > index 8f070ee..6b640b5 100644
> > --- a/net/rds/ib_rdma.c
> > +++ b/net/rds/ib_rdma.c
> > @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> > 	 */
> > 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
> > 	dirty_to_clean += llist_append_to_list(&pool->free_list, &unmap_list);
> > +	WRITE_ONCE(pool->flush_ongoing, true);
> > +	smp_wmb();
> > 	if (free_all) {
> > 		unsigned long flags;
> > 
> > @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> > 	atomic_sub(nfreed, &pool->item_count);
> > 
> > out:
> > +	WRITE_ONCE(pool->flush_ongoing, false);
> > +	smp_wmb();
> > 	mutex_unlock(&pool->flush_lock);
> > 	if (waitqueue_active(&pool->flush_wait))
> > 		wake_up(&pool->flush_wait);
> > @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private, int invalidate)
> > 
> > 	/* If we've pinned too many pages, request a flush */
> > 	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned ||
> > -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> > -		queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
> > +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5) {
> > +		smp_rmb();
> You won’t need these explicit barriers since above atomic and write once already
> issue them.
> 
> Other than that, it seems a good idea to me. With that fixed, feel free to add my
> ack and repost.

I disagree with the sentence that it is "good idea". This flush_ongoing
bool is racy, nothing prevents from changing it immediately after it was
read. The queue_delayed_work will anyway ensure that only one work is
running, so much more clean solution will be to check inside rds_ib_flush_mr_pool()
if you need to flush or not.

Thanks

> 
> Regards,
> Santosh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-18 16:48 ` Santosh Shilimkar
  2022-01-18 18:00   ` Leon Romanovsky
@ 2022-01-18 19:17   ` Jason Gunthorpe
  2022-01-18 19:42     ` Santosh Shilimkar
  1 sibling, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2022-01-18 19:17 UTC (permalink / raw)
  To: Santosh Shilimkar
  Cc: Praveen Kannoju, David S . Miller, kuba, netdev, linux-rdma,
	rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar wrote:
> 
> > On Jan 18, 2022, at 6:47 AM, Praveen Kannoju <praveen.kannoju@oracle.com> wrote:
> > 
> > This patch aims to reduce the number of asynchronous workers being spawned
> > to execute the function "rds_ib_flush_mr_pool" during the high I/O
> > situations. Synchronous call path's to this function "rds_ib_flush_mr_pool"
> > will be executed without being disturbed. By reducing the number of
> > processes contending to flush the mr pool, the total number of D state
> > processes waiting to acquire the mutex lock will be greatly reduced, which
> > otherwise were causing DB instance crash as the corresponding processes
> > were not progressing while waiting to acquire the mutex lock.
> > 
> > Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com>
> > —
> > 
> […]
> 
> > diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
> > index 8f070ee..6b640b5 100644
> > +++ b/net/rds/ib_rdma.c
> > @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> > 	 */
> > 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
> > 	dirty_to_clean += llist_append_to_list(&pool->free_list, &unmap_list);
> > +	WRITE_ONCE(pool->flush_ongoing, true);
> > +	smp_wmb();
> > 	if (free_all) {
> > 		unsigned long flags;
> > 
> > @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> > 	atomic_sub(nfreed, &pool->item_count);
> > 
> > out:
> > +	WRITE_ONCE(pool->flush_ongoing, false);
> > +	smp_wmb();
> > 	mutex_unlock(&pool->flush_lock);
> > 	if (waitqueue_active(&pool->flush_wait))
> > 		wake_up(&pool->flush_wait);
> > @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private, int invalidate)
> > 
> > 	/* If we've pinned too many pages, request a flush */
> > 	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned ||
> > -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> > -		queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
> > +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5) {
> > +		smp_rmb();
> You won’t need these explicit barriers since above atomic and write once already
> issue them.

No, they don't. Use smp_store_release() and smp_load_acquire if you
want to do something like this, but I still can't quite figure out if
this usage of unlocked memory accesses makes any sense at all.

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-18 19:17   ` Jason Gunthorpe
@ 2022-01-18 19:42     ` Santosh Shilimkar
  2022-01-19  6:59       ` Leon Romanovsky
  0 siblings, 1 reply; 19+ messages in thread
From: Santosh Shilimkar @ 2022-01-18 19:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Praveen Kannoju, David S . Miller, kuba, netdev, linux-rdma,
	rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar wrote:
>> 
>>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju <praveen.kannoju@oracle.com> wrote:
>>> 
>>> This patch aims to reduce the number of asynchronous workers being spawned
>>> to execute the function "rds_ib_flush_mr_pool" during the high I/O
>>> situations. Synchronous call path's to this function "rds_ib_flush_mr_pool"
>>> will be executed without being disturbed. By reducing the number of
>>> processes contending to flush the mr pool, the total number of D state
>>> processes waiting to acquire the mutex lock will be greatly reduced, which
>>> otherwise were causing DB instance crash as the corresponding processes
>>> were not progressing while waiting to acquire the mutex lock.
>>> 
>>> Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com>
>>> —
>>> 
>> […]
>> 
>>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
>>> index 8f070ee..6b640b5 100644
>>> +++ b/net/rds/ib_rdma.c
>>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
>>> 	 */
>>> 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
>>> 	dirty_to_clean += llist_append_to_list(&pool->free_list, &unmap_list);
>>> +	WRITE_ONCE(pool->flush_ongoing, true);
>>> +	smp_wmb();
>>> 	if (free_all) {
>>> 		unsigned long flags;
>>> 
>>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
>>> 	atomic_sub(nfreed, &pool->item_count);
>>> 
>>> out:
>>> +	WRITE_ONCE(pool->flush_ongoing, false);
>>> +	smp_wmb();
>>> 	mutex_unlock(&pool->flush_lock);
>>> 	if (waitqueue_active(&pool->flush_wait))
>>> 		wake_up(&pool->flush_wait);
>>> @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private, int invalidate)
>>> 
>>> 	/* If we've pinned too many pages, request a flush */
>>> 	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned ||
>>> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
>>> -		queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
>>> +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5) {
>>> +		smp_rmb();
>> You won’t need these explicit barriers since above atomic and write once already
>> issue them.
> 
> No, they don't. Use smp_store_release() and smp_load_acquire if you
> want to do something like this, but I still can't quite figure out if
> this usage of unlocked memory accesses makes any sense at all.
> 
Indeed, I see that now, thanks. Yeah, these multi variable checks can indeed
be racy but they are under lock at least for this code path. But there are few
hot path places where single variable states are evaluated atomically instead of
heavy lock. 

Regards,
Santosh


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-18 19:42     ` Santosh Shilimkar
@ 2022-01-19  6:59       ` Leon Romanovsky
  2022-01-19 11:46         ` Praveen Kannoju
  0 siblings, 1 reply; 19+ messages in thread
From: Leon Romanovsky @ 2022-01-19  6:59 UTC (permalink / raw)
  To: Santosh Shilimkar
  Cc: Jason Gunthorpe, Praveen Kannoju, David S . Miller, kuba, netdev,
	linux-rdma, rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

On Tue, Jan 18, 2022 at 07:42:54PM +0000, Santosh Shilimkar wrote:
> On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar wrote:
> >> 
> >>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju <praveen.kannoju@oracle.com> wrote:
> >>> 
> >>> This patch aims to reduce the number of asynchronous workers being spawned
> >>> to execute the function "rds_ib_flush_mr_pool" during the high I/O
> >>> situations. Synchronous call path's to this function "rds_ib_flush_mr_pool"
> >>> will be executed without being disturbed. By reducing the number of
> >>> processes contending to flush the mr pool, the total number of D state
> >>> processes waiting to acquire the mutex lock will be greatly reduced, which
> >>> otherwise were causing DB instance crash as the corresponding processes
> >>> were not progressing while waiting to acquire the mutex lock.
> >>> 
> >>> Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com>
> >>> —
> >>> 
> >> […]
> >> 
> >>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
> >>> index 8f070ee..6b640b5 100644
> >>> +++ b/net/rds/ib_rdma.c
> >>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> >>> 	 */
> >>> 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
> >>> 	dirty_to_clean += llist_append_to_list(&pool->free_list, &unmap_list);
> >>> +	WRITE_ONCE(pool->flush_ongoing, true);
> >>> +	smp_wmb();
> >>> 	if (free_all) {
> >>> 		unsigned long flags;
> >>> 
> >>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> >>> 	atomic_sub(nfreed, &pool->item_count);
> >>> 
> >>> out:
> >>> +	WRITE_ONCE(pool->flush_ongoing, false);
> >>> +	smp_wmb();
> >>> 	mutex_unlock(&pool->flush_lock);
> >>> 	if (waitqueue_active(&pool->flush_wait))
> >>> 		wake_up(&pool->flush_wait);
> >>> @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private, int invalidate)
> >>> 
> >>> 	/* If we've pinned too many pages, request a flush */
> >>> 	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned ||
> >>> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> >>> -		queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
> >>> +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5) {
> >>> +		smp_rmb();
> >> You won’t need these explicit barriers since above atomic and write once already
> >> issue them.
> > 
> > No, they don't. Use smp_store_release() and smp_load_acquire if you
> > want to do something like this, but I still can't quite figure out if
> > this usage of unlocked memory accesses makes any sense at all.
> > 
> Indeed, I see that now, thanks. Yeah, these multi variable checks can indeed
> be racy but they are under lock at least for this code path. But there are few
> hot path places where single variable states are evaluated atomically instead of
> heavy lock. 

At least pool->dirty_count is not locked in rds_ib_free_mr() at all.

Thanks

> 
> Regards,
> Santosh
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-19  6:59       ` Leon Romanovsky
@ 2022-01-19 11:46         ` Praveen Kannoju
  2022-01-19 13:04           ` Jason Gunthorpe
  2022-01-19 14:56           ` Leon Romanovsky
  0 siblings, 2 replies; 19+ messages in thread
From: Praveen Kannoju @ 2022-01-19 11:46 UTC (permalink / raw)
  To: Leon Romanovsky, Santosh Shilimkar
  Cc: Jason Gunthorpe, David S . Miller, kuba, netdev, linux-rdma,
	rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

-----Original Message-----
From: Leon Romanovsky [mailto:leon@kernel.org] 
Sent: 19 January 2022 12:29 PM
To: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>; Praveen Kannoju <praveen.kannoju@oracle.com>; David S . Miller <davem@davemloft.net>; kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool

On Tue, Jan 18, 2022 at 07:42:54PM +0000, Santosh Shilimkar wrote:
> On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar wrote:
> >> 
> >>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju <praveen.kannoju@oracle.com> wrote:
> >>> 
> >>> This patch aims to reduce the number of asynchronous workers being 
> >>> spawned to execute the function "rds_ib_flush_mr_pool" during the 
> >>> high I/O situations. Synchronous call path's to this function "rds_ib_flush_mr_pool"
> >>> will be executed without being disturbed. By reducing the number 
> >>> of processes contending to flush the mr pool, the total number of 
> >>> D state processes waiting to acquire the mutex lock will be 
> >>> greatly reduced, which otherwise were causing DB instance crash as 
> >>> the corresponding processes were not progressing while waiting to acquire the mutex lock.
> >>> 
> >>> Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com> 
> >>> —
> >>> 
> >> […]
> >> 
> >>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 
> >>> 8f070ee..6b640b5 100644
> >>> +++ b/net/rds/ib_rdma.c
> >>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> >>> 	 */
> >>> 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
> >>> 	dirty_to_clean += llist_append_to_list(&pool->free_list, 
> >>> &unmap_list);
> >>> +	WRITE_ONCE(pool->flush_ongoing, true);
> >>> +	smp_wmb();
> >>> 	if (free_all) {
> >>> 		unsigned long flags;
> >>> 
> >>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> >>> 	atomic_sub(nfreed, &pool->item_count);
> >>> 
> >>> out:
> >>> +	WRITE_ONCE(pool->flush_ongoing, false);
> >>> +	smp_wmb();
> >>> 	mutex_unlock(&pool->flush_lock);
> >>> 	if (waitqueue_active(&pool->flush_wait))
> >>> 		wake_up(&pool->flush_wait);
> >>> @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private, int 
> >>> invalidate)
> >>> 
> >>> 	/* If we've pinned too many pages, request a flush */
> >>> 	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned ||
> >>> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> >>> -		queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
> >>> +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5) {
> >>> +		smp_rmb();
> >> You won’t need these explicit barriers since above atomic and write 
> >> once already issue them.
> > 
> > No, they don't. Use smp_store_release() and smp_load_acquire if you 
> > want to do something like this, but I still can't quite figure out 
> > if this usage of unlocked memory accesses makes any sense at all.
> > 
> Indeed, I see that now, thanks. Yeah, these multi variable checks can 
> indeed be racy but they are under lock at least for this code path. 
> But there are few hot path places where single variable states are 
> evaluated atomically instead of heavy lock.

At least pool->dirty_count is not locked in rds_ib_free_mr() at all.

Thanks

> 
> Regards,
> Santosh
> 

Thank you Santosh, Leon and Jason for reviewing the Patch.

1. Leon, the bool variable "flush_ongoing " introduced through the patch has to be accessed only after acquiring the mutex lock. Hence it is well protected.

2. As the commit message already conveys the reason, the check being made in the function "rds_ib_free_mr" is only to avoid the redundant asynchronous workers from being spawned.

3. The synchronous flush path's through the function "rds_free_mr" with either cookie=0 or "invalidate" flag being set, have to be honoured as per the semantics and hence these paths have to execute the function "rds_ib_flush_mr_pool" unconditionally.

4. It complicates the patch to identify, where from the function "rds_ib_flush_mr_pool", has been called. And hence, this patch uses the state of the bool variable to stop the asynchronous workers.

5. We knew that "queue_delayed_work" will ensures only one work is running, but avoiding these async workers during high load situations, made way for the allocation and synchronous callers which would otherwise be waiting long for the flush to happen. Great reduction in the number of calls to the function "rds_ib_flush_mr_pool" has been observed with this patch. 

6. Jason, the only function "rds_ib_free_mr" which accesses the introduced bool variable "flush_ongoing" to spawn a flush worker does not crucially impact the availability of MR's, because the flush happens from allocation path as well when necessary.   Hence the Load-store ordering is not essentially needed here, because of which we chose smp_rmb() and smp_wmb() over smp_load_acquire() and smp_store_release().

Regards,
Praveen.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-19 11:46         ` Praveen Kannoju
@ 2022-01-19 13:04           ` Jason Gunthorpe
  2022-01-19 13:12             ` Praveen Kannoju
  2022-01-19 14:56           ` Leon Romanovsky
  1 sibling, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2022-01-19 13:04 UTC (permalink / raw)
  To: Praveen Kannoju
  Cc: Leon Romanovsky, Santosh Shilimkar, David S . Miller, kuba,
	netdev, linux-rdma, rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

On Wed, Jan 19, 2022 at 11:46:16AM +0000, Praveen Kannoju wrote:
 
> 6. Jason, the only function "rds_ib_free_mr" which accesses the
> introduced bool variable "flush_ongoing" to spawn a flush worker
> does not crucially impact the availability of MR's, because the
> flush happens from allocation path as well when necessary.  Hence
> the Load-store ordering is not essentially needed here, because of
> which we chose smp_rmb() and smp_wmb() over smp_load_acquire() and
> smp_store_release().

That seems like a confusing statement, you added barriers which do the
same things as acquire/release then say you didn't need acquire
release?

I think this is using barriers wrong.

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-19 13:04           ` Jason Gunthorpe
@ 2022-01-19 13:12             ` Praveen Kannoju
  2022-01-19 13:17               ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Praveen Kannoju @ 2022-01-19 13:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Santosh Shilimkar, David S . Miller, kuba,
	netdev, linux-rdma, rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom



-----Original Message-----
From: Jason Gunthorpe [mailto:jgg@ziepe.ca] 
Sent: 19 January 2022 06:35 PM
To: Praveen Kannoju <praveen.kannoju@oracle.com>
Cc: Leon Romanovsky <leon@kernel.org>; Santosh Shilimkar <santosh.shilimkar@oracle.com>; David S . Miller <davem@davemloft.net>; kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool

On Wed, Jan 19, 2022 at 11:46:16AM +0000, Praveen Kannoju wrote:
 
> 6. Jason, the only function "rds_ib_free_mr" which accesses the 
> introduced bool variable "flush_ongoing" to spawn a flush worker does 
> not crucially impact the availability of MR's, because the flush 
> happens from allocation path as well when necessary.  Hence the 
> Load-store ordering is not essentially needed here, because of which 
> we chose smp_rmb() and smp_wmb() over smp_load_acquire() and 
> smp_store_release().

That seems like a confusing statement, you added barriers which do the same things as acquire/release then say you didn't need acquire release?

I think this is using barriers wrong.

Jason

Jason, 

Yes, we are using the barriers. I was justifying the usage of smp_rmb() and smp_wmb() over smp_load_acquire() and smp_store_release() in the patch. 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-19 13:12             ` Praveen Kannoju
@ 2022-01-19 13:17               ` Jason Gunthorpe
  2022-01-19 14:08                 ` Praveen Kannoju
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2022-01-19 13:17 UTC (permalink / raw)
  To: Praveen Kannoju
  Cc: Leon Romanovsky, Santosh Shilimkar, David S . Miller, kuba,
	netdev, linux-rdma, rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

On Wed, Jan 19, 2022 at 01:12:29PM +0000, Praveen Kannoju wrote:

> Yes, we are using the barriers. I was justifying the usage of
> smp_rmb() and smp_wmb() over smp_load_acquire() and
> smp_store_release() in the patch.

You failed to justify it.

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-19 13:17               ` Jason Gunthorpe
@ 2022-01-19 14:08                 ` Praveen Kannoju
  2022-01-19 14:49                   ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Praveen Kannoju @ 2022-01-19 14:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Santosh Shilimkar, David S . Miller, kuba,
	netdev, linux-rdma, rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

-----Original Message-----
From: Jason Gunthorpe [mailto:jgg@ziepe.ca] 
Sent: 19 January 2022 06:47 PM
To: Praveen Kannoju <praveen.kannoju@oracle.com>
Cc: Leon Romanovsky <leon@kernel.org>; Santosh Shilimkar <santosh.shilimkar@oracle.com>; David S . Miller <davem@davemloft.net>; kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool

On Wed, Jan 19, 2022 at 01:12:29PM +0000, Praveen Kannoju wrote:

> Yes, we are using the barriers. I was justifying the usage of
> smp_rmb() and smp_wmb() over smp_load_acquire() and
> smp_store_release() in the patch.

You failed to justify it.

Jason

Apologies, if my earlier point is not clear, Jason.
Let me reframe:

1. The introduced bool variable "flush_ongoing", is being accessed only in the function "rds_ib_free_mr" while spawning asynchronous workers.

2. The ordering guaranteed by smp_rmb() and smp_wmb() would be sufficient for such simple usage and hence we did not use smp_load_acquire() and smp_store_release().

3. In case the function "rds_ib_free_mr", misses to spawn the flush function, the same will be requested by the allocation path  "rds_ib_alloc_frmr" which in-turn calls "rds_ib_try_reuse_ibmr", which finally calls the flush function "rds_ib_flush_mr_pool" to obtain mr, during mr allocation requests.

4. If you still insist, we can modify the patch to use  smp_load_acquire()  and smp_store_release().

Regards,
Praveen.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-19 14:08                 ` Praveen Kannoju
@ 2022-01-19 14:49                   ` Jason Gunthorpe
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2022-01-19 14:49 UTC (permalink / raw)
  To: Praveen Kannoju
  Cc: Leon Romanovsky, Santosh Shilimkar, David S . Miller, kuba,
	netdev, linux-rdma, rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

On Wed, Jan 19, 2022 at 02:08:48PM +0000, Praveen Kannoju wrote:
> From: Jason Gunthorpe [mailto:jgg@ziepe.ca] 
> Sent: 19 January 2022 06:47 PM
> To: Praveen Kannoju <praveen.kannoju@oracle.com>
> Cc: Leon Romanovsky <leon@kernel.org>; Santosh Shilimkar <santosh.shilimkar@oracle.com>; David S . Miller <davem@davemloft.net>; kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
> Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
> 
> On Wed, Jan 19, 2022 at 01:12:29PM +0000, Praveen Kannoju wrote:
> 
> > Yes, we are using the barriers. I was justifying the usage of
> > smp_rmb() and smp_wmb() over smp_load_acquire() and
> > smp_store_release() in the patch.
> 
> You failed to justify it.
> 
> Jason
> 
> Apologies, if my earlier point is not clear, Jason.
> Let me reframe:
> 
> 1. The introduced bool variable "flush_ongoing", is being accessed only in the function "rds_ib_free_mr" while spawning asynchronous workers.
> 
> 2. The ordering guaranteed by smp_rmb() and smp_wmb() would be
> sufficient for such simple usage and hence we did not use
> smp_load_acquire() and smp_store_release().

Again you haven't defined why these barriers are any differnet from
acquire/release or even *what they are doing*

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-19 11:46         ` Praveen Kannoju
  2022-01-19 13:04           ` Jason Gunthorpe
@ 2022-01-19 14:56           ` Leon Romanovsky
  2022-01-20  8:00             ` Praveen Kannoju
  1 sibling, 1 reply; 19+ messages in thread
From: Leon Romanovsky @ 2022-01-19 14:56 UTC (permalink / raw)
  To: Praveen Kannoju
  Cc: Santosh Shilimkar, Jason Gunthorpe, David S . Miller, kuba,
	netdev, linux-rdma, rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

On Wed, Jan 19, 2022 at 11:46:16AM +0000, Praveen Kannoju wrote:
> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org] 
> Sent: 19 January 2022 12:29 PM
> To: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>; Praveen Kannoju <praveen.kannoju@oracle.com>; David S . Miller <davem@davemloft.net>; kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
> Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
> 
> On Tue, Jan 18, 2022 at 07:42:54PM +0000, Santosh Shilimkar wrote:
> > On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > 
> > > On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar wrote:
> > >> 
> > >>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju <praveen.kannoju@oracle.com> wrote:
> > >>> 
> > >>> This patch aims to reduce the number of asynchronous workers being 
> > >>> spawned to execute the function "rds_ib_flush_mr_pool" during the 
> > >>> high I/O situations. Synchronous call path's to this function "rds_ib_flush_mr_pool"
> > >>> will be executed without being disturbed. By reducing the number 
> > >>> of processes contending to flush the mr pool, the total number of 
> > >>> D state processes waiting to acquire the mutex lock will be 
> > >>> greatly reduced, which otherwise were causing DB instance crash as 
> > >>> the corresponding processes were not progressing while waiting to acquire the mutex lock.
> > >>> 
> > >>> Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com> 
> > >>> —
> > >>> 
> > >> […]
> > >> 
> > >>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 
> > >>> 8f070ee..6b640b5 100644
> > >>> +++ b/net/rds/ib_rdma.c
> > >>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> > >>> 	 */
> > >>> 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
> > >>> 	dirty_to_clean += llist_append_to_list(&pool->free_list, 
> > >>> &unmap_list);
> > >>> +	WRITE_ONCE(pool->flush_ongoing, true);
> > >>> +	smp_wmb();
> > >>> 	if (free_all) {
> > >>> 		unsigned long flags;
> > >>> 
> > >>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> > >>> 	atomic_sub(nfreed, &pool->item_count);
> > >>> 
> > >>> out:
> > >>> +	WRITE_ONCE(pool->flush_ongoing, false);
> > >>> +	smp_wmb();
> > >>> 	mutex_unlock(&pool->flush_lock);
> > >>> 	if (waitqueue_active(&pool->flush_wait))
> > >>> 		wake_up(&pool->flush_wait);
> > >>> @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private, int 
> > >>> invalidate)
> > >>> 
> > >>> 	/* If we've pinned too many pages, request a flush */
> > >>> 	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned ||
> > >>> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> > >>> -		queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
> > >>> +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5) {
> > >>> +		smp_rmb();
> > >> You won’t need these explicit barriers since above atomic and write 
> > >> once already issue them.
> > > 
> > > No, they don't. Use smp_store_release() and smp_load_acquire if you 
> > > want to do something like this, but I still can't quite figure out 
> > > if this usage of unlocked memory accesses makes any sense at all.
> > > 
> > Indeed, I see that now, thanks. Yeah, these multi variable checks can 
> > indeed be racy but they are under lock at least for this code path. 
> > But there are few hot path places where single variable states are 
> > evaluated atomically instead of heavy lock.
> 
> At least pool->dirty_count is not locked in rds_ib_free_mr() at all.
> 
> Thanks
> 
> > 
> > Regards,
> > Santosh
> > 
> 
> Thank you Santosh, Leon and Jason for reviewing the Patch.
> 
> 1. Leon, the bool variable "flush_ongoing " introduced through the patch has to be accessed only after acquiring the mutex lock. Hence it is well protected.

I don't see any lock in rds_ib_free_mr() function where your perform
"if (!READ_ONCE(pool->flush_ongoing)) { ...".

> 
> 2. As the commit message already conveys the reason, the check being made in the function "rds_ib_free_mr" is only to avoid the redundant asynchronous workers from being spawned.
> 
> 3. The synchronous flush path's through the function "rds_free_mr" with either cookie=0 or "invalidate" flag being set, have to be honoured as per the semantics and hence these paths have to execute the function "rds_ib_flush_mr_pool" unconditionally.
> 
> 4. It complicates the patch to identify, where from the function "rds_ib_flush_mr_pool", has been called. And hence, this patch uses the state of the bool variable to stop the asynchronous workers.
> 
> 5. We knew that "queue_delayed_work" will ensures only one work is running, but avoiding these async workers during high load situations, made way for the allocation and synchronous callers which would otherwise be waiting long for the flush to happen. Great reduction in the number of calls to the function "rds_ib_flush_mr_pool" has been observed with this patch. 

So if you understand that there is only one work in progress, why do
you say workerS?

Thanks

> 
> 6. Jason, the only function "rds_ib_free_mr" which accesses the introduced bool variable "flush_ongoing" to spawn a flush worker does not crucially impact the availability of MR's, because the flush happens from allocation path as well when necessary.   Hence the Load-store ordering is not essentially needed here, because of which we chose smp_rmb() and smp_wmb() over smp_load_acquire() and smp_store_release().
> 
> Regards,
> Praveen.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-19 14:56           ` Leon Romanovsky
@ 2022-01-20  8:00             ` Praveen Kannoju
  2022-01-20 11:11               ` Leon Romanovsky
  0 siblings, 1 reply; 19+ messages in thread
From: Praveen Kannoju @ 2022-01-20  8:00 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe
  Cc: Santosh Shilimkar, David S . Miller, kuba, netdev, linux-rdma,
	rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

-----Original Message-----
From: Leon Romanovsky [mailto:leon@kernel.org] 
Sent: 19 January 2022 08:26 PM
To: Praveen Kannoju <praveen.kannoju@oracle.com>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>; Jason Gunthorpe <jgg@ziepe.ca>; David S . Miller <davem@davemloft.net>; kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool

On Wed, Jan 19, 2022 at 11:46:16AM +0000, Praveen Kannoju wrote:
> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org]
> Sent: 19 January 2022 12:29 PM
> To: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>; Praveen Kannoju 
> <praveen.kannoju@oracle.com>; David S . Miller <davem@davemloft.net>; 
> kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org; 
> rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama 
> Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh 
> Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
> Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the 
> asynchronous workers to flush the mr pool
> 
> On Tue, Jan 18, 2022 at 07:42:54PM +0000, Santosh Shilimkar wrote:
> > On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > 
> > > On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar wrote:
> > >> 
> > >>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju <praveen.kannoju@oracle.com> wrote:
> > >>> 
> > >>> This patch aims to reduce the number of asynchronous workers 
> > >>> being spawned to execute the function "rds_ib_flush_mr_pool" 
> > >>> during the high I/O situations. Synchronous call path's to this function "rds_ib_flush_mr_pool"
> > >>> will be executed without being disturbed. By reducing the number 
> > >>> of processes contending to flush the mr pool, the total number 
> > >>> of D state processes waiting to acquire the mutex lock will be 
> > >>> greatly reduced, which otherwise were causing DB instance crash 
> > >>> as the corresponding processes were not progressing while waiting to acquire the mutex lock.
> > >>> 
> > >>> Signed-off-by: Praveen Kumar Kannoju 
> > >>> <praveen.kannoju@oracle.com> —
> > >>> 
> > >> […]
> > >> 
> > >>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index
> > >>> 8f070ee..6b640b5 100644
> > >>> +++ b/net/rds/ib_rdma.c
> > >>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> > >>> 	 */
> > >>> 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
> > >>> 	dirty_to_clean += llist_append_to_list(&pool->free_list,
> > >>> &unmap_list);
> > >>> +	WRITE_ONCE(pool->flush_ongoing, true);
> > >>> +	smp_wmb();
> > >>> 	if (free_all) {
> > >>> 		unsigned long flags;
> > >>> 
> > >>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> > >>> 	atomic_sub(nfreed, &pool->item_count);
> > >>> 
> > >>> out:
> > >>> +	WRITE_ONCE(pool->flush_ongoing, false);
> > >>> +	smp_wmb();
> > >>> 	mutex_unlock(&pool->flush_lock);
> > >>> 	if (waitqueue_active(&pool->flush_wait))
> > >>> 		wake_up(&pool->flush_wait);
> > >>> @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private, 
> > >>> int
> > >>> invalidate)
> > >>> 
> > >>> 	/* If we've pinned too many pages, request a flush */
> > >>> 	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned ||
> > >>> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> > >>> -		queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
> > >>> +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5) {
> > >>> +		smp_rmb();
> > >> You won’t need these explicit barriers since above atomic and 
> > >> write once already issue them.
> > > 
> > > No, they don't. Use smp_store_release() and smp_load_acquire if 
> > > you want to do something like this, but I still can't quite figure 
> > > out if this usage of unlocked memory accesses makes any sense at all.
> > > 
> > Indeed, I see that now, thanks. Yeah, these multi variable checks 
> > can indeed be racy but they are under lock at least for this code path.
> > But there are few hot path places where single variable states are 
> > evaluated atomically instead of heavy lock.
> 
> At least pool->dirty_count is not locked in rds_ib_free_mr() at all.
> 
> Thanks
> 
> > 
> > Regards,
> > Santosh
> > 
> 
> Thank you Santosh, Leon and Jason for reviewing the Patch.
> 
> 1. Leon, the bool variable "flush_ongoing " introduced through the patch has to be accessed only after acquiring the mutex lock. Hence it is well protected.

I don't see any lock in rds_ib_free_mr() function where your perform "if (!READ_ONCE(pool->flush_ongoing)) { ...".

> 
> 2. As the commit message already conveys the reason, the check being made in the function "rds_ib_free_mr" is only to avoid the redundant asynchronous workers from being spawned.
> 
> 3. The synchronous flush path's through the function "rds_free_mr" with either cookie=0 or "invalidate" flag being set, have to be honoured as per the semantics and hence these paths have to execute the function "rds_ib_flush_mr_pool" unconditionally.
> 
> 4. It complicates the patch to identify, where from the function "rds_ib_flush_mr_pool", has been called. And hence, this patch uses the state of the bool variable to stop the asynchronous workers.
> 
> 5. We knew that "queue_delayed_work" will ensures only one work is running, but avoiding these async workers during high load situations, made way for the allocation and synchronous callers which would otherwise be waiting long for the flush to happen. Great reduction in the number of calls to the function "rds_ib_flush_mr_pool" has been observed with this patch. 

So if you understand that there is only one work in progress, why do you say workerS?

Thanks

> 
> 6. Jason, the only function "rds_ib_free_mr" which accesses the introduced bool variable "flush_ongoing" to spawn a flush worker does not crucially impact the availability of MR's, because the flush happens from allocation path as well when necessary.   Hence the Load-store ordering is not essentially needed here, because of which we chose smp_rmb() and smp_wmb() over smp_load_acquire() and smp_store_release().
> 
> Regards,
> Praveen.


Jason,

	The relaxed ordering primitives smp_rmb() and smp_wmb() ensure to provide
	guaranteed atomic memory operations READ_ONCE and WRITE_ONCE, used in the
	functions "rds_ib_free_mr" and "rds_ib_flush_mr_pool" correspondingly.

	Yes, the memory barrier primitives smp_load_acquire()and smp_store_release() 
	are even better. But, because of the simplicity of the use of memory barrier
	in the patch, smp_rmb() and smp_wmb() are chosen. 

	Please let me know if you want me to switch to smp_load_acquire() and
	smp_store_release().

Leon,

	Avoiding the asynchronous worker from being spawned during the high load situations,
	make way for both synchronous and allocation path to flush the mr pool and grab the
	mr without waiting.

	Please let me know if you still have any queries with this respect or any modifications
	are needed.

Regards,
Praveen.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-20  8:00             ` Praveen Kannoju
@ 2022-01-20 11:11               ` Leon Romanovsky
  2022-01-20 11:57                 ` Praveen Kannoju
  0 siblings, 1 reply; 19+ messages in thread
From: Leon Romanovsky @ 2022-01-20 11:11 UTC (permalink / raw)
  To: Praveen Kannoju
  Cc: Jason Gunthorpe, Santosh Shilimkar, David S . Miller, kuba,
	netdev, linux-rdma, rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

On Thu, Jan 20, 2022 at 08:00:54AM +0000, Praveen Kannoju wrote:
> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org] 
> Sent: 19 January 2022 08:26 PM
> To: Praveen Kannoju <praveen.kannoju@oracle.com>
> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>; Jason Gunthorpe <jgg@ziepe.ca>; David S . Miller <davem@davemloft.net>; kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
> Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
> 
> On Wed, Jan 19, 2022 at 11:46:16AM +0000, Praveen Kannoju wrote:
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon@kernel.org]
> > Sent: 19 January 2022 12:29 PM
> > To: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Praveen Kannoju 
> > <praveen.kannoju@oracle.com>; David S . Miller <davem@davemloft.net>; 
> > kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org; 
> > rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama 
> > Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh 
> > Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
> > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the 
> > asynchronous workers to flush the mr pool
> > 
> > On Tue, Jan 18, 2022 at 07:42:54PM +0000, Santosh Shilimkar wrote:
> > > On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > 
> > > > On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar wrote:
> > > >> 
> > > >>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju <praveen.kannoju@oracle.com> wrote:
> > > >>> 
> > > >>> This patch aims to reduce the number of asynchronous workers 
> > > >>> being spawned to execute the function "rds_ib_flush_mr_pool" 
> > > >>> during the high I/O situations. Synchronous call path's to this function "rds_ib_flush_mr_pool"
> > > >>> will be executed without being disturbed. By reducing the number 
> > > >>> of processes contending to flush the mr pool, the total number 
> > > >>> of D state processes waiting to acquire the mutex lock will be 
> > > >>> greatly reduced, which otherwise were causing DB instance crash 
> > > >>> as the corresponding processes were not progressing while waiting to acquire the mutex lock.
> > > >>> 
> > > >>> Signed-off-by: Praveen Kumar Kannoju 
> > > >>> <praveen.kannoju@oracle.com> —
> > > >>> 
> > > >> […]
> > > >> 
> > > >>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index
> > > >>> 8f070ee..6b640b5 100644
> > > >>> +++ b/net/rds/ib_rdma.c
> > > >>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> > > >>> 	 */
> > > >>> 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
> > > >>> 	dirty_to_clean += llist_append_to_list(&pool->free_list,
> > > >>> &unmap_list);
> > > >>> +	WRITE_ONCE(pool->flush_ongoing, true);
> > > >>> +	smp_wmb();
> > > >>> 	if (free_all) {
> > > >>> 		unsigned long flags;
> > > >>> 
> > > >>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
> > > >>> 	atomic_sub(nfreed, &pool->item_count);
> > > >>> 
> > > >>> out:
> > > >>> +	WRITE_ONCE(pool->flush_ongoing, false);
> > > >>> +	smp_wmb();
> > > >>> 	mutex_unlock(&pool->flush_lock);
> > > >>> 	if (waitqueue_active(&pool->flush_wait))
> > > >>> 		wake_up(&pool->flush_wait);
> > > >>> @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private, 
> > > >>> int
> > > >>> invalidate)
> > > >>> 
> > > >>> 	/* If we've pinned too many pages, request a flush */
> > > >>> 	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned ||
> > > >>> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> > > >>> -		queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
> > > >>> +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5) {
> > > >>> +		smp_rmb();
> > > >> You won’t need these explicit barriers since above atomic and 
> > > >> write once already issue them.
> > > > 
> > > > No, they don't. Use smp_store_release() and smp_load_acquire if 
> > > > you want to do something like this, but I still can't quite figure 
> > > > out if this usage of unlocked memory accesses makes any sense at all.
> > > > 
> > > Indeed, I see that now, thanks. Yeah, these multi variable checks 
> > > can indeed be racy but they are under lock at least for this code path.
> > > But there are few hot path places where single variable states are 
> > > evaluated atomically instead of heavy lock.
> > 
> > At least pool->dirty_count is not locked in rds_ib_free_mr() at all.
> > 
> > Thanks
> > 
> > > 
> > > Regards,
> > > Santosh
> > > 
> > 
> > Thank you Santosh, Leon and Jason for reviewing the Patch.
> > 
> > 1. Leon, the bool variable "flush_ongoing " introduced through the patch has to be accessed only after acquiring the mutex lock. Hence it is well protected.
> 
> I don't see any lock in rds_ib_free_mr() function where your perform "if (!READ_ONCE(pool->flush_ongoing)) { ...".
> 
> > 
> > 2. As the commit message already conveys the reason, the check being made in the function "rds_ib_free_mr" is only to avoid the redundant asynchronous workers from being spawned.
> > 
> > 3. The synchronous flush path's through the function "rds_free_mr" with either cookie=0 or "invalidate" flag being set, have to be honoured as per the semantics and hence these paths have to execute the function "rds_ib_flush_mr_pool" unconditionally.
> > 
> > 4. It complicates the patch to identify, where from the function "rds_ib_flush_mr_pool", has been called. And hence, this patch uses the state of the bool variable to stop the asynchronous workers.
> > 
> > 5. We knew that "queue_delayed_work" will ensures only one work is running, but avoiding these async workers during high load situations, made way for the allocation and synchronous callers which would otherwise be waiting long for the flush to happen. Great reduction in the number of calls to the function "rds_ib_flush_mr_pool" has been observed with this patch. 
> 
> So if you understand that there is only one work in progress, why do you say workerS?
> 
> Thanks
> 
> > 
> > 6. Jason, the only function "rds_ib_free_mr" which accesses the introduced bool variable "flush_ongoing" to spawn a flush worker does not crucially impact the availability of MR's, because the flush happens from allocation path as well when necessary.   Hence the Load-store ordering is not essentially needed here, because of which we chose smp_rmb() and smp_wmb() over smp_load_acquire() and smp_store_release().
> > 
> > Regards,
> > Praveen.
> 
> 
> Jason,
> 
> 	The relaxed ordering primitives smp_rmb() and smp_wmb() ensure to provide
> 	guaranteed atomic memory operations READ_ONCE and WRITE_ONCE, used in the
> 	functions "rds_ib_free_mr" and "rds_ib_flush_mr_pool" correspondingly.
> 
> 	Yes, the memory barrier primitives smp_load_acquire()and smp_store_release() 
> 	are even better. But, because of the simplicity of the use of memory barrier
> 	in the patch, smp_rmb() and smp_wmb() are chosen. 
> 
> 	Please let me know if you want me to switch to smp_load_acquire() and
> 	smp_store_release().
> 
> Leon,
> 
> 	Avoiding the asynchronous worker from being spawned during the high load situations,
> 	make way for both synchronous and allocation path to flush the mr pool and grab the
> 	mr without waiting.
> 
> 	Please let me know if you still have any queries with this respect or any modifications
> 	are needed.

I didn't get any answer on my questions.
So let's me repeat them.
1. Where can I see locks in rds_ib_free_mr() that protects concurrent
change of your new flush_ongoing field?
2. There is only one same work can be executed/scheduled, where do you
see multiple/parallel workers (plural) and how is it possible?

BTW, please fix your email client to reply inline.

> 
> Regards,
> Praveen.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-20 11:11               ` Leon Romanovsky
@ 2022-01-20 11:57                 ` Praveen Kannoju
  2022-01-20 12:21                   ` Leon Romanovsky
  0 siblings, 1 reply; 19+ messages in thread
From: Praveen Kannoju @ 2022-01-20 11:57 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Santosh Shilimkar, David S . Miller, kuba,
	netdev, linux-rdma, rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org]
> Sent: 20 January 2022 04:41 PM
> To: Praveen Kannoju <praveen.kannoju@oracle.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>; Santosh Shilimkar
> <santosh.shilimkar@oracle.com>; David S . Miller <davem@davemloft.net>;
> kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org;
> rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama
> Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh
> Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
> Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the
> asynchronous workers to flush the mr pool
> 
> On Thu, Jan 20, 2022 at 08:00:54AM +0000, Praveen Kannoju wrote:
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon@kernel.org]
> > Sent: 19 January 2022 08:26 PM
> > To: Praveen Kannoju <praveen.kannoju@oracle.com>
> > Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>; Jason Gunthorpe
> > <jgg@ziepe.ca>; David S . Miller <davem@davemloft.net>;
> > kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org;
> > rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama
> > Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh
> > Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
> > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the
> > asynchronous workers to flush the mr pool
> >
> > On Wed, Jan 19, 2022 at 11:46:16AM +0000, Praveen Kannoju wrote:
> > > -----Original Message-----
> > > From: Leon Romanovsky [mailto:leon@kernel.org]
> > > Sent: 19 January 2022 12:29 PM
> > > To: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> > > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Praveen Kannoju
> > > <praveen.kannoju@oracle.com>; David S . Miller
> > > <davem@davemloft.net>; kuba@kernel.org; netdev@vger.kernel.org;
> > > linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com;
> > > linux-kernel@vger.kernel.org; Rama Nichanamatlu
> > > <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom
> > > <rajesh.sivaramasubramaniom@oracle.com>
> > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by
> > > the asynchronous workers to flush the mr pool
> > >
> > > On Tue, Jan 18, 2022 at 07:42:54PM +0000, Santosh Shilimkar wrote:
> > > > On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > >
> > > > > On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar wrote:
> > > > >>
> > > > >>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju
> <praveen.kannoju@oracle.com> wrote:
> > > > >>>
> > > > >>> This patch aims to reduce the number of asynchronous workers
> > > > >>> being spawned to execute the function "rds_ib_flush_mr_pool"
> > > > >>> during the high I/O situations. Synchronous call path's to this
> function "rds_ib_flush_mr_pool"
> > > > >>> will be executed without being disturbed. By reducing the
> > > > >>> number of processes contending to flush the mr pool, the total
> > > > >>> number of D state processes waiting to acquire the mutex lock
> > > > >>> will be greatly reduced, which otherwise were causing DB
> > > > >>> instance crash as the corresponding processes were not
> progressing while waiting to acquire the mutex lock.
> > > > >>>
> > > > >>> Signed-off-by: Praveen Kumar Kannoju
> > > > >>> <praveen.kannoju@oracle.com> —
> > > > >>>
> > > > >> […]
> > > > >>
> > > > >>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index
> > > > >>> 8f070ee..6b640b5 100644
> > > > >>> +++ b/net/rds/ib_rdma.c
> > > > >>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct
> rds_ib_mr_pool *pool,
> > > > >>> 	 */
> > > > >>> 	dirty_to_clean = llist_append_to_list(&pool->drop_list,
> &unmap_list);
> > > > >>> 	dirty_to_clean += llist_append_to_list(&pool->free_list,
> > > > >>> &unmap_list);
> > > > >>> +	WRITE_ONCE(pool->flush_ongoing, true);
> > > > >>> +	smp_wmb();
> > > > >>> 	if (free_all) {
> > > > >>> 		unsigned long flags;
> > > > >>>
> > > > >>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct
> rds_ib_mr_pool *pool,
> > > > >>> 	atomic_sub(nfreed, &pool->item_count);
> > > > >>>
> > > > >>> out:
> > > > >>> +	WRITE_ONCE(pool->flush_ongoing, false);
> > > > >>> +	smp_wmb();
> > > > >>> 	mutex_unlock(&pool->flush_lock);
> > > > >>> 	if (waitqueue_active(&pool->flush_wait))
> > > > >>> 		wake_up(&pool->flush_wait);
> > > > >>> @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private,
> > > > >>> int
> > > > >>> invalidate)
> > > > >>>
> > > > >>> 	/* If we've pinned too many pages, request a flush */
> > > > >>> 	if (atomic_read(&pool->free_pinned) >= pool-
> >max_free_pinned ||
> > > > >>> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> > > > >>> -		queue_delayed_work(rds_ib_mr_wq, &pool-
> >flush_worker, 10);
> > > > >>> +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> {
> > > > >>> +		smp_rmb();
> > > > >> You won’t need these explicit barriers since above atomic and
> > > > >> write once already issue them.
> > > > >
> > > > > No, they don't. Use smp_store_release() and smp_load_acquire if
> > > > > you want to do something like this, but I still can't quite
> > > > > figure out if this usage of unlocked memory accesses makes any
> sense at all.
> > > > >
> > > > Indeed, I see that now, thanks. Yeah, these multi variable checks
> > > > can indeed be racy but they are under lock at least for this code path.
> > > > But there are few hot path places where single variable states are
> > > > evaluated atomically instead of heavy lock.
> > >
> > > At least pool->dirty_count is not locked in rds_ib_free_mr() at all.
> > >
> > > Thanks
> > >
> > > >
> > > > Regards,
> > > > Santosh
> > > >
> > >
> > > Thank you Santosh, Leon and Jason for reviewing the Patch.
> > >
> > > 1. Leon, the bool variable "flush_ongoing " introduced through the patch
> has to be accessed only after acquiring the mutex lock. Hence it is well
> protected.
> >
> > I don't see any lock in rds_ib_free_mr() function where your perform "if
> (!READ_ONCE(pool->flush_ongoing)) { ...".
> >
> > >
> > > 2. As the commit message already conveys the reason, the check being
> made in the function "rds_ib_free_mr" is only to avoid the redundant
> asynchronous workers from being spawned.
> > >
> > > 3. The synchronous flush path's through the function "rds_free_mr" with
> either cookie=0 or "invalidate" flag being set, have to be honoured as per the
> semantics and hence these paths have to execute the function
> "rds_ib_flush_mr_pool" unconditionally.
> > >
> > > 4. It complicates the patch to identify, where from the function
> "rds_ib_flush_mr_pool", has been called. And hence, this patch uses the
> state of the bool variable to stop the asynchronous workers.
> > >
> > > 5. We knew that "queue_delayed_work" will ensures only one work is
> running, but avoiding these async workers during high load situations, made
> way for the allocation and synchronous callers which would otherwise be
> waiting long for the flush to happen. Great reduction in the number of calls
> to the function "rds_ib_flush_mr_pool" has been observed with this patch.
> >
> > So if you understand that there is only one work in progress, why do you
> say workerS?
> >
> > Thanks
> >
> > >
> > > 6. Jason, the only function "rds_ib_free_mr" which accesses the
> introduced bool variable "flush_ongoing" to spawn a flush worker does not
> crucially impact the availability of MR's, because the flush happens from
> allocation path as well when necessary.   Hence the Load-store ordering is
> not essentially needed here, because of which we chose smp_rmb() and
> smp_wmb() over smp_load_acquire() and smp_store_release().
> > >
> > > Regards,
> > > Praveen.
> >
> >
> > Jason,
> >
> > 	The relaxed ordering primitives smp_rmb() and smp_wmb() ensure
> to provide
> > 	guaranteed atomic memory operations READ_ONCE and
> WRITE_ONCE, used in the
> > 	functions "rds_ib_free_mr" and "rds_ib_flush_mr_pool"
> correspondingly.
> >
> > 	Yes, the memory barrier primitives smp_load_acquire()and
> smp_store_release()
> > 	are even better. But, because of the simplicity of the use of memory
> barrier
> > 	in the patch, smp_rmb() and smp_wmb() are chosen.
> >
> > 	Please let me know if you want me to switch to smp_load_acquire()
> and
> > 	smp_store_release().
> >
> > Leon,
> >
> > 	Avoiding the asynchronous worker from being spawned during the
> high load situations,
> > 	make way for both synchronous and allocation path to flush the mr
> pool and grab the
> > 	mr without waiting.
> >
> > 	Please let me know if you still have any queries with this respect or
> any modifications
> > 	are needed.
> 
Thank you for your reply, Leon.

> I didn't get any answer on my questions.
> So let's me repeat them.
> 1. Where can I see locks in rds_ib_free_mr() that protects concurrent change
> of your new flush_ongoing field?

flush_ongoing variable is only modified in the function "rds_ib_flush_mr_pool" under mutex lock. It is only being read atomically in the function "rds_ib_free_mr()", with memory barrier in place.  Hence a lock is not essential in this function "rds_ib_free_mr()". Depending on the value being read, decision is taken weather to spawn the asynchronous worker or not. 

> 2. There is only one same work can be executed/scheduled, where do you
> see multiple/parallel workers (plural) and how is it possible?

In my earlier comment, I have re-iterated the same point. I would take back my word "workerS". By avoiding this asynchronous worker to participate in flushing, the synchronous flush jobs (cookie=0 and invalidate=1) as well as allocation path worker will be acquiring the mutex lock in the function "rds_ib_flush_mr_pool" quickly, thereby fetching the MR.

I hope I am clear. 

> BTW, please fix your email client to reply inline.
Fixed it. Thank you. 
> 
> >
> > Regards,
> > Praveen.


Regards,
Praveen.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-20 11:57                 ` Praveen Kannoju
@ 2022-01-20 12:21                   ` Leon Romanovsky
  2022-01-20 12:27                     ` Praveen Kannoju
  0 siblings, 1 reply; 19+ messages in thread
From: Leon Romanovsky @ 2022-01-20 12:21 UTC (permalink / raw)
  To: Praveen Kannoju
  Cc: Jason Gunthorpe, Santosh Shilimkar, David S . Miller, kuba,
	netdev, linux-rdma, rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom

On Thu, Jan 20, 2022 at 11:57:02AM +0000, Praveen Kannoju wrote:
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon@kernel.org]
> > Sent: 20 January 2022 04:41 PM
> > To: Praveen Kannoju <praveen.kannoju@oracle.com>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Santosh Shilimkar
> > <santosh.shilimkar@oracle.com>; David S . Miller <davem@davemloft.net>;
> > kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org;
> > rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama
> > Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh
> > Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
> > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the
> > asynchronous workers to flush the mr pool
> > 
> > On Thu, Jan 20, 2022 at 08:00:54AM +0000, Praveen Kannoju wrote:
> > > -----Original Message-----
> > > From: Leon Romanovsky [mailto:leon@kernel.org]
> > > Sent: 19 January 2022 08:26 PM
> > > To: Praveen Kannoju <praveen.kannoju@oracle.com>
> > > Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>; Jason Gunthorpe
> > > <jgg@ziepe.ca>; David S . Miller <davem@davemloft.net>;
> > > kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org;
> > > rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama
> > > Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh
> > > Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
> > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the
> > > asynchronous workers to flush the mr pool
> > >
> > > On Wed, Jan 19, 2022 at 11:46:16AM +0000, Praveen Kannoju wrote:
> > > > -----Original Message-----
> > > > From: Leon Romanovsky [mailto:leon@kernel.org]
> > > > Sent: 19 January 2022 12:29 PM
> > > > To: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> > > > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Praveen Kannoju
> > > > <praveen.kannoju@oracle.com>; David S . Miller
> > > > <davem@davemloft.net>; kuba@kernel.org; netdev@vger.kernel.org;
> > > > linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com;
> > > > linux-kernel@vger.kernel.org; Rama Nichanamatlu
> > > > <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom
> > > > <rajesh.sivaramasubramaniom@oracle.com>
> > > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by
> > > > the asynchronous workers to flush the mr pool
> > > >
> > > > On Tue, Jan 18, 2022 at 07:42:54PM +0000, Santosh Shilimkar wrote:
> > > > > On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > >
> > > > > > On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar wrote:
> > > > > >>
> > > > > >>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju
> > <praveen.kannoju@oracle.com> wrote:
> > > > > >>>
> > > > > >>> This patch aims to reduce the number of asynchronous workers
> > > > > >>> being spawned to execute the function "rds_ib_flush_mr_pool"
> > > > > >>> during the high I/O situations. Synchronous call path's to this
> > function "rds_ib_flush_mr_pool"
> > > > > >>> will be executed without being disturbed. By reducing the
> > > > > >>> number of processes contending to flush the mr pool, the total
> > > > > >>> number of D state processes waiting to acquire the mutex lock
> > > > > >>> will be greatly reduced, which otherwise were causing DB
> > > > > >>> instance crash as the corresponding processes were not
> > progressing while waiting to acquire the mutex lock.
> > > > > >>>
> > > > > >>> Signed-off-by: Praveen Kumar Kannoju
> > > > > >>> <praveen.kannoju@oracle.com> —
> > > > > >>>
> > > > > >> […]
> > > > > >>
> > > > > >>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index
> > > > > >>> 8f070ee..6b640b5 100644
> > > > > >>> +++ b/net/rds/ib_rdma.c
> > > > > >>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct
> > rds_ib_mr_pool *pool,
> > > > > >>> 	 */
> > > > > >>> 	dirty_to_clean = llist_append_to_list(&pool->drop_list,
> > &unmap_list);
> > > > > >>> 	dirty_to_clean += llist_append_to_list(&pool->free_list,
> > > > > >>> &unmap_list);
> > > > > >>> +	WRITE_ONCE(pool->flush_ongoing, true);
> > > > > >>> +	smp_wmb();
> > > > > >>> 	if (free_all) {
> > > > > >>> 		unsigned long flags;
> > > > > >>>
> > > > > >>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct
> > rds_ib_mr_pool *pool,
> > > > > >>> 	atomic_sub(nfreed, &pool->item_count);
> > > > > >>>
> > > > > >>> out:
> > > > > >>> +	WRITE_ONCE(pool->flush_ongoing, false);
> > > > > >>> +	smp_wmb();
> > > > > >>> 	mutex_unlock(&pool->flush_lock);
> > > > > >>> 	if (waitqueue_active(&pool->flush_wait))
> > > > > >>> 		wake_up(&pool->flush_wait);
> > > > > >>> @@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private,
> > > > > >>> int
> > > > > >>> invalidate)
> > > > > >>>
> > > > > >>> 	/* If we've pinned too many pages, request a flush */
> > > > > >>> 	if (atomic_read(&pool->free_pinned) >= pool-
> > >max_free_pinned ||
> > > > > >>> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> > > > > >>> -		queue_delayed_work(rds_ib_mr_wq, &pool-
> > >flush_worker, 10);
> > > > > >>> +	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> > {
> > > > > >>> +		smp_rmb();
> > > > > >> You won’t need these explicit barriers since above atomic and
> > > > > >> write once already issue them.
> > > > > >
> > > > > > No, they don't. Use smp_store_release() and smp_load_acquire if
> > > > > > you want to do something like this, but I still can't quite
> > > > > > figure out if this usage of unlocked memory accesses makes any
> > sense at all.
> > > > > >
> > > > > Indeed, I see that now, thanks. Yeah, these multi variable checks
> > > > > can indeed be racy but they are under lock at least for this code path.
> > > > > But there are few hot path places where single variable states are
> > > > > evaluated atomically instead of heavy lock.
> > > >
> > > > At least pool->dirty_count is not locked in rds_ib_free_mr() at all.
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > Regards,
> > > > > Santosh
> > > > >
> > > >
> > > > Thank you Santosh, Leon and Jason for reviewing the Patch.
> > > >
> > > > 1. Leon, the bool variable "flush_ongoing " introduced through the patch
> > has to be accessed only after acquiring the mutex lock. Hence it is well
> > protected.
> > >
> > > I don't see any lock in rds_ib_free_mr() function where your perform "if
> > (!READ_ONCE(pool->flush_ongoing)) { ...".
> > >
> > > >
> > > > 2. As the commit message already conveys the reason, the check being
> > made in the function "rds_ib_free_mr" is only to avoid the redundant
> > asynchronous workers from being spawned.
> > > >
> > > > 3. The synchronous flush path's through the function "rds_free_mr" with
> > either cookie=0 or "invalidate" flag being set, have to be honoured as per the
> > semantics and hence these paths have to execute the function
> > "rds_ib_flush_mr_pool" unconditionally.
> > > >
> > > > 4. It complicates the patch to identify, where from the function
> > "rds_ib_flush_mr_pool", has been called. And hence, this patch uses the
> > state of the bool variable to stop the asynchronous workers.
> > > >
> > > > 5. We knew that "queue_delayed_work" will ensures only one work is
> > running, but avoiding these async workers during high load situations, made
> > way for the allocation and synchronous callers which would otherwise be
> > waiting long for the flush to happen. Great reduction in the number of calls
> > to the function "rds_ib_flush_mr_pool" has been observed with this patch.
> > >
> > > So if you understand that there is only one work in progress, why do you
> > say workerS?
> > >
> > > Thanks
> > >
> > > >
> > > > 6. Jason, the only function "rds_ib_free_mr" which accesses the
> > introduced bool variable "flush_ongoing" to spawn a flush worker does not
> > crucially impact the availability of MR's, because the flush happens from
> > allocation path as well when necessary.   Hence the Load-store ordering is
> > not essentially needed here, because of which we chose smp_rmb() and
> > smp_wmb() over smp_load_acquire() and smp_store_release().
> > > >
> > > > Regards,
> > > > Praveen.
> > >
> > >
> > > Jason,
> > >
> > > 	The relaxed ordering primitives smp_rmb() and smp_wmb() ensure
> > to provide
> > > 	guaranteed atomic memory operations READ_ONCE and
> > WRITE_ONCE, used in the
> > > 	functions "rds_ib_free_mr" and "rds_ib_flush_mr_pool"
> > correspondingly.
> > >
> > > 	Yes, the memory barrier primitives smp_load_acquire()and
> > smp_store_release()
> > > 	are even better. But, because of the simplicity of the use of memory
> > barrier
> > > 	in the patch, smp_rmb() and smp_wmb() are chosen.
> > >
> > > 	Please let me know if you want me to switch to smp_load_acquire()
> > and
> > > 	smp_store_release().
> > >
> > > Leon,
> > >
> > > 	Avoiding the asynchronous worker from being spawned during the
> > high load situations,
> > > 	make way for both synchronous and allocation path to flush the mr
> > pool and grab the
> > > 	mr without waiting.
> > >
> > > 	Please let me know if you still have any queries with this respect or
> > any modifications
> > > 	are needed.
> > 
> Thank you for your reply, Leon.
> 
> > I didn't get any answer on my questions.
> > So let's me repeat them.
> > 1. Where can I see locks in rds_ib_free_mr() that protects concurrent change
> > of your new flush_ongoing field?
> 
> flush_ongoing variable is only modified in the function "rds_ib_flush_mr_pool" under mutex lock. It is only being read atomically in the function "rds_ib_free_mr()", with memory barrier in place.  Hence a lock is not essential in this function "rds_ib_free_mr()". Depending on the value being read, decision is taken weather to spawn the asynchronous worker or not. 

This is not how locking works.

 CPU1                                     CPU2      
  rds_ib_flush_mr_pool
  lock
                        context switch -> rds_ib_free_mr
			                  READ old value of flush_ongoing
			<- context switch
  WRITE new value to flush_ongoing
			              

> 
> > 2. There is only one same work can be executed/scheduled, where do you
> > see multiple/parallel workers (plural) and how is it possible?
> 
> In my earlier comment, I have re-iterated the same point. I would take back my word "workerS". By avoiding this asynchronous worker to participate in flushing, the synchronous flush jobs (cookie=0 and invalidate=1) as well as allocation path worker will be acquiring the mutex lock in the function "rds_ib_flush_mr_pool" quickly, thereby fetching the MR.

This is completely different scenario from what was presented before.
You are interested to prioritize synchronous operations over async.
In such case, why don't you simply cancel_delayed_work() in your sync
flows?

> 
> I hope I am clear. 
> 
> > BTW, please fix your email client to reply inline.
> Fixed it. Thank you. 
> > 
> > >
> > > Regards,
> > > Praveen.
> 
> 
> Regards,
> Praveen.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
  2022-01-20 12:21                   ` Leon Romanovsky
@ 2022-01-20 12:27                     ` Praveen Kannoju
  0 siblings, 0 replies; 19+ messages in thread
From: Praveen Kannoju @ 2022-01-20 12:27 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Santosh Shilimkar, David S . Miller, kuba,
	netdev, linux-rdma, rds-devel, linux-kernel, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom



> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org]
> Sent: 20 January 2022 05:51 PM
> To: Praveen Kannoju <praveen.kannoju@oracle.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>; Santosh Shilimkar
> <santosh.shilimkar@oracle.com>; David S . Miller <davem@davemloft.net>;
> kuba@kernel.org; netdev@vger.kernel.org; linux-rdma@vger.kernel.org;
> rds-devel@oss.oracle.com; linux-kernel@vger.kernel.org; Rama
> Nichanamatlu <rama.nichanamatlu@oracle.com>; Rajesh
> Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>
> Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by the
> asynchronous workers to flush the mr pool
> 
> On Thu, Jan 20, 2022 at 11:57:02AM +0000, Praveen Kannoju wrote:
> > > -----Original Message-----
> > > From: Leon Romanovsky [mailto:leon@kernel.org]
> > > Sent: 20 January 2022 04:41 PM
> > > To: Praveen Kannoju <praveen.kannoju@oracle.com>
> > > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Santosh Shilimkar
> > > <santosh.shilimkar@oracle.com>; David S . Miller
> > > <davem@davemloft.net>; kuba@kernel.org; netdev@vger.kernel.org;
> > > linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com;
> > > linux-kernel@vger.kernel.org; Rama Nichanamatlu
> > > <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom
> > > <rajesh.sivaramasubramaniom@oracle.com>
> > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by
> > > the asynchronous workers to flush the mr pool
> > >
> > > On Thu, Jan 20, 2022 at 08:00:54AM +0000, Praveen Kannoju wrote:
> > > > -----Original Message-----
> > > > From: Leon Romanovsky [mailto:leon@kernel.org]
> > > > Sent: 19 January 2022 08:26 PM
> > > > To: Praveen Kannoju <praveen.kannoju@oracle.com>
> > > > Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>; Jason
> > > > Gunthorpe <jgg@ziepe.ca>; David S . Miller <davem@davemloft.net>;
> > > > kuba@kernel.org; netdev@vger.kernel.org;
> > > > linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com;
> > > > linux-kernel@vger.kernel.org; Rama Nichanamatlu
> > > > <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom
> > > > <rajesh.sivaramasubramaniom@oracle.com>
> > > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused by
> > > > the asynchronous workers to flush the mr pool
> > > >
> > > > On Wed, Jan 19, 2022 at 11:46:16AM +0000, Praveen Kannoju wrote:
> > > > > -----Original Message-----
> > > > > From: Leon Romanovsky [mailto:leon@kernel.org]
> > > > > Sent: 19 January 2022 12:29 PM
> > > > > To: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> > > > > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Praveen Kannoju
> > > > > <praveen.kannoju@oracle.com>; David S . Miller
> > > > > <davem@davemloft.net>; kuba@kernel.org;
> netdev@vger.kernel.org;
> > > > > linux-rdma@vger.kernel.org; rds-devel@oss.oracle.com;
> > > > > linux-kernel@vger.kernel.org; Rama Nichanamatlu
> > > > > <rama.nichanamatlu@oracle.com>; Rajesh Sivaramasubramaniom
> > > > > <rajesh.sivaramasubramaniom@oracle.com>
> > > > > Subject: Re: [PATCH RFC] rds: ib: Reduce the contention caused
> > > > > by the asynchronous workers to flush the mr pool
> > > > >
> > > > > On Tue, Jan 18, 2022 at 07:42:54PM +0000, Santosh Shilimkar wrote:
> > > > > > On Jan 18, 2022, at 11:17 AM, Jason Gunthorpe <jgg@ziepe.ca>
> wrote:
> > > > > > >
> > > > > > > On Tue, Jan 18, 2022 at 04:48:43PM +0000, Santosh Shilimkar
> wrote:
> > > > > > >>
> > > > > > >>> On Jan 18, 2022, at 6:47 AM, Praveen Kannoju
> > > <praveen.kannoju@oracle.com> wrote:
> > > > > > >>>
> > > > > > >>> This patch aims to reduce the number of asynchronous
> > > > > > >>> workers being spawned to execute the function
> "rds_ib_flush_mr_pool"
> > > > > > >>> during the high I/O situations. Synchronous call path's to
> > > > > > >>> this
> > > function "rds_ib_flush_mr_pool"
> > > > > > >>> will be executed without being disturbed. By reducing the
> > > > > > >>> number of processes contending to flush the mr pool, the
> > > > > > >>> total number of D state processes waiting to acquire the
> > > > > > >>> mutex lock will be greatly reduced, which otherwise were
> > > > > > >>> causing DB instance crash as the corresponding processes
> > > > > > >>> were not
> > > progressing while waiting to acquire the mutex lock.
> > > > > > >>>
> > > > > > >>> Signed-off-by: Praveen Kumar Kannoju
> > > > > > >>> <praveen.kannoju@oracle.com> —
> > > > > > >>>
> > > > > > >> […]
> > > > > > >>
> > > > > > >>> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index
> > > > > > >>> 8f070ee..6b640b5 100644
> > > > > > >>> +++ b/net/rds/ib_rdma.c
> > > > > > >>> @@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct
> > > rds_ib_mr_pool *pool,
> > > > > > >>> 	 */
> > > > > > >>> 	dirty_to_clean = llist_append_to_list(&pool->drop_list,
> > > &unmap_list);
> > > > > > >>> 	dirty_to_clean += llist_append_to_list(&pool->free_list,
> > > > > > >>> &unmap_list);
> > > > > > >>> +	WRITE_ONCE(pool->flush_ongoing, true);
> > > > > > >>> +	smp_wmb();
> > > > > > >>> 	if (free_all) {
> > > > > > >>> 		unsigned long flags;
> > > > > > >>>
> > > > > > >>> @@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct
> > > rds_ib_mr_pool *pool,
> > > > > > >>> 	atomic_sub(nfreed, &pool->item_count);
> > > > > > >>>
> > > > > > >>> out:
> > > > > > >>> +	WRITE_ONCE(pool->flush_ongoing, false);
> > > > > > >>> +	smp_wmb();
> > > > > > >>> 	mutex_unlock(&pool->flush_lock);
> > > > > > >>> 	if (waitqueue_active(&pool->flush_wait))
> > > > > > >>> 		wake_up(&pool->flush_wait); @@ -507,8 +511,17
> @@ void
> > > > > > >>> rds_ib_free_mr(void *trans_private, int
> > > > > > >>> invalidate)
> > > > > > >>>
> > > > > > >>> 	/* If we've pinned too many pages, request a flush */
> > > > > > >>> 	if (atomic_read(&pool->free_pinned) >= pool-
> > > >max_free_pinned ||
> > > > > > >>> -	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
> > > > > > >>> -		queue_delayed_work(rds_ib_mr_wq, &pool-
> > > >flush_worker, 10);
> > > > > > >>> +	    atomic_read(&pool->dirty_count) >= pool->max_items /
> > > > > > >>> +5)
> > > {
> > > > > > >>> +		smp_rmb();
> > > > > > >> You won’t need these explicit barriers since above atomic
> > > > > > >> and write once already issue them.
> > > > > > >
> > > > > > > No, they don't. Use smp_store_release() and smp_load_acquire
> > > > > > > if you want to do something like this, but I still can't
> > > > > > > quite figure out if this usage of unlocked memory accesses
> > > > > > > makes any
> > > sense at all.
> > > > > > >
> > > > > > Indeed, I see that now, thanks. Yeah, these multi variable
> > > > > > checks can indeed be racy but they are under lock at least for this
> code path.
> > > > > > But there are few hot path places where single variable states
> > > > > > are evaluated atomically instead of heavy lock.
> > > > >
> > > > > At least pool->dirty_count is not locked in rds_ib_free_mr() at all.
> > > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Santosh
> > > > > >
> > > > >
> > > > > Thank you Santosh, Leon and Jason for reviewing the Patch.
> > > > >
> > > > > 1. Leon, the bool variable "flush_ongoing " introduced through
> > > > > the patch
> > > has to be accessed only after acquiring the mutex lock. Hence it is
> > > well protected.
> > > >
> > > > I don't see any lock in rds_ib_free_mr() function where your
> > > > perform "if
> > > (!READ_ONCE(pool->flush_ongoing)) { ...".
> > > >
> > > > >
> > > > > 2. As the commit message already conveys the reason, the check
> > > > > being
> > > made in the function "rds_ib_free_mr" is only to avoid the redundant
> > > asynchronous workers from being spawned.
> > > > >
> > > > > 3. The synchronous flush path's through the function
> > > > > "rds_free_mr" with
> > > either cookie=0 or "invalidate" flag being set, have to be honoured
> > > as per the semantics and hence these paths have to execute the
> > > function "rds_ib_flush_mr_pool" unconditionally.
> > > > >
> > > > > 4. It complicates the patch to identify, where from the function
> > > "rds_ib_flush_mr_pool", has been called. And hence, this patch uses
> > > the state of the bool variable to stop the asynchronous workers.
> > > > >
> > > > > 5. We knew that "queue_delayed_work" will ensures only one work
> > > > > is
> > > running, but avoiding these async workers during high load
> > > situations, made way for the allocation and synchronous callers
> > > which would otherwise be waiting long for the flush to happen. Great
> > > reduction in the number of calls to the function "rds_ib_flush_mr_pool"
> has been observed with this patch.
> > > >
> > > > So if you understand that there is only one work in progress, why
> > > > do you
> > > say workerS?
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > 6. Jason, the only function "rds_ib_free_mr" which accesses the
> > > introduced bool variable "flush_ongoing" to spawn a flush worker
> > > does not crucially impact the availability of MR's, because the flush
> happens from
> > > allocation path as well when necessary.   Hence the Load-store ordering is
> > > not essentially needed here, because of which we chose smp_rmb() and
> > > smp_wmb() over smp_load_acquire() and smp_store_release().
> > > > >
> > > > > Regards,
> > > > > Praveen.
> > > >
> > > >
> > > > Jason,
> > > >
> > > > 	The relaxed ordering primitives smp_rmb() and smp_wmb() ensure
> > > to provide
> > > > 	guaranteed atomic memory operations READ_ONCE and
> > > WRITE_ONCE, used in the
> > > > 	functions "rds_ib_free_mr" and "rds_ib_flush_mr_pool"
> > > correspondingly.
> > > >
> > > > 	Yes, the memory barrier primitives smp_load_acquire()and
> > > smp_store_release()
> > > > 	are even better. But, because of the simplicity of the use of
> > > > memory
> > > barrier
> > > > 	in the patch, smp_rmb() and smp_wmb() are chosen.
> > > >
> > > > 	Please let me know if you want me to switch to smp_load_acquire()
> > > and
> > > > 	smp_store_release().
> > > >
> > > > Leon,
> > > >
> > > > 	Avoiding the asynchronous worker from being spawned during the
> > > high load situations,
> > > > 	make way for both synchronous and allocation path to flush the mr
> > > pool and grab the
> > > > 	mr without waiting.
> > > >
> > > > 	Please let me know if you still have any queries with this
> > > > respect or
> > > any modifications
> > > > 	are needed.
> > >
> > Thank you for your reply, Leon.
> >
> > > I didn't get any answer on my questions.
> > > So let's me repeat them.
> > > 1. Where can I see locks in rds_ib_free_mr() that protects
> > > concurrent change of your new flush_ongoing field?
> >
> > flush_ongoing variable is only modified in the function
> "rds_ib_flush_mr_pool" under mutex lock. It is only being read atomically in
> the function "rds_ib_free_mr()", with memory barrier in place.  Hence a lock
> is not essential in this function "rds_ib_free_mr()". Depending on the value
> being read, decision is taken weather to spawn the asynchronous worker or
> not.
> 
> This is not how locking works.
> 
>  CPU1                                     CPU2
>   rds_ib_flush_mr_pool
>   lock
>                         context switch -> rds_ib_free_mr
> 			                  READ old value of flush_ongoing
> 			<- context switch
>   WRITE new value to flush_ongoing
> 
> 
> >
> > > 2. There is only one same work can be executed/scheduled, where do
> > > you see multiple/parallel workers (plural) and how is it possible?
> >
> > In my earlier comment, I have re-iterated the same point. I would take
> back my word "workerS". By avoiding this asynchronous worker to
> participate in flushing, the synchronous flush jobs (cookie=0 and
> invalidate=1) as well as allocation path worker will be acquiring the mutex
> lock in the function "rds_ib_flush_mr_pool" quickly, thereby fetching the
> MR.
> 
> This is completely different scenario from what was presented before.
> You are interested to prioritize synchronous operations over async.
> In such case, why don't you simply cancel_delayed_work() in your sync
> flows?

Thank you, Leon. 
Will verify this approach and post the updated patch for review. 

> 
> >
> > I hope I am clear.
> >
> > > BTW, please fix your email client to reply inline.
> > Fixed it. Thank you.
> > >
> > > >
> > > > Regards,
> > > > Praveen.
> >
> >
> > Regards,
> > Praveen.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool
@ 2022-01-18 13:10 Praveen Kumar Kannoju
  0 siblings, 0 replies; 19+ messages in thread
From: Praveen Kumar Kannoju @ 2022-01-18 13:10 UTC (permalink / raw)
  To: santosh.shilimkar, davem, kuba, netdev, linux-rdma, rds-devel,
	linux-kernel
  Cc: rama.nichanamatlu, rajesh.sivaramasubramaniom, Praveen Kumar Kannoju

This patch aims to reduce the number of asynchronous workers being spawned
to execute the function "rds_ib_flush_mr_pool" during the high I/O
situations. Synchronous call path's to this function "rds_ib_flush_mr_pool"
will be executed without being disturbed. By reducing the number of
processes contending to flush the mr pool, the total number of D state
processes waiting to acquire the mutex lock will be greatly reduced, which
otherwise were causing DB instance crash as the corresponding processes
were not progressing while waiting to acquire the mutex lock.

Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com>
---
 net/rds/ib.h       |  1 +
 net/rds/ib_mr.h    |  2 ++
 net/rds/ib_rdma.c  | 18 ++++++++++++++++--
 net/rds/ib_stats.c |  1 +
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 2ba7110..d881e3f 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -308,6 +308,7 @@ struct rds_ib_statistics {
 	uint64_t	s_ib_rdma_mr_1m_pool_flush;
 	uint64_t	s_ib_rdma_mr_1m_pool_wait;
 	uint64_t	s_ib_rdma_mr_1m_pool_depleted;
+	uint64_t	s_ib_rdma_flush_mr_pool_avoided;
 	uint64_t	s_ib_rdma_mr_8k_reused;
 	uint64_t	s_ib_rdma_mr_1m_reused;
 	uint64_t	s_ib_atomic_cswp;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index ea5e9ae..9cbec6e 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -105,6 +105,8 @@ struct rds_ib_mr_pool {
 	unsigned long		max_items_soft;
 	unsigned long		max_free_pinned;
 	unsigned int		max_pages;
+
+	bool                    flush_ongoing;	/* To avoid redundant flushes */
 };
 
 extern struct workqueue_struct *rds_ib_mr_wq;
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 8f070ee..6b640b5 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -393,6 +393,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 	 */
 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
 	dirty_to_clean += llist_append_to_list(&pool->free_list, &unmap_list);
+	WRITE_ONCE(pool->flush_ongoing, true);
+	smp_wmb();
 	if (free_all) {
 		unsigned long flags;
 
@@ -430,6 +432,8 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 	atomic_sub(nfreed, &pool->item_count);
 
 out:
+	WRITE_ONCE(pool->flush_ongoing, false);
+	smp_wmb();
 	mutex_unlock(&pool->flush_lock);
 	if (waitqueue_active(&pool->flush_wait))
 		wake_up(&pool->flush_wait);
@@ -507,8 +511,17 @@ void rds_ib_free_mr(void *trans_private, int invalidate)
 
 	/* If we've pinned too many pages, request a flush */
 	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned ||
-	    atomic_read(&pool->dirty_count) >= pool->max_items / 5)
-		queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
+	    atomic_read(&pool->dirty_count) >= pool->max_items / 5) {
+		smp_rmb();
+		if (!READ_ONCE(pool->flush_ongoing)) {
+			queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
+		} else {
+			/* This counter indicates the number of redundant
+			 * flush calls avoided, and provides an indication
+			 * of the load pattern imposed on kernel.
+			 */
+			rds_ib_stats_inc(s_ib_rdma_flush_mr_pool_avoided);
+		}
 
 	if (invalidate) {
 		if (likely(!in_interrupt())) {
@@ -670,6 +683,7 @@ struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev,
 
 	pool->max_free_pinned = pool->max_items * pool->max_pages / 4;
 	pool->max_items_soft = rds_ibdev->max_mrs * 3 / 4;
+	pool->flush_ongoing = false;
 
 	return pool;
 }
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index ac46d89..29ae5cb 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -75,6 +75,7 @@
 	"ib_rdma_mr_1m_pool_flush",
 	"ib_rdma_mr_1m_pool_wait",
 	"ib_rdma_mr_1m_pool_depleted",
+	"ib_rdma_flush_mr_pool_avoided",
 	"ib_rdma_mr_8k_reused",
 	"ib_rdma_mr_1m_reused",
 	"ib_atomic_cswp",
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-01-20 12:28 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-18 14:47 [PATCH RFC] rds: ib: Reduce the contention caused by the asynchronous workers to flush the mr pool Praveen Kumar Kannoju
2022-01-18 16:48 ` Santosh Shilimkar
2022-01-18 18:00   ` Leon Romanovsky
2022-01-18 19:17   ` Jason Gunthorpe
2022-01-18 19:42     ` Santosh Shilimkar
2022-01-19  6:59       ` Leon Romanovsky
2022-01-19 11:46         ` Praveen Kannoju
2022-01-19 13:04           ` Jason Gunthorpe
2022-01-19 13:12             ` Praveen Kannoju
2022-01-19 13:17               ` Jason Gunthorpe
2022-01-19 14:08                 ` Praveen Kannoju
2022-01-19 14:49                   ` Jason Gunthorpe
2022-01-19 14:56           ` Leon Romanovsky
2022-01-20  8:00             ` Praveen Kannoju
2022-01-20 11:11               ` Leon Romanovsky
2022-01-20 11:57                 ` Praveen Kannoju
2022-01-20 12:21                   ` Leon Romanovsky
2022-01-20 12:27                     ` Praveen Kannoju
  -- strict thread matches above, loose matches on Subject: below --
2022-01-18 13:10 Praveen Kumar Kannoju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).