RCU Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
@ 2021-01-20 16:21 Uladzislau Rezki (Sony)
  2021-01-20 16:21 ` [PATCH 2/3] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu() Uladzislau Rezki (Sony)
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Uladzislau Rezki (Sony) @ 2021-01-20 16:21 UTC (permalink / raw)
  To: LKML, RCU, Paul E . McKenney, Michael Ellerman
  Cc: Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Michal Hocko,
	Thomas Gleixner, Theodore Y . Ts'o,
	Sebastian Andrzej Siewior, Uladzislau Rezki, Oleksiy Avramchenko

For a single argument we can directly request a page from a caller
context when a "carry page block" is run out of free spots. Instead
of hitting a slow path we can request an extra page by demand and
proceed with a fast path.

A single-argument kvfree_rcu() must be invoked in sleepable contexts,
and that its fallback is the relatively high latency synchronize_rcu().
Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
to allow limited sleeping within the memory allocator.

[ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tree.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e04e336bee42..2014fb22644d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
 	}
 }
 
+// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
+// state specified by flags.  If can_alloc is true, the caller must
+// be schedulable and not be holding any locks or mutexes that might be
+// acquired by the memory allocator or anything that it might invoke.
+// Returns true if ptr was successfully recorded, else the caller must
+// use a fallback.
 static inline bool
-kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
+add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
+	unsigned long *flags, void *ptr, bool can_alloc)
 {
 	struct kvfree_rcu_bulk_data *bnode;
 	int idx;
 
-	if (unlikely(!krcp->initialized))
+	*krcp = krc_this_cpu_lock(flags);
+	if (unlikely(!(*krcp)->initialized))
 		return false;
 
-	lockdep_assert_held(&krcp->lock);
 	idx = !!is_vmalloc_addr(ptr);
 
 	/* Check if a new block is required. */
-	if (!krcp->bkvhead[idx] ||
-			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
-		bnode = get_cached_bnode(krcp);
-		/* Switch to emergency path. */
+	if (!(*krcp)->bkvhead[idx] ||
+			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
+		bnode = get_cached_bnode(*krcp);
+		if (!bnode && can_alloc) {
+			krc_this_cpu_unlock(*krcp, *flags);
+			bnode = (struct kvfree_rcu_bulk_data *)
+				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+			*krcp = krc_this_cpu_lock(flags);
+		}
+
 		if (!bnode)
 			return false;
 
 		/* Initialize the new block. */
 		bnode->nr_records = 0;
-		bnode->next = krcp->bkvhead[idx];
+		bnode->next = (*krcp)->bkvhead[idx];
 
 		/* Attach it to the head. */
-		krcp->bkvhead[idx] = bnode;
+		(*krcp)->bkvhead[idx] = bnode;
 	}
 
 	/* Finally insert. */
-	krcp->bkvhead[idx]->records
-		[krcp->bkvhead[idx]->nr_records++] = ptr;
+	(*krcp)->bkvhead[idx]->records
+		[(*krcp)->bkvhead[idx]->nr_records++] = ptr;
 
 	return true;
 }
@@ -3533,8 +3546,6 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 		ptr = (unsigned long *) func;
 	}
 
-	krcp = krc_this_cpu_lock(&flags);
-
 	// Queue the object but don't yet schedule the batch.
 	if (debug_rcu_head_queue(ptr)) {
 		// Probable double kfree_rcu(), just leak.
@@ -3542,12 +3553,11 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 			  __func__, head);
 
 		// Mark as success and leave.
-		success = true;
-		goto unlock_return;
+		return;
 	}
 
 	kasan_record_aux_stack(ptr);
-	success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
+	success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head);
 	if (!success) {
 		run_page_cache_worker(krcp);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/3] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()
  2021-01-20 16:21 [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument Uladzislau Rezki (Sony)
@ 2021-01-20 16:21 ` Uladzislau Rezki (Sony)
  2021-01-28 18:06   ` Uladzislau Rezki
  2021-01-20 16:21 ` [PATCH 3/3] kvfree_rcu: use migrate_disable/enable() Uladzislau Rezki (Sony)
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Uladzislau Rezki (Sony) @ 2021-01-20 16:21 UTC (permalink / raw)
  To: LKML, RCU, Paul E . McKenney, Michael Ellerman
  Cc: Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Michal Hocko,
	Thomas Gleixner, Theodore Y . Ts'o,
	Sebastian Andrzej Siewior, Uladzislau Rezki, Oleksiy Avramchenko

From: "Paul E. McKenney" <paulmck@kernel.org>

This commit applies the __GFP_NOMEMALLOC gfp flag to memory allocations
carried out by the single-argument variant of kvfree_rcu(), thus avoiding
this can-sleep code path from dipping into the emergency reserves.

Acked-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2014fb22644d..454809514c91 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3491,7 +3491,7 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
 		if (!bnode && can_alloc) {
 			krc_this_cpu_unlock(*krcp, *flags);
 			bnode = (struct kvfree_rcu_bulk_data *)
-				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
 			*krcp = krc_this_cpu_lock(flags);
 		}
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
  2021-01-20 16:21 [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument Uladzislau Rezki (Sony)
  2021-01-20 16:21 ` [PATCH 2/3] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu() Uladzislau Rezki (Sony)
@ 2021-01-20 16:21 ` Uladzislau Rezki (Sony)
  2021-01-20 19:45   ` Sebastian Andrzej Siewior
  2021-01-23  9:31   ` 回复: " Zhang, Qiang
  2021-01-20 18:40 ` [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument Paul E. McKenney
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Uladzislau Rezki (Sony) @ 2021-01-20 16:21 UTC (permalink / raw)
  To: LKML, RCU, Paul E . McKenney, Michael Ellerman
  Cc: Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Michal Hocko,
	Thomas Gleixner, Theodore Y . Ts'o,
	Sebastian Andrzej Siewior, Uladzislau Rezki, Oleksiy Avramchenko

Since the page is obtained in a fully preemptible context, dropping
the lock can lead to migration onto another CPU. As a result a prev.
bnode of that CPU may be underutilised, because a decision has been
made for a CPU that was run out of free slots to store a pointer.

migrate_disable/enable() are now independent of RT, use it in order
to prevent any migration during a page request for a specific CPU it
is requested for.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 454809514c91..cad36074366d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3489,10 +3489,12 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
 			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
 		bnode = get_cached_bnode(*krcp);
 		if (!bnode && can_alloc) {
+			migrate_disable();
 			krc_this_cpu_unlock(*krcp, *flags);
 			bnode = (struct kvfree_rcu_bulk_data *)
 				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
 			*krcp = krc_this_cpu_lock(flags);
+			migrate_enable();
 		}
 
 		if (!bnode)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-20 16:21 [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument Uladzislau Rezki (Sony)
  2021-01-20 16:21 ` [PATCH 2/3] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu() Uladzislau Rezki (Sony)
  2021-01-20 16:21 ` [PATCH 3/3] kvfree_rcu: use migrate_disable/enable() Uladzislau Rezki (Sony)
@ 2021-01-20 18:40 ` Paul E. McKenney
  2021-01-20 19:57 ` Sebastian Andrzej Siewior
  2021-01-25 13:22 ` Michal Hocko
  4 siblings, 0 replies; 36+ messages in thread
From: Paul E. McKenney @ 2021-01-20 18:40 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: LKML, RCU, Michael Ellerman, Andrew Morton, Daniel Axtens,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Wed, Jan 20, 2021 at 05:21:46PM +0100, Uladzislau Rezki (Sony) wrote:
> For a single argument we can directly request a page from a caller
> context when a "carry page block" is run out of free spots. Instead
> of hitting a slow path we can request an extra page by demand and
> proceed with a fast path.
> 
> A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> and that its fallback is the relatively high latency synchronize_rcu().
> Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> to allow limited sleeping within the memory allocator.
> 
> [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Queued all three for review and testing, thank you!

							Thanx, Paul

> ---
>  kernel/rcu/tree.c | 42 ++++++++++++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e04e336bee42..2014fb22644d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
>  	}
>  }
>  
> +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> +// state specified by flags.  If can_alloc is true, the caller must
> +// be schedulable and not be holding any locks or mutexes that might be
> +// acquired by the memory allocator or anything that it might invoke.
> +// Returns true if ptr was successfully recorded, else the caller must
> +// use a fallback.
>  static inline bool
> -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> +	unsigned long *flags, void *ptr, bool can_alloc)
>  {
>  	struct kvfree_rcu_bulk_data *bnode;
>  	int idx;
>  
> -	if (unlikely(!krcp->initialized))
> +	*krcp = krc_this_cpu_lock(flags);
> +	if (unlikely(!(*krcp)->initialized))
>  		return false;
>  
> -	lockdep_assert_held(&krcp->lock);
>  	idx = !!is_vmalloc_addr(ptr);
>  
>  	/* Check if a new block is required. */
> -	if (!krcp->bkvhead[idx] ||
> -			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> -		bnode = get_cached_bnode(krcp);
> -		/* Switch to emergency path. */
> +	if (!(*krcp)->bkvhead[idx] ||
> +			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> +		bnode = get_cached_bnode(*krcp);
> +		if (!bnode && can_alloc) {
> +			krc_this_cpu_unlock(*krcp, *flags);
> +			bnode = (struct kvfree_rcu_bulk_data *)
> +				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> +			*krcp = krc_this_cpu_lock(flags);
> +		}
> +
>  		if (!bnode)
>  			return false;
>  
>  		/* Initialize the new block. */
>  		bnode->nr_records = 0;
> -		bnode->next = krcp->bkvhead[idx];
> +		bnode->next = (*krcp)->bkvhead[idx];
>  
>  		/* Attach it to the head. */
> -		krcp->bkvhead[idx] = bnode;
> +		(*krcp)->bkvhead[idx] = bnode;
>  	}
>  
>  	/* Finally insert. */
> -	krcp->bkvhead[idx]->records
> -		[krcp->bkvhead[idx]->nr_records++] = ptr;
> +	(*krcp)->bkvhead[idx]->records
> +		[(*krcp)->bkvhead[idx]->nr_records++] = ptr;
>  
>  	return true;
>  }
> @@ -3533,8 +3546,6 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
>  		ptr = (unsigned long *) func;
>  	}
>  
> -	krcp = krc_this_cpu_lock(&flags);
> -
>  	// Queue the object but don't yet schedule the batch.
>  	if (debug_rcu_head_queue(ptr)) {
>  		// Probable double kfree_rcu(), just leak.
> @@ -3542,12 +3553,11 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
>  			  __func__, head);
>  
>  		// Mark as success and leave.
> -		success = true;
> -		goto unlock_return;
> +		return;
>  	}
>  
>  	kasan_record_aux_stack(ptr);
> -	success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
> +	success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head);
>  	if (!success) {
>  		run_page_cache_worker(krcp);
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
  2021-01-20 16:21 ` [PATCH 3/3] kvfree_rcu: use migrate_disable/enable() Uladzislau Rezki (Sony)
@ 2021-01-20 19:45   ` Sebastian Andrzej Siewior
  2021-01-20 21:42     ` Paul E. McKenney
  2021-01-23  9:31   ` 回复: " Zhang, Qiang
  1 sibling, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-01-20 19:45 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Oleksiy Avramchenko

On 2021-01-20 17:21:48 [+0100], Uladzislau Rezki (Sony) wrote:
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3489,10 +3489,12 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>  			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
>  		bnode = get_cached_bnode(*krcp);
>  		if (!bnode && can_alloc) {
> +			migrate_disable();
>  			krc_this_cpu_unlock(*krcp, *flags);

Why is krc_this_cpu_unlock() defined as
| static inline void
| krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
| {
|         raw_spin_unlock(&krcp->lock);
|         local_irq_restore(flags);
| }

instead of raw_spin_unlock_irqrestore()?
Should something with the locked section trigger a scheduling event by
setting TIF_NEED_RESCHED then there will be no scheduling event on
unlock. It will be delayed until a later "random" preempt_enable().

raw_spin_unlock_irqrestore() will reschedule if the flag is set,
local_irq_restore() will not.

Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-20 16:21 [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument Uladzislau Rezki (Sony)
                   ` (2 preceding siblings ...)
  2021-01-20 18:40 ` [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument Paul E. McKenney
@ 2021-01-20 19:57 ` Sebastian Andrzej Siewior
  2021-01-20 21:54   ` Paul E. McKenney
  2021-01-21 12:38   ` Uladzislau Rezki
  2021-01-25 13:22 ` Michal Hocko
  4 siblings, 2 replies; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-01-20 19:57 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Oleksiy Avramchenko

On 2021-01-20 17:21:46 [+0100], Uladzislau Rezki (Sony) wrote:
> For a single argument we can directly request a page from a caller
> context when a "carry page block" is run out of free spots. Instead
> of hitting a slow path we can request an extra page by demand and
> proceed with a fast path.
> 
> A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> and that its fallback is the relatively high latency synchronize_rcu().
> Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> to allow limited sleeping within the memory allocator.
> 
> [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  kernel/rcu/tree.c | 42 ++++++++++++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e04e336bee42..2014fb22644d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
>  	}
>  }
>  
> +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> +// state specified by flags.  If can_alloc is true, the caller must
> +// be schedulable and not be holding any locks or mutexes that might be
> +// acquired by the memory allocator or anything that it might invoke.
> +// Returns true if ptr was successfully recorded, else the caller must
> +// use a fallback.

The whole RCU department is getting swamped by the // comments. Can't we
have proper kernel doc and /* */ style comments like the remaining part
of the kernel?

>  static inline bool
> -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> +	unsigned long *flags, void *ptr, bool can_alloc)
>  {
>  	struct kvfree_rcu_bulk_data *bnode;
>  	int idx;
>  
> -	if (unlikely(!krcp->initialized))
> +	*krcp = krc_this_cpu_lock(flags);
> +	if (unlikely(!(*krcp)->initialized))
>  		return false;
>  
> -	lockdep_assert_held(&krcp->lock);
>  	idx = !!is_vmalloc_addr(ptr);
>  
>  	/* Check if a new block is required. */
> -	if (!krcp->bkvhead[idx] ||
> -			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> -		bnode = get_cached_bnode(krcp);
> -		/* Switch to emergency path. */
> +	if (!(*krcp)->bkvhead[idx] ||
> +			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> +		bnode = get_cached_bnode(*krcp);
> +		if (!bnode && can_alloc) {
> +			krc_this_cpu_unlock(*krcp, *flags);
> +			bnode = (struct kvfree_rcu_bulk_data *)

There is no need for this cast.

> +				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> +			*krcp = krc_this_cpu_lock(flags);

so if bnode is NULL you could retry get_cached_bnode() since it might
have been filled (given preemption or CPU migration changed something).
Judging from patch #3 you think that a CPU migration is a bad thing. But
why?

Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
  2021-01-20 19:45   ` Sebastian Andrzej Siewior
@ 2021-01-20 21:42     ` Paul E. McKenney
  0 siblings, 0 replies; 36+ messages in thread
From: Paul E. McKenney @ 2021-01-20 21:42 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Uladzislau Rezki (Sony),
	LKML, RCU, Michael Ellerman, Andrew Morton, Daniel Axtens,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Oleksiy Avramchenko

On Wed, Jan 20, 2021 at 08:45:54PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-20 17:21:48 [+0100], Uladzislau Rezki (Sony) wrote:
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3489,10 +3489,12 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> >  			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> >  		bnode = get_cached_bnode(*krcp);
> >  		if (!bnode && can_alloc) {
> > +			migrate_disable();
> >  			krc_this_cpu_unlock(*krcp, *flags);
> 
> Why is krc_this_cpu_unlock() defined as
> | static inline void
> | krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
> | {
> |         raw_spin_unlock(&krcp->lock);
> |         local_irq_restore(flags);
> | }
> 
> instead of raw_spin_unlock_irqrestore()?
> Should something with the locked section trigger a scheduling event by
> setting TIF_NEED_RESCHED then there will be no scheduling event on
> unlock. It will be delayed until a later "random" preempt_enable().
> 
> raw_spin_unlock_irqrestore() will reschedule if the flag is set,
> local_irq_restore() will not.

Good catch, thank you!  This one is already in mainline, so I queued
the following patch.

							Thanx, Paul

------------------------------------------------------------------------

commit 6c1d51e012c5b474cda77d4fa644d76e041c1e05
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed Jan 20 13:38:08 2021 -0800

    kvfree_rcu: Make krc_this_cpu_unlock() use raw_spin_unlock_irqrestore()
    
    The krc_this_cpu_unlock() function does a raw_spin_unlock() immediately
    followed by a local_irq_restore().  This commit saves a line of code by
    merging them into a raw_spin_unlock_irqrestore().  This transformation
    also reduces scheduling latency because raw_spin_unlock_irqrestore()
    responds immediately to a reschedule request.  In contrast,
    local_irq_restore() does a scheduling-oblivious enabling of interrupts.
    
    Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cad3607..e7a226a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3201,8 +3201,7 @@ krc_this_cpu_lock(unsigned long *flags)
 static inline void
 krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
 {
-	raw_spin_unlock(&krcp->lock);
-	local_irq_restore(flags);
+	raw_spin_unlock_irqrestore(&krcp->lock, flags);
 }
 
 static inline struct kvfree_rcu_bulk_data *

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-20 19:57 ` Sebastian Andrzej Siewior
@ 2021-01-20 21:54   ` Paul E. McKenney
  2021-01-21 13:35     ` Uladzislau Rezki
  2021-01-22 11:17     ` Sebastian Andrzej Siewior
  2021-01-21 12:38   ` Uladzislau Rezki
  1 sibling, 2 replies; 36+ messages in thread
From: Paul E. McKenney @ 2021-01-20 21:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Uladzislau Rezki (Sony),
	LKML, RCU, Michael Ellerman, Andrew Morton, Daniel Axtens,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Oleksiy Avramchenko

On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-20 17:21:46 [+0100], Uladzislau Rezki (Sony) wrote:
> > For a single argument we can directly request a page from a caller
> > context when a "carry page block" is run out of free spots. Instead
> > of hitting a slow path we can request an extra page by demand and
> > proceed with a fast path.
> > 
> > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > and that its fallback is the relatively high latency synchronize_rcu().
> > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > to allow limited sleeping within the memory allocator.
> > 
> > [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > ---
> >  kernel/rcu/tree.c | 42 ++++++++++++++++++++++++++----------------
> >  1 file changed, 26 insertions(+), 16 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index e04e336bee42..2014fb22644d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
> >  	}
> >  }
> >  
> > +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> > +// state specified by flags.  If can_alloc is true, the caller must
> > +// be schedulable and not be holding any locks or mutexes that might be
> > +// acquired by the memory allocator or anything that it might invoke.
> > +// Returns true if ptr was successfully recorded, else the caller must
> > +// use a fallback.
> 
> The whole RCU department is getting swamped by the // comments. Can't we
> have proper kernel doc and /* */ style comments like the remaining part
> of the kernel?

Because // comments are easier to type and take up less horizontal space.
Also, this kvfree_call_rcu_add_ptr_to_bulk() function is local to
kvfree_rcu(), and we don't normally docbook-ify such functions.

> >  static inline bool
> > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > +	unsigned long *flags, void *ptr, bool can_alloc)
> >  {
> >  	struct kvfree_rcu_bulk_data *bnode;
> >  	int idx;
> >  
> > -	if (unlikely(!krcp->initialized))
> > +	*krcp = krc_this_cpu_lock(flags);
> > +	if (unlikely(!(*krcp)->initialized))
> >  		return false;
> >  
> > -	lockdep_assert_held(&krcp->lock);
> >  	idx = !!is_vmalloc_addr(ptr);
> >  
> >  	/* Check if a new block is required. */
> > -	if (!krcp->bkvhead[idx] ||
> > -			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > -		bnode = get_cached_bnode(krcp);
> > -		/* Switch to emergency path. */
> > +	if (!(*krcp)->bkvhead[idx] ||
> > +			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > +		bnode = get_cached_bnode(*krcp);
> > +		if (!bnode && can_alloc) {
> > +			krc_this_cpu_unlock(*krcp, *flags);
> > +			bnode = (struct kvfree_rcu_bulk_data *)
> 
> There is no need for this cast.

Without it, gcc version 7.5.0 says:

	warning: assignment makes pointer from integer without a cast

> > +				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > +			*krcp = krc_this_cpu_lock(flags);
> 
> so if bnode is NULL you could retry get_cached_bnode() since it might
> have been filled (given preemption or CPU migration changed something).
> Judging from patch #3 you think that a CPU migration is a bad thing. But
> why?

So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
it with the correct CPU.

Though now that you mention it, couldn't the following happen?

o	Task A on CPU 0 notices that allocation is needed, so it
	drops the lock disables migration, and sleeps while
	allocating.

o	Task B on CPU 0 does the same.

o	The two tasks wake up in some order, and the second one
	causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
	assignment.

Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-20 19:57 ` Sebastian Andrzej Siewior
  2021-01-20 21:54   ` Paul E. McKenney
@ 2021-01-21 12:38   ` Uladzislau Rezki
  2021-01-22 11:34     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-21 12:38 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Uladzislau Rezki (Sony),
	LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Oleksiy Avramchenko

On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-20 17:21:46 [+0100], Uladzislau Rezki (Sony) wrote:
> > For a single argument we can directly request a page from a caller
> > context when a "carry page block" is run out of free spots. Instead
> > of hitting a slow path we can request an extra page by demand and
> > proceed with a fast path.
> > 
> > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > and that its fallback is the relatively high latency synchronize_rcu().
> > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > to allow limited sleeping within the memory allocator.
> > 
> > [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > ---
> >  kernel/rcu/tree.c | 42 ++++++++++++++++++++++++++----------------
> >  1 file changed, 26 insertions(+), 16 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index e04e336bee42..2014fb22644d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
> >  	}
> >  }
> >  
> > +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> > +// state specified by flags.  If can_alloc is true, the caller must
> > +// be schedulable and not be holding any locks or mutexes that might be
> > +// acquired by the memory allocator or anything that it might invoke.
> > +// Returns true if ptr was successfully recorded, else the caller must
> > +// use a fallback.
> 
> The whole RCU department is getting swamped by the // comments. Can't we
> have proper kernel doc and /* */ style comments like the remaining part
> of the kernel?
> 
> >  static inline bool
> > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > +	unsigned long *flags, void *ptr, bool can_alloc)
> >  {
> >  	struct kvfree_rcu_bulk_data *bnode;
> >  	int idx;
> >  
> > -	if (unlikely(!krcp->initialized))
> > +	*krcp = krc_this_cpu_lock(flags);
> > +	if (unlikely(!(*krcp)->initialized))
> >  		return false;
> >  
> > -	lockdep_assert_held(&krcp->lock);
> >  	idx = !!is_vmalloc_addr(ptr);
> >  
> >  	/* Check if a new block is required. */
> > -	if (!krcp->bkvhead[idx] ||
> > -			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > -		bnode = get_cached_bnode(krcp);
> > -		/* Switch to emergency path. */
> > +	if (!(*krcp)->bkvhead[idx] ||
> > +			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > +		bnode = get_cached_bnode(*krcp);
> > +		if (!bnode && can_alloc) {
> > +			krc_this_cpu_unlock(*krcp, *flags);
> > +			bnode = (struct kvfree_rcu_bulk_data *)
> 
> There is no need for this cast.
> 
__get_free_page() returns "unsigned long" whereas a bnode is a pointer
to kvfree_rcu_bulk_data struct, without a casting the compiler will
emit a warning.

> > +				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > +			*krcp = krc_this_cpu_lock(flags);
> 
> so if bnode is NULL you could retry get_cached_bnode() since it might
> have been filled (given preemption or CPU migration changed something).
> Judging from patch #3 you think that a CPU migration is a bad thing. But
> why?
> 
I see your point. Indeed we can retry but honestly i do not see that it
makes a lot of sense. I prefer to keep the logic as simple as it can be.
If we are run out of free pages(low memory condition), there is a fallback
mechanism for such purpose, i.e it implies that a slow path can be hit.

>>
>> You think that a CPU migration is a bad thing. But why?
>>
It is not a bad thing. But if it happens we might queue a new bnode
to a drain list of another CPU where a previous element of a new
bnode may be just underutilized. So that is why i use migrate_disable()/enable()
to prevent it.

If there are some hidden issues with migrate_disable()/enable() or you
think it is a bad idea to use it, it would be appreciated if you could
describe your view in more detail.

Thanks for the comments!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-20 21:54   ` Paul E. McKenney
@ 2021-01-21 13:35     ` Uladzislau Rezki
  2021-01-21 15:07       ` Paul E. McKenney
  2021-01-22 11:17     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-21 13:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Sebastian Andrzej Siewior, Uladzislau Rezki (Sony),
	LKML, RCU, Michael Ellerman, Andrew Morton, Daniel Axtens,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Oleksiy Avramchenko

On Wed, Jan 20, 2021 at 01:54:03PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2021-01-20 17:21:46 [+0100], Uladzislau Rezki (Sony) wrote:
> > > For a single argument we can directly request a page from a caller
> > > context when a "carry page block" is run out of free spots. Instead
> > > of hitting a slow path we can request an extra page by demand and
> > > proceed with a fast path.
> > > 
> > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > and that its fallback is the relatively high latency synchronize_rcu().
> > > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > to allow limited sleeping within the memory allocator.
> > > 
> > > [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > ---
> > >  kernel/rcu/tree.c | 42 ++++++++++++++++++++++++++----------------
> > >  1 file changed, 26 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index e04e336bee42..2014fb22644d 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
> > >  	}
> > >  }
> > >  
> > > +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> > > +// state specified by flags.  If can_alloc is true, the caller must
> > > +// be schedulable and not be holding any locks or mutexes that might be
> > > +// acquired by the memory allocator or anything that it might invoke.
> > > +// Returns true if ptr was successfully recorded, else the caller must
> > > +// use a fallback.
> > 
> > The whole RCU department is getting swamped by the // comments. Can't we
> > have proper kernel doc and /* */ style comments like the remaining part
> > of the kernel?
> 
> Because // comments are easier to type and take up less horizontal space.
> Also, this kvfree_call_rcu_add_ptr_to_bulk() function is local to
> kvfree_rcu(), and we don't normally docbook-ify such functions.
> 
> > >  static inline bool
> > > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > > +	unsigned long *flags, void *ptr, bool can_alloc)
> > >  {
> > >  	struct kvfree_rcu_bulk_data *bnode;
> > >  	int idx;
> > >  
> > > -	if (unlikely(!krcp->initialized))
> > > +	*krcp = krc_this_cpu_lock(flags);
> > > +	if (unlikely(!(*krcp)->initialized))
> > >  		return false;
> > >  
> > > -	lockdep_assert_held(&krcp->lock);
> > >  	idx = !!is_vmalloc_addr(ptr);
> > >  
> > >  	/* Check if a new block is required. */
> > > -	if (!krcp->bkvhead[idx] ||
> > > -			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > > -		bnode = get_cached_bnode(krcp);
> > > -		/* Switch to emergency path. */
> > > +	if (!(*krcp)->bkvhead[idx] ||
> > > +			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > > +		bnode = get_cached_bnode(*krcp);
> > > +		if (!bnode && can_alloc) {
> > > +			krc_this_cpu_unlock(*krcp, *flags);
> > > +			bnode = (struct kvfree_rcu_bulk_data *)
> > 
> > There is no need for this cast.
> 
> Without it, gcc version 7.5.0 says:
> 
> 	warning: assignment makes pointer from integer without a cast
> 
> > > +				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > > +			*krcp = krc_this_cpu_lock(flags);
> > 
> > so if bnode is NULL you could retry get_cached_bnode() since it might
> > have been filled (given preemption or CPU migration changed something).
> > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > why?
> 
> So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
> it with the correct CPU.
> 
> Though now that you mention it, couldn't the following happen?
> 
> o	Task A on CPU 0 notices that allocation is needed, so it
> 	drops the lock disables migration, and sleeps while
> 	allocating.
> 
> o	Task B on CPU 0 does the same.
> 
> o	The two tasks wake up in some order, and the second one
> 	causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
> 	assignment.
> 
> Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> 
Probably i should have mentioned your sequence you described, that two tasks
can get a page on same CPU, i was thinking about it :) Yep, it can happen
since we drop the lock and a context is fully preemptible, so another one
can trigger kvfree_rcu() ending up at the same place - entering a page
allocator.

I spent some time simulating it, but with no any luck, therefore i did not
reflect this case in the commit message, thus did no pay much attention to
such scenario.

>
> Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
>
Two woken tasks will be serialized, i.e. an assignment is protected by
the our local lock. We do krc_this_cpu_lock(flags); as a first step
right after that we do restore a migration. A migration in that case
can occur only when krc_this_cpu_unlock(*krcp, *flags); is invoked.

The scenario you described can happen, in that case a previous bnode
in the drain list can be either empty or partly utilized. But, again
i was non able to trigger such scenario.

If we should fix it, i think we can go with below "alloc_in_progress"
protection:

<snip>
urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cad36074366d..95485ec7267e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3488,12 +3488,19 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
        if (!(*krcp)->bkvhead[idx] ||
                        (*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
                bnode = get_cached_bnode(*krcp);
-               if (!bnode && can_alloc) {
+               if (!bnode && can_alloc && !(*krcp)->alloc_in_progress)  {
                        migrate_disable();
+
+                       /* Set it before dropping the lock. */
+                       (*krcp)->alloc_in_progress = true;
                        krc_this_cpu_unlock(*krcp, *flags);
+
                        bnode = (struct kvfree_rcu_bulk_data *)
                                __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
                        *krcp = krc_this_cpu_lock(flags);
+
+                       /* Clear it, the lock was taken back. */
+                       (*krcp)->alloc_in_progress = false;
                        migrate_enable();
                }
 
urezki@pc638:~/data/raid0/coding/linux-rcu.git$
<snip>

in that case a second task will follow a fallback path bypassing a page
request. I can send it as a separate patch if there are no any objections.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-21 13:35     ` Uladzislau Rezki
@ 2021-01-21 15:07       ` Paul E. McKenney
  2021-01-21 19:17         ` Uladzislau Rezki
  0 siblings, 1 reply; 36+ messages in thread
From: Paul E. McKenney @ 2021-01-21 15:07 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Sebastian Andrzej Siewior, LKML, RCU, Michael Ellerman,
	Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Michal Hocko,
	Thomas Gleixner, Theodore Y . Ts'o, Oleksiy Avramchenko

On Thu, Jan 21, 2021 at 02:35:10PM +0100, Uladzislau Rezki wrote:
> On Wed, Jan 20, 2021 at 01:54:03PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:

[ . . . ]

> > > so if bnode is NULL you could retry get_cached_bnode() since it might
> > > have been filled (given preemption or CPU migration changed something).
> > > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > > why?
> > 
> > So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
> > it with the correct CPU.
> > 
> > Though now that you mention it, couldn't the following happen?
> > 
> > o	Task A on CPU 0 notices that allocation is needed, so it
> > 	drops the lock disables migration, and sleeps while
> > 	allocating.
> > 
> > o	Task B on CPU 0 does the same.
> > 
> > o	The two tasks wake up in some order, and the second one
> > 	causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
> > 	assignment.
> > 
> > Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> > the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> > 
> Probably i should have mentioned your sequence you described, that two tasks
> can get a page on same CPU, i was thinking about it :) Yep, it can happen
> since we drop the lock and a context is fully preemptible, so another one
> can trigger kvfree_rcu() ending up at the same place - entering a page
> allocator.
> 
> I spent some time simulating it, but with no any luck, therefore i did not
> reflect this case in the commit message, thus did no pay much attention to
> such scenario.
> 
> >
> > Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> > the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> >
> Two woken tasks will be serialized, i.e. an assignment is protected by
> the our local lock. We do krc_this_cpu_lock(flags); as a first step
> right after that we do restore a migration. A migration in that case
> can occur only when krc_this_cpu_unlock(*krcp, *flags); is invoked.
> 
> The scenario you described can happen, in that case a previous bnode
> in the drain list can be either empty or partly utilized. But, again
> i was non able to trigger such scenario.

Ah, we did discuss this previously, and yes, the result for a very
rare race is just underutilization of a page.  With the change below,
the result of this race is instead needless use of the slowpath.

> If we should fix it, i think we can go with below "alloc_in_progress"
> protection:
> 
> <snip>
> urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index cad36074366d..95485ec7267e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3488,12 +3488,19 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>         if (!(*krcp)->bkvhead[idx] ||
>                         (*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
>                 bnode = get_cached_bnode(*krcp);
> -               if (!bnode && can_alloc) {
> +               if (!bnode && can_alloc && !(*krcp)->alloc_in_progress)  {
>                         migrate_disable();
> +
> +                       /* Set it before dropping the lock. */
> +                       (*krcp)->alloc_in_progress = true;
>                         krc_this_cpu_unlock(*krcp, *flags);
> +
>                         bnode = (struct kvfree_rcu_bulk_data *)
>                                 __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
>                         *krcp = krc_this_cpu_lock(flags);
> +
> +                       /* Clear it, the lock was taken back. */
> +                       (*krcp)->alloc_in_progress = false;
>                         migrate_enable();
>                 }
>  
> urezki@pc638:~/data/raid0/coding/linux-rcu.git$
> <snip>
> 
> in that case a second task will follow a fallback path bypassing a page
> request. I can send it as a separate patch if there are no any objections.

I was thinking in terms of something like the following.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

static bool add_ptr_to_bulk_krc_no_space(struct kfree_rcu_cpu *krcp)
{
	return !(krcp)->bkvhead[idx] ||
	       (krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR;
}

static inline bool
add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
	unsigned long *flags, void *ptr, bool can_alloc)
{
	struct kvfree_rcu_bulk_data *bnode;
	int idx;

	*krcp = krc_this_cpu_lock(flags);
	if (unlikely(!(*krcp)->initialized))
		return false;

	idx = !!is_vmalloc_addr(ptr);

	/* Check if a new block is required. */
	if (add_ptr_to_bulk_krc_no_space(*krcp)) {
		bnode = get_cached_bnode(*krcp);
		if (!bnode && can_alloc) {
			migrate_disable();
			krc_this_cpu_unlock(*krcp, *flags);
			bnode = (struct kvfree_rcu_bulk_data *)
				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
			*krcp = krc_this_cpu_lock(flags);
			migrate_enable();
		}

		if (!bnode && add_ptr_to_bulk_krc_no_space(*krcp)) {
			return false;
		} else if (bnode && add_ptr_to_bulk_krc_no_space(*krcp))
			/* Initialize the new block. */
			bnode->nr_records = 0;
			bnode->next = (*krcp)->bkvhead[idx];

			/* Attach it to the head. */
			(*krcp)->bkvhead[idx] = bnode;
		} else if (bnode) {
			// Or attempt to add it to the cache?
			free_page((unsigned long)bnode);
		}
	}

	/* Finally insert. */
	(*krcp)->bkvhead[idx]->records
		[(*krcp)->bkvhead[idx]->nr_records++] = ptr;

	return true;
}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-21 15:07       ` Paul E. McKenney
@ 2021-01-21 19:17         ` Uladzislau Rezki
  0 siblings, 0 replies; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-21 19:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, Sebastian Andrzej Siewior, LKML, RCU,
	Michael Ellerman, Andrew Morton, Daniel Axtens,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Oleksiy Avramchenko

On Thu, Jan 21, 2021 at 07:07:40AM -0800, Paul E. McKenney wrote:
> On Thu, Jan 21, 2021 at 02:35:10PM +0100, Uladzislau Rezki wrote:
> > On Wed, Jan 20, 2021 at 01:54:03PM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:
> 
> [ . . . ]
> 
> > > > so if bnode is NULL you could retry get_cached_bnode() since it might
> > > > have been filled (given preemption or CPU migration changed something).
> > > > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > > > why?
> > > 
> > > So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
> > > it with the correct CPU.
> > > 
> > > Though now that you mention it, couldn't the following happen?
> > > 
> > > o	Task A on CPU 0 notices that allocation is needed, so it
> > > 	drops the lock disables migration, and sleeps while
> > > 	allocating.
> > > 
> > > o	Task B on CPU 0 does the same.
> > > 
> > > o	The two tasks wake up in some order, and the second one
> > > 	causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
> > > 	assignment.
> > > 
> > > Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> > > the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> > > 
> > Probably i should have mentioned your sequence you described, that two tasks
> > can get a page on same CPU, i was thinking about it :) Yep, it can happen
> > since we drop the lock and a context is fully preemptible, so another one
> > can trigger kvfree_rcu() ending up at the same place - entering a page
> > allocator.
> > 
> > I spent some time simulating it, but with no any luck, therefore i did not
> > reflect this case in the commit message, thus did no pay much attention to
> > such scenario.
> > 
> > >
> > > Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> > > the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> > >
> > Two woken tasks will be serialized, i.e. an assignment is protected by
> > the our local lock. We do krc_this_cpu_lock(flags); as a first step
> > right after that we do restore a migration. A migration in that case
> > can occur only when krc_this_cpu_unlock(*krcp, *flags); is invoked.
> > 
> > The scenario you described can happen, in that case a previous bnode
> > in the drain list can be either empty or partly utilized. But, again
> > i was non able to trigger such scenario.
> 
> Ah, we did discuss this previously, and yes, the result for a very
> rare race is just underutilization of a page.  With the change below,
> the result of this race is instead needless use of the slowpath.
> 
> > If we should fix it, i think we can go with below "alloc_in_progress"
> > protection:
> > 
> > <snip>
> > urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index cad36074366d..95485ec7267e 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3488,12 +3488,19 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> >         if (!(*krcp)->bkvhead[idx] ||
> >                         (*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> >                 bnode = get_cached_bnode(*krcp);
> > -               if (!bnode && can_alloc) {
> > +               if (!bnode && can_alloc && !(*krcp)->alloc_in_progress)  {
> >                         migrate_disable();
> > +
> > +                       /* Set it before dropping the lock. */
> > +                       (*krcp)->alloc_in_progress = true;
> >                         krc_this_cpu_unlock(*krcp, *flags);
> > +
> >                         bnode = (struct kvfree_rcu_bulk_data *)
> >                                 __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> >                         *krcp = krc_this_cpu_lock(flags);
> > +
> > +                       /* Clear it, the lock was taken back. */
> > +                       (*krcp)->alloc_in_progress = false;
> >                         migrate_enable();
> >                 }
> >  
> > urezki@pc638:~/data/raid0/coding/linux-rcu.git$
> > <snip>
> > 
> > in that case a second task will follow a fallback path bypassing a page
> > request. I can send it as a separate patch if there are no any objections.
> 
> I was thinking in terms of something like the following.  Thoughts?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> static bool add_ptr_to_bulk_krc_no_space(struct kfree_rcu_cpu *krcp)
> {
> 	return !(krcp)->bkvhead[idx] ||
> 	       (krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR;
> }
>
Agree we should have such wrapper. So the code becomes more readable and
simpler.

> 
> static inline bool
> add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> 	unsigned long *flags, void *ptr, bool can_alloc)
> {
> 	struct kvfree_rcu_bulk_data *bnode;
> 	int idx;
> 
> 	*krcp = krc_this_cpu_lock(flags);
> 	if (unlikely(!(*krcp)->initialized))
> 		return false;
> 
> 	idx = !!is_vmalloc_addr(ptr);
> 
> 	/* Check if a new block is required. */
> 	if (add_ptr_to_bulk_krc_no_space(*krcp)) {
> 		bnode = get_cached_bnode(*krcp);
> 		if (!bnode && can_alloc) {
> 			migrate_disable();
> 			krc_this_cpu_unlock(*krcp, *flags);
> 			bnode = (struct kvfree_rcu_bulk_data *)
> 				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> 			*krcp = krc_this_cpu_lock(flags);
> 			migrate_enable();
> 		}
> 
> 		if (!bnode && add_ptr_to_bulk_krc_no_space(*krcp)) {
> 			return false;
> 		} else if (bnode && add_ptr_to_bulk_krc_no_space(*krcp))
> 			/* Initialize the new block. */
> 			bnode->nr_records = 0;
> 			bnode->next = (*krcp)->bkvhead[idx];
> 
> 			/* Attach it to the head. */
> 			(*krcp)->bkvhead[idx] = bnode;
> 		} else if (bnode) {
> 			// Or attempt to add it to the cache?
> 			free_page((unsigned long)bnode);
> 		}
> 	}
> 
> 	/* Finally insert. */
> 	(*krcp)->bkvhead[idx]->records
> 		[(*krcp)->bkvhead[idx]->nr_records++] = ptr;
> 
> 	return true;
> }
I see your point. But i do not see how it solves double/more entering to page
allocator by two tasks or maybe more :)

Yep, comparing with the flag i proposed, this approach will not likely hit a
slow path in pretty rare case, from the other hand we need to do something with
an extra page. We can not simply free it in a current context. We should at
least drop the lock again and then free.

Adding to the cache will require an extra decay logic. The simplest scenario
is to attach that extra block to the drain list. If we attach or free
the behaviour becomes almost the same as the patch #3 - kvfree_rcu: use migrate_disable/enable()

Thoughts?

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-20 21:54   ` Paul E. McKenney
  2021-01-21 13:35     ` Uladzislau Rezki
@ 2021-01-22 11:17     ` Sebastian Andrzej Siewior
  2021-01-22 15:28       ` Paul E. McKenney
  1 sibling, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-01-22 11:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki (Sony),
	LKML, RCU, Michael Ellerman, Andrew Morton, Daniel Axtens,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Oleksiy Avramchenko

On 2021-01-20 13:54:03 [-0800], Paul E. McKenney wrote:
> > > +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> > > +// state specified by flags.  If can_alloc is true, the caller must
> > > +// be schedulable and not be holding any locks or mutexes that might be
> > > +// acquired by the memory allocator or anything that it might invoke.
> > > +// Returns true if ptr was successfully recorded, else the caller must
> > > +// use a fallback.
> > 
> > The whole RCU department is getting swamped by the // comments. Can't we
> > have proper kernel doc and /* */ style comments like the remaining part
> > of the kernel?
> 
> Because // comments are easier to type and take up less horizontal space.

As for the typing I could try to sell you 
  ab // /*

for your .vimrc and then //<enter> would become /* ;) As for the
horizontal space, I don't have currently anything in my shop. I'm sorry.

> Also, this kvfree_call_rcu_add_ptr_to_bulk() function is local to
> kvfree_rcu(), and we don't normally docbook-ify such functions.

I didn't mean to promote using docbook to use every. For instance if you
look at kernel/trace/trace.c, there are no // comments around, just /*
style, even for things like tracing_selftest_running.

Basically I was curious if I could learn where this // is coming and if
I could stop it.

> > >  static inline bool
> > > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > > +	unsigned long *flags, void *ptr, bool can_alloc)
> > >  {
> > >  	struct kvfree_rcu_bulk_data *bnode;
> > >  	int idx;
> > >  
> > > -	if (unlikely(!krcp->initialized))
> > > +	*krcp = krc_this_cpu_lock(flags);
> > > +	if (unlikely(!(*krcp)->initialized))
> > >  		return false;
> > >  
> > > -	lockdep_assert_held(&krcp->lock);
> > >  	idx = !!is_vmalloc_addr(ptr);
> > >  
> > >  	/* Check if a new block is required. */
> > > -	if (!krcp->bkvhead[idx] ||
> > > -			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > > -		bnode = get_cached_bnode(krcp);
> > > -		/* Switch to emergency path. */
> > > +	if (!(*krcp)->bkvhead[idx] ||
> > > +			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > > +		bnode = get_cached_bnode(*krcp);
> > > +		if (!bnode && can_alloc) {
> > > +			krc_this_cpu_unlock(*krcp, *flags);
> > > +			bnode = (struct kvfree_rcu_bulk_data *)
> > 
> > There is no need for this cast.
> 
> Without it, gcc version 7.5.0 says:
> 
> 	warning: assignment makes pointer from integer without a cast
> 

I'm sorry. I forgot the part where __get_free_page() does not return
(void *).
But maybe it should given that free_pages() casts that long back to
(void *) and __get_free_pages() -> page_address() returns (void *)
which is then casted long.

> > > +				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > > +			*krcp = krc_this_cpu_lock(flags);
> > 
> > so if bnode is NULL you could retry get_cached_bnode() since it might
> > have been filled (given preemption or CPU migration changed something).
> > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > why?
> 
> So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
> it with the correct CPU.
> 
> Though now that you mention it, couldn't the following happen?
> 
> o	Task A on CPU 0 notices that allocation is needed, so it
> 	drops the lock disables migration, and sleeps while
> 	allocating.
> 
> o	Task B on CPU 0 does the same.
> 
> o	The two tasks wake up in some order, and the second one
> 	causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
> 	assignment.

Yes it could, good point.
I would really recommend using migrate_disable() at a minimum and only
if it is really needed. It is more expensive than preempt_disable() and
it isn't exactly good in terms of scheduling since the task is run able
but restricted to a specific CPU.
If it is unavoidable it is unavoidable but in this case I wouldn't use
migrate_disable() but re-evaluate the situation after the allocation.

> Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> 
> 							Thanx, Paul

Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-21 12:38   ` Uladzislau Rezki
@ 2021-01-22 11:34     ` Sebastian Andrzej Siewior
  2021-01-22 14:21       ` Uladzislau Rezki
  0 siblings, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-01-22 11:34 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Oleksiy Avramchenko

On 2021-01-21 13:38:34 [+0100], Uladzislau Rezki wrote:
> __get_free_page() returns "unsigned long" whereas a bnode is a pointer
> to kvfree_rcu_bulk_data struct, without a casting the compiler will
> emit a warning.

Yes, learned about it, sorry.

> >> You think that a CPU migration is a bad thing. But why?
> >>
> It is not a bad thing. But if it happens we might queue a new bnode
> to a drain list of another CPU where a previous element of a new
> bnode may be just underutilized. So that is why i use migrate_disable()/enable()
> to prevent it.

If you allocate a node for queueing and then you realize that you
already have one then you either free it or queue it for later.
Given that all this is a slower-path and that this is used on every-CPU,
sooner or later that page will be used avoids a later allocation, right?

> If there are some hidden issues with migrate_disable()/enable() or you
> think it is a bad idea to use it, it would be appreciated if you could
> describe your view in more detail.

Just what I mentioned in my previous email:
- the whole operation isn't cheap but it is more efficient in tree
  compared to what we used to have in RT.
- the task can no be freely placed while it could run. So if the task
  gets preempted, it will have to wait until it can run on the CPU
  again.

Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-22 11:34     ` Sebastian Andrzej Siewior
@ 2021-01-22 14:21       ` Uladzislau Rezki
  0 siblings, 0 replies; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-22 14:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Uladzislau Rezki, LKML, RCU, Paul E . McKenney, Michael Ellerman,
	Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Michal Hocko,
	Thomas Gleixner, Theodore Y . Ts'o, Oleksiy Avramchenko

> On 2021-01-21 13:38:34 [+0100], Uladzislau Rezki wrote:
> > __get_free_page() returns "unsigned long" whereas a bnode is a pointer
> > to kvfree_rcu_bulk_data struct, without a casting the compiler will
> > emit a warning.
> 
> Yes, learned about it, sorry.
> 
> > >> You think that a CPU migration is a bad thing. But why?
> > >>
> > It is not a bad thing. But if it happens we might queue a new bnode
> > to a drain list of another CPU where a previous element of a new
> > bnode may be just underutilized. So that is why i use migrate_disable()/enable()
> > to prevent it.
> 
> If you allocate a node for queueing and then you realize that you
> already have one then you either free it or queue it for later.
> Given that all this is a slower-path and that this is used on every-CPU,
> sooner or later that page will be used avoids a later allocation, right?
> 
To free it right away is a bit problematic, because we need at least one more
time to drop the lock what would introduce more complexity. Adding to the cache 
for later reuse is possible but would require an extra decay cache logic.

I think, since it is a corner case and i do not consider it as a big issue,
we can just queue it to the drain list so the previous node can be half filled
due to migration.

> > If there are some hidden issues with migrate_disable()/enable() or you
> > think it is a bad idea to use it, it would be appreciated if you could
> > describe your view in more detail.
> 
> Just what I mentioned in my previous email:
> - the whole operation isn't cheap but it is more efficient in tree
>   compared to what we used to have in RT.
> - the task can no be freely placed while it could run. So if the task
>   gets preempted, it will have to wait until it can run on the CPU
>   again.
> 
Yep, that is obvious. From scheduling point of view the task can wait more
time because the migration is prohibited. From the other hand, the page is
obtained in the path that is not considered as a hot one. One page can serve 
~500 pointers.

I do not say that we should keep: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

without it, a migration can occur, what is really rare according to my tests
therefore i do not have a strong opinion about keeping it. If you or someone
else has some concern we can drop it.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-22 11:17     ` Sebastian Andrzej Siewior
@ 2021-01-22 15:28       ` Paul E. McKenney
  0 siblings, 0 replies; 36+ messages in thread
From: Paul E. McKenney @ 2021-01-22 15:28 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Uladzislau Rezki (Sony),
	LKML, RCU, Michael Ellerman, Andrew Morton, Daniel Axtens,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Oleksiy Avramchenko

On Fri, Jan 22, 2021 at 12:17:33PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-20 13:54:03 [-0800], Paul E. McKenney wrote:
> > > > +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> > > > +// state specified by flags.  If can_alloc is true, the caller must
> > > > +// be schedulable and not be holding any locks or mutexes that might be
> > > > +// acquired by the memory allocator or anything that it might invoke.
> > > > +// Returns true if ptr was successfully recorded, else the caller must
> > > > +// use a fallback.
> > > 
> > > The whole RCU department is getting swamped by the // comments. Can't we
> > > have proper kernel doc and /* */ style comments like the remaining part
> > > of the kernel?
> > 
> > Because // comments are easier to type and take up less horizontal space.
> 
> As for the typing I could try to sell you 
>   ab // /*
> 
> for your .vimrc and then //<enter> would become /* ;) As for the
> horizontal space, I don't have currently anything in my shop. I'm sorry.

;-)

> > Also, this kvfree_call_rcu_add_ptr_to_bulk() function is local to
> > kvfree_rcu(), and we don't normally docbook-ify such functions.
> 
> I didn't mean to promote using docbook to use every. For instance if you
> look at kernel/trace/trace.c, there are no // comments around, just /*
> style, even for things like tracing_selftest_running.
> 
> Basically I was curious if I could learn where this // is coming and if
> I could stop it.

Because they are now allowed and because they make my life easier as
noted above.  Also in-function comment blocks are either one line or two
lines shorter.  Yeah, they look strange at first, but it is not that hard
to get used to them.  After all, I did manage to get used to the /* */
comment style shortly after first encountering it.  ;-)

> > > >  static inline bool
> > > > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > > > +	unsigned long *flags, void *ptr, bool can_alloc)
> > > >  {
> > > >  	struct kvfree_rcu_bulk_data *bnode;
> > > >  	int idx;
> > > >  
> > > > -	if (unlikely(!krcp->initialized))
> > > > +	*krcp = krc_this_cpu_lock(flags);
> > > > +	if (unlikely(!(*krcp)->initialized))
> > > >  		return false;
> > > >  
> > > > -	lockdep_assert_held(&krcp->lock);
> > > >  	idx = !!is_vmalloc_addr(ptr);
> > > >  
> > > >  	/* Check if a new block is required. */
> > > > -	if (!krcp->bkvhead[idx] ||
> > > > -			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > > > -		bnode = get_cached_bnode(krcp);
> > > > -		/* Switch to emergency path. */
> > > > +	if (!(*krcp)->bkvhead[idx] ||
> > > > +			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> > > > +		bnode = get_cached_bnode(*krcp);
> > > > +		if (!bnode && can_alloc) {
> > > > +			krc_this_cpu_unlock(*krcp, *flags);
> > > > +			bnode = (struct kvfree_rcu_bulk_data *)
> > > 
> > > There is no need for this cast.
> > 
> > Without it, gcc version 7.5.0 says:
> > 
> > 	warning: assignment makes pointer from integer without a cast
> > 
> 
> I'm sorry. I forgot the part where __get_free_page() does not return
> (void *).
> But maybe it should given that free_pages() casts that long back to
> (void *) and __get_free_pages() -> page_address() returns (void *)
> which is then casted long.

No argument here.  Then again, I am not the one in need of convincing.

There are use cases like this from pte_alloc_one_kernel():

	unsigned long page = __get_free_page(GFP_DMA);

But a quick look indicates that they are in the minority.

> > > > +				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > > > +			*krcp = krc_this_cpu_lock(flags);
> > > 
> > > so if bnode is NULL you could retry get_cached_bnode() since it might
> > > have been filled (given preemption or CPU migration changed something).
> > > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > > why?
> > 
> > So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
> > it with the correct CPU.
> > 
> > Though now that you mention it, couldn't the following happen?
> > 
> > o	Task A on CPU 0 notices that allocation is needed, so it
> > 	drops the lock disables migration, and sleeps while
> > 	allocating.
> > 
> > o	Task B on CPU 0 does the same.
> > 
> > o	The two tasks wake up in some order, and the second one
> > 	causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
> > 	assignment.
> 
> Yes it could, good point.
> I would really recommend using migrate_disable() at a minimum and only
> if it is really needed. It is more expensive than preempt_disable() and
> it isn't exactly good in terms of scheduling since the task is run able
> but restricted to a specific CPU.
> If it is unavoidable it is unavoidable but in this case I wouldn't use
> migrate_disable() but re-evaluate the situation after the allocation.

I could imagine the following alternatives:

o	Acquire the old CPU's lock despite having been migrated.
	If the above race happened, put the extra page in the
	per-CPU cache.  As Uladzislau notes, this would require
	some sort of periodic cleanup that would be good to avoid.

o	As now, which can result in an unfilled page, though only
	in an uncommon race condition.  (Uladzislau convinced me
	that this was a good approach some months ago, and the
	fact that he cannot make it happen easily does add some
	weight to his argument.)

o	Use migrate_disable().

Other ideas?

							Thanx, Paul

> > Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> > the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> > 
> > 							Thanx, Paul
> 
> Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
  2021-01-20 16:21 ` [PATCH 3/3] kvfree_rcu: use migrate_disable/enable() Uladzislau Rezki (Sony)
  2021-01-20 19:45   ` Sebastian Andrzej Siewior
@ 2021-01-23  9:31   ` Zhang, Qiang
  2021-01-24 21:57     ` Uladzislau Rezki
  1 sibling, 1 reply; 36+ messages in thread
From: Zhang, Qiang @ 2021-01-23  9:31 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony), LKML, RCU, Paul E . McKenney, Michael Ellerman
  Cc: Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Michal Hocko,
	Thomas Gleixner, Theodore Y . Ts'o,
	Sebastian Andrzej Siewior, Oleksiy Avramchenko



>________________________________________
>发件人: Uladzislau Rezki (Sony) <urezki@gmail.com>
>发送时间: 2021年1月21日 0:21
>收件人: LKML; RCU; Paul E . McKenney; Michael Ellerman
>抄送: Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj >Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas >Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Uladzislau >Rezki; Oleksiy Avramchenko
>主题: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
>
>Since the page is obtained in a fully preemptible context, dropping
>the lock can lead to migration onto another CPU. As a result a prev.
>bnode of that CPU may be underutilised, because a decision has been
>made for a CPU that was run out of free slots to store a pointer.
>
>migrate_disable/enable() are now independent of RT, use it in order
>to prevent any migration during a page request for a specific CPU it
>is requested for.


Hello Rezki

The critical migrate_disable/enable() area is not allowed to block, under RT and non RT.  
There is such a description in preempt.h 


* Notes on the implementation.
 *
 * The implementation is particularly tricky since existing code patterns
 * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
 * This means that it cannot use cpus_read_lock() to serialize against hotplug,
 * nor can it easily migrate itself into a pending affinity mask change on
 * migrate_enable().


How about the following changes:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e7a226abff0d..2aa19537ac7c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3488,12 +3488,10 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
                        (*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
                bnode = get_cached_bnode(*krcp);
                if (!bnode && can_alloc) {
-                       migrate_disable();
                        krc_this_cpu_unlock(*krcp, *flags);
                        bnode = (struct kvfree_rcu_bulk_data *)
                                __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
-                       *krcp = krc_this_cpu_lock(flags);
-                       migrate_enable();
+                       raw_spin_lock_irqsave(&(*krcp)->lock, *flags);
                }
 
                if (!bnode)


Thanks
Qiang



>
>Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>---
> kernel/rcu/tree.c | 2 ++
>1 file changed, 2 insertions(+)
>
>diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>index 454809514c91..cad36074366d 100644
>--- a/kernel/rcu/tree.c
>+++ b/kernel/rcu/tree.c
>@@ -3489,10 +3489,12 @@ add_ptr_to_bulk_krc_lock(struct >kfree_rcu_cpu **krcp,
>                        (*krcp)->bkvhead[idx]->nr_records == >KVFREE_BULK_MAX_ENTR) {
>                bnode = get_cached_bnode(*krcp);
>                if (!bnode && can_alloc) {
>+                       migrate_disable();
>                        krc_this_cpu_unlock(*krcp, *flags);
>                        bnode = (struct kvfree_rcu_bulk_data *)
>                                __get_free_page(GFP_KERNEL | >__GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
>                       *krcp = krc_this_cpu_lock(flags);
>+                       migrate_enable();
>                }
>
>                if (!bnode)
>--
>2.20.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
  2021-01-23  9:31   ` 回复: " Zhang, Qiang
@ 2021-01-24 21:57     ` Uladzislau Rezki
  2021-01-25  1:50       ` 回复: " Zhang, Qiang
  0 siblings, 1 reply; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-24 21:57 UTC (permalink / raw)
  To: Zhang, Qiang
  Cc: Uladzislau Rezki (Sony),
	LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

Hello, Zhang.

> >________________________________________
> >发件人: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >发送时间: 2021年1月21日 0:21
> >收件人: LKML; RCU; Paul E . McKenney; Michael Ellerman
> >抄送: Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj >Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas >Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Uladzislau >Rezki; Oleksiy Avramchenko
> >主题: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> >
> >Since the page is obtained in a fully preemptible context, dropping
> >the lock can lead to migration onto another CPU. As a result a prev.
> >bnode of that CPU may be underutilised, because a decision has been
> >made for a CPU that was run out of free slots to store a pointer.
> >
> >migrate_disable/enable() are now independent of RT, use it in order
> >to prevent any migration during a page request for a specific CPU it
> >is requested for.
> 
> 
> Hello Rezki
> 
> The critical migrate_disable/enable() area is not allowed to block, under RT and non RT.  
> There is such a description in preempt.h 
> 
> 
> * Notes on the implementation.
>  *
>  * The implementation is particularly tricky since existing code patterns
>  * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
>  * This means that it cannot use cpus_read_lock() to serialize against hotplug,
>  * nor can it easily migrate itself into a pending affinity mask change on
>  * migrate_enable().
> 
How i interpret it is migrate_enable()/migrate_disable() are not allowed to
use any blocking primitives, such as rwsem/mutexes/etc. in order to mark a
current context as non-migratable.

void migrate_disable(void)
{
 struct task_struct *p = current;

 if (p->migration_disabled) {
  p->migration_disabled++;
  return;
 }

 preempt_disable();
 this_rq()->nr_pinned++;
 p->migration_disabled = 1;
 preempt_enable();
}

It does nothing that prevents you from doing schedule() or even wait for any
event(mutex slow path behaviour), when the process is removed from the run-queue.
I mean after the migrate_disable() is invoked. Or i miss something?

>
> How about the following changes:
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e7a226abff0d..2aa19537ac7c 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3488,12 +3488,10 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>                         (*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
>                 bnode = get_cached_bnode(*krcp);
>                 if (!bnode && can_alloc) {
> -                       migrate_disable();
>                         krc_this_cpu_unlock(*krcp, *flags);
>                         bnode = (struct kvfree_rcu_bulk_data *)
>                                 __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> -                       *krcp = krc_this_cpu_lock(flags);
> -                       migrate_enable();
> +                       raw_spin_lock_irqsave(&(*krcp)->lock, *flags);
>
Hm.. Taking the former lock can lead to a pointer leaking, i mean a CPU associated
with "krcp" might go offline during a page request process, so a queuing occurs on
off-lined CPU. Apat of that, acquiring a former lock still does not solve:

- CPU1 in process of page allocation;
- CPU1 gets migrated to CPU2;
- another task running on CPU1 also allocate a page;
- both bnodes are added to krcp associated with CPU1.

I agree that such scenario probably will never happen or i would say, can be
considered as a corner case. We can drop the:

[PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

and live with: an allocated bnode can be queued to another CPU, so its prev.
"bnode" can be underutilized. What is also can be considered as a corner case.
According to my tests, it is hard to achieve:

Running kvfree_rcu() simultaneously in a tight loop, 1 000 000 allocations/freeing:

- 64 CPUs and 64 threads showed 1 migration;
- 64 CPUs and 128 threads showed 0 migrations;
- 64 CPUs and 32 threads showed 0 migration. 

Thoughts?

Thank you for your comments!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* 回复: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
  2021-01-24 21:57     ` Uladzislau Rezki
@ 2021-01-25  1:50       ` Zhang, Qiang
  2021-01-25  2:18         ` Zhang, Qiang
  0 siblings, 1 reply; 36+ messages in thread
From: Zhang, Qiang @ 2021-01-25  1:50 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko



________________________________________
发件人: Uladzislau Rezki <urezki@gmail.com>
发送时间: 2021年1月25日 5:57
收件人: Zhang, Qiang
抄送: Uladzislau Rezki (Sony); LKML; RCU; Paul E . McKenney; Michael Ellerman; Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Oleksiy Avramchenko
主题: Re: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

>Hello, Zhang.

> >________________________________________
> >发件人: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >发送时间: 2021年1月21日 0:21
> >收件人: LKML; RCU; Paul E . McKenney; Michael Ellerman
> >抄送: Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj >Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas >Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Uladzislau >Rezki; Oleksiy Avramchenko
> >主题: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> >
> >Since the page is obtained in a fully preemptible context, dropping
> >the lock can lead to migration onto another CPU. As a result a prev.
> >bnode of that CPU may be underutilised, because a decision has been
> >made for a CPU that was run out of free slots to store a pointer.
> >
> >migrate_disable/enable() are now independent of RT, use it in order
> >to prevent any migration during a page request for a specific CPU it
> >is requested for.
>
>
> Hello Rezki
>
> The critical migrate_disable/enable() area is not allowed to block, under RT and non RT.
> There is such a description in preempt.h
>
>
> * Notes on the implementation.
>  *
>  * The implementation is particularly tricky since existing code patterns
>  * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
>  * This means that it cannot use cpus_read_lock() to serialize against hotplug,
>  * nor can it easily migrate itself into a pending affinity mask change on
>  * migrate_enable().
>
>How i interpret it is migrate_enable()/migrate_disable() are not allowed to
>use any blocking primitives, such as rwsem/mutexes/etc. in order to mark a
>current context as non-migratable.
>
>void migrate_disable(void)
>{
> struct task_struct *p = current;
>
> if (p->migration_disabled) {
>  p->migration_disabled++;
>  return;
> }

> preempt_disable();
> this_rq()->nr_pinned++;
> p->migration_disabled = 1;
> preempt_enable();
>}
>
>It does nothing that prevents you from doing schedule() or even wait for any
>event(mutex slow path behaviour), when the process is removed from the run-queue.
>I mean after the migrate_disable() is invoked. Or i miss something?

Hello Rezki

Sorry, there's something wrong with the previous description. 
There are the following scenarios
 
Due to migrate_disable will increase rq's nr_pinned, after that
if get_free_page be blocked, and this time, CPU going offline,  
the sched_cpu_wait_empty() be called in per-cpu "cpuhp/%d" task,
and be blocked.

sched_cpu_wait_empty()
{

       rcuwait_wait_event(&rq->hotplug_wait,
                           rq->nr_running == 1 && !rq_has_pinned_tasks(rq),
                           TASK_UNINTERRUPTIBLE);
}



>
> How about the following changes:
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e7a226abff0d..2aa19537ac7c 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3488,12 +3488,10 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>                         (*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
>                 bnode = get_cached_bnode(*krcp);
>                 if (!bnode && can_alloc) {
> -                       migrate_disable();
>                         krc_this_cpu_unlock(*krcp, *flags);
>                         bnode = (struct kvfree_rcu_bulk_data *)
>                                 __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> -                       *krcp = krc_this_cpu_lock(flags);
> -                       migrate_enable();
> +                       raw_spin_lock_irqsave(&(*krcp)->lock, *flags);
>
Hm.. Taking the former lock can lead to a pointer leaking, i mean a CPU associated
with "krcp" might go offline during a page request process, so a queuing occurs on
off-lined CPU. Apat of that, acquiring a former lock still does not solve:

- CPU1 in process of page allocation;
- CPU1 gets migrated to CPU2;
- another task running on CPU1 also allocate a page;
- both bnodes are added to krcp associated with CPU1.

I agree that such scenario probably will never happen or i would say, can be
considered as a corner case. We can drop the:

[PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

and live with: an allocated bnode can be queued to another CPU, so its prev.
"bnode" can be underutilized. What is also can be considered as a corner case.
According to my tests, it is hard to achieve:

Running kvfree_rcu() simultaneously in a tight loop, 1 000 000 allocations/freeing:

- 64 CPUs and 64 threads showed 1 migration;
- 64 CPUs and 128 threads showed 0 migrations;
- 64 CPUs and 32 threads showed 0 migration.

Thoughts?

Thank you for your comments!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* 回复: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
  2021-01-25  1:50       ` 回复: " Zhang, Qiang
@ 2021-01-25  2:18         ` Zhang, Qiang
  2021-01-25 13:49           ` Uladzislau Rezki
  0 siblings, 1 reply; 36+ messages in thread
From: Zhang, Qiang @ 2021-01-25  2:18 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko


________________________________________
发件人: Uladzislau Rezki <urezki@gmail.com>
发送时间: 2021年1月25日 5:57
收件人: Zhang, Qiang
抄送: Uladzislau Rezki (Sony); LKML; RCU; Paul E . McKenney; Michael Ellerman; Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Oleksiy Avramchenko
主题: Re: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

>Hello, Zhang.

> >________________________________________
> >发件人: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >发送时间: 2021年1月21日 0:21
> >收件人: LKML; RCU; Paul E . McKenney; Michael Ellerman
> >抄送: Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj >Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas >Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Uladzislau >Rezki; Oleksiy Avramchenko
> >主题: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> >
> >Since the page is obtained in a fully preemptible context, dropping
> >the lock can lead to migration onto another CPU. As a result a prev.
> >bnode of that CPU may be underutilised, because a decision has been
> >made for a CPU that was run out of free slots to store a pointer.
> >
> >migrate_disable/enable() are now independent of RT, use it in order
> >to prevent any migration during a page request for a specific CPU it
> >is requested for.
>
>
> Hello Rezki
>
> The critical migrate_disable/enable() area is not allowed to block, under RT and non RT.
> There is such a description in preempt.h
>
>
> * Notes on the implementation.
>  *
>  * The implementation is particularly tricky since existing code patterns
>  * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
>  * This means that it cannot use cpus_read_lock() to serialize against hotplug,
>  * nor can it easily migrate itself into a pending affinity mask change on
>  * migrate_enable().
>
>How i interpret it is migrate_enable()/migrate_disable() are not allowed to
>use any blocking primitives, such as rwsem/mutexes/etc. in order to mark a
>current context as non-migratable.
>
>void migrate_disable(void)
>{
> struct task_struct *p = current;
>
> if (p->migration_disabled) {
>  p->migration_disabled++;
>  return;
> }

> preempt_disable();
> this_rq()->nr_pinned++;
> p->migration_disabled = 1;
> preempt_enable();
>}
>
>It does nothing that prevents you from doing schedule() or even wait for any
>event(mutex slow path behaviour), when the process is removed from the run-queue.
>I mean after the migrate_disable() is invoked. Or i miss something?

Hello Rezki

Sorry, there's something wrong with the previous description.
There are the following scenarios

Due to migrate_disable will increase  this_rq()->nr_pinned , after that
if get_free_page be blocked, and this time, CPU going offline,
the sched_cpu_wait_empty() be called in per-cpu "cpuhp/%d" task,
and be blocked.

blocked:
sched_cpu_wait_empty()
{
      struct rq *rq = this_rq();
       rcuwait_wait_event(&rq->hotplug_wait,
                           rq->nr_running == 1 && !rq_has_pinned_tasks(rq),
                           TASK_UNINTERRUPTIBLE);
}
wakeup:
balance_push()
{
        if (is_per_cpu_kthread(push_task) || is_migration_disabled(push_task)) {
              
                if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
                    rcuwait_active(&rq->hotplug_wait)) {
                        raw_spin_unlock(&rq->lock);
                        rcuwait_wake_up(&rq->hotplug_wait);
                        raw_spin_lock(&rq->lock);
                }
                return;
        }
}

One of the conditions for this function to wake up is "rq->nr_pinned  == 0"
that is to say between migrate_disable/enable, if blocked will defect CPU going
offline longer blocking time.

I'm not sure that's a problem,and I didn't find it in the kernel code  between 
 migrate_disable/enable possible sleep calls.

>
> How about the following changes:
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e7a226abff0d..2aa19537ac7c 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3488,12 +3488,10 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>                         (*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
>                 bnode = get_cached_bnode(*krcp);
>                 if (!bnode && can_alloc) {
> -                       migrate_disable();
>                         krc_this_cpu_unlock(*krcp, *flags);
>                         bnode = (struct kvfree_rcu_bulk_data *)
>                                 __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> -                       *krcp = krc_this_cpu_lock(flags);
> -                       migrate_enable();
> +                       raw_spin_lock_irqsave(&(*krcp)->lock, *flags);
>
>Hm.. Taking the former lock can lead to a pointer leaking, i mean a CPU associated
>with "krcp" might go offline during a page request process, so a queuing occurs on
>off-lined CPU. Apat of that, acquiring a former lock still does not solve:
 
I agree with you here

>- CPU1 in process of page allocation;
>- CPU1 gets migrated to CPU2;
>- another task running on CPU1 also allocate a page;
>- both bnodes are added to krcp associated with CPU1.
>
>I agree that such scenario probably will never happen or i would say, can be
>considered as a corner case. We can drop the:
>[PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

>and live with: an allocated bnode can be queued to another CPU, so its prev.
>"bnode" can be underutilized. What is also can be considered as a corner case.
>According to my tests, it is hard to achieve:

>Running kvfree_rcu() simultaneously in a tight loop, 1 000 000 allocations/freeing:
>
>- 64 CPUs and 64 threads showed 1 migration;
>- 64 CPUs and 128 threads showed 0 migrations;
>- 64 CPUs and 32 threads showed 0 migration.

>Thoughts?
>
>Thank you for your comments!

Maybe migrate_disable/enable() can be removed

Thanks
Qiang
>--
>Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-20 16:21 [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument Uladzislau Rezki (Sony)
                   ` (3 preceding siblings ...)
  2021-01-20 19:57 ` Sebastian Andrzej Siewior
@ 2021-01-25 13:22 ` Michal Hocko
  2021-01-25 14:31   ` Uladzislau Rezki
  4 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2021-01-25 13:22 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> For a single argument we can directly request a page from a caller
> context when a "carry page block" is run out of free spots. Instead
> of hitting a slow path we can request an extra page by demand and
> proceed with a fast path.
> 
> A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> and that its fallback is the relatively high latency synchronize_rcu().
> Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> to allow limited sleeping within the memory allocator.

__GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
way to allocate without triggering the OOM killer. Is this really what
you need/want? Is __GFP_NORETRY too weak?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: 回复: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
  2021-01-25  2:18         ` Zhang, Qiang
@ 2021-01-25 13:49           ` Uladzislau Rezki
  2021-01-26  9:33             ` 回复: " Zhang, Qiang
  0 siblings, 1 reply; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-25 13:49 UTC (permalink / raw)
  To: Zhang, Qiang
  Cc: Uladzislau Rezki, LKML, RCU, Paul E . McKenney, Michael Ellerman,
	Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Michal Hocko,
	Thomas Gleixner, Theodore Y . Ts'o,
	Sebastian Andrzej Siewior, Oleksiy Avramchenko

> 
> ________________________________________
> 发件人: Uladzislau Rezki <urezki@gmail.com>
> 发送时间: 2021年1月25日 5:57
> 收件人: Zhang, Qiang
> 抄送: Uladzislau Rezki (Sony); LKML; RCU; Paul E . McKenney; Michael Ellerman; Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Oleksiy Avramchenko
> 主题: Re: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> 
> >Hello, Zhang.
> 
> > >________________________________________
> > >发件人: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > >发送时间: 2021年1月21日 0:21
> > >收件人: LKML; RCU; Paul E . McKenney; Michael Ellerman
> > >抄送: Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj >Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas >Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Uladzislau >Rezki; Oleksiy Avramchenko
> > >主题: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> > >
> > >Since the page is obtained in a fully preemptible context, dropping
> > >the lock can lead to migration onto another CPU. As a result a prev.
> > >bnode of that CPU may be underutilised, because a decision has been
> > >made for a CPU that was run out of free slots to store a pointer.
> > >
> > >migrate_disable/enable() are now independent of RT, use it in order
> > >to prevent any migration during a page request for a specific CPU it
> > >is requested for.
> >
> >
> > Hello Rezki
> >
> > The critical migrate_disable/enable() area is not allowed to block, under RT and non RT.
> > There is such a description in preempt.h
> >
> >
> > * Notes on the implementation.
> >  *
> >  * The implementation is particularly tricky since existing code patterns
> >  * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
> >  * This means that it cannot use cpus_read_lock() to serialize against hotplug,
> >  * nor can it easily migrate itself into a pending affinity mask change on
> >  * migrate_enable().
> >
> >How i interpret it is migrate_enable()/migrate_disable() are not allowed to
> >use any blocking primitives, such as rwsem/mutexes/etc. in order to mark a
> >current context as non-migratable.
> >
> >void migrate_disable(void)
> >{
> > struct task_struct *p = current;
> >
> > if (p->migration_disabled) {
> >  p->migration_disabled++;
> >  return;
> > }
> 
> > preempt_disable();
> > this_rq()->nr_pinned++;
> > p->migration_disabled = 1;
> > preempt_enable();
> >}
> >
> >It does nothing that prevents you from doing schedule() or even wait for any
> >event(mutex slow path behaviour), when the process is removed from the run-queue.
> >I mean after the migrate_disable() is invoked. Or i miss something?
> 
> Hello Rezki
> 
> Sorry, there's something wrong with the previous description.
> There are the following scenarios
> 
> Due to migrate_disable will increase  this_rq()->nr_pinned , after that
> if get_free_page be blocked, and this time, CPU going offline,
> the sched_cpu_wait_empty() be called in per-cpu "cpuhp/%d" task,
> and be blocked.
> 
But after the migrate_disable() is invoked a CPU can not be brought down.
If there are pinned tasks a "hotplug path" will be blocked on balance_hotplug_wait()
call.

> blocked:
> sched_cpu_wait_empty()
> {
>       struct rq *rq = this_rq();
>        rcuwait_wait_event(&rq->hotplug_wait,
>                            rq->nr_running == 1 && !rq_has_pinned_tasks(rq),
>                            TASK_UNINTERRUPTIBLE);
> }
>
Exactly.

> wakeup:
> balance_push()
> {
>         if (is_per_cpu_kthread(push_task) || is_migration_disabled(push_task)) {
>               
>                 if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
>                     rcuwait_active(&rq->hotplug_wait)) {
>                         raw_spin_unlock(&rq->lock);
>                         rcuwait_wake_up(&rq->hotplug_wait);
>                         raw_spin_lock(&rq->lock);
>                 }
>                 return;
>         }
> }
> 
> One of the conditions for this function to wake up is "rq->nr_pinned  == 0"
> that is to say between migrate_disable/enable, if blocked will defect CPU going
> offline longer blocking time.
> 
Indeed, the hotplug time is affected. For example in case of waiting for
a mutex to be released, an owner will wakeup waiters. But this is expectable.

>
> I'm not sure that's a problem,and I didn't find it in the kernel code  between 
>  migrate_disable/enable possible sleep calls.
> 
For example z3fold.c:

/* Add to the appropriate unbuddied list */
static inline void add_to_unbuddied(struct z3fold_pool *pool,
				struct z3fold_header *zhdr)
{
	if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 ||
			zhdr->middle_chunks == 0) {
		struct list_head *unbuddied;
		int freechunks = num_free_chunks(zhdr);

		migrate_disable();
		unbuddied = this_cpu_ptr(pool->unbuddied);
		spin_lock(&pool->lock);
		list_add(&zhdr->buddy, &unbuddied[freechunks]);
		spin_unlock(&pool->lock);
		zhdr->cpu = smp_processor_id();
		migrate_enable();
	}
}

for PREEMPT_RT kernel a spinlock is converted to rt-mutex, thus it can sleep.

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-25 13:22 ` Michal Hocko
@ 2021-01-25 14:31   ` Uladzislau Rezki
  2021-01-25 15:39     ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-25 14:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki (Sony),
	LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

> On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > For a single argument we can directly request a page from a caller
> > context when a "carry page block" is run out of free spots. Instead
> > of hitting a slow path we can request an extra page by demand and
> > proceed with a fast path.
> > 
> > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > and that its fallback is the relatively high latency synchronize_rcu().
> > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > to allow limited sleeping within the memory allocator.
> 
> __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> way to allocate without triggering the OOM killer. Is this really what
> you need/want? Is __GFP_NORETRY too weak?
> 
Hm... We agreed to proceed with limited lightwait memory direct reclaim.
Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
point: https://www.spinics.net/lists/rcu/msg02856.html

<snip>
    So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
    further decisions based on instrumentation of the success rates of
    these opportunistic allocations.
<snip>

but for some reason, i can't find a tail or head of it, we introduced
__GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
What we would like to avoid.

I tend to say that it was a typo.

Thank you for pointing to it!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-25 14:31   ` Uladzislau Rezki
@ 2021-01-25 15:39     ` Michal Hocko
  2021-01-25 16:25       ` Uladzislau Rezki
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2021-01-25 15:39 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > For a single argument we can directly request a page from a caller
> > > context when a "carry page block" is run out of free spots. Instead
> > > of hitting a slow path we can request an extra page by demand and
> > > proceed with a fast path.
> > > 
> > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > and that its fallback is the relatively high latency synchronize_rcu().
> > > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > to allow limited sleeping within the memory allocator.
> > 
> > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> > way to allocate without triggering the OOM killer. Is this really what
> > you need/want? Is __GFP_NORETRY too weak?
> > 
> Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> point: https://www.spinics.net/lists/rcu/msg02856.html
> 
> <snip>
>     So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
>     further decisions based on instrumentation of the success rates of
>     these opportunistic allocations.
> <snip>

I completely agree with Johannes here.

> but for some reason, i can't find a tail or head of it, we introduced
> __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
> What we would like to avoid.

Not that I object to this use but I think it would be much better to use
it based on actual data. Going along with it right away might become a
future burden to make any changes in this aspect later on due to lack of 
exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
try as hard as it can get without being really disruptive (like OOM
killing something). And your wording didn't really give me that
impression.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-25 15:39     ` Michal Hocko
@ 2021-01-25 16:25       ` Uladzislau Rezki
  2021-01-28 15:11         ` Uladzislau Rezki
  0 siblings, 1 reply; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-25 16:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, LKML, RCU, Paul E . McKenney, Michael Ellerman,
	Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > For a single argument we can directly request a page from a caller
> > > > context when a "carry page block" is run out of free spots. Instead
> > > > of hitting a slow path we can request an extra page by demand and
> > > > proceed with a fast path.
> > > > 
> > > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > > and that its fallback is the relatively high latency synchronize_rcu().
> > > > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > to allow limited sleeping within the memory allocator.
> > > 
> > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> > > way to allocate without triggering the OOM killer. Is this really what
> > > you need/want? Is __GFP_NORETRY too weak?
> > > 
> > Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > point: https://www.spinics.net/lists/rcu/msg02856.html
> > 
> > <snip>
> >     So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> >     further decisions based on instrumentation of the success rates of
> >     these opportunistic allocations.
> > <snip>
> 
> I completely agree with Johannes here.
> 
> > but for some reason, i can't find a tail or head of it, we introduced
> > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
> > What we would like to avoid.
> 
> Not that I object to this use but I think it would be much better to use
> it based on actual data. Going along with it right away might become a
> future burden to make any changes in this aspect later on due to lack of 
> exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> try as hard as it can get without being really disruptive (like OOM
> killing something). And your wording didn't really give me that
> impression.
> 
Initially i proposed just to go with GFP_NOWAIT flag. But later on there
was a discussion about a fallback path, that uses synchronize_rcu() can be
slow, thus minimizing its hitting would be great. So, here we go with a
trade off.

Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to have some
light-wait requests would be acceptable. That is why __GFP_NORETRY was proposed.

There were simple criterias we discussed which we would like to achieve:

a) minimize a fallback hitting;
b) avoid of OOM involving;
c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* 回复: 回复: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
  2021-01-25 13:49           ` Uladzislau Rezki
@ 2021-01-26  9:33             ` Zhang, Qiang
  2021-01-26 13:43               ` Uladzislau Rezki
  0 siblings, 1 reply; 36+ messages in thread
From: Zhang, Qiang @ 2021-01-26  9:33 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko



________________________________________
发件人: Uladzislau Rezki <urezki@gmail.com>
发送时间: 2021年1月25日 21:49
收件人: Zhang, Qiang
抄送: Uladzislau Rezki; LKML; RCU; Paul E . McKenney; Michael Ellerman; Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Oleksiy Avramchenko
主题: Re: 回复: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

> >Hello, Zhang.
>
> > >________________________________________
> > >发件人: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > >发送时间: 2021年1月21日 0:21
> > >收件人: LKML; RCU; Paul E . McKenney; Michael Ellerman
> > >抄送: Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj >Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas >Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Uladzislau >Rezki; Oleksiy Avramchenko
> > >主题: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> > >
> > >Since the page is obtained in a fully preemptible context, dropping
> > >the lock can lead to migration onto another CPU. As a result a prev.
> > >bnode of that CPU may be underutilised, because a decision has been
> > >made for a CPU that was run out of free slots to store a pointer.
> > >
> > >migrate_disable/enable() are now independent of RT, use it in order
> > >to prevent any migration during a page request for a specific CPU it
> > >is requested for.
> >
> >
> > Hello Rezki
> >
> > The critical migrate_disable/enable() area is not allowed to block, under RT and non RT.
> > There is such a description in preempt.h
> >
> >
> > * Notes on the implementation.
> >  *
> >  * The implementation is particularly tricky since existing code patterns
> >  * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
> >  * This means that it cannot use cpus_read_lock() to serialize against hotplug,
> >  * nor can it easily migrate itself into a pending affinity mask change on
> >  * migrate_enable().
> >
> >How i interpret it is migrate_enable()/migrate_disable() are not allowed to
> >use any blocking primitives, such as rwsem/mutexes/etc. in order to mark a
> >current context as non-migratable.
> >
> >void migrate_disable(void)
> >{
> > struct task_struct *p = current;
> >
> > if (p->migration_disabled) {
> >  p->migration_disabled++;
> >  return;
> > }
>
> > preempt_disable();
> > this_rq()->nr_pinned++;
> > p->migration_disabled = 1;
> > preempt_enable();
> >}
> >
> >It does nothing that prevents you from doing schedule() or even wait for any
> >event(mutex slow path behaviour), when the process is removed from the run-queue.
> >I mean after the migrate_disable() is invoked. Or i miss something?
>
> Hello Rezki
>
> Sorry, there's something wrong with the previous description.
> There are the following scenarios
>
> Due to migrate_disable will increase  this_rq()->nr_pinned , after that
> if get_free_page be blocked, and this time, CPU going offline,
> the sched_cpu_wait_empty() be called in per-cpu "cpuhp/%d" task,
> and be blocked.
>
>But after the migrate_disable() is invoked a CPU can not be brought down.
>If there are pinned tasks a "hotplug path" will be blocked on balance_hotplug_wait()
>call.

> blocked:
> sched_cpu_wait_empty()
> {
>       struct rq *rq = this_rq();
>        rcuwait_wait_event(&rq->hotplug_wait,
>                            rq->nr_running == 1 && !rq_has_pinned_tasks(rq),
>                            TASK_UNINTERRUPTIBLE);
> }
>
>Exactly.

> wakeup:
> balance_push()
> {
>         if (is_per_cpu_kthread(push_task) || is_migration_disabled(push_task)) {
>
>                 if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
>                     rcuwait_active(&rq->hotplug_wait)) {
>                         raw_spin_unlock(&rq->lock);
>                         rcuwait_wake_up(&rq->hotplug_wait);
>                         raw_spin_lock(&rq->lock);
>                 }
>                 return;
>         }
> }
>
> One of the conditions for this function to wake up is "rq->nr_pinned  == 0"
> that is to say between migrate_disable/enable, if blocked will defect CPU going
> offline longer blocking time.
>
>Indeed, the hotplug time is affected. For example in case of waiting for
>a mutex to be released, an owner will wakeup waiters. But this is expectable.

>
> I'm not sure that's a problem,and I didn't find it in the kernel code  between
>  migrate_disable/enable possible sleep calls.
>
>For example z3fold.c:

>/* Add to the appropriate unbuddied list */
>static inline void add_to_unbuddied(struct z3fold_pool *pool,
>                                struct z3fold_header *zhdr)
>{
>       if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 ||
>                        zhdr->middle_chunks == 0) {
>                struct list_head *unbuddied;
>              int freechunks = num_free_chunks(zhdr);
>
>                migrate_disable();
>                unbuddied = this_cpu_ptr(pool->unbuddied);
>                spin_lock(&pool->lock);
>                list_add(&zhdr->buddy, &unbuddied[freechunks]);
>                spin_unlock(&pool->lock);
>                zhdr->cpu = smp_processor_id();
>                migrate_enable();
>        }
>}

>for PREEMPT_RT kernel a spinlock is converted to rt-mutex, thus it can sleep.

 I forgot that. Thank you for your explanation.

 
>--
>Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: 回复: 回复: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
  2021-01-26  9:33             ` 回复: " Zhang, Qiang
@ 2021-01-26 13:43               ` Uladzislau Rezki
  0 siblings, 0 replies; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-26 13:43 UTC (permalink / raw)
  To: Zhang, Qiang
  Cc: Uladzislau Rezki, LKML, RCU, Paul E . McKenney, Michael Ellerman,
	Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Michal Hocko,
	Thomas Gleixner, Theodore Y . Ts'o,
	Sebastian Andrzej Siewior, Oleksiy Avramchenko

On Tue, Jan 26, 2021 at 09:33:40AM +0000, Zhang, Qiang wrote:
> 
> 
> ________________________________________
> 发件人: Uladzislau Rezki <urezki@gmail.com>
> 发送时间: 2021年1月25日 21:49
> 收件人: Zhang, Qiang
> 抄送: Uladzislau Rezki; LKML; RCU; Paul E . McKenney; Michael Ellerman; Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Oleksiy Avramchenko
> 主题: Re: 回复: 回复: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> 
> > >Hello, Zhang.
> >
> > > >________________________________________
> > > >发件人: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > >发送时间: 2021年1月21日 0:21
> > > >收件人: LKML; RCU; Paul E . McKenney; Michael Ellerman
> > > >抄送: Andrew Morton; Daniel Axtens; Frederic Weisbecker; Neeraj >Upadhyay; Joel Fernandes; Peter Zijlstra; Michal Hocko; Thomas >Gleixner; Theodore Y . Ts'o; Sebastian Andrzej Siewior; Uladzislau >Rezki; Oleksiy Avramchenko
> > > >主题: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()
> > > >
> > > >Since the page is obtained in a fully preemptible context, dropping
> > > >the lock can lead to migration onto another CPU. As a result a prev.
> > > >bnode of that CPU may be underutilised, because a decision has been
> > > >made for a CPU that was run out of free slots to store a pointer.
> > > >
> > > >migrate_disable/enable() are now independent of RT, use it in order
> > > >to prevent any migration during a page request for a specific CPU it
> > > >is requested for.
> > >
> > >
> > > Hello Rezki
> > >
> > > The critical migrate_disable/enable() area is not allowed to block, under RT and non RT.
> > > There is such a description in preempt.h
> > >
> > >
> > > * Notes on the implementation.
> > >  *
> > >  * The implementation is particularly tricky since existing code patterns
> > >  * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
> > >  * This means that it cannot use cpus_read_lock() to serialize against hotplug,
> > >  * nor can it easily migrate itself into a pending affinity mask change on
> > >  * migrate_enable().
> > >
> > >How i interpret it is migrate_enable()/migrate_disable() are not allowed to
> > >use any blocking primitives, such as rwsem/mutexes/etc. in order to mark a
> > >current context as non-migratable.
> > >
> > >void migrate_disable(void)
> > >{
> > > struct task_struct *p = current;
> > >
> > > if (p->migration_disabled) {
> > >  p->migration_disabled++;
> > >  return;
> > > }
> >
> > > preempt_disable();
> > > this_rq()->nr_pinned++;
> > > p->migration_disabled = 1;
> > > preempt_enable();
> > >}
> > >
> > >It does nothing that prevents you from doing schedule() or even wait for any
> > >event(mutex slow path behaviour), when the process is removed from the run-queue.
> > >I mean after the migrate_disable() is invoked. Or i miss something?
> >
> > Hello Rezki
> >
> > Sorry, there's something wrong with the previous description.
> > There are the following scenarios
> >
> > Due to migrate_disable will increase  this_rq()->nr_pinned , after that
> > if get_free_page be blocked, and this time, CPU going offline,
> > the sched_cpu_wait_empty() be called in per-cpu "cpuhp/%d" task,
> > and be blocked.
> >
> >But after the migrate_disable() is invoked a CPU can not be brought down.
> >If there are pinned tasks a "hotplug path" will be blocked on balance_hotplug_wait()
> >call.
> 
> > blocked:
> > sched_cpu_wait_empty()
> > {
> >       struct rq *rq = this_rq();
> >        rcuwait_wait_event(&rq->hotplug_wait,
> >                            rq->nr_running == 1 && !rq_has_pinned_tasks(rq),
> >                            TASK_UNINTERRUPTIBLE);
> > }
> >
> >Exactly.
> 
> > wakeup:
> > balance_push()
> > {
> >         if (is_per_cpu_kthread(push_task) || is_migration_disabled(push_task)) {
> >
> >                 if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
> >                     rcuwait_active(&rq->hotplug_wait)) {
> >                         raw_spin_unlock(&rq->lock);
> >                         rcuwait_wake_up(&rq->hotplug_wait);
> >                         raw_spin_lock(&rq->lock);
> >                 }
> >                 return;
> >         }
> > }
> >
> > One of the conditions for this function to wake up is "rq->nr_pinned  == 0"
> > that is to say between migrate_disable/enable, if blocked will defect CPU going
> > offline longer blocking time.
> >
> >Indeed, the hotplug time is affected. For example in case of waiting for
> >a mutex to be released, an owner will wakeup waiters. But this is expectable.
> 
> >
> > I'm not sure that's a problem,and I didn't find it in the kernel code  between
> >  migrate_disable/enable possible sleep calls.
> >
> >For example z3fold.c:
> 
> >/* Add to the appropriate unbuddied list */
> >static inline void add_to_unbuddied(struct z3fold_pool *pool,
> >                                struct z3fold_header *zhdr)
> >{
> >       if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 ||
> >                        zhdr->middle_chunks == 0) {
> >                struct list_head *unbuddied;
> >              int freechunks = num_free_chunks(zhdr);
> >
> >                migrate_disable();
> >                unbuddied = this_cpu_ptr(pool->unbuddied);
> >                spin_lock(&pool->lock);
> >                list_add(&zhdr->buddy, &unbuddied[freechunks]);
> >                spin_unlock(&pool->lock);
> >                zhdr->cpu = smp_processor_id();
> >                migrate_enable();
> >        }
> >}
> 
> >for PREEMPT_RT kernel a spinlock is converted to rt-mutex, thus it can sleep.
> 
>  I forgot that. Thank you for your explanation.
> 
>  
No problem. I also has recently learned about spinlock and rt-mutexes :)

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-25 16:25       ` Uladzislau Rezki
@ 2021-01-28 15:11         ` Uladzislau Rezki
  2021-01-28 15:17           ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-28 15:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michal Hocko, LKML, RCU, Paul E . McKenney, Michael Ellerman,
	Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Mon, Jan 25, 2021 at 05:25:59PM +0100, Uladzislau Rezki wrote:
> On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> > On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > > For a single argument we can directly request a page from a caller
> > > > > context when a "carry page block" is run out of free spots. Instead
> > > > > of hitting a slow path we can request an extra page by demand and
> > > > > proceed with a fast path.
> > > > > 
> > > > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > > > and that its fallback is the relatively high latency synchronize_rcu().
> > > > > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > > to allow limited sleeping within the memory allocator.
> > > > 
> > > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> > > > way to allocate without triggering the OOM killer. Is this really what
> > > > you need/want? Is __GFP_NORETRY too weak?
> > > > 
> > > Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> > > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > > point: https://www.spinics.net/lists/rcu/msg02856.html
> > > 
> > > <snip>
> > >     So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > >     further decisions based on instrumentation of the success rates of
> > >     these opportunistic allocations.
> > > <snip>
> > 
> > I completely agree with Johannes here.
> > 
> > > but for some reason, i can't find a tail or head of it, we introduced
> > > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
> > > What we would like to avoid.
> > 
> > Not that I object to this use but I think it would be much better to use
> > it based on actual data. Going along with it right away might become a
> > future burden to make any changes in this aspect later on due to lack of 
> > exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> > try as hard as it can get without being really disruptive (like OOM
> > killing something). And your wording didn't really give me that
> > impression.
> > 
> Initially i proposed just to go with GFP_NOWAIT flag. But later on there
> was a discussion about a fallback path, that uses synchronize_rcu() can be
> slow, thus minimizing its hitting would be great. So, here we go with a
> trade off.
> 
> Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to have some
> light-wait requests would be acceptable. That is why __GFP_NORETRY was proposed.
> 
> There were simple criterias we discussed which we would like to achieve:
> 
> a) minimize a fallback hitting;
> b) avoid of OOM involving;
> c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()
> 
One question here. Since the code that triggers a page request can be
directly invoked from reclaim context as well as outside of it. We had
a concern about if any recursion is possible, but what i see it is safe.
The context that does it can not enter it twice:

<snip>
    /* Avoid recursion of direct reclaim */
    if (current->flags & PF_MEMALLOC)
        goto nopage;
<snip>

What about any deadlocking in regards to below following flags?

GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN

Thanks!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-28 15:11         ` Uladzislau Rezki
@ 2021-01-28 15:17           ` Michal Hocko
  2021-01-28 15:30             ` Uladzislau Rezki
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2021-01-28 15:17 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Thu 28-01-21 16:11:52, Uladzislau Rezki wrote:
> On Mon, Jan 25, 2021 at 05:25:59PM +0100, Uladzislau Rezki wrote:
> > On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> > > On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > > > For a single argument we can directly request a page from a caller
> > > > > > context when a "carry page block" is run out of free spots. Instead
> > > > > > of hitting a slow path we can request an extra page by demand and
> > > > > > proceed with a fast path.
> > > > > > 
> > > > > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > > > > and that its fallback is the relatively high latency synchronize_rcu().
> > > > > > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > > > to allow limited sleeping within the memory allocator.
> > > > > 
> > > > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> > > > > way to allocate without triggering the OOM killer. Is this really what
> > > > > you need/want? Is __GFP_NORETRY too weak?
> > > > > 
> > > > Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> > > > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > > > point: https://www.spinics.net/lists/rcu/msg02856.html
> > > > 
> > > > <snip>
> > > >     So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > > >     further decisions based on instrumentation of the success rates of
> > > >     these opportunistic allocations.
> > > > <snip>
> > > 
> > > I completely agree with Johannes here.
> > > 
> > > > but for some reason, i can't find a tail or head of it, we introduced
> > > > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
> > > > What we would like to avoid.
> > > 
> > > Not that I object to this use but I think it would be much better to use
> > > it based on actual data. Going along with it right away might become a
> > > future burden to make any changes in this aspect later on due to lack of 
> > > exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> > > try as hard as it can get without being really disruptive (like OOM
> > > killing something). And your wording didn't really give me that
> > > impression.
> > > 
> > Initially i proposed just to go with GFP_NOWAIT flag. But later on there
> > was a discussion about a fallback path, that uses synchronize_rcu() can be
> > slow, thus minimizing its hitting would be great. So, here we go with a
> > trade off.
> > 
> > Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to have some
> > light-wait requests would be acceptable. That is why __GFP_NORETRY was proposed.
> > 
> > There were simple criterias we discussed which we would like to achieve:
> > 
> > a) minimize a fallback hitting;
> > b) avoid of OOM involving;
> > c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()
> > 
> One question here. Since the code that triggers a page request can be
> directly invoked from reclaim context as well as outside of it. We had
> a concern about if any recursion is possible, but what i see it is safe.
> The context that does it can not enter it twice:
> 
> <snip>
>     /* Avoid recursion of direct reclaim */
>     if (current->flags & PF_MEMALLOC)
>         goto nopage;
> <snip>

Yes this is a recursion protection.

> What about any deadlocking in regards to below following flags?
> 
> GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN

and __GFP_NOMEMALLOC will make sure that the allocation will not consume
all the memory reserves. The later should be clarified in one of your
patches I have acked IIRC.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-28 15:17           ` Michal Hocko
@ 2021-01-28 15:30             ` Uladzislau Rezki
  2021-01-28 18:02               ` Uladzislau Rezki
  0 siblings, 1 reply; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-28 15:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, LKML, RCU, Paul E . McKenney, Michael Ellerman,
	Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Thu, Jan 28, 2021 at 04:17:01PM +0100, Michal Hocko wrote:
> On Thu 28-01-21 16:11:52, Uladzislau Rezki wrote:
> > On Mon, Jan 25, 2021 at 05:25:59PM +0100, Uladzislau Rezki wrote:
> > > On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> > > > On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > > > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > > > > For a single argument we can directly request a page from a caller
> > > > > > > context when a "carry page block" is run out of free spots. Instead
> > > > > > > of hitting a slow path we can request an extra page by demand and
> > > > > > > proceed with a fast path.
> > > > > > > 
> > > > > > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > > > > > and that its fallback is the relatively high latency synchronize_rcu().
> > > > > > > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > > > > to allow limited sleeping within the memory allocator.
> > > > > > 
> > > > > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> > > > > > way to allocate without triggering the OOM killer. Is this really what
> > > > > > you need/want? Is __GFP_NORETRY too weak?
> > > > > > 
> > > > > Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> > > > > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > > > > point: https://www.spinics.net/lists/rcu/msg02856.html
> > > > > 
> > > > > <snip>
> > > > >     So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > > > >     further decisions based on instrumentation of the success rates of
> > > > >     these opportunistic allocations.
> > > > > <snip>
> > > > 
> > > > I completely agree with Johannes here.
> > > > 
> > > > > but for some reason, i can't find a tail or head of it, we introduced
> > > > > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
> > > > > What we would like to avoid.
> > > > 
> > > > Not that I object to this use but I think it would be much better to use
> > > > it based on actual data. Going along with it right away might become a
> > > > future burden to make any changes in this aspect later on due to lack of 
> > > > exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> > > > try as hard as it can get without being really disruptive (like OOM
> > > > killing something). And your wording didn't really give me that
> > > > impression.
> > > > 
> > > Initially i proposed just to go with GFP_NOWAIT flag. But later on there
> > > was a discussion about a fallback path, that uses synchronize_rcu() can be
> > > slow, thus minimizing its hitting would be great. So, here we go with a
> > > trade off.
> > > 
> > > Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to have some
> > > light-wait requests would be acceptable. That is why __GFP_NORETRY was proposed.
> > > 
> > > There were simple criterias we discussed which we would like to achieve:
> > > 
> > > a) minimize a fallback hitting;
> > > b) avoid of OOM involving;
> > > c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()
> > > 
> > One question here. Since the code that triggers a page request can be
> > directly invoked from reclaim context as well as outside of it. We had
> > a concern about if any recursion is possible, but what i see it is safe.
> > The context that does it can not enter it twice:
> > 
> > <snip>
> >     /* Avoid recursion of direct reclaim */
> >     if (current->flags & PF_MEMALLOC)
> >         goto nopage;
> > <snip>
> 
> Yes this is a recursion protection.
> 
> > What about any deadlocking in regards to below following flags?
> > 
> > GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN
> 
> and __GFP_NOMEMALLOC will make sure that the allocation will not consume
> all the memory reserves. The later should be clarified in one of your
> patches I have acked IIRC.
>
Yep, it is clarified and reflected in another patch you ACKed.

Thanks!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-28 15:30             ` Uladzislau Rezki
@ 2021-01-28 18:02               ` Uladzislau Rezki
       [not found]                 ` <YBPNvbJLg56XU8co@dhcp22.suse.cz>
  0 siblings, 1 reply; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-28 18:02 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: Michal Hocko, LKML, RCU, Paul E . McKenney, Michael Ellerman,
	Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Thu, Jan 28, 2021 at 04:30:17PM +0100, Uladzislau Rezki wrote:
> On Thu, Jan 28, 2021 at 04:17:01PM +0100, Michal Hocko wrote:
> > On Thu 28-01-21 16:11:52, Uladzislau Rezki wrote:
> > > On Mon, Jan 25, 2021 at 05:25:59PM +0100, Uladzislau Rezki wrote:
> > > > On Mon, Jan 25, 2021 at 04:39:43PM +0100, Michal Hocko wrote:
> > > > > On Mon 25-01-21 15:31:50, Uladzislau Rezki wrote:
> > > > > > > On Wed 20-01-21 17:21:46, Uladzislau Rezki (Sony) wrote:
> > > > > > > > For a single argument we can directly request a page from a caller
> > > > > > > > context when a "carry page block" is run out of free spots. Instead
> > > > > > > > of hitting a slow path we can request an extra page by demand and
> > > > > > > > proceed with a fast path.
> > > > > > > > 
> > > > > > > > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > > > > > > > and that its fallback is the relatively high latency synchronize_rcu().
> > > > > > > > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > > > > > > > to allow limited sleeping within the memory allocator.
> > > > > > > 
> > > > > > > __GFP_RETRY_MAYFAIL can be quite heavy. It is effectively the most heavy
> > > > > > > way to allocate without triggering the OOM killer. Is this really what
> > > > > > > you need/want? Is __GFP_NORETRY too weak?
> > > > > > > 
> > > > > > Hm... We agreed to proceed with limited lightwait memory direct reclaim.
> > > > > > Johannes Weiner proposed to go with __GFP_NORETRY flag as a starting
> > > > > > point: https://www.spinics.net/lists/rcu/msg02856.html
> > > > > > 
> > > > > > <snip>
> > > > > >     So I'm inclined to suggest __GFP_NORETRY as a starting point, and make
> > > > > >     further decisions based on instrumentation of the success rates of
> > > > > >     these opportunistic allocations.
> > > > > > <snip>
> > > > > 
> > > > > I completely agree with Johannes here.
> > > > > 
> > > > > > but for some reason, i can't find a tail or head of it, we introduced
> > > > > > __GFP_RETRY_MAYFAIL what is a heavy one from a time consuming point of view.
> > > > > > What we would like to avoid.
> > > > > 
> > > > > Not that I object to this use but I think it would be much better to use
> > > > > it based on actual data. Going along with it right away might become a
> > > > > future burden to make any changes in this aspect later on due to lack of 
> > > > > exact reasoning. General rule of thumb for __GFP_RETRY_MAYFAIL is really
> > > > > try as hard as it can get without being really disruptive (like OOM
> > > > > killing something). And your wording didn't really give me that
> > > > > impression.
> > > > > 
> > > > Initially i proposed just to go with GFP_NOWAIT flag. But later on there
> > > > was a discussion about a fallback path, that uses synchronize_rcu() can be
> > > > slow, thus minimizing its hitting would be great. So, here we go with a
> > > > trade off.
> > > > 
> > > > Doing it hard as __GFP_RETRY_MAYFAIL can do, is not worth(IMHO), but to have some
> > > > light-wait requests would be acceptable. That is why __GFP_NORETRY was proposed.
> > > > 
> > > > There were simple criterias we discussed which we would like to achieve:
> > > > 
> > > > a) minimize a fallback hitting;
> > > > b) avoid of OOM involving;
> > > > c) avoid of dipping into the emergency reserves. See kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()
> > > > 
> > > One question here. Since the code that triggers a page request can be
> > > directly invoked from reclaim context as well as outside of it. We had
> > > a concern about if any recursion is possible, but what i see it is safe.
> > > The context that does it can not enter it twice:
> > > 
> > > <snip>
> > >     /* Avoid recursion of direct reclaim */
> > >     if (current->flags & PF_MEMALLOC)
> > >         goto nopage;
> > > <snip>
> > 
> > Yes this is a recursion protection.
> > 
> > > What about any deadlocking in regards to below following flags?
> > > 
> > > GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN
> > 
> > and __GFP_NOMEMALLOC will make sure that the allocation will not consume
> > all the memory reserves. The later should be clarified in one of your
> > patches I have acked IIRC.
> >
> Yep, it is clarified and reflected in another patch you ACKed.
> 
> Thanks!
> 

Please find below the V2.

Changelog V1 -> V2:
    - replace the __GFP_RETRY_MAYFAIL by __GFP_NORETRY;
    - add more comments explaining why and which flags are used.

<snip>
From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Date: Wed, 20 Jan 2021 17:21:46 +0100
Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for single-argument
 case

Single-argument kvfree_rcu() must be invoked from sleepable contexts,
so we can directly allocate pages.  Furthermmore, the fallback in case
of page-allocation failure is the high-latency synchronize_rcu(), so it
makes sense to do these page allocations from the fastpath, and even to
permit limited sleeping within the allocator.

This commit therefore allocates if needed on the fastpath using
GFP_KERNEL|__GFP_NORETRY.  This also has the beneficial effect
of leaving kvfree_rcu()'s per-CPU caches to the double-argument variant
of kvfree_rcu(), given that the double-argument variant cannot directly
invoke the allocator.

[ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tree.c | 51 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 35 insertions(+), 16 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e04e336bee42..e450c17a06b2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3465,37 +3465,59 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
 	}
 }
 
+// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
+// state specified by flags.  If can_alloc is true, the caller must
+// be schedulable and not be holding any locks or mutexes that might be
+// acquired by the memory allocator or anything that it might invoke.
+// Returns true if ptr was successfully recorded, else the caller must
+// use a fallback.
 static inline bool
-kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
+add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
+	unsigned long *flags, void *ptr, bool can_alloc)
 {
 	struct kvfree_rcu_bulk_data *bnode;
 	int idx;
 
-	if (unlikely(!krcp->initialized))
+	*krcp = krc_this_cpu_lock(flags);
+	if (unlikely(!(*krcp)->initialized))
 		return false;
 
-	lockdep_assert_held(&krcp->lock);
 	idx = !!is_vmalloc_addr(ptr);
 
 	/* Check if a new block is required. */
-	if (!krcp->bkvhead[idx] ||
-			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
-		bnode = get_cached_bnode(krcp);
-		/* Switch to emergency path. */
+	if (!(*krcp)->bkvhead[idx] ||
+			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
+		bnode = get_cached_bnode(*krcp);
+		if (!bnode && can_alloc) {
+			krc_this_cpu_unlock(*krcp, *flags);
+
+			// __GFP_NORETRY - allows a light-weight direct reclaim
+			// what is OK from minimizing of fallback hitting point of
+			// view. Apart of that it forbids any OOM invoking what is
+			// also beneficial since we are about to release memory soon.
+			//
+			// __GFP_NOWARN - it is supposed that an allocation can
+			// be failed under low memory or high memory pressure
+			// scenarios.
+			bnode = (struct kvfree_rcu_bulk_data *)
+				__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
+			*krcp = krc_this_cpu_lock(flags);
+		}
+
 		if (!bnode)
 			return false;
 
 		/* Initialize the new block. */
 		bnode->nr_records = 0;
-		bnode->next = krcp->bkvhead[idx];
+		bnode->next = (*krcp)->bkvhead[idx];
 
 		/* Attach it to the head. */
-		krcp->bkvhead[idx] = bnode;
+		(*krcp)->bkvhead[idx] = bnode;
 	}
 
 	/* Finally insert. */
-	krcp->bkvhead[idx]->records
-		[krcp->bkvhead[idx]->nr_records++] = ptr;
+	(*krcp)->bkvhead[idx]->records
+		[(*krcp)->bkvhead[idx]->nr_records++] = ptr;
 
 	return true;
 }
@@ -3533,8 +3555,6 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 		ptr = (unsigned long *) func;
 	}
 
-	krcp = krc_this_cpu_lock(&flags);
-
 	// Queue the object but don't yet schedule the batch.
 	if (debug_rcu_head_queue(ptr)) {
 		// Probable double kfree_rcu(), just leak.
@@ -3542,12 +3562,11 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 			  __func__, head);
 
 		// Mark as success and leave.
-		success = true;
-		goto unlock_return;
+		return;
 	}
 
 	kasan_record_aux_stack(ptr);
-	success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
+	success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head);
 	if (!success) {
 		run_page_cache_worker(krcp);
 
-- 
2.20.1
<snip>

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/3] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()
  2021-01-20 16:21 ` [PATCH 2/3] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu() Uladzislau Rezki (Sony)
@ 2021-01-28 18:06   ` Uladzislau Rezki
  0 siblings, 0 replies; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-28 18:06 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: LKML, RCU, Paul E . McKenney, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Michal Hocko, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Wed, Jan 20, 2021 at 05:21:47PM +0100, Uladzislau Rezki (Sony) wrote:
> From: "Paul E. McKenney" <paulmck@kernel.org>
> 
> This commit applies the __GFP_NOMEMALLOC gfp flag to memory allocations
> carried out by the single-argument variant of kvfree_rcu(), thus avoiding
> this can-sleep code path from dipping into the emergency reserves.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  kernel/rcu/tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 2014fb22644d..454809514c91 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3491,7 +3491,7 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>  		if (!bnode && can_alloc) {
>  			krc_this_cpu_unlock(*krcp, *flags);
>  			bnode = (struct kvfree_rcu_bulk_data *)
> -				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> +				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
>  			*krcp = krc_this_cpu_lock(flags);
>  		}
>  
> -- 
> 2.20.1
> 
Please see below a V2:

V1 -> V2:
    - rebase on [PATCH v2 1/1] kvfree_rcu: Directly allocate page for single-argument
    - add a comment about __GFP_NOMEMALLOC usage.

<snip>
From 1427698cdbdced53d9b5eee60aa5d22bc223056d Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paulmck@kernel.org>
Date: Wed, 20 Jan 2021 17:21:47 +0100
Subject: [PATCH v2 1/1] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument
 kvfree_rcu()

This commit applies the __GFP_NOMEMALLOC gfp flag to memory allocations
carried out by the single-argument variant of kvfree_rcu(), thus avoiding
this can-sleep code path from dipping into the emergency reserves.

Acked-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tree.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e450c17a06b2..e7b705155c92 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3496,11 +3496,14 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
 			// view. Apart of that it forbids any OOM invoking what is
 			// also beneficial since we are about to release memory soon.
 			//
+			// __GFP_NOMEMALLOC - prevents from consuming of all the
+			// memory reserves. Please note we have a fallback path.
+			//
 			// __GFP_NOWARN - it is supposed that an allocation can
 			// be failed under low memory or high memory pressure
 			// scenarios.
 			bnode = (struct kvfree_rcu_bulk_data *)
-				__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
+				__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
 			*krcp = krc_this_cpu_lock(flags);
 		}
 
-- 
2.20.1
<snip>

Thanks!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
       [not found]                 ` <YBPNvbJLg56XU8co@dhcp22.suse.cz>
@ 2021-01-29 16:35                   ` Uladzislau Rezki
  2021-02-01 11:47                     ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Uladzislau Rezki @ 2021-01-29 16:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, Paul E . McKenney, LKML, RCU, Michael Ellerman,
	Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Fri, Jan 29, 2021 at 09:56:29AM +0100, Michal Hocko wrote:
> On Thu 28-01-21 19:02:37, Uladzislau Rezki wrote:
> [...]
> > >From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
> > From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
> > Date: Wed, 20 Jan 2021 17:21:46 +0100
> > Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for single-argument
> >  case
> > 
> > Single-argument kvfree_rcu() must be invoked from sleepable contexts,
> > so we can directly allocate pages.  Furthermmore, the fallback in case
> > of page-allocation failure is the high-latency synchronize_rcu(), so it
> > makes sense to do these page allocations from the fastpath, and even to
> > permit limited sleeping within the allocator.
> > 
> > This commit therefore allocates if needed on the fastpath using
> > GFP_KERNEL|__GFP_NORETRY.
> 
> Yes, __GFP_NORETRY as a lightweight allocation mode should be fine. It
> is more robust than __GFP_NOWAIT on memory usage spikes.  The caller is
> prepared to handle the failure which is likely much less disruptive than
> OOM or potentially heavy reclaim __GFP_RETRY_MAYFAIL.
> 
> I cannot give you ack as I am not familiar with the code but this makes
> sense to me.
> 
No problem, i can separate it. We can have a patch on top of what we have so
far. The patch only modifies the gfp_mask passed to __get_free_pages():

From ec2feaa9b7f55f73b3b17e9ac372151c1aab5ae0 Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Date: Fri, 29 Jan 2021 17:16:03 +0100
Subject: [PATCH 1/1] kvfree_rcu: replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY

__GFP_RETRY_MAYFAIL is a bit heavy from reclaim process of view,
therefore a time consuming. That is not optional and there is
no need in doing it so hard, because we have a fallback path.

__GFP_NORETRY in its turn can perform some light-weight reclaim
and it rather fails under high memory pressure or low memory
condition.

In general there are four simple criterias we we would like to
achieve:
    a) minimize a fallback hitting;
    b) avoid of OOM invoking;
    c) do a light-wait page request;
    d) avoid of dipping into the emergency reserves.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 70ddc339e0b7..1e862120db9e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3489,8 +3489,20 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
 		bnode = get_cached_bnode(*krcp);
 		if (!bnode && can_alloc) {
 			krc_this_cpu_unlock(*krcp, *flags);
+
+			// __GFP_NORETRY - allows a light-weight direct reclaim
+			// what is OK from minimizing of fallback hitting point of
+			// view. Apart of that it forbids any OOM invoking what is
+			// also beneficial since we are about to release memory soon.
+			//
+			// __GFP_NOMEMALLOC - prevents from consuming of all the
+			// memory reserves. Please note we have a fallback path.
+			//
+			// __GFP_NOWARN - it is supposed that an allocation can
+			// be failed under low memory or high memory pressure
+			// scenarios.
 			bnode = (struct kvfree_rcu_bulk_data *)
-				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
+				__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
 			*krcp = krc_this_cpu_lock(flags);
 		}
 
-- 
2.20.1

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-01-29 16:35                   ` Uladzislau Rezki
@ 2021-02-01 11:47                     ` Michal Hocko
  2021-02-01 14:44                       ` Uladzislau Rezki
  2021-02-03 19:37                       ` Paul E. McKenney
  0 siblings, 2 replies; 36+ messages in thread
From: Michal Hocko @ 2021-02-01 11:47 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Paul E . McKenney, LKML, RCU, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Fri 29-01-21 17:35:31, Uladzislau Rezki wrote:
> On Fri, Jan 29, 2021 at 09:56:29AM +0100, Michal Hocko wrote:
> > On Thu 28-01-21 19:02:37, Uladzislau Rezki wrote:
> > [...]
> > > >From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
> > > From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
> > > Date: Wed, 20 Jan 2021 17:21:46 +0100
> > > Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for single-argument
> > >  case
> > > 
> > > Single-argument kvfree_rcu() must be invoked from sleepable contexts,
> > > so we can directly allocate pages.  Furthermmore, the fallback in case
> > > of page-allocation failure is the high-latency synchronize_rcu(), so it
> > > makes sense to do these page allocations from the fastpath, and even to
> > > permit limited sleeping within the allocator.
> > > 
> > > This commit therefore allocates if needed on the fastpath using
> > > GFP_KERNEL|__GFP_NORETRY.
> > 
> > Yes, __GFP_NORETRY as a lightweight allocation mode should be fine. It
> > is more robust than __GFP_NOWAIT on memory usage spikes.  The caller is
> > prepared to handle the failure which is likely much less disruptive than
> > OOM or potentially heavy reclaim __GFP_RETRY_MAYFAIL.
> > 
> > I cannot give you ack as I am not familiar with the code but this makes
> > sense to me.
> > 
> No problem, i can separate it. We can have a patch on top of what we have so
> far. The patch only modifies the gfp_mask passed to __get_free_pages():
> 
> >From ec2feaa9b7f55f73b3b17e9ac372151c1aab5ae0 Mon Sep 17 00:00:00 2001
> From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
> Date: Fri, 29 Jan 2021 17:16:03 +0100
> Subject: [PATCH 1/1] kvfree_rcu: replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY
> 
> __GFP_RETRY_MAYFAIL is a bit heavy from reclaim process of view,
> therefore a time consuming. That is not optional and there is
> no need in doing it so hard, because we have a fallback path.
> 
> __GFP_NORETRY in its turn can perform some light-weight reclaim
> and it rather fails under high memory pressure or low memory
> condition.
> 
> In general there are four simple criterias we we would like to
> achieve:
>     a) minimize a fallback hitting;
>     b) avoid of OOM invoking;
>     c) do a light-wait page request;
>     d) avoid of dipping into the emergency reserves.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Looks good to me. Feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  kernel/rcu/tree.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 70ddc339e0b7..1e862120db9e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3489,8 +3489,20 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>  		bnode = get_cached_bnode(*krcp);
>  		if (!bnode && can_alloc) {
>  			krc_this_cpu_unlock(*krcp, *flags);
> +
> +			// __GFP_NORETRY - allows a light-weight direct reclaim
> +			// what is OK from minimizing of fallback hitting point of
> +			// view. Apart of that it forbids any OOM invoking what is
> +			// also beneficial since we are about to release memory soon.
> +			//
> +			// __GFP_NOMEMALLOC - prevents from consuming of all the
> +			// memory reserves. Please note we have a fallback path.
> +			//
> +			// __GFP_NOWARN - it is supposed that an allocation can
> +			// be failed under low memory or high memory pressure
> +			// scenarios.
>  			bnode = (struct kvfree_rcu_bulk_data *)
> -				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +				__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>  			*krcp = krc_this_cpu_lock(flags);
>  		}
>  
> -- 
> 2.20.1
> 
> --
> Vlad Rezki

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-02-01 11:47                     ` Michal Hocko
@ 2021-02-01 14:44                       ` Uladzislau Rezki
  2021-02-03 19:37                       ` Paul E. McKenney
  1 sibling, 0 replies; 36+ messages in thread
From: Uladzislau Rezki @ 2021-02-01 14:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, Paul E . McKenney, LKML, RCU, Michael Ellerman,
	Andrew Morton, Daniel Axtens, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Mon, Feb 01, 2021 at 12:47:55PM +0100, Michal Hocko wrote:
> On Fri 29-01-21 17:35:31, Uladzislau Rezki wrote:
> > On Fri, Jan 29, 2021 at 09:56:29AM +0100, Michal Hocko wrote:
> > > On Thu 28-01-21 19:02:37, Uladzislau Rezki wrote:
> > > [...]
> > > > >From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
> > > > From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
> > > > Date: Wed, 20 Jan 2021 17:21:46 +0100
> > > > Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for single-argument
> > > >  case
> > > > 
> > > > Single-argument kvfree_rcu() must be invoked from sleepable contexts,
> > > > so we can directly allocate pages.  Furthermmore, the fallback in case
> > > > of page-allocation failure is the high-latency synchronize_rcu(), so it
> > > > makes sense to do these page allocations from the fastpath, and even to
> > > > permit limited sleeping within the allocator.
> > > > 
> > > > This commit therefore allocates if needed on the fastpath using
> > > > GFP_KERNEL|__GFP_NORETRY.
> > > 
> > > Yes, __GFP_NORETRY as a lightweight allocation mode should be fine. It
> > > is more robust than __GFP_NOWAIT on memory usage spikes.  The caller is
> > > prepared to handle the failure which is likely much less disruptive than
> > > OOM or potentially heavy reclaim __GFP_RETRY_MAYFAIL.
> > > 
> > > I cannot give you ack as I am not familiar with the code but this makes
> > > sense to me.
> > > 
> > No problem, i can separate it. We can have a patch on top of what we have so
> > far. The patch only modifies the gfp_mask passed to __get_free_pages():
> > 
> > >From ec2feaa9b7f55f73b3b17e9ac372151c1aab5ae0 Mon Sep 17 00:00:00 2001
> > From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
> > Date: Fri, 29 Jan 2021 17:16:03 +0100
> > Subject: [PATCH 1/1] kvfree_rcu: replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY
> > 
> > __GFP_RETRY_MAYFAIL is a bit heavy from reclaim process of view,
> > therefore a time consuming. That is not optional and there is
> > no need in doing it so hard, because we have a fallback path.
> > 
> > __GFP_NORETRY in its turn can perform some light-weight reclaim
> > and it rather fails under high memory pressure or low memory
> > condition.
> > 
> > In general there are four simple criterias we we would like to
> > achieve:
> >     a) minimize a fallback hitting;
> >     b) avoid of OOM invoking;
> >     c) do a light-wait page request;
> >     d) avoid of dipping into the emergency reserves.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Looks good to me. Feel free to add
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
Appreciate it!

--
Vlad Rezki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument
  2021-02-01 11:47                     ` Michal Hocko
  2021-02-01 14:44                       ` Uladzislau Rezki
@ 2021-02-03 19:37                       ` Paul E. McKenney
  1 sibling, 0 replies; 36+ messages in thread
From: Paul E. McKenney @ 2021-02-03 19:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, LKML, RCU, Michael Ellerman, Andrew Morton,
	Daniel Axtens, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Peter Zijlstra, Thomas Gleixner,
	Theodore Y . Ts'o, Sebastian Andrzej Siewior,
	Oleksiy Avramchenko

On Mon, Feb 01, 2021 at 12:47:55PM +0100, Michal Hocko wrote:
> On Fri 29-01-21 17:35:31, Uladzislau Rezki wrote:
> > On Fri, Jan 29, 2021 at 09:56:29AM +0100, Michal Hocko wrote:
> > > On Thu 28-01-21 19:02:37, Uladzislau Rezki wrote:
> > > [...]
> > > > >From 0bdb8ca1ae62088790e0a452c4acec3821e06989 Mon Sep 17 00:00:00 2001
> > > > From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
> > > > Date: Wed, 20 Jan 2021 17:21:46 +0100
> > > > Subject: [PATCH v2 1/1] kvfree_rcu: Directly allocate page for single-argument
> > > >  case
> > > > 
> > > > Single-argument kvfree_rcu() must be invoked from sleepable contexts,
> > > > so we can directly allocate pages.  Furthermmore, the fallback in case
> > > > of page-allocation failure is the high-latency synchronize_rcu(), so it
> > > > makes sense to do these page allocations from the fastpath, and even to
> > > > permit limited sleeping within the allocator.
> > > > 
> > > > This commit therefore allocates if needed on the fastpath using
> > > > GFP_KERNEL|__GFP_NORETRY.
> > > 
> > > Yes, __GFP_NORETRY as a lightweight allocation mode should be fine. It
> > > is more robust than __GFP_NOWAIT on memory usage spikes.  The caller is
> > > prepared to handle the failure which is likely much less disruptive than
> > > OOM or potentially heavy reclaim __GFP_RETRY_MAYFAIL.
> > > 
> > > I cannot give you ack as I am not familiar with the code but this makes
> > > sense to me.
> > > 
> > No problem, i can separate it. We can have a patch on top of what we have so
> > far. The patch only modifies the gfp_mask passed to __get_free_pages():
> > 
> > >From ec2feaa9b7f55f73b3b17e9ac372151c1aab5ae0 Mon Sep 17 00:00:00 2001
> > From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
> > Date: Fri, 29 Jan 2021 17:16:03 +0100
> > Subject: [PATCH 1/1] kvfree_rcu: replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY
> > 
> > __GFP_RETRY_MAYFAIL is a bit heavy from reclaim process of view,
> > therefore a time consuming. That is not optional and there is
> > no need in doing it so hard, because we have a fallback path.
> > 
> > __GFP_NORETRY in its turn can perform some light-weight reclaim
> > and it rather fails under high memory pressure or low memory
> > condition.
> > 
> > In general there are four simple criterias we we would like to
> > achieve:
> >     a) minimize a fallback hitting;
> >     b) avoid of OOM invoking;
> >     c) do a light-wait page request;
> >     d) avoid of dipping into the emergency reserves.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Looks good to me. Feel free to add
> Acked-by: Michal Hocko <mhocko@suse.com>

Queued, thank you both!

							Thanx, Paul

> > ---
> >  kernel/rcu/tree.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 70ddc339e0b7..1e862120db9e 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3489,8 +3489,20 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> >  		bnode = get_cached_bnode(*krcp);
> >  		if (!bnode && can_alloc) {
> >  			krc_this_cpu_unlock(*krcp, *flags);
> > +
> > +			// __GFP_NORETRY - allows a light-weight direct reclaim
> > +			// what is OK from minimizing of fallback hitting point of
> > +			// view. Apart of that it forbids any OOM invoking what is
> > +			// also beneficial since we are about to release memory soon.
> > +			//
> > +			// __GFP_NOMEMALLOC - prevents from consuming of all the
> > +			// memory reserves. Please note we have a fallback path.
> > +			//
> > +			// __GFP_NOWARN - it is supposed that an allocation can
> > +			// be failed under low memory or high memory pressure
> > +			// scenarios.
> >  			bnode = (struct kvfree_rcu_bulk_data *)
> > -				__get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > +				__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> >  			*krcp = krc_this_cpu_lock(flags);
> >  		}
> >  
> > -- 
> > 2.20.1
> > 
> > --
> > Vlad Rezki
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, back to index

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-20 16:21 [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument Uladzislau Rezki (Sony)
2021-01-20 16:21 ` [PATCH 2/3] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu() Uladzislau Rezki (Sony)
2021-01-28 18:06   ` Uladzislau Rezki
2021-01-20 16:21 ` [PATCH 3/3] kvfree_rcu: use migrate_disable/enable() Uladzislau Rezki (Sony)
2021-01-20 19:45   ` Sebastian Andrzej Siewior
2021-01-20 21:42     ` Paul E. McKenney
2021-01-23  9:31   ` 回复: " Zhang, Qiang
2021-01-24 21:57     ` Uladzislau Rezki
2021-01-25  1:50       ` 回复: " Zhang, Qiang
2021-01-25  2:18         ` Zhang, Qiang
2021-01-25 13:49           ` Uladzislau Rezki
2021-01-26  9:33             ` 回复: " Zhang, Qiang
2021-01-26 13:43               ` Uladzislau Rezki
2021-01-20 18:40 ` [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument Paul E. McKenney
2021-01-20 19:57 ` Sebastian Andrzej Siewior
2021-01-20 21:54   ` Paul E. McKenney
2021-01-21 13:35     ` Uladzislau Rezki
2021-01-21 15:07       ` Paul E. McKenney
2021-01-21 19:17         ` Uladzislau Rezki
2021-01-22 11:17     ` Sebastian Andrzej Siewior
2021-01-22 15:28       ` Paul E. McKenney
2021-01-21 12:38   ` Uladzislau Rezki
2021-01-22 11:34     ` Sebastian Andrzej Siewior
2021-01-22 14:21       ` Uladzislau Rezki
2021-01-25 13:22 ` Michal Hocko
2021-01-25 14:31   ` Uladzislau Rezki
2021-01-25 15:39     ` Michal Hocko
2021-01-25 16:25       ` Uladzislau Rezki
2021-01-28 15:11         ` Uladzislau Rezki
2021-01-28 15:17           ` Michal Hocko
2021-01-28 15:30             ` Uladzislau Rezki
2021-01-28 18:02               ` Uladzislau Rezki
     [not found]                 ` <YBPNvbJLg56XU8co@dhcp22.suse.cz>
2021-01-29 16:35                   ` Uladzislau Rezki
2021-02-01 11:47                     ` Michal Hocko
2021-02-01 14:44                       ` Uladzislau Rezki
2021-02-03 19:37                       ` Paul E. McKenney

RCU Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/rcu/0 rcu/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 rcu rcu/ https://lore.kernel.org/rcu \
		rcu@vger.kernel.org
	public-inbox-index rcu

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.rcu


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git