All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-13 21:37 ` Tetsuo Handa
  0 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2017-11-13 21:37 UTC (permalink / raw)
  To: Minchan Kim, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton
  Cc: Shakeel Butt, Greg Thelen, linux-mm, linux-kernel, Tetsuo Handa

When shrinker_rwsem was introduced, it was assumed that
register_shrinker()/unregister_shrinker() are really unlikely paths
which are called during initialization and tear down. But nowadays,
register_shrinker()/unregister_shrinker() might be called regularly.
This patch prepares for allowing parallel registration/unregistration
of shrinkers.

Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
using one RCU section. But using atomic_inc()/atomic_dec() for each
do_shrink_slab() call will not impact so much.

This patch uses polling loop with short sleep for unregister_shrinker()
rather than wait_on_atomic_t(), for we can save reader's cost (plain
atomic_dec() compared to atomic_dec_and_test()), we can expect that
do_shrink_slab() of unregistering shrinker likely returns shortly, and
we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
shrinker unexpectedly took so long.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/shrinker.h |  3 ++-
 mm/vmscan.c              | 41 +++++++++++++++++++----------------------
 2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 388ff29..333a1d0 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -62,9 +62,10 @@ struct shrinker {
 
 	int seeks;	/* seeks to recreate an obj */
 	long batch;	/* reclaim batch size, 0 = default */
-	unsigned long flags;
+	unsigned int flags;
 
 	/* These are for internal use */
+	atomic_t nr_active;
 	struct list_head list;
 	/* objs pending delete, per node */
 	atomic_long_t *nr_deferred;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1c1bc95..c8996e8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -157,7 +157,7 @@ struct scan_control {
 unsigned long vm_total_pages;
 
 static LIST_HEAD(shrinker_list);
-static DECLARE_RWSEM(shrinker_rwsem);
+static DEFINE_MUTEX(shrinker_lock);
 
 #ifdef CONFIG_MEMCG
 static bool global_reclaim(struct scan_control *sc)
@@ -285,9 +285,10 @@ int register_shrinker(struct shrinker *shrinker)
 	if (!shrinker->nr_deferred)
 		return -ENOMEM;
 
-	down_write(&shrinker_rwsem);
-	list_add_tail(&shrinker->list, &shrinker_list);
-	up_write(&shrinker_rwsem);
+	atomic_set(&shrinker->nr_active, 0);
+	mutex_lock(&shrinker_lock);
+	list_add_tail_rcu(&shrinker->list, &shrinker_list);
+	mutex_unlock(&shrinker_lock);
 	return 0;
 }
 EXPORT_SYMBOL(register_shrinker);
@@ -297,9 +298,13 @@ int register_shrinker(struct shrinker *shrinker)
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-	down_write(&shrinker_rwsem);
-	list_del(&shrinker->list);
-	up_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_lock);
+	list_del_rcu(&shrinker->list);
+	synchronize_rcu();
+	while (atomic_read(&shrinker->nr_active))
+		schedule_timeout_uninterruptible(1);
+	synchronize_rcu();
+	mutex_unlock(&shrinker_lock);
 	kfree(shrinker->nr_deferred);
 }
 EXPORT_SYMBOL(unregister_shrinker);
@@ -468,18 +473,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	if (nr_scanned == 0)
 		nr_scanned = SWAP_CLUSTER_MAX;
 
-	if (!down_read_trylock(&shrinker_rwsem)) {
-		/*
-		 * If we would return 0, our callers would understand that we
-		 * have nothing else to shrink and give up trying. By returning
-		 * 1 we keep it going and assume we'll be able to shrink next
-		 * time.
-		 */
-		freed = 1;
-		goto out;
-	}
-
-	list_for_each_entry(shrinker, &shrinker_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
@@ -498,11 +493,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
 			sc.nid = 0;
 
+		atomic_inc(&shrinker->nr_active);
+		rcu_read_unlock();
 		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
+		rcu_read_lock();
+		atomic_dec(&shrinker->nr_active);
 	}
-
-	up_read(&shrinker_rwsem);
-out:
+	rcu_read_unlock();
 	cond_resched();
 	return freed;
 }
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-13 21:37 ` Tetsuo Handa
  0 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2017-11-13 21:37 UTC (permalink / raw)
  To: Minchan Kim, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton
  Cc: Shakeel Butt, Greg Thelen, linux-mm, linux-kernel, Tetsuo Handa

When shrinker_rwsem was introduced, it was assumed that
register_shrinker()/unregister_shrinker() are really unlikely paths
which are called during initialization and tear down. But nowadays,
register_shrinker()/unregister_shrinker() might be called regularly.
This patch prepares for allowing parallel registration/unregistration
of shrinkers.

Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
using one RCU section. But using atomic_inc()/atomic_dec() for each
do_shrink_slab() call will not impact so much.

This patch uses polling loop with short sleep for unregister_shrinker()
rather than wait_on_atomic_t(), for we can save reader's cost (plain
atomic_dec() compared to atomic_dec_and_test()), we can expect that
do_shrink_slab() of unregistering shrinker likely returns shortly, and
we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
shrinker unexpectedly took so long.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/shrinker.h |  3 ++-
 mm/vmscan.c              | 41 +++++++++++++++++++----------------------
 2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 388ff29..333a1d0 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -62,9 +62,10 @@ struct shrinker {
 
 	int seeks;	/* seeks to recreate an obj */
 	long batch;	/* reclaim batch size, 0 = default */
-	unsigned long flags;
+	unsigned int flags;
 
 	/* These are for internal use */
+	atomic_t nr_active;
 	struct list_head list;
 	/* objs pending delete, per node */
 	atomic_long_t *nr_deferred;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1c1bc95..c8996e8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -157,7 +157,7 @@ struct scan_control {
 unsigned long vm_total_pages;
 
 static LIST_HEAD(shrinker_list);
-static DECLARE_RWSEM(shrinker_rwsem);
+static DEFINE_MUTEX(shrinker_lock);
 
 #ifdef CONFIG_MEMCG
 static bool global_reclaim(struct scan_control *sc)
@@ -285,9 +285,10 @@ int register_shrinker(struct shrinker *shrinker)
 	if (!shrinker->nr_deferred)
 		return -ENOMEM;
 
-	down_write(&shrinker_rwsem);
-	list_add_tail(&shrinker->list, &shrinker_list);
-	up_write(&shrinker_rwsem);
+	atomic_set(&shrinker->nr_active, 0);
+	mutex_lock(&shrinker_lock);
+	list_add_tail_rcu(&shrinker->list, &shrinker_list);
+	mutex_unlock(&shrinker_lock);
 	return 0;
 }
 EXPORT_SYMBOL(register_shrinker);
@@ -297,9 +298,13 @@ int register_shrinker(struct shrinker *shrinker)
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-	down_write(&shrinker_rwsem);
-	list_del(&shrinker->list);
-	up_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_lock);
+	list_del_rcu(&shrinker->list);
+	synchronize_rcu();
+	while (atomic_read(&shrinker->nr_active))
+		schedule_timeout_uninterruptible(1);
+	synchronize_rcu();
+	mutex_unlock(&shrinker_lock);
 	kfree(shrinker->nr_deferred);
 }
 EXPORT_SYMBOL(unregister_shrinker);
@@ -468,18 +473,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	if (nr_scanned == 0)
 		nr_scanned = SWAP_CLUSTER_MAX;
 
-	if (!down_read_trylock(&shrinker_rwsem)) {
-		/*
-		 * If we would return 0, our callers would understand that we
-		 * have nothing else to shrink and give up trying. By returning
-		 * 1 we keep it going and assume we'll be able to shrink next
-		 * time.
-		 */
-		freed = 1;
-		goto out;
-	}
-
-	list_for_each_entry(shrinker, &shrinker_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
@@ -498,11 +493,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
 			sc.nid = 0;
 
+		atomic_inc(&shrinker->nr_active);
+		rcu_read_unlock();
 		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
+		rcu_read_lock();
+		atomic_dec(&shrinker->nr_active);
 	}
-
-	up_read(&shrinker_rwsem);
-out:
+	rcu_read_unlock();
 	cond_resched();
 	return freed;
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 2/2] mm,vmscan: Allow parallel registration/unregistration of shrinkers.
  2017-11-13 21:37 ` Tetsuo Handa
@ 2017-11-13 21:37   ` Tetsuo Handa
  -1 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2017-11-13 21:37 UTC (permalink / raw)
  To: Minchan Kim, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton
  Cc: Shakeel Butt, Greg Thelen, linux-mm, linux-kernel, Tetsuo Handa

Shakeel Butt and Greg Thelen noticed that the job loader running in their
production can get stuck for 10s of seconds while doing mount operation,
for some unrelated job was blocking register_shrinker() due to calling
do_shrink_slab() triggered by memory pressure when the job loader doing
mount operation (which is regularly done) called register_shrinker().

Their machines have a lot of shrinkers registered and jobs under memory
pressure have to traverse all of those memcg-aware shrinkers and do affect
unrelated jobs which want to register/unregister their own shrinkers.

This patch allows processing register_shrinker()/unregister_shrinker() in
parallel so that each shrinker loaded/unloaded by the job loader will not
be blocked waiting for other shrinkers.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
---
 include/linux/shrinker.h |  1 +
 mm/vmscan.c              | 26 +++++++++++++++++++++-----
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 333a1d0..05ba330 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -67,6 +67,7 @@ struct shrinker {
 	/* These are for internal use */
 	atomic_t nr_active;
 	struct list_head list;
+	struct list_head gc_list;
 	/* objs pending delete, per node */
 	atomic_long_t *nr_deferred;
 };
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c8996e8..48ff848 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -157,7 +157,7 @@ struct scan_control {
 unsigned long vm_total_pages;
 
 static LIST_HEAD(shrinker_list);
-static DEFINE_MUTEX(shrinker_lock);
+static DEFINE_SPINLOCK(shrinker_lock);
 
 #ifdef CONFIG_MEMCG
 static bool global_reclaim(struct scan_control *sc)
@@ -286,9 +286,9 @@ int register_shrinker(struct shrinker *shrinker)
 		return -ENOMEM;
 
 	atomic_set(&shrinker->nr_active, 0);
-	mutex_lock(&shrinker_lock);
+	spin_lock(&shrinker_lock);
 	list_add_tail_rcu(&shrinker->list, &shrinker_list);
-	mutex_unlock(&shrinker_lock);
+	spin_unlock(&shrinker_lock);
 	return 0;
 }
 EXPORT_SYMBOL(register_shrinker);
@@ -298,13 +298,29 @@ int register_shrinker(struct shrinker *shrinker)
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-	mutex_lock(&shrinker_lock);
+	static LIST_HEAD(shrinker_gc_list);
+	struct shrinker *gc;
+
+	spin_lock(&shrinker_lock);
 	list_del_rcu(&shrinker->list);
+	/*
+	 * Need to update ->list.next if concurrently unregistering shrinkers
+	 * can find this shrinker, for this shrinker's unregistration might
+	 * complete before their unregistrations complete.
+	 */
+	list_for_each_entry(gc, &shrinker_gc_list, gc_list) {
+		if (gc->list.next == &shrinker->list)
+			rcu_assign_pointer(gc->list.next, shrinker->list.next);
+	}
+	list_add_tail(&shrinker->gc_list, &shrinker_gc_list);
+	spin_unlock(&shrinker_lock);
 	synchronize_rcu();
 	while (atomic_read(&shrinker->nr_active))
 		schedule_timeout_uninterruptible(1);
 	synchronize_rcu();
-	mutex_unlock(&shrinker_lock);
+	spin_lock(&shrinker_lock);
+	list_del(&shrinker->gc_list);
+	spin_unlock(&shrinker_lock);
 	kfree(shrinker->nr_deferred);
 }
 EXPORT_SYMBOL(unregister_shrinker);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 2/2] mm,vmscan: Allow parallel registration/unregistration of shrinkers.
@ 2017-11-13 21:37   ` Tetsuo Handa
  0 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2017-11-13 21:37 UTC (permalink / raw)
  To: Minchan Kim, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton
  Cc: Shakeel Butt, Greg Thelen, linux-mm, linux-kernel, Tetsuo Handa

Shakeel Butt and Greg Thelen noticed that the job loader running in their
production can get stuck for 10s of seconds while doing mount operation,
for some unrelated job was blocking register_shrinker() due to calling
do_shrink_slab() triggered by memory pressure when the job loader doing
mount operation (which is regularly done) called register_shrinker().

Their machines have a lot of shrinkers registered and jobs under memory
pressure have to traverse all of those memcg-aware shrinkers and do affect
unrelated jobs which want to register/unregister their own shrinkers.

This patch allows processing register_shrinker()/unregister_shrinker() in
parallel so that each shrinker loaded/unloaded by the job loader will not
be blocked waiting for other shrinkers.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
---
 include/linux/shrinker.h |  1 +
 mm/vmscan.c              | 26 +++++++++++++++++++++-----
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 333a1d0..05ba330 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -67,6 +67,7 @@ struct shrinker {
 	/* These are for internal use */
 	atomic_t nr_active;
 	struct list_head list;
+	struct list_head gc_list;
 	/* objs pending delete, per node */
 	atomic_long_t *nr_deferred;
 };
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c8996e8..48ff848 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -157,7 +157,7 @@ struct scan_control {
 unsigned long vm_total_pages;
 
 static LIST_HEAD(shrinker_list);
-static DEFINE_MUTEX(shrinker_lock);
+static DEFINE_SPINLOCK(shrinker_lock);
 
 #ifdef CONFIG_MEMCG
 static bool global_reclaim(struct scan_control *sc)
@@ -286,9 +286,9 @@ int register_shrinker(struct shrinker *shrinker)
 		return -ENOMEM;
 
 	atomic_set(&shrinker->nr_active, 0);
-	mutex_lock(&shrinker_lock);
+	spin_lock(&shrinker_lock);
 	list_add_tail_rcu(&shrinker->list, &shrinker_list);
-	mutex_unlock(&shrinker_lock);
+	spin_unlock(&shrinker_lock);
 	return 0;
 }
 EXPORT_SYMBOL(register_shrinker);
@@ -298,13 +298,29 @@ int register_shrinker(struct shrinker *shrinker)
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-	mutex_lock(&shrinker_lock);
+	static LIST_HEAD(shrinker_gc_list);
+	struct shrinker *gc;
+
+	spin_lock(&shrinker_lock);
 	list_del_rcu(&shrinker->list);
+	/*
+	 * Need to update ->list.next if concurrently unregistering shrinkers
+	 * can find this shrinker, for this shrinker's unregistration might
+	 * complete before their unregistrations complete.
+	 */
+	list_for_each_entry(gc, &shrinker_gc_list, gc_list) {
+		if (gc->list.next == &shrinker->list)
+			rcu_assign_pointer(gc->list.next, shrinker->list.next);
+	}
+	list_add_tail(&shrinker->gc_list, &shrinker_gc_list);
+	spin_unlock(&shrinker_lock);
 	synchronize_rcu();
 	while (atomic_read(&shrinker->nr_active))
 		schedule_timeout_uninterruptible(1);
 	synchronize_rcu();
-	mutex_unlock(&shrinker_lock);
+	spin_lock(&shrinker_lock);
+	list_del(&shrinker->gc_list);
+	spin_unlock(&shrinker_lock);
 	kfree(shrinker->nr_deferred);
 }
 EXPORT_SYMBOL(unregister_shrinker);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-13 21:37 ` Tetsuo Handa
@ 2017-11-13 22:05   ` Shakeel Butt
  -1 siblings, 0 replies; 74+ messages in thread
From: Shakeel Butt @ 2017-11-13 22:05 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Minchan Kim, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Greg Thelen,
	Linux MM, LKML

On Mon, Nov 13, 2017 at 1:37 PM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> When shrinker_rwsem was introduced, it was assumed that
> register_shrinker()/unregister_shrinker() are really unlikely paths
> which are called during initialization and tear down. But nowadays,
> register_shrinker()/unregister_shrinker() might be called regularly.
> This patch prepares for allowing parallel registration/unregistration
> of shrinkers.
>
> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> using one RCU section. But using atomic_inc()/atomic_dec() for each
> do_shrink_slab() call will not impact so much.
>
> This patch uses polling loop with short sleep for unregister_shrinker()
> rather than wait_on_atomic_t(), for we can save reader's cost (plain
> atomic_dec() compared to atomic_dec_and_test()), we can expect that
> do_shrink_slab() of unregistering shrinker likely returns shortly, and
> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> shrinker unexpectedly took so long.
>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Reviewed-and-tested-by: Shakeel Butt <shakeelb@google.com>

> ---
>  include/linux/shrinker.h |  3 ++-
>  mm/vmscan.c              | 41 +++++++++++++++++++----------------------
>  2 files changed, 21 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 388ff29..333a1d0 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -62,9 +62,10 @@ struct shrinker {
>
>         int seeks;      /* seeks to recreate an obj */
>         long batch;     /* reclaim batch size, 0 = default */
> -       unsigned long flags;
> +       unsigned int flags;
>
>         /* These are for internal use */
> +       atomic_t nr_active;
>         struct list_head list;
>         /* objs pending delete, per node */
>         atomic_long_t *nr_deferred;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1c1bc95..c8996e8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -157,7 +157,7 @@ struct scan_control {
>  unsigned long vm_total_pages;
>
>  static LIST_HEAD(shrinker_list);
> -static DECLARE_RWSEM(shrinker_rwsem);
> +static DEFINE_MUTEX(shrinker_lock);
>
>  #ifdef CONFIG_MEMCG
>  static bool global_reclaim(struct scan_control *sc)
> @@ -285,9 +285,10 @@ int register_shrinker(struct shrinker *shrinker)
>         if (!shrinker->nr_deferred)
>                 return -ENOMEM;
>
> -       down_write(&shrinker_rwsem);
> -       list_add_tail(&shrinker->list, &shrinker_list);
> -       up_write(&shrinker_rwsem);
> +       atomic_set(&shrinker->nr_active, 0);
> +       mutex_lock(&shrinker_lock);
> +       list_add_tail_rcu(&shrinker->list, &shrinker_list);
> +       mutex_unlock(&shrinker_lock);
>         return 0;
>  }
>  EXPORT_SYMBOL(register_shrinker);
> @@ -297,9 +298,13 @@ int register_shrinker(struct shrinker *shrinker)
>   */
>  void unregister_shrinker(struct shrinker *shrinker)
>  {
> -       down_write(&shrinker_rwsem);
> -       list_del(&shrinker->list);
> -       up_write(&shrinker_rwsem);
> +       mutex_lock(&shrinker_lock);
> +       list_del_rcu(&shrinker->list);
> +       synchronize_rcu();
> +       while (atomic_read(&shrinker->nr_active))
> +               schedule_timeout_uninterruptible(1);
> +       synchronize_rcu();
> +       mutex_unlock(&shrinker_lock);
>         kfree(shrinker->nr_deferred);
>  }
>  EXPORT_SYMBOL(unregister_shrinker);
> @@ -468,18 +473,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>         if (nr_scanned == 0)
>                 nr_scanned = SWAP_CLUSTER_MAX;
>
> -       if (!down_read_trylock(&shrinker_rwsem)) {
> -               /*
> -                * If we would return 0, our callers would understand that we
> -                * have nothing else to shrink and give up trying. By returning
> -                * 1 we keep it going and assume we'll be able to shrink next
> -                * time.
> -                */
> -               freed = 1;
> -               goto out;
> -       }
> -
> -       list_for_each_entry(shrinker, &shrinker_list, list) {
> +       rcu_read_lock();
> +       list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>                 struct shrink_control sc = {
>                         .gfp_mask = gfp_mask,
>                         .nid = nid,
> @@ -498,11 +493,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>                 if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>                         sc.nid = 0;
>
> +               atomic_inc(&shrinker->nr_active);
> +               rcu_read_unlock();
>                 freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> +               rcu_read_lock();
> +               atomic_dec(&shrinker->nr_active);
>         }
> -
> -       up_read(&shrinker_rwsem);
> -out:
> +       rcu_read_unlock();
>         cond_resched();
>         return freed;
>  }
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-13 22:05   ` Shakeel Butt
  0 siblings, 0 replies; 74+ messages in thread
From: Shakeel Butt @ 2017-11-13 22:05 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Minchan Kim, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Greg Thelen,
	Linux MM, LKML

On Mon, Nov 13, 2017 at 1:37 PM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> When shrinker_rwsem was introduced, it was assumed that
> register_shrinker()/unregister_shrinker() are really unlikely paths
> which are called during initialization and tear down. But nowadays,
> register_shrinker()/unregister_shrinker() might be called regularly.
> This patch prepares for allowing parallel registration/unregistration
> of shrinkers.
>
> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> using one RCU section. But using atomic_inc()/atomic_dec() for each
> do_shrink_slab() call will not impact so much.
>
> This patch uses polling loop with short sleep for unregister_shrinker()
> rather than wait_on_atomic_t(), for we can save reader's cost (plain
> atomic_dec() compared to atomic_dec_and_test()), we can expect that
> do_shrink_slab() of unregistering shrinker likely returns shortly, and
> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> shrinker unexpectedly took so long.
>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Reviewed-and-tested-by: Shakeel Butt <shakeelb@google.com>

> ---
>  include/linux/shrinker.h |  3 ++-
>  mm/vmscan.c              | 41 +++++++++++++++++++----------------------
>  2 files changed, 21 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 388ff29..333a1d0 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -62,9 +62,10 @@ struct shrinker {
>
>         int seeks;      /* seeks to recreate an obj */
>         long batch;     /* reclaim batch size, 0 = default */
> -       unsigned long flags;
> +       unsigned int flags;
>
>         /* These are for internal use */
> +       atomic_t nr_active;
>         struct list_head list;
>         /* objs pending delete, per node */
>         atomic_long_t *nr_deferred;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1c1bc95..c8996e8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -157,7 +157,7 @@ struct scan_control {
>  unsigned long vm_total_pages;
>
>  static LIST_HEAD(shrinker_list);
> -static DECLARE_RWSEM(shrinker_rwsem);
> +static DEFINE_MUTEX(shrinker_lock);
>
>  #ifdef CONFIG_MEMCG
>  static bool global_reclaim(struct scan_control *sc)
> @@ -285,9 +285,10 @@ int register_shrinker(struct shrinker *shrinker)
>         if (!shrinker->nr_deferred)
>                 return -ENOMEM;
>
> -       down_write(&shrinker_rwsem);
> -       list_add_tail(&shrinker->list, &shrinker_list);
> -       up_write(&shrinker_rwsem);
> +       atomic_set(&shrinker->nr_active, 0);
> +       mutex_lock(&shrinker_lock);
> +       list_add_tail_rcu(&shrinker->list, &shrinker_list);
> +       mutex_unlock(&shrinker_lock);
>         return 0;
>  }
>  EXPORT_SYMBOL(register_shrinker);
> @@ -297,9 +298,13 @@ int register_shrinker(struct shrinker *shrinker)
>   */
>  void unregister_shrinker(struct shrinker *shrinker)
>  {
> -       down_write(&shrinker_rwsem);
> -       list_del(&shrinker->list);
> -       up_write(&shrinker_rwsem);
> +       mutex_lock(&shrinker_lock);
> +       list_del_rcu(&shrinker->list);
> +       synchronize_rcu();
> +       while (atomic_read(&shrinker->nr_active))
> +               schedule_timeout_uninterruptible(1);
> +       synchronize_rcu();
> +       mutex_unlock(&shrinker_lock);
>         kfree(shrinker->nr_deferred);
>  }
>  EXPORT_SYMBOL(unregister_shrinker);
> @@ -468,18 +473,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>         if (nr_scanned == 0)
>                 nr_scanned = SWAP_CLUSTER_MAX;
>
> -       if (!down_read_trylock(&shrinker_rwsem)) {
> -               /*
> -                * If we would return 0, our callers would understand that we
> -                * have nothing else to shrink and give up trying. By returning
> -                * 1 we keep it going and assume we'll be able to shrink next
> -                * time.
> -                */
> -               freed = 1;
> -               goto out;
> -       }
> -
> -       list_for_each_entry(shrinker, &shrinker_list, list) {
> +       rcu_read_lock();
> +       list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>                 struct shrink_control sc = {
>                         .gfp_mask = gfp_mask,
>                         .nid = nid,
> @@ -498,11 +493,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>                 if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>                         sc.nid = 0;
>
> +               atomic_inc(&shrinker->nr_active);
> +               rcu_read_unlock();
>                 freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> +               rcu_read_lock();
> +               atomic_dec(&shrinker->nr_active);
>         }
> -
> -       up_read(&shrinker_rwsem);
> -out:
> +       rcu_read_unlock();
>         cond_resched();
>         return freed;
>  }
> --
> 1.8.3.1
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-13 21:37 ` Tetsuo Handa
@ 2017-11-15  0:56   ` Minchan Kim
  -1 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2017-11-15  0:56 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Huang Ying, Mel Gorman, Vladimir Davydov, Michal Hocko,
	Johannes Weiner, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> When shrinker_rwsem was introduced, it was assumed that
> register_shrinker()/unregister_shrinker() are really unlikely paths
> which are called during initialization and tear down. But nowadays,
> register_shrinker()/unregister_shrinker() might be called regularly.
> This patch prepares for allowing parallel registration/unregistration
> of shrinkers.
> 
> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> using one RCU section. But using atomic_inc()/atomic_dec() for each
> do_shrink_slab() call will not impact so much.
> 
> This patch uses polling loop with short sleep for unregister_shrinker()
> rather than wait_on_atomic_t(), for we can save reader's cost (plain
> atomic_dec() compared to atomic_dec_and_test()), we can expect that
> do_shrink_slab() of unregistering shrinker likely returns shortly, and
> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> shrinker unexpectedly took so long.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Before reviewing this patch, can't we solve the problem with more
simple way? Like this.

Shakeel, What do you think?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13d711dd8776..cbb624cb9baa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 			sc.nid = 0;
 
 		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
+		/*
+		 * bail out if someone want to register a new shrinker to prevent
+		 * long time stall by parallel ongoing shrinking.
+		 */
+		if (rwsem_is_contended(&shrinker_rwsem)) {
+			freed = 1;
+			break;
+		}
 	}
 
 	up_read(&shrinker_rwsem);

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-15  0:56   ` Minchan Kim
  0 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2017-11-15  0:56 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Huang Ying, Mel Gorman, Vladimir Davydov, Michal Hocko,
	Johannes Weiner, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> When shrinker_rwsem was introduced, it was assumed that
> register_shrinker()/unregister_shrinker() are really unlikely paths
> which are called during initialization and tear down. But nowadays,
> register_shrinker()/unregister_shrinker() might be called regularly.
> This patch prepares for allowing parallel registration/unregistration
> of shrinkers.
> 
> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> using one RCU section. But using atomic_inc()/atomic_dec() for each
> do_shrink_slab() call will not impact so much.
> 
> This patch uses polling loop with short sleep for unregister_shrinker()
> rather than wait_on_atomic_t(), for we can save reader's cost (plain
> atomic_dec() compared to atomic_dec_and_test()), we can expect that
> do_shrink_slab() of unregistering shrinker likely returns shortly, and
> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> shrinker unexpectedly took so long.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Before reviewing this patch, can't we solve the problem with more
simple way? Like this.

Shakeel, What do you think?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13d711dd8776..cbb624cb9baa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 			sc.nid = 0;
 
 		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
+		/*
+		 * bail out if someone want to register a new shrinker to prevent
+		 * long time stall by parallel ongoing shrinking.
+		 */
+		if (rwsem_is_contended(&shrinker_rwsem)) {
+			freed = 1;
+			break;
+		}
 	}
 
 	up_read(&shrinker_rwsem);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15  0:56   ` Minchan Kim
@ 2017-11-15  6:28     ` Shakeel Butt
  -1 siblings, 0 replies; 74+ messages in thread
From: Shakeel Butt @ 2017-11-15  6:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Greg Thelen,
	Linux MM, LKML

On Tue, Nov 14, 2017 at 4:56 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>> When shrinker_rwsem was introduced, it was assumed that
>> register_shrinker()/unregister_shrinker() are really unlikely paths
>> which are called during initialization and tear down. But nowadays,
>> register_shrinker()/unregister_shrinker() might be called regularly.
>> This patch prepares for allowing parallel registration/unregistration
>> of shrinkers.
>>
>> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>> using one RCU section. But using atomic_inc()/atomic_dec() for each
>> do_shrink_slab() call will not impact so much.
>>
>> This patch uses polling loop with short sleep for unregister_shrinker()
>> rather than wait_on_atomic_t(), for we can save reader's cost (plain
>> atomic_dec() compared to atomic_dec_and_test()), we can expect that
>> do_shrink_slab() of unregistering shrinker likely returns shortly, and
>> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
>> shrinker unexpectedly took so long.
>>
>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>
> Before reviewing this patch, can't we solve the problem with more
> simple way? Like this.
>
> Shakeel, What do you think?
>

Seems simple enough. I will run my test (running fork bomb in one
memcg and separately time a mount operation) and update if numbers
differ significantly.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 13d711dd8776..cbb624cb9baa 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>                         sc.nid = 0;
>
>                 freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> +               /*
> +                * bail out if someone want to register a new shrinker to prevent
> +                * long time stall by parallel ongoing shrinking.
> +                */
> +               if (rwsem_is_contended(&shrinker_rwsem)) {
> +                       freed = 1;

freed = freed ?: 1;

> +                       break;
> +               }
>         }
>
>         up_read(&shrinker_rwsem);

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-15  6:28     ` Shakeel Butt
  0 siblings, 0 replies; 74+ messages in thread
From: Shakeel Butt @ 2017-11-15  6:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Greg Thelen,
	Linux MM, LKML

On Tue, Nov 14, 2017 at 4:56 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>> When shrinker_rwsem was introduced, it was assumed that
>> register_shrinker()/unregister_shrinker() are really unlikely paths
>> which are called during initialization and tear down. But nowadays,
>> register_shrinker()/unregister_shrinker() might be called regularly.
>> This patch prepares for allowing parallel registration/unregistration
>> of shrinkers.
>>
>> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>> using one RCU section. But using atomic_inc()/atomic_dec() for each
>> do_shrink_slab() call will not impact so much.
>>
>> This patch uses polling loop with short sleep for unregister_shrinker()
>> rather than wait_on_atomic_t(), for we can save reader's cost (plain
>> atomic_dec() compared to atomic_dec_and_test()), we can expect that
>> do_shrink_slab() of unregistering shrinker likely returns shortly, and
>> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
>> shrinker unexpectedly took so long.
>>
>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>
> Before reviewing this patch, can't we solve the problem with more
> simple way? Like this.
>
> Shakeel, What do you think?
>

Seems simple enough. I will run my test (running fork bomb in one
memcg and separately time a mount operation) and update if numbers
differ significantly.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 13d711dd8776..cbb624cb9baa 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>                         sc.nid = 0;
>
>                 freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> +               /*
> +                * bail out if someone want to register a new shrinker to prevent
> +                * long time stall by parallel ongoing shrinking.
> +                */
> +               if (rwsem_is_contended(&shrinker_rwsem)) {
> +                       freed = 1;

freed = freed ?: 1;

> +                       break;
> +               }
>         }
>
>         up_read(&shrinker_rwsem);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15  0:56   ` Minchan Kim
@ 2017-11-15  8:56     ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-15  8:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Johannes Weiner, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Wed 15-11-17 09:56:02, Minchan Kim wrote:
> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> > When shrinker_rwsem was introduced, it was assumed that
> > register_shrinker()/unregister_shrinker() are really unlikely paths
> > which are called during initialization and tear down. But nowadays,
> > register_shrinker()/unregister_shrinker() might be called regularly.
> > This patch prepares for allowing parallel registration/unregistration
> > of shrinkers.
> > 
> > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > do_shrink_slab() call will not impact so much.
> > 
> > This patch uses polling loop with short sleep for unregister_shrinker()
> > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > shrinker unexpectedly took so long.
> > 
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> Before reviewing this patch, can't we solve the problem with more
> simple way? Like this.
> 
> Shakeel, What do you think?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 13d711dd8776..cbb624cb9baa 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  			sc.nid = 0;
>  
>  		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> +		/*
> +		 * bail out if someone want to register a new shrinker to prevent
> +		 * long time stall by parallel ongoing shrinking.
> +		 */
> +		if (rwsem_is_contended(&shrinker_rwsem)) {
> +			freed = 1;
> +			break;
> +		}

So you want to do only partial slab shrinking if we have more contending
direct reclaimers? This would just make a larger pressure on those on
the list head rather than the tail. I do not think this is a good idea.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-15  8:56     ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-15  8:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Johannes Weiner, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Wed 15-11-17 09:56:02, Minchan Kim wrote:
> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> > When shrinker_rwsem was introduced, it was assumed that
> > register_shrinker()/unregister_shrinker() are really unlikely paths
> > which are called during initialization and tear down. But nowadays,
> > register_shrinker()/unregister_shrinker() might be called regularly.
> > This patch prepares for allowing parallel registration/unregistration
> > of shrinkers.
> > 
> > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > do_shrink_slab() call will not impact so much.
> > 
> > This patch uses polling loop with short sleep for unregister_shrinker()
> > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > shrinker unexpectedly took so long.
> > 
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> Before reviewing this patch, can't we solve the problem with more
> simple way? Like this.
> 
> Shakeel, What do you think?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 13d711dd8776..cbb624cb9baa 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  			sc.nid = 0;
>  
>  		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> +		/*
> +		 * bail out if someone want to register a new shrinker to prevent
> +		 * long time stall by parallel ongoing shrinking.
> +		 */
> +		if (rwsem_is_contended(&shrinker_rwsem)) {
> +			freed = 1;
> +			break;
> +		}

So you want to do only partial slab shrinking if we have more contending
direct reclaimers? This would just make a larger pressure on those on
the list head rather than the tail. I do not think this is a good idea.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-13 21:37 ` Tetsuo Handa
@ 2017-11-15  9:02   ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-15  9:02 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Minchan Kim, Huang Ying, Mel Gorman, Vladimir Davydov,
	Johannes Weiner, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Tue 14-11-17 06:37:42, Tetsuo Handa wrote:
> When shrinker_rwsem was introduced, it was assumed that
> register_shrinker()/unregister_shrinker() are really unlikely paths
> which are called during initialization and tear down. But nowadays,
> register_shrinker()/unregister_shrinker() might be called regularly.

Please provide some examples. I know your other patch mentions the
usecase but I guess the two patches should be just squashed together.

> This patch prepares for allowing parallel registration/unregistration
> of shrinkers.
> 
> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> using one RCU section. But using atomic_inc()/atomic_dec() for each
> do_shrink_slab() call will not impact so much.
> 
> This patch uses polling loop with short sleep for unregister_shrinker()
> rather than wait_on_atomic_t(), for we can save reader's cost (plain
> atomic_dec() compared to atomic_dec_and_test()), we can expect that
> do_shrink_slab() of unregistering shrinker likely returns shortly, and
> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> shrinker unexpectedly took so long.

I would use wait_event_interruptible in the remove path rather than the
short sleep loop which is just too ugly. The shrinker walk would then
just wake_up the sleeper when the ref. count drops to 0. Two
synchronize_rcu is quite ugly as well, but I was not able to simplify
them. I will keep thinking. It just sucks how we cannot follow the
standard rcu list with dynamically allocated structure pattern here.
 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  include/linux/shrinker.h |  3 ++-
>  mm/vmscan.c              | 41 +++++++++++++++++++----------------------
>  2 files changed, 21 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 388ff29..333a1d0 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -62,9 +62,10 @@ struct shrinker {
>  
>  	int seeks;	/* seeks to recreate an obj */
>  	long batch;	/* reclaim batch size, 0 = default */
> -	unsigned long flags;
> +	unsigned int flags;

Why?

>  
>  	/* These are for internal use */
> +	atomic_t nr_active;
>  	struct list_head list;
>  	/* objs pending delete, per node */
>  	atomic_long_t *nr_deferred;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1c1bc95..c8996e8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -157,7 +157,7 @@ struct scan_control {
>  unsigned long vm_total_pages;
>  
>  static LIST_HEAD(shrinker_list);
> -static DECLARE_RWSEM(shrinker_rwsem);
> +static DEFINE_MUTEX(shrinker_lock);
>  
>  #ifdef CONFIG_MEMCG
>  static bool global_reclaim(struct scan_control *sc)
> @@ -285,9 +285,10 @@ int register_shrinker(struct shrinker *shrinker)
>  	if (!shrinker->nr_deferred)
>  		return -ENOMEM;
>  
> -	down_write(&shrinker_rwsem);
> -	list_add_tail(&shrinker->list, &shrinker_list);
> -	up_write(&shrinker_rwsem);
> +	atomic_set(&shrinker->nr_active, 0);

I would expect ref counter to be 1 and either remove path dec it down to
0 or the racing walker would. In any case that is when
unregister_shrinker can continue.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-15  9:02   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-15  9:02 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Minchan Kim, Huang Ying, Mel Gorman, Vladimir Davydov,
	Johannes Weiner, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Tue 14-11-17 06:37:42, Tetsuo Handa wrote:
> When shrinker_rwsem was introduced, it was assumed that
> register_shrinker()/unregister_shrinker() are really unlikely paths
> which are called during initialization and tear down. But nowadays,
> register_shrinker()/unregister_shrinker() might be called regularly.

Please provide some examples. I know your other patch mentions the
usecase but I guess the two patches should be just squashed together.

> This patch prepares for allowing parallel registration/unregistration
> of shrinkers.
> 
> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> using one RCU section. But using atomic_inc()/atomic_dec() for each
> do_shrink_slab() call will not impact so much.
> 
> This patch uses polling loop with short sleep for unregister_shrinker()
> rather than wait_on_atomic_t(), for we can save reader's cost (plain
> atomic_dec() compared to atomic_dec_and_test()), we can expect that
> do_shrink_slab() of unregistering shrinker likely returns shortly, and
> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> shrinker unexpectedly took so long.

I would use wait_event_interruptible in the remove path rather than the
short sleep loop which is just too ugly. The shrinker walk would then
just wake_up the sleeper when the ref. count drops to 0. Two
synchronize_rcu is quite ugly as well, but I was not able to simplify
them. I will keep thinking. It just sucks how we cannot follow the
standard rcu list with dynamically allocated structure pattern here.
 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  include/linux/shrinker.h |  3 ++-
>  mm/vmscan.c              | 41 +++++++++++++++++++----------------------
>  2 files changed, 21 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 388ff29..333a1d0 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -62,9 +62,10 @@ struct shrinker {
>  
>  	int seeks;	/* seeks to recreate an obj */
>  	long batch;	/* reclaim batch size, 0 = default */
> -	unsigned long flags;
> +	unsigned int flags;

Why?

>  
>  	/* These are for internal use */
> +	atomic_t nr_active;
>  	struct list_head list;
>  	/* objs pending delete, per node */
>  	atomic_long_t *nr_deferred;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1c1bc95..c8996e8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -157,7 +157,7 @@ struct scan_control {
>  unsigned long vm_total_pages;
>  
>  static LIST_HEAD(shrinker_list);
> -static DECLARE_RWSEM(shrinker_rwsem);
> +static DEFINE_MUTEX(shrinker_lock);
>  
>  #ifdef CONFIG_MEMCG
>  static bool global_reclaim(struct scan_control *sc)
> @@ -285,9 +285,10 @@ int register_shrinker(struct shrinker *shrinker)
>  	if (!shrinker->nr_deferred)
>  		return -ENOMEM;
>  
> -	down_write(&shrinker_rwsem);
> -	list_add_tail(&shrinker->list, &shrinker_list);
> -	up_write(&shrinker_rwsem);
> +	atomic_set(&shrinker->nr_active, 0);

I would expect ref counter to be 1 and either remove path dec it down to
0 or the racing walker would. In any case that is when
unregister_shrinker can continue.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15  8:56     ` Michal Hocko
@ 2017-11-15  9:18       ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-15  9:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Johannes Weiner, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Wed 15-11-17 09:56:25, Michal Hocko wrote:
> On Wed 15-11-17 09:56:02, Minchan Kim wrote:
> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> > > When shrinker_rwsem was introduced, it was assumed that
> > > register_shrinker()/unregister_shrinker() are really unlikely paths
> > > which are called during initialization and tear down. But nowadays,
> > > register_shrinker()/unregister_shrinker() might be called regularly.
> > > This patch prepares for allowing parallel registration/unregistration
> > > of shrinkers.
> > > 
> > > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > > do_shrink_slab() call will not impact so much.
> > > 
> > > This patch uses polling loop with short sleep for unregister_shrinker()
> > > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > > shrinker unexpectedly took so long.
> > > 
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > 
> > Before reviewing this patch, can't we solve the problem with more
> > simple way? Like this.
> > 
> > Shakeel, What do you think?
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 13d711dd8776..cbb624cb9baa 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >  			sc.nid = 0;
> >  
> >  		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> > +		/*
> > +		 * bail out if someone want to register a new shrinker to prevent
> > +		 * long time stall by parallel ongoing shrinking.
> > +		 */
> > +		if (rwsem_is_contended(&shrinker_rwsem)) {
> > +			freed = 1;
> > +			break;
> > +		}
> 
> So you want to do only partial slab shrinking if we have more contending
> direct reclaimers? This would just make a larger pressure on those on
> the list head rather than the tail. I do not think this is a good idea.

Scratch that. rwsem_is_contended is true only if there is at least one
writer. So the regular reclaim path will be OK. (Un)Register will
shortcut the reclaim. This is acceptable and actually much more simple
than the complex locking proposed in the patch. So it looks good to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-15  9:18       ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-15  9:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Johannes Weiner, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Wed 15-11-17 09:56:25, Michal Hocko wrote:
> On Wed 15-11-17 09:56:02, Minchan Kim wrote:
> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> > > When shrinker_rwsem was introduced, it was assumed that
> > > register_shrinker()/unregister_shrinker() are really unlikely paths
> > > which are called during initialization and tear down. But nowadays,
> > > register_shrinker()/unregister_shrinker() might be called regularly.
> > > This patch prepares for allowing parallel registration/unregistration
> > > of shrinkers.
> > > 
> > > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > > do_shrink_slab() call will not impact so much.
> > > 
> > > This patch uses polling loop with short sleep for unregister_shrinker()
> > > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > > shrinker unexpectedly took so long.
> > > 
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > 
> > Before reviewing this patch, can't we solve the problem with more
> > simple way? Like this.
> > 
> > Shakeel, What do you think?
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 13d711dd8776..cbb624cb9baa 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >  			sc.nid = 0;
> >  
> >  		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> > +		/*
> > +		 * bail out if someone want to register a new shrinker to prevent
> > +		 * long time stall by parallel ongoing shrinking.
> > +		 */
> > +		if (rwsem_is_contended(&shrinker_rwsem)) {
> > +			freed = 1;
> > +			break;
> > +		}
> 
> So you want to do only partial slab shrinking if we have more contending
> direct reclaimers? This would just make a larger pressure on those on
> the list head rather than the tail. I do not think this is a good idea.

Scratch that. rwsem_is_contended is true only if there is at least one
writer. So the regular reclaim path will be OK. (Un)Register will
shortcut the reclaim. This is acceptable and actually much more simple
than the complex locking proposed in the patch. So it looks good to me.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15  9:02   ` Michal Hocko
@ 2017-11-15 10:58     ` Tetsuo Handa
  -1 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2017-11-15 10:58 UTC (permalink / raw)
  To: mhocko
  Cc: minchan, ying.huang, mgorman, vdavydov.dev, hannes, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

Michal Hocko wrote:
> On Tue 14-11-17 06:37:42, Tetsuo Handa wrote:
> > When shrinker_rwsem was introduced, it was assumed that
> > register_shrinker()/unregister_shrinker() are really unlikely paths
> > which are called during initialization and tear down. But nowadays,
> > register_shrinker()/unregister_shrinker() might be called regularly.
> 
> Please provide some examples. I know your other patch mentions the
> usecase but I guess the two patches should be just squashed together.

They were squashed together in a draft version at
http://lkml.kernel.org/r/2940c150-577a-30a8-fac3-cf59a49b84b4@I-love.SAKURA.ne.jp .
Since Shakeel suggested me to post the patch for others to review without
parallel register/unregister and SHRINKER_PERMANENT, but I thought that
parallel register/unregister is still helpful (described below), I posted
as two patches.

> 
> > This patch prepares for allowing parallel registration/unregistration
> > of shrinkers.
> > 
> > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > do_shrink_slab() call will not impact so much.
> > 
> > This patch uses polling loop with short sleep for unregister_shrinker()
> > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > shrinker unexpectedly took so long.
> 
> I would use wait_event_interruptible in the remove path rather than the
> short sleep loop which is just too ugly. The shrinker walk would then
> just wake_up the sleeper when the ref. count drops to 0. Two
> synchronize_rcu is quite ugly as well, but I was not able to simplify
> them. I will keep thinking. It just sucks how we cannot follow the
> standard rcu list with dynamically allocated structure pattern here.

I think that Minchan's approach depends on how

  In our production, we have observed that the job loader gets stuck for
  10s of seconds while doing mount operation. It turns out that it was
  stuck in register_shrinker() and some unrelated job was under memory
  pressure and spending time in shrink_slab(). Our machines have a lot
  of shrinkers registered and jobs under memory pressure has to traverse
  all of those memcg-aware shrinkers and do affect unrelated jobs which
  want to register their own shrinkers.

is interpreted. If there were 100000 shrinkers and each do_shrink_slab() call
took 1 millisecond, aborting the iteration as soon as rwsem_is_contended() would
help a lot. But if there were 10 shrinkers and each do_shrink_slab() call took
10 seconds, aborting the iteration as soon as rwsem_is_contended() would help
less. Or, there might be some specific shrinker where its do_shrink_slab() call
takes 100 seconds. In that case, checking rwsem_is_contended() is too lazy.

Since it is possible for a local unpriviledged user to lockup the system at least
due to mute_trylock(&oom_lock) versus (printk() or schedule_timeout_killable(1)),
I suggest completely eliminating scheduling priority problem (i.e. a very low
scheduling priority thread might take 100 seconds inside some do_shrink_slab()
call) by not relying on an assumption of shortly returning from do_shrink_slab().
My first patch + my second patch will eliminate relying on such assumption, and
avoid potential khungtaskd warnings.

>  
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > ---
> >  include/linux/shrinker.h |  3 ++-
> >  mm/vmscan.c              | 41 +++++++++++++++++++----------------------
> >  2 files changed, 21 insertions(+), 23 deletions(-)
> > 
> > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > index 388ff29..333a1d0 100644
> > --- a/include/linux/shrinker.h
> > +++ b/include/linux/shrinker.h
> > @@ -62,9 +62,10 @@ struct shrinker {
> >  
> >  	int seeks;	/* seeks to recreate an obj */
> >  	long batch;	/* reclaim batch size, 0 = default */
> > -	unsigned long flags;
> > +	unsigned int flags;
> 
> Why?

In Shakeel's first version, it tried to keep structure size intact on
x86_64 architecture. Actually currently only two flags are defined.

> 
> >  
> >  	/* These are for internal use */
> > +	atomic_t nr_active;
> >  	struct list_head list;
> >  	/* objs pending delete, per node */
> >  	atomic_long_t *nr_deferred;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 1c1bc95..c8996e8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -157,7 +157,7 @@ struct scan_control {
> >  unsigned long vm_total_pages;
> >  
> >  static LIST_HEAD(shrinker_list);
> > -static DECLARE_RWSEM(shrinker_rwsem);
> > +static DEFINE_MUTEX(shrinker_lock);
> >  
> >  #ifdef CONFIG_MEMCG
> >  static bool global_reclaim(struct scan_control *sc)
> > @@ -285,9 +285,10 @@ int register_shrinker(struct shrinker *shrinker)
> >  	if (!shrinker->nr_deferred)
> >  		return -ENOMEM;
> >  
> > -	down_write(&shrinker_rwsem);
> > -	list_add_tail(&shrinker->list, &shrinker_list);
> > -	up_write(&shrinker_rwsem);
> > +	atomic_set(&shrinker->nr_active, 0);
> 
> I would expect ref counter to be 1 and either remove path dec it down to
> 0 or the racing walker would. In any case that is when
> unregister_shrinker can continue.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-15 10:58     ` Tetsuo Handa
  0 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2017-11-15 10:58 UTC (permalink / raw)
  To: mhocko
  Cc: minchan, ying.huang, mgorman, vdavydov.dev, hannes, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

Michal Hocko wrote:
> On Tue 14-11-17 06:37:42, Tetsuo Handa wrote:
> > When shrinker_rwsem was introduced, it was assumed that
> > register_shrinker()/unregister_shrinker() are really unlikely paths
> > which are called during initialization and tear down. But nowadays,
> > register_shrinker()/unregister_shrinker() might be called regularly.
> 
> Please provide some examples. I know your other patch mentions the
> usecase but I guess the two patches should be just squashed together.

They were squashed together in a draft version at
http://lkml.kernel.org/r/2940c150-577a-30a8-fac3-cf59a49b84b4@I-love.SAKURA.ne.jp .
Since Shakeel suggested me to post the patch for others to review without
parallel register/unregister and SHRINKER_PERMANENT, but I thought that
parallel register/unregister is still helpful (described below), I posted
as two patches.

> 
> > This patch prepares for allowing parallel registration/unregistration
> > of shrinkers.
> > 
> > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > do_shrink_slab() call will not impact so much.
> > 
> > This patch uses polling loop with short sleep for unregister_shrinker()
> > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > shrinker unexpectedly took so long.
> 
> I would use wait_event_interruptible in the remove path rather than the
> short sleep loop which is just too ugly. The shrinker walk would then
> just wake_up the sleeper when the ref. count drops to 0. Two
> synchronize_rcu is quite ugly as well, but I was not able to simplify
> them. I will keep thinking. It just sucks how we cannot follow the
> standard rcu list with dynamically allocated structure pattern here.

I think that Minchan's approach depends on how

  In our production, we have observed that the job loader gets stuck for
  10s of seconds while doing mount operation. It turns out that it was
  stuck in register_shrinker() and some unrelated job was under memory
  pressure and spending time in shrink_slab(). Our machines have a lot
  of shrinkers registered and jobs under memory pressure has to traverse
  all of those memcg-aware shrinkers and do affect unrelated jobs which
  want to register their own shrinkers.

is interpreted. If there were 100000 shrinkers and each do_shrink_slab() call
took 1 millisecond, aborting the iteration as soon as rwsem_is_contended() would
help a lot. But if there were 10 shrinkers and each do_shrink_slab() call took
10 seconds, aborting the iteration as soon as rwsem_is_contended() would help
less. Or, there might be some specific shrinker where its do_shrink_slab() call
takes 100 seconds. In that case, checking rwsem_is_contended() is too lazy.

Since it is possible for a local unpriviledged user to lockup the system at least
due to mute_trylock(&oom_lock) versus (printk() or schedule_timeout_killable(1)),
I suggest completely eliminating scheduling priority problem (i.e. a very low
scheduling priority thread might take 100 seconds inside some do_shrink_slab()
call) by not relying on an assumption of shortly returning from do_shrink_slab().
My first patch + my second patch will eliminate relying on such assumption, and
avoid potential khungtaskd warnings.

>  
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > ---
> >  include/linux/shrinker.h |  3 ++-
> >  mm/vmscan.c              | 41 +++++++++++++++++++----------------------
> >  2 files changed, 21 insertions(+), 23 deletions(-)
> > 
> > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > index 388ff29..333a1d0 100644
> > --- a/include/linux/shrinker.h
> > +++ b/include/linux/shrinker.h
> > @@ -62,9 +62,10 @@ struct shrinker {
> >  
> >  	int seeks;	/* seeks to recreate an obj */
> >  	long batch;	/* reclaim batch size, 0 = default */
> > -	unsigned long flags;
> > +	unsigned int flags;
> 
> Why?

In Shakeel's first version, it tried to keep structure size intact on
x86_64 architecture. Actually currently only two flags are defined.

> 
> >  
> >  	/* These are for internal use */
> > +	atomic_t nr_active;
> >  	struct list_head list;
> >  	/* objs pending delete, per node */
> >  	atomic_long_t *nr_deferred;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 1c1bc95..c8996e8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -157,7 +157,7 @@ struct scan_control {
> >  unsigned long vm_total_pages;
> >  
> >  static LIST_HEAD(shrinker_list);
> > -static DECLARE_RWSEM(shrinker_rwsem);
> > +static DEFINE_MUTEX(shrinker_lock);
> >  
> >  #ifdef CONFIG_MEMCG
> >  static bool global_reclaim(struct scan_control *sc)
> > @@ -285,9 +285,10 @@ int register_shrinker(struct shrinker *shrinker)
> >  	if (!shrinker->nr_deferred)
> >  		return -ENOMEM;
> >  
> > -	down_write(&shrinker_rwsem);
> > -	list_add_tail(&shrinker->list, &shrinker_list);
> > -	up_write(&shrinker_rwsem);
> > +	atomic_set(&shrinker->nr_active, 0);
> 
> I would expect ref counter to be 1 and either remove path dec it down to
> 0 or the racing walker would. In any case that is when
> unregister_shrinker can continue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15 10:58     ` Tetsuo Handa
@ 2017-11-15 11:51       ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-15 11:51 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: minchan, ying.huang, mgorman, vdavydov.dev, hannes, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

On Wed 15-11-17 19:58:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 14-11-17 06:37:42, Tetsuo Handa wrote:
> > > When shrinker_rwsem was introduced, it was assumed that
> > > register_shrinker()/unregister_shrinker() are really unlikely paths
> > > which are called during initialization and tear down. But nowadays,
> > > register_shrinker()/unregister_shrinker() might be called regularly.
> > 
> > Please provide some examples. I know your other patch mentions the
> > usecase but I guess the two patches should be just squashed together.
> 
> They were squashed together in a draft version at
> http://lkml.kernel.org/r/2940c150-577a-30a8-fac3-cf59a49b84b4@I-love.SAKURA.ne.jp .
> Since Shakeel suggested me to post the patch for others to review without
> parallel register/unregister and SHRINKER_PERMANENT, but I thought that
> parallel register/unregister is still helpful (described below), I posted
> as two patches.
> 
> > 
> > > This patch prepares for allowing parallel registration/unregistration
> > > of shrinkers.
> > > 
> > > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > > do_shrink_slab() call will not impact so much.
> > > 
> > > This patch uses polling loop with short sleep for unregister_shrinker()
> > > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > > shrinker unexpectedly took so long.
> > 
> > I would use wait_event_interruptible in the remove path rather than the
> > short sleep loop which is just too ugly. The shrinker walk would then
> > just wake_up the sleeper when the ref. count drops to 0. Two
> > synchronize_rcu is quite ugly as well, but I was not able to simplify
> > them. I will keep thinking. It just sucks how we cannot follow the
> > standard rcu list with dynamically allocated structure pattern here.
> 
> I think that Minchan's approach depends on how
> 
>   In our production, we have observed that the job loader gets stuck for
>   10s of seconds while doing mount operation. It turns out that it was
>   stuck in register_shrinker() and some unrelated job was under memory
>   pressure and spending time in shrink_slab(). Our machines have a lot
>   of shrinkers registered and jobs under memory pressure has to traverse
>   all of those memcg-aware shrinkers and do affect unrelated jobs which
>   want to register their own shrinkers.
> 
> is interpreted. If there were 100000 shrinkers and each do_shrink_slab() call
> took 1 millisecond, aborting the iteration as soon as rwsem_is_contended() would
> help a lot. But if there were 10 shrinkers and each do_shrink_slab() call took
> 10 seconds, aborting the iteration as soon as rwsem_is_contended() would help
> less. Or, there might be some specific shrinker where its do_shrink_slab() call
> takes 100 seconds. In that case, checking rwsem_is_contended() is too lazy.

I hope we do not have any shrinker to each that much time. They are not
supposed to... But the reality screws our intentions quite often so I
cannot really tell nobody is doing crazy stuff. Anyway, I think starting
simpler make sense here. We will see later.

> Since it is possible for a local unpriviledged user to lockup the system at least
> due to mute_trylock(&oom_lock) versus (printk() or schedule_timeout_killable(1)),
> I suggest completely eliminating scheduling priority problem (i.e. a very low
> scheduling priority thread might take 100 seconds inside some do_shrink_slab()
> call) by not relying on an assumption of shortly returning from do_shrink_slab().
> My first patch + my second patch will eliminate relying on such assumption, and
> avoid potential khungtaskd warnings.

It doesn't, because the priority issues will be still there when anybody
can preempt your shrinker for extensive amount of time. So no you are
not fixing the problem. You are merely make it less probable and limited
only to the removed shrinker. You still do not have any control over
what happens while that shrinker is executed, though.

Anyway, I do not claim your patch is a wrong approach. It is just quite
complex and maybe unnecessarily so for most workloads. Therefore going
with a simpler solution should be preferred until we see it
insufficient.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-15 11:51       ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-15 11:51 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: minchan, ying.huang, mgorman, vdavydov.dev, hannes, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

On Wed 15-11-17 19:58:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 14-11-17 06:37:42, Tetsuo Handa wrote:
> > > When shrinker_rwsem was introduced, it was assumed that
> > > register_shrinker()/unregister_shrinker() are really unlikely paths
> > > which are called during initialization and tear down. But nowadays,
> > > register_shrinker()/unregister_shrinker() might be called regularly.
> > 
> > Please provide some examples. I know your other patch mentions the
> > usecase but I guess the two patches should be just squashed together.
> 
> They were squashed together in a draft version at
> http://lkml.kernel.org/r/2940c150-577a-30a8-fac3-cf59a49b84b4@I-love.SAKURA.ne.jp .
> Since Shakeel suggested me to post the patch for others to review without
> parallel register/unregister and SHRINKER_PERMANENT, but I thought that
> parallel register/unregister is still helpful (described below), I posted
> as two patches.
> 
> > 
> > > This patch prepares for allowing parallel registration/unregistration
> > > of shrinkers.
> > > 
> > > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > > do_shrink_slab() call will not impact so much.
> > > 
> > > This patch uses polling loop with short sleep for unregister_shrinker()
> > > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > > shrinker unexpectedly took so long.
> > 
> > I would use wait_event_interruptible in the remove path rather than the
> > short sleep loop which is just too ugly. The shrinker walk would then
> > just wake_up the sleeper when the ref. count drops to 0. Two
> > synchronize_rcu is quite ugly as well, but I was not able to simplify
> > them. I will keep thinking. It just sucks how we cannot follow the
> > standard rcu list with dynamically allocated structure pattern here.
> 
> I think that Minchan's approach depends on how
> 
>   In our production, we have observed that the job loader gets stuck for
>   10s of seconds while doing mount operation. It turns out that it was
>   stuck in register_shrinker() and some unrelated job was under memory
>   pressure and spending time in shrink_slab(). Our machines have a lot
>   of shrinkers registered and jobs under memory pressure has to traverse
>   all of those memcg-aware shrinkers and do affect unrelated jobs which
>   want to register their own shrinkers.
> 
> is interpreted. If there were 100000 shrinkers and each do_shrink_slab() call
> took 1 millisecond, aborting the iteration as soon as rwsem_is_contended() would
> help a lot. But if there were 10 shrinkers and each do_shrink_slab() call took
> 10 seconds, aborting the iteration as soon as rwsem_is_contended() would help
> less. Or, there might be some specific shrinker where its do_shrink_slab() call
> takes 100 seconds. In that case, checking rwsem_is_contended() is too lazy.

I hope we do not have any shrinker to each that much time. They are not
supposed to... But the reality screws our intentions quite often so I
cannot really tell nobody is doing crazy stuff. Anyway, I think starting
simpler make sense here. We will see later.

> Since it is possible for a local unpriviledged user to lockup the system at least
> due to mute_trylock(&oom_lock) versus (printk() or schedule_timeout_killable(1)),
> I suggest completely eliminating scheduling priority problem (i.e. a very low
> scheduling priority thread might take 100 seconds inside some do_shrink_slab()
> call) by not relying on an assumption of shortly returning from do_shrink_slab().
> My first patch + my second patch will eliminate relying on such assumption, and
> avoid potential khungtaskd warnings.

It doesn't, because the priority issues will be still there when anybody
can preempt your shrinker for extensive amount of time. So no you are
not fixing the problem. You are merely make it less probable and limited
only to the removed shrinker. You still do not have any control over
what happens while that shrinker is executed, though.

Anyway, I do not claim your patch is a wrong approach. It is just quite
complex and maybe unnecessarily so for most workloads. Therefore going
with a simpler solution should be preferred until we see it
insufficient.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15 10:58     ` Tetsuo Handa
@ 2017-11-15 13:28       ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2017-11-15 13:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, minchan, ying.huang, mgorman, vdavydov.dev, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

On Wed, Nov 15, 2017 at 07:58:09PM +0900, Tetsuo Handa wrote:
> I think that Minchan's approach depends on how
> 
>   In our production, we have observed that the job loader gets stuck for
>   10s of seconds while doing mount operation. It turns out that it was
>   stuck in register_shrinker() and some unrelated job was under memory
>   pressure and spending time in shrink_slab(). Our machines have a lot
>   of shrinkers registered and jobs under memory pressure has to traverse
>   all of those memcg-aware shrinkers and do affect unrelated jobs which
>   want to register their own shrinkers.
> 
> is interpreted. If there were 100000 shrinkers and each do_shrink_slab() call
> took 1 millisecond, aborting the iteration as soon as rwsem_is_contended() would
> help a lot. But if there were 10 shrinkers and each do_shrink_slab() call took
> 10 seconds, aborting the iteration as soon as rwsem_is_contended() would help
> less. Or, there might be some specific shrinker where its do_shrink_slab() call
> takes 100 seconds. In that case, checking rwsem_is_contended() is too lazy.

In your patch, unregister() waits for shrinker->nr_active instead of
the lock, which is decreased in the same location where Minchan drops
the lock. How is that different behavior for long-running shrinkers?

Anyway, I suspect it's many shrinkers and many concurrent invocations,
so the lockbreak granularity you both chose should be fine.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-15 13:28       ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2017-11-15 13:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, minchan, ying.huang, mgorman, vdavydov.dev, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

On Wed, Nov 15, 2017 at 07:58:09PM +0900, Tetsuo Handa wrote:
> I think that Minchan's approach depends on how
> 
>   In our production, we have observed that the job loader gets stuck for
>   10s of seconds while doing mount operation. It turns out that it was
>   stuck in register_shrinker() and some unrelated job was under memory
>   pressure and spending time in shrink_slab(). Our machines have a lot
>   of shrinkers registered and jobs under memory pressure has to traverse
>   all of those memcg-aware shrinkers and do affect unrelated jobs which
>   want to register their own shrinkers.
> 
> is interpreted. If there were 100000 shrinkers and each do_shrink_slab() call
> took 1 millisecond, aborting the iteration as soon as rwsem_is_contended() would
> help a lot. But if there were 10 shrinkers and each do_shrink_slab() call took
> 10 seconds, aborting the iteration as soon as rwsem_is_contended() would help
> less. Or, there might be some specific shrinker where its do_shrink_slab() call
> takes 100 seconds. In that case, checking rwsem_is_contended() is too lazy.

In your patch, unregister() waits for shrinker->nr_active instead of
the lock, which is decreased in the same location where Minchan drops
the lock. How is that different behavior for long-running shrinkers?

Anyway, I suspect it's many shrinkers and many concurrent invocations,
so the lockbreak granularity you both chose should be fine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15  9:02   ` Michal Hocko
@ 2017-11-15 14:00     ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2017-11-15 14:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Wed, Nov 15, 2017 at 10:02:51AM +0100, Michal Hocko wrote:
> On Tue 14-11-17 06:37:42, Tetsuo Handa wrote:
> > This patch uses polling loop with short sleep for unregister_shrinker()
> > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > shrinker unexpectedly took so long.
> 
> I would use wait_event_interruptible in the remove path rather than the
> short sleep loop which is just too ugly. The shrinker walk would then
> just wake_up the sleeper when the ref. count drops to 0. Two
> synchronize_rcu is quite ugly as well, but I was not able to simplify
> them. I will keep thinking. It just sucks how we cannot follow the
> standard rcu list with dynamically allocated structure pattern here.

It's because the refcount is dropped too early. The refcount protects
the object during shrink, but not for the list_next(), and so you need
an additional grace period just for that part.

I think you could drop the reference count in the next iteration. This
way the list_next() works without requiring a second RCU grace period.

ref count protects the object and its list pointers; RCU protects what
the list pointers point to before we acquire the reference:

	rcu_read_lock();
	list_for_each_entry_rcu(pos, list) {
		if (!atomic_inc_not_zero(&pos->ref))
			continue;
		rcu_read_unlock();

		if (prev)
			atomic_dec(&prev->ref);
		prev = pos;

		shrink();

		rcu_read_lock();
	}
	rcu_read_unlock();
	if (prev)
		atomic_dec(&prev->ref);

In any case, Minchan's lock breaking seems way preferable over that
level of headscratching complexity for an unusual case like Shakeel's.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-15 14:00     ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2017-11-15 14:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Wed, Nov 15, 2017 at 10:02:51AM +0100, Michal Hocko wrote:
> On Tue 14-11-17 06:37:42, Tetsuo Handa wrote:
> > This patch uses polling loop with short sleep for unregister_shrinker()
> > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > shrinker unexpectedly took so long.
> 
> I would use wait_event_interruptible in the remove path rather than the
> short sleep loop which is just too ugly. The shrinker walk would then
> just wake_up the sleeper when the ref. count drops to 0. Two
> synchronize_rcu is quite ugly as well, but I was not able to simplify
> them. I will keep thinking. It just sucks how we cannot follow the
> standard rcu list with dynamically allocated structure pattern here.

It's because the refcount is dropped too early. The refcount protects
the object during shrink, but not for the list_next(), and so you need
an additional grace period just for that part.

I think you could drop the reference count in the next iteration. This
way the list_next() works without requiring a second RCU grace period.

ref count protects the object and its list pointers; RCU protects what
the list pointers point to before we acquire the reference:

	rcu_read_lock();
	list_for_each_entry_rcu(pos, list) {
		if (!atomic_inc_not_zero(&pos->ref))
			continue;
		rcu_read_unlock();

		if (prev)
			atomic_dec(&prev->ref);
		prev = pos;

		shrink();

		rcu_read_lock();
	}
	rcu_read_unlock();
	if (prev)
		atomic_dec(&prev->ref);

In any case, Minchan's lock breaking seems way preferable over that
level of headscratching complexity for an unusual case like Shakeel's.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15 14:00     ` Johannes Weiner
@ 2017-11-15 14:11       ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-15 14:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Wed 15-11-17 09:00:20, Johannes Weiner wrote:
> On Wed, Nov 15, 2017 at 10:02:51AM +0100, Michal Hocko wrote:
> > On Tue 14-11-17 06:37:42, Tetsuo Handa wrote:
> > > This patch uses polling loop with short sleep for unregister_shrinker()
> > > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > > shrinker unexpectedly took so long.
> > 
> > I would use wait_event_interruptible in the remove path rather than the
> > short sleep loop which is just too ugly. The shrinker walk would then
> > just wake_up the sleeper when the ref. count drops to 0. Two
> > synchronize_rcu is quite ugly as well, but I was not able to simplify
> > them. I will keep thinking. It just sucks how we cannot follow the
> > standard rcu list with dynamically allocated structure pattern here.
> 
> It's because the refcount is dropped too early. The refcount protects
> the object during shrink, but not for the list_next(), and so you need
> an additional grace period just for that part.

Exactly

> I think you could drop the reference count in the next iteration. This
> way the list_next() works without requiring a second RCU grace period.

That would work. I was playing with an idea of prefetching the next
elemnt before dropping the reference but that would require a lock for
the remove operation. Ugly...

> ref count protects the object and its list pointers; RCU protects what
> the list pointers point to before we acquire the reference:
> 
> 	rcu_read_lock();
> 	list_for_each_entry_rcu(pos, list) {
> 		if (!atomic_inc_not_zero(&pos->ref))
> 			continue;
> 		rcu_read_unlock();
> 
> 		if (prev)
> 			atomic_dec(&prev->ref);
> 		prev = pos;
> 
> 		shrink();
> 
> 		rcu_read_lock();
> 	}
> 	rcu_read_unlock();
> 	if (prev)
> 		atomic_dec(&prev->ref);
> 
> In any case, Minchan's lock breaking seems way preferable over that
> level of headscratching complexity for an unusual case like Shakeel's.

agreed! I would go the more complex way only if it turns out that early
break out causes some real problems.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-15 14:11       ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-15 14:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Wed 15-11-17 09:00:20, Johannes Weiner wrote:
> On Wed, Nov 15, 2017 at 10:02:51AM +0100, Michal Hocko wrote:
> > On Tue 14-11-17 06:37:42, Tetsuo Handa wrote:
> > > This patch uses polling loop with short sleep for unregister_shrinker()
> > > rather than wait_on_atomic_t(), for we can save reader's cost (plain
> > > atomic_dec() compared to atomic_dec_and_test()), we can expect that
> > > do_shrink_slab() of unregistering shrinker likely returns shortly, and
> > > we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> > > shrinker unexpectedly took so long.
> > 
> > I would use wait_event_interruptible in the remove path rather than the
> > short sleep loop which is just too ugly. The shrinker walk would then
> > just wake_up the sleeper when the ref. count drops to 0. Two
> > synchronize_rcu is quite ugly as well, but I was not able to simplify
> > them. I will keep thinking. It just sucks how we cannot follow the
> > standard rcu list with dynamically allocated structure pattern here.
> 
> It's because the refcount is dropped too early. The refcount protects
> the object during shrink, but not for the list_next(), and so you need
> an additional grace period just for that part.

Exactly

> I think you could drop the reference count in the next iteration. This
> way the list_next() works without requiring a second RCU grace period.

That would work. I was playing with an idea of prefetching the next
elemnt before dropping the reference but that would require a lock for
the remove operation. Ugly...

> ref count protects the object and its list pointers; RCU protects what
> the list pointers point to before we acquire the reference:
> 
> 	rcu_read_lock();
> 	list_for_each_entry_rcu(pos, list) {
> 		if (!atomic_inc_not_zero(&pos->ref))
> 			continue;
> 		rcu_read_unlock();
> 
> 		if (prev)
> 			atomic_dec(&prev->ref);
> 		prev = pos;
> 
> 		shrink();
> 
> 		rcu_read_lock();
> 	}
> 	rcu_read_unlock();
> 	if (prev)
> 		atomic_dec(&prev->ref);
> 
> In any case, Minchan's lock breaking seems way preferable over that
> level of headscratching complexity for an unusual case like Shakeel's.

agreed! I would go the more complex way only if it turns out that early
break out causes some real problems.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15  6:28     ` Shakeel Butt
@ 2017-11-16  0:46       ` Minchan Kim
  -1 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2017-11-16  0:46 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Greg Thelen,
	Linux MM, LKML

On Tue, Nov 14, 2017 at 10:28:10PM -0800, Shakeel Butt wrote:
> On Tue, Nov 14, 2017 at 4:56 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> >> When shrinker_rwsem was introduced, it was assumed that
> >> register_shrinker()/unregister_shrinker() are really unlikely paths
> >> which are called during initialization and tear down. But nowadays,
> >> register_shrinker()/unregister_shrinker() might be called regularly.
> >> This patch prepares for allowing parallel registration/unregistration
> >> of shrinkers.
> >>
> >> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> >> using one RCU section. But using atomic_inc()/atomic_dec() for each
> >> do_shrink_slab() call will not impact so much.
> >>
> >> This patch uses polling loop with short sleep for unregister_shrinker()
> >> rather than wait_on_atomic_t(), for we can save reader's cost (plain
> >> atomic_dec() compared to atomic_dec_and_test()), we can expect that
> >> do_shrink_slab() of unregistering shrinker likely returns shortly, and
> >> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> >> shrinker unexpectedly took so long.
> >>
> >> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >
> > Before reviewing this patch, can't we solve the problem with more
> > simple way? Like this.
> >
> > Shakeel, What do you think?
> >
> 
> Seems simple enough. I will run my test (running fork bomb in one
> memcg and separately time a mount operation) and update if numbers
> differ significantly.

Thanks.

> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 13d711dd8776..cbb624cb9baa 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >                         sc.nid = 0;
> >
> >                 freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> > +               /*
> > +                * bail out if someone want to register a new shrinker to prevent
> > +                * long time stall by parallel ongoing shrinking.
> > +                */
> > +               if (rwsem_is_contended(&shrinker_rwsem)) {
> > +                       freed = 1;
> 
> freed = freed ?: 1;

Yub.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-16  0:46       ` Minchan Kim
  0 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2017-11-16  0:46 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Greg Thelen,
	Linux MM, LKML

On Tue, Nov 14, 2017 at 10:28:10PM -0800, Shakeel Butt wrote:
> On Tue, Nov 14, 2017 at 4:56 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> >> When shrinker_rwsem was introduced, it was assumed that
> >> register_shrinker()/unregister_shrinker() are really unlikely paths
> >> which are called during initialization and tear down. But nowadays,
> >> register_shrinker()/unregister_shrinker() might be called regularly.
> >> This patch prepares for allowing parallel registration/unregistration
> >> of shrinkers.
> >>
> >> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> >> using one RCU section. But using atomic_inc()/atomic_dec() for each
> >> do_shrink_slab() call will not impact so much.
> >>
> >> This patch uses polling loop with short sleep for unregister_shrinker()
> >> rather than wait_on_atomic_t(), for we can save reader's cost (plain
> >> atomic_dec() compared to atomic_dec_and_test()), we can expect that
> >> do_shrink_slab() of unregistering shrinker likely returns shortly, and
> >> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> >> shrinker unexpectedly took so long.
> >>
> >> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >
> > Before reviewing this patch, can't we solve the problem with more
> > simple way? Like this.
> >
> > Shakeel, What do you think?
> >
> 
> Seems simple enough. I will run my test (running fork bomb in one
> memcg and separately time a mount operation) and update if numbers
> differ significantly.

Thanks.

> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 13d711dd8776..cbb624cb9baa 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >                         sc.nid = 0;
> >
> >                 freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> > +               /*
> > +                * bail out if someone want to register a new shrinker to prevent
> > +                * long time stall by parallel ongoing shrinking.
> > +                */
> > +               if (rwsem_is_contended(&shrinker_rwsem)) {
> > +                       freed = 1;
> 
> freed = freed ?: 1;

Yub.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15 11:51       ` Michal Hocko
@ 2017-11-16  0:56         ` Minchan Kim
  -1 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2017-11-16  0:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, ying.huang, mgorman, vdavydov.dev, hannes, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

On Wed, Nov 15, 2017 at 12:51:43PM +0100, Michal Hocko wrote:
< snip >
> > Since it is possible for a local unpriviledged user to lockup the system at least
> > due to mute_trylock(&oom_lock) versus (printk() or schedule_timeout_killable(1)),
> > I suggest completely eliminating scheduling priority problem (i.e. a very low
> > scheduling priority thread might take 100 seconds inside some do_shrink_slab()
> > call) by not relying on an assumption of shortly returning from do_shrink_slab().
> > My first patch + my second patch will eliminate relying on such assumption, and
> > avoid potential khungtaskd warnings.
> 
> It doesn't, because the priority issues will be still there when anybody
> can preempt your shrinker for extensive amount of time. So no you are
> not fixing the problem. You are merely make it less probable and limited
> only to the removed shrinker. You still do not have any control over
> what happens while that shrinker is executed, though.
> 
> Anyway, I do not claim your patch is a wrong approach. It is just quite
> complex and maybe unnecessarily so for most workloads. Therefore going
> with a simpler solution should be preferred until we see it
> insufficient.

That's exactly what I intended.

Try simple one firstly. Then, wait until the simple one is broken.
If it is broken, we can add more complicated thing this time.

By that model, we are going forwad to complicated stuff with good
justification without losing the chance to understand/learn new workload.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-16  0:56         ` Minchan Kim
  0 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2017-11-16  0:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, ying.huang, mgorman, vdavydov.dev, hannes, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

On Wed, Nov 15, 2017 at 12:51:43PM +0100, Michal Hocko wrote:
< snip >
> > Since it is possible for a local unpriviledged user to lockup the system at least
> > due to mute_trylock(&oom_lock) versus (printk() or schedule_timeout_killable(1)),
> > I suggest completely eliminating scheduling priority problem (i.e. a very low
> > scheduling priority thread might take 100 seconds inside some do_shrink_slab()
> > call) by not relying on an assumption of shortly returning from do_shrink_slab().
> > My first patch + my second patch will eliminate relying on such assumption, and
> > avoid potential khungtaskd warnings.
> 
> It doesn't, because the priority issues will be still there when anybody
> can preempt your shrinker for extensive amount of time. So no you are
> not fixing the problem. You are merely make it less probable and limited
> only to the removed shrinker. You still do not have any control over
> what happens while that shrinker is executed, though.
> 
> Anyway, I do not claim your patch is a wrong approach. It is just quite
> complex and maybe unnecessarily so for most workloads. Therefore going
> with a simpler solution should be preferred until we see it
> insufficient.

That's exactly what I intended.

Try simple one firstly. Then, wait until the simple one is broken.
If it is broken, we can add more complicated thing this time.

By that model, we are going forwad to complicated stuff with good
justification without losing the chance to understand/learn new workload.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-16  0:46       ` Minchan Kim
@ 2017-11-16  1:41         ` Shakeel Butt
  -1 siblings, 0 replies; 74+ messages in thread
From: Shakeel Butt @ 2017-11-16  1:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Greg Thelen,
	Linux MM, LKML

On Wed, Nov 15, 2017 at 4:46 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Nov 14, 2017 at 10:28:10PM -0800, Shakeel Butt wrote:
>> On Tue, Nov 14, 2017 at 4:56 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>> >> When shrinker_rwsem was introduced, it was assumed that
>> >> register_shrinker()/unregister_shrinker() are really unlikely paths
>> >> which are called during initialization and tear down. But nowadays,
>> >> register_shrinker()/unregister_shrinker() might be called regularly.
>> >> This patch prepares for allowing parallel registration/unregistration
>> >> of shrinkers.
>> >>
>> >> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>> >> using one RCU section. But using atomic_inc()/atomic_dec() for each
>> >> do_shrink_slab() call will not impact so much.
>> >>
>> >> This patch uses polling loop with short sleep for unregister_shrinker()
>> >> rather than wait_on_atomic_t(), for we can save reader's cost (plain
>> >> atomic_dec() compared to atomic_dec_and_test()), we can expect that
>> >> do_shrink_slab() of unregistering shrinker likely returns shortly, and
>> >> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
>> >> shrinker unexpectedly took so long.
>> >>
>> >> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> >
>> > Before reviewing this patch, can't we solve the problem with more
>> > simple way? Like this.
>> >
>> > Shakeel, What do you think?
>> >
>>
>> Seems simple enough. I will run my test (running fork bomb in one
>> memcg and separately time a mount operation) and update if numbers
>> differ significantly.
>
> Thanks.
>
>>
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 13d711dd8776..cbb624cb9baa 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>> >                         sc.nid = 0;
>> >
>> >                 freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
>> > +               /*
>> > +                * bail out if someone want to register a new shrinker to prevent
>> > +                * long time stall by parallel ongoing shrinking.
>> > +                */
>> > +               if (rwsem_is_contended(&shrinker_rwsem)) {
>> > +                       freed = 1;
>>
>> freed = freed ?: 1;
>
> Yub.

Thanks Minchan, you can add

Reviewed-and-tested-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-16  1:41         ` Shakeel Butt
  0 siblings, 0 replies; 74+ messages in thread
From: Shakeel Butt @ 2017-11-16  1:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Greg Thelen,
	Linux MM, LKML

On Wed, Nov 15, 2017 at 4:46 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Nov 14, 2017 at 10:28:10PM -0800, Shakeel Butt wrote:
>> On Tue, Nov 14, 2017 at 4:56 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>> >> When shrinker_rwsem was introduced, it was assumed that
>> >> register_shrinker()/unregister_shrinker() are really unlikely paths
>> >> which are called during initialization and tear down. But nowadays,
>> >> register_shrinker()/unregister_shrinker() might be called regularly.
>> >> This patch prepares for allowing parallel registration/unregistration
>> >> of shrinkers.
>> >>
>> >> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>> >> using one RCU section. But using atomic_inc()/atomic_dec() for each
>> >> do_shrink_slab() call will not impact so much.
>> >>
>> >> This patch uses polling loop with short sleep for unregister_shrinker()
>> >> rather than wait_on_atomic_t(), for we can save reader's cost (plain
>> >> atomic_dec() compared to atomic_dec_and_test()), we can expect that
>> >> do_shrink_slab() of unregistering shrinker likely returns shortly, and
>> >> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
>> >> shrinker unexpectedly took so long.
>> >>
>> >> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> >
>> > Before reviewing this patch, can't we solve the problem with more
>> > simple way? Like this.
>> >
>> > Shakeel, What do you think?
>> >
>>
>> Seems simple enough. I will run my test (running fork bomb in one
>> memcg and separately time a mount operation) and update if numbers
>> differ significantly.
>
> Thanks.
>
>>
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 13d711dd8776..cbb624cb9baa 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>> >                         sc.nid = 0;
>> >
>> >                 freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
>> > +               /*
>> > +                * bail out if someone want to register a new shrinker to prevent
>> > +                * long time stall by parallel ongoing shrinking.
>> > +                */
>> > +               if (rwsem_is_contended(&shrinker_rwsem)) {
>> > +                       freed = 1;
>>
>> freed = freed ?: 1;
>
> Yub.

Thanks Minchan, you can add

Reviewed-and-tested-by: Shakeel Butt <shakeelb@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-16  1:41         ` Shakeel Butt
@ 2017-11-16  4:50           ` Minchan Kim
  -1 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2017-11-16  4:50 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Greg Thelen,
	Linux MM, LKML

On Wed, Nov 15, 2017 at 05:41:41PM -0800, Shakeel Butt wrote:
> On Wed, Nov 15, 2017 at 4:46 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Tue, Nov 14, 2017 at 10:28:10PM -0800, Shakeel Butt wrote:
> >> On Tue, Nov 14, 2017 at 4:56 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> >> >> When shrinker_rwsem was introduced, it was assumed that
> >> >> register_shrinker()/unregister_shrinker() are really unlikely paths
> >> >> which are called during initialization and tear down. But nowadays,
> >> >> register_shrinker()/unregister_shrinker() might be called regularly.
> >> >> This patch prepares for allowing parallel registration/unregistration
> >> >> of shrinkers.
> >> >>
> >> >> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> >> >> using one RCU section. But using atomic_inc()/atomic_dec() for each
> >> >> do_shrink_slab() call will not impact so much.
> >> >>
> >> >> This patch uses polling loop with short sleep for unregister_shrinker()
> >> >> rather than wait_on_atomic_t(), for we can save reader's cost (plain
> >> >> atomic_dec() compared to atomic_dec_and_test()), we can expect that
> >> >> do_shrink_slab() of unregistering shrinker likely returns shortly, and
> >> >> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> >> >> shrinker unexpectedly took so long.
> >> >>
> >> >> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >> >
> >> > Before reviewing this patch, can't we solve the problem with more
> >> > simple way? Like this.
> >> >
> >> > Shakeel, What do you think?
> >> >
> >>
> >> Seems simple enough. I will run my test (running fork bomb in one
> >> memcg and separately time a mount operation) and update if numbers
> >> differ significantly.
> >
> > Thanks.
> >
> >>
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index 13d711dd8776..cbb624cb9baa 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >> >                         sc.nid = 0;
> >> >
> >> >                 freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> >> > +               /*
> >> > +                * bail out if someone want to register a new shrinker to prevent
> >> > +                * long time stall by parallel ongoing shrinking.
> >> > +                */
> >> > +               if (rwsem_is_contended(&shrinker_rwsem)) {
> >> > +                       freed = 1;
> >>
> >> freed = freed ?: 1;
> >
> > Yub.
> 
> Thanks Minchan, you can add
> 
> Reviewed-and-tested-by: Shakeel Butt <shakeelb@google.com>

Thanks for the testing, Shakeel.

I will send formal patch to Andrew after closing merge window.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-16  4:50           ` Minchan Kim
  0 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2017-11-16  4:50 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Greg Thelen,
	Linux MM, LKML

On Wed, Nov 15, 2017 at 05:41:41PM -0800, Shakeel Butt wrote:
> On Wed, Nov 15, 2017 at 4:46 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Tue, Nov 14, 2017 at 10:28:10PM -0800, Shakeel Butt wrote:
> >> On Tue, Nov 14, 2017 at 4:56 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> >> >> When shrinker_rwsem was introduced, it was assumed that
> >> >> register_shrinker()/unregister_shrinker() are really unlikely paths
> >> >> which are called during initialization and tear down. But nowadays,
> >> >> register_shrinker()/unregister_shrinker() might be called regularly.
> >> >> This patch prepares for allowing parallel registration/unregistration
> >> >> of shrinkers.
> >> >>
> >> >> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> >> >> using one RCU section. But using atomic_inc()/atomic_dec() for each
> >> >> do_shrink_slab() call will not impact so much.
> >> >>
> >> >> This patch uses polling loop with short sleep for unregister_shrinker()
> >> >> rather than wait_on_atomic_t(), for we can save reader's cost (plain
> >> >> atomic_dec() compared to atomic_dec_and_test()), we can expect that
> >> >> do_shrink_slab() of unregistering shrinker likely returns shortly, and
> >> >> we can avoid khungtaskd warnings when do_shrink_slab() of unregistering
> >> >> shrinker unexpectedly took so long.
> >> >>
> >> >> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >> >
> >> > Before reviewing this patch, can't we solve the problem with more
> >> > simple way? Like this.
> >> >
> >> > Shakeel, What do you think?
> >> >
> >>
> >> Seems simple enough. I will run my test (running fork bomb in one
> >> memcg and separately time a mount operation) and update if numbers
> >> differ significantly.
> >
> > Thanks.
> >
> >>
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index 13d711dd8776..cbb624cb9baa 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >> >                         sc.nid = 0;
> >> >
> >> >                 freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> >> > +               /*
> >> > +                * bail out if someone want to register a new shrinker to prevent
> >> > +                * long time stall by parallel ongoing shrinking.
> >> > +                */
> >> > +               if (rwsem_is_contended(&shrinker_rwsem)) {
> >> > +                       freed = 1;
> >>
> >> freed = freed ?: 1;
> >
> > Yub.
> 
> Thanks Minchan, you can add
> 
> Reviewed-and-tested-by: Shakeel Butt <shakeelb@google.com>

Thanks for the testing, Shakeel.

I will send formal patch to Andrew after closing merge window.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15 13:28       ` Johannes Weiner
@ 2017-11-16 10:56         ` Tetsuo Handa
  -1 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2017-11-16 10:56 UTC (permalink / raw)
  To: hannes
  Cc: mhocko, minchan, ying.huang, mgorman, vdavydov.dev, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

Johannes Weiner wrote:
> On Wed, Nov 15, 2017 at 07:58:09PM +0900, Tetsuo Handa wrote:
> > I think that Minchan's approach depends on how
> > 
> >   In our production, we have observed that the job loader gets stuck for
> >   10s of seconds while doing mount operation. It turns out that it was
> >   stuck in register_shrinker() and some unrelated job was under memory
> >   pressure and spending time in shrink_slab(). Our machines have a lot
> >   of shrinkers registered and jobs under memory pressure has to traverse
> >   all of those memcg-aware shrinkers and do affect unrelated jobs which
> >   want to register their own shrinkers.
> > 
> > is interpreted. If there were 100000 shrinkers and each do_shrink_slab() call
> > took 1 millisecond, aborting the iteration as soon as rwsem_is_contended() would
> > help a lot. But if there were 10 shrinkers and each do_shrink_slab() call took
> > 10 seconds, aborting the iteration as soon as rwsem_is_contended() would help
> > less. Or, there might be some specific shrinker where its do_shrink_slab() call
> > takes 100 seconds. In that case, checking rwsem_is_contended() is too lazy.
> 
> In your patch, unregister() waits for shrinker->nr_active instead of
> the lock, which is decreased in the same location where Minchan drops
> the lock. How is that different behavior for long-running shrinkers?

My patch waits for only one shrinker which unregister_shrinker() is trying to
unregister. Minchan's patch waits for the longest-running in-flight shrinkers.
The difference is that my patch is not disturbed by other in-flight shrinkers
unless the shrinker which unregister_shrinker() is trying to unregister is
the longest-running in-flight shrinker, but it is natural (and required thing)
to wait for the shrinker which unregister_shrinker() is trying to unregister.
This will make difference if some shrinker which unregister_shrinker() is not
trying to unregister is doing crazy stuff.

> 
> Anyway, I suspect it's many shrinkers and many concurrent invocations,
> so the lockbreak granularity you both chose should be fine.
> 

So far, Shakeel's environment does not seem to have shrinkers which do
crazy stuff. But if there is, my approach will reduce the latency.

If we can tolerate per "struct task_struct" marker, I think we can remove
atomic variables.

---
 include/linux/sched.h    |  1 +
 include/linux/shrinker.h |  1 +
 mm/vmscan.c              | 67 ++++++++++++++++++++++++++++++++----------------
 3 files changed, 47 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a5dc7c9..f7eed9b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1098,6 +1098,7 @@ struct task_struct {
 	/* Used by LSM modules for access restriction: */
 	void				*security;
 #endif
+	struct shrinker			*active_shrinker; /* Not for deref. */
 
 	/*
 	 * New fields for task_struct should be added above here, so that
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 388ff29..77cfd3f 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -66,6 +66,7 @@ struct shrinker {
 
 	/* These are for internal use */
 	struct list_head list;
+	struct list_head gc_list;
 	/* objs pending delete, per node */
 	atomic_long_t *nr_deferred;
 };
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1c1bc95..c6b2f5c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -157,7 +157,7 @@ struct scan_control {
 unsigned long vm_total_pages;
 
 static LIST_HEAD(shrinker_list);
-static DECLARE_RWSEM(shrinker_rwsem);
+static DEFINE_SPINLOCK(shrinker_lock);
 
 #ifdef CONFIG_MEMCG
 static bool global_reclaim(struct scan_control *sc)
@@ -285,9 +285,9 @@ int register_shrinker(struct shrinker *shrinker)
 	if (!shrinker->nr_deferred)
 		return -ENOMEM;
 
-	down_write(&shrinker_rwsem);
-	list_add_tail(&shrinker->list, &shrinker_list);
-	up_write(&shrinker_rwsem);
+	spin_lock(&shrinker_lock);
+	list_add_tail_rcu(&shrinker->list, &shrinker_list);
+	spin_unlock(&shrinker_lock);
 	return 0;
 }
 EXPORT_SYMBOL(register_shrinker);
@@ -297,9 +297,40 @@ int register_shrinker(struct shrinker *shrinker)
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-	down_write(&shrinker_rwsem);
-	list_del(&shrinker->list);
-	up_write(&shrinker_rwsem);
+	struct task_struct *g, *p;
+	static LIST_HEAD(shrinker_gc_list);
+	struct shrinker *gc;
+
+	spin_lock(&shrinker_lock);
+	list_del_rcu(&shrinker->list);
+	/*
+	 * Need to update ->list.next if concurrently unregistering shrinkers
+	 * can find this shrinker, for this shrinker's unregistration might
+	 * complete before their unregistrations complete.
+	 */
+	list_for_each_entry(gc, &shrinker_gc_list, gc_list) {
+		if (gc->list.next == &shrinker->list)
+			rcu_assign_pointer(gc->list.next, shrinker->list.next);
+	}
+	list_add_tail(&shrinker->gc_list, &shrinker_gc_list);
+	spin_unlock(&shrinker_lock);
+	synchronize_rcu();
+ retry:
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (unlikely(p->active_shrinker == shrinker)) {
+			get_task_struct(p);
+			rcu_read_unlock();
+			while (p->active_shrinker == shrinker)
+				schedule_timeout_uninterruptible(1);
+			put_task_struct(p);
+			goto retry;
+		}
+	}
+	rcu_read_unlock();
+	spin_lock(&shrinker_lock);
+	list_del(&shrinker->gc_list);
+	spin_unlock(&shrinker_lock);
 	kfree(shrinker->nr_deferred);
 }
 EXPORT_SYMBOL(unregister_shrinker);
@@ -468,18 +499,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	if (nr_scanned == 0)
 		nr_scanned = SWAP_CLUSTER_MAX;
 
-	if (!down_read_trylock(&shrinker_rwsem)) {
-		/*
-		 * If we would return 0, our callers would understand that we
-		 * have nothing else to shrink and give up trying. By returning
-		 * 1 we keep it going and assume we'll be able to shrink next
-		 * time.
-		 */
-		freed = 1;
-		goto out;
-	}
-
-	list_for_each_entry(shrinker, &shrinker_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
@@ -498,11 +519,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
 			sc.nid = 0;
 
+		current->active_shrinker = shrinker;
+		rcu_read_unlock();
 		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
+		rcu_read_lock();
 	}
-
-	up_read(&shrinker_rwsem);
-out:
+	rcu_read_unlock();
+	current->active_shrinker = NULL;
 	cond_resched();
 	return freed;
 }
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-16 10:56         ` Tetsuo Handa
  0 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2017-11-16 10:56 UTC (permalink / raw)
  To: hannes
  Cc: mhocko, minchan, ying.huang, mgorman, vdavydov.dev, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

Johannes Weiner wrote:
> On Wed, Nov 15, 2017 at 07:58:09PM +0900, Tetsuo Handa wrote:
> > I think that Minchan's approach depends on how
> > 
> >   In our production, we have observed that the job loader gets stuck for
> >   10s of seconds while doing mount operation. It turns out that it was
> >   stuck in register_shrinker() and some unrelated job was under memory
> >   pressure and spending time in shrink_slab(). Our machines have a lot
> >   of shrinkers registered and jobs under memory pressure has to traverse
> >   all of those memcg-aware shrinkers and do affect unrelated jobs which
> >   want to register their own shrinkers.
> > 
> > is interpreted. If there were 100000 shrinkers and each do_shrink_slab() call
> > took 1 millisecond, aborting the iteration as soon as rwsem_is_contended() would
> > help a lot. But if there were 10 shrinkers and each do_shrink_slab() call took
> > 10 seconds, aborting the iteration as soon as rwsem_is_contended() would help
> > less. Or, there might be some specific shrinker where its do_shrink_slab() call
> > takes 100 seconds. In that case, checking rwsem_is_contended() is too lazy.
> 
> In your patch, unregister() waits for shrinker->nr_active instead of
> the lock, which is decreased in the same location where Minchan drops
> the lock. How is that different behavior for long-running shrinkers?

My patch waits for only one shrinker which unregister_shrinker() is trying to
unregister. Minchan's patch waits for the longest-running in-flight shrinkers.
The difference is that my patch is not disturbed by other in-flight shrinkers
unless the shrinker which unregister_shrinker() is trying to unregister is
the longest-running in-flight shrinker, but it is natural (and required thing)
to wait for the shrinker which unregister_shrinker() is trying to unregister.
This will make difference if some shrinker which unregister_shrinker() is not
trying to unregister is doing crazy stuff.

> 
> Anyway, I suspect it's many shrinkers and many concurrent invocations,
> so the lockbreak granularity you both chose should be fine.
> 

So far, Shakeel's environment does not seem to have shrinkers which do
crazy stuff. But if there is, my approach will reduce the latency.

If we can tolerate per "struct task_struct" marker, I think we can remove
atomic variables.

---
 include/linux/sched.h    |  1 +
 include/linux/shrinker.h |  1 +
 mm/vmscan.c              | 67 ++++++++++++++++++++++++++++++++----------------
 3 files changed, 47 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a5dc7c9..f7eed9b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1098,6 +1098,7 @@ struct task_struct {
 	/* Used by LSM modules for access restriction: */
 	void				*security;
 #endif
+	struct shrinker			*active_shrinker; /* Not for deref. */
 
 	/*
 	 * New fields for task_struct should be added above here, so that
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 388ff29..77cfd3f 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -66,6 +66,7 @@ struct shrinker {
 
 	/* These are for internal use */
 	struct list_head list;
+	struct list_head gc_list;
 	/* objs pending delete, per node */
 	atomic_long_t *nr_deferred;
 };
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1c1bc95..c6b2f5c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -157,7 +157,7 @@ struct scan_control {
 unsigned long vm_total_pages;
 
 static LIST_HEAD(shrinker_list);
-static DECLARE_RWSEM(shrinker_rwsem);
+static DEFINE_SPINLOCK(shrinker_lock);
 
 #ifdef CONFIG_MEMCG
 static bool global_reclaim(struct scan_control *sc)
@@ -285,9 +285,9 @@ int register_shrinker(struct shrinker *shrinker)
 	if (!shrinker->nr_deferred)
 		return -ENOMEM;
 
-	down_write(&shrinker_rwsem);
-	list_add_tail(&shrinker->list, &shrinker_list);
-	up_write(&shrinker_rwsem);
+	spin_lock(&shrinker_lock);
+	list_add_tail_rcu(&shrinker->list, &shrinker_list);
+	spin_unlock(&shrinker_lock);
 	return 0;
 }
 EXPORT_SYMBOL(register_shrinker);
@@ -297,9 +297,40 @@ int register_shrinker(struct shrinker *shrinker)
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-	down_write(&shrinker_rwsem);
-	list_del(&shrinker->list);
-	up_write(&shrinker_rwsem);
+	struct task_struct *g, *p;
+	static LIST_HEAD(shrinker_gc_list);
+	struct shrinker *gc;
+
+	spin_lock(&shrinker_lock);
+	list_del_rcu(&shrinker->list);
+	/*
+	 * Need to update ->list.next if concurrently unregistering shrinkers
+	 * can find this shrinker, for this shrinker's unregistration might
+	 * complete before their unregistrations complete.
+	 */
+	list_for_each_entry(gc, &shrinker_gc_list, gc_list) {
+		if (gc->list.next == &shrinker->list)
+			rcu_assign_pointer(gc->list.next, shrinker->list.next);
+	}
+	list_add_tail(&shrinker->gc_list, &shrinker_gc_list);
+	spin_unlock(&shrinker_lock);
+	synchronize_rcu();
+ retry:
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (unlikely(p->active_shrinker == shrinker)) {
+			get_task_struct(p);
+			rcu_read_unlock();
+			while (p->active_shrinker == shrinker)
+				schedule_timeout_uninterruptible(1);
+			put_task_struct(p);
+			goto retry;
+		}
+	}
+	rcu_read_unlock();
+	spin_lock(&shrinker_lock);
+	list_del(&shrinker->gc_list);
+	spin_unlock(&shrinker_lock);
 	kfree(shrinker->nr_deferred);
 }
 EXPORT_SYMBOL(unregister_shrinker);
@@ -468,18 +499,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	if (nr_scanned == 0)
 		nr_scanned = SWAP_CLUSTER_MAX;
 
-	if (!down_read_trylock(&shrinker_rwsem)) {
-		/*
-		 * If we would return 0, our callers would understand that we
-		 * have nothing else to shrink and give up trying. By returning
-		 * 1 we keep it going and assume we'll be able to shrink next
-		 * time.
-		 */
-		freed = 1;
-		goto out;
-	}
-
-	list_for_each_entry(shrinker, &shrinker_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
@@ -498,11 +519,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
 			sc.nid = 0;
 
+		current->active_shrinker = shrinker;
+		rcu_read_unlock();
 		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
+		rcu_read_lock();
 	}
-
-	up_read(&shrinker_rwsem);
-out:
+	rcu_read_unlock();
+	current->active_shrinker = NULL;
 	cond_resched();
 	return freed;
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15  0:56   ` Minchan Kim
@ 2017-11-16 17:44     ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2017-11-16 17:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Andrew Morton, Shakeel Butt, Greg Thelen, linux-mm,
	linux-kernel

On Wed, Nov 15, 2017 at 09:56:02AM +0900, Minchan Kim wrote:
> @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  			sc.nid = 0;
>  
>  		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> +		/*
> +		 * bail out if someone want to register a new shrinker to prevent
> +		 * long time stall by parallel ongoing shrinking.
> +		 */
> +		if (rwsem_is_contended(&shrinker_rwsem)) {
> +			freed = 1;
> +			break;
> +		}
>  	}

When you send the formal version, please include

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-16 17:44     ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2017-11-16 17:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Andrew Morton, Shakeel Butt, Greg Thelen, linux-mm,
	linux-kernel

On Wed, Nov 15, 2017 at 09:56:02AM +0900, Minchan Kim wrote:
> @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  			sc.nid = 0;
>  
>  		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> +		/*
> +		 * bail out if someone want to register a new shrinker to prevent
> +		 * long time stall by parallel ongoing shrinking.
> +		 */
> +		if (rwsem_is_contended(&shrinker_rwsem)) {
> +			freed = 1;
> +			break;
> +		}
>  	}

When you send the formal version, please include

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-13 21:37 ` Tetsuo Handa
@ 2017-11-17 17:35   ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-11-17 17:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Minchan Kim, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Shakeel Butt,
	Greg Thelen, linux-mm, linux-kernel

On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> using one RCU section. But using atomic_inc()/atomic_dec() for each
> do_shrink_slab() call will not impact so much.

But you could use SRCU..

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-17 17:35   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-11-17 17:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Minchan Kim, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Johannes Weiner, Andrew Morton, Shakeel Butt,
	Greg Thelen, linux-mm, linux-kernel

On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> using one RCU section. But using atomic_inc()/atomic_dec() for each
> do_shrink_slab() call will not impact so much.

But you could use SRCU..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-17 17:35   ` Christoph Hellwig
@ 2017-11-17 17:41     ` Shakeel Butt
  -1 siblings, 0 replies; 74+ messages in thread
From: Shakeel Butt @ 2017-11-17 17:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Michal Hocko, Johannes Weiner, Andrew Morton,
	Greg Thelen, Linux MM, LKML

On Fri, Nov 17, 2017 at 9:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>> using one RCU section. But using atomic_inc()/atomic_dec() for each
>> do_shrink_slab() call will not impact so much.
>
> But you could use SRCU..

I looked into that but was advised to not go through that route due to
SRCU behind the CONFIG_SRCU. However now I see the precedence of
"#ifdef CONFIG_SRCU" in drivers/base/core.c and I think if we can take
that route if even after Minchan's patch the issue persists.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-17 17:41     ` Shakeel Butt
  0 siblings, 0 replies; 74+ messages in thread
From: Shakeel Butt @ 2017-11-17 17:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Michal Hocko, Johannes Weiner, Andrew Morton,
	Greg Thelen, Linux MM, LKML

On Fri, Nov 17, 2017 at 9:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>> using one RCU section. But using atomic_inc()/atomic_dec() for each
>> do_shrink_slab() call will not impact so much.
>
> But you could use SRCU..

I looked into that but was advised to not go through that route due to
SRCU behind the CONFIG_SRCU. However now I see the precedence of
"#ifdef CONFIG_SRCU" in drivers/base/core.c and I think if we can take
that route if even after Minchan's patch the issue persists.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-17 17:41     ` Shakeel Butt
@ 2017-11-17 17:53       ` Shakeel Butt
  -1 siblings, 0 replies; 74+ messages in thread
From: Shakeel Butt @ 2017-11-17 17:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Michal Hocko, Johannes Weiner, Andrew Morton,
	Greg Thelen, Linux MM, LKML

On Fri, Nov 17, 2017 at 9:41 AM, Shakeel Butt <shakeelb@google.com> wrote:
> On Fri, Nov 17, 2017 at 9:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>>> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>>> using one RCU section. But using atomic_inc()/atomic_dec() for each
>>> do_shrink_slab() call will not impact so much.
>>
>> But you could use SRCU..
>
> I looked into that but was advised to not go through that route due to
> SRCU behind the CONFIG_SRCU. However now I see the precedence of
> "#ifdef CONFIG_SRCU" in drivers/base/core.c and I think if we can take
> that route if even after Minchan's patch the issue persists.

Too many 'ifs' in the last sentence. I just wanted to say we can
consider SRCU if the issue persists even after Minchan's patch.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-17 17:53       ` Shakeel Butt
  0 siblings, 0 replies; 74+ messages in thread
From: Shakeel Butt @ 2017-11-17 17:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Michal Hocko, Johannes Weiner, Andrew Morton,
	Greg Thelen, Linux MM, LKML

On Fri, Nov 17, 2017 at 9:41 AM, Shakeel Butt <shakeelb@google.com> wrote:
> On Fri, Nov 17, 2017 at 9:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
>>> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
>>> using one RCU section. But using atomic_inc()/atomic_dec() for each
>>> do_shrink_slab() call will not impact so much.
>>
>> But you could use SRCU..
>
> I looked into that but was advised to not go through that route due to
> SRCU behind the CONFIG_SRCU. However now I see the precedence of
> "#ifdef CONFIG_SRCU" in drivers/base/core.c and I think if we can take
> that route if even after Minchan's patch the issue persists.

Too many 'ifs' in the last sentence. I just wanted to say we can
consider SRCU if the issue persists even after Minchan's patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-17 17:41     ` Shakeel Butt
@ 2017-11-17 18:36       ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-11-17 18:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Christoph Hellwig, Tetsuo Handa, Minchan Kim, Huang Ying,
	Mel Gorman, Vladimir Davydov, Michal Hocko, Johannes Weiner,
	Andrew Morton, Greg Thelen, Linux MM, LKML, Paul E. McKenney

On Fri, Nov 17, 2017 at 09:41:46AM -0800, Shakeel Butt wrote:
> On Fri, Nov 17, 2017 at 9:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> >> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> >> using one RCU section. But using atomic_inc()/atomic_dec() for each
> >> do_shrink_slab() call will not impact so much.
> >
> > But you could use SRCU..
> 
> I looked into that but was advised to not go through that route due to
> SRCU behind the CONFIG_SRCU. However now I see the precedence of
> "#ifdef CONFIG_SRCU" in drivers/base/core.c and I think if we can take
> that route if even after Minchan's patch the issue persists.

To be honest, I'd rather always require RCU then have core kernel
code reinvent it.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-17 18:36       ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-11-17 18:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Christoph Hellwig, Tetsuo Handa, Minchan Kim, Huang Ying,
	Mel Gorman, Vladimir Davydov, Michal Hocko, Johannes Weiner,
	Andrew Morton, Greg Thelen, Linux MM, LKML, Paul E. McKenney

On Fri, Nov 17, 2017 at 09:41:46AM -0800, Shakeel Butt wrote:
> On Fri, Nov 17, 2017 at 9:35 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> >> Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> >> using one RCU section. But using atomic_inc()/atomic_dec() for each
> >> do_shrink_slab() call will not impact so much.
> >
> > But you could use SRCU..
> 
> I looked into that but was advised to not go through that route due to
> SRCU behind the CONFIG_SRCU. However now I see the precedence of
> "#ifdef CONFIG_SRCU" in drivers/base/core.c and I think if we can take
> that route if even after Minchan's patch the issue persists.

To be honest, I'd rather always require RCU then have core kernel
code reinvent it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-17 17:35   ` Christoph Hellwig
@ 2017-11-20  9:25     ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-20  9:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Johannes Weiner, Andrew Morton, Shakeel Butt,
	Greg Thelen, linux-mm, linux-kernel

On Fri 17-11-17 09:35:21, Christoph Hellwig wrote:
> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > do_shrink_slab() call will not impact so much.
> 
> But you could use SRCU..

Davidlohr has tried that already http://lkml.kernel.org/r/1434398602.1903.15.camel@stgolabs.net
and failed. Doing SRCU inside core kernel is not seen with a great
support...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-20  9:25     ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-20  9:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Johannes Weiner, Andrew Morton, Shakeel Butt,
	Greg Thelen, linux-mm, linux-kernel

On Fri 17-11-17 09:35:21, Christoph Hellwig wrote:
> On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > do_shrink_slab() call will not impact so much.
> 
> But you could use SRCU..

Davidlohr has tried that already http://lkml.kernel.org/r/1434398602.1903.15.camel@stgolabs.net
and failed. Doing SRCU inside core kernel is not seen with a great
support...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-20  9:25     ` Michal Hocko
@ 2017-11-20  9:33       ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-11-20  9:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Hellwig, Tetsuo Handa, Minchan Kim, Huang Ying,
	Mel Gorman, Vladimir Davydov, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Greg Thelen, linux-mm, linux-kernel

On Mon, Nov 20, 2017 at 10:25:26AM +0100, Michal Hocko wrote:
> On Fri 17-11-17 09:35:21, Christoph Hellwig wrote:
> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> > > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > > do_shrink_slab() call will not impact so much.
> > 
> > But you could use SRCU..
> 
> Davidlohr has tried that already http://lkml.kernel.org/r/1434398602.1903.15.camel@stgolabs.net
> and failed. Doing SRCU inside core kernel is not seen with a great
> support...

I can't actually find any objection in that thread.  What's the problem
Davidlohr ran into?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-20  9:33       ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-11-20  9:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Hellwig, Tetsuo Handa, Minchan Kim, Huang Ying,
	Mel Gorman, Vladimir Davydov, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Greg Thelen, linux-mm, linux-kernel

On Mon, Nov 20, 2017 at 10:25:26AM +0100, Michal Hocko wrote:
> On Fri 17-11-17 09:35:21, Christoph Hellwig wrote:
> > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> > > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > > do_shrink_slab() call will not impact so much.
> > 
> > But you could use SRCU..
> 
> Davidlohr has tried that already http://lkml.kernel.org/r/1434398602.1903.15.camel@stgolabs.net
> and failed. Doing SRCU inside core kernel is not seen with a great
> support...

I can't actually find any objection in that thread.  What's the problem
Davidlohr ran into?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-20  9:33       ` Christoph Hellwig
@ 2017-11-20  9:42         ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-20  9:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Johannes Weiner, Andrew Morton, Shakeel Butt,
	Greg Thelen, linux-mm, linux-kernel

On Mon 20-11-17 01:33:09, Christoph Hellwig wrote:
> On Mon, Nov 20, 2017 at 10:25:26AM +0100, Michal Hocko wrote:
> > On Fri 17-11-17 09:35:21, Christoph Hellwig wrote:
> > > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> > > > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > > > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > > > do_shrink_slab() call will not impact so much.
> > > 
> > > But you could use SRCU..
> > 
> > Davidlohr has tried that already http://lkml.kernel.org/r/1434398602.1903.15.camel@stgolabs.net
> > and failed. Doing SRCU inside core kernel is not seen with a great
> > support...
> 
> I can't actually find any objection in that thread.  What's the problem
> Davidlohr ran into?

The patch has been dropped because allnoconfig failed to compile back
then http://lkml.kernel.org/r/CAP=VYLr0rPWi1aeuk4w1On9CYRNmnEWwJgGtaX=wEvGaBURtrg@mail.gmail.com
I have problem to find the follow up discussion though. The main
argument was that SRC is not generally available and so the core
kernel should rely on it.
-- 
Michal Hocko SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-20  9:42         ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-11-20  9:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Johannes Weiner, Andrew Morton, Shakeel Butt,
	Greg Thelen, linux-mm, linux-kernel

On Mon 20-11-17 01:33:09, Christoph Hellwig wrote:
> On Mon, Nov 20, 2017 at 10:25:26AM +0100, Michal Hocko wrote:
> > On Fri 17-11-17 09:35:21, Christoph Hellwig wrote:
> > > On Tue, Nov 14, 2017 at 06:37:42AM +0900, Tetsuo Handa wrote:
> > > > Since do_shrink_slab() can reschedule, we cannot protect shrinker_list
> > > > using one RCU section. But using atomic_inc()/atomic_dec() for each
> > > > do_shrink_slab() call will not impact so much.
> > > 
> > > But you could use SRCU..
> > 
> > Davidlohr has tried that already http://lkml.kernel.org/r/1434398602.1903.15.camel@stgolabs.net
> > and failed. Doing SRCU inside core kernel is not seen with a great
> > support...
> 
> I can't actually find any objection in that thread.  What's the problem
> Davidlohr ran into?

The patch has been dropped because allnoconfig failed to compile back
then http://lkml.kernel.org/r/CAP=VYLr0rPWi1aeuk4w1On9CYRNmnEWwJgGtaX=wEvGaBURtrg@mail.gmail.com
I have problem to find the follow up discussion though. The main
argument was that SRC is not generally available and so the core
kernel should rely on it.
-- 
Michal Hocko SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-20  9:42         ` Michal Hocko
@ 2017-11-20 10:41           ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-11-20 10:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Hellwig, Tetsuo Handa, Minchan Kim, Huang Ying,
	Mel Gorman, Vladimir Davydov, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Greg Thelen, linux-mm, linux-kernel,
	Paul E. McKenney

On Mon, Nov 20, 2017 at 10:42:37AM +0100, Michal Hocko wrote:
> The patch has been dropped because allnoconfig failed to compile back
> then http://lkml.kernel.org/r/CAP=VYLr0rPWi1aeuk4w1On9CYRNmnEWwJgGtaX=wEvGaBURtrg@mail.gmail.com
> I have problem to find the follow up discussion though. The main
> argument was that SRC is not generally available and so the core
> kernel should rely on it.

Paul,

isthere any good reason to not use SRCU in the core kernel and
instead try to reimplement it using atomic counters?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-20 10:41           ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-11-20 10:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Hellwig, Tetsuo Handa, Minchan Kim, Huang Ying,
	Mel Gorman, Vladimir Davydov, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Greg Thelen, linux-mm, linux-kernel,
	Paul E. McKenney

On Mon, Nov 20, 2017 at 10:42:37AM +0100, Michal Hocko wrote:
> The patch has been dropped because allnoconfig failed to compile back
> then http://lkml.kernel.org/r/CAP=VYLr0rPWi1aeuk4w1On9CYRNmnEWwJgGtaX=wEvGaBURtrg@mail.gmail.com
> I have problem to find the follow up discussion though. The main
> argument was that SRC is not generally available and so the core
> kernel should rely on it.

Paul,

isthere any good reason to not use SRCU in the core kernel and
instead try to reimplement it using atomic counters?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-20 10:41           ` Christoph Hellwig
@ 2017-11-20 10:56             ` Tetsuo Handa
  -1 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2017-11-20 10:56 UTC (permalink / raw)
  To: hch, mhocko
  Cc: minchan, ying.huang, mgorman, vdavydov.dev, hannes, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel, paulmck

Christoph Hellwig wrote:
> On Mon, Nov 20, 2017 at 10:42:37AM +0100, Michal Hocko wrote:
> > The patch has been dropped because allnoconfig failed to compile back
> > then http://lkml.kernel.org/r/CAP=VYLr0rPWi1aeuk4w1On9CYRNmnEWwJgGtaX=wEvGaBURtrg@mail.gmail.com
> > I have problem to find the follow up discussion though. The main
> > argument was that SRC is not generally available and so the core
> > kernel should rely on it.
> 
> Paul,
> 
> isthere any good reason to not use SRCU in the core kernel and
> instead try to reimplement it using atomic counters?

CONFIG_SRCU was added in order to save system size. There are users who run Linux on very
small systems ( https://www.elinux.org/images/5/52/Status-of-embedded-Linux-2017-09-JJ62.pdf ).

Also, atomic counters are not mandatory for shrinker case; e.g.
http://lkml.kernel.org/r/201711161956.EBF57883.QFFMOLOVSOHJFt@I-love.SAKURA.ne.jp .

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-20 10:56             ` Tetsuo Handa
  0 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2017-11-20 10:56 UTC (permalink / raw)
  To: hch, mhocko
  Cc: minchan, ying.huang, mgorman, vdavydov.dev, hannes, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel, paulmck

Christoph Hellwig wrote:
> On Mon, Nov 20, 2017 at 10:42:37AM +0100, Michal Hocko wrote:
> > The patch has been dropped because allnoconfig failed to compile back
> > then http://lkml.kernel.org/r/CAP=VYLr0rPWi1aeuk4w1On9CYRNmnEWwJgGtaX=wEvGaBURtrg@mail.gmail.com
> > I have problem to find the follow up discussion though. The main
> > argument was that SRC is not generally available and so the core
> > kernel should rely on it.
> 
> Paul,
> 
> isthere any good reason to not use SRCU in the core kernel and
> instead try to reimplement it using atomic counters?

CONFIG_SRCU was added in order to save system size. There are users who run Linux on very
small systems ( https://www.elinux.org/images/5/52/Status-of-embedded-Linux-2017-09-JJ62.pdf ).

Also, atomic counters are not mandatory for shrinker case; e.g.
http://lkml.kernel.org/r/201711161956.EBF57883.QFFMOLOVSOHJFt@I-love.SAKURA.ne.jp .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-20 10:56             ` Tetsuo Handa
@ 2017-11-20 18:28               ` Paul E. McKenney
  -1 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2017-11-20 18:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hch, mhocko, minchan, ying.huang, mgorman, vdavydov.dev, hannes,
	akpm, shakeelb, gthelen, linux-mm, linux-kernel

On Mon, Nov 20, 2017 at 07:56:28PM +0900, Tetsuo Handa wrote:
> Christoph Hellwig wrote:
> > On Mon, Nov 20, 2017 at 10:42:37AM +0100, Michal Hocko wrote:
> > > The patch has been dropped because allnoconfig failed to compile back
> > > then http://lkml.kernel.org/r/CAP=VYLr0rPWi1aeuk4w1On9CYRNmnEWwJgGtaX=wEvGaBURtrg@mail.gmail.com
> > > I have problem to find the follow up discussion though. The main
> > > argument was that SRC is not generally available and so the core
> > > kernel should rely on it.
> > 
> > Paul,
> > 
> > isthere any good reason to not use SRCU in the core kernel and
> > instead try to reimplement it using atomic counters?
> 
> CONFIG_SRCU was added in order to save system size. There are users who run Linux on very
> small systems ( https://www.elinux.org/images/5/52/Status-of-embedded-Linux-2017-09-JJ62.pdf ).
> 
> Also, atomic counters are not mandatory for shrinker case; e.g.
> http://lkml.kernel.org/r/201711161956.EBF57883.QFFMOLOVSOHJFt@I-love.SAKURA.ne.jp .

CONFIG_SRCU was indeed added in order to shrink single-CPU systems.
But many architectures are now requiring SRCU for one reason or another,
in more and more situations.

So I recently implemented a UP-only Tiny SRCU, which is quite a bit
smaller than its scalable counterpart, Tree SRCU:

   text	   data	    bss	    dec	    hex	filename
    983	     64	      0	   1047	    417	/tmp/c/kernel/rcu/srcutiny.o

   text	   data	    bss	    dec	    hex	filename
   6844	    193	      0	   7037	   1b7d	/tmp/b/kernel/rcu/srcutree.o

So perhaps it is time to unconditionally enable SRCU?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-20 18:28               ` Paul E. McKenney
  0 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2017-11-20 18:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hch, mhocko, minchan, ying.huang, mgorman, vdavydov.dev, hannes,
	akpm, shakeelb, gthelen, linux-mm, linux-kernel

On Mon, Nov 20, 2017 at 07:56:28PM +0900, Tetsuo Handa wrote:
> Christoph Hellwig wrote:
> > On Mon, Nov 20, 2017 at 10:42:37AM +0100, Michal Hocko wrote:
> > > The patch has been dropped because allnoconfig failed to compile back
> > > then http://lkml.kernel.org/r/CAP=VYLr0rPWi1aeuk4w1On9CYRNmnEWwJgGtaX=wEvGaBURtrg@mail.gmail.com
> > > I have problem to find the follow up discussion though. The main
> > > argument was that SRC is not generally available and so the core
> > > kernel should rely on it.
> > 
> > Paul,
> > 
> > isthere any good reason to not use SRCU in the core kernel and
> > instead try to reimplement it using atomic counters?
> 
> CONFIG_SRCU was added in order to save system size. There are users who run Linux on very
> small systems ( https://www.elinux.org/images/5/52/Status-of-embedded-Linux-2017-09-JJ62.pdf ).
> 
> Also, atomic counters are not mandatory for shrinker case; e.g.
> http://lkml.kernel.org/r/201711161956.EBF57883.QFFMOLOVSOHJFt@I-love.SAKURA.ne.jp .

CONFIG_SRCU was indeed added in order to shrink single-CPU systems.
But many architectures are now requiring SRCU for one reason or another,
in more and more situations.

So I recently implemented a UP-only Tiny SRCU, which is quite a bit
smaller than its scalable counterpart, Tree SRCU:

   text	   data	    bss	    dec	    hex	filename
    983	     64	      0	   1047	    417	/tmp/c/kernel/rcu/srcutiny.o

   text	   data	    bss	    dec	    hex	filename
   6844	    193	      0	   7037	   1b7d	/tmp/b/kernel/rcu/srcutree.o

So perhaps it is time to unconditionally enable SRCU?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-16 17:44     ` Johannes Weiner
@ 2017-11-23 23:46       ` Minchan Kim
  -1 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2017-11-23 23:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Andrew Morton, Shakeel Butt, Greg Thelen, linux-mm,
	linux-kernel

On Thu, Nov 16, 2017 at 12:44:22PM -0500, Johannes Weiner wrote:
> On Wed, Nov 15, 2017 at 09:56:02AM +0900, Minchan Kim wrote:
> > @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >  			sc.nid = 0;
> >  
> >  		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> > +		/*
> > +		 * bail out if someone want to register a new shrinker to prevent
> > +		 * long time stall by parallel ongoing shrinking.
> > +		 */
> > +		if (rwsem_is_contended(&shrinker_rwsem)) {
> > +			freed = 1;
> > +			break;
> > +		}
> >  	}
> 
> When you send the formal version, please include
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for the review, Johannes.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2017-11-23 23:46       ` Minchan Kim
  0 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2017-11-23 23:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, Huang Ying, Mel Gorman, Vladimir Davydov,
	Michal Hocko, Andrew Morton, Shakeel Butt, Greg Thelen, linux-mm,
	linux-kernel

On Thu, Nov 16, 2017 at 12:44:22PM -0500, Johannes Weiner wrote:
> On Wed, Nov 15, 2017 at 09:56:02AM +0900, Minchan Kim wrote:
> > @@ -498,6 +498,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >  			sc.nid = 0;
> >  
> >  		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
> > +		/*
> > +		 * bail out if someone want to register a new shrinker to prevent
> > +		 * long time stall by parallel ongoing shrinking.
> > +		 */
> > +		if (rwsem_is_contended(&shrinker_rwsem)) {
> > +			freed = 1;
> > +			break;
> > +		}
> >  	}
> 
> When you send the formal version, please include
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for the review, Johannes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2017-11-15 14:11       ` Michal Hocko
@ 2018-01-25  2:04         ` Tetsuo Handa
  -1 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2018-01-25  2:04 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, linux-mm
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 15-11-17 09:00:20, Johannes Weiner wrote:
> > In any case, Minchan's lock breaking seems way preferable over that
> > level of headscratching complexity for an unusual case like Shakeel's.
> 
> agreed! I would go the more complex way only if it turns out that early
> break out causes some real problems.
> 

Eric Wheeler wrote (at http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net ):
> Hello all,
> 
> We are getting processes stuck with /proc/pid/stack listing the following:

Yes, I think that this is a silent OOM lockup.

> 
> [<ffffffffac0cd0d2>] io_schedule+0x12/0x40
> [<ffffffffac1b4695>] __lock_page+0x105/0x150
> [<ffffffffac1b4dc1>] pagecache_get_page+0x161/0x210
> [<ffffffffac1d4ab4>] shmem_unused_huge_shrink+0x334/0x3f0
> [<ffffffffac251546>] super_cache_scan+0x176/0x180
> [<ffffffffac1cb6c5>] shrink_slab+0x275/0x460
> [<ffffffffac1d0b8e>] shrink_node+0x10e/0x320
> [<ffffffffac1d0f3d>] node_reclaim+0x19d/0x250
> [<ffffffffac1be0aa>] get_page_from_freelist+0x16a/0xac0
> [<ffffffffac1bed87>] __alloc_pages_nodemask+0x107/0x290
> [<ffffffffac06dbc3>] pte_alloc_one+0x13/0x40
> [<ffffffffac1ef329>] __pte_alloc+0x19/0x100
> [<ffffffffac1f17b8>] alloc_set_pte+0x468/0x4c0
> [<ffffffffac1f184a>] finish_fault+0x3a/0x70
> [<ffffffffac1f369a>] __handle_mm_fault+0x94a/0x1190
> [<ffffffffac1f3fa4>] handle_mm_fault+0xc4/0x1d0
> [<ffffffffac0682a3>] __do_page_fault+0x253/0x4d0
> [<ffffffffac068553>] do_page_fault+0x33/0x120
> [<ffffffffac8019dc>] page_fault+0x4c/0x60
> 
> 
> For some reason io_schedule is not coming back, so shrinker_rwsem never 
> gets an up_read. When this happens, other processes like libvirt get stuck 
> trying to start VMs with the /proc/pid/stack of libvirtd looking like so, 
> while register_shrinker waits for shrinker_rwsem to be released:
> 
> [<ffffffffac7538d3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffffac1cb985>] register_shrinker+0x45/0xa0
> [<ffffffffac250f68>] sget_userns+0x468/0x4a0
> [<ffffffffac25106a>] mount_nodev+0x2a/0xa0
> [<ffffffffac251be4>] mount_fs+0x34/0x150
> [<ffffffffac2701f2>] vfs_kern_mount+0x62/0x120
> [<ffffffffac272a0e>] do_mount+0x1ee/0xc50
> [<ffffffffac27377e>] SyS_mount+0x7e/0xd0
> [<ffffffffac003831>] do_syscall_64+0x61/0x1a0
> [<ffffffffac80012c>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> 

If io_schedule() depends on somebody else's memory allocation request, that
somebody else will call shrink_slab() and down_read_trylock(&shrinker_rwsem)
will fail without making progress. This means that that somebody else will
forever retry as long as should_continue_reclaim() returns true.

I don't know what is causing should_continue_reclaim() to return true, but
nobody will be able to reclaim memory because down_read_trylock(&shrinker_rwsem)
continues failing without making progress.

> 
> I seem to be able to reproduce this somewhat reliably, it will likely be 
> stuck by tomorrow morning. Since it does seem to take a day to hang, I was 
> hoping to avoid a bisect and see if anyone has seen this behavior or knows 
> it to be fixed in 4.15-rc.

I think that this problem is not yet fixed in linux-next.git .

> 
> Note that we are using zram as our only swap device, but at the time that 
> it shrink_slab() failed to return, there was plenty of memory available 
> and no swap was in use.
> 
> The machine is generally responsive, but `sync` will hang forever and our 
> only way out is `echo b > /proc/sysrq-trigger`.
> 
> Please suggest any additional information you might need for testing, and 
> I am happy to try patches.
> 
> Thank you for your help!

Pretending we will be able to make progress

	if (!down_read_trylock(&shrinker_rwsem)) {
		/*
		 * If we would return 0, our callers would understand that we
		 * have nothing else to shrink and give up trying. By returning
		 * 1 we keep it going and assume we'll be able to shrink next
		 * time.
		 */
		freed = 1;
		goto out;
	}

can work only if do_shrink_slab() does not depend on somebody else's
memory allocation. I think we should kill shrinker_rwsem assumption.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2018-01-25  2:04         ` Tetsuo Handa
  0 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2018-01-25  2:04 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, linux-mm
  Cc: Tetsuo Handa, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Wed 15-11-17 09:00:20, Johannes Weiner wrote:
> > In any case, Minchan's lock breaking seems way preferable over that
> > level of headscratching complexity for an unusual case like Shakeel's.
> 
> agreed! I would go the more complex way only if it turns out that early
> break out causes some real problems.
> 

Eric Wheeler wrote (at http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net ):
> Hello all,
> 
> We are getting processes stuck with /proc/pid/stack listing the following:

Yes, I think that this is a silent OOM lockup.

> 
> [<ffffffffac0cd0d2>] io_schedule+0x12/0x40
> [<ffffffffac1b4695>] __lock_page+0x105/0x150
> [<ffffffffac1b4dc1>] pagecache_get_page+0x161/0x210
> [<ffffffffac1d4ab4>] shmem_unused_huge_shrink+0x334/0x3f0
> [<ffffffffac251546>] super_cache_scan+0x176/0x180
> [<ffffffffac1cb6c5>] shrink_slab+0x275/0x460
> [<ffffffffac1d0b8e>] shrink_node+0x10e/0x320
> [<ffffffffac1d0f3d>] node_reclaim+0x19d/0x250
> [<ffffffffac1be0aa>] get_page_from_freelist+0x16a/0xac0
> [<ffffffffac1bed87>] __alloc_pages_nodemask+0x107/0x290
> [<ffffffffac06dbc3>] pte_alloc_one+0x13/0x40
> [<ffffffffac1ef329>] __pte_alloc+0x19/0x100
> [<ffffffffac1f17b8>] alloc_set_pte+0x468/0x4c0
> [<ffffffffac1f184a>] finish_fault+0x3a/0x70
> [<ffffffffac1f369a>] __handle_mm_fault+0x94a/0x1190
> [<ffffffffac1f3fa4>] handle_mm_fault+0xc4/0x1d0
> [<ffffffffac0682a3>] __do_page_fault+0x253/0x4d0
> [<ffffffffac068553>] do_page_fault+0x33/0x120
> [<ffffffffac8019dc>] page_fault+0x4c/0x60
> 
> 
> For some reason io_schedule is not coming back, so shrinker_rwsem never 
> gets an up_read. When this happens, other processes like libvirt get stuck 
> trying to start VMs with the /proc/pid/stack of libvirtd looking like so, 
> while register_shrinker waits for shrinker_rwsem to be released:
> 
> [<ffffffffac7538d3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffffac1cb985>] register_shrinker+0x45/0xa0
> [<ffffffffac250f68>] sget_userns+0x468/0x4a0
> [<ffffffffac25106a>] mount_nodev+0x2a/0xa0
> [<ffffffffac251be4>] mount_fs+0x34/0x150
> [<ffffffffac2701f2>] vfs_kern_mount+0x62/0x120
> [<ffffffffac272a0e>] do_mount+0x1ee/0xc50
> [<ffffffffac27377e>] SyS_mount+0x7e/0xd0
> [<ffffffffac003831>] do_syscall_64+0x61/0x1a0
> [<ffffffffac80012c>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> 

If io_schedule() depends on somebody else's memory allocation request, that
somebody else will call shrink_slab() and down_read_trylock(&shrinker_rwsem)
will fail without making progress. This means that that somebody else will
forever retry as long as should_continue_reclaim() returns true.

I don't know what is causing should_continue_reclaim() to return true, but
nobody will be able to reclaim memory because down_read_trylock(&shrinker_rwsem)
continues failing without making progress.

> 
> I seem to be able to reproduce this somewhat reliably, it will likely be 
> stuck by tomorrow morning. Since it does seem to take a day to hang, I was 
> hoping to avoid a bisect and see if anyone has seen this behavior or knows 
> it to be fixed in 4.15-rc.

I think that this problem is not yet fixed in linux-next.git .

> 
> Note that we are using zram as our only swap device, but at the time that 
> it shrink_slab() failed to return, there was plenty of memory available 
> and no swap was in use.
> 
> The machine is generally responsive, but `sync` will hang forever and our 
> only way out is `echo b > /proc/sysrq-trigger`.
> 
> Please suggest any additional information you might need for testing, and 
> I am happy to try patches.
> 
> Thank you for your help!

Pretending we will be able to make progress

	if (!down_read_trylock(&shrinker_rwsem)) {
		/*
		 * If we would return 0, our callers would understand that we
		 * have nothing else to shrink and give up trying. By returning
		 * 1 we keep it going and assume we'll be able to shrink next
		 * time.
		 */
		freed = 1;
		goto out;
	}

can work only if do_shrink_slab() does not depend on somebody else's
memory allocation. I think we should kill shrinker_rwsem assumption.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2018-01-25  2:04         ` Tetsuo Handa
@ 2018-01-25  8:36           ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2018-01-25  8:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Johannes Weiner, linux-mm, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Thu 25-01-18 11:04:23, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 15-11-17 09:00:20, Johannes Weiner wrote:
> > > In any case, Minchan's lock breaking seems way preferable over that
> > > level of headscratching complexity for an unusual case like Shakeel's.
> > 
> > agreed! I would go the more complex way only if it turns out that early
> > break out causes some real problems.
> > 
> 
> Eric Wheeler wrote (at http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net ):
> > Hello all,
> > 
> > We are getting processes stuck with /proc/pid/stack listing the following:
> 
> Yes, I think that this is a silent OOM lockup.
> 
> > 
> > [<ffffffffac0cd0d2>] io_schedule+0x12/0x40
> > [<ffffffffac1b4695>] __lock_page+0x105/0x150
> > [<ffffffffac1b4dc1>] pagecache_get_page+0x161/0x210
> > [<ffffffffac1d4ab4>] shmem_unused_huge_shrink+0x334/0x3f0
> > [<ffffffffac251546>] super_cache_scan+0x176/0x180
> > [<ffffffffac1cb6c5>] shrink_slab+0x275/0x460
> > [<ffffffffac1d0b8e>] shrink_node+0x10e/0x320
> > [<ffffffffac1d0f3d>] node_reclaim+0x19d/0x250
> > [<ffffffffac1be0aa>] get_page_from_freelist+0x16a/0xac0
> > [<ffffffffac1bed87>] __alloc_pages_nodemask+0x107/0x290
> > [<ffffffffac06dbc3>] pte_alloc_one+0x13/0x40
> > [<ffffffffac1ef329>] __pte_alloc+0x19/0x100
> > [<ffffffffac1f17b8>] alloc_set_pte+0x468/0x4c0
> > [<ffffffffac1f184a>] finish_fault+0x3a/0x70
> > [<ffffffffac1f369a>] __handle_mm_fault+0x94a/0x1190
> > [<ffffffffac1f3fa4>] handle_mm_fault+0xc4/0x1d0
> > [<ffffffffac0682a3>] __do_page_fault+0x253/0x4d0
> > [<ffffffffac068553>] do_page_fault+0x33/0x120
> > [<ffffffffac8019dc>] page_fault+0x4c/0x60
> > 
> > 
> > For some reason io_schedule is not coming back, so shrinker_rwsem never 
> > gets an up_read. When this happens, other processes like libvirt get stuck 
> > trying to start VMs with the /proc/pid/stack of libvirtd looking like so, 
> > while register_shrinker waits for shrinker_rwsem to be released:
> > 
> > [<ffffffffac7538d3>] call_rwsem_down_write_failed+0x13/0x20
> > [<ffffffffac1cb985>] register_shrinker+0x45/0xa0
> > [<ffffffffac250f68>] sget_userns+0x468/0x4a0
> > [<ffffffffac25106a>] mount_nodev+0x2a/0xa0
> > [<ffffffffac251be4>] mount_fs+0x34/0x150
> > [<ffffffffac2701f2>] vfs_kern_mount+0x62/0x120
> > [<ffffffffac272a0e>] do_mount+0x1ee/0xc50
> > [<ffffffffac27377e>] SyS_mount+0x7e/0xd0
> > [<ffffffffac003831>] do_syscall_64+0x61/0x1a0
> > [<ffffffffac80012c>] entry_SYSCALL64_slow_path+0x25/0x25
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> 
> If io_schedule() depends on somebody else's memory allocation request, that
> somebody else will call shrink_slab() and down_read_trylock(&shrinker_rwsem)
> will fail without making progress. This means that that somebody else will
> forever retry as long as should_continue_reclaim() returns true.

I would rather understand the problem than speculate here. I strongly
suspect somebody simply didn't unlock the page.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2018-01-25  8:36           ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2018-01-25  8:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Johannes Weiner, linux-mm, Minchan Kim, Huang Ying, Mel Gorman,
	Vladimir Davydov, Andrew Morton, Shakeel Butt, Greg Thelen,
	linux-mm, linux-kernel

On Thu 25-01-18 11:04:23, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 15-11-17 09:00:20, Johannes Weiner wrote:
> > > In any case, Minchan's lock breaking seems way preferable over that
> > > level of headscratching complexity for an unusual case like Shakeel's.
> > 
> > agreed! I would go the more complex way only if it turns out that early
> > break out causes some real problems.
> > 
> 
> Eric Wheeler wrote (at http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net ):
> > Hello all,
> > 
> > We are getting processes stuck with /proc/pid/stack listing the following:
> 
> Yes, I think that this is a silent OOM lockup.
> 
> > 
> > [<ffffffffac0cd0d2>] io_schedule+0x12/0x40
> > [<ffffffffac1b4695>] __lock_page+0x105/0x150
> > [<ffffffffac1b4dc1>] pagecache_get_page+0x161/0x210
> > [<ffffffffac1d4ab4>] shmem_unused_huge_shrink+0x334/0x3f0
> > [<ffffffffac251546>] super_cache_scan+0x176/0x180
> > [<ffffffffac1cb6c5>] shrink_slab+0x275/0x460
> > [<ffffffffac1d0b8e>] shrink_node+0x10e/0x320
> > [<ffffffffac1d0f3d>] node_reclaim+0x19d/0x250
> > [<ffffffffac1be0aa>] get_page_from_freelist+0x16a/0xac0
> > [<ffffffffac1bed87>] __alloc_pages_nodemask+0x107/0x290
> > [<ffffffffac06dbc3>] pte_alloc_one+0x13/0x40
> > [<ffffffffac1ef329>] __pte_alloc+0x19/0x100
> > [<ffffffffac1f17b8>] alloc_set_pte+0x468/0x4c0
> > [<ffffffffac1f184a>] finish_fault+0x3a/0x70
> > [<ffffffffac1f369a>] __handle_mm_fault+0x94a/0x1190
> > [<ffffffffac1f3fa4>] handle_mm_fault+0xc4/0x1d0
> > [<ffffffffac0682a3>] __do_page_fault+0x253/0x4d0
> > [<ffffffffac068553>] do_page_fault+0x33/0x120
> > [<ffffffffac8019dc>] page_fault+0x4c/0x60
> > 
> > 
> > For some reason io_schedule is not coming back, so shrinker_rwsem never 
> > gets an up_read. When this happens, other processes like libvirt get stuck 
> > trying to start VMs with the /proc/pid/stack of libvirtd looking like so, 
> > while register_shrinker waits for shrinker_rwsem to be released:
> > 
> > [<ffffffffac7538d3>] call_rwsem_down_write_failed+0x13/0x20
> > [<ffffffffac1cb985>] register_shrinker+0x45/0xa0
> > [<ffffffffac250f68>] sget_userns+0x468/0x4a0
> > [<ffffffffac25106a>] mount_nodev+0x2a/0xa0
> > [<ffffffffac251be4>] mount_fs+0x34/0x150
> > [<ffffffffac2701f2>] vfs_kern_mount+0x62/0x120
> > [<ffffffffac272a0e>] do_mount+0x1ee/0xc50
> > [<ffffffffac27377e>] SyS_mount+0x7e/0xd0
> > [<ffffffffac003831>] do_syscall_64+0x61/0x1a0
> > [<ffffffffac80012c>] entry_SYSCALL64_slow_path+0x25/0x25
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> 
> If io_schedule() depends on somebody else's memory allocation request, that
> somebody else will call shrink_slab() and down_read_trylock(&shrinker_rwsem)
> will fail without making progress. This means that that somebody else will
> forever retry as long as should_continue_reclaim() returns true.

I would rather understand the problem than speculate here. I strongly
suspect somebody simply didn't unlock the page.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2018-01-25  8:36           ` Michal Hocko
@ 2018-01-25 10:56             ` Tetsuo Handa
  -1 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2018-01-25 10:56 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, linux-mm, minchan, ying.huang, mgorman, vdavydov.dev,
	akpm, shakeelb, gthelen, linux-mm, linux-kernel

Using a debug patch and a reproducer shown below, we can trivially form
a circular locking dependency shown below.

----------------------------------------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8219001..240efb1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -950,7 +950,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	task_unlock(p);
 
-	if (__ratelimit(&oom_rs))
+	if (0 && __ratelimit(&oom_rs))
 		dump_header(oc, p);
 
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1afb2af..9858449 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -410,6 +410,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	return freed;
 }
 
+struct lockdep_map __shrink_slab_map =
+	STATIC_LOCKDEP_MAP_INIT("shrink_slab", &__shrink_slab_map);
+
 /**
  * shrink_slab - shrink slab caches
  * @gfp_mask: allocation context
@@ -453,6 +456,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		goto out;
 	}
 
+	lock_map_acquire(&__shrink_slab_map);
+
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
@@ -491,6 +496,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		}
 	}
 
+	lock_map_release(&__shrink_slab_map);
+
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
----------------------------------------

----------------------------------------
#include <stdlib.h>

int main(int argc, char *argv[])
{
	unsigned long long size;
	char *buf = NULL;
	unsigned long long i;
	for (size = 1048576; size < 512ULL * (1 << 30); size *= 2) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size /= 2;
			break;
		}
		buf = cp;
	}
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	return 0;
}
----------------------------------------

----------------------------------------
CentOS Linux 7 (Core)
Kernel 4.15.0-rc8-next-20180119+ on an x86_64

localhost login: [   36.954893] cp (2850) used greatest stack depth: 10816 bytes left
[   89.216085] Out of memory: Kill process 6981 (a.out) score 876 or sacrifice child
[   89.225853] Killed process 6981 (a.out) total-vm:4264020kB, anon-rss:3346832kB, file-rss:8kB, shmem-rss:0kB
[   89.313597] oom_reaper: reaped process 6981 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   92.640566] Out of memory: Kill process 6983 (a.out) score 876 or sacrifice child
[   92.642153] 
[   92.643464] Killed process 6983 (a.out) total-vm:4264020kB, anon-rss:3348624kB, file-rss:4kB, shmem-rss:0kB
[   92.644416] ======================================================
[   92.644417] WARNING: possible circular locking dependency detected
[   92.644418] 4.15.0-rc8-next-20180119+ #222 Not tainted
[   92.644419] ------------------------------------------------------
[   92.644419] kworker/u256:29/401 is trying to acquire lock:
[   92.644420]  (shrink_slab){+.+.}, at: [<0000000040040aca>] shrink_slab.part.42+0x73/0x350
[   92.644428] 
[   92.644428] but task is already holding lock:
[   92.665257]  (&xfs_nondir_ilock_class){++++}, at: [<00000000ae515ec8>] xfs_ilock+0xa3/0x180 [xfs]
[   92.668490] 
[   92.668490] which lock already depends on the new lock.
[   92.668490] 
[   92.672781] 
[   92.672781] the existing dependency chain (in reverse order) is:
[   92.676310] 
[   92.676310] -> #1 (&xfs_nondir_ilock_class){++++}:
[   92.679519]        xfs_free_eofblocks+0x9d/0x210 [xfs]
[   92.681716]        xfs_fs_destroy_inode+0x9e/0x220 [xfs]
[   92.683962]        dispose_list+0x30/0x40
[   92.685822]        prune_icache_sb+0x4d/0x70
[   92.687961]        super_cache_scan+0x136/0x180
[   92.690017]        shrink_slab.part.42+0x205/0x350
[   92.692109]        shrink_node+0x313/0x320
[   92.694177]        kswapd+0x386/0x6d0
[   92.695951]        kthread+0xeb/0x120
[   92.697889]        ret_from_fork+0x3a/0x50
[   92.699800] 
[   92.699800] -> #0 (shrink_slab){+.+.}:
[   92.702676]        shrink_slab.part.42+0x93/0x350
[   92.704756]        shrink_node+0x313/0x320
[   92.706660]        do_try_to_free_pages+0xde/0x350
[   92.708737]        try_to_free_pages+0xc5/0x100
[   92.710734]        __alloc_pages_slowpath+0x41c/0xd60
[   92.712470] oom_reaper: reaped process 6983 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   92.712978]        __alloc_pages_nodemask+0x22a/0x270
[   92.713013]        xfs_buf_allocate_memory+0x16b/0x2d0 [xfs]
[   92.721378]        xfs_buf_get_map+0xaf/0x140 [xfs]
[   92.723562]        xfs_buf_read_map+0x1f/0xc0 [xfs]
[   92.726105]        xfs_trans_read_buf_map+0xf5/0x2d0 [xfs]
[   92.728461]        xfs_btree_read_buf_block.constprop.36+0x69/0xc0 [xfs]
[   92.731321]        xfs_btree_lookup_get_block+0x82/0x180 [xfs]
[   92.733739]        xfs_btree_lookup+0x118/0x450 [xfs]
[   92.735982]        xfs_alloc_ag_vextent_near+0xb2/0xb80 [xfs]
[   92.738380]        xfs_alloc_ag_vextent+0x1cc/0x320 [xfs]
[   92.740646]        xfs_alloc_vextent+0x416/0x480 [xfs]
[   92.743023]        xfs_bmap_btalloc+0x340/0x8b0 [xfs]
[   92.745597]        xfs_bmapi_write+0x6c1/0x1270 [xfs]
[   92.747749]        xfs_iomap_write_allocate+0x16c/0x360 [xfs]
[   92.750317]        xfs_map_blocks+0x175/0x230 [xfs]
[   92.752745]        xfs_do_writepage+0x232/0x6e0 [xfs]
[   92.754843]        write_cache_pages+0x1d1/0x3b0
[   92.756801]        xfs_vm_writepages+0x60/0xa0 [xfs]
[   92.758838]        do_writepages+0x12/0x60
[   92.760822]        __writeback_single_inode+0x2c/0x170
[   92.762895]        writeback_sb_inodes+0x267/0x460
[   92.764851]        __writeback_inodes_wb+0x82/0xb0
[   92.766821]        wb_writeback+0x203/0x210
[   92.768676]        wb_workfn+0x266/0x2e0
[   92.770494]        process_one_work+0x253/0x460
[   92.772378]        worker_thread+0x42/0x3e0
[   92.774153]        kthread+0xeb/0x120
[   92.775775]        ret_from_fork+0x3a/0x50
[   92.777513] 
[   92.777513] other info that might help us debug this:
[   92.777513] 
[   92.781361]  Possible unsafe locking scenario:
[   92.781361] 
[   92.784382]        CPU0                    CPU1
[   92.786276]        ----                    ----
[   92.788130]   lock(&xfs_nondir_ilock_class);
[   92.790048]                                lock(shrink_slab);
[   92.792256]                                lock(&xfs_nondir_ilock_class);
[   92.794756]   lock(shrink_slab);
[   92.796251] 
[   92.796251]  *** DEADLOCK ***
[   92.796251] 
[   92.799521] 6 locks held by kworker/u256:29/401:
[   92.801573]  #0:  ((wq_completion)"writeback"){+.+.}, at: [<0000000087382bbf>] process_one_work+0x1f0/0x460
[   92.804947]  #1:  ((work_completion)(&(&wb->dwork)->work)){+.+.}, at: [<0000000087382bbf>] process_one_work+0x1f0/0x460
[   92.808596]  #2:  (&type->s_umount_key#31){++++}, at: [<0000000048ea98d7>] trylock_super+0x11/0x50
[   92.811957]  #3:  (sb_internal){.+.+}, at: [<0000000058532c48>] xfs_trans_alloc+0xe4/0x120 [xfs]
[   92.815280]  #4:  (&xfs_nondir_ilock_class){++++}, at: [<00000000ae515ec8>] xfs_ilock+0xa3/0x180 [xfs]
[   92.819075]  #5:  (shrinker_rwsem){++++}, at: [<0000000039dd500e>] shrink_slab.part.42+0x3c/0x350
[   92.822354] 
[   92.822354] stack backtrace:
[   92.824820] CPU: 1 PID: 401 Comm: kworker/u256:29 Not tainted 4.15.0-rc8-next-20180119+ #222
[   92.827894] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[   92.831609] Workqueue: writeback wb_workfn (flush-8:0)
[   92.833759] Call Trace:
[   92.835144]  dump_stack+0x7d/0xb6
[   92.836761]  print_circular_bug.isra.37+0x1d7/0x1e4
[   92.838908]  __lock_acquire+0x10da/0x15b0
[   92.840744]  ? __lock_acquire+0x390/0x15b0
[   92.842603]  ? lock_acquire+0x51/0x70
[   92.844308]  lock_acquire+0x51/0x70
[   92.845980]  ? shrink_slab.part.42+0x73/0x350
[   92.847896]  shrink_slab.part.42+0x93/0x350
[   92.849799]  ? shrink_slab.part.42+0x73/0x350
[   92.851710]  ? mem_cgroup_iter+0x140/0x530
[   92.853890]  ? mem_cgroup_iter+0x158/0x530
[   92.855897]  shrink_node+0x313/0x320
[   92.857609]  do_try_to_free_pages+0xde/0x350
[   92.859502]  try_to_free_pages+0xc5/0x100
[   92.861328]  __alloc_pages_slowpath+0x41c/0xd60
[   92.863298]  __alloc_pages_nodemask+0x22a/0x270
[   92.865285]  xfs_buf_allocate_memory+0x16b/0x2d0 [xfs]
[   92.867621]  xfs_buf_get_map+0xaf/0x140 [xfs]
[   92.869562]  xfs_buf_read_map+0x1f/0xc0 [xfs]
[   92.871494]  xfs_trans_read_buf_map+0xf5/0x2d0 [xfs]
[   92.873589]  xfs_btree_read_buf_block.constprop.36+0x69/0xc0 [xfs]
[   92.876322]  ? kmem_zone_alloc+0x7e/0x100 [xfs]
[   92.878320]  xfs_btree_lookup_get_block+0x82/0x180 [xfs]
[   92.880527]  xfs_btree_lookup+0x118/0x450 [xfs]
[   92.882528]  ? kmem_zone_alloc+0x7e/0x100 [xfs]
[   92.884511]  xfs_alloc_ag_vextent_near+0xb2/0xb80 [xfs]
[   92.886974]  xfs_alloc_ag_vextent+0x1cc/0x320 [xfs]
[   92.889088]  xfs_alloc_vextent+0x416/0x480 [xfs]
[   92.891098]  xfs_bmap_btalloc+0x340/0x8b0 [xfs]
[   92.893087]  xfs_bmapi_write+0x6c1/0x1270 [xfs]
[   92.895085]  xfs_iomap_write_allocate+0x16c/0x360 [xfs]
[   92.897277]  xfs_map_blocks+0x175/0x230 [xfs]
[   92.899228]  xfs_do_writepage+0x232/0x6e0 [xfs]
[   92.901218]  write_cache_pages+0x1d1/0x3b0
[   92.903102]  ? xfs_add_to_ioend+0x290/0x290 [xfs]
[   92.905170]  ? xfs_vm_writepages+0x4b/0xa0 [xfs]
[   92.907182]  xfs_vm_writepages+0x60/0xa0 [xfs]
[   92.909114]  do_writepages+0x12/0x60
[   92.910778]  __writeback_single_inode+0x2c/0x170
[   92.912735]  writeback_sb_inodes+0x267/0x460
[   92.914561]  __writeback_inodes_wb+0x82/0xb0
[   92.916413]  wb_writeback+0x203/0x210
[   92.918050]  ? cpumask_next+0x20/0x30
[   92.919790]  ? wb_workfn+0x266/0x2e0
[   92.921384]  wb_workfn+0x266/0x2e0
[   92.922908]  process_one_work+0x253/0x460
[   92.924687]  ? process_one_work+0x1f0/0x460
[   92.926518]  worker_thread+0x42/0x3e0
[   92.928077]  kthread+0xeb/0x120
[   92.929512]  ? process_one_work+0x460/0x460
[   92.931330]  ? kthread_create_worker_on_cpu+0x70/0x70
[   92.933313]  ret_from_fork+0x3a/0x50
----------------------------------------

Normally shrinker_rwsem acts like a shared lock. But when
register_shrinker()/unregister_shrinker() called down_write(),
shrinker_rwsem suddenly starts acting like an exclusive lock.

What is unfortunate is that down_write() is called independent of
memory allocation requests. That is, shrinker_rwsem is essentially
a mutex (and hence the debug patch shown above).

----------------------------------------
[<ffffffffac7538d3>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffffac1cb985>] register_shrinker+0x45/0xa0
[<ffffffffac250f68>] sget_userns+0x468/0x4a0
[<ffffffffac25106a>] mount_nodev+0x2a/0xa0
[<ffffffffac251be4>] mount_fs+0x34/0x150
[<ffffffffac2701f2>] vfs_kern_mount+0x62/0x120
[<ffffffffac272a0e>] do_mount+0x1ee/0xc50
[<ffffffffac27377e>] SyS_mount+0x7e/0xd0
[<ffffffffac003831>] do_syscall_64+0x61/0x1a0
[<ffffffffac80012c>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff
----------------------------------------

Therefore, I think that when do_shrink_slab() for GFP_KERNEL is in progress
and down_read_trylock() starts failing because somebody else started waiting at
down_write(), do_shrink_slab() for GFP_NOFS or GFP_NOIO cannot be called.
Doesn't such race cause unexpected results?

Michal Hocko wrote:
> I would rather understand the problem than speculate here. I strongly
> suspect somebody simply didn't unlock the page.

Then, can we please please have a mechanism which tells whether somebody
else was stuck doing memory allocation requests? It is basically
https://lkml.kernel.org/r/1510833448-19918-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2018-01-25 10:56             ` Tetsuo Handa
  0 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2018-01-25 10:56 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, linux-mm, minchan, ying.huang, mgorman, vdavydov.dev,
	akpm, shakeelb, gthelen, linux-mm, linux-kernel

Using a debug patch and a reproducer shown below, we can trivially form
a circular locking dependency shown below.

----------------------------------------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8219001..240efb1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -950,7 +950,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	task_unlock(p);
 
-	if (__ratelimit(&oom_rs))
+	if (0 && __ratelimit(&oom_rs))
 		dump_header(oc, p);
 
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1afb2af..9858449 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -410,6 +410,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	return freed;
 }
 
+struct lockdep_map __shrink_slab_map =
+	STATIC_LOCKDEP_MAP_INIT("shrink_slab", &__shrink_slab_map);
+
 /**
  * shrink_slab - shrink slab caches
  * @gfp_mask: allocation context
@@ -453,6 +456,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		goto out;
 	}
 
+	lock_map_acquire(&__shrink_slab_map);
+
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
@@ -491,6 +496,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		}
 	}
 
+	lock_map_release(&__shrink_slab_map);
+
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
----------------------------------------

----------------------------------------
#include <stdlib.h>

int main(int argc, char *argv[])
{
	unsigned long long size;
	char *buf = NULL;
	unsigned long long i;
	for (size = 1048576; size < 512ULL * (1 << 30); size *= 2) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size /= 2;
			break;
		}
		buf = cp;
	}
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	return 0;
}
----------------------------------------

----------------------------------------
CentOS Linux 7 (Core)
Kernel 4.15.0-rc8-next-20180119+ on an x86_64

localhost login: [   36.954893] cp (2850) used greatest stack depth: 10816 bytes left
[   89.216085] Out of memory: Kill process 6981 (a.out) score 876 or sacrifice child
[   89.225853] Killed process 6981 (a.out) total-vm:4264020kB, anon-rss:3346832kB, file-rss:8kB, shmem-rss:0kB
[   89.313597] oom_reaper: reaped process 6981 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   92.640566] Out of memory: Kill process 6983 (a.out) score 876 or sacrifice child
[   92.642153] 
[   92.643464] Killed process 6983 (a.out) total-vm:4264020kB, anon-rss:3348624kB, file-rss:4kB, shmem-rss:0kB
[   92.644416] ======================================================
[   92.644417] WARNING: possible circular locking dependency detected
[   92.644418] 4.15.0-rc8-next-20180119+ #222 Not tainted
[   92.644419] ------------------------------------------------------
[   92.644419] kworker/u256:29/401 is trying to acquire lock:
[   92.644420]  (shrink_slab){+.+.}, at: [<0000000040040aca>] shrink_slab.part.42+0x73/0x350
[   92.644428] 
[   92.644428] but task is already holding lock:
[   92.665257]  (&xfs_nondir_ilock_class){++++}, at: [<00000000ae515ec8>] xfs_ilock+0xa3/0x180 [xfs]
[   92.668490] 
[   92.668490] which lock already depends on the new lock.
[   92.668490] 
[   92.672781] 
[   92.672781] the existing dependency chain (in reverse order) is:
[   92.676310] 
[   92.676310] -> #1 (&xfs_nondir_ilock_class){++++}:
[   92.679519]        xfs_free_eofblocks+0x9d/0x210 [xfs]
[   92.681716]        xfs_fs_destroy_inode+0x9e/0x220 [xfs]
[   92.683962]        dispose_list+0x30/0x40
[   92.685822]        prune_icache_sb+0x4d/0x70
[   92.687961]        super_cache_scan+0x136/0x180
[   92.690017]        shrink_slab.part.42+0x205/0x350
[   92.692109]        shrink_node+0x313/0x320
[   92.694177]        kswapd+0x386/0x6d0
[   92.695951]        kthread+0xeb/0x120
[   92.697889]        ret_from_fork+0x3a/0x50
[   92.699800] 
[   92.699800] -> #0 (shrink_slab){+.+.}:
[   92.702676]        shrink_slab.part.42+0x93/0x350
[   92.704756]        shrink_node+0x313/0x320
[   92.706660]        do_try_to_free_pages+0xde/0x350
[   92.708737]        try_to_free_pages+0xc5/0x100
[   92.710734]        __alloc_pages_slowpath+0x41c/0xd60
[   92.712470] oom_reaper: reaped process 6983 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   92.712978]        __alloc_pages_nodemask+0x22a/0x270
[   92.713013]        xfs_buf_allocate_memory+0x16b/0x2d0 [xfs]
[   92.721378]        xfs_buf_get_map+0xaf/0x140 [xfs]
[   92.723562]        xfs_buf_read_map+0x1f/0xc0 [xfs]
[   92.726105]        xfs_trans_read_buf_map+0xf5/0x2d0 [xfs]
[   92.728461]        xfs_btree_read_buf_block.constprop.36+0x69/0xc0 [xfs]
[   92.731321]        xfs_btree_lookup_get_block+0x82/0x180 [xfs]
[   92.733739]        xfs_btree_lookup+0x118/0x450 [xfs]
[   92.735982]        xfs_alloc_ag_vextent_near+0xb2/0xb80 [xfs]
[   92.738380]        xfs_alloc_ag_vextent+0x1cc/0x320 [xfs]
[   92.740646]        xfs_alloc_vextent+0x416/0x480 [xfs]
[   92.743023]        xfs_bmap_btalloc+0x340/0x8b0 [xfs]
[   92.745597]        xfs_bmapi_write+0x6c1/0x1270 [xfs]
[   92.747749]        xfs_iomap_write_allocate+0x16c/0x360 [xfs]
[   92.750317]        xfs_map_blocks+0x175/0x230 [xfs]
[   92.752745]        xfs_do_writepage+0x232/0x6e0 [xfs]
[   92.754843]        write_cache_pages+0x1d1/0x3b0
[   92.756801]        xfs_vm_writepages+0x60/0xa0 [xfs]
[   92.758838]        do_writepages+0x12/0x60
[   92.760822]        __writeback_single_inode+0x2c/0x170
[   92.762895]        writeback_sb_inodes+0x267/0x460
[   92.764851]        __writeback_inodes_wb+0x82/0xb0
[   92.766821]        wb_writeback+0x203/0x210
[   92.768676]        wb_workfn+0x266/0x2e0
[   92.770494]        process_one_work+0x253/0x460
[   92.772378]        worker_thread+0x42/0x3e0
[   92.774153]        kthread+0xeb/0x120
[   92.775775]        ret_from_fork+0x3a/0x50
[   92.777513] 
[   92.777513] other info that might help us debug this:
[   92.777513] 
[   92.781361]  Possible unsafe locking scenario:
[   92.781361] 
[   92.784382]        CPU0                    CPU1
[   92.786276]        ----                    ----
[   92.788130]   lock(&xfs_nondir_ilock_class);
[   92.790048]                                lock(shrink_slab);
[   92.792256]                                lock(&xfs_nondir_ilock_class);
[   92.794756]   lock(shrink_slab);
[   92.796251] 
[   92.796251]  *** DEADLOCK ***
[   92.796251] 
[   92.799521] 6 locks held by kworker/u256:29/401:
[   92.801573]  #0:  ((wq_completion)"writeback"){+.+.}, at: [<0000000087382bbf>] process_one_work+0x1f0/0x460
[   92.804947]  #1:  ((work_completion)(&(&wb->dwork)->work)){+.+.}, at: [<0000000087382bbf>] process_one_work+0x1f0/0x460
[   92.808596]  #2:  (&type->s_umount_key#31){++++}, at: [<0000000048ea98d7>] trylock_super+0x11/0x50
[   92.811957]  #3:  (sb_internal){.+.+}, at: [<0000000058532c48>] xfs_trans_alloc+0xe4/0x120 [xfs]
[   92.815280]  #4:  (&xfs_nondir_ilock_class){++++}, at: [<00000000ae515ec8>] xfs_ilock+0xa3/0x180 [xfs]
[   92.819075]  #5:  (shrinker_rwsem){++++}, at: [<0000000039dd500e>] shrink_slab.part.42+0x3c/0x350
[   92.822354] 
[   92.822354] stack backtrace:
[   92.824820] CPU: 1 PID: 401 Comm: kworker/u256:29 Not tainted 4.15.0-rc8-next-20180119+ #222
[   92.827894] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[   92.831609] Workqueue: writeback wb_workfn (flush-8:0)
[   92.833759] Call Trace:
[   92.835144]  dump_stack+0x7d/0xb6
[   92.836761]  print_circular_bug.isra.37+0x1d7/0x1e4
[   92.838908]  __lock_acquire+0x10da/0x15b0
[   92.840744]  ? __lock_acquire+0x390/0x15b0
[   92.842603]  ? lock_acquire+0x51/0x70
[   92.844308]  lock_acquire+0x51/0x70
[   92.845980]  ? shrink_slab.part.42+0x73/0x350
[   92.847896]  shrink_slab.part.42+0x93/0x350
[   92.849799]  ? shrink_slab.part.42+0x73/0x350
[   92.851710]  ? mem_cgroup_iter+0x140/0x530
[   92.853890]  ? mem_cgroup_iter+0x158/0x530
[   92.855897]  shrink_node+0x313/0x320
[   92.857609]  do_try_to_free_pages+0xde/0x350
[   92.859502]  try_to_free_pages+0xc5/0x100
[   92.861328]  __alloc_pages_slowpath+0x41c/0xd60
[   92.863298]  __alloc_pages_nodemask+0x22a/0x270
[   92.865285]  xfs_buf_allocate_memory+0x16b/0x2d0 [xfs]
[   92.867621]  xfs_buf_get_map+0xaf/0x140 [xfs]
[   92.869562]  xfs_buf_read_map+0x1f/0xc0 [xfs]
[   92.871494]  xfs_trans_read_buf_map+0xf5/0x2d0 [xfs]
[   92.873589]  xfs_btree_read_buf_block.constprop.36+0x69/0xc0 [xfs]
[   92.876322]  ? kmem_zone_alloc+0x7e/0x100 [xfs]
[   92.878320]  xfs_btree_lookup_get_block+0x82/0x180 [xfs]
[   92.880527]  xfs_btree_lookup+0x118/0x450 [xfs]
[   92.882528]  ? kmem_zone_alloc+0x7e/0x100 [xfs]
[   92.884511]  xfs_alloc_ag_vextent_near+0xb2/0xb80 [xfs]
[   92.886974]  xfs_alloc_ag_vextent+0x1cc/0x320 [xfs]
[   92.889088]  xfs_alloc_vextent+0x416/0x480 [xfs]
[   92.891098]  xfs_bmap_btalloc+0x340/0x8b0 [xfs]
[   92.893087]  xfs_bmapi_write+0x6c1/0x1270 [xfs]
[   92.895085]  xfs_iomap_write_allocate+0x16c/0x360 [xfs]
[   92.897277]  xfs_map_blocks+0x175/0x230 [xfs]
[   92.899228]  xfs_do_writepage+0x232/0x6e0 [xfs]
[   92.901218]  write_cache_pages+0x1d1/0x3b0
[   92.903102]  ? xfs_add_to_ioend+0x290/0x290 [xfs]
[   92.905170]  ? xfs_vm_writepages+0x4b/0xa0 [xfs]
[   92.907182]  xfs_vm_writepages+0x60/0xa0 [xfs]
[   92.909114]  do_writepages+0x12/0x60
[   92.910778]  __writeback_single_inode+0x2c/0x170
[   92.912735]  writeback_sb_inodes+0x267/0x460
[   92.914561]  __writeback_inodes_wb+0x82/0xb0
[   92.916413]  wb_writeback+0x203/0x210
[   92.918050]  ? cpumask_next+0x20/0x30
[   92.919790]  ? wb_workfn+0x266/0x2e0
[   92.921384]  wb_workfn+0x266/0x2e0
[   92.922908]  process_one_work+0x253/0x460
[   92.924687]  ? process_one_work+0x1f0/0x460
[   92.926518]  worker_thread+0x42/0x3e0
[   92.928077]  kthread+0xeb/0x120
[   92.929512]  ? process_one_work+0x460/0x460
[   92.931330]  ? kthread_create_worker_on_cpu+0x70/0x70
[   92.933313]  ret_from_fork+0x3a/0x50
----------------------------------------

Normally shrinker_rwsem acts like a shared lock. But when
register_shrinker()/unregister_shrinker() called down_write(),
shrinker_rwsem suddenly starts acting like an exclusive lock.

What is unfortunate is that down_write() is called independent of
memory allocation requests. That is, shrinker_rwsem is essentially
a mutex (and hence the debug patch shown above).

----------------------------------------
[<ffffffffac7538d3>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffffac1cb985>] register_shrinker+0x45/0xa0
[<ffffffffac250f68>] sget_userns+0x468/0x4a0
[<ffffffffac25106a>] mount_nodev+0x2a/0xa0
[<ffffffffac251be4>] mount_fs+0x34/0x150
[<ffffffffac2701f2>] vfs_kern_mount+0x62/0x120
[<ffffffffac272a0e>] do_mount+0x1ee/0xc50
[<ffffffffac27377e>] SyS_mount+0x7e/0xd0
[<ffffffffac003831>] do_syscall_64+0x61/0x1a0
[<ffffffffac80012c>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff
----------------------------------------

Therefore, I think that when do_shrink_slab() for GFP_KERNEL is in progress
and down_read_trylock() starts failing because somebody else started waiting at
down_write(), do_shrink_slab() for GFP_NOFS or GFP_NOIO cannot be called.
Doesn't such race cause unexpected results?

Michal Hocko wrote:
> I would rather understand the problem than speculate here. I strongly
> suspect somebody simply didn't unlock the page.

Then, can we please please have a mechanism which tells whether somebody
else was stuck doing memory allocation requests? It is basically
https://lkml.kernel.org/r/1510833448-19918-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2018-01-25 10:56             ` Tetsuo Handa
@ 2018-01-25 11:41               ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2018-01-25 11:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, linux-mm, minchan, ying.huang, mgorman, vdavydov.dev,
	akpm, shakeelb, gthelen, linux-mm, linux-kernel

On Thu 25-01-18 19:56:59, Tetsuo Handa wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1afb2af..9858449 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -410,6 +410,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	return freed;
>  }
>  
> +struct lockdep_map __shrink_slab_map =
> +	STATIC_LOCKDEP_MAP_INIT("shrink_slab", &__shrink_slab_map);
> +
>  /**
>   * shrink_slab - shrink slab caches
>   * @gfp_mask: allocation context
> @@ -453,6 +456,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		goto out;
>  	}
>  
> +	lock_map_acquire(&__shrink_slab_map);
> +
>  	list_for_each_entry(shrinker, &shrinker_list, list) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
> @@ -491,6 +496,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		}
>  	}
>  
> +	lock_map_release(&__shrink_slab_map);
> +
>  	up_read(&shrinker_rwsem);
>  out:
>  	cond_resched();

I am not an expert on lockdep annotations. But is this something that
makes sense? Don't you need lock_acquire_shared otherwise it will simply
consider this a lockup if we succeed on trylock twice? But in any case
the trylock already notes any dependency as the lockdep is involved when
the lock is taken and we do not take any action otherwise. So what is
the point?

I am not familiar with XFS to read the lockdep trace properly.

[...]
 
> Normally shrinker_rwsem acts like a shared lock. But when
> register_shrinker()/unregister_shrinker() called down_write(),
> shrinker_rwsem suddenly starts acting like an exclusive lock.

How come? We only do trylock and that means that we won't take it
_after_ the write claims the lock.

> What is unfortunate is that down_write() is called independent of
> memory allocation requests. That is, shrinker_rwsem is essentially
> a mutex (and hence the debug patch shown above).
> 
> ----------------------------------------
> [<ffffffffac7538d3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffffac1cb985>] register_shrinker+0x45/0xa0
> [<ffffffffac250f68>] sget_userns+0x468/0x4a0
> [<ffffffffac25106a>] mount_nodev+0x2a/0xa0
> [<ffffffffac251be4>] mount_fs+0x34/0x150
> [<ffffffffac2701f2>] vfs_kern_mount+0x62/0x120
> [<ffffffffac272a0e>] do_mount+0x1ee/0xc50
> [<ffffffffac27377e>] SyS_mount+0x7e/0xd0
> [<ffffffffac003831>] do_syscall_64+0x61/0x1a0
> [<ffffffffac80012c>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> ----------------------------------------
> 
> Therefore, I think that when do_shrink_slab() for GFP_KERNEL is in progress
> and down_read_trylock() starts failing because somebody else started waiting at
> down_write(), do_shrink_slab() for GFP_NOFS or GFP_NOIO cannot be called.
> Doesn't such race cause unexpected results?

This is really hard to tell. I would expect that a skipped shrinkers
would lead to an OOM killer sooner or later. As soon as the shrinker
managed memory is the only one left for reclaim then we should OOM.
And I do not see anything obvious that would prevent that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2018-01-25 11:41               ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2018-01-25 11:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, linux-mm, minchan, ying.huang, mgorman, vdavydov.dev,
	akpm, shakeelb, gthelen, linux-mm, linux-kernel

On Thu 25-01-18 19:56:59, Tetsuo Handa wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1afb2af..9858449 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -410,6 +410,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	return freed;
>  }
>  
> +struct lockdep_map __shrink_slab_map =
> +	STATIC_LOCKDEP_MAP_INIT("shrink_slab", &__shrink_slab_map);
> +
>  /**
>   * shrink_slab - shrink slab caches
>   * @gfp_mask: allocation context
> @@ -453,6 +456,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		goto out;
>  	}
>  
> +	lock_map_acquire(&__shrink_slab_map);
> +
>  	list_for_each_entry(shrinker, &shrinker_list, list) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
> @@ -491,6 +496,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		}
>  	}
>  
> +	lock_map_release(&__shrink_slab_map);
> +
>  	up_read(&shrinker_rwsem);
>  out:
>  	cond_resched();

I am not an expert on lockdep annotations. But is this something that
makes sense? Don't you need lock_acquire_shared otherwise it will simply
consider this a lockup if we succeed on trylock twice? But in any case
the trylock already notes any dependency as the lockdep is involved when
the lock is taken and we do not take any action otherwise. So what is
the point?

I am not familiar with XFS to read the lockdep trace properly.

[...]
 
> Normally shrinker_rwsem acts like a shared lock. But when
> register_shrinker()/unregister_shrinker() called down_write(),
> shrinker_rwsem suddenly starts acting like an exclusive lock.

How come? We only do trylock and that means that we won't take it
_after_ the write claims the lock.

> What is unfortunate is that down_write() is called independent of
> memory allocation requests. That is, shrinker_rwsem is essentially
> a mutex (and hence the debug patch shown above).
> 
> ----------------------------------------
> [<ffffffffac7538d3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffffac1cb985>] register_shrinker+0x45/0xa0
> [<ffffffffac250f68>] sget_userns+0x468/0x4a0
> [<ffffffffac25106a>] mount_nodev+0x2a/0xa0
> [<ffffffffac251be4>] mount_fs+0x34/0x150
> [<ffffffffac2701f2>] vfs_kern_mount+0x62/0x120
> [<ffffffffac272a0e>] do_mount+0x1ee/0xc50
> [<ffffffffac27377e>] SyS_mount+0x7e/0xd0
> [<ffffffffac003831>] do_syscall_64+0x61/0x1a0
> [<ffffffffac80012c>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> ----------------------------------------
> 
> Therefore, I think that when do_shrink_slab() for GFP_KERNEL is in progress
> and down_read_trylock() starts failing because somebody else started waiting at
> down_write(), do_shrink_slab() for GFP_NOFS or GFP_NOIO cannot be called.
> Doesn't such race cause unexpected results?

This is really hard to tell. I would expect that a skipped shrinkers
would lead to an OOM killer sooner or later. As soon as the shrinker
managed memory is the only one left for reclaim then we should OOM.
And I do not see anything obvious that would prevent that.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2018-01-25 10:56             ` Tetsuo Handa
@ 2018-01-25 22:19               ` Eric Wheeler
  -1 siblings, 0 replies; 74+ messages in thread
From: Eric Wheeler @ 2018-01-25 22:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, minchan, ying.huang, mgorman, vdavydov.dev, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

On Thu, 25 Jan 2018, Tetsuo Handa wrote:

> Using a debug patch and a reproducer shown below, we can trivially form
> a circular locking dependency shown below.
> 
> ----------------------------------------
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8219001..240efb1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -950,7 +950,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	}
>  	task_unlock(p);
>  
> -	if (__ratelimit(&oom_rs))
> +	if (0 && __ratelimit(&oom_rs))
>  		dump_header(oc, p);
>  
>  	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1afb2af..9858449 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -410,6 +410,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	return freed;
>  }
>  
> +struct lockdep_map __shrink_slab_map =
> +	STATIC_LOCKDEP_MAP_INIT("shrink_slab", &__shrink_slab_map);
> +
>  /**
>   * shrink_slab - shrink slab caches
>   * @gfp_mask: allocation context
> @@ -453,6 +456,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		goto out;
>  	}
>  
> +	lock_map_acquire(&__shrink_slab_map);
> +
>  	list_for_each_entry(shrinker, &shrinker_list, list) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
> @@ -491,6 +496,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		}
>  	}
>  
> +	lock_map_release(&__shrink_slab_map);
> +
>  	up_read(&shrinker_rwsem);
>  out:
>  	cond_resched();
> ----------------------------------------
> 
> ----------------------------------------
> #include <stdlib.h>
> 
> int main(int argc, char *argv[])
> {
> 	unsigned long long size;
> 	char *buf = NULL;
> 	unsigned long long i;
> 	for (size = 1048576; size < 512ULL * (1 << 30); size *= 2) {
> 		char *cp = realloc(buf, size);
> 		if (!cp) {
> 			size /= 2;
> 			break;
> 		}
> 		buf = cp;
> 	}
> 	for (i = 0; i < size; i += 4096)
> 		buf[i] = 0;
> 	return 0;
> }

Hi Tetsuo,

Thank you for looking into this!

I tried running this C program in 4.14.15 but did not get a deadlock, just 
OOM kills. Is the patch required to induce the deadlock?

Also, what are you doing to XFS to make it trigger?

--
Eric Wheeler

> ----------------------------------------
> 
> ----------------------------------------
> CentOS Linux 7 (Core)
> Kernel 4.15.0-rc8-next-20180119+ on an x86_64
> 
> localhost login: [   36.954893] cp (2850) used greatest stack depth: 10816 bytes left
> [   89.216085] Out of memory: Kill process 6981 (a.out) score 876 or sacrifice child
> [   89.225853] Killed process 6981 (a.out) total-vm:4264020kB, anon-rss:3346832kB, file-rss:8kB, shmem-rss:0kB
> [   89.313597] oom_reaper: reaped process 6981 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [   92.640566] Out of memory: Kill process 6983 (a.out) score 876 or sacrifice child
> [   92.642153] 
> [   92.643464] Killed process 6983 (a.out) total-vm:4264020kB, anon-rss:3348624kB, file-rss:4kB, shmem-rss:0kB
> [   92.644416] ======================================================
> [   92.644417] WARNING: possible circular locking dependency detected
> [   92.644418] 4.15.0-rc8-next-20180119+ #222 Not tainted
> [   92.644419] ------------------------------------------------------
> [   92.644419] kworker/u256:29/401 is trying to acquire lock:
> [   92.644420]  (shrink_slab){+.+.}, at: [<0000000040040aca>] shrink_slab.part.42+0x73/0x350
> [   92.644428] 
> [   92.644428] but task is already holding lock:
> [   92.665257]  (&xfs_nondir_ilock_class){++++}, at: [<00000000ae515ec8>] xfs_ilock+0xa3/0x180 [xfs]
> [   92.668490] 
> [   92.668490] which lock already depends on the new lock.
> [   92.668490] 
> [   92.672781] 
> [   92.672781] the existing dependency chain (in reverse order) is:
> [   92.676310] 
> [   92.676310] -> #1 (&xfs_nondir_ilock_class){++++}:
> [   92.679519]        xfs_free_eofblocks+0x9d/0x210 [xfs]
> [   92.681716]        xfs_fs_destroy_inode+0x9e/0x220 [xfs]
> [   92.683962]        dispose_list+0x30/0x40
> [   92.685822]        prune_icache_sb+0x4d/0x70
> [   92.687961]        super_cache_scan+0x136/0x180
> [   92.690017]        shrink_slab.part.42+0x205/0x350
> [   92.692109]        shrink_node+0x313/0x320
> [   92.694177]        kswapd+0x386/0x6d0
> [   92.695951]        kthread+0xeb/0x120
> [   92.697889]        ret_from_fork+0x3a/0x50
> [   92.699800] 
> [   92.699800] -> #0 (shrink_slab){+.+.}:
> [   92.702676]        shrink_slab.part.42+0x93/0x350
> [   92.704756]        shrink_node+0x313/0x320
> [   92.706660]        do_try_to_free_pages+0xde/0x350
> [   92.708737]        try_to_free_pages+0xc5/0x100
> [   92.710734]        __alloc_pages_slowpath+0x41c/0xd60
> [   92.712470] oom_reaper: reaped process 6983 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [   92.712978]        __alloc_pages_nodemask+0x22a/0x270
> [   92.713013]        xfs_buf_allocate_memory+0x16b/0x2d0 [xfs]
> [   92.721378]        xfs_buf_get_map+0xaf/0x140 [xfs]
> [   92.723562]        xfs_buf_read_map+0x1f/0xc0 [xfs]
> [   92.726105]        xfs_trans_read_buf_map+0xf5/0x2d0 [xfs]
> [   92.728461]        xfs_btree_read_buf_block.constprop.36+0x69/0xc0 [xfs]
> [   92.731321]        xfs_btree_lookup_get_block+0x82/0x180 [xfs]
> [   92.733739]        xfs_btree_lookup+0x118/0x450 [xfs]
> [   92.735982]        xfs_alloc_ag_vextent_near+0xb2/0xb80 [xfs]
> [   92.738380]        xfs_alloc_ag_vextent+0x1cc/0x320 [xfs]
> [   92.740646]        xfs_alloc_vextent+0x416/0x480 [xfs]
> [   92.743023]        xfs_bmap_btalloc+0x340/0x8b0 [xfs]
> [   92.745597]        xfs_bmapi_write+0x6c1/0x1270 [xfs]
> [   92.747749]        xfs_iomap_write_allocate+0x16c/0x360 [xfs]
> [   92.750317]        xfs_map_blocks+0x175/0x230 [xfs]
> [   92.752745]        xfs_do_writepage+0x232/0x6e0 [xfs]
> [   92.754843]        write_cache_pages+0x1d1/0x3b0
> [   92.756801]        xfs_vm_writepages+0x60/0xa0 [xfs]
> [   92.758838]        do_writepages+0x12/0x60
> [   92.760822]        __writeback_single_inode+0x2c/0x170
> [   92.762895]        writeback_sb_inodes+0x267/0x460
> [   92.764851]        __writeback_inodes_wb+0x82/0xb0
> [   92.766821]        wb_writeback+0x203/0x210
> [   92.768676]        wb_workfn+0x266/0x2e0
> [   92.770494]        process_one_work+0x253/0x460
> [   92.772378]        worker_thread+0x42/0x3e0
> [   92.774153]        kthread+0xeb/0x120
> [   92.775775]        ret_from_fork+0x3a/0x50
> [   92.777513] 
> [   92.777513] other info that might help us debug this:
> [   92.777513] 
> [   92.781361]  Possible unsafe locking scenario:
> [   92.781361] 
> [   92.784382]        CPU0                    CPU1
> [   92.786276]        ----                    ----
> [   92.788130]   lock(&xfs_nondir_ilock_class);
> [   92.790048]                                lock(shrink_slab);
> [   92.792256]                                lock(&xfs_nondir_ilock_class);
> [   92.794756]   lock(shrink_slab);
> [   92.796251] 
> [   92.796251]  *** DEADLOCK ***
> [   92.796251] 
> [   92.799521] 6 locks held by kworker/u256:29/401:
> [   92.801573]  #0:  ((wq_completion)"writeback"){+.+.}, at: [<0000000087382bbf>] process_one_work+0x1f0/0x460
> [   92.804947]  #1:  ((work_completion)(&(&wb->dwork)->work)){+.+.}, at: [<0000000087382bbf>] process_one_work+0x1f0/0x460
> [   92.808596]  #2:  (&type->s_umount_key#31){++++}, at: [<0000000048ea98d7>] trylock_super+0x11/0x50
> [   92.811957]  #3:  (sb_internal){.+.+}, at: [<0000000058532c48>] xfs_trans_alloc+0xe4/0x120 [xfs]
> [   92.815280]  #4:  (&xfs_nondir_ilock_class){++++}, at: [<00000000ae515ec8>] xfs_ilock+0xa3/0x180 [xfs]
> [   92.819075]  #5:  (shrinker_rwsem){++++}, at: [<0000000039dd500e>] shrink_slab.part.42+0x3c/0x350
> [   92.822354] 
> [   92.822354] stack backtrace:
> [   92.824820] CPU: 1 PID: 401 Comm: kworker/u256:29 Not tainted 4.15.0-rc8-next-20180119+ #222
> [   92.827894] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
> [   92.831609] Workqueue: writeback wb_workfn (flush-8:0)
> [   92.833759] Call Trace:
> [   92.835144]  dump_stack+0x7d/0xb6
> [   92.836761]  print_circular_bug.isra.37+0x1d7/0x1e4
> [   92.838908]  __lock_acquire+0x10da/0x15b0
> [   92.840744]  ? __lock_acquire+0x390/0x15b0
> [   92.842603]  ? lock_acquire+0x51/0x70
> [   92.844308]  lock_acquire+0x51/0x70
> [   92.845980]  ? shrink_slab.part.42+0x73/0x350
> [   92.847896]  shrink_slab.part.42+0x93/0x350
> [   92.849799]  ? shrink_slab.part.42+0x73/0x350
> [   92.851710]  ? mem_cgroup_iter+0x140/0x530
> [   92.853890]  ? mem_cgroup_iter+0x158/0x530
> [   92.855897]  shrink_node+0x313/0x320
> [   92.857609]  do_try_to_free_pages+0xde/0x350
> [   92.859502]  try_to_free_pages+0xc5/0x100
> [   92.861328]  __alloc_pages_slowpath+0x41c/0xd60
> [   92.863298]  __alloc_pages_nodemask+0x22a/0x270
> [   92.865285]  xfs_buf_allocate_memory+0x16b/0x2d0 [xfs]
> [   92.867621]  xfs_buf_get_map+0xaf/0x140 [xfs]
> [   92.869562]  xfs_buf_read_map+0x1f/0xc0 [xfs]
> [   92.871494]  xfs_trans_read_buf_map+0xf5/0x2d0 [xfs]
> [   92.873589]  xfs_btree_read_buf_block.constprop.36+0x69/0xc0 [xfs]
> [   92.876322]  ? kmem_zone_alloc+0x7e/0x100 [xfs]
> [   92.878320]  xfs_btree_lookup_get_block+0x82/0x180 [xfs]
> [   92.880527]  xfs_btree_lookup+0x118/0x450 [xfs]
> [   92.882528]  ? kmem_zone_alloc+0x7e/0x100 [xfs]
> [   92.884511]  xfs_alloc_ag_vextent_near+0xb2/0xb80 [xfs]
> [   92.886974]  xfs_alloc_ag_vextent+0x1cc/0x320 [xfs]
> [   92.889088]  xfs_alloc_vextent+0x416/0x480 [xfs]
> [   92.891098]  xfs_bmap_btalloc+0x340/0x8b0 [xfs]
> [   92.893087]  xfs_bmapi_write+0x6c1/0x1270 [xfs]
> [   92.895085]  xfs_iomap_write_allocate+0x16c/0x360 [xfs]
> [   92.897277]  xfs_map_blocks+0x175/0x230 [xfs]
> [   92.899228]  xfs_do_writepage+0x232/0x6e0 [xfs]
> [   92.901218]  write_cache_pages+0x1d1/0x3b0
> [   92.903102]  ? xfs_add_to_ioend+0x290/0x290 [xfs]
> [   92.905170]  ? xfs_vm_writepages+0x4b/0xa0 [xfs]
> [   92.907182]  xfs_vm_writepages+0x60/0xa0 [xfs]
> [   92.909114]  do_writepages+0x12/0x60
> [   92.910778]  __writeback_single_inode+0x2c/0x170
> [   92.912735]  writeback_sb_inodes+0x267/0x460
> [   92.914561]  __writeback_inodes_wb+0x82/0xb0
> [   92.916413]  wb_writeback+0x203/0x210
> [   92.918050]  ? cpumask_next+0x20/0x30
> [   92.919790]  ? wb_workfn+0x266/0x2e0
> [   92.921384]  wb_workfn+0x266/0x2e0
> [   92.922908]  process_one_work+0x253/0x460
> [   92.924687]  ? process_one_work+0x1f0/0x460
> [   92.926518]  worker_thread+0x42/0x3e0
> [   92.928077]  kthread+0xeb/0x120
> [   92.929512]  ? process_one_work+0x460/0x460
> [   92.931330]  ? kthread_create_worker_on_cpu+0x70/0x70
> [   92.933313]  ret_from_fork+0x3a/0x50
> ----------------------------------------
> 
> Normally shrinker_rwsem acts like a shared lock. But when
> register_shrinker()/unregister_shrinker() called down_write(),
> shrinker_rwsem suddenly starts acting like an exclusive lock.
> 
> What is unfortunate is that down_write() is called independent of
> memory allocation requests. That is, shrinker_rwsem is essentially
> a mutex (and hence the debug patch shown above).
> 
> ----------------------------------------
> [<ffffffffac7538d3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffffac1cb985>] register_shrinker+0x45/0xa0
> [<ffffffffac250f68>] sget_userns+0x468/0x4a0
> [<ffffffffac25106a>] mount_nodev+0x2a/0xa0
> [<ffffffffac251be4>] mount_fs+0x34/0x150
> [<ffffffffac2701f2>] vfs_kern_mount+0x62/0x120
> [<ffffffffac272a0e>] do_mount+0x1ee/0xc50
> [<ffffffffac27377e>] SyS_mount+0x7e/0xd0
> [<ffffffffac003831>] do_syscall_64+0x61/0x1a0
> [<ffffffffac80012c>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> ----------------------------------------
> 
> Therefore, I think that when do_shrink_slab() for GFP_KERNEL is in progress
> and down_read_trylock() starts failing because somebody else started waiting at
> down_write(), do_shrink_slab() for GFP_NOFS or GFP_NOIO cannot be called.
> Doesn't such race cause unexpected results?
> 
> Michal Hocko wrote:
> > I would rather understand the problem than speculate here. I strongly
> > suspect somebody simply didn't unlock the page.
> 
> Then, can we please please have a mechanism which tells whether somebody
> else was stuck doing memory allocation requests? It is basically
> https://lkml.kernel.org/r/1510833448-19918-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2018-01-25 22:19               ` Eric Wheeler
  0 siblings, 0 replies; 74+ messages in thread
From: Eric Wheeler @ 2018-01-25 22:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, hannes, minchan, ying.huang, mgorman, vdavydov.dev, akpm,
	shakeelb, gthelen, linux-mm, linux-kernel

On Thu, 25 Jan 2018, Tetsuo Handa wrote:

> Using a debug patch and a reproducer shown below, we can trivially form
> a circular locking dependency shown below.
> 
> ----------------------------------------
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8219001..240efb1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -950,7 +950,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	}
>  	task_unlock(p);
>  
> -	if (__ratelimit(&oom_rs))
> +	if (0 && __ratelimit(&oom_rs))
>  		dump_header(oc, p);
>  
>  	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1afb2af..9858449 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -410,6 +410,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	return freed;
>  }
>  
> +struct lockdep_map __shrink_slab_map =
> +	STATIC_LOCKDEP_MAP_INIT("shrink_slab", &__shrink_slab_map);
> +
>  /**
>   * shrink_slab - shrink slab caches
>   * @gfp_mask: allocation context
> @@ -453,6 +456,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		goto out;
>  	}
>  
> +	lock_map_acquire(&__shrink_slab_map);
> +
>  	list_for_each_entry(shrinker, &shrinker_list, list) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
> @@ -491,6 +496,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		}
>  	}
>  
> +	lock_map_release(&__shrink_slab_map);
> +
>  	up_read(&shrinker_rwsem);
>  out:
>  	cond_resched();
> ----------------------------------------
> 
> ----------------------------------------
> #include <stdlib.h>
> 
> int main(int argc, char *argv[])
> {
> 	unsigned long long size;
> 	char *buf = NULL;
> 	unsigned long long i;
> 	for (size = 1048576; size < 512ULL * (1 << 30); size *= 2) {
> 		char *cp = realloc(buf, size);
> 		if (!cp) {
> 			size /= 2;
> 			break;
> 		}
> 		buf = cp;
> 	}
> 	for (i = 0; i < size; i += 4096)
> 		buf[i] = 0;
> 	return 0;
> }

Hi Tetsuo,

Thank you for looking into this!

I tried running this C program in 4.14.15 but did not get a deadlock, just 
OOM kills. Is the patch required to induce the deadlock?

Also, what are you doing to XFS to make it trigger?

--
Eric Wheeler

> ----------------------------------------
> 
> ----------------------------------------
> CentOS Linux 7 (Core)
> Kernel 4.15.0-rc8-next-20180119+ on an x86_64
> 
> localhost login: [   36.954893] cp (2850) used greatest stack depth: 10816 bytes left
> [   89.216085] Out of memory: Kill process 6981 (a.out) score 876 or sacrifice child
> [   89.225853] Killed process 6981 (a.out) total-vm:4264020kB, anon-rss:3346832kB, file-rss:8kB, shmem-rss:0kB
> [   89.313597] oom_reaper: reaped process 6981 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [   92.640566] Out of memory: Kill process 6983 (a.out) score 876 or sacrifice child
> [   92.642153] 
> [   92.643464] Killed process 6983 (a.out) total-vm:4264020kB, anon-rss:3348624kB, file-rss:4kB, shmem-rss:0kB
> [   92.644416] ======================================================
> [   92.644417] WARNING: possible circular locking dependency detected
> [   92.644418] 4.15.0-rc8-next-20180119+ #222 Not tainted
> [   92.644419] ------------------------------------------------------
> [   92.644419] kworker/u256:29/401 is trying to acquire lock:
> [   92.644420]  (shrink_slab){+.+.}, at: [<0000000040040aca>] shrink_slab.part.42+0x73/0x350
> [   92.644428] 
> [   92.644428] but task is already holding lock:
> [   92.665257]  (&xfs_nondir_ilock_class){++++}, at: [<00000000ae515ec8>] xfs_ilock+0xa3/0x180 [xfs]
> [   92.668490] 
> [   92.668490] which lock already depends on the new lock.
> [   92.668490] 
> [   92.672781] 
> [   92.672781] the existing dependency chain (in reverse order) is:
> [   92.676310] 
> [   92.676310] -> #1 (&xfs_nondir_ilock_class){++++}:
> [   92.679519]        xfs_free_eofblocks+0x9d/0x210 [xfs]
> [   92.681716]        xfs_fs_destroy_inode+0x9e/0x220 [xfs]
> [   92.683962]        dispose_list+0x30/0x40
> [   92.685822]        prune_icache_sb+0x4d/0x70
> [   92.687961]        super_cache_scan+0x136/0x180
> [   92.690017]        shrink_slab.part.42+0x205/0x350
> [   92.692109]        shrink_node+0x313/0x320
> [   92.694177]        kswapd+0x386/0x6d0
> [   92.695951]        kthread+0xeb/0x120
> [   92.697889]        ret_from_fork+0x3a/0x50
> [   92.699800] 
> [   92.699800] -> #0 (shrink_slab){+.+.}:
> [   92.702676]        shrink_slab.part.42+0x93/0x350
> [   92.704756]        shrink_node+0x313/0x320
> [   92.706660]        do_try_to_free_pages+0xde/0x350
> [   92.708737]        try_to_free_pages+0xc5/0x100
> [   92.710734]        __alloc_pages_slowpath+0x41c/0xd60
> [   92.712470] oom_reaper: reaped process 6983 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [   92.712978]        __alloc_pages_nodemask+0x22a/0x270
> [   92.713013]        xfs_buf_allocate_memory+0x16b/0x2d0 [xfs]
> [   92.721378]        xfs_buf_get_map+0xaf/0x140 [xfs]
> [   92.723562]        xfs_buf_read_map+0x1f/0xc0 [xfs]
> [   92.726105]        xfs_trans_read_buf_map+0xf5/0x2d0 [xfs]
> [   92.728461]        xfs_btree_read_buf_block.constprop.36+0x69/0xc0 [xfs]
> [   92.731321]        xfs_btree_lookup_get_block+0x82/0x180 [xfs]
> [   92.733739]        xfs_btree_lookup+0x118/0x450 [xfs]
> [   92.735982]        xfs_alloc_ag_vextent_near+0xb2/0xb80 [xfs]
> [   92.738380]        xfs_alloc_ag_vextent+0x1cc/0x320 [xfs]
> [   92.740646]        xfs_alloc_vextent+0x416/0x480 [xfs]
> [   92.743023]        xfs_bmap_btalloc+0x340/0x8b0 [xfs]
> [   92.745597]        xfs_bmapi_write+0x6c1/0x1270 [xfs]
> [   92.747749]        xfs_iomap_write_allocate+0x16c/0x360 [xfs]
> [   92.750317]        xfs_map_blocks+0x175/0x230 [xfs]
> [   92.752745]        xfs_do_writepage+0x232/0x6e0 [xfs]
> [   92.754843]        write_cache_pages+0x1d1/0x3b0
> [   92.756801]        xfs_vm_writepages+0x60/0xa0 [xfs]
> [   92.758838]        do_writepages+0x12/0x60
> [   92.760822]        __writeback_single_inode+0x2c/0x170
> [   92.762895]        writeback_sb_inodes+0x267/0x460
> [   92.764851]        __writeback_inodes_wb+0x82/0xb0
> [   92.766821]        wb_writeback+0x203/0x210
> [   92.768676]        wb_workfn+0x266/0x2e0
> [   92.770494]        process_one_work+0x253/0x460
> [   92.772378]        worker_thread+0x42/0x3e0
> [   92.774153]        kthread+0xeb/0x120
> [   92.775775]        ret_from_fork+0x3a/0x50
> [   92.777513] 
> [   92.777513] other info that might help us debug this:
> [   92.777513] 
> [   92.781361]  Possible unsafe locking scenario:
> [   92.781361] 
> [   92.784382]        CPU0                    CPU1
> [   92.786276]        ----                    ----
> [   92.788130]   lock(&xfs_nondir_ilock_class);
> [   92.790048]                                lock(shrink_slab);
> [   92.792256]                                lock(&xfs_nondir_ilock_class);
> [   92.794756]   lock(shrink_slab);
> [   92.796251] 
> [   92.796251]  *** DEADLOCK ***
> [   92.796251] 
> [   92.799521] 6 locks held by kworker/u256:29/401:
> [   92.801573]  #0:  ((wq_completion)"writeback"){+.+.}, at: [<0000000087382bbf>] process_one_work+0x1f0/0x460
> [   92.804947]  #1:  ((work_completion)(&(&wb->dwork)->work)){+.+.}, at: [<0000000087382bbf>] process_one_work+0x1f0/0x460
> [   92.808596]  #2:  (&type->s_umount_key#31){++++}, at: [<0000000048ea98d7>] trylock_super+0x11/0x50
> [   92.811957]  #3:  (sb_internal){.+.+}, at: [<0000000058532c48>] xfs_trans_alloc+0xe4/0x120 [xfs]
> [   92.815280]  #4:  (&xfs_nondir_ilock_class){++++}, at: [<00000000ae515ec8>] xfs_ilock+0xa3/0x180 [xfs]
> [   92.819075]  #5:  (shrinker_rwsem){++++}, at: [<0000000039dd500e>] shrink_slab.part.42+0x3c/0x350
> [   92.822354] 
> [   92.822354] stack backtrace:
> [   92.824820] CPU: 1 PID: 401 Comm: kworker/u256:29 Not tainted 4.15.0-rc8-next-20180119+ #222
> [   92.827894] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
> [   92.831609] Workqueue: writeback wb_workfn (flush-8:0)
> [   92.833759] Call Trace:
> [   92.835144]  dump_stack+0x7d/0xb6
> [   92.836761]  print_circular_bug.isra.37+0x1d7/0x1e4
> [   92.838908]  __lock_acquire+0x10da/0x15b0
> [   92.840744]  ? __lock_acquire+0x390/0x15b0
> [   92.842603]  ? lock_acquire+0x51/0x70
> [   92.844308]  lock_acquire+0x51/0x70
> [   92.845980]  ? shrink_slab.part.42+0x73/0x350
> [   92.847896]  shrink_slab.part.42+0x93/0x350
> [   92.849799]  ? shrink_slab.part.42+0x73/0x350
> [   92.851710]  ? mem_cgroup_iter+0x140/0x530
> [   92.853890]  ? mem_cgroup_iter+0x158/0x530
> [   92.855897]  shrink_node+0x313/0x320
> [   92.857609]  do_try_to_free_pages+0xde/0x350
> [   92.859502]  try_to_free_pages+0xc5/0x100
> [   92.861328]  __alloc_pages_slowpath+0x41c/0xd60
> [   92.863298]  __alloc_pages_nodemask+0x22a/0x270
> [   92.865285]  xfs_buf_allocate_memory+0x16b/0x2d0 [xfs]
> [   92.867621]  xfs_buf_get_map+0xaf/0x140 [xfs]
> [   92.869562]  xfs_buf_read_map+0x1f/0xc0 [xfs]
> [   92.871494]  xfs_trans_read_buf_map+0xf5/0x2d0 [xfs]
> [   92.873589]  xfs_btree_read_buf_block.constprop.36+0x69/0xc0 [xfs]
> [   92.876322]  ? kmem_zone_alloc+0x7e/0x100 [xfs]
> [   92.878320]  xfs_btree_lookup_get_block+0x82/0x180 [xfs]
> [   92.880527]  xfs_btree_lookup+0x118/0x450 [xfs]
> [   92.882528]  ? kmem_zone_alloc+0x7e/0x100 [xfs]
> [   92.884511]  xfs_alloc_ag_vextent_near+0xb2/0xb80 [xfs]
> [   92.886974]  xfs_alloc_ag_vextent+0x1cc/0x320 [xfs]
> [   92.889088]  xfs_alloc_vextent+0x416/0x480 [xfs]
> [   92.891098]  xfs_bmap_btalloc+0x340/0x8b0 [xfs]
> [   92.893087]  xfs_bmapi_write+0x6c1/0x1270 [xfs]
> [   92.895085]  xfs_iomap_write_allocate+0x16c/0x360 [xfs]
> [   92.897277]  xfs_map_blocks+0x175/0x230 [xfs]
> [   92.899228]  xfs_do_writepage+0x232/0x6e0 [xfs]
> [   92.901218]  write_cache_pages+0x1d1/0x3b0
> [   92.903102]  ? xfs_add_to_ioend+0x290/0x290 [xfs]
> [   92.905170]  ? xfs_vm_writepages+0x4b/0xa0 [xfs]
> [   92.907182]  xfs_vm_writepages+0x60/0xa0 [xfs]
> [   92.909114]  do_writepages+0x12/0x60
> [   92.910778]  __writeback_single_inode+0x2c/0x170
> [   92.912735]  writeback_sb_inodes+0x267/0x460
> [   92.914561]  __writeback_inodes_wb+0x82/0xb0
> [   92.916413]  wb_writeback+0x203/0x210
> [   92.918050]  ? cpumask_next+0x20/0x30
> [   92.919790]  ? wb_workfn+0x266/0x2e0
> [   92.921384]  wb_workfn+0x266/0x2e0
> [   92.922908]  process_one_work+0x253/0x460
> [   92.924687]  ? process_one_work+0x1f0/0x460
> [   92.926518]  worker_thread+0x42/0x3e0
> [   92.928077]  kthread+0xeb/0x120
> [   92.929512]  ? process_one_work+0x460/0x460
> [   92.931330]  ? kthread_create_worker_on_cpu+0x70/0x70
> [   92.933313]  ret_from_fork+0x3a/0x50
> ----------------------------------------
> 
> Normally shrinker_rwsem acts like a shared lock. But when
> register_shrinker()/unregister_shrinker() called down_write(),
> shrinker_rwsem suddenly starts acting like an exclusive lock.
> 
> What is unfortunate is that down_write() is called independent of
> memory allocation requests. That is, shrinker_rwsem is essentially
> a mutex (and hence the debug patch shown above).
> 
> ----------------------------------------
> [<ffffffffac7538d3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffffac1cb985>] register_shrinker+0x45/0xa0
> [<ffffffffac250f68>] sget_userns+0x468/0x4a0
> [<ffffffffac25106a>] mount_nodev+0x2a/0xa0
> [<ffffffffac251be4>] mount_fs+0x34/0x150
> [<ffffffffac2701f2>] vfs_kern_mount+0x62/0x120
> [<ffffffffac272a0e>] do_mount+0x1ee/0xc50
> [<ffffffffac27377e>] SyS_mount+0x7e/0xd0
> [<ffffffffac003831>] do_syscall_64+0x61/0x1a0
> [<ffffffffac80012c>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> ----------------------------------------
> 
> Therefore, I think that when do_shrink_slab() for GFP_KERNEL is in progress
> and down_read_trylock() starts failing because somebody else started waiting at
> down_write(), do_shrink_slab() for GFP_NOFS or GFP_NOIO cannot be called.
> Doesn't such race cause unexpected results?
> 
> Michal Hocko wrote:
> > I would rather understand the problem than speculate here. I strongly
> > suspect somebody simply didn't unlock the page.
> 
> Then, can we please please have a mechanism which tells whether somebody
> else was stuck doing memory allocation requests? It is basically
> https://lkml.kernel.org/r/1510833448-19918-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2018-01-25 22:19               ` Eric Wheeler
@ 2018-01-26  3:12                 ` Tetsuo Handa
  -1 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2018-01-26  3:12 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Tetsuo Handa, mhocko, hannes, minchan, ying.huang, mgorman,
	vdavydov.dev, akpm, shakeelb, gthelen, linux-mm, linux-kernel

Eric Wheeler wrote:
> Hi Tetsuo,
> 
> Thank you for looking into this!
> 
> I tried running this C program in 4.14.15 but did not get a deadlock, just 
> OOM kills. Is the patch required to induce the deadlock?

This reproducer must not trigger actual deadlock. Running this reproducer
with this patch applied causes lockdep warning. I just tried to suggest
possibility that making shrink_slab() suddenly no-op might cause unexpected
results. We still don't know what is happening in your case.

> 
> Also, what are you doing to XFS to make it trigger?

Nothing.



Would you answer to Michal's questions

  Is this a permanent state or does the holder eventually releases the lock?

  Do you remember the last good kernel?

and my guess

  Since commit 0bcac06f27d75285 was not backported to 4.14-stable kernel,
  this is unlikely the bug introduced by 0bcac06f27d75285 unless Eric
  explicitly backported 0bcac06f27d75285.

?

Can you take SysRq-t (e.g. "echo t > /proc/sysrq-trigger") when processes
got stuck? I think that we need to know what other threads are doing when
__lock_page() is waiting in order to distinguish "somebody forgot to unlock
the page" and "somebody is still doing something (e.g. waiting for memory
allocation) in order to unlock the page".

If you can take SysRq-t, taking SysRq-t with
http://lkml.kernel.org/r/1510833448-19918-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
applied and built with CONFIG_DEBUG_SHOW_MEMALLOC_LINE=y should give us
more clues (e.g. how long threads are waiting for memory allocation).

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2018-01-26  3:12                 ` Tetsuo Handa
  0 siblings, 0 replies; 74+ messages in thread
From: Tetsuo Handa @ 2018-01-26  3:12 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Tetsuo Handa, mhocko, hannes, minchan, ying.huang, mgorman,
	vdavydov.dev, akpm, shakeelb, gthelen, linux-mm, linux-kernel

Eric Wheeler wrote:
> Hi Tetsuo,
> 
> Thank you for looking into this!
> 
> I tried running this C program in 4.14.15 but did not get a deadlock, just 
> OOM kills. Is the patch required to induce the deadlock?

This reproducer must not trigger actual deadlock. Running this reproducer
with this patch applied causes lockdep warning. I just tried to suggest
possibility that making shrink_slab() suddenly no-op might cause unexpected
results. We still don't know what is happening in your case.

> 
> Also, what are you doing to XFS to make it trigger?

Nothing.



Would you answer to Michal's questions

  Is this a permanent state or does the holder eventually releases the lock?

  Do you remember the last good kernel?

and my guess

  Since commit 0bcac06f27d75285 was not backported to 4.14-stable kernel,
  this is unlikely the bug introduced by 0bcac06f27d75285 unless Eric
  explicitly backported 0bcac06f27d75285.

?

Can you take SysRq-t (e.g. "echo t > /proc/sysrq-trigger") when processes
got stuck? I think that we need to know what other threads are doing when
__lock_page() is waiting in order to distinguish "somebody forgot to unlock
the page" and "somebody is still doing something (e.g. waiting for memory
allocation) in order to unlock the page".

If you can take SysRq-t, taking SysRq-t with
http://lkml.kernel.org/r/1510833448-19918-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
applied and built with CONFIG_DEBUG_SHOW_MEMALLOC_LINE=y should give us
more clues (e.g. how long threads are waiting for memory allocation).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
  2018-01-26  3:12                 ` Tetsuo Handa
@ 2018-01-26 10:08                   ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2018-01-26 10:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Eric Wheeler, hannes, minchan, ying.huang, mgorman, vdavydov.dev,
	akpm, shakeelb, gthelen, linux-mm, linux-kernel

On Fri 26-01-18 12:12:00, Tetsuo Handa wrote:
> Would you answer to Michal's questions
> 
>   Is this a permanent state or does the holder eventually releases the lock?
> 
>   Do you remember the last good kernel?
> 
> and my guess
> 
>   Since commit 0bcac06f27d75285 was not backported to 4.14-stable kernel,
>   this is unlikely the bug introduced by 0bcac06f27d75285 unless Eric
>   explicitly backported 0bcac06f27d75285.

Can we do that in the original email thread please. Conflating these two
things while we have no idea about the culprit is just mess.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
@ 2018-01-26 10:08                   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2018-01-26 10:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Eric Wheeler, hannes, minchan, ying.huang, mgorman, vdavydov.dev,
	akpm, shakeelb, gthelen, linux-mm, linux-kernel

On Fri 26-01-18 12:12:00, Tetsuo Handa wrote:
> Would you answer to Michal's questions
> 
>   Is this a permanent state or does the holder eventually releases the lock?
> 
>   Do you remember the last good kernel?
> 
> and my guess
> 
>   Since commit 0bcac06f27d75285 was not backported to 4.14-stable kernel,
>   this is unlikely the bug introduced by 0bcac06f27d75285 unless Eric
>   explicitly backported 0bcac06f27d75285.

Can we do that in the original email thread please. Conflating these two
things while we have no idea about the culprit is just mess.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2018-01-26 10:08 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-13 21:37 [PATCH 1/2] mm,vmscan: Kill global shrinker lock Tetsuo Handa
2017-11-13 21:37 ` Tetsuo Handa
2017-11-13 21:37 ` [PATCH 2/2] mm,vmscan: Allow parallel registration/unregistration of shrinkers Tetsuo Handa
2017-11-13 21:37   ` Tetsuo Handa
2017-11-13 22:05 ` [PATCH 1/2] mm,vmscan: Kill global shrinker lock Shakeel Butt
2017-11-13 22:05   ` Shakeel Butt
2017-11-15  0:56 ` Minchan Kim
2017-11-15  0:56   ` Minchan Kim
2017-11-15  6:28   ` Shakeel Butt
2017-11-15  6:28     ` Shakeel Butt
2017-11-16  0:46     ` Minchan Kim
2017-11-16  0:46       ` Minchan Kim
2017-11-16  1:41       ` Shakeel Butt
2017-11-16  1:41         ` Shakeel Butt
2017-11-16  4:50         ` Minchan Kim
2017-11-16  4:50           ` Minchan Kim
2017-11-15  8:56   ` Michal Hocko
2017-11-15  8:56     ` Michal Hocko
2017-11-15  9:18     ` Michal Hocko
2017-11-15  9:18       ` Michal Hocko
2017-11-16 17:44   ` Johannes Weiner
2017-11-16 17:44     ` Johannes Weiner
2017-11-23 23:46     ` Minchan Kim
2017-11-23 23:46       ` Minchan Kim
2017-11-15  9:02 ` Michal Hocko
2017-11-15  9:02   ` Michal Hocko
2017-11-15 10:58   ` Tetsuo Handa
2017-11-15 10:58     ` Tetsuo Handa
2017-11-15 11:51     ` Michal Hocko
2017-11-15 11:51       ` Michal Hocko
2017-11-16  0:56       ` Minchan Kim
2017-11-16  0:56         ` Minchan Kim
2017-11-15 13:28     ` Johannes Weiner
2017-11-15 13:28       ` Johannes Weiner
2017-11-16 10:56       ` Tetsuo Handa
2017-11-16 10:56         ` Tetsuo Handa
2017-11-15 14:00   ` Johannes Weiner
2017-11-15 14:00     ` Johannes Weiner
2017-11-15 14:11     ` Michal Hocko
2017-11-15 14:11       ` Michal Hocko
2018-01-25  2:04       ` Tetsuo Handa
2018-01-25  2:04         ` Tetsuo Handa
2018-01-25  8:36         ` Michal Hocko
2018-01-25  8:36           ` Michal Hocko
2018-01-25 10:56           ` Tetsuo Handa
2018-01-25 10:56             ` Tetsuo Handa
2018-01-25 11:41             ` Michal Hocko
2018-01-25 11:41               ` Michal Hocko
2018-01-25 22:19             ` Eric Wheeler
2018-01-25 22:19               ` Eric Wheeler
2018-01-26  3:12               ` Tetsuo Handa
2018-01-26  3:12                 ` Tetsuo Handa
2018-01-26 10:08                 ` Michal Hocko
2018-01-26 10:08                   ` Michal Hocko
2017-11-17 17:35 ` Christoph Hellwig
2017-11-17 17:35   ` Christoph Hellwig
2017-11-17 17:41   ` Shakeel Butt
2017-11-17 17:41     ` Shakeel Butt
2017-11-17 17:53     ` Shakeel Butt
2017-11-17 17:53       ` Shakeel Butt
2017-11-17 18:36     ` Christoph Hellwig
2017-11-17 18:36       ` Christoph Hellwig
2017-11-20  9:25   ` Michal Hocko
2017-11-20  9:25     ` Michal Hocko
2017-11-20  9:33     ` Christoph Hellwig
2017-11-20  9:33       ` Christoph Hellwig
2017-11-20  9:42       ` Michal Hocko
2017-11-20  9:42         ` Michal Hocko
2017-11-20 10:41         ` Christoph Hellwig
2017-11-20 10:41           ` Christoph Hellwig
2017-11-20 10:56           ` Tetsuo Handa
2017-11-20 10:56             ` Tetsuo Handa
2017-11-20 18:28             ` Paul E. McKenney
2017-11-20 18:28               ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.