linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
@ 2021-03-11  2:13 Mike Kravetz
  2021-03-11  5:43 ` Andrew Morton
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Mike Kravetz @ 2021-03-11  2:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Paul E . McKenney, Shakeel Butt, tglx, john.ogness,
	urezki, ast, Eric Dumazet, Mina Almasry, peterz, Andrew Morton,
	Mike Kravetz

put_page does not correctly handle all calling contexts for hugetlb
pages.  This was recently discussed in the threads [1] and [2].

free_huge_page is the routine called for the final put_page of huegtlb
pages.  Since at least the beginning of git history, free_huge_page has
acquired the hugetlb_lock to move the page to a free list and possibly
perform other processing. When this code was originally written, the
hugetlb_lock should have been made irq safe.

For many years, nobody noticed this situation until lockdep code caught
free_huge_page being called from irq context.  By this time, another
lock (hugetlb subpool) was also taken in the free_huge_page path.  In
addition, hugetlb cgroup code had been added which could hold
hugetlb_lock for a considerable period of time.  Because of this, commit
c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
context") was added to address the issue of free_huge_page being called
from irq context.  That commit hands off free_huge_page processing to a
workqueue if !in_task.

The !in_task check handles the case of being called from irq context.
However, it does not take into account the case when called with irqs
disabled as in [1].

To complicate matters, functionality has been added to hugetlb
such that free_huge_page may block/sleep in certain situations.  The
hugetlb_lock is of course dropped before potentially blocking.

One way to handle all calling contexts is to have free_huge_page always
send pages to the workqueue for processing.  This idea was briefly
discussed here [3], but has some undesirable side effects.

Ideally, the hugetlb_lock should have been irq safe from the beginning
and any code added to the free_huge_page path should have taken this
into account.  However, this has not happened.  The code today does have
the ability to hand off requests to a workqueue.  It does this for calls
from irq context.  Changing the check in the code from !in_task to
in_atomic would handle the situations when called with irqs disabled.
However, it does not not handle the case when called with a spinlock
held.  This is needed because the code could block/sleep.

Select PREEMPT_COUNT if HUGETLB_PAGE is enabled so that in_atomic can be
used to detect all atomic contexts where sleeping is not possible.

[1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
[2] https://lore.kernel.org/linux-mm/YEjji9oAwHuZaZEt@dhcp22.suse.cz/
[3] https://lore.kernel.org/linux-mm/YDzaAWK41K4gD35V@dhcp22.suse.cz/

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/Kconfig   |  1 +
 mm/hugetlb.c | 10 +++++-----
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 462253ae483a..403d7a7a619a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -235,6 +235,7 @@ config HUGETLBFS
 
 config HUGETLB_PAGE
 	def_bool HUGETLBFS
+	select PREEMPT_COUNT
 
 config MEMFD_CREATE
 	def_bool TMPFS || HUGETLBFS
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 33b0d8778551..5407e77ca803 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1437,9 +1437,9 @@ static void __free_huge_page(struct page *page)
 }
 
 /*
- * As free_huge_page() can be called from a non-task context, we have
- * to defer the actual freeing in a workqueue to prevent potential
- * hugetlb_lock deadlock.
+ * If free_huge_page() is called from an atomic context, we have to defer
+ * the actual freeing in a workqueue.  This is to prevent possible sleeping
+ * while in atomic and potential hugetlb_lock deadlock.
  *
  * free_hpage_workfn() locklessly retrieves the linked list of pages to
  * be freed and frees them one-by-one. As the page->mapping pointer is
@@ -1467,9 +1467,9 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
 void free_huge_page(struct page *page)
 {
 	/*
-	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
+	 * Defer freeing if in atomic context and sleeping is not allowed
 	 */
-	if (!in_task()) {
+	if (in_atomic()) {
 		/*
 		 * Only call schedule_work() if hpage_freelist is previously
 		 * empty. Otherwise, schedule_work() had been called but the
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11  2:13 [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use Mike Kravetz
@ 2021-03-11  5:43 ` Andrew Morton
  2021-03-11  8:20   ` Michal Hocko
  2021-03-11  8:26 ` Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2021-03-11  5:43 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, peterz

On Wed, 10 Mar 2021 18:13:21 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> put_page does not correctly handle all calling contexts for hugetlb
> pages.  This was recently discussed in the threads [1] and [2].
> 
> free_huge_page is the routine called for the final put_page of huegtlb
> pages.  Since at least the beginning of git history, free_huge_page has
> acquired the hugetlb_lock to move the page to a free list and possibly
> perform other processing. When this code was originally written, the
> hugetlb_lock should have been made irq safe.
> 
> For many years, nobody noticed this situation until lockdep code caught
> free_huge_page being called from irq context.  By this time, another
> lock (hugetlb subpool) was also taken in the free_huge_page path.  In
> addition, hugetlb cgroup code had been added which could hold
> hugetlb_lock for a considerable period of time.  Because of this, commit
> c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
> context") was added to address the issue of free_huge_page being called
> from irq context.  That commit hands off free_huge_page processing to a
> workqueue if !in_task.
> 
> The !in_task check handles the case of being called from irq context.
> However, it does not take into account the case when called with irqs
> disabled as in [1].
> 
> To complicate matters, functionality has been added to hugetlb
> such that free_huge_page may block/sleep in certain situations.  The
> hugetlb_lock is of course dropped before potentially blocking.
> 
> One way to handle all calling contexts is to have free_huge_page always
> send pages to the workqueue for processing.  This idea was briefly
> discussed here [3], but has some undesirable side effects.
> 
> Ideally, the hugetlb_lock should have been irq safe from the beginning
> and any code added to the free_huge_page path should have taken this
> into account.  However, this has not happened.  The code today does have
> the ability to hand off requests to a workqueue.  It does this for calls
> from irq context.  Changing the check in the code from !in_task to
> in_atomic would handle the situations when called with irqs disabled.
> However, it does not not handle the case when called with a spinlock
> held.  This is needed because the code could block/sleep.
> 
> Select PREEMPT_COUNT if HUGETLB_PAGE is enabled so that in_atomic can be
> used to detect all atomic contexts where sleeping is not possible.
> 
> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
> [2] https://lore.kernel.org/linux-mm/YEjji9oAwHuZaZEt@dhcp22.suse.cz/
> [3] https://lore.kernel.org/linux-mm/YDzaAWK41K4gD35V@dhcp22.suse.cz/
> 
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -235,6 +235,7 @@ config HUGETLBFS
>  
>  config HUGETLB_PAGE
>  	def_bool HUGETLBFS
> +	select PREEMPT_COUNT
>  

Well this is unfortunate.  hugetlb is forcing PREEMPT_COUNT because we
screwed things up.

Did we consider changing the networking code to call a new
free_huge_tlb_from_irq()?  So the callee doesn't need to guess.

Or something else?

Is anyone looking onto fixing this for real?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11  5:43 ` Andrew Morton
@ 2021-03-11  8:20   ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2021-03-11  8:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, linux-mm, linux-kernel, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, peterz

On Wed 10-03-21 21:43:16, Andrew Morton wrote:
> On Wed, 10 Mar 2021 18:13:21 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> > put_page does not correctly handle all calling contexts for hugetlb
> > pages.  This was recently discussed in the threads [1] and [2].
> > 
> > free_huge_page is the routine called for the final put_page of huegtlb
> > pages.  Since at least the beginning of git history, free_huge_page has
> > acquired the hugetlb_lock to move the page to a free list and possibly
> > perform other processing. When this code was originally written, the
> > hugetlb_lock should have been made irq safe.
> > 
> > For many years, nobody noticed this situation until lockdep code caught
> > free_huge_page being called from irq context.  By this time, another
> > lock (hugetlb subpool) was also taken in the free_huge_page path.  In
> > addition, hugetlb cgroup code had been added which could hold
> > hugetlb_lock for a considerable period of time.  Because of this, commit
> > c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
> > context") was added to address the issue of free_huge_page being called
> > from irq context.  That commit hands off free_huge_page processing to a
> > workqueue if !in_task.
> > 
> > The !in_task check handles the case of being called from irq context.
> > However, it does not take into account the case when called with irqs
> > disabled as in [1].
> > 
> > To complicate matters, functionality has been added to hugetlb
> > such that free_huge_page may block/sleep in certain situations.  The
> > hugetlb_lock is of course dropped before potentially blocking.
> > 
> > One way to handle all calling contexts is to have free_huge_page always
> > send pages to the workqueue for processing.  This idea was briefly
> > discussed here [3], but has some undesirable side effects.
> > 
> > Ideally, the hugetlb_lock should have been irq safe from the beginning
> > and any code added to the free_huge_page path should have taken this
> > into account.  However, this has not happened.  The code today does have
> > the ability to hand off requests to a workqueue.  It does this for calls
> > from irq context.  Changing the check in the code from !in_task to
> > in_atomic would handle the situations when called with irqs disabled.
> > However, it does not not handle the case when called with a spinlock
> > held.  This is needed because the code could block/sleep.
> > 
> > Select PREEMPT_COUNT if HUGETLB_PAGE is enabled so that in_atomic can be
> > used to detect all atomic contexts where sleeping is not possible.
> > 
> > [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
> > [2] https://lore.kernel.org/linux-mm/YEjji9oAwHuZaZEt@dhcp22.suse.cz/
> > [3] https://lore.kernel.org/linux-mm/YDzaAWK41K4gD35V@dhcp22.suse.cz/
> > 
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -235,6 +235,7 @@ config HUGETLBFS
> >  
> >  config HUGETLB_PAGE
> >  	def_bool HUGETLBFS
> > +	select PREEMPT_COUNT
> >  
> 
> Well this is unfortunate.  hugetlb is forcing PREEMPT_COUNT because we
> screwed things up.

Yes this is far from ideal but we have tried to explore other ways all
looking much more complex. [1] shows that this is a problem already and
needs a reasonable fix to be backported for older kernels.

> Did we consider changing the networking code to call a new
> free_huge_tlb_from_irq()?  So the callee doesn't need to guess.

I do not think we want to pollute networking or any other code that
simply wants to put_page with a hugetlb specific knowledge.

> Or something else?
> 
> Is anyone looking onto fixing this for real?

Mike said he would be looking into making hugetlb_lock irq safe but
there is a non trivial way there and this would be not a great candidate
for backporting.

Btw. RCU already wants to have a reliable in_atomic as well and that
effectivelly means enabling PREEMPT_COUNT for everybody. The overhead of
per-cpu preempt counter should pretty much invisible AFAIK.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11  2:13 [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use Mike Kravetz
  2021-03-11  5:43 ` Andrew Morton
@ 2021-03-11  8:26 ` Michal Hocko
  2021-03-11  8:27   ` Michal Hocko
  2021-03-11  8:46 ` Peter Zijlstra
  2021-03-11  9:49 ` Peter Zijlstra
  3 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2021-03-11  8:26 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Paul E . McKenney, Shakeel Butt, tglx,
	john.ogness, urezki, ast, Eric Dumazet, Mina Almasry, peterz,
	Andrew Morton

On Wed 10-03-21 18:13:21, Mike Kravetz wrote:
> put_page does not correctly handle all calling contexts for hugetlb
> pages.  This was recently discussed in the threads [1] and [2].
> 
> free_huge_page is the routine called for the final put_page of huegtlb
> pages.  Since at least the beginning of git history, free_huge_page has
> acquired the hugetlb_lock to move the page to a free list and possibly
> perform other processing. When this code was originally written, the
> hugetlb_lock should have been made irq safe.
> 
> For many years, nobody noticed this situation until lockdep code caught
> free_huge_page being called from irq context.  By this time, another
> lock (hugetlb subpool) was also taken in the free_huge_page path.  In
> addition, hugetlb cgroup code had been added which could hold
> hugetlb_lock for a considerable period of time.  Because of this, commit
> c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
> context") was added to address the issue of free_huge_page being called
> from irq context.  That commit hands off free_huge_page processing to a
> workqueue if !in_task.
> 
> The !in_task check handles the case of being called from irq context.
> However, it does not take into account the case when called with irqs
> disabled as in [1].
> 
> To complicate matters, functionality has been added to hugetlb
> such that free_huge_page may block/sleep in certain situations.  The
> hugetlb_lock is of course dropped before potentially blocking.
> 
> One way to handle all calling contexts is to have free_huge_page always
> send pages to the workqueue for processing.  This idea was briefly
> discussed here [3], but has some undesirable side effects.

s@undesirable side effects@undesirable user visible side effects@

> Ideally, the hugetlb_lock should have been irq safe from the beginning
> and any code added to the free_huge_page path should have taken this
> into account.  However, this has not happened.  The code today does have
> the ability to hand off requests to a workqueue.  It does this for calls
> from irq context.  Changing the check in the code from !in_task to
> in_atomic would handle the situations when called with irqs disabled.
> However, it does not not handle the case when called with a spinlock
> held.  This is needed because the code could block/sleep.
> 
> Select PREEMPT_COUNT if HUGETLB_PAGE is enabled so that in_atomic can be
> used to detect all atomic contexts where sleeping is not possible.
> 
> [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
> [2] https://lore.kernel.org/linux-mm/YEjji9oAwHuZaZEt@dhcp22.suse.cz/
> [3] https://lore.kernel.org/linux-mm/YDzaAWK41K4gD35V@dhcp22.suse.cz/
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

While not an ideal solution I believe this is the most straightforward
one wrt to backporting to older kernels which are affected. I have a
hope that a preemption model independent in_atomic() is going to become
a norm. RCU is very much interested in the same thing as well. Now we
have two core kernel users requiring this so hopefully this will make
the case stronger.

That being said
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  fs/Kconfig   |  1 +
>  mm/hugetlb.c | 10 +++++-----
>  2 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 462253ae483a..403d7a7a619a 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -235,6 +235,7 @@ config HUGETLBFS
>  
>  config HUGETLB_PAGE
>  	def_bool HUGETLBFS
> +	select PREEMPT_COUNT
>  
>  config MEMFD_CREATE
>  	def_bool TMPFS || HUGETLBFS
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 33b0d8778551..5407e77ca803 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1437,9 +1437,9 @@ static void __free_huge_page(struct page *page)
>  }
>  
>  /*
> - * As free_huge_page() can be called from a non-task context, we have
> - * to defer the actual freeing in a workqueue to prevent potential
> - * hugetlb_lock deadlock.
> + * If free_huge_page() is called from an atomic context, we have to defer
> + * the actual freeing in a workqueue.  This is to prevent possible sleeping
> + * while in atomic and potential hugetlb_lock deadlock.
>   *
>   * free_hpage_workfn() locklessly retrieves the linked list of pages to
>   * be freed and frees them one-by-one. As the page->mapping pointer is
> @@ -1467,9 +1467,9 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
>  void free_huge_page(struct page *page)
>  {
>  	/*
> -	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> +	 * Defer freeing if in atomic context and sleeping is not allowed
>  	 */
> -	if (!in_task()) {
> +	if (in_atomic()) {
>  		/*
>  		 * Only call schedule_work() if hpage_freelist is previously
>  		 * empty. Otherwise, schedule_work() had been called but the
> -- 
> 2.29.2
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11  8:26 ` Michal Hocko
@ 2021-03-11  8:27   ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2021-03-11  8:27 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Paul E . McKenney, Shakeel Butt, tglx,
	john.ogness, urezki, ast, Eric Dumazet, Mina Almasry, peterz,
	Andrew Morton

On Thu 11-03-21 09:26:03, Michal Hocko wrote:
> On Wed 10-03-21 18:13:21, Mike Kravetz wrote:
> > put_page does not correctly handle all calling contexts for hugetlb
> > pages.  This was recently discussed in the threads [1] and [2].
> > 
> > free_huge_page is the routine called for the final put_page of huegtlb
> > pages.  Since at least the beginning of git history, free_huge_page has
> > acquired the hugetlb_lock to move the page to a free list and possibly
> > perform other processing. When this code was originally written, the
> > hugetlb_lock should have been made irq safe.
> > 
> > For many years, nobody noticed this situation until lockdep code caught
> > free_huge_page being called from irq context.  By this time, another
> > lock (hugetlb subpool) was also taken in the free_huge_page path.  In
> > addition, hugetlb cgroup code had been added which could hold
> > hugetlb_lock for a considerable period of time.  Because of this, commit
> > c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
> > context") was added to address the issue of free_huge_page being called
> > from irq context.  That commit hands off free_huge_page processing to a
> > workqueue if !in_task.
> > 
> > The !in_task check handles the case of being called from irq context.
> > However, it does not take into account the case when called with irqs
> > disabled as in [1].
> > 
> > To complicate matters, functionality has been added to hugetlb
> > such that free_huge_page may block/sleep in certain situations.  The
> > hugetlb_lock is of course dropped before potentially blocking.
> > 
> > One way to handle all calling contexts is to have free_huge_page always
> > send pages to the workqueue for processing.  This idea was briefly
> > discussed here [3], but has some undesirable side effects.
> 
> s@undesirable side effects@undesirable user visible side effects@
> 
> > Ideally, the hugetlb_lock should have been irq safe from the beginning
> > and any code added to the free_huge_page path should have taken this
> > into account.  However, this has not happened.  The code today does have
> > the ability to hand off requests to a workqueue.  It does this for calls
> > from irq context.  Changing the check in the code from !in_task to
> > in_atomic would handle the situations when called with irqs disabled.
> > However, it does not not handle the case when called with a spinlock
> > held.  This is needed because the code could block/sleep.
> > 
> > Select PREEMPT_COUNT if HUGETLB_PAGE is enabled so that in_atomic can be
> > used to detect all atomic contexts where sleeping is not possible.
> > 
> > [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
> > [2] https://lore.kernel.org/linux-mm/YEjji9oAwHuZaZEt@dhcp22.suse.cz/
> > [3] https://lore.kernel.org/linux-mm/YDzaAWK41K4gD35V@dhcp22.suse.cz/
> > 
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> While not an ideal solution I believe this is the most straightforward
> one wrt to backporting to older kernels which are affected. I have a
> hope that a preemption model independent in_atomic() is going to become
> a norm. RCU is very much interested in the same thing as well. Now we
> have two core kernel users requiring this so hopefully this will make
> the case stronger.
> 
> That being said
> Acked-by: Michal Hocko <mhocko@suse.com>

Btw. we very likely want
Cc: stable

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11  2:13 [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use Mike Kravetz
  2021-03-11  5:43 ` Andrew Morton
  2021-03-11  8:26 ` Michal Hocko
@ 2021-03-11  8:46 ` Peter Zijlstra
  2021-03-11  9:01   ` Michal Hocko
  2021-03-11  9:49 ` Peter Zijlstra
  3 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2021-03-11  8:46 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, Andrew Morton

On Wed, Mar 10, 2021 at 06:13:21PM -0800, Mike Kravetz wrote:
> from irq context.  Changing the check in the code from !in_task to
> in_atomic would handle the situations when called with irqs disabled.

It does not. local_irq_disable() does not change preempt_count().


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11  8:46 ` Peter Zijlstra
@ 2021-03-11  9:01   ` Michal Hocko
  2021-03-11  9:32     ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2021-03-11  9:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Kravetz, linux-mm, linux-kernel, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, Andrew Morton

On Thu 11-03-21 09:46:30, Peter Zijlstra wrote:
> On Wed, Mar 10, 2021 at 06:13:21PM -0800, Mike Kravetz wrote:
> > from irq context.  Changing the check in the code from !in_task to
> > in_atomic would handle the situations when called with irqs disabled.
> 
> It does not. local_irq_disable() does not change preempt_count().

You are right. Earlier I was suggesting to check of irq_disabled() as
well http://lkml.kernel.org/r/YD4I+VPr3UNt063H@dhcp22.suse.cz

back then it was not really clear to me that in fact we do care about
spin locks more than irq disabled code. I am not even sure whether we
need to care about irq disabled regions without any locks held that
wouldn't be covered by in_atomic. But it would be safer to add
irq_disabled check as well.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11  9:01   ` Michal Hocko
@ 2021-03-11  9:32     ` Peter Zijlstra
  2021-03-11  9:44       ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2021-03-11  9:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, Andrew Morton

On Thu, Mar 11, 2021 at 10:01:22AM +0100, Michal Hocko wrote:
> On Thu 11-03-21 09:46:30, Peter Zijlstra wrote:
> > On Wed, Mar 10, 2021 at 06:13:21PM -0800, Mike Kravetz wrote:
> > > from irq context.  Changing the check in the code from !in_task to
> > > in_atomic would handle the situations when called with irqs disabled.
> > 
> > It does not. local_irq_disable() does not change preempt_count().
> 
> You are right. Earlier I was suggesting to check of irq_disabled() as
> well http://lkml.kernel.org/r/YD4I+VPr3UNt063H@dhcp22.suse.cz
> 
> back then it was not really clear to me that in fact we do care about
> spin locks more than irq disabled code. I am not even sure whether we
> need to care about irq disabled regions without any locks held that
> wouldn't be covered by in_atomic. But it would be safer to add
> irq_disabled check as well.

Safer still is always doing it, replace it with if (true).

What's the purpose, doing the minimal 'correct', of the maximal safe
solution?

The whole changelog reads like a trainwreck, but akpm already commented
on that. I picked out a small factual incorrectness, simply because if
you can't get that right, the whole argument looses weight.

That said, I don't think you actually need it, if as you write the lock
should be IRQ-safe, then you're worried about the IRQ recursion
deadlock:


  spin_lock(&A);
  <IRQ>
	spin_lock(&A);


And that obviously cannot happen when IRQs are disabled, even if you
were holding a lock.


Anyway, I think I support Andrew here, this needs proper fixing, not
bandaids.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11  9:32     ` Peter Zijlstra
@ 2021-03-11  9:44       ` Michal Hocko
  2021-03-11  9:52         ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2021-03-11  9:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Kravetz, linux-mm, linux-kernel, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, Andrew Morton

On Thu 11-03-21 10:32:24, Peter Zijlstra wrote:
> On Thu, Mar 11, 2021 at 10:01:22AM +0100, Michal Hocko wrote:
> > On Thu 11-03-21 09:46:30, Peter Zijlstra wrote:
> > > On Wed, Mar 10, 2021 at 06:13:21PM -0800, Mike Kravetz wrote:
> > > > from irq context.  Changing the check in the code from !in_task to
> > > > in_atomic would handle the situations when called with irqs disabled.
> > > 
> > > It does not. local_irq_disable() does not change preempt_count().
> > 
> > You are right. Earlier I was suggesting to check of irq_disabled() as
> > well http://lkml.kernel.org/r/YD4I+VPr3UNt063H@dhcp22.suse.cz
> > 
> > back then it was not really clear to me that in fact we do care about
> > spin locks more than irq disabled code. I am not even sure whether we
> > need to care about irq disabled regions without any locks held that
> > wouldn't be covered by in_atomic. But it would be safer to add
> > irq_disabled check as well.
> 
> Safer still is always doing it, replace it with if (true).
> 
> What's the purpose, doing the minimal 'correct', of the maximal safe
> solution?

If we always defer to a WQ context then an admin wouldn't have any
feedback from the syscall when releasing the pool.

> The whole changelog reads like a trainwreck, but akpm already commented
> on that. I picked out a small factual incorrectness, simply because if
> you can't get that right, the whole argument looses weight.

Is there any reason why in_atomic || irq_disabled wouldn't work
universally?

> That said, I don't think you actually need it, if as you write the lock
> should be IRQ-safe, then you're worried about the IRQ recursion
> deadlock:

making hugetlb_lock irqsafe is a long way as explained by Mike
elsewhere. Not only that. The upcoming hugeltb feature to have sparse
vmemmap for hugetlb pages will need to allocate vmemmap when hugetlb
page is to be freed back to the allocator. That cannot happen in any
atomic context so there will be a need to tell those contexts for
special casing.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11  2:13 [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use Mike Kravetz
                   ` (2 preceding siblings ...)
  2021-03-11  8:46 ` Peter Zijlstra
@ 2021-03-11  9:49 ` Peter Zijlstra
  3 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2021-03-11  9:49 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, Andrew Morton

On Wed, Mar 10, 2021 at 06:13:21PM -0800, Mike Kravetz wrote:
> put_page does not correctly handle all calling contexts for hugetlb
> pages.  This was recently discussed in the threads [1] and [2].
> 
> free_huge_page is the routine called for the final put_page of huegtlb
> pages.  Since at least the beginning of git history, free_huge_page has
> acquired the hugetlb_lock to move the page to a free list and possibly
> perform other processing. When this code was originally written, the
> hugetlb_lock should have been made irq safe.
> 
> For many years, nobody noticed this situation until lockdep code caught
> free_huge_page being called from irq context.  By this time, another
> lock (hugetlb subpool) was also taken in the free_huge_page path. 

AFAICT there's no actual problem with making spool->lock IRQ-safe too.

> In addition, hugetlb cgroup code had been added which could hold
> hugetlb_lock for a considerable period of time. 

cgroups, always bloody cgroups. The scheduler (and a fair number of
other places) get to deal with cgroups with IRQs disabled, so I'm sure
this can too.

> Because of this, commit
> c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
> context") was added to address the issue of free_huge_page being called
> from irq context.  That commit hands off free_huge_page processing to a
> workqueue if !in_task.
> 
> The !in_task check handles the case of being called from irq context.
> However, it does not take into account the case when called with irqs
> disabled as in [1].
> 
> To complicate matters, functionality has been added to hugetlb
> such that free_huge_page may block/sleep in certain situations.  The
> hugetlb_lock is of course dropped before potentially blocking.

AFAICT that's because CMA, right? That's only hstate_is_gigantic() and
free_gigantic_page() that has that particular trainwreck.

So you could move the workqueue there, and leave all the other hugetlb
sizes unaffected. Afaict if you limit the workqueue crud to
cma_clear_bitmap(), you don't get your..

> One way to handle all calling contexts is to have free_huge_page always
> send pages to the workqueue for processing.  This idea was briefly
> discussed here [3], but has some undesirable side effects.

... user visible side effects either.

> Ideally, the hugetlb_lock should have been irq safe from the beginning
> and any code added to the free_huge_page path should have taken this
> into account.  However, this has not happened.  The code today does have
> the ability to hand off requests to a workqueue.  It does this for calls
> from irq context.  Changing the check in the code from !in_task to
> in_atomic would handle the situations when called with irqs disabled.
> However, it does not not handle the case when called with a spinlock
> held.  This is needed because the code could block/sleep.

I'll argue the current workqueue thing is in the wrong place to begin
with.

So how about you make hugetlb_lock and spool->lock IRQ-safe, move thw
workqueue thingy into cma_release(), and then worry about optimizing the
cgroup crap?

Correctness first, performance second. Also, if you really care about
performance, not using cgroups is a very good option anyway.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11  9:44       ` Michal Hocko
@ 2021-03-11  9:52         ` Peter Zijlstra
  2021-03-11 11:09           ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2021-03-11  9:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, Andrew Morton

On Thu, Mar 11, 2021 at 10:44:56AM +0100, Michal Hocko wrote:
> On Thu 11-03-21 10:32:24, Peter Zijlstra wrote:
> > The whole changelog reads like a trainwreck, but akpm already commented
> > on that. I picked out a small factual incorrectness, simply because if
> > you can't get that right, the whole argument looses weight.
> 
> Is there any reason why in_atomic || irq_disabled wouldn't work
> universally?

I just explained to you how you really wanted:

  in_atomic() && !irq_disabled()

> > That said, I don't think you actually need it, if as you write the lock
> > should be IRQ-safe, then you're worried about the IRQ recursion
> > deadlock:
> 
> making hugetlb_lock irqsafe is a long way as explained by Mike
> elsewhere. Not only that. The upcoming hugeltb feature to have sparse
> vmemmap for hugetlb pages will need to allocate vmemmap when hugetlb
> page is to be freed back to the allocator. That cannot happen in any
> atomic context so there will be a need to tell those contexts for
> special casing.

Then scrap the vmemmap code *NOW*. Do not merge more shit before fixing
existing problems. Especially not if that's known to make it harder to
fix the problems.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11  9:52         ` Peter Zijlstra
@ 2021-03-11 11:09           ` Michal Hocko
  2021-03-11 11:36             ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2021-03-11 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Kravetz, linux-mm, linux-kernel, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, Andrew Morton

On Thu 11-03-21 10:52:50, Peter Zijlstra wrote:
> On Thu, Mar 11, 2021 at 10:44:56AM +0100, Michal Hocko wrote:
> > On Thu 11-03-21 10:32:24, Peter Zijlstra wrote:
> > > The whole changelog reads like a trainwreck, but akpm already commented
> > > on that. I picked out a small factual incorrectness, simply because if
> > > you can't get that right, the whole argument looses weight.
> > 
> > Is there any reason why in_atomic || irq_disabled wouldn't work
> > universally?
> 
> I just explained to you how you really wanted:
> 
>   in_atomic() && !irq_disabled()

Sorry for being dense but I do not follow. You have provided the
following example
  spin_lock(&A);
  <IRQ>
        spin_lock(&A);

if A == hugetlb_lock then we should never reenter with
free_huge_page
	if (in_atomic() || irq_disabled())
		schedule_in_wq();
	else
		free_directly()

because hugetlb_lock is never held in irq context other than from
put_page (aka the above) path which will explicitly defer the handling and
thus the lock to a different context. We need to check for irq_disabled
because of the sleeping paths in the freeing path. Or do I miss
something?

From the code simplicity POV (and hugetlb has grown a lot of complexity)
it would be really easiest to make sure __free_huge_page to be called
from a non-atomic process context. There are few ways to do that
- defer each call to a WQ - user visible which sucks
- defer from atomic or otherwise non-sleeping contextx - requires
  reliable in_atomic AFAICS
- defer sleeping operations - makes the code flow more complex and it
  would be again user visible in some cases.

So I would say we are in "pick your own poison" kind of situation.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11 11:09           ` Michal Hocko
@ 2021-03-11 11:36             ` Peter Zijlstra
  2021-03-11 12:02               ` Michal Hocko
                                 ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Peter Zijlstra @ 2021-03-11 11:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, Andrew Morton

On Thu, Mar 11, 2021 at 12:09:15PM +0100, Michal Hocko wrote:

> Sorry for being dense but I do not follow. You have provided the
> following example
>   spin_lock(&A);
>   <IRQ>
>         spin_lock(&A);
> 
> if A == hugetlb_lock then we should never reenter with
> free_huge_page

What I'm saying is that if irq_disabled(), the that interrupt cannot
happen, so the second spin_lock cannot happen, so the deadlock cannot
happen.

So: '!irqs_disabled() && in_atomic()' is sufficient to avoid the IRQ
recursion deadlock.

Also, Linus hates constructs like this:

  https://lkml.kernel.org/r/CAHk-=wht7kAeyR5xEW2ORj7m0hibVxZ3t+2ie8vNHLQfdbN2_g@mail.gmail.com

> From the code simplicity POV (and hugetlb has grown a lot of complexity)
> it would be really easiest to make sure __free_huge_page to be called
> from a non-atomic process context. There are few ways to do that
> - defer each call to a WQ - user visible which sucks
> - defer from atomic or otherwise non-sleeping contextx - requires
>   reliable in_atomic AFAICS
> - defer sleeping operations - makes the code flow more complex and it
>   would be again user visible in some cases.
> 
> So I would say we are in "pick your own poison" kind of situation.

Just to be clear:

NAK on this patch and any and all ductape crap. Fix it properly, make
hugetlb_lock, spool->lock IRQ-safe, move the workqueue into the CMA
thing.

The code really doesn't look _that_ complicated.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11 11:36             ` Peter Zijlstra
@ 2021-03-11 12:02               ` Michal Hocko
  2021-03-11 17:25                 ` Mike Kravetz
  2021-03-11 12:49               ` Michal Hocko
  2021-03-11 17:50               ` Paul E. McKenney
  2 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2021-03-11 12:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Kravetz, linux-mm, linux-kernel, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, Andrew Morton

On Thu 11-03-21 12:36:51, Peter Zijlstra wrote:
> On Thu, Mar 11, 2021 at 12:09:15PM +0100, Michal Hocko wrote:
> 
> > Sorry for being dense but I do not follow. You have provided the
> > following example
> >   spin_lock(&A);
> >   <IRQ>
> >         spin_lock(&A);
> > 
> > if A == hugetlb_lock then we should never reenter with
> > free_huge_page
> 
> What I'm saying is that if irq_disabled(), the that interrupt cannot
> happen, so the second spin_lock cannot happen, so the deadlock cannot
> happen.
> 
> So: '!irqs_disabled() && in_atomic()' is sufficient to avoid the IRQ
> recursion deadlock.

OK, then I understand your point now. I thought you were arguing
an actual deadlock scenario. As I've said irq_disabled check would be
needed for sleeping operations that we already do.

> Also, Linus hates constructs like this:
> 
>   https://lkml.kernel.org/r/CAHk-=wht7kAeyR5xEW2ORj7m0hibVxZ3t+2ie8vNHLQfdbN2_g@mail.gmail.com
> 
> > From the code simplicity POV (and hugetlb has grown a lot of complexity)
> > it would be really easiest to make sure __free_huge_page to be called
> > from a non-atomic process context. There are few ways to do that
> > - defer each call to a WQ - user visible which sucks
> > - defer from atomic or otherwise non-sleeping contextx - requires
> >   reliable in_atomic AFAICS
> > - defer sleeping operations - makes the code flow more complex and it
> >   would be again user visible in some cases.
> > 
> > So I would say we are in "pick your own poison" kind of situation.
> 
> Just to be clear:
> 
> NAK on this patch and any and all ductape crap. Fix it properly, make
> hugetlb_lock, spool->lock IRQ-safe, move the workqueue into the CMA
> thing.
> 
> The code really doesn't look _that_ complicated.

Fair enough. As I've said I am not a great fan of this patch either
but it is a quick fix for a likely long term problem. If reworking the
hugetlb locking is preferable then be it.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11 11:36             ` Peter Zijlstra
  2021-03-11 12:02               ` Michal Hocko
@ 2021-03-11 12:49               ` Michal Hocko
  2021-03-11 17:50               ` Paul E. McKenney
  2 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2021-03-11 12:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Kravetz, linux-mm, linux-kernel, Paul E . McKenney,
	Shakeel Butt, tglx, john.ogness, urezki, ast, Eric Dumazet,
	Mina Almasry, Andrew Morton

On Thu 11-03-21 12:36:51, Peter Zijlstra wrote:
[...]
> Also, Linus hates constructs like this:
> 
>   https://lkml.kernel.org/r/CAHk-=wht7kAeyR5xEW2ORj7m0hibVxZ3t+2ie8vNHLQfdbN2_g@mail.gmail.com

Btw. I would really appreciate if somebody would explain why it is
_fundamentally broken_ to check for an atomic context and chose a
different handling in a code path, like put_page, which is out of hands
of the called context? This can be called from a wide variety of
contexts. There is no way to pass a context information to the called
function. I do recognize that this is not an act of beauty but why
fundamentally broken?

The put_page context can certainly work towards robustness and operate
on the most restrictive context grounds (I really hope nobody will ever
come up with an idea that put_page can be called from nmi context). This
can make the code more complex and less optimal in normal case (e.g.
hugetlb is almost never freed from an atomic context - one has to be
really creative to achieve that). So where do we draw a line?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11 12:02               ` Michal Hocko
@ 2021-03-11 17:25                 ` Mike Kravetz
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2021-03-11 17:25 UTC (permalink / raw)
  To: Michal Hocko, Peter Zijlstra
  Cc: linux-mm, linux-kernel, Paul E . McKenney, Shakeel Butt, tglx,
	john.ogness, urezki, ast, Eric Dumazet, Mina Almasry,
	Andrew Morton

On 3/11/21 4:02 AM, Michal Hocko wrote:
> On Thu 11-03-21 12:36:51, Peter Zijlstra wrote:
>> On Thu, Mar 11, 2021 at 12:09:15PM +0100, Michal Hocko wrote:
>>
>>> Sorry for being dense but I do not follow. You have provided the
>>> following example
>>>   spin_lock(&A);
>>>   <IRQ>
>>>         spin_lock(&A);
>>>
>>> if A == hugetlb_lock then we should never reenter with
>>> free_huge_page
>>
>> What I'm saying is that if irq_disabled(), the that interrupt cannot
>> happen, so the second spin_lock cannot happen, so the deadlock cannot
>> happen.
>>
>> So: '!irqs_disabled() && in_atomic()' is sufficient to avoid the IRQ
>> recursion deadlock.
> 
> OK, then I understand your point now. I thought you were arguing
> an actual deadlock scenario. As I've said irq_disabled check would be
> needed for sleeping operations that we already do.
> 
>> Also, Linus hates constructs like this:
>>
>>   https://lkml.kernel.org/r/CAHk-=wht7kAeyR5xEW2ORj7m0hibVxZ3t+2ie8vNHLQfdbN2_g@mail.gmail.com
>>
>>> From the code simplicity POV (and hugetlb has grown a lot of complexity)
>>> it would be really easiest to make sure __free_huge_page to be called
>>> from a non-atomic process context. There are few ways to do that
>>> - defer each call to a WQ - user visible which sucks
>>> - defer from atomic or otherwise non-sleeping contextx - requires
>>>   reliable in_atomic AFAICS
>>> - defer sleeping operations - makes the code flow more complex and it
>>>   would be again user visible in some cases.
>>>
>>> So I would say we are in "pick your own poison" kind of situation.
>>
>> Just to be clear:
>>
>> NAK on this patch and any and all ductape crap. Fix it properly, make
>> hugetlb_lock, spool->lock IRQ-safe, move the workqueue into the CMA
>> thing.
>>
>> The code really doesn't look _that_ complicated.
> 
> Fair enough. As I've said I am not a great fan of this patch either
> but it is a quick fix for a likely long term problem. If reworking the
> hugetlb locking is preferable then be it.

Thanks you Michal and Peter.  This patch was mostly about starting a
discussion, as this topic came up in a couple different places.  I
included the 'train wreck' of how we got here just for a bit of history.

I'll start working on a proper fix.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use
  2021-03-11 11:36             ` Peter Zijlstra
  2021-03-11 12:02               ` Michal Hocko
  2021-03-11 12:49               ` Michal Hocko
@ 2021-03-11 17:50               ` Paul E. McKenney
  2 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2021-03-11 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Mike Kravetz, linux-mm, linux-kernel, Shakeel Butt,
	tglx, john.ogness, urezki, ast, Eric Dumazet, Mina Almasry,
	Andrew Morton

On Thu, Mar 11, 2021 at 12:36:51PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 11, 2021 at 12:09:15PM +0100, Michal Hocko wrote:
> 
> > Sorry for being dense but I do not follow. You have provided the
> > following example
> >   spin_lock(&A);
> >   <IRQ>
> >         spin_lock(&A);
> > 
> > if A == hugetlb_lock then we should never reenter with
> > free_huge_page
> 
> What I'm saying is that if irq_disabled(), the that interrupt cannot
> happen, so the second spin_lock cannot happen, so the deadlock cannot
> happen.
> 
> So: '!irqs_disabled() && in_atomic()' is sufficient to avoid the IRQ
> recursion deadlock.
> 
> Also, Linus hates constructs like this:
> 
>   https://lkml.kernel.org/r/CAHk-=wht7kAeyR5xEW2ORj7m0hibVxZ3t+2ie8vNHLQfdbN2_g@mail.gmail.com

To be fair, later in that same thread Linus states that his main concern
is not core code, but rather driver code:

https://lore.kernel.org/lkml/CAHk-=wjsMycgMHJrCmeetR3r+K5bpSRtmVWfd8iaoQCYd_VYAg@mail.gmail.com/

Nevertheless, if the job can be done reasonably without checking the
preemption/interrupt state, why not?  And Mike's patch is still useful
for people hitting this bug.

							Thanx, Paul

> > From the code simplicity POV (and hugetlb has grown a lot of complexity)
> > it would be really easiest to make sure __free_huge_page to be called
> > from a non-atomic process context. There are few ways to do that
> > - defer each call to a WQ - user visible which sucks
> > - defer from atomic or otherwise non-sleeping contextx - requires
> >   reliable in_atomic AFAICS
> > - defer sleeping operations - makes the code flow more complex and it
> >   would be again user visible in some cases.
> > 
> > So I would say we are in "pick your own poison" kind of situation.
> 
> Just to be clear:
> 
> NAK on this patch and any and all ductape crap. Fix it properly, make
> hugetlb_lock, spool->lock IRQ-safe, move the workqueue into the CMA
> thing.
> 
> The code really doesn't look _that_ complicated.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-03-11 17:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-11  2:13 [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use Mike Kravetz
2021-03-11  5:43 ` Andrew Morton
2021-03-11  8:20   ` Michal Hocko
2021-03-11  8:26 ` Michal Hocko
2021-03-11  8:27   ` Michal Hocko
2021-03-11  8:46 ` Peter Zijlstra
2021-03-11  9:01   ` Michal Hocko
2021-03-11  9:32     ` Peter Zijlstra
2021-03-11  9:44       ` Michal Hocko
2021-03-11  9:52         ` Peter Zijlstra
2021-03-11 11:09           ` Michal Hocko
2021-03-11 11:36             ` Peter Zijlstra
2021-03-11 12:02               ` Michal Hocko
2021-03-11 17:25                 ` Mike Kravetz
2021-03-11 12:49               ` Michal Hocko
2021-03-11 17:50               ` Paul E. McKenney
2021-03-11  9:49 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).