linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] seqlock,mm: lockdep annotation + write_seqlock_irqsave()
@ 2023-06-23 17:12 Sebastian Andrzej Siewior
  2023-06-23 17:12 ` [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested() Sebastian Andrzej Siewior
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-23 17:12 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Luis Claudio R. Goncalves, Andrew Morton, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Michal Hocko,
	Peter Zijlstra, Petr Mladek, Tetsuo Handa, Thomas Gleixner,
	Waiman Long, Will Deacon

Hi,

this has been a single patch (2/2) but then it was pointed out that the
lockdep annotation in seqlock needs to be adjusted to fully close the
printk window so that there is no printing after the seq-lock has been
acquired and before printk_deferred_enter() takes effect.

I'm sending both patches in this series so both sides (locking and mm)
are aware of the situation. 
I hope that both patches can be applied independently via their subsystem
tree (the lockdep splat + deadlock is low risk).

The original thread starts at
	https://lore.kernel.org/20230621104034.HT6QnNkQ@linutronix.de

Sebastian



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-23 17:12 [PATCH v2 0/2] seqlock,mm: lockdep annotation + write_seqlock_irqsave() Sebastian Andrzej Siewior
@ 2023-06-23 17:12 ` Sebastian Andrzej Siewior
  2023-06-24  6:54   ` Tetsuo Handa
  2023-06-26 12:56   ` Mel Gorman
  2023-06-23 17:12 ` [PATCH v2 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save() Sebastian Andrzej Siewior
  2023-06-25  2:27 ` [PATCH v2 0/2] seqlock,mm: lockdep annotation + write_seqlock_irqsave() Tetsuo Handa
  2 siblings, 2 replies; 30+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-23 17:12 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Luis Claudio R. Goncalves, Andrew Morton, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Michal Hocko,
	Peter Zijlstra, Petr Mladek, Tetsuo Handa, Thomas Gleixner,
	Waiman Long, Will Deacon, Sebastian Andrzej Siewior,
	Tetsuo Handa

It was brought up by Tetsuo that the following sequence
   write_seqlock_irqsave()
   printk_deferred_enter()

could lead to a deadlock if the lockdep annotation within
write_seqlock_irqsave() triggers. The problem is that the sequence
counter is incremented before the lockdep annotation is performed. The
lockdep splat would then attempt to invoke printk() but the reader side,
of the same seqcount, could have a tty_port::lock acquired waiting for
the sequence number to become even again.

The other lockdep annotations come before the actual locking because "we
want to see the locking error before it happens". There is no reason why
seqcount should be different here.

Do the lockdep annotation first then perform the locking operation (the
sequence increment).

Fixes: 1ca7d67cf5d5a ("seqcount: Add lockdep functionality to seqcount/seqlock structures")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/20230621130641.-5iueY1I@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/seqlock.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 3926e90279477..d778af83c8f36 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -512,8 +512,8 @@ do {									\
 
 static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
 {
-	do_raw_write_seqcount_begin(s);
 	seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
+	do_raw_write_seqcount_begin(s);
 }
 
 /**
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().
  2023-06-23 17:12 [PATCH v2 0/2] seqlock,mm: lockdep annotation + write_seqlock_irqsave() Sebastian Andrzej Siewior
  2023-06-23 17:12 ` [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested() Sebastian Andrzej Siewior
@ 2023-06-23 17:12 ` Sebastian Andrzej Siewior
  2023-06-23 18:17   ` Michal Hocko
  2023-06-25  2:27 ` [PATCH v2 0/2] seqlock,mm: lockdep annotation + write_seqlock_irqsave() Tetsuo Handa
  2 siblings, 1 reply; 30+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-23 17:12 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Luis Claudio R. Goncalves, Andrew Morton, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Michal Hocko,
	Peter Zijlstra, Petr Mladek, Tetsuo Handa, Thomas Gleixner,
	Waiman Long, Will Deacon, Sebastian Andrzej Siewior

__build_all_zonelists() acquires zonelist_update_seq by first disabling
interrupts via local_irq_save() and then acquiring the seqlock with
write_seqlock(). This is troublesome and leads to problems on
PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
lock on PREEMPT_RT and must not be acquired with disabled interrupts.

The API provides write_seqlock_irqsave() which does the right thing in
one step.
printk_deferred_enter() has to be invoked in non-migrate-able context to
ensure that deferred printing is enabled and disabled on the same CPU.
This is the case after zonelist_update_seq has been acquired.

There was discussion on the first submission that the order should be:
	local_irq_disable();
	printk_deferred_enter();
	write_seqlock();

to avoid pitfalls like having an unaccounted printk() coming from
write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
only origin of such a printk() can be a lockdep splat because the
lockdep annotation happens after the sequence count is incremented.
This is exceptional and subject to change.

It was also pointed that PREEMPT_RT can be affected by the printk
problem since its write_seqlock_irqsave() does not really disable
interrupts. This isn't the case because PREEMPT_RT's printk
implementation differs from the mainline implementation in two important
aspects:
- Printing happens in a dedicated threads and not at during the
  invocation of printk().
- In emergency cases where synchronous printing is used, a different
  driver is used which does not use tty_port::lock.

Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
printk output.

Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/page_alloc.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b7..99b7e7d09c5c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5808,11 +5808,10 @@ static void __build_all_zonelists(void *data)
 	unsigned long flags;
 
 	/*
-	 * Explicitly disable this CPU's interrupts before taking seqlock
-	 * to prevent any IRQ handler from calling into the page allocator
-	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
+	 * The zonelist_update_seq must be acquired with irqsave because the
+	 * reader can be invoked from IRQ with GFP_ATOMIC.
 	 */
-	local_irq_save(flags);
+	write_seqlock_irqsave(&zonelist_update_seq, flags);
 	/*
 	 * Explicitly disable this CPU's synchronous printk() before taking
 	 * seqlock to prevent any printk() from trying to hold port->lock, for
@@ -5820,7 +5819,6 @@ static void __build_all_zonelists(void *data)
 	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
 	 */
 	printk_deferred_enter();
-	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	write_sequnlock(&zonelist_update_seq);
 	printk_deferred_exit();
-	local_irq_restore(flags);
+	write_sequnlock_irqrestore(&zonelist_update_seq, flags);
 }
 
 static noinline void __init
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().
  2023-06-23 17:12 ` [PATCH v2 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save() Sebastian Andrzej Siewior
@ 2023-06-23 18:17   ` Michal Hocko
  2023-06-23 20:15     ` [PATCH v3 " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2023-06-23 18:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-mm, linux-kernel, Luis Claudio R. Goncalves, Andrew Morton,
	Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman, Peter Zijlstra,
	Petr Mladek, Tetsuo Handa, Thomas Gleixner, Waiman Long,
	Will Deacon

On Fri 23-06-23 19:12:32, Sebastian Andrzej Siewior wrote:
> __build_all_zonelists() acquires zonelist_update_seq by first disabling
> interrupts via local_irq_save() and then acquiring the seqlock with
> write_seqlock(). This is troublesome and leads to problems on
> PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
> lock on PREEMPT_RT and must not be acquired with disabled interrupts.
> 
> The API provides write_seqlock_irqsave() which does the right thing in
> one step.
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.
> This is the case after zonelist_update_seq has been acquired.
> 
> There was discussion on the first submission that the order should be:
> 	local_irq_disable();
> 	printk_deferred_enter();
> 	write_seqlock();
> 
> to avoid pitfalls like having an unaccounted printk() coming from
> write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
> only origin of such a printk() can be a lockdep splat because the
> lockdep annotation happens after the sequence count is incremented.
> This is exceptional and subject to change.
> 
> It was also pointed that PREEMPT_RT can be affected by the printk
> problem since its write_seqlock_irqsave() does not really disable
> interrupts. This isn't the case because PREEMPT_RT's printk
> implementation differs from the mainline implementation in two important
> aspects:
> - Printing happens in a dedicated threads and not at during the
>   invocation of printk().
> - In emergency cases where synchronous printing is used, a different
>   driver is used which does not use tty_port::lock.
> 
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.
> 
> Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Thanks for extending the changelog. This is much more clearer IMO.

One nit below which I haven't noticed before. Anyway
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47421bedc12b7..99b7e7d09c5c0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5808,11 +5808,10 @@ static void __build_all_zonelists(void *data)
>  	unsigned long flags;
>  
>  	/*
> -	 * Explicitly disable this CPU's interrupts before taking seqlock
> -	 * to prevent any IRQ handler from calling into the page allocator
> -	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> +	 * The zonelist_update_seq must be acquired with irqsave because the
> +	 * reader can be invoked from IRQ with GFP_ATOMIC.
>  	 */
> -	local_irq_save(flags);
> +	write_seqlock_irqsave(&zonelist_update_seq, flags);
>  	/*
>  	 * Explicitly disable this CPU's synchronous printk() before taking
>  	 * seqlock to prevent any printk() from trying to hold port->lock, for

This is not the case anymore because the locking ordering has flipped. I
would just extend the comment above by something like:

	 * Also disable synchronous printk() to prevent any printk() from trying
	 * to hold port->lock, for tty_insert_flip_string_and_push_buffer() on
	 * other CPU might be calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with
	 * port->lock held.

> @@ -5820,7 +5819,6 @@ static void __build_all_zonelists(void *data)
>  	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
>  	 */
>  	printk_deferred_enter();
> -	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
>  	memset(node_load, 0, sizeof(node_load));
> @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
>  #endif
>  	}
>  
> -	write_sequnlock(&zonelist_update_seq);
>  	printk_deferred_exit();
> -	local_irq_restore(flags);
> +	write_sequnlock_irqrestore(&zonelist_update_seq, flags);
>  }
>  
>  static noinline void __init
> -- 
> 2.40.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().
  2023-06-23 18:17   ` Michal Hocko
@ 2023-06-23 20:15     ` Sebastian Andrzej Siewior
  2023-06-26  7:56       ` David Hildenbrand
                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-23 20:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Luis Claudio R. Goncalves, Andrew Morton,
	Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman, Peter Zijlstra,
	Petr Mladek, Tetsuo Handa, Thomas Gleixner, Waiman Long,
	Will Deacon

__build_all_zonelists() acquires zonelist_update_seq by first disabling
interrupts via local_irq_save() and then acquiring the seqlock with
write_seqlock(). This is troublesome and leads to problems on
PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
lock on PREEMPT_RT and must not be acquired with disabled interrupts.

The API provides write_seqlock_irqsave() which does the right thing in
one step.
printk_deferred_enter() has to be invoked in non-migrate-able context to
ensure that deferred printing is enabled and disabled on the same CPU.
This is the case after zonelist_update_seq has been acquired.

There was discussion on the first submission that the order should be:
	local_irq_disable();
	printk_deferred_enter();
	write_seqlock();

to avoid pitfalls like having an unaccounted printk() coming from
write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
only origin of such a printk() can be a lockdep splat because the
lockdep annotation happens after the sequence count is incremented.
This is exceptional and subject to change.

It was also pointed that PREEMPT_RT can be affected by the printk
problem since its write_seqlock_irqsave() does not really disable
interrupts. This isn't the case because PREEMPT_RT's printk
implementation differs from the mainline implementation in two important
aspects:
- Printing happens in a dedicated threads and not at during the
  invocation of printk().
- In emergency cases where synchronous printing is used, a different
  driver is used which does not use tty_port::lock.

Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
printk output.

Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
---
v2…v3
  - Update comment as per Michal's suggestion.

v1…v2:
  - Improve commit description

 mm/page_alloc.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b7..440e9af67b48d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5808,19 +5808,17 @@ static void __build_all_zonelists(void *data)
 	unsigned long flags;
 
 	/*
-	 * Explicitly disable this CPU's interrupts before taking seqlock
-	 * to prevent any IRQ handler from calling into the page allocator
-	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
+	 * The zonelist_update_seq must be acquired with irqsave because the
+	 * reader can be invoked from IRQ with GFP_ATOMIC.
 	 */
-	local_irq_save(flags);
+	write_seqlock_irqsave(&zonelist_update_seq, flags);
 	/*
-	 * Explicitly disable this CPU's synchronous printk() before taking
-	 * seqlock to prevent any printk() from trying to hold port->lock, for
+	 * Also disable synchronous printk() to prevent any printk() from
+	 * trying to hold port->lock, for
 	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
 	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
 	 */
 	printk_deferred_enter();
-	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	write_sequnlock(&zonelist_update_seq);
 	printk_deferred_exit();
-	local_irq_restore(flags);
+	write_sequnlock_irqrestore(&zonelist_update_seq, flags);
 }
 
 static noinline void __init
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-23 17:12 ` [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested() Sebastian Andrzej Siewior
@ 2023-06-24  6:54   ` Tetsuo Handa
  2023-06-26  8:12     ` Sebastian Andrzej Siewior
  2023-06-26 12:56   ` Mel Gorman
  1 sibling, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2023-06-24  6:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-mm, linux-kernel
  Cc: Luis Claudio R. Goncalves, Andrew Morton, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Michal Hocko,
	Peter Zijlstra, Petr Mladek, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2023/06/24 2:12, Sebastian Andrzej Siewior wrote:
>  include/linux/seqlock.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index 3926e90279477..d778af83c8f36 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -512,8 +512,8 @@ do {									\
>  
>  static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
>  {
> -	do_raw_write_seqcount_begin(s);
>  	seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
> +	do_raw_write_seqcount_begin(s);
>  }
>  
>  /**

Why not to do the same on the end side?

 static inline void do_write_seqcount_end(seqcount_t *s)
 {
- 	seqcount_release(&s->dep_map, _RET_IP_);
 	do_raw_write_seqcount_end(s);
+	seqcount_release(&s->dep_map, _RET_IP_);
 }


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] seqlock,mm: lockdep annotation + write_seqlock_irqsave()
  2023-06-23 17:12 [PATCH v2 0/2] seqlock,mm: lockdep annotation + write_seqlock_irqsave() Sebastian Andrzej Siewior
  2023-06-23 17:12 ` [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested() Sebastian Andrzej Siewior
  2023-06-23 17:12 ` [PATCH v2 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save() Sebastian Andrzej Siewior
@ 2023-06-25  2:27 ` Tetsuo Handa
  2 siblings, 0 replies; 30+ messages in thread
From: Tetsuo Handa @ 2023-06-25  2:27 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-mm, linux-kernel
  Cc: Luis Claudio R. Goncalves, Andrew Morton, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Michal Hocko,
	Peter Zijlstra, Petr Mladek, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2023/06/24 2:12, Sebastian Andrzej Siewior wrote:
> Hi,
> 
> this has been a single patch (2/2) but then it was pointed out that the
> lockdep annotation in seqlock needs to be adjusted to fully close the
> printk window so that there is no printing after the seq-lock has been
> acquired and before printk_deferred_enter() takes effect.
> 
> I'm sending both patches in this series so both sides (locking and mm)
> are aware of the situation. 
> I hope that both patches can be applied independently via their subsystem
> tree (the lockdep splat + deadlock is low risk).
> 
> The original thread starts at
> 	https://lore.kernel.org/20230621104034.HT6QnNkQ@linutronix.de
> 
> Sebastian

The original thread is too long to read. Below is a full summary for locking
maintainers to accept [PATCH 1/2].



When commit 1ca7d67cf5d5 ("seqcount: Add lockdep functionality to
seqcount/seqlock structures") added seqcount_acquire(&s->dep_map) check to
write_seqcount_begin_nested() and seqcount_release(&s->dep_map) check to
write_seqcount_end(), the ordering of updating s->sequence and doing lockdep
annotation was not important.

But since commit 3d36424b3b58 ("mm/page_alloc: fix race condition between
build_all_zonelists and page allocation") started calling
read_seqbegin(&zonelist_update_seq)/read_seqretry(&zonelist_update_seq) from
kmalloc(GFP_ATOMIC) path, commit 1007843a9190 ("mm/page_alloc: fix potential
deadlock on zonelist_update_seq seqlock") tried to close the race window using

 __build_all_zonelists() {
+    local_irq_save(flags);
+    printk_deferred_enter();
     write_seqlock(&zonelist_update_seq);
     (...snipped...)
     write_sequnlock(&zonelist_update_seq);
+    printk_deferred_exit();
+    local_irq_restore(flags);
 }

pattern. The reason behind this ordering was to

  satisfy "printk_deferred_enter() depends on local IRQs being disabled"

and

  make sure that "no synchronous printk() (for whatever reasons, not only
  printk() from build_zonelists() from __build_all_zonelists(), but also
  including printk() from lockdep, soft-lockup, KCSAN etc.) happens between
  write_seqlock() and write_sequnlock()

. However, Sebastian Andrzej Siewior mentioned that this ordering is
problematic if CONFIG_PREEMPT_RT=y, for disabling local IRQs conflicts with
"spin_lock(&zonelist_update_seq.lock) from write_seqlock(&zonelist_update_seq)
needs to be able to sleep", and Sebastian is proposing

 __build_all_zonelists() {
-    local_irq_save(flags);
-    printk_deferred_enter();
-    write_seqlock(&zonelist_update_seq);
+    write_seqlock_irqsave(&zonelist_update_seq, flags);
+    printk_deferred_enter();
     (...snipped...)
+    printk_deferred_exit();
+    write_sequnlock_irqrestore(&zonelist_update_seq, flags);
-    write_sequnlock(&zonelist_update_seq);
-    printk_deferred_exit();
-    local_irq_restore(flags);
 }

change as [PATCH 2/2]. Since write_seqlock_irqsave() becomes write_seqlock()
if CONFIG_PREEMPT_RT=y, this change can solve the conflict.

In order to accept this proposal, we need to make sure that

  no synchronous printk() happens between

    write_seqlock_irqsave(&zonelist_update_seq, flags) made
    zonelist_update_seq.seqcount odd

  and

    printk_deferred_enter() takes effect

and

  no synchronous printk() happens between

    printk_deferred_exit() took effect

  and

    write_sequnlock_irqrestore(&zonelist_update_seq, flags) makes
    zonelist_update_seq.seqcount even

, and Sebastian is proposing

 static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
 {
-    do_raw_write_seqcount_begin(s);
     seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
+    do_raw_write_seqcount_begin(s);
 }

 static inline void do_write_seqcount_end(seqcount_t *s)
 {
-    seqcount_release(&s->dep_map, _RET_IP_);
     do_raw_write_seqcount_end(s);
+    seqcount_release(&s->dep_map, _RET_IP_);
 }

as [PATCH 1/2].

With [PATCH 1/2] and [PATCH 2/2], possibility of synchronous printk() changes like below.

__build_all_zonelists() {
  write_seqlock_irqsave(&zonelist_update_seq, flags) {
    __write_seqlock_irqsave(&zonelist_update_seq) {
      spin_lock_irqsave(&zonelist_update_seq.lock, flags); // local IRQs disabled = synchronous printk() from IRQs is disabled here
      do_write_seqcount_begin(&zonelist_update_seq.seqcount.seqcount) {
        do_write_seqcount_begin_nested(&zonelist_update_seq.seqcount.seqcount, 0) {
          seqcount_acquire(&zonelist_update_seq.seqcount.seqcount.dep_map, 0, 0, _RET_IP_); // synchronous printk() from lockdep might happen here
          do_raw_write_seqcount_begin(&zonelist_update_seq.seqcount.seqcount) {
            seqcount_acquire(&zonelist_update_seq.seqcount.seqcount.dep_map, 0, 0, _RET_IP_); // zonelist_update_seq.seqcount.seqcount.sequence is guaranteed to be even = kmalloc(GFP_ATOMIC) with port lock held is safe = synchronous printk() is safe
            kcsan_nestable_atomic_begin(); // KCSAN is diabled = synchronous printk() from KCSAN is disabled here
            zonelist_update_seq.seqcount.seqcount.sequence++; // zonelist_update_seq.seqcount.seqcount.sequence is now odd = kmalloc(GFP_ATOMIC) with port lock held is not safe = synchronous printk() is not safe
          }
        }
      }
    }
  }
  printk_deferred_enter(); // synchronous printk() from whatever reason is disabled here
  (...snipped...)
  printk_deferred_exit(); // synchronous printk() from whatever reason is enabled here
  write_sequnlock_irqrestore(&zonelist_update_seq, flags) {
    do_write_seqcount_end(&zonelist_update_seq.seqcount.seqcount) {
      do_raw_write_seqcount_end(&zonelist_update_seq.seqcount.seqcount) {
        zonelist_update_seq.seqcount.seqcount.sequence++; // zonelist_update_seq.seqcount.seqcount.sequence is now even = kmalloc(GFP_ATOMIC) with port lock held is safe = synchronous printk() is safe
        kcsan_nestable_atomic_end(); // KCSAN is enabled = synchronous printk() from KCSAN is enabled here
      }
      seqcount_release(&zonelist_update_seq.seqcount.seqcount.dep_map, _RET_IP_); // synchronous printk() from lockdep might happen here
    }
    spin_unlock_irqrestore(&zonelist_update_seq.lock, flags); // local IRQs enabled = synchronous printk() from IRQs is enabled here
  }
}


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().
  2023-06-23 20:15     ` [PATCH v3 " Sebastian Andrzej Siewior
@ 2023-06-26  7:56       ` David Hildenbrand
  2023-06-26 13:14       ` Mel Gorman
  2023-06-28 13:56       ` Michal Hocko
  2 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2023-06-26  7:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Michal Hocko
  Cc: linux-mm, linux-kernel, Luis Claudio R. Goncalves, Andrew Morton,
	Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman, Peter Zijlstra,
	Petr Mladek, Tetsuo Handa, Thomas Gleixner, Waiman Long,
	Will Deacon

On 23.06.23 22:15, Sebastian Andrzej Siewior wrote:
> __build_all_zonelists() acquires zonelist_update_seq by first disabling
> interrupts via local_irq_save() and then acquiring the seqlock with
> write_seqlock(). This is troublesome and leads to problems on
> PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
> lock on PREEMPT_RT and must not be acquired with disabled interrupts.
> 
> The API provides write_seqlock_irqsave() which does the right thing in
> one step.
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.
> This is the case after zonelist_update_seq has been acquired.
> 
> There was discussion on the first submission that the order should be:
> 	local_irq_disable();
> 	printk_deferred_enter();
> 	write_seqlock();
> 
> to avoid pitfalls like having an unaccounted printk() coming from
> write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
> only origin of such a printk() can be a lockdep splat because the
> lockdep annotation happens after the sequence count is incremented.
> This is exceptional and subject to change.
> 
> It was also pointed that PREEMPT_RT can be affected by the printk
> problem since its write_seqlock_irqsave() does not really disable
> interrupts. This isn't the case because PREEMPT_RT's printk
> implementation differs from the mainline implementation in two important
> aspects:
> - Printing happens in a dedicated threads and not at during the
>    invocation of printk().
> - In emergency cases where synchronous printing is used, a different
>    driver is used which does not use tty_port::lock.
> 
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.
> 
> Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> v2…v3
>    - Update comment as per Michal's suggestion.
> 
> v1…v2:
>    - Improve commit description
> 
>   mm/page_alloc.c | 15 ++++++---------
>   1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47421bedc12b7..440e9af67b48d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5808,19 +5808,17 @@ static void __build_all_zonelists(void *data)
>   	unsigned long flags;
>   
>   	/*
> -	 * Explicitly disable this CPU's interrupts before taking seqlock
> -	 * to prevent any IRQ handler from calling into the page allocator
> -	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> +	 * The zonelist_update_seq must be acquired with irqsave because the
> +	 * reader can be invoked from IRQ with GFP_ATOMIC.
>   	 */
> -	local_irq_save(flags);
> +	write_seqlock_irqsave(&zonelist_update_seq, flags);
>   	/*
> -	 * Explicitly disable this CPU's synchronous printk() before taking
> -	 * seqlock to prevent any printk() from trying to hold port->lock, for
> +	 * Also disable synchronous printk() to prevent any printk() from
> +	 * trying to hold port->lock, for
>   	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
>   	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
>   	 */
>   	printk_deferred_enter();
> -	write_seqlock(&zonelist_update_seq);
>   
>   #ifdef CONFIG_NUMA
>   	memset(node_load, 0, sizeof(node_load));
> @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
>   #endif
>   	}
>   
> -	write_sequnlock(&zonelist_update_seq);
>   	printk_deferred_exit();
> -	local_irq_restore(flags);
> +	write_sequnlock_irqrestore(&zonelist_update_seq, flags);
>   }
>   
>   static noinline void __init

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-24  6:54   ` Tetsuo Handa
@ 2023-06-26  8:12     ` Sebastian Andrzej Siewior
  2023-06-26  9:25       ` Tetsuo Handa
  2023-06-26 14:44       ` Petr Mladek
  0 siblings, 2 replies; 30+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-26  8:12 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-kernel, Luis Claudio R. Goncalves, Andrew Morton,
	Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman, Michal Hocko,
	Peter Zijlstra, Petr Mladek, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
> Why not to do the same on the end side?
> 
>  static inline void do_write_seqcount_end(seqcount_t *s)
>  {
> - 	seqcount_release(&s->dep_map, _RET_IP_);
>  	do_raw_write_seqcount_end(s);
> +	seqcount_release(&s->dep_map, _RET_IP_);
>  }

I don't have a compelling argument for doing it. It is probably better
to release the lock from lockdep's point of view and then really release
it (so it can't be acquired before it is released).

Looking at other locking primitives (spin_lock_unlock(),
mutex_unlock(),…) that is what they do in the unlock path: lockdep
annotation followed by the actual operation. Therefore I would keep the
current ordering to remain in-sync with the other primitives.

Sebastian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-26  8:12     ` Sebastian Andrzej Siewior
@ 2023-06-26  9:25       ` Tetsuo Handa
  2023-06-26 10:48         ` Peter Zijlstra
  2023-06-26 14:44       ` Petr Mladek
  1 sibling, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2023-06-26  9:25 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-mm, linux-kernel, Luis Claudio R. Goncalves, Andrew Morton,
	Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman, Michal Hocko,
	Peter Zijlstra, Petr Mladek, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
>> Why not to do the same on the end side?
>>
>>  static inline void do_write_seqcount_end(seqcount_t *s)
>>  {
>> - 	seqcount_release(&s->dep_map, _RET_IP_);
>>  	do_raw_write_seqcount_end(s);
>> +	seqcount_release(&s->dep_map, _RET_IP_);
>>  }
> 
> I don't have a compelling argument for doing it. It is probably better
> to release the lock from lockdep's point of view and then really release
> it (so it can't be acquired before it is released).

We must do it because this is a source of possible printk() deadlock.
Otherwise, I will nack on PATCH 2/2.

> 
> Looking at other locking primitives (spin_lock_unlock(),
> mutex_unlock(),…) that is what they do in the unlock path: lockdep
> annotation followed by the actual operation. Therefore I would keep the
> current ordering to remain in-sync with the other primitives.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-26  9:25       ` Tetsuo Handa
@ 2023-06-26 10:48         ` Peter Zijlstra
  2023-06-26 11:26           ` Tetsuo Handa
  2023-06-26 13:13           ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2023-06-26 10:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Sebastian Andrzej Siewior, linux-mm, linux-kernel,
	Luis Claudio R. Goncalves, Andrew Morton, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Michal Hocko, Petr Mladek,
	Thomas Gleixner, Waiman Long, Will Deacon

On Mon, Jun 26, 2023 at 06:25:56PM +0900, Tetsuo Handa wrote:
> On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
> > On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
> >> Why not to do the same on the end side?
> >>
> >>  static inline void do_write_seqcount_end(seqcount_t *s)
> >>  {
> >> - 	seqcount_release(&s->dep_map, _RET_IP_);
> >>  	do_raw_write_seqcount_end(s);
> >> +	seqcount_release(&s->dep_map, _RET_IP_);
> >>  }
> > 
> > I don't have a compelling argument for doing it. It is probably better
> > to release the lock from lockdep's point of view and then really release
> > it (so it can't be acquired before it is released).
> 
> We must do it because this is a source of possible printk() deadlock.
> Otherwise, I will nack on PATCH 2/2.

Don't be like that... just hate on prink like the rest of us. In fact,
i've been patching out the actual printk code for years because its
unusable garbage.

Will this actually still be a problem once all the fancy printk stuff
lands? That shouldn't do synchronous prints except to 'atomic' consoles
by default IIRC.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-26 10:48         ` Peter Zijlstra
@ 2023-06-26 11:26           ` Tetsuo Handa
  2023-06-26 11:35             ` Michal Hocko
  2023-06-26 13:13           ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2023-06-26 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-mm, linux-kernel,
	Luis Claudio R. Goncalves, Andrew Morton, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Michal Hocko, Petr Mladek,
	Thomas Gleixner, Waiman Long, Will Deacon

On 2023/06/26 19:48, Peter Zijlstra wrote:
> On Mon, Jun 26, 2023 at 06:25:56PM +0900, Tetsuo Handa wrote:
>> On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
>>> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
>>>> Why not to do the same on the end side?
>>>>
>>>>  static inline void do_write_seqcount_end(seqcount_t *s)
>>>>  {
>>>> - 	seqcount_release(&s->dep_map, _RET_IP_);
>>>>  	do_raw_write_seqcount_end(s);
>>>> +	seqcount_release(&s->dep_map, _RET_IP_);
>>>>  }
>>>
>>> I don't have a compelling argument for doing it. It is probably better
>>> to release the lock from lockdep's point of view and then really release
>>> it (so it can't be acquired before it is released).
>>
>> We must do it because this is a source of possible printk() deadlock.
>> Otherwise, I will nack on PATCH 2/2.
> 
> Don't be like that... just hate on prink like the rest of us. In fact,
> i've been patching out the actual printk code for years because its
> unusable garbage.
> 
> Will this actually still be a problem once all the fancy printk stuff
> lands? That shouldn't do synchronous prints except to 'atomic' consoles
> by default IIRC.

Commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq
seqlock") was applied to 4.14-stable trees, and CONFIG_PREEMPT_RT is available
since 5.3. Thus, we want a fix which can be applied to 5.4-stable and later.
This means that we can't count on all the fancy printk stuff being available.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-26 11:26           ` Tetsuo Handa
@ 2023-06-26 11:35             ` Michal Hocko
  2023-06-26 12:27               ` Tetsuo Handa
  2023-06-26 12:46               ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 30+ messages in thread
From: Michal Hocko @ 2023-06-26 11:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Peter Zijlstra, Sebastian Andrzej Siewior, linux-mm,
	linux-kernel, Luis Claudio R. Goncalves, Andrew Morton,
	Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman, Petr Mladek,
	Thomas Gleixner, Waiman Long, Will Deacon

On Mon 26-06-23 20:26:02, Tetsuo Handa wrote:
> On 2023/06/26 19:48, Peter Zijlstra wrote:
> > On Mon, Jun 26, 2023 at 06:25:56PM +0900, Tetsuo Handa wrote:
> >> On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
> >>> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
> >>>> Why not to do the same on the end side?
> >>>>
> >>>>  static inline void do_write_seqcount_end(seqcount_t *s)
> >>>>  {
> >>>> - 	seqcount_release(&s->dep_map, _RET_IP_);
> >>>>  	do_raw_write_seqcount_end(s);
> >>>> +	seqcount_release(&s->dep_map, _RET_IP_);
> >>>>  }
> >>>
> >>> I don't have a compelling argument for doing it. It is probably better
> >>> to release the lock from lockdep's point of view and then really release
> >>> it (so it can't be acquired before it is released).
> >>
> >> We must do it because this is a source of possible printk() deadlock.
> >> Otherwise, I will nack on PATCH 2/2.
> > 
> > Don't be like that... just hate on prink like the rest of us. In fact,
> > i've been patching out the actual printk code for years because its
> > unusable garbage.
> > 
> > Will this actually still be a problem once all the fancy printk stuff
> > lands? That shouldn't do synchronous prints except to 'atomic' consoles
> > by default IIRC.
> 
> Commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq
> seqlock") was applied to 4.14-stable trees, and CONFIG_PREEMPT_RT is available
> since 5.3. Thus, we want a fix which can be applied to 5.4-stable and later.
> This means that we can't count on all the fancy printk stuff being available.

Is there any reason to backport RT specific fixup to stable trees? I
mean seriously, is there any actual memory hotplug user using
PREEMPT_RT? I would be more than curious to hear the usecase.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-26 11:35             ` Michal Hocko
@ 2023-06-26 12:27               ` Tetsuo Handa
  2023-06-26 13:16                 ` Michal Hocko
  2023-06-26 12:46               ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2023-06-26 12:27 UTC (permalink / raw)
  To: Michal Hocko, Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, linux-mm, linux-kernel,
	Luis Claudio R. Goncalves, Andrew Morton, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Petr Mladek,
	Thomas Gleixner, Waiman Long, Will Deacon

On 2023/06/26 20:35, Michal Hocko wrote:
> On Mon 26-06-23 20:26:02, Tetsuo Handa wrote:
>> On 2023/06/26 19:48, Peter Zijlstra wrote:
>>> On Mon, Jun 26, 2023 at 06:25:56PM +0900, Tetsuo Handa wrote:
>>>> On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
>>>>> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
>>>>>> Why not to do the same on the end side?
>>>>>>
>>>>>>  static inline void do_write_seqcount_end(seqcount_t *s)
>>>>>>  {
>>>>>> - 	seqcount_release(&s->dep_map, _RET_IP_);
>>>>>>  	do_raw_write_seqcount_end(s);
>>>>>> +	seqcount_release(&s->dep_map, _RET_IP_);
>>>>>>  }
>>>>>
>>>>> I don't have a compelling argument for doing it. It is probably better
>>>>> to release the lock from lockdep's point of view and then really release
>>>>> it (so it can't be acquired before it is released).
>>>>
>>>> We must do it because this is a source of possible printk() deadlock.
>>>> Otherwise, I will nack on PATCH 2/2.
>>>
>>> Don't be like that... just hate on prink like the rest of us. In fact,
>>> i've been patching out the actual printk code for years because its
>>> unusable garbage.
>>>
>>> Will this actually still be a problem once all the fancy printk stuff
>>> lands? That shouldn't do synchronous prints except to 'atomic' consoles
>>> by default IIRC.
>>
>> Commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq
>> seqlock") was applied to 4.14-stable trees, and CONFIG_PREEMPT_RT is available
>> since 5.3. Thus, we want a fix which can be applied to 5.4-stable and later.
>> This means that we can't count on all the fancy printk stuff being available.
> 
> Is there any reason to backport RT specific fixup to stable trees? I
> mean seriously, is there any actual memory hotplug user using
> PREEMPT_RT? I would be more than curious to hear the usecase.

Even if we don't backport RT specific fixup to stable trees, [PATCH 2/2] requires
that [PATCH 1/2] guarantees that synchronous printk() never happens (for whatever
reasons) between write_seqlock_irqsave(&zonelist_update_seq, flags) and
write_sequnlock_irqrestore(&zonelist_update_seq, flags).

If [PATCH 1/2] cannot guarantee it, [PATCH 2/2] will be automatically rejected.

If [PATCH 2/2] cannot be applied, we have several alternatives.

Alternative 1:

  Revert both commit 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
  and commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock").
  I don't think this will happen, for nobody will be happy.

Alternative 2:

  Revert commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
  and apply "mm/page_alloc: don't check zonelist_update_seq from atomic allocations" at
  https://lkml.kernel.org/r/dfdb9da6-ca8f-7a81-bfdd-d74b4c401f11@I-love.SAKURA.ne.jp .
  I think this is reasonable, for this reduces locking dependency. But Michal Hocko did not like it.

Alternative 3:

  Somehow preserve printk_deferred_enter() => write_seqlock(&zonelist_update_seq) and
  write_sequnlock(&zonelist_update_seq) => printk_deferred_exit() pattern. Something like below?

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..ded3ac3856e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5805,6 +5805,7 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+#ifndef CONFIG_PREEMPT_RT
 	unsigned long flags;
 
 	/*
@@ -5813,6 +5814,9 @@ static void __build_all_zonelists(void *data)
 	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
 	 */
 	local_irq_save(flags);
+#else
+	migrate_disable();
+#endif
 	/*
 	 * Explicitly disable this CPU's synchronous printk() before taking
 	 * seqlock to prevent any printk() from trying to hold port->lock, for
@@ -5859,7 +5863,11 @@ static void __build_all_zonelists(void *data)
 
 	write_sequnlock(&zonelist_update_seq);
 	printk_deferred_exit();
+#ifndef CONFIG_PREEMPT_RT
 	local_irq_restore(flags);
+#else
+	migrate_enable();
+#endif
 }
 
 static noinline void __init
----------------------------------------


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-26 11:35             ` Michal Hocko
  2023-06-26 12:27               ` Tetsuo Handa
@ 2023-06-26 12:46               ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 30+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-26 12:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Peter Zijlstra, linux-mm, linux-kernel,
	Luis Claudio R. Goncalves, Andrew Morton, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Petr Mladek,
	Thomas Gleixner, Waiman Long, Will Deacon

On 2023-06-26 13:35:39 [+0200], Michal Hocko wrote:
> Is there any reason to backport RT specific fixup to stable trees? I
> mean seriously, is there any actual memory hotplug user using
> PREEMPT_RT? I would be more than curious to hear the usecase.

There is no need for stable backports for RT-only fixes. We have
RT-stable trees for that.
The reason why we fix it in the RT-stable tree is not have something
broken that was known to work.

Sebastian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-23 17:12 ` [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested() Sebastian Andrzej Siewior
  2023-06-24  6:54   ` Tetsuo Handa
@ 2023-06-26 12:56   ` Mel Gorman
  1 sibling, 0 replies; 30+ messages in thread
From: Mel Gorman @ 2023-06-26 12:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-mm, linux-kernel, Luis Claudio R. Goncalves, Andrew Morton,
	Boqun Feng, Ingo Molnar, John Ogness, Michal Hocko,
	Peter Zijlstra, Petr Mladek, Tetsuo Handa, Thomas Gleixner,
	Waiman Long, Will Deacon

On Fri, Jun 23, 2023 at 07:12:31PM +0200, Sebastian Andrzej Siewior wrote:
> It was brought up by Tetsuo that the following sequence
>    write_seqlock_irqsave()
>    printk_deferred_enter()
> 
> could lead to a deadlock if the lockdep annotation within
> write_seqlock_irqsave() triggers. The problem is that the sequence
> counter is incremented before the lockdep annotation is performed. The
> lockdep splat would then attempt to invoke printk() but the reader side,
> of the same seqcount, could have a tty_port::lock acquired waiting for
> the sequence number to become even again.
> 
> The other lockdep annotations come before the actual locking because "we
> want to see the locking error before it happens". There is no reason why
> seqcount should be different here.
> 
> Do the lockdep annotation first then perform the locking operation (the
> sequence increment).
> 
> Fixes: 1ca7d67cf5d5a ("seqcount: Add lockdep functionality to seqcount/seqlock structures")
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Link: https://lore.kernel.org/20230621130641.-5iueY1I@linutronix.de
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-26 10:48         ` Peter Zijlstra
  2023-06-26 11:26           ` Tetsuo Handa
@ 2023-06-26 13:13           ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 30+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-26 13:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tetsuo Handa, linux-mm, linux-kernel, Luis Claudio R. Goncalves,
	Andrew Morton, Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman,
	Michal Hocko, Petr Mladek, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2023-06-26 12:48:31 [+0200], Peter Zijlstra wrote:
> 
> Don't be like that... just hate on prink like the rest of us. In fact,
> i've been patching out the actual printk code for years because its
> unusable garbage.

:)

> Will this actually still be a problem once all the fancy printk stuff
> lands? That shouldn't do synchronous prints except to 'atomic' consoles
> by default IIRC.

The fancy printk stuff should have synchronous printing in emergency
situations and threaded printing by default. But then I hear John saying
that there might be push back because atomic consoles may not be
available everywhere…

Anyway, the requirement for the deadlock to trigger, that Tetsuo Handa
is concerned here:
- lockdep enabled
- console and tty in use.
- tty_insert_flip_string_and_push_buffer() on one CPU with
  tty_port::lock acquired followed with a memory allocation blocking on
  read_seqbegin(&zonelist_update_seq) due the writer
- memory hotplug => __build_all_zonelists() acquiring
  write_seqlock(&zonelist_update_seq) and now lockdep creates a splat.
  This is accounted for if the lockdep annotation comes first (1/2 of
  the series). But the unlock annotation of the seq unlock operation may
  still create a splat so a possibility for a deadlock remains.

The requirement is _high_ and it hardly justifies ugly code.

Sebastian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().
  2023-06-23 20:15     ` [PATCH v3 " Sebastian Andrzej Siewior
  2023-06-26  7:56       ` David Hildenbrand
@ 2023-06-26 13:14       ` Mel Gorman
  2023-06-28 13:56       ` Michal Hocko
  2 siblings, 0 replies; 30+ messages in thread
From: Mel Gorman @ 2023-06-26 13:14 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Michal Hocko, linux-mm, linux-kernel, Luis Claudio R. Goncalves,
	Andrew Morton, Boqun Feng, Ingo Molnar, John Ogness,
	Peter Zijlstra, Petr Mladek, Tetsuo Handa, Thomas Gleixner,
	Waiman Long, Will Deacon

On Fri, Jun 23, 2023 at 10:15:17PM +0200, Sebastian Andrzej Siewior wrote:
> __build_all_zonelists() acquires zonelist_update_seq by first disabling
> interrupts via local_irq_save() and then acquiring the seqlock with
> write_seqlock(). This is troublesome and leads to problems on
> PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
> lock on PREEMPT_RT and must not be acquired with disabled interrupts.
> 
> The API provides write_seqlock_irqsave() which does the right thing in
> one step.
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.
> This is the case after zonelist_update_seq has been acquired.
> 
> There was discussion on the first submission that the order should be:
> 	local_irq_disable();
> 	printk_deferred_enter();
> 	write_seqlock();
> 
> to avoid pitfalls like having an unaccounted printk() coming from
> write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
> only origin of such a printk() can be a lockdep splat because the
> lockdep annotation happens after the sequence count is incremented.
> This is exceptional and subject to change.
> 
> It was also pointed that PREEMPT_RT can be affected by the printk
> problem since its write_seqlock_irqsave() does not really disable
> interrupts. This isn't the case because PREEMPT_RT's printk
> implementation differs from the mainline implementation in two important
> aspects:
> - Printing happens in a dedicated threads and not at during the
>   invocation of printk().
> - In emergency cases where synchronous printing is used, a different
>   driver is used which does not use tty_port::lock.
> 
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.
> 
> Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-26 12:27               ` Tetsuo Handa
@ 2023-06-26 13:16                 ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2023-06-26 13:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Sebastian Andrzej Siewior, Peter Zijlstra, linux-mm,
	linux-kernel, Luis Claudio R. Goncalves, Andrew Morton,
	Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman, Petr Mladek,
	Thomas Gleixner, Waiman Long, Will Deacon

On Mon 26-06-23 21:27:05, Tetsuo Handa wrote:
> On 2023/06/26 20:35, Michal Hocko wrote:
> > On Mon 26-06-23 20:26:02, Tetsuo Handa wrote:
> >> On 2023/06/26 19:48, Peter Zijlstra wrote:
> >>> On Mon, Jun 26, 2023 at 06:25:56PM +0900, Tetsuo Handa wrote:
> >>>> On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
> >>>>> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
> >>>>>> Why not to do the same on the end side?
> >>>>>>
> >>>>>>  static inline void do_write_seqcount_end(seqcount_t *s)
> >>>>>>  {
> >>>>>> - 	seqcount_release(&s->dep_map, _RET_IP_);
> >>>>>>  	do_raw_write_seqcount_end(s);
> >>>>>> +	seqcount_release(&s->dep_map, _RET_IP_);
> >>>>>>  }
> >>>>>
> >>>>> I don't have a compelling argument for doing it. It is probably better
> >>>>> to release the lock from lockdep's point of view and then really release
> >>>>> it (so it can't be acquired before it is released).
> >>>>
> >>>> We must do it because this is a source of possible printk() deadlock.
> >>>> Otherwise, I will nack on PATCH 2/2.
> >>>
> >>> Don't be like that... just hate on prink like the rest of us. In fact,
> >>> i've been patching out the actual printk code for years because its
> >>> unusable garbage.
> >>>
> >>> Will this actually still be a problem once all the fancy printk stuff
> >>> lands? That shouldn't do synchronous prints except to 'atomic' consoles
> >>> by default IIRC.
> >>
> >> Commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq
> >> seqlock") was applied to 4.14-stable trees, and CONFIG_PREEMPT_RT is available
> >> since 5.3. Thus, we want a fix which can be applied to 5.4-stable and later.
> >> This means that we can't count on all the fancy printk stuff being available.
> > 
> > Is there any reason to backport RT specific fixup to stable trees? I
> > mean seriously, is there any actual memory hotplug user using
> > PREEMPT_RT? I would be more than curious to hear the usecase.
> 
> Even if we don't backport RT specific fixup to stable trees, [PATCH 2/2] requires
> that [PATCH 1/2] guarantees that synchronous printk() never happens (for whatever
> reasons) between write_seqlock_irqsave(&zonelist_update_seq, flags) and
> write_sequnlock_irqrestore(&zonelist_update_seq, flags).

I suspect you are overcomplicating this. I do understand that you want
to have this 100% airtight but I would argue that this is actually not
really necessary. I would be perfectly fine living in the world where
this particular path could trigger an unintended printk. IIUC we are
mostly talking about lockup detector only, right? AFAIK there is no such
na issue _now_ so we are talking about a potential _risk_ only.
 
> If [PATCH 1/2] cannot guarantee it, [PATCH 2/2] will be automatically rejected.
> 
> If [PATCH 2/2] cannot be applied, we have several alternatives.
> 
> Alternative 1:
> 
>   Revert both commit 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
>   and commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock").
>   I don't think this will happen, for nobody will be happy.
> 
> Alternative 2:
> 
>   Revert commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
>   and apply "mm/page_alloc: don't check zonelist_update_seq from atomic allocations" at
>   https://lkml.kernel.org/r/dfdb9da6-ca8f-7a81-bfdd-d74b4c401f11@I-love.SAKURA.ne.jp .
>   I think this is reasonable, for this reduces locking dependency. But Michal Hocko did not like it.
> 
> Alternative 3:
> 
>   Somehow preserve printk_deferred_enter() => write_seqlock(&zonelist_update_seq) and
>   write_sequnlock(&zonelist_update_seq) => printk_deferred_exit() pattern. Something like below?
> 

Alternative 4:
stop chasing shadows and deal with the fact that this code won't be
perfect. Seriously you are trying to address a non-existing problem and
blocking a working RT solution which doesn't clutter the code with RT
specific baggage.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-26  8:12     ` Sebastian Andrzej Siewior
  2023-06-26  9:25       ` Tetsuo Handa
@ 2023-06-26 14:44       ` Petr Mladek
  2023-06-28 12:14         ` Tetsuo Handa
  1 sibling, 1 reply; 30+ messages in thread
From: Petr Mladek @ 2023-06-26 14:44 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Tetsuo Handa, linux-mm, linux-kernel, Luis Claudio R. Goncalves,
	Andrew Morton, Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman,
	Michal Hocko, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On Mon 2023-06-26 10:12:54, Sebastian Andrzej Siewior wrote:
> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
> > Why not to do the same on the end side?
> > 
> >  static inline void do_write_seqcount_end(seqcount_t *s)
> >  {
> > - 	seqcount_release(&s->dep_map, _RET_IP_);
> >  	do_raw_write_seqcount_end(s);
> > +	seqcount_release(&s->dep_map, _RET_IP_);
> >  }
> 
> I don't have a compelling argument for doing it. It is probably better
> to release the lock from lockdep's point of view and then really release
> it (so it can't be acquired before it is released).

If this is true then we should not change the ordering on the _begin
side either. I mean that we should call the lockdep code only
after the lock is taken. Anyway, both sides should be symmetric.

That said, lockdep is about chains of locks and not about timing.
We must not call lockdep annotation when the lock is still available
for a nested context. So the ordering is probably important only when
the lock might be taken from both normal and interrupt context.

Anyway, please do not do this change only because of printk().
IMHO, the current ordering is more logical and the printk() problem
should be solved another way.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-26 14:44       ` Petr Mladek
@ 2023-06-28 12:14         ` Tetsuo Handa
  2023-07-27 15:10           ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2023-06-28 12:14 UTC (permalink / raw)
  To: Petr Mladek, Sebastian Andrzej Siewior
  Cc: linux-mm, linux-kernel, Luis Claudio R. Goncalves, Andrew Morton,
	Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman, Michal Hocko,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon

On 2023/06/26 23:44, Petr Mladek wrote:
> On Mon 2023-06-26 10:12:54, Sebastian Andrzej Siewior wrote:
>> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
>>> Why not to do the same on the end side?
>>>
>>>  static inline void do_write_seqcount_end(seqcount_t *s)
>>>  {
>>> - 	seqcount_release(&s->dep_map, _RET_IP_);
>>>  	do_raw_write_seqcount_end(s);
>>> +	seqcount_release(&s->dep_map, _RET_IP_);
>>>  }
>>
>> I don't have a compelling argument for doing it. It is probably better
>> to release the lock from lockdep's point of view and then really release
>> it (so it can't be acquired before it is released).
> 
> If this is true then we should not change the ordering on the _begin
> side either. I mean that we should call the lockdep code only
> after the lock is taken. Anyway, both sides should be symmetric.
> 
> That said, lockdep is about chains of locks and not about timing.
> We must not call lockdep annotation when the lock is still available
> for a nested context. So the ordering is probably important only when
> the lock might be taken from both normal and interrupt context.
> 
> Anyway, please do not do this change only because of printk().
> IMHO, the current ordering is more logical and the printk() problem
> should be solved another way.

Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
rejected.



I found

  /*
   * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid
   * a migration causing the wrong PCP to be locked and remote memory being
   * potentially allocated, pin the task to the CPU for the lookup+lock.
   * preempt_disable is used on !RT because it is faster than migrate_disable.
   * migrate_disable is used on RT because otherwise RT spinlock usage is
   * interfered with and a high priority task cannot preempt the allocator.
   */
  #ifndef CONFIG_PREEMPT_RT
  #define pcpu_task_pin()         preempt_disable()
  #define pcpu_task_unpin()       preempt_enable()
  #else
  #define pcpu_task_pin()         migrate_disable()
  #define pcpu_task_unpin()       migrate_enable()
  #endif

in mm/page_alloc.c . Thus, I think that calling migrate_disable() if CONFIG_PREEMPT_RT=y
and calling local_irq_save() if CONFIG_PREEMPT_RT=n (i.e. Alternative 3) will work.

But thinking again, since CONFIG_PREEMPT_RT=y uses special printk() approach where messages
are printed from a dedicated kernel thread, do we need to call printk_deferred_enter() if
CONFIG_PREEMPT_RT=y ? That is, isn't the fix as straightforward as below?

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..a2a3bfa69a12 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5805,6 +5805,7 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+#ifndef CONFIG_PREEMPT_RT
 	unsigned long flags;
 
 	/*
@@ -5820,6 +5821,7 @@ static void __build_all_zonelists(void *data)
 	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
 	 */
 	printk_deferred_enter();
+#endif
 	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
@@ -5858,8 +5860,10 @@ static void __build_all_zonelists(void *data)
 	}
 
 	write_sequnlock(&zonelist_update_seq);
+#ifndef CONFIG_PREEMPT_RT
 	printk_deferred_exit();
 	local_irq_restore(flags);
+#endif
 }
 
 static noinline void __init
----------------------------------------


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().
  2023-06-23 20:15     ` [PATCH v3 " Sebastian Andrzej Siewior
  2023-06-26  7:56       ` David Hildenbrand
  2023-06-26 13:14       ` Mel Gorman
@ 2023-06-28 13:56       ` Michal Hocko
  2 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2023-06-28 13:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Andrew Morton
  Cc: linux-mm, linux-kernel, Luis Claudio R. Goncalves, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Peter Zijlstra,
	Petr Mladek, Tetsuo Handa, Thomas Gleixner, Waiman Long,
	Will Deacon

Andrew, it seems that we have a consensus on the MM side of things that
this is good enough to go. I am not sure about patch 1, that is more on
lockdep people but I think that this patch is good enough on this own.
Can we get this patch merged into mm tree and see whether any of Tetsuo
concerns pop out?

On Fri 23-06-23 22:15:17, Sebastian Andrzej Siewior wrote:
> __build_all_zonelists() acquires zonelist_update_seq by first disabling
> interrupts via local_irq_save() and then acquiring the seqlock with
> write_seqlock(). This is troublesome and leads to problems on
> PREEMPT_RT. The problem is that the inner spinlock_t becomes a sleeping
> lock on PREEMPT_RT and must not be acquired with disabled interrupts.
> 
> The API provides write_seqlock_irqsave() which does the right thing in
> one step.
> printk_deferred_enter() has to be invoked in non-migrate-able context to
> ensure that deferred printing is enabled and disabled on the same CPU.
> This is the case after zonelist_update_seq has been acquired.
> 
> There was discussion on the first submission that the order should be:
> 	local_irq_disable();
> 	printk_deferred_enter();
> 	write_seqlock();
> 
> to avoid pitfalls like having an unaccounted printk() coming from
> write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
> only origin of such a printk() can be a lockdep splat because the
> lockdep annotation happens after the sequence count is incremented.
> This is exceptional and subject to change.
> 
> It was also pointed that PREEMPT_RT can be affected by the printk
> problem since its write_seqlock_irqsave() does not really disable
> interrupts. This isn't the case because PREEMPT_RT's printk
> implementation differs from the mainline implementation in two important
> aspects:
> - Printing happens in a dedicated threads and not at during the
>   invocation of printk().
> - In emergency cases where synchronous printing is used, a different
>   driver is used which does not use tty_port::lock.
> 
> Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
> printk output.
> 
> Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> v2…v3
>   - Update comment as per Michal's suggestion.
> 
> v1…v2:
>   - Improve commit description
> 
>  mm/page_alloc.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47421bedc12b7..440e9af67b48d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5808,19 +5808,17 @@ static void __build_all_zonelists(void *data)
>  	unsigned long flags;
>  
>  	/*
> -	 * Explicitly disable this CPU's interrupts before taking seqlock
> -	 * to prevent any IRQ handler from calling into the page allocator
> -	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
> +	 * The zonelist_update_seq must be acquired with irqsave because the
> +	 * reader can be invoked from IRQ with GFP_ATOMIC.
>  	 */
> -	local_irq_save(flags);
> +	write_seqlock_irqsave(&zonelist_update_seq, flags);
>  	/*
> -	 * Explicitly disable this CPU's synchronous printk() before taking
> -	 * seqlock to prevent any printk() from trying to hold port->lock, for
> +	 * Also disable synchronous printk() to prevent any printk() from
> +	 * trying to hold port->lock, for
>  	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
>  	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
>  	 */
>  	printk_deferred_enter();
> -	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
>  	memset(node_load, 0, sizeof(node_load));
> @@ -5857,9 +5855,8 @@ static void __build_all_zonelists(void *data)
>  #endif
>  	}
>  
> -	write_sequnlock(&zonelist_update_seq);
>  	printk_deferred_exit();
> -	local_irq_restore(flags);
> +	write_sequnlock_irqrestore(&zonelist_update_seq, flags);
>  }
>  
>  static noinline void __init
> -- 
> 2.40.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-06-28 12:14         ` Tetsuo Handa
@ 2023-07-27 15:10           ` Sebastian Andrzej Siewior
  2023-07-29  5:31             ` Tetsuo Handa
  0 siblings, 1 reply; 30+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-07-27 15:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Petr Mladek, linux-mm, linux-kernel, Luis Claudio R. Goncalves,
	Andrew Morton, Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman,
	Michal Hocko, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon

On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
> > Anyway, please do not do this change only because of printk().
> > IMHO, the current ordering is more logical and the printk() problem
> > should be solved another way.
> 
> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
> rejected.

My understanding is that this patch gets applied and your objection will
be noted.

> I found
> 
>   /*
>    * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid
>    * a migration causing the wrong PCP to be locked and remote memory being
>    * potentially allocated, pin the task to the CPU for the lookup+lock.
>    * preempt_disable is used on !RT because it is faster than migrate_disable.
>    * migrate_disable is used on RT because otherwise RT spinlock usage is
>    * interfered with and a high priority task cannot preempt the allocator.
>    */
>   #ifndef CONFIG_PREEMPT_RT
>   #define pcpu_task_pin()         preempt_disable()
>   #define pcpu_task_unpin()       preempt_enable()
>   #else
>   #define pcpu_task_pin()         migrate_disable()
>   #define pcpu_task_unpin()       migrate_enable()
>   #endif
> 
> in mm/page_alloc.c . Thus, I think that calling migrate_disable() if CONFIG_PREEMPT_RT=y
> and calling local_irq_save() if CONFIG_PREEMPT_RT=n (i.e. Alternative 3) will work.
> 
> But thinking again, since CONFIG_PREEMPT_RT=y uses special printk() approach where messages
> are printed from a dedicated kernel thread, do we need to call printk_deferred_enter() if
> CONFIG_PREEMPT_RT=y ? That is, isn't the fix as straightforward as below?

That below will cause a splat with CONFIG_PROVE_RAW_LOCK_NESTING. That
is because seqlock_t::lock is acquired without disabling interrupts.
Additionally it is a bad example because the seqcount API is bypassed
due to printk's limitations and the problems, that are caused on
PREEMPT_RT, are "ifdefed away". None of this is documented/ explained.

Let me summarize your remaining problem:
- With (and only with) CONFIG_PROVE_LOCKING there can be a printk splat
  caused by a lock validation error noticed by lockdep during
  write_sequnlock_irqrestore().

- This can deadlock if there is a printing output on the tty which is
  using the same console as printk and memory hotplug is active at the
  same time.
  That is because the tty layer acquires the same lock as printk's
  console during memory allocation (of the tty layer).

Now:
- before this deadlocks (with CONFIG_PROVE_LOCKING) chances are high
  that a splat is seen first.

- printk is reworked and the printk output should either happen from a
  dedicated thread or directly via a different console driver which is
  not using uart_port::lock. Thus avoiding the deadlock.

Sebastian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-07-27 15:10           ` Sebastian Andrzej Siewior
@ 2023-07-29  5:31             ` Tetsuo Handa
  2023-07-29 11:05               ` Tetsuo Handa
  0 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2023-07-29  5:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Michal Hocko
  Cc: Petr Mladek, linux-mm, linux-kernel, Luis Claudio R. Goncalves,
	Andrew Morton, Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon

On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
>>> Anyway, please do not do this change only because of printk().
>>> IMHO, the current ordering is more logical and the printk() problem
>>> should be solved another way.
>>
>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
>> rejected.
> 
> My understanding is that this patch gets applied and your objection will
> be noted.

My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .

Maybe we can defer checking zonelist_update_seq till retry check like below,
for this is really an infrequent event.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7d3460c7a480..2f7b82af2590 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3642,22 +3642,27 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release);
  * retries the allocation if zonelist changes. Writer side is protected by the
  * embedded spin_lock.
  */
-static DEFINE_SEQLOCK(zonelist_update_seq);
+static unsigned int zonelist_update_seq;
 
 static unsigned int zonelist_iter_begin(void)
 {
 	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqbegin(&zonelist_update_seq);
+		return data_race(READ_ONCE(zonelist_update_seq));
 
 	return 0;
 }
 
-static unsigned int check_retry_zonelist(unsigned int seq)
+static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq)
 {
-	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqretry(&zonelist_update_seq, seq);
+	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM)) {
+		unsigned int seq2;
+
+		smp_rmb();
+		seq2 = data_race(READ_ONCE(zonelist_update_seq));
+		return unlikely(seq != seq2 || (seq2 & 1));
+	}
 
-	return seq;
+	return 0;
 }
 
 /* Perform direct synchronous page reclaim */
@@ -4146,7 +4151,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/* Reclaim has failed us, start killing things */
@@ -4172,7 +4177,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/*
@@ -5136,22 +5141,12 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+	static DEFINE_SPINLOCK(lock);
 	unsigned long flags;
 
-	/*
-	 * Explicitly disable this CPU's interrupts before taking seqlock
-	 * to prevent any IRQ handler from calling into the page allocator
-	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
-	 */
-	local_irq_save(flags);
-	/*
-	 * Explicitly disable this CPU's synchronous printk() before taking
-	 * seqlock to prevent any printk() from trying to hold port->lock, for
-	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
-	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
-	 */
-	printk_deferred_enter();
-	write_seqlock(&zonelist_update_seq);
+	spin_lock_irqsave(&lock, flags);
+	data_race(zonelist_update_seq++);
+	smp_wmb();
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -5188,9 +5183,9 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	write_sequnlock(&zonelist_update_seq);
-	printk_deferred_exit();
-	local_irq_restore(flags);
+	smp_wmb();
+	data_race(zonelist_update_seq++);
+	spin_unlock_irqrestore(&lock, flags);
 }
 
 static noinline void __init



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-07-29  5:31             ` Tetsuo Handa
@ 2023-07-29 11:05               ` Tetsuo Handa
  2023-07-31 14:25                 ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2023-07-29 11:05 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Michal Hocko
  Cc: Petr Mladek, linux-mm, linux-kernel, Luis Claudio R. Goncalves,
	Andrew Morton, Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon

On 2023/07/29 14:31, Tetsuo Handa wrote:
> On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
>> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
>>>> Anyway, please do not do this change only because of printk().
>>>> IMHO, the current ordering is more logical and the printk() problem
>>>> should be solved another way.
>>>
>>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
>>> rejected.
>>
>> My understanding is that this patch gets applied and your objection will
>> be noted.
> 
> My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
> https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
> https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
> 
> Maybe we can defer checking zonelist_update_seq till retry check like below,
> for this is really an infrequent event.
> 

An updated version with comments added.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7d3460c7a480..92ecf5f2f76b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3638,26 +3638,44 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release);
 
 /*
  * Zonelists may change due to hotplug during allocation. Detect when zonelists
- * have been rebuilt so allocation retries. Reader side does not lock and
- * retries the allocation if zonelist changes. Writer side is protected by the
- * embedded spin_lock.
+ * have been rebuilt so __GFP_DIRECT_RECLAIM allocation retries. Writer side
+ * makes this sequence odd before rebuilding zonelists and makes this sequence
+ * even after rebuilding zonelists. Sine writer side disables context switching
+ * when rebuilding zonelists, and !__GFP_DIRECT_RECLAIM allocation will not
+ * retry when zonelists changed, reader side does not need to hold a lock (but
+ * needs to use data_race() annotation), making locking dependency simpler.
  */
-static DEFINE_SEQLOCK(zonelist_update_seq);
+static unsigned int zonelist_update_seq;
 
 static unsigned int zonelist_iter_begin(void)
 {
 	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqbegin(&zonelist_update_seq);
+		/* See comment above. */
+		return data_race(READ_ONCE(zonelist_update_seq));
 
 	return 0;
 }
 
-static unsigned int check_retry_zonelist(unsigned int seq)
+static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq)
 {
-	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqretry(&zonelist_update_seq, seq);
+	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM)) {
+		/* See comment above. */
+		unsigned int seq2 = data_race(READ_ONCE(zonelist_update_seq));
 
-	return seq;
+		/*
+		 * "seq != seq2" indicates that __build_all_zonelists() has
+		 * started or has finished rebuilding zonelists, hence retry.
+		 * "seq == seq2 && (seq2 & 1)" indicates that
+		 * __build_all_zonelists() is still rebuilding zonelists
+		 * with context switching disabled, hence retry.
+		 * "seq == seq2 && !(seq2 & 1)" indicates that
+		 * __build_all_zonelists() did not rebuilt zonelists, hence
+		 * no retry.
+		 */
+		return unlikely(seq != seq2 || (seq2 & 1));
+	}
+
+	return 0;
 }
 
 /* Perform direct synchronous page reclaim */
@@ -4146,7 +4164,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/* Reclaim has failed us, start killing things */
@@ -4172,7 +4190,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/*
@@ -5136,22 +5154,17 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+	static DEFINE_SPINLOCK(lock);
 	unsigned long flags;
 
-	/*
-	 * Explicitly disable this CPU's interrupts before taking seqlock
-	 * to prevent any IRQ handler from calling into the page allocator
-	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
-	 */
-	local_irq_save(flags);
-	/*
-	 * Explicitly disable this CPU's synchronous printk() before taking
-	 * seqlock to prevent any printk() from trying to hold port->lock, for
-	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
-	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
-	 */
-	printk_deferred_enter();
-	write_seqlock(&zonelist_update_seq);
+#ifdef CONFIG_PREEMPT_RT
+	migrate_disable()
+#endif
+	/* Serialize increments of zonelist_update_seq. */
+	spin_lock_irqsave(&lock, flags);
+	zonelist_update_seq++;
+	/* Tell check_retry_zonelist() that we started rebuilding zonelists. */
+	smp_wmb();
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -5188,9 +5201,13 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	write_sequnlock(&zonelist_update_seq);
-	printk_deferred_exit();
-	local_irq_restore(flags);
+	/* Tell check_retry_zonelist() that we finished rebuilding zonelists. */
+	smp_wmb();
+	zonelist_update_seq++;
+	spin_unlock_irqrestore(&lock, flags);
+#ifdef CONFIG_PREEMPT_RT
+	migrate_enable()
+#endif
 }
 
 static noinline void __init



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-07-29 11:05               ` Tetsuo Handa
@ 2023-07-31 14:25                 ` Michal Hocko
  2023-08-03 13:18                   ` Tetsuo Handa
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2023-07-31 14:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Sebastian Andrzej Siewior, Petr Mladek, linux-mm, linux-kernel,
	Luis Claudio R. Goncalves, Andrew Morton, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon

On Sat 29-07-23 20:05:43, Tetsuo Handa wrote:
> On 2023/07/29 14:31, Tetsuo Handa wrote:
> > On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
> >> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
> >>>> Anyway, please do not do this change only because of printk().
> >>>> IMHO, the current ordering is more logical and the printk() problem
> >>>> should be solved another way.
> >>>
> >>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
> >>> rejected.
> >>
> >> My understanding is that this patch gets applied and your objection will
> >> be noted.
> > 
> > My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> > allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
> > https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
> > https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
> > 
> > Maybe we can defer checking zonelist_update_seq till retry check like below,
> > for this is really an infrequent event.
> > 
> 
> An updated version with comments added.

Seriously, don't you see how hairy all this is? And for what? Nitpicking
something that doesn't seem to be a real problem in the first place?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-07-31 14:25                 ` Michal Hocko
@ 2023-08-03 13:18                   ` Tetsuo Handa
  2023-08-03 14:49                     ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2023-08-03 13:18 UTC (permalink / raw)
  To: Michal Hocko, Sebastian Andrzej Siewior
  Cc: Petr Mladek, linux-mm, linux-kernel, Luis Claudio R. Goncalves,
	Andrew Morton, Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon

On 2023/07/31 23:25, Michal Hocko wrote:
> On Sat 29-07-23 20:05:43, Tetsuo Handa wrote:
>> On 2023/07/29 14:31, Tetsuo Handa wrote:
>>> On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
>>>> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
>>>>>> Anyway, please do not do this change only because of printk().
>>>>>> IMHO, the current ordering is more logical and the printk() problem
>>>>>> should be solved another way.
>>>>>
>>>>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
>>>>> rejected.
>>>>
>>>> My understanding is that this patch gets applied and your objection will
>>>> be noted.
>>>
>>> My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
>>> allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
>>> https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
>>> https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
>>>
>>> Maybe we can defer checking zonelist_update_seq till retry check like below,
>>> for this is really an infrequent event.
>>>
>>
>> An updated version with comments added.
> 
> Seriously, don't you see how hairy all this is? And for what? Nitpicking
> something that doesn't seem to be a real problem in the first place?

Seriously, can't you find "zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
allocations, which is a low-hanging fruit towards GFP_LOCKLESS" !?

My initial proposal was
"[PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations"
at https://lkml.kernel.org/r/dfdb9da6-ca8f-7a81-bfdd-d74b4c401f11@I-love.SAKURA.ne.jp .
Compared to that version, this what-you-call-hairy version has an improvement that

-	return read_seqbegin(&zonelist_update_seq);
+	return data_race(READ_ONCE(zonelist_update_seq));

can eliminate

	while ((__seq = seqprop_sequence(s)) & 1)
		cpu_relax();

path. There is no need to wait for completion of rebuilding zonelists, for
rebuilding zonelists being in flight (indicated by zonelist_update_seq being odd)
does not mean that allocation never succeeds. When allocation did not fail,
this "while" loop becomes nothing but a waste of CPU time, And it is very likely
that rebuilding zonelists being not in flight from the beginning.

We can make zonelist_iter_begin() (which is always called as long as
__alloc_pages_slowpath() is called) faster and simpler, which is an improvement
even without considering printk() and lockdep/KCSAN related problems.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-08-03 13:18                   ` Tetsuo Handa
@ 2023-08-03 14:49                     ` Michal Hocko
  2023-08-04 13:27                       ` Tetsuo Handa
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2023-08-03 14:49 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Sebastian Andrzej Siewior, Petr Mladek, linux-mm, linux-kernel,
	Luis Claudio R. Goncalves, Andrew Morton, Boqun Feng,
	Ingo Molnar, John Ogness, Mel Gorman, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon

On Thu 03-08-23 22:18:10, Tetsuo Handa wrote:
> On 2023/07/31 23:25, Michal Hocko wrote:
> > On Sat 29-07-23 20:05:43, Tetsuo Handa wrote:
> >> On 2023/07/29 14:31, Tetsuo Handa wrote:
> >>> On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
> >>>> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
> >>>>>> Anyway, please do not do this change only because of printk().
> >>>>>> IMHO, the current ordering is more logical and the printk() problem
> >>>>>> should be solved another way.
> >>>>>
> >>>>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
> >>>>> rejected.
> >>>>
> >>>> My understanding is that this patch gets applied and your objection will
> >>>> be noted.
> >>>
> >>> My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> >>> allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
> >>> https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
> >>> https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
> >>>
> >>> Maybe we can defer checking zonelist_update_seq till retry check like below,
> >>> for this is really an infrequent event.
> >>>
> >>
> >> An updated version with comments added.
> > 
> > Seriously, don't you see how hairy all this is? And for what? Nitpicking
> > something that doesn't seem to be a real problem in the first place?
> 
> Seriously, can't you find "zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> allocations, which is a low-hanging fruit towards GFP_LOCKLESS" !?

I do not think we have concluded that we want to support GFP_LOCKLESS.
This might be trivial straightforward now but it imposes some constrains
for future maintainability. So far we haven't heard about many usecases
where this would be needed and a single one is not sufficient IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-08-03 14:49                     ` Michal Hocko
@ 2023-08-04 13:27                       ` Tetsuo Handa
  2023-08-07  8:20                         ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Tetsuo Handa @ 2023-08-04 13:27 UTC (permalink / raw)
  To: Michal Hocko, Sebastian Andrzej Siewior, Andrew Morton
  Cc: Petr Mladek, linux-mm, linux-kernel, Luis Claudio R. Goncalves,
	Boqun Feng, Ingo Molnar, John Ogness, Mel Gorman, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon

On 2023/08/03 23:49, Michal Hocko wrote:
> On Thu 03-08-23 22:18:10, Tetsuo Handa wrote:
>> On 2023/07/31 23:25, Michal Hocko wrote:
>>> On Sat 29-07-23 20:05:43, Tetsuo Handa wrote:
>>>> On 2023/07/29 14:31, Tetsuo Handa wrote:
>>>>> On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
>>>>>> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
>>>>>>>> Anyway, please do not do this change only because of printk().
>>>>>>>> IMHO, the current ordering is more logical and the printk() problem
>>>>>>>> should be solved another way.
>>>>>>>
>>>>>>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
>>>>>>> rejected.
>>>>>>
>>>>>> My understanding is that this patch gets applied and your objection will
>>>>>> be noted.
>>>>>
>>>>> My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
>>>>> allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
>>>>> https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
>>>>> https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
>>>>>
>>>>> Maybe we can defer checking zonelist_update_seq till retry check like below,
>>>>> for this is really an infrequent event.
>>>>>
>>>>
>>>> An updated version with comments added.
>>>
>>> Seriously, don't you see how hairy all this is? And for what? Nitpicking
>>> something that doesn't seem to be a real problem in the first place?
>>
>> Seriously, can't you find "zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
>> allocations, which is a low-hanging fruit towards GFP_LOCKLESS" !?
> 
> I do not think we have concluded that we want to support GFP_LOCKLESS.
> This might be trivial straightforward now but it imposes some constrains
> for future maintainability. So far we haven't heard about many usecases
> where this would be needed and a single one is not sufficient IMHO.

When you introduced a word GFP_LOCKLESS in the link above, I was wondering the meaning
of "LESS" part. Since we know that it is difficult to achieve "hold 0 lock during memory
allocation", "hold least locks during memory allocation" will be at best. Therefore,
GFP_LOCKLESS is as misleading name as GFP_ATOMIC. GFP_LOCK_LEAST or GFP_LEAST_LOCKS will
represent the real behavior better.

Like I said

  I consider that memory allocations which do not do direct reclaim should be geared
  towards less locking dependency.

in the thread above, I still believe that this what-you-call-hairy version (which
matches "hold least locks during memory allocation" direction) is better than
"[PATCH v3 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save()."
(which does not match "hold least locks during memory allocation"). My version not
only avoids possibility of deadlock, but also makes zonelist_iter_begin() faster and
simpler.

Not holding zonelist_update_seq lock is trivially doable compared to removing
__GFP_KSWAPD_RECLAIM from GFP_ATOMIC. Please give me feedback about which line of my
proposal is technically unsafe, instead of discarding my proposal with negative words
like "hairy".


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
  2023-08-04 13:27                       ` Tetsuo Handa
@ 2023-08-07  8:20                         ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2023-08-07  8:20 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Sebastian Andrzej Siewior, Andrew Morton, Petr Mladek, linux-mm,
	linux-kernel, Luis Claudio R. Goncalves, Boqun Feng, Ingo Molnar,
	John Ogness, Mel Gorman, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon

On Fri 04-08-23 22:27:22, Tetsuo Handa wrote:
> On 2023/08/03 23:49, Michal Hocko wrote:
> > On Thu 03-08-23 22:18:10, Tetsuo Handa wrote:
> >> On 2023/07/31 23:25, Michal Hocko wrote:
> >>> On Sat 29-07-23 20:05:43, Tetsuo Handa wrote:
> >>>> On 2023/07/29 14:31, Tetsuo Handa wrote:
> >>>>> On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
> >>>>>> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
> >>>>>>>> Anyway, please do not do this change only because of printk().
> >>>>>>>> IMHO, the current ordering is more logical and the printk() problem
> >>>>>>>> should be solved another way.
> >>>>>>>
> >>>>>>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
> >>>>>>> rejected.
> >>>>>>
> >>>>>> My understanding is that this patch gets applied and your objection will
> >>>>>> be noted.
> >>>>>
> >>>>> My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> >>>>> allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
> >>>>> https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
> >>>>> https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
> >>>>>
> >>>>> Maybe we can defer checking zonelist_update_seq till retry check like below,
> >>>>> for this is really an infrequent event.
> >>>>>
> >>>>
> >>>> An updated version with comments added.
> >>>
> >>> Seriously, don't you see how hairy all this is? And for what? Nitpicking
> >>> something that doesn't seem to be a real problem in the first place?
> >>
> >> Seriously, can't you find "zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> >> allocations, which is a low-hanging fruit towards GFP_LOCKLESS" !?
> > 
> > I do not think we have concluded that we want to support GFP_LOCKLESS.
> > This might be trivial straightforward now but it imposes some constrains
> > for future maintainability. So far we haven't heard about many usecases
> > where this would be needed and a single one is not sufficient IMHO.
> 
> When you introduced a word GFP_LOCKLESS in the link above, I was wondering the meaning
> of "LESS" part. Since we know that it is difficult to achieve "hold 0 lock during memory
> allocation", "hold least locks during memory allocation" will be at best. Therefore,
> GFP_LOCKLESS is as misleading name as GFP_ATOMIC. GFP_LOCK_LEAST or GFP_LEAST_LOCKS will
> represent the real behavior better.

I am not sure I understand what least locks mean actually. I guess what
you wanted to say is that there are no locks or other synchronization
means with external visibility/dependencies used. In other words a mode
which can be called from any locking context. That would be certainly
possible and whether any internal locks are used or not is an
implementation detail as long as the no external visibility/dependencies 
rule is held. I do not really want to start the naming discussion as it
is not really clear we want/need to provide such a strong guarantee.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-08-07  8:20 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-23 17:12 [PATCH v2 0/2] seqlock,mm: lockdep annotation + write_seqlock_irqsave() Sebastian Andrzej Siewior
2023-06-23 17:12 ` [PATCH v2 1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested() Sebastian Andrzej Siewior
2023-06-24  6:54   ` Tetsuo Handa
2023-06-26  8:12     ` Sebastian Andrzej Siewior
2023-06-26  9:25       ` Tetsuo Handa
2023-06-26 10:48         ` Peter Zijlstra
2023-06-26 11:26           ` Tetsuo Handa
2023-06-26 11:35             ` Michal Hocko
2023-06-26 12:27               ` Tetsuo Handa
2023-06-26 13:16                 ` Michal Hocko
2023-06-26 12:46               ` Sebastian Andrzej Siewior
2023-06-26 13:13           ` Sebastian Andrzej Siewior
2023-06-26 14:44       ` Petr Mladek
2023-06-28 12:14         ` Tetsuo Handa
2023-07-27 15:10           ` Sebastian Andrzej Siewior
2023-07-29  5:31             ` Tetsuo Handa
2023-07-29 11:05               ` Tetsuo Handa
2023-07-31 14:25                 ` Michal Hocko
2023-08-03 13:18                   ` Tetsuo Handa
2023-08-03 14:49                     ` Michal Hocko
2023-08-04 13:27                       ` Tetsuo Handa
2023-08-07  8:20                         ` Michal Hocko
2023-06-26 12:56   ` Mel Gorman
2023-06-23 17:12 ` [PATCH v2 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save() Sebastian Andrzej Siewior
2023-06-23 18:17   ` Michal Hocko
2023-06-23 20:15     ` [PATCH v3 " Sebastian Andrzej Siewior
2023-06-26  7:56       ` David Hildenbrand
2023-06-26 13:14       ` Mel Gorman
2023-06-28 13:56       ` Michal Hocko
2023-06-25  2:27 ` [PATCH v2 0/2] seqlock,mm: lockdep annotation + write_seqlock_irqsave() Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).