linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] try to fix tick_sched timer stuck issue
@ 2020-01-10  8:25 Shile Zhang
  2020-01-10  8:25 ` [PATCH 1/1] mm: fix tick_sched timer blocked by pgdat_resize_lock Shile Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Shile Zhang @ 2020-01-10  8:25 UTC (permalink / raw)
  To: Andrew Morton, Pavel Tatashin; +Cc: linux-mm, linux-kernel, Shile Zhang

Hi Andrew and Pavel,

I found the 'tick_sched timer stuck' issue when enabled deferred pages
initialize feature on my 2c320g VM.
The dmesg log shown that deferred 81,699,533 pages (about 310GB) only
with 1ms!

    [    0.340130] node 0 initialised, 81699533 pages in 1ms

Obviously that is wrong time, and the timestamp in dmesg log. I checked
the sysytemd-analyze, also is wrong time:

    Startup finished in 837ms (kernel) + 1.026s (initrd) + 1.542s (userspace) = 3.407s

In fact, to initialize 320GB memory needs about 2+s on my VM.

I guess it possible caused by the timer is blocked during memory
initialising, so I added debug log based on my roughly anaylsis,
inside 'pgdat_resize_{lock,unlock}', as following:
---8<---
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 92b1047..7c00c56 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -285,13 +285,13 @@ static inline bool movable_node_is_enabled(void)
 void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
 {
        spin_lock_irqsave(&pgdat->node_size_lock, *flags);
-       trace_printk("DBG: pgdat_resize_lock: jiffies=%lu\n", jiffies);
+       trace_printk("  DBG: jiffies=%lu after pgdat_resize_lock\n", jiffies);
 }
 static inline
 void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
 {
        mdelay(100);
-       trace_printk("DBG: pgdat_resize_unlock: jiffies=%lu\n", jiffies);
+       trace_printk("DBG: jiffies=%lu before pgdat_resize_unlock\n", jiffies);
        spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
 }
 static inline
--->8---

Note, I add 'mdelay(100)' to check if the jiffies is stuck to update.
The trace shown that the jiffies was stuck inside pgdat_resize_{lock,unlock}:

    pgdatinit0-19    [000] d...     0.339850: pgdat_resize_lock:   DBG: jiffies=4294667301 after pgdat_resize_lock
    pgdatinit0-19    [000] d...     2.929611: pgdat_resize_unlock: DBG: jiffies=4294667301 before pgdat_resize_unlock

I think the root cause is clear now.

I'm not clear about the original 'window issue' mentioned by Pavel,
in commit:
https://lore.kernel.org/patchwork/patch/933504/

I just try to fix this timer issue, please help to review if it is OK to
fix it, or give some advise to fix this issue gracefully, thanks!

One more question is, I found there also other spin_lock_irqsave be used in
the kernel boot path on boot CPU, but I cannot search any issue reported
about if interrupts can be disabled on boot CPU on boot path. How we ensure
the tick_sched timer be fired in time? :r the accuracy of system wall clock?

Thanks!

Shile Zhang (1):
  mm: fix tick_sched timer blocked by pgdat_resize_lock

 include/linux/memory_hotplug.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

-- 
2.24.0.rc2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 1/1] mm: fix tick_sched timer blocked by pgdat_resize_lock
  2020-01-10  8:25 [PATCH 0/1] try to fix tick_sched timer stuck issue Shile Zhang
@ 2020-01-10  8:25 ` Shile Zhang
  2020-01-10  9:30   ` [PATCH RESEND] " Shile Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Shile Zhang @ 2020-01-10  8:25 UTC (permalink / raw)
  To: Andrew Morton, Pavel Tatashin; +Cc: linux-mm, linux-kernel, Shile Zhang

When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock'
will be called inside 'pgdatinit' kthread to initialise the deferred
pages with local interrupts disabled. Which is introduced by
commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred
pages").

But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 by
default), especially in small system with NRCPUS <= 2. In this case, the
interrupts are disabled on boot CPU during memory initialising, which
caused the tick_sched timer be blocked, leading to wall clock stuck.

Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing
deferred pages")

Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
---
 include/linux/memory_hotplug.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ba0dca6aac6e..6ffcb5b704bc 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -282,12 +282,22 @@ static inline bool movable_node_is_enabled(void)
 static inline
 void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
 {
-	spin_lock_irqsave(&pgdat->node_size_lock, *flags);
+	/*
+	 * Disable local interrupts on boot CPU will stop the tick_sched
+	 * timer, which will block jiffies(wall clock) update.
+	 */
+	if (current->cpu != get_boot_cpu_id())
+		spin_lock_irqsave(&pgdat->node_size_lock, *flags);
+	else
+		spin_lock(&pgdat->node_size_lock);
 }
 static inline
 void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
 {
-	spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
+	if (current->cpu != get_boot_cpu_id())
+		spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
+	else
+		spin_unlock(&pgdat->node_size_lock);
 }
 static inline
 void pgdat_resize_init(struct pglist_data *pgdat)
-- 
2.24.0.rc2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock
  2020-01-10  8:25 ` [PATCH 1/1] mm: fix tick_sched timer blocked by pgdat_resize_lock Shile Zhang
@ 2020-01-10  9:30   ` Shile Zhang
  2020-01-10 11:42     ` Kirill Tkhai
  0 siblings, 1 reply; 10+ messages in thread
From: Shile Zhang @ 2020-01-10  9:30 UTC (permalink / raw)
  To: Andrew Morton, Pavel Tatashin; +Cc: linux-mm, linux-kernel, Shile Zhang

When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock'
will be called inside 'pgdatinit' kthread to initialise the deferred
pages with local interrupts disabled. Which is introduced by
commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred
pages").

But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 by
default), especially in small system with NRCPUS <= 2. In this case, the
interrupts are disabled on boot CPU during memory initialising, which
caused the tick_sched timer be blocked, leading to wall clock stuck.

Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing
deferred pages")

Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
---
 include/linux/memory_hotplug.h | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ba0dca6aac6e..be69a6dc4fee 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -6,6 +6,8 @@
 #include <linux/spinlock.h>
 #include <linux/notifier.h>
 #include <linux/bug.h>
+#include <linux/sched.h>
+#include <linux/smp.h>
 
 struct page;
 struct zone;
@@ -282,12 +284,22 @@ static inline bool movable_node_is_enabled(void)
 static inline
 void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
 {
-	spin_lock_irqsave(&pgdat->node_size_lock, *flags);
+	/*
+	 * Disable local interrupts on boot CPU will stop the tick_sched
+	 * timer, which will block jiffies(wall clock) update.
+	 */
+	if (current->cpu != get_boot_cpu_id())
+		spin_lock_irqsave(&pgdat->node_size_lock, *flags);
+	else
+		spin_lock(&pgdat->node_size_lock);
 }
 static inline
 void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
 {
-	spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
+	if (current->cpu != get_boot_cpu_id())
+		spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
+	else
+		spin_unlock(&pgdat->node_size_lock);
 }
 static inline
 void pgdat_resize_init(struct pglist_data *pgdat)
-- 
2.24.0.rc2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock
  2020-01-10  9:30   ` [PATCH RESEND] " Shile Zhang
@ 2020-01-10 11:42     ` Kirill Tkhai
  2020-01-13  0:54       ` Shile Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Kirill Tkhai @ 2020-01-10 11:42 UTC (permalink / raw)
  To: Shile Zhang, Andrew Morton, Pavel Tatashin; +Cc: linux-mm, linux-kernel

On 10.01.2020 12:30, Shile Zhang wrote:
> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock'
> will be called inside 'pgdatinit' kthread to initialise the deferred
> pages with local interrupts disabled. Which is introduced by
> commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred
> pages").
> 
> But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 by
> default), especially in small system with NRCPUS <= 2. In this case, the
> interrupts are disabled on boot CPU during memory initialising, which
> caused the tick_sched timer be blocked, leading to wall clock stuck.
> 
> Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing
> deferred pages")
> 
> Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
> ---
>  include/linux/memory_hotplug.h | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index ba0dca6aac6e..be69a6dc4fee 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -6,6 +6,8 @@
>  #include <linux/spinlock.h>
>  #include <linux/notifier.h>
>  #include <linux/bug.h>
> +#include <linux/sched.h>
> +#include <linux/smp.h>
>  
>  struct page;
>  struct zone;
> @@ -282,12 +284,22 @@ static inline bool movable_node_is_enabled(void)
>  static inline
>  void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
>  {
> -	spin_lock_irqsave(&pgdat->node_size_lock, *flags);
> +	/*
> +	 * Disable local interrupts on boot CPU will stop the tick_sched
> +	 * timer, which will block jiffies(wall clock) update.
> +	 */
> +	if (current->cpu != get_boot_cpu_id())
> +		spin_lock_irqsave(&pgdat->node_size_lock, *flags);
> +	else
> +		spin_lock(&pgdat->node_size_lock);
>  }
>  static inline
>  void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
>  {
> -	spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
> +	if (current->cpu != get_boot_cpu_id())
> +		spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
> +	else
> +		spin_unlock(&pgdat->node_size_lock);
>  }
>  static inline
>  void pgdat_resize_init(struct pglist_data *pgdat)

1)Linux kernel is *preemptible*. Kernel with CONFIG_PREEMPT_RT option even may preempt
*kernel* code in the middle of function. When you are executing a code containing
pgdat_resize_lock() and pgdat_resize_unlock(), the process may migrate to another cpu
between them.

bool cpu               another cpu
----------------------------------
pgdat_resize_lock()
  spin_lock()
  --> migrate to another cpu
                      pgdat_resize_unlock()
                      spin_unlock_irqrestore(<uninitialized flags>)

(Yes, in case of CONFIG_PREEMPT_RT, process is preemptible even after spin_lock() call).

This looks like a bad helpers, and we should not introduce such the design.

2)I think there is no the problem this patch solves. Do we really this statistics?
Can't we simple remove print message from deferred_init_memmap() and solve this?

Also, you may try to check that sched_clock() gives better results with interrupts
disabled (on x86 it uses rdtsc, when it's possible. But it also may fallback to
jiffies-based clock in some hardware cases, and they also won't go with interrupts
disabled).

Kirill

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock
  2020-01-10 11:42     ` Kirill Tkhai
@ 2020-01-13  0:54       ` Shile Zhang
  2020-01-13  8:11         ` Kirill Tkhai
  0 siblings, 1 reply; 10+ messages in thread
From: Shile Zhang @ 2020-01-13  0:54 UTC (permalink / raw)
  To: Kirill Tkhai, Andrew Morton, Pavel Tatashin; +Cc: linux-mm, linux-kernel



On 2020/1/10 19:42, Kirill Tkhai wrote:
> On 10.01.2020 12:30, Shile Zhang wrote:
>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock'
>> will be called inside 'pgdatinit' kthread to initialise the deferred
>> pages with local interrupts disabled. Which is introduced by
>> commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred
>> pages").
>>
>> But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 by
>> default), especially in small system with NRCPUS <= 2. In this case, the
>> interrupts are disabled on boot CPU during memory initialising, which
>> caused the tick_sched timer be blocked, leading to wall clock stuck.
>>
>> Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing
>> deferred pages")
>>
>> Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
>> ---
>>   include/linux/memory_hotplug.h | 16 ++++++++++++++--
>>   1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>> index ba0dca6aac6e..be69a6dc4fee 100644
>> --- a/include/linux/memory_hotplug.h
>> +++ b/include/linux/memory_hotplug.h
>> @@ -6,6 +6,8 @@
>>   #include <linux/spinlock.h>
>>   #include <linux/notifier.h>
>>   #include <linux/bug.h>
>> +#include <linux/sched.h>
>> +#include <linux/smp.h>
>>   
>>   struct page;
>>   struct zone;
>> @@ -282,12 +284,22 @@ static inline bool movable_node_is_enabled(void)
>>   static inline
>>   void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
>>   {
>> -	spin_lock_irqsave(&pgdat->node_size_lock, *flags);
>> +	/*
>> +	 * Disable local interrupts on boot CPU will stop the tick_sched
>> +	 * timer, which will block jiffies(wall clock) update.
>> +	 */
>> +	if (current->cpu != get_boot_cpu_id())
>> +		spin_lock_irqsave(&pgdat->node_size_lock, *flags);
>> +	else
>> +		spin_lock(&pgdat->node_size_lock);
>>   }
>>   static inline
>>   void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
>>   {
>> -	spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
>> +	if (current->cpu != get_boot_cpu_id())
>> +		spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
>> +	else
>> +		spin_unlock(&pgdat->node_size_lock);
>>   }
>>   static inline
>>   void pgdat_resize_init(struct pglist_data *pgdat)
> 1)Linux kernel is *preemptible*. Kernel with CONFIG_PREEMPT_RT option even may preempt
> *kernel* code in the middle of function. When you are executing a code containing
> pgdat_resize_lock() and pgdat_resize_unlock(), the process may migrate to another cpu
> between them.
>
> bool cpu               another cpu
> ----------------------------------
> pgdat_resize_lock()
>    spin_lock()
>    --> migrate to another cpu
>                        pgdat_resize_unlock()
>                        spin_unlock_irqrestore(<uninitialized flags>)
>
> (Yes, in case of CONFIG_PREEMPT_RT, process is preemptible even after spin_lock() call).
>
> This looks like a bad helpers, and we should not introduce such the design.

Hi Kirill,

Thanks for your comments!
Sorry for I'm not very clear about this lock/unlock, but I encountered 
this issue
with "CONFIG_PREEMPT is not set".
> 2)I think there is no the problem this patch solves. Do we really this statistics?
> Can't we simple remove print message from deferred_init_memmap() and solve this?

Sorry for I've not put this issue very clearly. It's *not* just one 
statistics log
with wrong time calculate, but the wall clock is stuck.
So the 'systemd-analyze' command also give a wrong time as I mentioned 
in the cover
letter. I don't think is OK just remove the log, it cannot solve the 
wall clock latency.
> Also, you may try to check that sched_clock() gives better results with interrupts
> disabled (on x86 it uses rdtsc, when it's possible. But it also may fallback to
> jiffies-based clock in some hardware cases, and they also won't go with interrupts
> disabled).
>
> Kirill


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock
  2020-01-13  0:54       ` Shile Zhang
@ 2020-01-13  8:11         ` Kirill Tkhai
  2020-01-14  8:54           ` Shile Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Kirill Tkhai @ 2020-01-13  8:11 UTC (permalink / raw)
  To: Shile Zhang, Andrew Morton, Pavel Tatashin; +Cc: linux-mm, linux-kernel

On 13.01.2020 03:54, Shile Zhang wrote:
> 
> 
> On 2020/1/10 19:42, Kirill Tkhai wrote:
>> On 10.01.2020 12:30, Shile Zhang wrote:
>>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock'
>>> will be called inside 'pgdatinit' kthread to initialise the deferred
>>> pages with local interrupts disabled. Which is introduced by
>>> commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred
>>> pages").
>>>
>>> But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 by
>>> default), especially in small system with NRCPUS <= 2. In this case, the
>>> interrupts are disabled on boot CPU during memory initialising, which
>>> caused the tick_sched timer be blocked, leading to wall clock stuck.
>>>
>>> Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing
>>> deferred pages")
>>>
>>> Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
>>> ---
>>>   include/linux/memory_hotplug.h | 16 ++++++++++++++--
>>>   1 file changed, 14 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>> index ba0dca6aac6e..be69a6dc4fee 100644
>>> --- a/include/linux/memory_hotplug.h
>>> +++ b/include/linux/memory_hotplug.h
>>> @@ -6,6 +6,8 @@
>>>   #include <linux/spinlock.h>
>>>   #include <linux/notifier.h>
>>>   #include <linux/bug.h>
>>> +#include <linux/sched.h>
>>> +#include <linux/smp.h>
>>>     struct page;
>>>   struct zone;
>>> @@ -282,12 +284,22 @@ static inline bool movable_node_is_enabled(void)
>>>   static inline
>>>   void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
>>>   {
>>> -    spin_lock_irqsave(&pgdat->node_size_lock, *flags);
>>> +    /*
>>> +     * Disable local interrupts on boot CPU will stop the tick_sched
>>> +     * timer, which will block jiffies(wall clock) update.
>>> +     */
>>> +    if (current->cpu != get_boot_cpu_id())
>>> +        spin_lock_irqsave(&pgdat->node_size_lock, *flags);
>>> +    else
>>> +        spin_lock(&pgdat->node_size_lock);
>>>   }
>>>   static inline
>>>   void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
>>>   {
>>> -    spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
>>> +    if (current->cpu != get_boot_cpu_id())
>>> +        spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
>>> +    else
>>> +        spin_unlock(&pgdat->node_size_lock);
>>>   }
>>>   static inline
>>>   void pgdat_resize_init(struct pglist_data *pgdat)
>> 1)Linux kernel is *preemptible*. Kernel with CONFIG_PREEMPT_RT option even may preempt
>> *kernel* code in the middle of function. When you are executing a code containing
>> pgdat_resize_lock() and pgdat_resize_unlock(), the process may migrate to another cpu
>> between them.
>>
>> bool cpu               another cpu
>> ----------------------------------
>> pgdat_resize_lock()
>>    spin_lock()
>>    --> migrate to another cpu
>>                        pgdat_resize_unlock()
>>                        spin_unlock_irqrestore(<uninitialized flags>)
>>
>> (Yes, in case of CONFIG_PREEMPT_RT, process is preemptible even after spin_lock() call).
>>
>> This looks like a bad helpers, and we should not introduce such the design.
> 
> Hi Kirill,
> 
> Thanks for your comments!
> Sorry for I'm not very clear about this lock/unlock, but I encountered this issue
> with "CONFIG_PREEMPT is not set".

The thing is we simply shouldn't introduce such the primitives since the thread
may migrate to another cpu, while you own the lock. This looks like a buggy design.

>> 2)I think there is no the problem this patch solves. Do we really this statistics?
>> Can't we simple remove print message from deferred_init_memmap() and solve this?
> 
> Sorry for I've not put this issue very clearly. It's *not* just one statistics log
> with wrong time calculate, but the wall clock is stuck.
> So the 'systemd-analyze' command also give a wrong time as I mentioned in the cover
> letter. I don't think is OK just remove the log, it cannot solve the wall clock latency.

Have you tried temporary enabling interrupts in the middle of cycle after a huge enough
memory block is initialized? Something like:

deferred_init_memmap()
{
	while (spfn < epfn) {
		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
		local_irq_enable();
		local_irq_disable();
	}
}

Or, maybe, enable/disable interrupts somewhere inside deferred_init_maxorder().

>> Also, you may try to check that sched_clock() gives better results with interrupts
>> disabled (on x86 it uses rdtsc, when it's possible. But it also may fallback to
>> jiffies-based clock in some hardware cases, and they also won't go with interrupts
>> disabled).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock
  2020-01-13  8:11         ` Kirill Tkhai
@ 2020-01-14  8:54           ` Shile Zhang
  2020-01-15  9:45             ` Kirill Tkhai
  0 siblings, 1 reply; 10+ messages in thread
From: Shile Zhang @ 2020-01-14  8:54 UTC (permalink / raw)
  To: Kirill Tkhai, Andrew Morton, Pavel Tatashin; +Cc: linux-mm, linux-kernel



On 2020/1/13 16:11, Kirill Tkhai wrote:
> On 13.01.2020 03:54, Shile Zhang wrote:
>>
>> On 2020/1/10 19:42, Kirill Tkhai wrote:
>>> On 10.01.2020 12:30, Shile Zhang wrote:
>>>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock'
>>>> will be called inside 'pgdatinit' kthread to initialise the deferred
>>>> pages with local interrupts disabled. Which is introduced by
>>>> commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred
>>>> pages").
>>>>
>>>> But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 by
>>>> default), especially in small system with NRCPUS <= 2. In this case, the
>>>> interrupts are disabled on boot CPU during memory initialising, which
>>>> caused the tick_sched timer be blocked, leading to wall clock stuck.
>>>>
>>>> Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing
>>>> deferred pages")
>>>>
>>>> Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
>>>> ---
>>>>    include/linux/memory_hotplug.h | 16 ++++++++++++++--
>>>>    1 file changed, 14 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>>> index ba0dca6aac6e..be69a6dc4fee 100644
>>>> --- a/include/linux/memory_hotplug.h
>>>> +++ b/include/linux/memory_hotplug.h
>>>> @@ -6,6 +6,8 @@
>>>>    #include <linux/spinlock.h>
>>>>    #include <linux/notifier.h>
>>>>    #include <linux/bug.h>
>>>> +#include <linux/sched.h>
>>>> +#include <linux/smp.h>
>>>>      struct page;
>>>>    struct zone;
>>>> @@ -282,12 +284,22 @@ static inline bool movable_node_is_enabled(void)
>>>>    static inline
>>>>    void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
>>>>    {
>>>> -    spin_lock_irqsave(&pgdat->node_size_lock, *flags);
>>>> +    /*
>>>> +     * Disable local interrupts on boot CPU will stop the tick_sched
>>>> +     * timer, which will block jiffies(wall clock) update.
>>>> +     */
>>>> +    if (current->cpu != get_boot_cpu_id())
>>>> +        spin_lock_irqsave(&pgdat->node_size_lock, *flags);
>>>> +    else
>>>> +        spin_lock(&pgdat->node_size_lock);
>>>>    }
>>>>    static inline
>>>>    void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
>>>>    {
>>>> -    spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
>>>> +    if (current->cpu != get_boot_cpu_id())
>>>> +        spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
>>>> +    else
>>>> +        spin_unlock(&pgdat->node_size_lock);
>>>>    }
>>>>    static inline
>>>>    void pgdat_resize_init(struct pglist_data *pgdat)
>>> 1)Linux kernel is *preemptible*. Kernel with CONFIG_PREEMPT_RT option even may preempt
>>> *kernel* code in the middle of function. When you are executing a code containing
>>> pgdat_resize_lock() and pgdat_resize_unlock(), the process may migrate to another cpu
>>> between them.
>>>
>>> bool cpu               another cpu
>>> ----------------------------------
>>> pgdat_resize_lock()
>>>     spin_lock()
>>>     --> migrate to another cpu
>>>                         pgdat_resize_unlock()
>>>                         spin_unlock_irqrestore(<uninitialized flags>)
>>>
>>> (Yes, in case of CONFIG_PREEMPT_RT, process is preemptible even after spin_lock() call).
>>>
>>> This looks like a bad helpers, and we should not introduce such the design.
>> Hi Kirill,
>>
>> Thanks for your comments!
>> Sorry for I'm not very clear about this lock/unlock, but I encountered this issue
>> with "CONFIG_PREEMPT is not set".
> The thing is we simply shouldn't introduce such the primitives since the thread
> may migrate to another cpu, while you own the lock. This looks like a buggy design.
>
>>> 2)I think there is no the problem this patch solves. Do we really this statistics?
>>> Can't we simple remove print message from deferred_init_memmap() and solve this?
>> Sorry for I've not put this issue very clearly. It's *not* just one statistics log
>> with wrong time calculate, but the wall clock is stuck.
>> So the 'systemd-analyze' command also give a wrong time as I mentioned in the cover
>> letter. I don't think is OK just remove the log, it cannot solve the wall clock latency.
> Have you tried temporary enabling interrupts in the middle of cycle after a huge enough
> memory block is initialized? Something like:
>
> deferred_init_memmap()
> {
> 	while (spfn < epfn) {
> 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> 		local_irq_enable();
> 		local_irq_disable();
> 	}
> }

Yes, I'd tried this for issue confirm, before I sent this patch. Likes I 
mentioned the debug log in cover letter, I also add mdelay between 
local_irq_enable/disable, this system jiffies is stuck without update.
So I think there must be problem to use spin_lock_irqsave in early boot 
path on boot CPU.
> Or, maybe, enable/disable interrupts somewhere inside deferred_init_maxorder().
>
>>> Also, you may try to check that sched_clock() gives better results with interrupts
>>> disabled (on x86 it uses rdtsc, when it's possible. But it also may fallback to
>>> jiffies-based clock in some hardware cases, and they also won't go with interrupts
>>> disabled).


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock
  2020-01-14  8:54           ` Shile Zhang
@ 2020-01-15  9:45             ` Kirill Tkhai
  2020-02-04  7:24               ` Shile Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Kirill Tkhai @ 2020-01-15  9:45 UTC (permalink / raw)
  To: Shile Zhang, Andrew Morton, Pavel Tatashin; +Cc: linux-mm, linux-kernel

On 14.01.2020 11:54, Shile Zhang wrote:
> 
> 
> On 2020/1/13 16:11, Kirill Tkhai wrote:
>> On 13.01.2020 03:54, Shile Zhang wrote:
>>>
>>> On 2020/1/10 19:42, Kirill Tkhai wrote:
>>>> On 10.01.2020 12:30, Shile Zhang wrote:
>>>>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock'
>>>>> will be called inside 'pgdatinit' kthread to initialise the deferred
>>>>> pages with local interrupts disabled. Which is introduced by
>>>>> commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred
>>>>> pages").
>>>>>
>>>>> But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 by
>>>>> default), especially in small system with NRCPUS <= 2. In this case, the
>>>>> interrupts are disabled on boot CPU during memory initialising, which
>>>>> caused the tick_sched timer be blocked, leading to wall clock stuck.
>>>>>
>>>>> Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing
>>>>> deferred pages")
>>>>>
>>>>> Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
>>>>> ---
>>>>>    include/linux/memory_hotplug.h | 16 ++++++++++++++--
>>>>>    1 file changed, 14 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>>>> index ba0dca6aac6e..be69a6dc4fee 100644
>>>>> --- a/include/linux/memory_hotplug.h
>>>>> +++ b/include/linux/memory_hotplug.h
>>>>> @@ -6,6 +6,8 @@
>>>>>    #include <linux/spinlock.h>
>>>>>    #include <linux/notifier.h>
>>>>>    #include <linux/bug.h>
>>>>> +#include <linux/sched.h>
>>>>> +#include <linux/smp.h>
>>>>>      struct page;
>>>>>    struct zone;
>>>>> @@ -282,12 +284,22 @@ static inline bool movable_node_is_enabled(void)
>>>>>    static inline
>>>>>    void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
>>>>>    {
>>>>> -    spin_lock_irqsave(&pgdat->node_size_lock, *flags);
>>>>> +    /*
>>>>> +     * Disable local interrupts on boot CPU will stop the tick_sched
>>>>> +     * timer, which will block jiffies(wall clock) update.
>>>>> +     */
>>>>> +    if (current->cpu != get_boot_cpu_id())
>>>>> +        spin_lock_irqsave(&pgdat->node_size_lock, *flags);
>>>>> +    else
>>>>> +        spin_lock(&pgdat->node_size_lock);
>>>>>    }
>>>>>    static inline
>>>>>    void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
>>>>>    {
>>>>> -    spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
>>>>> +    if (current->cpu != get_boot_cpu_id())
>>>>> +        spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
>>>>> +    else
>>>>> +        spin_unlock(&pgdat->node_size_lock);
>>>>>    }
>>>>>    static inline
>>>>>    void pgdat_resize_init(struct pglist_data *pgdat)
>>>> 1)Linux kernel is *preemptible*. Kernel with CONFIG_PREEMPT_RT option even may preempt
>>>> *kernel* code in the middle of function. When you are executing a code containing
>>>> pgdat_resize_lock() and pgdat_resize_unlock(), the process may migrate to another cpu
>>>> between them.
>>>>
>>>> bool cpu               another cpu
>>>> ----------------------------------
>>>> pgdat_resize_lock()
>>>>     spin_lock()
>>>>     --> migrate to another cpu
>>>>                         pgdat_resize_unlock()
>>>>                         spin_unlock_irqrestore(<uninitialized flags>)
>>>>
>>>> (Yes, in case of CONFIG_PREEMPT_RT, process is preemptible even after spin_lock() call).
>>>>
>>>> This looks like a bad helpers, and we should not introduce such the design.
>>> Hi Kirill,
>>>
>>> Thanks for your comments!
>>> Sorry for I'm not very clear about this lock/unlock, but I encountered this issue
>>> with "CONFIG_PREEMPT is not set".
>> The thing is we simply shouldn't introduce such the primitives since the thread
>> may migrate to another cpu, while you own the lock. This looks like a buggy design.
>>
>>>> 2)I think there is no the problem this patch solves. Do we really this statistics?
>>>> Can't we simple remove print message from deferred_init_memmap() and solve this?
>>> Sorry for I've not put this issue very clearly. It's *not* just one statistics log
>>> with wrong time calculate, but the wall clock is stuck.
>>> So the 'systemd-analyze' command also give a wrong time as I mentioned in the cover
>>> letter. I don't think is OK just remove the log, it cannot solve the wall clock latency.
>> Have you tried temporary enabling interrupts in the middle of cycle after a huge enough
>> memory block is initialized? Something like:
>>
>> deferred_init_memmap()
>> {
>>     while (spfn < epfn) {
>>         nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>>         local_irq_enable();
>>         local_irq_disable();
>>     }
>> }
> 
> Yes, I'd tried this for issue confirm, before I sent this patch. Likes I mentioned the debug log in cover letter, I also add mdelay between local_irq_enable/disable, this system jiffies is stuck without update.
> So I think there must be problem to use spin_lock_irqsave in early boot path on boot CPU.

This time SMP is enabled. You have many threads are running. Interrupts are enabled
and they occur. So, it's OK to disable interrupts for some time.

My opinion is we should try to enable interrupts in the cycle after some fixed
amount of memory is initialized. Say, every 1GB. This should resolve two problems:
handling timer interrupt with update jiffies at time, and keeping the fix for the issue,
that Pavel fixed in 3a2d7fa8a3d5.

>> Or, maybe, enable/disable interrupts somewhere inside deferred_init_maxorder().
>>
>>>> Also, you may try to check that sched_clock() gives better results with interrupts
>>>> disabled (on x86 it uses rdtsc, when it's possible. But it also may fallback to
>>>> jiffies-based clock in some hardware cases, and they also won't go with interrupts
>>>> disabled).
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock
  2020-01-15  9:45             ` Kirill Tkhai
@ 2020-02-04  7:24               ` Shile Zhang
  2020-02-10  5:45                 ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Shile Zhang @ 2020-02-04  7:24 UTC (permalink / raw)
  To: Kirill Tkhai, Andrew Morton, Pavel Tatashin; +Cc: linux-mm, linux-kernel

Hi, Andrew,

Sorry for ping, could you please also help to check this issue?
Any comments from you is welcome!

Thanks!


On 2020/1/15 17:45, Kirill Tkhai wrote:
> On 14.01.2020 11:54, Shile Zhang wrote:
>>
>> On 2020/1/13 16:11, Kirill Tkhai wrote:
>>> On 13.01.2020 03:54, Shile Zhang wrote:
>>>> On 2020/1/10 19:42, Kirill Tkhai wrote:
>>>>> On 10.01.2020 12:30, Shile Zhang wrote:
>>>>>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock'
>>>>>> will be called inside 'pgdatinit' kthread to initialise the deferred
>>>>>> pages with local interrupts disabled. Which is introduced by
>>>>>> commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred
>>>>>> pages").
>>>>>>
>>>>>> But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 by
>>>>>> default), especially in small system with NRCPUS <= 2. In this case, the
>>>>>> interrupts are disabled on boot CPU during memory initialising, which
>>>>>> caused the tick_sched timer be blocked, leading to wall clock stuck.
>>>>>>
>>>>>> Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing
>>>>>> deferred pages")
>>>>>>
>>>>>> Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
>>>>>> ---
>>>>>>     include/linux/memory_hotplug.h | 16 ++++++++++++++--
>>>>>>     1 file changed, 14 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>>>>> index ba0dca6aac6e..be69a6dc4fee 100644
>>>>>> --- a/include/linux/memory_hotplug.h
>>>>>> +++ b/include/linux/memory_hotplug.h
>>>>>> @@ -6,6 +6,8 @@
>>>>>>     #include <linux/spinlock.h>
>>>>>>     #include <linux/notifier.h>
>>>>>>     #include <linux/bug.h>
>>>>>> +#include <linux/sched.h>
>>>>>> +#include <linux/smp.h>
>>>>>>       struct page;
>>>>>>     struct zone;
>>>>>> @@ -282,12 +284,22 @@ static inline bool movable_node_is_enabled(void)
>>>>>>     static inline
>>>>>>     void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
>>>>>>     {
>>>>>> -    spin_lock_irqsave(&pgdat->node_size_lock, *flags);
>>>>>> +    /*
>>>>>> +     * Disable local interrupts on boot CPU will stop the tick_sched
>>>>>> +     * timer, which will block jiffies(wall clock) update.
>>>>>> +     */
>>>>>> +    if (current->cpu != get_boot_cpu_id())
>>>>>> +        spin_lock_irqsave(&pgdat->node_size_lock, *flags);
>>>>>> +    else
>>>>>> +        spin_lock(&pgdat->node_size_lock);
>>>>>>     }
>>>>>>     static inline
>>>>>>     void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
>>>>>>     {
>>>>>> -    spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
>>>>>> +    if (current->cpu != get_boot_cpu_id())
>>>>>> +        spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
>>>>>> +    else
>>>>>> +        spin_unlock(&pgdat->node_size_lock);
>>>>>>     }
>>>>>>     static inline
>>>>>>     void pgdat_resize_init(struct pglist_data *pgdat)
>>>>> 1)Linux kernel is *preemptible*. Kernel with CONFIG_PREEMPT_RT option even may preempt
>>>>> *kernel* code in the middle of function. When you are executing a code containing
>>>>> pgdat_resize_lock() and pgdat_resize_unlock(), the process may migrate to another cpu
>>>>> between them.
>>>>>
>>>>> bool cpu               another cpu
>>>>> ----------------------------------
>>>>> pgdat_resize_lock()
>>>>>      spin_lock()
>>>>>      --> migrate to another cpu
>>>>>                          pgdat_resize_unlock()
>>>>>                          spin_unlock_irqrestore(<uninitialized flags>)
>>>>>
>>>>> (Yes, in case of CONFIG_PREEMPT_RT, process is preemptible even after spin_lock() call).
>>>>>
>>>>> This looks like a bad helpers, and we should not introduce such the design.
>>>> Hi Kirill,
>>>>
>>>> Thanks for your comments!
>>>> Sorry for I'm not very clear about this lock/unlock, but I encountered this issue
>>>> with "CONFIG_PREEMPT is not set".
>>> The thing is we simply shouldn't introduce such the primitives since the thread
>>> may migrate to another cpu, while you own the lock. This looks like a buggy design.
>>>
>>>>> 2)I think there is no the problem this patch solves. Do we really this statistics?
>>>>> Can't we simple remove print message from deferred_init_memmap() and solve this?
>>>> Sorry for I've not put this issue very clearly. It's *not* just one statistics log
>>>> with wrong time calculate, but the wall clock is stuck.
>>>> So the 'systemd-analyze' command also give a wrong time as I mentioned in the cover
>>>> letter. I don't think is OK just remove the log, it cannot solve the wall clock latency.
>>> Have you tried temporary enabling interrupts in the middle of cycle after a huge enough
>>> memory block is initialized? Something like:
>>>
>>> deferred_init_memmap()
>>> {
>>>      while (spfn < epfn) {
>>>          nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>>>          local_irq_enable();
>>>          local_irq_disable();
>>>      }
>>> }
>> Yes, I'd tried this for issue confirm, before I sent this patch. Likes I mentioned the debug log in cover letter, I also add mdelay between local_irq_enable/disable, this system jiffies is stuck without update.
>> So I think there must be problem to use spin_lock_irqsave in early boot path on boot CPU.
> This time SMP is enabled. You have many threads are running. Interrupts are enabled
> and they occur. So, it's OK to disable interrupts for some time.
>
> My opinion is we should try to enable interrupts in the cycle after some fixed
> amount of memory is initialized. Say, every 1GB. This should resolve two problems:
> handling timer interrupt with update jiffies at time, and keeping the fix for the issue,
> that Pavel fixed in 3a2d7fa8a3d5.
>
>>> Or, maybe, enable/disable interrupts somewhere inside deferred_init_maxorder().
>>>
>>>>> Also, you may try to check that sched_clock() gives better results with interrupts
>>>>> disabled (on x86 it uses rdtsc, when it's possible. But it also may fallback to
>>>>> jiffies-based clock in some hardware cases, and they also won't go with interrupts
>>>>> disabled).


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock
  2020-02-04  7:24               ` Shile Zhang
@ 2020-02-10  5:45                 ` Andrew Morton
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2020-02-10  5:45 UTC (permalink / raw)
  To: Shile Zhang; +Cc: Kirill Tkhai, Pavel Tatashin, linux-mm, linux-kernel

On Tue, 4 Feb 2020 15:24:38 +0800 Shile Zhang <shile.zhang@linux.alibaba.com> wrote:

> Hi, Andrew,
> 
> Sorry for ping, could you please also help to check this issue?
> Any comments from you is welcome!
> 

Kirill's suggestion isn't pretty but it appears that it will work OK -
redo deferred_init_memmap() so it releases 1024(?) pages at a time,
releasing and retaking pgdat_resize_lock/unlock each time around?


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-02-10  5:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-10  8:25 [PATCH 0/1] try to fix tick_sched timer stuck issue Shile Zhang
2020-01-10  8:25 ` [PATCH 1/1] mm: fix tick_sched timer blocked by pgdat_resize_lock Shile Zhang
2020-01-10  9:30   ` [PATCH RESEND] " Shile Zhang
2020-01-10 11:42     ` Kirill Tkhai
2020-01-13  0:54       ` Shile Zhang
2020-01-13  8:11         ` Kirill Tkhai
2020-01-14  8:54           ` Shile Zhang
2020-01-15  9:45             ` Kirill Tkhai
2020-02-04  7:24               ` Shile Zhang
2020-02-10  5:45                 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).