From: Shile Zhang <shile.zhang@linux.alibaba.com> To: Kirill Tkhai <ktkhai@virtuozzo.com>, Andrew Morton <akpm@linux-foundation.org>, Pavel Tatashin <pasha.tatashin@soleen.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock Date: Tue, 14 Jan 2020 16:54:22 +0800 [thread overview] Message-ID: <ba242ee6-22be-3047-5a88-e6b39e1509ef@linux.alibaba.com> (raw) In-Reply-To: <9420eab3-5e5e-150f-53c9-6cd40bacf859@virtuozzo.com> On 2020/1/13 16:11, Kirill Tkhai wrote: > On 13.01.2020 03:54, Shile Zhang wrote: >> >> On 2020/1/10 19:42, Kirill Tkhai wrote: >>> On 10.01.2020 12:30, Shile Zhang wrote: >>>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock' >>>> will be called inside 'pgdatinit' kthread to initialise the deferred >>>> pages with local interrupts disabled. Which is introduced by >>>> commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred >>>> pages"). >>>> >>>> But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 by >>>> default), especially in small system with NRCPUS <= 2. In this case, the >>>> interrupts are disabled on boot CPU during memory initialising, which >>>> caused the tick_sched timer be blocked, leading to wall clock stuck. >>>> >>>> Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing >>>> deferred pages") >>>> >>>> Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com> >>>> --- >>>> include/linux/memory_hotplug.h | 16 ++++++++++++++-- >>>> 1 file changed, 14 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h >>>> index ba0dca6aac6e..be69a6dc4fee 100644 >>>> --- a/include/linux/memory_hotplug.h >>>> +++ b/include/linux/memory_hotplug.h >>>> @@ -6,6 +6,8 @@ >>>> #include <linux/spinlock.h> >>>> #include <linux/notifier.h> >>>> #include <linux/bug.h> >>>> +#include <linux/sched.h> >>>> +#include <linux/smp.h> >>>> struct page; >>>> struct zone; >>>> @@ -282,12 +284,22 @@ static inline bool movable_node_is_enabled(void) >>>> static inline >>>> void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags) >>>> { >>>> - spin_lock_irqsave(&pgdat->node_size_lock, *flags); >>>> + /* >>>> + * Disable local interrupts on boot CPU will stop the tick_sched >>>> + * timer, which will block jiffies(wall clock) update. >>>> + */ >>>> + if (current->cpu != get_boot_cpu_id()) >>>> + spin_lock_irqsave(&pgdat->node_size_lock, *flags); >>>> + else >>>> + spin_lock(&pgdat->node_size_lock); >>>> } >>>> static inline >>>> void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags) >>>> { >>>> - spin_unlock_irqrestore(&pgdat->node_size_lock, *flags); >>>> + if (current->cpu != get_boot_cpu_id()) >>>> + spin_unlock_irqrestore(&pgdat->node_size_lock, *flags); >>>> + else >>>> + spin_unlock(&pgdat->node_size_lock); >>>> } >>>> static inline >>>> void pgdat_resize_init(struct pglist_data *pgdat) >>> 1)Linux kernel is *preemptible*. Kernel with CONFIG_PREEMPT_RT option even may preempt >>> *kernel* code in the middle of function. When you are executing a code containing >>> pgdat_resize_lock() and pgdat_resize_unlock(), the process may migrate to another cpu >>> between them. >>> >>> bool cpu another cpu >>> ---------------------------------- >>> pgdat_resize_lock() >>> spin_lock() >>> --> migrate to another cpu >>> pgdat_resize_unlock() >>> spin_unlock_irqrestore(<uninitialized flags>) >>> >>> (Yes, in case of CONFIG_PREEMPT_RT, process is preemptible even after spin_lock() call). >>> >>> This looks like a bad helpers, and we should not introduce such the design. >> Hi Kirill, >> >> Thanks for your comments! >> Sorry for I'm not very clear about this lock/unlock, but I encountered this issue >> with "CONFIG_PREEMPT is not set". > The thing is we simply shouldn't introduce such the primitives since the thread > may migrate to another cpu, while you own the lock. This looks like a buggy design. > >>> 2)I think there is no the problem this patch solves. Do we really this statistics? >>> Can't we simple remove print message from deferred_init_memmap() and solve this? >> Sorry for I've not put this issue very clearly. It's *not* just one statistics log >> with wrong time calculate, but the wall clock is stuck. >> So the 'systemd-analyze' command also give a wrong time as I mentioned in the cover >> letter. I don't think is OK just remove the log, it cannot solve the wall clock latency. > Have you tried temporary enabling interrupts in the middle of cycle after a huge enough > memory block is initialized? Something like: > > deferred_init_memmap() > { > while (spfn < epfn) { > nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn); > local_irq_enable(); > local_irq_disable(); > } > } Yes, I'd tried this for issue confirm, before I sent this patch. Likes I mentioned the debug log in cover letter, I also add mdelay between local_irq_enable/disable, this system jiffies is stuck without update. So I think there must be problem to use spin_lock_irqsave in early boot path on boot CPU. > Or, maybe, enable/disable interrupts somewhere inside deferred_init_maxorder(). > >>> Also, you may try to check that sched_clock() gives better results with interrupts >>> disabled (on x86 it uses rdtsc, when it's possible. But it also may fallback to >>> jiffies-based clock in some hardware cases, and they also won't go with interrupts >>> disabled).
next prev parent reply other threads:[~2020-01-14 8:54 UTC|newest] Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-01-10 8:25 [PATCH 0/1] try to fix tick_sched timer stuck issue Shile Zhang 2020-01-10 8:25 ` [PATCH 1/1] mm: fix tick_sched timer blocked by pgdat_resize_lock Shile Zhang 2020-01-10 9:30 ` [PATCH RESEND] " Shile Zhang 2020-01-10 11:42 ` Kirill Tkhai 2020-01-13 0:54 ` Shile Zhang 2020-01-13 8:11 ` Kirill Tkhai 2020-01-14 8:54 ` Shile Zhang [this message] 2020-01-15 9:45 ` Kirill Tkhai 2020-02-04 7:24 ` Shile Zhang 2020-02-10 5:45 ` Andrew Morton
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=ba242ee6-22be-3047-5a88-e6b39e1509ef@linux.alibaba.com \ --to=shile.zhang@linux.alibaba.com \ --cc=akpm@linux-foundation.org \ --cc=ktkhai@virtuozzo.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=pasha.tatashin@soleen.com \ --subject='Re: [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).