* [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
@ 2018-03-14 11:51 Kirill Tkhai
2018-03-14 13:55 ` Tejun Heo
2018-03-14 20:56 ` Andrew Morton
0 siblings, 2 replies; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-14 11:51 UTC (permalink / raw)
To: akpm, tj, cl, linux-mm, linux-kernel
In case of memory deficit and low percpu memory pages,
pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
time (as it makes memory allocations itself and waits
for memory reclaim). If tasks doing pcpu_alloc() are
choosen by OOM killer, they can't exit, because they
are waiting for the mutex.
The patch makes pcpu_alloc() to care about killing signal
and use mutex_lock_killable(), when it's allowed by GFP
flags. This guarantees, a task does not miss SIGKILL
from OOM killer.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
mm/percpu.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index 50e7fdf84055..212b4988926c 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1369,8 +1369,12 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
return NULL;
}
- if (!is_atomic)
- mutex_lock(&pcpu_alloc_mutex);
+ if (!is_atomic) {
+ if (gfp & __GFP_NOFAIL)
+ mutex_lock(&pcpu_alloc_mutex);
+ else if (mutex_lock_killable(&pcpu_alloc_mutex))
+ return NULL;
+ }
spin_lock_irqsave(&pcpu_lock, flags);
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-14 11:51 [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Kirill Tkhai
@ 2018-03-14 13:55 ` Tejun Heo
2018-03-14 20:56 ` Andrew Morton
1 sibling, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2018-03-14 13:55 UTC (permalink / raw)
To: Kirill Tkhai; +Cc: akpm, cl, linux-mm, linux-kernel
On Wed, Mar 14, 2018 at 02:51:48PM +0300, Kirill Tkhai wrote:
> In case of memory deficit and low percpu memory pages,
> pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
> time (as it makes memory allocations itself and waits
> for memory reclaim). If tasks doing pcpu_alloc() are
> choosen by OOM killer, they can't exit, because they
> are waiting for the mutex.
>
> The patch makes pcpu_alloc() to care about killing signal
> and use mutex_lock_killable(), when it's allowed by GFP
> flags. This guarantees, a task does not miss SIGKILL
> from OOM killer.
>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Applied to percpu/for-4.16-fixes.
Thanks, Kirill.
--
tejun
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-14 11:51 [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Kirill Tkhai
2018-03-14 13:55 ` Tejun Heo
@ 2018-03-14 20:56 ` Andrew Morton
2018-03-14 22:09 ` Tejun Heo
` (2 more replies)
1 sibling, 3 replies; 20+ messages in thread
From: Andrew Morton @ 2018-03-14 20:56 UTC (permalink / raw)
To: Kirill Tkhai; +Cc: tj, cl, linux-mm, linux-kernel
On Wed, 14 Mar 2018 14:51:48 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> In case of memory deficit and low percpu memory pages,
> pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
> time (as it makes memory allocations itself and waits
> for memory reclaim). If tasks doing pcpu_alloc() are
> choosen by OOM killer, they can't exit, because they
> are waiting for the mutex.
>
> The patch makes pcpu_alloc() to care about killing signal
> and use mutex_lock_killable(), when it's allowed by GFP
> flags. This guarantees, a task does not miss SIGKILL
> from OOM killer.
>
> ...
>
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1369,8 +1369,12 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
> return NULL;
> }
>
> - if (!is_atomic)
> - mutex_lock(&pcpu_alloc_mutex);
> + if (!is_atomic) {
> + if (gfp & __GFP_NOFAIL)
> + mutex_lock(&pcpu_alloc_mutex);
> + else if (mutex_lock_killable(&pcpu_alloc_mutex))
> + return NULL;
> + }
It would benefit from a comment explaining why we're doing this (it's
for the oom-killer).
My memory is weak and our documentation is awful. What does
mutex_lock_killable() actually do and how does it differ from
mutex_lock_interruptible()? Userspace tasks can run pcpu_alloc() and I
wonder if there's any way in which a userspace-delivered signal can
disrupt another userspace task's memory allocation attempt?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-14 20:56 ` Andrew Morton
@ 2018-03-14 22:09 ` Tejun Heo
2018-03-14 22:22 ` Andrew Morton
2018-03-15 11:58 ` [PATCH] Improve mutex documentation Matthew Wilcox
2018-03-19 15:14 ` [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Tejun Heo
2 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2018-03-14 22:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: Kirill Tkhai, cl, linux-mm, linux-kernel
Hello, Andrew.
On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> It would benefit from a comment explaining why we're doing this (it's
> for the oom-killer).
Will add.
> My memory is weak and our documentation is awful. What does
> mutex_lock_killable() actually do and how does it differ from
> mutex_lock_interruptible()? Userspace tasks can run pcpu_alloc() and I
IIRC, killable listens only to SIGKILL.
> wonder if there's any way in which a userspace-delivered signal can
> disrupt another userspace task's memory allocation attempt?
Hmm... maybe. Just honoring SIGKILL *should* be fine but the alloc
failure paths might be broken, so there are some risks. Given that
the cases where userspace tasks end up allocation percpu memory is
pretty limited and/or priviledged (like mount, bpf), I don't think the
risks are high tho.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-14 22:09 ` Tejun Heo
@ 2018-03-14 22:22 ` Andrew Morton
2018-03-15 8:58 ` Kirill Tkhai
2018-03-19 15:13 ` Tejun Heo
0 siblings, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2018-03-14 22:22 UTC (permalink / raw)
To: Tejun Heo; +Cc: Kirill Tkhai, cl, linux-mm, linux-kernel
On Wed, 14 Mar 2018 15:09:09 -0700 Tejun Heo <tj@kernel.org> wrote:
> Hello, Andrew.
>
> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> > It would benefit from a comment explaining why we're doing this (it's
> > for the oom-killer).
>
> Will add.
>
> > My memory is weak and our documentation is awful. What does
> > mutex_lock_killable() actually do and how does it differ from
> > mutex_lock_interruptible()? Userspace tasks can run pcpu_alloc() and I
>
> IIRC, killable listens only to SIGKILL.
>
> > wonder if there's any way in which a userspace-delivered signal can
> > disrupt another userspace task's memory allocation attempt?
>
> Hmm... maybe. Just honoring SIGKILL *should* be fine but the alloc
> failure paths might be broken, so there are some risks. Given that
> the cases where userspace tasks end up allocation percpu memory is
> pretty limited and/or priviledged (like mount, bpf), I don't think the
> risks are high tho.
hm. spose so. Maybe. Are there other ways? I assume the time is
being spent in pcpu_create_chunk()? We could drop the mutex while
running that stuff and take the appropriate did-we-race-with-someone
testing after retaking it. Or similar.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-14 22:22 ` Andrew Morton
@ 2018-03-15 8:58 ` Kirill Tkhai
2018-03-15 10:48 ` Tetsuo Handa
2018-03-19 15:13 ` Tejun Heo
1 sibling, 1 reply; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-15 8:58 UTC (permalink / raw)
To: Andrew Morton, Tejun Heo; +Cc: cl, linux-mm, linux-kernel
On 15.03.2018 01:22, Andrew Morton wrote:
> On Wed, 14 Mar 2018 15:09:09 -0700 Tejun Heo <tj@kernel.org> wrote:
>
>> Hello, Andrew.
>>
>> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
>>> It would benefit from a comment explaining why we're doing this (it's
>>> for the oom-killer).
>>
>> Will add.
>>
>>> My memory is weak and our documentation is awful. What does
>>> mutex_lock_killable() actually do and how does it differ from
>>> mutex_lock_interruptible()? Userspace tasks can run pcpu_alloc() and I
>>
>> IIRC, killable listens only to SIGKILL.
>>
>>> wonder if there's any way in which a userspace-delivered signal can
>>> disrupt another userspace task's memory allocation attempt?
>>
>> Hmm... maybe. Just honoring SIGKILL *should* be fine but the alloc
>> failure paths might be broken, so there are some risks. Given that
>> the cases where userspace tasks end up allocation percpu memory is
>> pretty limited and/or priviledged (like mount, bpf), I don't think the
>> risks are high tho.
>
> hm. spose so. Maybe. Are there other ways? I assume the time is
> being spent in pcpu_create_chunk()? We could drop the mutex while
> running that stuff and take the appropriate did-we-race-with-someone
> testing after retaking it. Or similar.
The balance work spends its time in pcpu_populate_chunk(). There are
two stacks of this problem:
[ 106.313267] kworker/2:2 D13832 936 2 0x80000000
[ 106.313740] Workqueue: events pcpu_balance_workfn
[ 106.314109] Call Trace:
[ 106.314293] ? __schedule+0x267/0x750
[ 106.314570] schedule+0x2d/0x90
[ 106.314803] schedule_timeout+0x17f/0x390
[ 106.315106] ? __next_timer_interrupt+0xc0/0xc0
[ 106.315429] __alloc_pages_slowpath+0xb73/0xd90
[ 106.315792] __alloc_pages_nodemask+0x16a/0x210
[ 106.316148] pcpu_populate_chunk+0xce/0x300
[ 106.316479] pcpu_balance_workfn+0x3f3/0x580
[ 106.316853] ? _raw_spin_unlock_irq+0xe/0x30
[ 106.317227] ? finish_task_switch+0x8d/0x250
[ 106.317632] process_one_work+0x1b7/0x410
[ 106.317970] worker_thread+0x26/0x3d0
[ 106.318304] ? process_one_work+0x410/0x410
[ 106.318649] kthread+0x10e/0x130
[ 106.318916] ? __kthread_create_worker+0x120/0x120
[ 106.319360] ret_from_fork+0x35/0x40
[ 106.453375] a.out D13400 3670 1 0x00100004
[ 106.453880] Call Trace:
[ 106.454114] ? __schedule+0x267/0x750
[ 106.454427] schedule+0x2d/0x90
[ 106.454829] schedule_preempt_disabled+0xf/0x20
[ 106.455422] __mutex_lock.isra.2+0x181/0x4d0
[ 106.455988] ? pcpu_alloc+0x3c4/0x670
[ 106.456465] pcpu_alloc+0x3c4/0x670
[ 106.456973] ? preempt_count_add+0x63/0x90
[ 106.457401] ? __local_bh_enable_ip+0x2e/0x60
[ 106.457882] ipv6_add_dev+0x121/0x490
[ 106.458330] addrconf_notify+0x27b/0x9a0
[ 106.458823] ? inetdev_init+0xd7/0x150
[ 106.459270] ? inetdev_event+0x339/0x4b0
[ 106.459738] ? preempt_count_add+0x63/0x90
[ 106.460243] ? _raw_spin_lock_irq+0xf/0x30
[ 106.460747] ? notifier_call_chain+0x42/0x60
[ 106.461271] notifier_call_chain+0x42/0x60
[ 106.461819] register_netdevice+0x415/0x530
[ 106.462364] register_netdev+0x11/0x20
[ 106.462849] loopback_net_init+0x43/0x90
[ 106.463216] ops_init+0x3b/0x100
[ 106.463516] setup_net+0x7d/0x150
[ 106.463831] copy_net_ns+0x14b/0x180
[ 106.464134] create_new_namespaces+0x117/0x1b0
[ 106.464481] unshare_nsproxy_namespaces+0x5b/0x90
[ 106.464864] SyS_unshare+0x1b0/0x300
[ 106.536845] Kernel panic - not syncing: Out of memory and no killable processes...
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-15 8:58 ` Kirill Tkhai
@ 2018-03-15 10:48 ` Tetsuo Handa
2018-03-15 12:09 ` Kirill Tkhai
0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2018-03-15 10:48 UTC (permalink / raw)
To: Kirill Tkhai, Andrew Morton, Tejun Heo; +Cc: cl, linux-mm, linux-kernel
On 2018/03/15 17:58, Kirill Tkhai wrote:
> On 15.03.2018 01:22, Andrew Morton wrote:
>> On Wed, 14 Mar 2018 15:09:09 -0700 Tejun Heo <tj@kernel.org> wrote:
>>
>>> Hello, Andrew.
>>>
>>> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
>>>> It would benefit from a comment explaining why we're doing this (it's
>>>> for the oom-killer).
>>>
>>> Will add.
>>>
>>>> My memory is weak and our documentation is awful. What does
>>>> mutex_lock_killable() actually do and how does it differ from
>>>> mutex_lock_interruptible()? Userspace tasks can run pcpu_alloc() and I
>>>
>>> IIRC, killable listens only to SIGKILL.
I think that killable listens to any signal which results in termination of
that process. For example, if a process is configured to terminate upon SIGINT,
fatal_signal_pending() becomes true upon SIGINT.
>>>
>>>> wonder if there's any way in which a userspace-delivered signal can
>>>> disrupt another userspace task's memory allocation attempt?
>>>
>>> Hmm... maybe. Just honoring SIGKILL *should* be fine but the alloc
>>> failure paths might be broken, so there are some risks. Given that
>>> the cases where userspace tasks end up allocation percpu memory is
>>> pretty limited and/or priviledged (like mount, bpf), I don't think the
>>> risks are high tho.
>>
>> hm. spose so. Maybe. Are there other ways? I assume the time is
>> being spent in pcpu_create_chunk()? We could drop the mutex while
>> running that stuff and take the appropriate did-we-race-with-someone
>> testing after retaking it. Or similar.
>
> The balance work spends its time in pcpu_populate_chunk(). There are
> two stacks of this problem:
Will you show me more contexts? Unless CONFIG_MMU=n kernels, the OOM reaper
reclaims memory from the OOM victim. Therefore, "If tasks doing pcpu_alloc()
are choosen by OOM killer, they can't exit, because they are waiting for the
mutex." should not cause problems. Of course, giving up upon SIGKILL is nice
regardless.
>
> [ 106.313267] kworker/2:2 D13832 936 2 0x80000000
> [ 106.313740] Workqueue: events pcpu_balance_workfn
> [ 106.314109] Call Trace:
> [ 106.314293] ? __schedule+0x267/0x750
> [ 106.314570] schedule+0x2d/0x90
> [ 106.314803] schedule_timeout+0x17f/0x390
> [ 106.315106] ? __next_timer_interrupt+0xc0/0xc0
> [ 106.315429] __alloc_pages_slowpath+0xb73/0xd90
> [ 106.315792] __alloc_pages_nodemask+0x16a/0x210
> [ 106.316148] pcpu_populate_chunk+0xce/0x300
> [ 106.316479] pcpu_balance_workfn+0x3f3/0x580
> [ 106.316853] ? _raw_spin_unlock_irq+0xe/0x30
> [ 106.317227] ? finish_task_switch+0x8d/0x250
> [ 106.317632] process_one_work+0x1b7/0x410
> [ 106.317970] worker_thread+0x26/0x3d0
> [ 106.318304] ? process_one_work+0x410/0x410
> [ 106.318649] kthread+0x10e/0x130
> [ 106.318916] ? __kthread_create_worker+0x120/0x120
> [ 106.319360] ret_from_fork+0x35/0x40
>
> [ 106.453375] a.out D13400 3670 1 0x00100004
> [ 106.453880] Call Trace:
> [ 106.454114] ? __schedule+0x267/0x750
> [ 106.454427] schedule+0x2d/0x90
> [ 106.454829] schedule_preempt_disabled+0xf/0x20
> [ 106.455422] __mutex_lock.isra.2+0x181/0x4d0
> [ 106.455988] ? pcpu_alloc+0x3c4/0x670
> [ 106.456465] pcpu_alloc+0x3c4/0x670
> [ 106.456973] ? preempt_count_add+0x63/0x90
> [ 106.457401] ? __local_bh_enable_ip+0x2e/0x60
> [ 106.457882] ipv6_add_dev+0x121/0x490
> [ 106.458330] addrconf_notify+0x27b/0x9a0
> [ 106.458823] ? inetdev_init+0xd7/0x150
> [ 106.459270] ? inetdev_event+0x339/0x4b0
> [ 106.459738] ? preempt_count_add+0x63/0x90
> [ 106.460243] ? _raw_spin_lock_irq+0xf/0x30
> [ 106.460747] ? notifier_call_chain+0x42/0x60
> [ 106.461271] notifier_call_chain+0x42/0x60
> [ 106.461819] register_netdevice+0x415/0x530
> [ 106.462364] register_netdev+0x11/0x20
> [ 106.462849] loopback_net_init+0x43/0x90
> [ 106.463216] ops_init+0x3b/0x100
> [ 106.463516] setup_net+0x7d/0x150
> [ 106.463831] copy_net_ns+0x14b/0x180
> [ 106.464134] create_new_namespaces+0x117/0x1b0
> [ 106.464481] unshare_nsproxy_namespaces+0x5b/0x90
> [ 106.464864] SyS_unshare+0x1b0/0x300
>
> [ 106.536845] Kernel panic - not syncing: Out of memory and no killable processes...
These two stacks of this problem are not blocked at mutex_lock().
Why all OOM-killable threads were killed? There were only few?
Does pcpu_alloc() allocate so much enough to deplete memory reserves?
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH] Improve mutex documentation
2018-03-14 20:56 ` Andrew Morton
2018-03-14 22:09 ` Tejun Heo
@ 2018-03-15 11:58 ` Matthew Wilcox
2018-03-15 12:12 ` Kirill Tkhai
` (2 more replies)
2018-03-19 15:14 ` [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Tejun Heo
2 siblings, 3 replies; 20+ messages in thread
From: Matthew Wilcox @ 2018-03-15 11:58 UTC (permalink / raw)
To: Andrew Morton
Cc: Kirill Tkhai, tj, cl, linux-mm, linux-kernel, linux-doc,
Jonathan Corbet, Mauro Carvalho Chehab, Peter Zijlstra,
Ingo Molnar
On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> My memory is weak and our documentation is awful. What does
> mutex_lock_killable() actually do and how does it differ from
> mutex_lock_interruptible()?
From: Matthew Wilcox <mawilcox@microsoft.com>
Add kernel-doc for mutex_lock_killable() and mutex_lock_io(). Reword the
kernel-doc for mutex_lock_interruptible().
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 858a07590e39..2048359f33d2 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -1082,15 +1082,16 @@ static noinline int __sched
__mutex_lock_interruptible_slowpath(struct mutex *lock);
/**
- * mutex_lock_interruptible - acquire the mutex, interruptible
- * @lock: the mutex to be acquired
+ * mutex_lock_interruptible() - Acquire the mutex, interruptible by signals.
+ * @lock: The mutex to be acquired.
*
- * Lock the mutex like mutex_lock(), and return 0 if the mutex has
- * been acquired or sleep until the mutex becomes available. If a
- * signal arrives while waiting for the lock then this function
- * returns -EINTR.
+ * Lock the mutex like mutex_lock(). If a signal is delivered while the
+ * process is sleeping, this function will return without acquiring the
+ * mutex.
*
- * This function is similar to (but not equivalent to) down_interruptible().
+ * Context: Process context.
+ * Return: 0 if the lock was successfully acquired or %-EINTR if a
+ * signal arrived.
*/
int __sched mutex_lock_interruptible(struct mutex *lock)
{
@@ -1104,6 +1105,18 @@ int __sched mutex_lock_interruptible(struct mutex *lock)
EXPORT_SYMBOL(mutex_lock_interruptible);
+/**
+ * mutex_lock_killable() - Acquire the mutex, interruptible by fatal signals.
+ * @lock: The mutex to be acquired.
+ *
+ * Lock the mutex like mutex_lock(). If a signal which will be fatal to
+ * the current process is delivered while the process is sleeping, this
+ * function will return without acquiring the mutex.
+ *
+ * Context: Process context.
+ * Return: 0 if the lock was successfully acquired or %-EINTR if a
+ * fatal signal arrived.
+ */
int __sched mutex_lock_killable(struct mutex *lock)
{
might_sleep();
@@ -1115,6 +1128,16 @@ int __sched mutex_lock_killable(struct mutex *lock)
}
EXPORT_SYMBOL(mutex_lock_killable);
+/**
+ * mutex_lock_io() - Acquire the mutex and mark the process as waiting for I/O
+ * @lock: The mutex to be acquired.
+ *
+ * Lock the mutex like mutex_lock(). While the task is waiting for this
+ * mutex, it will be accounted as being in the IO wait state by the
+ * scheduler.
+ *
+ * Context: Process context.
+ */
void __sched mutex_lock_io(struct mutex *lock)
{
int token;
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-15 10:48 ` Tetsuo Handa
@ 2018-03-15 12:09 ` Kirill Tkhai
2018-03-15 14:09 ` Tetsuo Handa
0 siblings, 1 reply; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-15 12:09 UTC (permalink / raw)
To: Tetsuo Handa, Andrew Morton, Tejun Heo; +Cc: cl, linux-mm, linux-kernel
On 15.03.2018 13:48, Tetsuo Handa wrote:
> On 2018/03/15 17:58, Kirill Tkhai wrote:
>> On 15.03.2018 01:22, Andrew Morton wrote:
>>> On Wed, 14 Mar 2018 15:09:09 -0700 Tejun Heo <tj@kernel.org> wrote:
>>>
>>>> Hello, Andrew.
>>>>
>>>> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
>>>>> It would benefit from a comment explaining why we're doing this (it's
>>>>> for the oom-killer).
>>>>
>>>> Will add.
>>>>
>>>>> My memory is weak and our documentation is awful. What does
>>>>> mutex_lock_killable() actually do and how does it differ from
>>>>> mutex_lock_interruptible()? Userspace tasks can run pcpu_alloc() and I
>>>>
>>>> IIRC, killable listens only to SIGKILL.
>
> I think that killable listens to any signal which results in termination of
> that process. For example, if a process is configured to terminate upon SIGINT,
> fatal_signal_pending() becomes true upon SIGINT.
It shouldn't act on SIGINT:
static inline int __fatal_signal_pending(struct task_struct *p)
{
return unlikely(sigismember(&p->pending.signal, SIGKILL));
}
static inline int fatal_signal_pending(struct task_struct *p)
{
return signal_pending(p) && __fatal_signal_pending(p);
}
>>>>
>>>>> wonder if there's any way in which a userspace-delivered signal can
>>>>> disrupt another userspace task's memory allocation attempt?
>>>>
>>>> Hmm... maybe. Just honoring SIGKILL *should* be fine but the alloc
>>>> failure paths might be broken, so there are some risks. Given that
>>>> the cases where userspace tasks end up allocation percpu memory is
>>>> pretty limited and/or priviledged (like mount, bpf), I don't think the
>>>> risks are high tho.
>>>
>>> hm. spose so. Maybe. Are there other ways? I assume the time is
>>> being spent in pcpu_create_chunk()? We could drop the mutex while
>>> running that stuff and take the appropriate did-we-race-with-someone
>>> testing after retaking it. Or similar.
>>
>> The balance work spends its time in pcpu_populate_chunk(). There are
>> two stacks of this problem:
>
> Will you show me more contexts? Unless CONFIG_MMU=n kernels, the OOM reaper
> reclaims memory from the OOM victim. Therefore, "If tasks doing pcpu_alloc()
> are choosen by OOM killer, they can't exit, because they are waiting for the
> mutex." should not cause problems. Of course, giving up upon SIGKILL is nice
> regardless.
There is a test case, which leads my 4 cpus VM to the OOM:
#define _GNU_SOURCE
#include <sched.h>
main()
{
int i;
for (i = 0; i < 8; i++)
fork();
daemon(1,1);
while (1)
unshare(CLONE_NEWNET);
}
The problem is that net namespace init/exit methods are not made to be executed in parallel,
and exclusive mutex is used there. I'm working on solution at the moment, and you may find
that I've done in net-next.git, if you are interested.
pcpu_alloc()-related OOM happens on stable kernel, and it's easy to trigger it by the test.
pcpu is not the only problem there, but it's one of them, and since there is logically seen
OOM deadlock in pcpu code like it's written in patch description, and the patch fixes it.
Going away from this problem to general, I think all allocating/registering actions in kernel
should be made with killable primitives, if they can fail. And the generic policy should be
using mutex_lock_killable() instead of mutex_lock(). Otherwise, OOM victims can't died,
if they waiting for a mutex, which is held by a process making a reclaim. This makes circular
dependencies and just makes OOM badness counting useless, while it must not be so.
>>
>> [ 106.313267] kworker/2:2 D13832 936 2 0x80000000
>> [ 106.313740] Workqueue: events pcpu_balance_workfn
>> [ 106.314109] Call Trace:
>> [ 106.314293] ? __schedule+0x267/0x750
>> [ 106.314570] schedule+0x2d/0x90
>> [ 106.314803] schedule_timeout+0x17f/0x390
>> [ 106.315106] ? __next_timer_interrupt+0xc0/0xc0
>> [ 106.315429] __alloc_pages_slowpath+0xb73/0xd90
>> [ 106.315792] __alloc_pages_nodemask+0x16a/0x210
>> [ 106.316148] pcpu_populate_chunk+0xce/0x300
>> [ 106.316479] pcpu_balance_workfn+0x3f3/0x580
>> [ 106.316853] ? _raw_spin_unlock_irq+0xe/0x30
>> [ 106.317227] ? finish_task_switch+0x8d/0x250
>> [ 106.317632] process_one_work+0x1b7/0x410
>> [ 106.317970] worker_thread+0x26/0x3d0
>> [ 106.318304] ? process_one_work+0x410/0x410
>> [ 106.318649] kthread+0x10e/0x130
>> [ 106.318916] ? __kthread_create_worker+0x120/0x120
>> [ 106.319360] ret_from_fork+0x35/0x40
>>
>> [ 106.453375] a.out D13400 3670 1 0x00100004
>> [ 106.453880] Call Trace:
>> [ 106.454114] ? __schedule+0x267/0x750
>> [ 106.454427] schedule+0x2d/0x90
>> [ 106.454829] schedule_preempt_disabled+0xf/0x20
>> [ 106.455422] __mutex_lock.isra.2+0x181/0x4d0
>> [ 106.455988] ? pcpu_alloc+0x3c4/0x670
>> [ 106.456465] pcpu_alloc+0x3c4/0x670
>> [ 106.456973] ? preempt_count_add+0x63/0x90
>> [ 106.457401] ? __local_bh_enable_ip+0x2e/0x60
>> [ 106.457882] ipv6_add_dev+0x121/0x490
>> [ 106.458330] addrconf_notify+0x27b/0x9a0
>> [ 106.458823] ? inetdev_init+0xd7/0x150
>> [ 106.459270] ? inetdev_event+0x339/0x4b0
>> [ 106.459738] ? preempt_count_add+0x63/0x90
>> [ 106.460243] ? _raw_spin_lock_irq+0xf/0x30
>> [ 106.460747] ? notifier_call_chain+0x42/0x60
>> [ 106.461271] notifier_call_chain+0x42/0x60
>> [ 106.461819] register_netdevice+0x415/0x530
>> [ 106.462364] register_netdev+0x11/0x20
>> [ 106.462849] loopback_net_init+0x43/0x90
>> [ 106.463216] ops_init+0x3b/0x100
>> [ 106.463516] setup_net+0x7d/0x150
>> [ 106.463831] copy_net_ns+0x14b/0x180
>> [ 106.464134] create_new_namespaces+0x117/0x1b0
>> [ 106.464481] unshare_nsproxy_namespaces+0x5b/0x90
>> [ 106.464864] SyS_unshare+0x1b0/0x300
>>
>> [ 106.536845] Kernel panic - not syncing: Out of memory and no killable processes...
>
> These two stacks of this problem are not blocked at mutex_lock().
>
> Why all OOM-killable threads were killed? There were only few?
> Does pcpu_alloc() allocate so much enough to deplete memory reserves?
The test eats all kmem, so OOM kills everything. It's because of slow net namespace
destruction. But this patch is about "half"-deadlock between pcpu_alloc() and worker,
which slows down OOM reaping. There is potential possibility, and it's good to fix it.
I've seen a crash with waiting on the mutex, but I have not saved it. It seems the test
may reproduce it after some time. With the patch applied I don't see pcpu-related crashes
in pcpu_alloc() at all.
Kirill
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve mutex documentation
2018-03-15 11:58 ` [PATCH] Improve mutex documentation Matthew Wilcox
@ 2018-03-15 12:12 ` Kirill Tkhai
2018-03-15 13:18 ` Matthew Wilcox
2018-03-16 13:57 ` Peter Zijlstra
2018-03-20 11:07 ` [tip:locking/urgent] locking/mutex: Improve documentation tip-bot for Matthew Wilcox
2 siblings, 1 reply; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-15 12:12 UTC (permalink / raw)
To: Matthew Wilcox, Andrew Morton
Cc: tj, cl, linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
Mauro Carvalho Chehab, Peter Zijlstra, Ingo Molnar
Hi, Matthew,
On 15.03.2018 14:58, Matthew Wilcox wrote:
> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
>> My memory is weak and our documentation is awful. What does
>> mutex_lock_killable() actually do and how does it differ from
>> mutex_lock_interruptible()?
>
> From: Matthew Wilcox <mawilcox@microsoft.com>
>
> Add kernel-doc for mutex_lock_killable() and mutex_lock_io(). Reword the
> kernel-doc for mutex_lock_interruptible().
>
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 858a07590e39..2048359f33d2 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -1082,15 +1082,16 @@ static noinline int __sched
> __mutex_lock_interruptible_slowpath(struct mutex *lock);
>
> /**
> - * mutex_lock_interruptible - acquire the mutex, interruptible
> - * @lock: the mutex to be acquired
> + * mutex_lock_interruptible() - Acquire the mutex, interruptible by signals.
> + * @lock: The mutex to be acquired.
> *
> - * Lock the mutex like mutex_lock(), and return 0 if the mutex has
> - * been acquired or sleep until the mutex becomes available. If a
> - * signal arrives while waiting for the lock then this function
> - * returns -EINTR.
> + * Lock the mutex like mutex_lock(). If a signal is delivered while the
> + * process is sleeping, this function will return without acquiring the
> + * mutex.
> *
> - * This function is similar to (but not equivalent to) down_interruptible().
> + * Context: Process context.
> + * Return: 0 if the lock was successfully acquired or %-EINTR if a
> + * signal arrived.
> */
> int __sched mutex_lock_interruptible(struct mutex *lock)
> {
> @@ -1104,6 +1105,18 @@ int __sched mutex_lock_interruptible(struct mutex *lock)
>
> EXPORT_SYMBOL(mutex_lock_interruptible);
>
> +/**
> + * mutex_lock_killable() - Acquire the mutex, interruptible by fatal signals.
Shouldn't we clarify that fatal signals are SIGKILL only?
> + * @lock: The mutex to be acquired.
> + *
> + * Lock the mutex like mutex_lock(). If a signal which will be fatal to
> + * the current process is delivered while the process is sleeping, this
> + * function will return without acquiring the mutex.
> + *
> + * Context: Process context.
> + * Return: 0 if the lock was successfully acquired or %-EINTR if a
> + * fatal signal arrived.
> + */
> int __sched mutex_lock_killable(struct mutex *lock)
> {
> might_sleep();
> @@ -1115,6 +1128,16 @@ int __sched mutex_lock_killable(struct mutex *lock)
> }
> EXPORT_SYMBOL(mutex_lock_killable);
>
> +/**
> + * mutex_lock_io() - Acquire the mutex and mark the process as waiting for I/O
> + * @lock: The mutex to be acquired.
> + *
> + * Lock the mutex like mutex_lock(). While the task is waiting for this
> + * mutex, it will be accounted as being in the IO wait state by the
> + * scheduler.
> + *
> + * Context: Process context.
> + */
> void __sched mutex_lock_io(struct mutex *lock)
> {
> int token;
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve mutex documentation
2018-03-15 12:12 ` Kirill Tkhai
@ 2018-03-15 13:18 ` Matthew Wilcox
2018-03-15 13:23 ` Kirill Tkhai
0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2018-03-15 13:18 UTC (permalink / raw)
To: Kirill Tkhai
Cc: Andrew Morton, tj, cl, linux-mm, linux-kernel, linux-doc,
Jonathan Corbet, Mauro Carvalho Chehab, Peter Zijlstra,
Ingo Molnar
On Thu, Mar 15, 2018 at 03:12:30PM +0300, Kirill Tkhai wrote:
> > +/**
> > + * mutex_lock_killable() - Acquire the mutex, interruptible by fatal signals.
>
> Shouldn't we clarify that fatal signals are SIGKILL only?
It's more complicated than it might seem (... welcome to signal handling!)
If you send SIGINT to a task that's waiting on a mutex_killable(), it will
still die. I *think* that's due to the code in complete_signal():
if (sig_fatal(p, sig) &&
!(signal->flags & SIGNAL_GROUP_EXIT) &&
!sigismember(&t->real_blocked, sig) &&
(sig == SIGKILL || !p->ptrace)) {
...
sigaddset(&t->pending.signal, SIGKILL);
You're correct that this code only checks for SIGKILL, but any fatal
signal will result in the signal group receiving SIGKILL.
Unless I've misunderstood, and it wouldn't be the first time I've
misunderstood signal handling.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve mutex documentation
2018-03-15 13:18 ` Matthew Wilcox
@ 2018-03-15 13:23 ` Kirill Tkhai
0 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-15 13:23 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, tj, cl, linux-mm, linux-kernel, linux-doc,
Jonathan Corbet, Mauro Carvalho Chehab, Peter Zijlstra,
Ingo Molnar
On 15.03.2018 16:18, Matthew Wilcox wrote:
> On Thu, Mar 15, 2018 at 03:12:30PM +0300, Kirill Tkhai wrote:
>>> +/**
>>> + * mutex_lock_killable() - Acquire the mutex, interruptible by fatal signals.
>>
>> Shouldn't we clarify that fatal signals are SIGKILL only?
>
> It's more complicated than it might seem (... welcome to signal handling!)
> If you send SIGINT to a task that's waiting on a mutex_killable(), it will
> still die. I *think* that's due to the code in complete_signal():
>
> if (sig_fatal(p, sig) &&
> !(signal->flags & SIGNAL_GROUP_EXIT) &&
> !sigismember(&t->real_blocked, sig) &&
> (sig == SIGKILL || !p->ptrace)) {
> ...
> sigaddset(&t->pending.signal, SIGKILL);
>
> You're correct that this code only checks for SIGKILL, but any fatal
> signal will result in the signal group receiving SIGKILL.
>
> Unless I've misunderstood, and it wouldn't be the first time I've
> misunderstood signal handling.
Sure, thanks for the explanation.
Kirill
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-15 12:09 ` Kirill Tkhai
@ 2018-03-15 14:09 ` Tetsuo Handa
2018-03-15 14:42 ` Kirill Tkhai
0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2018-03-15 14:09 UTC (permalink / raw)
To: ktkhai, akpm, tj; +Cc: cl, linux-mm, linux-kernel
Kirill Tkhai wrote:
> >>>>> My memory is weak and our documentation is awful.? What does
> >>>>> mutex_lock_killable() actually do and how does it differ from
> >>>>> mutex_lock_interruptible()?? Userspace tasks can run pcpu_alloc() and I
> >>>>
> >>>> IIRC, killable listens only to SIGKILL.
> >
> > I think that killable listens to any signal which results in termination of
> > that process. For example, if a process is configured to terminate upon SIGINT,
> > fatal_signal_pending() becomes true upon SIGINT.
>
> It shouldn't act on SIGINT:
>
> static inline int __fatal_signal_pending(struct task_struct *p)
> {
> return unlikely(sigismember(&p->pending.signal, SIGKILL));
> }
>
> static inline int fatal_signal_pending(struct task_struct *p)
> {
> return signal_pending(p) && __fatal_signal_pending(p);
> }
>
Really? Compile below module and try to load using insmod command.
----------------------------------------
#include <linux/module.h>
#include <linux/sched/signal.h>
static int __init test_init(void)
{
static DEFINE_MUTEX(lock);
mutex_lock(&lock);
printk(KERN_INFO "signal_pending()=%d fatal_signal_pending()=%d\n", signal_pending(current), fatal_signal_pending(current));
if (mutex_lock_killable(&lock)) {
printk(KERN_INFO "signal_pending()=%d fatal_signal_pending()=%d\n", signal_pending(current), fatal_signal_pending(current));
mutex_unlock(&lock);
return -EINTR;
}
mutex_unlock(&lock);
mutex_unlock(&lock);
return -EINVAL;
}
module_init(test_init);
MODULE_LICENSE("GPL");
----------------------------------------
What you will see (apart from lockdep warning) upon SIGINT or SIGHUP is
signal_pending()=0 fatal_signal_pending()=0
signal_pending()=1 fatal_signal_pending()=1
which means that fatal_signal_pending() becomes true without SIGKILL.
If insmod is executed via nohup wrapper, insmod does not terminate upon SIGHUP.
> The problem is that net namespace init/exit methods are not made to be executed in parallel,
> and exclusive mutex is used there. I'm working on solution at the moment, and you may find
> that I've done in net-next.git, if you are interested.
I see. Despite your patch, torture tests using your test case still allows OOM panic.
----------------------------------------
[ 860.420677] Out of memory: Kill process 12727 (a.out) score 0 or sacrifice child
[ 860.423228] Killed process 12727 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 860.428125] oom_reaper: reaped process 12727 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 860.438257] Out of memory: Kill process 12728 (a.out) score 0 or sacrifice child
[ 860.440709] Killed process 12728 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 860.445840] oom_reaper: reaped process 12728 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 860.456815] Out of memory: Kill process 12729 (a.out) score 0 or sacrifice child
[ 860.459618] Killed process 12729 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 860.464686] oom_reaper: reaped process 12729 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 860.489807] Out of memory: Kill process 12730 (a.out) score 0 or sacrifice child
[ 860.492495] Killed process 12730 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 860.501268] oom_reaper: reaped process 12730 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 860.536786] Out of memory: Kill process 12731 (a.out) score 0 or sacrifice child
[ 860.539392] Killed process 12731 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 860.544130] oom_reaper: reaped process 12731 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 860.553587] Out of memory: Kill process 12732 (a.out) score 0 or sacrifice child
[ 860.556359] Killed process 12732 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 860.559639] oom_reaper: reaped process 12732 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 860.564972] Out of memory: Kill process 12733 (a.out) score 0 or sacrifice child
[ 860.567603] Killed process 12733 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 860.573416] oom_reaper: reaped process 12733 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 860.579675] Out of memory: Kill process 762 (dbus-daemon) score 0 or sacrifice child
[ 860.582334] Killed process 762 (dbus-daemon) total-vm:24560kB, anon-rss:480kB, file-rss:0kB, shmem-rss:0kB
[ 860.590607] systemd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[ 860.594065] systemd cpuset=/ mems_allowed=0
[ 860.596172] CPU: 1 PID: 1 Comm: systemd Kdump: loaded Tainted: G O 4.16.0-rc5-next-20180315 #695
[ 860.599401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[ 860.602676] Call Trace:
[ 860.604118] dump_stack+0x5f/0x8b
[ 860.605741] dump_header+0x69/0x431
[ 860.607380] ? rcu_read_unlock_special+0x2cc/0x2f0
[ 860.609342] out_of_memory+0x4d8/0x720
[ 860.611044] __alloc_pages_nodemask+0x12c5/0x1410
[ 860.613041] filemap_fault+0x479/0x640
[ 860.614725] __xfs_filemap_fault.constprop.0+0x5f/0x1f0
[ 860.616717] __do_fault+0x15/0xa0
[ 860.618294] __handle_mm_fault+0xcb2/0x1140
[ 860.620031] handle_mm_fault+0x186/0x350
[ 860.621720] __do_page_fault+0x2a7/0x510
[ 860.623402] do_page_fault+0x2c/0x2a0
[ 860.624999] ? page_fault+0x2f/0x50
[ 860.626548] page_fault+0x45/0x50
[ 860.628045] RIP: 61fa2380:0x55c6609099a0
[ 860.629805] RSP: 608a920b:00007ffc83a98620 EFLAGS: 7fdbd590e740
[ 860.630698] Mem-Info:
[ 860.634428] active_anon:3783 inactive_anon:3987 isolated_anon:0
[ 860.634428] active_file:3 inactive_file:0 isolated_file:0
[ 860.634428] unevictable:0 dirty:0 writeback:0 unstable:0
[ 860.634428] slab_reclaimable:124666 slab_unreclaimable:694094
[ 860.634428] mapped:37 shmem:6270 pagetables:2087 bounce:0
[ 860.634428] free:21037 free_pcp:299 free_cma:0
[ 860.646361] Node 0 active_anon:15132kB inactive_anon:15948kB active_file:12kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:148kB dirty:0kB writeback:0kB shmem:25080kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2048kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[ 860.653706] Node 0 DMA free:14828kB min:284kB low:352kB high:420kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 860.660693] lowmem_reserve[]: 0 2684 3642 3642
[ 860.662661] Node 0 DMA32 free:53532kB min:49596kB low:61992kB high:74388kB active_anon:3420kB inactive_anon:5048kB active_file:192kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2771556kB mlocked:0kB kernel_stack:5184kB pagetables:7680kB bounce:0kB free_pcp:444kB local_pcp:76kB free_cma:0kB
[ 860.671523] lowmem_reserve[]: 0 0 958 958
[ 860.673464] Node 0 Normal free:15616kB min:17696kB low:22120kB high:26544kB active_anon:11948kB inactive_anon:10900kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:981136kB mlocked:0kB kernel_stack:2640kB pagetables:636kB bounce:0kB free_pcp:892kB local_pcp:648kB free_cma:0kB
[ 860.682664] lowmem_reserve[]: 0 0 0 0
[ 860.684412] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (E) 1*256kB (E) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ME) 2*4096kB (M) = 14828kB
[ 860.688963] Node 0 DMA32: 565*4kB (UM) 568*8kB (UM) 1037*16kB (UM) 42*32kB (ME) 26*64kB (UME) 15*128kB (ME) 15*256kB (UM) 8*512kB (ME) 7*1024kB (UME) 3*2048kB (M) 1*4096kB (M) = 53668kB
[ 860.694161] Node 0 Normal: 67*4kB (ME) 1218*8kB (UME) 348*16kB (UME) 1*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15612kB
[ 860.698195] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 860.701105] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 860.703682] 6267 total pagecache pages
[ 860.705388] 0 pages in swap cache
[ 860.707457] Swap cache stats: add 0, delete 0, find 0/0
[ 860.709624] Free swap = 0kB
[ 860.711105] Total swap = 0kB
[ 860.712841] 1048445 pages RAM
[ 860.714320] 0 pages HighMem/MovableOnly
[ 860.715970] 106296 pages reserved
[ 860.717625] 0 pages hwpoisoned
[ 860.719086] Unreclaimable slab info:
[ 860.720677] Name Used Total
[ 860.722685] scsi_sense_cache 44KB 44KB
[ 860.724517] RAWv6 237460KB 237460KB
[ 860.726353] TCPv6 118KB 118KB
[ 860.728183] sgpool-128 192KB 192KB
[ 860.730434] sgpool-16 64KB 64KB
[ 860.732303] mqueue_inode_cache 31KB 31KB
[ 860.734302] xfs_buf 584KB 640KB
[ 860.736188] xfs_ili 134KB 134KB
[ 860.738154] xfs_efd_item 110KB 110KB
[ 860.740163] xfs_trans 31KB 31KB
[ 860.742707] xfs_ifork 108KB 108KB
[ 860.744781] xfs_da_state 63KB 63KB
[ 860.747019] xfs_btree_cur 31KB 31KB
[ 860.748845] bio-2 47KB 47KB
[ 860.750647] UNIX 273KB 273KB
[ 860.752534] RAW 277018KB 277018KB
[ 860.754339] UDP 19861KB 19861KB
[ 860.756221] tw_sock_TCP 7KB 7KB
[ 860.758115] request_sock_TCP 7KB 7KB
[ 860.759851] TCP 120KB 120KB
[ 860.761508] hugetlbfs_inode_cache 63KB 63KB
[ 860.763700] eventpoll_pwq 15KB 15KB
[ 860.765423] inotify_inode_mark 52KB 52KB
[ 860.767707] request_queue 94KB 94KB
[ 860.769379] blkdev_ioc 39KB 39KB
[ 860.771060] biovec-(1<<(21-12)) 784KB 912KB
[ 860.772759] biovec-128 192KB 192KB
[ 860.775225] biovec-64 128KB 128KB
[ 860.777187] uid_cache 15KB 15KB
[ 860.779341] dmaengine-unmap-2 16KB 16KB
[ 860.781330] skbuff_head_cache 184KB 216KB
[ 860.783001] file_lock_cache 31KB 31KB
[ 860.784593] file_lock_ctx 15KB 15KB
[ 860.786172] net_namespace 55270KB 55270KB
[ 860.787706] shmem_inode_cache 980KB 980KB
[ 860.789349] task_delay_info 138KB 179KB
[ 860.791090] taskstats 23KB 23KB
[ 860.792623] proc_dir_entry 142317KB 142317KB
[ 860.794287] pde_opener 15KB 15KB
[ 860.796069] seq_file 31KB 31KB
[ 860.797641] sigqueue 19KB 19KB
[ 860.799060] kernfs_node_cache 173844KB 173844KB
[ 860.800572] mnt_cache 141KB 141KB
[ 860.802101] filp 656KB 656KB
[ 860.803624] names_cache 256KB 256KB
[ 860.805107] key_jar 31KB 31KB
[ 860.806655] vm_area_struct 1293KB 1726KB
[ 860.808074] mm_struct 789KB 1012KB
[ 860.809482] files_cache 1330KB 1330KB
[ 860.811777] signal_cache 998KB 1606KB
[ 860.813835] sighand_cache 1610KB 1990KB
[ 860.815698] task_struct 3795KB 4766KB
[ 860.817247] cred_jar 368KB 460KB
[ 860.818819] anon_vma 1457KB 1747KB
[ 860.820406] pid 1534KB 2196KB
[ 860.822081] Acpi-Operand 480KB 480KB
[ 860.823955] Acpi-State 27KB 27KB
[ 860.825518] Acpi-Namespace 179KB 179KB
[ 860.827060] numa_policy 15KB 15KB
[ 860.828772] trace_event_file 90KB 90KB
[ 860.830696] ftrace_event_field 95KB 95KB
[ 860.832241] pool_workqueue 144KB 144KB
[ 860.833746] task_group 599KB 630KB
[ 860.835272] page->ptl 362KB 411KB
[ 860.836755] dma-kmalloc-512 16KB 16KB
[ 860.838274] kmalloc-8192 211320KB 211320KB
[ 860.839748] kmalloc-4096 631748KB 631748KB
[ 860.841248] kmalloc-2048 529296KB 529296KB
[ 860.843094] kmalloc-1024 79300KB 79300KB
[ 860.844575] kmalloc-512 192744KB 192744KB
[ 860.847059] kmalloc-256 24292KB 24292KB
[ 860.848747] kmalloc-192 56959KB 56959KB
[ 860.850504] kmalloc-128 77732KB 77732KB
[ 860.852197] kmalloc-96 1426KB 1519KB
[ 860.853936] kmalloc-64 19192KB 19192KB
[ 860.855454] kmalloc-32 1368KB 1368KB
[ 860.856988] kmalloc-16 336KB 336KB
[ 860.858478] kmalloc-8 864KB 864KB
[ 860.859990] kmem_cache_node 20KB 20KB
[ 860.861448] kmem_cache 78KB 78KB
[ 860.863006] Kernel panic - not syncing: Out of memory and no killable processes...
[ 860.863006]
[ 860.866069] CPU: 3 PID: 1 Comm: systemd Kdump: loaded Tainted: G O 4.16.0-rc5-next-20180315 #695
[ 860.868829] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[ 860.871926] Call Trace:
[ 860.872850] dump_stack+0x5f/0x8b
[ 860.873965] panic+0xde/0x231
[ 860.875005] out_of_memory+0x4e4/0x720
[ 860.876218] __alloc_pages_nodemask+0x12c5/0x1410
[ 860.877645] filemap_fault+0x479/0x640
[ 860.878842] __xfs_filemap_fault.constprop.0+0x5f/0x1f0
[ 860.880542] __do_fault+0x15/0xa0
[ 860.881674] __handle_mm_fault+0xcb2/0x1140
[ 860.883004] handle_mm_fault+0x186/0x350
[ 860.884283] __do_page_fault+0x2a7/0x510
[ 860.885576] do_page_fault+0x2c/0x2a0
[ 860.886805] ? page_fault+0x2f/0x50
[ 860.888003] page_fault+0x45/0x50
[ 860.889169] RIP: 61fa2380:0x55c6609099a0
----------------------------------------
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-15 14:09 ` Tetsuo Handa
@ 2018-03-15 14:42 ` Kirill Tkhai
0 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-15 14:42 UTC (permalink / raw)
To: Tetsuo Handa, akpm, tj, willy; +Cc: cl, linux-mm, linux-kernel
On 15.03.2018 17:09, Tetsuo Handa wrote:
> Kirill Tkhai wrote:
>>>>>>> My memory is weak and our documentation is awful.? What does
>>>>>>> mutex_lock_killable() actually do and how does it differ from
>>>>>>> mutex_lock_interruptible()?? Userspace tasks can run pcpu_alloc() and I
>>>>>>
>>>>>> IIRC, killable listens only to SIGKILL.
>>>
>>> I think that killable listens to any signal which results in termination of
>>> that process. For example, if a process is configured to terminate upon SIGINT,
>>> fatal_signal_pending() becomes true upon SIGINT.
>>
>> It shouldn't act on SIGINT:
>>
>> static inline int __fatal_signal_pending(struct task_struct *p)
>> {
>> return unlikely(sigismember(&p->pending.signal, SIGKILL));
>> }
>>
>> static inline int fatal_signal_pending(struct task_struct *p)
>> {
>> return signal_pending(p) && __fatal_signal_pending(p);
>> }
>>
>
> Really? Compile below module and try to load using insmod command.
>
> ----------------------------------------
> #include <linux/module.h>
> #include <linux/sched/signal.h>
>
> static int __init test_init(void)
> {
> static DEFINE_MUTEX(lock);
>
> mutex_lock(&lock);
> printk(KERN_INFO "signal_pending()=%d fatal_signal_pending()=%d\n", signal_pending(current), fatal_signal_pending(current));
> if (mutex_lock_killable(&lock)) {
> printk(KERN_INFO "signal_pending()=%d fatal_signal_pending()=%d\n", signal_pending(current), fatal_signal_pending(current));
> mutex_unlock(&lock);
> return -EINTR;
> }
> mutex_unlock(&lock);
> mutex_unlock(&lock);
> return -EINVAL;
> }
>
> module_init(test_init);
> MODULE_LICENSE("GPL");
> ----------------------------------------
>
> What you will see (apart from lockdep warning) upon SIGINT or SIGHUP is
>
> signal_pending()=0 fatal_signal_pending()=0
> signal_pending()=1 fatal_signal_pending()=1
>
> which means that fatal_signal_pending() becomes true without SIGKILL.
> If insmod is executed via nohup wrapper, insmod does not terminate upon SIGHUP.
Matthew already pointed that. Thanks for the explanation again :)
>> The problem is that net namespace init/exit methods are not made to be executed in parallel,
>> and exclusive mutex is used there. I'm working on solution at the moment, and you may find
>> that I've done in net-next.git, if you are interested.
>
> I see. Despite your patch, torture tests using your test case still allows OOM panic.
I know. There are several problems. But fresh net-next.git with this patch and these two patchsets:
https://patchwork.ozlabs.org/project/netdev/list/?series=33829
https://patchwork.ozlabs.org/project/netdev/list/?series=33949
does not bump into OOM during the test.
Despite that, there is a lot of work, which should be made more.
Kirill
> ----------------------------------------
> [ 860.420677] Out of memory: Kill process 12727 (a.out) score 0 or sacrifice child
> [ 860.423228] Killed process 12727 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [ 860.428125] oom_reaper: reaped process 12727 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [ 860.438257] Out of memory: Kill process 12728 (a.out) score 0 or sacrifice child
> [ 860.440709] Killed process 12728 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [ 860.445840] oom_reaper: reaped process 12728 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [ 860.456815] Out of memory: Kill process 12729 (a.out) score 0 or sacrifice child
> [ 860.459618] Killed process 12729 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [ 860.464686] oom_reaper: reaped process 12729 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [ 860.489807] Out of memory: Kill process 12730 (a.out) score 0 or sacrifice child
> [ 860.492495] Killed process 12730 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [ 860.501268] oom_reaper: reaped process 12730 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [ 860.536786] Out of memory: Kill process 12731 (a.out) score 0 or sacrifice child
> [ 860.539392] Killed process 12731 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [ 860.544130] oom_reaper: reaped process 12731 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [ 860.553587] Out of memory: Kill process 12732 (a.out) score 0 or sacrifice child
> [ 860.556359] Killed process 12732 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [ 860.559639] oom_reaper: reaped process 12732 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [ 860.564972] Out of memory: Kill process 12733 (a.out) score 0 or sacrifice child
> [ 860.567603] Killed process 12733 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [ 860.573416] oom_reaper: reaped process 12733 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [ 860.579675] Out of memory: Kill process 762 (dbus-daemon) score 0 or sacrifice child
> [ 860.582334] Killed process 762 (dbus-daemon) total-vm:24560kB, anon-rss:480kB, file-rss:0kB, shmem-rss:0kB
> [ 860.590607] systemd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
> [ 860.594065] systemd cpuset=/ mems_allowed=0
> [ 860.596172] CPU: 1 PID: 1 Comm: systemd Kdump: loaded Tainted: G O 4.16.0-rc5-next-20180315 #695
> [ 860.599401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
> [ 860.602676] Call Trace:
> [ 860.604118] dump_stack+0x5f/0x8b
> [ 860.605741] dump_header+0x69/0x431
> [ 860.607380] ? rcu_read_unlock_special+0x2cc/0x2f0
> [ 860.609342] out_of_memory+0x4d8/0x720
> [ 860.611044] __alloc_pages_nodemask+0x12c5/0x1410
> [ 860.613041] filemap_fault+0x479/0x640
> [ 860.614725] __xfs_filemap_fault.constprop.0+0x5f/0x1f0
> [ 860.616717] __do_fault+0x15/0xa0
> [ 860.618294] __handle_mm_fault+0xcb2/0x1140
> [ 860.620031] handle_mm_fault+0x186/0x350
> [ 860.621720] __do_page_fault+0x2a7/0x510
> [ 860.623402] do_page_fault+0x2c/0x2a0
> [ 860.624999] ? page_fault+0x2f/0x50
> [ 860.626548] page_fault+0x45/0x50
> [ 860.628045] RIP: 61fa2380:0x55c6609099a0
> [ 860.629805] RSP: 608a920b:00007ffc83a98620 EFLAGS: 7fdbd590e740
> [ 860.630698] Mem-Info:
> [ 860.634428] active_anon:3783 inactive_anon:3987 isolated_anon:0
> [ 860.634428] active_file:3 inactive_file:0 isolated_file:0
> [ 860.634428] unevictable:0 dirty:0 writeback:0 unstable:0
> [ 860.634428] slab_reclaimable:124666 slab_unreclaimable:694094
> [ 860.634428] mapped:37 shmem:6270 pagetables:2087 bounce:0
> [ 860.634428] free:21037 free_pcp:299 free_cma:0
> [ 860.646361] Node 0 active_anon:15132kB inactive_anon:15948kB active_file:12kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:148kB dirty:0kB writeback:0kB shmem:25080kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2048kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
> [ 860.653706] Node 0 DMA free:14828kB min:284kB low:352kB high:420kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [ 860.660693] lowmem_reserve[]: 0 2684 3642 3642
> [ 860.662661] Node 0 DMA32 free:53532kB min:49596kB low:61992kB high:74388kB active_anon:3420kB inactive_anon:5048kB active_file:192kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2771556kB mlocked:0kB kernel_stack:5184kB pagetables:7680kB bounce:0kB free_pcp:444kB local_pcp:76kB free_cma:0kB
> [ 860.671523] lowmem_reserve[]: 0 0 958 958
> [ 860.673464] Node 0 Normal free:15616kB min:17696kB low:22120kB high:26544kB active_anon:11948kB inactive_anon:10900kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:981136kB mlocked:0kB kernel_stack:2640kB pagetables:636kB bounce:0kB free_pcp:892kB local_pcp:648kB free_cma:0kB
> [ 860.682664] lowmem_reserve[]: 0 0 0 0
> [ 860.684412] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (E) 1*256kB (E) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ME) 2*4096kB (M) = 14828kB
> [ 860.688963] Node 0 DMA32: 565*4kB (UM) 568*8kB (UM) 1037*16kB (UM) 42*32kB (ME) 26*64kB (UME) 15*128kB (ME) 15*256kB (UM) 8*512kB (ME) 7*1024kB (UME) 3*2048kB (M) 1*4096kB (M) = 53668kB
> [ 860.694161] Node 0 Normal: 67*4kB (ME) 1218*8kB (UME) 348*16kB (UME) 1*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15612kB
> [ 860.698195] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
> [ 860.701105] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [ 860.703682] 6267 total pagecache pages
> [ 860.705388] 0 pages in swap cache
> [ 860.707457] Swap cache stats: add 0, delete 0, find 0/0
> [ 860.709624] Free swap = 0kB
> [ 860.711105] Total swap = 0kB
> [ 860.712841] 1048445 pages RAM
> [ 860.714320] 0 pages HighMem/MovableOnly
> [ 860.715970] 106296 pages reserved
> [ 860.717625] 0 pages hwpoisoned
> [ 860.719086] Unreclaimable slab info:
> [ 860.720677] Name Used Total
> [ 860.722685] scsi_sense_cache 44KB 44KB
> [ 860.724517] RAWv6 237460KB 237460KB
> [ 860.726353] TCPv6 118KB 118KB
> [ 860.728183] sgpool-128 192KB 192KB
> [ 860.730434] sgpool-16 64KB 64KB
> [ 860.732303] mqueue_inode_cache 31KB 31KB
> [ 860.734302] xfs_buf 584KB 640KB
> [ 860.736188] xfs_ili 134KB 134KB
> [ 860.738154] xfs_efd_item 110KB 110KB
> [ 860.740163] xfs_trans 31KB 31KB
> [ 860.742707] xfs_ifork 108KB 108KB
> [ 860.744781] xfs_da_state 63KB 63KB
> [ 860.747019] xfs_btree_cur 31KB 31KB
> [ 860.748845] bio-2 47KB 47KB
> [ 860.750647] UNIX 273KB 273KB
> [ 860.752534] RAW 277018KB 277018KB
> [ 860.754339] UDP 19861KB 19861KB
> [ 860.756221] tw_sock_TCP 7KB 7KB
> [ 860.758115] request_sock_TCP 7KB 7KB
> [ 860.759851] TCP 120KB 120KB
> [ 860.761508] hugetlbfs_inode_cache 63KB 63KB
> [ 860.763700] eventpoll_pwq 15KB 15KB
> [ 860.765423] inotify_inode_mark 52KB 52KB
> [ 860.767707] request_queue 94KB 94KB
> [ 860.769379] blkdev_ioc 39KB 39KB
> [ 860.771060] biovec-(1<<(21-12)) 784KB 912KB
> [ 860.772759] biovec-128 192KB 192KB
> [ 860.775225] biovec-64 128KB 128KB
> [ 860.777187] uid_cache 15KB 15KB
> [ 860.779341] dmaengine-unmap-2 16KB 16KB
> [ 860.781330] skbuff_head_cache 184KB 216KB
> [ 860.783001] file_lock_cache 31KB 31KB
> [ 860.784593] file_lock_ctx 15KB 15KB
> [ 860.786172] net_namespace 55270KB 55270KB
> [ 860.787706] shmem_inode_cache 980KB 980KB
> [ 860.789349] task_delay_info 138KB 179KB
> [ 860.791090] taskstats 23KB 23KB
> [ 860.792623] proc_dir_entry 142317KB 142317KB
> [ 860.794287] pde_opener 15KB 15KB
> [ 860.796069] seq_file 31KB 31KB
> [ 860.797641] sigqueue 19KB 19KB
> [ 860.799060] kernfs_node_cache 173844KB 173844KB
> [ 860.800572] mnt_cache 141KB 141KB
> [ 860.802101] filp 656KB 656KB
> [ 860.803624] names_cache 256KB 256KB
> [ 860.805107] key_jar 31KB 31KB
> [ 860.806655] vm_area_struct 1293KB 1726KB
> [ 860.808074] mm_struct 789KB 1012KB
> [ 860.809482] files_cache 1330KB 1330KB
> [ 860.811777] signal_cache 998KB 1606KB
> [ 860.813835] sighand_cache 1610KB 1990KB
> [ 860.815698] task_struct 3795KB 4766KB
> [ 860.817247] cred_jar 368KB 460KB
> [ 860.818819] anon_vma 1457KB 1747KB
> [ 860.820406] pid 1534KB 2196KB
> [ 860.822081] Acpi-Operand 480KB 480KB
> [ 860.823955] Acpi-State 27KB 27KB
> [ 860.825518] Acpi-Namespace 179KB 179KB
> [ 860.827060] numa_policy 15KB 15KB
> [ 860.828772] trace_event_file 90KB 90KB
> [ 860.830696] ftrace_event_field 95KB 95KB
> [ 860.832241] pool_workqueue 144KB 144KB
> [ 860.833746] task_group 599KB 630KB
> [ 860.835272] page->ptl 362KB 411KB
> [ 860.836755] dma-kmalloc-512 16KB 16KB
> [ 860.838274] kmalloc-8192 211320KB 211320KB
> [ 860.839748] kmalloc-4096 631748KB 631748KB
> [ 860.841248] kmalloc-2048 529296KB 529296KB
> [ 860.843094] kmalloc-1024 79300KB 79300KB
> [ 860.844575] kmalloc-512 192744KB 192744KB
> [ 860.847059] kmalloc-256 24292KB 24292KB
> [ 860.848747] kmalloc-192 56959KB 56959KB
> [ 860.850504] kmalloc-128 77732KB 77732KB
> [ 860.852197] kmalloc-96 1426KB 1519KB
> [ 860.853936] kmalloc-64 19192KB 19192KB
> [ 860.855454] kmalloc-32 1368KB 1368KB
> [ 860.856988] kmalloc-16 336KB 336KB
> [ 860.858478] kmalloc-8 864KB 864KB
> [ 860.859990] kmem_cache_node 20KB 20KB
> [ 860.861448] kmem_cache 78KB 78KB
> [ 860.863006] Kernel panic - not syncing: Out of memory and no killable processes...
> [ 860.863006]
> [ 860.866069] CPU: 3 PID: 1 Comm: systemd Kdump: loaded Tainted: G O 4.16.0-rc5-next-20180315 #695
> [ 860.868829] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
> [ 860.871926] Call Trace:
> [ 860.872850] dump_stack+0x5f/0x8b
> [ 860.873965] panic+0xde/0x231
> [ 860.875005] out_of_memory+0x4e4/0x720
> [ 860.876218] __alloc_pages_nodemask+0x12c5/0x1410
> [ 860.877645] filemap_fault+0x479/0x640
> [ 860.878842] __xfs_filemap_fault.constprop.0+0x5f/0x1f0
> [ 860.880542] __do_fault+0x15/0xa0
> [ 860.881674] __handle_mm_fault+0xcb2/0x1140
> [ 860.883004] handle_mm_fault+0x186/0x350
> [ 860.884283] __do_page_fault+0x2a7/0x510
> [ 860.885576] do_page_fault+0x2c/0x2a0
> [ 860.886805] ? page_fault+0x2f/0x50
> [ 860.888003] page_fault+0x45/0x50
> [ 860.889169] RIP: 61fa2380:0x55c6609099a0
> ----------------------------------------
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Improve mutex documentation
2018-03-15 11:58 ` [PATCH] Improve mutex documentation Matthew Wilcox
2018-03-15 12:12 ` Kirill Tkhai
@ 2018-03-16 13:57 ` Peter Zijlstra
2018-03-20 11:07 ` [tip:locking/urgent] locking/mutex: Improve documentation tip-bot for Matthew Wilcox
2 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2018-03-16 13:57 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, Kirill Tkhai, tj, cl, linux-mm, linux-kernel,
linux-doc, Jonathan Corbet, Mauro Carvalho Chehab, Ingo Molnar
On Thu, Mar 15, 2018 at 04:58:12AM -0700, Matthew Wilcox wrote:
> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> > My memory is weak and our documentation is awful. What does
> > mutex_lock_killable() actually do and how does it differ from
> > mutex_lock_interruptible()?
>
> From: Matthew Wilcox <mawilcox@microsoft.com>
>
> Add kernel-doc for mutex_lock_killable() and mutex_lock_io(). Reword the
> kernel-doc for mutex_lock_interruptible().
>
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Thanks Matthew!
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-14 22:22 ` Andrew Morton
2018-03-15 8:58 ` Kirill Tkhai
@ 2018-03-19 15:13 ` Tejun Heo
1 sibling, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2018-03-19 15:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: Kirill Tkhai, cl, linux-mm, linux-kernel
Hello, Andrew.
On Wed, Mar 14, 2018 at 03:22:03PM -0700, Andrew Morton wrote:
> hm. spose so. Maybe. Are there other ways? I assume the time is
> being spent in pcpu_create_chunk()? We could drop the mutex while
> running that stuff and take the appropriate did-we-race-with-someone
> testing after retaking it. Or similar.
I'm not sure that'd change much. Ultimately, isn't the choice between
being able to return NULL and waiting for more memory? If we decide
to return NULL, it doesn't make difference where we do that from,
right?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-14 20:56 ` Andrew Morton
2018-03-14 22:09 ` Tejun Heo
2018-03-15 11:58 ` [PATCH] Improve mutex documentation Matthew Wilcox
@ 2018-03-19 15:14 ` Tejun Heo
2018-03-19 15:32 ` [PATCH v2] mm: " Kirill Tkhai
2 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2018-03-19 15:14 UTC (permalink / raw)
To: Andrew Morton; +Cc: Kirill Tkhai, cl, linux-mm, linux-kernel
On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> > + if (!is_atomic) {
> > + if (gfp & __GFP_NOFAIL)
> > + mutex_lock(&pcpu_alloc_mutex);
> > + else if (mutex_lock_killable(&pcpu_alloc_mutex))
> > + return NULL;
> > + }
>
> It would benefit from a comment explaining why we're doing this (it's
> for the oom-killer).
And, yeah, this would be great. Kirill, can you please send a patch
to add a comment there?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH v2] mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-19 15:14 ` [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Tejun Heo
@ 2018-03-19 15:32 ` Kirill Tkhai
2018-03-19 16:39 ` Tejun Heo
0 siblings, 1 reply; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-19 15:32 UTC (permalink / raw)
To: Tejun Heo, Andrew Morton; +Cc: cl, linux-mm, linux-kernel
From: Kirill Tkhai <ktkhai@virtuozzo.com>
In case of memory deficit and low percpu memory pages,
pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
time (as it makes memory allocations itself and waits
for memory reclaim). If tasks doing pcpu_alloc() are
choosen by OOM killer, they can't exit, because they
are waiting for the mutex.
The patch makes pcpu_alloc() to care about killing signal
and use mutex_lock_killable(), when it's allowed by GFP
flags. This guarantees, a task does not miss SIGKILL
from OOM killer.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
v2: Added explaining comment
mm/percpu.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index 50e7fdf84055..605e3228baa6 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1369,8 +1369,17 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
return NULL;
}
- if (!is_atomic)
- mutex_lock(&pcpu_alloc_mutex);
+ if (!is_atomic) {
+ /*
+ * pcpu_balance_workfn() allocates memory under this mutex,
+ * and it may wait for memory reclaim. Allow current task
+ * to become OOM victim, in case of memory pressure.
+ */
+ if (gfp & __GFP_NOFAIL)
+ mutex_lock(&pcpu_alloc_mutex);
+ else if (mutex_lock_killable(&pcpu_alloc_mutex))
+ return NULL;
+ }
spin_lock_irqsave(&pcpu_lock, flags);
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH v2] mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
2018-03-19 15:32 ` [PATCH v2] mm: " Kirill Tkhai
@ 2018-03-19 16:39 ` Tejun Heo
0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2018-03-19 16:39 UTC (permalink / raw)
To: Kirill Tkhai; +Cc: Andrew Morton, cl, linux-mm, linux-kernel
On Mon, Mar 19, 2018 at 06:32:10PM +0300, Kirill Tkhai wrote:
> From: Kirill Tkhai <ktkhai@virtuozzo.com>
>
> In case of memory deficit and low percpu memory pages,
> pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
> time (as it makes memory allocations itself and waits
> for memory reclaim). If tasks doing pcpu_alloc() are
> choosen by OOM killer, they can't exit, because they
> are waiting for the mutex.
>
> The patch makes pcpu_alloc() to care about killing signal
> and use mutex_lock_killable(), when it's allowed by GFP
> flags. This guarantees, a task does not miss SIGKILL
> from OOM killer.
>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Applied to percpu/for-4.16-fixes.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 20+ messages in thread
* [tip:locking/urgent] locking/mutex: Improve documentation
2018-03-15 11:58 ` [PATCH] Improve mutex documentation Matthew Wilcox
2018-03-15 12:12 ` Kirill Tkhai
2018-03-16 13:57 ` Peter Zijlstra
@ 2018-03-20 11:07 ` tip-bot for Matthew Wilcox
2 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Matthew Wilcox @ 2018-03-20 11:07 UTC (permalink / raw)
To: linux-tip-commits
Cc: tglx, peterz, hpa, mchehab, corbet, linux-kernel, ktkhai,
mawilcox, mingo, paulmck, akpm, torvalds
Commit-ID: 45dbac0e288350f9a4226a5b4b651ed434dd9f85
Gitweb: https://git.kernel.org/tip/45dbac0e288350f9a4226a5b4b651ed434dd9f85
Author: Matthew Wilcox <mawilcox@microsoft.com>
AuthorDate: Thu, 15 Mar 2018 04:58:12 -0700
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 20 Mar 2018 08:07:41 +0100
locking/mutex: Improve documentation
On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> My memory is weak and our documentation is awful. What does
> mutex_lock_killable() actually do and how does it differ from
> mutex_lock_interruptible()?
Add kernel-doc for mutex_lock_killable() and mutex_lock_io(). Reword the
kernel-doc for mutex_lock_interruptible().
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cl@linux.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/20180315115812.GA9949@bombadil.infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/locking/mutex.c | 37 ++++++++++++++++++++++++++++++-------
1 file changed, 30 insertions(+), 7 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 858a07590e39..2048359f33d2 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -1082,15 +1082,16 @@ static noinline int __sched
__mutex_lock_interruptible_slowpath(struct mutex *lock);
/**
- * mutex_lock_interruptible - acquire the mutex, interruptible
- * @lock: the mutex to be acquired
+ * mutex_lock_interruptible() - Acquire the mutex, interruptible by signals.
+ * @lock: The mutex to be acquired.
*
- * Lock the mutex like mutex_lock(), and return 0 if the mutex has
- * been acquired or sleep until the mutex becomes available. If a
- * signal arrives while waiting for the lock then this function
- * returns -EINTR.
+ * Lock the mutex like mutex_lock(). If a signal is delivered while the
+ * process is sleeping, this function will return without acquiring the
+ * mutex.
*
- * This function is similar to (but not equivalent to) down_interruptible().
+ * Context: Process context.
+ * Return: 0 if the lock was successfully acquired or %-EINTR if a
+ * signal arrived.
*/
int __sched mutex_lock_interruptible(struct mutex *lock)
{
@@ -1104,6 +1105,18 @@ int __sched mutex_lock_interruptible(struct mutex *lock)
EXPORT_SYMBOL(mutex_lock_interruptible);
+/**
+ * mutex_lock_killable() - Acquire the mutex, interruptible by fatal signals.
+ * @lock: The mutex to be acquired.
+ *
+ * Lock the mutex like mutex_lock(). If a signal which will be fatal to
+ * the current process is delivered while the process is sleeping, this
+ * function will return without acquiring the mutex.
+ *
+ * Context: Process context.
+ * Return: 0 if the lock was successfully acquired or %-EINTR if a
+ * fatal signal arrived.
+ */
int __sched mutex_lock_killable(struct mutex *lock)
{
might_sleep();
@@ -1115,6 +1128,16 @@ int __sched mutex_lock_killable(struct mutex *lock)
}
EXPORT_SYMBOL(mutex_lock_killable);
+/**
+ * mutex_lock_io() - Acquire the mutex and mark the process as waiting for I/O
+ * @lock: The mutex to be acquired.
+ *
+ * Lock the mutex like mutex_lock(). While the task is waiting for this
+ * mutex, it will be accounted as being in the IO wait state by the
+ * scheduler.
+ *
+ * Context: Process context.
+ */
void __sched mutex_lock_io(struct mutex *lock)
{
int token;
^ permalink raw reply related [flat|nested] 20+ messages in thread
end of thread, other threads:[~2018-03-20 11:07 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-14 11:51 [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Kirill Tkhai
2018-03-14 13:55 ` Tejun Heo
2018-03-14 20:56 ` Andrew Morton
2018-03-14 22:09 ` Tejun Heo
2018-03-14 22:22 ` Andrew Morton
2018-03-15 8:58 ` Kirill Tkhai
2018-03-15 10:48 ` Tetsuo Handa
2018-03-15 12:09 ` Kirill Tkhai
2018-03-15 14:09 ` Tetsuo Handa
2018-03-15 14:42 ` Kirill Tkhai
2018-03-19 15:13 ` Tejun Heo
2018-03-15 11:58 ` [PATCH] Improve mutex documentation Matthew Wilcox
2018-03-15 12:12 ` Kirill Tkhai
2018-03-15 13:18 ` Matthew Wilcox
2018-03-15 13:23 ` Kirill Tkhai
2018-03-16 13:57 ` Peter Zijlstra
2018-03-20 11:07 ` [tip:locking/urgent] locking/mutex: Improve documentation tip-bot for Matthew Wilcox
2018-03-19 15:14 ` [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Tejun Heo
2018-03-19 15:32 ` [PATCH v2] mm: " Kirill Tkhai
2018-03-19 16:39 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).