[PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
@ 2018-03-14 11:51 Kirill Tkhai
  2018-03-14 13:55 ` Tejun Heo
  2018-03-14 20:56 ` Andrew Morton
  0 siblings, 2 replies; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-14 11:51 UTC (permalink / raw)
  To: akpm, tj, cl, linux-mm, linux-kernel

In case of memory deficit and low percpu memory pages,
pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
time (as it makes memory allocations itself and waits
for memory reclaim). If tasks doing pcpu_alloc() are
choosen by OOM killer, they can't exit, because they
are waiting for the mutex.

The patch makes pcpu_alloc() to care about killing signal
and use mutex_lock_killable(), when it's allowed by GFP
flags. This guarantees, a task does not miss SIGKILL
from OOM killer.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 mm/percpu.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 50e7fdf84055..212b4988926c 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1369,8 +1369,12 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 		return NULL;
 	}

-	if (!is_atomic)
-		mutex_lock(&pcpu_alloc_mutex);
+	if (!is_atomic) {
+		if (gfp & __GFP_NOFAIL)
+			mutex_lock(&pcpu_alloc_mutex);
+		else if (mutex_lock_killable(&pcpu_alloc_mutex))
+			return NULL;
+	}

 	spin_lock_irqsave(&pcpu_lock, flags);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-14 11:51 [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Kirill Tkhai
@ 2018-03-14 13:55 ` Tejun Heo
  2018-03-14 20:56 ` Andrew Morton
  1 sibling, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2018-03-14 13:55 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: akpm, cl, linux-mm, linux-kernel

On Wed, Mar 14, 2018 at 02:51:48PM +0300, Kirill Tkhai wrote:
> In case of memory deficit and low percpu memory pages,
> pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
> time (as it makes memory allocations itself and waits
> for memory reclaim). If tasks doing pcpu_alloc() are
> choosen by OOM killer, they can't exit, because they
> are waiting for the mutex.
> 
> The patch makes pcpu_alloc() to care about killing signal
> and use mutex_lock_killable(), when it's allowed by GFP
> flags. This guarantees, a task does not miss SIGKILL
> from OOM killer.
> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

Applied to percpu/for-4.16-fixes.

Thanks, Kirill.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-14 11:51 [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Kirill Tkhai
  2018-03-14 13:55 ` Tejun Heo
@ 2018-03-14 20:56 ` Andrew Morton
  2018-03-14 22:09   ` Tejun Heo
                     ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Andrew Morton @ 2018-03-14 20:56 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: tj, cl, linux-mm, linux-kernel

On Wed, 14 Mar 2018 14:51:48 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote:

> In case of memory deficit and low percpu memory pages,
> pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
> time (as it makes memory allocations itself and waits
> for memory reclaim). If tasks doing pcpu_alloc() are
> choosen by OOM killer, they can't exit, because they
> are waiting for the mutex.
> 
> The patch makes pcpu_alloc() to care about killing signal
> and use mutex_lock_killable(), when it's allowed by GFP
> flags. This guarantees, a task does not miss SIGKILL
> from OOM killer.
> 
> ...
>
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1369,8 +1369,12 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  		return NULL;
>  	}
>  
> -	if (!is_atomic)
> -		mutex_lock(&pcpu_alloc_mutex);
> +	if (!is_atomic) {
> +		if (gfp & __GFP_NOFAIL)
> +			mutex_lock(&pcpu_alloc_mutex);
> +		else if (mutex_lock_killable(&pcpu_alloc_mutex))
> +			return NULL;
> +	}

It would benefit from a comment explaining why we're doing this (it's
for the oom-killer).

My memory is weak and our documentation is awful.  What does
mutex_lock_killable() actually do and how does it differ from
mutex_lock_interruptible()?  Userspace tasks can run pcpu_alloc() and I
wonder if there's any way in which a userspace-delivered signal can
disrupt another userspace task's memory allocation attempt?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-14 20:56 ` Andrew Morton
@ 2018-03-14 22:09   ` Tejun Heo
  2018-03-14 22:22     ` Andrew Morton
  2018-03-15 11:58   ` [PATCH] Improve mutex documentation Matthew Wilcox
  2018-03-19 15:14   ` [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Tejun Heo
  2 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2018-03-14 22:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Kirill Tkhai, cl, linux-mm, linux-kernel

Hello, Andrew.

On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> It would benefit from a comment explaining why we're doing this (it's
> for the oom-killer).

Will add.

> My memory is weak and our documentation is awful.  What does
> mutex_lock_killable() actually do and how does it differ from
> mutex_lock_interruptible()?  Userspace tasks can run pcpu_alloc() and I

IIRC, killable listens only to SIGKILL.

> wonder if there's any way in which a userspace-delivered signal can
> disrupt another userspace task's memory allocation attempt?

Hmm... maybe.  Just honoring SIGKILL *should* be fine but the alloc
failure paths might be broken, so there are some risks.  Given that
the cases where userspace tasks end up allocation percpu memory is
pretty limited and/or priviledged (like mount, bpf), I don't think the
risks are high tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-14 22:09   ` Tejun Heo
@ 2018-03-14 22:22     ` Andrew Morton
  2018-03-15  8:58       ` Kirill Tkhai
  2018-03-19 15:13       ` Tejun Heo
  0 siblings, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2018-03-14 22:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kirill Tkhai, cl, linux-mm, linux-kernel

On Wed, 14 Mar 2018 15:09:09 -0700 Tejun Heo <tj@kernel.org> wrote:

> Hello, Andrew.
> 
> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> > It would benefit from a comment explaining why we're doing this (it's
> > for the oom-killer).
> 
> Will add.
> 
> > My memory is weak and our documentation is awful.  What does
> > mutex_lock_killable() actually do and how does it differ from
> > mutex_lock_interruptible()?  Userspace tasks can run pcpu_alloc() and I
> 
> IIRC, killable listens only to SIGKILL.
> 
> > wonder if there's any way in which a userspace-delivered signal can
> > disrupt another userspace task's memory allocation attempt?
> 
> Hmm... maybe.  Just honoring SIGKILL *should* be fine but the alloc
> failure paths might be broken, so there are some risks.  Given that
> the cases where userspace tasks end up allocation percpu memory is
> pretty limited and/or priviledged (like mount, bpf), I don't think the
> risks are high tho.

hm.  spose so.  Maybe.  Are there other ways?  I assume the time is
being spent in pcpu_create_chunk()?  We could drop the mutex while
running that stuff and take the appropriate did-we-race-with-someone
testing after retaking it.  Or similar.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-14 22:22     ` Andrew Morton
@ 2018-03-15  8:58       ` Kirill Tkhai
  2018-03-15 10:48         ` Tetsuo Handa
  2018-03-19 15:13       ` Tejun Heo
  1 sibling, 1 reply; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-15  8:58 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo; +Cc: cl, linux-mm, linux-kernel

On 15.03.2018 01:22, Andrew Morton wrote:
> On Wed, 14 Mar 2018 15:09:09 -0700 Tejun Heo <tj@kernel.org> wrote:
> 
>> Hello, Andrew.
>>
>> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
>>> It would benefit from a comment explaining why we're doing this (it's
>>> for the oom-killer).
>>
>> Will add.
>>
>>> My memory is weak and our documentation is awful.  What does
>>> mutex_lock_killable() actually do and how does it differ from
>>> mutex_lock_interruptible()?  Userspace tasks can run pcpu_alloc() and I
>>
>> IIRC, killable listens only to SIGKILL.
>>
>>> wonder if there's any way in which a userspace-delivered signal can
>>> disrupt another userspace task's memory allocation attempt?
>>
>> Hmm... maybe.  Just honoring SIGKILL *should* be fine but the alloc
>> failure paths might be broken, so there are some risks.  Given that
>> the cases where userspace tasks end up allocation percpu memory is
>> pretty limited and/or priviledged (like mount, bpf), I don't think the
>> risks are high tho.
> 
> hm.  spose so.  Maybe.  Are there other ways?  I assume the time is
> being spent in pcpu_create_chunk()?  We could drop the mutex while
> running that stuff and take the appropriate did-we-race-with-someone
> testing after retaking it.  Or similar.

The balance work spends its time in pcpu_populate_chunk(). There are
two stacks of this problem:

[  106.313267] kworker/2:2     D13832   936      2 0x80000000
[  106.313740] Workqueue: events pcpu_balance_workfn
[  106.314109] Call Trace:
[  106.314293]  ? __schedule+0x267/0x750
[  106.314570]  schedule+0x2d/0x90
[  106.314803]  schedule_timeout+0x17f/0x390
[  106.315106]  ? __next_timer_interrupt+0xc0/0xc0
[  106.315429]  __alloc_pages_slowpath+0xb73/0xd90
[  106.315792]  __alloc_pages_nodemask+0x16a/0x210
[  106.316148]  pcpu_populate_chunk+0xce/0x300
[  106.316479]  pcpu_balance_workfn+0x3f3/0x580
[  106.316853]  ? _raw_spin_unlock_irq+0xe/0x30
[  106.317227]  ? finish_task_switch+0x8d/0x250
[  106.317632]  process_one_work+0x1b7/0x410
[  106.317970]  worker_thread+0x26/0x3d0
[  106.318304]  ? process_one_work+0x410/0x410
[  106.318649]  kthread+0x10e/0x130
[  106.318916]  ? __kthread_create_worker+0x120/0x120
[  106.319360]  ret_from_fork+0x35/0x40

[  106.453375] a.out           D13400  3670      1 0x00100004
[  106.453880] Call Trace:
[  106.454114]  ? __schedule+0x267/0x750
[  106.454427]  schedule+0x2d/0x90
[  106.454829]  schedule_preempt_disabled+0xf/0x20
[  106.455422]  __mutex_lock.isra.2+0x181/0x4d0
[  106.455988]  ? pcpu_alloc+0x3c4/0x670
[  106.456465]  pcpu_alloc+0x3c4/0x670
[  106.456973]  ? preempt_count_add+0x63/0x90
[  106.457401]  ? __local_bh_enable_ip+0x2e/0x60
[  106.457882]  ipv6_add_dev+0x121/0x490
[  106.458330]  addrconf_notify+0x27b/0x9a0
[  106.458823]  ? inetdev_init+0xd7/0x150
[  106.459270]  ? inetdev_event+0x339/0x4b0
[  106.459738]  ? preempt_count_add+0x63/0x90
[  106.460243]  ? _raw_spin_lock_irq+0xf/0x30
[  106.460747]  ? notifier_call_chain+0x42/0x60
[  106.461271]  notifier_call_chain+0x42/0x60
[  106.461819]  register_netdevice+0x415/0x530
[  106.462364]  register_netdev+0x11/0x20
[  106.462849]  loopback_net_init+0x43/0x90
[  106.463216]  ops_init+0x3b/0x100
[  106.463516]  setup_net+0x7d/0x150
[  106.463831]  copy_net_ns+0x14b/0x180
[  106.464134]  create_new_namespaces+0x117/0x1b0
[  106.464481]  unshare_nsproxy_namespaces+0x5b/0x90
[  106.464864]  SyS_unshare+0x1b0/0x300

[  106.536845] Kernel panic - not syncing: Out of memory and no killable processes...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-15  8:58       ` Kirill Tkhai
@ 2018-03-15 10:48         ` Tetsuo Handa
  2018-03-15 12:09           ` Kirill Tkhai
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2018-03-15 10:48 UTC (permalink / raw)
  To: Kirill Tkhai, Andrew Morton, Tejun Heo; +Cc: cl, linux-mm, linux-kernel

On 2018/03/15 17:58, Kirill Tkhai wrote:
> On 15.03.2018 01:22, Andrew Morton wrote:
>> On Wed, 14 Mar 2018 15:09:09 -0700 Tejun Heo <tj@kernel.org> wrote:
>>
>>> Hello, Andrew.
>>>
>>> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
>>>> It would benefit from a comment explaining why we're doing this (it's
>>>> for the oom-killer).
>>>
>>> Will add.
>>>
>>>> My memory is weak and our documentation is awful.  What does
>>>> mutex_lock_killable() actually do and how does it differ from
>>>> mutex_lock_interruptible()?  Userspace tasks can run pcpu_alloc() and I
>>>
>>> IIRC, killable listens only to SIGKILL.

I think that killable listens to any signal which results in termination of
that process. For example, if a process is configured to terminate upon SIGINT,
fatal_signal_pending() becomes true upon SIGINT.

>>>
>>>> wonder if there's any way in which a userspace-delivered signal can
>>>> disrupt another userspace task's memory allocation attempt?
>>>
>>> Hmm... maybe.  Just honoring SIGKILL *should* be fine but the alloc
>>> failure paths might be broken, so there are some risks.  Given that
>>> the cases where userspace tasks end up allocation percpu memory is
>>> pretty limited and/or priviledged (like mount, bpf), I don't think the
>>> risks are high tho.
>>
>> hm.  spose so.  Maybe.  Are there other ways?  I assume the time is
>> being spent in pcpu_create_chunk()?  We could drop the mutex while
>> running that stuff and take the appropriate did-we-race-with-someone
>> testing after retaking it.  Or similar.
>
> The balance work spends its time in pcpu_populate_chunk(). There are
> two stacks of this problem:

Will you show me more contexts? Unless CONFIG_MMU=n kernels, the OOM reaper
reclaims memory from the OOM victim. Therefore, "If tasks doing pcpu_alloc()
are choosen by OOM killer, they can't exit, because they are waiting for the
mutex." should not cause problems. Of course, giving up upon SIGKILL is nice
regardless.

>
> [  106.313267] kworker/2:2     D13832   936      2 0x80000000
> [  106.313740] Workqueue: events pcpu_balance_workfn
> [  106.314109] Call Trace:
> [  106.314293]  ? __schedule+0x267/0x750
> [  106.314570]  schedule+0x2d/0x90
> [  106.314803]  schedule_timeout+0x17f/0x390
> [  106.315106]  ? __next_timer_interrupt+0xc0/0xc0
> [  106.315429]  __alloc_pages_slowpath+0xb73/0xd90
> [  106.315792]  __alloc_pages_nodemask+0x16a/0x210
> [  106.316148]  pcpu_populate_chunk+0xce/0x300
> [  106.316479]  pcpu_balance_workfn+0x3f3/0x580
> [  106.316853]  ? _raw_spin_unlock_irq+0xe/0x30
> [  106.317227]  ? finish_task_switch+0x8d/0x250
> [  106.317632]  process_one_work+0x1b7/0x410
> [  106.317970]  worker_thread+0x26/0x3d0
> [  106.318304]  ? process_one_work+0x410/0x410
> [  106.318649]  kthread+0x10e/0x130
> [  106.318916]  ? __kthread_create_worker+0x120/0x120
> [  106.319360]  ret_from_fork+0x35/0x40
>
> [  106.453375] a.out           D13400  3670      1 0x00100004
> [  106.453880] Call Trace:
> [  106.454114]  ? __schedule+0x267/0x750
> [  106.454427]  schedule+0x2d/0x90
> [  106.454829]  schedule_preempt_disabled+0xf/0x20
> [  106.455422]  __mutex_lock.isra.2+0x181/0x4d0
> [  106.455988]  ? pcpu_alloc+0x3c4/0x670
> [  106.456465]  pcpu_alloc+0x3c4/0x670
> [  106.456973]  ? preempt_count_add+0x63/0x90
> [  106.457401]  ? __local_bh_enable_ip+0x2e/0x60
> [  106.457882]  ipv6_add_dev+0x121/0x490
> [  106.458330]  addrconf_notify+0x27b/0x9a0
> [  106.458823]  ? inetdev_init+0xd7/0x150
> [  106.459270]  ? inetdev_event+0x339/0x4b0
> [  106.459738]  ? preempt_count_add+0x63/0x90
> [  106.460243]  ? _raw_spin_lock_irq+0xf/0x30
> [  106.460747]  ? notifier_call_chain+0x42/0x60
> [  106.461271]  notifier_call_chain+0x42/0x60
> [  106.461819]  register_netdevice+0x415/0x530
> [  106.462364]  register_netdev+0x11/0x20
> [  106.462849]  loopback_net_init+0x43/0x90
> [  106.463216]  ops_init+0x3b/0x100
> [  106.463516]  setup_net+0x7d/0x150
> [  106.463831]  copy_net_ns+0x14b/0x180
> [  106.464134]  create_new_namespaces+0x117/0x1b0
> [  106.464481]  unshare_nsproxy_namespaces+0x5b/0x90
> [  106.464864]  SyS_unshare+0x1b0/0x300
>
> [  106.536845] Kernel panic - not syncing: Out of memory and no killable processes...

These two stacks of this problem are not blocked at mutex_lock().

Why all OOM-killable threads were killed? There were only few?
Does pcpu_alloc() allocate so much enough to deplete memory reserves?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] Improve mutex documentation
  2018-03-14 20:56 ` Andrew Morton
  2018-03-14 22:09   ` Tejun Heo
@ 2018-03-15 11:58   ` Matthew Wilcox
  2018-03-15 12:12     ` Kirill Tkhai
                       ` (2 more replies)
  2018-03-19 15:14   ` [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Tejun Heo
  2 siblings, 3 replies; 20+ messages in thread
From: Matthew Wilcox @ 2018-03-15 11:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill Tkhai, tj, cl, linux-mm, linux-kernel, linux-doc,
	Jonathan Corbet, Mauro Carvalho Chehab, Peter Zijlstra,
	Ingo Molnar

On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> My memory is weak and our documentation is awful.  What does
> mutex_lock_killable() actually do and how does it differ from
> mutex_lock_interruptible()?

From: Matthew Wilcox <mawilcox@microsoft.com>

Add kernel-doc for mutex_lock_killable() and mutex_lock_io().  Reword the
kernel-doc for mutex_lock_interruptible().

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 858a07590e39..2048359f33d2 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -1082,15 +1082,16 @@ static noinline int __sched
 __mutex_lock_interruptible_slowpath(struct mutex *lock);
 
 /**
- * mutex_lock_interruptible - acquire the mutex, interruptible
- * @lock: the mutex to be acquired
+ * mutex_lock_interruptible() - Acquire the mutex, interruptible by signals.
+ * @lock: The mutex to be acquired.
  *
- * Lock the mutex like mutex_lock(), and return 0 if the mutex has
- * been acquired or sleep until the mutex becomes available. If a
- * signal arrives while waiting for the lock then this function
- * returns -EINTR.
+ * Lock the mutex like mutex_lock().  If a signal is delivered while the
+ * process is sleeping, this function will return without acquiring the
+ * mutex.
  *
- * This function is similar to (but not equivalent to) down_interruptible().
+ * Context: Process context.
+ * Return: 0 if the lock was successfully acquired or %-EINTR if a
+ * signal arrived.
  */
 int __sched mutex_lock_interruptible(struct mutex *lock)
 {
@@ -1104,6 +1105,18 @@ int __sched mutex_lock_interruptible(struct mutex *lock)
 
 EXPORT_SYMBOL(mutex_lock_interruptible);
 
+/**
+ * mutex_lock_killable() - Acquire the mutex, interruptible by fatal signals.
+ * @lock: The mutex to be acquired.
+ *
+ * Lock the mutex like mutex_lock().  If a signal which will be fatal to
+ * the current process is delivered while the process is sleeping, this
+ * function will return without acquiring the mutex.
+ *
+ * Context: Process context.
+ * Return: 0 if the lock was successfully acquired or %-EINTR if a
+ * fatal signal arrived.
+ */
 int __sched mutex_lock_killable(struct mutex *lock)
 {
 	might_sleep();
@@ -1115,6 +1128,16 @@ int __sched mutex_lock_killable(struct mutex *lock)
 }
 EXPORT_SYMBOL(mutex_lock_killable);
 
+/**
+ * mutex_lock_io() - Acquire the mutex and mark the process as waiting for I/O
+ * @lock: The mutex to be acquired.
+ *
+ * Lock the mutex like mutex_lock().  While the task is waiting for this
+ * mutex, it will be accounted as being in the IO wait state by the
+ * scheduler.
+ *
+ * Context: Process context.
+ */
 void __sched mutex_lock_io(struct mutex *lock)
 {
 	int token;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-15 10:48         ` Tetsuo Handa
@ 2018-03-15 12:09           ` Kirill Tkhai
  2018-03-15 14:09             ` Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-15 12:09 UTC (permalink / raw)
  To: Tetsuo Handa, Andrew Morton, Tejun Heo; +Cc: cl, linux-mm, linux-kernel

On 15.03.2018 13:48, Tetsuo Handa wrote:
> On 2018/03/15 17:58, Kirill Tkhai wrote:
>> On 15.03.2018 01:22, Andrew Morton wrote:
>>> On Wed, 14 Mar 2018 15:09:09 -0700 Tejun Heo <tj@kernel.org> wrote:
>>>
>>>> Hello, Andrew.
>>>>
>>>> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
>>>>> It would benefit from a comment explaining why we're doing this (it's
>>>>> for the oom-killer).
>>>>
>>>> Will add.
>>>>
>>>>> My memory is weak and our documentation is awful.  What does
>>>>> mutex_lock_killable() actually do and how does it differ from
>>>>> mutex_lock_interruptible()?  Userspace tasks can run pcpu_alloc() and I
>>>>
>>>> IIRC, killable listens only to SIGKILL.
> 
> I think that killable listens to any signal which results in termination of
> that process. For example, if a process is configured to terminate upon SIGINT,
> fatal_signal_pending() becomes true upon SIGINT.

It shouldn't act on SIGINT:

static inline int __fatal_signal_pending(struct task_struct *p)
{
        return unlikely(sigismember(&p->pending.signal, SIGKILL));
}

static inline int fatal_signal_pending(struct task_struct *p)
{
        return signal_pending(p) && __fatal_signal_pending(p);
}

>>>>
>>>>> wonder if there's any way in which a userspace-delivered signal can
>>>>> disrupt another userspace task's memory allocation attempt?
>>>>
>>>> Hmm... maybe.  Just honoring SIGKILL *should* be fine but the alloc
>>>> failure paths might be broken, so there are some risks.  Given that
>>>> the cases where userspace tasks end up allocation percpu memory is
>>>> pretty limited and/or priviledged (like mount, bpf), I don't think the
>>>> risks are high tho.
>>>
>>> hm.  spose so.  Maybe.  Are there other ways?  I assume the time is
>>> being spent in pcpu_create_chunk()?  We could drop the mutex while
>>> running that stuff and take the appropriate did-we-race-with-someone
>>> testing after retaking it.  Or similar.
>>
>> The balance work spends its time in pcpu_populate_chunk(). There are
>> two stacks of this problem:
> 
> Will you show me more contexts? Unless CONFIG_MMU=n kernels, the OOM reaper
> reclaims memory from the OOM victim. Therefore, "If tasks doing pcpu_alloc()
> are choosen by OOM killer, they can't exit, because they are waiting for the
> mutex." should not cause problems. Of course, giving up upon SIGKILL is nice
> regardless.

There is a test case, which leads my 4 cpus VM to the OOM:

#define _GNU_SOURCE
#include <sched.h>

main()
{
	int i;
	for (i = 0; i < 8; i++)
		fork();
	daemon(1,1);

	while (1)
		unshare(CLONE_NEWNET);
}

The problem is that net namespace init/exit methods are not made to be executed in parallel,
and exclusive mutex is used there. I'm working on solution at the moment, and you may find
that I've done in net-next.git, if you are interested.

pcpu_alloc()-related OOM happens on stable kernel, and it's easy to trigger it by the test.
pcpu is not the only problem there, but it's one of them, and since there is logically seen
OOM deadlock in pcpu code like it's written in patch description, and the patch fixes it.

Going away from this problem to general, I think all allocating/registering actions in kernel
should be made with killable primitives, if they can fail. And the generic policy should be
using mutex_lock_killable() instead of mutex_lock(). Otherwise, OOM victims can't died,
if they waiting for a mutex, which is held by a process making a reclaim. This makes circular
dependencies and just makes OOM badness counting useless, while it must not be so.

>>
>> [  106.313267] kworker/2:2     D13832   936      2 0x80000000
>> [  106.313740] Workqueue: events pcpu_balance_workfn
>> [  106.314109] Call Trace:
>> [  106.314293]  ? __schedule+0x267/0x750
>> [  106.314570]  schedule+0x2d/0x90
>> [  106.314803]  schedule_timeout+0x17f/0x390
>> [  106.315106]  ? __next_timer_interrupt+0xc0/0xc0
>> [  106.315429]  __alloc_pages_slowpath+0xb73/0xd90
>> [  106.315792]  __alloc_pages_nodemask+0x16a/0x210
>> [  106.316148]  pcpu_populate_chunk+0xce/0x300
>> [  106.316479]  pcpu_balance_workfn+0x3f3/0x580
>> [  106.316853]  ? _raw_spin_unlock_irq+0xe/0x30
>> [  106.317227]  ? finish_task_switch+0x8d/0x250
>> [  106.317632]  process_one_work+0x1b7/0x410
>> [  106.317970]  worker_thread+0x26/0x3d0
>> [  106.318304]  ? process_one_work+0x410/0x410
>> [  106.318649]  kthread+0x10e/0x130
>> [  106.318916]  ? __kthread_create_worker+0x120/0x120
>> [  106.319360]  ret_from_fork+0x35/0x40
>>
>> [  106.453375] a.out           D13400  3670      1 0x00100004
>> [  106.453880] Call Trace:
>> [  106.454114]  ? __schedule+0x267/0x750
>> [  106.454427]  schedule+0x2d/0x90
>> [  106.454829]  schedule_preempt_disabled+0xf/0x20
>> [  106.455422]  __mutex_lock.isra.2+0x181/0x4d0
>> [  106.455988]  ? pcpu_alloc+0x3c4/0x670
>> [  106.456465]  pcpu_alloc+0x3c4/0x670
>> [  106.456973]  ? preempt_count_add+0x63/0x90
>> [  106.457401]  ? __local_bh_enable_ip+0x2e/0x60
>> [  106.457882]  ipv6_add_dev+0x121/0x490
>> [  106.458330]  addrconf_notify+0x27b/0x9a0
>> [  106.458823]  ? inetdev_init+0xd7/0x150
>> [  106.459270]  ? inetdev_event+0x339/0x4b0
>> [  106.459738]  ? preempt_count_add+0x63/0x90
>> [  106.460243]  ? _raw_spin_lock_irq+0xf/0x30
>> [  106.460747]  ? notifier_call_chain+0x42/0x60
>> [  106.461271]  notifier_call_chain+0x42/0x60
>> [  106.461819]  register_netdevice+0x415/0x530
>> [  106.462364]  register_netdev+0x11/0x20
>> [  106.462849]  loopback_net_init+0x43/0x90
>> [  106.463216]  ops_init+0x3b/0x100
>> [  106.463516]  setup_net+0x7d/0x150
>> [  106.463831]  copy_net_ns+0x14b/0x180
>> [  106.464134]  create_new_namespaces+0x117/0x1b0
>> [  106.464481]  unshare_nsproxy_namespaces+0x5b/0x90
>> [  106.464864]  SyS_unshare+0x1b0/0x300
>>
>> [  106.536845] Kernel panic - not syncing: Out of memory and no killable processes...
> 
> These two stacks of this problem are not blocked at mutex_lock().
> 
> Why all OOM-killable threads were killed? There were only few?
> Does pcpu_alloc() allocate so much enough to deplete memory reserves?

The test eats all kmem, so OOM kills everything. It's because of slow net namespace
destruction. But this patch is about "half"-deadlock between pcpu_alloc() and worker,
which slows down OOM reaping. There is potential possibility, and it's good to fix it.
I've seen a crash with waiting on the mutex, but I have not saved it. It seems the test
may reproduce it after some time. With the patch applied I don't see pcpu-related crashes
in pcpu_alloc() at all.

Kirill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve mutex documentation
  2018-03-15 11:58   ` [PATCH] Improve mutex documentation Matthew Wilcox
@ 2018-03-15 12:12     ` Kirill Tkhai
  2018-03-15 13:18       ` Matthew Wilcox
  2018-03-16 13:57     ` Peter Zijlstra
  2018-03-20 11:07     ` [tip:locking/urgent] locking/mutex: Improve documentation tip-bot for Matthew Wilcox
  2 siblings, 1 reply; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-15 12:12 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: tj, cl, linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
	Mauro Carvalho Chehab, Peter Zijlstra, Ingo Molnar

Hi, Matthew,

On 15.03.2018 14:58, Matthew Wilcox wrote:
> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
>> My memory is weak and our documentation is awful.  What does
>> mutex_lock_killable() actually do and how does it differ from
>> mutex_lock_interruptible()?
> 
> From: Matthew Wilcox <mawilcox@microsoft.com>
> 
> Add kernel-doc for mutex_lock_killable() and mutex_lock_io().  Reword the
> kernel-doc for mutex_lock_interruptible().
> 
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
> 
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 858a07590e39..2048359f33d2 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -1082,15 +1082,16 @@ static noinline int __sched
>  __mutex_lock_interruptible_slowpath(struct mutex *lock);
>  
>  /**
> - * mutex_lock_interruptible - acquire the mutex, interruptible
> - * @lock: the mutex to be acquired
> + * mutex_lock_interruptible() - Acquire the mutex, interruptible by signals.
> + * @lock: The mutex to be acquired.
>   *
> - * Lock the mutex like mutex_lock(), and return 0 if the mutex has
> - * been acquired or sleep until the mutex becomes available. If a
> - * signal arrives while waiting for the lock then this function
> - * returns -EINTR.
> + * Lock the mutex like mutex_lock().  If a signal is delivered while the
> + * process is sleeping, this function will return without acquiring the
> + * mutex.
>   *
> - * This function is similar to (but not equivalent to) down_interruptible().
> + * Context: Process context.
> + * Return: 0 if the lock was successfully acquired or %-EINTR if a
> + * signal arrived.
>   */
>  int __sched mutex_lock_interruptible(struct mutex *lock)
>  {
> @@ -1104,6 +1105,18 @@ int __sched mutex_lock_interruptible(struct mutex *lock)
>  
>  EXPORT_SYMBOL(mutex_lock_interruptible);
>  
> +/**
> + * mutex_lock_killable() - Acquire the mutex, interruptible by fatal signals.

Shouldn't we clarify that fatal signals are SIGKILL only?

> + * @lock: The mutex to be acquired.
> + *
> + * Lock the mutex like mutex_lock().  If a signal which will be fatal to
> + * the current process is delivered while the process is sleeping, this
> + * function will return without acquiring the mutex.
> + *
> + * Context: Process context.
> + * Return: 0 if the lock was successfully acquired or %-EINTR if a
> + * fatal signal arrived.
> + */
>  int __sched mutex_lock_killable(struct mutex *lock)
>  {
>  	might_sleep();
> @@ -1115,6 +1128,16 @@ int __sched mutex_lock_killable(struct mutex *lock)
>  }
>  EXPORT_SYMBOL(mutex_lock_killable);
>  
> +/**
> + * mutex_lock_io() - Acquire the mutex and mark the process as waiting for I/O
> + * @lock: The mutex to be acquired.
> + *
> + * Lock the mutex like mutex_lock().  While the task is waiting for this
> + * mutex, it will be accounted as being in the IO wait state by the
> + * scheduler.
> + *
> + * Context: Process context.
> + */
>  void __sched mutex_lock_io(struct mutex *lock)
>  {
>  	int token;
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve mutex documentation
  2018-03-15 12:12     ` Kirill Tkhai
@ 2018-03-15 13:18       ` Matthew Wilcox
  2018-03-15 13:23         ` Kirill Tkhai
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2018-03-15 13:18 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andrew Morton, tj, cl, linux-mm, linux-kernel, linux-doc,
	Jonathan Corbet, Mauro Carvalho Chehab, Peter Zijlstra,
	Ingo Molnar

On Thu, Mar 15, 2018 at 03:12:30PM +0300, Kirill Tkhai wrote:
> > +/**
> > + * mutex_lock_killable() - Acquire the mutex, interruptible by fatal signals.
> 
> Shouldn't we clarify that fatal signals are SIGKILL only?

It's more complicated than it might seem (... welcome to signal handling!)
If you send SIGINT to a task that's waiting on a mutex_killable(), it will
still die.  I *think* that's due to the code in complete_signal():

        if (sig_fatal(p, sig) &&
            !(signal->flags & SIGNAL_GROUP_EXIT) &&
            !sigismember(&t->real_blocked, sig) &&
            (sig == SIGKILL || !p->ptrace)) {
...
                                sigaddset(&t->pending.signal, SIGKILL);

You're correct that this code only checks for SIGKILL, but any fatal
signal will result in the signal group receiving SIGKILL.

Unless I've misunderstood, and it wouldn't be the first time I've
misunderstood signal handling.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve mutex documentation
  2018-03-15 13:18       ` Matthew Wilcox
@ 2018-03-15 13:23         ` Kirill Tkhai
  0 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-15 13:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, tj, cl, linux-mm, linux-kernel, linux-doc,
	Jonathan Corbet, Mauro Carvalho Chehab, Peter Zijlstra,
	Ingo Molnar

On 15.03.2018 16:18, Matthew Wilcox wrote:
> On Thu, Mar 15, 2018 at 03:12:30PM +0300, Kirill Tkhai wrote:
>>> +/**
>>> + * mutex_lock_killable() - Acquire the mutex, interruptible by fatal signals.
>>
>> Shouldn't we clarify that fatal signals are SIGKILL only?
> 
> It's more complicated than it might seem (... welcome to signal handling!)
> If you send SIGINT to a task that's waiting on a mutex_killable(), it will
> still die.  I *think* that's due to the code in complete_signal():
> 
>         if (sig_fatal(p, sig) &&
>             !(signal->flags & SIGNAL_GROUP_EXIT) &&
>             !sigismember(&t->real_blocked, sig) &&
>             (sig == SIGKILL || !p->ptrace)) {
> ...
>                                 sigaddset(&t->pending.signal, SIGKILL);
> 
> You're correct that this code only checks for SIGKILL, but any fatal
> signal will result in the signal group receiving SIGKILL.
> 
> Unless I've misunderstood, and it wouldn't be the first time I've
> misunderstood signal handling.

Sure, thanks for the explanation.

Kirill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-15 12:09           ` Kirill Tkhai
@ 2018-03-15 14:09             ` Tetsuo Handa
  2018-03-15 14:42               ` Kirill Tkhai
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2018-03-15 14:09 UTC (permalink / raw)
  To: ktkhai, akpm, tj; +Cc: cl, linux-mm, linux-kernel

Kirill Tkhai wrote:
> >>>>> My memory is weak and our documentation is awful.? What does
> >>>>> mutex_lock_killable() actually do and how does it differ from
> >>>>> mutex_lock_interruptible()?? Userspace tasks can run pcpu_alloc() and I
> >>>>
> >>>> IIRC, killable listens only to SIGKILL.
> > 
> > I think that killable listens to any signal which results in termination of
> > that process. For example, if a process is configured to terminate upon SIGINT,
> > fatal_signal_pending() becomes true upon SIGINT.
> 
> It shouldn't act on SIGINT:
> 
> static inline int __fatal_signal_pending(struct task_struct *p)
> {
>         return unlikely(sigismember(&p->pending.signal, SIGKILL));
> }
> 
> static inline int fatal_signal_pending(struct task_struct *p)
> {
>         return signal_pending(p) && __fatal_signal_pending(p);
> }
> 

Really? Compile below module and try to load using insmod command.

----------------------------------------
#include <linux/module.h>
#include <linux/sched/signal.h>

static int __init test_init(void)
{
	static DEFINE_MUTEX(lock);

	mutex_lock(&lock);
	printk(KERN_INFO "signal_pending()=%d fatal_signal_pending()=%d\n", signal_pending(current), fatal_signal_pending(current));
	if (mutex_lock_killable(&lock)) {
		printk(KERN_INFO "signal_pending()=%d fatal_signal_pending()=%d\n", signal_pending(current), fatal_signal_pending(current));
		mutex_unlock(&lock);
		return -EINTR;
	}
	mutex_unlock(&lock);
	mutex_unlock(&lock);
	return -EINVAL;
}

module_init(test_init);
MODULE_LICENSE("GPL");
----------------------------------------

What you will see (apart from lockdep warning) upon SIGINT or SIGHUP is

  signal_pending()=0 fatal_signal_pending()=0
  signal_pending()=1 fatal_signal_pending()=1

which means that fatal_signal_pending() becomes true without SIGKILL.
If insmod is executed via nohup wrapper, insmod does not terminate upon SIGHUP.


> The problem is that net namespace init/exit methods are not made to be executed in parallel,
> and exclusive mutex is used there. I'm working on solution at the moment, and you may find
> that I've done in net-next.git, if you are interested.

I see. Despite your patch, torture tests using your test case still allows OOM panic.

----------------------------------------
[  860.420677] Out of memory: Kill process 12727 (a.out) score 0 or sacrifice child
[  860.423228] Killed process 12727 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  860.428125] oom_reaper: reaped process 12727 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  860.438257] Out of memory: Kill process 12728 (a.out) score 0 or sacrifice child
[  860.440709] Killed process 12728 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  860.445840] oom_reaper: reaped process 12728 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  860.456815] Out of memory: Kill process 12729 (a.out) score 0 or sacrifice child
[  860.459618] Killed process 12729 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  860.464686] oom_reaper: reaped process 12729 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  860.489807] Out of memory: Kill process 12730 (a.out) score 0 or sacrifice child
[  860.492495] Killed process 12730 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  860.501268] oom_reaper: reaped process 12730 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  860.536786] Out of memory: Kill process 12731 (a.out) score 0 or sacrifice child
[  860.539392] Killed process 12731 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  860.544130] oom_reaper: reaped process 12731 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  860.553587] Out of memory: Kill process 12732 (a.out) score 0 or sacrifice child
[  860.556359] Killed process 12732 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  860.559639] oom_reaper: reaped process 12732 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  860.564972] Out of memory: Kill process 12733 (a.out) score 0 or sacrifice child
[  860.567603] Killed process 12733 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  860.573416] oom_reaper: reaped process 12733 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  860.579675] Out of memory: Kill process 762 (dbus-daemon) score 0 or sacrifice child
[  860.582334] Killed process 762 (dbus-daemon) total-vm:24560kB, anon-rss:480kB, file-rss:0kB, shmem-rss:0kB
[  860.590607] systemd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[  860.594065] systemd cpuset=/ mems_allowed=0
[  860.596172] CPU: 1 PID: 1 Comm: systemd Kdump: loaded Tainted: G           O      4.16.0-rc5-next-20180315 #695
[  860.599401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[  860.602676] Call Trace:
[  860.604118]  dump_stack+0x5f/0x8b
[  860.605741]  dump_header+0x69/0x431
[  860.607380]  ? rcu_read_unlock_special+0x2cc/0x2f0
[  860.609342]  out_of_memory+0x4d8/0x720
[  860.611044]  __alloc_pages_nodemask+0x12c5/0x1410
[  860.613041]  filemap_fault+0x479/0x640
[  860.614725]  __xfs_filemap_fault.constprop.0+0x5f/0x1f0
[  860.616717]  __do_fault+0x15/0xa0
[  860.618294]  __handle_mm_fault+0xcb2/0x1140
[  860.620031]  handle_mm_fault+0x186/0x350
[  860.621720]  __do_page_fault+0x2a7/0x510
[  860.623402]  do_page_fault+0x2c/0x2a0
[  860.624999]  ? page_fault+0x2f/0x50
[  860.626548]  page_fault+0x45/0x50
[  860.628045] RIP: 61fa2380:0x55c6609099a0
[  860.629805] RSP: 608a920b:00007ffc83a98620 EFLAGS: 7fdbd590e740
[  860.630698] Mem-Info:
[  860.634428] active_anon:3783 inactive_anon:3987 isolated_anon:0
[  860.634428]  active_file:3 inactive_file:0 isolated_file:0
[  860.634428]  unevictable:0 dirty:0 writeback:0 unstable:0
[  860.634428]  slab_reclaimable:124666 slab_unreclaimable:694094
[  860.634428]  mapped:37 shmem:6270 pagetables:2087 bounce:0
[  860.634428]  free:21037 free_pcp:299 free_cma:0
[  860.646361] Node 0 active_anon:15132kB inactive_anon:15948kB active_file:12kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:148kB dirty:0kB writeback:0kB shmem:25080kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2048kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[  860.653706] Node 0 DMA free:14828kB min:284kB low:352kB high:420kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  860.660693] lowmem_reserve[]: 0 2684 3642 3642
[  860.662661] Node 0 DMA32 free:53532kB min:49596kB low:61992kB high:74388kB active_anon:3420kB inactive_anon:5048kB active_file:192kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2771556kB mlocked:0kB kernel_stack:5184kB pagetables:7680kB bounce:0kB free_pcp:444kB local_pcp:76kB free_cma:0kB
[  860.671523] lowmem_reserve[]: 0 0 958 958
[  860.673464] Node 0 Normal free:15616kB min:17696kB low:22120kB high:26544kB active_anon:11948kB inactive_anon:10900kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:981136kB mlocked:0kB kernel_stack:2640kB pagetables:636kB bounce:0kB free_pcp:892kB local_pcp:648kB free_cma:0kB
[  860.682664] lowmem_reserve[]: 0 0 0 0
[  860.684412] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (E) 1*256kB (E) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ME) 2*4096kB (M) = 14828kB
[  860.688963] Node 0 DMA32: 565*4kB (UM) 568*8kB (UM) 1037*16kB (UM) 42*32kB (ME) 26*64kB (UME) 15*128kB (ME) 15*256kB (UM) 8*512kB (ME) 7*1024kB (UME) 3*2048kB (M) 1*4096kB (M) = 53668kB
[  860.694161] Node 0 Normal: 67*4kB (ME) 1218*8kB (UME) 348*16kB (UME) 1*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15612kB
[  860.698195] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  860.701105] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  860.703682] 6267 total pagecache pages
[  860.705388] 0 pages in swap cache
[  860.707457] Swap cache stats: add 0, delete 0, find 0/0
[  860.709624] Free swap  = 0kB
[  860.711105] Total swap = 0kB
[  860.712841] 1048445 pages RAM
[  860.714320] 0 pages HighMem/MovableOnly
[  860.715970] 106296 pages reserved
[  860.717625] 0 pages hwpoisoned
[  860.719086] Unreclaimable slab info:
[  860.720677] Name                      Used          Total
[  860.722685] scsi_sense_cache          44KB         44KB
[  860.724517] RAWv6                 237460KB     237460KB
[  860.726353] TCPv6                    118KB        118KB
[  860.728183] sgpool-128               192KB        192KB
[  860.730434] sgpool-16                 64KB         64KB
[  860.732303] mqueue_inode_cache         31KB         31KB
[  860.734302] xfs_buf                  584KB        640KB
[  860.736188] xfs_ili                  134KB        134KB
[  860.738154] xfs_efd_item             110KB        110KB
[  860.740163] xfs_trans                 31KB         31KB
[  860.742707] xfs_ifork                108KB        108KB
[  860.744781] xfs_da_state              63KB         63KB
[  860.747019] xfs_btree_cur             31KB         31KB
[  860.748845] bio-2                     47KB         47KB
[  860.750647] UNIX                     273KB        273KB
[  860.752534] RAW                   277018KB     277018KB
[  860.754339] UDP                    19861KB      19861KB
[  860.756221] tw_sock_TCP                7KB          7KB
[  860.758115] request_sock_TCP           7KB          7KB
[  860.759851] TCP                      120KB        120KB
[  860.761508] hugetlbfs_inode_cache         63KB         63KB
[  860.763700] eventpoll_pwq             15KB         15KB
[  860.765423] inotify_inode_mark         52KB         52KB
[  860.767707] request_queue             94KB         94KB
[  860.769379] blkdev_ioc                39KB         39KB
[  860.771060] biovec-(1<<(21-12))        784KB        912KB
[  860.772759] biovec-128               192KB        192KB
[  860.775225] biovec-64                128KB        128KB
[  860.777187] uid_cache                 15KB         15KB
[  860.779341] dmaengine-unmap-2         16KB         16KB
[  860.781330] skbuff_head_cache        184KB        216KB
[  860.783001] file_lock_cache           31KB         31KB
[  860.784593] file_lock_ctx             15KB         15KB
[  860.786172] net_namespace          55270KB      55270KB
[  860.787706] shmem_inode_cache        980KB        980KB
[  860.789349] task_delay_info          138KB        179KB
[  860.791090] taskstats                 23KB         23KB
[  860.792623] proc_dir_entry        142317KB     142317KB
[  860.794287] pde_opener                15KB         15KB
[  860.796069] seq_file                  31KB         31KB
[  860.797641] sigqueue                  19KB         19KB
[  860.799060] kernfs_node_cache     173844KB     173844KB
[  860.800572] mnt_cache                141KB        141KB
[  860.802101] filp                     656KB        656KB
[  860.803624] names_cache              256KB        256KB
[  860.805107] key_jar                   31KB         31KB
[  860.806655] vm_area_struct          1293KB       1726KB
[  860.808074] mm_struct                789KB       1012KB
[  860.809482] files_cache             1330KB       1330KB
[  860.811777] signal_cache             998KB       1606KB
[  860.813835] sighand_cache           1610KB       1990KB
[  860.815698] task_struct             3795KB       4766KB
[  860.817247] cred_jar                 368KB        460KB
[  860.818819] anon_vma                1457KB       1747KB
[  860.820406] pid                     1534KB       2196KB
[  860.822081] Acpi-Operand             480KB        480KB
[  860.823955] Acpi-State                27KB         27KB
[  860.825518] Acpi-Namespace           179KB        179KB
[  860.827060] numa_policy               15KB         15KB
[  860.828772] trace_event_file          90KB         90KB
[  860.830696] ftrace_event_field         95KB         95KB
[  860.832241] pool_workqueue           144KB        144KB
[  860.833746] task_group               599KB        630KB
[  860.835272] page->ptl                362KB        411KB
[  860.836755] dma-kmalloc-512           16KB         16KB
[  860.838274] kmalloc-8192          211320KB     211320KB
[  860.839748] kmalloc-4096          631748KB     631748KB
[  860.841248] kmalloc-2048          529296KB     529296KB
[  860.843094] kmalloc-1024           79300KB      79300KB
[  860.844575] kmalloc-512           192744KB     192744KB
[  860.847059] kmalloc-256            24292KB      24292KB
[  860.848747] kmalloc-192            56959KB      56959KB
[  860.850504] kmalloc-128            77732KB      77732KB
[  860.852197] kmalloc-96              1426KB       1519KB
[  860.853936] kmalloc-64             19192KB      19192KB
[  860.855454] kmalloc-32              1368KB       1368KB
[  860.856988] kmalloc-16               336KB        336KB
[  860.858478] kmalloc-8                864KB        864KB
[  860.859990] kmem_cache_node           20KB         20KB
[  860.861448] kmem_cache                78KB         78KB
[  860.863006] Kernel panic - not syncing: Out of memory and no killable processes...
[  860.863006] 
[  860.866069] CPU: 3 PID: 1 Comm: systemd Kdump: loaded Tainted: G           O      4.16.0-rc5-next-20180315 #695
[  860.868829] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[  860.871926] Call Trace:
[  860.872850]  dump_stack+0x5f/0x8b
[  860.873965]  panic+0xde/0x231
[  860.875005]  out_of_memory+0x4e4/0x720
[  860.876218]  __alloc_pages_nodemask+0x12c5/0x1410
[  860.877645]  filemap_fault+0x479/0x640
[  860.878842]  __xfs_filemap_fault.constprop.0+0x5f/0x1f0
[  860.880542]  __do_fault+0x15/0xa0
[  860.881674]  __handle_mm_fault+0xcb2/0x1140
[  860.883004]  handle_mm_fault+0x186/0x350
[  860.884283]  __do_page_fault+0x2a7/0x510
[  860.885576]  do_page_fault+0x2c/0x2a0
[  860.886805]  ? page_fault+0x2f/0x50
[  860.888003]  page_fault+0x45/0x50
[  860.889169] RIP: 61fa2380:0x55c6609099a0
----------------------------------------

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-15 14:09             ` Tetsuo Handa
@ 2018-03-15 14:42               ` Kirill Tkhai
  0 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-15 14:42 UTC (permalink / raw)
  To: Tetsuo Handa, akpm, tj, willy; +Cc: cl, linux-mm, linux-kernel

On 15.03.2018 17:09, Tetsuo Handa wrote:
> Kirill Tkhai wrote:
>>>>>>> My memory is weak and our documentation is awful.? What does
>>>>>>> mutex_lock_killable() actually do and how does it differ from
>>>>>>> mutex_lock_interruptible()?? Userspace tasks can run pcpu_alloc() and I
>>>>>>
>>>>>> IIRC, killable listens only to SIGKILL.
>>>
>>> I think that killable listens to any signal which results in termination of
>>> that process. For example, if a process is configured to terminate upon SIGINT,
>>> fatal_signal_pending() becomes true upon SIGINT.
>>
>> It shouldn't act on SIGINT:
>>
>> static inline int __fatal_signal_pending(struct task_struct *p)
>> {
>>         return unlikely(sigismember(&p->pending.signal, SIGKILL));
>> }
>>
>> static inline int fatal_signal_pending(struct task_struct *p)
>> {
>>         return signal_pending(p) && __fatal_signal_pending(p);
>> }
>>
> 
> Really? Compile below module and try to load using insmod command.
> 
> ----------------------------------------
> #include <linux/module.h>
> #include <linux/sched/signal.h>
> 
> static int __init test_init(void)
> {
> 	static DEFINE_MUTEX(lock);
> 
> 	mutex_lock(&lock);
> 	printk(KERN_INFO "signal_pending()=%d fatal_signal_pending()=%d\n", signal_pending(current), fatal_signal_pending(current));
> 	if (mutex_lock_killable(&lock)) {
> 		printk(KERN_INFO "signal_pending()=%d fatal_signal_pending()=%d\n", signal_pending(current), fatal_signal_pending(current));
> 		mutex_unlock(&lock);
> 		return -EINTR;
> 	}
> 	mutex_unlock(&lock);
> 	mutex_unlock(&lock);
> 	return -EINVAL;
> }
> 
> module_init(test_init);
> MODULE_LICENSE("GPL");
> ----------------------------------------
> 
> What you will see (apart from lockdep warning) upon SIGINT or SIGHUP is
> 
>   signal_pending()=0 fatal_signal_pending()=0
>   signal_pending()=1 fatal_signal_pending()=1
> 
> which means that fatal_signal_pending() becomes true without SIGKILL.
> If insmod is executed via nohup wrapper, insmod does not terminate upon SIGHUP.

Matthew already pointed that. Thanks for the explanation again :)
 
>> The problem is that net namespace init/exit methods are not made to be executed in parallel,
>> and exclusive mutex is used there. I'm working on solution at the moment, and you may find
>> that I've done in net-next.git, if you are interested.
> 
> I see. Despite your patch, torture tests using your test case still allows OOM panic.

I know. There are several problems. But fresh net-next.git with this patch and these two patchsets:

https://patchwork.ozlabs.org/project/netdev/list/?series=33829
https://patchwork.ozlabs.org/project/netdev/list/?series=33949

does not bump into OOM during the test.

Despite that, there is a lot of work, which should be made more.

Kirill
 
> ----------------------------------------
> [  860.420677] Out of memory: Kill process 12727 (a.out) score 0 or sacrifice child
> [  860.423228] Killed process 12727 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  860.428125] oom_reaper: reaped process 12727 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  860.438257] Out of memory: Kill process 12728 (a.out) score 0 or sacrifice child
> [  860.440709] Killed process 12728 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  860.445840] oom_reaper: reaped process 12728 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  860.456815] Out of memory: Kill process 12729 (a.out) score 0 or sacrifice child
> [  860.459618] Killed process 12729 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  860.464686] oom_reaper: reaped process 12729 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  860.489807] Out of memory: Kill process 12730 (a.out) score 0 or sacrifice child
> [  860.492495] Killed process 12730 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  860.501268] oom_reaper: reaped process 12730 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  860.536786] Out of memory: Kill process 12731 (a.out) score 0 or sacrifice child
> [  860.539392] Killed process 12731 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  860.544130] oom_reaper: reaped process 12731 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  860.553587] Out of memory: Kill process 12732 (a.out) score 0 or sacrifice child
> [  860.556359] Killed process 12732 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  860.559639] oom_reaper: reaped process 12732 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  860.564972] Out of memory: Kill process 12733 (a.out) score 0 or sacrifice child
> [  860.567603] Killed process 12733 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> [  860.573416] oom_reaper: reaped process 12733 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  860.579675] Out of memory: Kill process 762 (dbus-daemon) score 0 or sacrifice child
> [  860.582334] Killed process 762 (dbus-daemon) total-vm:24560kB, anon-rss:480kB, file-rss:0kB, shmem-rss:0kB
> [  860.590607] systemd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
> [  860.594065] systemd cpuset=/ mems_allowed=0
> [  860.596172] CPU: 1 PID: 1 Comm: systemd Kdump: loaded Tainted: G           O      4.16.0-rc5-next-20180315 #695
> [  860.599401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
> [  860.602676] Call Trace:
> [  860.604118]  dump_stack+0x5f/0x8b
> [  860.605741]  dump_header+0x69/0x431
> [  860.607380]  ? rcu_read_unlock_special+0x2cc/0x2f0
> [  860.609342]  out_of_memory+0x4d8/0x720
> [  860.611044]  __alloc_pages_nodemask+0x12c5/0x1410
> [  860.613041]  filemap_fault+0x479/0x640
> [  860.614725]  __xfs_filemap_fault.constprop.0+0x5f/0x1f0
> [  860.616717]  __do_fault+0x15/0xa0
> [  860.618294]  __handle_mm_fault+0xcb2/0x1140
> [  860.620031]  handle_mm_fault+0x186/0x350
> [  860.621720]  __do_page_fault+0x2a7/0x510
> [  860.623402]  do_page_fault+0x2c/0x2a0
> [  860.624999]  ? page_fault+0x2f/0x50
> [  860.626548]  page_fault+0x45/0x50
> [  860.628045] RIP: 61fa2380:0x55c6609099a0
> [  860.629805] RSP: 608a920b:00007ffc83a98620 EFLAGS: 7fdbd590e740
> [  860.630698] Mem-Info:
> [  860.634428] active_anon:3783 inactive_anon:3987 isolated_anon:0
> [  860.634428]  active_file:3 inactive_file:0 isolated_file:0
> [  860.634428]  unevictable:0 dirty:0 writeback:0 unstable:0
> [  860.634428]  slab_reclaimable:124666 slab_unreclaimable:694094
> [  860.634428]  mapped:37 shmem:6270 pagetables:2087 bounce:0
> [  860.634428]  free:21037 free_pcp:299 free_cma:0
> [  860.646361] Node 0 active_anon:15132kB inactive_anon:15948kB active_file:12kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:148kB dirty:0kB writeback:0kB shmem:25080kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2048kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
> [  860.653706] Node 0 DMA free:14828kB min:284kB low:352kB high:420kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [  860.660693] lowmem_reserve[]: 0 2684 3642 3642
> [  860.662661] Node 0 DMA32 free:53532kB min:49596kB low:61992kB high:74388kB active_anon:3420kB inactive_anon:5048kB active_file:192kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2771556kB mlocked:0kB kernel_stack:5184kB pagetables:7680kB bounce:0kB free_pcp:444kB local_pcp:76kB free_cma:0kB
> [  860.671523] lowmem_reserve[]: 0 0 958 958
> [  860.673464] Node 0 Normal free:15616kB min:17696kB low:22120kB high:26544kB active_anon:11948kB inactive_anon:10900kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:981136kB mlocked:0kB kernel_stack:2640kB pagetables:636kB bounce:0kB free_pcp:892kB local_pcp:648kB free_cma:0kB
> [  860.682664] lowmem_reserve[]: 0 0 0 0
> [  860.684412] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (E) 1*256kB (E) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ME) 2*4096kB (M) = 14828kB
> [  860.688963] Node 0 DMA32: 565*4kB (UM) 568*8kB (UM) 1037*16kB (UM) 42*32kB (ME) 26*64kB (UME) 15*128kB (ME) 15*256kB (UM) 8*512kB (ME) 7*1024kB (UME) 3*2048kB (M) 1*4096kB (M) = 53668kB
> [  860.694161] Node 0 Normal: 67*4kB (ME) 1218*8kB (UME) 348*16kB (UME) 1*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15612kB
> [  860.698195] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
> [  860.701105] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [  860.703682] 6267 total pagecache pages
> [  860.705388] 0 pages in swap cache
> [  860.707457] Swap cache stats: add 0, delete 0, find 0/0
> [  860.709624] Free swap  = 0kB
> [  860.711105] Total swap = 0kB
> [  860.712841] 1048445 pages RAM
> [  860.714320] 0 pages HighMem/MovableOnly
> [  860.715970] 106296 pages reserved
> [  860.717625] 0 pages hwpoisoned
> [  860.719086] Unreclaimable slab info:
> [  860.720677] Name                      Used          Total
> [  860.722685] scsi_sense_cache          44KB         44KB
> [  860.724517] RAWv6                 237460KB     237460KB
> [  860.726353] TCPv6                    118KB        118KB
> [  860.728183] sgpool-128               192KB        192KB
> [  860.730434] sgpool-16                 64KB         64KB
> [  860.732303] mqueue_inode_cache         31KB         31KB
> [  860.734302] xfs_buf                  584KB        640KB
> [  860.736188] xfs_ili                  134KB        134KB
> [  860.738154] xfs_efd_item             110KB        110KB
> [  860.740163] xfs_trans                 31KB         31KB
> [  860.742707] xfs_ifork                108KB        108KB
> [  860.744781] xfs_da_state              63KB         63KB
> [  860.747019] xfs_btree_cur             31KB         31KB
> [  860.748845] bio-2                     47KB         47KB
> [  860.750647] UNIX                     273KB        273KB
> [  860.752534] RAW                   277018KB     277018KB
> [  860.754339] UDP                    19861KB      19861KB
> [  860.756221] tw_sock_TCP                7KB          7KB
> [  860.758115] request_sock_TCP           7KB          7KB
> [  860.759851] TCP                      120KB        120KB
> [  860.761508] hugetlbfs_inode_cache         63KB         63KB
> [  860.763700] eventpoll_pwq             15KB         15KB
> [  860.765423] inotify_inode_mark         52KB         52KB
> [  860.767707] request_queue             94KB         94KB
> [  860.769379] blkdev_ioc                39KB         39KB
> [  860.771060] biovec-(1<<(21-12))        784KB        912KB
> [  860.772759] biovec-128               192KB        192KB
> [  860.775225] biovec-64                128KB        128KB
> [  860.777187] uid_cache                 15KB         15KB
> [  860.779341] dmaengine-unmap-2         16KB         16KB
> [  860.781330] skbuff_head_cache        184KB        216KB
> [  860.783001] file_lock_cache           31KB         31KB
> [  860.784593] file_lock_ctx             15KB         15KB
> [  860.786172] net_namespace          55270KB      55270KB
> [  860.787706] shmem_inode_cache        980KB        980KB
> [  860.789349] task_delay_info          138KB        179KB
> [  860.791090] taskstats                 23KB         23KB
> [  860.792623] proc_dir_entry        142317KB     142317KB
> [  860.794287] pde_opener                15KB         15KB
> [  860.796069] seq_file                  31KB         31KB
> [  860.797641] sigqueue                  19KB         19KB
> [  860.799060] kernfs_node_cache     173844KB     173844KB
> [  860.800572] mnt_cache                141KB        141KB
> [  860.802101] filp                     656KB        656KB
> [  860.803624] names_cache              256KB        256KB
> [  860.805107] key_jar                   31KB         31KB
> [  860.806655] vm_area_struct          1293KB       1726KB
> [  860.808074] mm_struct                789KB       1012KB
> [  860.809482] files_cache             1330KB       1330KB
> [  860.811777] signal_cache             998KB       1606KB
> [  860.813835] sighand_cache           1610KB       1990KB
> [  860.815698] task_struct             3795KB       4766KB
> [  860.817247] cred_jar                 368KB        460KB
> [  860.818819] anon_vma                1457KB       1747KB
> [  860.820406] pid                     1534KB       2196KB
> [  860.822081] Acpi-Operand             480KB        480KB
> [  860.823955] Acpi-State                27KB         27KB
> [  860.825518] Acpi-Namespace           179KB        179KB
> [  860.827060] numa_policy               15KB         15KB
> [  860.828772] trace_event_file          90KB         90KB
> [  860.830696] ftrace_event_field         95KB         95KB
> [  860.832241] pool_workqueue           144KB        144KB
> [  860.833746] task_group               599KB        630KB
> [  860.835272] page->ptl                362KB        411KB
> [  860.836755] dma-kmalloc-512           16KB         16KB
> [  860.838274] kmalloc-8192          211320KB     211320KB
> [  860.839748] kmalloc-4096          631748KB     631748KB
> [  860.841248] kmalloc-2048          529296KB     529296KB
> [  860.843094] kmalloc-1024           79300KB      79300KB
> [  860.844575] kmalloc-512           192744KB     192744KB
> [  860.847059] kmalloc-256            24292KB      24292KB
> [  860.848747] kmalloc-192            56959KB      56959KB
> [  860.850504] kmalloc-128            77732KB      77732KB
> [  860.852197] kmalloc-96              1426KB       1519KB
> [  860.853936] kmalloc-64             19192KB      19192KB
> [  860.855454] kmalloc-32              1368KB       1368KB
> [  860.856988] kmalloc-16               336KB        336KB
> [  860.858478] kmalloc-8                864KB        864KB
> [  860.859990] kmem_cache_node           20KB         20KB
> [  860.861448] kmem_cache                78KB         78KB
> [  860.863006] Kernel panic - not syncing: Out of memory and no killable processes...
> [  860.863006] 
> [  860.866069] CPU: 3 PID: 1 Comm: systemd Kdump: loaded Tainted: G           O      4.16.0-rc5-next-20180315 #695
> [  860.868829] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
> [  860.871926] Call Trace:
> [  860.872850]  dump_stack+0x5f/0x8b
> [  860.873965]  panic+0xde/0x231
> [  860.875005]  out_of_memory+0x4e4/0x720
> [  860.876218]  __alloc_pages_nodemask+0x12c5/0x1410
> [  860.877645]  filemap_fault+0x479/0x640
> [  860.878842]  __xfs_filemap_fault.constprop.0+0x5f/0x1f0
> [  860.880542]  __do_fault+0x15/0xa0
> [  860.881674]  __handle_mm_fault+0xcb2/0x1140
> [  860.883004]  handle_mm_fault+0x186/0x350
> [  860.884283]  __do_page_fault+0x2a7/0x510
> [  860.885576]  do_page_fault+0x2c/0x2a0
> [  860.886805]  ? page_fault+0x2f/0x50
> [  860.888003]  page_fault+0x45/0x50
> [  860.889169] RIP: 61fa2380:0x55c6609099a0
> ----------------------------------------

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Improve mutex documentation
  2018-03-15 11:58   ` [PATCH] Improve mutex documentation Matthew Wilcox
  2018-03-15 12:12     ` Kirill Tkhai
@ 2018-03-16 13:57     ` Peter Zijlstra
  2018-03-20 11:07     ` [tip:locking/urgent] locking/mutex: Improve documentation tip-bot for Matthew Wilcox
  2 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2018-03-16 13:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Kirill Tkhai, tj, cl, linux-mm, linux-kernel,
	linux-doc, Jonathan Corbet, Mauro Carvalho Chehab, Ingo Molnar

On Thu, Mar 15, 2018 at 04:58:12AM -0700, Matthew Wilcox wrote:
> On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> > My memory is weak and our documentation is awful.  What does
> > mutex_lock_killable() actually do and how does it differ from
> > mutex_lock_interruptible()?
> 
> From: Matthew Wilcox <mawilcox@microsoft.com>
> 
> Add kernel-doc for mutex_lock_killable() and mutex_lock_io().  Reword the
> kernel-doc for mutex_lock_interruptible().
> 
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>

Thanks Matthew!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-14 22:22     ` Andrew Morton
  2018-03-15  8:58       ` Kirill Tkhai
@ 2018-03-19 15:13       ` Tejun Heo
  1 sibling, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2018-03-19 15:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Kirill Tkhai, cl, linux-mm, linux-kernel

Hello, Andrew.

On Wed, Mar 14, 2018 at 03:22:03PM -0700, Andrew Morton wrote:
> hm.  spose so.  Maybe.  Are there other ways?  I assume the time is
> being spent in pcpu_create_chunk()?  We could drop the mutex while
> running that stuff and take the appropriate did-we-race-with-someone
> testing after retaking it.  Or similar.

I'm not sure that'd change much.  Ultimately, isn't the choice between
being able to return NULL and waiting for more memory?  If we decide
to return NULL, it doesn't make difference where we do that from,
right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-14 20:56 ` Andrew Morton
  2018-03-14 22:09   ` Tejun Heo
  2018-03-15 11:58   ` [PATCH] Improve mutex documentation Matthew Wilcox
@ 2018-03-19 15:14   ` Tejun Heo
  2018-03-19 15:32     ` [PATCH v2] mm: " Kirill Tkhai
  2 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2018-03-19 15:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Kirill Tkhai, cl, linux-mm, linux-kernel

On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
> > +	if (!is_atomic) {
> > +		if (gfp & __GFP_NOFAIL)
> > +			mutex_lock(&pcpu_alloc_mutex);
> > +		else if (mutex_lock_killable(&pcpu_alloc_mutex))
> > +			return NULL;
> > +	}
> 
> It would benefit from a comment explaining why we're doing this (it's
> for the oom-killer).

And, yeah, this would be great.  Kirill, can you please send a patch
to add a comment there?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2] mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-19 15:14   ` [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Tejun Heo
@ 2018-03-19 15:32     ` Kirill Tkhai
  2018-03-19 16:39       ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Kirill Tkhai @ 2018-03-19 15:32 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton; +Cc: cl, linux-mm, linux-kernel

From: Kirill Tkhai <ktkhai@virtuozzo.com>

In case of memory deficit and low percpu memory pages,
pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
time (as it makes memory allocations itself and waits
for memory reclaim). If tasks doing pcpu_alloc() are
choosen by OOM killer, they can't exit, because they
are waiting for the mutex.

The patch makes pcpu_alloc() to care about killing signal
and use mutex_lock_killable(), when it's allowed by GFP
flags. This guarantees, a task does not miss SIGKILL
from OOM killer.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
v2: Added explaining comment
 mm/percpu.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 50e7fdf84055..605e3228baa6 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1369,8 +1369,17 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 		return NULL;
 	}
 
-	if (!is_atomic)
-		mutex_lock(&pcpu_alloc_mutex);
+	if (!is_atomic) {
+		/*
+		 * pcpu_balance_workfn() allocates memory under this mutex,
+		 * and it may wait for memory reclaim. Allow current task
+		 * to become OOM victim, in case of memory pressure.
+		 */
+		if (gfp & __GFP_NOFAIL)
+			mutex_lock(&pcpu_alloc_mutex);
+		else if (mutex_lock_killable(&pcpu_alloc_mutex))
+			return NULL;
+	}
 
 	spin_lock_irqsave(&pcpu_lock, flags);
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
  2018-03-19 15:32     ` [PATCH v2] mm: " Kirill Tkhai
@ 2018-03-19 16:39       ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2018-03-19 16:39 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: Andrew Morton, cl, linux-mm, linux-kernel

On Mon, Mar 19, 2018 at 06:32:10PM +0300, Kirill Tkhai wrote:
> From: Kirill Tkhai <ktkhai@virtuozzo.com>
> 
> In case of memory deficit and low percpu memory pages,
> pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
> time (as it makes memory allocations itself and waits
> for memory reclaim). If tasks doing pcpu_alloc() are
> choosen by OOM killer, they can't exit, because they
> are waiting for the mutex.
> 
> The patch makes pcpu_alloc() to care about killing signal
> and use mutex_lock_killable(), when it's allowed by GFP
> flags. This guarantees, a task does not miss SIGKILL
> from OOM killer.
> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

Applied to percpu/for-4.16-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [tip:locking/urgent] locking/mutex: Improve documentation
  2018-03-15 11:58   ` [PATCH] Improve mutex documentation Matthew Wilcox
  2018-03-15 12:12     ` Kirill Tkhai
  2018-03-16 13:57     ` Peter Zijlstra
@ 2018-03-20 11:07     ` tip-bot for Matthew Wilcox
  2 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Matthew Wilcox @ 2018-03-20 11:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, peterz, hpa, mchehab, corbet, linux-kernel, ktkhai,
	mawilcox, mingo, paulmck, akpm, torvalds

Commit-ID:  45dbac0e288350f9a4226a5b4b651ed434dd9f85
Gitweb:     https://git.kernel.org/tip/45dbac0e288350f9a4226a5b4b651ed434dd9f85
Author:     Matthew Wilcox <mawilcox@microsoft.com>
AuthorDate: Thu, 15 Mar 2018 04:58:12 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 20 Mar 2018 08:07:41 +0100

locking/mutex: Improve documentation

On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:

> My memory is weak and our documentation is awful.  What does
> mutex_lock_killable() actually do and how does it differ from
> mutex_lock_interruptible()?

Add kernel-doc for mutex_lock_killable() and mutex_lock_io().  Reword the
kernel-doc for mutex_lock_interruptible().

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cl@linux.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/20180315115812.GA9949@bombadil.infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/mutex.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 858a07590e39..2048359f33d2 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -1082,15 +1082,16 @@ static noinline int __sched
 __mutex_lock_interruptible_slowpath(struct mutex *lock);
 
 /**
- * mutex_lock_interruptible - acquire the mutex, interruptible
- * @lock: the mutex to be acquired
+ * mutex_lock_interruptible() - Acquire the mutex, interruptible by signals.
+ * @lock: The mutex to be acquired.
  *
- * Lock the mutex like mutex_lock(), and return 0 if the mutex has
- * been acquired or sleep until the mutex becomes available. If a
- * signal arrives while waiting for the lock then this function
- * returns -EINTR.
+ * Lock the mutex like mutex_lock().  If a signal is delivered while the
+ * process is sleeping, this function will return without acquiring the
+ * mutex.
  *
- * This function is similar to (but not equivalent to) down_interruptible().
+ * Context: Process context.
+ * Return: 0 if the lock was successfully acquired or %-EINTR if a
+ * signal arrived.
  */
 int __sched mutex_lock_interruptible(struct mutex *lock)
 {
@@ -1104,6 +1105,18 @@ int __sched mutex_lock_interruptible(struct mutex *lock)
 
 EXPORT_SYMBOL(mutex_lock_interruptible);
 
+/**
+ * mutex_lock_killable() - Acquire the mutex, interruptible by fatal signals.
+ * @lock: The mutex to be acquired.
+ *
+ * Lock the mutex like mutex_lock().  If a signal which will be fatal to
+ * the current process is delivered while the process is sleeping, this
+ * function will return without acquiring the mutex.
+ *
+ * Context: Process context.
+ * Return: 0 if the lock was successfully acquired or %-EINTR if a
+ * fatal signal arrived.
+ */
 int __sched mutex_lock_killable(struct mutex *lock)
 {
 	might_sleep();
@@ -1115,6 +1128,16 @@ int __sched mutex_lock_killable(struct mutex *lock)
 }
 EXPORT_SYMBOL(mutex_lock_killable);
 
+/**
+ * mutex_lock_io() - Acquire the mutex and mark the process as waiting for I/O
+ * @lock: The mutex to be acquired.
+ *
+ * Lock the mutex like mutex_lock().  While the task is waiting for this
+ * mutex, it will be accounted as being in the IO wait state by the
+ * scheduler.
+ *
+ * Context: Process context.
+ */
 void __sched mutex_lock_io(struct mutex *lock)
 {
 	int token;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-03-20 11:07 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-14 11:51 [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Kirill Tkhai
2018-03-14 13:55 ` Tejun Heo
2018-03-14 20:56 ` Andrew Morton
2018-03-14 22:09   ` Tejun Heo
2018-03-14 22:22     ` Andrew Morton
2018-03-15  8:58       ` Kirill Tkhai
2018-03-15 10:48         ` Tetsuo Handa
2018-03-15 12:09           ` Kirill Tkhai
2018-03-15 14:09             ` Tetsuo Handa
2018-03-15 14:42               ` Kirill Tkhai
2018-03-19 15:13       ` Tejun Heo
2018-03-15 11:58   ` [PATCH] Improve mutex documentation Matthew Wilcox
2018-03-15 12:12     ` Kirill Tkhai
2018-03-15 13:18       ` Matthew Wilcox
2018-03-15 13:23         ` Kirill Tkhai
2018-03-16 13:57     ` Peter Zijlstra
2018-03-20 11:07     ` [tip:locking/urgent] locking/mutex: Improve documentation tip-bot for Matthew Wilcox
2018-03-19 15:14   ` [PATCH] percpu: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() Tejun Heo
2018-03-19 15:32     ` [PATCH v2] mm: " Kirill Tkhai
2018-03-19 16:39       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).