All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Wenwei Tao <wenwei.tww@alibaba-inc.com>,
	Oleg Nesterov <oleg@redhat.com>,
	David Rientjes <rientjes@google.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
Date: Fri, 4 Aug 2017 09:42:12 +0200	[thread overview]
Message-ID: <20170804074212.GA26029@dhcp22.suse.cz> (raw)
In-Reply-To: <201708040646.v746kkhC024636@www262.sakura.ne.jp>

On Fri 04-08-17 15:46:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> >                          So there is a race window when some threads
> > won't have fatal_signal_pending while the oom_reaper could start
> > unmapping the address space. generic_perform_write could then write
> > zero page to the page cache and corrupt data.
> 
> Oh, simple generic_perform_write() ?
> 
> > 
> > The race window is rather small and close to impossible to happen but it
> > would be better to have it covered.
> 
> OK, I confirmed that this problem is easily reproducible using below reproducer.

Yeah, I can imagine this could be triggered artificially. I am somehow
more skeptical about real life oom scenarios to trigger this though.
Anyway, thanks for your test case!
 
> Applying your patch seems to avoid this problem, but as far as I tested
> your patch seems to trivially trigger something lock related problem.
> Is your patch really safe?

> ----------
> [   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
> [   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
> [   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
> [   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   58.557480] ------------[ cut here ]------------
> [   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
> [   58.569076] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata
> [   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
> [   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
> [   58.613944] RIP: 0010:lock_release+0x172/0x1e0
> [   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
> [   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
> [   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
> [   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
> [   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
> [   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
> [   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
> [   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
> [   58.657280] Call Trace:
> [   58.659989]  up_read+0x1a/0x40
> [   58.662825]  __do_page_fault+0x28e/0x4c0
> [   58.665946]  do_page_fault+0x30/0x80
> [   58.668911]  page_fault+0x28/0x30

OK, I know what is going on here. The page fault must have returned with
VM_FAULT_RETRY when the caller drops mmap_sem. My patch overwrites the
this error code so the page fault path doesn't know that the lock is no
longer held and releases is unconditionally. This is a preexisting
problem introduced by 3f70dc38cec2 ("mm: make sure that kthreads will
not refault oom reaped memory"). I should have considered this option.

I believe the easiest way around this is the following patch
---
>From dd31779f763bbe2aa86100f804656ac680c49d35 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 4 Aug 2017 09:36:34 +0200
Subject: [PATCH] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced
 SIGBUS

Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] ------------[ cut here ]------------
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[   58.613944] RIP: 0010:lock_release+0x172/0x1e0
[   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[   58.657280] Call Trace:
[   58.659989]  up_read+0x1a/0x40
[   58.662825]  __do_page_fault+0x28e/0x4c0
[   58.665946]  do_page_fault+0x30/0x80
[   58.668911]  page_fault+0x28/0x30

The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
existing error code because all arch specific page fault handlers and
g-u-p would have to learn a new error code combination.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
Cc: stable # 4.9+
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..4fe5b6254688 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3881,8 +3881,18 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * further.
 	 */
 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+
+		/*
+		 * We are going to enforce SIGBUS but the PF path might have
+		 * dropped the mmap_sem already so take it again so that
+		 * we do not break expectations of all arch specific PF paths
+		 * and g-u-p
+		 */
+		if (ret & VM_FAULT_RETRY)
+			down_read(&vma->vm_mm->mmap_sem);
 		ret = VM_FAULT_SIGBUS;
+	}
 
 	return ret;
 }
-- 
2.13.2

-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID
From: Michal Hocko <mhocko@kernel.org>
To: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Wenwei Tao <wenwei.tww@alibaba-inc.com>,
	Oleg Nesterov <oleg@redhat.com>,
	David Rientjes <rientjes@google.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
Date: Fri, 4 Aug 2017 09:42:12 +0200	[thread overview]
Message-ID: <20170804074212.GA26029@dhcp22.suse.cz> (raw)
In-Reply-To: <201708040646.v746kkhC024636@www262.sakura.ne.jp>

On Fri 04-08-17 15:46:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> >                          So there is a race window when some threads
> > won't have fatal_signal_pending while the oom_reaper could start
> > unmapping the address space. generic_perform_write could then write
> > zero page to the page cache and corrupt data.
> 
> Oh, simple generic_perform_write() ?
> 
> > 
> > The race window is rather small and close to impossible to happen but it
> > would be better to have it covered.
> 
> OK, I confirmed that this problem is easily reproducible using below reproducer.

Yeah, I can imagine this could be triggered artificially. I am somehow
more skeptical about real life oom scenarios to trigger this though.
Anyway, thanks for your test case!
 
> Applying your patch seems to avoid this problem, but as far as I tested
> your patch seems to trivially trigger something lock related problem.
> Is your patch really safe?

> ----------
> [   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
> [   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
> [   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
> [   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   58.557480] ------------[ cut here ]------------
> [   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
> [   58.569076] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata
> [   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
> [   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
> [   58.613944] RIP: 0010:lock_release+0x172/0x1e0
> [   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
> [   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
> [   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
> [   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
> [   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
> [   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
> [   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
> [   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
> [   58.657280] Call Trace:
> [   58.659989]  up_read+0x1a/0x40
> [   58.662825]  __do_page_fault+0x28e/0x4c0
> [   58.665946]  do_page_fault+0x30/0x80
> [   58.668911]  page_fault+0x28/0x30

OK, I know what is going on here. The page fault must have returned with
VM_FAULT_RETRY when the caller drops mmap_sem. My patch overwrites the
this error code so the page fault path doesn't know that the lock is no
longer held and releases is unconditionally. This is a preexisting
problem introduced by 3f70dc38cec2 ("mm: make sure that kthreads will
not refault oom reaped memory"). I should have considered this option.

I believe the easiest way around this is the following patch
---

  reply	other threads:[~2017-08-04  7:42 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-03 13:59 [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer Michal Hocko
2017-08-03 13:59 ` Michal Hocko
2017-08-04  6:46 ` Tetsuo Handa
2017-08-04  7:42   ` Michal Hocko [this message]
2017-08-04  7:42     ` Michal Hocko
2017-08-04  8:25     ` Tetsuo Handa
2017-08-04  8:32       ` Michal Hocko
2017-08-04  8:32         ` Michal Hocko
2017-08-04  8:33         ` [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS Michal Hocko
2017-08-04  8:33           ` Michal Hocko
2017-08-04  8:33           ` [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer Michal Hocko
2017-08-04  8:33             ` Michal Hocko
2017-08-04  9:16       ` Re: [PATCH] " Michal Hocko
2017-08-04  9:16         ` Michal Hocko
2017-08-04 10:41         ` Tetsuo Handa
2017-08-04 10:41           ` Tetsuo Handa
2017-08-04 11:00           ` Michal Hocko
2017-08-04 11:00             ` Michal Hocko
2017-08-04 14:56             ` Michal Hocko
2017-08-04 14:56               ` Michal Hocko
2017-08-04 16:49               ` Tetsuo Handa
2017-08-04 16:49                 ` Tetsuo Handa
2017-08-05  1:46               ` 陶文苇
2017-08-05  1:46                 ` 陶文苇

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170804074212.GA26029@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=oleg@redhat.com \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=rientjes@google.com \
    --cc=wenwei.tww@alibaba-inc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.