All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-03 13:59 ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-03 13:59 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Wenwei Tao, Oleg Nesterov, Tetsuo Handa,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies is not
entirely true. __task_will_free_mem consider a task dying when
SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads
_after_ the flag is set. So there is a race window when some threads
won't have fatal_signal_pending while the oom_reaper could start
unmapping the address space. generic_perform_write could then write
zero page to the page cache and corrupt data.

The race window is rather small and close to impossible to happen but it
would be better to have it covered.

Fix this by extending the existing MMF_UNSTABLE check in handle_mm_fault
and segfault on any page fault after the oom reaper started its work.
This means that nobody will ever observe a potentially corrupted
content. Formerly we cared only about use_mm users because those can
outlive the oom victim quite easily but having the process itself
protected sounds like a reasonable thing to do as well.

There doesn't seem to be any real life bug report so this is merely a
fix of a theoretical bug.

Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
Wenwei has contacted me off list and this is a result of the dicussion.
I do not think this would be serious enough to warrant a stable backport
even though the description might sound scary. The race is highly unlikely.

 mm/memory.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..3d8bfeaca38a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3874,13 +3874,9 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * This mm has been already reaped by the oom reaper and so the
 	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
+	 * lose data and give a zero page instead e.g.
 	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
+	if (unlikely(!(ret & VM_FAULT_ERROR)
 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
 		ret = VM_FAULT_SIGBUS;
 
-- 
2.13.2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-03 13:59 ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-03 13:59 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Wenwei Tao, Oleg Nesterov, Tetsuo Handa,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies is not
entirely true. __task_will_free_mem consider a task dying when
SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads
_after_ the flag is set. So there is a race window when some threads
won't have fatal_signal_pending while the oom_reaper could start
unmapping the address space. generic_perform_write could then write
zero page to the page cache and corrupt data.

The race window is rather small and close to impossible to happen but it
would be better to have it covered.

Fix this by extending the existing MMF_UNSTABLE check in handle_mm_fault
and segfault on any page fault after the oom reaper started its work.
This means that nobody will ever observe a potentially corrupted
content. Formerly we cared only about use_mm users because those can
outlive the oom victim quite easily but having the process itself
protected sounds like a reasonable thing to do as well.

There doesn't seem to be any real life bug report so this is merely a
fix of a theoretical bug.

Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
Wenwei has contacted me off list and this is a result of the dicussion.
I do not think this would be serious enough to warrant a stable backport
even though the description might sound scary. The race is highly unlikely.

 mm/memory.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..3d8bfeaca38a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3874,13 +3874,9 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * This mm has been already reaped by the oom reaper and so the
 	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
+	 * lose data and give a zero page instead e.g.
 	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
+	if (unlikely(!(ret & VM_FAULT_ERROR)
 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
 		ret = VM_FAULT_SIGBUS;
 
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-03 13:59 ` Michal Hocko
  (?)
@ 2017-08-04  6:46 ` Tetsuo Handa
  2017-08-04  7:42     ` Michal Hocko
  -1 siblings, 1 reply; 24+ messages in thread
From: Tetsuo Handa @ 2017-08-04  6:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, Tetsuo Handa,
	David Rientjes, LKML, Michal Hocko

Michal Hocko wrote:
>                          So there is a race window when some threads
> won't have fatal_signal_pending while the oom_reaper could start
> unmapping the address space. generic_perform_write could then write
> zero page to the page cache and corrupt data.

Oh, simple generic_perform_write() ?

> 
> The race window is rather small and close to impossible to happen but it
> would be better to have it covered.

OK, I confirmed that this problem is easily reproducible using below reproducer.

----------
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>

#define NUMTHREADS 512
#define STACKSIZE 8192

static int pipe_fd[2] = { EOF, EOF };
static int file_writer(void *i)
{
	char buffer[4096] = { };
	int fd;
	snprintf(buffer, sizeof(buffer), "/tmp/file.%lu", (unsigned long) i);
	fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
	memset(buffer, 0xFF, sizeof(buffer));
	read(pipe_fd[0], buffer, 1);
	while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
	return 0;
}

int main(int argc, char *argv[])
{
	char *buf = NULL;
	unsigned long size;
	unsigned long i;
	char *stack;
	if (pipe(pipe_fd))
		return 1;
	stack = malloc(STACKSIZE * NUMTHREADS);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	for (i = 0; i < NUMTHREADS; i++)
                if (clone(file_writer, stack + (i + 1) * STACKSIZE,
			  CLONE_THREAD | CLONE_SIGHAND | CLONE_VM | CLONE_FS |
			  CLONE_FILES, (void *) i) == -1)
                        break;
	close(pipe_fd[1]);
	/* Will cause OOM due to overcommit; if not use SysRq-f */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	kill(-1, SIGKILL);
	return 0;
}
----------
$ cat /tmp/file.* | od -b | head
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
307730000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
307740000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
316600000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
316610000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
----------

Applying your patch seems to avoid this problem, but as far as I tested
your patch seems to trivially trigger something lock related problem.
Is your patch really safe?

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170804.txt.xz 
and config is at http://I-love.SAKURA.ne.jp/tmp/config-20170804 .

----------
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] ------------[ cut here ]------------
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[   58.569076] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[   58.613944] RIP: 0010:lock_release+0x172/0x1e0
[   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[   58.657280] Call Trace:
[   58.659989]  up_read+0x1a/0x40
[   58.662825]  __do_page_fault+0x28e/0x4c0
[   58.665946]  do_page_fault+0x30/0x80
[   58.668911]  page_fault+0x28/0x30
[   58.671629] RIP: 0033:0x40092f
[   58.674221] RSP: 002b:00007fb931f99ff0 EFLAGS: 00010217
[   58.677556] RAX: 0000000000001000 RBX: 00007fb931f99ff0 RCX: 00007fb93224ec90
[   58.681489] RDX: 0000000000001000 RSI: 00007fb931f99ff0 RDI: 0000000000000117
[   58.685297] RBP: 0000000000000117 R08: 00007fb9321ae938 R09: 000000000000000d
[   58.689123] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000100000000
[   58.692879] R13: 00007fb731f59010 R14: 0000000000000000 R15: 00007fb731f59010
[   58.696588] Code: 5e 41 5f 5d c3 e8 df a7 26 00 85 c0 74 1f 8b 35 2d 2f df 01 85 f6 75 15 48 c7 c6 66 7c a0 a3 48 c7 c7 5b 41 a0 a3 e8 6a 14 01 00 <0f> ff 4c 89 fa 4c 89 ee 48 89 df e8 fe c8 ff ff eb 88 48 c7 c7 
[   58.705635] ---[ end trace 91ff0f99e79ee485 ]---
[   58.831028] oom_reaper: reaped process 1056 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
----------

----------
[  187.202689] Out of memory: Kill process 2113 (a.out) score 734 or sacrifice child
[  187.208024] Killed process 2113 (a.out) total-vm:4268108kB, anon-rss:2735276kB, file-rss:0kB, shmem-rss:0kB
[  187.463902] oom_reaper: reaped process 2113 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  188.249973] DEBUG_LOCKS_WARN_ON(depth <= 0)
[  188.249983] ------------[ cut here ]------------
[  188.257247] WARNING: CPU: 7 PID: 2313 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[  188.263282] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp pcspkr vmw_balloon ppdev i2c_piix4 vmw_vmci shpchp sg parport_pc parport ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi scsi_transport_spi ahci mptscsih ata_piix libahci e1000 mptbase libata
[  188.295888] CPU: 7 PID: 2313 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[  188.300975] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  188.307049] task: ffff8c4433840040 task.stack: ffff9459c660c000
[  188.311510] RIP: 0010:lock_release+0x172/0x1e0
[  188.315530] RSP: 0000:ffff9459c660fe58 EFLAGS: 00010082
[  188.319895] RAX: 000000000000001f RBX: ffff8c4433840040 RCX: 0000000000000000
[  188.324908] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff810d4ba4
[  188.329894] RBP: ffff9459c660fe98 R08: 0000000000000000 R09: 0000000000000001
[  188.334724] R10: 0000000000000000 R11: 000000000000001f R12: ffff9459c660ff58
[  188.339707] R13: ffff8c4434644c90 R14: 0000000000000000 R15: ffffffff8105cb6e
[  188.344553] FS:  00007f0e2aca8740(0000) GS:ffff8c4439fc0000(0000) knlGS:0000000000000000
[  188.349616] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  188.353835] CR2: 00007f0e2a665ff0 CR3: 00000001346a5005 CR4: 00000000000606e0
[  188.358604] Call Trace:
[  188.361539]  up_read+0x1a/0x40
[  188.364661]  __do_page_fault+0x28e/0x4c0
[  188.368059]  do_page_fault+0x30/0x80
[  188.371338]  page_fault+0x28/0x30
[  188.374426] RIP: 0033:0x7f0e2a7c6c90
[  188.377543] RSP: 002b:00007f0e2a46bfe8 EFLAGS: 00010246
[  188.381238] RAX: 0000000000001000 RBX: 00007f0e2a46bff0 RCX: 00007f0e2a7c6c90
[  188.385576] RDX: 0000000000001000 RSI: 00007f0e2a46bff0 RDI: 00000000000000fd
[  188.389886] RBP: 00000000000000fd R08: 00007f0e2a726938 R09: 000000000000000d
[  188.394148] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000100000000
[  188.398339] R13: 00007f0c2a4d1010 R14: 0000000000000000 R15: 00007f0c2a4d1010
[  188.402508] Code: 5e 41 5f 5d c3 e8 df a7 26 00 85 c0 74 1f 8b 35 2d 2f df 01 85 f6 75 15 48 c7 c6 66 7c a0 81 48 c7 c7 5b 41 a0 81 e8 6a 14 01 00 <0f> ff 4c 89 fa 4c 89 ee 48 89 df e8 fe c8 ff ff eb 88 48 c7 c7 
[  188.412704] ---[ end trace d42863c48bb12d0a ]---
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  6:46 ` Tetsuo Handa
@ 2017-08-04  7:42     ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04  7:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 15:46:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> >                          So there is a race window when some threads
> > won't have fatal_signal_pending while the oom_reaper could start
> > unmapping the address space. generic_perform_write could then write
> > zero page to the page cache and corrupt data.
> 
> Oh, simple generic_perform_write() ?
> 
> > 
> > The race window is rather small and close to impossible to happen but it
> > would be better to have it covered.
> 
> OK, I confirmed that this problem is easily reproducible using below reproducer.

Yeah, I can imagine this could be triggered artificially. I am somehow
more skeptical about real life oom scenarios to trigger this though.
Anyway, thanks for your test case!
 
> Applying your patch seems to avoid this problem, but as far as I tested
> your patch seems to trivially trigger something lock related problem.
> Is your patch really safe?

> ----------
> [   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
> [   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
> [   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
> [   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   58.557480] ------------[ cut here ]------------
> [   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
> [   58.569076] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata
> [   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
> [   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
> [   58.613944] RIP: 0010:lock_release+0x172/0x1e0
> [   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
> [   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
> [   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
> [   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
> [   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
> [   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
> [   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
> [   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
> [   58.657280] Call Trace:
> [   58.659989]  up_read+0x1a/0x40
> [   58.662825]  __do_page_fault+0x28e/0x4c0
> [   58.665946]  do_page_fault+0x30/0x80
> [   58.668911]  page_fault+0x28/0x30

OK, I know what is going on here. The page fault must have returned with
VM_FAULT_RETRY when the caller drops mmap_sem. My patch overwrites the
this error code so the page fault path doesn't know that the lock is no
longer held and releases is unconditionally. This is a preexisting
problem introduced by 3f70dc38cec2 ("mm: make sure that kthreads will
not refault oom reaped memory"). I should have considered this option.

I believe the easiest way around this is the following patch
---
>From dd31779f763bbe2aa86100f804656ac680c49d35 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 4 Aug 2017 09:36:34 +0200
Subject: [PATCH] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced
 SIGBUS

Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] ------------[ cut here ]------------
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[   58.613944] RIP: 0010:lock_release+0x172/0x1e0
[   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[   58.657280] Call Trace:
[   58.659989]  up_read+0x1a/0x40
[   58.662825]  __do_page_fault+0x28e/0x4c0
[   58.665946]  do_page_fault+0x30/0x80
[   58.668911]  page_fault+0x28/0x30

The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
existing error code because all arch specific page fault handlers and
g-u-p would have to learn a new error code combination.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
Cc: stable # 4.9+
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..4fe5b6254688 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3881,8 +3881,18 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * further.
 	 */
 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+
+		/*
+		 * We are going to enforce SIGBUS but the PF path might have
+		 * dropped the mmap_sem already so take it again so that
+		 * we do not break expectations of all arch specific PF paths
+		 * and g-u-p
+		 */
+		if (ret & VM_FAULT_RETRY)
+			down_read(&vma->vm_mm->mmap_sem);
 		ret = VM_FAULT_SIGBUS;
+	}
 
 	return ret;
 }
-- 
2.13.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04  7:42     ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04  7:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 15:46:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> >                          So there is a race window when some threads
> > won't have fatal_signal_pending while the oom_reaper could start
> > unmapping the address space. generic_perform_write could then write
> > zero page to the page cache and corrupt data.
> 
> Oh, simple generic_perform_write() ?
> 
> > 
> > The race window is rather small and close to impossible to happen but it
> > would be better to have it covered.
> 
> OK, I confirmed that this problem is easily reproducible using below reproducer.

Yeah, I can imagine this could be triggered artificially. I am somehow
more skeptical about real life oom scenarios to trigger this though.
Anyway, thanks for your test case!
 
> Applying your patch seems to avoid this problem, but as far as I tested
> your patch seems to trivially trigger something lock related problem.
> Is your patch really safe?

> ----------
> [   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
> [   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
> [   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
> [   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   58.557480] ------------[ cut here ]------------
> [   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
> [   58.569076] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata
> [   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
> [   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
> [   58.613944] RIP: 0010:lock_release+0x172/0x1e0
> [   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
> [   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
> [   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
> [   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
> [   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
> [   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
> [   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
> [   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
> [   58.657280] Call Trace:
> [   58.659989]  up_read+0x1a/0x40
> [   58.662825]  __do_page_fault+0x28e/0x4c0
> [   58.665946]  do_page_fault+0x30/0x80
> [   58.668911]  page_fault+0x28/0x30

OK, I know what is going on here. The page fault must have returned with
VM_FAULT_RETRY when the caller drops mmap_sem. My patch overwrites the
this error code so the page fault path doesn't know that the lock is no
longer held and releases is unconditionally. This is a preexisting
problem introduced by 3f70dc38cec2 ("mm: make sure that kthreads will
not refault oom reaped memory"). I should have considered this option.

I believe the easiest way around this is the following patch
---

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  7:42     ` Michal Hocko
  (?)
@ 2017-08-04  8:25     ` Tetsuo Handa
  2017-08-04  8:32         ` Michal Hocko
  2017-08-04  9:16         ` Michal Hocko
  -1 siblings, 2 replies; 24+ messages in thread
From: Tetsuo Handa @ 2017-08-04  8:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

Well, while lockdep warning is gone, this problem is remaining.

diff --git a/mm/memory.c b/mm/memory.c
index edabf6f..1e06c29 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
        /*
         * This mm has been already reaped by the oom reaper and so the
         * refault cannot be trusted in general. Anonymous refaults would
-        * lose data and give a zero page instead e.g. This is especially
-        * problem for use_mm() because regular tasks will just die and
-        * the corrupted data will not be visible anywhere while kthread
-        * will outlive the oom victim and potentially propagate the date
-        * further.
+        * lose data and give a zero page instead e.g.
         */
-       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+       if (unlikely(!(ret & VM_FAULT_ERROR)
+                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+               if (ret & VM_FAULT_RETRY)
+                       down_read(&vma->vm_mm->mmap_sem);
                ret = VM_FAULT_SIGBUS;
+       }

        return ret;
 }

$ cat /tmp/file.* | od -b | head
0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
420330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
420340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*
457330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
*
457340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  8:25     ` Tetsuo Handa
@ 2017-08-04  8:32         ` Michal Hocko
  2017-08-04  9:16         ` Michal Hocko
  1 sibling, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04  8:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.

Ohh, I should have been more specific. Both patches have to be applied.
I have based this one first because it should go to stable. The later
one needs a trivial conflict resolution. I will send both of them as a
reply to this email!

Thanks for retesting. It matches my testing results.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04  8:32         ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04  8:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.

Ohh, I should have been more specific. Both patches have to be applied.
I have based this one first because it should go to stable. The later
one needs a trivial conflict resolution. I will send both of them as a
reply to this email!

Thanks for retesting. It matches my testing results.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
  2017-08-04  8:32         ` Michal Hocko
@ 2017-08-04  8:33           ` Michal Hocko
  -1 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04  8:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Wenwei Tao, Oleg Nesterov,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] ------------[ cut here ]------------
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[   58.613944] RIP: 0010:lock_release+0x172/0x1e0
[   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[   58.657280] Call Trace:
[   58.659989]  up_read+0x1a/0x40
[   58.662825]  __do_page_fault+0x28e/0x4c0
[   58.665946]  do_page_fault+0x30/0x80
[   58.668911]  page_fault+0x28/0x30

The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
existing error code because all arch specific page fault handlers and
g-u-p would have to learn a new error code combination.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
Cc: stable # 4.9+
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..4fe5b6254688 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3881,8 +3881,18 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * further.
 	 */
 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+
+		/*
+		 * We are going to enforce SIGBUS but the PF path might have
+		 * dropped the mmap_sem already so take it again so that
+		 * we do not break expectations of all arch specific PF paths
+		 * and g-u-p
+		 */
+		if (ret & VM_FAULT_RETRY)
+			down_read(&vma->vm_mm->mmap_sem);
 		ret = VM_FAULT_SIGBUS;
+	}
 
 	return ret;
 }
-- 
2.13.2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
@ 2017-08-04  8:33           ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04  8:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Wenwei Tao, Oleg Nesterov,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[   58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[   58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[   58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[   58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   58.557480] ------------[ cut here ]------------
[   58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[   58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[   58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[   58.613944] RIP: 0010:lock_release+0x172/0x1e0
[   58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[   58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[   58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[   58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[   58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[   58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[   58.644200] FS:  00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[   58.648989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[   58.657280] Call Trace:
[   58.659989]  up_read+0x1a/0x40
[   58.662825]  __do_page_fault+0x28e/0x4c0
[   58.665946]  do_page_fault+0x30/0x80
[   58.668911]  page_fault+0x28/0x30

The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
existing error code because all arch specific page fault handlers and
g-u-p would have to learn a new error code combination.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
Cc: stable # 4.9+
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..4fe5b6254688 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3881,8 +3881,18 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * further.
 	 */
 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+
+		/*
+		 * We are going to enforce SIGBUS but the PF path might have
+		 * dropped the mmap_sem already so take it again so that
+		 * we do not break expectations of all arch specific PF paths
+		 * and g-u-p
+		 */
+		if (ret & VM_FAULT_RETRY)
+			down_read(&vma->vm_mm->mmap_sem);
 		ret = VM_FAULT_SIGBUS;
+	}
 
 	return ret;
 }
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  8:33           ` Michal Hocko
@ 2017-08-04  8:33             ` Michal Hocko
  -1 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04  8:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Wenwei Tao, Oleg Nesterov,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies is not
entirely true. __task_will_free_mem consider a task dying when
SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads
_after_ the flag is set. So there is a race window when some threads
won't have fatal_signal_pending while the oom_reaper could start
unmapping the address space. generic_perform_write could then write
zero page to the page cache and corrupt data.

The race window is rather small and close to impossible to happen but it
would be better to have it covered.

Fix this by extending the existing MMF_UNSTABLE check in handle_mm_fault
and segfault on any page fault after the oom reaper started its work.
This means that nobody will ever observe a potentially corrupted
content. Formerly we cared only about use_mm users because those can
outlive the oom victim quite easily but having the process itself
protected sounds like a reasonable thing to do as well.

There doesn't seem to be any real life bug report so this is merely a
fix of a theoretical bug.

Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4fe5b6254688..e7308e633b52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3874,15 +3874,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * This mm has been already reaped by the oom reaper and so the
 	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
+	 * lose data and give a zero page instead e.g.
 	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
+	if (unlikely(!(ret & VM_FAULT_ERROR)
 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-
 		/*
 		 * We are going to enforce SIGBUS but the PF path might have
 		 * dropped the mmap_sem already so take it again so that
-- 
2.13.2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04  8:33             ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04  8:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Tetsuo Handa, Wenwei Tao, Oleg Nesterov,
	David Rientjes, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies is not
entirely true. __task_will_free_mem consider a task dying when
SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads
_after_ the flag is set. So there is a race window when some threads
won't have fatal_signal_pending while the oom_reaper could start
unmapping the address space. generic_perform_write could then write
zero page to the page cache and corrupt data.

The race window is rather small and close to impossible to happen but it
would be better to have it covered.

Fix this by extending the existing MMF_UNSTABLE check in handle_mm_fault
and segfault on any page fault after the oom reaper started its work.
This means that nobody will ever observe a potentially corrupted
content. Formerly we cared only about use_mm users because those can
outlive the oom victim quite easily but having the process itself
protected sounds like a reasonable thing to do as well.

There doesn't seem to be any real life bug report so this is merely a
fix of a theoretical bug.

Noticed-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4fe5b6254688..e7308e633b52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3874,15 +3874,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * This mm has been already reaped by the oom reaper and so the
 	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g. This is especially
-	 * problem for use_mm() because regular tasks will just die and
-	 * the corrupted data will not be visible anywhere while kthread
-	 * will outlive the oom victim and potentially propagate the date
-	 * further.
+	 * lose data and give a zero page instead e.g.
 	 */
-	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
+	if (unlikely(!(ret & VM_FAULT_ERROR)
 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-
 		/*
 		 * We are going to enforce SIGBUS but the PF path might have
 		 * dropped the mmap_sem already so take it again so that
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  8:25     ` Tetsuo Handa
@ 2017-08-04  9:16         ` Michal Hocko
  2017-08-04  9:16         ` Michal Hocko
  1 sibling, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04  9:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index edabf6f..1e06c29 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>         /*
>          * This mm has been already reaped by the oom reaper and so the
>          * refault cannot be trusted in general. Anonymous refaults would
> -        * lose data and give a zero page instead e.g. This is especially
> -        * problem for use_mm() because regular tasks will just die and
> -        * the corrupted data will not be visible anywhere while kthread
> -        * will outlive the oom victim and potentially propagate the date
> -        * further.
> +        * lose data and give a zero page instead e.g.
>          */
> -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> +       if (unlikely(!(ret & VM_FAULT_ERROR)
> +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> +               if (ret & VM_FAULT_RETRY)
> +                       down_read(&vma->vm_mm->mmap_sem);
>                 ret = VM_FAULT_SIGBUS;
> +       }
> 
>         return ret;
>  }

I have re-read your email again and I guess I misread previously. Are
you saying that the data corruption happens with the both patches
applied?

> 
> $ cat /tmp/file.* | od -b | head
> 0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 420330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 420340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 457330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 457340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04  9:16         ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04  9:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Andrew Morton, Wenwei Tao, Oleg Nesterov, David Rientjes, LKML

On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> Well, while lockdep warning is gone, this problem is remaining.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index edabf6f..1e06c29 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>         /*
>          * This mm has been already reaped by the oom reaper and so the
>          * refault cannot be trusted in general. Anonymous refaults would
> -        * lose data and give a zero page instead e.g. This is especially
> -        * problem for use_mm() because regular tasks will just die and
> -        * the corrupted data will not be visible anywhere while kthread
> -        * will outlive the oom victim and potentially propagate the date
> -        * further.
> +        * lose data and give a zero page instead e.g.
>          */
> -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> +       if (unlikely(!(ret & VM_FAULT_ERROR)
> +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> +               if (ret & VM_FAULT_RETRY)
> +                       down_read(&vma->vm_mm->mmap_sem);
>                 ret = VM_FAULT_SIGBUS;
> +       }
> 
>         return ret;
>  }

I have re-read your email again and I guess I misread previously. Are
you saying that the data corruption happens with the both patches
applied?

> 
> $ cat /tmp/file.* | od -b | head
> 0000000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 420330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 420340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 457330000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
> *
> 457340000 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377 377
> *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04  9:16         ` Michal Hocko
@ 2017-08-04 10:41           ` Tetsuo Handa
  -1 siblings, 0 replies; 24+ messages in thread
From: Tetsuo Handa @ 2017-08-04 10:41 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

Michal Hocko wrote:
> On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > Well, while lockdep warning is gone, this problem is remaining.
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index edabf6f..1e06c29 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >         /*
> >          * This mm has been already reaped by the oom reaper and so the
> >          * refault cannot be trusted in general. Anonymous refaults would
> > -        * lose data and give a zero page instead e.g. This is especially
> > -        * problem for use_mm() because regular tasks will just die and
> > -        * the corrupted data will not be visible anywhere while kthread
> > -        * will outlive the oom victim and potentially propagate the date
> > -        * further.
> > +        * lose data and give a zero page instead e.g.
> >          */
> > -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > +       if (unlikely(!(ret & VM_FAULT_ERROR)
> > +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> > +               if (ret & VM_FAULT_RETRY)
> > +                       down_read(&vma->vm_mm->mmap_sem);
> >                 ret = VM_FAULT_SIGBUS;
> > +       }
> > 
> >         return ret;
> >  }
> 
> I have re-read your email again and I guess I misread previously. Are
> you saying that the data corruption happens with the both patches
> applied?

Yes. Data corruption still happens.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04 10:41           ` Tetsuo Handa
  0 siblings, 0 replies; 24+ messages in thread
From: Tetsuo Handa @ 2017-08-04 10:41 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

Michal Hocko wrote:
> On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > Well, while lockdep warning is gone, this problem is remaining.
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index edabf6f..1e06c29 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >         /*
> >          * This mm has been already reaped by the oom reaper and so the
> >          * refault cannot be trusted in general. Anonymous refaults would
> > -        * lose data and give a zero page instead e.g. This is especially
> > -        * problem for use_mm() because regular tasks will just die and
> > -        * the corrupted data will not be visible anywhere while kthread
> > -        * will outlive the oom victim and potentially propagate the date
> > -        * further.
> > +        * lose data and give a zero page instead e.g.
> >          */
> > -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > +       if (unlikely(!(ret & VM_FAULT_ERROR)
> > +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> > +               if (ret & VM_FAULT_RETRY)
> > +                       down_read(&vma->vm_mm->mmap_sem);
> >                 ret = VM_FAULT_SIGBUS;
> > +       }
> > 
> >         return ret;
> >  }
> 
> I have re-read your email again and I guess I misread previously. Are
> you saying that the data corruption happens with the both patches
> applied?

Yes. Data corruption still happens.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04 10:41           ` Tetsuo Handa
@ 2017-08-04 11:00             ` Michal Hocko
  -1 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04 11:00 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > > Well, while lockdep warning is gone, this problem is remaining.
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index edabf6f..1e06c29 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > >         /*
> > >          * This mm has been already reaped by the oom reaper and so the
> > >          * refault cannot be trusted in general. Anonymous refaults would
> > > -        * lose data and give a zero page instead e.g. This is especially
> > > -        * problem for use_mm() because regular tasks will just die and
> > > -        * the corrupted data will not be visible anywhere while kthread
> > > -        * will outlive the oom victim and potentially propagate the date
> > > -        * further.
> > > +        * lose data and give a zero page instead e.g.
> > >          */
> > > -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > > -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > > +       if (unlikely(!(ret & VM_FAULT_ERROR)
> > > +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> > > +               if (ret & VM_FAULT_RETRY)
> > > +                       down_read(&vma->vm_mm->mmap_sem);
> > >                 ret = VM_FAULT_SIGBUS;
> > > +       }
> > > 
> > >         return ret;
> > >  }
> > 
> > I have re-read your email again and I guess I misread previously. Are
> > you saying that the data corruption happens with the both patches
> > applied?
> 
> Yes. Data corruption still happens.

I guess I managed to reproduce finally. Will investigate further.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04 11:00             ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04 11:00 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 04-08-17 17:25:46, Tetsuo Handa wrote:
> > > Well, while lockdep warning is gone, this problem is remaining.
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index edabf6f..1e06c29 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3931,15 +3931,14 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > >         /*
> > >          * This mm has been already reaped by the oom reaper and so the
> > >          * refault cannot be trusted in general. Anonymous refaults would
> > > -        * lose data and give a zero page instead e.g. This is especially
> > > -        * problem for use_mm() because regular tasks will just die and
> > > -        * the corrupted data will not be visible anywhere while kthread
> > > -        * will outlive the oom victim and potentially propagate the date
> > > -        * further.
> > > +        * lose data and give a zero page instead e.g.
> > >          */
> > > -       if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > > -                               && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > > +       if (unlikely(!(ret & VM_FAULT_ERROR)
> > > +                    && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
> > > +               if (ret & VM_FAULT_RETRY)
> > > +                       down_read(&vma->vm_mm->mmap_sem);
> > >                 ret = VM_FAULT_SIGBUS;
> > > +       }
> > > 
> > >         return ret;
> > >  }
> > 
> > I have re-read your email again and I guess I misread previously. Are
> > you saying that the data corruption happens with the both patches
> > applied?
> 
> Yes. Data corruption still happens.

I guess I managed to reproduce finally. Will investigate further.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04 11:00             ` Michal Hocko
@ 2017-08-04 14:56               ` Michal Hocko
  -1 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04 14:56 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
[...]
> > Yes. Data corruption still happens.
> 
> I guess I managed to reproduce finally. Will investigate further.

One limitation of the current MMF_UNSTABLE implementation is that it
still keeps the new page mapped and only sends EFAULT/kill to the
consumer. If somebody tries to re-read the same content nothing will
really happen. I went this way because it was much simpler and memory
consumers usually do not retry on EFAULT. Maybe this is not the case
here.

I've been staring into iov_iter_copy_from_user_atomic which I
believe should be the common write path which reads the user buffer
where the corruption caused by the oom_reaper would come from.
iov_iter_fault_in_readable should be called before this function. If
this happened after MMF_UNSTABLE was set then we should get EFAULT and
bail out early. Let's assume this wasn't the case. Then we should get
down to iov_iter_copy_from_user_atomic and that one shouldn't copy any
data because __copy_from_user_inatomic says

 * If copying succeeds, the return value must be 0.  If some data cannot be
 * fetched, it is permitted to copy less than had been fetched; the only
 * hard requirement is that not storing anything at all (i.e. returning size)
 * should happen only when nothing could be copied.  In other words, you don't
 * have to squeeze as much as possible - it is allowed, but not necessary.

which should be our case.

I was testing with xfs (but generic_perform_write seem to be doing the
same thing) and that one does
		if (unlikely(copied == 0)) {
			/*
			 * If we were unable to copy any data at all, we must
			 * fall back to a single segment length write.
			 *
			 * If we didn't fallback here, we could livelock
			 * because not all segments in the iov can be copied at
			 * once without a pagefault.
			 */
			bytes = min_t(unsigned long, PAGE_SIZE - offset,
						iov_iter_single_seg_count(i));
			goto again;
		}

and that again will go through iov_iter_fault_in_readable again and that
will succeed now.

And that's why we still see the corruption. That, however, means that
the MMF_UNSTABLE implementation has to be more complex and we have to
hook into all anonymous memory fault paths which I hoped I could avoid
previously.

This is a rough first draft that passes the test case from Tetsuo on my
system. It will need much more eyes on it and I will return to it with a
fresh brain next week. I would appreciate as much testing as possible.

Note that this is on top of the previous attempt for the fix but I will
squash the result into one patch because the previous one is not
sufficient.
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..1fbc78d423d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	int ret;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	pgtable = pte_alloc_one(vma->vm_mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto release;
 	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
@@ -583,6 +583,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	} else {
 		pmd_t entry;
 
+		/*
+		 * range could have been already torn down by
+		 * the oom reaper
+		 */
+		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+			spin_unlock(vmf->ptl);
+			ret = VM_FAULT_SIGBUS;
+			goto release;
+		}
 		/* Deliver the page fault to userland */
 		if (userfaultfd_missing(vma)) {
 			int ret;
@@ -610,6 +619,13 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	}
 
 	return 0;
+release:
+	if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
+	mem_cgroup_cancel_charge(page, memcg, true);
+	put_page(page);
+	return ret;
+
 }
 
 /*
@@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		ret = 0;
 		set = false;
 		if (pmd_none(*vmf->pmd)) {
-			if (userfaultfd_missing(vma)) {
+			/*
+			 * range could have been already torn down by
+			 * the oom reaper
+			 */
+			if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+				spin_unlock(vmf->ptl);
+				ret = VM_FAULT_SIGBUS;
+			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
diff --git a/mm/memory.c b/mm/memory.c
index e7308e633b52..7de9508e38e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2864,6 +2864,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct mem_cgroup *memcg;
 	struct page *page;
+	int ret = 0;
 	pte_t entry;
 
 	/* File mapping without ->vm_ops ? */
@@ -2896,6 +2897,14 @@ static int do_anonymous_page(struct vm_fault *vmf)
 				vmf->address, &vmf->ptl);
 		if (!pte_none(*vmf->pte))
 			goto unlock;
+		/*
+		 * range could have been already torn down by
+		 * the oom reaper
+		 */
+		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+			ret = VM_FAULT_SIGBUS;
+			goto unlock;
+		}
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2930,6 +2939,15 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (!pte_none(*vmf->pte))
 		goto release;
 
+	/*
+	 * range could have been already torn down by
+	 * the oom reaper
+	 */
+	if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+		ret = VM_FAULT_SIGBUS;
+		goto release;
+	}
+
 	/* Deliver the page fault to userland, check inside PT lock */
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2949,7 +2967,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return 0;
+	return ret;
 release:
 	mem_cgroup_cancel_charge(page, memcg, false);
 	put_page(page);
@@ -3231,7 +3249,10 @@ int finish_fault(struct vm_fault *vmf)
 		page = vmf->cow_page;
 	else
 		page = vmf->page;
-	ret = alloc_set_pte(vmf, vmf->memcg, page);
+	if (!test_bit(MMF_UNSTABLE, &vmf->vma->vm_mm->flags))
+		ret = alloc_set_pte(vmf, vmf->memcg, page);
+	else
+		ret = VM_FAULT_SIGBUS;
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
@@ -3871,24 +3892,6 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 
-	/*
-	 * This mm has been already reaped by the oom reaper and so the
-	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g.
-	 */
-	if (unlikely(!(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-		/*
-		 * We are going to enforce SIGBUS but the PF path might have
-		 * dropped the mmap_sem already so take it again so that
-		 * we do not break expectations of all arch specific PF paths
-		 * and g-u-p
-		 */
-		if (ret & VM_FAULT_RETRY)
-			down_read(&vma->vm_mm->mmap_sem);
-		ret = VM_FAULT_SIGBUS;
-	}
-
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04 14:56               ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-08-04 14:56 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
[...]
> > Yes. Data corruption still happens.
> 
> I guess I managed to reproduce finally. Will investigate further.

One limitation of the current MMF_UNSTABLE implementation is that it
still keeps the new page mapped and only sends EFAULT/kill to the
consumer. If somebody tries to re-read the same content nothing will
really happen. I went this way because it was much simpler and memory
consumers usually do not retry on EFAULT. Maybe this is not the case
here.

I've been staring into iov_iter_copy_from_user_atomic which I
believe should be the common write path which reads the user buffer
where the corruption caused by the oom_reaper would come from.
iov_iter_fault_in_readable should be called before this function. If
this happened after MMF_UNSTABLE was set then we should get EFAULT and
bail out early. Let's assume this wasn't the case. Then we should get
down to iov_iter_copy_from_user_atomic and that one shouldn't copy any
data because __copy_from_user_inatomic says

 * If copying succeeds, the return value must be 0.  If some data cannot be
 * fetched, it is permitted to copy less than had been fetched; the only
 * hard requirement is that not storing anything at all (i.e. returning size)
 * should happen only when nothing could be copied.  In other words, you don't
 * have to squeeze as much as possible - it is allowed, but not necessary.

which should be our case.

I was testing with xfs (but generic_perform_write seem to be doing the
same thing) and that one does
		if (unlikely(copied == 0)) {
			/*
			 * If we were unable to copy any data at all, we must
			 * fall back to a single segment length write.
			 *
			 * If we didn't fallback here, we could livelock
			 * because not all segments in the iov can be copied at
			 * once without a pagefault.
			 */
			bytes = min_t(unsigned long, PAGE_SIZE - offset,
						iov_iter_single_seg_count(i));
			goto again;
		}

and that again will go through iov_iter_fault_in_readable again and that
will succeed now.

And that's why we still see the corruption. That, however, means that
the MMF_UNSTABLE implementation has to be more complex and we have to
hook into all anonymous memory fault paths which I hoped I could avoid
previously.

This is a rough first draft that passes the test case from Tetsuo on my
system. It will need much more eyes on it and I will return to it with a
fresh brain next week. I would appreciate as much testing as possible.

Note that this is on top of the previous attempt for the fix but I will
squash the result into one patch because the previous one is not
sufficient.
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..1fbc78d423d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	int ret;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	pgtable = pte_alloc_one(vma->vm_mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg, true);
-		put_page(page);
-		return VM_FAULT_OOM;
+		ret = VM_FAULT_OOM;
+		goto release;
 	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
@@ -583,6 +583,15 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	} else {
 		pmd_t entry;
 
+		/*
+		 * range could have been already torn down by
+		 * the oom reaper
+		 */
+		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+			spin_unlock(vmf->ptl);
+			ret = VM_FAULT_SIGBUS;
+			goto release;
+		}
 		/* Deliver the page fault to userland */
 		if (userfaultfd_missing(vma)) {
 			int ret;
@@ -610,6 +619,13 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 	}
 
 	return 0;
+release:
+	if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
+	mem_cgroup_cancel_charge(page, memcg, true);
+	put_page(page);
+	return ret;
+
 }
 
 /*
@@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		ret = 0;
 		set = false;
 		if (pmd_none(*vmf->pmd)) {
-			if (userfaultfd_missing(vma)) {
+			/*
+			 * range could have been already torn down by
+			 * the oom reaper
+			 */
+			if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+				spin_unlock(vmf->ptl);
+				ret = VM_FAULT_SIGBUS;
+			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
diff --git a/mm/memory.c b/mm/memory.c
index e7308e633b52..7de9508e38e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2864,6 +2864,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct mem_cgroup *memcg;
 	struct page *page;
+	int ret = 0;
 	pte_t entry;
 
 	/* File mapping without ->vm_ops ? */
@@ -2896,6 +2897,14 @@ static int do_anonymous_page(struct vm_fault *vmf)
 				vmf->address, &vmf->ptl);
 		if (!pte_none(*vmf->pte))
 			goto unlock;
+		/*
+		 * range could have been already torn down by
+		 * the oom reaper
+		 */
+		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+			ret = VM_FAULT_SIGBUS;
+			goto unlock;
+		}
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2930,6 +2939,15 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	if (!pte_none(*vmf->pte))
 		goto release;
 
+	/*
+	 * range could have been already torn down by
+	 * the oom reaper
+	 */
+	if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
+		ret = VM_FAULT_SIGBUS;
+		goto release;
+	}
+
 	/* Deliver the page fault to userland, check inside PT lock */
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2949,7 +2967,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return 0;
+	return ret;
 release:
 	mem_cgroup_cancel_charge(page, memcg, false);
 	put_page(page);
@@ -3231,7 +3249,10 @@ int finish_fault(struct vm_fault *vmf)
 		page = vmf->cow_page;
 	else
 		page = vmf->page;
-	ret = alloc_set_pte(vmf, vmf->memcg, page);
+	if (!test_bit(MMF_UNSTABLE, &vmf->vma->vm_mm->flags))
+		ret = alloc_set_pte(vmf, vmf->memcg, page);
+	else
+		ret = VM_FAULT_SIGBUS;
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
@@ -3871,24 +3892,6 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 
-	/*
-	 * This mm has been already reaped by the oom reaper and so the
-	 * refault cannot be trusted in general. Anonymous refaults would
-	 * lose data and give a zero page instead e.g.
-	 */
-	if (unlikely(!(ret & VM_FAULT_ERROR)
-				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
-		/*
-		 * We are going to enforce SIGBUS but the PF path might have
-		 * dropped the mmap_sem already so take it again so that
-		 * we do not break expectations of all arch specific PF paths
-		 * and g-u-p
-		 */
-		if (ret & VM_FAULT_RETRY)
-			down_read(&vma->vm_mm->mmap_sem);
-		ret = VM_FAULT_SIGBUS;
-	}
-
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04 14:56               ` Michal Hocko
@ 2017-08-04 16:49                 ` Tetsuo Handa
  -1 siblings, 0 replies; 24+ messages in thread
From: Tetsuo Handa @ 2017-08-04 16:49 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

Michal Hocko wrote:
> And that's why we still see the corruption. That, however, means that
> the MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.

I don't understand mm internals including pte/ptl etc. , but I guess that
the direction is correct. Since the OOM reaper basically does

  Set MMF_UNSTABLE flag on mm_struct.
  For each reapable page in mm_struct {
    Take ptl lock.
    Remove pte.
    Release ptl lock.
  }

the page fault handler will need to check MMF_UNSTABLE with lock held.

  For each faulted page in mm_struct {
    Take ptl lock.
    Add pte only if MMF_UNSTABLE flag is not set.
    Release ptl lock.
  }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-04 16:49                 ` Tetsuo Handa
  0 siblings, 0 replies; 24+ messages in thread
From: Tetsuo Handa @ 2017-08-04 16:49 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, akpm, wenwei.tww, oleg, rientjes, linux-kernel

Michal Hocko wrote:
> And that's why we still see the corruption. That, however, means that
> the MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.

I don't understand mm internals including pte/ptl etc. , but I guess that
the direction is correct. Since the OOM reaper basically does

  Set MMF_UNSTABLE flag on mm_struct.
  For each reapable page in mm_struct {
    Take ptl lock.
    Remove pte.
    Release ptl lock.
  }

the page fault handler will need to check MMF_UNSTABLE with lock held.

  For each faulted page in mm_struct {
    Take ptl lock.
    Add pte only if MMF_UNSTABLE flag is not set.
    Release ptl lock.
  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
  2017-08-04 14:56               ` Michal Hocko
@ 2017-08-05  1:46                 ` 陶文苇
  -1 siblings, 0 replies; 24+ messages in thread
From: 陶文苇 @ 2017-08-05  1:46 UTC (permalink / raw)
  To: 'Michal Hocko', 'Tetsuo Handa'
  Cc: linux-mm, akpm, oleg, rientjes, linux-kernel



> -----Original Message-----
> From: Michal Hocko [mailto:mhocko@kernel.org]
> Sent: Friday, August 04, 2017 10:57 PM
> To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: linux-mm@kvack.org; akpm@linux-foundation.org; 陶文苇
> <wenwei.tww@alibaba-inc.com>; oleg@redhat.com; rientjes@google.com;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] mm, oom: fix potential data corruption when
> oom_reaper races with writer
> 
> On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> > On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> [...]
> > > Yes. Data corruption still happens.
> >
> > I guess I managed to reproduce finally. Will investigate further.
> 
> One limitation of the current MMF_UNSTABLE implementation is that it still
> keeps the new page mapped and only sends EFAULT/kill to the consumer. If
> somebody tries to re-read the same content nothing will really happen. I
> went this way because it was much simpler and memory consumers usually
> do not retry on EFAULT. Maybe this is not the case here.
> 
> I've been staring into iov_iter_copy_from_user_atomic which I believe
> should be the common write path which reads the user buffer where the
> corruption caused by the oom_reaper would come from.
> iov_iter_fault_in_readable should be called before this function. If this
> happened after MMF_UNSTABLE was set then we should get EFAULT and bail
> out early. Let's assume this wasn't the case. Then we should get down to
> iov_iter_copy_from_user_atomic and that one shouldn't copy any data
> because __copy_from_user_inatomic says
> 
>  * If copying succeeds, the return value must be 0.  If some data cannot
be
>  * fetched, it is permitted to copy less than had been fetched; the only
>  * hard requirement is that not storing anything at all (i.e. returning
size)
>  * should happen only when nothing could be copied.  In other words, you
> don't
>  * have to squeeze as much as possible - it is allowed, but not necessary.
> 
> which should be our case.
> 
> I was testing with xfs (but generic_perform_write seem to be doing the
same
> thing) and that one does
> 		if (unlikely(copied == 0)) {
> 			/*
> 			 * If we were unable to copy any data at all, we
must
> 			 * fall back to a single segment length write.
> 			 *
> 			 * If we didn't fallback here, we could livelock
> 			 * because not all segments in the iov can be copied
at
> 			 * once without a pagefault.
> 			 */
> 			bytes = min_t(unsigned long, PAGE_SIZE - offset,
>
iov_iter_single_seg_count(i));
> 			goto again;
> 		}
> 
> and that again will go through iov_iter_fault_in_readable again and that
will
> succeed now.
> 
Agree, didn't notice this case before.

> And that's why we still see the corruption. That, however, means that the
> MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.
> 
> This is a rough first draft that passes the test case from Tetsuo on my
system.
> It will need much more eyes on it and I will return to it with a fresh
brain next
> week. I would appreciate as much testing as possible.
> 
> Note that this is on top of the previous attempt for the fix but I will
squash
> the result into one patch because the previous one is not sufficient.
> ---
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index
> 86975dec0ba1..1fbc78d423d7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
>  	struct mem_cgroup *memcg;
>  	pgtable_t pgtable;
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> +	int ret;
> 
>  	VM_BUG_ON_PAGE(!PageCompound(page), page);
> 
> @@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
> 
>  	pgtable = pte_alloc_one(vma->vm_mm, haddr);
>  	if (unlikely(!pgtable)) {
> -		mem_cgroup_cancel_charge(page, memcg, true);
> -		put_page(page);
> -		return VM_FAULT_OOM;
> +		ret = VM_FAULT_OOM;
> +		goto release;
>  	}
> 
>  	clear_huge_page(page, haddr, HPAGE_PMD_NR); @@ -583,6 +583,15
> @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> struct page *page,
>  	} else {
>  		pmd_t entry;
> 
> +		/*
> +		 * range could have been already torn down by
> +		 * the oom reaper
> +		 */
> +		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +			spin_unlock(vmf->ptl);
> +			ret = VM_FAULT_SIGBUS;
> +			goto release;
> +		}
>  		/* Deliver the page fault to userland */
>  		if (userfaultfd_missing(vma)) {
>  			int ret;
> @@ -610,6 +619,13 @@ static int
> __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page
> *page,
>  	}
> 
>  	return 0;
> +release:
> +	if (pgtable)
> +		pte_free(vma->vm_mm, pgtable);
> +	mem_cgroup_cancel_charge(page, memcg, true);
> +	put_page(page);
> +	return ret;
> +
>  }
> 
>  /*
> @@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct
> vm_fault *vmf)
>  		ret = 0;
>  		set = false;
>  		if (pmd_none(*vmf->pmd)) {
> -			if (userfaultfd_missing(vma)) {
> +			/*
> +			 * range could have been already torn down by
> +			 * the oom reaper
> +			 */
> +			if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +				spin_unlock(vmf->ptl);
> +				ret = VM_FAULT_SIGBUS;
> +			} else if (userfaultfd_missing(vma)) {
>  				spin_unlock(vmf->ptl);
>  				ret = handle_userfault(vmf,
VM_UFFD_MISSING);
>  				VM_BUG_ON(ret & VM_FAULT_FALLBACK); diff
--git
> a/mm/memory.c b/mm/memory.c index e7308e633b52..7de9508e38e4
> 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2864,6 +2864,7 @@ static int do_anonymous_page(struct vm_fault
> *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct mem_cgroup *memcg;
>  	struct page *page;
> +	int ret = 0;
>  	pte_t entry;
> 
>  	/* File mapping without ->vm_ops ? */
> @@ -2896,6 +2897,14 @@ static int do_anonymous_page(struct vm_fault
> *vmf)
>  				vmf->address, &vmf->ptl);
>  		if (!pte_none(*vmf->pte))
>  			goto unlock;
> +		/*
> +		 * range could have been already torn down by
> +		 * the oom reaper
> +		 */
> +		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +			ret = VM_FAULT_SIGBUS;
> +			goto unlock;
> +		}
>  		/* Deliver the page fault to userland, check inside PT lock
*/
>  		if (userfaultfd_missing(vma)) {
>  			pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -2930,6
> +2939,15 @@ static int do_anonymous_page(struct vm_fault *vmf)
>  	if (!pte_none(*vmf->pte))
>  		goto release;
> 
> +	/*
> +	 * range could have been already torn down by
> +	 * the oom reaper
> +	 */
> +	if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto release;
> +	}
> +
>  	/* Deliver the page fault to userland, check inside PT lock */
>  	if (userfaultfd_missing(vma)) {
>  		pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -2949,7 +2967,7 @@
> static int do_anonymous_page(struct vm_fault *vmf)
>  	update_mmu_cache(vma, vmf->address, vmf->pte);
>  unlock:
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
> -	return 0;
> +	return ret;
>  release:
>  	mem_cgroup_cancel_charge(page, memcg, false);
>  	put_page(page);
> @@ -3231,7 +3249,10 @@ int finish_fault(struct vm_fault *vmf)
>  		page = vmf->cow_page;
>  	else
>  		page = vmf->page;
> -	ret = alloc_set_pte(vmf, vmf->memcg, page);
> +	if (!test_bit(MMF_UNSTABLE, &vmf->vma->vm_mm->flags))
> +		ret = alloc_set_pte(vmf, vmf->memcg, page);
> +	else
> +		ret = VM_FAULT_SIGBUS;
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  	return ret;
> @@ -3871,24 +3892,6 @@ int handle_mm_fault(struct vm_area_struct
> *vma, unsigned long address,
>  			mem_cgroup_oom_synchronize(false);
>  	}
> 
> -	/*
> -	 * This mm has been already reaped by the oom reaper and so the
> -	 * refault cannot be trusted in general. Anonymous refaults would
> -	 * lose data and give a zero page instead e.g.
> -	 */
> -	if (unlikely(!(ret & VM_FAULT_ERROR)
> -				&& test_bit(MMF_UNSTABLE,
&vma->vm_mm->flags))) {
> -		/*
> -		 * We are going to enforce SIGBUS but the PF path might have
> -		 * dropped the mmap_sem already so take it again so that
> -		 * we do not break expectations of all arch specific PF
paths
> -		 * and g-u-p
> -		 */
> -		if (ret & VM_FAULT_RETRY)
> -			down_read(&vma->vm_mm->mmap_sem);
> -		ret = VM_FAULT_SIGBUS;
> -	}
> -
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(handle_mm_fault);
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer
@ 2017-08-05  1:46                 ` 陶文苇
  0 siblings, 0 replies; 24+ messages in thread
From: 陶文苇 @ 2017-08-05  1:46 UTC (permalink / raw)
  To: 'Michal Hocko', 'Tetsuo Handa'
  Cc: linux-mm, akpm, oleg, rientjes, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 8703 bytes --]



> -----Original Message-----
> From: Michal Hocko [mailto:mhocko@kernel.org]
> Sent: Friday, August 04, 2017 10:57 PM
> To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: linux-mm@kvack.org; akpm@linux-foundation.org; 陶文苇
> <wenwei.tww@alibaba-inc.com>; oleg@redhat.com; rientjes@google.com;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] mm, oom: fix potential data corruption when
> oom_reaper races with writer
> 
> On Fri 04-08-17 13:00:47, Michal Hocko wrote:
> > On Fri 04-08-17 19:41:42, Tetsuo Handa wrote:
> [...]
> > > Yes. Data corruption still happens.
> >
> > I guess I managed to reproduce finally. Will investigate further.
> 
> One limitation of the current MMF_UNSTABLE implementation is that it still
> keeps the new page mapped and only sends EFAULT/kill to the consumer. If
> somebody tries to re-read the same content nothing will really happen. I
> went this way because it was much simpler and memory consumers usually
> do not retry on EFAULT. Maybe this is not the case here.
> 
> I've been staring into iov_iter_copy_from_user_atomic which I believe
> should be the common write path which reads the user buffer where the
> corruption caused by the oom_reaper would come from.
> iov_iter_fault_in_readable should be called before this function. If this
> happened after MMF_UNSTABLE was set then we should get EFAULT and bail
> out early. Let's assume this wasn't the case. Then we should get down to
> iov_iter_copy_from_user_atomic and that one shouldn't copy any data
> because __copy_from_user_inatomic says
> 
>  * If copying succeeds, the return value must be 0.  If some data cannot
be
>  * fetched, it is permitted to copy less than had been fetched; the only
>  * hard requirement is that not storing anything at all (i.e. returning
size)
>  * should happen only when nothing could be copied.  In other words, you
> don't
>  * have to squeeze as much as possible - it is allowed, but not necessary.
> 
> which should be our case.
> 
> I was testing with xfs (but generic_perform_write seem to be doing the
same
> thing) and that one does
> 		if (unlikely(copied == 0)) {
> 			/*
> 			 * If we were unable to copy any data at all, we
must
> 			 * fall back to a single segment length write.
> 			 *
> 			 * If we didn't fallback here, we could livelock
> 			 * because not all segments in the iov can be copied
at
> 			 * once without a pagefault.
> 			 */
> 			bytes = min_t(unsigned long, PAGE_SIZE - offset,
>
iov_iter_single_seg_count(i));
> 			goto again;
> 		}
> 
> and that again will go through iov_iter_fault_in_readable again and that
will
> succeed now.
> 
Agree, didn't notice this case before.

> And that's why we still see the corruption. That, however, means that the
> MMF_UNSTABLE implementation has to be more complex and we have to
> hook into all anonymous memory fault paths which I hoped I could avoid
> previously.
> 
> This is a rough first draft that passes the test case from Tetsuo on my
system.
> It will need much more eyes on it and I will return to it with a fresh
brain next
> week. I would appreciate as much testing as possible.
> 
> Note that this is on top of the previous attempt for the fix but I will
squash
> the result into one patch because the previous one is not sufficient.
> ---
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index
> 86975dec0ba1..1fbc78d423d7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -550,6 +550,7 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
>  	struct mem_cgroup *memcg;
>  	pgtable_t pgtable;
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> +	int ret;
> 
>  	VM_BUG_ON_PAGE(!PageCompound(page), page);
> 
> @@ -561,9 +562,8 @@ static int __do_huge_pmd_anonymous_page(struct
> vm_fault *vmf, struct page *page,
> 
>  	pgtable = pte_alloc_one(vma->vm_mm, haddr);
>  	if (unlikely(!pgtable)) {
> -		mem_cgroup_cancel_charge(page, memcg, true);
> -		put_page(page);
> -		return VM_FAULT_OOM;
> +		ret = VM_FAULT_OOM;
> +		goto release;
>  	}
> 
>  	clear_huge_page(page, haddr, HPAGE_PMD_NR); @@ -583,6 +583,15
> @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> struct page *page,
>  	} else {
>  		pmd_t entry;
> 
> +		/*
> +		 * range could have been already torn down by
> +		 * the oom reaper
> +		 */
> +		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +			spin_unlock(vmf->ptl);
> +			ret = VM_FAULT_SIGBUS;
> +			goto release;
> +		}
>  		/* Deliver the page fault to userland */
>  		if (userfaultfd_missing(vma)) {
>  			int ret;
> @@ -610,6 +619,13 @@ static int
> __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page
> *page,
>  	}
> 
>  	return 0;
> +release:
> +	if (pgtable)
> +		pte_free(vma->vm_mm, pgtable);
> +	mem_cgroup_cancel_charge(page, memcg, true);
> +	put_page(page);
> +	return ret;
> +
>  }
> 
>  /*
> @@ -688,7 +704,14 @@ int do_huge_pmd_anonymous_page(struct
> vm_fault *vmf)
>  		ret = 0;
>  		set = false;
>  		if (pmd_none(*vmf->pmd)) {
> -			if (userfaultfd_missing(vma)) {
> +			/*
> +			 * range could have been already torn down by
> +			 * the oom reaper
> +			 */
> +			if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +				spin_unlock(vmf->ptl);
> +				ret = VM_FAULT_SIGBUS;
> +			} else if (userfaultfd_missing(vma)) {
>  				spin_unlock(vmf->ptl);
>  				ret = handle_userfault(vmf,
VM_UFFD_MISSING);
>  				VM_BUG_ON(ret & VM_FAULT_FALLBACK); diff
--git
> a/mm/memory.c b/mm/memory.c index e7308e633b52..7de9508e38e4
> 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2864,6 +2864,7 @@ static int do_anonymous_page(struct vm_fault
> *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct mem_cgroup *memcg;
>  	struct page *page;
> +	int ret = 0;
>  	pte_t entry;
> 
>  	/* File mapping without ->vm_ops ? */
> @@ -2896,6 +2897,14 @@ static int do_anonymous_page(struct vm_fault
> *vmf)
>  				vmf->address, &vmf->ptl);
>  		if (!pte_none(*vmf->pte))
>  			goto unlock;
> +		/*
> +		 * range could have been already torn down by
> +		 * the oom reaper
> +		 */
> +		if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +			ret = VM_FAULT_SIGBUS;
> +			goto unlock;
> +		}
>  		/* Deliver the page fault to userland, check inside PT lock
*/
>  		if (userfaultfd_missing(vma)) {
>  			pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -2930,6
> +2939,15 @@ static int do_anonymous_page(struct vm_fault *vmf)
>  	if (!pte_none(*vmf->pte))
>  		goto release;
> 
> +	/*
> +	 * range could have been already torn down by
> +	 * the oom reaper
> +	 */
> +	if (test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto release;
> +	}
> +
>  	/* Deliver the page fault to userland, check inside PT lock */
>  	if (userfaultfd_missing(vma)) {
>  		pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -2949,7 +2967,7 @@
> static int do_anonymous_page(struct vm_fault *vmf)
>  	update_mmu_cache(vma, vmf->address, vmf->pte);
>  unlock:
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
> -	return 0;
> +	return ret;
>  release:
>  	mem_cgroup_cancel_charge(page, memcg, false);
>  	put_page(page);
> @@ -3231,7 +3249,10 @@ int finish_fault(struct vm_fault *vmf)
>  		page = vmf->cow_page;
>  	else
>  		page = vmf->page;
> -	ret = alloc_set_pte(vmf, vmf->memcg, page);
> +	if (!test_bit(MMF_UNSTABLE, &vmf->vma->vm_mm->flags))
> +		ret = alloc_set_pte(vmf, vmf->memcg, page);
> +	else
> +		ret = VM_FAULT_SIGBUS;
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  	return ret;
> @@ -3871,24 +3892,6 @@ int handle_mm_fault(struct vm_area_struct
> *vma, unsigned long address,
>  			mem_cgroup_oom_synchronize(false);
>  	}
> 
> -	/*
> -	 * This mm has been already reaped by the oom reaper and so the
> -	 * refault cannot be trusted in general. Anonymous refaults would
> -	 * lose data and give a zero page instead e.g.
> -	 */
> -	if (unlikely(!(ret & VM_FAULT_ERROR)
> -				&& test_bit(MMF_UNSTABLE,
&vma->vm_mm->flags))) {
> -		/*
> -		 * We are going to enforce SIGBUS but the PF path might have
> -		 * dropped the mmap_sem already so take it again so that
> -		 * we do not break expectations of all arch specific PF
paths
> -		 * and g-u-p
> -		 */
> -		if (ret & VM_FAULT_RETRY)
> -			down_read(&vma->vm_mm->mmap_sem);
> -		ret = VM_FAULT_SIGBUS;
> -	}
> -
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(handle_mm_fault);
> --
> Michal Hocko
> SUSE LabsN‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-08-05  1:46 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-03 13:59 [PATCH] mm, oom: fix potential data corruption when oom_reaper races with writer Michal Hocko
2017-08-03 13:59 ` Michal Hocko
2017-08-04  6:46 ` Tetsuo Handa
2017-08-04  7:42   ` Michal Hocko
2017-08-04  7:42     ` Michal Hocko
2017-08-04  8:25     ` Tetsuo Handa
2017-08-04  8:32       ` Michal Hocko
2017-08-04  8:32         ` Michal Hocko
2017-08-04  8:33         ` [PATCH 1/2] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS Michal Hocko
2017-08-04  8:33           ` Michal Hocko
2017-08-04  8:33           ` [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer Michal Hocko
2017-08-04  8:33             ` Michal Hocko
2017-08-04  9:16       ` Re: [PATCH] " Michal Hocko
2017-08-04  9:16         ` Michal Hocko
2017-08-04 10:41         ` Tetsuo Handa
2017-08-04 10:41           ` Tetsuo Handa
2017-08-04 11:00           ` Michal Hocko
2017-08-04 11:00             ` Michal Hocko
2017-08-04 14:56             ` Michal Hocko
2017-08-04 14:56               ` Michal Hocko
2017-08-04 16:49               ` Tetsuo Handa
2017-08-04 16:49                 ` Tetsuo Handa
2017-08-05  1:46               ` 陶文苇
2017-08-05  1:46                 ` 陶文苇

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.