All of lore.kernel.org
 help / color / mirror / Atom feed
* mm: GPF in find_get_pages_tag
@ 2016-07-05 11:39 ` Dmitry Vyukov
  0 siblings, 0 replies; 35+ messages in thread
From: Dmitry Vyukov @ 2016-07-05 11:39 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal,
	Andrey Ryabinin
  Cc: syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin

Hello,

The following program triggers GPF in find_get_pages_tag if run in
parallel loop for minutes:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
Modules linked in:
CPU: 2 PID: 301 Comm: a.out Tainted: G        W       4.7.0-rc5+ #28
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880063d12440 ti: ffff880067350000 task.ti: ffff880067350000
RIP: 0010:[<ffffffff816951a4>]
  [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
  [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
RSP: 0018:ffff880067357840  EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffff880063d12c80
RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000008
RBP: ffff880067357910 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff89f06360 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000000 R15: ffffed0007058ee5
FS:  00007f56e017c700(0000) GS:ffff88006d400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f56df97ae78 CR3: 0000000063d9e000 CR4: 00000000000006e0
Stack:
 ffffffff81694efc ffff8800673578a8 0000010267357860 ffff880067357a50
 ffff880065986aa0 1ffff1000ce6af11 0000000e00000000 ffff880067357a00
 0000000000000003 0000000041b58ab3 ffffffff87e2a722 ffffffff81694e70
Call Trace:
 [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
 [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90
fs/ext4/inode.c:2516
 [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
 [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
 [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
 [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
 [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
 [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
 [<     inline     >] vfs_fsync fs/sync.c:209
 [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
 [<     inline     >] SYSC_fdatasync fs/sync.c:232
 [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
 [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1
arch/x86/entry/entry_64.S:207
Code: 85 70 ff ff ff 49 d1 ec 4d 85 e4 4c 89 65 a8 74 65 e8 51 06 f0
ff 49 8d 7e 08 48 be 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 <80>
3c 30 00 0f 85 9c 05 00 00 4d 8b 6e 08 4c 89 eb 83 e3 03 48
RIP  [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
RIP  [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
 RSP <ffff880067357840>
---[ end trace 33a0cc4dd9a49a67 ]---



// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include <pthread.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <unistd.h>

int fd;
char buf[8192];
char filename[256];

void* thr(void* arg)
{
  switch ((long)arg) {
  case 0:
    write(fd, buf, 0x1001ul);
    break;
  case 1:
    fdatasync(fd);
    break;
  case 2:
    ftruncate(fd, 2);
    break;
  case 3:
    write(fd, buf, 0x20ul);
    break;
  case 5:
    fd = open(filename, 0x50042ul, 0x41ul);
    break;
  }
  return 0;
}

int main()
{
  long i;
  pthread_t th[10];

  srand(getpid());
  sprintf(filename, "./file%d", getpid());
  fd = open(filename, 0x50042ul, 0x41ul);
  for (i = 0; i < 10; i++) {
    pthread_create(&th[i], 0, thr, (void*)(i % 5));
    usleep(rand() % 10);
  }
  for (i = 0; i < 10; i++)
    pthread_join(th[i], 0);
  unlink(filename);
  return 0;
}

The faulting instruction is:
ffffffff816951a4:       80 3c 30 00             cmpb   $0x0,(%rax,%rsi,1)
So this is KASAN shadow check for NULL address.


The previous taint is not relevant, it is:

[   74.786477] ------------[ cut here ]------------
[   74.786885] WARNING: CPU: 2 PID: 717 at lib/stackdepot.c:119
depot_save_stack+0x34f/0x5b0
[   74.787196] Stack depot reached limit capacity


On commit 1a0a02d1efa066001fd315c1b4df583d939fa2c4 (Jun 30).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mm: GPF in find_get_pages_tag
@ 2016-07-05 11:39 ` Dmitry Vyukov
  0 siblings, 0 replies; 35+ messages in thread
From: Dmitry Vyukov @ 2016-07-05 11:39 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal,
	Andrey Ryabinin
  Cc: syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin

Hello,

The following program triggers GPF in find_get_pages_tag if run in
parallel loop for minutes:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
Modules linked in:
CPU: 2 PID: 301 Comm: a.out Tainted: G        W       4.7.0-rc5+ #28
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880063d12440 ti: ffff880067350000 task.ti: ffff880067350000
RIP: 0010:[<ffffffff816951a4>]
  [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
  [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
RSP: 0018:ffff880067357840  EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffff880063d12c80
RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000008
RBP: ffff880067357910 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff89f06360 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000000 R15: ffffed0007058ee5
FS:  00007f56e017c700(0000) GS:ffff88006d400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f56df97ae78 CR3: 0000000063d9e000 CR4: 00000000000006e0
Stack:
 ffffffff81694efc ffff8800673578a8 0000010267357860 ffff880067357a50
 ffff880065986aa0 1ffff1000ce6af11 0000000e00000000 ffff880067357a00
 0000000000000003 0000000041b58ab3 ffffffff87e2a722 ffffffff81694e70
Call Trace:
 [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
 [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90
fs/ext4/inode.c:2516
 [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
 [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
 [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
 [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
 [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
 [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
 [<     inline     >] vfs_fsync fs/sync.c:209
 [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
 [<     inline     >] SYSC_fdatasync fs/sync.c:232
 [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
 [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1
arch/x86/entry/entry_64.S:207
Code: 85 70 ff ff ff 49 d1 ec 4d 85 e4 4c 89 65 a8 74 65 e8 51 06 f0
ff 49 8d 7e 08 48 be 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 <80>
3c 30 00 0f 85 9c 05 00 00 4d 8b 6e 08 4c 89 eb 83 e3 03 48
RIP  [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
RIP  [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
 RSP <ffff880067357840>
---[ end trace 33a0cc4dd9a49a67 ]---



// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include <pthread.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <unistd.h>

int fd;
char buf[8192];
char filename[256];

void* thr(void* arg)
{
  switch ((long)arg) {
  case 0:
    write(fd, buf, 0x1001ul);
    break;
  case 1:
    fdatasync(fd);
    break;
  case 2:
    ftruncate(fd, 2);
    break;
  case 3:
    write(fd, buf, 0x20ul);
    break;
  case 5:
    fd = open(filename, 0x50042ul, 0x41ul);
    break;
  }
  return 0;
}

int main()
{
  long i;
  pthread_t th[10];

  srand(getpid());
  sprintf(filename, "./file%d", getpid());
  fd = open(filename, 0x50042ul, 0x41ul);
  for (i = 0; i < 10; i++) {
    pthread_create(&th[i], 0, thr, (void*)(i % 5));
    usleep(rand() % 10);
  }
  for (i = 0; i < 10; i++)
    pthread_join(th[i], 0);
  unlink(filename);
  return 0;
}

The faulting instruction is:
ffffffff816951a4:       80 3c 30 00             cmpb   $0x0,(%rax,%rsi,1)
So this is KASAN shadow check for NULL address.


The previous taint is not relevant, it is:

[   74.786477] ------------[ cut here ]------------
[   74.786885] WARNING: CPU: 2 PID: 717 at lib/stackdepot.c:119
depot_save_stack+0x34f/0x5b0
[   74.787196] Stack depot reached limit capacity


On commit 1a0a02d1efa066001fd315c1b4df583d939fa2c4 (Jun 30).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
  2016-07-05 11:39 ` Dmitry Vyukov
  (?)
@ 2016-07-14 11:19   ` Andrey Ryabinin
  -1 siblings, 0 replies; 35+ messages in thread
From: Andrey Ryabinin @ 2016-07-14 11:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, ross.zwisler, Kirill A. Shutemov, linux-mm,
	Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin, linux-kernel, Andrey Ryabinin,
	Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, stable

radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
leading to crash:

RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
  [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
....
Call Trace:
 [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
 [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
 [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
 [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
 [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
 [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
 [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
 [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
 [<     inline     >] vfs_fsync fs/sync.c:209
 [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
 [<     inline     >] SYSC_fdatasync fs/sync.c:232
 [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
 [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207

We must reset iterator's tags to bail out from radix_tree_next_slot() and
go to the slow-path in radix_tree_next_chunk().

Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/radix-tree.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index cb4b7e8..eca6f62 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -407,6 +407,7 @@ static inline __must_check
 void **radix_tree_iter_retry(struct radix_tree_iter *iter)
 {
 	iter->next_index = iter->index;
+	iter->tags = 0;
 	return NULL;
 }
 
-- 
2.7.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-14 11:19   ` Andrey Ryabinin
  0 siblings, 0 replies; 35+ messages in thread
From: Andrey Ryabinin @ 2016-07-14 11:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, ross.zwisler, Kirill A. Shutemov, linux-mm,
	Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin, linux-kernel, Andrey Ryabinin,
	Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, stable

radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
leading to crash:

RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
  [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
....
Call Trace:
 [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
 [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
 [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
 [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
 [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
 [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
 [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
 [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
 [<     inline     >] vfs_fsync fs/sync.c:209
 [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
 [<     inline     >] SYSC_fdatasync fs/sync.c:232
 [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
 [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207

We must reset iterator's tags to bail out from radix_tree_next_slot() and
go to the slow-path in radix_tree_next_chunk().

Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/radix-tree.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index cb4b7e8..eca6f62 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -407,6 +407,7 @@ static inline __must_check
 void **radix_tree_iter_retry(struct radix_tree_iter *iter)
 {
 	iter->next_index = iter->index;
+	iter->tags = 0;
 	return NULL;
 }
 
-- 
2.7.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-14 11:19   ` Andrey Ryabinin
  0 siblings, 0 replies; 35+ messages in thread
From: Andrey Ryabinin @ 2016-07-14 11:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, ross.zwisler, Kirill A. Shutemov, linux-mm,
	Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin, linux-kernel, Andrey Ryabinin,
	Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, stable

radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
leading to crash:

RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
  [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
....
Call Trace:
 [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
 [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
 [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
 [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
 [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
 [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
 [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
 [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
 [<     inline     >] vfs_fsync fs/sync.c:209
 [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
 [<     inline     >] SYSC_fdatasync fs/sync.c:232
 [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
 [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207

We must reset iterator's tags to bail out from radix_tree_next_slot() and
go to the slow-path in radix_tree_next_chunk().

Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/radix-tree.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index cb4b7e8..eca6f62 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -407,6 +407,7 @@ static inline __must_check
 void **radix_tree_iter_retry(struct radix_tree_iter *iter)
 {
 	iter->next_index = iter->index;
+	iter->tags = 0;
 	return NULL;
 }
 
-- 
2.7.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
  2016-07-14 11:19   ` Andrey Ryabinin
@ 2016-07-14 12:21     ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 35+ messages in thread
From: Konstantin Khlebnikov @ 2016-07-14 12:21 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable

ACK

Originally retry could happen only at index 0 when first indirect node
installed:
in this case tags holds only 1 bit. Seems like now this happends at any index.

On Thu, Jul 14, 2016 at 2:19 PM, Andrey Ryabinin
<aryabinin@virtuozzo.com> wrote:
> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
> leading to crash:
>
> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
> ....
> Call Trace:
>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>  [<     inline     >] vfs_fsync fs/sync.c:209
>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
>
> We must reset iterator's tags to bail out from radix_tree_next_slot() and
> go to the slow-path in radix_tree_next_chunk().
>
> Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/radix-tree.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index cb4b7e8..eca6f62 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -407,6 +407,7 @@ static inline __must_check
>  void **radix_tree_iter_retry(struct radix_tree_iter *iter)
>  {
>         iter->next_index = iter->index;
> +       iter->tags = 0;
>         return NULL;
>  }
>
> --
> 2.7.3
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-14 12:21     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 35+ messages in thread
From: Konstantin Khlebnikov @ 2016-07-14 12:21 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable

ACK

Originally retry could happen only at index 0 when first indirect node
installed:
in this case tags holds only 1 bit. Seems like now this happends at any index.

On Thu, Jul 14, 2016 at 2:19 PM, Andrey Ryabinin
<aryabinin@virtuozzo.com> wrote:
> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
> leading to crash:
>
> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
> ....
> Call Trace:
>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>  [<     inline     >] vfs_fsync fs/sync.c:209
>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
>
> We must reset iterator's tags to bail out from radix_tree_next_slot() and
> go to the slow-path in radix_tree_next_chunk().
>
> Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/radix-tree.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index cb4b7e8..eca6f62 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -407,6 +407,7 @@ static inline __must_check
>  void **radix_tree_iter_retry(struct radix_tree_iter *iter)
>  {
>         iter->next_index = iter->index;
> +       iter->tags = 0;
>         return NULL;
>  }
>
> --
> 2.7.3
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
  2016-07-14 11:19   ` Andrey Ryabinin
@ 2016-07-14 22:25     ` Ross Zwisler
  -1 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-14 22:25 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	linux-kernel, Konstantin Khlebnikov, Matthew Wilcox,
	Hugh Dickins, stable

On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
> leading to crash:
> 
> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
> ....
> Call Trace:
>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>  [<     inline     >] vfs_fsync fs/sync.c:209
>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
> 
> We must reset iterator's tags to bail out from radix_tree_next_slot() and
> go to the slow-path in radix_tree_next_chunk().

This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
does a 'continue'.  This will hop to the next iteration of the
radix_tree_for_each_tagged() loop, which will very check the exit condition of
the for() loop:

#define radix_tree_for_each_tagged(slot, root, iter, start, tag)	\
	for (slot = radix_tree_iter_init(iter, start) ;			\
	     slot || (slot = radix_tree_next_chunk(root, iter,		\
			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
	     slot = radix_tree_next_slot(slot, iter,			\
				RADIX_TREE_ITER_TAGGED))

So, we'll run the 
	     slot || (slot = radix_tree_next_chunk(root, iter,		\
			      RADIX_TREE_ITER_TAGGED | tag)) ;		\

bit first.  'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
this point radix_tree_next_slot() hasn't been called.

radix_tree_next_chunk() will set up the iter->index, iter->next_index and
iter->tags before it returns.  The next iteration of the loop in
find_get_pages_tag() will use the non-NULL slot provided by
radix_tree_next_chunk(), and only after that iteration will we call
radix_tree_next_slot() again.  By then iter->tags should be up to date.

Do you have a test setup that reliably fails without this code but passes when
you zero out iter->tags?

I've been looking at this as well, but haven't been able to get a reliable
reproducer in my test setup.

> Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/radix-tree.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index cb4b7e8..eca6f62 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -407,6 +407,7 @@ static inline __must_check
>  void **radix_tree_iter_retry(struct radix_tree_iter *iter)
>  {
>  	iter->next_index = iter->index;
> +	iter->tags = 0;
>  	return NULL;
>  }
>  
> -- 
> 2.7.3
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-14 22:25     ` Ross Zwisler
  0 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-14 22:25 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	linux-kernel, Konstantin Khlebnikov, Matthew Wilcox,
	Hugh Dickins, stable

On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
> leading to crash:
> 
> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
> ....
> Call Trace:
>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>  [<     inline     >] vfs_fsync fs/sync.c:209
>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
> 
> We must reset iterator's tags to bail out from radix_tree_next_slot() and
> go to the slow-path in radix_tree_next_chunk().

This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
does a 'continue'.  This will hop to the next iteration of the
radix_tree_for_each_tagged() loop, which will very check the exit condition of
the for() loop:

#define radix_tree_for_each_tagged(slot, root, iter, start, tag)	\
	for (slot = radix_tree_iter_init(iter, start) ;			\
	     slot || (slot = radix_tree_next_chunk(root, iter,		\
			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
	     slot = radix_tree_next_slot(slot, iter,			\
				RADIX_TREE_ITER_TAGGED))

So, we'll run the 
	     slot || (slot = radix_tree_next_chunk(root, iter,		\
			      RADIX_TREE_ITER_TAGGED | tag)) ;		\

bit first.  'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
this point radix_tree_next_slot() hasn't been called.

radix_tree_next_chunk() will set up the iter->index, iter->next_index and
iter->tags before it returns.  The next iteration of the loop in
find_get_pages_tag() will use the non-NULL slot provided by
radix_tree_next_chunk(), and only after that iteration will we call
radix_tree_next_slot() again.  By then iter->tags should be up to date.

Do you have a test setup that reliably fails without this code but passes when
you zero out iter->tags?

I've been looking at this as well, but haven't been able to get a reliable
reproducer in my test setup.

> Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/radix-tree.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index cb4b7e8..eca6f62 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -407,6 +407,7 @@ static inline __must_check
>  void **radix_tree_iter_retry(struct radix_tree_iter *iter)
>  {
>  	iter->next_index = iter->index;
> +	iter->tags = 0;
>  	return NULL;
>  }
>  
> -- 
> 2.7.3
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
  2016-07-14 22:25     ` Ross Zwisler
  (?)
@ 2016-07-15  8:52       ` Andrey Ryabinin
  -1 siblings, 0 replies; 35+ messages in thread
From: Andrey Ryabinin @ 2016-07-15  8:52 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Jan Kara, Kirill A. Shutemov, linux-mm,
	Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin, linux-kernel,
	Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, stable



On 07/15/2016 01:25 AM, Ross Zwisler wrote:
> On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
>> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
>> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
>> leading to crash:
>>
>> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>> ....
>> Call Trace:
>>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
>>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>>  [<     inline     >] vfs_fsync fs/sync.c:209
>>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
>>
>> We must reset iterator's tags to bail out from radix_tree_next_slot() and
>> go to the slow-path in radix_tree_next_chunk().
> 
> This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
> radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
> does a 'continue'.  This will hop to the next iteration of the
> radix_tree_for_each_tagged() loop, which will very check the exit condition of
> the for() loop:
> 
> #define radix_tree_for_each_tagged(slot, root, iter, start, tag)	\
> 	for (slot = radix_tree_iter_init(iter, start) ;			\
> 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> 	     slot = radix_tree_next_slot(slot, iter,			\
> 				RADIX_TREE_ITER_TAGGED))
> 
> So, we'll run the 
> 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> 
> bit first.  

This is not the way how the for() loop works. slot = radix_tree_next_slot() executed first
and only after that goes the condition statement.


> 'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
> this point radix_tree_next_slot() hasn't been called.
> 
> radix_tree_next_chunk() will set up the iter->index, iter->next_index and
> iter->tags before it returns.  The next iteration of the loop in
> find_get_pages_tag() will use the non-NULL slot provided by
> radix_tree_next_chunk(), and only after that iteration will we call
> radix_tree_next_slot() again.  By then iter->tags should be up to date.
> 
> Do you have a test setup that reliably fails without this code but passes when
> you zero out iter->tags?
> 


Yup, I run Dmitry's reproducer in a parallel loop:
	$ while true; do ./a.out & done

Usually it takes just couple minutes maximum.

> I've been looking at this as well, but haven't been able to get a reliable
> reproducer in my test setup.
> 
>> Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
>> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
>> Reported-by: Dmitry Vyukov <dvyukov@google.com>
>> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
>> Cc: Matthew Wilcox <willy@linux.intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: <stable@vger.kernel.org>
>> ---
>>  include/linux/radix-tree.h | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
>> index cb4b7e8..eca6f62 100644
>> --- a/include/linux/radix-tree.h
>> +++ b/include/linux/radix-tree.h
>> @@ -407,6 +407,7 @@ static inline __must_check
>>  void **radix_tree_iter_retry(struct radix_tree_iter *iter)
>>  {
>>  	iter->next_index = iter->index;
>> +	iter->tags = 0;
>>  	return NULL;
>>  }
>>  
>> -- 
>> 2.7.3
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-15  8:52       ` Andrey Ryabinin
  0 siblings, 0 replies; 35+ messages in thread
From: Andrey Ryabinin @ 2016-07-15  8:52 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Jan Kara, Kirill A. Shutemov, linux-mm,
	Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin, linux-kernel,
	Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, stable



On 07/15/2016 01:25 AM, Ross Zwisler wrote:
> On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
>> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
>> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
>> leading to crash:
>>
>> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>> ....
>> Call Trace:
>>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
>>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>>  [<     inline     >] vfs_fsync fs/sync.c:209
>>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
>>
>> We must reset iterator's tags to bail out from radix_tree_next_slot() and
>> go to the slow-path in radix_tree_next_chunk().
> 
> This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
> radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
> does a 'continue'.  This will hop to the next iteration of the
> radix_tree_for_each_tagged() loop, which will very check the exit condition of
> the for() loop:
> 
> #define radix_tree_for_each_tagged(slot, root, iter, start, tag)	\
> 	for (slot = radix_tree_iter_init(iter, start) ;			\
> 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> 	     slot = radix_tree_next_slot(slot, iter,			\
> 				RADIX_TREE_ITER_TAGGED))
> 
> So, we'll run the 
> 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> 
> bit first.  

This is not the way how the for() loop works. slot = radix_tree_next_slot() executed first
and only after that goes the condition statement.


> 'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
> this point radix_tree_next_slot() hasn't been called.
> 
> radix_tree_next_chunk() will set up the iter->index, iter->next_index and
> iter->tags before it returns.  The next iteration of the loop in
> find_get_pages_tag() will use the non-NULL slot provided by
> radix_tree_next_chunk(), and only after that iteration will we call
> radix_tree_next_slot() again.  By then iter->tags should be up to date.
> 
> Do you have a test setup that reliably fails without this code but passes when
> you zero out iter->tags?
> 


Yup, I run Dmitry's reproducer in a parallel loop:
	$ while true; do ./a.out & done

Usually it takes just couple minutes maximum.

> I've been looking at this as well, but haven't been able to get a reliable
> reproducer in my test setup.
> 
>> Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
>> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
>> Reported-by: Dmitry Vyukov <dvyukov@google.com>
>> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
>> Cc: Matthew Wilcox <willy@linux.intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: <stable@vger.kernel.org>
>> ---
>>  include/linux/radix-tree.h | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
>> index cb4b7e8..eca6f62 100644
>> --- a/include/linux/radix-tree.h
>> +++ b/include/linux/radix-tree.h
>> @@ -407,6 +407,7 @@ static inline __must_check
>>  void **radix_tree_iter_retry(struct radix_tree_iter *iter)
>>  {
>>  	iter->next_index = iter->index;
>> +	iter->tags = 0;
>>  	return NULL;
>>  }
>>  
>> -- 
>> 2.7.3
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-15  8:52       ` Andrey Ryabinin
  0 siblings, 0 replies; 35+ messages in thread
From: Andrey Ryabinin @ 2016-07-15  8:52 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Jan Kara, Kirill A. Shutemov, linux-mm,
	Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin, linux-kernel,
	Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, stable



On 07/15/2016 01:25 AM, Ross Zwisler wrote:
> On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
>> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
>> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
>> leading to crash:
>>
>> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>> ....
>> Call Trace:
>>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
>>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>>  [<     inline     >] vfs_fsync fs/sync.c:209
>>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
>>
>> We must reset iterator's tags to bail out from radix_tree_next_slot() and
>> go to the slow-path in radix_tree_next_chunk().
> 
> This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
> radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
> does a 'continue'.  This will hop to the next iteration of the
> radix_tree_for_each_tagged() loop, which will very check the exit condition of
> the for() loop:
> 
> #define radix_tree_for_each_tagged(slot, root, iter, start, tag)	\
> 	for (slot = radix_tree_iter_init(iter, start) ;			\
> 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> 	     slot = radix_tree_next_slot(slot, iter,			\
> 				RADIX_TREE_ITER_TAGGED))
> 
> So, we'll run the 
> 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> 
> bit first.  

This is not the way how the for() loop works. slot = radix_tree_next_slot() executed first
and only after that goes the condition statement.


> 'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
> this point radix_tree_next_slot() hasn't been called.
> 
> radix_tree_next_chunk() will set up the iter->index, iter->next_index and
> iter->tags before it returns.  The next iteration of the loop in
> find_get_pages_tag() will use the non-NULL slot provided by
> radix_tree_next_chunk(), and only after that iteration will we call
> radix_tree_next_slot() again.  By then iter->tags should be up to date.
> 
> Do you have a test setup that reliably fails without this code but passes when
> you zero out iter->tags?
> 


Yup, I run Dmitry's reproducer in a parallel loop:
	$ while true; do ./a.out & done

Usually it takes just couple minutes maximum.

> I've been looking at this as well, but haven't been able to get a reliable
> reproducer in my test setup.
> 
>> Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
>> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
>> Reported-by: Dmitry Vyukov <dvyukov@google.com>
>> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
>> Cc: Matthew Wilcox <willy@linux.intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: <stable@vger.kernel.org>
>> ---
>>  include/linux/radix-tree.h | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
>> index cb4b7e8..eca6f62 100644
>> --- a/include/linux/radix-tree.h
>> +++ b/include/linux/radix-tree.h
>> @@ -407,6 +407,7 @@ static inline __must_check
>>  void **radix_tree_iter_retry(struct radix_tree_iter *iter)
>>  {
>>  	iter->next_index = iter->index;
>> +	iter->tags = 0;
>>  	return NULL;
>>  }
>>  
>> -- 
>> 2.7.3
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
  2016-07-15  8:52       ` Andrey Ryabinin
  (?)
@ 2016-07-15 19:00         ` Ross Zwisler
  -1 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-15 19:00 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Ross Zwisler, Andrew Morton, Jan Kara, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	linux-kernel, Konstantin Khlebnikov, Matthew Wilcox,
	Hugh Dickins, stable

On Fri, Jul 15, 2016 at 11:52:58AM +0300, Andrey Ryabinin wrote:
> On 07/15/2016 01:25 AM, Ross Zwisler wrote:
> > On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
> >> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
> >> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
> >> leading to crash:
> >>
> >> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
> >>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
> >> ....
> >> Call Trace:
> >>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
> >>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
> >>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
> >>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
> >>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
> >>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
> >>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
> >>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
> >>  [<     inline     >] vfs_fsync fs/sync.c:209
> >>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
> >>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
> >>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
> >>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
> >>
> >> We must reset iterator's tags to bail out from radix_tree_next_slot() and
> >> go to the slow-path in radix_tree_next_chunk().
> > 
> > This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
> > radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
> > does a 'continue'.  This will hop to the next iteration of the
> > radix_tree_for_each_tagged() loop, which will very check the exit condition of
> > the for() loop:
> > 
> > #define radix_tree_for_each_tagged(slot, root, iter, start, tag)	\
> > 	for (slot = radix_tree_iter_init(iter, start) ;			\
> > 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> > 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> > 	     slot = radix_tree_next_slot(slot, iter,			\
> > 				RADIX_TREE_ITER_TAGGED))
> > 
> > So, we'll run the 
> > 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> > 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> > 
> > bit first.  
> 
> This is not the way how the for() loop works. slot = radix_tree_next_slot() executed first
> and only after that goes the condition statement.

Right...*sigh*...  Thanks for the sanity check. :)

> > 'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
> > this point radix_tree_next_slot() hasn't been called.
> > 
> > radix_tree_next_chunk() will set up the iter->index, iter->next_index and
> > iter->tags before it returns.  The next iteration of the loop in
> > find_get_pages_tag() will use the non-NULL slot provided by
> > radix_tree_next_chunk(), and only after that iteration will we call
> > radix_tree_next_slot() again.  By then iter->tags should be up to date.
> > 
> > Do you have a test setup that reliably fails without this code but passes when
> > you zero out iter->tags?
> > 
> 
> 
> Yup, I run Dmitry's reproducer in a parallel loop:
> 	$ while true; do ./a.out & done
> 
> Usually it takes just couple minutes maximum.

Cool - I was able to get this to work on my system as well by upping the
thread count.

In looking at this more, I agree that your patch fixes this particular bug,
but I think that ultimately we might want something more general.

IIUC, the real issue is that we shouldn't be running through
radix_tree_next_slot() with a NULL 'slot' parameter.  In the end I think it's
fine to zero out iter->tags in radix_tree_iter_retry(), but really we want to
guarantee that we just bail out of radix_tree_next_slot() if we have a NULL
'slot'.

I've run this patch in my test setup, and it fixes the reproducer provided by
Dmitry.  I've also run xfstests against it with out any failures.

--- 8< ---
>From 533beefac12f61f467aeb72e2d2c46685247b9bc Mon Sep 17 00:00:00 2001
From: Ross Zwisler <ross.zwisler@linux.intel.com>
Date: Fri, 15 Jul 2016 12:46:38 -0600
Subject: [PATCH] radix-tree: 'slot' can be NULL in radix_tree_next_slot()

There are four cases I can see where we could end up with a NULL 'slot' in
radix_tree_next_slot() (there might be more):

1) radix_tree_iter_retry() via a non-tagged iteration like
radix_tree_for_each_slot().  In this case we currently aren't seeing a bug
because radix_tree_iter_retry() sets

	iter->next_index = iter->index;

which means that in in the else case in radix_tree_next_slot(), 'count' is
zero, so we skip over the while() loop and effectively just return NULL
without ever dereferencing 'slot'.

2) radix_tree_iter_retry() via tagged iteration like
radix_tree_for_each_tagged().  With the current code this case is
unhandled and we have seen it result in a kernel crash when we dereference
'slot'.

3) radix_tree_iter_next() via via a non-tagged iteration like
radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
and shmem_partial_swap_usage().

I think that this case is currently unhandled.  Unlike with
radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
up executing code in the while() loop in radix_tree_next_slot() that assumes
'slot' is valid.

I haven't actually seen this crash on a test setup, but I don't think the
current code is safe.

4) radix_tree_iter_next() via tagged iteration like
radix_tree_for_each_tagged().  This happens in shmem_wait_for_pins().

radix_tree_iter_next() zeros out iter->tags, so we end up exiting
radix_tree_next_slot() here:

	if (flags & RADIX_TREE_ITER_TAGGED) {
		void *canon = slot;

		iter->tags >>= 1;
		if (unlikely(!iter->tags))
			return NULL;

Really we want to guarantee that we just bail out  of
radix_tree_next_slot() if we have a NULL 'slot'.  This is a more explicit
way of handling all the 4 above cases.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/radix-tree.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index cb4b7e8..840308d 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
 static __always_inline void **
 radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 {
+	if (unlikely(!slot))
+		return NULL;
+
 	if (flags & RADIX_TREE_ITER_TAGGED) {
 		void *canon = slot;
 
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-15 19:00         ` Ross Zwisler
  0 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-15 19:00 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Ross Zwisler, Andrew Morton, Jan Kara, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	linux-kernel, Konstantin Khlebnikov, Matthew Wilcox,
	Hugh Dickins, stable

On Fri, Jul 15, 2016 at 11:52:58AM +0300, Andrey Ryabinin wrote:
> On 07/15/2016 01:25 AM, Ross Zwisler wrote:
> > On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
> >> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
> >> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
> >> leading to crash:
> >>
> >> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
> >>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
> >> ....
> >> Call Trace:
> >>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
> >>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
> >>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
> >>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
> >>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
> >>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
> >>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
> >>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
> >>  [<     inline     >] vfs_fsync fs/sync.c:209
> >>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
> >>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
> >>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
> >>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
> >>
> >> We must reset iterator's tags to bail out from radix_tree_next_slot() and
> >> go to the slow-path in radix_tree_next_chunk().
> > 
> > This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
> > radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
> > does a 'continue'.  This will hop to the next iteration of the
> > radix_tree_for_each_tagged() loop, which will very check the exit condition of
> > the for() loop:
> > 
> > #define radix_tree_for_each_tagged(slot, root, iter, start, tag)	\
> > 	for (slot = radix_tree_iter_init(iter, start) ;			\
> > 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> > 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> > 	     slot = radix_tree_next_slot(slot, iter,			\
> > 				RADIX_TREE_ITER_TAGGED))
> > 
> > So, we'll run the 
> > 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> > 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> > 
> > bit first.  
> 
> This is not the way how the for() loop works. slot = radix_tree_next_slot() executed first
> and only after that goes the condition statement.

Right...*sigh*...  Thanks for the sanity check. :)

> > 'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
> > this point radix_tree_next_slot() hasn't been called.
> > 
> > radix_tree_next_chunk() will set up the iter->index, iter->next_index and
> > iter->tags before it returns.  The next iteration of the loop in
> > find_get_pages_tag() will use the non-NULL slot provided by
> > radix_tree_next_chunk(), and only after that iteration will we call
> > radix_tree_next_slot() again.  By then iter->tags should be up to date.
> > 
> > Do you have a test setup that reliably fails without this code but passes when
> > you zero out iter->tags?
> > 
> 
> 
> Yup, I run Dmitry's reproducer in a parallel loop:
> 	$ while true; do ./a.out & done
> 
> Usually it takes just couple minutes maximum.

Cool - I was able to get this to work on my system as well by upping the
thread count.

In looking at this more, I agree that your patch fixes this particular bug,
but I think that ultimately we might want something more general.

IIUC, the real issue is that we shouldn't be running through
radix_tree_next_slot() with a NULL 'slot' parameter.  In the end I think it's
fine to zero out iter->tags in radix_tree_iter_retry(), but really we want to
guarantee that we just bail out of radix_tree_next_slot() if we have a NULL
'slot'.

I've run this patch in my test setup, and it fixes the reproducer provided by
Dmitry.  I've also run xfstests against it with out any failures.

--- 8< ---
>From 533beefac12f61f467aeb72e2d2c46685247b9bc Mon Sep 17 00:00:00 2001
From: Ross Zwisler <ross.zwisler@linux.intel.com>
Date: Fri, 15 Jul 2016 12:46:38 -0600
Subject: [PATCH] radix-tree: 'slot' can be NULL in radix_tree_next_slot()

There are four cases I can see where we could end up with a NULL 'slot' in
radix_tree_next_slot() (there might be more):

1) radix_tree_iter_retry() via a non-tagged iteration like
radix_tree_for_each_slot().  In this case we currently aren't seeing a bug
because radix_tree_iter_retry() sets

	iter->next_index = iter->index;

which means that in in the else case in radix_tree_next_slot(), 'count' is
zero, so we skip over the while() loop and effectively just return NULL
without ever dereferencing 'slot'.

2) radix_tree_iter_retry() via tagged iteration like
radix_tree_for_each_tagged().  With the current code this case is
unhandled and we have seen it result in a kernel crash when we dereference
'slot'.

3) radix_tree_iter_next() via via a non-tagged iteration like
radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
and shmem_partial_swap_usage().

I think that this case is currently unhandled.  Unlike with
radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
up executing code in the while() loop in radix_tree_next_slot() that assumes
'slot' is valid.

I haven't actually seen this crash on a test setup, but I don't think the
current code is safe.

4) radix_tree_iter_next() via tagged iteration like
radix_tree_for_each_tagged().  This happens in shmem_wait_for_pins().

radix_tree_iter_next() zeros out iter->tags, so we end up exiting
radix_tree_next_slot() here:

	if (flags & RADIX_TREE_ITER_TAGGED) {
		void *canon = slot;

		iter->tags >>= 1;
		if (unlikely(!iter->tags))
			return NULL;

Really we want to guarantee that we just bail out  of
radix_tree_next_slot() if we have a NULL 'slot'.  This is a more explicit
way of handling all the 4 above cases.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/radix-tree.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index cb4b7e8..840308d 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
 static __always_inline void **
 radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 {
+	if (unlikely(!slot))
+		return NULL;
+
 	if (flags & RADIX_TREE_ITER_TAGGED) {
 		void *canon = slot;
 
-- 
2.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-15 19:00         ` Ross Zwisler
  0 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-15 19:00 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Ross Zwisler, Andrew Morton, Jan Kara, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	linux-kernel, Konstantin Khlebnikov, Matthew Wilcox,
	Hugh Dickins, stable

On Fri, Jul 15, 2016 at 11:52:58AM +0300, Andrey Ryabinin wrote:
> On 07/15/2016 01:25 AM, Ross Zwisler wrote:
> > On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
> >> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
> >> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
> >> leading to crash:
> >>
> >> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
> >>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
> >> ....
> >> Call Trace:
> >>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
> >>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
> >>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
> >>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
> >>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
> >>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
> >>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
> >>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
> >>  [<     inline     >] vfs_fsync fs/sync.c:209
> >>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
> >>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
> >>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
> >>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
> >>
> >> We must reset iterator's tags to bail out from radix_tree_next_slot() and
> >> go to the slow-path in radix_tree_next_chunk().
> > 
> > This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
> > radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
> > does a 'continue'.  This will hop to the next iteration of the
> > radix_tree_for_each_tagged() loop, which will very check the exit condition of
> > the for() loop:
> > 
> > #define radix_tree_for_each_tagged(slot, root, iter, start, tag)	\
> > 	for (slot = radix_tree_iter_init(iter, start) ;			\
> > 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> > 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> > 	     slot = radix_tree_next_slot(slot, iter,			\
> > 				RADIX_TREE_ITER_TAGGED))
> > 
> > So, we'll run the 
> > 	     slot || (slot = radix_tree_next_chunk(root, iter,		\
> > 			      RADIX_TREE_ITER_TAGGED | tag)) ;		\
> > 
> > bit first.  
> 
> This is not the way how the for() loop works. slot = radix_tree_next_slot() executed first
> and only after that goes the condition statement.

Right...*sigh*...  Thanks for the sanity check. :)

> > 'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
> > this point radix_tree_next_slot() hasn't been called.
> > 
> > radix_tree_next_chunk() will set up the iter->index, iter->next_index and
> > iter->tags before it returns.  The next iteration of the loop in
> > find_get_pages_tag() will use the non-NULL slot provided by
> > radix_tree_next_chunk(), and only after that iteration will we call
> > radix_tree_next_slot() again.  By then iter->tags should be up to date.
> > 
> > Do you have a test setup that reliably fails without this code but passes when
> > you zero out iter->tags?
> > 
> 
> 
> Yup, I run Dmitry's reproducer in a parallel loop:
> 	$ while true; do ./a.out & done
> 
> Usually it takes just couple minutes maximum.

Cool - I was able to get this to work on my system as well by upping the
thread count.

In looking at this more, I agree that your patch fixes this particular bug,
but I think that ultimately we might want something more general.

IIUC, the real issue is that we shouldn't be running through
radix_tree_next_slot() with a NULL 'slot' parameter.  In the end I think it's
fine to zero out iter->tags in radix_tree_iter_retry(), but really we want to
guarantee that we just bail out of radix_tree_next_slot() if we have a NULL
'slot'.

I've run this patch in my test setup, and it fixes the reproducer provided by
Dmitry.  I've also run xfstests against it with out any failures.

--- 8< ---

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mm: GPF in find_get_pages_tag
  2016-07-05 11:39 ` Dmitry Vyukov
@ 2016-07-15 19:03   ` Ross Zwisler
  -1 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-15 19:03 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal,
	Andrey Ryabinin, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin

On Tue, Jul 05, 2016 at 01:39:23PM +0200, Dmitry Vyukov wrote:
> Hello,
> 
> The following program triggers GPF in find_get_pages_tag if run in
> parallel loop for minutes:
> 
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
> Modules linked in:
> CPU: 2 PID: 301 Comm: a.out Tainted: G        W       4.7.0-rc5+ #28
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: ffff880063d12440 ti: ffff880067350000 task.ti: ffff880067350000
> RIP: 0010:[<ffffffff816951a4>]
>   [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
> RSP: 0018:ffff880067357840  EFLAGS: 00010202
> RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffff880063d12c80
> RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000008
> RBP: ffff880067357910 R08: 0000000000000002 R09: 0000000000000000
> R10: 0000000000000000 R11: ffffffff89f06360 R12: 0000000000000001
> R13: 0000000000000000 R14: 0000000000000000 R15: ffffed0007058ee5
> FS:  00007f56e017c700(0000) GS:ffff88006d400000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f56df97ae78 CR3: 0000000063d9e000 CR4: 00000000000006e0
> Stack:
>  ffffffff81694efc ffff8800673578a8 0000010267357860 ffff880067357a50
>  ffff880065986aa0 1ffff1000ce6af11 0000000e00000000 ffff880067357a00
>  0000000000000003 0000000041b58ab3 ffffffff87e2a722 ffffffff81694e70
> Call Trace:
>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90
> fs/ext4/inode.c:2516
>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>  [<     inline     >] vfs_fsync fs/sync.c:209
>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1
> arch/x86/entry/entry_64.S:207
> Code: 85 70 ff ff ff 49 d1 ec 4d 85 e4 4c 89 65 a8 74 65 e8 51 06 f0
> ff 49 8d 7e 08 48 be 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 <80>
> 3c 30 00 0f 85 9c 05 00 00 4d 8b 6e 08 4c 89 eb 83 e3 03 48
> RIP  [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
> RIP  [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>  RSP <ffff880067357840>
> ---[ end trace 33a0cc4dd9a49a67 ]---
> 
> 
> 
> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> #include <pthread.h>
> #include <stdint.h>
> #include <string.h>
> #include <stdio.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> 
> int fd;
> char buf[8192];
> char filename[256];
> 
> void* thr(void* arg)
> {
>   switch ((long)arg) {
>   case 0:
>     write(fd, buf, 0x1001ul);
>     break;
>   case 1:
>     fdatasync(fd);
>     break;
>   case 2:
>     ftruncate(fd, 2);
>     break;
>   case 3:
>     write(fd, buf, 0x20ul);
>     break;
>   case 5:
>     fd = open(filename, 0x50042ul, 0x41ul);
>     break;

This open() code is unreachable because the thread argument will only be 0-4,
right?  Should this be "case 4"?

>   }
>   return 0;
> }
> 
> int main()
> {
>   long i;
>   pthread_t th[10];
> 
>   srand(getpid());
>   sprintf(filename, "./file%d", getpid());
>   fd = open(filename, 0x50042ul, 0x41ul);
>   for (i = 0; i < 10; i++) {
>     pthread_create(&th[i], 0, thr, (void*)(i % 5));
>     usleep(rand() % 10);
>   }
>   for (i = 0; i < 10; i++)
>     pthread_join(th[i], 0);
>   unlink(filename);
>   return 0;
> }
> 
> The faulting instruction is:
> ffffffff816951a4:       80 3c 30 00             cmpb   $0x0,(%rax,%rsi,1)
> So this is KASAN shadow check for NULL address.
> 
> 
> The previous taint is not relevant, it is:
> 
> [   74.786477] ------------[ cut here ]------------
> [   74.786885] WARNING: CPU: 2 PID: 717 at lib/stackdepot.c:119
> depot_save_stack+0x34f/0x5b0
> [   74.787196] Stack depot reached limit capacity
> 
> 
> On commit 1a0a02d1efa066001fd315c1b4df583d939fa2c4 (Jun 30).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mm: GPF in find_get_pages_tag
@ 2016-07-15 19:03   ` Ross Zwisler
  0 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-15 19:03 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal,
	Andrey Ryabinin, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin

On Tue, Jul 05, 2016 at 01:39:23PM +0200, Dmitry Vyukov wrote:
> Hello,
> 
> The following program triggers GPF in find_get_pages_tag if run in
> parallel loop for minutes:
> 
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
> Modules linked in:
> CPU: 2 PID: 301 Comm: a.out Tainted: G        W       4.7.0-rc5+ #28
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: ffff880063d12440 ti: ffff880067350000 task.ti: ffff880067350000
> RIP: 0010:[<ffffffff816951a4>]
>   [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
> RSP: 0018:ffff880067357840  EFLAGS: 00010202
> RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffff880063d12c80
> RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000008
> RBP: ffff880067357910 R08: 0000000000000002 R09: 0000000000000000
> R10: 0000000000000000 R11: ffffffff89f06360 R12: 0000000000000001
> R13: 0000000000000000 R14: 0000000000000000 R15: ffffed0007058ee5
> FS:  00007f56e017c700(0000) GS:ffff88006d400000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f56df97ae78 CR3: 0000000063d9e000 CR4: 00000000000006e0
> Stack:
>  ffffffff81694efc ffff8800673578a8 0000010267357860 ffff880067357a50
>  ffff880065986aa0 1ffff1000ce6af11 0000000e00000000 ffff880067357a00
>  0000000000000003 0000000041b58ab3 ffffffff87e2a722 ffffffff81694e70
> Call Trace:
>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90
> fs/ext4/inode.c:2516
>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>  [<     inline     >] vfs_fsync fs/sync.c:209
>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1
> arch/x86/entry/entry_64.S:207
> Code: 85 70 ff ff ff 49 d1 ec 4d 85 e4 4c 89 65 a8 74 65 e8 51 06 f0
> ff 49 8d 7e 08 48 be 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 <80>
> 3c 30 00 0f 85 9c 05 00 00 4d 8b 6e 08 4c 89 eb 83 e3 03 48
> RIP  [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
> RIP  [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>  RSP <ffff880067357840>
> ---[ end trace 33a0cc4dd9a49a67 ]---
> 
> 
> 
> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> #include <pthread.h>
> #include <stdint.h>
> #include <string.h>
> #include <stdio.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> 
> int fd;
> char buf[8192];
> char filename[256];
> 
> void* thr(void* arg)
> {
>   switch ((long)arg) {
>   case 0:
>     write(fd, buf, 0x1001ul);
>     break;
>   case 1:
>     fdatasync(fd);
>     break;
>   case 2:
>     ftruncate(fd, 2);
>     break;
>   case 3:
>     write(fd, buf, 0x20ul);
>     break;
>   case 5:
>     fd = open(filename, 0x50042ul, 0x41ul);
>     break;

This open() code is unreachable because the thread argument will only be 0-4,
right?  Should this be "case 4"?

>   }
>   return 0;
> }
> 
> int main()
> {
>   long i;
>   pthread_t th[10];
> 
>   srand(getpid());
>   sprintf(filename, "./file%d", getpid());
>   fd = open(filename, 0x50042ul, 0x41ul);
>   for (i = 0; i < 10; i++) {
>     pthread_create(&th[i], 0, thr, (void*)(i % 5));
>     usleep(rand() % 10);
>   }
>   for (i = 0; i < 10; i++)
>     pthread_join(th[i], 0);
>   unlink(filename);
>   return 0;
> }
> 
> The faulting instruction is:
> ffffffff816951a4:       80 3c 30 00             cmpb   $0x0,(%rax,%rsi,1)
> So this is KASAN shadow check for NULL address.
> 
> 
> The previous taint is not relevant, it is:
> 
> [   74.786477] ------------[ cut here ]------------
> [   74.786885] WARNING: CPU: 2 PID: 717 at lib/stackdepot.c:119
> depot_save_stack+0x34f/0x5b0
> [   74.787196] Stack depot reached limit capacity
> 
> 
> On commit 1a0a02d1efa066001fd315c1b4df583d939fa2c4 (Jun 30).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mm: GPF in find_get_pages_tag
  2016-07-15 19:03   ` Ross Zwisler
@ 2016-07-15 19:07     ` Dmitry Vyukov
  -1 siblings, 0 replies; 35+ messages in thread
From: Dmitry Vyukov @ 2016-07-15 19:07 UTC (permalink / raw)
  To: syzkaller
  Cc: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal,
	Andrey Ryabinin, Kostya Serebryany, Alexander Potapenko,
	Sasha Levin

On Fri, Jul 15, 2016 at 9:03 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jul 05, 2016 at 01:39:23PM +0200, Dmitry Vyukov wrote:
>> Hello,
>>
>> The following program triggers GPF in find_get_pages_tag if run in
>> parallel loop for minutes:
>>
>> kasan: CONFIG_KASAN_INLINE enabled
>> kasan: GPF could be caused by NULL-ptr deref or user memory access
>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>> Modules linked in:
>> CPU: 2 PID: 301 Comm: a.out Tainted: G        W       4.7.0-rc5+ #28
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> task: ffff880063d12440 ti: ffff880067350000 task.ti: ffff880067350000
>> RIP: 0010:[<ffffffff816951a4>]
>>   [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>> RSP: 0018:ffff880067357840  EFLAGS: 00010202
>> RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffff880063d12c80
>> RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000008
>> RBP: ffff880067357910 R08: 0000000000000002 R09: 0000000000000000
>> R10: 0000000000000000 R11: ffffffff89f06360 R12: 0000000000000001
>> R13: 0000000000000000 R14: 0000000000000000 R15: ffffed0007058ee5
>> FS:  00007f56e017c700(0000) GS:ffff88006d400000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f56df97ae78 CR3: 0000000063d9e000 CR4: 00000000000006e0
>> Stack:
>>  ffffffff81694efc ffff8800673578a8 0000010267357860 ffff880067357a50
>>  ffff880065986aa0 1ffff1000ce6af11 0000000e00000000 ffff880067357a00
>>  0000000000000003 0000000041b58ab3 ffffffff87e2a722 ffffffff81694e70
>> Call Trace:
>>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90
>> fs/ext4/inode.c:2516
>>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>>  [<     inline     >] vfs_fsync fs/sync.c:209
>>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1
>> arch/x86/entry/entry_64.S:207
>> Code: 85 70 ff ff ff 49 d1 ec 4d 85 e4 4c 89 65 a8 74 65 e8 51 06 f0
>> ff 49 8d 7e 08 48 be 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 <80>
>> 3c 30 00 0f 85 9c 05 00 00 4d 8b 6e 08 4c 89 eb 83 e3 03 48
>> RIP  [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>> RIP  [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>>  RSP <ffff880067357840>
>> ---[ end trace 33a0cc4dd9a49a67 ]---
>>
>>
>>
>> // autogenerated by syzkaller (http://github.com/google/syzkaller)
>> #include <pthread.h>
>> #include <stdint.h>
>> #include <string.h>
>> #include <stdio.h>
>> #include <sys/syscall.h>
>> #include <unistd.h>
>>
>> int fd;
>> char buf[8192];
>> char filename[256];
>>
>> void* thr(void* arg)
>> {
>>   switch ((long)arg) {
>>   case 0:
>>     write(fd, buf, 0x1001ul);
>>     break;
>>   case 1:
>>     fdatasync(fd);
>>     break;
>>   case 2:
>>     ftruncate(fd, 2);
>>     break;
>>   case 3:
>>     write(fd, buf, 0x20ul);
>>     break;
>>   case 5:
>>     fd = open(filename, 0x50042ul, 0x41ul);
>>     break;
>
> This open() code is unreachable because the thread argument will only be 0-4,
> right?  Should this be "case 4"?


I am not sure. I think it I just copy-pasted the program that
triggered the crash for me. Andrey should have a valid reproducer, in
the other thread he said that he can reproduce it. Andrey, did you
change 5 to 4?



>>   }
>>   return 0;
>> }
>>
>> int main()
>> {
>>   long i;
>>   pthread_t th[10];
>>
>>   srand(getpid());
>>   sprintf(filename, "./file%d", getpid());
>>   fd = open(filename, 0x50042ul, 0x41ul);
>>   for (i = 0; i < 10; i++) {
>>     pthread_create(&th[i], 0, thr, (void*)(i % 5));
>>     usleep(rand() % 10);
>>   }
>>   for (i = 0; i < 10; i++)
>>     pthread_join(th[i], 0);
>>   unlink(filename);
>>   return 0;
>> }
>>
>> The faulting instruction is:
>> ffffffff816951a4:       80 3c 30 00             cmpb   $0x0,(%rax,%rsi,1)
>> So this is KASAN shadow check for NULL address.
>>
>>
>> The previous taint is not relevant, it is:
>>
>> [   74.786477] ------------[ cut here ]------------
>> [   74.786885] WARNING: CPU: 2 PID: 717 at lib/stackdepot.c:119
>> depot_save_stack+0x34f/0x5b0
>> [   74.787196] Stack depot reached limit capacity
>>
>>
>> On commit 1a0a02d1efa066001fd315c1b4df583d939fa2c4 (Jun 30).
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mm: GPF in find_get_pages_tag
@ 2016-07-15 19:07     ` Dmitry Vyukov
  0 siblings, 0 replies; 35+ messages in thread
From: Dmitry Vyukov @ 2016-07-15 19:07 UTC (permalink / raw)
  To: syzkaller
  Cc: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal,
	Andrey Ryabinin, Kostya Serebryany, Alexander Potapenko,
	Sasha Levin

On Fri, Jul 15, 2016 at 9:03 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jul 05, 2016 at 01:39:23PM +0200, Dmitry Vyukov wrote:
>> Hello,
>>
>> The following program triggers GPF in find_get_pages_tag if run in
>> parallel loop for minutes:
>>
>> kasan: CONFIG_KASAN_INLINE enabled
>> kasan: GPF could be caused by NULL-ptr deref or user memory access
>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>> Modules linked in:
>> CPU: 2 PID: 301 Comm: a.out Tainted: G        W       4.7.0-rc5+ #28
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> task: ffff880063d12440 ti: ffff880067350000 task.ti: ffff880067350000
>> RIP: 0010:[<ffffffff816951a4>]
>>   [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>> RSP: 0018:ffff880067357840  EFLAGS: 00010202
>> RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffff880063d12c80
>> RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000008
>> RBP: ffff880067357910 R08: 0000000000000002 R09: 0000000000000000
>> R10: 0000000000000000 R11: ffffffff89f06360 R12: 0000000000000001
>> R13: 0000000000000000 R14: 0000000000000000 R15: ffffed0007058ee5
>> FS:  00007f56e017c700(0000) GS:ffff88006d400000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f56df97ae78 CR3: 0000000063d9e000 CR4: 00000000000006e0
>> Stack:
>>  ffffffff81694efc ffff8800673578a8 0000010267357860 ffff880067357a50
>>  ffff880065986aa0 1ffff1000ce6af11 0000000e00000000 ffff880067357a00
>>  0000000000000003 0000000041b58ab3 ffffffff87e2a722 ffffffff81694e70
>> Call Trace:
>>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90
>> fs/ext4/inode.c:2516
>>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>>  [<     inline     >] vfs_fsync fs/sync.c:209
>>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1
>> arch/x86/entry/entry_64.S:207
>> Code: 85 70 ff ff ff 49 d1 ec 4d 85 e4 4c 89 65 a8 74 65 e8 51 06 f0
>> ff 49 8d 7e 08 48 be 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 <80>
>> 3c 30 00 0f 85 9c 05 00 00 4d 8b 6e 08 4c 89 eb 83 e3 03 48
>> RIP  [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>> RIP  [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>>  RSP <ffff880067357840>
>> ---[ end trace 33a0cc4dd9a49a67 ]---
>>
>>
>>
>> // autogenerated by syzkaller (http://github.com/google/syzkaller)
>> #include <pthread.h>
>> #include <stdint.h>
>> #include <string.h>
>> #include <stdio.h>
>> #include <sys/syscall.h>
>> #include <unistd.h>
>>
>> int fd;
>> char buf[8192];
>> char filename[256];
>>
>> void* thr(void* arg)
>> {
>>   switch ((long)arg) {
>>   case 0:
>>     write(fd, buf, 0x1001ul);
>>     break;
>>   case 1:
>>     fdatasync(fd);
>>     break;
>>   case 2:
>>     ftruncate(fd, 2);
>>     break;
>>   case 3:
>>     write(fd, buf, 0x20ul);
>>     break;
>>   case 5:
>>     fd = open(filename, 0x50042ul, 0x41ul);
>>     break;
>
> This open() code is unreachable because the thread argument will only be 0-4,
> right?  Should this be "case 4"?


I am not sure. I think it I just copy-pasted the program that
triggered the crash for me. Andrey should have a valid reproducer, in
the other thread he said that he can reproduce it. Andrey, did you
change 5 to 4?



>>   }
>>   return 0;
>> }
>>
>> int main()
>> {
>>   long i;
>>   pthread_t th[10];
>>
>>   srand(getpid());
>>   sprintf(filename, "./file%d", getpid());
>>   fd = open(filename, 0x50042ul, 0x41ul);
>>   for (i = 0; i < 10; i++) {
>>     pthread_create(&th[i], 0, thr, (void*)(i % 5));
>>     usleep(rand() % 10);
>>   }
>>   for (i = 0; i < 10; i++)
>>     pthread_join(th[i], 0);
>>   unlink(filename);
>>   return 0;
>> }
>>
>> The faulting instruction is:
>> ffffffff816951a4:       80 3c 30 00             cmpb   $0x0,(%rax,%rsi,1)
>> So this is KASAN shadow check for NULL address.
>>
>>
>> The previous taint is not relevant, it is:
>>
>> [   74.786477] ------------[ cut here ]------------
>> [   74.786885] WARNING: CPU: 2 PID: 717 at lib/stackdepot.c:119
>> depot_save_stack+0x34f/0x5b0
>> [   74.787196] Stack depot reached limit capacity
>>
>>
>> On commit 1a0a02d1efa066001fd315c1b4df583d939fa2c4 (Jun 30).
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mm: GPF in find_get_pages_tag
  2016-07-15 19:07     ` Dmitry Vyukov
@ 2016-07-15 20:21       ` Ross Zwisler
  -1 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-15 20:21 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzkaller, Andrew Morton, Jan Kara, ross.zwisler,
	Kirill A. Shutemov, linux-mm, LKML, Hugh Dickins, Greg Thelen,
	Suleiman Souhlal, Andrey Ryabinin, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin

On Fri, Jul 15, 2016 at 09:07:48PM +0200, Dmitry Vyukov wrote:
> On Fri, Jul 15, 2016 at 9:03 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> >> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> >> #include <pthread.h>
> >> #include <stdint.h>
> >> #include <string.h>
> >> #include <stdio.h>
> >> #include <sys/syscall.h>
> >> #include <unistd.h>
> >>
> >> int fd;
> >> char buf[8192];
> >> char filename[256];
> >>
> >> void* thr(void* arg)
> >> {
> >>   switch ((long)arg) {
> >>   case 0:
> >>     write(fd, buf, 0x1001ul);
> >>     break;
> >>   case 1:
> >>     fdatasync(fd);
> >>     break;
> >>   case 2:
> >>     ftruncate(fd, 2);
> >>     break;
> >>   case 3:
> >>     write(fd, buf, 0x20ul);
> >>     break;
> >>   case 5:
> >>     fd = open(filename, 0x50042ul, 0x41ul);
> >>     break;
> >
> > This open() code is unreachable because the thread argument will only be 0-4,
> > right?  Should this be "case 4"?
> 
> I am not sure. I think it I just copy-pasted the program that
> triggered the crash for me. Andrey should have a valid reproducer, in
> the other thread he said that he can reproduce it. Andrey, did you
> change 5 to 4?

Ah, sorry if I wasn't clear.  I don't think you need the open() call to have a
valid reproducer.  In mine, in fact, I only use the first three - the error
happens with a combination of write(), fdatasync() and ftruncate().

I just wanted to note that the test program (which was autogenerated?) had an
unreachable case in the switch() statement. :)

Thanks for this testing, by the way!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mm: GPF in find_get_pages_tag
@ 2016-07-15 20:21       ` Ross Zwisler
  0 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-15 20:21 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzkaller, Andrew Morton, Jan Kara, ross.zwisler,
	Kirill A. Shutemov, linux-mm, LKML, Hugh Dickins, Greg Thelen,
	Suleiman Souhlal, Andrey Ryabinin, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin

On Fri, Jul 15, 2016 at 09:07:48PM +0200, Dmitry Vyukov wrote:
> On Fri, Jul 15, 2016 at 9:03 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> >> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> >> #include <pthread.h>
> >> #include <stdint.h>
> >> #include <string.h>
> >> #include <stdio.h>
> >> #include <sys/syscall.h>
> >> #include <unistd.h>
> >>
> >> int fd;
> >> char buf[8192];
> >> char filename[256];
> >>
> >> void* thr(void* arg)
> >> {
> >>   switch ((long)arg) {
> >>   case 0:
> >>     write(fd, buf, 0x1001ul);
> >>     break;
> >>   case 1:
> >>     fdatasync(fd);
> >>     break;
> >>   case 2:
> >>     ftruncate(fd, 2);
> >>     break;
> >>   case 3:
> >>     write(fd, buf, 0x20ul);
> >>     break;
> >>   case 5:
> >>     fd = open(filename, 0x50042ul, 0x41ul);
> >>     break;
> >
> > This open() code is unreachable because the thread argument will only be 0-4,
> > right?  Should this be "case 4"?
> 
> I am not sure. I think it I just copy-pasted the program that
> triggered the crash for me. Andrey should have a valid reproducer, in
> the other thread he said that he can reproduce it. Andrey, did you
> change 5 to 4?

Ah, sorry if I wasn't clear.  I don't think you need the open() call to have a
valid reproducer.  In mine, in fact, I only use the first three - the error
happens with a combination of write(), fdatasync() and ftruncate().

I just wanted to note that the test program (which was autogenerated?) had an
unreachable case in the switch() statement. :)

Thanks for this testing, by the way!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mm: GPF in find_get_pages_tag
  2016-07-15 20:21       ` Ross Zwisler
@ 2016-07-15 20:25         ` Dmitry Vyukov
  -1 siblings, 0 replies; 35+ messages in thread
From: Dmitry Vyukov @ 2016-07-15 20:25 UTC (permalink / raw)
  To: syzkaller
  Cc: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal,
	Andrey Ryabinin, Kostya Serebryany, Alexander Potapenko,
	Sasha Levin

On Fri, Jul 15, 2016 at 10:21 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Jul 15, 2016 at 09:07:48PM +0200, Dmitry Vyukov wrote:
>> On Fri, Jul 15, 2016 at 9:03 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> >> // autogenerated by syzkaller (http://github.com/google/syzkaller)
>> >> #include <pthread.h>
>> >> #include <stdint.h>
>> >> #include <string.h>
>> >> #include <stdio.h>
>> >> #include <sys/syscall.h>
>> >> #include <unistd.h>
>> >>
>> >> int fd;
>> >> char buf[8192];
>> >> char filename[256];
>> >>
>> >> void* thr(void* arg)
>> >> {
>> >>   switch ((long)arg) {
>> >>   case 0:
>> >>     write(fd, buf, 0x1001ul);
>> >>     break;
>> >>   case 1:
>> >>     fdatasync(fd);
>> >>     break;
>> >>   case 2:
>> >>     ftruncate(fd, 2);
>> >>     break;
>> >>   case 3:
>> >>     write(fd, buf, 0x20ul);
>> >>     break;
>> >>   case 5:
>> >>     fd = open(filename, 0x50042ul, 0x41ul);
>> >>     break;
>> >
>> > This open() code is unreachable because the thread argument will only be 0-4,
>> > right?  Should this be "case 4"?
>>
>> I am not sure. I think it I just copy-pasted the program that
>> triggered the crash for me. Andrey should have a valid reproducer, in
>> the other thread he said that he can reproduce it. Andrey, did you
>> change 5 to 4?
>
> Ah, sorry if I wasn't clear.  I don't think you need the open() call to have a
> valid reproducer.  In mine, in fact, I only use the first three - the error
> happens with a combination of write(), fdatasync() and ftruncate().
>
> I just wanted to note that the test program (which was autogenerated?) had an
> unreachable case in the switch() statement. :)
>
> Thanks for this testing, by the way!


Ah, OK. I modified the program by hand to make it trigger the bug more
frequently. So I think the bug was introduced by me. Generator should
not generate dead cases.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mm: GPF in find_get_pages_tag
@ 2016-07-15 20:25         ` Dmitry Vyukov
  0 siblings, 0 replies; 35+ messages in thread
From: Dmitry Vyukov @ 2016-07-15 20:25 UTC (permalink / raw)
  To: syzkaller
  Cc: Andrew Morton, Jan Kara, ross.zwisler, Kirill A. Shutemov,
	linux-mm, LKML, Hugh Dickins, Greg Thelen, Suleiman Souhlal,
	Andrey Ryabinin, Kostya Serebryany, Alexander Potapenko,
	Sasha Levin

On Fri, Jul 15, 2016 at 10:21 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Jul 15, 2016 at 09:07:48PM +0200, Dmitry Vyukov wrote:
>> On Fri, Jul 15, 2016 at 9:03 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> >> // autogenerated by syzkaller (http://github.com/google/syzkaller)
>> >> #include <pthread.h>
>> >> #include <stdint.h>
>> >> #include <string.h>
>> >> #include <stdio.h>
>> >> #include <sys/syscall.h>
>> >> #include <unistd.h>
>> >>
>> >> int fd;
>> >> char buf[8192];
>> >> char filename[256];
>> >>
>> >> void* thr(void* arg)
>> >> {
>> >>   switch ((long)arg) {
>> >>   case 0:
>> >>     write(fd, buf, 0x1001ul);
>> >>     break;
>> >>   case 1:
>> >>     fdatasync(fd);
>> >>     break;
>> >>   case 2:
>> >>     ftruncate(fd, 2);
>> >>     break;
>> >>   case 3:
>> >>     write(fd, buf, 0x20ul);
>> >>     break;
>> >>   case 5:
>> >>     fd = open(filename, 0x50042ul, 0x41ul);
>> >>     break;
>> >
>> > This open() code is unreachable because the thread argument will only be 0-4,
>> > right?  Should this be "case 4"?
>>
>> I am not sure. I think it I just copy-pasted the program that
>> triggered the crash for me. Andrey should have a valid reproducer, in
>> the other thread he said that he can reproduce it. Andrey, did you
>> change 5 to 4?
>
> Ah, sorry if I wasn't clear.  I don't think you need the open() call to have a
> valid reproducer.  In mine, in fact, I only use the first three - the error
> happens with a combination of write(), fdatasync() and ftruncate().
>
> I just wanted to note that the test program (which was autogenerated?) had an
> unreachable case in the switch() statement. :)
>
> Thanks for this testing, by the way!


Ah, OK. I modified the program by hand to make it trigger the bug more
frequently. So I think the bug was introduced by me. Generator should
not generate dead cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
  2016-07-15 19:00         ` Ross Zwisler
@ 2016-07-15 20:57           ` Andrew Morton
  -1 siblings, 0 replies; 35+ messages in thread
From: Andrew Morton @ 2016-07-15 20:57 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrey Ryabinin, Jan Kara, Kirill A. Shutemov, linux-mm,
	Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin, linux-kernel,
	Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, stable

On Fri, 15 Jul 2016 13:00:40 -0600 Ross Zwisler <ross.zwisler@linux.intel.com> wrote:

> 
> ...
>
> In looking at this more, I agree that your patch fixes this particular bug,
> but I think that ultimately we might want something more general.
> 
> ...
>
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
>  static __always_inline void **
>  radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>  {
> +	if (unlikely(!slot))
> +		return NULL;
> +
>  	if (flags & RADIX_TREE_ITER_TAGGED) {
>  		void *canon = slot;

I'll hang onto Andrey's
radix-tree-fix-radix_tree_iter_retry-for-tagged-iterators.patch for
now, plan to send it in to Linus Wednesdayish.  If we can get the above
settled down prior to that then I shall swap over.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-15 20:57           ` Andrew Morton
  0 siblings, 0 replies; 35+ messages in thread
From: Andrew Morton @ 2016-07-15 20:57 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrey Ryabinin, Jan Kara, Kirill A. Shutemov, linux-mm,
	Greg Thelen, Suleiman Souhlal, syzkaller, Kostya Serebryany,
	Alexander Potapenko, Sasha Levin, linux-kernel,
	Konstantin Khlebnikov, Matthew Wilcox, Hugh Dickins, stable

On Fri, 15 Jul 2016 13:00:40 -0600 Ross Zwisler <ross.zwisler@linux.intel.com> wrote:

> 
> ...
>
> In looking at this more, I agree that your patch fixes this particular bug,
> but I think that ultimately we might want something more general.
> 
> ...
>
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
>  static __always_inline void **
>  radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>  {
> +	if (unlikely(!slot))
> +		return NULL;
> +
>  	if (flags & RADIX_TREE_ITER_TAGGED) {
>  		void *canon = slot;

I'll hang onto Andrey's
radix-tree-fix-radix_tree_iter_retry-for-tagged-iterators.patch for
now, plan to send it in to Linus Wednesdayish.  If we can get the above
settled down prior to that then I shall swap over.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
  2016-07-15 19:00         ` Ross Zwisler
@ 2016-07-16 13:45           ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 35+ messages in thread
From: Konstantin Khlebnikov @ 2016-07-16 13:45 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrey Ryabinin, Andrew Morton, Jan Kara, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable

On Fri, Jul 15, 2016 at 10:00 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Jul 15, 2016 at 11:52:58AM +0300, Andrey Ryabinin wrote:
>> On 07/15/2016 01:25 AM, Ross Zwisler wrote:
>> > On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
>> >> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
>> >> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
>> >> leading to crash:
>> >>
>> >> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>> >>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>> >> ....
>> >> Call Trace:
>> >>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>> >>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
>> >>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>> >>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>> >>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>> >>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>> >>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>> >>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>> >>  [<     inline     >] vfs_fsync fs/sync.c:209
>> >>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>> >>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>> >>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>> >>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
>> >>
>> >> We must reset iterator's tags to bail out from radix_tree_next_slot() and
>> >> go to the slow-path in radix_tree_next_chunk().
>> >
>> > This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
>> > radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
>> > does a 'continue'.  This will hop to the next iteration of the
>> > radix_tree_for_each_tagged() loop, which will very check the exit condition of
>> > the for() loop:
>> >
>> > #define radix_tree_for_each_tagged(slot, root, iter, start, tag)    \
>> >     for (slot = radix_tree_iter_init(iter, start) ;                 \
>> >          slot || (slot = radix_tree_next_chunk(root, iter,          \
>> >                           RADIX_TREE_ITER_TAGGED | tag)) ;          \
>> >          slot = radix_tree_next_slot(slot, iter,                    \
>> >                             RADIX_TREE_ITER_TAGGED))
>> >
>> > So, we'll run the
>> >          slot || (slot = radix_tree_next_chunk(root, iter,          \
>> >                           RADIX_TREE_ITER_TAGGED | tag)) ;          \
>> >
>> > bit first.
>>
>> This is not the way how the for() loop works. slot = radix_tree_next_slot() executed first
>> and only after that goes the condition statement.
>
> Right...*sigh*...  Thanks for the sanity check. :)
>
>> > 'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
>> > this point radix_tree_next_slot() hasn't been called.
>> >
>> > radix_tree_next_chunk() will set up the iter->index, iter->next_index and
>> > iter->tags before it returns.  The next iteration of the loop in
>> > find_get_pages_tag() will use the non-NULL slot provided by
>> > radix_tree_next_chunk(), and only after that iteration will we call
>> > radix_tree_next_slot() again.  By then iter->tags should be up to date.
>> >
>> > Do you have a test setup that reliably fails without this code but passes when
>> > you zero out iter->tags?
>> >
>>
>>
>> Yup, I run Dmitry's reproducer in a parallel loop:
>>       $ while true; do ./a.out & done
>>
>> Usually it takes just couple minutes maximum.
>
> Cool - I was able to get this to work on my system as well by upping the
> thread count.
>
> In looking at this more, I agree that your patch fixes this particular bug,
> but I think that ultimately we might want something more general.
>
> IIUC, the real issue is that we shouldn't be running through
> radix_tree_next_slot() with a NULL 'slot' parameter.  In the end I think it's
> fine to zero out iter->tags in radix_tree_iter_retry(), but really we want to
> guarantee that we just bail out of radix_tree_next_slot() if we have a NULL
> 'slot'.
>
> I've run this patch in my test setup, and it fixes the reproducer provided by
> Dmitry.  I've also run xfstests against it with out any failures.
>
> --- 8< ---
> From 533beefac12f61f467aeb72e2d2c46685247b9bc Mon Sep 17 00:00:00 2001
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
> Date: Fri, 15 Jul 2016 12:46:38 -0600
> Subject: [PATCH] radix-tree: 'slot' can be NULL in radix_tree_next_slot()
>
> There are four cases I can see where we could end up with a NULL 'slot' in
> radix_tree_next_slot() (there might be more):
>
> 1) radix_tree_iter_retry() via a non-tagged iteration like
> radix_tree_for_each_slot().  In this case we currently aren't seeing a bug
> because radix_tree_iter_retry() sets
>
>         iter->next_index = iter->index;
>
> which means that in in the else case in radix_tree_next_slot(), 'count' is
> zero, so we skip over the while() loop and effectively just return NULL
> without ever dereferencing 'slot'.
>
> 2) radix_tree_iter_retry() via tagged iteration like
> radix_tree_for_each_tagged().  With the current code this case is
> unhandled and we have seen it result in a kernel crash when we dereference
> 'slot'.
>
> 3) radix_tree_iter_next() via via a non-tagged iteration like
> radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
> and shmem_partial_swap_usage().
>
> I think that this case is currently unhandled.  Unlike with
> radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
> case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
> up executing code in the while() loop in radix_tree_next_slot() that assumes
> 'slot' is valid.
>
> I haven't actually seen this crash on a test setup, but I don't think the
> current code is safe.

This is becase distance between ->index and ->next_index now could be
more that one?

We could fix that by adding "iter->index = iter->next_index - 1;" into
radix_tree_iter_next()
right after updating next_index and tweak multi-order itreration logic
if it depends on that.

I'd like to keep radix_tree_next_slot() as small as possible because
this is supposed to be a fast-path.

>
> 4) radix_tree_iter_next() via tagged iteration like
> radix_tree_for_each_tagged().  This happens in shmem_wait_for_pins().
>
> radix_tree_iter_next() zeros out iter->tags, so we end up exiting
> radix_tree_next_slot() here:
>
>         if (flags & RADIX_TREE_ITER_TAGGED) {
>                 void *canon = slot;
>
>                 iter->tags >>= 1;
>                 if (unlikely(!iter->tags))
>                         return NULL;
>
> Really we want to guarantee that we just bail out  of
> radix_tree_next_slot() if we have a NULL 'slot'.  This is a more explicit
> way of handling all the 4 above cases.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  include/linux/radix-tree.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index cb4b7e8..840308d 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
>  static __always_inline void **
>  radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>  {
> +       if (unlikely(!slot))
> +               return NULL;
> +
>         if (flags & RADIX_TREE_ITER_TAGGED) {
>                 void *canon = slot;
>
> --
> 2.9.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-16 13:45           ` Konstantin Khlebnikov
  0 siblings, 0 replies; 35+ messages in thread
From: Konstantin Khlebnikov @ 2016-07-16 13:45 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrey Ryabinin, Andrew Morton, Jan Kara, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable

On Fri, Jul 15, 2016 at 10:00 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Jul 15, 2016 at 11:52:58AM +0300, Andrey Ryabinin wrote:
>> On 07/15/2016 01:25 AM, Ross Zwisler wrote:
>> > On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
>> >> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
>> >> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
>> >> leading to crash:
>> >>
>> >> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>> >>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>> >> ....
>> >> Call Trace:
>> >>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>> >>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
>> >>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>> >>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>> >>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>> >>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>> >>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>> >>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>> >>  [<     inline     >] vfs_fsync fs/sync.c:209
>> >>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>> >>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>> >>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>> >>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
>> >>
>> >> We must reset iterator's tags to bail out from radix_tree_next_slot() and
>> >> go to the slow-path in radix_tree_next_chunk().
>> >
>> > This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
>> > radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
>> > does a 'continue'.  This will hop to the next iteration of the
>> > radix_tree_for_each_tagged() loop, which will very check the exit condition of
>> > the for() loop:
>> >
>> > #define radix_tree_for_each_tagged(slot, root, iter, start, tag)    \
>> >     for (slot = radix_tree_iter_init(iter, start) ;                 \
>> >          slot || (slot = radix_tree_next_chunk(root, iter,          \
>> >                           RADIX_TREE_ITER_TAGGED | tag)) ;          \
>> >          slot = radix_tree_next_slot(slot, iter,                    \
>> >                             RADIX_TREE_ITER_TAGGED))
>> >
>> > So, we'll run the
>> >          slot || (slot = radix_tree_next_chunk(root, iter,          \
>> >                           RADIX_TREE_ITER_TAGGED | tag)) ;          \
>> >
>> > bit first.
>>
>> This is not the way how the for() loop works. slot = radix_tree_next_slot() executed first
>> and only after that goes the condition statement.
>
> Right...*sigh*...  Thanks for the sanity check. :)
>
>> > 'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
>> > this point radix_tree_next_slot() hasn't been called.
>> >
>> > radix_tree_next_chunk() will set up the iter->index, iter->next_index and
>> > iter->tags before it returns.  The next iteration of the loop in
>> > find_get_pages_tag() will use the non-NULL slot provided by
>> > radix_tree_next_chunk(), and only after that iteration will we call
>> > radix_tree_next_slot() again.  By then iter->tags should be up to date.
>> >
>> > Do you have a test setup that reliably fails without this code but passes when
>> > you zero out iter->tags?
>> >
>>
>>
>> Yup, I run Dmitry's reproducer in a parallel loop:
>>       $ while true; do ./a.out & done
>>
>> Usually it takes just couple minutes maximum.
>
> Cool - I was able to get this to work on my system as well by upping the
> thread count.
>
> In looking at this more, I agree that your patch fixes this particular bug,
> but I think that ultimately we might want something more general.
>
> IIUC, the real issue is that we shouldn't be running through
> radix_tree_next_slot() with a NULL 'slot' parameter.  In the end I think it's
> fine to zero out iter->tags in radix_tree_iter_retry(), but really we want to
> guarantee that we just bail out of radix_tree_next_slot() if we have a NULL
> 'slot'.
>
> I've run this patch in my test setup, and it fixes the reproducer provided by
> Dmitry.  I've also run xfstests against it with out any failures.
>
> --- 8< ---
> From 533beefac12f61f467aeb72e2d2c46685247b9bc Mon Sep 17 00:00:00 2001
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
> Date: Fri, 15 Jul 2016 12:46:38 -0600
> Subject: [PATCH] radix-tree: 'slot' can be NULL in radix_tree_next_slot()
>
> There are four cases I can see where we could end up with a NULL 'slot' in
> radix_tree_next_slot() (there might be more):
>
> 1) radix_tree_iter_retry() via a non-tagged iteration like
> radix_tree_for_each_slot().  In this case we currently aren't seeing a bug
> because radix_tree_iter_retry() sets
>
>         iter->next_index = iter->index;
>
> which means that in in the else case in radix_tree_next_slot(), 'count' is
> zero, so we skip over the while() loop and effectively just return NULL
> without ever dereferencing 'slot'.
>
> 2) radix_tree_iter_retry() via tagged iteration like
> radix_tree_for_each_tagged().  With the current code this case is
> unhandled and we have seen it result in a kernel crash when we dereference
> 'slot'.
>
> 3) radix_tree_iter_next() via via a non-tagged iteration like
> radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
> and shmem_partial_swap_usage().
>
> I think that this case is currently unhandled.  Unlike with
> radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
> case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
> up executing code in the while() loop in radix_tree_next_slot() that assumes
> 'slot' is valid.
>
> I haven't actually seen this crash on a test setup, but I don't think the
> current code is safe.

This is becase distance between ->index and ->next_index now could be
more that one?

We could fix that by adding "iter->index = iter->next_index - 1;" into
radix_tree_iter_next()
right after updating next_index and tweak multi-order itreration logic
if it depends on that.

I'd like to keep radix_tree_next_slot() as small as possible because
this is supposed to be a fast-path.

>
> 4) radix_tree_iter_next() via tagged iteration like
> radix_tree_for_each_tagged().  This happens in shmem_wait_for_pins().
>
> radix_tree_iter_next() zeros out iter->tags, so we end up exiting
> radix_tree_next_slot() here:
>
>         if (flags & RADIX_TREE_ITER_TAGGED) {
>                 void *canon = slot;
>
>                 iter->tags >>= 1;
>                 if (unlikely(!iter->tags))
>                         return NULL;
>
> Really we want to guarantee that we just bail out  of
> radix_tree_next_slot() if we have a NULL 'slot'.  This is a more explicit
> way of handling all the 4 above cases.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  include/linux/radix-tree.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index cb4b7e8..840308d 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
>  static __always_inline void **
>  radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>  {
> +       if (unlikely(!slot))
> +               return NULL;
> +
>         if (flags & RADIX_TREE_ITER_TAGGED) {
>                 void *canon = slot;
>
> --
> 2.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
  2016-07-16 13:45           ` Konstantin Khlebnikov
@ 2016-07-17 17:57             ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 35+ messages in thread
From: Konstantin Khlebnikov @ 2016-07-17 17:57 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrey Ryabinin, Andrew Morton, Jan Kara, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable

On Sat, Jul 16, 2016 at 4:45 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Fri, Jul 15, 2016 at 10:00 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
>> On Fri, Jul 15, 2016 at 11:52:58AM +0300, Andrey Ryabinin wrote:
>>> On 07/15/2016 01:25 AM, Ross Zwisler wrote:
>>> > On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
>>> >> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
>>> >> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
>>> >> leading to crash:
>>> >>
>>> >> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>>> >>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>>> >> ....
>>> >> Call Trace:
>>> >>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>>> >>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
>>> >>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>>> >>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>>> >>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>>> >>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>>> >>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>>> >>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>>> >>  [<     inline     >] vfs_fsync fs/sync.c:209
>>> >>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>>> >>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>>> >>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>>> >>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
>>> >>
>>> >> We must reset iterator's tags to bail out from radix_tree_next_slot() and
>>> >> go to the slow-path in radix_tree_next_chunk().
>>> >
>>> > This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
>>> > radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
>>> > does a 'continue'.  This will hop to the next iteration of the
>>> > radix_tree_for_each_tagged() loop, which will very check the exit condition of
>>> > the for() loop:
>>> >
>>> > #define radix_tree_for_each_tagged(slot, root, iter, start, tag)    \
>>> >     for (slot = radix_tree_iter_init(iter, start) ;                 \
>>> >          slot || (slot = radix_tree_next_chunk(root, iter,          \
>>> >                           RADIX_TREE_ITER_TAGGED | tag)) ;          \
>>> >          slot = radix_tree_next_slot(slot, iter,                    \
>>> >                             RADIX_TREE_ITER_TAGGED))
>>> >
>>> > So, we'll run the
>>> >          slot || (slot = radix_tree_next_chunk(root, iter,          \
>>> >                           RADIX_TREE_ITER_TAGGED | tag)) ;          \
>>> >
>>> > bit first.
>>>
>>> This is not the way how the for() loop works. slot = radix_tree_next_slot() executed first
>>> and only after that goes the condition statement.
>>
>> Right...*sigh*...  Thanks for the sanity check. :)
>>
>>> > 'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
>>> > this point radix_tree_next_slot() hasn't been called.
>>> >
>>> > radix_tree_next_chunk() will set up the iter->index, iter->next_index and
>>> > iter->tags before it returns.  The next iteration of the loop in
>>> > find_get_pages_tag() will use the non-NULL slot provided by
>>> > radix_tree_next_chunk(), and only after that iteration will we call
>>> > radix_tree_next_slot() again.  By then iter->tags should be up to date.
>>> >
>>> > Do you have a test setup that reliably fails without this code but passes when
>>> > you zero out iter->tags?
>>> >
>>>
>>>
>>> Yup, I run Dmitry's reproducer in a parallel loop:
>>>       $ while true; do ./a.out & done
>>>
>>> Usually it takes just couple minutes maximum.
>>
>> Cool - I was able to get this to work on my system as well by upping the
>> thread count.
>>
>> In looking at this more, I agree that your patch fixes this particular bug,
>> but I think that ultimately we might want something more general.
>>
>> IIUC, the real issue is that we shouldn't be running through
>> radix_tree_next_slot() with a NULL 'slot' parameter.  In the end I think it's
>> fine to zero out iter->tags in radix_tree_iter_retry(), but really we want to
>> guarantee that we just bail out of radix_tree_next_slot() if we have a NULL
>> 'slot'.
>>
>> I've run this patch in my test setup, and it fixes the reproducer provided by
>> Dmitry.  I've also run xfstests against it with out any failures.
>>
>> --- 8< ---
>> From 533beefac12f61f467aeb72e2d2c46685247b9bc Mon Sep 17 00:00:00 2001
>> From: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Date: Fri, 15 Jul 2016 12:46:38 -0600
>> Subject: [PATCH] radix-tree: 'slot' can be NULL in radix_tree_next_slot()
>>
>> There are four cases I can see where we could end up with a NULL 'slot' in
>> radix_tree_next_slot() (there might be more):
>>
>> 1) radix_tree_iter_retry() via a non-tagged iteration like
>> radix_tree_for_each_slot().  In this case we currently aren't seeing a bug
>> because radix_tree_iter_retry() sets
>>
>>         iter->next_index = iter->index;
>>
>> which means that in in the else case in radix_tree_next_slot(), 'count' is
>> zero, so we skip over the while() loop and effectively just return NULL
>> without ever dereferencing 'slot'.
>>
>> 2) radix_tree_iter_retry() via tagged iteration like
>> radix_tree_for_each_tagged().  With the current code this case is
>> unhandled and we have seen it result in a kernel crash when we dereference
>> 'slot'.
>>
>> 3) radix_tree_iter_next() via via a non-tagged iteration like
>> radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
>> and shmem_partial_swap_usage().
>>
>> I think that this case is currently unhandled.  Unlike with
>> radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
>> case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
>> up executing code in the while() loop in radix_tree_next_slot() that assumes
>> 'slot' is valid.
>>
>> I haven't actually seen this crash on a test setup, but I don't think the
>> current code is safe.
>
> This is becase distance between ->index and ->next_index now could be
> more that one?
>
> We could fix that by adding "iter->index = iter->next_index - 1;" into
> radix_tree_iter_next()
> right after updating next_index and tweak multi-order itreration logic
> if it depends on that.
>
> I'd like to keep radix_tree_next_slot() as small as possible because
> this is supposed to be a fast-path.

Support of multi-order entries in iterator is ridiculously over-engineered.
If radix_tree_next_chunk() finds multi-order entry it must return chunk
with size 1, radix_tree_next_slot() should know nothing about that.
I'll try to fix that.

>
>>
>> 4) radix_tree_iter_next() via tagged iteration like
>> radix_tree_for_each_tagged().  This happens in shmem_wait_for_pins().
>>
>> radix_tree_iter_next() zeros out iter->tags, so we end up exiting
>> radix_tree_next_slot() here:
>>
>>         if (flags & RADIX_TREE_ITER_TAGGED) {
>>                 void *canon = slot;
>>
>>                 iter->tags >>= 1;
>>                 if (unlikely(!iter->tags))
>>                         return NULL;
>>
>> Really we want to guarantee that we just bail out  of
>> radix_tree_next_slot() if we have a NULL 'slot'.  This is a more explicit
>> way of handling all the 4 above cases.
>>
>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> ---
>>  include/linux/radix-tree.h | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
>> index cb4b7e8..840308d 100644
>> --- a/include/linux/radix-tree.h
>> +++ b/include/linux/radix-tree.h
>> @@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
>>  static __always_inline void **
>>  radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>>  {
>> +       if (unlikely(!slot))
>> +               return NULL;
>> +
>>         if (flags & RADIX_TREE_ITER_TAGGED) {
>>                 void *canon = slot;
>>
>> --
>> 2.9.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-17 17:57             ` Konstantin Khlebnikov
  0 siblings, 0 replies; 35+ messages in thread
From: Konstantin Khlebnikov @ 2016-07-17 17:57 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrey Ryabinin, Andrew Morton, Jan Kara, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable

On Sat, Jul 16, 2016 at 4:45 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Fri, Jul 15, 2016 at 10:00 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
>> On Fri, Jul 15, 2016 at 11:52:58AM +0300, Andrey Ryabinin wrote:
>>> On 07/15/2016 01:25 AM, Ross Zwisler wrote:
>>> > On Thu, Jul 14, 2016 at 02:19:56PM +0300, Andrey Ryabinin wrote:
>>> >> radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
>>> >> Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
>>> >> leading to crash:
>>> >>
>>> >> RIP: [<     inline     >] radix_tree_next_slot include/linux/radix-tree.h:473
>>> >>   [<ffffffff816951a4>] find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
>>> >> ....
>>> >> Call Trace:
>>> >>  [<ffffffff816cd91a>] pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
>>> >>  [<ffffffff81ab4231>] mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
>>> >>  [<ffffffff81ac883e>] ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
>>> >>  [<ffffffff816c99c7>] do_writepages+0x97/0x100 mm/page-writeback.c:2364
>>> >>  [<ffffffff8169bee8>] __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
>>> >>  [<ffffffff8169c371>] filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
>>> >>  [<ffffffff81aa584d>] ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
>>> >>  [<ffffffff818b667a>] vfs_fsync_range+0x10a/0x250 fs/sync.c:195
>>> >>  [<     inline     >] vfs_fsync fs/sync.c:209
>>> >>  [<ffffffff818b6832>] do_fsync+0x42/0x70 fs/sync.c:219
>>> >>  [<     inline     >] SYSC_fdatasync fs/sync.c:232
>>> >>  [<ffffffff818b6f89>] SyS_fdatasync+0x19/0x20 fs/sync.c:230
>>> >>  [<ffffffff86a94e00>] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
>>> >>
>>> >> We must reset iterator's tags to bail out from radix_tree_next_slot() and
>>> >> go to the slow-path in radix_tree_next_chunk().
>>> >
>>> > This analysis doesn't make sense to me.  In find_get_pages_tag(), when we call
>>> > radix_tree_iter_retry(), this sets the local 'slot' variable to NULL, then
>>> > does a 'continue'.  This will hop to the next iteration of the
>>> > radix_tree_for_each_tagged() loop, which will very check the exit condition of
>>> > the for() loop:
>>> >
>>> > #define radix_tree_for_each_tagged(slot, root, iter, start, tag)    \
>>> >     for (slot = radix_tree_iter_init(iter, start) ;                 \
>>> >          slot || (slot = radix_tree_next_chunk(root, iter,          \
>>> >                           RADIX_TREE_ITER_TAGGED | tag)) ;          \
>>> >          slot = radix_tree_next_slot(slot, iter,                    \
>>> >                             RADIX_TREE_ITER_TAGGED))
>>> >
>>> > So, we'll run the
>>> >          slot || (slot = radix_tree_next_chunk(root, iter,          \
>>> >                           RADIX_TREE_ITER_TAGGED | tag)) ;          \
>>> >
>>> > bit first.
>>>
>>> This is not the way how the for() loop works. slot = radix_tree_next_slot() executed first
>>> and only after that goes the condition statement.
>>
>> Right...*sigh*...  Thanks for the sanity check. :)
>>
>>> > 'slot' is NULL, so we'll set it via radix_tree_next_chunk().  At
>>> > this point radix_tree_next_slot() hasn't been called.
>>> >
>>> > radix_tree_next_chunk() will set up the iter->index, iter->next_index and
>>> > iter->tags before it returns.  The next iteration of the loop in
>>> > find_get_pages_tag() will use the non-NULL slot provided by
>>> > radix_tree_next_chunk(), and only after that iteration will we call
>>> > radix_tree_next_slot() again.  By then iter->tags should be up to date.
>>> >
>>> > Do you have a test setup that reliably fails without this code but passes when
>>> > you zero out iter->tags?
>>> >
>>>
>>>
>>> Yup, I run Dmitry's reproducer in a parallel loop:
>>>       $ while true; do ./a.out & done
>>>
>>> Usually it takes just couple minutes maximum.
>>
>> Cool - I was able to get this to work on my system as well by upping the
>> thread count.
>>
>> In looking at this more, I agree that your patch fixes this particular bug,
>> but I think that ultimately we might want something more general.
>>
>> IIUC, the real issue is that we shouldn't be running through
>> radix_tree_next_slot() with a NULL 'slot' parameter.  In the end I think it's
>> fine to zero out iter->tags in radix_tree_iter_retry(), but really we want to
>> guarantee that we just bail out of radix_tree_next_slot() if we have a NULL
>> 'slot'.
>>
>> I've run this patch in my test setup, and it fixes the reproducer provided by
>> Dmitry.  I've also run xfstests against it with out any failures.
>>
>> --- 8< ---
>> From 533beefac12f61f467aeb72e2d2c46685247b9bc Mon Sep 17 00:00:00 2001
>> From: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Date: Fri, 15 Jul 2016 12:46:38 -0600
>> Subject: [PATCH] radix-tree: 'slot' can be NULL in radix_tree_next_slot()
>>
>> There are four cases I can see where we could end up with a NULL 'slot' in
>> radix_tree_next_slot() (there might be more):
>>
>> 1) radix_tree_iter_retry() via a non-tagged iteration like
>> radix_tree_for_each_slot().  In this case we currently aren't seeing a bug
>> because radix_tree_iter_retry() sets
>>
>>         iter->next_index = iter->index;
>>
>> which means that in in the else case in radix_tree_next_slot(), 'count' is
>> zero, so we skip over the while() loop and effectively just return NULL
>> without ever dereferencing 'slot'.
>>
>> 2) radix_tree_iter_retry() via tagged iteration like
>> radix_tree_for_each_tagged().  With the current code this case is
>> unhandled and we have seen it result in a kernel crash when we dereference
>> 'slot'.
>>
>> 3) radix_tree_iter_next() via via a non-tagged iteration like
>> radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
>> and shmem_partial_swap_usage().
>>
>> I think that this case is currently unhandled.  Unlike with
>> radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
>> case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
>> up executing code in the while() loop in radix_tree_next_slot() that assumes
>> 'slot' is valid.
>>
>> I haven't actually seen this crash on a test setup, but I don't think the
>> current code is safe.
>
> This is becase distance between ->index and ->next_index now could be
> more that one?
>
> We could fix that by adding "iter->index = iter->next_index - 1;" into
> radix_tree_iter_next()
> right after updating next_index and tweak multi-order itreration logic
> if it depends on that.
>
> I'd like to keep radix_tree_next_slot() as small as possible because
> this is supposed to be a fast-path.

Support of multi-order entries in iterator is ridiculously over-engineered.
If radix_tree_next_chunk() finds multi-order entry it must return chunk
with size 1, radix_tree_next_slot() should know nothing about that.
I'll try to fix that.

>
>>
>> 4) radix_tree_iter_next() via tagged iteration like
>> radix_tree_for_each_tagged().  This happens in shmem_wait_for_pins().
>>
>> radix_tree_iter_next() zeros out iter->tags, so we end up exiting
>> radix_tree_next_slot() here:
>>
>>         if (flags & RADIX_TREE_ITER_TAGGED) {
>>                 void *canon = slot;
>>
>>                 iter->tags >>= 1;
>>                 if (unlikely(!iter->tags))
>>                         return NULL;
>>
>> Really we want to guarantee that we just bail out  of
>> radix_tree_next_slot() if we have a NULL 'slot'.  This is a more explicit
>> way of handling all the 4 above cases.
>>
>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> ---
>>  include/linux/radix-tree.h | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
>> index cb4b7e8..840308d 100644
>> --- a/include/linux/radix-tree.h
>> +++ b/include/linux/radix-tree.h
>> @@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
>>  static __always_inline void **
>>  radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
>>  {
>> +       if (unlikely(!slot))
>> +               return NULL;
>> +
>>         if (flags & RADIX_TREE_ITER_TAGGED) {
>>                 void *canon = slot;
>>
>> --
>> 2.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
  2016-07-15 20:57           ` Andrew Morton
@ 2016-07-18 23:09             ` Ross Zwisler
  -1 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-18 23:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Andrey Ryabinin, Jan Kara, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	linux-kernel, Konstantin Khlebnikov, Matthew Wilcox,
	Hugh Dickins, stable

On Fri, Jul 15, 2016 at 01:57:33PM -0700, Andrew Morton wrote:
> On Fri, 15 Jul 2016 13:00:40 -0600 Ross Zwisler <ross.zwisler@linux.intel.com> wrote:
> 
> > 
> > ...
> >
> > In looking at this more, I agree that your patch fixes this particular bug,
> > but I think that ultimately we might want something more general.
> > 
> > ...
> >
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
> >  static __always_inline void **
> >  radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
> >  {
> > +	if (unlikely(!slot))
> > +		return NULL;
> > +
> >  	if (flags & RADIX_TREE_ITER_TAGGED) {
> >  		void *canon = slot;
> 
> I'll hang onto Andrey's
> radix-tree-fix-radix_tree_iter_retry-for-tagged-iterators.patch for
> now, plan to send it in to Linus Wednesdayish.  If we can get the above
> settled down prior to that then I shall swap over.

Sure, that works for me.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-18 23:09             ` Ross Zwisler
  0 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-18 23:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Andrey Ryabinin, Jan Kara, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	linux-kernel, Konstantin Khlebnikov, Matthew Wilcox,
	Hugh Dickins, stable

On Fri, Jul 15, 2016 at 01:57:33PM -0700, Andrew Morton wrote:
> On Fri, 15 Jul 2016 13:00:40 -0600 Ross Zwisler <ross.zwisler@linux.intel.com> wrote:
> 
> > 
> > ...
> >
> > In looking at this more, I agree that your patch fixes this particular bug,
> > but I think that ultimately we might want something more general.
> > 
> > ...
> >
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -463,6 +463,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
> >  static __always_inline void **
> >  radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
> >  {
> > +	if (unlikely(!slot))
> > +		return NULL;
> > +
> >  	if (flags & RADIX_TREE_ITER_TAGGED) {
> >  		void *canon = slot;
> 
> I'll hang onto Andrey's
> radix-tree-fix-radix_tree_iter_retry-for-tagged-iterators.patch for
> now, plan to send it in to Linus Wednesdayish.  If we can get the above
> settled down prior to that then I shall swap over.

Sure, that works for me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
  2016-07-16 13:45           ` Konstantin Khlebnikov
@ 2016-07-19  4:11             ` Ross Zwisler
  -1 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-19  4:11 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Ross Zwisler, Andrey Ryabinin, Andrew Morton, Jan Kara,
	Kirill A. Shutemov, linux-mm, Greg Thelen, Suleiman Souhlal,
	syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable

On Sat, Jul 16, 2016 at 04:45:31PM +0300, Konstantin Khlebnikov wrote:
> On Fri, Jul 15, 2016 at 10:00 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
<>
> > 3) radix_tree_iter_next() via via a non-tagged iteration like
> > radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
> > and shmem_partial_swap_usage().
> >
> > I think that this case is currently unhandled.  Unlike with
> > radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
> > case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
> > up executing code in the while() loop in radix_tree_next_slot() that assumes
> > 'slot' is valid.
> >
> > I haven't actually seen this crash on a test setup, but I don't think the
> > current code is safe.
> 
> This is becase distance between ->index and ->next_index now could be
> more that one?
>
> We could fix that by adding "iter->index = iter->next_index - 1;" into
> radix_tree_iter_next()
> right after updating next_index and tweak multi-order itreration logic
> if it depends on that.
> 
> I'd like to keep radix_tree_next_slot() as small as possible because
> this is supposed to be a fast-path.

I think it'll be exactly one?

	iter->next_index = __radix_tree_iter_add(iter, 1);

So iter->index will be X, iter->next_index will be X+1, accounting for the
iterator's shift.  So, basically, whatever your height is, you'll be set up to
process one more entry, I think.

This means that radix_tree_chunk_size() will return 1.  I guess with the
current logic this is safe:

static __always_inline void **
radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
{
...
	} else {
		long count = radix_tree_chunk_size(iter);
		void *canon = slot;

		while (--count > 0) {
			/* code that assumes 'slot' is non-NULL */

So 'count' will be 1, the prefix decrement will make it 0, so we won't execute
the code in the while() loop.  So maybe all the cases are covered after all.

It seems like we need some unit tests in tools/testing/radix-tree around this
- I'll try and find time to add them this week.

I just feel like this isn't very organized.  We have a lot of code in
radix_tree_next_slot() that assumes that 'slot' is non-NULL, but we don't
check it anywhere.  We know it *can* be NULL, but we just happen to have
things set up so that none of the code that uses 'slot' is executed.

I personally feel like a quick check for slot==NULL at the beginning of the
function is the simplest way to keep ourselves safe, and it doesn't seem like
we'd be adding that much overhead.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-19  4:11             ` Ross Zwisler
  0 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2016-07-19  4:11 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Ross Zwisler, Andrey Ryabinin, Andrew Morton, Jan Kara,
	Kirill A. Shutemov, linux-mm, Greg Thelen, Suleiman Souhlal,
	syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable

On Sat, Jul 16, 2016 at 04:45:31PM +0300, Konstantin Khlebnikov wrote:
> On Fri, Jul 15, 2016 at 10:00 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
<>
> > 3) radix_tree_iter_next() via via a non-tagged iteration like
> > radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
> > and shmem_partial_swap_usage().
> >
> > I think that this case is currently unhandled.  Unlike with
> > radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
> > case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
> > up executing code in the while() loop in radix_tree_next_slot() that assumes
> > 'slot' is valid.
> >
> > I haven't actually seen this crash on a test setup, but I don't think the
> > current code is safe.
> 
> This is becase distance between ->index and ->next_index now could be
> more that one?
>
> We could fix that by adding "iter->index = iter->next_index - 1;" into
> radix_tree_iter_next()
> right after updating next_index and tweak multi-order itreration logic
> if it depends on that.
> 
> I'd like to keep radix_tree_next_slot() as small as possible because
> this is supposed to be a fast-path.

I think it'll be exactly one?

	iter->next_index = __radix_tree_iter_add(iter, 1);

So iter->index will be X, iter->next_index will be X+1, accounting for the
iterator's shift.  So, basically, whatever your height is, you'll be set up to
process one more entry, I think.

This means that radix_tree_chunk_size() will return 1.  I guess with the
current logic this is safe:

static __always_inline void **
radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
{
...
	} else {
		long count = radix_tree_chunk_size(iter);
		void *canon = slot;

		while (--count > 0) {
			/* code that assumes 'slot' is non-NULL */

So 'count' will be 1, the prefix decrement will make it 0, so we won't execute
the code in the while() loop.  So maybe all the cases are covered after all.

It seems like we need some unit tests in tools/testing/radix-tree around this
- I'll try and find time to add them this week.

I just feel like this isn't very organized.  We have a lot of code in
radix_tree_next_slot() that assumes that 'slot' is non-NULL, but we don't
check it anywhere.  We know it *can* be NULL, but we just happen to have
things set up so that none of the code that uses 'slot' is executed.

I personally feel like a quick check for slot==NULL at the beginning of the
function is the simplest way to keep ourselves safe, and it doesn't seem like
we'd be adding that much overhead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
  2016-07-19  4:11             ` Ross Zwisler
@ 2016-07-19  9:07               ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 35+ messages in thread
From: Konstantin Khlebnikov @ 2016-07-19  9:07 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrey Ryabinin, Andrew Morton, Jan Kara, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable

On Tue, Jul 19, 2016 at 7:11 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Sat, Jul 16, 2016 at 04:45:31PM +0300, Konstantin Khlebnikov wrote:
>> On Fri, Jul 15, 2016 at 10:00 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
> <>
>> > 3) radix_tree_iter_next() via via a non-tagged iteration like
>> > radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
>> > and shmem_partial_swap_usage().
>> >
>> > I think that this case is currently unhandled.  Unlike with
>> > radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
>> > case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
>> > up executing code in the while() loop in radix_tree_next_slot() that assumes
>> > 'slot' is valid.
>> >
>> > I haven't actually seen this crash on a test setup, but I don't think the
>> > current code is safe.
>>
>> This is becase distance between ->index and ->next_index now could be
>> more that one?
>>
>> We could fix that by adding "iter->index = iter->next_index - 1;" into
>> radix_tree_iter_next()
>> right after updating next_index and tweak multi-order itreration logic
>> if it depends on that.
>>
>> I'd like to keep radix_tree_next_slot() as small as possible because
>> this is supposed to be a fast-path.
>
> I think it'll be exactly one?
>
>         iter->next_index = __radix_tree_iter_add(iter, 1);
>
> So iter->index will be X, iter->next_index will be X+1, accounting for the
> iterator's shift.  So, basically, whatever your height is, you'll be set up to
> process one more entry, I think.
>
> This means that radix_tree_chunk_size() will return 1.  I guess with the
> current logic this is safe:
>
> static __always_inline void **
> radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
> {
> ...
>         } else {
>                 long count = radix_tree_chunk_size(iter);
>                 void *canon = slot;
>
>                 while (--count > 0) {
>                         /* code that assumes 'slot' is non-NULL */
>
> So 'count' will be 1, the prefix decrement will make it 0, so we won't execute
> the code in the while() loop.  So maybe all the cases are covered after all.
>
> It seems like we need some unit tests in tools/testing/radix-tree around this
> - I'll try and find time to add them this week.
>
> I just feel like this isn't very organized.  We have a lot of code in
> radix_tree_next_slot() that assumes that 'slot' is non-NULL, but we don't
> check it anywhere.  We know it *can* be NULL, but we just happen to have
> things set up so that none of the code that uses 'slot' is executed.
>
> I personally feel like a quick check for slot==NULL at the beginning of the
> function is the simplest way to keep ourselves safe, and it doesn't seem like
> we'd be adding that much overhead.

Either fix is fine now. I working on better design for multiorder iterator which
moves all that logic from radix_tree_next_slot() into radix_tree_next_chunk().

Most likely I'll change tree structure a little. For example I think sibling
entries chould hold offset to head entry and order rather than a pointer to it.
Or maybe size: support of non-power-of-2 entries is interesting feature too.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators.
@ 2016-07-19  9:07               ` Konstantin Khlebnikov
  0 siblings, 0 replies; 35+ messages in thread
From: Konstantin Khlebnikov @ 2016-07-19  9:07 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrey Ryabinin, Andrew Morton, Jan Kara, Kirill A. Shutemov,
	linux-mm, Greg Thelen, Suleiman Souhlal, syzkaller,
	Kostya Serebryany, Alexander Potapenko, Sasha Levin,
	Linux Kernel Mailing List, Matthew Wilcox, Hugh Dickins, Stable

On Tue, Jul 19, 2016 at 7:11 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Sat, Jul 16, 2016 at 04:45:31PM +0300, Konstantin Khlebnikov wrote:
>> On Fri, Jul 15, 2016 at 10:00 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
> <>
>> > 3) radix_tree_iter_next() via via a non-tagged iteration like
>> > radix_tree_for_each_slot().  This currently happens in shmem_tag_pins()
>> > and shmem_partial_swap_usage().
>> >
>> > I think that this case is currently unhandled.  Unlike with
>> > radix_tree_iter_retry() case (#1 above) we can't rely on 'count' in the else
>> > case of radix_tree_next_slot() to be zero, so I think it's possible we'll end
>> > up executing code in the while() loop in radix_tree_next_slot() that assumes
>> > 'slot' is valid.
>> >
>> > I haven't actually seen this crash on a test setup, but I don't think the
>> > current code is safe.
>>
>> This is becase distance between ->index and ->next_index now could be
>> more that one?
>>
>> We could fix that by adding "iter->index = iter->next_index - 1;" into
>> radix_tree_iter_next()
>> right after updating next_index and tweak multi-order itreration logic
>> if it depends on that.
>>
>> I'd like to keep radix_tree_next_slot() as small as possible because
>> this is supposed to be a fast-path.
>
> I think it'll be exactly one?
>
>         iter->next_index = __radix_tree_iter_add(iter, 1);
>
> So iter->index will be X, iter->next_index will be X+1, accounting for the
> iterator's shift.  So, basically, whatever your height is, you'll be set up to
> process one more entry, I think.
>
> This means that radix_tree_chunk_size() will return 1.  I guess with the
> current logic this is safe:
>
> static __always_inline void **
> radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
> {
> ...
>         } else {
>                 long count = radix_tree_chunk_size(iter);
>                 void *canon = slot;
>
>                 while (--count > 0) {
>                         /* code that assumes 'slot' is non-NULL */
>
> So 'count' will be 1, the prefix decrement will make it 0, so we won't execute
> the code in the while() loop.  So maybe all the cases are covered after all.
>
> It seems like we need some unit tests in tools/testing/radix-tree around this
> - I'll try and find time to add them this week.
>
> I just feel like this isn't very organized.  We have a lot of code in
> radix_tree_next_slot() that assumes that 'slot' is non-NULL, but we don't
> check it anywhere.  We know it *can* be NULL, but we just happen to have
> things set up so that none of the code that uses 'slot' is executed.
>
> I personally feel like a quick check for slot==NULL at the beginning of the
> function is the simplest way to keep ourselves safe, and it doesn't seem like
> we'd be adding that much overhead.

Either fix is fine now. I working on better design for multiorder iterator which
moves all that logic from radix_tree_next_slot() into radix_tree_next_chunk().

Most likely I'll change tree structure a little. For example I think sibling
entries chould hold offset to head entry and order rather than a pointer to it.
Or maybe size: support of non-power-of-2 entries is interesting feature too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-07-19  9:07 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-05 11:39 mm: GPF in find_get_pages_tag Dmitry Vyukov
2016-07-05 11:39 ` Dmitry Vyukov
2016-07-14 11:19 ` [PATCH] radix-tree: fix radix_tree_iter_retry() for tagged iterators Andrey Ryabinin
2016-07-14 11:19   ` Andrey Ryabinin
2016-07-14 11:19   ` Andrey Ryabinin
2016-07-14 12:21   ` Konstantin Khlebnikov
2016-07-14 12:21     ` Konstantin Khlebnikov
2016-07-14 22:25   ` Ross Zwisler
2016-07-14 22:25     ` Ross Zwisler
2016-07-15  8:52     ` Andrey Ryabinin
2016-07-15  8:52       ` Andrey Ryabinin
2016-07-15  8:52       ` Andrey Ryabinin
2016-07-15 19:00       ` Ross Zwisler
2016-07-15 19:00         ` Ross Zwisler
2016-07-15 19:00         ` Ross Zwisler
2016-07-15 20:57         ` Andrew Morton
2016-07-15 20:57           ` Andrew Morton
2016-07-18 23:09           ` Ross Zwisler
2016-07-18 23:09             ` Ross Zwisler
2016-07-16 13:45         ` Konstantin Khlebnikov
2016-07-16 13:45           ` Konstantin Khlebnikov
2016-07-17 17:57           ` Konstantin Khlebnikov
2016-07-17 17:57             ` Konstantin Khlebnikov
2016-07-19  4:11           ` Ross Zwisler
2016-07-19  4:11             ` Ross Zwisler
2016-07-19  9:07             ` Konstantin Khlebnikov
2016-07-19  9:07               ` Konstantin Khlebnikov
2016-07-15 19:03 ` mm: GPF in find_get_pages_tag Ross Zwisler
2016-07-15 19:03   ` Ross Zwisler
2016-07-15 19:07   ` Dmitry Vyukov
2016-07-15 19:07     ` Dmitry Vyukov
2016-07-15 20:21     ` Ross Zwisler
2016-07-15 20:21       ` Ross Zwisler
2016-07-15 20:25       ` Dmitry Vyukov
2016-07-15 20:25         ` Dmitry Vyukov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.