linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
@ 2022-02-16 16:31 Lee Jones
  2022-02-18  1:06 ` John Hubbard
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Lee Jones @ 2022-02-16 16:31 UTC (permalink / raw)
  To: linux-ext4
  Cc: Christoph Hellwig, Dave Chinner, Goldwyn Rodrigues,
	Darrick J . Wong, Bob Peterson, Damien Le Moal,
	Theodore Ts'o, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

Good afternoon,

After recently receiving a bug report from Syzbot [0] which was raised
specifically against the Android v5.10 kernel, I spent some time
trying to get to the crux.  Firstly I reproduced the issue on the
reported kernel, then did the same using the latest release kernel
v5.16.

The full kernel trace can be seen below at [1].

I managed to seemingly bisect the issue down to commit:

  60263d5889e6d ("iomap: fall back to buffered writes for invalidation failures")

Although it appears to be the belief of the Filesystem community that
this is likely not the cause of the issue and should therefore not be
reverted.

It took quite some time, but I managed to strip down the reported
kernel config (which was very large) down to x86_defconfig plus only a
few additional config symbols.  Most of which are platform (qemu in
this case) config options the other is KASAN, required to successfully
reproduce this:

  CONFIG_HYPERVISOR_GUEST=y  
  CONFIG_PARAVIRT=y          
  CONFIG_PARAVIRT_DEBUG=y    
  CONFIG_PARAVIRT_SPINLOCKS=y
  CONFIG_KASAN=y

The (most likely non-optimised) qemu command currently being used is:

qemu-system-x86_64 -smp 8 -m 16G -enable-kvm -cpu max,migratable=off -no-reboot \
    -kernel ${BUILDDIR}/arch/x86/boot/bzImage -nographic                        \
    -hda ${IMAGEDIR}/wheezy-auto-repro.img                                      \                       
    -chardev stdio,id=char0,mux=on,logfile=serial.out,signal=off                \
    -serial chardev:char0 -mon chardev=char0                                    \
    -append "root=/dev/sda rw console=ttyS0"                                     

Darrick seems to suggest that:

  "The BUG report came from page_buffers failing to find any buffer heads
   attached to the page."

If the reproducer, also massively stripped down from the original
report, would be of any use to you, it can be found further down at
[2].

I don't how true this is, but it is my current belief that user-space
should not be able to force the kernel to BUG.  This seems to be a
temporary DoS issue.  So although not a critically serious security
problem involved memory leakage or data corruption, it could
potentially cause a nuisance if not rectified.

Any well meaning help with this would be gratefully received.

Kind regards,
Lee

[0] https://syzkaller.appspot.com/bug?extid=41c966bf0729a530bd8d

[1]
[   15.200920] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   15.215877] File: /syzkaller.IsS3Yc/0/bus PID: 1497 Comm: repro
[   16.718970] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   16.734250] File: /syzkaller.IsS3Yc/5/bus PID: 1512 Comm: repro
[   17.013871] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   17.028193] File: /syzkaller.IsS3Yc/6/bus PID: 1515 Comm: repro
[   17.320498] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   17.336115] File: /syzkaller.IsS3Yc/7/bus PID: 1518 Comm: repro
[   17.617921] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   17.633063] File: /syzkaller.IsS3Yc/8/bus PID: 1521 Comm: repro
[   18.527260] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   18.544236] File: /syzkaller.IsS3Yc/11/bus PID: 1530 Comm: repro
[   18.810347] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   18.824721] File: /syzkaller.IsS3Yc/12/bus PID: 1533 Comm: repro
[   19.099315] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   19.114151] File: /syzkaller.IsS3Yc/13/bus PID: 1536 Comm: repro
[   19.403882] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   19.418467] File: /syzkaller.IsS3Yc/14/bus PID: 1539 Comm: repro
[   19.703934] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   19.718400] File: /syzkaller.IsS3Yc/15/bus PID: 1542 Comm: repro
[   26.533129] ------------[ cut here ]------------
[   26.540473] WARNING: CPU: 1 PID: 1612 at fs/ext4/inode.c:3576 ext4_set_page_dirty+0xaf/0xc0
[   26.553171] Modules linked in:
[   26.557354] CPU: 1 PID: 1612 Comm: repro Not tainted 5.16.0+ #169
[   26.565238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   26.576182] RIP: 0010:ext4_set_page_dirty+0xaf/0xc0
[   26.583077] Code: 4c 89 ff e8 e3 86 e7 ff 49 f7 07 00 20 00 00 74 19 4c 89 ff 5b 41 5e 41 5f e9 8d 05 f0 ff 48 83 c0 ff 48 89 c3 e9 76 ff ff ff <0f> 0b eb e3 48 83 c0 ff 48 89 c3 eb 9e 0f 0b eb b8 55 48 89 e5 41
[   26.607402] RSP: 0018:ffff88810f4ffa10 EFLAGS: 00010246
[   26.614646] RAX: ffffea00043bc687 RBX: ffffea00043bc680 RCX: ffffffff9913f86d
[   26.625115] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffea00043bc680
[   26.635137] RBP: 0000000000000400 R08: dffffc0000000000 R09: fffff940008778d1
[   26.644923] R10: fffff940008778d1 R11: 0000000000000000 R12: ffff88810e14c000
[   26.654807] R13: ffffea00043bc680 R14: ffffea00043bc688 R15: ffffea00043bc680
[   26.664812] FS:  00007f27c16d6640(0000) GS:ffff8883ef440000(0000) knlGS:0000000000000000
[   26.676238] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.684212] CR2: 000000000049b3a8 CR3: 000000010f7a6005 CR4: 0000000000370ee0
[   26.693896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   26.703778] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   26.714238] Call Trace:
[   26.717987]  <TASK>
[   26.721105]  folio_mark_dirty+0x72/0xa0
[   26.726455]  set_page_dirty_lock+0x4a/0x70
[   26.732426]  unpin_user_pages_dirty_lock+0x101/0x1d0
[   26.739369]  process_vm_rw_single_vec+0x2f4/0x3c0
[   26.745707]  ? process_vm_rw+0x4d0/0x4d0
[   26.751454]  ? mm_access+0xe1/0x120
[   26.756495]  process_vm_rw+0x2fd/0x4d0
[   26.762431]  ? __ia32_sys_process_vm_writev+0x80/0x80
[   26.769780]  ? preempt_count_sub+0xf/0xc0
[   26.775021]  ? folio_add_lru+0xea/0x110
[   26.780260]  ? preempt_count_sub+0xf/0xc0
[   26.786062]  ? _raw_spin_unlock+0x2e/0x50
[   26.791676]  ? __handle_mm_fault+0x14a7/0x1970
[   26.797550]  ? handle_mm_fault+0x1d0/0x1d0
[   26.802981]  ? up_read+0x6f/0x180
[   26.807430]  ? down_read_trylock+0x13f/0x190
[   26.813252]  ? down_write_trylock+0x130/0x130
[   26.818935]  ? handle_mm_fault+0x160/0x1d0
[   26.824454]  ? do_kern_addr_fault+0x130/0x130
[   26.830695]  __x64_sys_process_vm_writev+0x71/0x80
[   26.837270]  do_syscall_64+0x43/0x90
[   26.842134]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   26.848994] RIP: 0033:0x44c849
[   26.853311] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
[   26.877911] RSP: 002b:00007f27c16d6168 EFLAGS: 00000216 ORIG_RAX: 0000000000000137
[   26.887953] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000044c849
[   26.897571] RDX: 0000000000000001 RSI: 0000000020c22000 RDI: 0000000000000076
[   26.907068] RBP: 00007f27c16d61a0 R08: 0000000000000001 R09: 0000000000000000
[   26.916433] R10: 0000000020c22fa0 R11: 0000000000000216 R12: 00007ffd3062431e
[   26.927113] R13: 00007ffd3062431f R14: 0000000000000000 R15: 00007f27c16d6640
[   26.936755]  </TASK>
[   26.939785] ---[ end trace 42b5bb79157828eb ]---
[   27.160243] ------------[ cut here ]------------
[   27.166572] kernel BUG at fs/ext4/inode.c:2620!
[   27.173362] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[   27.180459] CPU: 1 PID: 1616 Comm: repro Tainted: G        W         5.16.0+ #169
[   27.190304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   27.201112] RIP: 0010:mpage_prepare_extent_to_map+0x573/0x580
[   27.208692] Code: 08 14 00 00 00 00 65 48 8b 04 25 28 00 00 00 48 3b 84 24 40 01 00 00 75 15 89 d8 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b e8 04 39 15 01 0f 1f 40 00 55 48 89 e5 41 57 41 56 41
[   27.232612] RSP: 0018:ffff88810f7e6b60 EFLAGS: 00010246
[   27.239348] RAX: ffffea00043b4fc7 RBX: 0000000000000067 RCX: ffffffff9913ea61
[   27.248720] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffea00043b4fc0
[   27.257899] RBP: ffff88810f7e6cf0 R08: dffffc0000000000 R09: fffff940008769f9
[   27.266846] R10: fffff940008769f9 R11: 0000000000000000 R12: 0000000000000000
[   27.276050] R13: ffff88810f7e6be0 R14: ffffea00043b4fc0 R15: ffff88810f7e6f58
[   27.285119] FS:  00007f27c16d6640(0000) GS:ffff8883ef440000(0000) knlGS:0000000000000000
[   27.295466] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   27.302935] CR2: 0000000020002002 CR3: 000000010cb70006 CR4: 0000000000370ee0
[   27.312837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   27.322697] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   27.332451] Call Trace:
[   27.336060]  <TASK>
[   27.339164]  ? ext4_iomap_swap_activate+0x10/0x10
[   27.346171]  ? preempt_count_sub+0xf/0xc0
[   27.352456]  ? page_writeback_cpu_online+0x1f0/0x1f0
[   27.359598]  ? ext4_init_io_end+0x18/0x90
[   27.365427]  ? kmem_cache_alloc+0xf2/0x200
[   27.371638]  ext4_writepages+0x823/0x1c50
[   27.377047]  ? kernel_text_address+0xa8/0xc0
[   27.382587]  ? unwind_get_return_address+0x25/0x40
[   27.388649]  ? __rcu_read_unlock+0x8d/0x320
[   27.394134]  ? __rcu_read_lock+0x20/0x20
[   27.399094]  ? preempt_count_sub+0xf/0xc0
[   27.404220]  ? ext4_readpage+0x110/0x110
[   27.409494]  ? stack_trace_save+0x120/0x120
[   27.414838]  ? __is_insn_slot_addr+0x58/0x60
[   27.420186]  ? kernel_text_address+0xa8/0xc0
[   27.425531]  ? __kernel_text_address+0x9/0x40
[   27.431355]  ? unwind_get_return_address+0x25/0x40
[   27.437415]  ? stack_trace_save+0xdb/0x120
[   27.442722]  ? stack_trace_snprint+0xc0/0xc0
[   27.448310]  do_writepages+0x20b/0x3a0
[   27.453245]  ? __kasan_slab_alloc+0x43/0xb0
[   27.458532]  ? filter_irq_stacks+0x3d/0x80
[   27.463792]  ? __writepage+0xb0/0xb0
[   27.468366]  ? __iomap_dio_rw+0x1c2/0xec0
[   27.473579]  ? iomap_dio_rw+0x5/0x30
[   27.478152]  ? ext4_file_write_iter+0x8a8/0xde0
[   27.484091]  ? do_iter_readv_writev+0x2ce/0x360
[   27.490002]  ? do_iter_write+0x109/0x370
[   27.495400]  ? iter_file_splice_write+0x4b6/0x770
[   27.501658]  ? direct_splice_actor+0x7b/0x90
[   27.507225]  ? splice_direct_to_actor+0x309/0x570
[   27.513487]  ? do_splice_direct+0x172/0x230
[   27.519352]  ? do_sendfile+0x567/0x960
[   27.524605]  ? __x64_sys_sendfile64+0x104/0x150
[   27.531035]  ? do_syscall_64+0x43/0x90
[   27.536259]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[   27.543606]  ? do_syscall_64+0x43/0x90
[   27.548942]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[   27.556281]  ? _raw_spin_lock+0x120/0x120
[   27.561984]  ? __ext4_handle_dirty_metadata+0x22d/0x510
[   27.569186]  filemap_write_and_wait_range+0x200/0x230
[   27.576185]  ? filemap_range_needs_writeback+0x400/0x400
[   27.583632]  ? ext4_mark_iloc_dirty+0x66c/0x6b0
[   27.589986]  ? kmem_cache_alloc_trace+0xe7/0x230
[   27.596373]  ? __iomap_dio_rw+0x1c2/0xec0
[   27.601967]  __iomap_dio_rw+0x525/0xec0
[   27.609616]  ? jbd2_journal_stop+0x481/0x5b0
[   27.615606]  ? iomap_dio_complete+0x2a0/0x2a0
[   27.622054]  ? generic_update_time+0xde/0x130
[   27.628225]  ? __mnt_drop_write_file+0xd/0x60
[   27.634306]  ? file_update_time+0x1cd/0x210
[   27.640161]  ? kernel_text_address+0xa8/0xc0
[   27.646114]  ? file_remove_privs+0x2b0/0x2b0
[   27.652278]  iomap_dio_rw+0x5/0x30
[   27.657149]  ext4_file_write_iter+0x8a8/0xde0
[   27.663323]  ? ext4_file_read_iter+0x1e0/0x1e0
[   27.669800]  ? ____kasan_kmalloc+0xd1/0xf0
[   27.675676]  ? direct_splice_actor+0x7b/0x90
[   27.681778]  ? splice_direct_to_actor+0x309/0x570
[   27.688357]  ? do_splice_direct+0x172/0x230
[   27.694256]  ? do_sendfile+0x567/0x960
[   27.699605]  ? __x64_sys_sendfile64+0x104/0x150
[   27.706129]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[   27.713462]  do_iter_readv_writev+0x2ce/0x360
[   27.719894]  ? generic_file_rw_checks+0xd0/0xd0
[   27.726446]  ? memcpy+0x3c/0x60
[   27.731234]  ? security_file_permission+0x47/0x270
[   27.738026]  do_iter_write+0x109/0x370
[   27.743417]  iter_file_splice_write+0x4b6/0x770
[   27.749824]  ? splice_from_pipe+0x170/0x170
[   27.755630]  ? generic_file_splice_read+0x2d0/0x380
[   27.765041]  ? splice_shrink_spd+0x40/0x40
[   27.770668]  ? is_mmconf_reserved+0x240/0x240
[   27.776487]  ? kcalloc+0x1b/0x20
[   27.780737]  ? splice_from_pipe+0x170/0x170
[   27.786278]  direct_splice_actor+0x7b/0x90
[   27.791885]  splice_direct_to_actor+0x309/0x570
[   27.797680]  ? do_splice_direct+0x230/0x230
[   27.803067]  ? pipe_to_sendpage+0x1b0/0x1b0
[   27.808395]  ? security_file_permission+0x47/0x270
[   27.814761]  do_splice_direct+0x172/0x230
[   27.819842]  ? splice_direct_to_actor+0x570/0x570
[   27.825936]  ? security_file_permission+0x47/0x270
[   27.832354]  do_sendfile+0x567/0x960
[   27.837426]  ? do_pwritev+0x3d0/0x3d0
[   27.842258]  ? __se_sys_futex+0x1b1/0x2c0
[   27.847619]  ? restore_fpregs_from_fpstate+0xc4/0x190
[   27.854277]  __x64_sys_sendfile64+0x104/0x150
[   27.859919]  ? __ia32_sys_sendfile+0x170/0x170
[   27.865622]  ? switch_fpu_return+0x97/0x120
[   27.871204]  do_syscall_64+0x43/0x90
[   27.875661]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   27.882673] RIP: 0033:0x44c849
[   27.887077] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
[   27.911882] RSP: 002b:00007f27c16d6178 EFLAGS: 00000203 ORIG_RAX: 0000000000000028
[   27.921306] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000044c849
[   27.930587] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000003
[   27.939877] RBP: 00007f27c16d61a0 R08: 0000000000000000 R09: 0000000000000000
[   27.949975] R10: 0000000080000005 R11: 0000000000000203 R12: 00007ffd3062431e
[   27.959298] R13: 00007ffd3062431f R14: 0000000000000000 R15: 00007f27c16d6640
[   27.968616]  </TASK>
[   27.971668] Modules linked in:
[   27.975922] ---[ end trace 42b5bb79157828ec ]---
[   27.982710] RIP: 0010:mpage_prepare_extent_to_map+0x573/0x580
[   27.990558] Code: 08 14 00 00 00 00 65 48 8b 04 25 28 00 00 00 48 3b 84 24 40 01 00 00 75 15 89 d8 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b e8 04 39 15 01 0f 1f 40 00 55 48 89 e5 41 57 41 56 41
[   28.015009] RSP: 0018:ffff88810f7e6b60 EFLAGS: 00010246
[   28.021763] RAX: ffffea00043b4fc7 RBX: 0000000000000067 RCX: ffffffff9913ea61
[   28.031325] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffea00043b4fc0
[   28.040622] RBP: ffff88810f7e6cf0 R08: dffffc0000000000 R09: fffff940008769f9
[   28.050140] R10: fffff940008769f9 R11: 0000000000000000 R12: 0000000000000000
[   28.059313] R13: ffff88810f7e6be0 R14: ffffea00043b4fc0 R15: ffff88810f7e6f58
[   28.068502] FS:  00007f27c16d6640(0000) GS:ffff8883ef440000(0000) knlGS:0000000000000000
[   28.079454] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.087279] CR2: 0000000020002002 CR3: 000000010cb70006 CR4: 0000000000370ee0
[   28.096763] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   28.105974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

[2]
// https://syzkaller.appspot.com/bug?id=906354c4596539d9561ee6cb6d8c54cda38fc3c2
// autogenerated by syzkaller (https://github.com/google/syzkaller)

#define _GNU_SOURCE

#include <arpa/inet.h>
#include <dirent.h>
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <net/if.h>
#include <net/if_arp.h>
#include <netinet/in.h>
#include <pthread.h>
#include <sched.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mount.h>
#include <sys/prctl.h>
#include <sys/resource.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>

#include <linux/capability.h>
#include <linux/futex.h>
#include <linux/genetlink.h>
#include <linux/if_addr.h>
#include <linux/if_ether.h>
#include <linux/if_link.h>
#include <linux/if_tun.h>
#include <linux/in6.h>
#include <linux/ip.h>
#include <linux/neighbour.h>
#include <linux/net.h>
#include <linux/netlink.h>
#include <linux/rtnetlink.h>
#include <linux/tcp.h>
#include <linux/veth.h>

static void sleep_ms(uint64_t ms)
{
  usleep(ms * 1000);
}

static uint64_t current_time_ms(void)
{
  struct timespec ts;
  if (clock_gettime(CLOCK_MONOTONIC, &ts))
    exit(1);
  return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
}

static void use_temporary_dir(void)
{
  char tmpdir_template[] = "./syzkaller.XXXXXX";
  char* tmpdir = mkdtemp(tmpdir_template);
  if (!tmpdir)
    exit(1);
  if (chmod(tmpdir, 0777))
    exit(1);
  if (chdir(tmpdir))
    exit(1);
}

static void thread_start(void* (*fn)(void*), void* arg)
{
  pthread_t th;
  pthread_attr_t attr;
  pthread_attr_init(&attr);
  pthread_attr_setstacksize(&attr, 128 << 10);
  int i = 0;
  for (; i < 100; i++) {
    if (pthread_create(&th, &attr, fn, arg) == 0) {
      pthread_attr_destroy(&attr);
      return;
    }
    if (errno == EAGAIN) {
      usleep(50);
      continue;
    }
    break;
  }
  exit(1);
}

typedef struct {
  int state;
} event_t;

static void event_init(event_t* ev)
{
  ev->state = 0;
}

static void event_reset(event_t* ev)
{
  ev->state = 0;
}

static void event_set(event_t* ev)
{
  if (ev->state)
    exit(1);
  __atomic_store_n(&ev->state, 1, __ATOMIC_RELEASE);
  syscall(SYS_futex, &ev->state, FUTEX_WAKE | FUTEX_PRIVATE_FLAG, 1000000);
}

static void event_wait(event_t* ev)
{
  while (!__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
    syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, 0);
}

static int event_isset(event_t* ev)
{
  return __atomic_load_n(&ev->state, __ATOMIC_ACQUIRE);
}

static int event_timedwait(event_t* ev, uint64_t timeout)
{
  uint64_t start = current_time_ms();
  uint64_t now = start;
  for (;;) {
    uint64_t remain = timeout - (now - start);
    struct timespec ts;
    ts.tv_sec = remain / 1000;
    ts.tv_nsec = (remain % 1000) * 1000 * 1000;
    syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, &ts);
    if (__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
      return 1;
    now = current_time_ms();
    if (now - start > timeout)
      return 0;
  }
}

#define IFLA_IPVLAN_FLAGS 2
#define IPVLAN_MODE_L3S 2
#undef IPVLAN_F_VEPA
#define IPVLAN_F_VEPA 2

#define TUN_IFACE "syz_tun"
#define LOCAL_MAC 0xaaaaaaaaaaaa
#define REMOTE_MAC 0xaaaaaaaaaabb
#define LOCAL_IPV4 "172.20.20.170"
#define REMOTE_IPV4 "172.20.20.187"
#define LOCAL_IPV6 "fe80::aa"
#define REMOTE_IPV6 "fe80::bb"

#define IFF_NAPI 0x0010

#define DEVLINK_FAMILY_NAME "devlink"

#define DEVLINK_CMD_PORT_GET 5
#define DEVLINK_ATTR_BUS_NAME 1
#define DEVLINK_ATTR_DEV_NAME 2
#define DEVLINK_ATTR_NETDEV_NAME 7

#define DEV_IPV4 "172.20.20.%d"
#define DEV_IPV6 "fe80::%02x"
#define DEV_MAC 0x00aaaaaaaaaa

#define WG_GENL_NAME "wireguard"
enum wg_cmd {
  WG_CMD_GET_DEVICE,
  WG_CMD_SET_DEVICE,
};
enum wgdevice_attribute {
  WGDEVICE_A_UNSPEC,
  WGDEVICE_A_IFINDEX,
  WGDEVICE_A_IFNAME,
  WGDEVICE_A_PRIVATE_KEY,
  WGDEVICE_A_PUBLIC_KEY,
  WGDEVICE_A_FLAGS,
  WGDEVICE_A_LISTEN_PORT,
  WGDEVICE_A_FWMARK,
  WGDEVICE_A_PEERS,
};
enum wgpeer_attribute {
  WGPEER_A_UNSPEC,
  WGPEER_A_PUBLIC_KEY,
  WGPEER_A_PRESHARED_KEY,
  WGPEER_A_FLAGS,
  WGPEER_A_ENDPOINT,
  WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL,
  WGPEER_A_LAST_HANDSHAKE_TIME,
  WGPEER_A_RX_BYTES,
  WGPEER_A_TX_BYTES,
  WGPEER_A_ALLOWEDIPS,
  WGPEER_A_PROTOCOL_VERSION,
};
enum wgallowedip_attribute {
  WGALLOWEDIP_A_UNSPEC,
  WGALLOWEDIP_A_FAMILY,
  WGALLOWEDIP_A_IPADDR,
  WGALLOWEDIP_A_CIDR_MASK,
};

#define MAX_FDS 30

#define XT_TABLE_SIZE 1536
#define XT_MAX_ENTRIES 10

struct xt_counters {
  uint64_t pcnt, bcnt;
};

struct ipt_getinfo {
  char name[32];
  unsigned int valid_hooks;
  unsigned int hook_entry[5];
  unsigned int underflow[5];
  unsigned int num_entries;
  unsigned int size;
};

struct ipt_get_entries {
  char name[32];
  unsigned int size;
  uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
};

struct ipt_replace {
  char name[32];
  unsigned int valid_hooks;
  unsigned int num_entries;
  unsigned int size;
  unsigned int hook_entry[5];
  unsigned int underflow[5];
  unsigned int num_counters;
  struct xt_counters* counters;
  uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
};

struct ipt_table_desc {
  const char* name;
  struct ipt_getinfo info;
  struct ipt_replace replace;
};

#define IPT_BASE_CTL 64
#define IPT_SO_SET_REPLACE (IPT_BASE_CTL)
#define IPT_SO_GET_INFO (IPT_BASE_CTL)
#define IPT_SO_GET_ENTRIES (IPT_BASE_CTL + 1)

struct arpt_getinfo {
  char name[32];
  unsigned int valid_hooks;
  unsigned int hook_entry[3];
  unsigned int underflow[3];
  unsigned int num_entries;
  unsigned int size;
};

struct arpt_get_entries {
  char name[32];
  unsigned int size;
  uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
};

struct arpt_replace {
  char name[32];
  unsigned int valid_hooks;
  unsigned int num_entries;
  unsigned int size;
  unsigned int hook_entry[3];
  unsigned int underflow[3];
  unsigned int num_counters;
  struct xt_counters* counters;
  uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
};

#define ARPT_BASE_CTL 96
#define ARPT_SO_SET_REPLACE (ARPT_BASE_CTL)
#define ARPT_SO_GET_INFO (ARPT_BASE_CTL)
#define ARPT_SO_GET_ENTRIES (ARPT_BASE_CTL + 1)

static void loop();

static int wait_for_loop(int pid)
{
  if (pid < 0)
    exit(1);
  int status = 0;
  while (waitpid(-1, &status, __WALL) != pid) {
  }
  return WEXITSTATUS(status);
}

static int do_sandbox_none(void)
{
  if (unshare(CLONE_NEWPID)) {
  }
  int pid = fork();
  if (pid != 0)
    return wait_for_loop(pid);

  if (unshare(CLONE_NEWNET)) {
  }
  loop();
  exit(1);
}

#define FS_IOC_SETFLAGS _IOW('f', 2, long)
static void remove_dir(const char* dir)
{
  int iter = 0;
  DIR* dp = 0;
retry:
  while (umount2(dir, MNT_DETACH | UMOUNT_NOFOLLOW) == 0) {
  }
  dp = opendir(dir);
  if (dp == NULL) {
    if (errno == EMFILE) {
      exit(1);
    }
    exit(1);
  }
  struct dirent* ep = 0;
  while ((ep = readdir(dp))) {
    if (strcmp(ep->d_name, ".") == 0 || strcmp(ep->d_name, "..") == 0)
      continue;
    char filename[FILENAME_MAX];
    snprintf(filename, sizeof(filename), "%s/%s", dir, ep->d_name);
    while (umount2(filename, MNT_DETACH | UMOUNT_NOFOLLOW) == 0) {
    }
    struct stat st;
    if (lstat(filename, &st))
      exit(1);
    if (S_ISDIR(st.st_mode)) {
      remove_dir(filename);
      continue;
    }
    int i;
    for (i = 0;; i++) {
      if (unlink(filename) == 0)
        break;
      if (errno == EPERM) {
        int fd = open(filename, O_RDONLY);
        if (fd != -1) {
          long flags = 0;
          if (ioctl(fd, FS_IOC_SETFLAGS, &flags) == 0) {
          }
          close(fd);
          continue;
        }
      }
      if (errno == EROFS) {
        break;
      }
      if (errno != EBUSY || i > 100)
        exit(1);
      if (umount2(filename, MNT_DETACH | UMOUNT_NOFOLLOW))
        exit(1);
    }
  }
  closedir(dp);
  for (int i = 0;; i++) {
    if (rmdir(dir) == 0)
      break;
    if (i < 100) {
      if (errno == EPERM) {
        int fd = open(dir, O_RDONLY);
        if (fd != -1) {
          long flags = 0;
          if (ioctl(fd, FS_IOC_SETFLAGS, &flags) == 0) {
          }
          close(fd);
          continue;
        }
      }
      if (errno == EROFS) {
        break;
      }
      if (errno == EBUSY) {
        if (umount2(dir, MNT_DETACH | UMOUNT_NOFOLLOW))
          exit(1);
        continue;
      }
      if (errno == ENOTEMPTY) {
        if (iter < 100) {
          iter++;
          goto retry;
        }
      }
    }
    exit(1);
  }
}

static void kill_and_wait(int pid, int* status)
{
  kill(-pid, SIGKILL);
  kill(pid, SIGKILL);
  for (int i = 0; i < 100; i++) {
    if (waitpid(-1, status, WNOHANG | __WALL) == pid)
      return;
    usleep(1000);
  }
  DIR* dir = opendir("/sys/fs/fuse/connections");
  if (dir) {
    for (;;) {
      struct dirent* ent = readdir(dir);
      if (!ent)
        break;
      if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
        continue;
      char abort[300];
      snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort",
               ent->d_name);
      int fd = open(abort, O_WRONLY);
      if (fd == -1) {
        continue;
      }
      if (write(fd, abort, 1) < 0) {
      }
      close(fd);
    }
    closedir(dir);
  } else {
  }
  while (waitpid(-1, status, __WALL) != pid) {
  }
}

static void close_fds()
{
  for (int fd = 3; fd < MAX_FDS; fd++)
    close(fd);
}

struct thread_t {
  int created, call;
  event_t ready, done;
};

static struct thread_t threads[16];
static void execute_call(int call);
static int running;

static void* thr(void* arg)
{
  struct thread_t* th = (struct thread_t*)arg;
  for (;;) {
    event_wait(&th->ready);
    event_reset(&th->ready);
    execute_call(th->call);
    __atomic_fetch_sub(&running, 1, __ATOMIC_RELAXED);
    event_set(&th->done);
  }
  return 0;
}

static void execute_one(void)
{
  int i, call, thread;
  for (call = 0; call < 10; call++) {
    for (thread = 0; thread < (int)(sizeof(threads) / sizeof(threads[0]));
         thread++) {
      struct thread_t* th = &threads[thread];
      if (!th->created) {
        th->created = 1;
        event_init(&th->ready);
        event_init(&th->done);
        event_set(&th->done);
        thread_start(thr, th);
      }
      if (!event_isset(&th->done))
        continue;
      event_reset(&th->done);
      th->call = call;
      __atomic_fetch_add(&running, 1, __ATOMIC_RELAXED);
      event_set(&th->ready);
      event_timedwait(&th->done, 50);
      break;
    }
  }
  for (i = 0; i < 100 && __atomic_load_n(&running, __ATOMIC_RELAXED); i++)
    sleep_ms(1);
  close_fds();
}

static void execute_one(void);

#define WAIT_FLAGS __WALL

static void loop(void)
{
  int iter = 0;
  for (;; iter++) {
    char cwdbuf[32];
    sprintf(cwdbuf, "./%d", iter);
    if (mkdir(cwdbuf, 0777))
      exit(1);
    int pid = fork();
    if (pid < 0)
	    exit(1);
    if (pid == 0) {
      if (chdir(cwdbuf))
        exit(1);
      execute_one();
      exit(0);
    }
    int status = 0;
    uint64_t start = current_time_ms();
    for (;;) {
       if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid) {
	 break;
       }
      sleep_ms(1);
      if (current_time_ms() - start < 5000)
        continue;
      kill_and_wait(pid, &status);
      break;
    }
    remove_dir(cwdbuf);
  }
}

uint64_t r[5] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff,
                 0xffffffffffffffff, 0x0};

void execute_call(int call)
{
  intptr_t res = 0;
  switch (call) {
  case 0:
    memcpy((void*)0x20000080, "./bus\000", 6);
    res = syscall(__NR_open, 0x20000080ul, 0x14d842ul, 0ul);
    if (res != -1)
      r[0] = res;
    break;
  case 1:
    memcpy((void*)0x20000000, "/proc/self/exe\000", 15);
    res = syscall(__NR_openat, 0xffffff9c, 0x20000000ul, 0ul, 0ul);
    if (res != -1)
      r[1] = res;
    break;
  case 2:
    memcpy((void*)0x20002000, "./bus\000", 6);
    res = syscall(__NR_open, 0x20002000ul, 0x143042ul, 0ul);
    if (res != -1)
      r[2] = res;
    break;
  case 3:
    syscall(__NR_ftruncate, r[2], 0x2008002ul);
    break;
  case 4:
    memcpy((void*)0x20000400, "./bus\000", 6);
    res = syscall(__NR_open, 0x20000400ul, 0x14103eul, 0ul);
    if (res != -1)
      r[3] = res;
    break;
  case 5:
    syscall(__NR_mmap, 0x20000000ul, 0x600000ul, 0x7ffffeul, 0x11ul, r[3], 0ul);
    break;
  case 6:
    syscall(__NR_sendfile, r[0], r[1], 0ul, 0x80000005ul);
    break;
  case 7:
    res = syscall(__NR_gettid);
    if (res != -1)
      r[4] = res;
    break;
  case 8:
    *(uint64_t*)0x20c22000 = 0x2034afa4;
    *(uint64_t*)0x20c22008 = 0x1f80;
    *(uint64_t*)0x20c22fa0 = 0x20000080;
    *(uint64_t*)0x20c22fa8 = 0x2034afa5;
    syscall(__NR_process_vm_writev, r[4], 0x20c22000ul, 1ul, 0x20c22fa0ul, 1ul,
            0ul);
    break;
  case 9:
    syscall(__NR_sendfile, r[0], r[1], 0ul, 0x80000005ul);
    break;
  }
}
int main(void)
{
  syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);

  use_temporary_dir();
  do_sandbox_none();

  return 0;
}

-- 
Lee Jones [李琼斯]
Principal Technical Lead - Developer Services
Linaro.org │ Open source software for Arm SoCs
Follow Linaro: Facebook | Twitter | Blog

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-16 16:31 [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Lee Jones
@ 2022-02-18  1:06 ` John Hubbard
  2022-02-18  4:08   ` Theodore Ts'o
  2022-02-18  2:54 ` Theodore Ts'o
  2022-02-25 19:24 ` [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first Theodore Ts'o
  2 siblings, 1 reply; 29+ messages in thread
From: John Hubbard @ 2022-02-18  1:06 UTC (permalink / raw)
  To: Lee Jones, linux-ext4
  Cc: Christoph Hellwig, Dave Chinner, Goldwyn Rodrigues,
	Darrick J . Wong, Bob Peterson, Damien Le Moal,
	Theodore Ts'o, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On 2/16/22 08:31, Lee Jones wrote:
> Good afternoon,
...
> I managed to seemingly bisect the issue down to commit:
> 
>    60263d5889e6d ("iomap: fall back to buffered writes for invalidation failures")
> 
> Although it appears to be the belief of the Filesystem community that
> this is likely not the cause of the issue and should therefore not be
> reverted.

They are correct in this matter, imho. :)

...
> Darrick seems to suggest that:
> 
>    "The BUG report came from page_buffers failing to find any buffer heads
>     attached to the page."

Yes. And looking at the pair of backtraces below, this looks very much
like another aspect of the "get_user_pages problem" [1], originally
described in Jan Kara's 2018 email [2].

I'm getting close to posting an RFC for the direct IO conversion to
FOLL_PIN, but even after that, various parts of the kernel (reclaim,
filesystems/block layer) still need to be changed so as to use
page_maybe_dma_pinned() to help avoid this problem. There's a bit
more than that, actually.


[1] https://lwn.net/Articles/753027/

[2] https://www.spinics.net/lists/linux-mm/msg142700.html


thanks,
-- 
John Hubbard
NVIDIA

> 
> If the reproducer, also massively stripped down from the original
> report, would be of any use to you, it can be found further down at
> [2].
> 
> I don't how true this is, but it is my current belief that user-space
> should not be able to force the kernel to BUG.  This seems to be a
> temporary DoS issue.  So although not a critically serious security
> problem involved memory leakage or data corruption, it could
> potentially cause a nuisance if not rectified.
> 
> Any well meaning help with this would be gratefully received.
> 
> Kind regards,
> Lee
> 
> [0] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsyzkaller.appspot.com%2Fbug%3Fextid%3D41c966bf0729a530bd8d&amp;data=04%7C01%7Cjhubbard%40nvidia.com%7C107e857de4b940fbe7e708d9f169d4a8%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637806259852035011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=5oD2%2BB9iOOSqSh3xwLLzR1vFxqyMtYNivJVQmepj2ww%3D&amp;reserved=0
> 
> [1]
> [   15.200920] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   15.215877] File: /syzkaller.IsS3Yc/0/bus PID: 1497 Comm: repro
> [   16.718970] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   16.734250] File: /syzkaller.IsS3Yc/5/bus PID: 1512 Comm: repro
> [   17.013871] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   17.028193] File: /syzkaller.IsS3Yc/6/bus PID: 1515 Comm: repro
> [   17.320498] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   17.336115] File: /syzkaller.IsS3Yc/7/bus PID: 1518 Comm: repro
> [   17.617921] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   17.633063] File: /syzkaller.IsS3Yc/8/bus PID: 1521 Comm: repro
> [   18.527260] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   18.544236] File: /syzkaller.IsS3Yc/11/bus PID: 1530 Comm: repro
> [   18.810347] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   18.824721] File: /syzkaller.IsS3Yc/12/bus PID: 1533 Comm: repro
> [   19.099315] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   19.114151] File: /syzkaller.IsS3Yc/13/bus PID: 1536 Comm: repro
> [   19.403882] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   19.418467] File: /syzkaller.IsS3Yc/14/bus PID: 1539 Comm: repro
> [   19.703934] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   19.718400] File: /syzkaller.IsS3Yc/15/bus PID: 1542 Comm: repro
> [   26.533129] ------------[ cut here ]------------
> [   26.540473] WARNING: CPU: 1 PID: 1612 at fs/ext4/inode.c:3576 ext4_set_page_dirty+0xaf/0xc0
> [   26.553171] Modules linked in:
> [   26.557354] CPU: 1 PID: 1612 Comm: repro Not tainted 5.16.0+ #169
> [   26.565238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [   26.576182] RIP: 0010:ext4_set_page_dirty+0xaf/0xc0
> [   26.583077] Code: 4c 89 ff e8 e3 86 e7 ff 49 f7 07 00 20 00 00 74 19 4c 89 ff 5b 41 5e 41 5f e9 8d 05 f0 ff 48 83 c0 ff 48 89 c3 e9 76 ff ff ff <0f> 0b eb e3 48 83 c0 ff 48 89 c3 eb 9e 0f 0b eb b8 55 48 89 e5 41
> [   26.607402] RSP: 0018:ffff88810f4ffa10 EFLAGS: 00010246
> [   26.614646] RAX: ffffea00043bc687 RBX: ffffea00043bc680 RCX: ffffffff9913f86d
> [   26.625115] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffea00043bc680
> [   26.635137] RBP: 0000000000000400 R08: dffffc0000000000 R09: fffff940008778d1
> [   26.644923] R10: fffff940008778d1 R11: 0000000000000000 R12: ffff88810e14c000
> [   26.654807] R13: ffffea00043bc680 R14: ffffea00043bc688 R15: ffffea00043bc680
> [   26.664812] FS:  00007f27c16d6640(0000) GS:ffff8883ef440000(0000) knlGS:0000000000000000
> [   26.676238] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   26.684212] CR2: 000000000049b3a8 CR3: 000000010f7a6005 CR4: 0000000000370ee0
> [   26.693896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   26.703778] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   26.714238] Call Trace:
> [   26.717987]  <TASK>
> [   26.721105]  folio_mark_dirty+0x72/0xa0
> [   26.726455]  set_page_dirty_lock+0x4a/0x70
> [   26.732426]  unpin_user_pages_dirty_lock+0x101/0x1d0
> [   26.739369]  process_vm_rw_single_vec+0x2f4/0x3c0
> [   26.745707]  ? process_vm_rw+0x4d0/0x4d0
> [   26.751454]  ? mm_access+0xe1/0x120
> [   26.756495]  process_vm_rw+0x2fd/0x4d0
> [   26.762431]  ? __ia32_sys_process_vm_writev+0x80/0x80
> [   26.769780]  ? preempt_count_sub+0xf/0xc0
> [   26.775021]  ? folio_add_lru+0xea/0x110
> [   26.780260]  ? preempt_count_sub+0xf/0xc0
> [   26.786062]  ? _raw_spin_unlock+0x2e/0x50
> [   26.791676]  ? __handle_mm_fault+0x14a7/0x1970
> [   26.797550]  ? handle_mm_fault+0x1d0/0x1d0
> [   26.802981]  ? up_read+0x6f/0x180
> [   26.807430]  ? down_read_trylock+0x13f/0x190
> [   26.813252]  ? down_write_trylock+0x130/0x130
> [   26.818935]  ? handle_mm_fault+0x160/0x1d0
> [   26.824454]  ? do_kern_addr_fault+0x130/0x130
> [   26.830695]  __x64_sys_process_vm_writev+0x71/0x80
> [   26.837270]  do_syscall_64+0x43/0x90
> [   26.842134]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   26.848994] RIP: 0033:0x44c849
> [   26.853311] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> [   26.877911] RSP: 002b:00007f27c16d6168 EFLAGS: 00000216 ORIG_RAX: 0000000000000137
> [   26.887953] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000044c849
> [   26.897571] RDX: 0000000000000001 RSI: 0000000020c22000 RDI: 0000000000000076
> [   26.907068] RBP: 00007f27c16d61a0 R08: 0000000000000001 R09: 0000000000000000
> [   26.916433] R10: 0000000020c22fa0 R11: 0000000000000216 R12: 00007ffd3062431e
> [   26.927113] R13: 00007ffd3062431f R14: 0000000000000000 R15: 00007f27c16d6640
> [   26.936755]  </TASK>
> [   26.939785] ---[ end trace 42b5bb79157828eb ]---
> [   27.160243] ------------[ cut here ]------------
> [   27.166572] kernel BUG at fs/ext4/inode.c:2620!
> [   27.173362] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
> [   27.180459] CPU: 1 PID: 1616 Comm: repro Tainted: G        W         5.16.0+ #169
> [   27.190304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [   27.201112] RIP: 0010:mpage_prepare_extent_to_map+0x573/0x580
> [   27.208692] Code: 08 14 00 00 00 00 65 48 8b 04 25 28 00 00 00 48 3b 84 24 40 01 00 00 75 15 89 d8 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b e8 04 39 15 01 0f 1f 40 00 55 48 89 e5 41 57 41 56 41
> [   27.232612] RSP: 0018:ffff88810f7e6b60 EFLAGS: 00010246
> [   27.239348] RAX: ffffea00043b4fc7 RBX: 0000000000000067 RCX: ffffffff9913ea61
> [   27.248720] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffea00043b4fc0
> [   27.257899] RBP: ffff88810f7e6cf0 R08: dffffc0000000000 R09: fffff940008769f9
> [   27.266846] R10: fffff940008769f9 R11: 0000000000000000 R12: 0000000000000000
> [   27.276050] R13: ffff88810f7e6be0 R14: ffffea00043b4fc0 R15: ffff88810f7e6f58
> [   27.285119] FS:  00007f27c16d6640(0000) GS:ffff8883ef440000(0000) knlGS:0000000000000000
> [   27.295466] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   27.302935] CR2: 0000000020002002 CR3: 000000010cb70006 CR4: 0000000000370ee0
> [   27.312837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   27.322697] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   27.332451] Call Trace:
> [   27.336060]  <TASK>
> [   27.339164]  ? ext4_iomap_swap_activate+0x10/0x10
> [   27.346171]  ? preempt_count_sub+0xf/0xc0
> [   27.352456]  ? page_writeback_cpu_online+0x1f0/0x1f0
> [   27.359598]  ? ext4_init_io_end+0x18/0x90
> [   27.365427]  ? kmem_cache_alloc+0xf2/0x200
> [   27.371638]  ext4_writepages+0x823/0x1c50
> [   27.377047]  ? kernel_text_address+0xa8/0xc0
> [   27.382587]  ? unwind_get_return_address+0x25/0x40
> [   27.388649]  ? __rcu_read_unlock+0x8d/0x320
> [   27.394134]  ? __rcu_read_lock+0x20/0x20
> [   27.399094]  ? preempt_count_sub+0xf/0xc0
> [   27.404220]  ? ext4_readpage+0x110/0x110
> [   27.409494]  ? stack_trace_save+0x120/0x120
> [   27.414838]  ? __is_insn_slot_addr+0x58/0x60
> [   27.420186]  ? kernel_text_address+0xa8/0xc0
> [   27.425531]  ? __kernel_text_address+0x9/0x40
> [   27.431355]  ? unwind_get_return_address+0x25/0x40
> [   27.437415]  ? stack_trace_save+0xdb/0x120
> [   27.442722]  ? stack_trace_snprint+0xc0/0xc0
> [   27.448310]  do_writepages+0x20b/0x3a0
> [   27.453245]  ? __kasan_slab_alloc+0x43/0xb0
> [   27.458532]  ? filter_irq_stacks+0x3d/0x80
> [   27.463792]  ? __writepage+0xb0/0xb0
> [   27.468366]  ? __iomap_dio_rw+0x1c2/0xec0
> [   27.473579]  ? iomap_dio_rw+0x5/0x30
> [   27.478152]  ? ext4_file_write_iter+0x8a8/0xde0
> [   27.484091]  ? do_iter_readv_writev+0x2ce/0x360
> [   27.490002]  ? do_iter_write+0x109/0x370
> [   27.495400]  ? iter_file_splice_write+0x4b6/0x770
> [   27.501658]  ? direct_splice_actor+0x7b/0x90
> [   27.507225]  ? splice_direct_to_actor+0x309/0x570
> [   27.513487]  ? do_splice_direct+0x172/0x230
> [   27.519352]  ? do_sendfile+0x567/0x960
> [   27.524605]  ? __x64_sys_sendfile64+0x104/0x150
> [   27.531035]  ? do_syscall_64+0x43/0x90
> [   27.536259]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   27.543606]  ? do_syscall_64+0x43/0x90
> [   27.548942]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   27.556281]  ? _raw_spin_lock+0x120/0x120
> [   27.561984]  ? __ext4_handle_dirty_metadata+0x22d/0x510
> [   27.569186]  filemap_write_and_wait_range+0x200/0x230
> [   27.576185]  ? filemap_range_needs_writeback+0x400/0x400
> [   27.583632]  ? ext4_mark_iloc_dirty+0x66c/0x6b0
> [   27.589986]  ? kmem_cache_alloc_trace+0xe7/0x230
> [   27.596373]  ? __iomap_dio_rw+0x1c2/0xec0
> [   27.601967]  __iomap_dio_rw+0x525/0xec0
> [   27.609616]  ? jbd2_journal_stop+0x481/0x5b0
> [   27.615606]  ? iomap_dio_complete+0x2a0/0x2a0
> [   27.622054]  ? generic_update_time+0xde/0x130
> [   27.628225]  ? __mnt_drop_write_file+0xd/0x60
> [   27.634306]  ? file_update_time+0x1cd/0x210
> [   27.640161]  ? kernel_text_address+0xa8/0xc0
> [   27.646114]  ? file_remove_privs+0x2b0/0x2b0
> [   27.652278]  iomap_dio_rw+0x5/0x30
> [   27.657149]  ext4_file_write_iter+0x8a8/0xde0
> [   27.663323]  ? ext4_file_read_iter+0x1e0/0x1e0
> [   27.669800]  ? ____kasan_kmalloc+0xd1/0xf0
> [   27.675676]  ? direct_splice_actor+0x7b/0x90
> [   27.681778]  ? splice_direct_to_actor+0x309/0x570
> [   27.688357]  ? do_splice_direct+0x172/0x230
> [   27.694256]  ? do_sendfile+0x567/0x960
> [   27.699605]  ? __x64_sys_sendfile64+0x104/0x150
> [   27.706129]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   27.713462]  do_iter_readv_writev+0x2ce/0x360
> [   27.719894]  ? generic_file_rw_checks+0xd0/0xd0
> [   27.726446]  ? memcpy+0x3c/0x60
> [   27.731234]  ? security_file_permission+0x47/0x270
> [   27.738026]  do_iter_write+0x109/0x370
> [   27.743417]  iter_file_splice_write+0x4b6/0x770
> [   27.749824]  ? splice_from_pipe+0x170/0x170
> [   27.755630]  ? generic_file_splice_read+0x2d0/0x380
> [   27.765041]  ? splice_shrink_spd+0x40/0x40
> [   27.770668]  ? is_mmconf_reserved+0x240/0x240
> [   27.776487]  ? kcalloc+0x1b/0x20
> [   27.780737]  ? splice_from_pipe+0x170/0x170
> [   27.786278]  direct_splice_actor+0x7b/0x90
> [   27.791885]  splice_direct_to_actor+0x309/0x570
> [   27.797680]  ? do_splice_direct+0x230/0x230
> [   27.803067]  ? pipe_to_sendpage+0x1b0/0x1b0
> [   27.808395]  ? security_file_permission+0x47/0x270
> [   27.814761]  do_splice_direct+0x172/0x230
> [   27.819842]  ? splice_direct_to_actor+0x570/0x570
> [   27.825936]  ? security_file_permission+0x47/0x270
> [   27.832354]  do_sendfile+0x567/0x960
> [   27.837426]  ? do_pwritev+0x3d0/0x3d0
> [   27.842258]  ? __se_sys_futex+0x1b1/0x2c0
> [   27.847619]  ? restore_fpregs_from_fpstate+0xc4/0x190
> [   27.854277]  __x64_sys_sendfile64+0x104/0x150
> [   27.859919]  ? __ia32_sys_sendfile+0x170/0x170
> [   27.865622]  ? switch_fpu_return+0x97/0x120
> [   27.871204]  do_syscall_64+0x43/0x90
> [   27.875661]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   27.882673] RIP: 0033:0x44c849
> [   27.887077] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> [   27.911882] RSP: 002b:00007f27c16d6178 EFLAGS: 00000203 ORIG_RAX: 0000000000000028
> [   27.921306] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000044c849
> [   27.930587] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000003
> [   27.939877] RBP: 00007f27c16d61a0 R08: 0000000000000000 R09: 0000000000000000
> [   27.949975] R10: 0000000080000005 R11: 0000000000000203 R12: 00007ffd3062431e
> [   27.959298] R13: 00007ffd3062431f R14: 0000000000000000 R15: 00007f27c16d6640
> [   27.968616]  </TASK>
> [   27.971668] Modules linked in:
> [   27.975922] ---[ end trace 42b5bb79157828ec ]---
> [   27.982710] RIP: 0010:mpage_prepare_extent_to_map+0x573/0x580
> [   27.990558] Code: 08 14 00 00 00 00 65 48 8b 04 25 28 00 00 00 48 3b 84 24 40 01 00 00 75 15 89 d8 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b e8 04 39 15 01 0f 1f 40 00 55 48 89 e5 41 57 41 56 41
> [   28.015009] RSP: 0018:ffff88810f7e6b60 EFLAGS: 00010246
> [   28.021763] RAX: ffffea00043b4fc7 RBX: 0000000000000067 RCX: ffffffff9913ea61
> [   28.031325] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffea00043b4fc0
> [   28.040622] RBP: ffff88810f7e6cf0 R08: dffffc0000000000 R09: fffff940008769f9
> [   28.050140] R10: fffff940008769f9 R11: 0000000000000000 R12: 0000000000000000
> [   28.059313] R13: ffff88810f7e6be0 R14: ffffea00043b4fc0 R15: ffff88810f7e6f58
> [   28.068502] FS:  00007f27c16d6640(0000) GS:ffff8883ef440000(0000) knlGS:0000000000000000
> [   28.079454] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   28.087279] CR2: 0000000020002002 CR3: 000000010cb70006 CR4: 0000000000370ee0
> [   28.096763] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   28.105974] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> 
> [2]
> // https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsyzkaller.appspot.com%2Fbug%3Fid%3D906354c4596539d9561ee6cb6d8c54cda38fc3c2&amp;data=04%7C01%7Cjhubbard%40nvidia.com%7C107e857de4b940fbe7e708d9f169d4a8%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637806259852035011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=WhaMEtC2yWJva1OO%2Fp0BIYRwADAlClWQPee7Trf9DpQ%3D&amp;reserved=0
> // autogenerated by syzkaller (https://github.com/google/syzkaller)
> 
> #define _GNU_SOURCE
> 
> #include <arpa/inet.h>
> #include <dirent.h>
> #include <endian.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <net/if.h>
> #include <net/if_arp.h>
> #include <netinet/in.h>
> #include <pthread.h>
> #include <sched.h>
> #include <signal.h>
> #include <stdarg.h>
> #include <stdbool.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/ioctl.h>
> #include <sys/mount.h>
> #include <sys/prctl.h>
> #include <sys/resource.h>
> #include <sys/socket.h>
> #include <sys/stat.h>
> #include <sys/syscall.h>
> #include <sys/time.h>
> #include <sys/types.h>
> #include <sys/uio.h>
> #include <sys/wait.h>
> #include <time.h>
> #include <unistd.h>
> 
> #include <linux/capability.h>
> #include <linux/futex.h>
> #include <linux/genetlink.h>
> #include <linux/if_addr.h>
> #include <linux/if_ether.h>
> #include <linux/if_link.h>
> #include <linux/if_tun.h>
> #include <linux/in6.h>
> #include <linux/ip.h>
> #include <linux/neighbour.h>
> #include <linux/net.h>
> #include <linux/netlink.h>
> #include <linux/rtnetlink.h>
> #include <linux/tcp.h>
> #include <linux/veth.h>
> 
> static void sleep_ms(uint64_t ms)
> {
>    usleep(ms * 1000);
> }
> 
> static uint64_t current_time_ms(void)
> {
>    struct timespec ts;
>    if (clock_gettime(CLOCK_MONOTONIC, &ts))
>      exit(1);
>    return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
> }
> 
> static void use_temporary_dir(void)
> {
>    char tmpdir_template[] = "./syzkaller.XXXXXX";
>    char* tmpdir = mkdtemp(tmpdir_template);
>    if (!tmpdir)
>      exit(1);
>    if (chmod(tmpdir, 0777))
>      exit(1);
>    if (chdir(tmpdir))
>      exit(1);
> }
> 
> static void thread_start(void* (*fn)(void*), void* arg)
> {
>    pthread_t th;
>    pthread_attr_t attr;
>    pthread_attr_init(&attr);
>    pthread_attr_setstacksize(&attr, 128 << 10);
>    int i = 0;
>    for (; i < 100; i++) {
>      if (pthread_create(&th, &attr, fn, arg) == 0) {
>        pthread_attr_destroy(&attr);
>        return;
>      }
>      if (errno == EAGAIN) {
>        usleep(50);
>        continue;
>      }
>      break;
>    }
>    exit(1);
> }
> 
> typedef struct {
>    int state;
> } event_t;
> 
> static void event_init(event_t* ev)
> {
>    ev->state = 0;
> }
> 
> static void event_reset(event_t* ev)
> {
>    ev->state = 0;
> }
> 
> static void event_set(event_t* ev)
> {
>    if (ev->state)
>      exit(1);
>    __atomic_store_n(&ev->state, 1, __ATOMIC_RELEASE);
>    syscall(SYS_futex, &ev->state, FUTEX_WAKE | FUTEX_PRIVATE_FLAG, 1000000);
> }
> 
> static void event_wait(event_t* ev)
> {
>    while (!__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
>      syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, 0);
> }
> 
> static int event_isset(event_t* ev)
> {
>    return __atomic_load_n(&ev->state, __ATOMIC_ACQUIRE);
> }
> 
> static int event_timedwait(event_t* ev, uint64_t timeout)
> {
>    uint64_t start = current_time_ms();
>    uint64_t now = start;
>    for (;;) {
>      uint64_t remain = timeout - (now - start);
>      struct timespec ts;
>      ts.tv_sec = remain / 1000;
>      ts.tv_nsec = (remain % 1000) * 1000 * 1000;
>      syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, &ts);
>      if (__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
>        return 1;
>      now = current_time_ms();
>      if (now - start > timeout)
>        return 0;
>    }
> }
> 
> #define IFLA_IPVLAN_FLAGS 2
> #define IPVLAN_MODE_L3S 2
> #undef IPVLAN_F_VEPA
> #define IPVLAN_F_VEPA 2
> 
> #define TUN_IFACE "syz_tun"
> #define LOCAL_MAC 0xaaaaaaaaaaaa
> #define REMOTE_MAC 0xaaaaaaaaaabb
> #define LOCAL_IPV4 "172.20.20.170"
> #define REMOTE_IPV4 "172.20.20.187"
> #define LOCAL_IPV6 "fe80::aa"
> #define REMOTE_IPV6 "fe80::bb"
> 
> #define IFF_NAPI 0x0010
> 
> #define DEVLINK_FAMILY_NAME "devlink"
> 
> #define DEVLINK_CMD_PORT_GET 5
> #define DEVLINK_ATTR_BUS_NAME 1
> #define DEVLINK_ATTR_DEV_NAME 2
> #define DEVLINK_ATTR_NETDEV_NAME 7
> 
> #define DEV_IPV4 "172.20.20.%d"
> #define DEV_IPV6 "fe80::%02x"
> #define DEV_MAC 0x00aaaaaaaaaa
> 
> #define WG_GENL_NAME "wireguard"
> enum wg_cmd {
>    WG_CMD_GET_DEVICE,
>    WG_CMD_SET_DEVICE,
> };
> enum wgdevice_attribute {
>    WGDEVICE_A_UNSPEC,
>    WGDEVICE_A_IFINDEX,
>    WGDEVICE_A_IFNAME,
>    WGDEVICE_A_PRIVATE_KEY,
>    WGDEVICE_A_PUBLIC_KEY,
>    WGDEVICE_A_FLAGS,
>    WGDEVICE_A_LISTEN_PORT,
>    WGDEVICE_A_FWMARK,
>    WGDEVICE_A_PEERS,
> };
> enum wgpeer_attribute {
>    WGPEER_A_UNSPEC,
>    WGPEER_A_PUBLIC_KEY,
>    WGPEER_A_PRESHARED_KEY,
>    WGPEER_A_FLAGS,
>    WGPEER_A_ENDPOINT,
>    WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL,
>    WGPEER_A_LAST_HANDSHAKE_TIME,
>    WGPEER_A_RX_BYTES,
>    WGPEER_A_TX_BYTES,
>    WGPEER_A_ALLOWEDIPS,
>    WGPEER_A_PROTOCOL_VERSION,
> };
> enum wgallowedip_attribute {
>    WGALLOWEDIP_A_UNSPEC,
>    WGALLOWEDIP_A_FAMILY,
>    WGALLOWEDIP_A_IPADDR,
>    WGALLOWEDIP_A_CIDR_MASK,
> };
> 
> #define MAX_FDS 30
> 
> #define XT_TABLE_SIZE 1536
> #define XT_MAX_ENTRIES 10
> 
> struct xt_counters {
>    uint64_t pcnt, bcnt;
> };
> 
> struct ipt_getinfo {
>    char name[32];
>    unsigned int valid_hooks;
>    unsigned int hook_entry[5];
>    unsigned int underflow[5];
>    unsigned int num_entries;
>    unsigned int size;
> };
> 
> struct ipt_get_entries {
>    char name[32];
>    unsigned int size;
>    uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
> };
> 
> struct ipt_replace {
>    char name[32];
>    unsigned int valid_hooks;
>    unsigned int num_entries;
>    unsigned int size;
>    unsigned int hook_entry[5];
>    unsigned int underflow[5];
>    unsigned int num_counters;
>    struct xt_counters* counters;
>    uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
> };
> 
> struct ipt_table_desc {
>    const char* name;
>    struct ipt_getinfo info;
>    struct ipt_replace replace;
> };
> 
> #define IPT_BASE_CTL 64
> #define IPT_SO_SET_REPLACE (IPT_BASE_CTL)
> #define IPT_SO_GET_INFO (IPT_BASE_CTL)
> #define IPT_SO_GET_ENTRIES (IPT_BASE_CTL + 1)
> 
> struct arpt_getinfo {
>    char name[32];
>    unsigned int valid_hooks;
>    unsigned int hook_entry[3];
>    unsigned int underflow[3];
>    unsigned int num_entries;
>    unsigned int size;
> };
> 
> struct arpt_get_entries {
>    char name[32];
>    unsigned int size;
>    uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
> };
> 
> struct arpt_replace {
>    char name[32];
>    unsigned int valid_hooks;
>    unsigned int num_entries;
>    unsigned int size;
>    unsigned int hook_entry[3];
>    unsigned int underflow[3];
>    unsigned int num_counters;
>    struct xt_counters* counters;
>    uint64_t entrytable[XT_TABLE_SIZE / sizeof(uint64_t)];
> };
> 
> #define ARPT_BASE_CTL 96
> #define ARPT_SO_SET_REPLACE (ARPT_BASE_CTL)
> #define ARPT_SO_GET_INFO (ARPT_BASE_CTL)
> #define ARPT_SO_GET_ENTRIES (ARPT_BASE_CTL + 1)
> 
> static void loop();
> 
> static int wait_for_loop(int pid)
> {
>    if (pid < 0)
>      exit(1);
>    int status = 0;
>    while (waitpid(-1, &status, __WALL) != pid) {
>    }
>    return WEXITSTATUS(status);
> }
> 
> static int do_sandbox_none(void)
> {
>    if (unshare(CLONE_NEWPID)) {
>    }
>    int pid = fork();
>    if (pid != 0)
>      return wait_for_loop(pid);
> 
>    if (unshare(CLONE_NEWNET)) {
>    }
>    loop();
>    exit(1);
> }
> 
> #define FS_IOC_SETFLAGS _IOW('f', 2, long)
> static void remove_dir(const char* dir)
> {
>    int iter = 0;
>    DIR* dp = 0;
> retry:
>    while (umount2(dir, MNT_DETACH | UMOUNT_NOFOLLOW) == 0) {
>    }
>    dp = opendir(dir);
>    if (dp == NULL) {
>      if (errno == EMFILE) {
>        exit(1);
>      }
>      exit(1);
>    }
>    struct dirent* ep = 0;
>    while ((ep = readdir(dp))) {
>      if (strcmp(ep->d_name, ".") == 0 || strcmp(ep->d_name, "..") == 0)
>        continue;
>      char filename[FILENAME_MAX];
>      snprintf(filename, sizeof(filename), "%s/%s", dir, ep->d_name);
>      while (umount2(filename, MNT_DETACH | UMOUNT_NOFOLLOW) == 0) {
>      }
>      struct stat st;
>      if (lstat(filename, &st))
>        exit(1);
>      if (S_ISDIR(st.st_mode)) {
>        remove_dir(filename);
>        continue;
>      }
>      int i;
>      for (i = 0;; i++) {
>        if (unlink(filename) == 0)
>          break;
>        if (errno == EPERM) {
>          int fd = open(filename, O_RDONLY);
>          if (fd != -1) {
>            long flags = 0;
>            if (ioctl(fd, FS_IOC_SETFLAGS, &flags) == 0) {
>            }
>            close(fd);
>            continue;
>          }
>        }
>        if (errno == EROFS) {
>          break;
>        }
>        if (errno != EBUSY || i > 100)
>          exit(1);
>        if (umount2(filename, MNT_DETACH | UMOUNT_NOFOLLOW))
>          exit(1);
>      }
>    }
>    closedir(dp);
>    for (int i = 0;; i++) {
>      if (rmdir(dir) == 0)
>        break;
>      if (i < 100) {
>        if (errno == EPERM) {
>          int fd = open(dir, O_RDONLY);
>          if (fd != -1) {
>            long flags = 0;
>            if (ioctl(fd, FS_IOC_SETFLAGS, &flags) == 0) {
>            }
>            close(fd);
>            continue;
>          }
>        }
>        if (errno == EROFS) {
>          break;
>        }
>        if (errno == EBUSY) {
>          if (umount2(dir, MNT_DETACH | UMOUNT_NOFOLLOW))
>            exit(1);
>          continue;
>        }
>        if (errno == ENOTEMPTY) {
>          if (iter < 100) {
>            iter++;
>            goto retry;
>          }
>        }
>      }
>      exit(1);
>    }
> }
> 
> static void kill_and_wait(int pid, int* status)
> {
>    kill(-pid, SIGKILL);
>    kill(pid, SIGKILL);
>    for (int i = 0; i < 100; i++) {
>      if (waitpid(-1, status, WNOHANG | __WALL) == pid)
>        return;
>      usleep(1000);
>    }
>    DIR* dir = opendir("/sys/fs/fuse/connections");
>    if (dir) {
>      for (;;) {
>        struct dirent* ent = readdir(dir);
>        if (!ent)
>          break;
>        if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
>          continue;
>        char abort[300];
>        snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort",
>                 ent->d_name);
>        int fd = open(abort, O_WRONLY);
>        if (fd == -1) {
>          continue;
>        }
>        if (write(fd, abort, 1) < 0) {
>        }
>        close(fd);
>      }
>      closedir(dir);
>    } else {
>    }
>    while (waitpid(-1, status, __WALL) != pid) {
>    }
> }
> 
> static void close_fds()
> {
>    for (int fd = 3; fd < MAX_FDS; fd++)
>      close(fd);
> }
> 
> struct thread_t {
>    int created, call;
>    event_t ready, done;
> };
> 
> static struct thread_t threads[16];
> static void execute_call(int call);
> static int running;
> 
> static void* thr(void* arg)
> {
>    struct thread_t* th = (struct thread_t*)arg;
>    for (;;) {
>      event_wait(&th->ready);
>      event_reset(&th->ready);
>      execute_call(th->call);
>      __atomic_fetch_sub(&running, 1, __ATOMIC_RELAXED);
>      event_set(&th->done);
>    }
>    return 0;
> }
> 
> static void execute_one(void)
> {
>    int i, call, thread;
>    for (call = 0; call < 10; call++) {
>      for (thread = 0; thread < (int)(sizeof(threads) / sizeof(threads[0]));
>           thread++) {
>        struct thread_t* th = &threads[thread];
>        if (!th->created) {
>          th->created = 1;
>          event_init(&th->ready);
>          event_init(&th->done);
>          event_set(&th->done);
>          thread_start(thr, th);
>        }
>        if (!event_isset(&th->done))
>          continue;
>        event_reset(&th->done);
>        th->call = call;
>        __atomic_fetch_add(&running, 1, __ATOMIC_RELAXED);
>        event_set(&th->ready);
>        event_timedwait(&th->done, 50);
>        break;
>      }
>    }
>    for (i = 0; i < 100 && __atomic_load_n(&running, __ATOMIC_RELAXED); i++)
>      sleep_ms(1);
>    close_fds();
> }
> 
> static void execute_one(void);
> 
> #define WAIT_FLAGS __WALL
> 
> static void loop(void)
> {
>    int iter = 0;
>    for (;; iter++) {
>      char cwdbuf[32];
>      sprintf(cwdbuf, "./%d", iter);
>      if (mkdir(cwdbuf, 0777))
>        exit(1);
>      int pid = fork();
>      if (pid < 0)
> 	    exit(1);
>      if (pid == 0) {
>        if (chdir(cwdbuf))
>          exit(1);
>        execute_one();
>        exit(0);
>      }
>      int status = 0;
>      uint64_t start = current_time_ms();
>      for (;;) {
>         if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid) {
> 	 break;
>         }
>        sleep_ms(1);
>        if (current_time_ms() - start < 5000)
>          continue;
>        kill_and_wait(pid, &status);
>        break;
>      }
>      remove_dir(cwdbuf);
>    }
> }
> 
> uint64_t r[5] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff,
>                   0xffffffffffffffff, 0x0};
> 
> void execute_call(int call)
> {
>    intptr_t res = 0;
>    switch (call) {
>    case 0:
>      memcpy((void*)0x20000080, "./bus\000", 6);
>      res = syscall(__NR_open, 0x20000080ul, 0x14d842ul, 0ul);
>      if (res != -1)
>        r[0] = res;
>      break;
>    case 1:
>      memcpy((void*)0x20000000, "/proc/self/exe\000", 15);
>      res = syscall(__NR_openat, 0xffffff9c, 0x20000000ul, 0ul, 0ul);
>      if (res != -1)
>        r[1] = res;
>      break;
>    case 2:
>      memcpy((void*)0x20002000, "./bus\000", 6);
>      res = syscall(__NR_open, 0x20002000ul, 0x143042ul, 0ul);
>      if (res != -1)
>        r[2] = res;
>      break;
>    case 3:
>      syscall(__NR_ftruncate, r[2], 0x2008002ul);
>      break;
>    case 4:
>      memcpy((void*)0x20000400, "./bus\000", 6);
>      res = syscall(__NR_open, 0x20000400ul, 0x14103eul, 0ul);
>      if (res != -1)
>        r[3] = res;
>      break;
>    case 5:
>      syscall(__NR_mmap, 0x20000000ul, 0x600000ul, 0x7ffffeul, 0x11ul, r[3], 0ul);
>      break;
>    case 6:
>      syscall(__NR_sendfile, r[0], r[1], 0ul, 0x80000005ul);
>      break;
>    case 7:
>      res = syscall(__NR_gettid);
>      if (res != -1)
>        r[4] = res;
>      break;
>    case 8:
>      *(uint64_t*)0x20c22000 = 0x2034afa4;
>      *(uint64_t*)0x20c22008 = 0x1f80;
>      *(uint64_t*)0x20c22fa0 = 0x20000080;
>      *(uint64_t*)0x20c22fa8 = 0x2034afa5;
>      syscall(__NR_process_vm_writev, r[4], 0x20c22000ul, 1ul, 0x20c22fa0ul, 1ul,
>              0ul);
>      break;
>    case 9:
>      syscall(__NR_sendfile, r[0], r[1], 0ul, 0x80000005ul);
>      break;
>    }
> }
> int main(void)
> {
>    syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
>    syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
>    syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
> 
>    use_temporary_dir();
>    do_sandbox_none();
> 
>    return 0;
> }
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-16 16:31 [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Lee Jones
  2022-02-18  1:06 ` John Hubbard
@ 2022-02-18  2:54 ` Theodore Ts'o
  2022-02-18  4:24   ` Matthew Wilcox
  2022-02-25 19:24 ` [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first Theodore Ts'o
  2 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-18  2:54 UTC (permalink / raw)
  To: Lee Jones
  Cc: linux-ext4, Christoph Hellwig, Dave Chinner, Goldwyn Rodrigues,
	Darrick J . Wong, Bob Peterson, Damien Le Moal,
	Andreas Gruenbacher, Ritesh Harjani, Greg Kroah-Hartman,
	Johannes Thumshirn, linux-xfs, linux-fsdevel, cluster-devel,
	linux-kernel

On Wed, Feb 16, 2022 at 04:31:36PM +0000, Lee Jones wrote:
> 
> After recently receiving a bug report from Syzbot [0] which was raised
> specifically against the Android v5.10 kernel, I spent some time
> trying to get to the crux.  Firstly I reproduced the issue on the
> reported kernel, then did the same using the latest release kernel
> v5.16.
> 
> The full kernel trace can be seen below at [1].
> 

Lee, thanks for your work in trimming down the syzkaller reproducer.
The moral equivalent of what it is doing (except each system call is
done in a separate thread, with synchronization so each gets executed
in order, but perhaps on a different CPU) is:

        int dest_fd, src_fd, truncate_fd, mmap_fd;
        pid_t tid;
        struct iovec local_iov, remote_iov;

        dest_fd = open("./bus", O_RDWR|O_CREAT|O_NONBLOCK|O_SYNC|
                       O_DIRECT|O_LARGEFILE|O_NOATIME, 0);
        src_fd = openat(AT_FDCWD, "/bin/bash", O_RDONLY);
        truncate_fd = syscall(__NR_open, "./bus", O_RDWR|O_CREAT|O_SYNC|O_NOATIME|O_ASYNC);
        ftruncate(truncate_fd, 0x2008002ul);
        mmap((void *) 0x20000000ul /* addr */, 0x600000ul /* length */,
             PROT_WRITE|PROT_EXEC|PROT_SEM||0x7ffff0, MAP_FIXED|MAP_SHARED, mmap_fd, 0);
        sendfile(dest_fd, src_fd, 0 /* offset */, 0x80000005ul /* count */);
        local_iov.iov_base = (void *) 0x2034afa4;
        local_iov.iov_len = 0x1f80;
        remote_iov.iov_base = (void *) 0x20000080;
        remote_iov.iov_len = 0x2034afa5;
        process_vm_writev(gettid(), &local_iov, 1, &remote_iov, 1, 0);
        sendfile(dest_fd, src_fd, 0 /* offset */, 0x1f000005ul /* count */);

Which is then executed repeataedly over and over again.  (With the
file descriptors closed at each loop, so the file is reopened each time.)

So basically, we have a scratch file which is initialized using an
sendfile using O_DIRECT.  The scratch file is also mmap'ed into the
process's address space, and we then *modify* that an mmap'ed reagion
using process_vm_writev() --- and this is where the problem starts.

process_vm_writev() uses [un]pin_user_pages_remote() which is the same
interface uses for RDMA.  But it's not clear this is ever supposed to
work for memory which is mmap'ed region backed by a file.
pin_user_pages_remote() appears to assume that it is an anonymous
region, since the get_user_pages functions in mm/gup.c don't call
read_page() to read data into any pages that might not be mmaped in.

They also don't follow the mm / file system protocol of calling the
file system's write_begin() or page_mkwrite() before modifying a page,
and so when when process_vm_writev() calls unpin_user_pages_remote(),
this tries to dirty the page, but since page_mkwrite() or
write_begin() hasn't been called, buffers haven't been attached to the
page, and so that triggers the following ext4 WARN_ON:

[ 1451.095939] WARNING: CPU: 1 PID: 449 at fs/ext4/inode.c:3565 ext4_set_page_dirty+0x48/0x50
  ...
[ 1451.139877] Call Trace:
[ 1451.140833]  <TASK>
[ 1451.141889]  folio_mark_dirty+0x2f/0x60
[ 1451.143408]  set_page_dirty_lock+0x3e/0x60
[ 1451.145130]  unpin_user_pages_dirty_lock+0x108/0x130
[ 1451.147405]  process_vm_rw_single_vec.constprop.0+0x1b9/0x260
[ 1451.150135]  process_vm_rw_core.constprop.0+0x156/0x280
[ 1451.159576]  process_vm_rw+0xc4/0x110


Then when ext4_writepages() gets called, we trigger the BUG because
buffers haven't been attached to the page.  We can certainly avoid the
BUG by checking to see if buffers are attached first, and if not, skip
writing the page trigger a WARN_ON, and then forcibly clear the page
dirty flag so don't permanently leak memory and allow the file system
to be unmounted.  (Note: we can't fix the problem by attaching the
buffers in set_page_dirty(), since is occasionally called under
spinlocks and without the page being locked, so we can't do any kind
of allocation, so ix-nay on calling create_empty_buffers() which calls
alloc_buffer_head().)

BUT, that is really papering over the problem, since it's not clear
it's valid to try to use get_user_pages() and friends (including
[un]pin_user_pages_remote() on file-backed memory.

And if it is supposed to be valid, then mm/gup.c needs to first call
readpage() if the page is not present, so that if process_vm_writev()
is only modifying a few bytes in the mmap'ed region, we need to fault
in the page first, and then mm/gup.c needs to inform the file system
to make sure that if pinned memory region is not yet allocated, than
whatever needs to happen to allocate space, via page_mkwrite() has
taken place.  (And by the way, that means that pin_user_pages_remote()
may need to return ENOSPC if there is not free space in the file
system, and hence ENOSPC may need to reflected all the way back to
process_vm_writev().

Alternatively, if we don't expect process_vm_writev() to work on
file-backed memory, perhaps it and pin_user_pages_remote() should
return some kind of error?

	    	      	     	    	      - Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-18  1:06 ` John Hubbard
@ 2022-02-18  4:08   ` Theodore Ts'o
  2022-02-18  6:33     ` John Hubbard
  2022-02-18  7:51     ` Greg Kroah-Hartman
  0 siblings, 2 replies; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-18  4:08 UTC (permalink / raw)
  To: John Hubbard
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On Thu, Feb 17, 2022 at 05:06:45PM -0800, John Hubbard wrote:
> Yes. And looking at the pair of backtraces below, this looks very much
> like another aspect of the "get_user_pages problem" [1], originally
> described in Jan Kara's 2018 email [2].

Hmm... I just posted my analysis, which tracks with yours; but I had
forgotten about Jan's 2018 e-mail on the matter.

> I'm getting close to posting an RFC for the direct IO conversion to
> FOLL_PIN, but even after that, various parts of the kernel (reclaim,
> filesystems/block layer) still need to be changed so as to use
> page_maybe_dma_pinned() to help avoid this problem. There's a bit
> more than that, actually.

The challenge is that fixing this "the right away" is probably not
something we can backport into an LTS kernel, whether it's 5.15 or
5.10... or 4.19.

The only thing which can probably survive getting backported is
something like this.  It won't make the right thing happen if someone
is trying to RDMA or call process_vm_writev() into a file-backed
memory region --- but I'm not sure I care.  Certainly if the goal is
to make Android kernels, I'm pretty sure they are't either using RDMA,
and I suspect they are probably masking out the process_vm_writev(2)
system call (at least, for Android O and newer).  So if the goal is to
just to close some Syzbot bugs, what do folks think about this?

     	      	   	  	     	- Ted

commit 7711b1fda6f7f04274fa1cba6f092410262b0296
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Feb 17 22:54:03 2022 -0500

    ext4: work around bugs in mm/gup.c that can cause ext4 to BUG()
    
    [un]pin_user_pages_remote is dirtying pages without properly warning
    the file system in advance (or faulting in the file data if the page
    is not yet in the page cache).  This was noted by Jan Kara in 2018[1]
    and more recently has resulted in bug reports by Syzbot in various
    Android kernels[2].
    
    Fixing this for real is non-trivial, and will never be backportable
    into stable kernels.  So this is a simple workaround that stops the
    kernel from BUG()'ing.  The changed pages will not be properly written
    back, but given that the current gup code is missing the "read" in
    "read-modify-write", the dirty page in the page cache might contain
    corrupted data anyway.
    
    [1] https://www.spinics.net/lists/linux-mm/msg142700.html
    [2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com
    
    Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
    Reported-by: Lee Jones <lee.jones@linaro.org>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..3b2f336a90d1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
 	else
 		len = PAGE_SIZE;
 
+	/* Should never happen but for buggy gup code */
+	if (!page_has_buffers(page)) {
+		ext4_warning_inode(inode,
+		   "page %lu does not have buffers attached", page->index);
+		ClearPageDirty(page);
+		unlock_page(page);
+		return 0;
+	}
+
 	page_bufs = page_buffers(page);
 	/*
 	 * We cannot do block allocation or other extent handling in this
@@ -2594,6 +2603,14 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			wait_on_page_writeback(page);
 			BUG_ON(PageWriteback(page));
 
+			/* Should never happen but for buggy gup code */
+			if (!page_has_buffers(page)) {
+				ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index);
+				ClearPageDirty(page);
+				unlock_page(page);
+				continue;
+			}
+
 			if (mpd->map.m_len == 0)
 				mpd->first_page = page->index;
 			mpd->next_page = page->index + 1;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-18  2:54 ` Theodore Ts'o
@ 2022-02-18  4:24   ` Matthew Wilcox
  2022-02-18  6:03     ` Theodore Ts'o
  0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2022-02-18  4:24 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On Thu, Feb 17, 2022 at 09:54:30PM -0500, Theodore Ts'o wrote:
> process_vm_writev() uses [un]pin_user_pages_remote() which is the same
> interface uses for RDMA.  But it's not clear this is ever supposed to
> work for memory which is mmap'ed region backed by a file.
> pin_user_pages_remote() appears to assume that it is an anonymous
> region, since the get_user_pages functions in mm/gup.c don't call
> read_page() to read data into any pages that might not be mmaped in.

... it doesn't end up calling handle_mm_fault() in faultin_page()?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-18  4:24   ` Matthew Wilcox
@ 2022-02-18  6:03     ` Theodore Ts'o
  0 siblings, 0 replies; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-18  6:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On Fri, Feb 18, 2022 at 04:24:20AM +0000, Matthew Wilcox wrote:
> On Thu, Feb 17, 2022 at 09:54:30PM -0500, Theodore Ts'o wrote:
> > process_vm_writev() uses [un]pin_user_pages_remote() which is the same
> > interface uses for RDMA.  But it's not clear this is ever supposed to
> > work for memory which is mmap'ed region backed by a file.
> > pin_user_pages_remote() appears to assume that it is an anonymous
> > region, since the get_user_pages functions in mm/gup.c don't call
> > read_page() to read data into any pages that might not be mmaped in.
> 
> ... it doesn't end up calling handle_mm_fault() in faultin_page()?

Ah yes, sorry, I missed that.  This is what happens when a syzbot bug
is thrown to a file system developer, who then has to wade theough
mm code for which he is not understand....

	   	    	     	    - Ted
			    

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-18  4:08   ` Theodore Ts'o
@ 2022-02-18  6:33     ` John Hubbard
  2022-02-23 23:31       ` Theodore Ts'o
  2022-02-18  7:51     ` Greg Kroah-Hartman
  1 sibling, 1 reply; 29+ messages in thread
From: John Hubbard @ 2022-02-18  6:33 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On 2/17/22 20:08, Theodore Ts'o wrote:
> On Thu, Feb 17, 2022 at 05:06:45PM -0800, John Hubbard wrote:
>> Yes. And looking at the pair of backtraces below, this looks very much
>> like another aspect of the "get_user_pages problem" [1], originally
>> described in Jan Kara's 2018 email [2].
> 
> Hmm... I just posted my analysis, which tracks with yours; but I had
> forgotten about Jan's 2018 e-mail on the matter.
> 
>> I'm getting close to posting an RFC for the direct IO conversion to
>> FOLL_PIN, but even after that, various parts of the kernel (reclaim,
>> filesystems/block layer) still need to be changed so as to use
>> page_maybe_dma_pinned() to help avoid this problem. There's a bit
>> more than that, actually.
> 
> The challenge is that fixing this "the right away" is probably not
> something we can backport into an LTS kernel, whether it's 5.15 or
> 5.10... or 4.19.
> 
> The only thing which can probably survive getting backported is
> something like this.  It won't make the right thing happen if someone
> is trying to RDMA or call process_vm_writev() into a file-backed
> memory region --- but I'm not sure I care.  Certainly if the goal is
> to make Android kernels, I'm pretty sure they are't either using RDMA,
> and I suspect they are probably masking out the process_vm_writev(2)
> system call (at least, for Android O and newer).  So if the goal is to
> just to close some Syzbot bugs, what do folks think about this?
> 
>       	      	   	  	     	- Ted

Hi Ted!

This seems reasonable...-ish. Although this could turn into a pretty
grand game of whack-a-mole, one filesystem at a time. :)


> 
> commit 7711b1fda6f7f04274fa1cba6f092410262b0296
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu Feb 17 22:54:03 2022 -0500
> 
>      ext4: work around bugs in mm/gup.c that can cause ext4 to BUG()
>      
>      [un]pin_user_pages_remote is dirtying pages without properly warning
>      the file system in advance (or faulting in the file data if the page

Just a small thing I'll say once, to get it out of my system. No action
required here, I just want it understood:

Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via
pin_user_pages_remote()"), you would have written that like this:

"process_vm_writev() is dirtying pages without properly warning the file
system in advance..."

Because, there were many callers that were doing this:

     get_user_pages*()
     ...use the pages...

     for_each_page() {
             set_page_dirty*()
             put_page()
     }

anyway, moving on smartly...

>      is not yet in the page cache).  This was noted by Jan Kara in 2018[1]
>      and more recently has resulted in bug reports by Syzbot in various
>      Android kernels[2].
>      
>      Fixing this for real is non-trivial, and will never be backportable
>      into stable kernels.  So this is a simple workaround that stops the
>      kernel from BUG()'ing.  The changed pages will not be properly written
>      back, but given that the current gup code is missing the "read" in
>      "read-modify-write", the dirty page in the page cache might contain
>      corrupted data anyway.
>      
>      [1] https://www.spinics.net/lists/linux-mm/msg142700.html

(Sorry my earlier response mangled this link. I've manually fixed it
here, and am working with our IT to get the root cause addressed.)

>      [2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com
>      
>      Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
>      Reported-by: Lee Jones <lee.jones@linaro.org>
>      Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 01c9e4f743ba..3b2f336a90d1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
>   	else
>   		len = PAGE_SIZE;
>   
> +	/* Should never happen but for buggy gup code */
> +	if (!page_has_buffers(page)) {
> +		ext4_warning_inode(inode,
> +		   "page %lu does not have buffers attached", page->index);

I see that ext4_warning_inode() has rate limiting, but it doesn't look
like it's really intended for a per-page rate. It looks like it is
per-superblock (yes?), so I suspect one instance of this problem, with
many pages involved, could hit the limit.

Often, WARN_ON_ONCE() is used with per-page assertions. That's not great
either, but it might be better here. OTOH, I have minimal experience
with ext4_warning_inode() and maybe it really is just fine with per-page
failure rates.

thanks,
-- 
John Hubbard
NVIDIA

> +		ClearPageDirty(page);
> +		unlock_page(page);
> +		return 0;
> +	}
> +
>   	page_bufs = page_buffers(page);
>   	/*
>   	 * We cannot do block allocation or other extent handling in this
> @@ -2594,6 +2603,14 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
>   			wait_on_page_writeback(page);
>   			BUG_ON(PageWriteback(page));
>   
> +			/* Should never happen but for buggy gup code */
> +			if (!page_has_buffers(page)) {
> +				ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index);
> +				ClearPageDirty(page);
> +				unlock_page(page);
> +				continue;
> +			}
> +
>   			if (mpd->map.m_len == 0)
>   				mpd->first_page = page->index;
>   			mpd->next_page = page->index + 1;


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-18  4:08   ` Theodore Ts'o
  2022-02-18  6:33     ` John Hubbard
@ 2022-02-18  7:51     ` Greg Kroah-Hartman
  2022-02-23 23:35       ` Theodore Ts'o
  1 sibling, 1 reply; 29+ messages in thread
From: Greg Kroah-Hartman @ 2022-02-18  7:51 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: John Hubbard, Lee Jones, linux-ext4, Christoph Hellwig,
	Dave Chinner, Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Johannes Thumshirn, linux-xfs, linux-fsdevel, cluster-devel,
	linux-kernel

On Thu, Feb 17, 2022 at 11:08:38PM -0500, Theodore Ts'o wrote:
> On Thu, Feb 17, 2022 at 05:06:45PM -0800, John Hubbard wrote:
> > Yes. And looking at the pair of backtraces below, this looks very much
> > like another aspect of the "get_user_pages problem" [1], originally
> > described in Jan Kara's 2018 email [2].
> 
> Hmm... I just posted my analysis, which tracks with yours; but I had
> forgotten about Jan's 2018 e-mail on the matter.
> 
> > I'm getting close to posting an RFC for the direct IO conversion to
> > FOLL_PIN, but even after that, various parts of the kernel (reclaim,
> > filesystems/block layer) still need to be changed so as to use
> > page_maybe_dma_pinned() to help avoid this problem. There's a bit
> > more than that, actually.
> 
> The challenge is that fixing this "the right away" is probably not
> something we can backport into an LTS kernel, whether it's 5.15 or
> 5.10... or 4.19.

Don't worry about stable backports to start with.  Do it the "right way"
first and then we can consider if it needs to be backported or not.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-18  6:33     ` John Hubbard
@ 2022-02-23 23:31       ` Theodore Ts'o
  2022-02-24  0:44         ` John Hubbard
  0 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-23 23:31 UTC (permalink / raw)
  To: John Hubbard
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote:
> 
> Just a small thing I'll say once, to get it out of my system. No action
> required here, I just want it understood:
> 
> Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via
> pin_user_pages_remote()"), you would have written that like this:
> 
> "process_vm_writev() is dirtying pages without properly warning the file
> system in advance..."
> 
> Because, there were many callers that were doing this:
> 
>     get_user_pages*()
>     ...use the pages...
> 
>     for_each_page() {
>             set_page_dirty*()
>             put_page()
>     }

Sure, but that's not sufficient when modifying file-backed pages.
Previously, there was only two ways of modifying file-backed pages in
the page cache --- either using the write(2) system call, or when a
mmaped page is modified by the userspace.

In the case of write(2), the file system gets notified before the page
cache is modified by a call to the address operation's write_begin()
call, and after the page cache is modified, the address operation's
write_end() call lets the file system know that the modification is
done.  After the write is done, the 30 second writeback timer is
triggered, and in the case of ext4's data=journalling mode, we close
the ext4 micro-transation (and therefore the time between write_begin
and write_end calls needs to be minimal); otherwise this can block
ext4 transactions.

In the case of a user page fault, the address operation's
page_mkwrite() is called, and at that point we will allocate any
blocks needed to back memory if necessary (in the case of delayed
allocation, file system space has to get reserved).  The problem here
for remote access is that page_mkwrite() can block, and it can
potentially fail (e.g., with ENOSPC or ENOMEM).  This is also why we
can't just add the page buffers and do the file system allocation in
set_page_dirty(), since set_page_dirty() isn't allowed to block.

One approach might be to make all of the pages writeable when
pin_user_pages_remote() is called.  That that would mean that in the
case of remote access via process_vm_writev or RDMA, all of the blocks
will be allocated early.  But that's probably better since at that
point the userspace code is in a position to receive the error when
setting up the RDMA memory, and we don't want to be asking the file
system to do block allocation when incoming data is coming in from
Infiniband or iWARP.

> I see that ext4_warning_inode() has rate limiting, but it doesn't look
> like it's really intended for a per-page rate. It looks like it is
> per-superblock (yes?), so I suspect one instance of this problem, with
> many pages involved, could hit the limit.
> 
> Often, WARN_ON_ONCE() is used with per-page assertions. That's not great
> either, but it might be better here. OTOH, I have minimal experience
> with ext4_warning_inode() and maybe it really is just fine with per-page
> failure rates.

With the syzbot reproducer, we're not actually triggering the rate
limiter, since the ext4 warning is only getting hit a bit more than
once every 4 seconds.  And I'm guessing that in the real world, people
aren't actually trying to do remote direct access to file-backed
memory, at least not using ext4, since that's an invitation to a
kernel crash, and we would have gotten user complaints.  If some user
actually tries to use process_vm_writev for realsies, as opposed to a
random fuzzer or from a malicious program , we do want to warn them
about the potential data loss, so I'd prefer to warn once for each
inode --- but I'm not convinced that it's worth the effort.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-18  7:51     ` Greg Kroah-Hartman
@ 2022-02-23 23:35       ` Theodore Ts'o
  2022-02-24  1:48         ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-23 23:35 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: John Hubbard, Lee Jones, linux-ext4, Christoph Hellwig,
	Dave Chinner, Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Johannes Thumshirn, linux-xfs, linux-fsdevel, cluster-devel,
	linux-kernel

On Fri, Feb 18, 2022 at 08:51:54AM +0100, Greg Kroah-Hartman wrote:
> > The challenge is that fixing this "the right away" is probably not
> > something we can backport into an LTS kernel, whether it's 5.15 or
> > 5.10... or 4.19.
> 
> Don't worry about stable backports to start with.  Do it the "right way"
> first and then we can consider if it needs to be backported or not.

Fair enough; on the other hand, we could also view this as making ext4
more robust against buggy code in other subsystems, and while other
file systems may be losing user data if they are actually trying to do
remote memory access to file-backed memory, apparently other file
systems aren't noticing and so they're not crashing.  Issuing a
warning and then not crashing is arguably a better way for ext4 to
react, especially if there are other parts of the kernel that are
randomly calling set_page_dirty() on file-backed memory without
properly first informing the file system in a context where it can
block and potentially do I/O to do things like allocate blocks.

      	  	      	     	      	 - Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-23 23:31       ` Theodore Ts'o
@ 2022-02-24  0:44         ` John Hubbard
  2022-02-24  4:04           ` Theodore Ts'o
  0 siblings, 1 reply; 29+ messages in thread
From: John Hubbard @ 2022-02-24  0:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On 2/23/22 3:31 PM, Theodore Ts'o wrote:
> On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote:
>>
>> Just a small thing I'll say once, to get it out of my system. No action
>> required here, I just want it understood:
>>
>> Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via
>> pin_user_pages_remote()"), you would have written that like this:
>>
>> "process_vm_writev() is dirtying pages without properly warning the file
>> system in advance..."
>>
>> Because, there were many callers that were doing this:
>>
>>     get_user_pages*()
>>     ...use the pages...
>>
>>     for_each_page() {
>>             set_page_dirty*()
>>             put_page()
>>     }
> 
> Sure, but that's not sufficient when modifying file-backed pages.

We are in complete agreement. I was just trying to hint (too subtly) 
that the problem is ages old, and after doing all this work on the 
prerequisites to solving it (pin_user_pages() is a prerequisite), it kind 
of bothers me to see a commit with "work around bugs in mm/gup.c" in the 
title. That's all. :)


> Previously, there was only two ways of modifying file-backed pages in
> the page cache --- either using the write(2) system call, or when a
> mmaped page is modified by the userspace.
> 
> In the case of write(2), the file system gets notified before the page
> cache is modified by a call to the address operation's write_begin()
> call, and after the page cache is modified, the address operation's
> write_end() call lets the file system know that the modification is
> done.  After the write is done, the 30 second writeback timer is
> triggered, and in the case of ext4's data=journalling mode, we close
> the ext4 micro-transation (and therefore the time between write_begin
> and write_end calls needs to be minimal); otherwise this can block
> ext4 transactions.
> 
> In the case of a user page fault, the address operation's
> page_mkwrite() is called, and at that point we will allocate any
> blocks needed to back memory if necessary (in the case of delayed
> allocation, file system space has to get reserved).  The problem here
> for remote access is that page_mkwrite() can block, and it can
> potentially fail (e.g., with ENOSPC or ENOMEM).  This is also why we
> can't just add the page buffers and do the file system allocation in
> set_page_dirty(), since set_page_dirty() isn't allowed to block.

Oh yes. Believe me, I am well-versed in that story! But it's always
nice to hear it again, especially from a file system maintainer.
Each time there is something new, such as the micro-transaction detail.

> 
> One approach might be to make all of the pages writeable when
> pin_user_pages_remote() is called.  That that would mean that in the
> case of remote access via process_vm_writev or RDMA, all of the blocks
> will be allocated early.  But that's probably better since at that
> point the userspace code is in a position to receive the error when
> setting up the RDMA memory, and we don't want to be asking the file
> system to do block allocation when incoming data is coming in from
> Infiniband or iWARP.

So it sounds like the file lease idea, yes? I'm hoping that that
is still viable. I still think it's a good LSF/MM topic, to work through.

> 
>> I see that ext4_warning_inode() has rate limiting, but it doesn't look
>> like it's really intended for a per-page rate. It looks like it is
>> per-superblock (yes?), so I suspect one instance of this problem, with
>> many pages involved, could hit the limit.
>>
>> Often, WARN_ON_ONCE() is used with per-page assertions. That's not great
>> either, but it might be better here. OTOH, I have minimal experience
>> with ext4_warning_inode() and maybe it really is just fine with per-page
>> failure rates.
> 
> With the syzbot reproducer, we're not actually triggering the rate
> limiter, since the ext4 warning is only getting hit a bit more than
> once every 4 seconds.  And I'm guessing that in the real world, people
> aren't actually trying to do remote direct access to file-backed
> memory, at least not using ext4, since that's an invitation to a
> kernel crash, and we would have gotten user complaints.  If some user

Actually...I can confirm that real customers really are doing *exactly* 
that! Despite the kernel crashes--because the crashes don't always 
happen unless you have a large (supercomputer-sized) installation. And 
even then it is not always root-caused properly.

I guess that goes in the "weird but true" category. :)


thanks,
-- 
John Hubbard
NVIDIA

> actually tries to use process_vm_writev for realsies, as opposed to a
> random fuzzer or from a malicious program , we do want to warn them
> about the potential data loss, so I'd prefer to warn once for each
> inode --- but I'm not convinced that it's worth the effort.
> 
> Cheers,
> 
> 						- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-23 23:35       ` Theodore Ts'o
@ 2022-02-24  1:48         ` Dave Chinner
  2022-02-24  3:50           ` Theodore Ts'o
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2022-02-24  1:48 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Greg Kroah-Hartman, John Hubbard, Lee Jones, linux-ext4,
	Christoph Hellwig, Dave Chinner, Goldwyn Rodrigues,
	Darrick J . Wong, Bob Peterson, Damien Le Moal,
	Andreas Gruenbacher, Ritesh Harjani, Johannes Thumshirn,
	linux-xfs, linux-fsdevel, cluster-devel, linux-kernel

On Wed, Feb 23, 2022 at 06:35:54PM -0500, Theodore Ts'o wrote:
> On Fri, Feb 18, 2022 at 08:51:54AM +0100, Greg Kroah-Hartman wrote:
> > > The challenge is that fixing this "the right away" is probably not
> > > something we can backport into an LTS kernel, whether it's 5.15 or
> > > 5.10... or 4.19.
> > 
> > Don't worry about stable backports to start with.  Do it the "right way"
> > first and then we can consider if it needs to be backported or not.
> 
> Fair enough; on the other hand, we could also view this as making ext4
> more robust against buggy code in other subsystems, and while other
> file systems may be losing user data if they are actually trying to do
> remote memory access to file-backed memory, apparently other file
> systems aren't noticing and so they're not crashing.

Oh, we've noticed them, no question about that.  We've got bug
reports going back years for systems being crashed, triggering BUGs
and/or corrupting data on both XFS and ext4 filesystems due to users
trying to run RDMA applications with file backed pages.

Most of the people doing this now know that we won't support such
applications until the RDMA stack/hardware can trigger on-demand
write page faults the same way CPUs do when they first write to a
clean page. They don't have this, so mostly these people don't
bother reporting these class of problems to us anymore.  The
gup/RDMA infrastructure to make this all work is slowly moving
forwards, but it's not here yet.

> Issuing a
> warning and then not crashing is arguably a better way for ext4 to
> react, especially if there are other parts of the kernel that are
> randomly calling set_page_dirty() on file-backed memory without
> properly first informing the file system in a context where it can
> block and potentially do I/O to do things like allocate blocks.

I'm not sure that replacing the BUG() with a warning is good enough
- it's still indicative of an application doing something dangerous
that could result in silent data corruption and/or other problems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-24  1:48         ` Dave Chinner
@ 2022-02-24  3:50           ` Theodore Ts'o
  2022-02-24 10:29             ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-24  3:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Greg Kroah-Hartman, John Hubbard, Lee Jones, linux-ext4,
	Christoph Hellwig, Dave Chinner, Goldwyn Rodrigues,
	Darrick J . Wong, Bob Peterson, Damien Le Moal,
	Andreas Gruenbacher, Ritesh Harjani, Johannes Thumshirn,
	linux-xfs, linux-fsdevel, cluster-devel, linux-kernel

On Thu, Feb 24, 2022 at 12:48:42PM +1100, Dave Chinner wrote:
> > Fair enough; on the other hand, we could also view this as making ext4
> > more robust against buggy code in other subsystems, and while other
> > file systems may be losing user data if they are actually trying to do
> > remote memory access to file-backed memory, apparently other file
> > systems aren't noticing and so they're not crashing.
> 
> Oh, we've noticed them, no question about that.  We've got bug
> reports going back years for systems being crashed, triggering BUGs
> and/or corrupting data on both XFS and ext4 filesystems due to users
> trying to run RDMA applications with file backed pages.

Is this issue causing XFS to crash?  I didn't know that.

I tried the Syzbot reproducer with XFS mounted, and it didn't trigger
any crashes.  I'm sure data was getting corrupted, but I figured I
should bring ext4 to the XFS level of "at least we're not reliably
killing the kernel".

On ext4, an unprivileged process can use process_vm_writev(2) to crash
the system.  I don't know how quickly we can get a fix into mm/gup.c,
but if some other kernel path tries calling set_page_dirty() on a
file-backed page without first asking permission from the file system,
it seems to be nice if the file system doesn't BUG() --- as near as I
can tell, xfs isn't crashing in this case, but ext4 is.

					- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-24  0:44         ` John Hubbard
@ 2022-02-24  4:04           ` Theodore Ts'o
  0 siblings, 0 replies; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-24  4:04 UTC (permalink / raw)
  To: John Hubbard
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On Wed, Feb 23, 2022 at 04:44:07PM -0800, John Hubbard wrote:
> 
> Actually...I can confirm that real customers really are doing *exactly* 
> that! Despite the kernel crashes--because the crashes don't always 
> happen unless you have a large (supercomputer-sized) installation. And 
> even then it is not always root-caused properly.

Interesting.  The syzbot reproducer triggers *reliably* on ext4 using
a 2 CPU qemu kernel running on a laptop, and it doesn't require root,
so it's reasonable that Lee is pushing for a fix --- even if for the
Android O or newer, Seccomp can probably prohibit trap
process_vm_writev(2), but it seems unfortunate if say, someone running
a Docker container could take down the entire host OS.

  	 	   	      	       	      - Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
  2022-02-24  3:50           ` Theodore Ts'o
@ 2022-02-24 10:29             ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2022-02-24 10:29 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Greg Kroah-Hartman, John Hubbard, Lee Jones, linux-ext4,
	Christoph Hellwig, Dave Chinner, Goldwyn Rodrigues,
	Darrick J . Wong, Bob Peterson, Damien Le Moal,
	Andreas Gruenbacher, Ritesh Harjani, Johannes Thumshirn,
	linux-xfs, linux-fsdevel, cluster-devel, linux-kernel

On Wed, Feb 23, 2022 at 10:50:09PM -0500, Theodore Ts'o wrote:
> On Thu, Feb 24, 2022 at 12:48:42PM +1100, Dave Chinner wrote:
> > > Fair enough; on the other hand, we could also view this as making ext4
> > > more robust against buggy code in other subsystems, and while other
> > > file systems may be losing user data if they are actually trying to do
> > > remote memory access to file-backed memory, apparently other file
> > > systems aren't noticing and so they're not crashing.
> > 
> > Oh, we've noticed them, no question about that.  We've got bug
> > reports going back years for systems being crashed, triggering BUGs
> > and/or corrupting data on both XFS and ext4 filesystems due to users
> > trying to run RDMA applications with file backed pages.
> 
> Is this issue causing XFS to crash?  I didn't know that.

I have no idea if crashes nowdays -  go back a few years before and
search for XFS BUGging out in ->invalidate_page (or was it
->release_page?) because of unexpected dirty pages. I think it could
also trigger BUGs in writeback when ->writepages tripped over a
dirty page without a delayed allocation mapping over the hole...

We were pretty aggressive about telling people reporting such issues
that they get to keep all the borken bits to themselves and to stop
wasting our time with unsolvable problems caused by their
broken-by-design RDMA applications. Hence people have largely
stopped bothering us with random filesystem crashes on systems using
RDMA on file-backed pages...

> I tried the Syzbot reproducer with XFS mounted, and it didn't trigger
> any crashes.  I'm sure data was getting corrupted, but I figured I
> should bring ext4 to the XFS level of "at least we're not reliably
> killing the kernel".

Oh, well, good to know XFS didn't die a horrible death immediately.
Thanks for checking, Ted.

> On ext4, an unprivileged process can use process_vm_writev(2) to crash
> the system.  I don't know how quickly we can get a fix into mm/gup.c,
> but if some other kernel path tries calling set_page_dirty() on a
> file-backed page without first asking permission from the file system,
> it seems to be nice if the file system doesn't BUG() --- as near as I
> can tell, xfs isn't crashing in this case, but ext4 is.

iomap is probably refusing to map holes for writepage - we've
cleaned up most of the weird edge cases to return errors, so I'm
guessing iomap is just ignoring such pages these days.

Yeah, see iomap_writepage_map():

                error = wpc->ops->map_blocks(wpc, inode, pos);
                if (error)
                        break;
                if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE))
                        continue;
                if (wpc->iomap.type == IOMAP_HOLE)
                        continue;

Yeah, so if writeback maps a hole rather than converts a delalloc
region to IOMAP_MAPPED, it'll just skip over the block/page.  IIRC,
they essentially become uncleanable pages, and I think eventually
inode reclaim will just toss them out of memory.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-02-16 16:31 [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Lee Jones
  2022-02-18  1:06 ` John Hubbard
  2022-02-18  2:54 ` Theodore Ts'o
@ 2022-02-25 19:24 ` Theodore Ts'o
  2022-02-25 20:51   ` Eric Biggers
  2 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-25 19:24 UTC (permalink / raw)
  To: Lee Jones
  Cc: linux-ext4, Christoph Hellwig, Dave Chinner, Goldwyn Rodrigues,
	Darrick J . Wong, Bob Peterson, Damien Le Moal,
	Andreas Gruenbacher, Ritesh Harjani, Greg Kroah-Hartman,
	Johannes Thumshirn, linux-xfs, linux-fsdevel, cluster-devel,
	linux-kernel

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance (or faulting in the file data if the page
is not yet in the page cache).  This was noted by Jan Kara in 2018[1]
and more recently has resulted in bug reports by Syzbot in various
Android kernels[2].

This is technically a bug in the mm/gup.c codepath, but arguably ext4
is fragile in that a buggy get_user_pages() implementation causes ext4
to crash, where as other file systems are not crashing (although in
some cases the user data will be lost since gup code is not properly
informing the file system to potentially allocate blocks or reserve
space when writing into a sparse portion of file).  I suspect in real
life it is rare that people are using RDMA into file-backed memory,
which is why no one has complained to ext4 developers except fuzzing
programs.

So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed.  More discussion and background can be found in the
thread starting at [2].

[1] https://www.spinics.net/lists/linux-mm/msg142700.html
[2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com

Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/inode.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..f8fefbf67306 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
 	else
 		len = PAGE_SIZE;
 
+	/* Should never happen but for buggy gup code */
+	if (!page_has_buffers(page)) {
+		ext4_warning_inode(inode,
+		   "page %lu does not have buffers attached", page->index);
+		ClearPageDirty(page);
+		unlock_page(page);
+		return 0;
+	}
+
 	page_bufs = page_buffers(page);
 	/*
 	 * We cannot do block allocation or other extent handling in this
@@ -2588,12 +2597,28 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
 			    unlikely(page->mapping != mapping)) {
 				unlock_page(page);
-				continue;
+				goto out;
 			}
 
 			wait_on_page_writeback(page);
 			BUG_ON(PageWriteback(page));
 
+			/*
+			 * Should never happen but for buggy code in
+			 * other subsystemsa that call
+			 * set_page_dirty() without properly warning
+			 * the file system first.  See [1] for more
+			 * information.
+			 *
+			 * [1] https://www.spinics.net/lists/linux-mm/msg142700.html
+			 */
+			if (!page_has_buffers(page)) {
+				ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index);
+				ClearPageDirty(page);
+				unlock_page(page);
+				continue;
+			}
+
 			if (mpd->map.m_len == 0)
 				mpd->first_page = page->index;
 			mpd->next_page = page->index + 1;
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-02-25 19:24 ` [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first Theodore Ts'o
@ 2022-02-25 20:51   ` Eric Biggers
  2022-02-25 21:08     ` Theodore Ts'o
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Biggers @ 2022-02-25 20:51 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On Fri, Feb 25, 2022 at 02:24:35PM -0500, Theodore Ts'o wrote:
> [un]pin_user_pages_remote is dirtying pages without properly warning
> the file system in advance (or faulting in the file data if the page
> is not yet in the page cache).  This was noted by Jan Kara in 2018[1]
> and more recently has resulted in bug reports by Syzbot in various
> Android kernels[2].
> 
> This is technically a bug in the mm/gup.c codepath, but arguably ext4
> is fragile in that a buggy get_user_pages() implementation causes ext4
> to crash, where as other file systems are not crashing (although in
> some cases the user data will be lost since gup code is not properly
> informing the file system to potentially allocate blocks or reserve
> space when writing into a sparse portion of file).  I suspect in real
> life it is rare that people are using RDMA into file-backed memory,
> which is why no one has complained to ext4 developers except fuzzing
> programs.
> 
> So instead of crashing with a BUG, issue a warning (since there may be
> potential data loss) and just mark the page as clean to avoid
> unprivileged denial of service attacks until the problem can be
> properly fixed.  More discussion and background can be found in the
> thread starting at [2].
> 
> [1] https://www.spinics.net/lists/linux-mm/msg142700.html

Can you use a lore link
(https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/T/#u)?

> +			/*
> +			 * Should never happen but for buggy code in
> +			 * other subsystemsa that call

subsystemsa => subsystems

> +			 * set_page_dirty() without properly warning
> +			 * the file system first.  See [1] for more
> +			 * information.
> +			 *
> +			 * [1] https://www.spinics.net/lists/linux-mm/msg142700.html

Likewise, lore link here.

- Eric

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-02-25 20:51   ` Eric Biggers
@ 2022-02-25 21:08     ` Theodore Ts'o
  2022-02-25 21:23       ` [PATCH -v3] " Theodore Ts'o
  0 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-25 21:08 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On Fri, Feb 25, 2022 at 12:51:28PM -0800, Eric Biggers wrote:
> > 
> > [1] https://www.spinics.net/lists/linux-mm/msg142700.html
> 
> Can you use a lore link
> (https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/T/#u)?

Sure, thanks for finding the lore link!  I did try searching for it
last night, but for some reason I wasn't able to surface it.

> > +			/*
> > +			 * Should never happen but for buggy code in
> > +			 * other subsystemsa that call
> 
> subsystemsa => subsystems

Already fixed in my local version of the patch.

> > +			 * [1] https://www.spinics.net/lists/linux-mm/msg142700.html
> 
> Likewise, lore link here.

Ack.

						- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-02-25 21:08     ` Theodore Ts'o
@ 2022-02-25 21:23       ` Theodore Ts'o
  2022-02-25 21:33         ` John Hubbard
  2022-03-03  4:26         ` [PATCH -v4] " Theodore Ts'o
  0 siblings, 2 replies; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-25 21:23 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance.  This was noted by Jan Kara in 2018[1] and
more recently has resulted in bug reports by Syzbot in various Android
kernels[2].

This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that a buggy kernel subsystem which dirty pages without properly
notifying the file system causes ext4 to BUG, while other file systems
are not (although user data likely will be lost).  I suspect in real
life it is rare that people are using RDMA into file-backed memory,
which is why no one has complained to ext4 developers except fuzzing
programs.

So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed.  More discussion and background can be found in the
thread starting at [2].

[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
[2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com

Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/inode.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..008fe8750109 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
 	else
 		len = PAGE_SIZE;
 
+	/* Should never happen but for bugs in other kernel subsystems */
+	if (!page_has_buffers(page)) {
+		ext4_warning_inode(inode,
+		   "page %lu does not have buffers attached", page->index);
+		ClearPageDirty(page);
+		unlock_page(page);
+		return 0;
+	}
+
 	page_bufs = page_buffers(page);
 	/*
 	 * We cannot do block allocation or other extent handling in this
@@ -2588,12 +2597,28 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
 			    unlikely(page->mapping != mapping)) {
 				unlock_page(page);
-				continue;
+				goto out;
 			}
 
 			wait_on_page_writeback(page);
 			BUG_ON(PageWriteback(page));
 
+			/*
+			 * Should never happen but for buggy code in
+			 * other subsystems that call
+			 * set_page_dirty() without properly warning
+			 * the file system first.  See [1] for more
+			 * information.
+			 *
+			 * [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
+			 */
+			if (!page_has_buffers(page)) {
+				ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index);
+				ClearPageDirty(page);
+				unlock_page(page);
+				continue;
+			}
+
 			if (mpd->map.m_len == 0)
 				mpd->first_page = page->index;
 			mpd->next_page = page->index + 1;
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-02-25 21:23       ` [PATCH -v3] " Theodore Ts'o
@ 2022-02-25 21:33         ` John Hubbard
  2022-02-25 23:21           ` Theodore Ts'o
  2022-03-03  4:26         ` [PATCH -v4] " Theodore Ts'o
  1 sibling, 1 reply; 29+ messages in thread
From: John Hubbard @ 2022-02-25 21:33 UTC (permalink / raw)
  To: Theodore Ts'o, Eric Biggers
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On 2/25/22 13:23, Theodore Ts'o wrote:
> [un]pin_user_pages_remote is dirtying pages without properly warning
> the file system in advance.  This was noted by Jan Kara in 2018[1] and

In 2018, [un]pin_user_pages_remote did not exist. And so what Jan reported
was actually that dio_bio_complete() was calling set_page_dirty_lock()
on pages that were not (any longer) set up for that.

> more recently has resulted in bug reports by Syzbot in various Android
> kernels[2].
> 
> This is technically a bug in mm/gup.c, but arguably ext4 is fragile in

Is it, really? unpin_user_pages_dirty_lock() moved the set_page_dirty_lock()
call into mm/gup.c, but that merely refactored things. The callers are
all over the kernel, and those callers are what need changing in order
to fix this.


thanks,
-- 
John Hubbard
NVIDIA

> that a buggy kernel subsystem which dirty pages without properly
> notifying the file system causes ext4 to BUG, while other file systems
> are not (although user data likely will be lost).  I suspect in real
> life it is rare that people are using RDMA into file-backed memory,
> which is why no one has complained to ext4 developers except fuzzing
> programs.
> 
> So instead of crashing with a BUG, issue a warning (since there may be
> potential data loss) and just mark the page as clean to avoid
> unprivileged denial of service attacks until the problem can be
> properly fixed.  More discussion and background can be found in the
> thread starting at [2].
> 
> [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
> [2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com
> 
> Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
> Reported-by: Lee Jones <lee.jones@linaro.org>
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
>   fs/ext4/inode.c | 27 ++++++++++++++++++++++++++-
>   1 file changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 01c9e4f743ba..008fe8750109 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
>   	else
>   		len = PAGE_SIZE;
>   
> +	/* Should never happen but for bugs in other kernel subsystems */
> +	if (!page_has_buffers(page)) {
> +		ext4_warning_inode(inode,
> +		   "page %lu does not have buffers attached", page->index);
> +		ClearPageDirty(page);
> +		unlock_page(page);
> +		return 0;
> +	}
> +
>   	page_bufs = page_buffers(page);
>   	/*
>   	 * We cannot do block allocation or other extent handling in this
> @@ -2588,12 +2597,28 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
>   			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
>   			    unlikely(page->mapping != mapping)) {
>   				unlock_page(page);
> -				continue;
> +				goto out;
>   			}
>   
>   			wait_on_page_writeback(page);
>   			BUG_ON(PageWriteback(page));
>   
> +			/*
> +			 * Should never happen but for buggy code in
> +			 * other subsystems that call
> +			 * set_page_dirty() without properly warning
> +			 * the file system first.  See [1] for more
> +			 * information.
> +			 *
> +			 * [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
> +			 */
> +			if (!page_has_buffers(page)) {
> +				ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index);
> +				ClearPageDirty(page);
> +				unlock_page(page);
> +				continue;
> +			}
> +
>   			if (mpd->map.m_len == 0)
>   				mpd->first_page = page->index;
>   			mpd->next_page = page->index + 1;



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-02-25 21:33         ` John Hubbard
@ 2022-02-25 23:21           ` Theodore Ts'o
  2022-02-26  0:41             ` John Hubbard
  0 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-25 23:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: Eric Biggers, Lee Jones, linux-ext4, Christoph Hellwig,
	Dave Chinner, Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On Fri, Feb 25, 2022 at 01:33:33PM -0800, John Hubbard wrote:
> On 2/25/22 13:23, Theodore Ts'o wrote:
> > [un]pin_user_pages_remote is dirtying pages without properly warning
> > the file system in advance.  This was noted by Jan Kara in 2018[1] and
> 
> In 2018, [un]pin_user_pages_remote did not exist. And so what Jan reported
> was actually that dio_bio_complete() was calling set_page_dirty_lock()
> on pages that were not (any longer) set up for that.

Fair enough, there are two problems that are getting conflated here,
and that's my bad.  The problem which Jan pointed out is one where the
Direct I/O read path triggered a page fault, so page_mkwrite() was
actually called.  So in this case, the file system was actually
notified, and the page was marked dirty after the file system was
notified.  But then the DIO read was racing with the page cleaner,
which would call writepage(), and then clear the page, and then remove
the buffer_heads.  Then dio_bio_complete() would call set_page_dirty()
a second time, and that's what would trigger the BUG.

But in the syzbot reproducer, it's a different problem.  In this case,
process_vm_writev() calling [un]pin_user_pages_remote(), and
page_mkwrite() is never getting called.  So there is no need to race
with the page cleaner, and so the BUG triggers much more reliably.

> > more recently has resulted in bug reports by Syzbot in various Android
> > kernels[2].
> > 
> > This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
> 
> Is it, really? unpin_user_pages_dirty_lock() moved the set_page_dirty_lock()
> call into mm/gup.c, but that merely refactored things. The callers are
> all over the kernel, and those callers are what need changing in order
> to fix this.

From my perspective, the bug is calling set_page_dirty() without first
calling the file system's page_mkwrite().  This is necessary since the
file system needs to allocate file system data blocks in preparation
for a future writeback.

Now, calling page_mkwrite() by itself is not enough, since the moment
you make the page dirty, the page cleaner could go ahead and call
writepage() behind your back and clean it.  In actual practice, with a
Direct I/O read request racing with writeback, this is race was quite
hard to hit, because the that would imply that the background
writepage() call would have to complete ahead of the synchronous read
request, and the block layer generally prioritizes synchronous reads
ahead of background write requests.  So in practice, this race was
***very*** hard to hit.  Jan may have reported it in 2018, but I don't
think I've ever seen it happen myself.

For process_vm_writev() this is a case where user pages are pinned and
then released in short order, so I suspect that race with the page
cleaner would also be very hard to hit.  But we could completely
remove the potential for the race, and also make things kinder for
f2fs and btrfs's compressed file write support, by making things work
much like the write(2) system call.  Imagine if we had a
"pin_user_pages_local()" which calls write_begin(), and a
"unpin_user_pages_local()" which calls write_end(), and the
presumption with the "[un]pin_user_pages_local" API is that you don't
hold the pinned pages for very long --- say, not across a system call
boundary, and then it would work the same way the write(2) system call
works does except that in the case of process_vm_writev(2) the pages
are identified by another process's address space where they happen to
be mapped.

This obviously doesn't work when pinning pages for remote DMA, because
in that case the time between pin_user_pages_remote() and
unpin_user_pages_remote() could be a long, long time, so that means we
can't use using write_begin/write_end; we'd need to call page_mkwrite()
when the pages are first pinned and then somehow prevent the page
cleaner from touching a dirty page which is pinned for use by the
remote DMA.

Does that make sense?

							- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-02-25 23:21           ` Theodore Ts'o
@ 2022-02-26  0:41             ` John Hubbard
  2022-02-26  1:40               ` Theodore Ts'o
  0 siblings, 1 reply; 29+ messages in thread
From: John Hubbard @ 2022-02-26  0:41 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Biggers, Lee Jones, linux-ext4, Christoph Hellwig,
	Dave Chinner, Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On 2/25/22 15:21, Theodore Ts'o wrote:
...
> For process_vm_writev() this is a case where user pages are pinned and
> then released in short order, so I suspect that race with the page
> cleaner would also be very hard to hit.  But we could completely
> remove the potential for the race, and also make things kinder for

Completely removing the race would be wonderful. Because large
supercomputer installations are good at hitting "rare" cases.


> f2fs and btrfs's compressed file write support, by making things work
> much like the write(2) system call.  Imagine if we had a
> "pin_user_pages_local()" which calls write_begin(), and a
> "unpin_user_pages_local()" which calls write_end(), and the

Right, that would supply the missing connection to the filesystems.

In fact, maybe these names about right:

     pin_user_file_pages()
     unpin_user_file_pages()

...and then put them in a filesystem header file, because these are now
tightly coupled to filesystems, what with the need to call
.write_begin() and .write_end().

OK...

> presumption with the "[un]pin_user_pages_local" API is that you don't
> hold the pinned pages for very long --- say, not across a system call
> boundary, and then it would work the same way the write(2) system call
> works does except that in the case of process_vm_writev(2) the pages
> are identified by another process's address space where they happen to
> be mapped.
> 
> This obviously doesn't work when pinning pages for remote DMA, because
> in that case the time between pin_user_pages_remote() and
> unpin_user_pages_remote() could be a long, long time, so that means we
> can't use using write_begin/write_end; we'd need to call page_mkwrite()
> when the pages are first pinned and then somehow prevent the page
> cleaner from touching a dirty page which is pinned for use by the
> remote DMA.
> 
> Does that make sense?
> 
> 							- Ted

Yes, I really like this suggestion. It would neatly solve most short
term pinning cases, without interfering with any future solutions for
the long term pinning cases. Very nice.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-02-26  0:41             ` John Hubbard
@ 2022-02-26  1:40               ` Theodore Ts'o
  2022-02-26  2:00                 ` Theodore Ts'o
  2022-02-26  2:55                 ` John Hubbard
  0 siblings, 2 replies; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-26  1:40 UTC (permalink / raw)
  To: John Hubbard
  Cc: Eric Biggers, Lee Jones, linux-ext4, Christoph Hellwig,
	Dave Chinner, Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On Fri, Feb 25, 2022 at 04:41:14PM -0800, John Hubbard wrote:
> 
> > f2fs and btrfs's compressed file write support, by making things work
> > much like the write(2) system call.  Imagine if we had a
> > "pin_user_pages_local()" which calls write_begin(), and a
> > "unpin_user_pages_local()" which calls write_end(), and the
> 
> Right, that would supply the missing connection to the filesystems.
> 
> In fact, maybe these names about right:
> 
>     pin_user_file_pages()
>     unpin_user_file_pages()
> 
> ...and then put them in a filesystem header file, because these are now
> tightly coupled to filesystems, what with the need to call
> .write_begin() and .write_end().

Well, that makes it process_vm_writev()'s is that it needs to know
when to call pin_user_file_pages().  I suspect that for many use cases
--- for example, if this is being used by a debugger to modify a
variable on a stack, or an anonymous page in the program's data
segment, process_vm_writev() *isn't* actually pinning a file.  So they
want some kind of interface that automatically DTRT regardless of
whether the user pages being edited are file-backed or not
file-backed.

So some kind of [un]pin_user_pages_local() which will call
write_{begin,end}() if necessary would be the most convenient for
users such as process_vm_writev().   

And perhaps would it make sense for pin_user_pages to optionally (or
by default?) check for file-backed pages, and if it finds any, return
an error or stop pinning pages at that point, so the system call can
return EOPNOSUPP to the user, instead of silently causing user data to
be lost or corrupted as is currently the case with xfs and btrfs (and
ext4 once I patch it so it doesn't BUG).

I'll note that at least one caller of pin_user_pages, in fs/io_uring.c
takes it upon itself to check for file-backed pages, and returns
EOPNOTSUPP if there are any found.  Many that should be lifted to
pin_user_pages()?

For that matter, maybe pin_user_pages() and friends should take some
new FOLL_ flags to indicate whether file-backed pages should be
rejected, or perhaps they can promise they will only be holding the
pin for a very short amount of time (FOLL_SHORTERM?), and then
pin_user_pages() and unpin_user_pages() can automagically call
write_begin() and write_end() if necessary?  I dunno....

	      	  	      	 	       - Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-02-26  1:40               ` Theodore Ts'o
@ 2022-02-26  2:00                 ` Theodore Ts'o
  2022-02-26  2:55                 ` John Hubbard
  1 sibling, 0 replies; 29+ messages in thread
From: Theodore Ts'o @ 2022-02-26  2:00 UTC (permalink / raw)
  To: John Hubbard
  Cc: Eric Biggers, Lee Jones, linux-ext4, Christoph Hellwig,
	Dave Chinner, Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On Fri, Feb 25, 2022 at 08:40:36PM -0500, Theodore Ts'o wrote:
> Well, that makes it process_vm_writev()'s is that it needs to know
> when to call pin_user_file_pages().

Sorry, typed too fast.  What I was trying to say is this make it
process_vm_writev()'s problem to figure out when it should call
pin_user_file_pages() versus some other pin_user_pages function.

> I suspect that for many use cases
> --- for example, if this is being used by a debugger to modify a
> variable on a stack, or an anonymous page in the program's data
> segment, process_vm_writev() *isn't* actually pinning a file.  So they
> want some kind of interface that automatically DTRT regardless of
> whether the user pages being edited are file-backed or not
> file-backed.

					- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-02-26  1:40               ` Theodore Ts'o
  2022-02-26  2:00                 ` Theodore Ts'o
@ 2022-02-26  2:55                 ` John Hubbard
  1 sibling, 0 replies; 29+ messages in thread
From: John Hubbard @ 2022-02-26  2:55 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Biggers, Lee Jones, linux-ext4, Christoph Hellwig,
	Dave Chinner, Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On 2/25/22 17:40, Theodore Ts'o wrote:
...
>> ...and then put them in a filesystem header file, because these are now
>> tightly coupled to filesystems, what with the need to call
>> .write_begin() and .write_end().
> 
> Well, that makes it process_vm_writev()'s is that it needs to know
> when to call pin_user_file_pages().  I suspect that for many use cases
> --- for example, if this is being used by a debugger to modify a
> variable on a stack, or an anonymous page in the program's data
> segment, process_vm_writev() *isn't* actually pinning a file.  So they
> want some kind of interface that automatically DTRT regardless of
> whether the user pages being edited are file-backed or not
> file-backed.
> 
> So some kind of [un]pin_user_pages_local() which will call
> write_{begin,end}() if necessary would be the most convenient for
> users such as process_vm_writev().
> 

OK, yes.

> And perhaps would it make sense for pin_user_pages to optionally (or
> by default?) check for file-backed pages, and if it finds any, return
> an error or stop pinning pages at that point, so the system call can
> return EOPNOSUPP to the user, instead of silently causing user data to
> be lost or corrupted as is currently the case with xfs and btrfs (and
> ext4 once I patch it so it doesn't BUG).

Yes, also a good move. It is definitely time for this.

> 
> I'll note that at least one caller of pin_user_pages, in fs/io_uring.c
> takes it upon itself to check for file-backed pages, and returns

Well, not *exactly*: fs/io_uring.c calls is_file_hugepages(), which is a
check for hugetlbfs, rather than general check for file-backed pages. :)

But your point is still valid, and taken. The overall approach of,
"check for page type, then pin pages" is being done there.

> EOPNOTSUPP if there are any found.  Many that should be lifted to
> pin_user_pages()?
> 
> For that matter, maybe pin_user_pages() and friends should take some
> new FOLL_ flags to indicate whether file-backed pages should be
> rejected, or perhaps they can promise they will only be holding the
> pin for a very short amount of time (FOLL_SHORTERM?), and then

Naming: there is already a FOLL_LONGTERM, so anyone not using that is
already...non-FOLL_SHORTERM, so that would be too difficult to
understand.

Instead, maybe: FOLL_FILE, to indicate basically the inverse of your
FOLL_SHORTERM suggestion. And sweep through and augment the call sites
to pass in FOLL_FILE *at first*, so that the first patch leaves behavior
as-is. Then a patch per call site (bisection friendly), to start
actually changing behavior and dealing with the fallout.


> pin_user_pages() and unpin_user_pages() can automagically call
> write_begin() and write_end() if necessary?  I dunno....
> 
> 	      	  	      	 	       - Ted

This all sounds good to me. Thanks for thinking about this. I think this
is actually pretty easy to implement, too.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -v4] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-02-25 21:23       ` [PATCH -v3] " Theodore Ts'o
  2022-02-25 21:33         ` John Hubbard
@ 2022-03-03  4:26         ` Theodore Ts'o
  2022-03-03  8:21           ` Christoph Hellwig
                             ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Theodore Ts'o @ 2022-03-03  4:26 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance.  A related race was noted by Jan Kara in
2018[1]; however, more recently instead of it being a very hard-to-hit
race, it could be reliably triggered by process_vm_writev(2) which was
discovered by Syzbot[2].

This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that if some other kernel subsystem dirty pages without properly
notifying the file system using page_mkwrite(), ext4 will BUG, while
other file systems will not BUG (although data will still be lost).

So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed.  More discussion and background can be found in the
thread starting at [2].

[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
[2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com

Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
---
v4 - only changes to the commit description to eliminate some inaccuracies
     and clarify the text.

 fs/ext4/inode.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..008fe8750109 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
 	else
 		len = PAGE_SIZE;
 
+	/* Should never happen but for bugs in other kernel subsystems */
+	if (!page_has_buffers(page)) {
+		ext4_warning_inode(inode,
+		   "page %lu does not have buffers attached", page->index);
+		ClearPageDirty(page);
+		unlock_page(page);
+		return 0;
+	}
+
 	page_bufs = page_buffers(page);
 	/*
 	 * We cannot do block allocation or other extent handling in this
@@ -2588,12 +2597,28 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
 			    unlikely(page->mapping != mapping)) {
 				unlock_page(page);
-				continue;
+				goto out;
 			}
 
 			wait_on_page_writeback(page);
 			BUG_ON(PageWriteback(page));
 
+			/*
+			 * Should never happen but for buggy code in
+			 * other subsystems that call
+			 * set_page_dirty() without properly warning
+			 * the file system first.  See [1] for more
+			 * information.
+			 *
+			 * [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
+			 */
+			if (!page_has_buffers(page)) {
+				ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index);
+				ClearPageDirty(page);
+				unlock_page(page);
+				continue;
+			}
+
 			if (mpd->map.m_len == 0)
 				mpd->first_page = page->index;
 			mpd->next_page = page->index + 1;
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH -v4] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-03-03  4:26         ` [PATCH -v4] " Theodore Ts'o
@ 2022-03-03  8:21           ` Christoph Hellwig
  2022-03-03  9:21           ` Lee Jones
  2022-03-03 14:38           ` [PATCH -v5] ext4: don't BUG if someone " Theodore Ts'o
  2 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2022-03-03  8:21 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Biggers, Lee Jones, linux-ext4, Christoph Hellwig,
	Dave Chinner, Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

Looks good to me as a short-term bandaid:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -v4] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
  2022-03-03  4:26         ` [PATCH -v4] " Theodore Ts'o
  2022-03-03  8:21           ` Christoph Hellwig
@ 2022-03-03  9:21           ` Lee Jones
  2022-03-03 14:38           ` [PATCH -v5] ext4: don't BUG if someone " Theodore Ts'o
  2 siblings, 0 replies; 29+ messages in thread
From: Lee Jones @ 2022-03-03  9:21 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Eric Biggers, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

On Wed, 02 Mar 2022, Theodore Ts'o wrote:

> [un]pin_user_pages_remote is dirtying pages without properly warning
> the file system in advance.  A related race was noted by Jan Kara in
> 2018[1]; however, more recently instead of it being a very hard-to-hit
> race, it could be reliably triggered by process_vm_writev(2) which was
> discovered by Syzbot[2].
> 
> This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
> that if some other kernel subsystem dirty pages without properly
> notifying the file system using page_mkwrite(), ext4 will BUG, while
> other file systems will not BUG (although data will still be lost).
> 
> So instead of crashing with a BUG, issue a warning (since there may be
> potential data loss) and just mark the page as clean to avoid
> unprivileged denial of service attacks until the problem can be
> properly fixed.  More discussion and background can be found in the
> thread starting at [2].
> 
> [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
> [2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com
> 
> Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
> Reported-by: Lee Jones <lee.jones@linaro.org>
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> Cc: stable@kernel.org
> ---
> v4 - only changes to the commit description to eliminate some inaccuracies
>      and clarify the text.
> 
>  fs/ext4/inode.c | 27 ++++++++++++++++++++++++++-
>  1 file changed, 26 insertions(+), 1 deletion(-)

Thanks a bunch for sticking with this Ted.

I've been following along with great interest.

Sadly I am not in a position to provide a review.

Just wanted to pop by and say thank you.

-- 
Lee Jones [李琼斯]
Principal Technical Lead - Developer Services
Linaro.org │ Open source software for Arm SoCs
Follow Linaro: Facebook | Twitter | Blog

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -v5] ext4: don't BUG if someone dirty pages without asking ext4 first
  2022-03-03  4:26         ` [PATCH -v4] " Theodore Ts'o
  2022-03-03  8:21           ` Christoph Hellwig
  2022-03-03  9:21           ` Lee Jones
@ 2022-03-03 14:38           ` Theodore Ts'o
  2 siblings, 0 replies; 29+ messages in thread
From: Theodore Ts'o @ 2022-03-03 14:38 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lee Jones, linux-ext4, Christoph Hellwig, Dave Chinner,
	Goldwyn Rodrigues, Darrick J . Wong, Bob Peterson,
	Damien Le Moal, Andreas Gruenbacher, Ritesh Harjani,
	Greg Kroah-Hartman, Johannes Thumshirn, linux-xfs, linux-fsdevel,
	cluster-devel, linux-kernel

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance.  A related race was noted by Jan Kara in
2018[1]; however, more recently instead of it being a very hard-to-hit
race, it could be reliably triggered by process_vm_writev(2) which was
discovered by Syzbot[2].

This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that if some other kernel subsystem dirty pages without properly
notifying the file system using page_mkwrite(), ext4 will BUG, while
other file systems will not BUG (although data will still be lost).

So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed.  More discussion and background can be found in the
thread starting at [2].

[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
[2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com

Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---

 -v5 Argh, sent the wrong version of this patch for V4.  (Which, if you tried
     testing it, would cause generic/013 on ext4/4k to die horribly. :-)  This
     is the correct final version.....

 fs/ext4/inode.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..531a94f48637 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
 	else
 		len = PAGE_SIZE;
 
+	/* Should never happen but for bugs in other kernel subsystems */
+	if (!page_has_buffers(page)) {
+		ext4_warning_inode(inode,
+		   "page %lu does not have buffers attached", page->index);
+		ClearPageDirty(page);
+		unlock_page(page);
+		return 0;
+	}
+
 	page_bufs = page_buffers(page);
 	/*
 	 * We cannot do block allocation or other extent handling in this
@@ -2594,6 +2603,22 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			wait_on_page_writeback(page);
 			BUG_ON(PageWriteback(page));
 
+			/*
+			 * Should never happen but for buggy code in
+			 * other subsystems that call
+			 * set_page_dirty() without properly warning
+			 * the file system first.  See [1] for more
+			 * information.
+			 *
+			 * [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
+			 */
+			if (!page_has_buffers(page)) {
+				ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index);
+				ClearPageDirty(page);
+				unlock_page(page);
+				continue;
+			}
+
 			if (mpd->map.m_len == 0)
 				mpd->first_page = page->index;
 			mpd->next_page = page->index + 1;
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-03-03 14:39 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-16 16:31 [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Lee Jones
2022-02-18  1:06 ` John Hubbard
2022-02-18  4:08   ` Theodore Ts'o
2022-02-18  6:33     ` John Hubbard
2022-02-23 23:31       ` Theodore Ts'o
2022-02-24  0:44         ` John Hubbard
2022-02-24  4:04           ` Theodore Ts'o
2022-02-18  7:51     ` Greg Kroah-Hartman
2022-02-23 23:35       ` Theodore Ts'o
2022-02-24  1:48         ` Dave Chinner
2022-02-24  3:50           ` Theodore Ts'o
2022-02-24 10:29             ` Dave Chinner
2022-02-18  2:54 ` Theodore Ts'o
2022-02-18  4:24   ` Matthew Wilcox
2022-02-18  6:03     ` Theodore Ts'o
2022-02-25 19:24 ` [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first Theodore Ts'o
2022-02-25 20:51   ` Eric Biggers
2022-02-25 21:08     ` Theodore Ts'o
2022-02-25 21:23       ` [PATCH -v3] " Theodore Ts'o
2022-02-25 21:33         ` John Hubbard
2022-02-25 23:21           ` Theodore Ts'o
2022-02-26  0:41             ` John Hubbard
2022-02-26  1:40               ` Theodore Ts'o
2022-02-26  2:00                 ` Theodore Ts'o
2022-02-26  2:55                 ` John Hubbard
2022-03-03  4:26         ` [PATCH -v4] " Theodore Ts'o
2022-03-03  8:21           ` Christoph Hellwig
2022-03-03  9:21           ` Lee Jones
2022-03-03 14:38           ` [PATCH -v5] ext4: don't BUG if someone " Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).