* KASAN: use-after-free Read in unix_dgram_poll
@ 2019-03-03 10:22 syzbot
2019-03-03 13:55 ` Al Viro
0 siblings, 1 reply; 53+ messages in thread
From: syzbot @ 2019-03-03 10:22 UTC (permalink / raw)
To: davem, jbaron, kgraul, ktkhai, kyeongdon.kim, linux-kernel,
netdev, pabeni, syzkaller-bugs, viro, xiyou.wangcong
Hello,
syzbot found the following crash on:
HEAD commit: 7d762d69145a afs: Fix manually set volume location server ..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=131d832ac00000
kernel config: https://syzkaller.appspot.com/x/.config?x=b76ec970784287c
dashboard link: https://syzkaller.appspot.com/bug?extid=503d4cc169fcec1cb18c
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11934262c00000
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
==================================================================
BUG: KASAN: use-after-free in unix_dgram_poll+0x5e1/0x690
net/unix/af_unix.c:2695
Read of size 4 at addr ffff88809292aae0 by task syz-executor.1/18946
CPU: 0 PID: 18946 Comm: syz-executor.1 Not tainted 5.0.0-rc8+ #88
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
__asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
unix_dgram_poll+0x5e1/0x690 net/unix/af_unix.c:2695
sock_poll+0x291/0x340 net/socket.c:1127
vfs_poll include/linux/poll.h:86 [inline]
aio_poll fs/aio.c:1766 [inline]
__io_submit_one fs/aio.c:1876 [inline]
io_submit_one+0xe3e/0x1cf0 fs/aio.c:1909
__do_sys_io_submit fs/aio.c:1954 [inline]
__se_sys_io_submit fs/aio.c:1924 [inline]
__x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1924
? 0xffffffff81000000
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e29
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fd43ca93c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e29
RDX: 0000000020000600 RSI: 1ffffffffffffd70 RDI: 00007fd43ca73000
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd43ca946d4
R13: 00000000004bf02f R14: 00000000004d09b0 R15: 00000000ffffffff
Allocated by task 18946:
save_stack+0x45/0xd0 mm/kasan/common.c:73
set_track mm/kasan/common.c:85 [inline]
__kasan_kmalloc mm/kasan/common.c:495 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:468
kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:503
slab_post_alloc_hook mm/slab.h:440 [inline]
slab_alloc mm/slab.c:3388 [inline]
kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3548
sk_prot_alloc+0x67/0x2e0 net/core/sock.c:1471
sk_alloc+0x39/0xf70 net/core/sock.c:1531
unix_create1+0xc3/0x530 net/unix/af_unix.c:764
unix_create+0x103/0x1e0 net/unix/af_unix.c:825
__sock_create+0x3e6/0x750 net/socket.c:1275
sock_create net/socket.c:1315 [inline]
__sys_socketpair+0x272/0x5e0 net/socket.c:1407
__do_sys_socketpair net/socket.c:1456 [inline]
__se_sys_socketpair net/socket.c:1453 [inline]
__x64_sys_socketpair+0x97/0xf0 net/socket.c:1453
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 18944:
save_stack+0x45/0xd0 mm/kasan/common.c:73
set_track mm/kasan/common.c:85 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:457
kasan_slab_free+0xe/0x10 mm/kasan/common.c:465
__cache_free mm/slab.c:3494 [inline]
kmem_cache_free+0x86/0x260 mm/slab.c:3754
sk_prot_free net/core/sock.c:1512 [inline]
__sk_destruct+0x4b6/0x6d0 net/core/sock.c:1596
sk_destruct+0x7b/0x90 net/core/sock.c:1604
__sk_free+0xce/0x300 net/core/sock.c:1615
sk_free+0x42/0x50 net/core/sock.c:1626
sock_put include/net/sock.h:1707 [inline]
unix_release_sock+0x921/0xbb0 net/unix/af_unix.c:573
unix_release+0x44/0x90 net/unix/af_unix.c:835
__sock_release+0xd3/0x250 net/socket.c:579
sock_close+0x1b/0x30 net/socket.c:1139
__fput+0x2df/0x8d0 fs/file_table.c:278
____fput+0x16/0x20 fs/file_table.c:309
task_work_run+0x14a/0x1c0 kernel/task_work.c:113
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_usermode_loop+0x273/0x2c0 arch/x86/entry/common.c:166
prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
do_syscall_64+0x52d/0x610 arch/x86/entry/common.c:293
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The buggy address belongs to the object at ffff88809292a740
which belongs to the cache UNIX(49:syz1) of size 1728
The buggy address is located 928 bytes inside of
1728-byte region [ffff88809292a740, ffff88809292ae00)
The buggy address belongs to the page:
page:ffffea00024a4a80 count:1 mapcount:0 mapping:ffff8880920c0800 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00028223c8 ffffea0002581248 ffff8880920c0800
raw: 0000000000000000 ffff88809292a000 0000000100000002 ffff8880a9718ec0
page dumped because: kasan: bad access detected
page->mem_cgroup:ffff8880a9718ec0
Memory state around the buggy address:
ffff88809292a980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88809292aa00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88809292aa80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88809292ab00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88809292ab80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: KASAN: use-after-free Read in unix_dgram_poll 2019-03-03 10:22 KASAN: use-after-free Read in unix_dgram_poll syzbot @ 2019-03-03 13:55 ` Al Viro 2019-03-03 15:18 ` [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) Al Viro 0 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-03 13:55 UTC (permalink / raw) To: syzbot Cc: davem, jbaron, kgraul, ktkhai, kyeongdon.kim, linux-kernel, netdev, pabeni, syzkaller-bugs, xiyou.wangcong, hch On Sun, Mar 03, 2019 at 02:22:04AM -0800, syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit: 7d762d69145a afs: Fix manually set volume location server .. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=131d832ac00000 > kernel config: https://syzkaller.appspot.com/x/.config?x=b76ec970784287c > dashboard link: https://syzkaller.appspot.com/bug?extid=503d4cc169fcec1cb18c > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11934262c00000 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com > > ================================================================== > BUG: KASAN: use-after-free in unix_dgram_poll+0x5e1/0x690 > net/unix/af_unix.c:2695 > Read of size 4 at addr ffff88809292aae0 by task syz-executor.1/18946 > > CPU: 0 PID: 18946 Comm: syz-executor.1 Not tainted 5.0.0-rc8+ #88 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x172/0x1f0 lib/dump_stack.c:113 > print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 > kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 > __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134 > unix_dgram_poll+0x5e1/0x690 net/unix/af_unix.c:2695 > sock_poll+0x291/0x340 net/socket.c:1127 > vfs_poll include/linux/poll.h:86 [inline] > aio_poll fs/aio.c:1766 [inline] > __io_submit_one fs/aio.c:1876 [inline] > io_submit_one+0xe3e/0x1cf0 fs/aio.c:1909 > __do_sys_io_submit fs/aio.c:1954 [inline] > __se_sys_io_submit fs/aio.c:1924 [inline] > __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1924 > ? 0xffffffff81000000 > do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe Apparently, a call of ->poll() overlapping with call of ->release() (if not outright happening after it). Which should never happen... Maybe unrelated to this bug, but... What's to prevent a wakeup that happens just after we'd been added to a waitqueue by ->poll() triggering aio_poll_wake(), which gets to aio_poll_complete() with its fput() *before* we'd reached the end of ->poll() instance? I wonder if adding get_file(req->file); // make sure that early completion is safe right after req->file = fget(iocb->aio_fildes); if (unlikely(!req->file)) return -EBADF; paired with fput(req->file); right after out: in aio_poll() is needed... Am I missing something obvious here? Christoph? ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) 2019-03-03 13:55 ` Al Viro @ 2019-03-03 15:18 ` Al Viro 2019-03-03 18:37 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: Al Viro @ 2019-03-03 15:18 UTC (permalink / raw) To: Linus Torvalds Cc: davem, jbaron, kgraul, ktkhai, kyeongdon.kim, linux-kernel, netdev, pabeni, syzkaller-bugs, xiyou.wangcong, hch On Sun, Mar 03, 2019 at 01:55:02PM +0000, Al Viro wrote: > Maybe unrelated to this bug, but... What's to prevent a wakeup > that happens just after we'd been added to a waitqueue by ->poll() > triggering aio_poll_wake(), which gets to aio_poll_complete() > with its fput() *before* we'd reached the end of ->poll() instance? > > I wonder if adding > get_file(req->file); // make sure that early completion is safe > right after > req->file = fget(iocb->aio_fildes); > if (unlikely(!req->file)) > return -EBADF; > paired with > fput(req->file); > right after out: in aio_poll() is needed... Am I missing something > obvious here? Christoph? In more details - normally IOCB_CMD_POLL handling looks so: 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll() 2) aio_poll() resolves the descriptor to struct file by req->file = fget(iocb->aio_fildes) 3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that aio_kiocb to 2 (bumps by 1, that is). 4) aio_poll() calls vfs_poll(). After sanity checks (basically, "poll_wait() had been called and only once") it locks the queue. That's what the extra reference to iocb had been for - we know we can safely access it. 5) With queue locked, we check if ->woken has already been set to true (by aio_poll_wake()) and, if it had been, we unlock the queue, drop a reference to aio_kiocb and bugger off - at that point it's a responsibility to aio_poll_wake() and the stuff called/scheduled by it. That code will drop the reference to file in req->file, along with the other reference to our aio_kiocb. 6) otherwise, we see whether we need to wait. If we do, we unlock the queue, drop one reference to aio_kiocb and go away - eventual wakeup (or cancel) will deal with the reference to file and with the other reference to aio_kiocb 7) otherwise we remove ourselves from waitqueue (still under the queue lock), so that wakeup won't get us. No async activity will be happening, so we can safely drop req->file and iocb ourselves. If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb won't get freed under us, so we can do all the checks and locking safely. And we don't touch ->file if we detect that case. However, vfs_poll() most certainly *does* touch the file it had been given. So wakeup coming while we are still in ->poll() might end up doing fput() on that file. That case is not too rare, and usually we are saved by the still present reference from descriptor table - that fput() is not the final one. But if another thread closes that descriptor right after our fget() and wakeup does happen before ->poll() returns, we are in trouble - final fput() done while we are in the middle of a method. What we need is to hold on to the file reference the same way we do with aio_kiocb. No need to store the reference to what we'd got in a separate variable - req->file is never reassigned and we'd already made sure that req won't be freed under us, so dropping the extra reference with fput(req->file) is fine in all cases. Fixes: bfe4037e722ec Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/fs/aio.c b/fs/aio.c index 3083180a54c8..7e88bfabdac2 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1767,6 +1767,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) /* one for removal from waitqueue, one for this function */ refcount_set(&aiocb->ki_refcnt, 2); + get_file(req->file); mask = vfs_poll(req->file, &apt.pt) & req->events; if (unlikely(!req->head)) { @@ -1793,6 +1794,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) spin_unlock_irq(&ctx->ctx_lock); out: + fput(req->file); if (unlikely(apt.error)) { fput(req->file); return apt.error; ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) 2019-03-03 15:18 ` [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) Al Viro @ 2019-03-03 18:37 ` Eric Dumazet 2019-03-03 19:44 ` Linus Torvalds 2019-03-04 7:53 ` [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) Dmitry Vyukov 2 siblings, 0 replies; 53+ messages in thread From: Eric Dumazet @ 2019-03-03 18:37 UTC (permalink / raw) To: Al Viro, Linus Torvalds Cc: davem, jbaron, kgraul, ktkhai, kyeongdon.kim, linux-kernel, netdev, pabeni, syzkaller-bugs, xiyou.wangcong, hch On 03/03/2019 07:18 AM, Al Viro wrote: > Fixes: bfe4037e722ec > Cc: stable@vger.kernel.org > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > diff --git a/fs/aio.c b/fs/aio.c > index 3083180a54c8..7e88bfabdac2 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1767,6 +1767,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) > > /* one for removal from waitqueue, one for this function */ > refcount_set(&aiocb->ki_refcnt, 2); > + get_file(req->file); > > mask = vfs_poll(req->file, &apt.pt) & req->events; > if (unlikely(!req->head)) { > @@ -1793,6 +1794,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) > spin_unlock_irq(&ctx->ctx_lock); > > out: > + fput(req->file); > if (unlikely(apt.error)) { > fput(req->file); > return apt.error; > Very nice changelog Al, thanks for fixing this. Reviewed-by: Eric Dumazet <edumazet@google.com> ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) 2019-03-03 15:18 ` [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) Al Viro 2019-03-03 18:37 ` Eric Dumazet @ 2019-03-03 19:44 ` Linus Torvalds 2019-03-03 20:13 ` Linus Torvalds 2019-03-03 20:30 ` Al Viro 2019-03-04 7:53 ` [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) Dmitry Vyukov 2 siblings, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2019-03-03 19:44 UTC (permalink / raw) To: Al Viro, Eric Dumazet Cc: David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 1912 bytes --] On Sun, Mar 3, 2019 at 7:18 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > Maybe unrelated to this bug, but... What's to prevent a wakeup > > that happens just after we'd been added to a waitqueue by ->poll() > > triggering aio_poll_wake(), which gets to aio_poll_complete() > > with its fput() *before* we'd reached the end of ->poll() instance? I'm assuming you're talking about the second vfs_poll() in aio_poll_complete_work()? The one we call before we check for "rew->cancelled" properly under the spinlock? > 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll() > 2) aio_poll() resolves the descriptor to struct file by > req->file = fget(iocb->aio_fildes) [...] So honestly, the whole filp handling in aio looks overly complicated to me. All the different operations do that silly fget/fput() dance, although aio_read/write at least tried to make a common helper function for handling it. But as far as I can tell, they *all* could do: - look up file in aio_submit() when allocating and creating the aio_kiocb - drop the filp in 'iocb_put()' (which happens whether things complete successfully or not). and we'd have avoided a lot of complexity, and we'd have avoided this bug. Your patch fixes the poll() case, but it does so by just letting the existing complexity remain, and adding a second fget/fput pair in the poll logic. It would seem like it would be much better to rip all the complexity out entirely, and replace it with sane, simple and obviously correct code. Hmm? In other words, why wouldn't something like the attached work instead? TOTALLY UNTESTED! It builds, and it looks sane, but maybe I'm overlooking some obvious problem with it. But doesn't it look nice to see 2 files changed, 41 insertions(+), 50 deletions(-) with actual code reduction, and a fundamental simplification in handling of the file pointer? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 7944 bytes --] fs/aio.c | 83 ++++++++++++++++++++++-------------------------------- include/linux/fs.h | 8 +++++- 2 files changed, 41 insertions(+), 50 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index aaaaf4d12c73..9eccd2ea6ca9 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -167,14 +167,18 @@ struct kioctx { unsigned id; }; +/* + * First field must be 'ki_filp' in all the + * iocb unions! + */ struct fsync_iocb { + struct file *ki_filp; struct work_struct work; - struct file *file; bool datasync; }; struct poll_iocb { - struct file *file; + struct file *ki_filp; struct wait_queue_head *head; __poll_t events; bool woken; @@ -183,8 +187,15 @@ struct poll_iocb { struct work_struct work; }; +/* + * NOTE! Each of the iocb union members has "ki_filp" as + * the first entry in their struct definition. So you can + * access the file pointer either directly through this + * anonymous union, or through any of the sub-structs. + */ struct aio_kiocb { union { + struct file *ki_filp; struct kiocb rw; struct fsync_iocb fsync; struct poll_iocb poll; @@ -1060,6 +1071,7 @@ static inline void iocb_put(struct aio_kiocb *iocb) { if (refcount_read(&iocb->ki_refcnt) == 0 || refcount_dec_and_test(&iocb->ki_refcnt)) { + fput(iocb->ki_filp); percpu_ref_put(&iocb->ki_ctx->reqs); kmem_cache_free(kiocb_cachep, iocb); } @@ -1413,7 +1425,7 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2) aio_remove_iocb(iocb); if (kiocb->ki_flags & IOCB_WRITE) { - struct inode *inode = file_inode(kiocb->ki_filp); + struct inode *inode = file_inode(iocb->ki_filp); /* * Tell lockdep we inherited freeze protection from submission @@ -1421,10 +1433,9 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2) */ if (S_ISREG(inode->i_mode)) __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); - file_end_write(kiocb->ki_filp); + file_end_write(iocb->ki_filp); } - fput(kiocb->ki_filp); aio_complete(iocb, res, res2); } @@ -1432,9 +1443,6 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) { int ret; - req->ki_filp = fget(iocb->aio_fildes); - if (unlikely(!req->ki_filp)) - return -EBADF; req->ki_complete = aio_complete_rw; req->private = NULL; req->ki_pos = iocb->aio_offset; @@ -1451,7 +1459,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) ret = ioprio_check_cap(iocb->aio_reqprio); if (ret) { pr_debug("aio ioprio check cap error: %d\n", ret); - goto out_fput; + return ret; } req->ki_ioprio = iocb->aio_reqprio; @@ -1460,14 +1468,10 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags); if (unlikely(ret)) - goto out_fput; + return ret; req->ki_flags &= ~IOCB_HIPRI; /* no one is going to poll for this I/O */ return 0; - -out_fput: - fput(req->ki_filp); - return ret; } static int aio_setup_rw(int rw, const struct iocb *iocb, struct iovec **iovec, @@ -1521,24 +1525,19 @@ static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb, if (ret) return ret; file = req->ki_filp; - - ret = -EBADF; if (unlikely(!(file->f_mode & FMODE_READ))) - goto out_fput; + return -EBADF; ret = -EINVAL; if (unlikely(!file->f_op->read_iter)) - goto out_fput; + return -EINVAL; ret = aio_setup_rw(READ, iocb, &iovec, vectored, compat, &iter); if (ret) - goto out_fput; + return ret; ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter)); if (!ret) aio_rw_done(req, call_read_iter(file, req, &iter)); kfree(iovec); -out_fput: - if (unlikely(ret)) - fput(file); return ret; } @@ -1555,16 +1554,14 @@ static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb, return ret; file = req->ki_filp; - ret = -EBADF; if (unlikely(!(file->f_mode & FMODE_WRITE))) - goto out_fput; - ret = -EINVAL; + return -EBADF; if (unlikely(!file->f_op->write_iter)) - goto out_fput; + return -EINVAL; ret = aio_setup_rw(WRITE, iocb, &iovec, vectored, compat, &iter); if (ret) - goto out_fput; + return ret; ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter)); if (!ret) { /* @@ -1582,9 +1579,6 @@ static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb, aio_rw_done(req, call_write_iter(file, req, &iter)); } kfree(iovec); -out_fput: - if (unlikely(ret)) - fput(file); return ret; } @@ -1593,8 +1587,7 @@ static void aio_fsync_work(struct work_struct *work) struct fsync_iocb *req = container_of(work, struct fsync_iocb, work); int ret; - ret = vfs_fsync(req->file, req->datasync); - fput(req->file); + ret = vfs_fsync(req->ki_filp, req->datasync); aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0); } @@ -1605,13 +1598,8 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb, iocb->aio_rw_flags)) return -EINVAL; - req->file = fget(iocb->aio_fildes); - if (unlikely(!req->file)) - return -EBADF; - if (unlikely(!req->file->f_op->fsync)) { - fput(req->file); + if (unlikely(!req->ki_filp->f_op->fsync)) return -EINVAL; - } req->datasync = datasync; INIT_WORK(&req->work, aio_fsync_work); @@ -1621,10 +1609,7 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb, static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask) { - struct file *file = iocb->poll.file; - aio_complete(iocb, mangle_poll(mask), 0); - fput(file); } static void aio_poll_complete_work(struct work_struct *work) @@ -1636,7 +1621,7 @@ static void aio_poll_complete_work(struct work_struct *work) __poll_t mask = 0; if (!READ_ONCE(req->cancelled)) - mask = vfs_poll(req->file, &pt) & req->events; + mask = vfs_poll(iocb->ki_filp, &pt) & req->events; /* * Note that ->ki_cancel callers also delete iocb from active_reqs after @@ -1743,9 +1728,6 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) INIT_WORK(&req->work, aio_poll_complete_work); req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP; - req->file = fget(iocb->aio_fildes); - if (unlikely(!req->file)) - return -EBADF; req->head = NULL; req->woken = false; @@ -1763,7 +1745,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) /* one for removal from waitqueue, one for this function */ refcount_set(&aiocb->ki_refcnt, 2); - mask = vfs_poll(req->file, &apt.pt) & req->events; + mask = vfs_poll(aiocb->ki_filp, &apt.pt) & req->events; if (unlikely(!req->head)) { /* we did not manage to set up a waitqueue, done */ goto out; @@ -1788,10 +1770,8 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) spin_unlock_irq(&ctx->ctx_lock); out: - if (unlikely(apt.error)) { - fput(req->file); + if (unlikely(apt.error)) return apt.error; - } if (mask) aio_poll_complete(aiocb, mask); @@ -1829,6 +1809,11 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, if (unlikely(!req)) goto out_put_reqs_available; + req->ki_filp = fget(iocb->aio_fildes); + ret = -EBADF; + if (unlikely(!req->ki_filp)) + goto out_put_req; + if (iocb->aio_flags & IOCB_FLAG_RESFD) { /* * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an diff --git a/include/linux/fs.h b/include/linux/fs.h index 29d8e2cfed0e..fd423fec8d83 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -304,13 +304,19 @@ enum rw_hint { struct kiocb { struct file *ki_filp; + + /* The 'ki_filp' pointer is shared in a union for aio */ + randomized_struct_fields_start + loff_t ki_pos; void (*ki_complete)(struct kiocb *iocb, long ret, long ret2); void *private; int ki_flags; u16 ki_hint; u16 ki_ioprio; /* See linux/ioprio.h */ -} __randomize_layout; + + randomized_struct_fields_end +}; static inline bool is_sync_kiocb(struct kiocb *kiocb) { ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) 2019-03-03 19:44 ` Linus Torvalds @ 2019-03-03 20:13 ` Linus Torvalds 2019-03-03 20:30 ` Al Viro 1 sibling, 0 replies; 53+ messages in thread From: Linus Torvalds @ 2019-03-03 20:13 UTC (permalink / raw) To: Al Viro, Eric Dumazet Cc: David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig On Sun, Mar 3, 2019 at 11:44 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But doesn't it look nice to see > > 2 files changed, 41 insertions(+), 50 deletions(-) > > with actual code reduction, and a fundamental simplification in > handling of the file pointer? A coupl,e of the changes are "useless", and do the same thing as not having them at all: - struct inode *inode = file_inode(kiocb->ki_filp); + struct inode *inode = file_inode(iocb->ki_filp); - file_end_write(kiocb->ki_filp); + file_end_write(iocb->ki_filp); because the "ki_filp" ends up existing in both kiocb and iocb. At one point of editing that file I decided to try to just remove it from the sub-structs entirely and only keep it in the top-level structure, but it needs to be inside the 'struct kiocb' anyway for all the other users outside of fs/aio.c. Anyway, I don't think the patch is wrong (although I haven't actually _tested_ it) but I wanted to point out that those two one-liner changes are just "noise" that doesn't matter for the working of the patch. In the above, we have 'kiocb' being the embedded 'struct kiocb', and 'iocb' is the 'struct aio_kiocb' that contains it. 'ki_filp' is the exact same field in both cases. Linus Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) 2019-03-03 19:44 ` Linus Torvalds 2019-03-03 20:13 ` Linus Torvalds @ 2019-03-03 20:30 ` Al Viro 2019-03-03 22:23 ` Linus Torvalds 1 sibling, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-03 20:30 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig On Sun, Mar 03, 2019 at 11:44:33AM -0800, Linus Torvalds wrote: > On Sun, Mar 3, 2019 at 7:18 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > Maybe unrelated to this bug, but... What's to prevent a wakeup > > > that happens just after we'd been added to a waitqueue by ->poll() > > > triggering aio_poll_wake(), which gets to aio_poll_complete() > > > with its fput() *before* we'd reached the end of ->poll() instance? > > I'm assuming you're talking about the second vfs_poll() in > aio_poll_complete_work()? The one we call before we check for > "rew->cancelled" properly under the spinlock? No. The first one, right in aio_poll(). > But as far as I can tell, they *all* could do: > > - look up file in aio_submit() when allocating and creating the aio_kiocb > > - drop the filp in 'iocb_put()' (which happens whether things > complete successfully or not). > > and we'd have avoided a lot of complexity, and we'd have avoided this bug. > > Your patch fixes the poll() case, but it does so by just letting the > existing complexity remain, and adding a second fget/fput pair in the > poll logic. > > It would seem like it would be much better to rip all the complexity > out entirely, and replace it with sane, simple and obviously correct > code. > > Hmm? > > In other words, why wouldn't something like the attached work instead? I'll need to dig out the mail archive from last year, but IIRC this had been considered and there'd been non-trivial problems. Give me an hour or so and I'll dig that out (there'd been many rounds of review, with long threads involved, some private, some on fsdevel). > @@ -1060,6 +1071,7 @@ static inline void iocb_put(struct aio_kiocb *iocb) > { > if (refcount_read(&iocb->ki_refcnt) == 0 || > refcount_dec_and_test(&iocb->ki_refcnt)) { > + fput(iocb->ki_filp); Trivial oops here - it might be NULL. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) 2019-03-03 20:30 ` Al Viro @ 2019-03-03 22:23 ` Linus Torvalds 2019-03-04 2:36 ` Al Viro 0 siblings, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2019-03-03 22:23 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 1664 bytes --] On Sun, Mar 3, 2019 at 12:30 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Sun, Mar 03, 2019 at 11:44:33AM -0800, Linus Torvalds wrote: > > > > I'm assuming you're talking about the second vfs_poll() in > > aio_poll_complete_work()? The one we call before we check for > > "rew->cancelled" properly under the spinlock? > > No. The first one, right in aio_poll(). Ok, they are both confusing. The lifetime of that filp is just not clear, with the whole "it could have been completed asynchronously" issue. > I'll need to dig out the mail archive from last year, but IIRC this > had been considered and there'd been non-trivial problems. Give me > an hour or so and I'll dig that out (there'd been many rounds of > review, with long threads involved, some private, some on fsdevel). Looking more at the patch, it still looks eminently sane to me.I fixed the silly "ki_filp" thing you noticed (I thought we made fput() take NULL, but you're right we don't), and simplified the patch to not do the changes that aren't necessary. I fixed the silly "filp can be NULL" issue you pointed out (I lazily thought we allowed fput(NULL), but you're right, we definitely don't), and simplified the patch to not do the unnecessary changes where we can just access the file pointer multiple different ways. And the resulting kernel boots fine, but I doubt I have anything that actually uses io_submit(), so that doesn't mean much. Slightly updated patch attached anyway, even smaller than before: 2 files changed, 36 insertions(+), 44 deletions(-) with several of the new lines being comments about that file pointer placement in the union. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 6561 bytes --] fs/aio.c | 72 ++++++++++++++++++++++-------------------------------- include/linux/fs.h | 8 +++++- 2 files changed, 36 insertions(+), 44 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index aaaaf4d12c73..82c08422b0f4 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -167,9 +167,13 @@ struct kioctx { unsigned id; }; +/* + * First field must be the file pointer in all the + * iocb unions! See also 'struct kiocb' in <linux/fs.h> + */ struct fsync_iocb { - struct work_struct work; struct file *file; + struct work_struct work; bool datasync; }; @@ -183,8 +187,15 @@ struct poll_iocb { struct work_struct work; }; +/* + * NOTE! Each of the iocb union members has the file pointer + * as the first entry in their struct definition. So you can + * access the file pointer through any of the sub-structs, + * or directly as just 'ki_filp' in this struct. + */ struct aio_kiocb { union { + struct file *ki_filp; struct kiocb rw; struct fsync_iocb fsync; struct poll_iocb poll; @@ -1060,6 +1071,8 @@ static inline void iocb_put(struct aio_kiocb *iocb) { if (refcount_read(&iocb->ki_refcnt) == 0 || refcount_dec_and_test(&iocb->ki_refcnt)) { + if (iocb->ki_filp) + fput(iocb->ki_filp); percpu_ref_put(&iocb->ki_ctx->reqs); kmem_cache_free(kiocb_cachep, iocb); } @@ -1424,7 +1437,6 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2) file_end_write(kiocb->ki_filp); } - fput(kiocb->ki_filp); aio_complete(iocb, res, res2); } @@ -1432,9 +1444,6 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) { int ret; - req->ki_filp = fget(iocb->aio_fildes); - if (unlikely(!req->ki_filp)) - return -EBADF; req->ki_complete = aio_complete_rw; req->private = NULL; req->ki_pos = iocb->aio_offset; @@ -1451,7 +1460,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) ret = ioprio_check_cap(iocb->aio_reqprio); if (ret) { pr_debug("aio ioprio check cap error: %d\n", ret); - goto out_fput; + return ret; } req->ki_ioprio = iocb->aio_reqprio; @@ -1460,14 +1469,10 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags); if (unlikely(ret)) - goto out_fput; + return ret; req->ki_flags &= ~IOCB_HIPRI; /* no one is going to poll for this I/O */ return 0; - -out_fput: - fput(req->ki_filp); - return ret; } static int aio_setup_rw(int rw, const struct iocb *iocb, struct iovec **iovec, @@ -1521,24 +1526,19 @@ static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb, if (ret) return ret; file = req->ki_filp; - - ret = -EBADF; if (unlikely(!(file->f_mode & FMODE_READ))) - goto out_fput; + return -EBADF; ret = -EINVAL; if (unlikely(!file->f_op->read_iter)) - goto out_fput; + return -EINVAL; ret = aio_setup_rw(READ, iocb, &iovec, vectored, compat, &iter); if (ret) - goto out_fput; + return ret; ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter)); if (!ret) aio_rw_done(req, call_read_iter(file, req, &iter)); kfree(iovec); -out_fput: - if (unlikely(ret)) - fput(file); return ret; } @@ -1555,16 +1555,14 @@ static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb, return ret; file = req->ki_filp; - ret = -EBADF; if (unlikely(!(file->f_mode & FMODE_WRITE))) - goto out_fput; - ret = -EINVAL; + return -EBADF; if (unlikely(!file->f_op->write_iter)) - goto out_fput; + return -EINVAL; ret = aio_setup_rw(WRITE, iocb, &iovec, vectored, compat, &iter); if (ret) - goto out_fput; + return ret; ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter)); if (!ret) { /* @@ -1582,9 +1580,6 @@ static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb, aio_rw_done(req, call_write_iter(file, req, &iter)); } kfree(iovec); -out_fput: - if (unlikely(ret)) - fput(file); return ret; } @@ -1594,7 +1589,6 @@ static void aio_fsync_work(struct work_struct *work) int ret; ret = vfs_fsync(req->file, req->datasync); - fput(req->file); aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0); } @@ -1605,13 +1599,8 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb, iocb->aio_rw_flags)) return -EINVAL; - req->file = fget(iocb->aio_fildes); - if (unlikely(!req->file)) - return -EBADF; - if (unlikely(!req->file->f_op->fsync)) { - fput(req->file); + if (unlikely(!req->file->f_op->fsync)) return -EINVAL; - } req->datasync = datasync; INIT_WORK(&req->work, aio_fsync_work); @@ -1621,10 +1610,7 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb, static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask) { - struct file *file = iocb->poll.file; - aio_complete(iocb, mangle_poll(mask), 0); - fput(file); } static void aio_poll_complete_work(struct work_struct *work) @@ -1743,9 +1729,6 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) INIT_WORK(&req->work, aio_poll_complete_work); req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP; - req->file = fget(iocb->aio_fildes); - if (unlikely(!req->file)) - return -EBADF; req->head = NULL; req->woken = false; @@ -1788,10 +1771,8 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) spin_unlock_irq(&ctx->ctx_lock); out: - if (unlikely(apt.error)) { - fput(req->file); + if (unlikely(apt.error)) return apt.error; - } if (mask) aio_poll_complete(aiocb, mask); @@ -1829,6 +1810,11 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, if (unlikely(!req)) goto out_put_reqs_available; + req->ki_filp = fget(iocb->aio_fildes); + ret = -EBADF; + if (unlikely(!req->ki_filp)) + goto out_put_req; + if (iocb->aio_flags & IOCB_FLAG_RESFD) { /* * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an diff --git a/include/linux/fs.h b/include/linux/fs.h index 29d8e2cfed0e..fd423fec8d83 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -304,13 +304,19 @@ enum rw_hint { struct kiocb { struct file *ki_filp; + + /* The 'ki_filp' pointer is shared in a union for aio */ + randomized_struct_fields_start + loff_t ki_pos; void (*ki_complete)(struct kiocb *iocb, long ret, long ret2); void *private; int ki_flags; u16 ki_hint; u16 ki_ioprio; /* See linux/ioprio.h */ -} __randomize_layout; + + randomized_struct_fields_end +}; static inline bool is_sync_kiocb(struct kiocb *kiocb) { ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) 2019-03-03 22:23 ` Linus Torvalds @ 2019-03-04 2:36 ` Al Viro 2019-03-04 21:22 ` Linus Torvalds 0 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-04 2:36 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig On Sun, Mar 03, 2019 at 02:23:33PM -0800, Linus Torvalds wrote: OK, having dug through the archives, the reasons were not strong. So that part is OK... > @@ -1060,6 +1071,8 @@ static inline void iocb_put(struct aio_kiocb *iocb) > { > if (refcount_read(&iocb->ki_refcnt) == 0 || > refcount_dec_and_test(&iocb->ki_refcnt)) { > + if (iocb->ki_filp) > + fput(iocb->ki_filp); > percpu_ref_put(&iocb->ki_ctx->reqs); > kmem_cache_free(kiocb_cachep, iocb); > } That reminds me - refcount_t here is rather ridiculous; what we have is * everything other than aio_poll: ki_refcnt is 0 all along * aio_poll: originally 0, then set to 2, then two iocb_put() are done (either both synchronous to aio_poll(), or one sync and one async). That's not a refcount at all. It's a flag, set only for aio_poll ones. And that test in iocb_put() is "if flag is set, clear it and bugger off". What's worse, AFAICS the callers in aio_poll() are buggered (see below) > static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask) > { > - struct file *file = iocb->poll.file; > - > aio_complete(iocb, mangle_poll(mask), 0); > - fput(file); > } No reasons to keep that function at all now... > - if (unlikely(apt.error)) { > - fput(req->file); > + if (unlikely(apt.error)) > return apt.error; > - } > > if (mask) > aio_poll_complete(aiocb, mask); Looking at that thing... How does it manage to avoid leaks when we try to use it on e.g. /dev/tty, which has poll_wait(file, &tty->read_wait, wait); poll_wait(file, &tty->write_wait, wait); in n_tty_poll()? AFAICS, we'll succeed adding it to the first queue, then have aio_poll_queue_proc() fail and set apt.error to -EINVAL. Suppose we are looking for EPOLLIN and there's nothing ready to read. We'll go mask = vfs_poll(req->file, &apt.pt) & req->events; mask is 0. if (unlikely(!req->head)) { nope - it's &tty->read_wait, not NULL /* we did not manage to set up a waitqueue, done */ goto out; } spin_lock_irq(&ctx->ctx_lock); spin_lock(&req->head->lock); if (req->woken) { nope - no wakeups so far /* wake_up context handles the rest */ mask = 0; apt.error = 0; } else if (mask || apt.error) { apt.error is non-zero /* if we get an error or a mask we are done */ WARN_ON_ONCE(list_empty(&req->wait.entry)); list_del_init(&req->wait.entry); OK, removed from queue } else { /* actually waiting for an event */ list_add_tail(&aiocb->ki_list, &ctx->active_reqs); aiocb->ki_cancel = aio_poll_cancel; } spin_unlock(&req->head->lock); spin_unlock_irq(&ctx->ctx_lock); out: if (unlikely(apt.error)) { fput(req->file); return apt.error; ... and we return -EINVAL to __io_submit_one(), where we hit /* * If ret is 0, we'd either done aio_complete() ourselves or have * arranged for that to be done asynchronously. Anything non-zero * means that we need to destroy req ourselves. */ if (ret) goto out_put_req; return 0; out_put_req: if (req->ki_eventfd) eventfd_ctx_put(req->ki_eventfd); iocb_put(req); out_put_reqs_available: put_reqs_available(ctx, 1); return ret; and all knowledge of req is lost. But we'd done only *one* iocb_put() in that case, and ->ki_refcnt had been set to 2 by aio_poll(). How could it avoid a leak? The same goes for "->poll() returns something without bothering to call poll_wait()" case, actually... IOW, I would rather have aio_poll() (along with your fput-a-bit-later change) do this - out: if (mask && !apt.error) aio_complete(aiocb, mangle_poll(mask), 0); iocb_put(aiocb); return apt.error; Comments? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) 2019-03-04 2:36 ` Al Viro @ 2019-03-04 21:22 ` Linus Torvalds 2019-03-07 0:03 ` [PATCH 1/8] aio: make sure file is pinned Al Viro 0 siblings, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2019-03-04 21:22 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig On Sun, Mar 3, 2019 at 6:36 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > OK, having dug through the archives, the reasons were not strong. > So that part is OK... I've committed the patch. However, I didn't actually do the separate and independent cleanups: > > @@ -1060,6 +1071,8 @@ static inline void iocb_put(struct aio_kiocb *iocb) > > { > > if (refcount_read(&iocb->ki_refcnt) == 0 || > > refcount_dec_and_test(&iocb->ki_refcnt)) { > > + if (iocb->ki_filp) > > + fput(iocb->ki_filp); > > percpu_ref_put(&iocb->ki_ctx->reqs); > > kmem_cache_free(kiocb_cachep, iocb); > > } > > That reminds me - refcount_t here is rather ridiculous; what we > have is > * everything other than aio_poll: ki_refcnt is 0 all along > * aio_poll: originally 0, then set to 2, then two iocb_put() > are done (either both synchronous to aio_poll(), or one sync and one > async). > > That's not a refcount at all. It's a flag, set only for aio_poll ones. > And that test in iocb_put() is "if flag is set, clear it and bugger off". > > What's worse, AFAICS the callers in aio_poll() are buggered (see below) Ok. Suggestions? > > - if (unlikely(apt.error)) { > > - fput(req->file); > > + if (unlikely(apt.error)) > > return apt.error; > > - } > > > > if (mask) > > aio_poll_complete(aiocb, mask); > > Looking at that thing... How does it manage to avoid leaks > when we try to use it on e.g. /dev/tty, which has > poll_wait(file, &tty->read_wait, wait); > poll_wait(file, &tty->write_wait, wait); > in n_tty_poll()? I don't think that's he only case that uses multiple poll entries. It's now the poll() machinery is designed to be used, after all. Although maybe everybody ends up using just a single wait-queue.. > AFAICS, we'll succeed adding it to the first queue, then have > aio_poll_queue_proc() fail and set apt.error to -EINVAL. Yeah, that's bogus, but documented. I guess nobody really expects to use aio_poll on anything but a socket or something? But your refcount leak looks valid. Hmm. So yes, that whole ki_refcnt looks a bit odd and pointless. And apparently buggy too. > Comments? Can we just (a) remove ki_refcnt entirely (b) remove the "iocb_put(aiocb);" from aio_poll()? Every actual successful io submission should end in aio_complete(), or we free the aio iocb in the "out_put_req" error case. There's no point in doing the refcount as you pointed out, and it seems to be actively buggy. Anyway, all of this (and your suggestion to just remove "aio_poll_complete()" and just use "aio_complete()") is independent of the file refcounting part, I feel. Which is why I just committed that patch as-is (84c4e1f89fef "aio: simplify - and fix - fget/fput for io_submit()"). Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 1/8] aio: make sure file is pinned 2019-03-04 21:22 ` Linus Torvalds @ 2019-03-07 0:03 ` Al Viro 2019-03-07 0:03 ` [PATCH 2/8] aio_poll_wake(): don't set ->woken if we ignore the wakeup Al Viro ` (7 more replies) 0 siblings, 8 replies; 53+ messages in thread From: Al Viro @ 2019-03-07 0:03 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> "aio: remove the extra get_file/fput pair in io_submit_one" was too optimistic - not dereferencing file pointer after e.g. ->write_iter() returns is not enough; that reference might've been the only thing that kept alive objects that are referenced *before* the method returns. Such as inode, for example... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/aio.c b/fs/aio.c index 3d9669d011b9..ea30b78187ed 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1790,6 +1790,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, struct iocb __user *user_iocb, bool compat) { struct aio_kiocb *req; + struct file *file; ssize_t ret; /* enforce forwards compatibility on users */ @@ -1844,6 +1845,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, req->ki_user_iocb = user_iocb; req->ki_user_data = iocb->aio_data; + file = get_file(req->ki_filp); /* req can die too early */ switch (iocb->aio_lio_opcode) { case IOCB_CMD_PREAD: @@ -1872,6 +1874,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, ret = -EINVAL; break; } + fput(file); /* * If ret is 0, we'd either done aio_complete() ourselves or have -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH 2/8] aio_poll_wake(): don't set ->woken if we ignore the wakeup 2019-03-07 0:03 ` [PATCH 1/8] aio: make sure file is pinned Al Viro @ 2019-03-07 0:03 ` Al Viro 2019-03-07 2:18 ` Al Viro 2019-03-07 0:03 ` [PATCH 3/8] aio_poll(): sanitize the logics after vfs_poll(), get rid of leak on error Al Viro ` (6 subsequent siblings) 7 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-07 0:03 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> In case of early wakeups, aio_poll() assumes that aio_poll_complete() has either already happened or is imminent. In that case we do not want to put iocb on the list of cancellables. However, ignored wakeups need to be treated as if wakeup has not happened at all. Trivially fixed by having aio_poll_wake() set ->woken only after it's committed to taking iocb out of the waitqueue. Spotted-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index ea30b78187ed..3a8b894378e0 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1668,13 +1668,13 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, __poll_t mask = key_to_poll(key); unsigned long flags; + /* for instances that support it check for an event match first: */ + if (mask && !(mask & req->events)) + return 0; + req->woken = true; - /* for instances that support it check for an event match first: */ if (mask) { - if (!(mask & req->events)) - return 0; - /* * Try to complete the iocb inline if we can. Use * irqsave/irqrestore because not all filesystems (e.g. fuse) -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 2/8] aio_poll_wake(): don't set ->woken if we ignore the wakeup 2019-03-07 0:03 ` [PATCH 2/8] aio_poll_wake(): don't set ->woken if we ignore the wakeup Al Viro @ 2019-03-07 2:18 ` Al Viro 2019-03-08 11:16 ` zhengbin (A) 0 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-07 2:18 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Thu, Mar 07, 2019 at 12:03:10AM +0000, Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > In case of early wakeups, aio_poll() assumes that aio_poll_complete() > has either already happened or is imminent. In that case we do not > want to put iocb on the list of cancellables. However, ignored > wakeups need to be treated as if wakeup has not happened at all. > Trivially fixed by having aio_poll_wake() set ->woken only after > it's committed to taking iocb out of the waitqueue. > > Spotted-by: zhengbin <zhengbin13@huawei.com> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> ... and unfortunately it's worse than just that - what both of us have missed is that one could have non-specific wakep + schedule_work + aio_poll_complete_work() rechecking ->poll(), seeing nothing of interest and reinserting into queue. All before vfs_poll() manages to return into aio_poll(). The window is harder to hit, but it's still there, with exact same "failed to add to cancel list" kind of bug if we do hit it ;-/ ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 2/8] aio_poll_wake(): don't set ->woken if we ignore the wakeup 2019-03-07 2:18 ` Al Viro @ 2019-03-08 11:16 ` zhengbin (A) 0 siblings, 0 replies; 53+ messages in thread From: zhengbin (A) @ 2019-03-08 11:16 UTC (permalink / raw) To: Al Viro, Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang So, what should we do? On 2019/3/7 10:18, Al Viro wrote: > On Thu, Mar 07, 2019 at 12:03:10AM +0000, Al Viro wrote: >> From: Al Viro <viro@zeniv.linux.org.uk> >> >> In case of early wakeups, aio_poll() assumes that aio_poll_complete() >> has either already happened or is imminent. In that case we do not >> want to put iocb on the list of cancellables. However, ignored >> wakeups need to be treated as if wakeup has not happened at all. >> Trivially fixed by having aio_poll_wake() set ->woken only after >> it's committed to taking iocb out of the waitqueue. >> >> Spotted-by: zhengbin <zhengbin13@huawei.com> >> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > > ... and unfortunately it's worse than just that - what both of us > have missed is that one could have non-specific wakep + schedule_work + > aio_poll_complete_work() rechecking ->poll(), seeing nothing of > interest and reinserting into queue. All before vfs_poll() manages > to return into aio_poll(). The window is harder to hit, but it's > still there, with exact same "failed to add to cancel list" kind of bug > if we do hit it ;-/ > > . > ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 3/8] aio_poll(): sanitize the logics after vfs_poll(), get rid of leak on error 2019-03-07 0:03 ` [PATCH 1/8] aio: make sure file is pinned Al Viro 2019-03-07 0:03 ` [PATCH 2/8] aio_poll_wake(): don't set ->woken if we ignore the wakeup Al Viro @ 2019-03-07 0:03 ` Al Viro 2019-03-07 2:11 ` zhengbin (A) 2019-03-07 0:03 ` [PATCH 4/8] aio_poll(): get rid of weird refcounting Al Viro ` (5 subsequent siblings) 7 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-07 0:03 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> We want iocb_put() happening on errors, to balance the extra reference we'd taken. As it is, we end up with a leak. The rules should be * error: iocb_put() to deal with the extra ref, return error, let the caller do another iocb_put(). * async: iocb_put() to deal with the extra ref, return 0. * no error, event present immediately: aio_poll_complete() to report it, iocb_put() to deal with the extra ref, return 0. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 25 +++++++++++-------------- 1 file changed, 11 insertions(+), 14 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 3a8b894378e0..22b288997441 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1724,6 +1724,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) struct kioctx *ctx = aiocb->ki_ctx; struct poll_iocb *req = &aiocb->poll; struct aio_poll_table apt; + bool async = false; __poll_t mask; /* reject any unknown events outside the normal event mask. */ @@ -1760,30 +1761,26 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) spin_lock_irq(&ctx->ctx_lock); spin_lock(&req->head->lock); - if (req->woken) { - /* wake_up context handles the rest */ - mask = 0; + if (req->woken) { /* already taken up by aio_poll_wake() */ + async = true; apt.error = 0; - } else if (mask || apt.error) { - /* if we get an error or a mask we are done */ - WARN_ON_ONCE(list_empty(&req->wait.entry)); - list_del_init(&req->wait.entry); - } else { - /* actually waiting for an event */ + } else if (!mask && !apt.error) { /* actually waiting for an event */ list_add_tail(&aiocb->ki_list, &ctx->active_reqs); aiocb->ki_cancel = aio_poll_cancel; + async = true; + } else { /* if we get an error or a mask we are done */ + WARN_ON_ONCE(list_empty(&req->wait.entry)); + list_del_init(&req->wait.entry); + /* no wakeup in the future either; aiocb is ours to dispose of */ } spin_unlock(&req->head->lock); spin_unlock_irq(&ctx->ctx_lock); out: - if (unlikely(apt.error)) - return apt.error; - - if (mask) + if (async && !apt.error) aio_poll_complete(aiocb, mask); iocb_put(aiocb); - return 0; + return apt.error; } static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 3/8] aio_poll(): sanitize the logics after vfs_poll(), get rid of leak on error 2019-03-07 0:03 ` [PATCH 3/8] aio_poll(): sanitize the logics after vfs_poll(), get rid of leak on error Al Viro @ 2019-03-07 2:11 ` zhengbin (A) 0 siblings, 0 replies; 53+ messages in thread From: zhengbin (A) @ 2019-03-07 2:11 UTC (permalink / raw) To: Al Viro, Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang + if (async && !apt.error) --->may be this should be if (!async && !apt.error) ? On 2019/3/7 8:03, Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > We want iocb_put() happening on errors, to balance the extra reference > we'd taken. As it is, we end up with a leak. The rules should be > * error: iocb_put() to deal with the extra ref, return error, > let the caller do another iocb_put(). > * async: iocb_put() to deal with the extra ref, return 0. > * no error, event present immediately: aio_poll_complete() to > report it, iocb_put() to deal with the extra ref, return 0. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/aio.c | 25 +++++++++++-------------- > 1 file changed, 11 insertions(+), 14 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index 3a8b894378e0..22b288997441 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1724,6 +1724,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) > struct kioctx *ctx = aiocb->ki_ctx; > struct poll_iocb *req = &aiocb->poll; > struct aio_poll_table apt; > + bool async = false; > __poll_t mask; > > /* reject any unknown events outside the normal event mask. */ > @@ -1760,30 +1761,26 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) > > spin_lock_irq(&ctx->ctx_lock); > spin_lock(&req->head->lock); > - if (req->woken) { > - /* wake_up context handles the rest */ > - mask = 0; > + if (req->woken) { /* already taken up by aio_poll_wake() */ > + async = true; > apt.error = 0; > - } else if (mask || apt.error) { > - /* if we get an error or a mask we are done */ > - WARN_ON_ONCE(list_empty(&req->wait.entry)); > - list_del_init(&req->wait.entry); > - } else { > - /* actually waiting for an event */ > + } else if (!mask && !apt.error) { /* actually waiting for an event */ > list_add_tail(&aiocb->ki_list, &ctx->active_reqs); > aiocb->ki_cancel = aio_poll_cancel; > + async = true; > + } else { /* if we get an error or a mask we are done */ > + WARN_ON_ONCE(list_empty(&req->wait.entry)); > + list_del_init(&req->wait.entry); > + /* no wakeup in the future either; aiocb is ours to dispose of */ > } > spin_unlock(&req->head->lock); > spin_unlock_irq(&ctx->ctx_lock); > > out: > - if (unlikely(apt.error)) > - return apt.error; > - > - if (mask) > + if (async && !apt.error) > aio_poll_complete(aiocb, mask); > iocb_put(aiocb); > - return 0; > + return apt.error; > } > > static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, > ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 4/8] aio_poll(): get rid of weird refcounting 2019-03-07 0:03 ` [PATCH 1/8] aio: make sure file is pinned Al Viro 2019-03-07 0:03 ` [PATCH 2/8] aio_poll_wake(): don't set ->woken if we ignore the wakeup Al Viro 2019-03-07 0:03 ` [PATCH 3/8] aio_poll(): sanitize the logics after vfs_poll(), get rid of leak on error Al Viro @ 2019-03-07 0:03 ` Al Viro 2019-03-07 0:03 ` [PATCH 5/8] make aio_read()/aio_write() return int Al Viro ` (4 subsequent siblings) 7 siblings, 0 replies; 53+ messages in thread From: Al Viro @ 2019-03-07 0:03 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> The only reason for taking the extra ref to iocb is that we want to access it after vfs_poll() and an early wakeup could have it already completed by the time vfs_poll() returns. That's very easy to avoid, though - we need to know which lock to grab and, having grabbed it, we need to check if an early wakeup has already happened. So let's just copy the reference to waitqueue into aio_poll_table and instead of having the "woken" flag in iocb, move it to aio_poll() stack frame and put its address into iocb (and make sure it's cleared, so aio_poll_wake() won't step onto it after aio_poll() exits). That's enough to get rid of the refcount. In async case freeing is done by aio_poll_complete() (and aio_poll() will only touch the iocb past the vfs_poll() if it's guaranteed that aio_poll_complete() has not happened yet), in all other cases we make sure that wakeups hadn't and won't happen and deal with disposal of iocb ourselves. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 55 +++++++++++++++++++++++++++---------------------------- 1 file changed, 27 insertions(+), 28 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 22b288997441..ee062253e303 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -180,8 +180,8 @@ struct fsync_iocb { struct poll_iocb { struct file *file; struct wait_queue_head *head; + bool *taken; __poll_t events; - bool woken; bool cancelled; struct wait_queue_entry wait; struct work_struct work; @@ -209,8 +209,6 @@ struct aio_kiocb { struct list_head ki_list; /* the aio core uses this * for cancellation */ - refcount_t ki_refcnt; - /* * If the aio_resfd field of the userspace iocb is not zero, * this is the underlying eventfd context to deliver events to. @@ -1034,7 +1032,6 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) percpu_ref_get(&ctx->reqs); req->ki_ctx = ctx; INIT_LIST_HEAD(&req->ki_list); - refcount_set(&req->ki_refcnt, 0); req->ki_eventfd = NULL; return req; } @@ -1069,13 +1066,10 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) static inline void iocb_put(struct aio_kiocb *iocb) { - if (refcount_read(&iocb->ki_refcnt) == 0 || - refcount_dec_and_test(&iocb->ki_refcnt)) { - if (iocb->ki_filp) - fput(iocb->ki_filp); - percpu_ref_put(&iocb->ki_ctx->reqs); - kmem_cache_free(kiocb_cachep, iocb); - } + if (iocb->ki_filp) + fput(iocb->ki_filp); + percpu_ref_put(&iocb->ki_ctx->reqs); + kmem_cache_free(kiocb_cachep, iocb); } static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb, @@ -1672,8 +1666,10 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, if (mask && !(mask & req->events)) return 0; - req->woken = true; - + if (unlikely(req->taken)) { + *req->taken = true; + req->taken = NULL; + } if (mask) { /* * Try to complete the iocb inline if we can. Use @@ -1698,6 +1694,7 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, struct aio_poll_table { struct poll_table_struct pt; + struct wait_queue_head *head; struct aio_kiocb *iocb; int error; }; @@ -1715,7 +1712,7 @@ aio_poll_queue_proc(struct file *file, struct wait_queue_head *head, } pt->error = 0; - pt->iocb->poll.head = head; + pt->head = pt->iocb->poll.head = head; add_wait_queue(head, &pt->iocb->poll.wait); } @@ -1738,7 +1735,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP; req->head = NULL; - req->woken = false; + req->taken = &async; req->cancelled = false; apt.pt._qproc = aio_poll_queue_proc; @@ -1750,36 +1747,38 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) INIT_LIST_HEAD(&req->wait.entry); init_waitqueue_func_entry(&req->wait, aio_poll_wake); - /* one for removal from waitqueue, one for this function */ - refcount_set(&aiocb->ki_refcnt, 2); - - mask = vfs_poll(req->file, &apt.pt) & req->events; - if (unlikely(!req->head)) { + mask = req->events; + mask &= vfs_poll(req->file, &apt.pt); + /* + * Careful: we might've been put into waitqueue *and* already + * woken up before vfs_poll() returns. The caller is holding + * a reference to file, so it couldn't have been killed under + * us, no matter what. However, in case of early wakeup, @req + * might be already gone by now. + */ + if (unlikely(!apt.head)) { /* we did not manage to set up a waitqueue, done */ goto out; } - spin_lock_irq(&ctx->ctx_lock); - spin_lock(&req->head->lock); - if (req->woken) { /* already taken up by aio_poll_wake() */ - async = true; + spin_lock(&apt.head->lock); + if (async) { /* already taken up by aio_poll_wake() */ apt.error = 0; } else if (!mask && !apt.error) { /* actually waiting for an event */ list_add_tail(&aiocb->ki_list, &ctx->active_reqs); aiocb->ki_cancel = aio_poll_cancel; + req->taken = NULL; async = true; } else { /* if we get an error or a mask we are done */ WARN_ON_ONCE(list_empty(&req->wait.entry)); list_del_init(&req->wait.entry); /* no wakeup in the future either; aiocb is ours to dispose of */ } - spin_unlock(&req->head->lock); + spin_unlock(&apt.head->lock); spin_unlock_irq(&ctx->ctx_lock); - out: - if (async && !apt.error) + if (!async && !apt.error) aio_poll_complete(aiocb, mask); - iocb_put(aiocb); return apt.error; } -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH 5/8] make aio_read()/aio_write() return int 2019-03-07 0:03 ` [PATCH 1/8] aio: make sure file is pinned Al Viro ` (2 preceding siblings ...) 2019-03-07 0:03 ` [PATCH 4/8] aio_poll(): get rid of weird refcounting Al Viro @ 2019-03-07 0:03 ` Al Viro 2019-03-07 0:03 ` [PATCH 6/8] move dropping ->ki_eventfd into iocb_put() Al Viro ` (3 subsequent siblings) 7 siblings, 0 replies; 53+ messages in thread From: Al Viro @ 2019-03-07 0:03 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> that ssize_t is a rudiment of earlier calling conventions; it's been used only to pass 0 and -E... since last autumn. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index ee062253e303..5dd5f35d054c 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1508,13 +1508,13 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t ret) } } -static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb, +static int aio_read(struct kiocb *req, const struct iocb *iocb, bool vectored, bool compat) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct iov_iter iter; struct file *file; - ssize_t ret; + int ret; ret = aio_prep_rw(req, iocb); if (ret) @@ -1536,13 +1536,13 @@ static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb, return ret; } -static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb, +static int aio_write(struct kiocb *req, const struct iocb *iocb, bool vectored, bool compat) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct iov_iter iter; struct file *file; - ssize_t ret; + int ret; ret = aio_prep_rw(req, iocb); if (ret) @@ -1716,7 +1716,7 @@ aio_poll_queue_proc(struct file *file, struct wait_queue_head *head, add_wait_queue(head, &pt->iocb->poll.wait); } -static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) +static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) { struct kioctx *ctx = aiocb->ki_ctx; struct poll_iocb *req = &aiocb->poll; @@ -1787,7 +1787,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, { struct aio_kiocb *req; struct file *file; - ssize_t ret; + int ret; /* enforce forwards compatibility on users */ if (unlikely(iocb->aio_reserved2)) { -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH 6/8] move dropping ->ki_eventfd into iocb_put() 2019-03-07 0:03 ` [PATCH 1/8] aio: make sure file is pinned Al Viro ` (3 preceding siblings ...) 2019-03-07 0:03 ` [PATCH 5/8] make aio_read()/aio_write() return int Al Viro @ 2019-03-07 0:03 ` Al Viro 2019-03-07 0:03 ` [PATCH 7/8] deal with get_reqs_available() in aio_get_req() itself Al Viro ` (2 subsequent siblings) 7 siblings, 0 replies; 53+ messages in thread From: Al Viro @ 2019-03-07 0:03 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 5dd5f35d054c..5837a29e63fe 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1066,6 +1066,8 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) static inline void iocb_put(struct aio_kiocb *iocb) { + if (iocb->ki_eventfd) + eventfd_ctx_put(iocb->ki_eventfd); if (iocb->ki_filp) fput(iocb->ki_filp); percpu_ref_put(&iocb->ki_ctx->reqs); @@ -1142,10 +1144,8 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2) * eventfd. The eventfd_signal() function is safe to be called * from IRQ context. */ - if (iocb->ki_eventfd) { + if (iocb->ki_eventfd) eventfd_signal(iocb->ki_eventfd, 1); - eventfd_ctx_put(iocb->ki_eventfd); - } /* * We have to order our ring_info tail store above and test @@ -1819,18 +1819,19 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, goto out_put_req; if (iocb->aio_flags & IOCB_FLAG_RESFD) { + struct eventfd_ctx *eventfd; /* * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an * instance of the file* now. The file descriptor must be * an eventfd() fd, and will be signaled for each completed * event using the eventfd_signal() function. */ - req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd); - if (IS_ERR(req->ki_eventfd)) { - ret = PTR_ERR(req->ki_eventfd); - req->ki_eventfd = NULL; + eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd); + if (IS_ERR(eventfd)) { + ret = PTR_ERR(eventfd); goto out_put_req; } + req->ki_eventfd = eventfd; } ret = put_user(KIOCB_KEY, &user_iocb->aio_key); @@ -1881,8 +1882,6 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, goto out_put_req; return 0; out_put_req: - if (req->ki_eventfd) - eventfd_ctx_put(req->ki_eventfd); iocb_put(req); out_put_reqs_available: put_reqs_available(ctx, 1); -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH 7/8] deal with get_reqs_available() in aio_get_req() itself 2019-03-07 0:03 ` [PATCH 1/8] aio: make sure file is pinned Al Viro ` (4 preceding siblings ...) 2019-03-07 0:03 ` [PATCH 6/8] move dropping ->ki_eventfd into iocb_put() Al Viro @ 2019-03-07 0:03 ` Al Viro 2019-03-07 0:03 ` [PATCH 8/8] aio: move sanity checks and request allocation to io_submit_one() Al Viro 2019-03-07 0:23 ` [PATCH 1/8] aio: make sure file is pinned Linus Torvalds 7 siblings, 0 replies; 53+ messages in thread From: Al Viro @ 2019-03-07 0:03 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> simplifies the caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 5837a29e63fe..af51b1360305 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1029,6 +1029,11 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) if (unlikely(!req)) return NULL; + if (unlikely(!get_reqs_available(ctx))) { + kfree(req); + return NULL; + } + percpu_ref_get(&ctx->reqs); req->ki_ctx = ctx; INIT_LIST_HEAD(&req->ki_list); @@ -1805,13 +1810,9 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, return -EINVAL; } - if (!get_reqs_available(ctx)) - return -EAGAIN; - - ret = -EAGAIN; req = aio_get_req(ctx); if (unlikely(!req)) - goto out_put_reqs_available; + return -EAGAIN; req->ki_filp = fget(iocb->aio_fildes); ret = -EBADF; @@ -1883,7 +1884,6 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, return 0; out_put_req: iocb_put(req); -out_put_reqs_available: put_reqs_available(ctx, 1); return ret; } -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH 8/8] aio: move sanity checks and request allocation to io_submit_one() 2019-03-07 0:03 ` [PATCH 1/8] aio: make sure file is pinned Al Viro ` (5 preceding siblings ...) 2019-03-07 0:03 ` [PATCH 7/8] deal with get_reqs_available() in aio_get_req() itself Al Viro @ 2019-03-07 0:03 ` Al Viro 2019-03-07 0:23 ` [PATCH 1/8] aio: make sure file is pinned Linus Torvalds 7 siblings, 0 replies; 53+ messages in thread From: Al Viro @ 2019-03-07 0:03 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> makes for somewhat cleaner control flow in __io_submit_one() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 86 +++++++++++++++++++++++++++++++--------------------------------- 1 file changed, 42 insertions(+), 44 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index af51b1360305..6993581b77b2 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1788,36 +1788,15 @@ static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) } static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, - struct iocb __user *user_iocb, bool compat) + struct iocb __user *user_iocb, struct aio_kiocb *req, + bool compat) { - struct aio_kiocb *req; struct file *file; int ret; - /* enforce forwards compatibility on users */ - if (unlikely(iocb->aio_reserved2)) { - pr_debug("EINVAL: reserve field set\n"); - return -EINVAL; - } - - /* prevent overflows */ - if (unlikely( - (iocb->aio_buf != (unsigned long)iocb->aio_buf) || - (iocb->aio_nbytes != (size_t)iocb->aio_nbytes) || - ((ssize_t)iocb->aio_nbytes < 0) - )) { - pr_debug("EINVAL: overflow check\n"); - return -EINVAL; - } - - req = aio_get_req(ctx); - if (unlikely(!req)) - return -EAGAIN; - req->ki_filp = fget(iocb->aio_fildes); - ret = -EBADF; if (unlikely(!req->ki_filp)) - goto out_put_req; + return -EBADF; if (iocb->aio_flags & IOCB_FLAG_RESFD) { struct eventfd_ctx *eventfd; @@ -1828,17 +1807,15 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, * event using the eventfd_signal() function. */ eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd); - if (IS_ERR(eventfd)) { - ret = PTR_ERR(eventfd); - goto out_put_req; - } + if (IS_ERR(eventfd)) + return PTR_ERR(req->ki_eventfd); + req->ki_eventfd = eventfd; } - ret = put_user(KIOCB_KEY, &user_iocb->aio_key); - if (unlikely(ret)) { + if (unlikely(put_user(KIOCB_KEY, &user_iocb->aio_key))) { pr_debug("EFAULT: aio_key\n"); - goto out_put_req; + return -EFAULT; } req->ki_user_iocb = user_iocb; @@ -1873,30 +1850,51 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, break; } fput(file); - - /* - * If ret is 0, we'd either done aio_complete() ourselves or have - * arranged for that to be done asynchronously. Anything non-zero - * means that we need to destroy req ourselves. - */ - if (ret) - goto out_put_req; - return 0; -out_put_req: - iocb_put(req); - put_reqs_available(ctx, 1); return ret; } static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, bool compat) { + struct aio_kiocb *req; struct iocb iocb; + int err; if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb)))) return -EFAULT; - return __io_submit_one(ctx, &iocb, user_iocb, compat); + /* enforce forwards compatibility on users */ + if (unlikely(iocb.aio_reserved2)) { + pr_debug("EINVAL: reserve field set\n"); + return -EINVAL; + } + + /* prevent overflows */ + if (unlikely( + (iocb.aio_buf != (unsigned long)iocb.aio_buf) || + (iocb.aio_nbytes != (size_t)iocb.aio_nbytes) || + ((ssize_t)iocb.aio_nbytes < 0) + )) { + pr_debug("EINVAL: overflow check\n"); + return -EINVAL; + } + + req = aio_get_req(ctx); + if (unlikely(!req)) + return -EAGAIN; + + err = __io_submit_one(ctx, &iocb, user_iocb, req, compat); + + /* + * If err is 0, we'd either done aio_complete() ourselves or have + * arranged for that to be done asynchronously. Anything non-zero + * means that we need to destroy req ourselves. + */ + if (unlikely(err)) { + iocb_put(req); + put_reqs_available(ctx, 1); + } + return err; } /* sys_io_submit: -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 1/8] aio: make sure file is pinned 2019-03-07 0:03 ` [PATCH 1/8] aio: make sure file is pinned Al Viro ` (6 preceding siblings ...) 2019-03-07 0:03 ` [PATCH 8/8] aio: move sanity checks and request allocation to io_submit_one() Al Viro @ 2019-03-07 0:23 ` Linus Torvalds 2019-03-07 0:41 ` Al Viro 7 siblings, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2019-03-07 0:23 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, Cong Wang, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Wed, Mar 6, 2019 at 4:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > From: Al Viro <viro@zeniv.linux.org.uk> > > "aio: remove the extra get_file/fput pair in io_submit_one" was > too optimistic - not dereferencing file pointer after e.g. > ->write_iter() returns is not enough; that reference might've been > the only thing that kept alive objects that are referenced > *before* the method returns. Such as inode, for example... I still; think that this is actually _worse_ than just having the refcount on the req instead. As it is, we have that completely insane "ref can go away from under us", because nothing keeps that around, which then causes all those other crazy issues with "woken" etc garbage. I think we should be able to get rid of those entirely. Make the poll() case just return zero if it has added the entry successfully to poll queue. No need for "woken", no need for all that odd "oh, but now the req might no longer exist". The refcount wasn't the problem. Everything *else* was the problem, including only using the refcount for the poll case etc. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/8] aio: make sure file is pinned 2019-03-07 0:23 ` [PATCH 1/8] aio: make sure file is pinned Linus Torvalds @ 2019-03-07 0:41 ` Al Viro 2019-03-07 0:48 ` Al Viro 0 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-07 0:41 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, Cong Wang, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Wed, Mar 06, 2019 at 04:23:04PM -0800, Linus Torvalds wrote: > On Wed, Mar 6, 2019 at 4:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > From: Al Viro <viro@zeniv.linux.org.uk> > > > > "aio: remove the extra get_file/fput pair in io_submit_one" was > > too optimistic - not dereferencing file pointer after e.g. > > ->write_iter() returns is not enough; that reference might've been > > the only thing that kept alive objects that are referenced > > *before* the method returns. Such as inode, for example... > > I still; think that this is actually _worse_ than just having the > refcount on the req instead. > > As it is, we have that completely insane "ref can go away from under > us", because nothing keeps that around, which then causes all those > other crazy issues with "woken" etc garbage. > > I think we should be able to get rid of those entirely. Make the > poll() case just return zero if it has added the entry successfully to > poll queue. No need for "woken", no need for all that odd "oh, but > now the req might no longer exist". Not really. Sure, you can get rid of "might no longer exist" considerations, but you still need to decide which way do we want to handle it. There are 3 cases: * it's already taken up; don't put on the list for possible cancel, don't call aio_complete(). * will eventually be woken up; put on the list for possible cancle, don't call aio_complete(). * wanted to be on several queues, fortunately not woken up yet. Make sure it's gone from queue, return an error. * none of the above, and ->poll() has reported what we wanted from the very beginning. Remove from queue, call aio_complete(). You'll need some logics to handle that. I can buy the "if we know the req is still alive, we can check if it's still queued instead of separate woken flag", but but it won't win you much ;-/ ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/8] aio: make sure file is pinned 2019-03-07 0:41 ` Al Viro @ 2019-03-07 0:48 ` Al Viro 2019-03-07 1:20 ` Al Viro 0 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-07 0:48 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, Cong Wang, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Thu, Mar 07, 2019 at 12:41:59AM +0000, Al Viro wrote: > On Wed, Mar 06, 2019 at 04:23:04PM -0800, Linus Torvalds wrote: > > On Wed, Mar 6, 2019 at 4:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > From: Al Viro <viro@zeniv.linux.org.uk> > > > > > > "aio: remove the extra get_file/fput pair in io_submit_one" was > > > too optimistic - not dereferencing file pointer after e.g. > > > ->write_iter() returns is not enough; that reference might've been > > > the only thing that kept alive objects that are referenced > > > *before* the method returns. Such as inode, for example... > > > > I still; think that this is actually _worse_ than just having the > > refcount on the req instead. > > > > As it is, we have that completely insane "ref can go away from under > > us", because nothing keeps that around, which then causes all those > > other crazy issues with "woken" etc garbage. > > > > I think we should be able to get rid of those entirely. Make the > > poll() case just return zero if it has added the entry successfully to > > poll queue. No need for "woken", no need for all that odd "oh, but > > now the req might no longer exist". > > Not really. Sure, you can get rid of "might no longer exist" > considerations, but you still need to decide which way do we want to > handle it. There are 3 cases: > * it's already taken up; don't put on the list for possible > cancel, don't call aio_complete(). > * will eventually be woken up; put on the list for possible > cancle, don't call aio_complete(). > * wanted to be on several queues, fortunately not woken up > yet. Make sure it's gone from queue, return an error. > * none of the above, and ->poll() has reported what we wanted > from the very beginning. Remove from queue, call aio_complete(). > > You'll need some logics to handle that. I can buy the "if we know > the req is still alive, we can check if it's still queued instead of > separate woken flag", but but it won't win you much ;-/ If anything, the one good reason for refcount would be the risk that some ->read_iter() or ->write_iter() will try to dereference iocb after having decided to return -EIOCBQUEUED and submitted all bios. I think that doesn't happen, but making sure it doesn't would be a good argument in favour of that refcount. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/8] aio: make sure file is pinned 2019-03-07 0:48 ` Al Viro @ 2019-03-07 1:20 ` Al Viro 2019-03-07 1:30 ` Linus Torvalds 0 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-07 1:20 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, Cong Wang, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Thu, Mar 07, 2019 at 12:48:28AM +0000, Al Viro wrote: > On Thu, Mar 07, 2019 at 12:41:59AM +0000, Al Viro wrote: > > On Wed, Mar 06, 2019 at 04:23:04PM -0800, Linus Torvalds wrote: > > > On Wed, Mar 6, 2019 at 4:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > > > From: Al Viro <viro@zeniv.linux.org.uk> > > > > > > > > "aio: remove the extra get_file/fput pair in io_submit_one" was > > > > too optimistic - not dereferencing file pointer after e.g. > > > > ->write_iter() returns is not enough; that reference might've been > > > > the only thing that kept alive objects that are referenced > > > > *before* the method returns. Such as inode, for example... > > > > > > I still; think that this is actually _worse_ than just having the > > > refcount on the req instead. > > > > > > As it is, we have that completely insane "ref can go away from under > > > us", because nothing keeps that around, which then causes all those > > > other crazy issues with "woken" etc garbage. > > > > > > I think we should be able to get rid of those entirely. Make the > > > poll() case just return zero if it has added the entry successfully to > > > poll queue. No need for "woken", no need for all that odd "oh, but > > > now the req might no longer exist". > > > > Not really. Sure, you can get rid of "might no longer exist" > > considerations, but you still need to decide which way do we want to > > handle it. There are 3 cases: > > * it's already taken up; don't put on the list for possible > > cancel, don't call aio_complete(). > > * will eventually be woken up; put on the list for possible > > cancle, don't call aio_complete(). > > * wanted to be on several queues, fortunately not woken up > > yet. Make sure it's gone from queue, return an error. > > * none of the above, and ->poll() has reported what we wanted > > from the very beginning. Remove from queue, call aio_complete(). > > > > You'll need some logics to handle that. I can buy the "if we know > > the req is still alive, we can check if it's still queued instead of > > separate woken flag", but but it won't win you much ;-/ > > If anything, the one good reason for refcount would be the risk that > some ->read_iter() or ->write_iter() will try to dereference iocb > after having decided to return -EIOCBQUEUED and submitted all bios. > I think that doesn't happen, but making sure it doesn't would be > a good argument in favour of that refcount. *grumble* It is a good argument, unfortunately ;-/ Proof that instances do not step into that is rather subtle and won't take much to break. OK... I'll try to massage that series on top of your patch; I still hate the post-vfs_poll() logics in aio_poll() ;-/ Give me about half an hour and I'll have something to post. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/8] aio: make sure file is pinned 2019-03-07 1:20 ` Al Viro @ 2019-03-07 1:30 ` Linus Torvalds 2019-03-08 3:36 ` Al Viro 0 siblings, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2019-03-07 1:30 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, Cong Wang, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Wed, Mar 6, 2019 at 5:20 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > I'll try to massage that series on top of your patch; I still hate the > post-vfs_poll() logics in aio_poll() ;-/ Give me about half an hour > and I'll have something to post. No inherent hurry, I sent the ping just to make sure it hadn't gotten lost. And yeah, I think the post-vfs_poll() logic cannot possibly be necessary. My gut feel is that *if* we have the refcounting right, then we should be able to just let the wakeup come in at any later point, and ordering shouldn't matter all that much, and we shouldn't even need any locking. I'd like to think that it can be done with something like "just 'or' in the mask atomically" (so that we don't care about ordering between the synchronous vfs_poll() and the async poll wakeup), together with "when refcount goes to zero, finish the thing off and complete it" (so that we don't care who finishes first). No "woken" logic, no "who fired first" logic, no BS. Just make the operations work regardless of ordering. And maybe it can't be done. But the current model seems just so hacky that it can't be the right model. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/8] aio: make sure file is pinned 2019-03-07 1:30 ` Linus Torvalds @ 2019-03-08 3:36 ` Al Viro 2019-03-08 15:50 ` Christoph Hellwig 2019-03-10 7:06 ` Al Viro 0 siblings, 2 replies; 53+ messages in thread From: Al Viro @ 2019-03-08 3:36 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, Cong Wang, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Wed, Mar 06, 2019 at 05:30:21PM -0800, Linus Torvalds wrote: > On Wed, Mar 6, 2019 at 5:20 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > I'll try to massage that series on top of your patch; I still hate the > > post-vfs_poll() logics in aio_poll() ;-/ Give me about half an hour > > and I'll have something to post. > > No inherent hurry, I sent the ping just to make sure it hadn't gotten lost. > > And yeah, I think the post-vfs_poll() logic cannot possibly be > necessary. My gut feel is that *if* we have the refcounting right, > then we should be able to just let the wakeup come in at any later > point, and ordering shouldn't matter all that much, and we shouldn't > even need any locking. > > I'd like to think that it can be done with something like "just 'or' > in the mask atomically" (so that we don't care about ordering between > the synchronous vfs_poll() and the async poll wakeup), together with > "when refcount goes to zero, finish the thing off and complete it" (so > that we don't care who finishes first). > > No "woken" logic, no "who fired first" logic, no BS. Just make the > operations work regardless of ordering. > > And maybe it can't be done. But the current model seems just so hacky > that it can't be the right model. Umm... It is kinda-sorta doable; we do need something vaguely similar to ->woken ("should we add it to the list of cancellables, or is the async reference already gone?"), but other than that it seems to be feasible. See vfs.git#work.aio; the crucial bits are in these commits: keep io_event in aio_kiocb get rid of aio_complete() res/res2 arguments move aio_complete() to final iocb_put(), try to fix aio_poll() logics The first two are preparations, the last is where the fixes (hopefully) happen. The logics in aio_poll() after vfs_poll(): * we might want to steal the async reference (e.g. due to event returned from the very beginning, or due to attempt to put on more than one waitqueue, which makes results unreliable). That's _NOT_ possible if the thing had been put on a waitqueue, but currently isn't there. It might be either due to early wakeup having done everything or the same having scheduled aio_poll_complete_work(). In either case, the best we can do is to ignore the return value of vfs_poll() and, in case of error, mark the sucker cancelled. We *can't* return an error in that case. * if we want and can steal the async reference, rip it from waitqueue; otherwise, put it on the "cancellable" list, unless it's already gone or unless we are simulating the cancel ourselves. * if vfs_poll() has reported something we want and we have successufully stolen the iocb, put it there, have the reference we'd taken over dropped and return 0 Comments? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/8] aio: make sure file is pinned 2019-03-08 3:36 ` Al Viro @ 2019-03-08 15:50 ` Christoph Hellwig 2019-03-10 7:06 ` Al Viro 1 sibling, 0 replies; 53+ messages in thread From: Christoph Hellwig @ 2019-03-08 15:50 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, Cong Wang, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang, avi On Fri, Mar 08, 2019 at 03:36:50AM +0000, Al Viro wrote: > See vfs.git#work.aio; the crucial bits are in these commits: > keep io_event in aio_kiocb > get rid of aio_complete() res/res2 arguments > move aio_complete() to final iocb_put(), try to fix aio_poll() logics > The first two are preparations, the last is where the fixes (hopefully) > happen. Looks sensible. I'll try to run the tests over it, and I've added Avi so that maybe he can make sure that scylladb is also happy with it, that was usually the best way to find aio poll bugs.. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/8] aio: make sure file is pinned 2019-03-08 3:36 ` Al Viro 2019-03-08 15:50 ` Christoph Hellwig @ 2019-03-10 7:06 ` Al Viro 2019-03-10 7:08 ` [PATCH 1/8] pin iocb through aio Al Viro 2019-03-11 19:41 ` [PATCH 1/8] aio: make sure file is pinned Christoph Hellwig 1 sibling, 2 replies; 53+ messages in thread From: Al Viro @ 2019-03-10 7:06 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, Cong Wang, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Fri, Mar 08, 2019 at 03:36:50AM +0000, Al Viro wrote: > See vfs.git#work.aio; the crucial bits are in these commits: > keep io_event in aio_kiocb > get rid of aio_complete() res/res2 arguments > move aio_complete() to final iocb_put(), try to fix aio_poll() logics > The first two are preparations, the last is where the fixes (hopefully) > happen. OK, refactored, cleaned up and force-pushed. Current state: Al Viro (7): keep io_event in aio_kiocb aio: store event at final iocb_put() Fix aio_poll() races make aio_read()/aio_write() return int move dropping ->ki_eventfd into iocb_destroy() deal with get_reqs_available() in aio_get_req() itself aio: move sanity checks and request allocation to io_submit_one() Linus Torvalds (1): pin iocb through aio. fs/aio.c | 327 ++++++++++++++++++++++++++++----------------------------------- 1 file changed, 146 insertions(+), 181 deletions(-) ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 1/8] pin iocb through aio. 2019-03-10 7:06 ` Al Viro @ 2019-03-10 7:08 ` Al Viro 2019-03-10 7:08 ` [PATCH 2/8] keep io_event in aio_kiocb Al Viro ` (7 more replies) 2019-03-11 19:41 ` [PATCH 1/8] aio: make sure file is pinned Christoph Hellwig 1 sibling, 8 replies; 53+ messages in thread From: Al Viro @ 2019-03-10 7:08 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Linus Torvalds <torvalds@linux-foundation.org> aio_poll() is not the only case that needs file pinned; worse, while aio_read()/aio_write() can live without pinning iocb itself, the proof is rather brittle and can easily break on later changes. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 37 +++++++++++++++++++++---------------- 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 3d9669d011b9..363d7d7c8bff 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1022,6 +1022,9 @@ static bool get_reqs_available(struct kioctx *ctx) /* aio_get_req * Allocate a slot for an aio request. * Returns NULL if no requests are free. + * + * The refcount is initialized to 2 - one for the async op completion, + * one for the synchronous code that does this. */ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) { @@ -1034,7 +1037,7 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) percpu_ref_get(&ctx->reqs); req->ki_ctx = ctx; INIT_LIST_HEAD(&req->ki_list); - refcount_set(&req->ki_refcnt, 0); + refcount_set(&req->ki_refcnt, 2); req->ki_eventfd = NULL; return req; } @@ -1067,15 +1070,18 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) return ret; } +static inline void iocb_destroy(struct aio_kiocb *iocb) +{ + if (iocb->ki_filp) + fput(iocb->ki_filp); + percpu_ref_put(&iocb->ki_ctx->reqs); + kmem_cache_free(kiocb_cachep, iocb); +} + static inline void iocb_put(struct aio_kiocb *iocb) { - if (refcount_read(&iocb->ki_refcnt) == 0 || - refcount_dec_and_test(&iocb->ki_refcnt)) { - if (iocb->ki_filp) - fput(iocb->ki_filp); - percpu_ref_put(&iocb->ki_ctx->reqs); - kmem_cache_free(kiocb_cachep, iocb); - } + if (refcount_dec_and_test(&iocb->ki_refcnt)) + iocb_destroy(iocb); } static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb, @@ -1749,9 +1755,6 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) INIT_LIST_HEAD(&req->wait.entry); init_waitqueue_func_entry(&req->wait, aio_poll_wake); - /* one for removal from waitqueue, one for this function */ - refcount_set(&aiocb->ki_refcnt, 2); - mask = vfs_poll(req->file, &apt.pt) & req->events; if (unlikely(!req->head)) { /* we did not manage to set up a waitqueue, done */ @@ -1782,7 +1785,6 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) if (mask) aio_poll_complete(aiocb, mask); - iocb_put(aiocb); return 0; } @@ -1873,18 +1875,21 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, break; } + /* Done with the synchronous reference */ + iocb_put(req); + /* * If ret is 0, we'd either done aio_complete() ourselves or have * arranged for that to be done asynchronously. Anything non-zero * means that we need to destroy req ourselves. */ - if (ret) - goto out_put_req; - return 0; + if (!ret) + return 0; + out_put_req: if (req->ki_eventfd) eventfd_ctx_put(req->ki_eventfd); - iocb_put(req); + iocb_destroy(req); out_put_reqs_available: put_reqs_available(ctx, 1); return ret; -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH 2/8] keep io_event in aio_kiocb 2019-03-10 7:08 ` [PATCH 1/8] pin iocb through aio Al Viro @ 2019-03-10 7:08 ` Al Viro 2019-03-11 19:43 ` Christoph Hellwig 2019-03-10 7:08 ` [PATCH 3/8] aio: store event at final iocb_put() Al Viro ` (6 subsequent siblings) 7 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-10 7:08 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 56 +++++++++++++++++++++----------------------------------- 1 file changed, 21 insertions(+), 35 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 363d7d7c8bff..2249a7a1d6b3 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -204,8 +204,7 @@ struct aio_kiocb { struct kioctx *ki_ctx; kiocb_cancel_fn *ki_cancel; - struct iocb __user *ki_user_iocb; /* user's aiocb */ - __u64 ki_user_data; /* user's data for completion */ + struct io_event ki_res; struct list_head ki_list; /* the aio core uses this * for cancellation */ @@ -1087,10 +1086,9 @@ static inline void iocb_put(struct aio_kiocb *iocb) static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb, long res, long res2) { - ev->obj = (u64)(unsigned long)iocb->ki_user_iocb; - ev->data = iocb->ki_user_data; - ev->res = res; - ev->res2 = res2; + iocb->ki_res.res = res; + iocb->ki_res.res2 = res2; + *ev = iocb->ki_res; } /* aio_complete @@ -1126,7 +1124,7 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2) flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]); pr_debug("%p[%u]: %p: %p %Lx %lx %lx\n", - ctx, tail, iocb, iocb->ki_user_iocb, iocb->ki_user_data, + ctx, tail, iocb, (void __user *)iocb->ki_res.obj, iocb->ki_res.data, res, res2); /* after flagging the request as done, we @@ -1674,13 +1672,13 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, __poll_t mask = key_to_poll(key); unsigned long flags; + /* for instances that support it check for an event match first: */ + if (mask && !(mask & req->events)) + return 0; + req->woken = true; - /* for instances that support it check for an event match first: */ if (mask) { - if (!(mask & req->events)) - return 0; - /* * Try to complete the iocb inline if we can. Use * irqsave/irqrestore because not all filesystems (e.g. fuse) @@ -1844,8 +1842,10 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, goto out_put_req; } - req->ki_user_iocb = user_iocb; - req->ki_user_data = iocb->aio_data; + req->ki_res.obj = (u64)(unsigned long)user_iocb; + req->ki_res.data = iocb->aio_data; + req->ki_res.res = 0; + req->ki_res.res2 = 0; switch (iocb->aio_lio_opcode) { case IOCB_CMD_PREAD: @@ -2002,24 +2002,6 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id, } #endif -/* lookup_kiocb - * Finds a given iocb for cancellation. - */ -static struct aio_kiocb * -lookup_kiocb(struct kioctx *ctx, struct iocb __user *iocb) -{ - struct aio_kiocb *kiocb; - - assert_spin_locked(&ctx->ctx_lock); - - /* TODO: use a hash or array, this sucks. */ - list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) { - if (kiocb->ki_user_iocb == iocb) - return kiocb; - } - return NULL; -} - /* sys_io_cancel: * Attempts to cancel an iocb previously passed to io_submit. If * the operation is successfully cancelled, the resulting event is @@ -2037,6 +2019,7 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, struct aio_kiocb *kiocb; int ret = -EINVAL; u32 key; + u64 obj = (u64)(unsigned long)iocb; if (unlikely(get_user(key, &iocb->aio_key))) return -EFAULT; @@ -2048,10 +2031,13 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, return -EINVAL; spin_lock_irq(&ctx->ctx_lock); - kiocb = lookup_kiocb(ctx, iocb); - if (kiocb) { - ret = kiocb->ki_cancel(&kiocb->rw); - list_del_init(&kiocb->ki_list); + /* TODO: use a hash or array, this sucks. */ + list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) { + if (kiocb->ki_res.obj == obj) { + ret = kiocb->ki_cancel(&kiocb->rw); + list_del_init(&kiocb->ki_list); + break; + } } spin_unlock_irq(&ctx->ctx_lock); -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 2/8] keep io_event in aio_kiocb 2019-03-10 7:08 ` [PATCH 2/8] keep io_event in aio_kiocb Al Viro @ 2019-03-11 19:43 ` Christoph Hellwig 2019-03-11 21:17 ` Al Viro 0 siblings, 1 reply; 53+ messages in thread From: Christoph Hellwig @ 2019-03-11 19:43 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Sun, Mar 10, 2019 at 07:08:16AM +0000, Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> This could use a little changelog at least that explains the why. It also removes lookup_kiocb and folds that into the only caller, which is at least worth mentioning (and probably should be a separate patch). ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 2/8] keep io_event in aio_kiocb 2019-03-11 19:43 ` Christoph Hellwig @ 2019-03-11 21:17 ` Al Viro 0 siblings, 0 replies; 53+ messages in thread From: Al Viro @ 2019-03-11 21:17 UTC (permalink / raw) To: Christoph Hellwig Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Mon, Mar 11, 2019 at 08:43:06PM +0100, Christoph Hellwig wrote: > On Sun, Mar 10, 2019 at 07:08:16AM +0000, Al Viro wrote: > > From: Al Viro <viro@zeniv.linux.org.uk> > > > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > > This could use a little changelog at least that explains the why. > > It also removes lookup_kiocb and folds that into the only caller, > which is at least worth mentioning (and probably should be a separate > patch). More confusingly, it contains a pointless rudiment in aio_poll_wake(); that chunk doesn't solve the problems with ->woken, and the whole thing gets killed off shortly... ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 3/8] aio: store event at final iocb_put() 2019-03-10 7:08 ` [PATCH 1/8] pin iocb through aio Al Viro 2019-03-10 7:08 ` [PATCH 2/8] keep io_event in aio_kiocb Al Viro @ 2019-03-10 7:08 ` Al Viro 2019-03-11 19:44 ` Christoph Hellwig 2019-03-10 7:08 ` [PATCH 4/8] Fix aio_poll() races Al Viro ` (5 subsequent siblings) 7 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-10 7:08 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> Instead of having aio_complete() set ->ki_res.{res,res2}, do that explicitly in its callers, drop the reference (as aio_complete() used to do) and delay the rest until the final iocb_put(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 45 ++++++++++++++++++++------------------------- 1 file changed, 20 insertions(+), 25 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 2249a7a1d6b3..b9c4c1894020 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1077,24 +1077,10 @@ static inline void iocb_destroy(struct aio_kiocb *iocb) kmem_cache_free(kiocb_cachep, iocb); } -static inline void iocb_put(struct aio_kiocb *iocb) -{ - if (refcount_dec_and_test(&iocb->ki_refcnt)) - iocb_destroy(iocb); -} - -static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb, - long res, long res2) -{ - iocb->ki_res.res = res; - iocb->ki_res.res2 = res2; - *ev = iocb->ki_res; -} - /* aio_complete * Called when the io request on the given iocb is complete. */ -static void aio_complete(struct aio_kiocb *iocb, long res, long res2) +static void aio_complete(struct aio_kiocb *iocb) { struct kioctx *ctx = iocb->ki_ctx; struct aio_ring *ring; @@ -1118,14 +1104,14 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2) ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]); event = ev_page + pos % AIO_EVENTS_PER_PAGE; - aio_fill_event(event, iocb, res, res2); + *event = iocb->ki_res; kunmap_atomic(ev_page); flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]); - pr_debug("%p[%u]: %p: %p %Lx %lx %lx\n", + pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb, (void __user *)iocb->ki_res.obj, iocb->ki_res.data, - res, res2); + iocb->ki_res.res, iocb->ki_res.res2); /* after flagging the request as done, we * must never even look at it again @@ -1167,7 +1153,14 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2) if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); - iocb_put(iocb); +} + +static inline void iocb_put(struct aio_kiocb *iocb) +{ + if (refcount_dec_and_test(&iocb->ki_refcnt)) { + aio_complete(iocb); + iocb_destroy(iocb); + } } /* aio_read_events_ring @@ -1441,7 +1434,9 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2) file_end_write(kiocb->ki_filp); } - aio_complete(iocb, res, res2); + iocb->ki_res.res = res; + iocb->ki_res.res2 = res2; + iocb_put(iocb); } static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) @@ -1589,11 +1584,10 @@ static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb, static void aio_fsync_work(struct work_struct *work) { - struct fsync_iocb *req = container_of(work, struct fsync_iocb, work); - int ret; + struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, fsync.work); - ret = vfs_fsync(req->file, req->datasync); - aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0); + iocb->ki_res.res = vfs_fsync(iocb->fsync.file, iocb->fsync.datasync); + iocb_put(iocb); } static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb, @@ -1614,7 +1608,8 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb, static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask) { - aio_complete(iocb, mangle_poll(mask), 0); + iocb->ki_res.res = mangle_poll(mask); + iocb_put(iocb); } static void aio_poll_complete_work(struct work_struct *work) -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 3/8] aio: store event at final iocb_put() 2019-03-10 7:08 ` [PATCH 3/8] aio: store event at final iocb_put() Al Viro @ 2019-03-11 19:44 ` Christoph Hellwig 2019-03-11 21:13 ` Al Viro 0 siblings, 1 reply; 53+ messages in thread From: Christoph Hellwig @ 2019-03-11 19:44 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Sun, Mar 10, 2019 at 07:08:17AM +0000, Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > Instead of having aio_complete() set ->ki_res.{res,res2}, do that > explicitly in its callers, drop the reference (as aio_complete() > used to do) and delay the rest until the final iocb_put(). > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/aio.c | 45 ++++++++++++++++++++------------------------- > 1 file changed, 20 insertions(+), 25 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index 2249a7a1d6b3..b9c4c1894020 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1077,24 +1077,10 @@ static inline void iocb_destroy(struct aio_kiocb *iocb) > kmem_cache_free(kiocb_cachep, iocb); > } > > -static inline void iocb_put(struct aio_kiocb *iocb) > -{ > - if (refcount_dec_and_test(&iocb->ki_refcnt)) > - iocb_destroy(iocb); > -} Maybe iocb_put should just have been added in the place you move it to in patch 1? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 3/8] aio: store event at final iocb_put() 2019-03-11 19:44 ` Christoph Hellwig @ 2019-03-11 21:13 ` Al Viro 2019-03-11 22:52 ` Al Viro 0 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-11 21:13 UTC (permalink / raw) To: Christoph Hellwig Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Mon, Mar 11, 2019 at 08:44:31PM +0100, Christoph Hellwig wrote: > On Sun, Mar 10, 2019 at 07:08:17AM +0000, Al Viro wrote: > > From: Al Viro <viro@zeniv.linux.org.uk> > > > > Instead of having aio_complete() set ->ki_res.{res,res2}, do that > > explicitly in its callers, drop the reference (as aio_complete() > > used to do) and delay the rest until the final iocb_put(). > > > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > > --- > > fs/aio.c | 45 ++++++++++++++++++++------------------------- > > 1 file changed, 20 insertions(+), 25 deletions(-) > > > > diff --git a/fs/aio.c b/fs/aio.c > > index 2249a7a1d6b3..b9c4c1894020 100644 > > --- a/fs/aio.c > > +++ b/fs/aio.c > > @@ -1077,24 +1077,10 @@ static inline void iocb_destroy(struct aio_kiocb *iocb) > > kmem_cache_free(kiocb_cachep, iocb); > > } > > > > -static inline void iocb_put(struct aio_kiocb *iocb) > > -{ > > - if (refcount_dec_and_test(&iocb->ki_refcnt)) > > - iocb_destroy(iocb); > > -} > > Maybe iocb_put should just have been added in the place you move > it to in patch 1? Might as well... ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 3/8] aio: store event at final iocb_put() 2019-03-11 21:13 ` Al Viro @ 2019-03-11 22:52 ` Al Viro 0 siblings, 0 replies; 53+ messages in thread From: Al Viro @ 2019-03-11 22:52 UTC (permalink / raw) To: Christoph Hellwig Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Mon, Mar 11, 2019 at 09:13:28PM +0000, Al Viro wrote: > On Mon, Mar 11, 2019 at 08:44:31PM +0100, Christoph Hellwig wrote: > > On Sun, Mar 10, 2019 at 07:08:17AM +0000, Al Viro wrote: > > > From: Al Viro <viro@zeniv.linux.org.uk> > > > > > > Instead of having aio_complete() set ->ki_res.{res,res2}, do that > > > explicitly in its callers, drop the reference (as aio_complete() > > > used to do) and delay the rest until the final iocb_put(). > > > > > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > > > --- > > > fs/aio.c | 45 ++++++++++++++++++++------------------------- > > > 1 file changed, 20 insertions(+), 25 deletions(-) > > > > > > diff --git a/fs/aio.c b/fs/aio.c > > > index 2249a7a1d6b3..b9c4c1894020 100644 > > > --- a/fs/aio.c > > > +++ b/fs/aio.c > > > @@ -1077,24 +1077,10 @@ static inline void iocb_destroy(struct aio_kiocb *iocb) > > > kmem_cache_free(kiocb_cachep, iocb); > > > } > > > > > > -static inline void iocb_put(struct aio_kiocb *iocb) > > > -{ > > > - if (refcount_dec_and_test(&iocb->ki_refcnt)) > > > - iocb_destroy(iocb); > > > -} > > > > Maybe iocb_put should just have been added in the place you move > > it to in patch 1? > > Might as well... Actually, that wouldn't be any better - it would need at least a declaration before aio_complete(), since in the original patch aio_complete() calls it. So that wouldn't be less noisy... ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 4/8] Fix aio_poll() races 2019-03-10 7:08 ` [PATCH 1/8] pin iocb through aio Al Viro 2019-03-10 7:08 ` [PATCH 2/8] keep io_event in aio_kiocb Al Viro 2019-03-10 7:08 ` [PATCH 3/8] aio: store event at final iocb_put() Al Viro @ 2019-03-10 7:08 ` Al Viro 2019-03-11 19:58 ` Christoph Hellwig 2019-03-10 7:08 ` [PATCH 5/8] make aio_read()/aio_write() return int Al Viro ` (4 subsequent siblings) 7 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-10 7:08 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> aio_poll() has to cope with several unpleasant problems: * requests that might stay around indefinitely need to be made visible for io_cancel(2); that must not be done to a request already completed, though. * in cases when ->poll() has placed us on a waitqueue, wakeup might have happened (and request completed) before ->poll() returns. * worse, in some early wakeup cases request might end up re-added into the queue later - we can't treat "woken up and currently not in the queue" as "it's not going to stick around indefinitely" * ... moreover, ->poll() might have decided not to put it on any queues to start with, and that needs to be distinguished from the previous case * ->poll() might have tried to put us on more than one queue. Only the first will succeed for aio poll, so we might end up missing wakeups. OTOH, we might very well notice that only after the wakeup hits and request gets completed (all before ->poll() gets around to the second poll_wait()). In that case it's too late to decide that we have an error. req->woken was an attempt to deal with that. Unfortunately, it was broken. What we need to keep track of is not that wakeup has happened - the thing might come back after that. It's that async reference is already gone and won't come back, so we can't (and needn't) put the request on the list of cancellables. The easiest case is "request hadn't been put on any waitqueues"; we can tell by seeing NULL apt.head, and in that case there won't be anything async. We should either complete the request ourselves (if vfs_poll() reports anything of interest) or return an error. In all other cases we get exclusion with wakeups by grabbing the queue lock. If request is currently on queue and we have something interesting from vfs_poll(), we can steal it and complete the request ourselves. If it's on queue and vfs_poll() has not reported anything interesting, we either put it on the cancellable list, or, if we know that it hadn't been put on all queues ->poll() wanted it on, we steal it and return an error. If it's _not_ on queue, it's either been already dealt with (in which case we do nothing), or there's aio_poll_complete_work() about to be executed. In that case we either put it on the cancellable list, or, if we know it hadn't been put on all queues ->poll() wanted it on, simulate what cancel would've done. It's a lot more convoluted than I'd like it to be. Single-consumer APIs suck, and unfortunately aio is not an exception... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 71 +++++++++++++++++++++++++++++----------------------------------- 1 file changed, 32 insertions(+), 39 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index b9c4c1894020..f47a29f7f201 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -181,7 +181,7 @@ struct poll_iocb { struct file *file; struct wait_queue_head *head; __poll_t events; - bool woken; + bool done; bool cancelled; struct wait_queue_entry wait; struct work_struct work; @@ -1606,12 +1606,6 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb, return 0; } -static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask) -{ - iocb->ki_res.res = mangle_poll(mask); - iocb_put(iocb); -} - static void aio_poll_complete_work(struct work_struct *work) { struct poll_iocb *req = container_of(work, struct poll_iocb, work); @@ -1637,9 +1631,11 @@ static void aio_poll_complete_work(struct work_struct *work) return; } list_del_init(&iocb->ki_list); + iocb->ki_res.res = mangle_poll(mask); + req->done = true; spin_unlock_irq(&ctx->ctx_lock); - aio_poll_complete(iocb, mask); + iocb_put(iocb); } /* assumes we are called with irqs disabled */ @@ -1671,7 +1667,7 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, if (mask && !(mask & req->events)) return 0; - req->woken = true; + list_del_init(&req->wait.entry); if (mask) { /* @@ -1682,15 +1678,14 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, */ if (spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) { list_del(&iocb->ki_list); + iocb->ki_res.res = mangle_poll(mask); + req->done = true; spin_unlock_irqrestore(&iocb->ki_ctx->ctx_lock, flags); - - list_del_init(&req->wait.entry); - aio_poll_complete(iocb, mask); + iocb_put(iocb); return 1; } } - list_del_init(&req->wait.entry); schedule_work(&req->work); return 1; } @@ -1723,6 +1718,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) struct kioctx *ctx = aiocb->ki_ctx; struct poll_iocb *req = &aiocb->poll; struct aio_poll_table apt; + bool cancel = false; __poll_t mask; /* reject any unknown events outside the normal event mask. */ @@ -1736,7 +1732,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP; req->head = NULL; - req->woken = false; + req->done = false; req->cancelled = false; apt.pt._qproc = aio_poll_queue_proc; @@ -1749,36 +1745,33 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) init_waitqueue_func_entry(&req->wait, aio_poll_wake); mask = vfs_poll(req->file, &apt.pt) & req->events; - if (unlikely(!req->head)) { - /* we did not manage to set up a waitqueue, done */ - goto out; - } - spin_lock_irq(&ctx->ctx_lock); - spin_lock(&req->head->lock); - if (req->woken) { - /* wake_up context handles the rest */ - mask = 0; + if (likely(req->head)) { + spin_lock(&req->head->lock); + if (unlikely(list_empty(&req->wait.entry))) { + if (apt.error) + cancel = true; + apt.error = 0; + mask = 0; + } + if (mask || apt.error) { + list_del_init(&req->wait.entry); + } else if (cancel) { + WRITE_ONCE(req->cancelled, true); + } else if (!req->done) { /* actually waiting for an event */ + list_add_tail(&aiocb->ki_list, &ctx->active_reqs); + aiocb->ki_cancel = aio_poll_cancel; + } + spin_unlock(&req->head->lock); + } + if (mask) { /* no async, we'd stolen it */ + aiocb->ki_res.res = mangle_poll(mask); apt.error = 0; - } else if (mask || apt.error) { - /* if we get an error or a mask we are done */ - WARN_ON_ONCE(list_empty(&req->wait.entry)); - list_del_init(&req->wait.entry); - } else { - /* actually waiting for an event */ - list_add_tail(&aiocb->ki_list, &ctx->active_reqs); - aiocb->ki_cancel = aio_poll_cancel; } - spin_unlock(&req->head->lock); spin_unlock_irq(&ctx->ctx_lock); - -out: - if (unlikely(apt.error)) - return apt.error; - if (mask) - aio_poll_complete(aiocb, mask); - return 0; + iocb_put(aiocb); + return apt.error; } static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 4/8] Fix aio_poll() races 2019-03-10 7:08 ` [PATCH 4/8] Fix aio_poll() races Al Viro @ 2019-03-11 19:58 ` Christoph Hellwig 2019-03-11 21:06 ` Al Viro 0 siblings, 1 reply; 53+ messages in thread From: Christoph Hellwig @ 2019-03-11 19:58 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang Where do we put the second iocb reference in case we return from vfs_poll without ever being woken? Also it seems like the complete code would still benefit from a little helper, something like: diff --git a/fs/aio.c b/fs/aio.c index b2a5c7b3a1fe..8415e5e484ce 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1611,6 +1611,13 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb, return 0; } +static void aio_poll_finish(struct aio_kiocb *iocb, __poll_t mask) +{ + list_del_init(&iocb->ki_list); + iocb->ki_res.res = mangle_poll(mask); + iocb->poll.done = true; +} + static void aio_poll_complete_work(struct work_struct *work) { struct poll_iocb *req = container_of(work, struct poll_iocb, work); @@ -1635,9 +1642,7 @@ static void aio_poll_complete_work(struct work_struct *work) spin_unlock_irq(&ctx->ctx_lock); return; } - list_del_init(&iocb->ki_list); - iocb->ki_res.res = mangle_poll(mask); - req->done = true; + aio_poll_finish(iocb, mask); spin_unlock_irq(&ctx->ctx_lock); iocb_put(iocb); @@ -1674,24 +1679,20 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, list_del_init(&req->wait.entry); - if (mask) { + if (mask && spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) { /* * Try to complete the iocb inline if we can. Use * irqsave/irqrestore because not all filesystems (e.g. fuse) * call this function with IRQs disabled and because IRQs * have to be disabled before ctx_lock is obtained. */ - if (spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) { - list_del(&iocb->ki_list); - iocb->ki_res.res = mangle_poll(mask); - req->done = true; - spin_unlock_irqrestore(&iocb->ki_ctx->ctx_lock, flags); - iocb_put(iocb); - return 1; - } + aio_poll_finish(iocb, mask); + spin_unlock_irqrestore(&iocb->ki_ctx->ctx_lock, flags); + iocb_put(iocb); + } else { + schedule_work(&req->work); } - schedule_work(&req->work); return 1; } ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 4/8] Fix aio_poll() races 2019-03-11 19:58 ` Christoph Hellwig @ 2019-03-11 21:06 ` Al Viro 2019-03-12 19:18 ` Christoph Hellwig 0 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-11 21:06 UTC (permalink / raw) To: Christoph Hellwig Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Mon, Mar 11, 2019 at 08:58:31PM +0100, Christoph Hellwig wrote: > Where do we put the second iocb reference in case we return from > vfs_poll without ever being woken? Depends. If mask is non-zero (i.e. vfs_poll() has returned something we care about) and it has never been woken, we steal it and drop the reference ourselves. If it is zero and we see that ->poll() has tried to put it on two queues, we steal it (again, assuming it's not on waitqueue and _can_ be stolen) and return -EINVAL. In that case __io_submit_one() (or, by the end of the series, io_submit_one()) will call iocb_destroy(). And in the normal waiting case (nothing interesting reported and no errors) it will end up on the list of cancellables. Then it either will get completed by later wakeup, which will drop the reference, or it will get eventually cancelled, which will hit the same aio_poll_complete_work() and drop the reference... > Also it seems like the complete code would still benefit from a little > helper, something like: Umm... Not sure I like the name (something like aio_poll_done() seems to be better), but other than that - no problem. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 4/8] Fix aio_poll() races 2019-03-11 21:06 ` Al Viro @ 2019-03-12 19:18 ` Christoph Hellwig 0 siblings, 0 replies; 53+ messages in thread From: Christoph Hellwig @ 2019-03-12 19:18 UTC (permalink / raw) To: Al Viro Cc: Christoph Hellwig, Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Mon, Mar 11, 2019 at 09:06:18PM +0000, Al Viro wrote: > On Mon, Mar 11, 2019 at 08:58:31PM +0100, Christoph Hellwig wrote: > > Where do we put the second iocb reference in case we return from > > vfs_poll without ever being woken? > > Depends. If mask is non-zero (i.e. vfs_poll() has returned something > we care about) and it has never been woken, we steal it and drop the > reference ourselves. If it is zero and we see that ->poll() has tried > to put it on two queues, we steal it (again, assuming it's not on > waitqueue and _can_ be stolen) and return -EINVAL. In that case > __io_submit_one() (or, by the end of the series, io_submit_one()) > will call iocb_destroy(). And in the normal waiting case (nothing > interesting reported and no errors) it will end up on the list of > cancellables. Then it either will get completed by later wakeup, which > will drop the reference, or it will get eventually cancelled, which will > hit the same aio_poll_complete_work() and drop the reference... Ok, seems like the logic is sane. I was missing how the actual mask logic worked in aio_poll(). > > Also it seems like the complete code would still benefit from a little > > helper, something like: > > Umm... Not sure I like the name (something like aio_poll_done() seems > to be better), but other than that - no problem. I don't care about the name. Feel free to change it to whatever suits you. ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 5/8] make aio_read()/aio_write() return int 2019-03-10 7:08 ` [PATCH 1/8] pin iocb through aio Al Viro ` (2 preceding siblings ...) 2019-03-10 7:08 ` [PATCH 4/8] Fix aio_poll() races Al Viro @ 2019-03-10 7:08 ` Al Viro 2019-03-11 19:44 ` Christoph Hellwig 2019-03-10 7:08 ` [PATCH 6/8] move dropping ->ki_eventfd into iocb_destroy() Al Viro ` (3 subsequent siblings) 7 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-10 7:08 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> that ssize_t is a rudiment of earlier calling conventions; it's been used only to pass 0 and -E... since last autumn. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index f47a29f7f201..f9e8f1edfe36 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1513,13 +1513,13 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t ret) } } -static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb, +static int aio_read(struct kiocb *req, const struct iocb *iocb, bool vectored, bool compat) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct iov_iter iter; struct file *file; - ssize_t ret; + int ret; ret = aio_prep_rw(req, iocb); if (ret) @@ -1541,13 +1541,13 @@ static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb, return ret; } -static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb, +static int aio_write(struct kiocb *req, const struct iocb *iocb, bool vectored, bool compat) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct iov_iter iter; struct file *file; - ssize_t ret; + int ret; ret = aio_prep_rw(req, iocb); if (ret) @@ -1713,7 +1713,7 @@ aio_poll_queue_proc(struct file *file, struct wait_queue_head *head, add_wait_queue(head, &pt->iocb->poll.wait); } -static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) +static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) { struct kioctx *ctx = aiocb->ki_ctx; struct poll_iocb *req = &aiocb->poll; @@ -1778,7 +1778,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, struct iocb __user *user_iocb, bool compat) { struct aio_kiocb *req; - ssize_t ret; + int ret; /* enforce forwards compatibility on users */ if (unlikely(iocb->aio_reserved2)) { -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 5/8] make aio_read()/aio_write() return int 2019-03-10 7:08 ` [PATCH 5/8] make aio_read()/aio_write() return int Al Viro @ 2019-03-11 19:44 ` Christoph Hellwig 0 siblings, 0 replies; 53+ messages in thread From: Christoph Hellwig @ 2019-03-11 19:44 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang Looks good, Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 6/8] move dropping ->ki_eventfd into iocb_destroy() 2019-03-10 7:08 ` [PATCH 1/8] pin iocb through aio Al Viro ` (3 preceding siblings ...) 2019-03-10 7:08 ` [PATCH 5/8] make aio_read()/aio_write() return int Al Viro @ 2019-03-10 7:08 ` Al Viro 2019-03-11 19:46 ` Christoph Hellwig 2019-03-10 7:08 ` [PATCH 7/8] deal with get_reqs_available() in aio_get_req() itself Al Viro ` (2 subsequent siblings) 7 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-10 7:08 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index f9e8f1edfe36..595c19965a8b 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1071,6 +1071,8 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) static inline void iocb_destroy(struct aio_kiocb *iocb) { + if (iocb->ki_eventfd) + eventfd_ctx_put(iocb->ki_eventfd); if (iocb->ki_filp) fput(iocb->ki_filp); percpu_ref_put(&iocb->ki_ctx->reqs); @@ -1138,10 +1140,8 @@ static void aio_complete(struct aio_kiocb *iocb) * eventfd. The eventfd_signal() function is safe to be called * from IRQ context. */ - if (iocb->ki_eventfd) { + if (iocb->ki_eventfd) eventfd_signal(iocb->ki_eventfd, 1); - eventfd_ctx_put(iocb->ki_eventfd); - } /* * We have to order our ring_info tail store above and test @@ -1810,18 +1810,19 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, goto out_put_req; if (iocb->aio_flags & IOCB_FLAG_RESFD) { + struct eventfd_ctx *eventfd; /* * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an * instance of the file* now. The file descriptor must be * an eventfd() fd, and will be signaled for each completed * event using the eventfd_signal() function. */ - req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd); - if (IS_ERR(req->ki_eventfd)) { - ret = PTR_ERR(req->ki_eventfd); - req->ki_eventfd = NULL; + eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd); + if (IS_ERR(eventfd)) { + ret = PTR_ERR(eventfd); goto out_put_req; } + req->ki_eventfd = eventfd; } ret = put_user(KIOCB_KEY, &user_iocb->aio_key); @@ -1875,8 +1876,6 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, return 0; out_put_req: - if (req->ki_eventfd) - eventfd_ctx_put(req->ki_eventfd); iocb_destroy(req); out_put_reqs_available: put_reqs_available(ctx, 1); -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 6/8] move dropping ->ki_eventfd into iocb_destroy() 2019-03-10 7:08 ` [PATCH 6/8] move dropping ->ki_eventfd into iocb_destroy() Al Viro @ 2019-03-11 19:46 ` Christoph Hellwig 0 siblings, 0 replies; 53+ messages in thread From: Christoph Hellwig @ 2019-03-11 19:46 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Sun, Mar 10, 2019 at 07:08:20AM +0000, Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Looks sensible, but a changelog would be nice. > if (iocb->aio_flags & IOCB_FLAG_RESFD) { > + struct eventfd_ctx *eventfd; > /* > * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an > * instance of the file* now. The file descriptor must be > * an eventfd() fd, and will be signaled for each completed > * event using the eventfd_signal() function. > */ > + eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd); I don't think there is any point in the cast in either the old or new code. ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 7/8] deal with get_reqs_available() in aio_get_req() itself 2019-03-10 7:08 ` [PATCH 1/8] pin iocb through aio Al Viro ` (4 preceding siblings ...) 2019-03-10 7:08 ` [PATCH 6/8] move dropping ->ki_eventfd into iocb_destroy() Al Viro @ 2019-03-10 7:08 ` Al Viro 2019-03-11 19:46 ` Christoph Hellwig 2019-03-10 7:08 ` [PATCH 8/8] aio: move sanity checks and request allocation to io_submit_one() Al Viro 2019-03-11 19:41 ` [PATCH 1/8] pin iocb through aio Christoph Hellwig 7 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-10 7:08 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> simplifies the caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 595c19965a8b..66ca52c57cca 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1033,6 +1033,11 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) if (unlikely(!req)) return NULL; + if (unlikely(!get_reqs_available(ctx))) { + kfree(req); + return NULL; + } + percpu_ref_get(&ctx->reqs); req->ki_ctx = ctx; INIT_LIST_HEAD(&req->ki_list); @@ -1796,13 +1801,9 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, return -EINVAL; } - if (!get_reqs_available(ctx)) - return -EAGAIN; - - ret = -EAGAIN; req = aio_get_req(ctx); if (unlikely(!req)) - goto out_put_reqs_available; + return -EAGAIN; req->ki_filp = fget(iocb->aio_fildes); ret = -EBADF; @@ -1877,7 +1878,6 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, out_put_req: iocb_destroy(req); -out_put_reqs_available: put_reqs_available(ctx, 1); return ret; } -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 7/8] deal with get_reqs_available() in aio_get_req() itself 2019-03-10 7:08 ` [PATCH 7/8] deal with get_reqs_available() in aio_get_req() itself Al Viro @ 2019-03-11 19:46 ` Christoph Hellwig 0 siblings, 0 replies; 53+ messages in thread From: Christoph Hellwig @ 2019-03-11 19:46 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Sun, Mar 10, 2019 at 07:08:21AM +0000, Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > simplifies the caller > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Looks good, Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 8/8] aio: move sanity checks and request allocation to io_submit_one() 2019-03-10 7:08 ` [PATCH 1/8] pin iocb through aio Al Viro ` (5 preceding siblings ...) 2019-03-10 7:08 ` [PATCH 7/8] deal with get_reqs_available() in aio_get_req() itself Al Viro @ 2019-03-10 7:08 ` Al Viro 2019-03-11 19:48 ` Christoph Hellwig 2019-03-11 19:41 ` [PATCH 1/8] pin iocb through aio Christoph Hellwig 7 siblings, 1 reply; 53+ messages in thread From: Al Viro @ 2019-03-10 7:08 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang From: Al Viro <viro@zeniv.linux.org.uk> makes for somewhat cleaner control flow in __io_submit_one() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/aio.c | 119 ++++++++++++++++++++++++++++----------------------------------- 1 file changed, 53 insertions(+), 66 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 66ca52c57cca..2d1db52b1268 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1780,35 +1780,12 @@ static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) } static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, - struct iocb __user *user_iocb, bool compat) + struct iocb __user *user_iocb, struct aio_kiocb *req, + bool compat) { - struct aio_kiocb *req; - int ret; - - /* enforce forwards compatibility on users */ - if (unlikely(iocb->aio_reserved2)) { - pr_debug("EINVAL: reserve field set\n"); - return -EINVAL; - } - - /* prevent overflows */ - if (unlikely( - (iocb->aio_buf != (unsigned long)iocb->aio_buf) || - (iocb->aio_nbytes != (size_t)iocb->aio_nbytes) || - ((ssize_t)iocb->aio_nbytes < 0) - )) { - pr_debug("EINVAL: overflow check\n"); - return -EINVAL; - } - - req = aio_get_req(ctx); - if (unlikely(!req)) - return -EAGAIN; - req->ki_filp = fget(iocb->aio_fildes); - ret = -EBADF; if (unlikely(!req->ki_filp)) - goto out_put_req; + return -EBADF; if (iocb->aio_flags & IOCB_FLAG_RESFD) { struct eventfd_ctx *eventfd; @@ -1819,17 +1796,15 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, * event using the eventfd_signal() function. */ eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd); - if (IS_ERR(eventfd)) { - ret = PTR_ERR(eventfd); - goto out_put_req; - } + if (IS_ERR(eventfd)) + return PTR_ERR(req->ki_eventfd); + req->ki_eventfd = eventfd; } - ret = put_user(KIOCB_KEY, &user_iocb->aio_key); - if (unlikely(ret)) { + if (unlikely(put_user(KIOCB_KEY, &user_iocb->aio_key))) { pr_debug("EFAULT: aio_key\n"); - goto out_put_req; + return -EFAULT; } req->ki_res.obj = (u64)(unsigned long)user_iocb; @@ -1839,58 +1814,70 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, switch (iocb->aio_lio_opcode) { case IOCB_CMD_PREAD: - ret = aio_read(&req->rw, iocb, false, compat); - break; + return aio_read(&req->rw, iocb, false, compat); case IOCB_CMD_PWRITE: - ret = aio_write(&req->rw, iocb, false, compat); - break; + return aio_write(&req->rw, iocb, false, compat); case IOCB_CMD_PREADV: - ret = aio_read(&req->rw, iocb, true, compat); - break; + return aio_read(&req->rw, iocb, true, compat); case IOCB_CMD_PWRITEV: - ret = aio_write(&req->rw, iocb, true, compat); - break; + return aio_write(&req->rw, iocb, true, compat); case IOCB_CMD_FSYNC: - ret = aio_fsync(&req->fsync, iocb, false); - break; + return aio_fsync(&req->fsync, iocb, false); case IOCB_CMD_FDSYNC: - ret = aio_fsync(&req->fsync, iocb, true); - break; + return aio_fsync(&req->fsync, iocb, true); case IOCB_CMD_POLL: - ret = aio_poll(req, iocb); - break; + return aio_poll(req, iocb); default: pr_debug("invalid aio operation %d\n", iocb->aio_lio_opcode); - ret = -EINVAL; - break; + return -EINVAL; } - - /* Done with the synchronous reference */ - iocb_put(req); - - /* - * If ret is 0, we'd either done aio_complete() ourselves or have - * arranged for that to be done asynchronously. Anything non-zero - * means that we need to destroy req ourselves. - */ - if (!ret) - return 0; - -out_put_req: - iocb_destroy(req); - put_reqs_available(ctx, 1); - return ret; } static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, bool compat) { + struct aio_kiocb *req; struct iocb iocb; + int err; if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb)))) return -EFAULT; - return __io_submit_one(ctx, &iocb, user_iocb, compat); + /* enforce forwards compatibility on users */ + if (unlikely(iocb.aio_reserved2)) { + pr_debug("EINVAL: reserve field set\n"); + return -EINVAL; + } + + /* prevent overflows */ + if (unlikely( + (iocb.aio_buf != (unsigned long)iocb.aio_buf) || + (iocb.aio_nbytes != (size_t)iocb.aio_nbytes) || + ((ssize_t)iocb.aio_nbytes < 0) + )) { + pr_debug("EINVAL: overflow check\n"); + return -EINVAL; + } + + req = aio_get_req(ctx); + if (unlikely(!req)) + return -EAGAIN; + + err = __io_submit_one(ctx, &iocb, user_iocb, req, compat); + + /* Done with the synchronous reference */ + iocb_put(req); + + /* + * If err is 0, we'd either done aio_complete() ourselves or have + * arranged for that to be done asynchronously. Anything non-zero + * means that we need to destroy req ourselves. + */ + if (unlikely(err)) { + iocb_destroy(req); + put_reqs_available(ctx, 1); + } + return err; } /* sys_io_submit: -- 2.11.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 8/8] aio: move sanity checks and request allocation to io_submit_one() 2019-03-10 7:08 ` [PATCH 8/8] aio: move sanity checks and request allocation to io_submit_one() Al Viro @ 2019-03-11 19:48 ` Christoph Hellwig 2019-03-11 21:12 ` Al Viro 0 siblings, 1 reply; 53+ messages in thread From: Christoph Hellwig @ 2019-03-11 19:48 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Sun, Mar 10, 2019 at 07:08:22AM +0000, Al Viro wrote: > From: Al Viro <viro@zeniv.linux.org.uk> > > makes for somewhat cleaner control flow in __io_submit_one() > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> I wonder if we should even bother keeping __io_submit_one. Splitting that out was prep work from Jens for something that eventually turned into io_uring. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 8/8] aio: move sanity checks and request allocation to io_submit_one() 2019-03-11 19:48 ` Christoph Hellwig @ 2019-03-11 21:12 ` Al Viro 0 siblings, 0 replies; 53+ messages in thread From: Al Viro @ 2019-03-11 21:12 UTC (permalink / raw) To: Christoph Hellwig Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Mon, Mar 11, 2019 at 08:48:30PM +0100, Christoph Hellwig wrote: > On Sun, Mar 10, 2019 at 07:08:22AM +0000, Al Viro wrote: > > From: Al Viro <viro@zeniv.linux.org.uk> > > > > makes for somewhat cleaner control flow in __io_submit_one() > > > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > > I wonder if we should even bother keeping __io_submit_one. Splitting > that out was prep work from Jens for something that eventually turned > into io_uring. *shrug* I think it's a bit easier to keep track of control flow - failure exits are simpler that way. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/8] pin iocb through aio. 2019-03-10 7:08 ` [PATCH 1/8] pin iocb through aio Al Viro ` (6 preceding siblings ...) 2019-03-10 7:08 ` [PATCH 8/8] aio: move sanity checks and request allocation to io_submit_one() Al Viro @ 2019-03-11 19:41 ` Christoph Hellwig 7 siblings, 0 replies; 53+ messages in thread From: Christoph Hellwig @ 2019-03-11 19:41 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon.kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, xiyou.wangcong, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang Looks good, Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/8] aio: make sure file is pinned 2019-03-10 7:06 ` Al Viro 2019-03-10 7:08 ` [PATCH 1/8] pin iocb through aio Al Viro @ 2019-03-11 19:41 ` Christoph Hellwig 1 sibling, 0 replies; 53+ messages in thread From: Christoph Hellwig @ 2019-03-11 19:41 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric Dumazet, David Miller, Jason Baron, kgraul, ktkhai, kyeongdon kim, Linux List Kernel Mailing, Netdev, pabeni, syzkaller-bugs, Cong Wang, Christoph Hellwig, zhengbin, bcrl, linux-fsdevel, linux-aio, houtao1, yi.zhang On Sun, Mar 10, 2019 at 07:06:18AM +0000, Al Viro wrote: > OK, refactored, cleaned up and force-pushed. Current state: This survives the libaio test suite at least. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) 2019-03-03 15:18 ` [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) Al Viro 2019-03-03 18:37 ` Eric Dumazet 2019-03-03 19:44 ` Linus Torvalds @ 2019-03-04 7:53 ` Dmitry Vyukov 2 siblings, 0 replies; 53+ messages in thread From: Dmitry Vyukov @ 2019-03-04 7:53 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, David Miller, Jason Baron, kgraul, Kirill Tkhai, kyeongdon kim, LKML, netdev, Paolo Abeni, syzkaller-bugs, Cong Wang, Christoph Hellwig On Sun, Mar 3, 2019 at 4:19 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Sun, Mar 03, 2019 at 01:55:02PM +0000, Al Viro wrote: > > > Maybe unrelated to this bug, but... What's to prevent a wakeup > > that happens just after we'd been added to a waitqueue by ->poll() > > triggering aio_poll_wake(), which gets to aio_poll_complete() > > with its fput() *before* we'd reached the end of ->poll() instance? > > > > I wonder if adding > > get_file(req->file); // make sure that early completion is safe > > right after > > req->file = fget(iocb->aio_fildes); > > if (unlikely(!req->file)) > > return -EBADF; > > paired with > > fput(req->file); > > right after out: in aio_poll() is needed... Am I missing something > > obvious here? Christoph? > > In more details - normally IOCB_CMD_POLL handling looks so: > > 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll() > 2) aio_poll() resolves the descriptor to struct file by > req->file = fget(iocb->aio_fildes) > 3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that > aio_kiocb to 2 (bumps by 1, that is). > 4) aio_poll() calls vfs_poll(). After sanity checks (basically, "poll_wait() > had been called and only once") it locks the queue. That's what the extra > reference to iocb had been for - we know we can safely access it. > 5) With queue locked, we check if ->woken has already been set to true > (by aio_poll_wake()) and, if it had been, we unlock the queue, drop > a reference to aio_kiocb and bugger off - at that point it's > a responsibility to aio_poll_wake() and the stuff called/scheduled by > it. That code will drop the reference to file in req->file, along > with the other reference to our aio_kiocb. > 6) otherwise, we see whether we need to wait. If we do, we unlock > the queue, drop one reference to aio_kiocb and go away - eventual > wakeup (or cancel) will deal with the reference to file and with the > other reference to aio_kiocb > 7) otherwise we remove ourselves from waitqueue (still under the queue > lock), so that wakeup won't get us. No async activity will be happening, > so we can safely drop req->file and iocb ourselves. > > If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb > won't get freed under us, so we can do all the checks and locking safely. > And we don't touch ->file if we detect that case. > > However, vfs_poll() most certainly *does* touch the file it had been > given. So wakeup coming while we are still in ->poll() might end up > doing fput() on that file. That case is not too rare, and usually > we are saved by the still present reference from descriptor table - > that fput() is not the final one. But if another thread closes that > descriptor right after our fget() and wakeup does happen before > ->poll() returns, we are in trouble - final fput() done while we > are in the middle of a method. > > What we need is to hold on to the file reference the same way we do with > aio_kiocb. No need to store the reference to what we'd got in a separate > variable - req->file is never reassigned and we'd already made sure that > req won't be freed under us, so dropping the extra reference with > fput(req->file) is fine in all cases. > > Fixes: bfe4037e722ec > Cc: stable@vger.kernel.org > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Please add the Reported-by tag from the report. If you don't add it, then either: 1. somebody (you) will need to remember to later go and associate fix the the report with "#syz fix" command 2. or the bug will stay open and syzbot will never report use-after-frees in unix_dgram_poll again (it's still the same already reported bug) 3. or worse somebody will spend time re-debugging this bug just to find that you already fixed it but did not include the tag Thanks > --- > diff --git a/fs/aio.c b/fs/aio.c > index 3083180a54c8..7e88bfabdac2 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1767,6 +1767,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) > > /* one for removal from waitqueue, one for this function */ > refcount_set(&aiocb->ki_refcnt, 2); > + get_file(req->file); > > mask = vfs_poll(req->file, &apt.pt) & req->events; > if (unlikely(!req->head)) { > @@ -1793,6 +1794,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) > spin_unlock_irq(&ctx->ctx_lock); > > out: > + fput(req->file); > if (unlikely(apt.error)) { > fput(req->file); > return apt.error; > > -- > You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group. > To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com. > To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20190303151846.GQ2217%40ZenIV.linux.org.uk. > For more options, visit https://groups.google.com/d/optout. ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2019-03-12 19:18 UTC | newest] Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-03-03 10:22 KASAN: use-after-free Read in unix_dgram_poll syzbot 2019-03-03 13:55 ` Al Viro 2019-03-03 15:18 ` [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) Al Viro 2019-03-03 18:37 ` Eric Dumazet 2019-03-03 19:44 ` Linus Torvalds 2019-03-03 20:13 ` Linus Torvalds 2019-03-03 20:30 ` Al Viro 2019-03-03 22:23 ` Linus Torvalds 2019-03-04 2:36 ` Al Viro 2019-03-04 21:22 ` Linus Torvalds 2019-03-07 0:03 ` [PATCH 1/8] aio: make sure file is pinned Al Viro 2019-03-07 0:03 ` [PATCH 2/8] aio_poll_wake(): don't set ->woken if we ignore the wakeup Al Viro 2019-03-07 2:18 ` Al Viro 2019-03-08 11:16 ` zhengbin (A) 2019-03-07 0:03 ` [PATCH 3/8] aio_poll(): sanitize the logics after vfs_poll(), get rid of leak on error Al Viro 2019-03-07 2:11 ` zhengbin (A) 2019-03-07 0:03 ` [PATCH 4/8] aio_poll(): get rid of weird refcounting Al Viro 2019-03-07 0:03 ` [PATCH 5/8] make aio_read()/aio_write() return int Al Viro 2019-03-07 0:03 ` [PATCH 6/8] move dropping ->ki_eventfd into iocb_put() Al Viro 2019-03-07 0:03 ` [PATCH 7/8] deal with get_reqs_available() in aio_get_req() itself Al Viro 2019-03-07 0:03 ` [PATCH 8/8] aio: move sanity checks and request allocation to io_submit_one() Al Viro 2019-03-07 0:23 ` [PATCH 1/8] aio: make sure file is pinned Linus Torvalds 2019-03-07 0:41 ` Al Viro 2019-03-07 0:48 ` Al Viro 2019-03-07 1:20 ` Al Viro 2019-03-07 1:30 ` Linus Torvalds 2019-03-08 3:36 ` Al Viro 2019-03-08 15:50 ` Christoph Hellwig 2019-03-10 7:06 ` Al Viro 2019-03-10 7:08 ` [PATCH 1/8] pin iocb through aio Al Viro 2019-03-10 7:08 ` [PATCH 2/8] keep io_event in aio_kiocb Al Viro 2019-03-11 19:43 ` Christoph Hellwig 2019-03-11 21:17 ` Al Viro 2019-03-10 7:08 ` [PATCH 3/8] aio: store event at final iocb_put() Al Viro 2019-03-11 19:44 ` Christoph Hellwig 2019-03-11 21:13 ` Al Viro 2019-03-11 22:52 ` Al Viro 2019-03-10 7:08 ` [PATCH 4/8] Fix aio_poll() races Al Viro 2019-03-11 19:58 ` Christoph Hellwig 2019-03-11 21:06 ` Al Viro 2019-03-12 19:18 ` Christoph Hellwig 2019-03-10 7:08 ` [PATCH 5/8] make aio_read()/aio_write() return int Al Viro 2019-03-11 19:44 ` Christoph Hellwig 2019-03-10 7:08 ` [PATCH 6/8] move dropping ->ki_eventfd into iocb_destroy() Al Viro 2019-03-11 19:46 ` Christoph Hellwig 2019-03-10 7:08 ` [PATCH 7/8] deal with get_reqs_available() in aio_get_req() itself Al Viro 2019-03-11 19:46 ` Christoph Hellwig 2019-03-10 7:08 ` [PATCH 8/8] aio: move sanity checks and request allocation to io_submit_one() Al Viro 2019-03-11 19:48 ` Christoph Hellwig 2019-03-11 21:12 ` Al Viro 2019-03-11 19:41 ` [PATCH 1/8] pin iocb through aio Christoph Hellwig 2019-03-11 19:41 ` [PATCH 1/8] aio: make sure file is pinned Christoph Hellwig 2019-03-04 7:53 ` [PATCH] aio: prevent the final fput() in the middle of vfs_poll() (Re: KASAN: use-after-free Read in unix_dgram_poll) Dmitry Vyukov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).