* usercopy whitelist woe in scsi_sense_cache @ 2018-04-04 19:07 Oleksandr Natalenko 2018-04-04 20:21 ` Kees Cook 2018-04-04 20:32 ` Kees Cook 0 siblings, 2 replies; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-04 19:07 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, linux-kernel Hi, Kees, David et al. With v4.16 I get the following dump while using smartctl: === [ 261.260617] ------------[ cut here ]------------ [ 261.262135] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)! [ 261.267672] WARNING: CPU: 2 PID: 27041 at mm/usercopy.c:81 usercopy_warn +0x7e/0xa0 [ 261.273624] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm iTCO_wdt ppdev irqbypass bochs_drm ttm iTCO_vendor_support drm_kms_helper drm psmouse input_leds led_class pcspkr joydev intel_agp parport_pc mousedev evdev syscopyarea intel_gtt i2c_i801 sysfillrect parport rtc_cmos sysimgblt qemu_fw_cfg mac_hid agpgart fb_sys_fops lpc_ich ip_tables x_tables xfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32c_generic dm_crypt algif_skcipher af_alg dm_mod raid10 md_mod hid_generic usbhid hid sr_mod cdrom sd_mod crct10dif_pclmul uhci_hcd crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc serio_raw ahci atkbd aesni_intel xhci_pci aes_x86_64 ehci_pci libahci crypto_simd libps2 glue_helper xhci_hcd ehci_hcd libata cryptd usbcore usb_common i8042 serio virtio_scsi scsi_mod [ 261.300752] virtio_blk virtio_net virtio_pci virtio_ring virtio [ 261.305534] CPU: 2 PID: 27041 Comm: smartctl Not tainted 4.16.0-1-ARCH #1 [ 261.309936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 261.313668] RIP: 0010:usercopy_warn+0x7e/0xa0 [ 261.315653] RSP: 0018:ffffab5aca6cfc40 EFLAGS: 00010286 [ 261.320038] RAX: 0000000000000000 RBX: ffff8e8cd893605e RCX: 0000000000000001 [ 261.322215] RDX: 0000000080000001 RSI: ffffffff83eb4672 RDI: 00000000ffffffff [ 261.325680] RBP: 0000000000000016 R08: 0000000000000000 R09: 0000000000000282 [ 261.328462] R10: ffffffff83e896b1 R11: 0000000000000001 R12: 0000000000000001 [ 261.330584] R13: ffff8e8cd8936074 R14: ffff8e8cd893605e R15: 0000000000000016 [ 261.332748] FS: 00007f5a81bdf040(0000) GS:ffff8e8cdf700000(0000) knlGS: 0000000000000000 [ 261.337929] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 261.343128] CR2: 00007fff3a6790a8 CR3: 0000000018228006 CR4: 0000000000160ee0 [ 261.345976] Call Trace: [ 261.350620] __check_object_size+0x130/0x1a0 [ 261.355775] sg_io+0x269/0x3f0 [ 261.360729] ? path_lookupat+0xaa/0x1f0 [ 261.364027] ? current_time+0x18/0x70 [ 261.366684] scsi_cmd_ioctl+0x257/0x410 [ 261.369871] ? xfs_bmapi_read+0x1c3/0x340 [xfs] [ 261.372231] sd_ioctl+0xbf/0x1a0 [sd_mod] [ 261.375456] blkdev_ioctl+0x8ca/0x990 [ 261.381156] ? read_null+0x10/0x10 [ 261.384984] block_ioctl+0x39/0x40 [ 261.388739] do_vfs_ioctl+0xa4/0x630 [ 261.392624] ? vfs_write+0x164/0x1a0 [ 261.396658] SyS_ioctl+0x74/0x80 [ 261.399563] do_syscall_64+0x74/0x190 [ 261.402685] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 261.414154] RIP: 0033:0x7f5a8115ed87 [ 261.417184] RSP: 002b:00007fff3a65a458 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 261.427362] RAX: ffffffffffffffda RBX: 00007fff3a65a700 RCX: 00007f5a8115ed87 [ 261.432075] RDX: 00007fff3a65a470 RSI: 0000000000002285 RDI: 0000000000000003 [ 261.436200] RBP: 00007fff3a65a750 R08: 0000000000000010 R09: 0000000000000000 [ 261.446689] R10: 0000000000000000 R11: 0000000000000246 R12: 000055b5481d9ce0 [ 261.450059] R13: 0000000000000000 R14: 000055b5481d3550 R15: 00000000000000da [ 261.455103] Code: 48 c7 c0 f1 af e5 83 48 0f 44 c2 41 50 51 41 51 48 89 f9 49 89 f1 4d 89 d8 4c 89 d2 48 89 c6 48 c7 c7 48 b0 e5 83 e8 32 a7 e3 ff <0f> 0b 48 83 c4 18 c3 48 c7 c6 44 0d e5 83 49 89 f1 49 89 f3 eb [ 261.467988] ---[ end trace 75034b3832c364e4 ]--- === I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling smartctl in a loop and doing some usual background I/O. The warning is triggered within 3 minutes or so (not instantly). Initially, it was produced on my server after a kernel update (because disks are monitored with smartctl via Zabbix). Looks like the thing was introduced with 0afe76e88c57d91ef5697720aed380a339e3df70. Any idea how to deal with this please? If needed, I can provide any additional info, and also I'm happy/ready to test any proposed patches. Thanks. Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-04 19:07 usercopy whitelist woe in scsi_sense_cache Oleksandr Natalenko @ 2018-04-04 20:21 ` Kees Cook 2018-04-04 20:44 ` Douglas Gilbert 2018-04-04 20:49 ` Oleksandr Natalenko 2018-04-04 20:32 ` Kees Cook 1 sibling, 2 replies; 67+ messages in thread From: Kees Cook @ 2018-04-04 20:21 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > With v4.16 I get the following dump while using smartctl: > [...] > [ 261.262135] Bad or missing usercopy whitelist? Kernel memory exposure > attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)! > [...] > [ 261.345976] Call Trace: > [ 261.350620] __check_object_size+0x130/0x1a0 > [ 261.355775] sg_io+0x269/0x3f0 > [ 261.360729] ? path_lookupat+0xaa/0x1f0 > [ 261.364027] ? current_time+0x18/0x70 > [ 261.366684] scsi_cmd_ioctl+0x257/0x410 > [ 261.369871] ? xfs_bmapi_read+0x1c3/0x340 [xfs] > [ 261.372231] sd_ioctl+0xbf/0x1a0 [sd_mod] > [ 261.375456] blkdev_ioctl+0x8ca/0x990 > [ 261.381156] ? read_null+0x10/0x10 > [ 261.384984] block_ioctl+0x39/0x40 > [ 261.388739] do_vfs_ioctl+0xa4/0x630 > [ 261.392624] ? vfs_write+0x164/0x1a0 > [ 261.396658] SyS_ioctl+0x74/0x80 > [ 261.399563] do_syscall_64+0x74/0x190 > [ 261.402685] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 This is: sg_io+0x269/0x3f0: blk_complete_sghdr_rq at block/scsi_ioctl.c:280 (inlined by) sg_io at block/scsi_ioctl.c:376 which is: if (req->sense_len && hdr->sbp) { int len = min((unsigned int) hdr->mx_sb_len, req->sense_len); if (!copy_to_user(hdr->sbp, req->sense, len)) hdr->sb_len_wr = len; else ret = -EFAULT; } > [...] > I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling > smartctl in a loop and doing some usual background I/O. The warning is > triggered within 3 minutes or so (not instantly). > > Initially, it was produced on my server after a kernel update (because disks > are monitored with smartctl via Zabbix). > > Looks like the thing was introduced with > 0afe76e88c57d91ef5697720aed380a339e3df70. > > Any idea how to deal with this please? If needed, I can provide any additional > info, and also I'm happy/ready to test any proposed patches. Interesting, and a little confusing. So, what's strange here is that the scsi_sense_cache already has a full whitelist: kmem_cache_create_usercopy("scsi_sense_cache", SCSI_SENSE_BUFFERSIZE, 0, SLAB_HWCACHE_ALIGN, 0, SCSI_SENSE_BUFFERSIZE, NULL); Arg 2 is the buffer size, arg 5 is the whitelist offset (0), and the whitelist size (same as arg2). In other words, the entire buffer should be whitelisted. include/scsi/scsi_cmnd.h says: #define SCSI_SENSE_BUFFERSIZE 96 That means scsi_sense_cache should be 96 bytes in size? But a 22 byte read starting at offset 94 happened? That seems like a 20 byte read beyond the end of the SLUB object? Though if it were reading past the actual end of the object, I'd expect the hardened usercopy BUG (rather than the WARN) to kick in. Ah, it looks like /sys/kernel/slab/scsi_sense_cache/slab_size shows this to be 128 bytes of actual allocation, so the 20 bytes doesn't strictly overlap another object (hence no BUG): /sys/kernel/slab/scsi_sense_cache# grep . object_size usersize slab_size object_size:96 usersize:96 slab_size:128 Ah, right, due to SLAB_HWCACHE_ALIGN, the allocation is rounded up to the next cache line size, so there's 32 bytes of padding to reach 128. James or Martin, is this over-read "expected" behavior? i.e. does the sense cache buffer usage ever pull the ugly trick of silently expanding its allocation into the space the slab allocator has given it? If not, this looks like a real bug. What I don't see is how req->sense is _not_ at offset 0 in the scsi_sense_cache object... -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-04 20:21 ` Kees Cook @ 2018-04-04 20:44 ` Douglas Gilbert 2018-04-04 20:49 ` Oleksandr Natalenko 1 sibling, 0 replies; 67+ messages in thread From: Douglas Gilbert @ 2018-04-04 20:44 UTC (permalink / raw) To: Kees Cook, Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML On 2018-04-04 04:21 PM, Kees Cook wrote: > On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenko > <oleksandr@natalenko.name> wrote: >> With v4.16 I get the following dump while using smartctl: >> [...] >> [ 261.262135] Bad or missing usercopy whitelist? Kernel memory exposure >> attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)! >> [...] >> [ 261.345976] Call Trace: >> [ 261.350620] __check_object_size+0x130/0x1a0 >> [ 261.355775] sg_io+0x269/0x3f0 >> [ 261.360729] ? path_lookupat+0xaa/0x1f0 >> [ 261.364027] ? current_time+0x18/0x70 >> [ 261.366684] scsi_cmd_ioctl+0x257/0x410 >> [ 261.369871] ? xfs_bmapi_read+0x1c3/0x340 [xfs] >> [ 261.372231] sd_ioctl+0xbf/0x1a0 [sd_mod] >> [ 261.375456] blkdev_ioctl+0x8ca/0x990 >> [ 261.381156] ? read_null+0x10/0x10 >> [ 261.384984] block_ioctl+0x39/0x40 >> [ 261.388739] do_vfs_ioctl+0xa4/0x630 >> [ 261.392624] ? vfs_write+0x164/0x1a0 >> [ 261.396658] SyS_ioctl+0x74/0x80 >> [ 261.399563] do_syscall_64+0x74/0x190 >> [ 261.402685] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > This is: > > sg_io+0x269/0x3f0: > blk_complete_sghdr_rq at block/scsi_ioctl.c:280 > (inlined by) sg_io at block/scsi_ioctl.c:376 > > which is: > > if (req->sense_len && hdr->sbp) { > int len = min((unsigned int) hdr->mx_sb_len, req->sense_len); > > if (!copy_to_user(hdr->sbp, req->sense, len)) > hdr->sb_len_wr = len; > else > ret = -EFAULT; > } > >> [...] >> I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling >> smartctl in a loop and doing some usual background I/O. The warning is >> triggered within 3 minutes or so (not instantly). >> >> Initially, it was produced on my server after a kernel update (because disks >> are monitored with smartctl via Zabbix). >> >> Looks like the thing was introduced with >> 0afe76e88c57d91ef5697720aed380a339e3df70. >> >> Any idea how to deal with this please? If needed, I can provide any additional >> info, and also I'm happy/ready to test any proposed patches. > > Interesting, and a little confusing. So, what's strange here is that > the scsi_sense_cache already has a full whitelist: > > kmem_cache_create_usercopy("scsi_sense_cache", > SCSI_SENSE_BUFFERSIZE, 0, SLAB_HWCACHE_ALIGN, > 0, SCSI_SENSE_BUFFERSIZE, NULL); > > Arg 2 is the buffer size, arg 5 is the whitelist offset (0), and the > whitelist size (same as arg2). In other words, the entire buffer > should be whitelisted. > > include/scsi/scsi_cmnd.h says: > > #define SCSI_SENSE_BUFFERSIZE 96 > > That means scsi_sense_cache should be 96 bytes in size? But a 22 byte > read starting at offset 94 happened? That seems like a 20 byte read > beyond the end of the SLUB object? Though if it were reading past the > actual end of the object, I'd expect the hardened usercopy BUG (rather > than the WARN) to kick in. Ah, it looks like > /sys/kernel/slab/scsi_sense_cache/slab_size shows this to be 128 bytes > of actual allocation, so the 20 bytes doesn't strictly overlap another > object (hence no BUG): > > /sys/kernel/slab/scsi_sense_cache# grep . object_size usersize slab_size > object_size:96 > usersize:96 > slab_size:128 > > Ah, right, due to SLAB_HWCACHE_ALIGN, the allocation is rounded up to > the next cache line size, so there's 32 bytes of padding to reach 128. > > James or Martin, is this over-read "expected" behavior? i.e. does the > sense cache buffer usage ever pull the ugly trick of silently > expanding its allocation into the space the slab allocator has given > it? If not, this looks like a real bug. > > What I don't see is how req->sense is _not_ at offset 0 in the > scsi_sense_cache object... Looking at the smartctl SCSI code it pulls 32 byte sense buffers. Can't see 22 anywhere relevant in its code. There are two types of sense: fixed and descriptor: with fixed you seldom need more than 18 bytes (but it can only represent 32 bit LBAs). The other type has a header and 0 or more variable length descriptors. If decoding of descriptor sense went wrong you might end up at offset 94. But not with smartctl .... Doug Gilbert ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-04 20:21 ` Kees Cook 2018-04-04 20:44 ` Douglas Gilbert @ 2018-04-04 20:49 ` Oleksandr Natalenko 2018-04-04 21:25 ` Kees Cook 1 sibling, 1 reply; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-04 20:49 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML Hi. On středa 4. dubna 2018 22:21:53 CEST Kees Cook wrote: > ... > That means scsi_sense_cache should be 96 bytes in size? But a 22 byte > read starting at offset 94 happened? That seems like a 20 byte read > beyond the end of the SLUB object? Though if it were reading past the > actual end of the object, I'd expect the hardened usercopy BUG (rather > than the WARN) to kick in. Ah, it looks like > /sys/kernel/slab/scsi_sense_cache/slab_size shows this to be 128 bytes > of actual allocation, so the 20 bytes doesn't strictly overlap another > object (hence no BUG): > ... Actually, I can trigger a BUG too: [ 129.259213] usercopy: Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 119, size 22)! [ 129.265167] ------------[ cut here ]------------ [ 129.267579] kernel BUG at mm/usercopy.c:100! And also offset can be different, as you may see: [ 55.993224] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 76, size 22)! [ 55.998678] WARNING: CPU: 0 PID: 1305 at mm/usercopy.c:81 usercopy_warn +0x7e/0xa0 It looks like only the size stays the same. > Can you send me your .config? What SCSI drivers are you using in the > VM and on the real server? This is an Arch kernel with a config available here [1]. For both server and VM "lspci -vv" shows "ahci" in use. Is this what you are asking for? > Are you able to see what ioctl()s smartctl is issuing? I'll try to > reproduce this on my end... As per [2], strace shows "SG_IO" requests. Is this detailed enough? Thanks for looking into it. Regards, Oleksandr [1] https://git.archlinux.org/svntogit/packages.git/plain/trunk/config? h=packages/linux&id=d7625be23f83416491d202d5cea96e5a871fb216 [2] https://gist.github.com/6f58f8891468aeba1ab2cc9f45668735 ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-04 20:49 ` Oleksandr Natalenko @ 2018-04-04 21:25 ` Kees Cook 2018-04-04 21:34 ` Oleksandr Natalenko 2018-04-05 9:56 ` Oleksandr Natalenko 0 siblings, 2 replies; 67+ messages in thread From: Kees Cook @ 2018-04-04 21:25 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML On Wed, Apr 4, 2018 at 1:49 PM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > Hi. > > On středa 4. dubna 2018 22:21:53 CEST Kees Cook wrote: >> >> ... >> That means scsi_sense_cache should be 96 bytes in size? But a 22 byte >> read starting at offset 94 happened? That seems like a 20 byte read >> beyond the end of the SLUB object? Though if it were reading past the >> actual end of the object, I'd expect the hardened usercopy BUG (rather >> than the WARN) to kick in. Ah, it looks like >> /sys/kernel/slab/scsi_sense_cache/slab_size shows this to be 128 bytes >> of actual allocation, so the 20 bytes doesn't strictly overlap another >> object (hence no BUG): >> ... > > > Actually, I can trigger a BUG too: > > [ 129.259213] usercopy: Kernel memory exposure attempt detected from SLUB > object 'scsi_sense_cache' (offset 119, size 22)! Wow, yeah, that's totally outside the slub object_size. How did you trigger this? Just luck or something specific? > [ 129.265167] ------------[ cut here ]------------ > [ 129.267579] kernel BUG at mm/usercopy.c:100! > > And also offset can be different, as you may see: > > [ 55.993224] Bad or missing usercopy whitelist? Kernel memory exposure > attempt detected from SLUB object 'scsi_sense_cache' (offset 76, size 22)! > [ 55.998678] WARNING: CPU: 0 PID: 1305 at mm/usercopy.c:81 usercopy_warn > +0x7e/0xa0 > > It looks like only the size stays the same. I'd really like to understand how the buffer position can be changing... I'd expect that to break all kinds of things (i.e. free()ing the slab later would break too...) >> Can you send me your .config? What SCSI drivers are you using in the >> VM and on the real server? > > This is an Arch kernel with a config available here [1]. > > For both server and VM "lspci -vv" shows "ahci" in use. Is this what you are > asking for? I think so, yeah. >> Are you able to see what ioctl()s smartctl is issuing? I'll try to >> reproduce this on my end... > > As per [2], strace shows "SG_IO" requests. Is this detailed enough? That's useful, yeah. I'll try Douglas's suggestion of "smartctl -r scsiioctl,3 ..." too. > Thanks for looking into it. Thanks for the report! I hope someone more familiar with sg_io() can help explain the changing buffer offset... :P -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-04 21:25 ` Kees Cook @ 2018-04-04 21:34 ` Oleksandr Natalenko 2018-04-05 9:56 ` Oleksandr Natalenko 1 sibling, 0 replies; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-04 21:34 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, keescook Hi. 04.04.2018 23:25, Kees Cook wrote: >> Actually, I can trigger a BUG too: >> >> [ 129.259213] usercopy: Kernel memory exposure attempt detected from >> SLUB >> object 'scsi_sense_cache' (offset 119, size 22)! > > Wow, yeah, that's totally outside the slub object_size. How did you > trigger this? Just luck or something specific? Just luck, I suppose. It usually comes after the first warning if you wait long enough (maybe, a couple of extra minutes). To give you an idea regarding variety of offsets, I've summarised kernel log from the server: $ sudo journalctl -kb | grep "Kernel memory exposure attempt detected" | grep -oE 'offset [0-9]+, size [0-9]+' | sort | uniq -c 9 offset 107, size 22 6 offset 108, size 22 8 offset 109, size 22 7 offset 110, size 22 5 offset 111, size 22 5 offset 112, size 22 2 offset 113, size 22 2 offset 114, size 22 1 offset 115, size 22 1 offset 116, size 22 1 offset 119, size 22 1 offset 85, size 22 > I'd really like to understand how the buffer position can be > changing... I'd expect that to break all kinds of things (i.e. > free()ing the slab later would break too...) I haven't checked the code yet, but the first thing that comes to my mind is some uninitialised variable. Just guessing here, though. > Thanks for the report! I hope someone more familiar with sg_io() can > help explain the changing buffer offset... :P Hopefully, SCSI people are Cc'ed here properly… Thanks! Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-04 21:25 ` Kees Cook 2018-04-04 21:34 ` Oleksandr Natalenko @ 2018-04-05 9:56 ` Oleksandr Natalenko 2018-04-05 14:21 ` Kees Cook 1 sibling, 1 reply; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-05 9:56 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, keescook Hi. 04.04.2018 23:25, Kees Cook wrote: > Thanks for the report! I hope someone more familiar with sg_io() can > help explain the changing buffer offset... :P Also, FYI, I kept the server running with smartctl periodically invoked, and it was still triggering BUGs, however, I consider them to be more or less harmless until the server got stuck with high I/O wait this morning after next smartctl invocation. So, it isn't harmless, it seems… It could be unrelated, of course, since the journal didn't give me any hint (or a stack trace) on what happened, thus I'll monitor how things behave without smartctl too. Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-05 9:56 ` Oleksandr Natalenko @ 2018-04-05 14:21 ` Kees Cook 2018-04-05 14:32 ` Oleksandr Natalenko 0 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-05 14:21 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML On Thu, Apr 5, 2018 at 2:56 AM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > Hi. > > 04.04.2018 23:25, Kees Cook wrote: >> >> Thanks for the report! I hope someone more familiar with sg_io() can >> help explain the changing buffer offset... :P > > > Also, FYI, I kept the server running with smartctl periodically invoked, and > it was still triggering BUGs, however, I consider them to be more or less > harmless until the server got stuck with high I/O wait this morning after > next smartctl invocation. So, it isn't harmless, it seems… > > It could be unrelated, of course, since the journal didn't give me any hint > (or a stack trace) on what happened, thus I'll monitor how things behave > without smartctl too. I had a VM running over night with: [1] Running while :; do smartctl -a /dev/sda > /dev/null; done & [2]- Running while :; do ls --color=auto -lR / > /dev/null 2> /dev/null; done & [3]+ Running while :; do sleep $(( $RANDOM % 100 )); sync; echo 3 > /proc/sys/vm/drop_caches; done & and I haven't seen the issue. :( FWIW, I'm using the ahci qemu driver: -drive file=disk-image.raw,if=none,id=drive0,format=raw \ -device ahci,id=bus0 \ -device ide-drive,bus=bus0.0,drive=drive0 Does this match your qemu instance? -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-05 14:21 ` Kees Cook @ 2018-04-05 14:32 ` Oleksandr Natalenko 2018-04-05 14:33 ` Oleksandr Natalenko 0 siblings, 1 reply; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-05 14:32 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, keescook Hi. 05.04.2018 16:21, Kees Cook wrote: > I had a VM running over night with: > > [1] Running while :; do > smartctl -a /dev/sda > /dev/null; > done & > [2]- Running while :; do > ls --color=auto -lR / > /dev/null 2> /dev/null; > done & > [3]+ Running while :; do > sleep $(( $RANDOM % 100 )); sync; echo 3 > > /proc/sys/vm/drop_caches; > done & > > and I haven't seen the issue. :( > > FWIW, I'm using the ahci qemu driver: > > -drive file=disk-image.raw,if=none,id=drive0,format=raw \ > -device ahci,id=bus0 \ > -device ide-drive,bus=bus0.0,drive=drive0 > > Does this match your qemu instance? Well, not really. I just pass 2 raw disks as "-hda sda.img -hdb sda.img" (it is a playground VM for me with RAID10, LVM and LUKS inside, but I hope this doesn't matter). Does passing "-hda" differ from your approach? Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-05 14:32 ` Oleksandr Natalenko @ 2018-04-05 14:33 ` Oleksandr Natalenko [not found] ` <CAGXu5jL8oLV2xvjBVYv_SNXr74LdgpXEmU7K+cLYpD7jh2chgw@mail.gmail.com> 0 siblings, 1 reply; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-05 14:33 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, keescook 05.04.2018 16:32, Oleksandr Natalenko wrote: > "-hda sda.img -hdb sda.img" "-hda sda.img -hdb sdb.img", of course, I don't pass the same disk twice ☺ ^ permalink raw reply [flat|nested] 67+ messages in thread
[parent not found: <CAGXu5jL8oLV2xvjBVYv_SNXr74LdgpXEmU7K+cLYpD7jh2chgw@mail.gmail.com>]
* Re: usercopy whitelist woe in scsi_sense_cache [not found] ` <CAGXu5jL8oLV2xvjBVYv_SNXr74LdgpXEmU7K+cLYpD7jh2chgw@mail.gmail.com> @ 2018-04-05 18:52 ` Kees Cook 2018-04-06 6:21 ` Oleksandr Natalenko 2018-04-08 19:07 ` Oleksandr Natalenko 0 siblings, 2 replies; 67+ messages in thread From: Kees Cook @ 2018-04-05 18:52 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML [forcing non-HTML and resending...] On Thu, Apr 5, 2018 at 7:33 AM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > > 05.04.2018 16:32, Oleksandr Natalenko wrote: >> >> "-hda sda.img -hdb sda.img" > > > "-hda sda.img -hdb sdb.img", of course, I don't pass the same disk twice Okay. My qemu gets mad about that and wants the format=raw argument, so I'm using: -drive file=sda.img,format=raw \ -drive file=sdb.img,format=raw \ How are you running your smartctl? I'm doing this now: [1] Running while :; do ( smartctl -a /dev/sda; smartctl -a /dev/sdb ) > /dev/null; done & I assume I'm missing something from your .config, but since I don't boot with an initramfs, I had to tweak it a bit. I'll try again... -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-05 18:52 ` Kees Cook @ 2018-04-06 6:21 ` Oleksandr Natalenko 2018-04-08 19:07 ` Oleksandr Natalenko 1 sibling, 0 replies; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-06 6:21 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, keescook Hi. 05.04.2018 20:52, Kees Cook wrote: > Okay. My qemu gets mad about that and wants the format=raw argument, > so I'm using: > > -drive file=sda.img,format=raw \ > -drive file=sdb.img,format=raw \ > > How are you running your smartctl? I'm doing this now: > > [1] Running while :; do > ( smartctl -a /dev/sda; smartctl -a /dev/sdb ) > /dev/null; > done & Yes, so do I. > I assume I'm missing something from your .config, but since I don't > boot with an initramfs, I had to tweak it a bit. I'll try again... Let me, maybe, describe, what both the VM and the server have in common: 1. have 4 CPUs 2. are EFI-based 3. use blk-mq with BFQ scheduler (it is set up via udev rule during boot) 4. have zswap enabled 5. have 2 SATA disks with RAID10 on top of it (layout f2) 6. have LUKS on top of the RAID, and LVM on top of the LUKS VM has machine type "q35", BTW. Do you think something of what's mentioned above is relevant for the code path in question? Thanks. Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-05 18:52 ` Kees Cook 2018-04-06 6:21 ` Oleksandr Natalenko @ 2018-04-08 19:07 ` Oleksandr Natalenko 2018-04-09 9:35 ` Christoph Hellwig 2018-04-09 18:32 ` Kees Cook 1 sibling, 2 replies; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-08 19:07 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente Hi. Cc'ing linux-block people (mainly, Christoph) too because of 17cb960f29c2. Also, duplicating the initial statement for them. With v4.16 (and now with v4.16.1) it is possible to trigger usercopy whitelist warning and/or bug while doing smartctl on a SATA disk having blk-mq and BFQ enabled. The warning looks like this: === [ 574.997022] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 76, size 22)! [ 575.017332] WARNING: CPU: 0 PID: 32436 at mm/usercopy.c:81 usercopy_warn +0x7d/0xa0 [ 575.025262] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm bochs_drm iTCO_wdt ttm irqbypass iTCO_vendor_support ppdev drm_kms_helper psmouse parport_pc i2c_i801 joydev pcspkr drm parport rtc_cmos mousedev input_leds led_class intel_agp evdev syscopyarea qemu_fw_cfg intel_gtt sysfillrect mac_hid lpc_ich sysimgblt agpgart fb_sys_fops ip_tables x_tables xfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32c_generic dm_crypt algif_skcipher af_alg hid_generic usbhid hid dm_mod raid10 md_mod sr_mod sd_mod cdrom uhci_hcd crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc serio_raw xhci_pci ahci atkbd libps2 ehci_pci xhci_hcd aesni_intel libahci aes_x86_64 ehci_hcd crypto_simd glue_helper cryptd libata usbcore usb_common i8042 serio virtio_scsi scsi_mod [ 575.068775] virtio_blk virtio_net virtio_pci virtio_ring virtio [ 575.073935] CPU: 0 PID: 32436 Comm: smartctl Not tainted 4.16.0-pf2 #1 [ 575.078078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 575.082451] RIP: 0010:usercopy_warn+0x7d/0xa0 [ 575.086223] RSP: 0018:ffff9ca84aee7c40 EFLAGS: 00010286 [ 575.097637] RAX: 0000000000000000 RBX: ffff95199d68304c RCX: 0000000000000001 [ 575.101471] RDX: 0000000000000001 RSI: ffffffffaeeb050a RDI: 00000000ffffffff [ 575.105939] RBP: 0000000000000016 R08: 0000000000000000 R09: 000000000000028b [ 575.110370] R10: ffffffffaee854e9 R11: 0000000000000001 R12: 0000000000000001 [ 575.113269] R13: ffff95199d683062 R14: ffff95199d68304c R15: 0000000000000016 [ 575.116132] FS: 00007f993d405040(0000) GS:ffff95199f600000(0000) knlGS: 0000000000000000 [ 575.119285] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 575.129619] CR2: 00007ffe2390f0a8 CR3: 000000001d774004 CR4: 0000000000160ef0 [ 575.133976] Call Trace: [ 575.136311] __check_object_size+0x12f/0x1a0 [ 575.139576] sg_io+0x269/0x3f0 [ 575.142000] ? path_lookupat+0xaa/0x1f0 [ 575.144521] ? current_time+0x18/0x70 [ 575.147006] scsi_cmd_ioctl+0x257/0x410 [ 575.149782] ? xfs_bmapi_read+0x1c3/0x340 [xfs] [ 575.161441] sd_ioctl+0xbf/0x1a0 [sd_mod] [ 575.165036] blkdev_ioctl+0x8ca/0x990 [ 575.168291] ? read_null+0x10/0x10 [ 575.171638] block_ioctl+0x39/0x40 [ 575.174998] do_vfs_ioctl+0xa4/0x630 [ 575.178261] ? vfs_write+0x164/0x1a0 [ 575.181410] SyS_ioctl+0x74/0x80 [ 575.190904] do_syscall_64+0x74/0x190 [ 575.195200] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 575.199267] RIP: 0033:0x7f993c984d87 [ 575.201350] RSP: 002b:00007ffe238aeed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 575.204386] RAX: ffffffffffffffda RBX: 00007ffe238af180 RCX: 00007f993c984d87 [ 575.208349] RDX: 00007ffe238aeef0 RSI: 0000000000002285 RDI: 0000000000000003 [ 575.211254] RBP: 00007ffe238af1d0 R08: 0000000000000010 R09: 0000000000000000 [ 575.220511] R10: 0000000000000000 R11: 0000000000000246 R12: 00005637ec8e9ce0 [ 575.225238] R13: 0000000000000000 R14: 00005637ec8e3550 R15: 00000000000000da [ 575.230056] Code: 6c e4 ae 41 51 48 c7 c0 19 6e e5 ae 49 89 f1 48 0f 44 c2 48 89 f9 4d 89 d8 4c 89 d2 48 c7 c7 70 6e e5 ae 48 89 c6 e8 c3 5c e5 ff <0f> 0b 48 83 c4 18 c3 48 c7 c6 04 cb e4 ae 49 89 f1 49 89 f3 eb [ 575.239027] ---[ end trace 6e3293933bdd4761 ]--- === Usually, the warning is triggered first, and all the subsequent printouts are bugs because offset gets too big so that it doesn't fit into a real SLAB object size: [ 1687.609889] usercopy: Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 107, size 22)! [ 1687.614197] ------------[ cut here ]------------ [ 1687.615993] kernel BUG at mm/usercopy.c:100! To give you an idea regarding variety of offsets, I've summarised the kernel log from my server: $ sudo journalctl -kb | grep "Kernel memory exposure attempt detected" | grep -oE 'offset [0-9]+, size [0-9]+' | sort | uniq -c 9 offset 107, size 22 6 offset 108, size 22 8 offset 109, size 22 7 offset 110, size 22 5 offset 111, size 22 5 offset 112, size 22 2 offset 113, size 22 2 offset 114, size 22 1 offset 115, size 22 1 offset 116, size 22 1 offset 119, size 22 1 offset 85, size 22 So far, I wasn't able to trigger this with mq-deadline (or without blk-mq). Maybe, this has something to do with blk-mq+BFQ re-queuing, or it's just me not being persistent enough. It looks like this code path was re-written completely with 17cb960f29c2, but it went merged for the upcoming v4.17 only, and thus I haven't tried it yet. Kees took a brief look at it already: [1]. This is what smartctl does [2] (just a usual strace capture when the bug is not triggered). Christoph, do you have some idea on why this can happen? Thanks. Regards, Oleksandr [1] https://marc.info/?l=linux-scsi&m=152287333013845&w=2 [2] https://gist.github.com/pfactum/6f58f8891468aeba1ab2cc9f45668735 ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-08 19:07 ` Oleksandr Natalenko @ 2018-04-09 9:35 ` Christoph Hellwig 2018-04-09 15:54 ` Oleksandr Natalenko 2018-04-09 18:32 ` Kees Cook 1 sibling, 1 reply; 67+ messages in thread From: Christoph Hellwig @ 2018-04-09 9:35 UTC (permalink / raw) To: Oleksandr Natalenko Cc: Kees Cook, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente I really can't make sense of that report. And I'm also curious why you think 17cb960f29c2 should change anything for that code path. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-09 9:35 ` Christoph Hellwig @ 2018-04-09 15:54 ` Oleksandr Natalenko 0 siblings, 0 replies; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-09 15:54 UTC (permalink / raw) To: Christoph Hellwig Cc: Kees Cook, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente Hi. 09.04.2018 11:35, Christoph Hellwig wrote: > I really can't make sense of that report. Sorry, I have nothing to add there so far, I just see the symptom of something going wrong in the ioctl code path that is invoked by smartctl, but I have no idea what's the minimal environment to reproduce it. I'll try to collect more info. > And I'm also curious why > you think 17cb960f29c2 should change anything for that code path. Maybe, I've just mis-read that code. Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-08 19:07 ` Oleksandr Natalenko 2018-04-09 9:35 ` Christoph Hellwig @ 2018-04-09 18:32 ` Kees Cook 2018-04-09 19:02 ` Oleksandr Natalenko 1 sibling, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-09 18:32 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Sun, Apr 8, 2018 at 12:07 PM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > So far, I wasn't able to trigger this with mq-deadline (or without blk-mq). > Maybe, this has something to do with blk-mq+BFQ re-queuing, or it's just me > not being persistent enough. Ah, this detail I didn't have. I've changed my environment to build with: CONFIG_BLK_MQ_PCI=y CONFIG_BLK_MQ_VIRTIO=y CONFIG_IOSCHED_BFQ=y boot with scsi_mod.use_blk_mq=1 and select BFQ in the scheduler: # cat /sys/block/sd?/queue/scheduler mq-deadline kyber [bfq] none mq-deadline kyber [bfq] none Even with this, I'm not seeing anything yet... > It looks like this code path was re-written completely with 17cb960f29c2, but > it went merged for the upcoming v4.17 only, and thus I haven't tried it yet. > > Kees took a brief look at it already: [1]. This is what smartctl does [2] > (just a usual strace capture when the bug is not triggered). > > Christoph, do you have some idea on why this can happen? > > Thanks. > > Regards, > Oleksandr > > [1] https://marc.info/?l=linux-scsi&m=152287333013845&w=2 > [2] https://gist.github.com/pfactum/6f58f8891468aeba1ab2cc9f45668735 The thing I can't figure out is how req->sense is slipping forward in (and even beyond!) the allocation. -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-09 18:32 ` Kees Cook @ 2018-04-09 19:02 ` Oleksandr Natalenko 2018-04-09 20:30 ` Kees Cook 0 siblings, 1 reply; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-09 19:02 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente, keescook Hi. (fancy details for linux-block and BFQ people go below) 09.04.2018 20:32, Kees Cook wrote: > Ah, this detail I didn't have. I've changed my environment to > > build with: > > CONFIG_BLK_MQ_PCI=y > CONFIG_BLK_MQ_VIRTIO=y > CONFIG_IOSCHED_BFQ=y > > boot with scsi_mod.use_blk_mq=1 > > and select BFQ in the scheduler: > > # cat /sys/block/sd?/queue/scheduler > mq-deadline kyber [bfq] none > mq-deadline kyber [bfq] none > > Even with this, I'm not seeing anything yet... Thanks for looking into it anyway. I was experimenting today a little bit, and for me it looks like setting queue_depth and nr_requests to minimal values speeds up the reproducing. Could you please try it too? Something like: echo 1 | tee /sys/block/sd*/queue/nr_requests echo 1 | tee /sys/block/sd*/device/queue_depth Also, now I have a more solid proof that this is related to I/O scheduling. I was hammering my VM, and after a couple of usercopy warnings/bugs I can reliably trigger I/O hang. I was able to obtain the stack traces of tasks in D state. Listing them here below. dmcrypt_write: === [ 1615.409622] dmcrypt_write D 0 236 2 0x80000000 [ 1615.413422] Call Trace: [ 1615.415428] ? __schedule+0x336/0xf40 [ 1615.417824] ? blk_mq_sched_dispatch_requests+0x117/0x190 [ 1615.421423] ? __sbitmap_get_word+0x2a/0x80 [ 1615.424202] schedule+0x32/0xc0 [ 1615.426521] io_schedule+0x12/0x40 [ 1615.432414] blk_mq_get_tag+0x181/0x2a0 [ 1615.434881] ? elv_merge+0x79/0xe0 [ 1615.437447] ? wait_woken+0x80/0x80 [ 1615.439553] blk_mq_get_request+0xf9/0x400 [ 1615.444653] blk_mq_make_request+0x10b/0x640 [ 1615.448025] generic_make_request+0x124/0x2d0 [ 1615.450716] ? raid10_unplug+0xfb/0x180 [raid10] [ 1615.454069] raid10_unplug+0xfb/0x180 [raid10] [ 1615.456729] blk_flush_plug_list+0xc1/0x250 [ 1615.460276] blk_finish_plug+0x27/0x40 [ 1615.464103] dmcrypt_write+0x233/0x240 [dm_crypt] [ 1615.467443] ? wake_up_process+0x20/0x20 [ 1615.470845] ? crypt_iv_essiv_dtr+0x60/0x60 [dm_crypt] [ 1615.475272] ? kthread+0x113/0x130 [ 1615.477652] kthread+0x113/0x130 [ 1615.480567] ? kthread_create_on_node+0x70/0x70 [ 1615.483268] ret_from_fork+0x35/0x40 === One of XFS threads, too: === [ 1616.133298] xfsaild/dm-7 D 0 316 2 0x80000000 [ 1616.136679] Call Trace: [ 1616.138845] ? __schedule+0x336/0xf40 [ 1616.141581] ? preempt_count_add+0x68/0xa0 [ 1616.147214] ? _raw_spin_unlock+0x16/0x30 [ 1616.149813] ? _raw_spin_unlock_irqrestore+0x20/0x40 [ 1616.152478] ? try_to_del_timer_sync+0x4d/0x80 [ 1616.154734] schedule+0x32/0xc0 [ 1616.156579] _xfs_log_force+0x146/0x290 [xfs] [ 1616.159322] ? wake_up_process+0x20/0x20 [ 1616.162175] xfsaild+0x1a9/0x820 [xfs] [ 1616.164695] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] [ 1616.167567] ? kthread+0x113/0x130 [ 1616.169722] kthread+0x113/0x130 [ 1616.171908] ? kthread_create_on_node+0x70/0x70 [ 1616.174073] ? do_syscall_64+0x74/0x190 [ 1616.179008] ? SyS_exit_group+0x10/0x10 [ 1616.182306] ret_from_fork+0x35/0x40 === journald is another victim: === [ 1616.184343] systemd-journal D 0 354 1 0x00000104 [ 1616.187282] Call Trace: [ 1616.189464] ? __schedule+0x336/0xf40 [ 1616.191781] ? call_function_single_interrupt+0xa/0x20 [ 1616.194788] ? _raw_spin_lock_irqsave+0x25/0x50 [ 1616.197592] schedule+0x32/0xc0 [ 1616.200171] schedule_timeout+0x202/0x470 [ 1616.202851] ? preempt_count_add+0x49/0xa0 [ 1616.206227] wait_for_common+0xbb/0x180 [ 1616.209877] ? wake_up_process+0x20/0x20 [ 1616.212511] do_coredump+0x335/0xea0 [ 1616.214861] ? schedule+0x3c/0xc0 [ 1616.216775] ? futex_wait_queue_me+0xbb/0x110 [ 1616.218894] ? _raw_spin_unlock+0x16/0x30 [ 1616.220868] ? futex_wait+0x143/0x240 [ 1616.223450] get_signal+0x250/0x5c0 [ 1616.225965] do_signal+0x36/0x610 [ 1616.228246] ? __seccomp_filter+0x3b/0x260 [ 1616.231000] ? __check_object_size+0x9f/0x1a0 [ 1616.233716] exit_to_usermode_loop+0x85/0xa0 [ 1616.238413] do_syscall_64+0x18b/0x190 [ 1616.240798] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 1616.244401] RIP: 0033:0x7f78fc53e45d [ 1616.246690] RSP: 002b:00007ffd40330d20 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca [ 1616.251199] RAX: fffffffffffffe00 RBX: 00007f78f7806700 RCX: 00007f78fc53e45d [ 1616.254817] RDX: 00000000000004cd RSI: 0000000000000000 RDI: 00007f78f78069d0 [ 1616.258410] RBP: 00007ffd40330d20 R08: 00000000000000ca R09: 00007f78f78069d0 [ 1616.261813] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 1616.265065] R13: 0000000000000000 R14: 00007f78fc95e8c0 R15: 00007f78f7806d28 [ 1616.272861] journal-offline D 0 1229 1 0x00000104 [ 1616.275856] Call Trace: [ 1616.277396] ? __schedule+0x336/0xf40 [ 1616.279258] ? release_pages+0x192/0x3d0 [ 1616.282871] schedule+0x32/0xc0 [ 1616.285218] io_schedule+0x12/0x40 [ 1616.287267] wait_on_page_bit+0xea/0x130 [ 1616.291084] ? add_to_page_cache_lru+0xe0/0xe0 [ 1616.293898] __filemap_fdatawait_range+0xbb/0x110 [ 1616.297391] ? xen_swiotlb_init+0x85/0x4d0 [ 1616.303285] file_write_and_wait_range+0x76/0xa0 [ 1616.306313] xfs_file_fsync+0x5d/0x240 [xfs] [ 1616.309196] ? syscall_trace_enter+0xd8/0x2d0 [ 1616.312193] do_fsync+0x38/0x60 [ 1616.314654] SyS_fsync+0xc/0x10 [ 1616.318686] do_syscall_64+0x74/0x190 [ 1616.321519] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 1616.324045] RIP: 0033:0x7f78fc5470ec [ 1616.326257] RSP: 002b:00007f78f7805d30 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 1616.333535] RAX: ffffffffffffffda RBX: 00005597340357b0 RCX: 00007f78fc5470ec [ 1616.337057] RDX: 0000000000000000 RSI: 00007f78fbea3811 RDI: 0000000000000013 [ 1616.341176] RBP: 00007f78fbea4c60 R08: 00007f78f7806700 R09: 00007f78f7806700 [ 1616.345029] R10: 0000000000000811 R11: 0000000000000293 R12: 0000000000000002 [ 1616.349279] R13: 00007ffd40330e8f R14: 00007ffd40330e90 R15: 0000000000000016 [ 1876.568670] systemd-journal D 0 14892 1 0x00000100 [ 1876.573395] Call Trace: [ 1876.575559] ? __schedule+0x336/0xf40 [ 1876.578005] ? memcg_kmem_charge+0xad/0xf0 [ 1876.580797] ? lock_page_memcg+0x11/0x80 [ 1876.583674] schedule+0x32/0xc0 [ 1876.585516] io_schedule+0x12/0x40 [ 1876.590223] wait_on_page_bit+0xea/0x130 [ 1876.591840] ? add_to_page_cache_lru+0xe0/0xe0 [ 1876.593554] __filemap_fdatawait_range+0xbb/0x110 [ 1876.595340] ? store_online+0x34/0xb0 [ 1876.596913] ? __filemap_fdatawrite_range+0xc1/0x100 [ 1876.599338] ? __filemap_fdatawrite_range+0xcf/0x100 [ 1876.601447] file_write_and_wait_range+0x76/0xa0 [ 1876.604356] xfs_file_fsync+0x5d/0x240 [xfs] [ 1876.606599] ? syscall_trace_enter+0xd8/0x2d0 [ 1876.608753] do_fsync+0x38/0x60 [ 1876.610952] SyS_fsync+0xc/0x10 [ 1876.612442] do_syscall_64+0x74/0x190 [ 1876.615383] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 1876.622030] RIP: 0033:0x7fe056a840c4 [ 1876.624698] RSP: 002b:00007ffe7908ce88 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [ 1876.628290] RAX: ffffffffffffffda RBX: 000055ad65d1f9c0 RCX: 00007fe056a840c4 [ 1876.630922] RDX: 00007fe0563e1c7c RSI: 0000000000000003 RDI: 0000000000000017 [ 1876.634244] RBP: 00007fe0563e1c60 R08: 0000000000000005 R09: 0000000000000002 [ 1876.637153] R10: 000055ad65d1a010 R11: 0000000000000246 R12: 0000000000000002 [ 1876.640099] R13: 00000000fffffff0 R14: 0000000000000002 R15: 00007ffe7908d2e0 === And what's more important, I can make smartctl hang too! === [ 1876.466385] smartctl D 0 14891 495 0x00000000 [ 1876.469100] Call Trace: [ 1876.471267] ? __schedule+0x336/0xf40 [ 1876.473541] ? blk_mq_sched_dispatch_requests+0x117/0x190 [ 1876.476331] ? __sbitmap_get_word+0x2a/0x80 [ 1876.478784] ? sbitmap_get_shallow+0x5c/0xa0 [ 1876.481353] schedule+0x32/0xc0 [ 1876.484412] io_schedule+0x12/0x40 [ 1876.486892] blk_mq_get_tag+0x181/0x2a0 [ 1876.489448] ? wait_woken+0x80/0x80 [ 1876.492018] blk_mq_get_request+0xf9/0x400 [ 1876.497734] ? lookup_fast+0xc8/0x2d0 [ 1876.501050] blk_mq_alloc_request+0x7e/0xe0 [ 1876.503705] blk_get_request_flags+0x40/0x160 [ 1876.506306] sg_io+0x9d/0x3f0 [ 1876.507920] ? path_lookupat+0xaa/0x1f0 [ 1876.509750] ? current_time+0x18/0x70 [ 1876.511496] scsi_cmd_ioctl+0x257/0x410 [ 1876.513151] ? xfs_bmapi_read+0x1c3/0x340 [xfs] [ 1876.515355] sd_ioctl+0xbf/0x1a0 [sd_mod] [ 1876.517638] blkdev_ioctl+0x8ca/0x990 [ 1876.519219] ? read_null+0x10/0x10 [ 1876.520849] block_ioctl+0x39/0x40 [ 1876.522368] do_vfs_ioctl+0xa4/0x630 [ 1876.524490] ? vfs_write+0x164/0x1a0 [ 1876.526841] SyS_ioctl+0x74/0x80 [ 1876.528328] do_syscall_64+0x74/0x190 [ 1876.529886] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 1876.537471] RIP: 0033:0x7f4f232c8d87 [ 1876.540388] RSP: 002b:00007ffd0fe530a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 1876.545738] RAX: ffffffffffffffda RBX: 00007ffd0fe53350 RCX: 00007f4f232c8d87 [ 1876.548950] RDX: 00007ffd0fe530c0 RSI: 0000000000002285 RDI: 0000000000000003 [ 1876.552912] RBP: 00007ffd0fe533a0 R08: 0000000000000010 R09: 00007ffd0fe53980 [ 1876.559750] R10: 0000000000000200 R11: 0000000000000246 R12: 000055e98902dce0 [ 1876.563829] R13: 0000000000000000 R14: 000055e989027550 R15: 00000000000000d0 === Workqueues, also from alt+sysrq+t dump: === [ 1876.676508] Showing busy workqueues and worker pools: [ 1876.681916] workqueue events_power_efficient: flags=0x82 [ 1876.685388] pwq 8: cpus=0-3 flags=0x4 nice=0 active=2/256 [ 1876.688962] pending: fb_flashcursor, neigh_periodic_work [ 1876.694414] workqueue events_freezable_power_: flags=0x86 [ 1876.699262] pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/256 [ 1876.703777] pending: disk_events_workfn [ 1876.707279] workqueue dm-thin: flags=0xa000a [ 1876.709053] pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/1 [ 1876.715598] pending: do_waker [dm_thin_pool] === So far, I can reproduce it only with blk-mq *and* BFQ enabled. I've switched another machine (my server) to Kyber, and I do not observe the issue (so far, at least). Paolo, Jens? I don't know whom to address, actually :(. Can someone give me an idea on this please? > The thing I can't figure out is how req->sense is slipping forward in > (and even beyond!) the allocation. *cough* race condition *cough* ☺ Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-09 19:02 ` Oleksandr Natalenko @ 2018-04-09 20:30 ` Kees Cook 2018-04-09 23:03 ` Kees Cook 2018-04-10 6:35 ` Oleksandr Natalenko 0 siblings, 2 replies; 67+ messages in thread From: Kees Cook @ 2018-04-09 20:30 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Mon, Apr 9, 2018 at 12:02 PM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > > Hi. > > (fancy details for linux-block and BFQ people go below) > > 09.04.2018 20:32, Kees Cook wrote: >> >> Ah, this detail I didn't have. I've changed my environment to >> >> build with: >> >> CONFIG_BLK_MQ_PCI=y >> CONFIG_BLK_MQ_VIRTIO=y >> CONFIG_IOSCHED_BFQ=y >> >> boot with scsi_mod.use_blk_mq=1 >> >> and select BFQ in the scheduler: >> >> # cat /sys/block/sd?/queue/scheduler >> mq-deadline kyber [bfq] none >> mq-deadline kyber [bfq] none >> >> Even with this, I'm not seeing anything yet... > > > Thanks for looking into it anyway. I was experimenting today a little bit, and for me it looks like setting queue_depth and nr_requests to minimal values speeds up the reproducing. Could you please try it too? Something like: > > echo 1 | tee /sys/block/sd*/queue/nr_requests I can't get this below "4". > echo 1 | tee /sys/block/sd*/device/queue_depth I've got this now too. > Also, now I have a more solid proof that this is related to I/O scheduling. > > I was hammering my VM, and after a couple of usercopy warnings/bugs I can reliably trigger I/O hang. I was able to obtain the stack traces of tasks in D state. Listing them here below. dmcrypt_write: Ah! dm-crypt too. I'll see if I can get that added easily to my tests. > === > [ 1615.409622] dmcrypt_write D 0 236 2 0x80000000 > [ 1615.413422] Call Trace: > [ 1615.415428] ? __schedule+0x336/0xf40 > [ 1615.417824] ? blk_mq_sched_dispatch_requests+0x117/0x190 > [ 1615.421423] ? __sbitmap_get_word+0x2a/0x80 > [ 1615.424202] schedule+0x32/0xc0 > [ 1615.426521] io_schedule+0x12/0x40 > [ 1615.432414] blk_mq_get_tag+0x181/0x2a0 > [ 1615.434881] ? elv_merge+0x79/0xe0 > [ 1615.437447] ? wait_woken+0x80/0x80 > [ 1615.439553] blk_mq_get_request+0xf9/0x400 > [ 1615.444653] blk_mq_make_request+0x10b/0x640 > [ 1615.448025] generic_make_request+0x124/0x2d0 > [ 1615.450716] ? raid10_unplug+0xfb/0x180 [raid10] > [ 1615.454069] raid10_unplug+0xfb/0x180 [raid10] > [ 1615.456729] blk_flush_plug_list+0xc1/0x250 > [ 1615.460276] blk_finish_plug+0x27/0x40 > [ 1615.464103] dmcrypt_write+0x233/0x240 [dm_crypt] > [ 1615.467443] ? wake_up_process+0x20/0x20 > [ 1615.470845] ? crypt_iv_essiv_dtr+0x60/0x60 [dm_crypt] > [ 1615.475272] ? kthread+0x113/0x130 > [ 1615.477652] kthread+0x113/0x130 > [ 1615.480567] ? kthread_create_on_node+0x70/0x70 > [ 1615.483268] ret_from_fork+0x35/0x40 > === > > One of XFS threads, too: And XFS! You love your corner cases. ;) > > === > [ 1616.133298] xfsaild/dm-7 D 0 316 2 0x80000000 > [ 1616.136679] Call Trace: > [ 1616.138845] ? __schedule+0x336/0xf40 > [ 1616.141581] ? preempt_count_add+0x68/0xa0 > [ 1616.147214] ? _raw_spin_unlock+0x16/0x30 > [ 1616.149813] ? _raw_spin_unlock_irqrestore+0x20/0x40 > [ 1616.152478] ? try_to_del_timer_sync+0x4d/0x80 > [ 1616.154734] schedule+0x32/0xc0 > [ 1616.156579] _xfs_log_force+0x146/0x290 [xfs] > [ 1616.159322] ? wake_up_process+0x20/0x20 > [ 1616.162175] xfsaild+0x1a9/0x820 [xfs] > [ 1616.164695] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] > [ 1616.167567] ? kthread+0x113/0x130 > [ 1616.169722] kthread+0x113/0x130 > [ 1616.171908] ? kthread_create_on_node+0x70/0x70 > [ 1616.174073] ? do_syscall_64+0x74/0x190 > [ 1616.179008] ? SyS_exit_group+0x10/0x10 > [ 1616.182306] ret_from_fork+0x35/0x40 > === > > journald is another victim: > > === > [ 1616.184343] systemd-journal D 0 354 1 0x00000104 > [ 1616.187282] Call Trace: > [ 1616.189464] ? __schedule+0x336/0xf40 > [ 1616.191781] ? call_function_single_interrupt+0xa/0x20 > [ 1616.194788] ? _raw_spin_lock_irqsave+0x25/0x50 > [ 1616.197592] schedule+0x32/0xc0 > [ 1616.200171] schedule_timeout+0x202/0x470 > [ 1616.202851] ? preempt_count_add+0x49/0xa0 > [ 1616.206227] wait_for_common+0xbb/0x180 > [ 1616.209877] ? wake_up_process+0x20/0x20 > [ 1616.212511] do_coredump+0x335/0xea0 > [ 1616.214861] ? schedule+0x3c/0xc0 > [ 1616.216775] ? futex_wait_queue_me+0xbb/0x110 > [ 1616.218894] ? _raw_spin_unlock+0x16/0x30 > [ 1616.220868] ? futex_wait+0x143/0x240 > [ 1616.223450] get_signal+0x250/0x5c0 > [ 1616.225965] do_signal+0x36/0x610 > [ 1616.228246] ? __seccomp_filter+0x3b/0x260 > [ 1616.231000] ? __check_object_size+0x9f/0x1a0 > [ 1616.233716] exit_to_usermode_loop+0x85/0xa0 > [ 1616.238413] do_syscall_64+0x18b/0x190 > [ 1616.240798] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [ 1616.244401] RIP: 0033:0x7f78fc53e45d > [ 1616.246690] RSP: 002b:00007ffd40330d20 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca > [ 1616.251199] RAX: fffffffffffffe00 RBX: 00007f78f7806700 RCX: 00007f78fc53e45d > [ 1616.254817] RDX: 00000000000004cd RSI: 0000000000000000 RDI: 00007f78f78069d0 > [ 1616.258410] RBP: 00007ffd40330d20 R08: 00000000000000ca R09: 00007f78f78069d0 > [ 1616.261813] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 > [ 1616.265065] R13: 0000000000000000 R14: 00007f78fc95e8c0 R15: 00007f78f7806d28 > > [ 1616.272861] journal-offline D 0 1229 1 0x00000104 > [ 1616.275856] Call Trace: > [ 1616.277396] ? __schedule+0x336/0xf40 > [ 1616.279258] ? release_pages+0x192/0x3d0 > [ 1616.282871] schedule+0x32/0xc0 > [ 1616.285218] io_schedule+0x12/0x40 > [ 1616.287267] wait_on_page_bit+0xea/0x130 > [ 1616.291084] ? add_to_page_cache_lru+0xe0/0xe0 > [ 1616.293898] __filemap_fdatawait_range+0xbb/0x110 > [ 1616.297391] ? xen_swiotlb_init+0x85/0x4d0 > [ 1616.303285] file_write_and_wait_range+0x76/0xa0 > [ 1616.306313] xfs_file_fsync+0x5d/0x240 [xfs] > [ 1616.309196] ? syscall_trace_enter+0xd8/0x2d0 > [ 1616.312193] do_fsync+0x38/0x60 > [ 1616.314654] SyS_fsync+0xc/0x10 > [ 1616.318686] do_syscall_64+0x74/0x190 > [ 1616.321519] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [ 1616.324045] RIP: 0033:0x7f78fc5470ec > [ 1616.326257] RSP: 002b:00007f78f7805d30 EFLAGS: 00000293 ORIG_RAX: 000000000000004a > [ 1616.333535] RAX: ffffffffffffffda RBX: 00005597340357b0 RCX: 00007f78fc5470ec > [ 1616.337057] RDX: 0000000000000000 RSI: 00007f78fbea3811 RDI: 0000000000000013 > [ 1616.341176] RBP: 00007f78fbea4c60 R08: 00007f78f7806700 R09: 00007f78f7806700 > [ 1616.345029] R10: 0000000000000811 R11: 0000000000000293 R12: 0000000000000002 > [ 1616.349279] R13: 00007ffd40330e8f R14: 00007ffd40330e90 R15: 0000000000000016 > > [ 1876.568670] systemd-journal D 0 14892 1 0x00000100 > [ 1876.573395] Call Trace: > [ 1876.575559] ? __schedule+0x336/0xf40 > [ 1876.578005] ? memcg_kmem_charge+0xad/0xf0 > [ 1876.580797] ? lock_page_memcg+0x11/0x80 > [ 1876.583674] schedule+0x32/0xc0 > [ 1876.585516] io_schedule+0x12/0x40 > [ 1876.590223] wait_on_page_bit+0xea/0x130 > [ 1876.591840] ? add_to_page_cache_lru+0xe0/0xe0 > [ 1876.593554] __filemap_fdatawait_range+0xbb/0x110 > [ 1876.595340] ? store_online+0x34/0xb0 > [ 1876.596913] ? __filemap_fdatawrite_range+0xc1/0x100 > [ 1876.599338] ? __filemap_fdatawrite_range+0xcf/0x100 > [ 1876.601447] file_write_and_wait_range+0x76/0xa0 > [ 1876.604356] xfs_file_fsync+0x5d/0x240 [xfs] > [ 1876.606599] ? syscall_trace_enter+0xd8/0x2d0 > [ 1876.608753] do_fsync+0x38/0x60 > [ 1876.610952] SyS_fsync+0xc/0x10 > [ 1876.612442] do_syscall_64+0x74/0x190 > [ 1876.615383] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [ 1876.622030] RIP: 0033:0x7fe056a840c4 > [ 1876.624698] RSP: 002b:00007ffe7908ce88 EFLAGS: 00000246 ORIG_RAX: 000000000000004a > [ 1876.628290] RAX: ffffffffffffffda RBX: 000055ad65d1f9c0 RCX: 00007fe056a840c4 > [ 1876.630922] RDX: 00007fe0563e1c7c RSI: 0000000000000003 RDI: 0000000000000017 > [ 1876.634244] RBP: 00007fe0563e1c60 R08: 0000000000000005 R09: 0000000000000002 > [ 1876.637153] R10: 000055ad65d1a010 R11: 0000000000000246 R12: 0000000000000002 > [ 1876.640099] R13: 00000000fffffff0 R14: 0000000000000002 R15: 00007ffe7908d2e0 > === > > And what's more important, I can make smartctl hang too! > > === > [ 1876.466385] smartctl D 0 14891 495 0x00000000 > [ 1876.469100] Call Trace: > [ 1876.471267] ? __schedule+0x336/0xf40 > [ 1876.473541] ? blk_mq_sched_dispatch_requests+0x117/0x190 > [ 1876.476331] ? __sbitmap_get_word+0x2a/0x80 > [ 1876.478784] ? sbitmap_get_shallow+0x5c/0xa0 > [ 1876.481353] schedule+0x32/0xc0 > [ 1876.484412] io_schedule+0x12/0x40 > [ 1876.486892] blk_mq_get_tag+0x181/0x2a0 > [ 1876.489448] ? wait_woken+0x80/0x80 > [ 1876.492018] blk_mq_get_request+0xf9/0x400 > [ 1876.497734] ? lookup_fast+0xc8/0x2d0 > [ 1876.501050] blk_mq_alloc_request+0x7e/0xe0 > [ 1876.503705] blk_get_request_flags+0x40/0x160 > [ 1876.506306] sg_io+0x9d/0x3f0 > [ 1876.507920] ? path_lookupat+0xaa/0x1f0 > [ 1876.509750] ? current_time+0x18/0x70 > [ 1876.511496] scsi_cmd_ioctl+0x257/0x410 > [ 1876.513151] ? xfs_bmapi_read+0x1c3/0x340 [xfs] > [ 1876.515355] sd_ioctl+0xbf/0x1a0 [sd_mod] > [ 1876.517638] blkdev_ioctl+0x8ca/0x990 > [ 1876.519219] ? read_null+0x10/0x10 > [ 1876.520849] block_ioctl+0x39/0x40 > [ 1876.522368] do_vfs_ioctl+0xa4/0x630 > [ 1876.524490] ? vfs_write+0x164/0x1a0 > [ 1876.526841] SyS_ioctl+0x74/0x80 > [ 1876.528328] do_syscall_64+0x74/0x190 > [ 1876.529886] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [ 1876.537471] RIP: 0033:0x7f4f232c8d87 > [ 1876.540388] RSP: 002b:00007ffd0fe530a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > [ 1876.545738] RAX: ffffffffffffffda RBX: 00007ffd0fe53350 RCX: 00007f4f232c8d87 > [ 1876.548950] RDX: 00007ffd0fe530c0 RSI: 0000000000002285 RDI: 0000000000000003 > [ 1876.552912] RBP: 00007ffd0fe533a0 R08: 0000000000000010 R09: 00007ffd0fe53980 > [ 1876.559750] R10: 0000000000000200 R11: 0000000000000246 R12: 000055e98902dce0 > [ 1876.563829] R13: 0000000000000000 R14: 000055e989027550 R15: 00000000000000d0 > === > > Workqueues, also from alt+sysrq+t dump: > > === > [ 1876.676508] Showing busy workqueues and worker pools: > [ 1876.681916] workqueue events_power_efficient: flags=0x82 > [ 1876.685388] pwq 8: cpus=0-3 flags=0x4 nice=0 active=2/256 > [ 1876.688962] pending: fb_flashcursor, neigh_periodic_work > [ 1876.694414] workqueue events_freezable_power_: flags=0x86 > [ 1876.699262] pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/256 > [ 1876.703777] pending: disk_events_workfn > [ 1876.707279] workqueue dm-thin: flags=0xa000a > [ 1876.709053] pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/1 > [ 1876.715598] pending: do_waker [dm_thin_pool] > === > > So far, I can reproduce it only with blk-mq *and* BFQ enabled. I've switched another machine (my server) to Kyber, and I do not observe the issue (so far, at least). Two other questions, since you can reproduce this easily: - does it reproduce _without_ hardened usercopy? (I would assume yes, but you'd just not get any warning until the hangs started.) If it does reproduce without hardened usercopy, then a new bisect run could narrow the search even more. - does it reproduce with Linus's current tree? > Paolo, Jens? I don't know whom to address, actually :(. Can someone give me an idea on this please? > >> The thing I can't figure out is how req->sense is slipping forward in >> (and even beyond!) the allocation. > > *cough* race condition *cough* What would imply missing locking, yes? Yikes. But I'd expect use-after-free or something, or bad data, not having the pointer slip forward? -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-09 20:30 ` Kees Cook @ 2018-04-09 23:03 ` Kees Cook 2018-04-10 6:35 ` Oleksandr Natalenko 1 sibling, 0 replies; 67+ messages in thread From: Kees Cook @ 2018-04-09 23:03 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Mon, Apr 9, 2018 at 1:30 PM, Kees Cook <keescook@chromium.org> wrote: > Ah! dm-crypt too. I'll see if I can get that added easily to my tests. Quick update: I added dm-crypt (with XFS on top) and it hung my system almost immediately. I got no warnings at all, though. -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-09 20:30 ` Kees Cook 2018-04-09 23:03 ` Kees Cook @ 2018-04-10 6:35 ` Oleksandr Natalenko 2018-04-10 6:53 ` Kees Cook 2018-04-10 13:47 ` Oleksandr Natalenko 1 sibling, 2 replies; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-10 6:35 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente, keescook Hi. 09.04.2018 22:30, Kees Cook wrote: >> echo 1 | tee /sys/block/sd*/queue/nr_requests > > I can't get this below "4". Oops, yeah. It cannot be less than BLKDEV_MIN_RQ (which is 4), so it is enforced explicitly in queue_requests_store(). It is the same for me. >> echo 1 | tee /sys/block/sd*/device/queue_depth > > I've got this now too. > Ah! dm-crypt too. I'll see if I can get that added easily to my tests. > And XFS! You love your corner cases. ;) Yeah, so far this wonderful configuration has allowed me to uncover a bunch of bugs, and see, we are not done yet ;). > Two other questions, since you can reproduce this easily: > - does it reproduce _without_ hardened usercopy? (I would assume yes, > but you'd just not get any warning until the hangs started.) If it > does reproduce without hardened usercopy, then a new bisect run could > narrow the search even more. Looks like it cannot be disabled via kernel cmdline, so I have to re-compile the kernel, right? I can certainly do that anyway. > - does it reproduce with Linus's current tree? Will try this too. > What would imply missing locking, yes? Yikes. But I'd expect > use-after-free or something, or bad data, not having the pointer slip > forward? I still think this has something to do with blk-mq re-queuing capability and how BFQ implements it, because there are no sings of issue popping up with Kyber so far. > Quick update: I added dm-crypt (with XFS on top) and it hung my system > almost immediately. I got no warnings at all, though. Did your system hang on smartctl hammering too? Have you got some stack traces to compare with mine ones? Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-10 6:35 ` Oleksandr Natalenko @ 2018-04-10 6:53 ` Kees Cook 2018-04-10 17:16 ` Oleksandr Natalenko 2018-04-10 13:47 ` Oleksandr Natalenko 1 sibling, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-10 6:53 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Mon, Apr 9, 2018 at 11:35 PM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > Did your system hang on smartctl hammering too? Have you got some stack > traces to compare with mine ones? Unfortunately I only had a single hang with no dumps. I haven't been able to reproduce it since. :( -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-10 6:53 ` Kees Cook @ 2018-04-10 17:16 ` Oleksandr Natalenko 2018-04-11 3:13 ` Kees Cook 0 siblings, 1 reply; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-10 17:16 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente, keescook Hi, Kees, Paolo et al. 10.04.2018 08:53, Kees Cook wrote: > Unfortunately I only had a single hang with no dumps. I haven't been > able to reproduce it since. :( For your convenience I've prepared a VM that contains a reproducer. It consists of 3 disk images (sda.img is for the system, it is Arch-based, sdb/sdc.img are for RAID). They are available (in a compressed form) to download here [1]. RAID is built as RAID10 with far2 layout, on top of it there is a LUKS container (can be opened either with keyfile under the /root or using "qwerty" password). There's one LVM PV, one VG and one volume on top of LUKS containing XFS. RAID is automatically assembled during the boot, so you don't have to worry about it. I run the VM like this: $ qemu-system-x86_64 -display gtk,gl=on -machine q35,accel=kvm -cpu host,+vmx -enable-kvm -netdev user,id=user.0 -device virtio-net,netdev=user.0 -usb -device nec-usb-xhci,id=xhci -device usb-tablet,bus=xhci.0 -serial stdio -smp 4 -m 512 -hda sda.img -hdb sdb.img -hdc sdc.img The system is accessible via both VGA and serial console. The user is "root", the password is "qwerty". Under the /root folder there is a reproducer script (reproducer.sh). It does trivial things like enabling sysrq, opening LUKS device, mounting a volume, running a background I/O (this is an important part, actually, since I wasn't able to trigger the issue without the background I/O) and, finally, running the smartctl in a loop. If you are lucky, within a minute or two you'll get the first warning followed shortly by subsequent bugs and I/O stall (htop is pre-installed for your convenience too). Notable changes in this VM comparing to generic defaults: 1) blk-mq is enabled via kernel cmdline (scsi_mod.use_blk_mq=1 is in /etc/default/grub) 2) BFQ is set via udev (check /etc/udev/rules.d/10-io-scheduler.rules file) Again, I wasn't able to reproduce the usercopy warning/bug and I/O hang without all these components being involved. Hope you enjoy it. P.S. I haven't tested Linus' master branch yet. For now, this VM runs v4.16 kernel. Regards, Oleksandr [1] https://natalenko.name/myfiles/usercopy_bfq_woe/ ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-10 17:16 ` Oleksandr Natalenko @ 2018-04-11 3:13 ` Kees Cook 2018-04-11 22:47 ` Kees Cook 0 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-11 3:13 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Tue, Apr 10, 2018 at 10:16 AM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > Hi, Kees, Paolo et al. > > 10.04.2018 08:53, Kees Cook wrote: >> >> Unfortunately I only had a single hang with no dumps. I haven't been >> able to reproduce it since. :( > > > For your convenience I've prepared a VM that contains a reproducer. Awesome. :) > Under the /root folder there is a reproducer script (reproducer.sh). It does > trivial things like enabling sysrq, opening LUKS device, mounting a volume, > running a background I/O (this is an important part, actually, since I > wasn't able to trigger the issue without the background I/O) and, finally, > running the smartctl in a loop. If you are lucky, within a minute or two > you'll get the first warning followed shortly by subsequent bugs and I/O > stall (htop is pre-installed for your convenience too). Yup! [ 27.729498] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 76, size 22)! I'll see about booting with my own kernels, etc, and try to narrow this down. :) -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-11 3:13 ` Kees Cook @ 2018-04-11 22:47 ` Kees Cook 2018-04-12 0:03 ` Kees Cook 0 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-11 22:47 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Tue, Apr 10, 2018 at 8:13 PM, Kees Cook <keescook@chromium.org> wrote: > I'll see about booting with my own kernels, etc, and try to narrow this down. :) If I boot kernels I've built, I no longer hit the bug in this VM (though I'll keep trying). What compiler are you using? -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-11 22:47 ` Kees Cook @ 2018-04-12 0:03 ` Kees Cook 2018-04-12 18:44 ` Kees Cook 0 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-12 0:03 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Wed, Apr 11, 2018 at 3:47 PM, Kees Cook <keescook@chromium.org> wrote: > On Tue, Apr 10, 2018 at 8:13 PM, Kees Cook <keescook@chromium.org> wrote: >> I'll see about booting with my own kernels, etc, and try to narrow this down. :) > > If I boot kernels I've built, I no longer hit the bug in this VM > (though I'll keep trying). What compiler are you using? Ignore that: I've reproduced it with my kernels now. I think I messed up the initramfs initially. But with an exact copy of your .config, booting under Arch grub with initramfs, I see it. I'll start removing variables now... :P -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-12 0:03 ` Kees Cook @ 2018-04-12 18:44 ` Kees Cook 2018-04-12 19:04 ` Oleksandr Natalenko 0 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-12 18:44 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Wed, Apr 11, 2018 at 5:03 PM, Kees Cook <keescook@chromium.org> wrote: > On Wed, Apr 11, 2018 at 3:47 PM, Kees Cook <keescook@chromium.org> wrote: >> On Tue, Apr 10, 2018 at 8:13 PM, Kees Cook <keescook@chromium.org> wrote: >>> I'll see about booting with my own kernels, etc, and try to narrow this down. :) >> >> If I boot kernels I've built, I no longer hit the bug in this VM >> (though I'll keep trying). What compiler are you using? > > Ignore that: I've reproduced it with my kernels now. I think I messed > up the initramfs initially. But with an exact copy of your .config, > booting under Arch grub with initramfs, I see it. I'll start removing > variables now... :P My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire up timeouts"), which seems insane given that null_blk isn't even built in the .config. I managed to get the testing automated now for a "git bisect run ...", so I'm restarting again to hopefully get a better answer. Normally the Oops happens in the first minute of running. I've set the timeout to 10 minutes for a "good" run... -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-12 18:44 ` Kees Cook @ 2018-04-12 19:04 ` Oleksandr Natalenko 0 siblings, 0 replies; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-12 19:04 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente Hi. On =C4=8Dtvrtek 12. dubna 2018 20:44:37 CEST Kees Cook wrote: > My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire > up timeouts"), which seems insane given that null_blk isn't even built > in the .config. I managed to get the testing automated now for a "git > bisect run ...", so I'm restarting again to hopefully get a better > answer. Normally the Oops happens in the first minute of running. I've > set the timeout to 10 minutes for a "good" run... Apparently, you do this on Linus' tree, right? If so, I won't compile it he= re=20 then. Thanks for taking care of this. Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache @ 2018-04-12 19:04 ` Oleksandr Natalenko 0 siblings, 0 replies; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-12 19:04 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente Hi. On čtvrtek 12. dubna 2018 20:44:37 CEST Kees Cook wrote: > My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire > up timeouts"), which seems insane given that null_blk isn't even built > in the .config. I managed to get the testing automated now for a "git > bisect run ...", so I'm restarting again to hopefully get a better > answer. Normally the Oops happens in the first minute of running. I've > set the timeout to 10 minutes for a "good" run... Apparently, you do this on Linus' tree, right? If so, I won't compile it here then. Thanks for taking care of this. Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-12 19:04 ` Oleksandr Natalenko @ 2018-04-12 22:01 ` Kees Cook -1 siblings, 0 replies; 67+ messages in thread From: Kees Cook @ 2018-04-12 22:01 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Thu, Apr 12, 2018 at 12:04 PM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > Hi. > > On =C4=8Dtvrtek 12. dubna 2018 20:44:37 CEST Kees Cook wrote: >> My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire >> up timeouts"), which seems insane given that null_blk isn't even built >> in the .config. I managed to get the testing automated now for a "git >> bisect run ...", so I'm restarting again to hopefully get a better >> answer. Normally the Oops happens in the first minute of running. I've >> set the timeout to 10 minutes for a "good" run... > > Apparently, you do this on Linus' tree, right? If so, I won't compile it = here > then. Actually, I didn't test Linus's tree, but I can do that after the most recent bisect finishes... I'm just trying to find where it went wrong in 4.16. > Thanks for taking care of this. Thanks for building the reproducer! :) -Kees --=20 Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache @ 2018-04-12 22:01 ` Kees Cook 0 siblings, 0 replies; 67+ messages in thread From: Kees Cook @ 2018-04-12 22:01 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Thu, Apr 12, 2018 at 12:04 PM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > Hi. > > On čtvrtek 12. dubna 2018 20:44:37 CEST Kees Cook wrote: >> My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire >> up timeouts"), which seems insane given that null_blk isn't even built >> in the .config. I managed to get the testing automated now for a "git >> bisect run ...", so I'm restarting again to hopefully get a better >> answer. Normally the Oops happens in the first minute of running. I've >> set the timeout to 10 minutes for a "good" run... > > Apparently, you do this on Linus' tree, right? If so, I won't compile it here > then. Actually, I didn't test Linus's tree, but I can do that after the most recent bisect finishes... I'm just trying to find where it went wrong in 4.16. > Thanks for taking care of this. Thanks for building the reproducer! :) -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-12 22:01 ` Kees Cook @ 2018-04-12 22:47 ` Kees Cook -1 siblings, 0 replies; 67+ messages in thread From: Kees Cook @ 2018-04-12 22:47 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Thu, Apr 12, 2018 at 3:01 PM, Kees Cook <keescook@chromium.org> wrote: > On Thu, Apr 12, 2018 at 12:04 PM, Oleksandr Natalenko > <oleksandr@natalenko.name> wrote: >> Hi. >> >> On =C4=8Dtvrtek 12. dubna 2018 20:44:37 CEST Kees Cook wrote: >>> My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire >>> up timeouts"), which seems insane given that null_blk isn't even built >>> in the .config. I managed to get the testing automated now for a "git >>> bisect run ...", so I'm restarting again to hopefully get a better >>> answer. Normally the Oops happens in the first minute of running. I've >>> set the timeout to 10 minutes for a "good" run... After fixing up some build issues in the middle of the 4.16 cycle, I get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch 'for-4.16/block'". Instead of letting the test run longer, I'm going to switch to doing several shorter test boots per kernel and see if that helps. One more bisect coming... >> Apparently, you do this on Linus' tree, right? If so, I won't compile it= here >> then. > > Actually, I didn't test Linus's tree, but I can do that after the most > recent bisect finishes... I'm just trying to find where it went wrong > in 4.16. FWIW, I see an Oops under Linus's latest tree. -Kees --=20 Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache @ 2018-04-12 22:47 ` Kees Cook 0 siblings, 0 replies; 67+ messages in thread From: Kees Cook @ 2018-04-12 22:47 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Thu, Apr 12, 2018 at 3:01 PM, Kees Cook <keescook@chromium.org> wrote: > On Thu, Apr 12, 2018 at 12:04 PM, Oleksandr Natalenko > <oleksandr@natalenko.name> wrote: >> Hi. >> >> On čtvrtek 12. dubna 2018 20:44:37 CEST Kees Cook wrote: >>> My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire >>> up timeouts"), which seems insane given that null_blk isn't even built >>> in the .config. I managed to get the testing automated now for a "git >>> bisect run ...", so I'm restarting again to hopefully get a better >>> answer. Normally the Oops happens in the first minute of running. I've >>> set the timeout to 10 minutes for a "good" run... After fixing up some build issues in the middle of the 4.16 cycle, I get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch 'for-4.16/block'". Instead of letting the test run longer, I'm going to switch to doing several shorter test boots per kernel and see if that helps. One more bisect coming... >> Apparently, you do this on Linus' tree, right? If so, I won't compile it here >> then. > > Actually, I didn't test Linus's tree, but I can do that after the most > recent bisect finishes... I'm just trying to find where it went wrong > in 4.16. FWIW, I see an Oops under Linus's latest tree. -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-12 22:47 ` Kees Cook (?) @ 2018-04-13 3:02 ` Kees Cook 2018-04-16 20:44 ` Kees Cook -1 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-13 3:02 UTC (permalink / raw) To: Oleksandr Natalenko, Jens Axboe Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Thu, Apr 12, 2018 at 3:47 PM, Kees Cook <keescook@chromium.org> wrote: > After fixing up some build issues in the middle of the 4.16 cycle, I > get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch > 'for-4.16/block'"). Instead of letting the test run longer, I'm going > to switch to doing several shorter test boots per kernel and see if > that helps. One more bisect coming... Okay, so I can confirm the bisect points at the _merge_ itself, not a specific patch. I'm not sure how to proceed here. It looks like some kind of interaction between separate trees? Jens, do you have suggestions on how to track this down? -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-13 3:02 ` Kees Cook @ 2018-04-16 20:44 ` Kees Cook 2018-04-17 3:12 ` Kees Cook 0 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-16 20:44 UTC (permalink / raw) To: Oleksandr Natalenko, Jens Axboe, Bart Van Assche Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente On Thu, Apr 12, 2018 at 8:02 PM, Kees Cook <keescook@chromium.org> wrote: > On Thu, Apr 12, 2018 at 3:47 PM, Kees Cook <keescook@chromium.org> wrote: >> After fixing up some build issues in the middle of the 4.16 cycle, I >> get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch >> 'for-4.16/block'"). Instead of letting the test run longer, I'm going >> to switch to doing several shorter test boots per kernel and see if >> that helps. One more bisect coming... > > Okay, so I can confirm the bisect points at the _merge_ itself, not a > specific patch. I'm not sure how to proceed here. It looks like some > kind of interaction between separate trees? Jens, do you have > suggestions on how to track this down? Turning off HARDENED_USERCOPY and turning on KASAN, I see the same report: [ 38.274106] BUG: KASAN: slab-out-of-bounds in _copy_to_user+0x42/0x60 [ 38.274841] Read of size 22 at addr ffff8800122b8c4b by task smartctl/1064 [ 38.275630] [ 38.275818] CPU: 2 PID: 1064 Comm: smartctl Not tainted 4.17.0-rc1-ARCH+ #266 [ 38.276631] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 38.277690] Call Trace: [ 38.277988] dump_stack+0x71/0xab [ 38.278397] ? _copy_to_user+0x42/0x60 [ 38.278833] print_address_description+0x6a/0x270 [ 38.279368] ? _copy_to_user+0x42/0x60 [ 38.279800] kasan_report+0x243/0x360 [ 38.280221] _copy_to_user+0x42/0x60 [ 38.280635] sg_io+0x459/0x660 ... Though we get slightly more details (some we already knew): [ 38.301330] Allocated by task 329: [ 38.301734] kmem_cache_alloc_node+0xca/0x220 [ 38.302239] scsi_mq_init_request+0x64/0x130 [scsi_mod] [ 38.302821] blk_mq_alloc_rqs+0x2cf/0x370 [ 38.303265] blk_mq_sched_alloc_tags.isra.4+0x7d/0xb0 [ 38.303820] blk_mq_init_sched+0xf0/0x220 [ 38.304268] elevator_switch+0x17a/0x2c0 [ 38.304705] elv_iosched_store+0x173/0x220 [ 38.305171] queue_attr_store+0x72/0xb0 [ 38.305602] kernfs_fop_write+0x188/0x220 [ 38.306049] __vfs_write+0xb6/0x330 [ 38.306436] vfs_write+0xe9/0x240 [ 38.306804] ksys_write+0x98/0x110 [ 38.307181] do_syscall_64+0x6d/0x1d0 [ 38.307590] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 38.308142] [ 38.308316] Freed by task 0: [ 38.308652] (stack is not available) [ 38.309060] [ 38.309243] The buggy address belongs to the object at ffff8800122b8c00 [ 38.309243] which belongs to the cache scsi_sense_cache of size 96 [ 38.310625] The buggy address is located 75 bytes inside of [ 38.310625] 96-byte region [ffff8800122b8c00, ffff8800122b8c60) -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-16 20:44 ` Kees Cook @ 2018-04-17 3:12 ` Kees Cook 2018-04-17 9:19 ` Oleksandr Natalenko ` (3 more replies) 0 siblings, 4 replies; 67+ messages in thread From: Kees Cook @ 2018-04-17 3:12 UTC (permalink / raw) To: Oleksandr Natalenko, Jens Axboe, Bart Van Assche, Paolo Valente Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Mon, Apr 16, 2018 at 1:44 PM, Kees Cook <keescook@chromium.org> wrote: > On Thu, Apr 12, 2018 at 8:02 PM, Kees Cook <keescook@chromium.org> wrote: >> On Thu, Apr 12, 2018 at 3:47 PM, Kees Cook <keescook@chromium.org> wrote: >>> After fixing up some build issues in the middle of the 4.16 cycle, I >>> get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch >>> 'for-4.16/block'"). Instead of letting the test run longer, I'm going >>> to switch to doing several shorter test boots per kernel and see if >>> that helps. One more bisect coming... >> >> Okay, so I can confirm the bisect points at the _merge_ itself, not a >> specific patch. I'm not sure how to proceed here. It looks like some >> kind of interaction between separate trees? Jens, do you have >> suggestions on how to track this down? > > Turning off HARDENED_USERCOPY and turning on KASAN, I see the same report: > > [ 38.274106] BUG: KASAN: slab-out-of-bounds in _copy_to_user+0x42/0x60 > [ 38.274841] Read of size 22 at addr ffff8800122b8c4b by task smartctl/1064 > [ 38.275630] > [ 38.275818] CPU: 2 PID: 1064 Comm: smartctl Not tainted 4.17.0-rc1-ARCH+ #266 > [ 38.276631] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), > BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 > [ 38.277690] Call Trace: > [ 38.277988] dump_stack+0x71/0xab > [ 38.278397] ? _copy_to_user+0x42/0x60 > [ 38.278833] print_address_description+0x6a/0x270 > [ 38.279368] ? _copy_to_user+0x42/0x60 > [ 38.279800] kasan_report+0x243/0x360 > [ 38.280221] _copy_to_user+0x42/0x60 > [ 38.280635] sg_io+0x459/0x660 > ... > > Though we get slightly more details (some we already knew): > > [ 38.301330] Allocated by task 329: > [ 38.301734] kmem_cache_alloc_node+0xca/0x220 > [ 38.302239] scsi_mq_init_request+0x64/0x130 [scsi_mod] > [ 38.302821] blk_mq_alloc_rqs+0x2cf/0x370 > [ 38.303265] blk_mq_sched_alloc_tags.isra.4+0x7d/0xb0 > [ 38.303820] blk_mq_init_sched+0xf0/0x220 > [ 38.304268] elevator_switch+0x17a/0x2c0 > [ 38.304705] elv_iosched_store+0x173/0x220 > [ 38.305171] queue_attr_store+0x72/0xb0 > [ 38.305602] kernfs_fop_write+0x188/0x220 > [ 38.306049] __vfs_write+0xb6/0x330 > [ 38.306436] vfs_write+0xe9/0x240 > [ 38.306804] ksys_write+0x98/0x110 > [ 38.307181] do_syscall_64+0x6d/0x1d0 > [ 38.307590] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 38.308142] > [ 38.308316] Freed by task 0: > [ 38.308652] (stack is not available) > [ 38.309060] > [ 38.309243] The buggy address belongs to the object at ffff8800122b8c00 > [ 38.309243] which belongs to the cache scsi_sense_cache of size 96 > [ 38.310625] The buggy address is located 75 bytes inside of > [ 38.310625] 96-byte region [ffff8800122b8c00, ffff8800122b8c60) With a hardware watchpoint, I've isolated the corruption to here: bfq_dispatch_request+0x2be/0x1610: __bfq_dispatch_request at block/bfq-iosched.c:3902 3900 if (rq) { 3901 inc_in_driver_start_rq: 3902 bfqd->rq_in_driver++; 3903 start_rq: 3904 rq->rq_flags |= RQF_STARTED; 3905 } Through some race condition(?), rq_in_driver is also sense_buffer, and it can get incremented. Specifically, I am doing: diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 25c14c58385c..f50d5053d732 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -6,6 +6,8 @@ #include <linux/kernel.h> #include <linux/module.h> #include <linux/blk-mq.h> +#include <scsi/scsi_request.h> +#include <linux/hw_breakpoint.h> #include <trace/events/block.h> @@ -428,6 +430,18 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx) } } +static void sample_hbp_handler(struct perf_event *bp, + struct perf_sample_data *data, + struct pt_regs *regs) +{ + printk(KERN_INFO "sense_buffer value is changed\n"); + dump_stack(); + printk(KERN_INFO "Dump stack from sample_hbp_handler\n"); +} + +struct perf_event * __percpu *sample_hbp; +int ok_to_go; + void blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue, bool async) { @@ -435,6 +449,16 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head, struct elevator_queue *e = q->elevator; struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); + struct perf_event_attr attr; + struct scsi_request *req = scsi_req(rq); + + if (ok_to_go) { + hw_breakpoint_init(&attr); + attr.bp_addr = (uintptr_t)&(req->sense); + attr.bp_len = HW_BREAKPOINT_LEN_8; + attr.bp_type = HW_BREAKPOINT_W; + sample_hbp = register_wide_hw_breakpoint(&attr, sample_hbp_handler, NULL); + } /* flush rq in flush machinery need to be dispatched directly */ if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) { @@ -461,6 +485,9 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head, run: if (run_queue) blk_mq_run_hw_queue(hctx, async); + + if (ok_to_go) + unregister_wide_hw_breakpoint(sample_hbp); } void blk_mq_sched_insert_requests(struct request_queue *q, Where "ok_to_go" is wired up to a sysctl so I don't trip other things (?) at boot-time. I still haven't figured this out, though... any have a moment to look at this? -Kees -- Kees Cook Pixel Security ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 3:12 ` Kees Cook @ 2018-04-17 9:19 ` Oleksandr Natalenko 2018-04-17 16:25 ` Kees Cook 2018-04-17 10:02 ` James Bottomley ` (2 subsequent siblings) 3 siblings, 1 reply; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-17 9:19 UTC (permalink / raw) To: Kees Cook Cc: Jens Axboe, Bart Van Assche, Paolo Valente, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, keescook Hi. 17.04.2018 05:12, Kees Cook wrote: >> Turning off HARDENED_USERCOPY and turning on KASAN, I see the same >> report: >> >> [ 38.274106] BUG: KASAN: slab-out-of-bounds in >> _copy_to_user+0x42/0x60 >> [ 38.274841] Read of size 22 at addr ffff8800122b8c4b by task >> smartctl/1064 >> [ 38.275630] >> [ 38.275818] CPU: 2 PID: 1064 Comm: smartctl Not tainted >> 4.17.0-rc1-ARCH+ #266 >> [ 38.276631] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), >> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 >> [ 38.277690] Call Trace: >> [ 38.277988] dump_stack+0x71/0xab >> [ 38.278397] ? _copy_to_user+0x42/0x60 >> [ 38.278833] print_address_description+0x6a/0x270 >> [ 38.279368] ? _copy_to_user+0x42/0x60 >> [ 38.279800] kasan_report+0x243/0x360 >> [ 38.280221] _copy_to_user+0x42/0x60 >> [ 38.280635] sg_io+0x459/0x660 >> ... >> >> Though we get slightly more details (some we already knew): >> >> [ 38.301330] Allocated by task 329: >> [ 38.301734] kmem_cache_alloc_node+0xca/0x220 >> [ 38.302239] scsi_mq_init_request+0x64/0x130 [scsi_mod] >> [ 38.302821] blk_mq_alloc_rqs+0x2cf/0x370 >> [ 38.303265] blk_mq_sched_alloc_tags.isra.4+0x7d/0xb0 >> [ 38.303820] blk_mq_init_sched+0xf0/0x220 >> [ 38.304268] elevator_switch+0x17a/0x2c0 >> [ 38.304705] elv_iosched_store+0x173/0x220 >> [ 38.305171] queue_attr_store+0x72/0xb0 >> [ 38.305602] kernfs_fop_write+0x188/0x220 >> [ 38.306049] __vfs_write+0xb6/0x330 >> [ 38.306436] vfs_write+0xe9/0x240 >> [ 38.306804] ksys_write+0x98/0x110 >> [ 38.307181] do_syscall_64+0x6d/0x1d0 >> [ 38.307590] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> [ 38.308142] >> [ 38.308316] Freed by task 0: >> [ 38.308652] (stack is not available) >> [ 38.309060] >> [ 38.309243] The buggy address belongs to the object at >> ffff8800122b8c00 >> [ 38.309243] which belongs to the cache scsi_sense_cache of size 96 >> [ 38.310625] The buggy address is located 75 bytes inside of >> [ 38.310625] 96-byte region [ffff8800122b8c00, ffff8800122b8c60) > > With a hardware watchpoint, I've isolated the corruption to here: > > bfq_dispatch_request+0x2be/0x1610: > __bfq_dispatch_request at block/bfq-iosched.c:3902 > 3900 if (rq) { > 3901 inc_in_driver_start_rq: > 3902 bfqd->rq_in_driver++; > 3903 start_rq: > 3904 rq->rq_flags |= RQF_STARTED; > 3905 } > > Through some race condition(?), rq_in_driver is also sense_buffer, and > it can get incremented. > … > I still haven't figured this out, though... any have a moment to look > at this? By any chance, have you tried to simplify the reproducer environment, or it still needs my complex layout to trigger things even with KASAN? Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 9:19 ` Oleksandr Natalenko @ 2018-04-17 16:25 ` Kees Cook 0 siblings, 0 replies; 67+ messages in thread From: Kees Cook @ 2018-04-17 16:25 UTC (permalink / raw) To: Oleksandr Natalenko Cc: Jens Axboe, Bart Van Assche, Paolo Valente, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Tue, Apr 17, 2018 at 2:19 AM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > By any chance, have you tried to simplify the reproducer environment, or it > still needs my complex layout to trigger things even with KASAN? I haven't tried minimizing the reproducer yet, no. Now that I have a specific place to watch in the kernel for the corruption, though, that might help. If I get stuck again today, I'll try it. -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 3:12 ` Kees Cook 2018-04-17 9:19 ` Oleksandr Natalenko @ 2018-04-17 10:02 ` James Bottomley 2018-04-17 16:30 ` Kees Cook 2018-04-17 16:42 ` Kees Cook 2018-04-17 20:03 ` Kees Cook 3 siblings, 1 reply; 67+ messages in thread From: James Bottomley @ 2018-04-17 10:02 UTC (permalink / raw) To: Kees Cook, Oleksandr Natalenko, Jens Axboe, Bart Van Assche, Paolo Valente Cc: David Windsor, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Mon, 2018-04-16 at 20:12 -0700, Kees Cook wrote: > I still haven't figured this out, though... any have a moment to look > at this? Just to let you know you're not alone ... but I can't make any sense of this either. The bfdq is the elevator_data, which is initialised when the scheduler is attached, so it shouldn't change. Is it possible to set a data break point on elevator_data after it's initialised and see if it got changed by something? James ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 10:02 ` James Bottomley @ 2018-04-17 16:30 ` Kees Cook 0 siblings, 0 replies; 67+ messages in thread From: Kees Cook @ 2018-04-17 16:30 UTC (permalink / raw) To: James Bottomley Cc: Oleksandr Natalenko, Jens Axboe, Bart Van Assche, Paolo Valente, David Windsor, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Tue, Apr 17, 2018 at 3:02 AM, James Bottomley <jejb@linux.vnet.ibm.com> wrote: > On Mon, 2018-04-16 at 20:12 -0700, Kees Cook wrote: >> I still haven't figured this out, though... any have a moment to look >> at this? > > Just to let you know you're not alone ... but I can't make any sense of > this either. The bfdq is the elevator_data, which is initialised when > the scheduler is attached, so it shouldn't change. Is it possible to > set a data break point on elevator_data after it's initialised and see > if it got changed by something? Yeah, it seems like some pointer chain is getting overwritten outside of a lock or rcu or ?. I don't know this code well enough to guess at where to check, though. What I find so strange is that the structure offsets are different between bfpd's rq_in_driver field and scsi_request's sense field, so even THAT doesn't look to be clear-cut either: struct bfq_data { struct request_queue * queue; /* 0 8 */ struct list_head dispatch; /* 8 16 */ struct bfq_group * root_group; /* 24 8 */ struct rb_root queue_weights_tree; /* 32 8 */ struct rb_root group_weights_tree; /* 40 8 */ int busy_queues; /* 48 4 */ int wr_busy_queues; /* 52 4 */ int queued; /* 56 4 */ int rq_in_driver; /* 60 4 */ ... struct scsi_request { unsigned char __cmd[16]; /* 0 16 */ unsigned char * cmd; /* 16 8 */ short unsigned int cmd_len; /* 24 2 */ /* XXX 2 bytes hole, try to pack */ int result; /* 28 4 */ unsigned int sense_len; /* 32 4 */ unsigned int resid_len; /* 36 4 */ int retries; /* 40 4 */ /* XXX 4 bytes hole, try to pack */ void * sense; /* 48 8 */ ... This is _so_ weird. :P -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 3:12 ` Kees Cook 2018-04-17 9:19 ` Oleksandr Natalenko 2018-04-17 10:02 ` James Bottomley @ 2018-04-17 16:42 ` Kees Cook 2018-04-17 16:46 ` Jens Axboe 2018-04-17 20:03 ` Kees Cook 3 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-17 16:42 UTC (permalink / raw) To: Oleksandr Natalenko, Jens Axboe, Bart Van Assche, Paolo Valente Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Mon, Apr 16, 2018 at 8:12 PM, Kees Cook <keescook@chromium.org> wrote: > With a hardware watchpoint, I've isolated the corruption to here: > > bfq_dispatch_request+0x2be/0x1610: > __bfq_dispatch_request at block/bfq-iosched.c:3902 > 3900 if (rq) { > 3901 inc_in_driver_start_rq: > 3902 bfqd->rq_in_driver++; > 3903 start_rq: > 3904 rq->rq_flags |= RQF_STARTED; > 3905 } FWIW, the stacktrace here (removing the ? lines) is: [ 34.311980] RIP: 0010:bfq_dispatch_request+0x2be/0x1610 [ 34.452491] blk_mq_do_dispatch_sched+0x1d9/0x260 [ 34.454561] blk_mq_sched_dispatch_requests+0x3da/0x4b0 [ 34.458789] __blk_mq_run_hw_queue+0xae/0x130 [ 34.460001] __blk_mq_delay_run_hw_queue+0x192/0x280 [ 34.460823] blk_mq_run_hw_queue+0x10b/0x1b0 [ 34.463240] blk_mq_sched_insert_request+0x3bd/0x4d0 [ 34.467342] blk_execute_rq+0xcf/0x140 [ 34.468483] sg_io+0x2f7/0x730 Can anyone tell me more about the memory allocation layout of the various variables here? It looks like struct request is a header in front of struct scsi_request? How do struct elevator_queue, struct blk_mq_ctx, and struct blk_mq_hw_ctx overlap these? Regardless, I'll check for elevator data changing too... -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 16:42 ` Kees Cook @ 2018-04-17 16:46 ` Jens Axboe 0 siblings, 0 replies; 67+ messages in thread From: Jens Axboe @ 2018-04-17 16:46 UTC (permalink / raw) To: Kees Cook, Oleksandr Natalenko, Bart Van Assche, Paolo Valente Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On 4/17/18 10:42 AM, Kees Cook wrote: > On Mon, Apr 16, 2018 at 8:12 PM, Kees Cook <keescook@chromium.org> wrote: >> With a hardware watchpoint, I've isolated the corruption to here: >> >> bfq_dispatch_request+0x2be/0x1610: >> __bfq_dispatch_request at block/bfq-iosched.c:3902 >> 3900 if (rq) { >> 3901 inc_in_driver_start_rq: >> 3902 bfqd->rq_in_driver++; >> 3903 start_rq: >> 3904 rq->rq_flags |= RQF_STARTED; >> 3905 } > > FWIW, the stacktrace here (removing the ? lines) is: > > [ 34.311980] RIP: 0010:bfq_dispatch_request+0x2be/0x1610 > [ 34.452491] blk_mq_do_dispatch_sched+0x1d9/0x260 > [ 34.454561] blk_mq_sched_dispatch_requests+0x3da/0x4b0 > [ 34.458789] __blk_mq_run_hw_queue+0xae/0x130 > [ 34.460001] __blk_mq_delay_run_hw_queue+0x192/0x280 > [ 34.460823] blk_mq_run_hw_queue+0x10b/0x1b0 > [ 34.463240] blk_mq_sched_insert_request+0x3bd/0x4d0 > [ 34.467342] blk_execute_rq+0xcf/0x140 > [ 34.468483] sg_io+0x2f7/0x730 > > Can anyone tell me more about the memory allocation layout of the > various variables here? It looks like struct request is a header in > front of struct scsi_request? How do struct elevator_queue, struct > blk_mq_ctx, and struct blk_mq_hw_ctx overlap these? The scsi_request is a payload item for the block request, it's located right after the request in memory. These are persistent allocations, we don't allocate/free them per IO. blk_mq_ctx are the blk-mq software queues, they are percpu and allocated when the queue is setup. blk_mq_hw_ctx is the hardware queue. You probably have just one, it's allocated when the queue is setup. struct elevator_queue is allocated when the scheduler is attached to the queue. This can get freed and allocated if you switch the scheduler on a queue, otherwise it persists until the queue is torn down (and the scheduler data is freed). > Regardless, I'll check for elevator data changing too... It should not change unless you switch IO schedulers. If you're using BFQ and not switching, then it won't change. -- Jens Axboe ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 3:12 ` Kees Cook ` (2 preceding siblings ...) 2018-04-17 16:42 ` Kees Cook @ 2018-04-17 20:03 ` Kees Cook 2018-04-17 20:20 ` Kees Cook 3 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-17 20:03 UTC (permalink / raw) To: Oleksandr Natalenko, Jens Axboe, Bart Van Assche, Paolo Valente Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Mon, Apr 16, 2018 at 8:12 PM, Kees Cook <keescook@chromium.org> wrote: > With a hardware watchpoint, I've isolated the corruption to here: > > bfq_dispatch_request+0x2be/0x1610: > __bfq_dispatch_request at block/bfq-iosched.c:3902 > 3900 if (rq) { > 3901 inc_in_driver_start_rq: > 3902 bfqd->rq_in_driver++; > 3903 start_rq: > 3904 rq->rq_flags |= RQF_STARTED; > 3905 } This state continues to appear to be "impossible". [ 68.845979] watchpoint triggered [ 68.846462] sense before:ffff8b8f9f6aae00 after:ffff8b8f9f6aae01 [ 68.847196] elevator before:ffff8b8f9a2c2000 after:ffff8b8f9a2c2000 [ 68.847905] elevator_data before:ffff8b8f9a2c0400 after:ffff8b8f9a2c0400 ... [ 68.856925] RIP: 0010:bfq_dispatch_request+0x99/0xad0 [ 68.857553] RSP: 0018:ffff900280c63a58 EFLAGS: 00000082 [ 68.858253] RAX: ffff8b8f9aefbe80 RBX: ffff8b8f9a2c0400 RCX: ffff8b8f9a2c0408 [ 68.859201] RDX: ffff8b8f9a2c0408 RSI: ffff900280c63b34 RDI: 0000000000000001 [ 68.860147] RBP: 0000000000000000 R08: 0000000f00000204 R09: 0000000000000000 [ 68.861122] R10: ffff900280c63af0 R11: 0000000000000040 R12: ffff8b8f9aefbe00 [ 68.862089] R13: ffff8b8f9a221950 R14: 0000000000000000 R15: ffff8b8f9a2c0770 Here we can see that sense buffer has, as we've seen, been incremented. However, the "before" values for elevator and elevator_data still match their expected values. As such, this should be "impossible", since: static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) { struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; ... if (rq) { inc_in_driver_start_rq: bfqd->rq_in_driver++; start_rq: rq->rq_flags |= RQF_STARTED; } exit: return rq; } For rq_in_driver++ to touch sense, bfqd must be equal to scsi_request - 12 bytes (rq_in_driver is 60 byte offset from struct bfq_data, and sense is 48 bytes offset from struct scsi_request). The above bfq_dispatch_request+0x99/0xad0 is still __bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN removed. 0x99 is 153 decimal: (gdb) disass bfq_dispatch_request Dump of assembler code for function bfq_dispatch_request: ... 0xffffffff8134b2ad <+141>: test %rax,%rax 0xffffffff8134b2b0 <+144>: je 0xffffffff8134b2bd <bfq_dispatch_request+157> 0xffffffff8134b2b2 <+146>: addl $0x1,0x100(%rax) 0xffffffff8134b2b9 <+153>: addl $0x1,0x3c(%rbx) 0xffffffff8134b2bd <+157>: orl $0x2,0x18(%r12) 0xffffffff8134b2c3 <+163>: test %ebp,%ebp 0xffffffff8134b2c5 <+165>: je 0xffffffff8134b2ce <bfq_dispatch_request+174> 0xffffffff8134b2c7 <+167>: mov 0x108(%r14),%rax 0xffffffff8134b2ce <+174>: mov %r15,%rdi 0xffffffff8134b2d1 <+177>: callq 0xffffffff81706f90 <_raw_spin_unlock_irq> Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18 offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |= RQF_STARTED", the next C statement. I don't know what +146 is, though? An increment of something 256 bytes offset? There's a lot of inline fun and reordering happening here, so I'm ignoring that for the moment. So at +153 %rbx should be bfqd (i.e. elevator_data), but this is the tripped instruction. The watchpoint dump shows RBX as ffff8b8f9a2c0400 ... which matches elevator_data. So, what can this be? Some sort of cache issue? By the time the watchpoint handler captures the register information, it's already masked the problem? I'm stumped again. :( -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 20:03 ` Kees Cook @ 2018-04-17 20:20 ` Kees Cook 2018-04-17 20:25 ` Kees Cook 0 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-17 20:20 UTC (permalink / raw) To: Oleksandr Natalenko, Jens Axboe, Bart Van Assche, Paolo Valente Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Tue, Apr 17, 2018 at 1:03 PM, Kees Cook <keescook@chromium.org> wrote: > The above bfq_dispatch_request+0x99/0xad0 is still > __bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN > removed. 0x99 is 153 decimal: > > (gdb) disass bfq_dispatch_request > Dump of assembler code for function bfq_dispatch_request: > ... > 0xffffffff8134b2ad <+141>: test %rax,%rax > 0xffffffff8134b2b0 <+144>: je 0xffffffff8134b2bd > <bfq_dispatch_request+157> > 0xffffffff8134b2b2 <+146>: addl $0x1,0x100(%rax) > 0xffffffff8134b2b9 <+153>: addl $0x1,0x3c(%rbx) > 0xffffffff8134b2bd <+157>: orl $0x2,0x18(%r12) > 0xffffffff8134b2c3 <+163>: test %ebp,%ebp > 0xffffffff8134b2c5 <+165>: je 0xffffffff8134b2ce > <bfq_dispatch_request+174> > 0xffffffff8134b2c7 <+167>: mov 0x108(%r14),%rax > 0xffffffff8134b2ce <+174>: mov %r15,%rdi > 0xffffffff8134b2d1 <+177>: callq 0xffffffff81706f90 <_raw_spin_unlock_irq> > > Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18 > offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |= > RQF_STARTED", the next C statement. I don't know what +146 is, though? > An increment of something 256 bytes offset? There's a lot of inline > fun and reordering happening here, so I'm ignoring that for the > moment. No -- I'm reading this wrong. The RIP is the IP _after_ the trap, so +146 is the offender. [ 29.284746] watchpoint @ ffff95d41a0fe580 triggered [ 29.285349] sense before:ffff95d41f45f700 after:ffff95d41f45f701 (@ffff95d41a 0fe580) [ 29.286176] elevator before:ffff95d419419c00 after:ffff95d419419c00 [ 29.286847] elevator_data before:ffff95d419418c00 after:ffff95d419418c00 ... [ 29.295069] RIP: 0010:bfq_dispatch_request+0x99/0xbb0 [ 29.295622] RSP: 0018:ffffb26e01707a40 EFLAGS: 00000002 [ 29.296181] RAX: ffff95d41a0fe480 RBX: ffff95d419418c00 RCX: ffff95d419418c08 RAX is ffff95d41a0fe480 and sense is stored at ffff95d41a0fe580, exactly 0x100 away. WTF is this addl? -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 20:20 ` Kees Cook @ 2018-04-17 20:25 ` Kees Cook 2018-04-17 20:28 ` Jens Axboe 0 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-17 20:25 UTC (permalink / raw) To: Oleksandr Natalenko, Jens Axboe, Bart Van Assche, Paolo Valente Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Tue, Apr 17, 2018 at 1:20 PM, Kees Cook <keescook@chromium.org> wrote: > On Tue, Apr 17, 2018 at 1:03 PM, Kees Cook <keescook@chromium.org> wrote: >> The above bfq_dispatch_request+0x99/0xad0 is still >> __bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN >> removed. 0x99 is 153 decimal: >> >> (gdb) disass bfq_dispatch_request >> Dump of assembler code for function bfq_dispatch_request: >> ... >> 0xffffffff8134b2ad <+141>: test %rax,%rax >> 0xffffffff8134b2b0 <+144>: je 0xffffffff8134b2bd >> <bfq_dispatch_request+157> >> 0xffffffff8134b2b2 <+146>: addl $0x1,0x100(%rax) >> 0xffffffff8134b2b9 <+153>: addl $0x1,0x3c(%rbx) >> 0xffffffff8134b2bd <+157>: orl $0x2,0x18(%r12) >> 0xffffffff8134b2c3 <+163>: test %ebp,%ebp >> 0xffffffff8134b2c5 <+165>: je 0xffffffff8134b2ce >> <bfq_dispatch_request+174> >> 0xffffffff8134b2c7 <+167>: mov 0x108(%r14),%rax >> 0xffffffff8134b2ce <+174>: mov %r15,%rdi >> 0xffffffff8134b2d1 <+177>: callq 0xffffffff81706f90 <_raw_spin_unlock_irq> >> >> Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18 >> offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |= >> RQF_STARTED", the next C statement. I don't know what +146 is, though? >> An increment of something 256 bytes offset? There's a lot of inline >> fun and reordering happening here, so I'm ignoring that for the >> moment. > > No -- I'm reading this wrong. The RIP is the IP _after_ the trap, so > +146 is the offender. > > [ 29.284746] watchpoint @ ffff95d41a0fe580 triggered > [ 29.285349] sense before:ffff95d41f45f700 after:ffff95d41f45f701 (@ffff95d41a > 0fe580) > [ 29.286176] elevator before:ffff95d419419c00 after:ffff95d419419c00 > [ 29.286847] elevator_data before:ffff95d419418c00 after:ffff95d419418c00 > ... > [ 29.295069] RIP: 0010:bfq_dispatch_request+0x99/0xbb0 > [ 29.295622] RSP: 0018:ffffb26e01707a40 EFLAGS: 00000002 > [ 29.296181] RAX: ffff95d41a0fe480 RBX: ffff95d419418c00 RCX: ffff95d419418c08 > > RAX is ffff95d41a0fe480 and sense is stored at ffff95d41a0fe580, > exactly 0x100 away. > > WTF is this addl? What are the chances? :P Two ++ statements in a row separate by a collapsed goto. FML. :) ... bfqq->dispatched++; goto inc_in_driver_start_rq; ... inc_in_driver_start_rq: bfqd->rq_in_driver++; ... And there's the 0x100 (256): struct bfq_queue { ... int dispatched; /* 256 4 */ So bfqq is corrupted somewhere... I'll keep digging. I hope you're all enjoying my live debugging transcript. ;) -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 20:25 ` Kees Cook @ 2018-04-17 20:28 ` Jens Axboe 0 siblings, 0 replies; 67+ messages in thread From: Jens Axboe @ 2018-04-17 20:28 UTC (permalink / raw) To: Kees Cook, Oleksandr Natalenko, Bart Van Assche, Paolo Valente Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, Paolo Valente On 4/17/18 2:25 PM, Kees Cook wrote: > On Tue, Apr 17, 2018 at 1:20 PM, Kees Cook <keescook@chromium.org> wrote: >> On Tue, Apr 17, 2018 at 1:03 PM, Kees Cook <keescook@chromium.org> wrote: >>> The above bfq_dispatch_request+0x99/0xad0 is still >>> __bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN >>> removed. 0x99 is 153 decimal: >>> >>> (gdb) disass bfq_dispatch_request >>> Dump of assembler code for function bfq_dispatch_request: >>> ... >>> 0xffffffff8134b2ad <+141>: test %rax,%rax >>> 0xffffffff8134b2b0 <+144>: je 0xffffffff8134b2bd >>> <bfq_dispatch_request+157> >>> 0xffffffff8134b2b2 <+146>: addl $0x1,0x100(%rax) >>> 0xffffffff8134b2b9 <+153>: addl $0x1,0x3c(%rbx) >>> 0xffffffff8134b2bd <+157>: orl $0x2,0x18(%r12) >>> 0xffffffff8134b2c3 <+163>: test %ebp,%ebp >>> 0xffffffff8134b2c5 <+165>: je 0xffffffff8134b2ce >>> <bfq_dispatch_request+174> >>> 0xffffffff8134b2c7 <+167>: mov 0x108(%r14),%rax >>> 0xffffffff8134b2ce <+174>: mov %r15,%rdi >>> 0xffffffff8134b2d1 <+177>: callq 0xffffffff81706f90 <_raw_spin_unlock_irq> >>> >>> Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18 >>> offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |= >>> RQF_STARTED", the next C statement. I don't know what +146 is, though? >>> An increment of something 256 bytes offset? There's a lot of inline >>> fun and reordering happening here, so I'm ignoring that for the >>> moment. >> >> No -- I'm reading this wrong. The RIP is the IP _after_ the trap, so >> +146 is the offender. >> >> [ 29.284746] watchpoint @ ffff95d41a0fe580 triggered >> [ 29.285349] sense before:ffff95d41f45f700 after:ffff95d41f45f701 (@ffff95d41a >> 0fe580) >> [ 29.286176] elevator before:ffff95d419419c00 after:ffff95d419419c00 >> [ 29.286847] elevator_data before:ffff95d419418c00 after:ffff95d419418c00 >> ... >> [ 29.295069] RIP: 0010:bfq_dispatch_request+0x99/0xbb0 >> [ 29.295622] RSP: 0018:ffffb26e01707a40 EFLAGS: 00000002 >> [ 29.296181] RAX: ffff95d41a0fe480 RBX: ffff95d419418c00 RCX: ffff95d419418c08 >> >> RAX is ffff95d41a0fe480 and sense is stored at ffff95d41a0fe580, >> exactly 0x100 away. >> >> WTF is this addl? > > What are the chances? :P Two ++ statements in a row separate by a > collapsed goto. FML. :) > > ... > bfqq->dispatched++; > goto inc_in_driver_start_rq; > ... > inc_in_driver_start_rq: > bfqd->rq_in_driver++; > ... > > And there's the 0x100 (256): > > struct bfq_queue { > ... > int dispatched; /* 256 4 */ > > So bfqq is corrupted somewhere... I'll keep digging. I hope you're all > enjoying my live debugging transcript. ;) It has to be the latter bfqq->dispatched increment, as those are transient (and bfqd is not). Adding Paolo. -- Jens Axboe ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache @ 2018-04-17 20:28 ` Jens Axboe 0 siblings, 0 replies; 67+ messages in thread From: Jens Axboe @ 2018-04-17 20:28 UTC (permalink / raw) To: Kees Cook, Oleksandr Natalenko, Bart Van Assche Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, Paolo Valente On 4/17/18 2:25 PM, Kees Cook wrote: > On Tue, Apr 17, 2018 at 1:20 PM, Kees Cook <keescook@chromium.org> wrote: >> On Tue, Apr 17, 2018 at 1:03 PM, Kees Cook <keescook@chromium.org> wrote: >>> The above bfq_dispatch_request+0x99/0xad0 is still >>> __bfq_dispatch_request at block/bfq-iosched.c:3902, just with KASAN >>> removed. 0x99 is 153 decimal: >>> >>> (gdb) disass bfq_dispatch_request >>> Dump of assembler code for function bfq_dispatch_request: >>> ... >>> 0xffffffff8134b2ad <+141>: test %rax,%rax >>> 0xffffffff8134b2b0 <+144>: je 0xffffffff8134b2bd >>> <bfq_dispatch_request+157> >>> 0xffffffff8134b2b2 <+146>: addl $0x1,0x100(%rax) >>> 0xffffffff8134b2b9 <+153>: addl $0x1,0x3c(%rbx) >>> 0xffffffff8134b2bd <+157>: orl $0x2,0x18(%r12) >>> 0xffffffff8134b2c3 <+163>: test %ebp,%ebp >>> 0xffffffff8134b2c5 <+165>: je 0xffffffff8134b2ce >>> <bfq_dispatch_request+174> >>> 0xffffffff8134b2c7 <+167>: mov 0x108(%r14),%rax >>> 0xffffffff8134b2ce <+174>: mov %r15,%rdi >>> 0xffffffff8134b2d1 <+177>: callq 0xffffffff81706f90 <_raw_spin_unlock_irq> >>> >>> Just as a sanity-check, at +157 %r12 should be rq, rq_flags is 0x18 >>> offset from, $0x2 is RQF_STARTED, so that maps to "rq->rq_flags |= >>> RQF_STARTED", the next C statement. I don't know what +146 is, though? >>> An increment of something 256 bytes offset? There's a lot of inline >>> fun and reordering happening here, so I'm ignoring that for the >>> moment. >> >> No -- I'm reading this wrong. The RIP is the IP _after_ the trap, so >> +146 is the offender. >> >> [ 29.284746] watchpoint @ ffff95d41a0fe580 triggered >> [ 29.285349] sense before:ffff95d41f45f700 after:ffff95d41f45f701 (@ffff95d41a >> 0fe580) >> [ 29.286176] elevator before:ffff95d419419c00 after:ffff95d419419c00 >> [ 29.286847] elevator_data before:ffff95d419418c00 after:ffff95d419418c00 >> ... >> [ 29.295069] RIP: 0010:bfq_dispatch_request+0x99/0xbb0 >> [ 29.295622] RSP: 0018:ffffb26e01707a40 EFLAGS: 00000002 >> [ 29.296181] RAX: ffff95d41a0fe480 RBX: ffff95d419418c00 RCX: ffff95d419418c08 >> >> RAX is ffff95d41a0fe480 and sense is stored at ffff95d41a0fe580, >> exactly 0x100 away. >> >> WTF is this addl? > > What are the chances? :P Two ++ statements in a row separate by a > collapsed goto. FML. :) > > ... > bfqq->dispatched++; > goto inc_in_driver_start_rq; > ... > inc_in_driver_start_rq: > bfqd->rq_in_driver++; > ... > > And there's the 0x100 (256): > > struct bfq_queue { > ... > int dispatched; /* 256 4 */ > > So bfqq is corrupted somewhere... I'll keep digging. I hope you're all > enjoying my live debugging transcript. ;) It has to be the latter bfqq->dispatched increment, as those are transient (and bfqd is not). Adding Paolo. -- Jens Axboe ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 20:28 ` Jens Axboe (?) @ 2018-04-17 20:46 ` Kees Cook 2018-04-17 21:25 ` Kees Cook -1 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-17 20:46 UTC (permalink / raw) To: Jens Axboe, Paolo Valente Cc: Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Tue, Apr 17, 2018 at 1:28 PM, Jens Axboe <axboe@kernel.dk> wrote: > It has to be the latter bfqq->dispatched increment, as those are > transient (and bfqd is not). Yeah, and I see a lot of comments around the lifetime of rq and bfqq, so I assume something is not being locked correctly. #define RQ_BFQQ(rq) ((rq)->elv.priv[1]) static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) { struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; struct request *rq = NULL; struct bfq_queue *bfqq = NULL; if (!list_empty(&bfqd->dispatch)) { rq = list_first_entry(&bfqd->dispatch, struct request, queuelist); list_del_init(&rq->queuelist); bfqq = RQ_BFQQ(rq); if (bfqq) { /* * Increment counters here, because this * dispatch does not follow the standard * dispatch flow (where counters are * incremented) */ bfqq->dispatched++; ... I see elv.priv[1] assignments made in a few places -- is it possible there is some kind of uninitialized-but-not-NULL state that can leak in there? bfq_prepare_request() assigns elv.priv[1], and bfq_insert_request() only checks that it's non-NULL (if at all) in one case. Can bfq_insert_request() get called without bfq_prepare_request() being called first? -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 20:46 ` Kees Cook @ 2018-04-17 21:25 ` Kees Cook 2018-04-17 21:39 ` Jens Axboe 0 siblings, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-17 21:25 UTC (permalink / raw) To: Jens Axboe, Paolo Valente Cc: Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keescook@chromium.org> wrote: > I see elv.priv[1] assignments made in a few places -- is it possible > there is some kind of uninitialized-but-not-NULL state that can leak > in there? Got it. This fixes it for me: diff --git a/block/blk-mq.c b/block/blk-mq.c index 0dc9e341c2a7..859df3160303 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct request_queue *q, rq = blk_mq_rq_ctx_init(data, tag, op); if (!op_is_flush(op)) { - rq->elv.icq = NULL; + memset(&rq->elv, 0, sizeof(rq->elv)); if (e && e->type->ops.mq.prepare_request) { if (e->type->icq_cache && rq_ioc(bio)) blk_mq_sched_assign_ioc(rq, bio); @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) e->type->ops.mq.finish_request(rq); if (rq->elv.icq) { put_io_context(rq->elv.icq->ioc); - rq->elv.icq = NULL; + memset(&rq->elv, 0, sizeof(rq->elv)); } } -- Kees Cook Pixel Security ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 21:25 ` Kees Cook @ 2018-04-17 21:39 ` Jens Axboe 2018-04-17 21:47 ` Kees Cook 0 siblings, 1 reply; 67+ messages in thread From: Jens Axboe @ 2018-04-17 21:39 UTC (permalink / raw) To: Kees Cook, Paolo Valente Cc: Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On 4/17/18 3:25 PM, Kees Cook wrote: > On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keescook@chromium.org> wrote: >> I see elv.priv[1] assignments made in a few places -- is it possible >> there is some kind of uninitialized-but-not-NULL state that can leak >> in there? > > Got it. This fixes it for me: > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 0dc9e341c2a7..859df3160303 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct > request_queue *q, > > rq = blk_mq_rq_ctx_init(data, tag, op); > if (!op_is_flush(op)) { > - rq->elv.icq = NULL; > + memset(&rq->elv, 0, sizeof(rq->elv)); > if (e && e->type->ops.mq.prepare_request) { > if (e->type->icq_cache && rq_ioc(bio)) > blk_mq_sched_assign_ioc(rq, bio); > @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) > e->type->ops.mq.finish_request(rq); > if (rq->elv.icq) { > put_io_context(rq->elv.icq->ioc); > - rq->elv.icq = NULL; > + memset(&rq->elv, 0, sizeof(rq->elv)); > } > } This looks like a BFQ problem, this should not be necessary. Paolo, you're calling your own prepare request handler from the insert as well, and your prepare request does nothing if rq->elv.icq == NULL. -- Jens Axboe ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 21:39 ` Jens Axboe @ 2018-04-17 21:47 ` Kees Cook 2018-04-17 21:48 ` Jens Axboe 2018-04-17 21:55 ` Oleksandr Natalenko 0 siblings, 2 replies; 67+ messages in thread From: Kees Cook @ 2018-04-17 21:47 UTC (permalink / raw) To: Jens Axboe Cc: Paolo Valente, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <axboe@kernel.dk> wrote: > On 4/17/18 3:25 PM, Kees Cook wrote: >> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keescook@chromium.org> wrote: >>> I see elv.priv[1] assignments made in a few places -- is it possible >>> there is some kind of uninitialized-but-not-NULL state that can leak >>> in there? >> >> Got it. This fixes it for me: >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 0dc9e341c2a7..859df3160303 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct >> request_queue *q, >> >> rq = blk_mq_rq_ctx_init(data, tag, op); >> if (!op_is_flush(op)) { >> - rq->elv.icq = NULL; >> + memset(&rq->elv, 0, sizeof(rq->elv)); >> if (e && e->type->ops.mq.prepare_request) { >> if (e->type->icq_cache && rq_ioc(bio)) >> blk_mq_sched_assign_ioc(rq, bio); >> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) >> e->type->ops.mq.finish_request(rq); >> if (rq->elv.icq) { >> put_io_context(rq->elv.icq->ioc); >> - rq->elv.icq = NULL; >> + memset(&rq->elv, 0, sizeof(rq->elv)); >> } >> } > > This looks like a BFQ problem, this should not be necessary. Paolo, > you're calling your own prepare request handler from the insert > as well, and your prepare request does nothing if rq->elv.icq == NULL. I sent the patch anyway, since it's kind of a robustness improvement, I'd hope. If you fix BFQ also, please add: Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Root-caused-by: Kees Cook <keescook@chromium.org> :) I gotta task-switch to other things! Thanks for the pointers, and thank you Oleksandr for providing the reproducer! -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 21:47 ` Kees Cook @ 2018-04-17 21:48 ` Jens Axboe 2018-04-17 22:57 ` Jens Axboe 2018-04-17 21:55 ` Oleksandr Natalenko 1 sibling, 1 reply; 67+ messages in thread From: Jens Axboe @ 2018-04-17 21:48 UTC (permalink / raw) To: Kees Cook Cc: Paolo Valente, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On 4/17/18 3:47 PM, Kees Cook wrote: > On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <axboe@kernel.dk> wrote: >> On 4/17/18 3:25 PM, Kees Cook wrote: >>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keescook@chromium.org> wrote: >>>> I see elv.priv[1] assignments made in a few places -- is it possible >>>> there is some kind of uninitialized-but-not-NULL state that can leak >>>> in there? >>> >>> Got it. This fixes it for me: >>> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index 0dc9e341c2a7..859df3160303 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct >>> request_queue *q, >>> >>> rq = blk_mq_rq_ctx_init(data, tag, op); >>> if (!op_is_flush(op)) { >>> - rq->elv.icq = NULL; >>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>> if (e && e->type->ops.mq.prepare_request) { >>> if (e->type->icq_cache && rq_ioc(bio)) >>> blk_mq_sched_assign_ioc(rq, bio); >>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) >>> e->type->ops.mq.finish_request(rq); >>> if (rq->elv.icq) { >>> put_io_context(rq->elv.icq->ioc); >>> - rq->elv.icq = NULL; >>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>> } >>> } >> >> This looks like a BFQ problem, this should not be necessary. Paolo, >> you're calling your own prepare request handler from the insert >> as well, and your prepare request does nothing if rq->elv.icq == NULL. > > I sent the patch anyway, since it's kind of a robustness improvement, > I'd hope. If you fix BFQ also, please add: It's also a memset() in the hot path, would prefer to avoid that... The issue here is really the convoluted bfq usage of insert/prepare, I'm sure Paolo can take it from here. -- Jens Axboe ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 21:48 ` Jens Axboe @ 2018-04-17 22:57 ` Jens Axboe 2018-04-17 23:06 ` Kees Cook 2018-04-18 9:08 ` Paolo Valente 0 siblings, 2 replies; 67+ messages in thread From: Jens Axboe @ 2018-04-17 22:57 UTC (permalink / raw) To: Kees Cook Cc: Paolo Valente, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On 4/17/18 3:48 PM, Jens Axboe wrote: > On 4/17/18 3:47 PM, Kees Cook wrote: >> On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <axboe@kernel.dk> wrote: >>> On 4/17/18 3:25 PM, Kees Cook wrote: >>>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keescook@chromium.org> wrote: >>>>> I see elv.priv[1] assignments made in a few places -- is it possible >>>>> there is some kind of uninitialized-but-not-NULL state that can leak >>>>> in there? >>>> >>>> Got it. This fixes it for me: >>>> >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>> index 0dc9e341c2a7..859df3160303 100644 >>>> --- a/block/blk-mq.c >>>> +++ b/block/blk-mq.c >>>> @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct >>>> request_queue *q, >>>> >>>> rq = blk_mq_rq_ctx_init(data, tag, op); >>>> if (!op_is_flush(op)) { >>>> - rq->elv.icq = NULL; >>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>> if (e && e->type->ops.mq.prepare_request) { >>>> if (e->type->icq_cache && rq_ioc(bio)) >>>> blk_mq_sched_assign_ioc(rq, bio); >>>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) >>>> e->type->ops.mq.finish_request(rq); >>>> if (rq->elv.icq) { >>>> put_io_context(rq->elv.icq->ioc); >>>> - rq->elv.icq = NULL; >>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>> } >>>> } >>> >>> This looks like a BFQ problem, this should not be necessary. Paolo, >>> you're calling your own prepare request handler from the insert >>> as well, and your prepare request does nothing if rq->elv.icq == NULL. >> >> I sent the patch anyway, since it's kind of a robustness improvement, >> I'd hope. If you fix BFQ also, please add: > > It's also a memset() in the hot path, would prefer to avoid that... > The issue here is really the convoluted bfq usage of insert/prepare, > I'm sure Paolo can take it from here. Does this fix it? diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index f0ecd98509d8..d883469a1582 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -4934,8 +4934,11 @@ static void bfq_prepare_request(struct request *rq, struct bio *bio) bool new_queue = false; bool bfqq_already_existing = false, split = false; - if (!rq->elv.icq) + if (!rq->elv.icq) { + rq->elv.priv[0] = rq->elv.priv[1] = NULL; return; + } + bic = icq_to_bic(rq->elv.icq); spin_lock_irq(&bfqd->lock); -- Jens Axboe ^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 22:57 ` Jens Axboe @ 2018-04-17 23:06 ` Kees Cook 2018-04-17 23:12 ` Jens Axboe 2018-04-18 9:08 ` Paolo Valente 1 sibling, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-17 23:06 UTC (permalink / raw) To: Jens Axboe Cc: Paolo Valente, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On Tue, Apr 17, 2018 at 3:57 PM, Jens Axboe <axboe@kernel.dk> wrote: > On 4/17/18 3:48 PM, Jens Axboe wrote: >> On 4/17/18 3:47 PM, Kees Cook wrote: >>> On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <axboe@kernel.dk> wrote: >>>> On 4/17/18 3:25 PM, Kees Cook wrote: >>>>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keescook@chromium.org> wrote: >>>>>> I see elv.priv[1] assignments made in a few places -- is it possible >>>>>> there is some kind of uninitialized-but-not-NULL state that can leak >>>>>> in there? >>>>> >>>>> Got it. This fixes it for me: >>>>> >>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>> index 0dc9e341c2a7..859df3160303 100644 >>>>> --- a/block/blk-mq.c >>>>> +++ b/block/blk-mq.c >>>>> @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct >>>>> request_queue *q, >>>>> >>>>> rq = blk_mq_rq_ctx_init(data, tag, op); >>>>> if (!op_is_flush(op)) { >>>>> - rq->elv.icq = NULL; >>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>> if (e && e->type->ops.mq.prepare_request) { >>>>> if (e->type->icq_cache && rq_ioc(bio)) >>>>> blk_mq_sched_assign_ioc(rq, bio); >>>>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) >>>>> e->type->ops.mq.finish_request(rq); >>>>> if (rq->elv.icq) { >>>>> put_io_context(rq->elv.icq->ioc); >>>>> - rq->elv.icq = NULL; >>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>> } >>>>> } >>>> >>>> This looks like a BFQ problem, this should not be necessary. Paolo, >>>> you're calling your own prepare request handler from the insert >>>> as well, and your prepare request does nothing if rq->elv.icq == NULL. >>> >>> I sent the patch anyway, since it's kind of a robustness improvement, >>> I'd hope. If you fix BFQ also, please add: >> >> It's also a memset() in the hot path, would prefer to avoid that... >> The issue here is really the convoluted bfq usage of insert/prepare, >> I'm sure Paolo can take it from here. > > Does this fix it? > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index f0ecd98509d8..d883469a1582 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -4934,8 +4934,11 @@ static void bfq_prepare_request(struct request *rq, struct bio *bio) > bool new_queue = false; > bool bfqq_already_existing = false, split = false; > > - if (!rq->elv.icq) > + if (!rq->elv.icq) { > + rq->elv.priv[0] = rq->elv.priv[1] = NULL; > return; > + } > + > bic = icq_to_bic(rq->elv.icq); > > spin_lock_irq(&bfqd->lock); It does! Excellent. :) -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 23:06 ` Kees Cook @ 2018-04-17 23:12 ` Jens Axboe 0 siblings, 0 replies; 67+ messages in thread From: Jens Axboe @ 2018-04-17 23:12 UTC (permalink / raw) To: Kees Cook Cc: Paolo Valente, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On 4/17/18 5:06 PM, Kees Cook wrote: > On Tue, Apr 17, 2018 at 3:57 PM, Jens Axboe <axboe@kernel.dk> wrote: >> On 4/17/18 3:48 PM, Jens Axboe wrote: >>> On 4/17/18 3:47 PM, Kees Cook wrote: >>>> On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <axboe@kernel.dk> wrote: >>>>> On 4/17/18 3:25 PM, Kees Cook wrote: >>>>>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keescook@chromium.org> wrote: >>>>>>> I see elv.priv[1] assignments made in a few places -- is it possible >>>>>>> there is some kind of uninitialized-but-not-NULL state that can leak >>>>>>> in there? >>>>>> >>>>>> Got it. This fixes it for me: >>>>>> >>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>>> index 0dc9e341c2a7..859df3160303 100644 >>>>>> --- a/block/blk-mq.c >>>>>> +++ b/block/blk-mq.c >>>>>> @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct >>>>>> request_queue *q, >>>>>> >>>>>> rq = blk_mq_rq_ctx_init(data, tag, op); >>>>>> if (!op_is_flush(op)) { >>>>>> - rq->elv.icq = NULL; >>>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>>> if (e && e->type->ops.mq.prepare_request) { >>>>>> if (e->type->icq_cache && rq_ioc(bio)) >>>>>> blk_mq_sched_assign_ioc(rq, bio); >>>>>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) >>>>>> e->type->ops.mq.finish_request(rq); >>>>>> if (rq->elv.icq) { >>>>>> put_io_context(rq->elv.icq->ioc); >>>>>> - rq->elv.icq = NULL; >>>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>>> } >>>>>> } >>>>> >>>>> This looks like a BFQ problem, this should not be necessary. Paolo, >>>>> you're calling your own prepare request handler from the insert >>>>> as well, and your prepare request does nothing if rq->elv.icq == NULL. >>>> >>>> I sent the patch anyway, since it's kind of a robustness improvement, >>>> I'd hope. If you fix BFQ also, please add: >>> >>> It's also a memset() in the hot path, would prefer to avoid that... >>> The issue here is really the convoluted bfq usage of insert/prepare, >>> I'm sure Paolo can take it from here. >> >> Does this fix it? >> >> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c >> index f0ecd98509d8..d883469a1582 100644 >> --- a/block/bfq-iosched.c >> +++ b/block/bfq-iosched.c >> @@ -4934,8 +4934,11 @@ static void bfq_prepare_request(struct request *rq, struct bio *bio) >> bool new_queue = false; >> bool bfqq_already_existing = false, split = false; >> >> - if (!rq->elv.icq) >> + if (!rq->elv.icq) { >> + rq->elv.priv[0] = rq->elv.priv[1] = NULL; >> return; >> + } >> + >> bic = icq_to_bic(rq->elv.icq); >> >> spin_lock_irq(&bfqd->lock); > > It does! Excellent. :) Sweet! I'll add a comment and queue it up for 4.17 and mark for stable, with your annotations too. -- Jens Axboe ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 22:57 ` Jens Axboe @ 2018-04-18 9:08 ` Paolo Valente 2018-04-18 9:08 ` Paolo Valente 1 sibling, 0 replies; 67+ messages in thread From: Paolo Valente @ 2018-04-18 9:08 UTC (permalink / raw) To: Jens Axboe Cc: Kees Cook, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block > Il giorno 18 apr 2018, alle ore 00:57, Jens Axboe <axboe@kernel.dk> ha = scritto: >=20 > On 4/17/18 3:48 PM, Jens Axboe wrote: >> On 4/17/18 3:47 PM, Kees Cook wrote: >>> On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <axboe@kernel.dk> wrote: >>>> On 4/17/18 3:25 PM, Kees Cook wrote: >>>>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keescook@chromium.org> = wrote: >>>>>> I see elv.priv[1] assignments made in a few places -- is it = possible >>>>>> there is some kind of uninitialized-but-not-NULL state that can = leak >>>>>> in there? >>>>>=20 >>>>> Got it. This fixes it for me: >>>>>=20 >>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>> index 0dc9e341c2a7..859df3160303 100644 >>>>> --- a/block/blk-mq.c >>>>> +++ b/block/blk-mq.c >>>>> @@ -363,7 +363,7 @@ static struct request = *blk_mq_get_request(struct >>>>> request_queue *q, >>>>>=20 >>>>> rq =3D blk_mq_rq_ctx_init(data, tag, op); >>>>> if (!op_is_flush(op)) { >>>>> - rq->elv.icq =3D NULL; >>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>> if (e && e->type->ops.mq.prepare_request) { >>>>> if (e->type->icq_cache && rq_ioc(bio)) >>>>> blk_mq_sched_assign_ioc(rq, bio); >>>>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) >>>>> e->type->ops.mq.finish_request(rq); >>>>> if (rq->elv.icq) { >>>>> put_io_context(rq->elv.icq->ioc); >>>>> - rq->elv.icq =3D NULL; >>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>> } >>>>> } >>>>=20 >>>> This looks like a BFQ problem, this should not be necessary. Paolo, >>>> you're calling your own prepare request handler from the insert >>>> as well, and your prepare request does nothing if rq->elv.icq =3D=3D = NULL. >>>=20 >>> I sent the patch anyway, since it's kind of a robustness = improvement, >>> I'd hope. If you fix BFQ also, please add: >>=20 >> It's also a memset() in the hot path, would prefer to avoid that... >> The issue here is really the convoluted bfq usage of insert/prepare, >> I'm sure Paolo can take it from here. >=20 Hi, I'm very sorry for tuning in very late, but, at the same time, very glad to find the problem probably already solved ;) (in this respect, I = swear, my delay was not intentional) > Does this fix it? >=20 > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index f0ecd98509d8..d883469a1582 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -4934,8 +4934,11 @@ static void bfq_prepare_request(struct request = *rq, struct bio *bio) > bool new_queue =3D false; > bool bfqq_already_existing =3D false, split =3D false; >=20 > - if (!rq->elv.icq) > + if (!rq->elv.icq) { > + rq->elv.priv[0] =3D rq->elv.priv[1] =3D NULL; > return; > + } > + This does solve the problem at hand. But it also arouses a question, related to a possible subtle bug. For BFQ, !rq->elv.icq basically means "this request is not for me, as I am an icq-based scheduler". But, IIUC the main points in this thread, then this assumption is false. If it is actually false, then I hope that all requests with !rq->elv.icq that are sent to BFQ do verify the condition (at_head || blk_rq_is_passthrough(rq)). In fact, requests that do not verify that condition are those that BFQ must put in a bfq_queue. So, even if this patch makes the crash disappear, we drive BFQ completely crazy (and we may expect other strange failures) if we send BFQ a request with !((at_head || blk_rq_is_passthrough(rq)) and !rq->elv.icq. BFQ has to put that rq into a bfq_queue, but simply cannot. Jens, or any other, could you please shed a light on this, and explain how things are exactly? Thanks, Paolo > bic =3D icq_to_bic(rq->elv.icq); >=20 > spin_lock_irq(&bfqd->lock); >=20 > --=20 > Jens Axboe ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache @ 2018-04-18 9:08 ` Paolo Valente 0 siblings, 0 replies; 67+ messages in thread From: Paolo Valente @ 2018-04-18 9:08 UTC (permalink / raw) To: Jens Axboe Cc: Kees Cook, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block > Il giorno 18 apr 2018, alle ore 00:57, Jens Axboe <axboe@kernel.dk> ha scritto: > > On 4/17/18 3:48 PM, Jens Axboe wrote: >> On 4/17/18 3:47 PM, Kees Cook wrote: >>> On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <axboe@kernel.dk> wrote: >>>> On 4/17/18 3:25 PM, Kees Cook wrote: >>>>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keescook@chromium.org> wrote: >>>>>> I see elv.priv[1] assignments made in a few places -- is it possible >>>>>> there is some kind of uninitialized-but-not-NULL state that can leak >>>>>> in there? >>>>> >>>>> Got it. This fixes it for me: >>>>> >>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>> index 0dc9e341c2a7..859df3160303 100644 >>>>> --- a/block/blk-mq.c >>>>> +++ b/block/blk-mq.c >>>>> @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct >>>>> request_queue *q, >>>>> >>>>> rq = blk_mq_rq_ctx_init(data, tag, op); >>>>> if (!op_is_flush(op)) { >>>>> - rq->elv.icq = NULL; >>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>> if (e && e->type->ops.mq.prepare_request) { >>>>> if (e->type->icq_cache && rq_ioc(bio)) >>>>> blk_mq_sched_assign_ioc(rq, bio); >>>>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) >>>>> e->type->ops.mq.finish_request(rq); >>>>> if (rq->elv.icq) { >>>>> put_io_context(rq->elv.icq->ioc); >>>>> - rq->elv.icq = NULL; >>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>> } >>>>> } >>>> >>>> This looks like a BFQ problem, this should not be necessary. Paolo, >>>> you're calling your own prepare request handler from the insert >>>> as well, and your prepare request does nothing if rq->elv.icq == NULL. >>> >>> I sent the patch anyway, since it's kind of a robustness improvement, >>> I'd hope. If you fix BFQ also, please add: >> >> It's also a memset() in the hot path, would prefer to avoid that... >> The issue here is really the convoluted bfq usage of insert/prepare, >> I'm sure Paolo can take it from here. > Hi, I'm very sorry for tuning in very late, but, at the same time, very glad to find the problem probably already solved ;) (in this respect, I swear, my delay was not intentional) > Does this fix it? > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index f0ecd98509d8..d883469a1582 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -4934,8 +4934,11 @@ static void bfq_prepare_request(struct request *rq, struct bio *bio) > bool new_queue = false; > bool bfqq_already_existing = false, split = false; > > - if (!rq->elv.icq) > + if (!rq->elv.icq) { > + rq->elv.priv[0] = rq->elv.priv[1] = NULL; > return; > + } > + This does solve the problem at hand. But it also arouses a question, related to a possible subtle bug. For BFQ, !rq->elv.icq basically means "this request is not for me, as I am an icq-based scheduler". But, IIUC the main points in this thread, then this assumption is false. If it is actually false, then I hope that all requests with !rq->elv.icq that are sent to BFQ do verify the condition (at_head || blk_rq_is_passthrough(rq)). In fact, requests that do not verify that condition are those that BFQ must put in a bfq_queue. So, even if this patch makes the crash disappear, we drive BFQ completely crazy (and we may expect other strange failures) if we send BFQ a request with !((at_head || blk_rq_is_passthrough(rq)) and !rq->elv.icq. BFQ has to put that rq into a bfq_queue, but simply cannot. Jens, or any other, could you please shed a light on this, and explain how things are exactly? Thanks, Paolo > bic = icq_to_bic(rq->elv.icq); > > spin_lock_irq(&bfqd->lock); > > -- > Jens Axboe ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-18 9:08 ` Paolo Valente (?) @ 2018-04-18 14:30 ` Jens Axboe 2018-04-19 9:32 ` Paolo Valente -1 siblings, 1 reply; 67+ messages in thread From: Jens Axboe @ 2018-04-18 14:30 UTC (permalink / raw) To: Paolo Valente Cc: Kees Cook, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block On 4/18/18 3:08 AM, Paolo Valente wrote: > > >> Il giorno 18 apr 2018, alle ore 00:57, Jens Axboe <axboe@kernel.dk> ha scritto: >> >> On 4/17/18 3:48 PM, Jens Axboe wrote: >>> On 4/17/18 3:47 PM, Kees Cook wrote: >>>> On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <axboe@kernel.dk> wrote: >>>>> On 4/17/18 3:25 PM, Kees Cook wrote: >>>>>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keescook@chromium.org> wrote: >>>>>>> I see elv.priv[1] assignments made in a few places -- is it possible >>>>>>> there is some kind of uninitialized-but-not-NULL state that can leak >>>>>>> in there? >>>>>> >>>>>> Got it. This fixes it for me: >>>>>> >>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>>> index 0dc9e341c2a7..859df3160303 100644 >>>>>> --- a/block/blk-mq.c >>>>>> +++ b/block/blk-mq.c >>>>>> @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct >>>>>> request_queue *q, >>>>>> >>>>>> rq = blk_mq_rq_ctx_init(data, tag, op); >>>>>> if (!op_is_flush(op)) { >>>>>> - rq->elv.icq = NULL; >>>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>>> if (e && e->type->ops.mq.prepare_request) { >>>>>> if (e->type->icq_cache && rq_ioc(bio)) >>>>>> blk_mq_sched_assign_ioc(rq, bio); >>>>>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) >>>>>> e->type->ops.mq.finish_request(rq); >>>>>> if (rq->elv.icq) { >>>>>> put_io_context(rq->elv.icq->ioc); >>>>>> - rq->elv.icq = NULL; >>>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>>> } >>>>>> } >>>>> >>>>> This looks like a BFQ problem, this should not be necessary. Paolo, >>>>> you're calling your own prepare request handler from the insert >>>>> as well, and your prepare request does nothing if rq->elv.icq == NULL. >>>> >>>> I sent the patch anyway, since it's kind of a robustness improvement, >>>> I'd hope. If you fix BFQ also, please add: >>> >>> It's also a memset() in the hot path, would prefer to avoid that... >>> The issue here is really the convoluted bfq usage of insert/prepare, >>> I'm sure Paolo can take it from here. >> > > Hi, > I'm very sorry for tuning in very late, but, at the same time, very > glad to find the problem probably already solved ;) (in this respect, I swear, > my delay was not intentional) > >> Does this fix it? >> >> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c >> index f0ecd98509d8..d883469a1582 100644 >> --- a/block/bfq-iosched.c >> +++ b/block/bfq-iosched.c >> @@ -4934,8 +4934,11 @@ static void bfq_prepare_request(struct request *rq, struct bio *bio) >> bool new_queue = false; >> bool bfqq_already_existing = false, split = false; >> >> - if (!rq->elv.icq) >> + if (!rq->elv.icq) { >> + rq->elv.priv[0] = rq->elv.priv[1] = NULL; >> return; >> + } >> + > > This does solve the problem at hand. But it also arouses a question, > related to a possible subtle bug. > > For BFQ, !rq->elv.icq basically means "this request is not for me, as > I am an icq-based scheduler". But, IIUC the main points in this > thread, then this assumption is false. If it is actually false, then > I hope that all requests with !rq->elv.icq that are sent to BFQ do > verify the condition (at_head || blk_rq_is_passthrough(rq)). In fact, > requests that do not verify that condition are those that BFQ must put > in a bfq_queue. So, even if this patch makes the crash disappear, we > drive BFQ completely crazy (and we may expect other strange failures) > if we send BFQ a request with !((at_head || blk_rq_is_passthrough(rq)) > and !rq->elv.icq. BFQ has to put that rq into a bfq_queue, but simply > cannot. > > Jens, or any other, could you please shed a light on this, and explain > how things are exactly? Your assumption is correct, however you set ->priv[0] and ->priv[1] for requests, but only for ->elv.icq != NULL. So let's assume you get a request and assign those two, request completes. Later on, you get the same request, bypass insert it. BFQ doesn't clear the bic/bfqq pointers in the request, since ->elv.icq == NULL. It gets inserted into the dispatch list. Then when __bfq_dispatch_request() is called, you do: bfqq = RQ_BFQQ(rq); if (bfqq) bfqq->dispatched++; [...] which is wrong, since you don't know if you assigned a bfqq for this request. The memory that bfqq points to could be long gone, if that queue is freed. So you could either guard any bfqq/bic retrieval with ->elv.icq != NULL, or you could just clear the pointers for the case where the values aren't valid. -- Jens Axboe ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-18 14:30 ` Jens Axboe @ 2018-04-19 9:32 ` Paolo Valente 0 siblings, 0 replies; 67+ messages in thread From: Paolo Valente @ 2018-04-19 9:32 UTC (permalink / raw) To: Jens Axboe Cc: Kees Cook, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, Ulf Hansson, Mark Brown, Linus Walleij > Il giorno 18 apr 2018, alle ore 16:30, Jens Axboe <axboe@kernel.dk> ha = scritto: >=20 > On 4/18/18 3:08 AM, Paolo Valente wrote: >>=20 >>=20 >>> Il giorno 18 apr 2018, alle ore 00:57, Jens Axboe <axboe@kernel.dk> = ha scritto: >>>=20 >>> On 4/17/18 3:48 PM, Jens Axboe wrote: >>>> On 4/17/18 3:47 PM, Kees Cook wrote: >>>>> On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <axboe@kernel.dk> = wrote: >>>>>> On 4/17/18 3:25 PM, Kees Cook wrote: >>>>>>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook = <keescook@chromium.org> wrote: >>>>>>>> I see elv.priv[1] assignments made in a few places -- is it = possible >>>>>>>> there is some kind of uninitialized-but-not-NULL state that can = leak >>>>>>>> in there? >>>>>>>=20 >>>>>>> Got it. This fixes it for me: >>>>>>>=20 >>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>>>> index 0dc9e341c2a7..859df3160303 100644 >>>>>>> --- a/block/blk-mq.c >>>>>>> +++ b/block/blk-mq.c >>>>>>> @@ -363,7 +363,7 @@ static struct request = *blk_mq_get_request(struct >>>>>>> request_queue *q, >>>>>>>=20 >>>>>>> rq =3D blk_mq_rq_ctx_init(data, tag, op); >>>>>>> if (!op_is_flush(op)) { >>>>>>> - rq->elv.icq =3D NULL; >>>>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>>>> if (e && e->type->ops.mq.prepare_request) { >>>>>>> if (e->type->icq_cache && rq_ioc(bio)) >>>>>>> blk_mq_sched_assign_ioc(rq, bio); >>>>>>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) >>>>>>> e->type->ops.mq.finish_request(rq); >>>>>>> if (rq->elv.icq) { >>>>>>> put_io_context(rq->elv.icq->ioc); >>>>>>> - rq->elv.icq =3D NULL; >>>>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>>>> } >>>>>>> } >>>>>>=20 >>>>>> This looks like a BFQ problem, this should not be necessary. = Paolo, >>>>>> you're calling your own prepare request handler from the insert >>>>>> as well, and your prepare request does nothing if rq->elv.icq =3D=3D= NULL. >>>>>=20 >>>>> I sent the patch anyway, since it's kind of a robustness = improvement, >>>>> I'd hope. If you fix BFQ also, please add: >>>>=20 >>>> It's also a memset() in the hot path, would prefer to avoid that... >>>> The issue here is really the convoluted bfq usage of = insert/prepare, >>>> I'm sure Paolo can take it from here. >>>=20 >>=20 >> Hi, >> I'm very sorry for tuning in very late, but, at the same time, very >> glad to find the problem probably already solved ;) (in this respect, = I swear, >> my delay was not intentional) >>=20 >>> Does this fix it? >>>=20 >>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c >>> index f0ecd98509d8..d883469a1582 100644 >>> --- a/block/bfq-iosched.c >>> +++ b/block/bfq-iosched.c >>> @@ -4934,8 +4934,11 @@ static void bfq_prepare_request(struct = request *rq, struct bio *bio) >>> bool new_queue =3D false; >>> bool bfqq_already_existing =3D false, split =3D false; >>>=20 >>> - if (!rq->elv.icq) >>> + if (!rq->elv.icq) { >>> + rq->elv.priv[0] =3D rq->elv.priv[1] =3D NULL; >>> return; >>> + } >>> + >>=20 >> This does solve the problem at hand. But it also arouses a question, >> related to a possible subtle bug. >>=20 >> For BFQ, !rq->elv.icq basically means "this request is not for me, as >> I am an icq-based scheduler". But, IIUC the main points in this >> thread, then this assumption is false. If it is actually false, then >> I hope that all requests with !rq->elv.icq that are sent to BFQ do >> verify the condition (at_head || blk_rq_is_passthrough(rq)). In = fact, >> requests that do not verify that condition are those that BFQ must = put >> in a bfq_queue. So, even if this patch makes the crash disappear, we >> drive BFQ completely crazy (and we may expect other strange failures) >> if we send BFQ a request with !((at_head || = blk_rq_is_passthrough(rq)) >> and !rq->elv.icq. BFQ has to put that rq into a bfq_queue, but = simply >> cannot. >>=20 >> Jens, or any other, could you please shed a light on this, and = explain >> how things are exactly? >=20 First, thanks for summing up the problem. > Your assumption is correct, however you set ->priv[0] and ->priv[1] = for > requests, but only for ->elv.icq !=3D NULL. So let's assume you get a > request and assign those two, request completes. Later on, you get > the same request, bypass insert it. BFQ doesn't clear the bic/bfqq > pointers in the request, since ->elv.icq =3D=3D NULL. I'm missing something here. When the request gets completed in the first place, the hook bfq_finish_requeue_request gets called, and that hook clears both ->elv.priv elements (as the request has a non-null elv.icq). So, when bfq gets the same request again, those elements must be NULL. What am I getting wrong? I have some more concern on this point, but I'll stick to this for the moment, to not create more confusion. Thanks, Paolo > It gets inserted > into the dispatch list. >=20 > Then when __bfq_dispatch_request() is called, you do: >=20 > bfqq =3D RQ_BFQQ(rq); > if (bfqq) > bfqq->dispatched++; > [...] >=20 > which is wrong, since you don't know if you assigned a bfqq for this > request. The memory that bfqq points to could be long gone, if that > queue is freed. So you could either guard any bfqq/bic retrieval > with ->elv.icq !=3D NULL, or you could just clear the pointers for > the case where the values aren't valid. >=20 > --=20 > Jens Axboe ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache @ 2018-04-19 9:32 ` Paolo Valente 0 siblings, 0 replies; 67+ messages in thread From: Paolo Valente @ 2018-04-19 9:32 UTC (permalink / raw) To: Jens Axboe Cc: Kees Cook, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, Ulf Hansson, Mark Brown, Linus Walleij > Il giorno 18 apr 2018, alle ore 16:30, Jens Axboe <axboe@kernel.dk> ha scritto: > > On 4/18/18 3:08 AM, Paolo Valente wrote: >> >> >>> Il giorno 18 apr 2018, alle ore 00:57, Jens Axboe <axboe@kernel.dk> ha scritto: >>> >>> On 4/17/18 3:48 PM, Jens Axboe wrote: >>>> On 4/17/18 3:47 PM, Kees Cook wrote: >>>>> On Tue, Apr 17, 2018 at 2:39 PM, Jens Axboe <axboe@kernel.dk> wrote: >>>>>> On 4/17/18 3:25 PM, Kees Cook wrote: >>>>>>> On Tue, Apr 17, 2018 at 1:46 PM, Kees Cook <keescook@chromium.org> wrote: >>>>>>>> I see elv.priv[1] assignments made in a few places -- is it possible >>>>>>>> there is some kind of uninitialized-but-not-NULL state that can leak >>>>>>>> in there? >>>>>>> >>>>>>> Got it. This fixes it for me: >>>>>>> >>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>>>> index 0dc9e341c2a7..859df3160303 100644 >>>>>>> --- a/block/blk-mq.c >>>>>>> +++ b/block/blk-mq.c >>>>>>> @@ -363,7 +363,7 @@ static struct request *blk_mq_get_request(struct >>>>>>> request_queue *q, >>>>>>> >>>>>>> rq = blk_mq_rq_ctx_init(data, tag, op); >>>>>>> if (!op_is_flush(op)) { >>>>>>> - rq->elv.icq = NULL; >>>>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>>>> if (e && e->type->ops.mq.prepare_request) { >>>>>>> if (e->type->icq_cache && rq_ioc(bio)) >>>>>>> blk_mq_sched_assign_ioc(rq, bio); >>>>>>> @@ -461,7 +461,7 @@ void blk_mq_free_request(struct request *rq) >>>>>>> e->type->ops.mq.finish_request(rq); >>>>>>> if (rq->elv.icq) { >>>>>>> put_io_context(rq->elv.icq->ioc); >>>>>>> - rq->elv.icq = NULL; >>>>>>> + memset(&rq->elv, 0, sizeof(rq->elv)); >>>>>>> } >>>>>>> } >>>>>> >>>>>> This looks like a BFQ problem, this should not be necessary. Paolo, >>>>>> you're calling your own prepare request handler from the insert >>>>>> as well, and your prepare request does nothing if rq->elv.icq == NULL. >>>>> >>>>> I sent the patch anyway, since it's kind of a robustness improvement, >>>>> I'd hope. If you fix BFQ also, please add: >>>> >>>> It's also a memset() in the hot path, would prefer to avoid that... >>>> The issue here is really the convoluted bfq usage of insert/prepare, >>>> I'm sure Paolo can take it from here. >>> >> >> Hi, >> I'm very sorry for tuning in very late, but, at the same time, very >> glad to find the problem probably already solved ;) (in this respect, I swear, >> my delay was not intentional) >> >>> Does this fix it? >>> >>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c >>> index f0ecd98509d8..d883469a1582 100644 >>> --- a/block/bfq-iosched.c >>> +++ b/block/bfq-iosched.c >>> @@ -4934,8 +4934,11 @@ static void bfq_prepare_request(struct request *rq, struct bio *bio) >>> bool new_queue = false; >>> bool bfqq_already_existing = false, split = false; >>> >>> - if (!rq->elv.icq) >>> + if (!rq->elv.icq) { >>> + rq->elv.priv[0] = rq->elv.priv[1] = NULL; >>> return; >>> + } >>> + >> >> This does solve the problem at hand. But it also arouses a question, >> related to a possible subtle bug. >> >> For BFQ, !rq->elv.icq basically means "this request is not for me, as >> I am an icq-based scheduler". But, IIUC the main points in this >> thread, then this assumption is false. If it is actually false, then >> I hope that all requests with !rq->elv.icq that are sent to BFQ do >> verify the condition (at_head || blk_rq_is_passthrough(rq)). In fact, >> requests that do not verify that condition are those that BFQ must put >> in a bfq_queue. So, even if this patch makes the crash disappear, we >> drive BFQ completely crazy (and we may expect other strange failures) >> if we send BFQ a request with !((at_head || blk_rq_is_passthrough(rq)) >> and !rq->elv.icq. BFQ has to put that rq into a bfq_queue, but simply >> cannot. >> >> Jens, or any other, could you please shed a light on this, and explain >> how things are exactly? > First, thanks for summing up the problem. > Your assumption is correct, however you set ->priv[0] and ->priv[1] for > requests, but only for ->elv.icq != NULL. So let's assume you get a > request and assign those two, request completes. Later on, you get > the same request, bypass insert it. BFQ doesn't clear the bic/bfqq > pointers in the request, since ->elv.icq == NULL. I'm missing something here. When the request gets completed in the first place, the hook bfq_finish_requeue_request gets called, and that hook clears both ->elv.priv elements (as the request has a non-null elv.icq). So, when bfq gets the same request again, those elements must be NULL. What am I getting wrong? I have some more concern on this point, but I'll stick to this for the moment, to not create more confusion. Thanks, Paolo > It gets inserted > into the dispatch list. > > Then when __bfq_dispatch_request() is called, you do: > > bfqq = RQ_BFQQ(rq); > if (bfqq) > bfqq->dispatched++; > [...] > > which is wrong, since you don't know if you assigned a bfqq for this > request. The memory that bfqq points to could be long gone, if that > queue is freed. So you could either guard any bfqq/bic retrieval > with ->elv.icq != NULL, or you could just clear the pointers for > the case where the values aren't valid. > > -- > Jens Axboe ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-19 9:32 ` Paolo Valente (?) @ 2018-04-20 20:23 ` Kees Cook 2018-04-20 20:41 ` Oleksandr Natalenko 2018-04-21 8:43 ` Paolo Valente -1 siblings, 2 replies; 67+ messages in thread From: Kees Cook @ 2018-04-20 20:23 UTC (permalink / raw) To: Paolo Valente Cc: Jens Axboe, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, Ulf Hansson, Mark Brown, Linus Walleij On Thu, Apr 19, 2018 at 2:32 AM, Paolo Valente <paolo.valente@linaro.org> wrote: > I'm missing something here. When the request gets completed in the > first place, the hook bfq_finish_requeue_request gets called, and that > hook clears both ->elv.priv elements (as the request has a non-null > elv.icq). So, when bfq gets the same request again, those elements > must be NULL. What am I getting wrong? > > I have some more concern on this point, but I'll stick to this for the > moment, to not create more confusion. I don't know the "how", I only found the "what". :) If you want, grab the reproducer VM linked to earlier in this thread; it'll hit the problem within about 30 seconds of running the reproducer. -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-20 20:23 ` Kees Cook @ 2018-04-20 20:41 ` Oleksandr Natalenko 2018-04-21 8:43 ` Paolo Valente 1 sibling, 0 replies; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-20 20:41 UTC (permalink / raw) To: Kees Cook Cc: Paolo Valente, Jens Axboe, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, Ulf Hansson, Mark Brown, Linus Walleij, keescook Hi. On 20.04.2018 22:23, Kees Cook wrote: > I don't know the "how", I only found the "what". :) If you want, grab > the reproducer VM linked to earlier in this thread; it'll hit the > problem within about 30 seconds of running the reproducer. Just to avoid a possible confusion I should note that I've removed the reproducer from my server, but I can re-upload it if needed. -- Oleksandr Natalenko (post-factum) ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-20 20:23 ` Kees Cook @ 2018-04-21 8:43 ` Paolo Valente 2018-04-21 8:43 ` Paolo Valente 1 sibling, 0 replies; 67+ messages in thread From: Paolo Valente @ 2018-04-21 8:43 UTC (permalink / raw) To: Kees Cook Cc: Jens Axboe, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, Ulf Hansson, Mark Brown, Linus Walleij > Il giorno 20 apr 2018, alle ore 22:23, Kees Cook = <keescook@chromium.org> ha scritto: >=20 > On Thu, Apr 19, 2018 at 2:32 AM, Paolo Valente = <paolo.valente@linaro.org> wrote: >> I'm missing something here. When the request gets completed in the >> first place, the hook bfq_finish_requeue_request gets called, and = that >> hook clears both ->elv.priv elements (as the request has a non-null >> elv.icq). So, when bfq gets the same request again, those elements >> must be NULL. What am I getting wrong? >>=20 >> I have some more concern on this point, but I'll stick to this for = the >> moment, to not create more confusion. >=20 > I don't know the "how", I only found the "what". :) Got it, although I think you did much more than that ;) Anyway, my reply was exactly to a (Jens') detailed description of the how. And my concern is that there seems to be an inconsistency in that description. In addition, Jens is proposing a patch basing on that description. But, if this inconsistency is not solved, that patch may eliminate the symptom at hand, but it may not fix the real cause, or may even contribute to bury it deeper. > If you want, grab > the reproducer VM linked to earlier in this thread; it'll hit the > problem within about 30 seconds of running the reproducer. >=20 Yep. Actually, I've been investigating this kind of failure, in different incarnations, for months now. In this respect, other examples are the srp-test failures reported by Bart, e.g., here [1]. According to my analysis, the cause of the problem is somewhere in blk-mq, outside bfq. Unfortunately, I didn't make it to find where it exactly is, mainly because of my limited expertise on blk-mq internals. So I have asked for any kind of help and suggestions to Jens, Mike and any other knowledgeable guy. Probably those help requests got somehow lost on those threads, but your results, Kees, and the analysis that followed from Jens seems now to be carrying us to the solution of the not-so-recent issue. Time will tell. Thanks, Paolo [1] https://www.spinics.net/lists/linux-block/msg22760.html > -Kees >=20 > --=20 > Kees Cook > Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache @ 2018-04-21 8:43 ` Paolo Valente 0 siblings, 0 replies; 67+ messages in thread From: Paolo Valente @ 2018-04-21 8:43 UTC (permalink / raw) To: Kees Cook Cc: Jens Axboe, Oleksandr Natalenko, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, Ulf Hansson, Mark Brown, Linus Walleij > Il giorno 20 apr 2018, alle ore 22:23, Kees Cook <keescook@chromium.org> ha scritto: > > On Thu, Apr 19, 2018 at 2:32 AM, Paolo Valente <paolo.valente@linaro.org> wrote: >> I'm missing something here. When the request gets completed in the >> first place, the hook bfq_finish_requeue_request gets called, and that >> hook clears both ->elv.priv elements (as the request has a non-null >> elv.icq). So, when bfq gets the same request again, those elements >> must be NULL. What am I getting wrong? >> >> I have some more concern on this point, but I'll stick to this for the >> moment, to not create more confusion. > > I don't know the "how", I only found the "what". :) Got it, although I think you did much more than that ;) Anyway, my reply was exactly to a (Jens') detailed description of the how. And my concern is that there seems to be an inconsistency in that description. In addition, Jens is proposing a patch basing on that description. But, if this inconsistency is not solved, that patch may eliminate the symptom at hand, but it may not fix the real cause, or may even contribute to bury it deeper. > If you want, grab > the reproducer VM linked to earlier in this thread; it'll hit the > problem within about 30 seconds of running the reproducer. > Yep. Actually, I've been investigating this kind of failure, in different incarnations, for months now. In this respect, other examples are the srp-test failures reported by Bart, e.g., here [1]. According to my analysis, the cause of the problem is somewhere in blk-mq, outside bfq. Unfortunately, I didn't make it to find where it exactly is, mainly because of my limited expertise on blk-mq internals. So I have asked for any kind of help and suggestions to Jens, Mike and any other knowledgeable guy. Probably those help requests got somehow lost on those threads, but your results, Kees, and the analysis that followed from Jens seems now to be carrying us to the solution of the not-so-recent issue. Time will tell. Thanks, Paolo [1] https://www.spinics.net/lists/linux-block/msg22760.html > -Kees > > -- > Kees Cook > Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-17 21:47 ` Kees Cook 2018-04-17 21:48 ` Jens Axboe @ 2018-04-17 21:55 ` Oleksandr Natalenko 1 sibling, 0 replies; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-17 21:55 UTC (permalink / raw) To: Kees Cook Cc: Jens Axboe, Paolo Valente, Bart Van Assche, David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn, linux-block, keescook Hi. 17.04.2018 23:47, Kees Cook wrote: > I sent the patch anyway, since it's kind of a robustness improvement, > I'd hope. If you fix BFQ also, please add: > > Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> > Root-caused-by: Kees Cook <keescook@chromium.org> > > :) I gotta task-switch to other things! > > Thanks for the pointers, and thank you Oleksandr for providing the > reproducer! That was a great fun to read. Thank you for digging into it. Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-10 6:35 ` Oleksandr Natalenko 2018-04-10 6:53 ` Kees Cook @ 2018-04-10 13:47 ` Oleksandr Natalenko 1 sibling, 0 replies; 67+ messages in thread From: Oleksandr Natalenko @ 2018-04-10 13:47 UTC (permalink / raw) To: Kees Cook Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML, Christoph Hellwig, Jens Axboe, Hannes Reinecke, Johannes Thumshirn, linux-block, paolo.valente, keescook Hi. 10.04.2018 08:35, Oleksandr Natalenko wrote: >> - does it reproduce _without_ hardened usercopy? (I would assume yes, >> but you'd just not get any warning until the hangs started.) If it >> does reproduce without hardened usercopy, then a new bisect run could >> narrow the search even more. > > Looks like it cannot be disabled via kernel cmdline, so I have to > re-compile the kernel, right? I can certainly do that anyway. Okay, I've recompiled the kernel without hardened usercopy: [root@archlinux ~]# zgrep USERCOPY /proc/config.gz CONFIG_X86_INTEL_USERCOPY=y CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y # CONFIG_HARDENED_USERCOPY is not set and I cannot reproduce the issue anymore. I/O doesn't hang regardless of how long I hammer it. Eeeh? Maybe, this is a matter of some cleanup code path once the warn/bug condition is hit with hardening enabled? I'm just guessing here again. Will work towards checking Linus' master branch now… Regards, Oleksandr ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-04 19:07 usercopy whitelist woe in scsi_sense_cache Oleksandr Natalenko 2018-04-04 20:21 ` Kees Cook @ 2018-04-04 20:32 ` Kees Cook 2018-04-04 20:47 ` Douglas Gilbert 1 sibling, 1 reply; 67+ messages in thread From: Kees Cook @ 2018-04-04 20:32 UTC (permalink / raw) To: Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenko <oleksandr@natalenko.name> wrote: > [ 261.262135] Bad or missing usercopy whitelist? Kernel memory exposure > attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)! > I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling > smartctl in a loop and doing some usual background I/O. The warning is > triggered within 3 minutes or so (not instantly). Also: Can you send me your .config? What SCSI drivers are you using in the VM and on the real server? Are you able to see what ioctl()s smartctl is issuing? I'll try to reproduce this on my end... -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: usercopy whitelist woe in scsi_sense_cache 2018-04-04 20:32 ` Kees Cook @ 2018-04-04 20:47 ` Douglas Gilbert 0 siblings, 0 replies; 67+ messages in thread From: Douglas Gilbert @ 2018-04-04 20:47 UTC (permalink / raw) To: Kees Cook, Oleksandr Natalenko Cc: David Windsor, James E.J. Bottomley, Martin K. Petersen, linux-scsi, LKML On 2018-04-04 04:32 PM, Kees Cook wrote: > On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenko > <oleksandr@natalenko.name> wrote: >> [ 261.262135] Bad or missing usercopy whitelist? Kernel memory exposure >> attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)! >> I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling >> smartctl in a loop and doing some usual background I/O. The warning is >> triggered within 3 minutes or so (not instantly). > > Also: > > Can you send me your .config? What SCSI drivers are you using in the > VM and on the real server? > > Are you able to see what ioctl()s smartctl is issuing? I'll try to > reproduce this on my end... smartctl -r scsiioctl,3 .... ^ permalink raw reply [flat|nested] 67+ messages in thread
end of thread, other threads:[~2018-04-21 8:44 UTC | newest] Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-04 19:07 usercopy whitelist woe in scsi_sense_cache Oleksandr Natalenko 2018-04-04 20:21 ` Kees Cook 2018-04-04 20:44 ` Douglas Gilbert 2018-04-04 20:49 ` Oleksandr Natalenko 2018-04-04 21:25 ` Kees Cook 2018-04-04 21:34 ` Oleksandr Natalenko 2018-04-05 9:56 ` Oleksandr Natalenko 2018-04-05 14:21 ` Kees Cook 2018-04-05 14:32 ` Oleksandr Natalenko 2018-04-05 14:33 ` Oleksandr Natalenko [not found] ` <CAGXu5jL8oLV2xvjBVYv_SNXr74LdgpXEmU7K+cLYpD7jh2chgw@mail.gmail.com> 2018-04-05 18:52 ` Kees Cook 2018-04-06 6:21 ` Oleksandr Natalenko 2018-04-08 19:07 ` Oleksandr Natalenko 2018-04-09 9:35 ` Christoph Hellwig 2018-04-09 15:54 ` Oleksandr Natalenko 2018-04-09 18:32 ` Kees Cook 2018-04-09 19:02 ` Oleksandr Natalenko 2018-04-09 20:30 ` Kees Cook 2018-04-09 23:03 ` Kees Cook 2018-04-10 6:35 ` Oleksandr Natalenko 2018-04-10 6:53 ` Kees Cook 2018-04-10 17:16 ` Oleksandr Natalenko 2018-04-11 3:13 ` Kees Cook 2018-04-11 22:47 ` Kees Cook 2018-04-12 0:03 ` Kees Cook 2018-04-12 18:44 ` Kees Cook 2018-04-12 19:04 ` Oleksandr Natalenko 2018-04-12 19:04 ` Oleksandr Natalenko 2018-04-12 22:01 ` Kees Cook 2018-04-12 22:01 ` Kees Cook 2018-04-12 22:47 ` Kees Cook 2018-04-12 22:47 ` Kees Cook 2018-04-13 3:02 ` Kees Cook 2018-04-16 20:44 ` Kees Cook 2018-04-17 3:12 ` Kees Cook 2018-04-17 9:19 ` Oleksandr Natalenko 2018-04-17 16:25 ` Kees Cook 2018-04-17 10:02 ` James Bottomley 2018-04-17 16:30 ` Kees Cook 2018-04-17 16:42 ` Kees Cook 2018-04-17 16:46 ` Jens Axboe 2018-04-17 20:03 ` Kees Cook 2018-04-17 20:20 ` Kees Cook 2018-04-17 20:25 ` Kees Cook 2018-04-17 20:28 ` Jens Axboe 2018-04-17 20:28 ` Jens Axboe 2018-04-17 20:46 ` Kees Cook 2018-04-17 21:25 ` Kees Cook 2018-04-17 21:39 ` Jens Axboe 2018-04-17 21:47 ` Kees Cook 2018-04-17 21:48 ` Jens Axboe 2018-04-17 22:57 ` Jens Axboe 2018-04-17 23:06 ` Kees Cook 2018-04-17 23:12 ` Jens Axboe 2018-04-18 9:08 ` Paolo Valente 2018-04-18 9:08 ` Paolo Valente 2018-04-18 14:30 ` Jens Axboe 2018-04-19 9:32 ` Paolo Valente 2018-04-19 9:32 ` Paolo Valente 2018-04-20 20:23 ` Kees Cook 2018-04-20 20:41 ` Oleksandr Natalenko 2018-04-21 8:43 ` Paolo Valente 2018-04-21 8:43 ` Paolo Valente 2018-04-17 21:55 ` Oleksandr Natalenko 2018-04-10 13:47 ` Oleksandr Natalenko 2018-04-04 20:32 ` Kees Cook 2018-04-04 20:47 ` Douglas Gilbert
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.