On Mon, Dec 13, 2021 at 10:24AM -0800, Linus Torvalds wrote: > On Mon, Dec 13, 2021 at 10:19 AM Marco Elver wrote: > > > > I'm still genuinely worried about this: > > > > > 2. Yet another potentially larger issue is if some code > > > kmalloc()s some structs containing refcount_t, and relies on > > > GFP_ZERO (kzalloc()) to initialize their data assuming that a > > > freshly initialized refcount_t contains 0. > > > > Even with everything properly wrapped up in atomic_ref_t, it's not going > > to prevent mis-initialization via kzalloc() and friends. > > I agree that it's an issue, but it's not a new issue. We've had the > exact same thing with a lot of other core data structures. > > And a ref-count of zero isn't valid _anyway_. When you allocate a > structure, a zero ref-count by definition is wrong. You need to set > the ref-count to the user that allocated it. > > So I don't actually think the "implicit zero" is an issue in practice, > because it would be wrong in the first place. Code that relies on > kzmalloc() to initialize a refcount cannot work right. > > (And by "cannot" I obviously mean "can, if you do wrong things" - it's > not like it's *impossible* to do an "atomic_inc_ref()" to change a 0 > refcount to a 1, but it's both wrong *AND* actively stupid, since an > allocation does not need to set the refcount atomically). I tried to throw syzkaller at this, but just booting results in this warning: | WARNING: CPU: 2 PID: 1 at block/blk-mq.c:3080 blk_mq_clear_flush_rq_mapping block/blk-mq.c:3080 [inline] | WARNING: CPU: 2 PID: 1 at block/blk-mq.c:3080 blk_mq_exit_hctx+0x20e/0x220 block/blk-mq.c:3105 | Modules linked in: | CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.16.0-rc5+ #3 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 | RIP: 0010:blk_mq_clear_flush_rq_mapping block/blk-mq.c:3080 [inline] | RIP: 0010:blk_mq_exit_hctx+0x20e/0x220 block/blk-mq.c:3105 | Code: 00 31 d2 bf 9c 00 00 00 e8 7f 20 a2 ff e9 52 ff ff ff e8 65 22 b1 ff 4c 89 e7 e8 6d 77 00 00 e9 58 fe ff ff e8 53 22 b1 ff 90 <0f> 0b 90 e9 7e fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 41 55 41 54 | RSP: 0000:ffff9c10800139f0 EFLAGS: 00010293 | RAX: 0000000000000000 RBX: ffff970c428990b8 RCX: ffffffff9c662ecd | RDX: ffff970c40880040 RSI: 0000000000000000 RDI: ffff970c4421e200 | RBP: ffff970c43ff9e80 R08: ffff970c43f14fe0 R09: ffff9c1080013a18 | R10: 0000000000000000 R11: 000000000001403d R12: ffff970c4421e200 | R13: ffff970c44237000 R14: 0000000000000100 R15: ffff970c4421e200 | FS: 0000000000000000(0000) GS:ffff970f6fc80000(0000) knlGS:0000000000000000 | CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 | CR2: 0000000000000000 CR3: 00000002c2c0c001 CR4: 0000000000770ee0 | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 | PKRU: 55555554 | Call Trace: | | blk_mq_exit_hw_queues block/blk-mq.c:3130 [inline] | blk_mq_exit_queue+0x70/0x1a0 block/blk-mq.c:3780 | blk_cleanup_queue+0xba/0x120 block/blk-core.c:368 | __scsi_remove_device+0x77/0x190 drivers/scsi/scsi_sysfs.c:1458 | scsi_probe_and_add_lun+0xb5c/0xfa0 drivers/scsi/scsi_scan.c:1221 | __scsi_scan_target+0x121/0x660 drivers/scsi/scsi_scan.c:1604 | scsi_scan_channel drivers/scsi/scsi_scan.c:1692 [inline] | scsi_scan_channel+0x90/0xd0 drivers/scsi/scsi_scan.c:1668 | scsi_scan_host_selected+0x117/0x170 drivers/scsi/scsi_scan.c:1721 | do_scsi_scan_host+0xba/0xd0 drivers/scsi/scsi_scan.c:1860 | scsi_scan_host drivers/scsi/scsi_scan.c:1890 [inline] | scsi_scan_host+0x1cf/0x1f0 drivers/scsi/scsi_scan.c:1878 | virtscsi_probe+0x3a0/0x3f0 drivers/scsi/virtio_scsi.c:915 | virtio_dev_probe+0x1e4/0x2f0 drivers/virtio/virtio.c:273 | call_driver_probe drivers/base/dd.c:517 [inline] | really_probe.part.0+0xd3/0x320 drivers/base/dd.c:596 | really_probe drivers/base/dd.c:558 [inline] | __driver_probe_device+0xbf/0x180 drivers/base/dd.c:751 | driver_probe_device+0x27/0xd0 drivers/base/dd.c:781 | __driver_attach drivers/base/dd.c:1140 [inline] | __driver_attach+0xb4/0x1a0 drivers/base/dd.c:1092 | bus_for_each_dev+0xab/0x100 drivers/base/bus.c:301 | bus_add_driver+0x1c0/0x220 drivers/base/bus.c:618 | driver_register+0xc4/0x140 drivers/base/driver.c:171 | init+0xa0/0xea drivers/char/virtio_console.c:2257 | do_one_initcall+0x58/0x270 init/main.c:1297 | do_initcall_level init/main.c:1370 [inline] | do_initcalls init/main.c:1386 [inline] | do_basic_setup init/main.c:1405 [inline] | kernel_init_freeable+0x1f4/0x259 init/main.c:1610 | kernel_init+0x19/0x180 init/main.c:1499 | ret_from_fork+0x22/0x30 arch/x86/entry/entry_64.S:295 | which is WARN_ON_ONCE(refcount_read(&flush_rq->ref) != 0); .config is attached. Reverting this patch resolves the warning. I haven't been able to figure out which bit of that code assumes that "freed" is represented by 0 refcount internally. There is an alloc_pages() with GFP_ZERO in that code, which seems to allocate a 'struct request' pool that contains ->ref? This just confirms my suspicion that this is a can of worms. We can try to debug one issue after another until they are no more, but the fundamental issue of saying that 0 is not 0 is a bug-magnet (in the absence of real "constructors", which C doesn't have). Thanks, -- Marco