* regression: data corruption with ext4 on LUKS on nvme with torvalds master [not found] <1620493841.bxdq8r5haw.none.ref@localhost> @ 2021-05-08 17:54 ` Alex Xu (Hello71) 2021-05-09 2:29 ` Alex Xu (Hello71) 0 siblings, 1 reply; 13+ messages in thread From: Alex Xu (Hello71) @ 2021-05-08 17:54 UTC (permalink / raw) To: linux-kernel, linux-ext4, dm-crypt, linux-nvme, linux-block Hi all, Using torvalds master, I recently encountered data corruption on my ext4 volume on LUKS on NVMe. Specifically, during heavy writes, the system partially hangs; SysRq-W shows that processes are blocked in the kernel on I/O. After forcibly rebooting, chunks of files are replaced with other, unrelated data. I'm not sure exactly what the data is; some of it is unknown binary data, but in at least one case, a list of file paths was inserted into a file, indicating that the data is misdirected after encryption. This issue appears to affect files receiving writes in the temporal vicinity of the hang, but affects both new and old data: for example, my shell history file was corrupted up to many months before. The drive reports no SMART issues. I believe this is a regression in the kernel related to something merged in the last few days, as it consistently occurs with my most recent kernel versions, but disappears when reverting to an older kernel. I haven't investigated further, such as by bisecting. I hope this is sufficient information to give someone a lead on the issue, and if it is a bug, nail it down before anybody else loses data. Regards, Alex. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master 2021-05-08 17:54 ` regression: data corruption with ext4 on LUKS on nvme with torvalds master Alex Xu (Hello71) @ 2021-05-09 2:29 ` Alex Xu (Hello71) 2021-05-09 3:51 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Alex Xu (Hello71) @ 2021-05-09 2:29 UTC (permalink / raw) To: linux-kernel, linux-ext4, dm-crypt, linux-nvme, linux-block, Jens Axboe, Changheun Lee, bvanassche, yi.zhang, ming.lei, bgoncalv, hch, jaegeuk Excerpts from Alex Xu (Hello71)'s message of May 8, 2021 1:54 pm: > Hi all, > > Using torvalds master, I recently encountered data corruption on my ext4 > volume on LUKS on NVMe. Specifically, during heavy writes, the system > partially hangs; SysRq-W shows that processes are blocked in the kernel > on I/O. After forcibly rebooting, chunks of files are replaced with > other, unrelated data. I'm not sure exactly what the data is; some of it > is unknown binary data, but in at least one case, a list of file paths > was inserted into a file, indicating that the data is misdirected after > encryption. > > This issue appears to affect files receiving writes in the temporal > vicinity of the hang, but affects both new and old data: for example, my > shell history file was corrupted up to many months before. > > The drive reports no SMART issues. > > I believe this is a regression in the kernel related to something merged > in the last few days, as it consistently occurs with my most recent > kernel versions, but disappears when reverting to an older kernel. > > I haven't investigated further, such as by bisecting. I hope this is > sufficient information to give someone a lead on the issue, and if it is > a bug, nail it down before anybody else loses data. > > Regards, > Alex. > I found the following test to reproduce a hang, which I guess may be the cause: host$ cd /tmp host$ truncate -s 10G drive host$ qemu-system-x86_64 -drive format=raw,file=drive,if=none,id=drive -device nvme,drive=drive,serial=1 [... more VM setup options] guest$ cryptsetup luksFormat /dev/nvme0n1 [accept warning, use any password] guest$ cryptsetup open /dev/nvme0n1 [enter password] guest$ mkfs.ext4 /dev/mapper/test [normal output...] Creating journal (16384 blocks): [hangs forever] I bisected this issue to: cd2c7545ae1beac3b6aae033c7f31193b3255946 is the first bad commit commit cd2c7545ae1beac3b6aae033c7f31193b3255946 Author: Changheun Lee <nanich.lee@samsung.com> Date: Mon May 3 18:52:03 2021 +0900 bio: limit bio max size I didn't try reverting this commit or further reducing the test case. Let me know if you need my kernel config or other information. Regards, Alex. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master 2021-05-09 2:29 ` Alex Xu (Hello71) @ 2021-05-09 3:51 ` Jens Axboe 2021-05-09 14:47 ` Alex Xu (Hello71) 0 siblings, 1 reply; 13+ messages in thread From: Jens Axboe @ 2021-05-09 3:51 UTC (permalink / raw) To: Alex Xu (Hello71), linux-kernel, linux-ext4, dm-crypt, linux-nvme, linux-block, Changheun Lee, bvanassche, yi.zhang, ming.lei, bgoncalv, hch, jaegeuk On 5/8/21 8:29 PM, Alex Xu (Hello71) wrote: > Excerpts from Alex Xu (Hello71)'s message of May 8, 2021 1:54 pm: >> Hi all, >> >> Using torvalds master, I recently encountered data corruption on my ext4 >> volume on LUKS on NVMe. Specifically, during heavy writes, the system >> partially hangs; SysRq-W shows that processes are blocked in the kernel >> on I/O. After forcibly rebooting, chunks of files are replaced with >> other, unrelated data. I'm not sure exactly what the data is; some of it >> is unknown binary data, but in at least one case, a list of file paths >> was inserted into a file, indicating that the data is misdirected after >> encryption. >> >> This issue appears to affect files receiving writes in the temporal >> vicinity of the hang, but affects both new and old data: for example, my >> shell history file was corrupted up to many months before. >> >> The drive reports no SMART issues. >> >> I believe this is a regression in the kernel related to something merged >> in the last few days, as it consistently occurs with my most recent >> kernel versions, but disappears when reverting to an older kernel. >> >> I haven't investigated further, such as by bisecting. I hope this is >> sufficient information to give someone a lead on the issue, and if it is >> a bug, nail it down before anybody else loses data. >> >> Regards, >> Alex. >> > > I found the following test to reproduce a hang, which I guess may be the > cause: > > host$ cd /tmp > host$ truncate -s 10G drive > host$ qemu-system-x86_64 -drive format=raw,file=drive,if=none,id=drive -device nvme,drive=drive,serial=1 [... more VM setup options] > guest$ cryptsetup luksFormat /dev/nvme0n1 > [accept warning, use any password] > guest$ cryptsetup open /dev/nvme0n1 > [enter password] > guest$ mkfs.ext4 /dev/mapper/test > [normal output...] > Creating journal (16384 blocks): [hangs forever] > > I bisected this issue to: > > cd2c7545ae1beac3b6aae033c7f31193b3255946 is the first bad commit > commit cd2c7545ae1beac3b6aae033c7f31193b3255946 > Author: Changheun Lee <nanich.lee@samsung.com> > Date: Mon May 3 18:52:03 2021 +0900 > > bio: limit bio max size > > I didn't try reverting this commit or further reducing the test case. > Let me know if you need my kernel config or other information. If you have time, please do test with that reverted. I'd be anxious to get this revert queued up for 5.13-rc1. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master 2021-05-09 3:51 ` Jens Axboe @ 2021-05-09 14:47 ` Alex Xu (Hello71) 0 siblings, 0 replies; 13+ messages in thread From: Alex Xu (Hello71) @ 2021-05-09 14:47 UTC (permalink / raw) To: Jens Axboe, bgoncalv, bvanassche, dm-crypt, hch, jaegeuk, linux-block, linux-ext4, linux-kernel, linux-nvme, ming.lei, Changheun Lee, yi.zhang Excerpts from Jens Axboe's message of May 8, 2021 11:51 pm: > On 5/8/21 8:29 PM, Alex Xu (Hello71) wrote: >> Excerpts from Alex Xu (Hello71)'s message of May 8, 2021 1:54 pm: >>> Hi all, >>> >>> Using torvalds master, I recently encountered data corruption on my ext4 >>> volume on LUKS on NVMe. Specifically, during heavy writes, the system >>> partially hangs; SysRq-W shows that processes are blocked in the kernel >>> on I/O. After forcibly rebooting, chunks of files are replaced with >>> other, unrelated data. I'm not sure exactly what the data is; some of it >>> is unknown binary data, but in at least one case, a list of file paths >>> was inserted into a file, indicating that the data is misdirected after >>> encryption. >>> >>> This issue appears to affect files receiving writes in the temporal >>> vicinity of the hang, but affects both new and old data: for example, my >>> shell history file was corrupted up to many months before. >>> >>> The drive reports no SMART issues. >>> >>> I believe this is a regression in the kernel related to something merged >>> in the last few days, as it consistently occurs with my most recent >>> kernel versions, but disappears when reverting to an older kernel. >>> >>> I haven't investigated further, such as by bisecting. I hope this is >>> sufficient information to give someone a lead on the issue, and if it is >>> a bug, nail it down before anybody else loses data. >>> >>> Regards, >>> Alex. >>> >> >> I found the following test to reproduce a hang, which I guess may be the >> cause: >> >> host$ cd /tmp >> host$ truncate -s 10G drive >> host$ qemu-system-x86_64 -drive format=raw,file=drive,if=none,id=drive -device nvme,drive=drive,serial=1 [... more VM setup options] >> guest$ cryptsetup luksFormat /dev/nvme0n1 >> [accept warning, use any password] >> guest$ cryptsetup open /dev/nvme0n1 >> [enter password] >> guest$ mkfs.ext4 /dev/mapper/test >> [normal output...] >> Creating journal (16384 blocks): [hangs forever] >> >> I bisected this issue to: >> >> cd2c7545ae1beac3b6aae033c7f31193b3255946 is the first bad commit >> commit cd2c7545ae1beac3b6aae033c7f31193b3255946 >> Author: Changheun Lee <nanich.lee@samsung.com> >> Date: Mon May 3 18:52:03 2021 +0900 >> >> bio: limit bio max size >> >> I didn't try reverting this commit or further reducing the test case. >> Let me know if you need my kernel config or other information. > > If you have time, please do test with that reverted. I'd be anxious to > get this revert queued up for 5.13-rc1. > > -- > Jens Axboe > > I tested reverting it on top of b741596468b010af2846b75f5e75a842ce344a6e ("Merge tag 'riscv-for-linus-5.13-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux"), causing it to no longer hang. I didn't check if this fixes the data corruption, but I assume so. I also tested a 1 GB image (works either way), and a virtio-blk interface (works either way) The Show Blocked State from the VM (without revert): sysrq: Show Blocked State task:kworker/u2:0 state:D stack: 0 pid: 7 ppid: 2 flags:0x00004000 Workqueue: kcryptd/252:0 kcryptd_crypt Call Trace: __schedule+0x1a2/0x4f0 schedule+0x63/0xe0 schedule_timeout+0x6a/0xd0 ? lock_timer_base+0x80/0x80 io_schedule_timeout+0x4c/0x70 mempool_alloc+0xfc/0x130 ? __wake_up_common_lock+0x90/0x90 kcryptd_crypt+0x291/0x4e0 process_one_work+0x1b1/0x300 worker_thread+0x48/0x3d0 ? process_one_work+0x300/0x300 kthread+0x129/0x150 ? __kthread_create_worker+0x100/0x100 ret_from_fork+0x22/0x30 task:mkfs.ext4 state:D stack: 0 pid: 979 ppid: 964 flags:0x00004000 Call Trace: __schedule+0x1a2/0x4f0 ? __schedule+0x1aa/0x4f0 schedule+0x63/0xe0 schedule_timeout+0x99/0xd0 io_schedule_timeout+0x4c/0x70 wait_for_completion_io+0x74/0xc0 submit_bio_wait+0x46/0x60 blkdev_issue_zeroout+0x118/0x1f0 blkdev_fallocate+0x125/0x180 vfs_fallocate+0x126/0x2e0 __x64_sys_fallocate+0x37/0x60 do_syscall_64+0x61/0x80 ? do_syscall_64+0x6e/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Regards, Alex. ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <CGME20210513100034epcas1p4b23892cd77bde73c777eea6dc51c16a4@epcas1p4.samsung.com>]
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master [not found] <CGME20210513100034epcas1p4b23892cd77bde73c777eea6dc51c16a4@epcas1p4.samsung.com> @ 2021-05-13 9:42 ` Changheun Lee 2021-05-13 14:15 ` Theodore Ts'o 0 siblings, 1 reply; 13+ messages in thread From: Changheun Lee @ 2021-05-13 9:42 UTC (permalink / raw) To: alex_y_xu Cc: axboe, bgoncalv, bvanassche, dm-crypt, hch, jaegeuk, linux-block, linux-ext4, linux-kernel, linux-nvme, ming.lei, yi.zhang > Excerpts from Jens Axboe's message of May 8, 2021 11:51 pm: > > On 5/8/21 8:29 PM, Alex Xu (Hello71) wrote: > >> Excerpts from Alex Xu (Hello71)'s message of May 8, 2021 1:54 pm: > >>> Hi all, > >>> > >>> Using torvalds master, I recently encountered data corruption on my ext4 > >>> volume on LUKS on NVMe. Specifically, during heavy writes, the system > >>> partially hangs; SysRq-W shows that processes are blocked in the kernel > >>> on I/O. After forcibly rebooting, chunks of files are replaced with > >>> other, unrelated data. I'm not sure exactly what the data is; some of it > >>> is unknown binary data, but in at least one case, a list of file paths > >>> was inserted into a file, indicating that the data is misdirected after > >>> encryption. > >>> > >>> This issue appears to affect files receiving writes in the temporal > >>> vicinity of the hang, but affects both new and old data: for example, my > >>> shell history file was corrupted up to many months before. > >>> > >>> The drive reports no SMART issues. > >>> > >>> I believe this is a regression in the kernel related to something merged > >>> in the last few days, as it consistently occurs with my most recent > >>> kernel versions, but disappears when reverting to an older kernel. > >>> > >>> I haven't investigated further, such as by bisecting. I hope this is > >>> sufficient information to give someone a lead on the issue, and if it is > >>> a bug, nail it down before anybody else loses data. > >>> > >>> Regards, > >>> Alex. > >>> > >> > >> I found the following test to reproduce a hang, which I guess may be the > >> cause: > >> > >> host$ cd /tmp > >> host$ truncate -s 10G drive > >> host$ qemu-system-x86_64 -drive format=raw,file=drive,if=none,id=drive -device nvme,drive=drive,serial=1 [... more VM setup options] > >> guest$ cryptsetup luksFormat /dev/nvme0n1 > >> [accept warning, use any password] > >> guest$ cryptsetup open /dev/nvme0n1 > >> [enter password] > >> guest$ mkfs.ext4 /dev/mapper/test > >> [normal output...] > >> Creating journal (16384 blocks): [hangs forever] > >> > >> I bisected this issue to: > >> > >> cd2c7545ae1beac3b6aae033c7f31193b3255946 is the first bad commit > >> commit cd2c7545ae1beac3b6aae033c7f31193b3255946 > >> Author: Changheun Lee <nanich.lee@samsung.com> > >> Date: Mon May 3 18:52:03 2021 +0900 > >> > >> bio: limit bio max size > >> > >> I didn't try reverting this commit or further reducing the test case. > >> Let me know if you need my kernel config or other information. > > > > If you have time, please do test with that reverted. I'd be anxious to > > get this revert queued up for 5.13-rc1. > > > > -- > > Jens Axboe > > > > > > I tested reverting it on top of b741596468b010af2846b75f5e75a842ce344a6e > ("Merge tag 'riscv-for-linus-5.13-mw1' of > git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux"), causing it > to no longer hang. I didn't check if this fixes the data corruption, but > I assume so. > > I also tested a 1 GB image (works either way), and a virtio-blk > interface (works either way) > > The Show Blocked State from the VM (without revert): > > sysrq: Show Blocked State > task:kworker/u2:0 state:D stack: 0 pid: 7 ppid: 2 flags:0x00004000 > Workqueue: kcryptd/252:0 kcryptd_crypt > Call Trace: > __schedule+0x1a2/0x4f0 > schedule+0x63/0xe0 > schedule_timeout+0x6a/0xd0 > ? lock_timer_base+0x80/0x80 > io_schedule_timeout+0x4c/0x70 > mempool_alloc+0xfc/0x130 > ? __wake_up_common_lock+0x90/0x90 > kcryptd_crypt+0x291/0x4e0 > process_one_work+0x1b1/0x300 > worker_thread+0x48/0x3d0 > ? process_one_work+0x300/0x300 > kthread+0x129/0x150 > ? __kthread_create_worker+0x100/0x100 > ret_from_fork+0x22/0x30 > task:mkfs.ext4 state:D stack: 0 pid: 979 ppid: 964 flags:0x00004000 > Call Trace: > __schedule+0x1a2/0x4f0 > ? __schedule+0x1aa/0x4f0 > schedule+0x63/0xe0 > schedule_timeout+0x99/0xd0 > io_schedule_timeout+0x4c/0x70 > wait_for_completion_io+0x74/0xc0 > submit_bio_wait+0x46/0x60 > blkdev_issue_zeroout+0x118/0x1f0 > blkdev_fallocate+0x125/0x180 > vfs_fallocate+0x126/0x2e0 > __x64_sys_fallocate+0x37/0x60 > do_syscall_64+0x61/0x80 > ? do_syscall_64+0x6e/0x80 > entry_SYSCALL_64_after_hwframe+0x44/0xae > > Regards, > Alex. > First of all, thank you very much for report a bug. And sorry about your data lose. Problem might be casued by exhausting of memory. And memory exhausting would be caused by setting of small bio_max_size. Actually it was not reproduced in my VM environment at first. But, I reproduced same problem when bio_max_size is set with 8KB forced. Too many bio allocation would be occurred by setting of 8KB bio_max_size. So I prepare v10 patch to fix this bug. It will prevent that bio_max_size is set with small size. bio_max_size will be set with 1MB as a minimum. This size is same with legacy bio size before applying of "multipage bvec". It will be very helpful to me If you test with v10 patch. :) Thanks, Changheun Lee. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master 2021-05-13 9:42 ` Changheun Lee @ 2021-05-13 14:15 ` Theodore Ts'o 2021-05-13 15:59 ` Bart Van Assche 0 siblings, 1 reply; 13+ messages in thread From: Theodore Ts'o @ 2021-05-13 14:15 UTC (permalink / raw) To: Changheun Lee Cc: alex_y_xu, axboe, bgoncalv, bvanassche, dm-crypt, hch, jaegeuk, linux-block, linux-ext4, linux-kernel, linux-nvme, ming.lei, yi.zhang On Thu, May 13, 2021 at 06:42:22PM +0900, Changheun Lee wrote: > > Problem might be casued by exhausting of memory. And memory exhausting > would be caused by setting of small bio_max_size. Actually it was not > reproduced in my VM environment at first. But, I reproduced same problem > when bio_max_size is set with 8KB forced. Too many bio allocation would > be occurred by setting of 8KB bio_max_size. Hmm... I'm not sure how to align your diagnosis with the symptoms in the bug report. If we were limited by memory, that should slow down the I/O, but we should still be making forward progress, no? And a forced reboot should not result in data corruption, unless maybe there was a missing check for a failed memory allocation, causing data to be written to the wrong location, a missing error check leading to the block or file system layer not noticing that a write had failed (although again, memory exhaustion should not lead to failed writes; it might slow us down, sure, but if writes are being failed, something is Badly Going Wrong --- things like writes to the swap device or writes by the page cleaner must succeed, or else Things Would Go Bad In A Hurry). - Ted ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master 2021-05-13 14:15 ` Theodore Ts'o @ 2021-05-13 15:59 ` Bart Van Assche [not found] ` <0e7b0b6e-e78c-f22d-af8d-d7bdcb597bea@gmail.com> 0 siblings, 1 reply; 13+ messages in thread From: Bart Van Assche @ 2021-05-13 15:59 UTC (permalink / raw) To: Theodore Ts'o, Changheun Lee Cc: alex_y_xu, axboe, bgoncalv, dm-crypt, hch, jaegeuk, linux-block, linux-ext4, linux-kernel, linux-nvme, ming.lei, yi.zhang On 5/13/21 7:15 AM, Theodore Ts'o wrote: > On Thu, May 13, 2021 at 06:42:22PM +0900, Changheun Lee wrote: >> >> Problem might be casued by exhausting of memory. And memory exhausting >> would be caused by setting of small bio_max_size. Actually it was not >> reproduced in my VM environment at first. But, I reproduced same problem >> when bio_max_size is set with 8KB forced. Too many bio allocation would >> be occurred by setting of 8KB bio_max_size. > > Hmm... I'm not sure how to align your diagnosis with the symptoms in > the bug report. If we were limited by memory, that should slow down > the I/O, but we should still be making forward progress, no? And a > forced reboot should not result in data corruption, unless maybe there > was a missing check for a failed memory allocation, causing data to be > written to the wrong location, a missing error check leading to the > block or file system layer not noticing that a write had failed > (although again, memory exhaustion should not lead to failed writes; > it might slow us down, sure, but if writes are being failed, something > is Badly Going Wrong --- things like writes to the swap device or > writes by the page cleaner must succeed, or else Things Would Go Bad > In A Hurry). After the LUKS data corruption issue was reported I decided to take a look at the dm-crypt code. In that code I found the following: static void clone_init(struct dm_crypt_io *io, struct bio *clone) { struct crypt_config *cc = io->cc; clone->bi_private = io; clone->bi_end_io = crypt_endio; bio_set_dev(clone, cc->dev->bdev); clone->bi_opf = io->base_bio->bi_opf; } [ ... ] static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size) { [ ... ] clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, &cc->bs); [ ... ] clone_init(io, clone); [ ... ] for (i = 0; i < nr_iovecs; i++) { [ ... ] bio_add_page(clone, page, len, 0); remaining_size -= len; } [ ... ] } My interpretation is that crypt_alloc_buffer() allocates a bio, associates it with the underlying device and clones a bio. The input bio may have a size up to UINT_MAX while the new limit for the size of the cloned bio is max_sectors * 512. That causes bio_add_page() to fail if the input bio is larger than max_sectors * 512, hence the data corruption. Please note that this is a guess only and that I'm not familiar with the dm-crypt code. Bart. ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <0e7b0b6e-e78c-f22d-af8d-d7bdcb597bea@gmail.com>]
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master [not found] ` <0e7b0b6e-e78c-f22d-af8d-d7bdcb597bea@gmail.com> @ 2021-05-13 19:22 ` Mikulas Patocka 2021-05-13 21:18 ` Bart Van Assche 2021-05-14 9:50 ` Mikulas Patocka 0 siblings, 2 replies; 13+ messages in thread From: Mikulas Patocka @ 2021-05-13 19:22 UTC (permalink / raw) To: Milan Broz, Bart Van Assche, Theodore Ts'o, Changheun Lee Cc: alex_y_xu, axboe, bgoncalv, dm-crypt, hch, jaegeuk, linux-block, linux-ext4, linux-kernel, linux-nvme, ming.lei, yi.zhang, dm-devel > On 5/13/21 7:15 AM, Theodore Ts'o wrote: > > On Thu, May 13, 2021 at 06:42:22PM +0900, Changheun Lee wrote: > >> > >> Problem might be casued by exhausting of memory. And memory exhausting > >> would be caused by setting of small bio_max_size. Actually it was not > >> reproduced in my VM environment at first. But, I reproduced same problem > >> when bio_max_size is set with 8KB forced. Too many bio allocation would > >> be occurred by setting of 8KB bio_max_size. > > > > Hmm... I'm not sure how to align your diagnosis with the symptoms in > > the bug report. If we were limited by memory, that should slow down > > the I/O, but we should still be making forward progress, no? And a > > forced reboot should not result in data corruption, unless maybe there > > was a missing check for a failed memory allocation, causing data to be > > written to the wrong location, a missing error check leading to the > > block or file system layer not noticing that a write had failed > > (although again, memory exhaustion should not lead to failed writes; > > it might slow us down, sure, but if writes are being failed, something > > is Badly Going Wrong --- things like writes to the swap device or > > writes by the page cleaner must succeed, or else Things Would Go Bad > > In A Hurry). > > After the LUKS data corruption issue was reported I decided to take a > look at the dm-crypt code. In that code I found the following: > > static void clone_init(struct dm_crypt_io *io, struct bio *clone) > { > struct crypt_config *cc = io->cc; > > clone->bi_private = io; > clone->bi_end_io = crypt_endio; > bio_set_dev(clone, cc->dev->bdev); > clone->bi_opf = io->base_bio->bi_opf; > } > [ ... ] > static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size) > { > [ ... ] > clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, &cc->bs); > [ ... ] > clone_init(io, clone); > [ ... ] > for (i = 0; i < nr_iovecs; i++) { > [ ... ] > bio_add_page(clone, page, len, 0); > > remaining_size -= len; > } > [ ... ] > } > > My interpretation is that crypt_alloc_buffer() allocates a bio, > associates it with the underlying device and clones a bio. The input bio > may have a size up to UINT_MAX while the new limit for the size of the > cloned bio is max_sectors * 512. That causes bio_add_page() to fail if > the input bio is larger than max_sectors * 512, hence the data > corruption. Please note that this is a guess only and that I'm not > familiar with the dm-crypt code. > > Bart. We already had problems with too large bios in dm-crypt and we fixed it by adding this piece of code: /* * Check if bio is too large, split as needed. */ if (unlikely(bio->bi_iter.bi_size > (BIO_MAX_VECS << PAGE_SHIFT)) && (bio_data_dir(bio) == WRITE || cc->on_disk_tag_size)) dm_accept_partial_bio(bio, ((BIO_MAX_VECS << PAGE_SHIFT) >> SECTOR_SHIFT)); It will ask the device mapper to split the bio if it is too large. So, crypt_alloc_buffer can't receive a bio that is larger than BIO_MAX_VECS << PAGE_SHIFT. Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master 2021-05-13 19:22 ` Mikulas Patocka @ 2021-05-13 21:18 ` Bart Van Assche 2021-05-14 9:43 ` Mikulas Patocka 2021-05-14 9:50 ` Mikulas Patocka 1 sibling, 1 reply; 13+ messages in thread From: Bart Van Assche @ 2021-05-13 21:18 UTC (permalink / raw) To: Mikulas Patocka, Milan Broz, Theodore Ts'o, Changheun Lee Cc: alex_y_xu, axboe, bgoncalv, dm-crypt, hch, jaegeuk, linux-block, linux-ext4, linux-kernel, linux-nvme, ming.lei, yi.zhang, dm-devel On 5/13/21 12:22 PM, Mikulas Patocka wrote: > We already had problems with too large bios in dm-crypt and we fixed it by > adding this piece of code: > > /* > * Check if bio is too large, split as needed. > */ > if (unlikely(bio->bi_iter.bi_size > (BIO_MAX_VECS << PAGE_SHIFT)) && > (bio_data_dir(bio) == WRITE || cc->on_disk_tag_size)) > dm_accept_partial_bio(bio, ((BIO_MAX_VECS << PAGE_SHIFT) >> SECTOR_SHIFT)); > > It will ask the device mapper to split the bio if it is too large. So, > crypt_alloc_buffer can't receive a bio that is larger than BIO_MAX_VECS << > PAGE_SHIFT. Hi Mikulas, Are you perhaps referring to commit 4e870e948fba ("dm crypt: fix error with too large bios")? Did that commit go upstream before multi-page bvec support? Can larger bios be supported in case of two or more contiguous pages now that multi-page bvec support is upstream? Thanks, Bart. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master 2021-05-13 21:18 ` Bart Van Assche @ 2021-05-14 9:43 ` Mikulas Patocka 0 siblings, 0 replies; 13+ messages in thread From: Mikulas Patocka @ 2021-05-14 9:43 UTC (permalink / raw) To: Bart Van Assche Cc: Milan Broz, Theodore Ts'o, Changheun Lee, alex_y_xu, axboe, bgoncalv, dm-crypt, hch, jaegeuk, linux-block, linux-ext4, linux-kernel, linux-nvme, ming.lei, yi.zhang, dm-devel On Thu, 13 May 2021, Bart Van Assche wrote: > On 5/13/21 12:22 PM, Mikulas Patocka wrote: > > We already had problems with too large bios in dm-crypt and we fixed it by > > adding this piece of code: > > > > /* > > * Check if bio is too large, split as needed. > > */ > > if (unlikely(bio->bi_iter.bi_size > (BIO_MAX_VECS << PAGE_SHIFT)) && > > (bio_data_dir(bio) == WRITE || cc->on_disk_tag_size)) > > dm_accept_partial_bio(bio, ((BIO_MAX_VECS << PAGE_SHIFT) >> SECTOR_SHIFT)); > > > > It will ask the device mapper to split the bio if it is too large. So, > > crypt_alloc_buffer can't receive a bio that is larger than BIO_MAX_VECS << > > PAGE_SHIFT. > > Hi Mikulas, > > Are you perhaps referring to commit 4e870e948fba ("dm crypt: fix error > with too large bios")? Did that commit go upstream before multi-page > bvec support? Yes. It's from 2016. > Can larger bios be supported in case of two or more > contiguous pages now that multi-page bvec support is upstream? No - we need to allocate a buffer for the written data. The buffer size is limited to PAGE_SIZE * BIO_MAX_VECS. > Thanks, > > Bart. Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master 2021-05-13 19:22 ` Mikulas Patocka 2021-05-13 21:18 ` Bart Van Assche @ 2021-05-14 9:50 ` Mikulas Patocka [not found] ` <CGME20210514104426epcas1p3ee2f22f8e18c961118795c356e6a14ae@epcas1p3.samsung.com> 1 sibling, 1 reply; 13+ messages in thread From: Mikulas Patocka @ 2021-05-14 9:50 UTC (permalink / raw) To: Milan Broz, Bart Van Assche, Theodore Ts'o, Changheun Lee Cc: alex_y_xu, axboe, bgoncalv, dm-crypt, hch, jaegeuk, linux-block, linux-ext4, linux-kernel, linux-nvme, ming.lei, yi.zhang, dm-devel On 5/13/21 7:15 AM, Theodore Ts'o wrote: > On Thu, May 13, 2021 at 06:42:22PM +0900, Changheun Lee wrote: >> >> Problem might be casued by exhausting of memory. And memory exhausting >> would be caused by setting of small bio_max_size. Actually it was not >> reproduced in my VM environment at first. But, I reproduced same problem >> when bio_max_size is set with 8KB forced. Too many bio allocation would >> be occurred by setting of 8KB bio_max_size. > > Hmm... I'm not sure how to align your diagnosis with the symptoms in > the bug report. If we were limited by memory, that should slow down > the I/O, but we should still be making forward progress, no? And a > forced reboot should not result in data corruption, unless maybe there If you use data=writeback, data writes and journal writes are not synchronized. So, it may be possible that a journal write made it through, a data write didn't - the end result would be a file containing random contents that was on the disk. Changheun - do you use data=writeback? Did the corruption happen only in newly created files? Or did it corrupt existing files? > was a missing check for a failed memory allocation, causing data to be > written to the wrong location, a missing error check leading to the > block or file system layer not noticing that a write had failed > (although again, memory exhaustion should not lead to failed writes; > it might slow us down, sure, but if writes are being failed, something > is Badly Going Wrong --- things like writes to the swap device or > writes by the page cleaner must succeed, or else Things Would Go Bad > In A Hurry). Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <CGME20210514104426epcas1p3ee2f22f8e18c961118795c356e6a14ae@epcas1p3.samsung.com>]
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master [not found] ` <CGME20210514104426epcas1p3ee2f22f8e18c961118795c356e6a14ae@epcas1p3.samsung.com> @ 2021-05-14 10:26 ` Changheun Lee 2021-07-09 20:45 ` Samuel Mendoza-Jonas 0 siblings, 1 reply; 13+ messages in thread From: Changheun Lee @ 2021-05-14 10:26 UTC (permalink / raw) To: alex_y_xu Cc: gmazyland, bvanassche, tytso, axboe, bgoncalv, dm-crypt, hch, jaegeuk, linux-block, linux-ext4, linux-kernel, linux-nvme, ming.lei, yi.zhang, dm-devel > On 5/13/21 7:15 AM, Theodore Ts'o wrote: > > On Thu, May 13, 2021 at 06:42:22PM +0900, Changheun Lee wrote: > >> > >> Problem might be casued by exhausting of memory. And memory exhausting > >> would be caused by setting of small bio_max_size. Actually it was not > >> reproduced in my VM environment at first. But, I reproduced same problem > >> when bio_max_size is set with 8KB forced. Too many bio allocation would > >> be occurred by setting of 8KB bio_max_size. > > > > Hmm... I'm not sure how to align your diagnosis with the symptoms in > > the bug report. If we were limited by memory, that should slow down > > the I/O, but we should still be making forward progress, no? And a > > forced reboot should not result in data corruption, unless maybe there > > If you use data=writeback, data writes and journal writes are not > synchronized. So, it may be possible that a journal write made it through, > a data write didn't - the end result would be a file containing random > contents that was on the disk. > > Changheun - do you use data=writeback? Did the corruption happen only in > newly created files? Or did it corrupt existing files? Actually I didn't reproduced data corruption. I only reproduced hang during making ext4 filesystem. Alex, could you check it? > > > was a missing check for a failed memory allocation, causing data to be > > written to the wrong location, a missing error check leading to the > > block or file system layer not noticing that a write had failed > > (although again, memory exhaustion should not lead to failed writes; > > it might slow us down, sure, but if writes are being failed, something > > is Badly Going Wrong --- things like writes to the swap device or > > writes by the page cleaner must succeed, or else Things Would Go Bad > > In A Hurry). > > Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master 2021-05-14 10:26 ` Changheun Lee @ 2021-07-09 20:45 ` Samuel Mendoza-Jonas 0 siblings, 0 replies; 13+ messages in thread From: Samuel Mendoza-Jonas @ 2021-07-09 20:45 UTC (permalink / raw) To: Changheun Lee Cc: alex_y_xu, gmazyland, bvanassche, tytso, axboe, bgoncalv, dm-crypt, hch, jaegeuk, linux-block, linux-ext4, linux-kernel, linux-nvme, ming.lei, yi.zhang, dm-devel On Fri, May 14, 2021 at 07:26:14PM +0900, Changheun Lee wrote: > > On 5/13/21 7:15 AM, Theodore Ts'o wrote: > > > On Thu, May 13, 2021 at 06:42:22PM +0900, Changheun Lee wrote: > > >> > > >> Problem might be casued by exhausting of memory. And memory exhausting > > >> would be caused by setting of small bio_max_size. Actually it was not > > >> reproduced in my VM environment at first. But, I reproduced same problem > > >> when bio_max_size is set with 8KB forced. Too many bio allocation would > > >> be occurred by setting of 8KB bio_max_size. > > > > > > Hmm... I'm not sure how to align your diagnosis with the symptoms in > > > the bug report. If we were limited by memory, that should slow down > > > the I/O, but we should still be making forward progress, no? And a > > > forced reboot should not result in data corruption, unless maybe there > > > > If you use data=writeback, data writes and journal writes are not > > synchronized. So, it may be possible that a journal write made it through, > > a data write didn't - the end result would be a file containing random > > contents that was on the disk. > > > > Changheun - do you use data=writeback? Did the corruption happen only in > > newly created files? Or did it corrupt existing files? > > Actually I didn't reproduced data corruption. I only reproduced hang during > making ext4 filesystem. Alex, could you check it? > > > > > > was a missing check for a failed memory allocation, causing data to be > > > written to the wrong location, a missing error check leading to the > > > block or file system layer not noticing that a write had failed > > > (although again, memory exhaustion should not lead to failed writes; > > > it might slow us down, sure, but if writes are being failed, something > > > is Badly Going Wrong --- things like writes to the swap device or > > > writes by the page cleaner must succeed, or else Things Would Go Bad > > > In A Hurry). > > > > Mikulas I've recently been debugging an issue that isn't this exact issue (it occurs in 5.10), but looks somewhat similar. On a host that - Is running a kernel 5.4 >= x >= 5.10.47 at least - Using an EXT4 + LUKS partition - Running Elasticsearch stress tests We see that the index files used by the Elasticsearch process become corrupt after some time, and in each case I've seen so far the content of the file looks like the EXT4 extent header. #define EXT4_EXT_MAGIC cpu_to_le16(0xf30a) For example: $ hexdump -C /hdd1/nodes/0/indices/c6eSGDlCRjaWeIBwdeo9DQ/0/index/_23c.si 00000000 0a f3 04 00 54 01 00 00 00 00 00 00 00 00 00 00 |....T...........| 00000010 00 38 00 00 00 60 46 05 00 38 00 00 00 88 00 00 |.8...`F..8......| 00000020 00 98 46 05 00 40 00 00 00 88 00 00 00 a0 46 05 |..F..@........F.| 00000030 00 48 00 00 00 88 00 00 00 a8 46 05 00 48 00 00 |.H........F..H..| 00000040 00 88 00 00 00 a8 46 05 00 48 00 00 00 88 00 00 |......F..H......| 00000050 00 a8 46 05 00 48 00 00 00 88 00 00 00 a8 46 05 |..F..H........F.| 00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 000001a0 00 00 |..| 000001a2 I'm working on tracing exactly when this happens, but I'd be interested to hear if that sounds familar or might have a similar underlying cause beyond the commit that was reverted above. Cheers, Sam Mendoza-Jonas ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-07-09 20:45 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <1620493841.bxdq8r5haw.none.ref@localhost> 2021-05-08 17:54 ` regression: data corruption with ext4 on LUKS on nvme with torvalds master Alex Xu (Hello71) 2021-05-09 2:29 ` Alex Xu (Hello71) 2021-05-09 3:51 ` Jens Axboe 2021-05-09 14:47 ` Alex Xu (Hello71) [not found] <CGME20210513100034epcas1p4b23892cd77bde73c777eea6dc51c16a4@epcas1p4.samsung.com> 2021-05-13 9:42 ` Changheun Lee 2021-05-13 14:15 ` Theodore Ts'o 2021-05-13 15:59 ` Bart Van Assche [not found] ` <0e7b0b6e-e78c-f22d-af8d-d7bdcb597bea@gmail.com> 2021-05-13 19:22 ` Mikulas Patocka 2021-05-13 21:18 ` Bart Van Assche 2021-05-14 9:43 ` Mikulas Patocka 2021-05-14 9:50 ` Mikulas Patocka [not found] ` <CGME20210514104426epcas1p3ee2f22f8e18c961118795c356e6a14ae@epcas1p3.samsung.com> 2021-05-14 10:26 ` Changheun Lee 2021-07-09 20:45 ` Samuel Mendoza-Jonas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).