dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* Re: [dm-devel] regression: data corruption with ext4 on LUKS on nvme with torvalds master
       [not found] ` <0e7b0b6e-e78c-f22d-af8d-d7bdcb597bea@gmail.com>
@ 2021-05-13 19:22   ` Mikulas Patocka
  2021-05-13 21:18     ` Bart Van Assche
  2021-05-14  9:50     ` Mikulas Patocka
  0 siblings, 2 replies; 6+ messages in thread
From: Mikulas Patocka @ 2021-05-13 19:22 UTC (permalink / raw)
  To: Milan Broz, Bart Van Assche, Theodore Ts'o, Changheun Lee
  Cc: axboe, yi.zhang, bgoncalv, dm-crypt, linux-kernel, alex_y_xu,
	ming.lei, linux-block, dm-devel, linux-nvme, jaegeuk, linux-ext4,
	hch



> On 5/13/21 7:15 AM, Theodore Ts'o wrote:
> > On Thu, May 13, 2021 at 06:42:22PM +0900, Changheun Lee wrote:
> >>
> >> Problem might be casued by exhausting of memory. And memory exhausting
> >> would be caused by setting of small bio_max_size. Actually it was not
> >> reproduced in my VM environment at first. But, I reproduced same problem
> >> when bio_max_size is set with 8KB forced. Too many bio allocation would
> >> be occurred by setting of 8KB bio_max_size.
> > 
> > Hmm... I'm not sure how to align your diagnosis with the symptoms in
> > the bug report.  If we were limited by memory, that should slow down
> > the I/O, but we should still be making forward progress, no?  And a
> > forced reboot should not result in data corruption, unless maybe there
> > was a missing check for a failed memory allocation, causing data to be
> > written to the wrong location, a missing error check leading to the
> > block or file system layer not noticing that a write had failed
> > (although again, memory exhaustion should not lead to failed writes;
> > it might slow us down, sure, but if writes are being failed, something
> > is Badly Going Wrong --- things like writes to the swap device or
> > writes by the page cleaner must succeed, or else Things Would Go Bad
> > In A Hurry).
> 
> After the LUKS data corruption issue was reported I decided to take a
> look at the dm-crypt code. In that code I found the following:
> 
> static void clone_init(struct dm_crypt_io *io, struct bio *clone)
> {
> 	struct crypt_config *cc = io->cc;
> 
> 	clone->bi_private = io;
> 	clone->bi_end_io  = crypt_endio;
> 	bio_set_dev(clone, cc->dev->bdev);
> 	clone->bi_opf	  = io->base_bio->bi_opf;
> }
> [ ... ]
> static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
> {
> 	[ ... ]
> 	clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, &cc->bs);
> 	[ ... ]
> 	clone_init(io, clone);
> 	[ ... ]
> 	for (i = 0; i < nr_iovecs; i++) {
> 		[ ... ]
> 		bio_add_page(clone, page, len, 0);
> 
> 		remaining_size -= len;
> 	}
> 	[ ... ]
> }
> 
> My interpretation is that crypt_alloc_buffer() allocates a bio,
> associates it with the underlying device and clones a bio. The input bio
> may have a size up to UINT_MAX while the new limit for the size of the
> cloned bio is max_sectors * 512. That causes bio_add_page() to fail if
> the input bio is larger than max_sectors * 512, hence the data
> corruption. Please note that this is a guess only and that I'm not
> familiar with the dm-crypt code.
> 
> Bart.

We already had problems with too large bios in dm-crypt and we fixed it by 
adding this piece of code:

        /*
         * Check if bio is too large, split as needed.
         */
        if (unlikely(bio->bi_iter.bi_size > (BIO_MAX_VECS << PAGE_SHIFT)) &&
            (bio_data_dir(bio) == WRITE || cc->on_disk_tag_size))
                dm_accept_partial_bio(bio, ((BIO_MAX_VECS << PAGE_SHIFT) >> SECTOR_SHIFT));

It will ask the device mapper to split the bio if it is too large. So, 
crypt_alloc_buffer can't receive a bio that is larger than BIO_MAX_VECS << 
PAGE_SHIFT.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dm-devel] regression: data corruption with ext4 on LUKS on nvme with torvalds master
  2021-05-13 19:22   ` [dm-devel] regression: data corruption with ext4 on LUKS on nvme with torvalds master Mikulas Patocka
@ 2021-05-13 21:18     ` Bart Van Assche
  2021-05-14  9:43       ` Mikulas Patocka
  2021-05-14  9:50     ` Mikulas Patocka
  1 sibling, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2021-05-13 21:18 UTC (permalink / raw)
  To: Mikulas Patocka, Milan Broz, Theodore Ts'o, Changheun Lee
  Cc: axboe, yi.zhang, bgoncalv, dm-crypt, linux-kernel, alex_y_xu,
	ming.lei, linux-block, dm-devel, linux-nvme, jaegeuk, linux-ext4,
	hch

On 5/13/21 12:22 PM, Mikulas Patocka wrote:
> We already had problems with too large bios in dm-crypt and we fixed it by 
> adding this piece of code:
> 
>         /*
>          * Check if bio is too large, split as needed.
>          */
>         if (unlikely(bio->bi_iter.bi_size > (BIO_MAX_VECS << PAGE_SHIFT)) &&
>             (bio_data_dir(bio) == WRITE || cc->on_disk_tag_size))
>                 dm_accept_partial_bio(bio, ((BIO_MAX_VECS << PAGE_SHIFT) >> SECTOR_SHIFT));
> 
> It will ask the device mapper to split the bio if it is too large. So, 
> crypt_alloc_buffer can't receive a bio that is larger than BIO_MAX_VECS << 
> PAGE_SHIFT.

Hi Mikulas,

Are you perhaps referring to commit 4e870e948fba ("dm crypt: fix error
with too large bios")? Did that commit go upstream before multi-page
bvec support? Can larger bios be supported in case of two or more
contiguous pages now that multi-page bvec support is upstream?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dm-devel] regression: data corruption with ext4 on LUKS on nvme with torvalds master
  2021-05-13 21:18     ` Bart Van Assche
@ 2021-05-14  9:43       ` Mikulas Patocka
  0 siblings, 0 replies; 6+ messages in thread
From: Mikulas Patocka @ 2021-05-14  9:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, ming.lei, Theodore Ts'o, bgoncalv, dm-crypt, yi.zhang,
	linux-kernel, alex_y_xu, hch, linux-block, dm-devel, linux-nvme,
	Changheun Lee, jaegeuk, linux-ext4, Milan Broz



On Thu, 13 May 2021, Bart Van Assche wrote:

> On 5/13/21 12:22 PM, Mikulas Patocka wrote:
> > We already had problems with too large bios in dm-crypt and we fixed it by 
> > adding this piece of code:
> > 
> >         /*
> >          * Check if bio is too large, split as needed.
> >          */
> >         if (unlikely(bio->bi_iter.bi_size > (BIO_MAX_VECS << PAGE_SHIFT)) &&
> >             (bio_data_dir(bio) == WRITE || cc->on_disk_tag_size))
> >                 dm_accept_partial_bio(bio, ((BIO_MAX_VECS << PAGE_SHIFT) >> SECTOR_SHIFT));
> > 
> > It will ask the device mapper to split the bio if it is too large. So, 
> > crypt_alloc_buffer can't receive a bio that is larger than BIO_MAX_VECS << 
> > PAGE_SHIFT.
> 
> Hi Mikulas,
> 
> Are you perhaps referring to commit 4e870e948fba ("dm crypt: fix error
> with too large bios")? Did that commit go upstream before multi-page
> bvec support?

Yes. It's from 2016.

> Can larger bios be supported in case of two or more
> contiguous pages now that multi-page bvec support is upstream?

No - we need to allocate a buffer for the written data. The buffer size is 
limited to PAGE_SIZE * BIO_MAX_VECS.

> Thanks,
> 
> Bart.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dm-devel] regression: data corruption with ext4 on LUKS on nvme with torvalds master
  2021-05-13 19:22   ` [dm-devel] regression: data corruption with ext4 on LUKS on nvme with torvalds master Mikulas Patocka
  2021-05-13 21:18     ` Bart Van Assche
@ 2021-05-14  9:50     ` Mikulas Patocka
       [not found]       ` <CGME20210514104426epcas1p3ee2f22f8e18c961118795c356e6a14ae@epcas1p3.samsung.com>
  1 sibling, 1 reply; 6+ messages in thread
From: Mikulas Patocka @ 2021-05-14  9:50 UTC (permalink / raw)
  To: Milan Broz, Bart Van Assche, Theodore Ts'o, Changheun Lee
  Cc: axboe, yi.zhang, bgoncalv, dm-crypt, linux-kernel, alex_y_xu,
	ming.lei, linux-block, dm-devel, linux-nvme, jaegeuk, linux-ext4,
	hch



On 5/13/21 7:15 AM, Theodore Ts'o wrote:
> On Thu, May 13, 2021 at 06:42:22PM +0900, Changheun Lee wrote:
>>
>> Problem might be casued by exhausting of memory. And memory exhausting
>> would be caused by setting of small bio_max_size. Actually it was not
>> reproduced in my VM environment at first. But, I reproduced same problem
>> when bio_max_size is set with 8KB forced. Too many bio allocation would
>> be occurred by setting of 8KB bio_max_size.
> 
> Hmm... I'm not sure how to align your diagnosis with the symptoms in
> the bug report.  If we were limited by memory, that should slow down
> the I/O, but we should still be making forward progress, no?  And a
> forced reboot should not result in data corruption, unless maybe there

If you use data=writeback, data writes and journal writes are not 
synchronized. So, it may be possible that a journal write made it through, 
a data write didn't - the end result would be a file containing random 
contents that was on the disk.

Changheun - do you use data=writeback? Did the corruption happen only in 
newly created files? Or did it corrupt existing files?

> was a missing check for a failed memory allocation, causing data to be
> written to the wrong location, a missing error check leading to the
> block or file system layer not noticing that a write had failed
> (although again, memory exhaustion should not lead to failed writes;
> it might slow us down, sure, but if writes are being failed, something
> is Badly Going Wrong --- things like writes to the swap device or
> writes by the page cleaner must succeed, or else Things Would Go Bad
> In A Hurry).

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dm-devel] regression: data corruption with ext4 on LUKS on nvme with torvalds master
       [not found]       ` <CGME20210514104426epcas1p3ee2f22f8e18c961118795c356e6a14ae@epcas1p3.samsung.com>
@ 2021-05-14 10:26         ` Changheun Lee
  2021-07-09 20:45           ` Samuel Mendoza-Jonas
  0 siblings, 1 reply; 6+ messages in thread
From: Changheun Lee @ 2021-05-14 10:26 UTC (permalink / raw)
  To: alex_y_xu
  Cc: axboe, ming.lei, tytso, bvanassche, bgoncalv, dm-crypt,
	linux-kernel, linux-nvme, hch, linux-block, dm-devel, yi.zhang,
	jaegeuk, linux-ext4, gmazyland

> On 5/13/21 7:15 AM, Theodore Ts'o wrote:
> > On Thu, May 13, 2021 at 06:42:22PM +0900, Changheun Lee wrote:
> >>
> >> Problem might be casued by exhausting of memory. And memory exhausting
> >> would be caused by setting of small bio_max_size. Actually it was not
> >> reproduced in my VM environment at first. But, I reproduced same problem
> >> when bio_max_size is set with 8KB forced. Too many bio allocation would
> >> be occurred by setting of 8KB bio_max_size.
> > 
> > Hmm... I'm not sure how to align your diagnosis with the symptoms in
> > the bug report.  If we were limited by memory, that should slow down
> > the I/O, but we should still be making forward progress, no?  And a
> > forced reboot should not result in data corruption, unless maybe there
> 
> If you use data=writeback, data writes and journal writes are not 
> synchronized. So, it may be possible that a journal write made it through, 
> a data write didn't - the end result would be a file containing random 
> contents that was on the disk.
> 
> Changheun - do you use data=writeback? Did the corruption happen only in 
> newly created files? Or did it corrupt existing files?

Actually I didn't reproduced data corruption. I only reproduced hang during
making ext4 filesystem. Alex, could you check it?

> 
> > was a missing check for a failed memory allocation, causing data to be
> > written to the wrong location, a missing error check leading to the
> > block or file system layer not noticing that a write had failed
> > (although again, memory exhaustion should not lead to failed writes;
> > it might slow us down, sure, but if writes are being failed, something
> > is Badly Going Wrong --- things like writes to the swap device or
> > writes by the page cleaner must succeed, or else Things Would Go Bad
> > In A Hurry).
> 
> Mikulas


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dm-devel] regression: data corruption with ext4 on LUKS on nvme with torvalds master
  2021-05-14 10:26         ` Changheun Lee
@ 2021-07-09 20:45           ` Samuel Mendoza-Jonas
  0 siblings, 0 replies; 6+ messages in thread
From: Samuel Mendoza-Jonas @ 2021-07-09 20:45 UTC (permalink / raw)
  To: Changheun Lee
  Cc: axboe, ming.lei, tytso, bvanassche, bgoncalv, dm-crypt,
	linux-kernel, alex_y_xu, hch, linux-block, dm-devel, linux-nvme,
	yi.zhang, jaegeuk, linux-ext4, gmazyland

On Fri, May 14, 2021 at 07:26:14PM +0900, Changheun Lee wrote:
> > On 5/13/21 7:15 AM, Theodore Ts'o wrote:
> > > On Thu, May 13, 2021 at 06:42:22PM +0900, Changheun Lee wrote:
> > >>
> > >> Problem might be casued by exhausting of memory. And memory exhausting
> > >> would be caused by setting of small bio_max_size. Actually it was not
> > >> reproduced in my VM environment at first. But, I reproduced same problem
> > >> when bio_max_size is set with 8KB forced. Too many bio allocation would
> > >> be occurred by setting of 8KB bio_max_size.
> > > 
> > > Hmm... I'm not sure how to align your diagnosis with the symptoms in
> > > the bug report.  If we were limited by memory, that should slow down
> > > the I/O, but we should still be making forward progress, no?  And a
> > > forced reboot should not result in data corruption, unless maybe there
> > 
> > If you use data=writeback, data writes and journal writes are not 
> > synchronized. So, it may be possible that a journal write made it through, 
> > a data write didn't - the end result would be a file containing random 
> > contents that was on the disk.
> > 
> > Changheun - do you use data=writeback? Did the corruption happen only in 
> > newly created files? Or did it corrupt existing files?
> 
> Actually I didn't reproduced data corruption. I only reproduced hang during
> making ext4 filesystem. Alex, could you check it?
> 
> > 
> > > was a missing check for a failed memory allocation, causing data to be
> > > written to the wrong location, a missing error check leading to the
> > > block or file system layer not noticing that a write had failed
> > > (although again, memory exhaustion should not lead to failed writes;
> > > it might slow us down, sure, but if writes are being failed, something
> > > is Badly Going Wrong --- things like writes to the swap device or
> > > writes by the page cleaner must succeed, or else Things Would Go Bad
> > > In A Hurry).
> > 
> > Mikulas

I've recently been debugging an issue that isn't this exact issue
(it occurs in 5.10), but looks somewhat similar.
On a host that
- Is running a kernel 5.4 >= x >= 5.10.47 at least
- Using an EXT4 + LUKS partition
- Running Elasticsearch stress tests

We see that the index files used by the Elasticsearch process become
corrupt after some time, and in each case I've seen so far the content
of the file looks like the EXT4 extent header. 
	#define EXT4_EXT_MAGIC          cpu_to_le16(0xf30a)

For example:
$ hexdump -C /hdd1/nodes/0/indices/c6eSGDlCRjaWeIBwdeo9DQ/0/index/_23c.si
00000000  0a f3 04 00 54 01 00 00  00 00 00 00 00 00 00 00  |....T...........|
00000010  00 38 00 00 00 60 46 05  00 38 00 00 00 88 00 00  |.8...`F..8......|
00000020  00 98 46 05 00 40 00 00  00 88 00 00 00 a0 46 05  |..F..@........F.|
00000030  00 48 00 00 00 88 00 00  00 a8 46 05 00 48 00 00  |.H........F..H..|
00000040  00 88 00 00 00 a8 46 05  00 48 00 00 00 88 00 00  |......F..H......|
00000050  00 a8 46 05 00 48 00 00  00 88 00 00 00 a8 46 05  |..F..H........F.|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001a0  00 00                                             |..|
000001a2


I'm working on tracing exactly when this happens, but I'd be interested
to hear if that sounds familar or might have a similar underlying cause
beyond the commit that was reverted above.

Cheers,
Sam Mendoza-Jonas

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-07-12  6:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <a01ab479-69e8-9395-7d24-9de1eec28aff@acm.org>
     [not found] ` <0e7b0b6e-e78c-f22d-af8d-d7bdcb597bea@gmail.com>
2021-05-13 19:22   ` [dm-devel] regression: data corruption with ext4 on LUKS on nvme with torvalds master Mikulas Patocka
2021-05-13 21:18     ` Bart Van Assche
2021-05-14  9:43       ` Mikulas Patocka
2021-05-14  9:50     ` Mikulas Patocka
     [not found]       ` <CGME20210514104426epcas1p3ee2f22f8e18c961118795c356e6a14ae@epcas1p3.samsung.com>
2021-05-14 10:26         ` Changheun Lee
2021-07-09 20:45           ` Samuel Mendoza-Jonas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).