[PATCH RFC 0/2] Block and buffer invalidation under a filesystem

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] Block and buffer invalidation under a filesystem
@ 2020-08-25 12:05 Jan Kara
  2020-08-25 12:05 ` [PATCH RFC 1/2] fs: Don't invalidate page buffers in block_write_full_page() Jan Kara
  2020-08-25 12:05 ` [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem Jan Kara
  0 siblings, 2 replies; 16+ messages in thread
From: Jan Kara @ 2020-08-25 12:05 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: yebin, Christoph Hellwig, linux-block, Jens Axboe, Jan Kara

Hello,

Recently Ye Bin has reported an ext4 crash which he tracked tracked down to a
problem that several places (block_write_full_page(), fallocate(2) on blkdev,
etc.) can invalidate buffers under a live filesystem - block_invalidatepage()
will clear (among other) BH_Mapped flag and following lookup of the buffer_head
will reinitialize it (init_page_buffers()) which among other things clears
bh->b_private fields which then makes jbd2 crash.

I was thinking how to best fix this. block_write_full_page() is easy to deal
with as the invalidation there is just a relict from the past and we don't need
to invalidate pages there at all (patch 1/2). Other cases are more
questionable. In patch 2/2, I have made fallocate(2) on the block device and
discard ioctls bail with EBUSY if there's filesystem mounted because it seems
very weird and problematic to mess with a block device like that under a
filesystem. What do people think? Is anyone aware of a user that would be
broken by this?

There are also other possibilities of fixing this like making
block_invalidatepage() (or rather new ->invalidatepage callback for the
block device) less aggressive so that it does not discard that much state from
buffer_heads. But details of that are not yet clear to me.

Or other possibilities people see to fix this?

								Honza

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RFC 1/2] fs: Don't invalidate page buffers in block_write_full_page()
  2020-08-25 12:05 [PATCH RFC 0/2] Block and buffer invalidation under a filesystem Jan Kara
@ 2020-08-25 12:05 ` Jan Kara
  2020-08-25 12:05 ` [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2020-08-25 12:05 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: yebin, Christoph Hellwig, linux-block, Jens Axboe, Jan Kara, stable

If block_write_full_page() is called for a page that is beyond current
inode size, it will truncate page buffers for the page and return 0.
This logic has been added in 2.5.62 in commit 81eb69062588 ("fix ext3
BUG due to race with truncate") in history.git tree to fix a problem
with ext3 in data=ordered mode. This particular problem doesn't exist
anymore because ext3 is long gone and ext4 handles ordered data
differently. Also normally buffers are invalidated by truncate code and
there's no need to specially handle this in ->writepage() code.

This invalidation of page buffers in block_write_full_page() is causing
issues to filesystems (e.g. ext4 or ocfs2) when block device is shrunk
under filesystem's hands and metadata buffers get discarded while being
tracked by the journalling layer. Although it is obviously "not
supported" it can cause kernel crashes like:

[ 7986.689400] BUG: unable to handle kernel NULL pointer dereference at
+0000000000000008
[ 7986.697197] PGD 0 P4D 0
[ 7986.699724] Oops: 0002 [#1] SMP PTI
[ 7986.703200] CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G
+O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
[ 7986.716438] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
[ 7986.723462] RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
...
[ 7986.810150] Call Trace:
[ 7986.812595]  __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
[ 7986.818408]  jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
[ 7986.836467]  kjournald2+0xbd/0x270 [jbd2]

which is not great. The crash happens because bh->b_private is suddently
NULL although BH_JBD flag is still set (this is because
block_invalidatepage() cleared BH_Mapped flag and subsequent bh lookup
found buffer without BH_Mapped set, called init_page_buffers() which has
rewritten bh->b_private). So just remove the invalidation in
block_write_full_page().

Note that the buffer cache invalidation when block device changes size
is already careful to avoid similar problems by using
invalidate_mapping_pages() which skips busy buffers so it was only this
odd block_write_full_page() behavior that could tear down bdev buffers
under filesystem's hands.

Reported-by: Ye Bin <yebin10@huawei.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 061dd202979d..163c2c0b9aa3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2771,16 +2771,6 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-#if 0
-		/* Not really sure about this  - do we need this ? */
-		if (page->mapping->a_ops->invalidatepage)
-			page->mapping->a_ops->invalidatepage(page, offset);
-#endif
 		unlock_page(page);
 		return 0; /* don't care */
 	}
@@ -2975,12 +2965,6 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-		do_invalidatepage(page, 0, PAGE_SIZE);
 		unlock_page(page);
 		return 0; /* don't care */
 	}
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-25 12:05 [PATCH RFC 0/2] Block and buffer invalidation under a filesystem Jan Kara
  2020-08-25 12:05 ` [PATCH RFC 1/2] fs: Don't invalidate page buffers in block_write_full_page() Jan Kara
@ 2020-08-25 12:05 ` Jan Kara
  2020-08-25 12:16   ` Christoph Hellwig
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Kara @ 2020-08-25 12:05 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: yebin, Christoph Hellwig, linux-block, Jens Axboe, Jan Kara

Discarding blocks and buffers under a mounted filesystem is hardly
anything admin wants to do. Usually it will confuse the filesystem and
sometimes the loss of buffer_head state (including b_private field) can
even cause crashes like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
...
Call Trace:
 __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
 jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
 kjournald2+0xbd/0x270 [jbd2]

So refuse fallocate(2) and BLKZEROOUT, BLKDISCARD, BLKSECDISCARD ioctls
for a block device having filesystem mounted.

Reported-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/ioctl.c  | 19 ++++++++++++++++++-
 fs/block_dev.c |  9 +++++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index bdb3bbb253d9..0e3a46b0ffc8 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -113,7 +113,7 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 	uint64_t start, len;
 	struct request_queue *q = bdev_get_queue(bdev);
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
-
+	struct super_block *sb;
 
 	if (!(mode & FMODE_WRITE))
 		return -EBADF;
@@ -134,6 +134,14 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 
 	if (start + len > i_size_read(bdev->bd_inode))
 		return -EINVAL;
+	/*
+	 * Don't mess with device with mounted filesystem.
+	 */
+	sb = get_super(bdev);
+	if (sb) {
+		drop_super(sb);
+		return -EBUSY;
+	}
 	truncate_inode_pages_range(mapping, start, start + len - 1);
 	return blkdev_issue_discard(bdev, start >> 9, len >> 9,
 				    GFP_KERNEL, flags);
@@ -145,6 +153,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
 	uint64_t range[2];
 	struct address_space *mapping;
 	uint64_t start, end, len;
+	struct super_block *sb;
 
 	if (!(mode & FMODE_WRITE))
 		return -EBADF;
@@ -165,6 +174,14 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
 	if (end < start)
 		return -EINVAL;
 
+	/*
+	 * Don't mess with device with mounted filesystem.
+	 */
+	sb = get_super(bdev);
+	if (sb) {
+		drop_super(sb);
+		return -EBUSY;
+	}
 	/* Invalidate the page cache, including dirty pages */
 	mapping = bdev->bd_inode->i_mapping;
 	truncate_inode_pages_range(mapping, start, end);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 8ae833e00443..5b398eb7c34c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1973,6 +1973,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	loff_t end = start + len - 1;
 	loff_t isize;
 	int error;
+	struct super_block *sb;
 
 	/* Fail if we don't recognize the flags. */
 	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
@@ -1996,6 +1997,14 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	if ((start | len) & (bdev_logical_block_size(bdev) - 1))
 		return -EINVAL;
 
+	/*
+	 * Don't mess with device with mounted filesystem.
+	 */
+	sb = get_super(bdev);
+	if (sb) {
+		drop_super(sb);
+		return -EBUSY;
+	}
 	/* Invalidate the page cache, including dirty pages. */
 	mapping = bdev->bd_inode->i_mapping;
 	truncate_inode_pages_range(mapping, start, end);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-25 12:05 ` [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem Jan Kara
@ 2020-08-25 12:16   ` Christoph Hellwig
  2020-08-25 14:10     ` Theodore Y. Ts'o
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Christoph Hellwig @ 2020-08-25 12:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, yebin, Christoph Hellwig, linux-block, Jens Axboe

On Tue, Aug 25, 2020 at 02:05:54PM +0200, Jan Kara wrote:
> Discarding blocks and buffers under a mounted filesystem is hardly
> anything admin wants to do. Usually it will confuse the filesystem and
> sometimes the loss of buffer_head state (including b_private field) can
> even cause crashes like:

Doesn't work if the file system uses multiple devices.  I think we
just really need to split the fs buffer_head address space from the
block device one.  Everything else is just going to cause a huge mess.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-25 12:16   ` Christoph Hellwig
@ 2020-08-25 14:10     ` Theodore Y. Ts'o
  2020-08-25 15:12       ` Jan Kara
  2020-08-25 14:50     ` Jan Kara
  2020-08-28  8:21     ` Andreas Dilger
  2 siblings, 1 reply; 16+ messages in thread
From: Theodore Y. Ts'o @ 2020-08-25 14:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel, yebin, linux-block, Jens Axboe,
	Mark Fasheh, Joel Becker, Joseph Qi

(Adding the OCFS2 maintainers, since my possibly insane idea proposed
below would definitely impact them!)

On Tue, Aug 25, 2020 at 01:16:16PM +0100, Christoph Hellwig wrote:
> On Tue, Aug 25, 2020 at 02:05:54PM +0200, Jan Kara wrote:
> > Discarding blocks and buffers under a mounted filesystem is hardly
> > anything admin wants to do. Usually it will confuse the filesystem and
> > sometimes the loss of buffer_head state (including b_private field) can
> > even cause crashes like:
> 
> Doesn't work if the file system uses multiple devices.  I think we
> just really need to split the fs buffer_head address space from the
> block device one.  Everything else is just going to cause a huge mess.

I wonder if we should go a step further, and stop using struct
buffer_head altogether in jbd2 and ext4 (as well as ocfs2).

This would involve moving whatever structure elements from the
buffer_head struct into journal_head, and manage writeback and reads
requests directly in jbd2.  This would allow us to get detailed write
errors back, which is currently not possible from the buffer_head
infrastructure.

The downside is this would be a pretty massive change in terms of LOC,
since we use struct buffer_head in a *huge* number of places.  If
we're careful, most of it could be handled by a Coccinelle script to
rename "struct buffer_head" to "struct journal_head".  Fortunately, we
don't actually use that much of the fs/buffer_head functions in
fs/{ext4,ocfs2}/*.c.

One potentially tricky bit is that ocfs2 hasn't been converted to
using iomap, so it's still using __blockdev_direct_IO.  So it's data
blocks for DIO would still have to use struct buffer_head (which means
the Coccinelle script won't really work for fs/ocfs2, without a lot of
manual rework) --- or ocfs2 would have to switched to use iomap at
least for DIO support.

What do folks think?

						- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-25 12:16   ` Christoph Hellwig
  2020-08-25 14:10     ` Theodore Y. Ts'o
@ 2020-08-25 14:50     ` Jan Kara
  2020-08-27  7:16       ` Christoph Hellwig
  2020-08-28  8:21     ` Andreas Dilger
  2 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2020-08-25 14:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, linux-fsdevel, yebin, linux-block, Jens Axboe

On Tue 25-08-20 13:16:16, Christoph Hellwig wrote:
> On Tue, Aug 25, 2020 at 02:05:54PM +0200, Jan Kara wrote:
> > Discarding blocks and buffers under a mounted filesystem is hardly
> > anything admin wants to do. Usually it will confuse the filesystem and
> > sometimes the loss of buffer_head state (including b_private field) can
> > even cause crashes like:
> 
> Doesn't work if the file system uses multiple devices.

Hum, right.

> I think we just really need to split the fs buffer_head address space
> from the block device one.  Everything else is just going to cause a huge
> mess.

Do you mean that address_space filesystem uses to access its metadata on
/dev/sda will be different from the address_space you will see when reading
say /dev/sda?  Thus these will be completely separate (and incoherent)
caches? Although this would be simple it will break userspace I'm afraid.
There are situations where tools read e.g. superblock of a mounted
filesystem from the block device and rely on the data to be reasonably
recent. Even worse e.g. tune2fs or e2fsck can *modify* superblock of a
mounted filesystem through the block device (e.g. to set 'fsck after X
mounts' fields and similar).

So we would need to somehow maintain at least vague coherence between these
caches which would be ugly.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-25 14:10     ` Theodore Y. Ts'o
@ 2020-08-25 15:12       ` Jan Kara
  2020-08-25 18:41         ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2020-08-25 15:12 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Christoph Hellwig, Jan Kara, linux-fsdevel, yebin, linux-block,
	Jens Axboe, Mark Fasheh, Joel Becker, Joseph Qi

On Tue 25-08-20 10:10:20, Theodore Y. Ts'o wrote:
> (Adding the OCFS2 maintainers, since my possibly insane idea proposed
> below would definitely impact them!)
> 
> On Tue, Aug 25, 2020 at 01:16:16PM +0100, Christoph Hellwig wrote:
> > On Tue, Aug 25, 2020 at 02:05:54PM +0200, Jan Kara wrote:
> > > Discarding blocks and buffers under a mounted filesystem is hardly
> > > anything admin wants to do. Usually it will confuse the filesystem and
> > > sometimes the loss of buffer_head state (including b_private field) can
> > > even cause crashes like:
> > 
> > Doesn't work if the file system uses multiple devices.  I think we
> > just really need to split the fs buffer_head address space from the
> > block device one.  Everything else is just going to cause a huge mess.
> 
> I wonder if we should go a step further, and stop using struct
> buffer_head altogether in jbd2 and ext4 (as well as ocfs2).

What about the cache coherency issues I've pointed out in my reply to
Christoph?

> This would involve moving whatever structure elements from the
> buffer_head struct into journal_head, and manage writeback and reads
> requests directly in jbd2.  This would allow us to get detailed write
> errors back, which is currently not possible from the buffer_head
> infrastructure.
> 
> The downside is this would be a pretty massive change in terms of LOC,
> since we use struct buffer_head in a *huge* number of places.  If
> we're careful, most of it could be handled by a Coccinelle script to
> rename "struct buffer_head" to "struct journal_head".  Fortunately, we
> don't actually use that much of the fs/buffer_head functions in
> fs/{ext4,ocfs2}/*.c.
> 
> One potentially tricky bit is that ocfs2 hasn't been converted to
> using iomap, so it's still using __blockdev_direct_IO.  So it's data
> blocks for DIO would still have to use struct buffer_head (which means
> the Coccinelle script won't really work for fs/ocfs2, without a lot of
> manual rework) --- or ocfs2 would have to switched to use iomap at
> least for DIO support.
> 
> What do folks think?

Otherwise yes, this would be doable although pretty invasive as you
mention.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-25 15:12       ` Jan Kara
@ 2020-08-25 18:41         ` Matthew Wilcox
  2020-08-25 18:49           ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2020-08-25 18:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Y. Ts'o, Christoph Hellwig, linux-fsdevel, yebin,
	linux-block, Jens Axboe, Mark Fasheh, Joel Becker, Joseph Qi

On Tue, Aug 25, 2020 at 05:12:56PM +0200, Jan Kara wrote:
> On Tue 25-08-20 10:10:20, Theodore Y. Ts'o wrote:
> > (Adding the OCFS2 maintainers, since my possibly insane idea proposed
> > below would definitely impact them!)
> > 
> > On Tue, Aug 25, 2020 at 01:16:16PM +0100, Christoph Hellwig wrote:
> > > On Tue, Aug 25, 2020 at 02:05:54PM +0200, Jan Kara wrote:
> > > > Discarding blocks and buffers under a mounted filesystem is hardly
> > > > anything admin wants to do. Usually it will confuse the filesystem and
> > > > sometimes the loss of buffer_head state (including b_private field) can
> > > > even cause crashes like:
> > > 
> > > Doesn't work if the file system uses multiple devices.  I think we
> > > just really need to split the fs buffer_head address space from the
> > > block device one.  Everything else is just going to cause a huge mess.
> > 
> > I wonder if we should go a step further, and stop using struct
> > buffer_head altogether in jbd2 and ext4 (as well as ocfs2).
> 
> What about the cache coherency issues I've pointed out in my reply to
> Christoph?

If journal_heads pointed into the page cache as well, then you'd get
coherency.  These new journal heads would have to be able to cope with
the page cache being modified underneath them, of course.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-25 18:41         ` Matthew Wilcox
@ 2020-08-25 18:49           ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2020-08-25 18:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Theodore Y. Ts'o, Christoph Hellwig, linux-fsdevel,
	yebin, linux-block, Jens Axboe, Mark Fasheh, Joel Becker,
	Joseph Qi

On Tue 25-08-20 19:41:07, Matthew Wilcox wrote:
> On Tue, Aug 25, 2020 at 05:12:56PM +0200, Jan Kara wrote:
> > On Tue 25-08-20 10:10:20, Theodore Y. Ts'o wrote:
> > > (Adding the OCFS2 maintainers, since my possibly insane idea proposed
> > > below would definitely impact them!)
> > > 
> > > On Tue, Aug 25, 2020 at 01:16:16PM +0100, Christoph Hellwig wrote:
> > > > On Tue, Aug 25, 2020 at 02:05:54PM +0200, Jan Kara wrote:
> > > > > Discarding blocks and buffers under a mounted filesystem is hardly
> > > > > anything admin wants to do. Usually it will confuse the filesystem and
> > > > > sometimes the loss of buffer_head state (including b_private field) can
> > > > > even cause crashes like:
> > > > 
> > > > Doesn't work if the file system uses multiple devices.  I think we
> > > > just really need to split the fs buffer_head address space from the
> > > > block device one.  Everything else is just going to cause a huge mess.
> > > 
> > > I wonder if we should go a step further, and stop using struct
> > > buffer_head altogether in jbd2 and ext4 (as well as ocfs2).
> > 
> > What about the cache coherency issues I've pointed out in my reply to
> > Christoph?
> 
> If journal_heads pointed into the page cache as well, then you'd get
> coherency.  These new journal heads would have to be able to cope with
> the page cache being modified underneath them, of course.

I was thinking about this as well but IMHO it would be quite complex - e.g.
we could then have both buffer_heads (for block dev) and journal heads (for
fs) to have different opinions on block state (uptodate, dirty, ...) and
who now determines what the final page state is (e.g. for reclaim
purposes)? Sure we can all sort this out with various callbacks etc. but at
that point we are not really simplifying things anymore...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-25 14:50     ` Jan Kara
@ 2020-08-27  7:16       ` Christoph Hellwig
  2020-08-27 21:39         ` Al Viro
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2020-08-27  7:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, yebin, linux-block, Jens Axboe

On Tue, Aug 25, 2020 at 04:50:56PM +0200, Jan Kara wrote:
> Do you mean that address_space filesystem uses to access its metadata on
> /dev/sda will be different from the address_space you will see when reading
> say /dev/sda?  Thus these will be completely separate (and incoherent)
> caches?

Yes.

> Although this would be simple it will break userspace I'm afraid.
> There are situations where tools read e.g. superblock of a mounted
> filesystem from the block device and rely on the data to be reasonably
> recent. Even worse e.g. tune2fs or e2fsck can *modify* superblock of a
> mounted filesystem through the block device (e.g. to set 'fsck after X
> mounts' fields and similar).

We've not had any problems when XFS stopped using the block device
address space 9.5 years ago.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-27  7:16       ` Christoph Hellwig
@ 2020-08-27 21:39         ` Al Viro
  2020-08-28  0:07           ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2020-08-27 21:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, linux-fsdevel, yebin, linux-block, Jens Axboe

On Thu, Aug 27, 2020 at 08:16:03AM +0100, Christoph Hellwig wrote:
> On Tue, Aug 25, 2020 at 04:50:56PM +0200, Jan Kara wrote:
> > Do you mean that address_space filesystem uses to access its metadata on
> > /dev/sda will be different from the address_space you will see when reading
> > say /dev/sda?  Thus these will be completely separate (and incoherent)
> > caches?
> 
> Yes.
> 
> > Although this would be simple it will break userspace I'm afraid.
> > There are situations where tools read e.g. superblock of a mounted
> > filesystem from the block device and rely on the data to be reasonably
> > recent. Even worse e.g. tune2fs or e2fsck can *modify* superblock of a
> > mounted filesystem through the block device (e.g. to set 'fsck after X
> > mounts' fields and similar).
> 
> We've not had any problems when XFS stopped using the block device
> address space 9.5 years ago.

How much writes from fsck use does xfs see, again?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-27 21:39         ` Al Viro
@ 2020-08-28  0:07           ` Dave Chinner
  2020-08-28  8:10             ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2020-08-28  0:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Jan Kara, linux-fsdevel, yebin, linux-block,
	Jens Axboe

On Thu, Aug 27, 2020 at 10:39:00PM +0100, Al Viro wrote:
> On Thu, Aug 27, 2020 at 08:16:03AM +0100, Christoph Hellwig wrote:
> > On Tue, Aug 25, 2020 at 04:50:56PM +0200, Jan Kara wrote:
> > > Do you mean that address_space filesystem uses to access its metadata on
> > > /dev/sda will be different from the address_space you will see when reading
> > > say /dev/sda?  Thus these will be completely separate (and incoherent)
> > > caches?
> > 
> > Yes.
> > 
> > > Although this would be simple it will break userspace I'm afraid.
> > > There are situations where tools read e.g. superblock of a mounted
> > > filesystem from the block device and rely on the data to be reasonably
> > > recent. Even worse e.g. tune2fs or e2fsck can *modify* superblock of a
> > > mounted filesystem through the block device (e.g. to set 'fsck after X
> > > mounts' fields and similar).
> > 
> > We've not had any problems when XFS stopped using the block device
> > address space 9.5 years ago.
> 
> How much writes from fsck use does xfs see, again?

All of them, because xfs_repair uses direct IO and caches what it
needs in userspace.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-28  0:07           ` Dave Chinner
@ 2020-08-28  8:10             ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2020-08-28  8:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Al Viro, Christoph Hellwig, Jan Kara, linux-fsdevel, yebin,
	linux-block, Jens Axboe

On Fri 28-08-20 10:07:05, Dave Chinner wrote:
> On Thu, Aug 27, 2020 at 10:39:00PM +0100, Al Viro wrote:
> > On Thu, Aug 27, 2020 at 08:16:03AM +0100, Christoph Hellwig wrote:
> > > On Tue, Aug 25, 2020 at 04:50:56PM +0200, Jan Kara wrote:
> > > > Do you mean that address_space filesystem uses to access its metadata on
> > > > /dev/sda will be different from the address_space you will see when reading
> > > > say /dev/sda?  Thus these will be completely separate (and incoherent)
> > > > caches?
> > > 
> > > Yes.
> > > 
> > > > Although this would be simple it will break userspace I'm afraid.
> > > > There are situations where tools read e.g. superblock of a mounted
> > > > filesystem from the block device and rely on the data to be reasonably
> > > > recent. Even worse e.g. tune2fs or e2fsck can *modify* superblock of a
> > > > mounted filesystem through the block device (e.g. to set 'fsck after X
> > > > mounts' fields and similar).
> > > 
> > > We've not had any problems when XFS stopped using the block device
> > > address space 9.5 years ago.
> > 
> > How much writes from fsck use does xfs see, again?
> 
> All of them, because xfs_repair uses direct IO and caches what it
> needs in userspace.

But that's the difference. e2fsprogs (which means e2fsck, tune2fs, or
generally libext2fs which is linked against quite a few external programs
as well) use buffered IO so they rely on cache coherency between what the
kernel filesystem driver sees and what userspace sees when opening the
block device. That's why I'm concerned that loosing this coherency is going
to break some ext4 user... I'll talk to Ted what he thinks about this but
so far I don't see how to separate kernel's view of the bdev from userspace
view without breakage.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-25 12:16   ` Christoph Hellwig
  2020-08-25 14:10     ` Theodore Y. Ts'o
  2020-08-25 14:50     ` Jan Kara
@ 2020-08-28  8:21     ` Andreas Dilger
  2020-08-29  6:40       ` Christoph Hellwig
  2 siblings, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2020-08-28  8:21 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara; +Cc: linux-fsdevel, yebin, linux-block, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 997 bytes --]

On Aug 25, 2020, at 6:16 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Tue, Aug 25, 2020 at 02:05:54PM +0200, Jan Kara wrote:
>> Discarding blocks and buffers under a mounted filesystem is hardly
>> anything admin wants to do. Usually it will confuse the filesystem and
>> sometimes the loss of buffer_head state (including b_private field) can
>> even cause crashes like:
> 
> Doesn't work if the file system uses multiple devices.

It's not _worse_ than the current situation of allowing the complete
destruction of the mounted filesystem.  It doesn't fix the problem
for XFS with realtime devices, or ext4 with a separate journal device,
but it fixes the problem for a majority of users with a single device
filesystem.

While BLKFLSBUF causing a crash is annoying, BLKDISCARD/BLKSECDISCARD
under a mounted filesystem is definitely dangerous and wrong.

What about checking for O_EXCL on the device, indicating that it is
currently in use by some higher level?

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-28  8:21     ` Andreas Dilger
@ 2020-08-29  6:40       ` Christoph Hellwig
  2020-08-31  7:48         ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2020-08-29  6:40 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, Jan Kara, linux-fsdevel, yebin, linux-block,
	Jens Axboe

On Fri, Aug 28, 2020 at 02:21:29AM -0600, Andreas Dilger wrote:
> On Aug 25, 2020, at 6:16 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > On Tue, Aug 25, 2020 at 02:05:54PM +0200, Jan Kara wrote:
> >> Discarding blocks and buffers under a mounted filesystem is hardly
> >> anything admin wants to do. Usually it will confuse the filesystem and
> >> sometimes the loss of buffer_head state (including b_private field) can
> >> even cause crashes like:
> > 
> > Doesn't work if the file system uses multiple devices.
> 
> It's not _worse_ than the current situation of allowing the complete
> destruction of the mounted filesystem.  It doesn't fix the problem
> for XFS with realtime devices, or ext4 with a separate journal device,
> but it fixes the problem for a majority of users with a single device
> filesystem.
> 
> While BLKFLSBUF causing a crash is annoying, BLKDISCARD/BLKSECDISCARD
> under a mounted filesystem is definitely dangerous and wrong.
> 
> What about checking for O_EXCL on the device, indicating that it is
> currently in use by some higher level?

That actually seems like a much better idea.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem
  2020-08-29  6:40       ` Christoph Hellwig
@ 2020-08-31  7:48         ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2020-08-31  7:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Dilger, Jan Kara, linux-fsdevel, yebin, linux-block, Jens Axboe

On Sat 29-08-20 07:40:41, Christoph Hellwig wrote:
> On Fri, Aug 28, 2020 at 02:21:29AM -0600, Andreas Dilger wrote:
> > On Aug 25, 2020, at 6:16 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > > 
> > > On Tue, Aug 25, 2020 at 02:05:54PM +0200, Jan Kara wrote:
> > >> Discarding blocks and buffers under a mounted filesystem is hardly
> > >> anything admin wants to do. Usually it will confuse the filesystem and
> > >> sometimes the loss of buffer_head state (including b_private field) can
> > >> even cause crashes like:
> > > 
> > > Doesn't work if the file system uses multiple devices.
> > 
> > It's not _worse_ than the current situation of allowing the complete
> > destruction of the mounted filesystem.  It doesn't fix the problem
> > for XFS with realtime devices, or ext4 with a separate journal device,
> > but it fixes the problem for a majority of users with a single device
> > filesystem.
> > 
> > While BLKFLSBUF causing a crash is annoying, BLKDISCARD/BLKSECDISCARD
> > under a mounted filesystem is definitely dangerous and wrong.

Actually BLKFLSBUF won't cause a crash. That's using invalidate_bdev() -
i.e., page-reclaim style of eviction and that's fine with the filesystems.

> > What about checking for O_EXCL on the device, indicating that it is
> > currently in use by some higher level?
> 
> That actually seems like a much better idea.

OK, I'll rework the patch.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-08-31  7:48 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-25 12:05 [PATCH RFC 0/2] Block and buffer invalidation under a filesystem Jan Kara
2020-08-25 12:05 ` [PATCH RFC 1/2] fs: Don't invalidate page buffers in block_write_full_page() Jan Kara
2020-08-25 12:05 ` [PATCH RFC 2/2] block: Do not discard buffers under a mounted filesystem Jan Kara
2020-08-25 12:16   ` Christoph Hellwig
2020-08-25 14:10     ` Theodore Y. Ts'o
2020-08-25 15:12       ` Jan Kara
2020-08-25 18:41         ` Matthew Wilcox
2020-08-25 18:49           ` Jan Kara
2020-08-25 14:50     ` Jan Kara
2020-08-27  7:16       ` Christoph Hellwig
2020-08-27 21:39         ` Al Viro
2020-08-28  0:07           ` Dave Chinner
2020-08-28  8:10             ` Jan Kara
2020-08-28  8:21     ` Andreas Dilger
2020-08-29  6:40       ` Christoph Hellwig
2020-08-31  7:48         ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).