All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Fix race between do_invalidatepage and init_page_buffers
@ 2020-08-22  8:22 Ye Bin
  2020-08-22  8:22 ` [PATCH 1/2] ext4: Add comment to BUFFER_FLAGS_DISCARD for search code Ye Bin
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Ye Bin @ 2020-08-22  8:22 UTC (permalink / raw)
  To: jack, tytso, linux-ext4, yebin10

Ye Bin (2):
  ext4: Add comment to BUFFER_FLAGS_DISCARD for search code
  jbd2: Fix race between do_invalidatepage and init_page_buffers

 fs/buffer.c                 | 12 +++++++++++-
 fs/jbd2/journal.c           |  7 +++++++
 include/linux/buffer_head.h |  2 ++
 3 files changed, 20 insertions(+), 1 deletion(-)

-- 
2.25.4


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] ext4: Add comment to BUFFER_FLAGS_DISCARD for search code
  2020-08-22  8:22 [PATCH 0/2] Fix race between do_invalidatepage and init_page_buffers Ye Bin
@ 2020-08-22  8:22 ` Ye Bin
  2020-08-22  8:22 ` [PATCH 2/2] jbd2: Fix race between do_invalidatepage and init_page_buffers Ye Bin
  2020-08-24 15:51 ` [PATCH 0/2] " Jan Kara
  2 siblings, 0 replies; 8+ messages in thread
From: Ye Bin @ 2020-08-22  8:22 UTC (permalink / raw)
  To: jack, tytso, linux-ext4, yebin10

When we analyze the problem, we find that in the discard_buffer will
implicitly clear some bits, which bothered us for a while. Add notes
to comment so that we can't miss them when analyzing the code.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 fs/buffer.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c1501a3c5ebe..d05b94cc48c0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1496,7 +1496,13 @@ EXPORT_SYMBOL(set_bh_page);
  * Called when truncating a buffer on a page completely.
  */
 
-/* Bits that are cleared during an invalidate */
+/* Bits that are cleared during an invalidate
+ * clear_buffer_mapped
+ * clear_buffer_req
+ * clear_buffer_new
+ * clear_buffer_delay
+ * clear_buffer_unwritten
+*/
 #define BUFFER_FLAGS_DISCARD \
 	(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
 	 1 << BH_Delay | 1 << BH_Unwritten)
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] jbd2: Fix race between do_invalidatepage and init_page_buffers
  2020-08-22  8:22 [PATCH 0/2] Fix race between do_invalidatepage and init_page_buffers Ye Bin
  2020-08-22  8:22 ` [PATCH 1/2] ext4: Add comment to BUFFER_FLAGS_DISCARD for search code Ye Bin
@ 2020-08-22  8:22 ` Ye Bin
  2020-08-24 15:51 ` [PATCH 0/2] " Jan Kara
  2 siblings, 0 replies; 8+ messages in thread
From: Ye Bin @ 2020-08-22  8:22 UTC (permalink / raw)
  To: jack, tytso, linux-ext4, yebin10

We got follow exception when test lvreduce:
[ 7986.689400] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 7986.697197] PGD 0 P4D 0
[ 7986.699724] Oops: 0002 [#1] SMP PTI
[ 7986.703200] CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G           O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
[ 7986.716438] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
[ 7986.723462] RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
[ 7986.729876] Code: e8 83 75 ac da e9 5e ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 f0 48 0f ba 2f 18 72 18 48 8b 07 a9 00 00 02 00 74 1c 48 8b 47 40 <83> 40 08 01 f0 80 67 03 fe c3 f3 90 48 8b 07 a9 00 00 00 01 75 f4
[ 7986.748557] RSP: 0018:ffffaa8ca198fcd0 EFLAGS: 00010206
[ 7986.753761] RAX: 0000000000000000 RBX: ffff96f4ebde2960 RCX: dead000000000200
[ 7986.760864] RDX: ffff96f4f3338870 RSI: ffff96f5311c6f00 RDI: ffff96f4f0e6ee38
[ 7986.767967] RBP: ffff96f5311c6f00 R08: ffff97247bb01d68 R09: ffff96f4e92cb210
[ 7986.775069] R10: 0000000000000000 R11: 0000000000000228 R12: ffff96f4f0e6ee38
[ 7986.782171] R13: ffff96f4ebde2960 R14: ffff9724b8cce3a8 R15: ffff96f5311c6f00
[ 7986.789274] FS:  0000000000000000(0000) GS:ffff96f53f700000(0000) knlGS:0000000000000000
[ 7986.797328] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7986.803049] CR2: 0000000000000008 CR3: 0000001c9260a005 CR4: 00000000001606e0
[ 7986.810150] Call Trace:
[ 7986.812595]  __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
[ 7986.818408]  jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
[ 7986.824480]  ? __switch_to_asm+0x41/0x70
[ 7986.828386]  ? __switch_to_asm+0x35/0x70
[ 7986.832295]  ? kjournald2+0xbd/0x270 [jbd2]
[ 7986.836467]  kjournald2+0xbd/0x270 [jbd2]
[ 7986.840462]  ? finish_wait+0x80/0x80
[ 7986.844027]  ? commit_timeout+0x10/0x10 [jbd2]
[ 7986.848452]  kthread+0x10d/0x130
[ 7986.851671]  ? kthread_flush_work_fn+0x10/0x10
[ 7986.855973] md/raid:mdX: device dm-188 operational as raid disk 0
[ 7986.856100]  ret_from_fork+0x35/0x40
[ 7986.862169] md/raid:mdX: device dm-215 operational as raid disk 1
[ 7986.865732] Modules linked in:
[ 7986.871802] md/raid:mdX: device dm-128 operational as raid disk 2
[ 7986.871804]  dm_snapshot
[ 7986.875270] md/raid:mdX: raid level 5 active with 3 out of 3 devices, algorithm 2

Other exception:
[ 4167.542166] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 4167.549967] PGD 8000002fa3a7d067 P4D 8000002fa3a7d067 PUD 2fb4a03067 PMD 0
[ 4167.549971] Oops: 0002 [#1] SMP PTI
[ 4167.549973] CPU: 40 PID: 109973 Comm: fsstress Kdump: loaded Tainted: G           O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
[ 4167.549976] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
[ 4167.591371] RIP: 0010:jbd2_journal_add_journal_head+0xbf/0x120 [jbd2]
[ 4167.597784] Code: c2 00 00 00 01 75 f3 e9 6a ff ff ff 48 8b 53 10 48 85 d2 74 0b 48 83 7a 18 00 0f 85 74 ff ff ff 0f 0b 48 8b 4b 40 48 8d 53 03 <83> 41 08 01 f0 80 22 fe 48 85 c0 74 0f 48 8b 3d 7d bd 00 00 48 89
[ 4167.616464] RSP: 0018:ffff9a716674fc60 EFLAGS: 00010206
[ 4167.621666] RAX: 0000000000000000 RBX: ffff8d4f5e2568f0 RCX: 0000000000000000
[ 4167.628768] RDX: ffff8d4f5e2568f3 RSI: ffff8d4f5e2568f0 RDI: ffff8d4f5e2568f0
[ 4167.635869] RBP: ffff8d4f5e2568f0 R08: ffffbd14be3f14b4 R09: ffffbd14be3f1480
[ 4167.642973] R10: 0000000000000000 R11: ffff9a716674fb50 R12: ffff8d20364ec038
[ 4167.650075] R13: 0000000000001733 R14: ffff8d4fb6b48800 R15: ffff9a716674fdc8
[ 4167.657179] FS:  00007f35d68e6540(0000) GS:ffff8d203fb80000(0000) knlGS:0000000000000000
[ 4167.665232] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4167.670953] CR2: 0000000000000008 CR3: 0000002f6b0ee002 CR4: 00000000001606e0
[ 4167.678057] Call Trace:
[ 4167.680506]  jbd2_journal_get_write_access+0x51/0x80 [jbd2]
[ 4167.686081]  __ext4_journal_get_write_access+0x41/0x80 [ext4]
[ 4167.691818]  ext4_reserve_inode_write+0x8d/0xb0 [ext4]
[ 4167.696948]  ? add_dirent_to_buf+0x10c/0x1c0 [ext4]
[ 4167.701813]  ext4_mark_inode_dirty+0x51/0x1d0 [ext4]
[ 4167.706764]  ? current_time+0x4d/0x90
[ 4167.710420]  add_dirent_to_buf+0x10c/0x1c0 [ext4]
[ 4167.715112]  ext4_add_entry+0x10d/0x330 [ext4]
[ 4167.719546]  ? ext4_mark_iloc_dirty+0x5e/0x80 [ext4]
[ 4167.724498]  ? ext4_orphan_del+0x148/0x270 [ext4]
[ 4167.729190]  ext4_add_nondir+0x2b/0xb0 [ext4]
[ 4167.733537]  ext4_symlink+0x207/0x460 [ext4]
[ 4167.737795]  vfs_symlink+0xe6/0x170
[ 4167.741271]  do_symlinkat+0xdd/0xf0
[ 4167.744753]  do_syscall_64+0x5b/0x1b0
[ 4167.748409]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 4167.753443] RIP: 0033:0x7f35d68183a7

We run fsstress when we do lvreduce and lvextend. Lvreduce operation will lead
to do invalidatepage when write-back, bh's BH_Mappped flag will be cleaned. If
we get this bh again will call init_page_buffers clean bh->b_private. If
 this bh is used by jbd2, then when do commit transaction will lead to oops.

write-back                      	  kjournal
		touch file1(1) --->make page dirty
				start jbd2_journal_commit_transaction N
				          ...
				end jbd2_journal_commit_transaction N
		touch file2(2)
				start jbd2_journal_commit_transaction N + 1 (3)
block_write_full_page
  do_invalidatepage
    block_invalidatepage
      discard_buffer --->clear BH_Mapped(4)
      		touch file3 (5)
		   init_page_buffers --->set bh->b_private = NULL
		   		  jbd2_journal_get_write_access
				    jbd2_journal_add_journal_head(6)
				    --->jh is NULL and trigger oops

How to reproduce:
 First we add delay and information in kernel:
int block_write_full_page(struct page *page, get_block_t *get_block,
                        struct writeback_control *wbc)
 {
        struct inode * const inode = page->mapping->host;

+       if (page->index == 196609) {
+               printk("start %s\n", __func__);
+               msleep(10000);
+               printk("end %s\n", __func__);
+       }
 ...
}

int block_write_full_page(struct page *page, get_block_t *get_block,
		 struct writeback_control *wbc)
{
	if (page->index >= end_index+1 || !offset) {
+		printk("do_invalidatepage\n");
                do_invalidatepage(page, 0, PAGE_SIZE);
		....
}

void jbd2_journal_commit_transaction(journal_t *journal)
{
...
+	printk("start %s\n", __func__);
+	msleep(30000);
+	printk("end %s\n", __func__);
 	while (commit_transaction->t_buffers) {
		...
	}
...
}
step 1:
Create a large number of empty files to consume all inodes, and delete some
files which inode number is bigger. Make sure that the inodes are allocated
in the last block group.
step 2:
touch file1
step 3
wait print "end jbd2_journal_commit_transaction" then touch file2
step 4:
wait print "start jbd2_journal_commit_transaction" then
lvreduce -f -L-128M /dev/vgxx/lvxx
step 5:
wait print "do_invalidatepage" then touch file3
step 6:
wait a moment trigger oops

Maybe this kind of operation is not recommended, but the kernel can't crash
either. We add b_discard callback to buffer_head for judging bh could be
discarded when call discard_buffer. If jbd2 is using this buffer_head, we can't
discard buffer_head.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 fs/buffer.c                 | 4 ++++
 fs/jbd2/journal.c           | 7 +++++++
 include/linux/buffer_head.h | 2 ++
 3 files changed, 13 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index d05b94cc48c0..1395f7db016e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -928,6 +928,7 @@ init_page_buffers(struct page *page, struct block_device *bdev,
 		if (!buffer_mapped(bh)) {
 			bh->b_end_io = NULL;
 			bh->b_private = NULL;
+			bh->b_discard = NULL;
 			bh->b_bdev = bdev;
 			bh->b_blocknr = block;
 			if (uptodate)
@@ -1511,6 +1512,9 @@ static void discard_buffer(struct buffer_head * bh)
 {
 	unsigned long b_state, b_state_old;
 
+	if (bh->b_discard && !bh->b_discard(bh))
+		return;
+
 	lock_buffer(bh);
 	clear_buffer_dirty(bh);
 	bh->b_bdev = NULL;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 17fdc482f554..7e04c8afcac0 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2450,6 +2450,11 @@ static void journal_free_journal_head(struct journal_head *jh)
 	kmem_cache_free(jbd2_journal_head_cache, jh);
 }
 
+static bool journal_discard_buffer(struct buffer_head *bh)
+{
+	return !buffer_jbd(bh);
+}
+
 /*
  * A journal_head is attached to a buffer_head whenever JBD has an
  * interest in the buffer.
@@ -2517,6 +2522,7 @@ struct journal_head *jbd2_journal_add_journal_head(struct buffer_head *bh)
 		new_jh = NULL;		/* We consumed it */
 		set_buffer_jbd(bh);
 		bh->b_private = jh;
+		bh->b_discard = journal_discard_buffer;
 		jh->b_bh = bh;
 		get_bh(bh);
 		BUFFER_TRACE(bh, "added journal_head");
@@ -2559,6 +2565,7 @@ static void __journal_remove_journal_head(struct buffer_head *bh)
 
 	/* Unlink before dropping the lock */
 	bh->b_private = NULL;
+	bh->b_discard = NULL;
 	jh->b_bh = NULL;	/* debug, really */
 	clear_buffer_jbd(bh);
 }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 6b47f94378c5..a8dfb84f0a42 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -47,6 +47,7 @@ struct page;
 struct buffer_head;
 struct address_space;
 typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);
+typedef bool (bh_discard_t)(struct buffer_head *bh);
 
 /*
  * Historically, a buffer_head was used to map a single block
@@ -76,6 +77,7 @@ struct buffer_head {
 	spinlock_t b_uptodate_lock;	/* Used by the first bh in a page, to
 					 * serialise IO completion of other
 					 * buffers in the page */
+	bh_discard_t *b_discard;          /* judge buffer could be discarded */
 };
 
 /*
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] Fix race between do_invalidatepage and init_page_buffers
  2020-08-22  8:22 [PATCH 0/2] Fix race between do_invalidatepage and init_page_buffers Ye Bin
  2020-08-22  8:22 ` [PATCH 1/2] ext4: Add comment to BUFFER_FLAGS_DISCARD for search code Ye Bin
  2020-08-22  8:22 ` [PATCH 2/2] jbd2: Fix race between do_invalidatepage and init_page_buffers Ye Bin
@ 2020-08-24 15:51 ` Jan Kara
  2020-08-25  2:11   ` yebin
  2 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2020-08-24 15:51 UTC (permalink / raw)
  To: Ye Bin; +Cc: jack, tytso, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 848 bytes --]

Hello,

On Sat 22-08-20 16:22:16, Ye Bin wrote:
> Ye Bin (2):
>   ext4: Add comment to BUFFER_FLAGS_DISCARD for search code
>   jbd2: Fix race between do_invalidatepage and init_page_buffers
> 
>  fs/buffer.c                 | 12 +++++++++++-
>  fs/jbd2/journal.c           |  7 +++++++
>  include/linux/buffer_head.h |  2 ++
>  3 files changed, 20 insertions(+), 1 deletion(-)

Thanks for the good description of the problem and the analysis. I could
now easily understand what was really happening on your system. I think the
problem should be fixed differently through - it is a problem of
block_write_full_page() that it invalidates buffers while JBD2 is working
with them. Attached patch should also fix the problem. Can you please test
whether it fixes your testcase as well? Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-fs-Don-t-invalidate-page-buffers-in-block_write_full.patch --]
[-- Type: text/x-patch, Size: 3891 bytes --]

From 3b568c008a995d3d24ea9e5ed4315a96deb0e598 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 24 Aug 2020 17:07:40 +0200
Subject: [PATCH] fs: Don't invalidate page buffers in block_write_full_page()

If block_write_full_page() is called for a page that is beyond current
inode size, it will truncate page buffers for the page and return 0.
This logic has been added in 2.5.62 in commit 81eb69062588 ("fix ext3
BUG due to race with truncate") in history.git tree to fix a problem
with ext3 in data=ordered mode. This particular problem doesn't exist
anymore because ext3 is long gone and ext4 handles ordered data
differently. Also normally buffers are invalidated by truncate code and
there's no need to specially handle this in ->writepage() code.

This invalidation of page buffers in block_write_full_page() is causing
issues to filesystems (e.g. ext4 or ocfs2) when block device is shrunk
under filesystem's hands and metadata buffers get discarded while being
tracked by the journalling layer. Although it is obviously "not
supported" it can cause kernel crashes like:

[ 7986.689400] BUG: unable to handle kernel NULL pointer dereference at
+0000000000000008
[ 7986.697197] PGD 0 P4D 0
[ 7986.699724] Oops: 0002 [#1] SMP PTI
[ 7986.703200] CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G
+O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
[ 7986.716438] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
[ 7986.723462] RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
...
[ 7986.810150] Call Trace:
[ 7986.812595]  __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
[ 7986.818408]  jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
[ 7986.836467]  kjournald2+0xbd/0x270 [jbd2]

which is not great. The crash happens because bh->b_private is suddently
NULL although BH_JBD flag is still set (this is because
block_invalidatepage() cleared BH_Mapped flag and subsequent bh lookup
found buffer without BH_Mapped set, called init_page_buffers() which has
rewritten bh->b_private). So just remove the invalidation in
block_write_full_page().

Note that the buffer cache invalidation when block device changes size
is already careful to avoid similar problems by using
invalidate_mapping_pages() which skips busy buffers so it was only this
odd block_write_full_page() behavior that could tear down bdev buffers
under filesystem's hands.

Reported-by: Ye Bin <yebin10@huawei.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 061dd202979d..163c2c0b9aa3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2771,16 +2771,6 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-#if 0
-		/* Not really sure about this  - do we need this ? */
-		if (page->mapping->a_ops->invalidatepage)
-			page->mapping->a_ops->invalidatepage(page, offset);
-#endif
 		unlock_page(page);
 		return 0; /* don't care */
 	}
@@ -2975,12 +2965,6 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-		do_invalidatepage(page, 0, PAGE_SIZE);
 		unlock_page(page);
 		return 0; /* don't care */
 	}
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] Fix race between do_invalidatepage and init_page_buffers
  2020-08-24 15:51 ` [PATCH 0/2] " Jan Kara
@ 2020-08-25  2:11   ` yebin
  2020-08-25  8:41     ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: yebin @ 2020-08-25  2:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: jack, tytso, linux-ext4

Your patch certainly can fix the problem with my testcases, but I don't 
think it's
a good way. There are other paths that can call do_invalidatepage , for 
instance
block ioctl to discard and zero_range.

On 2020/8/24 23:51, Jan Kara wrote:
> Hello,
>
> On Sat 22-08-20 16:22:16, Ye Bin wrote:
>> Ye Bin (2):
>>    ext4: Add comment to BUFFER_FLAGS_DISCARD for search code
>>    jbd2: Fix race between do_invalidatepage and init_page_buffers
>>
>>   fs/buffer.c                 | 12 +++++++++++-
>>   fs/jbd2/journal.c           |  7 +++++++
>>   include/linux/buffer_head.h |  2 ++
>>   3 files changed, 20 insertions(+), 1 deletion(-)
> Thanks for the good description of the problem and the analysis. I could
> now easily understand what was really happening on your system. I think the
> problem should be fixed differently through - it is a problem of
> block_write_full_page() that it invalidates buffers while JBD2 is working
> with them. Attached patch should also fix the problem. Can you please test
> whether it fixes your testcase as well? Thanks!
>
> 								Honza



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] Fix race between do_invalidatepage and init_page_buffers
  2020-08-25  2:11   ` yebin
@ 2020-08-25  8:41     ` Jan Kara
  2020-11-20  3:36       ` Theodore Y. Ts'o
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2020-08-25  8:41 UTC (permalink / raw)
  To: yebin; +Cc: Jan Kara, jack, tytso, linux-ext4

On Tue 25-08-20 10:11:29, yebin wrote:
> Your patch certainly can fix the problem with my testcases, but I don't
> think it's a good way. There are other paths that can call
> do_invalidatepage , for instance block ioctl to discard and zero_range.

OK, good point! So my patch is a cleanup that stands on its own and we
should do it regardless. But I agree we need more to completely fix this.
I don't quite like the callback you've added just for this special case
(furthermore it grows size of every buffer_head and there can be lots of
those). But I agree with the general idea that we shouldn't discard buffers
that the filesystem is working with.

In fact I believe that fallocate(2) and zeroout/discard ioctls should
return EBUSY if they are run against a mounted device because with 99%
probability something went wrong and you're accidentally discarding the
wrong device. But maybe I'm wrong. I'll run this idea through other fs
developers.

								Honza

> On 2020/8/24 23:51, Jan Kara wrote:
> > On Sat 22-08-20 16:22:16, Ye Bin wrote:
> > > Ye Bin (2):
> > >    ext4: Add comment to BUFFER_FLAGS_DISCARD for search code
> > >    jbd2: Fix race between do_invalidatepage and init_page_buffers
> > > 
> > >   fs/buffer.c                 | 12 +++++++++++-
> > >   fs/jbd2/journal.c           |  7 +++++++
> > >   include/linux/buffer_head.h |  2 ++
> > >   3 files changed, 20 insertions(+), 1 deletion(-)
> > Thanks for the good description of the problem and the analysis. I could
> > now easily understand what was really happening on your system. I think the
> > problem should be fixed differently through - it is a problem of
> > block_write_full_page() that it invalidates buffers while JBD2 is working
> > with them. Attached patch should also fix the problem. Can you please test
> > whether it fixes your testcase as well? Thanks!
> > 
> > 								Honza
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] Fix race between do_invalidatepage and init_page_buffers
  2020-08-25  8:41     ` Jan Kara
@ 2020-11-20  3:36       ` Theodore Y. Ts'o
  2020-11-23 16:54         ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Theodore Y. Ts'o @ 2020-11-20  3:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: yebin, jack, linux-ext4

On Tue, Aug 25, 2020 at 10:41:37AM +0200, Jan Kara wrote:
> On Tue 25-08-20 10:11:29, yebin wrote:
> > Your patch certainly can fix the problem with my testcases, but I don't
> > think it's a good way. There are other paths that can call
> > do_invalidatepage , for instance block ioctl to discard and zero_range.
> 
> OK, good point! So my patch is a cleanup that stands on its own and we
> should do it regardless. But I agree we need more to completely fix this.
> I don't quite like the callback you've added just for this special case
> (furthermore it grows size of every buffer_head and there can be lots of
> those). But I agree with the general idea that we shouldn't discard buffers
> that the filesystem is working with.
> 
> In fact I believe that fallocate(2) and zeroout/discard ioctls should
> return EBUSY if they are run against a mounted device because with 99%
> probability something went wrong and you're accidentally discarding the
> wrong device. But maybe I'm wrong. I'll run this idea through other fs
> developers.

I'm going through old patches, and I'm trying to figure out where did
we end up on this issue?   Did we come to a conclusion on this?

One other thing which I noticed when looking at the original patch was
shouldn't lvreduce not be allowed to run on a LV which has a mounted
file system on its block device?

					- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] Fix race between do_invalidatepage and init_page_buffers
  2020-11-20  3:36       ` Theodore Y. Ts'o
@ 2020-11-23 16:54         ` Jan Kara
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2020-11-23 16:54 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, yebin, jack, linux-ext4

On Thu 19-11-20 22:36:00, Theodore Y. Ts'o wrote:
> On Tue, Aug 25, 2020 at 10:41:37AM +0200, Jan Kara wrote:
> > On Tue 25-08-20 10:11:29, yebin wrote:
> > > Your patch certainly can fix the problem with my testcases, but I don't
> > > think it's a good way. There are other paths that can call
> > > do_invalidatepage , for instance block ioctl to discard and zero_range.
> > 
> > OK, good point! So my patch is a cleanup that stands on its own and we
> > should do it regardless. But I agree we need more to completely fix this.
> > I don't quite like the callback you've added just for this special case
> > (furthermore it grows size of every buffer_head and there can be lots of
> > those). But I agree with the general idea that we shouldn't discard buffers
> > that the filesystem is working with.
> > 
> > In fact I believe that fallocate(2) and zeroout/discard ioctls should
> > return EBUSY if they are run against a mounted device because with 99%
> > probability something went wrong and you're accidentally discarding the
> > wrong device. But maybe I'm wrong. I'll run this idea through other fs
> > developers.
> 
> I'm going through old patches, and I'm trying to figure out where did
> we end up on this issue?   Did we come to a conclusion on this?

Yes, it is fixed by 384d87ef2c95 ("block: Do not discard buffers under a
mounted filesystem"). Also the block_write_full_page() got fixed up by
6dbf7bb555981 ("fs: Don't invalidate page buffers in
block_write_full_page()"). So we should be all set.

> One other thing which I noticed when looking at the original patch was
> shouldn't lvreduce not be allowed to run on a LV which has a mounted
> file system on its block device?

No, that is IMO working by design. The expectation is you can online-shrink
the fs and then lvreduce the device...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-11-23 16:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-22  8:22 [PATCH 0/2] Fix race between do_invalidatepage and init_page_buffers Ye Bin
2020-08-22  8:22 ` [PATCH 1/2] ext4: Add comment to BUFFER_FLAGS_DISCARD for search code Ye Bin
2020-08-22  8:22 ` [PATCH 2/2] jbd2: Fix race between do_invalidatepage and init_page_buffers Ye Bin
2020-08-24 15:51 ` [PATCH 0/2] " Jan Kara
2020-08-25  2:11   ` yebin
2020-08-25  8:41     ` Jan Kara
2020-11-20  3:36       ` Theodore Y. Ts'o
2020-11-23 16:54         ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.