All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] block: handle BLK_OPEN_RESTRICT_WRITES correctly
@ 2024-03-23 14:54 Christian Brauner
  2024-03-23 15:59 ` Christian Brauner
  2024-03-23 16:11 ` [PATCH 1/2] " Christian Brauner
  0 siblings, 2 replies; 24+ messages in thread
From: Christian Brauner @ 2024-03-23 14:54 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara, Jens Axboe
  Cc: Christian Brauner, Matthew Wilcox, linux-block

Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
default this option is set. When it is set the long-standing behavior
of being able to write to mounted block devices is enabled.

But in order to guard against unintended corruption by writing to the
block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
off. In that case it isn't possible to write to mounted block devices
anymore.

A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
which disallows concurrent BLK_OPEN_WRITE access. When we still had the
bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
the mode was passed around. Since we managed to get rid of the bdev
handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
on whether the file was opened writable and writes to that block device
are blocked. That logic doesn't work because we do allow
BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.

So fix the detection logic. When we open a block device with
BLK_OPEN_RESTRICT_WRITES we know that a holder must've been specified as
we forbid BLK_OPEN_RESTRICT_WRITES without a holder in
bdev_permission(). The holder will be stashed in
bdev_file->private_data. So we check whether bdev_file->private_data is
set and whether writes are blocked. If so, we unblock writes. Otherwise,
if the bdev_file was opened writable we just decrement the write count.

Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
directly be raised by userspace. It is implicitly raised during
mounting. So any concurrent writable request from userspace will fail
and so recognition based on bdev_file->private_data is safe.

Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
unset.

Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
Reported-by: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
@Jan, I have one question for you. Currently your original changes in
v6.8 do allow for a block device to be reopened with
BLK_OPEN_RESTRICT_WRITES provided the same holder is used as per
bdev_may_open(). I think that may have a bug.

The first opener @f1 of that block device will set bdev->bd_writers to
-1. The second opener @f2 using the same holder will pass the check in
bdev_may_open() that bdev->bd_writers must not be greater than zero.

The first opener @f1 now closes the block device and in bdev_release()
will end up calling bdev_yield_write_access() which calls
bdev_writes_blocked() and sets bdev->bd_writers to 0 again.

Now @f2 holds a file to that block device which was opened with
exclusive write access but bdev->bd_writers has been reset to 0.

So now @f3 comes along and succeeds in opening the block device with
BLK_OPEN_WRITE betraying @f2's request to have exclusive write access.

This isn't a practical issue yet because afaict there's no codepath
inside the kernel that reopenes the same block device with
BLK_OPEN_RESTRICT_WRITES but it will be if there is.

If that's right then either this needs to be fixed by counting the
number of BLK_OPEN_RESTRICT_WRITES a la [1] or we simply enforce that
BLK_OPEN_RESTRICT_WRITES means that there's only one opener a la [2].

[1]: count BLK_OPEN_RESTRICT_WRITES openers

     diff --git a/block/bdev.c b/block/bdev.c
     index 4b28a39fafc4..ea3f219310f5 100644
     --- a/block/bdev.c
     +++ b/block/bdev.c
     @@ -776,17 +776,17 @@ void blkdev_put_no_open(struct block_device *bdev)

      static bool bdev_writes_blocked(struct block_device *bdev)
      {
     -       return bdev->bd_writers == -1;
     +       return bdev->bd_writers < 0;
      }

      static void bdev_block_writes(struct block_device *bdev)
      {
     -       bdev->bd_writers = -1;
     +       bdev->bd_writers--;
      }

      static void bdev_unblock_writes(struct block_device *bdev)
      {
     -       bdev->bd_writers = 0;
     +       bdev->bd_writers++;
      }

 static bool bdev_may_open(struct block_device *bdev, blk_mode_t mode)

[2]: enforce that BLK_OPEN_RESTRICT_WRITES is unique

     diff --git a/block/bdev.c b/block/bdev.c
     index e9f1b12bd75c..cefb94a75530 100644
     --- a/block/bdev.c
     +++ b/block/bdev.c
     @@ -758,7 +758,7 @@ static bool bdev_may_open(struct block_device *bdev, blk_mode_t mode)
             /* Writes blocked? */
             if (mode & BLK_OPEN_WRITE && bdev_writes_blocked(bdev))
                     return false;
     -       if (mode & BLK_OPEN_RESTRICT_WRITES && bdev->bd_writers > 0)
     +       if (mode & BLK_OPEN_RESTRICT_WRITES && bdev->bd_writers != 0)
                     return false;
             return true;
      }

But maybe I'm just not reading this right. Let me know.

Christian
---
 block/bdev.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 7a5f611c3d2e..4b28a39fafc4 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -821,13 +821,14 @@ static void bdev_yield_write_access(struct file *bdev_file)
 		return;
 
 	bdev = file_bdev(bdev_file);
-	/* Yield exclusive or shared write access. */
-	if (bdev_file->f_mode & FMODE_WRITE) {
-		if (bdev_writes_blocked(bdev))
-			bdev_unblock_writes(bdev);
-		else
-			bdev->bd_writers--;
-	}
+	/*
+	 * If this was an exclusive open and writes are blocked
+	 * we know that we're the ones who blocked them.
+	 */
+	if (bdev_file->private_data && bdev_writes_blocked(bdev))
+		bdev_unblock_writes(bdev);
+	else if (bdev_file->f_mode & FMODE_WRITE)
+		bdev->bd_writers--;
 }
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-23 14:54 [PATCH] block: handle BLK_OPEN_RESTRICT_WRITES correctly Christian Brauner
@ 2024-03-23 15:59 ` Christian Brauner
  2024-03-23 16:11 ` [PATCH 1/2] " Christian Brauner
  1 sibling, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2024-03-23 15:59 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara, Jens Axboe; +Cc: Matthew Wilcox, linux-block

> +	/*
> +	 * If this was an exclusive open and writes are blocked
> +	 * we know that we're the ones who blocked them.
> +	 */
> +	if (bdev_file->private_data && bdev_writes_blocked(bdev))

This doesn't work because this will unblock BLK_OPEN_RESTRICT_WRITES
when the block device has been opened read-only but exclusively, i.e.,
when e.g., userspace requested read-only access and we use the file as
the holder. I have an alternative fix.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-23 14:54 [PATCH] block: handle BLK_OPEN_RESTRICT_WRITES correctly Christian Brauner
  2024-03-23 15:59 ` Christian Brauner
@ 2024-03-23 16:11 ` Christian Brauner
  2024-03-23 16:11   ` [PATCH 2/2] [RFC]: block: count BLK_OPEN_RESTRICT_WRITES openers Christian Brauner
                     ` (4 more replies)
  1 sibling, 5 replies; 24+ messages in thread
From: Christian Brauner @ 2024-03-23 16:11 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara, Jens Axboe
  Cc: Christian Brauner, Matthew Wilcox, linux-block

Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
default this option is set. When it is set the long-standing behavior
of being able to write to mounted block devices is enabled.

But in order to guard against unintended corruption by writing to the
block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
off. In that case it isn't possible to write to mounted block devices
anymore.

A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
which disallows concurrent BLK_OPEN_WRITE access. When we still had the
bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
the mode was passed around. Since we managed to get rid of the bdev
handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
on whether the file was opened writable and writes to that block device
are blocked. That logic doesn't work because we do allow
BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.

So fix the detection logic. Use O_EXCL as an indicator that
BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
for pidfds where O_EXCL means that this is a pidfd that refers to a
thread. For userspace open paths O_EXCL will never be retained but for
internal opens where we open files that are never installed into a file
descriptor table this is fine.

Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
directly be raised by userspace. It is implicitly raised during
mounting.

Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
unset.

Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
Reported-by: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 block/bdev.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 7a5f611c3d2e..f819f3086905 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -821,13 +821,12 @@ static void bdev_yield_write_access(struct file *bdev_file)
 		return;
 
 	bdev = file_bdev(bdev_file);
-	/* Yield exclusive or shared write access. */
-	if (bdev_file->f_mode & FMODE_WRITE) {
-		if (bdev_writes_blocked(bdev))
-			bdev_unblock_writes(bdev);
-		else
-			bdev->bd_writers--;
-	}
+
+	/* O_EXCL is only set for internal BLK_OPEN_RESTRICT_WRITES. */
+	if (bdev_file->f_flags & O_EXCL)
+		bdev_unblock_writes(bdev);
+	else if (bdev_file->f_mode & FMODE_WRITE)
+		bdev->bd_writers--;
 }
 
 /**
@@ -946,6 +945,13 @@ static unsigned blk_to_file_flags(blk_mode_t mode)
 	else
 		WARN_ON_ONCE(true);
 
+	/*
+	 * BLK_OPEN_RESTRICT_WRITES is never set from userspace and
+	 * O_EXCL is stripped from userspace.
+	 */
+	if (mode & BLK_OPEN_RESTRICT_WRITES)
+		flags |= O_EXCL;
+
 	if (mode & BLK_OPEN_NDELAY)
 		flags |= O_NDELAY;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] [RFC]: block: count BLK_OPEN_RESTRICT_WRITES openers
  2024-03-23 16:11 ` [PATCH 1/2] " Christian Brauner
@ 2024-03-23 16:11   ` Christian Brauner
  2024-03-26 13:24     ` Jan Kara
  2024-03-25 11:51   ` [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly Yu Kuai
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2024-03-23 16:11 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara, Jens Axboe
  Cc: Christian Brauner, Matthew Wilcox, linux-block

The original changes in v6.8 do allow for a block device to be reopened
with BLK_OPEN_RESTRICT_WRITES provided the same holder is used as per
bdev_may_open(). I think that may have a bug.

The first opener @f1 of that block device will set bdev->bd_writers to
-1. The second opener @f2 using the same holder will pass the check in
bdev_may_open() that bdev->bd_writers must not be greater than zero.

The first opener @f1 now closes the block device and in bdev_release()
will end up calling bdev_yield_write_access() which calls
bdev_writes_blocked() and sets bdev->bd_writers to 0 again.

Now @f2 holds a file to that block device which was opened with
exclusive write access but bdev->bd_writers has been reset to 0.

So now @f3 comes along and succeeds in opening the block device with
BLK_OPEN_WRITE betraying @f2's request to have exclusive write access.

This isn't a practical issue yet because afaict there's no codepath
inside the kernel that reopenes the same block device with
BLK_OPEN_RESTRICT_WRITES but it will be if there is.

If that's right then fix this by counting the number of
BLK_OPEN_RESTRICT_WRITES openers. So we only allow writes again once all
BLK_OPEN_RESTRICT_WRITES openers are done.

Fixes: ed5cc702d311 ("block: Add config option to not allow writing to mounted devices")
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 block/bdev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index f819f3086905..42f84692404c 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -776,17 +776,17 @@ void blkdev_put_no_open(struct block_device *bdev)
 
 static bool bdev_writes_blocked(struct block_device *bdev)
 {
-	return bdev->bd_writers == -1;
+	return bdev->bd_writers < 0;
 }
 
 static void bdev_block_writes(struct block_device *bdev)
 {
-	bdev->bd_writers = -1;
+	bdev->bd_writers--;
 }
 
 static void bdev_unblock_writes(struct block_device *bdev)
 {
-	bdev->bd_writers = 0;
+	bdev->bd_writers++;
 }
 
 static bool bdev_may_open(struct block_device *bdev, blk_mode_t mode)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-23 16:11 ` [PATCH 1/2] " Christian Brauner
  2024-03-23 16:11   ` [PATCH 2/2] [RFC]: block: count BLK_OPEN_RESTRICT_WRITES openers Christian Brauner
@ 2024-03-25 11:51   ` Yu Kuai
  2024-03-25 12:04     ` Christian Brauner
  2024-03-25 13:54     ` Christian Brauner
  2024-03-26 12:57   ` Jan Kara
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: Yu Kuai @ 2024-03-25 11:51 UTC (permalink / raw)
  To: Christian Brauner, Christoph Hellwig, Jan Kara, Jens Axboe
  Cc: Matthew Wilcox, linux-block, yukuai (C)

Hi,

在 2024/03/24 0:11, Christian Brauner 写道:
> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> default this option is set. When it is set the long-standing behavior
> of being able to write to mounted block devices is enabled.
> 
> But in order to guard against unintended corruption by writing to the
> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> off. In that case it isn't possible to write to mounted block devices
> anymore.
> 
> A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> the mode was passed around. Since we managed to get rid of the bdev
> handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> on whether the file was opened writable and writes to that block device
> are blocked. That logic doesn't work because we do allow
> BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.

I don't get it here, looks like there are no such use case. All users
passed in BLK_OPEN_RESTRICT_WRITES together with BLK_OPEN_WRITE.

Is the following root cause here?

1) t1 open with BLK_OPEN_WRITE
2) t2 open with BLK_OPEN_RESTRICT_WRITES, with bdev_block_writes(), yes
we don't wait for t1 to close;
3) t1 close, after the commit, bdev_unblock_writes() is called
unexpected.

Following openers will succeed although t2 doesn't close;
> 
> So fix the detection logic. Use O_EXCL as an indicator that
> BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> for pidfds where O_EXCL means that this is a pidfd that refers to a
> thread. For userspace open paths O_EXCL will never be retained but for
> internal opens where we open files that are never installed into a file
> descriptor table this is fine.

 From the path blkdev_open(), the file is from devtmpfs, and user can
pass in O_EXCL for that file, and that file will be used later in
blkdev_release() -> bdev_release() -> bdev_yield_write_access().

Thanks,
Kuai

> 
> Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> directly be raised by userspace. It is implicitly raised during
> mounting.
> 
> Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> unset.
> 
> Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> Reported-by: Matthew Wilcox <willy@infradead.org>
> Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>   block/bdev.c | 20 +++++++++++++-------
>   1 file changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 7a5f611c3d2e..f819f3086905 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -821,13 +821,12 @@ static void bdev_yield_write_access(struct file *bdev_file)
>   		return;
>   
>   	bdev = file_bdev(bdev_file);
> -	/* Yield exclusive or shared write access. */
> -	if (bdev_file->f_mode & FMODE_WRITE) {
> -		if (bdev_writes_blocked(bdev))
> -			bdev_unblock_writes(bdev);
> -		else
> -			bdev->bd_writers--;
> -	}
> +
> +	/* O_EXCL is only set for internal BLK_OPEN_RESTRICT_WRITES. */
> +	if (bdev_file->f_flags & O_EXCL)
> +		bdev_unblock_writes(bdev);
> +	else if (bdev_file->f_mode & FMODE_WRITE)
> +		bdev->bd_writers--;
>   }
>   
>   /**
> @@ -946,6 +945,13 @@ static unsigned blk_to_file_flags(blk_mode_t mode)
>   	else
>   		WARN_ON_ONCE(true);
>   
> +	/*
> +	 * BLK_OPEN_RESTRICT_WRITES is never set from userspace and
> +	 * O_EXCL is stripped from userspace.
> +	 */
> +	if (mode & BLK_OPEN_RESTRICT_WRITES)
> +		flags |= O_EXCL;
> +
>   	if (mode & BLK_OPEN_NDELAY)
>   		flags |= O_NDELAY;
>   
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-25 11:51   ` [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly Yu Kuai
@ 2024-03-25 12:04     ` Christian Brauner
  2024-03-25 13:52       ` Yu Kuai
  2024-03-25 13:54     ` Christian Brauner
  1 sibling, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2024-03-25 12:04 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Matthew Wilcox,
	linux-block, yukuai (C)

On Mon, Mar 25, 2024 at 07:51:27PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2024/03/24 0:11, Christian Brauner 写道:
> > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > default this option is set. When it is set the long-standing behavior
> > of being able to write to mounted block devices is enabled.
> > 
> > But in order to guard against unintended corruption by writing to the
> > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > off. In that case it isn't possible to write to mounted block devices
> > anymore.
> > 
> > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > the mode was passed around. Since we managed to get rid of the bdev
> > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > on whether the file was opened writable and writes to that block device
> > are blocked. That logic doesn't work because we do allow
> > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> 
> I don't get it here, looks like there are no such use case. All users
> passed in BLK_OPEN_RESTRICT_WRITES together with BLK_OPEN_WRITE.

sb_open_mode() does

#define sb_open_mode(flags) \
        (BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES | \
         (((flags) & SB_RDONLY) ? 0 : BLK_OPEN_WRITE))

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-25 12:04     ` Christian Brauner
@ 2024-03-25 13:52       ` Yu Kuai
  0 siblings, 0 replies; 24+ messages in thread
From: Yu Kuai @ 2024-03-25 13:52 UTC (permalink / raw)
  To: Christian Brauner, Yu Kuai
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Matthew Wilcox,
	linux-block, yukuai (C)

Hi,

在 2024/03/25 20:04, Christian Brauner 写道:
> On Mon, Mar 25, 2024 at 07:51:27PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/03/24 0:11, Christian Brauner 写道:
>>> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
>>> default this option is set. When it is set the long-standing behavior
>>> of being able to write to mounted block devices is enabled.
>>>
>>> But in order to guard against unintended corruption by writing to the
>>> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
>>> off. In that case it isn't possible to write to mounted block devices
>>> anymore.
>>>
>>> A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
>>> which disallows concurrent BLK_OPEN_WRITE access. When we still had the
>>> bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
>>> the mode was passed around. Since we managed to get rid of the bdev
>>> handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
>>> on whether the file was opened writable and writes to that block device
>>> are blocked. That logic doesn't work because we do allow
>>> BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
>>
>> I don't get it here, looks like there are no such use case. All users
>> passed in BLK_OPEN_RESTRICT_WRITES together with BLK_OPEN_WRITE.
> 
> sb_open_mode() does
> 
> #define sb_open_mode(flags) \
>          (BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES | \
>           (((flags) & SB_RDONLY) ? 0 : BLK_OPEN_WRITE))

Yes, you're right, thanks for the notice. And the problem that I
described should also exist.

BTW, do you agree that using O_EXCL is not correct here?

Thanks,
Kuai
> .
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-25 11:51   ` [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly Yu Kuai
  2024-03-25 12:04     ` Christian Brauner
@ 2024-03-25 13:54     ` Christian Brauner
  2024-03-26  1:32       ` Yu Kuai
  1 sibling, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2024-03-25 13:54 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Matthew Wilcox,
	linux-block, yukuai (C)

On Mon, Mar 25, 2024 at 07:51:27PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2024/03/24 0:11, Christian Brauner 写道:
> > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > default this option is set. When it is set the long-standing behavior
> > of being able to write to mounted block devices is enabled.
> > 
> > But in order to guard against unintended corruption by writing to the
> > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > off. In that case it isn't possible to write to mounted block devices
> > anymore.
> > 
> > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > the mode was passed around. Since we managed to get rid of the bdev
> > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > on whether the file was opened writable and writes to that block device
> > are blocked. That logic doesn't work because we do allow
> > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> 
> I don't get it here, looks like there are no such use case. All users
> passed in BLK_OPEN_RESTRICT_WRITES together with BLK_OPEN_WRITE.
> 
> Is the following root cause here?
> 
> 1) t1 open with BLK_OPEN_WRITE
> 2) t2 open with BLK_OPEN_RESTRICT_WRITES, with bdev_block_writes(), yes
> we don't wait for t1 to close;
> 3) t1 close, after the commit, bdev_unblock_writes() is called
> unexpected.
> 
> Following openers will succeed although t2 doesn't close;
> > 
> > So fix the detection logic. Use O_EXCL as an indicator that
> > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > thread. For userspace open paths O_EXCL will never be retained but for
> > internal opens where we open files that are never installed into a file
> > descriptor table this is fine.
> 
> From the path blkdev_open(), the file is from devtmpfs, and user can
> pass in O_EXCL for that file, and that file will be used later in
> blkdev_release() -> bdev_release() -> bdev_yield_write_access().

It can't because the VFS strips O_EXCL after the file has been opened.
Only internal opens can retain this flag. See do_dentry_open(). Or do
you mean something else?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-25 13:54     ` Christian Brauner
@ 2024-03-26  1:32       ` Yu Kuai
  0 siblings, 0 replies; 24+ messages in thread
From: Yu Kuai @ 2024-03-26  1:32 UTC (permalink / raw)
  To: Christian Brauner, Yu Kuai
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Matthew Wilcox,
	linux-block, yukuai (C)

Hi,

在 2024/03/25 21:54, Christian Brauner 写道:
> On Mon, Mar 25, 2024 at 07:51:27PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/03/24 0:11, Christian Brauner 写道:
>>> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
>>> default this option is set. When it is set the long-standing behavior
>>> of being able to write to mounted block devices is enabled.
>>>
>>> But in order to guard against unintended corruption by writing to the
>>> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
>>> off. In that case it isn't possible to write to mounted block devices
>>> anymore.
>>>
>>> A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
>>> which disallows concurrent BLK_OPEN_WRITE access. When we still had the
>>> bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
>>> the mode was passed around. Since we managed to get rid of the bdev
>>> handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
>>> on whether the file was opened writable and writes to that block device
>>> are blocked. That logic doesn't work because we do allow
>>> BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
>>
>> I don't get it here, looks like there are no such use case. All users
>> passed in BLK_OPEN_RESTRICT_WRITES together with BLK_OPEN_WRITE.
>>
>> Is the following root cause here?
>>
>> 1) t1 open with BLK_OPEN_WRITE
>> 2) t2 open with BLK_OPEN_RESTRICT_WRITES, with bdev_block_writes(), yes
>> we don't wait for t1 to close;
>> 3) t1 close, after the commit, bdev_unblock_writes() is called
>> unexpected.
>>
>> Following openers will succeed although t2 doesn't close;
>>>
>>> So fix the detection logic. Use O_EXCL as an indicator that
>>> BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
>>> for pidfds where O_EXCL means that this is a pidfd that refers to a
>>> thread. For userspace open paths O_EXCL will never be retained but for
>>> internal opens where we open files that are never installed into a file
>>> descriptor table this is fine.
>>
>>  From the path blkdev_open(), the file is from devtmpfs, and user can
>> pass in O_EXCL for that file, and that file will be used later in
>> blkdev_release() -> bdev_release() -> bdev_yield_write_access().
> 
> It can't because the VFS strips O_EXCL after the file has been opened.
> Only internal opens can retain this flag. See do_dentry_open(). Or do
> you mean something else?

Yes, I see that now, thanks for the explanation and forgive me that I'm
not that familiar with vfs code. :(

Now I think the patch can actually fix the problem, blkdev_open() and
blkdev_release() is not affected, and O_EXCL is not used from
bdev_file_open_by_dev() before. This is not straightforward, however I
can't find a better solution myself, so feel free to add:

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-23 16:11 ` [PATCH 1/2] " Christian Brauner
  2024-03-23 16:11   ` [PATCH 2/2] [RFC]: block: count BLK_OPEN_RESTRICT_WRITES openers Christian Brauner
  2024-03-25 11:51   ` [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly Yu Kuai
@ 2024-03-26 12:57   ` Jan Kara
  2024-03-26 13:17     ` Christian Brauner
  2024-03-27 12:01   ` Christian Brauner
  2024-03-29  4:56   ` Matthew Wilcox
  4 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2024-03-26 12:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Matthew Wilcox, linux-block

On Sat 23-03-24 17:11:19, Christian Brauner wrote:
> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> default this option is set. When it is set the long-standing behavior
> of being able to write to mounted block devices is enabled.
> 
> But in order to guard against unintended corruption by writing to the
> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> off. In that case it isn't possible to write to mounted block devices
> anymore.
> 
> A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> the mode was passed around. Since we managed to get rid of the bdev
> handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> on whether the file was opened writable and writes to that block device
> are blocked. That logic doesn't work because we do allow
> BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> 
> So fix the detection logic. Use O_EXCL as an indicator that
> BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> for pidfds where O_EXCL means that this is a pidfd that refers to a
> thread. For userspace open paths O_EXCL will never be retained but for
> internal opens where we open files that are never installed into a file
> descriptor table this is fine.
> 
> Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> directly be raised by userspace. It is implicitly raised during
> mounting.
> 
> Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> unset.
> 
> Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> Reported-by: Matthew Wilcox <willy@infradead.org>
> Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>

The fix looks correct but admittedly it looks a bit hacky. I'd prefer
storing the needed information in some other flag, preferably one that does
not already have a special meaning with block devices. But FMODE_ space is
exhausted and don't see another easy solution. So I guess:

Reviewed-by: Jan Kara <jack@suse.cz>

Thanks for looking into this!

								Honza

> ---
>  block/bdev.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 7a5f611c3d2e..f819f3086905 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -821,13 +821,12 @@ static void bdev_yield_write_access(struct file *bdev_file)
>  		return;
>  
>  	bdev = file_bdev(bdev_file);
> -	/* Yield exclusive or shared write access. */
> -	if (bdev_file->f_mode & FMODE_WRITE) {
> -		if (bdev_writes_blocked(bdev))
> -			bdev_unblock_writes(bdev);
> -		else
> -			bdev->bd_writers--;
> -	}
> +
> +	/* O_EXCL is only set for internal BLK_OPEN_RESTRICT_WRITES. */
> +	if (bdev_file->f_flags & O_EXCL)
> +		bdev_unblock_writes(bdev);
> +	else if (bdev_file->f_mode & FMODE_WRITE)
> +		bdev->bd_writers--;
>  }
>  
>  /**
> @@ -946,6 +945,13 @@ static unsigned blk_to_file_flags(blk_mode_t mode)
>  	else
>  		WARN_ON_ONCE(true);
>  
> +	/*
> +	 * BLK_OPEN_RESTRICT_WRITES is never set from userspace and
> +	 * O_EXCL is stripped from userspace.
> +	 */
> +	if (mode & BLK_OPEN_RESTRICT_WRITES)
> +		flags |= O_EXCL;
> +
>  	if (mode & BLK_OPEN_NDELAY)
>  		flags |= O_NDELAY;
>  
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-26 12:57   ` Jan Kara
@ 2024-03-26 13:17     ` Christian Brauner
  2024-03-26 13:31       ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2024-03-26 13:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox, linux-block

On Tue, Mar 26, 2024 at 01:57:15PM +0100, Jan Kara wrote:
> On Sat 23-03-24 17:11:19, Christian Brauner wrote:
> > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > default this option is set. When it is set the long-standing behavior
> > of being able to write to mounted block devices is enabled.
> > 
> > But in order to guard against unintended corruption by writing to the
> > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > off. In that case it isn't possible to write to mounted block devices
> > anymore.
> > 
> > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > the mode was passed around. Since we managed to get rid of the bdev
> > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > on whether the file was opened writable and writes to that block device
> > are blocked. That logic doesn't work because we do allow
> > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > 
> > So fix the detection logic. Use O_EXCL as an indicator that
> > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > thread. For userspace open paths O_EXCL will never be retained but for
> > internal opens where we open files that are never installed into a file
> > descriptor table this is fine.
> > 
> > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > directly be raised by userspace. It is implicitly raised during
> > mounting.
> > 
> > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > unset.
> > 
> > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > Reported-by: Matthew Wilcox <willy@infradead.org>
> > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> The fix looks correct but admittedly it looks a bit hacky. I'd prefer
> storing the needed information in some other flag, preferably one that does
> not already have a special meaning with block devices. But FMODE_ space is
> exhausted and don't see another easy solution. So I guess:

Admittedly, it's not pretty but we're abusing O_EXCL already for block
devices. We do have FMODE_* available. We could add
FMODE_WRITE_RESTRICTED because we have two bits left.

One other solution apart from FMODE_* I had come up with was even
nastier which would've been using the struct fd approach of using the
two available bits in the pointer. But that doesn't work because we have
stuff like dm that passes in strings which are byte aligned. And it's
arguably uglier.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] [RFC]: block: count BLK_OPEN_RESTRICT_WRITES openers
  2024-03-23 16:11   ` [PATCH 2/2] [RFC]: block: count BLK_OPEN_RESTRICT_WRITES openers Christian Brauner
@ 2024-03-26 13:24     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2024-03-26 13:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Matthew Wilcox, linux-block

On Sat 23-03-24 17:11:20, Christian Brauner wrote:
> The original changes in v6.8 do allow for a block device to be reopened
> with BLK_OPEN_RESTRICT_WRITES provided the same holder is used as per
> bdev_may_open(). I think that may have a bug.
		   		^^^^^^^^ is :)
Also AFAICT this can happen only the the reopen is O_RDONLY which would be
probably good to mention here.

> The first opener @f1 of that block device will set bdev->bd_writers to
> -1. The second opener @f2 using the same holder will pass the check in
> bdev_may_open() that bdev->bd_writers must not be greater than zero.
> 
> The first opener @f1 now closes the block device and in bdev_release()
> will end up calling bdev_yield_write_access() which calls
> bdev_writes_blocked() and sets bdev->bd_writers to 0 again.
> 
> Now @f2 holds a file to that block device which was opened with
> exclusive write access but bdev->bd_writers has been reset to 0.
> 
> So now @f3 comes along and succeeds in opening the block device with
> BLK_OPEN_WRITE betraying @f2's request to have exclusive write access.
> 
> This isn't a practical issue yet because afaict there's no codepath
> inside the kernel that reopenes the same block device with
> BLK_OPEN_RESTRICT_WRITES but it will be if there is.
> 
> If that's right then fix this by counting the number of
> BLK_OPEN_RESTRICT_WRITES openers. So we only allow writes again once all
> BLK_OPEN_RESTRICT_WRITES openers are done.
> 
> Fixes: ed5cc702d311 ("block: Add config option to not allow writing to mounted devices")
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Otherwise looks good so feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  block/bdev.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index f819f3086905..42f84692404c 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -776,17 +776,17 @@ void blkdev_put_no_open(struct block_device *bdev)
>  
>  static bool bdev_writes_blocked(struct block_device *bdev)
>  {
> -	return bdev->bd_writers == -1;
> +	return bdev->bd_writers < 0;
>  }
>  
>  static void bdev_block_writes(struct block_device *bdev)
>  {
> -	bdev->bd_writers = -1;
> +	bdev->bd_writers--;
>  }
>  
>  static void bdev_unblock_writes(struct block_device *bdev)
>  {
> -	bdev->bd_writers = 0;
> +	bdev->bd_writers++;
>  }
>  
>  static bool bdev_may_open(struct block_device *bdev, blk_mode_t mode)
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-26 13:17     ` Christian Brauner
@ 2024-03-26 13:31       ` Jan Kara
  2024-03-26 15:46         ` [PATCH v2] " Christian Brauner
  2024-03-26 15:47         ` [PATCH 1/2] " Christian Brauner
  0 siblings, 2 replies; 24+ messages in thread
From: Jan Kara @ 2024-03-26 13:31 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Matthew Wilcox, linux-block

On Tue 26-03-24 14:17:20, Christian Brauner wrote:
> On Tue, Mar 26, 2024 at 01:57:15PM +0100, Jan Kara wrote:
> > On Sat 23-03-24 17:11:19, Christian Brauner wrote:
> > > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > > default this option is set. When it is set the long-standing behavior
> > > of being able to write to mounted block devices is enabled.
> > > 
> > > But in order to guard against unintended corruption by writing to the
> > > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > > off. In that case it isn't possible to write to mounted block devices
> > > anymore.
> > > 
> > > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > > the mode was passed around. Since we managed to get rid of the bdev
> > > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > > on whether the file was opened writable and writes to that block device
> > > are blocked. That logic doesn't work because we do allow
> > > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > > 
> > > So fix the detection logic. Use O_EXCL as an indicator that
> > > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > > thread. For userspace open paths O_EXCL will never be retained but for
> > > internal opens where we open files that are never installed into a file
> > > descriptor table this is fine.
> > > 
> > > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > > directly be raised by userspace. It is implicitly raised during
> > > mounting.
> > > 
> > > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > > unset.
> > > 
> > > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > > Reported-by: Matthew Wilcox <willy@infradead.org>
> > > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > 
> > The fix looks correct but admittedly it looks a bit hacky. I'd prefer
> > storing the needed information in some other flag, preferably one that does
> > not already have a special meaning with block devices. But FMODE_ space is
> > exhausted and don't see another easy solution. So I guess:
> 
> Admittedly, it's not pretty but we're abusing O_EXCL already for block
> devices. We do have FMODE_* available. We could add
> FMODE_WRITE_RESTRICTED because we have two bits left.

Yeah, I'm mostly afraid that a few years down the road someone comes and
adds code thinking that file->f_flags & O_EXCL means this is exclusive bdev
open. Which will be kind of natural assumption but subtly wrong... So to
make the code more robust, I'd prefer burning a fmode flag for this rather
than abusing O_EXCL.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-26 13:31       ` Jan Kara
@ 2024-03-26 15:46         ` Christian Brauner
  2024-03-26 17:25           ` Christoph Hellwig
  2024-03-26 22:42           ` Jan Kara
  2024-03-26 15:47         ` [PATCH 1/2] " Christian Brauner
  1 sibling, 2 replies; 24+ messages in thread
From: Christian Brauner @ 2024-03-26 15:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, Christoph Hellwig, Matthew Wilcox,
	linux-block, linux-fsdevel

Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
default this option is set. When it is set the long-standing behavior
of being able to write to mounted block devices is enabled.

But in order to guard against unintended corruption by writing to the
block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
off. In that case it isn't possible to write to mounted block devices
anymore.

A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
which disallows concurrent BLK_OPEN_WRITE access. When we still had the
bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
the mode was passed around. Since we managed to get rid of the bdev
handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
on whether the file was opened writable and writes to that block device
are blocked. That logic doesn't work because we do allow
BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.

Fix the detection logic and use one of the FMODE_* bits we freed up a
while ago. We could've also abused O_EXCL as an indicator that
BLK_OPEN_RESTRICT_WRITES has been requested. For userspace open paths
O_EXCL will never be retained but for internal opens where we open files
that are never installed into a file descriptor table this is fine. But
it would be a gamble that this doesn't cause bugs. Note that
BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot directly
be raised by userspace. It is implicitly raised during mounting.

Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
unset.

Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
Link: https://lore.kernel.org/r/20240323-zielbereich-mittragen-6fdf14876c3e@brauner
Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 block/bdev.c       | 14 +++++++-------
 include/linux/fs.h |  2 ++
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 070890667563..6955693e4bcd 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -814,13 +814,11 @@ static void bdev_yield_write_access(struct file *bdev_file)
 		return;
 
 	bdev = file_bdev(bdev_file);
-	/* Yield exclusive or shared write access. */
-	if (bdev_file->f_mode & FMODE_WRITE) {
-		if (bdev_writes_blocked(bdev))
-			bdev_unblock_writes(bdev);
-		else
-			bdev->bd_writers--;
-	}
+
+	if (bdev_file->f_mode & FMODE_WRITE_RESTRICTED)
+		bdev_unblock_writes(bdev);
+	else if (bdev_file->f_mode & FMODE_WRITE)
+		bdev->bd_writers--;
 }
 
 /**
@@ -900,6 +898,8 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder,
 	bdev_file->f_mode |= FMODE_BUF_RASYNC | FMODE_CAN_ODIRECT;
 	if (bdev_nowait(bdev))
 		bdev_file->f_mode |= FMODE_NOWAIT;
+	if (mode & BLK_OPEN_RESTRICT_WRITES)
+		bdev_file->f_mode |= FMODE_WRITE_RESTRICTED;
 	bdev_file->f_mapping = bdev->bd_inode->i_mapping;
 	bdev_file->f_wb_err = filemap_sample_wb_err(bdev_file->f_mapping);
 	bdev_file->private_data = holder;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 00fc429b0af0..8dfd53b52744 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -121,6 +121,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define FMODE_PWRITE		((__force fmode_t)0x10)
 /* File is opened for execution with sys_execve / sys_uselib */
 #define FMODE_EXEC		((__force fmode_t)0x20)
+/* File writes are restricted (block device specific) */
+#define FMODE_WRITE_RESTRICTED  ((__force fmode_t)0x40)
 /* 32bit hashes as llseek() offset (for directories) */
 #define FMODE_32BITHASH         ((__force fmode_t)0x200)
 /* 64bit hashes as llseek() offset (for directories) */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-26 13:31       ` Jan Kara
  2024-03-26 15:46         ` [PATCH v2] " Christian Brauner
@ 2024-03-26 15:47         ` Christian Brauner
  1 sibling, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2024-03-26 15:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox, linux-block

On Tue, Mar 26, 2024 at 02:31:07PM +0100, Jan Kara wrote:
> On Tue 26-03-24 14:17:20, Christian Brauner wrote:
> > On Tue, Mar 26, 2024 at 01:57:15PM +0100, Jan Kara wrote:
> > > On Sat 23-03-24 17:11:19, Christian Brauner wrote:
> > > > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > > > default this option is set. When it is set the long-standing behavior
> > > > of being able to write to mounted block devices is enabled.
> > > > 
> > > > But in order to guard against unintended corruption by writing to the
> > > > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > > > off. In that case it isn't possible to write to mounted block devices
> > > > anymore.
> > > > 
> > > > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > > > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > > > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > > > the mode was passed around. Since we managed to get rid of the bdev
> > > > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > > > on whether the file was opened writable and writes to that block device
> > > > are blocked. That logic doesn't work because we do allow
> > > > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > > > 
> > > > So fix the detection logic. Use O_EXCL as an indicator that
> > > > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > > > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > > > thread. For userspace open paths O_EXCL will never be retained but for
> > > > internal opens where we open files that are never installed into a file
> > > > descriptor table this is fine.
> > > > 
> > > > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > > > directly be raised by userspace. It is implicitly raised during
> > > > mounting.
> > > > 
> > > > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > > > unset.
> > > > 
> > > > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > > > Reported-by: Matthew Wilcox <willy@infradead.org>
> > > > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > > > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > > 
> > > The fix looks correct but admittedly it looks a bit hacky. I'd prefer
> > > storing the needed information in some other flag, preferably one that does
> > > not already have a special meaning with block devices. But FMODE_ space is
> > > exhausted and don't see another easy solution. So I guess:
> > 
> > Admittedly, it's not pretty but we're abusing O_EXCL already for block
> > devices. We do have FMODE_* available. We could add
> > FMODE_WRITE_RESTRICTED because we have two bits left.
> 
> Yeah, I'm mostly afraid that a few years down the road someone comes and
> adds code thinking that file->f_flags & O_EXCL means this is exclusive bdev
> open. Which will be kind of natural assumption but subtly wrong... So to
> make the code more robust, I'd prefer burning a fmode flag for this rather
> than abusing O_EXCL.

Ok, done and resend. Thanks for the feedback! I was rather focussed on
not using a bit but I agree it's likely the right thing to do!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-26 15:46         ` [PATCH v2] " Christian Brauner
@ 2024-03-26 17:25           ` Christoph Hellwig
  2024-03-26 22:42           ` Jan Kara
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-03-26 17:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Matthew Wilcox, linux-block, linux-fsdevel

Not a fan of the new bit, but I see no good way around it.

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-26 15:46         ` [PATCH v2] " Christian Brauner
  2024-03-26 17:25           ` Christoph Hellwig
@ 2024-03-26 22:42           ` Jan Kara
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Kara @ 2024-03-26 22:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Matthew Wilcox, linux-block, linux-fsdevel

On Tue 26-03-24 16:46:19, Christian Brauner wrote:
> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> default this option is set. When it is set the long-standing behavior
> of being able to write to mounted block devices is enabled.
> 
> But in order to guard against unintended corruption by writing to the
> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> off. In that case it isn't possible to write to mounted block devices
> anymore.
> 
> A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> the mode was passed around. Since we managed to get rid of the bdev
> handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> on whether the file was opened writable and writes to that block device
> are blocked. That logic doesn't work because we do allow
> BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> 
> Fix the detection logic and use one of the FMODE_* bits we freed up a
> while ago. We could've also abused O_EXCL as an indicator that
> BLK_OPEN_RESTRICT_WRITES has been requested. For userspace open paths
> O_EXCL will never be retained but for internal opens where we open files
> that are never installed into a file descriptor table this is fine. But
> it would be a gamble that this doesn't cause bugs. Note that
> BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot directly
> be raised by userspace. It is implicitly raised during mounting.
> 
> Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> unset.
> 
> Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> Link: https://lore.kernel.org/r/20240323-zielbereich-mittragen-6fdf14876c3e@brauner
> Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> Reported-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  block/bdev.c       | 14 +++++++-------
>  include/linux/fs.h |  2 ++
>  2 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 070890667563..6955693e4bcd 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -814,13 +814,11 @@ static void bdev_yield_write_access(struct file *bdev_file)
>  		return;
>  
>  	bdev = file_bdev(bdev_file);
> -	/* Yield exclusive or shared write access. */
> -	if (bdev_file->f_mode & FMODE_WRITE) {
> -		if (bdev_writes_blocked(bdev))
> -			bdev_unblock_writes(bdev);
> -		else
> -			bdev->bd_writers--;
> -	}
> +
> +	if (bdev_file->f_mode & FMODE_WRITE_RESTRICTED)
> +		bdev_unblock_writes(bdev);
> +	else if (bdev_file->f_mode & FMODE_WRITE)
> +		bdev->bd_writers--;
>  }
>  
>  /**
> @@ -900,6 +898,8 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder,
>  	bdev_file->f_mode |= FMODE_BUF_RASYNC | FMODE_CAN_ODIRECT;
>  	if (bdev_nowait(bdev))
>  		bdev_file->f_mode |= FMODE_NOWAIT;
> +	if (mode & BLK_OPEN_RESTRICT_WRITES)
> +		bdev_file->f_mode |= FMODE_WRITE_RESTRICTED;
>  	bdev_file->f_mapping = bdev->bd_inode->i_mapping;
>  	bdev_file->f_wb_err = filemap_sample_wb_err(bdev_file->f_mapping);
>  	bdev_file->private_data = holder;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 00fc429b0af0..8dfd53b52744 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -121,6 +121,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  #define FMODE_PWRITE		((__force fmode_t)0x10)
>  /* File is opened for execution with sys_execve / sys_uselib */
>  #define FMODE_EXEC		((__force fmode_t)0x20)
> +/* File writes are restricted (block device specific) */
> +#define FMODE_WRITE_RESTRICTED  ((__force fmode_t)0x40)
>  /* 32bit hashes as llseek() offset (for directories) */
>  #define FMODE_32BITHASH         ((__force fmode_t)0x200)
>  /* 64bit hashes as llseek() offset (for directories) */
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-23 16:11 ` [PATCH 1/2] " Christian Brauner
                     ` (2 preceding siblings ...)
  2024-03-26 12:57   ` Jan Kara
@ 2024-03-27 12:01   ` Christian Brauner
  2024-03-29  4:56   ` Matthew Wilcox
  4 siblings, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2024-03-27 12:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Matthew Wilcox, linux-block, Christoph Hellwig, Jan Kara, Jens Axboe

On Sat, 23 Mar 2024 17:11:19 +0100, Christian Brauner wrote:
> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> default this option is set. When it is set the long-standing behavior
> of being able to write to mounted block devices is enabled.
> 
> But in order to guard against unintended corruption by writing to the
> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> off. In that case it isn't possible to write to mounted block devices
> anymore.
> 
> [...]

Applied to the vfs.fixes branch of the vfs/vfs.git tree.
Patches in the vfs.fixes branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.fixes

[1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
      https://git.kernel.org/vfs/vfs/c/ddd65e19c601
[2/2] block: count BLK_OPEN_RESTRICT_WRITES openers
      https://git.kernel.org/vfs/vfs/c/3ff56e285de5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-23 16:11 ` [PATCH 1/2] " Christian Brauner
                     ` (3 preceding siblings ...)
  2024-03-27 12:01   ` Christian Brauner
@ 2024-03-29  4:56   ` Matthew Wilcox
  2024-03-29 12:10     ` Christian Brauner
  4 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-03-29  4:56 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Christoph Hellwig, Jan Kara, Jens Axboe, linux-block

On Sat, Mar 23, 2024 at 05:11:19PM +0100, Christian Brauner wrote:
> Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> default this option is set. When it is set the long-standing behavior
> of being able to write to mounted block devices is enabled.
> 
> But in order to guard against unintended corruption by writing to the
> block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> off. In that case it isn't possible to write to mounted block devices
> anymore.
> 
> A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> the mode was passed around. Since we managed to get rid of the bdev
> handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> on whether the file was opened writable and writes to that block device
> are blocked. That logic doesn't work because we do allow
> BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> 
> So fix the detection logic. Use O_EXCL as an indicator that
> BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> for pidfds where O_EXCL means that this is a pidfd that refers to a
> thread. For userspace open paths O_EXCL will never be retained but for
> internal opens where we open files that are never installed into a file
> descriptor table this is fine.
> 
> Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> directly be raised by userspace. It is implicitly raised during
> mounting.
> 
> Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> unset.
> 
> Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> Reported-by: Matthew Wilcox <willy@infradead.org>
> Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>

So v1 of this patch works fine.  I just got round to testing v2, and it
does not.  Indeed, applying 2/2 causes root to fail to mount:

/dev/root: Can't open blockdev
List of all bdev filesystems:
 ext3
 ext2
 ext4
 xfs

Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0)

Applying only 1/2 boots but fails to fix the bug.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-29  4:56   ` Matthew Wilcox
@ 2024-03-29 12:10     ` Christian Brauner
  2024-03-29 15:11       ` Christian Brauner
  2024-04-03  6:04       ` Christian Brauner
  0 siblings, 2 replies; 24+ messages in thread
From: Christian Brauner @ 2024-03-29 12:10 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, Jan Kara, Jens Axboe, linux-block

On Fri, Mar 29, 2024 at 04:56:07AM +0000, Matthew Wilcox wrote:
> On Sat, Mar 23, 2024 at 05:11:19PM +0100, Christian Brauner wrote:
> > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > default this option is set. When it is set the long-standing behavior
> > of being able to write to mounted block devices is enabled.
> > 
> > But in order to guard against unintended corruption by writing to the
> > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > off. In that case it isn't possible to write to mounted block devices
> > anymore.
> > 
> > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > the mode was passed around. Since we managed to get rid of the bdev
> > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > on whether the file was opened writable and writes to that block device
> > are blocked. That logic doesn't work because we do allow
> > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > 
> > So fix the detection logic. Use O_EXCL as an indicator that
> > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > thread. For userspace open paths O_EXCL will never be retained but for
> > internal opens where we open files that are never installed into a file
> > descriptor table this is fine.
> > 
> > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > directly be raised by userspace. It is implicitly raised during
> > mounting.
> > 
> > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > unset.
> > 
> > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > Reported-by: Matthew Wilcox <willy@infradead.org>
> > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> So v1 of this patch works fine.  I just got round to testing v2, and it
> does not.  Indeed, applying 2/2 causes root to fail to mount:
> 
> /dev/root: Can't open blockdev
> List of all bdev filesystems:
>  ext3
>  ext2
>  ext4
>  xfs
> 
> Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0)
> 
> Applying only 1/2 boots but fails to fix the bug.

Thanks for testing this. This is odd because I tested with the setup you
provided.

I used the kernel config you sent to me in [2] with an xfs root device
with direct kernel boot and the following xfstests config in [3]. I'm
booting the vm with:

qemu-system-x86_64 -machine type=q35 -smp 1 -m 4G -accel kvm -cpu max -nographic -nodefaults \
        -chardev stdio,mux=on,id=console,signal=off -serial chardev:console -mon console \
        -kernel /home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.vmlinuz \
        -drive file=/home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.raw,format=raw,if=virtio \
        -append "console=ttyS0 root=/dev/vda2 module_blacklist=vmw_vmci systemd.tty.term.ttyS0=screen-256color systemd.tty.columns.ttyS0=96 systemd.tty.rows.ttyS0=46 debug loglevel=4 SYSTEMD_"

Note that the config you gave me in [2] didn't include
CONFIG_SCSI_VIRTIO=y which means I got the splat you did. I added this
missing config option and everything worked fine for me.

Can you please test what's in the vfs.fixes branch on
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git so we're
sure that we're testing the same thing?

The failures that I see are:

Failures: generic/042 generic/645 generic/682 generic/689 xfs/014
xfs/017 xfs/049 xfs/129 xfs/176 xfs/206 xfs/216 xfs/234 xfs/250 xfs/289
xfs/558 xfs/559
Failed 16 of 930 tests

* generic/645 fails because it requires an unrelated fix to fstests
  because we changed idmapped mounts to not not allow empty idmappings.
* generic/689 fails because the providec config doesn't compile tmpfs with POSIX ACL support
* xfs/558 and xfs/559 complain about missing logging
  about iomap validation and are unrelated
* All other failures are caused by loop devices which is expected unil
  a util-linux is released that contains Jan's fix in [1] so that
  mount(8) doesn't hold a writable fd to the loop device anymore and
  instead simply uses a read-only one.

[1]: https://github.com/util-linux/util-linux/commit/1cde32f323e0970f6c7f35940dcc0aea97b821e5
[2]: https://lore.kernel.org/r/Zf18I2UOGQxeN-Z1@casper.infradead.org
[3]:
#! /bin/bash

set -x

cd ~/src/git/xfstests-dev/
FIRST_DEV=/dev/vda3
SECOND_DEV=/dev/vda4
THIRD_DEV=/dev/vda5

echo "Testing xfs"
cat <<EOF >local.config
FSTYP=xfs
export TEST_DEV=${FIRST_DEV}
export SCRATCH_DEV=${SECOND_DEV}
export LOGWRITE_DEV=${THIRD_DEV}
export TEST_DIR=/mnt/test
export SCRATCH_MNT=/mnt/scratch
EOF

sudo mkfs.xfs -f ${FIRST_DEV}
sudo mkfs.xfs -f ${SECOND_DEV}
sudo mkfs.xfs -f ${THIRD_DEV}
sudo ./check -g quick

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-29 12:10     ` Christian Brauner
@ 2024-03-29 15:11       ` Christian Brauner
  2024-03-29 15:24         ` Christian Brauner
  2024-04-03  6:04       ` Christian Brauner
  1 sibling, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2024-03-29 15:11 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, Jan Kara, Jens Axboe, linux-block

On Fri, Mar 29, 2024 at 01:10:57PM +0100, Christian Brauner wrote:
> On Fri, Mar 29, 2024 at 04:56:07AM +0000, Matthew Wilcox wrote:
> > On Sat, Mar 23, 2024 at 05:11:19PM +0100, Christian Brauner wrote:
> > > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > > default this option is set. When it is set the long-standing behavior
> > > of being able to write to mounted block devices is enabled.
> > > 
> > > But in order to guard against unintended corruption by writing to the
> > > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > > off. In that case it isn't possible to write to mounted block devices
> > > anymore.
> > > 
> > > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > > the mode was passed around. Since we managed to get rid of the bdev
> > > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > > on whether the file was opened writable and writes to that block device
> > > are blocked. That logic doesn't work because we do allow
> > > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > > 
> > > So fix the detection logic. Use O_EXCL as an indicator that
> > > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > > thread. For userspace open paths O_EXCL will never be retained but for
> > > internal opens where we open files that are never installed into a file
> > > descriptor table this is fine.
> > > 
> > > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > > directly be raised by userspace. It is implicitly raised during
> > > mounting.
> > > 
> > > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > > unset.
> > > 
> > > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > > Reported-by: Matthew Wilcox <willy@infradead.org>
> > > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > 
> > So v1 of this patch works fine.  I just got round to testing v2, and it
> > does not.  Indeed, applying 2/2 causes root to fail to mount:
> > 
> > /dev/root: Can't open blockdev
> > List of all bdev filesystems:
> >  ext3
> >  ext2
> >  ext4
> >  xfs
> > 
> > Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0)

One other observation. This line right here is suspicious.
What I expect to see here is something like this:

Kernel panic - not syncing: VFS: Unable to mount root fs on "/dev/vda2" or unknown-block(254,2) ]---

IOW, the name of the device that failed to mount should be printed.
Since that doesn't show up this indicates that the kernel booted here
didn't have commit 84d2b696236c ("init/mount: print pretty name of root
device when panics") which I've merged for v6.7.

So that output looks like it's from a pre v6.7 kernel. But Jan's write
restriction changes only made it into v6.8. So even if the patches
somehow were to apply to this kernel they'd be entirely ineffective
since CONFIG_BLK_DEV_WRITE_MOUNTED is not available in this kernel.

> > 
> > Applying only 1/2 boots but fails to fix the bug.
> 
> Thanks for testing this. This is odd because I tested with the setup you
> provided.
> 
> I used the kernel config you sent to me in [2] with an xfs root device
> with direct kernel boot and the following xfstests config in [3]. I'm
> booting the vm with:
> 
> qemu-system-x86_64 -machine type=q35 -smp 1 -m 4G -accel kvm -cpu max -nographic -nodefaults \
>         -chardev stdio,mux=on,id=console,signal=off -serial chardev:console -mon console \
>         -kernel /home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.vmlinuz \
>         -drive file=/home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.raw,format=raw,if=virtio \
>         -append "console=ttyS0 root=/dev/vda2 module_blacklist=vmw_vmci systemd.tty.term.ttyS0=screen-256color systemd.tty.columns.ttyS0=96 systemd.tty.rows.ttyS0=46 debug loglevel=4 SYSTEMD_"
> 
> Note that the config you gave me in [2] didn't include
> CONFIG_SCSI_VIRTIO=y which means I got the splat you did. I added this
> missing config option and everything worked fine for me.
> 
> Can you please test what's in the vfs.fixes branch on
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git so we're
> sure that we're testing the same thing?
> 
> The failures that I see are:
> 
> Failures: generic/042 generic/645 generic/682 generic/689 xfs/014
> xfs/017 xfs/049 xfs/129 xfs/176 xfs/206 xfs/216 xfs/234 xfs/250 xfs/289
> xfs/558 xfs/559
> Failed 16 of 930 tests
> 
> * generic/645 fails because it requires an unrelated fix to fstests
>   because we changed idmapped mounts to not not allow empty idmappings.
> * generic/689 fails because the providec config doesn't compile tmpfs with POSIX ACL support
> * xfs/558 and xfs/559 complain about missing logging
>   about iomap validation and are unrelated
> * All other failures are caused by loop devices which is expected unil
>   a util-linux is released that contains Jan's fix in [1] so that
>   mount(8) doesn't hold a writable fd to the loop device anymore and
>   instead simply uses a read-only one.
> 
> [1]: https://github.com/util-linux/util-linux/commit/1cde32f323e0970f6c7f35940dcc0aea97b821e5
> [2]: https://lore.kernel.org/r/Zf18I2UOGQxeN-Z1@casper.infradead.org
> [3]:
> #! /bin/bash
> 
> set -x
> 
> cd ~/src/git/xfstests-dev/
> FIRST_DEV=/dev/vda3
> SECOND_DEV=/dev/vda4
> THIRD_DEV=/dev/vda5
> 
> echo "Testing xfs"
> cat <<EOF >local.config
> FSTYP=xfs
> export TEST_DEV=${FIRST_DEV}
> export SCRATCH_DEV=${SECOND_DEV}
> export LOGWRITE_DEV=${THIRD_DEV}
> export TEST_DIR=/mnt/test
> export SCRATCH_MNT=/mnt/scratch
> EOF
> 
> sudo mkfs.xfs -f ${FIRST_DEV}
> sudo mkfs.xfs -f ${SECOND_DEV}
> sudo mkfs.xfs -f ${THIRD_DEV}
> sudo ./check -g quick

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-29 15:11       ` Christian Brauner
@ 2024-03-29 15:24         ` Christian Brauner
  0 siblings, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2024-03-29 15:24 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, Jan Kara, Jens Axboe, linux-block

> One other observation. This line right here is suspicious.
> What I expect to see here is something like this:
> 
> Kernel panic - not syncing: VFS: Unable to mount root fs on "/dev/vda2" or unknown-block(254,2) ]---

Sorry, I looked at the wrong codepath here.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-03-29 12:10     ` Christian Brauner
  2024-03-29 15:11       ` Christian Brauner
@ 2024-04-03  6:04       ` Christian Brauner
  2024-04-03 19:22         ` Matthew Wilcox
  1 sibling, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2024-04-03  6:04 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, Jan Kara, Jens Axboe, linux-block

On Fri, Mar 29, 2024 at 01:10:57PM +0100, Christian Brauner wrote:
> On Fri, Mar 29, 2024 at 04:56:07AM +0000, Matthew Wilcox wrote:
> > On Sat, Mar 23, 2024 at 05:11:19PM +0100, Christian Brauner wrote:
> > > Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
> > > default this option is set. When it is set the long-standing behavior
> > > of being able to write to mounted block devices is enabled.
> > > 
> > > But in order to guard against unintended corruption by writing to the
> > > block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
> > > off. In that case it isn't possible to write to mounted block devices
> > > anymore.
> > > 
> > > A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
> > > which disallows concurrent BLK_OPEN_WRITE access. When we still had the
> > > bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
> > > the mode was passed around. Since we managed to get rid of the bdev
> > > handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
> > > on whether the file was opened writable and writes to that block device
> > > are blocked. That logic doesn't work because we do allow
> > > BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.
> > > 
> > > So fix the detection logic. Use O_EXCL as an indicator that
> > > BLK_OPEN_RESTRICT_WRITES has been requested. We do the exact same thing
> > > for pidfds where O_EXCL means that this is a pidfd that refers to a
> > > thread. For userspace open paths O_EXCL will never be retained but for
> > > internal opens where we open files that are never installed into a file
> > > descriptor table this is fine.
> > > 
> > > Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot
> > > directly be raised by userspace. It is implicitly raised during
> > > mounting.
> > > 
> > > Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
> > > unset.
> > > 
> > > Fixes: 321de651fa56 ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
> > > Reported-by: Matthew Wilcox <willy@infradead.org>
> > > Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
> > > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > 
> > So v1 of this patch works fine.  I just got round to testing v2, and it
> > does not.  Indeed, applying 2/2 causes root to fail to mount:
> > 
> > /dev/root: Can't open blockdev
> > List of all bdev filesystems:
> >  ext3
> >  ext2
> >  ext4
> >  xfs
> > 
> > Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0)
> > 
> > Applying only 1/2 boots but fails to fix the bug.
> 
> Thanks for testing this. This is odd because I tested with the setup you
> provided.
> 
> I used the kernel config you sent to me in [2] with an xfs root device
> with direct kernel boot and the following xfstests config in [3]. I'm
> booting the vm with:
> 
> qemu-system-x86_64 -machine type=q35 -smp 1 -m 4G -accel kvm -cpu max -nographic -nodefaults \
>         -chardev stdio,mux=on,id=console,signal=off -serial chardev:console -mon console \
>         -kernel /home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.vmlinuz \
>         -drive file=/home/ubuntu/data/mkosi-kernel2/mkosi.output.debian/image.raw,format=raw,if=virtio \
>         -append "console=ttyS0 root=/dev/vda2 module_blacklist=vmw_vmci systemd.tty.term.ttyS0=screen-256color systemd.tty.columns.ttyS0=96 systemd.tty.rows.ttyS0=46 debug loglevel=4 SYSTEMD_"
> 
> Note that the config you gave me in [2] didn't include
> CONFIG_SCSI_VIRTIO=y which means I got the splat you did. I added this
> missing config option and everything worked fine for me.
> 
> Can you please test what's in the vfs.fixes branch on
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git so we're
> sure that we're testing the same thing?

Willy, can you still reproduce this? I've been delaying the pull request
to give you time to verify this but I would really like to send it
before Friday. So it'd be really great if you could get back to me on
this.

> 
> The failures that I see are:
> 
> Failures: generic/042 generic/645 generic/682 generic/689 xfs/014
> xfs/017 xfs/049 xfs/129 xfs/176 xfs/206 xfs/216 xfs/234 xfs/250 xfs/289
> xfs/558 xfs/559
> Failed 16 of 930 tests
> 
> * generic/645 fails because it requires an unrelated fix to fstests
>   because we changed idmapped mounts to not not allow empty idmappings.
> * generic/689 fails because the providec config doesn't compile tmpfs with POSIX ACL support
> * xfs/558 and xfs/559 complain about missing logging
>   about iomap validation and are unrelated
> * All other failures are caused by loop devices which is expected unil
>   a util-linux is released that contains Jan's fix in [1] so that
>   mount(8) doesn't hold a writable fd to the loop device anymore and
>   instead simply uses a read-only one.
> 
> [1]: https://github.com/util-linux/util-linux/commit/1cde32f323e0970f6c7f35940dcc0aea97b821e5
> [2]: https://lore.kernel.org/r/Zf18I2UOGQxeN-Z1@casper.infradead.org
> [3]:
> #! /bin/bash
> 
> set -x
> 
> cd ~/src/git/xfstests-dev/
> FIRST_DEV=/dev/vda3
> SECOND_DEV=/dev/vda4
> THIRD_DEV=/dev/vda5
> 
> echo "Testing xfs"
> cat <<EOF >local.config
> FSTYP=xfs
> export TEST_DEV=${FIRST_DEV}
> export SCRATCH_DEV=${SECOND_DEV}
> export LOGWRITE_DEV=${THIRD_DEV}
> export TEST_DIR=/mnt/test
> export SCRATCH_MNT=/mnt/scratch
> EOF
> 
> sudo mkfs.xfs -f ${FIRST_DEV}
> sudo mkfs.xfs -f ${SECOND_DEV}
> sudo mkfs.xfs -f ${THIRD_DEV}
> sudo ./check -g quick

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly
  2024-04-03  6:04       ` Christian Brauner
@ 2024-04-03 19:22         ` Matthew Wilcox
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Wilcox @ 2024-04-03 19:22 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Christoph Hellwig, Jan Kara, Jens Axboe, linux-block

On Wed, Apr 03, 2024 at 08:04:16AM +0200, Christian Brauner wrote:
> Willy, can you still reproduce this? I've been delaying the pull request
> to give you time to verify this but I would really like to send it
> before Friday. So it'd be really great if you could get back to me on
> this.

Looks like the latest version fixed it.  Just built & booted next-20240403
with no patches and it works fine.  Thanks for fixing.  (I took a few
days off over Easter, so didn't look at this before now)

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-04-03 19:22 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-23 14:54 [PATCH] block: handle BLK_OPEN_RESTRICT_WRITES correctly Christian Brauner
2024-03-23 15:59 ` Christian Brauner
2024-03-23 16:11 ` [PATCH 1/2] " Christian Brauner
2024-03-23 16:11   ` [PATCH 2/2] [RFC]: block: count BLK_OPEN_RESTRICT_WRITES openers Christian Brauner
2024-03-26 13:24     ` Jan Kara
2024-03-25 11:51   ` [PATCH 1/2] block: handle BLK_OPEN_RESTRICT_WRITES correctly Yu Kuai
2024-03-25 12:04     ` Christian Brauner
2024-03-25 13:52       ` Yu Kuai
2024-03-25 13:54     ` Christian Brauner
2024-03-26  1:32       ` Yu Kuai
2024-03-26 12:57   ` Jan Kara
2024-03-26 13:17     ` Christian Brauner
2024-03-26 13:31       ` Jan Kara
2024-03-26 15:46         ` [PATCH v2] " Christian Brauner
2024-03-26 17:25           ` Christoph Hellwig
2024-03-26 22:42           ` Jan Kara
2024-03-26 15:47         ` [PATCH 1/2] " Christian Brauner
2024-03-27 12:01   ` Christian Brauner
2024-03-29  4:56   ` Matthew Wilcox
2024-03-29 12:10     ` Christian Brauner
2024-03-29 15:11       ` Christian Brauner
2024-03-29 15:24         ` Christian Brauner
2024-04-03  6:04       ` Christian Brauner
2024-04-03 19:22         ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.