All of lore.kernel.org
 help / color / mirror / Atom feed
* Use of PageError in superblock writes
@ 2022-05-14 17:58 Matthew Wilcox
  2022-05-24 16:53 ` David Sterba
  0 siblings, 1 reply; 2+ messages in thread
From: Matthew Wilcox @ 2022-05-14 17:58 UTC (permalink / raw)
  To: linux-btrfs

There's a need to reclaim page flag bits, and one of the easier
ones to remove is PG_error (aka {Set,Clear,}PageError(), aka
folio_{test,set,clear}_error().

That caused me to look at wait_dev_supers(), and it's a little unusual.
Instead of following the usual pagecache writeback procedure, it
locks the page, memcpy() into it, submits its own BIO, and then
unlocks the page in the bio endio routine.  wait_dev_supers()
waits for the page to be unlocked (ie the write completed) and then
checks PageError() to see if it worked.

I don't think this is buggy.  It's just confusing to see a write being
waited for on the locked bit instead of on the writeback bit.  But I
get why you're doing it; this is such a special case that there's no
need to do the usual pagecache dance with the dirty & writeback flags.

The handling of the uptodate flag is a little strange though.
You're leaving it clear if the write fails, which means that it'll be
re-read from storage if someone tries to read that block through the
pagecache from the underlying device file.  What's a bit weirder (and
maybe buggy?) is that if you're on a machine with a 64KiB page size,
you fetch a 64KiB page from the cache, write 4KiB into it, and then
(assuming the write is successful), mark the entire 64KiB as being
uptodate in the page cache for the underlying disk mapping, which means
that anyone reading the other 60KiB in the page cache is going to be
reading uninitialised memory.  You'd have to be root to do that, so I
doubt it's any kind of security concern, but it's not great.

I wonder if the page cache is really helping you here.  As far as I can
tell, you'd be better off allocating a single page, storing a pointer to
it in struct btrfs_device, and maintaining your own metadata in struct
page (how many writes are still outstanding, how many had errors).
That gives you five words (minus one bit) to play with without worrying
about what the rest of the system is doing with page flags.

Am I wrong about that?  Is there an advantage to having the superblock
pages stored in the bdev's i_mapping?  Or can we just use your O_DIRECT
path to read/write kernel memory directly?

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Use of PageError in superblock writes
  2022-05-14 17:58 Use of PageError in superblock writes Matthew Wilcox
@ 2022-05-24 16:53 ` David Sterba
  0 siblings, 0 replies; 2+ messages in thread
From: David Sterba @ 2022-05-24 16:53 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-btrfs

On Sat, May 14, 2022 at 06:58:29PM +0100, Matthew Wilcox wrote:
> That caused me to look at wait_dev_supers(), and it's a little unusual.
> Instead of following the usual pagecache writeback procedure, it
> locks the page, memcpy() into it, submits its own BIO, and then
> unlocks the page in the bio endio routine.  wait_dev_supers()
> waits for the page to be unlocked (ie the write completed) and then
> checks PageError() to see if it worked.
> 
> I don't think this is buggy.  It's just confusing to see a write being
> waited for on the locked bit instead of on the writeback bit.  But I
> get why you're doing it; this is such a special case that there's no
> need to do the usual pagecache dance with the dirty & writeback flags.

I think that this is the reason, the page lock is used for waiting
because "it works" and follows the direct change from buffer_head in
commit 314b6dd0eebfa9962. Switching to another way can be done and I
think somebody suggested to use page cache, but this was not
implemented, the buffer head -> bio was basically API transformation.

> The handling of the uptodate flag is a little strange though.
> You're leaving it clear if the write fails, which means that it'll be
> re-read from storage if someone tries to read that block through the
> pagecache from the underlying device file.  What's a bit weirder (and
> maybe buggy?) is that if you're on a machine with a 64KiB page size,
> you fetch a 64KiB page from the cache, write 4KiB into it, and then
> (assuming the write is successful), mark the entire 64KiB as being
> uptodate in the page cache for the underlying disk mapping, which means
> that anyone reading the other 60KiB in the page cache is going to be
> reading uninitialised memory.  You'd have to be root to do that, so I
> doubt it's any kind of security concern, but it's not great.

This does not sound like a serious problem but I agree that it's weird
and should be fixed.

> I wonder if the page cache is really helping you here.  As far as I can
> tell, you'd be better off allocating a single page, storing a pointer to
> it in struct btrfs_device, and maintaining your own metadata in struct
> page (how many writes are still outstanding, how many had errors).
> That gives you five words (minus one bit) to play with without worrying
> about what the rest of the system is doing with page flags.
> 
> Am I wrong about that?  Is there an advantage to having the superblock
> pages stored in the bdev's i_mapping?  Or can we just use your O_DIRECT
> path to read/write kernel memory directly?

No, I think you have a point, a separate page can work fine. I'm not
sure about using direct io here, the single page should make it obvious
what's going on and super block handling can be special.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-05-24 16:57 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-14 17:58 Use of PageError in superblock writes Matthew Wilcox
2022-05-24 16:53 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.