Use of PageError in superblock writes

* Use of PageError in superblock writes
@ 2022-05-14 17:58 Matthew Wilcox
  2022-05-24 16:53 ` David Sterba
  0 siblings, 1 reply; 2+ messages in thread
From: Matthew Wilcox @ 2022-05-14 17:58 UTC (permalink / raw)
  To: linux-btrfs

There's a need to reclaim page flag bits, and one of the easier
ones to remove is PG_error (aka {Set,Clear,}PageError(), aka
folio_{test,set,clear}_error().

That caused me to look at wait_dev_supers(), and it's a little unusual.
Instead of following the usual pagecache writeback procedure, it
locks the page, memcpy() into it, submits its own BIO, and then
unlocks the page in the bio endio routine.  wait_dev_supers()
waits for the page to be unlocked (ie the write completed) and then
checks PageError() to see if it worked.

I don't think this is buggy.  It's just confusing to see a write being
waited for on the locked bit instead of on the writeback bit.  But I
get why you're doing it; this is such a special case that there's no
need to do the usual pagecache dance with the dirty & writeback flags.

The handling of the uptodate flag is a little strange though.
You're leaving it clear if the write fails, which means that it'll be
re-read from storage if someone tries to read that block through the
pagecache from the underlying device file.  What's a bit weirder (and
maybe buggy?) is that if you're on a machine with a 64KiB page size,
you fetch a 64KiB page from the cache, write 4KiB into it, and then
(assuming the write is successful), mark the entire 64KiB as being
uptodate in the page cache for the underlying disk mapping, which means
that anyone reading the other 60KiB in the page cache is going to be
reading uninitialised memory.  You'd have to be root to do that, so I
doubt it's any kind of security concern, but it's not great.

I wonder if the page cache is really helping you here.  As far as I can
tell, you'd be better off allocating a single page, storing a pointer to
it in struct btrfs_device, and maintaining your own metadata in struct
page (how many writes are still outstanding, how many had errors).
That gives you five words (minus one bit) to play with without worrying
about what the rest of the system is doing with page flags.

Am I wrong about that?  Is there an advantage to having the superblock
pages stored in the bdev's i_mapping?  Or can we just use your O_DIRECT
path to read/write kernel memory directly?

^ permalink raw reply	[flat|nested] 2+ messages in thread