All of lore.kernel.org
 help / color / mirror / Atom feed
* error writing primary super block on zoned btrfs
@ 2022-07-18  5:49 Christoph Hellwig
  2022-07-18 12:28 ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2022-07-18  5:49 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: Matthew Wilcox, linux-btrfs

Hi Naohiro, (and willy for insights on the pagecache, see below),

when running plain fsx on zoned btrfs on a null_blk devices as below:

dev="/sys/kernel/config/nullb/nullb1"
size=12800 # MB

mkdir ${dev}
echo 2 > "${dev}"/submit_queues
echo 2 > "${dev}"/queue_mode
echo 2 > "${dev}"/irqmode
echo "${size}" > "${dev}"/size
echo 1 > "${dev}"/zoned
echo 0 > "${dev}"/zone_nr_conv
echo 128 > "${dev}"/zone_size
echo 96 > "${dev}"/zone_capacity
echo 14 > "${dev}"/zone_max_active
echo 1 > "${dev}"/memory_backed
echo 1000000 > "${dev}"/completion_nsec
echo 1 > "${dev}"/power
mkfs.btrfs -m single /dev/nullb1
mount /dev/nullb1 /mnt/test/
~/xfstests-dev/ltp/fsx /mnt/test/foobar

fsx will eventually after ~10 minutes fail with the following left
in dmesg:

[ 1185.480200] BTRFS error (device nullb1): error writing primary super block to device 1
[ 1185.480988] BTRFS: error (device nullb1) in write_all_supers:4488: errno=-5 IO failure (1 errors while writing supers)
[ 1185.481971] BTRFS info (device nullb1: state E): forced readonly
[ 1185.482521] BTRFS: error (device nullb1: state EA) in btrfs_sync_log:3341: errno=-5 IO failure

I tracked this down to the find_get_page call in wait_dev_supers
returning NULL, and digging furter it seems to come from
xa_is_value() in __filemap_get_folio returnin true.  I'm not sure
why we'd see a value here in the block device mapping and why that
only happens in zoned mode (the same config on regular device ran
for 10 hours last night).


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: error writing primary super block on zoned btrfs
  2022-07-18  5:49 error writing primary super block on zoned btrfs Christoph Hellwig
@ 2022-07-18 12:28 ` Matthew Wilcox
  2022-07-18 12:33   ` Christoph Hellwig
  2022-07-19  7:53   ` Johannes Thumshirn
  0 siblings, 2 replies; 7+ messages in thread
From: Matthew Wilcox @ 2022-07-18 12:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Naohiro Aota, linux-btrfs

On Mon, Jul 18, 2022 at 07:49:44AM +0200, Christoph Hellwig wrote:
> Hi Naohiro, (and willy for insights on the pagecache, see below),
> 
> when running plain fsx on zoned btrfs on a null_blk devices as below:
> 
> dev="/sys/kernel/config/nullb/nullb1"
> size=12800 # MB
> 
> mkdir ${dev}
> echo 2 > "${dev}"/submit_queues
> echo 2 > "${dev}"/queue_mode
> echo 2 > "${dev}"/irqmode
> echo "${size}" > "${dev}"/size
> echo 1 > "${dev}"/zoned
> echo 0 > "${dev}"/zone_nr_conv
> echo 128 > "${dev}"/zone_size
> echo 96 > "${dev}"/zone_capacity
> echo 14 > "${dev}"/zone_max_active
> echo 1 > "${dev}"/memory_backed
> echo 1000000 > "${dev}"/completion_nsec
> echo 1 > "${dev}"/power
> mkfs.btrfs -m single /dev/nullb1
> mount /dev/nullb1 /mnt/test/
> ~/xfstests-dev/ltp/fsx /mnt/test/foobar
> 
> fsx will eventually after ~10 minutes fail with the following left
> in dmesg:
> 
> [ 1185.480200] BTRFS error (device nullb1): error writing primary super block to device 1
> [ 1185.480988] BTRFS: error (device nullb1) in write_all_supers:4488: errno=-5 IO failure (1 errors while writing supers)
> [ 1185.481971] BTRFS info (device nullb1: state E): forced readonly
> [ 1185.482521] BTRFS: error (device nullb1: state EA) in btrfs_sync_log:3341: errno=-5 IO failure
> 
> I tracked this down to the find_get_page call in wait_dev_supers
> returning NULL, and digging furter it seems to come from
> xa_is_value() in __filemap_get_folio returnin true.  I'm not sure
> why we'd see a value here in the block device mapping and why that
> only happens in zoned mode (the same config on regular device ran
> for 10 hours last night).

A "value" entry in the block device's i_pages will be a shadow entry --
that is, the page has reached the end of the LRU list and been discarded,
so we made a note that we would have liked to keep it in the LRU list,
but we didn't have enough memory in the system to do so.  That helps
us put it back in the right position in the LRU list when it gets
brought back in from disc.

I'd sugegst something else has gone wrong, maybe the refcount should
have been kept elevated to prevent the superblock from being paged out.
I find it hard to believe that we can be so low on memory that we need
to page out a superblock to make room for some other memory allocation.

(Although maybe if you have millions of unusued filesystems mounted ...?)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: error writing primary super block on zoned btrfs
  2022-07-18 12:28 ` Matthew Wilcox
@ 2022-07-18 12:33   ` Christoph Hellwig
  2022-07-19  7:53   ` Johannes Thumshirn
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2022-07-18 12:33 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, Naohiro Aota, linux-btrfs

On Mon, Jul 18, 2022 at 01:28:52PM +0100, Matthew Wilcox wrote:
> A "value" entry in the block device's i_pages will be a shadow entry --
> that is, the page has reached the end of the LRU list and been discarded,
> so we made a note that we would have liked to keep it in the LRU list,
> but we didn't have enough memory in the system to do so.  That helps
> us put it back in the right position in the LRU list when it gets
> brought back in from disc.
> 
> I'd sugegst something else has gone wrong, maybe the refcount should
> have been kept elevated to prevent the superblock from being paged out.
> I find it hard to believe that we can be so low on memory that we need
> to page out a superblock to make room for some other memory allocation.
> 
> (Although maybe if you have millions of unusued filesystems mounted ...?)

This is a freshly booted VM running the test on the only non-root
disk file system.  So yeah, there must be a logic error somewhere
in the use of the page cache.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: error writing primary super block on zoned btrfs
  2022-07-18 12:28 ` Matthew Wilcox
  2022-07-18 12:33   ` Christoph Hellwig
@ 2022-07-19  7:53   ` Johannes Thumshirn
  2022-07-19 15:13     ` Christoph Hellwig
  1 sibling, 1 reply; 7+ messages in thread
From: Johannes Thumshirn @ 2022-07-19  7:53 UTC (permalink / raw)
  To: Matthew Wilcox, Christoph Hellwig; +Cc: Naohiro Aota, linux-btrfs

On 18.07.22 14:29, Matthew Wilcox wrote:
> On Mon, Jul 18, 2022 at 07:49:44AM +0200, Christoph Hellwig wrote:
>> Hi Naohiro, (and willy for insights on the pagecache, see below),
>>
>> when running plain fsx on zoned btrfs on a null_blk devices as below:
>>
>> dev="/sys/kernel/config/nullb/nullb1"
>> size=12800 # MB
>>
>> mkdir ${dev}
>> echo 2 > "${dev}"/submit_queues
>> echo 2 > "${dev}"/queue_mode
>> echo 2 > "${dev}"/irqmode
>> echo "${size}" > "${dev}"/size
>> echo 1 > "${dev}"/zoned
>> echo 0 > "${dev}"/zone_nr_conv
>> echo 128 > "${dev}"/zone_size
>> echo 96 > "${dev}"/zone_capacity
>> echo 14 > "${dev}"/zone_max_active
>> echo 1 > "${dev}"/memory_backed
>> echo 1000000 > "${dev}"/completion_nsec
>> echo 1 > "${dev}"/power
>> mkfs.btrfs -m single /dev/nullb1
>> mount /dev/nullb1 /mnt/test/
>> ~/xfstests-dev/ltp/fsx /mnt/test/foobar
>>
>> fsx will eventually after ~10 minutes fail with the following left
>> in dmesg:
>>
>> [ 1185.480200] BTRFS error (device nullb1): error writing primary super block to device 1
>> [ 1185.480988] BTRFS: error (device nullb1) in write_all_supers:4488: errno=-5 IO failure (1 errors while writing supers)
>> [ 1185.481971] BTRFS info (device nullb1: state E): forced readonly
>> [ 1185.482521] BTRFS: error (device nullb1: state EA) in btrfs_sync_log:3341: errno=-5 IO failure
>>
>> I tracked this down to the find_get_page call in wait_dev_supers
>> returning NULL, and digging furter it seems to come from
>> xa_is_value() in __filemap_get_folio returnin true.  I'm not sure
>> why we'd see a value here in the block device mapping and why that
>> only happens in zoned mode (the same config on regular device ran
>> for 10 hours last night).
> 
> A "value" entry in the block device's i_pages will be a shadow entry --
> that is, the page has reached the end of the LRU list and been discarded,
> so we made a note that we would have liked to keep it in the LRU list,
> but we didn't have enough memory in the system to do so.  That helps
> us put it back in the right position in the LRU list when it gets
> brought back in from disc.
> 
> I'd sugegst something else has gone wrong, maybe the refcount should
> have been kept elevated to prevent the superblock from being paged out.
> I find it hard to believe that we can be so low on memory that we need
> to page out a superblock to make room for some other memory allocation.
> 
> (Although maybe if you have millions of unusued filesystems mounted ...?)
> 

Ha but zoned btrfs uses two zones as a ringbuffer for its super-block, could
it be, that we're accumulating too many page references somewhere? And then it
behaves like having millions of filesystems mounted?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: error writing primary super block on zoned btrfs
  2022-07-19  7:53   ` Johannes Thumshirn
@ 2022-07-19 15:13     ` Christoph Hellwig
  2022-07-19 21:32       ` David Sterba
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2022-07-19 15:13 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Matthew Wilcox, Christoph Hellwig, Naohiro Aota, linux-btrfs

On Tue, Jul 19, 2022 at 07:53:45AM +0000, Johannes Thumshirn wrote:
> Ha but zoned btrfs uses two zones as a ringbuffer for its super-block, could
> it be, that we're accumulating too many page references somewhere? And then it
> behaves like having millions of filesystems mounted?

That fact the superblock moves for zoned devices probably has
something to do with it.  But the whole code leaves me really puzzling.

Why does wait_dev_supers even do a find_get_page vs just stashing
three page pointers away in the btrfs_device structure?

Why does this abuse wait_on_page_locked vs using a completion?

Why does the code count errors while only an error on the primary
superblock has any consequences?

What is the point of the secodary superblocks if they aren't written
on fsync?

How does just setting the whole page uptodat work on file systems
with a block size smaller than the page size where we don't know
what is in the rest of the page?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: error writing primary super block on zoned btrfs
  2022-07-19 15:13     ` Christoph Hellwig
@ 2022-07-19 21:32       ` David Sterba
  2022-07-25 17:36         ` David Sterba
  0 siblings, 1 reply; 7+ messages in thread
From: David Sterba @ 2022-07-19 21:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Johannes Thumshirn, Matthew Wilcox, Naohiro Aota, linux-btrfs

On Tue, Jul 19, 2022 at 05:13:45PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 19, 2022 at 07:53:45AM +0000, Johannes Thumshirn wrote:
> > Ha but zoned btrfs uses two zones as a ringbuffer for its super-block, could
> > it be, that we're accumulating too many page references somewhere? And then it
> > behaves like having millions of filesystems mounted?
> 
> That fact the superblock moves for zoned devices probably has
> something to do with it.  But the whole code leaves me really puzzling.
> 
> Why does wait_dev_supers even do a find_get_page vs just stashing
> three page pointers away in the btrfs_device structure?

The superblock used to be written written using buffer heads, the
current code is direct transformation of the buffer head API to bios, so
it's still using page cache.

I've sent a patchset to write it with separate pages, but this breaks
userspace as the reads are from page cache. This should be done by
direct io, also I'll need more time to test it properly, the
kernel/userspace interactions were missed initially.

> Why does this abuse wait_on_page_locked vs using a completion?

This is I think still from the buffer head times, the page lock waiting
was available and hasn't been converted to completion.

> Why does the code count errors while only an error on the primary
> superblock has any consequences?

Because the primary superblock is the important one, error on the
secondary should not bring down the filesystem if the primary can be
written.

> What is the point of the secodary superblocks if they aren't written
> on fsync?

Writing superblock is an IO hit, used to be noticeable on rotational
devices, and fsync is meant to be fast. It's I believe a performance
optimization. The secondary superblocks are rarely used so they don't
get the same treatment as primary.

> How does just setting the whole page uptodat work on file systems
> with a block size smaller than the page size where we don't know
> what is in the rest of the page?

This was pointed out by Matthew some time ago, the part of the page
after superblock will be uninitialized.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: error writing primary super block on zoned btrfs
  2022-07-19 21:32       ` David Sterba
@ 2022-07-25 17:36         ` David Sterba
  0 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2022-07-25 17:36 UTC (permalink / raw)
  To: David Sterba
  Cc: Christoph Hellwig, Johannes Thumshirn, Matthew Wilcox,
	Naohiro Aota, linux-btrfs

On Tue, Jul 19, 2022 at 11:32:41PM +0200, David Sterba wrote:
> On Tue, Jul 19, 2022 at 05:13:45PM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 19, 2022 at 07:53:45AM +0000, Johannes Thumshirn wrote:
> > > Ha but zoned btrfs uses two zones as a ringbuffer for its super-block, could
> > > it be, that we're accumulating too many page references somewhere? And then it
> > > behaves like having millions of filesystems mounted?
> > 
> > That fact the superblock moves for zoned devices probably has
> > something to do with it.  But the whole code leaves me really puzzling.
> > 
> > Why does wait_dev_supers even do a find_get_page vs just stashing
> > three page pointers away in the btrfs_device structure?
> 
> The superblock used to be written written using buffer heads, the
> current code is direct transformation of the buffer head API to bios, so
> it's still using page cache.
> 
> I've sent a patchset to write it with separate pages, but this breaks
> userspace as the reads are from page cache. This should be done by
> direct io, also I'll need more time to test it properly, the
> kernel/userspace interactions were missed initially.

I have wip (tests pass) that does the own page write, waiting using a
completion and own bio, ie. avoiding page cache. The super block read
side however uses page cache in several places. When a device is not
part of a mounted filesystem thre's no difference, no concurrent writers
and readers, but for zoned mode it is a problem in one cases, when both
zones are full and the older needs to be determined by reading the
superblock.

This happens on a mounted filesystem and it's read and write so it needs
to be converted to own page read too so I can't merge the write part
as-is. Maybe there's a middle ground, but otherwise the page cache based
read requires restructuring as it's done accross several functions.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-07-25 17:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-18  5:49 error writing primary super block on zoned btrfs Christoph Hellwig
2022-07-18 12:28 ` Matthew Wilcox
2022-07-18 12:33   ` Christoph Hellwig
2022-07-19  7:53   ` Johannes Thumshirn
2022-07-19 15:13     ` Christoph Hellwig
2022-07-19 21:32       ` David Sterba
2022-07-25 17:36         ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.