linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] btrfs: Share the same anonymous block device for the whole filesystem
@ 2020-06-12  6:42 Qu Wenruo
  2020-06-12 13:33 ` David Sterba
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Qu Wenruo @ 2020-06-12  6:42 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Greed Rong

[BUG]
There is a bug report about transaction abort due to -EMFILE error.

  ------------[ cut here ]------------
  BTRFS: Transaction aborted (error -24)
  WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
  RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
  Call Trace:
   create_pending_snapshots+0x82/0xa0 [btrfs]
   btrfs_commit_transaction+0x275/0x8c0 [btrfs]
   btrfs_mksubvol+0x4b9/0x500 [btrfs]
   btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
   btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
   btrfs_ioctl+0x11a4/0x2da0 [btrfs]
   do_vfs_ioctl+0xa9/0x640
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x1a/0x20
   do_syscall_64+0x5a/0x110
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ---[ end trace 33f2f83f3d5250e9 ]---
  BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
  BTRFS info (device sda1): forced readonly
  BTRFS warning (device sda1): Skipping commit of aborted transaction.
  BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown

The workload involves creating and deleting a lot of snapshots in a
short period.

[CAUSE]
The error is returned from get_anon_bdev(), which will return -EMFILE
if there is no anonymous block device left in the range.

For anonymous block device, we have at most 1 << 20 devices to allocate,
which looks quite a lot, but if we have a workload which created 1
snapshots per second, we only need 12 days to exhaust the whole pool.

So the anonymous block device pool is not an almost-unlimited resource,
not to mention there are still other users in the kernel relying on
anonyous block device.

When the resouce is exhausted, create_pending_snapshot() will fail due to
error returned from btrfs_get_fs_root(), and abort the transaction.

This problem would also affect other filesystems like ceph or overlayfs.

[FIX]
Currently btrfs uses unique anonymous block device for each root, which
sometimes can be overkilled, especially considering how lightweight
btrfs snapshot is.

This patch will use just one anonymous block device for each btrfs, and
since we only allocate it at mount time, it should not exhause anonymous
block device pool so easily.

Although this brings a big user visible interface change, before this
patch each subvolume has its own bdev for stat(), now all subvolumes
shares the same bdev.
I'm not sure how many users will be affected.

Furthermore, since overlayfs and ceph are already using different
anonymous block device for their snapshots/overlays, I'm not sure if we
should differ from the common behavior.

Reported-by: Greed Rong <greedrong@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Reason for RFC:
The patch will introduce a user visible behavior change, and a behavior
change from ceph/overlayfs.
I'm not sure whether we should go this direction, or go David's way to
pre-allocate bdev at ioctl time.
---
 fs/btrfs/ctree.h   |  8 +++-----
 fs/btrfs/disk-io.c | 13 ++++++-------
 fs/btrfs/inode.c   |  2 +-
 3 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 161533040978..585b951915b0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -928,6 +928,9 @@ struct btrfs_fs_info {
 	u32 sectorsize;
 	u32 stripesize;
 
+	/* For the shared devid of the fs */
+	dev_t anon_dev;
+
 	/* Block groups and devices containing active swapfiles. */
 	spinlock_t swapfile_pins_lock;
 	struct rb_root swapfile_pins;
@@ -1097,11 +1100,6 @@ struct btrfs_root {
 	 * protected by inode_lock
 	 */
 	struct radix_tree_root delayed_nodes_tree;
-	/*
-	 * right now this just gets used so that a root has its own devid
-	 * for stat.  It may be used for more later
-	 */
-	dev_t anon_dev;
 
 	spinlock_t root_item_lock;
 	refcount_t refs;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f8ec2d8606fd..0bda60943d83 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1148,7 +1148,6 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	else
 		root->defrag_trans_start = 0;
 	root->root_key.objectid = objectid;
-	root->anon_dev = 0;
 
 	spin_lock_init(&root->root_item_lock);
 	btrfs_qgroup_init_swapped_blocks(&root->swapped_blocks);
@@ -1430,10 +1429,6 @@ static int btrfs_init_fs_root(struct btrfs_root *root)
 	spin_lock_init(&root->ino_cache_lock);
 	init_waitqueue_head(&root->ino_cache_wait);
 
-	ret = get_anon_bdev(&root->anon_dev);
-	if (ret)
-		goto fail;
-
 	mutex_lock(&root->objectid_mutex);
 	ret = btrfs_find_highest_objectid(root,
 					&root->highest_objectid);
@@ -1509,6 +1504,8 @@ void btrfs_check_leaked_roots(struct btrfs_fs_info *fs_info)
 
 void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 {
+	if (fs_info->anon_dev)
+		free_anon_bdev(fs_info->anon_dev);
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
 	percpu_counter_destroy(&fs_info->dio_bytes);
@@ -2000,8 +1997,6 @@ void btrfs_put_root(struct btrfs_root *root)
 	if (refcount_dec_and_test(&root->refs)) {
 		WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
 		WARN_ON(test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state));
-		if (root->anon_dev)
-			free_anon_bdev(root->anon_dev);
 		btrfs_drew_lock_destroy(&root->snapshot_lock);
 		free_extent_buffer(root->node);
 		free_extent_buffer(root->commit_root);
@@ -2769,6 +2764,10 @@ static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block
 	sb->s_blocksize = BTRFS_BDEV_BLOCKSIZE;
 	sb->s_blocksize_bits = blksize_bits(BTRFS_BDEV_BLOCKSIZE);
 
+	ret = get_anon_bdev(&fs_info->anon_dev);
+	if (ret < 0)
+		return ret;
+
 	ret = percpu_counter_init(&fs_info->dio_bytes, 0, GFP_KERNEL);
 	if (ret)
 		return ret;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1242d0aa108d..f70a2e2b2d9c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8612,7 +8612,7 @@ static int btrfs_getattr(const struct path *path, struct kstat *stat,
 				  STATX_ATTR_NODUMP);
 
 	generic_fillattr(inode, stat);
-	stat->dev = BTRFS_I(inode)->root->anon_dev;
+	stat->dev = BTRFS_I(inode)->root->fs_info->anon_dev;
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: Share the same anonymous block device for the whole filesystem
  2020-06-12  6:42 [PATCH] btrfs: Share the same anonymous block device for the whole filesystem Qu Wenruo
@ 2020-06-12 13:33 ` David Sterba
  2020-06-12 16:48 ` David Sterba
  2020-06-12 17:24 ` David Sterba
  2 siblings, 0 replies; 5+ messages in thread
From: David Sterba @ 2020-06-12 13:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Greed Rong

On Fri, Jun 12, 2020 at 02:42:37PM +0800, Qu Wenruo wrote:
> Currently btrfs uses unique anonymous block device for each root, which
> sometimes can be overkilled, especially considering how lightweight
> btrfs snapshot is.

In my understanding the subvolumes must have a unique anon bdev
associated, to appear as different inode number namespaces. This goes
back to the beginning and we can't change that. See 3394e1607eaf870eb.

> Although this brings a big user visible interface change, before this
> patch each subvolume has its own bdev for stat(), now all subvolumes
> shares the same bdev.
> I'm not sure how many users will be affected.

A 'big user visible interface change' should be a strong sign to do the
research first before writing the code.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: Share the same anonymous block device for the whole filesystem
  2020-06-12  6:42 [PATCH] btrfs: Share the same anonymous block device for the whole filesystem Qu Wenruo
  2020-06-12 13:33 ` David Sterba
@ 2020-06-12 16:48 ` David Sterba
  2020-06-13  0:38   ` Qu Wenruo
  2020-06-12 17:24 ` David Sterba
  2 siblings, 1 reply; 5+ messages in thread
From: David Sterba @ 2020-06-12 16:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Greed Rong

On Fri, Jun 12, 2020 at 02:42:37PM +0800, Qu Wenruo wrote:
> For anonymous block device, we have at most 1 << 20 devices to allocate,
> which looks quite a lot, but if we have a workload which created 1
> snapshots per second, we only need 12 days to exhaust the whole pool.

1<<20 is 1M and that would mean that there that many snapshots active at
the same time in order to allocate all the anonymous block devices. Once
a snapshot is not part of any path the device number is released and can
be reused. So simply multiplying the numbers does not reflect the
reality.

A plausible explanation is leak of the anon bdev by something else than
btrfs on the system.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: Share the same anonymous block device for the whole filesystem
  2020-06-12  6:42 [PATCH] btrfs: Share the same anonymous block device for the whole filesystem Qu Wenruo
  2020-06-12 13:33 ` David Sterba
  2020-06-12 16:48 ` David Sterba
@ 2020-06-12 17:24 ` David Sterba
  2 siblings, 0 replies; 5+ messages in thread
From: David Sterba @ 2020-06-12 17:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Greed Rong

On Fri, Jun 12, 2020 at 02:42:37PM +0800, Qu Wenruo wrote:
> [BUG]
> There is a bug report about transaction abort due to -EMFILE error.
> 
>   ------------[ cut here ]------------
>   BTRFS: Transaction aborted (error -24)
>   WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
>   RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
>   Call Trace:
>    create_pending_snapshots+0x82/0xa0 [btrfs]
>    btrfs_commit_transaction+0x275/0x8c0 [btrfs]
>    btrfs_mksubvol+0x4b9/0x500 [btrfs]
>    btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
>    btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
>    btrfs_ioctl+0x11a4/0x2da0 [btrfs]
>    do_vfs_ioctl+0xa9/0x640
>    ksys_ioctl+0x67/0x90
>    __x64_sys_ioctl+0x1a/0x20
>    do_syscall_64+0x5a/0x110
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   ---[ end trace 33f2f83f3d5250e9 ]---
>   BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
>   BTRFS info (device sda1): forced readonly
>   BTRFS warning (device sda1): Skipping commit of aborted transaction.
>   BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
> 
> The workload involves creating and deleting a lot of snapshots in a
> short period.

The ids get returned to the IDA range once the last reference to the
root object is reached, but there's no distinction between a regular
subvolume and a deleted one.

I think we could call free_anon_bdev once the subvolume is deleted in
btrfs_delete_subvolume and not wait until it gets processed by
btrfs_clean_one_deleted_snapshot . This should be save, as once the
subvolume disappears from the file hierarchy, the bdev cannot be queried
by users.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: Share the same anonymous block device for the whole filesystem
  2020-06-12 16:48 ` David Sterba
@ 2020-06-13  0:38   ` Qu Wenruo
  0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2020-06-13  0:38 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Greed Rong


[-- Attachment #1.1: Type: text/plain, Size: 1863 bytes --]



On 2020/6/13 上午12:48, David Sterba wrote:
> On Fri, Jun 12, 2020 at 02:42:37PM +0800, Qu Wenruo wrote:
>> For anonymous block device, we have at most 1 << 20 devices to allocate,
>> which looks quite a lot, but if we have a workload which created 1
>> snapshots per second, we only need 12 days to exhaust the whole pool.
> 
> 1<<20 is 1M and that would mean that there that many snapshots active at
> the same time in order to allocate all the anonymous block devices. Once
> a snapshot is not part of any path the device number is released and can
> be reused. So simply multiplying the numbers does not reflect the
> reality.

Yes, but for that number we can still exhaust the pool if the subvolumes
are not cleaned up.

> 
> A plausible explanation is leak of the anon bdev by something else than
> btrfs on the system.
> 

I'm not sure if it's a leak. As you can see the free_anon_bdev() call in
btrfs_put_root().


Although I understand that we use bdev as a namespace seperator, but I'm
still not confident about whether it's that important.

One point is, I didn't see much users of bdev member to distinguish any
subvolume, no to mention the "sub" nature of subvolume.
It's not a full volume, since it already shares the chunk, extent, dev,
root trees, if it shares the same bdev, it won't cause any new users any
problem.

That's why I'm pushing the RFC patch. Are there that many users relying
on bdev to distinguish different subvolumes? Any example would be very
helpful.

Another problem is, for such lightweight btrfs subvolume, there are only
that many things we can do to reduce the frequency to hit such problem.
We can never eliminate the problem.

If the bdev problem is really affecting a lot of existing users, I can
make the behavior toggleable, using mount option maybe?

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-06-13  0:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-12  6:42 [PATCH] btrfs: Share the same anonymous block device for the whole filesystem Qu Wenruo
2020-06-12 13:33 ` David Sterba
2020-06-12 16:48 ` David Sterba
2020-06-13  0:38   ` Qu Wenruo
2020-06-12 17:24 ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).