linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v4 0/6] Btrfs: implement swap file support
@ 2018-05-24 21:41 Omar Sandoval
  2018-05-24 21:41 ` [RFC PATCH v4 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS Omar Sandoval
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Omar Sandoval @ 2018-05-24 21:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo

From: Omar Sandoval <osandov@fb.com>

Hi,

This patch series implements swap file support for Btrfs. If you don't
remember versions 1-3, that's because they were almost 4 years ago [1]
:)

This attempt takes a very different approach from my original versions
back then. As a refresher, the original idea was to go through
->read_iter() and ->write_iter() in O_DIRECT mode. It turns out that
this is really hard to get right, because it effectively makes the
GFP_NOFS flag meaningless, since now swapping out can go through the
filesystem and grab locks and such. We could try to make the read/write
path lockless for a swap file, but this would be too easy to get wrong.

So, instead, this approach was inspired by the iomap swap file support
[2], where we directly tell the swap code where it can find the swap
extents while doing our own sanity checks. This has a bunch of
restrictions, documented in patches 4 and 6. I have a bunch of xfstests
for these cases at https://github.com/osandov/xfstests/tree/btrfs-swap.

This series can also be found at https://github.com/osandov/linux/tree/btrfs-swap.

All comments welcome. Thanks!

1: https://www.spinics.net/lists/linux-btrfs/msg40129.html
2: https://patchwork.kernel.org/patch/10390417/

Omar Sandoval (6):
  mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  vfs: update swap_{,de}activate documentation
  Btrfs: push EXCL_OP set into btrfs_rm_device()
  Btrfs: prevent ioctls from interfering with a swap file
  Btrfs: rename get_chunk_map() and make it non-static
  Btrfs: support swap files

 Documentation/filesystems/Locking |  17 +--
 Documentation/filesystems/vfs.txt |  12 +-
 fs/btrfs/ctree.h                  |   6 +
 fs/btrfs/disk-io.c                |   3 +
 fs/btrfs/inode.c                  | 220 ++++++++++++++++++++++++++++++
 fs/btrfs/ioctl.c                  |  64 ++++++---
 fs/btrfs/volumes.c                |  34 +++--
 fs/btrfs/volumes.h                |   2 +
 include/linux/swap.h              |  13 +-
 mm/page_io.c                      |   6 +-
 mm/swapfile.c                     |  14 +-
 11 files changed, 335 insertions(+), 56 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH v4 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  2018-05-24 21:41 [RFC PATCH v4 0/6] Btrfs: implement swap file support Omar Sandoval
@ 2018-05-24 21:41 ` Omar Sandoval
  2018-05-25  9:11   ` Nikolay Borisov
  2018-05-24 21:41 ` [RFC PATCH v4 2/6] vfs: update swap_{,de}activate documentation Omar Sandoval
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Omar Sandoval @ 2018-05-24 21:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo

From: Omar Sandoval <osandov@fb.com>

The SWP_FILE flag serves two purposes: to make swap_{read,write}page()
go through the filesystem, and to make swapoff() call
->swap_deactivate(). For Btrfs, we want the latter but not the former,
so split this flag into two. This makes us always call
->swap_deactivate() if ->swap_activate() succeeded, not just if it
didn't add any swap extents itself.

This also resolves the issue of the very misleading name of SWP_FILE,
which is only used for swap files over NFS.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 include/linux/swap.h | 13 +++++++------
 mm/page_io.c         |  6 +++---
 mm/swapfile.c        | 13 ++++++++-----
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2417d288e016..29dfd436435c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -167,13 +167,14 @@ enum {
 	SWP_SOLIDSTATE	= (1 << 4),	/* blkdev seeks are cheap */
 	SWP_CONTINUED	= (1 << 5),	/* swap_map has count continuation */
 	SWP_BLKDEV	= (1 << 6),	/* its a block device */
-	SWP_FILE	= (1 << 7),	/* set after swap_activate success */
-	SWP_AREA_DISCARD = (1 << 8),	/* single-time swap area discards */
-	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
-	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
-	SWP_SYNCHRONOUS_IO = (1 << 11),	/* synchronous IO is efficient */
+	SWP_ACTIVATED	= (1 << 7),	/* set after swap_activate success */
+	SWP_FS		= (1 << 8),	/* swap file goes through fs */
+	SWP_AREA_DISCARD = (1 << 9),	/* single-time swap area discards */
+	SWP_PAGE_DISCARD = (1 << 10),	/* freed swap page-cluster discards */
+	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
+	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 12),	/* refcount in scan_swap_map */
+	SWP_SCANNING	= (1 << 13),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/page_io.c b/mm/page_io.c
index b41cf9644585..f2d06c1d0cc1 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -283,7 +283,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	struct swap_info_struct *sis = page_swap_info(page);
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-	if (sis->flags & SWP_FILE) {
+	if (sis->flags & SWP_FS) {
 		struct kiocb kiocb;
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
@@ -364,7 +364,7 @@ int swap_readpage(struct page *page, bool synchronous)
 		goto out;
 	}
 
-	if (sis->flags & SWP_FILE) {
+	if (sis->flags & SWP_FS) {
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
 
@@ -422,7 +422,7 @@ int swap_set_page_dirty(struct page *page)
 {
 	struct swap_info_struct *sis = page_swap_info(page);
 
-	if (sis->flags & SWP_FILE) {
+	if (sis->flags & SWP_FS) {
 		struct address_space *mapping = sis->swap_file->f_mapping;
 
 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cc2cf04d9018..886c9d89b144 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -973,7 +973,7 @@ int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
 			goto nextsi;
 		}
 		if (cluster) {
-			if (!(si->flags & SWP_FILE))
+			if (!(si->flags & SWP_FS))
 				n_ret = swap_alloc_cluster(si, swp_entries);
 		} else
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
@@ -2327,12 +2327,13 @@ static void destroy_swap_extents(struct swap_info_struct *sis)
 		kfree(se);
 	}
 
-	if (sis->flags & SWP_FILE) {
+	if (sis->flags & SWP_ACTIVATED) {
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
 
-		sis->flags &= ~SWP_FILE;
-		mapping->a_ops->swap_deactivate(swap_file);
+		sis->flags &= ~SWP_ACTIVATED;
+		if (mapping->a_ops->swap_deactivate)
+			mapping->a_ops->swap_deactivate(swap_file);
 	}
 }
 
@@ -2428,8 +2429,10 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
 
 	if (mapping->a_ops->swap_activate) {
 		ret = mapping->a_ops->swap_activate(sis, swap_file, span);
+		if (ret >= 0)
+			sis->flags |= SWP_ACTIVATED;
 		if (!ret) {
-			sis->flags |= SWP_FILE;
+			sis->flags |= SWP_FS;
 			ret = add_swap_extent(sis, 0, sis->max, 0);
 			*span = sis->pages;
 		}
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v4 2/6] vfs: update swap_{,de}activate documentation
  2018-05-24 21:41 [RFC PATCH v4 0/6] Btrfs: implement swap file support Omar Sandoval
  2018-05-24 21:41 ` [RFC PATCH v4 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS Omar Sandoval
@ 2018-05-24 21:41 ` Omar Sandoval
  2018-05-25  9:15   ` Nikolay Borisov
  2018-05-24 21:41 ` [RFC PATCH v4 3/6] Btrfs: push EXCL_OP set into btrfs_rm_device() Omar Sandoval
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Omar Sandoval @ 2018-05-24 21:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo

From: Omar Sandoval <osandov@fb.com>

The documentation for these functions is wrong in several ways:

- swap_activate() is called with the inode locked
- swap_activate() takes a swap_info_struct * and a sector_t *
- swap_activate() can also return a positive number of extents it added
  itself
- swap_deactivate() does not return anything

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 Documentation/filesystems/Locking | 17 +++++++----------
 Documentation/filesystems/vfs.txt | 12 ++++++++----
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 75d2d57e2c44..7f009e98fa3c 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -211,8 +211,9 @@ prototypes:
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
 	int (*error_remove_page)(struct address_space *, struct page *);
-	int (*swap_activate)(struct file *);
-	int (*swap_deactivate)(struct file *);
+	int (*swap_activate)(struct swap_info_struct *, struct file *,
+			     sector_t *);
+	void (*swap_deactivate)(struct file *);
 
 locking rules:
 	All except set_page_dirty and freepage may block
@@ -236,8 +237,8 @@ putback_page:		yes
 launder_page:		yes
 is_partially_uptodate:	yes
 error_remove_page:	yes
-swap_activate:		no
-swap_deactivate:	no
+swap_activate:					yes
+swap_deactivate:				no
 
 	->write_begin(), ->write_end() and ->readpage() may be called from
 the request handler (/dev/loop).
@@ -334,14 +335,10 @@ cleaned, or an error value if not. Note that in order to prevent the page
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 
-	->swap_activate will be called with a non-zero argument on
-files backing (non block device backed) swapfiles. A return value
-of zero indicates success, in which case this file can be used for
-backing swapspace. The swapspace operations will be proxied to the
-address space operations.
+	->swap_activate is called from sys_swapon() with the inode locked.
 
 	->swap_deactivate() will be called in the sys_swapoff()
-path after ->swap_activate() returned success.
+path after ->swap_activate() returned success. The inode is not locked.
 
 ----------------------- file_lock_operations ------------------------------
 prototypes:
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 5fd325df59e2..0149109d94d1 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -650,8 +650,9 @@ struct address_space_operations {
 					unsigned long);
 	void (*is_dirty_writeback) (struct page *, bool *, bool *);
 	int (*error_remove_page) (struct mapping *mapping, struct page *page);
-	int (*swap_activate)(struct file *);
-	int (*swap_deactivate)(struct file *);
+	int (*swap_activate)(struct swap_info_struct *, struct file *,
+			     sector_t *);
+	void (*swap_deactivate)(struct file *);
 };
 
   writepage: called by the VM to write a dirty page to backing store.
@@ -828,8 +829,11 @@ struct address_space_operations {
 
   swap_activate: Called when swapon is used on a file to allocate
 	space if necessary and pin the block lookup information in
-	memory. A return value of zero indicates success,
-	in which case this file can be used to back swapspace.
+	memory. If this returns zero, the swap system will call the address
+	space operations ->readpage() and ->direct_IO(). Alternatively, this
+	may call add_swap_extent() and return the number of extents added, in
+	which case the swap system will use the provided blocks directly
+	instead of going through the filesystem.
 
   swap_deactivate: Called during swapoff on files where swap_activate
 	was successful.
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v4 3/6] Btrfs: push EXCL_OP set into btrfs_rm_device()
  2018-05-24 21:41 [RFC PATCH v4 0/6] Btrfs: implement swap file support Omar Sandoval
  2018-05-24 21:41 ` [RFC PATCH v4 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS Omar Sandoval
  2018-05-24 21:41 ` [RFC PATCH v4 2/6] vfs: update swap_{,de}activate documentation Omar Sandoval
@ 2018-05-24 21:41 ` Omar Sandoval
  2018-05-25  9:19   ` Nikolay Borisov
  2018-05-28 13:29   ` David Sterba
  2018-05-24 21:41 ` [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file Omar Sandoval
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Omar Sandoval @ 2018-05-24 21:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo

From: Omar Sandoval <osandov@fb.com>

btrfs_ioctl_rm_dev() and btrfs_ioctl_rm_dev_v2() both manipulate this
bit. Let's move it into the common btrfs_rm_device(), which also makes
the following change to deal with swap files easier.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ioctl.c   | 13 -------------
 fs/btrfs/volumes.c |  4 ++++
 2 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9af3be96099f..82be4a94334b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3091,18 +3091,12 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
 		goto out;
 	}
 
-	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags)) {
-		ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
-		goto out;
-	}
-
 	if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID) {
 		ret = btrfs_rm_device(fs_info, NULL, vol_args->devid);
 	} else {
 		vol_args->name[BTRFS_SUBVOL_NAME_MAX] = '\0';
 		ret = btrfs_rm_device(fs_info, vol_args->name, 0);
 	}
-	clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
 
 	if (!ret) {
 		if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID)
@@ -3133,11 +3127,6 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
 	if (ret)
 		return ret;
 
-	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags)) {
-		ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
-		goto out_drop_write;
-	}
-
 	vol_args = memdup_user(arg, sizeof(*vol_args));
 	if (IS_ERR(vol_args)) {
 		ret = PTR_ERR(vol_args);
@@ -3151,8 +3140,6 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
 		btrfs_info(fs_info, "disk deleted %s", vol_args->name);
 	kfree(vol_args);
 out:
-	clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
-out_drop_write:
 	mnt_drop_write_file(file);
 
 	return ret;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b6757b53c297..9cfac177214f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1951,6 +1951,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
 	u64 num_devices;
 	int ret = 0;
 
+	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags))
+		return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
+
 	mutex_lock(&uuid_mutex);
 
 	num_devices = fs_devices->num_devices;
@@ -2069,6 +2072,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
 
 out:
 	mutex_unlock(&uuid_mutex);
+	clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
 	return ret;
 
 error_undo:
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file
  2018-05-24 21:41 [RFC PATCH v4 0/6] Btrfs: implement swap file support Omar Sandoval
                   ` (2 preceding siblings ...)
  2018-05-24 21:41 ` [RFC PATCH v4 3/6] Btrfs: push EXCL_OP set into btrfs_rm_device() Omar Sandoval
@ 2018-05-24 21:41 ` Omar Sandoval
  2018-05-25 14:50   ` David Sterba
  2018-05-24 21:41 ` [RFC PATCH v4 5/6] Btrfs: rename get_chunk_map() and make it non-static Omar Sandoval
  2018-05-24 21:41 ` [RFC PATCH v4 6/6] Btrfs: support swap files Omar Sandoval
  5 siblings, 1 reply; 20+ messages in thread
From: Omar Sandoval @ 2018-05-24 21:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo

From: Omar Sandoval <osandov@fb.com>

When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:

- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
- balance
- device remove/replace/resize

Don't allow those to happen on an active swap file. Balance and device
remove/replace/resize in particular are disallowed entirely; in the
future, we can relax this so that relocation skips/errors out only on
chunks containing an active swap file.

Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h   |  6 ++++++
 fs/btrfs/disk-io.c |  3 +++
 fs/btrfs/ioctl.c   | 51 ++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/volumes.c |  8 ++++++++
 4 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6deb39d8415d..280d7f5e2fe4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1117,6 +1117,9 @@ struct btrfs_fs_info {
 	u32 sectorsize;
 	u32 stripesize;
 
+	/* Number of active swapfiles */
+	atomic_t nr_swapfiles;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
@@ -1282,6 +1285,9 @@ struct btrfs_root {
 	spinlock_t qgroup_meta_rsv_lock;
 	u64 qgroup_meta_rsv_pertrans;
 	u64 qgroup_meta_rsv_prealloc;
+
+	/* Number of active swapfiles */
+	atomic_t nr_swapfiles;
 };
 
 struct btrfs_file_private {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 22171bbe86e3..b42fd6b41b20 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1215,6 +1215,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	atomic_set(&root->log_batch, 0);
 	refcount_set(&root->refs, 1);
 	atomic_set(&root->will_be_snapshotted, 0);
+	atomic_set(&root->nr_swapfiles, 0);
 	root->log_transid = 0;
 	root->log_transid_committed = -1;
 	root->last_log_commit = 0;
@@ -2825,6 +2826,8 @@ int open_ctree(struct super_block *sb,
 	fs_info->sectorsize = 4096;
 	fs_info->stripesize = 4096;
 
+	atomic_set(&fs_info->nr_swapfiles, 0);
+
 	ret = btrfs_alloc_stripe_hash_table(fs_info);
 	if (ret) {
 		err = ret;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 82be4a94334b..5f068aabe612 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -295,6 +295,11 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 	} else if (fsflags & FS_COMPR_FL) {
 		const char *comp;
 
+		if (IS_SWAPFILE(inode)) {
+			ret = -ETXTBSY;
+			goto out_unlock;
+		}
+
 		binode->flags |= BTRFS_INODE_COMPRESS;
 		binode->flags &= ~BTRFS_INODE_NOCOMPRESS;
 
@@ -767,6 +772,12 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
 	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
 		return -EINVAL;
 
+	if (atomic_read(&root->nr_swapfiles)) {
+		btrfs_info(fs_info,
+			   "cannot create snapshot with active swapfile");
+		return -ETXTBSY;
+	}
+
 	pending_snapshot = kzalloc(sizeof(*pending_snapshot), GFP_KERNEL);
 	if (!pending_snapshot)
 		return -ENOMEM;
@@ -1504,9 +1515,13 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 		}
 
 		inode_lock(inode);
-		if (do_compress)
-			BTRFS_I(inode)->defrag_compress = compress_type;
-		ret = cluster_pages_for_defrag(inode, pages, i, cluster);
+		if (IS_SWAPFILE(inode)) {
+			ret = -ETXTBSY;
+		} else {
+			if (do_compress)
+				BTRFS_I(inode)->defrag_compress = compress_type;
+			ret = cluster_pages_for_defrag(inode, pages, i, cluster);
+		}
 		if (ret < 0) {
 			inode_unlock(inode);
 			goto out_ra;
@@ -1602,6 +1617,12 @@ static noinline int btrfs_ioctl_resize(struct file *file,
 		return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
 	}
 
+	if (atomic_read(&fs_info->nr_swapfiles)) {
+		btrfs_info(fs_info, "cannot resize with active swapfile");
+		ret = -ETXTBSY;
+		goto out;
+	}
+
 	vol_args = memdup_user(arg, sizeof(*vol_args));
 	if (IS_ERR(vol_args)) {
 		ret = PTR_ERR(vol_args);
@@ -3614,6 +3635,11 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 		goto out;
 	}
 
+	if (IS_SWAPFILE(src) || IS_SWAPFILE(dst)) {
+		ret = -ETXTBSY;
+		goto out;
+	}
+
 	for (i = 0; i < chunk_count; i++) {
 		ret = btrfs_extent_same_range(src, loff, BTRFS_MAX_DEDUPE_LEN,
 					      dst, dst_loff, &cmp);
@@ -4286,6 +4312,11 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 		goto out_unlock;
 	}
 
+	if (IS_SWAPFILE(src) || IS_SWAPFILE(inode)) {
+		ret = -ETXTBSY;
+		goto out_unlock;
+	}
+
 	/* determine range to clone */
 	ret = -EINVAL;
 	if (off + len > src->i_size || off + len < off)
@@ -4764,7 +4795,13 @@ static long btrfs_ioctl_dev_replace(struct btrfs_fs_info *fs_info,
 		if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags)) {
 			ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
 		} else {
-			ret = btrfs_dev_replace_by_ioctl(fs_info, p);
+			if (atomic_read(&fs_info->nr_swapfiles)) {
+				btrfs_info(fs_info,
+					   "cannot replace device with active swapfile");
+				ret = -ETXTBSY;
+			} else {
+				ret = btrfs_dev_replace_by_ioctl(fs_info, p);
+			}
 			clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
 		}
 		break;
@@ -5024,6 +5061,12 @@ static long btrfs_ioctl_balance(struct file *file, void __user *arg)
 locked:
 	BUG_ON(!test_bit(BTRFS_FS_EXCL_OP, &fs_info->flags));
 
+	if (atomic_read(&fs_info->nr_swapfiles)) {
+		btrfs_info(fs_info, "cannot balance with active swapfile");
+		ret = -ETXTBSY;
+		goto out_unlock;
+	}
+
 	if (arg) {
 		bargs = memdup_user(arg, sizeof(*bargs));
 		if (IS_ERR(bargs)) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9cfac177214f..33b3d329ebb9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1954,6 +1954,13 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
 	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags))
 		return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
 
+	if (atomic_read(&fs_info->nr_swapfiles)) {
+		btrfs_info(fs_info,
+			   "cannot remove device with active swapfile");
+		ret = -ETXTBSY;
+		goto out_clear;
+	}
+
 	mutex_lock(&uuid_mutex);
 
 	num_devices = fs_devices->num_devices;
@@ -2072,6 +2079,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
 
 out:
 	mutex_unlock(&uuid_mutex);
+out_clear:
 	clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
 	return ret;
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v4 5/6] Btrfs: rename get_chunk_map() and make it non-static
  2018-05-24 21:41 [RFC PATCH v4 0/6] Btrfs: implement swap file support Omar Sandoval
                   ` (3 preceding siblings ...)
  2018-05-24 21:41 ` [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file Omar Sandoval
@ 2018-05-24 21:41 ` Omar Sandoval
  2018-05-25  9:21   ` Nikolay Borisov
  2018-05-24 21:41 ` [RFC PATCH v4 6/6] Btrfs: support swap files Omar Sandoval
  5 siblings, 1 reply; 20+ messages in thread
From: Omar Sandoval @ 2018-05-24 21:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo

From: Omar Sandoval <osandov@fb.com>

The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
make it non-static.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/volumes.c | 22 +++++++++++-----------
 fs/btrfs/volumes.h |  2 ++
 2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 33b3d329ebb9..6e1a89c6b362 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2789,8 +2789,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	return ret;
 }
 
-static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
-					u64 logical, u64 length)
+struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
+				       u64 logical, u64 length)
 {
 	struct extent_map_tree *em_tree;
 	struct extent_map *em;
@@ -2827,7 +2827,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
 	int i, ret = 0;
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 
-	em = get_chunk_map(fs_info, chunk_offset, 1);
+	em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
 	if (IS_ERR(em)) {
 		/*
 		 * This is a logic error, but we don't want to just rely on the
@@ -4962,7 +4962,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
 	int i = 0;
 	int ret = 0;
 
-	em = get_chunk_map(fs_info, chunk_offset, chunk_size);
+	em = btrfs_get_chunk_map(fs_info, chunk_offset, chunk_size);
 	if (IS_ERR(em))
 		return PTR_ERR(em);
 
@@ -5105,7 +5105,7 @@ int btrfs_chunk_readonly(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	int miss_ndevs = 0;
 	int i;
 
-	em = get_chunk_map(fs_info, chunk_offset, 1);
+	em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
 	if (IS_ERR(em))
 		return 1;
 
@@ -5165,7 +5165,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	struct map_lookup *map;
 	int ret;
 
-	em = get_chunk_map(fs_info, logical, len);
+	em = btrfs_get_chunk_map(fs_info, logical, len);
 	if (IS_ERR(em))
 		/*
 		 * We could return errors for these cases, but that could get
@@ -5211,7 +5211,7 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
 	struct map_lookup *map;
 	unsigned long len = fs_info->sectorsize;
 
-	em = get_chunk_map(fs_info, logical, len);
+	em = btrfs_get_chunk_map(fs_info, logical, len);
 
 	if (!WARN_ON(IS_ERR(em))) {
 		map = em->map_lookup;
@@ -5228,7 +5228,7 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	struct map_lookup *map;
 	int ret = 0;
 
-	em = get_chunk_map(fs_info, logical, len);
+	em = btrfs_get_chunk_map(fs_info, logical, len);
 
 	if(!WARN_ON(IS_ERR(em))) {
 		map = em->map_lookup;
@@ -5387,7 +5387,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
 	/* discard always return a bbio */
 	ASSERT(bbio_ret);
 
-	em = get_chunk_map(fs_info, logical, length);
+	em = btrfs_get_chunk_map(fs_info, logical, length);
 	if (IS_ERR(em))
 		return PTR_ERR(em);
 
@@ -5713,7 +5713,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 		return __btrfs_map_block_for_discard(fs_info, logical,
 						     *length, bbio_ret);
 
-	em = get_chunk_map(fs_info, logical, *length);
+	em = btrfs_get_chunk_map(fs_info, logical, *length);
 	if (IS_ERR(em))
 		return PTR_ERR(em);
 
@@ -6012,7 +6012,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 	u64 rmap_len;
 	int i, j, nr = 0;
 
-	em = get_chunk_map(fs_info, chunk_start, 1);
+	em = btrfs_get_chunk_map(fs_info, chunk_start, 1);
 	if (IS_ERR(em))
 		return -EIO;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 5139ec8daf4c..d3dedfd1324b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -467,6 +467,8 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
 int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
 				struct btrfs_fs_info *fs_info,
 				u64 chunk_offset, u64 chunk_size);
+struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
+				       u64 logical, u64 length);
 int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
 		       struct btrfs_fs_info *fs_info, u64 chunk_offset);
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v4 6/6] Btrfs: support swap files
  2018-05-24 21:41 [RFC PATCH v4 0/6] Btrfs: implement swap file support Omar Sandoval
                   ` (4 preceding siblings ...)
  2018-05-24 21:41 ` [RFC PATCH v4 5/6] Btrfs: rename get_chunk_map() and make it non-static Omar Sandoval
@ 2018-05-24 21:41 ` Omar Sandoval
  2018-05-25 10:07   ` Nikolay Borisov
  2018-05-28 13:46   ` David Sterba
  5 siblings, 2 replies; 20+ messages in thread
From: Omar Sandoval @ 2018-05-24 21:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo

From: Omar Sandoval <osandov@fb.com>

Implement the swap file a_ops on Btrfs. Activation needs to make sure
that the file can be used as a swap file, which currently means it must
be fully allocated as nocow with no compression on one device. It also
sets up the swap extents directly with add_swap_extent(), so export it.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 220 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c    |   1 +
 2 files changed, 221 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9228cb866115..6cca8529e307 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10526,6 +10526,224 @@ void btrfs_set_range_writeback(void *private_data, u64 start, u64 end)
 	}
 }
 
+struct btrfs_swap_info {
+	u64 start;
+	u64 block_start;
+	u64 block_len;
+	u64 lowest_ppage;
+	u64 highest_ppage;
+	unsigned long nr_pages;
+	int nr_extents;
+};
+
+static int btrfs_add_swap_extent(struct swap_info_struct *sis,
+				 struct btrfs_swap_info *bsi)
+{
+	unsigned long nr_pages;
+	u64 first_ppage, first_ppage_reported, next_ppage;
+	int ret;
+
+	first_ppage = ALIGN(bsi->block_start, PAGE_SIZE) >> PAGE_SHIFT;
+	next_ppage = ALIGN_DOWN(bsi->block_start + bsi->block_len,
+				PAGE_SIZE) >> PAGE_SHIFT;
+
+	if (first_ppage >= next_ppage)
+		return 0;
+	nr_pages = next_ppage - first_ppage;
+
+	first_ppage_reported = first_ppage;
+	if (bsi->start == 0)
+		first_ppage_reported++;
+	if (bsi->lowest_ppage > first_ppage_reported)
+		bsi->lowest_ppage = first_ppage_reported;
+	if (bsi->highest_ppage < (next_ppage - 1))
+		bsi->highest_ppage = next_ppage - 1;
+
+	ret = add_swap_extent(sis, bsi->nr_pages, nr_pages, first_ppage);
+	if (ret < 0)
+		return ret;
+	bsi->nr_extents += ret;
+	bsi->nr_pages += nr_pages;
+	return 0;
+}
+
+static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
+			       sector_t *span)
+{
+	struct inode *inode = file_inode(file);
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct extent_state *cached_state = NULL;
+	struct extent_map *em = NULL;
+	struct btrfs_device *device = NULL;
+	struct btrfs_swap_info bsi = {
+		.lowest_ppage = (sector_t)-1ULL,
+	};
+	int ret = 0;
+	u64 isize = inode->i_size;
+	u64 start;
+
+	/*
+	 * The inode is locked, so these flags won't change after we check them.
+	 */
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
+		btrfs_err(fs_info, "swapfile is compressed");
+		return -EINVAL;
+	}
+	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
+		btrfs_err(fs_info, "swapfile is copy-on-write");
+		return -EINVAL;
+	}
+
+	/*
+	 * Balance or device remove/replace/resize can move stuff around from
+	 * under us. The EXCL_OP flag makes sure they aren't running/won't run
+	 * concurrently while we are mapping the swap extents, and the fs_info
+	 * nr_swapfiles counter prevents them from running while the swap file
+	 * is active and moving the extents. Note that this also prevents a
+	 * concurrent device add which isn't actually necessary, but it's not
+	 * really worth the trouble to allow it.
+	 */
+	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags))
+		return -EBUSY;
+	atomic_inc(&fs_info->nr_swapfiles);
+	/*
+	 * Snapshots can create extents which require COW even if NODATACOW is
+	 * set. We use this counter to prevent snapshots. We must increment it
+	 * before walking the extents because we don't want a concurrent
+	 * snapshot to run after we've already checked the extents.
+	 */
+	atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
+
+	lock_extent_bits(io_tree, 0, isize - 1, &cached_state);
+	start = 0;
+	while (start < isize) {
+		u64 end, logical_block_start, physical_block_start;
+		u64 len = isize - start;
+
+		em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, len, 0);
+		if (IS_ERR(em)) {
+			ret = PTR_ERR(em);
+			goto out;
+		}
+		end = extent_map_end(em);
+
+		if (em->block_start == EXTENT_MAP_HOLE) {
+			btrfs_err(fs_info, "swapfile has holes");
+			ret = -EINVAL;
+			goto out;
+		}
+		if (em->block_start == EXTENT_MAP_INLINE) {
+			/*
+			 * It's unlikely we'll ever actually find ourselves
+			 * here, as a file small enough to fit inline won't be
+			 * big enough to store more than the swap header, but in
+			 * case something changes in the future, let's catch it
+			 * here rather than later.
+			 */
+			btrfs_err(fs_info, "swapfile is inline");
+			ret = -EINVAL;
+			goto out;
+		}
+		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
+			btrfs_err(fs_info, "swapfile is compressed");
+			ret = -EINVAL;
+			goto out;
+		}
+
+		logical_block_start = em->block_start + (start - em->start);
+		len = min(len, em->len - (start - em->start));
+		free_extent_map(em);
+		em = NULL;
+
+		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL);
+		if (ret < 0) {
+			goto out;
+		} else if (ret) {
+			ret = 0;
+		} else {
+			btrfs_err(fs_info, "swapfile is copy-on-write");
+			ret = -EINVAL;
+			goto out;
+		}
+
+		em = btrfs_get_chunk_map(fs_info, logical_block_start, len);
+		if (IS_ERR(em)) {
+			ret = PTR_ERR(em);
+			goto out;
+		}
+
+		if (em->map_lookup->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+			btrfs_err(fs_info, "swapfile data profile is not single");
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (device == NULL) {
+			device = em->map_lookup->stripes[0].dev;
+		} else if (device != em->map_lookup->stripes[0].dev) {
+			btrfs_err(fs_info, "swapfile is on multiple devices");
+			ret = -EINVAL;
+			goto out;
+		}
+
+		physical_block_start = (em->map_lookup->stripes[0].physical +
+					(logical_block_start - em->start));
+		len = min(len, em->len - (logical_block_start - em->start));
+		free_extent_map(em);
+		em = NULL;
+
+		if (bsi.block_len &&
+		    bsi.block_start + bsi.block_len == physical_block_start) {
+			bsi.block_len += len;
+		} else {
+			if (bsi.block_len) {
+				ret = btrfs_add_swap_extent(sis, &bsi);
+				if (ret)
+					goto out;
+			}
+			bsi.start = start;
+			bsi.block_start = physical_block_start;
+			bsi.block_len = len;
+		}
+
+		start = end;
+	}
+
+	if (bsi.block_len)
+		ret = btrfs_add_swap_extent(sis, &bsi);
+
+out:
+	if (!IS_ERR_OR_NULL(em))
+		free_extent_map(em);
+
+	unlock_extent_cached(io_tree, 0, isize - 1, &cached_state);
+
+	clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
+
+	if (ret) {
+		atomic_dec(&BTRFS_I(inode)->root->nr_swapfiles);
+		atomic_dec(&fs_info->nr_swapfiles);
+		return ret;
+	}
+
+	if (device)
+		sis->bdev = device->bdev;
+	*span = bsi.highest_ppage - bsi.lowest_ppage + 1;
+	sis->max = bsi.nr_pages;
+	sis->pages = bsi.nr_pages - 1;
+	sis->highest_bit = bsi.nr_pages - 1;
+	return bsi.nr_extents;
+}
+
+static void btrfs_swap_deactivate(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+
+	atomic_dec(&BTRFS_I(inode)->root->nr_swapfiles);
+	atomic_dec(&BTRFS_I(inode)->root->fs_info->nr_swapfiles);
+}
+
 static const struct inode_operations btrfs_dir_inode_operations = {
 	.getattr	= btrfs_getattr,
 	.lookup		= btrfs_lookup,
@@ -10606,6 +10824,8 @@ static const struct address_space_operations btrfs_aops = {
 	.releasepage	= btrfs_releasepage,
 	.set_page_dirty	= btrfs_set_page_dirty,
 	.error_remove_page = generic_error_remove_page,
+	.swap_activate	= btrfs_swap_activate,
+	.swap_deactivate = btrfs_swap_deactivate,
 };
 
 static const struct address_space_operations btrfs_symlink_aops = {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 886c9d89b144..0d934b5be05c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2382,6 +2382,7 @@ add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
 	list_add_tail(&new_se->list, &sis->first_swap_extent.list);
 	return 1;
 }
+EXPORT_SYMBOL_GPL(add_swap_extent);
 
 /*
  * A `swap extent' is a simple thing which maps a contiguous range of pages
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  2018-05-24 21:41 ` [RFC PATCH v4 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS Omar Sandoval
@ 2018-05-25  9:11   ` Nikolay Borisov
  0 siblings, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2018-05-25  9:11 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo



On 25.05.2018 00:41, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The SWP_FILE flag serves two purposes: to make swap_{read,write}page()
> go through the filesystem, and to make swapoff() call
> ->swap_deactivate(). For Btrfs, we want the latter but not the former,
> so split this flag into two. This makes us always call
> ->swap_deactivate() if ->swap_activate() succeeded, not just if it
> didn't add any swap extents itself.
> 
> This also resolves the issue of the very misleading name of SWP_FILE,
> which is only used for swap files over NFS.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Generally looks good:

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

just one cleanup suggestion, see below.
> ---
>  include/linux/swap.h | 13 +++++++------
>  mm/page_io.c         |  6 +++---
>  mm/swapfile.c        | 13 ++++++++-----
>  3 files changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2417d288e016..29dfd436435c 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -167,13 +167,14 @@ enum {
>  	SWP_SOLIDSTATE	= (1 << 4),	/* blkdev seeks are cheap */
>  	SWP_CONTINUED	= (1 << 5),	/* swap_map has count continuation */
>  	SWP_BLKDEV	= (1 << 6),	/* its a block device */
> -	SWP_FILE	= (1 << 7),	/* set after swap_activate success */
> -	SWP_AREA_DISCARD = (1 << 8),	/* single-time swap area discards */
> -	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
> -	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
> -	SWP_SYNCHRONOUS_IO = (1 << 11),	/* synchronous IO is efficient */
> +	SWP_ACTIVATED	= (1 << 7),	/* set after swap_activate success */
> +	SWP_FS		= (1 << 8),	/* swap file goes through fs */
> +	SWP_AREA_DISCARD = (1 << 9),	/* single-time swap area discards */
> +	SWP_PAGE_DISCARD = (1 << 10),	/* freed swap page-cluster discards */
> +	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
> +	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
>  					/* add others here before... */
> -	SWP_SCANNING	= (1 << 12),	/* refcount in scan_swap_map */
> +	SWP_SCANNING	= (1 << 13),	/* refcount in scan_swap_map */
>  };
>  
>  #define SWAP_CLUSTER_MAX 32UL
> diff --git a/mm/page_io.c b/mm/page_io.c
> index b41cf9644585..f2d06c1d0cc1 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -283,7 +283,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>  	struct swap_info_struct *sis = page_swap_info(page);
>  
>  	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> -	if (sis->flags & SWP_FILE) {
> +	if (sis->flags & SWP_FS) {

Not necessarily for this series but something to keep in mind:

nit: The way the fs case is tucked onto the __swap_writepage is a bit
ugly. How about factoring out the filesystem/blockdev casae into two
functions :

__swap_writepage_fs/__swap_writepage_bdev

then in swap_writepage we just have the if (sis->flags & SWP_FS) check
and dispatch to either write_page_fs or writepage_bdev.


>  		struct kiocb kiocb;
>  		struct file *swap_file = sis->swap_file;
>  		struct address_space *mapping = swap_file->f_mapping;
> @@ -364,7 +364,7 @@ int swap_readpage(struct page *page, bool synchronous)
>  		goto out;
>  	}
>  
> -	if (sis->flags & SWP_FILE) {
> +	if (sis->flags & SWP_FS) {
>  		struct file *swap_file = sis->swap_file;
>  		struct address_space *mapping = swap_file->f_mapping;
>  
> @@ -422,7 +422,7 @@ int swap_set_page_dirty(struct page *page)
>  {
>  	struct swap_info_struct *sis = page_swap_info(page);
>  
> -	if (sis->flags & SWP_FILE) {
> +	if (sis->flags & SWP_FS) {
>  		struct address_space *mapping = sis->swap_file->f_mapping;
>  
>  		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index cc2cf04d9018..886c9d89b144 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -973,7 +973,7 @@ int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
>  			goto nextsi;
>  		}
>  		if (cluster) {
> -			if (!(si->flags & SWP_FILE))
> +			if (!(si->flags & SWP_FS))
>  				n_ret = swap_alloc_cluster(si, swp_entries);
>  		} else
>  			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> @@ -2327,12 +2327,13 @@ static void destroy_swap_extents(struct swap_info_struct *sis)
>  		kfree(se);
>  	}
>  
> -	if (sis->flags & SWP_FILE) {
> +	if (sis->flags & SWP_ACTIVATED) {
>  		struct file *swap_file = sis->swap_file;
>  		struct address_space *mapping = swap_file->f_mapping;
>  
> -		sis->flags &= ~SWP_FILE;
> -		mapping->a_ops->swap_deactivate(swap_file);
> +		sis->flags &= ~SWP_ACTIVATED;
> +		if (mapping->a_ops->swap_deactivate)
> +			mapping->a_ops->swap_deactivate(swap_file);
>  	}
>  }
>  
> @@ -2428,8 +2429,10 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
>  
>  	if (mapping->a_ops->swap_activate) {
>  		ret = mapping->a_ops->swap_activate(sis, swap_file, span);
> +		if (ret >= 0)
> +			sis->flags |= SWP_ACTIVATED;
>  		if (!ret) {
> -			sis->flags |= SWP_FILE;
> +			sis->flags |= SWP_FS;
>  			ret = add_swap_extent(sis, 0, sis->max, 0);
>  			*span = sis->pages;
>  		}
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 2/6] vfs: update swap_{,de}activate documentation
  2018-05-24 21:41 ` [RFC PATCH v4 2/6] vfs: update swap_{,de}activate documentation Omar Sandoval
@ 2018-05-25  9:15   ` Nikolay Borisov
  0 siblings, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2018-05-25  9:15 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo



On 25.05.2018 00:41, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The documentation for these functions is wrong in several ways:
> 
> - swap_activate() is called with the inode locked
> - swap_activate() takes a swap_info_struct * and a sector_t *
> - swap_activate() can also return a positive number of extents it added
>   itself
> - swap_deactivate() does not return anything
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  Documentation/filesystems/Locking | 17 +++++++----------
>  Documentation/filesystems/vfs.txt | 12 ++++++++----
>  2 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 75d2d57e2c44..7f009e98fa3c 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -211,8 +211,9 @@ prototypes:
>  	int (*launder_page)(struct page *);
>  	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
>  	int (*error_remove_page)(struct address_space *, struct page *);
> -	int (*swap_activate)(struct file *);
> -	int (*swap_deactivate)(struct file *);
> +	int (*swap_activate)(struct swap_info_struct *, struct file *,
> +			     sector_t *);
> +	void (*swap_deactivate)(struct file *);
>  
>  locking rules:
>  	All except set_page_dirty and freepage may block
> @@ -236,8 +237,8 @@ putback_page:		yes
>  launder_page:		yes
>  is_partially_uptodate:	yes
>  error_remove_page:	yes
> -swap_activate:		no
> -swap_deactivate:	no
> +swap_activate:					yes
> +swap_deactivate:				no
>  
>  	->write_begin(), ->write_end() and ->readpage() may be called from
>  the request handler (/dev/loop).
> @@ -334,14 +335,10 @@ cleaned, or an error value if not. Note that in order to prevent the page
>  getting mapped back in and redirtied, it needs to be kept locked
>  across the entire operation.
>  
> -	->swap_activate will be called with a non-zero argument on
> -files backing (non block device backed) swapfiles. A return value
> -of zero indicates success, in which case this file can be used for
> -backing swapspace. The swapspace operations will be proxied to the
> -address space operations.
> +	->swap_activate is called from sys_swapon() with the inode locked.
>  
>  	->swap_deactivate() will be called in the sys_swapoff()
> -path after ->swap_activate() returned success.
> +path after ->swap_activate() returned success. The inode is not locked.
>  
>  ----------------------- file_lock_operations ------------------------------
>  prototypes:
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index 5fd325df59e2..0149109d94d1 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -650,8 +650,9 @@ struct address_space_operations {
>  					unsigned long);
>  	void (*is_dirty_writeback) (struct page *, bool *, bool *);
>  	int (*error_remove_page) (struct mapping *mapping, struct page *page);
> -	int (*swap_activate)(struct file *);
> -	int (*swap_deactivate)(struct file *);
> +	int (*swap_activate)(struct swap_info_struct *, struct file *,
> +			     sector_t *);
> +	void (*swap_deactivate)(struct file *);
>  };
>  
>    writepage: called by the VM to write a dirty page to backing store.
> @@ -828,8 +829,11 @@ struct address_space_operations {
>  
>    swap_activate: Called when swapon is used on a file to allocate
>  	space if necessary and pin the block lookup information in
> -	memory. A return value of zero indicates success,
> -	in which case this file can be used to back swapspace.
> +	memory. If this returns zero, the swap system will call the address
> +	space operations ->readpage() and ->direct_IO(). Alternatively, this
> +	may call add_swap_extent() and return the number of extents added, in
> +	which case the swap system will use the provided blocks directly
> +	instead of going through the filesystem.
>  
>    swap_deactivate: Called during swapoff on files where swap_activate
>  	was successful.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 3/6] Btrfs: push EXCL_OP set into btrfs_rm_device()
  2018-05-24 21:41 ` [RFC PATCH v4 3/6] Btrfs: push EXCL_OP set into btrfs_rm_device() Omar Sandoval
@ 2018-05-25  9:19   ` Nikolay Borisov
  2018-05-28 13:29   ` David Sterba
  1 sibling, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2018-05-25  9:19 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo



On 25.05.2018 00:41, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> btrfs_ioctl_rm_dev() and btrfs_ioctl_rm_dev_v2() both manipulate this
> bit. Let's move it into the common btrfs_rm_device(), which also makes
> the following change to deal with swap files easier.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/ioctl.c   | 13 -------------
>  fs/btrfs/volumes.c |  4 ++++
>  2 files changed, 4 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 9af3be96099f..82be4a94334b 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3091,18 +3091,12 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
>  		goto out;
>  	}
>  
> -	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags)) {
> -		ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
> -		goto out;
> -	}
> -
>  	if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID) {
>  		ret = btrfs_rm_device(fs_info, NULL, vol_args->devid);
>  	} else {
>  		vol_args->name[BTRFS_SUBVOL_NAME_MAX] = '\0';
>  		ret = btrfs_rm_device(fs_info, vol_args->name, 0);
>  	}
> -	clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
>  
>  	if (!ret) {
>  		if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID)
> @@ -3133,11 +3127,6 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
>  	if (ret)
>  		return ret;
>  
> -	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags)) {
> -		ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
> -		goto out_drop_write;
> -	}
> -
>  	vol_args = memdup_user(arg, sizeof(*vol_args));
>  	if (IS_ERR(vol_args)) {
>  		ret = PTR_ERR(vol_args);
> @@ -3151,8 +3140,6 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
>  		btrfs_info(fs_info, "disk deleted %s", vol_args->name);
>  	kfree(vol_args);
>  out:
> -	clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
> -out_drop_write:
>  	mnt_drop_write_file(file);
>  
>  	return ret;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index b6757b53c297..9cfac177214f 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1951,6 +1951,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
>  	u64 num_devices;
>  	int ret = 0;
>  
> +	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags))
> +		return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
> +
>  	mutex_lock(&uuid_mutex);
>  
>  	num_devices = fs_devices->num_devices;
> @@ -2069,6 +2072,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
>  
>  out:
>  	mutex_unlock(&uuid_mutex);
> +	clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
>  	return ret;
>  
>  error_undo:
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 5/6] Btrfs: rename get_chunk_map() and make it non-static
  2018-05-24 21:41 ` [RFC PATCH v4 5/6] Btrfs: rename get_chunk_map() and make it non-static Omar Sandoval
@ 2018-05-25  9:21   ` Nikolay Borisov
  2018-05-25 16:02     ` Omar Sandoval
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2018-05-25  9:21 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo



On 25.05.2018 00:41, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
> make it non-static.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

nit: How about introducing proper kernel doc for this function, now that
it becomes public just as good practice so that eventually we will have
proper kernel doc for all public interfaces. You could also mention that
it needs a paired free_extent_map
> ---
>  fs/btrfs/volumes.c | 22 +++++++++++-----------
>  fs/btrfs/volumes.h |  2 ++
>  2 files changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 33b3d329ebb9..6e1a89c6b362 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2789,8 +2789,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>  	return ret;
>  }
>  
> -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> -					u64 logical, u64 length)
> +struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
> +				       u64 logical, u64 length)
>  {
>  	struct extent_map_tree *em_tree;
>  	struct extent_map *em;
> @@ -2827,7 +2827,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>  	int i, ret = 0;
>  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>  
> -	em = get_chunk_map(fs_info, chunk_offset, 1);
> +	em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
>  	if (IS_ERR(em)) {
>  		/*
>  		 * This is a logic error, but we don't want to just rely on the
> @@ -4962,7 +4962,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
>  	int i = 0;
>  	int ret = 0;
>  
> -	em = get_chunk_map(fs_info, chunk_offset, chunk_size);
> +	em = btrfs_get_chunk_map(fs_info, chunk_offset, chunk_size);
>  	if (IS_ERR(em))
>  		return PTR_ERR(em);
>  
> @@ -5105,7 +5105,7 @@ int btrfs_chunk_readonly(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>  	int miss_ndevs = 0;
>  	int i;
>  
> -	em = get_chunk_map(fs_info, chunk_offset, 1);
> +	em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
>  	if (IS_ERR(em))
>  		return 1;
>  
> @@ -5165,7 +5165,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>  	struct map_lookup *map;
>  	int ret;
>  
> -	em = get_chunk_map(fs_info, logical, len);
> +	em = btrfs_get_chunk_map(fs_info, logical, len);
>  	if (IS_ERR(em))
>  		/*
>  		 * We could return errors for these cases, but that could get
> @@ -5211,7 +5211,7 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
>  	struct map_lookup *map;
>  	unsigned long len = fs_info->sectorsize;
>  
> -	em = get_chunk_map(fs_info, logical, len);
> +	em = btrfs_get_chunk_map(fs_info, logical, len);
>  
>  	if (!WARN_ON(IS_ERR(em))) {
>  		map = em->map_lookup;
> @@ -5228,7 +5228,7 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>  	struct map_lookup *map;
>  	int ret = 0;
>  
> -	em = get_chunk_map(fs_info, logical, len);
> +	em = btrfs_get_chunk_map(fs_info, logical, len);
>  
>  	if(!WARN_ON(IS_ERR(em))) {
>  		map = em->map_lookup;
> @@ -5387,7 +5387,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
>  	/* discard always return a bbio */
>  	ASSERT(bbio_ret);
>  
> -	em = get_chunk_map(fs_info, logical, length);
> +	em = btrfs_get_chunk_map(fs_info, logical, length);
>  	if (IS_ERR(em))
>  		return PTR_ERR(em);
>  
> @@ -5713,7 +5713,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
>  		return __btrfs_map_block_for_discard(fs_info, logical,
>  						     *length, bbio_ret);
>  
> -	em = get_chunk_map(fs_info, logical, *length);
> +	em = btrfs_get_chunk_map(fs_info, logical, *length);
>  	if (IS_ERR(em))
>  		return PTR_ERR(em);
>  
> @@ -6012,7 +6012,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
>  	u64 rmap_len;
>  	int i, j, nr = 0;
>  
> -	em = get_chunk_map(fs_info, chunk_start, 1);
> +	em = btrfs_get_chunk_map(fs_info, chunk_start, 1);
>  	if (IS_ERR(em))
>  		return -EIO;
>  
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 5139ec8daf4c..d3dedfd1324b 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -467,6 +467,8 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
>  int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
>  				struct btrfs_fs_info *fs_info,
>  				u64 chunk_offset, u64 chunk_size);
> +struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
> +				       u64 logical, u64 length);
>  int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>  		       struct btrfs_fs_info *fs_info, u64 chunk_offset);
>  
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 6/6] Btrfs: support swap files
  2018-05-24 21:41 ` [RFC PATCH v4 6/6] Btrfs: support swap files Omar Sandoval
@ 2018-05-25 10:07   ` Nikolay Borisov
  2018-05-25 16:16     ` Omar Sandoval
  2018-05-28 13:46   ` David Sterba
  1 sibling, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2018-05-25 10:07 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, linux-fsdevel, Tejun Heo



On 25.05.2018 00:41, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Implement the swap file a_ops on Btrfs. Activation needs to make sure
> that the file can be used as a swap file, which currently means it must
> be fully allocated as nocow with no compression on one device. It also
> sets up the swap extents directly with add_swap_extent(), so export it.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

What testing (apart form the xfstest patches you sent) have this code
seen? Have you run it with lockdep enabled (I'm asking because when I
picked up your v3 there was quite a bunch of deadlock warnings). Also
see some inline questions below.

> ---
>  fs/btrfs/inode.c | 220 +++++++++++++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c    |   1 +
>  2 files changed, 221 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9228cb866115..6cca8529e307 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -10526,6 +10526,224 @@ void btrfs_set_range_writeback(void *private_data, u64 start, u64 end)
>  	}
>  }
>  
<snip>
> +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> +			       sector_t *span)
> +{
> +	struct inode *inode = file_inode(file);
> +	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct extent_state *cached_state = NULL;
> +	struct extent_map *em = NULL;
> +	struct btrfs_device *device = NULL;
> +	struct btrfs_swap_info bsi = {
> +		.lowest_ppage = (sector_t)-1ULL,
> +	};
> +	int ret = 0;
> +	u64 isize = inode->i_size;
> +	u64 start;
> +
> +	/*
> +	 * The inode is locked, so these flags won't change after we check them.
> +	 */
> +	if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> +		btrfs_err(fs_info, "swapfile is compressed");
> +		return -EINVAL;
> +	}
> +	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> +		btrfs_err(fs_info, "swapfile is copy-on-write");
> +		return -EINVAL;
> +	}
> +
> +	/*
> +	 * Balance or device remove/replace/resize can move stuff around from
> +	 * under us. The EXCL_OP flag makes sure they aren't running/won't run
> +	 * concurrently while we are mapping the swap extents, and the fs_info
> +	 * nr_swapfiles counter prevents them from running while the swap file
> +	 * is active and moving the extents. Note that this also prevents a
> +	 * concurrent device add which isn't actually necessary, but it's not
> +	 * really worth the trouble to allow it.
> +	 */
> +	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags))
> +		return -EBUSY;
> +	atomic_inc(&fs_info->nr_swapfiles);
> +	/*
> +	 * Snapshots can create extents which require COW even if NODATACOW is
> +	 * set. We use this counter to prevent snapshots. We must increment it
> +	 * before walking the extents because we don't want a concurrent
> +	 * snapshot to run after we've already checked the extents.
> +	 */
> +	atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
> +
> +	lock_extent_bits(io_tree, 0, isize - 1, &cached_state);
> +	start = 0;
> +	while (start < isize) {
> +		u64 end, logical_block_start, physical_block_start;
> +		u64 len = isize - start;
> +
> +		em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, len, 0);
> +		if (IS_ERR(em)) {
> +			ret = PTR_ERR(em);
> +			goto out;
> +		}
> +		end = extent_map_end(em);
> +
> +		if (em->block_start == EXTENT_MAP_HOLE) {
> +			btrfs_err(fs_info, "swapfile has holes");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		if (em->block_start == EXTENT_MAP_INLINE) {
> +			/*
> +			 * It's unlikely we'll ever actually find ourselves
> +			 * here, as a file small enough to fit inline won't be
> +			 * big enough to store more than the swap header, but in
> +			 * case something changes in the future, let's catch it
> +			 * here rather than later.
> +			 */
> +			btrfs_err(fs_info, "swapfile is inline");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> +			btrfs_err(fs_info, "swapfile is compressed");
> +			ret = -EINVAL;
> +			goto out;
> +		}

Isn't this implied by the earlier BTRFS_I(inode)->flags &
BTRFS_INODE_COMPRESS check ?
> +
> +		logical_block_start = em->block_start + (start - em->start);

So which offset are you calculating by doing the start - em->start
calculation? start is really the ending logical address of the previous
extent but isn't this always < em->start of the next extent ?

> +		len = min(len, em->len - (start - em->start));
> +		free_extent_map(em);
> +		em = NULL;
> +
> +		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL);
> +		if (ret < 0) {
> +			goto out;
> +		} else if (ret) {
> +			ret = 0;
> +		} else {
> +			btrfs_err(fs_info, "swapfile is copy-on-write");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		em = btrfs_get_chunk_map(fs_info, logical_block_start, len);
> +		if (IS_ERR(em)) {
> +			ret = PTR_ERR(em);
> +			goto out;
> +		}
> +
> +		if (em->map_lookup->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> +			btrfs_err(fs_info, "swapfile data profile is not single");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (device == NULL) {
> +			device = em->map_lookup->stripes[0].dev;
> +		} else if (device != em->map_lookup->stripes[0].dev) {
> +			btrfs_err(fs_info, "swapfile is on multiple devices");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
<snip>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file
  2018-05-24 21:41 ` [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file Omar Sandoval
@ 2018-05-25 14:50   ` David Sterba
  2018-05-25 16:00     ` Omar Sandoval
  0 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2018-05-25 14:50 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, linux-fsdevel, Tejun Heo

On Thu, May 24, 2018 at 02:41:28PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> When a swap file is active, we must make sure that the extents of the
> file are not moved and that they don't become shared. That means that
> the following are not safe:
> 
> - chattr +c (enable compression)
> - reflink
> - dedupe
> - snapshot
> - defrag
> - balance
> - device remove/replace/resize
> 
> Don't allow those to happen on an active swap file. Balance and device
> remove/replace/resize in particular are disallowed entirely; in the
> future, we can relax this so that relocation skips/errors out only on
> chunks containing an active swap file.

Hm, disabling the entire balance is too intrusive. It's clear that the
swapfile causes a lot of trouble when it goes against the dynamic
capabilities of btrfs (relocation and the functionality that builds on
it).

Skipping the swapfile extents should be implemented at minimum. We can
add some heuristics that will group the swap extents to a small number
of chunks so the impact of unmovable chunks is limited.

I haven't looked at the implementation, but it might be possible to
internally find a different location for the swap extent once it's not
used for the actual paged data.

In an ideal case, the swap extents could span entire chunks (1G) and not
mix with regular data/metadata.

> Note that we don't have to worry about chattr -C (disable nocow), which
> we ignore for non-empty files, because an active swapfile must be
> non-empty and can't be truncated. We also don't have to worry about
> autodefrag because it's only done on COW files. Truncate and fallocate
> are already taken care of by the generic code. Device add doesn't do
> relocation so it's not an issue, either.

Ok, fine the remaining easy cases are covered.

I don't know if you mentioned that elsewhere (as design questions are
in this patch), the allocation profile is single, or is it also possible
to have striped or duplicated swap extents?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file
  2018-05-25 14:50   ` David Sterba
@ 2018-05-25 16:00     ` Omar Sandoval
  2018-05-25 16:10       ` David Sterba
  0 siblings, 1 reply; 20+ messages in thread
From: Omar Sandoval @ 2018-05-25 16:00 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team, linux-fsdevel, Tejun Heo

On Fri, May 25, 2018 at 04:50:55PM +0200, David Sterba wrote:
> On Thu, May 24, 2018 at 02:41:28PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > When a swap file is active, we must make sure that the extents of the
> > file are not moved and that they don't become shared. That means that
> > the following are not safe:
> > 
> > - chattr +c (enable compression)
> > - reflink
> > - dedupe
> > - snapshot
> > - defrag
> > - balance
> > - device remove/replace/resize
> > 
> > Don't allow those to happen on an active swap file. Balance and device
> > remove/replace/resize in particular are disallowed entirely; in the
> > future, we can relax this so that relocation skips/errors out only on
> > chunks containing an active swap file.
> 
> Hm, disabling the entire balance is too intrusive. It's clear that the
> swapfile causes a lot of trouble when it goes against the dynamic
> capabilities of btrfs (relocation and the functionality that builds on
> it).
> 
> Skipping the swapfile extents should be implemented at minimum.

Sure thing, this should definitely be possible. For balance, we can skip
them; for resize or delete, it of course has to fail if it encounters
swap extents. I'll take a stab at it.

> We can
> add some heuristics that will group the swap extents to a small number
> of chunks so the impact of unmovable chunks is limited.
> 
> I haven't looked at the implementation, but it might be possible to
> internally find a different location for the swap extent once it's not
> used for the actual paged data.
> 
> In an ideal case, the swap extents could span entire chunks (1G) and not
> mix with regular data/metadata.
> 
> > Note that we don't have to worry about chattr -C (disable nocow), which
> > we ignore for non-empty files, because an active swapfile must be
> > non-empty and can't be truncated. We also don't have to worry about
> > autodefrag because it's only done on COW files. Truncate and fallocate
> > are already taken care of by the generic code. Device add doesn't do
> > relocation so it's not an issue, either.
> 
> Ok, fine the remaining easy cases are covered.
> 
> I don't know if you mentioned that elsewhere (as design questions are
> in this patch), the allocation profile is single, or is it also possible
> to have striped or duplicated swap extents?

That's briefly mentioned in the last patch, only single data is
supported, although I think I can easily relax that to also allow RAID0.
Anything else is much harder to support, but we need to start somewhere.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 5/6] Btrfs: rename get_chunk_map() and make it non-static
  2018-05-25  9:21   ` Nikolay Borisov
@ 2018-05-25 16:02     ` Omar Sandoval
  0 siblings, 0 replies; 20+ messages in thread
From: Omar Sandoval @ 2018-05-25 16:02 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, kernel-team, linux-fsdevel, Tejun Heo

On Fri, May 25, 2018 at 12:21:41PM +0300, Nikolay Borisov wrote:
> 
> 
> On 25.05.2018 00:41, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
> > make it non-static.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> 
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> 
> nit: How about introducing proper kernel doc for this function, now that
> it becomes public just as good practice so that eventually we will have
> proper kernel doc for all public interfaces. You could also mention that
> it needs a paired free_extent_map

Will do, thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file
  2018-05-25 16:00     ` Omar Sandoval
@ 2018-05-25 16:10       ` David Sterba
  2018-08-21  8:46         ` Omar Sandoval
  0 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2018-05-25 16:10 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: dsterba, linux-btrfs, kernel-team, linux-fsdevel, Tejun Heo

On Fri, May 25, 2018 at 09:00:58AM -0700, Omar Sandoval wrote:
> On Fri, May 25, 2018 at 04:50:55PM +0200, David Sterba wrote:
> > On Thu, May 24, 2018 at 02:41:28PM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > When a swap file is active, we must make sure that the extents of the
> > > file are not moved and that they don't become shared. That means that
> > > the following are not safe:
> > > 
> > > - chattr +c (enable compression)
> > > - reflink
> > > - dedupe
> > > - snapshot
> > > - defrag
> > > - balance
> > > - device remove/replace/resize
> > > 
> > > Don't allow those to happen on an active swap file. Balance and device
> > > remove/replace/resize in particular are disallowed entirely; in the
> > > future, we can relax this so that relocation skips/errors out only on
> > > chunks containing an active swap file.
> > 
> > Hm, disabling the entire balance is too intrusive. It's clear that the
> > swapfile causes a lot of trouble when it goes against the dynamic
> > capabilities of btrfs (relocation and the functionality that builds on
> > it).
> > 
> > Skipping the swapfile extents should be implemented at minimum.
> 
> Sure thing, this should definitely be possible. For balance, we can skip
> them; for resize or delete, it of course has to fail if it encounters
> swap extents. I'll take a stab at it.

We can detect if there's an active swap file on the filesystem before
shrink, delete or replace is started so the user is not surprised if it
fails in the end, or not start the operations at all and give some hints
what to do.

> > We can
> > add some heuristics that will group the swap extents to a small number
> > of chunks so the impact of unmovable chunks is limited.
> > 
> > I haven't looked at the implementation, but it might be possible to
> > internally find a different location for the swap extent once it's not
> > used for the actual paged data.
> > 
> > In an ideal case, the swap extents could span entire chunks (1G) and not
> > mix with regular data/metadata.
> > 
> > > Note that we don't have to worry about chattr -C (disable nocow), which
> > > we ignore for non-empty files, because an active swapfile must be
> > > non-empty and can't be truncated. We also don't have to worry about
> > > autodefrag because it's only done on COW files. Truncate and fallocate
> > > are already taken care of by the generic code. Device add doesn't do
> > > relocation so it's not an issue, either.
> > 
> > Ok, fine the remaining easy cases are covered.
> > 
> > I don't know if you mentioned that elsewhere (as design questions are
> > in this patch), the allocation profile is single, or is it also possible
> > to have striped or duplicated swap extents?
> 
> That's briefly mentioned in the last patch, only single data is
> supported, although I think I can easily relax that to also allow RAID0.
> Anything else is much harder to support, but we need to start somewhere.

Of course, support for single is absolutelly fine for the first
implementation.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 6/6] Btrfs: support swap files
  2018-05-25 10:07   ` Nikolay Borisov
@ 2018-05-25 16:16     ` Omar Sandoval
  0 siblings, 0 replies; 20+ messages in thread
From: Omar Sandoval @ 2018-05-25 16:16 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, kernel-team, linux-fsdevel, Tejun Heo

On Fri, May 25, 2018 at 01:07:00PM +0300, Nikolay Borisov wrote:
> 
> 
> On 25.05.2018 00:41, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Implement the swap file a_ops on Btrfs. Activation needs to make sure
> > that the file can be used as a swap file, which currently means it must
> > be fully allocated as nocow with no compression on one device. It also
> > sets up the swap extents directly with add_swap_extent(), so export it.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> 
> What testing (apart form the xfstest patches you sent) have this code
> seen?

Light testing with my swapme script [1] and btrfsck to make sure I
didn't swap to the wrong place. I was meaning to put this through
something more intensive like a kernel build, thanks for the reminder.
As opposed to the previous approach, the swapin/swapout paths are
simple, core code, so the edge cases I'm worried about are really in
activate/deactivate and other ioctls breaking things.

1: https://github.com/osandov/osandov-linux/blob/master/scripts/swapme.c

> Have you run it with lockdep enabled (I'm asking because when I
> picked up your v3 there was quite a bunch of deadlock warnings). Also
> see some inline questions below.

Yup, I've been running it with lockdep, no warnings. The nice part of
this approach is that there's no new locking involved, just whatever the
swap code does itself.

> > ---
> >  fs/btrfs/inode.c | 220 +++++++++++++++++++++++++++++++++++++++++++++++
> >  mm/swapfile.c    |   1 +
> >  2 files changed, 221 insertions(+)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 9228cb866115..6cca8529e307 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -10526,6 +10526,224 @@ void btrfs_set_range_writeback(void *private_data, u64 start, u64 end)
> >  	}
> >  }
> >  
> <snip>
> > +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> > +			       sector_t *span)
> > +{
> > +	struct inode *inode = file_inode(file);
> > +	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct extent_state *cached_state = NULL;
> > +	struct extent_map *em = NULL;
> > +	struct btrfs_device *device = NULL;
> > +	struct btrfs_swap_info bsi = {
> > +		.lowest_ppage = (sector_t)-1ULL,
> > +	};
> > +	int ret = 0;
> > +	u64 isize = inode->i_size;
> > +	u64 start;
> > +
> > +	/*
> > +	 * The inode is locked, so these flags won't change after we check them.
> > +	 */
> > +	if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> > +		btrfs_err(fs_info, "swapfile is compressed");
> > +		return -EINVAL;
> > +	}
> > +	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> > +		btrfs_err(fs_info, "swapfile is copy-on-write");
> > +		return -EINVAL;
> > +	}
> > +
> > +	/*
> > +	 * Balance or device remove/replace/resize can move stuff around from
> > +	 * under us. The EXCL_OP flag makes sure they aren't running/won't run
> > +	 * concurrently while we are mapping the swap extents, and the fs_info
> > +	 * nr_swapfiles counter prevents them from running while the swap file
> > +	 * is active and moving the extents. Note that this also prevents a
> > +	 * concurrent device add which isn't actually necessary, but it's not
> > +	 * really worth the trouble to allow it.
> > +	 */
> > +	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags))
> > +		return -EBUSY;
> > +	atomic_inc(&fs_info->nr_swapfiles);
> > +	/*
> > +	 * Snapshots can create extents which require COW even if NODATACOW is
> > +	 * set. We use this counter to prevent snapshots. We must increment it
> > +	 * before walking the extents because we don't want a concurrent
> > +	 * snapshot to run after we've already checked the extents.
> > +	 */
> > +	atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
> > +
> > +	lock_extent_bits(io_tree, 0, isize - 1, &cached_state);
> > +	start = 0;
> > +	while (start < isize) {
> > +		u64 end, logical_block_start, physical_block_start;
> > +		u64 len = isize - start;
> > +
> > +		em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, len, 0);
> > +		if (IS_ERR(em)) {
> > +			ret = PTR_ERR(em);
> > +			goto out;
> > +		}
> > +		end = extent_map_end(em);
> > +
> > +		if (em->block_start == EXTENT_MAP_HOLE) {
> > +			btrfs_err(fs_info, "swapfile has holes");
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +		if (em->block_start == EXTENT_MAP_INLINE) {
> > +			/*
> > +			 * It's unlikely we'll ever actually find ourselves
> > +			 * here, as a file small enough to fit inline won't be
> > +			 * big enough to store more than the swap header, but in
> > +			 * case something changes in the future, let's catch it
> > +			 * here rather than later.
> > +			 */
> > +			btrfs_err(fs_info, "swapfile is inline");
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> > +			btrfs_err(fs_info, "swapfile is compressed");
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> 
> Isn't this implied by the earlier BTRFS_I(inode)->flags &
> BTRFS_INODE_COMPRESS check ?

Nope, if you look at btrfs_ioctl_setflags(), you can clear the inode
flag even if there are compressed extents.

> > +
> > +		logical_block_start = em->block_start + (start - em->start);
> 
> So which offset are you calculating by doing the start - em->start
> calculation? start is really the ending logical address of the previous
> extent but isn't this always < em->start of the next extent ?

[em->start, em->start + em->len) will always contain start.
start - em->start is the offset of start from the beginning of the
extent we just found. It should always be zero since we're walking
sequentially and we have the extents locked, but I wanted to be
defensive here.

> > +		len = min(len, em->len - (start - em->start));
> > +		free_extent_map(em);
> > +		em = NULL;
> > +
> > +		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL);
> > +		if (ret < 0) {
> > +			goto out;
> > +		} else if (ret) {
> > +			ret = 0;
> > +		} else {
> > +			btrfs_err(fs_info, "swapfile is copy-on-write");
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		em = btrfs_get_chunk_map(fs_info, logical_block_start, len);
> > +		if (IS_ERR(em)) {
> > +			ret = PTR_ERR(em);
> > +			goto out;
> > +		}
> > +
> > +		if (em->map_lookup->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> > +			btrfs_err(fs_info, "swapfile data profile is not single");
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		if (device == NULL) {
> > +			device = em->map_lookup->stripes[0].dev;
> > +		} else if (device != em->map_lookup->stripes[0].dev) {
> > +			btrfs_err(fs_info, "swapfile is on multiple devices");
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +
> <snip>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 3/6] Btrfs: push EXCL_OP set into btrfs_rm_device()
  2018-05-24 21:41 ` [RFC PATCH v4 3/6] Btrfs: push EXCL_OP set into btrfs_rm_device() Omar Sandoval
  2018-05-25  9:19   ` Nikolay Borisov
@ 2018-05-28 13:29   ` David Sterba
  1 sibling, 0 replies; 20+ messages in thread
From: David Sterba @ 2018-05-28 13:29 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, linux-fsdevel, Tejun Heo

On Thu, May 24, 2018 at 02:41:27PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> btrfs_ioctl_rm_dev() and btrfs_ioctl_rm_dev_v2() both manipulate this
> bit. Let's move it into the common btrfs_rm_device(), which also makes
> the following change to deal with swap files easier.

I'd rather keep it in both places as it's clear where the EXCL_OP
context starts in the ioctl handlers, not in some helper function.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 6/6] Btrfs: support swap files
  2018-05-24 21:41 ` [RFC PATCH v4 6/6] Btrfs: support swap files Omar Sandoval
  2018-05-25 10:07   ` Nikolay Borisov
@ 2018-05-28 13:46   ` David Sterba
  1 sibling, 0 replies; 20+ messages in thread
From: David Sterba @ 2018-05-28 13:46 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, linux-fsdevel, Tejun Heo

On Thu, May 24, 2018 at 02:41:30PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Implement the swap file a_ops on Btrfs. Activation needs to make sure
> that the file can be used as a swap file, which currently means it must
> be fully allocated as nocow with no compression on one device. It also
> sets up the swap extents directly with add_swap_extent(), so export it.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/inode.c | 220 +++++++++++++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c    |   1 +
>  2 files changed, 221 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9228cb866115..6cca8529e307 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -10526,6 +10526,224 @@ void btrfs_set_range_writeback(void *private_data, u64 start, u64 end)
>  	}
>  }
>  
> +struct btrfs_swap_info {
> +	u64 start;
> +	u64 block_start;
> +	u64 block_len;
> +	u64 lowest_ppage;
> +	u64 highest_ppage;
> +	unsigned long nr_pages;
> +	int nr_extents;
> +};
> +
> +static int btrfs_add_swap_extent(struct swap_info_struct *sis,
> +				 struct btrfs_swap_info *bsi)
> +{
> +	unsigned long nr_pages;
> +	u64 first_ppage, first_ppage_reported, next_ppage;
> +	int ret;
> +
> +	first_ppage = ALIGN(bsi->block_start, PAGE_SIZE) >> PAGE_SHIFT;
> +	next_ppage = ALIGN_DOWN(bsi->block_start + bsi->block_len,
> +				PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	if (first_ppage >= next_ppage)
> +		return 0;
> +	nr_pages = next_ppage - first_ppage;
> +
> +	first_ppage_reported = first_ppage;
> +	if (bsi->start == 0)
> +		first_ppage_reported++;
> +	if (bsi->lowest_ppage > first_ppage_reported)
> +		bsi->lowest_ppage = first_ppage_reported;
> +	if (bsi->highest_ppage < (next_ppage - 1))
> +		bsi->highest_ppage = next_ppage - 1;
> +
> +	ret = add_swap_extent(sis, bsi->nr_pages, nr_pages, first_ppage);
> +	if (ret < 0)
> +		return ret;
> +	bsi->nr_extents += ret;
> +	bsi->nr_pages += nr_pages;
> +	return 0;
> +}
> +
> +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> +			       sector_t *span)
> +{
> +	struct inode *inode = file_inode(file);
> +	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct extent_state *cached_state = NULL;
> +	struct extent_map *em = NULL;
> +	struct btrfs_device *device = NULL;
> +	struct btrfs_swap_info bsi = {
> +		.lowest_ppage = (sector_t)-1ULL,
> +	};
> +	int ret = 0;
> +	u64 isize = inode->i_size;
> +	u64 start;
> +
> +	/*
> +	 * The inode is locked, so these flags won't change after we check them.
> +	 */
> +	if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> +		btrfs_err(fs_info, "swapfile is compressed");

I'm not sure this is the right way to phrase the error message. I'd
expect something like "swapfile cannot be compressed", but I also think
I've seen the positive and negative elsewhere so there's no universal
style to follow.

> +		return -EINVAL;
> +	}
> +	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> +		btrfs_err(fs_info, "swapfile is copy-on-write");
> +		return -EINVAL;
> +	}
> +
> +	/*
> +	 * Balance or device remove/replace/resize can move stuff around from
> +	 * under us. The EXCL_OP flag makes sure they aren't running/won't run
> +	 * concurrently while we are mapping the swap extents, and the fs_info
> +	 * nr_swapfiles counter prevents them from running while the swap file
> +	 * is active and moving the extents. Note that this also prevents a
> +	 * concurrent device add which isn't actually necessary, but it's not
> +	 * really worth the trouble to allow it.
> +	 */
> +	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags))
> +		return -EBUSY;
> +	atomic_inc(&fs_info->nr_swapfiles);
> +	/*
> +	 * Snapshots can create extents which require COW even if NODATACOW is
> +	 * set. We use this counter to prevent snapshots. We must increment it
> +	 * before walking the extents because we don't want a concurrent
> +	 * snapshot to run after we've already checked the extents.
> +	 */
> +	atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
> +
> +	lock_extent_bits(io_tree, 0, isize - 1, &cached_state);
> +	start = 0;
> +	while (start < isize) {
> +		u64 end, logical_block_start, physical_block_start;
> +		u64 len = isize - start;
> +
> +		em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, len, 0);
> +		if (IS_ERR(em)) {
> +			ret = PTR_ERR(em);
> +			goto out;
> +		}
> +		end = extent_map_end(em);
> +
> +		if (em->block_start == EXTENT_MAP_HOLE) {
> +			btrfs_err(fs_info, "swapfile has holes");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		if (em->block_start == EXTENT_MAP_INLINE) {
> +			/*
> +			 * It's unlikely we'll ever actually find ourselves
> +			 * here, as a file small enough to fit inline won't be
> +			 * big enough to store more than the swap header, but in
> +			 * case something changes in the future, let's catch it
> +			 * here rather than later.
> +			 */
> +			btrfs_err(fs_info, "swapfile is inline");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> +			btrfs_err(fs_info, "swapfile is compressed");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		logical_block_start = em->block_start + (start - em->start);
> +		len = min(len, em->len - (start - em->start));
> +		free_extent_map(em);
> +		em = NULL;
> +
> +		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL);
> +		if (ret < 0) {
> +			goto out;
> +		} else if (ret) {
> +			ret = 0;
> +		} else {
> +			btrfs_err(fs_info, "swapfile is copy-on-write");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		em = btrfs_get_chunk_map(fs_info, logical_block_start, len);
> +		if (IS_ERR(em)) {
> +			ret = PTR_ERR(em);
> +			goto out;
> +		}
> +
> +		if (em->map_lookup->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> +			btrfs_err(fs_info, "swapfile data profile is not single");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (device == NULL) {
> +			device = em->map_lookup->stripes[0].dev;
> +		} else if (device != em->map_lookup->stripes[0].dev) {
> +			btrfs_err(fs_info, "swapfile is on multiple devices");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		physical_block_start = (em->map_lookup->stripes[0].physical +
> +					(logical_block_start - em->start));
> +		len = min(len, em->len - (logical_block_start - em->start));
> +		free_extent_map(em);
> +		em = NULL;
> +
> +		if (bsi.block_len &&
> +		    bsi.block_start + bsi.block_len == physical_block_start) {
> +			bsi.block_len += len;
> +		} else {
> +			if (bsi.block_len) {
> +				ret = btrfs_add_swap_extent(sis, &bsi);
> +				if (ret)
> +					goto out;
> +			}
> +			bsi.start = start;
> +			bsi.block_start = physical_block_start;
> +			bsi.block_len = len;
> +		}
> +
> +		start = end;
> +	}
> +
> +	if (bsi.block_len)
> +		ret = btrfs_add_swap_extent(sis, &bsi);
> +
> +out:
> +	if (!IS_ERR_OR_NULL(em))
> +		free_extent_map(em);
> +
> +	unlock_extent_cached(io_tree, 0, isize - 1, &cached_state);
> +
> +	clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
> +
> +	if (ret) {
> +		atomic_dec(&BTRFS_I(inode)->root->nr_swapfiles);
> +		atomic_dec(&fs_info->nr_swapfiles);
> +		return ret;
> +	}

Clearing the EXCL_OP should be done here, so the order of release is
reversed so the contexts are properly nested.

> +
> +	if (device)
> +		sis->bdev = device->bdev;
> +	*span = bsi.highest_ppage - bsi.lowest_ppage + 1;
> +	sis->max = bsi.nr_pages;
> +	sis->pages = bsi.nr_pages - 1;
> +	sis->highest_bit = bsi.nr_pages - 1;
> +	return bsi.nr_extents;
> +}
> +
> +static void btrfs_swap_deactivate(struct file *file)
> +{
> +	struct inode *inode = file_inode(file);
> +
> +	atomic_dec(&BTRFS_I(inode)->root->nr_swapfiles);
> +	atomic_dec(&BTRFS_I(inode)->root->fs_info->nr_swapfiles);
> +}
> +
>  static const struct inode_operations btrfs_dir_inode_operations = {
>  	.getattr	= btrfs_getattr,
>  	.lookup		= btrfs_lookup,
> @@ -10606,6 +10824,8 @@ static const struct address_space_operations btrfs_aops = {
>  	.releasepage	= btrfs_releasepage,
>  	.set_page_dirty	= btrfs_set_page_dirty,
>  	.error_remove_page = generic_error_remove_page,
> +	.swap_activate	= btrfs_swap_activate,
> +	.swap_deactivate = btrfs_swap_deactivate,
>  };
>  
>  static const struct address_space_operations btrfs_symlink_aops = {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 886c9d89b144..0d934b5be05c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2382,6 +2382,7 @@ add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
>  	list_add_tail(&new_se->list, &sis->first_swap_extent.list);
>  	return 1;
>  }
> +EXPORT_SYMBOL_GPL(add_swap_extent);

This maybe needs to go as a separate patch as it's an export in a
different subsystem, unless ou get an ack to do it in the same patch.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file
  2018-05-25 16:10       ` David Sterba
@ 2018-08-21  8:46         ` Omar Sandoval
  0 siblings, 0 replies; 20+ messages in thread
From: Omar Sandoval @ 2018-08-21  8:46 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team, linux-fsdevel, Tejun Heo

On Fri, May 25, 2018 at 06:10:43PM +0200, David Sterba wrote:
> On Fri, May 25, 2018 at 09:00:58AM -0700, Omar Sandoval wrote:
> > On Fri, May 25, 2018 at 04:50:55PM +0200, David Sterba wrote:
> > > On Thu, May 24, 2018 at 02:41:28PM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > When a swap file is active, we must make sure that the extents of the
> > > > file are not moved and that they don't become shared. That means that
> > > > the following are not safe:
> > > > 
> > > > - chattr +c (enable compression)
> > > > - reflink
> > > > - dedupe
> > > > - snapshot
> > > > - defrag
> > > > - balance
> > > > - device remove/replace/resize
> > > > 
> > > > Don't allow those to happen on an active swap file. Balance and device
> > > > remove/replace/resize in particular are disallowed entirely; in the
> > > > future, we can relax this so that relocation skips/errors out only on
> > > > chunks containing an active swap file.
> > > 
> > > Hm, disabling the entire balance is too intrusive. It's clear that the
> > > swapfile causes a lot of trouble when it goes against the dynamic
> > > capabilities of btrfs (relocation and the functionality that builds on
> > > it).
> > > 
> > > Skipping the swapfile extents should be implemented at minimum.
> > 
> > Sure thing, this should definitely be possible. For balance, we can skip
> > them; for resize or delete, it of course has to fail if it encounters
> > swap extents. I'll take a stab at it.
> 
> We can detect if there's an active swap file on the filesystem before
> shrink, delete or replace is started so the user is not surprised if it
> fails in the end, or not start the operations at all and give some hints
> what to do.

I looked at this some more, it's not pretty... Basically, we need to

- Add a counter of active swap extents to struct btrfs_block_group_cache
- At activate time, map the file extent to a block group and increment the counter
- At relocation time, check the counter, and either error out or skip as
  appropriate
- At deactivate time, decrement all of the block group counters. Easier
  said than done because the file could in theory have new extents
  allocated beyond EOF

The last point is the tricky one. The straightforward way to implement
deactivate would be to walk all of the extents of the file in the same
way we did for activate, but the extents may have changed. So, we have
to remember which extents were activated (or get that information from
the generic swap code somehow). This seems fragile.

Does anyone see a better approach? Is it worth the trouble?

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-08-21 12:05 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-24 21:41 [RFC PATCH v4 0/6] Btrfs: implement swap file support Omar Sandoval
2018-05-24 21:41 ` [RFC PATCH v4 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS Omar Sandoval
2018-05-25  9:11   ` Nikolay Borisov
2018-05-24 21:41 ` [RFC PATCH v4 2/6] vfs: update swap_{,de}activate documentation Omar Sandoval
2018-05-25  9:15   ` Nikolay Borisov
2018-05-24 21:41 ` [RFC PATCH v4 3/6] Btrfs: push EXCL_OP set into btrfs_rm_device() Omar Sandoval
2018-05-25  9:19   ` Nikolay Borisov
2018-05-28 13:29   ` David Sterba
2018-05-24 21:41 ` [RFC PATCH v4 4/6] Btrfs: prevent ioctls from interfering with a swap file Omar Sandoval
2018-05-25 14:50   ` David Sterba
2018-05-25 16:00     ` Omar Sandoval
2018-05-25 16:10       ` David Sterba
2018-08-21  8:46         ` Omar Sandoval
2018-05-24 21:41 ` [RFC PATCH v4 5/6] Btrfs: rename get_chunk_map() and make it non-static Omar Sandoval
2018-05-25  9:21   ` Nikolay Borisov
2018-05-25 16:02     ` Omar Sandoval
2018-05-24 21:41 ` [RFC PATCH v4 6/6] Btrfs: support swap files Omar Sandoval
2018-05-25 10:07   ` Nikolay Borisov
2018-05-25 16:16     ` Omar Sandoval
2018-05-28 13:46   ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).