linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data
@ 2021-11-17 20:19 Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 01/17] fs: export rw_verify_area() Omar Sandoval
                   ` (26 more replies)
  0 siblings, 27 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

This series has three parts: new Btrfs ioctls for reading/writing
compressed data, support for sending compressed data via Btrfs send, and
btrfs-progs support for sending/receiving compressed data and writing it
with the new ioctl.

Patches 1 and 2 are VFS changes exporting a couple of helpers for checks
needed by reads and writes. Patches 3-8 are preparatory Btrfs changes
for compressed reads and writes. Patch 6 is a cleanup. Patch 9 adds the
compressed read ioctl and patch 10 adds the compressed write ioctl.

The main use-case for this interface is Btrfs send/receive. Currently,
when sending data from one compressed filesystem to another, the sending
side decompresses the data and the receiving side recompresses it before
writing it out. This is wasteful and can be avoided if we can just send
and write compressed extents.

Patches 11 and 12 are cleanups for Btrfs send. Patches 13-17 add the
Btrfs send support. See the previous posting for more details and
benchmarks [1]. Patches 13-15 prepare some protocol changes for send
stream v2. Patch 16 implements compressed send. Patch 17 enables send
stream v2 and compressed send in the send ioctl when requested.

These patches are based on Dave Sterba's Btrfs misc-next branch [2],
which is in turn currently based on v5.16-rc1. Test cases are here [3].

Changes since v11 [4]:

- Rebased.
- Reworked send stream protocol versioning based on Dave's approach.
- Split up cow_file_range_inline() change into refactoring change and
  functional change.
- Added enforcement that BTRFS_SEND_A_DATA is always sent last.
- Fixed uninitialized variable reported by kernel test robot.
- Cleaned up some more style nits.
- Added and clarified some comments.
- Changed "page" nomenclature to "sector" for LZO.

1: https://lore.kernel.org/linux-btrfs/cover.1615922753.git.osandov@fb.com/
2: https://github.com/kdave/btrfs-devel/tree/misc-next
3: https://github.com/osandov/xfstests/tree/btrfs-encoded-io
4: https://lore.kernel.org/linux-btrfs/cover.1630514529.git.osandov@fb.com/

Omar Sandoval (17):
  fs: export rw_verify_area()
  fs: export variant of generic_write_checks without iov_iter
  btrfs: don't advance offset for compressed bios in
    btrfs_csum_one_bio()
  btrfs: add ram_bytes and offset to btrfs_ordered_extent
  btrfs: support different disk extent size for delalloc
  btrfs: clean up cow_file_range_inline()
  btrfs: optionally extend i_size in cow_file_range_inline()
  btrfs: add definitions + documentation for encoded I/O ioctls
  btrfs: add BTRFS_IOC_ENCODED_READ
  btrfs: add BTRFS_IOC_ENCODED_WRITE
  btrfs: send: remove unused send_ctx::{total,cmd}_send_size
  btrfs: send: fix maximum command numbering
  btrfs: add send stream v2 definitions
  btrfs: send: write larger chunks when using stream v2
  btrfs: send: allocate send buffer with alloc_page() and vmap() for v2
  btrfs: send: send compressed extents with encoded writes
  btrfs: send: enable support for stream v2 and compressed writes

 fs/btrfs/compression.c     |  10 +-
 fs/btrfs/compression.h     |   6 +-
 fs/btrfs/ctree.h           |  17 +-
 fs/btrfs/delalloc-space.c  |  18 +-
 fs/btrfs/file-item.c       |  32 +-
 fs/btrfs/file.c            |  68 ++-
 fs/btrfs/inode.c           | 927 +++++++++++++++++++++++++++++++++----
 fs/btrfs/ioctl.c           | 208 +++++++++
 fs/btrfs/ordered-data.c    | 131 ++----
 fs/btrfs/ordered-data.h    |  25 +-
 fs/btrfs/relocation.c      |   2 +-
 fs/btrfs/send.c            | 324 +++++++++++--
 fs/btrfs/send.h            |  39 +-
 fs/internal.h              |   5 -
 fs/read_write.c            |  34 +-
 include/linux/fs.h         |   2 +
 include/uapi/linux/btrfs.h | 142 +++++-
 17 files changed, 1704 insertions(+), 286 deletions(-)

The btrfs-progs patches were written by Boris Burkov with some updates
from me. Patches 1-4 are preparation. Patch 5 implements encoded writes.
Patch 6 implements the fallback to decompressing. Patches 7 and 8
implement the other commands. Patch 9 adds the new `btrfs send` options.
Patch 10 adds a test case.

Boris Burkov (10):
  btrfs-progs: receive: support v2 send stream larger tlv_len
  btrfs-progs: receive: dynamically allocate sctx->read_buf
  btrfs-progs: receive: support v2 send stream DATA tlv format
  btrfs-progs: receive: add send stream v2 cmds and attrs to send.h
  btrfs-progs: receive: process encoded_write commands
  btrfs-progs: receive: encoded_write fallback to explicit decode and
    write
  btrfs-progs: receive: process fallocate commands
  btrfs-progs: receive: process setflags ioctl commands
  btrfs-progs: send: stream v2 ioctl flags
  btrfs-progs: receive: add tests for basic encoded_write send/receive

 Documentation/btrfs-receive.asciidoc          |   4 +
 Documentation/btrfs-send.asciidoc             |  18 +-
 cmds/receive-dump.c                           |  31 +-
 cmds/receive.c                                | 347 +++++++++++++++++-
 cmds/send.c                                   |  92 ++++-
 common/send-stream.c                          | 165 +++++++--
 common/send-stream.h                          |   7 +
 ioctl.h                                       | 151 +++++++-
 kernel-shared/send.h                          |  39 +-
 libbtrfs/send-stream.c                        |   2 +-
 .../052-receive-write-encoded/test.sh         | 114 ++++++
 11 files changed, 924 insertions(+), 46 deletions(-)
 create mode 100755 tests/misc-tests/052-receive-write-encoded/test.sh

-- 
2.34.0


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v12 01/17] fs: export rw_verify_area()
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-18 14:57   ` David Sterba
  2021-11-17 20:19 ` [PATCH v12 02/17] fs: export variant of generic_write_checks without iov_iter Omar Sandoval
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

I'm adding Btrfs ioctls to read and write compressed data, and rather
than duplicating the checks in rw_verify_area(), let's just export it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/internal.h      | 5 -----
 fs/read_write.c    | 1 +
 include/linux/fs.h | 1 +
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 7979ff8d168c..8e400574401e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -157,11 +157,6 @@ extern char *simple_dname(struct dentry *, char *, int);
 extern void dput_to_list(struct dentry *, struct list_head *);
 extern void shrink_dentry_list(struct list_head *);
 
-/*
- * read_write.c
- */
-extern int rw_verify_area(int, struct file *, const loff_t *, size_t);
-
 /*
  * pipe.c
  */
diff --git a/fs/read_write.c b/fs/read_write.c
index 0074afa7ecb3..4d60146243df 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -385,6 +385,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
 	return security_file_permission(file,
 				read_write == READ ? MAY_READ : MAY_WRITE);
 }
+EXPORT_SYMBOL(rw_verify_area);
 
 static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1cb616fc1105..364940c6a299 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3244,6 +3244,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
 		int whence, loff_t size);
 extern loff_t no_seek_end_llseek_size(struct file *, loff_t, int, loff_t);
 extern loff_t no_seek_end_llseek(struct file *, loff_t, int);
+extern int rw_verify_area(int, struct file *, const loff_t *, size_t);
 extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 extern int stream_open(struct inode * inode, struct file * filp);
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 02/17] fs: export variant of generic_write_checks without iov_iter
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 01/17] fs: export rw_verify_area() Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 03/17] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Omar Sandoval
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Encoded I/O in Btrfs needs to check a write with a given logical size
without an iov_iter that matches that size (because the iov_iter we have
is for the compressed data). So, factor out the parts of
generic_write_check() that don't need an iov_iter into a new
generic_write_checks_count() function and export that.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/read_write.c    | 33 ++++++++++++++++++++-------------
 include/linux/fs.h |  1 +
 2 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 4d60146243df..dc5000173b80 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1618,24 +1618,16 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
 	return 0;
 }
 
-/*
- * Performs necessary checks before doing a write
- *
- * Can adjust writing position or amount of bytes to write.
- * Returns appropriate error code that caller should return or
- * zero in case that write should be allowed.
- */
-ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
+/* Like generic_write_checks(), but takes size of write instead of iter. */
+int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
-	loff_t count;
-	int ret;
 
 	if (IS_SWAPFILE(inode))
 		return -ETXTBSY;
 
-	if (!iov_iter_count(from))
+	if (!*count)
 		return 0;
 
 	/* FIXME: this is for backwards compatibility with 2.4 */
@@ -1645,8 +1637,23 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
 		return -EINVAL;
 
-	count = iov_iter_count(from);
-	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
+	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
+}
+EXPORT_SYMBOL(generic_write_checks_count);
+
+/*
+ * Performs necessary checks before doing a write
+ *
+ * Can adjust writing position or amount of bytes to write.
+ * Returns appropriate error code that caller should return or
+ * zero in case that write should be allowed.
+ */
+ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+	loff_t count = iov_iter_count(from);
+	int ret;
+
+	ret = generic_write_checks_count(iocb, &count);
 	if (ret)
 		return ret;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 364940c6a299..368d537409ab 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3201,6 +3201,7 @@ extern int sb_min_blocksize(struct super_block *, int);
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
+extern int generic_write_checks_count(struct kiocb *iocb, loff_t *count);
 extern int generic_write_check_limits(struct file *file, loff_t pos,
 		loff_t *count);
 extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 03/17] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio()
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 01/17] fs: export rw_verify_area() Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 02/17] fs: export variant of generic_write_checks without iov_iter Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 04/17] btrfs: add ram_bytes and offset to btrfs_ordered_extent Omar Sandoval
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

btrfs_csum_one_bio() loops over each filesystem block in the bio while
keeping a cursor of its current logical position in the file in order to
look up the ordered extent to add the checksums to. However, this
doesn't make much sense for compressed extents, as a sector on disk does
not correspond to a sector of decompressed file data. It happens to work
because 1) the compressed bio always covers one ordered extent and 2)
the size of the bio is always less than the size of the ordered extent.
However, the second point will not always be true for encoded writes.

Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that
it can assume that the bio only covers one ordered extent. Since we're
already changing the signature, let's get rid of the contig parameter
and make it implied by the offset parameter, similar to the change we
recently made to btrfs_lookup_bio_sums(). Additionally, let's rename
nr_sectors to blockcount to make it clear that it's the number of
filesystem blocks, not the number of 512-byte sectors.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/compression.c |  3 ++-
 fs/btrfs/ctree.h       |  2 +-
 fs/btrfs/file-item.c   | 32 ++++++++++++++------------------
 fs/btrfs/inode.c       |  8 ++++----
 4 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 32da97c3c19d..73350f524fb8 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -590,7 +590,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 
 		if (submit) {
 			if (!skip_sum) {
-				ret = btrfs_csum_one_bio(inode, bio, start, 1);
+				ret = btrfs_csum_one_bio(inode, bio, start,
+							 true);
 				if (ret)
 					goto finish_cb;
 			}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f1dd2486dcb3..9fd677f2ce15 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3169,7 +3169,7 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 			   struct btrfs_root *root,
 			   struct btrfs_ordered_sum *sums);
 blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
-				u64 file_start, int contig);
+				u64 offset, bool one_ordered);
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 			     struct list_head *list, int search_commit);
 void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 0f2e2ab34828..6203c85f5a1b 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -613,28 +613,28 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
  * btrfs_csum_one_bio - Calculates checksums of the data contained inside a bio
  * @inode:	 Owner of the data inside the bio
  * @bio:	 Contains the data to be checksummed
- * @file_start:  offset in file this bio begins to describe
- * @contig:	 Boolean. If true/1 means all bio vecs in this bio are
- *		 contiguous and they begin at @file_start in the file. False/0
- *		 means this bio can contain potentially discontiguous bio vecs
- *		 so the logical offset of each should be calculated separately.
+ * @offset:      If (u64)-1, @bio may contain discontiguous bio vecs, so the
+ *               file offsets are determined from the page offsets in the bio.
+ *               Otherwise, this is the starting file offset of the bio vecs in
+ *               @bio, which must be contiguous.
+ * @one_ordered: If true, @bio only refers to one ordered extent.
  */
 blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
-		       u64 file_start, int contig)
+				u64 offset, bool one_ordered)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	struct btrfs_ordered_sum *sums;
 	struct btrfs_ordered_extent *ordered = NULL;
+	const bool use_page_offsets = (offset == (u64)-1);
 	char *data;
 	struct bvec_iter iter;
 	struct bio_vec bvec;
 	int index;
-	int nr_sectors;
+	int blockcount;
 	unsigned long total_bytes = 0;
 	unsigned long this_sum_bytes = 0;
 	int i;
-	u64 offset;
 	unsigned nofs_flag;
 
 	nofs_flag = memalloc_nofs_save();
@@ -648,18 +648,13 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 	sums->len = bio->bi_iter.bi_size;
 	INIT_LIST_HEAD(&sums->list);
 
-	if (contig)
-		offset = file_start;
-	else
-		offset = 0; /* shut up gcc */
-
 	sums->bytenr = bio->bi_iter.bi_sector << 9;
 	index = 0;
 
 	shash->tfm = fs_info->csum_shash;
 
 	bio_for_each_segment(bvec, bio, iter) {
-		if (!contig)
+		if (use_page_offsets)
 			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
 
 		if (!ordered) {
@@ -678,13 +673,14 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
 			}
 		}
 
-		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info,
+		blockcount = BTRFS_BYTES_TO_BLKS(fs_info,
 						 bvec.bv_len + fs_info->sectorsize
 						 - 1);
 
-		for (i = 0; i < nr_sectors; i++) {
-			if (offset >= ordered->file_offset + ordered->num_bytes ||
-			    offset < ordered->file_offset) {
+		for (i = 0; i < blockcount; i++) {
+			if (!one_ordered &&
+			    !in_range(offset, ordered->file_offset,
+				      ordered->num_bytes)) {
 				unsigned long bytes_left;
 
 				sums->len = this_sum_bytes;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 91f7ed27e421..0bd992835cf5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2308,7 +2308,7 @@ void btrfs_clear_delalloc_extent(struct inode *vfs_inode,
 static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
 					   u64 dio_file_offset)
 {
-	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
+	return btrfs_csum_one_bio(BTRFS_I(inode), bio, (u64)-1, false);
 }
 
 /*
@@ -2560,7 +2560,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 					  0, btrfs_submit_bio_start);
 		goto out;
 	} else if (!skip_sum) {
-		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
+		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, (u64)-1, false);
 		if (ret)
 			goto out;
 	}
@@ -8162,7 +8162,7 @@ static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode,
 						     struct bio *bio,
 						     u64 dio_file_offset)
 {
-	return btrfs_csum_one_bio(BTRFS_I(inode), bio, dio_file_offset, 1);
+	return btrfs_csum_one_bio(BTRFS_I(inode), bio, dio_file_offset, false);
 }
 
 static void btrfs_end_dio_bio(struct bio *bio)
@@ -8219,7 +8219,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 		 * If we aren't doing async submit, calculate the csum of the
 		 * bio now.
 		 */
-		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, file_offset, 1);
+		ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, file_offset, false);
 		if (ret)
 			goto err;
 	} else {
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 04/17] btrfs: add ram_bytes and offset to btrfs_ordered_extent
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (2 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 03/17] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 05/17] btrfs: support different disk extent size for delalloc Omar Sandoval
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Currently, we only create ordered extents when ram_bytes == num_bytes
and offset == 0. However, BTRFS_IOC_ENCODED_WRITE writes may create
extents which only refer to a subset of the full unencoded extent, so we
need to plumb these fields through the ordered extent infrastructure and
pass them down to insert_reserved_file_extent().

Since we're changing the btrfs_add_ordered_extent* signature, let's get
rid of the trivial wrappers and add a kernel-doc.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c        |  58 ++++++++++++--------
 fs/btrfs/ordered-data.c | 119 ++++++++++++----------------------------
 fs/btrfs/ordered-data.h |  22 +++++---
 3 files changed, 82 insertions(+), 117 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0bd992835cf5..1afadc7afff3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -980,11 +980,14 @@ static int submit_one_async_extent(struct btrfs_inode *inode,
 	}
 	free_extent_map(em);
 
-	ret = btrfs_add_ordered_extent_compress(inode, start,	/* file_offset */
-					ins.objectid,		/* disk_bytenr */
-					async_extent->ram_size, /* num_bytes */
-					ins.offset,		/* disk_num_bytes */
-					async_extent->compress_type);
+	ret = btrfs_add_ordered_extent(inode, start,		/* file_offset */
+				       async_extent->ram_size,	/* num_bytes */
+				       async_extent->ram_size,	/* ram_bytes */
+				       ins.objectid,		/* disk_bytenr */
+				       ins.offset,		/* disk_num_bytes */
+				       0,			/* offset */
+				       1 << BTRFS_ORDERED_COMPRESSED,
+				       async_extent->compress_type);
 	if (ret) {
 		btrfs_drop_extent_cache(inode, start, end, 0);
 		goto out_free_reserve;
@@ -1233,9 +1236,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		}
 		free_extent_map(em);
 
-		ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
-					       ram_size, cur_alloc_size,
-					       BTRFS_ORDERED_REGULAR);
+		ret = btrfs_add_ordered_extent(inode, start, ram_size, ram_size,
+					       ins.objectid, cur_alloc_size, 0,
+					       1 << BTRFS_ORDERED_REGULAR,
+					       BTRFS_COMPRESS_NONE);
 		if (ret)
 			goto out_drop_extent_cache;
 
@@ -1893,10 +1897,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 				goto error;
 			}
 			free_extent_map(em);
-			ret = btrfs_add_ordered_extent(inode, cur_offset,
-						       disk_bytenr, num_bytes,
-						       num_bytes,
-						       BTRFS_ORDERED_PREALLOC);
+			ret = btrfs_add_ordered_extent(inode,
+					cur_offset, num_bytes, num_bytes,
+					disk_bytenr, num_bytes, 0,
+					1 << BTRFS_ORDERED_PREALLOC,
+					BTRFS_COMPRESS_NONE);
 			if (ret) {
 				btrfs_drop_extent_cache(inode, cur_offset,
 							cur_offset + num_bytes - 1,
@@ -1905,9 +1910,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 			}
 		} else {
 			ret = btrfs_add_ordered_extent(inode, cur_offset,
+						       num_bytes, num_bytes,
 						       disk_bytenr, num_bytes,
-						       num_bytes,
-						       BTRFS_ORDERED_NOCOW);
+						       0,
+						       1 << BTRFS_ORDERED_NOCOW,
+						       BTRFS_COMPRESS_NONE);
 			if (ret)
 				goto error;
 		}
@@ -2864,6 +2871,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_key ins;
 	u64 disk_num_bytes = btrfs_stack_file_extent_disk_num_bytes(stack_fi);
 	u64 disk_bytenr = btrfs_stack_file_extent_disk_bytenr(stack_fi);
+	u64 offset = btrfs_stack_file_extent_offset(stack_fi);
 	u64 num_bytes = btrfs_stack_file_extent_num_bytes(stack_fi);
 	u64 ram_bytes = btrfs_stack_file_extent_ram_bytes(stack_fi);
 	struct btrfs_drop_extents_args drop_args = { 0 };
@@ -2938,7 +2946,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 		goto out;
 
 	ret = btrfs_alloc_reserved_file_extent(trans, root, btrfs_ino(inode),
-					       file_pos, qgroup_reserved, &ins);
+					       file_pos - offset,
+					       qgroup_reserved, &ins);
 out:
 	btrfs_free_path(path);
 
@@ -2964,20 +2973,20 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 					     struct btrfs_ordered_extent *oe)
 {
 	struct btrfs_file_extent_item stack_fi;
-	u64 logical_len;
 	bool update_inode_bytes;
+	u64 num_bytes = oe->num_bytes;
+	u64 ram_bytes = oe->ram_bytes;
 
 	memset(&stack_fi, 0, sizeof(stack_fi));
 	btrfs_set_stack_file_extent_type(&stack_fi, BTRFS_FILE_EXTENT_REG);
 	btrfs_set_stack_file_extent_disk_bytenr(&stack_fi, oe->disk_bytenr);
 	btrfs_set_stack_file_extent_disk_num_bytes(&stack_fi,
 						   oe->disk_num_bytes);
+	btrfs_set_stack_file_extent_offset(&stack_fi, oe->offset);
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags))
-		logical_len = oe->truncated_len;
-	else
-		logical_len = oe->num_bytes;
-	btrfs_set_stack_file_extent_num_bytes(&stack_fi, logical_len);
-	btrfs_set_stack_file_extent_ram_bytes(&stack_fi, logical_len);
+		num_bytes = ram_bytes = oe->truncated_len;
+	btrfs_set_stack_file_extent_num_bytes(&stack_fi, num_bytes);
+	btrfs_set_stack_file_extent_ram_bytes(&stack_fi, ram_bytes);
 	btrfs_set_stack_file_extent_compression(&stack_fi, oe->compress_type);
 	/* Encryption and other encoding is reserved and all 0 */
 
@@ -7399,8 +7408,11 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 		if (IS_ERR(em))
 			goto out;
 	}
-	ret = btrfs_add_ordered_extent_dio(inode, start, block_start, len,
-					   block_len, type);
+	ret = btrfs_add_ordered_extent(inode, start, len, len, block_start,
+				       block_len, 0,
+				       (1 << type) |
+				       (1 << BTRFS_ORDERED_DIRECT),
+				       BTRFS_COMPRESS_NONE);
 	if (ret) {
 		if (em) {
 			free_extent_map(em);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 6b51fd2ec5ac..5e4c59b00b01 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -143,16 +143,27 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
 	return ret;
 }
 
-/*
- * Allocate and add a new ordered_extent into the per-inode tree.
+/**
+ * btrfs_add_ordered_extent - Add an ordered extent to the per-inode tree.
+ * @inode: inode that this extent is for.
+ * @file_offset: Logical offset in file where the extent starts.
+ * @num_bytes: Logical length of extent in file.
+ * @ram_bytes: Full length of unencoded data.
+ * @disk_bytenr: Offset of extent on disk.
+ * @disk_num_bytes: Size of extent on disk.
+ * @offset: Offset into unencoded data where file data starts.
+ * @flags: Flags specifying type of extent (1 << BTRFS_ORDERED_*).
+ * @compress_type: Compression algorithm used for data.
  *
- * The tree is given a single reference on the ordered extent that was
- * inserted.
+ * Most of these parameters correspond to &struct btrfs_file_extent_item. The
+ * tree is given a single reference on the ordered extent that was inserted.
+ *
+ * Return: 0 or -ENOMEM.
  */
-static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-				      u64 disk_bytenr, u64 num_bytes,
-				      u64 disk_num_bytes, int type, int dio,
-				      int compress_type)
+int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
+			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			     u64 disk_num_bytes, u64 offset, int flags,
+			     int compress_type)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -161,7 +172,8 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	struct btrfs_ordered_extent *entry;
 	int ret;
 
-	if (type == BTRFS_ORDERED_NOCOW || type == BTRFS_ORDERED_PREALLOC) {
+	if (flags &
+	    ((1 << BTRFS_ORDERED_NOCOW) | (1 << BTRFS_ORDERED_PREALLOC))) {
 		/* For nocow write, we can release the qgroup rsv right now */
 		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
 		if (ret < 0)
@@ -181,9 +193,11 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 		return -ENOMEM;
 
 	entry->file_offset = file_offset;
-	entry->disk_bytenr = disk_bytenr;
 	entry->num_bytes = num_bytes;
+	entry->ram_bytes = ram_bytes;
+	entry->disk_bytenr = disk_bytenr;
 	entry->disk_num_bytes = disk_num_bytes;
+	entry->offset = offset;
 	entry->bytes_left = num_bytes;
 	entry->inode = igrab(&inode->vfs_inode);
 	entry->compress_type = compress_type;
@@ -191,18 +205,12 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	entry->qgroup_rsv = ret;
 	entry->physical = (u64)-1;
 
-	ASSERT(type == BTRFS_ORDERED_REGULAR ||
-	       type == BTRFS_ORDERED_NOCOW ||
-	       type == BTRFS_ORDERED_PREALLOC ||
-	       type == BTRFS_ORDERED_COMPRESSED);
-	set_bit(type, &entry->flags);
+	ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0);
+	entry->flags = flags;
 
 	percpu_counter_add_batch(&fs_info->ordered_bytes, num_bytes,
 				 fs_info->delalloc_batch);
 
-	if (dio)
-		set_bit(BTRFS_ORDERED_DIRECT, &entry->flags);
-
 	/* one ref for the tree */
 	refcount_set(&entry->refs, 1);
 	init_waitqueue_head(&entry->wait);
@@ -247,41 +255,6 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	return 0;
 }
 
-int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-			     u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes,
-			     int type)
-{
-	ASSERT(type == BTRFS_ORDERED_REGULAR ||
-	       type == BTRFS_ORDERED_NOCOW ||
-	       type == BTRFS_ORDERED_PREALLOC);
-	return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr,
-					  num_bytes, disk_num_bytes, type, 0,
-					  BTRFS_COMPRESS_NONE);
-}
-
-int btrfs_add_ordered_extent_dio(struct btrfs_inode *inode, u64 file_offset,
-				 u64 disk_bytenr, u64 num_bytes,
-				 u64 disk_num_bytes, int type)
-{
-	ASSERT(type == BTRFS_ORDERED_REGULAR ||
-	       type == BTRFS_ORDERED_NOCOW ||
-	       type == BTRFS_ORDERED_PREALLOC);
-	return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr,
-					  num_bytes, disk_num_bytes, type, 1,
-					  BTRFS_COMPRESS_NONE);
-}
-
-int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset,
-				      u64 disk_bytenr, u64 num_bytes,
-				      u64 disk_num_bytes, int compress_type)
-{
-	ASSERT(compress_type != BTRFS_COMPRESS_NONE);
-	return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr,
-					  num_bytes, disk_num_bytes,
-					  BTRFS_ORDERED_COMPRESSED, 0,
-					  compress_type);
-}
-
 /*
  * Add a struct btrfs_ordered_sum into the list of checksums to be inserted
  * when an ordered extent is finished.  If the list covers more than one
@@ -1052,42 +1025,18 @@ static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	u64 file_offset = ordered->file_offset + pos;
 	u64 disk_bytenr = ordered->disk_bytenr + pos;
-	u64 num_bytes = len;
-	u64 disk_num_bytes = len;
-	int type;
-	unsigned long flags_masked = ordered->flags & ~(1 << BTRFS_ORDERED_DIRECT);
-	int compress_type = ordered->compress_type;
-	unsigned long weight;
-	int ret;
-
-	weight = hweight_long(flags_masked);
-	WARN_ON_ONCE(weight > 1);
-	if (!weight)
-		type = 0;
-	else
-		type = __ffs(flags_masked);
+	unsigned long flags = ordered->flags & BTRFS_ORDERED_TYPE_FLAGS;
 
 	/*
-	 * The splitting extent is already counted and will be added again
-	 * in btrfs_add_ordered_extent_*(). Subtract num_bytes to avoid
-	 * double counting.
+	 * The splitting extent is already counted and will be added again in
+	 * btrfs_add_ordered_extent_*(). Subtract len to avoid double counting.
 	 */
-	percpu_counter_add_batch(&fs_info->ordered_bytes, -num_bytes,
+	percpu_counter_add_batch(&fs_info->ordered_bytes, -len,
 				 fs_info->delalloc_batch);
-	if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered->flags)) {
-		WARN_ON_ONCE(1);
-		ret = btrfs_add_ordered_extent_compress(BTRFS_I(inode),
-				file_offset, disk_bytenr, num_bytes,
-				disk_num_bytes, compress_type);
-	} else if (test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) {
-		ret = btrfs_add_ordered_extent_dio(BTRFS_I(inode), file_offset,
-				disk_bytenr, num_bytes, disk_num_bytes, type);
-	} else {
-		ret = btrfs_add_ordered_extent(BTRFS_I(inode), file_offset,
-				disk_bytenr, num_bytes, disk_num_bytes, type);
-	}
-
-	return ret;
+	WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED));
+	return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len,
+					disk_bytenr, len, 0, flags,
+					ordered->compress_type);
 }
 
 int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 4194e960ff61..0feb0c29839e 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -76,6 +76,13 @@ enum {
 	BTRFS_ORDERED_PENDING,
 };
 
+/* BTRFS_ORDERED_* flags that specify the type of the extent. */
+#define BTRFS_ORDERED_TYPE_FLAGS ((1UL << BTRFS_ORDERED_REGULAR) |	\
+				  (1UL << BTRFS_ORDERED_NOCOW) |	\
+				  (1UL << BTRFS_ORDERED_PREALLOC) |	\
+				  (1UL << BTRFS_ORDERED_COMPRESSED) |	\
+				  (1UL << BTRFS_ORDERED_DIRECT))
+
 struct btrfs_ordered_extent {
 	/* logical offset in the file */
 	u64 file_offset;
@@ -84,9 +91,11 @@ struct btrfs_ordered_extent {
 	 * These fields directly correspond to the same fields in
 	 * btrfs_file_extent_item.
 	 */
-	u64 disk_bytenr;
 	u64 num_bytes;
+	u64 ram_bytes;
+	u64 disk_bytenr;
 	u64 disk_num_bytes;
+	u64 offset;
 
 	/* number of bytes that still need writing */
 	u64 bytes_left;
@@ -179,14 +188,9 @@ bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 				    struct btrfs_ordered_extent **cached,
 				    u64 file_offset, u64 io_size);
 int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-			     u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes,
-			     int type);
-int btrfs_add_ordered_extent_dio(struct btrfs_inode *inode, u64 file_offset,
-				 u64 disk_bytenr, u64 num_bytes,
-				 u64 disk_num_bytes, int type);
-int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset,
-				      u64 disk_bytenr, u64 num_bytes,
-				      u64 disk_num_bytes, int compress_type);
+			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			     u64 disk_num_bytes, u64 offset, int flags,
+			     int compress_type);
 void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);
 struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 05/17] btrfs: support different disk extent size for delalloc
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (3 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 04/17] btrfs: add ram_bytes and offset to btrfs_ordered_extent Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 06/17] btrfs: clean up cow_file_range_inline() Omar Sandoval
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Currently, we always reserve the same extent size in the file and extent
size on disk for delalloc because the former is the worst case for the
latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of
the extent on disk, which may be less than or greater than (for
bookends) the size in the file. Add a disk_num_bytes parameter to
btrfs_delalloc_reserve_metadata() so that we can reserve the correct
amount of csum bytes. No functional change.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h          |  3 ++-
 fs/btrfs/delalloc-space.c | 18 ++++++++++--------
 fs/btrfs/file.c           |  3 ++-
 fs/btrfs/inode.c          |  4 ++--
 fs/btrfs/relocation.c     |  2 +-
 5 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9fd677f2ce15..2e7f74060a14 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2823,7 +2823,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root,
 				      struct btrfs_block_rsv *rsv);
 void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
 
-int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes);
+int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
+				    u64 disk_num_bytes);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
 int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
 				   u64 start, u64 end);
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index bca438c7c972..ec96f1b342e0 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -267,11 +267,11 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 }
 
 static void calc_inode_reservations(struct btrfs_fs_info *fs_info,
-				    u64 num_bytes, u64 *meta_reserve,
-				    u64 *qgroup_reserve)
+				    u64 num_bytes, u64 disk_num_bytes,
+				    u64 *meta_reserve, u64 *qgroup_reserve)
 {
 	u64 nr_extents = count_max_extents(num_bytes);
-	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes);
+	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, disk_num_bytes);
 	u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
 
 	*meta_reserve = btrfs_calc_insert_metadata_size(fs_info,
@@ -285,7 +285,8 @@ static void calc_inode_reservations(struct btrfs_fs_info *fs_info,
 	*qgroup_reserve = nr_extents * fs_info->nodesize;
 }
 
-int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
+int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
+				    u64 disk_num_bytes)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -315,6 +316,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	}
 
 	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
+	disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize);
 
 	/*
 	 * We always want to do it this way, every other way is wrong and ends
@@ -326,8 +328,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	 * everything out and try again, which is bad.  This way we just
 	 * over-reserve slightly, and clean up the mess when we are done.
 	 */
-	calc_inode_reservations(fs_info, num_bytes, &meta_reserve,
-				&qgroup_reserve);
+	calc_inode_reservations(fs_info, num_bytes, disk_num_bytes,
+				&meta_reserve, &qgroup_reserve);
 	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true);
 	if (ret)
 		return ret;
@@ -346,7 +348,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	spin_lock(&inode->lock);
 	nr_extents = count_max_extents(num_bytes);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
-	inode->csum_bytes += num_bytes;
+	inode->csum_bytes += disk_num_bytes;
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
@@ -451,7 +453,7 @@ int btrfs_delalloc_reserve_space(struct btrfs_inode *inode,
 	ret = btrfs_check_data_free_space(inode, reserved, start, len);
 	if (ret < 0)
 		return ret;
-	ret = btrfs_delalloc_reserve_metadata(inode, len);
+	ret = btrfs_delalloc_reserve_metadata(inode, len, len);
 	if (ret < 0)
 		btrfs_free_reserved_data_space(inode, *reserved, start, len);
 	return ret;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 11204dbbe053..5fbf0a2aba2e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1749,7 +1749,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 					 fs_info->sectorsize);
 		WARN_ON(reserve_bytes == 0);
 		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				reserve_bytes);
+						      reserve_bytes,
+						      reserve_bytes);
 		if (ret) {
 			if (!only_release_metadata)
 				btrfs_free_reserved_data_space(BTRFS_I(inode),
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1afadc7afff3..0c5b9832f975 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5050,7 +5050,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 			goto out;
 		}
 	}
-	ret = btrfs_delalloc_reserve_metadata(inode, blocksize);
+	ret = btrfs_delalloc_reserve_metadata(inode, blocksize, blocksize);
 	if (ret < 0) {
 		if (!only_release_metadata)
 			btrfs_free_reserved_data_space(inode, data_reserved,
@@ -7812,7 +7812,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		struct extent_map *em2;
 
 		/* We can NOCOW, so only need to reserve metadata space. */
-		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len);
+		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), len, len);
 		if (ret < 0) {
 			/* Our caller expects us to free the input extent map. */
 			free_extent_map(em);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index a455a1ead0d6..fa8dcc3375b5 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2996,7 +2996,7 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 
 		/* Reserve metadata for this range */
 		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-						      clamped_len);
+						      clamped_len, clamped_len);
 		if (ret)
 			goto release_page;
 
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 06/17] btrfs: clean up cow_file_range_inline()
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (4 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 05/17] btrfs: support different disk extent size for delalloc Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 07/17] btrfs: optionally extend i_size in cow_file_range_inline() Omar Sandoval
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

The start parameter to cow_file_range_inline() (and
insert_inline_extent()) is always 0, so get rid of it and simplify the
logic in those two functions. Pass btrfs_inode to insert_inline_extent()
and remove the redundant root parameter. Also document the requirements
for creating an inline extent. No functional change.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 88 +++++++++++++++++++++---------------------------
 1 file changed, 38 insertions(+), 50 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0c5b9832f975..a5cae0c6d992 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -233,12 +233,13 @@ static int btrfs_init_inode_security(struct btrfs_trans_handle *trans,
  * no overlapping inline items exist in the btree
  */
 static int insert_inline_extent(struct btrfs_trans_handle *trans,
-				struct btrfs_path *path, bool extent_inserted,
-				struct btrfs_root *root, struct inode *inode,
-				u64 start, size_t size, size_t compressed_size,
+				struct btrfs_path *path,
+				struct btrfs_inode *inode, bool extent_inserted,
+				size_t size, size_t compressed_size,
 				int compress_type,
 				struct page **compressed_pages)
 {
+	struct btrfs_root *root = inode->root;
 	struct extent_buffer *leaf;
 	struct page *page = NULL;
 	char *kaddr;
@@ -246,7 +247,6 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_file_extent_item *ei;
 	int ret;
 	size_t cur_size = size;
-	unsigned long offset;
 
 	ASSERT((compressed_size > 0 && compressed_pages) ||
 	       (compressed_size == 0 && !compressed_pages));
@@ -258,8 +258,8 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 		struct btrfs_key key;
 		size_t datasize;
 
-		key.objectid = btrfs_ino(BTRFS_I(inode));
-		key.offset = start;
+		key.objectid = btrfs_ino(inode);
+		key.offset = 0;
 		key.type = BTRFS_EXTENT_DATA_KEY;
 
 		datasize = btrfs_file_extent_calc_inline_size(cur_size);
@@ -297,12 +297,10 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 		btrfs_set_file_extent_compression(leaf, ei,
 						  compress_type);
 	} else {
-		page = find_get_page(inode->i_mapping,
-				     start >> PAGE_SHIFT);
+		page = find_get_page(inode->vfs_inode.i_mapping, 0);
 		btrfs_set_file_extent_compression(leaf, ei, 0);
 		kaddr = kmap_atomic(page);
-		offset = offset_in_page(start);
-		write_extent_buffer(leaf, kaddr + offset, ptr, size);
+		write_extent_buffer(leaf, kaddr, ptr, size);
 		kunmap_atomic(kaddr);
 		put_page(page);
 	}
@@ -313,8 +311,8 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	 * We align size to sectorsize for inline extents just for simplicity
 	 * sake.
 	 */
-	size = ALIGN(size, root->fs_info->sectorsize);
-	ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode), start, size);
+	ret = btrfs_inode_set_file_extent_range(inode, 0,
+					ALIGN(size, root->fs_info->sectorsize));
 	if (ret)
 		goto fail;
 
@@ -327,7 +325,8 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	 * before we unlock the pages.  Otherwise we
 	 * could end up racing with unlink.
 	 */
-	BTRFS_I(inode)->disk_i_size = inode->i_size;
+	inode->disk_i_size = i_size_read(&inode->vfs_inode);
+
 fail:
 	return ret;
 }
@@ -338,8 +337,8 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
  * does the checks required to make sure the data is small enough
  * to fit as an inline extent.
  */
-static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
-					  u64 end, size_t compressed_size,
+static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 size,
+					  size_t compressed_size,
 					  int compress_type,
 					  struct page **compressed_pages)
 {
@@ -347,26 +346,21 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_trans_handle *trans;
-	u64 isize = i_size_read(&inode->vfs_inode);
-	u64 actual_end = min(end + 1, isize);
-	u64 inline_len = actual_end - start;
-	u64 aligned_end = ALIGN(end, fs_info->sectorsize);
-	u64 data_len = inline_len;
+	u64 data_len = compressed_size ? compressed_size : size;
 	int ret;
 	struct btrfs_path *path;
 
-	if (compressed_size)
-		data_len = compressed_size;
-
-	if (start > 0 ||
-	    actual_end > fs_info->sectorsize ||
+	/*
+	 * We can create an inline extent if it ends at or beyond the current
+	 * i_size, is no larger than a sector (decompressed), and the (possibly
+	 * compressed) data fits in a leaf and the configured maximum inline
+	 * size.
+	 */
+	if (size < i_size_read(&inode->vfs_inode) ||
+	    size > fs_info->sectorsize ||
 	    data_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info) ||
-	    (!compressed_size &&
-	    (actual_end & (fs_info->sectorsize - 1)) == 0) ||
-	    end + 1 < isize ||
-	    data_len > fs_info->max_inline) {
+	    data_len > fs_info->max_inline)
 		return 1;
-	}
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -380,30 +374,21 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 	trans->block_rsv = &inode->block_rsv;
 
 	drop_args.path = path;
-	drop_args.start = start;
-	drop_args.end = aligned_end;
+	drop_args.start = 0;
+	drop_args.end = fs_info->sectorsize;
 	drop_args.drop_cache = true;
 	drop_args.replace_extent = true;
-
-	if (compressed_size && compressed_pages)
-		drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(
-		   compressed_size);
-	else
-		drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(
-		    inline_len);
-
+	drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(data_len);
 	ret = btrfs_drop_extents(trans, root, inode, &drop_args);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
 		goto out;
 	}
 
-	if (isize > actual_end)
-		inline_len = min_t(u64, isize, actual_end);
-	ret = insert_inline_extent(trans, path, drop_args.extent_inserted,
-				   root, &inode->vfs_inode, start,
-				   inline_len, compressed_size,
-				   compress_type, compressed_pages);
+	ret = insert_inline_extent(trans, path, inode,
+				   drop_args.extent_inserted, size,
+				   compressed_size, compress_type,
+				   compressed_pages);
 	if (ret && ret != -ENOSPC) {
 		btrfs_abort_transaction(trans, ret);
 		goto out;
@@ -412,7 +397,7 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 		goto out;
 	}
 
-	btrfs_update_inode_bytes(inode, inline_len, drop_args.bytes_found);
+	btrfs_update_inode_bytes(inode, size, drop_args.bytes_found);
 	ret = btrfs_update_inode(trans, root, inode);
 	if (ret && ret != -ENOSPC) {
 		btrfs_abort_transaction(trans, ret);
@@ -734,12 +719,12 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 			/* we didn't compress the entire range, try
 			 * to make an uncompressed inline extent.
 			 */
-			ret = cow_file_range_inline(BTRFS_I(inode), start, end,
+			ret = cow_file_range_inline(BTRFS_I(inode), actual_end,
 						    0, BTRFS_COMPRESS_NONE,
 						    NULL);
 		} else {
 			/* try making a compressed inline extent */
-			ret = cow_file_range_inline(BTRFS_I(inode), start, end,
+			ret = cow_file_range_inline(BTRFS_I(inode), actual_end,
 						    total_compressed,
 						    compress_type, pages);
 		}
@@ -1154,8 +1139,11 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	 * So here we skip inline extent creation completely.
 	 */
 	if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
+		u64 actual_end = min_t(u64, i_size_read(&inode->vfs_inode),
+				       end + 1);
+
 		/* lets try to make an inline extent */
-		ret = cow_file_range_inline(inode, start, end, 0,
+		ret = cow_file_range_inline(inode, actual_end, 0,
 					    BTRFS_COMPRESS_NONE, NULL);
 		if (ret == 0) {
 			/*
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 07/17] btrfs: optionally extend i_size in cow_file_range_inline()
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (5 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 06/17] btrfs: clean up cow_file_range_inline() Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 08/17] btrfs: add definitions + documentation for encoded I/O ioctls Omar Sandoval
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Currently, an inline extent is always created after i_size is extended
from btrfs_dirty_pages(). However, for encoded writes, we only want to
update i_size after we successfully created the inline extent. Add an
update_i_size parameter to cow_file_range_inline() and
insert_inline_extent() and pass in the size of the extent rather than
determining it from i_size.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a5cae0c6d992..c2efea101f61 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -237,7 +237,8 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 				struct btrfs_inode *inode, bool extent_inserted,
 				size_t size, size_t compressed_size,
 				int compress_type,
-				struct page **compressed_pages)
+				struct page **compressed_pages,
+				bool update_i_size)
 {
 	struct btrfs_root *root = inode->root;
 	struct extent_buffer *leaf;
@@ -247,6 +248,7 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_file_extent_item *ei;
 	int ret;
 	size_t cur_size = size;
+	u64 i_size;
 
 	ASSERT((compressed_size > 0 && compressed_pages) ||
 	       (compressed_size == 0 && !compressed_pages));
@@ -325,7 +327,12 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 	 * before we unlock the pages.  Otherwise we
 	 * could end up racing with unlink.
 	 */
-	inode->disk_i_size = i_size_read(&inode->vfs_inode);
+	i_size = i_size_read(&inode->vfs_inode);
+	if (update_i_size && size > i_size) {
+		i_size_write(&inode->vfs_inode, size);
+		i_size = size;
+	}
+	inode->disk_i_size = i_size;
 
 fail:
 	return ret;
@@ -340,7 +347,8 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 size,
 					  size_t compressed_size,
 					  int compress_type,
-					  struct page **compressed_pages)
+					  struct page **compressed_pages,
+					  bool update_i_size)
 {
 	struct btrfs_drop_extents_args drop_args = { 0 };
 	struct btrfs_root *root = inode->root;
@@ -388,7 +396,7 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 size,
 	ret = insert_inline_extent(trans, path, inode,
 				   drop_args.extent_inserted, size,
 				   compressed_size, compress_type,
-				   compressed_pages);
+				   compressed_pages, update_i_size);
 	if (ret && ret != -ENOSPC) {
 		btrfs_abort_transaction(trans, ret);
 		goto out;
@@ -721,12 +729,13 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 			 */
 			ret = cow_file_range_inline(BTRFS_I(inode), actual_end,
 						    0, BTRFS_COMPRESS_NONE,
-						    NULL);
+						    NULL, false);
 		} else {
 			/* try making a compressed inline extent */
 			ret = cow_file_range_inline(BTRFS_I(inode), actual_end,
 						    total_compressed,
-						    compress_type, pages);
+						    compress_type, pages,
+						    false);
 		}
 		if (ret <= 0) {
 			unsigned long clear_flags = EXTENT_DELALLOC |
@@ -1144,7 +1153,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 		/* lets try to make an inline extent */
 		ret = cow_file_range_inline(inode, actual_end, 0,
-					    BTRFS_COMPRESS_NONE, NULL);
+					    BTRFS_COMPRESS_NONE, NULL, false);
 		if (ret == 0) {
 			/*
 			 * We use DO_ACCOUNTING here because we need the
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 08/17] btrfs: add definitions + documentation for encoded I/O ioctls
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (6 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 07/17] btrfs: optionally extend i_size in cow_file_range_inline() Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ Omar Sandoval
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

In order to allow sending and receiving compressed data without
decompressing it, we need an interface to write pre-compressed data
directly to the filesystem and the matching interface to read compressed
data without decompressing it. This adds the definitions for ioctls to
do that and detailed explanations of how to use them.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 include/uapi/linux/btrfs.h | 132 +++++++++++++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)

diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 738619994e26..7505acfa18d7 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -868,6 +868,134 @@ struct btrfs_ioctl_get_subvol_rootref_args {
 		__u8 align[7];
 };
 
+/*
+ * Data and metadata for an encoded read or write.
+ *
+ * Encoded I/O bypasses any encoding automatically done by the filesystem (e.g.,
+ * compression). This can be used to read the compressed contents of a file or
+ * write pre-compressed data directly to a file.
+ *
+ * BTRFS_IOC_ENCODED_READ and BTRFS_IOC_ENCODED_WRITE are essentially
+ * preadv/pwritev with additional metadata about how the data is encoded and the
+ * size of the unencoded data.
+ *
+ * BTRFS_IOC_ENCODED_READ fills the given iovecs with the encoded data, fills
+ * the metadata fields, and returns the size of the encoded data. It reads one
+ * extent per call. It can also read data which is not encoded.
+ *
+ * BTRFS_IOC_ENCODED_WRITE uses the metadata fields, writes the encoded data
+ * from the iovecs, and returns the size of the encoded data. Note that the
+ * encoded data is not validated when it is written; if it is not valid (e.g.,
+ * it cannot be decompressed), then a subsequent read may return an error.
+ *
+ * Since the filesystem page cache contains decoded data, encoded I/O bypasses
+ * the page cache. Encoded I/O requires CAP_SYS_ADMIN.
+ */
+struct btrfs_ioctl_encoded_io_args {
+	/* Input parameters for both reads and writes. */
+
+	/*
+	 * iovecs containing encoded data.
+	 *
+	 * For reads, if the size of the encoded data is larger than the sum of
+	 * iov[n].iov_len for 0 <= n < iovcnt, then the ioctl fails with
+	 * ENOBUFS.
+	 *
+	 * For writes, the size of the encoded data is the sum of iov[n].iov_len
+	 * for 0 <= n < iovcnt. This must be less than 128 KiB (this limit may
+	 * increase in the future). This must also be less than or equal to
+	 * unencoded_len.
+	 */
+	const struct iovec __user *iov;
+	/* Number of iovecs. */
+	unsigned long iovcnt;
+	/*
+	 * Offset in file.
+	 *
+	 * For writes, must be aligned to the sector size of the filesystem.
+	 */
+	__s64 offset;
+	/* Currently must be zero. */
+	__u64 flags;
+
+	/*
+	 * For reads, the following members are output parameters that will
+	 * contain the returned metadata for the encoded data.
+	 * For writes, the following members must be set to the metadata for the
+	 * encoded data.
+	 */
+
+	/*
+	 * Length of the data in the file.
+	 *
+	 * Must be less than or equal to unencoded_len - unencoded_offset. For
+	 * writes, must be aligned to the sector size of the filesystem unless
+	 * the data ends at or beyond the current end of the file.
+	 */
+	__u64 len;
+	/*
+	 * Length of the unencoded (i.e., decrypted and decompressed) data.
+	 *
+	 * For writes, must be no more than 128 KiB (this limit may increase in
+	 * the future). If the unencoded data is actually longer than
+	 * unencoded_len, then it is truncated; if it is shorter, then it is
+	 * extended with zeroes.
+	 */
+	__u64 unencoded_len;
+	/*
+	 * Offset from the first byte of the unencoded data to the first byte of
+	 * logical data in the file.
+	 *
+	 * Must be less than unencoded_len.
+	 */
+	__u64 unencoded_offset;
+	/*
+	 * BTRFS_ENCODED_IO_COMPRESSION_* type.
+	 *
+	 * For writes, must not be BTRFS_ENCODED_IO_COMPRESSION_NONE.
+	 */
+	__u32 compression;
+	/* Currently always BTRFS_ENCODED_IO_ENCRYPTION_NONE. */
+	__u32 encryption;
+	/*
+	 * Reserved for future expansion.
+	 *
+	 * For reads, always returned as zero. Users should check for non-zero
+	 * bytes. If there are any, then the kernel has a newer version of this
+	 * structure with additional information that the user definition is
+	 * missing.
+	 *
+	 * For writes, must be zeroed.
+	 */
+	__u8 reserved[32];
+};
+
+/* Data is not compressed. */
+#define BTRFS_ENCODED_IO_COMPRESSION_NONE 0
+/* Data is compressed as a single zlib stream. */
+#define BTRFS_ENCODED_IO_COMPRESSION_ZLIB 1
+/*
+ * Data is compressed as a single zstd frame with the windowLog compression
+ * parameter set to no more than 17.
+ */
+#define BTRFS_ENCODED_IO_COMPRESSION_ZSTD 2
+/*
+ * Data is compressed sector by sector (using the sector size indicated by the
+ * name of the constant) with LZO1X and wrapped in the format documented in
+ * fs/btrfs/lzo.c. For writes, the compression sector size must match the
+ * filesystem sector size.
+ */
+#define BTRFS_ENCODED_IO_COMPRESSION_LZO_4K 3
+#define BTRFS_ENCODED_IO_COMPRESSION_LZO_8K 4
+#define BTRFS_ENCODED_IO_COMPRESSION_LZO_16K 5
+#define BTRFS_ENCODED_IO_COMPRESSION_LZO_32K 6
+#define BTRFS_ENCODED_IO_COMPRESSION_LZO_64K 7
+#define BTRFS_ENCODED_IO_COMPRESSION_TYPES 8
+
+/* Data is not encrypted. */
+#define BTRFS_ENCODED_IO_ENCRYPTION_NONE 0
+#define BTRFS_ENCODED_IO_ENCRYPTION_TYPES 1
+
 /* Error codes as returned by the kernel */
 enum btrfs_err_code {
 	BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET = 1,
@@ -996,5 +1124,9 @@ enum btrfs_err_code {
 				struct btrfs_ioctl_ino_lookup_user_args)
 #define BTRFS_IOC_SNAP_DESTROY_V2 _IOW(BTRFS_IOCTL_MAGIC, 63, \
 				struct btrfs_ioctl_vol_args_v2)
+#define BTRFS_IOC_ENCODED_READ _IOR(BTRFS_IOCTL_MAGIC, 64, \
+				    struct btrfs_ioctl_encoded_io_args)
+#define BTRFS_IOC_ENCODED_WRITE _IOW(BTRFS_IOCTL_MAGIC, 64, \
+				     struct btrfs_ioctl_encoded_io_args)
 
 #endif /* _UAPI_LINUX_BTRFS_H */
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (7 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 08/17] btrfs: add definitions + documentation for encoded I/O ioctls Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-18 14:55   ` David Sterba
                     ` (2 more replies)
  2021-11-17 20:19 ` [PATCH v12 10/17] btrfs: add BTRFS_IOC_ENCODED_WRITE Omar Sandoval
                   ` (17 subsequent siblings)
  26 siblings, 3 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

There are 4 main cases:

1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
   from disk.
4. Regular, compressed extents: we read the entire compressed extent
   from disk and indicate what subset of the decompressed extent is in
   the file.

This initial implementation simplifies a few things that can be improved
in the future:

- We hold the inode lock during the operation.
- Cases 1, 3, and 4 allocate temporary memory to read into before
  copying out to userspace.
- We don't do read repair, because it turns out that read repair is
  currently broken for compressed data.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h |   4 +
 fs/btrfs/inode.c | 496 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ioctl.c | 106 ++++++++++
 3 files changed, 606 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2e7f74060a14..70034e33abe6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3275,6 +3275,10 @@ int btrfs_writepage_cow_fixup(struct page *page);
 void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 					  struct page *page, u64 start,
 					  u64 end, bool uptodate);
+struct btrfs_ioctl_encoded_io_args;
+ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
+			   struct btrfs_ioctl_encoded_io_args *encoded);
+
 extern const struct dentry_operations btrfs_dentry_operations;
 extern const struct iomap_ops btrfs_dio_iomap_ops;
 extern const struct iomap_dio_ops btrfs_dio_ops;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c2efea101f61..d29e968fd18b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10525,6 +10525,502 @@ void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
 	}
 }
 
+static int btrfs_encoded_io_compression_from_extent(int compress_type)
+{
+	switch (compress_type) {
+	case BTRFS_COMPRESS_NONE:
+		return BTRFS_ENCODED_IO_COMPRESSION_NONE;
+	case BTRFS_COMPRESS_ZLIB:
+		return BTRFS_ENCODED_IO_COMPRESSION_ZLIB;
+	case BTRFS_COMPRESS_LZO:
+		/*
+		 * The LZO format depends on the page size. 64k is the maximum
+		 * sectorsize (and thus page size) that we support.
+		 */
+		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
+			return -EINVAL;
+		return BTRFS_ENCODED_IO_COMPRESSION_LZO_4K + (PAGE_SHIFT - 12);
+	case BTRFS_COMPRESS_ZSTD:
+		return BTRFS_ENCODED_IO_COMPRESSION_ZSTD;
+	default:
+		return -EUCLEAN;
+	}
+}
+
+static ssize_t btrfs_encoded_read_inline(
+				struct kiocb *iocb,
+				struct iov_iter *iter, u64 start,
+				u64 lockend,
+				struct extent_state **cached_state,
+				u64 extent_start, size_t count,
+				struct btrfs_ioctl_encoded_io_args *encoded,
+				bool *unlocked)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *item;
+	u64 ram_bytes;
+	unsigned long ptr;
+	void *tmp;
+	ssize_t ret;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
+				       btrfs_ino(BTRFS_I(inode)), extent_start,
+				       0);
+	if (ret) {
+		if (ret > 0) {
+			/* The extent item disappeared? */
+			ret = -EIO;
+		}
+		goto out;
+	}
+	leaf = path->nodes[0];
+	item = btrfs_item_ptr(leaf, path->slots[0],
+			      struct btrfs_file_extent_item);
+
+	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
+	ptr = btrfs_file_extent_inline_start(item);
+
+	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
+			iocb->ki_pos);
+	ret = btrfs_encoded_io_compression_from_extent(
+				 btrfs_file_extent_compression(leaf, item));
+	if (ret < 0)
+		goto out;
+	encoded->compression = ret;
+	if (encoded->compression) {
+		size_t inline_size;
+
+		inline_size = btrfs_file_extent_inline_item_len(leaf,
+								path->slots[0]);
+		if (inline_size > count) {
+			ret = -ENOBUFS;
+			goto out;
+		}
+		count = inline_size;
+		encoded->unencoded_len = ram_bytes;
+		encoded->unencoded_offset = iocb->ki_pos - extent_start;
+	} else {
+		encoded->len = encoded->unencoded_len = count =
+			min_t(u64, count, encoded->len);
+		ptr += iocb->ki_pos - extent_start;
+	}
+
+	tmp = kmalloc(count, GFP_NOFS);
+	if (!tmp) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	read_extent_buffer(leaf, tmp, ptr, count);
+	btrfs_release_path(path);
+	unlock_extent_cached(io_tree, start, lockend, cached_state);
+	inode_unlock_shared(inode);
+	*unlocked = true;
+
+	ret = copy_to_iter(tmp, count, iter);
+	if (ret != count)
+		ret = -EFAULT;
+	kfree(tmp);
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+struct btrfs_encoded_read_private {
+	struct inode *inode;
+	u64 file_offset;
+	wait_queue_head_t wait;
+	atomic_t pending;
+	blk_status_t status;
+	bool skip_csum;
+};
+
+static blk_status_t submit_encoded_read_bio(struct inode *inode,
+					    struct bio *bio, int mirror_num,
+					    unsigned long bio_flags)
+{
+	struct btrfs_encoded_read_private *priv = bio->bi_private;
+	struct btrfs_bio *bbio = btrfs_bio(bio);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	blk_status_t ret;
+
+	if (!priv->skip_csum) {
+		ret = btrfs_lookup_bio_sums(inode, bio, NULL);
+		if (ret)
+			return ret;
+	}
+
+	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
+	if (ret) {
+		btrfs_bio_free_csum(bbio);
+		return ret;
+	}
+
+	atomic_inc(&priv->pending);
+	ret = btrfs_map_bio(fs_info, bio, mirror_num);
+	if (ret) {
+		atomic_dec(&priv->pending);
+		btrfs_bio_free_csum(bbio);
+	}
+	return ret;
+}
+
+static blk_status_t btrfs_encoded_read_verify_csum(struct btrfs_bio *bbio)
+{
+	const bool uptodate = bbio->bio.bi_status == BLK_STS_OK;
+	struct btrfs_encoded_read_private *priv = bbio->bio.bi_private;
+	struct inode *inode = priv->inode;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	u32 sectorsize = fs_info->sectorsize;
+	struct bio_vec *bvec;
+	struct bvec_iter_all iter_all;
+	u64 start = priv->file_offset;
+	u32 bio_offset = 0;
+
+	if (priv->skip_csum || !uptodate)
+		return bbio->bio.bi_status;
+
+	bio_for_each_segment_all(bvec, &bbio->bio, iter_all) {
+		unsigned int i, nr_sectors, pgoff;
+
+		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
+		pgoff = bvec->bv_offset;
+		for (i = 0; i < nr_sectors; i++) {
+			ASSERT(pgoff < PAGE_SIZE);
+			if (check_data_csum(inode, bbio, bio_offset,
+					    bvec->bv_page, pgoff, start))
+				return BLK_STS_IOERR;
+			start += sectorsize;
+			bio_offset += sectorsize;
+			pgoff += sectorsize;
+		}
+	}
+	return BLK_STS_OK;
+}
+
+static void btrfs_encoded_read_endio(struct bio *bio)
+{
+	struct btrfs_encoded_read_private *priv = bio->bi_private;
+	struct btrfs_bio *bbio = btrfs_bio(bio);
+	blk_status_t status;
+
+	status = btrfs_encoded_read_verify_csum(bbio);
+	if (status) {
+		/*
+		 * The memory barrier implied by the atomic_dec_return() here
+		 * pairs with the memory barrier implied by the
+		 * atomic_dec_return() or io_wait_event() in
+		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
+		 * write is observed before the load of status in
+		 * btrfs_encoded_read_regular_fill_pages().
+		 */
+		WRITE_ONCE(priv->status, status);
+	}
+	if (!atomic_dec_return(&priv->pending))
+		wake_up(&priv->wait);
+	btrfs_bio_free_csum(bbio);
+	bio_put(bio);
+}
+
+static int btrfs_encoded_read_regular_fill_pages(struct inode *inode,
+						 u64 file_offset,
+						 u64 disk_bytenr,
+						 u64 disk_io_size,
+						 struct page **pages)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_encoded_read_private priv = {
+		.inode = inode,
+		.file_offset = file_offset,
+		.pending = ATOMIC_INIT(1),
+		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
+	};
+	unsigned long i = 0;
+	u64 cur = 0;
+	int ret;
+
+	init_waitqueue_head(&priv.wait);
+	/*
+	 * Submit bios for the extent, splitting due to bio or stripe limits as
+	 * necessary.
+	 */
+	while (cur < disk_io_size) {
+		struct extent_map *em;
+		struct btrfs_io_geometry geom;
+		struct bio *bio = NULL;
+		u64 remaining;
+
+		em = btrfs_get_chunk_map(fs_info, disk_bytenr + cur,
+					 disk_io_size - cur);
+		if (IS_ERR(em)) {
+			ret = PTR_ERR(em);
+		} else {
+			ret = btrfs_get_io_geometry(fs_info, em, BTRFS_MAP_READ,
+						    disk_bytenr + cur, &geom);
+			free_extent_map(em);
+		}
+		if (ret) {
+			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
+			break;
+		}
+		remaining = min(geom.len, disk_io_size - cur);
+		while (bio || remaining) {
+			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
+
+			if (!bio) {
+				bio = btrfs_bio_alloc(BIO_MAX_VECS);
+				bio->bi_iter.bi_sector =
+					(disk_bytenr + cur) >> SECTOR_SHIFT;
+				bio->bi_end_io = btrfs_encoded_read_endio;
+				bio->bi_private = &priv;
+				bio->bi_opf = REQ_OP_READ;
+			}
+
+			if (!bytes ||
+			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
+				blk_status_t status;
+
+				status = submit_encoded_read_bio(inode, bio, 0,
+								 0);
+				if (status) {
+					WRITE_ONCE(priv.status, status);
+					bio_put(bio);
+					goto out;
+				}
+				bio = NULL;
+				continue;
+			}
+
+			i++;
+			cur += bytes;
+			remaining -= bytes;
+		}
+	}
+
+out:
+	if (atomic_dec_return(&priv.pending))
+		io_wait_event(priv.wait, !atomic_read(&priv.pending));
+	/* See btrfs_encoded_read_endio() for ordering. */
+	return blk_status_to_errno(READ_ONCE(priv.status));
+}
+
+static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
+					  struct iov_iter *iter,
+					  u64 start, u64 lockend,
+					  struct extent_state **cached_state,
+					  u64 disk_bytenr, u64 disk_io_size,
+					  size_t count, bool compressed,
+					  bool *unlocked)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct page **pages;
+	unsigned long nr_pages, i;
+	u64 cur;
+	size_t page_offset;
+	ssize_t ret;
+
+	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
+	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
+	if (!pages)
+		return -ENOMEM;
+	for (i = 0; i < nr_pages; i++) {
+		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (!pages[i]) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	ret = btrfs_encoded_read_regular_fill_pages(inode, start, disk_bytenr,
+						    disk_io_size, pages);
+	if (ret)
+		goto out;
+
+	unlock_extent_cached(io_tree, start, lockend, cached_state);
+	inode_unlock_shared(inode);
+	*unlocked = true;
+
+	if (compressed) {
+		i = 0;
+		page_offset = 0;
+	} else {
+		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
+		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
+	}
+	cur = 0;
+	while (cur < count) {
+		size_t bytes = min_t(size_t, count - cur,
+				     PAGE_SIZE - page_offset);
+
+		if (copy_page_to_iter(pages[i], page_offset, bytes,
+				      iter) != bytes) {
+			ret = -EFAULT;
+			goto out;
+		}
+		i++;
+		cur += bytes;
+		page_offset = 0;
+	}
+	ret = count;
+out:
+	for (i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			__free_page(pages[i]);
+	}
+	kfree(pages);
+	return ret;
+}
+
+ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
+			   struct btrfs_ioctl_encoded_io_args *encoded)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	ssize_t ret;
+	size_t count = iov_iter_count(iter);
+	u64 start, lockend, disk_bytenr, disk_io_size;
+	struct extent_state *cached_state = NULL;
+	struct extent_map *em;
+	bool unlocked = false;
+
+	file_accessed(iocb->ki_filp);
+
+	inode_lock_shared(inode);
+
+	if (iocb->ki_pos >= inode->i_size) {
+		inode_unlock_shared(inode);
+		return 0;
+	}
+	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
+	/*
+	 * We don't know how long the extent containing iocb->ki_pos is, but if
+	 * it's compressed we know that it won't be longer than this.
+	 */
+	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
+
+	for (;;) {
+		struct btrfs_ordered_extent *ordered;
+
+		ret = btrfs_wait_ordered_range(inode, start,
+					       lockend - start + 1);
+		if (ret)
+			goto out_unlock_inode;
+		lock_extent_bits(io_tree, start, lockend, &cached_state);
+		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
+						     lockend - start + 1);
+		if (!ordered)
+			break;
+		btrfs_put_ordered_extent(ordered);
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+		cond_resched();
+	}
+
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start,
+			      lockend - start + 1);
+	if (IS_ERR(em)) {
+		ret = PTR_ERR(em);
+		goto out_unlock_extent;
+	}
+
+	if (em->block_start == EXTENT_MAP_INLINE) {
+		u64 extent_start = em->start;
+
+		/*
+		 * For inline extents we get everything we need out of the
+		 * extent item.
+		 */
+		free_extent_map(em);
+		em = NULL;
+		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
+						&cached_state, extent_start,
+						count, encoded, &unlocked);
+		goto out;
+	}
+
+	/*
+	 * We only want to return up to EOF even if the extent extends beyond
+	 * that.
+	 */
+	encoded->len = (min_t(u64, extent_map_end(em), inode->i_size) -
+			iocb->ki_pos);
+	if (em->block_start == EXTENT_MAP_HOLE ||
+	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
+		disk_bytenr = EXTENT_MAP_HOLE;
+		encoded->len = encoded->unencoded_len = count =
+			min_t(u64, count, encoded->len);
+	} else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
+		disk_bytenr = em->block_start;
+		/*
+		 * Bail if the buffer isn't large enough to return the whole
+		 * compressed extent.
+		 */
+		if (em->block_len > count) {
+			ret = -ENOBUFS;
+			goto out_em;
+		}
+		disk_io_size = count = em->block_len;
+		encoded->unencoded_len = em->ram_bytes;
+		encoded->unencoded_offset = iocb->ki_pos - em->orig_start;
+		ret = btrfs_encoded_io_compression_from_extent(
+							     em->compress_type);
+		if (ret < 0)
+			goto out_em;
+		encoded->compression = ret;
+	} else {
+		disk_bytenr = em->block_start + (start - em->start);
+		if (encoded->len > count)
+			encoded->len = count;
+		/*
+		 * Don't read beyond what we locked. This also limits the page
+		 * allocations that we'll do.
+		 */
+		disk_io_size = min(lockend + 1,
+				   iocb->ki_pos + encoded->len) - start;
+		encoded->len = encoded->unencoded_len = count =
+			start + disk_io_size - iocb->ki_pos;
+		disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize);
+	}
+	free_extent_map(em);
+	em = NULL;
+
+	if (disk_bytenr == EXTENT_MAP_HOLE) {
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+		inode_unlock_shared(inode);
+		unlocked = true;
+		ret = iov_iter_zero(count, iter);
+		if (ret != count)
+			ret = -EFAULT;
+	} else {
+		ret = btrfs_encoded_read_regular(iocb, iter, start, lockend,
+						 &cached_state, disk_bytenr,
+						 disk_io_size, count,
+						 encoded->compression,
+						 &unlocked);
+	}
+
+out:
+	if (ret >= 0)
+		iocb->ki_pos += encoded->len;
+out_em:
+	free_extent_map(em);
+out_unlock_extent:
+	if (!unlocked)
+		unlock_extent_cached(io_tree, start, lockend, &cached_state);
+out_unlock_inode:
+	if (!unlocked)
+		inode_unlock_shared(inode);
+	return ret;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * Add an entry indicating a block group or device which is pinned by a
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 05c77a1979a9..f0c575223d88 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -28,6 +28,7 @@
 #include <linux/iversion.h>
 #include <linux/fileattr.h>
 #include <linux/fsverity.h>
+#include <linux/sched/xacct.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "export.h"
@@ -88,6 +89,22 @@ struct btrfs_ioctl_send_args_32 {
 
 #define BTRFS_IOC_SEND_32 _IOW(BTRFS_IOCTL_MAGIC, 38, \
 			       struct btrfs_ioctl_send_args_32)
+
+struct btrfs_ioctl_encoded_io_args_32 {
+	compat_uptr_t iov;
+	compat_ulong_t iovcnt;
+	__s64 offset;
+	__u64 flags;
+	__u64 len;
+	__u64 unencoded_len;
+	__u64 unencoded_offset;
+	__u32 compression;
+	__u32 encryption;
+	__u32 reserved[8];
+};
+
+#define BTRFS_IOC_ENCODED_READ_32 _IOR(BTRFS_IOCTL_MAGIC, 64, \
+				       struct btrfs_ioctl_encoded_io_args_32)
 #endif
 
 /* Mask out flags that are inappropriate for the given type of inode. */
@@ -4861,6 +4878,89 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
 	return ret;
 }
 
+static int btrfs_ioctl_encoded_read(struct file *file, void __user *argp,
+				    bool compat)
+{
+	struct btrfs_ioctl_encoded_io_args args = {};
+	size_t copy_end_kernel = offsetofend(struct btrfs_ioctl_encoded_io_args,
+					     flags);
+	size_t copy_end;
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = iovstack;
+	struct iov_iter iter;
+	loff_t pos;
+	struct kiocb kiocb;
+	ssize_t ret;
+
+	if (!capable(CAP_SYS_ADMIN)) {
+		ret = -EPERM;
+		goto out_acct;
+	}
+
+	if (compat) {
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+		struct btrfs_ioctl_encoded_io_args_32 args32;
+
+		copy_end = offsetofend(struct btrfs_ioctl_encoded_io_args_32,
+				       flags);
+		if (copy_from_user(&args32, argp, copy_end)) {
+			ret = -EFAULT;
+			goto out_acct;
+		}
+		args.iov = compat_ptr(args32.iov);
+		args.iovcnt = args32.iovcnt;
+		args.offset = args32.offset;
+		args.flags = args32.flags;
+#else
+		return -ENOTTY;
+#endif
+	} else {
+		copy_end = copy_end_kernel;
+		if (copy_from_user(&args, argp, copy_end)) {
+			ret = -EFAULT;
+			goto out_acct;
+		}
+	}
+	if (args.flags != 0) {
+		ret = -EINVAL;
+		goto out_acct;
+	}
+
+	ret = import_iovec(READ, args.iov, args.iovcnt, ARRAY_SIZE(iovstack),
+			   &iov, &iter);
+	if (ret < 0)
+		goto out_acct;
+
+	if (iov_iter_count(&iter) == 0) {
+		ret = 0;
+		goto out_iov;
+	}
+	pos = args.offset;
+	ret = rw_verify_area(READ, file, &pos, args.len);
+	if (ret < 0)
+		goto out_iov;
+
+	init_sync_kiocb(&kiocb, file);
+	kiocb.ki_pos = pos;
+
+	ret = btrfs_encoded_read(&kiocb, &iter, &args);
+	if (ret >= 0) {
+		fsnotify_access(file);
+		if (copy_to_user(argp + copy_end,
+				 (char *)&args + copy_end_kernel,
+				 sizeof(args) - copy_end_kernel))
+			ret = -EFAULT;
+	}
+
+out_iov:
+	kfree(iov);
+out_acct:
+	if (ret > 0)
+		add_rchar(current, ret);
+	inc_syscr(current);
+	return ret;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -5005,6 +5105,12 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return fsverity_ioctl_enable(file, (const void __user *)argp);
 	case FS_IOC_MEASURE_VERITY:
 		return fsverity_ioctl_measure(file, argp);
+	case BTRFS_IOC_ENCODED_READ:
+		return btrfs_ioctl_encoded_read(file, argp, false);
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+	case BTRFS_IOC_ENCODED_READ_32:
+		return btrfs_ioctl_encoded_read(file, argp, true);
+#endif
 	}
 
 	return -ENOTTY;
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 10/17] btrfs: add BTRFS_IOC_ENCODED_WRITE
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (8 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 11/17] btrfs: send: remove unused send_ctx::{total,cmd}_send_size Omar Sandoval
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

The implementation resembles direct I/O: we have to flush any ordered
extents, invalidate the page cache, and do the io tree/delalloc/extent
map/ordered extent dance. From there, we can reuse the compression code
with a minor modification to distinguish the write from writeback. This
also creates inline extents when possible.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/compression.c  |   7 +-
 fs/btrfs/compression.h  |   6 +-
 fs/btrfs/ctree.h        |   4 +
 fs/btrfs/file.c         |  65 ++++++++--
 fs/btrfs/inode.c        | 256 +++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/ioctl.c        | 102 ++++++++++++++++
 fs/btrfs/ordered-data.c |  12 +-
 fs/btrfs/ordered-data.h |   5 +-
 8 files changed, 437 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 73350f524fb8..ac5656392c2c 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -382,7 +382,8 @@ static void finish_compressed_bio_write(struct compressed_bio *cb)
 			cb->start, cb->start + cb->len - 1,
 			!cb->errors);
 
-	end_compressed_writeback(inode, cb);
+	if (cb->writeback)
+		end_compressed_writeback(inode, cb);
 	/* Note, our inode could be gone now */
 
 	/*
@@ -505,7 +506,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 				 struct page **compressed_pages,
 				 unsigned int nr_pages,
 				 unsigned int write_flags,
-				 struct cgroup_subsys_state *blkcg_css)
+				 struct cgroup_subsys_state *blkcg_css,
+				 bool writeback)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct bio *bio = NULL;
@@ -530,6 +532,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	cb->mirror_num = 0;
 	cb->compressed_pages = compressed_pages;
 	cb->compressed_len = compressed_len;
+	cb->writeback = writeback;
 	cb->orig_bio = NULL;
 	cb->nr_pages = nr_pages;
 
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 56eef0821e3e..a50dd19e764e 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -52,6 +52,9 @@ struct compressed_bio {
 	/* The compression algorithm for this bio */
 	u8 compress_type;
 
+	/* Whether this is a write for writeback. */
+	bool writeback;
+
 	/* IO errors */
 	u8 errors;
 	int mirror_num;
@@ -95,7 +98,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 				  struct page **compressed_pages,
 				  unsigned int nr_pages,
 				  unsigned int write_flags,
-				  struct cgroup_subsys_state *blkcg_css);
+				  struct cgroup_subsys_state *blkcg_css,
+				  bool writeback);
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 				 int mirror_num, unsigned long bio_flags);
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 70034e33abe6..98defa9cce0e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3278,6 +3278,8 @@ void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 struct btrfs_ioctl_encoded_io_args;
 ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 			   struct btrfs_ioctl_encoded_io_args *encoded);
+ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
+			     const struct btrfs_ioctl_encoded_io_args *encoded);
 
 extern const struct dentry_operations btrfs_dentry_operations;
 extern const struct iomap_ops btrfs_dio_iomap_ops;
@@ -3338,6 +3340,8 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
 			   struct btrfs_trans_handle **trans_out);
 int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
 			      struct btrfs_inode *inode, u64 start, u64 end);
+ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
+			    const struct btrfs_ioctl_encoded_io_args *encoded);
 int btrfs_release_file(struct inode *inode, struct file *file);
 int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
 		      size_t num_pages, loff_t pos, size_t write_bytes,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 5fbf0a2aba2e..d740d559717b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2067,12 +2067,43 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	return err < 0 ? err : written;
 }
 
-static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
-				    struct iov_iter *from)
+static ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from,
+			const struct btrfs_ioctl_encoded_io_args *encoded)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	loff_t count;
+	ssize_t ret;
+
+	btrfs_inode_lock(inode, 0);
+	count = encoded->len;
+	ret = generic_write_checks_count(iocb, &count);
+	if (ret == 0 && count != encoded->len) {
+		/*
+		 * The write got truncated by generic_write_checks_count(). We
+		 * can't do a partial encoded write.
+		 */
+		ret = -EFBIG;
+	}
+	if (ret || encoded->len == 0)
+		goto out;
+
+	ret = btrfs_write_check(iocb, from, encoded->len);
+	if (ret < 0)
+		goto out;
+
+	ret = btrfs_do_encoded_write(iocb, from, encoded);
+out:
+	btrfs_inode_unlock(inode, 0);
+	return ret;
+}
+
+ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
+			    const struct btrfs_ioctl_encoded_io_args *encoded)
 {
 	struct file *file = iocb->ki_filp;
 	struct btrfs_inode *inode = BTRFS_I(file_inode(file));
-	ssize_t num_written = 0;
+	ssize_t num_written, num_sync;
 	const bool sync = iocb->ki_flags & IOCB_DSYNC;
 
 	/*
@@ -2083,22 +2114,28 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (BTRFS_FS_ERROR(inode->root->fs_info))
 		return -EROFS;
 
-	if (!(iocb->ki_flags & IOCB_DIRECT) &&
-	    (iocb->ki_flags & IOCB_NOWAIT))
+	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
 		return -EOPNOTSUPP;
 
 	if (sync)
 		atomic_inc(&inode->sync_writers);
 
-	if (iocb->ki_flags & IOCB_DIRECT)
-		num_written = btrfs_direct_write(iocb, from);
-	else
-		num_written = btrfs_buffered_write(iocb, from);
+	if (encoded) {
+		num_written = btrfs_encoded_write(iocb, from, encoded);
+		num_sync = encoded->len;
+	} else if (iocb->ki_flags & IOCB_DIRECT) {
+		num_written = num_sync = btrfs_direct_write(iocb, from);
+	} else {
+		num_written = num_sync = btrfs_buffered_write(iocb, from);
+	}
 
 	btrfs_set_inode_last_sub_trans(inode);
 
-	if (num_written > 0)
-		num_written = generic_write_sync(iocb, num_written);
+	if (num_sync > 0) {
+		num_sync = generic_write_sync(iocb, num_sync);
+		if (num_sync < 0)
+			num_written = num_sync;
+	}
 
 	if (sync)
 		atomic_dec(&inode->sync_writers);
@@ -2107,6 +2144,12 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	return num_written;
 }
 
+static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
+				    struct iov_iter *from)
+{
+	return btrfs_do_write_iter(iocb, from, NULL);
+}
+
 int btrfs_release_file(struct inode *inode, struct file *filp)
 {
 	struct btrfs_file_private *private = filp->private_data;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d29e968fd18b..6919dc170c93 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -999,7 +999,7 @@ static int submit_one_async_extent(struct btrfs_inode *inode,
 			    async_extent->pages,	/* compressed_pages */
 			    async_extent->nr_pages,
 			    async_chunk->write_flags,
-			    async_chunk->blkcg_css)) {
+			    async_chunk->blkcg_css, true)) {
 		const u64 start = async_extent->start;
 		const u64 end = start + async_extent->ram_size - 1;
 
@@ -2994,6 +2994,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
 	 * except if the ordered extent was truncated.
 	 */
 	update_inode_bytes = test_bit(BTRFS_ORDERED_DIRECT, &oe->flags) ||
+			     test_bit(BTRFS_ORDERED_ENCODED, &oe->flags) ||
 			     test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags);
 
 	return insert_reserved_file_extent(trans, BTRFS_I(oe->inode),
@@ -3028,7 +3029,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 
 	if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) &&
 	    !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags) &&
-	    !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags))
+	    !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags) &&
+	    !test_bit(BTRFS_ORDERED_ENCODED, &ordered_extent->flags))
 		clear_bits |= EXTENT_DELALLOC_NEW;
 
 	freespace_inode = btrfs_is_free_space_inode(inode);
@@ -11021,6 +11023,256 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 	return ret;
 }
 
+ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
+			      const struct btrfs_ioctl_encoded_io_args *encoded)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct extent_changeset *data_reserved = NULL;
+	struct extent_state *cached_state = NULL;
+	int compression;
+	size_t orig_count;
+	u64 start, end;
+	u64 num_bytes, ram_bytes, disk_num_bytes;
+	unsigned long nr_pages, i;
+	struct page **pages;
+	struct btrfs_key ins;
+	bool extent_reserved = false;
+	struct extent_map *em;
+	ssize_t ret;
+
+	switch (encoded->compression) {
+	case BTRFS_ENCODED_IO_COMPRESSION_ZLIB:
+		compression = BTRFS_COMPRESS_ZLIB;
+		break;
+	case BTRFS_ENCODED_IO_COMPRESSION_ZSTD:
+		compression = BTRFS_COMPRESS_ZSTD;
+		break;
+	case BTRFS_ENCODED_IO_COMPRESSION_LZO_4K:
+	case BTRFS_ENCODED_IO_COMPRESSION_LZO_8K:
+	case BTRFS_ENCODED_IO_COMPRESSION_LZO_16K:
+	case BTRFS_ENCODED_IO_COMPRESSION_LZO_32K:
+	case BTRFS_ENCODED_IO_COMPRESSION_LZO_64K:
+		/* The page size must match for LZO. */
+		if (encoded->compression -
+		    BTRFS_ENCODED_IO_COMPRESSION_LZO_4K + 12 != PAGE_SHIFT)
+			return -EINVAL;
+		compression = BTRFS_COMPRESS_LZO;
+		break;
+	default:
+		return -EINVAL;
+	}
+	if (encoded->encryption != BTRFS_ENCODED_IO_ENCRYPTION_NONE)
+		return -EINVAL;
+
+	orig_count = iov_iter_count(from);
+
+	/* The extent size must be sane. */
+	if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED ||
+	    orig_count > BTRFS_MAX_COMPRESSED || orig_count == 0)
+		return -EINVAL;
+
+	/*
+	 * The compressed data must be smaller than the decompressed data.
+	 *
+	 * It's of course possible for data to compress to larger or the same
+	 * size, but the buffered I/O path falls back to no compression for such
+	 * data, and we don't want to break any assumptions by creating these
+	 * extents.
+	 *
+	 * Note that this is less strict than the current check we have that the
+	 * compressed data must be at least one sector smaller than the
+	 * decompressed data. We only want to enforce the weaker requirement
+	 * from old kernels that it is at least one byte smaller.
+	 */
+	if (orig_count >= encoded->unencoded_len)
+		return -EINVAL;
+
+	/* The extent must start on a sector boundary. */
+	start = iocb->ki_pos;
+	if (!IS_ALIGNED(start, fs_info->sectorsize))
+		return -EINVAL;
+
+	/*
+	 * The extent must end on a sector boundary. However, we allow a write
+	 * which ends at or extends i_size to have an unaligned length; we round
+	 * up the extent size and set i_size to the unaligned end.
+	 */
+	if (start + encoded->len < inode->i_size &&
+	    !IS_ALIGNED(start + encoded->len, fs_info->sectorsize))
+		return -EINVAL;
+
+	/* Finally, the offset in the unencoded data must be sector-aligned. */
+	if (!IS_ALIGNED(encoded->unencoded_offset, fs_info->sectorsize))
+		return -EINVAL;
+
+	num_bytes = ALIGN(encoded->len, fs_info->sectorsize);
+	ram_bytes = ALIGN(encoded->unencoded_len, fs_info->sectorsize);
+	end = start + num_bytes - 1;
+
+	/*
+	 * If the extent cannot be inline, the compressed data on disk must be
+	 * sector-aligned. For convenience, we extend it with zeroes if it
+	 * isn't.
+	 */
+	disk_num_bytes = ALIGN(orig_count, fs_info->sectorsize);
+	nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE);
+	pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL_ACCOUNT);
+	if (!pages)
+		return -ENOMEM;
+	for (i = 0; i < nr_pages; i++) {
+		size_t bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from));
+		char *kaddr;
+
+		pages[i] = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_HIGHMEM);
+		if (!pages[i]) {
+			ret = -ENOMEM;
+			goto out_pages;
+		}
+		kaddr = kmap(pages[i]);
+		if (copy_from_iter(kaddr, bytes, from) != bytes) {
+			kunmap(pages[i]);
+			ret = -EFAULT;
+			goto out_pages;
+		}
+		if (bytes < PAGE_SIZE)
+			memset(kaddr + bytes, 0, PAGE_SIZE - bytes);
+		kunmap(pages[i]);
+	}
+
+	for (;;) {
+		struct btrfs_ordered_extent *ordered;
+
+		ret = btrfs_wait_ordered_range(inode, start, num_bytes);
+		if (ret)
+			goto out_pages;
+		ret = invalidate_inode_pages2_range(inode->i_mapping,
+						    start >> PAGE_SHIFT,
+						    end >> PAGE_SHIFT);
+		if (ret)
+			goto out_pages;
+		lock_extent_bits(io_tree, start, end, &cached_state);
+		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
+						     num_bytes);
+		if (!ordered &&
+		    !filemap_range_has_page(inode->i_mapping, start, end))
+			break;
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+		unlock_extent_cached(io_tree, start, end, &cached_state);
+		cond_resched();
+	}
+
+	/*
+	 * We don't use the higher-level delalloc space functions because our
+	 * num_bytes and disk_num_bytes are different.
+	 */
+	ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), disk_num_bytes);
+	if (ret)
+		goto out_unlock;
+	ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, start,
+					num_bytes);
+	if (ret)
+		goto out_free_data_space;
+	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), num_bytes,
+					      disk_num_bytes);
+	if (ret)
+		goto out_qgroup_free_data;
+
+	/* Try an inline extent first. */
+	if (start == 0 && encoded->unencoded_len == encoded->len &&
+	    encoded->unencoded_offset == 0) {
+		ret = cow_file_range_inline(BTRFS_I(inode), encoded->len,
+					    orig_count, compression, pages,
+					    true);
+		if (ret <= 0) {
+			if (ret == 0)
+				ret = orig_count;
+			goto out_delalloc_release;
+		}
+	}
+
+	ret = btrfs_reserve_extent(root, disk_num_bytes, disk_num_bytes,
+				   disk_num_bytes, 0, 0, &ins, 1, 1);
+	if (ret)
+		goto out_delalloc_release;
+	extent_reserved = true;
+
+	em = create_io_em(BTRFS_I(inode), start, num_bytes,
+			  start - encoded->unencoded_offset, ins.objectid,
+			  ins.offset, ins.offset, ram_bytes, compression,
+			  BTRFS_ORDERED_COMPRESSED);
+	if (IS_ERR(em)) {
+		ret = PTR_ERR(em);
+		goto out_free_reserved;
+	}
+	free_extent_map(em);
+
+	ret = btrfs_add_ordered_extent(BTRFS_I(inode), start, num_bytes,
+				       ram_bytes, ins.objectid, ins.offset,
+				       encoded->unencoded_offset,
+				       (1 << BTRFS_ORDERED_ENCODED) |
+				       (1 << BTRFS_ORDERED_COMPRESSED),
+				       compression);
+	if (ret) {
+		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
+		goto out_free_reserved;
+	}
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+
+	if (start + encoded->len > inode->i_size)
+		i_size_write(inode, start + encoded->len);
+
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes);
+
+	if (btrfs_submit_compressed_write(BTRFS_I(inode), start, num_bytes,
+					  ins.objectid, ins.offset, pages,
+					  nr_pages, 0, NULL, false)) {
+		btrfs_writepage_endio_finish_ordered(BTRFS_I(inode), pages[0],
+						     start, end, 0);
+		ret = -EIO;
+		goto out_pages;
+	}
+	ret = orig_count;
+	goto out;
+
+out_free_reserved:
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
+out_delalloc_release:
+	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes);
+	btrfs_delalloc_release_metadata(BTRFS_I(inode), disk_num_bytes,
+					ret < 0);
+out_qgroup_free_data:
+	if (ret < 0) {
+		btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved, start,
+				       num_bytes);
+	}
+out_free_data_space:
+	/*
+	 * If btrfs_reserve_extent() succeeded, then we already decremented
+	 * bytes_may_use.
+	 */
+	if (!extent_reserved)
+		btrfs_free_reserved_data_space_noquota(fs_info, disk_num_bytes);
+out_unlock:
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+out_pages:
+	for (i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			__free_page(pages[i]);
+	}
+	kvfree(pages);
+out:
+	if (ret >= 0)
+		iocb->ki_pos += encoded->len;
+	return ret;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * Add an entry indicating a block group or device which is pinned by a
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f0c575223d88..ea78aa2e8ff3 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -105,6 +105,8 @@ struct btrfs_ioctl_encoded_io_args_32 {
 
 #define BTRFS_IOC_ENCODED_READ_32 _IOR(BTRFS_IOCTL_MAGIC, 64, \
 				       struct btrfs_ioctl_encoded_io_args_32)
+#define BTRFS_IOC_ENCODED_WRITE_32 _IOW(BTRFS_IOCTL_MAGIC, 64, \
+					struct btrfs_ioctl_encoded_io_args_32)
 #endif
 
 /* Mask out flags that are inappropriate for the given type of inode. */
@@ -4961,6 +4963,102 @@ static int btrfs_ioctl_encoded_read(struct file *file, void __user *argp,
 	return ret;
 }
 
+static int btrfs_ioctl_encoded_write(struct file *file, void __user *argp,
+				     bool compat)
+{
+	struct btrfs_ioctl_encoded_io_args args;
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = iovstack;
+	struct iov_iter iter;
+	loff_t pos;
+	struct kiocb kiocb;
+	ssize_t ret;
+
+	if (!capable(CAP_SYS_ADMIN)) {
+		ret = -EPERM;
+		goto out_acct;
+	}
+
+	if (!(file->f_mode & FMODE_WRITE)) {
+		ret = -EBADF;
+		goto out_acct;
+	}
+
+	if (compat) {
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+		struct btrfs_ioctl_encoded_io_args_32 args32;
+
+		if (copy_from_user(&args32, argp, sizeof(args32))) {
+			ret = -EFAULT;
+			goto out_acct;
+		}
+		args.iov = compat_ptr(args32.iov);
+		args.iovcnt = args32.iovcnt;
+		memcpy(&args.offset, &args32.offset,
+		       sizeof(args) -
+		       offsetof(struct btrfs_ioctl_encoded_io_args, offset));
+#else
+		return -ENOTTY;
+#endif
+	} else {
+		if (copy_from_user(&args, argp, sizeof(args))) {
+			ret = -EFAULT;
+			goto out_acct;
+		}
+	}
+
+	ret = -EINVAL;
+	if (args.flags != 0)
+		goto out_acct;
+	if (memchr_inv(args.reserved, 0, sizeof(args.reserved)))
+		goto out_acct;
+	if (args.compression == BTRFS_ENCODED_IO_COMPRESSION_NONE &&
+	    args.encryption == BTRFS_ENCODED_IO_ENCRYPTION_NONE)
+		goto out_acct;
+	if (args.compression >= BTRFS_ENCODED_IO_COMPRESSION_TYPES ||
+	    args.encryption >= BTRFS_ENCODED_IO_ENCRYPTION_TYPES)
+		goto out_acct;
+	if (args.unencoded_offset > args.unencoded_len)
+		goto out_acct;
+	if (args.len > args.unencoded_len - args.unencoded_offset)
+		goto out_acct;
+
+	ret = import_iovec(WRITE, args.iov, args.iovcnt, ARRAY_SIZE(iovstack),
+			   &iov, &iter);
+	if (ret < 0)
+		goto out_acct;
+
+	file_start_write(file);
+
+	if (iov_iter_count(&iter) == 0) {
+		ret = 0;
+		goto out_end_write;
+	}
+	pos = args.offset;
+	ret = rw_verify_area(WRITE, file, &pos, args.len);
+	if (ret < 0)
+		goto out_end_write;
+
+	init_sync_kiocb(&kiocb, file);
+	ret = kiocb_set_rw_flags(&kiocb, 0);
+	if (ret)
+		goto out_end_write;
+	kiocb.ki_pos = pos;
+
+	ret = btrfs_do_write_iter(&kiocb, &iter, &args);
+	if (ret > 0)
+		fsnotify_modify(file);
+
+out_end_write:
+	file_end_write(file);
+	kfree(iov);
+out_acct:
+	if (ret > 0)
+		add_wchar(current, ret);
+	inc_syscw(current);
+	return ret;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -5107,9 +5205,13 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return fsverity_ioctl_measure(file, argp);
 	case BTRFS_IOC_ENCODED_READ:
 		return btrfs_ioctl_encoded_read(file, argp, false);
+	case BTRFS_IOC_ENCODED_WRITE:
+		return btrfs_ioctl_encoded_write(file, argp, false);
 #if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
 	case BTRFS_IOC_ENCODED_READ_32:
 		return btrfs_ioctl_encoded_read(file, argp, true);
+	case BTRFS_IOC_ENCODED_WRITE_32:
+		return btrfs_ioctl_encoded_write(file, argp, true);
 #endif
 	}
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 5e4c59b00b01..7837336ab4b9 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -521,9 +521,15 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 	spin_lock(&btrfs_inode->lock);
 	btrfs_mod_outstanding_extents(btrfs_inode, -1);
 	spin_unlock(&btrfs_inode->lock);
-	if (root != fs_info->tree_root)
-		btrfs_delalloc_release_metadata(btrfs_inode, entry->num_bytes,
-						false);
+	if (root != fs_info->tree_root) {
+		u64 release;
+
+		if (test_bit(BTRFS_ORDERED_ENCODED, &entry->flags))
+			release = entry->disk_num_bytes;
+		else
+			release = entry->num_bytes;
+		btrfs_delalloc_release_metadata(btrfs_inode, release, false);
+	}
 
 	percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
 				 fs_info->delalloc_batch);
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 0feb0c29839e..aeb17714ba5a 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -74,6 +74,8 @@ enum {
 	BTRFS_ORDERED_LOGGED_CSUM,
 	/* We wait for this extent to complete in the current transaction */
 	BTRFS_ORDERED_PENDING,
+	/* BTRFS_IOC_ENCODED_WRITE */
+	BTRFS_ORDERED_ENCODED,
 };
 
 /* BTRFS_ORDERED_* flags that specify the type of the extent. */
@@ -81,7 +83,8 @@ enum {
 				  (1UL << BTRFS_ORDERED_NOCOW) |	\
 				  (1UL << BTRFS_ORDERED_PREALLOC) |	\
 				  (1UL << BTRFS_ORDERED_COMPRESSED) |	\
-				  (1UL << BTRFS_ORDERED_DIRECT))
+				  (1UL << BTRFS_ORDERED_DIRECT) |	\
+				  (1UL << BTRFS_ORDERED_ENCODED))
 
 struct btrfs_ordered_extent {
 	/* logical offset in the file */
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 11/17] btrfs: send: remove unused send_ctx::{total,cmd}_send_size
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (9 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 10/17] btrfs: add BTRFS_IOC_ENCODED_WRITE Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-18 14:11   ` David Sterba
  2021-11-17 20:19 ` [PATCH v12 12/17] btrfs: send: fix maximum command numbering Omar Sandoval
                   ` (15 subsequent siblings)
  26 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

We collect these statistics but have never used them for anything.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/send.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 6bdcb9d481d5..500b866ede43 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -81,8 +81,6 @@ struct send_ctx {
 	char *send_buf;
 	u32 send_size;
 	u32 send_max_size;
-	u64 total_send_size;
-	u64 cmd_send_size[BTRFS_SEND_C_MAX + 1];
 	u64 flags;	/* 'flags' member of btrfs_ioctl_send_args is u64 */
 	/* Protocol version compatibility requested */
 	u32 proto;
@@ -722,8 +720,6 @@ static int send_cmd(struct send_ctx *sctx)
 	ret = write_buf(sctx->send_filp, sctx->send_buf, sctx->send_size,
 					&sctx->send_off);
 
-	sctx->total_send_size += sctx->send_size;
-	sctx->cmd_send_size[get_unaligned_le16(&hdr->cmd)] += sctx->send_size;
 	sctx->send_size = 0;
 
 	return ret;
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 12/17] btrfs: send: fix maximum command numbering
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (10 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 11/17] btrfs: send: remove unused send_ctx::{total,cmd}_send_size Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-18 14:23   ` David Sterba
  2021-11-17 20:19 ` [PATCH v12 13/17] btrfs: add send stream v2 definitions Omar Sandoval
                   ` (14 subsequent siblings)
  26 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Commit e77fbf990316 ("btrfs: send: prepare for v2 protocol") added
_BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
version plus 1, but as written this creates gaps in the number space.
The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
23 and 24 are valid commands.

Instead, let's explicitly set BTRFS_SEND_C_MAX_V* to the maximum command
number. This requires repeating the command name, but it has a clearer
meaning and avoids gaps. It also doesn't require updating
__BTRFS_SEND_C_MAX for every new version.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/send.c | 4 ++--
 fs/btrfs/send.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 500b866ede43..450c873684e8 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -316,8 +316,8 @@ __maybe_unused
 static bool proto_cmd_ok(const struct send_ctx *sctx, int cmd)
 {
 	switch (sctx->proto) {
-	case 1:	 return cmd < __BTRFS_SEND_C_MAX_V1;
-	case 2:	 return cmd < __BTRFS_SEND_C_MAX_V2;
+	case 1:	 return cmd <= BTRFS_SEND_C_MAX_V1;
+	case 2:	 return cmd <= BTRFS_SEND_C_MAX_V2;
 	default: return false;
 	}
 }
diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
index 23bcefc84e49..59a4be3b09cd 100644
--- a/fs/btrfs/send.h
+++ b/fs/btrfs/send.h
@@ -77,10 +77,10 @@ enum btrfs_send_cmd {
 
 	BTRFS_SEND_C_END,
 	BTRFS_SEND_C_UPDATE_EXTENT,
-	__BTRFS_SEND_C_MAX_V1,
+	BTRFS_SEND_C_MAX_V1 = BTRFS_SEND_C_UPDATE_EXTENT,
 
 	/* Version 2 */
-	__BTRFS_SEND_C_MAX_V2,
+	BTRFS_SEND_C_MAX_V2 = BTRFS_SEND_C_MAX_V1,
 
 	/* End */
 	__BTRFS_SEND_C_MAX,
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 13/17] btrfs: add send stream v2 definitions
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (11 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 12/17] btrfs: send: fix maximum command numbering Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-18 14:18   ` David Sterba
  2021-11-18 14:20   ` David Sterba
  2021-11-17 20:19 ` [PATCH v12 14/17] btrfs: send: write larger chunks when using stream v2 Omar Sandoval
                   ` (13 subsequent siblings)
  26 siblings, 2 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

This adds the definitions of the new commands for send stream version 2
and their respective attributes: fallocate, FS_IOC_SETFLAGS (a.k.a.
chattr), and encoded writes. It also documents two changes to the send
stream format in v2: the receiver shouldn't assume a maximum command
size, and the DATA attribute is encoded differently to allow for writes
larger than 64k. These will be implemented in subsequent changes, and
then the ioctl will accept the new version and flag.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/send.c            |  2 +-
 fs/btrfs/send.h            | 35 +++++++++++++++++++++++++++++++++--
 include/uapi/linux/btrfs.h |  7 +++++++
 3 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 450c873684e8..53b3cc2276ea 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -7292,7 +7292,7 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 
 	sctx->clone_roots_cnt = arg->clone_sources_count;
 
-	sctx->send_max_size = BTRFS_SEND_BUF_SIZE;
+	sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V1;
 	sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
 	if (!sctx->send_buf) {
 		ret = -ENOMEM;
diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
index 59a4be3b09cd..50c2284f08af 100644
--- a/fs/btrfs/send.h
+++ b/fs/btrfs/send.h
@@ -12,7 +12,11 @@
 #define BTRFS_SEND_STREAM_MAGIC "btrfs-stream"
 #define BTRFS_SEND_STREAM_VERSION 1
 
-#define BTRFS_SEND_BUF_SIZE SZ_64K
+/*
+ * In send stream v1, no command is larger than 64k. In send stream v2, no limit
+ * should be assumed.
+ */
+#define BTRFS_SEND_BUF_SIZE_V1 SZ_64K
 
 enum btrfs_tlv_type {
 	BTRFS_TLV_U8,
@@ -80,7 +84,10 @@ enum btrfs_send_cmd {
 	BTRFS_SEND_C_MAX_V1 = BTRFS_SEND_C_UPDATE_EXTENT,
 
 	/* Version 2 */
-	BTRFS_SEND_C_MAX_V2 = BTRFS_SEND_C_MAX_V1,
+	BTRFS_SEND_C_FALLOCATE,
+	BTRFS_SEND_C_SETFLAGS,
+	BTRFS_SEND_C_ENCODED_WRITE,
+	BTRFS_SEND_C_MAX_V2 = BTRFS_SEND_C_ENCODED_WRITE,
 
 	/* End */
 	__BTRFS_SEND_C_MAX,
@@ -91,6 +98,7 @@ enum btrfs_send_cmd {
 enum {
 	BTRFS_SEND_A_UNSPEC,
 
+	/* Version 1 */
 	BTRFS_SEND_A_UUID,
 	BTRFS_SEND_A_CTRANSID,
 
@@ -113,6 +121,11 @@ enum {
 	BTRFS_SEND_A_PATH_LINK,
 
 	BTRFS_SEND_A_FILE_OFFSET,
+	/*
+	 * In send stream v2, this attribute is special: it must be the last
+	 * attribute in a command, its header contains only the type, and its
+	 * length is implicitly the remaining length of the command.
+	 */
 	BTRFS_SEND_A_DATA,
 
 	BTRFS_SEND_A_CLONE_UUID,
@@ -120,7 +133,25 @@ enum {
 	BTRFS_SEND_A_CLONE_PATH,
 	BTRFS_SEND_A_CLONE_OFFSET,
 	BTRFS_SEND_A_CLONE_LEN,
+	BTRFS_SEND_A_MAX_V1 = BTRFS_SEND_A_CLONE_LEN,
 
+	/* Version 2 */
+	BTRFS_SEND_A_FALLOCATE_MODE,
+
+	BTRFS_SEND_A_SETFLAGS_FLAGS,
+
+	BTRFS_SEND_A_UNENCODED_FILE_LEN,
+	BTRFS_SEND_A_UNENCODED_LEN,
+	BTRFS_SEND_A_UNENCODED_OFFSET,
+	/*
+	 * COMPRESSION and ENCRYPTION default to NONE (0) if omitted from
+	 * BTRFS_SEND_C_ENCODED_WRITE.
+	 */
+	BTRFS_SEND_A_COMPRESSION,
+	BTRFS_SEND_A_ENCRYPTION,
+	BTRFS_SEND_A_MAX_V2 = BTRFS_SEND_A_ENCRYPTION,
+
+	/* End */
 	__BTRFS_SEND_A_MAX,
 };
 #define BTRFS_SEND_A_MAX (__BTRFS_SEND_A_MAX - 1)
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 7505acfa18d7..9d5fbe8c36c4 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -776,6 +776,13 @@ struct btrfs_ioctl_received_subvol_args {
  */
 #define BTRFS_SEND_FLAG_VERSION			0x8
 
+/*
+ * Send compressed data using the ENCODED_WRITE command instead of decompressing
+ * the data and sending it with the WRITE command. This requires protocol
+ * version >= 2.
+ */
+#define BTRFS_SEND_FLAG_COMPRESSED		0x10
+
 #define BTRFS_SEND_FLAG_MASK \
 	(BTRFS_SEND_FLAG_NO_FILE_DATA | \
 	 BTRFS_SEND_FLAG_OMIT_STREAM_HEADER | \
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 14/17] btrfs: send: write larger chunks when using stream v2
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (12 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 13/17] btrfs: add send stream v2 definitions Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-18 15:50   ` David Sterba
  2021-11-17 20:19 ` [PATCH v12 15/17] btrfs: send: allocate send buffer with alloc_page() and vmap() for v2 Omar Sandoval
                   ` (12 subsequent siblings)
  26 siblings, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

The length field of the send stream TLV header is 16 bits. This means
that the maximum amount of data that can be sent for one write is 64k
minus one. However, encoded writes must be able to send the maximum
compressed extent (128k) in one command. To support this, send stream
version 2 encodes the DATA attribute differently: it has no length
field, and the length is implicitly up to the end of containing command
(which has a 32-bit length field). Although this is necessary for
encoded writes, normal writes can benefit from it, too.

Also add a check to enforce that the DATA attribute is last. It is only
strictly necessary for v2, but we might as well make v1 consistent with
it.

For v2, let's bump up the send buffer to the maximum compressed extent
size plus 16k for the other metadata (144k total). Since this will most
likely be vmalloc'd (and always will be after the next commit), we round
it up to the next page since we might as well use the rest of the page
on systems with >16k pages.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/send.c | 42 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 34 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 53b3cc2276ea..12844cb20584 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -81,6 +81,7 @@ struct send_ctx {
 	char *send_buf;
 	u32 send_size;
 	u32 send_max_size;
+	bool put_data;
 	u64 flags;	/* 'flags' member of btrfs_ioctl_send_args is u64 */
 	/* Protocol version compatibility requested */
 	u32 proto;
@@ -584,6 +585,9 @@ static int tlv_put(struct send_ctx *sctx, u16 attr, const void *data, int len)
 	int total_len = sizeof(*hdr) + len;
 	int left = sctx->send_max_size - sctx->send_size;
 
+	if (WARN_ON_ONCE(sctx->put_data))
+		return -EINVAL;
+
 	if (unlikely(left < total_len))
 		return -EOVERFLOW;
 
@@ -721,6 +725,7 @@ static int send_cmd(struct send_ctx *sctx)
 					&sctx->send_off);
 
 	sctx->send_size = 0;
+	sctx->put_data = false;
 
 	return ret;
 }
@@ -4902,14 +4907,30 @@ static inline u64 max_send_read_size(const struct send_ctx *sctx)
 
 static int put_data_header(struct send_ctx *sctx, u32 len)
 {
-	struct btrfs_tlv_header *hdr;
+	if (WARN_ON_ONCE(sctx->put_data))
+		return -EINVAL;
+	sctx->put_data = true;
+	if (sctx->proto >= 2) {
+		/*
+		 * In v2, the data attribute header doesn't include a length; it
+		 * is implicitly to the end of the command.
+		 */
+		if (sctx->send_max_size - sctx->send_size < 2 + len)
+			return -EOVERFLOW;
+		put_unaligned_le16(BTRFS_SEND_A_DATA,
+				   sctx->send_buf + sctx->send_size);
+		sctx->send_size += 2;
+	} else {
+		struct btrfs_tlv_header *hdr;
 
-	if (sctx->send_max_size - sctx->send_size < sizeof(*hdr) + len)
-		return -EOVERFLOW;
-	hdr = (struct btrfs_tlv_header *)(sctx->send_buf + sctx->send_size);
-	put_unaligned_le16(BTRFS_SEND_A_DATA, &hdr->tlv_type);
-	put_unaligned_le16(len, &hdr->tlv_len);
-	sctx->send_size += sizeof(*hdr);
+		if (sctx->send_max_size - sctx->send_size < sizeof(*hdr) + len)
+			return -EOVERFLOW;
+		hdr = (struct btrfs_tlv_header *)(sctx->send_buf +
+						  sctx->send_size);
+		put_unaligned_le16(BTRFS_SEND_A_DATA, &hdr->tlv_type);
+		put_unaligned_le16(len, &hdr->tlv_len);
+		sctx->send_size += sizeof(*hdr);
+	}
 	return 0;
 }
 
@@ -7292,7 +7313,12 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 
 	sctx->clone_roots_cnt = arg->clone_sources_count;
 
-	sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V1;
+	if (sctx->proto >= 2) {
+		sctx->send_max_size = ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED,
+					    PAGE_SIZE);
+	} else {
+		sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V1;
+	}
 	sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
 	if (!sctx->send_buf) {
 		ret = -ENOMEM;
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 15/17] btrfs: send: allocate send buffer with alloc_page() and vmap() for v2
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (13 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 14/17] btrfs: send: write larger chunks when using stream v2 Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 16/17] btrfs: send: send compressed extents with encoded writes Omar Sandoval
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

For encoded writes, we need the raw pages for reading compressed data
directly via a bio. So, replace kvmalloc() with vmap() so we have access
to the raw pages. 144k is large enough that it usually gets allocated
with vmalloc(), anyways.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/send.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 12844cb20584..8493335ef47a 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -82,6 +82,7 @@ struct send_ctx {
 	u32 send_size;
 	u32 send_max_size;
 	bool put_data;
+	struct page **send_buf_pages;
 	u64 flags;	/* 'flags' member of btrfs_ioctl_send_args is u64 */
 	/* Protocol version compatibility requested */
 	u32 proto;
@@ -7225,6 +7226,7 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 	struct btrfs_root *clone_root;
 	struct send_ctx *sctx = NULL;
 	u32 i;
+	u32 send_buf_num_pages = 0;
 	u64 *clone_sources_tmp = NULL;
 	int clone_sources_to_rollback = 0;
 	size_t alloc_size;
@@ -7316,10 +7318,28 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 	if (sctx->proto >= 2) {
 		sctx->send_max_size = ALIGN(SZ_16K + BTRFS_MAX_COMPRESSED,
 					    PAGE_SIZE);
+		send_buf_num_pages = sctx->send_max_size >> PAGE_SHIFT;
+		sctx->send_buf_pages = kcalloc(send_buf_num_pages,
+					       sizeof(*sctx->send_buf_pages),
+					       GFP_KERNEL);
+		if (!sctx->send_buf_pages) {
+			send_buf_num_pages = 0;
+			ret = -ENOMEM;
+			goto out;
+		}
+		for (i = 0; i < send_buf_num_pages; i++) {
+			sctx->send_buf_pages[i] = alloc_page(GFP_KERNEL);
+			if (!sctx->send_buf_pages[i]) {
+				ret = -ENOMEM;
+				goto out;
+			}
+		}
+		sctx->send_buf = vmap(sctx->send_buf_pages, send_buf_num_pages,
+				      VM_MAP, PAGE_KERNEL);
 	} else {
 		sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V1;
+		sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
 	}
-	sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
 	if (!sctx->send_buf) {
 		ret = -ENOMEM;
 		goto out;
@@ -7526,7 +7546,16 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 			fput(sctx->send_filp);
 
 		kvfree(sctx->clone_roots);
-		kvfree(sctx->send_buf);
+		if (sctx->proto >= 2) {
+			vunmap(sctx->send_buf);
+			for (i = 0; i < send_buf_num_pages; i++) {
+				if (sctx->send_buf_pages[i])
+					__free_page(sctx->send_buf_pages[i]);
+			}
+			kfree(sctx->send_buf_pages);
+		} else {
+			kvfree(sctx->send_buf);
+		}
 
 		name_cache_free(sctx);
 
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 16/17] btrfs: send: send compressed extents with encoded writes
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (14 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 15/17] btrfs: send: allocate send buffer with alloc_page() and vmap() for v2 Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 17/17] btrfs: send: enable support for stream v2 and compressed writes Omar Sandoval
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Now that all of the pieces are in place, we can use the ENCODED_WRITE
command to send compressed extents when appropriate.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h |   4 +
 fs/btrfs/inode.c |  10 +-
 fs/btrfs/send.c  | 234 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 225 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 98defa9cce0e..af1bb06f9ca0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3275,6 +3275,10 @@ int btrfs_writepage_cow_fixup(struct page *page);
 void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
 					  struct page *page, u64 start,
 					  u64 end, bool uptodate);
+int btrfs_encoded_io_compression_from_extent(int compress_type);
+int btrfs_encoded_read_regular_fill_pages(struct inode *inode, u64 file_offset,
+					  u64 disk_bytenr, u64 disk_io_size,
+					  struct page **pages);
 struct btrfs_ioctl_encoded_io_args;
 ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 			   struct btrfs_ioctl_encoded_io_args *encoded);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6919dc170c93..92e1dea7af9a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10527,7 +10527,7 @@ void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
 	}
 }
 
-static int btrfs_encoded_io_compression_from_extent(int compress_type)
+int btrfs_encoded_io_compression_from_extent(int compress_type)
 {
 	switch (compress_type) {
 	case BTRFS_COMPRESS_NONE:
@@ -10731,11 +10731,9 @@ static void btrfs_encoded_read_endio(struct bio *bio)
 	bio_put(bio);
 }
 
-static int btrfs_encoded_read_regular_fill_pages(struct inode *inode,
-						 u64 file_offset,
-						 u64 disk_bytenr,
-						 u64 disk_io_size,
-						 struct page **pages)
+int btrfs_encoded_read_regular_fill_pages(struct inode *inode, u64 file_offset,
+					  u64 disk_bytenr, u64 disk_io_size,
+					  struct page **pages)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_encoded_read_private priv = {
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 8493335ef47a..7a6a23a63950 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -609,6 +609,7 @@ static int tlv_put(struct send_ctx *sctx, u16 attr, const void *data, int len)
 		return tlv_put(sctx, attr, &__tmp, sizeof(__tmp));	\
 	}
 
+TLV_PUT_DEFINE_INT(32)
 TLV_PUT_DEFINE_INT(64)
 
 static int tlv_put_string(struct send_ctx *sctx, u16 attr,
@@ -5205,16 +5206,215 @@ static int send_hole(struct send_ctx *sctx, u64 end)
 	return ret;
 }
 
-static int send_extent_data(struct send_ctx *sctx,
-			    const u64 offset,
-			    const u64 len)
+static int send_encoded_inline_extent(struct send_ctx *sctx,
+				      struct btrfs_path *path, u64 offset,
+				      u64 len)
 {
+	struct btrfs_root *root = sctx->send_root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct inode *inode;
+	struct fs_path *p;
+	struct extent_buffer *leaf = path->nodes[0];
+	struct btrfs_key key;
+	struct btrfs_file_extent_item *ei;
+	u64 ram_bytes;
+	size_t inline_size;
+	int ret;
+
+	inode = btrfs_iget(fs_info->sb, sctx->cur_ino, root);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	p = fs_path_alloc();
+	if (!p) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_ENCODED_WRITE);
+	if (ret < 0)
+		goto out;
+
+	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
+	if (ret < 0)
+		goto out;
+
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+	ei = btrfs_item_ptr(leaf, path->slots[0],
+			    struct btrfs_file_extent_item);
+	ram_bytes = btrfs_file_extent_ram_bytes(leaf, ei);
+	inline_size = btrfs_file_extent_inline_item_len(leaf, path->slots[0]);
+
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_FILE_LEN,
+		    min(key.offset + ram_bytes - offset, len));
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_LEN, ram_bytes);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_OFFSET, offset - key.offset);
+	ret = btrfs_encoded_io_compression_from_extent(
+				btrfs_file_extent_compression(leaf, ei));
+	if (ret < 0)
+		goto out;
+	TLV_PUT_U32(sctx, BTRFS_SEND_A_COMPRESSION, ret);
+
+	ret = put_data_header(sctx, inline_size);
+	if (ret < 0)
+		goto out;
+	read_extent_buffer(leaf, sctx->send_buf + sctx->send_size,
+			   btrfs_file_extent_inline_start(ei), inline_size);
+	sctx->send_size += inline_size;
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	fs_path_free(p);
+	iput(inode);
+	return ret;
+}
+
+static int send_encoded_extent(struct send_ctx *sctx, struct btrfs_path *path,
+			       u64 offset, u64 len)
+{
+	struct btrfs_root *root = sctx->send_root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct inode *inode;
+	struct fs_path *p;
+	struct extent_buffer *leaf = path->nodes[0];
+	struct btrfs_key key;
+	struct btrfs_file_extent_item *ei;
+	u64 disk_bytenr, disk_num_bytes;
+	u32 data_offset;
+	struct btrfs_cmd_header *hdr;
+	u32 crc;
+	int ret;
+
+	inode = btrfs_iget(fs_info->sb, sctx->cur_ino, root);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	p = fs_path_alloc();
+	if (!p) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_ENCODED_WRITE);
+	if (ret < 0)
+		goto out;
+
+	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
+	if (ret < 0)
+		goto out;
+
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+	ei = btrfs_item_ptr(leaf, path->slots[0],
+			    struct btrfs_file_extent_item);
+	disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+	disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, ei);
+
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_FILE_LEN,
+		    min(key.offset + btrfs_file_extent_num_bytes(leaf, ei) - offset,
+			len));
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_LEN,
+		    btrfs_file_extent_ram_bytes(leaf, ei));
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_OFFSET,
+		    offset - key.offset + btrfs_file_extent_offset(leaf, ei));
+	ret = btrfs_encoded_io_compression_from_extent(
+				btrfs_file_extent_compression(leaf, ei));
+	if (ret < 0)
+		goto out;
+	TLV_PUT_U32(sctx, BTRFS_SEND_A_COMPRESSION, ret);
+	TLV_PUT_U32(sctx, BTRFS_SEND_A_ENCRYPTION, 0);
+
+	ret = put_data_header(sctx, disk_num_bytes);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * We want to do I/O directly into the send buffer, so get the next page
+	 * boundary in the send buffer. This means that there may be a gap
+	 * between the beginning of the command and the file data.
+	 */
+	data_offset = ALIGN(sctx->send_size, PAGE_SIZE);
+	if (data_offset > sctx->send_max_size ||
+	    sctx->send_max_size - data_offset < disk_num_bytes) {
+		ret = -EOVERFLOW;
+		goto out;
+	}
+
+	/*
+	 * Note that send_buf is a mapping of send_buf_pages, so this is really
+	 * reading into send_buf.
+	 */
+	ret = btrfs_encoded_read_regular_fill_pages(inode, offset, disk_bytenr,
+						    disk_num_bytes,
+						    sctx->send_buf_pages +
+						    (data_offset >> PAGE_SHIFT));
+	if (ret)
+		goto out;
+
+	hdr = (struct btrfs_cmd_header *)sctx->send_buf;
+	hdr->len = cpu_to_le32(sctx->send_size + disk_num_bytes - sizeof(*hdr));
+	hdr->crc = 0;
+	crc = btrfs_crc32c(0, sctx->send_buf, sctx->send_size);
+	crc = btrfs_crc32c(crc, sctx->send_buf + data_offset, disk_num_bytes);
+	hdr->crc = cpu_to_le32(crc);
+
+	ret = write_buf(sctx->send_filp, sctx->send_buf, sctx->send_size,
+			&sctx->send_off);
+	if (!ret) {
+		ret = write_buf(sctx->send_filp, sctx->send_buf + data_offset,
+				disk_num_bytes, &sctx->send_off);
+	}
+	sctx->send_size = 0;
+	sctx->put_data = false;
+
+tlv_put_failure:
+out:
+	fs_path_free(p);
+	iput(inode);
+	return ret;
+}
+
+static int send_extent_data(struct send_ctx *sctx, struct btrfs_path *path,
+			    const u64 offset, const u64 len)
+{
+	struct extent_buffer *leaf = path->nodes[0];
+	struct btrfs_file_extent_item *ei;
 	u64 read_size = max_send_read_size(sctx);
 	u64 sent = 0;
 
 	if (sctx->flags & BTRFS_SEND_FLAG_NO_FILE_DATA)
 		return send_update_extent(sctx, offset, len);
 
+	ei = btrfs_item_ptr(leaf, path->slots[0],
+			    struct btrfs_file_extent_item);
+	if ((sctx->flags & BTRFS_SEND_FLAG_COMPRESSED) &&
+	    btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
+		bool is_inline = (btrfs_file_extent_type(leaf, ei) ==
+				  BTRFS_FILE_EXTENT_INLINE);
+
+		/*
+		 * Send the compressed extent unless the compressed data is
+		 * larger than the decompressed data. This can happen if we're
+		 * not sending the entire extent, either because it has been
+		 * partially overwritten/truncated or because this is a part of
+		 * the extent that we couldn't clone in clone_range().
+		 */
+		if (is_inline &&
+		    btrfs_file_extent_inline_item_len(leaf,
+						      path->slots[0]) <= len) {
+			return send_encoded_inline_extent(sctx, path, offset,
+							  len);
+		} else if (!is_inline &&
+			   btrfs_file_extent_disk_num_bytes(leaf, ei) <= len) {
+			return send_encoded_extent(sctx, path, offset, len);
+		}
+	}
+
 	while (sent < len) {
 		u64 size = min(len - sent, read_size);
 		int ret;
@@ -5285,12 +5485,9 @@ static int send_capabilities(struct send_ctx *sctx)
 	return ret;
 }
 
-static int clone_range(struct send_ctx *sctx,
-		       struct clone_root *clone_root,
-		       const u64 disk_byte,
-		       u64 data_offset,
-		       u64 offset,
-		       u64 len)
+static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
+		       struct clone_root *clone_root, const u64 disk_byte,
+		       u64 data_offset, u64 offset, u64 len)
 {
 	struct btrfs_path *path;
 	struct btrfs_key key;
@@ -5314,7 +5511,7 @@ static int clone_range(struct send_ctx *sctx,
 	 */
 	if (clone_root->offset == 0 &&
 	    len == sctx->send_root->fs_info->sectorsize)
-		return send_extent_data(sctx, offset, len);
+		return send_extent_data(sctx, dst_path, offset, len);
 
 	path = alloc_path_for_send();
 	if (!path)
@@ -5411,7 +5608,8 @@ static int clone_range(struct send_ctx *sctx,
 
 			if (hole_len > len)
 				hole_len = len;
-			ret = send_extent_data(sctx, offset, hole_len);
+			ret = send_extent_data(sctx, dst_path, offset,
+					       hole_len);
 			if (ret < 0)
 				goto out;
 
@@ -5484,14 +5682,16 @@ static int clone_range(struct send_ctx *sctx,
 					if (ret < 0)
 						goto out;
 				}
-				ret = send_extent_data(sctx, offset + slen,
+				ret = send_extent_data(sctx, dst_path,
+						       offset + slen,
 						       clone_len - slen);
 			} else {
 				ret = send_clone(sctx, offset, clone_len,
 						 clone_root);
 			}
 		} else {
-			ret = send_extent_data(sctx, offset, clone_len);
+			ret = send_extent_data(sctx, dst_path, offset,
+					       clone_len);
 		}
 
 		if (ret < 0)
@@ -5523,7 +5723,7 @@ static int clone_range(struct send_ctx *sctx,
 	}
 
 	if (len > 0)
-		ret = send_extent_data(sctx, offset, len);
+		ret = send_extent_data(sctx, dst_path, offset, len);
 	else
 		ret = 0;
 out:
@@ -5554,10 +5754,10 @@ static int send_write_or_clone(struct send_ctx *sctx,
 				    struct btrfs_file_extent_item);
 		disk_byte = btrfs_file_extent_disk_bytenr(path->nodes[0], ei);
 		data_offset = btrfs_file_extent_offset(path->nodes[0], ei);
-		ret = clone_range(sctx, clone_root, disk_byte, data_offset,
-				  offset, end - offset);
+		ret = clone_range(sctx, path, clone_root, disk_byte,
+				  data_offset, offset, end - offset);
 	} else {
-		ret = send_extent_data(sctx, offset, end - offset);
+		ret = send_extent_data(sctx, path, offset, end - offset);
 	}
 	sctx->cur_inode_next_write_offset = end;
 	return ret;
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 17/17] btrfs: send: enable support for stream v2 and compressed writes
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (15 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 16/17] btrfs: send: send compressed extents with encoded writes Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 01/10] btrfs-progs: receive: support v2 send stream larger tlv_len Omar Sandoval
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Now that the new support is implemented, allow the ioctl to accept the
v2 and the compressed flag, and update the version in sysfs.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/send.c            | 7 +++++--
 fs/btrfs/send.h            | 2 +-
 include/uapi/linux/btrfs.h | 3 ++-
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 7a6a23a63950..e199b85e9c2c 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -685,8 +685,7 @@ static int send_header(struct send_ctx *sctx)
 	struct btrfs_stream_header hdr;
 
 	strcpy(hdr.magic, BTRFS_SEND_STREAM_MAGIC);
-	hdr.version = cpu_to_le32(BTRFS_SEND_STREAM_VERSION);
-
+	hdr.version = cpu_to_le32(sctx->proto);
 	return write_buf(sctx->send_filp, &hdr, sizeof(hdr),
 					&sctx->send_off);
 }
@@ -7496,6 +7495,10 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 	} else {
 		sctx->proto = 1;
 	}
+	if ((arg->flags & BTRFS_SEND_FLAG_COMPRESSED) && sctx->proto < 2) {
+		ret = -EINVAL;
+		goto out;
+	}
 
 	sctx->send_filp = fget(arg->send_fd);
 	if (!sctx->send_filp) {
diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
index 50c2284f08af..cf15d4078ac6 100644
--- a/fs/btrfs/send.h
+++ b/fs/btrfs/send.h
@@ -10,7 +10,7 @@
 #include "ctree.h"
 
 #define BTRFS_SEND_STREAM_MAGIC "btrfs-stream"
-#define BTRFS_SEND_STREAM_VERSION 1
+#define BTRFS_SEND_STREAM_VERSION 2
 
 /*
  * In send stream v1, no command is larger than 64k. In send stream v2, no limit
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 9d5fbe8c36c4..2533da0a0de1 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -787,7 +787,8 @@ struct btrfs_ioctl_received_subvol_args {
 	(BTRFS_SEND_FLAG_NO_FILE_DATA | \
 	 BTRFS_SEND_FLAG_OMIT_STREAM_HEADER | \
 	 BTRFS_SEND_FLAG_OMIT_END_CMD | \
-	 BTRFS_SEND_FLAG_VERSION)
+	 BTRFS_SEND_FLAG_VERSION | \
+	 BTRFS_SEND_FLAG_COMPRESSED)
 
 struct btrfs_ioctl_send_args {
 	__s64 send_fd;			/* in */
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 01/10] btrfs-progs: receive: support v2 send stream larger tlv_len
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (16 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 17/17] btrfs: send: enable support for stream v2 and compressed writes Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 02/10] btrfs-progs: receive: dynamically allocate sctx->read_buf Omar Sandoval
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Boris Burkov <borisb@fb.com>

An encoded extent can be up to 128K in length, which exceeds the largest
value expressible by the current send stream format's 16 bit tlv_len
field. Since encoded writes cannot be split into multiple writes by
btrfs send, the send stream format must change to accommodate encoded
writes.

Supporting this changed format requires retooling how we store the
commands we have processed. We currently store pointers to the struct
btrfs_tlv_headers in the command buffer. This is not sufficient to
represent the new BTRFS_SEND_A_DATA format. Instead, parse the attribute
headers and store them in a new struct btrfs_send_attribute which has a
32-bit length field. This is transparent to users of the various TLV_GET
macros.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 common/send-stream.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/common/send-stream.c b/common/send-stream.c
index e9be922b..7d182238 100644
--- a/common/send-stream.c
+++ b/common/send-stream.c
@@ -24,13 +24,23 @@
 #include "crypto/crc32c.h"
 #include "common/utils.h"
 
+struct btrfs_send_attribute {
+	u16 tlv_type;
+	/*
+	 * Note: in btrfs_tlv_header, this is __le16, but we need 32 bits for
+	 * attributes with file data as of version 2 of the send stream format
+	 */
+	u32 tlv_len;
+	char *data;
+};
+
 struct btrfs_send_stream {
 	char read_buf[BTRFS_SEND_BUF_SIZE];
 	int fd;
 
 	int cmd;
 	struct btrfs_cmd_header *cmd_hdr;
-	struct btrfs_tlv_header *cmd_attrs[BTRFS_SEND_A_MAX + 1];
+	struct btrfs_send_attribute cmd_attrs[BTRFS_SEND_A_MAX + 1];
 	u32 version;
 
 	/*
@@ -152,6 +162,7 @@ static int read_cmd(struct btrfs_send_stream *sctx)
 		struct btrfs_tlv_header *tlv_hdr;
 		u16 tlv_type;
 		u16 tlv_len;
+		struct btrfs_send_attribute *send_attr;
 
 		tlv_hdr = (struct btrfs_tlv_header *)data;
 		tlv_type = le16_to_cpu(tlv_hdr->tlv_type);
@@ -164,10 +175,15 @@ static int read_cmd(struct btrfs_send_stream *sctx)
 			goto out;
 		}
 
-		sctx->cmd_attrs[tlv_type] = tlv_hdr;
+		send_attr = &sctx->cmd_attrs[tlv_type];
+		send_attr->tlv_type = tlv_type;
+		send_attr->tlv_len = tlv_len;
+		pos += sizeof(*tlv_hdr);
+		data += sizeof(*tlv_hdr);
 
-		data += sizeof(*tlv_hdr) + tlv_len;
-		pos += sizeof(*tlv_hdr) + tlv_len;
+		send_attr->data = data;
+		pos += send_attr->tlv_len;
+		data += send_attr->tlv_len;
 	}
 
 	sctx->cmd = cmd;
@@ -180,7 +196,7 @@ out:
 static int tlv_get(struct btrfs_send_stream *sctx, int attr, void **data, int *len)
 {
 	int ret;
-	struct btrfs_tlv_header *hdr;
+	struct btrfs_send_attribute *send_attr;
 
 	if (attr <= 0 || attr > BTRFS_SEND_A_MAX) {
 		error("invalid attribute requested, attr = %d", attr);
@@ -188,15 +204,15 @@ static int tlv_get(struct btrfs_send_stream *sctx, int attr, void **data, int *l
 		goto out;
 	}
 
-	hdr = sctx->cmd_attrs[attr];
-	if (!hdr) {
+	send_attr = &sctx->cmd_attrs[attr];
+	if (!send_attr->data) {
 		error("attribute %d requested but not present", attr);
 		ret = -ENOENT;
 		goto out;
 	}
 
-	*len = le16_to_cpu(hdr->tlv_len);
-	*data = hdr + 1;
+	*len = send_attr->tlv_len;
+	*data = send_attr->data;
 
 	ret = 0;
 
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 02/10] btrfs-progs: receive: dynamically allocate sctx->read_buf
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (17 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 01/10] btrfs-progs: receive: support v2 send stream larger tlv_len Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 03/10] btrfs-progs: receive: support v2 send stream DATA tlv format Omar Sandoval
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Boris Burkov <boris@bur.io>

In send stream v2, write commands can now be an arbitrary size. For that
reason, we can no longer allocate a fixed array in sctx for read_cmd.
Instead, read_cmd dynamically allocates sctx->read_buf. To avoid
needless reallocations, we reuse read_buf between read_cmd calls by also
keeping track of the size of the allocated buffer in sctx->read_buf_sz.

We do the first allocation of the old default size at the start of
processing the stream, and we only reallocate if we encounter a command
that needs a larger buffer.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 common/send-stream.c   | 56 ++++++++++++++++++++++++++++--------------
 kernel-shared/send.h   |  2 +-
 libbtrfs/send-stream.c |  2 +-
 3 files changed, 39 insertions(+), 21 deletions(-)

diff --git a/common/send-stream.c b/common/send-stream.c
index 7d182238..421cd1bb 100644
--- a/common/send-stream.c
+++ b/common/send-stream.c
@@ -35,11 +35,11 @@ struct btrfs_send_attribute {
 };
 
 struct btrfs_send_stream {
-	char read_buf[BTRFS_SEND_BUF_SIZE];
+	char *read_buf;
+	size_t read_buf_sz;
 	int fd;
 
 	int cmd;
-	struct btrfs_cmd_header *cmd_hdr;
 	struct btrfs_send_attribute cmd_attrs[BTRFS_SEND_A_MAX + 1];
 	u32 version;
 
@@ -111,11 +111,12 @@ static int read_cmd(struct btrfs_send_stream *sctx)
 	u32 pos;
 	u32 crc;
 	u32 crc2;
+	struct btrfs_cmd_header *cmd_hdr;
+	size_t buf_len;
 
 	memset(sctx->cmd_attrs, 0, sizeof(sctx->cmd_attrs));
 
-	ASSERT(sizeof(*sctx->cmd_hdr) <= sizeof(sctx->read_buf));
-	ret = read_buf(sctx, sctx->read_buf, sizeof(*sctx->cmd_hdr));
+	ret = read_buf(sctx, sctx->read_buf, sizeof(*cmd_hdr));
 	if (ret < 0)
 		goto out;
 	if (ret) {
@@ -124,18 +125,25 @@ static int read_cmd(struct btrfs_send_stream *sctx)
 		goto out;
 	}
 
-	sctx->cmd_hdr = (struct btrfs_cmd_header *)sctx->read_buf;
-	cmd = le16_to_cpu(sctx->cmd_hdr->cmd);
-	cmd_len = le32_to_cpu(sctx->cmd_hdr->len);
+	cmd_hdr = (struct btrfs_cmd_header *)sctx->read_buf;
+	cmd_len = le32_to_cpu(cmd_hdr->len);
+	cmd = le16_to_cpu(cmd_hdr->cmd);
+	buf_len = sizeof(*cmd_hdr) + cmd_len;
+	if (sctx->read_buf_sz < buf_len) {
+		void *new_read_buf;
 
-	if (cmd_len + sizeof(*sctx->cmd_hdr) >= sizeof(sctx->read_buf)) {
-		ret = -EINVAL;
-		error("command length %u too big for buffer %zu",
-				cmd_len, sizeof(sctx->read_buf));
-		goto out;
+		new_read_buf = realloc(sctx->read_buf, buf_len);
+		if (!new_read_buf) {
+			ret = -ENOMEM;
+			error("failed to reallocate read buffer for cmd");
+			goto out;
+		}
+		sctx->read_buf = new_read_buf;
+		sctx->read_buf_sz = buf_len;
+		/* We need to reset cmd_hdr after realloc of sctx->read_buf */
+		cmd_hdr = (struct btrfs_cmd_header *)sctx->read_buf;
 	}
-
-	data = sctx->read_buf + sizeof(*sctx->cmd_hdr);
+	data = sctx->read_buf + sizeof(*cmd_hdr);
 	ret = read_buf(sctx, data, cmd_len);
 	if (ret < 0)
 		goto out;
@@ -145,11 +153,12 @@ static int read_cmd(struct btrfs_send_stream *sctx)
 		goto out;
 	}
 
-	crc = le32_to_cpu(sctx->cmd_hdr->crc);
-	sctx->cmd_hdr->crc = 0;
+	crc = le32_to_cpu(cmd_hdr->crc);
+	/* in send, crc is computed with header crc = 0, replicate that */
+	cmd_hdr->crc = 0;
 
 	crc2 = crc32c(0, (unsigned char*)sctx->read_buf,
-			sizeof(*sctx->cmd_hdr) + cmd_len);
+			sizeof(*cmd_hdr) + cmd_len);
 
 	if (crc != crc2) {
 		ret = -EINVAL;
@@ -537,19 +546,28 @@ int btrfs_read_and_process_send_stream(int fd,
 		goto out;
 	}
 
+	sctx.read_buf = malloc(BTRFS_SEND_BUF_SIZE_V1);
+	if (!sctx.read_buf) {
+		ret = -ENOMEM;
+		error("unable to allocate send stream read buffer");
+		goto out;
+	}
+	sctx.read_buf_sz = BTRFS_SEND_BUF_SIZE_V1;
+
 	while (1) {
 		ret = read_and_process_cmd(&sctx);
 		if (ret < 0) {
 			last_err = ret;
 			errors++;
 			if (max_errors > 0 && errors >= max_errors)
-				goto out;
+				break;
 		} else if (ret > 0) {
 			if (!honor_end_cmd)
 				ret = 0;
-			goto out;
+			break;
 		}
 	}
+	free(sctx.read_buf);
 
 out:
 	if (last_err && !ret)
diff --git a/kernel-shared/send.h b/kernel-shared/send.h
index e73f09df..c8003aa5 100644
--- a/kernel-shared/send.h
+++ b/kernel-shared/send.h
@@ -33,7 +33,7 @@ extern "C" {
 #define BTRFS_SEND_STREAM_MAGIC "btrfs-stream"
 #define BTRFS_SEND_STREAM_VERSION 1
 
-#define BTRFS_SEND_BUF_SIZE  (64 * 1024)
+#define BTRFS_SEND_BUF_SIZE_V1 (64 * 1024)
 #define BTRFS_SEND_READ_SIZE (1024 * 48)
 
 enum btrfs_tlv_type {
diff --git a/libbtrfs/send-stream.c b/libbtrfs/send-stream.c
index 2b21d846..39cbb3ed 100644
--- a/libbtrfs/send-stream.c
+++ b/libbtrfs/send-stream.c
@@ -22,7 +22,7 @@
 #include "crypto/crc32c.h"
 
 struct btrfs_send_stream {
-	char read_buf[BTRFS_SEND_BUF_SIZE];
+	char read_buf[BTRFS_SEND_BUF_SIZE_V1];
 	int fd;
 
 	int cmd;
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 03/10] btrfs-progs: receive: support v2 send stream DATA tlv format
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (18 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 02/10] btrfs-progs: receive: dynamically allocate sctx->read_buf Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 04/10] btrfs-progs: receive: add send stream v2 cmds and attrs to send.h Omar Sandoval
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Boris Burkov <borisb@fb.com>

The new format privileges the BTRFS_SEND_A_DATA attribute by
guaranteeing it will always be the last attribute in any command that
needs it, and by implicitly encoding the data length as the difference
between the total command length in the command header and the sizes of
the rest of the attributes (and of course the tlv_type identifying the
DATA attribute). To parse the new stream, we must read the tlv_type and
if it is not DATA, we proceed normally, but if it is DATA, we don't
parse a tlv_len but simply compute the length.

In addition, we add some bounds checking when parsing each chunk of
data, as well as for the tlv_len itself.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 common/send-stream.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/common/send-stream.c b/common/send-stream.c
index 421cd1bb..85b9998d 100644
--- a/common/send-stream.c
+++ b/common/send-stream.c
@@ -168,28 +168,44 @@ static int read_cmd(struct btrfs_send_stream *sctx)
 
 	pos = 0;
 	while (pos < cmd_len) {
-		struct btrfs_tlv_header *tlv_hdr;
 		u16 tlv_type;
-		u16 tlv_len;
 		struct btrfs_send_attribute *send_attr;
 
-		tlv_hdr = (struct btrfs_tlv_header *)data;
-		tlv_type = le16_to_cpu(tlv_hdr->tlv_type);
-		tlv_len = le16_to_cpu(tlv_hdr->tlv_len);
+		if (cmd_len - pos < sizeof(__le16)) {
+			error("send stream is truncated");
+			ret = -EINVAL;
+			goto out;
+		}
+		tlv_type = le16_to_cpu(*(__le16 *)data);
 
 		if (tlv_type == 0 || tlv_type > BTRFS_SEND_A_MAX) {
-			error("invalid tlv in cmd tlv_type = %hu, tlv_len = %hu",
-					tlv_type, tlv_len);
+			error("invalid tlv in cmd tlv_type = %hu", tlv_type);
 			ret = -EINVAL;
 			goto out;
 		}
 
 		send_attr = &sctx->cmd_attrs[tlv_type];
 		send_attr->tlv_type = tlv_type;
-		send_attr->tlv_len = tlv_len;
-		pos += sizeof(*tlv_hdr);
-		data += sizeof(*tlv_hdr);
 
+		pos += sizeof(tlv_type);
+		data += sizeof(tlv_type);
+		if (sctx->version == 2 && tlv_type == BTRFS_SEND_A_DATA) {
+			send_attr->tlv_len = cmd_len - pos;
+		} else {
+			if (cmd_len - pos < sizeof(__le16)) {
+				error("send stream is truncated");
+				ret = -EINVAL;
+				goto out;
+			}
+			send_attr->tlv_len = le16_to_cpu(*(__le16 *)data);
+			pos += sizeof(__le16);
+			data += sizeof(__le16);
+		}
+		if (cmd_len - pos < send_attr->tlv_len) {
+			error("send stream is truncated");
+			ret = -EINVAL;
+			goto out;
+		}
 		send_attr->data = data;
 		pos += send_attr->tlv_len;
 		data += send_attr->tlv_len;
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 04/10] btrfs-progs: receive: add send stream v2 cmds and attrs to send.h
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (19 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 03/10] btrfs-progs: receive: support v2 send stream DATA tlv format Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 05/10] btrfs-progs: receive: process encoded_write commands Omar Sandoval
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Boris Burkov <boris@bur.io>

Send stream v2 adds three commands and several attributes associated to
those commands. Before we implement processing them, add all the
commands and attributes. This avoids leaving the enums in an
intermediate state that doesn't correspond to any version of send
stream.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 kernel-shared/send.h | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel-shared/send.h b/kernel-shared/send.h
index c8003aa5..1458dd29 100644
--- a/kernel-shared/send.h
+++ b/kernel-shared/send.h
@@ -34,7 +34,6 @@ extern "C" {
 #define BTRFS_SEND_STREAM_VERSION 1
 
 #define BTRFS_SEND_BUF_SIZE_V1 (64 * 1024)
-#define BTRFS_SEND_READ_SIZE (1024 * 48)
 
 enum btrfs_tlv_type {
 	BTRFS_TLV_U8,
@@ -70,6 +69,7 @@ struct btrfs_tlv_header {
 enum btrfs_send_cmd {
 	BTRFS_SEND_C_UNSPEC,
 
+	/* Version 1 */
 	BTRFS_SEND_C_SUBVOL,
 	BTRFS_SEND_C_SNAPSHOT,
 
@@ -98,6 +98,15 @@ enum btrfs_send_cmd {
 
 	BTRFS_SEND_C_END,
 	BTRFS_SEND_C_UPDATE_EXTENT,
+	BTRFS_SEND_C_MAX_V1 = BTRFS_SEND_C_UPDATE_EXTENT,
+
+	/* Version 2 */
+	BTRFS_SEND_C_FALLOCATE,
+	BTRFS_SEND_C_SETFLAGS,
+	BTRFS_SEND_C_ENCODED_WRITE,
+	BTRFS_SEND_C_MAX_V2 = BTRFS_SEND_C_ENCODED_WRITE,
+
+	/* End */
 	__BTRFS_SEND_C_MAX,
 };
 #define BTRFS_SEND_C_MAX (__BTRFS_SEND_C_MAX - 1)
@@ -106,6 +115,7 @@ enum btrfs_send_cmd {
 enum {
 	BTRFS_SEND_A_UNSPEC,
 
+	/* Version 1 */
 	BTRFS_SEND_A_UUID,
 	BTRFS_SEND_A_CTRANSID,
 
@@ -128,6 +138,11 @@ enum {
 	BTRFS_SEND_A_PATH_LINK,
 
 	BTRFS_SEND_A_FILE_OFFSET,
+	/*
+	 * In send stream v2, this attribute is special: it must be the last
+	 * attribute in a command, its header contains only the type, and its
+	 * length is implicitly the remaining length of the command.
+	 */
 	BTRFS_SEND_A_DATA,
 
 	BTRFS_SEND_A_CLONE_UUID,
@@ -135,7 +150,25 @@ enum {
 	BTRFS_SEND_A_CLONE_PATH,
 	BTRFS_SEND_A_CLONE_OFFSET,
 	BTRFS_SEND_A_CLONE_LEN,
+	BTRFS_SEND_A_MAX_V1 = BTRFS_SEND_A_CLONE_LEN,
 
+	/* Version 2 */
+	BTRFS_SEND_A_FALLOCATE_MODE,
+
+	BTRFS_SEND_A_SETFLAGS_FLAGS,
+
+	BTRFS_SEND_A_UNENCODED_FILE_LEN,
+	BTRFS_SEND_A_UNENCODED_LEN,
+	BTRFS_SEND_A_UNENCODED_OFFSET,
+	/*
+	 * COMPRESSION and ENCRYPTION default to NONE (0) if omitted from
+	 * BTRFS_SEND_C_ENCODED_WRITE.
+	 */
+	BTRFS_SEND_A_COMPRESSION,
+	BTRFS_SEND_A_ENCRYPTION,
+	BTRFS_SEND_A_MAX_V2 = BTRFS_SEND_A_ENCRYPTION,
+
+	/* End */
 	__BTRFS_SEND_A_MAX,
 };
 #define BTRFS_SEND_A_MAX (__BTRFS_SEND_A_MAX - 1)
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 05/10] btrfs-progs: receive: process encoded_write commands
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (20 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 04/10] btrfs-progs: receive: add send stream v2 cmds and attrs to send.h Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 06/10] btrfs-progs: receive: encoded_write fallback to explicit decode and write Omar Sandoval
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Boris Burkov <borisb@fb.com>

Add a new btrfs_send_op and support for both dumping and proper receive
processing which does actual encoded writes.

Encoded writes are only allowed on a file descriptor opened with an
extra flag that allows encoded writes, so we also add support for this
flag when opening or reusing a file for writing.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 cmds/receive-dump.c  |  16 +++++-
 cmds/receive.c       |  48 ++++++++++++++++
 common/send-stream.c |  29 ++++++++++
 common/send-stream.h |   4 ++
 ioctl.h              | 132 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 228 insertions(+), 1 deletion(-)

diff --git a/cmds/receive-dump.c b/cmds/receive-dump.c
index 47a0a30e..09bb947b 100644
--- a/cmds/receive-dump.c
+++ b/cmds/receive-dump.c
@@ -319,6 +319,19 @@ static int print_update_extent(const char *path, u64 offset, u64 len,
 			  offset, len);
 }
 
+static int print_encoded_write(const char *path, const void *data, u64 offset,
+			       u64 len, u64 unencoded_file_len,
+			       u64 unencoded_len, u64 unencoded_offset,
+			       u32 compression, u32 encryption, void *user)
+{
+	return PRINT_DUMP(user, path, "encoded_write",
+			  "offset=%llu len=%llu, unencoded_file_len=%llu, "
+			  "unencoded_len=%llu, unencoded_offset=%llu, "
+			  "compression=%u, encryption=%u",
+			  offset, len, unencoded_file_len, unencoded_len,
+			  unencoded_offset, compression, encryption);
+}
+
 struct btrfs_send_ops btrfs_print_send_ops = {
 	.subvol = print_subvol,
 	.snapshot = print_snapshot,
@@ -340,5 +353,6 @@ struct btrfs_send_ops btrfs_print_send_ops = {
 	.chmod = print_chmod,
 	.chown = print_chown,
 	.utimes = print_utimes,
-	.update_extent = print_update_extent
+	.update_extent = print_update_extent,
+	.encoded_write = print_encoded_write,
 };
diff --git a/cmds/receive.c b/cmds/receive.c
index 4d123a1f..38579efc 100644
--- a/cmds/receive.c
+++ b/cmds/receive.c
@@ -29,12 +29,14 @@
 #include <assert.h>
 #include <getopt.h>
 #include <limits.h>
+#include <errno.h>
 
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <sys/ioctl.h>
 #include <sys/time.h>
 #include <sys/types.h>
+#include <sys/uio.h>
 #include <sys/xattr.h>
 #include <uuid/uuid.h>
 
@@ -49,6 +51,7 @@
 #include "cmds/receive-dump.h"
 #include "common/help.h"
 #include "common/path-utils.h"
+#include "stubs.h"
 
 struct btrfs_receive
 {
@@ -982,6 +985,50 @@ static int process_update_extent(const char *path, u64 offset, u64 len,
 	return 0;
 }
 
+static int process_encoded_write(const char *path, const void *data, u64 offset,
+				 u64 len, u64 unencoded_file_len,
+				 u64 unencoded_len, u64 unencoded_offset,
+				 u32 compression, u32 encryption, void *user)
+{
+	int ret;
+	struct btrfs_receive *rctx = user;
+	char full_path[PATH_MAX];
+	struct iovec iov = { (char *)data, len };
+	struct btrfs_ioctl_encoded_io_args encoded = {
+		.iov = &iov,
+		.iovcnt = 1,
+		.offset = offset,
+		.len = unencoded_file_len,
+		.unencoded_len = unencoded_len,
+		.unencoded_offset = unencoded_offset,
+		.compression = compression,
+		.encryption = encryption,
+	};
+
+	if (encryption) {
+		error("encoded_write: encryption not supported");
+		return -EOPNOTSUPP;
+	}
+
+	ret = path_cat_out(full_path, rctx->full_subvol_path, path);
+	if (ret < 0) {
+		error("encoded_write: path invalid: %s", path);
+		return ret;
+	}
+
+	ret = open_inode_for_write(rctx, full_path);
+	if (ret < 0)
+		return ret;
+
+	ret = ioctl(rctx->write_fd, BTRFS_IOC_ENCODED_WRITE, &encoded);
+	if (ret < 0) {
+		ret = -errno;
+		error("encoded_write: writing to %s failed: %m", path);
+		return ret;
+	}
+	return 0;
+}
+
 static struct btrfs_send_ops send_ops = {
 	.subvol = process_subvol,
 	.snapshot = process_snapshot,
@@ -1004,6 +1051,7 @@ static struct btrfs_send_ops send_ops = {
 	.chown = process_chown,
 	.utimes = process_utimes,
 	.update_extent = process_update_extent,
+	.encoded_write = process_encoded_write,
 };
 
 static int do_receive(struct btrfs_receive *rctx, const char *tomnt,
diff --git a/common/send-stream.c b/common/send-stream.c
index 85b9998d..33227d3f 100644
--- a/common/send-stream.c
+++ b/common/send-stream.c
@@ -357,6 +357,8 @@ static int read_and_process_cmd(struct btrfs_send_stream *sctx)
 	struct timespec mt;
 	u8 uuid[BTRFS_UUID_SIZE];
 	u8 clone_uuid[BTRFS_UUID_SIZE];
+	u32 compression;
+	u32 encryption;
 	u64 tmp;
 	u64 tmp2;
 	u64 ctransid;
@@ -366,6 +368,9 @@ static int read_and_process_cmd(struct btrfs_send_stream *sctx)
 	u64 clone_offset;
 	u64 offset;
 	u64 ino;
+	u64 unencoded_file_len;
+	u64 unencoded_len;
+	u64 unencoded_offset;
 	int len;
 	int xattr_len;
 
@@ -452,6 +457,30 @@ static int read_and_process_cmd(struct btrfs_send_stream *sctx)
 		TLV_GET(sctx, BTRFS_SEND_A_DATA, &data, &len);
 		ret = sctx->ops->write(path, data, offset, len, sctx->user);
 		break;
+	case BTRFS_SEND_C_ENCODED_WRITE:
+		TLV_GET_STRING(sctx, BTRFS_SEND_A_PATH, &path);
+		TLV_GET_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, &offset);
+		TLV_GET_U64(sctx, BTRFS_SEND_A_UNENCODED_FILE_LEN,
+			    &unencoded_file_len);
+		TLV_GET_U64(sctx, BTRFS_SEND_A_UNENCODED_LEN, &unencoded_len);
+		TLV_GET_U64(sctx, BTRFS_SEND_A_UNENCODED_OFFSET,
+			    &unencoded_offset);
+		/* Compression and encryption default to none if omitted. */
+		if (sctx->cmd_attrs[BTRFS_SEND_A_COMPRESSION].data)
+			TLV_GET_U32(sctx, BTRFS_SEND_A_COMPRESSION, &compression);
+		else
+			compression = BTRFS_ENCODED_IO_COMPRESSION_NONE;
+		if (sctx->cmd_attrs[BTRFS_SEND_A_ENCRYPTION].data)
+			TLV_GET_U32(sctx, BTRFS_SEND_A_ENCRYPTION, &encryption);
+		else
+			encryption = BTRFS_ENCODED_IO_ENCRYPTION_NONE;
+		TLV_GET(sctx, BTRFS_SEND_A_DATA, &data, &len);
+		ret = sctx->ops->encoded_write(path, data, offset, len,
+					       unencoded_file_len,
+					       unencoded_len, unencoded_offset,
+					       compression, encryption,
+					       sctx->user);
+		break;
 	case BTRFS_SEND_C_CLONE:
 		TLV_GET_STRING(sctx, BTRFS_SEND_A_PATH, &path);
 		TLV_GET_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, &offset);
diff --git a/common/send-stream.h b/common/send-stream.h
index 2de51eac..44abbc9d 100644
--- a/common/send-stream.h
+++ b/common/send-stream.h
@@ -53,6 +53,10 @@ struct btrfs_send_ops {
 		      struct timespec *mt, struct timespec *ct,
 		      void *user);
 	int (*update_extent)(const char *path, u64 offset, u64 len, void *user);
+	int (*encoded_write)(const char *path, const void *data, u64 offset,
+			     u64 len, u64 unencoded_file_len, u64 unencoded_len,
+			     u64 unencoded_offset, u32 compression,
+			     u32 encryption, void *user);
 };
 
 int btrfs_read_and_process_send_stream(int fd,
diff --git a/ioctl.h b/ioctl.h
index 368a87b2..dcfe0b6c 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -777,6 +777,134 @@ struct btrfs_ioctl_get_subvol_rootref_args {
 };
 BUILD_ASSERT(sizeof(struct btrfs_ioctl_get_subvol_rootref_args) == 4096);
 
+/*
+ * Data and metadata for an encoded read or write.
+ *
+ * Encoded I/O bypasses any encoding automatically done by the filesystem (e.g.,
+ * compression). This can be used to read the compressed contents of a file or
+ * write pre-compressed data directly to a file.
+ *
+ * BTRFS_IOC_ENCODED_READ and BTRFS_IOC_ENCODED_WRITE are essentially
+ * preadv/pwritev with additional metadata about how the data is encoded and the
+ * size of the unencoded data.
+ *
+ * BTRFS_IOC_ENCODED_READ fills the given iovecs with the encoded data, fills
+ * the metadata fields, and returns the size of the encoded data. It reads one
+ * extent per call. It can also read data which is not encoded.
+ *
+ * BTRFS_IOC_ENCODED_WRITE uses the metadata fields, writes the encoded data
+ * from the iovecs, and returns the size of the encoded data. Note that the
+ * encoded data is not validated when it is written; if it is not valid (e.g.,
+ * it cannot be decompressed), then a subsequent read may return an error.
+ *
+ * Since the filesystem page cache contains decoded data, encoded I/O bypasses
+ * the page cache. Encoded I/O requires CAP_SYS_ADMIN.
+ */
+struct btrfs_ioctl_encoded_io_args {
+	/* Input parameters for both reads and writes. */
+
+	/*
+	 * iovecs containing encoded data.
+	 *
+	 * For reads, if the size of the encoded data is larger than the sum of
+	 * iov[n].iov_len for 0 <= n < iovcnt, then the ioctl fails with
+	 * ENOBUFS.
+	 *
+	 * For writes, the size of the encoded data is the sum of iov[n].iov_len
+	 * for 0 <= n < iovcnt. This must be less than 128 KiB (this limit may
+	 * increase in the future). This must also be less than or equal to
+	 * unencoded_len.
+	 */
+	const struct iovec __user *iov;
+	/* Number of iovecs. */
+	unsigned long iovcnt;
+	/*
+	 * Offset in file.
+	 *
+	 * For writes, must be aligned to the sector size of the filesystem.
+	 */
+	__s64 offset;
+	/* Currently must be zero. */
+	__u64 flags;
+
+	/*
+	 * For reads, the following members are output parameters that will
+	 * contain the returned metadata for the encoded data.
+	 * For writes, the following members must be set to the metadata for the
+	 * encoded data.
+	 */
+
+	/*
+	 * Length of the data in the file.
+	 *
+	 * Must be less than or equal to unencoded_len - unencoded_offset. For
+	 * writes, must be aligned to the sector size of the filesystem unless
+	 * the data ends at or beyond the current end of the file.
+	 */
+	__u64 len;
+	/*
+	 * Length of the unencoded (i.e., decrypted and decompressed) data.
+	 *
+	 * For writes, must be no more than 128 KiB (this limit may increase in
+	 * the future). If the unencoded data is actually longer than
+	 * unencoded_len, then it is truncated; if it is shorter, then it is
+	 * extended with zeroes.
+	 */
+	__u64 unencoded_len;
+	/*
+	 * Offset from the first byte of the unencoded data to the first byte of
+	 * logical data in the file.
+	 *
+	 * Must be less than unencoded_len.
+	 */
+	__u64 unencoded_offset;
+	/*
+	 * BTRFS_ENCODED_IO_COMPRESSION_* type.
+	 *
+	 * For writes, must not be BTRFS_ENCODED_IO_COMPRESSION_NONE.
+	 */
+	__u32 compression;
+	/* Currently always BTRFS_ENCODED_IO_ENCRYPTION_NONE. */
+	__u32 encryption;
+	/*
+	 * Reserved for future expansion.
+	 *
+	 * For reads, always returned as zero. Users should check for non-zero
+	 * bytes. If there are any, then the kernel has a newer version of this
+	 * structure with additional information that the user definition is
+	 * missing.
+	 *
+	 * For writes, must be zeroed.
+	 */
+	__u8 reserved[32];
+};
+
+/* Data is not compressed. */
+#define BTRFS_ENCODED_IO_COMPRESSION_NONE 0
+/* Data is compressed as a single zlib stream. */
+#define BTRFS_ENCODED_IO_COMPRESSION_ZLIB 1
+/*
+ * Data is compressed as a single zstd frame with the windowLog compression
+ * parameter set to no more than 17.
+ */
+#define BTRFS_ENCODED_IO_COMPRESSION_ZSTD 2
+/*
+ * Data is compressed sector by sector (using the sector size indicated by the
+ * name of the constant) with LZO1X and wrapped in the format documented in
+ * fs/btrfs/lzo.c. For writes, the compression sector size must match the
+ * filesystem sector size.
+ */
+#define BTRFS_ENCODED_IO_COMPRESSION_LZO_4K 3
+#define BTRFS_ENCODED_IO_COMPRESSION_LZO_8K 4
+#define BTRFS_ENCODED_IO_COMPRESSION_LZO_16K 5
+#define BTRFS_ENCODED_IO_COMPRESSION_LZO_32K 6
+#define BTRFS_ENCODED_IO_COMPRESSION_LZO_64K 7
+#define BTRFS_ENCODED_IO_COMPRESSION_TYPES 8
+
+/* Data is not encrypted. */
+#define BTRFS_ENCODED_IO_ENCRYPTION_NONE 0
+#define BTRFS_ENCODED_IO_ENCRYPTION_TYPES 1
+
 /* Error codes as returned by the kernel */
 enum btrfs_err_code {
 	notused,
@@ -951,6 +1079,10 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code)
 				struct btrfs_ioctl_ino_lookup_user_args)
 #define BTRFS_IOC_SNAP_DESTROY_V2 _IOW(BTRFS_IOCTL_MAGIC, 63, \
 				   struct btrfs_ioctl_vol_args_v2)
+#define BTRFS_IOC_ENCODED_READ _IOR(BTRFS_IOCTL_MAGIC, 64, \
+				    struct btrfs_ioctl_encoded_io_args)
+#define BTRFS_IOC_ENCODED_WRITE _IOW(BTRFS_IOCTL_MAGIC, 64, \
+				     struct btrfs_ioctl_encoded_io_args)
 
 #ifdef __cplusplus
 }
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 06/10] btrfs-progs: receive: encoded_write fallback to explicit decode and write
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (21 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 05/10] btrfs-progs: receive: process encoded_write commands Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 07/10] btrfs-progs: receive: process fallocate commands Omar Sandoval
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Boris Burkov <boris@bur.io>

An encoded_write can fail if the file system it is being applied to does
not support encoded writes or if it can't find enough contiguous space
to accommodate the encoded extent. In those cases, we can likely still
process an encoded_write by explicitly decoding the data and doing a
normal write.

Add the necessary fallback path for decoding data compressed with zlib,
lzo, or zstd. zlib and zstd have reusable decoding context data
structures which we cache in the receive context so that we don't have
to recreate them on every encoded_write.

Finally, add a command line flag for force-decompress which causes
receive to always use the fallback path rather than first attempting the
encoded write.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 Documentation/btrfs-receive.asciidoc |   4 +
 cmds/receive.c                       | 261 ++++++++++++++++++++++++++-
 2 files changed, 258 insertions(+), 7 deletions(-)

diff --git a/Documentation/btrfs-receive.asciidoc b/Documentation/btrfs-receive.asciidoc
index e4c4d2c0..354a71dc 100644
--- a/Documentation/btrfs-receive.asciidoc
+++ b/Documentation/btrfs-receive.asciidoc
@@ -60,6 +60,10 @@ By default the mountpoint is searched in '/proc/self/mounts'.
 If '/proc' is not accessible, eg. in a chroot environment, use this option to
 tell us where this filesystem is mounted.
 
+--force-decompress::
+if the stream contains compressed data (see '--compressed-data' in
+`btrfs-send`(8)), always decompress it instead of writing it with encoded I/O.
+
 --dump::
 dump the stream metadata, one line per operation
 +
diff --git a/cmds/receive.c b/cmds/receive.c
index 38579efc..60f9a3fe 100644
--- a/cmds/receive.c
+++ b/cmds/receive.c
@@ -40,6 +40,10 @@
 #include <sys/xattr.h>
 #include <uuid/uuid.h>
 
+#include <lzo/lzo1x.h>
+#include <zlib.h>
+#include <zstd.h>
+
 #include "kernel-shared/ctree.h"
 #include "ioctl.h"
 #include "cmds/commands.h"
@@ -75,6 +79,12 @@ struct btrfs_receive
 	char cur_subvol_path[PATH_MAX];
 
 	int honor_end_cmd;
+
+	bool force_decompress;
+
+	/* Reuse stream objects for encoded_write decompression fallback */
+	ZSTD_DStream *zstd_dstream;
+	z_stream *zlib_stream;
 };
 
 static int finish_subvol(struct btrfs_receive *rctx)
@@ -985,6 +995,219 @@ static int process_update_extent(const char *path, u64 offset, u64 len,
 	return 0;
 }
 
+static int decompress_zlib(struct btrfs_receive *rctx, const char *encoded_data,
+			   u64 encoded_len, char *unencoded_data,
+			   u64 unencoded_len)
+{
+	bool init = false;
+	int ret;
+
+	if (!rctx->zlib_stream) {
+		init = true;
+		rctx->zlib_stream = malloc(sizeof(z_stream));
+		if (!rctx->zlib_stream) {
+			error("failed to allocate zlib stream %m");
+			return -ENOMEM;
+		}
+	}
+	rctx->zlib_stream->next_in = (void *)encoded_data;
+	rctx->zlib_stream->avail_in = encoded_len;
+	rctx->zlib_stream->next_out = (void *)unencoded_data;
+	rctx->zlib_stream->avail_out = unencoded_len;
+
+	if (init) {
+		rctx->zlib_stream->zalloc = Z_NULL;
+		rctx->zlib_stream->zfree = Z_NULL;
+		rctx->zlib_stream->opaque = Z_NULL;
+		ret = inflateInit(rctx->zlib_stream);
+	} else {
+		ret = inflateReset(rctx->zlib_stream);
+	}
+	if (ret != Z_OK) {
+		error("zlib inflate init failed: %d", ret);
+		return -EIO;
+	}
+
+	while (rctx->zlib_stream->avail_in > 0 &&
+	       rctx->zlib_stream->avail_out > 0) {
+		ret = inflate(rctx->zlib_stream, Z_FINISH);
+		if (ret == Z_STREAM_END) {
+			break;
+		} else if (ret != Z_OK) {
+			error("zlib inflate failed: %d", ret);
+			return -EIO;
+		}
+	}
+	return 0;
+}
+
+static int decompress_zstd(struct btrfs_receive *rctx, const char *encoded_buf,
+			   u64 encoded_len, char *unencoded_buf,
+			   u64 unencoded_len)
+{
+	ZSTD_inBuffer in_buf = {
+		.src = encoded_buf,
+		.size = encoded_len
+	};
+	ZSTD_outBuffer out_buf = {
+		.dst = unencoded_buf,
+		.size = unencoded_len
+	};
+	size_t ret;
+
+	if (!rctx->zstd_dstream) {
+		rctx->zstd_dstream = ZSTD_createDStream();
+		if (!rctx->zstd_dstream) {
+			error("failed to create zstd dstream");
+			return -ENOMEM;
+		}
+	}
+	ret = ZSTD_initDStream(rctx->zstd_dstream);
+	if (ZSTD_isError(ret)) {
+		error("failed to init zstd stream: %s", ZSTD_getErrorName(ret));
+		return -EIO;
+	}
+	while (in_buf.pos < in_buf.size && out_buf.pos < out_buf.size) {
+		ret = ZSTD_decompressStream(rctx->zstd_dstream, &out_buf, &in_buf);
+		if (ret == 0) {
+			break;
+		} else if (ZSTD_isError(ret)) {
+			error("failed to decompress zstd stream: %s",
+			      ZSTD_getErrorName(ret));
+			return -EIO;
+		}
+	}
+	return 0;
+}
+
+static int decompress_lzo(const char *encoded_data, u64 encoded_len,
+			  char *unencoded_data, u64 unencoded_len,
+			  unsigned int sector_size)
+{
+	uint32_t total_len;
+	size_t in_pos, out_pos;
+
+	if (encoded_len < 4) {
+		error("lzo header is truncated");
+		return -EIO;
+	}
+	memcpy(&total_len, encoded_data, 4);
+	total_len = le32toh(total_len);
+	if (total_len > encoded_len) {
+		error("lzo header is invalid");
+		return -EIO;
+	}
+
+	in_pos = 4;
+	out_pos = 0;
+	while (in_pos < total_len && out_pos < unencoded_len) {
+		size_t sector_remaining;
+		uint32_t src_len;
+		lzo_uint dst_len;
+		int ret;
+
+		sector_remaining = -in_pos % sector_size;
+		if (sector_remaining < 4) {
+			if (total_len - in_pos <= sector_remaining)
+				break;
+			in_pos += sector_remaining;
+		}
+
+		if (total_len - in_pos < 4) {
+			error("lzo segment header is truncated");
+			return -EIO;
+		}
+
+		memcpy(&src_len, encoded_data + in_pos, 4);
+		src_len = le32toh(src_len);
+		in_pos += 4;
+		if (src_len > total_len - in_pos) {
+			error("lzo segment header is invalid");
+			return -EIO;
+		}
+
+		dst_len = sector_size;
+		ret = lzo1x_decompress_safe((void *)(encoded_data + in_pos),
+					    src_len,
+					    (void *)(unencoded_data + out_pos),
+					    &dst_len, NULL);
+		if (ret != LZO_E_OK) {
+			error("lzo1x_decompress_safe failed: %d", ret);
+			return -EIO;
+		}
+
+		in_pos += src_len;
+		out_pos += dst_len;
+	}
+	return 0;
+}
+
+static int decompress_and_write(struct btrfs_receive *rctx,
+				const char *encoded_data, u64 offset,
+				u64 encoded_len, u64 unencoded_file_len,
+				u64 unencoded_len, u64 unencoded_offset,
+				u32 compression)
+{
+	int ret = 0;
+	size_t pos;
+	ssize_t w;
+	char *unencoded_data;
+	int sector_shift;
+
+	unencoded_data = calloc(unencoded_len, 1);
+	if (!unencoded_data) {
+		error("allocating space for unencoded data failed: %m");
+		return -errno;
+	}
+
+	switch (compression) {
+	case BTRFS_ENCODED_IO_COMPRESSION_ZLIB:
+		ret = decompress_zlib(rctx, encoded_data, encoded_len,
+				      unencoded_data, unencoded_len);
+		if (ret)
+			goto out;
+		break;
+	case BTRFS_ENCODED_IO_COMPRESSION_ZSTD:
+		ret = decompress_zstd(rctx, encoded_data, encoded_len,
+				      unencoded_data, unencoded_len);
+		if (ret)
+			goto out;
+		break;
+	case BTRFS_ENCODED_IO_COMPRESSION_LZO_4K:
+	case BTRFS_ENCODED_IO_COMPRESSION_LZO_8K:
+	case BTRFS_ENCODED_IO_COMPRESSION_LZO_16K:
+	case BTRFS_ENCODED_IO_COMPRESSION_LZO_32K:
+	case BTRFS_ENCODED_IO_COMPRESSION_LZO_64K:
+		sector_shift =
+			compression - BTRFS_ENCODED_IO_COMPRESSION_LZO_4K + 12;
+		ret = decompress_lzo(encoded_data, encoded_len, unencoded_data,
+				     unencoded_len, 1U << sector_shift);
+		if (ret)
+			goto out;
+		break;
+	default:
+		error("unknown compression: %d", compression);
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	pos = unencoded_offset;
+	while (pos < unencoded_file_len) {
+		w = pwrite(rctx->write_fd, unencoded_data + pos,
+			   unencoded_file_len - pos, offset);
+		if (w < 0) {
+			ret = -errno;
+			error("writing unencoded data failed: %m");
+			goto out;
+		}
+		pos += w;
+		offset += w;
+	}
+out:
+	free(unencoded_data);
+	return ret;
+}
+
 static int process_encoded_write(const char *path, const void *data, u64 offset,
 				 u64 len, u64 unencoded_file_len,
 				 u64 unencoded_len, u64 unencoded_offset,
@@ -1020,13 +1243,21 @@ static int process_encoded_write(const char *path, const void *data, u64 offset,
 	if (ret < 0)
 		return ret;
 
-	ret = ioctl(rctx->write_fd, BTRFS_IOC_ENCODED_WRITE, &encoded);
-	if (ret < 0) {
-		ret = -errno;
-		error("encoded_write: writing to %s failed: %m", path);
-		return ret;
+	if (!rctx->force_decompress) {
+		ret = ioctl(rctx->write_fd, BTRFS_IOC_ENCODED_WRITE, &encoded);
+		if (ret >= 0)
+			return 0;
+		/* Fall back for these errors, fail hard for anything else. */
+		if (errno != ENOSPC && errno != ENOTTY && errno != EINVAL) {
+			ret = -errno;
+			error("encoded_write: writing to %s failed: %m", path);
+			return ret;
+		}
 	}
-	return 0;
+
+	return decompress_and_write(rctx, data, offset, len, unencoded_file_len,
+				    unencoded_len, unencoded_offset,
+				    compression);
 }
 
 static struct btrfs_send_ops send_ops = {
@@ -1204,6 +1435,12 @@ out:
 		close(rctx->dest_dir_fd);
 		rctx->dest_dir_fd = -1;
 	}
+	if (rctx->zstd_dstream)
+		ZSTD_freeDStream(rctx->zstd_dstream);
+	if (rctx->zlib_stream) {
+		inflateEnd(rctx->zlib_stream);
+		free(rctx->zlib_stream);
+	}
 
 	return ret;
 }
@@ -1234,6 +1471,9 @@ static const char * const cmd_receive_usage[] = {
 	"-m ROOTMOUNT     the root mount point of the destination filesystem.",
 	"                 If /proc is not accessible, use this to tell us where",
 	"                 this file system is mounted.",
+	"--force-decompress",
+	"                 if the stream contains compressed data, always",
+	"                 decompress it instead of writing it with encoded I/O",
 	"--dump           dump stream metadata, one line per operation,",
 	"                 does not require the MOUNT parameter",
 	"-v               deprecated, alias for global -v option",
@@ -1277,12 +1517,16 @@ static int cmd_receive(const struct cmd_struct *cmd, int argc, char **argv)
 	optind = 0;
 	while (1) {
 		int c;
-		enum { GETOPT_VAL_DUMP = 257 };
+		enum {
+			GETOPT_VAL_DUMP = 257,
+			GETOPT_VAL_FORCE_DECOMPRESS,
+		};
 		static const struct option long_opts[] = {
 			{ "max-errors", required_argument, NULL, 'E' },
 			{ "chroot", no_argument, NULL, 'C' },
 			{ "dump", no_argument, NULL, GETOPT_VAL_DUMP },
 			{ "quiet", no_argument, NULL, 'q' },
+			{ "force-decompress", no_argument, NULL, GETOPT_VAL_FORCE_DECOMPRESS },
 			{ NULL, 0, NULL, 0 }
 		};
 
@@ -1325,6 +1569,9 @@ static int cmd_receive(const struct cmd_struct *cmd, int argc, char **argv)
 		case GETOPT_VAL_DUMP:
 			dump = 1;
 			break;
+		case GETOPT_VAL_FORCE_DECOMPRESS:
+			rctx.force_decompress = true;
+			break;
 		default:
 			usage_unknown_option(cmd, argv);
 		}
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 07/10] btrfs-progs: receive: process fallocate commands
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (22 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 06/10] btrfs-progs: receive: encoded_write fallback to explicit decode and write Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 08/10] btrfs-progs: receive: process setflags ioctl commands Omar Sandoval
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Boris Burkov <boris@bur.io>

Send stream v2 can emit fallocate commands, so receive must support them
as well. The implementation simply passes along the arguments to the
syscall. Note that mode is encoded as a u32 in send stream but fallocate
takes an int, so there is a unsigned->signed conversion there.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 cmds/receive-dump.c  |  9 +++++++++
 cmds/receive.c       | 25 +++++++++++++++++++++++++
 common/send-stream.c |  9 +++++++++
 common/send-stream.h |  2 ++
 4 files changed, 45 insertions(+)

diff --git a/cmds/receive-dump.c b/cmds/receive-dump.c
index 09bb947b..4ed7cbe5 100644
--- a/cmds/receive-dump.c
+++ b/cmds/receive-dump.c
@@ -332,6 +332,14 @@ static int print_encoded_write(const char *path, const void *data, u64 offset,
 			  unencoded_offset, compression, encryption);
 }
 
+static int print_fallocate(const char *path, int mode, u64 offset, u64 len,
+			   void *user)
+{
+	return PRINT_DUMP(user, path, "fallocate",
+			  "mode=%d offset=%llu len=%llu",
+			  mode, offset, len);
+}
+
 struct btrfs_send_ops btrfs_print_send_ops = {
 	.subvol = print_subvol,
 	.snapshot = print_snapshot,
@@ -355,4 +363,5 @@ struct btrfs_send_ops btrfs_print_send_ops = {
 	.utimes = print_utimes,
 	.update_extent = print_update_extent,
 	.encoded_write = print_encoded_write,
+	.fallocate = print_fallocate,
 };
diff --git a/cmds/receive.c b/cmds/receive.c
index 60f9a3fe..a902a55e 100644
--- a/cmds/receive.c
+++ b/cmds/receive.c
@@ -1260,6 +1260,30 @@ static int process_encoded_write(const char *path, const void *data, u64 offset,
 				    compression);
 }
 
+static int process_fallocate(const char *path, int mode, u64 offset, u64 len,
+			     void *user)
+{
+	int ret;
+	struct btrfs_receive *rctx = user;
+	char full_path[PATH_MAX];
+
+	ret = path_cat_out(full_path, rctx->full_subvol_path, path);
+	if (ret < 0) {
+		error("fallocate: path invalid: %s", path);
+		return ret;
+	}
+	ret = open_inode_for_write(rctx, full_path);
+	if (ret < 0)
+		return ret;
+	ret = fallocate(rctx->write_fd, mode, offset, len);
+	if (ret < 0) {
+		ret = -errno;
+		error("fallocate: fallocate on %s failed: %m", path);
+		return ret;
+	}
+	return 0;
+}
+
 static struct btrfs_send_ops send_ops = {
 	.subvol = process_subvol,
 	.snapshot = process_snapshot,
@@ -1283,6 +1307,7 @@ static struct btrfs_send_ops send_ops = {
 	.utimes = process_utimes,
 	.update_extent = process_update_extent,
 	.encoded_write = process_encoded_write,
+	.fallocate = process_fallocate,
 };
 
 static int do_receive(struct btrfs_receive *rctx, const char *tomnt,
diff --git a/common/send-stream.c b/common/send-stream.c
index 33227d3f..db0939a2 100644
--- a/common/send-stream.c
+++ b/common/send-stream.c
@@ -373,6 +373,7 @@ static int read_and_process_cmd(struct btrfs_send_stream *sctx)
 	u64 unencoded_offset;
 	int len;
 	int xattr_len;
+	int fallocate_mode;
 
 	ret = read_cmd(sctx);
 	if (ret)
@@ -537,6 +538,14 @@ static int read_and_process_cmd(struct btrfs_send_stream *sctx)
 	case BTRFS_SEND_C_END:
 		ret = 1;
 		break;
+	case BTRFS_SEND_C_FALLOCATE:
+		TLV_GET_STRING(sctx, BTRFS_SEND_A_PATH, &path);
+		TLV_GET_U32(sctx, BTRFS_SEND_A_FALLOCATE_MODE, &fallocate_mode);
+		TLV_GET_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, &offset);
+		TLV_GET_U64(sctx, BTRFS_SEND_A_SIZE, &tmp);
+		ret = sctx->ops->fallocate(path, fallocate_mode, offset, tmp,
+					   sctx->user);
+		break;
 	}
 
 tlv_get_failed:
diff --git a/common/send-stream.h b/common/send-stream.h
index 44abbc9d..61a88d3d 100644
--- a/common/send-stream.h
+++ b/common/send-stream.h
@@ -57,6 +57,8 @@ struct btrfs_send_ops {
 			     u64 len, u64 unencoded_file_len, u64 unencoded_len,
 			     u64 unencoded_offset, u32 compression,
 			     u32 encryption, void *user);
+	int (*fallocate)(const char *path, int mode, u64 offset, u64 len,
+			 void *user);
 };
 
 int btrfs_read_and_process_send_stream(int fd,
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 08/10] btrfs-progs: receive: process setflags ioctl commands
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (23 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 07/10] btrfs-progs: receive: process fallocate commands Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 09/10] btrfs-progs: send: stream v2 ioctl flags Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 10/10] btrfs-progs: receive: add tests for basic encoded_write send/receive Omar Sandoval
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Boris Burkov <boris@bur.io>

In send stream v2, send can emit a command for setting inode flags via
the setflags ioctl. Pass the flags attribute through to the ioctl call
in receive.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 cmds/receive-dump.c  |  6 ++++++
 cmds/receive.c       | 25 +++++++++++++++++++++++++
 common/send-stream.c |  7 +++++++
 common/send-stream.h |  1 +
 4 files changed, 39 insertions(+)

diff --git a/cmds/receive-dump.c b/cmds/receive-dump.c
index 4ed7cbe5..53d3e4a8 100644
--- a/cmds/receive-dump.c
+++ b/cmds/receive-dump.c
@@ -340,6 +340,11 @@ static int print_fallocate(const char *path, int mode, u64 offset, u64 len,
 			  mode, offset, len);
 }
 
+static int print_setflags(const char *path, int flags, void *user)
+{
+	return PRINT_DUMP(user, path, "setflags", "flags=%d", flags);
+}
+
 struct btrfs_send_ops btrfs_print_send_ops = {
 	.subvol = print_subvol,
 	.snapshot = print_snapshot,
@@ -364,4 +369,5 @@ struct btrfs_send_ops btrfs_print_send_ops = {
 	.update_extent = print_update_extent,
 	.encoded_write = print_encoded_write,
 	.fallocate = print_fallocate,
+	.setflags = print_setflags,
 };
diff --git a/cmds/receive.c b/cmds/receive.c
index a902a55e..ebd7289d 100644
--- a/cmds/receive.c
+++ b/cmds/receive.c
@@ -38,6 +38,7 @@
 #include <sys/types.h>
 #include <sys/uio.h>
 #include <sys/xattr.h>
+#include <linux/fs.h>
 #include <uuid/uuid.h>
 
 #include <lzo/lzo1x.h>
@@ -1284,6 +1285,29 @@ static int process_fallocate(const char *path, int mode, u64 offset, u64 len,
 	return 0;
 }
 
+static int process_setflags(const char *path, int flags, void *user)
+{
+	int ret;
+	struct btrfs_receive *rctx = user;
+	char full_path[PATH_MAX];
+
+	ret = path_cat_out(full_path, rctx->full_subvol_path, path);
+	if (ret < 0) {
+		error("setflags: path invalid: %s", path);
+		return ret;
+	}
+	ret = open_inode_for_write(rctx, full_path);
+	if (ret < 0)
+		return ret;
+	ret = ioctl(rctx->write_fd, FS_IOC_SETFLAGS, &flags);
+	if (ret < 0) {
+		ret = -errno;
+		error("setflags: setflags ioctl on %s failed: %m", path);
+		return ret;
+	}
+	return 0;
+}
+
 static struct btrfs_send_ops send_ops = {
 	.subvol = process_subvol,
 	.snapshot = process_snapshot,
@@ -1308,6 +1332,7 @@ static struct btrfs_send_ops send_ops = {
 	.update_extent = process_update_extent,
 	.encoded_write = process_encoded_write,
 	.fallocate = process_fallocate,
+	.setflags = process_setflags,
 };
 
 static int do_receive(struct btrfs_receive *rctx, const char *tomnt,
diff --git a/common/send-stream.c b/common/send-stream.c
index db0939a2..f25450c8 100644
--- a/common/send-stream.c
+++ b/common/send-stream.c
@@ -374,6 +374,7 @@ static int read_and_process_cmd(struct btrfs_send_stream *sctx)
 	int len;
 	int xattr_len;
 	int fallocate_mode;
+	int setflags_flags;
 
 	ret = read_cmd(sctx);
 	if (ret)
@@ -546,8 +547,14 @@ static int read_and_process_cmd(struct btrfs_send_stream *sctx)
 		ret = sctx->ops->fallocate(path, fallocate_mode, offset, tmp,
 					   sctx->user);
 		break;
+	case BTRFS_SEND_C_SETFLAGS:
+		TLV_GET_STRING(sctx, BTRFS_SEND_A_PATH, &path);
+		TLV_GET_U32(sctx, BTRFS_SEND_A_SETFLAGS_FLAGS, &setflags_flags);
+		ret = sctx->ops->setflags(path, setflags_flags, sctx->user);
+		break;
 	}
 
+
 tlv_get_failed:
 out:
 	free(path);
diff --git a/common/send-stream.h b/common/send-stream.h
index 61a88d3d..3189f889 100644
--- a/common/send-stream.h
+++ b/common/send-stream.h
@@ -59,6 +59,7 @@ struct btrfs_send_ops {
 			     u32 encryption, void *user);
 	int (*fallocate)(const char *path, int mode, u64 offset, u64 len,
 			 void *user);
+	int (*setflags)(const char *path, int flags, void *user);
 };
 
 int btrfs_read_and_process_send_stream(int fd,
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 09/10] btrfs-progs: send: stream v2 ioctl flags
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (24 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 08/10] btrfs-progs: receive: process setflags ioctl commands Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  2021-11-17 20:19 ` [PATCH v12 10/10] btrfs-progs: receive: add tests for basic encoded_write send/receive Omar Sandoval
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Boris Burkov <boris@bur.io>

First, add a --proto option to allow specifying the desired send
protocol version. It defaults to zero, which tells the kernel to pick
the latest version. This is based on Dave Sterba's patch.

Also add a --compressed-data flag to instruct the kernel to use
encoded_write commands for compressed extents. This requires an explicit
opt in separate from the protocol version because:

1. The user may not want compression on the receiving side, or may want
   a different compression algorithm/level on the receiving side.
2. It has a soft requirement for kernel support on the receiving side
   (btrfs-progs can fall back to decompressing and writing if the kernel
   doesn't support BTRFS_IOC_ENCODED_WRITE, but the user may not be
   prepared to pay that CPU cost). Going forward, since it's easier to
   update progs than the kernel, I think we'll want to make new send
   features that require kernel support opt-in, whereas anything that
   only requires a progs update can happen automatically.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 Documentation/btrfs-send.asciidoc | 18 +++++-
 cmds/send.c                       | 92 ++++++++++++++++++++++++++++++-
 ioctl.h                           | 19 ++++++-
 kernel-shared/send.h              |  2 +-
 4 files changed, 125 insertions(+), 6 deletions(-)

diff --git a/Documentation/btrfs-send.asciidoc b/Documentation/btrfs-send.asciidoc
index 2dae6e32..5ca29b4f 100644
--- a/Documentation/btrfs-send.asciidoc
+++ b/Documentation/btrfs-send.asciidoc
@@ -56,7 +56,23 @@ send in 'NO_FILE_DATA' mode
 The output stream does not contain any file
 data and thus cannot be used to transfer changes. This mode is faster and
 is useful to show the differences in metadata.
--q|--quiet::::
+
+--proto N::
+Use the send protocol version N. The default is 0, which means to use the
+highest version supported by the running kernel. Version 1 was the original
+protocol version. Version 2 encodes file data slightly more efficiently; it is
+also required for sending compressed data directly (see '--compressed-data').
+Version 2 requires at least btrfs-progs 5.16 on both the sender and receiver
+and at least Linux 5.16 on the sender.
+
+--compressed-data::
+Send data that is compressed on the filesystem directly without decompressing
+it. If the receiver supports the 'BTRFS_IOC_ENCODED_WRITE' ioctl (in Linux
+5.16), it can also write it directly without decompressing it. Otherwise, the
+receiver will fall back to decompressing it and writing it normally. This
+requires protocol version 2 or higher.
+
+-q|--quiet::
 (deprecated) alias for global '-q' option
 -v|--verbose::
 (deprecated) alias for global '-v' option
diff --git a/cmds/send.c b/cmds/send.c
index 18102331..114dce1b 100644
--- a/cmds/send.c
+++ b/cmds/send.c
@@ -57,6 +57,8 @@ struct btrfs_send {
 	u64 clone_sources_count;
 
 	char *root_path;
+	u32 proto;
+	u32 proto_supported;
 };
 
 static int get_root_id(struct btrfs_send *sctx, const char *path, u64 *root_id)
@@ -257,6 +259,16 @@ static int do_send(struct btrfs_send *send, u64 parent_root_id,
 	memset(&io_send, 0, sizeof(io_send));
 	io_send.send_fd = pipefd[1];
 	send->send_fd = pipefd[0];
+	io_send.flags = flags;
+
+	if (send->proto_supported > 1) {
+		/*
+		 * Versioned stream supported, requesting default or specific
+		 * number.
+		 */
+		io_send.version = send->proto;
+		io_send.flags |= BTRFS_SEND_FLAG_VERSION;
+	}
 
 	if (!ret)
 		ret = pthread_create(&t_read, NULL, read_sent_data, send);
@@ -267,7 +279,6 @@ static int do_send(struct btrfs_send *send, u64 parent_root_id,
 		goto out;
 	}
 
-	io_send.flags = flags;
 	io_send.clone_sources = (__u64*)send->clone_sources;
 	io_send.clone_sources_count = send->clone_sources_count;
 	io_send.parent_root = parent_root_id;
@@ -419,6 +430,36 @@ static void free_send_info(struct btrfs_send *sctx)
 	sctx->root_path = NULL;
 }
 
+static u32 get_sysfs_proto_supported(void)
+{
+	int fd;
+	int ret;
+	char buf[32] = {};
+	char *end = NULL;
+	u64 version;
+
+	fd = sysfs_open_file("features/send_stream_version");
+	if (fd < 0) {
+		/*
+		 * No file is either no version support or old kernel with just
+		 * v1.
+		 */
+		return 1;
+	}
+	ret = sysfs_read_file(fd, buf, sizeof(buf));
+	close(fd);
+	if (ret <= 0)
+		return 1;
+	version = strtoull(buf, &end, 10);
+	if (version == ULLONG_MAX && errno == ERANGE)
+		return 1;
+	if (version > U32_MAX) {
+		warning("sysfs/send_stream_version too big: %llu", version);
+		version = 1;
+	}
+	return version;
+}
+
 static const char * const cmd_send_usage[] = {
 	"btrfs send [-ve] [-p <parent>] [-c <clone-src>] [-f <outfile>] <subvol> [<subvol>...]",
 	"Send the subvolume(s) to stdout.",
@@ -447,6 +488,11 @@ static const char * const cmd_send_usage[] = {
 	"                 does not contain any file data and thus cannot be used",
 	"                 to transfer changes. This mode is faster and useful to",
 	"                 show the differences in metadata.",
+	"--proto N        request maximum protocol version N (default: highest",
+	"                 supported by running kernel)",
+	"--compressed-data",
+	"                 send data that is compressed on the filesystem directly",
+	"                 without decompressing it",
 	"-v|--verbose     deprecated, alias for global -v option",
 	"-q|--quiet       deprecated, alias for global -q option",
 	HELPINFO_INSERT_GLOBALS,
@@ -469,6 +515,7 @@ static int cmd_send(const struct cmd_struct *cmd, int argc, char **argv)
 	int full_send = 1;
 	int new_end_cmd_semantic = 0;
 	u64 send_flags = 0;
+	u64 proto;
 
 	memset(&send, 0, sizeof(send));
 	send.dump_fd = fileno(stdout);
@@ -487,11 +534,17 @@ static int cmd_send(const struct cmd_struct *cmd, int argc, char **argv)
 
 	optind = 0;
 	while (1) {
-		enum { GETOPT_VAL_SEND_NO_DATA = 256 };
+		enum {
+			GETOPT_VAL_SEND_NO_DATA = 256,
+			GETOPT_VAL_PROTO,
+			GETOPT_VAL_COMPRESSED_DATA,
+		};
 		static const struct option long_options[] = {
 			{ "verbose", no_argument, NULL, 'v' },
 			{ "quiet", no_argument, NULL, 'q' },
 			{ "no-data", no_argument, NULL, GETOPT_VAL_SEND_NO_DATA },
+			{ "proto", required_argument, NULL, GETOPT_VAL_PROTO },
+			{ "compressed-data", no_argument, NULL, GETOPT_VAL_COMPRESSED_DATA },
 			{ NULL, 0, NULL, 0 }
 		};
 		int c = getopt_long(argc, argv, "vqec:f:i:p:", long_options, NULL);
@@ -580,6 +633,18 @@ static int cmd_send(const struct cmd_struct *cmd, int argc, char **argv)
 		case GETOPT_VAL_SEND_NO_DATA:
 			send_flags |= BTRFS_SEND_FLAG_NO_FILE_DATA;
 			break;
+		case GETOPT_VAL_PROTO:
+			proto = arg_strtou64(optarg);
+			if (proto > U32_MAX) {
+				error("protocol version number too big %llu", proto);
+				ret = 1;
+				goto out;
+			}
+			send.proto = proto;
+			break;
+		case GETOPT_VAL_COMPRESSED_DATA:
+			send_flags |= BTRFS_SEND_FLAG_COMPRESSED;
+			break;
 		default:
 			usage_unknown_option(cmd, argv);
 		}
@@ -687,6 +752,29 @@ static int cmd_send(const struct cmd_struct *cmd, int argc, char **argv)
 	if ((send_flags & BTRFS_SEND_FLAG_NO_FILE_DATA) && bconf.verbose > 1)
 		if (bconf.verbose > 1)
 			fprintf(stderr, "Mode NO_FILE_DATA enabled\n");
+	send.proto_supported = get_sysfs_proto_supported();
+	if (send.proto_supported == 1) {
+		if (send.proto > send.proto_supported) {
+			error("requested version %u but kernel supports only %u",
+			      send.proto, send.proto_supported);
+			ret = -EPROTO;
+			goto out;
+		}
+	}
+	if (send_flags & BTRFS_SEND_FLAG_COMPRESSED) {
+		if (send.proto == 1) {
+			error("--compressed-data requires protocol version >= 2 (requested 1)");
+			ret = -EINVAL;
+			goto out;
+		} else if (send.proto == 0 && send.proto_supported < 2) {
+			error("kernel does not support --compressed-data");
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+	if (bconf.verbose > 1)
+		fprintf(stderr, "Protocol version requested: %u (supported %u)\n",
+			send.proto, send.proto_supported);
 
 	for (i = optind; i < argc; i++) {
 		int is_first_subvol;
diff --git a/ioctl.h b/ioctl.h
index dcfe0b6c..2137dd5f 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -655,10 +655,24 @@ BUILD_ASSERT(sizeof(struct btrfs_ioctl_received_subvol_args_32) == 192);
  */
 #define BTRFS_SEND_FLAG_OMIT_END_CMD		0x4
 
+/*
+ * Read the protocol version in the structure
+ */
+#define BTRFS_SEND_FLAG_VERSION			0x8
+
+/*
+ * Send compressed data using the ENCODED_WRITE command instead of decompressing
+ * the data and sending it with the WRITE command. This requires protocol
+ * version >= 2.
+ */
+#define BTRFS_SEND_FLAG_COMPRESSED		0x10
+
 #define BTRFS_SEND_FLAG_MASK \
 	(BTRFS_SEND_FLAG_NO_FILE_DATA | \
 	 BTRFS_SEND_FLAG_OMIT_STREAM_HEADER | \
-	 BTRFS_SEND_FLAG_OMIT_END_CMD)
+	 BTRFS_SEND_FLAG_OMIT_END_CMD | \
+	 BTRFS_SEND_FLAG_VERSION | \
+	 BTRFS_SEND_FLAG_COMPRESSED)
 
 struct btrfs_ioctl_send_args {
 	__s64 send_fd;			/* in */
@@ -666,7 +680,8 @@ struct btrfs_ioctl_send_args {
 	__u64 __user *clone_sources;	/* in */
 	__u64 parent_root;		/* in */
 	__u64 flags;			/* in */
-	__u64 reserved[4];		/* in */
+	__u32 version;			/* in */
+	__u8 reserved[28];		/* in */
 };
 /*
  * Size of structure depends on pointer width, was not caught in the early
diff --git a/kernel-shared/send.h b/kernel-shared/send.h
index 1458dd29..133eeb5a 100644
--- a/kernel-shared/send.h
+++ b/kernel-shared/send.h
@@ -31,7 +31,7 @@ extern "C" {
 #endif
 
 #define BTRFS_SEND_STREAM_MAGIC "btrfs-stream"
-#define BTRFS_SEND_STREAM_VERSION 1
+#define BTRFS_SEND_STREAM_VERSION 2
 
 #define BTRFS_SEND_BUF_SIZE_V1 (64 * 1024)
 
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v12 10/10] btrfs-progs: receive: add tests for basic encoded_write send/receive
  2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
                   ` (25 preceding siblings ...)
  2021-11-17 20:19 ` [PATCH v12 09/10] btrfs-progs: send: stream v2 ioctl flags Omar Sandoval
@ 2021-11-17 20:19 ` Omar Sandoval
  26 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-17 20:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Boris Burkov <boris@bur.io>

Adapt the existing send/receive tests by passing '-o compress-force' to
the mount commands in a new test. After writing a few files in the
various compression formats, send/receive them with and without
--force-decompress to test both the encoded_write path and the fallback
to decode+write.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 .../052-receive-write-encoded/test.sh         | 114 ++++++++++++++++++
 1 file changed, 114 insertions(+)
 create mode 100755 tests/misc-tests/052-receive-write-encoded/test.sh

diff --git a/tests/misc-tests/052-receive-write-encoded/test.sh b/tests/misc-tests/052-receive-write-encoded/test.sh
new file mode 100755
index 00000000..47330281
--- /dev/null
+++ b/tests/misc-tests/052-receive-write-encoded/test.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# test that we can send and receive encoded writes for three modes of
+# transparent compression: zlib, lzo, and zstd.
+
+source "$TEST_TOP/common"
+
+check_prereq mkfs.btrfs
+check_prereq btrfs
+
+setup_root_helper
+prepare_test_dev
+
+here=`pwd`
+
+# assumes the filesystem exists, and does mount, write, snapshot, send, unmount
+# for the specified encoding option
+send_one() {
+	local str
+	local subv
+	local snap
+
+	algorithm="$1"
+	shift
+	str="$1"
+	shift
+
+	subv="subv-$algorithm"
+	snap="snap-$algorithm"
+
+	run_check_mount_test_dev "-o" "compress-force=$algorithm"
+	cd "$TEST_MNT" || _fail "cannot chdir to TEST_MNT"
+
+	run_check $SUDO_HELPER "$TOP/btrfs" subvolume create "$subv"
+	run_check $SUDO_HELPER dd if=/dev/zero of="$subv/file1" bs=1M count=1
+	run_check $SUDO_HELPER dd if=/dev/zero of="$subv/file2" bs=500K count=1
+	run_check $SUDO_HELPER "$TOP/btrfs" subvolume snapshot -r "$subv" "$snap"
+	run_check $SUDO_HELPER "$TOP/btrfs" send -f "$str" "$snap" "$@"
+
+	cd "$here" || _fail "cannot chdir back to test directory"
+	run_check_umount_test_dev
+}
+
+receive_one() {
+	local str
+	str="$1"
+	shift
+
+	run_check_mkfs_test_dev
+	run_check_mount_test_dev
+	run_check $SUDO_HELPER "$TOP/btrfs" receive "$@" -v -f "$str" "$TEST_MNT"
+	run_check_umount_test_dev
+	run_check rm -f -- "$str"
+}
+
+test_one_write_encoded() {
+	local str
+	local algorithm
+	algorithm="$1"
+	shift
+	str="$here/stream-$algorithm.stream"
+
+	run_check_mkfs_test_dev
+	send_one "$algorithm" "$str" --compressed-data
+	receive_one "$str" "$@"
+}
+
+test_one_stream_v1() {
+	local str
+	local algorithm
+	algorithm="$1"
+	shift
+	str="$here/stream-$algorithm.stream"
+
+	run_check_mkfs_test_dev
+	send_one "$algorithm" "$str" --proto 1
+	receive_one "$str" "$@"
+}
+
+test_mix_write_encoded() {
+	local strzlib
+	local strlzo
+	local strzstd
+	strzlib="$here/stream-zlib.stream"
+	strlzo="$here/stream-lzo.stream"
+	strzstd="$here/stream-zstd.stream"
+
+	run_check_mkfs_test_dev
+
+	send_one "zlib" "$strzlib" --compressed-data
+	send_one "lzo" "$strlzo" --compressed-data
+	send_one "zstd" "$strzstd" --compressed-data
+
+	receive_one "$strzlib"
+	receive_one "$strlzo"
+	receive_one "$strzstd"
+}
+
+test_one_write_encoded "zlib"
+test_one_write_encoded "lzo"
+test_one_write_encoded "zstd"
+
+# with decompression forced
+test_one_write_encoded "zlib" "--force-decompress"
+test_one_write_encoded "lzo" "--force-decompress"
+test_one_write_encoded "zstd" "--force-decompress"
+
+# send stream v1
+test_one_stream_v1 "zlib"
+test_one_stream_v1 "lzo"
+test_one_stream_v1 "zstd"
+
+# files use a mix of compression algorithms
+test_mix_write_encoded
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 11/17] btrfs: send: remove unused send_ctx::{total,cmd}_send_size
  2021-11-17 20:19 ` [PATCH v12 11/17] btrfs: send: remove unused send_ctx::{total,cmd}_send_size Omar Sandoval
@ 2021-11-18 14:11   ` David Sterba
  0 siblings, 0 replies; 50+ messages in thread
From: David Sterba @ 2021-11-18 14:11 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Wed, Nov 17, 2021 at 12:19:21PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> We collect these statistics but have never used them for anything.

There was a protocol extension to put the progress information to the
stream, so that should be also part of the v2 update.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 13/17] btrfs: add send stream v2 definitions
  2021-11-17 20:19 ` [PATCH v12 13/17] btrfs: add send stream v2 definitions Omar Sandoval
@ 2021-11-18 14:18   ` David Sterba
  2021-11-18 19:08     ` Omar Sandoval
  2021-11-18 14:20   ` David Sterba
  1 sibling, 1 reply; 50+ messages in thread
From: David Sterba @ 2021-11-18 14:18 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Wed, Nov 17, 2021 at 12:19:23PM -0800, Omar Sandoval wrote:
> @@ -113,6 +121,11 @@ enum {
>  	BTRFS_SEND_A_PATH_LINK,
>  
>  	BTRFS_SEND_A_FILE_OFFSET,
> +	/*
> +	 * In send stream v2, this attribute is special: it must be the last
> +	 * attribute in a command, its header contains only the type, and its
> +	 * length is implicitly the remaining length of the command.
> +	 */
>  	BTRFS_SEND_A_DATA,

I don't like the conditional meaning of the DATA attribute, I'd rather
see a new one that's v2+. It's adding a complexity that's not obvious
without some context.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 13/17] btrfs: add send stream v2 definitions
  2021-11-17 20:19 ` [PATCH v12 13/17] btrfs: add send stream v2 definitions Omar Sandoval
  2021-11-18 14:18   ` David Sterba
@ 2021-11-18 14:20   ` David Sterba
  1 sibling, 0 replies; 50+ messages in thread
From: David Sterba @ 2021-11-18 14:20 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Wed, Nov 17, 2021 at 12:19:23PM -0800, Omar Sandoval wrote:
> --- a/fs/btrfs/send.h
> +++ b/fs/btrfs/send.h
> @@ -12,7 +12,11 @@
>  #define BTRFS_SEND_STREAM_MAGIC "btrfs-stream"
>  #define BTRFS_SEND_STREAM_VERSION 1
>  
> -#define BTRFS_SEND_BUF_SIZE SZ_64K
> +/*
> + * In send stream v1, no command is larger than 64k. In send stream v2, no limit
> + * should be assumed.
> + */
> +#define BTRFS_SEND_BUF_SIZE_V1 SZ_64K
>  
>  enum btrfs_tlv_type {
>  	BTRFS_TLV_U8,
> @@ -80,7 +84,10 @@ enum btrfs_send_cmd {
>  	BTRFS_SEND_C_MAX_V1 = BTRFS_SEND_C_UPDATE_EXTENT,
>  
>  	/* Version 2 */
> -	BTRFS_SEND_C_MAX_V2 = BTRFS_SEND_C_MAX_V1,
> +	BTRFS_SEND_C_FALLOCATE,
> +	BTRFS_SEND_C_SETFLAGS,
> +	BTRFS_SEND_C_ENCODED_WRITE,
> +	BTRFS_SEND_C_MAX_V2 = BTRFS_SEND_C_ENCODED_WRITE,

The previous patch changes the MAX_V command to be equal to the previous
command but that's exactly what I wanted to avoid in the protocol
definition list and keep it linear.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 12/17] btrfs: send: fix maximum command numbering
  2021-11-17 20:19 ` [PATCH v12 12/17] btrfs: send: fix maximum command numbering Omar Sandoval
@ 2021-11-18 14:23   ` David Sterba
  2021-11-18 18:54     ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: David Sterba @ 2021-11-18 14:23 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Wed, Nov 17, 2021 at 12:19:22PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Commit e77fbf990316 ("btrfs: send: prepare for v2 protocol") added
> _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
> version plus 1, but as written this creates gaps in the number space.
> The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
> accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
> has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
> 23 and 24 are valid commands.

The MAX definitions have the __ prefix so they're private and not meant
to be used as proper commands, so nothing should suggest there are any
commands with numbers 23 to 25 in the example.

> Instead, let's explicitly set BTRFS_SEND_C_MAX_V* to the maximum command
> number. This requires repeating the command name, but it has a clearer
> meaning and avoids gaps. It also doesn't require updating
> __BTRFS_SEND_C_MAX for every new version.

It's probably a matter of taste, I'd intentionally avoid the pattern
above, ie. repeating the previous command to define max.

> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -316,8 +316,8 @@ __maybe_unused
>  static bool proto_cmd_ok(const struct send_ctx *sctx, int cmd)
>  {
>  	switch (sctx->proto) {
> -	case 1:	 return cmd < __BTRFS_SEND_C_MAX_V1;
> -	case 2:	 return cmd < __BTRFS_SEND_C_MAX_V2;
> +	case 1:	 return cmd <= BTRFS_SEND_C_MAX_V1;
> +	case 2:	 return cmd <= BTRFS_SEND_C_MAX_V2;

This seems to be the only practical difference, < or <= .

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ
  2021-11-17 20:19 ` [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ Omar Sandoval
@ 2021-11-18 14:55   ` David Sterba
  2021-11-18 19:11     ` Omar Sandoval
  2022-01-24 21:54   ` David Sterba
  2022-01-24 22:26   ` David Sterba
  2 siblings, 1 reply; 50+ messages in thread
From: David Sterba @ 2021-11-18 14:55 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Wed, Nov 17, 2021 at 12:19:19PM -0800, Omar Sandoval wrote:
> +static int btrfs_encoded_io_compression_from_extent(int compress_type)
> +{
> +	switch (compress_type) {
> +	case BTRFS_COMPRESS_NONE:
> +		return BTRFS_ENCODED_IO_COMPRESSION_NONE;
> +	case BTRFS_COMPRESS_ZLIB:
> +		return BTRFS_ENCODED_IO_COMPRESSION_ZLIB;
> +	case BTRFS_COMPRESS_LZO:
> +		/*
> +		 * The LZO format depends on the page size. 64k is the maximum

Should this also say it depends ont the sector (not page) size?

> +		 * sectorsize (and thus page size) that we support.
> +		 */
> +		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
> +			return -EINVAL;
> +		return BTRFS_ENCODED_IO_COMPRESSION_LZO_4K + (PAGE_SHIFT - 12);
> +	case BTRFS_COMPRESS_ZSTD:
> +		return BTRFS_ENCODED_IO_COMPRESSION_ZSTD;
> +	default:
> +		return -EUCLEAN;
> +	}
> +}
> +
> +
> +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> +								path->slots[0]);
> +		if (inline_size > count) {
> +			ret = -ENOBUFS;
> +			goto out;
> +		}
> +		count = inline_size;
> +		encoded->unencoded_len = ram_bytes;
> +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> +	} else {
> +		encoded->len = encoded->unencoded_len = count =
> +			min_t(u64, count, encoded->len);

Please don't do chained initializations.

> +		ptr += iocb->ki_pos - extent_start;
> +	}
> +
> +	tmp = kmalloc(count, GFP_NOFS);
> +	if (!tmp) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	read_extent_buffer(leaf, tmp, ptr, count);
> +	btrfs_release_path(path);
> +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> +	inode_unlock_shared(inode);
> +	*unlocked = true;
> +
> +	ret = copy_to_iter(tmp, count, iter);
> +	if (ret != count)
> +		ret = -EFAULT;
> +	kfree(tmp);
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +struct btrfs_encoded_read_private {
> +	struct inode *inode;
> +	u64 file_offset;
> +	wait_queue_head_t wait;
> +	atomic_t pending;
> +	blk_status_t status;
> +	bool skip_csum;
> +};
> +
> +static blk_status_t submit_encoded_read_bio(struct inode *inode,
> +					    struct bio *bio, int mirror_num,
> +					    unsigned long bio_flags)
> +{
> +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> +	struct btrfs_bio *bbio = btrfs_bio(bio);
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	blk_status_t ret;
> +
> +	if (!priv->skip_csum) {
> +		ret = btrfs_lookup_bio_sums(inode, bio, NULL);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> +	if (ret) {
> +		btrfs_bio_free_csum(bbio);
> +		return ret;
> +	}
> +
> +	atomic_inc(&priv->pending);
> +	ret = btrfs_map_bio(fs_info, bio, mirror_num);
> +	if (ret) {
> +		atomic_dec(&priv->pending);
> +		btrfs_bio_free_csum(bbio);
> +	}
> +	return ret;
> +}
> +
> +static blk_status_t btrfs_encoded_read_verify_csum(struct btrfs_bio *bbio)
> +{
> +	const bool uptodate = bbio->bio.bi_status == BLK_STS_OK;
> +	struct btrfs_encoded_read_private *priv = bbio->bio.bi_private;
> +	struct inode *inode = priv->inode;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	u32 sectorsize = fs_info->sectorsize;
> +	struct bio_vec *bvec;
> +	struct bvec_iter_all iter_all;
> +	u64 start = priv->file_offset;
> +	u32 bio_offset = 0;
> +
> +	if (priv->skip_csum || !uptodate)
> +		return bbio->bio.bi_status;
> +
> +	bio_for_each_segment_all(bvec, &bbio->bio, iter_all) {
> +		unsigned int i, nr_sectors, pgoff;
> +
> +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> +		pgoff = bvec->bv_offset;
> +		for (i = 0; i < nr_sectors; i++) {
> +			ASSERT(pgoff < PAGE_SIZE);
> +			if (check_data_csum(inode, bbio, bio_offset,
> +					    bvec->bv_page, pgoff, start))
> +				return BLK_STS_IOERR;
> +			start += sectorsize;
> +			bio_offset += sectorsize;
> +			pgoff += sectorsize;
> +		}
> +	}
> +	return BLK_STS_OK;
> +}
> +
> +static void btrfs_encoded_read_endio(struct bio *bio)
> +{
> +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> +	struct btrfs_bio *bbio = btrfs_bio(bio);
> +	blk_status_t status;
> +
> +	status = btrfs_encoded_read_verify_csum(bbio);
> +	if (status) {
> +		/*
> +		 * The memory barrier implied by the atomic_dec_return() here
> +		 * pairs with the memory barrier implied by the
> +		 * atomic_dec_return() or io_wait_event() in
> +		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
> +		 * write is observed before the load of status in
> +		 * btrfs_encoded_read_regular_fill_pages().
> +		 */
> +		WRITE_ONCE(priv->status, status);

The WRITE_ONCE is here 3 times, I wonder if this is ok to be opencoded
like that. I'd suggest to use a helper with a comment.

> +	}
> +	if (!atomic_dec_return(&priv->pending))
> +		wake_up(&priv->wait);
> +	btrfs_bio_free_csum(bbio);
> +	bio_put(bio);
> +}
> +
> +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode,
> +						 u64 file_offset,
> +						 u64 disk_bytenr,
> +						 u64 disk_io_size,
> +						 struct page **pages)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct btrfs_encoded_read_private priv = {
> +		.inode = inode,
> +		.file_offset = file_offset,
> +		.pending = ATOMIC_INIT(1),
> +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> +	};
> +	unsigned long i = 0;
> +	u64 cur = 0;
> +	int ret;
> +
> +	init_waitqueue_head(&priv.wait);
> +	/*
> +	 * Submit bios for the extent, splitting due to bio or stripe limits as
> +	 * necessary.
> +	 */
> +	while (cur < disk_io_size) {
> +		struct extent_map *em;
> +		struct btrfs_io_geometry geom;
> +		struct bio *bio = NULL;
> +		u64 remaining;
> +
> +		em = btrfs_get_chunk_map(fs_info, disk_bytenr + cur,
> +					 disk_io_size - cur);
> +		if (IS_ERR(em)) {
> +			ret = PTR_ERR(em);
> +		} else {
> +			ret = btrfs_get_io_geometry(fs_info, em, BTRFS_MAP_READ,
> +						    disk_bytenr + cur, &geom);
> +			free_extent_map(em);
> +		}
> +		if (ret) {
> +			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
> +			break;
> +		}
> +		remaining = min(geom.len, disk_io_size - cur);
> +		while (bio || remaining) {
> +			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
> +
> +			if (!bio) {
> +				bio = btrfs_bio_alloc(BIO_MAX_VECS);
> +				bio->bi_iter.bi_sector =
> +					(disk_bytenr + cur) >> SECTOR_SHIFT;
> +				bio->bi_end_io = btrfs_encoded_read_endio;
> +				bio->bi_private = &priv;
> +				bio->bi_opf = REQ_OP_READ;
> +			}
> +
> +			if (!bytes ||
> +			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> +				blk_status_t status;
> +
> +				status = submit_encoded_read_bio(inode, bio, 0,
> +								 0);
> +				if (status) {
> +					WRITE_ONCE(priv.status, status);
> +					bio_put(bio);
> +					goto out;
> +				}
> +				bio = NULL;
> +				continue;
> +			}
> +
> +			i++;
> +			cur += bytes;
> +			remaining -= bytes;
> +		}
> +	}
> +
> +out:
> +	if (atomic_dec_return(&priv.pending))
> +		io_wait_event(priv.wait, !atomic_read(&priv.pending));
> +	/* See btrfs_encoded_read_endio() for ordering. */
> +	return blk_status_to_errno(READ_ONCE(priv.status));
> +}
> +
> +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> +					  struct iov_iter *iter,
> +					  u64 start, u64 lockend,
> +					  struct extent_state **cached_state,
> +					  u64 disk_bytenr, u64 disk_io_size,
> +					  size_t count, bool compressed,
> +					  bool *unlocked)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct page **pages;
> +	unsigned long nr_pages, i;
> +	u64 cur;
> +	size_t page_offset;
> +	ssize_t ret;
> +
> +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);

Power of two compile time constants can use the bitmask operations for
alighnment.

> +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> +	if (!pages)
> +		return -ENOMEM;
> +	for (i = 0; i < nr_pages; i++) {
> +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> +		if (!pages[i]) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +	}
> +
> +	ret = btrfs_encoded_read_regular_fill_pages(inode, start, disk_bytenr,
> +						    disk_io_size, pages);
> +	if (ret)
> +		goto out;
> +
> +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> +	inode_unlock_shared(inode);
> +	*unlocked = true;
> +
> +	if (compressed) {
> +		i = 0;
> +		page_offset = 0;
> +	} else {
> +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> +	}
> +	cur = 0;
> +	while (cur < count) {
> +		size_t bytes = min_t(size_t, count - cur,
> +				     PAGE_SIZE - page_offset);
> +
> +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> +				      iter) != bytes) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +		i++;
> +		cur += bytes;
> +		page_offset = 0;
> +	}
> +	ret = count;
> +out:
> +	for (i = 0; i < nr_pages; i++) {
> +		if (pages[i])
> +			__free_page(pages[i]);
> +	}
> +	kfree(pages);
> +	return ret;
> +}
> +
> +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
> +			   struct btrfs_ioctl_encoded_io_args *encoded)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	ssize_t ret;
> +	size_t count = iov_iter_count(iter);
> +	u64 start, lockend, disk_bytenr, disk_io_size;
> +	struct extent_state *cached_state = NULL;
> +	struct extent_map *em;
> +	bool unlocked = false;
> +
> +	file_accessed(iocb->ki_filp);
> +
> +	inode_lock_shared(inode);
> +
> +	if (iocb->ki_pos >= inode->i_size) {
> +		inode_unlock_shared(inode);
> +		return 0;
> +	}
> +	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
> +	/*
> +	 * We don't know how long the extent containing iocb->ki_pos is, but if
> +	 * it's compressed we know that it won't be longer than this.
> +	 */
> +	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
> +
> +	for (;;) {
> +		struct btrfs_ordered_extent *ordered;
> +
> +		ret = btrfs_wait_ordered_range(inode, start,
> +					       lockend - start + 1);
> +		if (ret)
> +			goto out_unlock_inode;
> +		lock_extent_bits(io_tree, start, lockend, &cached_state);
> +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> +						     lockend - start + 1);
> +		if (!ordered)
> +			break;
> +		btrfs_put_ordered_extent(ordered);
> +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> +		cond_resched();
> +	}
> +
> +	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start,
> +			      lockend - start + 1);
> +	if (IS_ERR(em)) {
> +		ret = PTR_ERR(em);
> +		goto out_unlock_extent;
> +	}
> +
> +	if (em->block_start == EXTENT_MAP_INLINE) {
> +		u64 extent_start = em->start;
> +
> +		/*
> +		 * For inline extents we get everything we need out of the
> +		 * extent item.
> +		 */
> +		free_extent_map(em);
> +		em = NULL;
> +		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
> +						&cached_state, extent_start,
> +						count, encoded, &unlocked);
> +		goto out;
> +	}
> +
> +	/*
> +	 * We only want to return up to EOF even if the extent extends beyond
> +	 * that.
> +	 */
> +	encoded->len = (min_t(u64, extent_map_end(em), inode->i_size) -
> +			iocb->ki_pos);
> +	if (em->block_start == EXTENT_MAP_HOLE ||
> +	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
> +		disk_bytenr = EXTENT_MAP_HOLE;
> +		encoded->len = encoded->unencoded_len = count =
> +			min_t(u64, count, encoded->len);
> +	} else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> +		disk_bytenr = em->block_start;
> +		/*
> +		 * Bail if the buffer isn't large enough to return the whole
> +		 * compressed extent.
> +		 */
> +		if (em->block_len > count) {
> +			ret = -ENOBUFS;
> +			goto out_em;
> +		}
> +		disk_io_size = count = em->block_len;
> +		encoded->unencoded_len = em->ram_bytes;
> +		encoded->unencoded_offset = iocb->ki_pos - em->orig_start;
> +		ret = btrfs_encoded_io_compression_from_extent(
> +							     em->compress_type);
> +		if (ret < 0)
> +			goto out_em;
> +		encoded->compression = ret;
> +	} else {
> +		disk_bytenr = em->block_start + (start - em->start);
> +		if (encoded->len > count)
> +			encoded->len = count;
> +		/*
> +		 * Don't read beyond what we locked. This also limits the page
> +		 * allocations that we'll do.
> +		 */
> +		disk_io_size = min(lockend + 1,
> +				   iocb->ki_pos + encoded->len) - start;
> +		encoded->len = encoded->unencoded_len = count =
> +			start + disk_io_size - iocb->ki_pos;
> +		disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize);
> +	}
> +	free_extent_map(em);
> +	em = NULL;
> +
> +	if (disk_bytenr == EXTENT_MAP_HOLE) {
> +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> +		inode_unlock_shared(inode);
> +		unlocked = true;
> +		ret = iov_iter_zero(count, iter);
> +		if (ret != count)
> +			ret = -EFAULT;
> +	} else {
> +		ret = btrfs_encoded_read_regular(iocb, iter, start, lockend,
> +						 &cached_state, disk_bytenr,
> +						 disk_io_size, count,
> +						 encoded->compression,
> +						 &unlocked);
> +	}
> +
> +out:
> +	if (ret >= 0)
> +		iocb->ki_pos += encoded->len;
> +out_em:
> +	free_extent_map(em);
> +out_unlock_extent:
> +	if (!unlocked)
> +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> +out_unlock_inode:
> +	if (!unlocked)
> +		inode_unlock_shared(inode);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_SWAP
>  /*
>   * Add an entry indicating a block group or device which is pinned by a
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 05c77a1979a9..f0c575223d88 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -28,6 +28,7 @@
>  #include <linux/iversion.h>
>  #include <linux/fileattr.h>
>  #include <linux/fsverity.h>
> +#include <linux/sched/xacct.h>
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "export.h"
> @@ -88,6 +89,22 @@ struct btrfs_ioctl_send_args_32 {
>  
>  #define BTRFS_IOC_SEND_32 _IOW(BTRFS_IOCTL_MAGIC, 38, \
>  			       struct btrfs_ioctl_send_args_32)
> +
> +struct btrfs_ioctl_encoded_io_args_32 {
> +	compat_uptr_t iov;
> +	compat_ulong_t iovcnt;
> +	__s64 offset;
> +	__u64 flags;
> +	__u64 len;
> +	__u64 unencoded_len;
> +	__u64 unencoded_offset;
> +	__u32 compression;
> +	__u32 encryption;
> +	__u32 reserved[8];
> +};
> +
> +#define BTRFS_IOC_ENCODED_READ_32 _IOR(BTRFS_IOCTL_MAGIC, 64, \
> +				       struct btrfs_ioctl_encoded_io_args_32)
>  #endif
>  
>  /* Mask out flags that are inappropriate for the given type of inode. */
> @@ -4861,6 +4878,89 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
>  	return ret;
>  }
>  
> +static int btrfs_ioctl_encoded_read(struct file *file, void __user *argp,
> +				    bool compat)
> +{
> +	struct btrfs_ioctl_encoded_io_args args = {};
> +	size_t copy_end_kernel = offsetofend(struct btrfs_ioctl_encoded_io_args,
> +					     flags);
> +	size_t copy_end;
> +	struct iovec iovstack[UIO_FASTIOV];
> +	struct iovec *iov = iovstack;
> +	struct iov_iter iter;
> +	loff_t pos;
> +	struct kiocb kiocb;
> +	ssize_t ret;
> +
> +	if (!capable(CAP_SYS_ADMIN)) {
> +		ret = -EPERM;
> +		goto out_acct;
> +	}
> +
> +	if (compat) {
> +#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
> +		struct btrfs_ioctl_encoded_io_args_32 args32;
> +
> +		copy_end = offsetofend(struct btrfs_ioctl_encoded_io_args_32,
> +				       flags);
> +		if (copy_from_user(&args32, argp, copy_end)) {
> +			ret = -EFAULT;
> +			goto out_acct;
> +		}
> +		args.iov = compat_ptr(args32.iov);
> +		args.iovcnt = args32.iovcnt;
> +		args.offset = args32.offset;
> +		args.flags = args32.flags;
> +#else
> +		return -ENOTTY;
> +#endif
> +	} else {
> +		copy_end = copy_end_kernel;
> +		if (copy_from_user(&args, argp, copy_end)) {
> +			ret = -EFAULT;
> +			goto out_acct;
> +		}
> +	}
> +	if (args.flags != 0) {
> +		ret = -EINVAL;
> +		goto out_acct;
> +	}
> +
> +	ret = import_iovec(READ, args.iov, args.iovcnt, ARRAY_SIZE(iovstack),
> +			   &iov, &iter);
> +	if (ret < 0)
> +		goto out_acct;
> +
> +	if (iov_iter_count(&iter) == 0) {
> +		ret = 0;
> +		goto out_iov;
> +	}
> +	pos = args.offset;
> +	ret = rw_verify_area(READ, file, &pos, args.len);
> +	if (ret < 0)
> +		goto out_iov;
> +
> +	init_sync_kiocb(&kiocb, file);
> +	kiocb.ki_pos = pos;
> +
> +	ret = btrfs_encoded_read(&kiocb, &iter, &args);
> +	if (ret >= 0) {
> +		fsnotify_access(file);
> +		if (copy_to_user(argp + copy_end,
> +				 (char *)&args + copy_end_kernel,
> +				 sizeof(args) - copy_end_kernel))
> +			ret = -EFAULT;
> +	}
> +
> +out_iov:
> +	kfree(iov);
> +out_acct:
> +	if (ret > 0)
> +		add_rchar(current, ret);
> +	inc_syscr(current);
> +	return ret;
> +}
> +
>  long btrfs_ioctl(struct file *file, unsigned int
>  		cmd, unsigned long arg)
>  {
> @@ -5005,6 +5105,12 @@ long btrfs_ioctl(struct file *file, unsigned int
>  		return fsverity_ioctl_enable(file, (const void __user *)argp);
>  	case FS_IOC_MEASURE_VERITY:
>  		return fsverity_ioctl_measure(file, argp);
> +	case BTRFS_IOC_ENCODED_READ:
> +		return btrfs_ioctl_encoded_read(file, argp, false);
> +#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
> +	case BTRFS_IOC_ENCODED_READ_32:
> +		return btrfs_ioctl_encoded_read(file, argp, true);
> +#endif
>  	}
>  
>  	return -ENOTTY;
> -- 
> 2.34.0

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 01/17] fs: export rw_verify_area()
  2021-11-17 20:19 ` [PATCH v12 01/17] fs: export rw_verify_area() Omar Sandoval
@ 2021-11-18 14:57   ` David Sterba
  2021-11-18 19:15     ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: David Sterba @ 2021-11-18 14:57 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Wed, Nov 17, 2021 at 12:19:11PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> I'm adding Btrfs ioctls to read and write compressed data, and rather
> than duplicating the checks in rw_verify_area(), let's just export it.
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3244,6 +3244,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
>  		int whence, loff_t size);
>  extern loff_t no_seek_end_llseek_size(struct file *, loff_t, int, loff_t);
>  extern loff_t no_seek_end_llseek(struct file *, loff_t, int);
> +extern int rw_verify_area(int, struct file *, const loff_t *, size_t);

Do you have an ack from VFS people for exporting a function from
fs/interna.h to the normal fs.h?

>  extern int generic_file_open(struct inode * inode, struct file * filp);
>  extern int nonseekable_open(struct inode * inode, struct file * filp);
>  extern int stream_open(struct inode * inode, struct file * filp);
> -- 
> 2.34.0

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 14/17] btrfs: send: write larger chunks when using stream v2
  2021-11-17 20:19 ` [PATCH v12 14/17] btrfs: send: write larger chunks when using stream v2 Omar Sandoval
@ 2021-11-18 15:50   ` David Sterba
  2021-11-18 19:34     ` Omar Sandoval
  0 siblings, 1 reply; 50+ messages in thread
From: David Sterba @ 2021-11-18 15:50 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Wed, Nov 17, 2021 at 12:19:24PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The length field of the send stream TLV header is 16 bits. This means
> that the maximum amount of data that can be sent for one write is 64k
> minus one. However, encoded writes must be able to send the maximum
> compressed extent (128k) in one command. To support this, send stream
> version 2 encodes the DATA attribute differently: it has no length
> field, and the length is implicitly up to the end of containing command
> (which has a 32-bit length field). Although this is necessary for
> encoded writes, normal writes can benefit from it, too.
> 
> Also add a check to enforce that the DATA attribute is last. It is only
> strictly necessary for v2, but we might as well make v1 consistent with
> it.
> 
> For v2, let's bump up the send buffer to the maximum compressed extent
> size plus 16k for the other metadata (144k total).

I'm not sure we want to set the number like that, it feels quite
limiting for potential compression enhancements.


> Since this will most
> likely be vmalloc'd (and always will be after the next commit), we round
> it up to the next page since we might as well use the rest of the page
> on systems with >16k pages.

Would it work also without the virtual mappings? For speedup it makes
sense to use vmalloc area, but as a fallback writing in smaller portions
or page by page eventually should be also possible. For that reason I
don't think we should set the maximum other than what fits to 32bit
number minus some overhead.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 12/17] btrfs: send: fix maximum command numbering
  2021-11-18 14:23   ` David Sterba
@ 2021-11-18 18:54     ` Omar Sandoval
  2021-12-09 18:08       ` Omar Sandoval
  2022-01-24 22:40       ` David Sterba
  0 siblings, 2 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-18 18:54 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On Thu, Nov 18, 2021 at 03:23:59PM +0100, David Sterba wrote:
> On Wed, Nov 17, 2021 at 12:19:22PM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Commit e77fbf990316 ("btrfs: send: prepare for v2 protocol") added
> > _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
> > version plus 1, but as written this creates gaps in the number space.
> > The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
> > accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
> > has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
> > 23 and 24 are valid commands.
> 
> The MAX definitions have the __ prefix so they're private and not meant
> to be used as proper commands, so nothing should suggest there are any
> commands with numbers 23 to 25 in the example.
> 
> > Instead, let's explicitly set BTRFS_SEND_C_MAX_V* to the maximum command
> > number. This requires repeating the command name, but it has a clearer
> > meaning and avoids gaps. It also doesn't require updating
> > __BTRFS_SEND_C_MAX for every new version.
> 
> It's probably a matter of taste, I'd intentionally avoid the pattern
> above, ie. repeating the previous command to define max.
> 
> > --- a/fs/btrfs/send.c
> > +++ b/fs/btrfs/send.c
> > @@ -316,8 +316,8 @@ __maybe_unused
> >  static bool proto_cmd_ok(const struct send_ctx *sctx, int cmd)
> >  {
> >  	switch (sctx->proto) {
> > -	case 1:	 return cmd < __BTRFS_SEND_C_MAX_V1;
> > -	case 2:	 return cmd < __BTRFS_SEND_C_MAX_V2;
> > +	case 1:	 return cmd <= BTRFS_SEND_C_MAX_V1;
> > +	case 2:	 return cmd <= BTRFS_SEND_C_MAX_V2;
> 
> This seems to be the only practical difference, < or <= .

There is another practical difference, which is more significant in my
opinion: the linear style creates "gaps" in the valid commands. Consider
this, with explicit values added for clarity:

enum btrfs_send_cmd {
        BTRFS_SEND_C_UNSPEC = 0,

        /* Version 1 */
        BTRFS_SEND_C_SUBVOL = 1,
        BTRFS_SEND_C_SNAPSHOT = 2,

        BTRFS_SEND_C_MKFILE = 3,
        BTRFS_SEND_C_MKDIR = 4,
        BTRFS_SEND_C_MKNOD = 5,
        BTRFS_SEND_C_MKFIFO = 6,
        BTRFS_SEND_C_MKSOCK = 7,
        BTRFS_SEND_C_SYMLINK = 8,

        BTRFS_SEND_C_RENAME = 9,
        BTRFS_SEND_C_LINK = 10,
        BTRFS_SEND_C_UNLINK = 11,
        BTRFS_SEND_C_RMDIR = 12,

        BTRFS_SEND_C_SET_XATTR = 13,
        BTRFS_SEND_C_REMOVE_XATTR = 14,

        BTRFS_SEND_C_WRITE = 15,
        BTRFS_SEND_C_CLONE = 16,

        BTRFS_SEND_C_TRUNCATE = 17,
        BTRFS_SEND_C_CHMOD = 18,
        BTRFS_SEND_C_CHOWN = 19,
        BTRFS_SEND_C_UTIMES = 20,

        BTRFS_SEND_C_END = 21,
        BTRFS_SEND_C_UPDATE_EXTENT = 22,
        __BTRFS_SEND_C_MAX_V1 = 23,

        /* Version 2 */
        BTRFS_SEND_C_FALLOCATE = 24,
        BTRFS_SEND_C_SETFLAGS = 25,
        BTRFS_SEND_C_ENCODED_WRITE = 26,
        __BTRFS_SEND_C_MAX_V2 = 27,

        /* End */
        __BTRFS_SEND_C_MAX = 28,
};
#define BTRFS_SEND_C_MAX (__BTRFS_SEND_C_MAX - 1) /* 27 */

Notice that BTRFS_SEND_C_UPDATE_EXTENT is 22 and the next valid command
is BTRFS_SEND_C_FALLOCATE, which is 24. So 23 does not correspond to an
actual command; it's a "gap". This is somewhat cosmetic, but it's an
ugly wart in the protocol.

Also consider something indexing on the command number, like the
cmd_send_size thing I got rid of in the previous patch:

	u64 cmd_send_size[BTRFS_SEND_C_MAX + 1]

Indices 23 and 27 are wasted. It's only 16 bytes in this case, which
doesn't matter practically, but it's unpleasant.

Maybe you were aware of this and fine with it, in which case we can drop
this change. But I think the name repetition is less ugly than the gaps.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 13/17] btrfs: add send stream v2 definitions
  2021-11-18 14:18   ` David Sterba
@ 2021-11-18 19:08     ` Omar Sandoval
  0 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-18 19:08 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On Thu, Nov 18, 2021 at 03:18:47PM +0100, David Sterba wrote:
> On Wed, Nov 17, 2021 at 12:19:23PM -0800, Omar Sandoval wrote:
> > @@ -113,6 +121,11 @@ enum {
> >  	BTRFS_SEND_A_PATH_LINK,
> >  
> >  	BTRFS_SEND_A_FILE_OFFSET,
> > +	/*
> > +	 * In send stream v2, this attribute is special: it must be the last
> > +	 * attribute in a command, its header contains only the type, and its
> > +	 * length is implicitly the remaining length of the command.
> > +	 */
> >  	BTRFS_SEND_A_DATA,
> 
> I don't like the conditional meaning of the DATA attribute, I'd rather
> see a new one that's v2+. It's adding a complexity that's not obvious
> without some context.

Hm, I could add a BTRFS_SEND_A_DATA2, but then we'd need something like
this on the parsing side:

diff --git a/common/send-stream.c b/common/send-stream.c
index f25450c8..d6b0c10b 100644
--- a/common/send-stream.c
+++ b/common/send-stream.c
@@ -189,7 +189,7 @@ static int read_cmd(struct btrfs_send_stream *sctx)
 
 		pos += sizeof(tlv_type);
 		data += sizeof(tlv_type);
-		if (sctx->version == 2 && tlv_type == BTRFS_SEND_A_DATA) {
+		if (tlv_type == BTRFS_SEND_A_DATA2) {
 			send_attr->tlv_len = cmd_len - pos;
 		} else {
 			if (cmd_len - pos < sizeof(__le16)) {
@@ -456,7 +456,10 @@ static int read_and_process_cmd(struct btrfs_send_stream *sctx)
 	case BTRFS_SEND_C_WRITE:
 		TLV_GET_STRING(sctx, BTRFS_SEND_A_PATH, &path);
 		TLV_GET_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, &offset);
-		TLV_GET(sctx, BTRFS_SEND_A_DATA, &data, &len);
+		if (sctx->cmd_attrs[BTRFS_SEND_A_DATA2].data)
+			TLV_GET(sctx, BTRFS_SEND_A_DATA2, &data, &len);
+		else
+			TLV_GET(sctx, BTRFS_SEND_A_DATA, &data, &len);
 		ret = sctx->ops->write(path, data, offset, len, sctx->user);
 		break;
 	case BTRFS_SEND_C_ENCODED_WRITE:
@@ -476,7 +479,10 @@ static int read_and_process_cmd(struct btrfs_send_stream *sctx)
 			TLV_GET_U32(sctx, BTRFS_SEND_A_ENCRYPTION, &encryption);
 		else
 			encryption = BTRFS_ENCODED_IO_ENCRYPTION_NONE;
-		TLV_GET(sctx, BTRFS_SEND_A_DATA, &data, &len);
+		if (sctx->cmd_attrs[BTRFS_SEND_A_DATA2].data)
+			TLV_GET(sctx, BTRFS_SEND_A_DATA2, &data, &len);
+		else
+			TLV_GET(sctx, BTRFS_SEND_A_DATA, &data, &len);
 		ret = sctx->ops->encoded_write(path, data, offset, len,
 					       unencoded_file_len,
 					       unencoded_len, unencoded_offset,

It doesn't really make reading the attribute any clearer, and then we
have to check for two attributes. But, if you prefer it this way, I can
change it.

P.S. if we stick with my way, that sctx->version == 2 should probably be
sctx->version >= 2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ
  2021-11-18 14:55   ` David Sterba
@ 2021-11-18 19:11     ` Omar Sandoval
  0 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-18 19:11 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On Thu, Nov 18, 2021 at 03:55:31PM +0100, David Sterba wrote:
> On Wed, Nov 17, 2021 at 12:19:19PM -0800, Omar Sandoval wrote:
> > +static int btrfs_encoded_io_compression_from_extent(int compress_type)
> > +{
> > +	switch (compress_type) {
> > +	case BTRFS_COMPRESS_NONE:
> > +		return BTRFS_ENCODED_IO_COMPRESSION_NONE;
> > +	case BTRFS_COMPRESS_ZLIB:
> > +		return BTRFS_ENCODED_IO_COMPRESSION_ZLIB;
> > +	case BTRFS_COMPRESS_LZO:
> > +		/*
> > +		 * The LZO format depends on the page size. 64k is the maximum
> 
> Should this also say it depends ont the sector (not page) size?

Yeah, the code also needs to be changed to use the filesystem sector
size now that we we support subpage LZO.

I'll clean up the other comments.

> > +		 * sectorsize (and thus page size) that we support.
> > +		 */
> > +		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
> > +			return -EINVAL;
> > +		return BTRFS_ENCODED_IO_COMPRESSION_LZO_4K + (PAGE_SHIFT - 12);
> > +	case BTRFS_COMPRESS_ZSTD:
> > +		return BTRFS_ENCODED_IO_COMPRESSION_ZSTD;
> > +	default:
> > +		return -EUCLEAN;
> > +	}
> > +}
> > +
> > +
> > +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> > +								path->slots[0]);
> > +		if (inline_size > count) {
> > +			ret = -ENOBUFS;
> > +			goto out;
> > +		}
> > +		count = inline_size;
> > +		encoded->unencoded_len = ram_bytes;
> > +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> > +	} else {
> > +		encoded->len = encoded->unencoded_len = count =
> > +			min_t(u64, count, encoded->len);
> 
> Please don't do chained initializations.
> 
> > +		ptr += iocb->ki_pos - extent_start;
> > +	}
> > +
> > +	tmp = kmalloc(count, GFP_NOFS);
> > +	if (!tmp) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	read_extent_buffer(leaf, tmp, ptr, count);
> > +	btrfs_release_path(path);
> > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > +	inode_unlock_shared(inode);
> > +	*unlocked = true;
> > +
> > +	ret = copy_to_iter(tmp, count, iter);
> > +	if (ret != count)
> > +		ret = -EFAULT;
> > +	kfree(tmp);
> > +out:
> > +	btrfs_free_path(path);
> > +	return ret;
> > +}
> > +
> > +struct btrfs_encoded_read_private {
> > +	struct inode *inode;
> > +	u64 file_offset;
> > +	wait_queue_head_t wait;
> > +	atomic_t pending;
> > +	blk_status_t status;
> > +	bool skip_csum;
> > +};
> > +
> > +static blk_status_t submit_encoded_read_bio(struct inode *inode,
> > +					    struct bio *bio, int mirror_num,
> > +					    unsigned long bio_flags)
> > +{
> > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > +	struct btrfs_bio *bbio = btrfs_bio(bio);
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	blk_status_t ret;
> > +
> > +	if (!priv->skip_csum) {
> > +		ret = btrfs_lookup_bio_sums(inode, bio, NULL);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> > +	if (ret) {
> > +		btrfs_bio_free_csum(bbio);
> > +		return ret;
> > +	}
> > +
> > +	atomic_inc(&priv->pending);
> > +	ret = btrfs_map_bio(fs_info, bio, mirror_num);
> > +	if (ret) {
> > +		atomic_dec(&priv->pending);
> > +		btrfs_bio_free_csum(bbio);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static blk_status_t btrfs_encoded_read_verify_csum(struct btrfs_bio *bbio)
> > +{
> > +	const bool uptodate = bbio->bio.bi_status == BLK_STS_OK;
> > +	struct btrfs_encoded_read_private *priv = bbio->bio.bi_private;
> > +	struct inode *inode = priv->inode;
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	u32 sectorsize = fs_info->sectorsize;
> > +	struct bio_vec *bvec;
> > +	struct bvec_iter_all iter_all;
> > +	u64 start = priv->file_offset;
> > +	u32 bio_offset = 0;
> > +
> > +	if (priv->skip_csum || !uptodate)
> > +		return bbio->bio.bi_status;
> > +
> > +	bio_for_each_segment_all(bvec, &bbio->bio, iter_all) {
> > +		unsigned int i, nr_sectors, pgoff;
> > +
> > +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> > +		pgoff = bvec->bv_offset;
> > +		for (i = 0; i < nr_sectors; i++) {
> > +			ASSERT(pgoff < PAGE_SIZE);
> > +			if (check_data_csum(inode, bbio, bio_offset,
> > +					    bvec->bv_page, pgoff, start))
> > +				return BLK_STS_IOERR;
> > +			start += sectorsize;
> > +			bio_offset += sectorsize;
> > +			pgoff += sectorsize;
> > +		}
> > +	}
> > +	return BLK_STS_OK;
> > +}
> > +
> > +static void btrfs_encoded_read_endio(struct bio *bio)
> > +{
> > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > +	struct btrfs_bio *bbio = btrfs_bio(bio);
> > +	blk_status_t status;
> > +
> > +	status = btrfs_encoded_read_verify_csum(bbio);
> > +	if (status) {
> > +		/*
> > +		 * The memory barrier implied by the atomic_dec_return() here
> > +		 * pairs with the memory barrier implied by the
> > +		 * atomic_dec_return() or io_wait_event() in
> > +		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
> > +		 * write is observed before the load of status in
> > +		 * btrfs_encoded_read_regular_fill_pages().
> > +		 */
> > +		WRITE_ONCE(priv->status, status);
> 
> The WRITE_ONCE is here 3 times, I wonder if this is ok to be opencoded
> like that. I'd suggest to use a helper with a comment.
> 
> > +	}
> > +	if (!atomic_dec_return(&priv->pending))
> > +		wake_up(&priv->wait);
> > +	btrfs_bio_free_csum(bbio);
> > +	bio_put(bio);
> > +}
> > +
> > +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode,
> > +						 u64 file_offset,
> > +						 u64 disk_bytenr,
> > +						 u64 disk_io_size,
> > +						 struct page **pages)
> > +{
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	struct btrfs_encoded_read_private priv = {
> > +		.inode = inode,
> > +		.file_offset = file_offset,
> > +		.pending = ATOMIC_INIT(1),
> > +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> > +	};
> > +	unsigned long i = 0;
> > +	u64 cur = 0;
> > +	int ret;
> > +
> > +	init_waitqueue_head(&priv.wait);
> > +	/*
> > +	 * Submit bios for the extent, splitting due to bio or stripe limits as
> > +	 * necessary.
> > +	 */
> > +	while (cur < disk_io_size) {
> > +		struct extent_map *em;
> > +		struct btrfs_io_geometry geom;
> > +		struct bio *bio = NULL;
> > +		u64 remaining;
> > +
> > +		em = btrfs_get_chunk_map(fs_info, disk_bytenr + cur,
> > +					 disk_io_size - cur);
> > +		if (IS_ERR(em)) {
> > +			ret = PTR_ERR(em);
> > +		} else {
> > +			ret = btrfs_get_io_geometry(fs_info, em, BTRFS_MAP_READ,
> > +						    disk_bytenr + cur, &geom);
> > +			free_extent_map(em);
> > +		}
> > +		if (ret) {
> > +			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
> > +			break;
> > +		}
> > +		remaining = min(geom.len, disk_io_size - cur);
> > +		while (bio || remaining) {
> > +			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
> > +
> > +			if (!bio) {
> > +				bio = btrfs_bio_alloc(BIO_MAX_VECS);
> > +				bio->bi_iter.bi_sector =
> > +					(disk_bytenr + cur) >> SECTOR_SHIFT;
> > +				bio->bi_end_io = btrfs_encoded_read_endio;
> > +				bio->bi_private = &priv;
> > +				bio->bi_opf = REQ_OP_READ;
> > +			}
> > +
> > +			if (!bytes ||
> > +			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> > +				blk_status_t status;
> > +
> > +				status = submit_encoded_read_bio(inode, bio, 0,
> > +								 0);
> > +				if (status) {
> > +					WRITE_ONCE(priv.status, status);
> > +					bio_put(bio);
> > +					goto out;
> > +				}
> > +				bio = NULL;
> > +				continue;
> > +			}
> > +
> > +			i++;
> > +			cur += bytes;
> > +			remaining -= bytes;
> > +		}
> > +	}
> > +
> > +out:
> > +	if (atomic_dec_return(&priv.pending))
> > +		io_wait_event(priv.wait, !atomic_read(&priv.pending));
> > +	/* See btrfs_encoded_read_endio() for ordering. */
> > +	return blk_status_to_errno(READ_ONCE(priv.status));
> > +}
> > +
> > +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> > +					  struct iov_iter *iter,
> > +					  u64 start, u64 lockend,
> > +					  struct extent_state **cached_state,
> > +					  u64 disk_bytenr, u64 disk_io_size,
> > +					  size_t count, bool compressed,
> > +					  bool *unlocked)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct page **pages;
> > +	unsigned long nr_pages, i;
> > +	u64 cur;
> > +	size_t page_offset;
> > +	ssize_t ret;
> > +
> > +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
> 
> Power of two compile time constants can use the bitmask operations for
> alighnment.
> 
> > +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> > +	if (!pages)
> > +		return -ENOMEM;
> > +	for (i = 0; i < nr_pages; i++) {
> > +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> > +		if (!pages[i]) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	ret = btrfs_encoded_read_regular_fill_pages(inode, start, disk_bytenr,
> > +						    disk_io_size, pages);
> > +	if (ret)
> > +		goto out;
> > +
> > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > +	inode_unlock_shared(inode);
> > +	*unlocked = true;
> > +
> > +	if (compressed) {
> > +		i = 0;
> > +		page_offset = 0;
> > +	} else {
> > +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> > +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> > +	}
> > +	cur = 0;
> > +	while (cur < count) {
> > +		size_t bytes = min_t(size_t, count - cur,
> > +				     PAGE_SIZE - page_offset);
> > +
> > +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> > +				      iter) != bytes) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +		i++;
> > +		cur += bytes;
> > +		page_offset = 0;
> > +	}
> > +	ret = count;
> > +out:
> > +	for (i = 0; i < nr_pages; i++) {
> > +		if (pages[i])
> > +			__free_page(pages[i]);
> > +	}
> > +	kfree(pages);
> > +	return ret;
> > +}
> > +
> > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
> > +			   struct btrfs_ioctl_encoded_io_args *encoded)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	ssize_t ret;
> > +	size_t count = iov_iter_count(iter);
> > +	u64 start, lockend, disk_bytenr, disk_io_size;
> > +	struct extent_state *cached_state = NULL;
> > +	struct extent_map *em;
> > +	bool unlocked = false;
> > +
> > +	file_accessed(iocb->ki_filp);
> > +
> > +	inode_lock_shared(inode);
> > +
> > +	if (iocb->ki_pos >= inode->i_size) {
> > +		inode_unlock_shared(inode);
> > +		return 0;
> > +	}
> > +	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
> > +	/*
> > +	 * We don't know how long the extent containing iocb->ki_pos is, but if
> > +	 * it's compressed we know that it won't be longer than this.
> > +	 */
> > +	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
> > +
> > +	for (;;) {
> > +		struct btrfs_ordered_extent *ordered;
> > +
> > +		ret = btrfs_wait_ordered_range(inode, start,
> > +					       lockend - start + 1);
> > +		if (ret)
> > +			goto out_unlock_inode;
> > +		lock_extent_bits(io_tree, start, lockend, &cached_state);
> > +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> > +						     lockend - start + 1);
> > +		if (!ordered)
> > +			break;
> > +		btrfs_put_ordered_extent(ordered);
> > +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> > +		cond_resched();
> > +	}
> > +
> > +	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start,
> > +			      lockend - start + 1);
> > +	if (IS_ERR(em)) {
> > +		ret = PTR_ERR(em);
> > +		goto out_unlock_extent;
> > +	}
> > +
> > +	if (em->block_start == EXTENT_MAP_INLINE) {
> > +		u64 extent_start = em->start;
> > +
> > +		/*
> > +		 * For inline extents we get everything we need out of the
> > +		 * extent item.
> > +		 */
> > +		free_extent_map(em);
> > +		em = NULL;
> > +		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
> > +						&cached_state, extent_start,
> > +						count, encoded, &unlocked);
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * We only want to return up to EOF even if the extent extends beyond
> > +	 * that.
> > +	 */
> > +	encoded->len = (min_t(u64, extent_map_end(em), inode->i_size) -
> > +			iocb->ki_pos);
> > +	if (em->block_start == EXTENT_MAP_HOLE ||
> > +	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
> > +		disk_bytenr = EXTENT_MAP_HOLE;
> > +		encoded->len = encoded->unencoded_len = count =
> > +			min_t(u64, count, encoded->len);
> > +	} else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> > +		disk_bytenr = em->block_start;
> > +		/*
> > +		 * Bail if the buffer isn't large enough to return the whole
> > +		 * compressed extent.
> > +		 */
> > +		if (em->block_len > count) {
> > +			ret = -ENOBUFS;
> > +			goto out_em;
> > +		}
> > +		disk_io_size = count = em->block_len;
> > +		encoded->unencoded_len = em->ram_bytes;
> > +		encoded->unencoded_offset = iocb->ki_pos - em->orig_start;
> > +		ret = btrfs_encoded_io_compression_from_extent(
> > +							     em->compress_type);
> > +		if (ret < 0)
> > +			goto out_em;
> > +		encoded->compression = ret;
> > +	} else {
> > +		disk_bytenr = em->block_start + (start - em->start);
> > +		if (encoded->len > count)
> > +			encoded->len = count;
> > +		/*
> > +		 * Don't read beyond what we locked. This also limits the page
> > +		 * allocations that we'll do.
> > +		 */
> > +		disk_io_size = min(lockend + 1,
> > +				   iocb->ki_pos + encoded->len) - start;
> > +		encoded->len = encoded->unencoded_len = count =
> > +			start + disk_io_size - iocb->ki_pos;
> > +		disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize);
> > +	}
> > +	free_extent_map(em);
> > +	em = NULL;
> > +
> > +	if (disk_bytenr == EXTENT_MAP_HOLE) {
> > +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> > +		inode_unlock_shared(inode);
> > +		unlocked = true;
> > +		ret = iov_iter_zero(count, iter);
> > +		if (ret != count)
> > +			ret = -EFAULT;
> > +	} else {
> > +		ret = btrfs_encoded_read_regular(iocb, iter, start, lockend,
> > +						 &cached_state, disk_bytenr,
> > +						 disk_io_size, count,
> > +						 encoded->compression,
> > +						 &unlocked);
> > +	}
> > +
> > +out:
> > +	if (ret >= 0)
> > +		iocb->ki_pos += encoded->len;
> > +out_em:
> > +	free_extent_map(em);
> > +out_unlock_extent:
> > +	if (!unlocked)
> > +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> > +out_unlock_inode:
> > +	if (!unlocked)
> > +		inode_unlock_shared(inode);
> > +	return ret;
> > +}
> > +
> >  #ifdef CONFIG_SWAP
> >  /*
> >   * Add an entry indicating a block group or device which is pinned by a
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 05c77a1979a9..f0c575223d88 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -28,6 +28,7 @@
> >  #include <linux/iversion.h>
> >  #include <linux/fileattr.h>
> >  #include <linux/fsverity.h>
> > +#include <linux/sched/xacct.h>
> >  #include "ctree.h"
> >  #include "disk-io.h"
> >  #include "export.h"
> > @@ -88,6 +89,22 @@ struct btrfs_ioctl_send_args_32 {
> >  
> >  #define BTRFS_IOC_SEND_32 _IOW(BTRFS_IOCTL_MAGIC, 38, \
> >  			       struct btrfs_ioctl_send_args_32)
> > +
> > +struct btrfs_ioctl_encoded_io_args_32 {
> > +	compat_uptr_t iov;
> > +	compat_ulong_t iovcnt;
> > +	__s64 offset;
> > +	__u64 flags;
> > +	__u64 len;
> > +	__u64 unencoded_len;
> > +	__u64 unencoded_offset;
> > +	__u32 compression;
> > +	__u32 encryption;
> > +	__u32 reserved[8];
> > +};
> > +
> > +#define BTRFS_IOC_ENCODED_READ_32 _IOR(BTRFS_IOCTL_MAGIC, 64, \
> > +				       struct btrfs_ioctl_encoded_io_args_32)
> >  #endif
> >  
> >  /* Mask out flags that are inappropriate for the given type of inode. */
> > @@ -4861,6 +4878,89 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
> >  	return ret;
> >  }
> >  
> > +static int btrfs_ioctl_encoded_read(struct file *file, void __user *argp,
> > +				    bool compat)
> > +{
> > +	struct btrfs_ioctl_encoded_io_args args = {};
> > +	size_t copy_end_kernel = offsetofend(struct btrfs_ioctl_encoded_io_args,
> > +					     flags);
> > +	size_t copy_end;
> > +	struct iovec iovstack[UIO_FASTIOV];
> > +	struct iovec *iov = iovstack;
> > +	struct iov_iter iter;
> > +	loff_t pos;
> > +	struct kiocb kiocb;
> > +	ssize_t ret;
> > +
> > +	if (!capable(CAP_SYS_ADMIN)) {
> > +		ret = -EPERM;
> > +		goto out_acct;
> > +	}
> > +
> > +	if (compat) {
> > +#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
> > +		struct btrfs_ioctl_encoded_io_args_32 args32;
> > +
> > +		copy_end = offsetofend(struct btrfs_ioctl_encoded_io_args_32,
> > +				       flags);
> > +		if (copy_from_user(&args32, argp, copy_end)) {
> > +			ret = -EFAULT;
> > +			goto out_acct;
> > +		}
> > +		args.iov = compat_ptr(args32.iov);
> > +		args.iovcnt = args32.iovcnt;
> > +		args.offset = args32.offset;
> > +		args.flags = args32.flags;
> > +#else
> > +		return -ENOTTY;
> > +#endif
> > +	} else {
> > +		copy_end = copy_end_kernel;
> > +		if (copy_from_user(&args, argp, copy_end)) {
> > +			ret = -EFAULT;
> > +			goto out_acct;
> > +		}
> > +	}
> > +	if (args.flags != 0) {
> > +		ret = -EINVAL;
> > +		goto out_acct;
> > +	}
> > +
> > +	ret = import_iovec(READ, args.iov, args.iovcnt, ARRAY_SIZE(iovstack),
> > +			   &iov, &iter);
> > +	if (ret < 0)
> > +		goto out_acct;
> > +
> > +	if (iov_iter_count(&iter) == 0) {
> > +		ret = 0;
> > +		goto out_iov;
> > +	}
> > +	pos = args.offset;
> > +	ret = rw_verify_area(READ, file, &pos, args.len);
> > +	if (ret < 0)
> > +		goto out_iov;
> > +
> > +	init_sync_kiocb(&kiocb, file);
> > +	kiocb.ki_pos = pos;
> > +
> > +	ret = btrfs_encoded_read(&kiocb, &iter, &args);
> > +	if (ret >= 0) {
> > +		fsnotify_access(file);
> > +		if (copy_to_user(argp + copy_end,
> > +				 (char *)&args + copy_end_kernel,
> > +				 sizeof(args) - copy_end_kernel))
> > +			ret = -EFAULT;
> > +	}
> > +
> > +out_iov:
> > +	kfree(iov);
> > +out_acct:
> > +	if (ret > 0)
> > +		add_rchar(current, ret);
> > +	inc_syscr(current);
> > +	return ret;
> > +}
> > +
> >  long btrfs_ioctl(struct file *file, unsigned int
> >  		cmd, unsigned long arg)
> >  {
> > @@ -5005,6 +5105,12 @@ long btrfs_ioctl(struct file *file, unsigned int
> >  		return fsverity_ioctl_enable(file, (const void __user *)argp);
> >  	case FS_IOC_MEASURE_VERITY:
> >  		return fsverity_ioctl_measure(file, argp);
> > +	case BTRFS_IOC_ENCODED_READ:
> > +		return btrfs_ioctl_encoded_read(file, argp, false);
> > +#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
> > +	case BTRFS_IOC_ENCODED_READ_32:
> > +		return btrfs_ioctl_encoded_read(file, argp, true);
> > +#endif
> >  	}
> >  
> >  	return -ENOTTY;
> > -- 
> > 2.34.0

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 01/17] fs: export rw_verify_area()
  2021-11-18 14:57   ` David Sterba
@ 2021-11-18 19:15     ` Omar Sandoval
  0 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-18 19:15 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On Thu, Nov 18, 2021 at 03:57:14PM +0100, David Sterba wrote:
> On Wed, Nov 17, 2021 at 12:19:11PM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > I'm adding Btrfs ioctls to read and write compressed data, and rather
> > than duplicating the checks in rw_verify_area(), let's just export it.
> > 
> > Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3244,6 +3244,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
> >  		int whence, loff_t size);
> >  extern loff_t no_seek_end_llseek_size(struct file *, loff_t, int, loff_t);
> >  extern loff_t no_seek_end_llseek(struct file *, loff_t, int);
> > +extern int rw_verify_area(int, struct file *, const loff_t *, size_t);
> 
> Do you have an ack from VFS people for exporting a function from
> fs/interna.h to the normal fs.h?

Nope, although I've sent it enough times that they should've nacked it
by now if they cared. I guess I didn't cc fsdevel on this version, so
I'll ping this patch on v11.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 14/17] btrfs: send: write larger chunks when using stream v2
  2021-11-18 15:50   ` David Sterba
@ 2021-11-18 19:34     ` Omar Sandoval
  0 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2021-11-18 19:34 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On Thu, Nov 18, 2021 at 04:50:37PM +0100, David Sterba wrote:
> On Wed, Nov 17, 2021 at 12:19:24PM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > The length field of the send stream TLV header is 16 bits. This means
> > that the maximum amount of data that can be sent for one write is 64k
> > minus one. However, encoded writes must be able to send the maximum
> > compressed extent (128k) in one command. To support this, send stream
> > version 2 encodes the DATA attribute differently: it has no length
> > field, and the length is implicitly up to the end of containing command
> > (which has a 32-bit length field). Although this is necessary for
> > encoded writes, normal writes can benefit from it, too.
> > 
> > Also add a check to enforce that the DATA attribute is last. It is only
> > strictly necessary for v2, but we might as well make v1 consistent with
> > it.
> > 
> > For v2, let's bump up the send buffer to the maximum compressed extent
> > size plus 16k for the other metadata (144k total).
> 
> I'm not sure we want to set the number like that, it feels quite
> limiting for potential compression enhancements.

This is all we need for now, but we can always raise it in the future. I
amended the protocol and the progs send parsing code to assume no hard
limit.

> > Since this will most
> > likely be vmalloc'd (and always will be after the next commit), we round
> > it up to the next page since we might as well use the rest of the page
> > on systems with >16k pages.
> 
> Would it work also without the virtual mappings? For speedup it makes
> sense to use vmalloc area, but as a fallback writing in smaller portions
> or page by page eventually should be also possible. For that reason I
> don't think we should set the maximum other than what fits to 32bit
> number minus some overhead.

I think you're saying that we could allocate a smaller buffer and do
smaller reads that we immediately write to the send pipe/file? So
something like:

send_write() {
	write_tlv_metadata_to_pipe();
	while (written < to_write) {
		read_small_chunk();
		write_small_chunk_to_pipe();
		written += size_of_small_chunk();
	}
}

And from the protocol's point of view, it's still one big command,
although we didn't have to keep it all in memory at once.

If I'm understanding correctly, then yes, I think that's something we
could do eventually. And my description of v2 allows this:

-#define BTRFS_SEND_BUF_SIZE SZ_64K
+/*
+ * In send stream v1, no command is larger than 64k. In send stream v2, no limit
+ * should be assumed.
+ */
+#define BTRFS_SEND_BUF_SIZE_V1 SZ_64K

Although receive would have to be more intelligent about reading huge
commands.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 12/17] btrfs: send: fix maximum command numbering
  2021-11-18 18:54     ` Omar Sandoval
@ 2021-12-09 18:08       ` Omar Sandoval
  2022-01-04 19:05         ` Omar Sandoval
  2022-01-24 22:40       ` David Sterba
  1 sibling, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2021-12-09 18:08 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On Thu, Nov 18, 2021 at 10:54:16AM -0800, Omar Sandoval wrote:
> On Thu, Nov 18, 2021 at 03:23:59PM +0100, David Sterba wrote:
> > On Wed, Nov 17, 2021 at 12:19:22PM -0800, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > Commit e77fbf990316 ("btrfs: send: prepare for v2 protocol") added
> > > _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
> > > version plus 1, but as written this creates gaps in the number space.
> > > The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
> > > accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
> > > has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
> > > 23 and 24 are valid commands.
> > 
> > The MAX definitions have the __ prefix so they're private and not meant
> > to be used as proper commands, so nothing should suggest there are any
> > commands with numbers 23 to 25 in the example.
> > 
> > > Instead, let's explicitly set BTRFS_SEND_C_MAX_V* to the maximum command
> > > number. This requires repeating the command name, but it has a clearer
> > > meaning and avoids gaps. It also doesn't require updating
> > > __BTRFS_SEND_C_MAX for every new version.
> > 
> > It's probably a matter of taste, I'd intentionally avoid the pattern
> > above, ie. repeating the previous command to define max.
> > 
> > > --- a/fs/btrfs/send.c
> > > +++ b/fs/btrfs/send.c
> > > @@ -316,8 +316,8 @@ __maybe_unused
> > >  static bool proto_cmd_ok(const struct send_ctx *sctx, int cmd)
> > >  {
> > >  	switch (sctx->proto) {
> > > -	case 1:	 return cmd < __BTRFS_SEND_C_MAX_V1;
> > > -	case 2:	 return cmd < __BTRFS_SEND_C_MAX_V2;
> > > +	case 1:	 return cmd <= BTRFS_SEND_C_MAX_V1;
> > > +	case 2:	 return cmd <= BTRFS_SEND_C_MAX_V2;
> > 
> > This seems to be the only practical difference, < or <= .
> 
> There is another practical difference, which is more significant in my
> opinion: the linear style creates "gaps" in the valid commands. Consider
> this, with explicit values added for clarity:
> 
> enum btrfs_send_cmd {
>         BTRFS_SEND_C_UNSPEC = 0,
> 
>         /* Version 1 */
>         BTRFS_SEND_C_SUBVOL = 1,
>         BTRFS_SEND_C_SNAPSHOT = 2,
> 
>         BTRFS_SEND_C_MKFILE = 3,
>         BTRFS_SEND_C_MKDIR = 4,
>         BTRFS_SEND_C_MKNOD = 5,
>         BTRFS_SEND_C_MKFIFO = 6,
>         BTRFS_SEND_C_MKSOCK = 7,
>         BTRFS_SEND_C_SYMLINK = 8,
> 
>         BTRFS_SEND_C_RENAME = 9,
>         BTRFS_SEND_C_LINK = 10,
>         BTRFS_SEND_C_UNLINK = 11,
>         BTRFS_SEND_C_RMDIR = 12,
> 
>         BTRFS_SEND_C_SET_XATTR = 13,
>         BTRFS_SEND_C_REMOVE_XATTR = 14,
> 
>         BTRFS_SEND_C_WRITE = 15,
>         BTRFS_SEND_C_CLONE = 16,
> 
>         BTRFS_SEND_C_TRUNCATE = 17,
>         BTRFS_SEND_C_CHMOD = 18,
>         BTRFS_SEND_C_CHOWN = 19,
>         BTRFS_SEND_C_UTIMES = 20,
> 
>         BTRFS_SEND_C_END = 21,
>         BTRFS_SEND_C_UPDATE_EXTENT = 22,
>         __BTRFS_SEND_C_MAX_V1 = 23,
> 
>         /* Version 2 */
>         BTRFS_SEND_C_FALLOCATE = 24,
>         BTRFS_SEND_C_SETFLAGS = 25,
>         BTRFS_SEND_C_ENCODED_WRITE = 26,
>         __BTRFS_SEND_C_MAX_V2 = 27,
> 
>         /* End */
>         __BTRFS_SEND_C_MAX = 28,
> };
> #define BTRFS_SEND_C_MAX (__BTRFS_SEND_C_MAX - 1) /* 27 */
> 
> Notice that BTRFS_SEND_C_UPDATE_EXTENT is 22 and the next valid command
> is BTRFS_SEND_C_FALLOCATE, which is 24. So 23 does not correspond to an
> actual command; it's a "gap". This is somewhat cosmetic, but it's an
> ugly wart in the protocol.
> 
> Also consider something indexing on the command number, like the
> cmd_send_size thing I got rid of in the previous patch:
> 
> 	u64 cmd_send_size[BTRFS_SEND_C_MAX + 1]
> 
> Indices 23 and 27 are wasted. It's only 16 bytes in this case, which
> doesn't matter practically, but it's unpleasant.
> 
> Maybe you were aware of this and fine with it, in which case we can drop
> this change. But I think the name repetition is less ugly than the gaps.

Ping. Please let me know how you'd like me to proceed on this issue and
my other replies. Thanks!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 12/17] btrfs: send: fix maximum command numbering
  2021-12-09 18:08       ` Omar Sandoval
@ 2022-01-04 19:05         ` Omar Sandoval
  0 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2022-01-04 19:05 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On Thu, Dec 09, 2021 at 10:08:02AM -0800, Omar Sandoval wrote:
> On Thu, Nov 18, 2021 at 10:54:16AM -0800, Omar Sandoval wrote:
> > On Thu, Nov 18, 2021 at 03:23:59PM +0100, David Sterba wrote:
> > > On Wed, Nov 17, 2021 at 12:19:22PM -0800, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > Commit e77fbf990316 ("btrfs: send: prepare for v2 protocol") added
> > > > _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
> > > > version plus 1, but as written this creates gaps in the number space.
> > > > The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
> > > > accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
> > > > has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
> > > > 23 and 24 are valid commands.
> > > 
> > > The MAX definitions have the __ prefix so they're private and not meant
> > > to be used as proper commands, so nothing should suggest there are any
> > > commands with numbers 23 to 25 in the example.
> > > 
> > > > Instead, let's explicitly set BTRFS_SEND_C_MAX_V* to the maximum command
> > > > number. This requires repeating the command name, but it has a clearer
> > > > meaning and avoids gaps. It also doesn't require updating
> > > > __BTRFS_SEND_C_MAX for every new version.
> > > 
> > > It's probably a matter of taste, I'd intentionally avoid the pattern
> > > above, ie. repeating the previous command to define max.
> > > 
> > > > --- a/fs/btrfs/send.c
> > > > +++ b/fs/btrfs/send.c
> > > > @@ -316,8 +316,8 @@ __maybe_unused
> > > >  static bool proto_cmd_ok(const struct send_ctx *sctx, int cmd)
> > > >  {
> > > >  	switch (sctx->proto) {
> > > > -	case 1:	 return cmd < __BTRFS_SEND_C_MAX_V1;
> > > > -	case 2:	 return cmd < __BTRFS_SEND_C_MAX_V2;
> > > > +	case 1:	 return cmd <= BTRFS_SEND_C_MAX_V1;
> > > > +	case 2:	 return cmd <= BTRFS_SEND_C_MAX_V2;
> > > 
> > > This seems to be the only practical difference, < or <= .
> > 
> > There is another practical difference, which is more significant in my
> > opinion: the linear style creates "gaps" in the valid commands. Consider
> > this, with explicit values added for clarity:
> > 
> > enum btrfs_send_cmd {
> >         BTRFS_SEND_C_UNSPEC = 0,
> > 
> >         /* Version 1 */
> >         BTRFS_SEND_C_SUBVOL = 1,
> >         BTRFS_SEND_C_SNAPSHOT = 2,
> > 
> >         BTRFS_SEND_C_MKFILE = 3,
> >         BTRFS_SEND_C_MKDIR = 4,
> >         BTRFS_SEND_C_MKNOD = 5,
> >         BTRFS_SEND_C_MKFIFO = 6,
> >         BTRFS_SEND_C_MKSOCK = 7,
> >         BTRFS_SEND_C_SYMLINK = 8,
> > 
> >         BTRFS_SEND_C_RENAME = 9,
> >         BTRFS_SEND_C_LINK = 10,
> >         BTRFS_SEND_C_UNLINK = 11,
> >         BTRFS_SEND_C_RMDIR = 12,
> > 
> >         BTRFS_SEND_C_SET_XATTR = 13,
> >         BTRFS_SEND_C_REMOVE_XATTR = 14,
> > 
> >         BTRFS_SEND_C_WRITE = 15,
> >         BTRFS_SEND_C_CLONE = 16,
> > 
> >         BTRFS_SEND_C_TRUNCATE = 17,
> >         BTRFS_SEND_C_CHMOD = 18,
> >         BTRFS_SEND_C_CHOWN = 19,
> >         BTRFS_SEND_C_UTIMES = 20,
> > 
> >         BTRFS_SEND_C_END = 21,
> >         BTRFS_SEND_C_UPDATE_EXTENT = 22,
> >         __BTRFS_SEND_C_MAX_V1 = 23,
> > 
> >         /* Version 2 */
> >         BTRFS_SEND_C_FALLOCATE = 24,
> >         BTRFS_SEND_C_SETFLAGS = 25,
> >         BTRFS_SEND_C_ENCODED_WRITE = 26,
> >         __BTRFS_SEND_C_MAX_V2 = 27,
> > 
> >         /* End */
> >         __BTRFS_SEND_C_MAX = 28,
> > };
> > #define BTRFS_SEND_C_MAX (__BTRFS_SEND_C_MAX - 1) /* 27 */
> > 
> > Notice that BTRFS_SEND_C_UPDATE_EXTENT is 22 and the next valid command
> > is BTRFS_SEND_C_FALLOCATE, which is 24. So 23 does not correspond to an
> > actual command; it's a "gap". This is somewhat cosmetic, but it's an
> > ugly wart in the protocol.
> > 
> > Also consider something indexing on the command number, like the
> > cmd_send_size thing I got rid of in the previous patch:
> > 
> > 	u64 cmd_send_size[BTRFS_SEND_C_MAX + 1]
> > 
> > Indices 23 and 27 are wasted. It's only 16 bytes in this case, which
> > doesn't matter practically, but it's unpleasant.
> > 
> > Maybe you were aware of this and fine with it, in which case we can drop
> > this change. But I think the name repetition is less ugly than the gaps.
> 
> Ping. Please let me know how you'd like me to proceed on this issue and
> my other replies. Thanks!

New year, new ping. Thanks!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ
  2021-11-17 20:19 ` [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ Omar Sandoval
  2021-11-18 14:55   ` David Sterba
@ 2022-01-24 21:54   ` David Sterba
  2022-01-24 22:33     ` Omar Sandoval
  2022-01-24 22:26   ` David Sterba
  2 siblings, 1 reply; 50+ messages in thread
From: David Sterba @ 2022-01-24 21:54 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Wed, Nov 17, 2021 at 12:19:19PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>

> - We don't do read repair, because it turns out that read repair is
>   currently broken for compressed data.

Is there a reproducer, and a fix?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ
  2021-11-17 20:19 ` [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ Omar Sandoval
  2021-11-18 14:55   ` David Sterba
  2022-01-24 21:54   ` David Sterba
@ 2022-01-24 22:26   ` David Sterba
  2022-01-25 21:26     ` Omar Sandoval
  2 siblings, 1 reply; 50+ messages in thread
From: David Sterba @ 2022-01-24 22:26 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Wed, Nov 17, 2021 at 12:19:19PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> There are 4 main cases:
> 
> 1. Inline extents: we copy the data straight out of the extent buffer.
> 2. Hole/preallocated extents: we fill in zeroes.
> 3. Regular, uncompressed extents: we read the sectors we need directly
>    from disk.
> 4. Regular, compressed extents: we read the entire compressed extent
>    from disk and indicate what subset of the decompressed extent is in
>    the file.
> 
> This initial implementation simplifies a few things that can be improved
> in the future:
> 
> - We hold the inode lock during the operation.
> - Cases 1, 3, and 4 allocate temporary memory to read into before
>   copying out to userspace.
> - We don't do read repair, because it turns out that read repair is
>   currently broken for compressed data.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/ctree.h |   4 +
>  fs/btrfs/inode.c | 496 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/ioctl.c | 106 ++++++++++
>  3 files changed, 606 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2e7f74060a14..70034e33abe6 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3275,6 +3275,10 @@ int btrfs_writepage_cow_fixup(struct page *page);
>  void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
>  					  struct page *page, u64 start,
>  					  u64 end, bool uptodate);
> +struct btrfs_ioctl_encoded_io_args;
> +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
> +			   struct btrfs_ioctl_encoded_io_args *encoded);
> +
>  extern const struct dentry_operations btrfs_dentry_operations;
>  extern const struct iomap_ops btrfs_dio_iomap_ops;
>  extern const struct iomap_dio_ops btrfs_dio_ops;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c2efea101f61..d29e968fd18b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -10525,6 +10525,502 @@ void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
>  	}
>  }
>  
> +static int btrfs_encoded_io_compression_from_extent(int compress_type)
> +{
> +	switch (compress_type) {
> +	case BTRFS_COMPRESS_NONE:
> +		return BTRFS_ENCODED_IO_COMPRESSION_NONE;
> +	case BTRFS_COMPRESS_ZLIB:
> +		return BTRFS_ENCODED_IO_COMPRESSION_ZLIB;
> +	case BTRFS_COMPRESS_LZO:
> +		/*
> +		 * The LZO format depends on the page size. 64k is the maximum
> +		 * sectorsize (and thus page size) that we support.
> +		 */
> +		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
> +			return -EINVAL;
> +		return BTRFS_ENCODED_IO_COMPRESSION_LZO_4K + (PAGE_SHIFT - 12);
> +	case BTRFS_COMPRESS_ZSTD:
> +		return BTRFS_ENCODED_IO_COMPRESSION_ZSTD;
> +	default:
> +		return -EUCLEAN;
> +	}
> +}
> +
> +static ssize_t btrfs_encoded_read_inline(
> +				struct kiocb *iocb,
> +				struct iov_iter *iter, u64 start,
> +				u64 lockend,
> +				struct extent_state **cached_state,
> +				u64 extent_start, size_t count,
> +				struct btrfs_ioctl_encoded_io_args *encoded,
> +				bool *unlocked)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);

Please use btrfs_inode in all internal helpers, either as parameters or
as local variable, to avoid the BTRFS_I and btrfs_sb conversions everywhere.

> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct btrfs_path *path;
> +	struct extent_buffer *leaf;
> +	struct btrfs_file_extent_item *item;
> +	u64 ram_bytes;
> +	unsigned long ptr;
> +	void *tmp;
> +	ssize_t ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> +				       btrfs_ino(BTRFS_I(inode)), extent_start,
> +				       0);
> +	if (ret) {
> +		if (ret > 0) {
> +			/* The extent item disappeared? */
> +			ret = -EIO;
> +		}
> +		goto out;
> +	}
> +	leaf = path->nodes[0];
> +	item = btrfs_item_ptr(leaf, path->slots[0],
> +			      struct btrfs_file_extent_item);
> +
> +	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
> +	ptr = btrfs_file_extent_inline_start(item);
> +
> +	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
> +			iocb->ki_pos);

No need for the outer ( )

> +	ret = btrfs_encoded_io_compression_from_extent(
> +				 btrfs_file_extent_compression(leaf, item));
> +	if (ret < 0)
> +		goto out;
> +	encoded->compression = ret;
> +	if (encoded->compression) {
> +		size_t inline_size;
> +
> +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> +								path->slots[0]);
> +		if (inline_size > count) {
> +			ret = -ENOBUFS;
> +			goto out;
> +		}
> +		count = inline_size;
> +		encoded->unencoded_len = ram_bytes;
> +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> +	} else {
> +		encoded->len = encoded->unencoded_len = count =
> +			min_t(u64, count, encoded->len);

I'm sure I have commented on that in the past, please don't use chained
intializations. In this case something like:

		count = min_t(u64, count, encoded->len);
		encoded->len = count;
		encoded->unencoded_len = count;

> +		ptr += iocb->ki_pos - extent_start;
> +	}
> +
> +	tmp = kmalloc(count, GFP_NOFS);
> +	if (!tmp) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	read_extent_buffer(leaf, tmp, ptr, count);
> +	btrfs_release_path(path);
> +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> +	inode_unlock_shared(inode);
> +	*unlocked = true;
> +
> +	ret = copy_to_iter(tmp, count, iter);
> +	if (ret != count)
> +		ret = -EFAULT;
> +	kfree(tmp);
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +struct btrfs_encoded_read_private {
> +	struct inode *inode;

This should also be btrfs_inode

> +	u64 file_offset;
> +	wait_queue_head_t wait;
> +	atomic_t pending;
> +	blk_status_t status;
> +	bool skip_csum;
> +};
> +
> +static blk_status_t submit_encoded_read_bio(struct inode *inode,

struct btrfs_inode

> +					    struct bio *bio, int mirror_num,
> +					    unsigned long bio_flags)

bio_flags is unused here (and in the encoded patches afaics)

> +{
> +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> +	struct btrfs_bio *bbio = btrfs_bio(bio);
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	blk_status_t ret;
> +
> +	if (!priv->skip_csum) {
> +		ret = btrfs_lookup_bio_sums(inode, bio, NULL);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> +	if (ret) {
> +		btrfs_bio_free_csum(bbio);
> +		return ret;
> +	}
> +
> +	atomic_inc(&priv->pending);
> +	ret = btrfs_map_bio(fs_info, bio, mirror_num);
> +	if (ret) {
> +		atomic_dec(&priv->pending);
> +		btrfs_bio_free_csum(bbio);
> +	}
> +	return ret;
> +}
> +
> +static blk_status_t btrfs_encoded_read_verify_csum(struct btrfs_bio *bbio)
> +{
> +	const bool uptodate = bbio->bio.bi_status == BLK_STS_OK;

	const bool uptodate = (bbio->bio.bi_status == BLK_STS_OK);

> +	struct btrfs_encoded_read_private *priv = bbio->bio.bi_private;
> +	struct inode *inode = priv->inode;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	u32 sectorsize = fs_info->sectorsize;
> +	struct bio_vec *bvec;
> +	struct bvec_iter_all iter_all;
> +	u64 start = priv->file_offset;
> +	u32 bio_offset = 0;
> +
> +	if (priv->skip_csum || !uptodate)
> +		return bbio->bio.bi_status;
> +
> +	bio_for_each_segment_all(bvec, &bbio->bio, iter_all) {
> +		unsigned int i, nr_sectors, pgoff;
> +
> +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> +		pgoff = bvec->bv_offset;
> +		for (i = 0; i < nr_sectors; i++) {
> +			ASSERT(pgoff < PAGE_SIZE);
> +			if (check_data_csum(inode, bbio, bio_offset,
> +					    bvec->bv_page, pgoff, start))
> +				return BLK_STS_IOERR;
> +			start += sectorsize;
> +			bio_offset += sectorsize;
> +			pgoff += sectorsize;
> +		}
> +	}
> +	return BLK_STS_OK;
> +}
> +
> +static void btrfs_encoded_read_endio(struct bio *bio)
> +{
> +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> +	struct btrfs_bio *bbio = btrfs_bio(bio);
> +	blk_status_t status;
> +
> +	status = btrfs_encoded_read_verify_csum(bbio);
> +	if (status) {
> +		/*
> +		 * The memory barrier implied by the atomic_dec_return() here
> +		 * pairs with the memory barrier implied by the
> +		 * atomic_dec_return() or io_wait_event() in
> +		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
> +		 * write is observed before the load of status in
> +		 * btrfs_encoded_read_regular_fill_pages().
> +		 */
> +		WRITE_ONCE(priv->status, status);
> +	}
> +	if (!atomic_dec_return(&priv->pending))
> +		wake_up(&priv->wait);
> +	btrfs_bio_free_csum(bbio);
> +	bio_put(bio);
> +}
> +
> +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode,
> +						 u64 file_offset,
> +						 u64 disk_bytenr,
> +						 u64 disk_io_size,
> +						 struct page **pages)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct btrfs_encoded_read_private priv = {
> +		.inode = inode,
> +		.file_offset = file_offset,
> +		.pending = ATOMIC_INIT(1),
> +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> +	};
> +	unsigned long i = 0;
> +	u64 cur = 0;
> +	int ret;
> +
> +	init_waitqueue_head(&priv.wait);
> +	/*
> +	 * Submit bios for the extent, splitting due to bio or stripe limits as
> +	 * necessary.
> +	 */
> +	while (cur < disk_io_size) {
> +		struct extent_map *em;
> +		struct btrfs_io_geometry geom;
> +		struct bio *bio = NULL;
> +		u64 remaining;
> +
> +		em = btrfs_get_chunk_map(fs_info, disk_bytenr + cur,
> +					 disk_io_size - cur);
> +		if (IS_ERR(em)) {
> +			ret = PTR_ERR(em);
> +		} else {
> +			ret = btrfs_get_io_geometry(fs_info, em, BTRFS_MAP_READ,
> +						    disk_bytenr + cur, &geom);
> +			free_extent_map(em);
> +		}
> +		if (ret) {
> +			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
> +			break;
> +		}
> +		remaining = min(geom.len, disk_io_size - cur);
> +		while (bio || remaining) {
> +			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
> +
> +			if (!bio) {
> +				bio = btrfs_bio_alloc(BIO_MAX_VECS);
> +				bio->bi_iter.bi_sector =
> +					(disk_bytenr + cur) >> SECTOR_SHIFT;
> +				bio->bi_end_io = btrfs_encoded_read_endio;
> +				bio->bi_private = &priv;
> +				bio->bi_opf = REQ_OP_READ;
> +			}
> +
> +			if (!bytes ||
> +			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> +				blk_status_t status;
> +
> +				status = submit_encoded_read_bio(inode, bio, 0,
> +								 0);
> +				if (status) {
> +					WRITE_ONCE(priv.status, status);
> +					bio_put(bio);
> +					goto out;
> +				}
> +				bio = NULL;
> +				continue;
> +			}
> +
> +			i++;
> +			cur += bytes;
> +			remaining -= bytes;
> +		}
> +	}
> +
> +out:
> +	if (atomic_dec_return(&priv.pending))
> +		io_wait_event(priv.wait, !atomic_read(&priv.pending));
> +	/* See btrfs_encoded_read_endio() for ordering. */
> +	return blk_status_to_errno(READ_ONCE(priv.status));
> +}
> +
> +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> +					  struct iov_iter *iter,
> +					  u64 start, u64 lockend,
> +					  struct extent_state **cached_state,
> +					  u64 disk_bytenr, u64 disk_io_size,
> +					  size_t count, bool compressed,
> +					  bool *unlocked)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct page **pages;
> +	unsigned long nr_pages, i;
> +	u64 cur;
> +	size_t page_offset;
> +	ssize_t ret;
> +
> +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
> +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> +	if (!pages)
> +		return -ENOMEM;
> +	for (i = 0; i < nr_pages; i++) {
> +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);

No HIGHMEM please.

> +		if (!pages[i]) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +	}
> +
> +	ret = btrfs_encoded_read_regular_fill_pages(inode, start, disk_bytenr,
> +						    disk_io_size, pages);
> +	if (ret)
> +		goto out;
> +
> +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> +	inode_unlock_shared(inode);
> +	*unlocked = true;
> +
> +	if (compressed) {
> +		i = 0;
> +		page_offset = 0;
> +	} else {
> +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> +	}
> +	cur = 0;
> +	while (cur < count) {
> +		size_t bytes = min_t(size_t, count - cur,
> +				     PAGE_SIZE - page_offset);
> +
> +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> +				      iter) != bytes) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +		i++;
> +		cur += bytes;
> +		page_offset = 0;
> +	}
> +	ret = count;
> +out:
> +	for (i = 0; i < nr_pages; i++) {
> +		if (pages[i])
> +			__free_page(pages[i]);
> +	}
> +	kfree(pages);
> +	return ret;
> +}
> +
> +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
> +			   struct btrfs_ioctl_encoded_io_args *encoded)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	ssize_t ret;
> +	size_t count = iov_iter_count(iter);
> +	u64 start, lockend, disk_bytenr, disk_io_size;
> +	struct extent_state *cached_state = NULL;
> +	struct extent_map *em;
> +	bool unlocked = false;
> +
> +	file_accessed(iocb->ki_filp);
> +
> +	inode_lock_shared(inode);

We have helpers for inode locking now, btrfs_inode_lock, that take
additional parameter for cases where we want to exclude certain
locking combinations.

Which also brings the question if the encoded read/write should be
excluded against some other operations like eg. deduplication has to do
against mmap.

> +
> +	if (iocb->ki_pos >= inode->i_size) {
> +		inode_unlock_shared(inode);
> +		return 0;
> +	}
> +	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
> +	/*
> +	 * We don't know how long the extent containing iocb->ki_pos is, but if
> +	 * it's compressed we know that it won't be longer than this.
> +	 */
> +	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
> +
> +	for (;;) {
> +		struct btrfs_ordered_extent *ordered;
> +
> +		ret = btrfs_wait_ordered_range(inode, start,
> +					       lockend - start + 1);
> +		if (ret)
> +			goto out_unlock_inode;
> +		lock_extent_bits(io_tree, start, lockend, &cached_state);
> +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> +						     lockend - start + 1);
> +		if (!ordered)
> +			break;
> +		btrfs_put_ordered_extent(ordered);
> +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> +		cond_resched();
> +	}
> +
> +	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start,
> +			      lockend - start + 1);
> +	if (IS_ERR(em)) {
> +		ret = PTR_ERR(em);
> +		goto out_unlock_extent;
> +	}
> +
> +	if (em->block_start == EXTENT_MAP_INLINE) {
> +		u64 extent_start = em->start;
> +
> +		/*
> +		 * For inline extents we get everything we need out of the
> +		 * extent item.
> +		 */
> +		free_extent_map(em);
> +		em = NULL;
> +		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
> +						&cached_state, extent_start,
> +						count, encoded, &unlocked);
> +		goto out;
> +	}
> +
> +	/*
> +	 * We only want to return up to EOF even if the extent extends beyond
> +	 * that.
> +	 */
> +	encoded->len = (min_t(u64, extent_map_end(em), inode->i_size) -
> +			iocb->ki_pos);

no outer ( )

> +	if (em->block_start == EXTENT_MAP_HOLE ||
> +	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
> +		disk_bytenr = EXTENT_MAP_HOLE;
> +		encoded->len = encoded->unencoded_len = count =
> +			min_t(u64, count, encoded->len);

No chained initializations

> +	} else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> +		disk_bytenr = em->block_start;
> +		/*
> +		 * Bail if the buffer isn't large enough to return the whole
> +		 * compressed extent.
> +		 */
> +		if (em->block_len > count) {
> +			ret = -ENOBUFS;
> +			goto out_em;
> +		}
> +		disk_io_size = count = em->block_len;
> +		encoded->unencoded_len = em->ram_bytes;
> +		encoded->unencoded_offset = iocb->ki_pos - em->orig_start;
> +		ret = btrfs_encoded_io_compression_from_extent(
> +							     em->compress_type);
> +		if (ret < 0)
> +			goto out_em;
> +		encoded->compression = ret;
> +	} else {
> +		disk_bytenr = em->block_start + (start - em->start);
> +		if (encoded->len > count)
> +			encoded->len = count;
> +		/*
> +		 * Don't read beyond what we locked. This also limits the page
> +		 * allocations that we'll do.
> +		 */
> +		disk_io_size = min(lockend + 1,
> +				   iocb->ki_pos + encoded->len) - start;
> +		encoded->len = encoded->unencoded_len = count =
> +			start + disk_io_size - iocb->ki_pos;
> +		disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize);
> +	}
> +	free_extent_map(em);
> +	em = NULL;
> +
> +	if (disk_bytenr == EXTENT_MAP_HOLE) {
> +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> +		inode_unlock_shared(inode);
> +		unlocked = true;
> +		ret = iov_iter_zero(count, iter);
> +		if (ret != count)
> +			ret = -EFAULT;
> +	} else {
> +		ret = btrfs_encoded_read_regular(iocb, iter, start, lockend,
> +						 &cached_state, disk_bytenr,
> +						 disk_io_size, count,
> +						 encoded->compression,
> +						 &unlocked);
> +	}
> +
> +out:
> +	if (ret >= 0)
> +		iocb->ki_pos += encoded->len;
> +out_em:
> +	free_extent_map(em);
> +out_unlock_extent:
> +	if (!unlocked)
> +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> +out_unlock_inode:
> +	if (!unlocked)
> +		inode_unlock_shared(inode);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_SWAP
>  /*
>   * Add an entry indicating a block group or device which is pinned by a
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 05c77a1979a9..f0c575223d88 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -28,6 +28,7 @@
>  #include <linux/iversion.h>
>  #include <linux/fileattr.h>
>  #include <linux/fsverity.h>
> +#include <linux/sched/xacct.h>
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "export.h"
> @@ -88,6 +89,22 @@ struct btrfs_ioctl_send_args_32 {
>  
>  #define BTRFS_IOC_SEND_32 _IOW(BTRFS_IOCTL_MAGIC, 38, \
>  			       struct btrfs_ioctl_send_args_32)
> +
> +struct btrfs_ioctl_encoded_io_args_32 {
> +	compat_uptr_t iov;
> +	compat_ulong_t iovcnt;
> +	__s64 offset;
> +	__u64 flags;
> +	__u64 len;
> +	__u64 unencoded_len;
> +	__u64 unencoded_offset;
> +	__u32 compression;
> +	__u32 encryption;
> +	__u32 reserved[8];
> +};
> +
> +#define BTRFS_IOC_ENCODED_READ_32 _IOR(BTRFS_IOCTL_MAGIC, 64, \
> +				       struct btrfs_ioctl_encoded_io_args_32)
>  #endif
>  
>  /* Mask out flags that are inappropriate for the given type of inode. */
> @@ -4861,6 +4878,89 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
>  	return ret;
>  }
>  
> +static int btrfs_ioctl_encoded_read(struct file *file, void __user *argp,
> +				    bool compat)
> +{
> +	struct btrfs_ioctl_encoded_io_args args = {};
> +	size_t copy_end_kernel = offsetofend(struct btrfs_ioctl_encoded_io_args,
> +					     flags);
> +	size_t copy_end;
> +	struct iovec iovstack[UIO_FASTIOV];
> +	struct iovec *iov = iovstack;
> +	struct iov_iter iter;
> +	loff_t pos;
> +	struct kiocb kiocb;
> +	ssize_t ret;
> +
> +	if (!capable(CAP_SYS_ADMIN)) {
> +		ret = -EPERM;
> +		goto out_acct;
> +	}
> +
> +	if (compat) {
> +#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
> +		struct btrfs_ioctl_encoded_io_args_32 args32;
> +
> +		copy_end = offsetofend(struct btrfs_ioctl_encoded_io_args_32,
> +				       flags);
> +		if (copy_from_user(&args32, argp, copy_end)) {
> +			ret = -EFAULT;
> +			goto out_acct;
> +		}
> +		args.iov = compat_ptr(args32.iov);
> +		args.iovcnt = args32.iovcnt;
> +		args.offset = args32.offset;
> +		args.flags = args32.flags;
> +#else
> +		return -ENOTTY;
> +#endif
> +	} else {
> +		copy_end = copy_end_kernel;
> +		if (copy_from_user(&args, argp, copy_end)) {
> +			ret = -EFAULT;
> +			goto out_acct;
> +		}
> +	}
> +	if (args.flags != 0) {
> +		ret = -EINVAL;
> +		goto out_acct;
> +	}
> +
> +	ret = import_iovec(READ, args.iov, args.iovcnt, ARRAY_SIZE(iovstack),
> +			   &iov, &iter);
> +	if (ret < 0)
> +		goto out_acct;
> +
> +	if (iov_iter_count(&iter) == 0) {
> +		ret = 0;
> +		goto out_iov;
> +	}
> +	pos = args.offset;
> +	ret = rw_verify_area(READ, file, &pos, args.len);
> +	if (ret < 0)
> +		goto out_iov;
> +
> +	init_sync_kiocb(&kiocb, file);
> +	kiocb.ki_pos = pos;
> +
> +	ret = btrfs_encoded_read(&kiocb, &iter, &args);
> +	if (ret >= 0) {
> +		fsnotify_access(file);
> +		if (copy_to_user(argp + copy_end,
> +				 (char *)&args + copy_end_kernel,
> +				 sizeof(args) - copy_end_kernel))
> +			ret = -EFAULT;
> +	}
> +
> +out_iov:
> +	kfree(iov);
> +out_acct:
> +	if (ret > 0)
> +		add_rchar(current, ret);
> +	inc_syscr(current);
> +	return ret;
> +}
> +
>  long btrfs_ioctl(struct file *file, unsigned int
>  		cmd, unsigned long arg)
>  {
> @@ -5005,6 +5105,12 @@ long btrfs_ioctl(struct file *file, unsigned int
>  		return fsverity_ioctl_enable(file, (const void __user *)argp);
>  	case FS_IOC_MEASURE_VERITY:
>  		return fsverity_ioctl_measure(file, argp);
> +	case BTRFS_IOC_ENCODED_READ:
> +		return btrfs_ioctl_encoded_read(file, argp, false);
> +#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
> +	case BTRFS_IOC_ENCODED_READ_32:
> +		return btrfs_ioctl_encoded_read(file, argp, true);
> +#endif
>  	}
>  
>  	return -ENOTTY;
> -- 
> 2.34.0

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ
  2022-01-24 21:54   ` David Sterba
@ 2022-01-24 22:33     ` Omar Sandoval
  0 siblings, 0 replies; 50+ messages in thread
From: Omar Sandoval @ 2022-01-24 22:33 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On Mon, Jan 24, 2022 at 10:54:35PM +0100, David Sterba wrote:
> On Wed, Nov 17, 2021 at 12:19:19PM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> 
> > - We don't do read repair, because it turns out that read repair is
> >   currently broken for compressed data.
> 
> Is there a reproducer, and a fix?

Reproducer:

# Create a filesystem with data duplicated at 2 physical locations on the same
# disk.
$ mkfs.btrfs -f -d dup /dev/vdb
# Mount it with compression enabled.
$ mount -o compress /dev/vdb /mnt
# Write some compressible data (octal dump of random data).
$ dd if=/dev/urandom bs=4k count=1 | od > /mnt/foo
# Force it on disk.
$ sync
# Get the locations it was written to on disk.
$ ~/repos/osandov-linux/scripts/btrfs_map_physical /mnt/foo | column -ts $'\t'
FILE OFFSET  FILE SIZE  EXTENT OFFSET  EXTENT TYPE                   LOGICAL SIZE  LOGICAL OFFSET  PHYSICAL SIZE  DEVID  PHYSICAL OFFSET
0            20480      0              regular,compression=zlib,dup  20480         298844160       8192           1      575668224
                                                                                                                  1      1005125632
# Corrupt one of the copies.
$ dd if=/dev/zero of=/dev/vdb bs=4k count=1 seek=575668224 oflag=seek_bytes
$ sync
# Now, re-read the file until we read it from the corrupted copy. To make sure
# we read from disk, we drop the page cache between each read.
$ while ! btrfs device stats /dev/vdb | grep -q 'corruption_errs\s\+[1-9]'; do echo 1 > /proc/sys/vm/drop_caches; cat /mnt/foo > /dev/null; done
$ dmesg | tail
[ 3240.222922] BTRFS info (device vdb): has skinny extents
[ 3240.235245] BTRFS info (device vdb): checking UUID tree
[ 3298.885372] bash (481): drop_caches: 1
[ 3298.924648] BTRFS warning (device vdb): csum failed root 5 ino 257 off 575676416 csum 0x8941f998 expected csum 0x05a3d0cd mirror 1
[ 3298.924657] BTRFS error (device vdb): bdev /dev/vdb errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[ 3298.926221] BTRFS info (device vdb): read error corrected: ino 257 off 0 (dev /dev/vdb sector 1124352)
[ 3298.926473] BTRFS info (device vdb): read error corrected: ino 257 off 4096 (dev /dev/vdb sector 1124352)
[ 3298.926516] BTRFS info (device vdb): read error corrected: ino 257 off 16384 (dev /dev/vdb sector 1124352)
[ 3298.926555] BTRFS info (device vdb): read error corrected: ino 257 off 8192 (dev /dev/vdb sector 1124352)
[ 3298.926614] BTRFS info (device vdb): read error corrected: ino 257 off 12288 (dev /dev/vdb sector 1124352)
# Now check that the copies match.
$ dd if=/dev/vdb bs=4k count=1 skip=575668224 iflag=skip_bytes status=none | sha256sum
6ffd32b49d77b9e4ae07fd1c598b8407bc4cbb2fdb7244420589703f35605996  -
$ dd if=/dev/vdb bs=4k count=1 skip=1005125632 iflag=skip_bytes status=none | sha256sum
3646164006a08d908f5cbd6131ce413c8de49566560bf7ac6bc9432ac792605d  -
# Oops, they don't. Check the corrupted copy.
$ dd if=/dev/vdb bs=4k count=1 skip=575668224 iflag=skip_bytes status=none | head
0006000 006760 154362 010427 177511 151544 074422 116513 105472
0006020 045224 041623 063647 150022 006147 155332 053640 077304
0006040 173561 102237 155233 021641 165413 114564 004351 006141
0006060 075766 007723 017005 142265 175347 110221 071117 004421
0006100 045713 040102 005447 127414 173546 075206 042537 176547
0006120 120721 117062 177257 171012 130114 031767 165144 103776
0006140 054556 152447 123212 023000 062570 103502 057476 065541
0006160 064364 151221 117125 133463 076760 133756 133026 171622
0006200 164705 031677 005544 027201 130024 177437 102433 005170
0006220 144353 136645 147416 035017 173440 121605 050052 050325
# It contains decompressed data, and not even for the correct offset: this is
# the 4th block of the file.

I don't have a fix. Rohit Singh from my team is ramping up and working
on a fix.

This bug has been present as far back as I could follow the history. I
think it's just been masked by the fact that 1) repairing the bad copies
with bad data doesn't make things "worse" and 2) the page cache caches
the good data, so once you've repaired it once, you don't see it again
until you reboot or the page is evicted.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 12/17] btrfs: send: fix maximum command numbering
  2021-11-18 18:54     ` Omar Sandoval
  2021-12-09 18:08       ` Omar Sandoval
@ 2022-01-24 22:40       ` David Sterba
  1 sibling, 0 replies; 50+ messages in thread
From: David Sterba @ 2022-01-24 22:40 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: dsterba, linux-btrfs, kernel-team

On Thu, Nov 18, 2021 at 10:54:16AM -0800, Omar Sandoval wrote:
> On Thu, Nov 18, 2021 at 03:23:59PM +0100, David Sterba wrote:
> > On Wed, Nov 17, 2021 at 12:19:22PM -0800, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > Commit e77fbf990316 ("btrfs: send: prepare for v2 protocol") added
> > > _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
> > > version plus 1, but as written this creates gaps in the number space.
> > > The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
> > > accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
> > > has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
> > > 23 and 24 are valid commands.
> > 
> > The MAX definitions have the __ prefix so they're private and not meant
> > to be used as proper commands, so nothing should suggest there are any
> > commands with numbers 23 to 25 in the example.
> > 
> > > Instead, let's explicitly set BTRFS_SEND_C_MAX_V* to the maximum command
> > > number. This requires repeating the command name, but it has a clearer
> > > meaning and avoids gaps. It also doesn't require updating
> > > __BTRFS_SEND_C_MAX for every new version.
> > 
> > It's probably a matter of taste, I'd intentionally avoid the pattern
> > above, ie. repeating the previous command to define max.
> > 
> > > --- a/fs/btrfs/send.c
> > > +++ b/fs/btrfs/send.c
> > > @@ -316,8 +316,8 @@ __maybe_unused
> > >  static bool proto_cmd_ok(const struct send_ctx *sctx, int cmd)
> > >  {
> > >  	switch (sctx->proto) {
> > > -	case 1:	 return cmd < __BTRFS_SEND_C_MAX_V1;
> > > -	case 2:	 return cmd < __BTRFS_SEND_C_MAX_V2;
> > > +	case 1:	 return cmd <= BTRFS_SEND_C_MAX_V1;
> > > +	case 2:	 return cmd <= BTRFS_SEND_C_MAX_V2;
> > 
> > This seems to be the only practical difference, < or <= .
> 
> There is another practical difference, which is more significant in my
> opinion: the linear style creates "gaps" in the valid commands. Consider
> this, with explicit values added for clarity:
> 
> enum btrfs_send_cmd {
>         BTRFS_SEND_C_UNSPEC = 0,
> 
>         /* Version 1 */
>         BTRFS_SEND_C_SUBVOL = 1,
>         BTRFS_SEND_C_SNAPSHOT = 2,
> 
>         BTRFS_SEND_C_MKFILE = 3,
>         BTRFS_SEND_C_MKDIR = 4,
>         BTRFS_SEND_C_MKNOD = 5,
>         BTRFS_SEND_C_MKFIFO = 6,
>         BTRFS_SEND_C_MKSOCK = 7,
>         BTRFS_SEND_C_SYMLINK = 8,
> 
>         BTRFS_SEND_C_RENAME = 9,
>         BTRFS_SEND_C_LINK = 10,
>         BTRFS_SEND_C_UNLINK = 11,
>         BTRFS_SEND_C_RMDIR = 12,
> 
>         BTRFS_SEND_C_SET_XATTR = 13,
>         BTRFS_SEND_C_REMOVE_XATTR = 14,
> 
>         BTRFS_SEND_C_WRITE = 15,
>         BTRFS_SEND_C_CLONE = 16,
> 
>         BTRFS_SEND_C_TRUNCATE = 17,
>         BTRFS_SEND_C_CHMOD = 18,
>         BTRFS_SEND_C_CHOWN = 19,
>         BTRFS_SEND_C_UTIMES = 20,
> 
>         BTRFS_SEND_C_END = 21,
>         BTRFS_SEND_C_UPDATE_EXTENT = 22,
>         __BTRFS_SEND_C_MAX_V1 = 23,
> 
>         /* Version 2 */
>         BTRFS_SEND_C_FALLOCATE = 24,
>         BTRFS_SEND_C_SETFLAGS = 25,
>         BTRFS_SEND_C_ENCODED_WRITE = 26,
>         __BTRFS_SEND_C_MAX_V2 = 27,
> 
>         /* End */
>         __BTRFS_SEND_C_MAX = 28,
> };
> #define BTRFS_SEND_C_MAX (__BTRFS_SEND_C_MAX - 1) /* 27 */

So as a compromise to avoid gaps and also repeating the last command name in
the definition, let's do it in a similar way as in your example,
explicit numbering of the commands, so the number will be repated for
the MAX constants.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ
  2022-01-24 22:26   ` David Sterba
@ 2022-01-25 21:26     ` Omar Sandoval
  2022-02-08 20:08       ` Omar Sandoval
  2022-02-10 18:42       ` David Sterba
  0 siblings, 2 replies; 50+ messages in thread
From: Omar Sandoval @ 2022-01-25 21:26 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On Mon, Jan 24, 2022 at 11:26:32PM +0100, David Sterba wrote:
> On Wed, Nov 17, 2021 at 12:19:19PM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > There are 4 main cases:
> > 
> > 1. Inline extents: we copy the data straight out of the extent buffer.
> > 2. Hole/preallocated extents: we fill in zeroes.
> > 3. Regular, uncompressed extents: we read the sectors we need directly
> >    from disk.
> > 4. Regular, compressed extents: we read the entire compressed extent
> >    from disk and indicate what subset of the decompressed extent is in
> >    the file.
> > 
> > This initial implementation simplifies a few things that can be improved
> > in the future:
> > 
> > - We hold the inode lock during the operation.
> > - Cases 1, 3, and 4 allocate temporary memory to read into before
> >   copying out to userspace.
> > - We don't do read repair, because it turns out that read repair is
> >   currently broken for compressed data.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  fs/btrfs/ctree.h |   4 +
> >  fs/btrfs/inode.c | 496 +++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/btrfs/ioctl.c | 106 ++++++++++
> >  3 files changed, 606 insertions(+)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 2e7f74060a14..70034e33abe6 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -3275,6 +3275,10 @@ int btrfs_writepage_cow_fixup(struct page *page);
> >  void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
> >  					  struct page *page, u64 start,
> >  					  u64 end, bool uptodate);
> > +struct btrfs_ioctl_encoded_io_args;
> > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
> > +			   struct btrfs_ioctl_encoded_io_args *encoded);
> > +
> >  extern const struct dentry_operations btrfs_dentry_operations;
> >  extern const struct iomap_ops btrfs_dio_iomap_ops;
> >  extern const struct iomap_dio_ops btrfs_dio_ops;
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index c2efea101f61..d29e968fd18b 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -10525,6 +10525,502 @@ void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
> >  	}
> >  }
> >  
> > +static int btrfs_encoded_io_compression_from_extent(int compress_type)
> > +{
> > +	switch (compress_type) {
> > +	case BTRFS_COMPRESS_NONE:
> > +		return BTRFS_ENCODED_IO_COMPRESSION_NONE;
> > +	case BTRFS_COMPRESS_ZLIB:
> > +		return BTRFS_ENCODED_IO_COMPRESSION_ZLIB;
> > +	case BTRFS_COMPRESS_LZO:
> > +		/*
> > +		 * The LZO format depends on the page size. 64k is the maximum
> > +		 * sectorsize (and thus page size) that we support.
> > +		 */
> > +		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
> > +			return -EINVAL;
> > +		return BTRFS_ENCODED_IO_COMPRESSION_LZO_4K + (PAGE_SHIFT - 12);
> > +	case BTRFS_COMPRESS_ZSTD:
> > +		return BTRFS_ENCODED_IO_COMPRESSION_ZSTD;
> > +	default:
> > +		return -EUCLEAN;
> > +	}
> > +}
> > +
> > +static ssize_t btrfs_encoded_read_inline(
> > +				struct kiocb *iocb,
> > +				struct iov_iter *iter, u64 start,
> > +				u64 lockend,
> > +				struct extent_state **cached_state,
> > +				u64 extent_start, size_t count,
> > +				struct btrfs_ioctl_encoded_io_args *encoded,
> > +				bool *unlocked)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> 
> Please use btrfs_inode in all internal helpers, either as parameters or
> as local variable, to avoid the BTRFS_I and btrfs_sb conversions everywhere.
> 
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct btrfs_path *path;
> > +	struct extent_buffer *leaf;
> > +	struct btrfs_file_extent_item *item;
> > +	u64 ram_bytes;
> > +	unsigned long ptr;
> > +	void *tmp;
> > +	ssize_t ret;
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> > +				       btrfs_ino(BTRFS_I(inode)), extent_start,
> > +				       0);
> > +	if (ret) {
> > +		if (ret > 0) {
> > +			/* The extent item disappeared? */
> > +			ret = -EIO;
> > +		}
> > +		goto out;
> > +	}
> > +	leaf = path->nodes[0];
> > +	item = btrfs_item_ptr(leaf, path->slots[0],
> > +			      struct btrfs_file_extent_item);
> > +
> > +	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
> > +	ptr = btrfs_file_extent_inline_start(item);
> > +
> > +	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
> > +			iocb->ki_pos);
> 
> No need for the outer ( )
> 
> > +	ret = btrfs_encoded_io_compression_from_extent(
> > +				 btrfs_file_extent_compression(leaf, item));
> > +	if (ret < 0)
> > +		goto out;
> > +	encoded->compression = ret;
> > +	if (encoded->compression) {
> > +		size_t inline_size;
> > +
> > +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> > +								path->slots[0]);
> > +		if (inline_size > count) {
> > +			ret = -ENOBUFS;
> > +			goto out;
> > +		}
> > +		count = inline_size;
> > +		encoded->unencoded_len = ram_bytes;
> > +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> > +	} else {
> > +		encoded->len = encoded->unencoded_len = count =
> > +			min_t(u64, count, encoded->len);
> 
> I'm sure I have commented on that in the past, please don't use chained
> intializations. In this case something like:
> 
> 		count = min_t(u64, count, encoded->len);
> 		encoded->len = count;
> 		encoded->unencoded_len = count;
> 
> > +		ptr += iocb->ki_pos - extent_start;
> > +	}
> > +
> > +	tmp = kmalloc(count, GFP_NOFS);
> > +	if (!tmp) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	read_extent_buffer(leaf, tmp, ptr, count);
> > +	btrfs_release_path(path);
> > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > +	inode_unlock_shared(inode);
> > +	*unlocked = true;
> > +
> > +	ret = copy_to_iter(tmp, count, iter);
> > +	if (ret != count)
> > +		ret = -EFAULT;
> > +	kfree(tmp);
> > +out:
> > +	btrfs_free_path(path);
> > +	return ret;
> > +}
> > +
> > +struct btrfs_encoded_read_private {
> > +	struct inode *inode;
> 
> This should also be btrfs_inode
> 
> > +	u64 file_offset;
> > +	wait_queue_head_t wait;
> > +	atomic_t pending;
> > +	blk_status_t status;
> > +	bool skip_csum;
> > +};
> > +
> > +static blk_status_t submit_encoded_read_bio(struct inode *inode,
> 
> struct btrfs_inode
> 
> > +					    struct bio *bio, int mirror_num,
> > +					    unsigned long bio_flags)
> 
> bio_flags is unused here (and in the encoded patches afaics)
> 
> > +{
> > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > +	struct btrfs_bio *bbio = btrfs_bio(bio);
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	blk_status_t ret;
> > +
> > +	if (!priv->skip_csum) {
> > +		ret = btrfs_lookup_bio_sums(inode, bio, NULL);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> > +	if (ret) {
> > +		btrfs_bio_free_csum(bbio);
> > +		return ret;
> > +	}
> > +
> > +	atomic_inc(&priv->pending);
> > +	ret = btrfs_map_bio(fs_info, bio, mirror_num);
> > +	if (ret) {
> > +		atomic_dec(&priv->pending);
> > +		btrfs_bio_free_csum(bbio);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static blk_status_t btrfs_encoded_read_verify_csum(struct btrfs_bio *bbio)
> > +{
> > +	const bool uptodate = bbio->bio.bi_status == BLK_STS_OK;
> 
> 	const bool uptodate = (bbio->bio.bi_status == BLK_STS_OK);
> 
> > +	struct btrfs_encoded_read_private *priv = bbio->bio.bi_private;
> > +	struct inode *inode = priv->inode;
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	u32 sectorsize = fs_info->sectorsize;
> > +	struct bio_vec *bvec;
> > +	struct bvec_iter_all iter_all;
> > +	u64 start = priv->file_offset;
> > +	u32 bio_offset = 0;
> > +
> > +	if (priv->skip_csum || !uptodate)
> > +		return bbio->bio.bi_status;
> > +
> > +	bio_for_each_segment_all(bvec, &bbio->bio, iter_all) {
> > +		unsigned int i, nr_sectors, pgoff;
> > +
> > +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> > +		pgoff = bvec->bv_offset;
> > +		for (i = 0; i < nr_sectors; i++) {
> > +			ASSERT(pgoff < PAGE_SIZE);
> > +			if (check_data_csum(inode, bbio, bio_offset,
> > +					    bvec->bv_page, pgoff, start))
> > +				return BLK_STS_IOERR;
> > +			start += sectorsize;
> > +			bio_offset += sectorsize;
> > +			pgoff += sectorsize;
> > +		}
> > +	}
> > +	return BLK_STS_OK;
> > +}
> > +
> > +static void btrfs_encoded_read_endio(struct bio *bio)
> > +{
> > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > +	struct btrfs_bio *bbio = btrfs_bio(bio);
> > +	blk_status_t status;
> > +
> > +	status = btrfs_encoded_read_verify_csum(bbio);
> > +	if (status) {
> > +		/*
> > +		 * The memory barrier implied by the atomic_dec_return() here
> > +		 * pairs with the memory barrier implied by the
> > +		 * atomic_dec_return() or io_wait_event() in
> > +		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
> > +		 * write is observed before the load of status in
> > +		 * btrfs_encoded_read_regular_fill_pages().
> > +		 */
> > +		WRITE_ONCE(priv->status, status);
> > +	}
> > +	if (!atomic_dec_return(&priv->pending))
> > +		wake_up(&priv->wait);
> > +	btrfs_bio_free_csum(bbio);
> > +	bio_put(bio);
> > +}
> > +
> > +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode,
> > +						 u64 file_offset,
> > +						 u64 disk_bytenr,
> > +						 u64 disk_io_size,
> > +						 struct page **pages)
> > +{
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	struct btrfs_encoded_read_private priv = {
> > +		.inode = inode,
> > +		.file_offset = file_offset,
> > +		.pending = ATOMIC_INIT(1),
> > +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> > +	};
> > +	unsigned long i = 0;
> > +	u64 cur = 0;
> > +	int ret;
> > +
> > +	init_waitqueue_head(&priv.wait);
> > +	/*
> > +	 * Submit bios for the extent, splitting due to bio or stripe limits as
> > +	 * necessary.
> > +	 */
> > +	while (cur < disk_io_size) {
> > +		struct extent_map *em;
> > +		struct btrfs_io_geometry geom;
> > +		struct bio *bio = NULL;
> > +		u64 remaining;
> > +
> > +		em = btrfs_get_chunk_map(fs_info, disk_bytenr + cur,
> > +					 disk_io_size - cur);
> > +		if (IS_ERR(em)) {
> > +			ret = PTR_ERR(em);
> > +		} else {
> > +			ret = btrfs_get_io_geometry(fs_info, em, BTRFS_MAP_READ,
> > +						    disk_bytenr + cur, &geom);
> > +			free_extent_map(em);
> > +		}
> > +		if (ret) {
> > +			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
> > +			break;
> > +		}
> > +		remaining = min(geom.len, disk_io_size - cur);
> > +		while (bio || remaining) {
> > +			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
> > +
> > +			if (!bio) {
> > +				bio = btrfs_bio_alloc(BIO_MAX_VECS);
> > +				bio->bi_iter.bi_sector =
> > +					(disk_bytenr + cur) >> SECTOR_SHIFT;
> > +				bio->bi_end_io = btrfs_encoded_read_endio;
> > +				bio->bi_private = &priv;
> > +				bio->bi_opf = REQ_OP_READ;
> > +			}
> > +
> > +			if (!bytes ||
> > +			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> > +				blk_status_t status;
> > +
> > +				status = submit_encoded_read_bio(inode, bio, 0,
> > +								 0);
> > +				if (status) {
> > +					WRITE_ONCE(priv.status, status);
> > +					bio_put(bio);
> > +					goto out;
> > +				}
> > +				bio = NULL;
> > +				continue;
> > +			}
> > +
> > +			i++;
> > +			cur += bytes;
> > +			remaining -= bytes;
> > +		}
> > +	}
> > +
> > +out:
> > +	if (atomic_dec_return(&priv.pending))
> > +		io_wait_event(priv.wait, !atomic_read(&priv.pending));
> > +	/* See btrfs_encoded_read_endio() for ordering. */
> > +	return blk_status_to_errno(READ_ONCE(priv.status));
> > +}
> > +
> > +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> > +					  struct iov_iter *iter,
> > +					  u64 start, u64 lockend,
> > +					  struct extent_state **cached_state,
> > +					  u64 disk_bytenr, u64 disk_io_size,
> > +					  size_t count, bool compressed,
> > +					  bool *unlocked)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct page **pages;
> > +	unsigned long nr_pages, i;
> > +	u64 cur;
> > +	size_t page_offset;
> > +	ssize_t ret;
> > +
> > +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
> > +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> > +	if (!pages)
> > +		return -ENOMEM;
> > +	for (i = 0; i < nr_pages; i++) {
> > +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> 
> No HIGHMEM please.
> 
> > +		if (!pages[i]) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	ret = btrfs_encoded_read_regular_fill_pages(inode, start, disk_bytenr,
> > +						    disk_io_size, pages);
> > +	if (ret)
> > +		goto out;
> > +
> > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > +	inode_unlock_shared(inode);
> > +	*unlocked = true;
> > +
> > +	if (compressed) {
> > +		i = 0;
> > +		page_offset = 0;
> > +	} else {
> > +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> > +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> > +	}
> > +	cur = 0;
> > +	while (cur < count) {
> > +		size_t bytes = min_t(size_t, count - cur,
> > +				     PAGE_SIZE - page_offset);
> > +
> > +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> > +				      iter) != bytes) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +		i++;
> > +		cur += bytes;
> > +		page_offset = 0;
> > +	}
> > +	ret = count;
> > +out:
> > +	for (i = 0; i < nr_pages; i++) {
> > +		if (pages[i])
> > +			__free_page(pages[i]);
> > +	}
> > +	kfree(pages);
> > +	return ret;
> > +}
> > +
> > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
> > +			   struct btrfs_ioctl_encoded_io_args *encoded)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	ssize_t ret;
> > +	size_t count = iov_iter_count(iter);
> > +	u64 start, lockend, disk_bytenr, disk_io_size;
> > +	struct extent_state *cached_state = NULL;
> > +	struct extent_map *em;
> > +	bool unlocked = false;
> > +
> > +	file_accessed(iocb->ki_filp);
> > +
> > +	inode_lock_shared(inode);
> 
> We have helpers for inode locking now, btrfs_inode_lock, that take
> additional parameter for cases where we want to exclude certain
> locking combinations.
> 
> Which also brings the question if the encoded read/write should be
> excluded against some other operations like eg. deduplication has to do
> against mmap.

I _think_ we want to exclude mmap, but I need to think some more about
it. Other than this comment, I believe I've addressed your other
outstanding comments and rebased my branches:

https://github.com/osandov/linux/tree/btrfs-send-encoded
https://github.com/osandov/btrfs-progs/tree/send-encoded

I'm guessing you're still looking at the send parts, so I'll figure out
mmap and await further comments before resending.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ
  2022-01-25 21:26     ` Omar Sandoval
@ 2022-02-08 20:08       ` Omar Sandoval
  2022-02-10 18:38         ` David Sterba
  2022-02-10 18:42       ` David Sterba
  1 sibling, 1 reply; 50+ messages in thread
From: Omar Sandoval @ 2022-02-08 20:08 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team

On Tue, Jan 25, 2022 at 01:26:21PM -0800, Omar Sandoval wrote:
> On Mon, Jan 24, 2022 at 11:26:32PM +0100, David Sterba wrote:
> > On Wed, Nov 17, 2021 at 12:19:19PM -0800, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > There are 4 main cases:
> > > 
> > > 1. Inline extents: we copy the data straight out of the extent buffer.
> > > 2. Hole/preallocated extents: we fill in zeroes.
> > > 3. Regular, uncompressed extents: we read the sectors we need directly
> > >    from disk.
> > > 4. Regular, compressed extents: we read the entire compressed extent
> > >    from disk and indicate what subset of the decompressed extent is in
> > >    the file.
> > > 
> > > This initial implementation simplifies a few things that can be improved
> > > in the future:
> > > 
> > > - We hold the inode lock during the operation.
> > > - Cases 1, 3, and 4 allocate temporary memory to read into before
> > >   copying out to userspace.
> > > - We don't do read repair, because it turns out that read repair is
> > >   currently broken for compressed data.
> > > 
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > ---
> > >  fs/btrfs/ctree.h |   4 +
> > >  fs/btrfs/inode.c | 496 +++++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/btrfs/ioctl.c | 106 ++++++++++
> > >  3 files changed, 606 insertions(+)
> > > 
> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > index 2e7f74060a14..70034e33abe6 100644
> > > --- a/fs/btrfs/ctree.h
> > > +++ b/fs/btrfs/ctree.h
> > > @@ -3275,6 +3275,10 @@ int btrfs_writepage_cow_fixup(struct page *page);
> > >  void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
> > >  					  struct page *page, u64 start,
> > >  					  u64 end, bool uptodate);
> > > +struct btrfs_ioctl_encoded_io_args;
> > > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
> > > +			   struct btrfs_ioctl_encoded_io_args *encoded);
> > > +
> > >  extern const struct dentry_operations btrfs_dentry_operations;
> > >  extern const struct iomap_ops btrfs_dio_iomap_ops;
> > >  extern const struct iomap_dio_ops btrfs_dio_ops;
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index c2efea101f61..d29e968fd18b 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -10525,6 +10525,502 @@ void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
> > >  	}
> > >  }
> > >  
> > > +static int btrfs_encoded_io_compression_from_extent(int compress_type)
> > > +{
> > > +	switch (compress_type) {
> > > +	case BTRFS_COMPRESS_NONE:
> > > +		return BTRFS_ENCODED_IO_COMPRESSION_NONE;
> > > +	case BTRFS_COMPRESS_ZLIB:
> > > +		return BTRFS_ENCODED_IO_COMPRESSION_ZLIB;
> > > +	case BTRFS_COMPRESS_LZO:
> > > +		/*
> > > +		 * The LZO format depends on the page size. 64k is the maximum
> > > +		 * sectorsize (and thus page size) that we support.
> > > +		 */
> > > +		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
> > > +			return -EINVAL;
> > > +		return BTRFS_ENCODED_IO_COMPRESSION_LZO_4K + (PAGE_SHIFT - 12);
> > > +	case BTRFS_COMPRESS_ZSTD:
> > > +		return BTRFS_ENCODED_IO_COMPRESSION_ZSTD;
> > > +	default:
> > > +		return -EUCLEAN;
> > > +	}
> > > +}
> > > +
> > > +static ssize_t btrfs_encoded_read_inline(
> > > +				struct kiocb *iocb,
> > > +				struct iov_iter *iter, u64 start,
> > > +				u64 lockend,
> > > +				struct extent_state **cached_state,
> > > +				u64 extent_start, size_t count,
> > > +				struct btrfs_ioctl_encoded_io_args *encoded,
> > > +				bool *unlocked)
> > > +{
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > 
> > Please use btrfs_inode in all internal helpers, either as parameters or
> > as local variable, to avoid the BTRFS_I and btrfs_sb conversions everywhere.
> > 
> > > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > > +	struct btrfs_path *path;
> > > +	struct extent_buffer *leaf;
> > > +	struct btrfs_file_extent_item *item;
> > > +	u64 ram_bytes;
> > > +	unsigned long ptr;
> > > +	void *tmp;
> > > +	ssize_t ret;
> > > +
> > > +	path = btrfs_alloc_path();
> > > +	if (!path) {
> > > +		ret = -ENOMEM;
> > > +		goto out;
> > > +	}
> > > +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> > > +				       btrfs_ino(BTRFS_I(inode)), extent_start,
> > > +				       0);
> > > +	if (ret) {
> > > +		if (ret > 0) {
> > > +			/* The extent item disappeared? */
> > > +			ret = -EIO;
> > > +		}
> > > +		goto out;
> > > +	}
> > > +	leaf = path->nodes[0];
> > > +	item = btrfs_item_ptr(leaf, path->slots[0],
> > > +			      struct btrfs_file_extent_item);
> > > +
> > > +	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
> > > +	ptr = btrfs_file_extent_inline_start(item);
> > > +
> > > +	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
> > > +			iocb->ki_pos);
> > 
> > No need for the outer ( )
> > 
> > > +	ret = btrfs_encoded_io_compression_from_extent(
> > > +				 btrfs_file_extent_compression(leaf, item));
> > > +	if (ret < 0)
> > > +		goto out;
> > > +	encoded->compression = ret;
> > > +	if (encoded->compression) {
> > > +		size_t inline_size;
> > > +
> > > +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> > > +								path->slots[0]);
> > > +		if (inline_size > count) {
> > > +			ret = -ENOBUFS;
> > > +			goto out;
> > > +		}
> > > +		count = inline_size;
> > > +		encoded->unencoded_len = ram_bytes;
> > > +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> > > +	} else {
> > > +		encoded->len = encoded->unencoded_len = count =
> > > +			min_t(u64, count, encoded->len);
> > 
> > I'm sure I have commented on that in the past, please don't use chained
> > intializations. In this case something like:
> > 
> > 		count = min_t(u64, count, encoded->len);
> > 		encoded->len = count;
> > 		encoded->unencoded_len = count;
> > 
> > > +		ptr += iocb->ki_pos - extent_start;
> > > +	}
> > > +
> > > +	tmp = kmalloc(count, GFP_NOFS);
> > > +	if (!tmp) {
> > > +		ret = -ENOMEM;
> > > +		goto out;
> > > +	}
> > > +	read_extent_buffer(leaf, tmp, ptr, count);
> > > +	btrfs_release_path(path);
> > > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > > +	inode_unlock_shared(inode);
> > > +	*unlocked = true;
> > > +
> > > +	ret = copy_to_iter(tmp, count, iter);
> > > +	if (ret != count)
> > > +		ret = -EFAULT;
> > > +	kfree(tmp);
> > > +out:
> > > +	btrfs_free_path(path);
> > > +	return ret;
> > > +}
> > > +
> > > +struct btrfs_encoded_read_private {
> > > +	struct inode *inode;
> > 
> > This should also be btrfs_inode
> > 
> > > +	u64 file_offset;
> > > +	wait_queue_head_t wait;
> > > +	atomic_t pending;
> > > +	blk_status_t status;
> > > +	bool skip_csum;
> > > +};
> > > +
> > > +static blk_status_t submit_encoded_read_bio(struct inode *inode,
> > 
> > struct btrfs_inode
> > 
> > > +					    struct bio *bio, int mirror_num,
> > > +					    unsigned long bio_flags)
> > 
> > bio_flags is unused here (and in the encoded patches afaics)
> > 
> > > +{
> > > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > > +	struct btrfs_bio *bbio = btrfs_bio(bio);
> > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > +	blk_status_t ret;
> > > +
> > > +	if (!priv->skip_csum) {
> > > +		ret = btrfs_lookup_bio_sums(inode, bio, NULL);
> > > +		if (ret)
> > > +			return ret;
> > > +	}
> > > +
> > > +	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> > > +	if (ret) {
> > > +		btrfs_bio_free_csum(bbio);
> > > +		return ret;
> > > +	}
> > > +
> > > +	atomic_inc(&priv->pending);
> > > +	ret = btrfs_map_bio(fs_info, bio, mirror_num);
> > > +	if (ret) {
> > > +		atomic_dec(&priv->pending);
> > > +		btrfs_bio_free_csum(bbio);
> > > +	}
> > > +	return ret;
> > > +}
> > > +
> > > +static blk_status_t btrfs_encoded_read_verify_csum(struct btrfs_bio *bbio)
> > > +{
> > > +	const bool uptodate = bbio->bio.bi_status == BLK_STS_OK;
> > 
> > 	const bool uptodate = (bbio->bio.bi_status == BLK_STS_OK);
> > 
> > > +	struct btrfs_encoded_read_private *priv = bbio->bio.bi_private;
> > > +	struct inode *inode = priv->inode;
> > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > +	u32 sectorsize = fs_info->sectorsize;
> > > +	struct bio_vec *bvec;
> > > +	struct bvec_iter_all iter_all;
> > > +	u64 start = priv->file_offset;
> > > +	u32 bio_offset = 0;
> > > +
> > > +	if (priv->skip_csum || !uptodate)
> > > +		return bbio->bio.bi_status;
> > > +
> > > +	bio_for_each_segment_all(bvec, &bbio->bio, iter_all) {
> > > +		unsigned int i, nr_sectors, pgoff;
> > > +
> > > +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> > > +		pgoff = bvec->bv_offset;
> > > +		for (i = 0; i < nr_sectors; i++) {
> > > +			ASSERT(pgoff < PAGE_SIZE);
> > > +			if (check_data_csum(inode, bbio, bio_offset,
> > > +					    bvec->bv_page, pgoff, start))
> > > +				return BLK_STS_IOERR;
> > > +			start += sectorsize;
> > > +			bio_offset += sectorsize;
> > > +			pgoff += sectorsize;
> > > +		}
> > > +	}
> > > +	return BLK_STS_OK;
> > > +}
> > > +
> > > +static void btrfs_encoded_read_endio(struct bio *bio)
> > > +{
> > > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > > +	struct btrfs_bio *bbio = btrfs_bio(bio);
> > > +	blk_status_t status;
> > > +
> > > +	status = btrfs_encoded_read_verify_csum(bbio);
> > > +	if (status) {
> > > +		/*
> > > +		 * The memory barrier implied by the atomic_dec_return() here
> > > +		 * pairs with the memory barrier implied by the
> > > +		 * atomic_dec_return() or io_wait_event() in
> > > +		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
> > > +		 * write is observed before the load of status in
> > > +		 * btrfs_encoded_read_regular_fill_pages().
> > > +		 */
> > > +		WRITE_ONCE(priv->status, status);
> > > +	}
> > > +	if (!atomic_dec_return(&priv->pending))
> > > +		wake_up(&priv->wait);
> > > +	btrfs_bio_free_csum(bbio);
> > > +	bio_put(bio);
> > > +}
> > > +
> > > +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode,
> > > +						 u64 file_offset,
> > > +						 u64 disk_bytenr,
> > > +						 u64 disk_io_size,
> > > +						 struct page **pages)
> > > +{
> > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > +	struct btrfs_encoded_read_private priv = {
> > > +		.inode = inode,
> > > +		.file_offset = file_offset,
> > > +		.pending = ATOMIC_INIT(1),
> > > +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> > > +	};
> > > +	unsigned long i = 0;
> > > +	u64 cur = 0;
> > > +	int ret;
> > > +
> > > +	init_waitqueue_head(&priv.wait);
> > > +	/*
> > > +	 * Submit bios for the extent, splitting due to bio or stripe limits as
> > > +	 * necessary.
> > > +	 */
> > > +	while (cur < disk_io_size) {
> > > +		struct extent_map *em;
> > > +		struct btrfs_io_geometry geom;
> > > +		struct bio *bio = NULL;
> > > +		u64 remaining;
> > > +
> > > +		em = btrfs_get_chunk_map(fs_info, disk_bytenr + cur,
> > > +					 disk_io_size - cur);
> > > +		if (IS_ERR(em)) {
> > > +			ret = PTR_ERR(em);
> > > +		} else {
> > > +			ret = btrfs_get_io_geometry(fs_info, em, BTRFS_MAP_READ,
> > > +						    disk_bytenr + cur, &geom);
> > > +			free_extent_map(em);
> > > +		}
> > > +		if (ret) {
> > > +			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
> > > +			break;
> > > +		}
> > > +		remaining = min(geom.len, disk_io_size - cur);
> > > +		while (bio || remaining) {
> > > +			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
> > > +
> > > +			if (!bio) {
> > > +				bio = btrfs_bio_alloc(BIO_MAX_VECS);
> > > +				bio->bi_iter.bi_sector =
> > > +					(disk_bytenr + cur) >> SECTOR_SHIFT;
> > > +				bio->bi_end_io = btrfs_encoded_read_endio;
> > > +				bio->bi_private = &priv;
> > > +				bio->bi_opf = REQ_OP_READ;
> > > +			}
> > > +
> > > +			if (!bytes ||
> > > +			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> > > +				blk_status_t status;
> > > +
> > > +				status = submit_encoded_read_bio(inode, bio, 0,
> > > +								 0);
> > > +				if (status) {
> > > +					WRITE_ONCE(priv.status, status);
> > > +					bio_put(bio);
> > > +					goto out;
> > > +				}
> > > +				bio = NULL;
> > > +				continue;
> > > +			}
> > > +
> > > +			i++;
> > > +			cur += bytes;
> > > +			remaining -= bytes;
> > > +		}
> > > +	}
> > > +
> > > +out:
> > > +	if (atomic_dec_return(&priv.pending))
> > > +		io_wait_event(priv.wait, !atomic_read(&priv.pending));
> > > +	/* See btrfs_encoded_read_endio() for ordering. */
> > > +	return blk_status_to_errno(READ_ONCE(priv.status));
> > > +}
> > > +
> > > +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> > > +					  struct iov_iter *iter,
> > > +					  u64 start, u64 lockend,
> > > +					  struct extent_state **cached_state,
> > > +					  u64 disk_bytenr, u64 disk_io_size,
> > > +					  size_t count, bool compressed,
> > > +					  bool *unlocked)
> > > +{
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > > +	struct page **pages;
> > > +	unsigned long nr_pages, i;
> > > +	u64 cur;
> > > +	size_t page_offset;
> > > +	ssize_t ret;
> > > +
> > > +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
> > > +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> > > +	if (!pages)
> > > +		return -ENOMEM;
> > > +	for (i = 0; i < nr_pages; i++) {
> > > +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> > 
> > No HIGHMEM please.
> > 
> > > +		if (!pages[i]) {
> > > +			ret = -ENOMEM;
> > > +			goto out;
> > > +		}
> > > +	}
> > > +
> > > +	ret = btrfs_encoded_read_regular_fill_pages(inode, start, disk_bytenr,
> > > +						    disk_io_size, pages);
> > > +	if (ret)
> > > +		goto out;
> > > +
> > > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > > +	inode_unlock_shared(inode);
> > > +	*unlocked = true;
> > > +
> > > +	if (compressed) {
> > > +		i = 0;
> > > +		page_offset = 0;
> > > +	} else {
> > > +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> > > +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> > > +	}
> > > +	cur = 0;
> > > +	while (cur < count) {
> > > +		size_t bytes = min_t(size_t, count - cur,
> > > +				     PAGE_SIZE - page_offset);
> > > +
> > > +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> > > +				      iter) != bytes) {
> > > +			ret = -EFAULT;
> > > +			goto out;
> > > +		}
> > > +		i++;
> > > +		cur += bytes;
> > > +		page_offset = 0;
> > > +	}
> > > +	ret = count;
> > > +out:
> > > +	for (i = 0; i < nr_pages; i++) {
> > > +		if (pages[i])
> > > +			__free_page(pages[i]);
> > > +	}
> > > +	kfree(pages);
> > > +	return ret;
> > > +}
> > > +
> > > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
> > > +			   struct btrfs_ioctl_encoded_io_args *encoded)
> > > +{
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > > +	ssize_t ret;
> > > +	size_t count = iov_iter_count(iter);
> > > +	u64 start, lockend, disk_bytenr, disk_io_size;
> > > +	struct extent_state *cached_state = NULL;
> > > +	struct extent_map *em;
> > > +	bool unlocked = false;
> > > +
> > > +	file_accessed(iocb->ki_filp);
> > > +
> > > +	inode_lock_shared(inode);
> > 
> > We have helpers for inode locking now, btrfs_inode_lock, that take
> > additional parameter for cases where we want to exclude certain
> > locking combinations.
> > 
> > Which also brings the question if the encoded read/write should be
> > excluded against some other operations like eg. deduplication has to do
> > against mmap.
> 
> I _think_ we want to exclude mmap, but I need to think some more about
> it. Other than this comment, I believe I've addressed your other
> outstanding comments and rebased my branches:
> 
> https://github.com/osandov/linux/tree/btrfs-send-encoded
> https://github.com/osandov/btrfs-progs/tree/send-encoded
> 
> I'm guessing you're still looking at the send parts, so I'll figure out
> mmap and await further comments before resending.

After looking at the code some more, I don't think we need to exclude
mmap after all.

The only thing that can happen is that with this interleaving:

btrfs_page_mkwrite         btrfs_encoded_read
---------------------------------------------------
(enter)                    (enter)
                           btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
                           lock_extent_bits
			   read extent
                           unlock_extent_cached
                           (exit)

we read the old data from before the page was dirtied. But there are
other interleavings of a concurrent btrfs_page_mkwrite and
btrfs_encoded_read that would read the old data, like:

btrfs_page_mkwrite         btrfs_encoded_read
---------------------------------------------------
(enter)                    (enter)
                           btrfs_wait_ordered_range
                           lock_extent_bits
lock_extent_bits blocked
			   read extent
                           unlock_extent_cached
                           (exit)
lock_extent_bits returns
btrfs_page_set_dirty
unlock_extent_cached
(exit)

Or even if we were to use BTRFS_ILOCK_MMAP:

btrfs_page_mkwrite               btrfs_encoded_read
-------------------------------------------------------------------
(enter)                          (enter)
			         btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
                                 btrfs_wait_ordered_range
                                 lock_extent_bits
			         read extent
                                 unlock_extent_cached
			         btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached

In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.

I pushed the change to use btrfs_inode_lock()/btrfs_inode_unlock() with
BTRFS_ILOCK_SHARED to my branch. I'll resend the series as well.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ
  2022-02-08 20:08       ` Omar Sandoval
@ 2022-02-10 18:38         ` David Sterba
  0 siblings, 0 replies; 50+ messages in thread
From: David Sterba @ 2022-02-10 18:38 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: dsterba, linux-btrfs, kernel-team

On Tue, Feb 08, 2022 at 12:08:50PM -0800, Omar Sandoval wrote:
> On Tue, Jan 25, 2022 at 01:26:21PM -0800, Omar Sandoval wrote:
> > On Mon, Jan 24, 2022 at 11:26:32PM +0100, David Sterba wrote:
> > > On Wed, Nov 17, 2021 at 12:19:19PM -0800, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > There are 4 main cases:
> > > > 
> > > > 1. Inline extents: we copy the data straight out of the extent buffer.
> > > > 2. Hole/preallocated extents: we fill in zeroes.
> > > > 3. Regular, uncompressed extents: we read the sectors we need directly
> > > >    from disk.
> > > > 4. Regular, compressed extents: we read the entire compressed extent
> > > >    from disk and indicate what subset of the decompressed extent is in
> > > >    the file.
> > > > 
> > > > This initial implementation simplifies a few things that can be improved
> > > > in the future:
> > > > 
> > > > - We hold the inode lock during the operation.
> > > > - Cases 1, 3, and 4 allocate temporary memory to read into before
> > > >   copying out to userspace.
> > > > - We don't do read repair, because it turns out that read repair is
> > > >   currently broken for compressed data.
> > > > 
> > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > ---
> > > >  fs/btrfs/ctree.h |   4 +
> > > >  fs/btrfs/inode.c | 496 +++++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/btrfs/ioctl.c | 106 ++++++++++
> > > >  3 files changed, 606 insertions(+)
> > > > 
> > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > > index 2e7f74060a14..70034e33abe6 100644
> > > > --- a/fs/btrfs/ctree.h
> > > > +++ b/fs/btrfs/ctree.h
> > > > @@ -3275,6 +3275,10 @@ int btrfs_writepage_cow_fixup(struct page *page);
> > > >  void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
> > > >  					  struct page *page, u64 start,
> > > >  					  u64 end, bool uptodate);
> > > > +struct btrfs_ioctl_encoded_io_args;
> > > > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
> > > > +			   struct btrfs_ioctl_encoded_io_args *encoded);
> > > > +
> > > >  extern const struct dentry_operations btrfs_dentry_operations;
> > > >  extern const struct iomap_ops btrfs_dio_iomap_ops;
> > > >  extern const struct iomap_dio_ops btrfs_dio_ops;
> > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > > index c2efea101f61..d29e968fd18b 100644
> > > > --- a/fs/btrfs/inode.c
> > > > +++ b/fs/btrfs/inode.c
> > > > @@ -10525,6 +10525,502 @@ void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
> > > >  	}
> > > >  }
> > > >  
> > > > +static int btrfs_encoded_io_compression_from_extent(int compress_type)
> > > > +{
> > > > +	switch (compress_type) {
> > > > +	case BTRFS_COMPRESS_NONE:
> > > > +		return BTRFS_ENCODED_IO_COMPRESSION_NONE;
> > > > +	case BTRFS_COMPRESS_ZLIB:
> > > > +		return BTRFS_ENCODED_IO_COMPRESSION_ZLIB;
> > > > +	case BTRFS_COMPRESS_LZO:
> > > > +		/*
> > > > +		 * The LZO format depends on the page size. 64k is the maximum
> > > > +		 * sectorsize (and thus page size) that we support.
> > > > +		 */
> > > > +		if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K)
> > > > +			return -EINVAL;
> > > > +		return BTRFS_ENCODED_IO_COMPRESSION_LZO_4K + (PAGE_SHIFT - 12);
> > > > +	case BTRFS_COMPRESS_ZSTD:
> > > > +		return BTRFS_ENCODED_IO_COMPRESSION_ZSTD;
> > > > +	default:
> > > > +		return -EUCLEAN;
> > > > +	}
> > > > +}
> > > > +
> > > > +static ssize_t btrfs_encoded_read_inline(
> > > > +				struct kiocb *iocb,
> > > > +				struct iov_iter *iter, u64 start,
> > > > +				u64 lockend,
> > > > +				struct extent_state **cached_state,
> > > > +				u64 extent_start, size_t count,
> > > > +				struct btrfs_ioctl_encoded_io_args *encoded,
> > > > +				bool *unlocked)
> > > > +{
> > > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > 
> > > Please use btrfs_inode in all internal helpers, either as parameters or
> > > as local variable, to avoid the BTRFS_I and btrfs_sb conversions everywhere.
> > > 
> > > > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > > > +	struct btrfs_path *path;
> > > > +	struct extent_buffer *leaf;
> > > > +	struct btrfs_file_extent_item *item;
> > > > +	u64 ram_bytes;
> > > > +	unsigned long ptr;
> > > > +	void *tmp;
> > > > +	ssize_t ret;
> > > > +
> > > > +	path = btrfs_alloc_path();
> > > > +	if (!path) {
> > > > +		ret = -ENOMEM;
> > > > +		goto out;
> > > > +	}
> > > > +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> > > > +				       btrfs_ino(BTRFS_I(inode)), extent_start,
> > > > +				       0);
> > > > +	if (ret) {
> > > > +		if (ret > 0) {
> > > > +			/* The extent item disappeared? */
> > > > +			ret = -EIO;
> > > > +		}
> > > > +		goto out;
> > > > +	}
> > > > +	leaf = path->nodes[0];
> > > > +	item = btrfs_item_ptr(leaf, path->slots[0],
> > > > +			      struct btrfs_file_extent_item);
> > > > +
> > > > +	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
> > > > +	ptr = btrfs_file_extent_inline_start(item);
> > > > +
> > > > +	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
> > > > +			iocb->ki_pos);
> > > 
> > > No need for the outer ( )
> > > 
> > > > +	ret = btrfs_encoded_io_compression_from_extent(
> > > > +				 btrfs_file_extent_compression(leaf, item));
> > > > +	if (ret < 0)
> > > > +		goto out;
> > > > +	encoded->compression = ret;
> > > > +	if (encoded->compression) {
> > > > +		size_t inline_size;
> > > > +
> > > > +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> > > > +								path->slots[0]);
> > > > +		if (inline_size > count) {
> > > > +			ret = -ENOBUFS;
> > > > +			goto out;
> > > > +		}
> > > > +		count = inline_size;
> > > > +		encoded->unencoded_len = ram_bytes;
> > > > +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> > > > +	} else {
> > > > +		encoded->len = encoded->unencoded_len = count =
> > > > +			min_t(u64, count, encoded->len);
> > > 
> > > I'm sure I have commented on that in the past, please don't use chained
> > > intializations. In this case something like:
> > > 
> > > 		count = min_t(u64, count, encoded->len);
> > > 		encoded->len = count;
> > > 		encoded->unencoded_len = count;
> > > 
> > > > +		ptr += iocb->ki_pos - extent_start;
> > > > +	}
> > > > +
> > > > +	tmp = kmalloc(count, GFP_NOFS);
> > > > +	if (!tmp) {
> > > > +		ret = -ENOMEM;
> > > > +		goto out;
> > > > +	}
> > > > +	read_extent_buffer(leaf, tmp, ptr, count);
> > > > +	btrfs_release_path(path);
> > > > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > > > +	inode_unlock_shared(inode);
> > > > +	*unlocked = true;
> > > > +
> > > > +	ret = copy_to_iter(tmp, count, iter);
> > > > +	if (ret != count)
> > > > +		ret = -EFAULT;
> > > > +	kfree(tmp);
> > > > +out:
> > > > +	btrfs_free_path(path);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +struct btrfs_encoded_read_private {
> > > > +	struct inode *inode;
> > > 
> > > This should also be btrfs_inode
> > > 
> > > > +	u64 file_offset;
> > > > +	wait_queue_head_t wait;
> > > > +	atomic_t pending;
> > > > +	blk_status_t status;
> > > > +	bool skip_csum;
> > > > +};
> > > > +
> > > > +static blk_status_t submit_encoded_read_bio(struct inode *inode,
> > > 
> > > struct btrfs_inode
> > > 
> > > > +					    struct bio *bio, int mirror_num,
> > > > +					    unsigned long bio_flags)
> > > 
> > > bio_flags is unused here (and in the encoded patches afaics)
> > > 
> > > > +{
> > > > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > > > +	struct btrfs_bio *bbio = btrfs_bio(bio);
> > > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > > +	blk_status_t ret;
> > > > +
> > > > +	if (!priv->skip_csum) {
> > > > +		ret = btrfs_lookup_bio_sums(inode, bio, NULL);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +	}
> > > > +
> > > > +	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> > > > +	if (ret) {
> > > > +		btrfs_bio_free_csum(bbio);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > > +	atomic_inc(&priv->pending);
> > > > +	ret = btrfs_map_bio(fs_info, bio, mirror_num);
> > > > +	if (ret) {
> > > > +		atomic_dec(&priv->pending);
> > > > +		btrfs_bio_free_csum(bbio);
> > > > +	}
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static blk_status_t btrfs_encoded_read_verify_csum(struct btrfs_bio *bbio)
> > > > +{
> > > > +	const bool uptodate = bbio->bio.bi_status == BLK_STS_OK;
> > > 
> > > 	const bool uptodate = (bbio->bio.bi_status == BLK_STS_OK);
> > > 
> > > > +	struct btrfs_encoded_read_private *priv = bbio->bio.bi_private;
> > > > +	struct inode *inode = priv->inode;
> > > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > > +	u32 sectorsize = fs_info->sectorsize;
> > > > +	struct bio_vec *bvec;
> > > > +	struct bvec_iter_all iter_all;
> > > > +	u64 start = priv->file_offset;
> > > > +	u32 bio_offset = 0;
> > > > +
> > > > +	if (priv->skip_csum || !uptodate)
> > > > +		return bbio->bio.bi_status;
> > > > +
> > > > +	bio_for_each_segment_all(bvec, &bbio->bio, iter_all) {
> > > > +		unsigned int i, nr_sectors, pgoff;
> > > > +
> > > > +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> > > > +		pgoff = bvec->bv_offset;
> > > > +		for (i = 0; i < nr_sectors; i++) {
> > > > +			ASSERT(pgoff < PAGE_SIZE);
> > > > +			if (check_data_csum(inode, bbio, bio_offset,
> > > > +					    bvec->bv_page, pgoff, start))
> > > > +				return BLK_STS_IOERR;
> > > > +			start += sectorsize;
> > > > +			bio_offset += sectorsize;
> > > > +			pgoff += sectorsize;
> > > > +		}
> > > > +	}
> > > > +	return BLK_STS_OK;
> > > > +}
> > > > +
> > > > +static void btrfs_encoded_read_endio(struct bio *bio)
> > > > +{
> > > > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > > > +	struct btrfs_bio *bbio = btrfs_bio(bio);
> > > > +	blk_status_t status;
> > > > +
> > > > +	status = btrfs_encoded_read_verify_csum(bbio);
> > > > +	if (status) {
> > > > +		/*
> > > > +		 * The memory barrier implied by the atomic_dec_return() here
> > > > +		 * pairs with the memory barrier implied by the
> > > > +		 * atomic_dec_return() or io_wait_event() in
> > > > +		 * btrfs_encoded_read_regular_fill_pages() to ensure that this
> > > > +		 * write is observed before the load of status in
> > > > +		 * btrfs_encoded_read_regular_fill_pages().
> > > > +		 */
> > > > +		WRITE_ONCE(priv->status, status);
> > > > +	}
> > > > +	if (!atomic_dec_return(&priv->pending))
> > > > +		wake_up(&priv->wait);
> > > > +	btrfs_bio_free_csum(bbio);
> > > > +	bio_put(bio);
> > > > +}
> > > > +
> > > > +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode,
> > > > +						 u64 file_offset,
> > > > +						 u64 disk_bytenr,
> > > > +						 u64 disk_io_size,
> > > > +						 struct page **pages)
> > > > +{
> > > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > > +	struct btrfs_encoded_read_private priv = {
> > > > +		.inode = inode,
> > > > +		.file_offset = file_offset,
> > > > +		.pending = ATOMIC_INIT(1),
> > > > +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> > > > +	};
> > > > +	unsigned long i = 0;
> > > > +	u64 cur = 0;
> > > > +	int ret;
> > > > +
> > > > +	init_waitqueue_head(&priv.wait);
> > > > +	/*
> > > > +	 * Submit bios for the extent, splitting due to bio or stripe limits as
> > > > +	 * necessary.
> > > > +	 */
> > > > +	while (cur < disk_io_size) {
> > > > +		struct extent_map *em;
> > > > +		struct btrfs_io_geometry geom;
> > > > +		struct bio *bio = NULL;
> > > > +		u64 remaining;
> > > > +
> > > > +		em = btrfs_get_chunk_map(fs_info, disk_bytenr + cur,
> > > > +					 disk_io_size - cur);
> > > > +		if (IS_ERR(em)) {
> > > > +			ret = PTR_ERR(em);
> > > > +		} else {
> > > > +			ret = btrfs_get_io_geometry(fs_info, em, BTRFS_MAP_READ,
> > > > +						    disk_bytenr + cur, &geom);
> > > > +			free_extent_map(em);
> > > > +		}
> > > > +		if (ret) {
> > > > +			WRITE_ONCE(priv.status, errno_to_blk_status(ret));
> > > > +			break;
> > > > +		}
> > > > +		remaining = min(geom.len, disk_io_size - cur);
> > > > +		while (bio || remaining) {
> > > > +			size_t bytes = min_t(u64, remaining, PAGE_SIZE);
> > > > +
> > > > +			if (!bio) {
> > > > +				bio = btrfs_bio_alloc(BIO_MAX_VECS);
> > > > +				bio->bi_iter.bi_sector =
> > > > +					(disk_bytenr + cur) >> SECTOR_SHIFT;
> > > > +				bio->bi_end_io = btrfs_encoded_read_endio;
> > > > +				bio->bi_private = &priv;
> > > > +				bio->bi_opf = REQ_OP_READ;
> > > > +			}
> > > > +
> > > > +			if (!bytes ||
> > > > +			    bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> > > > +				blk_status_t status;
> > > > +
> > > > +				status = submit_encoded_read_bio(inode, bio, 0,
> > > > +								 0);
> > > > +				if (status) {
> > > > +					WRITE_ONCE(priv.status, status);
> > > > +					bio_put(bio);
> > > > +					goto out;
> > > > +				}
> > > > +				bio = NULL;
> > > > +				continue;
> > > > +			}
> > > > +
> > > > +			i++;
> > > > +			cur += bytes;
> > > > +			remaining -= bytes;
> > > > +		}
> > > > +	}
> > > > +
> > > > +out:
> > > > +	if (atomic_dec_return(&priv.pending))
> > > > +		io_wait_event(priv.wait, !atomic_read(&priv.pending));
> > > > +	/* See btrfs_encoded_read_endio() for ordering. */
> > > > +	return blk_status_to_errno(READ_ONCE(priv.status));
> > > > +}
> > > > +
> > > > +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> > > > +					  struct iov_iter *iter,
> > > > +					  u64 start, u64 lockend,
> > > > +					  struct extent_state **cached_state,
> > > > +					  u64 disk_bytenr, u64 disk_io_size,
> > > > +					  size_t count, bool compressed,
> > > > +					  bool *unlocked)
> > > > +{
> > > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > > > +	struct page **pages;
> > > > +	unsigned long nr_pages, i;
> > > > +	u64 cur;
> > > > +	size_t page_offset;
> > > > +	ssize_t ret;
> > > > +
> > > > +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
> > > > +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> > > > +	if (!pages)
> > > > +		return -ENOMEM;
> > > > +	for (i = 0; i < nr_pages; i++) {
> > > > +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> > > 
> > > No HIGHMEM please.
> > > 
> > > > +		if (!pages[i]) {
> > > > +			ret = -ENOMEM;
> > > > +			goto out;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	ret = btrfs_encoded_read_regular_fill_pages(inode, start, disk_bytenr,
> > > > +						    disk_io_size, pages);
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > > > +	inode_unlock_shared(inode);
> > > > +	*unlocked = true;
> > > > +
> > > > +	if (compressed) {
> > > > +		i = 0;
> > > > +		page_offset = 0;
> > > > +	} else {
> > > > +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> > > > +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> > > > +	}
> > > > +	cur = 0;
> > > > +	while (cur < count) {
> > > > +		size_t bytes = min_t(size_t, count - cur,
> > > > +				     PAGE_SIZE - page_offset);
> > > > +
> > > > +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> > > > +				      iter) != bytes) {
> > > > +			ret = -EFAULT;
> > > > +			goto out;
> > > > +		}
> > > > +		i++;
> > > > +		cur += bytes;
> > > > +		page_offset = 0;
> > > > +	}
> > > > +	ret = count;
> > > > +out:
> > > > +	for (i = 0; i < nr_pages; i++) {
> > > > +		if (pages[i])
> > > > +			__free_page(pages[i]);
> > > > +	}
> > > > +	kfree(pages);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
> > > > +			   struct btrfs_ioctl_encoded_io_args *encoded)
> > > > +{
> > > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > > > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > > > +	ssize_t ret;
> > > > +	size_t count = iov_iter_count(iter);
> > > > +	u64 start, lockend, disk_bytenr, disk_io_size;
> > > > +	struct extent_state *cached_state = NULL;
> > > > +	struct extent_map *em;
> > > > +	bool unlocked = false;
> > > > +
> > > > +	file_accessed(iocb->ki_filp);
> > > > +
> > > > +	inode_lock_shared(inode);
> > > 
> > > We have helpers for inode locking now, btrfs_inode_lock, that take
> > > additional parameter for cases where we want to exclude certain
> > > locking combinations.
> > > 
> > > Which also brings the question if the encoded read/write should be
> > > excluded against some other operations like eg. deduplication has to do
> > > against mmap.
> > 
> > I _think_ we want to exclude mmap, but I need to think some more about
> > it. Other than this comment, I believe I've addressed your other
> > outstanding comments and rebased my branches:
> > 
> > https://github.com/osandov/linux/tree/btrfs-send-encoded
> > https://github.com/osandov/btrfs-progs/tree/send-encoded
> > 
> > I'm guessing you're still looking at the send parts, so I'll figure out
> > mmap and await further comments before resending.
> 
> After looking at the code some more, I don't think we need to exclude
> mmap after all.
> 
> The only thing that can happen is that with this interleaving:
> 
> btrfs_page_mkwrite         btrfs_encoded_read
> ---------------------------------------------------
> (enter)                    (enter)
>                            btrfs_wait_ordered_range
> lock_extent_bits
> btrfs_page_set_dirty
> unlock_extent_cached
> (exit)
>                            lock_extent_bits
> 			   read extent
>                            unlock_extent_cached
>                            (exit)
> 
> we read the old data from before the page was dirtied. But there are
> other interleavings of a concurrent btrfs_page_mkwrite and
> btrfs_encoded_read that would read the old data, like:
> 
> btrfs_page_mkwrite         btrfs_encoded_read
> ---------------------------------------------------
> (enter)                    (enter)
>                            btrfs_wait_ordered_range
>                            lock_extent_bits
> lock_extent_bits blocked
> 			   read extent
>                            unlock_extent_cached
>                            (exit)
> lock_extent_bits returns
> btrfs_page_set_dirty
> unlock_extent_cached
> (exit)
> 
> Or even if we were to use BTRFS_ILOCK_MMAP:
> 
> btrfs_page_mkwrite               btrfs_encoded_read
> -------------------------------------------------------------------
> (enter)                          (enter)
> 			         btrfs_inode_lock(BTRFS_ILOCK_MMAP)
> down_read(i_mmap_lock) (blocked)
>                                  btrfs_wait_ordered_range
>                                  lock_extent_bits
> 			         read extent
>                                  unlock_extent_cached
> 			         btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
> down_read(i_mmap_lock) returns
> lock_extent_bits
> btrfs_page_set_dirty
> unlock_extent_cached
> 
> In other words, this is inherently racy, so it's fine that we return the
> old data in this tiny window.

Thanks for the analysis, please put it to the changelog, I think it will
be handy one day.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ
  2022-01-25 21:26     ` Omar Sandoval
  2022-02-08 20:08       ` Omar Sandoval
@ 2022-02-10 18:42       ` David Sterba
  1 sibling, 0 replies; 50+ messages in thread
From: David Sterba @ 2022-02-10 18:42 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: dsterba, linux-btrfs, kernel-team

On Tue, Jan 25, 2022 at 01:26:21PM -0800, Omar Sandoval wrote:
> On Mon, Jan 24, 2022 at 11:26:32PM +0100, David Sterba wrote:
> > On Wed, Nov 17, 2021 at 12:19:19PM -0800, Omar Sandoval wrote:
> > > +
> > > +	inode_lock_shared(inode);
> > 
> > We have helpers for inode locking now, btrfs_inode_lock, that take
> > additional parameter for cases where we want to exclude certain
> > locking combinations.
> > 
> > Which also brings the question if the encoded read/write should be
> > excluded against some other operations like eg. deduplication has to do
> > against mmap.
> 
> I _think_ we want to exclude mmap, but I need to think some more about
> it. Other than this comment, I believe I've addressed your other
> outstanding comments and rebased my branches:
> 
> https://github.com/osandov/linux/tree/btrfs-send-encoded
> https://github.com/osandov/btrfs-progs/tree/send-encoded
> 
> I'm guessing you're still looking at the send parts, so I'll figure out
> mmap and await further comments before resending.

First I want to get the encoded ioctl, so please send v13.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2022-02-10 18:46 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-17 20:19 [PATCH v12 00/17] btrfs: add ioctls and send/receive support for reading/writing compressed data Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 01/17] fs: export rw_verify_area() Omar Sandoval
2021-11-18 14:57   ` David Sterba
2021-11-18 19:15     ` Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 02/17] fs: export variant of generic_write_checks without iov_iter Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 03/17] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 04/17] btrfs: add ram_bytes and offset to btrfs_ordered_extent Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 05/17] btrfs: support different disk extent size for delalloc Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 06/17] btrfs: clean up cow_file_range_inline() Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 07/17] btrfs: optionally extend i_size in cow_file_range_inline() Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 08/17] btrfs: add definitions + documentation for encoded I/O ioctls Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 09/17] btrfs: add BTRFS_IOC_ENCODED_READ Omar Sandoval
2021-11-18 14:55   ` David Sterba
2021-11-18 19:11     ` Omar Sandoval
2022-01-24 21:54   ` David Sterba
2022-01-24 22:33     ` Omar Sandoval
2022-01-24 22:26   ` David Sterba
2022-01-25 21:26     ` Omar Sandoval
2022-02-08 20:08       ` Omar Sandoval
2022-02-10 18:38         ` David Sterba
2022-02-10 18:42       ` David Sterba
2021-11-17 20:19 ` [PATCH v12 10/17] btrfs: add BTRFS_IOC_ENCODED_WRITE Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 11/17] btrfs: send: remove unused send_ctx::{total,cmd}_send_size Omar Sandoval
2021-11-18 14:11   ` David Sterba
2021-11-17 20:19 ` [PATCH v12 12/17] btrfs: send: fix maximum command numbering Omar Sandoval
2021-11-18 14:23   ` David Sterba
2021-11-18 18:54     ` Omar Sandoval
2021-12-09 18:08       ` Omar Sandoval
2022-01-04 19:05         ` Omar Sandoval
2022-01-24 22:40       ` David Sterba
2021-11-17 20:19 ` [PATCH v12 13/17] btrfs: add send stream v2 definitions Omar Sandoval
2021-11-18 14:18   ` David Sterba
2021-11-18 19:08     ` Omar Sandoval
2021-11-18 14:20   ` David Sterba
2021-11-17 20:19 ` [PATCH v12 14/17] btrfs: send: write larger chunks when using stream v2 Omar Sandoval
2021-11-18 15:50   ` David Sterba
2021-11-18 19:34     ` Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 15/17] btrfs: send: allocate send buffer with alloc_page() and vmap() for v2 Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 16/17] btrfs: send: send compressed extents with encoded writes Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 17/17] btrfs: send: enable support for stream v2 and compressed writes Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 01/10] btrfs-progs: receive: support v2 send stream larger tlv_len Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 02/10] btrfs-progs: receive: dynamically allocate sctx->read_buf Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 03/10] btrfs-progs: receive: support v2 send stream DATA tlv format Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 04/10] btrfs-progs: receive: add send stream v2 cmds and attrs to send.h Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 05/10] btrfs-progs: receive: process encoded_write commands Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 06/10] btrfs-progs: receive: encoded_write fallback to explicit decode and write Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 07/10] btrfs-progs: receive: process fallocate commands Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 08/10] btrfs-progs: receive: process setflags ioctl commands Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 09/10] btrfs-progs: send: stream v2 ioctl flags Omar Sandoval
2021-11-17 20:19 ` [PATCH v12 10/10] btrfs-progs: receive: add tests for basic encoded_write send/receive Omar Sandoval

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).