All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Btrfs: add interface for writing compressed extents directly
@ 2019-09-04 19:13 Omar Sandoval
  2019-09-04 19:13 ` [PATCH 1/2] fs: export rw_verify_area() Omar Sandoval
  2019-09-04 19:13 ` [PATCH 2/2] btrfs: add ioctl for directly writing compressed data Omar Sandoval
  0 siblings, 2 replies; 17+ messages in thread
From: Omar Sandoval @ 2019-09-04 19:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, linux-fsdevel

From: Omar Sandoval <osandov@fb.com>

Hi,

This is an update of my RFC series [1]. Patch 1 is a VFS patch to export
some permission checks. Patch 2 is the actual change.

Changes from the RFC:

- Rebased on misc-next, which now has the cleanups from patches 1-3.
- Generalized the interface to "raw writes" so that we can support,
  e.g., encrypted writes in the future.

Note that I kept the compat ioctl implementation the same based on the
discussion in [2].

1: https://lore.kernel.org/linux-btrfs/cover.1565900769.git.osandov@fb.com/
2: https://lore.kernel.org/linux-btrfs/20190903171458.GA7452@vader/

Omar Sandoval (2):
  fs: export rw_verify_area()
  btrfs: add ioctl for directly writing compressed data

 fs/btrfs/compression.c     |   6 +-
 fs/btrfs/compression.h     |  14 +--
 fs/btrfs/ctree.h           |   6 ++
 fs/btrfs/file.c            |  13 ++-
 fs/btrfs/inode.c           | 192 ++++++++++++++++++++++++++++++++++++-
 fs/btrfs/ioctl.c           |  95 ++++++++++++++++++
 fs/internal.h              |   5 -
 fs/read_write.c            |   1 +
 include/linux/fs.h         |   1 +
 include/uapi/linux/btrfs.h |  69 +++++++++++++
 10 files changed, 382 insertions(+), 20 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] fs: export rw_verify_area()
  2019-09-04 19:13 [PATCH 0/2] Btrfs: add interface for writing compressed extents directly Omar Sandoval
@ 2019-09-04 19:13 ` Omar Sandoval
  2019-09-04 19:13 ` [PATCH 2/2] btrfs: add ioctl for directly writing compressed data Omar Sandoval
  1 sibling, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2019-09-04 19:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, linux-fsdevel, Josef Bacik

From: Omar Sandoval <osandov@fb.com>

I'm adding a Btrfs ioctl to write compressed data, and rather than
duplicating the checks in rw_verify_area(), let's just export it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/internal.h      | 5 -----
 fs/read_write.c    | 1 +
 include/linux/fs.h | 1 +
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 315fcd8d237c..94e1831d4c95 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -160,11 +160,6 @@ extern char *simple_dname(struct dentry *, char *, int);
 extern void dput_to_list(struct dentry *, struct list_head *);
 extern void shrink_dentry_list(struct list_head *);
 
-/*
- * read_write.c
- */
-extern int rw_verify_area(int, struct file *, const loff_t *, size_t);
-
 /*
  * pipe.c
  */
diff --git a/fs/read_write.c b/fs/read_write.c
index 5bbf587f5bc1..76d0dd85d4f3 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -399,6 +399,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
 	return security_file_permission(file,
 				read_write == READ ? MAY_READ : MAY_WRITE);
 }
+EXPORT_SYMBOL(rw_verify_area);
 
 static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 997a530ff4e9..a9a1884768e4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3082,6 +3082,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
 		int whence, loff_t size);
 extern loff_t no_seek_end_llseek_size(struct file *, loff_t, int, loff_t);
 extern loff_t no_seek_end_llseek(struct file *, loff_t, int);
+extern int rw_verify_area(int, struct file *, const loff_t *, size_t);
 extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 extern int stream_open(struct inode * inode, struct file * filp);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-04 19:13 [PATCH 0/2] Btrfs: add interface for writing compressed extents directly Omar Sandoval
  2019-09-04 19:13 ` [PATCH 1/2] fs: export rw_verify_area() Omar Sandoval
@ 2019-09-04 19:13 ` Omar Sandoval
  2019-09-05  2:10   ` Dave Chinner
                     ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Omar Sandoval @ 2019-09-04 19:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, linux-fsdevel

From: Omar Sandoval <osandov@fb.com>

This adds an API for writing compressed data directly to the filesystem.
The use case that I have in mind is send/receive: currently, when
sending data from one compressed filesystem to another, the sending side
decompresses the data and the receiving side recompresses it before
writing it out. This is wasteful and can be avoided if we can just send
and write compressed extents. The send part will be implemented in a
separate series, as this ioctl can stand alone.

The interface is essentially pwrite(2) with some extra information:

- The input buffer contains the compressed data.
- Both the compressed and decompressed sizes of the data are given.
- The compression type (zlib, lzo, or zstd) is given.

The interface is general enough that it can be extended to encrypted or
otherwise encoded extents in the future. A more detailed description,
including restrictions and edge cases, is included in
include/uapi/linux/btrfs.h.

The implementation is similar to direct I/O: we have to flush any
ordered extents, invalidate the page cache, and do the io
tree/delalloc/extent map/ordered extent dance. From there, we can reuse
the compression code with a minor modification to distinguish the new
ioctl from writeback.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/compression.c     |   6 +-
 fs/btrfs/compression.h     |  14 +--
 fs/btrfs/ctree.h           |   6 ++
 fs/btrfs/file.c            |  13 ++-
 fs/btrfs/inode.c           | 192 ++++++++++++++++++++++++++++++++++++-
 fs/btrfs/ioctl.c           |  95 ++++++++++++++++++
 include/uapi/linux/btrfs.h |  69 +++++++++++++
 7 files changed, 380 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b05b361e2062..6632dd8d2e4d 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio)
 			bio->bi_status == BLK_STS_OK);
 	cb->compressed_pages[0]->mapping = NULL;
 
-	end_compressed_writeback(inode, cb);
+	if (cb->writeback)
+		end_compressed_writeback(inode, cb);
 	/* note, our inode could be gone now */
 
 	/*
@@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				 unsigned long compressed_len,
 				 struct page **compressed_pages,
 				 unsigned long nr_pages,
-				 unsigned int write_flags)
+				 unsigned int write_flags, bool writeback)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct bio *bio = NULL;
@@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 	cb->mirror_num = 0;
 	cb->compressed_pages = compressed_pages;
 	cb->compressed_len = compressed_len;
+	cb->writeback = writeback;
 	cb->orig_bio = NULL;
 	cb->nr_pages = nr_pages;
 
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 4cb8be9ff88b..5b48eb730362 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -6,6 +6,7 @@
 #ifndef BTRFS_COMPRESSION_H
 #define BTRFS_COMPRESSION_H
 
+#include <linux/btrfs.h>
 #include <linux/sizes.h>
 
 /*
@@ -47,6 +48,9 @@ struct compressed_bio {
 	/* the compression algorithm for this bio */
 	int compress_type;
 
+	/* Whether this is a write for writeback. */
+	bool writeback;
+
 	/* number of compressed pages in the array */
 	unsigned long nr_pages;
 
@@ -93,20 +97,12 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				  unsigned long compressed_len,
 				  struct page **compressed_pages,
 				  unsigned long nr_pages,
-				  unsigned int write_flags);
+				  unsigned int write_flags, bool writeback);
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 				 int mirror_num, unsigned long bio_flags);
 
 unsigned int btrfs_compress_str2level(unsigned int type, const char *str);
 
-enum btrfs_compression_type {
-	BTRFS_COMPRESS_NONE  = 0,
-	BTRFS_COMPRESS_ZLIB  = 1,
-	BTRFS_COMPRESS_LZO   = 2,
-	BTRFS_COMPRESS_ZSTD  = 3,
-	BTRFS_COMPRESS_TYPES = 3,
-};
-
 struct workspace_manager {
 	const struct btrfs_compress_op *ops;
 	struct list_head idle_ws;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 19d669d12ca1..9fae9b3f1f62 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2905,6 +2905,10 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
 int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
+
+ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
+			struct btrfs_ioctl_raw_pwrite_args *raw);
+
 extern const struct dentry_operations btrfs_dentry_operations;
 
 /* ioctl.c */
@@ -2928,6 +2932,8 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
 			   struct btrfs_inode *inode);
 int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
 void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
+ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
+			    struct btrfs_ioctl_raw_pwrite_args *args);
 int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
 void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
 			     int skip_pinned);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 8fe4eb7e5045..ed23aa65b2d5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1872,8 +1872,8 @@ static void update_time_for_write(struct inode *inode)
 		inode_inc_iversion(inode);
 }
 
-static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
-				    struct iov_iter *from)
+ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
+			    struct btrfs_ioctl_raw_pwrite_args *raw)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
@@ -1965,7 +1965,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (sync)
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 
-	if (iocb->ki_flags & IOCB_DIRECT) {
+	if (raw) {
+		num_written = btrfs_raw_write(iocb, from, raw);
+	} else if (iocb->ki_flags & IOCB_DIRECT) {
 		num_written = __btrfs_direct_write(iocb, from);
 	} else {
 		num_written = btrfs_buffered_write(iocb, from);
@@ -1996,6 +1998,11 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	return num_written ? num_written : err;
 }
 
+static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	return btrfs_do_write_iter(iocb, from, NULL);
+}
+
 int btrfs_release_file(struct inode *inode, struct file *filp)
 {
 	struct btrfs_file_private *private = filp->private_data;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a0546401bc0a..c8eaa1e5bf06 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 				    ins.objectid,
 				    ins.offset, async_extent->pages,
 				    async_extent->nr_pages,
-				    async_chunk->write_flags)) {
+				    async_chunk->write_flags, true)) {
 			struct page *p = async_extent->pages[0];
 			const u64 start = async_extent->start;
 			const u64 end = start + async_extent->ram_size - 1;
@@ -10590,6 +10590,196 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
 	}
 }
 
+/* Currently, this only supports raw writes of compressed data. */
+ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
+			struct btrfs_ioctl_raw_pwrite_args *raw)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct extent_changeset *data_reserved = NULL;
+	struct extent_state *cached_state = NULL;
+	unsigned long nr_pages, i;
+	struct page **pages;
+	u64 disk_num_bytes, num_bytes;
+	u64 start, end;
+	struct btrfs_key ins;
+	struct extent_map *em;
+	ssize_t ret;
+
+	if (iov_iter_count(from) != raw->num_bytes) {
+		/*
+		 * The write got truncated by generic_write_checks(). We can't
+		 * do a partial raw write.
+		 */
+		return -EFBIG;
+	}
+
+	/* This should be handled higher up. */
+	ASSERT(raw->num_bytes != 0);
+
+	/* The extent size must be sane. */
+	if (raw->num_bytes > BTRFS_MAX_UNCOMPRESSED ||
+	    raw->disk_num_bytes > BTRFS_MAX_COMPRESSED ||
+	    raw->disk_num_bytes == 0)
+		return -EINVAL;
+
+	/*
+	 * The compressed data on disk must be sector-aligned. For convenience,
+	 * we extend the compressed data with zeroes if it isn't.
+	 */
+	disk_num_bytes = ALIGN(raw->disk_num_bytes, fs_info->sectorsize);
+	/*
+	 * The extent in the file must also be sector-aligned. However, we allow
+	 * a write which ends at or extends i_size to have an unaligned length;
+	 * we round up the extent size and set i_size to the given length.
+	 */
+	start = iocb->ki_pos;
+	if ((start & (fs_info->sectorsize - 1)))
+		return -EINVAL;
+	if (start + raw->num_bytes >= inode->i_size) {
+		num_bytes = ALIGN(raw->num_bytes, fs_info->sectorsize);
+	} else {
+		num_bytes = raw->num_bytes;
+		if ((num_bytes & (fs_info->sectorsize - 1)))
+			return -EINVAL;
+	}
+	end = start + num_bytes - 1;
+
+	/*
+	 * It's valid for compressed data to be larger than or the same size as
+	 * the decompressed data. However, for buffered I/O, we never write out
+	 * a compressed extent unless it's smaller than the decompressed data,
+	 * so for now, let's not allow creating such extents with the ioctl,
+	 * either.
+	 */
+	if (disk_num_bytes >= num_bytes)
+		return -EINVAL;
+
+	nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE);
+	pages = kcalloc(nr_pages, sizeof(struct page *),
+			GFP_USER | __GFP_NOWARN);
+	if (!pages)
+		return -ENOMEM;
+	for (i = 0; i < nr_pages; i++) {
+		unsigned long offset = i << PAGE_SHIFT, n;
+		char *kaddr;
+
+		pages[i] = alloc_page(GFP_USER | __GFP_NOWARN);
+		if (!pages[i]) {
+			ret = -ENOMEM;
+			goto out_pages;
+		}
+		kaddr = kmap(pages[i]);
+		if (offset < raw->disk_num_bytes) {
+			n = min_t(unsigned long, PAGE_SIZE,
+				  raw->disk_num_bytes - offset);
+			if (copy_from_user(kaddr, raw->buf + offset, n)) {
+				kunmap(pages[i]);
+				ret = -EFAULT;
+				goto out_pages;
+			}
+		} else {
+			n = 0;
+		}
+		if (n < PAGE_SIZE)
+			memset(kaddr + n, 0, PAGE_SIZE - n);
+		kunmap(pages[i]);
+	}
+
+	for (;;) {
+		struct btrfs_ordered_extent *ordered;
+
+		lock_extent_bits(io_tree, start, end, &cached_state);
+		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
+						     end - start + 1);
+		if (!ordered &&
+		    !filemap_range_has_page(inode->i_mapping, start, end))
+			break;
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, end,
+				     &cached_state);
+		cond_resched();
+		ret = btrfs_wait_ordered_range(inode, start, end);
+		if (ret)
+			goto out_pages;
+		ret = invalidate_inode_pages2_range(inode->i_mapping,
+						    start >> PAGE_SHIFT,
+						    end >> PAGE_SHIFT);
+		if (ret)
+			goto out_pages;
+	}
+
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start,
+					   num_bytes);
+	if (ret)
+		goto out_unlock;
+
+	ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes,
+				   disk_num_bytes, 0, 0, &ins, 1, 1);
+	if (ret)
+		goto out_delalloc_release;
+
+	em = create_io_em(inode, start, num_bytes, start, ins.objectid,
+			  ins.offset, ins.offset, num_bytes, raw->compression,
+			  BTRFS_ORDERED_COMPRESSED);
+	if (IS_ERR(em)) {
+		ret = PTR_ERR(em);
+		goto out_free_reserve;
+	}
+	free_extent_map(em);
+
+	ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid,
+						num_bytes, ins.offset,
+						BTRFS_ORDERED_COMPRESSED,
+						raw->compression);
+	if (ret) {
+		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
+		goto out_free_reserve;
+	}
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+
+	if (start + raw->num_bytes > inode->i_size)
+		i_size_write(inode, start + raw->num_bytes);
+
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false);
+
+	if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
+					  ins.offset, pages, nr_pages, 0,
+					  false)) {
+		struct page *page = pages[0];
+
+		page->mapping = inode->i_mapping;
+		btrfs_writepage_endio_finish_ordered(page, start, end, 0);
+		page->mapping = NULL;
+		ret = -EIO;
+		goto out_pages;
+	}
+	iocb->ki_pos += raw->num_bytes;
+	return raw->num_bytes;
+
+out_free_reserve:
+	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
+out_delalloc_release:
+	btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes,
+				     true);
+out_unlock:
+	unlock_extent_cached(io_tree, start, end, &cached_state);
+out_pages:
+	for (i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			put_page(pages[i]);
+	}
+	kfree(pages);
+	return ret;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * Add an entry indicating a block group or device which is pinned by a
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index de730e56d3f5..c803732c9722 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -26,6 +26,7 @@
 #include <linux/btrfs.h>
 #include <linux/uaccess.h>
 #include <linux/iversion.h>
+#include <linux/sched/xacct.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -84,6 +85,20 @@ struct btrfs_ioctl_send_args_32 {
 
 #define BTRFS_IOC_SEND_32 _IOW(BTRFS_IOCTL_MAGIC, 38, \
 			       struct btrfs_ioctl_send_args_32)
+
+struct btrfs_ioctl_raw_pwrite_args_32 {
+	__u64 offset;		/* in */
+	__u64 num_bytes;	/* in */
+	__u64 disk_num_bytes;	/* in */
+	__u8 compression;	/* in */
+	__u8 encryption;	/* in */
+	__u16 other_encoding;	/* in */
+	__u32 reserved[7];
+	compat_uptr_t buf;	/* in */
+} __attribute__ ((__packed__));
+
+#define BTRFS_IOC_RAW_PWRITE_32 _IOW(BTRFS_IOCTL_MAGIC, 63, \
+				     struct btrfs_ioctl_raw_pwrite_args_32)
 #endif
 
 static int btrfs_clone(struct inode *src, struct inode *inode,
@@ -5437,6 +5452,80 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
 	return ret;
 }
 
+static int btrfs_ioctl_raw_pwrite(struct file *file, void __user *argp,
+				  bool compat)
+{
+	struct btrfs_ioctl_raw_pwrite_args args;
+	struct iov_iter iter;
+	loff_t pos;
+	struct kiocb kiocb;
+	ssize_t ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EBADF;
+
+	if (compat) {
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+		struct btrfs_ioctl_raw_pwrite_args_32 args32;
+
+		if (copy_from_user(&args32, argp, sizeof(args32)))
+			return -EFAULT;
+		args.offset = args32.offset;
+		args.num_bytes = args32.num_bytes;
+		args.disk_num_bytes = args32.disk_num_bytes;
+		args.compression = args32.compression;
+		args.encryption = args32.encryption;
+		args.other_encoding = args32.other_encoding;
+		memcpy(args.reserved, args32.reserved, sizeof(args.reserved));
+		args.buf = compat_ptr(args32.buf);
+#else
+		return -ENOTTY;
+#endif
+	} else {
+		if (copy_from_user(&args, argp, sizeof(args)))
+			return -EFAULT;
+	}
+
+	/* The compression type must be valid. */
+	if (args.compression == BTRFS_COMPRESS_NONE ||
+	    args.compression > BTRFS_COMPRESS_TYPES)
+		return -EINVAL;
+	/* Reserved fields must be zero. */
+	if (args.encryption || args.other_encoding ||
+	    memchr_inv(args.reserved, 0, sizeof(args.reserved)))
+		return -EINVAL;
+
+	if (unlikely(!access_ok(args.buf, args.disk_num_bytes)))
+		return -EFAULT;
+
+	pos = args.offset;
+	ret = rw_verify_area(WRITE, file, &pos, args.num_bytes);
+	if (ret)
+		return ret;
+
+	init_sync_kiocb(&kiocb, file);
+	kiocb.ki_pos = pos;
+	/*
+	 * This iov_iter is a lie; we only construct it so that we can use
+	 * write_iter.
+	 */
+	iov_iter_init(&iter, WRITE, NULL, 0, args.num_bytes);
+
+	file_start_write(file);
+	ret = btrfs_do_write_iter(&kiocb, &iter, &args);
+	if (ret > 0) {
+		ASSERT(ret == args.num_bytes);
+		fsnotify_modify(file);
+		add_wchar(current, ret);
+	}
+	inc_syscw(current);
+	file_end_write(file);
+	return ret < 0 ? ret : 0;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -5583,6 +5672,12 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_subvol_rootref(file, argp);
 	case BTRFS_IOC_INO_LOOKUP_USER:
 		return btrfs_ioctl_ino_lookup_user(file, argp);
+	case BTRFS_IOC_RAW_PWRITE:
+		return btrfs_ioctl_raw_pwrite(file, argp, false);
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+	case BTRFS_IOC_RAW_PWRITE_32:
+		return btrfs_ioctl_raw_pwrite(file, argp, true);
+#endif
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 3ee0678c0a83..b4dce062265a 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -822,6 +822,73 @@ struct btrfs_ioctl_get_subvol_rootref_args {
 		__u8 align[7];
 };
 
+enum btrfs_compression_type {
+	BTRFS_COMPRESS_NONE  = 0,
+	BTRFS_COMPRESS_ZLIB  = 1,
+	BTRFS_COMPRESS_LZO   = 2,
+	BTRFS_COMPRESS_ZSTD  = 3,
+	BTRFS_COMPRESS_TYPES = 3,
+};
+
+/*
+ * Write an extent directly to the filesystem. CAP_SYS_ADMIN is required and the
+ * file descriptor must be open for writing.
+ *
+ * Currently, this is only for writing compressed data. However, it may be
+ * extended in the future.
+ */
+struct btrfs_ioctl_raw_pwrite_args {
+	/*
+	 * Offset in file where to write. This must be aligned to the sector
+	 * size of the filesystem.
+	 */
+	__u64 offset;		/* in */
+	/*
+	 * Length of the extent in the file, in bytes. This must be aligned to
+	 * the sector size of the filesystem unless the data ends at or beyond
+	 * the current end of file; this special case is to support creating
+	 * files whose length is not aligned to the sector size.
+	 *
+	 * For a compressed extent, this is the length of the decompressed data.
+	 * It must be less than 128k (BTRFS_MAX_UNCOMPRESSED), although that
+	 * limit may increase in the future.
+	 */
+	__u64 num_bytes;	/* in */
+	/*
+	 * Length of the extent on disk, in bytes (see buf below).
+	 *
+	 * For a compressed extent, this does not need to be aligned to a
+	 * sector. It must be less than 128k (BTRFS_MAX_COMPRESSED), although
+	 * that limit may increase in the future.
+	 */
+	__u64 disk_num_bytes;	/* in */
+	/*
+	 * Compression type (enum btrfs_compression_type). Currently, this must
+	 * not be BTRFS_COMPRESS_NONE.
+	 */
+	__u8 compression;	/* in */
+	/* Encryption type. Currently, this must be zero. */
+	__u8 encryption;	/* in */
+	/* Other type of encoding. Currently, this must be zero. */
+	__u16 other_encoding;	/* in */
+	/* Reserved for future extensions. Must be zero. */
+	__u32 reserved[7];
+	/*
+	 * The raw data on disk.
+	 *
+	 * For a compressed extent, the format is as follows:
+	 *
+	 * - zlib: The extent is a single zlib stream.
+	 * - lzo: The extent is compressed page by page with LZO1X and wrapped
+	 *   according to the format documented in fs/btrfs/lzo.c.
+	 * - zstd: The extent is a single zstd stream. The windowLog compression
+	 *   parameter must be no more than 17 (ZSTD_BTRFS_MAX_WINDOWLOG).
+	 *
+	 * If the compressed data is invalid, reading will return an error.
+	 */
+	void __user *buf;	/* in */
+} __attribute__ ((__packed__));
+
 /* Error codes as returned by the kernel */
 enum btrfs_err_code {
 	BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET = 1,
@@ -946,5 +1013,7 @@ enum btrfs_err_code {
 				struct btrfs_ioctl_get_subvol_rootref_args)
 #define BTRFS_IOC_INO_LOOKUP_USER _IOWR(BTRFS_IOCTL_MAGIC, 62, \
 				struct btrfs_ioctl_ino_lookup_user_args)
+#define BTRFS_IOC_RAW_PWRITE _IOW(BTRFS_IOCTL_MAGIC, 63, \
+				  struct btrfs_ioctl_raw_pwrite_args)
 
 #endif /* _UAPI_LINUX_BTRFS_H */
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-04 19:13 ` [PATCH 2/2] btrfs: add ioctl for directly writing compressed data Omar Sandoval
@ 2019-09-05  2:10   ` Dave Chinner
  2019-09-05 12:16     ` Johannes Thumshirn
  2019-09-06 18:19     ` Omar Sandoval
  2019-09-05 10:33   ` Nikolay Borisov
  2019-09-10 11:39   ` Filipe Manana
  2 siblings, 2 replies; 17+ messages in thread
From: Dave Chinner @ 2019-09-05  2:10 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, linux-fsdevel

On Wed, Sep 04, 2019 at 12:13:26PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This adds an API for writing compressed data directly to the filesystem.
> The use case that I have in mind is send/receive: currently, when
> sending data from one compressed filesystem to another, the sending side
> decompresses the data and the receiving side recompresses it before
> writing it out. This is wasteful and can be avoided if we can just send
> and write compressed extents. The send part will be implemented in a
> separate series, as this ioctl can stand alone.
> 
> The interface is essentially pwrite(2) with some extra information:
> 
> - The input buffer contains the compressed data.
> - Both the compressed and decompressed sizes of the data are given.
> - The compression type (zlib, lzo, or zstd) is given.

So why can't you do this with pwritev2()? Heaps of flags, and
use a second iovec to hold the decompressed size of the previous
iovec. i.e.

	iov[0].iov_base = compressed_data;
	iov[0].iov_len = compressed_size;
	iov[1].iov_base = NULL;
	iov[1].iov_len = uncompressed_size;
	pwritev2(fd, iov, 2, offset, RWF_COMPRESSED_ZLIB);

And you don't need to reinvent pwritev() with some whacky ioctl that
is bound to be completely screwed up is ways not noticed until
someone else tries to use it...

I'd also suggest atht if we are going to be able to write compressed
data directly, then we should be able to read them as well directly
via preadv2()....

> The interface is general enough that it can be extended to encrypted or
> otherwise encoded extents in the future. A more detailed description,
> including restrictions and edge cases, is included in
> include/uapi/linux/btrfs.h.

No thanks, that bit us on the arse -hard- with the clone interfaces
we lifted to the VFS from btrfs. Let's do it through the existing IO
paths and write a bunch of fstests to exercise it and verify the
interface's utility and the filesystem implementation correctness
before anything is merged.

> The implementation is similar to direct I/O: we have to flush any
> ordered extents, invalidate the page cache, and do the io
> tree/delalloc/extent map/ordered extent dance.

Which, to me, says that this should be a small bit of extra code
in the direct IO path that skips the compression/decompression code
and sets a few extra flags in the iocb that is passed down to the
direct IO code.

We don't need a whole new IO path just to skip a data transformation
step in the direct IO path....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-04 19:13 ` [PATCH 2/2] btrfs: add ioctl for directly writing compressed data Omar Sandoval
  2019-09-05  2:10   ` Dave Chinner
@ 2019-09-05 10:33   ` Nikolay Borisov
  2019-09-19  6:14     ` Omar Sandoval
  2019-09-10 11:39   ` Filipe Manana
  2 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2019-09-05 10:33 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, linux-fsdevel



On 4.09.19 г. 22:13 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This adds an API for writing compressed data directly to the filesystem.
> The use case that I have in mind is send/receive: currently, when
> sending data from one compressed filesystem to another, the sending side
> decompresses the data and the receiving side recompresses it before
> writing it out. This is wasteful and can be avoided if we can just send
> and write compressed extents. The send part will be implemented in a
> separate series, as this ioctl can stand alone.
> 
> The interface is essentially pwrite(2) with some extra information:
> 
> - The input buffer contains the compressed data.
> - Both the compressed and decompressed sizes of the data are given.
> - The compression type (zlib, lzo, or zstd) is given.
> 
> The interface is general enough that it can be extended to encrypted or
> otherwise encoded extents in the future. A more detailed description,
> including restrictions and edge cases, is included in
> include/uapi/linux/btrfs.h.
> 
> The implementation is similar to direct I/O: we have to flush any
> ordered extents, invalidate the page cache, and do the io
> tree/delalloc/extent map/ordered extent dance. From there, we can reuse
> the compression code with a minor modification to distinguish the new
> ioctl from writeback.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>


Should we choose to continue with this interface (based on Dave's
feedback) I'd rather see the following things changed:

> ---
>  fs/btrfs/compression.c     |   6 +-
>  fs/btrfs/compression.h     |  14 +--
>  fs/btrfs/ctree.h           |   6 ++
>  fs/btrfs/file.c            |  13 ++-
>  fs/btrfs/inode.c           | 192 ++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/ioctl.c           |  95 ++++++++++++++++++
>  include/uapi/linux/btrfs.h |  69 +++++++++++++
>  7 files changed, 380 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index b05b361e2062..6632dd8d2e4d 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio)
>  			bio->bi_status == BLK_STS_OK);
>  	cb->compressed_pages[0]->mapping = NULL;
>  
> -	end_compressed_writeback(inode, cb);
> +	if (cb->writeback)
> +		end_compressed_writeback(inode, cb);
>  	/* note, our inode could be gone now */
>  
>  	/*
> @@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>  				 unsigned long compressed_len,
>  				 struct page **compressed_pages,
>  				 unsigned long nr_pages,
> -				 unsigned int write_flags)
> +				 unsigned int write_flags, bool writeback)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct bio *bio = NULL;
> @@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>  	cb->mirror_num = 0;
>  	cb->compressed_pages = compressed_pages;
>  	cb->compressed_len = compressed_len;
> +	cb->writeback = writeback;
>  	cb->orig_bio = NULL;
>  	cb->nr_pages = nr_pages;
>  
> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> index 4cb8be9ff88b..5b48eb730362 100644
> --- a/fs/btrfs/compression.h
> +++ b/fs/btrfs/compression.h
> @@ -6,6 +6,7 @@
>  #ifndef BTRFS_COMPRESSION_H
>  #define BTRFS_COMPRESSION_H
>  
> +#include <linux/btrfs.h>
>  #include <linux/sizes.h>
>  
>  /*
> @@ -47,6 +48,9 @@ struct compressed_bio {
>  	/* the compression algorithm for this bio */
>  	int compress_type;
>  
> +	/* Whether this is a write for writeback. */
> +	bool writeback;
> +
>  	/* number of compressed pages in the array */
>  	unsigned long nr_pages;
>  
> @@ -93,20 +97,12 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>  				  unsigned long compressed_len,
>  				  struct page **compressed_pages,
>  				  unsigned long nr_pages,
> -				  unsigned int write_flags);
> +				  unsigned int write_flags, bool writeback);
>  blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  				 int mirror_num, unsigned long bio_flags);
>  
>  unsigned int btrfs_compress_str2level(unsigned int type, const char *str);
>  
> -enum btrfs_compression_type {
> -	BTRFS_COMPRESS_NONE  = 0,
> -	BTRFS_COMPRESS_ZLIB  = 1,
> -	BTRFS_COMPRESS_LZO   = 2,
> -	BTRFS_COMPRESS_ZSTD  = 3,
> -	BTRFS_COMPRESS_TYPES = 3,
> -};
> -
>  struct workspace_manager {
>  	const struct btrfs_compress_op *ops;
>  	struct list_head idle_ws;
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 19d669d12ca1..9fae9b3f1f62 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2905,6 +2905,10 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
>  int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
>  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>  					  u64 end, int uptodate);
> +
> +ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
> +			struct btrfs_ioctl_raw_pwrite_args *raw);
> +
>  extern const struct dentry_operations btrfs_dentry_operations;
>  
>  /* ioctl.c */
> @@ -2928,6 +2932,8 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
>  			   struct btrfs_inode *inode);
>  int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
>  void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
> +ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
> +			    struct btrfs_ioctl_raw_pwrite_args *args);
>  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
>  void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
>  			     int skip_pinned);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 8fe4eb7e5045..ed23aa65b2d5 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1872,8 +1872,8 @@ static void update_time_for_write(struct inode *inode)
>  		inode_inc_iversion(inode);
>  }
>  
> -static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> -				    struct iov_iter *from)
> +ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
> +			    struct btrfs_ioctl_raw_pwrite_args *raw)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file_inode(file);
> @@ -1965,7 +1965,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  	if (sync)
>  		atomic_inc(&BTRFS_I(inode)->sync_writers);
>  
> -	if (iocb->ki_flags & IOCB_DIRECT) {
> +	if (raw) {
> +		num_written = btrfs_raw_write(iocb, from, raw);
> +	} else if (iocb->ki_flags & IOCB_DIRECT) {
>  		num_written = __btrfs_direct_write(iocb, from);
>  	} else {
>  		num_written = btrfs_buffered_write(iocb, from);
> @@ -1996,6 +1998,11 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  	return num_written ? num_written : err;
>  }
>  
> +static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	return btrfs_do_write_iter(iocb, from, NULL);
> +}
> +
>  int btrfs_release_file(struct inode *inode, struct file *filp)
>  {
>  	struct btrfs_file_private *private = filp->private_data;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a0546401bc0a..c8eaa1e5bf06 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
>  				    ins.objectid,
>  				    ins.offset, async_extent->pages,
>  				    async_extent->nr_pages,
> -				    async_chunk->write_flags)) {
> +				    async_chunk->write_flags, true)) {
>  			struct page *p = async_extent->pages[0];
>  			const u64 start = async_extent->start;
>  			const u64 end = start + async_extent->ram_size - 1;
> @@ -10590,6 +10590,196 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
>  	}
>  }
>  
> +/* Currently, this only supports raw writes of compressed data. */
> +ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
> +			struct btrfs_ioctl_raw_pwrite_args *raw)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct inode *inode = file_inode(file);
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct extent_changeset *data_reserved = NULL;
> +	struct extent_state *cached_state = NULL;
> +	unsigned long nr_pages, i;
> +	struct page **pages;
> +	u64 disk_num_bytes, num_bytes;
> +	u64 start, end;
> +	struct btrfs_key ins;
> +	struct extent_map *em;
> +	ssize_t ret;
> +
> +	if (iov_iter_count(from) != raw->num_bytes) {
> +		/*
> +		 * The write got truncated by generic_write_checks(). We can't
> +		 * do a partial raw write.
> +		 */
> +		return -EFBIG;
> +	}
> +
> +	/* This should be handled higher up. */
> +	ASSERT(raw->num_bytes != 0);

This is already checked, indirectly via rw_verify_area, there's :

 if (unlikely((ssize_t) count < 0))
                  return retval;

So you can remove this assert.

> +
> +	/* The extent size must be sane. */
> +	if (raw->num_bytes > BTRFS_MAX_UNCOMPRESSED ||
> +	    raw->disk_num_bytes > BTRFS_MAX_COMPRESSED ||
> +	    raw->disk_num_bytes == 0)
> +		return -EINVAL;
> +
> +	/*
> +	 * The compressed data on disk must be sector-aligned. For convenience,
> +	 * we extend the compressed data with zeroes if it isn't.
> +	 */
> +	disk_num_bytes = ALIGN(raw->disk_num_bytes, fs_info->sectorsize);
> +	/*
> +	 * The extent in the file must also be sector-aligned. However, we allow
> +	 * a write which ends at or extends i_size to have an unaligned length;
> +	 * we round up the extent size and set i_size to the given length.
> +	 */
> +	start = iocb->ki_pos;
> +	if ((start & (fs_info->sectorsize - 1)))

if (!IS_ALIGNED(start, fs_info->sectorsize))

> +		return -EINVAL;
> +	if (start + raw->num_bytes >= inode->i_size) {
> +		num_bytes = ALIGN(raw->num_bytes, fs_info->sectorsize);
> +	} else {
> +		num_bytes = raw->num_bytes;
> +		if ((num_bytes & (fs_info->sectorsize - 1)))

ditto

> +			return -EINVAL;
> +	}
> +	end = start + num_bytes - 1;
> +
> +	/*
> +	 * It's valid for compressed data to be larger than or the same size as
> +	 * the decompressed data. However, for buffered I/O, we never write out
> +	 * a compressed extent unless it's smaller than the decompressed data,
> +	 * so for now, let's not allow creating such extents with the ioctl,
> +	 * either.
> +	 */
> +	if (disk_num_bytes >= num_bytes)
> +		return -EINVAL;
> +
> +	nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE);
> +	pages = kcalloc(nr_pages, sizeof(struct page *),
> +			GFP_USER | __GFP_NOWARN);
> +	if (!pages)
> +		return -ENOMEM;
> +	for (i = 0; i < nr_pages; i++) {
> +		unsigned long offset = i << PAGE_SHIFT, n;
> +		char *kaddr;
> +
> +		pages[i] = alloc_page(GFP_USER | __GFP_NOWARN);
> +		if (!pages[i]) {
> +			ret = -ENOMEM;
> +			goto out_pages;
> +		}
> +		kaddr = kmap(pages[i]);
> +		if (offset < raw->disk_num_bytes) {
> +			n = min_t(unsigned long, PAGE_SIZE,
> +				  raw->disk_num_bytes - offset);
> +			if (copy_from_user(kaddr, raw->buf + offset, n)) {
> +				kunmap(pages[i]);
> +				ret = -EFAULT;
> +				goto out_pages;
> +			}
> +		} else {
> +			n = 0;
> +		}
> +		if (n < PAGE_SIZE)
> +			memset(kaddr + n, 0, PAGE_SIZE - n);
> +		kunmap(pages[i]);
> +	}
> +
> +	for (;;) {
> +		struct btrfs_ordered_extent *ordered;
> +
> +		lock_extent_bits(io_tree, start, end, &cached_state);
> +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> +						     end - start + 1);
> +		if (!ordered &&
> +		    !filemap_range_has_page(inode->i_mapping, start, end))
> +			break;
> +		if (ordered)
> +			btrfs_put_ordered_extent(ordered);
> +		unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, end,
> +				     &cached_state);
> +		cond_resched();
> +		ret = btrfs_wait_ordered_range(inode, start, end);
> +		if (ret)
> +			goto out_pages;
> +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> +						    start >> PAGE_SHIFT,
> +						    end >> PAGE_SHIFT);
> +		if (ret)
> +			goto out_pages;
> +	}

Won't btrfs_lock_and_flush_ordered_range suffice here? Perhaps call that
function + invalidate_inode_pages2_range ?

> +
> +	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start,
> +					   num_bytes);
> +	if (ret)
> +		goto out_unlock;
> +
> +	ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes,
> +				   disk_num_bytes, 0, 0, &ins, 1, 1);
> +	if (ret)
> +		goto out_delalloc_release;
> +
> +	em = create_io_em(inode, start, num_bytes, start, ins.objectid,
> +			  ins.offset, ins.offset, num_bytes, raw->compression,
> +			  BTRFS_ORDERED_COMPRESSED);
> +	if (IS_ERR(em)) {
> +		ret = PTR_ERR(em);
> +		goto out_free_reserve;
> +	}
> +	free_extent_map(em);
> +
> +	ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid,
> +						num_bytes, ins.offset,
> +						BTRFS_ORDERED_COMPRESSED,
> +						raw->compression);
> +	if (ret) {
> +		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
> +		goto out_free_reserve;
> +	}
> +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> +
> +	if (start + raw->num_bytes > inode->i_size)
> +		i_size_write(inode, start + raw->num_bytes);
> +
> +	unlock_extent_cached(io_tree, start, end, &cached_state);
> +
> +	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false);
> +
> +	if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
> +					  ins.offset, pages, nr_pages, 0,
> +					  false)) {
> +		struct page *page = pages[0];
> +
> +		page->mapping = inode->i_mapping;
> +		btrfs_writepage_endio_finish_ordered(page, start, end, 0);
> +		page->mapping = NULL;
> +		ret = -EIO;
> +		goto out_pages;
> +	}
> +	iocb->ki_pos += raw->num_bytes;
> +	return raw->num_bytes;
> +
> +out_free_reserve:
> +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> +	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
> +out_delalloc_release:
> +	btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes,
> +				     true);
> +out_unlock:
> +	unlock_extent_cached(io_tree, start, end, &cached_state);
> +out_pages:
> +	for (i = 0; i < nr_pages; i++) {
> +		if (pages[i])
> +			put_page(pages[i]);
> +	}
> +	kfree(pages);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_SWAP
>  /*
>   * Add an entry indicating a block group or device which is pinned by a

<snip>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-05  2:10   ` Dave Chinner
@ 2019-09-05 12:16     ` Johannes Thumshirn
  2019-09-05 12:51       ` Johannes Thumshirn
  2019-09-06  7:46       ` Dave Chinner
  2019-09-06 18:19     ` Omar Sandoval
  1 sibling, 2 replies; 17+ messages in thread
From: Johannes Thumshirn @ 2019-09-05 12:16 UTC (permalink / raw)
  To: Dave Chinner, Omar Sandoval; +Cc: linux-btrfs, kernel-team, linux-fsdevel

On 05/09/2019 04:10, Dave Chinner wrote:
> On Wed, Sep 04, 2019 at 12:13:26PM -0700, Omar Sandoval wrote:
>> From: Omar Sandoval <osandov@fb.com>
>>
>> This adds an API for writing compressed data directly to the filesystem.
>> The use case that I have in mind is send/receive: currently, when
>> sending data from one compressed filesystem to another, the sending side
>> decompresses the data and the receiving side recompresses it before
>> writing it out. This is wasteful and can be avoided if we can just send
>> and write compressed extents. The send part will be implemented in a
>> separate series, as this ioctl can stand alone.
>>
>> The interface is essentially pwrite(2) with some extra information:
>>
>> - The input buffer contains the compressed data.
>> - Both the compressed and decompressed sizes of the data are given.
>> - The compression type (zlib, lzo, or zstd) is given.
> 
> So why can't you do this with pwritev2()? Heaps of flags, and
> use a second iovec to hold the decompressed size of the previous
> iovec. i.e.
> 
> 	iov[0].iov_base = compressed_data;
> 	iov[0].iov_len = compressed_size;
> 	iov[1].iov_base = NULL;
> 	iov[1].iov_len = uncompressed_size;
> 	pwritev2(fd, iov, 2, offset, RWF_COMPRESSED_ZLIB);
> 
> And you don't need to reinvent pwritev() with some whacky ioctl that
> is bound to be completely screwed up is ways not noticed until
> someone else tries to use it...
> 
> I'd also suggest atht if we are going to be able to write compressed
> data directly, then we should be able to read them as well directly
> via preadv2()....


While I'm with you on this from a design PoV, one question remains:
What to do with the file systems that do not support compression?

Currently there's only a kernel global check for known RWF_* flags in
kiocb_set_rw_flags().

So we need a way for the individual file systems to opt into the new
RWF_COMPRESSED_* flags and fail early if they're not supported, that
will cause a lot of code churn if we cannot do it in the vfs layer.

From the 52 ->write_iter callbacks in fs/ 32 are not using
generic_file_write_iter(). So we'd have to patch 33 functions (+/- 1-2
because my grep | wc fu isn't the best).

Any ideas?

Byte,
	Johannes
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-05 12:16     ` Johannes Thumshirn
@ 2019-09-05 12:51       ` Johannes Thumshirn
  2019-09-06  7:46       ` Dave Chinner
  1 sibling, 0 replies; 17+ messages in thread
From: Johannes Thumshirn @ 2019-09-05 12:51 UTC (permalink / raw)
  To: Dave Chinner, Omar Sandoval; +Cc: linux-btrfs, kernel-team, linux-fsdevel

On 05/09/2019 14:16, Johannes Thumshirn wrote:
> On 05/09/2019 04:10, Dave Chinner wrote:
>> On Wed, Sep 04, 2019 at 12:13:26PM -0700, Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> This adds an API for writing compressed data directly to the filesystem.
>>> The use case that I have in mind is send/receive: currently, when
>>> sending data from one compressed filesystem to another, the sending side
>>> decompresses the data and the receiving side recompresses it before
>>> writing it out. This is wasteful and can be avoided if we can just send
>>> and write compressed extents. The send part will be implemented in a
>>> separate series, as this ioctl can stand alone.
>>>
>>> The interface is essentially pwrite(2) with some extra information:
>>>
>>> - The input buffer contains the compressed data.
>>> - Both the compressed and decompressed sizes of the data are given.
>>> - The compression type (zlib, lzo, or zstd) is given.
>>
>> So why can't you do this with pwritev2()? Heaps of flags, and
>> use a second iovec to hold the decompressed size of the previous
>> iovec. i.e.
>>
>> 	iov[0].iov_base = compressed_data;
>> 	iov[0].iov_len = compressed_size;
>> 	iov[1].iov_base = NULL;
>> 	iov[1].iov_len = uncompressed_size;
>> 	pwritev2(fd, iov, 2, offset, RWF_COMPRESSED_ZLIB);
>>
>> And you don't need to reinvent pwritev() with some whacky ioctl that
>> is bound to be completely screwed up is ways not noticed until
>> someone else tries to use it...
>>
>> I'd also suggest atht if we are going to be able to write compressed
>> data directly, then we should be able to read them as well directly
>> via preadv2()....
> 
> 
> While I'm with you on this from a design PoV, one question remains:
> What to do with the file systems that do not support compression?
> 
> Currently there's only a kernel global check for known RWF_* flags in
> kiocb_set_rw_flags().
> 
> So we need a way for the individual file systems to opt into the new
> RWF_COMPRESSED_* flags and fail early if they're not supported, that
> will cause a lot of code churn if we cannot do it in the vfs layer.
> 
> From the 52 ->write_iter callbacks in fs/ 32 are not using
> generic_file_write_iter(). So we'd have to patch 33 functions (+/- 1-2
> because my grep | wc fu isn't the best).
> 

This (from Anthony Iliopoulos) should be sufficient:

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 58a18ed11546..86f7ff0387d7 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3299,7 +3299,7 @@ static loff_t btrfs_file_llseek(struct file *file,

 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
-       filp->f_mode |= FMODE_NOWAIT;
+       filp->f_mode |= (FMODE_NOWAIT|FMODE_CAN_COMPRESS);
        return generic_file_open(inode, filp);
 }

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 997a530ff4e9..1b59e795f448 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3357,6 +3357,11 @@ static inline int kiocb_set_rw_flags(struct kiocb
                ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
        if (flags & RWF_APPEND)
                ki->ki_flags |= IOCB_APPEND;
+       if (flags & RWF_COMPRESSED) {
+               if (!(ki->ki_filp->fmode & FMODE_CAN_COMPRESS))
+                       return -EOPNOTSUPP;
+               ki->ki_flags |= IOCB_COMPRESSED;
+       }
        return 0;
 }


-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-05 12:16     ` Johannes Thumshirn
  2019-09-05 12:51       ` Johannes Thumshirn
@ 2019-09-06  7:46       ` Dave Chinner
  1 sibling, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2019-09-06  7:46 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: Omar Sandoval, linux-btrfs, kernel-team, linux-fsdevel

On Thu, Sep 05, 2019 at 02:16:37PM +0200, Johannes Thumshirn wrote:
> On 05/09/2019 04:10, Dave Chinner wrote:
> > On Wed, Sep 04, 2019 at 12:13:26PM -0700, Omar Sandoval wrote:
> >> From: Omar Sandoval <osandov@fb.com>
> >>
> >> This adds an API for writing compressed data directly to the filesystem.
> >> The use case that I have in mind is send/receive: currently, when
> >> sending data from one compressed filesystem to another, the sending side
> >> decompresses the data and the receiving side recompresses it before
> >> writing it out. This is wasteful and can be avoided if we can just send
> >> and write compressed extents. The send part will be implemented in a
> >> separate series, as this ioctl can stand alone.
> >>
> >> The interface is essentially pwrite(2) with some extra information:
> >>
> >> - The input buffer contains the compressed data.
> >> - Both the compressed and decompressed sizes of the data are given.
> >> - The compression type (zlib, lzo, or zstd) is given.
> > 
> > So why can't you do this with pwritev2()? Heaps of flags, and
> > use a second iovec to hold the decompressed size of the previous
> > iovec. i.e.
> > 
> > 	iov[0].iov_base = compressed_data;
> > 	iov[0].iov_len = compressed_size;
> > 	iov[1].iov_base = NULL;
> > 	iov[1].iov_len = uncompressed_size;
> > 	pwritev2(fd, iov, 2, offset, RWF_COMPRESSED_ZLIB);
> > 
> > And you don't need to reinvent pwritev() with some whacky ioctl that
> > is bound to be completely screwed up is ways not noticed until
> > someone else tries to use it...
> > 
> > I'd also suggest atht if we are going to be able to write compressed
> > data directly, then we should be able to read them as well directly
> > via preadv2()....
> 
> 
> While I'm with you on this from a design PoV, one question remains:
> What to do with the file systems that do not support compression?

EINVAL.

> Currently there's only a kernel global check for known RWF_* flags in
> kiocb_set_rw_flags().

That's really an implementation detail, there's lots of ways of
doing it, probably a superblock feature flag would be the easiest
way to specify support for a filesystem supporting direct read/write
of compressed data.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-05  2:10   ` Dave Chinner
  2019-09-05 12:16     ` Johannes Thumshirn
@ 2019-09-06 18:19     ` Omar Sandoval
  2019-09-06 21:07       ` Dave Chinner
  1 sibling, 1 reply; 17+ messages in thread
From: Omar Sandoval @ 2019-09-06 18:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-btrfs, kernel-team, linux-fsdevel

On Thu, Sep 05, 2019 at 12:10:12PM +1000, Dave Chinner wrote:
> On Wed, Sep 04, 2019 at 12:13:26PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > This adds an API for writing compressed data directly to the filesystem.
> > The use case that I have in mind is send/receive: currently, when
> > sending data from one compressed filesystem to another, the sending side
> > decompresses the data and the receiving side recompresses it before
> > writing it out. This is wasteful and can be avoided if we can just send
> > and write compressed extents. The send part will be implemented in a
> > separate series, as this ioctl can stand alone.
> > 
> > The interface is essentially pwrite(2) with some extra information:
> > 
> > - The input buffer contains the compressed data.
> > - Both the compressed and decompressed sizes of the data are given.
> > - The compression type (zlib, lzo, or zstd) is given.

Hi, Dave,

> So why can't you do this with pwritev2()? Heaps of flags, and
> use a second iovec to hold the decompressed size of the previous
> iovec. i.e.
> 
> 	iov[0].iov_base = compressed_data;
> 	iov[0].iov_len = compressed_size;
> 	iov[1].iov_base = NULL;
> 	iov[1].iov_len = uncompressed_size;
> 	pwritev2(fd, iov, 2, offset, RWF_COMPRESSED_ZLIB);
> 
> And you don't need to reinvent pwritev() with some whacky ioctl that
> is bound to be completely screwed up is ways not noticed until
> someone else tries to use it...

This is a good suggestion, thanks. I hadn't considered (ab?)using iovecs
in this way.

One modification I'd make would be to put the encoding into the second
iovec and use a single RWF_ENCODED flag so that we don't have to keep
stealing from RWF_* every time we add a new compression
algorithm/encryption type/whatever:

 	iov[0].iov_base = compressed_data;
 	iov[0].iov_len = compressed_size;
 	iov[1].iov_base = (void *)IOV_ENCODING_ZLIB;
 	iov[1].iov_len = uncompressed_size;
 	pwritev2(fd, iov, 2, offset, RWF_ENCODED);

Making every other iovec a metadata iovec in this way would be a major
pain to plumb through the iov_iter and VFS code, though. Instead, we
could put the metadata in iov[0] and the encoded data in iov[1..iovcnt -
1]:

	iov[0].iov_base = (void *)IOV_ENCODING_ZLIB;
	iov[0].iov_len = unencoded_len;
	iov[1].iov_base = encoded_data1;
	iov[1].iov_len = encoded_size1;
	iov[2].iov_base = encoded_data2;
	iov[2].iov_len = encoded_size2;
 	pwritev2(fd, iov, 3, offset, RWF_ENCODED);

In my opinion, these are both reasonable interfaces. The former allows
the user to write multiple encoded "extents" at once, while the latter
allows writing a single encoded extent from scattered buffers. The
latter is much simpler to implement ;) Thoughts?

> I'd also suggest atht if we are going to be able to write compressed
> data directly, then we should be able to read them as well directly
> via preadv2()....
> 
> > The interface is general enough that it can be extended to encrypted or
> > otherwise encoded extents in the future. A more detailed description,
> > including restrictions and edge cases, is included in
> > include/uapi/linux/btrfs.h.
> 
> No thanks, that bit us on the arse -hard- with the clone interfaces
> we lifted to the VFS from btrfs. Let's do it through the existing IO
> paths and write a bunch of fstests to exercise it and verify the
> interface's utility and the filesystem implementation correctness
> before anything is merged.
> 
> > The implementation is similar to direct I/O: we have to flush any
> > ordered extents, invalidate the page cache, and do the io
> > tree/delalloc/extent map/ordered extent dance.
> 
> Which, to me, says that this should be a small bit of extra code
> in the direct IO path that skips the compression/decompression code
> and sets a few extra flags in the iocb that is passed down to the
> direct IO code.
> 
> We don't need a whole new IO path just to skip a data transformation
> step in the direct IO path....

Eh, at least for Btrfs, it's much hairier to retrofit this onto the mess
of callbacks that is __blockdev_direct_IO() than it is to have a
separate path. But that doesn't affect the interface, and other
filesystems may be able to share more code with the direct IO path.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-06 18:19     ` Omar Sandoval
@ 2019-09-06 21:07       ` Dave Chinner
  2019-09-06 21:27         ` Omar Sandoval
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2019-09-06 21:07 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, linux-fsdevel

On Fri, Sep 06, 2019 at 11:19:49AM -0700, Omar Sandoval wrote:
> On Thu, Sep 05, 2019 at 12:10:12PM +1000, Dave Chinner wrote:
> > On Wed, Sep 04, 2019 at 12:13:26PM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > This adds an API for writing compressed data directly to the filesystem.
> > > The use case that I have in mind is send/receive: currently, when
> > > sending data from one compressed filesystem to another, the sending side
> > > decompresses the data and the receiving side recompresses it before
> > > writing it out. This is wasteful and can be avoided if we can just send
> > > and write compressed extents. The send part will be implemented in a
> > > separate series, as this ioctl can stand alone.
> > > 
> > > The interface is essentially pwrite(2) with some extra information:
> > > 
> > > - The input buffer contains the compressed data.
> > > - Both the compressed and decompressed sizes of the data are given.
> > > - The compression type (zlib, lzo, or zstd) is given.
> 
> Hi, Dave,
> 
> > So why can't you do this with pwritev2()? Heaps of flags, and
> > use a second iovec to hold the decompressed size of the previous
> > iovec. i.e.
> > 
> > 	iov[0].iov_base = compressed_data;
> > 	iov[0].iov_len = compressed_size;
> > 	iov[1].iov_base = NULL;
> > 	iov[1].iov_len = uncompressed_size;
> > 	pwritev2(fd, iov, 2, offset, RWF_COMPRESSED_ZLIB);
> > 
> > And you don't need to reinvent pwritev() with some whacky ioctl that
> > is bound to be completely screwed up is ways not noticed until
> > someone else tries to use it...
> 
> This is a good suggestion, thanks. I hadn't considered (ab?)using iovecs
> in this way.

Yeah, it is a bit of API abuse to pass per-iovec context in the next
iovec, but ISTR it being proposed in past times for other
mechanisms. I think it's far better than a whole new filesystem
private ioctl interface and structure to do what is effectively
direct IO...

> One modification I'd make would be to put the encoding into the second
> iovec and use a single RWF_ENCODED flag so that we don't have to keep
> stealing from RWF_* every time we add a new compression
> algorithm/encryption type/whatever:
> 
>  	iov[0].iov_base = compressed_data;
>  	iov[0].iov_len = compressed_size;
>  	iov[1].iov_base = (void *)IOV_ENCODING_ZLIB;
>  	iov[1].iov_len = uncompressed_size;
>  	pwritev2(fd, iov, 2, offset, RWF_ENCODED);
> 
> Making every other iovec a metadata iovec in this way would be a major
> pain to plumb through the iov_iter and VFS code, though. Instead, we
> could put the metadata in iov[0] and the encoded data in iov[1..iovcnt -
> 1]:
> 
> 	iov[0].iov_base = (void *)IOV_ENCODING_ZLIB;
> 	iov[0].iov_len = unencoded_len;
> 	iov[1].iov_base = encoded_data1;
> 	iov[1].iov_len = encoded_size1;
> 	iov[2].iov_base = encoded_data2;
> 	iov[2].iov_len = encoded_size2;
>  	pwritev2(fd, iov, 3, offset, RWF_ENCODED);
> 
> In my opinion, these are both reasonable interfaces. The former allows
> the user to write multiple encoded "extents" at once, while the latter
> allows writing a single encoded extent from scattered buffers. The
> latter is much simpler to implement ;) Thoughts?

Both reasonable, and I have no real concern about how it is done as
long as the format is well documented and works for both read and
write.

The only other thing I think we need to be careful of is that
interface works with AIO (via the RWF flag) and the new uioring async
interface  - I think thw RWF flag is all that is needed there). I
think that's another good reason for taking the preadv2/pwritev2
path, as that should all largely just work with the right iocb
frobbing in the syscall context...

> > > The implementation is similar to direct I/O: we have to flush any
> > > ordered extents, invalidate the page cache, and do the io
> > > tree/delalloc/extent map/ordered extent dance.
> > 
> > Which, to me, says that this should be a small bit of extra code
> > in the direct IO path that skips the compression/decompression code
> > and sets a few extra flags in the iocb that is passed down to the
> > direct IO code.
> > 
> > We don't need a whole new IO path just to skip a data transformation
> > step in the direct IO path....
> 
> Eh, at least for Btrfs, it's much hairier to retrofit this onto the mess
> of callbacks that is __blockdev_direct_IO() than it is to have a
> separate path. But that doesn't affect the interface, and other
> filesystems may be able to share more code with the direct IO path.

Ah, yeah, btrfs still uses that old mess. It should be much easier
when btrfs is converted to use the iomap DIO path - all the internal
compressed extent setup work for btrfs will likely be in the
->iomap_begin callout and then the rest of the DIO code treats it
like a normal data extent to do the IO...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-06 21:07       ` Dave Chinner
@ 2019-09-06 21:27         ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2019-09-06 21:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-btrfs, kernel-team, linux-fsdevel

On Sat, Sep 07, 2019 at 07:07:17AM +1000, Dave Chinner wrote:
> On Fri, Sep 06, 2019 at 11:19:49AM -0700, Omar Sandoval wrote:
> > On Thu, Sep 05, 2019 at 12:10:12PM +1000, Dave Chinner wrote:
> > > On Wed, Sep 04, 2019 at 12:13:26PM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > This adds an API for writing compressed data directly to the filesystem.
> > > > The use case that I have in mind is send/receive: currently, when
> > > > sending data from one compressed filesystem to another, the sending side
> > > > decompresses the data and the receiving side recompresses it before
> > > > writing it out. This is wasteful and can be avoided if we can just send
> > > > and write compressed extents. The send part will be implemented in a
> > > > separate series, as this ioctl can stand alone.
> > > > 
> > > > The interface is essentially pwrite(2) with some extra information:
> > > > 
> > > > - The input buffer contains the compressed data.
> > > > - Both the compressed and decompressed sizes of the data are given.
> > > > - The compression type (zlib, lzo, or zstd) is given.
> > 
> > Hi, Dave,
> > 
> > > So why can't you do this with pwritev2()? Heaps of flags, and
> > > use a second iovec to hold the decompressed size of the previous
> > > iovec. i.e.
> > > 
> > > 	iov[0].iov_base = compressed_data;
> > > 	iov[0].iov_len = compressed_size;
> > > 	iov[1].iov_base = NULL;
> > > 	iov[1].iov_len = uncompressed_size;
> > > 	pwritev2(fd, iov, 2, offset, RWF_COMPRESSED_ZLIB);
> > > 
> > > And you don't need to reinvent pwritev() with some whacky ioctl that
> > > is bound to be completely screwed up is ways not noticed until
> > > someone else tries to use it...
> > 
> > This is a good suggestion, thanks. I hadn't considered (ab?)using iovecs
> > in this way.
> 
> Yeah, it is a bit of API abuse to pass per-iovec context in the next
> iovec, but ISTR it being proposed in past times for other
> mechanisms. I think it's far better than a whole new filesystem
> private ioctl interface and structure to do what is effectively
> direct IO...
> 
> > One modification I'd make would be to put the encoding into the second
> > iovec and use a single RWF_ENCODED flag so that we don't have to keep
> > stealing from RWF_* every time we add a new compression
> > algorithm/encryption type/whatever:
> > 
> >  	iov[0].iov_base = compressed_data;
> >  	iov[0].iov_len = compressed_size;
> >  	iov[1].iov_base = (void *)IOV_ENCODING_ZLIB;
> >  	iov[1].iov_len = uncompressed_size;
> >  	pwritev2(fd, iov, 2, offset, RWF_ENCODED);
> > 
> > Making every other iovec a metadata iovec in this way would be a major
> > pain to plumb through the iov_iter and VFS code, though. Instead, we
> > could put the metadata in iov[0] and the encoded data in iov[1..iovcnt -
> > 1]:
> > 
> > 	iov[0].iov_base = (void *)IOV_ENCODING_ZLIB;
> > 	iov[0].iov_len = unencoded_len;
> > 	iov[1].iov_base = encoded_data1;
> > 	iov[1].iov_len = encoded_size1;
> > 	iov[2].iov_base = encoded_data2;
> > 	iov[2].iov_len = encoded_size2;
> >  	pwritev2(fd, iov, 3, offset, RWF_ENCODED);
> > 
> > In my opinion, these are both reasonable interfaces. The former allows
> > the user to write multiple encoded "extents" at once, while the latter
> > allows writing a single encoded extent from scattered buffers. The
> > latter is much simpler to implement ;) Thoughts?
> 
> Both reasonable, and I have no real concern about how it is done as
> long as the format is well documented and works for both read and
> write.
> 
> The only other thing I think we need to be careful of is that
> interface works with AIO (via the RWF flag) and the new uioring async
> interface  - I think thw RWF flag is all that is needed there). I
> think that's another good reason for taking the preadv2/pwritev2
> path, as that should all largely just work with the right iocb
> frobbing in the syscall context...

A symmetric interface for preadv2 would look something like this:

	iov[1].iov_base = encoded_data1;
	iov[1].iov_len = encoded_size1;
	iov[2].iov_base = encoded_data2;
	iov[2].iov_len = encoded_size2;
	preadv2(fd, iov, 3, offset, RWF_ENCODED);
	/*
	 * iov[0].iov_base gets filled in with the encoding flags,
	 * iov[0].iov_len gets filled in with unencoded length.
	 */

But, iov is passed as a const struct iovec *, so it'd be nasty to write
to it in the RWF_ENCODED case. Maybe we actually want to pass the
encoding information through an extra indirection. Something along the
lines of this for writes:

	struct encoded_rw {
		size_t unencoded_len;
		int compression;
		int encryption;
		...
	};
	
	struct encoded_rw encoded = {
		unencoded_len,
		ENCODED_RW_ZLIB,
	};
	iov[0].iov_base = &encoded;
	iov[0].iov_len = sizeof(encoded);
	iov[1].iov_base = encoded_data1;
	iov[1].iov_len = encoded_size1;
	iov[2].iov_base = encoded_data2;
	iov[2].iov_len = encoded_size2;
	pwritev2(fd, iov, 3, offset, RWF_ENCODED);

And similar for reads:

	struct encoded_rw encoded;
	iov[0].iov_base = &encoded;
	iov[0].iov_len = sizeof(encoded);
	iov[1].iov_base = encoded_data1;
	iov[1].iov_len = encoded_size1;
	iov[2].iov_base = encoded_data2;
	iov[2].iov_len = encoded_size2;
	preadv2(fd, iov, 3, offset, RWF_ENCODED);
	/* encoded gets filled in with the encoding information. */

I'll draft something with this interface.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-04 19:13 ` [PATCH 2/2] btrfs: add ioctl for directly writing compressed data Omar Sandoval
  2019-09-05  2:10   ` Dave Chinner
  2019-09-05 10:33   ` Nikolay Borisov
@ 2019-09-10 11:39   ` Filipe Manana
  2019-09-19  6:23     ` Omar Sandoval
  2 siblings, 1 reply; 17+ messages in thread
From: Filipe Manana @ 2019-09-10 11:39 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, linux-fsdevel

On Wed, Sep 4, 2019 at 8:14 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> This adds an API for writing compressed data directly to the filesystem.
> The use case that I have in mind is send/receive: currently, when
> sending data from one compressed filesystem to another, the sending side
> decompresses the data and the receiving side recompresses it before
> writing it out. This is wasteful and can be avoided if we can just send
> and write compressed extents. The send part will be implemented in a
> separate series, as this ioctl can stand alone.
>
> The interface is essentially pwrite(2) with some extra information:
>
> - The input buffer contains the compressed data.
> - Both the compressed and decompressed sizes of the data are given.
> - The compression type (zlib, lzo, or zstd) is given.
>
> The interface is general enough that it can be extended to encrypted or
> otherwise encoded extents in the future. A more detailed description,
> including restrictions and edge cases, is included in
> include/uapi/linux/btrfs.h.
>
> The implementation is similar to direct I/O: we have to flush any
> ordered extents, invalidate the page cache, and do the io
> tree/delalloc/extent map/ordered extent dance. From there, we can reuse
> the compression code with a minor modification to distinguish the new
> ioctl from writeback.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/compression.c     |   6 +-
>  fs/btrfs/compression.h     |  14 +--
>  fs/btrfs/ctree.h           |   6 ++
>  fs/btrfs/file.c            |  13 ++-
>  fs/btrfs/inode.c           | 192 ++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/ioctl.c           |  95 ++++++++++++++++++
>  include/uapi/linux/btrfs.h |  69 +++++++++++++
>  7 files changed, 380 insertions(+), 15 deletions(-)
>
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index b05b361e2062..6632dd8d2e4d 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio)
>                         bio->bi_status == BLK_STS_OK);
>         cb->compressed_pages[0]->mapping = NULL;
>
> -       end_compressed_writeback(inode, cb);
> +       if (cb->writeback)
> +               end_compressed_writeback(inode, cb);
>         /* note, our inode could be gone now */
>
>         /*
> @@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>                                  unsigned long compressed_len,
>                                  struct page **compressed_pages,
>                                  unsigned long nr_pages,
> -                                unsigned int write_flags)
> +                                unsigned int write_flags, bool writeback)
>  {
>         struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>         struct bio *bio = NULL;
> @@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>         cb->mirror_num = 0;
>         cb->compressed_pages = compressed_pages;
>         cb->compressed_len = compressed_len;
> +       cb->writeback = writeback;
>         cb->orig_bio = NULL;
>         cb->nr_pages = nr_pages;
>
> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> index 4cb8be9ff88b..5b48eb730362 100644
> --- a/fs/btrfs/compression.h
> +++ b/fs/btrfs/compression.h
> @@ -6,6 +6,7 @@
>  #ifndef BTRFS_COMPRESSION_H
>  #define BTRFS_COMPRESSION_H
>
> +#include <linux/btrfs.h>
>  #include <linux/sizes.h>
>
>  /*
> @@ -47,6 +48,9 @@ struct compressed_bio {
>         /* the compression algorithm for this bio */
>         int compress_type;
>
> +       /* Whether this is a write for writeback. */
> +       bool writeback;
> +
>         /* number of compressed pages in the array */
>         unsigned long nr_pages;
>
> @@ -93,20 +97,12 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>                                   unsigned long compressed_len,
>                                   struct page **compressed_pages,
>                                   unsigned long nr_pages,
> -                                 unsigned int write_flags);
> +                                 unsigned int write_flags, bool writeback);
>  blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>                                  int mirror_num, unsigned long bio_flags);
>
>  unsigned int btrfs_compress_str2level(unsigned int type, const char *str);
>
> -enum btrfs_compression_type {
> -       BTRFS_COMPRESS_NONE  = 0,
> -       BTRFS_COMPRESS_ZLIB  = 1,
> -       BTRFS_COMPRESS_LZO   = 2,
> -       BTRFS_COMPRESS_ZSTD  = 3,
> -       BTRFS_COMPRESS_TYPES = 3,
> -};
> -
>  struct workspace_manager {
>         const struct btrfs_compress_op *ops;
>         struct list_head idle_ws;
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 19d669d12ca1..9fae9b3f1f62 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2905,6 +2905,10 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
>  int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
>  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>                                           u64 end, int uptodate);
> +
> +ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
> +                       struct btrfs_ioctl_raw_pwrite_args *raw);
> +
>  extern const struct dentry_operations btrfs_dentry_operations;
>
>  /* ioctl.c */
> @@ -2928,6 +2932,8 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
>                            struct btrfs_inode *inode);
>  int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
>  void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
> +ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
> +                           struct btrfs_ioctl_raw_pwrite_args *args);
>  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
>  void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
>                              int skip_pinned);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 8fe4eb7e5045..ed23aa65b2d5 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1872,8 +1872,8 @@ static void update_time_for_write(struct inode *inode)
>                 inode_inc_iversion(inode);
>  }
>
> -static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> -                                   struct iov_iter *from)
> +ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
> +                           struct btrfs_ioctl_raw_pwrite_args *raw)
>  {
>         struct file *file = iocb->ki_filp;
>         struct inode *inode = file_inode(file);
> @@ -1965,7 +1965,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>         if (sync)
>                 atomic_inc(&BTRFS_I(inode)->sync_writers);
>
> -       if (iocb->ki_flags & IOCB_DIRECT) {
> +       if (raw) {
> +               num_written = btrfs_raw_write(iocb, from, raw);
> +       } else if (iocb->ki_flags & IOCB_DIRECT) {
>                 num_written = __btrfs_direct_write(iocb, from);
>         } else {
>                 num_written = btrfs_buffered_write(iocb, from);
> @@ -1996,6 +1998,11 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>         return num_written ? num_written : err;
>  }
>
> +static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +       return btrfs_do_write_iter(iocb, from, NULL);
> +}
> +
>  int btrfs_release_file(struct inode *inode, struct file *filp)
>  {
>         struct btrfs_file_private *private = filp->private_data;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a0546401bc0a..c8eaa1e5bf06 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
>                                     ins.objectid,
>                                     ins.offset, async_extent->pages,
>                                     async_extent->nr_pages,
> -                                   async_chunk->write_flags)) {
> +                                   async_chunk->write_flags, true)) {
>                         struct page *p = async_extent->pages[0];
>                         const u64 start = async_extent->start;
>                         const u64 end = start + async_extent->ram_size - 1;
> @@ -10590,6 +10590,196 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
>         }
>  }
>
> +/* Currently, this only supports raw writes of compressed data. */
> +ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
> +                       struct btrfs_ioctl_raw_pwrite_args *raw)
> +{
> +       struct file *file = iocb->ki_filp;
> +       struct inode *inode = file_inode(file);
> +       struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +       struct btrfs_root *root = BTRFS_I(inode)->root;
> +       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +       struct extent_changeset *data_reserved = NULL;
> +       struct extent_state *cached_state = NULL;
> +       unsigned long nr_pages, i;
> +       struct page **pages;
> +       u64 disk_num_bytes, num_bytes;
> +       u64 start, end;
> +       struct btrfs_key ins;
> +       struct extent_map *em;
> +       ssize_t ret;
> +
> +       if (iov_iter_count(from) != raw->num_bytes) {
> +               /*
> +                * The write got truncated by generic_write_checks(). We can't
> +                * do a partial raw write.
> +                */
> +               return -EFBIG;
> +       }
> +
> +       /* This should be handled higher up. */
> +       ASSERT(raw->num_bytes != 0);
> +
> +       /* The extent size must be sane. */
> +       if (raw->num_bytes > BTRFS_MAX_UNCOMPRESSED ||
> +           raw->disk_num_bytes > BTRFS_MAX_COMPRESSED ||
> +           raw->disk_num_bytes == 0)
> +               return -EINVAL;

For the uncompressed size (num_bytes), even if it's within the limit,
shouldn't we validate it?
We can't trust for sure that what the user supplies is actually
correct. Can't we grab that from the compressed buffer (I suppose at
least most compression methods encode that in a header or trailer)?
Without such validation we can end up corrupt data/metadata on disk.

Thanks.

> +
> +       /*
> +        * The compressed data on disk must be sector-aligned. For convenience,
> +        * we extend the compressed data with zeroes if it isn't.
> +        */
> +       disk_num_bytes = ALIGN(raw->disk_num_bytes, fs_info->sectorsize);
> +       /*
> +        * The extent in the file must also be sector-aligned. However, we allow
> +        * a write which ends at or extends i_size to have an unaligned length;
> +        * we round up the extent size and set i_size to the given length.
> +        */
> +       start = iocb->ki_pos;
> +       if ((start & (fs_info->sectorsize - 1)))
> +               return -EINVAL;
> +       if (start + raw->num_bytes >= inode->i_size) {
> +               num_bytes = ALIGN(raw->num_bytes, fs_info->sectorsize);
> +       } else {
> +               num_bytes = raw->num_bytes;
> +               if ((num_bytes & (fs_info->sectorsize - 1)))
> +                       return -EINVAL;
> +       }
> +       end = start + num_bytes - 1;
> +
> +       /*
> +        * It's valid for compressed data to be larger than or the same size as
> +        * the decompressed data. However, for buffered I/O, we never write out
> +        * a compressed extent unless it's smaller than the decompressed data,
> +        * so for now, let's not allow creating such extents with the ioctl,
> +        * either.
> +        */
> +       if (disk_num_bytes >= num_bytes)
> +               return -EINVAL;
> +
> +       nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE);
> +       pages = kcalloc(nr_pages, sizeof(struct page *),
> +                       GFP_USER | __GFP_NOWARN);
> +       if (!pages)
> +               return -ENOMEM;
> +       for (i = 0; i < nr_pages; i++) {
> +               unsigned long offset = i << PAGE_SHIFT, n;
> +               char *kaddr;
> +
> +               pages[i] = alloc_page(GFP_USER | __GFP_NOWARN);
> +               if (!pages[i]) {
> +                       ret = -ENOMEM;
> +                       goto out_pages;
> +               }
> +               kaddr = kmap(pages[i]);
> +               if (offset < raw->disk_num_bytes) {
> +                       n = min_t(unsigned long, PAGE_SIZE,
> +                                 raw->disk_num_bytes - offset);
> +                       if (copy_from_user(kaddr, raw->buf + offset, n)) {
> +                               kunmap(pages[i]);
> +                               ret = -EFAULT;
> +                               goto out_pages;
> +                       }
> +               } else {
> +                       n = 0;
> +               }
> +               if (n < PAGE_SIZE)
> +                       memset(kaddr + n, 0, PAGE_SIZE - n);
> +               kunmap(pages[i]);
> +       }
> +
> +       for (;;) {
> +               struct btrfs_ordered_extent *ordered;
> +
> +               lock_extent_bits(io_tree, start, end, &cached_state);
> +               ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> +                                                    end - start + 1);
> +               if (!ordered &&
> +                   !filemap_range_has_page(inode->i_mapping, start, end))
> +                       break;
> +               if (ordered)
> +                       btrfs_put_ordered_extent(ordered);
> +               unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, end,
> +                                    &cached_state);
> +               cond_resched();
> +               ret = btrfs_wait_ordered_range(inode, start, end);
> +               if (ret)
> +                       goto out_pages;
> +               ret = invalidate_inode_pages2_range(inode->i_mapping,
> +                                                   start >> PAGE_SHIFT,
> +                                                   end >> PAGE_SHIFT);
> +               if (ret)
> +                       goto out_pages;
> +       }
> +
> +       ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start,
> +                                          num_bytes);
> +       if (ret)
> +               goto out_unlock;
> +
> +       ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes,
> +                                  disk_num_bytes, 0, 0, &ins, 1, 1);
> +       if (ret)
> +               goto out_delalloc_release;
> +
> +       em = create_io_em(inode, start, num_bytes, start, ins.objectid,
> +                         ins.offset, ins.offset, num_bytes, raw->compression,
> +                         BTRFS_ORDERED_COMPRESSED);
> +       if (IS_ERR(em)) {
> +               ret = PTR_ERR(em);
> +               goto out_free_reserve;
> +       }
> +       free_extent_map(em);
> +
> +       ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid,
> +                                               num_bytes, ins.offset,
> +                                               BTRFS_ORDERED_COMPRESSED,
> +                                               raw->compression);
> +       if (ret) {
> +               btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
> +               goto out_free_reserve;
> +       }
> +       btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> +
> +       if (start + raw->num_bytes > inode->i_size)
> +               i_size_write(inode, start + raw->num_bytes);
> +
> +       unlock_extent_cached(io_tree, start, end, &cached_state);
> +
> +       btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false);
> +
> +       if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
> +                                         ins.offset, pages, nr_pages, 0,
> +                                         false)) {
> +               struct page *page = pages[0];
> +
> +               page->mapping = inode->i_mapping;
> +               btrfs_writepage_endio_finish_ordered(page, start, end, 0);
> +               page->mapping = NULL;
> +               ret = -EIO;
> +               goto out_pages;
> +       }
> +       iocb->ki_pos += raw->num_bytes;
> +       return raw->num_bytes;
> +
> +out_free_reserve:
> +       btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> +       btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
> +out_delalloc_release:
> +       btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes,
> +                                    true);
> +out_unlock:
> +       unlock_extent_cached(io_tree, start, end, &cached_state);
> +out_pages:
> +       for (i = 0; i < nr_pages; i++) {
> +               if (pages[i])
> +                       put_page(pages[i]);
> +       }
> +       kfree(pages);
> +       return ret;
> +}
> +
>  #ifdef CONFIG_SWAP
>  /*
>   * Add an entry indicating a block group or device which is pinned by a
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index de730e56d3f5..c803732c9722 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -26,6 +26,7 @@
>  #include <linux/btrfs.h>
>  #include <linux/uaccess.h>
>  #include <linux/iversion.h>
> +#include <linux/sched/xacct.h>
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -84,6 +85,20 @@ struct btrfs_ioctl_send_args_32 {
>
>  #define BTRFS_IOC_SEND_32 _IOW(BTRFS_IOCTL_MAGIC, 38, \
>                                struct btrfs_ioctl_send_args_32)
> +
> +struct btrfs_ioctl_raw_pwrite_args_32 {
> +       __u64 offset;           /* in */
> +       __u64 num_bytes;        /* in */
> +       __u64 disk_num_bytes;   /* in */
> +       __u8 compression;       /* in */
> +       __u8 encryption;        /* in */
> +       __u16 other_encoding;   /* in */
> +       __u32 reserved[7];
> +       compat_uptr_t buf;      /* in */
> +} __attribute__ ((__packed__));
> +
> +#define BTRFS_IOC_RAW_PWRITE_32 _IOW(BTRFS_IOCTL_MAGIC, 63, \
> +                                    struct btrfs_ioctl_raw_pwrite_args_32)
>  #endif
>
>  static int btrfs_clone(struct inode *src, struct inode *inode,
> @@ -5437,6 +5452,80 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
>         return ret;
>  }
>
> +static int btrfs_ioctl_raw_pwrite(struct file *file, void __user *argp,
> +                                 bool compat)
> +{
> +       struct btrfs_ioctl_raw_pwrite_args args;
> +       struct iov_iter iter;
> +       loff_t pos;
> +       struct kiocb kiocb;
> +       ssize_t ret;
> +
> +       if (!capable(CAP_SYS_ADMIN))
> +               return -EPERM;
> +
> +       if (!(file->f_mode & FMODE_WRITE))
> +               return -EBADF;
> +
> +       if (compat) {
> +#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
> +               struct btrfs_ioctl_raw_pwrite_args_32 args32;
> +
> +               if (copy_from_user(&args32, argp, sizeof(args32)))
> +                       return -EFAULT;
> +               args.offset = args32.offset;
> +               args.num_bytes = args32.num_bytes;
> +               args.disk_num_bytes = args32.disk_num_bytes;
> +               args.compression = args32.compression;
> +               args.encryption = args32.encryption;
> +               args.other_encoding = args32.other_encoding;
> +               memcpy(args.reserved, args32.reserved, sizeof(args.reserved));
> +               args.buf = compat_ptr(args32.buf);
> +#else
> +               return -ENOTTY;
> +#endif
> +       } else {
> +               if (copy_from_user(&args, argp, sizeof(args)))
> +                       return -EFAULT;
> +       }
> +
> +       /* The compression type must be valid. */
> +       if (args.compression == BTRFS_COMPRESS_NONE ||
> +           args.compression > BTRFS_COMPRESS_TYPES)
> +               return -EINVAL;
> +       /* Reserved fields must be zero. */
> +       if (args.encryption || args.other_encoding ||
> +           memchr_inv(args.reserved, 0, sizeof(args.reserved)))
> +               return -EINVAL;
> +
> +       if (unlikely(!access_ok(args.buf, args.disk_num_bytes)))
> +               return -EFAULT;
> +
> +       pos = args.offset;
> +       ret = rw_verify_area(WRITE, file, &pos, args.num_bytes);
> +       if (ret)
> +               return ret;
> +
> +       init_sync_kiocb(&kiocb, file);
> +       kiocb.ki_pos = pos;
> +       /*
> +        * This iov_iter is a lie; we only construct it so that we can use
> +        * write_iter.
> +        */
> +       iov_iter_init(&iter, WRITE, NULL, 0, args.num_bytes);
> +
> +       file_start_write(file);
> +       ret = btrfs_do_write_iter(&kiocb, &iter, &args);
> +       if (ret > 0) {
> +               ASSERT(ret == args.num_bytes);
> +               fsnotify_modify(file);
> +               add_wchar(current, ret);
> +       }
> +       inc_syscw(current);
> +       file_end_write(file);
> +       return ret < 0 ? ret : 0;
> +}
> +
>  long btrfs_ioctl(struct file *file, unsigned int
>                 cmd, unsigned long arg)
>  {
> @@ -5583,6 +5672,12 @@ long btrfs_ioctl(struct file *file, unsigned int
>                 return btrfs_ioctl_get_subvol_rootref(file, argp);
>         case BTRFS_IOC_INO_LOOKUP_USER:
>                 return btrfs_ioctl_ino_lookup_user(file, argp);
> +       case BTRFS_IOC_RAW_PWRITE:
> +               return btrfs_ioctl_raw_pwrite(file, argp, false);
> +#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
> +       case BTRFS_IOC_RAW_PWRITE_32:
> +               return btrfs_ioctl_raw_pwrite(file, argp, true);
> +#endif
>         }
>
>         return -ENOTTY;
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 3ee0678c0a83..b4dce062265a 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -822,6 +822,73 @@ struct btrfs_ioctl_get_subvol_rootref_args {
>                 __u8 align[7];
>  };
>
> +enum btrfs_compression_type {
> +       BTRFS_COMPRESS_NONE  = 0,
> +       BTRFS_COMPRESS_ZLIB  = 1,
> +       BTRFS_COMPRESS_LZO   = 2,
> +       BTRFS_COMPRESS_ZSTD  = 3,
> +       BTRFS_COMPRESS_TYPES = 3,
> +};
> +
> +/*
> + * Write an extent directly to the filesystem. CAP_SYS_ADMIN is required and the
> + * file descriptor must be open for writing.
> + *
> + * Currently, this is only for writing compressed data. However, it may be
> + * extended in the future.
> + */
> +struct btrfs_ioctl_raw_pwrite_args {
> +       /*
> +        * Offset in file where to write. This must be aligned to the sector
> +        * size of the filesystem.
> +        */
> +       __u64 offset;           /* in */
> +       /*
> +        * Length of the extent in the file, in bytes. This must be aligned to
> +        * the sector size of the filesystem unless the data ends at or beyond
> +        * the current end of file; this special case is to support creating
> +        * files whose length is not aligned to the sector size.
> +        *
> +        * For a compressed extent, this is the length of the decompressed data.
> +        * It must be less than 128k (BTRFS_MAX_UNCOMPRESSED), although that
> +        * limit may increase in the future.
> +        */
> +       __u64 num_bytes;        /* in */
> +       /*
> +        * Length of the extent on disk, in bytes (see buf below).
> +        *
> +        * For a compressed extent, this does not need to be aligned to a
> +        * sector. It must be less than 128k (BTRFS_MAX_COMPRESSED), although
> +        * that limit may increase in the future.
> +        */
> +       __u64 disk_num_bytes;   /* in */
> +       /*
> +        * Compression type (enum btrfs_compression_type). Currently, this must
> +        * not be BTRFS_COMPRESS_NONE.
> +        */
> +       __u8 compression;       /* in */
> +       /* Encryption type. Currently, this must be zero. */
> +       __u8 encryption;        /* in */
> +       /* Other type of encoding. Currently, this must be zero. */
> +       __u16 other_encoding;   /* in */
> +       /* Reserved for future extensions. Must be zero. */
> +       __u32 reserved[7];
> +       /*
> +        * The raw data on disk.
> +        *
> +        * For a compressed extent, the format is as follows:
> +        *
> +        * - zlib: The extent is a single zlib stream.
> +        * - lzo: The extent is compressed page by page with LZO1X and wrapped
> +        *   according to the format documented in fs/btrfs/lzo.c.
> +        * - zstd: The extent is a single zstd stream. The windowLog compression
> +        *   parameter must be no more than 17 (ZSTD_BTRFS_MAX_WINDOWLOG).
> +        *
> +        * If the compressed data is invalid, reading will return an error.
> +        */
> +       void __user *buf;       /* in */
> +} __attribute__ ((__packed__));
> +
>  /* Error codes as returned by the kernel */
>  enum btrfs_err_code {
>         BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET = 1,
> @@ -946,5 +1013,7 @@ enum btrfs_err_code {
>                                 struct btrfs_ioctl_get_subvol_rootref_args)
>  #define BTRFS_IOC_INO_LOOKUP_USER _IOWR(BTRFS_IOCTL_MAGIC, 62, \
>                                 struct btrfs_ioctl_ino_lookup_user_args)
> +#define BTRFS_IOC_RAW_PWRITE _IOW(BTRFS_IOCTL_MAGIC, 63, \
> +                                 struct btrfs_ioctl_raw_pwrite_args)
>
>  #endif /* _UAPI_LINUX_BTRFS_H */
> --
> 2.23.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-05 10:33   ` Nikolay Borisov
@ 2019-09-19  6:14     ` Omar Sandoval
  2019-09-19  7:46       ` Nikolay Borisov
  0 siblings, 1 reply; 17+ messages in thread
From: Omar Sandoval @ 2019-09-19  6:14 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, kernel-team, linux-fsdevel

On Thu, Sep 05, 2019 at 01:33:56PM +0300, Nikolay Borisov wrote:
> 
> 
> On 4.09.19 г. 22:13 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > This adds an API for writing compressed data directly to the filesystem.
> > The use case that I have in mind is send/receive: currently, when
> > sending data from one compressed filesystem to another, the sending side
> > decompresses the data and the receiving side recompresses it before
> > writing it out. This is wasteful and can be avoided if we can just send
> > and write compressed extents. The send part will be implemented in a
> > separate series, as this ioctl can stand alone.
> > 
> > The interface is essentially pwrite(2) with some extra information:
> > 
> > - The input buffer contains the compressed data.
> > - Both the compressed and decompressed sizes of the data are given.
> > - The compression type (zlib, lzo, or zstd) is given.
> > 
> > The interface is general enough that it can be extended to encrypted or
> > otherwise encoded extents in the future. A more detailed description,
> > including restrictions and edge cases, is included in
> > include/uapi/linux/btrfs.h.
> > 
> > The implementation is similar to direct I/O: we have to flush any
> > ordered extents, invalidate the page cache, and do the io
> > tree/delalloc/extent map/ordered extent dance. From there, we can reuse
> > the compression code with a minor modification to distinguish the new
> > ioctl from writeback.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> 
> 
> Should we choose to continue with this interface (based on Dave's
> feedback) I'd rather see the following things changed:
> 
> > ---
> >  fs/btrfs/compression.c     |   6 +-
> >  fs/btrfs/compression.h     |  14 +--
> >  fs/btrfs/ctree.h           |   6 ++
> >  fs/btrfs/file.c            |  13 ++-
> >  fs/btrfs/inode.c           | 192 ++++++++++++++++++++++++++++++++++++-
> >  fs/btrfs/ioctl.c           |  95 ++++++++++++++++++
> >  include/uapi/linux/btrfs.h |  69 +++++++++++++
> >  7 files changed, 380 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> > index b05b361e2062..6632dd8d2e4d 100644
> > --- a/fs/btrfs/compression.c
> > +++ b/fs/btrfs/compression.c
> > @@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio)
> >  			bio->bi_status == BLK_STS_OK);
> >  	cb->compressed_pages[0]->mapping = NULL;
> >  
> > -	end_compressed_writeback(inode, cb);
> > +	if (cb->writeback)
> > +		end_compressed_writeback(inode, cb);
> >  	/* note, our inode could be gone now */
> >  
> >  	/*
> > @@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >  				 unsigned long compressed_len,
> >  				 struct page **compressed_pages,
> >  				 unsigned long nr_pages,
> > -				 unsigned int write_flags)
> > +				 unsigned int write_flags, bool writeback)
> >  {
> >  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >  	struct bio *bio = NULL;
> > @@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >  	cb->mirror_num = 0;
> >  	cb->compressed_pages = compressed_pages;
> >  	cb->compressed_len = compressed_len;
> > +	cb->writeback = writeback;
> >  	cb->orig_bio = NULL;
> >  	cb->nr_pages = nr_pages;
> >  
> > diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> > index 4cb8be9ff88b..5b48eb730362 100644
> > --- a/fs/btrfs/compression.h
> > +++ b/fs/btrfs/compression.h
> > @@ -6,6 +6,7 @@
> >  #ifndef BTRFS_COMPRESSION_H
> >  #define BTRFS_COMPRESSION_H
> >  
> > +#include <linux/btrfs.h>
> >  #include <linux/sizes.h>
> >  
> >  /*
> > @@ -47,6 +48,9 @@ struct compressed_bio {
> >  	/* the compression algorithm for this bio */
> >  	int compress_type;
> >  
> > +	/* Whether this is a write for writeback. */
> > +	bool writeback;
> > +
> >  	/* number of compressed pages in the array */
> >  	unsigned long nr_pages;
> >  
> > @@ -93,20 +97,12 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >  				  unsigned long compressed_len,
> >  				  struct page **compressed_pages,
> >  				  unsigned long nr_pages,
> > -				  unsigned int write_flags);
> > +				  unsigned int write_flags, bool writeback);
> >  blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
> >  				 int mirror_num, unsigned long bio_flags);
> >  
> >  unsigned int btrfs_compress_str2level(unsigned int type, const char *str);
> >  
> > -enum btrfs_compression_type {
> > -	BTRFS_COMPRESS_NONE  = 0,
> > -	BTRFS_COMPRESS_ZLIB  = 1,
> > -	BTRFS_COMPRESS_LZO   = 2,
> > -	BTRFS_COMPRESS_ZSTD  = 3,
> > -	BTRFS_COMPRESS_TYPES = 3,
> > -};
> > -
> >  struct workspace_manager {
> >  	const struct btrfs_compress_op *ops;
> >  	struct list_head idle_ws;
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 19d669d12ca1..9fae9b3f1f62 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -2905,6 +2905,10 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
> >  int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
> >  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
> >  					  u64 end, int uptodate);
> > +
> > +ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
> > +			struct btrfs_ioctl_raw_pwrite_args *raw);
> > +
> >  extern const struct dentry_operations btrfs_dentry_operations;
> >  
> >  /* ioctl.c */
> > @@ -2928,6 +2932,8 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
> >  			   struct btrfs_inode *inode);
> >  int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
> >  void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
> > +ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
> > +			    struct btrfs_ioctl_raw_pwrite_args *args);
> >  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
> >  void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
> >  			     int skip_pinned);
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 8fe4eb7e5045..ed23aa65b2d5 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1872,8 +1872,8 @@ static void update_time_for_write(struct inode *inode)
> >  		inode_inc_iversion(inode);
> >  }
> >  
> > -static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> > -				    struct iov_iter *from)
> > +ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
> > +			    struct btrfs_ioctl_raw_pwrite_args *raw)
> >  {
> >  	struct file *file = iocb->ki_filp;
> >  	struct inode *inode = file_inode(file);
> > @@ -1965,7 +1965,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >  	if (sync)
> >  		atomic_inc(&BTRFS_I(inode)->sync_writers);
> >  
> > -	if (iocb->ki_flags & IOCB_DIRECT) {
> > +	if (raw) {
> > +		num_written = btrfs_raw_write(iocb, from, raw);
> > +	} else if (iocb->ki_flags & IOCB_DIRECT) {
> >  		num_written = __btrfs_direct_write(iocb, from);
> >  	} else {
> >  		num_written = btrfs_buffered_write(iocb, from);
> > @@ -1996,6 +1998,11 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >  	return num_written ? num_written : err;
> >  }
> >  
> > +static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +	return btrfs_do_write_iter(iocb, from, NULL);
> > +}
> > +
> >  int btrfs_release_file(struct inode *inode, struct file *filp)
> >  {
> >  	struct btrfs_file_private *private = filp->private_data;
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index a0546401bc0a..c8eaa1e5bf06 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
> >  				    ins.objectid,
> >  				    ins.offset, async_extent->pages,
> >  				    async_extent->nr_pages,
> > -				    async_chunk->write_flags)) {
> > +				    async_chunk->write_flags, true)) {
> >  			struct page *p = async_extent->pages[0];
> >  			const u64 start = async_extent->start;
> >  			const u64 end = start + async_extent->ram_size - 1;
> > @@ -10590,6 +10590,196 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
> >  	}
> >  }
> >  
> > +/* Currently, this only supports raw writes of compressed data. */
> > +ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
> > +			struct btrfs_ioctl_raw_pwrite_args *raw)
> > +{
> > +	struct file *file = iocb->ki_filp;
> > +	struct inode *inode = file_inode(file);
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	struct btrfs_root *root = BTRFS_I(inode)->root;
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct extent_changeset *data_reserved = NULL;
> > +	struct extent_state *cached_state = NULL;
> > +	unsigned long nr_pages, i;
> > +	struct page **pages;
> > +	u64 disk_num_bytes, num_bytes;
> > +	u64 start, end;
> > +	struct btrfs_key ins;
> > +	struct extent_map *em;
> > +	ssize_t ret;
> > +
> > +	if (iov_iter_count(from) != raw->num_bytes) {
> > +		/*
> > +		 * The write got truncated by generic_write_checks(). We can't
> > +		 * do a partial raw write.
> > +		 */
> > +		return -EFBIG;
> > +	}
> > +
> > +	/* This should be handled higher up. */
> > +	ASSERT(raw->num_bytes != 0);
> 
> This is already checked, indirectly via rw_verify_area, there's :
> 
>  if (unlikely((ssize_t) count < 0))
>                   return retval;
> 
> So you can remove this assert.

These are different checks. The assert is checking that we didn't get
here with a count of zero, whereas rw_verify_area() checks that the
count doesn't overflow a ssize_t. I removed the assert regardless.

> > +	/* The extent size must be sane. */
> > +	if (raw->num_bytes > BTRFS_MAX_UNCOMPRESSED ||
> > +	    raw->disk_num_bytes > BTRFS_MAX_COMPRESSED ||
> > +	    raw->disk_num_bytes == 0)
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * The compressed data on disk must be sector-aligned. For convenience,
> > +	 * we extend the compressed data with zeroes if it isn't.
> > +	 */
> > +	disk_num_bytes = ALIGN(raw->disk_num_bytes, fs_info->sectorsize);
> > +	/*
> > +	 * The extent in the file must also be sector-aligned. However, we allow
> > +	 * a write which ends at or extends i_size to have an unaligned length;
> > +	 * we round up the extent size and set i_size to the given length.
> > +	 */
> > +	start = iocb->ki_pos;
> > +	if ((start & (fs_info->sectorsize - 1)))
> 
> if (!IS_ALIGNED(start, fs_info->sectorsize))

Fixed, thanks.

> > +		return -EINVAL;
> > +	if (start + raw->num_bytes >= inode->i_size) {
> > +		num_bytes = ALIGN(raw->num_bytes, fs_info->sectorsize);
> > +	} else {
> > +		num_bytes = raw->num_bytes;
> > +		if ((num_bytes & (fs_info->sectorsize - 1)))
> 
> ditto

Fixed.

> > +			return -EINVAL;
> > +	}
> > +	end = start + num_bytes - 1;
> > +
> > +	/*
> > +	 * It's valid for compressed data to be larger than or the same size as
> > +	 * the decompressed data. However, for buffered I/O, we never write out
> > +	 * a compressed extent unless it's smaller than the decompressed data,
> > +	 * so for now, let's not allow creating such extents with the ioctl,
> > +	 * either.
> > +	 */
> > +	if (disk_num_bytes >= num_bytes)
> > +		return -EINVAL;
> > +
> > +	nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE);
> > +	pages = kcalloc(nr_pages, sizeof(struct page *),
> > +			GFP_USER | __GFP_NOWARN);
> > +	if (!pages)
> > +		return -ENOMEM;
> > +	for (i = 0; i < nr_pages; i++) {
> > +		unsigned long offset = i << PAGE_SHIFT, n;
> > +		char *kaddr;
> > +
> > +		pages[i] = alloc_page(GFP_USER | __GFP_NOWARN);
> > +		if (!pages[i]) {
> > +			ret = -ENOMEM;
> > +			goto out_pages;
> > +		}
> > +		kaddr = kmap(pages[i]);
> > +		if (offset < raw->disk_num_bytes) {
> > +			n = min_t(unsigned long, PAGE_SIZE,
> > +				  raw->disk_num_bytes - offset);
> > +			if (copy_from_user(kaddr, raw->buf + offset, n)) {
> > +				kunmap(pages[i]);
> > +				ret = -EFAULT;
> > +				goto out_pages;
> > +			}
> > +		} else {
> > +			n = 0;
> > +		}
> > +		if (n < PAGE_SIZE)
> > +			memset(kaddr + n, 0, PAGE_SIZE - n);
> > +		kunmap(pages[i]);
> > +	}
> > +
> > +	for (;;) {
> > +		struct btrfs_ordered_extent *ordered;
> > +
> > +		lock_extent_bits(io_tree, start, end, &cached_state);
> > +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> > +						     end - start + 1);
> > +		if (!ordered &&
> > +		    !filemap_range_has_page(inode->i_mapping, start, end))
> > +			break;
> > +		if (ordered)
> > +			btrfs_put_ordered_extent(ordered);
> > +		unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, end,
> > +				     &cached_state);
> > +		cond_resched();
> > +		ret = btrfs_wait_ordered_range(inode, start, end);
> > +		if (ret)
> > +			goto out_pages;
> > +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> > +						    start >> PAGE_SHIFT,
> > +						    end >> PAGE_SHIFT);
> > +		if (ret)
> > +			goto out_pages;
> > +	}
> 
> Won't btrfs_lock_and_flush_ordered_range suffice here? Perhaps call that
> function + invalidate_inode_pages2_range ?

No, btrfs_lock_and_flush_ordered_range() doesn't write out dirty pages,
so it's not sufficient here.

Thanks for the review!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-10 11:39   ` Filipe Manana
@ 2019-09-19  6:23     ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2019-09-19  6:23 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team, linux-fsdevel

On Tue, Sep 10, 2019 at 12:39:28PM +0100, Filipe Manana wrote:
> On Wed, Sep 4, 2019 at 8:14 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > From: Omar Sandoval <osandov@fb.com>
> >
> > This adds an API for writing compressed data directly to the filesystem.
> > The use case that I have in mind is send/receive: currently, when
> > sending data from one compressed filesystem to another, the sending side
> > decompresses the data and the receiving side recompresses it before
> > writing it out. This is wasteful and can be avoided if we can just send
> > and write compressed extents. The send part will be implemented in a
> > separate series, as this ioctl can stand alone.
> >
> > The interface is essentially pwrite(2) with some extra information:
> >
> > - The input buffer contains the compressed data.
> > - Both the compressed and decompressed sizes of the data are given.
> > - The compression type (zlib, lzo, or zstd) is given.
> >
> > The interface is general enough that it can be extended to encrypted or
> > otherwise encoded extents in the future. A more detailed description,
> > including restrictions and edge cases, is included in
> > include/uapi/linux/btrfs.h.
> >
> > The implementation is similar to direct I/O: we have to flush any
> > ordered extents, invalidate the page cache, and do the io
> > tree/delalloc/extent map/ordered extent dance. From there, we can reuse
> > the compression code with a minor modification to distinguish the new
> > ioctl from writeback.
> >
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  fs/btrfs/compression.c     |   6 +-
> >  fs/btrfs/compression.h     |  14 +--
> >  fs/btrfs/ctree.h           |   6 ++
> >  fs/btrfs/file.c            |  13 ++-
> >  fs/btrfs/inode.c           | 192 ++++++++++++++++++++++++++++++++++++-
> >  fs/btrfs/ioctl.c           |  95 ++++++++++++++++++
> >  include/uapi/linux/btrfs.h |  69 +++++++++++++
> >  7 files changed, 380 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> > index b05b361e2062..6632dd8d2e4d 100644
> > --- a/fs/btrfs/compression.c
> > +++ b/fs/btrfs/compression.c
> > @@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio)
> >                         bio->bi_status == BLK_STS_OK);
> >         cb->compressed_pages[0]->mapping = NULL;
> >
> > -       end_compressed_writeback(inode, cb);
> > +       if (cb->writeback)
> > +               end_compressed_writeback(inode, cb);
> >         /* note, our inode could be gone now */
> >
> >         /*
> > @@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >                                  unsigned long compressed_len,
> >                                  struct page **compressed_pages,
> >                                  unsigned long nr_pages,
> > -                                unsigned int write_flags)
> > +                                unsigned int write_flags, bool writeback)
> >  {
> >         struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >         struct bio *bio = NULL;
> > @@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >         cb->mirror_num = 0;
> >         cb->compressed_pages = compressed_pages;
> >         cb->compressed_len = compressed_len;
> > +       cb->writeback = writeback;
> >         cb->orig_bio = NULL;
> >         cb->nr_pages = nr_pages;
> >
> > diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> > index 4cb8be9ff88b..5b48eb730362 100644
> > --- a/fs/btrfs/compression.h
> > +++ b/fs/btrfs/compression.h
> > @@ -6,6 +6,7 @@
> >  #ifndef BTRFS_COMPRESSION_H
> >  #define BTRFS_COMPRESSION_H
> >
> > +#include <linux/btrfs.h>
> >  #include <linux/sizes.h>
> >
> >  /*
> > @@ -47,6 +48,9 @@ struct compressed_bio {
> >         /* the compression algorithm for this bio */
> >         int compress_type;
> >
> > +       /* Whether this is a write for writeback. */
> > +       bool writeback;
> > +
> >         /* number of compressed pages in the array */
> >         unsigned long nr_pages;
> >
> > @@ -93,20 +97,12 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >                                   unsigned long compressed_len,
> >                                   struct page **compressed_pages,
> >                                   unsigned long nr_pages,
> > -                                 unsigned int write_flags);
> > +                                 unsigned int write_flags, bool writeback);
> >  blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
> >                                  int mirror_num, unsigned long bio_flags);
> >
> >  unsigned int btrfs_compress_str2level(unsigned int type, const char *str);
> >
> > -enum btrfs_compression_type {
> > -       BTRFS_COMPRESS_NONE  = 0,
> > -       BTRFS_COMPRESS_ZLIB  = 1,
> > -       BTRFS_COMPRESS_LZO   = 2,
> > -       BTRFS_COMPRESS_ZSTD  = 3,
> > -       BTRFS_COMPRESS_TYPES = 3,
> > -};
> > -
> >  struct workspace_manager {
> >         const struct btrfs_compress_op *ops;
> >         struct list_head idle_ws;
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 19d669d12ca1..9fae9b3f1f62 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -2905,6 +2905,10 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
> >  int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
> >  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
> >                                           u64 end, int uptodate);
> > +
> > +ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
> > +                       struct btrfs_ioctl_raw_pwrite_args *raw);
> > +
> >  extern const struct dentry_operations btrfs_dentry_operations;
> >
> >  /* ioctl.c */
> > @@ -2928,6 +2932,8 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
> >                            struct btrfs_inode *inode);
> >  int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
> >  void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
> > +ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
> > +                           struct btrfs_ioctl_raw_pwrite_args *args);
> >  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
> >  void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
> >                              int skip_pinned);
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 8fe4eb7e5045..ed23aa65b2d5 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1872,8 +1872,8 @@ static void update_time_for_write(struct inode *inode)
> >                 inode_inc_iversion(inode);
> >  }
> >
> > -static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> > -                                   struct iov_iter *from)
> > +ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
> > +                           struct btrfs_ioctl_raw_pwrite_args *raw)
> >  {
> >         struct file *file = iocb->ki_filp;
> >         struct inode *inode = file_inode(file);
> > @@ -1965,7 +1965,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >         if (sync)
> >                 atomic_inc(&BTRFS_I(inode)->sync_writers);
> >
> > -       if (iocb->ki_flags & IOCB_DIRECT) {
> > +       if (raw) {
> > +               num_written = btrfs_raw_write(iocb, from, raw);
> > +       } else if (iocb->ki_flags & IOCB_DIRECT) {
> >                 num_written = __btrfs_direct_write(iocb, from);
> >         } else {
> >                 num_written = btrfs_buffered_write(iocb, from);
> > @@ -1996,6 +1998,11 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >         return num_written ? num_written : err;
> >  }
> >
> > +static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +       return btrfs_do_write_iter(iocb, from, NULL);
> > +}
> > +
> >  int btrfs_release_file(struct inode *inode, struct file *filp)
> >  {
> >         struct btrfs_file_private *private = filp->private_data;
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index a0546401bc0a..c8eaa1e5bf06 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
> >                                     ins.objectid,
> >                                     ins.offset, async_extent->pages,
> >                                     async_extent->nr_pages,
> > -                                   async_chunk->write_flags)) {
> > +                                   async_chunk->write_flags, true)) {
> >                         struct page *p = async_extent->pages[0];
> >                         const u64 start = async_extent->start;
> >                         const u64 end = start + async_extent->ram_size - 1;
> > @@ -10590,6 +10590,196 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
> >         }
> >  }
> >
> > +/* Currently, this only supports raw writes of compressed data. */
> > +ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
> > +                       struct btrfs_ioctl_raw_pwrite_args *raw)
> > +{
> > +       struct file *file = iocb->ki_filp;
> > +       struct inode *inode = file_inode(file);
> > +       struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +       struct btrfs_root *root = BTRFS_I(inode)->root;
> > +       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +       struct extent_changeset *data_reserved = NULL;
> > +       struct extent_state *cached_state = NULL;
> > +       unsigned long nr_pages, i;
> > +       struct page **pages;
> > +       u64 disk_num_bytes, num_bytes;
> > +       u64 start, end;
> > +       struct btrfs_key ins;
> > +       struct extent_map *em;
> > +       ssize_t ret;
> > +
> > +       if (iov_iter_count(from) != raw->num_bytes) {
> > +               /*
> > +                * The write got truncated by generic_write_checks(). We can't
> > +                * do a partial raw write.
> > +                */
> > +               return -EFBIG;
> > +       }
> > +
> > +       /* This should be handled higher up. */
> > +       ASSERT(raw->num_bytes != 0);
> > +
> > +       /* The extent size must be sane. */
> > +       if (raw->num_bytes > BTRFS_MAX_UNCOMPRESSED ||
> > +           raw->disk_num_bytes > BTRFS_MAX_COMPRESSED ||
> > +           raw->disk_num_bytes == 0)
> > +               return -EINVAL;
> 
> For the uncompressed size (num_bytes), even if it's within the limit,
> shouldn't we validate it?
> We can't trust for sure that what the user supplies is actually
> correct. Can't we grab that from the compressed buffer (I suppose at
> least most compression methods encode that in a header or trailer)?
> Without such validation we can end up corrupt data/metadata on disk.
> 
> Thanks.

The user could always spoof the length in the headers, as well. The only
way to really validate the length would be to decompress the data, which
defeats the purpose of this interface.

In any case, I don't think it's a big deal for the length to be invalid.
The decompression code for all three compression types already handles
the case where the data decompresses to less than num_bytes by
zero-filling the remainder of the extent and the case where the data
decompresses to more than num_bytes by truncating the data.

The user can also give us garbage data that doesn't decompress at all.
In my opinion, it's fair to document that if the compressed data is
malformed, the result of reading that data is undefined, as long as we
don't crash or leak disk contents or anything like that.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-19  6:14     ` Omar Sandoval
@ 2019-09-19  7:46       ` Nikolay Borisov
  2019-09-19  7:59         ` Omar Sandoval
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2019-09-19  7:46 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: kernel-team, linux-btrfs, linux-fsdevel



On 19.09.19 г. 9:14 ч., Omar Sandoval wrote:
> On Thu, Sep 05, 2019 at 01:33:56PM +0300, Nikolay Borisov wrote:

<snip>

>>
>> Won't btrfs_lock_and_flush_ordered_range suffice here? Perhaps call that
>> function + invalidate_inode_pages2_range ?
> 
> No, btrfs_lock_and_flush_ordered_range() doesn't write out dirty pages,
> so it's not sufficient here.

But it does - it calls btrfs_start_ordered_extent which calls
filemap_fdatawrite_range.

> 
> Thanks for the review!
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-19  7:46       ` Nikolay Borisov
@ 2019-09-19  7:59         ` Omar Sandoval
  2019-09-24 10:29           ` Nikolay Borisov
  0 siblings, 1 reply; 17+ messages in thread
From: Omar Sandoval @ 2019-09-19  7:59 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: kernel-team, linux-btrfs, linux-fsdevel

On Thu, Sep 19, 2019 at 09:46:31AM +0200, Nikolay Borisov wrote:
> 
> 
> On 19.09.19 г. 9:14 ч., Omar Sandoval wrote:
> > On Thu, Sep 05, 2019 at 01:33:56PM +0300, Nikolay Borisov wrote:
> 
> <snip>
> 
> >>
> >> Won't btrfs_lock_and_flush_ordered_range suffice here? Perhaps call that
> >> function + invalidate_inode_pages2_range ?
> > 
> > No, btrfs_lock_and_flush_ordered_range() doesn't write out dirty pages,
> > so it's not sufficient here.
> 
> But it does - it calls btrfs_start_ordered_extent which calls
> filemap_fdatawrite_range.

It only calls that for ranges which already have an ordered extent,
which we don't create until we're writing the dirty pages out (take a
look at where we call btrfs_add_ordered_extent()).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
  2019-09-19  7:59         ` Omar Sandoval
@ 2019-09-24 10:29           ` Nikolay Borisov
  0 siblings, 0 replies; 17+ messages in thread
From: Nikolay Borisov @ 2019-09-24 10:29 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: kernel-team, linux-btrfs, linux-fsdevel



On 19.09.19 г. 10:59 ч., Omar Sandoval wrote:
> On Thu, Sep 19, 2019 at 09:46:31AM +0200, Nikolay Borisov wrote:
>>
>>
>> On 19.09.19 г. 9:14 ч., Omar Sandoval wrote:
>>> On Thu, Sep 05, 2019 at 01:33:56PM +0300, Nikolay Borisov wrote:
>>
>> <snip>
>>
>>>>
>>>> Won't btrfs_lock_and_flush_ordered_range suffice here? Perhaps call that
>>>> function + invalidate_inode_pages2_range ?
>>>
>>> No, btrfs_lock_and_flush_ordered_range() doesn't write out dirty pages,
>>> so it's not sufficient here.
>>
>> But it does - it calls btrfs_start_ordered_extent which calls
>> filemap_fdatawrite_range.
> 
> It only calls that for ranges which already have an ordered extent,
> which we don't create until we're writing the dirty pages out (take a
> look at where we call btrfs_add_ordered_extent()).

Right, btrfs_wait_ordered_range actually does btrfs_fdatawrite_range to
instantiate any pending ranges and then does the lookup_first_ordered_
extent/start_ordered_extent loop.

> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-09-24 10:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-04 19:13 [PATCH 0/2] Btrfs: add interface for writing compressed extents directly Omar Sandoval
2019-09-04 19:13 ` [PATCH 1/2] fs: export rw_verify_area() Omar Sandoval
2019-09-04 19:13 ` [PATCH 2/2] btrfs: add ioctl for directly writing compressed data Omar Sandoval
2019-09-05  2:10   ` Dave Chinner
2019-09-05 12:16     ` Johannes Thumshirn
2019-09-05 12:51       ` Johannes Thumshirn
2019-09-06  7:46       ` Dave Chinner
2019-09-06 18:19     ` Omar Sandoval
2019-09-06 21:07       ` Dave Chinner
2019-09-06 21:27         ` Omar Sandoval
2019-09-05 10:33   ` Nikolay Borisov
2019-09-19  6:14     ` Omar Sandoval
2019-09-19  7:46       ` Nikolay Borisov
2019-09-19  7:59         ` Omar Sandoval
2019-09-24 10:29           ` Nikolay Borisov
2019-09-10 11:39   ` Filipe Manana
2019-09-19  6:23     ` Omar Sandoval

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.