All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nikolay Borisov <nborisov@suse.com>
To: Omar Sandoval <osandov@osandov.com>, linux-btrfs@vger.kernel.org
Cc: kernel-team@fb.com, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 2/2] btrfs: add ioctl for directly writing compressed data
Date: Thu, 5 Sep 2019 13:33:56 +0300	[thread overview]
Message-ID: <652f5971-2c82-e766-fde4-2076e65cf948@suse.com> (raw)
In-Reply-To: <8eae56abb90c0fe87c350322485ce8674e135074.1567623877.git.osandov@fb.com>



On 4.09.19 г. 22:13 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This adds an API for writing compressed data directly to the filesystem.
> The use case that I have in mind is send/receive: currently, when
> sending data from one compressed filesystem to another, the sending side
> decompresses the data and the receiving side recompresses it before
> writing it out. This is wasteful and can be avoided if we can just send
> and write compressed extents. The send part will be implemented in a
> separate series, as this ioctl can stand alone.
> 
> The interface is essentially pwrite(2) with some extra information:
> 
> - The input buffer contains the compressed data.
> - Both the compressed and decompressed sizes of the data are given.
> - The compression type (zlib, lzo, or zstd) is given.
> 
> The interface is general enough that it can be extended to encrypted or
> otherwise encoded extents in the future. A more detailed description,
> including restrictions and edge cases, is included in
> include/uapi/linux/btrfs.h.
> 
> The implementation is similar to direct I/O: we have to flush any
> ordered extents, invalidate the page cache, and do the io
> tree/delalloc/extent map/ordered extent dance. From there, we can reuse
> the compression code with a minor modification to distinguish the new
> ioctl from writeback.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>


Should we choose to continue with this interface (based on Dave's
feedback) I'd rather see the following things changed:

> ---
>  fs/btrfs/compression.c     |   6 +-
>  fs/btrfs/compression.h     |  14 +--
>  fs/btrfs/ctree.h           |   6 ++
>  fs/btrfs/file.c            |  13 ++-
>  fs/btrfs/inode.c           | 192 ++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/ioctl.c           |  95 ++++++++++++++++++
>  include/uapi/linux/btrfs.h |  69 +++++++++++++
>  7 files changed, 380 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index b05b361e2062..6632dd8d2e4d 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio)
>  			bio->bi_status == BLK_STS_OK);
>  	cb->compressed_pages[0]->mapping = NULL;
>  
> -	end_compressed_writeback(inode, cb);
> +	if (cb->writeback)
> +		end_compressed_writeback(inode, cb);
>  	/* note, our inode could be gone now */
>  
>  	/*
> @@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>  				 unsigned long compressed_len,
>  				 struct page **compressed_pages,
>  				 unsigned long nr_pages,
> -				 unsigned int write_flags)
> +				 unsigned int write_flags, bool writeback)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct bio *bio = NULL;
> @@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>  	cb->mirror_num = 0;
>  	cb->compressed_pages = compressed_pages;
>  	cb->compressed_len = compressed_len;
> +	cb->writeback = writeback;
>  	cb->orig_bio = NULL;
>  	cb->nr_pages = nr_pages;
>  
> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> index 4cb8be9ff88b..5b48eb730362 100644
> --- a/fs/btrfs/compression.h
> +++ b/fs/btrfs/compression.h
> @@ -6,6 +6,7 @@
>  #ifndef BTRFS_COMPRESSION_H
>  #define BTRFS_COMPRESSION_H
>  
> +#include <linux/btrfs.h>
>  #include <linux/sizes.h>
>  
>  /*
> @@ -47,6 +48,9 @@ struct compressed_bio {
>  	/* the compression algorithm for this bio */
>  	int compress_type;
>  
> +	/* Whether this is a write for writeback. */
> +	bool writeback;
> +
>  	/* number of compressed pages in the array */
>  	unsigned long nr_pages;
>  
> @@ -93,20 +97,12 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
>  				  unsigned long compressed_len,
>  				  struct page **compressed_pages,
>  				  unsigned long nr_pages,
> -				  unsigned int write_flags);
> +				  unsigned int write_flags, bool writeback);
>  blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  				 int mirror_num, unsigned long bio_flags);
>  
>  unsigned int btrfs_compress_str2level(unsigned int type, const char *str);
>  
> -enum btrfs_compression_type {
> -	BTRFS_COMPRESS_NONE  = 0,
> -	BTRFS_COMPRESS_ZLIB  = 1,
> -	BTRFS_COMPRESS_LZO   = 2,
> -	BTRFS_COMPRESS_ZSTD  = 3,
> -	BTRFS_COMPRESS_TYPES = 3,
> -};
> -
>  struct workspace_manager {
>  	const struct btrfs_compress_op *ops;
>  	struct list_head idle_ws;
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 19d669d12ca1..9fae9b3f1f62 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2905,6 +2905,10 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
>  int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
>  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>  					  u64 end, int uptodate);
> +
> +ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
> +			struct btrfs_ioctl_raw_pwrite_args *raw);
> +
>  extern const struct dentry_operations btrfs_dentry_operations;
>  
>  /* ioctl.c */
> @@ -2928,6 +2932,8 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
>  			   struct btrfs_inode *inode);
>  int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
>  void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
> +ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
> +			    struct btrfs_ioctl_raw_pwrite_args *args);
>  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
>  void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
>  			     int skip_pinned);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 8fe4eb7e5045..ed23aa65b2d5 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1872,8 +1872,8 @@ static void update_time_for_write(struct inode *inode)
>  		inode_inc_iversion(inode);
>  }
>  
> -static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> -				    struct iov_iter *from)
> +ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
> +			    struct btrfs_ioctl_raw_pwrite_args *raw)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file_inode(file);
> @@ -1965,7 +1965,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  	if (sync)
>  		atomic_inc(&BTRFS_I(inode)->sync_writers);
>  
> -	if (iocb->ki_flags & IOCB_DIRECT) {
> +	if (raw) {
> +		num_written = btrfs_raw_write(iocb, from, raw);
> +	} else if (iocb->ki_flags & IOCB_DIRECT) {
>  		num_written = __btrfs_direct_write(iocb, from);
>  	} else {
>  		num_written = btrfs_buffered_write(iocb, from);
> @@ -1996,6 +1998,11 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  	return num_written ? num_written : err;
>  }
>  
> +static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	return btrfs_do_write_iter(iocb, from, NULL);
> +}
> +
>  int btrfs_release_file(struct inode *inode, struct file *filp)
>  {
>  	struct btrfs_file_private *private = filp->private_data;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a0546401bc0a..c8eaa1e5bf06 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
>  				    ins.objectid,
>  				    ins.offset, async_extent->pages,
>  				    async_extent->nr_pages,
> -				    async_chunk->write_flags)) {
> +				    async_chunk->write_flags, true)) {
>  			struct page *p = async_extent->pages[0];
>  			const u64 start = async_extent->start;
>  			const u64 end = start + async_extent->ram_size - 1;
> @@ -10590,6 +10590,196 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
>  	}
>  }
>  
> +/* Currently, this only supports raw writes of compressed data. */
> +ssize_t btrfs_raw_write(struct kiocb *iocb, struct iov_iter *from,
> +			struct btrfs_ioctl_raw_pwrite_args *raw)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct inode *inode = file_inode(file);
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	struct extent_changeset *data_reserved = NULL;
> +	struct extent_state *cached_state = NULL;
> +	unsigned long nr_pages, i;
> +	struct page **pages;
> +	u64 disk_num_bytes, num_bytes;
> +	u64 start, end;
> +	struct btrfs_key ins;
> +	struct extent_map *em;
> +	ssize_t ret;
> +
> +	if (iov_iter_count(from) != raw->num_bytes) {
> +		/*
> +		 * The write got truncated by generic_write_checks(). We can't
> +		 * do a partial raw write.
> +		 */
> +		return -EFBIG;
> +	}
> +
> +	/* This should be handled higher up. */
> +	ASSERT(raw->num_bytes != 0);

This is already checked, indirectly via rw_verify_area, there's :

 if (unlikely((ssize_t) count < 0))
                  return retval;

So you can remove this assert.

> +
> +	/* The extent size must be sane. */
> +	if (raw->num_bytes > BTRFS_MAX_UNCOMPRESSED ||
> +	    raw->disk_num_bytes > BTRFS_MAX_COMPRESSED ||
> +	    raw->disk_num_bytes == 0)
> +		return -EINVAL;
> +
> +	/*
> +	 * The compressed data on disk must be sector-aligned. For convenience,
> +	 * we extend the compressed data with zeroes if it isn't.
> +	 */
> +	disk_num_bytes = ALIGN(raw->disk_num_bytes, fs_info->sectorsize);
> +	/*
> +	 * The extent in the file must also be sector-aligned. However, we allow
> +	 * a write which ends at or extends i_size to have an unaligned length;
> +	 * we round up the extent size and set i_size to the given length.
> +	 */
> +	start = iocb->ki_pos;
> +	if ((start & (fs_info->sectorsize - 1)))

if (!IS_ALIGNED(start, fs_info->sectorsize))

> +		return -EINVAL;
> +	if (start + raw->num_bytes >= inode->i_size) {
> +		num_bytes = ALIGN(raw->num_bytes, fs_info->sectorsize);
> +	} else {
> +		num_bytes = raw->num_bytes;
> +		if ((num_bytes & (fs_info->sectorsize - 1)))

ditto

> +			return -EINVAL;
> +	}
> +	end = start + num_bytes - 1;
> +
> +	/*
> +	 * It's valid for compressed data to be larger than or the same size as
> +	 * the decompressed data. However, for buffered I/O, we never write out
> +	 * a compressed extent unless it's smaller than the decompressed data,
> +	 * so for now, let's not allow creating such extents with the ioctl,
> +	 * either.
> +	 */
> +	if (disk_num_bytes >= num_bytes)
> +		return -EINVAL;
> +
> +	nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE);
> +	pages = kcalloc(nr_pages, sizeof(struct page *),
> +			GFP_USER | __GFP_NOWARN);
> +	if (!pages)
> +		return -ENOMEM;
> +	for (i = 0; i < nr_pages; i++) {
> +		unsigned long offset = i << PAGE_SHIFT, n;
> +		char *kaddr;
> +
> +		pages[i] = alloc_page(GFP_USER | __GFP_NOWARN);
> +		if (!pages[i]) {
> +			ret = -ENOMEM;
> +			goto out_pages;
> +		}
> +		kaddr = kmap(pages[i]);
> +		if (offset < raw->disk_num_bytes) {
> +			n = min_t(unsigned long, PAGE_SIZE,
> +				  raw->disk_num_bytes - offset);
> +			if (copy_from_user(kaddr, raw->buf + offset, n)) {
> +				kunmap(pages[i]);
> +				ret = -EFAULT;
> +				goto out_pages;
> +			}
> +		} else {
> +			n = 0;
> +		}
> +		if (n < PAGE_SIZE)
> +			memset(kaddr + n, 0, PAGE_SIZE - n);
> +		kunmap(pages[i]);
> +	}
> +
> +	for (;;) {
> +		struct btrfs_ordered_extent *ordered;
> +
> +		lock_extent_bits(io_tree, start, end, &cached_state);
> +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> +						     end - start + 1);
> +		if (!ordered &&
> +		    !filemap_range_has_page(inode->i_mapping, start, end))
> +			break;
> +		if (ordered)
> +			btrfs_put_ordered_extent(ordered);
> +		unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, end,
> +				     &cached_state);
> +		cond_resched();
> +		ret = btrfs_wait_ordered_range(inode, start, end);
> +		if (ret)
> +			goto out_pages;
> +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> +						    start >> PAGE_SHIFT,
> +						    end >> PAGE_SHIFT);
> +		if (ret)
> +			goto out_pages;
> +	}

Won't btrfs_lock_and_flush_ordered_range suffice here? Perhaps call that
function + invalidate_inode_pages2_range ?

> +
> +	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start,
> +					   num_bytes);
> +	if (ret)
> +		goto out_unlock;
> +
> +	ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes,
> +				   disk_num_bytes, 0, 0, &ins, 1, 1);
> +	if (ret)
> +		goto out_delalloc_release;
> +
> +	em = create_io_em(inode, start, num_bytes, start, ins.objectid,
> +			  ins.offset, ins.offset, num_bytes, raw->compression,
> +			  BTRFS_ORDERED_COMPRESSED);
> +	if (IS_ERR(em)) {
> +		ret = PTR_ERR(em);
> +		goto out_free_reserve;
> +	}
> +	free_extent_map(em);
> +
> +	ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid,
> +						num_bytes, ins.offset,
> +						BTRFS_ORDERED_COMPRESSED,
> +						raw->compression);
> +	if (ret) {
> +		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
> +		goto out_free_reserve;
> +	}
> +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> +
> +	if (start + raw->num_bytes > inode->i_size)
> +		i_size_write(inode, start + raw->num_bytes);
> +
> +	unlock_extent_cached(io_tree, start, end, &cached_state);
> +
> +	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false);
> +
> +	if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
> +					  ins.offset, pages, nr_pages, 0,
> +					  false)) {
> +		struct page *page = pages[0];
> +
> +		page->mapping = inode->i_mapping;
> +		btrfs_writepage_endio_finish_ordered(page, start, end, 0);
> +		page->mapping = NULL;
> +		ret = -EIO;
> +		goto out_pages;
> +	}
> +	iocb->ki_pos += raw->num_bytes;
> +	return raw->num_bytes;
> +
> +out_free_reserve:
> +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> +	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
> +out_delalloc_release:
> +	btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes,
> +				     true);
> +out_unlock:
> +	unlock_extent_cached(io_tree, start, end, &cached_state);
> +out_pages:
> +	for (i = 0; i < nr_pages; i++) {
> +		if (pages[i])
> +			put_page(pages[i]);
> +	}
> +	kfree(pages);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_SWAP
>  /*
>   * Add an entry indicating a block group or device which is pinned by a

<snip>

  parent reply	other threads:[~2019-09-05 10:34 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-04 19:13 [PATCH 0/2] Btrfs: add interface for writing compressed extents directly Omar Sandoval
2019-09-04 19:13 ` [PATCH 1/2] fs: export rw_verify_area() Omar Sandoval
2019-09-04 19:13 ` [PATCH 2/2] btrfs: add ioctl for directly writing compressed data Omar Sandoval
2019-09-05  2:10   ` Dave Chinner
2019-09-05 12:16     ` Johannes Thumshirn
2019-09-05 12:51       ` Johannes Thumshirn
2019-09-06  7:46       ` Dave Chinner
2019-09-06 18:19     ` Omar Sandoval
2019-09-06 21:07       ` Dave Chinner
2019-09-06 21:27         ` Omar Sandoval
2019-09-05 10:33   ` Nikolay Borisov [this message]
2019-09-19  6:14     ` Omar Sandoval
2019-09-19  7:46       ` Nikolay Borisov
2019-09-19  7:59         ` Omar Sandoval
2019-09-24 10:29           ` Nikolay Borisov
2019-09-10 11:39   ` Filipe Manana
2019-09-19  6:23     ` Omar Sandoval

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=652f5971-2c82-e766-fde4-2076e65cf948@suse.com \
    --to=nborisov@suse.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=osandov@osandov.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.